Adversarial Online Multi-Task Reinforcement Learning
Abstract
We consider the adversarial online multi-task reinforcement learning setting, where in each of episodes the learner is given an unknown task taken from a finite set of unknown finite-horizon MDP models. The learner’s objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in are well-separated under a notion of -separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of on the regret of any learning algorithm and an instance-specific lower bound of in sample complexity for a class of uniformly good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains sample complexity guarantee for the clustering phase and regret guarantee for the learning phase, indicating that the dependency on and is tight.
1 Introduction
The majority of theoretical works in online reinforcement learning (RL) have focused on single-task settings in which the learner is given the same task in every episode. In practice, an autonomous agent might face a sequence of different tasks. For example, an automatic medical diagnosis system could be given an arbitrarily ordered sequence of patients who are suffering from an unknown set of variants of a virus. In this example, the system needs to classify and learn the appropriate treatment for each variant of the virus. This example is an instance of the adversarial online multi-task episodic RL setting, an important learning setting for which the theoretical understanding is rather limited. The framework commonly used in existing theoretical works is an episodic setting of episodes; in each episode an unknown Markov decision process (MDP) from a finite set of size is given to the learner. When , the setting reduces to single-task episodic RL. Most existing algorithms for single-task episodic RL are based on aggregating samples in all episodes to obtain sub-linear bounds on various notions of regret (Azar et al., 2017; Jin et al., 2018; Simchowitz and Jamieson, 2019) or finite -PAC bounds on the sample complexity of exploration (Dann and Brunskill, 2015). When , without any assumptions on the common structure of the tasks, aggregating samples from different tasks could produce negative transfer (Brunskill and Li, 2013). To avoid negative transfer, existing works (Brunskill and Li, 2013; Hallak et al., 2015; Kwon et al., 2021) assumed that there exists some notion of task-separability that defines how different the tasks in are. Based on this notion of separability, most existing algorithms followed a two-phase cluster-then-learn paradigm that first attempts to figure out which MDP is being given and then uses the samples from the previous episodes of the same MDP for learning. However, most existing works employ strong assumptions such that the tasks are given stochastically following a fixed distribution (Azar et al., 2013; Brunskill and Li, 2013; Steimle et al., 2021; Kwon et al., 2021) or the task-separability notion allows the MDPs to be distinguished in a small number of exploration steps (Hallak et al., 2015; Kwon et al., 2021). These strong assumptions become the main theoretical challenges towards understanding this setting.
Our goal in this work is to study the adversarial setting with a more general task-separability notion, in which the aformentioned strong assumptions do not hold. Specifically, the learner makes no statistical assumptions on the sequence of tasks; the task in each episode can be either the same or different from the tasks in any other episodes. Moreover, the difference between the tasks in two consecutive episodes can be large (linear in the length of the episodes) so that algorithms based on a fixed budget for total variation such as RestartQ-UCB (Mao et al., 2021) cannot be applied. The performance of the learner is measured by its regret with respect to an omniscient agent that knows which tasks are coming in every episode and the optimal policies for these tasks. We consider the same cluster-then-learn paradigm of the previous works and focus on the following two questions:
-
•
Is there a task-separability notion that generalizes the notions from previous works while still enabling tasks to be distinguished by a cluster-then-learn algorithm with polynomial time and sample complexity? If so, what is the optimal sample complexity of clustering under this notion?
-
•
Is there a polynomial time cluster-then-learn algorithm that simultaneously obtains near-optimal sample complexity in the clustering phase and near-optimal regret guarantee for the learning phase in the adversarial setting?
We answer both questions positively. For the first question, we introduce the notion of -separability, a task-separability notion that generalizes the task-separability definitions in previous works in the same setting (Brunskill and Li, 2013; Hallak et al., 2015; Kwon et al., 2021). Definition 1 formally defines -separability. A more informal version of -separability has appeared in the discounted setting of Concurrent PAC RL (Guo and Brunskill, 2015) where multiple MDPs are learned concurrently; however the implications on the episodic sequential setting and the tightness of their results were lacking. In essence, -separability assumes that between every pair of MDPs in , there exists some state-action pair whose transition functions are well-separated in -norm. This setting is more challenging than the one considered by Hallak et al. (2015) where all state-action pairs are well-separated. In Appendix B, we show that -separability is more general than the entropy-based separability defined in Kwon et al. (2021) and thus requires novel approaches to exploring and clustering samples from different episodes. Under this notion of -separability, we show an instance-specific lower bound111Here and throughout the introduction, we suppress factors related to the MDPs such that the number of states and actions and the horizon length in all the bounds. on both the sample complexity and regret of the clustering phase for a class of cluster-then-learn algorithms that includes most of the existing works.
To answer the second question, we propose a new cluster-then-learn algorithm, AOMultiRL, which obtains a regret upper bound of (the hides logarithmic terms). This upper bound indicates that the linear dependency on and in the lower bounds are tight. The upper bound in the learning phase is near-optimal because if the identity of the model is revealed to a learner at the beginning of every episode (so that no clustering is necessary), there exists a straightforward lower bound obtained by combining the lower bound for the single-task episodic setting of Domingues et al. (2021b) and Cauchy-Schwarz inequality. In the stochastic setting, the L-UCRL algorithm (Kwon et al., 2021) obtains regret with respect to the optimal policy of a partially observable MDP (POMDP) setting that does not know the identity of the MDPs in each episode; thus their notion of regret is weaker than the one in our work.
Overview of Techniques
-
•
In Section 3, we present two lower bounds. The first is a minimax lower bound on the total regret of any algorithm. This result uses the construction of JAO MDPs in Jaksch et al. (2010). The second is a instance-specific lower bound on the sample complexity and regret of the clustering phase for a class of uniformly good cluster-then-learn algorithms when both and are sufficiently large. The instance-specific lower bound relies on the novel construction of 2-JAO MDP, a hard instance combining two JAO MDPs in which one is the minimax lower bound instance and the other satisfies -separability. We show that learning 2-JAO MDPs is fundamentally a two-dimensional extension of the problem of finding a biased coin among a collection of fair coins (e.g. Tulsiani, 2014), for which information theoretic techniques of the one-dimensional problem can be adapted.
-
•
In Section 4, we show that AOMultiRL obtains a regret upper bound of . The main idea of AOMultiRL is based on the observation that a fixed horizon of order with a small constant factor is sufficient to obtain a -dependent coarse estimate of the transition functions of all state-action pairs. In turn, this coarse estimate is sufficient to have high-probability guarantees for the correctness of the clustering phase. This allows AOMultiRL to have a fixed horizon for the learning phase and be able to apply single-task RL algorithms with theoretical guarantees such as UCBVI-CH (Azar et al., 2017) in the learning phase.
Our paper is structured as follows: Section 2 formally sets up the problem. Section 3 presents the lower bounds. AOMultiRL and its regret upper bound are shown in Section 4. Several numerical simulations are in Section 5. The appendix contains formal proofs of all results. We defer detailed discussion on related works to Appendix A.
2 Problem Setup
Our learning setting consists of episodes. In episode , an adversary chooses an unknown Markov decision process (MDP) from a set of finite-horizon tabular stationary MDP models where is the shared reward function, is the set of states with size , is the set of actions with size , is the length of each episode, and is the transition function where specifies the probability of being in state after taking action at state . The state space and action space are known and shared between all models; however, the transition functions are distinct and unknown. Following a common practice in single-task RL literature (Azar et al., 2017; Jin et al., 2018), we assume that the reward function is known and deterministic, however our techniques and results extend to the setting of unknown stochastic . Furthermore, the MDPs are assumed to be communicating with a finite diameter (Jaksch et al., 2010). A justification for this assumption on the diameter is provided in Section 2.1.
The adversary also chooses the initial state . The policy of the learner in episode is a collection of functions , which can be non-stationary and history-dependent. The value function of starting in state at step is the expected rewards obtained by following for steps , where the expectation is taken with respect to the stochasticity in and . Let denote the value function of the optimal policy in episode .
The performance of the learner is measured by its regret with respect to the optimal policies in every episode:
(1) |
Let . We assume that the MDPs in are -separable:
Definition 1 (-separability)
Let and consider set of MDP models with models. For all and , the -distinguishing set for two models and is defined as the set of state-action pairs such that the distance between and is larger than : , where denotes the -norm and .
The set is -separable if for every two models in , the set is non-empty:
In addition, is called a separation level of , and we say a state-action pair is -distinguishing for two models and if .
We use the following notion of a -distinguishing set for a collection of MDP models :
Definition 2 (-distinguishing set)
Given a -separable set of MDPs , a -distinguishing set of is a set of state-action pairs such that for all . In particular, the set is a -distinguishing set of .
By definition, a state-action pair can be -distinguishing for some pairs of models and not -distinguishing for other pairs of models.
2.1 Assumption on the finite diameter of the MDPs
In this work, all MDPs are assumed to be communicating. We employ the following formal definition and assumption commonly used in literature (Jaksch et al., 2010; Brunskill and Li, 2013; Sun and Huang, 2020; Tarbouriech et al., 2021):
Definition 3
((Jaksch et al., 2010)) Given an ergodic Markov chain , let be the first passage time for two states on . Then the hitting time of a unichain MDP is , where is the Markov chain induced by on . In addition, is the diameter of .
Assumption 4
The diameter of all MDPs in are bounded by a constant .
While this finite diameter assumption is common in undiscounted and discounted single-task setting (Jaksch et al., 2010; Guo and Brunskill, 2015), it is not necessary in the episodic single-task setting (Jin et al., 2018; Mao et al., 2021). Therefore, it is important to justify this assumption in the episodic multi-task setting. In the episodic single-task setting, for any initial state , the average time between any pair of states reachable from is bounded ; hence, plays the same role as (Domingues et al., 2021b). This allows the learner to visit and gather state-transition samples in each state multiple times and construct accurate estimates of the model.
However, in the multi-task setting, the same initial state in one episode might belong to a different MDP than the state in the previous episodes. Therefore, the set of reachable states and their state-transition distributions could change drastically. Hence, it is important that the -distinguishing state-action pairs be reachable from any initial state for the learner to recognize which MDP it is in and use the samples appropriately. Otherwise, combining samples from different MDPs could lead to negative transfer. Conversely, if the MDPs are allowed to be non-communicating, the component that makes them -separable might be unreachable from other components. In this case, the adversary can pick the initial states in these components and block the learner from accessing the -distinguishing state-actions. A construction that formalizes this argument is shown at the end of Section 3.
3 Minimax and Instance-Dependent Lower Bounds
We first show that if is sufficiently small and , then the setting is uninteresting in the sense that one cannot do much better than learning every episode individually without any transfer, leading to an expected regret that grows linearly in the number of episodes .
Lemma 5 (Minimax Lower Bound)
Suppose and are given. Let . There exists a set of -separable MDPs of size , each with states, actions, diameter at most and horizon such that if the tasks are chosen uniformly at random from , the expected regret of any sequence of policies over episodes is
Proof (Sketch) We construct so that each MDP in is a JAO MDP (Jaksch et al., 2010) of two states , actions and diameter . Figure 1 (left) illustrates the structure of a JAO MDP. State has no reward, while state has reward . Each model has a unique best action that starts from and goes to . The pair is a -distinguishing state-action pair.
A JAO MDP can be converted to an MDP with states, actions and diameter , and this type of MDP gives the minimax lower bound proof in the undiscounted setting (Jaksch et al., 2010). The adversary selects a model from uniformly at random, and so previous episodes provide no useful information for the current episode; hence, the regret of any learner is equal to the sum of its one-episode learning regrets. The one-episode learning regret for JAO MDPs is known to be when comparing against the optimal infinite-horizon average reward. For JAO MDPs, the optimal infinite horizon policy is also optimal for finite horizon; so, we can use a geometric convergence result from Markov chain theory (Levin et al., 2008) to convert this lower bound to a lower bound of the standard finite-horizon regret of the same order, giving the result.
Using the same technique in the proof of Lemma 5, we can show that applying UCRL2 (Jaksch et al., 2010) in every episode individually leads to a regret upper bound of . This implies that learning every episode individually already gives a near-optimal regret guarantee.
Remark 6
Our proof for Lemma 5 contains a simple yet rigorous proof for the mixing-time argument used in Mao et al. (2021); Jin et al. (2018). This argument claims that for JAO MDPs, when the diameter is sufficiently small compared to the horizon, the optimal -step value function in the regret of the episodic setting can be replaced by the optimal average reward in the undiscounted setting without changing the order of the lower bound. To the best of our knowledge, our proof is the first rigorous proof for this argument that applies for any number of episodes including . Domingues et al. (2021b) provide an alternative proof; however the results therein hold in a different setting where is sufficiently large and the horizon can be much smaller than .
We emphasize that the lower bound in Lemma 5 holds for any learning algorithms. This result motivates the more interesting setting in which is a fixed and large constant independent of . In this case, we are interested in an instance-specific lower bound. For multi-armed bandits, instance-specific lower bounds are constructed with respect to a class of uniformly good learning algorithms (Lai and Robbins, 1985). In our setting, we focus on defining a class of uniformly good algorithms that include the cluster-then-learn algorithms in the previous works for multi-task PAC RL settings such as Finite-Model-RL (Brunskill and Li, 2013) and PAC-EXPLORE (Guo and Brunskill, 2015). We consider a class of MDPs and a cluster-then-learn algorithm uniformly good if they satisfy an intuitive property: for any MDP in that class, the algorithm should be able to correctly classify whether a cluster of samples is from that MDP or not with an arbitrarily low (but not zero) failure probability, provided that the horizon is sufficiently long for the algorithm to collect enough samples. The following definition formalizes this idea.
Definition 7 (PAC identifiability of MDPs)
A set of models of size is PAC identifiable if there exists a function , a sample collection policy and a classification algorithm with the following property: for every , for each model in , if is run for steps and the state-transition samples are given to , then the algorithm returns the correct identity of with probability at least , where the probability is taken over all possible sequence of samples collected by running on for steps. The smallest choice of function among all possible choices is called the sample complexity of model identification of .
The clustering algorithm in a cluster-then-learn framework solves a problem different from classification: they only need to tell whether a cluster of samples belong to the same or different distribution than another cluster of samples, not the identity of the distribution. We can reduce one problem to the other by the following construction: consider the adversary that gives all models in the first episodes. After the first episodes, there are clusters of samples, each corresponding to one model in . Once the learner has constructed different clusters, from the episode , the clustering problem is as hard as classification since identifying the right cluster immediately implies the identity of the MDP where the samples come from, and vice versa. Hence, we can apply the sample complexity of classification to that of clustering.
Next, we show the lower bound on the sample complexity of model identification for the class of -separable communicating MDPs.
Lemma 8
For any and , there exists a PAC identifiable -separable set of MDPs of size , each with at most states, actions and diameter such that for any classification algorithm , if the number of state-transition samples given to is less than then for at least one MDP in , algorithm fails to identify that MDP with probability at least .
Proof (Sketch) The set is a set of 2-JAO MDPs, shown in Figure 1 (right). Each 2-JAO MDP combines two JAO MDPs with the same number of actions and with diameter in the range ; one is -separable and one is the hard instance for the minimax lower bound of Jaksch et al. (2010). Rewards exist only in the part containing the hard instance. If a learner completely ignores the -separable part, by Lemma 5 the learner cannot do much better than just learning every episode individually. On the other hand, with enough samples from the -separable part, the learner can identify the MDP and use the samples collected in the previous episodes of the same MDP to accelerate learning the hard instance part. However, the -separable part is also a JAO MDP, for which no useful information from previous episodes can help identify the MDP in the current episode.
Only the actions at state are -distinguishing and can be used to identify the MDPs. Taking an action in state can be seen as flipping a coin: heads for transitioning to another state and tails for staying in state . Identifying a 2-JAO MDP reduces to the problem of using at most coin flips to identify, in a matrix of coins, a row that has coins that are slightly different from the others. The first column has fair coins except in row , where the success probability is . The second column coins with success probability of except in row , where the coin is upwardly biased by . Lemma 23 and Corollary 24 in the appendix show a lower bound on the number of coin flips on the first column (the left part of the 2-JAO MDP), implying the desired result.
Lemma 8 imply that for 2-JAO MDPs, any uniformly good model identification algorithm needs to collect at least samples from state on the left part. Whenever an action towards state is taken from state , the learner may end up in state . Once in state , the learner needs to get back to state to obtain the next useful sample. The expected number of actions needed to get back to state from state is . This implies the following two lower bounds on the horizon of the clustering phase and the total regret of any cluster-then-learn algorithms.
Corollary 9
For any and , there exists a PAC identifiable -separable set of MDPs of size , each with states, actions and diameter such that for any uniformly good cluster-then-learn algorithm, to find the correct cluster with probability of at least , the expected number of exploration steps needed in the clustering phase is . Furthermore, the expected regret over episodes of the same algorithm is
Proof (Sketch) In the lower bound construction, the learner is assumed to know everything about the set of models, including their optimal policies. Hence, after having identified the model in the clustering phase, the learner can follow the optimal policy in the learning phase and incur a small regret of at most in this phase. Therefore, the regret is dominated by the regret in the clustering phase, which is of order .
Remark 10
The lower bound in Corollary 9 holds for a particular class of uniformly good cluster-then-learn algorithms under an adaptive adversary. It remains an open question whether this lower bound holds for any algorithms, not just cluster-then-learn.
Remark 11
Corollary 9 implies that, without further assumptions, it is not possible to improve the dependency on . At the first glance this seems to contradict the existing results in bandits and online learning literature, where the regret bound depends on where is the the difference in expected reward between the best arm and the sub-optimal arms. However, does not play the same role as the gaps in bandits. Observe that on the 2-JAO MDPs, the set of arms with positive reward is only in the right JAO MDP. The lower-bound learner knows this, but chooses to pull the arms on the left JAO MDP (with zero-reward) to collect side information that helps learn the right part faster. In this analogy, does not play the same role as the gaps in bandits, since the learner already knows the arms on the left JAO MDP are suboptimal. The role of is in model identification, for which similar lower bounds are known (e.g. Tulsiani, 2014).
Finally, we construct a non-communicating variant of the 2-JAO MDP to show that the finite diameter assumption is necessary. Figure 4 in Appendix C illustrates this construction. On this variant, all the transitions from state to state are reversed. In addition, no actions take state to state , making this MDP non-communicating. A set of these non-communicating MDPs is still -separable due to the state-action pairs that start at state . However, by setting the initial state to , the adversary can force the learner to operate only on the right part, regardless of how large is.
4 Non-Asymptotic Upper Bounds
We propose and analyze AOMultiRL, a polynomial time cluster-then-learn algorithm that obtains a high-probability regret bound of . In each episode, the learner starts with the clustering phase to identify the cluster of samples generated in previous episodes that has the same task. Once the right cluster is identified, the learner can use the samples from previous episodes in the learning phase.
A fundamental difference between the undiscounted infinite horizon setting considered in previous works (Guo and Brunskill, 2015; Brunskill and Li, 2013) and the episodic finite horizon in our work is the horizon of the two phases. In previous works, different episodes might have different horizons for the clustering phase depending on whether the learner decides to start exploration at all (Brunskill and Li, 2015) or which state-action pairs are to be explored (Brunskill and Li, 2013). This poses a challenge for the episodic finite-horizon setting, because a varying horizon for the clustering phase leads to a varying horizon for the learning phase. Thus, standard single-task algorithms that rely on a fixed horizon such as UCBVI (Azar et al., 2017) and StrongEuler (Simchowitz and Jamieson, 2019) cannot be applied directly. From an algorithmic standpoint, for a fixed horizon , a non-asymptotic bound on the horizon of the clustering phase is necessary so that the learner knows exactly whether is large enough and when to stop collecting samples.
AOMultiRL alleviates this issue by setting a fixed horizon for the clustering phase, which reduces the learning phase to standard single-task episodic RL. First, we state an assumption on the ergodicity of the MDPs.
Assumption 12
The hitting times of all MDPs in are bounded by a known constant .
The main purpose of Assumption 12 is simplifying the computation of a non-asymptotic upper bound for the clustering phase in order to focus the exposition on the main ideas. We discuss a method for removing this assumption in Appendix G.
Algorithm 1 outlines the main steps of our approach. Given a set of -distinguishing state-action pairs, in the clustering phase the learner employs a history-dependent policy specified by Algorithm 2, ExploreID, to collect at least samples for each state-action pair in , where will be determined later. Once all in have been visited at least times, Algorithm 3, IdentifyCluster, computes the empirical means of the transition function of these and then compares them with those in each cluster to determine which cluster contains the samples from the same task (or none do, in which case a new cluster is created). For the rest of the episode, the learner uses the UCBVI-CH algorithm (Azar et al., 2017) to learn the optimal policy.
The algorithms and results up to Theorem 16 are presented for a general set . Since is generally unknown, Corollary 17 shows the result for and .
4.1 The Exploration Algorithm
Given a collection of tuples , the empirical transition functions estimated by are
are the number of instances of and in , respectively.
For each episode , let denote the transition function of the task and denote the collection of samples collected during the learning phase. The empirical means estimated using samples in are . The value of can be chosen so that for all , with high probability is close to . Specifically, we find that if is large enough so that is within in norm of the true function , then the right cluster can be identified in every episode. The exact value of is given in the following lemma.
Lemma 13
Suppose the learner is given a constant and a -distinguishing set . If each state-action pair in is visited at least times during the clustering phase of each episode , then with probability at least , the event
The exploration in AOMultiRL is modelled as an instance of the active model estimation problem (Tarbouriech et al., 2020). Given the current state , if there exists an action such that and has not been visited at least times, this action will be chosen (with ties broken by selecting the most chosen action). Otherwise, the algorithm chooses an action that has the highest estimated probability of leading to an under-sampled state-action pair in . The following lemma computes the number of steps in the clustering phase.
4.2 The Clustering Algorithm
Denote by the set of clusters, the number of clusters and the cluster. Each is a collection of two multisets which contain the samples collected during the clustering and learning phases, respectively. Formally, up to episode we have
where and are the state and action at time step of episode , respectively and is the cluster index returned by Algorithm 3 in episode .
Let denote the empirical means estimated using samples in . For each episode , from Lemma 14 with high probability after the first steps each state-action pair has been visited at least times. Algorithm 3 determines the right cluster for a task by computing the distance between and the empirical transition function for each cluster . If there exists an such that the distance is larger than a certain threshold , i.e., , then the algorithm concludes that the task belongs to another cluster. Otherwise, the task is considered to belong to cluster . We set . The following lemma shows that with this choice of , the right cluster is identified by Algorithm 3 in all episodes.
Lemma 15
Consider a -separable set of MDPs and an -distinguishing set where . If the events defined in Lemma 13 hold for all , then with the distance threshold Algorithm 3 always produces a correct output in each episode: the trajectories of the same model in two different episodes are clustered together and no two trajectories of two different models are in the same cluster.
Once the clustering phase finishes, the learner enters the learning phase and uses the UCBVI-CH algorithm (Azar et al., 2017) to learn the optimal policy for this phase. In principle, almost all standard single-task RL algorithms with a near-optimal regret guarantee can be used for this phase. We chose UCBVI-CH to simplify the analysis and make the exposition clear.
To simulate the standard single-task episodic learning setting, the learner only uses the samples in for regret minimization. The impact of combining samples in two phases for regret minimization is addressed in Appendix H. Theorem 16 states a regret bound for Algorithm 1.
Theorem 16
For any failure probability , with probability at least the regret of Algorithm 1 is bounded as
where , , , and .
For , the first two terms are the most significant. The term accounts for the clustering phase and the fact that the exploration policy might lead the learner to an undesirable state after steps. The term comes from the fact that the learning phase is equivalent to episodic single-task learning with horizon . When , the sub-linear bound on the learning phase is a major improvement compared to the bound of the strategy that learns each episode individually.
By setting and , we obtain
Corollary 17
For any failure probability , with probability at least , by setting with , the regret of Algorithm 1 is
(2) |
where .
Time Complexity The clustering algorithm runs once in each episode, which leads to time complexity of . When , the overall time complexity is dominated by the learning phase, which is for UCBVI-CH.
Remark 18
Instead of clustering, a different paradigm involves actively merging samples from different MDPs to learn a model that is an averaged estimate of the MDPs in . The best regret guarantee in this paradigm, to the best of our knowledge, is , where is a variation budget, achieved by RestartQ-UCB (Mao et al., 2021, Theorem 3). In our setting, if the adversary frequently alternates between tasks then and therefore this bound becomes , which is larger than the trivial bound and worse than the bound in Corollary 17. If the adversary selects tasks so that is small i.e. then the bound offered by RestartQ-UCB is better since it is sub-linear in . Note that this does not contradict the lower bound result in Section 3, since the lower bound is constructed with an adversary that selects tasks uniformly at random, and hence is linear in .
4.3 Learning a distinguishing set when is small
As pointed out by Brunskill and Li (2013), for all , the size of the smallest -distinguishing set of is at most . If and such a set is known to the learner, then the clustering phase only need collect samples from this set instead of the full set of state-action pairs. However, in general this set is not known. We show that if the adversary is weaker so that all models are guaranteed to appear at least once early on, the learner will be able to discover a -distinguishing set of size at most . Specifically, we employ the following assumption:
Assumption 19
There exists an unknown constant satisfying such that after at most episodes, each model in has been given to the learner at least once.
In order to discover , the learner uses Algorithm 4, which consists of two stages:
-
•
Stage 1: the learner starts by running Algorithm 1 with the -distinguishing set candidate until the number of clusters is . With high probability, each cluster corresponds to a model. At the end of stage 1, the learner uses the empirical estimates in all clusters for to construct a -distinguishing set for .
-
•
Stage 2: the learner runs Algorithm 1 with the distinguishing set as an input.
Extracting -distinguishing pairs: After episodes, with high probability there are clusters corresponding to models. For two clusters and , the set contains the first state-action pair that satisfies . With high probability, every satisfies this condition, hence .
Let denote the index of the MDP model corresponding to cluster . For all , by the triangle inequality, we have
where is omitted for brevity. It follows that the set is -distinguishing and . Although is smaller than the -separation level of , it is sufficient for the conditions in Lemma 15 to hold. Thus, with high probability the clustering algorithm in stage 2 works correctly. The next theorem shows the regret guarantee of Algorithm 4.
Theorem 20
Compared to Corollary 17, Theorem 20 improves the clustering phase’s dependency from to . This implies that if the number of models is small and all models appear relatively early, we can discover a -distinguishing set quickly without increasing the order of the total regret bound.
5 Experiments

We evaluate AOMultiRL on a sequence of episodes, where the task in each episode is taken from a set of MDPs. Each MDP in is a grid of cells with valid actions: up, down, left, right. The state for row and column (0-indexed) is represented by the tuple . The reward is 0 in every state, except for the four corners , and , where the reward is 1. We fix the initial state at .
To simulate an adversarial sequence of tasks, episodes 100 to 150 and episodes 180 to 200 contain only the MDP . Other episodes chooses and uniformly at random. The hitting time is and the failure probability is . We use the rlberry framework (Domingues et al., 2021a) for our implementation222The code is available at https://github.com/ngmq/adversarial-online-multi-task-reinforcement-learning.
We construct the transition functions so that each MDP has only one easy-to-reach corner, which corresponds to a unique optimal policy. The separation level is . Furthermore, there exists state-action pairs that are -distinguishing but not -distinguishing. More details can be found in Appendix J.
The baseline algorithms include a random agent that chooses actions uniformly at random, a one-episode UCBVI agent which does not group episodes and learns using only the samples of each episode, and the optimal non-stationary agent that acts optimally in every episode. The first and the last baselines serve as the lower bound and upper bound performance for AOMultiRL, while the second baseline helps illustrate the effectiveness of clustering episodes correctly. We evaluate two instances of AOMultiRL: AOMultiRL1 with a set of given and AOMultiRL2 without any distinguishing set given. We follow the approach of Kwon et al. (2021) and evaluate all five algorithms based on their expected cumulative reward when starting at state and following their learned policy for steps (averaged over 10 runs). While the horizon for the learning phase is much smaller than the horizon for clustering phase of , we ensure the fairness of the comparisons by not using the samples collected in the clustering phase in the learning phase, thus simulating the setting where without the need to use significantly larger MDPs. We use the average per-episode reward as the performance metric. Figure 2 shows the results.
The effectiveness of the clustering on the learning phase. To measure the effectiveness of aggregating samples from episodes of the same task for the learning phase, we compare AOMultiRL1 and the one-episode UCBVI agent. Since for every pair of MDP models, the transition functions are distinct for state-action pairs adjacent to two of the corners, AOMultiRL1 can only learn the estimated model accurately for each MDP model if the clustering phase produces correct clusters in most of the episodes. We can observe in Figure 2 that after about thirty episodes, AOMultiRL1 starts outperforming the one-episode UCBVI agent and approaching the performance of the optimal non-stationary agent. The model appears for the first time in episode 100, which accounts for the sudden drop in performance in that episode. Afterwards, the performance of AOMultiRL1 steadily increases again. This demonstrates that the AOMultiRL1 is able to identify the correct cluster in most of the episodes, which enables the multi-episode UCBVI algorithm in AOMultiRL1 to estimate the MDP models much more accurately than the non-transfer one-episode UCBVI agent. This suggests that for larger MDPs where , spending a number of initial steps on finding the episodes of the same task would yield higher long-term rewards.
Performance of AOMultiRL with the discovered . Next, we examine the performance of AOMultiRL2 when no distinguishing set is given. We run AOMultiRL2 for 204 episodes, in which stage 1 consists of the first four episodes, each containing one of the four MDP models in . As the identities of the models are not given, the algorithm has to correctly construct four clusters and then compute a -distinguishing set after the episode even though each model is seen just once. As mentioned above, the MDPs are set up so that if the AOMultiRL2 correctly identifies four clusters, then the discovered will contain at least one state-action pair that is -distinguishing but not -distinguishing. In stage 2, the horizon of the learning phase is set to the same used for AOMultiRL1. The performance in stage 2 of AOMultiRL2 approaches that of AOMultiRL1, indicating that the discovered is as effective as the set given to AOMultiRL1.
6 Conclusion
In this paper, we studied the adversarial online multi-task RL setting with the tasks belonging to a finite set of well-separated models. We used a general notion of task-separability, which we call -separability. Under this notion, we proved a minimax regret lower bound that applies to all algorithms and an instance-specific regret lower bound that applies to a class of uniformly good cluster-then-learn algorithms. We further proposed AOMultiRL, a polynomial time cluster-then-learn algorithm that obtains a nearly-optimal instance-specific regret upper bound. These results addressed two fundamental aspects of online multi-task RL, namely learning an adversarial task sequence and learning under a general task-separability notion. Adversarial online multi-task learning remains challenging when the diameter and the number of models are unknown; this is left for future work.
Acknowledgement
This work was supported by the NSERC Discovery Grant RGPIN-2018-03942.
References
- Abel et al. (2018) David Abel, Yuu Jinnai, Sophie Yue Guo, George Konidaris, and Michael Littman. Policy and value transfer in lifelong reinforcement learning. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 20–29. PMLR, 10–15 Jul 2018.
- Auer and Ortner (2007) Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement learning. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2007.
- Auer et al. (2002) Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002. doi: 10.1137/S0097539701398375. URL https://doi.org/10.1137/S0097539701398375.
- Azar et al. (2013) Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Sequential transfer in multi-armed bandit with finite set of models. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
- Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 263–272. PMLR, 06–11 Aug 2017.
- Brunskill and Li (2013) Emma Brunskill and Lihong Li. Sample complexity of multi-task reinforcement learning. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI’13, page 122–131, Arlington, Virginia, USA, 2013. AUAI Press.
- Brunskill and Li (2015) Emma Brunskill and Lihong Li. The online discovery problem and its application to lifelong reinforcement learning. CoRR, abs/1506.03379, 2015.
- Cover and Thomas (2006) Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006. ISBN 0471241954.
- Dann and Brunskill (2015) Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- Dann et al. (2017) Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying PAC and regret: Uniform PAC bounds for episodic reinforcement learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Domingues et al. (2021a) Omar Darwiche Domingues, Yannis Flet-Berliac, Edouard Leurent, Pierre Ménard, Xuedong Shang, and Michal Valko. rlberry - A Reinforcement Learning Library for Research and Education, 10 2021a. URL https://github.com/rlberry-py/rlberry.
- Domingues et al. (2021b) Omar Darwiche Domingues, Pierre Ménard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite MDPs: Minimax lower bounds revisited. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato, editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 578–598. PMLR, 16–19 Mar 2021b. URL https://proceedings.mlr.press/v132/domingues21a.html.
- Guo and Brunskill (2015) Zhaohan Guo and Emma Brunskill. Concurrent PAC RL. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, page 2624–2630. AAAI Press, 2015. ISBN 0262511290.
- Hallak et al. (2015) Assaf Hallak, Dotan Di Castro, and Shie Mannor. Contextual Markov Decision Processes, 2015.
- Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563–1600, 2010.
- Jin et al. (2018) Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Kwon et al. (2021) Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. RL for latent MDPs: Regret guarantees and a lower bound. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
- Lai and Robbins (1985) Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4–22, 1985.
- Lecarpentier et al. (2021) Erwan Lecarpentier, David Abel, Kavosh Asadi, Yuu Jinnai, Emmanuel Rachelson, and Michael L. Littman. Lipschitz lifelong reinforcement learning. In AAAI, 2021.
- Levin et al. (2008) David Asher Levin, Yuval Peres, and Elizabeth Lee Wilmer. Markov Chains and Mixing Times. American Mathematical Soc., 2008. ISBN 9780821886274. URL http://pages.uoregon.edu/dlevin/MARKOV/.
- Mao et al. (2021) Weichao Mao, Kaiqing Zhang, Ruihao Zhu, David Simchi-Levi, and Tamer Basar. Near-optimal model-free reinforcement learning in non-stationary episodic MDPs. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7447–7458. PMLR, 18–24 Jul 2021. URL http://proceedings.mlr.press/v139/mao21b.html.
- Sason (2015) Igal Sason. On reverse Pinsker inequalities. arXiv preprint arXiv:1503.07118, 2015.
- Simchowitz and Jamieson (2019) Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular MDPs. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Steimle et al. (2021) Lauren N. Steimle, David L. Kaufman, and Brian T. Denton. Multi-model Markov decision processes. IISE Transactions, 53(10):1124–1139, 2021. doi: 10.1080/24725854.2021.1895454.
- Sun and Huang (2020) Yanchao Sun and Furong Huang. Can agents learn by analogy? An inferable model for PAC reinforcement learning. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’20, page 1332–1340, Richland, SC, 2020. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450375184.
- Tarbouriech et al. (2020) Jean Tarbouriech, Shubhanshu Shekhar, Matteo Pirotta, Mohammad Ghavamzadeh, and Alessandro Lazaric. Active model estimation in Markov Decision Processes. In Uncertainty in Artificial Intelligence, 2020. URL http://proceedings.mlr.press/v124/tarbouriech20a/tarbouriech20a-supp.pdf.
- Tarbouriech et al. (2021) Jean Tarbouriech, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. A provably efficient sample collection strategy for reinforcement learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=AvVDR8R-kQX.
- Tulsiani (2014) Madhur Tulsiani. Lecture 6, lecture notes in information and coding theory, October 2014. URL https://home.ttic.edu/~madhurt/courses/infotheory2014/l6.pdf.
- Weissman et al. (2003) Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
Appendix A Related Work
Stochastic Online Multi-task RL. The Finite-Model-RL algorithm (Brunskill and Li, 2013) considered the stochastic setting with infinite-horizon MDPs and focused on deriving a sample complexity of exploration in a -PAC setting. As shown by Dann et al. (2017), even an optimal -PAC bound can only guarantee a necessarily sub-optimal regret bound for each task that appears in episodes, leading to an overall regret bound for the learning phase in the multi-task setting.
The Contextual MDPs algorithm by Hallak et al. (2015) is capable of obtaining a regret bound in the learning phase after the right cluster has been identified; however their clustering phase has exponential time complexity in . The recent L-UCRL algorithm (Kwon et al., 2021) considered the stochastic finite-horizon setting and reduced the problem to learning the optimal policy of a POMDP. Under a set of assumptions that allow the clusters to be discovered in , L-UCRL is able to obtain an overall regret with respect to a POMDP planning oracle which aims to learn a policy that maximizes the expected single-task return when a task is randomly drawn from a known distribution of tasks. In contrast, our work adopts a stronger notion of regret that encourages the learner to maximize its expected return for a sequence of tasks chosen by an adversary. When the models are bandits instead of MDPs, Azar et al. (2013) use spectral learning to estimate the mean reward of the arms in all models and obtains an upper bound linear in .
Lifelong RL. Learning a sequence of related tasks is more well-studied in the lifelong learning literature. Recent works in lifelong RL (Abel et al., 2018; Lecarpentier et al., 2021) often focus on the setting where tasks are drawn from an unknown distribution of MDPs and there exists some similarity measure between MDPs that support transfer learning. Our work instead focuses on learning the dissimilarity between tasks for the clustering phase and avoiding negative transfer.
Active model estimation The exploration in AOMultiRL is modelled after the active model estimation problem (Tarbouriech et al., 2020), which is often presented in PAC-RL setting. Several recent works on active model estimation are PAC-Explore (Guo and Brunskill, 2015), FW-MODEST (Tarbouriech et al., 2020), curious walking (Sun and Huang, 2020), and GOSPRL (Tarbouriech et al., 2021).
The bound on the horizon of clustering in Lemma 14 has the same dependency on the number of states and actions as the state-of-the-art bound by GOSPRL (Tarbouriech et al., 2021) for the active model estimation problem. The main drawback is that depends linearly on the hitting time and not the diameter of the MDPs. As the hitting time is often strictly larger than the diameter (Jaksch et al., 2010; Tarbouriech et al., 2021), this dependency on is sub-optimal. On the other hand, AOMultiRL is substantially less computationally expensive than GOSPRL since there is no shortest-path policy computation involved.
Appendix B The generality of -separability notion
In this section, we show that the general separation notion in Definition 1 defines a broader class of online multi-task RL problems that extends the entropy-based separation assumption in the latent MDPs setting (Kwon et al., 2021). We start by restating the entropy-based separation condition of Kwon et al. (2021):
Definition 21
Let denote the class of all history-dependent and possibly non-Markovian policies, and let be a trajectory of length sampled from MDP by a policy . The set is well-separated if the following condition holds:
(3) |
where is a target failure probability, are universal constants and is the probability that is realized when running policy on model .
The following lemma constructs a set of just two models that satisfy the -separability condition but not the entropy-based separation condition.
Lemma 22
Given any and any constants , there exists a set of MDPs with horizon that is -separable but is not well-separated in the sense of Definition 21.
Proof Consider the set with in Figure 3. Both and have the same transition functions in all state-action pairs except for :
It follows that the distance between and is
As a result, this set is -separable. However, any deterministic policy that takes action in and an arbitrary action in and will induce the same Markov chain on two MDP models. Thus, the entropy-based separation definition does not apply. An example of such a policy is shown below.
Consider running the following deterministic policy on model :
Consider an arbitrary trajectory . The probability that this trajectory is realized with respect to both models is
(4) | ||||
(5) | ||||
(6) | ||||
(7) |
As a result, for all ,
(8) |
which implies that
(9) |
which is larger than .
Appendix C Proofs of the lower bounds
See 5
Proof We construct in the following way: each MDP in is a JAO MDP (Jaksch et al., 2010) of two states and actions and diameter . The translation from this JAO MDP to an MDP with states, actions and diameter is straightforward (Jaksch et al., 2010). State has reward while state has no reward. In state , for all actions the probability of transitioning to state is except for one best action where this probability is . Every MDP in has a unique best action: for , the action is the best action in the MDP . The starting state is always .
We consider a learner who knows all the parameters of models in , except the identity of the task given in episode . We employ the following information-theoretic argument from Mao et al. (2021): when the task in episode is chosen uniformly at random from , no useful information from the previous episodes can help the learner identify the best action in . This is true since all the information in the previous episodes is samples from the MDPs in , which provide no further information than the parameters of the models in . Since , all actions (from state ) are equally probable to be the best action in . Therefore, the learner is forced to learn from scratch. It follows that the total regret of the learner is the sum of the one-episode-learning regrets in every episode:
where is the one-episode-learning regret in episode . The one-episode-learning is equivalent to the learning in the undiscounted setting with horizon . Applying the lower bound result for the undiscounted setting in Jaksch et al. (2010, Theorem 5) obtains that for all ,
where is the average reward of the optimal policy (Jaksch et al., 2010). Note since only state has reward , is also the stationary probability that the optimal learner is at state .
Next, we show that for all and , it holds that . The optimal policy on all induces a Markov chain between two states with transition matrix
Let be the probability that the Markov chain is in state after time steps with the initial state . Let . Obviously, . By Levin et al. (2008, Equation 1.8), we have . It follows that, for the optimal policy,
(10) | ||||
(11) | ||||
(12) | ||||
(13) | ||||
(14) |
Hence,
(15) | ||||
(16) | ||||
(17) | ||||
(18) | ||||
(19) | ||||
(20) |
where the last equality follows from .
For any and , we have , and hence . We conclude that
The upper bound of UCRL2 can be proved similarly: Theorem 2 in Jaksch et al. (2010) states that for any , by running UCRL2 with failure parameter , we obtain that for any initial state and any , with probability at least ,
(21) |
Setting and trivially bound the regret in the failure cases by to obtain
(22) | ||||
(23) |
This bound holds across all episodes, hence the total regret bound with respect to is . Combining this with the fact that , we obtain the upper bound.
See 8 Before showing the proof of Lemma 8, we consider the following auxiliary problem: Suppose we are given three constants and a set of coins. The coins are arranged into a table of rows and columns so that each cell contains exactly one coin. The rows are indexed from to and the columns are indexed from to . In the first column, all coins are fair except for one coin at row which is biased with probability of heads equal to . In the second column, all coins have probability of heads equal to except for the coin at row which has probability of heads . In this setting, row is a special row that contains the most biased coins in the two columns. The objective is to find this special row after at most coin flips, where is a constant representing a fixed budget. Note that if we ignore the second column, then this problem is reduced to the well-known problem of identifying one biased coin in a collection of -coins (Tulsiani, 2014).
Let be the number of flips an algorithm performs on the first and second column, respectively. For a fixed global budget , after coin flips, the algorithm recommends as its prediction for . Note that is a random stopping time which can depend on any information the algorithm observes up to time . Let be the random variable for the outcome of flip, and be the sequence of outcomes after flips. For , let denote the probability measure induced by corresponding to the case when . We first show that if the algorithm fails to flip the coins sufficiently many times in both columns, then for some the probability of failure is at least .
Lemma 23
Let and . For any algorithm , if
then there exists a set with such that
The proof uses a reasonably well-known reverse Pinsker inequality (Sason, 2015, Equation 10):
Let and be probability measures over a common discrete set. Then
(24)
where is the total variation distance. In the particular case where and are Bernoulli distributions with success probabilities and respectively, we get
(25) |
Proof (of Lemma 23) As reasoned in the proof for the lower bound of multi-armed bandits (Auer et al., 2002), we can assume that is deterministic333Deterministic conditional on the random history. Our proof closely follows the main steps in the proof of Tulsiani (2014) for the setting where there is only one column. We will lower bound the probability of mistake of based on its behavior on a hypothetical instance where .
To account for algorithms which do not exhaust both budgets and , we introduce two “dummy coins” by adding a zero’th row with two identical coins, solely for the analysis. These two coins have the same mean of 1 under all models and hence flipping either of them provides no information. An algorithm which wishes to stop in a round will simply flip any dummy coin in the remaining rounds . This way, we have the convenient option of always working with a sequence of outcomes in the analysis.
Let and denote the probability and expectation over taken on the hypothetical instance with , respectively. Let be the coin that the algorithm flips in step . Let denote the outcome of where is tails and is heads.
The number of flips the coin in row , column is
By the earlier definition of for , we have
We define
Clearly, at most rows satisfy and, similarly, at most rows satisfy . Therefore, .
We also define
As at most arms can satisfy , it holds that .
Consequently, defining , we have .
For any , we have
(26) | ||||
(27) | ||||
(28) |
where the first inequality follows from Auer et al. (2002, Equation 28) since the final output is a function of the outcomes , and the last inequality is Pinsker inequality.
Since is deterministic, the flip at step is fully determined given the previous outcomes . Applying the chain rule for KL-divergences (Cover and Thomas, 2006, Theorem 2.5.3) we obtain
Note that is the result of a single coin flip. When , the KL-divergence is zero since the two instances have the identical coins on both columns. When , the KL-divergence is either or , depending on whether or , respectively. It follows that
Since and , we can bound (Tulsiani, 2014) and . Consequently,
Plugging this into Equation 28 and applying , we obtain
The next result shows that if is sufficiently small, then any algorithm has to flip the coins in the first column sufficiently many times; otherwise the probability of failure is at least .
Corollary 24
Let and be the constants defined in Lemma 23. Let be the budget for the number of flips on both columns. If , then for any algorithm , if
then there exists a set with such that
Proof We will show that when , the inequality holds trivially for any (recall that is the fixed budget for the total number of coin flips). The result then follows directly from Lemma 23. We have
which implies that always holds for any .
We are now ready to prove Lemma 8.
Proof (of Lemma 8) We construct as the set of 2-JAO MDPs in Figure 1 (right). Each MDP has a left part and a right part, where each part is a JAO MDP. The left part of the MDP consists of two states and actions numbered from to , where all actions from state transition to state with probability of or stay at state with probability , except for the action that transitions to state with probability and stays at state with probability . The right part of the MDP consists of two states and also actions numbered from to , where all actions from state transition to state with probability of or stays at state with probability , except for the action that transitions to state with probability and stays at state with probability . We set . We will show the conversion from these 2-JAO MDPs to MDPs with states and actions later.
Since each model in has a distinct index for the actions on both parts that transitions from to and with probability higher than any other actions, identifying a model in is equivalent to identifying this distinct action. Each action on both parts can be seen as a (possibly biased) coin, where the probability of getting tails is equal to the probability of ending up in state when the action is taken. Thus, the problem of identifying this distinct action index reduces to the above auxiliary problem of identifying the row of the most biased coins, where taking an action from state is equivalent to flipping a coin, and is replaced by . Corollary 24 states that for every algorithm, if the number of coin flips on the first column is less than , then there exists a set of size at least positions of the row with the most biased coins such that the algorithm fails to find the biased coin with probability at least . Correspondingly, for any model classification algorithm, if the number of state-transition samples from state 0 towards state 2 (i.e. the first column) is less than then the algorithm fails to identify the model for at least MDPs in .
Finally, we show the conversion from the 2-JAO MDP to an MDP with states and actions. The conversion is almost identical to that of Jaksch et al. (2010), which starts with an atomic 2-JAO MDP of three states and actions and builds an -ary tree from there. Assuming is an even positive number, each part of the atomic 2-JAO MDP has actions. We make copies of these atomic 2-JAO MDPs, where only one of them has the best action on the right part. Arranging copies of these atomic 2-JAO MDPs and connecting their states by connections, we obtain an -ary tree which represents a composite MDP with at most states, actions and diameter . The transitions of the actions on the tree are defined identically to that of Jaksch et al. (2010): self-loops for states and , deterministic connections to the state of other nodes on the tree for state . By having in each atomic 2-JAO MDP, the diameter of this composite MDP is at most . This composite MDP is harder to explore and learn than the 2-JAO MDP with three states and actions, and hence all the lower bound results apply.
See 9
Proof As argued in Section 3, we can apply the sample complexity of the classification algorithm onto that of the clustering algorithm. Using the same set of 2-JAO MDPs constructed in the proof of Lemma 8, for any given MDP , any PAC classification learner has to be in state and takes at least actions from state to state . If the learner stays at state , then it can take the next action from to in the next time step. However, if the learner transitions to state , then it has to wait until it gets back to state to take the next action. Let denote the number of times the learner ends up in state after taking actions on the left part from state . Since every action from to has probability at least of ending up in state , we have
(29) |
Since every action from state transitions to state with the same probability of , every time the learner is in state , the expected number of time steps it needs to get back to state is . Hence, the expected number of time steps the learner needs to get back to state after times being in state is . We conclude that for any PAC learner, the expected number of exploration steps needed to identify the model with probability of correct at least is at least
(30) |
Next, we lower bound the expected regret of the same algorithm. Let be the number of time steps the algorithm spends on the left part and on the right part of each model in . Note that and are random variables. Recall that the right part of each MDP in resembles the JAO MDP in the minimax lower bound proof in Lemma 5, hence we can apply the regret formula of the JAO MDP for 2-JAO MDP and obtain that the regret in each episode is of the same order as
(31) | ||||
(32) | ||||
(33) | ||||
(34) | ||||
(35) | ||||
(36) |
where
-
•
the second equality follows from ,
-
•
the third equality follows from the fact that the time steps spent on the left part of the MDP returns no rewards,
-
•
the fourth equality follows from the linearity of expectation,
-
•
the inequality follows from and equation 20,
-
•
the last equality follows from for all and .
We conclude that the expected regret over episodes is at least
Appendix D Proofs of the upper bounds
First, we state the following concentration inequality for vector-valued random variables by Weissman et al. (2003).
Lemma 25 (Weissman et al. (2003))
Let be a probability distribution on the set . Let be a set of i.i.d samples drawn from . Then, for all :
Using Lemma 25, we can show that samples are sufficient for each so that with high probability, the empirical means of the transition function are within of their true values, measured in distance.
Corollary 26
Denote . If a state-action pair is visited at least
(37) |
times, then with probability at least ,
Proof We simplify the bound in Lemma 25 as follows:
Next, we substitute into the right hand side and solve the following inequality for :
to obtain . Thus satisfies this condition.
Taking a union bound of the result in Corollary 26 over all state-action pairs in the set of all episodes from to , we obtain Lemma 13.
Next, we show the proof of Lemma 14. The proof strategy is similar to that of Auer and Ortner (2007); Sun and Huang (2020).
See 14
Proof The history-dependent exploration policy in Algorithm 2 visits an under-sampled state-action pair in whenever possible; otherwise it starts a sequence of steps that would lead to such a state-action pair. In the latter case, denote the current state of the learner by and the number of steps needed to travel from to an under-sampled state by . By Assumption 4 and using Markov inequality, we have
It follows that . In other words, in every interval of time steps, the probability of visiting an under-sampled state-action pair in is at least . Over such intervals, the expected number of such visits is lower bounded by . Fix a . Let denote number of visits to after intervals. Using a Chernoff bound for Poisson trials, we have
for any . Setting and solving
for , we obtain
(38) |
By definition of ,
We also have . Overall, satisfies the condition in Equation 38. Taking a union bound over all and noting that each interval has length steps, the total number of identifying steps needed is .
To prove Lemma 15, we state the following auxiliary proposition and its corollary.
Proposition 27
Suppose we are given a probability distribution over , a constant and two set of samples and drawn from such that and . Then,
Proof Let and denote the number of samples of in and , respectively. We have:
(39) | ||||
(40) | ||||
(41) | ||||
(42) | ||||
(43) | ||||
(44) |
Corollary 28
Suppose we are given a probability distribution over , a constant and a finite number of set of samples such that for all . Then,
(45) |
Proof (Of Lemma 15) The proof is by induction. The claim is trivially true for the first episode (). For an episode , assume that the outputs of the Algorithm 3 are correct until the beginning of this episode. We consider two cases:
-
•
When the task has never been given to the learner before episode .
Consider an arbitrary existing cluster . Denote by the identity of the model to which the samples in belong, the identity of the task , and in a state-action pair that distinguishes these two models. Under the definition of , the result in Lemma 13 and the result in Corollary 28, the following three inequalities hold true:
From here, we omit the and write for when no confusion is possible. Applying the triangle inequality twice, we obtain:
It follows that the break condition in Algorithm 3 is satisfied, and the correct value of 0 is returned. A new cluster is created containing only the samples generated by the new task .
-
•
When the task has been given to the learner before episode .
In this case, there exists a cluster containing the samples generated from model . Using a similar argument in the previous part, we have that whenever the iteration in Algorithm 3 reaches a cluster whose identity , the break condition is true for at least one , and the algorithm moves to the next cluster. When the iteration reaches cluster , for all , we have:
Hence, the break condition is false for all , and thus the algorithm returns as expected.
By induction, under event , Algorithm 3 always produces correct outputs throughout the episodes.
We can now state the regret bound of Algorithm 1 where the regret minimization algorithm in every episode is UCBVI-CH (Azar et al., 2017). For each state-action pair in episode , UCBVI-CH needs a bonus term defined as
where , is the total number of visits to in the learning phase before episode , and is the total number of episodes in which the model is given to the learner. However, is unknown to the learner. We instead upper bound by and modify the bonus term as
(46) |
where . Since , this algorithm still retain the optimism principle needed for UCBVI-CH. The total regret of each model in is bounded by the following result, whose proof is in Appendix E.
Lemma 29
With probability at least , applying UCBVI-CH with the bonus term defined in Equation 46, each task in has a total regret of
See 16
Proof Summing up the regret for all and applying the Cauchy-Schwarz inequality, Lemma 29 together with Lemma 15 and Lemma 14 imply that with probability , the total regret is bounded by
(47) |
Note that the bound in Equation 47 is tighter than the bound in Theorem 16. To obtain the bound in Theorem 16, notice that and thus .
See 20
Proof In stage 1, as the distinguishing set has size , the number of time steps needed in the clustering phase is
where .
In stage 2, the length of the clustering phase is
where .
Substituting and into Theorem 16, we obtain the regret bound of stage 1 and stage 2:
where and .
where and .
Since , we have . Using the assumption that and the Cauchy-Schwarz inequality for the sum , we obtain
(48) | ||||
(49) |
By having and , we obtain
(50) |
where .
Appendix E Per-model Regret analysis
First, we prove the following lemma which upper bound the per-episode regret as a function of and the regret of the clustering phase.
Lemma 30
The regret of Algorithm 1 in episode is
Proof Denote by the probability of visiting state at time when the learner follows a (possibly non-stationary) policy in model starting from state . The regret of task in a single episode can be written as
The first inequality follows from the assumption that for all . The second inequality follows the fact that
and
Furthermore, since for all , we have
For each state , the value of is the expected total -step reward of an optimal non-stationary step policy starting in state on the MDP . Thus, the term represents the bounded span of the finite-step value function in MDP . Applying equation 11 of Jaksch et al. (2010), the span of the value function is bounded by the diameter of the MDP. We obtain for all
It follows that
Denote the set of episodes where the model is given to the learner. The total regret of the learner in episodes is
The policy from time step to is the UCBVI-CH algorithm (Azar et al., 2017). Therefore, the term corresponds to the total regret of UCBVI-CH in an adversarial setting in which the starting state in each episode is chosen by an adversary that maximizes the regret in each episode. In Appendix F, we given a simplified analysis for UCBVI-CH and show that with probability at least ,
(51) |
Appendix F A simplified analysis for UCBVI-CH
where
In section, we construct a simplified analysis for the UCBVI-CH algorithm in Azar et al. (2017). The proof largely follows the existing constructions in Azar et al. (2017), with two differences: the definition of “typical” episodes and the analysis are tailored specifically for the Chernoff-type bonus of UCBVI-CH, without being complicated by handling of the variances for the Bernstein-type bonus of UCBVI-BF in Azar et al. (2017). For completeness, the full UCBVI-CH algorithm from Azar et al. (2017) is shown in Algorithms 5 and 6.
Notation. In this section, we consider the standard single-task episodic RL setting in Azar et al. (2017) where the learner is given the same MDP in episodes. We assume the reward function is deterministic and known. The state and action spaces and are discrete spaces with size and , respectively. Denote by the failure probability and let . We assume the product is sufficiently large that .
Let denote the optimal value function and the value function of the policy of the UCBVI-CH agent in episode . The regret is defined as follows.
(52) |
where .
Denote by the number of visits to the state-action pair up to the beginning of episode .
We call an episode “typical” if all state-action pairs visited in episode have been visited at least times at the beginning of episode . The set of typical episodes is defined as follows.
(53) |
Equation 52 can be written as
(54) |
The first inequality follows from the trivial upper bound of the regret in an episode . The second inequality comes from the fact that each state-action pair can cause at most episodes to be non-typical; therefore there are at most non-typical episodes.
Next, we have:
(55) |
From here we write for brevity.
Lemma 3 in Azar et al. (2017) implies that, for all ,
(56) |
where , and are martingale difference sequences which, by Lemma 5 in Azar et al. (2017), satisfy
(57) |
and is a confidence interval to be defined later.
Note that the second inequality follows from the fact that , and the last inequality follows directly from Equation 57.
Let . By the definition of a “typical” episode, implies that for all . It follows that . Thus,
(58) |
where , and .
Next, we compute . In Equation (32) in Azar et al. (2017), corresponds to the confidence interval of
Equation (9) in Azar et al. (2017) computes a confidence interval for this term using the Bernstein inequality. Instead, we use the Hoeffding inequality and obtain
(59) |
We focus on the third and dominant term . As , this term can be upper bounded by
(61) |
We bound and separately.
First, we bound . We introduce the following lemma, which is an analogy to Lemma 19 in Jaksch et al. (2010) in the finite-horizon setting.
Lemma 31
Let . For any sequence of numbers with , consider the sequence defined as
Then, for all ,
Lemma 32
Denote the number of times the state-action pair is visited during episode , and let be the first episode where the state-action pair is visited at least times. Then,
(62) |
Proof By definition, . Regrouping the sum in by , we have
where the last two inequalities follow from Lemma 31, the Cauchy-Schwarz inequality and the fact that .
In order to bound the term in Equation 61, we use the following lemma, which is a variant of Lemma 31 and was stated in Azar et al. (2017) without proof.
Lemma 33
Let . For any sequence of numbers with , consider the sequence defined as
Then, for all ,
Proof The second half follows immediately from existing results for the partial sum of the harmonic series. We prove the first half of the inequality by induction. By definition of the two sequences, and for all . At , if then the inequality trivially holds. If , then and
since .
For , by the induction hypothesis, we have
where the last inequality follows from . Therefore, the induction hypothesis holds for all .
Using Lemma 33, the term can be bounded similarly to term as follows:
Lemma 34
With and defined in Lemma 62, we have
Proof We write as
where the last inequality follows from Lemma 33. Trivially bounding the logarithm term by , we obtain
Appendix G Removing the assumption on the hitting time
GOSPRL (Tarbouriech et al., 2021, Lemma 3) guaranteed that in the undiscounted infinite horizon setting, with , Lemma 14 holds with high probability. Thus, in the episodic finite horizon setting, by setting for some appropriately large constant and applying GOSPRL in each episode we obtain a tight bound in the dependency of and for communicating MDPs. One difficulty in this approach is both and are unknown. One possible way to overcome this is to apply the doubling-trick as following: at the beginning of episode , we set , where . If the learner successfully visits every state-action pair at least times after steps, we set . Otherwise, . There are at most episodes with failed exploration until is large enough so that with high probability, all the subsequent episodes will have successful explorations. Moreover, the horizons of the clustering and learning phases change at most times. The full analysis of this approach is not in the scope of this paper and is left to future work.
Appendix H Using samples in both phases for regret minimization
One of the results from previous works on the stochastic infinite-horizon multi-task setting (Brunskill and Li, 2013) is that in the cluster-then-learn paradigm, the samples collected in the their first stage (before all models have been seen at least once) can be used to accelerate the learning in their second stage (after all models have been seen at least once). In this work, we study the similar effects at the phase level. Specifically, in the finite horizon setting, the clustering phase is always followed by the learning phase; therefore it is desirable to use the samples collected in the clustering phase to improve the regret bound of the learning phase.
Our goal is to improve the regret of stage 1 in Algorithm 4. The reason that we focus on Stage 1 is two-fold:
-
•
In case Assumption 19 does not hold, i.e. is close to , the total regret is dominated by the regret of stage 1. Given that the length of the clustering phase is already of the same order with respect to the state-of-the-art bound of the recently proposed GOSPRL algorithm (Tarbouriech et al., 2021), without further assumptions we conjecture that it is difficult to improve substantially, and thus we focus on improving the learning phase.
-
•
In stage 1, every state-action pair is uniformly visited at least times before the learning phase. This uniformity allows us to study their impact in a systematic way without any further assumptions.
Using samples collected in both phases for the learning phase in Algorithm 1 is equivalent to using the policy
for the learning phase, since contains both and .
The regret minimization process in the learning phase is now equivalent to learning single-task episodic RL where at the beginning of each episode, the learner is given more samples, in which the transition function of each is sampled i.i.d. times. We extend the UCBVI-CH algorithm in Azar et al. (2017) to this new setting and obtain Algorithm 7. The bonus function of episode in UCBVI-CH is set to
(63) |
where .
The regret of this algorithm is bounded in the following theorem (proved in Appendix I).
Theorem 35
Given a constant . With probability at least , the regret of Algorithm 7 is bounded by
It can be observed that when , this bound recovers the bound of UCBVI-CH (up to a constant factor). Intuitively, when is small compared to , then the regret should still be of order since most of the useful information for learning still comes from exploring the environment. As increases, since the logarithmic term increases much slower compared to , the dominant term converges to 0.
Using Theorem 35 and , we can directly bound the regret of each model that is given in :
Lemma 36
The stage-1 regret of each model is
where .
Adding up the bound in Lemma 36 for all models and applying the Cauchy-Schwarz inequality, we obtain the total regret bound of Stage 1:
Theorem 37
In our setting, recall that and . Since we assumed that , we also have , and thus the bound in Theorem 37 is an improvement from the bound for stage 1 in the proof of Theorem 20, albeit the order stays the same. Intuitively, this means that the length of the learning phase is much larger than the length of the clustering phase, and therefore the learner spends more time on learning the optimal policy. When the length of the learning phase is small compared to , then the samples collected in the clustering phase significantly reduce the regret bound of the learning phase. Therefore, Algorithm 7 also accelerates the learning phase after the exploration phase, which is consistent with findings on the stochastic infinite-horizon multi-task setting in Brunskill and Li (2013).
Appendix I Proofs for Appendix H
We analyze the regret of the UCBVI-CH algorithm with external samples, where at the beginning of each episode, each state-action pair receives additional samples drawn i.i.d from the transition function .
Adapting from Equation 60, the regret of E-UCBVI-CH can be bounded by
(64) |
where as defined in Appendix F.
The first term bounds the total regret of episodes where a state-action pair is visited less than times: in each episode where a pair is visited at least once there are at least more samples of this pair, and therefore there can be at most such episodes.
Similar to Appendix F, we bound by bounding its two components and where
In order to bound , we first prove the following technical lemma, which quantifies the fraction of the regret that is reduced when using external samples.
Lemma 38
Suppose two constants are given. For any sequence of numbers with , consider the sequence defined as
Then, for all ,
Proof If , then the claim is trivially true. For , the claim is equivalent to
which is true, since .
Corollary 39
Suppose two constants are given. For any sequence of numbers with , consider the sequence defined as
Then, for all ,
Proof The first half of the claim is true, following Lemma 38. We now show that the second half is true. Consider the following function
The derivative is . Since , we have , and therefore is decreasing. It follows that for ,
Using Corollary 39, we can bound as following.
Proof We can write as follows
By definition of :
Next, we bound . Using similar techniques in Lemma 38 and Corollary 39, we can show that the following claims are true.
Lemma 41
Given two constants . For any sequence of numbers with , consider the sequence defined as
Then, for all ,
And for all ,
Consequently, is bounded in the following corollary.
and the total regret is
Appendix J Experimental Details
Transition functions. Figure 5 illustrate the gridworld environment of the four MDPs in . The rows are numbered top to bottom from to . The columns are numbered left to right from to . The starting state is at position . In every state, the probability of success of all actions is . When an action is unsuccessful, the probability of being in one of the other adjacent cells is equally divided from the remaining probability of . There are several exceptions:
-
•
In the four corners, if the agent takes an action in the direction of the border then with probability of it will stay in the same corner, and with probability it will end up in the cell in the opposite direction. For example, if the agent is at and takes action up, then with probability it will actually goes down to the cell .
-
•
Each of four MDPs have an easy-to-reach corner and three hard-to-reach corners. The easy-to-reach corners in models and are and , respectively. In each of these model, the probability of success of an action that leads to one of the hard-to-reach corners is , except for the corner where this probability is . For example, in model , taking action right in cell has probability of success equal to while taking the action down in cell has probability of success equal to .
-
•
On the four edges, any action that takes the agent out of the grid has probability of success equal to 0, and the agent ends up in one of the three adjacent cells with equal probability of . For each example, taking action up in position will take the agent to one of the three positions and with probability .
Under this construction, the seperation level is . One example of a -distinguishing set of optimal size is . One example of a -distinguishing but not -distinguishing is .
Performance metric. At the end of each episode, the two AOMultiRL agents and the one-episode UCBVI agent obtain their estimated model . The estimated optimal policy computed based on is run for steps starting from . The average per-episode reward (APER) in episode of an agent is defined as
(65) |
where the reward this agent received in step of episode .
Horizon settings For AOMultiRL2, the horizons of the clustering phase in two stages are different since the distinguishing sets in the two stages are different. In order to make a fair comparison with other algorithms, the horizon of the learning phase is set to in stage 1 and in stage 2. Since we assumed that stage 2 is dominant, the goal of the experiment is to examine whether a -distinguishing set can be discovered and how effective that set can be. We observe that AOMultiRL2 is able to discover the same -distinguishing set in all 10 runs. Since this set also has an optimal size of , in stage 2 the clustering phase’s horizon of AOMultiRL2 is identical to that of AOMultiRL1.