Policy Finetuning in Reinforcement Learning
via Design of Experiments using Offline Data
Abstract
In some applications of reinforcement learning, a dataset of pre-collected experience is already available but it is also possible to acquire some additional online data to help improve the quality of the policy. However, it may be preferable to gather additional data with a single, non-reactive exploration policy and avoid the engineering costs associated with switching policies.
In this paper we propose an algorithm with provable guarantees that can leverage an offline dataset to design a single non-reactive policy for exploration. We theoretically analyze the algorithm and measure the quality of the final policy as a function of the local coverage of the original dataset and the amount of additional data collected.
1 Introduction
Reinforcement learning (RL) is a general framework for data-driven, sequential decision making (Puterman, 1994; Sutton and Barto, 2018). In RL, a common goal is to identify a near-optimal policy, and there exist two main paradigms: online and offline RL.
Online RL is effective when the practical cost of a bad decision is low, such as in simulated environments (e.g., (Mnih et al., 2015; Silver et al., 2016)). In online RL, a well designed learning algorithm starts from tabula rasa and implements a sequence of policies with a value that should approach that of an optimal policy. When the cost of making a mistake is high, such as in healthcare (Gottesman et al., 2018) and in self-driving (Kiran et al., 2021), an offline approach is preferred. In offline RL, the agent uses a dataset of pre-collected experience to extract a policy that is as good as possible. In this latter case, the quality of the policy that can be extracted from the dataset is limited by the quality of the dataset.
Many applications, however, fall between these two opposite settings: for example, a company that sells products online has most likely recorded the feedback that it has received from its customers, but can also collect a small amount of additional strategic data in order to improve its recommendation engine. While in principle an online exploration algorithm can be used to collect fresh data, in practice there are a number of practical engineering considerations that require the policy to be deployed to be non-reactive. We say that a policy is non-reactive, (or passive, memoryless) if it chooses actions only according to the current state of the system. Most online algorithms are, by design, reactive to the data being acquired.
An example of a situation where non-reactive policies may be preferred are those where a human in the loop is required to validate each exploratory policy before they are deployed, to ensure they are of high quality (Dann et al., 2019) and safe (Yang et al., 2021), as well as free of discriminatory content (Koenecke et al., 2020). Other situations that may warrant non-reactive exploration are those where the interaction with the user occurs through a distributed system with delayed feedback. In recommendation systems, data collection may only take minutes, but policy deployment and updates can span weeks (Afsar et al., 2022). Similar considerations apply across various RL application domains, including healthcare (Yu et al., 2021), computer networks (Xu et al., 2018), and new material design (Raccuglia et al., 2016). In all such cases, the engineering effort required to implement a system that handles real-time policy switches may be prohibitive: deploying a single, non-reactive policy is much preferred.
Non-reactive exploration from offline data Most exploration algorithms that we are aware of incorporate policy switches when they interact with the environment (Dann and Brunskill, 2015; Dann et al., 2017; Azar et al., 2017; Jin et al., 2018; Dann et al., 2019; Zanette and Brunskill, 2019; Zhang et al., 2020b). Implementing a sequence of non-reactive policies is necessary in order to achieve near-optimal regret: the number of policy switches must be at least where are the state space, action space, horizon and the total number of episodes, respectively (Qiao et al., 2022). With no switches, i.e., when a fully non-reactive data collection strategy is implemented, it is information theoretically impossible (Xiao et al., 2022) to identify a good policy using a number of samples polynomial in the size of the state and action space.
However, these fundamental limits apply to the case where the agent learns from tabula rasa. In the more common case where offline data is available, we demonstrate that it is possible to leverage the dataset to design an effective non-reactive exploratory policy. More precisely, an available offline dataset contains information (e.g., transitions) about a certain area of the state-action space, a concept known as partial coverage. A dataset with partial coverage naturally identifies a ‘sub-region’ of the original MDP—more precisely, a sub-graph—that is relatively well explored. We demonstrate that it is possible to use the dataset to design a non-reactive policy that further explores such sub-region. The additional data collected can be used to learn a near-optimal policy in such sub-region.
In other words, exploration with no policy switches can collect additional information and compete with the best policy that is restricted to an area where the original dataset has sufficient information. The value of such policy can be much higher than the one that can be computed using only the offline dataset, and does not directly depend on a concentrability coefficient (Munos and Szepesvári, 2008; Chen and Jiang, 2019).
Perhaps surprisingly, addressing the problem of reactive exploration in reinforcement learning requires an approach that combines both optimism and pessimism in the face of uncertainty to explore efficiently. While optimism drives exploration, pessimism ensures that the agent explores conservatively, in a way that restricts its exploration effort to a region that it knows how to navigate, and so our paper makes a technical contribution which can be of independent interest.
Contributions To the best of our knowledge, this is the first paper with theoretical rigor that considers the problem of designing an experiment in reinforcement learning for online, passive exploration, using a dataset of pre-collected experience. More precisely, our contributions are as follows:
-
•
We introduce an algorithm that takes as input a dataset, uses it to design and deploy a non-reactive exploratory policy, and then outputs a locally near-optimal policy.
-
•
We introduce the concept of sparsified MDP, which is actively used by our algorithm to design the exploratory policy, as well as to theoretically analyze the quality of the final policy that it finds.
-
•
We rigorously establish a nearly minimax-optimal upper bound for the sample complexity needed to learn a local -optimal policy using our algorithm.
2 Related Work
In this section we discuss some related literature. Our work is related to low-switching algorithms, but unlike those, we focus on the limit case where no-switiches are allowed. For more related work about low-switching algorithms, offline RL, task-agnostic RL, and reward-free RL we refer to Appendix F.
Low-switching RL In reinforcement learning, (Bai et al., 2019) first proposed Q-learning with UCB2 exploration, proving an switching cost. This was later improved by a factor of by the UCBadvantage algorithm in (Zhang et al., 2020b). Recently, (Qiao et al., 2022) generalized the policy elimination algorithm from (Cesa-Bianchi et al., 2013) and introduced APEVE, which attains an optimal switching cost. The reward-free version of their algorithm (which is not regret minimizing) has an switching cost.
Similar ideas were soon applied in RL with linear function approximation (Gao et al., 2021; Wang et al., 2021; Qiao and Wang, 2022) and general function approximation (Qiao et al., 2023). Additionally, numerous research efforts have focused on low-adaptivity in other learning domains, such as batched dueling bandits (Agarwal et al., 2022), batched convex optimization (Duchi et al., 2018), linear contextual bandits (Ruan et al., 2021), and deployment-efficient RL (Huang et al., 2022).
Our work was inspired by the problem of non-reactive policy design in linear contextual bandits. Given access to an offline dataset, (Zanette et al., 2021a) proposed an algorithm to output a single exploratory policy, which generates a dataset from which a near-optimal policy can be extracted. However, there are a number of additional challenges which arise in reinforcement learning, including the fact that the state space is only partially explored in the offline dataset. In fact, in reinforcement learning, (Xiao et al., 2022) established an exponential lower bound for any non-adaptive policy learning algorithm starting from tabula rasa.
3 Setup
Throughout this paper, we let . We adopt the big-O notation, where suppresses poly-log factors of the input parameters. We indicate the cardinality of a set with .
Markov decision process We consider time-homogeneous episodic Markov decision processes (MDPs). They are defined by a finite state space , a finite action space , a trasition kernel , a reward function and the episodic length . The transition probability , which does not depend on the current time-step , denotes the probability of transitioning to state when taking action in the current state . Typically we denote with the initial state. For simplicity, we consider deterministic reward functions . A deterministic non-reactive (or memoryless, or passive) policy maps a given state to an action.
The value function is defined as the expected cumulated reward. It depends on the state under consideration, the transition and reward that define the MDP as well as on the policy being implemented. It is defined as where denotes the expectation generated by and policy . A closely related quantity is the state-action value function, or -function, defined as When it is clear from the context, we sometimes omit and simply write them as and We denote an MDP defined by and the transition matrix as .
3.1 Interaction protocol
In this paper we assume access to an offline dataset where every state-action is sampled in an i.i.d. fashion from some distribution and , which is common in the offline RL literature (Xie et al., 2021a; Zhan et al., 2022; Rashidinejad et al., 2021; Uehara and Sun, 2021). We denote and as the number of and samples in the offline dataset respectively. The interaction protocol considered in this paper consists of three distinct phases, which are displayed in algorithm 1. They are:
-
•
the offline phase, where the learner uses an offline dataset of pre-collected experience to design the non-reactive exploratory policy ;
-
•
the online phase where is deployed to generate the online dataset ;
-
•
the planning phase where the learner receives a reward function and uses all the data collected to extract a good policy with respect to that reward function.
4 Algorithm: balancing optimism and pessimism for experimental design
In this section we outline our algorithm Reward-Free Non-reactive Policy Design (RF-NPD), which follows the high-level protocol described in algorithm 1. The technical novelty lies almost entirely in the design of the exploratory policy . In order to prepare the reader for the discussion of the algorithm, we first give some intuition in section 4.1 followed by the definition of sparsified MDP in section 4.2, a central concept of this paper, and then describe the implementation of 2 in the protocol in algorithm 1 in section 4.3. We conclude by presenting the implementation of 3 and 4 in the protocol in algorithm 1.
4.1 Intuition
In order to present the main intuition for this paper, in this section we assume that enough transitions are available in the dataset for every edge , namely that the critical condition
(4.1) |
holds for all tuples (the precise value for will be given later in eq. 5.1). Such condition is hardly satisfied everywhere in the state-action-state space, but assuming it in this section simplifies the presentation of one of the key ideas of this paper.
The key observation is that when eq. 4.1 holds for all , we can use the empirical transition kernel to design an exploration policy to eventually extract a near-optimal policy for any desired level of sub-optimality , despite eq. 4.1 being independent of . More precisely, let be the empirical transition kernel defined in the usual way for any tuple . The intuition—which will be verified rigorously in the analysis of the algorithm—is the following:
If eq. 4.1 holds for every then can be used to design a non-reactive exploration policy which can be deployed on to find an -optimal policy using samples.
We remark that even if the condition 4.1 holds for all tuples , the empirical kernel is not accurate enough to extract an -optimal policy from the dataset without collecting further data. Indeed, the threshold on the number of samples is independent of the desired sub-optimality , while it is well known that at least offline samples are needed to find an -optimal policy. Therefore, directly implementing an offline RL algorithm to use the available offline dataset does not yield an -optimal policy. However, the threshold is sufficient to design a non-reactive exploratory policy that can discover an -optimal policy after collecting online data.
4.2 Sparsified MDP
The intuition in the prior section must be modified to work with heterogeneous datasets and dynamics where may fail to hold everywhere. For example, if is very small for a certain tuple , it is unlikely that the dataset contains samples for that particular tuple. In a more extreme setting, if the dataset is empty, the critical condition in eq. 4.1 is violated for all tuples , and in fact the lower bound of Xiao et al. (2022) states that finding -optimal policies by exploring with a non-reactive policy is not feasible with sample complexity. This suggests that in general it is not possible to output an -optimal policy using the protocol in algorithm 1.
However, a real-world dataset generally covers at least a portion of the state-action space, and so we expect the condition to hold somewhere; the sub-region of the MDP where it holds represents the connectivity graph of the sparsified MDP. This is the region that the agent knows how to navigate using the offline dataset , and so it is the one that the agent can explore further using . More precisely, the sparsified MDP is defined to have identical dynamics as the original MDP on the edges that satisfy the critical condition 4.1. When instead the edge fails to satisfy the critical condition 4.1, it is replaced with a transition to an absorbing state .
Definition 4.1 (Sparsified MDP).
Let be an absorbing state, i.e., such that and for all . The state space in the sparsified MDP is defined as that of the original MDP with the addition of . The dynamics of the sparsified MDP are defined as
(4.2) |
For any deterministic reward function the reward function on the sparsified MDP is defined as ; for simplicity we only consider deterministic reward functions.
The empirical sparsified MDP is defined in the same way but by using the empirical transition kernel in eq. 4.2. The empirical sparsified MDP is used by our algorithm to design the exploratory policy, while the (population) sparsified MDP is used for its theoretical analysis. They are two fundamental concepts in this paper.
4.3 Offline design of experiments
In this section we describe the main sub-component of the algorithm, namely the sub-routine that uses the offline dataset to compute the exploratory policy . The exploratory policy is a mixture of the policies produced by a variant of the reward-free exploration algorithm of (Kaufmann et al., 2021; Ménard et al., 2021). Unlike prior literature, the reward-free algorithm is not interfaced with the real MDP , but rather simulated on the empirical sparsified MDP . This avoids interacting with with a reactive policy, but it introduces some bias that must be controlled. The overall procedure is detailed in algorithm 2. To be clear, no real-world samples are collected by algorithm 2; instead we use the word ‘virtual samples’ to refer to those generated from .
At a high level, algorithm 2 implements value iteration using the empirical transition kernel , with the exploration bonus defined in eq. 4.3 that replaces the reward function. The exploration bonus can be seen as implementing the principle of optimism in the face of uncertainty; however, the possibility of transitioning to an absorbing state with zero reward (due to the use of the absorbing state in the definition of ) implements the principle of pessimism.
This delicate interplay between optimism and pessimism is critical to the success of the overall procedure: while optimism encourages exploration, pessimism ensures that the exploration efforts are directed to the region of the state space that the agent actually knows how to navigate, and prevents the agent from getting ‘trapped’ in unknown regions. In fact, these latter regions could have combinatorial structures (Xiao et al., 2022) which cannot be explored with non-reactive policies.
More precisely, at the beginning of the -th virtual episode in algorithm 2, and denote the counters for the number of virtual samples simulated from at each and tuple. We define the bonus function
(4.3) |
which is used to construct the empirical uncertainty function , a quantity that serves as a proxy for the uncertainty of the value of any policy on the spasified MDP. Specifically, for the -th virtual episode, we set and . For , we further define:
(4.4) |
Note that, the above bonus function takes a similar form of the bonus function in (Ménard et al., 2021). This order of is set to achieve the optimal sample complexity, and other works have also investigated into other forms of bonus function (Kaufmann et al., 2021). Finally, in 11 through 13 the current policy —which is the greedy policy with respect to —is simulated on the empirical reference MDP , and the virtual counters are updated. It is crucial to note that the simulation takes place entirely offline, by generating virtual transitions from . Upon termination of algorithm 2, the uniform mixture of policies form the non-reactive exploration policy , ensuring that the latter has wide ‘coverage’ over .
4.4 Online and planning phase
Algorithm 2 implements 2 of the procedure in algorithm 1 by finding the exploratory policy . After that, in 3 of the interaction protocol the online dataset is generated by deploying on the real MDP to generate trajectories. Conceptually, the online dataset and the offline dataset identify an updated empirical transition kernel and its sparsified version111 For any , we define if and otherwise. Finally, for any we have and for any we have Here is the counter of initial offline data and is the counter of online data. . Finally, in 4 a reward function is received, and the value iteration algorithm (See Appendix E) is invoked with as reward function and as dynamics, and the near-optimal policy is produced. The use of the (updated) empirical sparsified dynamics can be seen as incorporating the principle of pessimism under uncertainty due to the presence of the absorbing state.
Our complete algorithm is reported in algorithm 3, and it can be seen as implementing the interaction protocol described in algorithm 1.
5 Main Result
In this section, we present a performance bound on our algorithm, namely a bound on the sub-optimality of the value of the final policy when measured on the sparsified MDP . The sparsified MDP arises because it is generally not possible to directly compete with the optimal policy using a non-reactive data collection strategy and a polynomial number of samples due to the lower bound of Xiao et al. (2022); more details are given in Appendix C.
In order to state the main result, we let where and are the number of episodes for the offline simulation and online interaction, respectively. Let be some universal constant, and choose the threshold in the definition of sparsified MDP as
(5.1) |
Theorem 5.1.
For any and if we let the number of online episodes be
then with probability at least for any reward function the final policy returned by Algorithm 3 satisfies the bound
(5.2) |
The theorem gives a performance guarantee on the value of the policy , which depends both on the initial coverage of the offline dataset as well as on the number of samples collected in the online phase. The dependence on the coverage of the offline dataset is implicit through the definition of the (population) sparsified , which is determined by the counts .
In order to gain some intuition, we examine some special cases as a function of the coverage of the offline dataset.
Empty dataset Suppose that the offline dataset is empty. Then the sparsified MDP identifies a multi-armed bandit at the initial state , where any action taken from such state gives back the reward and leads to the absorbing state . In this case, our algorithm essentially designs an allocation strategy that is uniform across all actions at the starting state . Given enough online samples, converges to the action with the highest instantaneous reward on the multi-armed bandit induced by the start state. With no coverage from the offline dataset, the lower bound of Xiao et al. (2022) for non-reactive policies precludes finding an -optimal policy on the original MDP unless exponentially many samples are collected.
Known connectivity graph On the other extreme, assume that the offline dataset contains enough information everywhere in the state-action space such that the critical condition 4.1 is satisfied for all tuples. Then the sparsified MDP and the real MDP coincide, i.e., , and so the final policy directly competes with the optimal policy for any given reward function in eq. 5.2. More precisely, the policy is -suboptimal on if trajectories are collected in the online phase, a result that matches the lower bound for reward-free exploration of Jin et al. (2020b) up to log factors. However, we achieve such result with a data collection strategy that is completely passive, one that is computed with the help of an initial offline dataset whose size need not depend on final accuracy .
Partial coverage In more typical cases, the offline dataset has only partial coverage over the state-action space and the critical condition 4.1 may be violated in certain state-action-successor states. In this case, the connectivity graph of the sparsified MDP is a sub-graph of the original MDP augmented with edges towards the absorbing state. The lack of coverage of the original dataset arises through the sparsified MDP in the guarantees that we present in theorem 5.1. In this section, we ‘translate’ such guarantees into guarantees on , in which case the ‘lack of coverage’ is naturally represented by the concentrability coefficient
see for examples the papers (Munos and Szepesvári, 2008; Chen and Jiang, 2019) for background material on the concentrability factor. More precisely, we compute the sample complexity—in terms of online as well as offline samples—required for to be -suboptimal with respect to any comparator policy , and so in particular with respect to the optimal policy on the “real” MDP . The next corollary is proved in section B.3.
Corollary 5.2.
Suppose that the offline dataset contains
samples and that additional
online samples are collected during the online phase. Then with probability at least , for any reward function , the policy is -suboptimal with respect to any comparator policy
(5.3) |
The online sample size is equivalent to the one that arises in the statement of theorem 5.1 (expressed as number of online trajectories), and does not depend on the concentrability coefficient. The dependence on the offline dataset in theorem 5.1 is implicit in the definition of sparsified MDP; here we have made it explicit using the notion of concentrability.
Corollary 5.2 can be used to compare the achievable guarantees of our procedure with that of an offline algorithm, such as the minimax-optimal procedure detailed in (Xie et al., 2021b). The proceedure described in (Xie et al., 2021b) achieves (5.3) with probability at least by using
(5.4) |
offline samples222Technically, (Zhan et al., 2022) considers the non-homogeneous setting, and expresses their result in terms of number of trajectories. In obtaining eq. 5.4, we ‘removed’ an factor due to our dynamics being homogeneous, and add it back to express the result in terms of number of samples. However, notice that (Zhan et al., 2022) consider the reward-aware setting, which is simpler than reward-free RL setting that we consider. This should add an additional factor that is not accounted for in eq. 5.4, see the paper Jin et al. (2020b) for more details.. In terms of offline data, our procedure has a similar dependence on various factors, but it depends on the desired accuracy through as opposed to which is typical for an offline algorithm. This implies that in the small- regime, if sufficient online samples are collected, one can improve upon a fully offline procedure by collecting a number of additional online samples in a non-reactive way.
Finally, notice that one may improve upon an offline dataset by collecting more data from the distribution , i.e., without performing experimental design. Compared to this latter case, notice that our online sample complexity does not depend on the concentrability coefficient. Further discussion can be found in appendix B.
6 Proof
In this section we prove theorem 5.1, and defer the proofs of the supporting statements to the Appendix A.
Let us define the comparator policy used for the comparison in eq. 5.2 to be the (deterministic) policy with the highest value function on the sparsified MDP:
We can bound the suboptimality using the triangle inequality as
The middle term after the first inequality is negative due to the optimality of on and . It suffices to prove that for any arbitrary policy and reward function the following statement holds with probability at least
(6.1) |
Bounding the estimation error using the population uncertainty function
In order to prove eq. 6.1, we first define the population uncertainty function , which is a scalar function over the state-action space. It represents the maximum estimation error on the value of any policy when it is evaluated on instead of . For any , the uncertainty function is defined as and for
We extend the definition to the absorbing state by letting for any The summation used above is over , but since for any it is equivalent to that over Intuitively, takes a similar form as Bellman optimality equation. The additional factor and additional term quantify the uncertainty of the true Q function on the sparsifed MDP and will converge to zero when the sample size goes to infinity. This definition of uncertainty function and the following lemma follow closely from the uncertainty function defined in (Ménard et al., 2021).
The next lemma highlights the key property of the uncertainty function , namely that for any reward function and any policy we can upper bound the estimation error via the uncertainty function at the initial times-step; it is proved in section A.2.1.
Lemma 6.1.
With probability , for any reward function and any deterministic policy , it holds that
(6.2) |
Leveraging the exploration mechanics
Throughout this section, denotes some universal constant and may vary from line to line. Recall that the agent greedily minimizes the empirical uncertainty function to compute the exploratory policy . The empirical uncertainty is defined as for any and
(6.3) |
where is the counter of the times we encounter until the beginning of the -the virtual episode in the simulation phase. Note that, takes a similar form as except that depends on the empirical transition probability while depends on the true transition probability on the sparsified MDP. For the exploration scheme to be effective, and should be close in value, a concept which is at the core of this work and which we formally state below and prove in section A.2.2.
Lemma 6.2 (Bounding uncertainty function with empirical uncertainty functions).
With probability at least , we have for any
Notice that is the population uncertainty after the online samples have been collected, while is the corresponding empirical uncertainty which varies during the planning phase.
Rate of decrease of the estimation error
Combining lemmas 6.1 and 6.2 shows that (a function of) the agent’s uncertainty estimate upper bounds the estimation error in eq. 6.1. In order to conclude, we need to show that decreases on average at the rate , a statement that we present below and prove in section A.2.3.
Lemma 6.3.
With probability at least , we have
(6.4) |
Then, for any if we take
then with probability at least , it holds that
After combining lemmas 6.1, 6.2 and 6.3, we see that the estimation error can be bounded as
(for ) |
Here, the constant may vary between lines. Rescaling the universal constant and the failure probability , we complete the upper bound in equation (6.1) and hence the proof for the main result.
References
- Afsar et al. (2022) M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys, 55(7):1–38, 2022.
- Agarwal et al. (2022) Arpit Agarwal, Rohan Ghuge, and Viswanath Nagarajan. Batched dueling bandits. In International Conference on Machine Learning, pages 89–110. PMLR, 2022.
- Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
- Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
- Bai et al. (2019) Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems, 32, 2019.
- Cesa-Bianchi et al. (2013) Nicolo Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other adaptive adversaries. Advances in Neural Information Processing Systems, 26, 2013.
- Chen and Jiang (2019) Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
- Chen et al. (2022) Jinglin Chen, Aditya Modi, Akshay Krishnamurthy, Nan Jiang, and Alekh Agarwal. On the statistical efficiency of reward-free exploration in non-linear rl. arXiv preprint arXiv:2206.10770, 2022.
- Dann and Brunskill (2015) Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 2015.
- Dann et al. (2017) Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
- Dann et al. (2019) Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516, 2019.
- Du et al. (2019) Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674. PMLR, 2019.
- Duan et al. (2021) Yaqi Duan, Chi Jin, and Zhiyuan Li. Risk bounds and rademacher complexity in batch reinforcement learning. In International Conference on Machine Learning, pages 2892–2902. PMLR, 2021.
- Duchi et al. (2018) John Duchi, Feng Ruan, and Chulhee Yun. Minimax bounds on stochastic batched convex optimization. In Conference On Learning Theory, pages 3065–3162. PMLR, 2018.
- Durrett (2019) Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.
- Foster et al. (2021) Dylan J Foster, Akshay Krishnamurthy, David Simchi-Levi, and Yunzong Xu. Offline reinforcement learning: Fundamental barriers for value function approximation. arXiv preprint arXiv:2111.10919, 2021.
- Gao et al. (2021) Minbo Gao, Tianle Xie, Simon S Du, and Lin F Yang. A provably efficient algorithm for linear markov decision process with low switching cost. arXiv preprint arXiv:2101.00494, 2021.
- Gao et al. (2019) Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019.
- Gottesman et al. (2018) Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent, Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding, David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, Christopher Mosch, Li wei H. Lehman, Matthieu Komorowski, Matthieu Komorowski, Aldo Faisal, Leo Anthony Celi, David Sontag, and Finale Doshi-Velez. Evaluating reinforcement learning algorithms in observational health settings, 2018.
- Hazan et al. (2019) Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pages 2681–2691. PMLR, 2019.
- Huang et al. (2022) Jiawei Huang, Jinglin Chen, Li Zhao, Tao Qin, Nan Jiang, and Tie-Yan Liu. Towards deployment-efficient reinforcement learning: Lower bound and optimality. arXiv preprint arXiv:2202.06450, 2022.
- Jiang and Huang (2020) Nan Jiang and Jiawei Huang. Minimax value interval for off-policy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33:2747–2758, 2020.
- Jin et al. (2018) Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
- Jin et al. (2020a) Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020a.
- Jin et al. (2020b) Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning (ICML), 2020b.
- Jin et al. (2020c) Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? arXiv preprint arXiv:2012.15085, 2020c.
- Jonsson et al. (2020) Anders Jonsson, Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Edouard Leurent, and Michal Valko. Planning in markov decision processes with gap-dependent sample complexity. Advances in Neural Information Processing Systems, 33:1253–1263, 2020.
- Kallus and Uehara (2022) Nathan Kallus and Masatoshi Uehara. Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Operations Research, 2022.
- Kaufmann et al. (2021) Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Algorithmic Learning Theory, pages 865–891. PMLR, 2021.
- Kiran et al. (2021) B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous driving: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(6):4909–4926, 2021.
- Koenecke et al. (2020) Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R Rickford, Dan Jurafsky, and Sharad Goel. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, 117(14):7684–7689, 2020.
- Lattimore and Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
- Li et al. (2023) Gen Li, Yuling Yan, Yuxin Chen, and Jianqing Fan. Optimal reward-agnostic exploration in reinforcement learning. 2023.
- Long et al. (2021) Jihao Long, Jiequn Han, and E Weinan. An l2 analysis of reinforcement learning in high dimensions with kernel and neural network approximation. arXiv preprint arXiv:2104.07794, 3, 2021.
- Maurer and Pontil (2009) Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
- Ménard et al. (2021) Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, and Michal Valko. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pages 7599–7608. PMLR, 2021.
- Misra et al. (2020) Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961–6971. PMLR, 2020.
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Munos and Szepesvári (2008) Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5), 2008.
- Nachum et al. (2019) Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, and Dale Schuurmans. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019.
- Nguyen-Tang et al. (2022) Thanh Nguyen-Tang, Ming Yin, Sunil Gupta, Svetha Venkatesh, and Raman Arora. On instance-dependent bounds for offline reinforcement learning with linear function approximation. arXiv preprint arXiv:2211.13208, 2022.
- Puterman (1994) Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1994. ISBN 0471619779.
- Qiao and Wang (2022) Dan Qiao and Yu-Xiang Wang. Near-optimal deployment efficiency in reward-free reinforcement learning with linear function approximation. arXiv preprint arXiv:2210.00701, 2022.
- Qiao et al. (2022) Dan Qiao, Ming Yin, Ming Min, and Yu-Xiang Wang. Sample-efficient reinforcement learning with loglog (t) switching cost. In International Conference on Machine Learning, pages 18031–18061. PMLR, 2022.
- Qiao et al. (2023) Dan Qiao, Ming Yin, and Yu-Xiang Wang. Logarithmic switching cost in reinforcement learning beyond linear mdps. arXiv preprint arXiv:2302.12456, 2023.
- Qiu et al. (2021) Shuang Qiu, Jieping Ye, Zhaoran Wang, and Zhuoran Yang. On reward-free rl with kernel and neural function approximations: Single-agent mdp and markov game. In International Conference on Machine Learning, pages 8737–8747. PMLR, 2021.
- Raccuglia et al. (2016) Paul Raccuglia, Katherine C Elbert, Philip DF Adler, Casey Falk, Malia B Wenny, Aurelio Mollo, Matthias Zeller, Sorelle A Friedler, Joshua Schrier, and Alexander J Norquist. Machine-learning-assisted materials discovery using failed experiments. Nature, 533(7601):73–76, 2016.
- Rashidinejad et al. (2021) Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. arXiv preprint arXiv:2103.12021, 2021.
- Rashidinejad et al. (2022) Paria Rashidinejad, Hanlin Zhu, Kunhe Yang, Stuart Russell, and Jiantao Jiao. Optimal conservative offline rl with general function approximation via augmented lagrangian. arXiv preprint arXiv:2211.00716, 2022.
- Ruan et al. (2021) Yufei Ruan, Jiaqi Yang, and Yuan Zhou. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74–87, 2021.
- Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
- Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018.
- Talebi and Maillard (2018) Mohammad Sadegh Talebi and Odalric-Ambrym Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In Algorithmic Learning Theory, pages 770–805. PMLR, 2018.
- Uehara and Sun (2021) Masatoshi Uehara and Wen Sun. Pessimistic model-based offline reinforcement learning under partial coverage, 2021.
- Wagenmaker et al. (2022) Andrew J Wagenmaker, Yifang Chen, Max Simchowitz, Simon Du, and Kevin Jamieson. Reward-free rl is no harder than reward-aware rl in linear markov decision processes. In International Conference on Machine Learning, pages 22430–22456. PMLR, 2022.
- Wainwright (2019) Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
- Wang et al. (2020a) Ruosong Wang, Simon S Du, Lin Yang, and Russ R Salakhutdinov. On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816–17826, 2020a.
- Wang et al. (2020b) Ruosong Wang, Dean P Foster, and Sham M Kakade. What are the statistical limits of offline rl with linear function approximation? arXiv preprint arXiv:2010.11895, 2020b.
- Wang et al. (2021) Tianhao Wang, Dongruo Zhou, and Quanquan Gu. Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. Advances in Neural Information Processing Systems, 34:13524–13536, 2021.
- Xiao et al. (2022) Chenjun Xiao, Ilbin Lee, Bo Dai, Dale Schuurmans, and Csaba Szepesvari. The curse of passive data collection in batch reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 8413–8438. PMLR, 2022.
- Xie and Jiang (2020a) Tengyang Xie and Nan Jiang. Batch value-function approximation with only realizability. arXiv preprint arXiv:2008.04990, 2020a.
- Xie and Jiang (2020b) Tengyang Xie and Nan Jiang. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR, 2020b.
- Xie et al. (2021a) Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, and Alekh Agarwal. Bellman-consistent pessimism for offline reinforcement learning. arXiv preprint arXiv:2106.06926, 2021a.
- Xie et al. (2021b) Tengyang Xie, Nan Jiang, Huan Wang, Caiming Xiong, and Yu Bai. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34:27395–27407, 2021b.
- Xiong et al. (2022) Wei Xiong, Han Zhong, Chengshuai Shi, Cong Shen, Liwei Wang, and Tong Zhang. Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. arXiv preprint arXiv:2205.15512, 2022.
- Xu et al. (2018) Zhiyuan Xu, Jian Tang, Jingsong Meng, Weiyi Zhang, Yanzhi Wang, Chi Harold Liu, and Dejun Yang. Experience-driven networking: A deep reinforcement learning based approach. In IEEE INFOCOM 2018-IEEE conference on computer communications, pages 1871–1879. IEEE, 2018.
- Yang et al. (2021) Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, and Peter J. Ramadge. Accelerating safe reinforcement learning with constraint-mismatched policies, 2021.
- Yin and Wang (2020) Ming Yin and Yu-Xiang Wang. Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3948–3958. PMLR, 2020.
- (69) Ming Yin, Yu-Xiang Wang, Yaqi Duan, and Mengdi Wang. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism.
- Yin et al. (2020) Ming Yin, Yu Bai, and Yu-Xiang Wang. Near optimal provable uniform convergence in off-policy evaluation for reinforcement learning. arXiv preprint arXiv:2007.03760, 2020.
- Yin et al. (2022) Ming Yin, Mengdi Wang, and Yu-Xiang Wang. Offline reinforcement learning with differentiable function approximation is provably efficient. arXiv preprint arXiv:2210.00750, 2022.
- Yu et al. (2021) Chao Yu, Jiming Liu, Shamim Nemati, and Guosheng Yin. Reinforcement learning in healthcare: A survey. ACM Computing Surveys (CSUR), 55(1):1–36, 2021.
- Zanette (2023) Andrea Zanette. When is realizability sufficient for off-policy reinforcement learning? ICML 2023, 2023.
- Zanette and Brunskill (2019) Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
- Zanette and Wainwright (2022) Andrea Zanette and Martin J. Wainwright. Bellman residual orthogonalization for offline reinforcement learning, 2022. URL https://arxiv.org/abs/2203.12786.
- Zanette et al. (2020) Andrea Zanette, Alessandro Lazaric, Mykel J Kochenderfer, and Emma Brunskill. Provably efficient reward-agnostic navigation with linear value iteration. Advances in Neural Information Processing Systems, 33:11756–11766, 2020.
- Zanette et al. (2021a) Andrea Zanette, Kefan Dong, Jonathan N Lee, and Emma Brunskill. Design of experiments for stochastic contextual linear bandits. Advances in Neural Information Processing Systems, 34:22720–22731, 2021a.
- Zanette et al. (2021b) Andrea Zanette, Martin J Wainwright, and Emma Brunskill. Provable benefits of actor-critic methods for offline reinforcement learning. Advances in neural information processing systems, 34:13626–13640, 2021b.
- Zhan et al. (2022) Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, and Jason D Lee. Offline reinforcement learning with realizability and single-policy concentrability. arXiv preprint arXiv:2202.04634, 2022.
- Zhang et al. (2022) Ruiqi Zhang, Xuezhou Zhang, Chengzhuo Ni, and Mengdi Wang. Off-policy fitted q-evaluation with differentiable function approximators: Z-estimation and inference theory. In International Conference on Machine Learning, pages 26713–26749. PMLR, 2022.
- Zhang et al. (2020a) Xuezhou Zhang, Yuzhe Ma, and Adish Singla. Task-agnostic exploration in reinforcement learning. Advances in Neural Information Processing Systems, 33:11734–11743, 2020a.
- Zhang et al. (2020b) Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207, 2020b.
- Zhang et al. (2021) Zihan Zhang, Simon Du, and Xiangyang Ji. Near optimal reward-free reinforcement learning. In International Conference on Machine Learning, pages 12402–12412. PMLR, 2021.
Appendix A Proof of the main result
A.1 Definitions
In this section, we define some crucial concepts that will be used in the proof of the main result.
A.1.1 Sparsified MDP
First, we restate the definition 4.1 in the main text.
Definition A.1 (Sparsified MDP).
Let be an absorbing state, i.e., such that and for all . The state space in the sparsified MDP is defined as that of the original MDP with the addition of . The dynamics of the sparsified MDP are defined as
(A.1) |
For any deterministic reward function the reward function on the sparsified MDP is defined as ; for simplicity we only consider deterministic reward functions.
In the offline phase of our algorithm, we simulate the virtual episodes from the empirical version of sparsified MDP (See Algorithm 2). Now we formally this MDP.
Definition A.2 (Emprical sparsified MDP).
Let be the absorbing state defined in the sparsified MDP. The state space in the empirical sparsified MDP is defined as that of the original MDP with the addition of . The dynamics of the sparsified MDP are defined as
(A.2) |
For any deterministic reward function the reward function on the empirical sparsified MDP is defined as ; for simplicity we only consider deterministic reward functions. Here, the counters and are the number of and in the offline data, respectively.
Finally, in the planning phase, after we interact with the true environment for many online episodes, construct a fine-estimated sparsified MDP, which is used to extract the optmal policy of given reward functions. We formally define it below.
Definition A.3 (Fine-estimated sparsified MDP).
Let be the absorbing state defined in the sparsified MDP. The state space in the fine-estimated sparsified MDP is defined as that of the original MDP with the addition of . The dynamics of the sparsified MDP are defined as
(A.3) |
For any deterministic reward function the reward function on the fine-estimated sparsified MDP is defined as ; for simplicity we only consider deterministic reward functions. Here, the counters and are the number of and in the offline data, respectively, while and are the counters of and in online episodes.
A.1.2 High probability events
In this section, we define all high-probability events we need in order to make our theorem to hold. Specifically, we define
where
Here, denotes the Kullback–Leibler divergence between two distributions. is the number of episodes of the sub-routine RF-UCB and is the policy executed at the i-th episode of RF-UCB (See algorithm 2). is the number of episodes executed in the online phase. is the counter of before the beginning of the -th episode in the offline phase and is the number of samples in the online data (See definitions in section A.1). We denote and as the occupancy measure of at stage under policy on (the true transition dynamics), (the transition dynamics in the sparsfied MDP) and (the transition dynamics in the empirical sparsified MDP) respectively.
For the event defined above, we have the following guarantee, which is proved in section A.3.1.
Lemma A.4.
For we have
happens with probability at least
A.1.3 Uncertainty function and empirical uncertainty function
In this section, we restate the definition of uncertainty function and empirical uncertainty functions, as well as some intermediate uncertainty functions which will be often used in the proof. First, we restate the definition of bonus function.
Definition A.5 (Bonus Function).
We define
(A.4) |
Further, we define
(A.5) |
Next, we restate the definition of uncertainty function and empirical uncertainty function. Notice that the uncertainty function does not depend on reward function or specific policy .
Definition A.6 (Uncertainty function).
We define for any and any deterministic policy
where we specify
In addition, we define for any Here, the notation above means summation over But since for any this is equivalent to the summation over is the transition probability of fine-estimated sparsified MDP defined in section A.1.1.
Then, for the reader to better understand the proof, we define some intermediate quantity, which also measure the uncertainty of value estimation, but they may be reward or policy dependent.
Definition A.7 (Intermediate uncertainty function).
We define for any and any deterministc policy
where
In addition, we define for any and any reward function . Here, is the transition probability of fine-estimated sparsified MDP defined in section A.1.1.
The intermediate function is reward and policy dependent and can be used to upper bound the value estimation error. Further, we need to define some policy-dependent version of the uncertainty function, denoted as as well as another quantity
Definition A.8.
We define for any and
where
In addition, we define for any and any reward any policy Here, is the transition probability of fine-estimated sparsified MDP defined in section A.1.1.
Finally, we define the empirical uncertainty functions.
Definition A.9.
We define for any and Further, we define
(A.6) |
where is the bonus function defined in eq. A.4 and is the counter of the times we encounter until the beginning of the -the episode when running the sub-routine RF-UCB in the offline phase.
A.2 Proof of the key theorems and lemmas
A.2.1 Uncertainty Functions upper bounds the estimation error
Proof.
This proof follows closely from the techniques in [Ménard et al., 2021], but here we consider the homogeneous MDP. We let be the probability of encountering at stage when running policy on , starting from Now we assume happens and fix a reward function and policy From lemma A.14, we know
where is the intermediate uncertainty function defined in definition A.7. Here, we use the policy-dependent version of uncertainty function which will then upper bounded using another two policy-dependent quantities and We define these policy-dependent uncertainty functions to upper bound the estimation error of specific policy, but since we are considering a reward-free setting, which entails a low estimation error for any policy and reward, these quantities cannot be directly used in the algorithm to update the policy. Next, we claim
where and are defined in definition A.8. Actually, this is easy from the definition of , and For this is trivial. Assume this is true for then the case of is given by the definition and the fact that for any Therefore, we have
(A.7) |
Next, we eliminate the dependency to policy and obtain an upper bound of estimation error using the policy-independent uncertainty function This is done by bounding by and then upper bounding by From definition A.8, the Cauchy-Schwarz inequality and the fact that we have for any reward function and any deterministic policy
(Induction by successively using the definition of ) | ||||
(Cauchy-Schwarz Inequality) | ||||
We notice that the right hand side of the last inequality above is the policy value of some specific reward function when running on Concretely, if the transition probability is and the reward function at is then the state value function at the initial state is This specific reward function is non-negative and uniformly bounded by one, so it holds that Moreover, from the definition of and (eq. A.4), we know Then, from the definition of in definition A.6, we apply an inductive argument to obtain
for any deterministic policy So we have
Therefore, combining the last inequality with (A.7), we have
From the definition of (definition A.6) and (definition A.8), we can see
for any and any deterministic policy Therefore, we conclude that
∎
A.2.2 Proof of lemma 6.2
Proof.
It suffices to prove that for any it holds that
since the theorem comes from an average of the inequalities above and the fact that . For any fixed we prove it by induction on . When both sides are zero by definition. Suppose the claim holds for ; we will prove that it also holds for We denote as the number of virtual episodes in the offline phase. From lemma A.16, the decreasing property of (lemma A.17) and lemma A.21, we have
(Lemma A.16) | ||||
(Lemma A.17) | ||||
(Lemma A.21 and ) |
The inductive hypothesis gives us for any
which implies
(A.8) |
Again, by definition of we have
(A.9) |
Combining (A.8) and (A.9), as well as the definition of (definition A.9), we prove the case of stage and we conclude the theorem by induction. ∎
A.2.3 Upper bounding the empirical uncertainty function (lemma 6.3)
Proof.
First, we are going to prove
(A.10) |
From the definition (see algorithm 2), we know an important property of uncertainty function is that is greedy with respect to So we have
Therefore, we have
(definition A.9) | ||||
( lemma A.20) |
We apply Lemma D.7 and Jensen’s Inequality to obtain
(lemma D.7) | ||||
Inserting the last display to the upper bound, we conclude the proof of the upper bound on .
Then, to prove the sample complexity, we use lemma D.8. We take
in Lemma D.8. From lemma D.8, we know there exists a universal constant such that when it holds that which implies whenever
it holds that
(A.11) |
For notation simplicity, we define
Comparing the left hand side of (A.11) and the right hand side of (A.10), we know
It is easy to see that so the right hand side of last inequality is upper bounded by
From (A.11), we know this is upper bounded by as long as From the definition of L and equations (A.10) (A.11), we have when it holds that
In conclusion, this admits a sample complexity of
∎
A.3 Omitted proofs
A.3.1 Proof for lemma A.4 (high probability event)
The lemma A.4 is proved by combining all the lemmas below via a union bound. Below, we always denote as the number of in the offline dataset.
Lemma A.10.
holds with probability at least
Proof.
For any fixed such that we denote as the index of the i-th time when we visit For notation simplicity, we fix the state-action pair here and write as simply for Notice that the total visiting time is random, so our argument is based on conditional probability. We denote as the indicator of whether the next state is when we visited at the i-th time. From the data generating mechanism and the definition for the reference MDP, we know conditional on the total number of visiting and all the indexes it holds that are independent Bernoulli random variable with successful probability being We denote as their arithmetic average. Using Empirical Bernstein Inequality (Lemma D.3), and the fact that we have
Taking integral of the conditional probability, we conclude that the unconditional probability of the event is at least Therefore, we conclude. ∎
Lemma A.11.
For it holds that
Proof.
Consider for a fixed triple Let denote the number of times the tuple was encountered in total during the online interaction phase. We define as follows. For we let be the subsequent state when was encountered the -th time in the whole run of the algorithm. Otherwise, we let be an independent sample from By construction, is a sequence of i.i.d. categorical random variables from with distribution We denote as the empirical probability mass and as the probability mass of Then, from Lemma D.4, we have with probability at least it holds that for any
which implies
Using a union bound for we conclude. ∎
Lemma A.12 (Lower Bound on Counters).
For it holds that and
Proof.
We denote as the number of times we encounter at stage before the beginning of episode Concretely speaking, we define for any Then, we define
We fixed an and denote as the sigma field generated by the first episodes when running RF-UCB and Then, we know is -measurable and is -measurable. Taking in Lemma D.1 and applying a union bound, we know with probability the following event happens:
To finish the proof, it remains to note that the event above implies the event we want by summing over for each and each For the the proof is almost the same. ∎
Lemma A.13 (Upper Bound on Counters).
For it holds that
Proof.
We denote as the number of times we encounter at stage before the beginning of episode Concretely speaking, we define for any Then, we define
We fixed an and denote as the sigma field generated by the first episodes when running RF-UCB and Then, we know is -measurable and is -measurable. Taking in Lemma D.2 and applying a union bound, we know with probability the following event happens:
To finish the proof, it remains to note that the event above implies the event we want by summing over for each and each ∎
A.3.2 Proof for lemma A.14 (property of intermediate uncertainty function)
Lemma A.14 (Intermediate Uncertainty Function).
We define for any and any deterministc policy
where
In addition, we define for any and any reward function . Then, under for any , any deterministic policy and any deterministic reward function (with its augmentation ), it holds that
Proof.
We prove by induction. For we know by definition and the left hand side of the inequality we want also vanishes. Suppose the claim holds for and for any The Bellman equation gives us
Under , we know that for any so from Lemma D.5 we have
where is the bonus defined as (A.4). Here, we let in Lemma D.5 be . So the range will be and the upper bound for KL divergence is given by high-probability event We further apply Lemma D.6 to get
and
Therefore, we have
Using the fact that and for positive and the definition of we obtain
which implies
If then we have by definition
by definition. Otherwise, if we have
Therefore, we have in either case
The induction hypothesis gives us that
Inserting this into the upper bound of we prove the case of by the definition of and the simple fact that ∎
A.3.3 Proof for lemma A.15 and lemma A.16 (properties of uncertainty function)
In this section, we prove some lemmas for upper bounding the uncertainty functions We first provide a basic upper bound for The uncertainty function is defined in definition A.6.
Lemma A.15.
Under the high probability event (defined in section A.1.2) for all it holds that
In addition, for from definition, we know the above upper bound naturally holds.
Proof.
For any fixed from the definition of the uncertainty function, we know
To prove the lemma, it suffices to bound the difference between and Under (which happens with probability at least ), it holds that Applying Lemma D.5 and a simple bound for variance gives us
The last line comes from for any positive and the fact that Insert this bound into the definition of to obtain
Noticing that we conclude. ∎
Lemma A.16.
There exists a universal constant such that under the high probability event when we have for any it holds that
In addition, for from definition, we know the above upper bound naturally holds.
A.3.4 Proof for lemma A.17, lemma A.18, lemma A.19, lemma A.20 (properties of bonus function)
For our bonus function, we have the following basic property.
Lemma A.17.
is non-increasing when For any we have
Proof.
We define where Then, for
Taking and we conclude thee first claim. For the second claim, it is trivial since the logarithm function is increasing. ∎
Lemma A.18.
Under we have for any
Proof.
For fixed when we know that implies From Lemma A.17, we know
When simple algebra shows that
Therefore, we conclude. ∎
Lemma A.19.
Under we have for any
Proof.
We consider under the event it holds that for any
If then we have which implies
Otherwise, if then we have
Combining the two cases above, we conclude. ∎
Lemma A.20.
Under we have for any and any
Proof.
Under it holds that
If then and hence,
This result equals to the right hand side in the lemma, because (so taking maximum does not change anything). Otherwise, if then
Since we have
The last inequality comes from simple algebra. Therefore, we conclude. ∎
A.3.5 Proof for lemma A.21 and lemma A.22 (properties of empirical sparsified MDP)
In this section, we state two important properties of the empirical sparsified MDP and prove them. We remark that we do not include in these two lemmas, since by definition
Lemma A.21 (Multiplicative Accuracy).
We set
Then, when under we have for any
Proof.
For both sides are zero. For recall then from lemma A.10, with probability at least ,
(lemma A.10 and definition of ) | ||||
() | ||||
where the second line comes from the lower bound on . We conclude. ∎
Lemma A.22 (Bound on Ratios of Visitation Probability).
For any deterministic policy and any it holds that
Here, recall that we denote and as the occupancy measure of at stage under policy on (the transition dynamics in the sparsfied MDP) and (the transition dynamics in the empirical sparsified MDP) respectively.
We remark that for the inequality does not necessarily hold.
Proof.
We denote as all truncated trajectories up to stage such that Notice that if then it holds that for We denote as the probability under a specific transition dynamics and policy For any fixed and any fixed we apply Lemma A.21 to get
and
We conclude by rewriting the visiting probability as
∎
Appendix B Additional comparisons
B.1 Comparison with other comparator policy
Our main result compares the sub-optimality of the policy against the optimal policy on the sparsified MDP. We can further derive the sub-optimality of our output with respect to any comparator policy on the original MDP If we denote and as the global optimal policy, the optimal policy on the sparsified MDP and the policy output by our algorithm, respectively, and denote as the comparator policy, we have
(B.1) |
Here, the second term is non-positive from the definition of the third term is upper bounded by due to our main theorem (Theorem 5.1), and the last term is non-positive from the definition of the sparsified MDP. Since the connectivity graph of the sparsified MDP is a sub-graph of the original MDP, for any policy, the policy value on the sparsified MDP must be no higher than that on the original MDP.
At a high level, the term in the last line of (B.1) represents the error from the finite online episodes, while the approximation error term measures the policy value difference of on the original MDP and the sparsified one, representing the coverage quality of the offline dataset. If the dataset covers most of what covers, then this gap should be small. When is the global optimal policy this means the data should cover the state-actions pairs where optimal policy covers. The approximation error here plays a similar role as the concentrability coefficient in the offline reinforcement learning.
B.2 Comparison with offline reinforcement learning
Our algorithm leverages some information from the offline dataset, so it is natural to ask how we expect that offline dataset to be, compared to traditional offline reinforcement learning does. In offline RL, we typically require the concentrablity condition, namely good coverage for the offline dataset, in order to achieve a polynomial sample complexity. Specifically, if we assume the offline data are sampled by first sampling i.i.d. from and then sampling the subsequent state from the transition dynamics, then the concentrability condition says the following constant is well-defined and finite.
The concentrability coefficient can be defined in several alternative ways, either for a set of policies or with respect to a single policy [Chen and Jiang, 2019, Zhan et al., 2022, Xie et al., 2021b, Zanette et al., 2021b]. Here, we follow the definition in [Xie et al., 2021b]. This means, the sampling distribution must covers the region where the global optimal policy covers, which is a very similar intuition obtained from our setting.
[Xie et al., 2021b] also gave optimal sample complexity (in terms of state-action pairs) for an offline RL algorithm is
which is minimax optimal up to logarithm terms and higher order terms. Similar sample complexity were also given in several literature [Yin and Wang, 2020, Yin et al., 2020, Xie and Jiang, 2020b].
Uniform data distribution
For simplicity, we first assume to be uniform on all state-action pairs and the reward function to be given. Consider we have state-action pairs in the offline data, which are sampled i.i.d. from the distribution . Notice that here, the global optimal policy still differs from the optimum on the sparsified MDP since even if we get enough samples from each pairs, we might not get enough samples for every and hence, not all will be included in the set of known tuples.
Concretely, if we consider the case when we sample each state-action pair for times and simply treat the transition frequency as the true probability, then for any it holds that . So for any any we know while for any we have and Therefore, we have
From the value difference lemma (lemma D.11), we can upper bound the approximation error by
(lemma D.11) | ||||
(the value function is upper bounded by ) | ||||
(summation over and ) | ||||
(definition of in (5.1)) |
Therefore, to get an -optimal policy compared to the global optimal one, we need the number of state-action pairs in the initial offline dataset to be
From the theorem 5.1, the offline data size we need here is actually significantly smaller than what we need for an offline algorithm. As long as is not too small, for instance, larger than then we shave off the whole term. The order of offline sample complexity here is actually instead of typical in offline RL, and this is significantly smaller in small regime. To compensate the smaller offline sample size, actually we need more online sample to obtain an globally -optimal policy, and we summarize the general requirement for offline and online sample size in corollary 5.2.
Non-uniform data distribution
Assume the data generating distribution is not uniform but still supported on all pairs such that so that the concentrability coefficient in offline RL is still well defined. We simply consider the case when each state-action pair is sampled by times and treat the transition frequency as the true underlying probability. Then, following a very similar argument as in the last paragraph, the number of state-action pairs needed in the initial offline dataset in order to extract an -globally optimal policy is
Here, the quantity
plays a similar role of classical concentrability coefficient and also measures the distribution shift between two policies. In the worst case, this coefficient can be resulting in an extra factor compared to the optimal offline sample complexity. However, we still shave off the entire term and also shave off in the term.
Partial coverage data
Under partial coverage, we expect the output policy to be competitive with the value of the best policy supported in the region covered by the offline dataset. In such case, theorem 5.1 provides guarantees with the best comparator policy on the sparsified MDP . In order to gain further intuition, it is best to ‘translate’ such guarantees into guarantees on .
In the worst case, the data distribution at a certain pair can be zero when which implies the concentrability coefficient Here, is an arbitrary comparator policy. In this case, either classical offline RL algorithm or our policy finetuning algorithm cannot guarantee an -optimal policy compared to the global optimal policy. However, we can still output a locally -optimal policy, compared to the optimal policy on the sparsified MDP.
In order to compare to any policy on the original MDP, we have the corollary 5.2, which will be proved in section B.3.
The statement in corollary 5.2 is a quite direct consequence of theorem 5.1, and it expresses the sub-optimality gap of with respect to any comparator policy on the original MDP . It can also be written in terms of the sub-optimality: If we fix a comparator policy , then with probability at least , for any reward function , the policy returned by algorithm 2 satisfies:
where is the number of online episodes and is the number of state-action pairs in offline data. Here, the sub-optimality depends on an online error as well as on an offline error. The online error is the one that also arises in the statement of theorem 5.1. It is an error that can be reduced by collecting more online samples, i.e., by increasing , with the typical inverse square-root depedence .
However, the upper bound suggests that even in the limit of infinite online data, the value of will not approach that of because of a residual error due to the offline dataset . Such residual error depends on certain concentrability factor expressed as , whose presence is intuitive: if a comparator policy is not covered well, our algorithm does not have enough information to navigate to the area that tends to visit, and so it is unable to refine its estimates there. However the dependence on the number of offline samples is through its inverse, i.e., as opposed to the more typical : such gap represents the improvable performance when additional online data are collected non-reactively.
It is useful to compare corollary 5.2 with what is achievable by using a minimax-optimal online algorithm[Xie et al., 2021b]. In this latter case, one can bound the sub-optimality gap for any comparator policy with high probability as
(B.2) |
B.3 Proof of corollary 5.2
Let’s denote as the offline dataset, where as the total number of tuples. We keep the notation the same as in the main text. We use and to denote the counter of and in the offline data The state-action pairs are sampled i.i.d. from and the subsequent states are sampled from the transition dynamics. We fix a comparator policy and assume for any such that which implies a finite concentrability constant Here, is the occupancy probability of when executing policy averaged over all stages
Similar to the Section B.1, we have
(B.3) |
where and are the optimal policy on the sparsified MDP and the policy output by our algorithm, respectively, and is the fixed comparator policy. Here, the second term is non-positive from the definition of the last term is non-positive from the definition of the sparsified MDP . This is because, for any state-action pair and any fixed policy the probability of reaching under will not exceed that under the true transition probability If we denote the visiting probability under and (or resp.) as ( resp.), then we have
for any Note that, for this does not hold necessarily. Them, for any policy we have
(definition of policy value) | ||||
( for any ) | ||||
Therefore, we get when we take
From the main result (Theorem 5.1), we know the estimation error is bounded as
(B.4) |
where is the number of online episodes. Therefore, to make the right hand side of (B.4) less than one needs at least online episodes. This is exactly what the main result shows.
So it suffices to bound the approximation error term. From the value difference lemma (Lemma D.11), we have
(by expanding the expectation) | ||||
( and the inner product has terms) |
We define
(B.5) |
as the average visiting probability. Then, we have
(B.6) |
So it suffices to upper bound . Notice here we only consider and since the value function starting from is always zero.
For , if it holds that Otherwise, from the definition, we know so it suffices to bound in this case. From lemma B.1, we know that with probability at least for any such that , it holds that
Then we deal with two cases. When from lemma B.2 we have
From the definition of we know that and which implies
The last inequality comes from our assumption for
In the other case, when it holds that
for any . Therefore, for any for any such that we have
(B.7) |
Combining equations (B.7) and (B.6), we know for any comparator policy it holds that
where is the total number of transitions in the offline data. In order to make the right hand side of last display less than one needs at least
offline transitions. Combining the proof for estimation error and approximation error, we conclude.
B.4 Proof for lemma B.1 and lemma B.2
Lemma B.1.
With probability at least for any it holds that
Proof.
We fixed and denote as the index set where for We range the indexes in as For we denote which is the indicator of whether the next state is when we encounter the -th time. When we denote as independent Bernoulli random variables with successful rate Then, we know for all are i.i.d. sequence of Bernoulli random variables. From Lemma D.1, we know with probability at least for any positive integer , it holds that
We take (although is random, we can still take it because for any the inequality above holds) to get
Applying a union bound for all we conclude. ∎
Lemma B.2.
With probability at least for any it holds that
Proof.
If this is trivial. We fixed an such that For we denote which is the indicator of whether the -th state-action pair we encounter in the offline dataset is . When we denote as independent Bernoulli random variables with successful rate Then, we know for all are i.i.d. sequence of Bernoulli random variables. From Lemma D.1, we know with probability at least for any positive integer , it holds that
We take to get
Applying a union bound for all such that we conclude. ∎
Appendix C Lower bound
In this section we briefly discuss the optimality of the algorithm. Although the following considerations are also mentioned in the main text, here we mention how they naturally lead to a lower bound.
Lower bound for reward-free exploration
Consider the MDP class defined in the proof of the lower bound of Theorem 4.1 in [Jin et al., 2020b]. Assume that the dataset arises from a logging policy which induces the condition for all for every instance of the class. In this case, every MDP instance and its sparsified version coincide. Then the concatenation of the logging policy and of the policy produced by our algorithm (i.e., algorithm 3) can be interpreted as a reactive policy, which must comply with the reward free lower bound established in Theorem 4.1 of [Jin et al., 2020b]. More precisely, the reward-free sample complexity lower bound established in Theorem 4.1 in [Jin et al., 2020b] is
(C.1) |
trajectories. This matches the sample complexity of theorem 5.1. Notice that the number of samples originally present in the dataset can be
(C.2) |
a term independent of the accuracy . Given that when we have
our algorithm is unimprovable beyond constant terms and logarithmic terms in a minimax sense.
Lower bound for non-reactive exploration
Consider the MDP class defined in the proof of the lower bound in Theorem 1 of [Xiao et al., 2022]. It establishes an exponential sample complexity for non-reactive exploration when no prior knowledge is available. In other words, in absence of any data about the MDP, non-reactive exploration must suffer an exponential sample complexity. In such case, our theorem 5.1 (correctly) provides vacuous guarantees, because the sparsified MDP is degenerate (all edges lead to the absorbing state).
Combining the two constructions
It is possible to combine the MDP class from the paper [Xiao et al., 2022] with the MDP class from the paper [Jin et al., 2020b]. In lieu of a formal proof, here we provide only a sketch of the construction that would induce a lower bound. More precisely, consider a starting state where only two actions— and —are available. Taking leads to the start state of an instance of the class , while taking leads to the start state of an instance of the class ; in both cases the transition occurs with probability one and zero reward is collected.
Furthermore, assume that the reward function given over the MDPs in is shifted such that the value of a policy that takes in and then plays optimally is and that the reward functions on is shifted such that the value of a policy which takes initially and then playes optimally is .
In addition, assume that the dataset arises from a logging policy which takes initially and then visits all uniformly.
Such construction and dataset identify a sparsified MDP which coincide with with the addition of (and its transition to with zero reward). Intuitively, a policy with value arbitrarily close to must take action which leads to , which is the portion of the MDP that is unexplored in the dataset. In this case, unless the agent collects exponentially many trajectories in the online phase, the lower bound from [Xiao et al., 2022] implies that it is not possible to discover a policy with value close to (e.g., larger than ). On the other hand, our theorem 5.1 guarantees that has a value at least , because is -optimal on the sparsified MDP—i.e., -optimal when restricted to an instance on —with high probability using at most trajectories. This value is unimprovable given the lower bound of Jin et al. [2020b], which applies to the class .
For completeness, in the next sub-section we refine the lower bound of [Xiao et al., 2022] to handle mixture policies.
Appendix D Technical lemmas and proofs
Lemma D.1 (Lemma F.4 in [Dann et al., 2017]).
Let for be a filtration and be a sequence of Bernoulli random variables with with being -measurable and being measurable. It holds that
Lemma D.2.
Let for be a filtration and be a sequence of Bernoulli random variables with with being -measurable and being measurable. It holds that
Proof.
Notice that is non-decreasing on where at zero we continuously extend this function. For any since we have Taking expectation conditional on and noticing that is a Martingale difference sequence w.r.t. the filtration we have
where the last inequality comes from the fact that conditional on is a Bernoulli random variable. We define which is a supermartingale from our derivation above. We define now the stopping time and the sequence . Applying the convergence theorem for nonnegative supermartingales (Theorem 4.2.12 in [Durrett, 2019]), we get that is well-defined almost surely. Therefore, is well-defined even when . By the optional stopping theorem for non-negative supermartingales (Theorem 4.8.4 in [Durrett, 2019], we have for all and applying Fatou’s lemma, we obtain . Using Markov’s inequality, we can finally bound
∎
Lemma D.3 (Empirical Bernstein Inequality, Theorem 11 in [Maurer and Pontil, 2009]).
Let and be i.i.d random variables such that with probability 1 . Let , and , then with probability we have
Lemma D.4 (Concentration for KL Divergence, Proposition 1 in [Jonsson et al., 2020]).
Let be i.i.d. samples from a distribution supported over , of probabilities given by , where is the probability simplex of dimension . We denote by the empirical vector of probabilities. Then, for any , with probability at least it holds that
We remark that there is a slight difference between it and the original version. In [Jonsson et al., 2020], they use instead of . But since the second term of the right hand side above is increasing with our version also holds.
Lemma D.5 (Bernstein Transportation, Lemma 11 in [Talebi and Maillard, 2018]).
For any function and any two probability measure which satisfy we denote and We assume and are finite, then we have
Lemma D.6 (Difference of Variance, Lemma 12 in [Ménard et al., 2021]).
Let be two probability measure on a discrete sample space of cardinality Let be two functions defined on such that for all , we have that
Further, if it holds that
Lemma D.7.
For any sequence of numbers with we have
Proof.
∎
Lemma D.8.
For and there exists a universal constant such that when
it holds that
Proof.
We have
We define then we have Since we have
We can take such that whenever Therefore, we know is increasing when Then, it suffices to prove
Since it suffices to prove
When the difference of right hand side and left hand side s always increasing w.r.t. for fixed Therefore, it suffices to prove the case when Noticing that we can always take a sufficiently large uniform constant such that the inequality above holds when we conclude. ∎
Lemma D.9 (Chain rule of Kullback-Leibler divergence, Exercise 3.2 in [Wainwright, 2019]).
Given two -variate distributions and , show that the Kullback-Leibler divergence can be decomposed as
where denotes the conditional distribution of given under , with a similar definition for .
Lemma D.10 (Bretagnolle-Huber Inequality, Theorem 14.1 in [Lattimore and Szepesvári, 2020]).
Let and be probability measures on the same measurable space , and let be an arbitrary event. Then,
where is the complement of .
Lemma D.11 (Value Difference Lemma, Lemma E.15 in [Dann et al., 2017]).
For any two MDPs and with rewards and and transition probabilities and , the difference in values with respect to the same policy can be written as
where for any state and the expectation is taken with respect to and and with respect to and .
Appendix E Details of the planning phase
In this section, we provide some details of the planning phase in algorithm 3. In the planning phase, we are given a reward function and we compute an estimate of sparsified transition dynamics which is formally defined section A.1.1. The goal of the planning phase is to compute the optimal policy on the MDP specified by the transition dynamics and reward function where is the sparsified version of and for any To compute the optimal policy, we iteratively apply the Bellman optimality equation. First, we define for any and solve
Then, for we iteratively define
for any , and then solve
for any For and any can be arbitrary action. Then, from the property of Bellman optimality equation, we know is the optimal policy on and
Appendix F More related works
Other low-switching algorithms Low-switching learning algorithms were initially studied in the context of bandits, with the UCB2 algorithm [Auer et al., 2002] achieving an switching cost. Gao et al. [2019] demonstrated a sufficient and necessary switching cost for attaining the minimax optimal regret in multi-armed bandits. In both adversarial and stochastic online learning, [Cesa-Bianchi et al., 2013] designed an algorithm that achieves an switching cost.
Reward-free reinforcement learning In reward-free reinforcement learning (RFRL) the goal is to find a near-optimal policy for any given reward function. [Jin et al., 2020a] proposed an algorithm based on EULER [Zanette and Brunskill, 2019] that can find a policy with trajectories. Subsequently, [Kaufmann et al., 2021] reduces the sample complexity by a factor by using uncertainty functions to upper bound the value estimation error. The sample complexity was further improved by another factor by [Ménard et al., 2021].
A lower bound of was established for homogeneous MDPs by [Jin et al., 2020a], while an additional factor is conjectured for non-homogeneous cases. [Zhang et al., 2021] proposed the first algorithm with matching sample complexity in the homogeneous case. Similar results are available with linear [Wang et al., 2020a, Wagenmaker et al., 2022, Zanette et al., 2020] and general function approximation [Chen et al., 2022, Qiu et al., 2021].
Offline reinforcement learning In offline reinforcement learning the goal is to learn a near-optimal policy from an existing dataset which is generated from a (possibly very different) logging policy. Offline RL in tabular domains has been investigated extensively [Yin and Wang, 2020, Jin et al., 2020c, Nachum et al., 2019, Rashidinejad et al., 2021, Kallus and Uehara, 2022, Xie and Jiang, 2020a]. Similar results were shown using linear [Yin et al., , Xiong et al., 2022, Nguyen-Tang et al., 2022, Zanette et al., 2021b] and general function approximation[Xie et al., 2021a, Long et al., 2021, Zhang et al., 2022, Duan et al., 2021, Jiang and Huang, 2020, Uehara and Sun, 2021, Zanette and Wainwright, 2022, Rashidinejad et al., 2022, Yin et al., 2022]. Offline RL is effective when the dataset ‘covers’ a near optimal policy, as measured by a certain concentrabiluty factor. In the function approximation setting additional conditions, such as Bellman completeness, may need to be approximately satisfied [Munos and Szepesvári, 2008, Chen and Jiang, 2019, Zanette, 2023, Wang et al., 2020b, Foster et al., 2021, Zhan et al., 2022].
Task-agnostic reinforcement learning Another related line of work is task-agnostic RL, where tasks are considered during the planning phase, and the reward functions is collected from the environment instead of being provided directly. [Zhang et al., 2020a] presented the first task-agnostic algorithm, UBEZero, with a sample complexity of . Recently, [Li et al., 2023] proposed an algorithm that leverages offline RL techniques to estimate a well-behaved behavior policy in the reward-agnostic phase, achieving minimax sample complexity. Other works exploring effective exploration schemes in RL include [Hazan et al., 2019, Du et al., 2019, Misra et al., 2020].