Provably Efficient Reinforcement Learning in
Partially Observable Dynamical Systems
Abstract
We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new Partially Observable Bilinear Actor-Critic framework, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition. Under this framework, we propose an actor-critic style algorithm that is capable of performing agnostic policy learning. Given a policy class that consists of memory based policies (that look at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy in the given policy class. For certain examples such as undercomplete observable tabular POMDPs, observable LQGs and observable POMDPs with latent low-rank transition, by implicitly leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon in its sample complexity.
1 Introduction
Large state space and partial observability are two key challenges of Reinforcement Learning (RL). While recent advances in RL for fully observable systems have focused on the challenge of scaling RL to large state space in both theory and in practice using rich function approximation, the understanding of large scale RL under partial observability is still limited. In POMDPs, for example, a core issue is that the optimal policy is not necessarily Markovian since the observations are not Markovian.
A common heuristic to tackle large scale RL with partial observability in practice is to simply maintain a time window of the history of observations, which is treated as a state to feed into the policy and the value function. Such a window of history can be often maintained explicitly via truncating away older history (e.g., DQN uses a window with length 4 for playing video games [49]; Open AI Five uses a window with length 16 for LSTMs [4]). Since even for planning under partial observations and known dynamics, finding the globally optimal policy conditional on the entire history is generally NP-hard (due to the curse of the history) [44, 55, 24], searching for a short memory-based policy can be understood as a reasonable middle ground that balances computation and optimality. The impressive empirical results of these prior works also demonstrate that in practice, there often exists a high-quality policy (not necessarily the globally optimal) that is only a function of a short window of recent observations. However, these prior works that search for the best memory-based policy unfortunately cannot ensure sample efficient PAC guarantees due to the difficulty of strategic exploration in POMDPs. The key question that we aim to answer in this work is:
Can we design provably efficient RL algorithms that agnostically learn the best fixed-length memory based policy with function approximation?
We provide affirmative answers to the above question. More formally, we study RL for partially observable dynamical systems that include not only the classic Partially Observable MDPs (POMDPs) [52, 56, 64], but also a more general model called Predictive State Representations (PSRs) [46]. We design a model-free actor-critic framework, named PO-Bilinear Actor-Critic Class, where we have a policy class (i.e., actors) that consists of policies that take a fixed-length window of observations as input (memory-based policy), and a newly introduced value link function class (i.e., critics) that consists of functions that take the fixed-length window of history and (possibly multi-step if the system is overcomplete) future observations as inputs. A value link function class is an analog of the value function class tailored to partially observable systems that only relies on observable quantities (i.e., past and future observations and actions). In our algorithm, we agnostically search for the best memory-based policy from the given policy class.
Our framework is based on the idea of a newly introduced notion of value link function equipped with future observations. While the idea of using future observations has been used in the literature on POMDPs, our work is the first to use this idea to learn a high-quality policy in a model-free manner. Existing works discuss how to use future observations only in a model-based manner [8, 25]. By leveraging these model-based viewpoints, while recent works discuss strategic exploration to learn near-optimal policies, their results are either limited to the tabular setting (and are not scalable for large state spaces) [36, 22, 3, 74, 43] or are tailored to specific non-tabular models and unclear how to incorporate general function approximation [65, 42, 12]. We break these barriers by devising a new actor-critic-based model-free view on POMDPs. We demonstrate the scalability and generality of our PO-bilinear actor-critic framework by showing PAC-guarantee on many models as follows (see Table 1 for a summary).
Observable Tabular POMDPs. In tabular observable POMDPs, i.e., POMDPs where multi-step future observations retain information about latent states, the PO-bilinear rank decomposition holds. We can ensure the sample complexity is where ( is an emission matrix),and are the cardinality of state, action, observation space, respectively, is the horizon, and is the number of future observations.111In Section 8, we discuss how to get rid of using a model-based learning perspective. The intuition is that a tabular POMDP’s model complexity has nothing to do with or , i.e., number of parameters in transition and omission distribution is (even if we consider the time-inhomogeneous setting, it scales with , but no and ) and the PO-bilinear rank is still . In the special undercomplete () case, our framework is also flexible enough to set the memory length according to the property of the problems in order to search for the globally optimal policy. More specifically, using the latest result from [24] about belief contraction, we can set with being the optimality threshold. This allows us to compete against the globally optimal policy without paying an exponential dependence on .
Observable Linear Quadratic Gaussian (LQG). In observable LQG – a classic partial observable linear dynamical system, our algorithm can compete against the globally optimal policy with a sample complexity scaling polynomially with respect to the horizon, dimensions of the state, observation, and action spaces (and other system parameters). This is achieved by simply setting the memory length to . The special linear structures of the problem allow us to avoid exponential dependence on even when using the full history as a memory. While the global optimality results in tabular POMDPs and LQG exist by using different algorithms, to the best of our knowledge, this is the first unified algorithm that can solve both tabular POMDPs and LQG simultaneously without paying an exponential dependence on horizon .
Observable Hilbert Space Embedding POMDPs (HSE-POMDPs). Our framework ensures the agnostic PAC guarantee on HSE-POMDPs where policy induced transitions and omission distributions have condition mean embeddings [8, 6]. This model naturally generalizes tabular POMDPs and LQG. We show that the sample complexity scales polynomially with respect to the dimensions of the embeddings. This is the first PAC guarantee in HSE-POMDPs.
Predictive State Representations (PSRs). We give the first PAC-guarantee on PSRs. PSRs model partially observable dynamical systems without even using the concept of latent states and strictly generalize the POMDP model. Our work significantly generalizes a prior PAC learning result for reactive PSRs (i.e., reactive PSRs require a strong condition that the optimal policy only depends on the latest observation) which is a much more restricted setting [33].
-step decodable POMDPs [20]. Our framework can capture -step decodable POMDPs where there is a (unknown) decoder that can perfectly decode the latent state by looking at the latest -memory. Our algorithm can compete against the globally optimal policy with the sample complexity scaling polynomially with respect to horizon , , and the statistical complexities of function classes, without any explicit dependence on . This PAC result is similar with the one from [20].
Observable POMDPs with low-rank latent transition. Our framework captures observable POMDPs where the latent transition is low-rank. This is the first PAC guarantee in this model. Under this model, we first show that with where is the rank of the latent transition matrix, there exists an -memory policy that is -near optimal with respect to the globally optimal policy. Then, starting with a general model class that contains the ground truth transition and omission distribution (i.e., realizability in model class), we first convert the model class to a policy class and a value link function class, and we then show that our algorithm competes against the globally optimal policy with a sample complexity scaling polynomially with respect to , and the statistical complexity of the model class. Particularly, the sample complexity has no explicit dependence on the size of the state and observation space, instead it just depends on the statistical complexity of the given model class.
Model |
|
|
|
|
|
|
|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PO-Bilinear Rank |
|
Rank |
|
|
Rank () | ||||||||||||
PAC Learning | Known | Known | Known | New | New | New |
1.1 Related Works
Generalization and function approximation of RL in MDPs.
In Markovian environments, there is a growing literature that gives PAC bounds with function approximation under certain models. Some of the representative models are linear MDPs [36, 76], block MDPs [16, 48, 78], and low-rank MDPs [2, 71]. Several general frameworks in [33, 61, 35, 21, 17] characterize sufficient conditions for provably efficient RL. Each above model is captured in these frameworks as a special case. While our work builds on the bilinear/Bellman rank framework [17, 33], when we naïvely reduce POMDPs to MDPs, the bilinear/Bellman rank is . These two frameworks are only shown applicable to reactive POMDPs where the optimal policy only depends on the latest observation. However, this assumption makes the POMDP model very restricted.
Online RL for POMDPs.
Prior works [39, 19] showed -type sample complexity bounds for general POMDPs. Exponential dependence can be circumvented with more structures. First, in the tabular setting, under observability assumptions, in [3, 22, 34, 43, 23], favorable sample complexities are obtained by leveraging the spectral learning technique [29] (see section 1.1 in [34] for an excellent summary). Second, in LQG, which is a partial observable version of LQRs, in [42, 65], sub-linear regret algorithms are proposed. These works use random policies for exploration, which is sufficient for LQG. Since random exploration strategy is not enough for tabular POMDPs, it is unclear if the existing techniques from LQG can be applied to solve general POMDPs. Third, the recent work [20] provides a new model called -step decodable POMDP (when , it is Block MDP) with an efficient algorithm.
Our framework captures all above mentioned POMDP models. In addition, we propose a new model called HSE-POMDPs which extends prior works on HSE-HMM[6] to POMDPs and includes LQG and tabular POMDPs. Our algorithm delivers the first PAC bound for this model.
System identification for uncontrolled partially observable systems.
There is a long line of work on system identification for uncontrolled partially observable systems, among which the spectral learning based methods are related to our work [72, 29, 58, 8, 28, 54, 6, 38, 25, 67]. Informally, these methods leverage the high-level idea that under some observability conditions, one can use the sufficient statistics of (possibly multi-step) future observations as a surrogate for the belief states, thus allowing the learning algorithms to ignore the latent state inference and completely rely on observable quantities. Our approach shares a similar spirit in the sense that we use sufficient statistics of future observations to replace latent states, and our algorithm only relies on observable quantities. The major difference is that these prior works only focus on passive system identification for uncontrolled systems, while we need to find a high-performance policy by actively interacting with the systems for information acquisition.
Reinforcement learning in PSRs.
PSRs [32, 46, 62, 8, 68] are models that generalize POMDPs. PSRs also rely on the idea of using the sufficient statistics of multi-step future observations (i.e., predictive states) to serve as a summary of the history. Prior works on RL for PSRs [8, 38, 14, 45, 30] do not address the problem of strategic exploration and operate under the assumption that a pre-collected diverse training dataset is given and the data collection policy is a blind policy (i.e., it does not depend on history of observations). To our knowledge, the only existing PAC learning algorithm for PSRs is limited to a reactive PSR model where the optimal policy depends just on the latest observation [33]. Our framework captures standard PSRs models that are strictly more general than reactive PSRs.
Value link functions.
Analogue of value link functions (referred to as bridge functions) are used in the literature of causal inference (offline contextual bandits) [50, 13, 11, 40, 53, 60, 75] and offline RL with unmeasured confounders [7, 66]. However, their settings are not standard POMDPs in the sense that their setting is a POMDP with unmeasured confounders following [69]. Our setting is a standard POMDP without unmeasured confounders. Here, we emphasize that their setting does not capture our setting. More specifically, by taking [66] as an example, they require that logged data is generated by policies that can depend on latent states but cannot depend on observable states. Thus, their definition of link functions (called as bridge functions) is not applicable to our setting since the data we use is clearly generated by policies that depend on observations. Due to this difference, their setting prohibits us from using future observations, unlike our setting. Finally, we stress that our work is online, while their setting is offline. Hence, they do not discuss any methods for exploration.
1.2 Organization
In Section 2, we introduce the notation, definition of POMDPs, and our function-approximation setup such as the policy and value link function class. In Section 3, we define value link functions and the PO-bilinear actor-citric class. In Section 4, we give examples that admit PO-bilinear actor-citric class including observable undercomplete tabular POMDPs, observable overcomplete tabular POMDPs, observable LQG, and observable HSE-POMDPs. In Section 5, we give a unified algorithm for the PO-bilinear actor-citric class, and the sample complexity of the algorithm. We also instantiate this general result for examples presented in Section 4. In Section 6, we show that PSRs, which are more general models than POMDPs, also admit PO-bilinear rank decomposition. In Section 7, we give a more general definition of PO-bilinear actor-critic class, followed by showing that two additional examples — -step decodable POMDPs and observable POMDPs with low-rank latent transition — fall into this general definition (Section 8). Both the examples use general nonlinear function approximation and their sample complexities do not explicitly depend on the size of the state and observation spaces, but only on the statistical complexities of the function classes. As a by-product, we can refine the sample complexity result in the tabular case in Section 5. Most of the proofs are deferred to the Appendix.
2 Preliminary
We introduce background for POMDPs here and defer the introduction of PSRs to Section 6. We consider an episodic POMDP specified by , where is the unobserved state space, is the observation space, is the action space, is the horizon, is the transition probability, is the emission probability, and is the reward. Here, are unknown distributions. For notation simplicity, we consider the time-homogeneous case in this paper; Extension to the time-inhomogeneous setting is straightforward.
In our work, we consider -memory policies. Let and . An element is represented as , and an element is represented as (thus, ). Figure 1 illustrates this situation. An -memory policy is defined as where each is a mapping from to a distribution over actions .
In a POMDP, an -memory policy generates the data as follows. Each episode starts with the initial state sampled from some unknown distribution. At each step , from , the agent observes , executes action , receives reward , and transits to the next latent state . Note that the agent does not observe the underlying states but only the observations . We denote as the value of the policy , i.e., where the expectation is taken w.r.t. the stochasticity of the policy , emissions distribution and transition dynamics .
We define a value function for a policy at step to be the expected cumulative reward to go under the policy starting from a and , i.e. where . The notation means the expectation is taken under a policy from to . Compared to the standard MDP setting, the expectation is conditional on not only but also since we consider -memory policies. The corresponding Bellman equation for is .
The Actor-critic function approximation setup.
Our goal is to find a near optimal policy that maximizes the policy value in an online manner. Since any POMDPs can be converted into MDPs by setting the state at level to the observable history up to , any off-the-shelf online provably efficient algorithms for MDPs can be applied to POMDPs. By defining as the whole history up to step (i.e., a history is in the form of ) , these naïve algorithms ensure that output policies can compete against the globally optimal policy where . However, this conversion results in the error with exponential dependence on the horizon , which is prohibitively large in the long horizon setting.
Instead of directly competing against the globally optimal policy, we aim for agnostic policy learning, i.e., compete against the best policy in a given -memory policy class. Our function approximation setup consists of two function classes, A policy class consisting of -memory policies where (i.e., actors), A set of value link functions where , whose role is to approximate (i.e., critics). Our goal is to provide an algorithm that outputs a policy that has a low excess risk, where excess risk is defined by where is the best policy in class . To motivate this agnostic setting, -memory policies are also widely used in practice, e.g., DQN [49] sets . Besides, there are natural examples where -memory policies are close to the globally optimal policy with being only polynomial with respect to other problem dependent parameters, e.g., observable POMDPs [24] and LQG [42, 65, 51]. We will show the global optimality in these two examples later, without any exponential dependence on in the sample complexity.
Remark 1 (Limits of existing MDP actor-critic framework).
While general actor-critic framework proposed in MDPs [33] is applicable to POMDPs via the naïve POMDP to MDP reduction, it is unable to leverage any benefits from the restricted policy class. This naïve reduction (from POMDP to MDP) uses full history and will incur sample complexity that scales exponentially with respect to the horizon.
Additional notation. Let and . Give a matrix , we denote its pseudo inverse by and the operator norm by . We define the norm . The outer product is denoted by . Let be the marginal distribution at and be the Dirac delta function. We denote the policy by . We denote a uniform action by . Given a function class , we define .
3 Value Link Functions and the PO-bilinear Framework
Unlike MDPs, we cannot directly work with value functions (or Q functions) in POMDPs, since they depend on the unobserved state . To handle this issue, below we first introduce new value link functions by using future observations, and then discuss the PO-bilinear framework.
3.1 Value Link Functions
Definition 1 (K-step value link functions).
Fix a set of policies where . Value link functions at step for a policy are defined as the solution to the following integral equation:
where the expectation is taken under the policy .
Link functions do not necessarily exist, nor are needed to be unique. At an intuitive level, K-step value link functions are embeddings of the value functions onto the observation space, and its existence essentially means that K-step futures have sufficient information to recover the latent state dependent value function. The proper choice of would depend on the underlying models. For example, we use uniform policy in the tabular case, and in LQG. For notational simplicity, we mostly focus on the case of , though we will also discuss the general case of . The simplified definition for 1-step link functions is provided in the following. Note that this definition is agnostic to .
Definition 2 (1-step value link functions).
One-step value link functions at step for a policy are defined as the solution to the following integral equation:
(1) |
In Section 4, we will demonstrate the form of the value link function for various examples. The idea of encoding latent state information using the statistics of (multi-step) futures have been widely used in learning models of HMMs [58, 29], PSRs [8, 6, 25, 67], and system identification for linear systems [72]. Existing provably efficient (online) RL works for POMDPs elaborate on this viewpoint [36, 22, 3]. Compared to them, the novelty of link functions is that it is introduced to recover value functions but not models. This model-free view differs from the existing dominant model-based view in online RL for POMDPs. In our setup, we can control systems if we can recover value functions on the underlying states even if we fail to identify the underlying model.
3.2 The PO-Bilinear Actor-critic Framework for POMDPs
With the definition of value link functions, we are now ready to introduce the PO-bilinear actor-critic (AC) class for POMDPs. We will focus on the case of here. Let , where , be a class consisting of functions that satisfy the following realizability assumption w.r.t. the policy class .
Assumption 1 (Realizability).
We assume that is realizable w.r.t. the policy class , i.e., , there exists at least one such that is a value link function w.r.t. the policy . Note that realizability implicitly requires the existence of link functions.
We next introduce the PO-Bilinear Actor-critic class. For each level , we first define the Bellman loss:
given M-memory policies and . Letting be a link function for , our key observation is that value link functions satisfy
for any M memory roll-in policy , and any evaluation pair . This is an analog of Bellman equations on MDPs. The above equation tells us that is a right loss to quantify how much the estimator is different from . When has a low-rank structure in a proper way, we can efficiently learn a near optimal M memory policy. The following definition precisely quantifies the low-rank structure that we need for sample efficient learning.
Definition 3 (PO-bilinear AC Class, ).
The model is a PO-bilinear Actor-critic class of rank if is realizable, and there exist and such that for all and ,
-
1.
-
2.
for any and the corresponding value link function .
We define as the PO-bilinear rank.
While the above definition is enough to capture most of the examples we discuss later in this work, including undercomplete tabular POMDPs, LQG, HSE-POMDPs, we provide two useful extensions. The first extension incorporates discriminators into the framework, which can be used to capture the M-step decodable POMDPs and POMDPs with low-rank latent transition (see Section 7). The second extension incorporates multi-step futures, which can be used to capture overcomplete POMDPs and general PSRs. In the next section, we introduce the multi-step future version.
3.3 PO-bilinear Actor-critic Class with Multi Step Future
In this section, we provide an extension to Definition 3 to incorporate multiple-step futures (i.e., ). For simplicity, we assume that .
The definition is then as follows. The main difference is that we roll out a policy , times to incorporate multi-step link functions. We introduce the notation
Then, combining the Bellman equation for state-value functions and the definition of K-step link functions, we have
Thus, by taking expectations further with respect to (i.e., can be sampled from some roll-in policy), we have
Hence, the Bellman loss of a pair under a roll-in denoted by at is defined as
The above is a proper loss function when we use multi-step futures. Here is the structure we need for .
Definition 4 (PO-bilinear AC Class for POMDPs with multi-step future).
The model is a PO-bilinear class of rank if is realizable (regarding general K-step link functions), and there exists and such that for all and ,
-
1.
We have:
-
2.
for any and the corresponding value link function in .
We define as the PO-bilinear rank.
4 Examples of PO-Bilinear Actor-critic Classes
We consider three examples (observable tabular POMDPs, LQG, HSE-POMDPs) that admit PO-bilinear rank decomposition. Our framework can also capture PSRs and -step decodable POMDPs, of which the discussions are deferred to Section 6 and Section 8, respectively. We mainly focus on one-step future, i.e., , and briefly discuss extension to in the the tabular case. In this section, except for LQG, we assume for any . All the missing proofs are deferred to Section B.
4.1 Observable Undercomplete Tabular POMDPs
Example 1 (Observable undercomplete tabular POMDPs).
Let where the entry indexed by a pair is defined as . Assume that , which we call observability. This requires undercompletenes .
The following lemma shows that being full rank implies the existence of value link functions.
Lemma 1.
For Example 1, there exists a one-step value link function for any and .
Proof.
Consider any function (thus, this captures all possible ). Denote as the one-hot encoding of over (similarly for ). We have , where we use the assumption that and thus . Then,
(2) |
which means that the value link function corresponding to is . ∎
We next show that the PO-Bilinear rank (Definition 3) of tabular POMDPs is bounded by .
Lemma 2.
Assume is full column rank. Set the value link function class for certain , and policy class . Then, the model is a PO-biliner AC class (Definition 3) with PO-bilinear rank at most .
Later, we will see that the PO-bilinear rank in the more general definition is just in Section 7. This fact will result a significant improvement in terms of the sample complexity, and will result in a sample complexity that does not incur .
4.2 Observable Overcomplete Tabular POMDPs
We consider overcomplete POMDPs with multi-step futures. The proofs are deferred to Section B.2. We have the following theorem. This is a generalization of Lemma 1, i.e., when , it is Lemma 1.
Lemma 3.
Define a -dimensional matrix whose entry indexed by and is equal to . When this matrix is full-column rank, K-step link functions with respect to exist.
Note a sufficient condition to satisfy the above is that a matrix whose entry indexed by and is equal to is full-column rank for certain . It says there is (unknown) action sequence with length that retains information about latent states.
We next calculate the PO-bilinear rank. Importantly, this does not depend on and .
Lemma 4.
Set a value link function class for certain and a policy class . Then, the model satisfies PO-bilinear rank condition with PO-bilinear rank (Definition 4) at most .
Note that the bilinear rank is still (just in the more general definition in Section 7). Crucially, it does not depend on the length of futures .
4.3 Observable Linear Quadratic Gaussian
The next example is Linear Quadratic Gaussian (LQG) with continuous state and action spaces. The details are deferred to Section J. Here, we set so that the policy class contains the globally optimal policy.
Example 2 (Linear Quadratic Gaussian (LQG)).
Consider LQG:
where are Gaussian distribution with mean and variances and , respectively, and , and , and are positive definite matrices.
We define the policy class as the linear policy class , where is a dimension of . This choice is natural since the globally optimal policy is known to be linear with respect to the entire history [5, Chapter 4]. We define two quadratic features, with , and with . We have the following lemma.
Lemma 5 (PO-bilinear rank of observable LQG).
Assume . Then, the following holds:
-
•
For any policy linear in , a one-step value link function exists, and is linear in .
-
•
Letting be the dimension of , we set and being linear in . Then LQG satisfies Definition 3 with PO-bilinear rank at most
We have two remarks. First, when , K-step link functions exist when is full raw rank. This assumption is referred to as observability in control theory [27]. Secondly, the PO-bilinear rank scales polynomially with respect to even with . As we show in Section J, due to this fact, we can compete against the globally optimal policy with polynomial sample complexity.
4.4 Observable Hilbert Space Embedding POMDPs
We consider HSE-POMDPs that generalize tabular POMDPs and LQG. Proofs here are deferred to Section B.4. Consider any . Given a policy , we define the induced transition operator as , where we have . Namely, is the transition kernel of some Markov chain induced by the policy . The HSE-POMDP assumes two conditional distributions and have conditional mean embeddings.
Example 3 (HSE-POMDPs).
We introduce features . We assume the existence of the conditional mean embedding operators: (1) there exists a matrix such that for all , and (2) for all , there exists a matrix , such that .
The existence of conditional mean embedding is a common assumption in prior RL works on learning dynamics of HMMs, PSRs, [59, 6] and Bellman complete linear MDPs [77, 15, 10, 26]. HSE-POMDPs naturally capture tabular POMDPs and LQG. For tabular POMDPs, and are one-hot encoding features. In LQG, and are quadratic features we define in Section 4.3. Here for simplicity, we focus on finite-dimensional features and . Extension to infinite-dimensional Reproducing kernel Hilbert Space is deferred to Section B.4.
The following shows the existence of value link functions and the PO-bilinear rank decomposition.
Lemma 6 (PO-bilinear rank of observable HSE-POMDPs).
Assume is full column rank (observability), and is linear in for any . Then the following holds.
-
•
A one-step value link function exists for any , and is linear in .
-
•
We set a value function class , policy class . Then HSE-POMDP satisfies Definition 3 with PO-bilinear rank at most .
The first statement can be verified by noting that when , value link functions take the following form where we leverage the existence of the conditional mean embedding operator , and that is full column rank (thus ). Note that the PO-bilinear rank depends only on the dimension of the features without any explicit dependence on the length of memory.
5 Algorithm and Complexity
In this section, we first give our algorithm followed by a general sample complexity analysis. We then instantiate our analysis to specific models considered in Section 4.
5.1 Algorithm
We first focus on the cases where models satisfy the PO-bilinear AC model (i.e., Definition 3) with finite action and with one-step link function. We discuss the extension to handle continuous action in Remark 2 and multi-step link functions at the end of this subsection.
We present our algorithm Provable in Algorithm 1. Note Provable is agnostic to the form of and . Inside iteration , given the latest learned policy , we define Bellman error for all pairs where the Bellman error is averaged over the samples from . Here, to evaluate the Bellman loss for any policy , we use importance sampling by running rather than executing a policy so that we can reuse samples.222This choice might limit the algorithm to the case where is discrete. However, for examples such as LQG, we show that we can replace by a G-optimal design over the quadratic polynomial feature of the actions. A pair that has a small total Bellman error intuitively means that given the data so far, could still be a value link function for the policy . Then in the constrained optimization formulation, we only focus on pairs whose Bellman errors are small so far. Among these pairs, we select the pair using the principle of optimism in the face of uncertainty. We remark the algorithm leverages some design choices from the Bilinear-UCB algorithm for MDPs [17]. The key difference between our algorithm and the Bilinear-UCB is that we leverage the actor-critic framework equipped with value link functions to handle partially observability and agnostic learning.
With multi-step link functions.
Finally, we consider the case with multi-step futures in Algorithm 2 when . Recall the notation . The only difference is in the process of data collection. Particularly, at every iteration , we roll-in using to (and include) time step , we then roll-out by switching to for steps.
Remark 2 (Continuous control).
Algorithms so far implicitly assume the action is finite. However, we can consider LQG, which has continuous action. By employing a G-optimal design over actions, our algorithm can handle the continuous action. The discussion is deferred to Section C.
5.2 Sample Complexity
We show a sample complexity result by using reduction to supervised learning analysis. We begin by stating the following assumption which is ensured by standard uniform convergence results.
Assumption 2 (Uniform Convergence).
Fix . Let be a set of i.i.d tuples following . With probability ,
For , we also require
Remark 3 (Finite function classes).
The term depends on the statistical complexities of the function classes and . As a simple example, we consider the case where and are discrete. In this case, we have , and , which are standard statistical complexities for discrete function classes and . Achieving this result simply requires standard concentration and a union bound over all functions in .
Under Assumption 2, when the model is PO-bilinear with rank , we get the following.
Theorem 1 (PAC guarantee of Provable).
Suppose we have a PO-bilinear AC class with rank . Suppose Assumption 2, and for any
By setting
where
With probability at least , letting , we have
The total number of samples used in the algorithm is .
Informally, when , to achieve -near optimality, the above theorem indicates that we just need to set , which results a sample complexity scaling (since only scales ). We give detailed derivation and examples in the next section.
5.3 Examples
Hereafter, we show the sample complexity result by using Theorem 1. For complete results, refer to Section F–K. The result of -step decodable POMDPs and observable low-rank POMDPs are deferred to Section M.
5.3.1 Finite Sample Classes
We first consider the case where the hypothesis class is finite when the class admits PO-bilinear rank decomposition.
Example 4 (Finite Sample Classes).
Consider the case when and are finite and the PO-bilinear rank assumption is satisfied. When and are infinite hypothesis classes, and are replaced with their -covering numbers, respectively.
Theorem 2 (Sample complexity for discrete and (informal)).
Let for any and the PO-bilinear rank assumption holds with PO-biliear rank . By letting , with probability , we can achieve when we use samples
Here, are omitted.
5.3.2 Observable Undercomplete Tabular POMDPs
We start with tabular POMDPs. The details here is deferred to Section H.
Example 5 (continues=ex:under_tabular).
In tabular models, recall the PO-bilinear rank is at most . We suppose for any . Assuming is full-column rank, to satisfy the realizability, we set where and are one-hot encoding vectors over and , respectively. We set . Then, the following holds.
Theorem 3 (Sample complexity for unrercomplete tabular models (Informal)).
With probability , we can achieve when we use samples at most
Here, are omitted.
Firstly, while the above error incurs , we will later see in Section 8.2.2 when we use the more general definition of PO-bilinear AC class and combine a model-based perspective, we can remove from the error bound. The intuition here is that the statistical complexity still scales with and does not incur . At the same time, although PO-bilinear rank currently scales with , we can show that it can be just with a more refined definition. Secondly, can be replaced with other analogous conditions . Here, note . The reason why we use -norm is to invoke the result [24] to achieve the near global optimality as in the next paragraph.
Near global optimality.
Finally, we consider the PAC guarantee against the globally optimal policy. As shown in [24], it is enough to set to compete with the globally optimal policy . Thus we achieve a quasipolynomial sample complexity when competing against .
Theorem 4 (Sample complexity for undercomplete tabular models (Informal) — competing against ).
With probability , we can achieve when we use samples at most
5.3.3 Observable Tabular Overcomplete POMDPs
We consider obvercomplete tabular POMDPs. In this case, the PO-bilinear rank is at most . We suppose for any . Assuming is full-column rank, to satisfy the realizability, we set where and are one-hot encoding vectors over and , respectively. We set . Then, the following holds.
Theorem 5 (Sample complexity for overcomplete tabular models).
With probability , we can achieve
when we use samples at most
Here, are omitted.
When we use K-step futures, in the above theorem, we additionally incur , which is coming from a naive parameterization of . In Section 8.2.3, we will see that under the model-based learning perspective (i.e., we parameterize first and then construct and using the model class), we will get rid of the dependence and . This is because the complexity of the model class is independent of or (i.e., number of parameters in are ).
5.3.4 Observable LQG
Now let us revisit LQG. The detail here is deferred to Section J. We show that Provable can compete against the globally optimal policy with polynomial sample complexity.
Example 6 (continues=ex:lqqs).
In LQG, by setting , we achieve a polynomial sample complexity when competing against the globally optimal policy .
Theorem 6 (Sample complexity for LQG (informal) – competing against ).
Consider a linear policy class . and assume and all policies induce a stable system (we formalize in Section J). With probability , we can achieve when we use samples at most
5.3.5 Observable HSE-POMDPs
Next, we study HSE-POMDPs. The details here is deferred to Section G.
Example 7 (continues=ex:linear).
In HSE-POMDPs, PO-bilinear rank is at most . Suppose and such that for any . Then, to satisfies the realizability, we set where .
Theorem 7 (Sample complexity for HSE-POMDPs (Informal)).
Let . Suppose lies in for any . Then, with probability , we can achieve when we use samples
Here, are omitted and .
Note that the sample complexity above does not explicitly depend on the memory length , instead it only explicitly depends on the dimension of the features . In other words, if we have a feature mapping that can map the entire history (i.e., ) to a low-dimensional vector (e.g., LQG), our algorithm can immediately compete against the global optimality .
6 Predictive State Representations
In this section, we demonstrate that our definition and algorithm applies to PSRs — models that strictly generalize POMDPs [46, 62]. Below, we first briefly introduce PSRs, followed by showing that it is a PO-bilinear AC model. Throughout this section, we will focus on discrete linear PSRs. We also suppose reward at is deterministic function of conditional on where . Given , the dynamical system generates . Here we use the superscript on to emphasize that the ends with the action .
PSRs use the concept of test, which is a sequence of future observations and actions, i.e., for some test with length , we define the probability of test being successful as which is the probability of observing by actively executing actions conditioned on history .
We now explain one-step observable PSRs while deferring the general multi-step observable setting to Section D. A one-step observable PSR uses the observations in as tests, i.e., tests with length 1.
Definition 5 (Core test set and linear PSRs).
A core test set contains a finite number of tests (i.e., observations from ). For any , any history , any future test for any , there exists a vector , such that the probability of succeeds conditioned on can be expressed as: where we denote as a vector in with entries equal to for . The vector is called predictive state.
A core test set that has the smallest number of tests is called a minimum core test set denoted as . PSRs are strictly more expressive than POMDPs in that all POMDPs can be embedded into PSRs whose size of the minimum core tests is at most ; however, vice versa does not hold [46]. For example, in observable undercomplete POMDPs (i.e., full column rank) , the observation set can serve as a core test set, but the minimum core test set will have size . Here, we assume we know a core test set that contains ; however, we are agnostic to which set is the actual . In the literature on PSRs, this setting is often referred to as transform PSRs [8, 57].
Now we define a value link function in PSRs. First, given an -memory policy, define , i.e., the expected total reward under , conditioned on the history . Note that our value function here depends on the entire history.
Definition 6 (General value link functions).
Consider an -memory policy . One-step general value link functions at step are defined as solutions to
(3) |
This definition is more general than Definition 3 since (3) implies (1) in POMDPs by setting . In PSRs, we can show the existence of this general value link function.
Lemma 7 (The existence of link functions for PSRs).
Suppose is a core test set. Then, a one-step value link function always exists.
The high-level derivation is as follows. Using the linear PSR property, one can first show that has a bilinear form , where denotes the one-hot encoding vector over , and is a matrix. Then, given any and , for some matrix , we can show satisfies the above, where is a one-hot encoding vector over and serves as an unbiased estimate of .
Finally, we show that PSR admits PO-bilinear rank decomposition (Definition 3).
Lemma 8.
Suppose a core test set includes a minimum core test set . Set and , the PO-bilinear rank is at most .
Then, Algorithm 1 is directly applicable to PSRs. Note that here the PO-bilinear rank, fortunately, rank scales with but not . The dependence comes from the dimension of the “feature" of memory . If one has a compact feature representation , such that is linear with respect to feature , then the PO-bilinear rank is . This implies that if one has a compact featurization of memory a priori, one can avoid exponential dependence on .
Sample complexity.
We finally brifely mention the sample complexity result. The detail is deferred to Section K. The sample complexity to satisfy is given as
where and some parameters associated with PSRs. Here, there is no explicit dependence on . Note that in the worst case, scales as , and scales as .
7 Generalization of PO-Bilinear AC Class
We extend our previous definition of PO-Bilinear AC framework. We first present an even more general framework that captures all the previous examples that we have discussed so far. We then provide two more examples that can be covered by this framework: (1) -step decodable POMDPs, and (2) observable POMDPs with low-rank latent transition. Using the result in (2), we can obtain refined results in the tabular setting compared to the result from Section 5.3.2.
The following is a general PO-Bilinear AC Class. Recall . We consider one-step future, i.e., , but the extension to is straightforward. Comparing to Definition 3, we introduce another class of functions termed as discriminators and the loss function .
Definition 7 (General PO-Bilinear AC Class).
Consider a tuple consisting of a policy class , a function class , a loss function where , a set of estimation policies where , and a discriminator class with . Consider a non-decreasing function with .
The model is a PO-bilinear class of rank if is realizable, and there exist and such that for all and ,
-
(a)
,
-
(b)
(In -step decodable POMDPs and POMDPs with low-rank latent transition, we set and in the previous sections, we set . )
-
(c)
for any and the corresponding value link function in .
The first condition states the average Bellman error under is upper-bounded by the quantity in the bilinear form. The second condition states that we have a known loss function that can be used to estimate an upper bound (up to a non-decreasing transformation ) of the value of the bilinear form. Our algorithm will use the surrogate loss . As we will show, just being able to estimate an upper bound of the value of the bilinear form suffices for deriving a PAC algorithm. The discriminator and the non-decreasing functions give us additional freedom to design the loss function. For simple examples such as tabular POMDPs and LQG, as we already see, we simply set the discriminator class (i.e., we do not use discriminators) and being the identity mapping.
With this definition, we slightly modify Provable to incorporate the discriminator to construct constraints. The algorithm is summarized in Algorithm 3 that is named as DisProvable. There are two modifications: (1) when we collect data, we switch from the roll-in policy to the policy at time step ; (2) the Bellman error constraint is defined using the loss together with the discriminator class .
The following theorem shows the sample complexity of Algorithm 3. For simplicity, we direct consider the case where are all discrete.
Assumption 3 (Uniform Convergence).
Fix . Let be a set of i.i.d tuples by executing With probability ,
For , we also require
Theorem 8 (Sample complexity of Algorithm 3).
Suppose we have a PO-bilinear AC class with rank in Definition 7. Suppose Assumption 3, and for any
By setting
where
With probability at least , letting , we have
The total number of samples used in the algorithm is .
This reduces to Theorem 1 when we set as an identify function and . When is a strongly convex function, we can gain more refined rate results. For example, when , i.e., , with , the above theorem implies a slow sample complexity rate . However, by leverage the strong convexity of the square function , a refined analysis can give the fast rate . We will see such two examples in the next sections.
8 Examples for Generalized PO-Bilinear AC Class
We demonstrate that our generalized framework captures two models: (1) -step decodable POMDPs, and (2) observable POMDPs with the latent low-rank transition. In this section, we assume for any .
8.1 -step decodable POMDPs
The example we include here is a model that involves nonlinear function approximation but has a unique assumption on the exact identifiability of the latent states.
Example 8 (-step decodable POMDPs [20]).
There exists an unknown decoder , such that for every reachable trajectory , we have for all .
Existence of value link functions.
From the definition, using a value function over , we can define a value link function as
since it satisfies
This is summarized in the following lemma.
Lemma 9 (Existence of link functions in -step decodable POMDPs).
In -step decodable POMDPs, link functions exist.
-step decodable POMDPs showcase the generality of value link functions, which not only capture standard observability conditions where future observations and actions are used to replace belief states (e.g., observable tabular POMDPs and observable LQG), but also capture a model where history is used to replace latent states.
PO-Bilinear Rank.
Next, we calculate the PO-bilinear rank based on Definition 7. In the tabular case, we can naïvely obtain the PO-bilinear decomposition with rank following Example 1. Here, we consider the nontabular case where function approximation is used and can be extremely large. We define the following Bellman operator associated with at step :
(4) | ||||
Note that above we use the ground truth decoder to decode from to its associated latent state . The existence of this Bellman operator is crucially dependent on the existence of such decoder .
We show that -step decodable POMDPs satisfy the definition in Definition 7. We assume that the latent state-wise transition model is low-rank. In MDPs, this assumption is widely used in [76, 36, 2, 71]. Here, we do not need to know in the algorithm.
Assumption 4 (Low-rankness of latent transition).
Suppose is low-rank, i.e., where are (unknown) -dimensional features. As technical conditions, we suppose for any and for any .
Lemma 10 (Bilinear decomposition of low-rank -step decodable POMDPs ).
Suppose Assumption 4, , for any . Assume a discriminator class is Bellman complete, i.e.,
for any . The loss function is designed as
(5) |
Then, there exist so that the PO-bilinear rank is at most such that
(6) | |||
(7) |
and
(8) |
Proof.
The proof is deferred to Section N.2. Note that (6), (7), (8) correspond to (a), (b), (c) in Definition 7. ∎
We use the most general bilinear class definition from Definition 7, where for scalar . Hence is a non-decreasing function ( is non-decreasing in ). The proof of the above lemma leverages the novel trick of the so-called moment matching policy introduced by [20]. When the latent state and action space are discrete, it states that the bilinear rank is , which is much smaller than . Note we here introduce in the loss function (5) to induce strong convexity w.r.t as in [70, 18, 9], which is important to obtain the fast rate later.
The concrete sample complexity of Provable (Algorithm 3) for this model is summarized in the following. Recall that the bilinear rank is where is the rank of the transition matrix. We set . Then, we have the following result.
Theorem 9 (Sample complexity for -step decodable POMDPs (Informal)).
Suppose Assumption 4, Bellman completeness, , for any . With probability , we can achieve when we use samples at most
Here, are omitted.
The followings are several implications. First, the error rate scales with . As we promised, by leveraging the strong convexity of loss functions, we obtain a rate , which is faster than that are attained when we naively invoke Theorem 8 with . Secondly, the error bound incurs . As showed in [20], this is inevitable in -step decodable POMDPs. Thirdly, in the tabular case, when we use the naïve function classes for , i.e., , , , the bound could incur additional since the complexity of the function classes can scale with respect to (e.g., can be in the order of , and similarly for ). However, when we start form a realizable model class that captures the ground truth transition and omission distribution, we can remove . See Section 8.2.4 for an example.
Note that [20] uses a different function class setup where they assume one has an M memory-action dependent function class which contains while we use the actor-critic framework . The two function class setups are not directly comparable. Generally, we mention that such optimal with truncated history does not exist when the exact decodability does not hold (e.g., such with truncated history does not exist in LQG). This displays the potential generality of the actor-critic framework we propose here.
8.2 Observable POMDPs with Latent Low-rank Transition: a model-based perspective
The final example we include in this work is a POMDP with the latent low-rank transition. We first introduce the model, and then we introduce our function approximation setup and show the sample complexity. Finally, we revisit the sample complexity for observable tabular POMDPs and -step decodable tabular POMDPs using the improved algorithm that elaborates on the model-based approach in this section.
Example 9 (Observable POMDPs with latent low-rank transition).
The latent transition is factorized as where and . The observation matrix has full-column rank.
In the tabular POMDP example, we have . However in general can be much smaller than . Note that in this section, we will focus on the setting where are discrete to avoid using measure theory languages, but their size could be extremely large. Particularly, our sample complexity will not have explicit polynomial or logarithmic dependence on , instead it will only scale polynomially with respect to the complexity of the hypothesis class and the rank .
Model-based function approximation.
Our function approximation class consists of a set of models where together models latent transition as , and models , and is full column rank. For notation simplicity, we often use to denote a model . We impose the following assumption.
Assumption 5 (Realizability).
We assume realizability, i.e., .
We assume is discrete, but can be large such that a linear dependence on in the sample complexity is not acceptable. Our goal is to get a bound that scales polynomially with respect to , which is the standard statistical complexity of the discrete hypothesis class .
Next, we construct using the model class . Given , we denote as the optimal -memory policy, i.e., the -memory policy that maximizes the total expected reward. We set
We consider the value function class for with being full column rank. For each , we can define the corresponding value function of the policy at : . Then, since is full column rank, as we see in the proof of Lemma 1, a corresponding value link function is
where . Then, we construct as:
(9) |
By construction, since , we must have , which implies is realizable (note ). Here, from the construction and the assumption for any , we have and , which can be seen from
by assuming and .
To construct a discriminator class , we first define the Bellman operator for :
where is the whole history space up to (, and is just part of this history) and is the probability of generating conditioned on under model . Then, we construct such that
(10) |
so that we can ensure the Bellman completeness:
noting . Here, from the construction, and .
We define the loss as the same as the one we used in -step decodable POMDPs, except that our discriminators now take the entire history as input:
(11) |
Finally, as in the case of -step decodable POMDPs (Lemma 10), we get the following lemma that states that our model is a PO-bilinear AC class (Definition 7) under the following model assumption.
Assumption 6.
We assume for any in the model. Suppose for any , and in the model. Suppose for any in the model and . Suppose for any and for any in the model, we have .
Lemma 11 (PO-bilinear decomposition for Observable POMDPs with low-rank transition).
The above lemma ensures that the PO-bilinear rank only depends on , and is independent of the length of the memory. For example, in the tabular case, it is .
Next, we show the output from DisProvablecan search for the best in class -memory policy as follows.
Theorem 10 (Sample complexity of DisProvable for observable POMDPs with latent low-rank transition).
Here, we emphasize that there is no explicit polynomial or logarithmic dependence on and , which permits learning for large state and observation spaces. We also do not have any explicit polynomial dependence on , as we construct and from the model class which ensures the complexities of and are in the same order as that of .
8.2.1 Global Optimality
We show a quasi-polynomial sample complexity bound for competing against the globally optimal policy . To compete against the globally optimal policy , we need to set properly. We use the following lemma. The proof is given in Section O.
Lemma 12 (Near global optimaltiy of -memoruy policy).
Consider , and a POMDP with low-rank latent transition and being full column rank with . When (with being some absolute constant), there must exists an -memory policy , such that
Note that the memory above is independent of instead it only depends on the rank . To prove the above lemma, we first show a new result on belief contraction for low-rank POMDPs under the -based observability. The proof of the belief contraction borrows some key lemma from [24] but extends the original result for small-size tabular POMDPs to low-rank POMDPs. We leverage the linear structure of the problem and the G-optimal design to construct an initial distribution over that can be used as a starting point for belief propagation along the memory.
We conclude the study on the POMDPs with low-rank latent transition by the following theorem, which demonstrates a quasi-polynomial sample complexity for learning the globally optimal policy.
Theorem 11 (Sample complexity of DisProvable for POMDPs with low-rank latent transition — competing against ).
Remark 4 (Comparison to [73]).
We compare our results to the very recent work [73] that studies POMDPs with the low-rank latent transition. The results are in general not directly comparable, but we state several key differences here. First, [73] considers a special instance of low-rank transition, i.e., [73] assumes has low non-negative rank, which could be exponentially larger than the usual rank [2]. Second, [73] additionally assumes short past sufficiency, a condition which intuitively says that for any roll-in policy, the sufficient statistics of a short memory is enough to recover the belief over the latent states, and their sample complexity has an exponential dependence on the length of the memory. While our result also relies on the fact that the globally optimal policy can be approximated by an -memory policy with small , this fact is derived directly from the standard observability condition.
8.2.2 Revisiting Observable Undercomplete Tabular POMDPs
We reconsider the sample complexity of undercomplete tabular POMDPs using Theorem 10. In this case, we will start from a model class that captures the ground truth latent transition and omission distribution . By constructing -nets over the model class,we can set since have and many parameters, respectively. Besides, the PO-bilinear rank is . Therefore, the sample complexity is
We leave the formal analysis to future works.
Compared to results in Section 5.3.2, there is no term. This is due to two improvements. The first improvement is that we refine the rank from to . The second improvement is we model the value link function class and policy class starting from the model class whose complexity has nothing to do with the length of memory (note that previously, from a pure model-free perspective, the statistical complexity of can scale as in the worst case).
8.2.3 Revisiting Observable Overcomplete POMDPs
We reconsider the sample complexity of overcomplete tabular POMDPs using Theorem 10 with slight modification to incorporate multi-step future. Suppose (recall is defined in Lemma 3 in Section 4.2). Then, we can achieve a sample complexity
since the PO-bilinear rank is . Note that there is no dependence, since both the policy class and the value link function class are built from the model class whose complexity has nothing to do with .
Note that due to our definition of , there is no term. However, when we use a different definition, for instance, (recall is defined in Section 4.2), we would incur . This is because if we only know that there is an unknown sequence of actions such that is full column rank, we need to use uniform samples in the importance sampling step to identify such a sequence. More formally, we can see that
(12) |
8.2.4 Revisiting -step Decodable Tabular POMDPs
We reconsider the sample complexity of tabular -step decodable POMDPs by constructing from the model class as we did for the low-rank POMDP. In this case, by constructing -nets, we can set since have and parameters, respectively. Therefore, the sample complexity is
Again, we leave the formal analysis to future works. Compared to the naive result mentioned after Theorem 9 where could scale in the order of in the tabular case, we do not have dependence here.
9 Summary
We propose a PO-bilinear actor-critic framework that is the first unified framework for provably efficient RL on large-scale partially observable dynamical systems. Our framework can capture not only many models where provably efficient learning has been known such as tabular POMDPs, LQG and M-step decodable POMDPs, but also models where provably efficient RL is not known such as HSE-POMDPs, general PSRs and observable POMDPs with low-rank latent transition. Our unified actor-critic based algorithm—Provable provably performs agnostic learning by searching for the best memory-based policy. For special models such as observable tabular MDPs, LQG, and POMDPs with low-rank latent transition, by leveraging their special properties, i.e., the exponential stability of Bayesian filters in tabular and low-rank POMDPs, and existence of a compact featurization of histories in LQG, we are able to directly compete against the global optimality without paying an exponential dependence on horizon.
Acknowledgement
We thank Nan Jiang for valuable discussions on PSRs.
References
- AHKS [20] Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning. Advances in Neural Information Processing Systems, 33:13399–13412, 2020.
- AKKS [20] Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095–20107, 2020.
- ALA [16] Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Reinforcement learning of pomdps using spectral methods. In Conference on Learning Theory, pages 193–256. PMLR, 2016.
- BBC+ [19] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
- Ber [12] Dimitri Bertsekas. Dynamic programming and optimal control: Volume I, volume 1. Athena scientific, 2012.
- BGG [13] Byron Boots, Geoffrey Gordon, and Arthur Gretton. Hilbert space embeddings of predictive state representations. arXiv preprint arXiv:1309.6819, 2013.
- BK [21] Andrew Bennett and Nathan Kallus. Proximal reinforcement learning: Efficient off-policy evaluation in partially observed markov decision processes. 2021.
- BSG [11] Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon. Closing the learning-planning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954–966, 2011.
- CJ [19] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
- CO [20] Sayak Ray Chowdhury and Rafael Oliveira. No-regret reinforcement learning with value function approximation: a kernel embedding approach. arXiv preprint arXiv:2011.07881, 2020.
- CPS+ [20] Yifan Cui, Hongming Pu, Xu Shi, Wang Miao, and Eric Tchetgen Tchetgen. Semiparametric proximal causal inference. arXiv preprint arXiv:2011.08411, 2020.
- CYW [22] Qi Cai, Zhuoran Yang, and Zhaoran Wang. Sample-efficient reinforcement learning for pomdps with linear function approximations. arXiv preprint arXiv:2204.09787, 2022.
- Dea [18] Ben Deaner. Proxy controls and panel data. arXiv preprint arXiv:1810.00283, 2018.
- DHB+ [17] Carlton Downey, Ahmed Hefny, Byron Boots, Geoffrey J Gordon, and Boyue Li. Predictive state recurrent neural networks. Advances in Neural Information Processing Systems, 30, 2017.
- DJW [20] Yaqi Duan, Zeyu Jia, and Mengdi Wang. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
- DKJ+ [19] Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674. PMLR, 2019.
- DKL+ [21] Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pages 2826–2836. PMLR, 2021.
- DLMS [20] Nishanth Dikkala, Greg Lewis, Lester Mackey, and Vasilis Syrgkanis. Minimax estimation of conditional moment models. In Advances in Neural Information Processing Systems, volume 33, pages 12248–12262, 2020.
- EDKM [05] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Reinforcement learning in pomdps without resets. 2005.
- EJKM [22] Yonathan Efroni, Chi Jin, Akshay Krishnamurthy, and Sobhan Miryoosefi. Provable reinforcement learning with a short-term memory. arXiv preprint arXiv:2202.03983, 2022.
- FKQR [21] Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
- GDB [16] Zhaohan Daniel Guo, Shayan Doroudi, and Emma Brunskill. A pac rl algorithm for episodic pomdps. In Artificial Intelligence and Statistics, pages 510–518. PMLR, 2016.
- [23] Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Learning in observable pomdps, without computationally intractable oracles. arXiv preprint arXiv:2206.03446, 2022.
- [24] Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Planning in observable pomdps in quasipolynomial time. arXiv preprint arXiv:2201.04735, 2022.
- HDG [15] Ahmed Hefny, Carlton Downey, and Geoffrey J Gordon. Supervised learning for dynamical system learning. Advances in neural information processing systems, 28, 2015.
- HDL+ [21] Botao Hao, Yaqi Duan, Tor Lattimore, Csaba Szepesvári, and Mengdi Wang. Sparse feature selection makes batch reinforcement learning more sample efficient. In International Conference on Machine Learning, pages 4063–4073. PMLR, 2021.
- Hes [18] Joao P Hespanha. Linear systems theory. Princeton university press, 2018.
- HFP [13] William L Hamilton, Mahdi Milani Fard, and Joelle Pineau. Modelling sparse dynamical systems with compressed predictive state representations. In International Conference on Machine Learning, pages 178–186. PMLR, 2013.
- HKZ [12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
- IP [08] Masoumeh T Izadi and Doina Precup. Point-based planning for predictive state representations. In Conference of the Canadian Society for Computational Studies of Intelligence, pages 126–137. Springer, 2008.
- Jae [98] Herbert Jaeger. Discrete-time, discrete-valued observable operator models: a tutorial. GMD-Forschungszentrum Informationstechnik Darmstadt, Germany, 1998.
- Jae [00] Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural computation, 12(6):1371–1398, 2000.
- JKA+ [17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
- JKKL [20] Chi Jin, Sham Kakade, Akshay Krishnamurthy, and Qinghua Liu. Sample-efficient reinforcement learning of undercomplete pomdps. Advances in Neural Information Processing Systems, 33:18530–18539, 2020.
- JLM [21] Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in Neural Information Processing Systems, 34, 2021.
- JYWJ [20] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
- KECM [21] Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. Rl for latent mdps: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34, 2021.
- KJS [15] Alex Kulesza, Nan Jiang, and Satinder Singh. Spectral learning of predictive state representations with insufficient statistics. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- KMN [99] Michael Kearns, Yishay Mansour, and Andrew Ng. Approximate planning in large pomdps via reusable trajectories. Advances in Neural Information Processing Systems, 12, 1999.
- KMU [21] Nathan Kallus, Xiaojie Mao, and Masatoshi Uehara. Causal inference under unmeasured confounding with negative controls: A minimax learning approach. arXiv preprint arXiv:2103.14029, 2021.
- KW [60] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363–366, 1960.
- LAHA [20] Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Regret minimization in partially observable linear quadratic control. arXiv preprint arXiv:2002.00082, 2020.
- LCSJ [22] Qinghua Liu, Alan Chung, Csaba Szepesvári, and Chi Jin. When is partially observable reinforcement learning not scary? arXiv preprint arXiv:2204.08967, 2022.
- Lit [96] Michael Lederman Littman. Algorithms for sequential decision-making. Brown University, 1996.
- LMPR [20] Tianyu Li, Bogdan Mazoure, Doina Precup, and Guillaume Rabusseau. Efficient planning under partial observability with unnormalized q functions and spectral learning. In International Conference on Artificial Intelligence and Statistics, pages 2852–2862. PMLR, 2020.
- LS [01] Michael Littman and Richard S Sutton. Predictive representations of state. Advances in neural information processing systems, 14, 2001.
- LS [20] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
- MHKL [20] Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961–6971. PMLR, 2020.
- MKS+ [13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- MST [18] Wang Miao, Xu Shi, and Eric Tchetgen Tchetgen. A confounding bridge approach for double negative control inference on causal effects. arXiv preprint arXiv:1808.04945, 2018.
- MTR [19] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalent control of lqr is efficient. arXiv preprint arXiv:1902.07826, 2019.
- Mur [00] Kevin P Murphy. A survey of pomdp solution techniques. environment, 2(10), 2000.
- MZG+ [21] Afsaneh Mastouri, Yuchen Zhu, Limor Gultchin, Anna Korba, Ricardo Silva, Matt J Kusner, Arthur Gretton, and Krikamol Muandet. Proximal causal learning with kernels: Two-stage estimation and moment restriction. arXiv preprint arXiv:2105.04544, 2021.
- NBGF [12] Yu Nishiyama, Abdeslam Boularias, Arthur Gretton, and Kenji Fukumizu. Hilbert space embeddings of pomdps. arXiv preprint arXiv:1210.4887, 2012.
- PT [87] Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov decision processes. Mathematics of operations research, 12(3):441–450, 1987.
- PVSP [06] Josep M Porta, Nikos Vlassis, Matthijs TJ Spaan, and Pascal Poupart. Point-based value iteration for continuous pomdps. 2006.
- RGT [04] Matthew Rosencrantz, Geoff Gordon, and Sebastian Thrun. Learning low dimensional predictive representations. In Proceedings of the twenty-first international conference on Machine learning, page 88, 2004.
- SBS+ [10] Le Song, Byron Boots, Sajid Siddiqi, Geoffrey J Gordon, and Alex Smola. Hilbert space embeddings of hidden markov models. 2010.
- SHSF [09] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 961–968, 2009.
- Sin [21] Rahul Singh. A finite sample theorem for longitudinal causal inference with machine learning: Long term, dynamic, and mediated effects. arXiv preprint arXiv:2112.14249, 2021.
- SJK+ [19] Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pages 2898–2933. PMLR, 2019.
- SJR [04] Satinder Singh, Michael R James, and Matthew R Rudary. Predictive state representations: a new theory for modeling dynamical systems. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 512–519, 2004.
- SKKS [09] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
- SPK [13] Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based pomdp solvers. Autonomous Agents and Multi-Agent Systems, 27(1):1–51, 2013.
- SSH [20] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In Conference on Learning Theory, pages 3320–3436. PMLR, 2020.
- SUJ [21] Chengchun Shi, Masatoshi Uehara, and Nan Jiang. A minimax learning approach to off-policy evaluation in partially observable markov decision processes. arXiv preprint arXiv:2111.06784, 2021.
- SVBB [16] Wen Sun, Arun Venkatraman, Byron Boots, and J Andrew Bagnell. Learning to filter with predictive state inference machines. In International conference on machine learning, pages 1197–1205. PMLR, 2016.
- TJ [15] Michael R Thon and Herbert Jaeger. Links between multiplicity automata, observable operator models and predictive state representations: a unified learning framework. J. Mach. Learn. Res., 16:103–147, 2015.
- TSM [20] Guy Tennenholtz, Uri Shalit, and Shie Mannor. Off-policy evaluation in partially observable environments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10276–10283, 2020.
- UIJ+ [21] Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, and Tengyang Xie. Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981, 2021.
- UZS [21] Masatoshi Uehara, Xuezhou Zhang, and Wen Sun. Representation learning for online and offline rl in low-rank mdps. arXiv preprint arXiv:2110.04652, 2021.
- VODM [12] Peter Van Overschee and Bart De Moor. Subspace identification for linear systems: Theory—Implementation—Applications. Springer Science & Business Media, 2012.
- WCYW [22] Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Embed to control partially observed systems: Representation learning with provable sample efficiency. arXiv preprint arXiv:2205.13476, 2022.
- XCGZ [21] Yi Xiong, Ningyuan Chen, Xuefeng Gao, and Xiang Zhou. Sublinear regret for learning pomdps. arXiv preprint arXiv:2107.03635, 2021.
- XKG [21] Liyuan Xu, Heishiro Kanagawa, and Arthur Gretton. Deep proxy causal learning and its application to confounded bandit policy evaluation. Advances in Neural Information Processing Systems, 34:26264–26275, 2021.
- YW [20] Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR, 2020.
- ZLKB [20] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020.
- ZSU+ [22] Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Wen Sun, and Alekh Agarwal. Efficient reinforcement learning in block mdps: A model-free representation learning approach. arXiv preprint arXiv:2202.00063, 2022.
Appendix A Supplement for Section 3
We generalize Definition 3 to capture more models. The first extension is to use multi-step link functions. This extension is essential to capture overcomplete POMDPs and multi-step PSRs. The second extension is to use minimax loss functions with discriminators so that we can use not only absolute value loss functions but also squared loss functions. This extension is important to capture M-step decodable POMDPs.
Appendix B Supplement for Section 4
B.1 Observable Undercomplete Tabular POMDPs
B.2 Observable Overcomplete POMDPs
We consider overcomplete POMDPs with multi-step futures. We have the following theorem. This is a generalization of Lemma 1.
Proof of Lemma 3
Consider any function (thus, this captures all possible ). Denote as the one-hot encoding of (similarly for over and over ). We have , where we use the assumption that and thus . Then,
which means that the value bridge function corresponding to is
Proof of Lemma 4
Recall we want to show the low-rank property of the following loss function:
We consider an expectation conditioning on and . For some vector , which depends on , we write it in the form of where is the one-hot encoding vector over . Then, the loss for is equal to
Hence, we can take and .
B.3 Observable Linear Quadratic Gaussian
B.4 Observable HSE-POMDPs
We first provide the proof of Lemma 6. Then, we briefly mention how we extend to the infinite-dimensional setting.
Proof of the first statement in Lemma 6
First, we need to show value bridge functions exist. This is proved noting
Thus, is a value bridge function.
Proof of the second statement in Lemma 6
Consider a triple , with and , we have:
which verifies the bilinear structure, i.e., , and , and shows that the bilinear rank is at most .
Infinite dimensional HSE-POMDPs
Consider the case and are features in infinite dimensional RKHS. By assuming that the spectrum of the operator is decaying with a certain order, we can still ensure the existence of value bridge functions even if and are infinite dimensional.
Next, we consider the PO-bilinear rank. We can still use the decomposition in the proof above. While the PO-bilinear rank itself in the current definition is infinite-dimensional, when we get the PAC result later, the dependence on the PO-bilinear rank comes from the information gain based on , which is the intrinsic dimension of . Thus, we can easily get the sample complexity result by replacing with the information gain over [63]. Generally, to take infinite dimensional models into account, the PO-bilinear rank in Definition 3 can be generalized using the critical information gain [17].
Appendix C Supplement for Section 5 (Algorithm for LQG with Continuous Action)
In this section, we present a modification to handle LQG with continuous action in Definition 3.
Our algorithm so far samples from and performs importance weighting in designing the loss , which will incur a polynomial dependence on as we will see in the next section. However, among the examples that we consider in Section 4, LQG has continuous action. If we naïvely sample from a ball in and perform (nonparametric) importance weighting, we will pay in our sample complexity bound, which is not ideal for high-dimension control problems. To avoid exponential dependence on , here we replace with a -optimal design over the action’s quadratic feature space.
Here, we want to evaluate the Bellman error of pair under a roll-in policy :
where for any linear deterministic policy (here ) using a single policy. In other words, we would like to get a good loss such that
for some policy without incuring exponential dependence on . We explain how to design such a loss function step by step.
First Step
The first step is to consider the conditional expectation on . Here, using the quadratic form of , we can show that there are some :
where . Then, the Bellman loss we want to evaluate can be written in the form of
Second step
The second step is to compute a d-optimal design for the set for certain enough large , and denote as the supports on the d-optimal design. Note in LQG, though we cannot ensure the action lives in the compact set, we can still ensure that in high probability and it suffices in our setting as we will see. Since the dimension of is , we can ensure [47, 41]. Here is a concrete theorem we invoke.
Theorem 12 (Property of G-optimal design).
Suppose is a compact set. There exists a distribution over such that:
-
•
is supported on at most points.
-
•
For any , we have .
We have the following handy lemma stating any is spanned by .
Lemma 13.
Let and . Then, it satisfies
Proof.
Since is full-raw rank from the construction of G-optimal design, is invertible. Then, we have
For the latter statement, we have
We use a property of G-optimal design in Theorem 12.
For the last statement, we have
from CS inequality. ∎
Third Step
The third step is combining current facts. Recall we want to evaluate
In addition, the following also holds:
Here, we use . This concludes that
Thus, we can perform policy evaluation for a policy if we can do intervention from .
Fourth Step
The fourth step is replacing with a single policy that uniformly randomly select actions from the set , which we denote as . Using importance weighting, we define the loss function for as follows:
(13) |
where is a uniform action over and
The term 13is equal to we want to evaluate.
Summary
To summarize, we just need to use the following loss function in line 7 in Algorithm 1:
where is
and is a set of i.i.d samples following the distribution induced by executing . Values in indicators functions are some large values selected properly later. Due to unbounded Gaussian noises in LQG, indicators functions for truncation is introduced here for technical reason to get valid concentration in Assumption 2.
Appendix D Supplement for Section 6
We first add several discussions to explain core tests in detail. Next, we show the existence and form of link functions. Finally, we calculate the PO-bilinear rank. In this section, we will focus on the general case where tests could be multiple steps.
D.1 Definition of PSRs
We first define core tests and predictive states [46, 62]. This definition is a generalization of Definition 5 with multi-step futures.
We slighly abuse notation and denote throughout this whole section — note that here does not include .
Definition 8 (Core test sets and PSRs).
A set is called a core test set if for any , , any possible future (i.e., test) and any history , there exists such that
The vector is referred to as the predictive state.
We often denote . To understand the above definition, we revisit observable undercomplete POMDPs and overcomplete POMDPs.
Example 1 (Observable undercomplete POMDPs).
In undercomplete POMDPs, when is full-column rank, is a core test. Recall is a matrix in whose entry indexed by is equal to .
Lemma 14 (Core tests in undercomplete POMDPs).
When is full-column rank, is a core test set.
Proof.
Consider any . Given a -dimensional belief state with each entry , for any future , there exists a -dimensional vector such that . More specifically, can be written as:
where is a vector with the entry indexed by equal to , is a matrix with the entry indexed by equal to . Here, note given a vector , is define as a diagonal matrix where the diagonal element corresponds to . Thus, we have
where and . This concludes the proof. ∎
Example 2 (Overcomplete POMDPs).
We consider overcomplete POMDPs so that we can permit .
Lemma 15 (Core tests in overcomplete POMDPs).
Recall . Define a -dimensional matrix whose entry indexed by is equal to . When this matrix is full-colmun rank for all , is a core test set.
Proof.
Fix a test and consider a step . Then,
Thus, the assumption that is full column rank implies that that the matrix with the entry indexed by being equal to is full-column rank.
Define a -dimensional state given history . By definition, we have
Using is full-column rank, we have . Thus, using the format of from the proof of Lemma 14, we can conclude that for any test , we have . Thus, this concludes is a core test set. ∎
Finally, we present an important property of predictive states, which corresponds to the Bayesian filter in POMDP.
Lemma 16 (Forward dynamics of predictive states).
We have
When we define where rows are for , we can express the forward update rule of predictive states as follows:
Proof.
The proof is an application of Bayes’s rule. We denote the observation part of by and the action part of , respectively. We have
(by definition) | ||||
(Bayes rule) | ||||
(by definition) |
This concludes the proof. ∎
To further understand that why PSR generalizes POMDP, let us re-visit the undercomplete POMDPs (i.e., being full column rank) again. Set . As we see in the proof of Lemma 14, the belief state together with defines predictive state, i.e., , with , and . Note that in POMDPs, matrix and vector all contain non-negative entries. On other hand, in PSRs, and could contain negative entries. This is the intuitive reason why PSRs are more expressive than POMDPs [46]. For the formal instance of a finite-dimensional PSR which cannot be expressed as a finite-dimensional POMDP, refer to [62, 31].
D.2 Existence of link functions
We discuss the existence and the form of link functions. First, we define general value link functions with multi-step futures. For notational simplicity, we assume here that the tests have the same length, i.e., there is a , such that .
Definition 9 (General value link functions in dynamical systems).
Recall is the set of tests. At time step , general value link functions are defined as solutions to the following:
(14) |
where is some distribution over the action set induced by the test set, i.e., . Here, for , we often denote and by and , respectively.
To show the existence of general value link functions for PSRs, we first study the format of value functions in PSRs. The following lemma states that value functions for -memory policies have bilinear forms.
Lemma 17 (Bilinear form of value functions for -memory policies).
Let be a one-hot encoding vector over . Suppose is a core test set. Then, for any -memory policy , there exists such that
Proof.
From Lemma 16, there exists a matrix such that via Bayes rule:
(15) |
We use induction to prove the claim. Here, the base argument clearly holds. Thus, we assume
We have
Note we use the assumption that the reward is a function of conditional on .
We first check the first term (a) that contains rewards. Using the fact that , this is equal to
Thus, it has a bilinear form, i.e., there exists some matrix such that
where is a matrix whose row indexed by is equal to .
Next, we see the second term (b). Using (15), the second term is equal to
where we use the notation to represent the operation of appending pair to the memory while maintaining the proper length of the memory by truncating away the oldest observation-action pair. Thus, it has an again bilinear form and the matrix can be defined such that its row indexed by is equal to . This concludes the proof. ∎
Next, we check sufficient conditions to ensure the existence of general K-step link functions. Given , we define the corresponding set of action sequences as . We set in (14) to be a uniform distribution over the set denoted by . Namely, will uniformly randomly select a sequence of test actions from .
Lemma 18 (Existence of link functions in PSRs).
Suppose is a core test. There exists such that
Proof.
We mainly need to design an unbiased estimator of the predictive state . We use importance weighting to do that. Given , and the resulting corresponding random observations , we define the following estimator , such that its entry indexed by a test is equal to:
We can verify that
Then,
With this estimator, now we can define the link function using the bilinear form of , i.e.,
Using the fact that is an unbiased estimate of , we can conclude the proof. ∎
D.3 PO-Bilinear Rank Decomoposition
Finally, we calculate the PO-bilinear rank. Here,
The Bellman error for under a roll-in denoted by is defined as
In fact, for any general value link functions .
Our goal is to design a loss function such that we can estimate the above Bellman error using data from a single policy. To do that, we design the following randomized action selection strategy.
Given a action sequence from a test , let us denote as a copy of but starting from the second action of , i.e., if , then . Denote . Our random action selection strategy first selects uniformly randomly from , and then select a sequence of actions uniformly randomly from . Here, we remark the length of outputs is not fixed (i.e., has length larger than the ).
As a first step, we define two unbiased estimators for and . Conditioning on history , given actions followed by action sequence , denote the corresponding observations as . We construct unbiased estimators for and as follows. As an unbiased estimator of , we define with the entry indexed by test as follows:
(16) |
Similarly, as an unbiased estimator of , we define with the entry indexed by test as follows:
(17) |
We remark the length of in (16) and the one of (17) are different.
Then, by using importance sampling, we can verify
With the above setup, we can construct the loss function for estimating the Bellman error. We set the loss as follows:
(18) | ||||
Since we have shown that and are unbiased estimators of and , respectively, we can show that for any roll-in policy :
The above shows that we can use as a loss function.
Summary
We can use the almost similar algorithm as Algorithm 1. The sole difference is we need to replace with
where is an empirical approximation when executing .
Calculation of PO-bilinear rank
Finally, we prove a PSR belongs to the PO-bilinear class.
Lemma 19 (PO-bilinear decomposition).
Let be a minimum core test set contained in . The PSR model has PO-bilinear rank at most , i.e., there exists two -dimensional mappings and such that for any tripe , we have:
Proof.
We first take expectation conditional on . Then, we have
where and are some two matrices as defined in the proof of Lemma 17 from where we have already known that the -induced Bellman backup on a value function which has a bilinear form gives back a bilinear form value function. Rearrange terms, we get:
Now recall that the minimum core test set is . The final step is to argue that lives in a subspace whose dimension is . Since is a core test set, by definition, we can express using , i.e.,
where the row of indexed by is equal to , where is the vector that is used to predict whose existences is ensured by the definition of PSRs. This implies that
Finally, we take expectation with respect to then we get such that
∎
The key observation here is that the bilinear rank scales with but not . This is good news since we often cannot identify exact minimal core test sets; however, it is easy to find core tests including minimal core tests. Thus, even if we do not know the linear dimension of a dynamical system a priori, the resulting bilinear rank is the linear dimension of dynamical systems as long as core sets are large enough so that they include minimal core tests. This will result in the benefit of sample complexity as we will see Section K.
Appendix E Proof of Theorem 1
We fix the parameters as in Theorem 1. Let
We define
Then, by our assumption 2 with probability , we
(19) | ||||
(20) |
Hereafter, we condition on the above events.
We first show the following lemma. Recall
Lemma 20 (Optimism).
Set . For all , is a feasible solution of the constrained program. Furthermore, we have for any , where is the value link function selected by the algorithm in iteration .
Proof.
For any , we have
since is a value link function in . This is because
(IS sampling) | |||
(First assumption in Definition 3) | |||
(Second assumption in Definition 3) |
Then, we have
(Uniform convergence result) | ||||
(Using the construction of algorithm) | ||||
(Uniform convergence) |
∎
Remark 5.
Note that
holds for general link functions inDefinition 6 . Thus, the statement goes through even if we use Definition 6.
Next, we prove the following lemma to upper bound the per step regret.
Lemma 21.
For any , we have
Lemma 22.
Let . We have
Proof.
Lemma 23.
Proof.
We have
The first term is upper-bounded by . The second term is upper-bounded by
(First assumption in Definition 3) | |||
From the first line to the second line, we use the definition of bilinear rank models. From the second line to the third line, we use . In the last line, we use the constraint on .
∎
Combining lemmas so far, we have
(Use Lemma 21) | ||||
(CS inequality) | ||||
(Use Lemma 22 and Lemma 23 ) |
We set such that and . Then,
since when .
Finally, the following holds
(Plug in ) | ||||
(Plug in ) | ||||
Appendix F Sample Complexity for Finite Function Classes
Consider cases where and are finite and the PO-bilinear rank assumption is satisfied. When and are infinite hypothesis classes, and are replaced with their -covering numbers, respectively.
Theorem 13 (Sample complexity for discrete and ).
Let for any and the PO-bilinear rank assumption holds with PO-bilinear rank . By letting , with probability , we can achieve when we use samples at most
Here, are omitted.
Proof.
We derive the above result. First, we check the uniform convergence result. Then,
Thus, we need to set such that
where is some constant and
By organizing the term, the following is sufficient
Using Lemma 44, the following satisfies the condition:
Combining all together, the sample complexity is , i.e.,
Appendix G Sample Complexity in Observable HSE POMDPs
We revisit the existence of link functions by taking the norm constraint into account. Then, we consider the PO-bilinear decomposition with certain and . Next, we calculate the uniform convergence result. Finally, we show the sample complexity result.
We use the following assumptions
Assumption 7.
For any , the following holds:
-
1.
-
2.
There exists a matrix such that (i.e., conditional embedding of the omission distribution),
-
3.
-
4.
There exists a matrix such that (i.e., conditional embedding of the transition)
-
5.
is finite.
We define
Existence of link functions.
We show value link functions exist. This is proved by noting
Thus, is a value link function. The radius of the parameter space is upper-bounded by . Hence, we set
Then, the realizability holds.
PO-bilinear decomposition.
Recall we derive the PO-bilinear decomposition in Section B.4. Consider a triple with and , we have:
which verifies the PO-bilinear structure, i.e.,
and shows that the PO-bilinear rank is at most . Thus, based on the above PO-bilinear decomposition, we set . This is because
and
In the above, we use Jensen’s inequality.
Uniform convergence.
To invoke Theorem 1, we show the uniform convergence result.
Lemma 24 (Uniform convergence of loss functions).
Let . Then, with probability ,
and
Proof.
Let . Define as an -net for . Then, . Then,
Besides, for fixed , , we have
Then, for , , we have
Hence, for any ,
By taking , we have :
Similarly,
∎
Finally, we obtain the PAC bound, we need to find such that
where is some constant and
By organizing the term, the following is sufficient:
Appendix H Sample Complexity in Observable Undercomplete Tabular POMDPs
We revisit the existence of value link functions. Then, we show the PO-bilinear rank decomposition. After showing the uniform convergence lemma, we calculate the sample complexity.
Existence of value link functions.
In the tabular case, by setting
where is a one-hot encoding vector over , we can regard the tabular model as an HSE-POMDP. Here is our assumption.
Assumption 8.
(a) , (b) is full-column rank and for any .
Note we use the -norm since this choice is more amenable in the tabular setting. However, even if the norm bound is given in terms of -norm, we can still ensure the PAC guarantee (this is because ).
Here, since we assume the reward lies in , value functions on the latent state belong to . Here, letting , value link functions exist by taking . Hence, we take
so that the realizability holds. Importantly, we can ensure since
for any . Note is contained in
(21) |
This is because each is equal to for some vector . Here, denoting the component of corresponding to by , is a vector stacking for each . Then, we have
Hence, .
PO-Bilinear decomposition.
Next, recall we derive the PO-bilinear decomposition:
Then, and . We use . This is because
In the last line, we use (21).
Uniform convergence.
Then, we can obtain the following uniform convergence lemma.
Lemma 25.
Let and . Then, with probability ,
and
Proof.
Let .
Define as an -net for with respect to -norm. Define as an -net for with respect to the following norm:
Then, .
Let where is a one-hot encoding vector over . Then, when , we have
Besides, for fixed , , we have
Then, for , , we have
Hence, for any ,
By taking , we have
∎
Sample Complexity.
Finally, we obtain the PAC bound. We need to find such that
where is some constant and
By organizing terms, we get
Thus, we need to set
Hence, the sample complexity is
By some algebra, it is
Later, we prove we can remove using the more refined analysis in Section N.
Global optimality.
Appendix I Sample Complexity in Observable Overcomplete Tabular POMDPs
To simplify the presentation, we focus on the case when .
Existence of value link functions.
In the tabular case, by setting
where is a one-hot encoding vector over and is a one-hot encoding vector over , we can regard the tabular model as an HSE-POMDP. Here is our assumption.
Assumption 9.
(a) , (b) is full-column rank and .
PO-bilinear decomposition.
Next, we derive the PO-bilinear decomposition:
Then, and . We use .
Sample Complexity.
Appendix J Sample Complexity in LQG
In this section, we derive the sample complexity in LQG. We first explain the setting. Then, we prove the existence of link functions. Lemma 5 is proved there. Furtheremore, we show the PO-bilinear rank decomposition in LQG. We prove Lemma 5 there. Next, we show the uniform convergence result in LQG. Finally, by invoking Theorem 1, we calculate the sample complexity.
We study a finite-horizon discrete time LQG governed by the following equation:
where is a Gaussian noise with mean and noise and is a Gaussian noise with mean and . We use a matrix instead of to avoid the notational confusion later. With a linear policy , this induces the following system:
where is the vector removing from and is a matrix mapping to . This is derived by
We suppose the system is always stable in the sense that the operator norm of is upper-bounded by . Here is the assumption we introduce throughout this section.
Assumption 10.
Suppose . Suppose for any . is full-column rank.
We present the form of linear mean embedding operators in LQGs.
Lemma 26 (Linear mean embedding operator).
Let . We have
Proof.
Here, we have
From the second line to the third line, we use formula . This immediately concludes the result.
∎
Thus, the matrix has the left inverse when is full-column rank as follows:
We use a block matrix inversion formula:
J.1 Existence of Link Functions
Lemma 27 (Value functions in LQGs).
Let for . Then, a value function has a bilinear form:
For any , these parameters are recursively defined inductively by
Proof.
The proof is completed by backward induction regarding , starting from level . First, we have
Here, we use induction. Thus, supposing the statement is true at horizon , we have
where is a vector that removes the last component from and is a state at . Here, recall we have
Then, the statement is concluded some algebra.
∎
Lemma 28 (Norm constraints on value functions).
We can set such that
Proof.
We have
Then,
Since we assume , this immediately leads to
Besides,
Thus,
∎
Next, we set the norm on the function class .
Lemma 29 (Realizability on LQGs).
We set
A function class includes at least one value link function for any linear policy for .
Proof.
Here, we have
The norm constraint on is decided by the following calculation:
Then, the norm on is decided by the following calculation:
From the first line to the second line, we use Lemma 43.
∎
J.2 PO-bilinear Rank Decomposition
Lemma 30 (Bilinear rank decomposition for LQG).
For any , we have the following bilinear rank decomposition:
where
Here, and depend on a policy . The following norm constraints hold:
Proof.
Hereafter, we suppose the expectation is always taken under . We also denote to simplify the presentation. Using this simplified notation, we get
Besides,
Then, the bilinear decomposition is clear by using
where is any vector and is any matrix.
First, we calculate the upper bounds of the norms.
From the third line to the fourth line, we use . Thus, .
Next, we consider . By some algebra, we can see
∎
Lemma 31 (Variance of marginal distribution).
Recall is a marginal distribution over at when we execute . The distribution is a Gaussian distribution with mean . The operator norm on the variance of is upper-bounded by .
Proof.
We first calculate the operator norm of the variance of . The variance is
The statement is immediately concluded. ∎
Let . Recall . We define
Then, the final estimator is constructed by
This is equal to
where
We set
for any .
J.3 Uniform Convergence
Recall that
Besides, is included in
Lemma 32 (Concentration of loss functions).
With probability ,
is upper-bounded by
Proof.
Due to indicator functions, is bounded for any by
Thus, for fixed and , we can say that with high probability
Besides, we can consider a covering number with respect to -norm for the space of and since both are bounded. The radius of each space is upper-bounded by
Thus, by taking uniform bound and considering the bias term due to the discretization as in the proof of Lemma 24, the statement is concluded. ∎
Lemma 33 (Bias terms 1).
Expectation of and are equal to
where
Proof.
We want to upper bound the difference of
and
By CS inequality, we have
We analyze the term (a) and the term (b). Before starting analysis, note follows Gaussian distribution with mean and variance upper-bounded by
using Lemma 31. Besides, from Lemma 13. Note we can use a G-optimal design since we have a norm constraint on .
Regarding the term (a), by setting and properly, we can ensure it is upper-bounded by
Regarding the term (b), noting high order moments of Gaussian distributions can be always upper-bounded, the term (b) is upper-bounded by . This concludes the statement. ∎
Lemma 34 (Bias terms 2).
Proof.
First Statement
We have
Here, by some algebra, there exists a vector
Thus, there exists and a vector such that
Recall we can write
Using the above,
Besides,
Thus,
In conclusion,
Second Statement
As we see in the proof of Lemma 33, the following term
is upper-bounded by Hence,
(Definition) | |||
(Statement of Lemma 33) | |||
(First statement) | |||
∎
J.4 Sample Complexity
Appendix K Sample Complexity in PSRs
To focus on the main point, we just use a one-step future. We first show the form of link functions to set a proper class for . Next, we show the PO-bilinear decomposition.
We assume the following assumptions.
Assumption 11.
(a) is a core test and is a minimum core rest, (b) for any where is in .
K.1 Existence of Link Functions
Recall , where we use to denote the one-hot encoding vector over , and is a matrix in .
Then, is a value link function. This is because
Hence, we set to be
so that the realizability holds.
K.2 PO-bilinear Rank Decomposition
We show that PSR admits PO-bilinear rank decomposition (Definition 6). Here is the Bellman loss:
To analyze the above, we decompose the above into three terms:
Let be a minimum core test. Here, for any future , there exists such that where is a -dimensional predictive state . This satisfies
(23) |
where is a matrix whose -th row is as we see in Section D.
Term (c).
We have
where is a matrix whose -th row is . The existence of is ensured since is a core test.
Term (b).
We have
for some matrix . In the first inequality, we use the reward is a function of conditioning on the whole history. From the first line to the second line, we use a property of core tests.
Term (a).
We have
for some matrix . Then, the above is further equal to
for some matrix . From the first line to the second line, we use in (23).
Summary.
Combining all terms, there exists a matrix such that
Here, we suppose for any . Besides,
Thus, we can set .
K.3 Sample Complexity
Suppose are finite and rewards at lie in . Assume the realizability holds. Then,
Following the calculation in Section F, the sample complexity is
Here, there is no explicit dependence on . Note the worst-case sample complexity of is and the worse-case sample complexity of is .
K.4 Most General Case
We consider the general case in Section D. Let be a function class consisting of where satisfies and . When the realizability holds, we would get
Here, there is no explicit sample complexity of . Note the worse-case sample complexity of is and the worst-case sample complexity of is .
Appendix L Proof of Theorem 8
We fix the parameters as in Theorem 8. Let
From the assumption, Then, with probability , we have
(24) | |||
(25) |
We first show the following lemma. Recall
Lemma 35 (Optimism).
Set . For all , is a feasible solution of the constrained program. Furthermore, we have for any , where is the value link function selected by the algorithm in iteration .
Proof.
For any , we have
since is a value link function in noting the condition (c) in Definition 7. Thus,
using (24) noting . Hence, is a feasible set for any and any .
Then, we have
(Uniform convergence result) | ||||
(Using the construction of algorithm) | ||||
(Uniform convergence) |
∎
Next, we prove the following lemma to upper bound the per step regret.
Lemma 36.
For any , we have
From Lemma 22, we have
Lemma 37.
Proof.
We have
The first term is upper-bounded by . The second term is upper-bounded by
From the first line to the second line, we use (b) in Definition 7. From the second line to the third line, we use is a non-decreasing function. In the last line, we use the constraint on .
∎
Combining lemmas so far, we have
(CS inequality) | ||||
We set such that and . Then,
since when .
Finally, the following holds
(Plug in ) | |||
(Plug in ) |
Appendix M Sample Complexity in -step Decodable POMDPs
We first give a summary of our results. Then, we show that an -step decodable POMDP is a PO-bilinear rank model. After showing the uniform convergence of the loss function with fast rates, we calculate the sample complexity. Since we use squared loss functions, we need to modify the proof of Theorem 1.
M.1 PO-bilinear Rank Decomposition (Proof of Lemma 10)
In this section, we derive the PO-bilinear decomposition of -step decodable POMDPs (Lemma 10 ).
First, we define moment matching policies following [20]. We denote .
Definition 10 (Moment Matching Policies).
For , we define
where and . For an -step policy and , we define the moment matching policy :
Note the expectation in the right hand side is taken under a policy .
Thus, the first condition in Definition 7 ((6)) is satisfied
Next, we show the second condition in Definition 7 ((7)). This is proved as follows
(26) | |||
(Jensen’s inequality) | |||
(27) |
From the first line to the second line, we use [20, Lemma B.2]. From the third to the fourth line, we use the Bellman completeness assumption: . From the fourth line to the fifth line, we use importance sampling.
M.2 Uniform Convergence
We define the operator
and
Lemma 38 (Uniform Convergence).
Let . Suppose for . Fix .
-
1.
Take a true link function . Then, it satisfies
-
2.
Suppose satisfies
and the Bellman completeness holds. Then, with probability , we have
Proof.
To simplify the notation, we define
Given , we define as the maximizer:
In this proof, the expectation is always taken for the data generating process . We first observe
Then, we define
As the first step, we prove with probability
(29) |
We first fix . Then, from the definition of and the Bellman completeness , we have
(30) |
Here, we invoke Bernstein’s inequality:
(31) |
Hereafter, we condition on the above event. Then, combining (30) and (31), we have
Here, we use
Thus, by some algebra,
Besides,
Lastly, by union bounds over , the statement (29) is proved. Note .
First Statement.
From the second line to the third line, we use (29).
Second Statement.
Now, we use the assumption on :
From what we showed in (29), this implies
Recall we want to upper-bound the error for Here, we use the following observation later:
We use Bernstein’s inequality: with probability , for any ,
Here, we use
Hereafter, we condition on the above event.
Finally, we have
Hence,
∎
M.3 Proof of Main Statement
We define
Let
Then, from the first statement in Lemma 38, with probability ,
(32) | ||||
Besides, from the second statement in Lemma 38, for , when satisfies
we have
(33) |
We first show the optimism. Recall
Lemma 39 (Optimism).
Set . For all , is a feasible solution of the constrained program. Furthermore, we have for any .
Proof.
For any , letting be a corresponding value link function, we have
using (32). This implies
Hence, is a feasible set for any . Then, we have
∎
Next, recall the following two statements. The following statements are proved as before in the proof of Theorem 1.
-
•
For any ,
-
•
Let . We have
Lemma 40.
Proof.
The rest of the argument is the same as the proof in Theorem 1. Finally, the following holds
Sample Complexity Result.
We want to find such that
where
By organizing terms, we have
Thus, setting the following is enough:
The total sample we use is
Appendix N Sample Complexity in Observable POMDPs with Latent Low-rank Transition
This section largely follows the one in Section M.
N.1 Existence of Value Link Functions
Since we consider the discrete setting, we can set the value link function class as Section H. Hence, we set
Then, we can ensure . Then, from the construction of , we can also ensure .
N.2 PO-bilinear Rank Decomposition (Proof of Lemma 11)
In this section, we derive the PO-bilinear decomposition of observable POMDPs with the latent low-rank transition. We want to prove Lemma 11. Recall .
Next, we show the second condition in Definition 7. This is proved as follows:
From the first line to the second line, we use [20, Lemma B.2]. From the third to the fourth line, we use the Bellman completeness assumption: . From the fourth line to the fifth line, we use importance sampling.
The third condition
is easily proved.
Finally, the following norm constraints hold:
Sample Complexity Result.
Following the same procedure as Section M, here, we want to find such that
where
By organizing terms, we have
Thus, setting the following is enough
The total sample we use is
Finally, we plug-in .
Appendix O Exponential Stability for POMDPs with Low-rank Transition
In this section, we prove that the short memory policy is a globall near optimla policy in low-rank MDPs. We first introduce several notation. Next, we prove the exponential stability of Bayesian fileters, which immediately leads to the main statement.
Notation.
Given a belief , an action and observation pair , we define the Bayesian update as follows. We define as the operation that incorporates observation , i.e., with , and as the operation that incorporates the transition, i.e., . Finally, we denote as the full Bayesian filter, i.e.,
Let us denote as the initial latent state distribution. Given the first observation , we denote as the initial belief of the system conditioned on the first observation . Given two beliefs , we define the distance
Consider a POMDP whose latent transition is low rank, i.e., . For notation simplicity, we still consider discrete state, action, and observation space to avoid using measure theory languages.
Design of initial distribution.
We want to design a good distribution for the initial distribution in an artificial Bayesian filter ignoring the history other than the short history.
The following lemma is from [24, Lemma 4.9] that quantifies the contraction of a Bayesian map.
Lemma 41 (Contraction propery of beliefs).
Suppose and . Then we have:
Next, we compute the G-optimal design using feature . Denote the G-optimal design as . Here, we use assumption for any in Assumption 6, which ensures that lives in a compact space for any . The property is given as in Theorem 12. In summary, the support of (denoted by ) is at most points and for any , there exists such that
(34) |
where we denote the points on the support as .
We set our “empty" belief as follows:
Note that this belief does not depend on any history. We aim to bound using the following lemma where is some belief resulting from applying for any to a belief . This is a newly introduce lemma.
Lemma 42 (Distance between the actual belief and the designed initial distribution).
For any distribution that results from a previous belief and a one-step latent transition under action , i.e., , we have:
Proof.
For any , using its definition, we have:
(Definition) | ||||
(Property of G-optimal design) | ||||
Similarly, the construction of implies that , thus, we have:
(Use propety of G-optimal design(34)) |
Thus, . ∎
Theorem 14 (Exponential stability for POMDPs with Low-rank Latent Transition).
Consider a . Consider any policy (full history dependent) and a trajectory for . Denote as the (true) belief conditioned on . For approximated belief, first for , we define as:
for , we define as:
Then we have:
Proof.
We define
Hereafter, we omit to simplify the notation.
We start from the base case (i.e., ).
First case, consider , . Denote . From Lemma 42, we know that:
Thus, noting and , we have:
(From Lemma 41) | |||
which implies the base case:
Now for any , we have:
(Data processing inequality from [24, Lemma 2.7]) | |||
This completes the induction step. Adding expectation with respect to the history back, we conclude the proof.
When , we simply start with the original belief . For any , we simply set , thus the conclusion still holds.
∎
The above Theorem 14 indicates that in order to approximate the ground truth belief that is conditioned on the entire history, we only need to apply the Bayesian filter on the M memory starting from a fixed distribution . The existence of such is proven by construction where we rely on the low-rankness of the latent transition and a D-optimal design over using the feature .
The above Theorem 14 together with the proof of Theorem 1.2 in [24] immediately implies for (with being some absolute constant), there must exists an M-memory policy , such that – thus a globally optimal policy can be approximated by a policy that only relies on short memories.
Appendix P Auxiliary Lemmas
We use the following in Section 4.3.
Lemma 43 (Useful inequalities).
-
•
-
•
When and are semi positive definite matrices, we have
The following lemma is useful when we calculate the sample complexity.
Lemma 44.
The following is satisfied
when
for some constant .