Provably Efficient Reinforcement Learning in
Partially Observable Dynamical Systems

Masatoshi Uehara [email protected] Supported by Masason Foundation Cornell University Ayush Sekhari [email protected] Cornell University Jason D. Lee [email protected] Princeton University Nathan Kallus [email protected] Cornell University Wen Sun [email protected] Cornell University

Abstract

We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new Partially Observable Bilinear Actor-Critic framework, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition. Under this framework, we propose an actor-critic style algorithm that is capable of performing agnostic policy learning. Given a policy class that consists of memory based policies (that look at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy in the given policy class. For certain examples such as undercomplete observable tabular POMDPs, observable LQGs and observable POMDPs with latent low-rank transition, by implicitly leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon in its sample complexity.

1 Introduction

Large state space and partial observability are two key challenges of Reinforcement Learning (RL). While recent advances in RL for fully observable systems have focused on the challenge of scaling RL to large state space in both theory and in practice using rich function approximation, the understanding of large scale RL under partial observability is still limited. In POMDPs, for example, a core issue is that the optimal policy is not necessarily Markovian since the observations are not Markovian.

A common heuristic to tackle large scale RL with partial observability in practice is to simply maintain a time window of the history of observations, which is treated as a state to feed into the policy and the value function. Such a window of history can be often maintained explicitly via truncating away older history (e.g., DQN uses a window with length 4 for playing video games [49]; Open AI Five uses a window with length 16 for LSTMs [4]). Since even for planning under partial observations and known dynamics, finding the globally optimal policy conditional on the entire history is generally NP-hard (due to the curse of the history) [44, 55, 24], searching for a short memory-based policy can be understood as a reasonable middle ground that balances computation and optimality. The impressive empirical results of these prior works also demonstrate that in practice, there often exists a high-quality policy (not necessarily the globally optimal) that is only a function of a short window of recent observations. However, these prior works that search for the best memory-based policy unfortunately cannot ensure sample efficient PAC guarantees due to the difficulty of strategic exploration in POMDPs. The key question that we aim to answer in this work is:

Can we design provably efficient RL algorithms that agnostically learn the best fixed-length memory based policy with function approximation?

We provide affirmative answers to the above question. More formally, we study RL for partially observable dynamical systems that include not only the classic Partially Observable MDPs (POMDPs) [52, 56, 64], but also a more general model called Predictive State Representations (PSRs) [46]. We design a model-free actor-critic framework, named PO-Bilinear Actor-Critic Class, where we have a policy class (i.e., actors) that consists of policies that take a fixed-length window of observations as input (memory-based policy), and a newly introduced value link function class (i.e., critics) that consists of functions that take the fixed-length window of history and (possibly multi-step if the system is overcomplete) future observations as inputs. A value link function class is an analog of the value function class tailored to partially observable systems that only relies on observable quantities (i.e., past and future observations and actions). In our algorithm, we agnostically search for the best memory-based policy from the given policy class.

Our framework is based on the idea of a newly introduced notion of value link function equipped with future observations. While the idea of using future observations has been used in the literature on POMDPs, our work is the first to use this idea to learn a high-quality policy in a model-free manner. Existing works discuss how to use future observations only in a model-based manner [8, 25]. By leveraging these model-based viewpoints, while recent works discuss strategic exploration to learn near-optimal policies, their results are either limited to the tabular setting (and are not scalable for large state spaces) [36, 22, 3, 74, 43] or are tailored to specific non-tabular models and unclear how to incorporate general function approximation [65, 42, 12]. We break these barriers by devising a new actor-critic-based model-free view on POMDPs. We demonstrate the scalability and generality of our PO-bilinear actor-critic framework by showing PAC-guarantee on many models as follows (see Table 1 for a summary).

Observable Tabular POMDPs. In tabular observable POMDPs, i.e., POMDPs where multi-step future observations retain information about latent states, the PO-bilinear rank decomposition holds. We can ensure the sample complexity is $\mathrm{\mathrm{Poly}}(S,A^{M},O^{M},A^{K},O^{K},H,1/\sigma_{1})$ where $\sigma_{1}=\min_{x}\|\mathbb{O}x\|_{1}/\|x\|_{1}$ ( $\mathbb{O}$ is an emission matrix),and $S,A,O$ are the cardinality of state, action, observation space, respectively, $H$ is the horizon, and $K$ is the number of future observations.¹¹1In Section 8, we discuss how to get rid of $O^{M},O^{K}$ using a model-based learning perspective. The intuition is that a tabular POMDP’s model complexity has nothing to do with $M$ or $K$ , i.e., number of parameters in transition and omission distribution is $S^{2}A+OA$ (even if we consider the time-inhomogeneous setting, it scales with $H(S^{2}A+OA)$ , but no $O^{M}$ and $O^{K}$ ) and the PO-bilinear rank is still $S$ . In the special undercomplete ( $O\geq S$ ) case, our framework is also flexible enough to set the memory length according to the property of the problems in order to search for the globally optimal policy. More specifically, using the latest result from [24] about belief contraction, we can set $M=\tilde{O}((1/\sigma^{4}_{1})\ln(SH/\epsilon))$ with $\epsilon$ being the optimality threshold. This allows us to compete against the globally optimal policy without paying an exponential dependence on $H$ .

Observable Linear Quadratic Gaussian (LQG). In observable LQG – a classic partial observable linear dynamical system, our algorithm can compete against the globally optimal policy with a sample complexity scaling polynomially with respect to the horizon, dimensions of the state, observation, and action spaces (and other system parameters). This is achieved by simply setting the memory length $M$ to $H$ . The special linear structures of the problem allow us to avoid exponential dependence on $H$ even when using the full history as a memory. While the global optimality results in tabular POMDPs and LQG exist by using different algorithms, to the best of our knowledge, this is the first unified algorithm that can solve both tabular POMDPs and LQG simultaneously without paying an exponential dependence on horizon $H$ .

Observable Hilbert Space Embedding POMDPs (HSE-POMDPs). Our framework ensures the agnostic PAC guarantee on HSE-POMDPs where policy induced transitions and omission distributions have condition mean embeddings [8, 6]. This model naturally generalizes tabular POMDPs and LQG. We show that the sample complexity scales polynomially with respect to the dimensions of the embeddings. This is the first PAC guarantee in HSE-POMDPs.

Predictive State Representations (PSRs). We give the first PAC-guarantee on PSRs. PSRs model partially observable dynamical systems without even using the concept of latent states and strictly generalize the POMDP model. Our work significantly generalizes a prior PAC learning result for reactive PSRs (i.e., reactive PSRs require a strong condition that the optimal policy only depends on the latest observation) which is a much more restricted setting [33].

$M$ -step decodable POMDPs [20]. Our framework can capture $M$ -step decodable POMDPs where there is a (unknown) decoder that can perfectly decode the latent state by looking at the latest $M$ -memory. Our algorithm can compete against the globally optimal policy with the sample complexity scaling polynomially with respect to horizon $H$ , $S,A^{M}$ , and the statistical complexities of function classes, without any explicit dependence on $O$ . This PAC result is similar with the one from [20].

Observable POMDPs with low-rank latent transition. Our framework captures observable POMDPs where the latent transition is low-rank. This is the first PAC guarantee in this model. Under this model, we first show that with $M=\tilde{O}\left((1/\sigma_{1}^{4})\ln(dH/\epsilon)\right)$ where $d$ is the rank of the latent transition matrix, there exists an $M$ -memory policy that is $\epsilon$ -near optimal with respect to the globally optimal policy. Then, starting with a general model class that contains the ground truth transition and omission distribution (i.e., realizability in model class), we first convert the model class to a policy class and a value link function class, and we then show that our algorithm competes against the globally optimal policy with a sample complexity scaling polynomially with respect to $H,d,|\mathcal{A}|^{(1/\sigma_{1}^{4})\ln(dH/\epsilon)},1/\sigma_{1}$ , and the statistical complexity of the model class. Particularly, the sample complexity has no explicit dependence on the size of the state and observation space, instead it just depends on the statistical complexity of the given model class.

Model

Observable

tabular POMDPs

Observable

LQG

Low-rank

M

-step

decodable POMDPs

Observable

HSE-POMDPs

PSRs

Low rank

observable POMDPs

PO-Bilinear Rank

(OA)^{M}S(\dagger)

(Can be

S

)

O(Md^{2}_{a}d^{2}_{s})(\dagger)

Rank

(\dagger)

Feature dimension

(z,s)

(OA)^{M}\times

# of core tests

Rank (

\dagger

)

PAC Learning

Known

New

Table 1: Summary of settings that are from PO-Bilinear AC class. The 2nd row gives the parameters that bound the PO-Bilinear rank. Here

M

denotes the length of memory used to define memory-based policies

\pi(\cdot|\bar{z}_{h})

where

\bar{z}_{h}=(o_{h-M:h},a_{h-M:h-1})

denotes the

M

-step memory. In the 3rd row, “known” means that sample-efficient algorithms already exist. “ New” means our result gives the first sample-efficient algorithm. However, even in “known” case, agnostic guarantees are new; hence, when the policy class is small, we can gain some benefit. The symbol

\dagger

means we can compete with the globally optimal policy without paying an exponential dependence on horizon

H

. For the tabular case, the PO-bilinear rank can be improved to

S

when we use the most general definition (Refer to Section 8. For LQG,

d_{a}

and

d_{s}

are the dimension of action and state spaces. For PSRs,

O

and

A

denote the size of observation and action spaces.

1.1 Related Works

Generalization and function approximation of RL in MDPs.

In Markovian environments, there is a growing literature that gives PAC bounds with function approximation under certain models. Some of the representative models are linear MDPs [36, 76], block MDPs [16, 48, 78], and low-rank MDPs [2, 71]. Several general frameworks in [33, 61, 35, 21, 17] characterize sufficient conditions for provably efficient RL. Each above model is captured in these frameworks as a special case. While our work builds on the bilinear/Bellman rank framework [17, 33], when we naïvely reduce POMDPs to MDPs, the bilinear/Bellman rank is $\Theta(A^{H})$ . These two frameworks are only shown applicable to reactive POMDPs where the optimal policy only depends on the latest observation. However, this assumption makes the POMDP model very restricted.

Online RL for POMDPs.

Prior works [39, 19] showed $A^{H}$ -type sample complexity bounds for general POMDPs. Exponential dependence can be circumvented with more structures. First, in the tabular setting, under observability assumptions, in [3, 22, 34, 43, 23], favorable sample complexities are obtained by leveraging the spectral learning technique [29] (see section 1.1 in [34] for an excellent summary). Second, in LQG, which is a partial observable version of LQRs, in [42, 65], sub-linear regret algorithms are proposed. These works use random policies for exploration, which is sufficient for LQG. Since random exploration strategy is not enough for tabular POMDPs, it is unclear if the existing techniques from LQG can be applied to solve general POMDPs. Third, the recent work [20] provides a new model called $M$ -step decodable POMDP (when $M=1$ , it is Block MDP) with an efficient algorithm.

Our framework captures all above mentioned POMDP models. In addition, we propose a new model called HSE-POMDPs which extends prior works on HSE-HMM[6] to POMDPs and includes LQG and tabular POMDPs. Our algorithm delivers the first PAC bound for this model.

Finally, we remark there are several existing POMDP models that it is unclear whether our framework can capture. The first model is a POMDP [12] where emissions and transitions are modeled by linear mixture models. The second model is a latent POMDP [37]. We leave it as future works.

System identification for uncontrolled partially observable systems.

There is a long line of work on system identification for uncontrolled partially observable systems, among which the spectral learning based methods are related to our work [72, 29, 58, 8, 28, 54, 6, 38, 25, 67]. Informally, these methods leverage the high-level idea that under some observability conditions, one can use the sufficient statistics of (possibly multi-step) future observations as a surrogate for the belief states, thus allowing the learning algorithms to ignore the latent state inference and completely rely on observable quantities. Our approach shares a similar spirit in the sense that we use sufficient statistics of future observations to replace latent states, and our algorithm only relies on observable quantities. The major difference is that these prior works only focus on passive system identification for uncontrolled systems, while we need to find a high-performance policy by actively interacting with the systems for information acquisition.

Reinforcement learning in PSRs.

PSRs [32, 46, 62, 8, 68] are models that generalize POMDPs. PSRs also rely on the idea of using the sufficient statistics of multi-step future observations (i.e., predictive states) to serve as a summary of the history. Prior works on RL for PSRs [8, 38, 14, 45, 30] do not address the problem of strategic exploration and operate under the assumption that a pre-collected diverse training dataset is given and the data collection policy is a blind policy (i.e., it does not depend on history of observations). To our knowledge, the only existing PAC learning algorithm for PSRs is limited to a reactive PSR model where the optimal policy depends just on the latest observation [33]. Our framework captures standard PSRs models that are strictly more general than reactive PSRs.

Value link functions.

Analogue of value link functions (referred to as bridge functions) are used in the literature of causal inference (offline contextual bandits) [50, 13, 11, 40, 53, 60, 75] and offline RL with unmeasured confounders [7, 66]. However, their settings are not standard POMDPs in the sense that their setting is a POMDP with unmeasured confounders following [69]. Our setting is a standard POMDP without unmeasured confounders. Here, we emphasize that their setting does not capture our setting. More specifically, by taking [66] as an example, they require that logged data is generated by policies that can depend on latent states but cannot depend on observable states. Thus, their definition of link functions (called as bridge functions) is not applicable to our setting since the data we use is clearly generated by policies that depend on observations. Due to this difference, their setting prohibits us from using future observations, unlike our setting. Finally, we stress that our work is online, while their setting is offline. Hence, they do not discuss any methods for exploration.

1.2 Organization

In Section 2, we introduce the notation, definition of POMDPs, and our function-approximation setup such as the policy and value link function class. In Section 3, we define value link functions and the PO-bilinear actor-citric class. In Section 4, we give examples that admit PO-bilinear actor-citric class including observable undercomplete tabular POMDPs, observable overcomplete tabular POMDPs, observable LQG, and observable HSE-POMDPs. In Section 5, we give a unified algorithm for the PO-bilinear actor-citric class, and the sample complexity of the algorithm. We also instantiate this general result for examples presented in Section 4. In Section 6, we show that PSRs, which are more general models than POMDPs, also admit PO-bilinear rank decomposition. In Section 7, we give a more general definition of PO-bilinear actor-critic class, followed by showing that two additional examples — $M$ -step decodable POMDPs and observable POMDPs with low-rank latent transition — fall into this general definition (Section 8). Both the examples use general nonlinear function approximation and their sample complexities do not explicitly depend on the size of the state and observation spaces, but only on the statistical complexities of the function classes. As a by-product, we can refine the sample complexity result in the tabular case in Section 5. Most of the proofs are deferred to the Appendix.

2 Preliminary

Figure 1: Case with M

=3

. A 3-memory policy determines action

a_{6}

based on

\bar{z}_{6}

We introduce background for POMDPs here and defer the introduction of PSRs to Section 6. We consider an episodic POMDP specified by $\mathcal{M}=\langle{\mathcal{S}},\mathcal{O},\mathcal{A},H,\mathbb{T},\mathbb{O}\rangle$ , where ${\mathcal{S}}$ is the unobserved state space, $\mathcal{O}$ is the observation space, $\mathcal{A}$ is the action space, $H$ is the horizon, $\mathbb{T}:{\mathcal{S}}\times\mathcal{A}\to\Delta({\mathcal{S}})$ is the transition probability, $\mathbb{O}:{\mathcal{S}}\to\Delta(\mathcal{O})$ is the emission probability, and $r:\mathcal{O}\times\mathcal{A}\to\mathbb{R}$ is the reward. Here, $\mathbb{T},\mathbb{O}$ are unknown distributions. For notation simplicity, we consider the time-homogeneous case in this paper; Extension to the time-inhomogeneous setting is straightforward.

In our work, we consider $M$ -memory policies. Let $\mathcal{Z}_{h}=(\mathcal{O}\times\mathcal{A})^{\min\{h,M\}}$ and $\bar{\mathcal{Z}}_{h}=\mathcal{Z}_{h-1}\times\mathcal{O}$ . An element $z_{h}\in\mathcal{Z}_{h}$ is represented as $z_{h}=[o_{\max(h-M+1,1):h},a_{\max(h-M+1,1):h}]$ , and an element $\bar{z}_{h}\in\bar{\mathcal{Z}}_{h}$ is represented as $\bar{z}_{h}=[o_{\max(h-M,1):h},a_{\max(h-M,1):h-1}]$ (thus, $\bar{z}_{h}=[z_{h-1},o_{h}]$ ). Figure 1 illustrates this situation. An $M$ -memory policy is defined as $\pi=\{\pi_{h}\}_{h=1}^{H}$ where each $\pi_{h}$ is a mapping from $\bar{\mathcal{Z}}_{h}$ to a distribution over actions $\Delta(\mathcal{A})$ .

In a POMDP, an $M$ -memory policy generates the data as follows. Each episode starts with the initial state $s_{1}$ sampled from some unknown distribution. At each step $h\in[H]$ , from $s_{h}\in{\mathcal{S}}$ , the agent observes $o_{h}\sim\mathbb{O}(\cdot|s_{h})$ , executes action $a_{h}\sim\pi_{h}(\cdot|\bar{z}_{h})$ , receives reward $r(s_{h},a_{h})$ , and transits to the next latent state $s_{h+1}\sim\mathbb{T}(\cdot|s_{h},a_{h})$ . Note that the agent does not observe the underlying states but only the observations $\{o_{h}\}_{h\leq H}$ . We denote $J(\pi)$ as the value of the policy $\pi$ , i.e., $\mathbb{E}[\sum_{h=1}^{H}r_{h};a_{1:H}\sim\pi]$ where the expectation is taken w.r.t. the stochasticity of the policy $\pi$ , emissions distribution $\mathbb{O}$ and transition dynamics $\mathbb{T}$ .

We define a value function for a policy $\pi$ at step $h$ to be the expected cumulative reward to go under the policy $\pi$ starting from a $z\in\mathcal{Z}_{h-1}$ and $s\in{\mathcal{S}}$ , i.e. $V^{\pi}_{h}:\mathcal{Z}_{h-1}\times{\mathcal{S}}\to\mathbb{R}$ where $V^{\pi}_{h}(z,s)=\mathbb{E}[\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}\mid z_{h-1}=z,s_{h}=s;a_{h:H}\sim\pi]$ . The notation $\mathbb{E}[\cdot\,;a_{h:H}\sim\pi]$ means the expectation is taken under a policy $\pi$ from $h$ to $H$ . Compared to the standard MDP setting, the expectation is conditional on not only $s_{h}$ but also $z_{h-1}$ since we consider $M$ -memory policies. The corresponding Bellman equation for $V^{\pi}_{h}$ is $V^{\pi}_{h}(z_{h-1},s_{h})=\mathbb{E}\left[r_{h}+V^{\pi}_{h+1}(z_{h},s_{h+1})\mid z_{h-1},s_{h};a_{h}\sim\pi\right]$ .

The Actor-critic function approximation setup.

Our goal is to find a near optimal policy that maximizes the policy value $J(\pi)$ in an online manner. Since any POMDPs can be converted into MDPs by setting the state at level $h$ to the observable history up to $h$ , any off-the-shelf online provably efficient algorithms for MDPs can be applied to POMDPs. By defining $\mathcal{H}_{h}$ as the whole history up to step $h\in[H]$ (i.e., a history $\tau_{h}\in\mathcal{H}_{h}$ is in the form of $o_{1:h},a_{1:h-1}$ ) , these naïve algorithms ensure that output policies can compete against the globally optimal policy $\pi^{\star}_{\mathrm{gl}}=\mathop{\mathrm{argmax}}_{\pi\in\bar{\Pi}}J(\pi)$ where $\tilde{\Pi}=\{\bar{\Pi}_{h}\},\tilde{\Pi}_{h}=[\mathcal{H}_{h}\to\Delta(\mathcal{A})]$ . However, this conversion results in the error with exponential dependence on the horizon $H$ , which is prohibitively large in the long horizon setting.

Instead of directly competing against the globally optimal policy, we aim for agnostic policy learning, i.e., compete against the best policy in a given $M$ -memory policy class. Our function approximation setup consists of two function classes, $(a)$ A policy class $\Pi$ consisting of $M$ -memory policies $\Pi:=\{\Pi_{h}\}_{h=1}^{H}$ where $\Pi_{h}\subset[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})]$ (i.e., actors), $(b)$ A set of value link functions $\mathcal{G}=\{\mathcal{G}_{h}\}_{h=1}^{H}$ where $\mathcal{G}_{h}\subset[\bar{\mathcal{Z}}_{h}\to\mathbb{R}]$ , whose role is to approximate $V^{\pi}_{h}$ (i.e., critics). Our goal is to provide an algorithm that outputs a policy $\hat{\pi}=\{\hat{\pi}_{h}\}$ that has a low excess risk, where excess risk is defined by $R(\pi):=J(\hat{\pi})-J(\pi^{\star})$ where $\pi^{\star}=\mathop{\mathrm{argmax}}_{\pi\in\Pi}J(\pi)$ is the best policy in class $\Pi$ . To motivate this agnostic setting, $M$ -memory policies are also widely used in practice, e.g., DQN [49] sets $M=4$ . Besides, there are natural examples where $M$ -memory policies are close to the globally optimal policy with $M$ being only polynomial with respect to other problem dependent parameters, e.g., observable POMDPs [24] and LQG [42, 65, 51]. We will show the global optimality in these two examples later, without any exponential dependence on $H$ in the sample complexity.

Remark 1 (Limits of existing MDP actor-critic framework).

While general actor-critic framework proposed in MDPs [33] is applicable to POMDPs via the naïve POMDP to MDP reduction, it is unable to leverage any benefits from the restricted policy class. This naïve reduction (from POMDP to MDP) uses full history and will incur sample complexity that scales exponentially with respect to the horizon.

Additional notation. Let $[H]=\{1,\cdots,H\}$ and $[t]=\{1,\cdots,t\}$ . Give a matrix $A$ , we denote its pseudo inverse by $A^{\dagger}$ and the operator norm by $\|A\|$ . We define the $\ell_{1}$ norm $\|A\|_{1}=\max_{x:x\neq 0}\|Ax\|_{1}/\|x\|_{1}$ . The outer product is denoted by $\otimes$ . Let $d^{\pi}_{h}(\cdot)\in\bar{Z}_{h}\times{\mathcal{S}}$ be the marginal distribution at $h$ and $\delta(\cdot)$ be the Dirac delta function. We denote the policy $\delta(a=a^{\prime})$ by $\mathrm{do}(a^{\prime})$ . We denote a uniform action by $\mathcal{U}(\mathcal{A})$ . Given a function class $\mathcal{G}$ , we define $\|\mathcal{G}\|_{\infty}=\sup_{g\in\mathcal{G}}\|g\|_{\infty}$ .

3 Value Link Functions and the PO-bilinear Framework

Unlike MDPs, we cannot directly work with value functions $V^{\pi}_{h}(s)$ (or Q functions) in POMDPs, since they depend on the unobserved state $s$ . To handle this issue, below we first introduce new value link functions by using future observations, and then discuss the PO-bilinear framework.

3.1 Value Link Functions

Definition 1 (K-step value link functions).

Fix a set of policies $\pi^{out}=\{\pi^{out}_{i}\}_{i=1}^{K}$ where $\pi^{out}_{i}:\mathcal{O}\to\Delta(\mathcal{A})$ . Value link functions $g^{\pi}_{h}:\mathcal{Z}_{h-1}\times\mathcal{O}^{K}\times\mathcal{A}^{K-1}\to\mathbb{R}$ at step $h\in[H]$ for a policy $\pi$ are defined as the solution to the following integral equation:

\displaystyle\forall z_{h-1}\in\mathcal{Z}_{h-1},s_{h}\in{\mathcal{S}},\qquad\mathbb{E}[g^{\pi}_{h}(z_{h-1},o_{h:h+K-1},a_{h:h+K-2})\mid z_{h-1},s_{h};a_{h:h+K-2}\sim\pi^{out}]=V^{\pi}_{h}(z_{h-1},s_{h}),

where the expectation is taken under the policy $\pi^{out}$ .

Link functions do not necessarily exist, nor are needed to be unique. At an intuitive level, K-step value link functions are embeddings of the value functions onto the observation space, and its existence essentially means that K-step futures have sufficient information to recover the latent state dependent value function. The proper choice of $\pi^{out}$ would depend on the underlying models. For example, we use uniform policy in the tabular case, and $\delta(a=0)$ in LQG. For notational simplicity, we mostly focus on the case of $K=1$ , though we will also discuss the general case of $K\geq 2$ . The simplified definition for 1-step link functions is provided in the following. Note that this definition is agnostic to $\pi^{out}$ .

Definition 2 (1-step value link functions).

One-step value link functions $g^{\pi}_{h}:\mathcal{Z}_{h-1}\times\mathcal{O}\to\mathbb{R}$ at step $h\in[H]$ for a policy $\pi$ are defined as the solution to the following integral equation:

\displaystyle\forall z_{h-1}\in\mathcal{Z}_{h-1},s_{h}\in{\mathcal{S}}:\qquad\mathbb{E}[g^{\pi}_{h}(z_{h-1},o_{h})\mid z_{h-1},s_{h}]=V^{\pi}_{h}(z_{h-1},s_{h}).

(1)

In Section 4, we will demonstrate the form of the value link function for various examples. The idea of encoding latent state information using the statistics of (multi-step) futures have been widely used in learning models of HMMs [58, 29], PSRs [8, 6, 25, 67], and system identification for linear systems [72]. Existing provably efficient (online) RL works for POMDPs elaborate on this viewpoint [36, 22, 3]. Compared to them, the novelty of link functions is that it is introduced to recover value functions but not models. This model-free view differs from the existing dominant model-based view in online RL for POMDPs. In our setup, we can control systems if we can recover value functions on the underlying states even if we fail to identify the underlying model.

3.2 The PO-Bilinear Actor-critic Framework for POMDPs

With the definition of value link functions, we are now ready to introduce the PO-bilinear actor-critic (AC) class for POMDPs. We will focus on the case of $K=1$ here. Let $\mathcal{G}=\{\mathcal{G}_{h}\}_{h=1}^{H}$ , where $\mathcal{G}_{h}\subset[\bar{\mathcal{Z}}_{h}\to\mathbb{R}]$ , be a class consisting of functions that satisfy the following realizability assumption w.r.t. the policy class $\Pi$ .

Assumption 1 (Realizability).

We assume that $\mathcal{G}$ is realizable w.r.t. the policy class $\Pi$ , i.e., $\forall\pi\in\Pi,h\in[H]$ , there exists at least one $g^{\pi}_{h}\in\mathcal{G}_{h}$ such that $g^{\pi}_{h}$ is a value link function w.r.t. the policy $\pi$ . Note that realizability implicitly requires the existence of link functions.

We next introduce the PO-Bilinear Actor-critic class. For each level $h\in[H]$ , we first define the Bellman loss:

\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{in}):=\mathbb{E}[g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1}):a_{1:h-1}\sim\pi^{in},a_{h}\sim\pi]

given M-memory policies $\pi=\{\pi_{h}\},\pi^{in}=\{\pi^{in}_{h}\}$ and $g=\{g_{h}\}$ . Letting $g^{\pi}=\{g^{\pi}_{h}\}_{h=1}^{H}$ be a link function for $\pi$ , our key observation is that value link functions satisfy

\displaystyle 0=\mathrm{Br}_{h}(\pi,g^{\pi};\pi^{in})

for any M memory roll-in policy $\pi^{in}=\{\pi^{in}_{h}\}_{h=1}^{H}$ , and any evaluation pair $(\pi,g^{\pi})$ . This is an analog of Bellman equations on MDPs. The above equation tells us that $\mathrm{Br}_{h}(\pi,g;\pi^{in})$ is a right loss to quantify how much the estimator $g$ is different from $g^{\pi}_{h}$ . When $\mathrm{Br}_{h}(\pi,g;\pi^{in})$ has a low-rank structure in a proper way, we can efficiently learn a near optimal M memory policy. The following definition precisely quantifies the low-rank structure that we need for sample efficient learning.

Definition 3 (PO-bilinear AC Class, $K=1$ ).

The model is a PO-bilinear Actor-critic class of rank $d$ if $\mathcal{G}$ is realizable, and there exist $W_{h}:\Pi\times\mathcal{G}\to\mathbb{R}^{d}$ and $X_{h}:\Pi\to\mathbb{R}^{d}$ such that for all $\pi^{\prime},\pi\in\Pi,g\in\mathcal{G}$ and $h\in[H]$ ,

1.

$\mathbb{E}[g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]=\langle W_{h}(\pi,g),X_{h}(\pi^{\prime})\rangle.$
2.

$W_{h}(\pi,g^{\pi})=0$ for any $\pi\in\Pi$ and the corresponding value link function $g^{\pi}\in\mathcal{G}$ .

We define $d$ as the PO-bilinear rank.

While the above definition is enough to capture most of the examples we discuss later in this work, including undercomplete tabular POMDPs, LQG, HSE-POMDPs, we provide two useful extensions. The first extension incorporates discriminators into the framework, which can be used to capture the M-step decodable POMDPs and POMDPs with low-rank latent transition (see Section 7). The second extension incorporates multi-step futures, which can be used to capture overcomplete POMDPs and general PSRs. In the next section, we introduce the multi-step future version.

3.3 PO-bilinear Actor-critic Class with Multi Step Future

In this section, we provide an extension to Definition 3 to incorporate multiple-step futures (i.e., $K>1$ ). For simplicity, we assume that $\pi^{out}=\mathcal{U}(\mathcal{A})$ .

The definition is then as follows. The main difference is that we roll out a policy $\mathcal{U}(\mathcal{A})$ , $K-1$ times to incorporate multi-step link functions. We introduce the notation

\displaystyle(z_{h-1},o_{h:h+K-1},a_{h:h+K-2})=\bar{z}^{K}_{h}\in\bar{\mathcal{Z}}^{K}_{h}=\mathcal{Z}_{h-1}\times\mathcal{O}^{K}\times\mathcal{A}^{K-1}.

Then, combining the Bellman equation for state-value functions and the definition of K-step link functions, we have

	$\displaystyle 0$	$\displaystyle=\mathbb{E}[V^{\pi}_{h+1}(z_{h},s_{h+1})+r_{h}-V^{\pi}_{h}(z_{h-1},s_{h})\mid z_{h-1},s_{h};a_{h}\sim\pi]$
		$\displaystyle=\mathbb{E}[g^{\pi}_{h+1}(\bar{z}^{K}_{h+1})\mid z_{h-1},s_{h};a_{h}\sim\pi,a_{h+1:h+K-1}\sim\mathcal{U}(\mathcal{A})]+\mathbb{E}[r_{h}\mid z_{h-1},s_{h};a_{h}\sim\pi]$
		$\displaystyle-\mathbb{E}[g^{\pi}_{h}(\bar{z}^{K}_{h})\mid z_{h-1},s_{h};a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A})]$

Thus, by taking expectations further with respect to $(z_{h-1},s_{h})$ (i.e., $z_{h-1},s_{h}$ can be sampled from some roll-in policy), we have

	$\displaystyle 0$	$\displaystyle=\mathbb{E}[g^{\pi}_{h+1}(\bar{z}^{K}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi,a_{h+1:h+K-1}\sim\mathcal{U}(\mathcal{A})]+\mathbb{E}[r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]$
		$\displaystyle-\mathbb{E}[g^{\pi}_{h}(\bar{z}^{K}_{h});a_{1:h-1}\sim\pi^{\prime},a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A})].$

Hence, the Bellman loss of a pair $(\pi,g)$ under a roll-in $\pi^{\prime}$ denoted by $\mathrm{Br}_{h}(\pi,g;\pi^{\prime})$ at $h\in[H]$ is defined as

	$\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime})$	$\displaystyle=\mathbb{E}[g_{h+1}(\bar{z}^{K}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi,a_{h+1:h+K-1}\sim\mathcal{U}(\mathcal{A})]+\mathbb{E}[r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]$
		$\displaystyle-\mathbb{E}[g_{h}(\bar{z}^{K}_{h});a_{1:h-1}\sim\pi^{\prime},a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A})].$

The above is a proper loss function when we use multi-step futures. Here is the structure we need for $\mathrm{Br}_{h}(\pi,g;\pi^{\prime})$ .

Definition 4 (PO-bilinear AC Class for POMDPs with multi-step future).

The model is a PO-bilinear class of rank $d$ if $\mathcal{G}$ is realizable (regarding general K-step link functions), and there exists $W_{h}:\Pi\times\mathcal{G}\to\mathbb{R}^{d}$ and $X_{h}:\Pi\to\mathbb{R}^{d}$ such that for all $\pi^{\prime},\pi\in\Pi,g\in\mathcal{G}$ and $h\in[H]$ ,

We have:

	$\displaystyle\mathbb{E}[g_{h+1}(\bar{z}^{K}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi,a_{h+1:h+K-1}\sim\mathcal{U}(\mathcal{A})]+\mathbb{E}[r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]$
	$\displaystyle-\mathbb{E}[g_{h}(\bar{z}^{K}_{h});a_{1:h-1}\sim\pi^{\prime},a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A})]=\langle W_{h}(\pi,g),X_{h}(\pi^{\prime})\rangle,$

2.

$W_{h}(\pi,g^{\pi})=0$ for any $\pi\in\Pi$ and the corresponding value link function $g^{\pi}$ in $\mathcal{G}$ .

We define $d$ as the PO-bilinear rank.

4 Examples of PO-Bilinear Actor-critic Classes

We consider three examples (observable tabular POMDPs, LQG, HSE-POMDPs) that admit PO-bilinear rank decomposition. Our framework can also capture PSRs and $M$ -step decodable POMDPs, of which the discussions are deferred to Section 6 and Section 8, respectively. We mainly focus on one-step future, i.e., $K=1$ , and briefly discuss extension to $K>1$ in the the tabular case. In this section, except for LQG, we assume $r_{h}\in[0,1]$ for any $h\in[H]$ . All the missing proofs are deferred to Section B.

4.1 Observable Undercomplete Tabular POMDPs

Example 1 (Observable undercomplete tabular POMDPs).

Let $\mathbb{O}\in\mathbb{R}^{|\mathcal{O}|\times|{\mathcal{S}}|}$ where the entry indexed by a pair $(o,s)$ is defined as $\mathbb{O}_{o,s}=\mathbb{O}(o|s)$ . Assume that ${\rm rank}(\mathbb{O})=|{\mathcal{S}}|$ , which we call observability. This requires undercompletenes $|\mathcal{O}|\geq|{\mathcal{S}}|$ .

The following lemma shows that $\mathbb{O}$ being full rank implies the existence of value link functions.

Lemma 1.

For Example 1, there exists a one-step value link function $g^{\pi}_{h}$ for any $\pi\in\Pi$ and $h\in[H]$ .

Proof.

Consider any function $f:\mathcal{Z}_{h-1}\times{\mathcal{S}}\to\mathbb{R}$ (thus, this captures all possible $V^{\pi}_{h}$ ). Denote ${\textbf{1}}(z)$ as the one-hot encoding of $z$ over $\mathcal{Z}_{h-1}$ (similarly for ${\textbf{1}}(s)$ ). We have $f(z,s)=\langle f,{\textbf{1}}(z)\otimes{\textbf{1}}(s)\rangle=\langle f,{\textbf{1}}(z)\otimes(\mathbb{O}^{\dagger}\mathbb{O}{\textbf{1}}(s))\rangle$ , where we use the assumption that ${\rm rank}(\mathbb{O})=|{\mathcal{S}}|$ and thus $\mathbb{O}^{\dagger}\mathbb{O}=I$ . Then,

\displaystyle f(z,s)=\langle f,{\textbf{1}}(z)\otimes(\mathbb{O}^{\dagger}\mathbb{E}_{o\sim O(s)}{\textbf{1}}(o))\rangle=\mathbb{E}_{o\sim O(s)}\langle f,{\textbf{1}}(z)\otimes\mathbb{O}^{\dagger}{\textbf{1}}(o)\rangle,

(2)

which means that the value link function corresponding to $f$ is $g(z,o):=\langle f,{\textbf{1}}(z)\otimes\mathbb{O}^{\dagger}{\textbf{1}}(o)\rangle$ . ∎

We next show that the PO-Bilinear rank (Definition 3) of tabular POMDPs is bounded by $|{\mathcal{S}}|(|\mathcal{O}||\mathcal{A}|)^{M}$ .

Lemma 2.

Assume $\mathbb{O}$ is full column rank. Set the value link function class $\mathcal{G}_{h}=[\mathcal{Z}_{h-1}\times\mathcal{O}\to[0,C_{\mathcal{G}}]]$ for certain $C_{\mathcal{G}}\in\mathbb{R}$ , and policy class $\Pi_{h}=[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})]$ . Then, the model is a PO-biliner AC class (Definition 3) with PO-bilinear rank at most $|{\mathcal{S}}|(|\mathcal{O}||\mathcal{A}|)^{M}$ .

Later, we will see that the PO-bilinear rank in the more general definition is just $|{\mathcal{S}}|$ in Section 7. This fact will result a significant improvement in terms of the sample complexity, and will result in a sample complexity that does not incur $|\mathcal{O}|^{M}$ .

4.2 Observable Overcomplete Tabular POMDPs

We consider overcomplete POMDPs with multi-step futures. The proofs are deferred to Section B.2. We have the following theorem. This is a generalization of Lemma 1, i.e., when $K=1$ , it is Lemma 1.

Lemma 3.

Define a $|{\mathcal{T}}^{K}|\times|{\mathcal{S}}|$ -dimensional matrix $\mathbb{O}^{K}$ whose entry indexed by $(o_{h:h+K-1},a_{h:h+K-2})\in{\mathcal{T}}^{K}$ and $s_{h}\in{\mathcal{S}}$ is equal to $\mathbb{P}(o_{h:h+K-1},a_{h:h+K-2}\mid s_{h};a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A}))$ . When this matrix is full-column rank, K-step link functions with respect to $\mathcal{U}(\mathcal{A})$ exist.

Note a sufficient condition to satisfy the above is that a matrix $\mathbb{O}^{K}(a^{\prime}_{h:h+K-2})\in\mathbb{R}^{|\mathcal{O}|^{K}\times|{\mathcal{S}}|}$ whose entry indexed by $o_{h:h+K-1}\in\mathcal{O}^{K}$ and $s_{h}\in{\mathcal{S}}$ is equal to $\mathbb{P}(o_{h:h+K-1}\mid s_{h};a_{h:h+K-2}=a^{\prime}_{h:h+K-2})$ is full-column rank for certain $a^{\prime}_{h:h+K-2}\in\mathcal{A}^{K-1}$ . It says there is (unknown) action sequence with length $K$ that retains information about latent states.

We next calculate the PO-bilinear rank. Importantly, this does not depend on $|\mathcal{A}|^{K}$ and $|\mathcal{O}|^{K}$ .

Lemma 4.

Set a value link function class $\mathcal{G}_{h}=[\bar{\mathcal{Z}}^{K}\to[0,C_{\mathcal{G}}]]$ for certain $C_{\mathcal{G}}\in\mathbb{R}^{+}$ and a policy class $\Pi_{h}=[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})]$ . Then, the model satisfies PO-bilinear rank condition with PO-bilinear rank (Definition 4) at most $|{\mathcal{S}}|(|\mathcal{O}||\mathcal{A}|)^{M}$ .

Note that the bilinear rank is still $|{\mathcal{S}}|(|\mathcal{O}||\mathcal{A}|)^{M}$ (just $|{\mathcal{S}}|$ in the more general definition in Section 7). Crucially, it does not depend on the length of futures $K$ .

4.3 Observable Linear Quadratic Gaussian

The next example is Linear Quadratic Gaussian (LQG) with continuous state and action spaces. The details are deferred to Section J. Here, we set $M=H-1$ so that the policy class $\Pi$ contains the globally optimal policy.

Example 2 (Linear Quadratic Gaussian (LQG)).

Consider LQG:

\displaystyle s^{\prime}=As+Ba+\epsilon,~{}o=Cs+\tau,~{}r=-(s^{\top}Qs+a^{\top}Ra)

where $\epsilon,\tau$ are Gaussian distribution with mean $0$ and variances $\Sigma_{\epsilon}$ and $\Sigma_{\tau}$ , respectively, and $s\in\mathbb{R}^{d_{s}},o\in\mathbb{R}^{d_{o}}$ , and $a\in\mathbb{R}^{d_{a}}$ , and $Q,R$ are positive definite matrices.

We define the policy class as the linear policy class $\Pi_{h}=\{\delta(a_{h}=K_{h}\bar{z}_{h})\mid{}K_{h}\in\mathbb{R}^{|\mathcal{A}|\times d_{\bar{z}_{h}}})$ , where $d_{\bar{z}_{h}}$ is a dimension of $\bar{z}_{h}\in\bar{\mathcal{Z}}_{h}$ . This choice is natural since the globally optimal policy is known to be linear with respect to the entire history [5, Chapter 4]. We define two quadratic features, $\phi_{h}(z_{h-1},s_{h})=(1,[z^{\top}_{h-1},s^{\top}_{h}]\otimes[z^{\top}_{h-1},s^{\top}_{h}])^{\top}$ with $z_{h-1}\in\mathcal{Z}_{h-1},s_{h}\in{\mathcal{S}}$ , and $\psi_{h}(z_{h-1},o_{h})=(1,[z^{\top}_{h-1},o^{\top}_{h}]\otimes[z^{\top}_{h-1},o^{\top}_{h}])^{\top}$ with $z_{h-1}\in\mathcal{Z}_{h-1},o_{h}\in\mathcal{O}$ . We have the following lemma.

Lemma 5 (PO-bilinear rank of observable LQG).

Assume ${\rm rank}(C)=d_{s}$ . Then, the following holds:

•

For any policy $\pi$ linear in $\bar{z}_{h}$ , a one-step value link function $g^{\pi}_{h}(\cdot)$ exists, and is linear in $\psi_{h}(\cdot)$ .
•

Letting $d_{\psi_{h}}$ be the dimension of $\psi_{h}$ , we set $\mathcal{G}_{h}=\{\theta^{\top}\psi_{h}(\cdot)|\theta\in\mathbb{R}^{d_{\psi_{h}}}\}$ and $\Pi$ being linear in $\bar{z}_{h}$ . Then LQG satisfies Definition 3 with PO-bilinear rank at most $O(\{1+(H-1)(d_{o}+d_{a})+d_{s}\}^{2})$

We have two remarks. First, when $\pi^{out}_{t}=\delta(a=0)$ , K-step link functions exist when $[C^{\top},(CA)^{\top},\dots,(CA^{K-1})^{\top}]$ is full raw rank. This assumption is referred to as observability in control theory [27]. Secondly, the PO-bilinear rank scales polynomially with respect to $H,d_{o},d_{a},d_{s}$ even with $M=H-1$ . As we show in Section J, due to this fact, we can compete against the globally optimal policy with polynomial sample complexity.

4.4 Observable Hilbert Space Embedding POMDPs

We consider HSE-POMDPs that generalize tabular POMDPs and LQG. Proofs here are deferred to Section B.4. Consider any $h\in[H]$ . Given a policy $\pi_{h}:\bar{\mathcal{Z}}_{h}\to\mathcal{A}$ , we define the induced transition operator $\mathbb{T}_{\pi}=\{\mathbb{T}_{\pi;h}\}_{h=1}^{H}$ as $(z_{h},s_{h+1})\sim\mathbb{T}_{\pi;h}(z_{h-1},s_{h})$ , where we have $o_{h}\sim\mathbb{O}(s_{h}),a_{h}\sim\pi_{h}(\bar{z}_{h}),s_{h+1}\sim\mathbb{T}(s_{h},a_{h})$ . Namely, $\mathbb{T}_{\pi}$ is the transition kernel of some Markov chain induced by the policy $\pi$ . The HSE-POMDP assumes two conditional distributions $\mathbb{O}(\cdot|s)$ and $\mathbb{T}_{\pi}(\cdot,\cdot|z,s)$ have conditional mean embeddings.

Example 3 (HSE-POMDPs).

We introduce features $\phi_{h}:\mathcal{Z}_{h-1}\times{\mathcal{S}}\to\mathbb{R}^{d_{\phi_{h}}},\psi_{h}:\mathcal{Z}_{h-1}\times\mathcal{O}\to\mathbb{R}^{d_{\psi_{h}}}$ . We assume the existence of the conditional mean embedding operators: (1) there exists a matrix $K_{h}$ such that for all $z\in\mathcal{Z}_{h-1},s\in{\mathcal{S}}$ , $\mathbb{E}_{o\sim\mathbb{O}(\cdot|s)}\psi_{h}(z,o)=K_{h}\phi_{h}(z,s)$ and (2) for all $\pi\in\Pi$ , there exists a matrix $T_{\pi;h}$ , such that $\mathbb{E}_{z_{h},s_{h+1}\sim\mathbb{T}_{\pi;h}(z_{h-1},s_{h})}\phi_{h+1}(z_{h},s_{h+1})=T_{\pi;h}\phi_{h}(z_{h-1},s_{h})$ .

The existence of conditional mean embedding is a common assumption in prior RL works on learning dynamics of HMMs, PSRs, [59, 6] and Bellman complete linear MDPs [77, 15, 10, 26]. HSE-POMDPs naturally capture tabular POMDPs and LQG. For tabular POMDPs, $\psi_{h}$ and $\phi_{h}$ are one-hot encoding features. In LQG, $\phi_{h}$ and $\psi_{h}$ are quadratic features we define in Section 4.3. Here for simplicity, we focus on finite-dimensional features $\phi_{h}$ and $\psi_{h}$ . Extension to infinite-dimensional Reproducing kernel Hilbert Space is deferred to Section B.4.

The following shows the existence of value link functions and the PO-bilinear rank decomposition.

Lemma 6 (PO-bilinear rank of observable HSE-POMDPs).

Assume $K_{h}$ is full column rank (observability), and $V_{h}^{\pi}(\cdot)$ is linear in $\phi_{h}$ for any $\pi\in\Pi,h\in[H]$ . Then the following holds.

•

A one-step value link function $g^{\pi}_{h}(\cdot)$ exists for any $\pi\in\Pi,h\in[H]$ , and is linear in $\psi_{h}$ .
•

We set a value function class $\mathcal{G}_{h}=\{w^{\top}\psi_{h}(\cdot)|w\in\mathbb{R}^{d_{\psi_{h}}}\}$ , policy class $\Pi_{h}\subset[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})]$ . Then HSE-POMDP satisfies Definition 3 with PO-bilinear rank at most $\max_{h\in[H]}d_{\phi_{h}}$ .

The first statement can be verified by noting that when $V^{\pi}_{h}(\cdot)=\langle\theta_{h},\phi_{h}(\cdot)\rangle$ , value link functions take the following form $g_{h}^{\pi}(\cdot)=\langle(K^{\dagger}_{h})^{\top}\theta_{h}),\psi_{h}(\cdot)\rangle$ where we leverage the existence of the conditional mean embedding operator $K_{h}$ , and that $K_{h}$ is full column rank (thus $K_{h}^{\dagger}K_{h}=\mathbb{I}_{d_{\phi_{h}}}$ ). Note that the PO-bilinear rank depends only on the dimension of the features $\phi_{h}$ without any explicit dependence on the length of memory.

5 Algorithm and Complexity

In this section, we first give our algorithm followed by a general sample complexity analysis. We then instantiate our analysis to specific models considered in Section 4.

5.1 Algorithm

Algorithm 1 PaRtially ObserVAble BiLinEar (Provable) # multi-step version is in Algorithm 2

1: Input: Value class

\mathcal{G}=\{\mathcal{G}_{h}\},\mathcal{G}_{h}\subset[\mathcal{Z}_{h-1}\to\mathbb{R}]

, Policy class

\Pi=\{\Pi_{h}\},\Pi_{h}\subset[\bar{\mathcal{Z}}_{h-1}\to\mathbb{R}]

, parameters

m\in\mathbf{N},R\in\mathbb{R}

2: Initialize

\pi^{0}\in\Pi

3: Form the first step dataset

\mathcal{D}^{0}=\{o^{i}\}_{i=1}^{m}

, with

o^{i}\sim\mathbb{O}(\cdot|s_{1})

4: for

t=0\to T-1

5: For any

h\in[H]

, collect

m

i.i.d tuple as follows:

(\bar{z}_{h},s_{h})\sim d^{\pi^{t}}_{h},a_{h}\sim\mathcal{U}(\mathcal{A}),r_{h}=r_{h}(o_{h},a_{h}),s_{h+1}\sim\mathbb{T}(s_{h},a_{h}),o_{h+1}\sim\mathbb{O}(\cdot|s_{h+1})

6: Define

\mathcal{D}^{t}_{h}=\{(\bar{z}^{i}_{h},a^{i}_{h},r^{i}_{h},o^{i}_{h+1})\}_{i=1}^{m}

# note latent state

s

is not in the dataset

7: Define the Bellman error

\forall(\pi,g)\in\Pi\times\mathcal{G}

\textstyle\sigma^{t}_{h}(\pi,g):=\mathbb{E}_{\mathcal{D}^{t}_{h}}\left[\pi_{h}(a_{h}\mid\bar{z}_{h})|\mathcal{A}|\{g_{h+1}(\bar{z}_{h+1})+r_{h}\}-g_{h}(\bar{z}_{h})\right].

8: Select policy optimistically as follows

\displaystyle\textstyle(\pi^{t+1},g^{t+1}):=\mathop{\mathrm{argmax}}_{\pi\in\Pi,g\in\mathcal{G}}\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(o)]\quad\mathrm{s.t.}\quad\forall h\in[H],\forall i\in[t],(\sigma^{i}_{h}(\pi,g))^{2}\leq R.

9: end for

10: Output: Randomly choose

\hat{\pi}

from

(\pi_{1},\cdots,\pi_{T})

We first focus on the cases where models satisfy the PO-bilinear AC model (i.e., Definition 3) with finite action and with one-step link function. We discuss the extension to handle continuous action in Remark 2 and multi-step link functions at the end of this subsection.

We present our algorithm Provable in Algorithm 1. Note Provable is agnostic to the form of $X_{h}$ and $W_{h}$ . Inside iteration $t$ , given the latest learned policy $\pi^{t}$ , we define Bellman error for all pairs $(\pi,g)$ where the Bellman error is averaged over the samples from $\pi^{t}$ . Here, to evaluate the Bellman loss for any policy $\pi\in\Pi$ , we use importance sampling by running $\mathcal{U}(\mathcal{A})$ rather than executing a policy $\pi$ so that we can reuse samples.²²2This choice might limit the algorithm to the case where $\mathcal{A}$ is discrete. However, for examples such as LQG, we show that we can replace $\mathcal{U}(\mathcal{A})$ by a G-optimal design over the quadratic polynomial feature of the actions. A pair $(\pi,g)$ that has a small total Bellman error intuitively means that given the data so far, $g$ could still be a value link function for the policy $\pi$ . Then in the constrained optimization formulation, we only focus on $(\pi,g)$ pairs whose Bellman errors are small so far. Among these $(\pi,g)$ pairs, we select the pair using the principle of optimism in the face of uncertainty. We remark the algorithm leverages some design choices from the Bilinear-UCB algorithm for MDPs [17]. The key difference between our algorithm and the Bilinear-UCB is that we leverage the actor-critic framework equipped with value link functions to handle partially observability and agnostic learning.

With multi-step link functions.

Finally, we consider the case with multi-step futures in Algorithm 2 when $\pi^{out}=\mathcal{U}(\mathcal{A})$ . Recall the notation $\bar{z}^{K}_{h}=(z_{h-1},o_{h:h+K-1},a_{h:h+K-2})$ . The only difference is in the process of data collection. Particularly, at every iteration $t$ , we roll-in using $\pi^{t}$ to (and include) time step $h-1$ , we then roll-out by switching to $\mathcal{U}(\mathcal{A})$ for $K$ steps.

Algorithm 2 PaRtially ObserVAble BiLinEar (Provable) # multi-step version

1: Input: Value link function class

\mathcal{G}=\{\mathcal{G}_{h}\},\mathcal{G}_{h}\subset[\bar{Z}^{K}_{h}\to\mathbb{R}]

, Policy class

\Pi=\{\Pi_{h}\},\Pi_{h}\subset[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})]

, parameters

m\in\mathbb{N},R\in\mathbb{R}

2: Define

\displaystyle l_{h}(\bar{z}^{K}_{h},a_{h+K-1},r_{h},o_{h+K};\pi,g):=|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left(g_{h+1}(\bar{z}^{K}_{h+1})+r_{h}\right)-g_{h}(\bar{z}^{K}_{h}).

3: Initialize

\pi^{0}\in\Pi

4: Form the first step dataset

\mathcal{D}^{0}=\{\bar{z}^{K;i}_{1}\}_{i=1}^{m}

where each

\bar{z}^{K}

is generated by following

a_{1:K-1}\sim U(\mathcal{A})

in an i.i.d manner.

5: for

t=0\to T-1

6: for any

h\in[H]

, define the Bellman error

\displaystyle\sigma^{t}_{h}(\pi,g)=\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}^{K}_{h},a_{h+K-1},r_{h},o_{h+K};\pi,g)]

where

\mathcal{D}^{t}_{h}

means empirical approximation by executing

a_{1:h-1}\sim\pi^{t},a_{h:h+K-1}\sim\mathcal{U}(\mathcal{A})

and collecting

m

i.i.d tuples.

7: Select policy optimistically as follows (here note

g=\{g_{h}\}_{h=1}^{H}

)

\displaystyle(\pi^{t+1},g^{t+1}):=\mathop{\mathrm{argmax}}_{\pi\in\Pi,g\in\mathcal{G}}\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(\bar{z}^{K}_{1})]\quad\mathrm{s.t.}\quad\forall h\in[H],\forall i\in[t],\,\sigma^{i}_{h}(\pi,g)^{2}\leq R.

8: end for

9: Output: Randomly choose

\hat{\pi}

from

(\pi_{1},\cdots,\pi_{T})

Remark 2 (Continuous control).

Algorithms so far implicitly assume the action is finite. However, we can consider LQG, which has continuous action. By employing a G-optimal design over actions, our algorithm can handle the continuous action. The discussion is deferred to Section C.

5.2 Sample Complexity

We show a sample complexity result by using reduction to supervised learning analysis. We begin by stating the following assumption which is ensured by standard uniform convergence results.

Assumption 2 (Uniform Convergence).

Fix $h\in[H]$ . Let $\mathcal{D}^{\prime}_{h}$ be a set of $m$ i.i.d tuples following $(z_{h-1},s_{h},o_{h})\sim d^{\pi^{t}}_{h},a_{h}\sim\mathcal{U}(\mathcal{A}),s_{h+1}\sim\mathbb{T}(s_{h},a_{h}),o_{h+1}\sim\mathbb{O}(s_{h+1})$ . With probability $1-\delta$ ,

\displaystyle\textstyle\sup_{\pi\in\Pi,g\in\mathcal{G}}|(\mathbb{E}_{\mathcal{D}^{\prime}_{h}}-\mathbb{E})[\pi_{h}(a_{h}\mid\bar{z}_{h})|\mathcal{A}|\{g_{h+1}(\bar{z}_{h+1})+r_{h}\}-g_{h}(\bar{z}_{h})]|\leq\epsilon_{gen,h}(m,\Pi,\mathcal{G},\delta)

For $h=1$ , we also require

\sup_{g_{1}\in\mathcal{G}_{1}}|\mathbb{E}_{\mathcal{D}^{\prime}_{1}}[g_{1}(o_{1})]-\mathbb{E}[\mathbb{E}_{\mathcal{D}^{\prime}_{1}}[g_{1}(o_{1})]]|\leq\epsilon_{ini,1}(m,\mathcal{G},\delta).

Remark 3 (Finite function classes).

The term $\epsilon_{gen}$ depends on the statistical complexities of the function classes $\Pi$ and $\mathcal{G}$ . As a simple example, we consider the case where $\Pi$ and $\mathcal{G}$ are discrete. In this case, we have $\epsilon_{gen,h}(m,\Pi,\mathcal{G},\delta)=O(\sqrt{\ln(|\Pi||\mathcal{G}|/\delta)/m})$ , and $\epsilon_{ini,1}(m,\mathcal{G},\delta)=O(\sqrt{\ln(|\mathcal{G}|/\delta)/m})$ , which are standard statistical complexities for discrete function classes $\Pi$ and $\mathcal{G}$ . Achieving this result simply requires standard concentration and a union bound over all functions in $\Pi,\mathcal{G}$ .

Under Assumption 2, when the model is PO-bilinear with rank $d$ , we get the following.

Theorem 1 (PAC guarantee of Provable).

Suppose we have a PO-bilinear AC class with rank $d$ . Suppose Assumption 2, $\sup_{\pi\in\Pi}\|X_{h}(\pi)\|\leq B_{X}$ and $\sup_{\pi\in\Pi,g\in\mathcal{G}}\|W_{h}(\pi,g)\|\leq B_{W}$ for any $h\in[H].$
By setting $T=2Hd\ln\left(4Hd\left(\frac{B^{2}_{X}B^{2}_{W}}{\tilde{\epsilon}^{2}_{gen}}+1\right)\right),R=\epsilon^{2}_{gen}$ where

\displaystyle\textstyle\epsilon_{gen}:=\max_{h}\epsilon_{gen,h}(m,\Pi,\mathcal{G},\delta/(TH+1)),\tilde{\epsilon}_{gen}:=\max_{h}\epsilon_{gen,h}(m,\Pi,\mathcal{G},\delta/H).

With probability at least $1-\delta$ , letting $\pi^{\star}=\mathop{\mathrm{argmax}}_{\pi\in\Pi}J(\pi^{\star})$ , we have

\displaystyle\textstyle J(\pi^{\star})-J(\hat{\pi})\leq 5\epsilon_{gen}\sqrt{dH^{2}\cdot\ln\left(4Hd\left({B^{2}_{X}B^{2}_{W}}/{\tilde{\epsilon}^{2}_{gen}}+1\right)\right)}+2\epsilon_{ini,1}(m,\mathcal{G},\delta/(TH+1)).

The total number of samples used in the algorithm is $mTH$ .

Informally, when $\epsilon_{gen}\approx\tilde{O}(1/\sqrt{m})$ , to achieve $\epsilon$ -near optimality, the above theorem indicates that we just need to set $m\approx\tilde{O}(1/\epsilon^{2})$ , which results a sample complexity scaling $\tilde{O}(1/\epsilon^{2})$ (since $T$ only scales $\tilde{O}(dH)$ ). We give detailed derivation and examples in the next section.

5.3 Examples

Hereafter, we show the sample complexity result by using Theorem 1. For complete results, refer to Section F–K. The result of $M$ -step decodable POMDPs and observable low-rank POMDPs are deferred to Section M.

5.3.1 Finite Sample Classes

We first consider the case where the hypothesis class is finite when the class admits PO-bilinear rank decomposition.

Example 4 (Finite Sample Classes).

Consider the case when $\Pi$ and $\mathcal{G}$ are finite and the PO-bilinear rank assumption is satisfied. When $\Pi$ and $\mathcal{G}$ are infinite hypothesis classes, $|\mathcal{F}|$ and $|\mathcal{G}|$ are replaced with their $L^{\infty}$ -covering numbers, respectively.

Theorem 2 (Sample complexity for discrete $\Pi$ and $\mathcal{G}$ (informal)).

Let $\|\mathcal{G}_{h}\|_{\infty}\leq C_{\mathcal{G}},r_{h}\in[0,1]$ for any $h\in[H]$ and the PO-bilinear rank assumption holds with PO-biliear rank $d$ . By letting $|\Pi_{\max}|=\max_{h}|\Pi_{h}|,|\mathcal{G}_{\max}|=\max_{h}|\mathcal{G}_{h}|$ , with probability $1-\delta$ , we can achieve $J(\pi^{\star})-J(\hat{\pi})\leq\epsilon$ when we use samples

\displaystyle\tilde{O}\left(d^{2}H^{4}\max(C_{\mathcal{G}},1)^{2}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|/\delta)\ln^{2}(B_{X}B_{W}/\delta)(1/\epsilon)^{2}\right).

Here, $\mathrm{Polylog}(d,H,|\mathcal{A}|,\ln(|\mathcal{G}|),\ln(|\Pi|),\ln(1/\delta),\ln(B_{X}),\ln(B_{W}),\ln(1/\delta),1/\epsilon)$ are omitted.

5.3.2 Observable Undercomplete Tabular POMDPs

We start with tabular POMDPs. The details here is deferred to Section H.

Example 5 (continues=ex:under_tabular).

In tabular models, recall the PO-bilinear rank is at most $d=|\mathcal{O}|^{M}|\mathcal{A}|^{M}|{\mathcal{S}}|$ . We suppose $r_{h}\in[0,1]$ for any $h\in[H]$ . Assuming $\mathbb{O}$ is full-column rank, to satisfy the realizability, we set $\mathcal{G}_{h}=\{\langle\theta,\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle\mid\|\theta\|_{\infty}\leq H\}$ where $\|\mathbb{O}^{\dagger}\|_{1}\leq 1/\sigma_{1}$ and $\mathbf{1}(z),\mathbf{1}(o)$ are one-hot encoding vectors over $\mathcal{Z}_{h-1}$ and $\mathcal{O}$ , respectively. We set $\Pi_{h}=[\bar{Z}_{h}\to\Delta(\mathcal{A})]$ . Then, the following holds.

Theorem 3 (Sample complexity for unrercomplete tabular models (Informal)).

With probability $1-\delta$ , we can achieve $J(\pi^{\star})-J(\hat{\pi})\leq\epsilon$ when we use samples at most $\tilde{O}\left(|{\mathcal{S}}|^{2}|\mathcal{A}|^{3M+3}|\mathcal{O}|^{3M+1}H^{6}(1/\epsilon)^{2}(1/\sigma_{1})^{2}\ln(1/\delta)\right).$
Here, $\mathrm{polylog}(|{\mathcal{S}}|,|\mathcal{O}|,|\mathcal{A}|,H,1/\sigma_{1},\ln(1/\delta))$ are omitted.

Firstly, while the above error incurs $|\mathcal{O}|^{M}|\mathcal{A}|^{M}$ , we will later see in Section 8.2.2 when we use the more general definition of PO-bilinear AC class and combine a model-based perspective, we can remove $|\mathcal{O}|^{M}$ from the error bound. The intuition here is that the statistical complexity still scales with $|{\mathcal{S}}|^{2}|\mathcal{A}|+|\mathcal{O}||\mathcal{A}|$ and does not incur $|\mathcal{O}|^{M}$ . At the same time, although PO-bilinear rank currently scales with $|\mathcal{O}|^{M}||\mathcal{A}|^{M}|{\mathcal{S}}|$ , we can show that it can be just $|{\mathcal{S}}|$ with a more refined definition. Secondly, $\|\mathbb{O}^{\dagger}\|_{1}\leq 1/\sigma_{1}$ can be replaced with other analogous conditions $\|\mathbb{O}^{\dagger}\|_{2}\leq 1/\sigma_{2}$ . Here, note $\|\mathbb{O}^{\dagger}\|_{1}=1/\{\min_{x}\|\mathbb{O}x\|_{1}/\|x\|_{1}\},\|\mathbb{O}^{\dagger}\|_{2}=1/\{\min_{x}\|\mathbb{O}x\|_{2}/\|x\|_{2}\}$ . The reason why we use $1$ -norm is to invoke the result [24] to achieve the near global optimality as in the next paragraph.

Near global optimality.

Finally, we consider the PAC guarantee against the globally optimal policy. As shown in [24], it is enough to set $M=O((1/\sigma^{4}_{1})\ln(SH/\epsilon))$ to compete with the globally optimal policy $\pi^{\star}_{\mathrm{gl}}$ . Thus we achieve a quasipolynomial sample complexity when competing against $\pi^{\star}_{\mathrm{gl}}$ .

Theorem 4 (Sample complexity for undercomplete tabular models (Informal) — competing against $\pi^{\star}_{\mathrm{gl}}$ ).

With probability $1-\delta$ , we can achieve $J(\pi^{\star}_{\mathrm{gl}})-J(\hat{\pi})\leq\epsilon$ when we use samples at most

\displaystyle\textstyle\mathrm{poly}(|{\mathcal{S}}|,|\mathcal{A}|^{\ln(|{\mathcal{S}}|H/\epsilon)/\sigma^{4}_{1}},|\mathcal{O}|^{\ln(|{\mathcal{S}}|H/\epsilon)/\sigma^{4}_{1}},H,1/\sigma_{1},1/\epsilon,\ln(1/\delta)).

5.3.3 Observable Tabular Overcomplete POMDPs

We consider obvercomplete tabular POMDPs. In this case, the PO-bilinear rank is at most $|\mathcal{O}|^{M}|\mathcal{A}|^{M}|{\mathcal{S}}|$ . We suppose $r_{h}\in[0,1]$ for any $h\in[H]$ . Assuming $\mathbb{O}^{K}$ is full-column rank, to satisfy the realizability, we set $\mathcal{G}_{h}=\{\langle\theta,\mathbf{1}(z)\otimes\{\mathbb{O}^{K}\}^{\dagger}\mathbf{1}(t^{K})\rangle\mid\|\theta\|_{\infty}\leq H\}$ where $\|\mathbb{O}^{K}\|_{1}\leq 1/\sigma_{1}$ and $\mathbf{1}(z),\mathbf{1}(t^{K})$ are one-hot encoding vectors over $\mathcal{Z}_{h-1}$ and $\mathcal{O}^{K}\times\mathcal{A}^{K-1}$ , respectively. We set $\Pi_{h}=[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})]$ . Then, the following holds.

Theorem 5 (Sample complexity for overcomplete tabular models).

With probability $1-\delta$ , we can achieve $J(\pi^{\star})-J(\hat{\pi})\leq\epsilon$ when we use samples at most $\tilde{O}\left(|{\mathcal{S}}|^{2}|\mathcal{A}|^{3M+K+2}|\mathcal{O}|^{3M+K}H^{6}(1/\epsilon)^{2}(1/\sigma_{1})^{2}\ln(1/\delta)\right).$
Here, $\mathrm{polylog}(|{\mathcal{S}}|,|\mathcal{O}|,|\mathcal{A}|,H,1/\sigma_{1},\ln(1/\delta))$ are omitted.

When we use K-step futures, in the above theorem, we additionally incur $|\mathcal{A}|^{K}|\mathcal{O}|^{K}$ , which is coming from a naive parameterization of $\mathcal{G}_{h}$ . In Section 8.2.3, we will see that under the model-based learning perspective (i.e., we parameterize $\mathbb{T},\mathbb{O}$ first and then construct $\Pi$ and $\mathcal{G}$ using the model class), we will get rid of the dependence $|\mathcal{O}|^{M+K}$ and $|\mathcal{A}|^{K}$ . This is because the complexity of the model class is independent of $M$ or $K$ (i.e., number of parameters in $\mathbb{T},\mathbb{O}$ are $O(|{\mathcal{S}}|^{2}|\mathcal{A}||\mathcal{O}|)$ ).

5.3.4 Observable LQG

Now let us revisit LQG. The detail here is deferred to Section J. We show that Provable can compete against the globally optimal policy with polynomial sample complexity.

Example 6 (continues=ex:lqqs).

In LQG, by setting $H=M-1$ , we achieve a polynomial sample complexity when competing against the globally optimal policy $\pi^{\star}_{\mathrm{gl}}$ .

Theorem 6 (Sample complexity for LQG (informal) – competing against $\pi^{\star}_{\mathrm{gl}}$ ).

Consider a linear policy class $\Pi_{h}=\{\delta(a_{h}=\bar{K}_{h}\bar{z}_{h})\mid\|\bar{K}_{h}\|\leq\Theta\}$ . and assume $\max(\|A\|,\|B\|,\|C\|,\|Q\|,\|R\|)\leq\Theta$ and all policies induce a stable system (we formalize in Section J). With probability $1-\delta$ , we can achieve $J(\pi^{\star}_{\mathrm{gl}})-J(\hat{\pi})\leq\epsilon$ when we use samples at most

\displaystyle\textstyle\mathrm{poly}(H,d_{s},d_{o},d_{a},\Theta,\|C^{\dagger}\|,\ln(1/\delta))\times(1/\epsilon)^{2}.

5.3.5 Observable HSE-POMDPs

Next, we study HSE-POMDPs. The details here is deferred to Section G.

Example 7 (continues=ex:linear).

In HSE-POMDPs, PO-bilinear rank is at most $\max_{h}d_{\phi_{h}}$ . Suppose $\|\psi_{h}\|\leq 1$ and $V^{\pi}_{h}(\cdot)=\langle\theta^{\pi}_{h},\phi_{h}(\cdot)\rangle$ such that $\|\theta^{\pi}_{h}\|\leq\Theta_{V}$ for any $h\in[H]$ . Then, to satisfies the realizability, we set $\mathcal{G}_{h}=\{\langle\theta,\psi_{h}(\cdot)\rangle\mid\|\theta\|\leq\Theta_{V}/\sigma_{\min}(K)\}$ where $\sigma_{\min}(K)=\min_{h\in[H]}1/\|K^{\dagger}_{h}\|$ .

Theorem 7 (Sample complexity for HSE-POMDPs (Informal)).

Let $d_{\psi}=\max_{h}\{d_{\psi_{h}}\},d_{\phi}=\max_{h}\{d_{\psi_{h}}\},|\Pi_{\max}|=\max_{h}(|\Pi_{h}|)$ . Suppose $r_{h}$ lies in $[0,1]$ for any $h\in[H]$ . Then, with probability $1-\delta$ , we can achieve $J(\pi^{\star})-J(\hat{\pi})\leq\epsilon$ when we use samples

\displaystyle\textstyle\tilde{O}\left(d^{2}_{\phi}H^{4}|\mathcal{A}|^{2}\max(\Theta_{V},1)^{2}\{d_{\psi}+\ln(|\Pi_{\max}|/\delta)\}{(1/\sigma_{\min}(K))^{2}}\cdot(1/\epsilon)^{2}\right).

Here, $\mathrm{polylog}(d_{\phi},d_{\psi},|\mathcal{A}|,\Theta_{V},\ln(|\Pi_{\max}|),1/\sigma_{\min}(K),1/\epsilon,\ln(1/\delta),\sigma_{\max}(T),\sigma_{\max}(K))$ are omitted and $\sigma_{\max}(K)=\max_{h\in[H]}\|K_{h}\|,\sigma_{\max}(T)=\max_{\pi\in\Pi,h\in[H]}\|T_{\pi:h}\|$ .

Note that the sample complexity above does not explicitly depend on the memory length $M$ , instead it only explicitly depends on the dimension of the features $\phi,\psi$ . In other words, if we have a feature mapping $\psi_{h}$ that can map the entire history (i.e., $M=H$ ) to a low-dimensional vector (e.g., LQG), our algorithm can immediately compete against the global optimality $\pi^{\star}_{\mathrm{gl}}$ .

6 Predictive State Representations

In this section, we demonstrate that our definition and algorithm applies to PSRs — models that strictly generalize POMDPs [46, 62]. Below, we first briefly introduce PSRs, followed by showing that it is a PO-bilinear AC model. Throughout this section, we will focus on discrete linear PSRs. We also suppose reward at $h$ is deterministic function of $(o_{h},a_{h})$ conditional on $\tau^{a}_{h+1}$ where $\tau^{a}_{h}=(o_{1},a_{1},\cdots,o_{h-1},a_{h-1})$ . Given $\tau^{a}_{h}$ , the dynamical system generates $o_{h}\sim\mathbb{P}(\cdot|\tau^{a}_{h})$ . Here we use the superscript $a$ on $\tau_{h}^{a}$ to emphasize that the $\tau^{a}_{h}$ ends with the action $a_{h-1}$ .

PSRs use the concept of test, which is a sequence of future observations and actions, i.e., for some test $t=(o_{h:h+W-1},a_{h:h+W-2})$ with length $W\in\mathbb{N}^{+}$ , we define the probability of test $t$ being successful $\mathbb{P}(t|\tau^{a}_{h})$ as $\mathbb{P}(t|\tau^{a}_{h}):=\mathbb{P}(o_{h:h+W-1}|\tau^{a}_{h};\text{do}(a_{h:h+W-2}))$ which is the probability of observing $o_{h:h+W-1}$ by actively executing actions $a_{h:h+W-2}$ conditioned on history $\tau^{a}_{h}$ .

We now explain one-step observable PSRs while deferring the general multi-step observable setting to Section D. A one-step observable PSR uses the observations in $\mathcal{O}$ as tests, i.e., tests with length 1.

Definition 5 (Core test set and linear PSRs).

A core test set ${\mathcal{T}}\subset\mathcal{O}$ contains a finite number of tests (i.e., observations from $\mathcal{O}$ ). For any $h$ , any history $\tau^{a}_{h}$ , any future test $t_{h}=(o_{h:h+W-1},a_{h:h+W-2})$ for any $W\in\mathbb{N}^{+}$ , there exists a vector $m_{t_{h}}\in\mathbb{R}^{|{\mathcal{T}}|}$ , such that the probability of $t_{h}$ succeeds conditioned on $\tau^{a}_{h}$ can be expressed as: $\mathbb{P}(t_{h}|\tau^{a}_{h})=m_{t_{h}}^{\top}[\mathbb{P}(o|\tau^{a}_{h})]_{o\in{\mathcal{T}}},$ where we denote ${\bf q}_{\tau^{a}_{h}}:=[\mathbb{P}(o|\tau^{a}_{h})]_{o\in{\mathcal{T}}}$ as a vector in $\mathbb{R}^{|{\mathcal{T}}|}$ with entries equal to $\mathbb{P}(o|\tau^{a}_{h})$ for $o\in{\mathcal{T}}$ . The vector ${\bf q}_{\tau^{a}_{h}}$ is called predictive state.

A core test set ${\mathcal{T}}$ that has the smallest number of tests is called a minimum core test set denoted as $\mathcal{Q}$ . PSRs are strictly more expressive than POMDPs in that all POMDPs can be embedded into PSRs whose size of the minimum core tests is at most $|{\mathcal{S}}|$ ; however, vice versa does not hold [46]. For example, in observable undercomplete POMDPs (i.e., $\mathbb{O}$ full column rank) , the observation set $\mathcal{O}$ can serve as a core test set, but the minimum core test set $\mathcal{Q}$ will have size $|{\mathcal{S}}|$ . Here, we assume we know a core test set ${\mathcal{T}}$ that contains $\mathcal{Q}$ ; however, we are agnostic to which set is the actual $\mathcal{Q}$ . In the literature on PSRs, this setting is often referred to as transform PSRs [8, 57].

Now we define a value link function in PSRs. First, given an $M$ -memory policy, define $\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})=\mathbb{E}[\sum_{t=h}^{H}r_{t}|\tau^{a}_{h};a_{h:H}\sim\pi]$ , i.e., the expected total reward under $\pi$ , conditioned on the history $\tau^{a}_{h}$ . Note that our value function here depends on the entire history.

Definition 6 (General value link functions).

Consider an $M$ -memory policy $\pi$ . One-step general value link functions $g^{\pi}_{h}:\mathcal{Z}_{h-1}\times{\mathcal{T}}\to\mathbb{R}$ at step $h\in[H]$ are defined as solutions to

\displaystyle\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})=\mathbb{E}[g^{\pi}_{h}(z_{h-1},o_{h})\mid\tau^{a}_{h}].

(3)

This definition is more general than Definition 3 since (3) implies (1) in POMDPs by setting $\mathcal{O}={\mathcal{T}}$ . In PSRs, we can show the existence of this general value link function.

Lemma 7 (The existence of link functions for PSRs).

Suppose ${\mathcal{T}}$ is a core test set. Then, a one-step value link function $g^{\pi}_{h}$ always exists.

The high-level derivation is as follows. Using the linear PSR property, one can first show that $\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})$ has a bilinear form $\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}_{h}^{\pi}{\bf q}_{\tau^{a}_{h}}$ , where ${\textbf{1}}(z)\in\mathbb{R}^{|\mathcal{Z}_{h-1}|}$ denotes the one-hot encoding vector over $\mathcal{Z}_{h-1}$ , and $\mathbb{J}_{h}^{\pi}$ is a $|\mathcal{Z}_{h-1}|\times|{\mathcal{T}}|$ matrix. Then, given any $\tau^{a}_{h}$ and $o\sim\mathbb{P}(\cdot|\tau^{a}_{h})$ , for some $|\mathcal{Z}_{h-1}|\times|{\mathcal{T}}|$ matrix $\mathbb{J}$ , we can show $g_{h}(z_{h-1},o):={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}[{\textbf{1}}(t=o)]_{t\in{\mathcal{T}}}$ satisfies the above, where $[{\textbf{1}}(t=o)]_{t\in{\mathcal{T}}}\in\mathbb{R}^{|{\mathcal{T}}|}$ is a one-hot encoding vector over ${\mathcal{T}}$ and serves as an unbiased estimate of ${\bf q}_{\tau^{a}_{h}}$ .

Finally, we show that PSR admits PO-bilinear rank decomposition (Definition 3).

Lemma 8.

Suppose a core test set ${\mathcal{T}}$ includes a minimum core test set $\mathcal{Q}$ . Set $\Pi_{h}=[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})]$ and $\mathcal{G}_{h}=\{(z_{h-1},o)\mapsto{\textbf{1}}(z_{h-1})^{\top}\mathbb{J}[{\textbf{1}}(t=o)]_{t\in{\mathcal{T}}}\mid\mathbb{J}\in\mathbb{R}^{|\mathcal{Z}_{h-1}|\times|{\mathcal{T}}|}\}$ , the PO-bilinear rank is at most $(|\mathcal{O}||\mathcal{A}|)^{M}|\mathcal{Q}|$ .

Then, Algorithm 1 is directly applicable to PSRs. Note that here the PO-bilinear rank, fortunately, rank scales with $|\mathcal{Q}|$ but not $|{\mathcal{T}}|$ . The dependence $(|\mathcal{O}||\mathcal{A}|)^{M}$ comes from the dimension of the “feature" of memory ${\textbf{1}}(z_{h-1})$ . If one has a compact feature representation $\phi:\mathcal{Z}_{h-1}\to\mathbb{R}^{d}$ , such that $\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})=\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{h}{\bf q}_{\tau^{a}_{h}}$ is linear with respect to feature $\phi(z_{h-1})$ , then the PO-bilinear rank is $d|\mathcal{Q}|$ . This implies that if one has a compact featurization of memory a priori, one can avoid exponential dependence on $M$ .

Sample complexity.

We finally brifely mention the sample complexity result. The detail is deferred to Section K. The sample complexity to satisfy $J(\pi^{\star})-J(\hat{\pi})\leq\epsilon$ is given as

\displaystyle\tilde{O}\left(\frac{|\mathcal{O}|^{M}|\mathcal{A}|^{M-1}|\mathcal{Q}|^{2}\max(\Theta,1)H^{4}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|/\delta)\ln(\Theta_{W})^{2}}{\epsilon^{2}}\right)

where $\Theta_{W}$ and $\Theta$ some parameters associated with PSRs. Here, there is no explicit dependence on $|{\mathcal{T}}|$ . Note that in the worst case, $\ln|\mathcal{G}_{\max}|$ scales as $O(|\mathcal{Z}_{h-1}||{\mathcal{T}}|)$ , and $\ln|\Pi_{\max}|$ scales as $O(|\mathcal{Z}_{h-1}||\mathcal{O}||\mathcal{A}|)$ .

7 Generalization of PO-Bilinear AC Class

We extend our previous definition of PO-Bilinear AC framework. We first present an even more general framework that captures all the previous examples that we have discussed so far. We then provide two more examples that can be covered by this framework: (1) $M$ -step decodable POMDPs, and (2) observable POMDPs with low-rank latent transition. Using the result in (2), we can obtain refined results in the tabular setting compared to the result from Section 5.3.2.

The following is a general PO-Bilinear AC Class. Recall $M(h):=\max(h-M,1)$ . We consider one-step future, i.e., $K=1$ , but the extension to $K>1$ is straightforward. Comparing to Definition 3, we introduce another class of functions termed as discriminators $\mathcal{F}$ and the loss function $l$ .

Definition 7 (General PO-Bilinear AC Class).

Consider a tuple $\langle\Pi,\mathcal{G},l,\Pi^{e},\mathcal{F}\rangle$ consisting of a policy class $\Pi$ , a function class $\mathcal{G}$ , a loss function $l=\{l_{h}\}_{h=1}^{H}$ where $l_{h}(\cdot;f,\pi,g):\mathcal{H}_{h-1}\times\mathcal{O}\times\mathcal{A}\times\mathbb{R}\times\mathcal{O}\to\mathbb{R}$ , a set of estimation policies $\Pi^{e}:=\{\pi^{e}(\pi):\pi\in\Pi\}$ where $\pi^{e}_{h}(\pi):\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})$ , and a discriminator class $\mathcal{F}=\{\mathcal{F}_{h}\}$ with $\mathcal{F}_{h}\subset[\mathcal{H}_{h}\to\mathbb{R}]$ . Consider a non-decreasing function $\zeta:\mathbb{R}^{+}\to\mathbb{R}$ with $\zeta(0)=0$ .

The model is a PO-bilinear class of rank $d$ if $\mathcal{G}$ is realizable, and there exist $W_{h}:\Pi\times\mathcal{G}\to\mathbb{R}^{d}$ and $X_{h}:\Pi\to\mathbb{R}^{d}$ such that for all $\pi,\pi^{\prime}\in\Pi,g\in\mathcal{G}$ and $h\in[H]$ ,

(a)

$|\mathbb{E}[g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1});a_{1:h}\sim\pi]|\leq|\langle W_{h}(\pi,g),X_{h}(\pi)\rangle|$ ,

(b)

\displaystyle\zeta(\max_{f\in\mathcal{F}_{h}}|\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g);a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\pi^{e}(\pi^{\prime})]|)\geq|\langle W_{h}(\pi,g),X_{h}(\pi^{\prime})\rangle|.

(In $M$ -step decodable POMDPs and POMDPs with low-rank latent transition, we set $\pi^{e}(\pi)=\mathcal{U}(\mathcal{A})$ and in the previous sections, we set $\pi^{e}(\pi^{\prime})=\pi^{\prime}$ . )

(c)

\max_{f\in\mathcal{F}_{h}}|\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi});a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\pi^{e}(\pi^{\prime})]|=0

for any $\pi\in\Pi$ and the corresponding value link function $g^{\pi}$ in $\mathcal{G}$ .

The first condition states the average Bellman error under $\pi$ is upper-bounded by the quantity in the bilinear form. The second condition states that we have a known loss function $l$ that can be used to estimate an upper bound (up to a non-decreasing transformation $\zeta$ ) of the value of the bilinear form. Our algorithm will use the surrogate loss $l(\cdot)$ . As we will show, just being able to estimate an upper bound of the value of the bilinear form suffices for deriving a PAC algorithm. The discriminator $\mathcal{F}$ and the non-decreasing functions $\zeta$ give us additional freedom to design the loss function. For simple examples such as tabular POMDPs and LQG, as we already see, we simply set the discriminator class $\mathcal{F}=\emptyset$ (i.e., we do not use discriminators) and $\zeta$ being the identity mapping.

With this definition, we slightly modify Provable to incorporate the discriminator to construct constraints. The algorithm is summarized in Algorithm 3 that is named as DisProvable. There are two modifications: (1) when we collect data, we switch from the roll-in policy $\pi^{t}$ to the policy $\pi^{e}$ at time step $M(h)$ ; (2) the Bellman error constraint $\sigma^{t}_{h}$ is defined using the loss $l$ together with the discriminator class $\mathcal{F}_{h}$ .

Algorithm 3 PaRtially ObserVAble BiLinEar with DIScriminators (DisProvable)

1: Input: Value link function class

\mathcal{G}=\{\mathcal{G}_{h}\},\mathcal{G}_{h}\subset[\bar{\mathcal{Z}}_{h}\to\mathbb{R}]

, discriminator class

\mathcal{F}=\{\mathcal{F}_{h}\},\mathcal{F}_{h}\subset[\mathcal{H}_{h}\to\mathbb{R}]

, policy class

\Pi=\{\Pi_{h}\},\Pi_{h}\subset[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})]

, parameters

m\in\mathbb{N},R\in\mathbb{R}

2: Initialize

\pi^{0}\in\Pi

3: Form the first step dataset

\mathcal{D}^{0}=\{o^{i}\}_{i=1}^{m}

, with

o^{i}\sim\mathbb{O}(\cdot|s_{1})

4: for

t=0\to T-1

5: For any

h\in[H]

, define the Bellman error

\forall(\pi,g)\in\Pi\times\mathcal{G}:\sigma^{t}_{h}(\pi,g):=\max_{f\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}^{t}_{h}}\left[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g)\right]|

where

\mathcal{D}^{t}_{h}

is the empirical approximation by executing

a_{1:M(h)-1}\sim\pi^{t},a_{M(h):h}\sim\pi^{e}(\pi^{t})

and collecting

m

i.i.d tuples.

6: Select policy optimistically as follows

\displaystyle(\pi^{t+1},g^{t+1}):=\mathop{\mathrm{argmax}}_{\pi\in\Pi,g\in\mathcal{G}}\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(o)]\quad\mathrm{s.t.}\quad\forall h\in[H],\forall i\in[t],\sigma^{i}_{h}(\pi,g)\leq R.

7: end for

8: Output: Randomly choose

\hat{\pi}

from

(\pi_{1},\cdots,\pi_{T})

The following theorem shows the sample complexity of Algorithm 3. For simplicity, we direct consider the case where $\Pi,\mathcal{G},\mathcal{F}$ are all discrete.

Assumption 3 (Uniform Convergence).

Fix $h\in[H]$ . Let $\mathcal{D}^{\prime}_{h}$ be a set of $m$ i.i.d tuples by executing $a_{1:M(h)-1}\sim\pi^{t},a_{M(h):h}\sim\pi^{e}$ With probability $1-\delta$ ,

\displaystyle\textstyle\sup_{\pi\in\Pi,g\in\mathcal{G},f\in\mathcal{F}}|(\mathbb{E}_{\mathcal{D}^{\prime}_{h}}-\mathbb{E})[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g)]|\leq\epsilon_{gen,h}(m,\Pi,\mathcal{G},\mathcal{F},\delta)

For $h=1$ , we also require

\sup_{g_{1}\in\mathcal{G}_{1}}|\mathbb{E}_{\mathcal{D}^{\prime}_{1}}[g_{1}(o_{1})]-\mathbb{E}[\mathbb{E}_{\mathcal{D}^{\prime}_{1}}[g_{1}(o_{1})]]|\leq\epsilon_{ini,1}(m,\mathcal{G},\delta).

Theorem 8 (Sample complexity of Algorithm 3).

Suppose we have a PO-bilinear AC class with rank $d$ in Definition 7. Suppose Assumption 3, $\sup_{\pi\in\Pi}\|X_{h}(\pi)\|\leq B_{X}$ and $\sup_{\pi\in\Pi,g\in\mathcal{G}}\|W_{h}(\pi,g)\|\leq B_{W}$ for any $h\in[H].$
By setting $T=2Hd\ln\left(4Hd\left(\frac{B^{2}_{X}B^{2}_{W}}{\zeta^{2}(\tilde{\epsilon}_{gen})}+1\right)\right),R=\epsilon_{gen}$ where

\displaystyle\textstyle\epsilon_{gen}:=\max_{h}\epsilon_{gen,h}(m,\Pi,\mathcal{G},\mathcal{F},\delta/(TH+1)),\tilde{\epsilon}_{gen}:=\max_{h}\epsilon_{gen,h}(m,\Pi,\mathcal{G},\mathcal{F},\delta/H).

With probability at least $1-\delta$ , letting $\pi^{\star}=\mathop{\mathrm{argmax}}_{\pi\in\Pi}J(\pi^{\star})$ , we have

\displaystyle\textstyle J(\pi^{\star})-J(\hat{\pi})\leq H^{1/2}\left[4\zeta(\epsilon_{gen})^{2}+2T\zeta(2\epsilon_{gen})^{2}Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\zeta^{2}(\tilde{\epsilon}_{gen})+1))\right]^{1/2}+2\epsilon_{ini}.

The total number of samples used in the algorithm is $mTH$ .

This reduces to Theorem 1 when we set $\zeta$ as an identify function and $\pi^{e}(\pi^{\prime})=\pi^{\prime}$ . When $\zeta^{-1}(\cdot)$ is a strongly convex function, we can gain more refined rate results. For example, when $\zeta(x)=\sqrt{x}$ , i.e., $\zeta^{-1}(x)=x^{2}$ , with $\epsilon_{gen}=O(1/\sqrt{m})$ , the above theorem implies a slow sample complexity rate $1/\epsilon^{4}$ . However, by leverage the strong convexity of the square function $\zeta^{-1}(x):=x^{2}$ , a refined analysis can give the fast rate $1/\epsilon^{2}$ . We will see such two examples in the next sections.

8 Examples for Generalized PO-Bilinear AC Class

We demonstrate that our generalized framework captures two models: (1) $M$ -step decodable POMDPs, and (2) observable POMDPs with the latent low-rank transition. In this section, we assume $r_{h}\in[0,1]$ for any $h\in[H]$ .

8.1 $M$ -step decodable POMDPs

The example we include here is a model that involves nonlinear function approximation but has a unique assumption on the exact identifiability of the latent states.

Example 8 ( $M$ -step decodable POMDPs [20]).

There exists an unknown decoder $\iota_{h}:\bar{\mathcal{Z}}_{h}\to{\mathcal{S}}$ , such that for every reachable trajectory $(s_{1:h},a_{1:h-1},o_{1:h})$ , we have $s_{h}=\iota_{h}(\bar{z}_{h})$ for all $h\in[H]$ .

Note that when $M=0$ , this model is reduced to the well-known Block MDP model [16, 48, 78].

Existence of value link functions.

From the definition, using a value function $V^{\pi}_{h}(z_{h-1},s_{h})$ over $z_{h-1}\in\mathcal{Z}_{h-1},s_{h}\in{\mathcal{S}}$ , we can define a value link function $v^{\pi}_{h}:\mathcal{Z}_{h-1}\times\mathcal{O}\to\mathbb{R}$ as

\displaystyle v^{\pi}_{h}(z_{h-1},o_{h})=V^{\pi}_{h}(z_{h-1},\iota_{h}(\bar{z}_{h}))

since it satisfies

\displaystyle\mathbb{E}_{o_{h}\sim\mathbb{O}(s_{h})}[v^{\pi}_{h}(z_{h-1},o_{h})\mid z_{h-1},s_{h}]=\mathbb{E}_{o_{h}\sim\mathbb{O}(s_{h})}[V^{\pi}_{h}(z_{h-1},\iota_{h}(\bar{z}_{h}))\mid z_{h-1},s_{h}]=V^{\pi}_{h}(z_{h-1},s_{h}).

This is summarized in the following lemma.

Lemma 9 (Existence of link functions in $M$ -step decodable POMDPs).

In $M$ -step decodable POMDPs, link functions exist.

$M$ -step decodable POMDPs showcase the generality of value link functions, which not only capture standard observability conditions where future observations and actions are used to replace belief states (e.g., observable tabular POMDPs and observable LQG), but also capture a model where history is used to replace latent states.

PO-Bilinear Rank.

Next, we calculate the PO-bilinear rank based on Definition 7. In the tabular case, we can naïvely obtain the PO-bilinear decomposition with rank $|\mathcal{O}|^{M}|\mathcal{A}|^{M}|{\mathcal{S}}|$ following Example 1. Here, we consider the nontabular case where function approximation is used and $|\mathcal{O}|$ can be extremely large. We define the following Bellman operator associated with $\pi$ at step $h$ :

		$\displaystyle\mathcal{B}^{\pi}_{h}:\mathcal{G}\to[\bar{\mathcal{Z}}_{h}\to\mathbb{R}];$		(4)
		$\displaystyle\forall\bar{z}_{h}:\left[\mathcal{B}^{\pi}_{h}g\right](\bar{z}_{h}):=\mathbb{E}_{a_{h}\sim\pi(\bar{z}_{h})}\left[r_{h}(\iota_{h}(\bar{z}_{h}),a_{h})+\mathbb{E}_{o_{h+1}\sim\mathbb{O}\circ\mathbb{T}(\iota_{h}(\bar{z}_{h}),a_{h})}[g_{h+1}(\bar{z}_{h+1})]\right].$

Note that above we use the ground truth decoder $\iota_{h}$ to decode from $\bar{z}_{h}$ to its associated latent state $s_{h}$ . The existence of this Bellman operator $\mathcal{B}^{\pi}_{h}$ is crucially dependent on the existence of such decoder $\iota_{h}$ .

We show that $M$ -step decodable POMDPs satisfy the definition in Definition 7. We assume that the latent state-wise transition model is low-rank. In MDPs, this assumption is widely used in [76, 36, 2, 71]. Here, we do not need to know $\mu,\phi$ in the algorithm.

Assumption 4 (Low-rankness of latent transition).

Suppose $\mathbb{T}$ is low-rank, i.e., $\mathbb{T}(s^{\prime}\mid s,a)=\langle\phi(s,a),\mu(s^{\prime})\rangle(\forall(s,a,s^{\prime}))$ where $\phi,\mu$ are (unknown) $d$ -dimensional features. As technical conditions, we suppose $\|\phi(s,a)\|\leq 1$ for any $(s,a)$ and $|\int\mu(s)v(s)d(s)|\leq\sqrt{d}$ for any $\|v\|_{\infty}\leq 1$ .

Lemma 10 (Bilinear decomposition of low-rank $M$ -step decodable POMDPs ).

Suppose Assumption 4, $\|\mathcal{G}_{h}\|_{\infty}\leq H,\|\mathcal{F}_{h}\|_{\infty}\leq H$ , $r_{h}\in[0,1]$ for any $h\in[H]$ . Assume a discriminator class is Bellman complete, i.e.,

\forall\pi\in\Pi,\forall g\in\mathcal{G}:(\mathcal{B}^{\pi}_{h}g)-g_{h}\in\mathcal{F}_{h},

for any $h\in[H]$ . The loss function is designed as

\displaystyle l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g):=\pi_{h}(a_{h}\mid\bar{z}_{h})|\mathcal{A}|f(\bar{z}_{h})(g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1}))-0.5f(\bar{z}_{h})^{2}.

(5)

Then, there exist $W_{h}(\pi,g),X_{h}(\pi^{\prime})$ so that the PO-bilinear rank is at most $d$ such that

	$\displaystyle\|\mathbb{E}[g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1}):a_{1:h}\sim\pi]\|=\|\langle W_{h}(\pi,g),X_{h}(\pi)\rangle\|,$		(6)
	$\displaystyle\left\lvert\max_{f\in\mathcal{F}_{h}}\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g);a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})]\right\rvert\geq\frac{0.5\langle W_{h}(\pi,g),X_{h}(\pi^{\prime})\rangle^{2}}{\|\mathcal{A}\|^{M}},$		(7)

and

\displaystyle\left\lvert\max_{f\in\mathcal{F}_{h}}\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi});a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})]\right\rvert=0.

(8)

Proof.

The proof is deferred to Section N.2. Note that (6), (7), (8) correspond to (a), (b), (c) in Definition 7. ∎

We use the most general bilinear class definition from Definition 7, where $\zeta(a)=|\mathcal{A}|^{M/2}a^{1/2}$ for scalar $a\in\mathbb{R}^{+}$ . Hence $\zeta$ is a non-decreasing function ( $\zeta$ is non-decreasing in $\mathbb{R}^{+}$ ). The proof of the above lemma leverages the novel trick of the so-called moment matching policy introduced by [20]. When the latent state and action space are discrete, it states that the bilinear rank is $|{\mathcal{S}}||\mathcal{A}|$ , which is much smaller than $|\mathcal{O}|^{M}|\mathcal{A}|^{M}|{\mathcal{S}}|$ . Note we here introduce $-0.5f(\bar{z}_{h})^{2}$ in the loss function (5) to induce strong convexity w.r.t $f$ as in [70, 18, 9], which is important to obtain the fast rate later.

The concrete sample complexity of Provable (Algorithm 3) for this model is summarized in the following. Recall that the bilinear rank is $d$ where $d$ is the rank of the transition matrix. We set $\mathcal{G}_{h}\subset[\bar{\mathcal{Z}}_{h}\to[0,H]]$ . Then, we have the following result.

Theorem 9 (Sample complexity for $M$ -step decodable POMDPs (Informal)).

Suppose Assumption 4, Bellman completeness, $\|\mathcal{G}_{h}\|_{\infty}\leq H,\|\mathcal{F}_{h}\|_{\infty}\leq H$ , $r_{h}\in[0,1]$ for any $h\in[H]$ . With probability $1-\delta$ , we can achieve $J(\pi^{\star})-J(\hat{\pi})\leq\epsilon$ when we use samples at most

\displaystyle\tilde{O}\left(\frac{d^{2}H^{6}|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)}{\epsilon^{2}}\right).

Here, $\mathrm{polylog}(d,H,|\mathcal{A}|,1/\epsilon,\ln(|\Pi_{\max}|),\ln(|\mathcal{F}_{\max}|),\ln(|\mathcal{G}_{\max}|),\ln(1/\delta))$ are omitted.

The followings are several implications. First, the error rate scales with $O(1/\epsilon^{2})$ . As we promised, by leveraging the strong convexity of loss functions, we obtain a rate $O(1/\epsilon^{2})$ , which is faster than $O(1/\epsilon^{4})$ that are attained when we naively invoke Theorem 8 with $\xi(x)\propto\sqrt{x}$ . Secondly, the error bound incurs $|\mathcal{A}|^{M}$ . As showed in [20], this is inevitable in $M$ -step decodable POMDPs. Thirdly, in the tabular case, when we use the naïve function classes for $\mathcal{G},\mathcal{F},\Pi$ , i.e., $\mathcal{G}_{h}=\{\bar{\mathcal{Z}}_{h}\to[0,H]\}$ , $\mathcal{F}_{h}=\{\bar{\mathcal{Z}}_{h}\to[0,H]\}$ , $\Pi_{h}=\{\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})\}$ , the bound could incur additional $|\mathcal{O}|^{M}$ since the complexity of the function classes can scale with respect to $(|\mathcal{O}||\mathcal{A}|)^{M}$ (e.g., $\log(|\mathcal{G}_{h}|)$ can be in the order of $O(|\mathcal{O}|^{M}|\mathcal{A}^{M})$ , and similarly for $\log(|\mathcal{F}_{h}|),\ln(\Pi_{h})$ ). However, when we start form a realizable model class that captures the ground truth transition and omission distribution, we can remove $|\mathcal{O}|^{M}$ . See Section 8.2.4 for an example.

Note that [20] uses a different function class setup where they assume one has an M memory-action dependent $Q$ function class $\mathcal{Q}_{h}:\bar{\mathcal{Z}}_{h}\times\mathcal{A}\to\mathbb{R}$ which contains $Q^{\star}_{h}(\bar{z}_{h},a)$ while we use the actor-critic framework $v^{\pi}_{h},\pi$ . The two function class setups are not directly comparable. Generally, we mention that such optimal $Q^{\star}$ with truncated history does not exist when the exact decodability does not hold (e.g., such $Q^{\star}$ with truncated history does not exist in LQG). This displays the potential generality of the actor-critic framework we propose here.

8.2 Observable POMDPs with Latent Low-rank Transition: a model-based perspective

The final example we include in this work is a POMDP with the latent low-rank transition. We first introduce the model, and then we introduce our function approximation setup and show the sample complexity. Finally, we revisit the sample complexity for observable tabular POMDPs and $M$ -step decodable tabular POMDPs using the improved algorithm that elaborates on the model-based approach in this section.

Example 9 (Observable POMDPs with latent low-rank transition).

The latent transition $\mathbb{T}(s^{\prime}|s,a)$ is factorized as $\mathbb{T}(s^{\prime}|s,a)=\mu^{\star}(s^{\prime})^{\top}\phi^{\star}(s,a),\forall s,a,s^{\prime}$ where $\mu^{\star}:{\mathcal{S}}\to\mathbb{R}^{d}$ and $\phi^{\star}:{\mathcal{S}}\times\mathcal{A}\to\mathbb{R}^{d}$ . The observation $|\mathcal{O}|\times|{\mathcal{S}}|$ matrix $\mathbb{O}$ has full-column rank.

In the tabular POMDP example, we have $d\leq|{\mathcal{S}}|$ . However in general $d$ can be much smaller than $|{\mathcal{S}}|$ . Note that in this section, we will focus on the setting where ${\mathcal{S}},\mathcal{O}$ are discrete to avoid using measure theory languages, but their size could be extremely large. Particularly, our sample complexity will not have explicit polynomial or logarithmic dependence on $|\mathcal{O}|,|{\mathcal{S}}|$ , instead it will only scale polynomially with respect to the complexity of the hypothesis class and the rank $d$ .

Model-based function approximation.

Our function approximation class consists of a set of models $\mathcal{M}=\{(\mu,\phi,O)\}$ where $\mu,\phi$ together models latent transition as $\mu(\cdot)^{\top}\phi(s,a)\in\Delta({\mathcal{S}})$ , and $O:{\mathcal{S}}\to\Delta(\mathcal{O})$ models $\mathbb{O}$ , and $O$ is full column rank. For notation simplicity, we often use $\theta:=(\mu,\phi,O)\in\mathcal{M}$ to denote a model $(\mu,\phi,O)$ . We impose the following assumption.

Assumption 5 (Realizability).

We assume realizability, i.e., $(\mu^{\star},\phi^{\star},\mathbb{O})\in\mathcal{M}$ .

We assume $\mathcal{M}$ is discrete, but $|\mathcal{M}|$ can be large such that a linear dependence on $|\mathcal{M}|$ in the sample complexity is not acceptable. Our goal is to get a bound that scales polynomially with respect to $\ln(|\mathcal{M}|)$ , which is the standard statistical complexity of the discrete hypothesis class $\mathcal{M}$ .

Next, we construct $\Pi,\mathcal{G},\mathcal{F}$ using the model class $\mathcal{M}$ . Given $\theta:=(\mu,\phi,O)$ , we denote $\pi^{\theta}$ as the optimal $M$ -memory policy, i.e., the $M$ -memory policy that maximizes the total expected reward. We set

\Pi=\{\pi^{\theta}:\theta\in\mathcal{M}\}.

We consider the value function class for $\theta:=(\mu,\phi,O)$ with $O$ being full column rank. For each $\theta$ , we can define the corresponding value function of the policy $\pi$ at $h\in[H]$ : $V^{\pi}_{\theta;h}(z_{h-1},s_{h}):\mathcal{Z}_{h-1}\times{\mathcal{S}}\to\mathbb{R}$ . Then, since $O$ is full column rank, as we see in the proof of Lemma 1, a corresponding value link function is

\displaystyle g^{\pi}_{\theta;h}(z,o)=\langle f^{\pi}_{\theta,h},\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle

where $V^{\pi}_{\theta;h}(z_{h-1},s_{h})=\langle f^{\pi}_{\theta;h},\mathbf{1}(z)\otimes\mathbf{1}(s)\rangle$ . Then, we construct $\mathcal{G}=\{\mathcal{G}_{h}\}$ as:

\displaystyle\forall h\in[H]:\;\mathcal{G}_{h}=\{\bar{\mathcal{Z}}_{h}\ni\bar{z}_{h-1}\mapsto g^{\pi}_{\theta;h}(\bar{z}_{h-1})\in\mathbb{R}:\pi\in\Pi,\theta\in\mathcal{M}\}.

(9)

By construction, since $\theta^{\star}:=(\mu^{\star},\phi^{\star},\mathbb{O})\in\mathcal{M}$ , we must have $g^{\pi}\in\mathcal{G},\forall\pi\in\Pi$ , which implies $\mathcal{G}$ is realizable (note $g^{\pi}_{h}=g^{\pi}_{\theta^{\star};h}$ ). Here, from the construction and the assumption $r_{h}\in[0,1]$ for any $h\in[H]$ , we have $|\mathcal{G}_{h}|\leq|\mathcal{M}|^{2}$ and $\|\mathcal{G}_{h}\|_{\infty}\leq H/\sigma_{1}$ , which can be seen from

\displaystyle\forall(z,o);\langle f^{\pi}_{\theta;h},\mathbf{1}(z)\otimes O^{\dagger}\mathbf{1}(o)\rangle\leq\|f^{\pi}_{\theta;h}\|_{\infty}\|\mathbf{1}(z)\otimes O^{\dagger}\mathbf{1}(o)\|_{1}\leq H\times\|O^{\dagger}\mathbf{1}(o)\|_{1}\leq H/\sigma_{1}

by assuming $\|O^{\dagger}\|_{1}\leq 1/\sigma_{1}$ and $\|f^{\pi}_{\theta;h}\|_{\infty}\leq H$ .

To construct a discriminator class $\mathcal{F}$ , we first define the Bellman operator $\mathcal{B}^{\pi}_{\theta;h}$ for $\pi\in\Pi,h\in[H],\theta\in\mathcal{M}$ :

	$\displaystyle\mathcal{B}^{\pi}_{\theta;h}:\mathcal{G}\to[\mathcal{H}_{h}\to\mathbb{R}];$
	$\displaystyle\forall\tau_{h};\left(\mathcal{B}^{\pi}_{\theta;h}g\right)(\tau_{h})=\mathbb{E}_{a_{h}\sim\pi_{h}(\bar{z}_{h})}\left[r_{h}+\mathbb{E}_{o_{h+1}\sim\mathbb{P}_{\theta}(\cdot\|\tau_{h},a_{h})}g_{h+1}(\bar{z}_{h+1})\right],$

where $\mathcal{H}_{h}$ is the whole history space up to $h$ ( $\tau_{h}=(a_{1:h-1},o_{1:h})$ , and $\bar{z}_{h}$ is just part of this history) and $\mathbb{P}_{\theta}(o_{h+1}|\tau_{h},a_{h})$ is the probability of generating $o_{h+1}$ conditioned on $\tau_{h},a_{h}$ under model $\theta$ . Then, we construct $\mathcal{F}=\{\mathcal{F}_{h}\}$ such that

\displaystyle\forall h\in[H]:\mathcal{F}_{h}=\{\mathcal{H}_{h}\ni\tau_{h}\mapsto\{g_{h}-\mathcal{B}^{\pi}_{\theta;h}g\}(\tau_{h})\in\mathbb{R}:\pi\in\Pi,g\in\mathcal{G},\theta\in\mathcal{M}\}.

(10)

so that we can ensure the Bellman completeness:

\displaystyle-(\mathcal{B}^{\pi}_{h}\mathcal{G})+\mathcal{G}_{h}\subset\mathcal{F}_{h}.

noting $\mathcal{B}^{\pi}_{\theta^{\star};h}=\mathcal{B}^{\pi}_{h}$ . Here, from the construction, $|\mathcal{F}_{h}|\leq|\mathcal{M}|^{2}\times|\mathcal{M}|^{2}\times|\mathcal{M}|^{2}=|\mathcal{M}|^{6}$ and $\|\mathcal{F}_{h}\|_{\infty}\leq 3H/\sigma_{1}$ .

We define the loss as the same as the one we used in $M$ -step decodable POMDPs, except that our discriminators now take the entire history as input:

\displaystyle l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g):=\pi_{h}(a_{h}\mid\bar{z}_{h})|\mathcal{A}|f(\tau_{h})(g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1}))-0.5f(\tau_{h})^{2}.

(11)

Finally, as in the case of $M$ -step decodable POMDPs (Lemma 10), we get the following lemma that states that our model is a PO-bilinear AC class (Definition 7) under the following model assumption.

Assumption 6.

We assume $\|O^{\dagger}\|_{1}\leq 1/\sigma_{1}$ for any $O$ in the model. Suppose $\mu(\cdot)^{\top}\phi(s,a)\in\Delta({\mathcal{S}})$ for any $(s,a)$ , $\mu(\cdot)$ and $\phi(\cdot)$ in the model. Suppose $\|\phi(s,a)\|\leq 1$ for any $\phi$ in the model and $(s,a)\in{\mathcal{S}}\times\mathcal{A}$ . Suppose for any $v:{\mathcal{S}}\to[0,1]$ and for any $\mu$ in the model, we have $\|\int v(s)\mu(s)\mathrm{d}(s)\|_{2}\leq\sqrt{d}$ .

Lemma 11 (PO-bilinear decomposition for Observable POMDPs with low-rank transition).

Suppose Assumption 5, 6. Consider observable POMDPs with latent low-rank transition. Set $\mathcal{G}$ as in (9), $\mathcal{F}$ as in (10) and $l$ as in (11). Then, there exist $W_{h}(\pi,g),X_{h}(\pi^{\prime})$ that admits the PO-bilinear rank decomposition in Definition 7 with rank $d$ .

The above lemma ensures that the PO-bilinear rank only depends on $d$ , and is independent of the length of the memory. For example, in the tabular case, it is $|{\mathcal{S}}|$ .

Next, we show the output from DisProvablecan search for the best in class $M$ -memory policy as follows.

Theorem 10 (Sample complexity of DisProvable for observable POMDPs with latent low-rank transition).

Consider observable POMDPs with latent low-rank transition. Suppose Assumption 5, 6. With probability $1-\delta$ , we can achieve $J(\pi^{\star})-J(\hat{\pi})\leq\epsilon$ when we use samples at most

\displaystyle\tilde{O}\left(\frac{d^{2}H^{6}|\mathcal{A}|^{2+M}\ln(|\mathcal{M}|/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

Here, we omit $\mathrm{polylog}(d,H,|\mathcal{A}|,\ln(1/\delta),\ln(|\mathcal{M}|),1/\sigma_{1},1/\epsilon)$ .

Here, we emphasize that there is no explicit polynomial or logarithmic dependence on $|{\mathcal{S}}|$ and $|\mathcal{O}|$ , which permits learning for large state and observation spaces. We also do not have any explicit polynomial dependence on $|\mathcal{O}|^{M}$ , as we construct $\Pi$ and $\mathcal{G}$ from the model class $\mathcal{M}$ which ensures the complexities of $\pi$ and $\mathcal{G}$ are in the same order as that of $\mathcal{M}$ .

8.2.1 Global Optimality

We show a quasi-polynomial sample complexity bound for competing against the globally optimal policy $\pi^{\star}_{\text{gl}}$ . To compete against the globally optimal policy $\pi^{\star}_{\mathrm{gl}}$ , we need to set $M$ properly. We use the following lemma. The proof is given in Section O.

Lemma 12 (Near global optimaltiy of $M$ -memoruy policy).

Consider $\epsilon\in(0,H]$ , and a POMDP with low-rank latent transition and $\mathbb{O}$ being full column rank with $\|\mathbb{O}^{\dagger}\|_{1}\leq 1/\sigma_{1}$ . When $M=\Theta(C(1/\sigma_{1})^{-4}\ln(dH/\epsilon))$ (with $C$ being some absolute constant), there must exists an $M$ -memory policy $\pi^{\star}$ , such that $J(\pi^{\star}_{\mathrm{gl}})-J({\pi^{\star}})\leq\epsilon$

Note that the memory $M$ above is independent of $|{\mathcal{S}}|$ instead it only depends on the rank $d$ . To prove the above lemma, we first show a new result on belief contraction for low-rank POMDPs under the $\ell_{1}$ -based observability. The proof of the belief contraction borrows some key lemma from [24] but extends the original result for small-size tabular POMDPs to low-rank POMDPs. We leverage the linear structure of the problem and the G-optimal design to construct an initial distribution over ${\mathcal{S}}$ that can be used as a starting point for belief propagation along the memory.

We conclude the study on the POMDPs with low-rank latent transition by the following theorem, which demonstrates a quasi-polynomial sample complexity for learning the globally optimal policy.

Theorem 11 (Sample complexity of DisProvable for POMDPs with low-rank latent transition — competing against $\pi^{\star}_{\mathrm{gl}}$ ).

Consider observable POMDPs with latent low-rank transition. Fix some $\epsilon\in(0,H),\delta\in(0,1)$ . Suppose Assumption 5, 6. We construct $\Pi,\mathcal{G},\mathcal{F}$ , and the loss $l$ as we described above. With probability at least $1-\delta$ , when $M=\Theta(C\sigma_{1}^{-4}\ln(dH/\epsilon))$ , DisProvable outputs a $\hat{\pi}$ such that $J(\pi^{\star}_{\mathrm{gl}})-J(\hat{\pi})\leq\epsilon$ , with number of samples scaling

\tilde{O}\left(\frac{d^{2}H^{6}|\mathcal{A}|^{2}\ln(|\mathcal{M}/\delta|)}{\epsilon^{2}\sigma_{1}^{2}}\cdot|\mathcal{A}|^{\ln(dH/\epsilon)/\sigma_{1}^{4}}\right).

Remark 4 (Comparison to [73]).

We compare our results to the very recent work [73] that studies POMDPs with the low-rank latent transition. The results are in general not directly comparable, but we state several key differences here. First, [73] considers a special instance of low-rank transition, i.e., [73] assumes $\mathbb{T}$ has low non-negative rank, which could be exponentially larger than the usual rank [2]. Second, [73] additionally assumes short past sufficiency, a condition which intuitively says that for any roll-in policy, the sufficient statistics of a short memory is enough to recover the belief over the latent states, and their sample complexity has an exponential dependence on the length of the memory. While our result also relies on the fact that the globally optimal policy can be approximated by an $M$ -memory policy with small $M$ , this fact is derived directly from the standard observability condition.

8.2.2 Revisiting Observable Undercomplete Tabular POMDPs

We reconsider the sample complexity of undercomplete tabular POMDPs using Theorem 10. In this case, we will start from a model class that captures the ground truth latent transition $\mathbb{T}$ and omission distribution $\mathbb{O}$ . By constructing $\epsilon$ -nets over the model class,we can set $\ln(|\mathcal{M}|)=\tilde{O}(|{\mathcal{S}}|^{3}|\mathcal{O}||\mathcal{A}|)$ since $\mathbb{T},\mathbb{O}$ have $|{\mathcal{S}}|^{2}|\mathcal{A}|$ and $|\mathcal{O}||{\mathcal{S}}|$ many parameters, respectively. Besides, the PO-bilinear rank is $d=|{\mathcal{S}}|$ . Therefore, the sample complexity is

\displaystyle\tilde{O}\left(\frac{|{\mathcal{S}}|^{5}|\mathcal{O}|H^{6}|\mathcal{A}|^{2+M}\ln(1/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

We leave the formal analysis to future works.

Compared to results in Section 5.3.2, there is no $|\mathcal{O}|^{M}$ term. This is due to two improvements. The first improvement is that we refine the rank from $|\mathcal{O}|^{M}|\mathcal{A}|^{M}|{\mathcal{S}}|$ to $|{\mathcal{S}}|$ . The second improvement is we model the value link function class and policy class starting from the model class whose complexity has nothing to do with the length of memory $M$ (note that previously, from a pure model-free perspective, the statistical complexity of $\mathcal{G}$ can scale as $|\mathcal{O}|^{M}|\mathcal{A}|^{M}|{\mathcal{S}}|$ in the worst case).

8.2.3 Revisiting Observable Overcomplete POMDPs

We reconsider the sample complexity of overcomplete tabular POMDPs using Theorem 10 with slight modification to incorporate multi-step future. Suppose $\|\{\mathbb{O}^{K}\}^{\dagger}\|_{1}\leq 1/\sigma_{1}$ (recall $\mathbb{O}^{K}$ is defined in Lemma 3 in Section 4.2). Then, we can achieve a sample complexity

\displaystyle\tilde{O}\left(\frac{|{\mathcal{S}}|^{5}|\mathcal{O}|H^{6}|\mathcal{A}|^{2+M}\ln(1/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right)

since the PO-bilinear rank is $|{\mathcal{S}}|$ . Note that there is no $|\mathcal{O}|^{M+K}$ dependence, since both the policy class and the value link function class are built from the model class whose complexity has nothing to do with $M,K$ .

Note that due to our definition of $\mathbb{O}^{K}$ , there is no $|\mathcal{A}|^{K}$ term. However, when we use a different definition, for instance, $\min_{a^{\prime}_{h:h+K-2}\in\mathcal{A}^{K-1}}\|\{\mathbb{O}^{K}(a^{\prime}_{h:h+K-2})\}^{\dagger}\|_{1}\leq 1/\alpha_{1}$ (recall $\mathbb{O}^{K}(a^{\prime}_{h:h+K-2})$ is defined in Section 4.2), we would incur $|\mathcal{A}|^{K}$ . This is because if we only know that there is an unknown sequence of actions $a^{\prime}_{h:h+K-2}$ such that $\mathbb{O}^{K}(a^{\prime}_{h:h+K-2})$ is full column rank, we need to use uniform samples $|\mathcal{A}|^{K}$ in the importance sampling step to identify such a sequence. More formally, we can see that

\displaystyle|\mathcal{A}|^{K}\min_{a^{\prime}_{h:h+K-2}\in\mathcal{A}^{K-1}}\|\{\mathbb{O}^{K}(a^{\prime}_{h:h+K-2})\}^{\dagger}\|_{1}\geq\|\{\mathbb{O}^{K}\}^{\dagger}\|_{1}.

(12)

8.2.4 Revisiting $M$ -step Decodable Tabular POMDPs

We reconsider the sample complexity of tabular $M$ -step decodable POMDPs by constructing $\mathcal{F},\mathcal{G},\Pi$ from the model class $\mathcal{M}$ as we did for the low-rank POMDP. In this case, by constructing $\epsilon$ -nets, we can set $\ln(|\mathcal{M}|)=\tilde{O}(|{\mathcal{S}}|^{3}|\mathcal{O}||\mathcal{A}|)$ since $\mathbb{T},\mathbb{O}$ have $|{\mathcal{S}}|^{2}|\mathcal{A}|$ and $|\mathcal{O}||{\mathcal{S}}|$ parameters, respectively. Therefore, the sample complexity is

\displaystyle\tilde{O}\left(\frac{H^{6}|{\mathcal{S}}|^{5}|\mathcal{O}||\mathcal{A}|^{2+M}\ln(1/\delta)}{\epsilon^{2}}\right).

Again, we leave the formal analysis to future works. Compared to the naive result mentioned after Theorem 9 where $\ln(\mathcal{G}),\ln(\Pi)$ could scale in the order of $|\mathcal{O}|^{M}$ in the tabular case, we do not have $|\mathcal{O}|^{M}$ dependence here.

9 Summary

We propose a PO-bilinear actor-critic framework that is the first unified framework for provably efficient RL on large-scale partially observable dynamical systems. Our framework can capture not only many models where provably efficient learning has been known such as tabular POMDPs, LQG and M-step decodable POMDPs, but also models where provably efficient RL is not known such as HSE-POMDPs, general PSRs and observable POMDPs with low-rank latent transition. Our unified actor-critic based algorithm—Provable provably performs agnostic learning by searching for the best memory-based policy. For special models such as observable tabular MDPs, LQG, and POMDPs with low-rank latent transition, by leveraging their special properties, i.e., the exponential stability of Bayesian filters in tabular and low-rank POMDPs, and existence of a compact featurization of histories in LQG, we are able to directly compete against the global optimality without paying an exponential dependence on horizon.

Acknowledgement

We thank Nan Jiang for valuable discussions on PSRs.

References

AHKS [20] Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning. Advances in Neural Information Processing Systems, 33:13399–13412, 2020.
AKKS [20] Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095–20107, 2020.
ALA [16] Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Reinforcement learning of pomdps using spectral methods. In Conference on Learning Theory, pages 193–256. PMLR, 2016.
BBC⁺ [19] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
Ber [12] Dimitri Bertsekas. Dynamic programming and optimal control: Volume I, volume 1. Athena scientific, 2012.
BGG [13] Byron Boots, Geoffrey Gordon, and Arthur Gretton. Hilbert space embeddings of predictive state representations. arXiv preprint arXiv:1309.6819, 2013.
BK [21] Andrew Bennett and Nathan Kallus. Proximal reinforcement learning: Efficient off-policy evaluation in partially observed markov decision processes. 2021.
BSG [11] Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon. Closing the learning-planning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954–966, 2011.
CJ [19] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
CO [20] Sayak Ray Chowdhury and Rafael Oliveira. No-regret reinforcement learning with value function approximation: a kernel embedding approach. arXiv preprint arXiv:2011.07881, 2020.
CPS⁺ [20] Yifan Cui, Hongming Pu, Xu Shi, Wang Miao, and Eric Tchetgen Tchetgen. Semiparametric proximal causal inference. arXiv preprint arXiv:2011.08411, 2020.
CYW [22] Qi Cai, Zhuoran Yang, and Zhaoran Wang. Sample-efficient reinforcement learning for pomdps with linear function approximations. arXiv preprint arXiv:2204.09787, 2022.
Dea [18] Ben Deaner. Proxy controls and panel data. arXiv preprint arXiv:1810.00283, 2018.
DHB⁺ [17] Carlton Downey, Ahmed Hefny, Byron Boots, Geoffrey J Gordon, and Boyue Li. Predictive state recurrent neural networks. Advances in Neural Information Processing Systems, 30, 2017.
DJW [20] Yaqi Duan, Zeyu Jia, and Mengdi Wang. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
DKJ⁺ [19] Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674. PMLR, 2019.
DKL⁺ [21] Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pages 2826–2836. PMLR, 2021.
DLMS [20] Nishanth Dikkala, Greg Lewis, Lester Mackey, and Vasilis Syrgkanis. Minimax estimation of conditional moment models. In Advances in Neural Information Processing Systems, volume 33, pages 12248–12262, 2020.
EDKM [05] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Reinforcement learning in pomdps without resets. 2005.
EJKM [22] Yonathan Efroni, Chi Jin, Akshay Krishnamurthy, and Sobhan Miryoosefi. Provable reinforcement learning with a short-term memory. arXiv preprint arXiv:2202.03983, 2022.
FKQR [21] Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
GDB [16] Zhaohan Daniel Guo, Shayan Doroudi, and Emma Brunskill. A pac rl algorithm for episodic pomdps. In Artificial Intelligence and Statistics, pages 510–518. PMLR, 2016.
[23] Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Learning in observable pomdps, without computationally intractable oracles. arXiv preprint arXiv:2206.03446, 2022.
[24] Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Planning in observable pomdps in quasipolynomial time. arXiv preprint arXiv:2201.04735, 2022.
HDG [15] Ahmed Hefny, Carlton Downey, and Geoffrey J Gordon. Supervised learning for dynamical system learning. Advances in neural information processing systems, 28, 2015.
HDL⁺ [21] Botao Hao, Yaqi Duan, Tor Lattimore, Csaba Szepesvári, and Mengdi Wang. Sparse feature selection makes batch reinforcement learning more sample efficient. In International Conference on Machine Learning, pages 4063–4073. PMLR, 2021.
Hes [18] Joao P Hespanha. Linear systems theory. Princeton university press, 2018.
HFP [13] William L Hamilton, Mahdi Milani Fard, and Joelle Pineau. Modelling sparse dynamical systems with compressed predictive state representations. In International Conference on Machine Learning, pages 178–186. PMLR, 2013.
HKZ [12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
IP [08] Masoumeh T Izadi and Doina Precup. Point-based planning for predictive state representations. In Conference of the Canadian Society for Computational Studies of Intelligence, pages 126–137. Springer, 2008.
Jae [98] Herbert Jaeger. Discrete-time, discrete-valued observable operator models: a tutorial. GMD-Forschungszentrum Informationstechnik Darmstadt, Germany, 1998.
Jae [00] Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural computation, 12(6):1371–1398, 2000.
JKA⁺ [17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
JKKL [20] Chi Jin, Sham Kakade, Akshay Krishnamurthy, and Qinghua Liu. Sample-efficient reinforcement learning of undercomplete pomdps. Advances in Neural Information Processing Systems, 33:18530–18539, 2020.
JLM [21] Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in Neural Information Processing Systems, 34, 2021.
JYWJ [20] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
KECM [21] Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. Rl for latent mdps: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34, 2021.
KJS [15] Alex Kulesza, Nan Jiang, and Satinder Singh. Spectral learning of predictive state representations with insufficient statistics. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
KMN [99] Michael Kearns, Yishay Mansour, and Andrew Ng. Approximate planning in large pomdps via reusable trajectories. Advances in Neural Information Processing Systems, 12, 1999.
KMU [21] Nathan Kallus, Xiaojie Mao, and Masatoshi Uehara. Causal inference under unmeasured confounding with negative controls: A minimax learning approach. arXiv preprint arXiv:2103.14029, 2021.
KW [60] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363–366, 1960.
LAHA [20] Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Regret minimization in partially observable linear quadratic control. arXiv preprint arXiv:2002.00082, 2020.
LCSJ [22] Qinghua Liu, Alan Chung, Csaba Szepesvári, and Chi Jin. When is partially observable reinforcement learning not scary? arXiv preprint arXiv:2204.08967, 2022.
Lit [96] Michael Lederman Littman. Algorithms for sequential decision-making. Brown University, 1996.
LMPR [20] Tianyu Li, Bogdan Mazoure, Doina Precup, and Guillaume Rabusseau. Efficient planning under partial observability with unnormalized q functions and spectral learning. In International Conference on Artificial Intelligence and Statistics, pages 2852–2862. PMLR, 2020.
LS [01] Michael Littman and Richard S Sutton. Predictive representations of state. Advances in neural information processing systems, 14, 2001.
LS [20] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
MHKL [20] Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961–6971. PMLR, 2020.
MKS⁺ [13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
MST [18] Wang Miao, Xu Shi, and Eric Tchetgen Tchetgen. A confounding bridge approach for double negative control inference on causal effects. arXiv preprint arXiv:1808.04945, 2018.
MTR [19] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalent control of lqr is efficient. arXiv preprint arXiv:1902.07826, 2019.
Mur [00] Kevin P Murphy. A survey of pomdp solution techniques. environment, 2(10), 2000.
MZG⁺ [21] Afsaneh Mastouri, Yuchen Zhu, Limor Gultchin, Anna Korba, Ricardo Silva, Matt J Kusner, Arthur Gretton, and Krikamol Muandet. Proximal causal learning with kernels: Two-stage estimation and moment restriction. arXiv preprint arXiv:2105.04544, 2021.
NBGF [12] Yu Nishiyama, Abdeslam Boularias, Arthur Gretton, and Kenji Fukumizu. Hilbert space embeddings of pomdps. arXiv preprint arXiv:1210.4887, 2012.
PT [87] Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov decision processes. Mathematics of operations research, 12(3):441–450, 1987.
PVSP [06] Josep M Porta, Nikos Vlassis, Matthijs TJ Spaan, and Pascal Poupart. Point-based value iteration for continuous pomdps. 2006.
RGT [04] Matthew Rosencrantz, Geoff Gordon, and Sebastian Thrun. Learning low dimensional predictive representations. In Proceedings of the twenty-first international conference on Machine learning, page 88, 2004.
SBS⁺ [10] Le Song, Byron Boots, Sajid Siddiqi, Geoffrey J Gordon, and Alex Smola. Hilbert space embeddings of hidden markov models. 2010.
SHSF [09] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 961–968, 2009.
Sin [21] Rahul Singh. A finite sample theorem for longitudinal causal inference with machine learning: Long term, dynamic, and mediated effects. arXiv preprint arXiv:2112.14249, 2021.
SJK⁺ [19] Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pages 2898–2933. PMLR, 2019.
SJR [04] Satinder Singh, Michael R James, and Matthew R Rudary. Predictive state representations: a new theory for modeling dynamical systems. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 512–519, 2004.
SKKS [09] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
SPK [13] Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based pomdp solvers. Autonomous Agents and Multi-Agent Systems, 27(1):1–51, 2013.
SSH [20] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In Conference on Learning Theory, pages 3320–3436. PMLR, 2020.
SUJ [21] Chengchun Shi, Masatoshi Uehara, and Nan Jiang. A minimax learning approach to off-policy evaluation in partially observable markov decision processes. arXiv preprint arXiv:2111.06784, 2021.
SVBB [16] Wen Sun, Arun Venkatraman, Byron Boots, and J Andrew Bagnell. Learning to filter with predictive state inference machines. In International conference on machine learning, pages 1197–1205. PMLR, 2016.
TJ [15] Michael R Thon and Herbert Jaeger. Links between multiplicity automata, observable operator models and predictive state representations: a unified learning framework. J. Mach. Learn. Res., 16:103–147, 2015.
TSM [20] Guy Tennenholtz, Uri Shalit, and Shie Mannor. Off-policy evaluation in partially observable environments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10276–10283, 2020.
UIJ⁺ [21] Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, and Tengyang Xie. Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981, 2021.
UZS [21] Masatoshi Uehara, Xuezhou Zhang, and Wen Sun. Representation learning for online and offline rl in low-rank mdps. arXiv preprint arXiv:2110.04652, 2021.
VODM [12] Peter Van Overschee and Bart De Moor. Subspace identification for linear systems: Theory—Implementation—Applications. Springer Science & Business Media, 2012.
WCYW [22] Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Embed to control partially observed systems: Representation learning with provable sample efficiency. arXiv preprint arXiv:2205.13476, 2022.
XCGZ [21] Yi Xiong, Ningyuan Chen, Xuefeng Gao, and Xiang Zhou. Sublinear regret for learning pomdps. arXiv preprint arXiv:2107.03635, 2021.
XKG [21] Liyuan Xu, Heishiro Kanagawa, and Arthur Gretton. Deep proxy causal learning and its application to confounded bandit policy evaluation. Advances in Neural Information Processing Systems, 34:26264–26275, 2021.
YW [20] Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR, 2020.
ZLKB [20] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020.
ZSU⁺ [22] Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Wen Sun, and Alekh Agarwal. Efficient reinforcement learning in block mdps: A model-free representation learning approach. arXiv preprint arXiv:2202.00063, 2022.

Appendix A Supplement for Section 3

We generalize Definition 3 to capture more models. The first extension is to use multi-step link functions. This extension is essential to capture overcomplete POMDPs and multi-step PSRs. The second extension is to use minimax loss functions with discriminators so that we can use not only absolute value loss functions but also squared loss functions. This extension is important to capture M-step decodable POMDPs.

Appendix B Supplement for Section 4

B.1 Observable Undercomplete Tabular POMDPs

We need to prove Lemma 2. In the tabular case, by setting

\displaystyle\psi_{h}(z,o)=\mathbf{1}(z)\otimes\mathbf{1}(o),\phi_{h}(z,s)=\mathbf{1}(z)\otimes\mathbf{1}(s),K_{h}=I_{|\mathcal{Z}_{h-1}|}\otimes\mathbb{O}

where $\mathbf{1}(z),\mathbf{1}(o),\mathbf{1}(s)$ are one-hot encoding vectors over $\mathcal{Z}_{h-1},\mathcal{O},{\mathcal{S}}$ , respectively. Then, we can regard the tabular model as an HSE-POMDP. We can just invoke Lemma 6.

B.2 Observable Overcomplete POMDPs

We consider overcomplete POMDPs with multi-step futures. We have the following theorem. This is a generalization of Lemma 1.

Proof of Lemma 3

Consider any function $f:\mathcal{Z}_{h-1}\times{\mathcal{S}}\to\mathbb{R}$ (thus, this captures all possible $V^{\pi}_{h}$ ). Denote ${\textbf{1}}(z)$ as the one-hot encoding of $Z_{h-1}$ (similarly for ${\textbf{1}}(s)$ over ${\mathcal{S}}$ and ${\textbf{1}}(t)$ over ${\mathcal{T}}^{K}$ ). We have $f(z,s)=\langle f,{\textbf{1}}(z)\otimes{\textbf{1}}(s)\rangle=\langle f,{\textbf{1}}(z)\otimes((\mathbb{O}^{K})^{\dagger}\mathbb{O}^{K}{\textbf{1}}(s))\rangle$ , where we use the assumption that ${\rm rank}(\mathbb{O}^{K})=|{\mathcal{S}}|$ and thus $(\mathbb{O}^{K})^{\dagger}\mathbb{O}^{K}=I$ . Then,

	$\displaystyle f(z_{h-1},s_{h})$	$\displaystyle=\langle f,{\textbf{1}}(z_{h-1})\otimes(\mathbb{O}^{K})^{\dagger}\mathbb{E}[{\textbf{1}}(o_{h:h+K-1},a_{h:h+K-2})\mid s_{h};a_{h:h+K-2}\sim\pi^{out}]\rangle$
		$\displaystyle=\mathbb{E}[\langle f,{\textbf{1}}(z_{h-1})\otimes(\mathbb{O}^{K})^{\dagger}{\textbf{1}}(o_{h:h+K-1},a_{h:h+K-2})\rangle\mid z_{h-1},s_{h};a_{h:h+K-2}\sim\pi^{out}].$

which means that the value bridge function corresponding to $f(\cdot)$ is

g(z,t):=\langle f,{\textbf{1}}(z)\otimes(\mathbb{O}^{K})^{\dagger}{\textbf{1}}(t)\rangle.

$\quad\blacksquare$

Proof of Lemma 4

Recall we want to show the low-rank property of the following loss function:

	$\displaystyle\mathbb{E}[g_{h+1}(\bar{z}^{K}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi,a_{h+1:h+K-1}\sim\pi^{out}]+\mathbb{E}[r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]$
	$\displaystyle-\mathbb{E}[g_{h}(\bar{z}^{K}_{h});a_{1:h-1}\sim\pi^{\prime},a_{h:h+K-1}\sim\pi^{out}].$

We consider an expectation conditioning on $z_{h-1}$ and $s_{h}$ . For some vector $\theta_{\pi,g}\in\mathbb{R}^{|\mathcal{Z}_{h-1}|\times|{\mathcal{S}}|}$ , which depends on $\pi$ , we write it in the form of $\langle\theta_{\pi,g},{\textbf{1}}(z_{h-1},s_{h})\rangle$ where ${\textbf{1}}(z_{h-1},s_{h})$ is the one-hot encoding vector over $\mathcal{Z}_{h-1}\times{\mathcal{S}}$ . Then, the loss for $(\pi,g)$ is equal to

\displaystyle\langle\theta_{\pi,g},\mathbb{E}[{\textbf{1}}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}]\rangle.

Hence, we can take $X(\pi^{\prime})=\mathbb{E}[{\textbf{1}}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}]$ and $W(\pi)=\theta_{\pi,g}$ . $\quad\blacksquare$

B.3 Observable Linear Quadratic Gaussian

We need to prove Lemma 5. The proof is further deferred to Section J.

B.4 Observable HSE-POMDPs

We first provide the proof of Lemma 6. Then, we briefly mention how we extend to the infinite-dimensional setting.

Proof of the first statement in Lemma 6

First, we need to show value bridge functions exist. This is proved noting

\displaystyle\mathbb{E}_{o\sim\mathbb{O}(s)}[\langle(K^{\dagger}_{h})^{\top}\theta^{\pi}_{h},\psi_{h}(\bar{z}_{h})\rangle]=\langle(K^{\dagger}_{h})^{\top}\theta^{\pi}_{h},K_{h}\phi_{h}(z_{h-1},s_{h})\rangle=\langle\theta^{\pi}_{h},\phi_{h}(z_{h-1},s_{h})\rangle=V^{\pi}_{h}(z_{h-1},s_{h}).

Thus, $\langle(K^{\dagger}_{h})^{\top}\theta^{\pi}_{h},\psi_{h}(\bar{z}_{h})\rangle$ is a value bridge function. $\quad\blacksquare$

Proof of the second statement in Lemma 6

Consider a triple $(\pi^{\prime},\pi,g)\in\Pi\times\Pi\times\mathcal{G}$ , with $g_{h}(\cdot)=\theta_{h}^{\top}\psi_{h}(\cdot)$ and $g^{\pi}_{h}=\langle\theta^{\star}_{h},\psi_{h}(\cdot)\rangle$ , we have:

	$\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime})$
	$\displaystyle=\mathbb{E}\left[\theta_{h}^{\top}\psi(\bar{z}_{h})-r_{h}-\theta_{h+1}^{\top}\psi(\bar{z}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi\right]$
	$\displaystyle=\mathbb{E}\left[\theta_{h}^{\top}K_{h}\phi_{h}(z_{h-1},s_{h})-r_{h}-\theta_{h+1}^{\top}K_{h+1}(\phi_{h+1}(z_{h},s_{h+1}));a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi\right]$
	$\displaystyle=\mathbb{E}\left[(\theta_{h}-\theta^{\star}_{h})^{\top}K_{h}\phi_{h}(z_{h-1},s_{h})-(\theta_{h+1}-\theta^{\star}_{h+1})^{\top}K_{h+1}(T_{\pi;h}\phi_{h}(z_{h-1},s_{h}));a_{1:h-1}\sim\pi^{\prime}\right]$
	$\displaystyle=\left\langle\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}],\quad K_{h}^{\top}(\theta_{h}-\theta^{\star}_{h})-T_{\pi;h}^{\top}K_{h+1}^{\top}(\theta_{h+1}-\theta^{\star}_{h+1})\right\rangle,$

which verifies the bilinear structure, i.e., $X_{h}(\pi^{\prime})=\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}]$ , and $W_{h}(\pi,g)=K_{h}^{\top}(\theta_{h}-\theta^{\star}_{h})-T_{\pi;h}^{\top}K_{h+1}^{\top}(\theta_{h+1}-\theta^{\star}_{h+1})$ , and shows that the bilinear rank is at most $\max_{h}d_{\psi_{h}}$ . $\quad\blacksquare$

Infinite dimensional HSE-POMDPs

Consider the case $\phi_{h}$ and $\psi_{h}$ are features in infinite dimensional RKHS. By assuming that the spectrum of the operator $K_{h}$ is decaying with a certain order, we can still ensure the existence of value bridge functions even if $d_{\phi_{h}}$ and $d^{\prime}_{\psi_{h}}$ are infinite dimensional.

Next, we consider the PO-bilinear rank. We can still use the decomposition in the proof above. While the PO-bilinear rank itself in the current definition is infinite-dimensional, when we get the PAC result later, the dependence on the PO-bilinear rank comes from the information gain based on $X_{h}(\pi)$ , which is the intrinsic dimension of $X_{h}(\pi)$ . Thus, we can easily get the sample complexity result by replacing $d_{\psi_{h}}$ with the information gain over $\psi_{h}(\cdot)$ [63]. Generally, to take infinite dimensional models into account, the PO-bilinear rank in Definition 3 can be generalized using the critical information gain [17].

Appendix C Supplement for Section 5 (Algorithm for LQG with Continuous Action)

In this section, we present a modification to handle LQG with continuous action in Definition 3.

Our algorithm so far samples $a_{h}$ from $\mathcal{U}(\mathcal{A})$ and performs importance weighting in designing the loss $\sigma^{t}_{h}$ , which will incur a polynomial dependence on $|\mathcal{A}|$ as we will see in the next section. However, among the examples that we consider in Section 4, LQG has continuous action. If we naïvely sample $a_{h}$ from a ball in $\mathbb{R}^{d_{a}}$ and perform (nonparametric) importance weighting, we will pay $\exp(d_{a})$ in our sample complexity bound, which is not ideal for high-dimension control problems. To avoid exponential dependence on $d_{a}$ , here we replace $\mathcal{U}(\mathcal{A})$ with a $d$ -optimal design over the action’s quadratic feature space.

Here, we want to evaluate the Bellman error of $(\pi,g)$ pair under a roll-in policy $\pi^{\prime}$ :

\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime}):=\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{h})]

where $u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)=\theta^{\top}_{h}\psi(\bar{z}_{h})-r_{h}(s_{h},a_{h})-\theta^{\top}_{h+1}\psi_{h+1}(\bar{z}_{h+1})$ for any linear deterministic policy $\pi\in\Pi$ (here $g_{h}(\cdot):=\theta_{h}^{\top}\psi(\cdot)$ ) using a single policy. In other words, we would like to get a good loss $l_{h}$ such that

\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime})=\mathbb{E}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi^{e}]

for some policy $\pi^{e}$ without incuring exponential dependence on $d_{a}$ . We explain how to design such a loss function $l_{h}(\cdot;\pi,g)$ step by step.

First Step

The first step is to consider the conditional expectation on $(\bar{z}_{h},s_{h},a_{h})$ . Here, using the quadratic form of $\psi$ , we can show that there are some $c_{0}:\bar{Z}_{h}\times{\mathcal{S}}\to\mathbb{R},c_{1}:\bar{Z}_{h}\times{\mathcal{S}}\to\mathbb{R}^{(d_{a}+d_{s}+d_{\bar{z}_{h}})^{2}},c_{2}\in\mathbb{R}$ :

	$\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime})$	$\displaystyle=\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h};a_{1:h-1}\sim\pi,a_{h}\sim\pi(\bar{z}_{h})]$
		$\displaystyle=\langle c_{2}(\theta),[1,[\bar{z}^{\top}_{h},s^{\top}_{h},a^{\top}_{h}]\otimes[\bar{z}^{\top}_{h},s^{\top}_{h},a^{\top}_{h}]]^{\top}\rangle$
		$\displaystyle=c_{0}(\bar{z}_{h},s_{h};\theta)+c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\kappa(a_{h})$

where $\kappa(a)=[a^{\top},(a\otimes a)^{\top}]^{\top}$ . Then, the Bellman loss we want to evaluate can be written in the form of

	$\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{h})]$
	$\displaystyle=\mathbb{E}[c_{0}(\bar{z}_{h},s_{h};\theta)+c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\kappa(\pi(\bar{z}_{h}));a_{1:h-1}\sim\pi^{\prime}].$

Second step

The second step is to compute a d-optimal design for the set $\{\kappa(a):a\in\mathbb{R}^{d_{a}},\|a\|_{2}\leq Z\}$ for certain enough large $Z\in\mathbb{R}$ , and denote $a^{1},\dots,a^{d^{\diamond}}$ as the supports on the d-optimal design. Note in LQG, though we cannot ensure the action lives in the compact set, we can still ensure that in high probability and it suffices in our setting as we will see. Since the dimension of $k(a)$ is $d_{a}+d^{2}_{a}$ , we can ensure $d^{\diamond}\leq(d_{a}+d_{a}^{2})(d_{a}+d_{a}^{2}+1)/2$ [47, 41]. Here is a concrete theorem we invoke.

Theorem 12 (Property of G-optimal design).

Suppose $\mathcal{X}\in\mathbb{R}^{d}$ is a compact set. There exists a distribution $\rho$ over $\mathcal{X}$ such that:

•

$\rho$ is supported on at most $d(d+1)/2$ points.
•

For any $x^{\prime}\in\mathcal{X}$ , we have $x^{\prime\top}\mathbb{E}_{x\sim\rho}[xx^{\top}]^{-1}x^{\prime}\leq d$ .

We have the following handy lemma stating any $\kappa(a)$ is spanned by $\{\kappa(a^{i})\}_{i=1}^{d^{\diamond}}$ .

Lemma 13.

Let $K=[\rho^{1/2}(a^{1})\kappa(a^{1}),\rho^{1/2}(a^{2})\kappa(a^{2}),\cdots,\rho^{1/2}(a^{d^{\diamond}})\kappa(a^{d^{\diamond}})]$ and $\alpha(a)=K^{\top}(KK^{\top})^{-1}k(a)$ . Then, it satisfies

\kappa(a)=K\alpha(a),\quad\|\alpha(a)\|\leq(d_{a}+d_{a}^{2})^{1/2},\quad\alpha_{i}(a)/\rho^{1/2}(a^{i})\leq(d_{a}+d_{a}^{2})

Proof.

Since $K$ is full-raw rank from the construction of G-optimal design, $KK^{\top}$ is invertible. Then, we have

\displaystyle\sum_{i=1}^{d^{\diamond}}\alpha_{i}(a)\rho^{1/2}(a^{i})\kappa(a^{i})=KK^{\top}(KK^{\top})^{-1}\kappa(a)=\kappa(a)

For the latter statement, we have

\displaystyle\langle K^{\top}(KK^{\top})^{-1}k(a),K^{\top}(KK^{\top})^{-1}k(a)\rangle

\displaystyle=k(a)^{\top}(KK^{\top})^{-1}k(a)\leq(d_{a}+d_{a}^{2}).

We use a property of G-optimal design in Theorem 12.

For the last statement, we have

\displaystyle\kappa^{\top}(a^{i})(KK^{\top})^{-1}\kappa(a)\leq\|\kappa^{\top}(a^{i})\|_{(KK^{\top})^{-1}}\|\kappa^{\top}(a)\|_{(KK^{\top})^{-1}}\leq(d_{a}+d_{a}^{2}).

from CS inequality. ∎

Third Step

The third step is combining current facts. Recall we want to evaluate

\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]=\mathbb{E}[c_{0}(\bar{z}_{h},s_{h};\theta)+c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\kappa(\pi(\bar{z}_{h}));a_{1:h-1}\sim\pi^{\prime}].

In addition, the following also holds:

	$\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{i})]$	$\displaystyle=\mathbb{E}[c_{0}(\bar{z}_{h},s_{h};\theta)+c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\kappa(a^{i});a_{1:h-1}\sim\pi^{\prime}]$
	$\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(0)]$	$\displaystyle=\mathbb{E}[c_{0}(\bar{z}_{h},s_{h};\theta);a_{1:h-1}\sim\pi^{\prime}]$

Here, we use $\kappa(0)=0$ . This concludes that

	$\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]$
	$\displaystyle=\mathbb{E}[c_{0}(\bar{z}_{h},s_{h};\theta)+c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\kappa(\pi(\bar{z}_{h}));a_{1:h-1}\sim\pi^{\prime}]$
	$\displaystyle=\mathbb{E}[c_{0}(\bar{z}_{h},s_{h};\theta)+c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\{\sum_{i=1}^{d^{\diamond}}\alpha_{i}(\pi(\bar{z}_{h}))\kappa(a^{i})\};a_{1:h-1}\sim\pi^{\prime}]$
	$\displaystyle=\mathbb{E}\left[c_{0}(\bar{z}_{h},s_{h};\theta)\left(1-\sum_{i=1}^{d^{\diamond}}\alpha_{i}(\pi(\bar{z}_{h}))\right)+\sum_{i=1}^{d^{\diamond}}\alpha_{i}(\pi(\bar{z}_{h}))\left(c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\kappa(a^{i})+c_{0}(\bar{z}_{h},s_{h};\theta)\right);a_{1:h-1}\sim\pi^{\prime}\right]$
	$\displaystyle=\mathbb{E}\left[\left(1-\sum_{i=1}^{d^{\diamond}}\alpha_{i}(\pi(\bar{z}_{h}))\right)u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(0)\right]$
	$\displaystyle+\sum_{i=1}^{d^{\diamond}}\mathbb{E}\left[\alpha_{i}(\pi(\bar{z}_{h}))u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{i})\right].$

Thus, we can perform policy evaluation for a policy $\pi$ if we can do intervention from $do(0),do(a^{1}),\cdots,do(a^{d^{\diamond}})$ .

Fourth Step

The fourth step is replacing $do(0),do(a^{1}),\cdots,do(a^{d^{\diamond}})$ with a single policy that uniformly randomly select actions from the set $\{0,a^{1},\dots,a^{d^{\diamond}}\}$ , which we denote as $a\sim U(1+d^{\diamond})$ . Using importance weighting, we define the loss function for $\pi,\theta$ as follows:

\displaystyle\mathbb{E}[f_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(1+d^{\diamond})]

(13)

where $U(1+d^{\diamond})$ is a uniform action over $0,a^{1},\cdots,a^{d^{\diamond}}$ and

	$\displaystyle f_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)$
	$\displaystyle=\|1+d^{\diamond}\|\left(\mathbb{I}(a_{h}=0)\left(1-\sum_{i=1}^{d^{\diamond}}\alpha_{i}(\pi(\bar{z}_{h}))\right)+\sum_{i=1}^{d^{\diamond}}\mathbb{I}(a_{h}=a^{i})\alpha_{i}(\pi(\bar{z}_{h}))\right)u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta).$

The term 13is equal to $\mathrm{Br}_{h}(\pi,g;\pi^{\prime})$ we want to evaluate.

Summary

To summarize, we just need to use the following loss function in line 7 in Algorithm 1:

\displaystyle\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)]

where $l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)$ is

\displaystyle\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\mathbb{I}(\|r_{h}\|\leq Z_{2})\mathbb{I}(\|o_{h+1}\|\leq Z_{3})f_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)

and $\mathcal{D}^{t}_{h}$ is a set of $m$ i.i.d samples following the distribution induced by executing $a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(1+d^{\diamond})$ . Values $Z_{1},Z_{2},Z_{3}$ in indicators functions are some large values selected properly later. Due to unbounded Gaussian noises in LQG, indicators functions for truncation is introduced here for technical reason to get valid concentration in Assumption 2.

Appendix D Supplement for Section 6

We first add several discussions to explain core tests in detail. Next, we show the existence and form of link functions. Finally, we calculate the PO-bilinear rank. In this section, we will focus on the general case where tests could be multiple steps.

D.1 Definition of PSRs

We first define core tests and predictive states [46, 62]. This definition is a generalization of Definition 5 with multi-step futures.

We slighly abuse notation and denote $\tau^{a}_{h}:=(o_{1},a_{1},\dots,o_{h-1},a_{h-1})$ throughout this whole section — note that $\tau^{a}_{h}$ here does not include $o_{h}$ .

Definition 8 (Core test sets and PSRs).

A set ${\mathcal{T}}\subset\cup_{C\in\mathbf{N}^{+}}\mathcal{O}^{C}\times\mathcal{A}^{C-1}$ is called a core test set if for any $h\in[H]$ , $W\in\mathbf{N}^{+}$ , any possible future (i.e., test) $t_{h}=(o_{h:h+W-1},a_{h:h+W-2})\in\mathcal{O}^{W}\times\mathcal{A}^{W-1}$ and any history $\tau^{a}_{h}$ , there exists $m_{t_{h}}\in\mathbb{R}^{|{\mathcal{T}}|}$ such that

\displaystyle\mathbb{P}(o_{h:W+h-1}\mid\tau^{a}_{h};do(a_{h:W+h-2}))=\langle m_{t_{h}},[\mathbb{P}(t\mid\tau^{a}_{h})]_{t\in{\mathcal{T}}}\rangle.

The vector $[\mathbb{P}(t\mid\tau^{a}_{h})]_{t\in{\mathcal{T}}}\in\mathbb{R}^{|{\mathcal{T}}|}$ is referred to as the predictive state.

We often denote $\mathbf{q}_{\tau^{a}_{h}}=[\mathbb{P}(t\mid\tau^{a}_{h})]_{t\in{\mathcal{T}}_{h}}$ . To understand the above definition, we revisit observable undercomplete POMDPs and overcomplete POMDPs.

Example 1 (Observable undercomplete POMDPs).

In undercomplete POMDPs, when $\mathbb{O}$ is full-column rank, $\mathcal{O}$ is a core test. Recall $\mathbb{O}$ is a matrix in $\mathbb{R}^{|\mathcal{O}|\times|{\mathcal{S}}|}$ whose entry indexed by $o_{i}\in\mathcal{O},s_{j}\in{\mathcal{S}}$ is equal to $\mathbb{O}(o_{i}\mid s_{j})$ .

Lemma 14 (Core tests in undercomplete POMDPs).

When $\mathbb{O}$ is full-column rank, $\mathcal{O}$ is a core test set.

Proof.

Consider any $h\in[H]$ . Given a $|{\mathcal{S}}|$ -dimensional belief state $\mathbf{s}_{\tau^{a}_{h}}=[\mathbb{P}(\cdot\mid\tau^{a}_{h})]_{|{\mathcal{S}}|}$ with each entry $\mathbb{P}(s_{h}\mid\tau^{a}_{h})$ , for any future $t=(o_{h:h+W},a_{h:h+W-1})$ , there exists a $|{\mathcal{S}}|$ -dimensional vector $\mathbf{m^{\prime}}_{t}$ such that $\mathbb{P}(o_{h:h+W}\mid\tau^{a}_{h};do(a_{h:h+W-1}))=\langle\mathbf{m^{\prime}}_{t},\mathbf{s}_{\tau^{a}_{h}}\rangle$ . More specifically, ${\bf m^{\prime}}_{t}$ can be written as:

\displaystyle({\bf m^{\prime}}_{t})^{\top}=\mathbb{O}(o_{h+W}\mid{}\cdot)^{\top}\prod_{\tau=h}^{h+W-1}\mathbb{T}_{a_{h}}{\rm diag}(\mathbb{O}(o_{h}\mid{}\cdot))

where $\mathbb{O}(o|\cdot)\in\mathbb{R}^{|{\mathcal{S}}|}$ is a vector with the entry indexed by $s$ equal to $\mathbb{O}(o|s)$ , $\mathbb{T}_{a}\in\mathbb{R}^{|{\mathcal{S}}|\times|{\mathcal{S}}|}$ is a matrix with the entry indexed by $(s,s^{\prime})$ equal to $\mathbb{T}(s^{\prime}\mid s,a_{h})$ . Here, note given a vector $C$ , ${\rm diag}(C)$ is define as a $|C|\times|C|$ diagonal matrix where the diagonal element corresponds to $C$ . Thus, we have

\displaystyle\mathbb{P}(o_{h:h+W}\mid\tau^{a}_{h};do(a_{h:h+W-1}))=\langle\mathbf{m^{\prime}}_{t},\mathbf{s}_{\tau^{a}_{h}}\rangle=\langle\mathbf{m^{\prime}}_{t},\mathbb{O}^{\dagger}\mathbf{q}_{\tau^{a}_{h}}\rangle=\langle(\mathbb{O}^{\dagger})^{\top}\mathbf{m^{\prime}}_{t},\mathbf{q}_{\tau^{a}_{h}}\rangle,

where ${\bf q}_{\tau^{a}_{h}}\in\mathbb{R}^{|\mathcal{O}|}$ and ${\bf q}_{\tau^{a}_{h}}(o)=\mathbb{P}(o|\tau^{a}_{h})$ . This concludes the proof. ∎

Example 2 (Overcomplete POMDPs).

We consider overcomplete POMDPs so that we can permit $|{\mathcal{S}}|\geq|\mathcal{O}|$ .

Lemma 15 (Core tests in overcomplete POMDPs).

Recall ${\mathcal{T}}^{K}=\mathcal{O}\times(\mathcal{O}\times\mathcal{A})^{K-1}$ . Define a $|{\mathcal{T}}^{K}|\times|{\mathcal{S}}|$ -dimensional matrix $\mathbb{O}^{K}$ whose entry indexed by $(o_{h:h+K-1},a_{h:h+K-2})\in{\mathcal{T}}^{K},s_{h}\in{\mathcal{S}}$ is equal to $\mathbb{P}(o_{h:h+K-1},a_{h:h+K-2}\mid s_{h};a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A}))$ . When this matrix is full-colmun rank for all $h$ , ${\mathcal{T}}^{K}$ is a core test set.

Proof.

Fix a test $t=(o^{\prime}_{h:h+K-1},a^{\prime}_{h:h+K-2})$ and consider a step $h\in[H]$ . Then,

	$\displaystyle\mathbb{P}(o^{\prime}_{h:h+K-1},a^{\prime}_{h:h+K-2}\mid s_{h};a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A}))$
	$\displaystyle=\mathbb{E}[{\textbf{1}}(o_{h:h+K-1}=o^{\prime}_{h:h+K-1},a_{h:h+K-2}=a^{\prime}_{h:h+K-2})\mid s_{h};a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A}))]$
	$\displaystyle=\mathbb{E}[(1/\|\mathcal{A}\|^{K-1}){\textbf{1}}(o_{h:h+K-1}=o^{\prime}_{h:h+K-1},a_{h:h+K-2}=a^{\prime}_{h:h+K-2})\mid s_{h};a_{h:h+K-2}\sim do(a^{\prime}_{h:h+K-2})]$
	$\displaystyle=\mathbb{E}[(1/\|\mathcal{A}\|^{K-1}){\textbf{1}}(o_{h:h+K-1}=o^{\prime}_{h:h+K-1})\mid s_{h};a_{h:h+K-2}\sim do(a^{\prime}_{h:h+K-2})]$
	$\displaystyle=(1/\|\mathcal{A}\|^{K-1})\mathbb{P}(o^{\prime}_{h:h+K-1}\mid s_{h};do(a^{\prime}_{h:h+K-2})).$

Thus, the assumption that ${\mathcal{T}}^{K}$ is full column rank implies that that the matrix $\mathbb{\bar{J}}_{h}\in\mathbb{R}^{|{\mathcal{T}}^{K}|\times|{\mathcal{S}}|}$ with the entry indexed by $(t,s_{h})$ being equal to $\mathbb{P}(o^{\prime}_{h:h+K-1}\mid s_{h};do(a^{\prime}_{h:h+K-2}))$ is full-column rank.

Define a $|{\mathcal{T}}^{K}|$ -dimensional state $\mathbf{q}_{\tau^{a}_{h}}=[\mathbb{P}(t\mid\tau^{a}_{h})]_{t\in{\mathcal{T}}^{K}}$ given history $\tau^{a}_{h}$ . By definition, we have

\displaystyle\mathbf{q}_{\tau^{a}_{h}}=\mathbb{\bar{J}}_{h}\mathbf{s}_{\tau^{a}_{h}}

Using $\mathbb{\bar{J}}_{h}$ is full-column rank, we have $\mathbf{s}_{\tau^{a}_{h}}=\bar{M}^{\dagger}_{h}\mathbf{q}_{\tau^{a}_{h}}$ . Thus, using the format of ${\bf m^{\prime}}_{t}$ from the proof of Lemma 14, we can conclude that for any test $t=(o_{h:h+W},a_{h:h+W-1})$ , we have $\mathbb{P}(o_{h:h+W}|\tau;\text{do}(a_{h:h+W-1}))=\langle(\bar{M}^{\dagger}_{h})^{\top}{\bf m^{\prime}}_{t},{\bf q}_{\tau^{a}_{h}}\rangle$ . Thus, this concludes ${\mathcal{T}}^{K}$ is a core test set. ∎

Finally, we present an important property of predictive states, which corresponds to the Bayesian filter in POMDP.

Lemma 16 (Forward dynamics of predictive states).

We have

\displaystyle\mathbb{P}(t\mid\tau^{a}_{h},a,o)=\mathbf{m}^{\top}_{o,a,t}\mathbf{q}_{\tau^{a}_{h}}/\mathbf{m}^{\top}_{o}\mathbf{q}_{\tau^{a}_{h}}.

When we define $M_{o,a}\in\mathbb{R}^{|{\mathcal{T}}|\times|{\mathcal{T}}|}$ where rows are $\mathbf{m}_{o,a,t}$ for $t\in{\mathcal{T}}$ , we can express the forward update rule of predictive states as follows:

\displaystyle\mathbf{q}_{\tau^{a}_{h},a,o}=M_{o,a}\mathbf{q}_{\tau^{a}_{h}}/(\mathbf{m}^{\top}_{o}\mathbf{q}_{\tau^{a}_{h}}).

Proof.

The proof is an application of Bayes’s rule. We denote the observation part of $t$ by $t^{\mathcal{O}}$ and the action part of $t^{\mathcal{A}}$ , respectively. We have

$\displaystyle\mathbb{P}(t\mid\tau^{a}_{h},a,o)$	$\displaystyle=\mathbb{P}(t^{\mathcal{O}}\mid\tau^{a}_{h},o;\text{do}(a,t^{\mathcal{A}}))$	(by definition)
	$\displaystyle=\frac{\mathbb{P}(o,t^{\mathcal{O}}\mid\tau^{a}_{h};\text{do}(a,t^{\mathcal{A}}))}{\mathbb{P}(o;\tau^{a}_{h})}$	(Bayes rule)
	$\displaystyle=\mathbf{m}^{\top}_{o,a,t}\mathbf{q}_{\tau^{a}_{h}}/\mathbf{m}^{\top}_{o}\mathbf{q}_{\tau^{a}_{h}}.$	(by definition)

This concludes the proof. ∎

To further understand that why PSR generalizes POMDP, let us re-visit the undercomplete POMDPs (i.e., $\mathbb{O}$ being full column rank) again. Set ${\mathcal{T}}=\mathcal{O}$ . As we see in the proof of Lemma 14, the belief state ${\bf s}_{\tau}\in\Delta({\mathcal{S}})$ together with $\mathbb{O}$ defines predictive state, i.e., ${\bf q}_{\tau^{a}_{h}}=\mathbb{O}{\bf s}_{\tau^{a}_{h}}$ , with $M_{o,a}=\mathbb{O}\mathbb{T}_{a}{\rm diag}(\mathbb{O}(o|\cdot))\mathbb{O}^{\dagger}$ , and ${\bf m}^{\top}_{o}={\textbf{1}}^{\top}{\rm diag}(\mathbb{O}(o|\cdot))\mathbb{O}^{\dagger}$ . Note that in POMDPs, matrix $M_{o,a}$ and vector $\mathbf{m}_{o}$ all contain non-negative entries. On other hand, in PSRs, $M_{a,o}$ and $\mathbf{m}_{a,o}$ could contain negative entries. This is the intuitive reason why PSRs are more expressive than POMDPs [46]. For the formal instance of a finite-dimensional PSR which cannot be expressed as a finite-dimensional POMDP, refer to [62, 31].

D.2 Existence of link functions

We discuss the existence and the form of link functions. First, we define general value link functions with multi-step futures. For notational simplicity, we assume here that the tests $t\in{\mathcal{T}}$ have the same length, i.e., there is a $K\in\mathbb{N}^{+}$ , such that ${\mathcal{T}}\subset\mathcal{O}^{K}\times\mathcal{A}^{K-1}$ .

Definition 9 (General value link functions in dynamical systems).

Recall ${\mathcal{T}}\subset\mathcal{O}^{K}\times\mathcal{A}^{K-1}$ is the set of tests. At time step $h$ , general value link functions $g^{\pi}_{h}:\mathcal{Z}_{h-1}\times\mathcal{O}^{K}\mathcal{A}^{K-1}\to\mathbb{R}$ are defined as solutions to the following:

\displaystyle\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})=\mathbb{E}[g^{\pi}_{h}(z_{h-1},o_{h:h+K-1},a_{h:h+K-2})\mid\tau^{a}_{h};(a_{h:h+K-2})\sim\rho^{out}].

(14)

where $\rho^{out}$ is some distribution over the action set ${\mathcal{T}}^{\mathcal{A}}$ induced by the test set, i.e., $\{t^{\mathcal{A}}:t\in{\mathcal{T}}\}$ . Here, for $t=(o_{h:h+K-1},a_{h:h+K-2})$ , we often denote $o_{h:h+K-1}$ and $a_{h:h+K-2}$ by $t^{\mathcal{O}}$ and $t^{\mathcal{A}}$ , respectively.

To show the existence of general value link functions for PSRs, we first study the format of value functions in PSRs. The following lemma states that value functions for $M$ -memory policies have bilinear forms.

Lemma 17 (Bilinear form of value functions for $M$ -memory policies).

Let $\phi(\cdot)\in\mathbb{R}^{|\mathcal{Z}_{h-1}|}$ be a one-hot encoding vector over $\mathcal{Z}_{h-1}$ . Suppose ${\mathcal{T}}$ is a core test set. Then, for any $M$ -memory policy $\pi$ , there exists $\mathbb{J}^{\pi}_{h}\in\mathbb{R}^{|\mathcal{Z}_{h-1}|\times|{\mathcal{T}}|}$ such that

\displaystyle\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})=\phi^{\top}(z_{h-1})\mathbb{J}^{\pi}_{h}{\bf q}_{\tau^{a}_{h}}.

Proof.

From Lemma 16, there exists a matrix $M_{o,a}\in\mathbb{R}^{|{\mathcal{T}}|\times|{\mathcal{T}}|}$ such that via Bayes rule:

\displaystyle{\bf q}_{\tau^{a}_{h},a,o}=M_{o,a}{\bf q}_{\tau^{a}_{h}}/\mathbb{P}(o|\tau^{a}_{h}).

(15)

We use induction to prove the claim. Here, the base argument clearly holds. Thus, we assume

\displaystyle\mathcal{V}^{\pi}_{h+1}(\tau^{a}_{h+1})=\phi^{\top}(z_{h})\mathbb{J}^{\pi}_{h+1}{\bf q}_{\tau^{a}_{h+1}}.

We have

	$\displaystyle\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})$	$\displaystyle=\mathbb{E}[r_{h}+\mathcal{V}^{\pi}_{h+1}(\tau^{a}_{h},o_{h},a_{h})\mid\tau^{a}_{h};a_{h}\sim\pi(\bar{z}_{h})]$
		$\displaystyle=\underbrace{\sum_{o_{h},a_{h}}\mathbb{P}(o_{h}\mid\tau^{a}_{h})\pi_{h}(a_{h}\mid o_{h},z_{h-1})r(o_{h},a_{h})}_{(a)}$
		$\displaystyle+\underbrace{\sum_{o_{h},a_{h}}\mathbb{P}(o_{h}\mid\tau^{a}_{h})\pi_{h}(a_{h}\mid o_{h},z_{h-1})\{\phi^{\top}(z_{h})\mathbb{J}^{\pi}_{h+1}{\bf q}_{\tau^{a}_{h},o_{h},a_{h}}\}}_{(b)}.$

Note we use the assumption that the reward is a function of $o_{h},a_{h}$ conditional on $(\tau^{a}_{h},o_{h},a_{h})$ .

We first check the first term (a) that contains rewards. Using the fact that $\mathbb{P}(o|\tau^{a}_{h})={\bf m}_{o}^{\top}{\bf q}_{\tau^{a}_{h}}$ , this is equal to

\displaystyle\sum_{o_{h},a_{h}}\langle{\bf m}_{o_{h}},\mathbf{q}_{\tau^{a}_{h}}\rangle\pi_{h}(a_{h}\mid o_{h},z_{h-1})r(o_{h},a_{h})=\langle\sum_{o_{h},a_{h}}{\bf m}_{o_{h}}\pi_{h}(a_{h}\mid o_{h},z_{h-1})r(o_{h},a_{h}),\mathbf{q}_{\tau^{a}_{h}}\rangle.

Thus, it has a bilinear form, i.e., there exists some matrix $\mathbb{J}^{\pi}_{1}$ such that

\langle\sum_{o_{h},a_{h}}m_{o_{h},a_{h}}\pi_{h}(a_{h}\mid o_{h},z_{h-1})r_{h},\mathbf{q}_{\tau^{a}_{h}}\rangle=\phi^{\top}(z_{h-1})\mathbb{J}^{\pi}_{1}\mathbf{q}_{\tau^{a}_{h}}

where $\mathbb{J}^{\pi}_{1}$ is a matrix whose row indexed by $z_{h-1}$ is equal to $\sum_{o,a}{\bf m}_{o}^{\top}\pi_{h}(a|o,z_{h-1})r(o,a)$ .

Next, we see the second term (b). Using (15), the second term is equal to

\displaystyle\sum_{o_{h},a_{h}}\pi_{h}(a_{h}\mid o_{h},z_{h-1})\phi^{\top}(z_{h-1}\oplus o_{h},a_{h})\mathbb{J}^{\pi}_{h+1}M_{o_{h},a_{h}}\mathbf{q}_{\tau^{a}_{h}}

where we use the notation $z_{h-1}\oplus o,a$ to represent the operation of appending $(o,a)$ pair to the memory while maintaining the proper length of the memory by truncating away the oldest observation-action pair. Thus, it has an again bilinear form $\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{2}{\bf q}_{\tau^{a}_{h}}$ and the matrix $\mathbb{J}^{\pi}_{2}$ can be defined such that its row indexed by $z_{h-1}$ is equal to $\sum_{o,a}\pi_{h}(a|o,z_{h-1})\phi^{\top}(z_{h-1}\oplus o,a)M^{\pi}_{h+1}M_{a,o}$ . This concludes the proof. ∎

Next, we check sufficient conditions to ensure the existence of general K-step link functions. Given ${\mathcal{T}}$ , we define the corresponding set of action sequences ${\mathcal{T}}^{\mathcal{A}}$ as ${\mathcal{T}}^{\mathcal{A}}:=\{t^{\mathcal{A}}:t\in{\mathcal{T}}\}$ . We set $\rho^{out}$ in (14) to be a uniform distribution over the set ${\mathcal{T}}^{\mathcal{A}}$ denoted by $\mathcal{U}({\mathcal{T}}^{\mathcal{A}})$ . Namely, $\mathcal{U}({\mathcal{T}}^{\mathcal{A}})$ will uniformly randomly select a sequence of test actions from ${\mathcal{T}}^{\mathcal{A}}$ .

Lemma 18 (Existence of link functions in PSRs).

Suppose ${\mathcal{T}}$ is a core test. There exists $g^{\pi}_{h}:\mathcal{Z}_{h-1}\times{\mathcal{T}}$ such that

\displaystyle\mathbb{E}[g^{\pi}_{h}(z_{h-1},o_{h:h+K-1},a_{h:h+K-2})\mid\tau^{a}_{h};a_{h:h+K-2}\sim\mathcal{U}({\mathcal{T}}^{\mathcal{A}})]=\mathcal{V}^{\pi}_{h}(\tau^{a}_{h}).

Proof.

We mainly need to design an unbiased estimator of the predictive state ${\bf q}_{\tau^{a}_{h}}$ . We use importance weighting to do that. Given $a_{h:h+K-2}\sim\mathcal{U}({\mathcal{T}}^{\mathcal{A}})$ , and the resulting corresponding random observations $o_{h:h+K-1}$ , we define the following estimator $\hat{\bf q}_{\tau^{a}_{h}}(o_{h:h+K-1},a_{h:h+K-2})\in\mathbb{R}^{|{\mathcal{T}}|}$ , such that its entry indexed by a test $t\in{\mathcal{T}}$ is equal to:

\displaystyle\hat{\bf q}_{\tau^{a}_{h}}(o_{h:h+K-1},a_{h:h+K-2})[t]=\frac{{\textbf{1}}(t^{\mathcal{O}}=o_{h:h+K-1},t^{\mathcal{A}}=a_{h:h+K-2})}{1/|{\mathcal{T}}^{\mathcal{A}}|}.

We can verify that

	$\displaystyle\mathbb{E}[\hat{\bf q}_{\tau^{a}_{h}}(o_{h:h+K-1},a_{h:h+K-2})[t]\mid\tau^{a}_{h};a_{h:h+K-2}\sim\mathcal{U}({\mathcal{T}}^{\mathcal{A}})]$
	$\displaystyle=1/\|{\mathcal{T}}^{\mathcal{A}}\|\mathbb{E}[{\textbf{1}}(t^{\mathcal{O}}=o_{h:h+K-1},t^{\mathcal{A}}=a_{h:h+K-2})\mid\tau^{a}_{h};a_{h:h+K-2}\sim\mathcal{U}({\mathcal{T}}^{\mathcal{A}})]$
	$\displaystyle=\mathbb{E}[{\textbf{1}}(t^{\mathcal{O}}=o_{h:h+K-1},t^{\mathcal{A}}=a_{h:h+K-2})\mid\tau^{a}_{h};a_{h:h+K-2}\sim do(t^{\mathcal{A}})]={\bf q}_{\tau^{a}_{h}}[t].$

Then,

\displaystyle\mathbb{E}[\hat{\bf q}_{\tau^{a}_{h}}(o_{h:h+K-1},a_{h:h+K-2})\mid\tau^{a}_{h};a_{h:h+K-2}\sim\mathcal{U}({\mathcal{T}}^{\mathcal{A}})]={\bf q}_{\tau^{a}_{h}}.

With this estimator, now we can define the link function using the bilinear form of $\mathcal{V}^{\pi}_{h}(\tau)$ , i.e.,

\displaystyle g^{\pi}_{h}(z_{h-1},o_{h:h+K-1},a_{h:h+K-2})=\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{h}{\hat{\bf q}_{\tau^{a}_{h}}}(o_{h:h+K-1},a_{h:h+K-2}).

Using the fact that $\hat{\bf q}_{\tau^{a}_{h}}(o_{h:h+K-1},a_{h:h+K-2})$ is an unbiased estimate of ${\bf q}_{\tau^{a}_{h}}$ , we can conclude the proof. ∎

Since PSR models capture POMDP models, our above result directly implies the existence of the link functions in observable POMDPs as well by using obtained facts in Example 1 and 2 .

D.3 PO-Bilinear Rank Decomoposition

Finally, we calculate the PO-bilinear rank. Here,

g_{h}\in\{\mathcal{Z}_{h-1}\times(\mathcal{O}^{K}\mathcal{A}^{K-1})\ni(z_{h-1}\times t)\mapsto\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{h}{\hat{\bf q}_{\tau^{a}_{h}}}(t)\in\mathbb{R}:\mathbb{J}^{\pi}_{h}\in\mathbb{R}^{\mathcal{Z}_{h-1}\times|{\mathcal{T}}|}\}.

The Bellman error for $(g,\pi)$ under a roll-in $\pi^{\prime}$ denoted by $\mathrm{Br}_{h}(g,\pi;\pi^{\prime})$ is defined as

	$\displaystyle-\mathbb{E}[\mathbb{E}[g_{h+1}(z_{h},t^{\mathcal{A}}_{h+1},t^{\mathcal{O}}_{h+1})\mid\tau^{a}_{h+1};t^{\mathcal{A}}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}_{h+1})]+r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]$
	$\displaystyle+\mathbb{E}[g_{h}(z_{h-1},t^{\mathcal{A}}_{h},t^{\mathcal{O}}_{h})\mid\tau^{a}_{h};t^{\mathcal{A}}_{h}\sim U({\mathcal{T}}^{\mathcal{A}}_{h})];a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi].$

In fact, $\mathrm{Br}_{h}(g,\pi;\pi^{\prime})=0$ for any general value link functions $g^{\pi}$ .

Our goal is to design a loss function $l_{h}(\cdot)$ such that we can estimate the above Bellman error $\mathrm{Br}_{h}(g,\pi;\pi^{\prime})$ using data from a single policy. To do that, we design the following randomized action selection strategy.

Given a action sequence $t^{\mathcal{A}}$ from a test $t$ , let us denote $\bar{t}^{\mathcal{A}}$ as a copy of $t^{\mathcal{A}}$ but starting from the second action of $t^{\mathcal{A}}$ , i.e., if $t^{\mathcal{A}}=\{a_{1},a_{2},a_{3}\}$ , then $\bar{t}^{\mathcal{A}}=\{a_{2},a_{3}\}$ . Denote $\bar{\mathcal{T}}^{\mathcal{A}}=\{\bar{t}^{\mathcal{A}}:t\in{\mathcal{T}}\}$ . Our random action selection strategy first selects $a_{h}\sim U(\mathcal{A})$ uniformly randomly from $\mathcal{A}$ , and then select a sequence of actions $\bar{\bf a}$ uniformly randomly from ${\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}}$ . Here, we remark the length of outputs is not fixed (i.e., $\bar{\bf a}\in{\mathcal{T}}^{\mathcal{A}}$ has length larger than the $\bar{\bf a}\in\bar{\mathcal{T}}^{\mathcal{A}}$ ).

As a first step, we define two unbiased estimators for ${\bf q}_{\tau^{a}_{h}}$ and ${\bf q}_{\tau^{a}_{h+1}}$ . Conditioning on history $\tau^{a}_{h}$ , given actions $a_{h}\sim U(\mathcal{A})$ followed by action sequence $\bar{\bf a}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}})$ , denote the corresponding observations as $o_{h},o_{h+1},\dots o_{h+|\bar{\bf a}_{h+1}|+1}$ . We construct unbiased estimators for ${\bf q}_{\tau^{a}_{h}}$ and ${\bf q}_{\tau^{a}_{h+1}}$ as follows. As an unbiased estimator of ${\bf q}_{\tau^{a}_{h}}$ , we define $\hat{\bf q}_{\tau^{a}_{h}}$ with the entry indexed by test $t^{\prime}\in{\mathcal{T}}$ as follows:

\displaystyle\hat{\bf q}_{\tau^{a}_{h}}(a_{h},\bar{\bf a}_{h+1},o_{h:h+|\bar{\bf a}_{h+1}|+1})[t^{\prime}]=\frac{{\textbf{1}}(\bar{\bf a}_{h+1}\in\bar{\mathcal{T}}^{\mathcal{A}},(a_{h},\bar{\bf a}_{h+1})=t^{\prime\mathcal{A}},o_{h:h+|\bar{\bf a}_{h+1}|+1}=t^{\prime\mathcal{O}})}{1/(2|\mathcal{A}||{\mathcal{T}}^{\mathcal{A}}|)}.

(16)

Similarly, as an unbiased estimator of ${\bf q}_{\tau^{a}_{h+1}}$ , we define $\hat{\bf q}_{\tau^{a}_{h+1}}$ with the entry indexed by test $t^{\prime}\in{\mathcal{T}}$ as follows:

\displaystyle\hat{\bf q}_{\tau^{a}_{h+1}}(a_{h},\bar{\bf a}_{h+1},o_{h+1:h+|\bar{\bf a}_{h+1}|+1})[t^{\prime}]=\frac{{\textbf{1}}(\bar{\bf a}_{h+1}\in{\mathcal{T}}^{\mathcal{A}},\bar{\bf a}_{h+1}=t^{\prime\mathcal{A}},o_{h+1:h+|\bar{\bf a}_{h+1}|+1}=t^{\prime\mathcal{O}})}{1/(2|{\mathcal{T}}^{\mathcal{A}}|)}

(17)

We remark the length of $\bar{\bf a}$ in (16) and the one of (17) are different.

Then, by using importance sampling, we can verify

	$\displaystyle\mathbb{E}[\hat{\bf q}_{\tau^{a}_{h}}(a_{h},\bar{\bf a}_{h+1},o_{h:h+\|\bar{\bf a}_{h+1}\|+1})\|\tau^{a}_{h};a_{h}\sim U(\mathcal{A}),\bar{\bf a}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}})]$	$\displaystyle={\bf q}_{\tau^{a}_{h}},$
	$\displaystyle\mathbb{E}[\hat{\bf q}_{\tau^{a}_{h+1}}(\bar{\bf a}_{h+1},o_{h+1:h+\|\bar{\bf a}_{h+1}\|+1})\|\tau^{a}_{h+1};\bar{\bf a}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}})]$	$\displaystyle={\bf q}_{\tau^{a}_{h+1}}.$

With the above setup, we can construct the loss function $l$ for estimating the Bellman error. We set the loss as follows:

		$\displaystyle l_{h}({z}_{h-1},a_{h},r_{h},\bar{\bf a}_{h+1},o_{h:h+\|\bar{\bf a}_{h+1}\|+1};\pi,g)$		(18)
		$\displaystyle=\phi(z_{h-1})^{\top}\mathbb{J}_{h}\hat{\bf q}_{\tau^{a}_{h}}(a_{h},\bar{\bf a}_{h+1},o_{h:h+\|\bar{\bf a}_{h+1}\|+1})$
		$\displaystyle-\frac{{\textbf{1}}\{a_{h}=\pi_{h}(\bar{z}_{h})\}}{1/\|\mathcal{A}\|}\left(r_{h}+\phi(z_{h})^{\top}\mathbb{J}_{h+1}\hat{\bf q}_{\tau^{a}_{h+1}}(\bar{\bf a}_{h+1},o_{h+1:h+\|\bar{\bf a}_{h+1}\|+1})\right).$

Since we have shown that $\hat{\bf q}_{\tau^{a}_{h}}$ and $\hat{\bf q}_{\tau^{a}_{h+1}}$ are unbiased estimators of ${\bf q}_{\tau}$ and ${\bf q}_{\tau^{a}_{h+1}}$ , respectively, we can show that for any roll-in policy $\pi^{\prime}$ :

	$\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime})$
	$\displaystyle=-\mathbb{E}[\mathbb{E}[g_{h+1}(z_{h},t^{\mathcal{A}}_{h+1},t^{\mathcal{O}}_{h+1})\mid\tau^{a}_{h+1};t^{\mathcal{A}}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}_{h+1})]+r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]$
	$\displaystyle+\mathbb{E}[g_{h}(z_{h-1},t^{\mathcal{A}}_{h},t^{\mathcal{O}}_{h})\mid\tau^{a}_{h};t^{\mathcal{A}}_{h}\sim U({\mathcal{T}}^{\mathcal{A}}_{h})];a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]$
	$\displaystyle=\mathbb{E}[-\phi(z_{h})^{\top}\mathbb{J}^{\pi}_{h+1}{\bf q}_{\tau^{a}_{h+1}}-r_{h}+\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{h}{\bf q}_{\tau^{a}_{h}};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]$
	$\displaystyle=\mathbb{E}\left[l_{h}(z_{h-1},a_{h},r_{h},\bar{\bf a}_{h+1},o_{h:h+\|\bar{\bf a}_{h+1}\|+1};\pi,g);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(\mathcal{A}),\bar{\bf a}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}})\right].$

The above shows that we can use $l_{h}(\cdot)$ as a loss function.

Summary

We can use the almost similar algorithm as Algorithm 1. The sole difference is we need to replace $\sigma^{t}_{h}(\pi,g)$ with

\displaystyle\mathbb{E}_{\mathcal{D}^{t}_{h}}\left[l_{h}(z_{h-1},a_{h},r_{h},\bar{\bf a}_{h+1},o_{h:h+|\bar{\bf a}_{h+1}|+1};\pi,g);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(\mathcal{A}),\bar{\bf a}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}})\right]

where $\mathcal{D}^{t}_{h}$ is an empirical approximation when executing $a_{1:h-1}\sim\pi^{t},a_{h}\sim U(\mathcal{A}),\bar{\bf a}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}})$ .

Calculation of PO-bilinear rank

Finally, we prove a PSR belongs to the PO-bilinear class.

Lemma 19 (PO-bilinear decomposition).

Let $\mathcal{Q}$ be a minimum core test set contained in ${\mathcal{T}}$ . The PSR model has PO-bilinear rank at most $|\mathcal{O}|^{M}|\mathcal{A}|^{M}|\mathcal{Q}|$ , i.e., there exists two $|\mathcal{O}|^{M}|\mathcal{A}|^{M}|\mathcal{Q}|$ -dimensional mappings $W_{h}:\Pi\times\mathcal{G}\to\mathbb{R}^{{|\mathcal{O}|^{M}|\mathcal{A}|^{M}|\mathcal{Q}|}}$ and $X_{h}:\Pi\to\mathbb{R}^{{|\mathcal{O}|^{M}|\mathcal{A}|^{M}|\mathcal{Q}|}}$ such that for any tripe $(\pi,g;\pi^{\prime})$ , we have:

	$\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime})$	$\displaystyle=\mathbb{E}\left[\phi(z_{h-1})^{\top}\mathbb{J}_{h}{\bf q}_{\tau^{a}_{h}}-r_{h}-\phi(z_{h})^{\top}\mathbb{J}_{h+1}{\bf q}_{\tau^{a}_{h+1}};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi\right]$
		$\displaystyle=\left\langle X_{h}(\pi^{\prime}),W_{h}(\pi,g)\right\rangle.$

Proof.

We first take expectation conditional on $\tau^{a}_{h}$ . Then, we have

	$\displaystyle\phi(z_{h-1})^{\top}\mathbb{J}_{h}{\bf q}_{\tau^{a}_{h}}-\mathbb{E}\left[r_{h}+\phi(z_{h})^{\top}\mathbb{J}_{h+1}{\bf q}_{\tau^{a}_{h+1}}\mid\tau^{a}_{h};a_{h}\sim\pi\right]$
	$\displaystyle=\phi(z_{h-1})^{\top}\mathbb{J}_{h}{\bf q}_{\tau^{a}_{h}}+\left(\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{1}{\bf q}_{\tau^{a}_{h}}+\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{2}{\bf q}_{\tau^{a}_{h}}\right),$

where $\mathbb{J}^{\pi}_{1}$ and $\mathbb{J}^{\pi}_{2}$ are some two matrices as defined in the proof of Lemma 17 from where we have already known that the $\pi$ -induced Bellman backup on a value function which has a bilinear form gives back a bilinear form value function. Rearrange terms, we get:

\displaystyle\phi(z_{h-1})^{\top}\mathbb{J}_{h}{\bf q}_{\tau^{a}_{h}}-\mathbb{E}\left[r_{h}+\phi(z_{h})^{\top}\mathbb{J}_{h+1}{\bf q}_{\tau^{a}_{h+1}}\mid\tau^{a}_{h};a_{h}\sim\pi\right]=\left\langle\phi(z_{h-1}),(\mathbb{J}_{h}+\mathbb{J}^{\pi}_{1}+\mathbb{J}^{\pi}_{2})\bf{q}_{\tau^{a}_{h}}\right\rangle.

Now recall that the minimum core test set is $\mathcal{Q}\subset{\mathcal{T}}$ . The final step is to argue that ${\bf q}_{\tau}$ lives in a subspace whose dimension is $|\mathcal{Q}|$ . Since $\mathcal{Q}$ is a core test set, by definition, we can express ${\bf q}_{\tau^{a}_{h}}$ using $[\mathbb{P}(t|\tau^{a}_{h})]_{t\in\mathcal{Q}}$ , i.e.,

\displaystyle\exists K\in\mathbb{R}^{|{\mathcal{T}}|\times|\mathcal{Q}|},\quad{\bf q}_{\tau^{a}_{h}}=K[\mathbb{P}(t|\tau^{a}_{h})]_{t\in\mathcal{Q}},

where the row of $K$ indexed by $t\in{\mathcal{T}}$ is equal to ${\bf k}_{t}$ , where ${\bf k}_{t}$ is the vector that is used to predict $\mathbb{P}(t|\tau^{a}_{h})={\bf k}_{t}^{\top}[\mathbb{P}(t|\tau^{a}_{h})]_{t\in\mathcal{Q}}$ whose existences is ensured by the definition of PSRs. This implies that

	$\displaystyle\left\langle\phi(z_{h-1}),(\mathbb{J}_{h}+\mathbb{J}^{\pi}_{1}+\mathbb{J}^{\pi}_{2})\bf{q}_{\tau^{a}_{h}}\right\rangle$	$\displaystyle=\left\langle\phi(z_{h-1}),(\mathbb{J}_{h}+\mathbb{J}^{\pi}_{1}+\mathbb{J}^{\pi}_{2})K[\mathbb{P}(t\|\tau^{a}_{h})]_{t\in\mathcal{Q}}\right\rangle$
		$\displaystyle=(\phi(z_{h-1})\otimes[\mathbb{P}(t\|\tau^{a}_{h})]_{t\in\mathcal{Q}}),\text{vec}((\mathbb{J}_{h}+\mathbb{J}^{\pi}_{1}+\mathbb{J}^{\pi}_{2})K)\rangle.$

Finally, we take expectation with respect to $\tau^{a}_{h}$ then we get $\mathrm{Br}_{h}(\pi,g;\pi^{\prime})=\langle X_{h}(\pi^{\prime}),W_{h}(\pi,g)\rangle$ such that

\displaystyle X_{h}(\pi^{\prime})=\phi(z_{h-1})\otimes\mathbb{E}[[\mathbb{P}(t|\tau^{a}_{h})]_{t\in\mathcal{Q}};a_{1:h-1}\sim\pi^{\prime}],\quad W_{h}(\pi,g)=\text{vec}((\mathbb{J}_{h}+\mathbb{J}^{\pi}_{1}+\mathbb{J}^{\pi}_{2})K).

∎

The key observation here is that the bilinear rank scales with $|\mathcal{Q}|$ but not $|{\mathcal{T}}|$ . This is good news since we often cannot identify exact minimal core test sets; however, it is easy to find core tests including minimal core tests. Thus, even if we do not know the linear dimension of a dynamical system a priori, the resulting bilinear rank is the linear dimension of dynamical systems as long as core sets are large enough so that they include minimal core tests. This will result in the benefit of sample complexity as we will see Section K.

Appendix E Proof of Theorem 1

We fix the parameters as in Theorem 1. Let

\displaystyle l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1})=|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\{r_{h}+g_{h+1}(\bar{z}_{h+1})\}-g_{h}(\bar{z}_{h}).

We define

	$\displaystyle\epsilon_{gen}$	$\displaystyle=\max_{h}\epsilon_{gen,h}(m,\Pi,\mathcal{G},\delta/(TH+1)),\quad\epsilon_{ini}=\epsilon_{ini}(\mathcal{G},\delta/(TH+1)),$
	$\displaystyle\tilde{\epsilon}_{gen}$	$\displaystyle=\max_{h}\epsilon_{gen}(m,\Pi,\mathcal{G},\delta/H).$

Then, by our assumption 2 with probability $1-\delta$ , we $\forall t\in[T],\forall h\in[H]$

		$\displaystyle\sup_{\pi\in\Pi,g\in\mathcal{G}}\|\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi,g)]-\mathbb{E}[\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi,g)]]\|\leq\epsilon_{gen},$		(19)
		$\displaystyle\sup_{g_{1}\in\mathcal{G}_{1}}\|\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(o_{1})]-\mathbb{E}[\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(o_{1})]]\|\leq\epsilon_{ini}.$		(20)

Hereafter, we condition on the above events.

We first show the following lemma. Recall

\displaystyle\pi^{\star}=\mathop{\mathrm{argmax}}_{\pi\in\Pi}J(\pi).

Lemma 20 (Optimism).

Set $R:=\epsilon^{2}_{gen}$ . For all $t\in[T]$ , $(\pi^{\star},g^{\pi^{\star}})$ is a feasible solution of the constrained program. Furthermore, we have $J(\pi^{\star})\leq\mathbb{E}[g^{t}_{1}(o_{1})]+2\epsilon_{ini}$ for any $t\in[T]$ , where $g^{t}$ is the value link function selected by the algorithm in iteration $t$ .

Proof.

For any $\pi$ , we have

\displaystyle\mathbb{E}[\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi,g^{\pi})]=0

since $g^{\pi}$ is a value link function in $\mathcal{G}$ . This is because

	$\displaystyle\mathbb{E}[\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi,g^{\pi})]$
	$\displaystyle=\mathbb{E}[g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1});a_{1:h-1}\sim\pi^{t},a_{h}\sim\pi]$		(IS sampling)
	$\displaystyle=\langle W_{h}(\pi,g^{\pi}),X_{h}(\pi^{t})\rangle$		(First assumption in Definition 3)
	$\displaystyle=0.$		(Second assumption in Definition 3)

Thus,

\displaystyle|\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi^{\star},g^{\pi^{\star}})]|\leq\epsilon_{gen}.

using (19) noting $\pi^{\star}\in\Pi,g^{\pi^{\star}}\in\mathcal{G}$ . This implies

\forall t\in[T],\forall h\in[H];(\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi^{\star},g^{\pi^{\star}})])^{2}\leq\epsilon^{2}_{gen}.

Hence, $(\pi^{\star},g^{\pi^{\star}})$ is a feasible set for any $t\in[T]$ .

Then, we have

$\displaystyle J(\pi^{\star})$	$\displaystyle=\mathbb{E}[g^{\pi^{\star}}_{1}(o_{1})]\leq\mathbb{E}_{\mathcal{D}^{0}}[g^{\pi^{\star}}_{1}(o_{1})]+\epsilon_{ini}$	(Uniform convergence result)
	$\displaystyle\leq\mathbb{E}_{\mathcal{D}^{0}}[g^{t}_{1}(o_{1})]+\epsilon_{ini}$	(Using the construction of algorithm)
	$\displaystyle\leq\mathbb{E}[g^{t}_{1}(o_{1})]+2\epsilon_{ini}.$	(Uniform convergence)

∎

Remark 5.

Note that

\displaystyle\mathbb{E}[\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi,g^{\pi})]]=0

holds for general link functions $g^{\pi}$ inDefinition 6 . Thus, the statement goes through even if we use Definition 6.

Next, we prove the following lemma to upper bound the per step regret.

Lemma 21.

For any $t\in[T]$ , we have

\displaystyle J(\pi^{\star})-J(\hat{\pi})\leq\sum_{h=1}^{H}|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle|+2\epsilon_{ini}.

Proof.

	$\displaystyle J(\pi^{\star})-J(\hat{\pi})$
	$\displaystyle\leq 2\epsilon_{ini}+\mathbb{E}[g^{t}_{1}(o_{1})]-J(\pi^{t})$		(From optimism)
	$\displaystyle=2\epsilon_{ini}+\sum_{h=1}^{H}\mathbb{E}[g^{t}_{h}(\bar{z}_{h})-\{r_{h}+g^{t}_{h+1}(\bar{z}_{h+1})\};a_{1:h}\sim\pi^{t}]$		(Performance difference lemma)
	$\displaystyle\leq 2\epsilon_{ini}+\sum_{h=1}^{H}\|\mathbb{E}[g^{t}_{h}(\bar{z}_{h})-\{r_{h}+g^{t}_{h+1}(\bar{z}_{h+1})\};a_{1:h}\sim\pi^{t}]\|$
	$\displaystyle=2\epsilon_{ini}+\sum_{h=1}^{H}\|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle\|.$		(First assumption in Definition 3)

∎

Lemma 22.

Let $\Sigma_{t,h}=\lambda I+\sum_{\tau=0}^{t-1}X_{h}(\pi^{\tau})X_{h}(\pi^{\tau})^{\top}$ . We have

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}\|X_{h}(\pi^{t})\|_{\Sigma^{-1}_{t,h}}\leq H\sqrt{\frac{d}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)}.

Proof.

We fix $h\in[H]$ . Here, we have $\Sigma_{t,h}=\lambda I+\sum_{\tau=0}^{t-1}X_{h}(\pi^{\tau})X_{h}(\pi^{\tau})^{\top}$ . From the elliptical potential lemma in [1, Lemma G.2], we have

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\|X_{h}(\pi^{t})\|_{\Sigma^{-1}_{t,h}}\leq\sqrt{\frac{1}{T}\sum_{t=0}^{T-1}\|X_{h}(\pi^{t})\|^{2}_{\Sigma^{-1}_{t,h}}}\leq\sqrt{\frac{1}{T}\ln\frac{\det(\Sigma_{t,h})}{\det(\lambda I)}}\leq\sqrt{\frac{d}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)}.

Then,

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=0}^{H}\|X_{h}(\pi^{t})\|^{2}_{\Sigma^{-1}_{t,h}}\leq H\sqrt{\frac{d}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)}.

∎

Lemma 23.

\displaystyle\|W_{h}(\pi^{t},g^{t})\|^{2}_{\Sigma_{t,h}}\leq 2\lambda B^{2}_{W}+4T\epsilon^{2}_{gen}.

Proof.

We have

\displaystyle\|W_{h}(\pi^{t},g^{t})\|^{2}_{\Sigma_{t,h}}=\lambda\|W_{h}(\pi^{t},g^{t})\|^{2}_{2}+\sum_{\tau=0}^{t-1}\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{\tau})\rangle^{2}.

The first term is upper-bounded by $\lambda B^{2}_{W}$ . The second term is upper-bounded by

	$\displaystyle\sum_{\tau=0}^{t-1}\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{\tau})\rangle^{2}$
	$\displaystyle=\sum_{\tau=0}^{t-1}\left(\mathbb{E}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi^{t},g^{t});a_{1:h-1}\sim\pi^{\tau},a_{h}\sim U(\mathcal{A})]\right)^{2}$		(First assumption in Definition 3)
	$\displaystyle\leq 2\sum_{\tau=0}^{t-1}\mathbb{E}_{\mathcal{D}^{\tau}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi^{t},g^{t})]^{2}+2t\epsilon^{2}_{gen}\leq 4T\epsilon^{2}_{gen}.$

From the first line to the second line, we use the definition of bilinear rank models. From the second line to the third line, we use $(a+b)^{2}\leq 2a^{2}+2b^{2}$ . In the last line, we use the constraint on $(\pi^{t},g^{t})$ .

∎

Combining lemmas so far, we have

$\displaystyle J(\pi^{\star})-J(\hat{\pi})$	$\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}\|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle\|+2\epsilon_{ini}$	(Use Lemma 21)
	$\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}\\|W_{h}(\pi^{t},g^{t})\\|_{\Sigma_{t,h}}\\|X_{h}(\pi^{t})\\|_{\Sigma^{-1}_{t,h}}+2\epsilon_{ini}$	(CS inequality)
	$\displaystyle\leq H^{1/2}\left[2\lambda B^{2}_{W}+4T\epsilon^{2}_{gen}\right]^{1/2}\left(\frac{dH}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)\right)^{1/2}+2\epsilon_{ini}.$	(Use Lemma 22 and Lemma 23 )

We set $\lambda$ such that $B^{2}_{X}/\lambda=B^{2}_{W}B^{2}_{X}/\epsilon^{2}_{gen}+1$ and $T=\left\lceil 2Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\tilde{\epsilon}_{gen}+1))\right\rceil$ . Then,

	$\displaystyle\frac{Hd}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)$	$\displaystyle\leq\frac{Hd}{T}\ln\left(1+\frac{T}{d}\left(\frac{B^{2}_{W}B^{2}_{X}}{\epsilon^{2}_{gen}}+1\right)\right)$
		$\displaystyle\leq\frac{Hd}{T}\ln\left(1+\frac{T}{d}\left(\frac{B^{2}_{W}B^{2}_{X}}{\tilde{\epsilon}^{2}_{gen}}+1\right)\right)$
		$\displaystyle\leq\frac{Hd}{T}\ln\left(\frac{2T}{d}\left(\frac{B^{2}_{W}B^{2}_{X}}{\tilde{\epsilon}^{2}_{gen}}+1\right)\right)\leq 1$

since $a\ln(bT)/T\leq 1$ when $T=2a\ln(2ab)$ .

Finally, the following holds

$\displaystyle J(\pi^{\star})-J(\pi^{T})$	$\displaystyle\leq H^{1/2}\left[4\lambda B^{2}_{W}+8T\epsilon^{2}_{gen}\right]^{1/2}+2\epsilon_{ini}$
	$\displaystyle\leq H^{1/2}\left[4\lambda B^{2}_{W}+16\epsilon^{2}_{gen}Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\tilde{\epsilon}_{gen}+1))\right]^{1/2}+2\epsilon_{ini}$	(Plug in $T$ )
	$\displaystyle\leq H^{1/2}\left[8\epsilon^{2}_{gen}+16\epsilon^{2}_{gen}Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\tilde{\epsilon}_{gen}+1))\right]^{1/2}+2\epsilon_{ini}$	(Plug in $\epsilon_{gen}$ )
	$\displaystyle\leq 5\epsilon_{gen}\left[H^{2}d\ln(4Hd(B^{2}_{X}B^{2}_{W}/\tilde{\epsilon}_{gen}+1))\right]^{1/2}+2\epsilon_{ini}.$

Appendix F Sample Complexity for Finite Function Classes

Consider cases where $\Pi$ and $\mathcal{G}$ are finite and the PO-bilinear rank assumption is satisfied. When $\Pi$ and $\mathcal{G}$ are infinite hypothesis classes, $|\mathcal{F}|$ and $|\mathcal{G}|$ are replaced with their $L^{\infty}$ -covering numbers, respectively.

Theorem 13 (Sample complexity for discrete $\Pi$ and $\mathcal{G}$ ).

Let $\|\mathcal{G}_{h}\|_{\infty}\leq C_{\mathcal{G}},r_{h}\in[0,1]$ for any $h\in[H]$ and the PO-bilinear rank assumption holds with PO-bilinear rank $d$ . By letting $|\Pi_{\max}|=\max_{h}|\Pi_{h}|,|\mathcal{G}_{\max}|=\max_{h}|\mathcal{G}_{h}|$ , with probability $1-\delta$ , we can achieve $J(\pi^{\star})-J(\hat{\pi})\leq\epsilon$ when we use samples at most

\displaystyle\tilde{O}\left(d_{b}H^{4}\max(C_{\mathcal{G}},1)^{2}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|/\delta)\ln^{2}(B_{X}B_{W})(1/\epsilon)^{2}\right).

Here, $\mathrm{polylog}(d,H,|\mathcal{A}|,\ln(|\mathcal{G}_{\max}|),\ln(|\Pi_{\max}|),\ln(1/\delta),\ln(B_{X}),\ln(B_{W}),\ln(1/\delta),(1/\epsilon))$ are omitted.

Proof.

We derive the above result. First, we check the uniform convergence result. Then,

\displaystyle\epsilon_{gen}=c\max(C_{\mathcal{G}},1)|\mathcal{A}|\sqrt{\ln(|\mathcal{G}_{\max}||\Pi_{\max}|TH/\delta)/m}.

Thus, we need to set $m$ such that

\displaystyle J(\pi^{\star})-J(\hat{\pi})\leq c\max(C_{\mathcal{G}},1)|\mathcal{A}|\sqrt{\ln(|\mathcal{G}_{\max}||\Pi_{\max}|TH/\delta)/m}\sqrt{dH^{2}\ln(H^{3}dB^{2}_{X}B^{2}_{W}m+1)}\leq\epsilon

where $c$ is some constant and

\displaystyle T=cHd\ln(HdB^{2}_{X}B^{2}_{W}m+1).

By organizing the term, the following $m$ is sufficient

\displaystyle c\sqrt{\frac{dH^{2}\max(C_{\mathcal{G}},1)^{2}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|H^{2}d/\delta)\ln(H^{3}dB^{2}_{X}B^{2}_{W}m)}{m}}\leq\epsilon

Using Lemma 44, the following $m$ satisfies the condition:

\displaystyle m=c\frac{B_{1}(\ln B_{1}B_{2})^{2}}{\epsilon^{2}},B_{1}=dH^{2}\max(C_{\mathcal{G}},1)^{2}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|H^{2}d/\delta),B_{2}=H^{3}dB^{2}_{X}B^{2}_{W}.

Combining all together, the sample complexity is $mTH$ , i.e.,

\displaystyle\tilde{O}\left(\frac{d^{2}H^{4}\max(C_{\mathcal{G}},1)^{2}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|/\delta)\ln^{2}(B_{X}B_{W})}{\epsilon^{2}}\right).\quad\blacksquare

Appendix G Sample Complexity in Observable HSE POMDPs

We revisit the existence of link functions by taking the norm constraint into account. Then, we consider the PO-bilinear decomposition with certain $B_{X}\in\mathbb{R}$ and $B_{W}\in\mathbb{R}$ . Next, we calculate the uniform convergence result. Finally, we show the sample complexity result.

We use the following assumptions

Assumption 7.

For any $h\in[H]$ , the following holds:

1.

$V^{\pi}_{h}(z_{h-1},s)=\langle\theta^{\pi}_{h},\phi_{h}(z_{h-1},s)\rangle.$
2.

There exists a matrix $K_{h}$ such that $\mathbb{E}_{o\sim\mathbb{O}(s)}[\psi_{h}(z_{h-1},o)]=K_{h}\phi_{h}(z_{h-1},s)$ (i.e., conditional embedding of the omission distribution),
3.

$\|\phi_{h}(\cdot)\|\leq 1,\|\psi_{h}(\cdot)\|\leq 1,\|\theta^{\pi}_{h}\|\leq\Theta_{V},0\leq r_{h}\leq 1,$
4.

There exists a matrix $T_{\pi;h}$ such that $\mathbb{E}[\phi_{h}(z_{h},s_{h+1})\mid z_{h-1},s_{h};a_{h}\sim\pi]=T_{\pi;h}\phi_{h}(z_{h-1},s_{h})$ (i.e., conditional embedding of the transition)
5.

$\Pi$ is finite.

We define

	$\displaystyle\sigma_{\min}(K)=\min_{h\in[H]}1/\\|K_{h}^{\dagger}\\|,\ \sigma_{\max}(K)=\max_{h\in[H]}\\|K_{h}\\|,\ \sigma_{\max}(T)=\max_{h\in[H]}\\|T_{\pi:h}\\|,$
	$\displaystyle d_{\phi}=\max_{h\in[H]}d_{\phi_{h}},\quad d_{\psi}=\max_{h\in[H]}d_{\psi_{h}}.$

Existence of link functions.

We show value link functions exist. This is proved by noting

\displaystyle\mathbb{E}_{o\sim\mathbb{O}(s)}[\langle(K^{\dagger}_{h})^{\top}\theta^{\pi}_{h},\psi_{h}(\bar{z}_{h})\rangle]=\langle(K^{\dagger}_{h})^{\top}\theta^{\pi}_{h},K_{h}\phi_{h}(z_{h-1},s_{h})\rangle=\langle\theta^{\pi}_{h},\phi_{h}(z_{h-1},s_{h})\rangle=V^{\pi}_{h}(z_{h-1},s_{h}).

Thus, $\langle(K^{\dagger}_{h})^{\top}\theta^{\pi}_{h},\psi_{h}(\bar{z}_{h})\rangle$ is a value link function. The radius of the parameter space is upper-bounded by $\Theta_{V}/\sigma_{\min}(K)$ . Hence, we set

\mathcal{G}_{h}=\{\langle\theta,\psi_{h}(\cdot)\rangle:\|\theta\|\leq\Theta_{V}/\sigma_{\min}(K)\}.

Then, the realizability holds.

PO-bilinear decomposition.

Recall we derive the PO-bilinear decomposition in Section B.4. Consider a triple $(\pi^{\prime},\pi,g)$ with $g_{h}(\cdot)=\theta_{h}^{\top}\psi_{h}(\cdot)$ and $g^{\pi}_{h}=\langle\theta^{\star}_{h},\psi_{h}(\cdot)\rangle$ , we have:

	$\displaystyle\mathbb{E}\left[\theta_{h}^{\top}\psi_{h+1}(\bar{z}_{h})-r_{h}-\theta_{h+1}^{\top}\psi(\bar{z}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi\right]$
	$\displaystyle=\left\langle\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}],\quad K_{h}^{\top}(\theta_{h}-\theta^{\star}_{h})-T_{\pi;h}^{\top}K_{h+1}^{\top}(\theta_{h+1}-\theta^{\star}_{h+1})\right\rangle,$

which verifies the PO-bilinear structure, i.e.,

X_{h}(\pi^{\prime})=\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}],\quad W_{h}(\pi,g)=K_{h}^{\top}(\theta_{h}-\theta^{\star}_{h})-T_{\pi;h}^{\top}K_{h+1}^{\top}(\theta_{h+1}-\theta^{\star}_{h+1}),

and shows that the PO-bilinear rank is at most $d_{\phi}=\max_{h}d_{\phi_{h}}$ . Thus, based on the above PO-bilinear decomposition, we set $\|B_{X}\|=1,\|B_{W}\|=2(1+\sigma_{\max}(T))\sigma_{\max}(K)\Theta_{V}/\sigma_{\min}(K)$ . This is because

	$\displaystyle\\|K_{h}^{\top}(\theta_{h}-\theta^{\star}_{h})-T_{\pi;h}^{\top}K_{h+1}^{\top}(\theta_{h+1}-\theta^{\star}_{h+1})\\|$
	$\displaystyle\leq\\|K_{h}^{\top}\\|(\\|\theta_{h}\\|+\\|\theta^{\star}_{h}\\|)+\\|T_{\pi;h}^{\top}\\|\\|K_{h+1}^{\top}\\|(\\|\theta_{h+1}\\|+\\|\theta^{\star}_{h+1})\\|)$
	$\displaystyle\leq 2(1+\sigma_{\max}(T))\sigma_{\max}(K)\Theta_{V}/\sigma_{\min}(K).$

and

\displaystyle\|\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}]\|\leq\mathbb{E}[\|\phi_{h}(z_{h-1},s_{h})\|;a_{1:h-1}\sim\pi^{\prime}]\leq 1.

In the above, we use Jensen’s inequality.

Uniform convergence.

To invoke Theorem 1, we show the uniform convergence result.

Lemma 24 (Uniform convergence of loss functions).

Let $C=\Theta_{V}/(\sigma_{\min}(K))$ . Then, with probability $1-\delta$ ,

	$\displaystyle\sup_{\pi\in\Pi,g\in\mathcal{G}}\left\|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[\|\mathcal{A}\|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\right\}\right]\right\|$
	$\displaystyle\leq 5\|\mathcal{A}\|\{1+2C\}\sqrt{\frac{\{2d_{\psi}\ln(1+Cm)+\ln(\|\Pi_{\max}\|/\delta)\}}{m}}$

and

\displaystyle\sup_{g_{1}\in\mathcal{G}_{1}}|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}[g_{1}(\bar{z}_{1})|\leq 5C\sqrt{\frac{\{d_{\psi}\ln(1+Cm)+\ln(|\Pi_{\max}|/\delta)\}}{m}}.

Proof.

Let $C=\Theta_{V}/\sigma_{\min}(K)$ . Define $\mathcal{N}_{\epsilon,h}$ as an $\epsilon$ -net for $\mathcal{G}_{h}$ . Then, $|\mathcal{N}_{\epsilon,h}|\leq(1+C/\epsilon)^{d}$ . Then,

	$\displaystyle\|l_{h}(\cdot;\pi,g)-l_{h}(\cdot;\pi^{\diamond},g^{\diamond})\|$	$\displaystyle\leq\|\mathcal{A}\|\{\\|g_{h}-g^{\diamond}_{h}\\|_{\infty}+\\|g_{h+1}-g^{\diamond}_{h+1}\\|_{\infty}\}$
		$\displaystyle\leq\|\mathcal{A}\|\{\\|\theta_{h}-\theta^{\diamond}_{h}\\|_{2}+\\|\theta_{h+1}-\theta^{\diamond}_{h+1}\\|_{2}\}\leq 2\|\mathcal{A}\|\epsilon.$

Besides, for fixed $\pi\in\Pi,\theta_{h}\in\mathcal{N}_{\epsilon,h}$ , $\theta_{h+1}\in\mathcal{N}_{\epsilon,{h+1}}$ , we have

\displaystyle\left\lvert\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h};\theta_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1};\theta_{h+1})\right\}\right]\right\rvert\leq|\mathcal{A}|\left(1+2C\right)\sqrt{\frac{\ln(|\Pi_{h}|/\delta)}{m}}.

Then, for $\forall\pi\in\Pi,\forall\theta_{h}\in\mathcal{N}_{\epsilon,h}$ , $\forall\theta_{h+1}\in\mathcal{N}_{\epsilon,{h+1}}$ , we have

\displaystyle\left\lvert\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h};\theta_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1};\theta_{h+1})\right\}\right]\right\rvert\leq|\mathcal{A}|\left(1+2C\right)\sqrt{\frac{\ln(|\Pi_{h}||\mathcal{N}_{\epsilon,h}||\mathcal{N}_{\epsilon,h+1}|/\delta)}{m}}.

Hence, for any $g_{h}=\langle\theta_{h},\psi_{h}\rangle\in\mathcal{G}_{h},g_{h+1}=\langle\theta_{h+1},\psi_{h+1}\rangle\in\mathcal{G}_{h+1}$ ,

	$\displaystyle\left\lvert\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[\|\mathcal{A}\|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h};\theta_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1};\theta_{h+1})\right\}\right]\right\rvert$
	$\displaystyle\leq\|\mathcal{A}\|\left(1+2C\right)\sqrt{\frac{\ln(\|\Pi_{h}\|\|\mathcal{N}_{\epsilon,h}\|\|\mathcal{N}_{\epsilon,h+1}\|/\delta)}{m}}+4\|\mathcal{A}\|\epsilon.$

By taking $\epsilon=1/m$ , we have $\forall\pi\in\Pi,\forall g_{h}\in\mathcal{G}_{h},\forall g_{h+1}\in\mathcal{G}_{h+1}$ :

	$\displaystyle\|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}[\|\mathcal{A}\|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}]\|$
	$\displaystyle\leq\|\mathcal{A}\|\{1+2C\}\sqrt{\frac{\{2d\ln(1+Cm)+\ln(\|\Pi_{h}\|/\delta)\}}{m}}+\frac{4\|\mathcal{A}\|}{m}$
	$\displaystyle\leq 5\|\mathcal{A}\|\{1+2C\}\sqrt{\frac{\{2d\ln(1+Cm)+\ln(\|\Pi_{h}\|/\delta)\}}{m}}.$

Similarly,

	$\displaystyle\forall g_{1}\in\mathcal{G}_{1};\|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}[g_{1}(\bar{z}_{1})]\|$	$\displaystyle\leq C\sqrt{\frac{\{d\ln(1+Cm)+\ln(\|\Pi_{h}\|/\delta)\}}{m}}+\frac{4}{m}$
		$\displaystyle\leq 5C\sqrt{\frac{\{d\ln(1+Cm)+\ln(\|\Pi_{h}\|/\delta)\}}{m}}.$

∎

Finally, we obtain the PAC bound, we need to find $m$ such that

\displaystyle c|\mathcal{A}|\max(C,1)\sqrt{\frac{d_{\psi}\ln(\max(C,1)m)+\ln(|\Pi_{\max}|TH/\delta)}{m}}\sqrt{d_{\phi}H^{2}\ln\left(Hd_{\phi}B^{2}_{X}B^{2}_{W}m+1\right)}\leq\epsilon.

where $c$ is some constant and

\displaystyle T=cHd_{\phi}\ln(HdB^{2}_{X}B^{2}_{W}m+1).

By organizing the term, the following $m$ is sufficient:

\displaystyle c\sqrt{\frac{\{d_{\psi}+\ln(d_{\phi}|\Pi_{\max}|H^{2}/\delta)\}d_{\phi}H^{2}|\mathcal{A}|^{2}\max(C,1)^{2}\ln(\{C+Hd_{\phi}B^{2}_{X}B^{2}_{W}+1)\}m)^{2}}{m}}\leq\epsilon.

By using Lemma 44, we can set

	$\displaystyle m=\frac{B_{1}}{\epsilon^{2}}\ln(mB_{1}B_{2})^{2},$
	$\displaystyle B_{1}=\{d_{\psi}+\ln(d_{\phi}\|\Pi_{\max}\|H^{2}/\delta)\}d_{\phi}H^{2}\|\mathcal{A}\|^{2}\max(C,1)^{2},B_{2}=C+Hd_{\phi}B^{2}_{X}B^{2}_{W}+1.$

Thus, the final sample complexity is

\displaystyle\tilde{O}\left(\frac{d^{2}_{\phi}\{d_{\psi}+\ln(|\Pi_{\max}|/\delta)\}H^{4}|\mathcal{A}|^{2}\max(C,1)^{2}}{\epsilon^{2}}\right)

where $C=\Theta_{V}/\sigma_{\min}(K)$ .

Appendix H Sample Complexity in Observable Undercomplete Tabular POMDPs

We revisit the existence of value link functions. Then, we show the PO-bilinear rank decomposition. After showing the uniform convergence lemma, we calculate the sample complexity.

Existence of value link functions.

In the tabular case, by setting

\displaystyle\psi_{h}(z,o)=\mathbf{1}(z)\otimes\mathbf{1}(o),\phi_{h}(z,s)=\mathbf{1}(z)\otimes\mathbf{1}(s),K_{h}=\mathbb{I}_{|\mathcal{Z}_{h-1}|}\otimes\mathbb{O}.

where $\mathbf{1}(z)$ is a one-hot encoding vector over $\mathcal{Z}_{h-1}$ , we can regard the tabular model as an HSE-POMDP. Here is our assumption.

Assumption 8.

(a) $0\leq r_{h}\leq 1$ , (b) $\mathbb{O}$ is full-column rank and $\|\mathbb{O}^{\dagger}\|_{1}\leq 1/\sigma_{1}$ for any $h\in[H]$ .

Note we use the $1$ -norm since this choice is more amenable in the tabular setting. However, even if the norm bound is given in terms of $2$ -norm, we can still ensure the PAC guarantee (this is because $\|\mathbb{O}^{\dagger}\|_{1}/\sqrt{|{\mathcal{S}}|}\leq\|\mathbb{O}^{\dagger}\|_{2}\leq\|\mathbb{O}^{\dagger}\|_{1}\sqrt{|\mathcal{O}|}$ ).

Here, since we assume the reward lies in $[0,1]$ , value functions on the latent state belong to $\{\langle\theta,\phi_{h}(\cdot)\rangle:\|\theta\|_{\infty}\leq H\}$ . Here, letting $V^{\pi}_{h}=\langle\theta^{\pi}_{h},\phi_{h}\rangle$ , value link functions exist by taking $\langle\theta^{\pi}_{h},\mathbf{1}(z)\times\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle$ . Hence, we take

\displaystyle\mathcal{G}_{h}=\left\{(z,o)\mapsto\langle\theta,\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle;\|\theta\|_{\infty}\leq H\right\}

so that the realizability holds. Importantly, we can ensure $\|\mathcal{G}_{h}\|_{\infty}\leq H/\sigma_{1}$ since

\displaystyle|\langle\theta,\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle|\leq\|\theta\|_{\infty}\|\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\|_{1}\leq\|\theta\|_{\infty}\|\mathbb{O}^{\dagger}\mathbf{1}(o)\|_{1}\leq H/\sigma_{1}

for any $(z,o)\in\mathcal{Z}_{h-1}\times\mathcal{O}$ . Note $\mathcal{G}_{h}$ is contained in

\displaystyle\left\{\langle\theta,\mathbf{1}(z)\otimes\mathbf{1}(o)\rangle;\|\theta\|_{2}\leq H|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}/\sigma_{1}\right\}

(21)

This is because each $\langle\theta,\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle$ is equal to $\langle\theta^{\prime},\mathbf{1}(z)\otimes\mathbf{1}(o)\rangle$ for some vector $\theta^{\prime}\in\mathbb{R}^{|\mathcal{Z}_{h-1}|\times|\mathcal{O}|}$ . Here, denoting the component of $\theta$ corresponding to $z\in\mathcal{Z}_{h-1}$ by $\theta_{z}\in\mathbb{R}^{|\mathcal{O}|}$ , $\theta^{\prime}$ is a vector stacking $\mathbb{O}^{\dagger}\theta_{z}$ for each $z\in\mathcal{Z}_{h-1}$ . Then, we have

\displaystyle\|\mathbb{O}^{\dagger}\theta_{z}\|_{2}\leq\|\mathbb{O}^{\dagger}\|_{2}\|\theta_{z}\|_{2}\leq\|\mathbb{O}^{\dagger}\|_{1}\sqrt{|\mathcal{O}|}H\sqrt{|\mathcal{O}|}\leq H|\mathcal{O}|/\sigma_{1}.

Hence, $\|\theta^{\prime}\|_{2}\leq|\mathcal{O}|^{M}|\mathcal{A}|^{M}\times H|\mathcal{O}|/\sigma_{1}$ .

PO-Bilinear decomposition.

Next, recall we derive the PO-bilinear decomposition:

	$\displaystyle\mathbb{E}[\theta^{\top}_{h}\phi_{h}(\bar{z}_{h})-r_{h}-\theta^{\top}_{h+1}\phi_{h+1}(\bar{z}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]$
	$\displaystyle=\langle K^{\top}_{h}\{\theta_{h}-\theta^{\pi}_{h}\}-\{T_{\pi:h}\}^{\top}K^{\top}_{h+1}\{\theta_{h+1}-\theta^{\pi}_{h+1}\},\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}]\rangle.$

Then, $B_{X}=1$ and $B_{W}=4H|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}/\sigma_{1}$ . We use $\|K^{\top}_{h}\|_{2}=\|\mathbb{O}_{h}\|_{2}\leq 1,\|T^{\top}_{\pi:h}\|_{2}\leq 1$ . This is because

	$\displaystyle\\|K^{\top}_{h}\{\theta_{h}-\theta^{\pi}_{h}\}-\{T_{\pi:h}\}^{\top}K^{\top}_{h+1}\{\theta_{h+1}-\theta^{\pi}_{h+1}\}\\|_{2}$
	$\displaystyle\leq\\|\theta_{h}\\|_{2}+\\|\theta^{\pi}_{h}\\|_{2}+\\|\theta_{h+1}\\|_{2}+\\|\theta^{\pi}_{h+1}\\|_{2}\leq 4H\|\mathcal{O}\|^{M+1}\|\mathcal{A}\|^{M}/\sigma_{1}.$

In the last line, we use (21).

Uniform convergence.

Then, we can obtain the following uniform convergence lemma.

Lemma 25.

Let $C=H/\sigma_{1}$ and $d_{\psi}=|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}$ . Then, with probability $1-\delta$ ,

	$\displaystyle\sup_{\pi\in\Pi,g\in\mathcal{G}}\left\|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[\|\mathcal{A}\|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\right\}\right]\right\|$
	$\displaystyle\leq 5\|\mathcal{A}\|\{1+2C\}\sqrt{\frac{\{d^{2}_{\psi}\ln(1+Cm)+\ln(\|\Pi_{h}\|/\delta)\}}{m}}$

and

\displaystyle\sup_{g_{1}\in\mathcal{G}_{1}}|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}[g_{1}(\bar{z}_{1})]|\leq 5C\sqrt{\frac{\{d_{\psi}\ln(1+Cm)+\ln(|\Pi_{h}|/\delta)\}}{m}}.

Proof.

Let $d_{\phi}=|{\mathcal{S}}||\mathcal{O}|^{M}|\mathcal{A}|^{M},d_{\psi}=|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}$ .

Define $\mathcal{N}_{\epsilon,h}$ as an $\epsilon$ -net for $\{\theta:\|\theta\|_{2}\leq C\}$ with respect to $L^{2}$ -norm. Define $\mathcal{N}^{\prime}_{\epsilon,h}$ as an $\epsilon$ -net for $\Pi_{h}:\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})$ with respect to the following norm:

\displaystyle d(\pi,\pi^{\prime})=\max_{\bar{z}_{h-1}\in\bar{Z}_{h-1}}\|\pi(\cdot\mid\bar{z}_{h-1})-\pi^{\prime}(\cdot\mid\bar{z}_{h-1})\|_{1}.

Then, $|\mathcal{N}_{\epsilon,h}|\leq(1+C/\epsilon)^{d},|\mathcal{N}^{\prime}_{\epsilon,h}|\leq(1+1/\epsilon)^{d_{\psi}|\mathcal{A}|}$ .

Let $g_{h}=\langle\theta_{h},\psi_{h}\rangle,g^{\diamond}_{h}=\langle\theta^{\diamond}_{h},\psi_{h}\rangle$ where $\psi_{h}$ is a one-hot encoding vector over $\bar{\mathcal{Z}}_{h}$ . Then, when $\|\theta_{h}-\theta^{\diamond}_{h}\|_{2}\leq\epsilon,\|\theta_{h+1}-\theta^{\diamond}_{h+1}\|_{2}\leq\epsilon,\|\pi_{h}-\pi^{\diamond}_{h}\|_{1}\leq\epsilon$ , we have

	$\displaystyle\|l_{h}(\cdot;\pi,g)-l_{h}(\cdot;\pi^{\diamond},g^{\diamond})\|$	$\displaystyle\leq\|\mathcal{A}\|\{\\|\pi_{h}-\pi^{\diamond}_{h}\\|_{\infty}C+\\|g_{h}-g^{\diamond}_{h}\\|_{\infty}+\\|g_{h+1}-g^{\diamond}_{h+1}\\|_{\infty}\}$
		$\displaystyle\leq\|\mathcal{A}\|\{\epsilon C+\\|\theta_{h}-\theta^{\diamond}_{h}\\|_{2}+\\|\theta_{h+1}-\theta^{\diamond}_{h+1}\\|_{2}\}$
		$\displaystyle\leq 3\|\mathcal{A}\|C\epsilon.$

Besides, for fixed $\pi\in\mathcal{N}^{\prime}_{\epsilon,h},\theta_{h}\in\mathcal{N}_{\epsilon,h}$ , $\theta_{h+1}\in\mathcal{N}_{\epsilon,{h+1}}$ , we have

\displaystyle\left\lvert\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h};\theta_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1};\theta_{h+1})\right\}\right]\right\rvert\leq|\mathcal{A}|\left(1+2C\right)\sqrt{\frac{\ln(1/\delta)}{m}}.

Then, for $\forall\pi\in\mathcal{N}^{\prime}_{\epsilon,h},\forall\theta_{h}\in\mathcal{N}_{\epsilon,h}$ , $\forall\theta_{h+1}\in\mathcal{N}_{\epsilon,{h+1}}$ , we have

\displaystyle\left\lvert\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h};\theta_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1};\theta_{h+1})\right\}\right]\right\rvert\leq|\mathcal{A}|\left(1+2C\right)\sqrt{\frac{\ln(|\mathcal{N}^{\prime}_{\epsilon,h}||\mathcal{N}_{\epsilon,h}||\mathcal{N}_{\epsilon,h+1}|/\delta)}{m}}.

Hence, for any $\pi_{h}\in\Pi_{h},g_{h}=\langle\theta_{h},\psi_{h}\rangle\in\mathcal{G}_{h},g_{h+1}=\langle\theta_{h+1},\psi_{h+1}\rangle\in\mathcal{G}_{h+1}$ ,

	$\displaystyle\left\lvert\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[\|\mathcal{A}\|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h};\theta_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1};\theta_{h+1})\right\}\right]\right\rvert$
	$\displaystyle\leq\|\mathcal{A}\|\left(1+2C\right)\sqrt{\frac{\ln(\|\mathcal{N}^{\prime}_{\epsilon,h}\|\|\mathcal{N}_{\epsilon,h}\|\|\mathcal{N}_{\epsilon,h+1}\|/\delta)}{m}}+3\|\mathcal{A}\|C\epsilon.$

By taking $\epsilon=1/m$ , we have $\forall\pi\in\Pi,\forall g_{h}\in\mathcal{G}_{h},\forall g_{h+1}\in\mathcal{G}_{h+1};$

	$\displaystyle\|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}[\|\mathcal{A}\|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}]\|$
	$\displaystyle\leq\|\mathcal{A}\|\{1+2C\}\sqrt{\frac{\{2d_{\psi}\ln(1+Cm)+d_{\psi}\|\mathcal{A}\|\ln(1+m)+\ln(1/\delta)\}}{m}}+\frac{3\|\mathcal{A}\|C}{m}$
	$\displaystyle\leq 10\|\mathcal{A}\|C\sqrt{\frac{\{d_{\psi}\|\mathcal{A}\|\ln(1+Cm)+\ln(1/\delta)\}}{m}}.$

∎

Sample Complexity.

Finally, we obtain the PAC bound. We need to find $m$ such that

\displaystyle c|\mathcal{A}|C\sqrt{\frac{\{d_{\psi}|\mathcal{A}|\ln(1+Cm)+\ln(TH/\delta)\}}{m}}\sqrt{d_{\phi}H^{2}\ln(Hd_{\phi}B^{2}_{X}B^{2}_{W}m+1)}\leq\epsilon.

where $c$ is some constant and

\displaystyle T=cHd_{\phi}\ln(HdB^{2}_{X}B^{2}_{W}m+1).

By organizing terms, we get

\displaystyle\sqrt{\frac{|\mathcal{A}|^{3}C^{2}d_{\phi}d_{\psi}H^{2}\ln(H^{2}d_{\phi}/\delta)\ln(\{C+d_{\psi}+Hd_{\phi}B^{2}_{X}B^{2}_{W}\}m)^{2}}{m}}\leq\epsilon.

Thus, we need to set

\displaystyle m=\tilde{O}\left(\frac{|\mathcal{A}|^{3}C^{2}d_{\phi}d_{\psi}H^{2}\ln(1/\delta)}{\epsilon^{2}}\right)

Hence, the sample complexity is

\displaystyle\tilde{O}\left(\frac{|\mathcal{A}|^{3}C^{2}d^{2}_{\phi}d_{\psi}H^{4}\ln(1/\delta)}{\epsilon^{2}}\right).

By some algebra, it is

\displaystyle\tilde{O}\left(\frac{|\mathcal{A}|^{3M+3}|\mathcal{O}|^{3M+1}|{\mathcal{S}}|^{2}H^{6}\ln(1/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

Later, we prove we can remove $|\mathcal{O}|^{M}$ using the more refined analysis in Section N.

Global optimality.

We use a result in the proof of [24, Theorem 1.2]. We just set $M=C(1/\sigma_{1})^{4}\ln(SH/\epsilon)$ . Note their assumption 1 is satisfied when $\|\mathbb{O}^{\dagger}\|_{1}\leq(1/\sigma_{1})$ . More specifically, assumption 1 in [24] requires for any $b$ and $b^{\prime}$ , we have

\displaystyle\|\mathbb{O}b-\mathbb{O}b^{\prime}\|_{1}\geq 1/\sigma_{1}\|b-b^{\prime}\|_{1}.

This is proved as follows. Note for any $e,e^{\prime}$ ,

\displaystyle\|\mathbb{O}^{\dagger}e-\mathbb{O}^{\dagger}e^{\prime}\|_{1}\leq\|\mathbb{O}^{\dagger}\|_{1}\|e-e^{\prime}\|_{1}.

Then, by setting $e=\mathbb{O}b$ and $e^{\prime}=\mathbb{O}b^{\prime}$ , the assumption 1 is ensured. Here, we use $\mathbb{O}^{\dagger}\mathbb{O}=I$ .

Appendix I Sample Complexity in Observable Overcomplete Tabular POMDPs

To simplify the presentation, we focus on the case when $\pi^{out}=U(\mathcal{A})$ .

Existence of value link functions.

In the tabular case, by setting

\displaystyle\psi_{h}(z,t^{K})=\mathbf{1}(z)\otimes\mathbf{1}(t^{K}),\phi_{h}(z,s)=\mathbf{1}(z)\otimes\mathbf{1}(s),K_{h}=\mathbb{I}_{|\mathcal{Z}_{h-1}|}\otimes\mathbb{O}^{K}.

where $\mathbf{1}(z)$ is a one-hot encoding vector over $\mathcal{Z}_{h-1}$ and $\mathbf{1}(t^{K})$ is a one-hot encoding vector over $\mathcal{Z}^{K}=\mathcal{O}^{K}\times\mathcal{A}^{K-1}$ , we can regard the tabular model as an HSE-POMDP. Here is our assumption.

Assumption 9.

(a) $0\leq r_{h}\leq 1$ , (b) $\mathbb{O}^{K}$ is full-column rank and $\|\{\mathbb{O}^{K}\}^{\dagger}\|_{1}\leq 1/\sigma_{1}$ .

Recall we define $\mathbb{O}^{K}$ in Lemma 3. Since we assume the reward lies in $[0,1]$ , value functions on the latent state belong to $\{\langle\theta,\phi_{h}(\cdot)\rangle:\|\theta\|_{\infty}\leq H\}$ . Here, letting $V^{\pi}_{h}(\cdot)=\langle\theta^{\pi}_{h},\phi_{h}(\cdot)\rangle$ , value link functions exist by taking $\langle\theta^{\pi}_{h},\mathbf{1}(z)\otimes\{\mathbb{O}^{K}\}^{\dagger}\mathbf{1}(t^{K})\rangle$ . Hence, we take

\displaystyle\mathcal{G}_{h}=\{(z,t^{K})\mapsto\langle\theta^{\pi}_{h},\mathbf{1}(z)\otimes\{\mathbb{O}^{K}\}^{\dagger}\mathbf{1}(t^{K})\rangle;\|\theta^{\pi}_{h}\|_{\infty}\leq H\}

so that the realizability holds. Importantly, we can ensure $\|\mathcal{G}_{h}\|_{\infty}\leq H/\sigma_{1}$ as in Section H. Then, as in Section H, $\mathcal{G}_{h}$ is contained in

\displaystyle\left\{\langle\theta,\mathbf{1}(z)\otimes\mathbf{1}(o)\rangle;\|\theta\|_{2}\leq H|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}/\sigma_{1}\right\}.

PO-bilinear decomposition.

Next, we derive the PO-bilinear decomposition:

	$\displaystyle\mathbb{E}[\theta^{\top}_{h}\phi(z_{h-1},t^{K}_{h})-r_{h}-\theta^{\top}_{h+1}\phi(z_{h-1},t^{K}_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi,a_{h+1:h+K-1}\sim U(\mathcal{A})]$
	$\displaystyle=\langle\{K_{h}\}^{\top}\{\theta_{h}-\theta^{\pi}_{h}\}-\{T_{\pi:h}\}^{\top}\{K_{h+1}\}^{\top}\{\theta_{h+1}-\theta^{\pi}_{h+1}\},\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}]\rangle.$

Then, $B_{X}=1$ and $B_{W}=4H|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}/\sigma_{1}$ . We use $\|K_{h}\|_{2}=\|\mathbb{O}^{K}\|_{2}\leq 1,\|T^{\top}_{\pi:h}\|_{2}\leq 1$ .

Sample Complexity.

We can follow the same procedure in the proof of Section H. Let $d_{\phi}=|{\mathcal{S}}||\mathcal{O}|^{M}|\mathcal{A}|^{M},d_{\psi}=|\mathcal{O}|^{M+K}|\mathcal{A}|^{M+K-1}$ . Hence, the sample complexity is

\displaystyle\tilde{O}\left(\frac{|\mathcal{A}|^{3}C^{2}d^{2}_{\phi}d_{\psi}H^{4}\ln(1/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

By some algebra, the above is

\displaystyle\tilde{O}\left(\frac{|\mathcal{A}|^{3M+K+2}|\mathcal{O}|^{3M+K}|{\mathcal{S}}|^{2}H^{6}\ln(1/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

Using the more refined analysis later, we show we can remove $|\mathcal{O}|^{3M+K}$ in Section N.

Appendix J Sample Complexity in LQG

In this section, we derive the sample complexity in LQG. We first explain the setting. Then, we prove the existence of link functions. Lemma 5 is proved there. Furtheremore, we show the PO-bilinear rank decomposition in LQG. We prove Lemma 5 there. Next, we show the uniform convergence result in LQG. Finally, by invoking Theorem 1, we calculate the sample complexity.

We study a finite-horizon discrete time LQG governed by the following equation:

\displaystyle s_{1}=\epsilon_{1},s_{h+1}=As_{h}+Ba_{h}+\epsilon_{h},r_{h}=s^{\top}_{h}Qs_{h}+a^{\top}_{h}Ra_{h},o_{h}=Os_{h}+\tau_{h}.

where $\epsilon_{h}$ is a Gaussian noise with mean $0$ and noise $\Sigma_{\epsilon}$ and $\tau_{h}$ is a Gaussian noise with mean $0$ and $\Sigma_{\tau}$ . We use a matrix $O$ instead of $C$ to avoid the notational confusion later. With a linear policy $\pi_{h}(a_{h}\mid o_{h},z_{h-1})=\delta(a_{h}=\mathbf{U}_{1h}o_{h}+\mathbf{U}_{2h}z_{h-1})$ , this induces the following system:

\displaystyle\begin{bmatrix}z^{\prime}_{h}\\ o_{h}\\ a_{h}\\ s_{h+1}\end{bmatrix}=\Xi_{1h}(\pi)\begin{bmatrix}z_{h-1}\\ s_{h}\end{bmatrix}+\Xi_{2h}(\pi),\Xi_{2h}(\pi)=\begin{bmatrix}0\\ \tau\\ \mathbf{U}_{1h}\tau\\ B\mathbf{U}_{1h}\tau+\epsilon\end{bmatrix},\,\Xi_{1h}(\pi)=\begin{bmatrix}I^{\prime}&0\\ 0&O\\ \mathbf{U}_{2h}&\mathbf{U}_{1h}O\\ B\mathbf{U}_{2h}&A+B\mathbf{U}_{1h}O\end{bmatrix}

where $z^{\prime}_{h}$ is the vector removing $(o_{h},a_{h})$ from $z_{h}$ and $I^{\prime}$ is a matrix mapping $z_{h}$ to $z^{\prime}_{h}$ . This is derived by

	$\displaystyle s_{h+1}$	$\displaystyle=As_{h}+Ba_{h}+\epsilon=As_{h}+B\{\mathbf{U}_{1h}o_{h}+\mathbf{U}_{2h}z_{h-1}\}+\epsilon$
		$\displaystyle=(A+B\mathbf{U}_{1h}O)s_{h}+B\mathbf{U}_{2h}z_{h-1}+\epsilon+B\mathbf{U}_{1h}\tau,$
	$\displaystyle a_{h}$	$\displaystyle=\mathbf{U}_{1h}o_{h}+\mathbf{U}_{2h}z_{h-1}=\mathbf{U}_{1h}Os_{h}+\mathbf{U}_{2h}z_{h-1}+\mathbf{U}_{1h}\tau,$
	$\displaystyle o_{h}$	$\displaystyle=Os_{h}+\tau.$

We suppose the system is always stable in the sense that the operator norm of $\Xi_{1h}(\pi)$ is upper-bounded by $1$ . Here is the assumption we introduce throughout this section.

Assumption 10.

Suppose $\max(\|A\|,\|B\|,\|O\|,\|Q\|,\|R\|)\leq\mathbb{C}$ . Suppose $\|\Xi_{1h}(\pi)\|\leq 1$ for any $\pi$ . $O$ is full-column rank.

We present the form of linear mean embedding operators in LQGs.

Lemma 26 (Linear mean embedding operator).

Let $z\in\mathcal{Z}_{h-1},o\in\mathcal{O},s\in{\mathcal{S}}$ . We have

\displaystyle\mathbb{E}_{o\sim\mathbb{O}(s)}\left[\begin{bmatrix}1\\ \begin{bmatrix}z\\ o\end{bmatrix}\otimes\begin{bmatrix}z\\ o\end{bmatrix}\end{bmatrix}\right]=K_{h}\begin{bmatrix}1\\ \begin{bmatrix}z\\ s\end{bmatrix}\otimes\begin{bmatrix}z\\ s\end{bmatrix}\end{bmatrix},K_{h}=\begin{bmatrix}1&\mathbf{0}\\ \mathrm{vec}\left(\begin{bmatrix}0&0\\ 0&\Sigma_{\tau}\end{bmatrix}\right)&\begin{bmatrix}\mathbb{I}&0\\ 0&O\end{bmatrix}\otimes\begin{bmatrix}\mathbb{I}&0\\ 0&O\end{bmatrix}.\end{bmatrix}

Proof.

Here, we have

	$\displaystyle\mathbb{E}_{o\sim\mathbb{O}(s)}\left[\begin{bmatrix}z\\ o\end{bmatrix}\otimes\begin{bmatrix}z\\ o\end{bmatrix}\right]=\mathrm{Vec}\left[\begin{bmatrix}zz^{\top}&zo^{\top}\\ oz^{\top}&oo^{\top}\end{bmatrix}\right]=\mathrm{Vec}\left[\begin{bmatrix}zz^{\top}&zs^{\top}O^{\top}\\ Osz^{\top}&Oss^{\top}O^{\top}+\Sigma_{r}\end{bmatrix}\right]$
	$\displaystyle=\mathrm{Vec}\left[\begin{bmatrix}0&0\\ 0&\Sigma_{r}\end{bmatrix}\right]+\begin{bmatrix}\mathbb{I}&0\\ 0&O\end{bmatrix}\otimes\begin{bmatrix}\mathbb{I}&O\\ 0&O\end{bmatrix}\times\mathrm{Vec}\left[\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\right].$

From the second line to the third line, we use formula $\mathrm{vec}[A_{1}A_{2}A_{3}]=(A^{\top}_{3}\otimes A_{1})\mathrm{vec}(A_{2})$ . This immediately concludes the result.

∎

Thus, the matrix $K_{h}$ has the left inverse when $O$ is full-column rank as follows:

\displaystyle K^{\dagger}_{h}=\begin{bmatrix}1&\mathbf{0}\\ -\begin{bmatrix}\mathbb{I}&0\\ 0&O^{\dagger}\end{bmatrix}\otimes\begin{bmatrix}\mathbb{I}&0\\ 0&O^{\dagger}\end{bmatrix}\mathrm{vec}\left(\begin{bmatrix}0&0\\ 0&\Sigma_{\tau}\end{bmatrix}\right)&\begin{bmatrix}\mathbb{I}&0\\ 0&O^{\dagger}\end{bmatrix}\otimes\begin{bmatrix}\mathbb{I}&0\\ 0&O^{\dagger}\end{bmatrix}\end{bmatrix}.

We use a block matrix inversion formula:

\displaystyle\begin{bmatrix}A^{-1}_{1}&0\\ -A^{\dagger}_{3}A_{2}A^{-1}_{1}&A^{\dagger}_{3}\end{bmatrix}\begin{bmatrix}A_{1}&0\\ A_{2}&A_{3}\end{bmatrix}=I.

J.1 Existence of Link Functions

Lemma 27 (Value functions in LQGs).

Let $\pi_{h}(a\mid o,z)=\delta(a=\mathbf{U}_{1h}o+\mathbf{U}_{2h}z)$ for $z\in\mathcal{Z}_{h-1},o\in\mathcal{O}$ . Then, a value function has a bilinear form:

\displaystyle V^{\pi}_{h}(z,s)=[z^{\top},s^{\top}]\Lambda_{h}[z^{\top},s^{\top}]^{\top}+\Gamma_{h}.

For any $h\in[H]$ , these parameters $\Lambda_{h},\Gamma_{h}$ are recursively defined inductively by

	$\displaystyle\Lambda_{H}=\begin{bmatrix}\mathbf{U}^{\top}_{2h}R\mathbf{U}_{2h}&\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}O\\ \{\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}O\}^{\top}&Q+O^{\top}\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}O\end{bmatrix},O_{H}=\mathop{\mathrm{tr}}(\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}\Sigma_{\tau}),$
	$\displaystyle\Lambda_{h}=\Xi_{1h}(\pi)\Lambda_{h+1}\Xi^{\top}_{1h}(\pi)+\Sigma_{\Lambda_{h}},\Sigma_{\Lambda_{h1}}=\begin{bmatrix}\mathbf{U}^{\top}_{2h}R\mathbf{U}_{2h}&\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}O\\ \{\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}O\}^{\top}&Q+O^{\top}\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}O\end{bmatrix},$
	$\displaystyle\Gamma_{h}=\mathop{\mathrm{tr}}\left(\Lambda_{h+1}\Sigma_{\Lambda_{h2}}(\pi)\right)+\Gamma_{h+1},\quad\Sigma_{\Lambda_{h2}}(\pi)=\begin{bmatrix}0&0&0&0\\ 0&\Sigma_{\tau}&\Sigma_{\tau}\mathbf{U}^{\top}_{1h}&\Sigma_{\tau}\mathbf{U}^{\top}_{1h}B^{\top}\\ 0&\mathbf{U}_{1h}\Sigma_{\tau}&\mathbf{U}_{1h}\Sigma_{\tau}\mathbf{U}^{\top}_{1h}&\mathbf{U}_{1h}\Sigma_{\tau}\mathbf{U}^{\top}_{1h}B^{\top}\\ 0&B\mathbf{U}_{1h}\Sigma_{\tau}&B\mathbf{U}_{1h}\Sigma_{\tau}\mathbf{U}^{\top}_{1h}&B\mathbf{U}_{1h}\Sigma_{\tau}\mathbf{U}^{\top}_{1h}B^{\top}+\Sigma_{\epsilon}&\end{bmatrix}.$

Proof.

The proof is completed by backward induction regarding $h$ , starting from level $H$ . First, we have

	$\displaystyle V^{\pi}_{H}(z,s)=s^{\top}Qs+\mathbb{E}_{o\sim O(s)}[\{\mathbf{U}_{1h}o+\mathbf{U}_{2h}z\}^{\top}R\{\mathbf{U}_{1h}o+K_{2}z\}]$
	$\displaystyle=s^{\top}Qs+\mathbb{E}_{o\sim O(s)}[\{\mathbf{U}_{1h}Os+\mathbf{U}_{1h}\tau+\mathbf{U}_{2h}z\}^{\top}R\{\mathbf{U}_{1h}Os+\mathbf{U}_{1h}\tau+\mathbf{U}_{2h}z\}]$
	$\displaystyle=s^{\top}\{Q+O^{\top}\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}O\}s+z\mathbf{U}^{\top}_{2h}R\mathbf{U}_{2h}z+2z^{\top}\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}Os+\mathop{\mathrm{tr}}(\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}\Sigma_{\tau})$
	$\displaystyle=[z^{\top},s^{\top}]\begin{bmatrix}\mathbf{U}^{\top}_{2h}R\mathbf{U}_{2h}&\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}O\\ \{\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}O\}^{\top}&Q+O^{\top}\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}O\end{bmatrix}[z^{\top},s^{\top}]^{\top}+\mathop{\mathrm{tr}}(\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}\Sigma_{\tau}).$

Here, we use induction. Thus, supposing the statement is true at horizon $h+1$ , we have

	$\displaystyle V^{\pi}_{h}(z,s)$	$\displaystyle=\Gamma_{h+1}+s^{\top}Qs+\mathbb{E}_{o\sim O(s)}[\{\mathbf{U}_{1h}o+\mathbf{U}_{2h}z\}^{\top}R\{\mathbf{U}_{1h}o+K_{2}z\}]$
		$\displaystyle+\mathbb{E}_{o\sim O(s),a\sim\pi(o,z),s^{\prime}\sim\mathbb{T}(s,a)}[[z^{\top}_{-1},o^{\top},a^{\top},s^{\prime\top}]\Lambda_{h+1}[z^{\top}_{-1},o^{\top},a^{\top},s^{\prime\top}]^{\top}]$

where $z^{\prime}$ is a vector that removes the last component $(o,a)$ from $z$ and $s^{\prime}$ is a state at $h+1$ . Here, recall we have

\displaystyle[(z^{\prime})^{\top},o^{\top},a^{\top},s^{\prime\top}]^{\top}=\Xi_{1h}(\pi)[z^{\top},s^{\top}]^{\top}+\Xi_{2h}(\pi).

Then, the statement is concluded some algebra.

∎

Lemma 28 (Norm constraints on value functions).

We can set $\|\Lambda_{h}\|\leq\mathbb{C}_{\Lambda,h},\|\Gamma_{h}\|\leq\mathbb{C}_{\Gamma,h}$ such that

\displaystyle\mathbb{C}_{\Lambda,h}=\mathrm{poly}(\mathbb{C},H),\mathbb{C}_{\Gamma,h}=\mathrm{poly}(d_{o},d_{s},d_{a},\mathbb{C},H).

Proof.

We have

\displaystyle\|\Lambda_{H}\|\leq\mathrm{poly}(\mathbb{C},H),\quad\|\Gamma_{H}\|\leq\mathrm{poly}(\mathbb{C},H).

Then,

\displaystyle\|\Lambda_{h}\|\leq\|\Xi_{1h}(\pi)\|\|\Lambda_{h+1}\|\|\Xi_{1h}(\pi)\|+\mathrm{poly}(\mathbb{C},H).

Since we assume $\|\Xi_{1h}(\pi)\|\leq 1$ , this immediately leads to

\displaystyle\|\Lambda_{h}\|\leq\mathrm{poly}(\mathbb{C},H).

Besides,

\displaystyle\|\Gamma_{h}\|\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C})\|\Lambda_{h+1}\|+\|\Gamma_{h+1}\|.

Thus,

\displaystyle\|\Gamma_{h}\|\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C}).

∎

Next, we set the norm on the function class $\mathcal{G}_{h}$ .

Lemma 29 (Realizability on LQGs).

We set

	$\displaystyle\mathcal{G}_{h}=\left\{\bar{\Gamma}_{h}+(z^{\top},o^{\top})\bar{\Lambda}_{h}(z^{\top},o^{\top})^{\top}\mid\\|\bar{\Lambda}_{h}\\|\leq C_{\bar{\Lambda},h},\|\bar{\Gamma}_{h}\|\leq C_{\bar{\Gamma},h},z\in Z_{h-1},o\in\mathcal{O}\right\},$
	$\displaystyle C_{\bar{\Lambda},h}=\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\\|O^{\dagger}\\|),C_{\bar{\Gamma},h}=\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\\|O^{\dagger}\\|).$

A function class $\mathcal{G}_{h}$ includes at least one value link function for any linear policy $\pi=\delta(a=\mathbf{U}_{1h}o+\mathbf{U}_{2h}z)$ for $\|\mathbf{U}_{1h}\|\leq\mathbb{C},\|\mathbf{U}_{2h}\|\leq\mathbb{C}$ .

Proof.

Here, we have

	$\displaystyle V^{\pi}_{h}(\cdot)$	$\displaystyle=\Gamma_{h}+\mathop{\mathrm{tr}}\left\{\Lambda_{h}\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\right\}$
		$\displaystyle=\Gamma_{h}+\mathbb{E}_{o\sim\mathbb{O}(s)}\left[\mathop{\mathrm{tr}}\left\{\Lambda_{h}\begin{bmatrix}zz^{\top}&zo^{\top}\{O^{\dagger}\}^{\top}\\ O^{\dagger}oz^{\top}&O^{\dagger}\left\{oo^{\top}-\Sigma_{\tau}\right\}\{O^{\dagger}\}^{\top}\end{bmatrix}\right\}\right]$
		$\displaystyle=\Gamma_{h}-\mathop{\mathrm{tr}}\left\{\Lambda_{h}\begin{bmatrix}0&0\\ 0&O^{\dagger}\Sigma_{\tau}\{O^{\dagger}\}^{\top}\end{bmatrix}\right\}$
		$\displaystyle+\mathbb{E}_{o\sim\mathbb{O}(s)}\left[[z^{\top},o^{\top}]\begin{bmatrix}I&0\\ 0&\{O^{\dagger}\}^{\top}\end{bmatrix}\Lambda_{h}\begin{bmatrix}I&0\\ 0&O^{\dagger}\end{bmatrix}\begin{bmatrix}z\\ o\end{bmatrix}\right].$

The norm constraint on $\bar{\Lambda}_{h}$ is decided by the following calculation:

\displaystyle\left\|\begin{bmatrix}I&0\\ 0&\{O^{\dagger}\}^{\top}\end{bmatrix}\Lambda_{h}\begin{bmatrix}I&0\\ 0&O^{\dagger}\end{bmatrix}\right\|\leq\|O^{\dagger}\|^{2}_{2}\|\Lambda_{h}\|=\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\|O^{\dagger}\|).

Then, the norm on $\bar{\Gamma}_{h}$ is decided by the following calculation:

	$\displaystyle\left\lvert\Gamma_{h}-\mathop{\mathrm{tr}}\left\{\Lambda_{h}\begin{bmatrix}0&0\\ 0&O^{\dagger}\Sigma_{\tau}\{O^{\dagger}\}^{\top}\end{bmatrix}\right\}\right\rvert$	$\displaystyle\leq\|\Gamma_{h}\|+\left\lvert\mathop{\mathrm{tr}}\left\{\Lambda_{h}\begin{bmatrix}0&0\\ 0&O^{\dagger}\Sigma_{\tau}\{O^{\dagger}\}^{\top}\end{bmatrix}\right\}\right\rvert$
		$\displaystyle\leq\|\Gamma_{h}\|+\\|\Sigma_{h}\\|_{2}\\|O^{\dagger}\\|^{2}_{2}\mathrm{Tr}(\Sigma_{\tau})$
		$\displaystyle\leq\|\Gamma_{h}\|+\\|\Sigma_{h}\\|_{2}\\|O^{\dagger}\\|^{2}_{2}\mathbb{C}d_{o}$
		$\displaystyle=\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\\|O^{\dagger}\\|).$

From the first line to the second line, we use Lemma 43.

∎

J.2 PO-bilinear Rank Decomposition

Lemma 30 (Bilinear rank decomposition for LQG).

For any $g_{h+1}\in\mathcal{G}_{h+1},g_{h}\in\mathcal{G}_{h}$ , we have the following bilinear rank decomposition:

\displaystyle\mathbb{E}[g_{h+1}(z_{h},o_{h+1})+r_{h}-g_{h}(z_{h-1},o_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]=\langle X(\pi^{\prime}),W(\pi)\rangle

where

	$\displaystyle X_{h}(\pi^{\prime})=(1,\mathbb{E}[[z^{\top}_{h-1},s^{\top}_{h}]\otimes[z^{\top}_{h-1},s^{\top}_{h}];a_{1:h-1}\sim\pi^{\prime}])^{\top},$
	$\displaystyle W_{h}(\pi)=\begin{bmatrix}\mathop{\mathrm{tr}}\left(\{\bar{\Lambda}_{h}-\bar{\Lambda}^{\star}_{h}\}\begin{bmatrix}0&0\\ 0&\Sigma_{\tau}\end{bmatrix}+\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\{\bar{\Lambda}_{h+1}-\bar{\Lambda}^{\star}_{h+1}\}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}\Sigma_{\Lambda_{h2}}(\pi)\right)\\ \mathrm{vec}\left[\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\{\bar{\Lambda}_{h}-\bar{\Lambda}^{\star}_{h}\}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}+\Xi^{\top}_{1h}(\pi)\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\{\bar{\Lambda}^{\star}_{h+1}-\bar{\Lambda}_{h+1}\}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}\Xi_{1h}(\pi)\right]\end{bmatrix}.$

Here, $\Xi_{1h}(\pi)$ and $\Sigma_{\Lambda_{h2}}(\pi)$ depend on a policy $\pi$ . The following norm constraints hold:

\displaystyle\|X_{h}(\pi^{\prime})\|_{2}\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\|O^{\dagger}\|),\|W_{h}(\pi)\|_{2}\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\|O^{\dagger}\|).

Proof.

We have

	$\displaystyle\mathbb{E}[g_{h}(z_{h-1},o_{h})-r_{h}(z_{h-1},s_{h})-g_{h+1}(z_{h},o_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]$
	$\displaystyle=-\mathbb{E}[r_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]+$
	$\displaystyle+\mathbb{E}\left[\bar{\Gamma}_{h}+(z_{h-1}^{\top},o^{\top}_{h})\bar{\Lambda}_{h}(z_{h-1}^{\top},o^{\top}_{h})^{\top}-\bar{\Gamma}_{h+1}-(z_{h}^{\top},o^{\top}_{h+1})\bar{\Lambda}_{h+1}(z_{h}^{\top},o^{\top}_{h+1})^{\top};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi\right].$		(22)

Since we have

	$\displaystyle\mathbb{E}[r_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]$
	$\displaystyle=-\mathbb{E}\left[\bar{\Gamma}^{\star}_{h}+(z_{h-1}^{\top},o^{\top}_{h})\bar{\Lambda}^{\star}_{h}(z_{h-1}^{\top},o^{\top}_{h})^{\top}-\bar{\Gamma}^{\star}_{h+1}-(z_{h}^{\top},o^{\top}_{h+1})\bar{\Lambda}^{\star}_{h+1}(z_{h}^{\top},o^{\top}_{h+1})^{\top};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi\right].$

we focus on the term (22).

Hereafter, we suppose the expectation is always taken under $a_{1:h-1}\sim\pi^{\prime},a\sim\pi$ . We also denote $z=z_{h-1},o_{h}=o,o_{h+1}=o^{\prime},s_{h}=s,s_{h+1}=s^{\prime}$ to simplify the presentation. Using this simplified notation, we get

	$\displaystyle\mathbb{E}\left[(z^{\top},o^{\top})\bar{\Lambda}_{h}(z^{\top},o^{\top})^{\top}\right]=\mathbb{E}\left[(z^{\top},(Os+\tau)^{\top})\bar{\Lambda}_{h}(z^{\top},(Os+\tau)^{\top})^{\top}\right]$
	$\displaystyle=\mathbb{E}\left[[z^{\top},s^{\top}]\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\bar{\Lambda}_{h}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}\begin{bmatrix}z\\ s\end{bmatrix}\right]+\mathop{\mathrm{tr}}\left(\bar{\Lambda}_{h}\begin{bmatrix}0&0\\ 0&\Sigma_{\tau}\end{bmatrix}\right).$

Besides,

	$\displaystyle\mathbb{E}\left[(z^{\prime\top},o^{\prime\top})\bar{\Lambda}_{h+1}(z^{\prime\top},o^{\prime\top})^{\top}\right]$
	$\displaystyle=\mathbb{E}\left[[z^{\prime\top},s^{\prime\top}]\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\bar{\Lambda}_{h+1}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}\begin{bmatrix}z^{\prime}\\ s^{\prime}\end{bmatrix}\right]+\mathop{\mathrm{tr}}\left(\bar{\Lambda}_{h}\begin{bmatrix}0&0\\ 0&\Sigma_{\tau}\end{bmatrix}\right)$
	$\displaystyle=\mathbb{E}\left[[z^{\top},s^{\top}]\Xi^{\top}_{1h}(\pi)\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\bar{\Lambda}_{h+1}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}\Xi_{1h}(\pi)\begin{bmatrix}z\\ s\end{bmatrix}\right]+\mathop{\mathrm{tr}}\left(\bar{\Lambda}_{h}\begin{bmatrix}0&0\\ 0&\Sigma_{\tau}\end{bmatrix}\right)+$
	$\displaystyle+\mathop{\mathrm{tr}}\left(\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\bar{\Lambda}_{h+1}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}\Sigma_{\Lambda_{h2}}(\pi)\right).$

Then, the bilinear decomposition is clear by using

\displaystyle A^{\top}_{2}A_{1}A_{2}=\mathrm{tr}(A_{1}A_{2}A^{\top}_{2})=\mathrm{vec}(A^{\top}_{1})^{\top}\mathrm{vec}(A_{2}A^{\top}_{2})=\langle\mathrm{vec}(A^{\top}_{1}),A_{2}\otimes A_{2}\rangle.

where $A_{2}$ is any vector and $A_{1}$ is any matrix.

First, we calculate the upper bounds of the norms.

	$\displaystyle\\|X_{h}(\pi^{\prime})\\|^{2}_{2}$	$\displaystyle=1+\left\\|\mathbb{E}_{(z,s)\sim d^{\pi^{\prime}}_{h}}\left[\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\right]\right\\|^{2}_{F}$
		$\displaystyle=1+\left\\|\mathbb{E}_{(z,s)\sim d^{\pi^{\prime}}_{h-1}}\left[\Xi_{1h}(\pi)\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\Xi^{\top}_{1h}(\pi)\right]+\Sigma_{\Lambda_{h2}}(\pi)\right\\|^{2}_{F}$
		$\displaystyle\leq 1+\\|\Xi_{1h}(\pi)\\|^{4}_{2}\left\\|\mathbb{E}_{(z,s)\sim d^{\pi^{\prime}}_{h-1}}\left[\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\right]\right\\|^{2}_{F}+\\|\Sigma_{\Lambda_{h2}}(\pi)\\|^{2}_{F}$
		$\displaystyle\leq 1+\\|X_{h-1}(\pi^{\prime})\\|^{2}_{2}+\\|\Sigma_{\Lambda_{h2}}(\pi)\\|^{2}_{F}.$

From the third line to the fourth line, we use $\|\Xi_{1h}(\pi)\|_{2}\leq 1$ . Thus, $\|X_{h}(\pi^{\prime})\|_{2}\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\|O^{\dagger}\|,\mathbb{C})$ .

Next, we consider $W(\pi)$ . By some algebra, we can see

	$\displaystyle\\|W(\pi)\\|_{2}$	$\displaystyle\leq\mathrm{poly}(\\|\bar{\Lambda}_{h}\\|,\\|\bar{\Lambda}_{h+1}\\|,\mathbb{C},d_{o},d_{s},d_{a},\\|\Xi_{1h}(\pi)\\|))$
		$\displaystyle\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\\|O^{\dagger}\\|,\mathbb{C})$

∎

Lemma 31 (Variance of marginal distribution).

Recall $d^{\pi}_{h}(z_{h-1},s_{h})$ is a marginal distribution over $\mathcal{Z}_{h-1}\times{\mathcal{S}}$ at $h$ when we execute $a_{1:h-1}\sim\pi$ . The distribution $d^{\pi}_{h}(z_{h-1},s_{h})$ is a Gaussian distribution with mean $0$ . The operator norm on the variance of $d^{\pi}_{h}(z_{h-1},s_{h})$ is upper-bounded by $\mathrm{poly}(H,d_{o},d_{a},d_{s},\mathbb{C})$ .

Proof.

We first calculate the operator norm of the variance of $d^{\pi}_{h}(z_{h-1},o_{h})$ . The variance is

\displaystyle\sum_{i=1}^{h}\left(\prod_{t=i+1}^{h}\Xi_{1t}(\pi)\right)\Sigma_{\Lambda_{i2}}(\pi)\left(\prod_{t=i+1}^{h}\Xi^{\top}_{1t}(\pi)\right).

The statement is immediately concluded. ∎

Let $u_{h}(\bar{z}_{h},r_{h},a_{h},o_{h+1};\theta)=\theta^{\top}_{h}\psi_{h}(\bar{z}_{h})-r_{h}-\theta^{\top}_{h+1}\psi_{h+1}(\bar{z}_{h+1})$ . Recall $\psi_{h}(\bar{z}_{h})=[1,\bar{z}^{\top}_{h}\otimes\bar{z}^{\top}_{h}]^{\top}$ . We define

	$\displaystyle\hat{y}_{h}(a^{[i]})$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\{\alpha_{i}(\pi(\bar{z}_{h}))\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})\mathbb{I}(\\|r_{h}\\|\leq Z_{2})\mathbb{I}(\\|o_{h+1}\\|\leq Z_{3})\mathbb{I}(a_{h}=a^{[i]})(1+d^{\diamond})$
		$\displaystyle u_{h}(\bar{z}_{h},r_{h},a_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(1+d^{\diamond})\}$
	$\displaystyle\hat{y}_{h}(a^{[0]})$	$\displaystyle=\mathbb{E}_{\mathcal{D}}\{\{1-\sum_{i}\alpha_{i}(\pi(\bar{z}_{h}))\}\mathbb{I}(a=0)\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})\mathbb{I}(\\|r_{h}\\|\leq Z_{2})\mathbb{I}(\\|o_{h+1}\\|\leq Z_{3})$
		$\displaystyle(1+d^{\diamond})u_{h}(\bar{z}_{h},r_{h},a_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(1+d^{\diamond})\}.$

Then, the final estimator is constructed by

\displaystyle\hat{y}_{h}(a^{[0]})+\sum_{i=1}^{d^{\diamond}}\hat{y}_{h}(a^{[i]}).

This is equal to

\displaystyle\mathbb{E}_{\mathcal{D}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)]

where

	$\displaystyle l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)=\left[\sum_{i}\alpha_{i}(\pi(\bar{z}_{h}))\mathbb{I}(a_{h}=a^{[i]})+\{1-\sum_{i}\alpha_{i}(\pi(\bar{z}_{h}))\}\mathbb{I}(a_{h}=0)\right]\times$
	$\displaystyle\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})\mathbb{I}(\\|r_{h}\\|\leq Z_{2})\mathbb{I}(\\|o_{h+1}\\|\leq Z_{3})(1+d^{\diamond})u_{h}(\bar{z}_{h},r_{h},a_{h},o_{h+1};\theta).$

We set

\displaystyle Z_{i}=\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\|O^{\dagger}\|).

for any $i\in[3]$ .

J.3 Uniform Convergence

Recall that

\displaystyle\Pi=\{\delta(a=\mathbf{U}_{1h}z+\mathbf{U}_{2h}o)\mid\|\mathbf{U}_{1h}\|\leq\mathbb{C},\|\mathbf{U}_{2h}\|\leq\mathbb{C}\}.

Besides, $\mathcal{G}_{h}$ is included in

\displaystyle\{\langle\theta,\psi_{h}(\cdot)\rangle\mid\|\theta\|\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\|O^{\dagger}\|)\}.

Lemma 32 (Concentration of loss functions).

With probability $1-\delta$ ,

\displaystyle\sup_{\pi\in\Pi,\theta\in\Theta}|(\mathbb{E}_{\mathcal{D}}-\mathbb{E})\{l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)\}|

is upper-bounded by

\displaystyle\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\|O^{\dagger}\|)\times\sqrt{\ln(1/\delta)/m}.

Proof.

Due to indicator functions, $l_{h}(\bar{z}_{h},a_{h},o_{h},o_{h+1};\theta,\pi)$ is bounded for any $\pi,\theta$ by

\displaystyle\mathrm{poly}(\ln(m),d_{s},d_{a},d_{o},\mathbb{C},H,\|O^{\dagger}\|).

Thus, for fixed $\pi$ and $\theta$ , we can say that with high probability $1-\delta$

\displaystyle\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\|O^{\dagger}\|,\ln(1/\delta))\times\sqrt{1/m}.

Besides, we can consider a covering number with respect to $l^{\infty}$ -norm for the space of $K$ and $\theta$ since both are bounded. The radius of each space is upper-bounded by

\displaystyle\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\|O^{\dagger}\|).

Thus, by taking uniform bound and considering the bias term due to the discretization as in the proof of Lemma 24, the statement is concluded. ∎

Lemma 33 (Bias terms 1).

Expectation of $\hat{y}_{h}(a^{[i]})$ and $\hat{y}_{h}(a^{[0]})$ are equal to

\displaystyle y_{h}(a^{[i]})+\mathrm{Error}_{1},\quad y_{h}(a^{[0]})+\mathrm{Error}_{2}.

where

	$\displaystyle y_{h}(a^{[i]})$	$\displaystyle=\mathbb{E}\left[\alpha_{i}(\pi(\bar{z}_{h}))\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{[i]})\right],$
	$\displaystyle y_{h}(a^{[0]})$	$\displaystyle=\mathbb{E}\left[\{1-\sum_{i}\alpha_{i}(\pi(\bar{z}_{h}))\}\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(0)\right],$
	$\displaystyle\mathrm{Error}_{1}$	$\displaystyle=m^{-1}\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\\|O^{\dagger}\\|),\quad\mathrm{Error}_{2}=m^{-1}\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\\|O^{\dagger}\\|).$

Proof.

We want to upper bound the difference of

\displaystyle\mathbb{E}\left[\alpha_{i}(\pi(\bar{z}_{h}))\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{[i]})\right]

and

\displaystyle\mathbb{E}\left[\alpha_{i}(\pi(\bar{z}_{h}))\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\mathbb{I}(\|r_{h}\|\leq Z_{2})\mathbb{I}(\|o_{h+1}\|\leq Z_{3})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{[i]})\right].

By CS inequality, we have

	$\displaystyle\|\mathbb{E}\left[\alpha_{i}(\pi(\bar{z}_{h}))\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})\{\mathbb{I}(\\|r_{h}\\|\leq Z_{2})\mathbb{I}(\\|o_{h+1}\\|\leq Z_{3})-1\}u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{[i]})\right]\|$
	$\displaystyle\leq\underbrace{\left\lvert\mathbb{E}\left[\{\mathbb{I}(\\|r_{h}\\|\leq Z_{2})\mathbb{I}(\\|o_{h+1}\\|\leq Z_{3})-1\}^{2};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{[i]})\right]\right\rvert}_{(a)}$
	$\displaystyle\times\underbrace{\left\lvert\mathbb{E}\left[\alpha^{2}_{i}(\pi(\bar{z}_{h}))u^{2}_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{[i]})\right]\right\rvert^{1/2}}_{(b)}.$

We analyze the term (a) and the term (b). Before starting analysis, note $(\bar{z}^{\top}_{h},a^{\top}_{h},r^{\top}_{h},o^{\top}_{h+1})$ follows Gaussian distribution with mean $0$ and variance upper-bounded by

\displaystyle\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H)

using Lemma 31. Besides, $\alpha^{2}_{i}(\pi_{h}(\bar{z}_{h}))\leq\mathrm{poly}(d_{s},d_{o},d_{a},H)$ from Lemma 13. Note we can use a G-optimal design since we have a norm constraint on $\bar{z}_{1}$ .

Regarding the term (a), by setting $Z_{2}=\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},\ln(m),H,\|O^{\dagger}\|)$ and $Z_{3}=\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},\ln(m),H,\|O^{\dagger}\|)$ properly, we can ensure it is upper-bounded by

\displaystyle\frac{\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\|O^{\dagger}\|,\ln(m))}{m}.

Regarding the term (b), noting high order moments of Gaussian distributions can be always upper-bounded, the term (b) is upper-bounded by $\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\|O^{\dagger}\|,\ln(m))$ . This concludes the statement. ∎

Lemma 34 (Bias terms 2).

Recall we define $y_{h}(a^{[i]})$ and $y_{h}(a^{[0]})$ in Lemma 33. Then, we have

\displaystyle\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]=y_{h}(a^{[0]})+\sum_{i}y_{h}(a^{[i]}).

Thus,

	$\displaystyle\mathbb{E}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(1+d^{\diamond})]$
	$\displaystyle=\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]+\frac{\mathrm{poly}(\mathbb{C},d_{s},d_{a},d_{o},H,\\|O^{\dagger}\\|)}{m}.$

Proof.

First Statement

We have

	$\displaystyle\mathbb{E}[\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]$
	$\displaystyle=\mathbb{E}[\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h}];a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]$
	$\displaystyle=\mathbb{E}[\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})\mathbb{E}[u_{h}(\bar{z}_{h},\pi_{h}(\bar{z}_{h}),r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h}=\pi_{h}(\bar{z}_{h})];a_{1:h-1}\sim\pi^{\prime}].$

Here, by some algebra, there exists a vector $c_{2}$

\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h}]=\langle c_{2},[1,[\bar{z}^{\top}_{h},s^{\top}_{h},a^{\top}_{h}]\otimes[\bar{z}^{\top}_{h},s^{\top}_{h},a^{\top}_{h}]]^{\top}\rangle.

Thus, there exists $c_{0}$ and a vector $c_{1}$ such that

\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h}]=c_{0}(\bar{z}_{h},s_{h})+c^{\top}_{1}(\bar{z}_{h},s_{h})\kappa(a_{h})

Recall we can write

\displaystyle\kappa(\pi_{h}(\bar{z}_{h}))=\sum_{i=1}^{d^{\diamond}}\alpha_{i}(\pi_{h}(\bar{z}_{h}))\kappa(a^{[i]})

Using the above,

	$\displaystyle\mathbb{E}[\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]$
	$\displaystyle=\mathbb{E}[\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})\{c_{0}(\bar{z}_{h},s_{h})+c^{\top}_{1}(\bar{z}_{h},s_{h})\}\kappa(\pi_{h}(\bar{z}_{h}));a_{1:h-1}\sim\pi^{\prime}]$
	$\displaystyle=\mathbb{E}\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})\{c_{0}(\bar{z}_{h},s_{h})+\sum_{i}c^{\top}_{1}(\bar{z}_{h},s_{h})\alpha_{i}(\pi_{h}(\bar{z}_{h}))\kappa(a^{[i]})\};a_{1:h-1}\sim\pi^{\prime}]$
	$\displaystyle=\mathbb{E}[\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})[c_{0}(\bar{z}_{h},s_{h})+$
	$\displaystyle+\sum_{i}\alpha_{i}(\pi_{h}(\bar{z}_{h}))\{\mathbb{E}[\kappa(\bar{z}_{h},\pi_{h}(\bar{z}_{h}),r_{h},o_{h+1})\mid\bar{z}_{h},s_{h},a_{h}=a^{[i]}]-c_{0}(\bar{z}_{h},s_{h})\}];a_{1:h-1}\sim\pi^{\prime}]$
	$\displaystyle=\mathbb{E}\left[\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})\left[c_{0}(\bar{z}_{h},s_{h})-\sum_{i}\alpha_{i}(\pi_{h}(\bar{z}_{h}))c_{0}(\bar{z}_{h},s_{h})\right]\right]+\sum_{i}y_{h}(a^{[i]}).$

Besides,

\displaystyle c_{0}(\bar{z}_{h},s_{h})=\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h}=do(0)].

Thus,

	$\displaystyle\mathbb{E}\left[\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})\left[c_{0}(\bar{z}_{h},s_{h})-\sum_{i}\alpha_{i}(\pi_{h}(\bar{z}_{h}))c_{0}(\bar{z}_{h},s_{h})\right]\right]$
	$\displaystyle=\mathbb{E}[\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})\{1-\sum_{i}\alpha_{i}(\pi_{h}(\bar{z}_{h}))\}\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h}=0];a_{1:h-1}\sim\pi^{\prime}]$
	$\displaystyle=\mathbb{E}[\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})\{1-\sum_{i}\alpha_{i}(\pi_{h}(\bar{z}_{h}))\}u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}=do(0)]$
	$\displaystyle=y_{h}(a^{[0]}).$

In conclusion,

\displaystyle\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]=y_{h}(a^{[0]})+\sum_{i}y_{h}(a^{[i]}).

Second Statement

As we see in the proof of Lemma 33, the following term

\displaystyle\mathbb{E}[\{\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})-1\}u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]

is upper-bounded by $\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\ln(m))/m.$ Hence,

	$\displaystyle\mathbb{E}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)]$
	$\displaystyle=\mathbb{E}[\hat{y}_{h}(a^{[0]})]+\sum_{i=1}^{d^{\diamond}}\mathbb{E}[\hat{y}_{h}(a^{[i]})]$		(Definition)
	$\displaystyle=y_{h}(a^{[0]})+\sum_{i=1}^{d^{\diamond}}y_{h}(a^{[i]})+\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\ln(m))/m$		(Statement of Lemma 33)
	$\displaystyle=\mathbb{E}[\mathbb{I}(\\|\bar{z}_{h}\\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi_{h}(\bar{z}_{h})]+\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\ln(m))/m$		(First statement)
	$\displaystyle=\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi_{h}(\bar{z}_{h})]+\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\ln(m))/m$
	$\displaystyle=\mathrm{Br}_{h}(\pi,\theta;\pi^{\prime})+\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\ln(m))/m.$

∎

J.4 Sample Complexity

Summarizing results so far, we have

	$\displaystyle\sup_{\pi\in\Pi,\theta\in\Theta}\|\mathbb{E}_{\mathcal{D}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)\}]-\mathrm{Br}_{h}(\pi,\theta;\pi^{\prime})\|$
	$\displaystyle\leq\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\\|O^{\dagger}\\|)\times\sqrt{\ln(1/\delta)/m}.$

This is enough to invoke Theorem 1. Here, recall we have

\displaystyle\|X_{h}(\pi)\|\leq\mathrm{poly}(H,d_{o},d_{a},d_{s},\mathbb{C},\Theta,\|O^{\dagger}\|),\quad\|W_{h}(\pi)\|\leq\mathrm{poly}(H,d_{o},d_{a},d_{s},\mathbb{C},\Theta,\|O^{\dagger}\|).

for any $\pi\in\Pi$ using Lemma 30. In addition, we showed the PO-bilinear rank is

\displaystyle\mathrm{poly}(H,d_{o},d_{a},d_{s}).

Then, using Theorem 1, the sample complexity is

\displaystyle\tilde{O}\left(\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},\Theta,H,\|O^{\dagger}\|,\ln(1/\delta))\times\frac{1}{\epsilon^{2}}\right).

Appendix K Sample Complexity in PSRs

To focus on the main point, we just use a one-step future. We first show the form of link functions to set a proper class for $\mathcal{G}_{h}$ . Next, we show the PO-bilinear decomposition.

We assume the following assumptions.

Assumption 11.

(a) ${\mathcal{T}}\subset\mathcal{O}$ is a core test and $\mathcal{Q}$ is a minimum core rest, (b) $\|\mathrm{vec}(\mathbb{J}^{\pi}_{h})\|\leq\Theta$ for any $\pi\in\Pi$ where $\mathbb{J}^{\pi}_{h}$ is in $\mathcal{V}^{\pi}_{h}(\tau_{h})={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}_{h}^{\pi}{\bf q}_{\tau_{h}}$ .

K.1 Existence of Link Functions

Recall $V^{\pi}_{h}(\tau_{h})={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}_{h}^{\pi}{\bf q}_{\tau_{h}}$ , where we use ${\textbf{1}}(z)\in\mathbb{R}^{|\mathcal{O}|^{M}|\mathcal{A}|^{M}}$ to denote the one-hot encoding vector over $\mathcal{Z}_{h-1}$ , and $\mathbb{J}_{h}^{\pi}$ is a matrix in $\mathbb{R}^{|\mathcal{O}|^{M}|\mathcal{A}|^{M}\times|{\mathcal{T}}|}$ .

Then, $g^{\pi}_{h}(z_{h-1},o):={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}^{\pi}_{h}[{\textbf{1}}(t=o)]_{t\in{\mathcal{T}}}$ is a value link function. This is because

	$\displaystyle\mathbb{E}[g_{h}(z_{h-1},o)\mid\tau_{h}]$	$\displaystyle=\mathbb{E}[{\textbf{1}}(z_{h-1})^{\top}\mathbb{J}^{\pi}_{h}[{\textbf{1}}(t=o)]_{t\in{\mathcal{T}}}\mid\tau_{h}]$
		$\displaystyle={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}_{h}^{\pi}{\bf q}_{\tau_{h}}.$

Hence, we set $\mathcal{G}_{h}$ to be

\displaystyle\{(z_{h-1},o)\mapsto{\textbf{1}}(z_{h-1})^{\top}\mathbb{J}[{\textbf{1}}(t=o)]_{t\in{\mathcal{T}}}:\|\mathrm{vec}(\mathbb{J})\|\leq\Theta\}

so that the realizability holds.

K.2 PO-bilinear Rank Decomposition

We show that PSR admits PO-bilinear rank decomposition (Definition 6). Here is the Bellman loss:

\displaystyle\mathbb{E}[\{g_{h+1}(z_{h},o_{h+1})+r_{h}\}-g_{h}(z_{h-1},o_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi].

To analyze the above, we decompose the above into three terms:

\displaystyle\underbrace{\mathbb{E}[g_{h+1}(z_{h},o_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]}_{(a)}+\underbrace{\mathbb{E}[r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]}_{(b)}+\underbrace{\mathbb{E}[-g_{h}(z_{h-1},o_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]}_{(c)}.

Let $\mathcal{Q}$ be a minimum core test. Here, for any future $t$ , there exists $\tilde{m}_{t}$ such that $\mathbb{P}(t\mid\tau_{h})=\langle\tilde{m}_{t},\tilde{\bf q}_{\tau_{h}}\rangle$ where $[\mathbb{P}(\cdot\mid\tau_{h})]_{|\mathcal{Q}|}$ is a $|\mathcal{Q}|$ -dimensional predictive state $\tilde{\bf q}_{\tau_{h}}$ . This satisfies

\displaystyle\mathbb{P}(o_{h}\mid\tau_{h};a_{h})\tilde{\bf q}_{\tau_{h},a_{h},o_{h}}=\tilde{M}_{o_{h},a_{h}}\tilde{\bf q}_{\tau_{h}}.

(23)

where $\tilde{M}_{o_{h},a_{h}}$ is a matrix whose $i$ -th row is $\tilde{m}^{\top}_{o_{h},a_{h}}$ as we see in Section D.

Term (c).

We have

	$\displaystyle\mathbb{E}[g_{h}(z_{h-1},o_{h})\mid\tau_{h}]$	$\displaystyle={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}\mathbb{E}[[{\textbf{1}}(t=o_{h})]_{t\in{\mathcal{T}}}\mid\tau_{h}]$
		$\displaystyle={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}\mathbb{J}_{1}\tilde{\bf q}_{\tau_{h}}$

where $\mathbb{J}_{1}\in\mathbb{R}^{|{\mathcal{T}}|\times|\mathcal{Q}|}$ is a matrix whose $i$ -th row is $\tilde{m}^{\top}_{t}$ . The existence of $\mathbb{J}_{1}$ is ensured since $\mathcal{Q}$ is a core test.

Term (b).

We have

	$\displaystyle\mathbb{E}[r_{h}\mid\tau_{h};a_{h}\sim\pi]$	$\displaystyle=\sum_{o_{h},a_{h}}\pi(a_{h}\mid o_{h},z_{h-1})r_{h}(a_{h},o_{h})\mathbb{P}(o_{h}\mid\tau_{h};a_{h})$
		$\displaystyle=\sum_{o_{h},a_{h}}\pi(a_{h}\mid o_{h},z_{h-1})r_{h}(a_{h},o_{h})\langle\tilde{m}_{o_{h},a_{h}},\tilde{\bf q}_{\tau_{h}}\rangle$
		$\displaystyle={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}^{\pi}_{2}\tilde{\bf q}_{\tau_{h}}$

for some matrix $\mathbb{J}^{\pi}_{2}$ . In the first inequality, we use the reward is a function of $o_{h},a_{h}$ conditioning on the whole history. From the first line to the second line, we use a property of core tests.

Term (a).

We have

	$\displaystyle\mathbb{E}[g_{h+1}(z_{h},o_{h+1})\mid\tau_{h};a_{h}\sim\pi]=\mathbb{E}[{\textbf{1}}(z_{h})^{\top}\mathbb{J}[{\textbf{1}}(t=o_{h+1})]_{t\in{\mathcal{T}}}\mid\tau_{h};a_{h}\sim\pi]$
	$\displaystyle=\mathbb{E}[{\textbf{1}}(z_{h})^{\top}\mathbb{J}\mathbb{J}_{3}\tilde{q}_{\tau_{h},a_{h},o_{h}}\mid\tau_{h};a_{h}\sim\pi]$

for some matrix $\mathbb{J}_{3}$ . Then, the above is further equal to

	$\displaystyle\sum_{a_{h},o_{h}}{\textbf{1}}(z_{h})^{\top}\mathbb{J}\mathbb{J}_{3}\pi(a_{h}\mid z_{h-1},o_{h})\mathbb{P}(o_{h}\mid\tau_{h};a_{h})\tilde{\bf q}_{\tau_{h},a_{h},o_{h}}$
	$\displaystyle=\sum_{a_{h},o_{h}}{\textbf{1}}(z_{h})^{\top}\mathbb{J}\mathbb{J}_{3}\pi(a_{h}\mid z_{h-1},o_{h})\tilde{M}_{o_{h},a_{h}}\tilde{\bf q}_{\tau_{h}}$
	$\displaystyle={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}^{\pi}_{4}\tilde{\bf q}_{\tau_{h}}$

for some matrix $\mathbb{J}^{\pi}_{4}$ . From the first line to the second line, we use $\mathbb{P}(o_{h}\mid\tau_{h};a_{h})\tilde{\bf q}_{\tau_{h},a_{h},o_{h}}=\tilde{M}_{o_{h},a_{h}}\tilde{\bf q}_{\tau_{h}}$ in (23).

Summary.

Combining all terms, there exists a matrix $\mathbb{J}^{\pi}_{5}$ such that

	$\displaystyle\mathbb{E}[\{g_{h+1}(z_{h},o_{h+1})+r_{h}\}-g_{h}(z_{h-1},o_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]$
	$\displaystyle={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}^{\pi}_{5}\mathbb{E}[\tilde{\bf q}_{\tau_{h}};a_{1:h-1}\sim\pi^{\prime}]$
	$\displaystyle=\langle\mathrm{Vec}(\mathbb{J}^{\pi}_{5}),{\textbf{1}}(z_{h-1})\otimes\mathbb{E}[\tilde{\bf q}_{\tau_{h}};a_{1:h-1}\sim\pi^{\prime}]\rangle$

Here, we suppose $\|\mathrm{Vec}(\mathbb{J}^{\pi}_{5})\|\leq\Theta_{W}$ for any $\pi$ . Besides,

	$\displaystyle\\|{\textbf{1}}(z_{h-1})\otimes\mathbb{E}[\tilde{\bf q}_{\tau_{h}};a_{1:h-1}\sim\pi^{\prime}]\\|_{2}\leq\\|\mathbb{E}[\tilde{\bf q}_{\tau_{h}};a_{1:h-1}\sim\pi^{\prime}]\\|_{2}\leq\mathbb{E}[\\|\tilde{\bf q}_{\tau_{h}}\\|_{2};a_{1:h-1}\sim\pi^{\prime}]$
	$\displaystyle\leq\mathbb{E}[\\|\tilde{\bf q}_{\tau_{h}}\\|_{1};a_{1:h-1}\sim\pi^{\prime}]=1.$

Thus, we can set $B_{X}=1$ .

K.3 Sample Complexity

Suppose $\Pi,\mathcal{G}$ are finite and rewards at $h$ lie in $[0,1]$ . Assume the realizability holds. Then,

\displaystyle\epsilon_{gen}=c\max(\Theta,1)|\mathcal{A}|\sqrt{\ln(|\mathcal{G}_{\max}||\Pi_{\max}|TH/\delta)/m}.

Following the calculation in Section F, the sample complexity is

\displaystyle\tilde{O}\left(\frac{|\mathcal{O}|^{2(M-1)}|\mathcal{A}|^{2(M-1)}|\mathcal{Q}|^{2}\max(\Theta,1)H^{4}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|/\delta)\ln(\Theta_{W})^{2}}{\epsilon^{2}}\right).

Here, there is no explicit dependence on $|{\mathcal{T}}|$ . Note the worst-case sample complexity of $\ln|\mathcal{G}_{\max}|$ is $O(|\mathcal{Z}_{h-1}||{\mathcal{T}}|)$ and the worse-case sample complexity of $\ln|\Pi_{\max}|$ is $O(|\mathcal{Z}_{h-1}||\mathcal{O}||\mathcal{A}|)$ .

K.4 Most General Case

We consider the general case in Section D. Let $\mathcal{G}_{h}$ be a function class consisting of $\mathbf{1}(z_{h-1})^{\top}\mathbb{J}_{h}\mathbf{1}(t)$ where $\mathbb{J}_{h}$ satisfies $\mathbb{J}_{h}\in\mathbb{R}^{\mathcal{Z}_{h-1}\times|{\mathcal{T}}|}$ and $\|\mathrm{vec}(\mathbb{J}_{h})\|\leq\Theta$ . When the realizability holds, we would get

\displaystyle\tilde{O}\left(\frac{|\mathcal{O}|^{2(M-1)}|\mathcal{A}|^{2(M-1)}|\mathcal{Q}|^{2}|{\mathcal{T}}^{\mathcal{A}}|^{2}\max(\Theta,1)H^{4}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|/\delta)\ln(B_{X}B_{W})^{2}}{\epsilon^{2}}\right).

Here, there is no explicit sample complexity of $|{\mathcal{T}}^{\mathcal{O}}|$ . Note the worse-case sample complexity of $\ln|\mathcal{G}_{\max}|$ is $O(|\mathcal{Z}_{h-1}||{\mathcal{T}}|)$ and the worst-case sample complexity of $\ln|\Pi_{\max}|$ is $O(|\mathcal{Z}_{h-1}||\mathcal{O}||\mathcal{A}|)$ .

Appendix L Proof of Theorem 8

We fix the parameters as in Theorem 8. Let

\displaystyle l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g)=|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\{r_{h}+g_{h+1}(\bar{z}_{h+1})-g_{h}(\bar{z}_{h})\}f(\tau_{h})-0.5f(\tau_{h})^{2}.

From the assumption, Then, with probability $1-\delta$ , we have $\forall t\in[T],\forall h\in[H]$

	$\displaystyle\sup_{\pi\in\Pi,g\in\mathcal{G},f\in\mathcal{F}}\|\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g)]-\mathbb{E}[\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g)]]\|\leq\epsilon_{gen},$		(24)
	$\displaystyle\sup_{g_{1}\in\mathcal{G}_{1}}\|\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(o_{1})]-\mathbb{E}[\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(o_{1})]]\|\leq\epsilon_{ini}.$		(25)

We first show the following lemma. Recall $\pi^{\star}=\mathop{\mathrm{argmax}}_{\pi\in\Pi}J(\pi).$

Lemma 35 (Optimism).

Set $R:=\epsilon_{gen}$ . For all $t\in[T]$ , $(\pi^{\star},g^{\pi^{\star}})$ is a feasible solution of the constrained program. Furthermore, we have $J(\pi^{\star})\leq\mathbb{E}[g^{t}_{1}(o_{1})]+2\epsilon_{ini}$ for any $t\in[T]$ , where $g^{t}$ is the value link function selected by the algorithm in iteration $t$ .

Proof.

For any $\pi$ , we have

\displaystyle\max_{f\in\mathcal{F}_{h}}|\mathbb{E}[\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi})]]|=0

since $g^{\pi}$ is a value link function in $\mathcal{G}$ noting the condition (c) in Definition 7. Thus,

\displaystyle\max_{f\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi^{\star},g^{\pi^{\star}})]|\leq\epsilon_{gen}

using (24) noting $\pi^{\star}\in\Pi,g^{\pi^{\star}}\in\mathcal{G}$ . Hence, $(\pi^{\star},g^{\pi^{\star}})$ is a feasible set for any $t\in[T]$ and any $h\in[H]$ .

Then, we have

$\displaystyle J(\pi^{\star})$	$\displaystyle=\mathbb{E}[g^{\pi^{\star}}_{1}(o_{1})]\leq\mathbb{E}_{\mathcal{D}^{0}}[g^{\pi^{\star}}_{1}(o_{1})]+\epsilon_{ini}$	(Uniform convergence result)
	$\displaystyle\leq\mathbb{E}_{\mathcal{D}^{0}}[g^{t}_{1}(o_{1})]+\epsilon_{ini}$	(Using the construction of algorithm)
	$\displaystyle\leq\mathbb{E}[g^{t}_{1}(o_{1})]+2\epsilon_{ini}.$	(Uniform convergence)

∎

Next, we prove the following lemma to upper bound the per step regret.

Lemma 36.

For any $t\in[T]$ , we have

\displaystyle J(\pi^{\star})-J(\hat{\pi})\leq\sum_{h=1}^{H}\left(|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle|\right)+2\epsilon_{ini}.

Proof.

	$\displaystyle J(\pi^{\star})-J(\hat{\pi})$
	$\displaystyle\leq 2\epsilon_{ini}+\mathbb{E}[g^{t}_{1}(o_{1})]-J(\pi^{t})$		(From optimism)
	$\displaystyle=2\epsilon_{ini}+\sum_{h=1}^{H}\mathbb{E}[g^{t}_{h}(\bar{z}_{h})-\{r_{h}+g^{t}_{h+1}(\bar{z}_{h+1})\};a_{1:h}\sim\pi^{t}]$		(Performance difference lemma)
	$\displaystyle\leq 2\epsilon_{ini}+\sum_{h=1}^{H}\|\mathbb{E}[g^{t}_{h}(\bar{z}_{h})-\{r_{h}+g^{t}_{h+1}(\bar{z}_{h+1})\};a_{1:h}\sim\pi^{t}]\|$
	$\displaystyle\leq 2\epsilon_{ini}+\sum_{h=1}^{H}\|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle\|.$		(From (a) in Definition 3)

∎

From Lemma 22, we have

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}\|X_{h}(\pi^{t})\|_{\Sigma^{-1}_{t,h}}\leq H\sqrt{\frac{d}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)}.

Lemma 37.

\displaystyle\|W_{h}(\pi^{t},g^{t})\|^{2}_{\Sigma_{t,h}}\leq 2\lambda B^{2}_{W}+T\zeta(2\epsilon_{gen}).

Proof.

We have

\displaystyle\|W_{h}(\pi^{t},g^{t})\|^{2}_{\Sigma_{t,h}}=\lambda\|W_{h}(\pi^{t},g^{t})\|^{2}_{2}+\sum_{\tau=0}^{t-1}\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{\tau})\rangle^{2}.

The first term is upper-bounded by $\lambda B^{2}_{W}$ . The second term is upper-bounded by

	$\displaystyle\sum_{\tau=0}^{t-1}\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{\tau})\rangle^{2}$
	$\displaystyle\leq\sum_{k=0}^{t-1}\zeta\left(\max_{f\in\mathcal{F}_{h}}\left\|\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi^{t},g^{t});a_{1:M(h)-1}\sim\pi^{k},a_{M(h):h}\sim\pi^{e}(\pi)]\right\|\right)^{2}$
	$\displaystyle\leq\sum_{k=0}^{t-1}\zeta\left(\max_{f\in\mathcal{F}_{h}}\left\|\mathbb{E}_{\mathcal{D}^{k}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};f,\pi^{t},g^{t})]\right\|+\epsilon_{gen}\right)^{2}$
	$\displaystyle\leq t\zeta(2\epsilon_{gen})^{2}.$

From the first line to the second line, we use (b) in Definition 7. From the second line to the third line, we use $\xi$ is a non-decreasing function. In the last line, we use the constraint on $(\pi^{t},g^{t})$ .

∎

Combining lemmas so far, we have

$\displaystyle J(\pi^{\star})-J(\hat{\pi})$	$\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}\|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle\|+2\epsilon_{ini}$
	$\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}\\|W_{h}(\pi^{t},g^{t})\\|_{\Sigma_{t,h}}\\|X_{h}(\pi^{t})\\|_{\Sigma^{-1}_{t,h}}+2\epsilon_{ini}$	(CS inequality)
	$\displaystyle\leq H^{1/2}\left[2\lambda B^{2}_{W}+T\zeta^{2}(2\epsilon_{gen})\right]^{1/2}\left(\frac{dH}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)\right)^{1/2}+2\epsilon_{ini}.$

We set $\lambda$ such that $B^{2}_{X}/\lambda=B^{2}_{W}B^{2}_{X}/\zeta^{2}(\epsilon_{gen})+1$ and $T=\left\lceil 2Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\zeta^{2}(\tilde{\epsilon}_{gen})+1))\right\rceil$ . Then,

	$\displaystyle\frac{Hd}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)$	$\displaystyle\leq\frac{Hd}{T}\ln\left(1+\frac{T}{d}\left(\frac{B^{2}_{W}B^{2}_{X}}{\zeta^{2}(\epsilon_{gen})}+1\right)\right)$
		$\displaystyle\leq\frac{Hd}{T}\ln\left(1+\frac{T}{d}\left(\frac{B^{2}_{W}B^{2}_{X}}{\zeta^{2}(\tilde{\epsilon}_{gen})}+1\right)\right)$
		$\displaystyle\leq\frac{Hd}{T}\ln\left(\frac{2T}{d}\left(\frac{B^{2}_{W}B^{2}_{X}}{\zeta^{2}(\tilde{\epsilon}_{gen})}+1\right)\right)\leq 1$

since $a\ln(bT)/T\leq 1$ when $T=2a\ln(2ab)$ .

Finally, the following holds

	$\displaystyle J(\pi^{\star})-J(\pi^{T})$
	$\displaystyle\leq H^{1/2}\left[2\lambda B^{2}_{W}+T\zeta^{2}(2\epsilon_{gen})\right]^{1/2}+2\epsilon_{ini}$
	$\displaystyle\leq H^{1/2}\left[2\lambda B^{2}_{W}+2\zeta^{2}(2\epsilon_{gen})Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\zeta^{2}(\tilde{\epsilon}_{gen})+1))\right]^{1/2}+2\epsilon_{ini}$		(Plug in $T$ )
	$\displaystyle\leq H^{1/2}\left[4\zeta^{2}(\epsilon_{gen})+2\zeta^{2}(2\epsilon_{gen})Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\zeta^{2}(\tilde{\epsilon}_{gen})+1))\right]^{1/2}+2\epsilon_{ini}.$		(Plug in $\lambda$ )

Appendix M Sample Complexity in $M$ -step Decodable POMDPs

We first give a summary of our results. Then, we show that an $M$ -step decodable POMDP is a PO-bilinear rank model. After showing the uniform convergence of the loss function with fast rates, we calculate the sample complexity. Since we use squared loss functions, we need to modify the proof of Theorem 1.

M.1 PO-bilinear Rank Decomposition (Proof of Lemma 10)

In this section, we derive the PO-bilinear decomposition of $M$ -step decodable POMDPs (Lemma 10 ).

First, we define moment matching policies following [20]. We denote $M(h)=h-M$ .

Definition 10 (Moment Matching Policies).

For $h^{\prime}\in[M(h),h]$ , we define

\displaystyle x_{h^{\prime}}=(s_{M(h):{h^{\prime}}},o_{M(h):{h^{\prime}}},a_{M(h):h^{\prime}-1})\in\mathcal{X}_{l}

where $\mathcal{X}_{l}=S^{l}\times\mathcal{O}^{l}\times\mathcal{A}^{l-1}$ and $l=h^{\prime}-M(h)+1$ . For an $M$ -step policy $\pi$ and $h\in[H]$ , we define the moment matching policy $\mu^{\pi,h}=\{\mu^{\pi,h}_{h^{\prime}}:\mathcal{X}_{h^{\prime}-M(h)+1}\to\Delta(\mathcal{A})\}^{h-1}_{h^{\prime}=M(h)}$ :

\displaystyle\mu^{\pi,h}_{h^{\prime}}(a_{h^{\prime}}\mid x_{h^{\prime}}):=\mathbb{E}[\pi_{h^{\prime}}(a_{h^{\prime}}\mid\bar{z}_{h^{\prime}})\mid x_{h^{\prime}};\pi].

Note the expectation in the right hand side is taken under a policy $\pi$ .

Using [20, Lemma B.2], we have

	$\displaystyle\mathrm{Br}(\pi,g;\pi^{\prime})$	$\displaystyle=\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\pi]$
		$\displaystyle=\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\pi]$
		$\displaystyle=\mathbb{E}[\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}\mid s_{M(h)};a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\pi];a_{1:M(h)-1}\sim\pi^{\prime}]$
		$\displaystyle=\left\langle X_{h}(\pi^{\prime}),W_{h}(\pi,g)\right\rangle.$

where

	$\displaystyle W_{h}(\pi,g)$	$\displaystyle=\int\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}\mid s_{M(h)};a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\pi]\mu(s_{M(h)})\mathrm{d}(s_{M(h)}),$
	$\displaystyle X_{h}(\pi^{\prime})$	$\displaystyle=\mathbb{E}[\phi(s_{M(h)-1},a_{M(h)-1});a_{1:M(h)-1}\sim\pi^{\prime}].$

Thus, the first condition in Definition 7 ((6)) is satisfied

Next, we show the second condition in Definition 7 ((7)). This is proved as follows

	$\displaystyle\frac{0.5}{\|\mathcal{A}\|^{M}}\left\langle X_{h}(\pi^{\prime}),W_{h}(\pi,g)\right\rangle^{2}$		(26)
	$\displaystyle=\frac{0.5}{\|\mathcal{A}\|^{M}}\left(\mathbb{E}\left[\left(g_{h}(\bar{z}_{h})-(\mathcal{B}^{\pi}_{h}g_{h+1})(\bar{z}_{h})\right);a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h}\right]\right)^{2}$
	$\displaystyle\leq\frac{0.5}{\|\mathcal{A}\|^{M}}\mathbb{E}\left[\left(g_{h}(\bar{z}_{h})-(\mathcal{B}^{\pi}_{h}g_{h+1})(\bar{z}_{h})\right)^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h}\right]$		(Jensen’s inequality)
	$\displaystyle\leq\frac{1}{\|\mathcal{A}\|^{M}}\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[\left(g_{h}(\bar{z}_{h})-(\mathcal{B}^{\pi}_{h}g_{h+1})(\bar{z}_{h})\right)f(\bar{z}_{h})-0.5f(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h}\right]$
	$\displaystyle=\frac{1}{\|\mathcal{A}\|^{M}}\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[\|\mathcal{A}\|\pi_{h}(a_{h}\|\bar{z}_{h})\left(g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\right)f(\bar{z}_{h})-0.5f(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\mathcal{U}(\mathcal{A})\right]$
	$\displaystyle\leq\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[\|\mathcal{A}\|\pi_{h}(a_{h}\|\bar{z}_{h})\left(g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\right)f(\bar{z}_{h})-0.5f(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right]$
	$\displaystyle=\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[l_{h}(\bar{z}_{h},a_{h},r_{h}o_{h+1};f,\pi,g);a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right].$		(27)

From the first line to the second line, we use [20, Lemma B.2]. From the third to the fourth line, we use the Bellman completeness assumption: $-(\mathcal{B}^{\pi}_{h}\mathcal{G})+\mathcal{G}_{h}\subset\mathcal{F}_{h}$ . From the fourth line to the fifth line, we use importance sampling.

Finally, we show the third condition in Definition 7 (8):

\displaystyle\left\lvert\max_{f\in\mathcal{F}_{h}}\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi});a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})]\right\rvert=0.

(28)

This follows since

	$\displaystyle\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi});a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})]$
	$\displaystyle=\mathbb{E}\left[\|\mathcal{A}\|\pi_{h}(a_{h}\|\bar{z}_{h})\left(g^{\pi}_{h}(\bar{z}_{h})-r_{h}-g^{\pi}_{h+1}(\bar{z}_{h+1})\right)f(\bar{z}_{h})-0.5f(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right]$
	$\displaystyle=\mathbb{E}\left[\mathbb{E}[\|\mathcal{A}\|\pi_{h}(a_{h}\|\bar{z}_{h})\left(g^{\pi}_{h}(\bar{z}_{h})-r_{h}-g^{\pi}_{h+1}(\bar{z}_{h+1})\right)\mid\bar{z}_{h}]f(\bar{z}_{h})-0.5f(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right]$
	$\displaystyle=\mathbb{E}\left[-0.5f(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right].$

M.2 Uniform Convergence

We define the operator

\displaystyle(\mathcal{B}^{\pi}_{h}g)(\bar{z}_{h}):=\mathbb{E}[r_{h}+g_{h+1}(\bar{z}_{h+1})\mid\bar{z}_{h};a_{h}\sim\pi].

and

\displaystyle(\bar{\mathcal{B}}^{\pi}_{h}g)(\bar{z}_{h}):=-(\mathcal{B}^{\pi}_{h}g)(\bar{z}_{h})+g_{h}.

Lemma 38 (Uniform Convergence).

Let $|\mathcal{D}|=m$ . Suppose $\|\mathcal{F}_{h}\|_{\infty}\leq 3H$ for $h\in[H]$ . Fix $\pi^{\prime}\in\Pi$ .

Take a true link function $g^{\pi}\in\mathcal{G}$ . Then, it satisfies

	$\displaystyle\max_{f_{h}\in\mathcal{F}_{h}}\|\mathbb{E}_{\mathcal{D}}[\|\mathcal{A}\|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g^{\pi}_{h}(\bar{z}_{h})-r_{h}-g^{\pi}_{h+1}(\bar{z}_{h+1})\}f_{h}(\bar{z}_{h})-0.5f_{h}(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim U(\mathcal{A})]$
	$\displaystyle\leq c_{1}\frac{(H\|\mathcal{A}\|)^{2}\ln(\|\Pi_{\max}\|\|\mathcal{F}_{\max}\|\|\mathcal{G}_{\max}\|/\delta)}{m}.$

Suppose $g(\pi)$ satisfies

	$\displaystyle\max_{f_{h}\in\mathcal{F}_{h}}\|\mathbb{E}_{\mathcal{D}}[\|\mathcal{A}\|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g_{h}(\pi)(\bar{z}_{h})-r_{h}-g_{h+1}(\pi)(\bar{z}_{h+1})\}f_{h}(\bar{z}_{h})-0.5f_{h}(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim U(\mathcal{A})]\|$
	$\displaystyle\leq\Lambda,$

and the Bellman completeness $\bar{\mathcal{B}}^{\pi}_{h}\mathcal{G}\subset\mathcal{F}_{h}(\forall\pi\in\Pi)$ holds. Then, with probability $1-\delta$ , we have

	$\displaystyle\mathbb{E}[(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h});a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim U(\mathcal{A})]$
	$\displaystyle\leq\Lambda+c_{2}\frac{(H\|\mathcal{A}\|)^{2}\ln(\|\Pi_{\max}\|\|\mathcal{F}_{\max}\|\|\mathcal{G}_{\max}\|/\delta)}{m}.$

Proof.

To simplify the notation, we define

\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)=\pi_{h}(a_{h}\mid\bar{z}_{h})|\mathcal{A}|\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}.

Given $g\in\mathcal{G}$ , we define $\hat{f}_{h}(\cdot;g)$ as the maximizer:

\displaystyle\mathop{\mathrm{argmax}}_{f_{h}\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}}[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}f_{h}(\bar{z}_{h})-0.5f_{h}(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim U(\mathcal{A})]|.

In this proof, the expectation is always taken for the data generating process $\mathcal{D}$ . We first observe

	$\displaystyle\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)f_{h}(\bar{z}_{h})-0.5f_{h}(\bar{z}_{h})^{2}]$
	$\displaystyle=0.5\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)^{2}-\{\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)-f_{h}(\bar{z}_{h})\}^{2}].$

Then, we define

\displaystyle\mathrm{Er}_{h}(f,g):=0.5\{\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)-f_{h}(\bar{z}_{h})\}^{2}-0.5\{\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)-(\bar{\mathcal{B}}^{\pi}_{h}g)(\bar{z}_{h})\}^{2}].

As the first step, we prove with probability $1-\delta$

\displaystyle\forall g;|\mathbb{E}_{\mathcal{D}}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]|\leq\frac{12H|\mathcal{A}|\ln(2|\mathcal{F}_{h}|||\mathcal{G}_{h}||\mathcal{G}_{h+1}|/\delta)}{m}.

(29)

We first fix $g$ . Then, from the definition of $\hat{f}_{h}(\cdot;g)$ and the Bellman completeness $\bar{\mathcal{B}}^{\pi}_{h}\mathcal{G}\subset\mathcal{F}_{h}$ , we have

\displaystyle\mathbb{E}_{\mathcal{D}}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]\leq 0.

(30)

Here, we invoke Bernstein’s inequality:

\displaystyle\forall f\in\mathcal{F}_{h};|(\mathbb{E}-\mathbb{E}_{\mathcal{D}})\mathrm{Er}_{h}(f,g))|\leq\sqrt{\mathbb{E}[\mathrm{Er}_{h}(f,g)]\frac{\ln(2|\mathcal{F}_{h}|/\delta)}{m}}+\frac{(6H|\mathcal{A}|)^{2}\ln(2|\mathcal{F}_{h}|/\delta)}{m}.

(31)

Hereafter, we condition on the above event. Then, combining (30) and (31), we have

	$\displaystyle\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]$	$\displaystyle\leq\mathbb{E}_{\mathcal{D}}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]+\|(\mathbb{E}-\mathbb{E}_{\mathcal{D}})\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)\|$
		$\displaystyle\leq\sqrt{\frac{\mathbb{E}[\mathrm{Er}^{2}_{h}(\hat{f}_{h}(\cdot;g),g)]\ln(2\|\mathcal{F}_{h}\|/\delta)(6H\|\mathcal{A}\|)^{2}}{m}}+\frac{(6H\|\mathcal{A}\|)^{2}\ln(2\|\mathcal{F}_{h}\|/\delta)}{m}.$

Here, we use

	$\displaystyle\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]$	$\displaystyle=0.5\mathbb{E}[\{f_{h}(\bar{z}_{h})-(\bar{\mathcal{B}}^{\pi}_{h}g)(\bar{z}_{h})\}^{2}],$
	$\displaystyle\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)^{2}]$	$\displaystyle\leq\mathbb{E}[\{f_{h}(\bar{z}_{h})-(\bar{\mathcal{B}}^{\pi}_{h}g)(\bar{z}_{h})\}^{2}](6H\|\mathcal{A}\|)^{2}=\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g))](6H\|\mathcal{A}\|)^{2}.$

Thus, by some algebra,

\displaystyle\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]\leq\frac{(12H|\mathcal{A}|)^{2}\ln(2|\mathcal{F}_{h}|/\delta)}{m}.

Besides,

	$\displaystyle\|\mathbb{E}_{\mathcal{D}}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]\|$
	$\displaystyle\leq\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g))]+\|(\mathbb{E}-\mathbb{E}_{\mathcal{D}})[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]\|$
	$\displaystyle\leq\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]+\sqrt{\frac{\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)](6H\|\mathcal{A}\|)^{2}\ln(2\|\mathcal{F}_{h}\|/\delta)}{m}}+\frac{(6H\|\mathcal{A}\|)^{2}\ln(2\|\mathcal{F}_{h}\|/\delta)}{m}$
	$\displaystyle\leq\frac{3(12H\|\mathcal{A}\|)^{2}\ln(2\|\mathcal{F}_{h}\|/\delta)}{m}+\frac{27H\|\mathcal{A}\|\ln(2\|\mathcal{F}_{h}\|/\delta)}{m}.$

Lastly, by union bounds over $\mathcal{G}_{h},\mathcal{G}_{h+1}$ , the statement (29) is proved. Note $\bar{\mathcal{B}}^{\pi}_{h}g^{\pi}=0$ .

First Statement.

	$\displaystyle\|\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g^{\pi})\hat{f}_{h}(\bar{z}_{h};g^{\pi})-0.5\hat{f}_{h}(\bar{z}_{h};g^{\pi})^{2}]\|$
	$\displaystyle=\|0.5\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g^{\pi})]-0.5\mathbb{E}_{\mathcal{D}}[\{\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g^{\pi})-f_{h}(\bar{z}_{h})\}^{2}]\|$
	$\displaystyle\leq c\frac{H\|\mathcal{A}\|\ln(\|\mathcal{F}_{h}\|\mathcal{G}_{h}\|\|\mathcal{G}_{h+1}/\delta)}{m}.$

From the second line to the third line, we use (29).

Second Statement.

Now, we use the assumption on $g(\pi)$ :

\displaystyle\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g(\pi))\hat{f}_{h}(\bar{z}_{h};g(\pi))-0.5\hat{f}_{h}(\bar{z}_{h};g(\pi))^{2}]\leq\Lambda.

From what we showed in (29), this implies

\displaystyle\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g(\pi))(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))(\bar{z}_{h})-0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]\leq\Lambda+\frac{3(12H|\mathcal{A}|)^{2}\ln(|\mathcal{F}_{h}||\mathcal{G}_{h}||\mathcal{G}_{h+1}|/\delta)}{m}.

Recall we want to upper-bound the error for $\mathbb{E}[0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})].$ Here, we use the following observation later:

\displaystyle\mathbb{E}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g(\pi))(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))(\bar{z}_{h})-0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]=\mathbb{E}[0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})].

We use Bernstein’s inequality: with probability $1-\delta$ , for any $g\in\mathcal{G}$ ,

	$\displaystyle\|(\mathbb{E}-\mathbb{E}_{\mathcal{D}})[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)(\mathcal{B}^{\pi}_{h}g)(\bar{z}_{h})-0.5(\mathcal{B}^{\pi}_{h}g)^{2}(\bar{z}_{h})]\|$
	$\displaystyle\leq\sqrt{\frac{\mathbb{E}[(3\|\mathcal{A}\|H)^{2}(\mathcal{B}^{\pi}_{h}g)^{2}(\bar{z}_{h})]\ln(2\|\mathcal{G}_{h}\|\|\mathcal{G}_{h+1}\|/\delta)}{m}}+\frac{(3\|\mathcal{A}\|H)^{2}\ln(\|\mathcal{G}_{h}\|\|\mathcal{G}_{h+1}\|/\delta)}{m}.$

Here, we use

	$\displaystyle\mathbb{E}[\{\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)(\mathcal{B}^{\pi}_{h}g)(\bar{z}_{h})-0.5(\mathcal{B}^{\pi}_{h}g)^{2}(\bar{z}_{h})\}^{2}]$
	$\displaystyle\leq\mathbb{E}[\{\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)(\mathcal{B}^{\pi}_{h}g)(\bar{z}_{h})-0.5(\mathcal{B}^{\pi}_{h}g)^{2}(\bar{z}_{h})\}](6\|\mathcal{A}\|H)^{2}.$

Hereafter, we condition on the above event.

Finally, we have

	$\displaystyle\mathbb{E}[0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]$
	$\displaystyle\leq\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g(\pi))(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))(\bar{z}_{h})-0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]+$
	$\displaystyle+\|(\mathbb{E}-\mathbb{E}_{\mathcal{D}})[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g(\pi))(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))(\bar{z}_{h})-0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]\|$
	$\displaystyle\leq\Lambda+\frac{3(12H\|\mathcal{A}\|)^{2}\ln(4\|\mathcal{F}_{h}\|\|\mathcal{G}_{h}\|\|\mathcal{G}_{h+1}\|/\delta)}{m}$
	$\displaystyle+\|(\mathbb{E}-\mathbb{E}_{\mathcal{D}})[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g(\pi))(\mathcal{B}^{\pi}_{h}g(\pi))(\bar{z}_{h})-0.5(\mathcal{B}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]\|$
	$\displaystyle\leq\Lambda+\frac{3(12H\|\mathcal{A}\|)^{2}\ln(4\|\mathcal{F}_{h}\|\|\mathcal{G}_{h}\|\|\mathcal{G}_{h+1}\|/\delta)}{m}+\sqrt{\frac{\mathbb{E}[0.5(\mathcal{B}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]\ln(4\|\mathcal{G}_{h}\|\|\mathcal{G}_{h+1}\|/\delta)}{m}}+\frac{\ln(\|\mathcal{G}_{h}\|\|\mathcal{G}_{h+1}\|/\delta)}{m}.$

Hence,

\displaystyle\forall\pi\in\Pi,\forall g(\pi);\mathbb{E}[0.5(\mathcal{B}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]\leq\Lambda+c\frac{(H|\mathcal{A}|)^{2}\ln(|\mathcal{F}_{h}||\mathcal{G}_{h}||\mathcal{G}_{h+1}|/\delta)}{m}.

∎

M.3 Proof of Main Statement

We define

\displaystyle|\mathcal{F}_{\max}|=\max_{h\in[H]}|\mathcal{F}_{h}|,\,|\Pi_{\max}|=\max_{h\in[H]}|\Pi_{h}|,|\mathcal{G}_{\max}|=\max_{h\in[H]}|\mathcal{G}_{h}|.

Let

	$\displaystyle\epsilon^{2}_{gen}$	$\displaystyle=c_{1}\frac{(H\|\mathcal{A}\|)^{2}\ln(\|\Pi_{\max}\|\|\mathcal{F}_{\max}\|\|\mathcal{G}_{\max}\|T(H+1)/\delta)}{m},$
	$\displaystyle\tilde{\epsilon}^{2}_{gen}$	$\displaystyle=c_{1}\frac{(H\|\mathcal{A}\|)^{2}\ln(\|\Pi_{\max}\|\|\mathcal{F}_{\max}\|\|\mathcal{G}_{\max}\|(H+1)/\delta)}{m},$
	$\displaystyle\epsilon_{ini}$	$\displaystyle=c_{3}\sqrt{\frac{(H\|\mathcal{A}\|)^{2}\ln(\|\mathcal{G}_{1}\|T(H+1)/\delta)}{m}},$
	$\displaystyle T$	$\displaystyle=2Hd\ln\left(4Hd\left(\frac{B^{2}_{X}B^{2}_{W}}{\tilde{\epsilon}_{gen}}+1\right)\right),\quad R=\epsilon^{2}_{gen}.$

Then, from the first statement in Lemma 38, with probability $1-\delta$ , $\forall t\in[T],\forall h\in[H],\forall\pi\in\Pi$

		$\displaystyle\max_{f_{h}\in\mathcal{F}_{h}}\|\mathbb{E}_{\mathcal{D}^{t}_{h}}[\|\mathcal{A}\|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g^{\pi}_{h}(\bar{z}_{h})-r_{h}-g^{\pi}_{h+1}(\bar{z}_{h+1})\}f_{h}(\bar{z}_{h})-0.5f_{h}(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{t},a_{M(h):h}\sim U(\mathcal{A})]\|$		(32)
		$\displaystyle\leq c_{1}\frac{(H\|\mathcal{A}\|)^{2}\ln(\|\Pi_{\max}\|\|\mathcal{F}_{\max}\|\|\mathcal{G}_{\max}\|T(H+1)/\delta)}{m}.$

Besides, from the second statement in Lemma 38, for $\pi\in\Pi,\forall t\in[T],\forall h\in[H]$ , when $g(\pi)$ satisfies

	$\displaystyle\max_{f_{h}\in\mathcal{F}_{h}}\|\mathbb{E}_{\mathcal{D}^{t}_{h}}[\|\mathcal{A}\|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g_{h}(\pi)(\bar{z}_{h})-r_{h}-g_{h+1}(\pi)(\bar{z}_{h+1})\}f_{h}(\bar{z}_{h})-0.5f_{h}(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{t},a_{M(h):h}\sim U(\mathcal{A})]\|$
	$\displaystyle\leq c_{1}\frac{(H\|\mathcal{A}\|)^{2}\ln(\|\Pi_{\max}\|\|\mathcal{F}_{\max}\|\|\mathcal{G}_{\max}\|T(H+1)/\delta)}{m},$

we have

\displaystyle\mathbb{E}[(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h});a_{1:M(h)-1}\sim\pi^{t},a_{M(h):h-1}\sim U(\mathcal{A})]\leq(c_{1}+c_{2})\frac{(H|\mathcal{A}|)^{2}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|TH/\delta)}{m}.

(33)

We first show the optimism. Recall $\pi^{\star}=\mathop{\mathrm{argmax}}_{\pi\in\Pi}J(\pi).$

Lemma 39 (Optimism).

Set $R=\epsilon^{2}_{gen}$ . For all $t\in[T]$ , $(\pi^{\star},g^{\pi^{\star}})$ is a feasible solution of the constrained program. Furthermore, we have $J(\pi^{\star})\leq\mathbb{E}[g^{t}_{1}(o_{1})]+2\epsilon_{ini}$ for any $t\in[T]$ .

Proof.

For any $\pi\in\Pi$ , letting $g^{\pi}\in\mathcal{G}$ be a corresponding value link function, we have

\displaystyle\max_{f\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi})]|\leq\epsilon^{2}_{gen}.

using (32). This implies

\forall t\in[T],\forall h\in[H],\max_{f\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};f,\pi^{\star},g^{\pi^{\star}})]|\leq\epsilon^{2}_{gen}.

Hence, $(\pi^{\star},g^{\pi^{\star}})$ is a feasible set for any $t\in[T]$ . Then, we have

	$\displaystyle J(\pi^{\star})$	$\displaystyle=\mathbb{E}[g^{\pi^{\star}}_{1}(o_{1})]\leq\mathbb{E}_{\mathcal{D}^{t}_{1}}[g^{\pi^{\star}}_{1}(o_{1})]+\epsilon_{ini}$
		$\displaystyle\leq\mathbb{E}_{\mathcal{D}^{t}_{1}}[g^{t}_{1}(o_{1})]+\epsilon_{ini}\leq\mathbb{E}[g^{t}_{1}(o_{1})]+2\epsilon_{ini}.$

∎

Next, recall the following two statements. The following statements are proved as before in the proof of Theorem 1.

•

For any $t\in[T]$ ,

\displaystyle J(\pi^{\star})-J(\pi^{t})\leq\sum_{h=1}^{H}|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle|+2\epsilon_{ini}.

•

Let $\Sigma_{t,h}=\lambda I+\sum_{\tau=0}^{t-1}X_{h}(\pi^{\tau})X_{h}(\pi^{\tau})^{\top}$ . We have

\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}\|X_{h}(\pi^{t})\|^{2}_{\Sigma^{-1}_{t,h}}\leq H\sqrt{\frac{d}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)}.

Lemma 40.

\displaystyle\|W_{h}(\pi^{t},g^{t})\|^{2}_{\Sigma_{t,h}}\leq 2\lambda B^{2}_{W}+T|\mathcal{A}|^{M}\epsilon^{2}_{gen}.

Proof.

We have

\displaystyle\|W_{h}(\pi^{t},g^{t})\|^{2}_{\Sigma_{t,h}}=\lambda\|W_{h}(\pi^{t},g^{t})\|^{2}_{2}+\sum_{\tau=0}^{t-1}\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{\tau})\rangle^{2}.

The first term is upper-bounded by $\lambda B^{2}_{W}$ . The second term is upper-bounded by

	$\displaystyle\sum_{\tau=0}^{t-1}\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{\tau})\rangle^{2}$
	$\displaystyle\leq\|\mathcal{A}\|^{M}\sum_{\tau=0}^{t-1}\mathbb{E}[\mathbb{E}[\|\mathcal{A}\|\pi^{t}_{h}(a_{h}\mid\bar{z}_{h})g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\mid\bar{z}_{h};a_{h}\sim U(\mathcal{A})]^{2};a_{1:M(h)-1}\sim\pi^{\tau},a_{M(h):h-1}\sim U(\mathcal{A})]$
	$\displaystyle=\|\mathcal{A}\|^{M}\sum_{\tau=0}^{t-1}\mathbb{E}[(\bar{\mathcal{B}}^{\pi^{t}}_{h}g(\pi))^{2}(\bar{z}_{h});a_{1:M(h)-1}\sim\pi^{\tau},a_{M(h):h-1}\sim U(\mathcal{A})]$
	$\displaystyle\leq\|\mathcal{A}\|^{M}T(c_{1}+c_{2})\frac{(H\|\mathcal{A}\|)^{2}\ln(\|\Pi_{\max}\|\|\mathcal{F}_{\max}\|\|\mathcal{G}_{\max}\|TH/\delta)}{m}\leq T\|\mathcal{A}\|^{M}\epsilon^{2}_{gen}.$

From the first line to the second line, we use (26). Here, from the third line to the fourth line, we use (33). ∎

The rest of the argument is the same as the proof in Theorem 1. Finally, the following holds

\displaystyle J(\pi^{\star})-J(\hat{\pi})\leq 5\epsilon_{gen}|\mathcal{A}|^{M/2}\left[H^{2}d\ln(4Hd(B^{2}_{X}B^{2}_{W}/\tilde{\epsilon}_{gen}+1))\right]^{1/2}+2\epsilon_{ini}.

Sample Complexity Result.

We want to find $m$ such that

\displaystyle\sqrt{\frac{H^{2}|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|TH/\delta)}{m}}[H^{2}d\ln(HdB^{2}_{X}B^{2}_{W}m)]^{1/2}\leq\epsilon.

where

\displaystyle T=Hd\ln(HdB^{2}_{X}B^{2}_{W}m).

By organizing terms, we have

\displaystyle\sqrt{\frac{H^{4}d|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|Hd/\delta)\ln(HdB^{2}_{X}B^{2}_{W}m)}{m}}\leq\epsilon.

Thus, setting the following $m$ is enough:

\displaystyle m=\tilde{O}\left(\frac{H^{4}d|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)}{\epsilon^{2}}\right).

The total sample we use $mTH$ is

\displaystyle\tilde{O}\left(\frac{d^{2}H^{6}|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)}{\epsilon^{2}}\right).

Appendix N Sample Complexity in Observable POMDPs with Latent Low-rank Transition

This section largely follows the one in Section M.

N.1 Existence of Value Link Functions

Since we consider the discrete setting, we can set the value link function class as Section H. Hence, we set

\displaystyle\mathcal{G}_{h}=\left\{\langle\theta,\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle;\|\theta\|_{\infty}\leq H\right\}.

Then, we can ensure $\|\mathcal{G}_{h}\|\leq H/\sigma_{1}$ . Then, from the construction of $\mathcal{F}_{h}$ , we can also ensure $\|\mathcal{F}_{h}\|\leq 4H/\sigma_{1}$ .

N.2 PO-bilinear Rank Decomposition (Proof of Lemma 11)

In this section, we derive the PO-bilinear decomposition of observable POMDPs with the latent low-rank transition. We want to prove Lemma 11. Recall $M(h)=\max(h-M,1)$ .

Using [20, Lemma B.2], we have

	$\displaystyle\mathrm{Br}(\pi,g;\pi^{\prime})=\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\pi]$
	$\displaystyle=\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\pi]$
	$\displaystyle=\mathbb{E}[\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}\mid s_{M(h)};a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\pi];a_{1:M(h)-1}\sim\pi^{\prime}]$
	$\displaystyle=\left\langle X_{h}(\pi^{\prime}),W_{h}(\pi,g)\right\rangle$

where

	$\displaystyle W_{h}(\pi,g)$	$\displaystyle=\int\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}\mid s_{M(h)};a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\pi]\mu(s_{M(h)})\mathrm{d}(s_{M(h)}),$
	$\displaystyle X_{h}(\pi^{\prime})$	$\displaystyle=\mathbb{E}[\phi(s_{M(h)-1},a_{M(h)-1});a_{1:M(h)-1}\sim\pi^{\prime}].$

Thus, the first condition in Definition 7 is satisfied

Next, we show the second condition in Definition 7. This is proved as follows:

	$\displaystyle\frac{0.5}{\|\mathcal{A}\|^{M}}\left\langle X_{h}(\pi^{\prime}),W_{h}(\pi,g)\right\rangle^{2}$
	$\displaystyle=\frac{0.5}{\|\mathcal{A}\|^{M}}\mathbb{E}\left[\left(g_{h}(\bar{z}_{h})-(\mathcal{B}^{\pi}_{h}g_{h+1})(\tau_{h})\right);a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h}\right]^{2}$
	$\displaystyle\leq\frac{0.5}{\|\mathcal{A}\|^{M}}\mathbb{E}\left[\left(g_{h}(\bar{z}_{h})-(\mathcal{B}^{\pi}_{h}g_{h+1})(\tau_{h})\right)^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h}\right]$
	$\displaystyle\leq\frac{1}{\|\mathcal{A}\|^{M}}\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[\left(g_{h}(\bar{z}_{h})-(\mathcal{B}^{\pi}_{h}g_{h+1})(\tau_{h})\right)f(\tau_{h})-0.5f(\tau_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h}\right]$
	$\displaystyle=\frac{1}{\|\mathcal{A}\|^{M}}\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[\|\mathcal{A}\|\pi_{h}(a_{h}\|\bar{z}_{h})\left(g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\right)f(\tau_{h})-0.5f(\tau_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\mathcal{U}(\mathcal{A})\right]$
	$\displaystyle\leq\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[\|\mathcal{A}\|\pi_{h}(a_{h}\|\bar{z}_{h})\left(g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\right)f(\tau_{h})-0.5f(\tau_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right]$
	$\displaystyle=\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g);a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right].$

The third condition

\displaystyle\left\lvert\max_{f\in\mathcal{F}_{h}}\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi});a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim U(\mathcal{A})]\right\rvert=0.

is easily proved.

Finally, the following norm constraints hold:

\displaystyle\|W_{h}(\pi,g)\|\leq 3C_{\mathcal{G}}\sqrt{d},\quad\|X_{h}(\pi^{\prime})\|\leq 1.

Sample Complexity Result.

Following the same procedure as Section M, here, we want to find $m$ such that

\displaystyle\sqrt{\frac{C^{2}_{\mathcal{G}}|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|TH/\delta)}{m}}[H^{2}d\ln(HdB^{2}_{X}B^{2}_{W}m)]^{1/2}\leq\epsilon.

where

\displaystyle T=Hd\ln(HdB^{2}_{X}B^{2}_{W}m).

By organizing terms, we have

\displaystyle\sqrt{\frac{C^{2}_{\mathcal{G}}H^{2}d|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|Hd/\delta)\ln(HdB^{2}_{X}B^{2}_{W}m)}{m}}\leq\epsilon.

Thus, setting the following $m$ is enough

\displaystyle m=\tilde{O}\left(\frac{H^{4}d|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

The total sample we use $mTH$ is

\displaystyle\tilde{O}\left(\frac{d^{2}H^{6}|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

Finally, we plug-in $\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)=\ln(|\mathcal{M}|)$ .

Appendix O Exponential Stability for POMDPs with Low-rank Transition

In this section, we prove that the short memory policy is a globall near optimla policy in low-rank MDPs. We first introduce several notation. Next, we prove the exponential stability of Bayesian fileters, which immediately leads to the main statement.

Notation.

Given a belief $b\in\Delta({\mathcal{S}})$ , an action and observation pair $(a,o)$ , we define the Bayesian update as follows. We define $B(b,o)\in\Delta({\mathcal{S}})$ as the operation that incorporates observation $o$ , i.e., $b^{\prime}=B(b,o)$ with $b^{\prime}(s)=O(o|s)b(s)/(\sum_{\bar{s}}O(o|\bar{s})b(\bar{s}))$ , and $\mathbb{T}_{a}b$ as the operation that incorporates the transition, i.e., $(\mathbb{T}_{a}b)(s^{\prime})=\sum_{s}b(s)\mathbb{T}(s^{\prime}|s,a)$ . Finally, we denote $U(b,a,o)$ as the full Bayesian filter, i.e.,

U(b,a,o)=B(\mathbb{T}_{a}b,o).

Let us denote $b_{0}\in\Delta({\mathcal{S}})$ as the initial latent state distribution. Given the first observation $o_{1}\sim\mathbb{O}(\cdot|s),s\sim b_{0}$ , we denote $b_{1}=B(b_{0},o_{1})$ as the initial belief of the system conditioned on the first observation $o_{1}$ . Given two beliefs $b,b^{\prime}$ , we define the distance $D_{2}(b,b^{\prime}):=\log\mathbb{E}_{s\sim b}[b(s)/b(s^{\prime})]$

Consider a POMDP whose latent transition is low rank, i.e., $\mathbb{T}(s^{\prime}|s,a)=\mu(s^{\prime})^{\top}\phi(s,a)$ . For notation simplicity, we still consider discrete state, action, and observation space to avoid using measure theory languages.

Design of initial distribution.

We want to design a good distribution for the initial distribution in an artificial Bayesian filter ignoring the history other than the short history.

The following lemma is from [24, Lemma 4.9] that quantifies the contraction of a Bayesian map.

Lemma 41 (Contraction propery of beliefs).

Suppose $b,b^{\prime}\in\Delta({\mathcal{S}})$ and $\|b/b^{\prime}\|_{\infty}<\infty$ . Then we have:

\displaystyle\mathbb{E}_{s\sim b,o\sim\mathbb{O}(s)}\left[\sqrt{\exp\left(\frac{D_{2}(B(b,o),B(b^{\prime},o))}{4}-1\right)}\right]\leq\left(1-\sigma^{4}_{1}/2^{40}\right)\sqrt{\exp\left(\frac{D_{2}(b,b^{\prime})}{4}\right)-1}

Next, we compute the G-optimal design using feature $\phi(s,a):{\mathcal{S}}\times\mathcal{A}\to\mathbb{R}$ . Denote the G-optimal design as $\rho\in\Delta({\mathcal{S}}\times\mathcal{A})$ . Here, we use assumption $\|\phi(s,a)\|\leq 1$ for any $(s,a)$ in Assumption 6, which ensures that $\phi(s,a)$ lives in a compact space for any $(s,a)$ . The property is given as in Theorem 12. In summary, the support of $\rho$ (denoted by $S_{\rho}$ ) is at most $d(d+1)/2$ points and for any $\phi(s,a)$ , there exists $\alpha(s,a)$ such that

\displaystyle\phi(s,a)=\sum_{i=1}^{|S_{\rho}|}\alpha_{i}(s,a)\phi(s^{i},a^{i})\rho^{1/2}(s^{i},a^{i}),\quad\alpha_{i}(s,a)/\rho^{1/2}(s^{i},a^{i})\leq d

(34)

where we denote the points on the support $S_{\rho}$ as $\{s^{i},a^{i}\}_{i=1}^{|S_{\rho}|}$ .

We set our “empty" belief as follows:

\displaystyle\tilde{b}_{0}(\cdot):=\sum_{\tilde{s},\tilde{a}}\rho(\tilde{s},\tilde{a})\mathbb{T}(\cdot|\tilde{s},\tilde{a})=\sum_{i=1}^{|S_{\rho}|}\rho(s^{i},a^{i})\mathbb{T}(\cdot|s^{i},a^{i}).

Note that this belief $\tilde{b}_{0}$ does not depend on any history. We aim to bound $D_{2}(b,\tilde{b}_{0})$ using the following lemma where $b$ is some belief resulting from applying $\mathbb{T}_{a}$ for any $a$ to a belief $\tilde{b}\in\Delta({\mathcal{S}})$ . This is a newly introduce lemma.

Lemma 42 (Distance between the actual belief and the designed initial distribution).

For any distribution $b\in\Delta({\mathcal{S}})$ that results from a previous belief $\tilde{b}$ and a one-step latent transition under action $a$ , i.e., $b(s)=\mathbb{T}_{a}\tilde{b}(\tilde{s})$ , we have:

\displaystyle D_{2}(b,\tilde{b}_{0})\leq\ln(d^{3}).

Proof.

For any $b\in\Delta({\mathcal{S}})$ , using its definition, we have:

$\displaystyle b(s)$	$\displaystyle=\sum_{\tilde{s}}\tilde{b}(\tilde{s})\phi(\tilde{s},a)^{\top}\mu(s)$	(Definition)
	$\displaystyle=\sum_{\tilde{s}}\tilde{b}(\tilde{s})\sum_{i=1}^{S_{\rho}}\alpha_{i}(\tilde{s},a)\rho^{1/2}(s^{i},a^{i})\phi(s^{i},a^{i})^{\top}\mu(s)$	(Property of G-optimal design)
	$\displaystyle=\sum_{i=1}^{S_{\rho}}\underbrace{\left(\sum_{\tilde{s}}\tilde{b}(\tilde{s})\alpha_{i}(\tilde{s},a)\rho^{1/2}(s^{i},a^{i})\right)}_{:=\beta_{i}}\phi(s^{i},a^{i})^{\top}\mu(s)$

Similarly, the construction of $\tilde{b}_{0}$ implies that $\tilde{b}_{0}(s)=\sum_{i=1}^{S_{\rho}}\rho(s^{i},a^{i})\phi(s^{i},a^{i})^{\top}\mu(s)$ , thus, we have:

$\displaystyle b(s)/\tilde{b}_{0}(s)$	$\displaystyle=\sum_{i=1}^{S_{\rho}}\frac{\beta_{i}\phi(s^{i},a^{i})^{\top}\mu(s)}{\sum_{j=1}^{S_{\rho}}\rho(s^{j},a^{j})\phi(s^{j},a^{j})^{\top}\mu(s)}\leq\sum_{i=1}^{S_{\rho}}\frac{\beta_{i}\phi(s^{i},a^{i})^{\top}\mu(s)}{\rho(s^{i},a^{i})\phi(s^{i},a^{i})^{\top}\mu(s)}$
	$\displaystyle=\sum_{i=1}^{S_{\rho}}\frac{\beta_{i}}{\rho(s^{i},a^{i})}=\sum_{i=1}^{S_{\rho}}{\sum_{\tilde{s}}\tilde{b}(\tilde{s})\frac{\alpha_{i}(\tilde{s},a)}{\rho^{1/2}(s^{i},a^{i})}}=\sum_{\tilde{s}}\tilde{b}(\tilde{s})\sum_{i=1}^{S_{\rho}}\frac{\alpha_{i}(\tilde{s},a)}{\rho^{1/2}(s^{i},a^{i})}$
	$\displaystyle\leq\sum_{\tilde{s}}\tilde{b}(\tilde{s})d^{3}=d^{3}.$	(Use propety of G-optimal design(34))

Thus, $D_{2}(b,\tilde{b}_{0})=\ln\left(\mathbb{E}_{s\sim b}\frac{b(s)}{\tilde{b}_{0}(s)}\right)\leq\ln d^{3}$ . ∎

Now we prove the exponential stability by leveraging Lemma 42 and Lemma 41.

Theorem 14 (Exponential stability for POMDPs with Low-rank Latent Transition).

Consider a $t\geq C\gamma^{-4}\ln(d/\epsilon)$ . Consider any policy (full history dependent) $\pi$ and a trajectory $a_{1:h+t-1},o_{1:h+t}\sim\pi$ for $h\geq 1$ . Denote $b_{h+t}$ as the (true) belief conditioned on $a_{1:h+t-1},o_{1:h+t}$ . For approximated belief, first for $h=1$ , we define $\bar{b}_{h+t}$ as:

\displaystyle\bar{b}_{1}=b_{1},\quad\bar{b}_{1+\tau}(o_{1:1+\tau},a_{1:1+\tau-1})=U(\bar{b}_{n}(o_{1:\tau},a_{1:\tau-1}),o_{1+\tau},a_{1+\tau-1}),1\leq\tau\leq t;

for $h\geq 2$ , we define $\bar{b}_{h+t}$ as:

\displaystyle\bar{b}_{h}=B(\tilde{b}_{0},o_{h}),\quad\bar{b}_{h+\tau}(o_{h:h+\tau},a_{h:h+\tau-1})=U(\bar{b}_{h+\tau-1}(o_{h:h+\tau-1},a_{h:h+\tau-2}),o_{h+\tau},a_{h+\tau-1}),1\leq\tau\leq t;

Then we have:

\displaystyle\forall h\geq 1:\quad\mathbb{E}[\left\|b_{h+t}(o_{1:h+t},a_{1:h+t-1})-\bar{b}_{h+t}(o_{h:h+t},a_{h:h+t-1})\right\|_{1};a_{1:h+t-1}\sim\pi]\leq\epsilon.

Proof.

We define

Y_{h+n}(o_{1:h+n},a_{1:h+n-1})=\sqrt{\exp(D_{2}(b_{h+n}(o_{1:h+n},a_{1:h+n-1}),\bar{b}_{h+n}(o_{h:h+n},a_{h:h+n-1}))/4)-1}.

Hereafter, we omit $(o_{1:h+n},a_{1:h+n-1})$ to simplify the notation.

We start from the base case $Y_{h}$ (i.e., $n=0$ ).

First case, consider $h>1$ , $b_{h}=U(b_{h-1},o_{h},a_{h-1})$ . Denote $b^{\prime}_{h}=\mathbb{T}_{a_{h-1}}b_{h-1}$ . From Lemma 42, we know that:

\displaystyle\mathbb{E}[D_{2}(b^{\prime}_{h},\tilde{b}_{0})\mid o_{1:h-1},a_{h-1};a_{1:h-1}\sim\pi]\leq\ln(d^{3}).

Thus, noting $b_{h}=B(b^{\prime}_{h},o_{h})$ and $\bar{b}_{h}=B(\tilde{b}_{h},o_{h})$ , we have:

	$\displaystyle\mathbb{E}_{o_{h}\sim\mathbb{O}b^{\prime}_{h}}\left[\sqrt{\exp(D_{2}(b_{h},B(\tilde{b}_{0},o_{h}))/4)-1}\mid o_{1:h-1},a_{1:h-1};a_{1:h-1}\sim\pi\right]$
	$\displaystyle\leq\mathbb{E}_{o_{h}\sim\mathbb{O}b^{\prime}_{h}}\left[\sqrt{\exp(D_{2}(b^{\prime}_{h},\tilde{b}_{0})/4)-1}\mid o_{1:h-1},a_{1:h-1};a_{1:h-1}\sim\pi\right]$		(From Lemma 41)
	$\displaystyle\leq(1-\sigma^{4}_{1}/2^{40})d^{3/2}$

which implies the base case:

\displaystyle\mathbb{E}[Y_{h}\mid o_{1:h-1},a_{h-1};a_{1:h-1}\sim\pi]\leq(1-\sigma^{4}_{1}/2^{40})d^{3/2}.

Now for any $n\geq 1$ , we have:

	$\displaystyle\mathbb{E}[Y_{h+n}\mid o_{1:h-1},a_{1:h-1};a_{1:h+n-1}\sim\pi]$
	$\displaystyle=\mathbb{E}\left[\sqrt{\exp\left(D_{2}(b_{h+n},\bar{b}_{h+n})/4\right)-1}\mid o_{1:h-1},a_{1:h-1};a_{1:h+n-1}\sim\pi\right]$
	$\displaystyle\leq(1-\sigma^{4}_{1}/2^{40})\mathbb{E}\left[\sqrt{\exp\left(D_{2}\left((\mathbb{T}_{a_{h+n-1}}b_{h+n-1}),(\mathbb{T}_{a_{h+n-1}}\bar{b}_{h+n-1})\right)/4\right)-1}\mid o_{1:h-1},a_{1:h-1};a_{1:h+n-1}\sim\pi\right]$
	$\displaystyle\leq(1-\sigma^{4}_{1}/2^{40})\mathbb{E}\left[\sqrt{\exp\left(D_{2}\left(b_{h+n-1},\bar{b}_{h+n-1})\right)/4\right)-1}\mid o_{1:h-1},a_{1:h-1};a_{1:h+n-1}\sim\pi\right]$		(Data processing inequality from [24, Lemma 2.7])
	$\displaystyle=(1-\sigma^{4}_{1}/2^{40})\mathbb{E}[Y_{h+n-1}\mid o_{1:h-1},a_{1:h-1};a_{1:h+n-1}\sim\pi].$

This completes the induction step. Adding expectation with respect to the history $a_{1:h-1},o_{1:h-1}$ back, we conclude the proof.

When $h=1$ , we simply start with the original belief $b_{1}$ . For any $0\leq n\leq t$ , we simply set $\bar{b}_{1+n}=b_{1+n}$ , thus the conclusion still holds.

∎

The above Theorem 14 indicates that in order to approximate the ground truth belief $b_{h+t}$ that is conditioned on the entire history, we only need to apply the Bayesian filter on the M memory $\bar{z}_{h+t}$ starting from a fixed distribution $\tilde{b}_{0}$ . The existence of such $\tilde{b}_{0}$ is proven by construction where we rely on the low-rankness of the latent transition and a D-optimal design over ${\mathcal{S}}\times\mathcal{A}$ using the feature $\phi$ .

The above Theorem 14 together with the proof of Theorem 1.2 in [24] immediately implies for $M=\Theta(C(\sigma_{1})^{-4}\ln(dH/\epsilon))$ (with $C$ being some absolute constant), there must exists an M-memory policy $\pi^{\star}$ , such that $J(\pi^{\star}_{gl})-J(\pi^{\star})\leq\epsilon$ – thus a globally optimal policy can be approximated by a policy that only relies on short memories.

Appendix P Auxiliary Lemmas

We use the following in Section 4.3.

Lemma 43 (Useful inequalities).

•

	$\displaystyle\\|AB\\|\leq\\|A\\|\\|B\\|,\\|AB\\|_{F}\leq\\|A\\|\\|B\\|_{F}$
	$\displaystyle\mathrm{vec}(aa^{\top})=a\otimes a,\\|\mathrm{vec}(A)\\|_{2}=\\|A\\|_{F},\mathrm{Tr}(AB)=\mathrm{vec}(A^{\top})^{\top}\mathrm{vec}(B).$

•

When $A$ and $B$ are semi positive definite matrices, we have

$\displaystyle\mathrm{Tr}(AB)\leq\|A\|\mathrm{Tr}(B).$

The following lemma is useful when we calculate the sample complexity.

Lemma 44.

The following is satisfied

\displaystyle\sqrt{\frac{B_{1}}{m}\ln^{2}(B_{2}m+B_{3})}\leq c\epsilon

when

\displaystyle m=c\frac{B_{1}}{\epsilon^{2}}\{\ln(m(B_{2}+B_{3}+1))\}^{2}.

for some constant $c$ .

	$\displaystyle\mathbb{P}(o^{\prime}_{h:h+K-1},a^{\prime}_{h:h+K-2}\mid s_{h};a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A}))$
	$\displaystyle=\mathbb{E}[{\textbf{1}}(o_{h:h+K-1}=o^{\prime}_{h:h+K-1},a_{h:h+K-2}=a^{\prime}_{h:h+K-2})\mid s_{h};a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A}))]$
	$\displaystyle=\mathbb{E}[(1/\|\mathcal{A}\|^{K-1}){\textbf{1}}(o_{h:h+K-1}=o^{\prime}_{h:h+K-1},a_{h:h+K-2}=a^{\prime}_{h:h+K-2})\mid s_{h};a_{h:h+K-2}\sim do(a^{\prime}_{h:h+K-2})]$
	$\displaystyle=\mathbb{E}[(1/\|\mathcal{A}\|^{K-1}){\textbf{1}}(o_{h:h+K-1}=o^{\prime}_{h:h+K-1})\mid s_{h};a_{h:h+K-2}\sim do(a^{\prime}_{h:h+K-2})]$
	$\displaystyle=(1/\|\mathcal{A}\|^{K-1})\mathbb{P}(o^{\prime}_{h:h+K-1}\mid s_{h};do(a^{\prime}_{h:h+K-2})).$

		$\displaystyle l_{h}({z}_{h-1},a_{h},r_{h},\bar{\bf a}_{h+1},o_{h:h+\|\bar{\bf a}_{h+1}\|+1};\pi,g)$		(18)
		$\displaystyle=\phi(z_{h-1})^{\top}\mathbb{J}_{h}\hat{\bf q}_{\tau^{a}_{h}}(a_{h},\bar{\bf a}_{h+1},o_{h:h+\|\bar{\bf a}_{h+1}\|+1})$
		$\displaystyle-\frac{{\textbf{1}}\{a_{h}=\pi_{h}(\bar{z}_{h})\}}{1/\|\mathcal{A}\|}\left(r_{h}+\phi(z_{h})^{\top}\mathbb{J}_{h+1}\hat{\bf q}_{\tau^{a}_{h+1}}(\bar{\bf a}_{h+1},o_{h+1:h+\|\bar{\bf a}_{h+1}\|+1})\right).$

	$\displaystyle\|l_{h}(\cdot;\pi,g)-l_{h}(\cdot;\pi^{\diamond},g^{\diamond})\|$	$\displaystyle\leq\|\mathcal{A}\|\{\\|\pi_{h}-\pi^{\diamond}_{h}\\|_{\infty}C+\\|g_{h}-g^{\diamond}_{h}\\|_{\infty}+\\|g_{h+1}-g^{\diamond}_{h+1}\\|_{\infty}\}$
		$\displaystyle\leq\|\mathcal{A}\|\{\epsilon C+\\|\theta_{h}-\theta^{\diamond}_{h}\\|_{2}+\\|\theta_{h+1}-\theta^{\diamond}_{h+1}\\|_{2}\}$
		$\displaystyle\leq 3\|\mathcal{A}\|C\epsilon.$

	$\displaystyle\left\lvert\Gamma_{h}-\mathop{\mathrm{tr}}\left\{\Lambda_{h}\begin{bmatrix}0&0\\ 0&O^{\dagger}\Sigma_{\tau}\{O^{\dagger}\}^{\top}\end{bmatrix}\right\}\right\rvert$	$\displaystyle\leq\|\Gamma_{h}\|+\left\lvert\mathop{\mathrm{tr}}\left\{\Lambda_{h}\begin{bmatrix}0&0\\ 0&O^{\dagger}\Sigma_{\tau}\{O^{\dagger}\}^{\top}\end{bmatrix}\right\}\right\rvert$
		$\displaystyle\leq\|\Gamma_{h}\|+\\|\Sigma_{h}\\|_{2}\\|O^{\dagger}\\|^{2}_{2}\mathrm{Tr}(\Sigma_{\tau})$
		$\displaystyle\leq\|\Gamma_{h}\|+\\|\Sigma_{h}\\|_{2}\\|O^{\dagger}\\|^{2}_{2}\mathbb{C}d_{o}$
		$\displaystyle=\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\\|O^{\dagger}\\|).$

	$\displaystyle\\|X_{h}(\pi^{\prime})\\|^{2}_{2}$	$\displaystyle=1+\left\\|\mathbb{E}_{(z,s)\sim d^{\pi^{\prime}}_{h}}\left[\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\right]\right\\|^{2}_{F}$
		$\displaystyle=1+\left\\|\mathbb{E}_{(z,s)\sim d^{\pi^{\prime}}_{h-1}}\left[\Xi_{1h}(\pi)\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\Xi^{\top}_{1h}(\pi)\right]+\Sigma_{\Lambda_{h2}}(\pi)\right\\|^{2}_{F}$
		$\displaystyle\leq 1+\\|\Xi_{1h}(\pi)\\|^{4}_{2}\left\\|\mathbb{E}_{(z,s)\sim d^{\pi^{\prime}}_{h-1}}\left[\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\right]\right\\|^{2}_{F}+\\|\Sigma_{\Lambda_{h2}}(\pi)\\|^{2}_{F}$
		$\displaystyle\leq 1+\\|X_{h-1}(\pi^{\prime})\\|^{2}_{2}+\\|\Sigma_{\Lambda_{h2}}(\pi)\\|^{2}_{F}.$

Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems

Abstract

1 Introduction

1.1 Related Works

Generalization and function approximation of RL in MDPs.

Online RL for POMDPs.

System identification for uncontrolled partially observable systems.

Reinforcement learning in PSRs.

Value link functions.

1.2 Organization

2 Preliminary

The Actor-critic function approximation setup.

Remark 1 (Limits of existing MDP actor-critic framework).

3 Value Link Functions and the PO-bilinear Framework

3.1 Value Link Functions

Definition 1 (K-step value link functions).

Definition 2 (1-step value link functions).

3.2 The PO-Bilinear Actor-critic Framework for POMDPs

Assumption 1 (Realizability).

Definition 3 (PO-bilinear AC Class, K=1K=1).

3.3 PO-bilinear Actor-critic Class with Multi Step Future

Definition 4 (PO-bilinear AC Class for POMDPs with multi-step future).

4 Examples of PO-Bilinear Actor-critic Classes

4.1 Observable Undercomplete Tabular POMDPs

Example 1 (Observable undercomplete tabular POMDPs).

Lemma 1.

Proof.

Lemma 2.

4.2 Observable Overcomplete Tabular POMDPs

Lemma 3.

Lemma 4.

4.3 Observable Linear Quadratic Gaussian

Example 2 (Linear Quadratic Gaussian (LQG)).

Lemma 5 (PO-bilinear rank of observable LQG).

4.4 Observable Hilbert Space Embedding POMDPs

Example 3 (HSE-POMDPs).

Lemma 6 (PO-bilinear rank of observable HSE-POMDPs).

5 Algorithm and Complexity

5.1 Algorithm

With multi-step link functions.

Remark 2 (Continuous control).

5.2 Sample Complexity

Assumption 2 (Uniform Convergence).

Remark 3 (Finite function classes).

Theorem 1 (PAC guarantee of Provable).

5.3 Examples

5.3.1 Finite Sample Classes

Example 4 (Finite Sample Classes).

Theorem 2 (Sample complexity for discrete Π\Pi and 𝒢\mathcal{G} (informal)).

5.3.2 Observable Undercomplete Tabular POMDPs

Example 5 (continues=ex:under_tabular).

Theorem 3 (Sample complexity for unrercomplete tabular models (Informal)).

Near global optimality.

Theorem 4 (Sample complexity for undercomplete tabular models (Informal) — competing against πgl⋆\pi^{\star}_{\mathrm{gl}}).

5.3.3 Observable Tabular Overcomplete POMDPs

Theorem 5 (Sample complexity for overcomplete tabular models).

5.3.4 Observable LQG

Example 6 (continues=ex:lqqs).

Theorem 6 (Sample complexity for LQG (informal) – competing against πgl⋆\pi^{\star}_{\mathrm{gl}}).

5.3.5 Observable HSE-POMDPs

Example 7 (continues=ex:linear).

Theorem 7 (Sample complexity for HSE-POMDPs (Informal)).

6 Predictive State Representations

Definition 5 (Core test set and linear PSRs).

Definition 6 (General value link functions).

Lemma 7 (The existence of link functions for PSRs).

Lemma 8.

Sample complexity.

7 Generalization of PO-Bilinear AC Class

Definition 7 (General PO-Bilinear AC Class).

Assumption 3 (Uniform Convergence).

Theorem 8 (Sample complexity of Algorithm 3).

8 Examples for Generalized PO-Bilinear AC Class

8.1 MM-step decodable POMDPs

Example 8 (MM-step decodable POMDPs [20]).

Existence of value link functions.

Lemma 9 (Existence of link functions in MM-step decodable POMDPs).

PO-Bilinear Rank.

Assumption 4 (Low-rankness of latent transition).

Lemma 10 (Bilinear decomposition of low-rank MM-step decodable POMDPs ).

Provably Efficient Reinforcement Learning in
Partially Observable Dynamical Systems

Definition 3 (PO-bilinear AC Class, $K=1$ ).

Theorem 2 (Sample complexity for discrete $\Pi$ and $\mathcal{G}$ (informal)).

Theorem 4 (Sample complexity for undercomplete tabular models (Informal) — competing against $\pi^{\star}_{\mathrm{gl}}$ ).

Theorem 6 (Sample complexity for LQG (informal) – competing against $\pi^{\star}_{\mathrm{gl}}$ ).

8.1 $M$ -step decodable POMDPs

Example 8 ( $M$ -step decodable POMDPs [20]).

Lemma 9 (Existence of link functions in $M$ -step decodable POMDPs).

Lemma 10 (Bilinear decomposition of low-rank $M$ -step decodable POMDPs ).

Theorem 9 (Sample complexity for $M$ -step decodable POMDPs (Informal)).

Lemma 12 (Near global optimaltiy of $M$ -memoruy policy).

Theorem 11 (Sample complexity of DisProvable for POMDPs with low-rank latent transition — competing against $\pi^{\star}_{\mathrm{gl}}$ ).

8.2.4 Revisiting $M$ -step Decodable Tabular POMDPs

Lemma 17 (Bilinear form of value functions for $M$ -memory policies).