This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Provably Efficient Reinforcement Learning in
Partially Observable Dynamical Systems

Masatoshi Uehara [email protected] Supported by Masason Foundation Cornell University Ayush Sekhari [email protected] Cornell University Jason D. Lee [email protected] Princeton University Nathan Kallus [email protected] Cornell University Wen Sun [email protected] Cornell University
Abstract

We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new Partially Observable Bilinear Actor-Critic framework, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as well as a newly introduced model Hilbert Space Embeddings of POMDPs and observable POMDPs with latent low-rank transition. Under this framework, we propose an actor-critic style algorithm that is capable of performing agnostic policy learning. Given a policy class that consists of memory based policies (that look at a fixed-length window of recent observations), and a value function class that consists of functions taking both memory and future observations as inputs, our algorithm learns to compete against the best memory-based policy in the given policy class. For certain examples such as undercomplete observable tabular POMDPs, observable LQGs and observable POMDPs with latent low-rank transition, by implicitly leveraging their special properties, our algorithm is even capable of competing against the globally optimal policy without paying an exponential dependence on the horizon in its sample complexity.

1 Introduction

Large state space and partial observability are two key challenges of Reinforcement Learning (RL). While recent advances in RL for fully observable systems have focused on the challenge of scaling RL to large state space in both theory and in practice using rich function approximation, the understanding of large scale RL under partial observability is still limited. In POMDPs, for example, a core issue is that the optimal policy is not necessarily Markovian since the observations are not Markovian.

A common heuristic to tackle large scale RL with partial observability in practice is to simply maintain a time window of the history of observations, which is treated as a state to feed into the policy and the value function. Such a window of history can be often maintained explicitly via truncating away older history (e.g., DQN uses a window with length 4 for playing video games [49]; Open AI Five uses a window with length 16 for LSTMs [4]). Since even for planning under partial observations and known dynamics, finding the globally optimal policy conditional on the entire history is generally NP-hard (due to the curse of the history) [44, 55, 24], searching for a short memory-based policy can be understood as a reasonable middle ground that balances computation and optimality. The impressive empirical results of these prior works also demonstrate that in practice, there often exists a high-quality policy (not necessarily the globally optimal) that is only a function of a short window of recent observations. However, these prior works that search for the best memory-based policy unfortunately cannot ensure sample efficient PAC guarantees due to the difficulty of strategic exploration in POMDPs. The key question that we aim to answer in this work is:

Can we design provably efficient RL algorithms that agnostically learn the best fixed-length memory based policy with function approximation?

We provide affirmative answers to the above question. More formally, we study RL for partially observable dynamical systems that include not only the classic Partially Observable MDPs (POMDPs) [52, 56, 64], but also a more general model called Predictive State Representations (PSRs) [46]. We design a model-free actor-critic framework, named PO-Bilinear Actor-Critic Class, where we have a policy class (i.e., actors) that consists of policies that take a fixed-length window of observations as input (memory-based policy), and a newly introduced value link function class (i.e., critics) that consists of functions that take the fixed-length window of history and (possibly multi-step if the system is overcomplete) future observations as inputs. A value link function class is an analog of the value function class tailored to partially observable systems that only relies on observable quantities (i.e., past and future observations and actions). In our algorithm, we agnostically search for the best memory-based policy from the given policy class.

Our framework is based on the idea of a newly introduced notion of value link function equipped with future observations. While the idea of using future observations has been used in the literature on POMDPs, our work is the first to use this idea to learn a high-quality policy in a model-free manner. Existing works discuss how to use future observations only in a model-based manner [8, 25]. By leveraging these model-based viewpoints, while recent works discuss strategic exploration to learn near-optimal policies, their results are either limited to the tabular setting (and are not scalable for large state spaces) [36, 22, 3, 74, 43] or are tailored to specific non-tabular models and unclear how to incorporate general function approximation [65, 42, 12]. We break these barriers by devising a new actor-critic-based model-free view on POMDPs. We demonstrate the scalability and generality of our PO-bilinear actor-critic framework by showing PAC-guarantee on many models as follows (see Table 1 for a summary).

Observable Tabular POMDPs. In tabular observable POMDPs, i.e., POMDPs where multi-step future observations retain information about latent states, the PO-bilinear rank decomposition holds. We can ensure the sample complexity is Poly(S,AM,OM,AK,OK,H,1/σ1)\mathrm{\mathrm{Poly}}(S,A^{M},O^{M},A^{K},O^{K},H,1/\sigma_{1}) where σ1=minx𝕆x1/x1\sigma_{1}=\min_{x}\|\mathbb{O}x\|_{1}/\|x\|_{1} (𝕆\mathbb{O} is an emission matrix),and S,A,OS,A,O are the cardinality of state, action, observation space, respectively, HH is the horizon, and KK is the number of future observations.111In Section 8, we discuss how to get rid of OM,OKO^{M},O^{K} using a model-based learning perspective. The intuition is that a tabular POMDP’s model complexity has nothing to do with MM or KK, i.e., number of parameters in transition and omission distribution is S2A+OAS^{2}A+OA (even if we consider the time-inhomogeneous setting, it scales with H(S2A+OA)H(S^{2}A+OA), but no OMO^{M} and OKO^{K}) and the PO-bilinear rank is still SS. In the special undercomplete (OSO\geq S) case, our framework is also flexible enough to set the memory length according to the property of the problems in order to search for the globally optimal policy. More specifically, using the latest result from [24] about belief contraction, we can set M=O~((1/σ14)ln(SH/ϵ))M=\tilde{O}((1/\sigma^{4}_{1})\ln(SH/\epsilon)) with ϵ\epsilon being the optimality threshold. This allows us to compete against the globally optimal policy without paying an exponential dependence on HH.

Observable Linear Quadratic Gaussian (LQG). In observable LQG – a classic partial observable linear dynamical system, our algorithm can compete against the globally optimal policy with a sample complexity scaling polynomially with respect to the horizon, dimensions of the state, observation, and action spaces (and other system parameters). This is achieved by simply setting the memory length MM to HH. The special linear structures of the problem allow us to avoid exponential dependence on HH even when using the full history as a memory. While the global optimality results in tabular POMDPs and LQG exist by using different algorithms, to the best of our knowledge, this is the first unified algorithm that can solve both tabular POMDPs and LQG simultaneously without paying an exponential dependence on horizon HH.

Observable Hilbert Space Embedding POMDPs (HSE-POMDPs). Our framework ensures the agnostic PAC guarantee on HSE-POMDPs where policy induced transitions and omission distributions have condition mean embeddings [8, 6]. This model naturally generalizes tabular POMDPs and LQG. We show that the sample complexity scales polynomially with respect to the dimensions of the embeddings. This is the first PAC guarantee in HSE-POMDPs.

Predictive State Representations (PSRs). We give the first PAC-guarantee on PSRs. PSRs model partially observable dynamical systems without even using the concept of latent states and strictly generalize the POMDP model. Our work significantly generalizes a prior PAC learning result for reactive PSRs (i.e., reactive PSRs require a strong condition that the optimal policy only depends on the latest observation) which is a much more restricted setting [33].

MM-step decodable POMDPs [20]. Our framework can capture MM-step decodable POMDPs where there is a (unknown) decoder that can perfectly decode the latent state by looking at the latest MM-memory. Our algorithm can compete against the globally optimal policy with the sample complexity scaling polynomially with respect to horizon HH, S,AMS,A^{M}, and the statistical complexities of function classes, without any explicit dependence on OO. This PAC result is similar with the one from [20].

Observable POMDPs with low-rank latent transition. Our framework captures observable POMDPs where the latent transition is low-rank. This is the first PAC guarantee in this model. Under this model, we first show that with M=O~((1/σ14)ln(dH/ϵ))M=\tilde{O}\left((1/\sigma_{1}^{4})\ln(dH/\epsilon)\right) where dd is the rank of the latent transition matrix, there exists an MM-memory policy that is ϵ\epsilon-near optimal with respect to the globally optimal policy. Then, starting with a general model class that contains the ground truth transition and omission distribution (i.e., realizability in model class), we first convert the model class to a policy class and a value link function class, and we then show that our algorithm competes against the globally optimal policy with a sample complexity scaling polynomially with respect to H,d,|𝒜|(1/σ14)ln(dH/ϵ),1/σ1H,d,|\mathcal{A}|^{(1/\sigma_{1}^{4})\ln(dH/\epsilon)},1/\sigma_{1}, and the statistical complexity of the model class. Particularly, the sample complexity has no explicit dependence on the size of the state and observation space, instead it just depends on the statistical complexity of the given model class.

Model
Observable
tabular POMDPs
Observable
LQG
Low-rank MM-step
decodable POMDPs
Observable
HSE-POMDPs
PSRs
Low rank
observable POMDPs
PO-Bilinear Rank
(OA)MS()(OA)^{M}S(\dagger)
(Can be SS)
O(Mda2ds2)()O(Md^{2}_{a}d^{2}_{s})(\dagger) Rank ()(\dagger)
Feature dimension
on (z,s)(z,s)
(OA)M×(OA)^{M}\times
# of core tests
Rank (\dagger)
PAC Learning Known Known Known New New New
Table 1: Summary of settings that are from PO-Bilinear AC class. The 2nd row gives the parameters that bound the PO-Bilinear rank. Here MM denotes the length of memory used to define memory-based policies π(|z¯h)\pi(\cdot|\bar{z}_{h}) where z¯h=(ohM:h,ahM:h1)\bar{z}_{h}=(o_{h-M:h},a_{h-M:h-1}) denotes the MM-step memory. In the 3rd row, “known” means that sample-efficient algorithms already exist. “ New” means our result gives the first sample-efficient algorithm. However, even in “known” case, agnostic guarantees are new; hence, when the policy class is small, we can gain some benefit. The symbol \dagger means we can compete with the globally optimal policy without paying an exponential dependence on horizon HH. For the tabular case, the PO-bilinear rank can be improved to SS when we use the most general definition (Refer to Section 8. For LQG, dad_{a} and dsd_{s} are the dimension of action and state spaces. For PSRs, OO and AA denote the size of observation and action spaces.

1.1 Related Works

Generalization and function approximation of RL in MDPs.

In Markovian environments, there is a growing literature that gives PAC bounds with function approximation under certain models. Some of the representative models are linear MDPs [36, 76], block MDPs [16, 48, 78], and low-rank MDPs [2, 71]. Several general frameworks in [33, 61, 35, 21, 17] characterize sufficient conditions for provably efficient RL. Each above model is captured in these frameworks as a special case. While our work builds on the bilinear/Bellman rank framework [17, 33], when we naïvely reduce POMDPs to MDPs, the bilinear/Bellman rank is Θ(AH)\Theta(A^{H}). These two frameworks are only shown applicable to reactive POMDPs where the optimal policy only depends on the latest observation. However, this assumption makes the POMDP model very restricted.

Online RL for POMDPs.

Prior works [39, 19] showed AHA^{H}-type sample complexity bounds for general POMDPs. Exponential dependence can be circumvented with more structures. First, in the tabular setting, under observability assumptions, in [3, 22, 34, 43, 23], favorable sample complexities are obtained by leveraging the spectral learning technique [29] (see section 1.1 in [34] for an excellent summary). Second, in LQG, which is a partial observable version of LQRs, in [42, 65], sub-linear regret algorithms are proposed. These works use random policies for exploration, which is sufficient for LQG. Since random exploration strategy is not enough for tabular POMDPs, it is unclear if the existing techniques from LQG can be applied to solve general POMDPs. Third, the recent work [20] provides a new model called MM-step decodable POMDP (when M=1M=1, it is Block MDP) with an efficient algorithm.

Our framework captures all above mentioned POMDP models. In addition, we propose a new model called HSE-POMDPs which extends prior works on HSE-HMM[6] to POMDPs and includes LQG and tabular POMDPs. Our algorithm delivers the first PAC bound for this model.

Finally, we remark there are several existing POMDP models that it is unclear whether our framework can capture. The first model is a POMDP [12] where emissions and transitions are modeled by linear mixture models. The second model is a latent POMDP [37]. We leave it as future works.

System identification for uncontrolled partially observable systems.

There is a long line of work on system identification for uncontrolled partially observable systems, among which the spectral learning based methods are related to our work [72, 29, 58, 8, 28, 54, 6, 38, 25, 67]. Informally, these methods leverage the high-level idea that under some observability conditions, one can use the sufficient statistics of (possibly multi-step) future observations as a surrogate for the belief states, thus allowing the learning algorithms to ignore the latent state inference and completely rely on observable quantities. Our approach shares a similar spirit in the sense that we use sufficient statistics of future observations to replace latent states, and our algorithm only relies on observable quantities. The major difference is that these prior works only focus on passive system identification for uncontrolled systems, while we need to find a high-performance policy by actively interacting with the systems for information acquisition.

Reinforcement learning in PSRs.

PSRs [32, 46, 62, 8, 68] are models that generalize POMDPs. PSRs also rely on the idea of using the sufficient statistics of multi-step future observations (i.e., predictive states) to serve as a summary of the history. Prior works on RL for PSRs [8, 38, 14, 45, 30] do not address the problem of strategic exploration and operate under the assumption that a pre-collected diverse training dataset is given and the data collection policy is a blind policy (i.e., it does not depend on history of observations). To our knowledge, the only existing PAC learning algorithm for PSRs is limited to a reactive PSR model where the optimal policy depends just on the latest observation [33]. Our framework captures standard PSRs models that are strictly more general than reactive PSRs.

Value link functions.

Analogue of value link functions (referred to as bridge functions) are used in the literature of causal inference (offline contextual bandits) [50, 13, 11, 40, 53, 60, 75] and offline RL with unmeasured confounders [7, 66]. However, their settings are not standard POMDPs in the sense that their setting is a POMDP with unmeasured confounders following [69]. Our setting is a standard POMDP without unmeasured confounders. Here, we emphasize that their setting does not capture our setting. More specifically, by taking [66] as an example, they require that logged data is generated by policies that can depend on latent states but cannot depend on observable states. Thus, their definition of link functions (called as bridge functions) is not applicable to our setting since the data we use is clearly generated by policies that depend on observations. Due to this difference, their setting prohibits us from using future observations, unlike our setting. Finally, we stress that our work is online, while their setting is offline. Hence, they do not discuss any methods for exploration.

1.2 Organization

In Section 2, we introduce the notation, definition of POMDPs, and our function-approximation setup such as the policy and value link function class. In Section 3, we define value link functions and the PO-bilinear actor-citric class. In Section 4, we give examples that admit PO-bilinear actor-citric class including observable undercomplete tabular POMDPs, observable overcomplete tabular POMDPs, observable LQG, and observable HSE-POMDPs. In Section 5, we give a unified algorithm for the PO-bilinear actor-citric class, and the sample complexity of the algorithm. We also instantiate this general result for examples presented in Section 4. In Section 6, we show that PSRs, which are more general models than POMDPs, also admit PO-bilinear rank decomposition. In Section 7, we give a more general definition of PO-bilinear actor-critic class, followed by showing that two additional examples — MM-step decodable POMDPs and observable POMDPs with low-rank latent transition — fall into this general definition (Section 8). Both the examples use general nonlinear function approximation and their sample complexities do not explicitly depend on the size of the state and observation spaces, but only on the statistical complexities of the function classes. As a by-product, we can refine the sample complexity result in the tabular case in Section 5. Most of the proofs are deferred to the Appendix.

2 Preliminary

o3o_{3}a3a_{3}o4o_{4}a4a_{4}o5o_{5}a5a_{5}o6o_{6}a6a_{6}z¯6\bar{z}_{6}z5z_{5}
Figure 1: Case with M=3=3. A 3-memory policy determines action a6a_{6} based on z¯6\bar{z}_{6}.

We introduce background for POMDPs here and defer the introduction of PSRs to Section 6. We consider an episodic POMDP specified by =𝒮,𝒪,𝒜,H,𝕋,𝕆\mathcal{M}=\langle{\mathcal{S}},\mathcal{O},\mathcal{A},H,\mathbb{T},\mathbb{O}\rangle, where 𝒮{\mathcal{S}} is the unobserved state space, 𝒪\mathcal{O} is the observation space, 𝒜\mathcal{A} is the action space, HH is the horizon, 𝕋:𝒮×𝒜Δ(𝒮)\mathbb{T}:{\mathcal{S}}\times\mathcal{A}\to\Delta({\mathcal{S}}) is the transition probability, 𝕆:𝒮Δ(𝒪)\mathbb{O}:{\mathcal{S}}\to\Delta(\mathcal{O}) is the emission probability, and r:𝒪×𝒜r:\mathcal{O}\times\mathcal{A}\to\mathbb{R} is the reward. Here, 𝕋,𝕆\mathbb{T},\mathbb{O} are unknown distributions. For notation simplicity, we consider the time-homogeneous case in this paper; Extension to the time-inhomogeneous setting is straightforward.

In our work, we consider MM-memory policies. Let 𝒵h=(𝒪×𝒜)min{h,M}\mathcal{Z}_{h}=(\mathcal{O}\times\mathcal{A})^{\min\{h,M\}} and 𝒵¯h=𝒵h1×𝒪\bar{\mathcal{Z}}_{h}=\mathcal{Z}_{h-1}\times\mathcal{O}. An element zh𝒵hz_{h}\in\mathcal{Z}_{h} is represented as zh=[omax(hM+1,1):h,amax(hM+1,1):h]z_{h}=[o_{\max(h-M+1,1):h},a_{\max(h-M+1,1):h}], and an element z¯h𝒵¯h\bar{z}_{h}\in\bar{\mathcal{Z}}_{h} is represented as z¯h=[omax(hM,1):h,amax(hM,1):h1]\bar{z}_{h}=[o_{\max(h-M,1):h},a_{\max(h-M,1):h-1}] (thus, z¯h=[zh1,oh]\bar{z}_{h}=[z_{h-1},o_{h}]). Figure 1 illustrates this situation. An MM-memory policy is defined as π={πh}h=1H\pi=\{\pi_{h}\}_{h=1}^{H} where each πh\pi_{h} is a mapping from 𝒵¯h\bar{\mathcal{Z}}_{h} to a distribution over actions Δ(𝒜)\Delta(\mathcal{A}).

In a POMDP, an MM-memory policy generates the data as follows. Each episode starts with the initial state s1s_{1} sampled from some unknown distribution. At each step h[H]h\in[H], from sh𝒮s_{h}\in{\mathcal{S}}, the agent observes oh𝕆(|sh)o_{h}\sim\mathbb{O}(\cdot|s_{h}), executes action ahπh(|z¯h)a_{h}\sim\pi_{h}(\cdot|\bar{z}_{h}), receives reward r(sh,ah)r(s_{h},a_{h}), and transits to the next latent state sh+1𝕋(|sh,ah)s_{h+1}\sim\mathbb{T}(\cdot|s_{h},a_{h}). Note that the agent does not observe the underlying states but only the observations {oh}hH\{o_{h}\}_{h\leq H}. We denote J(π)J(\pi) as the value of the policy π\pi, i.e., 𝔼[h=1Hrh;a1:Hπ]\mathbb{E}[\sum_{h=1}^{H}r_{h};a_{1:H}\sim\pi] where the expectation is taken w.r.t. the stochasticity of the policy π\pi, emissions distribution 𝕆\mathbb{O} and transition dynamics 𝕋\mathbb{T}.

We define a value function for a policy π\pi at step hh to be the expected cumulative reward to go under the policy π\pi starting from a z𝒵h1z\in\mathcal{Z}_{h-1} and s𝒮s\in{\mathcal{S}}, i.e. Vhπ:𝒵h1×𝒮V^{\pi}_{h}:\mathcal{Z}_{h-1}\times{\mathcal{S}}\to\mathbb{R} where Vhπ(z,s)=𝔼[h=hHrhzh1=z,sh=s;ah:Hπ]V^{\pi}_{h}(z,s)=\mathbb{E}[\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}\mid z_{h-1}=z,s_{h}=s;a_{h:H}\sim\pi]. The notation 𝔼[;ah:Hπ]\mathbb{E}[\cdot\,;a_{h:H}\sim\pi] means the expectation is taken under a policy π\pi from hh to HH. Compared to the standard MDP setting, the expectation is conditional on not only shs_{h} but also zh1z_{h-1} since we consider MM-memory policies. The corresponding Bellman equation for VhπV^{\pi}_{h} is Vhπ(zh1,sh)=𝔼[rh+Vh+1π(zh,sh+1)zh1,sh;ahπ]V^{\pi}_{h}(z_{h-1},s_{h})=\mathbb{E}\left[r_{h}+V^{\pi}_{h+1}(z_{h},s_{h+1})\mid z_{h-1},s_{h};a_{h}\sim\pi\right].

The Actor-critic function approximation setup.

Our goal is to find a near optimal policy that maximizes the policy value J(π)J(\pi) in an online manner. Since any POMDPs can be converted into MDPs by setting the state at level hh to the observable history up to hh, any off-the-shelf online provably efficient algorithms for MDPs can be applied to POMDPs. By defining h\mathcal{H}_{h} as the whole history up to step h[H]h\in[H] (i.e., a history τhh\tau_{h}\in\mathcal{H}_{h} is in the form of o1:h,a1:h1o_{1:h},a_{1:h-1}) , these naïve algorithms ensure that output policies can compete against the globally optimal policy πgl=argmaxπΠ¯J(π)\pi^{\star}_{\mathrm{gl}}=\mathop{\mathrm{argmax}}_{\pi\in\bar{\Pi}}J(\pi) where Π~={Π¯h},Π~h=[hΔ(𝒜)]\tilde{\Pi}=\{\bar{\Pi}_{h}\},\tilde{\Pi}_{h}=[\mathcal{H}_{h}\to\Delta(\mathcal{A})]. However, this conversion results in the error with exponential dependence on the horizon HH, which is prohibitively large in the long horizon setting.

Instead of directly competing against the globally optimal policy, we aim for agnostic policy learning, i.e., compete against the best policy in a given MM-memory policy class. Our function approximation setup consists of two function classes, (a)(a) A policy class Π\Pi consisting of MM-memory policies Π:={Πh}h=1H\Pi:=\{\Pi_{h}\}_{h=1}^{H} where Πh[𝒵¯hΔ(𝒜)]\Pi_{h}\subset[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})] (i.e., actors), (b)(b) A set of value link functions 𝒢={𝒢h}h=1H\mathcal{G}=\{\mathcal{G}_{h}\}_{h=1}^{H} where 𝒢h[𝒵¯h]\mathcal{G}_{h}\subset[\bar{\mathcal{Z}}_{h}\to\mathbb{R}], whose role is to approximate VhπV^{\pi}_{h} (i.e., critics). Our goal is to provide an algorithm that outputs a policy π^={π^h}\hat{\pi}=\{\hat{\pi}_{h}\} that has a low excess risk, where excess risk is defined by R(π):=J(π^)J(π)R(\pi):=J(\hat{\pi})-J(\pi^{\star}) where π=argmaxπΠJ(π)\pi^{\star}=\mathop{\mathrm{argmax}}_{\pi\in\Pi}J(\pi) is the best policy in class Π\Pi. To motivate this agnostic setting, MM-memory policies are also widely used in practice, e.g., DQN [49] sets M=4M=4. Besides, there are natural examples where MM-memory policies are close to the globally optimal policy with MM being only polynomial with respect to other problem dependent parameters, e.g., observable POMDPs [24] and LQG [42, 65, 51]. We will show the global optimality in these two examples later, without any exponential dependence on HH in the sample complexity.

Remark 1 (Limits of existing MDP actor-critic framework).

While general actor-critic framework proposed in MDPs [33] is applicable to POMDPs via the naïve POMDP to MDP reduction, it is unable to leverage any benefits from the restricted policy class. This naïve reduction (from POMDP to MDP) uses full history and will incur sample complexity that scales exponentially with respect to the horizon.

Additional notation. Let [H]={1,,H}[H]=\{1,\cdots,H\} and [t]={1,,t}[t]=\{1,\cdots,t\}. Give a matrix AA, we denote its pseudo inverse by AA^{\dagger} and the operator norm by A\|A\|. We define the 1\ell_{1} norm A1=maxx:x0Ax1/x1\|A\|_{1}=\max_{x:x\neq 0}\|Ax\|_{1}/\|x\|_{1}. The outer product is denoted by \otimes. Let dhπ()Z¯h×𝒮d^{\pi}_{h}(\cdot)\in\bar{Z}_{h}\times{\mathcal{S}} be the marginal distribution at hh and δ()\delta(\cdot) be the Dirac delta function. We denote the policy δ(a=a)\delta(a=a^{\prime}) by do(a)\mathrm{do}(a^{\prime}). We denote a uniform action by 𝒰(𝒜)\mathcal{U}(\mathcal{A}). Given a function class 𝒢\mathcal{G}, we define 𝒢=supg𝒢g\|\mathcal{G}\|_{\infty}=\sup_{g\in\mathcal{G}}\|g\|_{\infty}.

3 Value Link Functions and the PO-bilinear Framework

Unlike MDPs, we cannot directly work with value functions Vhπ(s)V^{\pi}_{h}(s) (or Q functions) in POMDPs, since they depend on the unobserved state ss. To handle this issue, below we first introduce new value link functions by using future observations, and then discuss the PO-bilinear framework.

3.1 Value Link Functions

Definition 1 (K-step value link functions).

Fix a set of policies πout={πiout}i=1K\pi^{out}=\{\pi^{out}_{i}\}_{i=1}^{K} where πiout:𝒪Δ(𝒜)\pi^{out}_{i}:\mathcal{O}\to\Delta(\mathcal{A}). Value link functions ghπ:𝒵h1×𝒪K×𝒜K1g^{\pi}_{h}:\mathcal{Z}_{h-1}\times\mathcal{O}^{K}\times\mathcal{A}^{K-1}\to\mathbb{R} at step h[H]h\in[H] for a policy π\pi are defined as the solution to the following integral equation:

zh1𝒵h1,sh𝒮,𝔼[ghπ(zh1,oh:h+K1,ah:h+K2)zh1,sh;ah:h+K2πout]=Vhπ(zh1,sh),\displaystyle\forall z_{h-1}\in\mathcal{Z}_{h-1},s_{h}\in{\mathcal{S}},\qquad\mathbb{E}[g^{\pi}_{h}(z_{h-1},o_{h:h+K-1},a_{h:h+K-2})\mid z_{h-1},s_{h};a_{h:h+K-2}\sim\pi^{out}]=V^{\pi}_{h}(z_{h-1},s_{h}),

where the expectation is taken under the policy πout\pi^{out}.

Link functions do not necessarily exist, nor are needed to be unique. At an intuitive level, K-step value link functions are embeddings of the value functions onto the observation space, and its existence essentially means that K-step futures have sufficient information to recover the latent state dependent value function. The proper choice of πout\pi^{out} would depend on the underlying models. For example, we use uniform policy in the tabular case, and δ(a=0)\delta(a=0) in LQG. For notational simplicity, we mostly focus on the case of K=1K=1, though we will also discuss the general case of K2K\geq 2. The simplified definition for 1-step link functions is provided in the following. Note that this definition is agnostic to πout\pi^{out}.

Definition 2 (1-step value link functions).

One-step value link functions ghπ:𝒵h1×𝒪g^{\pi}_{h}:\mathcal{Z}_{h-1}\times\mathcal{O}\to\mathbb{R} at step h[H]h\in[H] for a policy π\pi are defined as the solution to the following integral equation:

zh1𝒵h1,sh𝒮:𝔼[ghπ(zh1,oh)zh1,sh]=Vhπ(zh1,sh).\displaystyle\forall z_{h-1}\in\mathcal{Z}_{h-1},s_{h}\in{\mathcal{S}}:\qquad\mathbb{E}[g^{\pi}_{h}(z_{h-1},o_{h})\mid z_{h-1},s_{h}]=V^{\pi}_{h}(z_{h-1},s_{h}). (1)

In Section 4, we will demonstrate the form of the value link function for various examples. The idea of encoding latent state information using the statistics of (multi-step) futures have been widely used in learning models of HMMs [58, 29], PSRs [8, 6, 25, 67], and system identification for linear systems [72]. Existing provably efficient (online) RL works for POMDPs elaborate on this viewpoint [36, 22, 3]. Compared to them, the novelty of link functions is that it is introduced to recover value functions but not models. This model-free view differs from the existing dominant model-based view in online RL for POMDPs. In our setup, we can control systems if we can recover value functions on the underlying states even if we fail to identify the underlying model.

3.2 The PO-Bilinear Actor-critic Framework for POMDPs

With the definition of value link functions, we are now ready to introduce the PO-bilinear actor-critic (AC) class for POMDPs. We will focus on the case of K=1K=1 here. Let 𝒢={𝒢h}h=1H\mathcal{G}=\{\mathcal{G}_{h}\}_{h=1}^{H}, where 𝒢h[𝒵¯h]\mathcal{G}_{h}\subset[\bar{\mathcal{Z}}_{h}\to\mathbb{R}], be a class consisting of functions that satisfy the following realizability assumption w.r.t.  the policy class Π\Pi.

Assumption 1 (Realizability).

We assume that 𝒢\mathcal{G} is realizable w.r.t. the policy class Π\Pi, i.e., πΠ,h[H]\forall\pi\in\Pi,h\in[H], there exists at least one ghπ𝒢hg^{\pi}_{h}\in\mathcal{G}_{h} such that ghπg^{\pi}_{h} is a value link function w.r.t. the policy π\pi. Note that realizability implicitly requires the existence of link functions.

We next introduce the PO-Bilinear Actor-critic class. For each level h[H]h\in[H], we first define the Bellman loss:

Brh(π,g;πin):=𝔼[gh(z¯h)rhgh+1(z¯h+1):a1:h1πin,ahπ]\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{in}):=\mathbb{E}[g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1}):a_{1:h-1}\sim\pi^{in},a_{h}\sim\pi]

given M-memory policies π={πh},πin={πhin}\pi=\{\pi_{h}\},\pi^{in}=\{\pi^{in}_{h}\} and g={gh}g=\{g_{h}\}. Letting gπ={ghπ}h=1Hg^{\pi}=\{g^{\pi}_{h}\}_{h=1}^{H} be a link function for π\pi, our key observation is that value link functions satisfy

0=Brh(π,gπ;πin)\displaystyle 0=\mathrm{Br}_{h}(\pi,g^{\pi};\pi^{in})

for any M memory roll-in policy πin={πhin}h=1H\pi^{in}=\{\pi^{in}_{h}\}_{h=1}^{H}, and any evaluation pair (π,gπ)(\pi,g^{\pi}). This is an analog of Bellman equations on MDPs. The above equation tells us that Brh(π,g;πin)\mathrm{Br}_{h}(\pi,g;\pi^{in}) is a right loss to quantify how much the estimator gg is different from ghπg^{\pi}_{h}. When Brh(π,g;πin)\mathrm{Br}_{h}(\pi,g;\pi^{in}) has a low-rank structure in a proper way, we can efficiently learn a near optimal M memory policy. The following definition precisely quantifies the low-rank structure that we need for sample efficient learning.

Definition 3 (PO-bilinear AC Class, K=1K=1).

The model is a PO-bilinear Actor-critic class of rank dd if 𝒢\mathcal{G} is realizable, and there exist Wh:Π×𝒢dW_{h}:\Pi\times\mathcal{G}\to\mathbb{R}^{d} and Xh:ΠdX_{h}:\Pi\to\mathbb{R}^{d} such that for all π,πΠ,g𝒢\pi^{\prime},\pi\in\Pi,g\in\mathcal{G} and h[H]h\in[H],

  1. 1.

    𝔼[gh(z¯h)rhgh+1(z¯h+1);a1:h1π,ahπ]=Wh(π,g),Xh(π).\mathbb{E}[g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]=\langle W_{h}(\pi,g),X_{h}(\pi^{\prime})\rangle.

  2. 2.

    Wh(π,gπ)=0W_{h}(\pi,g^{\pi})=0 for any πΠ\pi\in\Pi and the corresponding value link function gπ𝒢g^{\pi}\in\mathcal{G} .

We define dd as the PO-bilinear rank.

While the above definition is enough to capture most of the examples we discuss later in this work, including undercomplete tabular POMDPs, LQG, HSE-POMDPs, we provide two useful extensions. The first extension incorporates discriminators into the framework, which can be used to capture the M-step decodable POMDPs and POMDPs with low-rank latent transition (see Section 7). The second extension incorporates multi-step futures, which can be used to capture overcomplete POMDPs and general PSRs. In the next section, we introduce the multi-step future version.

3.3 PO-bilinear Actor-critic Class with Multi Step Future

In this section, we provide an extension to Definition 3 to incorporate multiple-step futures (i.e., K>1K>1). For simplicity, we assume that πout=𝒰(𝒜)\pi^{out}=\mathcal{U}(\mathcal{A}).

The definition is then as follows. The main difference is that we roll out a policy 𝒰(𝒜)\mathcal{U}(\mathcal{A}), K1K-1 times to incorporate multi-step link functions. We introduce the notation

(zh1,oh:h+K1,ah:h+K2)=z¯hK𝒵¯hK=𝒵h1×𝒪K×𝒜K1.\displaystyle(z_{h-1},o_{h:h+K-1},a_{h:h+K-2})=\bar{z}^{K}_{h}\in\bar{\mathcal{Z}}^{K}_{h}=\mathcal{Z}_{h-1}\times\mathcal{O}^{K}\times\mathcal{A}^{K-1}.

Then, combining the Bellman equation for state-value functions and the definition of K-step link functions, we have

0\displaystyle 0 =𝔼[Vh+1π(zh,sh+1)+rhVhπ(zh1,sh)zh1,sh;ahπ]\displaystyle=\mathbb{E}[V^{\pi}_{h+1}(z_{h},s_{h+1})+r_{h}-V^{\pi}_{h}(z_{h-1},s_{h})\mid z_{h-1},s_{h};a_{h}\sim\pi]
=𝔼[gh+1π(z¯h+1K)zh1,sh;ahπ,ah+1:h+K1𝒰(𝒜)]+𝔼[rhzh1,sh;ahπ]\displaystyle=\mathbb{E}[g^{\pi}_{h+1}(\bar{z}^{K}_{h+1})\mid z_{h-1},s_{h};a_{h}\sim\pi,a_{h+1:h+K-1}\sim\mathcal{U}(\mathcal{A})]+\mathbb{E}[r_{h}\mid z_{h-1},s_{h};a_{h}\sim\pi]
𝔼[ghπ(z¯hK)zh1,sh;ah:h+K2𝒰(𝒜)]\displaystyle-\mathbb{E}[g^{\pi}_{h}(\bar{z}^{K}_{h})\mid z_{h-1},s_{h};a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A})]

Thus, by taking expectations further with respect to (zh1,sh)(z_{h-1},s_{h}) (i.e., zh1,shz_{h-1},s_{h} can be sampled from some roll-in policy), we have

0\displaystyle 0 =𝔼[gh+1π(z¯h+1K);a1:h1π,ahπ,ah+1:h+K1𝒰(𝒜)]+𝔼[rh;a1:h1π,ahπ]\displaystyle=\mathbb{E}[g^{\pi}_{h+1}(\bar{z}^{K}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi,a_{h+1:h+K-1}\sim\mathcal{U}(\mathcal{A})]+\mathbb{E}[r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]
𝔼[ghπ(z¯hK);a1:h1π,ah:h+K2𝒰(𝒜)].\displaystyle-\mathbb{E}[g^{\pi}_{h}(\bar{z}^{K}_{h});a_{1:h-1}\sim\pi^{\prime},a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A})].

Hence, the Bellman loss of a pair (π,g)(\pi,g) under a roll-in π\pi^{\prime} denoted by Brh(π,g;π)\mathrm{Br}_{h}(\pi,g;\pi^{\prime}) at h[H]h\in[H] is defined as

Brh(π,g;π)\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime}) =𝔼[gh+1(z¯h+1K);a1:h1π,ahπ,ah+1:h+K1𝒰(𝒜)]+𝔼[rh;a1:h1π,ahπ]\displaystyle=\mathbb{E}[g_{h+1}(\bar{z}^{K}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi,a_{h+1:h+K-1}\sim\mathcal{U}(\mathcal{A})]+\mathbb{E}[r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]
𝔼[gh(z¯hK);a1:h1π,ah:h+K2𝒰(𝒜)].\displaystyle-\mathbb{E}[g_{h}(\bar{z}^{K}_{h});a_{1:h-1}\sim\pi^{\prime},a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A})].

The above is a proper loss function when we use multi-step futures. Here is the structure we need for Brh(π,g;π)\mathrm{Br}_{h}(\pi,g;\pi^{\prime}).

Definition 4 (PO-bilinear AC Class for POMDPs with multi-step future).

The model is a PO-bilinear class of rank dd if 𝒢\mathcal{G} is realizable (regarding general K-step link functions), and there exists Wh:Π×𝒢dW_{h}:\Pi\times\mathcal{G}\to\mathbb{R}^{d} and Xh:ΠdX_{h}:\Pi\to\mathbb{R}^{d} such that for all π,πΠ,g𝒢\pi^{\prime},\pi\in\Pi,g\in\mathcal{G} and h[H]h\in[H],

  1. 1.

    We have:

    𝔼[gh+1(z¯h+1K);a1:h1π,ahπ,ah+1:h+K1𝒰(𝒜)]+𝔼[rh;a1:h1π,ahπ]\displaystyle\mathbb{E}[g_{h+1}(\bar{z}^{K}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi,a_{h+1:h+K-1}\sim\mathcal{U}(\mathcal{A})]+\mathbb{E}[r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]
    𝔼[gh(z¯hK);a1:h1π,ah:h+K2𝒰(𝒜)]=Wh(π,g),Xh(π),\displaystyle-\mathbb{E}[g_{h}(\bar{z}^{K}_{h});a_{1:h-1}\sim\pi^{\prime},a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A})]=\langle W_{h}(\pi,g),X_{h}(\pi^{\prime})\rangle,
  2. 2.

    Wh(π,gπ)=0W_{h}(\pi,g^{\pi})=0 for any πΠ\pi\in\Pi and the corresponding value link function gπg^{\pi} in 𝒢\mathcal{G} .

We define dd as the PO-bilinear rank.

4 Examples of PO-Bilinear Actor-critic Classes

We consider three examples (observable tabular POMDPs, LQG, HSE-POMDPs) that admit PO-bilinear rank decomposition. Our framework can also capture PSRs and MM-step decodable POMDPs, of which the discussions are deferred to Section 6 and Section 8, respectively. We mainly focus on one-step future, i.e., K=1K=1, and briefly discuss extension to K>1K>1 in the the tabular case. In this section, except for LQG, we assume rh[0,1]r_{h}\in[0,1] for any h[H]h\in[H]. All the missing proofs are deferred to Section B.

4.1 Observable Undercomplete Tabular POMDPs

Example 1 (Observable undercomplete tabular POMDPs).

Let 𝕆|𝒪|×|𝒮|\mathbb{O}\in\mathbb{R}^{|\mathcal{O}|\times|{\mathcal{S}}|} where the entry indexed by a pair (o,s)(o,s) is defined as 𝕆o,s=𝕆(o|s)\mathbb{O}_{o,s}=\mathbb{O}(o|s). Assume that rank(𝕆)=|𝒮|{\rm rank}(\mathbb{O})=|{\mathcal{S}}|, which we call observability. This requires undercompletenes |𝒪||𝒮||\mathcal{O}|\geq|{\mathcal{S}}|.

The following lemma shows that 𝕆\mathbb{O} being full rank implies the existence of value link functions.

Lemma 1.

For Example 1, there exists a one-step value link function ghπg^{\pi}_{h} for any πΠ\pi\in\Pi and h[H]h\in[H].

Proof.

Consider any function f:𝒵h1×𝒮f:\mathcal{Z}_{h-1}\times{\mathcal{S}}\to\mathbb{R} (thus, this captures all possible VhπV^{\pi}_{h}). Denote 1(z){\textbf{1}}(z) as the one-hot encoding of zz over 𝒵h1\mathcal{Z}_{h-1} (similarly for 1(s){\textbf{1}}(s)). We have f(z,s)=f,1(z)1(s)=f,1(z)(𝕆𝕆1(s))f(z,s)=\langle f,{\textbf{1}}(z)\otimes{\textbf{1}}(s)\rangle=\langle f,{\textbf{1}}(z)\otimes(\mathbb{O}^{\dagger}\mathbb{O}{\textbf{1}}(s))\rangle, where we use the assumption that rank(𝕆)=|𝒮|{\rm rank}(\mathbb{O})=|{\mathcal{S}}| and thus 𝕆𝕆=I\mathbb{O}^{\dagger}\mathbb{O}=I. Then,

f(z,s)=f,1(z)(𝕆𝔼oO(s)1(o))=𝔼oO(s)f,1(z)𝕆1(o),\displaystyle f(z,s)=\langle f,{\textbf{1}}(z)\otimes(\mathbb{O}^{\dagger}\mathbb{E}_{o\sim O(s)}{\textbf{1}}(o))\rangle=\mathbb{E}_{o\sim O(s)}\langle f,{\textbf{1}}(z)\otimes\mathbb{O}^{\dagger}{\textbf{1}}(o)\rangle, (2)

which means that the value link function corresponding to ff is g(z,o):=f,1(z)𝕆1(o)g(z,o):=\langle f,{\textbf{1}}(z)\otimes\mathbb{O}^{\dagger}{\textbf{1}}(o)\rangle. ∎

We next show that the PO-Bilinear rank (Definition 3) of tabular POMDPs is bounded by |𝒮|(|𝒪||𝒜|)M|{\mathcal{S}}|(|\mathcal{O}||\mathcal{A}|)^{M}.

Lemma 2.

Assume 𝕆\mathbb{O} is full column rank. Set the value link function class 𝒢h=[𝒵h1×𝒪[0,C𝒢]]\mathcal{G}_{h}=[\mathcal{Z}_{h-1}\times\mathcal{O}\to[0,C_{\mathcal{G}}]] for certain C𝒢C_{\mathcal{G}}\in\mathbb{R}, and policy class Πh=[𝒵¯hΔ(𝒜)]\Pi_{h}=[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})]. Then, the model is a PO-biliner AC class (Definition 3) with PO-bilinear rank at most |𝒮|(|𝒪||𝒜|)M|{\mathcal{S}}|(|\mathcal{O}||\mathcal{A}|)^{M}.

Later, we will see that the PO-bilinear rank in the more general definition is just |𝒮||{\mathcal{S}}| in Section 7. This fact will result a significant improvement in terms of the sample complexity, and will result in a sample complexity that does not incur |𝒪|M|\mathcal{O}|^{M}.

4.2 Observable Overcomplete Tabular POMDPs

We consider overcomplete POMDPs with multi-step futures. The proofs are deferred to Section B.2. We have the following theorem. This is a generalization of Lemma 1, i.e., when K=1K=1, it is Lemma 1.

Lemma 3.

Define a |𝒯K|×|𝒮||{\mathcal{T}}^{K}|\times|{\mathcal{S}}|-dimensional matrix 𝕆K\mathbb{O}^{K} whose entry indexed by (oh:h+K1,ah:h+K2)𝒯K(o_{h:h+K-1},a_{h:h+K-2})\in{\mathcal{T}}^{K} and sh𝒮s_{h}\in{\mathcal{S}} is equal to (oh:h+K1,ah:h+K2sh;ah:h+K2𝒰(𝒜))\mathbb{P}(o_{h:h+K-1},a_{h:h+K-2}\mid s_{h};a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A})). When this matrix is full-column rank, K-step link functions with respect to 𝒰(𝒜)\mathcal{U}(\mathcal{A}) exist.

Note a sufficient condition to satisfy the above is that a matrix 𝕆K(ah:h+K2)|𝒪|K×|𝒮|\mathbb{O}^{K}(a^{\prime}_{h:h+K-2})\in\mathbb{R}^{|\mathcal{O}|^{K}\times|{\mathcal{S}}|} whose entry indexed by oh:h+K1𝒪Ko_{h:h+K-1}\in\mathcal{O}^{K} and sh𝒮s_{h}\in{\mathcal{S}} is equal to (oh:h+K1sh;ah:h+K2=ah:h+K2)\mathbb{P}(o_{h:h+K-1}\mid s_{h};a_{h:h+K-2}=a^{\prime}_{h:h+K-2}) is full-column rank for certain ah:h+K2𝒜K1a^{\prime}_{h:h+K-2}\in\mathcal{A}^{K-1}. It says there is (unknown) action sequence with length KK that retains information about latent states.

We next calculate the PO-bilinear rank. Importantly, this does not depend on |𝒜|K|\mathcal{A}|^{K} and |𝒪|K|\mathcal{O}|^{K}.

Lemma 4.

Set a value link function class 𝒢h=[𝒵¯K[0,C𝒢]]\mathcal{G}_{h}=[\bar{\mathcal{Z}}^{K}\to[0,C_{\mathcal{G}}]] for certain C𝒢+C_{\mathcal{G}}\in\mathbb{R}^{+} and a policy class Πh=[𝒵¯hΔ(𝒜)]\Pi_{h}=[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})]. Then, the model satisfies PO-bilinear rank condition with PO-bilinear rank (Definition 4) at most |𝒮|(|𝒪||𝒜|)M|{\mathcal{S}}|(|\mathcal{O}||\mathcal{A}|)^{M}.

Note that the bilinear rank is still |𝒮|(|𝒪||𝒜|)M|{\mathcal{S}}|(|\mathcal{O}||\mathcal{A}|)^{M} (just |𝒮||{\mathcal{S}}| in the more general definition in Section 7). Crucially, it does not depend on the length of futures KK.

4.3 Observable Linear Quadratic Gaussian

The next example is Linear Quadratic Gaussian (LQG) with continuous state and action spaces. The details are deferred to Section J. Here, we set M=H1M=H-1 so that the policy class Π\Pi contains the globally optimal policy.

Example 2 (Linear Quadratic Gaussian (LQG)).

Consider LQG:

s=As+Ba+ϵ,o=Cs+τ,r=(sQs+aRa)\displaystyle s^{\prime}=As+Ba+\epsilon,~{}o=Cs+\tau,~{}r=-(s^{\top}Qs+a^{\top}Ra)

where ϵ,τ\epsilon,\tau are Gaussian distribution with mean 0 and variances Σϵ\Sigma_{\epsilon} and Στ\Sigma_{\tau}, respectively, and sds,odos\in\mathbb{R}^{d_{s}},o\in\mathbb{R}^{d_{o}}, and adaa\in\mathbb{R}^{d_{a}}, and Q,RQ,R are positive definite matrices.

We define the policy class as the linear policy class Πh={δ(ah=Khz¯h)Kh|𝒜|×dz¯h)\Pi_{h}=\{\delta(a_{h}=K_{h}\bar{z}_{h})\mid{}K_{h}\in\mathbb{R}^{|\mathcal{A}|\times d_{\bar{z}_{h}}}), where dz¯hd_{\bar{z}_{h}} is a dimension of z¯h𝒵¯h\bar{z}_{h}\in\bar{\mathcal{Z}}_{h}. This choice is natural since the globally optimal policy is known to be linear with respect to the entire history [5, Chapter 4]. We define two quadratic features, ϕh(zh1,sh)=(1,[zh1,sh][zh1,sh])\phi_{h}(z_{h-1},s_{h})=(1,[z^{\top}_{h-1},s^{\top}_{h}]\otimes[z^{\top}_{h-1},s^{\top}_{h}])^{\top} with zh1𝒵h1,sh𝒮z_{h-1}\in\mathcal{Z}_{h-1},s_{h}\in{\mathcal{S}}, and ψh(zh1,oh)=(1,[zh1,oh][zh1,oh])\psi_{h}(z_{h-1},o_{h})=(1,[z^{\top}_{h-1},o^{\top}_{h}]\otimes[z^{\top}_{h-1},o^{\top}_{h}])^{\top} with zh1𝒵h1,oh𝒪z_{h-1}\in\mathcal{Z}_{h-1},o_{h}\in\mathcal{O}. We have the following lemma.

Lemma 5 (PO-bilinear rank of observable LQG).

Assume rank(C)=ds{\rm rank}(C)=d_{s}. Then, the following holds:

  • For any policy π\pi linear in z¯h\bar{z}_{h}, a one-step value link function ghπ()g^{\pi}_{h}(\cdot) exists, and is linear in ψh()\psi_{h}(\cdot).

  • Letting dψhd_{\psi_{h}} be the dimension of ψh\psi_{h}, we set 𝒢h={θψh()|θdψh}\mathcal{G}_{h}=\{\theta^{\top}\psi_{h}(\cdot)|\theta\in\mathbb{R}^{d_{\psi_{h}}}\} and Π\Pi being linear in z¯h\bar{z}_{h}. Then LQG satisfies Definition 3 with PO-bilinear rank at most O({1+(H1)(do+da)+ds}2)O(\{1+(H-1)(d_{o}+d_{a})+d_{s}\}^{2})

We have two remarks. First, when πtout=δ(a=0)\pi^{out}_{t}=\delta(a=0), K-step link functions exist when [C,(CA),,(CAK1)][C^{\top},(CA)^{\top},\dots,(CA^{K-1})^{\top}] is full raw rank. This assumption is referred to as observability in control theory [27]. Secondly, the PO-bilinear rank scales polynomially with respect to H,do,da,dsH,d_{o},d_{a},d_{s} even with M=H1M=H-1. As we show in Section J, due to this fact, we can compete against the globally optimal policy with polynomial sample complexity.

4.4 Observable Hilbert Space Embedding POMDPs

We consider HSE-POMDPs that generalize tabular POMDPs and LQG. Proofs here are deferred to Section  B.4. Consider any h[H]h\in[H]. Given a policy πh:𝒵¯h𝒜\pi_{h}:\bar{\mathcal{Z}}_{h}\to\mathcal{A}, we define the induced transition operator 𝕋π={𝕋π;h}h=1H\mathbb{T}_{\pi}=\{\mathbb{T}_{\pi;h}\}_{h=1}^{H} as (zh,sh+1)𝕋π;h(zh1,sh)(z_{h},s_{h+1})\sim\mathbb{T}_{\pi;h}(z_{h-1},s_{h}), where we have oh𝕆(sh),ahπh(z¯h),sh+1𝕋(sh,ah)o_{h}\sim\mathbb{O}(s_{h}),a_{h}\sim\pi_{h}(\bar{z}_{h}),s_{h+1}\sim\mathbb{T}(s_{h},a_{h}). Namely, 𝕋π\mathbb{T}_{\pi} is the transition kernel of some Markov chain induced by the policy π\pi. The HSE-POMDP assumes two conditional distributions 𝕆(|s)\mathbb{O}(\cdot|s) and 𝕋π(,|z,s)\mathbb{T}_{\pi}(\cdot,\cdot|z,s) have conditional mean embeddings.

Example 3 (HSE-POMDPs).

We introduce features ϕh:𝒵h1×𝒮dϕh,ψh:𝒵h1×𝒪dψh\phi_{h}:\mathcal{Z}_{h-1}\times{\mathcal{S}}\to\mathbb{R}^{d_{\phi_{h}}},\psi_{h}:\mathcal{Z}_{h-1}\times\mathcal{O}\to\mathbb{R}^{d_{\psi_{h}}}. We assume the existence of the conditional mean embedding operators: (1) there exists a matrix KhK_{h} such that for all z𝒵h1,s𝒮z\in\mathcal{Z}_{h-1},s\in{\mathcal{S}}, 𝔼o𝕆(|s)ψh(z,o)=Khϕh(z,s)\mathbb{E}_{o\sim\mathbb{O}(\cdot|s)}\psi_{h}(z,o)=K_{h}\phi_{h}(z,s) and (2) for all πΠ\pi\in\Pi, there exists a matrix Tπ;hT_{\pi;h}, such that 𝔼zh,sh+1𝕋π;h(zh1,sh)ϕh+1(zh,sh+1)=Tπ;hϕh(zh1,sh)\mathbb{E}_{z_{h},s_{h+1}\sim\mathbb{T}_{\pi;h}(z_{h-1},s_{h})}\phi_{h+1}(z_{h},s_{h+1})=T_{\pi;h}\phi_{h}(z_{h-1},s_{h}).

The existence of conditional mean embedding is a common assumption in prior RL works on learning dynamics of HMMs, PSRs, [59, 6] and Bellman complete linear MDPs [77, 15, 10, 26]. HSE-POMDPs naturally capture tabular POMDPs and LQG. For tabular POMDPs, ψh\psi_{h} and ϕh\phi_{h} are one-hot encoding features. In LQG, ϕh\phi_{h} and ψh\psi_{h} are quadratic features we define in Section 4.3. Here for simplicity, we focus on finite-dimensional features ϕh\phi_{h} and ψh\psi_{h}. Extension to infinite-dimensional Reproducing kernel Hilbert Space is deferred to Section  B.4.

The following shows the existence of value link functions and the PO-bilinear rank decomposition.

Lemma 6 (PO-bilinear rank of observable HSE-POMDPs).

Assume KhK_{h} is full column rank (observability), and Vhπ()V_{h}^{\pi}(\cdot) is linear in ϕh\phi_{h} for any πΠ,h[H]\pi\in\Pi,h\in[H]. Then the following holds.

  • A one-step value link function ghπ()g^{\pi}_{h}(\cdot) exists for any πΠ,h[H]\pi\in\Pi,h\in[H], and is linear in ψh\psi_{h}.

  • We set a value function class 𝒢h={wψh()|wdψh}\mathcal{G}_{h}=\{w^{\top}\psi_{h}(\cdot)|w\in\mathbb{R}^{d_{\psi_{h}}}\}, policy class Πh[𝒵¯hΔ(𝒜)]\Pi_{h}\subset[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})]. Then HSE-POMDP satisfies Definition 3 with PO-bilinear rank at most maxh[H]dϕh\max_{h\in[H]}d_{\phi_{h}}.

The first statement can be verified by noting that when Vhπ()=θh,ϕh()V^{\pi}_{h}(\cdot)=\langle\theta_{h},\phi_{h}(\cdot)\rangle, value link functions take the following form ghπ()=(Kh)θh),ψh()g_{h}^{\pi}(\cdot)=\langle(K^{\dagger}_{h})^{\top}\theta_{h}),\psi_{h}(\cdot)\rangle where we leverage the existence of the conditional mean embedding operator KhK_{h}, and that KhK_{h} is full column rank (thus KhKh=𝕀dϕhK_{h}^{\dagger}K_{h}=\mathbb{I}_{d_{\phi_{h}}}). Note that the PO-bilinear rank depends only on the dimension of the features ϕh\phi_{h} without any explicit dependence on the length of memory.

5 Algorithm and Complexity

In this section, we first give our algorithm followed by a general sample complexity analysis. We then instantiate our analysis to specific models considered in Section 4.

5.1 Algorithm

Algorithm 1 PaRtially ObserVAble BiLinEar (Provable) # multi-step version is in Algorithm 2
1:  Input: Value class 𝒢={𝒢h},𝒢h[𝒵h1]\mathcal{G}=\{\mathcal{G}_{h}\},\mathcal{G}_{h}\subset[\mathcal{Z}_{h-1}\to\mathbb{R}], Policy class Π={Πh},Πh[𝒵¯h1]\Pi=\{\Pi_{h}\},\Pi_{h}\subset[\bar{\mathcal{Z}}_{h-1}\to\mathbb{R}], parameters m𝐍,Rm\in\mathbf{N},R\in\mathbb{R}
2:  Initialize π0Π\pi^{0}\in\Pi
3:  Form the first step dataset 𝒟0={oi}i=1m\mathcal{D}^{0}=\{o^{i}\}_{i=1}^{m}, with oi𝕆(|s1)o^{i}\sim\mathbb{O}(\cdot|s_{1})
4:  for t=0T1t=0\to T-1 do
5:     For any h[H]h\in[H], collect mm i.i.d tuple as follows: (z¯h,sh)dhπt,ah𝒰(𝒜),rh=rh(oh,ah),sh+1𝕋(sh,ah),oh+1𝕆(|sh+1)(\bar{z}_{h},s_{h})\sim d^{\pi^{t}}_{h},a_{h}\sim\mathcal{U}(\mathcal{A}),r_{h}=r_{h}(o_{h},a_{h}),s_{h+1}\sim\mathbb{T}(s_{h},a_{h}),o_{h+1}\sim\mathbb{O}(\cdot|s_{h+1}).
6:     Define 𝒟ht={(z¯hi,ahi,rhi,oh+1i)}i=1m\mathcal{D}^{t}_{h}=\{(\bar{z}^{i}_{h},a^{i}_{h},r^{i}_{h},o^{i}_{h+1})\}_{i=1}^{m} # note latent state ss is not in the dataset
7:     Define the Bellman error (π,g)Π×𝒢\forall(\pi,g)\in\Pi\times\mathcal{G},
σht(π,g):=𝔼𝒟ht[πh(ahz¯h)|𝒜|{gh+1(z¯h+1)+rh}gh(z¯h)].\textstyle\sigma^{t}_{h}(\pi,g):=\mathbb{E}_{\mathcal{D}^{t}_{h}}\left[\pi_{h}(a_{h}\mid\bar{z}_{h})|\mathcal{A}|\{g_{h+1}(\bar{z}_{h+1})+r_{h}\}-g_{h}(\bar{z}_{h})\right].
8:     Select policy optimistically as follows
(πt+1,gt+1):=argmaxπΠ,g𝒢𝔼𝒟0[g1(o)]s.t.h[H],i[t],(σhi(π,g))2R.\displaystyle\textstyle(\pi^{t+1},g^{t+1}):=\mathop{\mathrm{argmax}}_{\pi\in\Pi,g\in\mathcal{G}}\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(o)]\quad\mathrm{s.t.}\quad\forall h\in[H],\forall i\in[t],(\sigma^{i}_{h}(\pi,g))^{2}\leq R.
9:  end for
10:  Output: Randomly choose π^\hat{\pi} from (π1,,πT)(\pi_{1},\cdots,\pi_{T}).

We first focus on the cases where models satisfy the PO-bilinear AC model (i.e., Definition 3) with finite action and with one-step link function. We discuss the extension to handle continuous action in Remark 2 and multi-step link functions at the end of this subsection.

We present our algorithm Provable in Algorithm 1. Note Provable is agnostic to the form of XhX_{h} and WhW_{h}. Inside iteration tt, given the latest learned policy πt\pi^{t}, we define Bellman error for all pairs (π,g)(\pi,g) where the Bellman error is averaged over the samples from πt\pi^{t}. Here, to evaluate the Bellman loss for any policy πΠ\pi\in\Pi, we use importance sampling by running 𝒰(𝒜)\mathcal{U}(\mathcal{A}) rather than executing a policy π\pi so that we can reuse samples.222This choice might limit the algorithm to the case where 𝒜\mathcal{A} is discrete. However, for examples such as LQG, we show that we can replace 𝒰(𝒜)\mathcal{U}(\mathcal{A}) by a G-optimal design over the quadratic polynomial feature of the actions. A pair (π,g)(\pi,g) that has a small total Bellman error intuitively means that given the data so far, gg could still be a value link function for the policy π\pi. Then in the constrained optimization formulation, we only focus on (π,g)(\pi,g) pairs whose Bellman errors are small so far. Among these (π,g)(\pi,g) pairs, we select the pair using the principle of optimism in the face of uncertainty. We remark the algorithm leverages some design choices from the Bilinear-UCB algorithm for MDPs [17]. The key difference between our algorithm and the Bilinear-UCB is that we leverage the actor-critic framework equipped with value link functions to handle partially observability and agnostic learning.

With multi-step link functions.

Finally, we consider the case with multi-step futures in Algorithm 2 when πout=𝒰(𝒜)\pi^{out}=\mathcal{U}(\mathcal{A}). Recall the notation z¯hK=(zh1,oh:h+K1,ah:h+K2)\bar{z}^{K}_{h}=(z_{h-1},o_{h:h+K-1},a_{h:h+K-2}). The only difference is in the process of data collection. Particularly, at every iteration tt, we roll-in using πt\pi^{t} to (and include) time step h1h-1, we then roll-out by switching to 𝒰(𝒜)\mathcal{U}(\mathcal{A}) for KK steps.

Algorithm 2 PaRtially ObserVAble BiLinEar (Provable) # multi-step version
1:  Input: Value link function class 𝒢={𝒢h},𝒢h[Z¯hK]\mathcal{G}=\{\mathcal{G}_{h}\},\mathcal{G}_{h}\subset[\bar{Z}^{K}_{h}\to\mathbb{R}], Policy class Π={Πh},Πh[𝒵¯hΔ(𝒜)]\Pi=\{\Pi_{h}\},\Pi_{h}\subset[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})], parameters m,Rm\in\mathbb{N},R\in\mathbb{R}
2:  Define
lh(z¯hK,ah+K1,rh,oh+K;π,g):=|𝒜|πh(ahz¯h)(gh+1(z¯h+1K)+rh)gh(z¯hK).\displaystyle l_{h}(\bar{z}^{K}_{h},a_{h+K-1},r_{h},o_{h+K};\pi,g):=|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left(g_{h+1}(\bar{z}^{K}_{h+1})+r_{h}\right)-g_{h}(\bar{z}^{K}_{h}).
3:  Initialize π0Π\pi^{0}\in\Pi
4:  Form the first step dataset 𝒟0={z¯1K;i}i=1m\mathcal{D}^{0}=\{\bar{z}^{K;i}_{1}\}_{i=1}^{m} where each z¯K\bar{z}^{K} is generated by following a1:K1U(𝒜)a_{1:K-1}\sim U(\mathcal{A}) in an i.i.d manner.
5:  for t=0T1t=0\to T-1 do
6:     for any h[H]h\in[H], define the Bellman error
σht(π,g)=𝔼𝒟ht[lh(z¯hK,ah+K1,rh,oh+K;π,g)]\displaystyle\sigma^{t}_{h}(\pi,g)=\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}^{K}_{h},a_{h+K-1},r_{h},o_{h+K};\pi,g)]
where 𝒟ht\mathcal{D}^{t}_{h} means empirical approximation by executing a1:h1πt,ah:h+K1𝒰(𝒜)a_{1:h-1}\sim\pi^{t},a_{h:h+K-1}\sim\mathcal{U}(\mathcal{A}) and collecting mm i.i.d tuples.
7:     Select policy optimistically as follows (here note g={gh}h=1Hg=\{g_{h}\}_{h=1}^{H})
(πt+1,gt+1):=argmaxπΠ,g𝒢𝔼𝒟0[g1(z¯1K)]s.t.h[H],i[t],σhi(π,g)2R.\displaystyle(\pi^{t+1},g^{t+1}):=\mathop{\mathrm{argmax}}_{\pi\in\Pi,g\in\mathcal{G}}\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(\bar{z}^{K}_{1})]\quad\mathrm{s.t.}\quad\forall h\in[H],\forall i\in[t],\,\sigma^{i}_{h}(\pi,g)^{2}\leq R.
8:  end for
9:  Output: Randomly choose π^\hat{\pi} from (π1,,πT)(\pi_{1},\cdots,\pi_{T}).
Remark 2 (Continuous control).

Algorithms so far implicitly assume the action is finite. However, we can consider LQG, which has continuous action. By employing a G-optimal design over actions, our algorithm can handle the continuous action. The discussion is deferred to Section C.

5.2 Sample Complexity

We show a sample complexity result by using reduction to supervised learning analysis. We begin by stating the following assumption which is ensured by standard uniform convergence results.

Assumption 2 (Uniform Convergence).

Fix h[H]h\in[H]. Let 𝒟h\mathcal{D}^{\prime}_{h} be a set of mm i.i.d tuples following (zh1,sh,oh)dhπt,ah𝒰(𝒜),sh+1𝕋(sh,ah),oh+1𝕆(sh+1)(z_{h-1},s_{h},o_{h})\sim d^{\pi^{t}}_{h},a_{h}\sim\mathcal{U}(\mathcal{A}),s_{h+1}\sim\mathbb{T}(s_{h},a_{h}),o_{h+1}\sim\mathbb{O}(s_{h+1}). With probability 1δ1-\delta,

supπΠ,g𝒢|(𝔼𝒟h𝔼)[πh(ahz¯h)|𝒜|{gh+1(z¯h+1)+rh}gh(z¯h)]|ϵgen,h(m,Π,𝒢,δ)\displaystyle\textstyle\sup_{\pi\in\Pi,g\in\mathcal{G}}|(\mathbb{E}_{\mathcal{D}^{\prime}_{h}}-\mathbb{E})[\pi_{h}(a_{h}\mid\bar{z}_{h})|\mathcal{A}|\{g_{h+1}(\bar{z}_{h+1})+r_{h}\}-g_{h}(\bar{z}_{h})]|\leq\epsilon_{gen,h}(m,\Pi,\mathcal{G},\delta)

For h=1h=1, we also require

supg1𝒢1|𝔼𝒟1[g1(o1)]𝔼[𝔼𝒟1[g1(o1)]]|ϵini,1(m,𝒢,δ).\sup_{g_{1}\in\mathcal{G}_{1}}|\mathbb{E}_{\mathcal{D}^{\prime}_{1}}[g_{1}(o_{1})]-\mathbb{E}[\mathbb{E}_{\mathcal{D}^{\prime}_{1}}[g_{1}(o_{1})]]|\leq\epsilon_{ini,1}(m,\mathcal{G},\delta).
Remark 3 (Finite function classes).

The term ϵgen\epsilon_{gen} depends on the statistical complexities of the function classes Π\Pi and 𝒢\mathcal{G}. As a simple example, we consider the case where Π\Pi and 𝒢\mathcal{G} are discrete. In this case, we have ϵgen,h(m,Π,𝒢,δ)=O(ln(|Π||𝒢|/δ)/m)\epsilon_{gen,h}(m,\Pi,\mathcal{G},\delta)=O(\sqrt{\ln(|\Pi||\mathcal{G}|/\delta)/m}), and ϵini,1(m,𝒢,δ)=O(ln(|𝒢|/δ)/m)\epsilon_{ini,1}(m,\mathcal{G},\delta)=O(\sqrt{\ln(|\mathcal{G}|/\delta)/m}), which are standard statistical complexities for discrete function classes Π\Pi and 𝒢\mathcal{G}. Achieving this result simply requires standard concentration and a union bound over all functions in Π,𝒢\Pi,\mathcal{G}.

Under Assumption 2, when the model is PO-bilinear with rank dd, we get the following.

Theorem 1 (PAC guarantee of Provable).

Suppose we have a PO-bilinear AC class with rank dd. Suppose Assumption 2, supπΠXh(π)BX\sup_{\pi\in\Pi}\|X_{h}(\pi)\|\leq B_{X} and supπΠ,g𝒢Wh(π,g)BW\sup_{\pi\in\Pi,g\in\mathcal{G}}\|W_{h}(\pi,g)\|\leq B_{W} for any h[H].h\in[H].
By setting T=2Hdln(4Hd(BX2BW2ϵ~gen2+1)),R=ϵgen2T=2Hd\ln\left(4Hd\left(\frac{B^{2}_{X}B^{2}_{W}}{\tilde{\epsilon}^{2}_{gen}}+1\right)\right),R=\epsilon^{2}_{gen} where

ϵgen:=maxhϵgen,h(m,Π,𝒢,δ/(TH+1)),ϵ~gen:=maxhϵgen,h(m,Π,𝒢,δ/H).\displaystyle\textstyle\epsilon_{gen}:=\max_{h}\epsilon_{gen,h}(m,\Pi,\mathcal{G},\delta/(TH+1)),\tilde{\epsilon}_{gen}:=\max_{h}\epsilon_{gen,h}(m,\Pi,\mathcal{G},\delta/H).

With probability at least 1δ1-\delta, letting π=argmaxπΠJ(π)\pi^{\star}=\mathop{\mathrm{argmax}}_{\pi\in\Pi}J(\pi^{\star}), we have

J(π)J(π^)5ϵgendH2ln(4Hd(BX2BW2/ϵ~gen2+1))+2ϵini,1(m,𝒢,δ/(TH+1)).\displaystyle\textstyle J(\pi^{\star})-J(\hat{\pi})\leq 5\epsilon_{gen}\sqrt{dH^{2}\cdot\ln\left(4Hd\left({B^{2}_{X}B^{2}_{W}}/{\tilde{\epsilon}^{2}_{gen}}+1\right)\right)}+2\epsilon_{ini,1}(m,\mathcal{G},\delta/(TH+1)).

The total number of samples used in the algorithm is mTHmTH.

Informally, when ϵgenO~(1/m)\epsilon_{gen}\approx\tilde{O}(1/\sqrt{m}), to achieve ϵ\epsilon-near optimality, the above theorem indicates that we just need to set mO~(1/ϵ2)m\approx\tilde{O}(1/\epsilon^{2}), which results a sample complexity scaling O~(1/ϵ2)\tilde{O}(1/\epsilon^{2}) (since TT only scales O~(dH)\tilde{O}(dH)). We give detailed derivation and examples in the next section.

5.3 Examples

Hereafter, we show the sample complexity result by using Theorem 1. For complete results, refer to Section FK. The result of MM-step decodable POMDPs and observable low-rank POMDPs are deferred to Section M.

5.3.1 Finite Sample Classes

We first consider the case where the hypothesis class is finite when the class admits PO-bilinear rank decomposition.

Example 4 (Finite Sample Classes).

Consider the case when Π\Pi and 𝒢\mathcal{G} are finite and the PO-bilinear rank assumption is satisfied. When Π\Pi and 𝒢\mathcal{G} are infinite hypothesis classes, |||\mathcal{F}| and |𝒢||\mathcal{G}| are replaced with their LL^{\infty}-covering numbers, respectively.

Theorem 2 (Sample complexity for discrete Π\Pi and 𝒢\mathcal{G} (informal)).

Let 𝒢hC𝒢,rh[0,1]\|\mathcal{G}_{h}\|_{\infty}\leq C_{\mathcal{G}},r_{h}\in[0,1] for any h[H]h\in[H] and the PO-bilinear rank assumption holds with PO-biliear rank dd. By letting |Πmax|=maxh|Πh|,|𝒢max|=maxh|𝒢h||\Pi_{\max}|=\max_{h}|\Pi_{h}|,|\mathcal{G}_{\max}|=\max_{h}|\mathcal{G}_{h}|, with probability 1δ1-\delta, we can achieve J(π)J(π^)ϵJ(\pi^{\star})-J(\hat{\pi})\leq\epsilon when we use samples

O~(d2H4max(C𝒢,1)2|𝒜|2ln(|𝒢max||Πmax|/δ)ln2(BXBW/δ)(1/ϵ)2).\displaystyle\tilde{O}\left(d^{2}H^{4}\max(C_{\mathcal{G}},1)^{2}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|/\delta)\ln^{2}(B_{X}B_{W}/\delta)(1/\epsilon)^{2}\right).

Here, Polylog(d,H,|𝒜|,ln(|𝒢|),ln(|Π|),ln(1/δ),ln(BX),ln(BW),ln(1/δ),1/ϵ)\mathrm{Polylog}(d,H,|\mathcal{A}|,\ln(|\mathcal{G}|),\ln(|\Pi|),\ln(1/\delta),\ln(B_{X}),\ln(B_{W}),\ln(1/\delta),1/\epsilon) are omitted.

5.3.2 Observable Undercomplete Tabular POMDPs

We start with tabular POMDPs. The details here is deferred to Section H.

Example 5 (continues=ex:under_tabular).

In tabular models, recall the PO-bilinear rank is at most d=|𝒪|M|𝒜|M|𝒮|d=|\mathcal{O}|^{M}|\mathcal{A}|^{M}|{\mathcal{S}}|. We suppose rh[0,1]r_{h}\in[0,1] for any h[H]h\in[H]. Assuming 𝕆\mathbb{O} is full-column rank, to satisfy the realizability, we set 𝒢h={θ,𝟏(z)𝕆𝟏(o)θH}\mathcal{G}_{h}=\{\langle\theta,\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle\mid\|\theta\|_{\infty}\leq H\} where 𝕆11/σ1\|\mathbb{O}^{\dagger}\|_{1}\leq 1/\sigma_{1} and 𝟏(z),𝟏(o)\mathbf{1}(z),\mathbf{1}(o) are one-hot encoding vectors over 𝒵h1\mathcal{Z}_{h-1} and 𝒪\mathcal{O}, respectively. We set Πh=[Z¯hΔ(𝒜)]\Pi_{h}=[\bar{Z}_{h}\to\Delta(\mathcal{A})]. Then, the following holds.

Theorem 3 (Sample complexity for unrercomplete tabular models (Informal)).

With probability 1δ1-\delta, we can achieve J(π)J(π^)ϵJ(\pi^{\star})-J(\hat{\pi})\leq\epsilon when we use samples at most O~(|𝒮|2|𝒜|3M+3|𝒪|3M+1H6(1/ϵ)2(1/σ1)2ln(1/δ)).\tilde{O}\left(|{\mathcal{S}}|^{2}|\mathcal{A}|^{3M+3}|\mathcal{O}|^{3M+1}H^{6}(1/\epsilon)^{2}(1/\sigma_{1})^{2}\ln(1/\delta)\right).
Here, polylog(|𝒮|,|𝒪|,|𝒜|,H,1/σ1,ln(1/δ))\mathrm{polylog}(|{\mathcal{S}}|,|\mathcal{O}|,|\mathcal{A}|,H,1/\sigma_{1},\ln(1/\delta)) are omitted.

Firstly, while the above error incurs |𝒪|M|𝒜|M|\mathcal{O}|^{M}|\mathcal{A}|^{M}, we will later see in Section 8.2.2 when we use the more general definition of PO-bilinear AC class and combine a model-based perspective, we can remove |𝒪|M|\mathcal{O}|^{M} from the error bound. The intuition here is that the statistical complexity still scales with |𝒮|2|𝒜|+|𝒪||𝒜||{\mathcal{S}}|^{2}|\mathcal{A}|+|\mathcal{O}||\mathcal{A}| and does not incur |𝒪|M|\mathcal{O}|^{M}. At the same time, although PO-bilinear rank currently scales with |𝒪|M||𝒜|M|𝒮||\mathcal{O}|^{M}||\mathcal{A}|^{M}|{\mathcal{S}}|, we can show that it can be just |𝒮||{\mathcal{S}}| with a more refined definition. Secondly, 𝕆11/σ1\|\mathbb{O}^{\dagger}\|_{1}\leq 1/\sigma_{1} can be replaced with other analogous conditions 𝕆21/σ2\|\mathbb{O}^{\dagger}\|_{2}\leq 1/\sigma_{2}. Here, note 𝕆1=1/{minx𝕆x1/x1},𝕆2=1/{minx𝕆x2/x2}\|\mathbb{O}^{\dagger}\|_{1}=1/\{\min_{x}\|\mathbb{O}x\|_{1}/\|x\|_{1}\},\|\mathbb{O}^{\dagger}\|_{2}=1/\{\min_{x}\|\mathbb{O}x\|_{2}/\|x\|_{2}\}. The reason why we use 11-norm is to invoke the result [24] to achieve the near global optimality as in the next paragraph.

Near global optimality.

Finally, we consider the PAC guarantee against the globally optimal policy. As shown in [24], it is enough to set M=O((1/σ14)ln(SH/ϵ))M=O((1/\sigma^{4}_{1})\ln(SH/\epsilon)) to compete with the globally optimal policy πgl\pi^{\star}_{\mathrm{gl}}. Thus we achieve a quasipolynomial sample complexity when competing against πgl\pi^{\star}_{\mathrm{gl}}.

Theorem 4 (Sample complexity for undercomplete tabular models (Informal) — competing against πgl\pi^{\star}_{\mathrm{gl}}).

With probability 1δ1-\delta, we can achieve J(πgl)J(π^)ϵJ(\pi^{\star}_{\mathrm{gl}})-J(\hat{\pi})\leq\epsilon when we use samples at most

poly(|𝒮|,|𝒜|ln(|𝒮|H/ϵ)/σ14,|𝒪|ln(|𝒮|H/ϵ)/σ14,H,1/σ1,1/ϵ,ln(1/δ)).\displaystyle\textstyle\mathrm{poly}(|{\mathcal{S}}|,|\mathcal{A}|^{\ln(|{\mathcal{S}}|H/\epsilon)/\sigma^{4}_{1}},|\mathcal{O}|^{\ln(|{\mathcal{S}}|H/\epsilon)/\sigma^{4}_{1}},H,1/\sigma_{1},1/\epsilon,\ln(1/\delta)).

5.3.3 Observable Tabular Overcomplete POMDPs

We consider obvercomplete tabular POMDPs. In this case, the PO-bilinear rank is at most |𝒪|M|𝒜|M|𝒮||\mathcal{O}|^{M}|\mathcal{A}|^{M}|{\mathcal{S}}|. We suppose rh[0,1]r_{h}\in[0,1] for any h[H]h\in[H]. Assuming 𝕆K\mathbb{O}^{K} is full-column rank, to satisfy the realizability, we set 𝒢h={θ,𝟏(z){𝕆K}𝟏(tK)θH}\mathcal{G}_{h}=\{\langle\theta,\mathbf{1}(z)\otimes\{\mathbb{O}^{K}\}^{\dagger}\mathbf{1}(t^{K})\rangle\mid\|\theta\|_{\infty}\leq H\} where 𝕆K11/σ1\|\mathbb{O}^{K}\|_{1}\leq 1/\sigma_{1} and 𝟏(z),𝟏(tK)\mathbf{1}(z),\mathbf{1}(t^{K}) are one-hot encoding vectors over 𝒵h1\mathcal{Z}_{h-1} and 𝒪K×𝒜K1\mathcal{O}^{K}\times\mathcal{A}^{K-1}, respectively. We set Πh=[𝒵¯hΔ(𝒜)]\Pi_{h}=[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})]. Then, the following holds.

Theorem 5 (Sample complexity for overcomplete tabular models).

With probability 1δ1-\delta, we can achieve J(π)J(π^)ϵJ(\pi^{\star})-J(\hat{\pi})\leq\epsilon when we use samples at most O~(|𝒮|2|𝒜|3M+K+2|𝒪|3M+KH6(1/ϵ)2(1/σ1)2ln(1/δ)).\tilde{O}\left(|{\mathcal{S}}|^{2}|\mathcal{A}|^{3M+K+2}|\mathcal{O}|^{3M+K}H^{6}(1/\epsilon)^{2}(1/\sigma_{1})^{2}\ln(1/\delta)\right).
Here, polylog(|𝒮|,|𝒪|,|𝒜|,H,1/σ1,ln(1/δ))\mathrm{polylog}(|{\mathcal{S}}|,|\mathcal{O}|,|\mathcal{A}|,H,1/\sigma_{1},\ln(1/\delta)) are omitted.

When we use K-step futures, in the above theorem, we additionally incur |𝒜|K|𝒪|K|\mathcal{A}|^{K}|\mathcal{O}|^{K}, which is coming from a naive parameterization of 𝒢h\mathcal{G}_{h}. In Section 8.2.3, we will see that under the model-based learning perspective (i.e., we parameterize 𝕋,𝕆\mathbb{T},\mathbb{O} first and then construct Π\Pi and 𝒢\mathcal{G} using the model class), we will get rid of the dependence |𝒪|M+K|\mathcal{O}|^{M+K} and |𝒜|K|\mathcal{A}|^{K}. This is because the complexity of the model class is independent of MM or KK (i.e., number of parameters in 𝕋,𝕆\mathbb{T},\mathbb{O} are O(|𝒮|2|𝒜||𝒪|)O(|{\mathcal{S}}|^{2}|\mathcal{A}||\mathcal{O}|)).

5.3.4 Observable LQG

Now let us revisit LQG. The detail here is deferred to Section J. We show that Provable can compete against the globally optimal policy with polynomial sample complexity.

Example 6 (continues=ex:lqqs).

In LQG, by setting H=M1H=M-1, we achieve a polynomial sample complexity when competing against the globally optimal policy πgl\pi^{\star}_{\mathrm{gl}}.

Theorem 6 (Sample complexity for LQG (informal) – competing against πgl\pi^{\star}_{\mathrm{gl}}).

Consider a linear policy class Πh={δ(ah=K¯hz¯h)K¯hΘ}\Pi_{h}=\{\delta(a_{h}=\bar{K}_{h}\bar{z}_{h})\mid\|\bar{K}_{h}\|\leq\Theta\}. and assume max(A,B,C,Q,R)Θ\max(\|A\|,\|B\|,\|C\|,\|Q\|,\|R\|)\leq\Theta and all policies induce a stable system (we formalize in Section J). With probability 1δ1-\delta, we can achieve J(πgl)J(π^)ϵJ(\pi^{\star}_{\mathrm{gl}})-J(\hat{\pi})\leq\epsilon when we use samples at most

poly(H,ds,do,da,Θ,C,ln(1/δ))×(1/ϵ)2.\displaystyle\textstyle\mathrm{poly}(H,d_{s},d_{o},d_{a},\Theta,\|C^{\dagger}\|,\ln(1/\delta))\times(1/\epsilon)^{2}.

5.3.5 Observable HSE-POMDPs

Next, we study HSE-POMDPs. The details here is deferred to Section G.

Example 7 (continues=ex:linear).

In HSE-POMDPs, PO-bilinear rank is at most maxhdϕh\max_{h}d_{\phi_{h}}. Suppose ψh1\|\psi_{h}\|\leq 1 and Vhπ()=θhπ,ϕh()V^{\pi}_{h}(\cdot)=\langle\theta^{\pi}_{h},\phi_{h}(\cdot)\rangle such that θhπΘV\|\theta^{\pi}_{h}\|\leq\Theta_{V} for any h[H]h\in[H]. Then, to satisfies the realizability, we set 𝒢h={θ,ψh()θΘV/σmin(K)}\mathcal{G}_{h}=\{\langle\theta,\psi_{h}(\cdot)\rangle\mid\|\theta\|\leq\Theta_{V}/\sigma_{\min}(K)\} where σmin(K)=minh[H]1/Kh\sigma_{\min}(K)=\min_{h\in[H]}1/\|K^{\dagger}_{h}\|.

Theorem 7 (Sample complexity for HSE-POMDPs (Informal)).

Let dψ=maxh{dψh},dϕ=maxh{dψh},|Πmax|=maxh(|Πh|)d_{\psi}=\max_{h}\{d_{\psi_{h}}\},d_{\phi}=\max_{h}\{d_{\psi_{h}}\},|\Pi_{\max}|=\max_{h}(|\Pi_{h}|). Suppose rhr_{h} lies in [0,1][0,1] for any h[H]h\in[H]. Then, with probability 1δ1-\delta, we can achieve J(π)J(π^)ϵJ(\pi^{\star})-J(\hat{\pi})\leq\epsilon when we use samples

O~(dϕ2H4|𝒜|2max(ΘV,1)2{dψ+ln(|Πmax|/δ)}(1/σmin(K))2(1/ϵ)2).\displaystyle\textstyle\tilde{O}\left(d^{2}_{\phi}H^{4}|\mathcal{A}|^{2}\max(\Theta_{V},1)^{2}\{d_{\psi}+\ln(|\Pi_{\max}|/\delta)\}{(1/\sigma_{\min}(K))^{2}}\cdot(1/\epsilon)^{2}\right).

Here, polylog(dϕ,dψ,|𝒜|,ΘV,ln(|Πmax|),1/σmin(K),1/ϵ,ln(1/δ),σmax(T),σmax(K))\mathrm{polylog}(d_{\phi},d_{\psi},|\mathcal{A}|,\Theta_{V},\ln(|\Pi_{\max}|),1/\sigma_{\min}(K),1/\epsilon,\ln(1/\delta),\sigma_{\max}(T),\sigma_{\max}(K)) are omitted and σmax(K)=maxh[H]Kh,σmax(T)=maxπΠ,h[H]Tπ:h\sigma_{\max}(K)=\max_{h\in[H]}\|K_{h}\|,\sigma_{\max}(T)=\max_{\pi\in\Pi,h\in[H]}\|T_{\pi:h}\|.

Note that the sample complexity above does not explicitly depend on the memory length MM, instead it only explicitly depends on the dimension of the features ϕ,ψ\phi,\psi. In other words, if we have a feature mapping ψh\psi_{h} that can map the entire history (i.e., M=HM=H) to a low-dimensional vector (e.g., LQG), our algorithm can immediately compete against the global optimality πgl\pi^{\star}_{\mathrm{gl}}.

6 Predictive State Representations

In this section, we demonstrate that our definition and algorithm applies to PSRs — models that strictly generalize POMDPs [46, 62]. Below, we first briefly introduce PSRs, followed by showing that it is a PO-bilinear AC model. Throughout this section, we will focus on discrete linear PSRs. We also suppose reward at hh is deterministic function of (oh,ah)(o_{h},a_{h}) conditional on τh+1a\tau^{a}_{h+1} where τha=(o1,a1,,oh1,ah1)\tau^{a}_{h}=(o_{1},a_{1},\cdots,o_{h-1},a_{h-1}). Given τha\tau^{a}_{h}, the dynamical system generates oh(|τha)o_{h}\sim\mathbb{P}(\cdot|\tau^{a}_{h}). Here we use the superscript aa on τha\tau_{h}^{a} to emphasize that the τha\tau^{a}_{h} ends with the action ah1a_{h-1}.

PSRs use the concept of test, which is a sequence of future observations and actions, i.e., for some test t=(oh:h+W1,ah:h+W2)t=(o_{h:h+W-1},a_{h:h+W-2}) with length W+W\in\mathbb{N}^{+}, we define the probability of test tt being successful (t|τha)\mathbb{P}(t|\tau^{a}_{h}) as (t|τha):=(oh:h+W1|τha;do(ah:h+W2))\mathbb{P}(t|\tau^{a}_{h}):=\mathbb{P}(o_{h:h+W-1}|\tau^{a}_{h};\text{do}(a_{h:h+W-2})) which is the probability of observing oh:h+W1o_{h:h+W-1} by actively executing actions ah:h+W2a_{h:h+W-2} conditioned on history τha\tau^{a}_{h}.

We now explain one-step observable PSRs while deferring the general multi-step observable setting to Section D. A one-step observable PSR uses the observations in 𝒪\mathcal{O} as tests, i.e., tests with length 1.

Definition 5 (Core test set and linear PSRs).

A core test set 𝒯𝒪{\mathcal{T}}\subset\mathcal{O} contains a finite number of tests (i.e., observations from 𝒪\mathcal{O}). For any hh, any history τha\tau^{a}_{h}, any future test th=(oh:h+W1,ah:h+W2)t_{h}=(o_{h:h+W-1},a_{h:h+W-2}) for any W+W\in\mathbb{N}^{+}, there exists a vector mth|𝒯|m_{t_{h}}\in\mathbb{R}^{|{\mathcal{T}}|}, such that the probability of tht_{h} succeeds conditioned on τha\tau^{a}_{h} can be expressed as: (th|τha)=mth[(o|τha)]o𝒯,\mathbb{P}(t_{h}|\tau^{a}_{h})=m_{t_{h}}^{\top}[\mathbb{P}(o|\tau^{a}_{h})]_{o\in{\mathcal{T}}}, where we denote 𝐪τha:=[(o|τha)]o𝒯{\bf q}_{\tau^{a}_{h}}:=[\mathbb{P}(o|\tau^{a}_{h})]_{o\in{\mathcal{T}}} as a vector in |𝒯|\mathbb{R}^{|{\mathcal{T}}|} with entries equal to (o|τha)\mathbb{P}(o|\tau^{a}_{h}) for o𝒯o\in{\mathcal{T}}. The vector 𝐪τha{\bf q}_{\tau^{a}_{h}} is called predictive state.

A core test set 𝒯{\mathcal{T}} that has the smallest number of tests is called a minimum core test set denoted as 𝒬\mathcal{Q}. PSRs are strictly more expressive than POMDPs in that all POMDPs can be embedded into PSRs whose size of the minimum core tests is at most |𝒮||{\mathcal{S}}|; however, vice versa does not hold [46]. For example, in observable undercomplete POMDPs (i.e., 𝕆\mathbb{O} full column rank) , the observation set 𝒪\mathcal{O} can serve as a core test set, but the minimum core test set 𝒬\mathcal{Q} will have size |𝒮||{\mathcal{S}}|. Here, we assume we know a core test set 𝒯{\mathcal{T}} that contains 𝒬\mathcal{Q}; however, we are agnostic to which set is the actual 𝒬\mathcal{Q}. In the literature on PSRs, this setting is often referred to as transform PSRs [8, 57].

Now we define a value link function in PSRs. First, given an MM-memory policy, define 𝒱hπ(τha)=𝔼[t=hHrt|τha;ah:Hπ]\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})=\mathbb{E}[\sum_{t=h}^{H}r_{t}|\tau^{a}_{h};a_{h:H}\sim\pi], i.e., the expected total reward under π\pi, conditioned on the history τha\tau^{a}_{h}. Note that our value function here depends on the entire history.

Definition 6 (General value link functions).

Consider an MM-memory policy π\pi. One-step general value link functions ghπ:𝒵h1×𝒯g^{\pi}_{h}:\mathcal{Z}_{h-1}\times{\mathcal{T}}\to\mathbb{R} at step h[H]h\in[H] are defined as solutions to

𝒱hπ(τha)=𝔼[ghπ(zh1,oh)τha].\displaystyle\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})=\mathbb{E}[g^{\pi}_{h}(z_{h-1},o_{h})\mid\tau^{a}_{h}]. (3)

This definition is more general than Definition 3 since (3) implies (1) in POMDPs by setting 𝒪=𝒯\mathcal{O}={\mathcal{T}}. In PSRs, we can show the existence of this general value link function.

Lemma 7 (The existence of link functions for PSRs).

Suppose 𝒯{\mathcal{T}} is a core test set. Then, a one-step value link function ghπg^{\pi}_{h} always exists.

The high-level derivation is as follows. Using the linear PSR property, one can first show that 𝒱hπ(τha)\mathcal{V}^{\pi}_{h}(\tau^{a}_{h}) has a bilinear form 𝒱hπ(τha)=1(zh1)𝕁hπ𝐪τha\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}_{h}^{\pi}{\bf q}_{\tau^{a}_{h}}, where 1(z)|𝒵h1|{\textbf{1}}(z)\in\mathbb{R}^{|\mathcal{Z}_{h-1}|} denotes the one-hot encoding vector over 𝒵h1\mathcal{Z}_{h-1}, and 𝕁hπ\mathbb{J}_{h}^{\pi} is a |𝒵h1|×|𝒯||\mathcal{Z}_{h-1}|\times|{\mathcal{T}}| matrix. Then, given any τha\tau^{a}_{h} and o(|τha)o\sim\mathbb{P}(\cdot|\tau^{a}_{h}), for some |𝒵h1|×|𝒯||\mathcal{Z}_{h-1}|\times|{\mathcal{T}}| matrix 𝕁\mathbb{J}, we can show gh(zh1,o):=1(zh1)𝕁[1(t=o)]t𝒯g_{h}(z_{h-1},o):={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}[{\textbf{1}}(t=o)]_{t\in{\mathcal{T}}} satisfies the above, where [1(t=o)]t𝒯|𝒯|[{\textbf{1}}(t=o)]_{t\in{\mathcal{T}}}\in\mathbb{R}^{|{\mathcal{T}}|} is a one-hot encoding vector over 𝒯{\mathcal{T}} and serves as an unbiased estimate of 𝐪τha{\bf q}_{\tau^{a}_{h}}.

Finally, we show that PSR admits PO-bilinear rank decomposition (Definition 3).

Lemma 8.

Suppose a core test set 𝒯{\mathcal{T}} includes a minimum core test set 𝒬\mathcal{Q}. Set Πh=[𝒵¯hΔ(𝒜)]\Pi_{h}=[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})] and 𝒢h={(zh1,o)1(zh1)𝕁[1(t=o)]t𝒯𝕁|𝒵h1|×|𝒯|}\mathcal{G}_{h}=\{(z_{h-1},o)\mapsto{\textbf{1}}(z_{h-1})^{\top}\mathbb{J}[{\textbf{1}}(t=o)]_{t\in{\mathcal{T}}}\mid\mathbb{J}\in\mathbb{R}^{|\mathcal{Z}_{h-1}|\times|{\mathcal{T}}|}\}, the PO-bilinear rank is at most (|𝒪||𝒜|)M|𝒬|(|\mathcal{O}||\mathcal{A}|)^{M}|\mathcal{Q}|.

Then, Algorithm 1 is directly applicable to PSRs. Note that here the PO-bilinear rank, fortunately, rank scales with |𝒬||\mathcal{Q}| but not |𝒯||{\mathcal{T}}|. The dependence (|𝒪||𝒜|)M(|\mathcal{O}||\mathcal{A}|)^{M} comes from the dimension of the “feature" of memory 1(zh1){\textbf{1}}(z_{h-1}). If one has a compact feature representation ϕ:𝒵h1d\phi:\mathcal{Z}_{h-1}\to\mathbb{R}^{d}, such that 𝒱hπ(τha)=ϕ(zh1)𝕁hπ𝐪τha\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})=\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{h}{\bf q}_{\tau^{a}_{h}} is linear with respect to feature ϕ(zh1)\phi(z_{h-1}), then the PO-bilinear rank is d|𝒬|d|\mathcal{Q}|. This implies that if one has a compact featurization of memory a priori, one can avoid exponential dependence on MM.

Sample complexity.

We finally brifely mention the sample complexity result. The detail is deferred to Section K. The sample complexity to satisfy J(π)J(π^)ϵJ(\pi^{\star})-J(\hat{\pi})\leq\epsilon is given as

O~(|𝒪|M|𝒜|M1|𝒬|2max(Θ,1)H4|𝒜|2ln(|𝒢max||Πmax|/δ)ln(ΘW)2ϵ2)\displaystyle\tilde{O}\left(\frac{|\mathcal{O}|^{M}|\mathcal{A}|^{M-1}|\mathcal{Q}|^{2}\max(\Theta,1)H^{4}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|/\delta)\ln(\Theta_{W})^{2}}{\epsilon^{2}}\right)

where ΘW\Theta_{W} and Θ\Theta some parameters associated with PSRs. Here, there is no explicit dependence on |𝒯||{\mathcal{T}}|. Note that in the worst case, ln|𝒢max|\ln|\mathcal{G}_{\max}| scales as O(|𝒵h1||𝒯|)O(|\mathcal{Z}_{h-1}||{\mathcal{T}}|), and ln|Πmax|\ln|\Pi_{\max}| scales as O(|𝒵h1||𝒪||𝒜|)O(|\mathcal{Z}_{h-1}||\mathcal{O}||\mathcal{A}|).

7 Generalization of PO-Bilinear AC Class

We extend our previous definition of PO-Bilinear AC framework. We first present an even more general framework that captures all the previous examples that we have discussed so far. We then provide two more examples that can be covered by this framework: (1) MM-step decodable POMDPs, and (2) observable POMDPs with low-rank latent transition. Using the result in (2), we can obtain refined results in the tabular setting compared to the result from Section 5.3.2.

The following is a general PO-Bilinear AC Class. Recall M(h):=max(hM,1)M(h):=\max(h-M,1). We consider one-step future, i.e., K=1K=1, but the extension to K>1K>1 is straightforward. Comparing to Definition 3, we introduce another class of functions termed as discriminators \mathcal{F} and the loss function ll.

Definition 7 (General PO-Bilinear AC Class).

Consider a tuple Π,𝒢,l,Πe,\langle\Pi,\mathcal{G},l,\Pi^{e},\mathcal{F}\rangle consisting of a policy class Π\Pi, a function class 𝒢\mathcal{G}, a loss function l={lh}h=1Hl=\{l_{h}\}_{h=1}^{H} where lh(;f,π,g):h1×𝒪×𝒜××𝒪l_{h}(\cdot;f,\pi,g):\mathcal{H}_{h-1}\times\mathcal{O}\times\mathcal{A}\times\mathbb{R}\times\mathcal{O}\to\mathbb{R}, a set of estimation policies Πe:={πe(π):πΠ}\Pi^{e}:=\{\pi^{e}(\pi):\pi\in\Pi\} where πhe(π):𝒵¯hΔ(𝒜)\pi^{e}_{h}(\pi):\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A}), and a discriminator class ={h}\mathcal{F}=\{\mathcal{F}_{h}\} with h[h]\mathcal{F}_{h}\subset[\mathcal{H}_{h}\to\mathbb{R}]. Consider a non-decreasing function ζ:+\zeta:\mathbb{R}^{+}\to\mathbb{R} with ζ(0)=0\zeta(0)=0.

The model is a PO-bilinear class of rank dd if 𝒢\mathcal{G} is realizable, and there exist Wh:Π×𝒢dW_{h}:\Pi\times\mathcal{G}\to\mathbb{R}^{d} and Xh:ΠdX_{h}:\Pi\to\mathbb{R}^{d} such that for all π,πΠ,g𝒢\pi,\pi^{\prime}\in\Pi,g\in\mathcal{G} and h[H]h\in[H],

  1. (a)

    |𝔼[gh(z¯h)rhgh+1(z¯h+1);a1:hπ]||Wh(π,g),Xh(π)||\mathbb{E}[g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1});a_{1:h}\sim\pi]|\leq|\langle W_{h}(\pi,g),X_{h}(\pi)\rangle|,

  2. (b)
    ζ(maxfh|𝔼[lh(τh,ah,rh,oh+1;f,π,g);a1:M(h)1π,aM(h):hπe(π)]|)|Wh(π,g),Xh(π)|.\displaystyle\zeta(\max_{f\in\mathcal{F}_{h}}|\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g);a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\pi^{e}(\pi^{\prime})]|)\geq|\langle W_{h}(\pi,g),X_{h}(\pi^{\prime})\rangle|.

    (In MM-step decodable POMDPs and POMDPs with low-rank latent transition, we set πe(π)=𝒰(𝒜)\pi^{e}(\pi)=\mathcal{U}(\mathcal{A}) and in the previous sections, we set πe(π)=π\pi^{e}(\pi^{\prime})=\pi^{\prime}. )

  3. (c)
    maxfh|𝔼[lh(τh,ah,rh,oh+1;f,π,gπ);a1:M(h)1π,aM(h):hπe(π)]|=0\max_{f\in\mathcal{F}_{h}}|\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi});a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\pi^{e}(\pi^{\prime})]|=0

    for any πΠ\pi\in\Pi and the corresponding value link function gπg^{\pi} in 𝒢\mathcal{G} .

The first condition states the average Bellman error under π\pi is upper-bounded by the quantity in the bilinear form. The second condition states that we have a known loss function ll that can be used to estimate an upper bound (up to a non-decreasing transformation ζ\zeta) of the value of the bilinear form. Our algorithm will use the surrogate loss l()l(\cdot). As we will show, just being able to estimate an upper bound of the value of the bilinear form suffices for deriving a PAC algorithm. The discriminator \mathcal{F} and the non-decreasing functions ζ\zeta give us additional freedom to design the loss function. For simple examples such as tabular POMDPs and LQG, as we already see, we simply set the discriminator class =\mathcal{F}=\emptyset (i.e., we do not use discriminators) and ζ\zeta being the identity mapping.

With this definition, we slightly modify Provable to incorporate the discriminator to construct constraints. The algorithm is summarized in Algorithm 3 that is named as DisProvable. There are two modifications: (1) when we collect data, we switch from the roll-in policy πt\pi^{t} to the policy πe\pi^{e} at time step M(h)M(h); (2) the Bellman error constraint σht\sigma^{t}_{h} is defined using the loss ll together with the discriminator class h\mathcal{F}_{h}.

Algorithm 3 PaRtially ObserVAble BiLinEar with DIScriminators (DisProvable)
1:  Input: Value link function class 𝒢={𝒢h},𝒢h[𝒵¯h]\mathcal{G}=\{\mathcal{G}_{h}\},\mathcal{G}_{h}\subset[\bar{\mathcal{Z}}_{h}\to\mathbb{R}], discriminator class ={h},h[h]\mathcal{F}=\{\mathcal{F}_{h}\},\mathcal{F}_{h}\subset[\mathcal{H}_{h}\to\mathbb{R}], policy class Π={Πh},Πh[𝒵¯hΔ(𝒜)]\Pi=\{\Pi_{h}\},\Pi_{h}\subset[\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})], parameters m,Rm\in\mathbb{N},R\in\mathbb{R}
2:  Initialize π0Π\pi^{0}\in\Pi
3:  Form the first step dataset 𝒟0={oi}i=1m\mathcal{D}^{0}=\{o^{i}\}_{i=1}^{m}, with oi𝕆(|s1)o^{i}\sim\mathbb{O}(\cdot|s_{1})
4:  for t=0T1t=0\to T-1 do
5:     For any h[H]h\in[H], define the Bellman error
(π,g)Π×𝒢:σht(π,g):=maxfh|𝔼𝒟ht[lh(τh,ah,rh,oh+1;f,π,g)]|\forall(\pi,g)\in\Pi\times\mathcal{G}:\sigma^{t}_{h}(\pi,g):=\max_{f\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}^{t}_{h}}\left[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g)\right]|
where 𝒟ht\mathcal{D}^{t}_{h} is the empirical approximation by executing a1:M(h)1πt,aM(h):hπe(πt)a_{1:M(h)-1}\sim\pi^{t},a_{M(h):h}\sim\pi^{e}(\pi^{t}) and collecting mm i.i.d tuples.
6:     Select policy optimistically as follows
(πt+1,gt+1):=argmaxπΠ,g𝒢𝔼𝒟0[g1(o)]s.t.h[H],i[t],σhi(π,g)R.\displaystyle(\pi^{t+1},g^{t+1}):=\mathop{\mathrm{argmax}}_{\pi\in\Pi,g\in\mathcal{G}}\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(o)]\quad\mathrm{s.t.}\quad\forall h\in[H],\forall i\in[t],\sigma^{i}_{h}(\pi,g)\leq R.
7:  end for
8:  Output: Randomly choose π^\hat{\pi} from (π1,,πT)(\pi_{1},\cdots,\pi_{T}).

The following theorem shows the sample complexity of Algorithm 3. For simplicity, we direct consider the case where Π,𝒢,\Pi,\mathcal{G},\mathcal{F} are all discrete.

Assumption 3 (Uniform Convergence).

Fix h[H]h\in[H]. Let 𝒟h\mathcal{D}^{\prime}_{h} be a set of mm i.i.d tuples by executing a1:M(h)1πt,aM(h):hπea_{1:M(h)-1}\sim\pi^{t},a_{M(h):h}\sim\pi^{e} With probability 1δ1-\delta,

supπΠ,g𝒢,f|(𝔼𝒟h𝔼)[lh(τh,ah,rh,oh+1;f,π,g)]|ϵgen,h(m,Π,𝒢,,δ)\displaystyle\textstyle\sup_{\pi\in\Pi,g\in\mathcal{G},f\in\mathcal{F}}|(\mathbb{E}_{\mathcal{D}^{\prime}_{h}}-\mathbb{E})[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g)]|\leq\epsilon_{gen,h}(m,\Pi,\mathcal{G},\mathcal{F},\delta)

For h=1h=1, we also require

supg1𝒢1|𝔼𝒟1[g1(o1)]𝔼[𝔼𝒟1[g1(o1)]]|ϵini,1(m,𝒢,δ).\sup_{g_{1}\in\mathcal{G}_{1}}|\mathbb{E}_{\mathcal{D}^{\prime}_{1}}[g_{1}(o_{1})]-\mathbb{E}[\mathbb{E}_{\mathcal{D}^{\prime}_{1}}[g_{1}(o_{1})]]|\leq\epsilon_{ini,1}(m,\mathcal{G},\delta).
Theorem 8 (Sample complexity of Algorithm 3).

Suppose we have a PO-bilinear AC class with rank dd in Definition 7. Suppose Assumption 3, supπΠXh(π)BX\sup_{\pi\in\Pi}\|X_{h}(\pi)\|\leq B_{X} and supπΠ,g𝒢Wh(π,g)BW\sup_{\pi\in\Pi,g\in\mathcal{G}}\|W_{h}(\pi,g)\|\leq B_{W} for any h[H].h\in[H].
By setting T=2Hdln(4Hd(BX2BW2ζ2(ϵ~gen)+1)),R=ϵgenT=2Hd\ln\left(4Hd\left(\frac{B^{2}_{X}B^{2}_{W}}{\zeta^{2}(\tilde{\epsilon}_{gen})}+1\right)\right),R=\epsilon_{gen} where

ϵgen:=maxhϵgen,h(m,Π,𝒢,,δ/(TH+1)),ϵ~gen:=maxhϵgen,h(m,Π,𝒢,,δ/H).\displaystyle\textstyle\epsilon_{gen}:=\max_{h}\epsilon_{gen,h}(m,\Pi,\mathcal{G},\mathcal{F},\delta/(TH+1)),\tilde{\epsilon}_{gen}:=\max_{h}\epsilon_{gen,h}(m,\Pi,\mathcal{G},\mathcal{F},\delta/H).

With probability at least 1δ1-\delta, letting π=argmaxπΠJ(π)\pi^{\star}=\mathop{\mathrm{argmax}}_{\pi\in\Pi}J(\pi^{\star}), we have

J(π)J(π^)H1/2[4ζ(ϵgen)2+2Tζ(2ϵgen)2Hdln(4Hd(BX2BW2/ζ2(ϵ~gen)+1))]1/2+2ϵini.\displaystyle\textstyle J(\pi^{\star})-J(\hat{\pi})\leq H^{1/2}\left[4\zeta(\epsilon_{gen})^{2}+2T\zeta(2\epsilon_{gen})^{2}Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\zeta^{2}(\tilde{\epsilon}_{gen})+1))\right]^{1/2}+2\epsilon_{ini}.

The total number of samples used in the algorithm is mTHmTH.

This reduces to Theorem 1 when we set ζ\zeta as an identify function and πe(π)=π\pi^{e}(\pi^{\prime})=\pi^{\prime}. When ζ1()\zeta^{-1}(\cdot) is a strongly convex function, we can gain more refined rate results. For example, when ζ(x)=x\zeta(x)=\sqrt{x}, i.e., ζ1(x)=x2\zeta^{-1}(x)=x^{2}, with ϵgen=O(1/m)\epsilon_{gen}=O(1/\sqrt{m}), the above theorem implies a slow sample complexity rate 1/ϵ41/\epsilon^{4}. However, by leverage the strong convexity of the square function ζ1(x):=x2\zeta^{-1}(x):=x^{2}, a refined analysis can give the fast rate 1/ϵ21/\epsilon^{2}. We will see such two examples in the next sections.

8 Examples for Generalized PO-Bilinear AC Class

We demonstrate that our generalized framework captures two models: (1) MM-step decodable POMDPs, and (2) observable POMDPs with the latent low-rank transition. In this section, we assume rh[0,1]r_{h}\in[0,1] for any h[H]h\in[H].

8.1 MM-step decodable POMDPs

The example we include here is a model that involves nonlinear function approximation but has a unique assumption on the exact identifiability of the latent states.

Example 8 (MM-step decodable POMDPs [20]).

There exists an unknown decoder ιh:𝒵¯h𝒮\iota_{h}:\bar{\mathcal{Z}}_{h}\to{\mathcal{S}}, such that for every reachable trajectory (s1:h,a1:h1,o1:h)(s_{1:h},a_{1:h-1},o_{1:h}), we have sh=ιh(z¯h)s_{h}=\iota_{h}(\bar{z}_{h}) for all h[H]h\in[H].

Note that when M=0M=0, this model is reduced to the well-known Block MDP model [16, 48, 78].

Existence of value link functions.

From the definition, using a value function Vhπ(zh1,sh)V^{\pi}_{h}(z_{h-1},s_{h}) over zh1𝒵h1,sh𝒮z_{h-1}\in\mathcal{Z}_{h-1},s_{h}\in{\mathcal{S}}, we can define a value link function vhπ:𝒵h1×𝒪v^{\pi}_{h}:\mathcal{Z}_{h-1}\times\mathcal{O}\to\mathbb{R} as

vhπ(zh1,oh)=Vhπ(zh1,ιh(z¯h))\displaystyle v^{\pi}_{h}(z_{h-1},o_{h})=V^{\pi}_{h}(z_{h-1},\iota_{h}(\bar{z}_{h}))

since it satisfies

𝔼oh𝕆(sh)[vhπ(zh1,oh)zh1,sh]=𝔼oh𝕆(sh)[Vhπ(zh1,ιh(z¯h))zh1,sh]=Vhπ(zh1,sh).\displaystyle\mathbb{E}_{o_{h}\sim\mathbb{O}(s_{h})}[v^{\pi}_{h}(z_{h-1},o_{h})\mid z_{h-1},s_{h}]=\mathbb{E}_{o_{h}\sim\mathbb{O}(s_{h})}[V^{\pi}_{h}(z_{h-1},\iota_{h}(\bar{z}_{h}))\mid z_{h-1},s_{h}]=V^{\pi}_{h}(z_{h-1},s_{h}).

This is summarized in the following lemma.

Lemma 9 (Existence of link functions in MM-step decodable POMDPs).

In MM-step decodable POMDPs, link functions exist.

MM-step decodable POMDPs showcase the generality of value link functions, which not only capture standard observability conditions where future observations and actions are used to replace belief states (e.g., observable tabular POMDPs and observable LQG), but also capture a model where history is used to replace latent states.

PO-Bilinear Rank.

Next, we calculate the PO-bilinear rank based on Definition 7. In the tabular case, we can naïvely obtain the PO-bilinear decomposition with rank |𝒪|M|𝒜|M|𝒮||\mathcal{O}|^{M}|\mathcal{A}|^{M}|{\mathcal{S}}| following Example 1. Here, we consider the nontabular case where function approximation is used and |𝒪||\mathcal{O}| can be extremely large. We define the following Bellman operator associated with π\pi at step hh:

hπ:𝒢[𝒵¯h];\displaystyle\mathcal{B}^{\pi}_{h}:\mathcal{G}\to[\bar{\mathcal{Z}}_{h}\to\mathbb{R}]; (4)
z¯h:[hπg](z¯h):=𝔼ahπ(z¯h)[rh(ιh(z¯h),ah)+𝔼oh+1𝕆𝕋(ιh(z¯h),ah)[gh+1(z¯h+1)]].\displaystyle\forall\bar{z}_{h}:\left[\mathcal{B}^{\pi}_{h}g\right](\bar{z}_{h}):=\mathbb{E}_{a_{h}\sim\pi(\bar{z}_{h})}\left[r_{h}(\iota_{h}(\bar{z}_{h}),a_{h})+\mathbb{E}_{o_{h+1}\sim\mathbb{O}\circ\mathbb{T}(\iota_{h}(\bar{z}_{h}),a_{h})}[g_{h+1}(\bar{z}_{h+1})]\right].

Note that above we use the ground truth decoder ιh\iota_{h} to decode from z¯h\bar{z}_{h} to its associated latent state shs_{h}. The existence of this Bellman operator hπ\mathcal{B}^{\pi}_{h} is crucially dependent on the existence of such decoder ιh\iota_{h}.

We show that MM-step decodable POMDPs satisfy the definition in Definition 7. We assume that the latent state-wise transition model is low-rank. In MDPs, this assumption is widely used in [76, 36, 2, 71]. Here, we do not need to know μ,ϕ\mu,\phi in the algorithm.

Assumption 4 (Low-rankness of latent transition).

Suppose 𝕋\mathbb{T} is low-rank, i.e., 𝕋(ss,a)=ϕ(s,a),μ(s)((s,a,s))\mathbb{T}(s^{\prime}\mid s,a)=\langle\phi(s,a),\mu(s^{\prime})\rangle(\forall(s,a,s^{\prime})) where ϕ,μ\phi,\mu are (unknown) dd-dimensional features. As technical conditions, we suppose ϕ(s,a)1\|\phi(s,a)\|\leq 1 for any (s,a)(s,a) and |μ(s)v(s)d(s)|d|\int\mu(s)v(s)d(s)|\leq\sqrt{d} for any v1\|v\|_{\infty}\leq 1.

Lemma 10 (Bilinear decomposition of low-rank MM-step decodable POMDPs ).

Suppose Assumption 4, 𝒢hH,hH\|\mathcal{G}_{h}\|_{\infty}\leq H,\|\mathcal{F}_{h}\|_{\infty}\leq H, rh[0,1]r_{h}\in[0,1] for any h[H]h\in[H]. Assume a discriminator class is Bellman complete, i.e.,

πΠ,g𝒢:(hπg)ghh,\forall\pi\in\Pi,\forall g\in\mathcal{G}:(\mathcal{B}^{\pi}_{h}g)-g_{h}\in\mathcal{F}_{h},

for any h[H]h\in[H]. The loss function is designed as

lh(τh,ah,rh,oh+1;f,π,g):=πh(ahz¯h)|𝒜|f(z¯h)(gh(z¯h)rhgh+1(z¯h+1))0.5f(z¯h)2.\displaystyle l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g):=\pi_{h}(a_{h}\mid\bar{z}_{h})|\mathcal{A}|f(\bar{z}_{h})(g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1}))-0.5f(\bar{z}_{h})^{2}. (5)

Then, there exist Wh(π,g),Xh(π)W_{h}(\pi,g),X_{h}(\pi^{\prime}) so that the PO-bilinear rank is at most dd such that

|𝔼[gh(z¯h)rhgh+1(z¯h+1):a1:hπ]|=|Wh(π,g),Xh(π)|,\displaystyle|\mathbb{E}[g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1}):a_{1:h}\sim\pi]|=|\langle W_{h}(\pi,g),X_{h}(\pi)\rangle|, (6)
|maxfh𝔼[lh(τh,ah,rh,oh+1;f,π,g);a1:M(h)1π,aM(h):h𝒰(𝒜)]|0.5Wh(π,g),Xh(π)2|𝒜|M,\displaystyle\left\lvert\max_{f\in\mathcal{F}_{h}}\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g);a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})]\right\rvert\geq\frac{0.5\langle W_{h}(\pi,g),X_{h}(\pi^{\prime})\rangle^{2}}{|\mathcal{A}|^{M}}, (7)

and

|maxfh𝔼[lh(τh,ah,rh,oh+1;f,π,gπ);a1:M(h)1π,aM(h):h𝒰(𝒜)]|=0.\displaystyle\left\lvert\max_{f\in\mathcal{F}_{h}}\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi});a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})]\right\rvert=0. (8)
Proof.

The proof is deferred to Section N.2. Note that (6), (7), (8) correspond to (a), (b), (c) in Definition 7. ∎

We use the most general bilinear class definition from Definition 7, where ζ(a)=|𝒜|M/2a1/2\zeta(a)=|\mathcal{A}|^{M/2}a^{1/2} for scalar a+a\in\mathbb{R}^{+}. Hence ζ\zeta is a non-decreasing function (ζ\zeta is non-decreasing in +\mathbb{R}^{+}). The proof of the above lemma leverages the novel trick of the so-called moment matching policy introduced by [20]. When the latent state and action space are discrete, it states that the bilinear rank is |𝒮||𝒜||{\mathcal{S}}||\mathcal{A}|, which is much smaller than |𝒪|M|𝒜|M|𝒮||\mathcal{O}|^{M}|\mathcal{A}|^{M}|{\mathcal{S}}|. Note we here introduce 0.5f(z¯h)2-0.5f(\bar{z}_{h})^{2} in the loss function (5) to induce strong convexity w.r.t ff as in [70, 18, 9], which is important to obtain the fast rate later.

The concrete sample complexity of Provable (Algorithm 3) for this model is summarized in the following. Recall that the bilinear rank is dd where dd is the rank of the transition matrix. We set 𝒢h[𝒵¯h[0,H]]\mathcal{G}_{h}\subset[\bar{\mathcal{Z}}_{h}\to[0,H]]. Then, we have the following result.

Theorem 9 (Sample complexity for MM-step decodable POMDPs (Informal)).

Suppose Assumption 4, Bellman completeness, 𝒢hH,hH\|\mathcal{G}_{h}\|_{\infty}\leq H,\|\mathcal{F}_{h}\|_{\infty}\leq H, rh[0,1]r_{h}\in[0,1] for any h[H]h\in[H]. With probability 1δ1-\delta, we can achieve J(π)J(π^)ϵJ(\pi^{\star})-J(\hat{\pi})\leq\epsilon when we use samples at most

O~(d2H6|𝒜|2+Mln(|Πmax||max||𝒢max|/δ)ϵ2).\displaystyle\tilde{O}\left(\frac{d^{2}H^{6}|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)}{\epsilon^{2}}\right).

Here, polylog(d,H,|𝒜|,1/ϵ,ln(|Πmax|),ln(|max|),ln(|𝒢max|),ln(1/δ))\mathrm{polylog}(d,H,|\mathcal{A}|,1/\epsilon,\ln(|\Pi_{\max}|),\ln(|\mathcal{F}_{\max}|),\ln(|\mathcal{G}_{\max}|),\ln(1/\delta)) are omitted.

The followings are several implications. First, the error rate scales with O(1/ϵ2)O(1/\epsilon^{2}). As we promised, by leveraging the strong convexity of loss functions, we obtain a rate O(1/ϵ2)O(1/\epsilon^{2}), which is faster than O(1/ϵ4)O(1/\epsilon^{4}) that are attained when we naively invoke Theorem 8 with ξ(x)x\xi(x)\propto\sqrt{x}. Secondly, the error bound incurs |𝒜|M|\mathcal{A}|^{M}. As showed in [20], this is inevitable in MM-step decodable POMDPs. Thirdly, in the tabular case, when we use the naïve function classes for 𝒢,,Π\mathcal{G},\mathcal{F},\Pi, i.e., 𝒢h={𝒵¯h[0,H]}\mathcal{G}_{h}=\{\bar{\mathcal{Z}}_{h}\to[0,H]\}, h={𝒵¯h[0,H]}\mathcal{F}_{h}=\{\bar{\mathcal{Z}}_{h}\to[0,H]\}, Πh={𝒵¯hΔ(𝒜)}\Pi_{h}=\{\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A})\}, the bound could incur additional |𝒪|M|\mathcal{O}|^{M} since the complexity of the function classes can scale with respect to (|𝒪||𝒜|)M(|\mathcal{O}||\mathcal{A}|)^{M} (e.g., log(|𝒢h|)\log(|\mathcal{G}_{h}|) can be in the order of O(|𝒪|M|𝒜M)O(|\mathcal{O}|^{M}|\mathcal{A}^{M}), and similarly for log(|h|),ln(Πh)\log(|\mathcal{F}_{h}|),\ln(\Pi_{h})). However, when we start form a realizable model class that captures the ground truth transition and omission distribution, we can remove |𝒪|M|\mathcal{O}|^{M}. See Section 8.2.4 for an example.

Note that [20] uses a different function class setup where they assume one has an M memory-action dependent QQ function class 𝒬h:𝒵¯h×𝒜\mathcal{Q}_{h}:\bar{\mathcal{Z}}_{h}\times\mathcal{A}\to\mathbb{R} which contains Qh(z¯h,a)Q^{\star}_{h}(\bar{z}_{h},a) while we use the actor-critic framework vhπ,πv^{\pi}_{h},\pi. The two function class setups are not directly comparable. Generally, we mention that such optimal QQ^{\star} with truncated history does not exist when the exact decodability does not hold (e.g., such QQ^{\star} with truncated history does not exist in LQG). This displays the potential generality of the actor-critic framework we propose here.

8.2 Observable POMDPs with Latent Low-rank Transition: a model-based perspective

The final example we include in this work is a POMDP with the latent low-rank transition. We first introduce the model, and then we introduce our function approximation setup and show the sample complexity. Finally, we revisit the sample complexity for observable tabular POMDPs and MM-step decodable tabular POMDPs using the improved algorithm that elaborates on the model-based approach in this section.

Example 9 (Observable POMDPs with latent low-rank transition).

The latent transition 𝕋(s|s,a)\mathbb{T}(s^{\prime}|s,a) is factorized as 𝕋(s|s,a)=μ(s)ϕ(s,a),s,a,s\mathbb{T}(s^{\prime}|s,a)=\mu^{\star}(s^{\prime})^{\top}\phi^{\star}(s,a),\forall s,a,s^{\prime} where μ:𝒮d\mu^{\star}:{\mathcal{S}}\to\mathbb{R}^{d} and ϕ:𝒮×𝒜d\phi^{\star}:{\mathcal{S}}\times\mathcal{A}\to\mathbb{R}^{d}. The observation |𝒪|×|𝒮||\mathcal{O}|\times|{\mathcal{S}}| matrix 𝕆\mathbb{O} has full-column rank.

In the tabular POMDP example, we have d|𝒮|d\leq|{\mathcal{S}}|. However in general dd can be much smaller than |𝒮||{\mathcal{S}}|. Note that in this section, we will focus on the setting where 𝒮,𝒪{\mathcal{S}},\mathcal{O} are discrete to avoid using measure theory languages, but their size could be extremely large. Particularly, our sample complexity will not have explicit polynomial or logarithmic dependence on |𝒪|,|𝒮||\mathcal{O}|,|{\mathcal{S}}|, instead it will only scale polynomially with respect to the complexity of the hypothesis class and the rank dd.

Model-based function approximation.

Our function approximation class consists of a set of models ={(μ,ϕ,O)}\mathcal{M}=\{(\mu,\phi,O)\} where μ,ϕ\mu,\phi together models latent transition as μ()ϕ(s,a)Δ(𝒮)\mu(\cdot)^{\top}\phi(s,a)\in\Delta({\mathcal{S}}), and O:𝒮Δ(𝒪)O:{\mathcal{S}}\to\Delta(\mathcal{O}) models 𝕆\mathbb{O}, and OO is full column rank. For notation simplicity, we often use θ:=(μ,ϕ,O)\theta:=(\mu,\phi,O)\in\mathcal{M} to denote a model (μ,ϕ,O)(\mu,\phi,O). We impose the following assumption.

Assumption 5 (Realizability).

We assume realizability, i.e., (μ,ϕ,𝕆)(\mu^{\star},\phi^{\star},\mathbb{O})\in\mathcal{M}.

We assume \mathcal{M} is discrete, but |||\mathcal{M}| can be large such that a linear dependence on |||\mathcal{M}| in the sample complexity is not acceptable. Our goal is to get a bound that scales polynomially with respect to ln(||)\ln(|\mathcal{M}|), which is the standard statistical complexity of the discrete hypothesis class \mathcal{M}.

Next, we construct Π,𝒢,\Pi,\mathcal{G},\mathcal{F} using the model class \mathcal{M}. Given θ:=(μ,ϕ,O)\theta:=(\mu,\phi,O), we denote πθ\pi^{\theta} as the optimal MM-memory policy, i.e., the MM-memory policy that maximizes the total expected reward. We set

Π={πθ:θ}.\Pi=\{\pi^{\theta}:\theta\in\mathcal{M}\}.

We consider the value function class for θ:=(μ,ϕ,O)\theta:=(\mu,\phi,O) with OO being full column rank. For each θ\theta, we can define the corresponding value function of the policy π\pi at h[H]h\in[H]: Vθ;hπ(zh1,sh):𝒵h1×𝒮V^{\pi}_{\theta;h}(z_{h-1},s_{h}):\mathcal{Z}_{h-1}\times{\mathcal{S}}\to\mathbb{R}. Then, since OO is full column rank, as we see in the proof of Lemma 1, a corresponding value link function is

gθ;hπ(z,o)=fθ,hπ,𝟏(z)𝕆𝟏(o)\displaystyle g^{\pi}_{\theta;h}(z,o)=\langle f^{\pi}_{\theta,h},\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle

where Vθ;hπ(zh1,sh)=fθ;hπ,𝟏(z)𝟏(s)V^{\pi}_{\theta;h}(z_{h-1},s_{h})=\langle f^{\pi}_{\theta;h},\mathbf{1}(z)\otimes\mathbf{1}(s)\rangle. Then, we construct 𝒢={𝒢h}\mathcal{G}=\{\mathcal{G}_{h}\} as:

h[H]:𝒢h={𝒵¯hz¯h1gθ;hπ(z¯h1):πΠ,θ}.\displaystyle\forall h\in[H]:\;\mathcal{G}_{h}=\{\bar{\mathcal{Z}}_{h}\ni\bar{z}_{h-1}\mapsto g^{\pi}_{\theta;h}(\bar{z}_{h-1})\in\mathbb{R}:\pi\in\Pi,\theta\in\mathcal{M}\}. (9)

By construction, since θ:=(μ,ϕ,𝕆)\theta^{\star}:=(\mu^{\star},\phi^{\star},\mathbb{O})\in\mathcal{M}, we must have gπ𝒢,πΠg^{\pi}\in\mathcal{G},\forall\pi\in\Pi, which implies 𝒢\mathcal{G} is realizable (note ghπ=gθ;hπg^{\pi}_{h}=g^{\pi}_{\theta^{\star};h}). Here, from the construction and the assumption rh[0,1]r_{h}\in[0,1] for any h[H]h\in[H], we have |𝒢h|||2|\mathcal{G}_{h}|\leq|\mathcal{M}|^{2} and 𝒢hH/σ1\|\mathcal{G}_{h}\|_{\infty}\leq H/\sigma_{1}, which can be seen from

(z,o);fθ;hπ,𝟏(z)O𝟏(o)fθ;hπ𝟏(z)O𝟏(o)1H×O𝟏(o)1H/σ1\displaystyle\forall(z,o);\langle f^{\pi}_{\theta;h},\mathbf{1}(z)\otimes O^{\dagger}\mathbf{1}(o)\rangle\leq\|f^{\pi}_{\theta;h}\|_{\infty}\|\mathbf{1}(z)\otimes O^{\dagger}\mathbf{1}(o)\|_{1}\leq H\times\|O^{\dagger}\mathbf{1}(o)\|_{1}\leq H/\sigma_{1}

by assuming O11/σ1\|O^{\dagger}\|_{1}\leq 1/\sigma_{1} and fθ;hπH\|f^{\pi}_{\theta;h}\|_{\infty}\leq H.

To construct a discriminator class \mathcal{F}, we first define the Bellman operator θ;hπ\mathcal{B}^{\pi}_{\theta;h} for πΠ,h[H],θ\pi\in\Pi,h\in[H],\theta\in\mathcal{M}:

θ;hπ:𝒢[h];\displaystyle\mathcal{B}^{\pi}_{\theta;h}:\mathcal{G}\to[\mathcal{H}_{h}\to\mathbb{R}];
τh;(θ;hπg)(τh)=𝔼ahπh(z¯h)[rh+𝔼oh+1θ(|τh,ah)gh+1(z¯h+1)],\displaystyle\forall\tau_{h};\left(\mathcal{B}^{\pi}_{\theta;h}g\right)(\tau_{h})=\mathbb{E}_{a_{h}\sim\pi_{h}(\bar{z}_{h})}\left[r_{h}+\mathbb{E}_{o_{h+1}\sim\mathbb{P}_{\theta}(\cdot|\tau_{h},a_{h})}g_{h+1}(\bar{z}_{h+1})\right],

where h\mathcal{H}_{h} is the whole history space up to hh (τh=(a1:h1,o1:h)\tau_{h}=(a_{1:h-1},o_{1:h}), and z¯h\bar{z}_{h} is just part of this history) and θ(oh+1|τh,ah)\mathbb{P}_{\theta}(o_{h+1}|\tau_{h},a_{h}) is the probability of generating oh+1o_{h+1} conditioned on τh,ah\tau_{h},a_{h} under model θ\theta. Then, we construct ={h}\mathcal{F}=\{\mathcal{F}_{h}\} such that

h[H]:h={hτh{ghθ;hπg}(τh):πΠ,g𝒢,θ}.\displaystyle\forall h\in[H]:\mathcal{F}_{h}=\{\mathcal{H}_{h}\ni\tau_{h}\mapsto\{g_{h}-\mathcal{B}^{\pi}_{\theta;h}g\}(\tau_{h})\in\mathbb{R}:\pi\in\Pi,g\in\mathcal{G},\theta\in\mathcal{M}\}. (10)

so that we can ensure the Bellman completeness:

(hπ𝒢)+𝒢hh.\displaystyle-(\mathcal{B}^{\pi}_{h}\mathcal{G})+\mathcal{G}_{h}\subset\mathcal{F}_{h}.

noting θ;hπ=hπ\mathcal{B}^{\pi}_{\theta^{\star};h}=\mathcal{B}^{\pi}_{h}. Here, from the construction, |h|||2×||2×||2=||6|\mathcal{F}_{h}|\leq|\mathcal{M}|^{2}\times|\mathcal{M}|^{2}\times|\mathcal{M}|^{2}=|\mathcal{M}|^{6} and h3H/σ1\|\mathcal{F}_{h}\|_{\infty}\leq 3H/\sigma_{1}.

We define the loss as the same as the one we used in MM-step decodable POMDPs, except that our discriminators now take the entire history as input:

lh(τh,ah,rh,oh+1;f,π,g):=πh(ahz¯h)|𝒜|f(τh)(gh(z¯h)rhgh+1(z¯h+1))0.5f(τh)2.\displaystyle l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g):=\pi_{h}(a_{h}\mid\bar{z}_{h})|\mathcal{A}|f(\tau_{h})(g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1}))-0.5f(\tau_{h})^{2}. (11)

Finally, as in the case of MM-step decodable POMDPs (Lemma 10), we get the following lemma that states that our model is a PO-bilinear AC class (Definition 7) under the following model assumption.

Assumption 6.

We assume O11/σ1\|O^{\dagger}\|_{1}\leq 1/\sigma_{1} for any OO in the model. Suppose μ()ϕ(s,a)Δ(𝒮)\mu(\cdot)^{\top}\phi(s,a)\in\Delta({\mathcal{S}}) for any (s,a)(s,a), μ()\mu(\cdot) and ϕ()\phi(\cdot) in the model. Suppose ϕ(s,a)1\|\phi(s,a)\|\leq 1 for any ϕ\phi in the model and (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times\mathcal{A}. Suppose for any v:𝒮[0,1]v:{\mathcal{S}}\to[0,1] and for any μ\mu in the model, we have v(s)μ(s)d(s)2d\|\int v(s)\mu(s)\mathrm{d}(s)\|_{2}\leq\sqrt{d}.

Lemma 11 (PO-bilinear decomposition for Observable POMDPs with low-rank transition).

Suppose Assumption 5, 6. Consider observable POMDPs with latent low-rank transition. Set 𝒢\mathcal{G} as in (9), \mathcal{F} as in (10) and ll as in (11). Then, there exist Wh(π,g),Xh(π)W_{h}(\pi,g),X_{h}(\pi^{\prime}) that admits the PO-bilinear rank decomposition in Definition 7 with rank dd.

The above lemma ensures that the PO-bilinear rank only depends on dd, and is independent of the length of the memory. For example, in the tabular case, it is |𝒮||{\mathcal{S}}|.

Next, we show the output from DisProvablecan search for the best in class MM-memory policy as follows.

Theorem 10 (Sample complexity of DisProvable for observable POMDPs with latent low-rank transition).

Consider observable POMDPs with latent low-rank transition. Suppose Assumption 5, 6. With probability 1δ1-\delta, we can achieve J(π)J(π^)ϵJ(\pi^{\star})-J(\hat{\pi})\leq\epsilon when we use samples at most

O~(d2H6|𝒜|2+Mln(||/δ)ϵ2σ12).\displaystyle\tilde{O}\left(\frac{d^{2}H^{6}|\mathcal{A}|^{2+M}\ln(|\mathcal{M}|/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

Here, we omit polylog(d,H,|𝒜|,ln(1/δ),ln(||),1/σ1,1/ϵ)\mathrm{polylog}(d,H,|\mathcal{A}|,\ln(1/\delta),\ln(|\mathcal{M}|),1/\sigma_{1},1/\epsilon).

Here, we emphasize that there is no explicit polynomial or logarithmic dependence on |𝒮||{\mathcal{S}}| and |𝒪||\mathcal{O}|, which permits learning for large state and observation spaces. We also do not have any explicit polynomial dependence on |𝒪|M|\mathcal{O}|^{M}, as we construct Π\Pi and 𝒢\mathcal{G} from the model class \mathcal{M} which ensures the complexities of π\pi and 𝒢\mathcal{G} are in the same order as that of \mathcal{M}.

8.2.1 Global Optimality

We show a quasi-polynomial sample complexity bound for competing against the globally optimal policy πgl\pi^{\star}_{\text{gl}}. To compete against the globally optimal policy πgl\pi^{\star}_{\mathrm{gl}}, we need to set MM properly. We use the following lemma. The proof is given in Section O.

Lemma 12 (Near global optimaltiy of MM-memoruy policy).

Consider ϵ(0,H]\epsilon\in(0,H], and a POMDP with low-rank latent transition and 𝕆\mathbb{O} being full column rank with 𝕆11/σ1\|\mathbb{O}^{\dagger}\|_{1}\leq 1/\sigma_{1}. When M=Θ(C(1/σ1)4ln(dH/ϵ))M=\Theta(C(1/\sigma_{1})^{-4}\ln(dH/\epsilon)) (with CC being some absolute constant), there must exists an MM-memory policy π\pi^{\star}, such that J(πgl)J(π)ϵJ(\pi^{\star}_{\mathrm{gl}})-J({\pi^{\star}})\leq\epsilon

Note that the memory MM above is independent of |𝒮||{\mathcal{S}}| instead it only depends on the rank dd. To prove the above lemma, we first show a new result on belief contraction for low-rank POMDPs under the 1\ell_{1}-based observability. The proof of the belief contraction borrows some key lemma from [24] but extends the original result for small-size tabular POMDPs to low-rank POMDPs. We leverage the linear structure of the problem and the G-optimal design to construct an initial distribution over 𝒮{\mathcal{S}} that can be used as a starting point for belief propagation along the memory.

We conclude the study on the POMDPs with low-rank latent transition by the following theorem, which demonstrates a quasi-polynomial sample complexity for learning the globally optimal policy.

Theorem 11 (Sample complexity of DisProvable for POMDPs with low-rank latent transition — competing against πgl\pi^{\star}_{\mathrm{gl}}).

Consider observable POMDPs with latent low-rank transition. Fix some ϵ(0,H),δ(0,1)\epsilon\in(0,H),\delta\in(0,1). Suppose Assumption 5, 6. We construct Π,𝒢,\Pi,\mathcal{G},\mathcal{F}, and the loss ll as we described above. With probability at least 1δ1-\delta, when M=Θ(Cσ14ln(dH/ϵ))M=\Theta(C\sigma_{1}^{-4}\ln(dH/\epsilon)), DisProvable outputs a π^\hat{\pi} such that J(πgl)J(π^)ϵJ(\pi^{\star}_{\mathrm{gl}})-J(\hat{\pi})\leq\epsilon, with number of samples scaling

O~(d2H6|𝒜|2ln(|/δ|)ϵ2σ12|𝒜|ln(dH/ϵ)/σ14).\tilde{O}\left(\frac{d^{2}H^{6}|\mathcal{A}|^{2}\ln(|\mathcal{M}/\delta|)}{\epsilon^{2}\sigma_{1}^{2}}\cdot|\mathcal{A}|^{\ln(dH/\epsilon)/\sigma_{1}^{4}}\right).
Remark 4 (Comparison to [73]).

We compare our results to the very recent work [73] that studies POMDPs with the low-rank latent transition. The results are in general not directly comparable, but we state several key differences here. First, [73] considers a special instance of low-rank transition, i.e., [73] assumes 𝕋\mathbb{T} has low non-negative rank, which could be exponentially larger than the usual rank [2]. Second, [73] additionally assumes short past sufficiency, a condition which intuitively says that for any roll-in policy, the sufficient statistics of a short memory is enough to recover the belief over the latent states, and their sample complexity has an exponential dependence on the length of the memory. While our result also relies on the fact that the globally optimal policy can be approximated by an MM-memory policy with small MM, this fact is derived directly from the standard observability condition.

8.2.2 Revisiting Observable Undercomplete Tabular POMDPs

We reconsider the sample complexity of undercomplete tabular POMDPs using Theorem 10. In this case, we will start from a model class that captures the ground truth latent transition 𝕋\mathbb{T} and omission distribution 𝕆\mathbb{O}. By constructing ϵ\epsilon-nets over the model class,we can set ln(||)=O~(|𝒮|3|𝒪||𝒜|)\ln(|\mathcal{M}|)=\tilde{O}(|{\mathcal{S}}|^{3}|\mathcal{O}||\mathcal{A}|) since 𝕋,𝕆\mathbb{T},\mathbb{O} have |𝒮|2|𝒜||{\mathcal{S}}|^{2}|\mathcal{A}| and |𝒪||𝒮||\mathcal{O}||{\mathcal{S}}| many parameters, respectively. Besides, the PO-bilinear rank is d=|𝒮|d=|{\mathcal{S}}|. Therefore, the sample complexity is

O~(|𝒮|5|𝒪|H6|𝒜|2+Mln(1/δ)ϵ2σ12).\displaystyle\tilde{O}\left(\frac{|{\mathcal{S}}|^{5}|\mathcal{O}|H^{6}|\mathcal{A}|^{2+M}\ln(1/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

We leave the formal analysis to future works.

Compared to results in Section 5.3.2, there is no |𝒪|M|\mathcal{O}|^{M} term. This is due to two improvements. The first improvement is that we refine the rank from |𝒪|M|𝒜|M|𝒮||\mathcal{O}|^{M}|\mathcal{A}|^{M}|{\mathcal{S}}| to |𝒮||{\mathcal{S}}|. The second improvement is we model the value link function class and policy class starting from the model class whose complexity has nothing to do with the length of memory MM (note that previously, from a pure model-free perspective, the statistical complexity of 𝒢\mathcal{G} can scale as |𝒪|M|𝒜|M|𝒮||\mathcal{O}|^{M}|\mathcal{A}|^{M}|{\mathcal{S}}| in the worst case).

8.2.3 Revisiting Observable Overcomplete POMDPs

We reconsider the sample complexity of overcomplete tabular POMDPs using Theorem 10 with slight modification to incorporate multi-step future. Suppose {𝕆K}11/σ1\|\{\mathbb{O}^{K}\}^{\dagger}\|_{1}\leq 1/\sigma_{1} (recall 𝕆K\mathbb{O}^{K} is defined in Lemma 3 in Section 4.2). Then, we can achieve a sample complexity

O~(|𝒮|5|𝒪|H6|𝒜|2+Mln(1/δ)ϵ2σ12)\displaystyle\tilde{O}\left(\frac{|{\mathcal{S}}|^{5}|\mathcal{O}|H^{6}|\mathcal{A}|^{2+M}\ln(1/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right)

since the PO-bilinear rank is |𝒮||{\mathcal{S}}|. Note that there is no |𝒪|M+K|\mathcal{O}|^{M+K} dependence, since both the policy class and the value link function class are built from the model class whose complexity has nothing to do with M,KM,K.

Note that due to our definition of 𝕆K\mathbb{O}^{K}, there is no |𝒜|K|\mathcal{A}|^{K} term. However, when we use a different definition, for instance, minah:h+K2𝒜K1{𝕆K(ah:h+K2)}11/α1\min_{a^{\prime}_{h:h+K-2}\in\mathcal{A}^{K-1}}\|\{\mathbb{O}^{K}(a^{\prime}_{h:h+K-2})\}^{\dagger}\|_{1}\leq 1/\alpha_{1} (recall 𝕆K(ah:h+K2)\mathbb{O}^{K}(a^{\prime}_{h:h+K-2}) is defined in Section 4.2), we would incur |𝒜|K|\mathcal{A}|^{K}. This is because if we only know that there is an unknown sequence of actions ah:h+K2a^{\prime}_{h:h+K-2} such that 𝕆K(ah:h+K2)\mathbb{O}^{K}(a^{\prime}_{h:h+K-2}) is full column rank, we need to use uniform samples |𝒜|K|\mathcal{A}|^{K} in the importance sampling step to identify such a sequence. More formally, we can see that

|𝒜|Kminah:h+K2𝒜K1{𝕆K(ah:h+K2)}1{𝕆K}1.\displaystyle|\mathcal{A}|^{K}\min_{a^{\prime}_{h:h+K-2}\in\mathcal{A}^{K-1}}\|\{\mathbb{O}^{K}(a^{\prime}_{h:h+K-2})\}^{\dagger}\|_{1}\geq\|\{\mathbb{O}^{K}\}^{\dagger}\|_{1}. (12)

8.2.4 Revisiting MM-step Decodable Tabular POMDPs

We reconsider the sample complexity of tabular MM-step decodable POMDPs by constructing ,𝒢,Π\mathcal{F},\mathcal{G},\Pi from the model class \mathcal{M} as we did for the low-rank POMDP. In this case, by constructing ϵ\epsilon-nets, we can set ln(||)=O~(|𝒮|3|𝒪||𝒜|)\ln(|\mathcal{M}|)=\tilde{O}(|{\mathcal{S}}|^{3}|\mathcal{O}||\mathcal{A}|) since 𝕋,𝕆\mathbb{T},\mathbb{O} have |𝒮|2|𝒜||{\mathcal{S}}|^{2}|\mathcal{A}| and |𝒪||𝒮||\mathcal{O}||{\mathcal{S}}| parameters, respectively. Therefore, the sample complexity is

O~(H6|𝒮|5|𝒪||𝒜|2+Mln(1/δ)ϵ2).\displaystyle\tilde{O}\left(\frac{H^{6}|{\mathcal{S}}|^{5}|\mathcal{O}||\mathcal{A}|^{2+M}\ln(1/\delta)}{\epsilon^{2}}\right).

Again, we leave the formal analysis to future works. Compared to the naive result mentioned after Theorem 9 where ln(𝒢),ln(Π)\ln(\mathcal{G}),\ln(\Pi) could scale in the order of |𝒪|M|\mathcal{O}|^{M} in the tabular case, we do not have |𝒪|M|\mathcal{O}|^{M} dependence here.

9 Summary

We propose a PO-bilinear actor-critic framework that is the first unified framework for provably efficient RL on large-scale partially observable dynamical systems. Our framework can capture not only many models where provably efficient learning has been known such as tabular POMDPs, LQG and M-step decodable POMDPs, but also models where provably efficient RL is not known such as HSE-POMDPs, general PSRs and observable POMDPs with low-rank latent transition. Our unified actor-critic based algorithm—Provable provably performs agnostic learning by searching for the best memory-based policy. For special models such as observable tabular MDPs, LQG, and POMDPs with low-rank latent transition, by leveraging their special properties, i.e., the exponential stability of Bayesian filters in tabular and low-rank POMDPs, and existence of a compact featurization of histories in LQG, we are able to directly compete against the global optimality without paying an exponential dependence on horizon.

Acknowledgement

We thank Nan Jiang for valuable discussions on PSRs.

References

  • AHKS [20] Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning. Advances in Neural Information Processing Systems, 33:13399–13412, 2020.
  • AKKS [20] Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095–20107, 2020.
  • ALA [16] Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Reinforcement learning of pomdps using spectral methods. In Conference on Learning Theory, pages 193–256. PMLR, 2016.
  • BBC+ [19] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  • Ber [12] Dimitri Bertsekas. Dynamic programming and optimal control: Volume I, volume 1. Athena scientific, 2012.
  • BGG [13] Byron Boots, Geoffrey Gordon, and Arthur Gretton. Hilbert space embeddings of predictive state representations. arXiv preprint arXiv:1309.6819, 2013.
  • BK [21] Andrew Bennett and Nathan Kallus. Proximal reinforcement learning: Efficient off-policy evaluation in partially observed markov decision processes. 2021.
  • BSG [11] Byron Boots, Sajid M Siddiqi, and Geoffrey J Gordon. Closing the learning-planning loop with predictive state representations. The International Journal of Robotics Research, 30(7):954–966, 2011.
  • CJ [19] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
  • CO [20] Sayak Ray Chowdhury and Rafael Oliveira. No-regret reinforcement learning with value function approximation: a kernel embedding approach. arXiv preprint arXiv:2011.07881, 2020.
  • CPS+ [20] Yifan Cui, Hongming Pu, Xu Shi, Wang Miao, and Eric Tchetgen Tchetgen. Semiparametric proximal causal inference. arXiv preprint arXiv:2011.08411, 2020.
  • CYW [22] Qi Cai, Zhuoran Yang, and Zhaoran Wang. Sample-efficient reinforcement learning for pomdps with linear function approximations. arXiv preprint arXiv:2204.09787, 2022.
  • Dea [18] Ben Deaner. Proxy controls and panel data. arXiv preprint arXiv:1810.00283, 2018.
  • DHB+ [17] Carlton Downey, Ahmed Hefny, Byron Boots, Geoffrey J Gordon, and Boyue Li. Predictive state recurrent neural networks. Advances in Neural Information Processing Systems, 30, 2017.
  • DJW [20] Yaqi Duan, Zeyu Jia, and Mengdi Wang. Minimax-optimal off-policy evaluation with linear function approximation. In International Conference on Machine Learning, pages 2701–2709. PMLR, 2020.
  • DKJ+ [19] Simon Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674. PMLR, 2019.
  • DKL+ [21] Simon Du, Sham Kakade, Jason Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pages 2826–2836. PMLR, 2021.
  • DLMS [20] Nishanth Dikkala, Greg Lewis, Lester Mackey, and Vasilis Syrgkanis. Minimax estimation of conditional moment models. In Advances in Neural Information Processing Systems, volume 33, pages 12248–12262, 2020.
  • EDKM [05] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Reinforcement learning in pomdps without resets. 2005.
  • EJKM [22] Yonathan Efroni, Chi Jin, Akshay Krishnamurthy, and Sobhan Miryoosefi. Provable reinforcement learning with a short-term memory. arXiv preprint arXiv:2202.03983, 2022.
  • FKQR [21] Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021.
  • GDB [16] Zhaohan Daniel Guo, Shayan Doroudi, and Emma Brunskill. A pac rl algorithm for episodic pomdps. In Artificial Intelligence and Statistics, pages 510–518. PMLR, 2016.
  • [23] Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Learning in observable pomdps, without computationally intractable oracles. arXiv preprint arXiv:2206.03446, 2022.
  • [24] Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Planning in observable pomdps in quasipolynomial time. arXiv preprint arXiv:2201.04735, 2022.
  • HDG [15] Ahmed Hefny, Carlton Downey, and Geoffrey J Gordon. Supervised learning for dynamical system learning. Advances in neural information processing systems, 28, 2015.
  • HDL+ [21] Botao Hao, Yaqi Duan, Tor Lattimore, Csaba Szepesvári, and Mengdi Wang. Sparse feature selection makes batch reinforcement learning more sample efficient. In International Conference on Machine Learning, pages 4063–4073. PMLR, 2021.
  • Hes [18] Joao P Hespanha. Linear systems theory. Princeton university press, 2018.
  • HFP [13] William L Hamilton, Mahdi Milani Fard, and Joelle Pineau. Modelling sparse dynamical systems with compressed predictive state representations. In International Conference on Machine Learning, pages 178–186. PMLR, 2013.
  • HKZ [12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
  • IP [08] Masoumeh T Izadi and Doina Precup. Point-based planning for predictive state representations. In Conference of the Canadian Society for Computational Studies of Intelligence, pages 126–137. Springer, 2008.
  • Jae [98] Herbert Jaeger. Discrete-time, discrete-valued observable operator models: a tutorial. GMD-Forschungszentrum Informationstechnik Darmstadt, Germany, 1998.
  • Jae [00] Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural computation, 12(6):1371–1398, 2000.
  • JKA+ [17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
  • JKKL [20] Chi Jin, Sham Kakade, Akshay Krishnamurthy, and Qinghua Liu. Sample-efficient reinforcement learning of undercomplete pomdps. Advances in Neural Information Processing Systems, 33:18530–18539, 2020.
  • JLM [21] Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in Neural Information Processing Systems, 34, 2021.
  • JYWJ [20] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  • KECM [21] Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, and Shie Mannor. Rl for latent mdps: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34, 2021.
  • KJS [15] Alex Kulesza, Nan Jiang, and Satinder Singh. Spectral learning of predictive state representations with insufficient statistics. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • KMN [99] Michael Kearns, Yishay Mansour, and Andrew Ng. Approximate planning in large pomdps via reusable trajectories. Advances in Neural Information Processing Systems, 12, 1999.
  • KMU [21] Nathan Kallus, Xiaojie Mao, and Masatoshi Uehara. Causal inference under unmeasured confounding with negative controls: A minimax learning approach. arXiv preprint arXiv:2103.14029, 2021.
  • KW [60] Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363–366, 1960.
  • LAHA [20] Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Regret minimization in partially observable linear quadratic control. arXiv preprint arXiv:2002.00082, 2020.
  • LCSJ [22] Qinghua Liu, Alan Chung, Csaba Szepesvári, and Chi Jin. When is partially observable reinforcement learning not scary? arXiv preprint arXiv:2204.08967, 2022.
  • Lit [96] Michael Lederman Littman. Algorithms for sequential decision-making. Brown University, 1996.
  • LMPR [20] Tianyu Li, Bogdan Mazoure, Doina Precup, and Guillaume Rabusseau. Efficient planning under partial observability with unnormalized q functions and spectral learning. In International Conference on Artificial Intelligence and Statistics, pages 2852–2862. PMLR, 2020.
  • LS [01] Michael Littman and Richard S Sutton. Predictive representations of state. Advances in neural information processing systems, 14, 2001.
  • LS [20] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • MHKL [20] Dipendra Misra, Mikael Henaff, Akshay Krishnamurthy, and John Langford. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961–6971. PMLR, 2020.
  • MKS+ [13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • MST [18] Wang Miao, Xu Shi, and Eric Tchetgen Tchetgen. A confounding bridge approach for double negative control inference on causal effects. arXiv preprint arXiv:1808.04945, 2018.
  • MTR [19] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalent control of lqr is efficient. arXiv preprint arXiv:1902.07826, 2019.
  • Mur [00] Kevin P Murphy. A survey of pomdp solution techniques. environment, 2(10), 2000.
  • MZG+ [21] Afsaneh Mastouri, Yuchen Zhu, Limor Gultchin, Anna Korba, Ricardo Silva, Matt J Kusner, Arthur Gretton, and Krikamol Muandet. Proximal causal learning with kernels: Two-stage estimation and moment restriction. arXiv preprint arXiv:2105.04544, 2021.
  • NBGF [12] Yu Nishiyama, Abdeslam Boularias, Arthur Gretton, and Kenji Fukumizu. Hilbert space embeddings of pomdps. arXiv preprint arXiv:1210.4887, 2012.
  • PT [87] Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov decision processes. Mathematics of operations research, 12(3):441–450, 1987.
  • PVSP [06] Josep M Porta, Nikos Vlassis, Matthijs TJ Spaan, and Pascal Poupart. Point-based value iteration for continuous pomdps. 2006.
  • RGT [04] Matthew Rosencrantz, Geoff Gordon, and Sebastian Thrun. Learning low dimensional predictive representations. In Proceedings of the twenty-first international conference on Machine learning, page 88, 2004.
  • SBS+ [10] Le Song, Byron Boots, Sajid Siddiqi, Geoffrey J Gordon, and Alex Smola. Hilbert space embeddings of hidden markov models. 2010.
  • SHSF [09] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 961–968, 2009.
  • Sin [21] Rahul Singh. A finite sample theorem for longitudinal causal inference with machine learning: Long term, dynamic, and mediated effects. arXiv preprint arXiv:2112.14249, 2021.
  • SJK+ [19] Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory, pages 2898–2933. PMLR, 2019.
  • SJR [04] Satinder Singh, Michael R James, and Matthew R Rudary. Predictive state representations: a new theory for modeling dynamical systems. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 512–519, 2004.
  • SKKS [09] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
  • SPK [13] Guy Shani, Joelle Pineau, and Robert Kaplow. A survey of point-based pomdp solvers. Autonomous Agents and Multi-Agent Systems, 27(1):1–51, 2013.
  • SSH [20] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In Conference on Learning Theory, pages 3320–3436. PMLR, 2020.
  • SUJ [21] Chengchun Shi, Masatoshi Uehara, and Nan Jiang. A minimax learning approach to off-policy evaluation in partially observable markov decision processes. arXiv preprint arXiv:2111.06784, 2021.
  • SVBB [16] Wen Sun, Arun Venkatraman, Byron Boots, and J Andrew Bagnell. Learning to filter with predictive state inference machines. In International conference on machine learning, pages 1197–1205. PMLR, 2016.
  • TJ [15] Michael R Thon and Herbert Jaeger. Links between multiplicity automata, observable operator models and predictive state representations: a unified learning framework. J. Mach. Learn. Res., 16:103–147, 2015.
  • TSM [20] Guy Tennenholtz, Uri Shalit, and Shie Mannor. Off-policy evaluation in partially observable environments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10276–10283, 2020.
  • UIJ+ [21] Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, and Tengyang Xie. Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency. arXiv preprint arXiv:2102.02981, 2021.
  • UZS [21] Masatoshi Uehara, Xuezhou Zhang, and Wen Sun. Representation learning for online and offline rl in low-rank mdps. arXiv preprint arXiv:2110.04652, 2021.
  • VODM [12] Peter Van Overschee and Bart De Moor. Subspace identification for linear systems: Theory—Implementation—Applications. Springer Science & Business Media, 2012.
  • WCYW [22] Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Embed to control partially observed systems: Representation learning with provable sample efficiency. arXiv preprint arXiv:2205.13476, 2022.
  • XCGZ [21] Yi Xiong, Ningyuan Chen, Xuefeng Gao, and Xiang Zhou. Sublinear regret for learning pomdps. arXiv preprint arXiv:2107.03635, 2021.
  • XKG [21] Liyuan Xu, Heishiro Kanagawa, and Arthur Gretton. Deep proxy causal learning and its application to confounded bandit policy evaluation. Advances in Neural Information Processing Systems, 34:26264–26275, 2021.
  • YW [20] Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR, 2020.
  • ZLKB [20] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent bellman error. In International Conference on Machine Learning, pages 10978–10989. PMLR, 2020.
  • ZSU+ [22] Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Wen Sun, and Alekh Agarwal. Efficient reinforcement learning in block mdps: A model-free representation learning approach. arXiv preprint arXiv:2202.00063, 2022.

Appendix A Supplement for Section 3

We generalize Definition 3 to capture more models. The first extension is to use multi-step link functions. This extension is essential to capture overcomplete POMDPs and multi-step PSRs. The second extension is to use minimax loss functions with discriminators so that we can use not only absolute value loss functions but also squared loss functions. This extension is important to capture M-step decodable POMDPs.

Appendix B Supplement for Section 4

B.1 Observable Undercomplete Tabular POMDPs

We need to prove Lemma 2. In the tabular case, by setting

ψh(z,o)=𝟏(z)𝟏(o),ϕh(z,s)=𝟏(z)𝟏(s),Kh=I|𝒵h1|𝕆\displaystyle\psi_{h}(z,o)=\mathbf{1}(z)\otimes\mathbf{1}(o),\phi_{h}(z,s)=\mathbf{1}(z)\otimes\mathbf{1}(s),K_{h}=I_{|\mathcal{Z}_{h-1}|}\otimes\mathbb{O}

where 𝟏(z),𝟏(o),𝟏(s)\mathbf{1}(z),\mathbf{1}(o),\mathbf{1}(s) are one-hot encoding vectors over 𝒵h1,𝒪,𝒮\mathcal{Z}_{h-1},\mathcal{O},{\mathcal{S}}, respectively. Then, we can regard the tabular model as an HSE-POMDP. We can just invoke Lemma 6.

B.2 Observable Overcomplete POMDPs

We consider overcomplete POMDPs with multi-step futures. We have the following theorem. This is a generalization of Lemma 1.

Proof of Lemma 3

Consider any function f:𝒵h1×𝒮f:\mathcal{Z}_{h-1}\times{\mathcal{S}}\to\mathbb{R} (thus, this captures all possible VhπV^{\pi}_{h}). Denote 1(z){\textbf{1}}(z) as the one-hot encoding of Zh1Z_{h-1} (similarly for 1(s){\textbf{1}}(s) over 𝒮{\mathcal{S}} and 1(t){\textbf{1}}(t) over 𝒯K{\mathcal{T}}^{K}). We have f(z,s)=f,1(z)1(s)=f,1(z)((𝕆K)𝕆K1(s))f(z,s)=\langle f,{\textbf{1}}(z)\otimes{\textbf{1}}(s)\rangle=\langle f,{\textbf{1}}(z)\otimes((\mathbb{O}^{K})^{\dagger}\mathbb{O}^{K}{\textbf{1}}(s))\rangle, where we use the assumption that rank(𝕆K)=|𝒮|{\rm rank}(\mathbb{O}^{K})=|{\mathcal{S}}| and thus (𝕆K)𝕆K=I(\mathbb{O}^{K})^{\dagger}\mathbb{O}^{K}=I. Then,

f(zh1,sh)\displaystyle f(z_{h-1},s_{h}) =f,1(zh1)(𝕆K)𝔼[1(oh:h+K1,ah:h+K2)sh;ah:h+K2πout]\displaystyle=\langle f,{\textbf{1}}(z_{h-1})\otimes(\mathbb{O}^{K})^{\dagger}\mathbb{E}[{\textbf{1}}(o_{h:h+K-1},a_{h:h+K-2})\mid s_{h};a_{h:h+K-2}\sim\pi^{out}]\rangle
=𝔼[f,1(zh1)(𝕆K)1(oh:h+K1,ah:h+K2)zh1,sh;ah:h+K2πout].\displaystyle=\mathbb{E}[\langle f,{\textbf{1}}(z_{h-1})\otimes(\mathbb{O}^{K})^{\dagger}{\textbf{1}}(o_{h:h+K-1},a_{h:h+K-2})\rangle\mid z_{h-1},s_{h};a_{h:h+K-2}\sim\pi^{out}].

which means that the value bridge function corresponding to f()f(\cdot) is

g(z,t):=f,1(z)(𝕆K)1(t).g(z,t):=\langle f,{\textbf{1}}(z)\otimes(\mathbb{O}^{K})^{\dagger}{\textbf{1}}(t)\rangle.

\quad\blacksquare

Proof of Lemma 4

Recall we want to show the low-rank property of the following loss function:

𝔼[gh+1(z¯h+1K);a1:h1π,ahπ,ah+1:h+K1πout]+𝔼[rh;a1:h1π,ahπ]\displaystyle\mathbb{E}[g_{h+1}(\bar{z}^{K}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi,a_{h+1:h+K-1}\sim\pi^{out}]+\mathbb{E}[r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]
𝔼[gh(z¯hK);a1:h1π,ah:h+K1πout].\displaystyle-\mathbb{E}[g_{h}(\bar{z}^{K}_{h});a_{1:h-1}\sim\pi^{\prime},a_{h:h+K-1}\sim\pi^{out}].

We consider an expectation conditioning on zh1z_{h-1} and shs_{h}. For some vector θπ,g|𝒵h1|×|𝒮|\theta_{\pi,g}\in\mathbb{R}^{|\mathcal{Z}_{h-1}|\times|{\mathcal{S}}|}, which depends on π\pi, we write it in the form of θπ,g,1(zh1,sh)\langle\theta_{\pi,g},{\textbf{1}}(z_{h-1},s_{h})\rangle where 1(zh1,sh){\textbf{1}}(z_{h-1},s_{h}) is the one-hot encoding vector over 𝒵h1×𝒮\mathcal{Z}_{h-1}\times{\mathcal{S}}. Then, the loss for (π,g)(\pi,g) is equal to

θπ,g,𝔼[1(zh1,sh);a1:h1π].\displaystyle\langle\theta_{\pi,g},\mathbb{E}[{\textbf{1}}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}]\rangle.

Hence, we can take X(π)=𝔼[1(zh1,sh);a1:h1π]X(\pi^{\prime})=\mathbb{E}[{\textbf{1}}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}] and W(π)=θπ,gW(\pi)=\theta_{\pi,g}. \quad\blacksquare

B.3 Observable Linear Quadratic Gaussian

We need to prove Lemma 5. The proof is further deferred to Section J.

B.4 Observable HSE-POMDPs

We first provide the proof of Lemma 6. Then, we briefly mention how we extend to the infinite-dimensional setting.

Proof of the first statement in Lemma 6

First, we need to show value bridge functions exist. This is proved noting

𝔼o𝕆(s)[(Kh)θhπ,ψh(z¯h)]=(Kh)θhπ,Khϕh(zh1,sh)=θhπ,ϕh(zh1,sh)=Vhπ(zh1,sh).\displaystyle\mathbb{E}_{o\sim\mathbb{O}(s)}[\langle(K^{\dagger}_{h})^{\top}\theta^{\pi}_{h},\psi_{h}(\bar{z}_{h})\rangle]=\langle(K^{\dagger}_{h})^{\top}\theta^{\pi}_{h},K_{h}\phi_{h}(z_{h-1},s_{h})\rangle=\langle\theta^{\pi}_{h},\phi_{h}(z_{h-1},s_{h})\rangle=V^{\pi}_{h}(z_{h-1},s_{h}).

Thus, (Kh)θhπ,ψh(z¯h)\langle(K^{\dagger}_{h})^{\top}\theta^{\pi}_{h},\psi_{h}(\bar{z}_{h})\rangle is a value bridge function. \quad\blacksquare

Proof of the second statement in Lemma 6

Consider a triple (π,π,g)Π×Π×𝒢(\pi^{\prime},\pi,g)\in\Pi\times\Pi\times\mathcal{G}, with gh()=θhψh()g_{h}(\cdot)=\theta_{h}^{\top}\psi_{h}(\cdot) and ghπ=θh,ψh()g^{\pi}_{h}=\langle\theta^{\star}_{h},\psi_{h}(\cdot)\rangle, we have:

Brh(π,g;π)\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime})
=𝔼[θhψ(z¯h)rhθh+1ψ(z¯h+1);a1:h1π,ahπ]\displaystyle=\mathbb{E}\left[\theta_{h}^{\top}\psi(\bar{z}_{h})-r_{h}-\theta_{h+1}^{\top}\psi(\bar{z}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi\right]
=𝔼[θhKhϕh(zh1,sh)rhθh+1Kh+1(ϕh+1(zh,sh+1));a1:h1π,ahπ]\displaystyle=\mathbb{E}\left[\theta_{h}^{\top}K_{h}\phi_{h}(z_{h-1},s_{h})-r_{h}-\theta_{h+1}^{\top}K_{h+1}(\phi_{h+1}(z_{h},s_{h+1}));a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi\right]
=𝔼[(θhθh)Khϕh(zh1,sh)(θh+1θh+1)Kh+1(Tπ;hϕh(zh1,sh));a1:h1π]\displaystyle=\mathbb{E}\left[(\theta_{h}-\theta^{\star}_{h})^{\top}K_{h}\phi_{h}(z_{h-1},s_{h})-(\theta_{h+1}-\theta^{\star}_{h+1})^{\top}K_{h+1}(T_{\pi;h}\phi_{h}(z_{h-1},s_{h}));a_{1:h-1}\sim\pi^{\prime}\right]
=𝔼[ϕh(zh1,sh);a1:h1π],Kh(θhθh)Tπ;hKh+1(θh+1θh+1),\displaystyle=\left\langle\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}],\quad K_{h}^{\top}(\theta_{h}-\theta^{\star}_{h})-T_{\pi;h}^{\top}K_{h+1}^{\top}(\theta_{h+1}-\theta^{\star}_{h+1})\right\rangle,

which verifies the bilinear structure, i.e., Xh(π)=𝔼[ϕh(zh1,sh);a1:h1π]X_{h}(\pi^{\prime})=\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}], and Wh(π,g)=Kh(θhθh)Tπ;hKh+1(θh+1θh+1)W_{h}(\pi,g)=K_{h}^{\top}(\theta_{h}-\theta^{\star}_{h})-T_{\pi;h}^{\top}K_{h+1}^{\top}(\theta_{h+1}-\theta^{\star}_{h+1}), and shows that the bilinear rank is at most maxhdψh\max_{h}d_{\psi_{h}}.\quad\blacksquare

Infinite dimensional HSE-POMDPs

Consider the case ϕh\phi_{h} and ψh\psi_{h} are features in infinite dimensional RKHS. By assuming that the spectrum of the operator KhK_{h} is decaying with a certain order, we can still ensure the existence of value bridge functions even if dϕhd_{\phi_{h}} and dψhd^{\prime}_{\psi_{h}} are infinite dimensional.

Next, we consider the PO-bilinear rank. We can still use the decomposition in the proof above. While the PO-bilinear rank itself in the current definition is infinite-dimensional, when we get the PAC result later, the dependence on the PO-bilinear rank comes from the information gain based on Xh(π)X_{h}(\pi), which is the intrinsic dimension of Xh(π)X_{h}(\pi). Thus, we can easily get the sample complexity result by replacing dψhd_{\psi_{h}} with the information gain over ψh()\psi_{h}(\cdot) [63]. Generally, to take infinite dimensional models into account, the PO-bilinear rank in Definition 3 can be generalized using the critical information gain [17].

Appendix C Supplement for Section 5 (Algorithm for LQG with Continuous Action)

In this section, we present a modification to handle LQG with continuous action in Definition 3.

Our algorithm so far samples aha_{h} from 𝒰(𝒜)\mathcal{U}(\mathcal{A}) and performs importance weighting in designing the loss σht\sigma^{t}_{h}, which will incur a polynomial dependence on |𝒜||\mathcal{A}| as we will see in the next section. However, among the examples that we consider in Section 4, LQG has continuous action. If we naïvely sample aha_{h} from a ball in da\mathbb{R}^{d_{a}} and perform (nonparametric) importance weighting, we will pay exp(da)\exp(d_{a}) in our sample complexity bound, which is not ideal for high-dimension control problems. To avoid exponential dependence on dad_{a}, here we replace 𝒰(𝒜)\mathcal{U}(\mathcal{A}) with a dd-optimal design over the action’s quadratic feature space.

Here, we want to evaluate the Bellman error of (π,g)(\pi,g) pair under a roll-in policy π\pi^{\prime}:

Brh(π,g;π):=𝔼[uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahπ(z¯h)]\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime}):=\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{h})]

where uh(z¯h,ah,rh,oh+1;θ)=θhψ(z¯h)rh(sh,ah)θh+1ψh+1(z¯h+1)u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)=\theta^{\top}_{h}\psi(\bar{z}_{h})-r_{h}(s_{h},a_{h})-\theta^{\top}_{h+1}\psi_{h+1}(\bar{z}_{h+1}) for any linear deterministic policy πΠ\pi\in\Pi (here gh():=θhψ()g_{h}(\cdot):=\theta_{h}^{\top}\psi(\cdot)) using a single policy. In other words, we would like to get a good loss lhl_{h} such that

Brh(π,g;π)=𝔼[lh(z¯h,ah,rh,oh+1;θ,π);a1:h1π,ahπe]\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime})=\mathbb{E}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi^{e}]

for some policy πe\pi^{e} without incuring exponential dependence on dad_{a}. We explain how to design such a loss function lh(;π,g)l_{h}(\cdot;\pi,g) step by step.

First Step

The first step is to consider the conditional expectation on (z¯h,sh,ah)(\bar{z}_{h},s_{h},a_{h}). Here, using the quadratic form of ψ\psi, we can show that there are some c0:Z¯h×𝒮,c1:Z¯h×𝒮(da+ds+dz¯h)2,c2c_{0}:\bar{Z}_{h}\times{\mathcal{S}}\to\mathbb{R},c_{1}:\bar{Z}_{h}\times{\mathcal{S}}\to\mathbb{R}^{(d_{a}+d_{s}+d_{\bar{z}_{h}})^{2}},c_{2}\in\mathbb{R}:

Brh(π,g;π)\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime}) =𝔼[uh(z¯h,ah,rh,oh+1;θ)z¯h,sh,ah;a1:h1π,ahπ(z¯h)]\displaystyle=\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h};a_{1:h-1}\sim\pi,a_{h}\sim\pi(\bar{z}_{h})]
=c2(θ),[1,[z¯h,sh,ah][z¯h,sh,ah]]\displaystyle=\langle c_{2}(\theta),[1,[\bar{z}^{\top}_{h},s^{\top}_{h},a^{\top}_{h}]\otimes[\bar{z}^{\top}_{h},s^{\top}_{h},a^{\top}_{h}]]^{\top}\rangle
=c0(z¯h,sh;θ)+c1(z¯h,sh;θ)κ(ah)\displaystyle=c_{0}(\bar{z}_{h},s_{h};\theta)+c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\kappa(a_{h})

where κ(a)=[a,(aa)]\kappa(a)=[a^{\top},(a\otimes a)^{\top}]^{\top}. Then, the Bellman loss we want to evaluate can be written in the form of

𝔼[uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahπ(z¯h)]\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{h})]
=𝔼[c0(z¯h,sh;θ)+c1(z¯h,sh;θ)κ(π(z¯h));a1:h1π].\displaystyle=\mathbb{E}[c_{0}(\bar{z}_{h},s_{h};\theta)+c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\kappa(\pi(\bar{z}_{h}));a_{1:h-1}\sim\pi^{\prime}].
Second step

The second step is to compute a d-optimal design for the set {κ(a):ada,a2Z}\{\kappa(a):a\in\mathbb{R}^{d_{a}},\|a\|_{2}\leq Z\} for certain enough large ZZ\in\mathbb{R}, and denote a1,,ada^{1},\dots,a^{d^{\diamond}} as the supports on the d-optimal design. Note in LQG, though we cannot ensure the action lives in the compact set, we can still ensure that in high probability and it suffices in our setting as we will see. Since the dimension of k(a)k(a) is da+da2d_{a}+d^{2}_{a}, we can ensure d(da+da2)(da+da2+1)/2d^{\diamond}\leq(d_{a}+d_{a}^{2})(d_{a}+d_{a}^{2}+1)/2 [47, 41]. Here is a concrete theorem we invoke.

Theorem 12 (Property of G-optimal design).

Suppose 𝒳d\mathcal{X}\in\mathbb{R}^{d} is a compact set. There exists a distribution ρ\rho over 𝒳\mathcal{X} such that:

  • ρ\rho is supported on at most d(d+1)/2d(d+1)/2 points.

  • For any x𝒳x^{\prime}\in\mathcal{X}, we have x𝔼xρ[xx]1xdx^{\prime\top}\mathbb{E}_{x\sim\rho}[xx^{\top}]^{-1}x^{\prime}\leq d.

We have the following handy lemma stating any κ(a)\kappa(a) is spanned by {κ(ai)}i=1d\{\kappa(a^{i})\}_{i=1}^{d^{\diamond}}.

Lemma 13.

Let K=[ρ1/2(a1)κ(a1),ρ1/2(a2)κ(a2),,ρ1/2(ad)κ(ad)]K=[\rho^{1/2}(a^{1})\kappa(a^{1}),\rho^{1/2}(a^{2})\kappa(a^{2}),\cdots,\rho^{1/2}(a^{d^{\diamond}})\kappa(a^{d^{\diamond}})] and α(a)=K(KK)1k(a)\alpha(a)=K^{\top}(KK^{\top})^{-1}k(a). Then, it satisfies

κ(a)=Kα(a),α(a)(da+da2)1/2,αi(a)/ρ1/2(ai)(da+da2)\kappa(a)=K\alpha(a),\quad\|\alpha(a)\|\leq(d_{a}+d_{a}^{2})^{1/2},\quad\alpha_{i}(a)/\rho^{1/2}(a^{i})\leq(d_{a}+d_{a}^{2})
Proof.

Since KK is full-raw rank from the construction of G-optimal design, KKKK^{\top} is invertible. Then, we have

i=1dαi(a)ρ1/2(ai)κ(ai)=KK(KK)1κ(a)=κ(a)\displaystyle\sum_{i=1}^{d^{\diamond}}\alpha_{i}(a)\rho^{1/2}(a^{i})\kappa(a^{i})=KK^{\top}(KK^{\top})^{-1}\kappa(a)=\kappa(a)

For the latter statement, we have

K(KK)1k(a),K(KK)1k(a)\displaystyle\langle K^{\top}(KK^{\top})^{-1}k(a),K^{\top}(KK^{\top})^{-1}k(a)\rangle =k(a)(KK)1k(a)(da+da2).\displaystyle=k(a)^{\top}(KK^{\top})^{-1}k(a)\leq(d_{a}+d_{a}^{2}).

We use a property of G-optimal design in Theorem 12.

For the last statement, we have

κ(ai)(KK)1κ(a)κ(ai)(KK)1κ(a)(KK)1(da+da2).\displaystyle\kappa^{\top}(a^{i})(KK^{\top})^{-1}\kappa(a)\leq\|\kappa^{\top}(a^{i})\|_{(KK^{\top})^{-1}}\|\kappa^{\top}(a)\|_{(KK^{\top})^{-1}}\leq(d_{a}+d_{a}^{2}).

from CS inequality. ∎

Third Step

The third step is combining current facts. Recall we want to evaluate

𝔼[uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahπ(z¯j)]=𝔼[c0(z¯h,sh;θ)+c1(z¯h,sh;θ)κ(π(z¯h));a1:h1π].\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]=\mathbb{E}[c_{0}(\bar{z}_{h},s_{h};\theta)+c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\kappa(\pi(\bar{z}_{h}));a_{1:h-1}\sim\pi^{\prime}].

In addition, the following also holds:

𝔼[uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahdo(ai)]\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{i})] =𝔼[c0(z¯h,sh;θ)+c1(z¯h,sh;θ)κ(ai);a1:h1π]\displaystyle=\mathbb{E}[c_{0}(\bar{z}_{h},s_{h};\theta)+c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\kappa(a^{i});a_{1:h-1}\sim\pi^{\prime}]
𝔼[uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahdo(0)]\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(0)] =𝔼[c0(z¯h,sh;θ);a1:h1π]\displaystyle=\mathbb{E}[c_{0}(\bar{z}_{h},s_{h};\theta);a_{1:h-1}\sim\pi^{\prime}]

Here, we use κ(0)=0\kappa(0)=0. This concludes that

𝔼[uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahπ(z¯j)]\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]
=𝔼[c0(z¯h,sh;θ)+c1(z¯h,sh;θ)κ(π(z¯h));a1:h1π]\displaystyle=\mathbb{E}[c_{0}(\bar{z}_{h},s_{h};\theta)+c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\kappa(\pi(\bar{z}_{h}));a_{1:h-1}\sim\pi^{\prime}]
=𝔼[c0(z¯h,sh;θ)+c1(z¯h,sh;θ){i=1dαi(π(z¯h))κ(ai)};a1:h1π]\displaystyle=\mathbb{E}[c_{0}(\bar{z}_{h},s_{h};\theta)+c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\{\sum_{i=1}^{d^{\diamond}}\alpha_{i}(\pi(\bar{z}_{h}))\kappa(a^{i})\};a_{1:h-1}\sim\pi^{\prime}]
=𝔼[c0(z¯h,sh;θ)(1i=1dαi(π(z¯h)))+i=1dαi(π(z¯h))(c1(z¯h,sh;θ)κ(ai)+c0(z¯h,sh;θ));a1:h1π]\displaystyle=\mathbb{E}\left[c_{0}(\bar{z}_{h},s_{h};\theta)\left(1-\sum_{i=1}^{d^{\diamond}}\alpha_{i}(\pi(\bar{z}_{h}))\right)+\sum_{i=1}^{d^{\diamond}}\alpha_{i}(\pi(\bar{z}_{h}))\left(c^{\top}_{1}(\bar{z}_{h},s_{h};\theta)\kappa(a^{i})+c_{0}(\bar{z}_{h},s_{h};\theta)\right);a_{1:h-1}\sim\pi^{\prime}\right]
=𝔼[(1i=1dαi(π(z¯h)))uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahdo(0)]\displaystyle=\mathbb{E}\left[\left(1-\sum_{i=1}^{d^{\diamond}}\alpha_{i}(\pi(\bar{z}_{h}))\right)u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(0)\right]
+i=1d𝔼[αi(π(z¯h))uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahdo(ai)].\displaystyle+\sum_{i=1}^{d^{\diamond}}\mathbb{E}\left[\alpha_{i}(\pi(\bar{z}_{h}))u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{i})\right].

Thus, we can perform policy evaluation for a policy π\pi if we can do intervention from do(0),do(a1),,do(ad)do(0),do(a^{1}),\cdots,do(a^{d^{\diamond}}).

Fourth Step

The fourth step is replacing do(0),do(a1),,do(ad)do(0),do(a^{1}),\cdots,do(a^{d^{\diamond}}) with a single policy that uniformly randomly select actions from the set {0,a1,,ad}\{0,a^{1},\dots,a^{d^{\diamond}}\}, which we denote as aU(1+d)a\sim U(1+d^{\diamond}). Using importance weighting, we define the loss function for π,θ\pi,\theta as follows:

𝔼[fh(z¯h,ah,rh,oh+1;θ,π);a1:h1π,ahU(1+d)]\displaystyle\mathbb{E}[f_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(1+d^{\diamond})] (13)

where U(1+d)U(1+d^{\diamond}) is a uniform action over 0,a1,,ad0,a^{1},\cdots,a^{d^{\diamond}} and

fh(z¯h,ah,rh,oh+1;θ,π)\displaystyle f_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)
=|1+d|(𝕀(ah=0)(1i=1dαi(π(z¯h)))+i=1d𝕀(ah=ai)αi(π(z¯h)))uh(z¯h,ah,rh,oh+1;θ).\displaystyle=|1+d^{\diamond}|\left(\mathbb{I}(a_{h}=0)\left(1-\sum_{i=1}^{d^{\diamond}}\alpha_{i}(\pi(\bar{z}_{h}))\right)+\sum_{i=1}^{d^{\diamond}}\mathbb{I}(a_{h}=a^{i})\alpha_{i}(\pi(\bar{z}_{h}))\right)u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta).

The term 13is equal to Brh(π,g;π)\mathrm{Br}_{h}(\pi,g;\pi^{\prime}) we want to evaluate.

Summary

To summarize, we just need to use the following loss function in line 7 in Algorithm 1:

𝔼𝒟ht[lh(z¯h,ah,rh,oh+1;θ,π)]\displaystyle\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)]

where lh(z¯h,ah,rh,oh+1;θ,π)l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi) is

𝕀(z¯hZ1)𝕀(rhZ2)𝕀(oh+1Z3)fh(z¯h,ah,rh,oh+1;θ,π)\displaystyle\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\mathbb{I}(\|r_{h}\|\leq Z_{2})\mathbb{I}(\|o_{h+1}\|\leq Z_{3})f_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)

and 𝒟ht\mathcal{D}^{t}_{h} is a set of mm i.i.d samples following the distribution induced by executing a1:h1π,ahU(1+d)a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(1+d^{\diamond}). Values Z1,Z2,Z3Z_{1},Z_{2},Z_{3} in indicators functions are some large values selected properly later. Due to unbounded Gaussian noises in LQG, indicators functions for truncation is introduced here for technical reason to get valid concentration in Assumption 2.

Appendix D Supplement for Section 6

We first add several discussions to explain core tests in detail. Next, we show the existence and form of link functions. Finally, we calculate the PO-bilinear rank. In this section, we will focus on the general case where tests could be multiple steps.

D.1 Definition of PSRs

We first define core tests and predictive states [46, 62]. This definition is a generalization of Definition 5 with multi-step futures.

We slighly abuse notation and denote τha:=(o1,a1,,oh1,ah1)\tau^{a}_{h}:=(o_{1},a_{1},\dots,o_{h-1},a_{h-1}) throughout this whole section — note that τha\tau^{a}_{h} here does not include oho_{h}.

Definition 8 (Core test sets and PSRs).

A set 𝒯C𝐍+𝒪C×𝒜C1{\mathcal{T}}\subset\cup_{C\in\mathbf{N}^{+}}\mathcal{O}^{C}\times\mathcal{A}^{C-1} is called a core test set if for any h[H]h\in[H], W𝐍+W\in\mathbf{N}^{+}, any possible future (i.e., test) th=(oh:h+W1,ah:h+W2)𝒪W×𝒜W1t_{h}=(o_{h:h+W-1},a_{h:h+W-2})\in\mathcal{O}^{W}\times\mathcal{A}^{W-1} and any history τha\tau^{a}_{h}, there exists mth|𝒯|m_{t_{h}}\in\mathbb{R}^{|{\mathcal{T}}|} such that

(oh:W+h1τha;do(ah:W+h2))=mth,[(tτha)]t𝒯.\displaystyle\mathbb{P}(o_{h:W+h-1}\mid\tau^{a}_{h};do(a_{h:W+h-2}))=\langle m_{t_{h}},[\mathbb{P}(t\mid\tau^{a}_{h})]_{t\in{\mathcal{T}}}\rangle.

The vector [(tτha)]t𝒯|𝒯|[\mathbb{P}(t\mid\tau^{a}_{h})]_{t\in{\mathcal{T}}}\in\mathbb{R}^{|{\mathcal{T}}|} is referred to as the predictive state.

We often denote 𝐪τha=[(tτha)]t𝒯h\mathbf{q}_{\tau^{a}_{h}}=[\mathbb{P}(t\mid\tau^{a}_{h})]_{t\in{\mathcal{T}}_{h}}. To understand the above definition, we revisit observable undercomplete POMDPs and overcomplete POMDPs.

Example 1 (Observable undercomplete POMDPs).

In undercomplete POMDPs, when 𝕆\mathbb{O} is full-column rank, 𝒪\mathcal{O} is a core test. Recall 𝕆\mathbb{O} is a matrix in |𝒪|×|𝒮|\mathbb{R}^{|\mathcal{O}|\times|{\mathcal{S}}|} whose entry indexed by oi𝒪,sj𝒮o_{i}\in\mathcal{O},s_{j}\in{\mathcal{S}} is equal to 𝕆(oisj)\mathbb{O}(o_{i}\mid s_{j}).

Lemma 14 (Core tests in undercomplete POMDPs).

When 𝕆\mathbb{O} is full-column rank, 𝒪\mathcal{O} is a core test set.

Proof.

Consider any h[H]h\in[H]. Given a |𝒮||{\mathcal{S}}|-dimensional belief state 𝐬τha=[(τha)]|𝒮|\mathbf{s}_{\tau^{a}_{h}}=[\mathbb{P}(\cdot\mid\tau^{a}_{h})]_{|{\mathcal{S}}|} with each entry (shτha)\mathbb{P}(s_{h}\mid\tau^{a}_{h}), for any future t=(oh:h+W,ah:h+W1)t=(o_{h:h+W},a_{h:h+W-1}), there exists a |𝒮||{\mathcal{S}}|-dimensional vector 𝐦t\mathbf{m^{\prime}}_{t} such that (oh:h+Wτha;do(ah:h+W1))=𝐦t,𝐬τha\mathbb{P}(o_{h:h+W}\mid\tau^{a}_{h};do(a_{h:h+W-1}))=\langle\mathbf{m^{\prime}}_{t},\mathbf{s}_{\tau^{a}_{h}}\rangle. More specifically, 𝐦t{\bf m^{\prime}}_{t} can be written as:

(𝐦t)=𝕆(oh+W)τ=hh+W1𝕋ahdiag(𝕆(oh))\displaystyle({\bf m^{\prime}}_{t})^{\top}=\mathbb{O}(o_{h+W}\mid{}\cdot)^{\top}\prod_{\tau=h}^{h+W-1}\mathbb{T}_{a_{h}}{\rm diag}(\mathbb{O}(o_{h}\mid{}\cdot))

where 𝕆(o|)|𝒮|\mathbb{O}(o|\cdot)\in\mathbb{R}^{|{\mathcal{S}}|} is a vector with the entry indexed by ss equal to 𝕆(o|s)\mathbb{O}(o|s), 𝕋a|𝒮|×|𝒮|\mathbb{T}_{a}\in\mathbb{R}^{|{\mathcal{S}}|\times|{\mathcal{S}}|} is a matrix with the entry indexed by (s,s)(s,s^{\prime}) equal to 𝕋(ss,ah)\mathbb{T}(s^{\prime}\mid s,a_{h}). Here, note given a vector CC, diag(C){\rm diag}(C) is define as a |C|×|C||C|\times|C| diagonal matrix where the diagonal element corresponds to CC. Thus, we have

(oh:h+Wτha;do(ah:h+W1))=𝐦t,𝐬τha=𝐦t,𝕆𝐪τha=(𝕆)𝐦t,𝐪τha,\displaystyle\mathbb{P}(o_{h:h+W}\mid\tau^{a}_{h};do(a_{h:h+W-1}))=\langle\mathbf{m^{\prime}}_{t},\mathbf{s}_{\tau^{a}_{h}}\rangle=\langle\mathbf{m^{\prime}}_{t},\mathbb{O}^{\dagger}\mathbf{q}_{\tau^{a}_{h}}\rangle=\langle(\mathbb{O}^{\dagger})^{\top}\mathbf{m^{\prime}}_{t},\mathbf{q}_{\tau^{a}_{h}}\rangle,

where 𝐪τha|𝒪|{\bf q}_{\tau^{a}_{h}}\in\mathbb{R}^{|\mathcal{O}|} and 𝐪τha(o)=(o|τha){\bf q}_{\tau^{a}_{h}}(o)=\mathbb{P}(o|\tau^{a}_{h}). This concludes the proof. ∎

Example 2 (Overcomplete POMDPs).

We consider overcomplete POMDPs so that we can permit |𝒮||𝒪||{\mathcal{S}}|\geq|\mathcal{O}|.

Lemma 15 (Core tests in overcomplete POMDPs).

Recall 𝒯K=𝒪×(𝒪×𝒜)K1{\mathcal{T}}^{K}=\mathcal{O}\times(\mathcal{O}\times\mathcal{A})^{K-1}. Define a |𝒯K|×|𝒮||{\mathcal{T}}^{K}|\times|{\mathcal{S}}|-dimensional matrix 𝕆K\mathbb{O}^{K} whose entry indexed by (oh:h+K1,ah:h+K2)𝒯K,sh𝒮(o_{h:h+K-1},a_{h:h+K-2})\in{\mathcal{T}}^{K},s_{h}\in{\mathcal{S}} is equal to (oh:h+K1,ah:h+K2sh;ah:h+K2𝒰(𝒜))\mathbb{P}(o_{h:h+K-1},a_{h:h+K-2}\mid s_{h};a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A})). When this matrix is full-colmun rank for all hh, 𝒯K{\mathcal{T}}^{K} is a core test set.

Proof.

Fix a test t=(oh:h+K1,ah:h+K2)t=(o^{\prime}_{h:h+K-1},a^{\prime}_{h:h+K-2}) and consider a step h[H]h\in[H]. Then,

(oh:h+K1,ah:h+K2sh;ah:h+K2𝒰(𝒜))\displaystyle\mathbb{P}(o^{\prime}_{h:h+K-1},a^{\prime}_{h:h+K-2}\mid s_{h};a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A}))
=𝔼[1(oh:h+K1=oh:h+K1,ah:h+K2=ah:h+K2)sh;ah:h+K2𝒰(𝒜))]\displaystyle=\mathbb{E}[{\textbf{1}}(o_{h:h+K-1}=o^{\prime}_{h:h+K-1},a_{h:h+K-2}=a^{\prime}_{h:h+K-2})\mid s_{h};a_{h:h+K-2}\sim\mathcal{U}(\mathcal{A}))]
=𝔼[(1/|𝒜|K1)1(oh:h+K1=oh:h+K1,ah:h+K2=ah:h+K2)sh;ah:h+K2do(ah:h+K2)]\displaystyle=\mathbb{E}[(1/|\mathcal{A}|^{K-1}){\textbf{1}}(o_{h:h+K-1}=o^{\prime}_{h:h+K-1},a_{h:h+K-2}=a^{\prime}_{h:h+K-2})\mid s_{h};a_{h:h+K-2}\sim do(a^{\prime}_{h:h+K-2})]
=𝔼[(1/|𝒜|K1)1(oh:h+K1=oh:h+K1)sh;ah:h+K2do(ah:h+K2)]\displaystyle=\mathbb{E}[(1/|\mathcal{A}|^{K-1}){\textbf{1}}(o_{h:h+K-1}=o^{\prime}_{h:h+K-1})\mid s_{h};a_{h:h+K-2}\sim do(a^{\prime}_{h:h+K-2})]
=(1/|𝒜|K1)(oh:h+K1sh;do(ah:h+K2)).\displaystyle=(1/|\mathcal{A}|^{K-1})\mathbb{P}(o^{\prime}_{h:h+K-1}\mid s_{h};do(a^{\prime}_{h:h+K-2})).

Thus, the assumption that 𝒯K{\mathcal{T}}^{K} is full column rank implies that that the matrix 𝕁¯h|𝒯K|×|𝒮|\mathbb{\bar{J}}_{h}\in\mathbb{R}^{|{\mathcal{T}}^{K}|\times|{\mathcal{S}}|} with the entry indexed by (t,sh)(t,s_{h}) being equal to (oh:h+K1sh;do(ah:h+K2))\mathbb{P}(o^{\prime}_{h:h+K-1}\mid s_{h};do(a^{\prime}_{h:h+K-2})) is full-column rank.

Define a |𝒯K||{\mathcal{T}}^{K}|-dimensional state 𝐪τha=[(tτha)]t𝒯K\mathbf{q}_{\tau^{a}_{h}}=[\mathbb{P}(t\mid\tau^{a}_{h})]_{t\in{\mathcal{T}}^{K}} given history τha\tau^{a}_{h}. By definition, we have

𝐪τha=𝕁¯h𝐬τha\displaystyle\mathbf{q}_{\tau^{a}_{h}}=\mathbb{\bar{J}}_{h}\mathbf{s}_{\tau^{a}_{h}}

Using 𝕁¯h\mathbb{\bar{J}}_{h} is full-column rank, we have 𝐬τha=M¯h𝐪τha\mathbf{s}_{\tau^{a}_{h}}=\bar{M}^{\dagger}_{h}\mathbf{q}_{\tau^{a}_{h}}. Thus, using the format of 𝐦t{\bf m^{\prime}}_{t} from the proof of Lemma 14, we can conclude that for any test t=(oh:h+W,ah:h+W1)t=(o_{h:h+W},a_{h:h+W-1}), we have (oh:h+W|τ;do(ah:h+W1))=(M¯h)𝐦t,𝐪τha\mathbb{P}(o_{h:h+W}|\tau;\text{do}(a_{h:h+W-1}))=\langle(\bar{M}^{\dagger}_{h})^{\top}{\bf m^{\prime}}_{t},{\bf q}_{\tau^{a}_{h}}\rangle. Thus, this concludes 𝒯K{\mathcal{T}}^{K} is a core test set. ∎

Finally, we present an important property of predictive states, which corresponds to the Bayesian filter in POMDP.

Lemma 16 (Forward dynamics of predictive states).

We have

(tτha,a,o)=𝐦o,a,t𝐪τha/𝐦o𝐪τha.\displaystyle\mathbb{P}(t\mid\tau^{a}_{h},a,o)=\mathbf{m}^{\top}_{o,a,t}\mathbf{q}_{\tau^{a}_{h}}/\mathbf{m}^{\top}_{o}\mathbf{q}_{\tau^{a}_{h}}.

When we define Mo,a|𝒯|×|𝒯|M_{o,a}\in\mathbb{R}^{|{\mathcal{T}}|\times|{\mathcal{T}}|} where rows are 𝐦o,a,t\mathbf{m}_{o,a,t} for t𝒯t\in{\mathcal{T}}, we can express the forward update rule of predictive states as follows:

𝐪τha,a,o=Mo,a𝐪τha/(𝐦o𝐪τha).\displaystyle\mathbf{q}_{\tau^{a}_{h},a,o}=M_{o,a}\mathbf{q}_{\tau^{a}_{h}}/(\mathbf{m}^{\top}_{o}\mathbf{q}_{\tau^{a}_{h}}).
Proof.

The proof is an application of Bayes’s rule. We denote the observation part of tt by t𝒪t^{\mathcal{O}} and the action part of t𝒜t^{\mathcal{A}}, respectively. We have

(tτha,a,o)\displaystyle\mathbb{P}(t\mid\tau^{a}_{h},a,o) =(t𝒪τha,o;do(a,t𝒜))\displaystyle=\mathbb{P}(t^{\mathcal{O}}\mid\tau^{a}_{h},o;\text{do}(a,t^{\mathcal{A}})) (by definition)
=(o,t𝒪τha;do(a,t𝒜))(o;τha)\displaystyle=\frac{\mathbb{P}(o,t^{\mathcal{O}}\mid\tau^{a}_{h};\text{do}(a,t^{\mathcal{A}}))}{\mathbb{P}(o;\tau^{a}_{h})} (Bayes rule)
=𝐦o,a,t𝐪τha/𝐦o𝐪τha.\displaystyle=\mathbf{m}^{\top}_{o,a,t}\mathbf{q}_{\tau^{a}_{h}}/\mathbf{m}^{\top}_{o}\mathbf{q}_{\tau^{a}_{h}}. (by definition)

This concludes the proof. ∎

To further understand that why PSR generalizes POMDP, let us re-visit the undercomplete POMDPs (i.e., 𝕆\mathbb{O} being full column rank) again. Set 𝒯=𝒪{\mathcal{T}}=\mathcal{O}. As we see in the proof of Lemma 14, the belief state 𝐬τΔ(𝒮){\bf s}_{\tau}\in\Delta({\mathcal{S}}) together with 𝕆\mathbb{O} defines predictive state, i.e., 𝐪τha=𝕆𝐬τha{\bf q}_{\tau^{a}_{h}}=\mathbb{O}{\bf s}_{\tau^{a}_{h}}, with Mo,a=𝕆𝕋adiag(𝕆(o|))𝕆M_{o,a}=\mathbb{O}\mathbb{T}_{a}{\rm diag}(\mathbb{O}(o|\cdot))\mathbb{O}^{\dagger}, and 𝐦o=1diag(𝕆(o|))𝕆{\bf m}^{\top}_{o}={\textbf{1}}^{\top}{\rm diag}(\mathbb{O}(o|\cdot))\mathbb{O}^{\dagger}. Note that in POMDPs, matrix Mo,aM_{o,a} and vector 𝐦o\mathbf{m}_{o} all contain non-negative entries. On other hand, in PSRs, Ma,oM_{a,o} and 𝐦a,o\mathbf{m}_{a,o} could contain negative entries. This is the intuitive reason why PSRs are more expressive than POMDPs [46]. For the formal instance of a finite-dimensional PSR which cannot be expressed as a finite-dimensional POMDP, refer to [62, 31].

D.2 Existence of link functions

We discuss the existence and the form of link functions. First, we define general value link functions with multi-step futures. For notational simplicity, we assume here that the tests t𝒯t\in{\mathcal{T}} have the same length, i.e., there is a K+K\in\mathbb{N}^{+}, such that 𝒯𝒪K×𝒜K1{\mathcal{T}}\subset\mathcal{O}^{K}\times\mathcal{A}^{K-1}.

Definition 9 (General value link functions in dynamical systems).

Recall 𝒯𝒪K×𝒜K1{\mathcal{T}}\subset\mathcal{O}^{K}\times\mathcal{A}^{K-1} is the set of tests. At time step hh, general value link functions ghπ:𝒵h1×𝒪K𝒜K1g^{\pi}_{h}:\mathcal{Z}_{h-1}\times\mathcal{O}^{K}\mathcal{A}^{K-1}\to\mathbb{R} are defined as solutions to the following:

𝒱hπ(τha)=𝔼[ghπ(zh1,oh:h+K1,ah:h+K2)τha;(ah:h+K2)ρout].\displaystyle\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})=\mathbb{E}[g^{\pi}_{h}(z_{h-1},o_{h:h+K-1},a_{h:h+K-2})\mid\tau^{a}_{h};(a_{h:h+K-2})\sim\rho^{out}]. (14)

where ρout\rho^{out} is some distribution over the action set 𝒯𝒜{\mathcal{T}}^{\mathcal{A}} induced by the test set, i.e., {t𝒜:t𝒯}\{t^{\mathcal{A}}:t\in{\mathcal{T}}\}. Here, for t=(oh:h+K1,ah:h+K2)t=(o_{h:h+K-1},a_{h:h+K-2}), we often denote oh:h+K1o_{h:h+K-1} and ah:h+K2a_{h:h+K-2} by t𝒪t^{\mathcal{O}} and t𝒜t^{\mathcal{A}}, respectively.

To show the existence of general value link functions for PSRs, we first study the format of value functions in PSRs. The following lemma states that value functions for MM-memory policies have bilinear forms.

Lemma 17 (Bilinear form of value functions for MM-memory policies).

Let ϕ()|𝒵h1|\phi(\cdot)\in\mathbb{R}^{|\mathcal{Z}_{h-1}|} be a one-hot encoding vector over 𝒵h1\mathcal{Z}_{h-1}. Suppose 𝒯{\mathcal{T}} is a core test set. Then, for any MM-memory policy π\pi, there exists 𝕁hπ|𝒵h1|×|𝒯|\mathbb{J}^{\pi}_{h}\in\mathbb{R}^{|\mathcal{Z}_{h-1}|\times|{\mathcal{T}}|} such that

𝒱hπ(τha)=ϕ(zh1)𝕁hπ𝐪τha.\displaystyle\mathcal{V}^{\pi}_{h}(\tau^{a}_{h})=\phi^{\top}(z_{h-1})\mathbb{J}^{\pi}_{h}{\bf q}_{\tau^{a}_{h}}.
Proof.

From Lemma 16, there exists a matrix Mo,a|𝒯|×|𝒯|M_{o,a}\in\mathbb{R}^{|{\mathcal{T}}|\times|{\mathcal{T}}|} such that via Bayes rule:

𝐪τha,a,o=Mo,a𝐪τha/(o|τha).\displaystyle{\bf q}_{\tau^{a}_{h},a,o}=M_{o,a}{\bf q}_{\tau^{a}_{h}}/\mathbb{P}(o|\tau^{a}_{h}). (15)

We use induction to prove the claim. Here, the base argument clearly holds. Thus, we assume

𝒱h+1π(τh+1a)=ϕ(zh)𝕁h+1π𝐪τh+1a.\displaystyle\mathcal{V}^{\pi}_{h+1}(\tau^{a}_{h+1})=\phi^{\top}(z_{h})\mathbb{J}^{\pi}_{h+1}{\bf q}_{\tau^{a}_{h+1}}.

We have

𝒱hπ(τha)\displaystyle\mathcal{V}^{\pi}_{h}(\tau^{a}_{h}) =𝔼[rh+𝒱h+1π(τha,oh,ah)τha;ahπ(z¯h)]\displaystyle=\mathbb{E}[r_{h}+\mathcal{V}^{\pi}_{h+1}(\tau^{a}_{h},o_{h},a_{h})\mid\tau^{a}_{h};a_{h}\sim\pi(\bar{z}_{h})]
=oh,ah(ohτha)πh(ahoh,zh1)r(oh,ah)(a)\displaystyle=\underbrace{\sum_{o_{h},a_{h}}\mathbb{P}(o_{h}\mid\tau^{a}_{h})\pi_{h}(a_{h}\mid o_{h},z_{h-1})r(o_{h},a_{h})}_{(a)}
+oh,ah(ohτha)πh(ahoh,zh1){ϕ(zh)𝕁h+1π𝐪τha,oh,ah}(b).\displaystyle+\underbrace{\sum_{o_{h},a_{h}}\mathbb{P}(o_{h}\mid\tau^{a}_{h})\pi_{h}(a_{h}\mid o_{h},z_{h-1})\{\phi^{\top}(z_{h})\mathbb{J}^{\pi}_{h+1}{\bf q}_{\tau^{a}_{h},o_{h},a_{h}}\}}_{(b)}.

Note we use the assumption that the reward is a function of oh,aho_{h},a_{h} conditional on (τha,oh,ah)(\tau^{a}_{h},o_{h},a_{h}).

We first check the first term (a) that contains rewards. Using the fact that (o|τha)=𝐦o𝐪τha\mathbb{P}(o|\tau^{a}_{h})={\bf m}_{o}^{\top}{\bf q}_{\tau^{a}_{h}}, this is equal to

oh,ah𝐦oh,𝐪τhaπh(ahoh,zh1)r(oh,ah)=oh,ah𝐦ohπh(ahoh,zh1)r(oh,ah),𝐪τha.\displaystyle\sum_{o_{h},a_{h}}\langle{\bf m}_{o_{h}},\mathbf{q}_{\tau^{a}_{h}}\rangle\pi_{h}(a_{h}\mid o_{h},z_{h-1})r(o_{h},a_{h})=\langle\sum_{o_{h},a_{h}}{\bf m}_{o_{h}}\pi_{h}(a_{h}\mid o_{h},z_{h-1})r(o_{h},a_{h}),\mathbf{q}_{\tau^{a}_{h}}\rangle.

Thus, it has a bilinear form, i.e., there exists some matrix 𝕁1π\mathbb{J}^{\pi}_{1} such that

oh,ahmoh,ahπh(ahoh,zh1)rh,𝐪τha=ϕ(zh1)𝕁1π𝐪τha\langle\sum_{o_{h},a_{h}}m_{o_{h},a_{h}}\pi_{h}(a_{h}\mid o_{h},z_{h-1})r_{h},\mathbf{q}_{\tau^{a}_{h}}\rangle=\phi^{\top}(z_{h-1})\mathbb{J}^{\pi}_{1}\mathbf{q}_{\tau^{a}_{h}}

where 𝕁1π\mathbb{J}^{\pi}_{1} is a matrix whose row indexed by zh1z_{h-1} is equal to o,a𝐦oπh(a|o,zh1)r(o,a)\sum_{o,a}{\bf m}_{o}^{\top}\pi_{h}(a|o,z_{h-1})r(o,a).

Next, we see the second term (b). Using (15), the second term is equal to

oh,ahπh(ahoh,zh1)ϕ(zh1oh,ah)𝕁h+1πMoh,ah𝐪τha\displaystyle\sum_{o_{h},a_{h}}\pi_{h}(a_{h}\mid o_{h},z_{h-1})\phi^{\top}(z_{h-1}\oplus o_{h},a_{h})\mathbb{J}^{\pi}_{h+1}M_{o_{h},a_{h}}\mathbf{q}_{\tau^{a}_{h}}

where we use the notation zh1o,az_{h-1}\oplus o,a to represent the operation of appending (o,a)(o,a) pair to the memory while maintaining the proper length of the memory by truncating away the oldest observation-action pair. Thus, it has an again bilinear form ϕ(zh1)𝕁2π𝐪τha\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{2}{\bf q}_{\tau^{a}_{h}} and the matrix 𝕁2π\mathbb{J}^{\pi}_{2} can be defined such that its row indexed by zh1z_{h-1} is equal to o,aπh(a|o,zh1)ϕ(zh1o,a)Mh+1πMa,o\sum_{o,a}\pi_{h}(a|o,z_{h-1})\phi^{\top}(z_{h-1}\oplus o,a)M^{\pi}_{h+1}M_{a,o}. This concludes the proof. ∎

Next, we check sufficient conditions to ensure the existence of general K-step link functions. Given 𝒯{\mathcal{T}}, we define the corresponding set of action sequences 𝒯𝒜{\mathcal{T}}^{\mathcal{A}} as 𝒯𝒜:={t𝒜:t𝒯}{\mathcal{T}}^{\mathcal{A}}:=\{t^{\mathcal{A}}:t\in{\mathcal{T}}\}. We set ρout\rho^{out} in (14) to be a uniform distribution over the set 𝒯𝒜{\mathcal{T}}^{\mathcal{A}} denoted by 𝒰(𝒯𝒜)\mathcal{U}({\mathcal{T}}^{\mathcal{A}}). Namely, 𝒰(𝒯𝒜)\mathcal{U}({\mathcal{T}}^{\mathcal{A}}) will uniformly randomly select a sequence of test actions from 𝒯𝒜{\mathcal{T}}^{\mathcal{A}}.

Lemma 18 (Existence of link functions in PSRs).

Suppose 𝒯{\mathcal{T}} is a core test. There exists ghπ:𝒵h1×𝒯g^{\pi}_{h}:\mathcal{Z}_{h-1}\times{\mathcal{T}} such that

𝔼[ghπ(zh1,oh:h+K1,ah:h+K2)τha;ah:h+K2𝒰(𝒯𝒜)]=𝒱hπ(τha).\displaystyle\mathbb{E}[g^{\pi}_{h}(z_{h-1},o_{h:h+K-1},a_{h:h+K-2})\mid\tau^{a}_{h};a_{h:h+K-2}\sim\mathcal{U}({\mathcal{T}}^{\mathcal{A}})]=\mathcal{V}^{\pi}_{h}(\tau^{a}_{h}).
Proof.

We mainly need to design an unbiased estimator of the predictive state 𝐪τha{\bf q}_{\tau^{a}_{h}}. We use importance weighting to do that. Given ah:h+K2𝒰(𝒯𝒜)a_{h:h+K-2}\sim\mathcal{U}({\mathcal{T}}^{\mathcal{A}}), and the resulting corresponding random observations oh:h+K1o_{h:h+K-1}, we define the following estimator 𝐪^τha(oh:h+K1,ah:h+K2)|𝒯|\hat{\bf q}_{\tau^{a}_{h}}(o_{h:h+K-1},a_{h:h+K-2})\in\mathbb{R}^{|{\mathcal{T}}|}, such that its entry indexed by a test t𝒯t\in{\mathcal{T}} is equal to:

𝐪^τha(oh:h+K1,ah:h+K2)[t]=1(t𝒪=oh:h+K1,t𝒜=ah:h+K2)1/|𝒯𝒜|.\displaystyle\hat{\bf q}_{\tau^{a}_{h}}(o_{h:h+K-1},a_{h:h+K-2})[t]=\frac{{\textbf{1}}(t^{\mathcal{O}}=o_{h:h+K-1},t^{\mathcal{A}}=a_{h:h+K-2})}{1/|{\mathcal{T}}^{\mathcal{A}}|}.

We can verify that

𝔼[𝐪^τha(oh:h+K1,ah:h+K2)[t]τha;ah:h+K2𝒰(𝒯𝒜)]\displaystyle\mathbb{E}[\hat{\bf q}_{\tau^{a}_{h}}(o_{h:h+K-1},a_{h:h+K-2})[t]\mid\tau^{a}_{h};a_{h:h+K-2}\sim\mathcal{U}({\mathcal{T}}^{\mathcal{A}})]
=1/|𝒯𝒜|𝔼[1(t𝒪=oh:h+K1,t𝒜=ah:h+K2)τha;ah:h+K2𝒰(𝒯𝒜)]\displaystyle=1/|{\mathcal{T}}^{\mathcal{A}}|\mathbb{E}[{\textbf{1}}(t^{\mathcal{O}}=o_{h:h+K-1},t^{\mathcal{A}}=a_{h:h+K-2})\mid\tau^{a}_{h};a_{h:h+K-2}\sim\mathcal{U}({\mathcal{T}}^{\mathcal{A}})]
=𝔼[1(t𝒪=oh:h+K1,t𝒜=ah:h+K2)τha;ah:h+K2do(t𝒜)]=𝐪τha[t].\displaystyle=\mathbb{E}[{\textbf{1}}(t^{\mathcal{O}}=o_{h:h+K-1},t^{\mathcal{A}}=a_{h:h+K-2})\mid\tau^{a}_{h};a_{h:h+K-2}\sim do(t^{\mathcal{A}})]={\bf q}_{\tau^{a}_{h}}[t].

Then,

𝔼[𝐪^τha(oh:h+K1,ah:h+K2)τha;ah:h+K2𝒰(𝒯𝒜)]=𝐪τha.\displaystyle\mathbb{E}[\hat{\bf q}_{\tau^{a}_{h}}(o_{h:h+K-1},a_{h:h+K-2})\mid\tau^{a}_{h};a_{h:h+K-2}\sim\mathcal{U}({\mathcal{T}}^{\mathcal{A}})]={\bf q}_{\tau^{a}_{h}}.

With this estimator, now we can define the link function using the bilinear form of 𝒱hπ(τ)\mathcal{V}^{\pi}_{h}(\tau), i.e.,

ghπ(zh1,oh:h+K1,ah:h+K2)=ϕ(zh1)𝕁hπ𝐪^τha(oh:h+K1,ah:h+K2).\displaystyle g^{\pi}_{h}(z_{h-1},o_{h:h+K-1},a_{h:h+K-2})=\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{h}{\hat{\bf q}_{\tau^{a}_{h}}}(o_{h:h+K-1},a_{h:h+K-2}).

Using the fact that 𝐪^τha(oh:h+K1,ah:h+K2)\hat{\bf q}_{\tau^{a}_{h}}(o_{h:h+K-1},a_{h:h+K-2}) is an unbiased estimate of 𝐪τha{\bf q}_{\tau^{a}_{h}}, we can conclude the proof. ∎

Since PSR models capture POMDP models, our above result directly implies the existence of the link functions in observable POMDPs as well by using obtained facts in Example  1 and 2 .

D.3 PO-Bilinear Rank Decomoposition

Finally, we calculate the PO-bilinear rank. Here,

gh{𝒵h1×(𝒪K𝒜K1)(zh1×t)ϕ(zh1)𝕁hπ𝐪^τha(t):𝕁hπ𝒵h1×|𝒯|}.g_{h}\in\{\mathcal{Z}_{h-1}\times(\mathcal{O}^{K}\mathcal{A}^{K-1})\ni(z_{h-1}\times t)\mapsto\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{h}{\hat{\bf q}_{\tau^{a}_{h}}}(t)\in\mathbb{R}:\mathbb{J}^{\pi}_{h}\in\mathbb{R}^{\mathcal{Z}_{h-1}\times|{\mathcal{T}}|}\}.

The Bellman error for (g,π)(g,\pi) under a roll-in π\pi^{\prime} denoted by Brh(g,π;π)\mathrm{Br}_{h}(g,\pi;\pi^{\prime}) is defined as

𝔼[𝔼[gh+1(zh,th+1𝒜,th+1𝒪)τh+1a;th+1𝒜U(𝒯h+1𝒜)]+rh;a1:h1π,ahπ]\displaystyle-\mathbb{E}[\mathbb{E}[g_{h+1}(z_{h},t^{\mathcal{A}}_{h+1},t^{\mathcal{O}}_{h+1})\mid\tau^{a}_{h+1};t^{\mathcal{A}}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}_{h+1})]+r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]
+𝔼[gh(zh1,th𝒜,th𝒪)τha;th𝒜U(𝒯h𝒜)];a1:h1π,ahπ].\displaystyle+\mathbb{E}[g_{h}(z_{h-1},t^{\mathcal{A}}_{h},t^{\mathcal{O}}_{h})\mid\tau^{a}_{h};t^{\mathcal{A}}_{h}\sim U({\mathcal{T}}^{\mathcal{A}}_{h})];a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi].

In fact, Brh(g,π;π)=0\mathrm{Br}_{h}(g,\pi;\pi^{\prime})=0 for any general value link functions gπg^{\pi}.

Our goal is to design a loss function lh()l_{h}(\cdot) such that we can estimate the above Bellman error Brh(g,π;π)\mathrm{Br}_{h}(g,\pi;\pi^{\prime}) using data from a single policy. To do that, we design the following randomized action selection strategy.

Given a action sequence t𝒜t^{\mathcal{A}} from a test tt, let us denote t¯𝒜\bar{t}^{\mathcal{A}} as a copy of t𝒜t^{\mathcal{A}} but starting from the second action of t𝒜t^{\mathcal{A}}, i.e., if t𝒜={a1,a2,a3}t^{\mathcal{A}}=\{a_{1},a_{2},a_{3}\}, then t¯𝒜={a2,a3}\bar{t}^{\mathcal{A}}=\{a_{2},a_{3}\}. Denote 𝒯¯𝒜={t¯𝒜:t𝒯}\bar{\mathcal{T}}^{\mathcal{A}}=\{\bar{t}^{\mathcal{A}}:t\in{\mathcal{T}}\}. Our random action selection strategy first selects ahU(𝒜)a_{h}\sim U(\mathcal{A}) uniformly randomly from 𝒜\mathcal{A}, and then select a sequence of actions 𝐚¯\bar{\bf a} uniformly randomly from 𝒯𝒜𝒯¯𝒜{\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}}. Here, we remark the length of outputs is not fixed (i.e., 𝐚¯𝒯𝒜\bar{\bf a}\in{\mathcal{T}}^{\mathcal{A}} has length larger than the 𝐚¯𝒯¯𝒜\bar{\bf a}\in\bar{\mathcal{T}}^{\mathcal{A}}).

As a first step, we define two unbiased estimators for 𝐪τha{\bf q}_{\tau^{a}_{h}} and 𝐪τh+1a{\bf q}_{\tau^{a}_{h+1}}. Conditioning on history τha\tau^{a}_{h}, given actions ahU(𝒜)a_{h}\sim U(\mathcal{A}) followed by action sequence 𝐚¯h+1U(𝒯𝒜𝒯¯𝒜)\bar{\bf a}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}}), denote the corresponding observations as oh,oh+1,oh+|𝐚¯h+1|+1o_{h},o_{h+1},\dots o_{h+|\bar{\bf a}_{h+1}|+1}. We construct unbiased estimators for 𝐪τha{\bf q}_{\tau^{a}_{h}} and 𝐪τh+1a{\bf q}_{\tau^{a}_{h+1}} as follows. As an unbiased estimator of 𝐪τha{\bf q}_{\tau^{a}_{h}}, we define 𝐪^τha\hat{\bf q}_{\tau^{a}_{h}} with the entry indexed by test t𝒯t^{\prime}\in{\mathcal{T}} as follows:

𝐪^τha(ah,𝐚¯h+1,oh:h+|𝐚¯h+1|+1)[t]=1(𝐚¯h+1𝒯¯𝒜,(ah,𝐚¯h+1)=t𝒜,oh:h+|𝐚¯h+1|+1=t𝒪)1/(2|𝒜||𝒯𝒜|).\displaystyle\hat{\bf q}_{\tau^{a}_{h}}(a_{h},\bar{\bf a}_{h+1},o_{h:h+|\bar{\bf a}_{h+1}|+1})[t^{\prime}]=\frac{{\textbf{1}}(\bar{\bf a}_{h+1}\in\bar{\mathcal{T}}^{\mathcal{A}},(a_{h},\bar{\bf a}_{h+1})=t^{\prime\mathcal{A}},o_{h:h+|\bar{\bf a}_{h+1}|+1}=t^{\prime\mathcal{O}})}{1/(2|\mathcal{A}||{\mathcal{T}}^{\mathcal{A}}|)}. (16)

Similarly, as an unbiased estimator of 𝐪τh+1a{\bf q}_{\tau^{a}_{h+1}}, we define 𝐪^τh+1a\hat{\bf q}_{\tau^{a}_{h+1}} with the entry indexed by test t𝒯t^{\prime}\in{\mathcal{T}} as follows:

𝐪^τh+1a(ah,𝐚¯h+1,oh+1:h+|𝐚¯h+1|+1)[t]=1(𝐚¯h+1𝒯𝒜,𝐚¯h+1=t𝒜,oh+1:h+|𝐚¯h+1|+1=t𝒪)1/(2|𝒯𝒜|)\displaystyle\hat{\bf q}_{\tau^{a}_{h+1}}(a_{h},\bar{\bf a}_{h+1},o_{h+1:h+|\bar{\bf a}_{h+1}|+1})[t^{\prime}]=\frac{{\textbf{1}}(\bar{\bf a}_{h+1}\in{\mathcal{T}}^{\mathcal{A}},\bar{\bf a}_{h+1}=t^{\prime\mathcal{A}},o_{h+1:h+|\bar{\bf a}_{h+1}|+1}=t^{\prime\mathcal{O}})}{1/(2|{\mathcal{T}}^{\mathcal{A}}|)} (17)

We remark the length of 𝐚¯\bar{\bf a} in (16) and the one of (17) are different.

Then, by using importance sampling, we can verify

𝔼[𝐪^τha(ah,𝐚¯h+1,oh:h+|𝐚¯h+1|+1)|τha;ahU(𝒜),𝐚¯h+1U(𝒯𝒜𝒯¯𝒜)]\displaystyle\mathbb{E}[\hat{\bf q}_{\tau^{a}_{h}}(a_{h},\bar{\bf a}_{h+1},o_{h:h+|\bar{\bf a}_{h+1}|+1})|\tau^{a}_{h};a_{h}\sim U(\mathcal{A}),\bar{\bf a}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}})] =𝐪τha,\displaystyle={\bf q}_{\tau^{a}_{h}},
𝔼[𝐪^τh+1a(𝐚¯h+1,oh+1:h+|𝐚¯h+1|+1)|τh+1a;𝐚¯h+1U(𝒯𝒜𝒯¯𝒜)]\displaystyle\mathbb{E}[\hat{\bf q}_{\tau^{a}_{h+1}}(\bar{\bf a}_{h+1},o_{h+1:h+|\bar{\bf a}_{h+1}|+1})|\tau^{a}_{h+1};\bar{\bf a}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}})] =𝐪τh+1a.\displaystyle={\bf q}_{\tau^{a}_{h+1}}.

With the above setup, we can construct the loss function ll for estimating the Bellman error. We set the loss as follows:

lh(zh1,ah,rh,𝐚¯h+1,oh:h+|𝐚¯h+1|+1;π,g)\displaystyle l_{h}({z}_{h-1},a_{h},r_{h},\bar{\bf a}_{h+1},o_{h:h+|\bar{\bf a}_{h+1}|+1};\pi,g) (18)
=ϕ(zh1)𝕁h𝐪^τha(ah,𝐚¯h+1,oh:h+|𝐚¯h+1|+1)\displaystyle=\phi(z_{h-1})^{\top}\mathbb{J}_{h}\hat{\bf q}_{\tau^{a}_{h}}(a_{h},\bar{\bf a}_{h+1},o_{h:h+|\bar{\bf a}_{h+1}|+1})
1{ah=πh(z¯h)}1/|𝒜|(rh+ϕ(zh)𝕁h+1𝐪^τh+1a(𝐚¯h+1,oh+1:h+|𝐚¯h+1|+1)).\displaystyle-\frac{{\textbf{1}}\{a_{h}=\pi_{h}(\bar{z}_{h})\}}{1/|\mathcal{A}|}\left(r_{h}+\phi(z_{h})^{\top}\mathbb{J}_{h+1}\hat{\bf q}_{\tau^{a}_{h+1}}(\bar{\bf a}_{h+1},o_{h+1:h+|\bar{\bf a}_{h+1}|+1})\right).

Since we have shown that 𝐪^τha\hat{\bf q}_{\tau^{a}_{h}} and 𝐪^τh+1a\hat{\bf q}_{\tau^{a}_{h+1}} are unbiased estimators of 𝐪τ{\bf q}_{\tau} and 𝐪τh+1a{\bf q}_{\tau^{a}_{h+1}}, respectively, we can show that for any roll-in policy π\pi^{\prime}:

Brh(π,g;π)\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime})
=𝔼[𝔼[gh+1(zh,th+1𝒜,th+1𝒪)τh+1a;th+1𝒜U(𝒯h+1𝒜)]+rh;a1:h1π,ahπ]\displaystyle=-\mathbb{E}[\mathbb{E}[g_{h+1}(z_{h},t^{\mathcal{A}}_{h+1},t^{\mathcal{O}}_{h+1})\mid\tau^{a}_{h+1};t^{\mathcal{A}}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}_{h+1})]+r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]
+𝔼[gh(zh1,th𝒜,th𝒪)τha;th𝒜U(𝒯h𝒜)];a1:h1π,ahπ]\displaystyle+\mathbb{E}[g_{h}(z_{h-1},t^{\mathcal{A}}_{h},t^{\mathcal{O}}_{h})\mid\tau^{a}_{h};t^{\mathcal{A}}_{h}\sim U({\mathcal{T}}^{\mathcal{A}}_{h})];a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]
=𝔼[ϕ(zh)𝕁h+1π𝐪τh+1arh+ϕ(zh1)𝕁hπ𝐪τha;a1:h1π,ahπ]\displaystyle=\mathbb{E}[-\phi(z_{h})^{\top}\mathbb{J}^{\pi}_{h+1}{\bf q}_{\tau^{a}_{h+1}}-r_{h}+\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{h}{\bf q}_{\tau^{a}_{h}};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]
=𝔼[lh(zh1,ah,rh,𝐚¯h+1,oh:h+|𝐚¯h+1|+1;π,g);a1:h1π,ahU(𝒜),𝐚¯h+1U(𝒯𝒜𝒯¯𝒜)].\displaystyle=\mathbb{E}\left[l_{h}(z_{h-1},a_{h},r_{h},\bar{\bf a}_{h+1},o_{h:h+|\bar{\bf a}_{h+1}|+1};\pi,g);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(\mathcal{A}),\bar{\bf a}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}})\right].

The above shows that we can use lh()l_{h}(\cdot) as a loss function.

Summary

We can use the almost similar algorithm as Algorithm 1. The sole difference is we need to replace σht(π,g)\sigma^{t}_{h}(\pi,g) with

𝔼𝒟ht[lh(zh1,ah,rh,𝐚¯h+1,oh:h+|𝐚¯h+1|+1;π,g);a1:h1π,ahU(𝒜),𝐚¯h+1U(𝒯𝒜𝒯¯𝒜)]\displaystyle\mathbb{E}_{\mathcal{D}^{t}_{h}}\left[l_{h}(z_{h-1},a_{h},r_{h},\bar{\bf a}_{h+1},o_{h:h+|\bar{\bf a}_{h+1}|+1};\pi,g);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(\mathcal{A}),\bar{\bf a}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}})\right]

where 𝒟ht\mathcal{D}^{t}_{h} is an empirical approximation when executing a1:h1πt,ahU(𝒜),𝐚¯h+1U(𝒯𝒜𝒯¯𝒜)a_{1:h-1}\sim\pi^{t},a_{h}\sim U(\mathcal{A}),\bar{\bf a}_{h+1}\sim U({\mathcal{T}}^{\mathcal{A}}\cup\bar{\mathcal{T}}^{\mathcal{A}}).

Calculation of PO-bilinear rank

Finally, we prove a PSR belongs to the PO-bilinear class.

Lemma 19 (PO-bilinear decomposition).

Let 𝒬\mathcal{Q} be a minimum core test set contained in 𝒯{\mathcal{T}}. The PSR model has PO-bilinear rank at most |𝒪|M|𝒜|M|𝒬||\mathcal{O}|^{M}|\mathcal{A}|^{M}|\mathcal{Q}|, i.e., there exists two |𝒪|M|𝒜|M|𝒬||\mathcal{O}|^{M}|\mathcal{A}|^{M}|\mathcal{Q}|-dimensional mappings Wh:Π×𝒢|𝒪|M|𝒜|M|𝒬|W_{h}:\Pi\times\mathcal{G}\to\mathbb{R}^{{|\mathcal{O}|^{M}|\mathcal{A}|^{M}|\mathcal{Q}|}} and Xh:Π|𝒪|M|𝒜|M|𝒬|X_{h}:\Pi\to\mathbb{R}^{{|\mathcal{O}|^{M}|\mathcal{A}|^{M}|\mathcal{Q}|}} such that for any tripe (π,g;π)(\pi,g;\pi^{\prime}), we have:

Brh(π,g;π)\displaystyle\mathrm{Br}_{h}(\pi,g;\pi^{\prime}) =𝔼[ϕ(zh1)𝕁h𝐪τharhϕ(zh)𝕁h+1𝐪τh+1a;a1:h1π,ahπ]\displaystyle=\mathbb{E}\left[\phi(z_{h-1})^{\top}\mathbb{J}_{h}{\bf q}_{\tau^{a}_{h}}-r_{h}-\phi(z_{h})^{\top}\mathbb{J}_{h+1}{\bf q}_{\tau^{a}_{h+1}};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi\right]
=Xh(π),Wh(π,g).\displaystyle=\left\langle X_{h}(\pi^{\prime}),W_{h}(\pi,g)\right\rangle.
Proof.

We first take expectation conditional on τha\tau^{a}_{h}. Then, we have

ϕ(zh1)𝕁h𝐪τha𝔼[rh+ϕ(zh)𝕁h+1𝐪τh+1aτha;ahπ]\displaystyle\phi(z_{h-1})^{\top}\mathbb{J}_{h}{\bf q}_{\tau^{a}_{h}}-\mathbb{E}\left[r_{h}+\phi(z_{h})^{\top}\mathbb{J}_{h+1}{\bf q}_{\tau^{a}_{h+1}}\mid\tau^{a}_{h};a_{h}\sim\pi\right]
=ϕ(zh1)𝕁h𝐪τha+(ϕ(zh1)𝕁1π𝐪τha+ϕ(zh1)𝕁2π𝐪τha),\displaystyle=\phi(z_{h-1})^{\top}\mathbb{J}_{h}{\bf q}_{\tau^{a}_{h}}+\left(\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{1}{\bf q}_{\tau^{a}_{h}}+\phi(z_{h-1})^{\top}\mathbb{J}^{\pi}_{2}{\bf q}_{\tau^{a}_{h}}\right),

where 𝕁1π\mathbb{J}^{\pi}_{1} and 𝕁2π\mathbb{J}^{\pi}_{2} are some two matrices as defined in the proof of Lemma 17 from where we have already known that the π\pi-induced Bellman backup on a value function which has a bilinear form gives back a bilinear form value function. Rearrange terms, we get:

ϕ(zh1)𝕁h𝐪τha𝔼[rh+ϕ(zh)𝕁h+1𝐪τh+1aτha;ahπ]=ϕ(zh1),(𝕁h+𝕁1π+𝕁2π)𝐪τ𝐡𝐚.\displaystyle\phi(z_{h-1})^{\top}\mathbb{J}_{h}{\bf q}_{\tau^{a}_{h}}-\mathbb{E}\left[r_{h}+\phi(z_{h})^{\top}\mathbb{J}_{h+1}{\bf q}_{\tau^{a}_{h+1}}\mid\tau^{a}_{h};a_{h}\sim\pi\right]=\left\langle\phi(z_{h-1}),(\mathbb{J}_{h}+\mathbb{J}^{\pi}_{1}+\mathbb{J}^{\pi}_{2})\bf{q}_{\tau^{a}_{h}}\right\rangle.

Now recall that the minimum core test set is 𝒬𝒯\mathcal{Q}\subset{\mathcal{T}}. The final step is to argue that 𝐪τ{\bf q}_{\tau} lives in a subspace whose dimension is |𝒬||\mathcal{Q}|. Since 𝒬\mathcal{Q} is a core test set, by definition, we can express 𝐪τha{\bf q}_{\tau^{a}_{h}} using [(t|τha)]t𝒬[\mathbb{P}(t|\tau^{a}_{h})]_{t\in\mathcal{Q}}, i.e.,

K|𝒯|×|𝒬|,𝐪τha=K[(t|τha)]t𝒬,\displaystyle\exists K\in\mathbb{R}^{|{\mathcal{T}}|\times|\mathcal{Q}|},\quad{\bf q}_{\tau^{a}_{h}}=K[\mathbb{P}(t|\tau^{a}_{h})]_{t\in\mathcal{Q}},

where the row of KK indexed by t𝒯t\in{\mathcal{T}} is equal to 𝐤t{\bf k}_{t}, where 𝐤t{\bf k}_{t} is the vector that is used to predict (t|τha)=𝐤t[(t|τha)]t𝒬\mathbb{P}(t|\tau^{a}_{h})={\bf k}_{t}^{\top}[\mathbb{P}(t|\tau^{a}_{h})]_{t\in\mathcal{Q}} whose existences is ensured by the definition of PSRs. This implies that

ϕ(zh1),(𝕁h+𝕁1π+𝕁2π)𝐪τ𝐡𝐚\displaystyle\left\langle\phi(z_{h-1}),(\mathbb{J}_{h}+\mathbb{J}^{\pi}_{1}+\mathbb{J}^{\pi}_{2})\bf{q}_{\tau^{a}_{h}}\right\rangle =ϕ(zh1),(𝕁h+𝕁1π+𝕁2π)K[(t|τha)]t𝒬\displaystyle=\left\langle\phi(z_{h-1}),(\mathbb{J}_{h}+\mathbb{J}^{\pi}_{1}+\mathbb{J}^{\pi}_{2})K[\mathbb{P}(t|\tau^{a}_{h})]_{t\in\mathcal{Q}}\right\rangle
=(ϕ(zh1)[(t|τha)]t𝒬),vec((𝕁h+𝕁1π+𝕁2π)K).\displaystyle=(\phi(z_{h-1})\otimes[\mathbb{P}(t|\tau^{a}_{h})]_{t\in\mathcal{Q}}),\text{vec}((\mathbb{J}_{h}+\mathbb{J}^{\pi}_{1}+\mathbb{J}^{\pi}_{2})K)\rangle.

Finally, we take expectation with respect to τha\tau^{a}_{h} then we get Brh(π,g;π)=Xh(π),Wh(π,g)\mathrm{Br}_{h}(\pi,g;\pi^{\prime})=\langle X_{h}(\pi^{\prime}),W_{h}(\pi,g)\rangle such that

Xh(π)=ϕ(zh1)𝔼[[(t|τha)]t𝒬;a1:h1π],Wh(π,g)=vec((𝕁h+𝕁1π+𝕁2π)K).\displaystyle X_{h}(\pi^{\prime})=\phi(z_{h-1})\otimes\mathbb{E}[[\mathbb{P}(t|\tau^{a}_{h})]_{t\in\mathcal{Q}};a_{1:h-1}\sim\pi^{\prime}],\quad W_{h}(\pi,g)=\text{vec}((\mathbb{J}_{h}+\mathbb{J}^{\pi}_{1}+\mathbb{J}^{\pi}_{2})K).

The key observation here is that the bilinear rank scales with |𝒬||\mathcal{Q}| but not |𝒯||{\mathcal{T}}|. This is good news since we often cannot identify exact minimal core test sets; however, it is easy to find core tests including minimal core tests. Thus, even if we do not know the linear dimension of a dynamical system a priori, the resulting bilinear rank is the linear dimension of dynamical systems as long as core sets are large enough so that they include minimal core tests. This will result in the benefit of sample complexity as we will see Section K.

Appendix E Proof of Theorem 1

We fix the parameters as in Theorem 1. Let

lh(z¯h,ah,rh,oh+1)=|𝒜|πh(ahz¯h){rh+gh+1(z¯h+1)}gh(z¯h).\displaystyle l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1})=|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\{r_{h}+g_{h+1}(\bar{z}_{h+1})\}-g_{h}(\bar{z}_{h}).

We define

ϵgen\displaystyle\epsilon_{gen} =maxhϵgen,h(m,Π,𝒢,δ/(TH+1)),ϵini=ϵini(𝒢,δ/(TH+1)),\displaystyle=\max_{h}\epsilon_{gen,h}(m,\Pi,\mathcal{G},\delta/(TH+1)),\quad\epsilon_{ini}=\epsilon_{ini}(\mathcal{G},\delta/(TH+1)),
ϵ~gen\displaystyle\tilde{\epsilon}_{gen} =maxhϵgen(m,Π,𝒢,δ/H).\displaystyle=\max_{h}\epsilon_{gen}(m,\Pi,\mathcal{G},\delta/H).

Then, by our assumption 2 with probability 1δ1-\delta, we t[T],h[H]\forall t\in[T],\forall h\in[H]

supπΠ,g𝒢|𝔼𝒟ht[lh(z¯h,ah,rh,oh+1;π,g)]𝔼[𝔼𝒟ht[lh(z¯h,ah,rh,oh+1;π,g)]]|ϵgen,\displaystyle\sup_{\pi\in\Pi,g\in\mathcal{G}}|\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi,g)]-\mathbb{E}[\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi,g)]]|\leq\epsilon_{gen}, (19)
supg1𝒢1|𝔼𝒟0[g1(o1)]𝔼[𝔼𝒟0[g1(o1)]]|ϵini.\displaystyle\sup_{g_{1}\in\mathcal{G}_{1}}|\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(o_{1})]-\mathbb{E}[\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(o_{1})]]|\leq\epsilon_{ini}. (20)

Hereafter, we condition on the above events.

We first show the following lemma. Recall

π=argmaxπΠJ(π).\displaystyle\pi^{\star}=\mathop{\mathrm{argmax}}_{\pi\in\Pi}J(\pi).
Lemma 20 (Optimism).

Set R:=ϵgen2R:=\epsilon^{2}_{gen}. For all t[T]t\in[T], (π,gπ)(\pi^{\star},g^{\pi^{\star}}) is a feasible solution of the constrained program. Furthermore, we have J(π)𝔼[g1t(o1)]+2ϵiniJ(\pi^{\star})\leq\mathbb{E}[g^{t}_{1}(o_{1})]+2\epsilon_{ini} for any t[T]t\in[T], where gtg^{t} is the value link function selected by the algorithm in iteration tt.

Proof.

For any π\pi, we have

𝔼[𝔼𝒟ht[lh(z¯h,ah,rh,oh+1;π,gπ)]=0\displaystyle\mathbb{E}[\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi,g^{\pi})]=0

since gπg^{\pi} is a value link function in 𝒢\mathcal{G}. This is because

𝔼[𝔼𝒟ht[lh(z¯h,ah,rh,oh+1;π,gπ)]\displaystyle\mathbb{E}[\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi,g^{\pi})]
=𝔼[gh(z¯h)rhgh+1(z¯h+1);a1:h1πt,ahπ]\displaystyle=\mathbb{E}[g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1});a_{1:h-1}\sim\pi^{t},a_{h}\sim\pi] (IS sampling)
=Wh(π,gπ),Xh(πt)\displaystyle=\langle W_{h}(\pi,g^{\pi}),X_{h}(\pi^{t})\rangle (First assumption in Definition 3)
=0.\displaystyle=0. (Second assumption in Definition 3)

Thus,

|𝔼𝒟ht[lh(z¯h,ah,rh,oh+1;π,gπ)]|ϵgen.\displaystyle|\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi^{\star},g^{\pi^{\star}})]|\leq\epsilon_{gen}.

using (19) noting πΠ,gπ𝒢\pi^{\star}\in\Pi,g^{\pi^{\star}}\in\mathcal{G}. This implies

t[T],h[H];(𝔼𝒟ht[lh(z¯h,ah,rh,oh+1;π,gπ)])2ϵgen2.\forall t\in[T],\forall h\in[H];(\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi^{\star},g^{\pi^{\star}})])^{2}\leq\epsilon^{2}_{gen}.

Hence, (π,gπ)(\pi^{\star},g^{\pi^{\star}}) is a feasible set for any t[T]t\in[T].

Then, we have

J(π)\displaystyle J(\pi^{\star}) =𝔼[g1π(o1)]𝔼𝒟0[g1π(o1)]+ϵini\displaystyle=\mathbb{E}[g^{\pi^{\star}}_{1}(o_{1})]\leq\mathbb{E}_{\mathcal{D}^{0}}[g^{\pi^{\star}}_{1}(o_{1})]+\epsilon_{ini} (Uniform convergence result)
𝔼𝒟0[g1t(o1)]+ϵini\displaystyle\leq\mathbb{E}_{\mathcal{D}^{0}}[g^{t}_{1}(o_{1})]+\epsilon_{ini} (Using the construction of algorithm)
𝔼[g1t(o1)]+2ϵini.\displaystyle\leq\mathbb{E}[g^{t}_{1}(o_{1})]+2\epsilon_{ini}. (Uniform convergence)

Remark 5.

Note that

𝔼[𝔼𝒟ht[lh(z¯h,ah,rh,oh+1;π,gπ)]]=0\displaystyle\mathbb{E}[\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi,g^{\pi})]]=0

holds for general link functions gπg^{\pi} inDefinition 6 . Thus, the statement goes through even if we use Definition 6.

Next, we prove the following lemma to upper bound the per step regret.

Lemma 21.

For any t[T]t\in[T], we have

J(π)J(π^)h=1H|Wh(πt,gt),Xh(πt)|+2ϵini.\displaystyle J(\pi^{\star})-J(\hat{\pi})\leq\sum_{h=1}^{H}|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle|+2\epsilon_{ini}.
Proof.
J(π)J(π^)\displaystyle J(\pi^{\star})-J(\hat{\pi})
2ϵini+𝔼[g1t(o1)]J(πt)\displaystyle\leq 2\epsilon_{ini}+\mathbb{E}[g^{t}_{1}(o_{1})]-J(\pi^{t}) (From optimism)
=2ϵini+h=1H𝔼[ght(z¯h){rh+gh+1t(z¯h+1)};a1:hπt]\displaystyle=2\epsilon_{ini}+\sum_{h=1}^{H}\mathbb{E}[g^{t}_{h}(\bar{z}_{h})-\{r_{h}+g^{t}_{h+1}(\bar{z}_{h+1})\};a_{1:h}\sim\pi^{t}] (Performance difference lemma)
2ϵini+h=1H|𝔼[ght(z¯h){rh+gh+1t(z¯h+1)};a1:hπt]|\displaystyle\leq 2\epsilon_{ini}+\sum_{h=1}^{H}|\mathbb{E}[g^{t}_{h}(\bar{z}_{h})-\{r_{h}+g^{t}_{h+1}(\bar{z}_{h+1})\};a_{1:h}\sim\pi^{t}]|
=2ϵini+h=1H|Wh(πt,gt),Xh(πt)|.\displaystyle=2\epsilon_{ini}+\sum_{h=1}^{H}|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle|. (First assumption in Definition 3)

Lemma 22.

Let Σt,h=λI+τ=0t1Xh(πτ)Xh(πτ)\Sigma_{t,h}=\lambda I+\sum_{\tau=0}^{t-1}X_{h}(\pi^{\tau})X_{h}(\pi^{\tau})^{\top}. We have

1Tt=0T1h=1HXh(πt)Σt,h1HdTln(1+TBX2dλ).\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}\|X_{h}(\pi^{t})\|_{\Sigma^{-1}_{t,h}}\leq H\sqrt{\frac{d}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)}.
Proof.

We fix h[H]h\in[H]. Here, we have Σt,h=λI+τ=0t1Xh(πτ)Xh(πτ)\Sigma_{t,h}=\lambda I+\sum_{\tau=0}^{t-1}X_{h}(\pi^{\tau})X_{h}(\pi^{\tau})^{\top}. From the elliptical potential lemma in [1, Lemma G.2], we have

1Tt=0T1Xh(πt)Σt,h11Tt=0T1Xh(πt)Σt,h121Tlndet(Σt,h)det(λI)dTln(1+TBX2dλ).\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\|X_{h}(\pi^{t})\|_{\Sigma^{-1}_{t,h}}\leq\sqrt{\frac{1}{T}\sum_{t=0}^{T-1}\|X_{h}(\pi^{t})\|^{2}_{\Sigma^{-1}_{t,h}}}\leq\sqrt{\frac{1}{T}\ln\frac{\det(\Sigma_{t,h})}{\det(\lambda I)}}\leq\sqrt{\frac{d}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)}.

Then,

1Tt=0T1h=0HXh(πt)Σt,h12HdTln(1+TBX2dλ).\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=0}^{H}\|X_{h}(\pi^{t})\|^{2}_{\Sigma^{-1}_{t,h}}\leq H\sqrt{\frac{d}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)}.

Lemma 23.
Wh(πt,gt)Σt,h22λBW2+4Tϵgen2.\displaystyle\|W_{h}(\pi^{t},g^{t})\|^{2}_{\Sigma_{t,h}}\leq 2\lambda B^{2}_{W}+4T\epsilon^{2}_{gen}.
Proof.

We have

Wh(πt,gt)Σt,h2=λWh(πt,gt)22+τ=0t1Wh(πt,gt),Xh(πτ)2.\displaystyle\|W_{h}(\pi^{t},g^{t})\|^{2}_{\Sigma_{t,h}}=\lambda\|W_{h}(\pi^{t},g^{t})\|^{2}_{2}+\sum_{\tau=0}^{t-1}\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{\tau})\rangle^{2}.

The first term is upper-bounded by λBW2\lambda B^{2}_{W}. The second term is upper-bounded by

τ=0t1Wh(πt,gt),Xh(πτ)2\displaystyle\sum_{\tau=0}^{t-1}\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{\tau})\rangle^{2}
=τ=0t1(𝔼[lh(z¯h,ah,rh,oh+1;πt,gt);a1:h1πτ,ahU(𝒜)])2\displaystyle=\sum_{\tau=0}^{t-1}\left(\mathbb{E}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi^{t},g^{t});a_{1:h-1}\sim\pi^{\tau},a_{h}\sim U(\mathcal{A})]\right)^{2} (First assumption in Definition 3)
2τ=0t1𝔼𝒟hτ[lh(z¯h,ah,rh,oh+1;πt,gt)]2+2tϵgen24Tϵgen2.\displaystyle\leq 2\sum_{\tau=0}^{t-1}\mathbb{E}_{\mathcal{D}^{\tau}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\pi^{t},g^{t})]^{2}+2t\epsilon^{2}_{gen}\leq 4T\epsilon^{2}_{gen}.

From the first line to the second line, we use the definition of bilinear rank models. From the second line to the third line, we use (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}. In the last line, we use the constraint on (πt,gt)(\pi^{t},g^{t}).

Combining lemmas so far, we have

J(π)J(π^)\displaystyle J(\pi^{\star})-J(\hat{\pi}) 1Tt=0T1h=1H|Wh(πt,gt),Xh(πt)|+2ϵini\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle|+2\epsilon_{ini} (Use Lemma 21)
1Tt=0T1h=1HWh(πt,gt)Σt,hXh(πt)Σt,h1+2ϵini\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}\|W_{h}(\pi^{t},g^{t})\|_{\Sigma_{t,h}}\|X_{h}(\pi^{t})\|_{\Sigma^{-1}_{t,h}}+2\epsilon_{ini} (CS inequality)
H1/2[2λBW2+4Tϵgen2]1/2(dHTln(1+TBX2dλ))1/2+2ϵini.\displaystyle\leq H^{1/2}\left[2\lambda B^{2}_{W}+4T\epsilon^{2}_{gen}\right]^{1/2}\left(\frac{dH}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)\right)^{1/2}+2\epsilon_{ini}. (Use Lemma 22 and Lemma 23 )

We set λ\lambda such that BX2/λ=BW2BX2/ϵgen2+1B^{2}_{X}/\lambda=B^{2}_{W}B^{2}_{X}/\epsilon^{2}_{gen}+1 and T=2Hdln(4Hd(BX2BW2/ϵ~gen+1))T=\left\lceil 2Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\tilde{\epsilon}_{gen}+1))\right\rceil. Then,

HdTln(1+TBX2dλ)\displaystyle\frac{Hd}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right) HdTln(1+Td(BW2BX2ϵgen2+1))\displaystyle\leq\frac{Hd}{T}\ln\left(1+\frac{T}{d}\left(\frac{B^{2}_{W}B^{2}_{X}}{\epsilon^{2}_{gen}}+1\right)\right)
HdTln(1+Td(BW2BX2ϵ~gen2+1))\displaystyle\leq\frac{Hd}{T}\ln\left(1+\frac{T}{d}\left(\frac{B^{2}_{W}B^{2}_{X}}{\tilde{\epsilon}^{2}_{gen}}+1\right)\right)
HdTln(2Td(BW2BX2ϵ~gen2+1))1\displaystyle\leq\frac{Hd}{T}\ln\left(\frac{2T}{d}\left(\frac{B^{2}_{W}B^{2}_{X}}{\tilde{\epsilon}^{2}_{gen}}+1\right)\right)\leq 1

since aln(bT)/T1a\ln(bT)/T\leq 1 when T=2aln(2ab)T=2a\ln(2ab).

Finally, the following holds

J(π)J(πT)\displaystyle J(\pi^{\star})-J(\pi^{T}) H1/2[4λBW2+8Tϵgen2]1/2+2ϵini\displaystyle\leq H^{1/2}\left[4\lambda B^{2}_{W}+8T\epsilon^{2}_{gen}\right]^{1/2}+2\epsilon_{ini}
H1/2[4λBW2+16ϵgen2Hdln(4Hd(BX2BW2/ϵ~gen+1))]1/2+2ϵini\displaystyle\leq H^{1/2}\left[4\lambda B^{2}_{W}+16\epsilon^{2}_{gen}Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\tilde{\epsilon}_{gen}+1))\right]^{1/2}+2\epsilon_{ini} (Plug in TT )
H1/2[8ϵgen2+16ϵgen2Hdln(4Hd(BX2BW2/ϵ~gen+1))]1/2+2ϵini\displaystyle\leq H^{1/2}\left[8\epsilon^{2}_{gen}+16\epsilon^{2}_{gen}Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\tilde{\epsilon}_{gen}+1))\right]^{1/2}+2\epsilon_{ini} (Plug in ϵgen\epsilon_{gen} )
5ϵgen[H2dln(4Hd(BX2BW2/ϵ~gen+1))]1/2+2ϵini.\displaystyle\leq 5\epsilon_{gen}\left[H^{2}d\ln(4Hd(B^{2}_{X}B^{2}_{W}/\tilde{\epsilon}_{gen}+1))\right]^{1/2}+2\epsilon_{ini}.

Appendix F Sample Complexity for Finite Function Classes

Consider cases where Π\Pi and 𝒢\mathcal{G} are finite and the PO-bilinear rank assumption is satisfied. When Π\Pi and 𝒢\mathcal{G} are infinite hypothesis classes, |||\mathcal{F}| and |𝒢||\mathcal{G}| are replaced with their LL^{\infty}-covering numbers, respectively.

Theorem 13 (Sample complexity for discrete Π\Pi and 𝒢\mathcal{G}).

Let 𝒢hC𝒢,rh[0,1]\|\mathcal{G}_{h}\|_{\infty}\leq C_{\mathcal{G}},r_{h}\in[0,1] for any h[H]h\in[H] and the PO-bilinear rank assumption holds with PO-bilinear rank dd. By letting |Πmax|=maxh|Πh|,|𝒢max|=maxh|𝒢h||\Pi_{\max}|=\max_{h}|\Pi_{h}|,|\mathcal{G}_{\max}|=\max_{h}|\mathcal{G}_{h}|, with probability 1δ1-\delta, we can achieve J(π)J(π^)ϵJ(\pi^{\star})-J(\hat{\pi})\leq\epsilon when we use samples at most

O~(dbH4max(C𝒢,1)2|𝒜|2ln(|𝒢max||Πmax|/δ)ln2(BXBW)(1/ϵ)2).\displaystyle\tilde{O}\left(d_{b}H^{4}\max(C_{\mathcal{G}},1)^{2}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|/\delta)\ln^{2}(B_{X}B_{W})(1/\epsilon)^{2}\right).

Here, polylog(d,H,|𝒜|,ln(|𝒢max|),ln(|Πmax|),ln(1/δ),ln(BX),ln(BW),ln(1/δ),(1/ϵ))\mathrm{polylog}(d,H,|\mathcal{A}|,\ln(|\mathcal{G}_{\max}|),\ln(|\Pi_{\max}|),\ln(1/\delta),\ln(B_{X}),\ln(B_{W}),\ln(1/\delta),(1/\epsilon)) are omitted.

Proof.

We derive the above result. First, we check the uniform convergence result. Then,

ϵgen=cmax(C𝒢,1)|𝒜|ln(|𝒢max||Πmax|TH/δ)/m.\displaystyle\epsilon_{gen}=c\max(C_{\mathcal{G}},1)|\mathcal{A}|\sqrt{\ln(|\mathcal{G}_{\max}||\Pi_{\max}|TH/\delta)/m}.

Thus, we need to set mm such that

J(π)J(π^)cmax(C𝒢,1)|𝒜|ln(|𝒢max||Πmax|TH/δ)/mdH2ln(H3dBX2BW2m+1)ϵ\displaystyle J(\pi^{\star})-J(\hat{\pi})\leq c\max(C_{\mathcal{G}},1)|\mathcal{A}|\sqrt{\ln(|\mathcal{G}_{\max}||\Pi_{\max}|TH/\delta)/m}\sqrt{dH^{2}\ln(H^{3}dB^{2}_{X}B^{2}_{W}m+1)}\leq\epsilon

where cc is some constant and

T=cHdln(HdBX2BW2m+1).\displaystyle T=cHd\ln(HdB^{2}_{X}B^{2}_{W}m+1).

By organizing the term, the following mm is sufficient

cdH2max(C𝒢,1)2|𝒜|2ln(|𝒢max||Πmax|H2d/δ)ln(H3dBX2BW2m)mϵ\displaystyle c\sqrt{\frac{dH^{2}\max(C_{\mathcal{G}},1)^{2}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|H^{2}d/\delta)\ln(H^{3}dB^{2}_{X}B^{2}_{W}m)}{m}}\leq\epsilon

Using Lemma 44, the following mm satisfies the condition:

m=cB1(lnB1B2)2ϵ2,B1=dH2max(C𝒢,1)2|𝒜|2ln(|𝒢max||Πmax|H2d/δ),B2=H3dBX2BW2.\displaystyle m=c\frac{B_{1}(\ln B_{1}B_{2})^{2}}{\epsilon^{2}},B_{1}=dH^{2}\max(C_{\mathcal{G}},1)^{2}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|H^{2}d/\delta),B_{2}=H^{3}dB^{2}_{X}B^{2}_{W}.

Combining all together, the sample complexity is mTHmTH, i.e.,

O~(d2H4max(C𝒢,1)2|𝒜|2ln(|𝒢max||Πmax|/δ)ln2(BXBW)ϵ2).\displaystyle\tilde{O}\left(\frac{d^{2}H^{4}\max(C_{\mathcal{G}},1)^{2}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|/\delta)\ln^{2}(B_{X}B_{W})}{\epsilon^{2}}\right).\quad\blacksquare

Appendix G Sample Complexity in Observable HSE POMDPs

We revisit the existence of link functions by taking the norm constraint into account. Then, we consider the PO-bilinear decomposition with certain BXB_{X}\in\mathbb{R} and BWB_{W}\in\mathbb{R}. Next, we calculate the uniform convergence result. Finally, we show the sample complexity result.

We use the following assumptions

Assumption 7.

For any h[H]h\in[H], the following holds:

  1. 1.

    Vhπ(zh1,s)=θhπ,ϕh(zh1,s).V^{\pi}_{h}(z_{h-1},s)=\langle\theta^{\pi}_{h},\phi_{h}(z_{h-1},s)\rangle.

  2. 2.

    There exists a matrix KhK_{h} such that 𝔼o𝕆(s)[ψh(zh1,o)]=Khϕh(zh1,s)\mathbb{E}_{o\sim\mathbb{O}(s)}[\psi_{h}(z_{h-1},o)]=K_{h}\phi_{h}(z_{h-1},s) (i.e., conditional embedding of the omission distribution),

  3. 3.

    ϕh()1,ψh()1,θhπΘV,0rh1,\|\phi_{h}(\cdot)\|\leq 1,\|\psi_{h}(\cdot)\|\leq 1,\|\theta^{\pi}_{h}\|\leq\Theta_{V},0\leq r_{h}\leq 1,

  4. 4.

    There exists a matrix Tπ;hT_{\pi;h} such that 𝔼[ϕh(zh,sh+1)zh1,sh;ahπ]=Tπ;hϕh(zh1,sh)\mathbb{E}[\phi_{h}(z_{h},s_{h+1})\mid z_{h-1},s_{h};a_{h}\sim\pi]=T_{\pi;h}\phi_{h}(z_{h-1},s_{h}) (i.e., conditional embedding of the transition)

  5. 5.

    Π\Pi is finite.

We define

σmin(K)=minh[H]1/Kh,σmax(K)=maxh[H]Kh,σmax(T)=maxh[H]Tπ:h,\displaystyle\sigma_{\min}(K)=\min_{h\in[H]}1/\|K_{h}^{\dagger}\|,\ \sigma_{\max}(K)=\max_{h\in[H]}\|K_{h}\|,\ \sigma_{\max}(T)=\max_{h\in[H]}\|T_{\pi:h}\|,
dϕ=maxh[H]dϕh,dψ=maxh[H]dψh.\displaystyle d_{\phi}=\max_{h\in[H]}d_{\phi_{h}},\quad d_{\psi}=\max_{h\in[H]}d_{\psi_{h}}.
Existence of link functions.

We show value link functions exist. This is proved by noting

𝔼o𝕆(s)[(Kh)θhπ,ψh(z¯h)]=(Kh)θhπ,Khϕh(zh1,sh)=θhπ,ϕh(zh1,sh)=Vhπ(zh1,sh).\displaystyle\mathbb{E}_{o\sim\mathbb{O}(s)}[\langle(K^{\dagger}_{h})^{\top}\theta^{\pi}_{h},\psi_{h}(\bar{z}_{h})\rangle]=\langle(K^{\dagger}_{h})^{\top}\theta^{\pi}_{h},K_{h}\phi_{h}(z_{h-1},s_{h})\rangle=\langle\theta^{\pi}_{h},\phi_{h}(z_{h-1},s_{h})\rangle=V^{\pi}_{h}(z_{h-1},s_{h}).

Thus, (Kh)θhπ,ψh(z¯h)\langle(K^{\dagger}_{h})^{\top}\theta^{\pi}_{h},\psi_{h}(\bar{z}_{h})\rangle is a value link function. The radius of the parameter space is upper-bounded by ΘV/σmin(K)\Theta_{V}/\sigma_{\min}(K). Hence, we set

𝒢h={θ,ψh():θΘV/σmin(K)}.\mathcal{G}_{h}=\{\langle\theta,\psi_{h}(\cdot)\rangle:\|\theta\|\leq\Theta_{V}/\sigma_{\min}(K)\}.

Then, the realizability holds.

PO-bilinear decomposition.

Recall we derive the PO-bilinear decomposition in Section B.4. Consider a triple (π,π,g)(\pi^{\prime},\pi,g) with gh()=θhψh()g_{h}(\cdot)=\theta_{h}^{\top}\psi_{h}(\cdot) and ghπ=θh,ψh()g^{\pi}_{h}=\langle\theta^{\star}_{h},\psi_{h}(\cdot)\rangle, we have:

𝔼[θhψh+1(z¯h)rhθh+1ψ(z¯h+1);a1:h1π,ahπ]\displaystyle\mathbb{E}\left[\theta_{h}^{\top}\psi_{h+1}(\bar{z}_{h})-r_{h}-\theta_{h+1}^{\top}\psi(\bar{z}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi\right]
=𝔼[ϕh(zh1,sh);a1:h1π],Kh(θhθh)Tπ;hKh+1(θh+1θh+1),\displaystyle=\left\langle\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}],\quad K_{h}^{\top}(\theta_{h}-\theta^{\star}_{h})-T_{\pi;h}^{\top}K_{h+1}^{\top}(\theta_{h+1}-\theta^{\star}_{h+1})\right\rangle,

which verifies the PO-bilinear structure, i.e.,

Xh(π)=𝔼[ϕh(zh1,sh);a1:h1π],Wh(π,g)=Kh(θhθh)Tπ;hKh+1(θh+1θh+1),X_{h}(\pi^{\prime})=\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}],\quad W_{h}(\pi,g)=K_{h}^{\top}(\theta_{h}-\theta^{\star}_{h})-T_{\pi;h}^{\top}K_{h+1}^{\top}(\theta_{h+1}-\theta^{\star}_{h+1}),

and shows that the PO-bilinear rank is at most dϕ=maxhdϕhd_{\phi}=\max_{h}d_{\phi_{h}}. Thus, based on the above PO-bilinear decomposition, we set BX=1,BW=2(1+σmax(T))σmax(K)ΘV/σmin(K)\|B_{X}\|=1,\|B_{W}\|=2(1+\sigma_{\max}(T))\sigma_{\max}(K)\Theta_{V}/\sigma_{\min}(K). This is because

Kh(θhθh)Tπ;hKh+1(θh+1θh+1)\displaystyle\|K_{h}^{\top}(\theta_{h}-\theta^{\star}_{h})-T_{\pi;h}^{\top}K_{h+1}^{\top}(\theta_{h+1}-\theta^{\star}_{h+1})\|
Kh(θh+θh)+Tπ;hKh+1(θh+1+θh+1))\displaystyle\leq\|K_{h}^{\top}\|(\|\theta_{h}\|+\|\theta^{\star}_{h}\|)+\|T_{\pi;h}^{\top}\|\|K_{h+1}^{\top}\|(\|\theta_{h+1}\|+\|\theta^{\star}_{h+1})\|)
2(1+σmax(T))σmax(K)ΘV/σmin(K).\displaystyle\leq 2(1+\sigma_{\max}(T))\sigma_{\max}(K)\Theta_{V}/\sigma_{\min}(K).

and

𝔼[ϕh(zh1,sh);a1:h1π]𝔼[ϕh(zh1,sh);a1:h1π]1.\displaystyle\|\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}]\|\leq\mathbb{E}[\|\phi_{h}(z_{h-1},s_{h})\|;a_{1:h-1}\sim\pi^{\prime}]\leq 1.

In the above, we use Jensen’s inequality.

Uniform convergence.

To invoke Theorem 1, we show the uniform convergence result.

Lemma 24 (Uniform convergence of loss functions).

Let C=ΘV/(σmin(K))C=\Theta_{V}/(\sigma_{\min}(K)). Then, with probability 1δ1-\delta,

supπΠ,g𝒢|{𝔼𝒟𝔼}[|𝒜|πh(ahz¯h){gh(z¯h)rhgh+1(z¯h+1)}]|\displaystyle\sup_{\pi\in\Pi,g\in\mathcal{G}}\left|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\right\}\right]\right|
5|𝒜|{1+2C}{2dψln(1+Cm)+ln(|Πmax|/δ)}m\displaystyle\leq 5|\mathcal{A}|\{1+2C\}\sqrt{\frac{\{2d_{\psi}\ln(1+Cm)+\ln(|\Pi_{\max}|/\delta)\}}{m}}

and

supg1𝒢1|{𝔼𝒟𝔼}[g1(z¯1)|5C{dψln(1+Cm)+ln(|Πmax|/δ)}m.\displaystyle\sup_{g_{1}\in\mathcal{G}_{1}}|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}[g_{1}(\bar{z}_{1})|\leq 5C\sqrt{\frac{\{d_{\psi}\ln(1+Cm)+\ln(|\Pi_{\max}|/\delta)\}}{m}}.
Proof.

Let C=ΘV/σmin(K)C=\Theta_{V}/\sigma_{\min}(K). Define 𝒩ϵ,h\mathcal{N}_{\epsilon,h} as an ϵ\epsilon-net for 𝒢h\mathcal{G}_{h}. Then, |𝒩ϵ,h|(1+C/ϵ)d|\mathcal{N}_{\epsilon,h}|\leq(1+C/\epsilon)^{d}. Then,

|lh(;π,g)lh(;π,g)|\displaystyle|l_{h}(\cdot;\pi,g)-l_{h}(\cdot;\pi^{\diamond},g^{\diamond})| |𝒜|{ghgh+gh+1gh+1}\displaystyle\leq|\mathcal{A}|\{\|g_{h}-g^{\diamond}_{h}\|_{\infty}+\|g_{h+1}-g^{\diamond}_{h+1}\|_{\infty}\}
|𝒜|{θhθh2+θh+1θh+12}2|𝒜|ϵ.\displaystyle\leq|\mathcal{A}|\{\|\theta_{h}-\theta^{\diamond}_{h}\|_{2}+\|\theta_{h+1}-\theta^{\diamond}_{h+1}\|_{2}\}\leq 2|\mathcal{A}|\epsilon.

Besides, for fixed πΠ,θh𝒩ϵ,h\pi\in\Pi,\theta_{h}\in\mathcal{N}_{\epsilon,h}, θh+1𝒩ϵ,h+1\theta_{h+1}\in\mathcal{N}_{\epsilon,{h+1}}, we have

|{𝔼𝒟𝔼}[|𝒜|πh(ahz¯h){gh(z¯h;θh)rhgh+1(z¯h+1;θh+1)}]||𝒜|(1+2C)ln(|Πh|/δ)m.\displaystyle\left\lvert\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h};\theta_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1};\theta_{h+1})\right\}\right]\right\rvert\leq|\mathcal{A}|\left(1+2C\right)\sqrt{\frac{\ln(|\Pi_{h}|/\delta)}{m}}.

Then, for πΠ,θh𝒩ϵ,h\forall\pi\in\Pi,\forall\theta_{h}\in\mathcal{N}_{\epsilon,h}, θh+1𝒩ϵ,h+1\forall\theta_{h+1}\in\mathcal{N}_{\epsilon,{h+1}}, we have

|{𝔼𝒟𝔼}[|𝒜|πh(ahz¯h){gh(z¯h;θh)rhgh+1(z¯h+1;θh+1)}]||𝒜|(1+2C)ln(|Πh||𝒩ϵ,h||𝒩ϵ,h+1|/δ)m.\displaystyle\left\lvert\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h};\theta_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1};\theta_{h+1})\right\}\right]\right\rvert\leq|\mathcal{A}|\left(1+2C\right)\sqrt{\frac{\ln(|\Pi_{h}||\mathcal{N}_{\epsilon,h}||\mathcal{N}_{\epsilon,h+1}|/\delta)}{m}}.

Hence, for any gh=θh,ψh𝒢h,gh+1=θh+1,ψh+1𝒢h+1g_{h}=\langle\theta_{h},\psi_{h}\rangle\in\mathcal{G}_{h},g_{h+1}=\langle\theta_{h+1},\psi_{h+1}\rangle\in\mathcal{G}_{h+1},

|{𝔼𝒟𝔼}[|𝒜|πh(ahz¯h){gh(z¯h;θh)rhgh+1(z¯h+1;θh+1)}]|\displaystyle\left\lvert\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h};\theta_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1};\theta_{h+1})\right\}\right]\right\rvert
|𝒜|(1+2C)ln(|Πh||𝒩ϵ,h||𝒩ϵ,h+1|/δ)m+4|𝒜|ϵ.\displaystyle\leq|\mathcal{A}|\left(1+2C\right)\sqrt{\frac{\ln(|\Pi_{h}||\mathcal{N}_{\epsilon,h}||\mathcal{N}_{\epsilon,h+1}|/\delta)}{m}}+4|\mathcal{A}|\epsilon.

By taking ϵ=1/m\epsilon=1/m, we have πΠ,gh𝒢h,gh+1𝒢h+1\forall\pi\in\Pi,\forall g_{h}\in\mathcal{G}_{h},\forall g_{h+1}\in\mathcal{G}_{h+1}:

|{𝔼𝒟𝔼}[|𝒜|πh(ahz¯h){gh(z¯h)rhgh+1(z¯h+1)}]|\displaystyle|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}]|
|𝒜|{1+2C}{2dln(1+Cm)+ln(|Πh|/δ)}m+4|𝒜|m\displaystyle\leq|\mathcal{A}|\{1+2C\}\sqrt{\frac{\{2d\ln(1+Cm)+\ln(|\Pi_{h}|/\delta)\}}{m}}+\frac{4|\mathcal{A}|}{m}
5|𝒜|{1+2C}{2dln(1+Cm)+ln(|Πh|/δ)}m.\displaystyle\leq 5|\mathcal{A}|\{1+2C\}\sqrt{\frac{\{2d\ln(1+Cm)+\ln(|\Pi_{h}|/\delta)\}}{m}}.

Similarly,

g1𝒢1;|{𝔼𝒟𝔼}[g1(z¯1)]|\displaystyle\forall g_{1}\in\mathcal{G}_{1};|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}[g_{1}(\bar{z}_{1})]| C{dln(1+Cm)+ln(|Πh|/δ)}m+4m\displaystyle\leq C\sqrt{\frac{\{d\ln(1+Cm)+\ln(|\Pi_{h}|/\delta)\}}{m}}+\frac{4}{m}
5C{dln(1+Cm)+ln(|Πh|/δ)}m.\displaystyle\leq 5C\sqrt{\frac{\{d\ln(1+Cm)+\ln(|\Pi_{h}|/\delta)\}}{m}}.

Finally, we obtain the PAC bound, we need to find mm such that

c|𝒜|max(C,1)dψln(max(C,1)m)+ln(|Πmax|TH/δ)mdϕH2ln(HdϕBX2BW2m+1)ϵ.\displaystyle c|\mathcal{A}|\max(C,1)\sqrt{\frac{d_{\psi}\ln(\max(C,1)m)+\ln(|\Pi_{\max}|TH/\delta)}{m}}\sqrt{d_{\phi}H^{2}\ln\left(Hd_{\phi}B^{2}_{X}B^{2}_{W}m+1\right)}\leq\epsilon.

where cc is some constant and

T=cHdϕln(HdBX2BW2m+1).\displaystyle T=cHd_{\phi}\ln(HdB^{2}_{X}B^{2}_{W}m+1).

By organizing the term, the following mm is sufficient:

c{dψ+ln(dϕ|Πmax|H2/δ)}dϕH2|𝒜|2max(C,1)2ln({C+HdϕBX2BW2+1)}m)2mϵ.\displaystyle c\sqrt{\frac{\{d_{\psi}+\ln(d_{\phi}|\Pi_{\max}|H^{2}/\delta)\}d_{\phi}H^{2}|\mathcal{A}|^{2}\max(C,1)^{2}\ln(\{C+Hd_{\phi}B^{2}_{X}B^{2}_{W}+1)\}m)^{2}}{m}}\leq\epsilon.

By using Lemma 44, we can set

m=B1ϵ2ln(mB1B2)2,\displaystyle m=\frac{B_{1}}{\epsilon^{2}}\ln(mB_{1}B_{2})^{2},
B1={dψ+ln(dϕ|Πmax|H2/δ)}dϕH2|𝒜|2max(C,1)2,B2=C+HdϕBX2BW2+1.\displaystyle B_{1}=\{d_{\psi}+\ln(d_{\phi}|\Pi_{\max}|H^{2}/\delta)\}d_{\phi}H^{2}|\mathcal{A}|^{2}\max(C,1)^{2},B_{2}=C+Hd_{\phi}B^{2}_{X}B^{2}_{W}+1.

Thus, the final sample complexity is

O~(dϕ2{dψ+ln(|Πmax|/δ)}H4|𝒜|2max(C,1)2ϵ2)\displaystyle\tilde{O}\left(\frac{d^{2}_{\phi}\{d_{\psi}+\ln(|\Pi_{\max}|/\delta)\}H^{4}|\mathcal{A}|^{2}\max(C,1)^{2}}{\epsilon^{2}}\right)

where C=ΘV/σmin(K)C=\Theta_{V}/\sigma_{\min}(K).

Appendix H Sample Complexity in Observable Undercomplete Tabular POMDPs

We revisit the existence of value link functions. Then, we show the PO-bilinear rank decomposition. After showing the uniform convergence lemma, we calculate the sample complexity.

Existence of value link functions.

In the tabular case, by setting

ψh(z,o)=𝟏(z)𝟏(o),ϕh(z,s)=𝟏(z)𝟏(s),Kh=𝕀|𝒵h1|𝕆.\displaystyle\psi_{h}(z,o)=\mathbf{1}(z)\otimes\mathbf{1}(o),\phi_{h}(z,s)=\mathbf{1}(z)\otimes\mathbf{1}(s),K_{h}=\mathbb{I}_{|\mathcal{Z}_{h-1}|}\otimes\mathbb{O}.

where 𝟏(z)\mathbf{1}(z) is a one-hot encoding vector over 𝒵h1\mathcal{Z}_{h-1}, we can regard the tabular model as an HSE-POMDP. Here is our assumption.

Assumption 8.

(a) 0rh10\leq r_{h}\leq 1, (b) 𝕆\mathbb{O} is full-column rank and 𝕆11/σ1\|\mathbb{O}^{\dagger}\|_{1}\leq 1/\sigma_{1} for any h[H]h\in[H].

Note we use the 11-norm since this choice is more amenable in the tabular setting. However, even if the norm bound is given in terms of 22-norm, we can still ensure the PAC guarantee (this is because 𝕆1/|𝒮|𝕆2𝕆1|𝒪|\|\mathbb{O}^{\dagger}\|_{1}/\sqrt{|{\mathcal{S}}|}\leq\|\mathbb{O}^{\dagger}\|_{2}\leq\|\mathbb{O}^{\dagger}\|_{1}\sqrt{|\mathcal{O}|}).

Here, since we assume the reward lies in [0,1][0,1], value functions on the latent state belong to {θ,ϕh():θH}\{\langle\theta,\phi_{h}(\cdot)\rangle:\|\theta\|_{\infty}\leq H\}. Here, letting Vhπ=θhπ,ϕhV^{\pi}_{h}=\langle\theta^{\pi}_{h},\phi_{h}\rangle, value link functions exist by taking θhπ,𝟏(z)×𝕆𝟏(o)\langle\theta^{\pi}_{h},\mathbf{1}(z)\times\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle. Hence, we take

𝒢h={(z,o)θ,𝟏(z)𝕆𝟏(o);θH}\displaystyle\mathcal{G}_{h}=\left\{(z,o)\mapsto\langle\theta,\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle;\|\theta\|_{\infty}\leq H\right\}

so that the realizability holds. Importantly, we can ensure 𝒢hH/σ1\|\mathcal{G}_{h}\|_{\infty}\leq H/\sigma_{1} since

|θ,𝟏(z)𝕆𝟏(o)|θ𝟏(z)𝕆𝟏(o)1θ𝕆𝟏(o)1H/σ1\displaystyle|\langle\theta,\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle|\leq\|\theta\|_{\infty}\|\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\|_{1}\leq\|\theta\|_{\infty}\|\mathbb{O}^{\dagger}\mathbf{1}(o)\|_{1}\leq H/\sigma_{1}

for any (z,o)𝒵h1×𝒪(z,o)\in\mathcal{Z}_{h-1}\times\mathcal{O}. Note 𝒢h\mathcal{G}_{h} is contained in

{θ,𝟏(z)𝟏(o);θ2H|𝒪|M+1|𝒜|M/σ1}\displaystyle\left\{\langle\theta,\mathbf{1}(z)\otimes\mathbf{1}(o)\rangle;\|\theta\|_{2}\leq H|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}/\sigma_{1}\right\} (21)

This is because each θ,𝟏(z)𝕆𝟏(o)\langle\theta,\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle is equal to θ,𝟏(z)𝟏(o)\langle\theta^{\prime},\mathbf{1}(z)\otimes\mathbf{1}(o)\rangle for some vector θ|𝒵h1|×|𝒪|\theta^{\prime}\in\mathbb{R}^{|\mathcal{Z}_{h-1}|\times|\mathcal{O}|}. Here, denoting the component of θ\theta corresponding to z𝒵h1z\in\mathcal{Z}_{h-1} by θz|𝒪|\theta_{z}\in\mathbb{R}^{|\mathcal{O}|}, θ\theta^{\prime} is a vector stacking 𝕆θz\mathbb{O}^{\dagger}\theta_{z} for each z𝒵h1z\in\mathcal{Z}_{h-1}. Then, we have

𝕆θz2𝕆2θz2𝕆1|𝒪|H|𝒪|H|𝒪|/σ1.\displaystyle\|\mathbb{O}^{\dagger}\theta_{z}\|_{2}\leq\|\mathbb{O}^{\dagger}\|_{2}\|\theta_{z}\|_{2}\leq\|\mathbb{O}^{\dagger}\|_{1}\sqrt{|\mathcal{O}|}H\sqrt{|\mathcal{O}|}\leq H|\mathcal{O}|/\sigma_{1}.

Hence, θ2|𝒪|M|𝒜|M×H|𝒪|/σ1\|\theta^{\prime}\|_{2}\leq|\mathcal{O}|^{M}|\mathcal{A}|^{M}\times H|\mathcal{O}|/\sigma_{1}.

PO-Bilinear decomposition.

Next, recall we derive the PO-bilinear decomposition:

𝔼[θhϕh(z¯h)rhθh+1ϕh+1(z¯h+1);a1:h1π,ahπ]\displaystyle\mathbb{E}[\theta^{\top}_{h}\phi_{h}(\bar{z}_{h})-r_{h}-\theta^{\top}_{h+1}\phi_{h+1}(\bar{z}_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]
=Kh{θhθhπ}{Tπ:h}Kh+1{θh+1θh+1π},𝔼[ϕh(zh1,sh);a1:h1π].\displaystyle=\langle K^{\top}_{h}\{\theta_{h}-\theta^{\pi}_{h}\}-\{T_{\pi:h}\}^{\top}K^{\top}_{h+1}\{\theta_{h+1}-\theta^{\pi}_{h+1}\},\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}]\rangle.

Then, BX=1B_{X}=1 and BW=4H|𝒪|M+1|𝒜|M/σ1B_{W}=4H|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}/\sigma_{1}. We use Kh2=𝕆h21,Tπ:h21\|K^{\top}_{h}\|_{2}=\|\mathbb{O}_{h}\|_{2}\leq 1,\|T^{\top}_{\pi:h}\|_{2}\leq 1. This is because

Kh{θhθhπ}{Tπ:h}Kh+1{θh+1θh+1π}2\displaystyle\|K^{\top}_{h}\{\theta_{h}-\theta^{\pi}_{h}\}-\{T_{\pi:h}\}^{\top}K^{\top}_{h+1}\{\theta_{h+1}-\theta^{\pi}_{h+1}\}\|_{2}
θh2+θhπ2+θh+12+θh+1π24H|𝒪|M+1|𝒜|M/σ1.\displaystyle\leq\|\theta_{h}\|_{2}+\|\theta^{\pi}_{h}\|_{2}+\|\theta_{h+1}\|_{2}+\|\theta^{\pi}_{h+1}\|_{2}\leq 4H|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}/\sigma_{1}.

In the last line, we use (21).

Uniform convergence.

Then, we can obtain the following uniform convergence lemma.

Lemma 25.

Let C=H/σ1C=H/\sigma_{1} and dψ=|𝒪|M+1|𝒜|Md_{\psi}=|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}. Then, with probability 1δ1-\delta,

supπΠ,g𝒢|{𝔼𝒟𝔼}[|𝒜|πh(ahz¯h){gh(z¯h)rhgh+1(z¯h+1)}]|\displaystyle\sup_{\pi\in\Pi,g\in\mathcal{G}}\left|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\right\}\right]\right|
5|𝒜|{1+2C}{dψ2ln(1+Cm)+ln(|Πh|/δ)}m\displaystyle\leq 5|\mathcal{A}|\{1+2C\}\sqrt{\frac{\{d^{2}_{\psi}\ln(1+Cm)+\ln(|\Pi_{h}|/\delta)\}}{m}}

and

supg1𝒢1|{𝔼𝒟𝔼}[g1(z¯1)]|5C{dψln(1+Cm)+ln(|Πh|/δ)}m.\displaystyle\sup_{g_{1}\in\mathcal{G}_{1}}|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}[g_{1}(\bar{z}_{1})]|\leq 5C\sqrt{\frac{\{d_{\psi}\ln(1+Cm)+\ln(|\Pi_{h}|/\delta)\}}{m}}.
Proof.

Let dϕ=|𝒮||𝒪|M|𝒜|M,dψ=|𝒪|M+1|𝒜|Md_{\phi}=|{\mathcal{S}}||\mathcal{O}|^{M}|\mathcal{A}|^{M},d_{\psi}=|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}.

Define 𝒩ϵ,h\mathcal{N}_{\epsilon,h} as an ϵ\epsilon-net for {θ:θ2C}\{\theta:\|\theta\|_{2}\leq C\} with respect to L2L^{2}-norm. Define 𝒩ϵ,h\mathcal{N}^{\prime}_{\epsilon,h} as an ϵ\epsilon-net for Πh:𝒵¯hΔ(𝒜)\Pi_{h}:\bar{\mathcal{Z}}_{h}\to\Delta(\mathcal{A}) with respect to the following norm:

d(π,π)=maxz¯h1Z¯h1π(z¯h1)π(z¯h1)1.\displaystyle d(\pi,\pi^{\prime})=\max_{\bar{z}_{h-1}\in\bar{Z}_{h-1}}\|\pi(\cdot\mid\bar{z}_{h-1})-\pi^{\prime}(\cdot\mid\bar{z}_{h-1})\|_{1}.

Then, |𝒩ϵ,h|(1+C/ϵ)d,|𝒩ϵ,h|(1+1/ϵ)dψ|𝒜||\mathcal{N}_{\epsilon,h}|\leq(1+C/\epsilon)^{d},|\mathcal{N}^{\prime}_{\epsilon,h}|\leq(1+1/\epsilon)^{d_{\psi}|\mathcal{A}|}.

Let gh=θh,ψh,gh=θh,ψhg_{h}=\langle\theta_{h},\psi_{h}\rangle,g^{\diamond}_{h}=\langle\theta^{\diamond}_{h},\psi_{h}\rangle where ψh\psi_{h} is a one-hot encoding vector over 𝒵¯h\bar{\mathcal{Z}}_{h}. Then, when θhθh2ϵ,θh+1θh+12ϵ,πhπh1ϵ\|\theta_{h}-\theta^{\diamond}_{h}\|_{2}\leq\epsilon,\|\theta_{h+1}-\theta^{\diamond}_{h+1}\|_{2}\leq\epsilon,\|\pi_{h}-\pi^{\diamond}_{h}\|_{1}\leq\epsilon, we have

|lh(;π,g)lh(;π,g)|\displaystyle|l_{h}(\cdot;\pi,g)-l_{h}(\cdot;\pi^{\diamond},g^{\diamond})| |𝒜|{πhπhC+ghgh+gh+1gh+1}\displaystyle\leq|\mathcal{A}|\{\|\pi_{h}-\pi^{\diamond}_{h}\|_{\infty}C+\|g_{h}-g^{\diamond}_{h}\|_{\infty}+\|g_{h+1}-g^{\diamond}_{h+1}\|_{\infty}\}
|𝒜|{ϵC+θhθh2+θh+1θh+12}\displaystyle\leq|\mathcal{A}|\{\epsilon C+\|\theta_{h}-\theta^{\diamond}_{h}\|_{2}+\|\theta_{h+1}-\theta^{\diamond}_{h+1}\|_{2}\}
3|𝒜|Cϵ.\displaystyle\leq 3|\mathcal{A}|C\epsilon.

Besides, for fixed π𝒩ϵ,h,θh𝒩ϵ,h\pi\in\mathcal{N}^{\prime}_{\epsilon,h},\theta_{h}\in\mathcal{N}_{\epsilon,h}, θh+1𝒩ϵ,h+1\theta_{h+1}\in\mathcal{N}_{\epsilon,{h+1}}, we have

|{𝔼𝒟𝔼}[|𝒜|πh(ahz¯h){gh(z¯h;θh)rhgh+1(z¯h+1;θh+1)}]||𝒜|(1+2C)ln(1/δ)m.\displaystyle\left\lvert\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h};\theta_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1};\theta_{h+1})\right\}\right]\right\rvert\leq|\mathcal{A}|\left(1+2C\right)\sqrt{\frac{\ln(1/\delta)}{m}}.

Then, for π𝒩ϵ,h,θh𝒩ϵ,h\forall\pi\in\mathcal{N}^{\prime}_{\epsilon,h},\forall\theta_{h}\in\mathcal{N}_{\epsilon,h}, θh+1𝒩ϵ,h+1\forall\theta_{h+1}\in\mathcal{N}_{\epsilon,{h+1}}, we have

|{𝔼𝒟𝔼}[|𝒜|πh(ahz¯h){gh(z¯h;θh)rhgh+1(z¯h+1;θh+1)}]||𝒜|(1+2C)ln(|𝒩ϵ,h||𝒩ϵ,h||𝒩ϵ,h+1|/δ)m.\displaystyle\left\lvert\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h};\theta_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1};\theta_{h+1})\right\}\right]\right\rvert\leq|\mathcal{A}|\left(1+2C\right)\sqrt{\frac{\ln(|\mathcal{N}^{\prime}_{\epsilon,h}||\mathcal{N}_{\epsilon,h}||\mathcal{N}_{\epsilon,h+1}|/\delta)}{m}}.

Hence, for any πhΠh,gh=θh,ψh𝒢h,gh+1=θh+1,ψh+1𝒢h+1\pi_{h}\in\Pi_{h},g_{h}=\langle\theta_{h},\psi_{h}\rangle\in\mathcal{G}_{h},g_{h+1}=\langle\theta_{h+1},\psi_{h+1}\rangle\in\mathcal{G}_{h+1},

|{𝔼𝒟𝔼}[|𝒜|πh(ahz¯h){gh(z¯h;θh)rhgh+1(z¯h+1;θh+1)}]|\displaystyle\left\lvert\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}\left[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\left\{g_{h}(\bar{z}_{h};\theta_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1};\theta_{h+1})\right\}\right]\right\rvert
|𝒜|(1+2C)ln(|𝒩ϵ,h||𝒩ϵ,h||𝒩ϵ,h+1|/δ)m+3|𝒜|Cϵ.\displaystyle\leq|\mathcal{A}|\left(1+2C\right)\sqrt{\frac{\ln(|\mathcal{N}^{\prime}_{\epsilon,h}||\mathcal{N}_{\epsilon,h}||\mathcal{N}_{\epsilon,h+1}|/\delta)}{m}}+3|\mathcal{A}|C\epsilon.

By taking ϵ=1/m\epsilon=1/m, we have πΠ,gh𝒢h,gh+1𝒢h+1;\forall\pi\in\Pi,\forall g_{h}\in\mathcal{G}_{h},\forall g_{h+1}\in\mathcal{G}_{h+1};

|{𝔼𝒟𝔼}[|𝒜|πh(ahz¯h){gh(z¯h)rhgh+1(z¯h+1)}]|\displaystyle|\{\mathbb{E}_{\mathcal{D}}-\mathbb{E}\}[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}]|
|𝒜|{1+2C}{2dψln(1+Cm)+dψ|𝒜|ln(1+m)+ln(1/δ)}m+3|𝒜|Cm\displaystyle\leq|\mathcal{A}|\{1+2C\}\sqrt{\frac{\{2d_{\psi}\ln(1+Cm)+d_{\psi}|\mathcal{A}|\ln(1+m)+\ln(1/\delta)\}}{m}}+\frac{3|\mathcal{A}|C}{m}
10|𝒜|C{dψ|𝒜|ln(1+Cm)+ln(1/δ)}m.\displaystyle\leq 10|\mathcal{A}|C\sqrt{\frac{\{d_{\psi}|\mathcal{A}|\ln(1+Cm)+\ln(1/\delta)\}}{m}}.

Sample Complexity.

Finally, we obtain the PAC bound. We need to find mm such that

c|𝒜|C{dψ|𝒜|ln(1+Cm)+ln(TH/δ)}mdϕH2ln(HdϕBX2BW2m+1)ϵ.\displaystyle c|\mathcal{A}|C\sqrt{\frac{\{d_{\psi}|\mathcal{A}|\ln(1+Cm)+\ln(TH/\delta)\}}{m}}\sqrt{d_{\phi}H^{2}\ln(Hd_{\phi}B^{2}_{X}B^{2}_{W}m+1)}\leq\epsilon.

where cc is some constant and

T=cHdϕln(HdBX2BW2m+1).\displaystyle T=cHd_{\phi}\ln(HdB^{2}_{X}B^{2}_{W}m+1).

By organizing terms, we get

|𝒜|3C2dϕdψH2ln(H2dϕ/δ)ln({C+dψ+HdϕBX2BW2}m)2mϵ.\displaystyle\sqrt{\frac{|\mathcal{A}|^{3}C^{2}d_{\phi}d_{\psi}H^{2}\ln(H^{2}d_{\phi}/\delta)\ln(\{C+d_{\psi}+Hd_{\phi}B^{2}_{X}B^{2}_{W}\}m)^{2}}{m}}\leq\epsilon.

Thus, we need to set

m=O~(|𝒜|3C2dϕdψH2ln(1/δ)ϵ2)\displaystyle m=\tilde{O}\left(\frac{|\mathcal{A}|^{3}C^{2}d_{\phi}d_{\psi}H^{2}\ln(1/\delta)}{\epsilon^{2}}\right)

Hence, the sample complexity is

O~(|𝒜|3C2dϕ2dψH4ln(1/δ)ϵ2).\displaystyle\tilde{O}\left(\frac{|\mathcal{A}|^{3}C^{2}d^{2}_{\phi}d_{\psi}H^{4}\ln(1/\delta)}{\epsilon^{2}}\right).

By some algebra, it is

O~(|𝒜|3M+3|𝒪|3M+1|𝒮|2H6ln(1/δ)ϵ2σ12).\displaystyle\tilde{O}\left(\frac{|\mathcal{A}|^{3M+3}|\mathcal{O}|^{3M+1}|{\mathcal{S}}|^{2}H^{6}\ln(1/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

Later, we prove we can remove |𝒪|M|\mathcal{O}|^{M} using the more refined analysis in Section N.

Global optimality.

We use a result in the proof of [24, Theorem 1.2]. We just set M=C(1/σ1)4ln(SH/ϵ)M=C(1/\sigma_{1})^{4}\ln(SH/\epsilon). Note their assumption 1 is satisfied when 𝕆1(1/σ1)\|\mathbb{O}^{\dagger}\|_{1}\leq(1/\sigma_{1}). More specifically, assumption 1 in [24] requires for any bb and bb^{\prime}, we have

𝕆b𝕆b11/σ1bb1.\displaystyle\|\mathbb{O}b-\mathbb{O}b^{\prime}\|_{1}\geq 1/\sigma_{1}\|b-b^{\prime}\|_{1}.

This is proved as follows. Note for any e,ee,e^{\prime},

𝕆e𝕆e1𝕆1ee1.\displaystyle\|\mathbb{O}^{\dagger}e-\mathbb{O}^{\dagger}e^{\prime}\|_{1}\leq\|\mathbb{O}^{\dagger}\|_{1}\|e-e^{\prime}\|_{1}.

Then, by setting e=𝕆be=\mathbb{O}b and e=𝕆be^{\prime}=\mathbb{O}b^{\prime}, the assumption 1 is ensured. Here, we use 𝕆𝕆=I\mathbb{O}^{\dagger}\mathbb{O}=I.

Appendix I Sample Complexity in Observable Overcomplete Tabular POMDPs

To simplify the presentation, we focus on the case when πout=U(𝒜)\pi^{out}=U(\mathcal{A}).

Existence of value link functions.

In the tabular case, by setting

ψh(z,tK)=𝟏(z)𝟏(tK),ϕh(z,s)=𝟏(z)𝟏(s),Kh=𝕀|𝒵h1|𝕆K.\displaystyle\psi_{h}(z,t^{K})=\mathbf{1}(z)\otimes\mathbf{1}(t^{K}),\phi_{h}(z,s)=\mathbf{1}(z)\otimes\mathbf{1}(s),K_{h}=\mathbb{I}_{|\mathcal{Z}_{h-1}|}\otimes\mathbb{O}^{K}.

where 𝟏(z)\mathbf{1}(z) is a one-hot encoding vector over 𝒵h1\mathcal{Z}_{h-1} and 𝟏(tK)\mathbf{1}(t^{K}) is a one-hot encoding vector over 𝒵K=𝒪K×𝒜K1\mathcal{Z}^{K}=\mathcal{O}^{K}\times\mathcal{A}^{K-1}, we can regard the tabular model as an HSE-POMDP. Here is our assumption.

Assumption 9.

(a) 0rh10\leq r_{h}\leq 1, (b) 𝕆K\mathbb{O}^{K} is full-column rank and {𝕆K}11/σ1\|\{\mathbb{O}^{K}\}^{\dagger}\|_{1}\leq 1/\sigma_{1}.

Recall we define 𝕆K\mathbb{O}^{K} in Lemma 3. Since we assume the reward lies in [0,1][0,1], value functions on the latent state belong to {θ,ϕh():θH}\{\langle\theta,\phi_{h}(\cdot)\rangle:\|\theta\|_{\infty}\leq H\}. Here, letting Vhπ()=θhπ,ϕh()V^{\pi}_{h}(\cdot)=\langle\theta^{\pi}_{h},\phi_{h}(\cdot)\rangle, value link functions exist by taking θhπ,𝟏(z){𝕆K}𝟏(tK)\langle\theta^{\pi}_{h},\mathbf{1}(z)\otimes\{\mathbb{O}^{K}\}^{\dagger}\mathbf{1}(t^{K})\rangle. Hence, we take

𝒢h={(z,tK)θhπ,𝟏(z){𝕆K}𝟏(tK);θhπH}\displaystyle\mathcal{G}_{h}=\{(z,t^{K})\mapsto\langle\theta^{\pi}_{h},\mathbf{1}(z)\otimes\{\mathbb{O}^{K}\}^{\dagger}\mathbf{1}(t^{K})\rangle;\|\theta^{\pi}_{h}\|_{\infty}\leq H\}

so that the realizability holds. Importantly, we can ensure 𝒢hH/σ1\|\mathcal{G}_{h}\|_{\infty}\leq H/\sigma_{1} as in Section H. Then, as in Section H, 𝒢h\mathcal{G}_{h} is contained in

{θ,𝟏(z)𝟏(o);θ2H|𝒪|M+1|𝒜|M/σ1}.\displaystyle\left\{\langle\theta,\mathbf{1}(z)\otimes\mathbf{1}(o)\rangle;\|\theta\|_{2}\leq H|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}/\sigma_{1}\right\}.
PO-bilinear decomposition.

Next, we derive the PO-bilinear decomposition:

𝔼[θhϕ(zh1,thK)rhθh+1ϕ(zh1,thK);a1:h1π,ahπ,ah+1:h+K1U(𝒜)]\displaystyle\mathbb{E}[\theta^{\top}_{h}\phi(z_{h-1},t^{K}_{h})-r_{h}-\theta^{\top}_{h+1}\phi(z_{h-1},t^{K}_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi,a_{h+1:h+K-1}\sim U(\mathcal{A})]
={Kh}{θhθhπ}{Tπ:h}{Kh+1}{θh+1θh+1π},𝔼[ϕh(zh1,sh);a1:h1π].\displaystyle=\langle\{K_{h}\}^{\top}\{\theta_{h}-\theta^{\pi}_{h}\}-\{T_{\pi:h}\}^{\top}\{K_{h+1}\}^{\top}\{\theta_{h+1}-\theta^{\pi}_{h+1}\},\mathbb{E}[\phi_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime}]\rangle.

Then, BX=1B_{X}=1 and BW=4H|𝒪|M+1|𝒜|M/σ1B_{W}=4H|\mathcal{O}|^{M+1}|\mathcal{A}|^{M}/\sigma_{1}. We use Kh2=𝕆K21,Tπ:h21\|K_{h}\|_{2}=\|\mathbb{O}^{K}\|_{2}\leq 1,\|T^{\top}_{\pi:h}\|_{2}\leq 1.

Sample Complexity.

We can follow the same procedure in the proof of Section H. Let dϕ=|𝒮||𝒪|M|𝒜|M,dψ=|𝒪|M+K|𝒜|M+K1d_{\phi}=|{\mathcal{S}}||\mathcal{O}|^{M}|\mathcal{A}|^{M},d_{\psi}=|\mathcal{O}|^{M+K}|\mathcal{A}|^{M+K-1}. Hence, the sample complexity is

O~(|𝒜|3C2dϕ2dψH4ln(1/δ)ϵ2σ12).\displaystyle\tilde{O}\left(\frac{|\mathcal{A}|^{3}C^{2}d^{2}_{\phi}d_{\psi}H^{4}\ln(1/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

By some algebra, the above is

O~(|𝒜|3M+K+2|𝒪|3M+K|𝒮|2H6ln(1/δ)ϵ2σ12).\displaystyle\tilde{O}\left(\frac{|\mathcal{A}|^{3M+K+2}|\mathcal{O}|^{3M+K}|{\mathcal{S}}|^{2}H^{6}\ln(1/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

Using the more refined analysis later, we show we can remove |𝒪|3M+K|\mathcal{O}|^{3M+K} in Section N.

Appendix J Sample Complexity in LQG

In this section, we derive the sample complexity in LQG. We first explain the setting. Then, we prove the existence of link functions. Lemma 5 is proved there. Furtheremore, we show the PO-bilinear rank decomposition in LQG. We prove Lemma 5 there. Next, we show the uniform convergence result in LQG. Finally, by invoking Theorem 1, we calculate the sample complexity.

We study a finite-horizon discrete time LQG governed by the following equation:

s1=ϵ1,sh+1=Ash+Bah+ϵh,rh=shQsh+ahRah,oh=Osh+τh.\displaystyle s_{1}=\epsilon_{1},s_{h+1}=As_{h}+Ba_{h}+\epsilon_{h},r_{h}=s^{\top}_{h}Qs_{h}+a^{\top}_{h}Ra_{h},o_{h}=Os_{h}+\tau_{h}.

where ϵh\epsilon_{h} is a Gaussian noise with mean 0 and noise Σϵ\Sigma_{\epsilon} and τh\tau_{h} is a Gaussian noise with mean 0 and Στ\Sigma_{\tau}. We use a matrix OO instead of CC to avoid the notational confusion later. With a linear policy πh(ahoh,zh1)=δ(ah=𝐔1hoh+𝐔2hzh1)\pi_{h}(a_{h}\mid o_{h},z_{h-1})=\delta(a_{h}=\mathbf{U}_{1h}o_{h}+\mathbf{U}_{2h}z_{h-1}), this induces the following system:

[zhohahsh+1]=Ξ1h(π)[zh1sh]+Ξ2h(π),Ξ2h(π)=[0τ𝐔1hτB𝐔1hτ+ϵ],Ξ1h(π)=[I00O𝐔2h𝐔1hOB𝐔2hA+B𝐔1hO]\displaystyle\begin{bmatrix}z^{\prime}_{h}\\ o_{h}\\ a_{h}\\ s_{h+1}\end{bmatrix}=\Xi_{1h}(\pi)\begin{bmatrix}z_{h-1}\\ s_{h}\end{bmatrix}+\Xi_{2h}(\pi),\Xi_{2h}(\pi)=\begin{bmatrix}0\\ \tau\\ \mathbf{U}_{1h}\tau\\ B\mathbf{U}_{1h}\tau+\epsilon\end{bmatrix},\,\Xi_{1h}(\pi)=\begin{bmatrix}I^{\prime}&0\\ 0&O\\ \mathbf{U}_{2h}&\mathbf{U}_{1h}O\\ B\mathbf{U}_{2h}&A+B\mathbf{U}_{1h}O\end{bmatrix}

where zhz^{\prime}_{h} is the vector removing (oh,ah)(o_{h},a_{h}) from zhz_{h} and II^{\prime} is a matrix mapping zhz_{h} to zhz^{\prime}_{h}. This is derived by

sh+1\displaystyle s_{h+1} =Ash+Bah+ϵ=Ash+B{𝐔1hoh+𝐔2hzh1}+ϵ\displaystyle=As_{h}+Ba_{h}+\epsilon=As_{h}+B\{\mathbf{U}_{1h}o_{h}+\mathbf{U}_{2h}z_{h-1}\}+\epsilon
=(A+B𝐔1hO)sh+B𝐔2hzh1+ϵ+B𝐔1hτ,\displaystyle=(A+B\mathbf{U}_{1h}O)s_{h}+B\mathbf{U}_{2h}z_{h-1}+\epsilon+B\mathbf{U}_{1h}\tau,
ah\displaystyle a_{h} =𝐔1hoh+𝐔2hzh1=𝐔1hOsh+𝐔2hzh1+𝐔1hτ,\displaystyle=\mathbf{U}_{1h}o_{h}+\mathbf{U}_{2h}z_{h-1}=\mathbf{U}_{1h}Os_{h}+\mathbf{U}_{2h}z_{h-1}+\mathbf{U}_{1h}\tau,
oh\displaystyle o_{h} =Osh+τ.\displaystyle=Os_{h}+\tau.

We suppose the system is always stable in the sense that the operator norm of Ξ1h(π)\Xi_{1h}(\pi) is upper-bounded by 11. Here is the assumption we introduce throughout this section.

Assumption 10.

Suppose max(A,B,O,Q,R)\max(\|A\|,\|B\|,\|O\|,\|Q\|,\|R\|)\leq\mathbb{C}. Suppose Ξ1h(π)1\|\Xi_{1h}(\pi)\|\leq 1 for any π\pi. OO is full-column rank.

We present the form of linear mean embedding operators in LQGs.

Lemma 26 (Linear mean embedding operator).

Let z𝒵h1,o𝒪,s𝒮z\in\mathcal{Z}_{h-1},o\in\mathcal{O},s\in{\mathcal{S}}. We have

𝔼o𝕆(s)[[1[zo][zo]]]=Kh[1[zs][zs]],Kh=[1𝟎vec([000Στ])[𝕀00O][𝕀00O].]\displaystyle\mathbb{E}_{o\sim\mathbb{O}(s)}\left[\begin{bmatrix}1\\ \begin{bmatrix}z\\ o\end{bmatrix}\otimes\begin{bmatrix}z\\ o\end{bmatrix}\end{bmatrix}\right]=K_{h}\begin{bmatrix}1\\ \begin{bmatrix}z\\ s\end{bmatrix}\otimes\begin{bmatrix}z\\ s\end{bmatrix}\end{bmatrix},K_{h}=\begin{bmatrix}1&\mathbf{0}\\ \mathrm{vec}\left(\begin{bmatrix}0&0\\ 0&\Sigma_{\tau}\end{bmatrix}\right)&\begin{bmatrix}\mathbb{I}&0\\ 0&O\end{bmatrix}\otimes\begin{bmatrix}\mathbb{I}&0\\ 0&O\end{bmatrix}.\end{bmatrix}
Proof.

Here, we have

𝔼o𝕆(s)[[zo][zo]]=Vec[[zzzoozoo]]=Vec[[zzzsOOszOssO+Σr]]\displaystyle\mathbb{E}_{o\sim\mathbb{O}(s)}\left[\begin{bmatrix}z\\ o\end{bmatrix}\otimes\begin{bmatrix}z\\ o\end{bmatrix}\right]=\mathrm{Vec}\left[\begin{bmatrix}zz^{\top}&zo^{\top}\\ oz^{\top}&oo^{\top}\end{bmatrix}\right]=\mathrm{Vec}\left[\begin{bmatrix}zz^{\top}&zs^{\top}O^{\top}\\ Osz^{\top}&Oss^{\top}O^{\top}+\Sigma_{r}\end{bmatrix}\right]
=Vec[[000Σr]]+[𝕀00O][𝕀O0O]×Vec[[zzzsszss]].\displaystyle=\mathrm{Vec}\left[\begin{bmatrix}0&0\\ 0&\Sigma_{r}\end{bmatrix}\right]+\begin{bmatrix}\mathbb{I}&0\\ 0&O\end{bmatrix}\otimes\begin{bmatrix}\mathbb{I}&O\\ 0&O\end{bmatrix}\times\mathrm{Vec}\left[\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\right].

From the second line to the third line, we use formula vec[A1A2A3]=(A3A1)vec(A2)\mathrm{vec}[A_{1}A_{2}A_{3}]=(A^{\top}_{3}\otimes A_{1})\mathrm{vec}(A_{2}). This immediately concludes the result.

Thus, the matrix KhK_{h} has the left inverse when OO is full-column rank as follows:

Kh=[1𝟎[𝕀00O][𝕀00O]vec([000Στ])[𝕀00O][𝕀00O]].\displaystyle K^{\dagger}_{h}=\begin{bmatrix}1&\mathbf{0}\\ -\begin{bmatrix}\mathbb{I}&0\\ 0&O^{\dagger}\end{bmatrix}\otimes\begin{bmatrix}\mathbb{I}&0\\ 0&O^{\dagger}\end{bmatrix}\mathrm{vec}\left(\begin{bmatrix}0&0\\ 0&\Sigma_{\tau}\end{bmatrix}\right)&\begin{bmatrix}\mathbb{I}&0\\ 0&O^{\dagger}\end{bmatrix}\otimes\begin{bmatrix}\mathbb{I}&0\\ 0&O^{\dagger}\end{bmatrix}\end{bmatrix}.

We use a block matrix inversion formula:

[A110A3A2A11A3][A10A2A3]=I.\displaystyle\begin{bmatrix}A^{-1}_{1}&0\\ -A^{\dagger}_{3}A_{2}A^{-1}_{1}&A^{\dagger}_{3}\end{bmatrix}\begin{bmatrix}A_{1}&0\\ A_{2}&A_{3}\end{bmatrix}=I.

J.1 Existence of Link Functions

Lemma 27 (Value functions in LQGs).

Let πh(ao,z)=δ(a=𝐔1ho+𝐔2hz)\pi_{h}(a\mid o,z)=\delta(a=\mathbf{U}_{1h}o+\mathbf{U}_{2h}z) for z𝒵h1,o𝒪z\in\mathcal{Z}_{h-1},o\in\mathcal{O}. Then, a value function has a bilinear form:

Vhπ(z,s)=[z,s]Λh[z,s]+Γh.\displaystyle V^{\pi}_{h}(z,s)=[z^{\top},s^{\top}]\Lambda_{h}[z^{\top},s^{\top}]^{\top}+\Gamma_{h}.

For any h[H]h\in[H], these parameters Λh,Γh\Lambda_{h},\Gamma_{h} are recursively defined inductively by

ΛH=[𝐔2hR𝐔2h𝐔2hR𝐔1hO{𝐔2hR𝐔1hO}Q+O𝐔1hR𝐔1hO],OH=tr(𝐔1hR𝐔1hΣτ),\displaystyle\Lambda_{H}=\begin{bmatrix}\mathbf{U}^{\top}_{2h}R\mathbf{U}_{2h}&\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}O\\ \{\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}O\}^{\top}&Q+O^{\top}\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}O\end{bmatrix},O_{H}=\mathop{\mathrm{tr}}(\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}\Sigma_{\tau}),
Λh=Ξ1h(π)Λh+1Ξ1h(π)+ΣΛh,ΣΛh1=[𝐔2hR𝐔2h𝐔2hR𝐔1hO{𝐔2hR𝐔1hO}Q+O𝐔1hR𝐔1hO],\displaystyle\Lambda_{h}=\Xi_{1h}(\pi)\Lambda_{h+1}\Xi^{\top}_{1h}(\pi)+\Sigma_{\Lambda_{h}},\Sigma_{\Lambda_{h1}}=\begin{bmatrix}\mathbf{U}^{\top}_{2h}R\mathbf{U}_{2h}&\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}O\\ \{\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}O\}^{\top}&Q+O^{\top}\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}O\end{bmatrix},
Γh=tr(Λh+1ΣΛh2(π))+Γh+1,ΣΛh2(π)=[00000ΣτΣτ𝐔1hΣτ𝐔1hB0𝐔1hΣτ𝐔1hΣτ𝐔1h𝐔1hΣτ𝐔1hB0B𝐔1hΣτB𝐔1hΣτ𝐔1hB𝐔1hΣτ𝐔1hB+Σϵ].\displaystyle\Gamma_{h}=\mathop{\mathrm{tr}}\left(\Lambda_{h+1}\Sigma_{\Lambda_{h2}}(\pi)\right)+\Gamma_{h+1},\quad\Sigma_{\Lambda_{h2}}(\pi)=\begin{bmatrix}0&0&0&0\\ 0&\Sigma_{\tau}&\Sigma_{\tau}\mathbf{U}^{\top}_{1h}&\Sigma_{\tau}\mathbf{U}^{\top}_{1h}B^{\top}\\ 0&\mathbf{U}_{1h}\Sigma_{\tau}&\mathbf{U}_{1h}\Sigma_{\tau}\mathbf{U}^{\top}_{1h}&\mathbf{U}_{1h}\Sigma_{\tau}\mathbf{U}^{\top}_{1h}B^{\top}\\ 0&B\mathbf{U}_{1h}\Sigma_{\tau}&B\mathbf{U}_{1h}\Sigma_{\tau}\mathbf{U}^{\top}_{1h}&B\mathbf{U}_{1h}\Sigma_{\tau}\mathbf{U}^{\top}_{1h}B^{\top}+\Sigma_{\epsilon}&\end{bmatrix}.
Proof.

The proof is completed by backward induction regarding hh, starting from level HH. First, we have

VHπ(z,s)=sQs+𝔼oO(s)[{𝐔1ho+𝐔2hz}R{𝐔1ho+K2z}]\displaystyle V^{\pi}_{H}(z,s)=s^{\top}Qs+\mathbb{E}_{o\sim O(s)}[\{\mathbf{U}_{1h}o+\mathbf{U}_{2h}z\}^{\top}R\{\mathbf{U}_{1h}o+K_{2}z\}]
=sQs+𝔼oO(s)[{𝐔1hOs+𝐔1hτ+𝐔2hz}R{𝐔1hOs+𝐔1hτ+𝐔2hz}]\displaystyle=s^{\top}Qs+\mathbb{E}_{o\sim O(s)}[\{\mathbf{U}_{1h}Os+\mathbf{U}_{1h}\tau+\mathbf{U}_{2h}z\}^{\top}R\{\mathbf{U}_{1h}Os+\mathbf{U}_{1h}\tau+\mathbf{U}_{2h}z\}]
=s{Q+O𝐔1hR𝐔1hO}s+z𝐔2hR𝐔2hz+2z𝐔2hR𝐔1hOs+tr(𝐔1hR𝐔1hΣτ)\displaystyle=s^{\top}\{Q+O^{\top}\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}O\}s+z\mathbf{U}^{\top}_{2h}R\mathbf{U}_{2h}z+2z^{\top}\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}Os+\mathop{\mathrm{tr}}(\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}\Sigma_{\tau})
=[z,s][𝐔2hR𝐔2h𝐔2hR𝐔1hO{𝐔2hR𝐔1hO}Q+O𝐔1hR𝐔1hO][z,s]+tr(𝐔1hR𝐔1hΣτ).\displaystyle=[z^{\top},s^{\top}]\begin{bmatrix}\mathbf{U}^{\top}_{2h}R\mathbf{U}_{2h}&\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}O\\ \{\mathbf{U}^{\top}_{2h}R\mathbf{U}_{1h}O\}^{\top}&Q+O^{\top}\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}O\end{bmatrix}[z^{\top},s^{\top}]^{\top}+\mathop{\mathrm{tr}}(\mathbf{U}^{\top}_{1h}R\mathbf{U}_{1h}\Sigma_{\tau}).

Here, we use induction. Thus, supposing the statement is true at horizon h+1h+1, we have

Vhπ(z,s)\displaystyle V^{\pi}_{h}(z,s) =Γh+1+sQs+𝔼oO(s)[{𝐔1ho+𝐔2hz}R{𝐔1ho+K2z}]\displaystyle=\Gamma_{h+1}+s^{\top}Qs+\mathbb{E}_{o\sim O(s)}[\{\mathbf{U}_{1h}o+\mathbf{U}_{2h}z\}^{\top}R\{\mathbf{U}_{1h}o+K_{2}z\}]
+𝔼oO(s),aπ(o,z),s𝕋(s,a)[[z1,o,a,s]Λh+1[z1,o,a,s]]\displaystyle+\mathbb{E}_{o\sim O(s),a\sim\pi(o,z),s^{\prime}\sim\mathbb{T}(s,a)}[[z^{\top}_{-1},o^{\top},a^{\top},s^{\prime\top}]\Lambda_{h+1}[z^{\top}_{-1},o^{\top},a^{\top},s^{\prime\top}]^{\top}]

where zz^{\prime} is a vector that removes the last component (o,a)(o,a) from zz and ss^{\prime} is a state at h+1h+1. Here, recall we have

[(z),o,a,s]=Ξ1h(π)[z,s]+Ξ2h(π).\displaystyle[(z^{\prime})^{\top},o^{\top},a^{\top},s^{\prime\top}]^{\top}=\Xi_{1h}(\pi)[z^{\top},s^{\top}]^{\top}+\Xi_{2h}(\pi).

Then, the statement is concluded some algebra.

Lemma 28 (Norm constraints on value functions).

We can set ΛhΛ,h,ΓhΓ,h\|\Lambda_{h}\|\leq\mathbb{C}_{\Lambda,h},\|\Gamma_{h}\|\leq\mathbb{C}_{\Gamma,h} such that

Λ,h=poly(,H),Γ,h=poly(do,ds,da,,H).\displaystyle\mathbb{C}_{\Lambda,h}=\mathrm{poly}(\mathbb{C},H),\mathbb{C}_{\Gamma,h}=\mathrm{poly}(d_{o},d_{s},d_{a},\mathbb{C},H).
Proof.

We have

ΛHpoly(,H),ΓHpoly(,H).\displaystyle\|\Lambda_{H}\|\leq\mathrm{poly}(\mathbb{C},H),\quad\|\Gamma_{H}\|\leq\mathrm{poly}(\mathbb{C},H).

Then,

ΛhΞ1h(π)Λh+1Ξ1h(π)+poly(,H).\displaystyle\|\Lambda_{h}\|\leq\|\Xi_{1h}(\pi)\|\|\Lambda_{h+1}\|\|\Xi_{1h}(\pi)\|+\mathrm{poly}(\mathbb{C},H).

Since we assume Ξ1h(π)1\|\Xi_{1h}(\pi)\|\leq 1, this immediately leads to

Λhpoly(,H).\displaystyle\|\Lambda_{h}\|\leq\mathrm{poly}(\mathbb{C},H).

Besides,

Γhpoly(H,do,ds,da,)Λh+1+Γh+1.\displaystyle\|\Gamma_{h}\|\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C})\|\Lambda_{h+1}\|+\|\Gamma_{h+1}\|.

Thus,

Γhpoly(H,do,ds,da,).\displaystyle\|\Gamma_{h}\|\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C}).

Next, we set the norm on the function class 𝒢h\mathcal{G}_{h}.

Lemma 29 (Realizability on LQGs).

We set

𝒢h={Γ¯h+(z,o)Λ¯h(z,o)Λ¯hCΛ¯,h,|Γ¯h|CΓ¯,h,zZh1,o𝒪},\displaystyle\mathcal{G}_{h}=\left\{\bar{\Gamma}_{h}+(z^{\top},o^{\top})\bar{\Lambda}_{h}(z^{\top},o^{\top})^{\top}\mid\|\bar{\Lambda}_{h}\|\leq C_{\bar{\Lambda},h},|\bar{\Gamma}_{h}|\leq C_{\bar{\Gamma},h},z\in Z_{h-1},o\in\mathcal{O}\right\},
CΛ¯,h=poly(H,do,ds,da,,O),CΓ¯,h=poly(H,do,ds,da,,O).\displaystyle C_{\bar{\Lambda},h}=\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\|O^{\dagger}\|),C_{\bar{\Gamma},h}=\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\|O^{\dagger}\|).

A function class 𝒢h\mathcal{G}_{h} includes at least one value link function for any linear policy π=δ(a=𝐔1ho+𝐔2hz)\pi=\delta(a=\mathbf{U}_{1h}o+\mathbf{U}_{2h}z) for 𝐔1h,𝐔2h\|\mathbf{U}_{1h}\|\leq\mathbb{C},\|\mathbf{U}_{2h}\|\leq\mathbb{C}.

Proof.

Here, we have

Vhπ()\displaystyle V^{\pi}_{h}(\cdot) =Γh+tr{Λh[zzzsszss]}\displaystyle=\Gamma_{h}+\mathop{\mathrm{tr}}\left\{\Lambda_{h}\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\right\}
=Γh+𝔼o𝕆(s)[tr{Λh[zzzo{O}OozO{ooΣτ}{O}]}]\displaystyle=\Gamma_{h}+\mathbb{E}_{o\sim\mathbb{O}(s)}\left[\mathop{\mathrm{tr}}\left\{\Lambda_{h}\begin{bmatrix}zz^{\top}&zo^{\top}\{O^{\dagger}\}^{\top}\\ O^{\dagger}oz^{\top}&O^{\dagger}\left\{oo^{\top}-\Sigma_{\tau}\right\}\{O^{\dagger}\}^{\top}\end{bmatrix}\right\}\right]
=Γhtr{Λh[000OΣτ{O}]}\displaystyle=\Gamma_{h}-\mathop{\mathrm{tr}}\left\{\Lambda_{h}\begin{bmatrix}0&0\\ 0&O^{\dagger}\Sigma_{\tau}\{O^{\dagger}\}^{\top}\end{bmatrix}\right\}
+𝔼o𝕆(s)[[z,o][I00{O}]Λh[I00O][zo]].\displaystyle+\mathbb{E}_{o\sim\mathbb{O}(s)}\left[[z^{\top},o^{\top}]\begin{bmatrix}I&0\\ 0&\{O^{\dagger}\}^{\top}\end{bmatrix}\Lambda_{h}\begin{bmatrix}I&0\\ 0&O^{\dagger}\end{bmatrix}\begin{bmatrix}z\\ o\end{bmatrix}\right].

The norm constraint on Λ¯h\bar{\Lambda}_{h} is decided by the following calculation:

[I00{O}]Λh[I00O]O22Λh=poly(H,do,ds,da,,O).\displaystyle\left\|\begin{bmatrix}I&0\\ 0&\{O^{\dagger}\}^{\top}\end{bmatrix}\Lambda_{h}\begin{bmatrix}I&0\\ 0&O^{\dagger}\end{bmatrix}\right\|\leq\|O^{\dagger}\|^{2}_{2}\|\Lambda_{h}\|=\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\|O^{\dagger}\|).

Then, the norm on Γ¯h\bar{\Gamma}_{h} is decided by the following calculation:

|Γhtr{Λh[000OΣτ{O}]}|\displaystyle\left\lvert\Gamma_{h}-\mathop{\mathrm{tr}}\left\{\Lambda_{h}\begin{bmatrix}0&0\\ 0&O^{\dagger}\Sigma_{\tau}\{O^{\dagger}\}^{\top}\end{bmatrix}\right\}\right\rvert |Γh|+|tr{Λh[000OΣτ{O}]}|\displaystyle\leq|\Gamma_{h}|+\left\lvert\mathop{\mathrm{tr}}\left\{\Lambda_{h}\begin{bmatrix}0&0\\ 0&O^{\dagger}\Sigma_{\tau}\{O^{\dagger}\}^{\top}\end{bmatrix}\right\}\right\rvert
|Γh|+Σh2O22Tr(Στ)\displaystyle\leq|\Gamma_{h}|+\|\Sigma_{h}\|_{2}\|O^{\dagger}\|^{2}_{2}\mathrm{Tr}(\Sigma_{\tau})
|Γh|+Σh2O22do\displaystyle\leq|\Gamma_{h}|+\|\Sigma_{h}\|_{2}\|O^{\dagger}\|^{2}_{2}\mathbb{C}d_{o}
=poly(H,do,ds,da,,O).\displaystyle=\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\|O^{\dagger}\|).

From the first line to the second line, we use Lemma 43.

J.2 PO-bilinear Rank Decomposition

Lemma 30 (Bilinear rank decomposition for LQG).

For any gh+1𝒢h+1,gh𝒢hg_{h+1}\in\mathcal{G}_{h+1},g_{h}\in\mathcal{G}_{h}, we have the following bilinear rank decomposition:

𝔼[gh+1(zh,oh+1)+rhgh(zh1,oh);a1:h1π,ahπ]=X(π),W(π)\displaystyle\mathbb{E}[g_{h+1}(z_{h},o_{h+1})+r_{h}-g_{h}(z_{h-1},o_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]=\langle X(\pi^{\prime}),W(\pi)\rangle

where

Xh(π)=(1,𝔼[[zh1,sh][zh1,sh];a1:h1π]),\displaystyle X_{h}(\pi^{\prime})=(1,\mathbb{E}[[z^{\top}_{h-1},s^{\top}_{h}]\otimes[z^{\top}_{h-1},s^{\top}_{h}];a_{1:h-1}\sim\pi^{\prime}])^{\top},
Wh(π)=[tr({Λ¯hΛ¯h}[000Στ]+[I00O]{Λ¯h+1Λ¯h+1}[I00O]ΣΛh2(π))vec[[I00O]{Λ¯hΛ¯h}[I00O]+Ξ1h(π)[I00O]{Λ¯h+1Λ¯h+1}[I00O]Ξ1h(π)]].\displaystyle W_{h}(\pi)=\begin{bmatrix}\mathop{\mathrm{tr}}\left(\{\bar{\Lambda}_{h}-\bar{\Lambda}^{\star}_{h}\}\begin{bmatrix}0&0\\ 0&\Sigma_{\tau}\end{bmatrix}+\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\{\bar{\Lambda}_{h+1}-\bar{\Lambda}^{\star}_{h+1}\}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}\Sigma_{\Lambda_{h2}}(\pi)\right)\\ \mathrm{vec}\left[\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\{\bar{\Lambda}_{h}-\bar{\Lambda}^{\star}_{h}\}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}+\Xi^{\top}_{1h}(\pi)\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\{\bar{\Lambda}^{\star}_{h+1}-\bar{\Lambda}_{h+1}\}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}\Xi_{1h}(\pi)\right]\end{bmatrix}.

Here, Ξ1h(π)\Xi_{1h}(\pi) and ΣΛh2(π)\Sigma_{\Lambda_{h2}}(\pi) depend on a policy π\pi. The following norm constraints hold:

Xh(π)2poly(H,do,ds,da,,O),Wh(π)2poly(H,do,ds,da,,O).\displaystyle\|X_{h}(\pi^{\prime})\|_{2}\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\|O^{\dagger}\|),\|W_{h}(\pi)\|_{2}\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\|O^{\dagger}\|).
Proof.

We have

𝔼[gh(zh1,oh)rh(zh1,sh)gh+1(zh,oh+1);a1:h1π,ahπ]\displaystyle\mathbb{E}[g_{h}(z_{h-1},o_{h})-r_{h}(z_{h-1},s_{h})-g_{h+1}(z_{h},o_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]
=𝔼[rh(zh1,sh);a1:h1π,ahπ]+\displaystyle=-\mathbb{E}[r_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]+
+𝔼[Γ¯h+(zh1,oh)Λ¯h(zh1,oh)Γ¯h+1(zh,oh+1)Λ¯h+1(zh,oh+1);a1:h1π,ahπ].\displaystyle+\mathbb{E}\left[\bar{\Gamma}_{h}+(z_{h-1}^{\top},o^{\top}_{h})\bar{\Lambda}_{h}(z_{h-1}^{\top},o^{\top}_{h})^{\top}-\bar{\Gamma}_{h+1}-(z_{h}^{\top},o^{\top}_{h+1})\bar{\Lambda}_{h+1}(z_{h}^{\top},o^{\top}_{h+1})^{\top};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi\right]. (22)

Since we have

𝔼[rh(zh1,sh);a1:h1π,ahπ]\displaystyle\mathbb{E}[r_{h}(z_{h-1},s_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]
=𝔼[Γ¯h+(zh1,oh)Λ¯h(zh1,oh)Γ¯h+1(zh,oh+1)Λ¯h+1(zh,oh+1);a1:h1π,ahπ].\displaystyle=-\mathbb{E}\left[\bar{\Gamma}^{\star}_{h}+(z_{h-1}^{\top},o^{\top}_{h})\bar{\Lambda}^{\star}_{h}(z_{h-1}^{\top},o^{\top}_{h})^{\top}-\bar{\Gamma}^{\star}_{h+1}-(z_{h}^{\top},o^{\top}_{h+1})\bar{\Lambda}^{\star}_{h+1}(z_{h}^{\top},o^{\top}_{h+1})^{\top};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi\right].

we focus on the term (22).

Hereafter, we suppose the expectation is always taken under a1:h1π,aπa_{1:h-1}\sim\pi^{\prime},a\sim\pi. We also denote z=zh1,oh=o,oh+1=o,sh=s,sh+1=sz=z_{h-1},o_{h}=o,o_{h+1}=o^{\prime},s_{h}=s,s_{h+1}=s^{\prime} to simplify the presentation. Using this simplified notation, we get

𝔼[(z,o)Λ¯h(z,o)]=𝔼[(z,(Os+τ))Λ¯h(z,(Os+τ))]\displaystyle\mathbb{E}\left[(z^{\top},o^{\top})\bar{\Lambda}_{h}(z^{\top},o^{\top})^{\top}\right]=\mathbb{E}\left[(z^{\top},(Os+\tau)^{\top})\bar{\Lambda}_{h}(z^{\top},(Os+\tau)^{\top})^{\top}\right]
=𝔼[[z,s][I00O]Λ¯h[I00O][zs]]+tr(Λ¯h[000Στ]).\displaystyle=\mathbb{E}\left[[z^{\top},s^{\top}]\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\bar{\Lambda}_{h}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}\begin{bmatrix}z\\ s\end{bmatrix}\right]+\mathop{\mathrm{tr}}\left(\bar{\Lambda}_{h}\begin{bmatrix}0&0\\ 0&\Sigma_{\tau}\end{bmatrix}\right).

Besides,

𝔼[(z,o)Λ¯h+1(z,o)]\displaystyle\mathbb{E}\left[(z^{\prime\top},o^{\prime\top})\bar{\Lambda}_{h+1}(z^{\prime\top},o^{\prime\top})^{\top}\right]
=𝔼[[z,s][I00O]Λ¯h+1[I00O][zs]]+tr(Λ¯h[000Στ])\displaystyle=\mathbb{E}\left[[z^{\prime\top},s^{\prime\top}]\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\bar{\Lambda}_{h+1}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}\begin{bmatrix}z^{\prime}\\ s^{\prime}\end{bmatrix}\right]+\mathop{\mathrm{tr}}\left(\bar{\Lambda}_{h}\begin{bmatrix}0&0\\ 0&\Sigma_{\tau}\end{bmatrix}\right)
=𝔼[[z,s]Ξ1h(π)[I00O]Λ¯h+1[I00O]Ξ1h(π)[zs]]+tr(Λ¯h[000Στ])+\displaystyle=\mathbb{E}\left[[z^{\top},s^{\top}]\Xi^{\top}_{1h}(\pi)\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\bar{\Lambda}_{h+1}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}\Xi_{1h}(\pi)\begin{bmatrix}z\\ s\end{bmatrix}\right]+\mathop{\mathrm{tr}}\left(\bar{\Lambda}_{h}\begin{bmatrix}0&0\\ 0&\Sigma_{\tau}\end{bmatrix}\right)+
+tr([I00O]Λ¯h+1[I00O]ΣΛh2(π)).\displaystyle+\mathop{\mathrm{tr}}\left(\begin{bmatrix}I&0\\ 0&O^{\top}\end{bmatrix}\bar{\Lambda}_{h+1}\begin{bmatrix}I&0\\ 0&O\end{bmatrix}\Sigma_{\Lambda_{h2}}(\pi)\right).

Then, the bilinear decomposition is clear by using

A2A1A2=tr(A1A2A2)=vec(A1)vec(A2A2)=vec(A1),A2A2.\displaystyle A^{\top}_{2}A_{1}A_{2}=\mathrm{tr}(A_{1}A_{2}A^{\top}_{2})=\mathrm{vec}(A^{\top}_{1})^{\top}\mathrm{vec}(A_{2}A^{\top}_{2})=\langle\mathrm{vec}(A^{\top}_{1}),A_{2}\otimes A_{2}\rangle.

where A2A_{2} is any vector and A1A_{1} is any matrix.

First, we calculate the upper bounds of the norms.

Xh(π)22\displaystyle\|X_{h}(\pi^{\prime})\|^{2}_{2} =1+𝔼(z,s)dhπ[[zzzsszss]]F2\displaystyle=1+\left\|\mathbb{E}_{(z,s)\sim d^{\pi^{\prime}}_{h}}\left[\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\right]\right\|^{2}_{F}
=1+𝔼(z,s)dh1π[Ξ1h(π)[zzzsszss]Ξ1h(π)]+ΣΛh2(π)F2\displaystyle=1+\left\|\mathbb{E}_{(z,s)\sim d^{\pi^{\prime}}_{h-1}}\left[\Xi_{1h}(\pi)\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\Xi^{\top}_{1h}(\pi)\right]+\Sigma_{\Lambda_{h2}}(\pi)\right\|^{2}_{F}
1+Ξ1h(π)24𝔼(z,s)dh1π[[zzzsszss]]F2+ΣΛh2(π)F2\displaystyle\leq 1+\|\Xi_{1h}(\pi)\|^{4}_{2}\left\|\mathbb{E}_{(z,s)\sim d^{\pi^{\prime}}_{h-1}}\left[\begin{bmatrix}zz^{\top}&zs^{\top}\\ sz^{\top}&ss^{\top}\end{bmatrix}\right]\right\|^{2}_{F}+\|\Sigma_{\Lambda_{h2}}(\pi)\|^{2}_{F}
1+Xh1(π)22+ΣΛh2(π)F2.\displaystyle\leq 1+\|X_{h-1}(\pi^{\prime})\|^{2}_{2}+\|\Sigma_{\Lambda_{h2}}(\pi)\|^{2}_{F}.

From the third line to the fourth line, we use Ξ1h(π)21\|\Xi_{1h}(\pi)\|_{2}\leq 1. Thus, Xh(π)2poly(H,do,ds,da,O,)\|X_{h}(\pi^{\prime})\|_{2}\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\|O^{\dagger}\|,\mathbb{C}).

Next, we consider W(π)W(\pi). By some algebra, we can see

W(π)2\displaystyle\|W(\pi)\|_{2} poly(Λ¯h,Λ¯h+1,,do,ds,da,Ξ1h(π)))\displaystyle\leq\mathrm{poly}(\|\bar{\Lambda}_{h}\|,\|\bar{\Lambda}_{h+1}\|,\mathbb{C},d_{o},d_{s},d_{a},\|\Xi_{1h}(\pi)\|))
poly(H,do,ds,da,O,)\displaystyle\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\|O^{\dagger}\|,\mathbb{C})

Lemma 31 (Variance of marginal distribution).

Recall dhπ(zh1,sh)d^{\pi}_{h}(z_{h-1},s_{h}) is a marginal distribution over 𝒵h1×𝒮\mathcal{Z}_{h-1}\times{\mathcal{S}} at hh when we execute a1:h1πa_{1:h-1}\sim\pi. The distribution dhπ(zh1,sh)d^{\pi}_{h}(z_{h-1},s_{h}) is a Gaussian distribution with mean 0. The operator norm on the variance of dhπ(zh1,sh)d^{\pi}_{h}(z_{h-1},s_{h}) is upper-bounded by poly(H,do,da,ds,)\mathrm{poly}(H,d_{o},d_{a},d_{s},\mathbb{C}).

Proof.

We first calculate the operator norm of the variance of dhπ(zh1,oh)d^{\pi}_{h}(z_{h-1},o_{h}). The variance is

i=1h(t=i+1hΞ1t(π))ΣΛi2(π)(t=i+1hΞ1t(π)).\displaystyle\sum_{i=1}^{h}\left(\prod_{t=i+1}^{h}\Xi_{1t}(\pi)\right)\Sigma_{\Lambda_{i2}}(\pi)\left(\prod_{t=i+1}^{h}\Xi^{\top}_{1t}(\pi)\right).

The statement is immediately concluded. ∎

Let uh(z¯h,rh,ah,oh+1;θ)=θhψh(z¯h)rhθh+1ψh+1(z¯h+1)u_{h}(\bar{z}_{h},r_{h},a_{h},o_{h+1};\theta)=\theta^{\top}_{h}\psi_{h}(\bar{z}_{h})-r_{h}-\theta^{\top}_{h+1}\psi_{h+1}(\bar{z}_{h+1}). Recall ψh(z¯h)=[1,z¯hz¯h]\psi_{h}(\bar{z}_{h})=[1,\bar{z}^{\top}_{h}\otimes\bar{z}^{\top}_{h}]^{\top}. We define

y^h(a[i])\displaystyle\hat{y}_{h}(a^{[i]}) =𝔼𝒟{αi(π(z¯h))𝕀(z¯hZ1)𝕀(rhZ2)𝕀(oh+1Z3)𝕀(ah=a[i])(1+d)\displaystyle=\mathbb{E}_{\mathcal{D}}\{\alpha_{i}(\pi(\bar{z}_{h}))\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\mathbb{I}(\|r_{h}\|\leq Z_{2})\mathbb{I}(\|o_{h+1}\|\leq Z_{3})\mathbb{I}(a_{h}=a^{[i]})(1+d^{\diamond})
uh(z¯h,rh,ah,oh+1;θ);a1:h1π,ahU(1+d)}\displaystyle u_{h}(\bar{z}_{h},r_{h},a_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(1+d^{\diamond})\}
y^h(a[0])\displaystyle\hat{y}_{h}(a^{[0]}) =𝔼𝒟{{1iαi(π(z¯h))}𝕀(a=0)𝕀(z¯hZ1)𝕀(rhZ2)𝕀(oh+1Z3)\displaystyle=\mathbb{E}_{\mathcal{D}}\{\{1-\sum_{i}\alpha_{i}(\pi(\bar{z}_{h}))\}\mathbb{I}(a=0)\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\mathbb{I}(\|r_{h}\|\leq Z_{2})\mathbb{I}(\|o_{h+1}\|\leq Z_{3})
(1+d)uh(z¯h,rh,ah,oh+1;θ);a1:h1π,ahU(1+d)}.\displaystyle(1+d^{\diamond})u_{h}(\bar{z}_{h},r_{h},a_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(1+d^{\diamond})\}.

Then, the final estimator is constructed by

y^h(a[0])+i=1dy^h(a[i]).\displaystyle\hat{y}_{h}(a^{[0]})+\sum_{i=1}^{d^{\diamond}}\hat{y}_{h}(a^{[i]}).

This is equal to

𝔼𝒟[lh(z¯h,ah,rh,oh+1;θ,π)]\displaystyle\mathbb{E}_{\mathcal{D}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)]

where

lh(z¯h,ah,rh,oh+1;θ,π)=[iαi(π(z¯h))𝕀(ah=a[i])+{1iαi(π(z¯h))}𝕀(ah=0)]×\displaystyle l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)=\left[\sum_{i}\alpha_{i}(\pi(\bar{z}_{h}))\mathbb{I}(a_{h}=a^{[i]})+\{1-\sum_{i}\alpha_{i}(\pi(\bar{z}_{h}))\}\mathbb{I}(a_{h}=0)\right]\times
𝕀(z¯hZ1)𝕀(rhZ2)𝕀(oh+1Z3)(1+d)uh(z¯h,rh,ah,oh+1;θ).\displaystyle\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\mathbb{I}(\|r_{h}\|\leq Z_{2})\mathbb{I}(\|o_{h+1}\|\leq Z_{3})(1+d^{\diamond})u_{h}(\bar{z}_{h},r_{h},a_{h},o_{h+1};\theta).

We set

Zi=poly(ln(m),ds,do,da,,H,O).\displaystyle Z_{i}=\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\|O^{\dagger}\|).

for any i[3]i\in[3].

J.3 Uniform Convergence

Recall that

Π={δ(a=𝐔1hz+𝐔2ho)𝐔1h,𝐔2h}.\displaystyle\Pi=\{\delta(a=\mathbf{U}_{1h}z+\mathbf{U}_{2h}o)\mid\|\mathbf{U}_{1h}\|\leq\mathbb{C},\|\mathbf{U}_{2h}\|\leq\mathbb{C}\}.

Besides, 𝒢h\mathcal{G}_{h} is included in

{θ,ψh()θpoly(H,do,ds,da,,O)}.\displaystyle\{\langle\theta,\psi_{h}(\cdot)\rangle\mid\|\theta\|\leq\mathrm{poly}(H,d_{o},d_{s},d_{a},\mathbb{C},\|O^{\dagger}\|)\}.
Lemma 32 (Concentration of loss functions).

With probability 1δ1-\delta,

supπΠ,θΘ|(𝔼𝒟𝔼){lh(z¯h,ah,rh,oh+1;θ,π)}|\displaystyle\sup_{\pi\in\Pi,\theta\in\Theta}|(\mathbb{E}_{\mathcal{D}}-\mathbb{E})\{l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)\}|

is upper-bounded by

poly(ln(m),ds,do,da,,H,O)×ln(1/δ)/m.\displaystyle\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\|O^{\dagger}\|)\times\sqrt{\ln(1/\delta)/m}.
Proof.

Due to indicator functions, lh(z¯h,ah,oh,oh+1;θ,π)l_{h}(\bar{z}_{h},a_{h},o_{h},o_{h+1};\theta,\pi) is bounded for any π,θ\pi,\theta by

poly(ln(m),ds,da,do,,H,O).\displaystyle\mathrm{poly}(\ln(m),d_{s},d_{a},d_{o},\mathbb{C},H,\|O^{\dagger}\|).

Thus, for fixed π\pi and θ\theta, we can say that with high probability 1δ1-\delta

poly(ln(m),ds,do,da,,H,O,ln(1/δ))×1/m.\displaystyle\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\|O^{\dagger}\|,\ln(1/\delta))\times\sqrt{1/m}.

Besides, we can consider a covering number with respect to ll^{\infty}-norm for the space of KK and θ\theta since both are bounded. The radius of each space is upper-bounded by

poly(ln(m),ds,do,da,,H,O).\displaystyle\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\|O^{\dagger}\|).

Thus, by taking uniform bound and considering the bias term due to the discretization as in the proof of Lemma 24, the statement is concluded. ∎

Lemma 33 (Bias terms 1).

Expectation of y^h(a[i])\hat{y}_{h}(a^{[i]}) and y^h(a[0])\hat{y}_{h}(a^{[0]}) are equal to

yh(a[i])+Error1,yh(a[0])+Error2.\displaystyle y_{h}(a^{[i]})+\mathrm{Error}_{1},\quad y_{h}(a^{[0]})+\mathrm{Error}_{2}.

where

yh(a[i])\displaystyle y_{h}(a^{[i]}) =𝔼[αi(π(z¯h))𝕀(z¯hZ1)uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahdo(a[i])],\displaystyle=\mathbb{E}\left[\alpha_{i}(\pi(\bar{z}_{h}))\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{[i]})\right],
yh(a[0])\displaystyle y_{h}(a^{[0]}) =𝔼[{1iαi(π(z¯h))}𝕀(z¯hZ1)uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahdo(0)],\displaystyle=\mathbb{E}\left[\{1-\sum_{i}\alpha_{i}(\pi(\bar{z}_{h}))\}\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(0)\right],
Error1\displaystyle\mathrm{Error}_{1} =m1poly(ln(m),ds,do,da,,H,O),Error2=m1poly(ln(m),ds,do,da,,H,O).\displaystyle=m^{-1}\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\|O^{\dagger}\|),\quad\mathrm{Error}_{2}=m^{-1}\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\|O^{\dagger}\|).
Proof.

We want to upper bound the difference of

𝔼[αi(π(z¯h))𝕀(z¯hZ1)uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahdo(a[i])]\displaystyle\mathbb{E}\left[\alpha_{i}(\pi(\bar{z}_{h}))\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{[i]})\right]

and

𝔼[αi(π(z¯h))𝕀(z¯hZ1)𝕀(rhZ2)𝕀(oh+1Z3)uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahdo(a[i])].\displaystyle\mathbb{E}\left[\alpha_{i}(\pi(\bar{z}_{h}))\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\mathbb{I}(\|r_{h}\|\leq Z_{2})\mathbb{I}(\|o_{h+1}\|\leq Z_{3})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{[i]})\right].

By CS inequality, we have

|𝔼[αi(π(z¯h))𝕀(z¯hZ1){𝕀(rhZ2)𝕀(oh+1Z3)1}uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahdo(a[i])]|\displaystyle|\mathbb{E}\left[\alpha_{i}(\pi(\bar{z}_{h}))\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\{\mathbb{I}(\|r_{h}\|\leq Z_{2})\mathbb{I}(\|o_{h+1}\|\leq Z_{3})-1\}u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{[i]})\right]|
|𝔼[{𝕀(rhZ2)𝕀(oh+1Z3)1}2;a1:h1π,ahdo(a[i])]|(a)\displaystyle\leq\underbrace{\left\lvert\mathbb{E}\left[\{\mathbb{I}(\|r_{h}\|\leq Z_{2})\mathbb{I}(\|o_{h+1}\|\leq Z_{3})-1\}^{2};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{[i]})\right]\right\rvert}_{(a)}
×|𝔼[αi2(π(z¯h))uh2(z¯h,ah,rh,oh+1);a1:h1π,ahdo(a[i])]|1/2(b).\displaystyle\times\underbrace{\left\lvert\mathbb{E}\left[\alpha^{2}_{i}(\pi(\bar{z}_{h}))u^{2}_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim do(a^{[i]})\right]\right\rvert^{1/2}}_{(b)}.

We analyze the term (a) and the term (b). Before starting analysis, note (z¯h,ah,rh,oh+1)(\bar{z}^{\top}_{h},a^{\top}_{h},r^{\top}_{h},o^{\top}_{h+1}) follows Gaussian distribution with mean 0 and variance upper-bounded by

poly(,ds,do,da,H)\displaystyle\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H)

using Lemma 31. Besides, αi2(πh(z¯h))poly(ds,do,da,H)\alpha^{2}_{i}(\pi_{h}(\bar{z}_{h}))\leq\mathrm{poly}(d_{s},d_{o},d_{a},H) from Lemma 13. Note we can use a G-optimal design since we have a norm constraint on z¯1\bar{z}_{1}.

Regarding the term (a), by setting Z2=poly(,ds,do,da,ln(m),H,O)Z_{2}=\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},\ln(m),H,\|O^{\dagger}\|) and Z3=poly(,ds,do,da,ln(m),H,O)Z_{3}=\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},\ln(m),H,\|O^{\dagger}\|) properly, we can ensure it is upper-bounded by

poly(,ds,do,da,H,O,ln(m))m.\displaystyle\frac{\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\|O^{\dagger}\|,\ln(m))}{m}.

Regarding the term (b), noting high order moments of Gaussian distributions can be always upper-bounded, the term (b) is upper-bounded by poly(,ds,do,da,H,O,ln(m))\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\|O^{\dagger}\|,\ln(m)). This concludes the statement. ∎

Lemma 34 (Bias terms 2).

Recall we define yh(a[i])y_{h}(a^{[i]}) and yh(a[0])y_{h}(a^{[0]}) in Lemma 33. Then, we have

𝔼[𝕀(z¯hZ1)uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahπ(z¯j)]=yh(a[0])+iyh(a[i]).\displaystyle\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]=y_{h}(a^{[0]})+\sum_{i}y_{h}(a^{[i]}).

Thus,

𝔼[lh(z¯h,ah,rh,oh+1;θ,π)a1:h1π,ahU(1+d)]\displaystyle\mathbb{E}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)a_{1:h-1}\sim\pi^{\prime},a_{h}\sim U(1+d^{\diamond})]
=𝔼[uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahπ(z¯j)]+poly(,ds,da,do,H,O)m.\displaystyle=\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]+\frac{\mathrm{poly}(\mathbb{C},d_{s},d_{a},d_{o},H,\|O^{\dagger}\|)}{m}.
Proof.
First Statement

We have

𝔼[𝕀(z¯hZ1)uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahπ(z¯j)]\displaystyle\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]
=𝔼[𝕀(z¯hZ1)𝔼[uh(z¯h,ah,rh,oh+1;θ)z¯h,sh,ah];a1:h1π,ahπ(z¯j)]\displaystyle=\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h}];a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]
=𝔼[𝕀(z¯hZ1)𝔼[uh(z¯h,πh(z¯h),rh,oh+1;θ)z¯h,sh,ah=πh(z¯h)];a1:h1π].\displaystyle=\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\mathbb{E}[u_{h}(\bar{z}_{h},\pi_{h}(\bar{z}_{h}),r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h}=\pi_{h}(\bar{z}_{h})];a_{1:h-1}\sim\pi^{\prime}].

Here, by some algebra, there exists a vector c2c_{2}

𝔼[uh(z¯h,ah,rh,oh+1;θ)z¯h,sh,ah]=c2,[1,[z¯h,sh,ah][z¯h,sh,ah]].\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h}]=\langle c_{2},[1,[\bar{z}^{\top}_{h},s^{\top}_{h},a^{\top}_{h}]\otimes[\bar{z}^{\top}_{h},s^{\top}_{h},a^{\top}_{h}]]^{\top}\rangle.

Thus, there exists c0c_{0} and a vector c1c_{1} such that

𝔼[uh(z¯h,ah,rh,oh+1;θ)z¯h,sh,ah]=c0(z¯h,sh)+c1(z¯h,sh)κ(ah)\displaystyle\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h}]=c_{0}(\bar{z}_{h},s_{h})+c^{\top}_{1}(\bar{z}_{h},s_{h})\kappa(a_{h})

Recall we can write

κ(πh(z¯h))=i=1dαi(πh(z¯h))κ(a[i])\displaystyle\kappa(\pi_{h}(\bar{z}_{h}))=\sum_{i=1}^{d^{\diamond}}\alpha_{i}(\pi_{h}(\bar{z}_{h}))\kappa(a^{[i]})

Using the above,

𝔼[𝕀(z¯hZ1)uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahπ(z¯j)]\displaystyle\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]
=𝔼[𝕀(z¯hZ1){c0(z¯h,sh)+c1(z¯h,sh)}κ(πh(z¯h));a1:h1π]\displaystyle=\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\{c_{0}(\bar{z}_{h},s_{h})+c^{\top}_{1}(\bar{z}_{h},s_{h})\}\kappa(\pi_{h}(\bar{z}_{h}));a_{1:h-1}\sim\pi^{\prime}]
=𝔼𝕀(z¯hZ1){c0(z¯h,sh)+ic1(z¯h,sh)αi(πh(z¯h))κ(a[i])};a1:h1π]\displaystyle=\mathbb{E}\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\{c_{0}(\bar{z}_{h},s_{h})+\sum_{i}c^{\top}_{1}(\bar{z}_{h},s_{h})\alpha_{i}(\pi_{h}(\bar{z}_{h}))\kappa(a^{[i]})\};a_{1:h-1}\sim\pi^{\prime}]
=𝔼[𝕀(z¯hZ1)[c0(z¯h,sh)+\displaystyle=\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})[c_{0}(\bar{z}_{h},s_{h})+
+iαi(πh(z¯h)){𝔼[κ(z¯h,πh(z¯h),rh,oh+1)z¯h,sh,ah=a[i]]c0(z¯h,sh)}];a1:h1π]\displaystyle+\sum_{i}\alpha_{i}(\pi_{h}(\bar{z}_{h}))\{\mathbb{E}[\kappa(\bar{z}_{h},\pi_{h}(\bar{z}_{h}),r_{h},o_{h+1})\mid\bar{z}_{h},s_{h},a_{h}=a^{[i]}]-c_{0}(\bar{z}_{h},s_{h})\}];a_{1:h-1}\sim\pi^{\prime}]
=𝔼[𝕀(z¯hZ1)[c0(z¯h,sh)iαi(πh(z¯h))c0(z¯h,sh)]]+iyh(a[i]).\displaystyle=\mathbb{E}\left[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\left[c_{0}(\bar{z}_{h},s_{h})-\sum_{i}\alpha_{i}(\pi_{h}(\bar{z}_{h}))c_{0}(\bar{z}_{h},s_{h})\right]\right]+\sum_{i}y_{h}(a^{[i]}).

Besides,

c0(z¯h,sh)=𝔼[uh(z¯h,ah,rh,oh+1;θ)z¯h,sh,ah=do(0)].\displaystyle c_{0}(\bar{z}_{h},s_{h})=\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h}=do(0)].

Thus,

𝔼[𝕀(z¯hZ1)[c0(z¯h,sh)iαi(πh(z¯h))c0(z¯h,sh)]]\displaystyle\mathbb{E}\left[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\left[c_{0}(\bar{z}_{h},s_{h})-\sum_{i}\alpha_{i}(\pi_{h}(\bar{z}_{h}))c_{0}(\bar{z}_{h},s_{h})\right]\right]
=𝔼[𝕀(z¯hZ1){1iαi(πh(z¯h))}𝔼[uh(z¯h,ah,rh,oh+1;θ)z¯h,sh,ah=0];a1:h1π]\displaystyle=\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\{1-\sum_{i}\alpha_{i}(\pi_{h}(\bar{z}_{h}))\}\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta)\mid\bar{z}_{h},s_{h},a_{h}=0];a_{1:h-1}\sim\pi^{\prime}]
=𝔼[𝕀(z¯hZ1){1iαi(πh(z¯h))}uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ah=do(0)]\displaystyle=\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})\{1-\sum_{i}\alpha_{i}(\pi_{h}(\bar{z}_{h}))\}u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}=do(0)]
=yh(a[0]).\displaystyle=y_{h}(a^{[0]}).

In conclusion,

𝔼[𝕀(z¯hZ1)uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahπ(z¯j)]=yh(a[0])+iyh(a[i]).\displaystyle\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]=y_{h}(a^{[0]})+\sum_{i}y_{h}(a^{[i]}).
Second Statement

As we see in the proof of Lemma 33, the following term

𝔼[{𝕀(z¯hZ1)1}uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahπ(z¯j)]\displaystyle\mathbb{E}[\{\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})-1\}u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi(\bar{z}_{j})]

is upper-bounded by poly(,ds,do,da,H,ln(m))/m.\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\ln(m))/m. Hence,

𝔼[lh(z¯h,ah,rh,oh+1;θ,π)]\displaystyle\mathbb{E}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)]
=𝔼[y^h(a[0])]+i=1d𝔼[y^h(a[i])]\displaystyle=\mathbb{E}[\hat{y}_{h}(a^{[0]})]+\sum_{i=1}^{d^{\diamond}}\mathbb{E}[\hat{y}_{h}(a^{[i]})] (Definition)
=yh(a[0])+i=1dyh(a[i])+poly(,ds,do,da,H,ln(m))/m\displaystyle=y_{h}(a^{[0]})+\sum_{i=1}^{d^{\diamond}}y_{h}(a^{[i]})+\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\ln(m))/m (Statement of Lemma 33)
=𝔼[𝕀(z¯hZ1)uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahπh(z¯h)]+poly(,ds,do,da,H,ln(m))/m\displaystyle=\mathbb{E}[\mathbb{I}(\|\bar{z}_{h}\|\leq Z_{1})u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi_{h}(\bar{z}_{h})]+\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\ln(m))/m (First statement)
=𝔼[uh(z¯h,ah,rh,oh+1;θ);a1:h1π,ahπh(z¯h)]+poly(,ds,do,da,H,ln(m))/m\displaystyle=\mathbb{E}[u_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta);a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi_{h}(\bar{z}_{h})]+\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\ln(m))/m
=Brh(π,θ;π)+poly(,ds,do,da,H,ln(m))/m.\displaystyle=\mathrm{Br}_{h}(\pi,\theta;\pi^{\prime})+\mathrm{poly}(\mathbb{C},d_{s},d_{o},d_{a},H,\ln(m))/m.

J.4 Sample Complexity

Summarizing results so far, we have

supπΠ,θΘ|𝔼𝒟[lh(z¯h,ah,rh,oh+1;θ,π)}]Brh(π,θ;π)|\displaystyle\sup_{\pi\in\Pi,\theta\in\Theta}|\mathbb{E}_{\mathcal{D}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};\theta,\pi)\}]-\mathrm{Br}_{h}(\pi,\theta;\pi^{\prime})|
poly(ln(m),ds,do,da,,H,O)×ln(1/δ)/m.\displaystyle\leq\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},H,\|O^{\dagger}\|)\times\sqrt{\ln(1/\delta)/m}.

This is enough to invoke Theorem 1. Here, recall we have

Xh(π)poly(H,do,da,ds,,Θ,O),Wh(π)poly(H,do,da,ds,,Θ,O).\displaystyle\|X_{h}(\pi)\|\leq\mathrm{poly}(H,d_{o},d_{a},d_{s},\mathbb{C},\Theta,\|O^{\dagger}\|),\quad\|W_{h}(\pi)\|\leq\mathrm{poly}(H,d_{o},d_{a},d_{s},\mathbb{C},\Theta,\|O^{\dagger}\|).

for any πΠ\pi\in\Pi using Lemma 30. In addition, we showed the PO-bilinear rank is

poly(H,do,da,ds).\displaystyle\mathrm{poly}(H,d_{o},d_{a},d_{s}).

Then, using Theorem 1, the sample complexity is

O~(poly(ln(m),ds,do,da,,Θ,H,O,ln(1/δ))×1ϵ2).\displaystyle\tilde{O}\left(\mathrm{poly}(\ln(m),d_{s},d_{o},d_{a},\mathbb{C},\Theta,H,\|O^{\dagger}\|,\ln(1/\delta))\times\frac{1}{\epsilon^{2}}\right).

Appendix K Sample Complexity in PSRs

To focus on the main point, we just use a one-step future. We first show the form of link functions to set a proper class for 𝒢h\mathcal{G}_{h}. Next, we show the PO-bilinear decomposition.

We assume the following assumptions.

Assumption 11.

(a) 𝒯𝒪{\mathcal{T}}\subset\mathcal{O} is a core test and 𝒬\mathcal{Q} is a minimum core rest, (b) vec(𝕁hπ)Θ\|\mathrm{vec}(\mathbb{J}^{\pi}_{h})\|\leq\Theta for any πΠ\pi\in\Pi where 𝕁hπ\mathbb{J}^{\pi}_{h} is in 𝒱hπ(τh)=1(zh1)𝕁hπ𝐪τh\mathcal{V}^{\pi}_{h}(\tau_{h})={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}_{h}^{\pi}{\bf q}_{\tau_{h}}.

K.1 Existence of Link Functions

Recall Vhπ(τh)=1(zh1)𝕁hπ𝐪τhV^{\pi}_{h}(\tau_{h})={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}_{h}^{\pi}{\bf q}_{\tau_{h}}, where we use 1(z)|𝒪|M|𝒜|M{\textbf{1}}(z)\in\mathbb{R}^{|\mathcal{O}|^{M}|\mathcal{A}|^{M}} to denote the one-hot encoding vector over 𝒵h1\mathcal{Z}_{h-1}, and 𝕁hπ\mathbb{J}_{h}^{\pi} is a matrix in |𝒪|M|𝒜|M×|𝒯|\mathbb{R}^{|\mathcal{O}|^{M}|\mathcal{A}|^{M}\times|{\mathcal{T}}|}.

Then, ghπ(zh1,o):=1(zh1)𝕁hπ[1(t=o)]t𝒯g^{\pi}_{h}(z_{h-1},o):={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}^{\pi}_{h}[{\textbf{1}}(t=o)]_{t\in{\mathcal{T}}} is a value link function. This is because

𝔼[gh(zh1,o)τh]\displaystyle\mathbb{E}[g_{h}(z_{h-1},o)\mid\tau_{h}] =𝔼[1(zh1)𝕁hπ[1(t=o)]t𝒯τh]\displaystyle=\mathbb{E}[{\textbf{1}}(z_{h-1})^{\top}\mathbb{J}^{\pi}_{h}[{\textbf{1}}(t=o)]_{t\in{\mathcal{T}}}\mid\tau_{h}]
=1(zh1)𝕁hπ𝐪τh.\displaystyle={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}_{h}^{\pi}{\bf q}_{\tau_{h}}.

Hence, we set 𝒢h\mathcal{G}_{h} to be

{(zh1,o)1(zh1)𝕁[1(t=o)]t𝒯:vec(𝕁)Θ}\displaystyle\{(z_{h-1},o)\mapsto{\textbf{1}}(z_{h-1})^{\top}\mathbb{J}[{\textbf{1}}(t=o)]_{t\in{\mathcal{T}}}:\|\mathrm{vec}(\mathbb{J})\|\leq\Theta\}

so that the realizability holds.

K.2 PO-bilinear Rank Decomposition

We show that PSR admits PO-bilinear rank decomposition (Definition 6). Here is the Bellman loss:

𝔼[{gh+1(zh,oh+1)+rh}gh(zh1,oh);a1:h1π,ahπ].\displaystyle\mathbb{E}[\{g_{h+1}(z_{h},o_{h+1})+r_{h}\}-g_{h}(z_{h-1},o_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi].

To analyze the above, we decompose the above into three terms:

𝔼[gh+1(zh,oh+1);a1:h1π,ahπ](a)+𝔼[rh;a1:h1π,ahπ](b)+𝔼[gh(zh1,oh);a1:h1π,ahπ](c).\displaystyle\underbrace{\mathbb{E}[g_{h+1}(z_{h},o_{h+1});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]}_{(a)}+\underbrace{\mathbb{E}[r_{h};a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]}_{(b)}+\underbrace{\mathbb{E}[-g_{h}(z_{h-1},o_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]}_{(c)}.

Let 𝒬\mathcal{Q} be a minimum core test. Here, for any future tt, there exists m~t\tilde{m}_{t} such that (tτh)=m~t,𝐪~τh\mathbb{P}(t\mid\tau_{h})=\langle\tilde{m}_{t},\tilde{\bf q}_{\tau_{h}}\rangle where [(τh)]|𝒬|[\mathbb{P}(\cdot\mid\tau_{h})]_{|\mathcal{Q}|} is a |𝒬||\mathcal{Q}|-dimensional predictive state 𝐪~τh\tilde{\bf q}_{\tau_{h}}. This satisfies

(ohτh;ah)𝐪~τh,ah,oh=M~oh,ah𝐪~τh.\displaystyle\mathbb{P}(o_{h}\mid\tau_{h};a_{h})\tilde{\bf q}_{\tau_{h},a_{h},o_{h}}=\tilde{M}_{o_{h},a_{h}}\tilde{\bf q}_{\tau_{h}}. (23)

where M~oh,ah\tilde{M}_{o_{h},a_{h}} is a matrix whose ii-th row is m~oh,ah\tilde{m}^{\top}_{o_{h},a_{h}} as we see in Section D.

Term (c).

We have

𝔼[gh(zh1,oh)τh]\displaystyle\mathbb{E}[g_{h}(z_{h-1},o_{h})\mid\tau_{h}] =1(zh1)𝕁𝔼[[1(t=oh)]t𝒯τh]\displaystyle={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}\mathbb{E}[[{\textbf{1}}(t=o_{h})]_{t\in{\mathcal{T}}}\mid\tau_{h}]
=1(zh1)𝕁𝕁1𝐪~τh\displaystyle={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}\mathbb{J}_{1}\tilde{\bf q}_{\tau_{h}}

where 𝕁1|𝒯|×|𝒬|\mathbb{J}_{1}\in\mathbb{R}^{|{\mathcal{T}}|\times|\mathcal{Q}|} is a matrix whose ii-th row is m~t\tilde{m}^{\top}_{t}. The existence of 𝕁1\mathbb{J}_{1} is ensured since 𝒬\mathcal{Q} is a core test.

Term (b).

We have

𝔼[rhτh;ahπ]\displaystyle\mathbb{E}[r_{h}\mid\tau_{h};a_{h}\sim\pi] =oh,ahπ(ahoh,zh1)rh(ah,oh)(ohτh;ah)\displaystyle=\sum_{o_{h},a_{h}}\pi(a_{h}\mid o_{h},z_{h-1})r_{h}(a_{h},o_{h})\mathbb{P}(o_{h}\mid\tau_{h};a_{h})
=oh,ahπ(ahoh,zh1)rh(ah,oh)m~oh,ah,𝐪~τh\displaystyle=\sum_{o_{h},a_{h}}\pi(a_{h}\mid o_{h},z_{h-1})r_{h}(a_{h},o_{h})\langle\tilde{m}_{o_{h},a_{h}},\tilde{\bf q}_{\tau_{h}}\rangle
=1(zh1)𝕁2π𝐪~τh\displaystyle={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}^{\pi}_{2}\tilde{\bf q}_{\tau_{h}}

for some matrix 𝕁2π\mathbb{J}^{\pi}_{2}. In the first inequality, we use the reward is a function of oh,aho_{h},a_{h} conditioning on the whole history. From the first line to the second line, we use a property of core tests.

Term (a).

We have

𝔼[gh+1(zh,oh+1)τh;ahπ]=𝔼[1(zh)𝕁[1(t=oh+1)]t𝒯τh;ahπ]\displaystyle\mathbb{E}[g_{h+1}(z_{h},o_{h+1})\mid\tau_{h};a_{h}\sim\pi]=\mathbb{E}[{\textbf{1}}(z_{h})^{\top}\mathbb{J}[{\textbf{1}}(t=o_{h+1})]_{t\in{\mathcal{T}}}\mid\tau_{h};a_{h}\sim\pi]
=𝔼[1(zh)𝕁𝕁3q~τh,ah,ohτh;ahπ]\displaystyle=\mathbb{E}[{\textbf{1}}(z_{h})^{\top}\mathbb{J}\mathbb{J}_{3}\tilde{q}_{\tau_{h},a_{h},o_{h}}\mid\tau_{h};a_{h}\sim\pi]

for some matrix 𝕁3\mathbb{J}_{3}. Then, the above is further equal to

ah,oh1(zh)𝕁𝕁3π(ahzh1,oh)(ohτh;ah)𝐪~τh,ah,oh\displaystyle\sum_{a_{h},o_{h}}{\textbf{1}}(z_{h})^{\top}\mathbb{J}\mathbb{J}_{3}\pi(a_{h}\mid z_{h-1},o_{h})\mathbb{P}(o_{h}\mid\tau_{h};a_{h})\tilde{\bf q}_{\tau_{h},a_{h},o_{h}}
=ah,oh1(zh)𝕁𝕁3π(ahzh1,oh)M~oh,ah𝐪~τh\displaystyle=\sum_{a_{h},o_{h}}{\textbf{1}}(z_{h})^{\top}\mathbb{J}\mathbb{J}_{3}\pi(a_{h}\mid z_{h-1},o_{h})\tilde{M}_{o_{h},a_{h}}\tilde{\bf q}_{\tau_{h}}
=1(zh1)𝕁4π𝐪~τh\displaystyle={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}^{\pi}_{4}\tilde{\bf q}_{\tau_{h}}

for some matrix 𝕁4π\mathbb{J}^{\pi}_{4}. From the first line to the second line, we use (ohτh;ah)𝐪~τh,ah,oh=M~oh,ah𝐪~τh\mathbb{P}(o_{h}\mid\tau_{h};a_{h})\tilde{\bf q}_{\tau_{h},a_{h},o_{h}}=\tilde{M}_{o_{h},a_{h}}\tilde{\bf q}_{\tau_{h}} in (23).

Summary.

Combining all terms, there exists a matrix 𝕁5π\mathbb{J}^{\pi}_{5} such that

𝔼[{gh+1(zh,oh+1)+rh}gh(zh1,oh);a1:h1π,ahπ]\displaystyle\mathbb{E}[\{g_{h+1}(z_{h},o_{h+1})+r_{h}\}-g_{h}(z_{h-1},o_{h});a_{1:h-1}\sim\pi^{\prime},a_{h}\sim\pi]
=1(zh1)𝕁5π𝔼[𝐪~τh;a1:h1π]\displaystyle={\textbf{1}}(z_{h-1})^{\top}\mathbb{J}^{\pi}_{5}\mathbb{E}[\tilde{\bf q}_{\tau_{h}};a_{1:h-1}\sim\pi^{\prime}]
=Vec(𝕁5π),1(zh1)𝔼[𝐪~τh;a1:h1π]\displaystyle=\langle\mathrm{Vec}(\mathbb{J}^{\pi}_{5}),{\textbf{1}}(z_{h-1})\otimes\mathbb{E}[\tilde{\bf q}_{\tau_{h}};a_{1:h-1}\sim\pi^{\prime}]\rangle

Here, we suppose Vec(𝕁5π)ΘW\|\mathrm{Vec}(\mathbb{J}^{\pi}_{5})\|\leq\Theta_{W} for any π\pi. Besides,

1(zh1)𝔼[𝐪~τh;a1:h1π]2𝔼[𝐪~τh;a1:h1π]2𝔼[𝐪~τh2;a1:h1π]\displaystyle\|{\textbf{1}}(z_{h-1})\otimes\mathbb{E}[\tilde{\bf q}_{\tau_{h}};a_{1:h-1}\sim\pi^{\prime}]\|_{2}\leq\|\mathbb{E}[\tilde{\bf q}_{\tau_{h}};a_{1:h-1}\sim\pi^{\prime}]\|_{2}\leq\mathbb{E}[\|\tilde{\bf q}_{\tau_{h}}\|_{2};a_{1:h-1}\sim\pi^{\prime}]
𝔼[𝐪~τh1;a1:h1π]=1.\displaystyle\leq\mathbb{E}[\|\tilde{\bf q}_{\tau_{h}}\|_{1};a_{1:h-1}\sim\pi^{\prime}]=1.

Thus, we can set BX=1B_{X}=1.

K.3 Sample Complexity

Suppose Π,𝒢\Pi,\mathcal{G} are finite and rewards at hh lie in [0,1][0,1]. Assume the realizability holds. Then,

ϵgen=cmax(Θ,1)|𝒜|ln(|𝒢max||Πmax|TH/δ)/m.\displaystyle\epsilon_{gen}=c\max(\Theta,1)|\mathcal{A}|\sqrt{\ln(|\mathcal{G}_{\max}||\Pi_{\max}|TH/\delta)/m}.

Following the calculation in Section F, the sample complexity is

O~(|𝒪|2(M1)|𝒜|2(M1)|𝒬|2max(Θ,1)H4|𝒜|2ln(|𝒢max||Πmax|/δ)ln(ΘW)2ϵ2).\displaystyle\tilde{O}\left(\frac{|\mathcal{O}|^{2(M-1)}|\mathcal{A}|^{2(M-1)}|\mathcal{Q}|^{2}\max(\Theta,1)H^{4}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|/\delta)\ln(\Theta_{W})^{2}}{\epsilon^{2}}\right).

Here, there is no explicit dependence on |𝒯||{\mathcal{T}}|. Note the worst-case sample complexity of ln|𝒢max|\ln|\mathcal{G}_{\max}| is O(|𝒵h1||𝒯|)O(|\mathcal{Z}_{h-1}||{\mathcal{T}}|) and the worse-case sample complexity of ln|Πmax|\ln|\Pi_{\max}| is O(|𝒵h1||𝒪||𝒜|)O(|\mathcal{Z}_{h-1}||\mathcal{O}||\mathcal{A}|).

K.4 Most General Case

We consider the general case in Section D. Let 𝒢h\mathcal{G}_{h} be a function class consisting of 𝟏(zh1)𝕁h𝟏(t)\mathbf{1}(z_{h-1})^{\top}\mathbb{J}_{h}\mathbf{1}(t) where 𝕁h\mathbb{J}_{h} satisfies 𝕁h𝒵h1×|𝒯|\mathbb{J}_{h}\in\mathbb{R}^{\mathcal{Z}_{h-1}\times|{\mathcal{T}}|} and vec(𝕁h)Θ\|\mathrm{vec}(\mathbb{J}_{h})\|\leq\Theta. When the realizability holds, we would get

O~(|𝒪|2(M1)|𝒜|2(M1)|𝒬|2|𝒯𝒜|2max(Θ,1)H4|𝒜|2ln(|𝒢max||Πmax|/δ)ln(BXBW)2ϵ2).\displaystyle\tilde{O}\left(\frac{|\mathcal{O}|^{2(M-1)}|\mathcal{A}|^{2(M-1)}|\mathcal{Q}|^{2}|{\mathcal{T}}^{\mathcal{A}}|^{2}\max(\Theta,1)H^{4}|\mathcal{A}|^{2}\ln(|\mathcal{G}_{\max}||\Pi_{\max}|/\delta)\ln(B_{X}B_{W})^{2}}{\epsilon^{2}}\right).

Here, there is no explicit sample complexity of |𝒯𝒪||{\mathcal{T}}^{\mathcal{O}}|. Note the worse-case sample complexity of ln|𝒢max|\ln|\mathcal{G}_{\max}| is O(|𝒵h1||𝒯|)O(|\mathcal{Z}_{h-1}||{\mathcal{T}}|) and the worst-case sample complexity of ln|Πmax|\ln|\Pi_{\max}| is O(|𝒵h1||𝒪||𝒜|)O(|\mathcal{Z}_{h-1}||\mathcal{O}||\mathcal{A}|).

Appendix L Proof of Theorem 8

We fix the parameters as in Theorem 8. Let

lh(τh,ah,rh,oh+1;f,π,g)=|𝒜|πh(ahz¯h){rh+gh+1(z¯h+1)gh(z¯h)}f(τh)0.5f(τh)2.\displaystyle l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g)=|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\{r_{h}+g_{h+1}(\bar{z}_{h+1})-g_{h}(\bar{z}_{h})\}f(\tau_{h})-0.5f(\tau_{h})^{2}.

From the assumption, Then, with probability 1δ1-\delta, we have t[T],h[H]\forall t\in[T],\forall h\in[H]

supπΠ,g𝒢,f|𝔼𝒟ht[lh(τh,ah,rh,oh+1;f,π,g)]𝔼[𝔼𝒟ht[lh(τh,ah,rh,oh+1;f,π,g)]]|ϵgen,\displaystyle\sup_{\pi\in\Pi,g\in\mathcal{G},f\in\mathcal{F}}|\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g)]-\mathbb{E}[\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g)]]|\leq\epsilon_{gen}, (24)
supg1𝒢1|𝔼𝒟0[g1(o1)]𝔼[𝔼𝒟0[g1(o1)]]|ϵini.\displaystyle\sup_{g_{1}\in\mathcal{G}_{1}}|\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(o_{1})]-\mathbb{E}[\mathbb{E}_{\mathcal{D}^{0}}[g_{1}(o_{1})]]|\leq\epsilon_{ini}. (25)

We first show the following lemma. Recall π=argmaxπΠJ(π).\pi^{\star}=\mathop{\mathrm{argmax}}_{\pi\in\Pi}J(\pi).

Lemma 35 (Optimism).

Set R:=ϵgenR:=\epsilon_{gen}. For all t[T]t\in[T], (π,gπ)(\pi^{\star},g^{\pi^{\star}}) is a feasible solution of the constrained program. Furthermore, we have J(π)𝔼[g1t(o1)]+2ϵiniJ(\pi^{\star})\leq\mathbb{E}[g^{t}_{1}(o_{1})]+2\epsilon_{ini} for any t[T]t\in[T], where gtg^{t} is the value link function selected by the algorithm in iteration tt.

Proof.

For any π\pi, we have

maxfh|𝔼[𝔼𝒟ht[lh(τh,ah,rh,oh+1;f,π,gπ)]]|=0\displaystyle\max_{f\in\mathcal{F}_{h}}|\mathbb{E}[\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi})]]|=0

since gπg^{\pi} is a value link function in 𝒢\mathcal{G} noting the condition (c) in Definition 7. Thus,

maxfh|𝔼𝒟ht[lh(τh,ah,rh,oh+1;f,π,gπ)]|ϵgen\displaystyle\max_{f\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi^{\star},g^{\pi^{\star}})]|\leq\epsilon_{gen}

using (24) noting πΠ,gπ𝒢\pi^{\star}\in\Pi,g^{\pi^{\star}}\in\mathcal{G}. Hence, (π,gπ)(\pi^{\star},g^{\pi^{\star}}) is a feasible set for any t[T]t\in[T] and any h[H]h\in[H] .

Then, we have

J(π)\displaystyle J(\pi^{\star}) =𝔼[g1π(o1)]𝔼𝒟0[g1π(o1)]+ϵini\displaystyle=\mathbb{E}[g^{\pi^{\star}}_{1}(o_{1})]\leq\mathbb{E}_{\mathcal{D}^{0}}[g^{\pi^{\star}}_{1}(o_{1})]+\epsilon_{ini} (Uniform convergence result)
𝔼𝒟0[g1t(o1)]+ϵini\displaystyle\leq\mathbb{E}_{\mathcal{D}^{0}}[g^{t}_{1}(o_{1})]+\epsilon_{ini} (Using the construction of algorithm)
𝔼[g1t(o1)]+2ϵini.\displaystyle\leq\mathbb{E}[g^{t}_{1}(o_{1})]+2\epsilon_{ini}. (Uniform convergence)

Next, we prove the following lemma to upper bound the per step regret.

Lemma 36.

For any t[T]t\in[T], we have

J(π)J(π^)h=1H(|Wh(πt,gt),Xh(πt)|)+2ϵini.\displaystyle J(\pi^{\star})-J(\hat{\pi})\leq\sum_{h=1}^{H}\left(|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle|\right)+2\epsilon_{ini}.
Proof.
J(π)J(π^)\displaystyle J(\pi^{\star})-J(\hat{\pi})
2ϵini+𝔼[g1t(o1)]J(πt)\displaystyle\leq 2\epsilon_{ini}+\mathbb{E}[g^{t}_{1}(o_{1})]-J(\pi^{t}) (From optimism)
=2ϵini+h=1H𝔼[ght(z¯h){rh+gh+1t(z¯h+1)};a1:hπt]\displaystyle=2\epsilon_{ini}+\sum_{h=1}^{H}\mathbb{E}[g^{t}_{h}(\bar{z}_{h})-\{r_{h}+g^{t}_{h+1}(\bar{z}_{h+1})\};a_{1:h}\sim\pi^{t}] (Performance difference lemma)
2ϵini+h=1H|𝔼[ght(z¯h){rh+gh+1t(z¯h+1)};a1:hπt]|\displaystyle\leq 2\epsilon_{ini}+\sum_{h=1}^{H}|\mathbb{E}[g^{t}_{h}(\bar{z}_{h})-\{r_{h}+g^{t}_{h+1}(\bar{z}_{h+1})\};a_{1:h}\sim\pi^{t}]|
2ϵini+h=1H|Wh(πt,gt),Xh(πt)|.\displaystyle\leq 2\epsilon_{ini}+\sum_{h=1}^{H}|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle|. (From (a) in Definition 3)

From Lemma 22, we have

1Tt=0T1h=1HXh(πt)Σt,h1HdTln(1+TBX2dλ).\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}\|X_{h}(\pi^{t})\|_{\Sigma^{-1}_{t,h}}\leq H\sqrt{\frac{d}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)}.
Lemma 37.
Wh(πt,gt)Σt,h22λBW2+Tζ(2ϵgen).\displaystyle\|W_{h}(\pi^{t},g^{t})\|^{2}_{\Sigma_{t,h}}\leq 2\lambda B^{2}_{W}+T\zeta(2\epsilon_{gen}).
Proof.

We have

Wh(πt,gt)Σt,h2=λWh(πt,gt)22+τ=0t1Wh(πt,gt),Xh(πτ)2.\displaystyle\|W_{h}(\pi^{t},g^{t})\|^{2}_{\Sigma_{t,h}}=\lambda\|W_{h}(\pi^{t},g^{t})\|^{2}_{2}+\sum_{\tau=0}^{t-1}\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{\tau})\rangle^{2}.

The first term is upper-bounded by λBW2\lambda B^{2}_{W}. The second term is upper-bounded by

τ=0t1Wh(πt,gt),Xh(πτ)2\displaystyle\sum_{\tau=0}^{t-1}\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{\tau})\rangle^{2}
k=0t1ζ(maxfh|𝔼[lh(τh,ah,rh,oh+1;f,πt,gt);a1:M(h)1πk,aM(h):hπe(π)]|)2\displaystyle\leq\sum_{k=0}^{t-1}\zeta\left(\max_{f\in\mathcal{F}_{h}}\left|\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi^{t},g^{t});a_{1:M(h)-1}\sim\pi^{k},a_{M(h):h}\sim\pi^{e}(\pi)]\right|\right)^{2}
k=0t1ζ(maxfh|𝔼𝒟hk[lh(z¯h,ah,rh,oh+1;f,πt,gt)]|+ϵgen)2\displaystyle\leq\sum_{k=0}^{t-1}\zeta\left(\max_{f\in\mathcal{F}_{h}}\left|\mathbb{E}_{\mathcal{D}^{k}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};f,\pi^{t},g^{t})]\right|+\epsilon_{gen}\right)^{2}
tζ(2ϵgen)2.\displaystyle\leq t\zeta(2\epsilon_{gen})^{2}.

From the first line to the second line, we use (b) in Definition 7. From the second line to the third line, we use ξ\xi is a non-decreasing function. In the last line, we use the constraint on (πt,gt)(\pi^{t},g^{t}).

Combining lemmas so far, we have

J(π)J(π^)\displaystyle J(\pi^{\star})-J(\hat{\pi}) 1Tt=0T1h=1H|Wh(πt,gt),Xh(πt)|+2ϵini\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle|+2\epsilon_{ini}
1Tt=0T1h=1HWh(πt,gt)Σt,hXh(πt)Σt,h1+2ϵini\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}\|W_{h}(\pi^{t},g^{t})\|_{\Sigma_{t,h}}\|X_{h}(\pi^{t})\|_{\Sigma^{-1}_{t,h}}+2\epsilon_{ini} (CS inequality)
H1/2[2λBW2+Tζ2(2ϵgen)]1/2(dHTln(1+TBX2dλ))1/2+2ϵini.\displaystyle\leq H^{1/2}\left[2\lambda B^{2}_{W}+T\zeta^{2}(2\epsilon_{gen})\right]^{1/2}\left(\frac{dH}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)\right)^{1/2}+2\epsilon_{ini}.

We set λ\lambda such that BX2/λ=BW2BX2/ζ2(ϵgen)+1B^{2}_{X}/\lambda=B^{2}_{W}B^{2}_{X}/\zeta^{2}(\epsilon_{gen})+1 and T=2Hdln(4Hd(BX2BW2/ζ2(ϵ~gen)+1))T=\left\lceil 2Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\zeta^{2}(\tilde{\epsilon}_{gen})+1))\right\rceil. Then,

HdTln(1+TBX2dλ)\displaystyle\frac{Hd}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right) HdTln(1+Td(BW2BX2ζ2(ϵgen)+1))\displaystyle\leq\frac{Hd}{T}\ln\left(1+\frac{T}{d}\left(\frac{B^{2}_{W}B^{2}_{X}}{\zeta^{2}(\epsilon_{gen})}+1\right)\right)
HdTln(1+Td(BW2BX2ζ2(ϵ~gen)+1))\displaystyle\leq\frac{Hd}{T}\ln\left(1+\frac{T}{d}\left(\frac{B^{2}_{W}B^{2}_{X}}{\zeta^{2}(\tilde{\epsilon}_{gen})}+1\right)\right)
HdTln(2Td(BW2BX2ζ2(ϵ~gen)+1))1\displaystyle\leq\frac{Hd}{T}\ln\left(\frac{2T}{d}\left(\frac{B^{2}_{W}B^{2}_{X}}{\zeta^{2}(\tilde{\epsilon}_{gen})}+1\right)\right)\leq 1

since aln(bT)/T1a\ln(bT)/T\leq 1 when T=2aln(2ab)T=2a\ln(2ab).

Finally, the following holds

J(π)J(πT)\displaystyle J(\pi^{\star})-J(\pi^{T})
H1/2[2λBW2+Tζ2(2ϵgen)]1/2+2ϵini\displaystyle\leq H^{1/2}\left[2\lambda B^{2}_{W}+T\zeta^{2}(2\epsilon_{gen})\right]^{1/2}+2\epsilon_{ini}
H1/2[2λBW2+2ζ2(2ϵgen)Hdln(4Hd(BX2BW2/ζ2(ϵ~gen)+1))]1/2+2ϵini\displaystyle\leq H^{1/2}\left[2\lambda B^{2}_{W}+2\zeta^{2}(2\epsilon_{gen})Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\zeta^{2}(\tilde{\epsilon}_{gen})+1))\right]^{1/2}+2\epsilon_{ini} (Plug in TT )
H1/2[4ζ2(ϵgen)+2ζ2(2ϵgen)Hdln(4Hd(BX2BW2/ζ2(ϵ~gen)+1))]1/2+2ϵini.\displaystyle\leq H^{1/2}\left[4\zeta^{2}(\epsilon_{gen})+2\zeta^{2}(2\epsilon_{gen})Hd\ln(4Hd(B^{2}_{X}B^{2}_{W}/\zeta^{2}(\tilde{\epsilon}_{gen})+1))\right]^{1/2}+2\epsilon_{ini}. (Plug in λ\lambda )

Appendix M Sample Complexity in MM-step Decodable POMDPs

We first give a summary of our results. Then, we show that an MM-step decodable POMDP is a PO-bilinear rank model. After showing the uniform convergence of the loss function with fast rates, we calculate the sample complexity. Since we use squared loss functions, we need to modify the proof of Theorem 1.

M.1 PO-bilinear Rank Decomposition (Proof of Lemma 10)

In this section, we derive the PO-bilinear decomposition of MM-step decodable POMDPs (Lemma 10 ).

First, we define moment matching policies following [20]. We denote M(h)=hMM(h)=h-M.

Definition 10 (Moment Matching Policies).

For h[M(h),h]h^{\prime}\in[M(h),h], we define

xh=(sM(h):h,oM(h):h,aM(h):h1)𝒳l\displaystyle x_{h^{\prime}}=(s_{M(h):{h^{\prime}}},o_{M(h):{h^{\prime}}},a_{M(h):h^{\prime}-1})\in\mathcal{X}_{l}

where 𝒳l=Sl×𝒪l×𝒜l1\mathcal{X}_{l}=S^{l}\times\mathcal{O}^{l}\times\mathcal{A}^{l-1} and l=hM(h)+1l=h^{\prime}-M(h)+1. For an MM-step policy π\pi and h[H]h\in[H], we define the moment matching policy μπ,h={μhπ,h:𝒳hM(h)+1Δ(𝒜)}h=M(h)h1\mu^{\pi,h}=\{\mu^{\pi,h}_{h^{\prime}}:\mathcal{X}_{h^{\prime}-M(h)+1}\to\Delta(\mathcal{A})\}^{h-1}_{h^{\prime}=M(h)}:

μhπ,h(ahxh):=𝔼[πh(ahz¯h)xh;π].\displaystyle\mu^{\pi,h}_{h^{\prime}}(a_{h^{\prime}}\mid x_{h^{\prime}}):=\mathbb{E}[\pi_{h^{\prime}}(a_{h^{\prime}}\mid\bar{z}_{h^{\prime}})\mid x_{h^{\prime}};\pi].

Note the expectation in the right hand side is taken under a policy π\pi.

Using [20, Lemma B.2], we have

Br(π,g;π)\displaystyle\mathrm{Br}(\pi,g;\pi^{\prime}) =𝔼[{gh(z¯h)rhgh+1(z¯h+1)};a1:M(h)1π,aM(h):hπ]\displaystyle=\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\pi]
=𝔼[{gh(z¯h)rhgh+1(z¯h+1)};a1:M(h)1π,aM(h):h1μπ,h,ahπ]\displaystyle=\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\pi]
=𝔼[𝔼[{gh(z¯h)rhgh+1(z¯h+1)}sM(h);aM(h):h1μπ,h,ahπ];a1:M(h)1π]\displaystyle=\mathbb{E}[\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}\mid s_{M(h)};a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\pi];a_{1:M(h)-1}\sim\pi^{\prime}]
=Xh(π),Wh(π,g).\displaystyle=\left\langle X_{h}(\pi^{\prime}),W_{h}(\pi,g)\right\rangle.

where

Wh(π,g)\displaystyle W_{h}(\pi,g) =𝔼[{gh(z¯h)rhgh+1(z¯h+1)}sM(h);aM(h):h1μπ,h,ahπ]μ(sM(h))d(sM(h)),\displaystyle=\int\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}\mid s_{M(h)};a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\pi]\mu(s_{M(h)})\mathrm{d}(s_{M(h)}),
Xh(π)\displaystyle X_{h}(\pi^{\prime}) =𝔼[ϕ(sM(h)1,aM(h)1);a1:M(h)1π].\displaystyle=\mathbb{E}[\phi(s_{M(h)-1},a_{M(h)-1});a_{1:M(h)-1}\sim\pi^{\prime}].

Thus, the first condition in Definition 7 ((6)) is satisfied

Next, we show the second condition in Definition 7 ((7)). This is proved as follows

0.5|𝒜|MXh(π),Wh(π,g)2\displaystyle\frac{0.5}{|\mathcal{A}|^{M}}\left\langle X_{h}(\pi^{\prime}),W_{h}(\pi,g)\right\rangle^{2} (26)
=0.5|𝒜|M(𝔼[(gh(z¯h)(hπgh+1)(z¯h));a1:M(h)1π,aM(h):h1μπ,h])2\displaystyle=\frac{0.5}{|\mathcal{A}|^{M}}\left(\mathbb{E}\left[\left(g_{h}(\bar{z}_{h})-(\mathcal{B}^{\pi}_{h}g_{h+1})(\bar{z}_{h})\right);a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h}\right]\right)^{2}
0.5|𝒜|M𝔼[(gh(z¯h)(hπgh+1)(z¯h))2;a1:M(h)1π,aM(h):h1μπ,h]\displaystyle\leq\frac{0.5}{|\mathcal{A}|^{M}}\mathbb{E}\left[\left(g_{h}(\bar{z}_{h})-(\mathcal{B}^{\pi}_{h}g_{h+1})(\bar{z}_{h})\right)^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h}\right] (Jensen’s inequality)
1|𝒜|Mmaxfh𝔼[(gh(z¯h)(hπgh+1)(z¯h))f(z¯h)0.5f(z¯h)2;a1:M(h)1π,aM(h):h1μπ,h]\displaystyle\leq\frac{1}{|\mathcal{A}|^{M}}\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[\left(g_{h}(\bar{z}_{h})-(\mathcal{B}^{\pi}_{h}g_{h+1})(\bar{z}_{h})\right)f(\bar{z}_{h})-0.5f(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h}\right]
=1|𝒜|Mmaxfh𝔼[|𝒜|πh(ah|z¯h)(gh(z¯h)rhgh+1(z¯h+1))f(z¯h)0.5f(z¯h)2;a1:M(h)1π,aM(h):h1μπ,h,ah𝒰(𝒜)]\displaystyle=\frac{1}{|\mathcal{A}|^{M}}\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[|\mathcal{A}|\pi_{h}(a_{h}|\bar{z}_{h})\left(g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\right)f(\bar{z}_{h})-0.5f(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\mathcal{U}(\mathcal{A})\right]
maxfh𝔼[|𝒜|πh(ah|z¯h)(gh(z¯h)rhgh+1(z¯h+1))f(z¯h)0.5f(z¯h)2;a1:M(h)1π,aM(h):h𝒰(𝒜)]\displaystyle\leq\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[|\mathcal{A}|\pi_{h}(a_{h}|\bar{z}_{h})\left(g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\right)f(\bar{z}_{h})-0.5f(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right]
=maxfh𝔼[lh(z¯h,ah,rhoh+1;f,π,g);a1:M(h)1π,aM(h):h𝒰(𝒜)].\displaystyle=\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[l_{h}(\bar{z}_{h},a_{h},r_{h}o_{h+1};f,\pi,g);a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right]. (27)

From the first line to the second line, we use [20, Lemma B.2]. From the third to the fourth line, we use the Bellman completeness assumption: (hπ𝒢)+𝒢hh-(\mathcal{B}^{\pi}_{h}\mathcal{G})+\mathcal{G}_{h}\subset\mathcal{F}_{h}. From the fourth line to the fifth line, we use importance sampling.

Finally, we show the third condition in Definition 7 (8):

|maxfh𝔼[lh(τh,ah,rh,oh+1;f,π,gπ);a1:M(h)1π,aM(h):h𝒰(𝒜)]|=0.\displaystyle\left\lvert\max_{f\in\mathcal{F}_{h}}\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi});a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})]\right\rvert=0. (28)

This follows since

𝔼[lh(τh,ah,rh,oh+1;f,π,gπ);a1:M(h)1π,aM(h):h𝒰(𝒜)]\displaystyle\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi});a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})]
=𝔼[|𝒜|πh(ah|z¯h)(ghπ(z¯h)rhgh+1π(z¯h+1))f(z¯h)0.5f(z¯h)2;a1:M(h)1π,aM(h):h𝒰(𝒜)]\displaystyle=\mathbb{E}\left[|\mathcal{A}|\pi_{h}(a_{h}|\bar{z}_{h})\left(g^{\pi}_{h}(\bar{z}_{h})-r_{h}-g^{\pi}_{h+1}(\bar{z}_{h+1})\right)f(\bar{z}_{h})-0.5f(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right]
=𝔼[𝔼[|𝒜|πh(ah|z¯h)(ghπ(z¯h)rhgh+1π(z¯h+1))z¯h]f(z¯h)0.5f(z¯h)2;a1:M(h)1π,aM(h):h𝒰(𝒜)]\displaystyle=\mathbb{E}\left[\mathbb{E}[|\mathcal{A}|\pi_{h}(a_{h}|\bar{z}_{h})\left(g^{\pi}_{h}(\bar{z}_{h})-r_{h}-g^{\pi}_{h+1}(\bar{z}_{h+1})\right)\mid\bar{z}_{h}]f(\bar{z}_{h})-0.5f(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right]
=𝔼[0.5f(z¯h)2;a1:M(h)1π,aM(h):h𝒰(𝒜)].\displaystyle=\mathbb{E}\left[-0.5f(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right].

M.2 Uniform Convergence

We define the operator

(hπg)(z¯h):=𝔼[rh+gh+1(z¯h+1)z¯h;ahπ].\displaystyle(\mathcal{B}^{\pi}_{h}g)(\bar{z}_{h}):=\mathbb{E}[r_{h}+g_{h+1}(\bar{z}_{h+1})\mid\bar{z}_{h};a_{h}\sim\pi].

and

(¯hπg)(z¯h):=(hπg)(z¯h)+gh.\displaystyle(\bar{\mathcal{B}}^{\pi}_{h}g)(\bar{z}_{h}):=-(\mathcal{B}^{\pi}_{h}g)(\bar{z}_{h})+g_{h}.
Lemma 38 (Uniform Convergence).

Let |𝒟|=m|\mathcal{D}|=m. Suppose h3H\|\mathcal{F}_{h}\|_{\infty}\leq 3H for h[H]h\in[H]. Fix πΠ\pi^{\prime}\in\Pi.

  1. 1.

    Take a true link function gπ𝒢g^{\pi}\in\mathcal{G}. Then, it satisfies

    maxfhh|𝔼𝒟[|𝒜|πh(ahz¯h){ghπ(z¯h)rhgh+1π(z¯h+1)}fh(z¯h)0.5fh(z¯h)2;a1:M(h)1π,aM(h):hU(𝒜)]\displaystyle\max_{f_{h}\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}}[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g^{\pi}_{h}(\bar{z}_{h})-r_{h}-g^{\pi}_{h+1}(\bar{z}_{h+1})\}f_{h}(\bar{z}_{h})-0.5f_{h}(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim U(\mathcal{A})]
    c1(H|𝒜|)2ln(|Πmax||max||𝒢max|/δ)m.\displaystyle\leq c_{1}\frac{(H|\mathcal{A}|)^{2}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)}{m}.
  2. 2.

    Suppose g(π)g(\pi) satisfies

    maxfhh|𝔼𝒟[|𝒜|πh(ahz¯h){gh(π)(z¯h)rhgh+1(π)(z¯h+1)}fh(z¯h)0.5fh(z¯h)2;a1:M(h)1π,aM(h):hU(𝒜)]|\displaystyle\max_{f_{h}\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}}[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g_{h}(\pi)(\bar{z}_{h})-r_{h}-g_{h+1}(\pi)(\bar{z}_{h+1})\}f_{h}(\bar{z}_{h})-0.5f_{h}(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim U(\mathcal{A})]|
    Λ,\displaystyle\leq\Lambda,

    and the Bellman completeness ¯hπ𝒢h(πΠ)\bar{\mathcal{B}}^{\pi}_{h}\mathcal{G}\subset\mathcal{F}_{h}(\forall\pi\in\Pi) holds. Then, with probability 1δ1-\delta, we have

    𝔼[(¯hπg(π))2(z¯h);a1:M(h)1π,aM(h):h1U(𝒜)]\displaystyle\mathbb{E}[(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h});a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim U(\mathcal{A})]
    Λ+c2(H|𝒜|)2ln(|Πmax||max||𝒢max|/δ)m.\displaystyle\leq\Lambda+c_{2}\frac{(H|\mathcal{A}|)^{2}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)}{m}.
Proof.

To simplify the notation, we define

αh(z¯h,ah,rh,oh+1;g)=πh(ahz¯h)|𝒜|{gh(z¯h)rhgh+1(z¯h+1)}.\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)=\pi_{h}(a_{h}\mid\bar{z}_{h})|\mathcal{A}|\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}.

Given g𝒢g\in\mathcal{G}, we define f^h(;g)\hat{f}_{h}(\cdot;g) as the maximizer:

argmaxfhh|𝔼𝒟[|𝒜|πh(ahz¯h){gh(z¯h)rhgh+1(z¯h+1)}fh(z¯h)0.5fh(z¯h)2;a1:M(h)1π,aM(h):hU(𝒜)]|.\displaystyle\mathop{\mathrm{argmax}}_{f_{h}\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}}[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}f_{h}(\bar{z}_{h})-0.5f_{h}(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim U(\mathcal{A})]|.

In this proof, the expectation is always taken for the data generating process 𝒟\mathcal{D}. We first observe

𝔼𝒟[αh(z¯h,ah,rh,oh+1;g)fh(z¯h)0.5fh(z¯h)2]\displaystyle\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)f_{h}(\bar{z}_{h})-0.5f_{h}(\bar{z}_{h})^{2}]
=0.5𝔼𝒟[αh(z¯h,ah,rh,oh+1;g)2{αh(z¯h,ah,rh,oh+1;g)fh(z¯h)}2].\displaystyle=0.5\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)^{2}-\{\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)-f_{h}(\bar{z}_{h})\}^{2}].

Then, we define

Erh(f,g):=0.5{αh(z¯h,ah,rh,oh+1;g)fh(z¯h)}20.5{αh(z¯h,ah,rh,oh+1;g)(¯hπg)(z¯h)}2].\displaystyle\mathrm{Er}_{h}(f,g):=0.5\{\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)-f_{h}(\bar{z}_{h})\}^{2}-0.5\{\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)-(\bar{\mathcal{B}}^{\pi}_{h}g)(\bar{z}_{h})\}^{2}].

As the first step, we prove with probability 1δ1-\delta

g;|𝔼𝒟[Erh(f^h(;g),g)]|12H|𝒜|ln(2|h|||𝒢h||𝒢h+1|/δ)m.\displaystyle\forall g;|\mathbb{E}_{\mathcal{D}}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]|\leq\frac{12H|\mathcal{A}|\ln(2|\mathcal{F}_{h}|||\mathcal{G}_{h}||\mathcal{G}_{h+1}|/\delta)}{m}. (29)

We first fix gg. Then, from the definition of f^h(;g)\hat{f}_{h}(\cdot;g) and the Bellman completeness ¯hπ𝒢h\bar{\mathcal{B}}^{\pi}_{h}\mathcal{G}\subset\mathcal{F}_{h}, we have

𝔼𝒟[Erh(f^h(;g),g)]0.\displaystyle\mathbb{E}_{\mathcal{D}}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]\leq 0. (30)

Here, we invoke Bernstein’s inequality:

fh;|(𝔼𝔼𝒟)Erh(f,g))|𝔼[Erh(f,g)]ln(2|h|/δ)m+(6H|𝒜|)2ln(2|h|/δ)m.\displaystyle\forall f\in\mathcal{F}_{h};|(\mathbb{E}-\mathbb{E}_{\mathcal{D}})\mathrm{Er}_{h}(f,g))|\leq\sqrt{\mathbb{E}[\mathrm{Er}_{h}(f,g)]\frac{\ln(2|\mathcal{F}_{h}|/\delta)}{m}}+\frac{(6H|\mathcal{A}|)^{2}\ln(2|\mathcal{F}_{h}|/\delta)}{m}. (31)

Hereafter, we condition on the above event. Then, combining (30) and (31), we have

𝔼[Erh(f^h(;g),g)]\displaystyle\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)] 𝔼𝒟[Erh(f^h(;g),g)]+|(𝔼𝔼𝒟)Erh(f^h(;g),g)|\displaystyle\leq\mathbb{E}_{\mathcal{D}}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]+|(\mathbb{E}-\mathbb{E}_{\mathcal{D}})\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)|
𝔼[Erh2(f^h(;g),g)]ln(2|h|/δ)(6H|𝒜|)2m+(6H|𝒜|)2ln(2|h|/δ)m.\displaystyle\leq\sqrt{\frac{\mathbb{E}[\mathrm{Er}^{2}_{h}(\hat{f}_{h}(\cdot;g),g)]\ln(2|\mathcal{F}_{h}|/\delta)(6H|\mathcal{A}|)^{2}}{m}}+\frac{(6H|\mathcal{A}|)^{2}\ln(2|\mathcal{F}_{h}|/\delta)}{m}.

Here, we use

𝔼[Erh(f^h(;g),g)]\displaystyle\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)] =0.5𝔼[{fh(z¯h)(¯hπg)(z¯h)}2],\displaystyle=0.5\mathbb{E}[\{f_{h}(\bar{z}_{h})-(\bar{\mathcal{B}}^{\pi}_{h}g)(\bar{z}_{h})\}^{2}],
𝔼[Erh(f^h(;g),g)2]\displaystyle\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)^{2}] 𝔼[{fh(z¯h)(¯hπg)(z¯h)}2](6H|𝒜|)2=𝔼[Erh(f^h(;g))](6H|𝒜|)2.\displaystyle\leq\mathbb{E}[\{f_{h}(\bar{z}_{h})-(\bar{\mathcal{B}}^{\pi}_{h}g)(\bar{z}_{h})\}^{2}](6H|\mathcal{A}|)^{2}=\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g))](6H|\mathcal{A}|)^{2}.

Thus, by some algebra,

𝔼[Erh(f^h(;g),g)](12H|𝒜|)2ln(2|h|/δ)m.\displaystyle\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]\leq\frac{(12H|\mathcal{A}|)^{2}\ln(2|\mathcal{F}_{h}|/\delta)}{m}.

Besides,

|𝔼𝒟[Erh(f^h(;g),g)]|\displaystyle|\mathbb{E}_{\mathcal{D}}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]|
𝔼[Erh(f^h(;g))]+|(𝔼𝔼𝒟)[Erh(f^h(;g),g)]|\displaystyle\leq\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g))]+|(\mathbb{E}-\mathbb{E}_{\mathcal{D}})[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]|
𝔼[Erh(f^h(;g),g)]+𝔼[Erh(f^h(;g),g)](6H|𝒜|)2ln(2|h|/δ)m+(6H|𝒜|)2ln(2|h|/δ)m\displaystyle\leq\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)]+\sqrt{\frac{\mathbb{E}[\mathrm{Er}_{h}(\hat{f}_{h}(\cdot;g),g)](6H|\mathcal{A}|)^{2}\ln(2|\mathcal{F}_{h}|/\delta)}{m}}+\frac{(6H|\mathcal{A}|)^{2}\ln(2|\mathcal{F}_{h}|/\delta)}{m}
3(12H|𝒜|)2ln(2|h|/δ)m+27H|𝒜|ln(2|h|/δ)m.\displaystyle\leq\frac{3(12H|\mathcal{A}|)^{2}\ln(2|\mathcal{F}_{h}|/\delta)}{m}+\frac{27H|\mathcal{A}|\ln(2|\mathcal{F}_{h}|/\delta)}{m}.

Lastly, by union bounds over 𝒢h,𝒢h+1\mathcal{G}_{h},\mathcal{G}_{h+1}, the statement (29) is proved. Note ¯hπgπ=0\bar{\mathcal{B}}^{\pi}_{h}g^{\pi}=0.

First Statement.
|𝔼𝒟[αh(z¯h,ah,rh,oh+1;gπ)f^h(z¯h;gπ)0.5f^h(z¯h;gπ)2]|\displaystyle|\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g^{\pi})\hat{f}_{h}(\bar{z}_{h};g^{\pi})-0.5\hat{f}_{h}(\bar{z}_{h};g^{\pi})^{2}]|
=|0.5𝔼𝒟[αh(z¯h,ah,rh,oh+1;gπ)]0.5𝔼𝒟[{αh(z¯h,ah,rh,oh+1;gπ)fh(z¯h)}2]|\displaystyle=|0.5\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g^{\pi})]-0.5\mathbb{E}_{\mathcal{D}}[\{\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g^{\pi})-f_{h}(\bar{z}_{h})\}^{2}]|
cH|𝒜|ln(|h|𝒢h||𝒢h+1/δ)m.\displaystyle\leq c\frac{H|\mathcal{A}|\ln(|\mathcal{F}_{h}|\mathcal{G}_{h}||\mathcal{G}_{h+1}/\delta)}{m}.

From the second line to the third line, we use (29).

Second Statement.

Now, we use the assumption on g(π)g(\pi):

𝔼𝒟[αh(z¯h,ah,rh,oh+1;g(π))f^h(z¯h;g(π))0.5f^h(z¯h;g(π))2]Λ.\displaystyle\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g(\pi))\hat{f}_{h}(\bar{z}_{h};g(\pi))-0.5\hat{f}_{h}(\bar{z}_{h};g(\pi))^{2}]\leq\Lambda.

From what we showed in (29), this implies

𝔼𝒟[αh(z¯h,ah,rh,oh+1;g(π))(¯hπg(π))(z¯h)0.5(¯hπg(π))2(z¯h)]Λ+3(12H|𝒜|)2ln(|h||𝒢h||𝒢h+1|/δ)m.\displaystyle\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g(\pi))(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))(\bar{z}_{h})-0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]\leq\Lambda+\frac{3(12H|\mathcal{A}|)^{2}\ln(|\mathcal{F}_{h}||\mathcal{G}_{h}||\mathcal{G}_{h+1}|/\delta)}{m}.

Recall we want to upper-bound the error for 𝔼[0.5(¯hπg(π))2(z¯h)].\mathbb{E}[0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]. Here, we use the following observation later:

𝔼[αh(z¯h,ah,rh,oh+1;g(π))(¯hπg(π))(z¯h)0.5(¯hπg(π))2(z¯h)]=𝔼[0.5(¯hπg(π))2(z¯h)].\displaystyle\mathbb{E}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g(\pi))(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))(\bar{z}_{h})-0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]=\mathbb{E}[0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})].

We use Bernstein’s inequality: with probability 1δ1-\delta, for any g𝒢g\in\mathcal{G},

|(𝔼𝔼𝒟)[αh(z¯h,ah,rh,oh+1;g)(hπg)(z¯h)0.5(hπg)2(z¯h)]|\displaystyle|(\mathbb{E}-\mathbb{E}_{\mathcal{D}})[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)(\mathcal{B}^{\pi}_{h}g)(\bar{z}_{h})-0.5(\mathcal{B}^{\pi}_{h}g)^{2}(\bar{z}_{h})]|
𝔼[(3|𝒜|H)2(hπg)2(z¯h)]ln(2|𝒢h||𝒢h+1|/δ)m+(3|𝒜|H)2ln(|𝒢h||𝒢h+1|/δ)m.\displaystyle\leq\sqrt{\frac{\mathbb{E}[(3|\mathcal{A}|H)^{2}(\mathcal{B}^{\pi}_{h}g)^{2}(\bar{z}_{h})]\ln(2|\mathcal{G}_{h}||\mathcal{G}_{h+1}|/\delta)}{m}}+\frac{(3|\mathcal{A}|H)^{2}\ln(|\mathcal{G}_{h}||\mathcal{G}_{h+1}|/\delta)}{m}.

Here, we use

𝔼[{αh(z¯h,ah,rh,oh+1;g)(hπg)(z¯h)0.5(hπg)2(z¯h)}2]\displaystyle\mathbb{E}[\{\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)(\mathcal{B}^{\pi}_{h}g)(\bar{z}_{h})-0.5(\mathcal{B}^{\pi}_{h}g)^{2}(\bar{z}_{h})\}^{2}]
𝔼[{αh(z¯h,ah,rh,oh+1;g)(hπg)(z¯h)0.5(hπg)2(z¯h)}](6|𝒜|H)2.\displaystyle\leq\mathbb{E}[\{\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g)(\mathcal{B}^{\pi}_{h}g)(\bar{z}_{h})-0.5(\mathcal{B}^{\pi}_{h}g)^{2}(\bar{z}_{h})\}](6|\mathcal{A}|H)^{2}.

Hereafter, we condition on the above event.

Finally, we have

𝔼[0.5(¯hπg(π))2(z¯h)]\displaystyle\mathbb{E}[0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]
𝔼𝒟[αh(z¯h,ah,rh,oh+1;g(π))(¯hπg(π))(z¯h)0.5(¯hπg(π))2(z¯h)]+\displaystyle\leq\mathbb{E}_{\mathcal{D}}[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g(\pi))(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))(\bar{z}_{h})-0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]+
+|(𝔼𝔼𝒟)[αh(z¯h,ah,rh,oh+1;g(π))(¯hπg(π))(z¯h)0.5(¯hπg(π))2(z¯h)]|\displaystyle+|(\mathbb{E}-\mathbb{E}_{\mathcal{D}})[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g(\pi))(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))(\bar{z}_{h})-0.5(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]|
Λ+3(12H|𝒜|)2ln(4|h||𝒢h||𝒢h+1|/δ)m\displaystyle\leq\Lambda+\frac{3(12H|\mathcal{A}|)^{2}\ln(4|\mathcal{F}_{h}||\mathcal{G}_{h}||\mathcal{G}_{h+1}|/\delta)}{m}
+|(𝔼𝔼𝒟)[αh(z¯h,ah,rh,oh+1;g(π))(hπg(π))(z¯h)0.5(hπg(π))2(z¯h)]|\displaystyle+|(\mathbb{E}-\mathbb{E}_{\mathcal{D}})[\alpha_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};g(\pi))(\mathcal{B}^{\pi}_{h}g(\pi))(\bar{z}_{h})-0.5(\mathcal{B}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]|
Λ+3(12H|𝒜|)2ln(4|h||𝒢h||𝒢h+1|/δ)m+𝔼[0.5(hπg(π))2(z¯h)]ln(4|𝒢h||𝒢h+1|/δ)m+ln(|𝒢h||𝒢h+1|/δ)m.\displaystyle\leq\Lambda+\frac{3(12H|\mathcal{A}|)^{2}\ln(4|\mathcal{F}_{h}||\mathcal{G}_{h}||\mathcal{G}_{h+1}|/\delta)}{m}+\sqrt{\frac{\mathbb{E}[0.5(\mathcal{B}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]\ln(4|\mathcal{G}_{h}||\mathcal{G}_{h+1}|/\delta)}{m}}+\frac{\ln(|\mathcal{G}_{h}||\mathcal{G}_{h+1}|/\delta)}{m}.

Hence,

πΠ,g(π);𝔼[0.5(hπg(π))2(z¯h)]Λ+c(H|𝒜|)2ln(|h||𝒢h||𝒢h+1|/δ)m.\displaystyle\forall\pi\in\Pi,\forall g(\pi);\mathbb{E}[0.5(\mathcal{B}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h})]\leq\Lambda+c\frac{(H|\mathcal{A}|)^{2}\ln(|\mathcal{F}_{h}||\mathcal{G}_{h}||\mathcal{G}_{h+1}|/\delta)}{m}.

M.3 Proof of Main Statement

We define

|max|=maxh[H]|h|,|Πmax|=maxh[H]|Πh|,|𝒢max|=maxh[H]|𝒢h|.\displaystyle|\mathcal{F}_{\max}|=\max_{h\in[H]}|\mathcal{F}_{h}|,\,|\Pi_{\max}|=\max_{h\in[H]}|\Pi_{h}|,|\mathcal{G}_{\max}|=\max_{h\in[H]}|\mathcal{G}_{h}|.

Let

ϵgen2\displaystyle\epsilon^{2}_{gen} =c1(H|𝒜|)2ln(|Πmax||max||𝒢max|T(H+1)/δ)m,\displaystyle=c_{1}\frac{(H|\mathcal{A}|)^{2}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|T(H+1)/\delta)}{m},
ϵ~gen2\displaystyle\tilde{\epsilon}^{2}_{gen} =c1(H|𝒜|)2ln(|Πmax||max||𝒢max|(H+1)/δ)m,\displaystyle=c_{1}\frac{(H|\mathcal{A}|)^{2}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|(H+1)/\delta)}{m},
ϵini\displaystyle\epsilon_{ini} =c3(H|𝒜|)2ln(|𝒢1|T(H+1)/δ)m,\displaystyle=c_{3}\sqrt{\frac{(H|\mathcal{A}|)^{2}\ln(|\mathcal{G}_{1}|T(H+1)/\delta)}{m}},
T\displaystyle T =2Hdln(4Hd(BX2BW2ϵ~gen+1)),R=ϵgen2.\displaystyle=2Hd\ln\left(4Hd\left(\frac{B^{2}_{X}B^{2}_{W}}{\tilde{\epsilon}_{gen}}+1\right)\right),\quad R=\epsilon^{2}_{gen}.

Then, from the first statement in Lemma 38, with probability 1δ1-\delta, t[T],h[H],πΠ\forall t\in[T],\forall h\in[H],\forall\pi\in\Pi

maxfhh|𝔼𝒟ht[|𝒜|πh(ahz¯h){ghπ(z¯h)rhgh+1π(z¯h+1)}fh(z¯h)0.5fh(z¯h)2;a1:M(h)1πt,aM(h):hU(𝒜)]|\displaystyle\max_{f_{h}\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}^{t}_{h}}[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g^{\pi}_{h}(\bar{z}_{h})-r_{h}-g^{\pi}_{h+1}(\bar{z}_{h+1})\}f_{h}(\bar{z}_{h})-0.5f_{h}(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{t},a_{M(h):h}\sim U(\mathcal{A})]| (32)
c1(H|𝒜|)2ln(|Πmax||max||𝒢max|T(H+1)/δ)m.\displaystyle\leq c_{1}\frac{(H|\mathcal{A}|)^{2}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|T(H+1)/\delta)}{m}.

Besides, from the second statement in Lemma 38, for πΠ,t[T],h[H]\pi\in\Pi,\forall t\in[T],\forall h\in[H], when g(π)g(\pi) satisfies

maxfhh|𝔼𝒟ht[|𝒜|πh(ahz¯h){gh(π)(z¯h)rhgh+1(π)(z¯h+1)}fh(z¯h)0.5fh(z¯h)2;a1:M(h)1πt,aM(h):hU(𝒜)]|\displaystyle\max_{f_{h}\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}^{t}_{h}}[|\mathcal{A}|\pi_{h}(a_{h}\mid\bar{z}_{h})\{g_{h}(\pi)(\bar{z}_{h})-r_{h}-g_{h+1}(\pi)(\bar{z}_{h+1})\}f_{h}(\bar{z}_{h})-0.5f_{h}(\bar{z}_{h})^{2};a_{1:M(h)-1}\sim\pi^{t},a_{M(h):h}\sim U(\mathcal{A})]|
c1(H|𝒜|)2ln(|Πmax||max||𝒢max|T(H+1)/δ)m,\displaystyle\leq c_{1}\frac{(H|\mathcal{A}|)^{2}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|T(H+1)/\delta)}{m},

we have

𝔼[(¯hπg(π))2(z¯h);a1:M(h)1πt,aM(h):h1U(𝒜)](c1+c2)(H|𝒜|)2ln(|Πmax||max||𝒢max|TH/δ)m.\displaystyle\mathbb{E}[(\bar{\mathcal{B}}^{\pi}_{h}g(\pi))^{2}(\bar{z}_{h});a_{1:M(h)-1}\sim\pi^{t},a_{M(h):h-1}\sim U(\mathcal{A})]\leq(c_{1}+c_{2})\frac{(H|\mathcal{A}|)^{2}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|TH/\delta)}{m}. (33)

We first show the optimism. Recall π=argmaxπΠJ(π).\pi^{\star}=\mathop{\mathrm{argmax}}_{\pi\in\Pi}J(\pi).

Lemma 39 (Optimism).

Set R=ϵgen2R=\epsilon^{2}_{gen}. For all t[T]t\in[T], (π,gπ)(\pi^{\star},g^{\pi^{\star}}) is a feasible solution of the constrained program. Furthermore, we have J(π)𝔼[g1t(o1)]+2ϵiniJ(\pi^{\star})\leq\mathbb{E}[g^{t}_{1}(o_{1})]+2\epsilon_{ini} for any t[T]t\in[T].

Proof.

For any πΠ\pi\in\Pi, letting gπ𝒢g^{\pi}\in\mathcal{G} be a corresponding value link function, we have

maxfh|𝔼𝒟ht[lh(z¯h,ah,rh,oh+1;f,π,gπ)]|ϵgen2.\displaystyle\max_{f\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi})]|\leq\epsilon^{2}_{gen}.

using (32). This implies

t[T],h[H],maxfh|𝔼𝒟ht[lh(z¯h,ah,rh,oh+1;f,π,gπ)]|ϵgen2.\forall t\in[T],\forall h\in[H],\max_{f\in\mathcal{F}_{h}}|\mathbb{E}_{\mathcal{D}^{t}_{h}}[l_{h}(\bar{z}_{h},a_{h},r_{h},o_{h+1};f,\pi^{\star},g^{\pi^{\star}})]|\leq\epsilon^{2}_{gen}.

Hence, (π,gπ)(\pi^{\star},g^{\pi^{\star}}) is a feasible set for any t[T]t\in[T]. Then, we have

J(π)\displaystyle J(\pi^{\star}) =𝔼[g1π(o1)]𝔼𝒟1t[g1π(o1)]+ϵini\displaystyle=\mathbb{E}[g^{\pi^{\star}}_{1}(o_{1})]\leq\mathbb{E}_{\mathcal{D}^{t}_{1}}[g^{\pi^{\star}}_{1}(o_{1})]+\epsilon_{ini}
𝔼𝒟1t[g1t(o1)]+ϵini𝔼[g1t(o1)]+2ϵini.\displaystyle\leq\mathbb{E}_{\mathcal{D}^{t}_{1}}[g^{t}_{1}(o_{1})]+\epsilon_{ini}\leq\mathbb{E}[g^{t}_{1}(o_{1})]+2\epsilon_{ini}.

Next, recall the following two statements. The following statements are proved as before in the proof of Theorem 1.

  • For any t[T]t\in[T],

    J(π)J(πt)h=1H|Wh(πt,gt),Xh(πt)|+2ϵini.\displaystyle J(\pi^{\star})-J(\pi^{t})\leq\sum_{h=1}^{H}|\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{t})\rangle|+2\epsilon_{ini}.
  • Let Σt,h=λI+τ=0t1Xh(πτ)Xh(πτ)\Sigma_{t,h}=\lambda I+\sum_{\tau=0}^{t-1}X_{h}(\pi^{\tau})X_{h}(\pi^{\tau})^{\top}. We have

    1Tt=0T1h=1HXh(πt)Σt,h12HdTln(1+TBX2dλ).\displaystyle\frac{1}{T}\sum_{t=0}^{T-1}\sum_{h=1}^{H}\|X_{h}(\pi^{t})\|^{2}_{\Sigma^{-1}_{t,h}}\leq H\sqrt{\frac{d}{T}\ln\left(1+\frac{TB^{2}_{X}}{d\lambda}\right)}.
Lemma 40.
Wh(πt,gt)Σt,h22λBW2+T|𝒜|Mϵgen2.\displaystyle\|W_{h}(\pi^{t},g^{t})\|^{2}_{\Sigma_{t,h}}\leq 2\lambda B^{2}_{W}+T|\mathcal{A}|^{M}\epsilon^{2}_{gen}.
Proof.

We have

Wh(πt,gt)Σt,h2=λWh(πt,gt)22+τ=0t1Wh(πt,gt),Xh(πτ)2.\displaystyle\|W_{h}(\pi^{t},g^{t})\|^{2}_{\Sigma_{t,h}}=\lambda\|W_{h}(\pi^{t},g^{t})\|^{2}_{2}+\sum_{\tau=0}^{t-1}\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{\tau})\rangle^{2}.

The first term is upper-bounded by λBW2\lambda B^{2}_{W}. The second term is upper-bounded by

τ=0t1Wh(πt,gt),Xh(πτ)2\displaystyle\sum_{\tau=0}^{t-1}\langle W_{h}(\pi^{t},g^{t}),X_{h}(\pi^{\tau})\rangle^{2}
|𝒜|Mτ=0t1𝔼[𝔼[|𝒜|πht(ahz¯h)gh(z¯h)rhgh+1(z¯h+1)z¯h;ahU(𝒜)]2;a1:M(h)1πτ,aM(h):h1U(𝒜)]\displaystyle\leq|\mathcal{A}|^{M}\sum_{\tau=0}^{t-1}\mathbb{E}[\mathbb{E}[|\mathcal{A}|\pi^{t}_{h}(a_{h}\mid\bar{z}_{h})g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\mid\bar{z}_{h};a_{h}\sim U(\mathcal{A})]^{2};a_{1:M(h)-1}\sim\pi^{\tau},a_{M(h):h-1}\sim U(\mathcal{A})]
=|𝒜|Mτ=0t1𝔼[(¯hπtg(π))2(z¯h);a1:M(h)1πτ,aM(h):h1U(𝒜)]\displaystyle=|\mathcal{A}|^{M}\sum_{\tau=0}^{t-1}\mathbb{E}[(\bar{\mathcal{B}}^{\pi^{t}}_{h}g(\pi))^{2}(\bar{z}_{h});a_{1:M(h)-1}\sim\pi^{\tau},a_{M(h):h-1}\sim U(\mathcal{A})]
|𝒜|MT(c1+c2)(H|𝒜|)2ln(|Πmax||max||𝒢max|TH/δ)mT|𝒜|Mϵgen2.\displaystyle\leq|\mathcal{A}|^{M}T(c_{1}+c_{2})\frac{(H|\mathcal{A}|)^{2}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|TH/\delta)}{m}\leq T|\mathcal{A}|^{M}\epsilon^{2}_{gen}.

From the first line to the second line, we use (26). Here, from the third line to the fourth line, we use (33). ∎

The rest of the argument is the same as the proof in Theorem 1. Finally, the following holds

J(π)J(π^)5ϵgen|𝒜|M/2[H2dln(4Hd(BX2BW2/ϵ~gen+1))]1/2+2ϵini.\displaystyle J(\pi^{\star})-J(\hat{\pi})\leq 5\epsilon_{gen}|\mathcal{A}|^{M/2}\left[H^{2}d\ln(4Hd(B^{2}_{X}B^{2}_{W}/\tilde{\epsilon}_{gen}+1))\right]^{1/2}+2\epsilon_{ini}.
Sample Complexity Result.

We want to find mm such that

H2|𝒜|2+Mln(|Πmax||max||𝒢max|TH/δ)m[H2dln(HdBX2BW2m)]1/2ϵ.\displaystyle\sqrt{\frac{H^{2}|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|TH/\delta)}{m}}[H^{2}d\ln(HdB^{2}_{X}B^{2}_{W}m)]^{1/2}\leq\epsilon.

where

T=Hdln(HdBX2BW2m).\displaystyle T=Hd\ln(HdB^{2}_{X}B^{2}_{W}m).

By organizing terms, we have

H4d|𝒜|2+Mln(|Πmax||max||𝒢max|Hd/δ)ln(HdBX2BW2m)mϵ.\displaystyle\sqrt{\frac{H^{4}d|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|Hd/\delta)\ln(HdB^{2}_{X}B^{2}_{W}m)}{m}}\leq\epsilon.

Thus, setting the following mm is enough:

m=O~(H4d|𝒜|2+Mln(|Πmax||max||𝒢max|/δ)ϵ2).\displaystyle m=\tilde{O}\left(\frac{H^{4}d|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)}{\epsilon^{2}}\right).

The total sample we use mTHmTH is

O~(d2H6|𝒜|2+Mln(|Πmax||max||𝒢max|/δ)ϵ2).\displaystyle\tilde{O}\left(\frac{d^{2}H^{6}|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)}{\epsilon^{2}}\right).

Appendix N Sample Complexity in Observable POMDPs with Latent Low-rank Transition

This section largely follows the one in Section M.

N.1 Existence of Value Link Functions

Since we consider the discrete setting, we can set the value link function class as Section H. Hence, we set

𝒢h={θ,𝟏(z)𝕆𝟏(o);θH}.\displaystyle\mathcal{G}_{h}=\left\{\langle\theta,\mathbf{1}(z)\otimes\mathbb{O}^{\dagger}\mathbf{1}(o)\rangle;\|\theta\|_{\infty}\leq H\right\}.

Then, we can ensure 𝒢hH/σ1\|\mathcal{G}_{h}\|\leq H/\sigma_{1}. Then, from the construction of h\mathcal{F}_{h}, we can also ensure h4H/σ1\|\mathcal{F}_{h}\|\leq 4H/\sigma_{1}.

N.2 PO-bilinear Rank Decomposition (Proof of Lemma 11)

In this section, we derive the PO-bilinear decomposition of observable POMDPs with the latent low-rank transition. We want to prove Lemma 11. Recall M(h)=max(hM,1)M(h)=\max(h-M,1).

Using [20, Lemma B.2], we have

Br(π,g;π)=𝔼[{gh(z¯h)rhgh+1(z¯h+1)};a1:M(h)1π,aM(h):hπ]\displaystyle\mathrm{Br}(\pi,g;\pi^{\prime})=\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\pi]
=𝔼[{gh(z¯h)rhgh+1(z¯h+1)};a1:M(h)1π,aM(h):h1μπ,h,ahπ]\displaystyle=\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\pi]
=𝔼[𝔼[{gh(z¯h)rhgh+1(z¯h+1)}sM(h);aM(h):h1μπ,h,ahπ];a1:M(h)1π]\displaystyle=\mathbb{E}[\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}\mid s_{M(h)};a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\pi];a_{1:M(h)-1}\sim\pi^{\prime}]
=Xh(π),Wh(π,g)\displaystyle=\left\langle X_{h}(\pi^{\prime}),W_{h}(\pi,g)\right\rangle

where

Wh(π,g)\displaystyle W_{h}(\pi,g) =𝔼[{gh(z¯h)rhgh+1(z¯h+1)}sM(h);aM(h):h1μπ,h,ahπ]μ(sM(h))d(sM(h)),\displaystyle=\int\mathbb{E}[\{g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\}\mid s_{M(h)};a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\pi]\mu(s_{M(h)})\mathrm{d}(s_{M(h)}),
Xh(π)\displaystyle X_{h}(\pi^{\prime}) =𝔼[ϕ(sM(h)1,aM(h)1);a1:M(h)1π].\displaystyle=\mathbb{E}[\phi(s_{M(h)-1},a_{M(h)-1});a_{1:M(h)-1}\sim\pi^{\prime}].

Thus, the first condition in Definition 7 is satisfied

Next, we show the second condition in Definition 7. This is proved as follows:

0.5|𝒜|MXh(π),Wh(π,g)2\displaystyle\frac{0.5}{|\mathcal{A}|^{M}}\left\langle X_{h}(\pi^{\prime}),W_{h}(\pi,g)\right\rangle^{2}
=0.5|𝒜|M𝔼[(gh(z¯h)(hπgh+1)(τh));a1:M(h)1π,aM(h):h1μπ,h]2\displaystyle=\frac{0.5}{|\mathcal{A}|^{M}}\mathbb{E}\left[\left(g_{h}(\bar{z}_{h})-(\mathcal{B}^{\pi}_{h}g_{h+1})(\tau_{h})\right);a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h}\right]^{2}
0.5|𝒜|M𝔼[(gh(z¯h)(hπgh+1)(τh))2;a1:M(h)1π,aM(h):h1μπ,h]\displaystyle\leq\frac{0.5}{|\mathcal{A}|^{M}}\mathbb{E}\left[\left(g_{h}(\bar{z}_{h})-(\mathcal{B}^{\pi}_{h}g_{h+1})(\tau_{h})\right)^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h}\right]
1|𝒜|Mmaxfh𝔼[(gh(z¯h)(hπgh+1)(τh))f(τh)0.5f(τh)2;a1:M(h)1π,aM(h):h1μπ,h]\displaystyle\leq\frac{1}{|\mathcal{A}|^{M}}\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[\left(g_{h}(\bar{z}_{h})-(\mathcal{B}^{\pi}_{h}g_{h+1})(\tau_{h})\right)f(\tau_{h})-0.5f(\tau_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h}\right]
=1|𝒜|Mmaxfh𝔼[|𝒜|πh(ah|z¯h)(gh(z¯h)rhgh+1(z¯h+1))f(τh)0.5f(τh)2;a1:M(h)1π,aM(h):h1μπ,h,ah𝒰(𝒜)]\displaystyle=\frac{1}{|\mathcal{A}|^{M}}\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[|\mathcal{A}|\pi_{h}(a_{h}|\bar{z}_{h})\left(g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\right)f(\tau_{h})-0.5f(\tau_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h-1}\sim\mu^{\pi,h},a_{h}\sim\mathcal{U}(\mathcal{A})\right]
maxfh𝔼[|𝒜|πh(ah|z¯h)(gh(z¯h)rhgh+1(z¯h+1))f(τh)0.5f(τh)2;a1:M(h)1π,aM(h):h𝒰(𝒜)]\displaystyle\leq\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[|\mathcal{A}|\pi_{h}(a_{h}|\bar{z}_{h})\left(g_{h}(\bar{z}_{h})-r_{h}-g_{h+1}(\bar{z}_{h+1})\right)f(\tau_{h})-0.5f(\tau_{h})^{2};a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right]
=maxfh𝔼[lh(τh,ah,rh,oh+1;f,π,g);a1:M(h)1π,aM(h):h𝒰(𝒜)].\displaystyle=\max_{f\in\mathcal{F}_{h}}\mathbb{E}\left[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g);a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim\mathcal{U}(\mathcal{A})\right].

From the first line to the second line, we use [20, Lemma B.2]. From the third to the fourth line, we use the Bellman completeness assumption: (hπ𝒢)+𝒢hh-(\mathcal{B}^{\pi}_{h}\mathcal{G})+\mathcal{G}_{h}\subset\mathcal{F}_{h}. From the fourth line to the fifth line, we use importance sampling.

The third condition

|maxfh𝔼[lh(τh,ah,rh,oh+1;f,π,gπ);a1:M(h)1π,aM(h):hU(𝒜)]|=0.\displaystyle\left\lvert\max_{f\in\mathcal{F}_{h}}\mathbb{E}[l_{h}(\tau_{h},a_{h},r_{h},o_{h+1};f,\pi,g^{\pi});a_{1:M(h)-1}\sim\pi^{\prime},a_{M(h):h}\sim U(\mathcal{A})]\right\rvert=0.

is easily proved.

Finally, the following norm constraints hold:

Wh(π,g)3C𝒢d,Xh(π)1.\displaystyle\|W_{h}(\pi,g)\|\leq 3C_{\mathcal{G}}\sqrt{d},\quad\|X_{h}(\pi^{\prime})\|\leq 1.
Sample Complexity Result.

Following the same procedure as Section M, here, we want to find mm such that

C𝒢2|𝒜|2+Mln(|Πmax||max||𝒢max|TH/δ)m[H2dln(HdBX2BW2m)]1/2ϵ.\displaystyle\sqrt{\frac{C^{2}_{\mathcal{G}}|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|TH/\delta)}{m}}[H^{2}d\ln(HdB^{2}_{X}B^{2}_{W}m)]^{1/2}\leq\epsilon.

where

T=Hdln(HdBX2BW2m).\displaystyle T=Hd\ln(HdB^{2}_{X}B^{2}_{W}m).

By organizing terms, we have

C𝒢2H2d|𝒜|2+Mln(|Πmax||max||𝒢max|Hd/δ)ln(HdBX2BW2m)mϵ.\displaystyle\sqrt{\frac{C^{2}_{\mathcal{G}}H^{2}d|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|Hd/\delta)\ln(HdB^{2}_{X}B^{2}_{W}m)}{m}}\leq\epsilon.

Thus, setting the following mm is enough

m=O~(H4d|𝒜|2+Mln(|Πmax||max||𝒢max|/δ)ϵ2σ12).\displaystyle m=\tilde{O}\left(\frac{H^{4}d|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

The total sample we use mTHmTH is

O~(d2H6|𝒜|2+Mln(|Πmax||max||𝒢max|/δ)ϵ2σ12).\displaystyle\tilde{O}\left(\frac{d^{2}H^{6}|\mathcal{A}|^{2+M}\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)}{\epsilon^{2}\sigma^{2}_{1}}\right).

Finally, we plug-in ln(|Πmax||max||𝒢max|/δ)=ln(||)\ln(|\Pi_{\max}||\mathcal{F}_{\max}||\mathcal{G}_{\max}|/\delta)=\ln(|\mathcal{M}|).

Appendix O Exponential Stability for POMDPs with Low-rank Transition

In this section, we prove that the short memory policy is a globall near optimla policy in low-rank MDPs. We first introduce several notation. Next, we prove the exponential stability of Bayesian fileters, which immediately leads to the main statement.

Notation.

Given a belief bΔ(𝒮)b\in\Delta({\mathcal{S}}), an action and observation pair (a,o)(a,o), we define the Bayesian update as follows. We define B(b,o)Δ(𝒮)B(b,o)\in\Delta({\mathcal{S}}) as the operation that incorporates observation oo, i.e., b=B(b,o)b^{\prime}=B(b,o) with b(s)=O(o|s)b(s)/(s¯O(o|s¯)b(s¯))b^{\prime}(s)=O(o|s)b(s)/(\sum_{\bar{s}}O(o|\bar{s})b(\bar{s})), and 𝕋ab\mathbb{T}_{a}b as the operation that incorporates the transition, i.e., (𝕋ab)(s)=sb(s)𝕋(s|s,a)(\mathbb{T}_{a}b)(s^{\prime})=\sum_{s}b(s)\mathbb{T}(s^{\prime}|s,a). Finally, we denote U(b,a,o)U(b,a,o) as the full Bayesian filter, i.e.,

U(b,a,o)=B(𝕋ab,o).U(b,a,o)=B(\mathbb{T}_{a}b,o).

Let us denote b0Δ(𝒮)b_{0}\in\Delta({\mathcal{S}}) as the initial latent state distribution. Given the first observation o1𝕆(|s),sb0o_{1}\sim\mathbb{O}(\cdot|s),s\sim b_{0}, we denote b1=B(b0,o1)b_{1}=B(b_{0},o_{1}) as the initial belief of the system conditioned on the first observation o1o_{1}. Given two beliefs b,bb,b^{\prime}, we define the distance D2(b,b):=log𝔼sb[b(s)/b(s)]D_{2}(b,b^{\prime}):=\log\mathbb{E}_{s\sim b}[b(s)/b(s^{\prime})]

Consider a POMDP whose latent transition is low rank, i.e., 𝕋(s|s,a)=μ(s)ϕ(s,a)\mathbb{T}(s^{\prime}|s,a)=\mu(s^{\prime})^{\top}\phi(s,a). For notation simplicity, we still consider discrete state, action, and observation space to avoid using measure theory languages.

Design of initial distribution.

We want to design a good distribution for the initial distribution in an artificial Bayesian filter ignoring the history other than the short history.

The following lemma is from [24, Lemma 4.9] that quantifies the contraction of a Bayesian map.

Lemma 41 (Contraction propery of beliefs).

Suppose b,bΔ(𝒮)b,b^{\prime}\in\Delta({\mathcal{S}}) and b/b<\|b/b^{\prime}\|_{\infty}<\infty. Then we have:

𝔼sb,o𝕆(s)[exp(D2(B(b,o),B(b,o))41)](1σ14/240)exp(D2(b,b)4)1\displaystyle\mathbb{E}_{s\sim b,o\sim\mathbb{O}(s)}\left[\sqrt{\exp\left(\frac{D_{2}(B(b,o),B(b^{\prime},o))}{4}-1\right)}\right]\leq\left(1-\sigma^{4}_{1}/2^{40}\right)\sqrt{\exp\left(\frac{D_{2}(b,b^{\prime})}{4}\right)-1}

Next, we compute the G-optimal design using feature ϕ(s,a):𝒮×𝒜\phi(s,a):{\mathcal{S}}\times\mathcal{A}\to\mathbb{R}. Denote the G-optimal design as ρΔ(𝒮×𝒜)\rho\in\Delta({\mathcal{S}}\times\mathcal{A}). Here, we use assumption ϕ(s,a)1\|\phi(s,a)\|\leq 1 for any (s,a)(s,a) in Assumption 6, which ensures that ϕ(s,a)\phi(s,a) lives in a compact space for any (s,a)(s,a). The property is given as in Theorem 12. In summary, the support of ρ\rho (denoted by SρS_{\rho}) is at most d(d+1)/2d(d+1)/2 points and for any ϕ(s,a)\phi(s,a), there exists α(s,a)\alpha(s,a) such that

ϕ(s,a)=i=1|Sρ|αi(s,a)ϕ(si,ai)ρ1/2(si,ai),αi(s,a)/ρ1/2(si,ai)d\displaystyle\phi(s,a)=\sum_{i=1}^{|S_{\rho}|}\alpha_{i}(s,a)\phi(s^{i},a^{i})\rho^{1/2}(s^{i},a^{i}),\quad\alpha_{i}(s,a)/\rho^{1/2}(s^{i},a^{i})\leq d (34)

where we denote the points on the support SρS_{\rho} as {si,ai}i=1|Sρ|\{s^{i},a^{i}\}_{i=1}^{|S_{\rho}|}.

We set our “empty" belief as follows:

b~0():=s~,a~ρ(s~,a~)𝕋(|s~,a~)=i=1|Sρ|ρ(si,ai)𝕋(|si,ai).\displaystyle\tilde{b}_{0}(\cdot):=\sum_{\tilde{s},\tilde{a}}\rho(\tilde{s},\tilde{a})\mathbb{T}(\cdot|\tilde{s},\tilde{a})=\sum_{i=1}^{|S_{\rho}|}\rho(s^{i},a^{i})\mathbb{T}(\cdot|s^{i},a^{i}).

Note that this belief b~0\tilde{b}_{0} does not depend on any history. We aim to bound D2(b,b~0)D_{2}(b,\tilde{b}_{0}) using the following lemma where bb is some belief resulting from applying 𝕋a\mathbb{T}_{a} for any aa to a belief b~Δ(𝒮)\tilde{b}\in\Delta({\mathcal{S}}). This is a newly introduce lemma.

Lemma 42 (Distance between the actual belief and the designed initial distribution).

For any distribution bΔ(𝒮)b\in\Delta({\mathcal{S}}) that results from a previous belief b~\tilde{b} and a one-step latent transition under action aa, i.e., b(s)=𝕋ab~(s~)b(s)=\mathbb{T}_{a}\tilde{b}(\tilde{s}), we have:

D2(b,b~0)ln(d3).\displaystyle D_{2}(b,\tilde{b}_{0})\leq\ln(d^{3}).
Proof.

For any bΔ(𝒮)b\in\Delta({\mathcal{S}}), using its definition, we have:

b(s)\displaystyle b(s) =s~b~(s~)ϕ(s~,a)μ(s)\displaystyle=\sum_{\tilde{s}}\tilde{b}(\tilde{s})\phi(\tilde{s},a)^{\top}\mu(s) (Definition)
=s~b~(s~)i=1Sραi(s~,a)ρ1/2(si,ai)ϕ(si,ai)μ(s)\displaystyle=\sum_{\tilde{s}}\tilde{b}(\tilde{s})\sum_{i=1}^{S_{\rho}}\alpha_{i}(\tilde{s},a)\rho^{1/2}(s^{i},a^{i})\phi(s^{i},a^{i})^{\top}\mu(s) (Property of G-optimal design)
=i=1Sρ(s~b~(s~)αi(s~,a)ρ1/2(si,ai)):=βiϕ(si,ai)μ(s)\displaystyle=\sum_{i=1}^{S_{\rho}}\underbrace{\left(\sum_{\tilde{s}}\tilde{b}(\tilde{s})\alpha_{i}(\tilde{s},a)\rho^{1/2}(s^{i},a^{i})\right)}_{:=\beta_{i}}\phi(s^{i},a^{i})^{\top}\mu(s)

Similarly, the construction of b~0\tilde{b}_{0} implies that b~0(s)=i=1Sρρ(si,ai)ϕ(si,ai)μ(s)\tilde{b}_{0}(s)=\sum_{i=1}^{S_{\rho}}\rho(s^{i},a^{i})\phi(s^{i},a^{i})^{\top}\mu(s), thus, we have:

b(s)/b~0(s)\displaystyle b(s)/\tilde{b}_{0}(s) =i=1Sρβiϕ(si,ai)μ(s)j=1Sρρ(sj,aj)ϕ(sj,aj)μ(s)i=1Sρβiϕ(si,ai)μ(s)ρ(si,ai)ϕ(si,ai)μ(s)\displaystyle=\sum_{i=1}^{S_{\rho}}\frac{\beta_{i}\phi(s^{i},a^{i})^{\top}\mu(s)}{\sum_{j=1}^{S_{\rho}}\rho(s^{j},a^{j})\phi(s^{j},a^{j})^{\top}\mu(s)}\leq\sum_{i=1}^{S_{\rho}}\frac{\beta_{i}\phi(s^{i},a^{i})^{\top}\mu(s)}{\rho(s^{i},a^{i})\phi(s^{i},a^{i})^{\top}\mu(s)}
=i=1Sρβiρ(si,ai)=i=1Sρs~b~(s~)αi(s~,a)ρ1/2(si,ai)=s~b~(s~)i=1Sραi(s~,a)ρ1/2(si,ai)\displaystyle=\sum_{i=1}^{S_{\rho}}\frac{\beta_{i}}{\rho(s^{i},a^{i})}=\sum_{i=1}^{S_{\rho}}{\sum_{\tilde{s}}\tilde{b}(\tilde{s})\frac{\alpha_{i}(\tilde{s},a)}{\rho^{1/2}(s^{i},a^{i})}}=\sum_{\tilde{s}}\tilde{b}(\tilde{s})\sum_{i=1}^{S_{\rho}}\frac{\alpha_{i}(\tilde{s},a)}{\rho^{1/2}(s^{i},a^{i})}
s~b~(s~)d3=d3.\displaystyle\leq\sum_{\tilde{s}}\tilde{b}(\tilde{s})d^{3}=d^{3}. (Use propety of G-optimal design(34))

Thus, D2(b,b~0)=ln(𝔼sbb(s)b~0(s))lnd3D_{2}(b,\tilde{b}_{0})=\ln\left(\mathbb{E}_{s\sim b}\frac{b(s)}{\tilde{b}_{0}(s)}\right)\leq\ln d^{3}. ∎

Now we prove the exponential stability by leveraging Lemma 42 and Lemma 41.

Theorem 14 (Exponential stability for POMDPs with Low-rank Latent Transition).

Consider a tCγ4ln(d/ϵ)t\geq C\gamma^{-4}\ln(d/\epsilon). Consider any policy (full history dependent) π\pi and a trajectory a1:h+t1,o1:h+tπa_{1:h+t-1},o_{1:h+t}\sim\pi for h1h\geq 1. Denote bh+tb_{h+t} as the (true) belief conditioned on a1:h+t1,o1:h+ta_{1:h+t-1},o_{1:h+t}. For approximated belief, first for h=1h=1, we define b¯h+t\bar{b}_{h+t} as:

b¯1=b1,b¯1+τ(o1:1+τ,a1:1+τ1)=U(b¯n(o1:τ,a1:τ1),o1+τ,a1+τ1),1τt;\displaystyle\bar{b}_{1}=b_{1},\quad\bar{b}_{1+\tau}(o_{1:1+\tau},a_{1:1+\tau-1})=U(\bar{b}_{n}(o_{1:\tau},a_{1:\tau-1}),o_{1+\tau},a_{1+\tau-1}),1\leq\tau\leq t;

for h2h\geq 2, we define b¯h+t\bar{b}_{h+t} as:

b¯h=B(b~0,oh),b¯h+τ(oh:h+τ,ah:h+τ1)=U(b¯h+τ1(oh:h+τ1,ah:h+τ2),oh+τ,ah+τ1),1τt;\displaystyle\bar{b}_{h}=B(\tilde{b}_{0},o_{h}),\quad\bar{b}_{h+\tau}(o_{h:h+\tau},a_{h:h+\tau-1})=U(\bar{b}_{h+\tau-1}(o_{h:h+\tau-1},a_{h:h+\tau-2}),o_{h+\tau},a_{h+\tau-1}),1\leq\tau\leq t;

Then we have:

h1:𝔼[bh+t(o1:h+t,a1:h+t1)b¯h+t(oh:h+t,ah:h+t1)1;a1:h+t1π]ϵ.\displaystyle\forall h\geq 1:\quad\mathbb{E}[\left\|b_{h+t}(o_{1:h+t},a_{1:h+t-1})-\bar{b}_{h+t}(o_{h:h+t},a_{h:h+t-1})\right\|_{1};a_{1:h+t-1}\sim\pi]\leq\epsilon.
Proof.

We define

Yh+n(o1:h+n,a1:h+n1)=exp(D2(bh+n(o1:h+n,a1:h+n1),b¯h+n(oh:h+n,ah:h+n1))/4)1.Y_{h+n}(o_{1:h+n},a_{1:h+n-1})=\sqrt{\exp(D_{2}(b_{h+n}(o_{1:h+n},a_{1:h+n-1}),\bar{b}_{h+n}(o_{h:h+n},a_{h:h+n-1}))/4)-1}.

Hereafter, we omit (o1:h+n,a1:h+n1)(o_{1:h+n},a_{1:h+n-1}) to simplify the notation.

We start from the base case YhY_{h} (i.e., n=0n=0).

First case, consider h>1h>1, bh=U(bh1,oh,ah1)b_{h}=U(b_{h-1},o_{h},a_{h-1}). Denote bh=𝕋ah1bh1b^{\prime}_{h}=\mathbb{T}_{a_{h-1}}b_{h-1}. From Lemma 42, we know that:

𝔼[D2(bh,b~0)o1:h1,ah1;a1:h1π]ln(d3).\displaystyle\mathbb{E}[D_{2}(b^{\prime}_{h},\tilde{b}_{0})\mid o_{1:h-1},a_{h-1};a_{1:h-1}\sim\pi]\leq\ln(d^{3}).

Thus, noting bh=B(bh,oh)b_{h}=B(b^{\prime}_{h},o_{h}) and b¯h=B(b~h,oh)\bar{b}_{h}=B(\tilde{b}_{h},o_{h}), we have:

𝔼oh𝕆bh[exp(D2(bh,B(b~0,oh))/4)1o1:h1,a1:h1;a1:h1π]\displaystyle\mathbb{E}_{o_{h}\sim\mathbb{O}b^{\prime}_{h}}\left[\sqrt{\exp(D_{2}(b_{h},B(\tilde{b}_{0},o_{h}))/4)-1}\mid o_{1:h-1},a_{1:h-1};a_{1:h-1}\sim\pi\right]
𝔼oh𝕆bh[exp(D2(bh,b~0)/4)1o1:h1,a1:h1;a1:h1π]\displaystyle\leq\mathbb{E}_{o_{h}\sim\mathbb{O}b^{\prime}_{h}}\left[\sqrt{\exp(D_{2}(b^{\prime}_{h},\tilde{b}_{0})/4)-1}\mid o_{1:h-1},a_{1:h-1};a_{1:h-1}\sim\pi\right] (From Lemma 41)
(1σ14/240)d3/2\displaystyle\leq(1-\sigma^{4}_{1}/2^{40})d^{3/2}

which implies the base case:

𝔼[Yho1:h1,ah1;a1:h1π](1σ14/240)d3/2.\displaystyle\mathbb{E}[Y_{h}\mid o_{1:h-1},a_{h-1};a_{1:h-1}\sim\pi]\leq(1-\sigma^{4}_{1}/2^{40})d^{3/2}.

Now for any n1n\geq 1, we have:

𝔼[Yh+no1:h1,a1:h1;a1:h+n1π]\displaystyle\mathbb{E}[Y_{h+n}\mid o_{1:h-1},a_{1:h-1};a_{1:h+n-1}\sim\pi]
=𝔼[exp(D2(bh+n,b¯h+n)/4)1o1:h1,a1:h1;a1:h+n1π]\displaystyle=\mathbb{E}\left[\sqrt{\exp\left(D_{2}(b_{h+n},\bar{b}_{h+n})/4\right)-1}\mid o_{1:h-1},a_{1:h-1};a_{1:h+n-1}\sim\pi\right]
(1σ14/240)𝔼[exp(D2((𝕋ah+n1bh+n1),(𝕋ah+n1b¯h+n1))/4)1o1:h1,a1:h1;a1:h+n1π]\displaystyle\leq(1-\sigma^{4}_{1}/2^{40})\mathbb{E}\left[\sqrt{\exp\left(D_{2}\left((\mathbb{T}_{a_{h+n-1}}b_{h+n-1}),(\mathbb{T}_{a_{h+n-1}}\bar{b}_{h+n-1})\right)/4\right)-1}\mid o_{1:h-1},a_{1:h-1};a_{1:h+n-1}\sim\pi\right]
(1σ14/240)𝔼[exp(D2(bh+n1,b¯h+n1))/4)1o1:h1,a1:h1;a1:h+n1π]\displaystyle\leq(1-\sigma^{4}_{1}/2^{40})\mathbb{E}\left[\sqrt{\exp\left(D_{2}\left(b_{h+n-1},\bar{b}_{h+n-1})\right)/4\right)-1}\mid o_{1:h-1},a_{1:h-1};a_{1:h+n-1}\sim\pi\right] (Data processing inequality from [24, Lemma 2.7])
=(1σ14/240)𝔼[Yh+n1o1:h1,a1:h1;a1:h+n1π].\displaystyle=(1-\sigma^{4}_{1}/2^{40})\mathbb{E}[Y_{h+n-1}\mid o_{1:h-1},a_{1:h-1};a_{1:h+n-1}\sim\pi].

This completes the induction step. Adding expectation with respect to the history a1:h1,o1:h1a_{1:h-1},o_{1:h-1} back, we conclude the proof.

When h=1h=1, we simply start with the original belief b1b_{1}. For any 0nt0\leq n\leq t, we simply set b¯1+n=b1+n\bar{b}_{1+n}=b_{1+n}, thus the conclusion still holds.

The above Theorem 14 indicates that in order to approximate the ground truth belief bh+tb_{h+t} that is conditioned on the entire history, we only need to apply the Bayesian filter on the M memory z¯h+t\bar{z}_{h+t} starting from a fixed distribution b~0\tilde{b}_{0}. The existence of such b~0\tilde{b}_{0} is proven by construction where we rely on the low-rankness of the latent transition and a D-optimal design over 𝒮×𝒜{\mathcal{S}}\times\mathcal{A} using the feature ϕ\phi.

The above Theorem 14 together with the proof of Theorem 1.2 in [24] immediately implies for M=Θ(C(σ1)4ln(dH/ϵ))M=\Theta(C(\sigma_{1})^{-4}\ln(dH/\epsilon)) (with CC being some absolute constant), there must exists an M-memory policy π\pi^{\star}, such that J(πgl)J(π)ϵJ(\pi^{\star}_{gl})-J(\pi^{\star})\leq\epsilon – thus a globally optimal policy can be approximated by a policy that only relies on short memories.

Appendix P Auxiliary Lemmas

We use the following in Section 4.3.

Lemma 43 (Useful inequalities).
  • ABAB,ABFABF\displaystyle\|AB\|\leq\|A\|\|B\|,\|AB\|_{F}\leq\|A\|\|B\|_{F}
    vec(aa)=aa,vec(A)2=AF,Tr(AB)=vec(A)vec(B).\displaystyle\mathrm{vec}(aa^{\top})=a\otimes a,\|\mathrm{vec}(A)\|_{2}=\|A\|_{F},\mathrm{Tr}(AB)=\mathrm{vec}(A^{\top})^{\top}\mathrm{vec}(B).
  • When AA and BB are semi positive definite matrices, we have

    Tr(AB)ATr(B).\displaystyle\mathrm{Tr}(AB)\leq\|A\|\mathrm{Tr}(B).

The following lemma is useful when we calculate the sample complexity.

Lemma 44.

The following is satisfied

B1mln2(B2m+B3)cϵ\displaystyle\sqrt{\frac{B_{1}}{m}\ln^{2}(B_{2}m+B_{3})}\leq c\epsilon

when

m=cB1ϵ2{ln(m(B2+B3+1))}2.\displaystyle m=c\frac{B_{1}}{\epsilon^{2}}\{\ln(m(B_{2}+B_{3}+1))\}^{2}.

for some constant cc.