This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability

Aviv Tamar, Daniel Soudry, Ev Zisselman
Abstract

In the Bayesian reinforcement learning (RL) setting, a prior distribution over the unknown problem parameters – the rewards and transitions – is assumed, and a policy that optimizes the (posterior) expected return is sought. A common approximation, which has been recently popularized as meta-RL, is to train the agent on a sample of NN problem instances from the prior, with the hope that for large enough NN, good generalization behavior to an unseen test instance will be obtained. In this work, we study generalization in Bayesian RL under the probably approximately correct (PAC) framework, using the method of algorithmic stability. Our main contribution is showing that by adding regularization, the optimal policy becomes stable in an appropriate sense. Most stability results in the literature build on strong convexity of the regularized loss – an approach that is not suitable for RL as Markov decision processes (MDPs) are not convex. Instead, building on recent results of fast convergence rates for mirror descent in regularized MDPs, we show that regularized MDPs satisfy a certain quadratic growth criterion, which is sufficient to establish stability. This result, which may be of independent interest, allows us to study the effect of regularization on generalization in the Bayesian RL setting.

1 Introduction

How can an agent learn to quickly perform well in an unknown task? This is the basic question in reinforcement learning (RL). The most popular RL algorithms are designed in a minimax approach – seeking a procedure that will eventually learn to perform well in any task (Strehl et al. 2006; Jaksch, Ortner, and Auer 2010; Jin et al. 2018). Lacking prior information about the task, such methods must invest considerable efforts in uninformed exploration, typically requiring many samples to reach adequate performance. In contrast, when a prior distribution over possible tasks is known in advance, an agent can direct its exploration much more effectively. This is the Bayesian RL (BRL, Ghavamzadeh et al. 2016) setting. A Bayes-optimal policy – the optimal policy in BRL – can be orders of magnitude more sample efficient than a minimax approach, and indeed, recent studies demonstrated a quick solution of novel tasks, sometimes in just a handful of trials (Duan et al. 2016; Zintgraf et al. 2020; Dorfman, Shenfeld, and Tamar 2020).

For high-dimensional problems, and when the task distribution does not obey a very simple structure, solving BRL is intractable, and one must resort to approximations. A common approximation, which has been popularized under the term meta-RL (Duan et al. 2016; Finn, Abbeel, and Levine 2017), is to replace the distribution over tasks with an empirical sample of tasks, and seek an optimal policy with respect to the sample, henceforth termed the empirical risk minimization policy (ERM policy). In this paper, we investigate the performance of the ERM policy on a novel task from the task distribution, that is – we ask how well the policy generalizes.

We focus on the Probably Approximately Correct (PAC) framework, which is popular in supervised learning, and adapt it to the BRL setting. Since the space of deterministic history-dependent policies is finite (in a finite horizon setting), a trivial generalization bound for a finite hypothesis space can be formulated. However, the size of the policy space, which such a naive bound depends on, leads us to seek alternative methods for controlling generalization. In particular, regularization is a well-established method in supervised learning that can be used to trade-off training error and test error. In RL, regularized MDPs are popular in practice (Schulman et al. 2017), and have also received interest lately due to their favorable optimization properties (Shani, Efroni, and Mannor 2020; Neu, Jonsson, and Gómez 2017).

The main contribution of this work is making the connection between regularized MDPs and PAC generalization, as described above. We build on the classical analysis of Bousquet and Elisseeff (2002), which bounds generalization through algorithmic stability. Establishing algorithmic stability results for regularized MDPs, however, is not trivial, as the loss function in MDPs is not convex in the policy. Our key insight is that while not convex, regularized MDPs satisfy a certain quadratic growth criterion, which is sufficient to establish stability. To show this, we build on the recently discovered fast convergence rates for mirror descent in regularized MDPs (Shani, Efroni, and Mannor 2020). Our result, which may be of independent interest, allows us to derive generalization bounds that can be controlled by the regularization magnitude. Furthermore, we show that when the MDP prior obeys certain structural properties, our results significantly improve the trivial finite hypothesis space bound.

To our knowledge, this is the first work to formally study generalization in the BRL setting. While not explicitly mentioned as such, the BRL setting has been widely used by many empirical studies on generalization in RL (Tamar et al. 2016; Cobbe et al. 2020). In fact, whenever the MDPs come from a distribution, BRL is the relevant formulation. Our results therefore also establish a formal basis for studying generalization in RL.

This paper is structured as follows. After surveying related work in Section 2, we begin with background on MDPs, BRL, and algorithmic stability in Section 3, and then present our problem formulation and straightforward upper and lower bounds for the ERM policy in Section 4. In Section 5 we discuss fundamental properties of regularized MDPs. In Section 6 we describe a general connection between a certain rate result for mirror descent and a quadratic growth condition, and in Section 7 we apply this connection to regularized MDPs, and derive corresponding generalization bounds. We discuss our results and future directions in Section 8.

2 Related Work

Generalization to novel tasks in RL has been studied extensively, often without making the explicit connection to Bayesian RL. Empirical studies can largely be classified into three paradigms. The first increases the number of training tasks, either using procedurally generated domains (Cobbe et al. 2020), or by means such as image augmentation (Kostrikov, Yarats, and Fergus 2020) and task interpolation (Yao, Zhang, and Finn 2021). The second paradigm adds inductive bias to the neural network, such as a differentiable planning or learning computation (Tamar et al. 2016; Boutilier et al. 2020), or graph neural networks (Rivlin, Hazan, and Karpas 2020). The third is meta-RL, where an agent is explicitely trained to generalize, either using a Bayesian RL objective (Duan et al. 2016; Zintgraf et al. 2020), or through gradient based meta learning (Finn, Abbeel, and Levine 2017). We are not aware of theoretical studies of generalization in Bayesian RL.

The Bayesian RL algorithms of Guez, Silver, and Dayan (2012) and Grover, Basu, and Dimitrakakis (2020) perform online planning in the belief space by sampling MDPs from the posterior at each time step, and optimizing over this sample. The performance bounds for these algorithms require the correct posterior at each step, implicitly assuming a correct prior, while we focus on learning the prior from data. The lower bound in our Proposition 2 demonstrates how errors in the prior can severely impact performance.

Our stability-based approach to PAC learning is based on the seminal work of Bousquet and Elisseeff (2002). More recent works investigated stability of generalized learning (Shalev-Shwartz et al. 2010), multi-task learning (Zhang 2015), and stochastic gradient descent (Hardt, Recht, and Singer 2016). To our knowledge, we provide the first stability result for regularized MDPs, which, due to their non-convex nature, requires a new methodology. The stability results of Charles and Papailiopoulos (2018) build on quadratic growth, a property we use as well. However, showing that this property holds for regularized MDPs is not trivial, and is a major part of our contribution.

Finally, there is recent interest in PAC-Bayes theory for meta learning (Amit and Meir 2018; Rothfuss et al. 2021; Farid and Majumdar 2021). To our knowledge, this theory has not yet been developed for meta RL.

3 Background

We give background on BRL and algorithmic stability.

3.1 MDPs and Bayesian RL

A stationary Markov decision process (MDP, Bertsekas 2006) is defined by a tuple M=(S,A,Pinit,C,P,H)M=(S,A,P_{\textrm{init}},C,P,H), where SS and AA are the state and actions spaces, PinitP_{\textrm{init}} is an initial state distribution, C:S×A[0,Cmax]C:S\times A\to[0,C_{\mathrm{max}}] is a bounded cost function, PP is the transition kernel, and HH is the episode horizon, meaning that after HH steps of interaction, the state is reset to sPinits\sim P_{\textrm{init}}. We make the additional assumption that the cost C(s,a)𝒞C(s,a)\in\mathcal{C}, and 𝒞\mathcal{C} is a finite set.111This assumption is non-standard, and required to guarantee a finite set of possible histories in the Bayesian RL setting. In practice, the reward can be discretized to satisfy the assumption.

In the Bayesian RL setting (BRL, Ghavamzadeh et al. 2016), there is a distribution over MDPs P(M)P(M), defined over some space of MDPs \mathcal{M}. For simplicity, we assume that SS, AA, PinitP_{\textrm{init}}, and HH are fixed for all MDPs in \mathcal{M}, and thus the only varying factors between different MDPs are the costs and transitions, denoted CMC_{M} and PMP_{M}.

A simulator for an MDP MM is a sequential algorithm that at time tt outputs sts_{t}, and, given input ata_{t}, outputs ct=C(st,at)c_{t}=C(s_{t},a_{t}), and transitions the state according to st+1P(|st,at)s_{t+1}\sim P(\cdot|s_{t},a_{t}). After every HH steps of interaction, the state is reset to sPinits\sim P_{\textrm{init}}. Let the history at time tt be ht={s0,a0,c0,s1,a1,c1,st}h_{t}=\left\{s_{0},a_{0},c_{0},s_{1},a_{1},c_{1}\dots,s_{t}\right\}. A policy π\pi is a stochastic mapping from the history to a probability over actions π(a|ht)=P(at=a|ht)\pi(a|h_{t})=P(a_{t}=a|h_{t}).

A typical MDP objective is to minimize the TT-horizon expected return 𝔼π;M[t=0TCM(st,at)]\mathbb{E}_{\pi;M}\left[\sum_{t=0}^{T}C_{M}(s_{t},a_{t})\right], where the expectation is with respect to the policy π\pi and state transitions prescribed by MM. In BRL, the objective is an average over the possible MDPs in the prior:

(π)=𝔼MP𝔼π;M[t=0TCM(st,at)].\mathcal{L}(\pi)=\mathbb{E}_{M\sim P}\mathbb{E}_{\pi;M}\left[\sum_{t=0}^{T}C_{M}(s_{t},a_{t})\right]. (1)

We denote by \mathcal{H} the space of TT-length histories. Note that by our definitions above, \mathcal{H} is finite. Also note that TT is not necessarily equal to HH.

3.2 PAC Generalization and Algorithmic Stability

Statistical learning theory (Vapnik 2013) studies the generalization performance of a prediction algorithm trained on a finite data sample. Let S={z1,,zN}S=\left\{z_{1},\dots,z_{N}\right\} denote a sample of N1{N}\geq 1 i.i.d. elements from some space 𝒵\mathcal{Z} with distribution P(z)P(z). A learning algorithm 𝒜\mathcal{A} takes as input SS and outputs a prediction function 𝒜S\mathcal{A}_{S}. Let 0(𝒜S,z)B0\leq\ell(\mathcal{A}_{S},z)\leq B, where z𝒵z\in\mathcal{Z}, denote the loss of the prediction on a sample zz. The population risk is R(𝒜,S)=𝔼z[(𝒜S,z)]R(\mathcal{A},S)=\mathbb{E}_{z}\left[\ell(\mathcal{A}_{S},z)\right], and the empirical risk is R^(𝒜,S)=1Ni=1N[(𝒜S,zi)]\hat{R}(\mathcal{A},S)=\frac{1}{N}\sum_{i=1}^{N}\left[\ell(\mathcal{A}_{S},z_{i})\right]. Typically, algorithms are trained by minimizing the empirical risk. Probably approximately correct (PAC) learning algorithms are guaranteed to produce predictions with a population risk that is close to the empirical risk with high probability, and thus generalize. We recite fundamental results due to Bousquet and Elisseeff (2002) that connect algorithmic stability and PAC bounds.

Let S\iS^{\backslash i} denote the set SS with element ii removed. An algorithm satisfies uniform stability β\beta if the following holds:

S𝒵N,i{1,,N},(𝒜S,)(𝒜S\i,)β.\forall S\in\mathcal{Z}^{N},\forall i\in\{1,\dots,{N}\},\|\ell(\mathcal{A}_{S},\cdot)-\ell(\mathcal{A}_{S^{\backslash i}},\cdot)\|_{\infty}\leq\beta.

An algorithm is said to satisfy pointwise hypothesis stability β\beta if the following holds:

i{1,,N},𝔼S[|(𝒜S,zi)(𝒜S\i,zi)|]β.\forall i\in\left\{1,\dots,N\right\},\mathbb{E}_{S}\left[\left|\ell(\mathcal{A}_{S},z_{i})-\ell(\mathcal{A}_{S^{\backslash i}},z_{i})\right|\right]\leq\beta.
Theorem 1 (Theorem 11 in Bousquet and Elisseeff 2002).

Let 𝒜\mathcal{A} be an algorithm with pointwise hypothesis stability β\beta. Then, for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta over the random draw of SS,

R(𝒜,S)R^(𝒜,S)+B2+12BNβ2Nδ.R(\mathcal{A},S)\leq\hat{R}(\mathcal{A},S)+\sqrt{\frac{B^{2}+12BN\beta}{2N\delta}}.
Theorem 2 (Theorem 12 in Bousquet and Elisseeff 2002).

Let 𝒜\mathcal{A} be an algorithm with uniform stability β\beta. Then, for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta over the random draw of SS,

R(𝒜,S)R^(𝒜,S)+2β+(4Nβ+B)ln(1/δ)2N.R(\mathcal{A},S)\leq\hat{R}(\mathcal{A},S)+2\beta+(4N\beta+B)\sqrt{\frac{\ln(1/\delta)}{2N}}.

The bounds in Theorems 1 and 2 are useful if one can show that for a particular problem, β\beta scales as o(1/N)o(1/\sqrt{N}). Indeed, Bousquet and Elisseeff 2002 showed such results for several supervised learning problems. For example, L2L_{2} regularized kernel regression has stability 𝒪(1/λN)\mathcal{O}(1/\lambda N), where λ\lambda – the regularization weight in the loss – can be chosen to satisfy the o(1/N)o(1/\sqrt{N}) condition on β\beta.

4 Problem Formulation

We next describe our learning problem. We are given a training set of NN simulators for NN independently sampled MDPs, {M1,,MN}\left\{M_{1},\dots,M_{N}\right\}, where each MiP(M)M_{i}\sim P(M); in the following, we will sometimes refer to this training set as the training data. We are allowed to interact with these simulators as we wish for an unrestricted amount of time. From this interaction, our goal is to compute a policy π\pi that obtains a low expected TT-horizon cost for a test simulator MP(M)M\sim P(M), i.e., we wish to minimize the population risk (1). It is well known (e.g., Ghavamzadeh et al. 2016) that there exists a deterministic history dependent policy that minimizes (1), also known as the Bayes-optimal policy, and we denote it by πBO\pi_{\mathrm{BO}}. Our performance measure is the TT-horizon average regret,

T(π)=𝔼MP[𝔼π;M[t=0TCM(st,at)]𝔼πBO;M[t=0TCM(st,at)]]=(π)(πBO).\begin{split}\mathcal{R}_{T}(\pi)=&\mathbb{E}_{M\sim P}\Bigg{[}\mathbb{E}_{\pi;M}\bigg{[}\sum_{t=0}^{T}C_{M}(s_{t},a_{t})\bigg{]}\\ &-\!\mathbb{E}_{\pi_{\mathrm{BO}};M}\!\bigg{[}\!\sum_{t=0}^{T}\!C_{M}(s_{t},a_{t})\!\bigg{]}\!\Bigg{]}=\mathcal{L}(\pi)-\mathcal{L}(\pi_{\mathrm{BO}}).\end{split} (2)
Remark 1.

The BRL formulation generalizes several special cases that were explored before in the context of generalization in RL. When T=kHT=kH, this setting is often referred to as kk-shot learning, and in particular, for T=HT=H, the learned policy is evaluated on solving a test task in a single shot. Another popular setting is the contextual MDP (Hallak, Di Castro, and Mannor 2015), where, in addition to the state, each task MM is identified using some task identifier idMid_{M}, which is observed. By adding idMid_{M} to the state space, and modifying the dynamics such that idMid_{M} does not change throughout the episode, this setting is a special case of our formalism. Finally, many previous studies (e.g., Tamar et al. 2016) considered the same performance objective, but limited the optimization to Markov policies (i.e., policies that depend only on the current state and not on the full history). In this work, we specifically consider history dependent policies, as it allows us to meaningfully compare the learned policy with the optimum.

4.1 Analysis of an ERM Approach

Our goal is to study the generalization properties of learning algorithms in the BRL setting. An intuitive approach, in the spirit of the empirical risk minimization (ERM, Vapnik 2013) principle, is to minimize the empirical risk,

^(π)=1Ni=1N𝔼π;Mi[t=0TCMi(st,at)]𝔼MP^N𝔼π,M[t=0TCM(st,at)],\begin{split}\hat{\mathcal{L}}(\pi)&=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\pi;M_{i}}\left[\sum_{t=0}^{T}C_{M_{i}}(s_{t},a_{t})\right]\\ &\equiv\mathbb{E}_{M\sim\hat{P}_{N}}\mathbb{E}_{\pi,M}\left[\sum_{t=0}^{T}C_{M}(s_{t},a_{t})\right],\end{split} (3)

where P^N\hat{P}_{N} is the empirical distribution of the N{N} sampled MDPs. Let π^argminπ^(π)\hat{\pi}^{*}\in\operatorname*{arg\,min}_{\pi\in\mathcal{H}}\hat{\mathcal{L}}(\pi) denote the ERM policy.

Since the hypothesis space of deterministic history dependent policies is finite, and the loss is bounded by CmaxTC_{\max}T, a trivial generalization bound can be formulated as follows (following PAC bounds for a finite hypothesis class, e.g., Shalev-Shwartz and Ben-David 2014).

Proposition 1.

Consider the ERM policy π^\hat{\pi}^{*}, and let ¯\bar{\mathcal{H}} denote the set of deterministic TT-length history dependent policies. Then with probability at least 1δ1-\delta,

T(π^)2log(2|¯|/δ)Cmax2T2N.\mathcal{R}_{T}(\hat{\pi}^{*})\leq\sqrt{\frac{2\log(2|\bar{\mathcal{H}}|/\delta)C_{\max}^{2}T^{2}}{N}}.

Note that |¯|=|A||||A|(|S||A||𝒞|)T|\bar{\mathcal{H}}|=|A|^{|\mathcal{H}|}\approx|A|^{(|S||A||\mathcal{C}|)^{T}}, so log|¯|=𝒪((|S||A||𝒞|)T)\log|\bar{\mathcal{H}}|=\mathcal{O}((|S||A||\mathcal{C}|)^{T}). The exponential dependence on TT in the bound is not very satisfying, and one may ask whether a more favourable upper bound can be established for the ERM policy. To answer this, we next give a lower bound, showing that without additional assumptions on the problem, the exponential dependence on TT is necessary.

Proposition 2.

For any 0δ<10\leq\delta<1, there is an ϵ>0\epsilon>0, and a problem, such that for N=2TN=2^{T}, with probability larger than δ\delta we have T(π^)>ϵ\mathcal{R}_{T}(\hat{\pi}^{*})>\epsilon.

Proof.

(sketch; full proof in Section E.) Let T=HT=H, and consider an MDP space \mathcal{M} of size 2H2^{H}, where the state space has 2H+12H+1 states that we label s0,s10,s11,,st0,st1,,sH0,sH1s_{0},s_{1}^{0},s_{1}^{1},\dots,s_{t}^{0},s_{t}^{1},\dots,s_{H}^{0},s_{H}^{1}. The initial state for all MDPs in \mathcal{M} is s0s_{0}. A cost is only obtained at the last time step, and depends only on the last action. Each MDP MM\in\mathcal{M} corresponds to a unique binary number of size HH, denoted xx, and the transitions for each MDP correspond to the digits in its identifier xx: there is a high probability to transition to st0s_{t}^{0} from either st10s_{t-1}^{0} or st11s_{t-1}^{1} only if the tt’s digit of xx is zero, and similarly, there is a high probability to transition to st1s_{t}^{1} from either st10s_{t-1}^{0} or st11s_{t-1}^{1} only if the tt’s digit of xx is one. Thus, with high probability, a trajectory in the MDP traces the digits of its identifier xx. Given a finite data sample, there is a non-negligible set of MDPs that will not appear in the data. For any trajectory that corresponds to an xx from this set, the ERM policy at time HH will not be able to correctly identify the most probable MDP, and will choose an incorrect action with non-negligible probability. ∎

The results above motivate us to seek alternatives to the ERM approach, with the hope of providing more favorable generalization bounds. In the remainder of this paper, we focus on methods that add a regularization term to the loss.

5 Regularized MDPs

In supervised learning, a well-established method for controlling generalization is to add a regularization term, such as the L2L_{2} norm of the parameters, to the objective function that is minimized. The works of Bousquet and Elisseeff (2002); Shalev-Shwartz et al. (2010) showed that for convex loss functions, adding a strongly convex regularizer such as the L2L_{2} norm leads to algorithmic stability, which can be used to derive generalization bounds that are controlled by the amount of regularization. In this work, we ask whether a similar approach of adding regularization to the BRL objective (3) can be used to control generalization.

We focus on the following regularization scheme. For some λ>0\lambda>0, consider a regularized ERM of the form:

^λ(π)=1Ni=1N𝔼π;Mi[t=0TCMi(st,at)+λ(π(|ht))],\hat{\mathcal{L}}^{\lambda}(\pi)=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\pi;M_{i}}\left[\sum_{t=0}^{T}C_{M_{i}}(s_{t},a_{t})+\lambda\mathcal{R}(\pi(\cdot|h_{t}))\right],

where \mathcal{R} is some regularization function applied to the policy. In particular, we will be interested in L2L_{2} regularization, where (π(|ht))=π(|ht)2\mathcal{R}(\pi(\cdot|h_{t}))=\|\pi(\cdot|h_{t})\|_{2}. We also define the regularized population risk,

λ(π)=𝔼MP𝔼π;M[t=0TCM(st,at)+λ(π(|ht))].\mathcal{L}^{\lambda}(\pi)=\mathbb{E}_{M\sim P}\mathbb{E}_{\pi;M}\left[\sum_{t=0}^{T}C_{M}(s_{t},a_{t})+\lambda\mathcal{R}(\pi(\cdot|h_{t}))\right].

In standard (non-Bayesian) RL, regularized MDPs have been studied extensively (Neu, Jonsson, and Gómez 2017). A popular motivation has been to use the regularization to induce exploration (Fox, Pakman, and Tishby 2015; Schulman et al. 2017). Recently, Shani, Efroni, and Mannor (2020) showed that for optimizing a policy using kk iterations of mirror descent (equivalent to trust region policy optimization Schulman et al. 2015) with L2L_{2} or entropy regularization enables a fast O(1/k)O(1/k) convergence rate, similarly to convergence rates for strongly convex functions, although the MDP objective is not convex. In our work, we build on these results to show a stability property for regularization in the BRL setting described above. We begin by adapting a central result in Shani, Efroni, and Mannor (2020), which was proved for discounted MDPs, to our finite horizon and history dependent policy setting.

The BRL objectives in Eq. (1) (similarly, Eq. (3)) can be interpreted as follows: we first choose a history dependent policy π(ht)\pi(h_{t}), and then nature draws an MDP MP(M)M\sim P(M) (similarly, MP^NM\sim\hat{P}_{N}), and we then evaluate π(ht)\pi(h_{t}) on MM. The expected cost (over the draws of MM), is the BRL performance. In the following discussion, for simplicity, the results are given for the prior P(M)P(M), but they hold for P^N\hat{P}_{N} as well.

Let P(M|ht;π)P(M|h_{t};\pi) denote the posterior probability of nature having drawn the MDP MM, given that we have seen the history hth_{t} under policy π\pi. From Bayes rule, we have that

P(M|ht;π)P(ht|M;π)P(M).P(M|h_{t};\pi)\propto P(h_{t}|M;\pi)P(M).

Let us define the regularized expected cost,

Cλ(ht,at;π)=𝔼M|ht;πCM(st,at)+λ(πt(|ht)),C_{\lambda}(h_{t},a_{t};\pi)=\mathbb{E}_{M|h_{t};\pi}C_{M}(s_{t},a_{t})+\lambda\mathcal{R}(\pi_{t}(\cdot|h_{t})),

and the value function,

Vtπ(ht)=𝔼π;M|ht[t=tTCλ(ht,at;π)|ht].V_{t}^{\pi}(h_{t})=\mathbb{E}_{\pi;M|h_{t}}\left[\left.\sum_{t^{\prime}=t}^{T}C_{\lambda}(h_{t^{\prime}},a_{t^{\prime}};\pi)\right|h_{t}\right].

The value function satisfies Bellman’s equation. Let

P(ct,st+1|ht,at)=MP(M|ht)P(ct|M,st,at)P(st+1|M,st,at)\begin{split}&P(c_{t},s_{t+1}|h_{t},a_{t})=\\ &\sum_{M}P(M|h_{t})P(c_{t}|M,s_{t},a_{t})P(s_{t+1}|M,s_{t},a_{t})\end{split}

denote the posterior probability of observing ct,st+1c_{t},s_{t+1} at time tt. Then

VTπ(hT)=aTπ(aT|hT)Cλ(hT,aT;π),V_{T}^{\pi}(h_{T})=\sum_{a_{T}}\pi(a_{T}|h_{T})C_{\lambda}(h_{T},a_{T};\pi),

and, letting ht+1={ht,at,ct,st+1}h_{t+1}=\left\{h_{t},a_{t},c_{t},s_{t+1}\right\},

Vtπ(ht)=atπ(at|ht)(Cλ(ht,at;π)+ct,st+1P(ct,st+1|ht,at)Vt+1π({ht,at,ct,st+1})).\begin{split}V_{t}^{\pi}(h_{t})=&\sum_{a_{t}}\pi(a_{t}|h_{t})\bigg{(}C_{\lambda}(h_{t},a_{t};\pi)\\ &+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\bigg{)}.\end{split}

Consider two histories ht,h¯t¯h_{t},\bar{h}_{\bar{t}}\in\mathcal{H}, and let

𝐏π(h¯t¯|ht)={atπ(at|ht)P(c¯t,s¯t+1|ht,at),if t¯=t+10,else\mathbf{P}^{\pi}(\bar{h}_{\bar{t}}|h_{t})\!=\!\left\{\begin{array}[]{lr}\!\!\!\sum_{a_{t}}\!\!\pi(a_{t}|h_{t})P(\bar{c}_{t},\bar{s}_{t+1}|h_{t},a_{t}),&\text{if }{\bar{t}}=t+1\\ \!\!\!0,&\text{else}\end{array}\right.

denote the transition probability between histories. Also, define

Cπ(ht)=atπ(at|ht)Cλ(ht,at;π).C^{\pi}(h_{t})=\sum_{a_{t}}\pi(a_{t}|h_{t})C_{\lambda}(h_{t},a_{t};\pi).

We can write the Bellman equation in matrix form as follows

Vπ=Cπ+𝐏πVπ,V^{\pi}=C^{\pi}+\mathbf{P}^{\pi}V^{\pi}, (4)

where VπV^{\pi} and CπC^{\pi} are vectors in ||\mathbb{R}^{|\mathcal{H}|}, and 𝐏π\mathbf{P}^{\pi} is a matrix in ||×||\mathbb{R}^{|\mathcal{H}|\times|\mathcal{H}|}.

The uniform trust region policy optimization algorithm of Shani, Efroni, and Mannor (2020) is a type of mirror descent algorithm applied to the policy in regularized MDPs. An adaptation of this algorithm for our setting is given in Sec. B.4 of the supplementary material. The next result provides a fundamental inequality that the policy updates of this algorithm satisfy, in the spirit of an inequality that is used to establish convergence rates for mirror descent (cf. Lemma 8.11 in Beck 2017). The proof follows Lemma 10 in Shani, Efroni, and Mannor (2020), with technical differences due to the finite horizon setting; it is given in Sec. B.4.

Proposition 3.

Let {πk}\{\pi_{k}\} be the sequence generated by uniform trust region policy optimization with step sizes {αk}\{\alpha_{k}\} and L2L_{2} regularization. Then for every π\pi and k0k\geq 0,

αk(I𝐏π)(VπkVπ)(1αkλ)2ππk2212ππk+122+λαk2(πk22πk+122)+αk2L22e,\begin{split}&\alpha_{k}(I-\mathbf{P}^{\pi})(V^{\pi_{k}}-V^{\pi})\leq\frac{(1-\alpha_{k}\lambda)}{2}\|\pi-\pi_{k}\|_{2}^{2}\\ &-\frac{1}{2}\|\pi-\pi_{k+1}\|_{2}^{2}+\frac{\lambda\alpha_{k}}{2}(\|\pi_{k}\|_{2}^{2}-\|\pi_{k+1}\|_{2}^{2})+\frac{\alpha_{k}^{2}L^{2}}{2}e,\end{split}

where ee is a vector of ones, L=CmaxT|A|L=C_{\max}T|A|, and π2||\|\pi\|_{2}\in\mathbb{R}^{|\mathcal{H}|} denotes the L2L_{2} norm of the policy element-wise, for each history.

In the following, we shall show that Proposition 3 can be used to derive stability bounds in the regularized BRL setting. To simplify our presentation, we first present a key technique that our approach builds on in a general optimization setting, and only then come back to MDPs.

6 Stability based on the Fundamental Inequality for Mirror Descent

Standard stability results, such as in Bousquet and Elisseeff (2002); Shalev-Shwartz et al. (2010), depend on convexity of the loss function, and strong convexity of the regularizing function (Bousquet and Elisseeff 2002). While our L2L_{2} regularization is strongly convex, the MDP objective is not convex in the policy.222The linear programming formulation is not suitable for establishing stability in our BRL setting, as changing the prior would change the constraints in the linear program. In this work, we show that nevertheless, algorithmic stability can be established. To simplify our presentation, we first present the essence of our technique in a general form, without the complexity of MDPs. In the next section, we adapt the technique to the BRL setting.

Our key insight is that the fundamental inequality of mirror descent (cf. Prop. 3), actually prescribes a quadratic growth condition. The next lemma shows this for a general iterative algorithm, but it may be useful to think about mirror descent when reading it. In the sequel, we will show that similar conditions hold for regularized MDPs.

Lemma 1.

Let f:𝒳f:\mathcal{X}\to\mathbb{R} be some function that attains a minimum f(x)f(x)x𝒳f(x^{*})\leq f(x)\quad\forall x\in\mathcal{X}. Consider a sequence of step sizes α0,α1,+\alpha_{0},\alpha_{1},\dots\in\mathbb{R}^{+} and corresponding sequence of iterates x0,x1,𝒳x_{0},x_{1},\dots\in\mathcal{X}. Assume that f(xk+1)f(xk)f(x_{k+1})\leq f(x_{k}) for all k0k\geq 0. Also consider a sequence of values z0,z1,+z_{0},z_{1},\dots\in\mathbb{R}^{+} that satisfy |zkz0|B|z_{k}-z_{0}|\leq B for all k0k\geq 0. Assume that there exists λ>0\lambda>0 and L0L\geq 0 such that the following holds for any step size sequence, all k0k\geq 0, and any x𝒳x\in\mathcal{X}:

αk(f(xk)f(x))(1λαk)xkx2xk+1x2+λαk(zkzk+1)+αk2L22.\begin{split}\alpha_{k}\left(f(x_{k})-f(x)\right)\leq&\left(1-\lambda\alpha_{k}\right)\|x_{k}-x\|^{2}-\|x_{k+1}-x\|^{2}\\ &+\lambda\alpha_{k}\left(z_{k}-z_{k+1}\right)+\frac{\alpha_{k}^{2}L^{2}}{2}.\end{split} (5)

Then the following statements hold true.

  1. 1.

    For step sizes αk=1λ(k+2)\alpha_{k}=\frac{1}{\lambda(k+2)}, the sequence converges to xx^{*} at rate

    f(xk)f(x)L2logkλk.f(x_{k})-f(x^{*})\leq\frac{L^{2}\log k}{\lambda k}.
  2. 2.

    Quadratic growth: λxx02f(x0)f(x)\lambda\|x^{*}-x_{0}\|^{2}\leq f(x_{0})-f(x^{*}).

Proof.

The first claim is similar to Theorem 2 of Shani, Efroni, and Mannor (2020); for completeness we give a full proof in Sec. A of the supplementary. We prove the second claim. Let αk=1λ(k+2)\alpha_{k}=\frac{1}{\lambda(k+2)}, and multiply (5) by λ(k+2)\lambda(k+2):

f(xk)f(x0)λ(k+1)xkx02λ(k+2)xk+1x02+λ(zkzk+1)+L22λ(k+2).\begin{split}f(x_{k})\!\!-\!f(x_{0})\!\leq&\lambda\!\left(k+1\right)\!\|x_{k}\!-\!x_{0}\|^{2}\!\!-\!\lambda(k\!+\!2)\|x_{k+1}\!-\!x_{0}\|^{2}\\ &+\lambda\left(z_{k}-z_{k+1}\right)+\frac{L^{2}}{2\lambda(k+2)}.\end{split}

Summing over kk, and observing the telescoping sums:

k=0N(f(xk)f(x0))λ(N+2)xN+1x02+λ(z0zN+1)+L22λk=0N1(k+2)λ(N+2)xN+1x02+λB+L2log(N+2)2λ.\begin{split}&\sum_{k=0}^{N}\left(f(x_{k})-f(x_{0})\right)\\ &\leq-\lambda(N+2)\|x_{N+1}-x_{0}\|^{2}+\lambda\left(z_{0}\!-\!z_{N+1}\right)\!+\!\frac{L^{2}}{2\lambda}\sum_{k=0}^{N}\frac{1}{(k\!+\!2)}\\ &\leq-\lambda(N+2)\|x_{N+1}-x_{0}\|^{2}+\lambda B+\frac{L^{2}\log(N+2)}{2\lambda}.\end{split}

Since f(xk)f(x_{k}) is decreasing, k=0N(f(xN)f(x))k=0N(f(xk)f(x))\sum_{k=0}^{N}\left(f(x_{N})-f(x^{*})\right)\leq\sum_{k=0}^{N}\left(f(x_{k})-f(x^{*})\right), and

N(f(xN)f(x0))λ(N+2)xN+1x02+λB+L2log(N+2)2λ.\begin{split}N\left(f(x_{N})-f(x_{0})\right)\leq&-\lambda(N+2)\|x_{N+1}-x_{0}\|^{2}+\lambda B\\ &+\frac{L^{2}\log(N+2)}{2\lambda}.\end{split}

Dividing by NN, taking NN\to\infty, and using the first part of the lemma:

f(x)f(x0)λxx02.f(x^{*})-f(x_{0})\leq-\lambda\|x^{*}-x_{0}\|^{2}.

Rearranging give the result. ∎

We now present a stability result for a regularized ERM objective. The proof resembles Shalev-Shwartz et al. (2010), but replaces strong convexity with quadratic growth.

Proposition 4.

Let z0,z1𝒵z_{0},z_{1}\dots\in\mathcal{Z} denote a sequence of independent samples, and let :𝒳×𝒵\ell:\mathcal{X}\times\mathcal{Z}\to\mathcal{R} be a loss for a predictor x𝒳x\in\mathcal{X} and sample z𝒵z\in\mathcal{Z}. Consider a regularized ERM objective LN(x)=1Ni=1N(x,zi)+λ(x)L_{N}(x)=\frac{1}{N}\sum_{i=1}^{N}\ell(x,z_{i})+\lambda\mathcal{R}(x), and let LN\j=1Ni=1ijN(x,zi)+λ(x)L_{N}^{\backslash j}=\frac{1}{N}\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\ell(x,z_{i})+\lambda\mathcal{R}(x). Assume that \ell is β\beta-Lipschitz: for any z𝒵z\in\mathcal{Z}, and any x,xx,x^{\prime}, |(x,z)(x,z)|βxx\left|\ell(x,z)-\ell(x^{\prime},z)\right|\leq\beta\|x-x^{\prime}\|. Assume that LN(x)L_{N}(x) and LN\j(x)L_{N}^{\backslash j}(x) have unique minimizers, and denote them xx^{*} and x,\jx^{*,\backslash j}, respectively. Further assume quadratic growth: λxx2LN(x)LN(x)\lambda\|x^{*}-x\|^{2}\leq L_{N}(x)-L_{N}(x^{*}) for any x𝒳x\in\mathcal{X}. Then, we have that

xx,\jβλN,\|x^{*}-x^{*,\backslash j}\|\leq\frac{\beta}{\lambda N},

and z𝒵\forall z\in\mathcal{Z}

(x,z)(x,\j,z)β2λN.\ell(x^{*},z)-\ell(x^{*,\backslash j},z)\leq\frac{\beta^{2}}{\lambda N}.
Proof.

(sketch; full proof in Sec. A.) Let Δ=LN(x,\j)LN(x)\Delta=L_{N}(x^{*,\backslash j})-L_{N}(x^{*}). From quadratic growth, we have that

Δλxx,\j2.\Delta\geq\lambda\|x^{*}-x^{*,\backslash j}\|^{2}.

On the other hand, by taking out the jj’th element from the loss terms LNL_{N}, and observing that x,\jx^{*,\backslash j} minimizes LN\jL_{N}^{\backslash j}, we have that

Δ=1Ni=1ijN(x,\j,zi)+λ(x,\j)1Ni=1ijN(x,zi)λ(x)+(x,\j,zj)(x,zj)N(x,\j,zj)(x,zj)N,\begin{split}\Delta=&\frac{1}{N}\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\ell(x^{*,\backslash j},z_{i})\!+\!\lambda\mathcal{R}(x^{*,\backslash j})\!-\!\frac{1}{N}\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\ell(x^{*},z_{i})\!-\!\lambda\mathcal{R}(x^{*})\\ &+\frac{\ell(x^{*,\backslash j},z_{j})-\ell(x^{*},z_{j})}{N}\\ &\leq\frac{\ell(x^{*,\backslash j},z_{j})-\ell(x^{*},z_{j})}{N},\end{split}

and from the Lipschitz condition, Δβxx,\jN\Delta\leq\frac{\beta\|x^{*}-x^{*,\backslash j}\|}{N}. Combining the above inequalities for Δ\Delta gives xx,\jβλN\|x^{*}-x^{*,\backslash j}\|\leq\frac{\beta}{\lambda N}, and the final result is obtained by using the Lipschitz condition one more time. ∎

7 Stability for Regularized Bayesian RL

We are now ready to present stability bounds for the regularized Baysian RL setting. Let μ||\mu\in\mathbb{R}^{|\mathcal{H}|} denote the distribution over h0h_{0}, the initial history (we assume that all elements in the vector that correspond to histories of length greater than 0 are zero). Recall the regularized ERM loss ^λ(π)\hat{\mathcal{L}}^{\lambda}(\pi), and let π\pi^{*} denote its minimizer. Define the leave-one-out ERM loss,

^λ,\j(π)=1Ni=1ijN𝔼π;Mi[t=0TC(st,at)+λ(π(|ht))],\hat{\mathcal{L}}^{\lambda,\backslash j}(\pi)=\frac{1}{N}\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\mathbb{E}_{\pi;M_{i}}\left[\sum_{t=0}^{T}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi(\cdot|h_{t}))\right],

and let π\j,\pi^{\backslash j,*} its minimizer. Recall the definition of 𝐏π\mathbf{P}^{\pi} – the transition probability between histories under policy π\pi, which depends on the prior. In the following, we use the following notation: 𝐏π\mathbf{P}^{\pi} refers to the empirical prior P^N\hat{P}_{N}, while 𝐏Mjπ\mathbf{P}_{M_{j}}^{\pi} refers to a prior that has all its mass on a single MDP MjM_{j}. The following theorem will be used to derive our stability results. The proof is in Sec. C.

Theorem 3.

Let Δ=^λ(π\j,)^λ(π)\Delta=\hat{\mathcal{L}}^{\lambda}(\pi^{\backslash j,*})-\hat{\mathcal{L}}^{\lambda}(\pi^{*}). We have that

Δλ2μ(I𝐏π\j,)1π\j,π22,\Delta\geq\frac{\lambda}{2}\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1}\|\pi^{\backslash j,*}-\pi^{*}\|^{2}_{2},

and

Δ1NCmaxT|A|μ(I𝐏Mjπ\j,)1π\j,π2.\Delta\leq\frac{1}{N}C_{\max}T\sqrt{|A|}\mu^{\top}(I-\mathbf{P}_{M_{j}}^{\pi^{\backslash j,*}})^{-1}\left\|\pi^{\backslash j,*}-\pi^{*}\right\|_{2}.

Following the proof of Proposition 4, we would like to use the two expressions in Theorem 3 to bound π\j,π2\left\|\pi^{\backslash j,*}-\pi^{*}\right\|_{2}, which would directly lead to a stability result. This is complicated by the fact that different factors (I𝐏π\j,)1(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1} and (I𝐏Mjπ\j,)1(I-\mathbf{P}_{M_{j}}^{\pi^{\backslash j,*}})^{-1} appear in the two expressions. Our first result assumes that these expressions cannot be too different; a discussion of this assumption follows.

Assumption 1.

For any two MDPs M,MM,M^{\prime}\in\mathcal{M}, and any policy π\pi, let 𝐏Mπ\mathbf{P}^{\pi}_{M} and 𝐏Mπ\mathbf{P}^{\pi}_{M^{\prime}} denote their respective history transition probabilities. There exists some D<D<\infty such that for any x||x\in\mathbb{R}^{|\mathcal{H}|}

μ(I𝐏Mπ)1xDμ(I𝐏Mπ)1x.\mu^{\top}(I-\mathbf{P}^{\pi}_{M})^{-1}x\leq D\mu^{\top}(I-\mathbf{P}^{\pi}_{M^{\prime}})^{-1}x.

Let us define the regularized loss for MDP MM, Mλ(π)=𝔼π;M[t=0TCM(st,at)+λ(π(|ht))].\mathcal{L}_{M}^{\lambda}(\pi)=\mathbb{E}_{\pi;M}\left[\sum_{t=0}^{T}C_{M}(s_{t},a_{t})+\lambda\mathcal{R}(\pi(\cdot|h_{t}))\right]. We have the following result.

Corollary 1.

Let Assumption 1 hold, and let κ=2D2Cmax2T2|A|\kappa=2D^{2}C_{\max}^{2}T^{2}|A|. Then, for any MDP MM^{\prime}\in\mathcal{M},

Mλ(π\j,)Mλ(π)κλN,\mathcal{L}_{M^{\prime}}^{\lambda}(\pi^{\backslash j,*})-\mathcal{L}_{M^{\prime}}^{\lambda}(\pi^{*})\leq\frac{\kappa}{\lambda N},

and with probability at least 1δ1-\delta,

T(π^)2λT+2κλN+(4κλ+3CmaxT)ln(1/δ)2N.\mathcal{R}_{T}(\hat{\pi}^{*})\leq 2\lambda T+\frac{2\kappa}{\lambda N}+\left(\frac{4\kappa}{\lambda}+3C_{\max}T\right)\sqrt{\frac{\ln(1/\delta)}{2N}}.

Note that each element that corresponds to history hth_{t} in the vector μ(I𝐏Mπ)1\mu^{\top}(I-\mathbf{P}^{\pi}_{M})^{-1} is equivalent to P(ht|M;π)P(h_{t}|M;\pi), the probability of observing hth_{t} under policy π\pi and MDP MM (see Sec. B.1 for formal proof). Thus, Assumption 1 essentially states that two different MDPs under the prior cannot visit completely different histories given the same policy. With our regularization scheme, such an assumption is required for uniform stability: if the test MDP can reach completely different states than possible during training, it is impossible to guarantee anything about the performance of the policy in those states. Unfortunately, the constant DD can be very large. Let

q=supM,M,s,sS,aA,c𝒞PM(s,c|s,a)PM(s,c|s,a),q=\sup_{M,M^{\prime}\in\mathcal{M},s,s^{\prime}\in S,a\in A,c\in\mathcal{C}}\frac{P_{M}(s^{\prime},c|s,a)}{P_{M^{\prime}}(s^{\prime},c|s,a)},

where we assume that 0/0=10/0=1. Then, P(ht|M;π)/P(ht|M;π)=ΠtPM(st+1,ct|st,at)PM(st+1,ct|st,at)P(h_{t}|M;\pi)/P(h_{t}|M^{\prime};\pi)=\Pi_{t}\frac{P_{M}(s_{t+1},c_{t}|s_{t},a_{t})}{P_{M^{\prime}}(s_{t+1},c_{t}|s_{t},a_{t})} is at most qTq^{T}, and therefore DD can be in the order of qTq^{T}. One way to guarantee that DD is finite, is to add a small amount of noise to any state transition. The following example estimates qq is such a case.

Example 1.

Consider modifying each MDP MM in \mathcal{M} such that PM(s,c|s,a)(1α)PM(s,c|s,a)+α/|S||𝒞|P_{M}(s^{\prime},c|s,a)\to(1-\alpha)P_{M}(s^{\prime},c|s,a)+\alpha/|S||\mathcal{C}|, where α(0,1)\alpha\in(0,1). In this case, q(1α)|S||𝒞|αq\leq\frac{(1-\alpha)|S||\mathcal{C}|}{\alpha}.

Let us now compare Corollary 1 with the trivial bound in Proposition 1. First, Corollary 1 allows to control generalization by increasing the regularization λ\lambda. The term 2λT2\lambda T is a bias, incurred by adding the regularization to the objective, and can be reduced by decreasing λ\lambda. Comparing the constants of the 𝒪(1/N)\mathcal{O}(1/\sqrt{N}) term, the dominant terms are D2D^{2} vs. (|S||𝒞||A|)T(|S||\mathcal{C}||A|)^{T}. Since DD does not depend on |A||A|, the bound in Corollary 1 is important for problems with large |A||A|. The example above shows that in the worst case, D2D^{2} can be 𝒪((|S||𝒞|)2T)\mathcal{O}((|S||\mathcal{C}|)^{2T}). There are, of course, more favorable case, where the structure of \mathcal{M} is such that DD is better behaved.

Example 2.

Consider an hypothesis set \mathcal{M} such that all MDPs in \mathcal{M} differ only on a set of states that cannot be visited more than kk times in an episode. In this case, DD would be in the order of qkT/Hq^{kT/H}.

Another case is where the set \mathcal{M} is finite. In this case, we can show that the pointwise hypothesis stability does not depend on DD, and we obtain a bound that does not depend exponentially on TT, as we now show.

Corollary 2.

Let \mathcal{M} be a finite set, and let Pmin=minMP(M)P_{\mathrm{min}}=\min_{M\in\mathcal{M}}P(M). Then

𝔼[Mjλ(π\j,)Mjλ(π)]4Cmax2T2|A|λNPmin+exp(NPmin8)CmaxT,\begin{split}&\mathbb{E}\left[\mathcal{L}_{M_{j}}^{\lambda}(\pi^{\backslash j,*})-\mathcal{L}_{M_{j}}^{\lambda}(\pi^{*})\right]\\ &\leq\frac{4C_{\max}^{2}T^{2}|A|}{\lambda NP_{\mathrm{min}}}+\exp\left(\frac{-NP_{\mathrm{min}}}{8}\right)C_{\max}T,\end{split}

and with probability at least 1δ1-\delta, (ignoring exponential terms)

T(π^)2λT+Cmax2T22Nδ+48Cmax3T3|A|2δλNPmin.\mathcal{R}_{T}(\hat{\pi}^{*})\leq 2\lambda T+\sqrt{\frac{C_{\max}^{2}T^{2}}{2N\delta}+\frac{48C_{\max}^{3}T^{3}|A|}{2\delta\lambda NP_{\mathrm{min}}}}.

In the generalization bounds of both Corollary 1 and Corollary 2, reducing λ\lambda and increasing NN at a rate such that the stability is o(1/N)o(1/{\sqrt{N}}) will guarantee learnability, i.e., convergence to πBO\pi_{\mathrm{BO}} as NN\to\infty.

Example 3.

Under the setting of Corollary 2, letting λ=N1/3\lambda=N^{-1/3} gives that, with probability at least 1δ1-\delta, (ignoring exponential terms)

T(π^)2TN1/3+Cmax2T22Nδ+48Cmax3T3|A|2δN2/3Pmin.\mathcal{R}_{T}(\hat{\pi}^{*})\leq\frac{2T}{N^{1/3}}+\sqrt{\frac{C_{\max}^{2}T^{2}}{2N\delta}+\frac{48C_{\max}^{3}T^{3}|A|}{2\delta N^{2/3}P_{\mathrm{min}}}}.

For a finite NN, and when there is structure the hypothesis space \mathcal{M}, as displayed for example in DD, the bounds in Corollaries 1 and 2 allow to set λ\lambda to obtain bounds that are more optimistic than the trivial bound in Proposition 1. In these cases, our results show that regularization allows for improved generalization.

Example 4.

Set λ=1\lambda=1. Then the bound in Corollary 2 becomes

T(π^)2T+Cmax2T22Nδ+48Cmax3T3|A|2δNPmin,\mathcal{R}_{T}(\hat{\pi}^{*})\leq 2T+\sqrt{\frac{C_{\max}^{2}T^{2}}{2N\delta}+\frac{48C_{\max}^{3}T^{3}|A|}{2\delta NP_{\mathrm{min}}}},

while the naive bound is

T(π^)ln(2/δ)+(|S||A||𝒞|)TCmax2T2N.\mathcal{R}_{T}(\hat{\pi}^{*})\leq\sqrt{\frac{\ln(2/\delta)+(|S||A||\mathcal{C}|)^{T}C_{\max}^{2}T^{2}}{N}}.

For a finite NN that is much smaller than (|S||A||𝒞|)T(|S||A||\mathcal{C}|)^{T}, and for reasonable values of PminP_{\mathrm{min}} and δ\delta, the naive bound can be completely vacuous (larger than CmaxTC_{\max}T – the maximum regret possible), while the bound in Corollary 2 can be significantly smaller.

8 Discussion

In this work, we analyzed generalization in Bayesian RL, focusing on algorithmic stability and a specific form of policy regularization. The bounds we derived can be controlled by the amount of regularization, and under some structural assumptions on the space of possible MDPs, compare favorably to a trivial bound based on the finite policy space. We next outline several future directions.

Specialized regularization for kk-shot learning:

One can view our results as somewhat pessimistic – at worst, they require that every history has a non-zero probability of being visited, and even then, the dependence on TT can be exponential. One may ask whether alternative regularization methods could relax the dependence on TT. We believe this is true, based on the following observation. Recall the example in the lower bound of Proposition 2. Let T=kHT=kH, and consider the policy that at time step t=Ht=H chooses an action, and based on the observed cost chooses which action to use at time steps t=2H,t=3H,t=2H,t=3H,.... Note that after observing the first cost, it is clear which action is optimal, and therefore the policy obtains at most a k1k\frac{k-1}{k} fraction of the optimal total cost on both training and test, regardless of the training data. More generally, there exist deterministic policies, such as the Q-learning algorithm of Jin et al. (2018), that achieve 𝒪(H3|S||A|T)\mathcal{O}\left(\sqrt{H^{3}|S||A|T}\right) regret for any MDP. Thus, we believe that in the kk-shot learning setting, regularization methods that induce efficient exploration can be devised. We leave this as an open problem.

Continuous MDPs:

Another important direction is developing PAC algorithms for continuous state, cost, and action spaces. It is clear that without the finite hypothesis space assumption, overfitting is a much more serious concern; in Sec. F of the supplementary material we provide a simple example of this, when only the costs do not belong to a finite set. We hypothesize that regularization techniques can be important in such settings, in concordance with known results for supervised learning. We believe that the tools for stability analysis in MDPs that we developed in this work may be useful for this problem, which we leave to future work.

Implicit regularization:

Finally, all the results in this work considered optimal solutions of the regularized Bayesian RL problem. In practice, due to the size of the state space that scales exponentially with the horizon, computing such policies is intractable even for medium-sized problems. Interestingly, approximate solutions do not necessarily hurt generalization: implicit regularization, for example as implied by using stochastic gradient descent for optimization, is known to improve generalization, at least in supervised learning (Hardt, Recht, and Singer 2016; Zou et al. 2021). We hypothesize that stability results similar to Hardt, Recht, and Singer (2016) may be developed for Bayesian RL as well, using the quadratic growth property established here.

Acknowledgements

Aviv Tamar thanks Kfir Levy for insightful discussions on optimization and stability, and Shie Mannor for advising on the presentation of this work. This research was partly funded by the Israel Science Foundation (ISF-759/19).

References

  • Amit and Meir (2018) Amit, R.; and Meir, R. 2018. Meta-learning by adjusting priors based on extended PAC-Bayes theory. In International Conference on Machine Learning, 205–214. PMLR.
  • Beck (2017) Beck, A. 2017. First-order methods in optimization. SIAM.
  • Bertsekas (2006) Bertsekas, D. P. 2006. Dynamic programming and optimal control, volume 2. Athena Scientific.
  • Bousquet and Elisseeff (2002) Bousquet, O.; and Elisseeff, A. 2002. Stability and generalization. The Journal of Machine Learning Research, 2: 499–526.
  • Boutilier et al. (2020) Boutilier, C.; Hsu, C.-w.; Kveton, B.; Mladenov, M.; Szepesvari, C.; and Zaheer, M. 2020. Differentiable Meta-Learning of Bandit Policies. Advances in Neural Information Processing Systems, 33.
  • Charles and Papailiopoulos (2018) Charles, Z.; and Papailiopoulos, D. 2018. Stability and generalization of learning algorithms that converge to global optima. In International Conference on Machine Learning, 745–754. PMLR.
  • Cobbe et al. (2020) Cobbe, K.; Hesse, C.; Hilton, J.; and Schulman, J. 2020. Leveraging procedural generation to benchmark reinforcement learning. In International conference on machine learning, 2048–2056. PMLR.
  • Dorfman, Shenfeld, and Tamar (2020) Dorfman, R.; Shenfeld, I.; and Tamar, A. 2020. Offline Meta Learning of Exploration. arXiv preprint arXiv:2008.02598.
  • Duan et al. (2016) Duan, Y.; Schulman, J.; Chen, X.; Bartlett, P. L.; Sutskever, I.; and Abbeel, P. 2016. RL2\text{RL}^{2}: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779.
  • Farid and Majumdar (2021) Farid, A.; and Majumdar, A. 2021. PAC-BUS: Meta-Learning Bounds via PAC-Bayes and Uniform Stability. arXiv preprint arXiv:2102.06589.
  • Finn, Abbeel, and Levine (2017) Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 1126–1135. JMLR. org.
  • Fox, Pakman, and Tishby (2015) Fox, R.; Pakman, A.; and Tishby, N. 2015. Taming the noise in reinforcement learning via soft updates. arXiv preprint arXiv:1512.08562.
  • Ghavamzadeh et al. (2016) Ghavamzadeh, M.; Mannor, S.; Pineau, J.; and Tamar, A. 2016. Bayesian reinforcement learning: A survey. arXiv preprint arXiv:1609.04436.
  • Grover, Basu, and Dimitrakakis (2020) Grover, D.; Basu, D.; and Dimitrakakis, C. 2020. Bayesian reinforcement learning via deep, sparse sampling. In International Conference on Artificial Intelligence and Statistics, 3036–3045. PMLR.
  • Guez, Silver, and Dayan (2012) Guez, A.; Silver, D.; and Dayan, P. 2012. Efficient Bayes-adaptive reinforcement learning using sample-based search. In Advances in neural information processing systems, 1025–1033.
  • Hallak, Di Castro, and Mannor (2015) Hallak, A.; Di Castro, D.; and Mannor, S. 2015. Contextual markov decision processes. arXiv preprint arXiv:1502.02259.
  • Hardt, Recht, and Singer (2016) Hardt, M.; Recht, B.; and Singer, Y. 2016. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, 1225–1234. PMLR.
  • Jaksch, Ortner, and Auer (2010) Jaksch, T.; Ortner, R.; and Auer, P. 2010. Near-optimal Regret Bounds for Reinforcement Learning. Journal of Machine Learning Research, 11(4).
  • Jin et al. (2018) Jin, C.; Allen-Zhu, Z.; Bubeck, S.; and Jordan, M. I. 2018. Is Q-learning provably efficient? arXiv preprint arXiv:1807.03765.
  • Kostrikov, Yarats, and Fergus (2020) Kostrikov, I.; Yarats, D.; and Fergus, R. 2020. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649.
  • Mendelson et al. (2016) Mendelson, A. F.; Zuluaga, M. A.; Hutton, B. F.; and Ourselin, S. 2016. What is the distribution of the number of unique original items in a bootstrap sample? arXiv preprint arXiv:1602.05822.
  • Neu, Jonsson, and Gómez (2017) Neu, G.; Jonsson, A.; and Gómez, V. 2017. A unified view of entropy-regularized markov decision processes. arXiv preprint arXiv:1705.07798.
  • Rivlin, Hazan, and Karpas (2020) Rivlin, O.; Hazan, T.; and Karpas, E. 2020. Generalized planning with deep reinforcement learning. arXiv preprint arXiv:2005.02305.
  • Rothfuss et al. (2021) Rothfuss, J.; Fortuin, V.; Josifoski, M.; and Krause, A. 2021. PACOH: Bayes-optimal meta-learning with PAC-guarantees. In International Conference on Machine Learning, 9116–9126. PMLR.
  • Schulman et al. (2015) Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In International conference on machine learning, 1889–1897. PMLR.
  • Schulman et al. (2017) Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S.; and Ben-David, S. 2014. Understanding machine learning: From theory to algorithms. Cambridge university press.
  • Shalev-Shwartz et al. (2010) Shalev-Shwartz, S.; Shamir, O.; Srebro, N.; and Sridharan, K. 2010. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11: 2635–2670.
  • Shani, Efroni, and Mannor (2020) Shani, L.; Efroni, Y.; and Mannor, S. 2020. Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 5668–5675.
  • Strehl et al. (2006) Strehl, A. L.; Li, L.; Wiewiora, E.; Langford, J.; and Littman, M. L. 2006. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, 881–888.
  • Tamar et al. (2016) Tamar, A.; Wu, Y.; Thomas, G.; Levine, S.; and Abbeel, P. 2016. Value iteration networks. In Advances in Neural Information Processing Systems, 2154–2162.
  • Vapnik (2013) Vapnik, V. 2013. The nature of statistical learning theory. Springer science & business media.
  • Yao, Zhang, and Finn (2021) Yao, H.; Zhang, L.; and Finn, C. 2021. Meta-Learning with Fewer Tasks through Task Interpolation. arXiv preprint arXiv:2106.02695.
  • Zhang (2015) Zhang, Y. 2015. Multi-task learning and algorithmic stability. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
  • Zintgraf et al. (2020) Zintgraf, L.; Shiarlis, K.; Igl, M.; Schulze, S.; Gal, Y.; Hofmann, K.; and Whiteson, S. 2020. VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning. In International Conference on Learning Representation (ICLR).
  • Zou et al. (2021) Zou, D.; Wu, J.; Braverman, V.; Gu, Q.; Foster, D. P.; and Kakade, S. M. 2021. The Benefits of Implicit Regularization from SGD in Least Squares Problems. arXiv preprint arXiv:2108.04552.

Appendix A A Stability Result for a Particular Family of Non-Convex Functions

Standard uniform stability results depend on convexity of the loss function, and strong convexity of the regularizing function (Bousquet and Elisseeff 2002). For MDPs, however, the loss is not convex. In this work, we show that nevertheless, uniform stability for the Bayesian RL setting can be established. Our results combine standard techniques for establishing uniform stability with results on MDP optimization. To simplify our presentation, we first present the essence of our technique in this section in a general form, without the complexity of the MDP setting. In the next section, we adapt the proof technique to the MDP setting.

In the following we present a stability result that does not require convexity. The next lemma considers a general iterative algorithm, but it may be useful to think about projected gradient descent when reading it.

Lemma 2.

Let f:𝒳f:\mathcal{X}\to\mathbb{R} be some function that attains a minimum f(x)f(x)x𝒳f(x^{*})\leq f(x)\quad\forall x\in\mathcal{X}. Consider a sequence of step sizes α0,α1,\alpha_{0},\alpha_{1},\dots\in\mathbb{R} and corresponding sequence of iterates x0,x1,𝒳x_{0},x_{1},\dots\in\mathcal{X}. Assume that f(xk+1)f(xk)f(x_{k+1})\leq f(x_{k}) for all k0k\geq 0. Also consider a sequence of values z0,z1,+z_{0},z_{1},\dots\in\mathbb{R}^{+} that satisfy |zkz0|B|z_{k}-z_{0}|\leq B for all k0k\geq 0. Assume that there exists λ>0\lambda>0 and L0L\geq 0 such that the following holds for any step size sequence, all k0k\geq 0, and any x𝒳x\in\mathcal{X}:

αk(f(xk)f(x))(1λαk)xkx2xk+1x2+λαk(zkzk+1)+αk2L22.\alpha_{k}\left(f(x_{k})-f(x)\right)\leq\left(1-\lambda\alpha_{k}\right)\|x_{k}-x\|^{2}-\|x_{k+1}-x\|^{2}+\lambda\alpha_{k}\left(z_{k}-z_{k+1}\right)+\frac{\alpha_{k}^{2}L^{2}}{2}.

Then the following statements hold true for step sizes αk=1λ(k+2)\alpha_{k}=\frac{1}{\lambda(k+2)}:

  1. 1.

    The sequence converges to xx^{*} and satisfies

    f(xk)f(x)L2logkλk.f(x_{k})-f(x^{*})\leq\frac{L^{2}\log k}{\lambda k}.
  2. 2.

    It holds that λxx02f(x0)f(x)\lambda\|x^{*}-x_{0}\|^{2}\leq f(x_{0})-f(x^{*}).

Proof.

For the first claim, we follow the proof of Theorem 2 in Shani, Efroni, and Mannor (2020). Note that by the assumption, we have that

αk(f(xk)f(x))(1λαk)xkx2xk+1x2+λαk(zkzk+1)+αk2L22.\alpha_{k}\left(f(x_{k})-f(x^{*})\right)\leq\left(1-\lambda\alpha_{k}\right)\|x_{k}-x^{*}\|^{2}-\|x_{k+1}-x^{*}\|^{2}+\lambda\alpha_{k}\left(z_{k}-z_{k+1}\right)+\frac{\alpha_{k}^{2}L^{2}}{2}.

Letting αk=1λ(k+2)\alpha_{k}=\frac{1}{\lambda(k+2)}, and multiplying by λ(k+2)\lambda(k+2):

(f(xk)f(x))λ(k+1)xkx2λ(k+2)xk+1x2+λ(zkzk+1)+L22λ(k+2).\left(f(x_{k})-f(x^{*})\right)\leq\lambda\left(k+1\right)\|x_{k}-x^{*}\|^{2}-\lambda(k+2)\|x_{k+1}-x^{*}\|^{2}+\lambda\left(z_{k}-z_{k+1}\right)+\frac{L^{2}}{2\lambda(k+2)}.

Summing over kk, and observing the telescoping sums:

k=0N(f(xk)f(x))λx0x2λ(N+2)xN+1x2+λ(z0zN+1)+L22λk=0N1(k+2)λx0x2+λB+L2log(N+2)2λ.\begin{split}\sum_{k=0}^{N}\left(f(x_{k})-f(x^{*})\right)&\leq\lambda\|x_{0}-x^{*}\|^{2}-\lambda(N+2)\|x_{N+1}-x^{*}\|^{2}+\lambda\left(z_{0}-z_{N+1}\right)+\frac{L^{2}}{2\lambda}\sum_{k=0}^{N}\frac{1}{(k+2)}\\ &\leq\lambda\|x_{0}-x^{*}\|^{2}+\lambda B+\frac{L^{2}\log(N+2)}{2\lambda}.\end{split}

Noting that f(xk)f(x_{k}) is decreasing, we have that k=0N(f(xN)f(x))k=0N(f(xk)f(x))\sum_{k=0}^{N}\left(f(x_{N})-f(x^{*})\right)\leq\sum_{k=0}^{N}\left(f(x_{k})-f(x^{*})\right). Therefore

N(f(xN)f(x))λx0x2+λB+L2log(N+2)2λ,N\left(f(x_{N})-f(x^{*})\right)\leq\lambda\|x_{0}-x^{*}\|^{2}+\lambda B+\frac{L^{2}\log(N+2)}{2\lambda},

and

f(xN)f(x)λNx0x2+λBN+L2log(N+2)2λN.f(x_{N})-f(x^{*})\leq\frac{\lambda}{N}\|x_{0}-x^{*}\|^{2}+\frac{\lambda B}{N}+\frac{L^{2}\log(N+2)}{2\lambda N}.

Taking NN\to\infty, we see that the sequence converges to xx^{*}.

We next prove the second claim. By our assumption we have:

αk(f(xk)f(x0))(1λαk)xkx02xk+1x02+λαk(zkzk+1)+αk2L22.\alpha_{k}\left(f(x_{k})-f(x_{0})\right)\leq\left(1-\lambda\alpha_{k}\right)\|x_{k}-x_{0}\|^{2}-\|x_{k+1}-x_{0}\|^{2}+\lambda\alpha_{k}\left(z_{k}-z_{k+1}\right)+\frac{\alpha_{k}^{2}L^{2}}{2}.

Letting αk=1λ(k+2)\alpha_{k}=\frac{1}{\lambda(k+2)}, and multiplying by λ(k+2)\lambda(k+2):

(f(xk)f(x0))λ(k+1)xkx02λ(k+2)xk+1x02+λ(zkzk+1)+L22λ(k+2).\left(f(x_{k})-f(x_{0})\right)\leq\lambda\left(k+1\right)\|x_{k}-x_{0}\|^{2}-\lambda(k+2)\|x_{k+1}-x_{0}\|^{2}+\lambda\left(z_{k}-z_{k+1}\right)+\frac{L^{2}}{2\lambda(k+2)}.

Summing over kk, and observing the telescoping sums:

k=0N(f(xk)f(x0))λ(N+2)xN+1x02+λ(z0zN+1)+L22λk=0N1(k+2)λ(N+2)xN+1x02+λB+L2log(N+2)2λ.\begin{split}\sum_{k=0}^{N}\left(f(x_{k})-f(x_{0})\right)&\leq-\lambda(N+2)\|x_{N+1}-x_{0}\|^{2}+\lambda\left(z_{0}-z_{N+1}\right)+\frac{L^{2}}{2\lambda}\sum_{k=0}^{N}\frac{1}{(k+2)}\\ &\leq-\lambda(N+2)\|x_{N+1}-x_{0}\|^{2}+\lambda B+\frac{L^{2}\log(N+2)}{2\lambda}.\end{split}

Proceeding similarly as above, we have:

N(f(xN)f(x0))λ(N+2)xN+1x02+λB+L2log(N+2)2λ,N\left(f(x_{N})-f(x_{0})\right)\leq-\lambda(N+2)\|x_{N+1}-x_{0}\|^{2}+\lambda B+\frac{L^{2}\log(N+2)}{2\lambda},

and

f(xN)f(x0)λ(N+2)NxN+1x02+λBN+L2log(N+2)2λN.f(x_{N})-f(x_{0})\leq-\lambda\frac{(N+2)}{N}\|x_{N+1}-x_{0}\|^{2}+\frac{\lambda B}{N}+\frac{L^{2}\log(N+2)}{2\lambda N}.

Taking NN\to\infty, and using the first part of the lemma:

f(x)f(x0)λxx02,f(x^{*})-f(x_{0})\leq-\lambda\|x^{*}-x_{0}\|^{2},

so

λxx02f(x0)f(x).\lambda\|x^{*}-x_{0}\|^{2}\leq f(x_{0})-f(x^{*}).

We now present a stability result. This result is similar to a result in Shalev-Shwartz et al. (2010), but does not require convexity.

Proposition 5.

Let y0,y1𝒴y_{0},y_{1}\dots\in\mathcal{Y} denote a sequence of samples, and let :𝒴×𝒳\ell:\mathcal{Y}\times\mathcal{X}\to\mathcal{R}. Consider a loss function LN(x)=1Ni=1N(yi,x)+λ(x)L_{N}(x)=\frac{1}{N}\sum_{i=1}^{N}\ell(y_{i},x)+\lambda\mathcal{R}(x), and let LN\j(x)=1Ni=1ijN(yi,x)+(x)L_{N}^{\backslash j}(x)=\frac{1}{N}\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\ell(y_{i},x)+\mathcal{R}(x). Assume that for any y𝒴y\in\mathcal{Y}, and any x,xx,x^{\prime}, |(y,x)(y,x)|βxx\left|\ell(y,x)-\ell(y,x^{\prime})\right|\leq\beta\|x-x^{\prime}\|. Assume that LN(x)L_{N}(x) and LN\j(x)L_{N}^{\backslash j}(x) have unique minimizers, and denote them xx^{*} and x,\jx^{*,\backslash j}, respectively. Further assume that λxx2LN(x)LN(x)\lambda\|x^{*}-x\|^{2}\leq L_{N}(x)-L_{N}(x^{*}) for any x𝒳x\in\mathcal{X}. Then, we have that

xx,\jβλN,\|x^{*}-x^{*,\backslash j}\|\leq\frac{\beta}{\lambda N},

and for any y𝒴y\in\mathcal{Y},

(y,x)(y,x,\j)β2λN.\ell(y,x^{*})-\ell(y,x^{*,\backslash j})\leq\frac{\beta^{2}}{\lambda N}.
Proof.

We have λxx,\j2LN(x,\j)LN(x)\lambda\|x^{*}-x^{*,\backslash j}\|^{2}\leq L_{N}(x^{*,\backslash j})-L_{N}(x^{*}).

On the other hand,

LN(x,\j)LN(x)=1Ni=1N(yi,x,\j)+λ(x,\j)1Ni=1N(yi,x)λ(x)=1Ni=1ijN(yi,x,\j)+λ(x,\j)1Ni=1ijN(yi,x)λ(x)+(yj,x,\j)(yj,x)N(yj,x,\j)(yj,x)Nβxx,\jN,\begin{split}L_{N}(x^{*,\backslash j})-L_{N}(x^{*})&=\frac{1}{N}\sum_{i=1}^{N}\ell(y_{i},x^{*,\backslash j})+\lambda\mathcal{R}(x^{*,\backslash j})-\frac{1}{N}\sum_{i=1}^{N}\ell(y_{i},x^{*})-\lambda\mathcal{R}(x^{*})\\ &=\frac{1}{N}\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\ell(y_{i},x^{*,\backslash j})+\lambda\mathcal{R}(x^{*,\backslash j})-\frac{1}{N}\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\ell(y_{i},x^{*})-\lambda\mathcal{R}(x^{*})+\frac{\ell(y_{j},x^{*,\backslash j})-\ell(y_{j},x^{*})}{N}\\ &\leq\frac{\ell(y_{j},x^{*,\backslash j})-\ell(y_{j},x^{*})}{N}\\ &\leq\frac{\beta\|x^{*}-x^{*,\backslash j}\|}{N},\end{split}

where the first inequality is since x,\jx^{*,\backslash j} minimizes LN\jL_{N}^{\backslash j}, and the second inequality is by the Lipschitz assumption. Combining the two results above, we have

λxx,\j2βxx,\jN,\lambda\|x^{*}-x^{*,\backslash j}\|^{2}\leq\frac{\beta\|x^{*}-x^{*,\backslash j}\|}{N},

and therefore xx,\jβλN\|x^{*}-x^{*,\backslash j}\|\leq\frac{\beta}{\lambda N}. From the Lipschitz assumption, we therefore have (y,x)(y,x,\j)β2λN\ell(y,x^{*})-\ell(y,x^{*,\backslash j})\leq\frac{\beta^{2}}{\lambda N}. ∎

Appendix B Regularized MDPs and BRL

In this section we investigate regularized MDPs in the BRL setting. We will want to show that regularized MDPs are uniformly stable, using a result similar to Proposition 5. To do that, following (Shani, Efroni, and Mannor 2020), we consider the iterates of a mirror descent algorithm, and we shall show that conditions similar to Lemma 2 hold. We then use this property to derive the stability result.

In our proof we heavily build on techniques from Shani, Efroni, and Mannor (2020). One major difference is that Shani, Efroni, and Mannor (2020) consider discounted MDPs, while we study the finite horizon, Bayesian setting.

We consider history dependent policies of the form {π0(h0),,πT(hT)}\left\{\pi_{0}(h_{0}),\dots,\pi_{T}(h_{T})\right\}, where we recall that ht={s0,a0,c0,s1,a1,c1,st}h_{t}=\left\{s_{0},a_{0},c_{0},s_{1},a_{1},c_{1}\dots,s_{t}\right\}. Note that by definition, ht+1={ht,at,ct,st+1}h_{t+1}=\left\{h_{t},a_{t},c_{t},s_{t+1}\right\}. The BRL objective (1) can be interpreted as follows: we first choose a history dependent policy π(ht)\pi(h_{t}), and then nature draws an MDP MP(M)M\sim P(M), and we then evaluate π(ht)\pi(h_{t}) on MM. The expected cost (over the draws of MM), is the BRL performance objective.

Let P(M|ht)P(M|h_{t}) denote the posterior probability of nature having drawn the MDP MM, given that we have seen the history hth_{t}. From Bayes rule, we have that:

P(M|ht;π)P(ht|M;π)P(M).P(M|h_{t};\pi)\propto P(h_{t}|M;\pi)P(M).

Also, from the law of total probability and the Markov transitions in each possible MDP, we have that:333for notation simplicity, we assume that the set \mathcal{M} is finite. For infinite sets, replace the sums with integrals.

P(st+1|ht,at)=MP(M|ht)P(st+1|M,ht,at)=MP(M|ht)P(st+1|M,st,at).P(s_{t+1}|h_{t},a_{t})=\sum_{M}P(M|h_{t})P(s_{t+1}|M,h_{t},a_{t})=\sum_{M}P(M|h_{t})P(s_{t+1}|M,s_{t},a_{t}).

Similarly,

P(ct|ht,at)=MP(M|ht)P(ct|M,st,at),P(c_{t}|h_{t},a_{t})=\sum_{M}P(M|h_{t})P(c_{t}|M,s_{t},a_{t}),

and

P(ct,st+1|ht,at)=MP(M|ht)P(ct|M,st,at)P(st+1|M,st,at).P(c_{t},s_{t+1}|h_{t},a_{t})=\sum_{M}P(M|h_{t})P(c_{t}|M,s_{t},a_{t})P(s_{t+1}|M,s_{t},a_{t}).

We also consider a regularized cost,

Cλ(ht,at;π)=𝔼M|htC(st,at)+λ(πt(|ht)),C_{\lambda}(h_{t},a_{t};\pi)=\mathbb{E}_{M|h_{t}}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi_{t}(\cdot|h_{t})),

where we assume that \mathcal{R} is strongly convex.

B.1 Finite horizon dynamic programming

The policy satisfies a dynamic programming principle. Define the value function

Vtπ(ht)=𝔼π;M[t=tTCλ(ht,at;π)|ht],V_{t}^{\pi}(h_{t})=\mathbb{E}_{\pi;M}\left[\left.\sum_{t^{\prime}=t}^{T}C_{\lambda}(h_{t},a_{t};\pi)\right|h_{t}\right],

and the optimal value function

Vt(ht)=maxπ𝔼π;M[t=tTCλ(ht,at;π)|ht].V_{t}^{*}(h_{t})=\max_{\pi}\mathbb{E}_{\pi;M}\left[\left.\sum_{t^{\prime}=t}^{T}C_{\lambda}(h_{t},a_{t};\pi)\right|h_{t}\right].

Then we have that

VTπ(hT)=aT,cTπ(aT|hT)Cλ(hT,aT;π),V_{T}^{\pi}(h_{T})=\sum_{a_{T},c_{T}}\pi(a_{T}|h_{T})C_{\lambda}(h_{T},a_{T};\pi),

and

Vtπ(ht)=atπ(at|ht)(Cλ(ht,at;π)+ct,st+1P(ct,st+1|ht,at)Vt+1π({ht,at,ct,st+1}))atπ(at|ht)Qtπ(ht,at).V_{t}^{\pi}(h_{t})=\sum_{a_{t}}\pi(a_{t}|h_{t})\left(C_{\lambda}(h_{t},a_{t};\pi)+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\right)\equiv\sum_{a_{t}}\pi(a_{t}|h_{t})Q_{t}^{\pi}(h_{t},a_{t}).

For the optimal value, we have

VT(hT)=maxπ(|hT)aT,cTπ(aT|hT)Cλ(hT,aT;π),V_{T}^{*}(h_{T})=\max_{\pi(\cdot|h_{T})}\sum_{a_{T},c_{T}}\pi(a_{T}|h_{T})C_{\lambda}(h_{T},a_{T};\pi),

and

Vt(ht)=maxπ(|hT)atπ(at|ht)(Cλ(ht,at;π)+st+1P(st+1|ht,at)Vt+1π({ht,at,ct,st+1})).V_{t}^{*}(h_{t})=\max_{\pi(\cdot|h_{T})}\sum_{a_{t}}\pi(a_{t}|h_{t})\left(C_{\lambda}(h_{t},a_{t};\pi)+\sum_{s_{t+1}}P(s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\right).

Note that for λ=0\lambda=0 we obtain a standard (unregularized) finite horizon MDP, where the optimal policy is deterministic. For the regularized setting, the optimal policy may be stochastic. In the following, we denote Vπ={V0π,V1π,,VTπ}V^{\pi}=\left\{V_{0}^{\pi},V_{1}^{\pi},\dots,V_{T}^{\pi}\right\}, and similarly for VV^{*}. Note that Vπ||V^{\pi}\in\mathbb{R}^{|\mathcal{H}|}.

It will be convenient to write the dynamic programming property in operator notation. For a particular time step tt, let

𝒯tπVtπ(ht)=atπ(at|ht)(Cλ(ht,at;π)+ct,st+1P(ct,st+1|ht,at)Vt+1π({ht,at,ct,st+1})).\mathcal{T}^{\pi^{\prime}}_{t}V_{t}^{\pi}(h_{t})=\sum_{a_{t}}\pi^{\prime}(a_{t}|h_{t})\left(C_{\lambda}(h_{t},a_{t};\pi^{\prime})+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\right).

Consider two histories ht,h¯t¯h_{t},\bar{h}_{\bar{t}}\in\mathcal{H}, and let 𝐏π(h¯t¯|ht)=atπ(at|ht)P(c¯t,s¯t+1|ht,at)\mathbf{P}^{\pi}(\bar{h}_{\bar{t}}|h_{t})=\sum_{a_{t}}\pi(a_{t}|h_{t})P(\bar{c}_{t},\bar{s}_{t+1}|h_{t},a_{t}) if t¯=t+1{\bar{t}}=t+1 and 0 else. Also, define Cπ(ht)=atπ(at|ht)Cλ(ht,at;π)C^{\pi}(h_{t})=\sum_{a_{t}}\pi(a_{t}|h_{t})C_{\lambda}(h_{t},a_{t};\pi). We can write the Bellman equation as follows

Vπ=Cπ+𝐏πVπ=𝒯πVπ.V^{\pi}=C^{\pi}+\mathbf{P}^{\pi}V^{\pi}=\mathcal{T}^{\pi}V^{\pi}. (6)

The next proposition establishes several important dynamic-programming properties.

Proposition 6.

The following holds.

  1. 1.

    The matrix I𝐏πI-\mathbf{P}^{\pi} is invertible, and Vπ=(I𝐏π)1CπV^{\pi}=(I-\mathbf{P}^{\pi})^{-1}C^{\pi}.

  2. 2.

    For two policies π,π\pi,\pi^{\prime}, we have

    VπVπ=(I𝐏π)1(𝒯πVπVπ),𝒯πVπVπ=(I𝐏π)(VπVπ).\begin{split}&V^{\pi^{\prime}}-V^{\pi}=(I-\mathbf{P}^{\pi^{\prime}})^{-1}(\mathcal{T}^{\pi^{\prime}}V^{\pi}-V^{\pi}),\\ &\mathcal{T}^{\pi^{\prime}}V^{\pi}-V^{\pi}=(I-\mathbf{P}^{\pi^{\prime}})\left(V^{\pi^{\prime}}-V^{\pi}\right).\end{split}
  3. 3.

    For any vector b||b\in\mathbb{R}^{|\mathcal{H}|}, we have Vπ=(𝒯π)TbV^{\pi}=(\mathcal{T}^{\pi})^{T}b. Furthermore, letting ee denote a vector of ones, we have that (I𝐏π)1eTe(I-\mathbf{P}^{\pi^{\prime}})^{-1}e\leq Te.

  4. 4.

    Let 𝒫π(h¯t¯|ht)\mathcal{P}^{\pi}\left(\left.\bar{h}_{\bar{t}}\right|h_{t}\right) denote the probability of observing h¯t¯\bar{h}_{\bar{t}} after hth_{t} has been observed, under policy π\pi. We have that 𝒫π=(I𝐏π)1\mathcal{P}^{\pi}=(I-\mathbf{P}^{\pi})^{-1}.

Proof.

We prove each argument.

  1. 1.

    The finite horizon problem is a special case of the stochastic shortest path problem, where each history of length TT leads to termination. Proposition 2.2.1 in (Bertsekas 2006) shows that I𝐏πI-\mathbf{P}^{\pi} is invertible for the stochastic shortest path problem, and therefore also for our special case.

  2. 2.

    Now,

    VπVπ=(I𝐏π)1Cπ(I𝐏π)1(I𝐏π)Vπ=(I𝐏π)1(Cπ+𝐏πVπVπ)=(I𝐏π)1(𝒯πVπVπ).\begin{split}V^{\pi^{\prime}}-V^{\pi}&=(I-\mathbf{P}^{\pi^{\prime}})^{-1}C^{\pi^{\prime}}-(I-\mathbf{P}^{\pi^{\prime}})^{-1}(I-\mathbf{P}^{\pi^{\prime}})V^{\pi}\\ &=(I-\mathbf{P}^{\pi^{\prime}})^{-1}\left(C^{\pi^{\prime}}+\mathbf{P}^{\pi^{\prime}}V^{\pi}-V^{\pi}\right)\\ &=(I-\mathbf{P}^{\pi^{\prime}})^{-1}\left(\mathcal{T}^{\pi^{\prime}}V^{\pi}-V^{\pi}\right).\end{split}

    The second result is obtained by multiplying both sides by (I𝐏π)(I-\mathbf{P}^{\pi^{\prime}}).

  3. 3.

    Note that by definition of the finite horizon problem, for k>Tk>T, we have that (𝐏π)k=0(\mathbf{P}^{\pi})^{k}=0, since after TT time steps we must terminate. Rolling out (6) we have

    Vπ=Cπ+𝐏πVπ=Cπ+𝐏πCπ+(𝐏π)2Vπ==k=0T(𝐏π)kCπ.V^{\pi}=C^{\pi}+\mathbf{P}^{\pi}V^{\pi}=C^{\pi}+\mathbf{P}^{\pi}C^{\pi}+(\mathbf{P}^{\pi})^{2}V^{\pi}=\dots=\sum_{k=0}^{T}(\mathbf{P}^{\pi})^{k}C^{\pi}.

    On the other hand, using a similar argument, we have that (𝒯π)Tb=k=0T(𝐏π)kCπ=Vπ(\mathcal{T}^{\pi})^{T}b=\sum_{k=0}^{T}(\mathbf{P}^{\pi})^{k}C^{\pi}=V^{\pi}. Observe that k=0T(𝐏π)k=k=0(𝐏π)k=(I𝐏π)1\sum_{k=0}^{T}(\mathbf{P}^{\pi})^{k}=\sum_{k=0}^{\infty}(\mathbf{P}^{\pi})^{k}=(I-\mathbf{P}^{\pi})^{-1}. Therefore Vπ=k=0T(𝐏π)kCπ=(I𝐏π)1CπV^{\pi}=\sum_{k=0}^{T}(\mathbf{P}^{\pi})^{k}C^{\pi}=(I-\mathbf{P}^{\pi})^{-1}C^{\pi}. If Cπ=eC^{\pi}=e, the maximal cost in TT time steps is at most TT, therefore (I𝐏π)1eTe(I-\mathbf{P}^{\pi})^{-1}e\leq Te.

  4. 4.

    By definition, we have 𝒫π=I+𝐏π+(𝐏π)2+=k=0T(𝐏π)k=(I𝐏π)1\mathcal{P}^{\pi}=I+\mathbf{P}^{\pi}+(\mathbf{P}^{\pi})^{2}+\dots=\sum_{k=0}^{T}(\mathbf{P}^{\pi})^{k}=(I-\mathbf{P}^{\pi})^{-1}.

B.2 Policy gradient

We have that πΔA\pi\in\Delta^{\mathcal{H}}_{A}, and therefore πVπ||×||×|A|\nabla_{\pi}V^{\pi}\in\mathbb{R}^{|\mathcal{H}|\times|\mathcal{H}|\times|A|}. We write πVtπ(ht,h¯t¯,a¯t¯)=π(a¯t¯|h¯t¯)Vπ(ht)\nabla_{\pi}V_{t}^{\pi}(h_{t},\bar{h}_{\bar{t}},\bar{a}_{\bar{t}})=\partial_{\pi(\bar{a}_{\bar{t}}|\bar{h}_{\bar{t}})}V^{\pi}(h_{t}).

Proposition 7.

We have that

πVtπ(ht,h¯t¯,a¯t¯)={𝒫π(h¯t¯|ht)(Qt¯π(ht,at)+π(a¯t¯|h¯t¯)(π(|h¯t¯))),for t¯t0,else .\nabla_{\pi}V_{t}^{\pi}(h_{t},\bar{h}_{\bar{t}},\bar{a}_{\bar{t}})=\left\{\begin{array}[]{lr}\mathcal{P}^{\pi}\left(\left.\bar{h}_{\bar{t}}\right|h_{t}\right)\left(Q_{\bar{t}}^{\pi}(h_{t},a_{t})+\partial_{\pi(\bar{a}_{\bar{t}}|\bar{h}_{\bar{t}})}\mathcal{R}(\pi(\cdot|\bar{h}_{\bar{t}}))\right),&\text{for }\bar{t}\geq t\\ 0,&\text{else }\end{array}\right..
Proof.

Since the history hth_{t} is given, any past actions cannot affect the future rewards, therefore the gradient is zero for t¯<t\bar{t}<t. Otherwise, we have that

Vtπ(ht)=𝔼π;M[t=tTCλ(ht,at;π)|ht]=𝔼π;M[t=th¯t¯1Cλ(ht,at;π)|ht]+𝔼π;M[t=h¯t¯TCλ(ht,at;π)|ht]=𝔼π;M[t=th¯t¯1Cλ(ht,at;π)|ht]+h¯t¯𝒫π(h¯t¯|ht)Vt¯π(h¯t¯)=𝔼π;M[t=th¯t¯1Cλ(ht,at;π)|ht]+h¯t¯𝒫π(h¯t¯|ht)aπ(a|h¯t¯)Qt¯π(h¯t¯,a).\begin{split}V_{t}^{\pi}(h_{t})&=\mathbb{E}_{\pi;M}\left[\left.\sum_{t^{\prime}=t}^{T}C_{\lambda}(h_{t},a_{t};\pi)\right|h_{t}\right]\\ &=\mathbb{E}_{\pi;M}\left[\left.\sum_{t^{\prime}=t}^{\bar{h}_{\bar{t}}-1}C_{\lambda}(h_{t},a_{t};\pi)\right|h_{t}\right]+\mathbb{E}_{\pi;M}\left[\left.\sum_{t^{\prime}=\bar{h}_{\bar{t}}}^{T}C_{\lambda}(h_{t},a_{t};\pi)\right|h_{t}\right]\\ &=\mathbb{E}_{\pi;M}\left[\left.\sum_{t^{\prime}=t}^{\bar{h}_{\bar{t}}-1}C_{\lambda}(h_{t},a_{t};\pi)\right|h_{t}\right]+\sum_{\bar{h}_{\bar{t}}}\mathcal{P}^{\pi}\left(\left.\bar{h}_{\bar{t}}\right|h_{t}\right)V_{\bar{t}}^{\pi}(\bar{h}_{\bar{t}})\\ &=\mathbb{E}_{\pi;M}\left[\left.\sum_{t^{\prime}=t}^{\bar{h}_{\bar{t}}-1}C_{\lambda}(h_{t},a_{t};\pi)\right|h_{t}\right]+\sum_{\bar{h}_{\bar{t}}}\mathcal{P}^{\pi}\left(\left.\bar{h}_{\bar{t}}\right|h_{t}\right)\sum_{a}\pi(a|\bar{h}_{\bar{t}})Q_{\bar{t}}^{\pi}(\bar{h}_{\bar{t}},a).\end{split}

Taking a gradient,

πVtπ(ht,h¯t¯,a¯t¯)=𝒫π(h¯t¯|ht)(Qt¯π(h¯t¯,a¯t¯)+π(a¯t¯|h¯t¯)Qt¯π(h¯t¯,a¯t¯))\nabla_{\pi}V_{t}^{\pi}(h_{t},\bar{h}_{\bar{t}},\bar{a}_{\bar{t}})=\mathcal{P}^{\pi}\left(\left.\bar{h}_{\bar{t}}\right|h_{t}\right)\left(Q_{\bar{t}}^{\pi}(\bar{h}_{\bar{t}},\bar{a}_{\bar{t}})+\partial_{\pi(\bar{a}_{\bar{t}}|\bar{h}_{\bar{t}})}Q_{\bar{t}}^{\pi}(\bar{h}_{\bar{t}},\bar{a}_{\bar{t}})\right)

Finally, observe that the only dependence of Qt¯π(h¯t¯,a¯t¯)Q_{\bar{t}}^{\pi}(\bar{h}_{\bar{t}},\bar{a}_{\bar{t}}) on π(a¯t¯|h¯t¯)\pi(\bar{a}_{\bar{t}}|\bar{h}_{\bar{t}}) is through Cλ(h¯t¯,a¯t¯;π)C_{\lambda}(\bar{h}_{\bar{t}},\bar{a}_{\bar{t}};\pi). Plugging the expression for Cλ(h¯t¯,a¯t¯;π)C_{\lambda}(\bar{h}_{\bar{t}},\bar{a}_{\bar{t}};\pi) gives the result. ∎

B.3 Linear approximation of a policy’s value

The linear approximation of the value of a policy π\pi^{\prime} around the policy π\pi is given by

VπVπ+πVπ,ππ.V^{\pi^{\prime}}\approx V^{\pi}+\left<\nabla_{\pi}V^{\pi},\pi^{\prime}-\pi\right>.
Proposition 8.

Let π,πΔA\pi,\pi^{\prime}\in\Delta^{\mathcal{H}}_{A}. Then,

πVπ(ht),ππ=h¯t¯𝒫π(h¯t¯|ht)(𝒯t¯πVπ(h¯t¯)Vπ(h¯t¯)λ(h¯t¯;π,π)),\left<\nabla_{\pi}V^{\pi}(h_{t}),\pi^{\prime}-\pi\right>=\sum_{\bar{h}_{\bar{t}}\in\mathcal{H}}\mathcal{P}^{\pi}\left(\left.\bar{h}_{\bar{t}}\right|h_{t}\right)\left(\mathcal{T}^{\pi^{\prime}}_{\bar{t}}V^{\pi}(\bar{h}_{\bar{t}})-V^{\pi}(\bar{h}_{\bar{t}})-\lambda\mathcal{B}_{\mathcal{R}}(\bar{h}_{\bar{t}};\pi^{\prime},\pi)\right),

where 𝒯tπVπ(ht)=atπ(at|ht)(Cλ(ht,at;π)+ct,st+1P(ct,st+1|ht,at)Vt+1π({ht,at,ct,st+1}))\mathcal{T}^{\pi^{\prime}}_{t}V^{\pi}(h_{t})=\sum_{a_{t}}\pi^{\prime}(a_{t}|h_{t})\left(C_{\lambda}(h_{t},a_{t};\pi^{\prime})+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\right). and (h¯t¯;π,π)=(πt¯(|h¯t¯))(πt¯(|h¯t¯))π(|h¯t¯)(π(|h¯t¯)),π(h¯t¯)π(h¯t¯)\mathcal{B}_{\mathcal{R}}(\bar{h}_{\bar{t}};\pi^{\prime},\pi)=\mathcal{R}(\pi^{\prime}_{\bar{t}}(\cdot|\bar{h}_{\bar{t}}))-\mathcal{R}(\pi_{\bar{t}}(\cdot|\bar{h}_{\bar{t}}))-\left<\partial_{\pi(\cdot|\bar{h}_{\bar{t}})}\mathcal{R}(\pi(\cdot|\bar{h}_{\bar{t}})),\pi^{\prime}(\bar{h}_{\bar{t}})-\pi(\bar{h}_{\bar{t}})\right> is the Bregman distance associated with the regularization function ()\mathcal{R}(\cdot) .

Proof.

Consider the inner product π(|h¯t¯)Vπ(ht),π(h¯t¯)π(h¯t¯)\left<\nabla_{\pi(\cdot|\bar{h}_{\bar{t}})}V^{\pi}(h_{t}),\pi^{\prime}(\bar{h}_{\bar{t}})-\pi(\bar{h}_{\bar{t}})\right>. From Proposition 7, we have

π(|h¯t¯)Vπ(ht),π(h¯t¯)π(h¯t¯)=𝒫π(h¯t¯|ht)(Qt¯π(h¯t¯,)+π(|h¯t¯)(π(|h¯t¯))),π(h¯t¯)π(h¯t¯).\begin{split}\left<\nabla_{\pi(\cdot|\bar{h}_{\bar{t}})}V^{\pi}(h_{t}),\pi^{\prime}(\bar{h}_{\bar{t}})-\pi(\bar{h}_{\bar{t}})\right>&=\left<\mathcal{P}^{\pi}\left(\left.\bar{h}_{\bar{t}}\right|h_{t}\right)\left(Q_{\bar{t}}^{\pi}(\bar{h}_{\bar{t}},\cdot)+\partial_{\pi(\cdot|\bar{h}_{\bar{t}})}\mathcal{R}(\pi(\cdot|\bar{h}_{\bar{t}}))\right),\pi^{\prime}(\bar{h}_{\bar{t}})-\pi(\bar{h}_{\bar{t}})\right>.\end{split} (7)

where we used the fact that 𝒫π(h¯t¯|ht)=0\mathcal{P}^{\pi}\left(\left.\bar{h}_{\bar{t}}\right|h_{t}\right)=0 for t¯<t\bar{t}<t. Note that the following holds:

Qtπ(ht,),π(ht)π(ht)=Cλ(ht,;π)+ct,st+1P(ct,st+1|ht,)Vt+1π({ht,,ct,st+1}),π(ht)π(ht)=atπ(at|ht)(Cλ(ht,at;π)+ct,st+1P(ct,st+1|ht,at)Vt+1π({ht,at,ct,st+1}))atπ(at|ht)(Cλ(ht,at;π)+ct,st+1P(ct,st+1|ht,at)Vt+1π({ht,at,ct,st+1}))=atπ(at|ht)(𝔼M|htC(st,at)+λ(πt(|ht))+ct,st+1P(ct,st+1|ht,at)Vt+1π({ht,at,ct,st+1}))Vπ(ht)=atπ(at|ht)(𝔼M|htC(st,at)+λ(πt(|ht))+ct,st+1P(ct,st+1|ht,at)Vt+1π({ht,at,ct,st+1}))+λ((πt(|ht))(πt(|ht)))Vπ(ht)=atπ(at|ht)(Cλ(ht,at;π)+ct,st+1P(ct,st+1|ht,at)Vt+1π({ht,at,ct,st+1}))+λ((πt(|ht))(πt(|ht)))Vπ(ht)𝒯tπVπ(ht)+λ((πt(|ht))(πt(|ht)))Vπ(ht)\begin{split}&\left<Q_{t}^{\pi}(h_{t},\cdot),\pi^{\prime}(h_{t})-\pi(h_{t})\right>=\\ &\left<C_{\lambda}(h_{t},\cdot;\pi)+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},\cdot)V_{t+1}^{\pi}(\left\{h_{t},\cdot,c_{t},s_{t+1}\right\}),\pi^{\prime}(h_{t})-\pi(h_{t})\right>\\ =&\sum_{a_{t}}\pi^{\prime}(a_{t}|h_{t})\left(C_{\lambda}(h_{t},a_{t};\pi)+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\right)\\ &-\sum_{a_{t}}\pi(a_{t}|h_{t})\left(C_{\lambda}(h_{t},a_{t};\pi)+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\right)\\ =&\sum_{a_{t}}\pi^{\prime}(a_{t}|h_{t})\left(\mathbb{E}_{M|h_{t}}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi_{t}(\cdot|h_{t}))+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\right)\\ &-V^{\pi}(h_{t})\\ =&\sum_{a_{t}}\pi^{\prime}(a_{t}|h_{t})\left(\mathbb{E}_{M|h_{t}}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi^{\prime}_{t}(\cdot|h_{t}))+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\right)\\ &+\lambda\left(\mathcal{R}(\pi_{t}(\cdot|h_{t}))-\mathcal{R}(\pi^{\prime}_{t}(\cdot|h_{t}))\right)-V^{\pi}(h_{t})\\ =&\sum_{a_{t}}\pi^{\prime}(a_{t}|h_{t})\left(C_{\lambda}(h_{t},a_{t};\pi^{\prime})+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\right)\\ &+\lambda\left(\mathcal{R}(\pi_{t}(\cdot|h_{t}))-\mathcal{R}(\pi^{\prime}_{t}(\cdot|h_{t}))\right)-V^{\pi}(h_{t})\\ \equiv&\mathcal{T}^{\pi^{\prime}}_{t}V^{\pi}(h_{t})+\lambda\left(\mathcal{R}(\pi_{t}(\cdot|h_{t}))-\mathcal{R}(\pi^{\prime}_{t}(\cdot|h_{t}))\right)-V^{\pi}(h_{t})\end{split}

Plugging this back in (7), we have

π(|h¯t¯)Vπ(ht),π(h¯t¯)π(h¯t¯)=𝒫π(h¯t¯|ht)(𝒯t¯πVπ(h¯t¯)Vπ(h¯t¯))λ𝒫π(h¯t¯|ht)((πt¯(|h¯t¯))(πt¯(|h¯t¯))π(|h¯t¯)(π(|h¯t¯)),π(h¯t¯)π(h¯t¯))=𝒫π(h¯t¯|ht)(𝒯t¯πVπ(h¯t¯)Vπ(h¯t¯)λ(h¯t¯;π,π)).\begin{split}&\left<\nabla_{\pi(\cdot|\bar{h}_{\bar{t}})}V^{\pi}(h_{t}),\pi^{\prime}(\bar{h}_{\bar{t}})-\pi(\bar{h}_{\bar{t}})\right>\\ =&\mathcal{P}^{\pi}\left(\left.\bar{h}_{\bar{t}}\right|h_{t}\right)\left(\mathcal{T}^{\pi^{\prime}}_{\bar{t}}V^{\pi}(\bar{h}_{\bar{t}})-V^{\pi}(\bar{h}_{\bar{t}})\right)\\ &-\lambda\mathcal{P}^{\pi}\left(\left.\bar{h}_{\bar{t}}\right|h_{t}\right)\left(\mathcal{R}(\pi^{\prime}_{\bar{t}}(\cdot|\bar{h}_{\bar{t}}))-\mathcal{R}(\pi_{\bar{t}}(\cdot|\bar{h}_{\bar{t}}))-\left<\partial_{\pi(\cdot|\bar{h}_{\bar{t}})}\mathcal{R}(\pi(\cdot|\bar{h}_{\bar{t}})),\pi^{\prime}(\bar{h}_{\bar{t}})-\pi(\bar{h}_{\bar{t}})\right>\right)\\ =&\mathcal{P}^{\pi}\left(\left.\bar{h}_{\bar{t}}\right|h_{t}\right)\left(\mathcal{T}^{\pi^{\prime}}_{\bar{t}}V^{\pi}(\bar{h}_{\bar{t}})-V^{\pi}(\bar{h}_{\bar{t}})-\lambda\mathcal{B}_{\mathcal{R}}(\bar{h}_{\bar{t}};\pi^{\prime},\pi)\right).\end{split}

Finally,

πVπ(ht),ππ=h¯t¯𝒫π(h¯t¯|ht)(𝒯t¯πVπ(h¯t¯)Vπ(h¯t¯)λ(h¯t¯;π,π)).\left<\nabla_{\pi}V^{\pi}(h_{t}),\pi^{\prime}-\pi\right>=\sum_{\bar{h}_{\bar{t}}\in\mathcal{H}}\mathcal{P}^{\pi}\left(\left.\bar{h}_{\bar{t}}\right|h_{t}\right)\left(\mathcal{T}^{\pi^{\prime}}_{\bar{t}}V^{\pi}(\bar{h}_{\bar{t}})-V^{\pi}(\bar{h}_{\bar{t}})-\lambda\mathcal{B}_{\mathcal{R}}(\bar{h}_{\bar{t}};\pi^{\prime},\pi)\right).

B.4 Uniform trust region policy optimization

We consider updates of the following form.

πk+1argminπΔA{Vπk,ππk+1αk𝒫πk(π,πk)}.\pi_{k+1}\in\operatorname*{arg\,min}_{\pi\in\Delta^{\mathcal{H}}_{A}}\left\{\left<\nabla V^{\pi_{k}},\pi-\pi_{k}\right>+\frac{1}{\alpha_{k}}\mathcal{P}^{\pi_{k}}\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})\right\}.

Applying Proposition 8 we have that the update is equivalent to

πk+1argminπΔA{𝒫πk(𝒯πVπkVπk+(1αkλ)(π,πk))},\pi_{k+1}\in\operatorname*{arg\,min}_{\pi\in\Delta^{\mathcal{H}}_{A}}\left\{\mathcal{P}^{\pi_{k}}\left(\mathcal{T}^{\pi}V^{\pi_{k}}-V^{\pi_{k}}+\left(\frac{1}{\alpha_{k}}-\lambda\right)\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})\right)\right\},

and since 𝒫πk0\mathcal{P}^{\pi_{k}}\geq 0 component-wise, this is equivalent to minimizing for each hth_{t}:

πk+1(ht)argminπΔA{(αk𝒯πVπk(ht)+(1αkλ)(ht;π,πk))}.\pi_{k+1}(h_{t})\in\operatorname*{arg\,min}_{\pi\in\Delta_{A}}\left\{\left(\alpha_{k}\mathcal{T}^{\pi}V^{\pi_{k}}(h_{t})+\left(1-\alpha_{k}\lambda\right)\mathcal{B}_{\mathcal{R}}(h_{t};\pi,\pi_{k})\right)\right\}.

Fundamental inequality:

we claim that the following holds for uniform trust region policy optimization.

Proposition 9.

Let {πk}\{\pi_{k}\} be the sequence generated by uniform trust region policy optimization with step sizes {αk}\{\alpha_{k}\}. Then for every π\pi and k0k\geq 0,

αk(IPπ)(VπkVπ)(1αkλ)(π,πk)(π,πk+1)+λαk((πk)(πk+1))+αk2L22e,\alpha_{k}(I-P^{\pi})(V^{\pi_{k}}-V^{\pi})\leq(1-\alpha_{k}\lambda)\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})-\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k+1})+\lambda\alpha_{k}(\mathcal{R}(\pi_{k})-\mathcal{R}(\pi_{k+1}))+\frac{\alpha_{k}^{2}L^{2}}{2}e,

where ee is a vector of ones, and L=CmaxTL=C_{max}T for the L1L_{1} norm and L=CmaxT|A|L=C_{max}T|A| for the Euclidean norm.

Proof.

We follow the proof of Lemma 10 in (Shani, Efroni, and Mannor 2020).

Define ψ(π)=αk(𝒫πk)1Vπk,π+δΔA(π)\psi(\pi)=\alpha_{k}(\mathcal{P}^{\pi_{k}})^{-1}\left<\nabla V^{\pi_{k}},\pi\right>+\delta_{\Delta^{\mathcal{H}}_{A}}(\pi), where δΔA(π)=0\delta_{\Delta^{\mathcal{H}}_{A}}(\pi)=0 if πΔA\pi\in\Delta^{\mathcal{H}}_{A} and infinite otherwise, and note that ψ\psi is convex. Applying the non-Euclidean second prox theorem (Theorem 31 in Shani, Efroni, and Mannor (2020)), with a=πka=\pi_{k}, b=πk+1b=\pi_{k+1}, we get that for any πΔA\pi\in\Delta^{\mathcal{H}}_{A},

(πk)(πk+1),ππk+1αk(𝒫πk)1Vπk,ππk+1.\left<\nabla\mathcal{R}(\pi_{k})-\nabla\mathcal{R}(\pi_{k+1}),\pi-\pi_{k+1}\right>\leq\alpha_{k}(\mathcal{P}^{\pi_{k}})^{-1}\left<\nabla V^{\pi_{k}},\pi-\pi_{k+1}\right>.

By the three points lemma (Lemma 30 in Shani, Efroni, and Mannor (2020)), we have

(πk)(πk+1),ππk+1=(π,πk+1)+(πk+1,πk)(π,πk).\left<\nabla\mathcal{R}(\pi_{k})-\nabla\mathcal{R}(\pi_{k+1}),\pi-\pi_{k+1}\right>=\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k+1})+\mathcal{B}_{\mathcal{R}}(\pi_{k+1},\pi_{k})-\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k}).

By adding and subtracting αk(𝒫π)1Vπk,πk\alpha_{k}(\mathcal{P}^{\pi})^{-1}\left<\nabla V^{\pi_{k}},\pi_{k}\right>, we have

αk(𝒫πk)1Vπk,πkπ(π,πk+1)(πk+1,πk)+(π,πk)+αk(𝒫πk)1Vπk,πkπk+1=(π,πk+1)(πk+1,πk)+(π,πk)αk(𝒫πk)1𝒫πk(𝒯πk+1VπkVπkλ(πk+1,πk))=(π,πk+1)(πk+1,πk)+(π,πk)+αk(Vπk𝒯πk+1Vπk)+λαk(πk+1,πk)(π,πk)(π,πk+1)(1λαk)2πk+1πk2+αk(Vπk𝒯πk+1Vπk),\begin{split}&\alpha_{k}(\mathcal{P}^{\pi_{k}})^{-1}\left<\nabla V^{\pi_{k}},\pi_{k}-\pi\right>\leq-\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k+1})-\mathcal{B}_{\mathcal{R}}(\pi_{k+1},\pi_{k})+\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})+\alpha_{k}(\mathcal{P}^{\pi_{k}})^{-1}\left<\nabla V^{\pi_{k}},\pi_{k}-\pi_{k+1}\right>\\ &=-\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k+1})-\mathcal{B}_{\mathcal{R}}(\pi_{k+1},\pi_{k})+\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})-\alpha_{k}(\mathcal{P}^{\pi_{k}})^{-1}\mathcal{P}^{\pi_{k}}\left(\mathcal{T}^{\pi_{k+1}}V^{\pi_{k}}-V^{\pi_{k}}-\lambda\mathcal{B}_{\mathcal{R}}(\pi_{k+1},\pi_{k})\right)\\ &=-\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k+1})-\mathcal{B}_{\mathcal{R}}(\pi_{k+1},\pi_{k})+\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})+\alpha_{k}\left(V^{\pi_{k}}-\mathcal{T}^{\pi_{k+1}}V^{\pi_{k}}\right)+\lambda\alpha_{k}\mathcal{B}_{\mathcal{R}}(\pi_{k+1},\pi_{k})\\ &\leq\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})-\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k+1})-\frac{(1-\lambda\alpha_{k})}{2}\|\pi_{k+1}-\pi_{k}\|^{2}+\alpha_{k}\left(V^{\pi_{k}}-\mathcal{T}^{\pi_{k+1}}V^{\pi_{k}}\right),\end{split} (8)

where the last inequality is since the Bregman distance is 1-strongly convex for our choice of \mathcal{R}.

Let L=CmaxT|A|L=C_{max}T|A|. Now, for every hth_{t}, we have that

αk(Vπk𝒯πk+1Vπk)(ht)=αk(𝒯πkVπk𝒯πk+1Vπk)(ht)=αkλ((πk(|ht))(πk+1(|ht)))+atαk(πk(at|ht)πk+1(at|ht))(𝔼M|htC(st,at)+ct,st+1P(ct,st+1|ht,at)Vt+1πk({ht,at,ct,st+1}))=αkλ((πk(|ht))(πk+1(|ht)))+αk1λαk(𝔼M|htC(st,)+ct,st+1P(ct,st+1|ht,)Vt+1πk({ht,,ct,st+1})),1λαk(πk(|ht)πk+1(|ht))λαk((πk(|ht))(πk+1(|ht)))+αk22(1λαk)𝔼M|htC(st,)+ct,st+1P(ct,st+1|ht,)Vt+1πk({ht,,ct,st+1})2+1λαk2πk(|ht)πk+1(|ht)2λαk((πk(|ht))(πk+1(|ht)))+αk22(1λαk)L2+1λαk2πk(|ht)πk+1(|ht)2.\begin{split}&\alpha_{k}\left(V^{\pi_{k}}-\mathcal{T}^{\pi_{k+1}}V^{\pi_{k}}\right)(h_{t})\\ =&\alpha_{k}\left(\mathcal{T}^{\pi_{k}}V^{\pi_{k}}-\mathcal{T}^{\pi_{k+1}}V^{\pi_{k}}\right)(h_{t})\\ =&\alpha_{k}\lambda(\mathcal{R}(\pi_{k}(\cdot|h_{t}))-\mathcal{R}(\pi_{k+1}(\cdot|h_{t})))\\ &+\sum_{a_{t}}\alpha_{k}(\pi_{k}(a_{t}|h_{t})-\pi_{k+1}(a_{t}|h_{t}))\left(\mathbb{E}_{M|h_{t}}C(s_{t},a_{t})+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi_{k}}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\right)\\ =&\alpha_{k}\lambda(\mathcal{R}(\pi_{k}(\cdot|h_{t}))-\mathcal{R}(\pi_{k+1}(\cdot|h_{t})))\\ &+\left<\frac{\alpha_{k}}{\sqrt{1-\lambda\alpha_{k}}}\left(\mathbb{E}_{M|h_{t}}C(s_{t},\cdot)+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},\cdot)V_{t+1}^{\pi_{k}}(\left\{h_{t},\cdot,c_{t},s_{t+1}\right\})\right),\sqrt{1-\lambda\alpha_{k}}(\pi_{k}(\cdot|h_{t})-\pi_{k+1}(\cdot|h_{t}))\right>\\ \leq&\lambda\alpha_{k}(\mathcal{R}(\pi_{k}(\cdot|h_{t}))-\mathcal{R}(\pi_{k+1}(\cdot|h_{t})))\\ &+\frac{\alpha_{k}^{2}}{2(1-\lambda\alpha_{k})}\left\|\mathbb{E}_{M|h_{t}}C(s_{t},\cdot)+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},\cdot)V_{t+1}^{\pi_{k}}(\left\{h_{t},\cdot,c_{t},s_{t+1}\right\})\right\|^{2}_{*}\\ &+\frac{1-\lambda\alpha_{k}}{2}\left\|\pi_{k}(\cdot|h_{t})-\pi_{k+1}(\cdot|h_{t})\right\|^{2}\\ \leq&\lambda\alpha_{k}(\mathcal{R}(\pi_{k}(\cdot|h_{t}))-\mathcal{R}(\pi_{k+1}(\cdot|h_{t})))+\frac{\alpha_{k}^{2}}{2(1-\lambda\alpha_{k})}L^{2}+\frac{1-\lambda\alpha_{k}}{2}\left\|\pi_{k}(\cdot|h_{t})-\pi_{k+1}(\cdot|h_{t})\right\|^{2}.\end{split}

The first inequality is by Fenchel’s inequality on the convex 2\|\cdot\|^{2} and its convex conjugate 2\|\cdot\|^{2}_{*}. Plugging into (8), we obtain

αk(𝒫πk)1Vπk,πkπ(π,πk)(π,πk+1)(1λαk)2πk+1πk2+λαk((πk(|ht))(πk+1(|ht)))+αk22(1λαk)L2e+1λαk2πk(|ht)πk+1(|ht)2=(π,πk)(π,πk+1)+λαk((πk(|ht))(πk+1(|ht)))+αk22(1λαk)L2e.\begin{split}&\alpha_{k}(\mathcal{P}^{\pi_{k}})^{-1}\left<\nabla V^{\pi_{k}},\pi_{k}-\pi\right>\\ &\leq\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})-\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k+1})-\frac{(1-\lambda\alpha_{k})}{2}\|\pi_{k+1}-\pi_{k}\|^{2}+\lambda\alpha_{k}(\mathcal{R}(\pi_{k}(\cdot|h_{t}))-\mathcal{R}(\pi_{k+1}(\cdot|h_{t})))\\ &+\frac{\alpha_{k}^{2}}{2(1-\lambda\alpha_{k})}L^{2}e+\frac{1-\lambda\alpha_{k}}{2}\left\|\pi_{k}(\cdot|h_{t})-\pi_{k+1}(\cdot|h_{t})\right\|^{2}\\ &=\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})-\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k+1})+\lambda\alpha_{k}(\mathcal{R}(\pi_{k}(\cdot|h_{t}))-\mathcal{R}(\pi_{k+1}(\cdot|h_{t})))+\frac{\alpha_{k}^{2}}{2(1-\lambda\alpha_{k})}L^{2}e.\end{split}

Using Proposition 8, we have

αk(𝒯πVπkVπkλ(π,πk))(π,πk)(π,πk+1)+λαk((πk(|ht))(πk+1(|ht)))+αk2L22(1λαk)e,\begin{split}&-\alpha_{k}\left(\mathcal{T}^{\pi}V^{\pi_{k}}-V^{\pi_{k}}-\lambda\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})\right)\\ &\leq\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})-\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k+1})+\lambda\alpha_{k}(\mathcal{R}(\pi_{k}(\cdot|h_{t}))-\mathcal{R}(\pi_{k+1}(\cdot|h_{t})))+\frac{\alpha_{k}^{2}L^{2}}{2(1-\lambda\alpha_{k})}e,\end{split}

therefore

αk(𝒯πVπkVπk)(1αkλ)(π,πk)(π,πk+1)+λαk((πk(|ht))(πk+1(|ht)))+αk2L22(1λαk)e.\begin{split}&-\alpha_{k}\left(\mathcal{T}^{\pi}V^{\pi_{k}}-V^{\pi_{k}}\right)\\ &\leq(1-\alpha_{k}\lambda)\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})-\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k+1})+\lambda\alpha_{k}(\mathcal{R}(\pi_{k}(\cdot|h_{t}))-\mathcal{R}(\pi_{k+1}(\cdot|h_{t})))+\frac{\alpha_{k}^{2}L^{2}}{2(1-\lambda\alpha_{k})}e.\end{split}

Finally, by Proposition 6

αk(IPπ)(VπkVπ)=αk(𝒯πVπkVπk)(1αkλ)(π,πk)(π,πk+1)+λαk((πk(|ht))(πk+1(|ht)))+αk2L22(1λαk)e.\begin{split}&\alpha_{k}(I-P^{\pi})(V^{\pi_{k}}-V^{\pi})=-\alpha_{k}(\mathcal{T}^{\pi}V^{\pi_{k}}-V^{\pi_{k}})\\ &\leq(1-\alpha_{k}\lambda)\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k})-\mathcal{B}_{\mathcal{R}}(\pi,\pi_{k+1})+\lambda\alpha_{k}(\mathcal{R}(\pi_{k}(\cdot|h_{t}))-\mathcal{R}(\pi_{k+1}(\cdot|h_{t})))+\frac{\alpha_{k}^{2}L^{2}}{2(1-\lambda\alpha_{k})}e.\end{split}

We next claim the value functions are decreasing.

Proposition 10.

We have that Vπk+1VπkV^{\pi_{k+1}}\leq V^{\pi_{k}}.

Proof.

The proof is similar to the proof of Lemma 11 in (Shani, Efroni, and Mannor 2020). ∎

We next present a main result.

Proposition 11.

Then the following statements hold true for step sizes αk=1λ(k+2)\alpha_{k}=\frac{1}{\lambda(k+2)}:

  1. 1.

    The sequence converges to VV^{*} and satisfies

    VπkV(λ2B+Cmax2T3)logkλk.V^{\pi_{k}}-V^{*}\leq\frac{\left(\lambda^{2}B+C_{max}^{2}T^{3}\right)\log k}{\lambda k}.
  2. 2.

    It holds that λ(π0,π)(I𝐏π0)(Vπ0Vπ)\lambda\mathcal{B}_{\mathcal{R}}(\pi_{0},\pi^{*})\leq(I-\mathbf{P}^{\pi_{0}})(V^{\pi_{0}}-V^{\pi^{*}}).

Proof.

The convergence part is equivalent to Theorem 2 in (Shani, Efroni, and Mannor 2020), with the only difference being that (IPπ)1eTe(I-P^{\pi})^{-1}e\leq Te, giving a factor of TT instead of 1γ\frac{1}{\gamma} (where γ\gamma is the discount factor in (Shani, Efroni, and Mannor 2020)). We now prove the second part.

We next prove the second claim. By our assumption we have:

αk(I𝐏π0)(VπkVπ0)(1αkλ)(π0,πk)(π0,πk+1)+λαk((πk)(πk+1))+αk2L22e.\alpha_{k}(I-\mathbf{P}^{\pi_{0}})(V^{\pi_{k}}-V^{\pi_{0}})\leq(1-\alpha_{k}\lambda)\mathcal{B}_{\mathcal{R}}(\pi_{0},\pi_{k})-\mathcal{B}_{\mathcal{R}}(\pi_{0},\pi_{k+1})+\lambda\alpha_{k}(\mathcal{R}(\pi_{k})-\mathcal{R}(\pi_{k+1}))+\frac{\alpha_{k}^{2}L^{2}}{2}e.

Letting αk=1λ(k+2)\alpha_{k}=\frac{1}{\lambda(k+2)}, and multiplying by λ(k+2)\lambda(k+2):

(I𝐏π0)(VπkVπ0)λ(k+1)(π0,πk)λ(k+2)(π0,πk+1)+λ((πk)(πk+1))+L2e2λ(k+2).(I-\mathbf{P}^{\pi_{0}})(V^{\pi_{k}}-V^{\pi_{0}})\leq\lambda\left(k+1\right)\mathcal{B}_{\mathcal{R}}(\pi_{0},\pi_{k})-\lambda(k+2)\mathcal{B}_{\mathcal{R}}(\pi_{0},\pi_{k+1})+\lambda\left(\mathcal{R}(\pi_{k})-\mathcal{R}(\pi_{k+1})\right)+\frac{L^{2}e}{2\lambda(k+2)}.

Summing over kk, and observing the telescoping sums:

(I𝐏π0)k=0N(VπkVπ0)λ(N+2)(π0,πN+1)+λ((π0)(πN+1))+L2e2λk=0N1(k+2)λ(N+2)(π0,πN+1)+λBe+L2log(N+2)e2λ.\begin{split}(I-\mathbf{P}^{\pi_{0}})\sum_{k=0}^{N}\left(V^{\pi_{k}}-V^{\pi_{0}}\right)&\leq-\lambda(N+2)\mathcal{B}_{\mathcal{R}}(\pi_{0},\pi_{N+1})+\lambda\left(\mathcal{R}(\pi_{0})-\mathcal{R}(\pi_{N+1})\right)+\frac{L^{2}e}{2\lambda}\sum_{k=0}^{N}\frac{1}{(k+2)}\\ &\leq-\lambda(N+2)\mathcal{B}_{\mathcal{R}}(\pi_{0},\pi_{N+1})+\lambda Be+\frac{L^{2}\log(N+2)e}{2\lambda}.\end{split}

We therefore have:

N(VπNVπ0)λ(N+2)(I𝐏π0)1(π0,πN+1)+λBTe+TL2log(N+2)e2λ,N\left(V^{\pi_{N}}-V^{\pi_{0}}\right)\leq-\lambda(N+2)(I-\mathbf{P}^{\pi_{0}})^{-1}\mathcal{B}_{\mathcal{R}}(\pi_{0},\pi_{N+1})+\lambda BTe+\frac{TL^{2}\log(N+2)e}{2\lambda},

and

VπNVπ0λ(N+2)N(I𝐏π0)1(π0,πN+1)+λBTeN+TL2log(N+2)e2λN.V^{\pi_{N}}-V^{\pi_{0}}\leq-\lambda\frac{(N+2)}{N}(I-\mathbf{P}^{\pi_{0}})^{-1}\mathcal{B}_{\mathcal{R}}(\pi_{0},\pi_{N+1})+\frac{\lambda BTe}{N}+\frac{TL^{2}\log(N+2)e}{2\lambda N}.

Taking NN\to\infty, and using the first part of the lemma:

VπVπ0λ(I𝐏π0)1(π0,π),V^{\pi^{*}}-V^{\pi_{0}}\leq-\lambda(I-\mathbf{P}^{\pi_{0}})^{-1}\mathcal{B}_{\mathcal{R}}(\pi_{0},\pi^{*}),

so

λ(π0,π)(I𝐏π0)(Vπ0Vπ).\lambda\mathcal{B}_{\mathcal{R}}(\pi_{0},\pi^{*})\leq(I-\mathbf{P}^{\pi_{0}})(V^{\pi_{0}}-V^{\pi^{*}}).

Appendix C Stability Analysis for Regularized MDPs

We are finally ready to prove the stability result.

Consider a regularized Bayes adaptive MDP, with some regularization function (π)\mathcal{R}(\pi). The regularized losses are:

λ(π)=1Ni=1N𝔼π;Mi[t=0TC(st,at)+λ(π(|ht))],\mathcal{L}^{\lambda}(\pi)=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\pi;M_{i}}\left[\sum_{t=0}^{T}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi(\cdot|h_{t}))\right],

and

λ,\j(π)=1Nij,1iNN𝔼π;Mi[t=0TC(st,at)+λ(π(|ht))].\mathcal{L}^{\lambda,\backslash j}(\pi)=\frac{1}{N}\sum_{i\neq j,1\leq i\leq N}^{N}\mathbb{E}_{\pi;M_{i}}\left[\sum_{t=0}^{T}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi(\cdot|h_{t}))\right].

Let π\pi^{*} and π\j,\pi^{\backslash j,*} denote the optimal policies for the losses above. Let P(s0)P(s_{0}) denote the initial history distribution, and let μ\mu\in\mathbb{R}^{\mathcal{H}} be P(s0)P(s_{0}) for the elements that correspond to h0h_{0}, and 0 else. Observe that

λ(π\j,)=μVπ\j,,λ(π)=μVπ.\mathcal{L}^{\lambda}(\pi^{\backslash j,*})=\mu^{\top}V^{\pi^{\backslash j,*}},\quad\mathcal{L}^{\lambda}(\pi^{*})=\mu^{\top}V^{\pi^{*}}.

We make the following assumption.

Assumption 2.

For any two MDPs M,MM,M^{\prime}\in\mathcal{M}, and any policy π\pi, let 𝐏Mπ\mathbf{P}^{\pi}_{M} and 𝐏Mπ\mathbf{P}^{\pi}_{M^{\prime}} denote their respective transition matrices (cf. Proposition 6). There exists some D<D<\infty such that for any xx\in\mathbb{R}^{\mathcal{H}}

μ(I𝐏Mπ)1xDμ(I𝐏Mπ)1x.\mu^{\top}(I-\mathbf{P}^{\pi}_{M})^{-1}x\leq D\mu^{\top}(I-\mathbf{P}^{\pi}_{M^{\prime}})^{-1}x.

Assumption 2 essentially requires that two different MDPs under the prior cannot visit completely different histories given the same policy. It seems that such an assumption is required to establish uniform stability: if it is possible that in the test MDP we will reach completely different states than seen in the training MDPs, it seems impossible to guarantee anything about the performance of the policy there.

We will need the following lemma.

Lemma 3.

Lipschitz property: we have that

μ(VπVπ)CmaxTAμ(I𝐏π)1π(at|ht)π(at|ht)2.\mu^{\top}(V^{\pi^{\prime}}-V^{\pi})\leq C_{max}T\sqrt{A}\mu^{\top}(I-\mathbf{P}^{\pi^{\prime}})^{-1}\left\|\pi^{\prime}(a_{t}|h_{t})-\pi(a_{t}|h_{t})\right\|_{2}.
Proof.

We have that

atπ(at|ht)(Cλ(ht,at;π)+ct,st+1P(ct,st+1|ht,at)Vt+1π({ht,at,ct,st+1}))atπ(at|ht)(Cλ(ht,at;π)+ct,st+1P(ct,st+1|ht,at)Vt+1π({ht,at,ct,st+1}))at|π(at|ht)π(at|ht)|CmaxTCmaxTAπ(at|ht)π(at|ht)2\begin{split}&\sum_{a_{t}}\pi^{\prime}(a_{t}|h_{t})\left(C_{\lambda}(h_{t},a_{t};\pi^{\prime})+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\right)\\ &-\sum_{a_{t}}\pi(a_{t}|h_{t})\left(C_{\lambda}(h_{t},a_{t};\pi)+\sum_{c_{t},s_{t+1}}P(c_{t},s_{t+1}|h_{t},a_{t})V_{t+1}^{\pi}(\left\{h_{t},a_{t},c_{t},s_{t+1}\right\})\right)\\ \leq&\sum_{a_{t}}\left|\pi^{\prime}(a_{t}|h_{t})-\pi(a_{t}|h_{t})\right|C_{max}T\\ \leq&C_{max}T\sqrt{A}\left\|\pi^{\prime}(a_{t}|h_{t})-\pi(a_{t}|h_{t})\right\|_{2}\end{split}

From Proposition 6, we have

μ(VπVπ)=μ(I𝐏π)1(𝒯πVπ𝒯πVπ)CmaxTAμ(I𝐏π)1π(at|ht)π(at|ht)2.\begin{split}\mu^{\top}(V^{\pi^{\prime}}-V^{\pi})&=\mu^{\top}(I-\mathbf{P}^{\pi^{\prime}})^{-1}(\mathcal{T}^{\pi^{\prime}}V^{\pi}-\mathcal{T}^{\pi}V^{\pi})\\ &\leq C_{max}T\sqrt{A}\mu^{\top}(I-\mathbf{P}^{\pi^{\prime}})^{-1}\left\|\pi^{\prime}(a_{t}|h_{t})-\pi(a_{t}|h_{t})\right\|_{2}.\end{split}

We are now ready to prove our main theorem.

Theorem 4.

Let Δ=^λ(π\j,)^λ(π)\Delta=\hat{\mathcal{L}}^{\lambda}(\pi^{\backslash j,*})-\hat{\mathcal{L}}^{\lambda}(\pi^{*}). We have that

Δλ2μ(I𝐏π\j,)1π\j,π22,\Delta\geq\frac{\lambda}{2}\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1}\|\pi^{\backslash j,*}-\pi^{*}\|^{2}_{2},

and

Δ1NCmaxTAμ(I𝐏Mjπ\j,)1π\j,π2.\Delta\leq\frac{1}{N}C_{\max}T\sqrt{A}\mu^{\top}(I-\mathbf{P}_{M_{j}}^{\pi^{\backslash j,*}})^{-1}\left\|\pi^{\backslash j,*}-\pi^{*}\right\|_{2}.
Proof.

From Proposition 11, for the prior that corresponds to the empirical MDP distribution, we have that

Δ=^λ(π\j,)^λ(π)=μ(Vπ\j,Vπ)λ2μ(I𝐏π\j,)1π\j,π22.\Delta=\hat{\mathcal{L}}^{\lambda}(\pi^{\backslash j,*})-\hat{\mathcal{L}}^{\lambda}(\pi^{*})=\mu^{\top}\left(V^{\pi^{\backslash j,*}}-V^{\pi^{*}}\right)\geq\frac{\lambda}{2}\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1}\|\pi^{\backslash j,*}-\pi^{*}\|^{2}_{2}.

Noting that μ(I𝐏π\j,)1\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1} is a probability distribution, from Jensen’s inequality we have

λ2(μ(I𝐏π\j,)1π\j,π2)2λ2μ(I𝐏π\j,)1π\j,π22.\frac{\lambda}{2}(\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1}\|\pi^{\backslash j,*}-\pi^{*}\|_{2})^{2}\leq\frac{\lambda}{2}\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1}\|\pi^{\backslash j,*}-\pi^{*}\|^{2}_{2}. (9)

On the other hand,

Δ=^λ(π\j,)^λ(π)=1Ni=1N𝔼π\j,;Mi[t=0TC(st,at)+λ(π\j,(|ht))]1Ni=1N𝔼π;Mi[t=0TC(st,at)+λ(π(|ht))]=1Ni=1ijN𝔼π\j,;Mi[t=0TC(st,at)+λ(π\j,(|ht))]1Ni=1ijN𝔼π;Mi[t=0TC(st,at)+λ(π(|ht))]+1N𝔼π\j,;Mj[t=0TC(st,at)+λ(π\j,(|ht))]1N𝔼π;Mj[t=0TC(st,at)+λ(π(|ht))]1N(𝔼π\j,;Mj[t=0TC(st,at)+λ(π\j,(|ht))]𝔼π;Mj[t=0TC(st,at)+λ(π(|ht))])1NCmaxTAμ(I𝐏Mjπ\j,)1π\j,π2\begin{split}\Delta=&\hat{\mathcal{L}}^{\lambda}(\pi^{\backslash j,*})-\hat{\mathcal{L}}^{\lambda}(\pi^{*})\\ =&\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\pi^{\backslash j,*};M_{i}}\left[\sum_{t=0}^{T}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi^{\backslash j,*}(\cdot|h_{t}))\right]-\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}_{\pi^{*};M_{i}}\left[\sum_{t=0}^{T}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi^{*}(\cdot|h_{t}))\right]\\ =&\frac{1}{N}\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\mathbb{E}_{\pi^{\backslash j,*};M_{i}}\left[\sum_{t=0}^{T}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi^{\backslash j,*}(\cdot|h_{t}))\right]-\frac{1}{N}\sum_{\begin{subarray}{c}i=1\\ i\neq j\end{subarray}}^{N}\mathbb{E}_{\pi^{*};M_{i}}\left[\sum_{t=0}^{T}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi^{*}(\cdot|h_{t}))\right]\\ &+\frac{1}{N}\mathbb{E}_{\pi^{\backslash j,*};M_{j}}\left[\sum_{t=0}^{T}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi^{\backslash j,*}(\cdot|h_{t}))\right]-\frac{1}{N}\mathbb{E}_{\pi^{*};M_{j}}\left[\sum_{t=0}^{T}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi^{*}(\cdot|h_{t}))\right]\\ \leq&\frac{1}{N}\left(\mathbb{E}_{\pi^{\backslash j,*};M_{j}}\left[\sum_{t=0}^{T}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi^{\backslash j,*}(\cdot|h_{t}))\right]-\mathbb{E}_{\pi^{*};M_{j}}\left[\sum_{t=0}^{T}C(s_{t},a_{t})+\lambda\mathcal{R}(\pi^{*}(\cdot|h_{t}))\right]\right)\\ \leq&\frac{1}{N}C_{max}T\sqrt{A}\mu^{\top}(I-\mathbf{P}_{M_{j}}^{\pi^{\backslash j,*}})^{-1}\left\|\pi^{\backslash j,*}-\pi^{*}\right\|_{2}\\ \end{split}

where the first inequality is since π,\j\pi^{*,\backslash j} minimizes ^λ,\j\hat{\mathcal{L}}^{\lambda,\backslash j}, and the second inequality is by the Lipschitz property of Lemma 3

Appendix D Proofs for Corollaries

Corollary 3.

Let Assumption 1 hold, and let κ=2D2Cmax2T2A\kappa=2D^{2}C_{\max}^{2}T^{2}A. Then, for any MDP MM^{\prime}\in\mathcal{M},

Mλ(π\j,)Mλ(π)κλN,\mathcal{L}_{M^{\prime}}^{\lambda}(\pi^{\backslash j,*})-\mathcal{L}_{M^{\prime}}^{\lambda}(\pi^{*})\leq\frac{\kappa}{\lambda N},

Then with probability at least 1δ1-\delta,

T(π^)=T(π^)T(πBO)2λT+2κλN+(4κλ+3CmaxT)ln(1/δ)2N\mathcal{R}_{T}(\hat{\pi}^{*})=\mathcal{L}_{T}(\hat{\pi}^{*})-\mathcal{L}_{T}(\pi_{\mathrm{BO}})\leq 2\lambda T+\frac{2\kappa}{\lambda N}+\left(\frac{4\kappa}{\lambda}+3C_{\max}T\right)\sqrt{\frac{\ln(1/\delta)}{2N}}
Proof.

For the first part, from Theorem 3

Δ1NCmaxTAμ(I𝐏Mjπ\j,)1π\j,π21NDCmaxTAμ(I𝐏π\j,)1π\j,π2,\begin{split}\Delta\leq&\frac{1}{N}C_{\max}T\sqrt{A}\mu^{\top}(I-\mathbf{P}_{M_{j}}^{\pi^{\backslash j,*}})^{-1}\left\|\pi^{\backslash j,*}-\pi^{*}\right\|_{2}\\ \leq&\frac{1}{N}DC_{\max}T\sqrt{A}\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1}\left\|\pi^{\backslash j,*}-\pi^{*}\right\|_{2},\end{split}

where the third inequality is by Assumption 2. Combining with the second bound in Theorem 3, we have

λ2(μ(I𝐏π\j,)1π\j,π2)21NDCmaxTAμ(I𝐏π\j,)1π\j,π2,\frac{\lambda}{2}(\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1}\|\pi^{\backslash j,*}-\pi^{*}\|_{2})^{2}\leq\frac{1}{N}DC_{\max}T\sqrt{A}\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1}\left\|\pi^{\backslash j,*}-\pi^{*}\right\|_{2},

and therefore

μ(I𝐏π\j,)1π\j,π22DCmaxTAλN.\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1}\left\|\pi^{\backslash j,*}-\pi^{*}\right\|_{2}\leq\frac{2DC_{\max}T\sqrt{A}}{\lambda N}. (10)

For any MDP MM^{\prime}, we have that

Mλ(π\j,)Mλ(π)CmaxTAμ(I𝐏Mπ\j,)1π\j,π2DCmaxTAμ(I𝐏π\j,)1π\j,π22D2Cmax2T2AλN,\begin{split}\mathcal{L}_{M^{\prime}}^{\lambda}(\pi^{\backslash j,*})-\mathcal{L}_{M^{\prime}}^{\lambda}(\pi^{*})&\leq C_{\max}T\sqrt{A}\mu^{\top}(I-\mathbf{P}_{M^{\prime}}^{\pi^{\backslash j,*}})^{-1}\left\|\pi^{\backslash j,*}-\pi^{*}\right\|_{2}\\ &\leq DC_{\max}T\sqrt{A}\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1}\left\|\pi^{\backslash j,*}-\pi^{*}\right\|_{2}\\ &\leq\frac{2D^{2}C_{\max}^{2}T^{2}A}{\lambda N},\end{split}

where the first inequality is by the Lipschitz property of Lemma 3, the second inequality is by Assumption 2, and the third is using (10).

For the second part, using Theorem 2 we have

λ(π^)T(π^)+T(π^)T(πBO)^λ(π^)T(πBO)+2β+(4Nβ+B)ln(1/δ)2N^λ(π^)^T(πBO)+2β+(4Nβ+3B)ln(1/δ)2N^λ(πBO)^T(πBO)+2β+(4Nβ+3B)ln(1/δ)2N,\begin{split}\mathcal{L}^{\lambda}(\hat{\pi}^{*})-\mathcal{L}_{T}(\hat{\pi}^{*})+\mathcal{L}_{T}(\hat{\pi}^{*})-\mathcal{L}_{T}(\pi_{\mathrm{BO}})\leq\hat{\mathcal{L}}^{\lambda}(\hat{\pi}^{*})-\mathcal{L}_{T}(\pi_{\mathrm{BO}})+2\beta+(4N\beta+B)\sqrt{\frac{\ln(1/\delta)}{2N}}\\ \leq\hat{\mathcal{L}}^{\lambda}(\hat{\pi}^{*})-\hat{\mathcal{L}}_{T}(\pi_{\mathrm{BO}})+2\beta+(4N\beta+3B)\sqrt{\frac{\ln(1/\delta)}{2N}}\\ \leq\hat{\mathcal{L}}^{\lambda}({\pi_{\mathrm{BO}}})-\hat{\mathcal{L}}_{T}({\pi_{\mathrm{BO}}})+2\beta+(4N\beta+3B)\sqrt{\frac{\ln(1/\delta)}{2N}},\\ \end{split}

where in the second inequality we used a standard Hoeffding bound, and the third inequality is since π^\hat{\pi}^{*} minimizes ^λ\hat{\mathcal{L}}^{\lambda}.

So,

T(π^)T(πBO)T(π^)λ(π^)+^λ(πBO)^T(πBO)+2β+(4Nβ+3B)ln(1/δ)2N.\begin{split}\mathcal{L}_{T}(\hat{\pi}^{*})-\mathcal{L}_{T}(\pi_{\mathrm{BO}})\leq\mathcal{L}_{T}(\hat{\pi}^{*})-\mathcal{L}^{\lambda}(\hat{\pi}^{*})+\hat{\mathcal{L}}^{\lambda}({\pi_{\mathrm{BO}}})-\hat{\mathcal{L}}_{T}({\pi_{\mathrm{BO}}})+2\beta+(4N\beta+3B)\sqrt{\frac{\ln(1/\delta)}{2N}}.\end{split}

Note that

T(π^)λ(π^)=λ𝔼MP𝔼π^;M[t=0T(π^(|ht))]λT,\mathcal{L}_{T}(\hat{\pi}^{*})-\mathcal{L}^{\lambda}(\hat{\pi}^{*})=\lambda\mathbb{E}_{M\sim P}\mathbb{E}_{\hat{\pi}^{*};M}\left[\sum_{t=0}^{T}\mathcal{R}(\hat{\pi}^{*}(\cdot|h_{t}))\right]\leq\lambda T,

since (π^(|ht))1\mathcal{R}(\hat{\pi}^{*}(\cdot|h_{t}))\leq 1. Similarly, ^λ(πBO)^T(πBO)λT\hat{\mathcal{L}}^{\lambda}({\pi_{\mathrm{BO}}})-\hat{\mathcal{L}}_{T}({\pi_{\mathrm{BO}}})\leq\lambda T. Plugging in the result for β\beta from the first part of the corollary, and using the bound B=CmaxTB=C_{\max}T gives the result. ∎

Corollary 4.

Let \mathcal{M} be a finite set, and let Pmin=minMP(M)P_{\mathrm{min}}=\min_{M\in\mathcal{M}}P(M). Then

𝔼[Mjλ(π\j,)Mjλ(π)]4Cmax2T2|A|λNPmin+exp(NPmin8)CmaxT,\begin{split}&\mathbb{E}\left[\mathcal{L}_{M_{j}}^{\lambda}(\pi^{\backslash j,*})-\mathcal{L}_{M_{j}}^{\lambda}(\pi^{*})\right]\\ &\leq\frac{4C_{\max}^{2}T^{2}|A|}{\lambda NP_{\mathrm{min}}}+\exp\left(\frac{-NP_{\mathrm{min}}}{8}\right)C_{\max}T,\end{split}

and with probability at least 1δ1-\delta, (ignoring exponential terms)

T(π^)T(πBO)2λT+Cmax2T22Nδ+48Cmax3T3|A|2δλNPmin.\mathcal{L}_{T}(\hat{\pi}^{*})-\mathcal{L}_{T}(\pi_{\mathrm{BO}})\leq 2\lambda T+\sqrt{\frac{C_{\max}^{2}T^{2}}{2N\delta}+\frac{48C_{\max}^{3}T^{3}|A|}{2\delta\lambda NP_{\mathrm{min}}}}.
Proof.

We begin with the first part. Consider the case where \mathcal{M} is a finite set, and ||=𝒦|\mathcal{M}|=\mathcal{K}. Let P^(M)\hat{P}(M) denote the empirical distribution of MM in the sample, M\forall M\in\mathcal{M}. For large NN, we claim that the ratio P^(M)/P(M)\hat{P}(M)/P(M) will be close to 11 with high probability; let us call this the ‘good’ event. This can be seen from the multiplicative Chernoff bound, where we have that

P(P^(M)<12P(M))exp(NP(M)8).P\left(\hat{P}(M)<\frac{1}{2}P(M)\right)\leq\exp\left(\frac{-NP(M)}{8}\right).

Recall that μ(I𝐏Mjπ\j,)1\mu^{\top}(I-\mathbf{P}_{M_{j}}^{\pi^{\backslash j,*}})^{-1} is the vector of visitation frequencies of histories under MjM_{j}, while μ(I𝐏π\j,)1\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1} is the vector of visitation frequencies of histories under P^(M)\hat{P}(M). Therefore, for some non-negative vector xx, we have that μ(I𝐏Mjπ\j,)1xP^(Mj)1μ(I𝐏π\j,)1\mu^{\top}(I-\mathbf{P}_{M_{j}}^{\pi^{\backslash j,*}})^{-1}x\leq\hat{P}(M_{j})^{-1}\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1}, and with probability at least 1exp(NP(M)8)1-\exp\left(\frac{-NP(M)}{8}\right) we have μ(I𝐏Mjπ\j,)1x2P(Mj)1μ(I𝐏π\j,)1\mu^{\top}(I-\mathbf{P}_{M_{j}}^{\pi^{\backslash j,*}})^{-1}x\leq 2{P}(M_{j})^{-1}\mu^{\top}(I-\mathbf{P}^{\pi^{\backslash j,*}})^{-1}.

Following Theorem 3, we have that under the good event:

λ(π\j,)λ(π)λ4P(Mj)(μ(I𝐏Mjπ\j,)1π\j,π2)2,\mathcal{L}^{\lambda}(\pi^{\backslash j,*})-\mathcal{L}^{\lambda}(\pi^{*})\geq\frac{\lambda}{4}{P}(M_{j})(\mu^{\top}(I-\mathbf{P}_{M_{j}}^{\pi^{\backslash j,*}})^{-1}\|\pi^{\backslash j,*}-\pi^{*}\|_{2})^{2},

and, for any MjM_{j} in the sample,

λ(π\j,)λ(π)1NCmaxTAμ(I𝐏Mjπ\j,)1π\j,π2.\mathcal{L}^{\lambda}(\pi^{\backslash j,*})-\mathcal{L}^{\lambda}(\pi^{*})\leq\frac{1}{N}C_{\max}T\sqrt{A}\mu^{\top}(I-\mathbf{P}_{M_{j}}^{\pi^{\backslash j,*}})^{-1}\left\|\pi^{\backslash j,*}-\pi^{*}\right\|_{2}.

Thus, under the good event, we have that

μ(I𝐏Mjπ\j,)1π\j,π24λNP(Mj)CmaxT|A|,\mu^{\top}(I-\mathbf{P}_{M_{j}}^{\pi^{\backslash j,*}})^{-1}\left\|\pi^{\backslash j,*}-\pi^{*}\right\|_{2}\leq\frac{4}{\lambda N{P}(M_{j})}C_{\max}T\sqrt{|A|},

and

Mjλ(π\j,)Mjλ(π)4Cmax2T2|A|λNPmin,\mathcal{L}_{M_{j}}^{\lambda}(\pi^{\backslash j,*})-\mathcal{L}_{M_{j}}^{\lambda}(\pi^{*})\leq\frac{4C_{\max}^{2}T^{2}|A|}{\lambda N{P}_{\mathrm{min}}},

where Pmin=minMP(M){P}_{\mathrm{min}}=\min_{M\in\mathcal{M}}P(M). Now, we have that

𝔼[Mjλ(π\j,)Mjλ(π)]4Cmax2T2|A|λNPmin+exp(NP(M)8)CmaxT,\mathbb{E}\left[\mathcal{L}_{M_{j}}^{\lambda}(\pi^{\backslash j,*})-\mathcal{L}_{M_{j}}^{\lambda}(\pi^{*})\right]\leq\frac{4C_{\max}^{2}T^{2}|A|}{\lambda N{P}_{\mathrm{min}}}+\exp\left(\frac{-NP(M)}{8}\right)C_{\max}T,

where the second term is the maximum performance difference possible, multiplied by the probability that the ‘good’ event did not happen.

The second part is similar to the proof of Corollary 1. Using Theorem 1 we have

λ(π^)(π^)+(π^)(πBO)^λ(π^)T(πBO)+B2+12BNβ2Nδ^λ(π^)^(πBO)+B2+12BNβ2Nδ+2Bln(1/δ)2N^λ(πBO)^(πBO)+B2+12BNβ2Nδ+2Bln(1/δ)2N,\begin{split}\mathcal{L}^{\lambda}(\hat{\pi}^{*})-\mathcal{L}(\hat{\pi}^{*})+\mathcal{L}(\hat{\pi}^{*})-\mathcal{L}(\pi_{\mathrm{BO}})\leq\hat{\mathcal{L}}^{\lambda}(\hat{\pi}^{*})-\mathcal{L}_{T}(\pi_{\mathrm{BO}})+\sqrt{\frac{B^{2}+12BN\beta}{2N\delta}}\\ \leq\hat{\mathcal{L}}^{\lambda}(\hat{\pi}^{*})-\hat{\mathcal{L}}(\pi_{\mathrm{BO}})+\sqrt{\frac{B^{2}+12BN\beta}{2N\delta}}+2B\sqrt{\frac{\ln(1/\delta)}{2N}}\\ \leq\hat{\mathcal{L}}^{\lambda}({\pi_{\mathrm{BO}}})-\hat{\mathcal{L}}({\pi_{\mathrm{BO}}})+\sqrt{\frac{B^{2}+12BN\beta}{2N\delta}}+2B\sqrt{\frac{\ln(1/\delta)}{2N}},\end{split}

where in the second inequality we used a standard Hoeffding bound, and the third inequality is since π^\hat{\pi}^{*} minimizes ^λ\hat{\mathcal{L}}^{\lambda}.

So,

(π^)(πBO)2λT+B2+12BNβ2Nδ+2Bln(1/δ)2N.\begin{split}\mathcal{L}(\hat{\pi}^{*})-\mathcal{L}(\pi_{\mathrm{BO}})\leq 2\lambda T+\sqrt{\frac{B^{2}+12BN\beta}{2N\delta}}+2B\sqrt{\frac{\ln(1/\delta)}{2N}}.\end{split}

From the first part of the corollary, we plug in β=4Cmax2T2|A|λNPmin\beta=\frac{4C_{\max}^{2}T^{2}|A|}{\lambda NP_{\mathrm{min}}} (ignoring the exponential term exp(NPmin8)CmaxT\exp\left(\frac{-NP_{\mathrm{min}}}{8}\right)C_{\max}T, and use the bound B=CmaxTB=C_{\max}T. We also ignore the exponential (in δ\delta) term 2Bln(1/δ)2N2B\sqrt{\frac{\ln(1/\delta)}{2N}}. This gives

(π^)(πBO)2λT+Cmax2T22Nδ+48Cmax3T3|A|2δλNPmin.\begin{split}\mathcal{L}(\hat{\pi}^{*})-\mathcal{L}(\pi_{\mathrm{BO}})\leq 2\lambda T+\sqrt{\frac{C_{\max}^{2}T^{2}}{2N\delta}+\frac{48C_{\max}^{3}T^{3}|A|}{2\delta\lambda NP_{\mathrm{min}}}}.\end{split}

Appendix E Lower Bound

We show that in the worst case, the number of samples required for generalization has an exponential dependence on the horizon TT. To do this, we will show a problem where for N=2TN=2^{T} there is a constant generalization error with a non-negligible probability.

Consider an MDP space \mathcal{M} of size 2T2^{T}, where the state space has 2T+12T+1 states that we label s0,s10,s11,,st0,st1,,sT0,sT1s_{0},s_{1}^{0},s_{1}^{1},\dots,s_{t}^{0},s_{t}^{1},\dots,s_{T}^{0},s_{T}^{1}. Only at the last time step the agent can choose an action from A=a0,a1A={a_{0},a_{1}}, and obtain a cost (during all previous time steps the cost is 0 and actions do not have an effect). Each MDP MM\in\mathcal{M} has a unique identifier xx – a binary number of size TT, i.e., x=[x1,,xT]x=[x_{1},\dots,x_{T}]. The initial state for all MDPs in \mathcal{M} is s0s_{0}. The transitions for the MDP identified by xx are:

P(s=st+10|s=st0)=δxt=0(1ϵ)+δxt=1ϵ,P(s=st+10|s=st1)=δxt=0(1ϵ)+δxt=1ϵ,P(s=st+10|s=s0)=δxt=0(1ϵ)+δxt=1ϵ,P(s=st+11|s=st0)=δxt=0(ϵ)+δxt=1(1ϵ),P(s=st+11|s=st1)=δxt=0(ϵ)+δxt=1(1ϵ),P(s=st+11|s=s0)=δxt=0(ϵ)+δxt=1(1ϵ).\begin{split}P(s^{\prime}=s_{t+1}^{0}|s=s_{t}^{0})&=\delta_{x_{t}=0}(1-\epsilon)+\delta_{x_{t}=1}\epsilon,\\ P(s^{\prime}=s_{t+1}^{0}|s=s_{t}^{1})&=\delta_{x_{t}=0}(1-\epsilon)+\delta_{x_{t}=1}\epsilon,\\ P(s^{\prime}=s_{t+1}^{0}|s=s_{0})&=\delta_{x_{t}=0}(1-\epsilon)+\delta_{x_{t}=1}\epsilon,\\ P(s^{\prime}=s_{t+1}^{1}|s=s_{t}^{0})&=\delta_{x_{t}=0}(\epsilon)+\delta_{x_{t}=1}(1-\epsilon),\\ P(s^{\prime}=s_{t+1}^{1}|s=s_{t}^{1})&=\delta_{x_{t}=0}(\epsilon)+\delta_{x_{t}=1}(1-\epsilon),\\ P(s^{\prime}=s_{t+1}^{1}|s=s_{0})&=\delta_{x_{t}=0}(\epsilon)+\delta_{x_{t}=1}(1-\epsilon).\\ \end{split}

Let ϵ<1\epsilon^{\prime}<1. In the following, we assume that ϵ=ϵ2T\epsilon=\frac{\epsilon^{\prime}}{2^{T}}. Let f(x){0,1}f(x)\in\{0,1\} be some binary function. The cost function for the MDP identified by xx is C(a0)=f(x)C(a_{0})=f(x), C(a1)=1f(x)C(a_{1})=1-f(x). We assume that P(M=x)P(M=x) is a uniform distribution. Given a history hT=s0,s1,,sTh_{T}=s_{0},s_{1},\dots,s_{T}, the posterior distribution is P(M=x|hT)Πt=1T((1ϵ)δxt=st+ϵδxtst)P(M=x|h_{T})\propto\Pi_{t=1}^{T}((1-\epsilon)\delta_{x_{t}=s_{t}}+\epsilon\delta_{x_{t}\neq s_{t}}), and the most likely estimate is x=s1,,sTx^{*}=s_{1},\dots,s_{T}, and the expected cost is minimized for a(hT)=argmin{f(x),1f(x)}a^{*}(h_{T})=\operatorname*{arg\,min}\{f(x^{*}),1-f(x^{*})\}. The optimal expected cost therefore satisfies

0P(M=x|hT)+1P(Mx|hT)=xxP(M=x|hT)xxϵ=(2T1)ϵ=ϵϵϵ,\mathcal{L}^{*}\leq 0\cdot P(M=x^{*}|h_{T})+1\cdot P(M\neq x^{*}|h_{T})=\sum_{x\neq x^{*}}P(M=x|h_{T})\leq\sum_{x\neq x^{*}}\epsilon=(2^{T}-1)\epsilon=\epsilon^{\prime}-\epsilon\leq\epsilon^{\prime},

where we compared the optimal cost to a suboptimal policy that obtains the worst case cost of 11 on every MDP that is not xx^{*} and a cost of zero for xx^{*}.

Now, given NN\to\infty sampled MDPs, the distribution of the fraction of unique MDPs in the sample converges to a Gaussian distribution 𝒩((1e1,(e12e2)))\mathcal{N}\left((1-e^{-1},(e^{-1}-2e^{-2}))\right) (Mendelson et al. 2016). Therefore, for each δ>0\delta>0 there is a ϕ>0\phi>0 such that with probability of at least δ\delta, a fraction of ϕ\phi MDPs is not sampled. For these MDPs, the empirical prior is zero. Given a history hT=s0,s1,,sTh_{T}=s_{0},s_{1},\dots,s_{T} that corresponds to an unvisited xx, the expected cost is minimized for some action that depends on the data and not on f(x)f(x), therefore, there exists an ff such that for each unvisited xx, the expected cost is at least 0.5(1ϵ)0.5(1-\epsilon^{\prime}), since 0.5P(x)=0.5(1xP(M=x))0.5(1ϵ)0.5\cdot P(x\not\in\mathcal{M})=0.5\cdot(1-\sum_{x\in\mathcal{M}}P(M=x))\geq 0.5(1-\epsilon^{\prime}). The total expected cost is therefore larger than 0.5ϕ(1ϵ)0.5\phi(1-\epsilon^{\prime}), and the regret is larger than 0.5ϕϵ(1+0.5ϕ)0.5\phi-\epsilon^{\prime}(1+0.5\phi). We can thus always choose ϵ\epsilon^{\prime} small enough to get a positive regret.

Appendix F Overfitting Example

For a finite NN, if the hypothesis set \mathcal{H} is not restricted, the ERM policy can be arbitrarily bad. To see this, consider MDPs with a single state, 22 actions, and two corresponding rewards r1r_{1} and r2r_{2}. The set \mathcal{M} corresponds to a distribution over the 2-dimensional continuous vector [r1,r2][r_{1},r_{2}], and let this distribution be uniform in [0,1]2[0,1]^{2}. For T>1T>1, an optimal policy is to try out each action, and then choose the action corresponding to the highest reward. For any finite NN, we can devise a policy that, after trying out each action once, maps all the rewards in the training MDPs to their highest reward action, while mapping every other possible reward pair to its lowest reward action. Such a policy will obtain a high return on training MDPs and the lowest return on test MDPs.444The policy above is actually not the optimal ERM policy, as for a finite NN and uniform reward distribution, there is 0 probability of having two training MDPs with the same rewards for at least one of the actions, and therefore trying out just one arm is enough to identify the MDP and select the best action. A similar argument can still be used to devise a policy that settles on the worst action for every MDP that is not in the training set.