This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Provable Hierarchical Imitation Learning via EM

Zhiyu Zhang
Boston University
[email protected]
   Ioannis Ch. Paschalidis
Boston University
[email protected]
Abstract

Due to recent empirical successes, the options framework for hierarchical reinforcement learning is gaining increasing popularity. Rather than learning from rewards, we consider learning an options-type hierarchical policy from expert demonstrations. Such a problem is referred to as hierarchical imitation learning. Converting this problem to parameter inference in a latent variable model, we develop convergence guarantees for the EM approach proposed by [10]. The population level algorithm is analyzed as an intermediate step, which is nontrivial due to the samples being correlated. If the expert policy can be parameterized by a variant of the options framework, then, under regularity conditions, we prove that the proposed algorithm converges with high probability to a norm ball around the true parameter. To our knowledge, this is the first performance guarantee for an hierarchical imitation learning algorithm that only observes primitive state-action pairs.

1 Introduction

Recent empirical studies [25, 31, 36, 38] have shown that the scalability of Reinforcement Learning (RL) algorithms can be improved by incorporating hierarchical structures. As an example, consider the options framework [35] representing a two-level hierarchical policy: with a set of multi-step low level procedures (options), the high level policy selects an option, which, in turn, decides the primitive action applied at each time step until the option terminates. Learning such a hierarchical policy from environmental feedback effectively breaks the overall task into sub-tasks, each easier to solve.

Researchers have investigated the hierarchical RL problem under various settings. Existing theoretical analyses [6, 18, 19, 28] typically assume that the options are given. As a result, only the high-level policy needs to be learned. Recent advances in deep hierarchical RL (e.g., [2]) focus on concurrently learning the full options framework, but still the initialization of the options is critical. A promising practical approach is to learn an initial hierarchical policy from expert demonstrations. Then, deep hierarchical RL algorithms can be applied for policy improvement. The former step is named as Hierarchical Imitation Learning (HIL).

Due to its practicality, HIL has been extensively studied within the deep learning and robotics communities. However, existing works typically suffer from the following limitations. First, the considered HIL formulations often lack rigor and clarity. Second, existing works are mostly empirical, only testing on a few specific benchmarks. Without theoretical justification, it remains unclear whether the proposed methods can be generalized beyond their experimental settings.

In this paper, we investigate HIL from a theoretical perspective. Our problem formulation is concise while retaining the essential difficulty of HIL: we need to learn a complete hierarchical policy from an unsegmented sequence of state-action pairs. Under this setting, HIL becomes an inference problem in a latent variable model. Such a transformation was first proposed by [10], where the Expectation-Maximization (EM) algorithm [14] was applied for policy learning. Empirical results for this algorithm and its gradient variants [17, 24] demonstrate good performance, but the theoretical analysis remains open. By bridging this gap, we aim to solidify the foundation of HIL and provide some high level guidance for its practice.

1.1 Related work

Due to its intrinsic difficulty, existing works on HIL typically consider its easier variants for practicality. If the expert options are observed, standard imitation learning algorithms can be applied to learn the high and low level policies separately [26]. If those are not available, a popular idea [7, 29, 32, 33] is to first divide the expert demonstration into segments using domain knowledge or heuristics, learn the individual option corresponding to each segment, and finally learn the high level policy. With additional supervision, these steps can be unified [34]. In this regard, the EM approach [10, 17, 24] is this particular idea pushed to an extreme: without any other forms of supervision, we simultaneously segment the demonstration and learn from it, by exploiting the latent variable structure.

From the theoretical perspective, inference in parametric latent variable models is a long-standing problem in statistics. For many years the EM algorithm has been considered the standard approach, but performance guarantees [30, 40] were generally weak, only characterizing the convergence of parameter estimates to stationary points of the finite sample likelihood function. Under additional local assumptions, convergence to the Maximum Likelihood Estimate (MLE) can be further established. However, due to the randomness in sampling, the finite sample likelihood function is usually highly non-concave, leading to stringent requirements on initialization. Another weakness is that converging to the finite sample MLE does not directly characterize the distance to the maximizer of the population likelihood function which is the true parameter.

Recent ideas on EM algorithms [3, 39, 42, 43] focus on the convergence to the true parameter directly, relying on an instrumental object named as the population EM algorithm. It has the same two-stage iterative procedure as the standard EM algorithm, but its QQ-function, the maximization objective in the M-step, is defined as the infinite sample limit of the finite sample QQ-function. Under regularity conditions, the population EM algorithm converges to the true parameter. The standard EM algorithm is then analyzed as its perturbed version, converging with high probability to a norm ball around the true parameter. The main advantage of this approach is that the true parameter usually has a large basin of attraction in the population EM algorithm. Therefore, the requirement on initialization is less stringent. See [42, Figure 1] for an illustration.

The QQ-function adopted in the population EM algorithm is named as the population QQ-function. To properly define such a quantity, the stochastic convergence of the finite sample QQ-function needs to be constructed. When the samples are i.i.d., such as in Gaussian Mixture Models (GMMs) [3, 11, 41], the required convergence follows directly from the law of large numbers. However, this argument is less straightforward in time-series models such as Hidden Markov Models (HMMs) and the model considered in HIL. For HMMs, [42] showed that the expectation of the QQ-function converges, but both the stochastic convergence analysis and the analytical expression of the population QQ-function are not provided. The missing techniques could be borrowed from a body of work [8, 13, 27, 37] analyzing the asymptotic behavior of HMMs. Most notably, [27] provided a rigorous treatment of the population EM algorithm via sufficient statistics, assuming the HMM is parameterized by an exponential family.

Finally, apart from the EM algorithm, a separate line of research [1, 21] applies spectral methods for tractable inference in latent variable models. However, such methods are mainly complementary to the EM algorithm since better performance can usually be obtained by initializing the EM algorithm with the solution of the spectral methods [23].

1.2 Our contributions

In this paper, we establish the first known performance guarantee for a HIL algorithm that only observes primitive state-action pairs. Specifically, we first fix and reformulate the original EM approach by [10] in a rigorous manner. The lack of mixing is identified as a technical difficulty in learning the standard options framework, and a novel options with failure framework is proposed to circumvent this issue.

Inspired by [3] and [42], the population version of our algorithm is analyzed as an intermediate step. We prove that if the expert policy can be parameterized by the options with failure framework, then, under regularity conditions, the population version algorithm converges to the true parameter, and the finite sample version converges with high probability to a norm ball around the true parameter. Our analysis directly constructs the stochastic convergence of the finite sample QQ-function, and an analytical expression of the resulting population QQ-function is provided. Finally, we qualitatively validate our theoretical results using a numerical example.

2 Problem settings

Notation.

Throughout this paper, we use uppercase letters (e.g., StS_{t}) for random variables and lowercase letters (e.g., sts_{t}) for values of random variables. Let [t1:t2][t_{1}:t_{2}] be the set of integers tt such that t1tt2t_{1}\leq t\leq t_{2}. When used in the subscript, the brackets are removed (e.g., St1:t2={St}t1tt2S_{t_{1}:t_{2}}={\left\{S_{t}\right\}}_{t_{1}\leq t\leq t_{2}}).

2.1 Definition of the hierarchical policy

Refer to caption
Figure 1: A graphical model for hierarchical reinforcement learning.

In this section, we first introduce the options framework for hierarchical reinforcement learning [4, 35], captured by the probabilistic graphical model shown in Figure 1. The index tt represents the time; (St,At,Ot,Bt)(S_{t},A_{t},O_{t},B_{t}) respectively represent the state, the action, the option and the termination indicator. For all tt, StS_{t}, AtA_{t} and OtO_{t} are defined on the finite state space 𝒮\mathcal{S}, the finite action space 𝒜\mathcal{A} and the finite option space 𝒪\mathcal{O}; BtB_{t} is a binary random variable. Define the parameter θ:=(θhi,θlo,θb)\theta\mathrel{\mathop{:}}=(\theta_{hi},\theta_{lo},\theta_{b}) where θhiΘhi\theta_{hi}\in\mathit{\Theta}_{hi}, θloΘlo\theta_{lo}\in\mathit{\Theta}_{lo}, and θbΘb\theta_{b}\in\mathit{\Theta}_{b}. The parameter space Θ:=Θhi×Θlo×Θb\mathit{\Theta}\mathrel{\mathop{:}}=\mathit{\Theta}_{hi}\times\mathit{\Theta}_{lo}\times\mathit{\Theta}_{b} is a convex and compact subset of a Euclidean space.

For any (o0,s1)𝒪×𝒮(o_{0},s_{1})\in\mathcal{O}\times\mathcal{S}, if we fix (O0,S1)=(o0,s1)(O_{0},S_{1})=(o_{0},s_{1}) and consider a given θ\theta, the joint distribution on the rest of the graphical model is determined by the following components: an unknown environment transition probability PP, a high level policy πhi\pi_{hi} parameterized by θhi\theta_{hi}, a low level policy πlo\pi_{lo} parameterized by θlo\theta_{lo} and a termination policy πb\pi_{b} parameterized by θb\theta_{b}. Sampling a tuple (s2:T,a1:T,o1:T,b1:T)(s_{2:T},a_{1:T},o_{1:T},b_{1:T}) from such a joint distribution, or equivalently, implementing the hierarchical decision process, follows the following procedure. Starting from the first time step, the decision making agent first determines whether or not to terminate the current option o0o_{0}. The decision is encoded in a termination indicator b1b_{1} sampled from πb(|s1,o0;θb)\pi_{b}(\cdot|s_{1},o_{0};\theta_{b}). b1=1b_{1}=1 indicates that the option o0o_{0} terminates and the next option o1o_{1} is sampled from πhi(|s1;θhi)\pi_{hi}(\cdot|s_{1};\theta_{hi}); b1=0b_{1}=0 indicates that the option o0o_{0} continues and o1=o0o_{1}=o_{0}. Next, the primitive action a1a_{1} is sampled from πlo(|s1,o1;θlo)\pi_{lo}(\cdot|s_{1},o_{1};\theta_{lo}), applying the low level policy associated with the option o1o_{1}. Using the environment, the next state s2s_{2} is sampled from P(|s1,a1)P(\cdot|s_{1},a_{1}). The rest of the samples (s3:T,a2:T,o2:T,b2:T)(s_{3:T},a_{2:T},o_{2:T},b_{2:T}) are generated analogously.

The options framework corresponds to the above hierarchical policy structure and the policy triple {πhi,πlo,πb}\{\pi_{hi},\pi_{lo},\pi_{b}\}. However, due to a technicality identified at the end of this subsection, we consider a novel options with failure framework for the remainder of this paper, which adds an extra failure mechanism to the graphical model in the case of bt=0b_{t}=0. Specifically, there exists a constant 0<ζ<10<\zeta<1 such that when the termination indicator bt=0b_{t}=0, with probability 1ζ1-\zeta the next option oto_{t} is assigned to ot1o_{t-1}, whereas with probability ζ\zeta the next option oto_{t} is sampled uniformly from the set of options 𝒪\mathcal{O}. Notice that if ζ=0\zeta=0, we recover the standard options framework.

To simplify the notation, we define π¯hi\bar{\pi}_{hi} as the combination of πhi\pi_{hi} and the failure mechanism. For any θhi\theta_{hi}, with any other input arguments,

π¯hi(ot|st,ot1,bt;θhi):={πhi(ot|st;θhi),if bt=1,1ζ+ζ|𝒪|,if bt=0ot=ot1,ζ|𝒪|,if bt=0otot1.\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\mathrel{\mathop{:}}=\begin{cases}\pi_{hi}(o_{t}|s_{t};\theta_{hi}),&\text{if $b_{t}=1$},\\ 1-\zeta+\frac{\zeta}{|\mathcal{O}|},&\text{if $b_{t}=0$, $o_{t}=o_{t-1}$},\\ \frac{\zeta}{|\mathcal{O}|},&\text{if $b_{t}=0$, $o_{t}\neq o_{t-1}$.}\end{cases}

Formally, the options with failure framework is defined as the class of policy triples {π¯hi,πlo,πb}\{\bar{\pi}_{hi},\pi_{lo},\pi_{b}\} parameterized by ζ\zeta and θ\theta. With (O0,S1)=(o0,s1)(O_{0},S_{1})=(o_{0},s_{1}) and a given θ\theta, let θ,o0,s1\mathbb{P}_{\theta,o_{0},s_{1}} be the joint distribution of {S2:T,A1:T,O1:T,B1:T}\{S_{2:T},A_{1:T},O_{1:T},B_{1:T}\}. With any input arguments,

θ,o0,s1(S2:T=s2:T,A1:T=a1:T,O1:T=o1:T,B1:T=b1:T)=[t=1Tπb(bt|st,ot1;θb)π¯hi(ot|st,ot1,bt;θhi)πlo(at|st,ot;θlo)][t=1T1P(st+1|st,at)].\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T})=\\ \left[\prod_{t=1}^{T}\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\right]\left[\prod_{t=1}^{T-1}P(s_{t+1}|s_{t},a_{t})\right].
On the policy framework.

The options with failure framework is adopted to simplify the construction of the mixing condition (Lemma D.1). It is possible that our analysis could be extended to learn the standard options framework. In that case, instead of constructing the usual one step mixing condition, one could target the multi-step mixing condition similar to [8, Chap. 4.3].

2.2 The imitation learning problem

Suppose an expert uses an options with failure policy with true parameters ζ\zeta and θ=(θhi,θlo,θb)\theta^{*}=(\theta^{*}_{hi},\theta^{*}_{lo},\theta^{*}_{b}); its initial condition (o0,s1)(o_{0},s_{1}) is sampled from a distribution ν\nu^{*}. A finite length observation sequence {s1:T,a1:T}={st,at}t=1T\{s_{1:T},a_{1:T}\}=\{s_{t},a_{t}\}_{t=1}^{T} with T2T\geq 2 is observed from the expert. ζ\zeta and the parametric structure of the expert policy are known, but ν\nu^{*} is unknown. Our objective is to estimate θ\theta^{*} from {s1:T,a1:T}\{s_{1:T},a_{1:T}\}.

On the practicality of our setting.

Two comments need to made here. First, it is common in practice to observe not one, but a set of independent observation sequences. In that case, the problem essentially becomes easier. Second, the cardinality of the option space and the parameterization of the expert policy are usually unknown. A popular solution is to assume an expressive parameterization (e.g., a neural network) in the algorithm and select card(𝒪)\textrm{card}(\mathcal{O}) through cross-validation. Theoretical analysis of EM under this setting is challenging, even when samples are i.i.d. [15, 16]. Therefore, we only consider the domain of correct-specification.

Throughout this paper, the following assumptions are imposed for simplicity.

Assumption 1 (Non-degeneracy).

With any other input arguments, the domain of πhi\pi_{hi}, πlo\pi_{lo} and πb\pi_{b} as functions of θ\theta can be extended to an open set Θ~\tilde{\mathit{\Theta}} that contains Θ\mathit{\Theta}. Moreover, for all θΘ~\theta\in\tilde{\mathit{\Theta}}, πhi\pi_{hi}, πlo\pi_{lo} and πb\pi_{b} parameterized by θ\theta are strictly positive.

Assumption 2 (Differentiability).

With any other input arguments, πhi\pi_{hi}, πlo\pi_{lo} and πb\pi_{b} as functions of θ\theta are continuously differentiable on Θ~\tilde{\mathit{\Theta}}.

Next, consider the stochastic process {Ot1,St}t=1\{O_{t-1},S_{t}\}_{t=1}^{\infty} induced by ν\nu^{*} and the expert policy. Based on the graphical model, it is a Markov chain with finite state space 𝒪×𝒮\mathcal{O}\times\mathcal{S}. Let Πθ\Pi_{\theta^{*}} be its set of stationary distributions, which is nonempty and convex.

Assumption 3 (Stationary initial distribution).

ν\nu^{*} is an extreme point of Πθ\Pi_{\theta^{*}}. That is, νΠθ\nu^{*}\in\Pi_{\theta^{*}}, and it cannot be written as the convex combination of two elements of Πθ\Pi_{\theta^{*}}.

On the assumptions.

The first two assumptions are generally mild and therefore hold for many policy parameterizations. The third assumption is a bit more restrictive, but it is essential for our theoretical analysis. In Appendix A, we provide further justification of this assumption in a particular class of environments: st,st+1𝒮\forall s_{t},s_{t+1}\in\mathcal{S}, there exists at𝒜a_{t}\in\mathcal{A} such that P(st+1|st,at)>0P(s_{t+1}|s_{t},a_{t})>0. In such environments, Πθ\Pi_{\theta^{*}} contains a unique element which is also the limiting distribution. If we start sampling the observation sequence late enough, Assumption 3 is approximately satisfied.

3 A Baum-Welch type algorithm

Adopting the EM approach, we present Algorithm 1 for the estimation of θ\theta^{*}. It reformulates the algorithm by [10] in a rigorous manner, and an error in the latter is fixed: when defining the posterior distribution of latent variables, at any time t<Tt<T, the original algorithm neglects the dependency of future states St+1:TS_{t+1:T} on the current option OtO_{t}. A detailed discussion is provided in Appendix B.1.

Algorithm 1 A Baum-Welch type algorithm for provable hierarchical imitation learning
0:  Observation sequence {s1:T,a1:T}\{s_{1:T},a_{1:T}\}; a probability mass function μ(o0|s1)\mu(o_{0}|s_{1}) on o0𝒪o_{0}\in\mathcal{O}; N+N\in\mathbb{N}_{+}; θ(0)Θ\theta^{(0)}\in\mathit{\Theta}.
1:  for n=1,,Nn=1,\ldots,N do
2:     Compute the forward message {αμ,tθ(n1)}t=1T\{\alpha^{\theta^{(n-1)}}_{\mu,t}\}_{t=1}^{T} and the backward message {βt|Tθ(n1)}t=1T\{\beta^{\theta^{(n-1)}}_{t|T}\}_{t=1}^{T} according to (1), (2), (3) and (4).
3:     Compute the smoothing distributions {γμ,t|Tθ(n1)}t=1T\{\gamma^{\theta^{(n-1)}}_{\mu,t|T}\}_{t=1}^{T} and {γ~μ,t|Tθ(n1)}t=2T\{\tilde{\gamma}^{\theta^{(n-1)}}_{\mu,t|T}\}_{t=2}^{T} according to (5) and (6).
4:     Update the parameter estimate θ(n)argmaxθΘQμ,T(θ|θ(n1))\theta^{(n)}\in\operatorname*{arg\,max}_{\theta\in\mathit{\Theta}}Q_{\mu,T}(\theta|\theta^{(n-1)}) according to (7).
5:  end for

Since our graphical model resembles an HMM, Algorithm 1 is intuitively similar to the classical Baum-Welch algorithm [5] for HMM parameter inference. Analogously, it iterates between forward-backward smoothing and parameter update. In each iteration, the algorithm first estimates certain marginal distributions of the latent variables (O1:T,B1:T)(O_{1:T},B_{1:T}) conditioned on the observation sequence {s1:T,a1:T}\{s_{1:T},a_{1:T}\}, assuming the current estimate of θ\theta is correct. Such conditional distributions are named as smoothing distributions, and they are used to compute the QQ-function, which is a surrogate of the likelihood function. The next estimate of θ\theta is assigned as one of the maximizing arguments of the QQ-function.

From the structure of our graphical model, a prior distribution of (O0,S1)(O_{0},S_{1}) is required to compute the smoothing distributions. Since the true prior distribution ν\nu^{*} is unknown, ν^\hat{\nu}, defined next, is used as its approximation: o0𝒪\forall o_{0}\in\mathcal{O}, ν^(o0,s1):=μ(o0|s1)\hat{\nu}(o_{0},s_{1})\mathrel{\mathop{:}}=\mu(o_{0}|s_{1}); s1s1\forall s^{\prime}_{1}\neq s_{1}, ν^(o0,s1):=0\hat{\nu}(o_{0},s^{\prime}_{1})\mathrel{\mathop{:}}=0. Theorem 2 shows that the additional estimation error introduced by this approximation vanishes as TT\rightarrow\infty, regardless of the choice of μ\mu. Let \mathcal{M} be the set of μ\mu allowed by Algorithm 1.

3.1 Latent variable estimation

In the following, we define the forward message, the backward message and the smoothing distribution for all θ\theta, μ\mu and all t[1:T]t\in[1:T]. All of these quantities are probability mass functions over 𝒪×𝒮\mathcal{O}\times\mathcal{S}, and normalizing constants zα,μ,tθz_{\alpha,\mu,t}^{\theta}, zβ,tθz_{\beta,t}^{\theta} and zγ,μθz_{\gamma,\mu}^{\theta} are adopted to enforce this. With any input arguments oto_{t} and btb_{t}, the forward message is defined as

αμ,tθ(ot,bt):=zα,μ,tθ𝔼O0μ(|s1)[θ,O0,s1(S2:t=s2:t,A1:t=a1:t,Ot=ot,Bt=bt)].\alpha^{\theta}_{\mu,t}(o_{t},b_{t})\mathrel{\mathop{:}}=z_{\alpha,\mu,t}^{\theta}\mathbb{E}_{O_{0}\sim\mu(\cdot|s_{1})}[\mathbb{P}_{\theta,O_{0},s_{1}}(S_{2:t}=s_{2:t},A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t})].

On the LHS, the dependency on {s1:T,a1:T}\{s_{1:T},a_{1:T}\} is omitted for a cleaner notation. By convention, αμ,1θ\alpha^{\theta}_{\mu,1} is equivalent to

αμ,1θ(o1,b1)=zα,μ,1θ𝔼O0μ(|s1)[θ,O0,s1(A1=a1,O1=o1,B1=b1)].\alpha^{\theta}_{\mu,1}(o_{1},b_{1})=z_{\alpha,\mu,1}^{\theta}\mathbb{E}_{O_{0}\sim\mu(\cdot|s_{1})}[\mathbb{P}_{\theta,O_{0},s_{1}}(A_{1}=a_{1},O_{1}=o_{1},B_{1}=b_{1})].

The backward message is defined as

βt|Tθ(ot,bt):=zβ,tθθ,o0,s1(St+1:T=st+1:T,At+1:T=at+1:T|St=st,At=at,Ot=ot,Bt=bt),\beta^{\theta}_{t|T}(o_{t},b_{t})\mathrel{\mathop{:}}=z_{\beta,t}^{\theta}\mathbb{P}_{\theta,o_{0},s_{1}}(S_{t+1:T}=s_{t+1:T},A_{t+1:T}=a_{t+1:T}|S_{t}=s_{t},A_{t}=a_{t},O_{t}=o_{t},B_{t}=b_{t}),

where the value of o0o_{0} on the RHS is arbitrary. By convention, the boundary condition is

βT|Tθ(oT,bT)=(2|𝒪|)1.\beta^{\theta}_{T|T}(o_{T},b_{T})=(2\left|\mathcal{O}\right|)^{-1}. (1)

The smoothing distribution is defined as

γμ,t|Tθ(ot,bt):=zγ,μθ𝔼O0μ(|s1)[θ,O0,s1(S2:T=s2:T,A1:T=a1:T,Ot=ot,Bt=bt)].\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})\mathrel{\mathop{:}}=z_{\gamma,\mu}^{\theta}\mathbb{E}_{O_{0}\sim\mu(\cdot|s_{1})}[\mathbb{P}_{\theta,O_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{t}=o_{t},B_{t}=b_{t})].

It can be easily verified that the normalizing constant does not depend on tt.

Finally, for all θ\theta, μ\mu and t[2:T]t\in[2:T], with any input arguments ot1o_{t-1} and btb_{t}, we define the two-step smoothing distribution as

γ~μ,t|Tθ(ot1,bt):=zγ,μθ𝔼O0μ(|s1)[θ,O0,s1(S2:T=s2:T,A1:T=a1:T,Ot1=ot1,Bt=bt)],\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t})\mathrel{\mathop{:}}=z_{\gamma,\mu}^{\theta}\mathbb{E}_{O_{0}\sim\mu(\cdot|s_{1})}[\mathbb{P}_{\theta,O_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{t-1}=o_{t-1},B_{t}=b_{t})],

where zγ,μθz_{\gamma,\mu}^{\theta} is the same normalizing constant as the one for the smoothing distribution γμ,t|Tθ\gamma^{\theta}_{\mu,t|T}.

The quantities above can be computed using the forward-backward recursion. For simplicity, we omit normalizing constants by using the proportional symbol \propto. The proof is deferred to Appendix B.2.

Theorem 1 (Forward-backward smoothing).

For all θΘ\theta\in\mathit{\Theta} and μ\mu\in\mathcal{M}, with any input arguments on the LHS,

  1. 1.

    (Forward recursion) t[2:T]\forall t\in[2:T],

    αμ,tθ(ot,bt)ot1,bt1πb(bt|st,ot1;θb)π¯hi(ot|st,ot1,bt;θhi)πlo(at|st,ot;θlo)αμ,t1θ(ot1,bt1).\alpha^{\theta}_{\mu,t}(o_{t},b_{t})\propto\sum_{o_{t-1},b_{t-1}}\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1}). (2)

    When t=1t=1,

    αμ,1θ(o1,b1)𝔼O0μ(|s1)[πb(b1|s1,O0;θb)π¯hi(o1|s1,O0,b1;θhi)πlo(a1|s1,o1;θlo)].\alpha^{\theta}_{\mu,1}(o_{1},b_{1})\propto\mathbb{E}_{O_{0}\sim\mu(\cdot|s_{1})}[\pi_{b}(b_{1}|s_{1},O_{0};\theta_{b})\bar{\pi}_{hi}(o_{1}|s_{1},O_{0},b_{1};\theta_{hi})\pi_{lo}(a_{1}|s_{1},o_{1};\theta_{lo})]. (3)
  2. 2.

    (Backward recursion) t[1:T1]\forall t\in[1:T-1],

    βt|Tθ(ot,bt)ot+1,bt+1πb(bt+1|st+1,ot;θb)π¯hi(ot+1|st+1,ot,bt+1;θhi)×πlo(at+1|st+1,ot+1;θlo)βt+1|Tθ(ot+1,bt+1).\beta^{\theta}_{t|T}(o_{t},b_{t})\propto\sum_{o_{t+1},b_{t+1}}\pi_{b}(b_{t+1}|s_{t+1},o_{t};\theta_{b})\bar{\pi}_{hi}(o_{t+1}|s_{t+1},o_{t},b_{t+1};\theta_{hi})\\ \times\pi_{lo}(a_{t+1}|s_{t+1},o_{t+1};\theta_{lo})\beta^{\theta}_{t+1|T}(o_{t+1},b_{t+1}). (4)
  3. 3.

    (Smoothing) t[1:T]\forall t\in[1:T],

    γμ,t|Tθ(ot,bt)αμ,tθ(ot,bt)βt|Tθ(ot,bt).\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})\propto\alpha^{\theta}_{\mu,t}(o_{t},b_{t})\beta^{\theta}_{t|T}(o_{t},b_{t}). (5)
  4. 4.

    (Two-step smoothing) t[2:T]\forall t\in[2:T],

    γ~μ,t|Tθ(ot1,bt)πb(bt|st,ot1;θb)[otπ¯hi(ot|st,ot1,bt;θhi)πlo(at|st,ot;θlo)βt|Tθ(ot,bt)]×[bt1αμ,t1θ(ot1,bt1)].\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t})\propto\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\left[\sum_{o_{t}}\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\beta^{\theta}_{t|T}(o_{t},b_{t})\right]\\ \times\left[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1})\right]. (6)

3.2 Parameter update

For all θ,θΘ\theta,\theta^{\prime}\in\mathit{\Theta} and μ\mu\in\mathcal{M}, the (finite sample) QQ-function is defined as

Qμ,T(θ|θ):=1T{t=2Tot1,btγ~μ,t|Tθ(ot1,bt)[logπb(bt|st,ot1;θb)]+t=1Tot,btγμ,t|Tθ(ot,bt)×[logπlo(at|st,ot;θlo)]+t=1Totγμ,t|Tθ(ot,bt=1)[logπhi(ot|st;θhi)]}.Q_{\mu,T}(\theta^{\prime}|\theta)\mathrel{\mathop{:}}=\frac{1}{T}\Bigg{\{}\sum_{t=2}^{T}\sum_{o_{t-1},b_{t}}\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t})\left[\log\pi_{b}(b_{t}|s_{t},o_{t-1};\theta^{\prime}_{b})\right]+\sum_{t=1}^{T}\sum_{o_{t},b_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})\\ \times[\log\pi_{lo}(a_{t}|s_{t},o_{t};\theta^{\prime}_{lo})]+\sum_{t=1}^{T}\sum_{o_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}=1)[\log\pi_{hi}(o_{t}|s_{t};\theta^{\prime}_{hi})]\Bigg{\}}. (7)

The parameter update is performed as θ(n)argmaxθΘQμ,T(θ|θ(n1))\theta^{(n)}\in\operatorname*{arg\,max}_{\theta\in\mathit{\Theta}}Q_{\mu,T}(\theta|\theta^{(n-1)}), which may not be unique. Since Θ\mathit{\Theta} is compact and Qμ,T(θ|θ)Q_{\mu,T}(\theta^{\prime}|\theta) is continuous with respect to θ\theta^{\prime}, the maximization is well-posed. Note that our definition of Qμ,T(θ|θ)Q_{\mu,T}(\theta^{\prime}|\theta) is an approximation of the standard definition of QQ-function in the EM literature. See Appendix B.3 for a detailed discussion.

3.3 Generalization to continuous spaces

Although we require finite state and action space for our theoretical analysis, Algorithm 1 can be readily generalized to continuous 𝒮\mathcal{S} and 𝒜\mathcal{A}: we only need to replace πlo\pi_{lo} by a density function. However, generalization to continuous option space requires a substantially different algorithm. The forward-backward smoothing procedure in Theorem 1 involves integrals rather than sums, and Sequential Monte Carlo (SMC) techniques need to be applied. Fortunately, it is widely accepted that a finite option space is reasonable in the options framework, since the options need to be distinct and separate [9].

4 Performance guarantee

Our analysis of Algorithm 1 has the following structure. We first prove the stochastic convergence of the QQ-function Qμ,T(θ|θ)Q_{\mu,T}(\theta^{\prime}|\theta) to a population QQ-function Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta), leading to a well-posed definition of the population version algorithm. This step is our major theoretical contribution. With additional assumptions, the first-order stability condition is constructed, and techniques in [3] can be applied to show the convergence of the population version algorithm. The remaining step is to analyze Algorithm 1 as a perturbed form of its population version, which requires a high probability bound on the distance between their parameter updates. We can establish the strong consistency of the parameter update of Algorithm 1 as an estimator of the parameter update of the population version algorithm. Therefore, the existence of such a high probability bound can be proved for large enough TT. However, the analytical expression of this bound requires knowledge of the specific parameterization of {π¯hi,πlo,πb}\{\bar{\pi}_{hi},\pi_{lo},\pi_{b}\}, which is not available in this general context of discussion.

Concretely, we first analyze the asymptotic behavior of the QQ-function Qμ,T(θ|θ)Q_{\mu,T}(\theta^{\prime}|\theta) as TT\rightarrow\infty. From Assumption 3, the observation sequence {s1:T,a1:T}\{s_{1:T},a_{1:T}\} is generated from a stationary Markov chain {Xt}t=1:={St,At,Ot,Bt}t=1\{X_{t}\}_{t=1}^{\infty}\mathrel{\mathop{:}}=\{S_{t},A_{t},O_{t},B_{t}\}_{t=1}^{\infty}. Let 𝒳=𝒮×𝒜×𝒪×{0,1}\mathcal{X}=\mathcal{S}\times\mathcal{A}\times\mathcal{O}\times\{0,1\} be its state space. Using Kolmogorov’s extension theorem, we can extend this one-sided Markov chain to the index set \mathbb{Z} and define a unique probability measure θ,ν\mathbb{P}_{\theta^{*},\nu^{*}} over the sample space 𝒳\mathcal{X}^{\mathbb{Z}}. Any observation sequence {s1:T,a1:T}\{s_{1:T},a_{1:T}\} can be regarded as a segment of an infinite length sample path ω𝒳\omega\in\mathcal{X}^{\mathbb{Z}}. Therefore, if the observation sequence is not specified, Qμ,T(θ|θ)Q_{\mu,T}(\theta^{\prime}|\theta) is a random variable with underlying probability measure θ,ν\mathbb{P}_{\theta^{*},\nu^{*}}.

One caveat is that the definition of Qμ,T(θ|θ)Q_{\mu,T}(\theta^{\prime}|\theta) from Section 3 fails for some ω𝒳\omega\in\mathcal{X}^{\mathbb{Z}}. To fix this issue, define the set of proper sample paths as

Ω={ω𝒳;P(st+1|st,at)>0,t}.\Omega=\left\{\omega\in\mathcal{X}^{\mathbb{Z}};P(s_{t+1}|s_{t},a_{t})>0,\forall t\in\mathbb{Z}\right\}. (8)

Note that θ,ν(Ω)=1\mathbb{P}_{\theta^{*},\nu^{*}}(\Omega)=1; therefore, working on Ω\Omega is probabilistically equivalent to working on 𝒳\mathcal{X}^{\mathbb{Z}}. For all ωΩ\omega\in\Omega, Qμ,T(θ|θ)Q_{\mu,T}(\theta^{\prime}|\theta) follows the definition from Section 3; for other sample paths, Qμ,T(θ|θ)Q_{\mu,T}(\theta^{\prime}|\theta) is defined arbitrarily. In this way, Qμ,T(θ|θ)Q_{\mu,T}(\theta^{\prime}|\theta) becomes a well-defined random variable. Its stochastic convergence is characterized in the following theorem.

Theorem 2 (The stochastic convergence of the QQ-function).

With Assumption 1, 2 and 3, there exists a real-valued function Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta) defined on the domain θΘ~\theta^{\prime}\in\tilde{\mathit{\Theta}} and θΘ\theta\in\mathit{\Theta} such that

  1. 1.

    For all θΘ\theta\in\mathit{\Theta}, Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta) is continuously differentiable with respect to θΘ~\theta^{\prime}\in\tilde{\mathit{\Theta}}. Moreover, the set argmaxθΘQ¯(θ|θ)\operatorname*{arg\,max}_{\theta^{\prime}\in\mathit{\Theta}}\bar{Q}(\theta^{\prime}|\theta) is nonempty.

  2. 2.

    As TT\rightarrow\infty,

    supθ,θΘsupμ|Qμ,T(θ|θ;ω)Q¯(θ|θ)|0,Pθ,ν-a.s.\sup_{\theta,\theta^{\prime}\in\mathit{\Theta}}\sup_{\mu\in\mathcal{M}}\left|Q_{\mu,T}(\theta^{\prime}|\theta;\omega)-\bar{Q}(\theta^{\prime}|\theta)\right|\rightarrow 0,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

We name Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta) as the population QQ-function. The analytical expressions of Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta) and Q¯(θ|θ)\nabla\bar{Q}(\theta^{\prime}|\theta) are provided in Appendix C.2, where the complete version of the above theorem (Theorem 7) is proved. In the following, we provide a high level sketch of the main idea.

Proof Sketch.

The main difficulty of the proof is that, Qμ,T(θ|θ)Q_{\mu,T}(\theta^{\prime}|\theta) defined in (7) is (roughly) the average of TT terms, with each term dependent on the entire observation sequence; as TT\rightarrow\infty, all the terms keep changing such that the law of large numbers cannot be applied directly. As a solution, we approximate γμ,t|Tθ\gamma^{\theta}_{\mu,t|T} and γ~μ,t|Tθ\tilde{\gamma}^{\theta}_{\mu,t|T} with smoothing distributions in an infinitely extended graphical model independent of TT, resulting in an approximated QQ-function (still depends on TT). The techniques adopted in this step are analogous to Markovian decomposition and uniform forgetting in the HMM literature [8, 37]. The limiting behavior of the approximated QQ-function is the same as that of Qμ,T(θ|θ)Q_{\mu,T}(\theta^{\prime}|\theta), since their difference vanishes as TT\rightarrow\infty. For the approximated QQ-function, we can apply the ergodic theorem since the smoothing distributions no longer depend on TT. ∎

The population version of Algorithm 1 has parameter updates θ(n)argmaxθΘQ¯(θ|θ(n1))\theta^{(n)}\in\operatorname*{arg\,max}_{\theta\in\mathit{\Theta}}\bar{Q}(\theta|\theta^{(n-1)}). To characterize the local convergence of Algorithm 1 and its population version, we impose the following assumptions for the remainder of Section 4.

Assumption 4 (Strong concavity).

There exists λ>0\lambda>0 such that for all θ1,θ2Θ\theta_{1},\theta_{2}\in\mathit{\Theta},

Q¯(θ1|θ)Q¯(θ2|θ)Q¯(θ2|θ),θ1θ2λ2θ1θ222.\bar{Q}(\theta_{1}|\theta^{*})-\bar{Q}(\theta_{2}|\theta^{*})-\langle\nabla\bar{Q}(\theta_{2}|\theta^{*}),\theta_{1}-\theta_{2}\rangle\leq-\frac{\lambda}{2}\left\|{\theta_{1}-\theta_{2}}\right\|_{2}^{2}.

For any r>0r>0, let Θr:={θ;θΘ,θθ2r}\mathit{\Theta}_{r}\mathrel{\mathop{:}}=\{\theta;\theta\in\mathit{\Theta},\|{\theta-\theta^{*}}\|_{2}\leq r\}.

Assumption 5 (Additional local assumptions).

There exists r>0r>0 such that

  1. 1.

    (Identifiability) For all θΘr\theta\in\mathit{\Theta}_{r}, the set argmaxθΘQ¯(θ|θ)\operatorname*{arg\,max}_{\theta^{\prime}\in\mathit{\Theta}}\bar{Q}(\theta^{\prime}|\theta) has a unique element M¯(θ)\bar{M}(\theta). Moreover, for all ε>0\varepsilon>0, with the convention that supθQ¯(θ|θ)=\sup_{\theta^{\prime}\in\varnothing}\bar{Q}(\theta^{\prime}|\theta)=-\infty, we have

    infθΘr[Q¯(M¯(θ)|θ)supθΘ;θM¯(θ)2εQ¯(θ|θ)]>0.\inf_{\theta\in\mathit{\Theta}_{r}}\bigg{[}\bar{Q}(\bar{M}(\theta)|\theta)-\sup_{\theta^{\prime}\in\mathit{\Theta};\|{\theta^{\prime}-\bar{M}(\theta)}\|_{2}\geq\varepsilon}\bar{Q}(\theta^{\prime}|\theta)\bigg{]}>0.
  2. 2.

    (Uniqueness of finite sample parameter updates) For all θΘr\theta\in\mathit{\Theta}_{r}, T2T\geq 2 and μ\mu\in\mathcal{M}, Pθ,νP_{\theta^{*},\nu^{*}}-almost surely, the set argmaxθΘQμ,T(θ|θ;ω)\operatorname*{arg\,max}_{\theta^{\prime}\in\mathit{\Theta}}Q_{\mu,T}(\theta^{\prime}|\theta;\omega) has a unique element Mμ,T(θ;ω)M_{\mu,T}(\theta;\omega).

On the additional assumptions.

In Assumption 4, we require the strong concavity of Q¯(|θ)\bar{Q}(\cdot|\theta^{*}) over the entire parameter space since the maximization step in our algorithm is global. Such a requirement could be avoided: if the maximization step is replaced by a gradient update (Gradient EM), then Q¯(|θ)\bar{Q}(\cdot|\theta^{*}) only needs to be strongly concave in a small region around θ\theta^{*}. The price to pay is to assume knowledge on structural constants of Q¯(|θ)\bar{Q}(\cdot|\theta^{*}) (Lipschitz constant and strong concavity constant). See [3] for an analysis of the gradient EM algorithm.

Nonetheless, we expect the following to hold in certain cases of tabular parameterization: for all θΘ\theta\in\mathit{\Theta}, the function Q¯(|θ)\bar{Q}(\cdot|\theta) is strongly concave over Θ\mathit{\Theta} (see the end of Appendix C.2). From this condition, Assumption 4 and 5.1 directly follow. Assumption 5.2 holds as well; in fact, it is a quite mild assumption due to the sample-based nature of Qμ,T(θ|θ;ω)Q_{\mu,T}(\theta^{\prime}|\theta;\omega).

The next step is to characterize the convergence of the population version algorithm.

Theorem 3 (Convergence of the population version algorithm).

With all the assumptions,

  1. 1.

    (First-order stability) There exists γ>0\gamma>0 such that for all θΘr\theta\in\mathit{\Theta}_{r},

    Q¯(M¯(θ)|θ)Q¯(M¯(θ)|θ)2γθθ2.\left\|{\nabla\bar{Q}(\bar{M}(\theta)|\theta)-\nabla\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}\leq\gamma\left\|{\theta-\theta^{*}}\right\|_{2}.
  2. 2.

    (Contraction) Let κ=γ/λ\kappa=\gamma/\lambda. For all θΘr\theta\in\mathit{\Theta}_{r},

    M¯(θ)θ2κθθ2.\left\|{\bar{M}(\theta)-\theta^{*}}\right\|_{2}\leq\kappa\left\|{\theta-\theta^{*}}\right\|_{2}.

    If κ<1\kappa<1, the population version algorithm converges linearly to the true parameter θ\theta^{*}.

The proof is given in Appendix C.3, where we also show an upper bound on γ\gamma. The idea mirrors that of [3, Theorem 4] with problem-specific modifications. Algorithm 1 can be regarded as a perturbed form of this population version algorithm, with convergence characterized in the following theorem.

Theorem 4 (Performance guarantee for Algorithm 1).

With all the assumptions, if κ<1\kappa<1 we have

  1. 1.

    For all Δ(0,(1κ)r]\Delta\in(0,(1-\kappa)r] and q(0,1)q\in(0,1), there exists T¯(Δ,q)+\underline{T}(\Delta,q)\in\mathbb{N}_{+} such that the following statement is true. If the observation length TT¯(Δ,q)T\geq\underline{T}(\Delta,q), then with probability at least 1q1-q,

    supθΘrsupμMμ,T(θ;ω)M¯(θ)2Δ.\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\leq\Delta.
  2. 2.

    If TT¯(Δ,q)T\geq\underline{T}(\Delta,q), Algorithm 1 with any μ\mu\in\mathcal{M} has the following performance guarantee. If θ(0)Θr\theta^{(0)}\in\mathit{\Theta}_{r}, then with probability at least 1q1-q, for all n+n\in\mathbb{N}_{+},

    θ(n)θ2κnθ(0)θ2+(1κ)1Δ.\|{\theta^{(n)}-\theta^{*}}\|_{2}\leq\kappa^{n}\|{\theta^{(0)}-\theta^{*}}\|_{2}+(1-\kappa)^{-1}\Delta.

The proof is provided in Appendix C.4. Essentially, we use Theorem 2 to show the uniform (in θ\theta and μ\mu) strong consistency of Mμ,T(θ;ω)M_{\mu,T}(\theta;\omega) as an estimator of M¯(θ)\bar{M}(\theta), following the standard analysis of M-estimators. A direct corollary of this argument is the high probability bound on the difference between Mμ,T(θ;ω)M_{\mu,T}(\theta;\omega) and M¯(θ)\bar{M}(\theta), as shown in the first part of the theorem. Combining this high probability bound with Theorem 3 and [3, Theorem 5] yields the final performance guarantee.

Theorem 4 has two practical implications. First, under regularity conditions, with large enough TT, Algorithm 1 can converge with arbitrarily high probability to an arbitrarily small norm ball around the true parameter. In other words, with enough samples, the EM approach can recover the true parameter of the expert policy arbitrarily well. Second, the estimation error (upper bound) decreases exponentially in the initial phase of the algorithm. In this regard, a practitioner can allocate his computational budget accordingly.

One limitation of our analysis is that the condition κ<1\kappa<1 is hard to verify for a practical parameterization of the expert policy. This is typical in the theory of EM algorithms: even in the case of i.i.d. samples, characterizing the contraction coefficient is intractable except for a few simple parametric models. Nonetheless, such a condition strengthens our intuition on when the EM approach to HIL works: Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta) should have a large curvature with respect to θ\theta^{\prime}, and the function should not change much with respect to θ\theta around θ\theta^{*}. In the next section, we present a numerical example to qualitatively demonstrate our result.

5 Numerical example

In this section, we qualitatively demonstrate our theoretical result through an example. Here, we value clarity over completeness, therefore large-scale experiments are deferred to future works.

Refer to caption
Figure 2: The state transition structure.

Consider the Markov Decision Process (MDP) illustrated in Figure 2. There are four states, numbered from left to right as 1 to 4. At any state st[1:4]s_{t}\in[1:4], there are two allowable actions: LEFT and RIGHT. If at=RIGHTa_{t}=\textrm{RIGHT}, then the next state is sampled uniformly from the states on the right of state sts_{t} (including sts_{t} itself). Symmetrically, if at=LEFTa_{t}=\textrm{LEFT}, then the next state is sampled uniformly from the states on the left of state sts_{t} (including sts_{t}).

Suppose an expert applies the following options with failure policy with parameters (θhi,θlo,θb)=(0.6,0.7,0.8)(\theta^{*}_{hi},\theta^{*}_{lo},\theta^{*}_{b})=(0.6,0.7,0.8) and ζ=0.1\zeta=0.1. The option space has two elements: LEFTEND and RIGHTEND. πhi(ot=LEFTEND|st;θhi)\pi_{hi}(o_{t}=\textrm{LEFTEND}|s_{t};\theta_{hi}) equals θhi\theta_{hi} if st=1,2s_{t}=1,2, and 1θhi1-\theta_{hi} if st=3,4s_{t}=3,4. For all sts_{t}, πlo(at=LEFT|st,ot=LEFTEND;θlo)=πlo(at=RIGHT|st,ot=RIGHTEND;θlo)=θlo\pi_{lo}(a_{t}=\textrm{LEFT}|s_{t},o_{t}=\textrm{LEFTEND};\theta_{lo})=\pi_{lo}(a_{t}=\textrm{RIGHT}|s_{t},o_{t}=\textrm{RIGHTEND};\theta_{lo})=\theta_{lo}. πb(bt=1|st,ot=LEFTEND;θb)\pi_{b}(b_{t}=1|s_{t},o_{t}=\textrm{LEFTEND};\theta_{b}) equals θb\theta_{b} if st=1s_{t}=1, and 1θb1-\theta_{b} otherwise. Symmetrically, πb(bt=1|st,ot=RIGHTEND;θb)\pi_{b}(b_{t}=1|s_{t},o_{t}=\textrm{RIGHTEND};\theta_{b}) equals θb\theta_{b} if st=4s_{t}=4, and 1θb1-\theta_{b} otherwise. Intuitively, the high level policy directs the agent to states 1 and 4, and the option terminates with high probability when the corresponding target state is reached.

In our experiment, the parameter spaces Θhi\mathit{\Theta}_{hi}, Θlo\mathit{\Theta}_{lo} and Θb\mathit{\Theta}_{b} are all equal to the interval [0.1,0.9][0.1,0.9]. The initial parameter estimate (θhi(0),θlo(0),θb(0))=(0.5,0.6,0.7)(\theta^{(0)}_{hi},\theta^{(0)}_{lo},\theta^{(0)}_{b})=(0.5,0.6,0.7). For all s1s_{1}, μ(o0=RIGHTEND|s1)=1\mu(o_{0}=\textrm{RIGHTEND}|s_{1})=1.

We investigate the behavior of θ(n)θ2\|{\theta^{(n)}-\theta^{*}}\|_{2} as a random variable dependent on nn and TT. 50 sample paths of length TT are sampled from (approximately) the stationary Markov chain induced by the expert policy, with T{5000,8000,10000}T\in\{5000,8000,10000\}. After running Algorithm 1 with any sample path ω\omega and any TT, we obtain a sequence {θ(n)θ2;ω,T}n[0:N]\{\|{\theta^{(n)}-\theta^{*}}\|_{2};\omega,T\}_{n\in[0:N]}. Let err(n,T)err(n,T) be the average of θ(n)θ2\|{\theta^{(n)}-\theta^{*}}\|_{2} for fixed nn and TT, over the 50 sample paths. The result is shown in Figure 3.

Refer to caption
Figure 3: Plots of err(n,T)err(n,T) and logerr(n,T)\log err(n,T) with varying nn and TT.

Assumption 1, 2, 3 and 5.2 hold in this example, and we speculate that Assumption 4 and 5.1 hold as well. The condition κ<1\kappa<1 cannot be verified, but the empirical result exhibits patterns consistent with the performance guarantee, even though rigorously Theorem 4 is not applicable. First, err(n,T)err(n,T) decreases exponentially in the early phase of the algorithm. Second, as TT increases, Algorithm 1 achieves better performance.

An observation is worth mentioning as a separate note: for n>300n>300, err(n,T)err(n,T) first slightly increases, then levels off. This is due to the parameter estimate on some sample paths converging to bad stationary points of the finite sample likelihood function, which suggests that early stopping could be helpful in practice. Omitted details and additional experiments are provided in Appendix E, where we also investigate, for example, the effect of μ\mu and random initialization on the performance of Algorithm 1.

6 Conclusions

In this paper, we investigate the EM approach to HIL from a theoretical perspective. We prove that under regularity conditions, the proposed algorithm converges with high probability to a norm ball around the true parameter. To our knowledge, this is the first performance guarantee for an HIL algorithm that only observes primitive state-action pairs. Future works could further investigate the practical performance of this approach, especially its scalability in complicated environments.

Acknowledgements

We thank the anonymous reviewers for their constructive comments. Z.Z. thanks Tianrui Chen for helpful discussions. The research was partially supported by the NSF under grants DMS-1664644, CNS-1645681, and IIS-1914792, by the ONR under grant N00014-19-1-2571, by the NIH under grants R01 GM135930 and UL54 TR004130, and by the DOE under grant DE-AR-0001282.

References

  • Anandkumar et al. [2014] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(80):2773–2832, 2014.
  • Bacon et al. [2017] P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 1726–1734, 2017.
  • Balakrishnan et al. [2017] S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120, 2017.
  • Barto and Mahadevan [2003] A. G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13(1-2):41–77, 2003.
  • Baum et al. [1970] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The annals of mathematical statistics, 41(1):164–171, 1970.
  • Brunskill and Li [2014] E. Brunskill and L. Li. PAC-inspired option discovery in lifelong reinforcement learning. In Proceedings of the 31st International Conference on Machine Learning, pages 316–324, 2014.
  • Butterfield et al. [2010] J. Butterfield, S. Osentoski, G. Jay, and O. C. Jenkins. Learning from demonstration using a multi-valued function regressor for time-series data. In 2010 10th IEEE-RAS International Conference on Humanoid Robots, pages 328–333. IEEE, 2010.
  • Cappé et al. [2006] O. Cappé, E. Moulines, and T. Rydén. Inference in hidden Markov models. Springer Science & Business Media, 2006.
  • Daniel et al. [2016a] C. Daniel, G. Neumann, O. Kroemer, and J. Peters. Hierarchical relative entropy policy search. Journal of Machine Learning Research, 17(1):3190–3239, 2016a.
  • Daniel et al. [2016b] C. Daniel, H. Van Hoof, J. Peters, and G. Neumann. Probabilistic inference for determining options in reinforcement learning. Machine Learning, 104(2-3):337–357, 2016b.
  • Daskalakis et al. [2017] C. Daskalakis, C. Tzamos, and M. Zampetakis. Ten steps of EM suffice for mixtures of two Gaussians. In Conference on Learning Theory, pages 704–710, 2017.
  • Davidson [1994] J. Davidson. Stochastic limit theory: An introduction for econometricians. OUP Oxford, 1994.
  • De Castro et al. [2017] Y. De Castro, E. Gassiat, and S. Le Corff. Consistent estimation of the filtering and marginal smoothing distributions in nonparametric hidden Markov models. IEEE Transactions on Information Theory, 63(8):4758–4777, 2017.
  • Dempster et al. [1977] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
  • Dwivedi et al. [2018a] R. Dwivedi, N. Ho, K. Khamaru, M. I. Jordan, M. J. Wainwright, and B. Yu. Singularity, misspecification, and the convergence rate of EM. arXiv preprint arXiv:1810.00828, 2018a.
  • Dwivedi et al. [2018b] R. Dwivedi, N. Ho, K. Khamaru, M. J. Wainwright, and M. I. Jordan. Theoretical guarantees for EM under mis-specified gaussian mixture models. In Advances in Neural Information Processing Systems 31, pages 9681–9689, 2018b.
  • Fox et al. [2017] R. Fox, S. Krishnan, I. Stoica, and K. Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
  • Fruit and Lazaric [2017] R. Fruit and A. Lazaric. Exploration–exploitation in MDPs with options. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 576–584, 2017.
  • Fruit et al. [2017] R. Fruit, M. Pirotta, A. Lazaric, and E. Brunskill. Regret minimization in MDPs with options without prior knowledge. In Advances in Neural Information Processing Systems 30, pages 3166–3176, 2017.
  • Hairer [2006] M. Hairer. Ergodic properties of Markov processes. Unpublished lecture notes, 2006. URL http://www.hairer.org/notes/Markov.pdf.
  • Hsu et al. [2012] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
  • Jain and Kar [2017] P. Jain and P. Kar. Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning, 10(3-4):142–336, 2017.
  • Kontorovich et al. [2013] A. Kontorovich, B. Nadler, and R. Weiss. On learning parametric-output HMMs. In Proceedings of the 30th International Conference on Machine Learning, pages 702–710, 2013.
  • Krishnan et al. [2017] S. Krishnan, R. Fox, I. Stoica, and K. Goldberg. DDCO: Discovery of deep continuous options for robot learning from demonstrations. In Conference on Robot Learning, pages 418–437, 2017.
  • Kulkarni et al. [2016] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems 29, pages 3675–3683, 2016.
  • Le et al. [2018] H. Le, N. Jiang, A. Agarwal, M. Dudik, Y. Yue, and H. Daumé. Hierarchical imitation and reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, pages 2917–2926, 2018.
  • Le Corff and Fort [2013] S. Le Corff and G. Fort. Online expectation maximization based algorithms for inference in hidden Markov models. Electronic Journal of Statistics, 7:763–792, 2013.
  • Mann and Mannor [2014] T. Mann and S. Mannor. Scaling up approximate value iteration with options: Better policies with fewer iterations. In Proceedings of the 31th International Conference on Machine Learning, pages 127–135, 2014.
  • Manschitz et al. [2014] S. Manschitz, J. Kober, M. Gienger, and J. Peters. Learning to sequence movement primitives from demonstrations. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4414–4421. IEEE, 2014.
  • McLachlan and Krishnan [2007] G. J. McLachlan and T. Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.
  • Nachum et al. [2018] O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems 31, pages 3303–3313, 2018.
  • Niekum et al. [2012] S. Niekum, S. Osentoski, G. Konidaris, and A. G. Barto. Learning and generalization of complex tasks from unstructured demonstrations. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5239–5246. IEEE, 2012.
  • Niekum et al. [2015] S. Niekum, S. Osentoski, G. Konidaris, S. Chitta, B. Marthi, and A. G. Barto. Learning grounded finite-state representations from unstructured demonstrations. The International Journal of Robotics Research, 34(2):131–157, 2015.
  • Shiarlis et al. [2018] K. Shiarlis, M. Wulfmeier, S. Salter, S. Whiteson, and I. Posner. TACO: Learning task decomposition via temporal alignment for control. In Proceedings of the 35th International Conference on Machine Learning, pages 4654–4663, 2018.
  • Sutton et al. [1999] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
  • Tessler et al. [2017] C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor. A deep hierarchical approach to lifelong learning in minecraft. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 1553–1561, 2017.
  • van Handel [2008] R. van Handel. Hidden Markov models. Unpublished lecture notes, 2008. URL https://web.math.princeton.edu/~rvan/orf557/hmm080728.pdf.
  • Vezhnevets et al. [2017] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pages 3540–3549, 2017.
  • Wang et al. [2015] Z. Wang, Q. Gu, Y. Ning, and H. Liu. High dimensional EM algorithm: Statistical optimization and asymptotic normality. In Advances in Neural Information Processing Systems 28, pages 2521–2529, 2015.
  • Wu [1983] C. J. Wu. On the convergence properties of the EM algorithm. The Annals of statistics, 11(1):95–103, 1983.
  • Xu et al. [2016] J. Xu, D. J. Hsu, and A. Maleki. Global analysis of expectation maximization for mixtures of two Gaussians. In Advances in Neural Information Processing Systems 29, pages 2676–2684, 2016.
  • Yang et al. [2017] F. Yang, S. Balakrishnan, and M. J. Wainwright. Statistical and computational guarantees for the Baum-Welch algorithm. Journal of Machine Learning Research, 18(1):4528–4580, 2017.
  • Yi and Caramanis [2015] X. Yi and C. Caramanis. Regularized EM algorithms: A unified framework and statistical guarantees. In Advances in Neural Information Processing Systems 28, pages 1567–1575, 2015.

Appendix

Organization.

Appendix A presents discussions that motivate Assumption 3. In particular, we show that Assumption 3 approximately holds in a particular class of environment. Appendix B provides details on Algorithm 1, including the comparison with the existing algorithm from [10], the forward-backward implementation and the derivation of the QQ-function from (7). In Appendix C, we prove our theoretical results from Section 4. Technical lemmas involved in the proofs are deferred to Appendix D. Finally, Appendix E presents details of our numerical example omitted from Section 5.

Appendix A Discussion on Assumption 3

In this section we justify Assumption 3 in a particular class of environment. Consider the stochastic process {Xt;θ}t=1={St,At,Ot,Bt;θ}t=1\{X_{t};\theta\}_{t=1}^{\infty}=\{S_{t},A_{t},O_{t},B_{t};\theta\}_{t=1}^{\infty} generated by any (o0,s1)(o_{0},s_{1}) and an options with failure hierarchical policy with parameter θ\theta. It is a Markov chain with its transition kernel parameterized by θ\theta, and its state space 𝒳=𝒮×𝒜×𝒪×{0,1}\mathcal{X}=\mathcal{S}\times\mathcal{A}\times\mathcal{O}\times\{0,1\} is finite. Denote its one step transition kernel as QθQ_{\theta} and its tt step transition kernel as QθtQ^{t}_{\theta}. In the following, we show that {Xt;θ}t=1\{X_{t};\theta\}_{t=1}^{\infty} is uniformly ergodic when the environment meets the reachability assumption: st,st+1𝒮\forall s_{t},s_{t+1}\in\mathcal{S}, there exists at𝒜a_{t}\in\mathcal{A} such that P(st+1|st,at)>0P(s_{t+1}|s_{t},a_{t})>0.

Proposition 5 (Ergodicity).

With Assumption 1, 2 and the reachability assumption stated above, for all θΘ\theta\in\mathit{\Theta}, a Markov chain with transition kernel QθQ_{\theta} has a unique stationary distribution νθ\nu_{\theta}. There exist constants α(0,1)\alpha\in(0,1) and C>0C>0 such that for all θΘ\theta\in\mathit{\Theta} and t+t\in\mathbb{N}_{+},

supθΘmaxx𝒳Qθt(x,)νθTVCαt.\sup_{\theta\in\mathit{\Theta}}\max_{x\in\mathcal{X}}\left\|{Q_{\theta}^{t}(x,\cdot)-\nu_{\theta}}\right\|_{\rm TV}\leq C\alpha^{t}.
Proof of Proposition 5.

We start by analyzing the irreducibility of the Markov chain {Xt;θ}t=1\{X_{t};\theta\}_{t=1}^{\infty} with any θ\theta. Denote the probability measure on the natural filtered space as X\mathbb{P}_{X}. The dependency on θ\theta is dropped for a cleaner notation, since the following proof holds for all θΘ\theta\in\mathit{\Theta}. For any x,x~𝒳x,\tilde{x}\in\mathcal{X}, let x=(s,a,o,b)x=(s,a,o,b) and x~=(s~,a~,o~,b~)\tilde{x}=(\tilde{s},\tilde{a},\tilde{o},\tilde{b}). For any time tt,

X(Xt+2=x~|Xt=x)=s¯𝒮,a¯𝒜X(Xt+2=x~|Xt=x,St+1=s¯,At+1=a¯)×X(St+1=s¯,At+1=a¯|Xt=x).\mathbb{P}_{X}(X_{t+2}=\tilde{x}|X_{t}=x)=\sum_{\bar{s}\in\mathcal{S},\bar{a}\in\mathcal{A}}\mathbb{P}_{X}(X_{t+2}=\tilde{x}|X_{t}=x,S_{t+1}=\bar{s},A_{t+1}=\bar{a})\\ \times\mathbb{P}_{X}(S_{t+1}=\bar{s},A_{t+1}=\bar{a}|X_{t}=x).

From Assumption 1, there exists a state s¯\bar{s} such that a¯𝒜\forall\bar{a}\in\mathcal{A}, X(St+1=s¯,At+1=a¯|Xt=x)>0\mathbb{P}_{X}(S_{t+1}=\bar{s},A_{t+1}=\bar{a}|X_{t}=x)>0. Consider the first factor in the sum,

X(Xt+2=x~|Xt=x,St+1=s¯,At+1=a¯)=X(St+2=s~|St+1=s¯,At+1=a¯)×X(Bt+2=b~,Ot+2=o~,At+2=a~|Xt=x,St+1=s¯,At+1=a¯,St+2=s~).\mathbb{P}_{X}(X_{t+2}=\tilde{x}|X_{t}=x,S_{t+1}=\bar{s},A_{t+1}=\bar{a})=\mathbb{P}_{X}(S_{t+2}=\tilde{s}|S_{t+1}=\bar{s},A_{t+1}=\bar{a})\\ \times\mathbb{P}_{X}(B_{t+2}=\tilde{b},O_{t+2}=\tilde{o},A_{t+2}=\tilde{a}|X_{t}=x,S_{t+1}=\bar{s},A_{t+1}=\bar{a},S_{t+2}=\tilde{s}).

From Assumption 1, the second term on the RHS is positive for all s¯𝒮\bar{s}\in\mathcal{S} and a¯𝒜\bar{a}\in\mathcal{A}. From the reachability assumption, for any s¯\bar{s} there exists an action a¯\bar{a} such that X(St+2=s~|St+1=s¯,At+1=a¯)>0\mathbb{P}_{X}(S_{t+2}=\tilde{s}|S_{t+1}=\bar{s},A_{t+1}=\bar{a})>0. As a result, for any x,x~𝒳x,\tilde{x}\in\mathcal{X}, X(Xt+2=x~|Xt=x)>0\mathbb{P}_{X}(X_{t+2}=\tilde{x}|X_{t}=x)>0, and the considered Markov chain is irreducible.

As shown above, for all θΘ\theta\in\mathit{\Theta}, minx,x~𝒳Qθ2(x,x~)>0\min_{x,\tilde{x}\in\mathcal{X}}Q^{2}_{\theta}(x,\tilde{x})>0 where Qθ2Q^{2}_{\theta} is the two step transition kernel of the Markov chain {Xt;θ}t=1\{X_{t};\theta\}_{t=1}^{\infty}. Due to Assumption 2, minx,x~𝒳Qθ2(x,x~)\min_{x,\tilde{x}\in\mathcal{X}}Q^{2}_{\theta}(x,\tilde{x}) is continuous with respect to θ\theta. Moreover, since Θ\mathit{\Theta} is compact, if we let δ=infθΘminx,x~𝒳Qθ2(x,x~)\delta=\inf_{\theta\in\mathit{\Theta}}\min_{x,\tilde{x}\in\mathcal{X}}Q^{2}_{\theta}(x,\tilde{x}) we have δ>0\delta>0. The classical Doeblin-type condition can be constructed as follows. For all θΘ\theta\in\mathit{\Theta} and x,x~𝒳x,\tilde{x}\in\mathcal{X}, with any probability measure ν\nu over the finite sample space 𝒳\mathcal{X},

Qθ2(x,x~)δν(x~).Q^{2}_{\theta}(x,\tilde{x})\geq\delta\nu(\tilde{x}). (9)

A Markov chain convergence result is restated in the following lemma, tailored to our need.

Lemma A.1 ([8], Theorem 4.3.16 restated).

With the Doeblin-type condition in (9), the Markov chain {Xt;θ}t=1\{X_{t};\theta\}_{t=1}^{\infty} with any θΘ\theta\in\mathit{\Theta} has a unique stationary distribution νθ\nu_{\theta}. Moreover, for all θΘ\theta\in\mathit{\Theta}, x𝒳x\in\mathcal{X} and t+t\in\mathbb{N}_{+},

Qθt(x,)νθTV(1δ)t/2.\left\|{Q^{t}_{\theta}(x,\cdot)-\nu_{\theta}}\right\|_{\rm TV}\leq(1-\delta)^{\left\lfloor{t/2}\right\rfloor}.

Letting C=(1δ)1C=(1-\delta)^{-1} and α=(1δ)1/2\alpha=(1-\delta)^{1/2}, we have

supθΘmaxx1𝒳Qθt(x1,)νθTV(1δ)t/2Cαt.\sup_{\theta\in\mathit{\Theta}}\max_{x_{1}\in\mathcal{X}}\left\|{Q_{\theta}^{t}(x_{1},\cdot)-\nu_{\theta}}\right\|_{\rm TV}\leq(1-\delta)^{\left\lfloor{t/2}\right\rfloor}\leq C\alpha^{t}.\qed

Proposition 5 shows that in {Xt;θ}t=1\{X_{t};\theta\}_{t=1}^{\infty}, the initial distribution (of X1X_{1}) is not very important since the distribution of XtX_{t} converges to νθ\nu_{\theta} uniformly with respect to X1X_{1} and θ\theta. As a result, {Ot1,St}t=1\{O_{t-1},S_{t}\}_{t=1}^{\infty} also converges to the unique limiting distribution, regardless of the initial distribution. When sampling the observation sequence from the expert, we can always start sampling late enough such that Assumption 3 is approximately satisfied. Note that the proof of Proposition 5 does not use the failure mechanism imposed on the hierarchical policy, implying that the result also holds for the standard options framework.

Appendix B Details of the algorithm

B.1 An error in the existing algorithm

First, we point out a technicality when comparing Algorithm 1 to the algorithm from [10]. The algorithm from [10] learns a hierarchical policy following the standard options framework, not the options with failure framework considered in Algorithm 1. To draw direct comparison, we need to let ζ=0\zeta=0 in Algorithm 1. However, an error in the existing algorithm can be demonstrated without referring to ζ\zeta.

For simplicity, consider O0O_{0} fixed to o0𝒪o_{0}\in\mathcal{O}; let 2tT12\leq t\leq T-1. Then, according to the definitions in [10], the (unnormalized) forward message is defined as

αtθ(ot,bt)=θ,o0,s1(A1:t=a1:t,Ot=ot,Bt=bt|S2:t=s2:t).\alpha^{\theta}_{t}(o_{t},b_{t})=\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t}|S_{2:t}=s_{2:t}).

The (unnormalized) backward message is defined as

βt|Tθ(ot,bt)=θ,o0,s1(At+1:T=at+1:T|St+1:T=st+1:T,Ot=ot,Bt=bt).\beta^{\theta}_{t|T}(o_{t},b_{t})=\mathbb{P}_{\theta,o_{0},s_{1}}(A_{t+1:T}=a_{t+1:T}|S_{t+1:T}=s_{t+1:T},O_{t}=o_{t},B_{t}=b_{t}).

The smoothing distribution is defined as

γt|Tθ(ot,bt)=θ,o0,s1(Ot=ot,Bt=bt|S2:T=s2:T,A1:T=a1:T).\gamma^{\theta}_{t|T}(o_{t},b_{t})=\mathbb{P}_{\theta,o_{0},s_{1}}(O_{t}=o_{t},B_{t}=b_{t}|S_{2:T}=s_{2:T},A_{1:T}=a_{1:T}).

We use the proportional symbol \propto to represent normalizing constants independent of oto_{t} and btb_{t}. [10] claims that, for any oto_{t} and btb_{t},

γt|Tθ(ot,bt)αtθ(ot,bt)βt|Tθ(ot,bt).\gamma^{\theta}_{t|T}(o_{t},b_{t})\propto\alpha^{\theta}_{t}(o_{t},b_{t})\beta^{\theta}_{t|T}(o_{t},b_{t}).

However, applying Bayes’ formula, it follows that

γt|Tθ(ot,bt)θ,o0,s1(A1:T=a1:T|S2:T=s2:T,Ot=ot,Bt=bt)θ,o0,s1(Ot=ot,Bt=bt|S2:T=s2:T).\gamma^{\theta}_{t|T}(o_{t},b_{t})\propto\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:T}=a_{1:T}|S_{2:T}=s_{2:T},O_{t}=o_{t},B_{t}=b_{t})\mathbb{P}_{\theta,o_{0},s_{1}}(O_{t}=o_{t},B_{t}=b_{t}|S_{2:T}=s_{2:T}).

Using the Markov property,

θ,o0,s1(A1:T=a1:T|S2:T=s2:T,Ot=ot,Bt=bt)=θ,o0,s1(A1:t=a1:t|S2:T=s2:T,Ot=ot,Bt=bt)×θ,o0,s1(At+1:T=at+1:T|S2:T=s2:T,Ot=ot,Bt=bt).\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:T}=a_{1:T}|S_{2:T}=s_{2:T},O_{t}=o_{t},B_{t}=b_{t})=\\ \mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:t}=a_{1:t}|S_{2:T}=s_{2:T},O_{t}=o_{t},B_{t}=b_{t})\\ \times\mathbb{P}_{\theta,o_{0},s_{1}}(A_{t+1:T}=a_{t+1:T}|S_{2:T}=s_{2:T},O_{t}=o_{t},B_{t}=b_{t}).

Therefore,

γt|Tθ(ot,bt)θ,o0,s1(A1:t=a1:t,Ot=ot,Bt=bt|S2:T=s2:T)βt|Tθ(ot,bt).\gamma^{\theta}_{t|T}(o_{t},b_{t})\propto\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t}|S_{2:T}=s_{2:T})\beta^{\theta}_{t|T}(o_{t},b_{t}).

Applying Bayes’ formula again, it follows that

θ,o0,s1(A1:t=a1:t,Ot=ot,Bt=bt|S2:T=s2:T)\displaystyle\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t}|S_{2:T}=s_{2:T})
\displaystyle\propto~{} θ,o0,s1(A1:t=a1:t,Ot=ot,Bt=bt|S2:t=s2:t)\displaystyle\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t}|S_{2:t}=s_{2:t})
×θ,o0,s1(St+1:T=st+1:T|S2:t=s2:t,A1:t=a1:t,Ot=ot,Bt=bt)\displaystyle\hskip 60.00009pt\times\mathbb{P}_{\theta,o_{0},s_{1}}(S_{t+1:T}=s_{t+1:T}|S_{2:t}=s_{2:t},A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t})
=\displaystyle=~{} αtθ(ot,bt)θ,o0,s1(St+1:T=st+1:T|St=st,At=at,Ot=ot,Bt=bt).\displaystyle\alpha^{\theta}_{t}(o_{t},b_{t})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{t+1:T}=s_{t+1:T}|S_{t}=s_{t},A_{t}=a_{t},O_{t}=o_{t},B_{t}=b_{t}).

For the claim in [10] to be true, θ,o0,s1(St+1:T=st+1:T|St=st,At=at,Ot=ot,Bt=bt)\mathbb{P}_{\theta,o_{0},s_{1}}(S_{t+1:T}=s_{t+1:T}|S_{t}=s_{t},A_{t}=a_{t},O_{t}=o_{t},B_{t}=b_{t}) should not depend on oto_{t} and btb_{t}. Clearly this requirement does not hold in most cases, since the likelihood of the future observation sequence should depend on the currently applied option.

B.2 Proof of Theorem 1

We drop the dependency on θ\theta, since the following proof holds for all θΘ\theta\in\mathit{\Theta}. The proportional symbol \propto is used to replace a multiplier term that depends on the context.

1. (Forward recursion)

First consider any fixed o0o_{0}. For a cleaner notation, we use pp as an abbreviation of θ,o0,s1\mathbb{P}_{\theta,o_{0},s_{1}}. Let H1H_{1}, H2H_{2} be any two subsets of {St,At,Ot,Bt}t=1T\{S_{t},A_{t},O_{t},B_{t}\}_{t=1}^{T}, and let h1h_{1}, h2h_{2} be the sets of values generated from H1H_{1} and H2H_{2}, respectively, such that the uppercase symbols are replaced by the lowercase symbols. (H1H_{1} and H2H_{2} are two sets of random variables; h1h_{1} and h2h_{2} are two sets of values of random variables.) Then, for all (o0,s1)(o_{0},s_{1}), pp is defined as

p(h1|h2,o0,s1):=θ,o0,s1(H1=h1|H2=h2).p(h_{1}|h_{2},o_{0},s_{1})\mathrel{\mathop{:}}=\mathbb{P}_{\theta,o_{0},s_{1}}(H_{1}=h_{1}|H_{2}=h_{2}).

If the RHS does not depend on o0o_{0} and s1s_{1}, we can omit it on the LHS by using p(h1|h2)p(h_{1}|h_{2}). t[2:T]\forall t\in[2:T],

p(s2:t,a1:t,ot,bt|o0,s1)\displaystyle p(s_{2:t},a_{1:t},o_{t},b_{t}|o_{0},s_{1})
=\displaystyle=~{} p(s2:t,a1:t1,ot,bt|o0,s1)πlo(at|st,ot)\displaystyle p(s_{2:t},a_{1:t-1},o_{t},b_{t}|o_{0},s_{1})\pi_{lo}(a_{t}|s_{t},o_{t})
=\displaystyle=~{} ot1p(s2:t,a1:t1,ot,bt,ot1|o0,s1)πlo(at|st,ot)\displaystyle\sum_{o_{t-1}}p(s_{2:t},a_{1:t-1},o_{t},b_{t},o_{t-1}|o_{0},s_{1})\pi_{lo}(a_{t}|s_{t},o_{t})
=\displaystyle=~{} ot1p(s2:t,a1:t1,ot1|o0,s1)πb(bt|st,ot1)π¯hi(ot|st,ot1,bt)πlo(at|st,ot).\displaystyle\sum_{o_{t-1}}p(s_{2:t},a_{1:t-1},o_{t-1}|o_{0},s_{1})\pi_{b}(b_{t}|s_{t},o_{t-1})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t})\pi_{lo}(a_{t}|s_{t},o_{t}).

Furthermore,

p(s2:t,a1:t1,ot1|o0,s1)\displaystyle p(s_{2:t},a_{1:t-1},o_{t-1}|o_{0},s_{1}) =p(s2:t1,a1:t1,ot1|o0,s1)P(st|st1,at1)\displaystyle=p(s_{2:t-1},a_{1:t-1},o_{t-1}|o_{0},s_{1})P(s_{t}|s_{t-1},a_{t-1})
bt1p(s2:t1,a1:t1,ot1,bt1|o0,s1),\displaystyle\propto\sum_{b_{t-1}}p(s_{2:t-1},a_{1:t-1},o_{t-1},b_{t-1}|o_{0},s_{1}),

where \propto replaces a multiplier that does not depend on ot1o_{t-1}. Taking expectation with respect to O0O_{0} gives the desirable forward recursion result. For the case of t=1t=1, the proof is analogous.

2. (Backward recursion)

For any o0o_{0}, t[1:T1]\forall t\in[1:T-1],

βt|Tθ(ot,bt)\displaystyle\beta^{\theta}_{t|T}(o_{t},b_{t}) p(st+1:T,at+1:T|st,at,ot,bt)\displaystyle\propto p(s_{t+1:T},a_{t+1:T}|s_{t},a_{t},o_{t},b_{t})
=p(st+2:T,at+1:T|st+1,ot)P(st+1|st,at)\displaystyle=p(s_{t+2:T},a_{t+1:T}|s_{t+1},o_{t})P(s_{t+1}|s_{t},a_{t})
ot+1,bt+1p(st+2:T,at+1:T|st+1,ot,ot+1,bt+1)p(ot+1,bt+1|st+1,ot),\displaystyle\propto\sum_{o_{t+1},b_{t+1}}p(s_{t+2:T},a_{t+1:T}|s_{t+1},o_{t},o_{t+1},b_{t+1})p(o_{t+1},b_{t+1}|s_{t+1},o_{t}),

where the multipliers replaced by \propto are independent of oto_{t} and btb_{t}. Moreover,

p(st+2:T,at+1:T|st+1,ot,ot+1,bt+1)\displaystyle p(s_{t+2:T},a_{t+1:T}|s_{t+1},o_{t},o_{t+1},b_{t+1})
=\displaystyle=~{} p(st+2:T,at+2:T|st+1,ot,ot+1,bt+1,at+1)p(at+1|st+1,ot,ot+1,bt+1)\displaystyle p(s_{t+2:T},a_{t+2:T}|s_{t+1},o_{t},o_{t+1},b_{t+1},a_{t+1})p(a_{t+1}|s_{t+1},o_{t},o_{t+1},b_{t+1})
=\displaystyle=~{} βt+1|Tθ(ot+1,bt+1)p(at+1|st+1,ot,ot+1,bt+1).\displaystyle\beta^{\theta}_{t+1|T}(o_{t+1},b_{t+1})p(a_{t+1}|s_{t+1},o_{t},o_{t+1},b_{t+1}).

Plugging in the structure of the policy gives the desirable result.

3. (Smoothing)

Consider any fixed o0o_{0}. For any t[2:T]t\in[2:T],

p(s2:T,a1:T,ot,bt|o0,s1)\displaystyle p(s_{2:T},a_{1:T},o_{t},b_{t}|o_{0},s_{1}) =p(s2:t,a1:t,ot,bt|o0,s1)p(st+1:T,at+1:T|s1:t,a1:t,ot,bt,o0)\displaystyle=p(s_{2:t},a_{1:t},o_{t},b_{t}|o_{0},s_{1})p(s_{t+1:T},a_{t+1:T}|s_{1:t},a_{1:t},o_{t},b_{t},o_{0})
=p(s2:t,a1:t,ot,bt|o0,s1)p(st+1:T,at+1:T|st,at,ot,bt).\displaystyle=p(s_{2:t},a_{1:t},o_{t},b_{t}|o_{0},s_{1})p(s_{t+1:T},a_{t+1:T}|s_{t},a_{t},o_{t},b_{t}).

Taking expectation with respect to O0O_{0} on both sides yields the desirable result. Notice that the second term on the RHS does not depend on O0O_{0}, therefore is not involved in the expectation. For the case of t=1t=1 the proof is analogous.

4. (Two-step smoothing)

For any t[3:T]t\in[3:T], consider any fixed o0o_{0},

p(s2:T,a1:T,ot1,bt|o0,s1)\displaystyle p(s_{2:T},a_{1:T},o_{t-1},b_{t}|o_{0},s_{1})
=\displaystyle=~{} bt1p(s2:T,a1:T,ot1,bt,bt1|o0,s1)\displaystyle\sum_{b_{t-1}}p(s_{2:T},a_{1:T},o_{t-1},b_{t},b_{t-1}|o_{0},s_{1})
=\displaystyle=~{} bt1p(s2:t1,a1:t1,ot1,bt1|o0,s1)p(st:T,at:T,bt|s1:t1,a1:t1,ot1,bt1,o0)\displaystyle\sum_{b_{t-1}}p(s_{2:t-1},a_{1:t-1},o_{t-1},b_{t-1}|o_{0},s_{1})p(s_{t:T},a_{t:T},b_{t}|s_{1:t-1},a_{1:t-1},o_{t-1},b_{t-1},o_{0})
=\displaystyle=~{} bt1p(s2:t1,a1:t1,ot1,bt1|o0,s1)P(st|st1,at1)p(st+1:T,at:T,bt|st,ot1).\displaystyle\sum_{b_{t-1}}p(s_{2:t-1},a_{1:t-1},o_{t-1},b_{t-1}|o_{0},s_{1})P(s_{t}|s_{t-1},a_{t-1})p(s_{t+1:T},a_{t:T},b_{t}|s_{t},o_{t-1}).

Take expectation with respect to O0O_{0} on both sides. Notice that only the first term on the RHS depends on o0o_{0}. We have

γ~μ,t|T(ot1,bt)\displaystyle\tilde{\gamma}_{\mu,t|T}(o_{t-1},b_{t})
\displaystyle\propto~{} bt1αμ,t1(ot1,bt1)P(st|st1,at1)p(st+1:T,at:T,bt|st,ot1)\displaystyle\sum_{b_{t-1}}\alpha_{\mu,t-1}(o_{t-1},b_{t-1})P(s_{t}|s_{t-1},a_{t-1})p(s_{t+1:T},a_{t:T},b_{t}|s_{t},o_{t-1})
\displaystyle\propto~{} πb(bt|st,ot1)p(st+1:T,at:T|st,bt,ot1)bt1αμ,t1(ot1,bt1)\displaystyle\pi_{b}(b_{t}|s_{t},o_{t-1})p(s_{t+1:T},a_{t:T}|s_{t},b_{t},o_{t-1})\sum_{b_{t-1}}\alpha_{\mu,t-1}(o_{t-1},b_{t-1})
=\displaystyle=~{} πb(bt|st,ot1)[otp(st+1:T,at:T,ot|st,bt,ot1)]bt1αμ,t1(ot1,bt1)\displaystyle\pi_{b}(b_{t}|s_{t},o_{t-1})\left[\sum_{o_{t}}p(s_{t+1:T},a_{t:T},o_{t}|s_{t},b_{t},o_{t-1})\right]\sum_{b_{t-1}}\alpha_{\mu,t-1}(o_{t-1},b_{t-1})
\displaystyle\propto~{} πb(bt|st,ot1)[otπ¯hi(ot|st,ot1,bt)πlo(at|st,ot)βt|T(ot,bt)]bt1αμ,t1(ot1,bt1),\displaystyle\pi_{b}(b_{t}|s_{t},o_{t-1})\left[\sum_{o_{t}}\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t})\pi_{lo}(a_{t}|s_{t},o_{t})\beta_{t|T}(o_{t},b_{t})\right]\sum_{b_{t-1}}\alpha_{\mu,t-1}(o_{t-1},b_{t-1}),

where the multipliers replaced by \propto are independent of ot1o_{t-1} and btb_{t}. For the case of t=2t=2 the proof is analogous. ∎

B.3 Discussion on the QQ-function

In our algorithm, as motivated by Section 3, we effectively consider the following joint distribution on the graphical model shown in Figure 1: the prior distribution of (O0,S1)(O_{0},S_{1}) is ν^\hat{\nu}, and the distribution of the rest of the graphical model is determined by an options with failure policy with parameters ζ\zeta and θ\theta. From the EM literature [3, 22], the complete likelihood function is

L(s1:T,a1:T,o0:T,b1:T;θ)=ν^(o0,s1)θ,o0,s1(S2:T=s2:T,A1:T=a1:T,O1:T=o1:T,B1:T=b1:T).L(s_{1:T},a_{1:T},o_{0:T},b_{1:T};\theta)=\hat{\nu}(o_{0},s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T}).

The marginal likelihood function is

Lm(s1:T,a1:T;θ)=o0:T,b1:Tν^(o0,s1)θ,o0,s1(S2:T=s2:T,A1:T=a1:T,O1:T=o1:T,B1:T=b1:T),L^{m}(s_{1:T},a_{1:T};\theta)=\sum_{o_{0:T},b_{1:T}}\hat{\nu}(o_{0},s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T}),

where the superscript mm means marginal. From the definition of smoothing distributions, we can verify that Lm(s1:T,a1:T;θ)=(zγ,μθ)1L^{m}(s_{1:T},a_{1:T};\theta)=(z^{\theta}_{\gamma,\mu})^{-1}.

The standard MLE approach maximizes the logarithm of the marginal likelihood function (marginal log-likelihood) with respect to θ\theta. However, such an optimization objective is hard to evaluate for time series models (e.g., HMMs and our graphical model). As an alternative, the marginal log-likelihood can be lower bounded [22, Chap. 5.4] as

logLm(s1:T,a1:T;θ)o0:T,b1:TL(s1:T,a1:T,o0:T,b1:T;θ)Lm(s1:T,a1:T;θ)logL(s1:T,a1:T,o0:T,b1:T;θ),\log L^{m}(s_{1:T},a_{1:T};\theta^{\prime})\geq\sum_{o_{0:T},b_{1:T}}\frac{L(s_{1:T},a_{1:T},o_{0:T},b_{1:T};\theta)}{L^{m}(s_{1:T},a_{1:T};\theta)}\log L(s_{1:T},a_{1:T},o_{0:T},b_{1:T};\theta^{\prime}),

where θ\theta on the RHS is arbitrary. The RHS is usually called the (unnormalized) QQ-function. For our graphical model, it is denoted as Q~μ,T(θ|θ)\tilde{Q}_{\mu,T}(\theta^{\prime}|\theta).

Q~μ,T(θ|θ)=o0:T,b1:Tν^(o0,s1)θ,o0,s1(S2:T=s2:T,A1:T=a1:T,O1:T=o1:T,B1:T=b1:T)×zγ,μθlog[ν^(o0,s1)θ,o0,s1(S2:T=s2:T,A1:T=a1:T,O1:T=o1:T,B1:T=b1:T)].\tilde{Q}_{\mu,T}(\theta^{\prime}|\theta)=\sum_{o_{0:T},b_{1:T}}\hat{\nu}(o_{0},s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T})\\ \times z^{\theta}_{\gamma,\mu}\log[\hat{\nu}(o_{0},s_{1})\mathbb{P}_{\theta^{\prime},o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T})].

The RHS is well-defined from the non-degeneracy assumption. From the classical monotonicity property of EM updates [22, Chap. 5.7], maximizing the (unnormalized) QQ-function Q~μ,T(θ|θ)\tilde{Q}_{\mu,T}(\theta^{\prime}|\theta) with respect to θ\theta^{\prime} guarantees non-negative improvement on the marginal log-likelihood. Therefore, improvements on parameter inference can be achieved via iteratively maximizing the (unnormalized) QQ-function.

Using the structure of the hierarchical policy, Q~μ,T\tilde{Q}_{\mu,T} can be rewritten as

Q~μ,T(θ|θ)=t=2Tot1,btγ~μ,t|Tθ(ot1,bt)[logπb(bt|st,ot1;θb)]+t=1Tot,btγμ,t|Tθ(ot,bt)[logπlo(at|st,ot;θlo)]+t=1Totγμ,t|Tθ(ot,bt=1)[logπhi(ot|st;θhi)]+zγ,μθo0,b1μ(o0|s1)θ,o0,s1(S2:T=s2:T,A1:T=a1:T,B1=b1)[logπb(b1|s1,o0;θb)]+C,\tilde{Q}_{\mu,T}(\theta^{\prime}|\theta)=\sum_{t=2}^{T}\sum_{o_{t-1},b_{t}}\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t})[\log\pi_{b}(b_{t}|s_{t},o_{t-1};\theta^{\prime}_{b})]\\ +\sum_{t=1}^{T}\sum_{o_{t},b_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})[\log\pi_{lo}(a_{t}|s_{t},o_{t};\theta^{\prime}_{lo})]+\sum_{t=1}^{T}\sum_{o_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}=1)[\log\pi_{hi}(o_{t}|s_{t};\theta^{\prime}_{hi})]\\ +z^{\theta}_{\gamma,\mu}\sum_{o_{0},b_{1}}\mu(o_{0}|s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},B_{1}=b_{1})[\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})]+C,

where CC contains terms unrelated to θ\theta^{\prime}. Consider the first term on the last line, which partially captures the effect of assuming ν^\hat{\nu} on the parameter inference. Since this term is upper bounded by maxb1,s1,o0|logπb(b1|s1,o0;θb)|\max_{b_{1},s_{1},o_{0}}|\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})|, when TT is large enough this term becomes negligible. The precise argument is similar to the proof of Lemma C.2. Therefore, after dropping the last line and normalizing, we arrive at our definition of the (normalized) QQ-function in (7).

Appendix C Details of the performance guarantee

C.1 Smoothing in an extended graphical model

Before providing the proofs, we first introduce a few definitions. Consider the extended graphical model shown in Figure 4 with a parameter kk; k+k\in\mathbb{N}_{+}.

Refer to caption
Figure 4: An extended graphical model for hierarchical imitation learning.

Let the joint distribution of (Ok,S1k)(O_{-k},S_{1-k}) be ν\nu^{*}. Define the distribution of the rest of the graphical model using an options with failure hierarchical policy with parameters ζ\zeta and θ\theta, analogous to our settings so far. With these two components, the joint distribution on the graphical model is determined. Let θ,k\mathbb{P}_{\theta,k} be such a joint distribution; ν\nu^{*} is omitted for conciseness.

We emphasize the comparison between θ,k\mathbb{P}_{\theta,k} and θ,o0,s1\mathbb{P}_{\theta,o_{0},s_{1}}. The sample space of θ,k\mathbb{P}_{\theta,k} is the domain of {S1k:T+k,A1k:T+k,Ok:T+k,B1k:T+k}\{S_{1-k:T+k},A_{1-k:T+k},O_{-k:T+k},B_{1-k:T+k}\}, whereas the sample space of θ,o0,s1\mathbb{P}_{\theta,o_{0},s_{1}} is the domain of {S2:T,A1:T,O1:T,B1:T}\{S_{2:T},A_{1:T},O_{1:T},B_{1:T}\} since (O0,S1)(O_{0},S_{1}) is fixed to (o0,s1)(o_{0},s_{1}).

Consider the infinite length observation sequence {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}} corresponding to any ωΩ\omega\in\Omega, where Ω\Omega is defined in (8). Analogous to the non-extended model (Figure 1), we can define smoothing distributions for the extended model with any parameter kk. For all θΘ\theta\in\mathit{\Theta} and t[1:T]t\in[1:T], with any input arguments oto_{t} and btb_{t}, the forward message is defined as

αk,tθ(ot,bt):=zα,k,tθθ,k(S1k:t=s1k:t,A1k:t=a1k:t,Ot=ot,Bt=bt).\alpha^{\theta}_{k,t}(o_{t},b_{t})\mathrel{\mathop{:}}=z_{\alpha,k,t}^{\theta}\mathbb{P}_{\theta,k}(S_{1-k:t}=s_{1-k:t},A_{1-k:t}=a_{1-k:t},O_{t}=o_{t},B_{t}=b_{t}).

The backward message is defined as

βk,tθ(ot,bt):=zβ,k,tθθ,k(St+1:T+k=st+1:T+k,At+1:T+k=at+1:T+k|St=st,At=at,Ot=ot,Bt=bt).\beta^{\theta}_{k,t}(o_{t},b_{t})\mathrel{\mathop{:}}=\\ z_{\beta,k,t}^{\theta}\mathbb{P}_{\theta,k}(S_{t+1:T+k}=s_{t+1:T+k},A_{t+1:T+k}=a_{t+1:T+k}|S_{t}=s_{t},A_{t}=a_{t},O_{t}=o_{t},B_{t}=b_{t}).

The smoothing distribution is defined as

γk,tθ(ot,bt):=zγ,kθθ,k(S1k:T+k=s1k:T+k,A1k:T+k=a1k:T+k,Ot=ot,Bt=bt).\gamma^{\theta}_{k,t}(o_{t},b_{t})\mathrel{\mathop{:}}=z_{\gamma,k}^{\theta}\mathbb{P}_{\theta,k}(S_{1-k:T+k}=s_{1-k:T+k},A_{1-k:T+k}=a_{1-k:T+k},O_{t}=o_{t},B_{t}=b_{t}).

The two-step smoothing distribution is defined as

γ~k,tθ(ot1,bt):=zγ,kθθ,k(S1k:T+k=s1k:T+k,A1k:T+k=a1k:T+k,Ot1=ot1,Bt=bt).\tilde{\gamma}^{\theta}_{k,t}(o_{t-1},b_{t})\mathrel{\mathop{:}}=z_{\gamma,k}^{\theta}\mathbb{P}_{\theta,k}(S_{1-k:T+k}=s_{1-k:T+k},A_{1-k:T+k}=a_{1-k:T+k},O_{t-1}=o_{t-1},B_{t}=b_{t}).

The quantities zα,k,tθz^{\theta}_{\alpha,k,t}, zβ,k,tθz^{\theta}_{\beta,k,t} and zγ,kθz^{\theta}_{\gamma,k} are normalizing constants such that the LHS of the expressions above are probability mass functions. In particular, since k>0k>0, we can define αk,tθ\alpha^{\theta}_{k,t} for t=0t=0 in the same way as t[1:T]t\in[1:T]. The dependency on TT in the smoothing distributions is dropped for a cleaner notation.

Recursive results similar to Theorem 1 can be established; the proof is analogous and therefore omitted. As in Theorem 1, we make extensive use of the proportional symbol \propto which stands for, the LHS equals the RHS multiplied by a normalizing constant. Moreover, the normalizing constant does not depend on the input arguments of the LHS.

Corollary 6 (Forward-backward smoothing for the extended model).

For all θΘ\theta\in\mathit{\Theta} and k+k\in\mathbb{N}_{+}, with any input arguments,

  1. 1.

    (Forward recursion) t[1:T]\forall t\in[1:T],

    αk,tθ(ot,bt)ot1,bt1πb(bt|st,ot1;θb)π¯hi(ot|st,ot1,bt;θhi)πlo(at|st,ot;θlo)αk,t1θ(ot1,bt1).\alpha^{\theta}_{k,t}(o_{t},b_{t})\propto\sum_{o_{t-1},b_{t-1}}\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\alpha^{\theta}_{k,t-1}(o_{t-1},b_{t-1}). (10)
  2. 2.

    (Backward recursion) t[1:T1]\forall t\in[1:T-1],

    βk,tθ(ot,bt)ot+1,bt+1πb(bt+1|st+1,ot;θb)π¯hi(ot+1|st+1,ot,bt+1;θhi)×πlo(at+1|st+1,ot+1;θlo)βk,t+1θ(ot+1,bt+1).\beta^{\theta}_{k,t}(o_{t},b_{t})\propto\sum_{o_{t+1},b_{t+1}}\pi_{b}(b_{t+1}|s_{t+1},o_{t};\theta_{b})\bar{\pi}_{hi}(o_{t+1}|s_{t+1},o_{t},b_{t+1};\theta_{hi})\\ \times\pi_{lo}(a_{t+1}|s_{t+1},o_{t+1};\theta_{lo})\beta^{\theta}_{k,t+1}(o_{t+1},b_{t+1}). (11)
  3. 3.

    (Smoothing) t[1:T]\forall t\in[1:T],

    γk,tθ(ot,bt)αk,tθ(ot,bt)βk,tθ(ot,bt).\gamma^{\theta}_{k,t}(o_{t},b_{t})\propto\alpha^{\theta}_{k,t}(o_{t},b_{t})\beta^{\theta}_{k,t}(o_{t},b_{t}). (12)
  4. 4.

    (Two-step smoothing) t[1:T]\forall t\in[1:T],

    γ~k,tθ(ot1,bt)πb(bt|st,ot1;θb)[otπ¯hi(ot|st,ot1,bt;θhi)πlo(at|st,ot;θlo)βk,tθ(ot,bt)]×[bt1αk,t1θ(ot1,bt1)].\tilde{\gamma}^{\theta}_{k,t}(o_{t-1},b_{t})\propto\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bigg{[}\sum_{o_{t}}\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\beta^{\theta}_{k,t}(o_{t},b_{t})\bigg{]}\\ \times\bigg{[}\sum_{b_{t-1}}\alpha^{\theta}_{k,t-1}(o_{t-1},b_{t-1})\bigg{]}. (13)

The following lemma characterizes the limiting behavior of γk,tθ\gamma^{\theta}_{k,t} and γ~k,tθ\tilde{\gamma}^{\theta}_{k,t} as kk\rightarrow\infty.

Lemma C.1 (Limits of smoothing distributions).

With Assumption 1, 2 and 3, for all T2T\geq 2, θΘ\theta\in\mathit{\Theta}, ωΩ\omega\in\Omega and t[1:T]t\in[1:T], the limits of {γk,tθ}k+\{\gamma^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}} and {γ~k,tθ}k+\{\tilde{\gamma}^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}} as kk\rightarrow\infty exist with respect to the total variation distance. Let γ,tθ:=limkγk,tθ\gamma^{\theta}_{\infty,t}\mathrel{\mathop{:}}=\lim_{k\rightarrow\infty}\gamma^{\theta}_{k,t} and γ~,tθ:=limkγ~k,tθ\tilde{\gamma}^{\theta}_{\infty,t}\mathrel{\mathop{:}}=\lim_{k\rightarrow\infty}\tilde{\gamma}^{\theta}_{k,t}. They have the following properties:

  1. 1.

    γ,tθ\gamma^{\theta}_{\infty,t} and γ~,tθ\tilde{\gamma}^{\theta}_{\infty,t} do not depend on TT.

  2. 2.

    γ,tθ\gamma^{\theta}_{\infty,t} and γ~,tθ\tilde{\gamma}^{\theta}_{\infty,t} are entry-wise Lipschitz continuous with respect to θΘ\theta\in\mathit{\Theta}.

The proof is given in Appendix D.4. The dependency of γ,tθ\gamma^{\theta}_{\infty,t} and γ~,tθ\tilde{\gamma}^{\theta}_{\infty,t} on ω\omega is omitted for a cleaner notation.

C.2 The stochastic convergence of the QQ-function

In this subsection, we present the proof of Theorem 2.

First, consider γ,tθ\gamma^{\theta}_{\infty,t} and γ~,tθ\tilde{\gamma}^{\theta}_{\infty,t} defined in Lemma C.1. Using the arguments from Section 4, they can also be analyzed in the infinitely extended probability space (𝒳,𝒫(𝒳),θ,ν)(\mathcal{X}^{\mathbb{Z}},\mathcal{P}(\mathcal{X}^{\mathbb{Z}}),\mathbb{P}_{\theta^{*},\nu^{*}}), where 𝒫()\mathcal{P}(\cdot) denotes the power set. We only define γ,tθ\gamma^{\theta}_{\infty,t} and γ~,tθ\tilde{\gamma}^{\theta}_{\infty,t} for ωΩ\omega\in\Omega; for other sample paths, they are defined arbitrarily. Since θ,ν(Ω)=1\mathbb{P}_{\theta^{*},\nu^{*}}(\Omega)=1, such a restriction from 𝒳\mathcal{X}^{\mathbb{Z}} to Ω\Omega does not change our probabilistic results.

For any sample path ω\omega, let ω(st)\omega(s_{t}) and ω(at)\omega(a_{t}) be the values of StS_{t} and AtA_{t} corresponding to ω\omega. With a slight overload of notation, let ω(t)={ω(st),ω(at),ω(ot),ω(bt)}\omega(t)=\{\omega(s_{t}),\omega(a_{t}),\omega(o_{t}),\omega(b_{t})\}, which is the set of components in ω\omega corresponding to time tt.

For all θΘ\theta\in\mathit{\Theta}, θΘ~\theta^{\prime}\in\tilde{\mathit{\Theta}}, ωΩ\omega\in\Omega and t+t\in\mathbb{N}_{+}, define

ft(θ|θ;ω):=ot1,btγ~,tθ(ot1,bt;ω)[logπb(bt|ω(st),ot1;θb)]+ot,btγ,tθ(ot,bt;ω)×[logπlo(ω(at)|ω(st),ot;θlo)]+otγ,tθ(ot,bt=1;ω)[logπhi(ot|ω(st);θhi)],f_{t}(\theta^{\prime}|\theta;\omega)\mathrel{\mathop{:}}=\sum_{o_{t-1},b_{t}}\tilde{\gamma}^{\theta}_{\infty,t}(o_{t-1},b_{t};\omega)\left[\log\pi_{b}(b_{t}|\omega(s_{t}),o_{t-1};\theta^{\prime}_{b})\right]+\sum_{o_{t},b_{t}}\gamma^{\theta}_{\infty,t}(o_{t},b_{t};\omega)\\ \times\left[\log\pi_{lo}(\omega(a_{t})|\omega(s_{t}),o_{t};\theta^{\prime}_{lo})\right]+\sum_{o_{t}}\gamma^{\theta}_{\infty,t}(o_{t},b_{t}=1;\omega)\left[\log\pi_{hi}(o_{t}|\omega(s_{t});\theta^{\prime}_{hi})\right],

where the dependency of the RHS on ω\omega is shown explicitly for clarity. |ft(θ|θ;ω)||f_{t}(\theta^{\prime}|\theta;\omega)| is upper bounded by a constant that does not depend on θ\theta, θ\theta^{\prime}, ω\omega and tt, due to Assumption 1 and 2. Moreover, for all θ\theta, ω\omega and tt, ft(θ|θ;ω)f_{t}(\theta^{\prime}|\theta;\omega) is continuously differentiable with respect to θΘ~\theta^{\prime}\in\tilde{\mathit{\Theta}}; for all θ\theta^{\prime}, ω\omega and tt, ft(θ|θ;ω)f_{t}(\theta^{\prime}|\theta;\omega) is Lipschitz continuous with respect to θΘ\theta\in\mathit{\Theta}, due to Lemma C.1.

Next, define

Q¯(θ|θ):=𝔼θ,ν[f1(θ|θ;ω)].\bar{Q}(\theta^{\prime}|\theta)\mathrel{\mathop{:}}=\mathbb{E}_{\theta^{*},\nu^{*}}[f_{1}(\theta^{\prime}|\theta;\omega)]. (14)

The subscripts θ\theta^{*} and ν\nu^{*} in 𝔼θ,ν\mathbb{E}_{\theta^{*},\nu^{*}} denote that the expectation is taken with respect to the probability measure θ,ν\mathbb{P}_{\theta^{*},\nu^{*}}.

With the above definitions, we state the complete version of Theorem 2. The QQ-function defined in (7) is written as Qμ,T(θ|θ;ω)Q_{\mu,T}(\theta^{\prime}|\theta;\omega), showing its dependency on the sample path.

Theorem 7 (The complete version of Theorem 2).

With Assumption 1, 2 and 3, consider Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta) defined in (14), we have

  1. 1.

    For all θΘ\theta\in\mathit{\Theta}, Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta) is continuously differentiable with respect to θΘ~\theta^{\prime}\in\tilde{\mathit{\Theta}}, where Θ~\tilde{\mathit{\Theta}} is defined in Assumption 1. The gradient is

    Q¯(θ|θ)=𝔼θ,ν[f1(θ|θ;ω)].\nabla\bar{Q}(\theta^{\prime}|\theta)=\mathbb{E}_{\theta^{*},\nu^{*}}[\nabla f_{1}(\theta^{\prime}|\theta;\omega)].

    Moreover, as the set of maximizing arguments, argmaxθΘQ¯(θ|θ)\operatorname*{arg\,max}_{\theta^{\prime}\in\mathit{\Theta}}\bar{Q}(\theta^{\prime}|\theta) is nonempty.

  2. 2.

    As TT\rightarrow\infty,

    supθ,θΘsupμ|Qμ,T(θ|θ;ω)Q¯(θ|θ)|0,Pθ,ν-a.s.\sup_{\theta,\theta^{\prime}\in\mathit{\Theta}}\sup_{\mu\in\mathcal{M}}\left|Q_{\mu,T}(\theta^{\prime}|\theta;\omega)-\bar{Q}(\theta^{\prime}|\theta)\right|\rightarrow 0,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

Before proving Theorem 7, we state the following definition and an auxiliary lemma required for the proof. For all θ,θΘ\theta,\theta^{\prime}\in\mathit{\Theta}, ωΩ\omega\in\Omega and T2T\geq 2, the sample-path-based population QQ-function Q,Ts(θ|θ;ω)Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega) is defined as

Q,Ts(θ|θ;ω):=1Tt=1Tft(θ|θ;ω).Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)\mathrel{\mathop{:}}=\frac{1}{T}\sum_{t=1}^{T}f_{t}(\theta^{\prime}|\theta;\omega). (15)

The superscript s in Q,TsQ^{s}_{\infty,T} stands for sample-path-based. If the sample path ω\omega is not specified, Q,Ts(θ|θ)Q^{s}_{\infty,T}(\theta^{\prime}|\theta) is a random variable associated with probability measure θ,ν\mathbb{P}_{\theta^{*},\nu^{*}}. Note that due to stationarity, for any θ\theta, θ\theta^{\prime} and TT, Q¯(θ|θ)=𝔼θ,ν[Q,Ts(θ|θ;ω)]\bar{Q}(\theta^{\prime}|\theta)=\mathbb{E}_{\theta^{*},\nu^{*}}[Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)].

The difference between Q,TsQ^{s}_{\infty,T} and Qμ,TQ_{\mu,T} is bounded in the following lemma.

Lemma C.2 (Bounding the difference between the QQ-function and the sample-path-based population QQ-function).

With Assumption 1, 2 and 3, for all T2T\geq 2 and ωΩ\omega\in\Omega,

supθ,θΘsupμ|Q,Ts(θ|θ;ω)Qμ,T(θ|θ;ω)|constT1,\sup_{\theta,\theta^{\prime}\in\mathit{\Theta}}\sup_{\mu\in\mathcal{M}}\left|Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)-Q_{\mu,T}(\theta^{\prime}|\theta;\omega)\right|\leq const\cdot T^{-1},

where constconst is a constant independent of TT and ω\omega.

The proof is provided in Appendix D.5. Now we are ready to present the proof of Theorem 7 step-by-step. The structure of this proof is similar to the standard analysis of HMM maximum likelihood estimators [8, Chap. 12].

Proof of Theorem 7.

We prove the two parts of the theorem separately.

1. For all θΘ~\theta^{\prime}\in\tilde{\mathit{\Theta}}, there exists δθ>0\delta_{\theta^{\prime}}>0 such that the set {θ~;θ~θ2δθ}Θ~\{\tilde{\theta};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta_{\theta^{\prime}}\}\subseteq\tilde{\mathit{\Theta}}. For all θΘ\theta\in\mathit{\Theta} and ωΩ\omega\in\Omega, due to the differentiability of f1(θ|θ;ω)f_{1}(\theta^{\prime}|\theta;\omega) with respect to θ\theta^{\prime}, there exists a gradient f1(θ|θ;ω)\nabla f_{1}(\theta^{\prime}|\theta;\omega) at any θΘ~\theta^{\prime}\in\tilde{\mathit{\Theta}} such that

limδ0supθ~Θ~;θ~θ2δ|f1(θ~|θ;ω)f1(θ|θ;ω)f1(θ|θ;ω),θ~θ|θ~θ2=0.\lim_{\delta\rightarrow 0}\sup_{\tilde{\theta}\in\tilde{\mathit{\Theta}};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta}\frac{|f_{1}(\tilde{\theta}|\theta;\omega)-f_{1}(\theta^{\prime}|\theta;\omega)-\langle\nabla f_{1}(\theta^{\prime}|\theta;\omega),\tilde{\theta}-\theta^{\prime}\rangle|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}=0.

We need to transform the above almost surely (in ω\omega) convergence to the convergence of expectation, using the dominated convergence theorem. As a requirement, the quantity inside the limit on the LHS needs to be upper-bounded. For all θΘ\theta\in\mathit{\Theta}, θΘ~\theta^{\prime}\in\tilde{\mathit{\Theta}}, ωΩ\omega\in\Omega and 0<δδθ0<\delta\leq\delta_{\theta^{\prime}},

supθ~Θ~;θ~θ2δ|f1(θ~|θ;ω)f1(θ|θ;ω)f1(θ|θ;ω),θ~θ|θ~θ2supθ~;θ~θ2δθ|f1(θ~|θ;ω)f1(θ|θ;ω)|θ~θ2+supθ~;θ~θ2δθ|f1(θ|θ;ω),θ~θ|θ~θ2.\sup_{\tilde{\theta}\in\tilde{\mathit{\Theta}};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta}\frac{|f_{1}(\tilde{\theta}|\theta;\omega)-f_{1}(\theta^{\prime}|\theta;\omega)-\langle\nabla f_{1}(\theta^{\prime}|\theta;\omega),\tilde{\theta}-\theta^{\prime}\rangle|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}\leq\\ \sup_{\tilde{\theta};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta_{\theta^{\prime}}}\frac{|f_{1}(\tilde{\theta}|\theta;\omega)-f_{1}(\theta^{\prime}|\theta;\omega)|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}+\sup_{\tilde{\theta};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta_{\theta^{\prime}}}\frac{|\langle\nabla f_{1}(\theta^{\prime}|\theta;\omega),\tilde{\theta}-\theta^{\prime}\rangle|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}. (16)

Since continuously differentiable functions are Lipschitz continuous on convex and compact subsets, πhi\pi_{hi}, πlo\pi_{lo} and πb\pi_{b} as functions of θ~Θ~\tilde{\theta}\in\tilde{\mathit{\Theta}} are Lipschitz continuous on {θ~;θ~θ2δθ}\{\tilde{\theta};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta_{\theta^{\prime}}\}, with any other input arguments. From the expression of f1f_{1}, we can verify that for any fixed θ\theta and ω\omega, f1(θ~|θ;ω)f_{1}(\tilde{\theta}|\theta;\omega) as a function of θ~\tilde{\theta} is Lipschitz continuous on {θ~;θ~θ2δθ}\{\tilde{\theta};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta_{\theta^{\prime}}\}, and the Lipschitz constant only depends on θ\theta^{\prime} and δθ\delta_{\theta^{\prime}}. Consequently, the RHS of (16) can be upper-bounded for all ωΩ\omega\in\Omega. Applying the dominated convergence theorem, we have

limδ0𝔼θ,ν[supθ~Θ~;θ~θ2δ|f1(θ~|θ;ω)f1(θ|θ;ω)f1(θ|θ;ω),θ~θ|θ~θ2]=0.\lim_{\delta\rightarrow 0}\mathbb{E}_{\theta^{*},\nu^{*}}\left[\sup_{\tilde{\theta}\in\tilde{\mathit{\Theta}};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta}\frac{|f_{1}(\tilde{\theta}|\theta;\omega)-f_{1}(\theta^{\prime}|\theta;\omega)-\langle\nabla f_{1}(\theta^{\prime}|\theta;\omega),\tilde{\theta}-\theta^{\prime}\rangle|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}\right]=0. (17)

On the other hand, notice that for all θΘ\theta\in\mathit{\Theta}, θΘ~\theta^{\prime}\in\tilde{\mathit{\Theta}} and δ>0\delta>0,

supθ~Θ~;θ~θ2δ|Q¯(θ~|θ)Q¯(θ|θ)𝔼θ,ν[f1(θ|θ;ω)],θ~θ|θ~θ2\displaystyle\sup_{\tilde{\theta}\in\tilde{\mathit{\Theta}};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta}\frac{|\bar{Q}(\tilde{\theta}|\theta)-\bar{Q}(\theta^{\prime}|\theta)-\langle\mathbb{E}_{\theta^{*},\nu^{*}}[\nabla f_{1}(\theta^{\prime}|\theta;\omega)],\tilde{\theta}-\theta^{\prime}\rangle|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}
=\displaystyle=~{} supθ~Θ~;θ~θ2δ|𝔼θ,ν[f1(θ~|θ;ω)f1(θ|θ;ω)f1(θ|θ;ω),θ~θ]|θ~θ2\displaystyle\sup_{\tilde{\theta}\in\tilde{\mathit{\Theta}};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta}\frac{|\mathbb{E}_{\theta^{*},\nu^{*}}[f_{1}(\tilde{\theta}|\theta;\omega)-f_{1}(\theta^{\prime}|\theta;\omega)-\langle\nabla f_{1}(\theta^{\prime}|\theta;\omega),\tilde{\theta}-\theta^{\prime}\rangle]|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}
\displaystyle\leq~{} 𝔼θ,ν[supθ~Θ~;θ~θ2δ|f1(θ~|θ;ω)f1(θ|θ;ω)f1(θ|θ;ω),θ~θ|θ~θ2].\displaystyle\mathbb{E}_{\theta^{*},\nu^{*}}\left[\sup_{\tilde{\theta}\in\tilde{\mathit{\Theta}};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta}\frac{|f_{1}(\tilde{\theta}|\theta;\omega)-f_{1}(\theta^{\prime}|\theta;\omega)-\langle\nabla f_{1}(\theta^{\prime}|\theta;\omega),\tilde{\theta}-\theta^{\prime}\rangle|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}\right].

Combining with (17) proves the differentiability of Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta) with respect to θΘ~\theta^{\prime}\in\tilde{\mathit{\Theta}} for any fixed θ\theta. The gradient is

Q¯(θ|θ)=𝔼θ,ν[f1(θ|θ;ω)].\nabla\bar{Q}(\theta^{\prime}|\theta)=\mathbb{E}_{\theta^{*},\nu^{*}}[\nabla f_{1}(\theta^{\prime}|\theta;\omega)].

Analogously, using the dominated convergence theorem we can also show that the gradient Q¯(θ|θ)\nabla\bar{Q}(\theta^{\prime}|\theta) is continuous with respect to θΘ~\theta^{\prime}\in\tilde{\mathit{\Theta}}. Details are omitted due to the similarity with the above procedure. It is worth noting that we let θΘ~\theta^{\prime}\in\tilde{\mathit{\Theta}} instead of Θ\mathit{\Theta}. In this way, the gradient Q¯(θ|θ)\nabla\bar{Q}(\theta^{\prime}|\theta) can be naturally defined when θ\theta^{\prime} is not an interior point of Θ\mathit{\Theta}.

From differentiability and ΘΘ~\mathit{\Theta}\subseteq\tilde{\mathit{\Theta}}, Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta) is also continuous with respect to θΘ\theta^{\prime}\in\mathit{\Theta}. Since Θ\mathit{\Theta} is compact, the set of maximizing arguments argmaxθΘQ¯(θ|θ)\operatorname*{arg\,max}_{\theta^{\prime}\in\mathit{\Theta}}\bar{Q}(\theta^{\prime}|\theta) is nonempty.

2. We need to prove the uniform (in θ,θΘ\theta,\theta^{\prime}\in\mathit{\Theta} and μ\mu\in\mathcal{M}) almost sure convergence of the QQ-function Qμ,T(θ|θ;ω)Q_{\mu,T}(\theta^{\prime}|\theta;\omega) to the population QQ-function Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta). The proof is separated into three steps. First, we show the almost sure convergence of Q,Ts(θ|θ;ω)Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega) to Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta) for all θ,θΘ\theta,\theta^{\prime}\in\mathit{\Theta} using the ergodic theorem. Second, we extend this pointwise convergence to uniform (in θ,θ\theta,\theta^{\prime}) convergence using a version of the Arzelà-Ascoli theorem [12, Chap. 21]. Finally, from Lemma C.2, the difference between Qμ,T(θ|θ;ω)Q_{\mu,T}(\theta^{\prime}|\theta;\omega) and Q,Ts(θ|θ;ω)Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega) vanishes uniformly in μ\mu as TT\rightarrow\infty.

Concretely, for the pointwise (in θ,θ\theta,\theta^{\prime}) almost sure convergence of Q,Ts(θ|θ;ω)Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega) as TT\rightarrow\infty, we apply Birkhoff’s ergodic theorem. Let 𝒯:𝒳𝒳\mathcal{T}:\mathcal{X}^{\mathbb{Z}}\rightarrow\mathcal{X}^{\mathbb{Z}} be the standard shift operator. That is, for any tt\in\mathbb{Z}, 𝒯ω(t)=ω(t+1)\mathcal{T}\omega(t)=\omega(t+1). Due to stationarity, 𝒯\mathcal{T} is a measure-preserving map, i.e., θ,ν(𝒯1F)=θ,ν(F)\mathbb{P}_{\theta^{*},\nu^{*}}(\mathcal{T}^{-1}F)=\mathbb{P}_{\theta^{*},\nu^{*}}(F) for all F𝒫(𝒳)F\in\mathcal{P}(\mathcal{X}^{\mathbb{Z}}). Therefore, the quadruple {𝒳,𝒫(𝒳),θ,ν,𝒯}\{\mathcal{X}^{\mathbb{Z}},\mathcal{P}(\mathcal{X}^{\mathbb{Z}}),\mathbb{P}_{\theta^{*},\nu^{*}},\mathcal{T}\} defines a dynamical system.

Here, we need some clarification on some concepts and notations. Consider the Markov chain {Xt}t=1={St,At,Ot,Bt}t=1\{X_{t}\}_{t=1}^{\infty}=\{S_{t},A_{t},O_{t},B_{t}\}_{t=1}^{\infty} induced by the expert policy, let ΠX,θ\Pi_{X,\theta^{*}} be its set of all stationary distributions. Comparing ΠX,θ\Pi_{X,\theta^{*}} to Πθ\Pi_{\theta^{*}} from Assumption 3, they both depend on the true parameter θ\theta^{*}; the former corresponds to the chain {St,At,Ot,Bt}t=1\{S_{t},A_{t},O_{t},B_{t}\}_{t=1}^{\infty}, while the latter corresponds to the chain {Ot1,St}t=1\{O_{t-1},S_{t}\}_{t=1}^{\infty}. From the structure of our graphical model, they are equivalent by some transformation.

From Section 4, θ,ν\mathbb{P}_{\theta^{*},\nu^{*}} is defined from an element of ΠX,θ\Pi_{X,\theta^{*}} that depends on ν\nu^{*}. Denote this stationary distribution as ψ\psi. Since ν\nu^{*} is an extreme point of Πθ\Pi_{\theta^{*}} (Assumption 3), ψ\psi is also an extreme point of ΠX,θ\Pi_{X,\theta^{*}}. Then, we can apply a standard Markov chain ergodicity result. From [20, Theorem 5.7], the dynamical system {𝒳,𝒫(𝒳),θ,ν,𝒯}\{\mathcal{X}^{\mathbb{Z}},\mathcal{P}(\mathcal{X}^{\mathbb{Z}}),\mathbb{P}_{\theta^{*},\nu^{*}},\mathcal{T}\} is ergodic. For our case, Birkhoff’s ergodic theorem is restated as follows.

Lemma C.3 ([20], Corollary 5.3 restated).

If a dynamical system {𝒳,𝒫(𝒳),θ,ν,𝒯}\{\mathcal{X}^{\mathbb{Z}},\mathcal{P}(\mathcal{X}^{\mathbb{Z}}),\mathbb{P}_{\theta^{*},\nu^{*}},\mathcal{T}\} is ergodic and f:𝒳f:\mathcal{X}^{\mathbb{Z}}\rightarrow\mathbb{R} satisfies 𝔼θ,ν[f(ω)]<\mathbb{E}_{\theta^{*},\nu^{*}}[f(\omega)]<\infty, then as TT\rightarrow\infty,

1Tt=0T1f(𝒯tω)𝔼θ,ν[f(ω)],Pθ,ν-a.s.\frac{1}{T}\sum_{t=0}^{T-1}f(\mathcal{T}^{t}\omega)\rightarrow\mathbb{E}_{\theta^{*},\nu^{*}}[f(\omega)],~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

For our purpose, observe that for any θ,θΘ\theta,\theta^{\prime}\in\mathit{\Theta}, ft(θ|θ;ω)=f1(θ|θ;𝒯t1ω)f_{t}(\theta^{\prime}|\theta;\omega)=f_{1}(\theta^{\prime}|\theta;\mathcal{T}^{t-1}\omega). Therefore, applying the ergodic theorem to Q,Ts(θ|θ)Q^{s}_{\infty,T}(\theta^{\prime}|\theta), as TT\rightarrow\infty,

Q,Ts(θ|θ;ω)Q¯(θ|θ),Pθ,ν-a.s.Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)\rightarrow\bar{Q}(\theta^{\prime}|\theta),~{}P_{\theta^{*},\nu^{*}}\text{-a.s.} (18)

To extend the pointwise convergence in (18) to uniform (in θ,θ\theta,\theta^{\prime}) convergence, the following concept is required. The sequence {Q,Ts(θ|θ)}\{Q^{s}_{\infty,T}(\theta^{\prime}|\theta)\} indexed by TT as functions of θ\theta and θ\theta^{\prime} is strongly stochastically equicontinuous [12, Equation 21.43] if for any ε>0\varepsilon>0 there exists δ>0\delta>0 such that

lim supTsupθ1,θ1,θ2,θ2Θ;θ1θ22+θ1θ22δ|Q,Ts(θ1|θ1;ω)Q,Ts(θ2|θ2;ω)|<ε,Pθ,ν-a.s.\limsup_{T\rightarrow\infty}\sup_{\theta_{1},\theta^{\prime}_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta_{1}-\theta_{2}}\|_{2}+\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\|_{2}\leq\delta}\left|Q^{s}_{\infty,T}(\theta^{\prime}_{1}|\theta_{1};\omega)-Q^{s}_{\infty,T}(\theta^{\prime}_{2}|\theta_{2};\omega)\right|<\varepsilon,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.} (19)

Indeed this property holds for {Q,Ts(θ|θ)}\{Q^{s}_{\infty,T}(\theta^{\prime}|\theta)\}, as shown in Appendix D.6. The version of the Arzelà-Ascoli theorem we use is restated as follows, tailored to our need.

Lemma C.4 ([12], Theorem 21.8 restated).

Given (18) and (19), as TT\rightarrow\infty we have

supθ,θΘ|Q,Ts(θ|θ;ω)Q¯(θ|θ)|0,Pθ,ν-a.s.\sup_{\theta,\theta^{\prime}\in\mathit{\Theta}}\left|Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)-\bar{Q}(\theta^{\prime}|\theta)\right|\rightarrow 0,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

Combining Lemma C.2 and Lemma C.4 concludes the proof of the second part. ∎

On the concavity of Q¯(|θ)\bar{Q}(\cdot|\theta).

As discussed after introducing Assumption 4, we expect the following to hold in certain cases of tabular parameterization: for all θΘ\theta\in\mathit{\Theta}, the function Q¯(|θ)\bar{Q}(\cdot|\theta) is strongly concave over Θ\mathit{\Theta}. Details are presented below.

Consider θb\theta^{\prime}_{b} for example, we need to provide sufficient conditions such that the following function is strongly concave with respect to θbΘb\theta^{\prime}_{b}\in\mathit{\Theta}_{b}, given any θΘ\theta\in\mathit{\Theta}.

Q¯b(θb|θ)=o0,b1𝔼θ,ν[γ~,tθ(o0,b1;ω)logπb(b1|ω(s1),o0;θb)].\bar{Q}_{b}(\theta^{\prime}_{b}|\theta)=\sum_{o_{0},b_{1}}\mathbb{E}_{\theta^{*},\nu^{*}}\left[\tilde{\gamma}^{\theta}_{\infty,t}(o_{0},b_{1};\omega)\log\pi_{b}(b_{1}|\omega(s_{1}),o_{0};\theta^{\prime}_{b})\right].

Let the marginal distribution of ν\nu^{*} on S1S_{1} be νS1\nu^{*}_{S_{1}}. If νS1\nu^{*}_{S_{1}} is strictly positive on 𝒮\mathcal{S}, then we rewrite Q¯b(θb|θ)\bar{Q}_{b}(\theta^{\prime}_{b}|\theta) as

Q¯b(θb|θ)=o0,b1s1𝒮νS1(s1)𝔼θ,ν|S1=s1[γ~,tθ(o0,b1;ω)]logπb(b1|s1,o0;θb).\bar{Q}_{b}(\theta^{\prime}_{b}|\theta)=\sum_{o_{0},b_{1}}\sum_{s_{1}\in\mathcal{S}}\nu^{*}_{S_{1}}(s_{1})\mathbb{E}_{\theta^{*},\nu^{*}|S_{1}=s_{1}}\left[\tilde{\gamma}^{\theta}_{\infty,t}(o_{0},b_{1};\omega)\right]\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b}).

In the case of tabular parameterization, πb(b1|s1,o0;θb)\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b}) is an entry of θb\theta^{\prime}_{b} indexed as θb(b1,s1,o0)\theta^{\prime}_{b}(b_{1},s_{1},o_{0}); its logarithm is 1-strongly concave on the interval [0,1][0,1]. Q¯b(θb|θ)\bar{Q}_{b}(\theta^{\prime}_{b}|\theta) is strongly concave with respect to θb\theta^{\prime}_{b} if 𝔼θ,ν|S1=s1[γ~,tθ(o0,b1;ω)]\mathbb{E}_{\theta^{*},\nu^{*}|S_{1}=s_{1}}[\tilde{\gamma}^{\theta}_{\infty,t}(o_{0},b_{1};\omega)] is strictly positive for all o0o_{0} and b1b_{1}. We speculate that this requirement is mild, but a rigorous characterization is quite challenging.

C.3 The convergence of the population version algorithm

We first present the complete version of Theorem 3, where an upper bound on γ\gamma is also shown. Notice that we assume all the assumptions, including Assumption 4 and 5.

Theorem 8 (The complete version of Theorem 3).

With all the assumptions,

  1. 1.

    (First-order stability) There exists 0<γγ¯0<\gamma\leq\bar{\gamma} such that for all θΘr\theta\in\mathit{\Theta}_{r},

    Q¯(M¯(θ)|θ)Q¯(M¯(θ)|θ)2γθθ2.\left\|{\nabla\bar{Q}(\bar{M}(\theta)|\theta)-\nabla\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}\leq\gamma\left\|{\theta-\theta^{*}}\right\|_{2}.

    Specifically, the upper bound γ¯\bar{\gamma} is given by

    γ¯=4|𝒪|Lθ,rεb2ζ(supθΘrzθ,θ)(2maxo0,s1,b1supθbΘblogπb(b1|s1,o0;θb)2+maxs1,a1,o1supθloΘlologπlo(a1|s1,o1;θlo)2+maxs1,o1supθhiΘhilogπhi(o1|s1;θhi)2).\bar{\gamma}=\frac{4|\mathcal{O}|L_{\theta^{*},r}}{\varepsilon^{2}_{b}\zeta}\left(\sup_{\theta^{\prime}\in\mathit{\Theta}_{r}}z_{\theta^{\prime},\theta^{*}}\right)\bigg{(}2\max_{o_{0},s_{1},b_{1}}\sup_{\theta^{\prime}_{b}\in\mathit{\Theta}_{b}}\left\|{\nabla\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})}\right\|_{2}\\ +\max_{s_{1},a_{1},o_{1}}\sup_{\theta^{\prime}_{lo}\in\mathit{\Theta}_{lo}}\left\|{\nabla\log\pi_{lo}(a_{1}|s_{1},o_{1};\theta^{\prime}_{lo})}\right\|_{2}+\max_{s_{1},o_{1}}\sup_{\theta^{\prime}_{hi}\in\mathit{\Theta}_{hi}}\left\|{\nabla\log\pi_{hi}(o_{1}|s_{1};\theta^{\prime}_{hi})}\right\|_{2}\bigg{)}.

    ζ\zeta is the failure parameter in the options with failure framework; εb\varepsilon_{b} is a mixing constant defined in Lemma D.1; Lθ,rL_{\theta^{*},r} is a Lipschitz constant defined in Lemma D.2; zθ,θz_{\theta^{\prime},\theta^{*}} is defined in Lemma D.5.

  2. 2.

    (Contraction) Let κ=γ/λ\kappa=\gamma/\lambda. For all θΘr\theta\in\mathit{\Theta}_{r},

    M¯(θ)θ2κθθ2.\left\|{\bar{M}(\theta)-\theta^{*}}\right\|_{2}\leq\kappa\left\|{\theta-\theta^{*}}\right\|_{2}.

    If κ<1\kappa<1, the population version algorithm converges linearly to the true parameter θ\theta^{*}.

Proof of Theorem 8.

We prove the two parts separately in the following.

1. For convenience of notation, let Q¯(θ|θ)=[bQ¯(θ|θ),loQ¯(θ|θ),hiQ¯(θ|θ)]\nabla\bar{Q}(\theta^{\prime}|\theta)=[\nabla_{b}\bar{Q}(\theta^{\prime}|\theta),\nabla_{lo}\bar{Q}(\theta^{\prime}|\theta),\nabla_{hi}\bar{Q}(\theta^{\prime}|\theta)] such that, for example, bQ¯(θ|θ)\nabla_{b}\bar{Q}(\theta^{\prime}|\theta) is the gradient of Q¯(θ|θ)\bar{Q}(\theta^{\prime}|\theta) with respect to θb\theta^{\prime}_{b}. Using the expressions of Q¯(θ|θ)\nabla\bar{Q}(\theta^{\prime}|\theta) from Theorem 7, we have

Q¯(M¯(θ)|θ)Q¯(M¯(θ)|θ)2bQ¯(M¯(θ)|θ)bQ¯(M¯(θ)|θ)2+loQ¯(M¯(θ)|θ)loQ¯(M¯(θ)|θ)2+hiQ¯(M¯(θ)|θ)hiQ¯(M¯(θ)|θ)2.\left\|{\nabla\bar{Q}(\bar{M}(\theta)|\theta)-\nabla\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}\leq\left\|{\nabla_{b}\bar{Q}(\bar{M}(\theta)|\theta)-\nabla_{b}\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}\\ +\left\|{\nabla_{lo}\bar{Q}(\bar{M}(\theta)|\theta)-\nabla_{lo}\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}+\left\|{\nabla_{hi}\bar{Q}(\bar{M}(\theta)|\theta)-\nabla_{hi}\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}.

Consider the first term,

bQ¯(M¯(θ)|θ)bQ¯(M¯(θ)|θ)2\displaystyle\left\|{\nabla_{b}\bar{Q}(\bar{M}(\theta)|\theta)-\nabla_{b}\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}
=\displaystyle=~{} 𝔼θ,ν{o0,b1[γ~,1θ(o0,b1;ω)γ~,1θ(o0,b1;ω)][logπb(b1|ω(s1),o0;M¯(θ)b)]}2\displaystyle\left\|{\mathbb{E}_{\theta^{*},\nu^{*}}\bigg{\{}\sum_{o_{0},b_{1}}\left[\tilde{\gamma}^{\theta}_{\infty,1}(o_{0},b_{1};\omega)-\tilde{\gamma}^{\theta^{*}}_{\infty,1}(o_{0},b_{1};\omega)\right]\left[\nabla\log\pi_{b}(b_{1}|\omega(s_{1}),o_{0};\bar{M}(\theta)_{b})\right]\bigg{\}}}\right\|_{2}
\displaystyle\leq~{} o0,b1𝔼θ,ν{[γ~θ,1(o0,b1;ω)γ~θ,1(o0,b1;ω)][logπb(b1|ω(s1),o0;M¯(θ)b)]}2\displaystyle\sum_{o_{0},b_{1}}\left\|{\mathbb{E}_{\theta^{*},\nu^{*}}\bigg{\{}\left[\tilde{\gamma}^{\theta}_{\infty,1}(o_{0},b_{1};\omega)-\tilde{\gamma}^{\theta^{*}}_{\infty,1}(o_{0},b_{1};\omega)\right]\left[\nabla\log\pi_{b}(b_{1}|\omega(s_{1}),o_{0};\bar{M}(\theta)_{b})\right]\bigg{\}}}\right\|_{2}
\displaystyle\leq~{} o0,b1𝔼θ,ν{|γ~θ,1(o0,b1;ω)γ~θ,1(o0,b1;ω)|logπb(b1|ω(s1),o0;M¯(θ)b)2}\displaystyle\sum_{o_{0},b_{1}}\mathbb{E}_{\theta^{*},\nu^{*}}\bigg{\{}\left|\tilde{\gamma}^{\theta}_{\infty,1}(o_{0},b_{1};\omega)-\tilde{\gamma}^{\theta^{*}}_{\infty,1}(o_{0},b_{1};\omega)\right|\left\|{\nabla\log\pi_{b}(b_{1}|\omega(s_{1}),o_{0};\bar{M}(\theta)_{b})}\right\|_{2}\bigg{\}}
\displaystyle\leq~{} maxo0,s1,b1supθbΘblogπb(b1|s1,o0;θb)2𝔼θ,ν{o0,b1|γ~θ,1(o0,b1;ω)γ~θ,1(o0,b1;ω)|}\displaystyle\max_{o_{0},s_{1},b_{1}}\sup_{\theta^{\prime}_{b}\in\mathit{\Theta}_{b}}\left\|{\nabla\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})}\right\|_{2}\mathbb{E}_{\theta^{*},\nu^{*}}\bigg{\{}\sum_{o_{0},b_{1}}\left|\tilde{\gamma}^{\theta}_{\infty,1}(o_{0},b_{1};\omega)-\tilde{\gamma}^{\theta^{*}}_{\infty,1}(o_{0},b_{1};\omega)\right|\bigg{\}}
\displaystyle\leq~{} 2maxo0,s1,b1supθbΘblogπb(b1|s1,o0;θb)2×supωΩγ~θ,1(ω)γ~θ,1(ω)TV\displaystyle 2\max_{o_{0},s_{1},b_{1}}\sup_{\theta^{\prime}_{b}\in\mathit{\Theta}_{b}}\left\|{\nabla\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})}\right\|_{2}\times\sup_{\omega\in\Omega}\left\|{\tilde{\gamma}^{\theta}_{\infty,1}(\omega)-\tilde{\gamma}^{\theta^{*}}_{\infty,1}(\omega)}\right\|_{\rm TV}
\displaystyle\leq~{} 8|𝒪|Lθ,rε2bζ(supθΘrzθ,θ)(maxo0,s1,b1supθbΘblogπb(b1|s1,o0;θb)2)θθ2.\displaystyle\frac{8|\mathcal{O}|L_{\theta^{*},r}}{\varepsilon^{2}_{b}\zeta}\left(\sup_{\theta^{\prime}\in\mathit{\Theta}_{r}}z_{\theta^{\prime},\theta^{*}}\right)\left(\max_{o_{0},s_{1},b_{1}}\sup_{\theta^{\prime}_{b}\in\mathit{\Theta}_{b}}\left\|{\nabla\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})}\right\|_{2}\right)\left\|{\theta-\theta^{*}}\right\|_{2}.

We use the triangle inequality and the Jensen’s inequality in the third and the fourth line respectively. The fifth line is finite due to θb\theta_{b} being compact and the continuity of the gradient (Assumption 2). The last line is due to the limit form of Lemma D.7, similar to the argument in Appendix D.4. Notice that the coefficient of θθ2\|{\theta-\theta^{*}}\|_{2} on the last line does not depend on θ\theta.

Analogously, we have

loQ¯(M¯(θ)|θ)loQ¯(M¯(θ)|θ)24|𝒪|Lθ,rε2bζ(supθΘrzθ,θ)(maxs1,a1,o1supθloΘlologπlo(a1|s1,o1;θlo)2)θθ2,\left\|{\nabla_{lo}\bar{Q}(\bar{M}(\theta)|\theta)-\nabla_{lo}\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}\leq\\ \frac{4|\mathcal{O}|L_{\theta^{*},r}}{\varepsilon^{2}_{b}\zeta}\left(\sup_{\theta^{\prime}\in\mathit{\Theta}_{r}}z_{\theta^{\prime},\theta^{*}}\right)\left(\max_{s_{1},a_{1},o_{1}}\sup_{\theta^{\prime}_{lo}\in\mathit{\Theta}_{lo}}\left\|{\nabla\log\pi_{lo}(a_{1}|s_{1},o_{1};\theta^{\prime}_{lo})}\right\|_{2}\right)\left\|{\theta-\theta^{*}}\right\|_{2},
hiQ¯(M¯(θ)|θ)hiQ¯(M¯(θ)|θ)24|𝒪|Lθ,rε2bζ(supθΘrzθ,θ)(maxs1,o1supθhiΘhilogπhi(o1|s1;θhi)2)θθ2.\left\|{\nabla_{hi}\bar{Q}(\bar{M}(\theta)|\theta)-\nabla_{hi}\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}\leq\\ \frac{4|\mathcal{O}|L_{\theta^{*},r}}{\varepsilon^{2}_{b}\zeta}\left(\sup_{\theta^{\prime}\in\mathit{\Theta}_{r}}z_{\theta^{\prime},\theta^{*}}\right)\left(\max_{s_{1},o_{1}}\sup_{\theta^{\prime}_{hi}\in\mathit{\Theta}_{hi}}\left\|{\nabla\log\pi_{hi}(o_{1}|s_{1};\theta^{\prime}_{hi})}\right\|_{2}\right)\left\|{\theta-\theta^{*}}\right\|_{2}.

Combining everything, we have the upper bound on γ\gamma.

2. The proof of the second part mirrors the proof of [3, Theorem 4]. The main difference is the construction of the following self-consistency (a.k.a. fixed-point) condition.

Lemma C.5 (Self-consistency).

With all the assumptions, θ=M¯(θ)\theta^{*}=\bar{M}(\theta^{*}).

The proof of this lemma is presented in Appendix D.7. Such a condition is used without proof in [3] since it only considers i.i.d. samples, and the self-consistency condition for EM with i.i.d. samples is a well-known result. However, for the case of dependent samples like our graphical model, such a condition results from the stochastic convergence of the QQ-function which is not immediate.

For the rest of the proof, we present a brief sketch here for completeness. Due to concavity, we have the first order optimality conditions: for all θ,θΘ\theta,\theta^{\prime}\in\mathit{\Theta}, Q¯(M¯(θ)|θ),θM¯(θ)0\langle\nabla\bar{Q}(\bar{M}(\theta^{*})|\theta^{*}),\theta-\bar{M}(\theta^{*})\rangle\leq 0 and Q¯(M¯(θ)|θ),θM¯(θ)0\langle\nabla\bar{Q}(\bar{M}(\theta)|\theta),\theta^{\prime}-\bar{M}(\theta)\rangle\leq 0. Using θ=M¯(θ)\theta^{*}=\bar{M}(\theta^{*}), we can combine the two optimality conditions together and obtain the following. For all θΘ\theta\in\mathit{\Theta},

Q¯(M¯(θ)|θ)Q¯(θ|θ),θM¯(θ)Q¯(M¯(θ)|θ)Q¯(M¯(θ)|θ),θM¯(θ).\langle\nabla\bar{Q}(\bar{M}(\theta)|\theta^{*})-\nabla\bar{Q}(\theta^{*}|\theta^{*}),\theta^{*}-\bar{M}(\theta)\rangle\leq\langle\nabla\bar{Q}(\bar{M}(\theta)|\theta^{*})-\nabla\bar{Q}(\bar{M}(\theta)|\theta),\theta^{*}-\bar{M}(\theta)\rangle.

From Assumption 4, LHSλθM¯(θ)22\text{LHS}\geq\lambda\|{\theta^{*}-\bar{M}(\theta)}\|_{2}^{2}. From Cauchy-Schwarz and the first part of this theorem, RHSγθM¯(θ)2θθ2\text{RHS}\leq\gamma\|{\theta^{*}-\bar{M}(\theta)}\|_{2}\|{\theta-\theta^{*}}\|_{2}. Canceling θM¯(θ)2\|{\theta^{*}-\bar{M}(\theta)}\|_{2} on both sides completes the proof. ∎

C.4 Proof of Theorem 4

1. We first show the strong consistency of Mμ,T(θ;ω)M_{\mu,T}(\theta;\omega), the parameter update of Algorithm 1, as an estimator of M¯(θ)\bar{M}(\theta). This follows from standard techniques in the analysis of M-estimators. In particular, consider the set of sample paths ω\omega such that ωΩ\omega\in\Omega and argmaxθΘQμ,T(θ|θ;ω)\operatorname*{arg\,max}_{\theta^{\prime}\in\mathit{\Theta}}Q_{\mu,T}(\theta^{\prime}|\theta;\omega) has a unique element Mμ,T(θ;ω)M_{\mu,T}(\theta;\omega). Such a set of sample paths has probability measure 1.

For all θΘ\theta\in\mathit{\Theta}, T2T\geq 2 and μ\mu\in\mathcal{M}, with one of the above sample path ω\omega,

0\displaystyle 0 Q¯(M¯(θ)|θ)Q¯(Mμ,T(θ;ω)|θ)\displaystyle\leq\bar{Q}(\bar{M}(\theta)|\theta)-\bar{Q}(M_{\mu,T}(\theta;\omega)|\theta)
Q¯(M¯(θ)|θ)Qμ,T(M¯(θ)|θ;ω)+Qμ,T(M¯(θ)|θ;ω)Qμ,T(Mμ,T(θ;ω)|θ;ω)\displaystyle\leq\bar{Q}(\bar{M}(\theta)|\theta)-Q_{\mu,T}(\bar{M}(\theta)|\theta;\omega)+Q_{\mu,T}(\bar{M}(\theta)|\theta;\omega)-Q_{\mu,T}(M_{\mu,T}(\theta;\omega)|\theta;\omega)
+Qμ,T(Mμ,T(θ;ω)|θ;ω)Q¯(MT(θ;ω)|θ)\displaystyle\hskip 200.0003pt+Q_{\mu,T}(M_{\mu,T}(\theta;\omega)|\theta;\omega)-\bar{Q}(M_{T}(\theta;\omega)|\theta)
2supθΘ|Q¯(θ|θ)Qμ,T(θ|θ;ω)|.\displaystyle\leq 2\sup_{\theta^{\prime}\in\mathit{\Theta}}\left|\bar{Q}(\theta^{\prime}|\theta)-Q_{\mu,T}(\theta^{\prime}|\theta;\omega)\right|.

From Theorem 7, θ,ν\mathbb{P}_{\theta^{*},\nu^{*}}-almost surely, supθ,θΘsupμ|Q¯(θ|θ)Qμ,T(θ|θ;ω)|0\sup_{\theta,\theta^{\prime}\in\mathit{\Theta}}\sup_{\mu\in\mathcal{M}}|\bar{Q}(\theta^{\prime}|\theta)-Q_{\mu,T}(\theta^{\prime}|\theta;\omega)|\rightarrow 0 as TT\rightarrow\infty. Therefore,

supθΘrsupμ[Q¯(M¯(θ)|θ)Q¯(Mμ,T(θ;ω)|θ)]0,Pθ,ν-a.s.\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left[\bar{Q}(\bar{M}(\theta)|\theta)-\bar{Q}(M_{\mu,T}(\theta;\omega)|\theta)\right]\rightarrow 0,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

An equivalent argument is the following. θ,ν\mathbb{P}_{\theta^{*},\nu^{*}}-almost surely, for any δ>0\delta>0 there exists Tω+T_{\omega}\in\mathbb{N}_{+} such that for all TTωT\geq T_{\omega}, supθΘrsupμ[Q¯(M¯(θ)|θ)Q¯(Mμ,T(θ;ω)|θ)]δ\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}[\bar{Q}(\bar{M}(\theta)|\theta)-\bar{Q}(M_{\mu,T}(\theta;\omega)|\theta)]\leq\delta. In particular, for any ε>0\varepsilon>0, let

δ=12infθΘr[Q¯(M¯(θ)|θ)supθΘ;θM¯(θ)2εQ¯(θ|θ)].\delta=\frac{1}{2}\inf_{\theta\in\mathit{\Theta}_{r}}\bigg{[}\bar{Q}(\bar{M}(\theta)|\theta)-\sup_{\theta^{\prime}\in\mathit{\Theta};\|{\theta^{\prime}-\bar{M}(\theta)}\|_{2}\geq\varepsilon}\bar{Q}(\theta^{\prime}|\theta)\bigg{]}.

From the identifiability assumption (Assumption 5), the RHS is positive. Therefore, such an assignment of δ\delta is valid. Consequently, for all TTωT\geq T_{\omega}, θΘr\theta\in\mathit{\Theta}_{r} and μ\mu\in\mathcal{M},

Q¯(M¯(θ)|θ)Q¯(Mμ,T(θ;ω)|θ)<Q¯(M¯(θ)|θ)supθΘ;θM¯(θ)2εQ¯(θ|θ),\bar{Q}(\bar{M}(\theta)|\theta)-\bar{Q}(M_{\mu,T}(\theta;\omega)|\theta)<\bar{Q}(\bar{M}(\theta)|\theta)-\sup_{\theta^{\prime}\in\mathit{\Theta};\|{\theta^{\prime}-\bar{M}(\theta)}\|_{2}\geq\varepsilon}\bar{Q}(\theta^{\prime}|\theta),

which means that Mμ,T(θ;ω)M¯(θ)2<ε\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\|_{2}<\varepsilon. Taking supremum over θΘr\theta\in\mathit{\Theta}_{r} and μ\mu\in\mathcal{M}, we summarize the argument as the following. θ,ν\mathbb{P}_{\theta^{*},\nu^{*}}-almost surely, for any ε>0\varepsilon>0 there exists Tω+T_{\omega}\in\mathbb{N}_{+} such that for all TTωT\geq T_{\omega},

supθΘrsupμMμ,T(θ;ω)M¯(θ)2<ε.\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}<\varepsilon.

Such a result is equivalent to the uniform (in θ\theta and μ\mu) strong consistency of Mμ,T(θ;ω)M_{\mu,T}(\theta;\omega) as an estimator of M¯(θ)\bar{M}(\theta). As TT\rightarrow\infty,

supθΘrsupμMμ,T(θ;ω)M¯(θ)20,Pθ,ν-a.s.\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\rightarrow 0,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

This result is insufficient for Part 1, since TωT_{\omega} is sample path dependent. To get rid of this sample path dependency, we use the dominated convergence theorem. Notice that θ,ν\mathbb{P}_{\theta^{*},\nu^{*}}-almost surely, for all T2T\geq 2, supθΘrsupμMμ,T(θ;ω)M¯(θ)2\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\|_{2} is bounded due to the compactness of Θ\mathit{\Theta}. Therefore we have

limT𝔼θ,ν[supθΘrsupμMμ,T(θ;ω)M¯(θ)2]=0.\lim_{T\rightarrow\infty}\mathbb{E}_{\theta^{*},\nu^{*}}\left[\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\right]=0.

For any q>0q>0, there exists T¯(q)+\underline{T}(q)\in\mathbb{N}_{+} such that for all TT¯(q)T\geq\underline{T}(q),

𝔼θ,ν[supθΘrsupμMμ,T(θ;ω)M¯(θ)2]q.\mathbb{E}_{\theta^{*},\nu^{*}}\left[\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\right]\leq q.

Applying Markov’s inequality, for any Δ>0\Delta>0,

θ,ν(supθΘrsupμMμ,T(θ;ω)M¯(θ)2Δ)1Δ𝔼θ,ν[supθΘrsupμMμ,T(θ;ω)M¯(θ)2]qΔ.\mathbb{P}_{\theta^{*},\nu^{*}}\left(\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\geq\Delta\right)\leq\frac{1}{\Delta}\mathbb{E}_{\theta^{*},\nu^{*}}\left[\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\right]\leq\frac{q}{\Delta}.

Scaling qq yields the desirable result.

2. The proof of Part 2 is the same as [3, Theorem 5]. We present a sketch for completeness. For all TT¯(Δ,q)T\geq\underline{T}(\Delta,q), condition the following proof on the high probability event that supθΘrsupμMμ,T(θ;ω)M¯(θ)2Δ\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\leq\Delta.

Assume θ(n1)θ2r\|{\theta^{(n-1)}-\theta^{*}}\|_{2}\leq r, which holds for n=1n=1. Then, using the triangle inequality, the result from Theorem 3, the above concentration and Δ(1κ)r\Delta\leq(1-\kappa)r, we have the following for any μ\mu.

θ(n)θ2\displaystyle\left\|{\theta^{(n)}-\theta^{*}}\right\|_{2} M¯(θ(n1))θ2+Mμ,T(θ(n1))M¯(θ(n1))2\displaystyle\leq\left\|{\bar{M}(\theta^{(n-1)})-\theta^{*}}\right\|_{2}+\left\|{M_{\mu,T}(\theta^{(n-1)})-\bar{M}(\theta^{(n-1)})}\right\|_{2}
κθ(n1)θ2+Δ,\displaystyle\leq\kappa\|{\theta^{(n-1)}-\theta^{*}}\|_{2}+\Delta, (20)

and θ(n)θ2κr+(1κ)r=r\|{\theta^{(n)}-\theta^{*}}\|_{2}\leq\kappa r+(1-\kappa)r=r. From induction, the one step relation (20) holds for all n+n\in\mathbb{N}_{+}. Unrolling (20) and regrouping the terms completes the proof. ∎

Appendix D Proofs of auxiliary lemmas

This section presents proofs omitted in earlier sections. Assumptions 1, 2 and 3 are assumed.

In particular, the first three subsections develop a few essential lemmas required for the proofs in later subsections. In Appendix D.1, we show an important mixing property of the options with failure framework. In Appendix D.2, such a mixing property is used to prove a general contraction result of our forward-backward smoothing procedure (Theorem 1 and Corollary 6), similar to the concept of filtering stability in the HMM literature. At a high level, considering the forward-backward recursion in the extended graphical model (Corollary 6), this result characterizes the effect of changing θ\theta and the boundary conditions αθk,0\alpha^{\theta}_{k,0} and βθk,T\beta^{\theta}_{k,T} on the smoothing distribution γθk,t\gamma^{\theta}_{k,t}, given any observation sequence {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}}. Due to this high level reasoning, we name this result as the smoothing stability lemma. Appendix D.3 provides concrete applications of this lemma to quantities defined in earlier sections.

D.1 Mixing

Recall that ζ\zeta is the auxiliary parameter in the options with failure framework.

Lemma D.1 (Mixing).

There exists a constant εb>0\varepsilon_{b}>0 and a conditional distribution π¯o,b(ot,bt|st;θ)\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta) parameterized by θ\theta such that for all θΘ\theta\in\mathit{\Theta}, with any input arguments btb_{t}, sts_{t}, ot1o_{t-1} and oto_{t},

0<εbζπ¯o,b(ot,bt|st;θ)πb(bt|st,ot1;θb)π¯hi(ot|st,ot1,bt;θhi)εb1|𝒪|π¯o,b(ot,bt|st;θ).0<\varepsilon_{b}\zeta\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta)\leq\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\leq\varepsilon_{b}^{-1}|\mathcal{O}|\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta).
Proof of Lemma D.1.

The proof is separated into two parts.

1. We first show an intermediate result: there exists a constant εb>0\varepsilon_{b}>0 and a conditional distribution π¯b(bt|st;θb)\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b}) parameterized by θb\theta_{b} such that for all θbΘb\theta_{b}\in\mathit{\Theta}_{b}, with any input arguments btb_{t}, sts_{t} and ot1o_{t-1},

0<εbπ¯b(bt|st;θb)πb(bt|st,ot1;θb)εb1π¯b(bt|st;θb).0<\varepsilon_{b}\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})\leq\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\leq\varepsilon_{b}^{-1}\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b}).

This can be proved as follows. Let cb=infθbΘbminbt,st,ot1πb(bt|st,ot1;θb)c_{b}=\inf_{\theta_{b}\in\mathit{\Theta}_{b}}\min_{b_{t},s_{t},o_{t-1}}\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b}). Similar to the procedure in Appendix A, from the non-degeneracy assumption, the differentiabiilty assumption and Θ\mathit{\Theta} being compact, we have cb>0c_{b}>0. For any θbΘb\theta_{b}\in\Theta_{b}, with any input arguments btb_{t} and sts_{t}, let f(bt,st;θb)=minot1𝒪πb(bt|st,ot1;θb)f(b_{t},s_{t};\theta_{b})=\min_{o_{t-1}\in\mathcal{O}}\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b}). Observe that cbf(bt,st;θb)1c_{b}\leq f(b_{t},s_{t};\theta_{b})\leq 1. Let εb=cb/2\varepsilon_{b}=c_{b}/2 and

π¯b(bt|st;θb)=f(bt,st;θb)bt{0,1}f(bt,st;θb).\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})=\frac{f(b_{t},s_{t};\theta_{b})}{\sum_{b^{\prime}_{t}\in\{0,1\}}f(b^{\prime}_{t},s_{t};\theta_{b})}.

Clearly εbπ¯b(bt|st;θb)>0\varepsilon_{b}\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})>0. Moreover, for any ot1o_{t-1}, εbπ¯b(bt|st;θb)<2cbπ¯b(bt|st;θb)f(bt,st;θb)πb(bt|st,ot1;θb)\varepsilon_{b}\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})<2c_{b}\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})\leq f(b_{t},s_{t};\theta_{b})\leq\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b}).

On the other hand, with any input arguments,

εb1π¯b(bt|st;θb)εb1cb/2=1πb(bt|st,ot1;θb),\varepsilon_{b}^{-1}\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})\geq\varepsilon_{b}^{-1}c_{b}/2=1\geq\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b}),

which completes the proof of the first part.

2. Define π¯o,b(ot,bt|st;θ)\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta) as follows. With any input arguments, let

π¯o,b(ot,bt=0|st;θ):=π¯b(bt=0|st;θb)/|𝒪|,\displaystyle\bar{\pi}_{o,b}(o_{t},b_{t}=0|s_{t};\theta)\mathrel{\mathop{:}}=\bar{\pi}_{b}(b_{t}=0|s_{t};\theta_{b})/|\mathcal{O}|,
π¯o,b(ot,bt=1|st;θ):=π¯b(bt=1|st;θb)πhi(ot|st;θhi).\displaystyle\bar{\pi}_{o,b}(o_{t},b_{t}=1|s_{t};\theta)\mathrel{\mathop{:}}=\bar{\pi}_{b}(b_{t}=1|s_{t};\theta_{b})\pi_{hi}(o_{t}|s_{t};\theta_{hi}).

Clearly εbζπ¯o,b(ot,bt|st;θ)>0\varepsilon_{b}\zeta\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta)>0. Omit the dependency on θ\theta for a cleaner notation since every term is parameterized by θ\theta. When bt=1b_{t}=1, with any other input arguments,

εbπ¯b(bt=1|st)πhi(ot|st)πb(bt=1|st,ot1)π¯hi(ot|st,ot1,bt=1)εb1π¯b(bt=1|st)πhi(ot|st).\varepsilon_{b}\bar{\pi}_{b}(b_{t}=1|s_{t})\pi_{hi}(o_{t}|s_{t})\leq\pi_{b}(b_{t}=1|s_{t},o_{t-1})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t}=1)\leq\varepsilon_{b}^{-1}\bar{\pi}_{b}(b_{t}=1|s_{t})\pi_{hi}(o_{t}|s_{t}).

Similarly, when bt=0b_{t}=0 and ot=ot1o_{t}=o_{t-1},

εbπ¯b(bt=0|st)ζ/|𝒪|\displaystyle\varepsilon_{b}\bar{\pi}_{b}(b_{t}=0|s_{t})\zeta/|\mathcal{O}| εbπ¯b(bt=0|st)(1|𝒪|1|𝒪|ζ)\displaystyle\leq\varepsilon_{b}\bar{\pi}_{b}(b_{t}=0|s_{t})\bigg{(}1-\frac{|\mathcal{O}|-1}{|\mathcal{O}|}\zeta\bigg{)}
πb(bt=0|st,ot1)π¯hi(ot=ot1|st,ot1,bt=0)\displaystyle\leq\pi_{b}(b_{t}=0|s_{t},o_{t-1})\bar{\pi}_{hi}(o_{t}=o_{t-1}|s_{t},o_{t-1},b_{t}=0)
εb1π¯b(bt=0|st).\displaystyle\leq\varepsilon_{b}^{-1}\bar{\pi}_{b}(b_{t}=0|s_{t}).

Finally, when bt=0b_{t}=0 and otot1o_{t}\neq o_{t-1},

εbπ¯b(bt=0|st)ζ/|𝒪|πb(bt=0|st,ot1)π¯hi(ot|st,ot1,bt=0)εb1π¯b(bt=0|st)ζ/|𝒪|.\varepsilon_{b}\bar{\pi}_{b}(b_{t}=0|s_{t})\zeta/|\mathcal{O}|\leq\pi_{b}(b_{t}=0|s_{t},o_{t-1})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t}=0)\leq\varepsilon_{b}^{-1}\bar{\pi}_{b}(b_{t}=0|s_{t})\zeta/|\mathcal{O}|.

Combining the above cases and the definition of π¯o,b(ot,bt|st;θ)\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta) completes the proof. ∎

D.2 Smoothing stability

Before stating the smoothing stability lemma, we introduce a few definitions. The quantities defined in this subsection depend on an observation sequence {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}}, but such a dependency is usually omitted to simplify the notation, unless specified otherwise. Consistent with our notations so far, in the following we make extensive use of the proportional symbol \propto.

D.2.1 Forward and backward recursion operators

With any given observation sequence {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}} and any θΘ\theta\in\mathit{\Theta}, define the filtering operator FθtF^{\theta}_{t} as the following. For any probability measure φ\varphi over 𝒪×{0,1}\mathcal{O}\times\{0,1\}, FθtφF^{\theta}_{t}\varphi is also a probability measure such that with any input arguments oto_{t} and btb_{t},

Fθtφ(ot,bt)ot1,bt1πb(bt|st,ot1;θb)π¯hi(ot|st,ot1,bt;θhi)πlo(at|st,ot;θlo)φ(ot1,bt1).F^{\theta}_{t}\varphi(o_{t},b_{t})\propto\sum_{o_{t-1},b_{t-1}}\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\varphi(o_{t-1},b_{t-1}). (21)

The RHS has exactly the form of the forward recursion, therefore the recursion on both αθk,t\alpha^{\theta}_{k,t} in (2) and αθμ,t\alpha^{\theta}_{\mu,t} in (10) can be expressed using FθtF^{\theta}_{t}. For generality, let {φθt}t\{\varphi^{\theta}_{t}\}_{t\in\mathbb{Z}} and {φ^θ^t}t\{\hat{\varphi}^{\hat{\theta}}_{t}\}_{t\in\mathbb{Z}} be any two indexed sets of probability measures such that Fθtφθt1=φθtF^{\theta}_{t}\varphi^{\theta}_{t-1}=\varphi^{\theta}_{t} and Fθ^tφ^θ^t1=φ^θ^tF^{\hat{\theta}}_{t}\hat{\varphi}^{\hat{\theta}}_{t-1}=\hat{\varphi}^{\hat{\theta}}_{t}. We restrict {φθt}t\{\varphi^{\theta}_{t}\}_{t\in\mathbb{Z}} and {φ^θ^t}t\{\hat{\varphi}^{\hat{\theta}}_{t}\}_{t\in\mathbb{Z}} to be strictly positive. Due to Assumption 1, such a restriction is valid. Notice that θ\theta and θ^\hat{\theta} here can be equal. We use the seemingly more complicated notation {φ^θ^t}t\{\hat{\varphi}^{\hat{\theta}}_{t}\}_{t\in\mathbb{Z}} because even if θ=θ^\theta=\hat{\theta}, {φθt}t\{\varphi^{\theta}_{t}\}_{t\in\mathbb{Z}} and {φ^θ^t}t\{\hat{\varphi}^{\hat{\theta}}_{t}\}_{t\in\mathbb{Z}} are still different; in this case they are just two different sets of probability measures satisfying the same recursion FθtF^{\theta}_{t}.

Similarly, we define the backward recursion operator BθtB^{\theta}_{t} as follows. For any probability measure ρ\rho over 𝒪×{0,1}\mathcal{O}\times\{0,1\}, BθtρB^{\theta}_{t}\rho is also a probability measure such that with any input arguments oto_{t} and btb_{t},

Bθtρ(ot,bt)ot+1,bt+1πb(bt+1|st+1,ot;θb)π¯hi(ot+1|st+1,ot,bt+1;θhi)×πlo(at+1|st+1,ot+1;θlo)ρ(ot+1,bt+1).B^{\theta}_{t}\rho(o_{t},b_{t})\propto\sum_{o_{t+1},b_{t+1}}\pi_{b}(b_{t+1}|s_{t+1},o_{t};\theta_{b})\bar{\pi}_{hi}(o_{t+1}|s_{t+1},o_{t},b_{t+1};\theta_{hi})\\ \times\pi_{lo}(a_{t+1}|s_{t+1},o_{t+1};\theta_{lo})\rho(o_{t+1},b_{t+1}). (22)

The recursion on both βθt|T\beta^{\theta}_{t|T} in (4) and βθk,t\beta^{\theta}_{k,t} in (11) can be expressed using BθtB^{\theta}_{t}. Let {ρθt}t\{\rho^{\theta}_{t}\}_{t\in\mathbb{Z}} and {ρ^θ^t}t\{\hat{\rho}^{\hat{\theta}}_{t}\}_{t\in\mathbb{Z}} be any two indexed sets of probability measures such that Bθtρθt+1=ρθtB^{\theta}_{t}\rho^{\theta}_{t+1}=\rho^{\theta}_{t} and Bθ^tρ^θ^t+1=ρ^θ^tB^{\hat{\theta}}_{t}\hat{\rho}^{\hat{\theta}}_{t+1}=\hat{\rho}^{\hat{\theta}}_{t}. We restrict {ρθt}t\{\rho^{\theta}_{t}\}_{t\in\mathbb{Z}} and {ρ^θ^t}t\{\hat{\rho}^{\hat{\theta}}_{t}\}_{t\in\mathbb{Z}} to be strictly positive.

The operation \otimes is defined as follows: {(φθρ^θ^)t}t\{(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t}\}_{t\in\mathbb{Z}} is an indexed set of probability measures such that for any input arguments oto_{t} and btb_{t},

(φθρ^θ^)t(ot,bt)φθt(ot,bt)ρ^θ^t(ot,bt).(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t}(o_{t},b_{t})\propto\varphi^{\theta}_{t}(o_{t},b_{t})\hat{\rho}^{\hat{\theta}}_{t}(o_{t},b_{t}). (23)

Finally, we clarify the use of \propto in the above definitions. In (21), (22) and (23), the normalizing constants replaced by \propto are independent of the input arguments (ot,bt)(o_{t},b_{t}).

D.2.2 Forward and backward smoothing operators

For any θ,θ^Θ\theta,\hat{\theta}\in\mathit{\Theta} and any tt, with any observation sequence {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}} and any input arguments oto_{t} and btb_{t}, observe that

(φ^θ^ρθ)t(ot,bt)ot1,bt1πb(bt|st,ot1;θ^b)π¯hi(ot|st,ot1,bt;θ^hi)πlo(at|st,ot;θ^lo)×ρθt(ot,bt)(φ^θ^ρθ)t1(ot1,bt1)ρθt1(ot1,bt1),(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t}(o_{t},b_{t})\propto\sum_{o_{t-1},b_{t-1}}\pi_{b}(b_{t}|s_{t},o_{t-1};\hat{\theta}_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\hat{\theta}_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\hat{\theta}_{lo})\\ \times\rho^{\theta}_{t}(o_{t},b_{t})\frac{(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}(o_{t-1},b_{t-1})}{\rho^{\theta}_{t-1}(o_{t-1},b_{t-1})},

and

ρθt1(ot1,bt1)ot,btπb(bt|st,ot1;θb)π¯hi(ot|st,ot1,bt;θhi)πlo(at|st,ot;θlo)ρθt(ot,bt).\rho^{\theta}_{t-1}(o_{t-1},b_{t-1})\propto\sum_{o^{\prime}_{t},b^{\prime}_{t}}\pi_{b}(b^{\prime}_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o^{\prime}_{t}|s_{t},o_{t-1},b^{\prime}_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o^{\prime}_{t};\theta_{lo})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t}).

To simplify notation, let

h(θ;ot1,st,at,ot,bt)=πb(bt|st,ot1;θb)π¯hi(ot|st,ot1,bt;θhi)πlo(at|st,ot;θlo).h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})=\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo}). (24)

Then,

(φ^θ^ρθ)t(ot,bt)=Cθ^,θFot1,bt1h(θ^;ot1,st,at,ot,bt)ρθt(ot,bt)(φ^θ^ρθ)t1(ot1,bt1)ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt),(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t}(o_{t},b_{t})=C^{\hat{\theta},\theta}_{F}\sum_{o_{t-1},b_{t-1}}\frac{h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}(o_{t-1},b_{t-1})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}, (25)

where Cθ^,θFC^{\hat{\theta},\theta}_{F} is a normalizing constant such that

(Cθ^,θF)1=ot1,bt1ot,bth(θ^;ot1,st,at,ot,bt)ρθt(ot,bt)ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt)(φ^θ^ρθ)t1(ot1,bt1).\left(C^{\hat{\theta},\theta}_{F}\right)^{-1}=\sum_{o_{t-1},b_{t-1}}\frac{\sum_{o_{t},b_{t}}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}(o_{t-1},b_{t-1}).

From (25), we define the forward smoothing operator Kθ^,θF,tK^{\hat{\theta},\theta}_{F,t} on the probability measure (φ^θ^ρθ)t1(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1} such that as probability measures,

(φ^θ^ρθ)t1Kθ^,θF,t=(φ^θ^ρθ)t.(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}K^{\hat{\theta},\theta}_{F,t}=(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t}.

The subscript FF in Kθ^,θF,tK^{\hat{\theta},\theta}_{F,t} stands for forward. Kθ^,θF,tK^{\hat{\theta},\theta}_{F,t} depends on the the parameters θ\theta and θ^\hat{\theta}, the observation {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}} and the specific choice of {ρθt}t\{\rho^{\theta}_{t}\}_{t\in\mathbb{Z}}. In the general case of θθ^\theta\neq\hat{\theta}, Kθ^,θF,tK^{\hat{\theta},\theta}_{F,t} is a nonlinear operator which requires rather sophisticated analysis. However, when θ=θ^\theta=\hat{\theta}, it is straightforward to verify that the normalizing constant Cθ,θF=1C^{\theta,\theta}_{F}=1, and Kθ,θF,tK^{\theta,\theta}_{F,t} becomes a linear operator.

In fact, the linear operator Kθ,θF,tK^{\theta,\theta}_{F,t} can be regarded as the standard operation of a Markov transition kernel on probability measures. With a slight overload of notation, define such a Markov transition kernel on 𝒪×{0,1}\mathcal{O}\times\{0,1\}, entry-wise, as the following. For any (ot,bt)(o_{t},b_{t}) and (ot1,bt1)(o_{t-1},b_{t-1}) in 𝒪×{0,1}\mathcal{O}\times\{0,1\},

Kθ,θF,t(ot,bt|ot1,bt1):=h(θ;ot1,st,at,ot,bt)ρθt(ot,bt)ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt).K^{\theta,\theta}_{F,t}(o_{t},b_{t}|o_{t-1},b_{t-1})\mathrel{\mathop{:}}=\frac{h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}. (26)

We name this Markov transition kernel as the forward smoothing kernel. Such a definition is analogous to Markovian decomposition in the HMM literature [8]. The only caveat here is that we also allow perturbations on the parameter. The resulting operator Kθ^,θF,tK^{\hat{\theta},\theta}_{F,t} is nonlinear and no longer corresponds to a Markov transition kernel.

To proceed, we characterize the difference between operators Kθ^,θF,tK^{\hat{\theta},\theta}_{F,t} and Kθ,θF,tK^{\theta,\theta}_{F,t} when θ^\hat{\theta} and θ\theta are close. First, we show a version of Lipschitz continuity for the options with failure framework.

Lemma D.2 (Lipschitz continuity).

For all θΘ\theta\in\mathit{\Theta} and δ>0\delta>0, there exists a real number Lθ,δL_{\theta,\delta} such that with any input arguments ot1o_{t-1}, sts_{t}, ata_{t}, oto_{t} and btb_{t}, the function h(θ~;ot1,st,at,ot,bt)h(\tilde{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t}) defined in (24) is Lθ,δL_{\theta,\delta}-Lipschitz with respect to θ~\tilde{\theta} on the set {θ~;θ~Θ,θ~θ2δ}\{\tilde{\theta};\tilde{\theta}\in\mathit{\Theta},\|{\tilde{\theta}-\theta}\|_{2}\leq\delta\}. Moreover, Lθ,δL_{\theta,\delta} is upper bounded by a constant that does not depend on θ\theta and δ\delta.

Proof of Lemma D.2.

Due to Assumption 2, with any input arguments ot1o_{t-1}, sts_{t}, ata_{t}, oto_{t} and btb_{t}, h(θ~;ot1,st,at,ot,bt)h(\tilde{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t}) is continuously differentiable with respect to θ~Θ~\tilde{\theta}\in\tilde{\mathit{\Theta}}. As continuously differentiable functions are Lipschitz continuous on convex and compact subsets, h(θ~;ot1,st,at,ot,bt)h(\tilde{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t}) is Lipschitz continuous on Θ\mathit{\Theta}, hence also on {θ~;θ~Θ,θ~θ2δ}\{\tilde{\theta};\tilde{\theta}\in\mathit{\Theta},\|{\tilde{\theta}-\theta}\|_{2}\leq\delta\}. The Lipschitz constants depend on the choice of input arguments ot1o_{t-1}, sts_{t}, ata_{t}, oto_{t} and btb_{t}.

We can let Lθ,δL_{\theta,\delta} be the smallest Lipschitz constant on {θ~;θ~Θ,θ~θ2δ}\{\tilde{\theta};\tilde{\theta}\in\mathit{\Theta},\|{\tilde{\theta}-\theta}\|_{2}\leq\delta\} that holds for all input arguments ot1o_{t-1}, sts_{t}, ata_{t}, oto_{t} and btb_{t}. Clearly Lθ,δL_{\theta,\delta} is upper bounded by any Lipschitz constant on Θ\mathit{\Theta} that holds for all input arguments, which does not depend on θ\theta and δ\delta. ∎

Next, we bound the difference between operators Kθ^,θF,tK^{\hat{\theta},\theta}_{F,t} and Kθ,θF,tK^{\theta,\theta}_{F,t}.

Lemma D.3 (Perturbation on the forward smoothing kernel).

Let φ\varphi be any probability measure on 𝒪×{0,1}\mathcal{O}\times\{0,1\}. Let Kθ^,θF,tK^{\hat{\theta},\theta}_{F,t} and Kθ,θF,tK^{\theta,\theta}_{F,t} be defined with the same observation sequence {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}} and the same choice of {ρθt}t\{\rho^{\theta}_{t}\}_{t\in\mathbb{Z}}. Their difference is only in the first entry of the superscript (θ^\hat{\theta} in Kθ^,θF,tK^{\hat{\theta},\theta}_{F,t}; θ\theta in Kθ,θF,tK^{\theta,\theta}_{F,t}). Then, for all tt, φ\varphi, θ\theta, θ^\hat{\theta}, {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}} and {ρθt}t\{\rho^{\theta}_{t}\}_{t\in\mathbb{Z}},

φKθ^,θF,tφKθ,θF,tTVmaxot1,ot,bth(θ;ot1,st,at,ot,bt)minot1,ot,bth(θ;ot1,st,at,ot,bt)Lθ,θ^θ2θ^θ2minot1,ot,bth(θ^;ot1,st,at,ot,bt).\left\|{\varphi K^{\hat{\theta},\theta}_{F,t}-\varphi K^{\theta,\theta}_{F,t}}\right\|_{\rm TV}\leq\frac{\max_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}{\min_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}\frac{L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}\|{\hat{\theta}-\theta}\|_{2}}{\min_{o_{t-1},o_{t},b_{t}}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})}.
Proof of Lemma D.3.

From the definitions, for any tt, φ\varphi, θ\theta, θ^\hat{\theta}, {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}} and {ρθt}t\{\rho^{\theta}_{t}\}_{t\in\mathbb{Z}},

φKθ^,θF,tφKθ,θF,tTV\displaystyle\left\|{\varphi K^{\hat{\theta},\theta}_{F,t}-\varphi K^{\theta,\theta}_{F,t}}\right\|_{\rm TV}
=\displaystyle=~{} 12ot,bt|ot1,bt1[Cθ^,θFh(θ^;ot1,st,at,ot,bt)h(θ;ot1,st,at,ot,bt)]ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt)ρθt(ot,bt)φ(ot1,bt1)|\displaystyle\frac{1}{2}\sum_{o_{t},b_{t}}\left|\sum_{o_{t-1},b_{t-1}}\frac{\left[C^{\hat{\theta},\theta}_{F}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\right]}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}\rho^{\theta}_{t}(o_{t},b_{t})\varphi(o_{t-1},b_{t-1})\right|
\displaystyle\leq~{} 12ot1,bt1ot,bt|Cθ^,θFh(θ^;ot1,st,at,ot,bt)h(θ;ot1,st,at,ot,bt)|ρθt(ot,bt)ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt)φ(ot1,bt1).\displaystyle\frac{1}{2}\sum_{o_{t-1},b_{t-1}}\frac{\sum_{o_{t},b_{t}}\left|C^{\hat{\theta},\theta}_{F}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\right|\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}\varphi(o_{t-1},b_{t-1}).

From the definition of the normalizing constant Cθ^,θFC^{\hat{\theta},\theta}_{F}, we have

(Cθ^,θF)1=ot1,bt1ot,bth(θ^;ot1,st,at,ot,bt)ρθt(ot,bt)ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt)φ(ot1,bt1).\left(C^{\hat{\theta},\theta}_{F}\right)^{-1}=\sum_{o_{t-1},b_{t-1}}\frac{\sum_{o_{t},b_{t}}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}\varphi(o_{t-1},b_{t-1}).

Therefore,

Cθ^,θFmaxot1ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt)ot,bth(θ^;ot1,st,at,ot,bt)ρθt(ot,bt),C^{\hat{\theta},\theta}_{F}\leq\max_{o_{t-1}}\frac{\sum_{o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o_{t},b_{t}}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})},

and

|Cθ^,θF1|\displaystyle\left|C^{\hat{\theta},\theta}_{F}-1\right|
=\displaystyle=~{} |ot1,bt1ot,bt[h(θ^;ot1,st,at,ot,bt)h(θ;ot1,st,at,ot,bt)]ρθt(ot,bt)ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt)φ(ot1,bt1)|Cθ^,θF\displaystyle\left|\sum_{o_{t-1},b_{t-1}}\frac{\sum_{o_{t},b_{t}}[h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})]\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}\varphi(o_{t-1},b_{t-1})\right|C^{\hat{\theta},\theta}_{F}
\displaystyle\leq~{} Lθ,θ^θ2θ^θ2Cθ^,θFminot1ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt).\displaystyle\frac{L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}\|{\hat{\theta}-\theta}\|_{2}C^{\hat{\theta},\theta}_{F}}{\min_{o_{t-1}}\sum_{o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}.

As a result, for any given ot1o_{t-1}, oto_{t} and btb_{t},

|Cθ^,θFh(θ^;ot1,st,at,ot,bt)h(θ;ot1,st,at,ot,bt)|\displaystyle\left|C^{\hat{\theta},\theta}_{F}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\right|
\displaystyle\leq~{} Cθ^,θF|h(θ^;ot1,st,at,ot,bt)h(θ;ot1,st,at,ot,bt)|+|Cθ^,θF1|h(θ;ot1,st,at,ot,bt)\displaystyle C^{\hat{\theta},\theta}_{F}\left|h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\right|+\left|C^{\hat{\theta},\theta}_{F}-1\right|h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})
\displaystyle\leq~{} [1+h(θ;ot1,st,at,ot,bt)minot1ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt)]Lθ,θ^θ2θ^θ2Cθ^,θF.\displaystyle\left[1+\frac{h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}{\min_{o^{\prime}_{t-1}}\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o^{\prime}_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}\right]L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}\left\|{\hat{\theta}-\theta}\right\|_{2}C^{\hat{\theta},\theta}_{F}.

Combining everything together,

φKθ^,θF,tφKθ,θF,tTV\displaystyle\left\|{\varphi K^{\hat{\theta},\theta}_{F,t}-\varphi K^{\theta,\theta}_{F,t}}\right\|_{\rm TV}
\displaystyle\leq~{} Lθ,θ^θ2θ^θ2Cθ^,θF×maxot11+ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt)minot1ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt)2ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt)\displaystyle L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}\left\|{\hat{\theta}-\theta}\right\|_{2}C^{\hat{\theta},\theta}_{F}\times\max_{o_{t-1}}\frac{1+\frac{\sum_{o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}{\min_{o^{\prime}_{t-1}}\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o^{\prime}_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}}{2\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}
=\displaystyle=~{} Lθ,θ^θ2θ^θ2Cθ^,θFminot1ot,bth(θ;ot1,st,at,ot,bt)ρθt(ot,bt)\displaystyle\frac{L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}\|{\hat{\theta}-\theta}\|_{2}C^{\hat{\theta},\theta}_{F}}{\min_{o^{\prime}_{t-1}}\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o^{\prime}_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}
\displaystyle\leq~{} maxot1,ot,bth(θ;ot1,st,at,ot,bt)minot1,ot,bth(θ;ot1,st,at,ot,bt)Lθ,θ^θ2θ^θ2minot1,ot,bth(θ^;ot1,st,at,ot,bt).\displaystyle\frac{\max_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}{\min_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}\frac{L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}\|{\hat{\theta}-\theta}\|_{2}}{\min_{o_{t-1},o_{t},b_{t}}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})}.\qed

On the other hand, we can formulate a backward smoothing recursion as

(φθρ^θ^)t(ot,bt)=Cθ,θ^Bot+1,bt+1h(θ^;ot,st+1,at+1,ot+1,bt+1)φθt(ot,bt)(φθρ^θ^)t+1(ot+1,bt+1)ot,bth(θ;ot,st+1,at+1,ot+1,bt+1)φθt(ot,bt),(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t}(o_{t},b_{t})=C^{\theta,\hat{\theta}}_{B}\sum_{o_{t+1},b_{t+1}}\frac{h(\hat{\theta};o_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})\varphi^{\theta}_{t}(o_{t},b_{t})(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t+1}(o_{t+1},b_{t+1})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o^{\prime}_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})\varphi^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}, (27)

where Cθ,θ^BC^{\theta,\hat{\theta}}_{B} is a normalizing constant such that

(Cθ,θ^B)1=ot+1,bt+1ot,bth(θ^;ot,st+1,at+1,ot+1,bt+1)φθt(ot,bt)ot,bth(θ;ot,st+1,at+1,ot+1,bt+1)φθt(ot,bt)(φθρ^θ^)t+1(ot+1,bt+1).\left(C^{\theta,\hat{\theta}}_{B}\right)^{-1}=\sum_{o_{t+1},b_{t+1}}\frac{\sum_{o_{t},b_{t}}h(\hat{\theta};o_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})\varphi^{\theta}_{t}(o_{t},b_{t})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o^{\prime}_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})\varphi^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t+1}(o_{t+1},b_{t+1}).

The subscript BB in Kθ,θ^B,tK^{\theta,\hat{\theta}}_{B,t} stands for backward. Similar to the forward smoothing operator Kθ^,θF,tK^{\hat{\theta},\theta}_{F,t}, we can define the backward smoothing operator Kθ,θ^B,tK^{\theta,\hat{\theta}}_{B,t} from (27) such that as probability measures,

(φθρ^θ^)t+1Kθ,θ^B,t=(φθρ^θ^)t.(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t+1}K^{\theta,\hat{\theta}}_{B,t}=(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t}.

Analogous to Kθ^,θF,tK^{\hat{\theta},\theta}_{F,t}, in the general case of θθ^\theta\neq\hat{\theta}, Kθ,θ^B,tK^{\theta,\hat{\theta}}_{B,t} is a nonlinear operator. However, if θ=θ^\theta=\hat{\theta}, Kθ,θ^B,tK^{\theta,\hat{\theta}}_{B,t} becomes a linear operator and induces a Markov transition kernel. The following lemma is similar to Lemma D.3. We state it without proof.

Lemma D.4 (Perturbation on the backward smoothing kernel).

Let ρ\rho be any probability measure on 𝒪×{0,1}\mathcal{O}\times\{0,1\}. Let Kθ,θ^B,tK^{\theta,\hat{\theta}}_{B,t} and Kθ,θB,tK^{\theta,\theta}_{B,t} be defined with the same observation sequence {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}} and the same choice of {φθt}t\{\varphi^{\theta}_{t}\}_{t\in\mathbb{Z}}. Then, for any tt, ρ\rho, θ\theta, θ^\hat{\theta}, {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}} and {φθt}t\{\varphi^{\theta}_{t}\}_{t\in\mathbb{Z}},

ρKθ,θ^B,tρKθ,θB,tTVmaxot,ot+1,bt+1h(θ;ot,st+1,at+1,ot+1,bt+1)minot,ot+1,bt+1h(θ;ot,st+1,at+1,ot+1,bt+1)×Lθ^,θ^θ2θ^θ2minot,ot+1,bt+1h(θ^;ot,st+1,at+1,ot+1,bt+1).\left\|{\rho K^{\theta,\hat{\theta}}_{B,t}-\rho K^{\theta,\theta}_{B,t}}\right\|_{\rm TV}\leq\frac{\max_{o_{t},o_{t+1},b_{t+1}}h(\theta;o_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})}{\min_{o_{t},o_{t+1},b_{t+1}}h(\theta;o_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})}\\ \times\frac{L_{\hat{\theta},\|{\hat{\theta}-\theta}\|_{2}}\|{\hat{\theta}-\theta}\|_{2}}{\min_{o_{t},o_{t+1},b_{t+1}}h(\hat{\theta};o_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})}.

Notice that the bounds in both Lemma D.3 and Lemma D.4 depend on the observation sequence {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}}.

D.2.3 A perturbed contraction result for smoothing stability

For any t1,t2t_{1},t_{2}\in\mathbb{Z} with t1t2t_{1}\leq t_{2}, let 𝕀=[t1:t2]\mathbb{I}=[t_{1}:t_{2}]. Remember the following definition from Appendix D.2.1, with the index set restricted to 𝕀\mathbb{I}: for any θ,θ^Θ\theta,\hat{\theta}\in\mathit{\Theta}, {φθt}t𝕀\{\varphi^{\theta}_{t}\}_{t\in\mathbb{I}} and {φ^θ^t}t𝕀\{\hat{\varphi}^{\hat{\theta}}_{t}\}_{t\in\mathbb{I}} are two indexed sets of probability measures defined on 𝒪×{0,1}\mathcal{O}\times\{0,1\} such that, for all t𝕀t\in\mathbb{I}, (1) if tt1t\neq t_{1}, Fθtφθt1=φθtF^{\theta}_{t}\varphi^{\theta}_{t-1}=\varphi^{\theta}_{t} and Fθ^tφ^θ^t1=φ^θ^tF^{\hat{\theta}}_{t}\hat{\varphi}^{\hat{\theta}}_{t-1}=\hat{\varphi}^{\hat{\theta}}_{t}; (2) φθt\varphi^{\theta}_{t} and φ^θ^t\hat{\varphi}^{\hat{\theta}}_{t} are strictly positive on their domains. {ρθt}t𝕀\{\rho^{\theta}_{t}\}_{t\in\mathbb{I}} and {ρ^θ^t}t𝕀\{\hat{\rho}^{\hat{\theta}}_{t}\}_{t\in\mathbb{I}} are two indexed sets of probability measures defined on 𝒪×{0,1}\mathcal{O}\times\{0,1\} such that for all t𝕀t\in\mathbb{I}, (1) if tt2t\neq t_{2}, Bθtρθt+1=ρθtB^{\theta}_{t}\rho^{\theta}_{t+1}=\rho^{\theta}_{t} and Bθ^tρ^θ^t+1=ρ^θ^tB^{\hat{\theta}}_{t}\hat{\rho}^{\hat{\theta}}_{t+1}=\hat{\rho}^{\hat{\theta}}_{t}; (2) ρθt\rho^{\theta}_{t} and ρ^θ^t\hat{\rho}^{\hat{\theta}}_{t} are strictly positive on their domains. θ\theta and θ^\hat{\theta} are allowed to be equal.

The smoothing stability lemma is stated as follows.

Lemma D.5 (Smoothing stability).

With {φθt}t𝕀\{\varphi^{\theta}_{t}\}_{t\in\mathbb{I}}, {φ^θ^t}t𝕀\{\hat{\varphi}^{\hat{\theta}}_{t}\}_{t\in\mathbb{I}}, {ρθt}t𝕀\{\rho^{\theta}_{t}\}_{t\in\mathbb{I}} and {ρ^θ^t}t𝕀\{\hat{\rho}^{\hat{\theta}}_{t}\}_{t\in\mathbb{I}} defined above,

(φθρθ)t2(φ^θ^ρθ)t2TV(1ε2bζ|𝒪|)t2t1+|𝒪|zθ,θ^Lθ,θ^θ2ε2bζθ^θ2,\left\|{(\varphi^{\theta}\otimes\rho^{\theta})_{t_{2}}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t_{2}-t_{1}}+\frac{|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2},
(φ^θ^ρθ)t1(φ^θ^ρ^θ^)t1TV(1ε2bζ|𝒪|)t2t1+|𝒪|zθ,θ^Lθ,θ^θ2ε2bζθ^θ2,\left\|{(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{1}}-(\hat{\varphi}^{\hat{\theta}}\otimes\hat{\rho}^{\hat{\theta}})_{t_{1}}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t_{2}-t_{1}}+\frac{|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2},

where zθ,θz_{\theta,\theta^{\prime}} is a positive real number dependent only on θ\theta and θ^\hat{\theta}. Specifically,

zθ,θ=maxst,at[maxot1,ot,bth(θ;ot1,st,at,ot,bt)][maxot1,ot,bth(θ^;ot1,st,at,ot,bt)][minot1,ot,bth(θ;ot1,st,at,ot,bt)][minot1,ot,bth(θ^;ot1,st,at,ot,bt)].z_{\theta,\theta^{\prime}}=\max_{s^{\prime}_{t},a^{\prime}_{t}}\frac{[\max_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s^{\prime}_{t},a^{\prime}_{t},o_{t},b_{t})]\vee[\max_{o_{t-1},o_{t},b_{t}}h(\hat{\theta};o_{t-1},s^{\prime}_{t},a^{\prime}_{t},o_{t},b_{t})]}{[\min_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s^{\prime}_{t},a^{\prime}_{t},o_{t},b_{t})][\min_{o_{t-1},o_{t},b_{t}}h(\hat{\theta};o_{t-1},s^{\prime}_{t},a^{\prime}_{t},o_{t},b_{t})]}.

Intuitively, if θ^=θ\hat{\theta}=\theta, Lemma D.5 has the form of an exact contraction, which is similar to the standard filtering stability result for HMMs. Indeed, our proof uses the classical techniques of uniform forgetting from the HMM literature [8]. If θ^\hat{\theta} is different from θ\theta, such a contraction is perturbed. For HMMs, similar results are provided in [13, Proposition 2.2, Theorem 2.3].

Proof of Lemma D.5.

Consider the first bound. It holds trivially when t2=t1t_{2}=t_{1}. Now consider only t2>t1t_{2}>t_{1}. Using the forward smoothing operators, for any t1<tt2t_{1}<t\leq t_{2},

(φθρθ)t1Kθ,θF,t(φ^θ^ρθ)t1Kθ^,θF,t=(φθρθ)t(φ^θ^ρθ)t.(\varphi^{\theta}\otimes\rho^{\theta})_{t-1}K^{\theta,\theta}_{F,t}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}K^{\hat{\theta},\theta}_{F,t}=(\varphi^{\theta}\otimes\rho^{\theta})_{t}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t}.

Therefore,

(φθρθ)t(φ^θ^ρθ)tTV[(φθρθ)t1(φ^θ^ρθ)t1]Kθ,θF,tTV+(φ^θ^ρθ)t1Kθ,θF,t(φ^θ^ρθ)t1Kθ^,θF,tTV,\left\|{(\varphi^{\theta}\otimes\rho^{\theta})_{t}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t}}\right\|_{\rm TV}\leq\left\|{\left[(\varphi^{\theta}\otimes\rho^{\theta})_{t-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}\right]K^{\theta,\theta}_{F,t}}\right\|_{\rm TV}\\ +\left\|{(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}K^{\theta,\theta}_{F,t}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}K^{\hat{\theta},\theta}_{F,t}}\right\|_{\rm TV},

where the first term is due to Kθ,θF,tK^{\theta,\theta}_{F,t} being a linear operator.

From Lemma D.3, the second term on the RHS is upper bounded by zθ,θ^Lθ,θ^θ2θ^θ2z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}\|{\hat{\theta}-\theta}\|_{2}. As for the first term, we can construct the classical Doeblin-type minorization condition [8, Chap. 4.3]. Applying Lemma D.1 in the definition of the Markov transition kernel Kθ,θF,tK^{\theta,\theta}_{F,t} (26), we have

Kθ,θF,t(ot,bt|ot1,bt1)ε2bζ|𝒪|π¯o,b(ot,bt|st;θ)πlo(at|st,ot;θlo)ρθt(ot,bt)ot,btπ¯o,b(ot,bt|st;θ)πlo(at|st,ot;θlo)ρθt(ot,bt)=:ε2bζ|𝒪|π¯θF,t(ot,bt).K^{\theta,\theta}_{F,t}(o_{t},b_{t}|o_{t-1},b_{t-1})\geq\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\frac{\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta)\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}\bar{\pi}_{o,b}(o^{\prime}_{t},b^{\prime}_{t}|s_{t};\theta)\pi_{lo}(a_{t}|s_{t},o^{\prime}_{t};\theta_{lo})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}=\mathrel{\mathop{:}}\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bar{\pi}^{\theta}_{F,t}(o_{t},b_{t}). (28)

Observe that π¯θF,t\bar{\pi}^{\theta}_{F,t} just defined is a probability measure. Further define K¯θ,θF,t\bar{K}^{\theta,\theta}_{F,t} entry-wise as

K¯θ,θF,t(ot,bt|ot1,bt1):=(1ε2bζ|𝒪|)1(Kθ,θF,t(ot,bt|ot1,bt1)ε2bζ|𝒪|π¯θF,t(ot,bt)).\bar{K}^{\theta,\theta}_{F,t}(o_{t},b_{t}|o_{t-1},b_{t-1})\mathrel{\mathop{:}}=\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{-1}\bigg{(}K^{\theta,\theta}_{F,t}(o_{t},b_{t}|o_{t-1},b_{t-1})-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bar{\pi}^{\theta}_{F,t}(o_{t},b_{t})\bigg{)}.

We can verify that K¯θ,θF,t\bar{K}^{\theta,\theta}_{F,t} is also a Markov transition kernel. Moreover,

[(φθρθ)t1(φ^θ^ρθ)t1]Kθ,θF,t=(1ε2bζ|𝒪|)[(φθρθ)t1(φ^θ^ρθ)t1]K¯θ,θF,t.\left[(\varphi^{\theta}\otimes\rho^{\theta})_{t-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}\right]K^{\theta,\theta}_{F,t}=\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}\left[(\varphi^{\theta}\otimes\rho^{\theta})_{t-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}\right]\bar{K}^{\theta,\theta}_{F,t}.

To proceed, the standard approach is to use the fact that the Dobrushin coefficient of K¯θ,θF,t\bar{K}^{\theta,\theta}_{F,t} is upper bounded by one. For clarity, we avoid such definitions and take a more direct approach here, which requires the extension of the total variation distance for two probability measures to the total variation norm for a finite signed measure. For a finite signed measure ν\nu over a finite set Ω\Omega, let the total variation norm of ν\nu be

νTV:=12ωΩ|ν(ω)|.\left\|{\nu}\right\|_{\rm TV}\mathrel{\mathop{:}}=\frac{1}{2}\sum_{\omega\in\Omega}\left|\nu(\omega)\right|.

When ν\nu is the difference between two probability measures ν1ν2\nu_{1}-\nu_{2}, the total variation norm of ν\nu coincides with the total variation distance between ν1\nu_{1} and ν2\nu_{2}. Therefore, the same notation TV\|{\cdot}\|_{\rm TV} is adopted here.

Let ¯(𝒪×{0,1})\mathcal{\bar{M}}(\mathcal{O}\times\{0,1\}) be the set of finite signed measures over the finite set 𝒪×{0,1}\mathcal{O}\times\{0,1\}. From [8, Chap. 4.3.1], ¯(𝒪×{0,1})\mathcal{\bar{M}}(\mathcal{O}\times\{0,1\}) is a Banach space. Define an operator norm op\|{\cdot}\|_{\rm op} for K¯θ,θF,t\bar{K}^{\theta,\theta}_{F,t} as

K¯θ,θF,top:=sup{νK¯θ,θF,tTV;νTV=1,ν¯(𝒪×{0,1})}.\left\|{\bar{K}^{\theta,\theta}_{F,t}}\right\|_{\rm op}\mathrel{\mathop{:}}=\sup\left\{\left\|{\nu\bar{K}^{\theta,\theta}_{F,t}}\right\|_{\rm TV};\left\|{\nu}\right\|_{\rm TV}=1,\nu\in\mathcal{\bar{M}}(\mathcal{O}\times\{0,1\})\right\}.

Since K¯θ,θF,t\bar{K}^{\theta,\theta}_{F,t} is a Markov transition kernel, K¯θ,θF,top=1\|{\bar{K}^{\theta,\theta}_{F,t}}\|_{\rm op}=1 [8, Lemma 4.3.6]. Therefore,

(φθρθ)t2(φ^θ^ρθ)t2TV\displaystyle\left\|{(\varphi^{\theta}\otimes\rho^{\theta})_{t_{2}}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}}}\right\|_{\rm TV}
\displaystyle\leq~{} [(φθρθ)t21(φ^θ^ρθ)t21]Kθ,θF,t2TV+(φ^θ^ρθ)t21(Kθ,θF,t2Kθ^,θF,t2)TV\displaystyle\left\|{\left[(\varphi^{\theta}\otimes\rho^{\theta})_{t_{2}-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}-1}\right]K^{\theta,\theta}_{F,t_{2}}}\right\|_{\rm TV}+\left\|{(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}-1}\left(K^{\theta,\theta}_{F,t_{2}}-K^{\hat{\theta},\theta}_{F,t_{2}}\right)}\right\|_{\rm TV}
=\displaystyle=~{} (1ε2bζ|𝒪|)[(φθρθ)t21(φ^θ^ρθ)t21]K¯θ,θF,t2TV+zθ,θ^Lθ,θ^θ2θ^θ2\displaystyle\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}\left\|{\left[(\varphi^{\theta}\otimes\rho^{\theta})_{t_{2}-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}-1}\right]\bar{K}^{\theta,\theta}_{F,t_{2}}}\right\|_{\rm TV}+z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}\|{\hat{\theta}-\theta}\|_{2}
\displaystyle\leq~{} (1ε2bζ|𝒪|)(φθρθ)t21(φ^θ^ρθ)t21TVK¯θ,θF,t2op+zθ,θ^Lθ,θ^θ2θ^θ2\displaystyle\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}\left\|{(\varphi^{\theta}\otimes\rho^{\theta})_{t_{2}-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}-1}}\right\|_{\rm TV}\left\|{\bar{K}^{\theta,\theta}_{F,t_{2}}}\right\|_{\rm op}+z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}\|{\hat{\theta}-\theta}\|_{2}
=\displaystyle=~{} (1ε2bζ|𝒪|)(φθρθ)t21(φ^θ^ρθ)t21TV+zθ,θ^Lθ,θ^θ2θ^θ2.\displaystyle\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}\left\|{(\varphi^{\theta}\otimes\rho^{\theta})_{t_{2}-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}-1}}\right\|_{\rm TV}+z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}\|{\hat{\theta}-\theta}\|_{2}.

The second inequality is due to the sub-multiplicativity of the operator norm. Finally, the desirable result follows from unrolling the summation and identifying the geometric series.

The proof of the second bound is analogous, using the backward smoothing operators instead of the forward smoothing operators. Details are omitted. ∎

Note that Lemma D.5 only holds when considering the options with failure framework. For the standard options framework, the one-step Doeblin-type minorization condition (28) we construct in the proof does not hold anymore, due to the failure of Lemma D.1. Instead, one could target the two-step minorization condition: define a two step smoothing kernel similar to Kθ,θF,tK^{\theta,\theta}_{F,t} and lower bound it similar to (28). Notations are much more complicated. For simplicity, this extension is not considered in this paper.

D.3 The approximation lemmas

This subsection applies Lemma D.5 to quantities defined in earlier sections.

First, we bound the difference of smoothing distributions in the non-extended graphical model (as in Theorem 1) and the extended one with parameter kk (as in Corollary 6). The parameter θ\theta in the two models can be different. The bounds use quantities defined in Appendix D.1 and Appendix D.2. Recall the definition of Ω\Omega from 8.

Lemma D.6 (Bounding the difference of smoothing distributions, Part I).

For all θ,θ^Θ\theta,\hat{\theta}\in\mathit{\Theta}, k+k\in\mathbb{N}_{+} and μ\mu\in\mathcal{M}, with the observation sequence {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}} corresponding to any ωΩ\omega\in\Omega, we have

  1. 1.

    t[1:T]\forall t\in[1:T],

    γθμ,t|Tγθ^k,tTV(1ε2bζ|𝒪|)t1+(1ε2bζ|𝒪|)Tt+2|𝒪|zθ,θ^Lθ,θ^θ2ε2bζθ^θ2.\left\|{\gamma^{\theta}_{\mu,t|T}-\gamma^{\hat{\theta}}_{k,t}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-1}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T-t}+\frac{2|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2}.
  2. 2.

    t[2:T]\forall t\in[2:T],

    γ~θμ,t|Tγ~θ^k,tTV2(1ε2bζ|𝒪|)t2+(1ε2bζ|𝒪|)Tt+4|𝒪|zθ,θ^Lθ,θ^θ2ε2bζθ^θ2.\left\|{\tilde{\gamma}^{\theta}_{\mu,t|T}-\tilde{\gamma}^{\hat{\theta}}_{k,t}}\right\|_{\rm TV}\leq 2\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-2}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T-t}+\frac{4|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2}.

Similarly, we can bound the difference of smoothing distributions in two extended graphical models with different kk and different parameter θ\theta.

Lemma D.7 (Bounding the difference of smoothing distributions, Part II).

For all θ,θ^Θ\theta,\hat{\theta}\in\mathit{\Theta} and t[1:T]t\in[1:T], with any two integers k2>k1>0k_{2}>k_{1}>0 and the observation sequence {st,at}t\{s_{t},a_{t}\}_{t\in\mathbb{Z}} corresponding to any ωΩ\omega\in\Omega, we have

γθk1,tγθ^k2,tTV(1ε2bζ|𝒪|)t+k11+(1ε2bζ|𝒪|)T+k1t+2|𝒪|zθ,θ^Lθ,θ^θ2ε2bζθ^θ2,\left\|{\gamma^{\theta}_{k_{1},t}-\gamma^{\hat{\theta}}_{k_{2},t}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t+k_{1}-1}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T+k_{1}-t}+\frac{2|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2},
γ~θk1,tγ~θ^k2,tTV2(1ε2bζ|𝒪|)t+k12+(1ε2bζ|𝒪|)T+k1t+4|𝒪|zθ,θ^Lθ,θ^θ2ε2bζθ^θ2.\left\|{\tilde{\gamma}^{\theta}_{k_{1},t}-\tilde{\gamma}^{\hat{\theta}}_{k_{2},t}}\right\|_{\rm TV}\leq 2\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t+k_{1}-2}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T+k_{1}-t}+\frac{4|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2}.

It can be easily verified that in Lemma D.6 and Lemma D.7, the bounds still hold if θ\theta and θ^\hat{\theta} on the LHS are interchanged. We only present the proof of Lemma D.6. As for Lemma D.7, the proof is analogous therefore omitted. Our proof essentially relies on the smoothing stability lemma (Lemma D.5).

Proof of Lemma D.6.

Consider the first bound. For a cleaner notation, let

Δθ,θ^=|𝒪|zθ,θ^Lθ,θ^θ2ε2bζθ^θ2.\Delta_{\theta,\hat{\theta}}=\frac{|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2}.

Apply Lemma D.5 as follows: t[1:T]\forall t\in[1:T], let φθt=αθμ,t\varphi^{\theta}_{t}=\alpha^{\theta}_{\mu,t} and φ^θ^t=αθ^k,t\hat{\varphi}^{\hat{\theta}}_{t}=\alpha^{\hat{\theta}}_{k,t}; let ρθt=βθt|T\rho^{\theta}_{t}=\beta^{\theta}_{t|T} and ρ^θ^t=βθ^k,t\hat{\rho}^{\hat{\theta}}_{t}=\beta^{\hat{\theta}}_{k,t}. Due to Assumption 1, the strictly positive requirement is satisfied. Then, we have

αθμ,tβθt|Tαθμ,t,βθt|Tαθ^k,tβθt|Tαθ^k,t,βθt|TTV(1ε2bζ|𝒪|)t1+Δθ,θ^,\left\|{\frac{\alpha^{\theta}_{\mu,t}\cdot\beta^{\theta}_{t|T}}{\langle\alpha^{\theta}_{\mu,t},\beta^{\theta}_{t|T}\rangle}-\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\theta}_{t|T}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\theta}_{t|T}\rangle}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-1}+\Delta_{\theta,\hat{\theta}},
αθ^k,tβθt|Tαθ^k,t,βθt|Tαθ^k,tβθ^k,tαθ^k,t,βθ^k,tTV(1ε2bζ|𝒪|)Tt+Δθ,θ^,\left\|{\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\theta}_{t|T}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\theta}_{t|T}\rangle}-\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\hat{\theta}}_{k,t}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\hat{\theta}}_{k,t}\rangle}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T-t}+\Delta_{\theta,\hat{\theta}},

where \cdot denotes element-wise product and ,\langle\cdot,\cdot\rangle denotes Euclidean inner product. Therefore,

γθμ,t|Tγθ^k,tTV\displaystyle\left\|{\gamma^{\theta}_{\mu,t|T}-\gamma^{\hat{\theta}}_{k,t}}\right\|_{\rm TV} =αθμ,tβθt|Tαθμ,t,βθt|Tαθ^k,tβθ^k,tαθ^k,t,βθ^k,tTV\displaystyle=\left\|{\frac{\alpha^{\theta}_{\mu,t}\cdot\beta^{\theta}_{t|T}}{\langle\alpha^{\theta}_{\mu,t},\beta^{\theta}_{t|T}\rangle}-\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\hat{\theta}}_{k,t}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\hat{\theta}}_{k,t}\rangle}}\right\|_{\rm TV}
αθμ,tβθt|Tαθμ,t,βθt|Tαθ^k,tβθt|Tαθ^k,t,βθt|TTV+αθ^k,tβθt|Tαθ^k,t,βθt|Tαθ^k,tβθ^k,tαθ^k,t,βθ^k,tTV\displaystyle\leq\left\|{\frac{\alpha^{\theta}_{\mu,t}\cdot\beta^{\theta}_{t|T}}{\langle\alpha^{\theta}_{\mu,t},\beta^{\theta}_{t|T}\rangle}-\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\theta}_{t|T}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\theta}_{t|T}\rangle}}\right\|_{\rm TV}+\left\|{\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\theta}_{t|T}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\theta}_{t|T}\rangle}-\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\hat{\theta}}_{k,t}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\hat{\theta}}_{k,t}\rangle}}\right\|_{\rm TV}
(1ε2bζ|𝒪|)t1+(1ε2bζ|𝒪|)Tt+2Δθ,θ^.\displaystyle\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-1}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T-t}+2\Delta_{\theta,\hat{\theta}}.

Next, we bound the difference of two-step smoothing distributions γ~θμ,t|Tγ~θ^k,tTV\|{\tilde{\gamma}^{\theta}_{\mu,t|T}-\tilde{\gamma}^{\hat{\theta}}_{k,t}}\|_{\rm TV}. Although the idea is straightforward, the details are tedious. For any t[2:T]t\in[2:T], from (6) we have

γ~θμ,t|T(ot1,bt)\displaystyle\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t})
\displaystyle\propto~{} πb(bt|st,ot1;θb)[otπ¯hi(ot|st,ot1,bt;θhi)πlo(at|st,ot;θlo)γθμ,t|T(ot,bt)αθμ,t(ot,bt)][bt1αθμ,t1(ot1,bt1)]\displaystyle\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\left[\sum_{o_{t}}\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\frac{\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})}{\alpha^{\theta}_{\mu,t}(o_{t},b_{t})}\right]\left[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1})\right]
\displaystyle\propto~{} otπ¯hi(ot|st,ot1,bt;θhi)πlo(at|st,ot;θlo)γθμ,t|T(ot,bt)[bt1αθμ,t1(ot1,bt1)]πb(bt|st,ot1;θb)ot1,bt1πb(bt|st,ot1;θb)π¯hi(ot|st,ot1,bt;θhi)πlo(at|st,ot;θlo)αθμ,t1(ot1,bt1)\displaystyle\sum_{o_{t}}\frac{\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1})]\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})}{\sum_{o^{\prime}_{t-1},b_{t-1}}\pi_{b}(b_{t}|s_{t},o^{\prime}_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o^{\prime}_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\alpha^{\theta}_{\mu,t-1}(o^{\prime}_{t-1},b_{t-1})}
=\displaystyle=~{} otπb(bt|st,ot1;θb)π¯hi(ot|st,ot1,bt;θhi)[bt1αθμ,t1(ot1,bt1)]ot1πb(bt|st,ot1;θb)π¯hi(ot|st,ot1,bt;θhi)[bt1αθμ,t1(ot1,bt1)]γθμ,t|T(ot,bt).\displaystyle\sum_{o_{t}}\frac{\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1})]}{\sum_{o^{\prime}_{t-1}}\pi_{b}(b_{t}|s_{t},o^{\prime}_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o^{\prime}_{t-1},b_{t};\theta_{hi})[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o^{\prime}_{t-1},b_{t-1})]}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}).

The denominators are all positive due to the non-degeneracy assumption. It can be easily verified that the normalizing constants involved in the second and the third line cancel each other. As abbreviations, define

gθ(ot1,st,ot,bt):=πb(bt|st,ot1;θb)π¯hi(ot|st,ot1,bt;θhi),\displaystyle g^{\theta}(o_{t-1},s_{t},o_{t},b_{t})\mathrel{\mathop{:}}=\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi}),
gθ^(ot1,st,ot,bt):=πb(bt|st,ot1;θ^b)π¯hi(ot|st,ot1,bt;θ^hi),\displaystyle g^{\hat{\theta}}(o_{t-1},s_{t},o_{t},b_{t})\mathrel{\mathop{:}}=\pi_{b}(b_{t}|s_{t},o_{t-1};\hat{\theta}_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\hat{\theta}_{hi}),
fθμ,t(ot1,st,ot,bt):=gθ(ot1,st,ot,bt)[bt1αθμ,t1(ot1,bt1)]ot1gθ(ot1,st,ot,bt)[bt1αθμ,t1(ot1,bt1)],\displaystyle f^{\theta}_{\mu,t}(o_{t-1},s_{t},o_{t},b_{t})\mathrel{\mathop{:}}=\frac{g^{\theta}(o_{t-1},s_{t},o_{t},b_{t})[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1})]}{\sum_{o^{\prime}_{t-1}}g^{\theta}(o^{\prime}_{t-1},s_{t},o_{t},b_{t})[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o^{\prime}_{t-1},b_{t-1})]},
fθ^k,t(ot1,st,ot,bt):=gθ^(ot1,st,ot,bt)[bt1αθ^k,t1(ot1,bt1)]ot1gθ^(ot1,st,ot,bt)[bt1αθ^k,t1(ot1,bt1)].\displaystyle f^{\hat{\theta}}_{k,t}(o_{t-1},s_{t},o_{t},b_{t})\mathrel{\mathop{:}}=\frac{g^{\hat{\theta}}(o_{t-1},s_{t},o_{t},b_{t})[\sum_{b_{t-1}}\alpha^{\hat{\theta}}_{k,t-1}(o_{t-1},b_{t-1})]}{\sum_{o^{\prime}_{t-1}}g^{\hat{\theta}}(o^{\prime}_{t-1},s_{t},o_{t},b_{t})[\sum_{b_{t-1}}\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b_{t-1})]}.

Then,

γ~θμ,t|Tγ~θ^k,tTV=\displaystyle\left\|{\tilde{\gamma}^{\theta}_{\mu,t|T}-\tilde{\gamma}^{\hat{\theta}}_{k,t}}\right\|_{\rm TV}=~{} 12ot1,bt|ot[fθμ,t(ot1,st,ot,bt)γθμ,t|T(ot,bt)fθ^k,t(ot1,st,ot,bt)γθ^k,t|T(ot,bt)]|\displaystyle\frac{1}{2}\sum_{o_{t-1},b_{t}}\bigg{|}\sum_{o_{t}}[f^{\theta}_{\mu,t}(o_{t-1},s_{t},o_{t},b_{t})\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})-f^{\hat{\theta}}_{k,t}(o_{t-1},s_{t},o_{t},b_{t})\gamma^{\hat{\theta}}_{k,t|T}(o_{t},b_{t})]\bigg{|}
\displaystyle\leq~{} 12ot1,bt,ot|fθμ,t(ot1,st,ot,bt)fθ^k,t(ot1,st,ot,bt)|γθμ,t|T(ot,bt)\displaystyle\frac{1}{2}\sum_{o_{t-1},b_{t},o_{t}}\left|f^{\theta}_{\mu,t}(o_{t-1},s_{t},o_{t},b_{t})-f^{\hat{\theta}}_{k,t}(o_{t-1},s_{t},o_{t},b_{t})\right|\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})
+12ot1,bt,otfθ^k,t(ot1,st,ot,bt)|γθμ,t|T(ot,bt)γθ^k,t|T(ot,bt)|.\displaystyle\hskip 50.00008pt+\frac{1}{2}\sum_{o_{t-1},b_{t},o_{t}}f^{\hat{\theta}}_{k,t}(o_{t-1},s_{t},o_{t},b_{t})\left|\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})-\gamma^{\hat{\theta}}_{k,t|T}(o_{t},b_{t})\right|. (29)

Now, we bound the two terms on the RHS separately. Consider the first term in (29),

12ot1,ot,bt|fθμ,t(ot1,st,ot,bt)fθ^k,t(ot1,st,ot,bt)|γθμ,t|T(ot,bt)\displaystyle\frac{1}{2}\sum_{o_{t-1},o_{t},b_{t}}\left|f^{\theta}_{\mu,t}(o_{t-1},s_{t},o_{t},b_{t})-f^{\hat{\theta}}_{k,t}(o_{t-1},s_{t},o_{t},b_{t})\right|\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})
\displaystyle\leq~{} 12maxot,btot1,bt1|gθ(ot1,st,ot,bt)αθμ,t1(ot1,bt1)ot1,bt1gθ(ot1,st,ot,bt)αθμ,t1(ot1,bt1)\displaystyle\frac{1}{2}\max_{o_{t},b_{t}}\sum_{o_{t-1},b_{t-1}}\bigg{|}\frac{g^{\theta}(o_{t-1},s_{t},o_{t},b_{t})\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}g^{\theta}(o^{\prime}_{t-1},s_{t},o_{t},b_{t})\alpha^{\theta}_{\mu,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}
gθ(ot1,st,ot,bt)αθ^k,t1(ot1,bt1)ot1,bt1gθ(ot1,st,ot,bt)αθ^k,t1(ot1,bt1)|\displaystyle\hskip 80.00012pt-\frac{g^{\theta}(o_{t-1},s_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o_{t-1},b_{t-1})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}g^{\theta}(o^{\prime}_{t-1},s_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}\bigg{|}
+12maxot,btot1,bt1αθ^k,t1(ot1,bt1)|gθ(ot1,st,ot,bt)ot1,bt1gθ(ot1,st,ot,bt)αθ^k,t1(ot1,bt1)\displaystyle\hskip 20.00003pt+\frac{1}{2}\max_{o_{t},b_{t}}\sum_{o_{t-1},b_{t-1}}\alpha^{\hat{\theta}}_{k,t-1}(o_{t-1},b_{t-1})\bigg{|}\frac{g^{\theta}(o_{t-1},s_{t},o_{t},b_{t})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}g^{\theta}(o^{\prime}_{t-1},s_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}
gθ^(ot1,st,ot,bt)ot1,bt1gθ^(ot1,st,ot,bt)αθ^k,t1(ot1,bt1)|.\displaystyle\hskip 100.00015pt-\frac{g^{\hat{\theta}}(o_{t-1},s_{t},o_{t},b_{t})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}g^{\hat{\theta}}(o^{\prime}_{t-1},s_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}\bigg{|}. (30)

Denote the two terms on the RHS of (30) as Δ1\Delta_{1} and Δ2\Delta_{2} respectively. To bound Δ1\Delta_{1}, we can apply Lemma D.5 on the index set [1:t1][1:t-1] as follows, assuming t>2t>2. For any t[1:t1]t^{\prime}\in[1:t-1], let φθt=αθμ,t\varphi^{\theta}_{t^{\prime}}=\alpha^{\theta}_{\mu,t^{\prime}} and φ^θ^t=αθ^k,t\hat{\varphi}^{\hat{\theta}}_{t^{\prime}}=\alpha^{\hat{\theta}}_{k,t^{\prime}}. For any (ot,bt)(o_{t},b_{t}), let ρθt1(ot1,bt1)=z1θgθ(ot1,st,ot,bt)\rho^{\theta}_{t-1}(o_{t-1},b_{t-1})=z^{-1}_{\theta}g^{\theta}(o_{t-1},s_{t},o_{t},b_{t}), where zθz_{\theta} is a normalizing constant. For 1t<t11\leq t^{\prime}<t-1, let ρθt=Bθtρθt+1\rho^{\theta}_{t^{\prime}}=B^{\theta}_{t^{\prime}}\rho^{\theta}_{t^{\prime}+1}. Then,

Δ1(1ε2bζ|𝒪|)t2+Δθ,θ^.\Delta_{1}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-2}+\Delta_{\theta,\hat{\theta}}.

Such a bound holds trivially if t2t\leq 2.

Next, we bound Δ2\Delta_{2} as follows. Straightforward computation yields the following result.

Δ2\displaystyle\Delta_{2} =12maxot,btot1,bt1αθ^k,t1(ot1,bt1)|h(θ;ot1,st,at,ot,bt)ot1,bt1h(θ;ot1,st,at,ot,bt)αθ^k,t1(ot1,bt1)\displaystyle=\frac{1}{2}\max_{o_{t},b_{t}}\sum_{o_{t-1},b_{t-1}}\alpha^{\hat{\theta}}_{k,t-1}(o_{t-1},b_{t-1})\bigg{|}\frac{h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}h(\theta;o^{\prime}_{t-1},s_{t},a_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}
h(θ^;ot1,st,at,ot,bt)ot1,bt1h(θ^;ot1,st,at,ot,bt)αθ^k,t1(ot1,bt1)|\displaystyle\hskip 70.0001pt-\frac{h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}h(\hat{\theta};o^{\prime}_{t-1},s_{t},a_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}\bigg{|}
maxot,btot1,bt1|h(θ;ot1,st,at,ot,bt)h(θ^;ot1,st,at,ot,bt)|αθ^k,t1(ot1,bt1)ot1,bt1h(θ;ot1,st,at,ot,bt)αθ^k,t1(ot1,bt1)\displaystyle\leq\max_{o_{t},b_{t}}\frac{\sum_{o_{t-1},b_{t-1}}\left|h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})\right|\alpha^{\hat{\theta}}_{k,t-1}(o_{t-1},b_{t-1})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}h(\theta;o^{\prime}_{t-1},s_{t},a_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}
maxot1,ot,bt|h(θ;ot1,st,at,ot,bt)h(θ^;ot1,st,at,ot,bt)|minot1,ot,bth(θ;ot1,st,at,ot,bt)Δθ,θ^,\displaystyle\leq\frac{\max_{o_{t-1},o_{t},b_{t}}\left|h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})\right|}{\min_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}\leq\Delta_{\theta,\hat{\theta}},

where we use the definition of h(θ;ot1,st,at,ot,bt)h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t}) in (24).

As for the second term in (29),

12ot1,bt,otfθk,t(ot1,st,ot,bt)|γθμ,t|T(ot,bt)γθk,t|T(ot,bt)|\displaystyle\frac{1}{2}\sum_{o_{t-1},b_{t},o_{t}}f^{\theta}_{k,t}(o_{t-1},s_{t},o_{t},b_{t})\left|\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})-\gamma^{\theta}_{k,t|T}(o_{t},b_{t})\right|
=\displaystyle=~{} γθμ,t|Tγθk,tTV(1ε2bζ|𝒪|)t1+(1ε2bζ|𝒪|)Tt+2Δθ,θ^.\displaystyle\left\|{\gamma^{\theta}_{\mu,t|T}-\gamma^{\theta}_{k,t}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-1}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T-t}+2\Delta_{\theta,\hat{\theta}}.

Combining the above gives the desirable result. ∎

D.4 Proof of Lemma C.1

Based on Lemma D.7, for all T2T\geq 2, θΘ\theta\in\mathit{\Theta} and t[1:T]t\in[1:T], with any observation sequence, both the sequences {γθk,t}k+\{\gamma^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}} and {γ~θk,t}k+\{\tilde{\gamma}^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}} are Cauchy sequences associated with the total variation distance. Moreover, the set of probability measures over the finite sample space 𝒪×{0,1}\mathcal{O}\times\{0,1\} is complete. Therefore, the limits of both {γθk,t}k+\{\gamma^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}} and {γ~θk,t}k+\{\tilde{\gamma}^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}} as kk\rightarrow\infty exist with respect to the total variation distance. From the definitions of {γθk,t}k+\{\gamma^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}} and {γ~θk,t}k+\{\tilde{\gamma}^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}} in Appendix C.1, it is clear that their limits as kk\rightarrow\infty do not depend on TT.

The Lipschitz continuity of γθ,t\gamma^{\theta}_{\infty,t} also follows from Lemma D.7. Specifically, for all θ,θ^Θ\theta,\hat{\theta}\in\mathit{\Theta} and t[1:T]t\in[1:T], with any observation sequence,

γθ,tγθ^,tTV2|𝒪|zθ,θ^Lθ,θ^θ2ε2bζθ^θ2.\left\|{\gamma^{\theta}_{\infty,t}-\gamma^{\hat{\theta}}_{\infty,t}}\right\|_{\rm TV}\leq\frac{2|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2}.

The coefficient of θ^θ2\|{\hat{\theta}-\theta}\|_{2} on the RHS can be upper bounded by a constant that does not depend on θ\theta and θ^\hat{\theta}. The same argument holds for γ~θ,t\tilde{\gamma}^{\theta}_{\infty,t}. ∎

D.5 Proof of Lemma C.2

For a cleaner notation, we omit the dependency on ω\omega in the following analysis. From the definitions, for all θ,θΘ\theta,\theta^{\prime}\in\mathit{\Theta} and μ\mu\in\mathcal{M},

Qs,T(θ|θ)Qμ,T(θ|θ)\displaystyle Q^{s}_{\infty,T}(\theta^{\prime}|\theta)-Q_{\mu,T}(\theta^{\prime}|\theta)
=\displaystyle=~{} 1T{t=2Tot1,bt[γ~θ,t(ot1,bt)γ~θμ,t|T(ot1,bt)][logπb(bt|st,ot1;θb)]\displaystyle\frac{1}{T}\bigg{\{}\sum_{t=2}^{T}\sum_{o_{t-1},b_{t}}\left[\tilde{\gamma}^{\theta}_{\infty,t}(o_{t-1},b_{t})-\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t})\right][\log\pi_{b}(b_{t}|s_{t},o_{t-1};\theta^{\prime}_{b})]
+t=1Tot,bt[γθ,t(ot,bt)γθμ,t|T(ot,bt)][logπlo(at|st,ot;θlo)]\displaystyle+\sum_{t=1}^{T}\sum_{o_{t},b_{t}}\left[\gamma^{\theta}_{\infty,t}(o_{t},b_{t})-\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})\right][\log\pi_{lo}(a_{t}|s_{t},o_{t};\theta^{\prime}_{lo})]
+t=1Tot[γθ,t(ot,bt=1)γθμ,t|T(ot,bt=1)][logπhi(ot|st;θhi)]+err},\displaystyle+\sum_{t=1}^{T}\sum_{o_{t}}\left[\gamma^{\theta}_{\infty,t}(o_{t},b_{t}=1)-\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}=1)\right][\log\pi_{hi}(o_{t}|s_{t};\theta^{\prime}_{hi})]+err\bigg{\}},

where the last term is a small error term associated with t=1t=1 such that,

|err|=|o0,b1γ~θ,1(o0,b1)[logπb(b1|s1,o0;θb)]|maxb1,s1,o0|logπb(b1|s1,o0;θb)|.\left|err\right|=\bigg{|}\sum_{o_{0},b_{1}}\tilde{\gamma}^{\theta}_{\infty,1}(o_{0},b_{1})\left[\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})\right]\bigg{|}\leq\max_{b_{1},s_{1},o_{0}}|\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})|.

The maximum on the RHS is finite due to the non-degeneracy assumption. Furthermore,

|Qs,T(θ|θ)Qμ,T(θ|θ)|\displaystyle\left|Q^{s}_{\infty,T}(\theta^{\prime}|\theta)-Q_{\mu,T}(\theta^{\prime}|\theta)\right|
\displaystyle\leq~{} 1T{t=2Tmaxbt,st,ot1|logπb(bt|st,ot1;θb)|ot1,bt|γ~θ,t(ot1,bt)γ~θμ,t|T(ot1,bt)|\displaystyle\frac{1}{T}\bigg{\{}\sum_{t=2}^{T}\max_{b_{t},s_{t},o_{t-1}}\left|\log\pi_{b}(b_{t}|s_{t},o_{t-1};\theta^{\prime}_{b})\right|\sum_{o_{t-1},b_{t}}\left|\tilde{\gamma}^{\theta}_{\infty,t}(o_{t-1},b_{t})-\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t})\right|
+t=1Tmaxat,st,ot|logπlo(at|st,ot;θlo)|ot,bt|γθ,t(ot,bt)γθμ,t|T(ot,bt)|\displaystyle+\sum_{t=1}^{T}\max_{a_{t},s_{t},o_{t}}\left|\log\pi_{lo}(a_{t}|s_{t},o_{t};\theta^{\prime}_{lo})\right|\sum_{o_{t},b_{t}}\left|\gamma^{\theta}_{\infty,t}(o_{t},b_{t})-\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})\right|
+t=1Tmaxst,ot|logπhi(ot|st;θhi)|ot|γθ,t(ot,bt=1)γθμ,t|T(ot,bt=1)|+|err|}.\displaystyle+\sum_{t=1}^{T}\max_{s_{t},o_{t}}\left|\log\pi_{hi}(o_{t}|s_{t};\theta^{\prime}_{hi})\right|\sum_{o_{t}}\left|\gamma^{\theta}_{\infty,t}(o_{t},b_{t}=1)-\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}=1)\right|+|err|\bigg{\}}.

Since the bounds in Lemma D.6 hold for any k>0k>0, they also hold in the limit as kk\rightarrow\infty. Therefore, for any θ\theta, μ\mu and any t[1:T]t\in[1:T],

γθμ,t|Tγθ,tTV(1ε2bζ|𝒪|)t1+(1ε2bζ|𝒪|)Tt.\left\|{\gamma^{\theta}_{\mu,t|T}-\gamma^{\theta}_{\infty,t}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-1}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T-t}.

For any θ\theta, μ\mu and any t[2:T]t\in[2:T],

γ~θμ,t|Tγ~θ,tTV2(1ε2bζ|𝒪|)t2+(1ε2bζ|𝒪|)Tt.\left\|{\tilde{\gamma}^{\theta}_{\mu,t|T}-\tilde{\gamma}^{\theta}_{\infty,t}}\right\|_{\rm TV}\leq 2\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-2}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T-t}.

Combining everything above,

|Qs,T(θ|θ)Qμ,T(θ|θ)|\displaystyle\left|Q^{s}_{\infty,T}(\theta^{\prime}|\theta)-Q_{\mu,T}(\theta^{\prime}|\theta)\right|
\displaystyle\leq~{} 1T{maxbt,st,ot1|logπb(bt|st,ot1;θb)|[1+2t=2Tγ~θμ,t|Tγ~θ,tTV]\displaystyle\frac{1}{T}\bigg{\{}\max_{b_{t},s_{t},o_{t-1}}\left|\log\pi_{b}(b_{t}|s_{t},o_{t-1};\theta^{\prime}_{b})\right|\bigg{[}1+2\sum_{t=2}^{T}\left\|{\tilde{\gamma}^{\theta}_{\mu,t|T}-\tilde{\gamma}^{\theta}_{\infty,t}}\right\|_{\rm TV}\bigg{]}
+2[maxat,st,ot|logπlo(at|st,ot;θlo)|+maxst,ot|logπhi(ot|st;θhi)|]t=1Tγθμ,t|Tγθ,tTV}\displaystyle+2\left[\max_{a_{t},s_{t},o_{t}}\left|\log\pi_{lo}(a_{t}|s_{t},o_{t};\theta^{\prime}_{lo})\right|+\max_{s_{t},o_{t}}\left|\log\pi_{hi}(o_{t}|s_{t};\theta^{\prime}_{hi})\right|\right]\sum_{t=1}^{T}\left\|{\gamma^{\theta}_{\mu,t|T}-\gamma^{\theta}_{\infty,t}}\right\|_{\rm TV}\bigg{\}}
\displaystyle\leq~{} 1T{(1+6|O|εb2ζ)maxbt,st,ot1|logπb(bt|st,ot1;θb)|\displaystyle\frac{1}{T}\bigg{\{}\bigg{(}1+\frac{6|O|}{\varepsilon_{b}^{2}\zeta}\bigg{)}\max_{b_{t},s_{t},o_{t-1}}\left|\log\pi_{b}(b_{t}|s_{t},o_{t-1};\theta^{\prime}_{b})\right|
+4|O|εb2ζ[maxat,st,ot|logπlo(at|st,ot;θlo)|+maxst,ot|logπhi(ot|st;θhi)|]}=C(θ)T,\displaystyle\hskip 50.00008pt+\frac{4|O|}{\varepsilon_{b}^{2}\zeta}\left[\max_{a_{t},s_{t},o_{t}}\left|\log\pi_{lo}(a_{t}|s_{t},o_{t};\theta^{\prime}_{lo})\right|+\max_{s_{t},o_{t}}\left|\log\pi_{hi}(o_{t}|s_{t};\theta^{\prime}_{hi})\right|\right]\bigg{\}}=\frac{C(\theta^{\prime})}{T},

where C(θ)C(\theta^{\prime}) is a positive real number that only depends on θ\theta^{\prime} and the structural constants |𝒪||\mathcal{O}|, ζ\zeta and εb\varepsilon_{b}. Due to Assumption 2, C(θ)C(\theta^{\prime}) is continuous with respect to θ\theta^{\prime}. Since Θ\mathit{\Theta} is compact, supθΘC(θ)<\sup_{\theta^{\prime}\in\mathit{\Theta}}C(\theta^{\prime})<\infty. Therefore,

|Qs,T(θ|θ)Qμ,T(θ|θ)|1TsupθΘC(θ).\left|Q^{s}_{\infty,T}(\theta^{\prime}|\theta)-Q_{\mu,T}(\theta^{\prime}|\theta)\right|\leq\frac{1}{T}\sup_{\theta^{\prime}\in\mathit{\Theta}}C(\theta^{\prime}).

Taking supremum with respect to θ\theta, θ\theta^{\prime} and μ\mu completes the proof. ∎

D.6 Proof of the strong stochastic equicontinuity condition (19)

First, for all δ>0\delta>0 and ωΩ\omega\in\Omega,

lim supTsupθ1,θ1,θ2,θ2Θ;θ1θ22+θ1θ22δ|Qs,T(θ1|θ1;ω)Qs,T(θ2|θ2;ω)|\displaystyle\limsup_{T\rightarrow\infty}\sup_{\theta_{1},\theta^{\prime}_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta_{1}-\theta_{2}}\|_{2}+\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\|_{2}\leq\delta}\left|Q^{s}_{\infty,T}(\theta^{\prime}_{1}|\theta_{1};\omega)-Q^{s}_{\infty,T}(\theta^{\prime}_{2}|\theta_{2};\omega)\right|
\displaystyle\leq~{} lim supT1Tsupθ1,θ1,θ2,θ2Θ;θ1θ22+θ1θ22δ|ft(θ1|θ1;ω)ft(θ2|θ2;ω)|.\displaystyle\limsup_{T\rightarrow\infty}\frac{1}{T}\sup_{\theta_{1},\theta^{\prime}_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta_{1}-\theta_{2}}\|_{2}+\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\|_{2}\leq\delta}\left|f_{t}(\theta^{\prime}_{1}|\theta_{1};\omega)-f_{t}(\theta^{\prime}_{2}|\theta_{2};\omega)\right|.

Due to the boundedness of ft(θ|θ;ω)f_{t}(\theta^{\prime}|\theta;\omega) from Appendix C.2, we can apply the ergodic theorem (Lemma C.3). θ,ν\mathbb{P}_{\theta^{*},\nu^{*}} almost surely,

lim supT1Tt=1Tsupθ1,θ1,θ2,θ2Θ;θ1θ22+θ1θ22δ|ft(θ1|θ1;ω)ft(θ2|θ2;ω)|\displaystyle\limsup_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\sup_{\theta_{1},\theta^{\prime}_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta_{1}-\theta_{2}}\|_{2}+\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\|_{2}\leq\delta}\left|f_{t}(\theta^{\prime}_{1}|\theta_{1};\omega)-f_{t}(\theta^{\prime}_{2}|\theta_{2};\omega)\right|
=\displaystyle=~{} 𝔼θ,ν[supθ1,θ1,θ2,θ2Θ;θ1θ22+θ1θ22δ|f1(θ1|θ1;ω)f1(θ2|θ2;ω)|]\displaystyle\mathbb{E}_{\theta^{*},\nu^{*}}\bigg{[}\sup_{\theta_{1},\theta^{\prime}_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta_{1}-\theta_{2}}\|_{2}+\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\|_{2}\leq\delta}\left|f_{1}(\theta^{\prime}_{1}|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}|\theta_{2};\omega)\right|\bigg{]}
\displaystyle\leq~{} 𝔼θ,ν[supθ1,θ1,θ2Θ;θ1θ22δ|f1(θ1|θ1;ω)f1(θ2|θ1;ω)|]\displaystyle\mathbb{E}_{\theta^{*},\nu^{*}}\bigg{[}\sup_{\theta_{1},\theta^{\prime}_{1},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\|_{2}\leq\delta}\left|f_{1}(\theta^{\prime}_{1}|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}|\theta_{1};\omega)\right|\bigg{]}
+𝔼θ,ν[supθ1,θ2,θ2Θ;θ1θ22δ|f1(θ2|θ1;ω)f1(θ2|θ2;ω)|].\displaystyle\hskip 50.00008pt+\mathbb{E}_{\theta^{*},\nu^{*}}\bigg{[}\sup_{\theta_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta_{1}-\theta_{2}}\|_{2}\leq\delta}\left|f_{1}(\theta^{\prime}_{2}|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}|\theta_{2};\omega)\right|\bigg{]}.

Notice that for all θ1\theta_{1}, θ1\theta^{\prime}_{1}, θ2\theta^{\prime}_{2} and ω\omega,

|f1(θ1|θ1;ω)f1(θ2|θ1;ω)|maxot|logπhi(ot|ω(st);θ1,hi)logπhi(ot|ω(st);θ2,hi)|+maxot|logπlo(ω(at)|ω(st),ot;θ1,lo)logπlo(ω(at)|ω(st),ot;θ2,lo)|+maxot1,bt|logπb(bt|ω(st),ot1;θ1,b)logπb(bt|ω(st),ot1;θ2,b)|.\left|f_{1}(\theta^{\prime}_{1}|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}|\theta_{1};\omega)\right|\leq\max_{o_{t}}\left|\log\pi_{hi}(o_{t}|\omega(s_{t});\theta^{\prime}_{1,hi})-\log\pi_{hi}(o_{t}|\omega(s_{t});\theta^{\prime}_{2,hi})\right|\\ +\max_{o_{t}}\left|\log\pi_{lo}(\omega(a_{t})|\omega(s_{t}),o_{t};\theta^{\prime}_{1,lo})-\log\pi_{lo}(\omega(a_{t})|\omega(s_{t}),o_{t};\theta^{\prime}_{2,lo})\right|\\ +\max_{o_{t-1},b_{t}}\left|\log\pi_{b}(b_{t}|\omega(s_{t}),o_{t-1};\theta^{\prime}_{1,b})-\log\pi_{b}(b_{t}|\omega(s_{t}),o_{t-1};\theta^{\prime}_{2,b})\right|.

The RHS does not depend on θ1\theta_{1}. Due to Assumption 2, πhi\pi_{hi}, πlo\pi_{lo} and πb\pi_{b} as functions of the parameter θ\theta are uniformly continuous on Θ\mathit{\Theta}, with any other input arguments. Therefore it is straightforward to verify that, for any ωΩ\omega\in\Omega,

limδ0supθ1,θ1,θ2Θ;θ1θ22δ|f1(θ1|θ1;ω)f1(θ2|θ1;ω)|=0.\lim_{\delta\rightarrow 0}\sup_{\theta_{1},\theta^{\prime}_{1},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\|_{2}\leq\delta}\left|f_{1}(\theta^{\prime}_{1}|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}|\theta_{1};\omega)\right|=0.

Applying the dominated convergence theorem,

limδ0𝔼θ,ν[supθ1,θ1,θ2Θ;θ1θ22δ|f1(θ1|θ1;ω)f1(θ2|θ1;ω)|]=0.\lim_{\delta\rightarrow 0}\mathbb{E}_{\theta^{*},\nu^{*}}\bigg{[}\sup_{\theta_{1},\theta^{\prime}_{1},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\|_{2}\leq\delta}\left|f_{1}(\theta^{\prime}_{1}|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}|\theta_{1};\omega)\right|\bigg{]}=0.

Similarly, using Lemma C.1 we can show that for any ωΩ\omega\in\Omega,

limδ0supθ1,θ2,θ2Θ;θ1θ22δ|f1(θ2|θ1;ω)f1(θ2|θ2;ω)|=0.\lim_{\delta\rightarrow 0}\sup_{\theta_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta_{1}-\theta_{2}}\|_{2}\leq\delta}\left|f_{1}(\theta^{\prime}_{2}|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}|\theta_{2};\omega)\right|=0.

Using the dominated convergence theorem gives the convergence of the expectation as well. Combining the above gives the strong stochastic equicontinuity condition (19). ∎

D.7 Proof of Lemma C.5

Consider the following joint distribution on the graphical model shown in Figure 1: the prior distribution of (O0,S1)(O_{0},S_{1}) is ν\nu^{*}, and the joint distribution of the rest of the graphical model is determined by an options with failure policy with parameters ζ\zeta and θ\theta. Notice that this is the correct graphical model for the inference of the true parameter θ\theta^{*}, since the assumed prior distribution of (O0,S1)(O_{0},S_{1}) coincides with the correct one.

For clarity, we use the same notations as in Appendix B.3 for the complete likelihood function, the marginal likelihood function and the (unnormalized) QQ-function. Specifically, such quantities used in this proof have the same symbols as those defined in Appendix B.3, but mathematically they are not the same.

Parallel to Appendix B.3, the complete likelihood function is

L(s1:T,a1:T,o0:T,b1:T;θ)=ν(o0,s1)θ,o0,s1(S2:T=s1:T,A1:T=a1:T,O1:T=o1:T,B1:T=b1:T).L(s_{1:T},a_{1:T},o_{0:T},b_{1:T};\theta)=\nu^{*}(o_{0},s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{1:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T}).

The marginal likelihood function is

Lm(s1:T,a1:T;θ)=o0ν(o0,s1)θ,o0,s1(S2:T=s1:T,A1:T=a1:T).L^{m}(s_{1:T},a_{1:T};\theta)=\sum_{o_{0}}\nu^{*}(o_{0},s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{1:T},A_{1:T}=a_{1:T}).

Let μ\mu^{*} be the conditional distribution of O0O_{0} given s1s_{1}. For any o0𝒪o_{0}\in\mathcal{O},

μ(o0|s1)=ν(o0,s1)o0𝒪ν(o0,s1).\mu^{*}(o_{0}|s_{1})=\frac{\nu^{*}(o_{0},s_{1})}{\sum_{o^{\prime}_{0}\in\mathcal{O}}\nu^{*}(o^{\prime}_{0},s_{1})}.

Therefore, for the inference of θ\theta^{*} considered in this proof, the (unnormalized) QQ-function can be expressed as

Q~μ,T(θ|θ)\displaystyle\tilde{Q}_{\mu^{*},T}(\theta^{\prime}|\theta)
=\displaystyle=~{} o0:T,b1:TL(s1:T,a1:T,o0:T,b1:T;θ)Lm(s1:T,a1:T;θ)logL(s1:T,a1:T,o0:T,b1:T;θ)\displaystyle\sum_{o_{0:T},b_{1:T}}\frac{L(s_{1:T},a_{1:T},o_{0:T},b_{1:T};\theta)}{L^{m}(s_{1:T},a_{1:T};\theta)}\log L(s_{1:T},a_{1:T},o_{0:T},b_{1:T};\theta^{\prime})
=\displaystyle=~{} o0:T,b1:Tμ(o0|s1)θ,o0,s1(S2:T=s2:T,A1:T=a1:T,O1:T=o1:T,B1:T=b1:T)\displaystyle\sum_{o_{0:T},b_{1:T}}\mu^{*}(o_{0}|s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T})
×zθγ,μlog[ν(o0,s1)θ,o0,s1(S2:T=s1:T,A1:T=a1:T,O1:T=o1:T,B1:T=b1:T)].\displaystyle\hskip 20.00003pt\times z^{\theta}_{\gamma,\mu^{*}}\log[\nu^{*}(o_{0},s_{1})\mathbb{P}_{\theta^{\prime},o^{\prime}_{0},s_{1}}(S_{2:T}=s_{1:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T})].

We can rewrite Q~μ,T(θ|θ)\tilde{Q}_{\mu^{*},T}(\theta^{\prime}|\theta) using the structure of the options with failure framework, drop the terms irrelevant to θ\theta^{\prime} and normalize using TT. The result is the following definition of the (normalized) QQ-function:

QT(θ|θ):=\displaystyle Q^{*}_{T}(\theta^{\prime}|\theta)\mathrel{\mathop{:}}= o0,b1ν(o0|s1)θ,o0,s1(S2:T=s2:T,A1:T=a1:T,B1=b1)[logπb(b1|s1,o0;θb)]To0ν(o0,s1)θ,o0,s1(S2:T=s1:T,A1:T=a1:T)\displaystyle\frac{\sum_{o_{0},b_{1}}\nu^{*}(o_{0}|s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},B_{1}=b_{1})[\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})]}{T\sum_{o_{0}}\nu^{*}(o_{0},s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{1:T},A_{1:T}=a_{1:T})}
+1Tt=1Tot,btγθμ,t|T(ot,bt)[logπlo(at|st,ot;θlo)]\displaystyle\hskip 50.00008pt+\frac{1}{T}\sum_{t=1}^{T}\sum_{o_{t},b_{t}}\gamma^{\theta}_{\mu^{*},t|T}(o_{t},b_{t})[\log\pi_{lo}(a_{t}|s_{t},o_{t};\theta^{\prime}_{lo})]
+1Tt=1Totγθμ,t|T(ot,bt=1)[logπhi(ot|st;θhi)]\displaystyle\hskip 50.00008pt+\frac{1}{T}\sum_{t=1}^{T}\sum_{o_{t}}\gamma^{\theta}_{\mu^{*},t|T}(o_{t},b_{t}=1)[\log\pi_{hi}(o_{t}|s_{t};\theta^{\prime}_{hi})]
+1Tt=2Tot1,btγ~θμ,t|T(ot1,bt)[logπb(bt|st,ot1;θb)].\displaystyle\hskip 100.00015pt+\frac{1}{T}\sum_{t=2}^{T}\sum_{o_{t-1},b_{t}}\tilde{\gamma}^{\theta}_{\mu^{*},t|T}(o_{t-1},b_{t})[\log\pi_{b}(b_{t}|s_{t},o_{t-1};\theta^{\prime}_{b})].

We draw a comparison between QT(θ|θ)Q^{*}_{T}(\theta^{\prime}|\theta) and Qμ,T(θ|θ)Q_{\mu^{*},T}(\theta^{\prime}|\theta) defined in (7): their difference is in the first term of QT(θ|θ)Q^{*}_{T}(\theta^{\prime}|\theta). Maximizing QT(θ|θ)Q^{*}_{T}(\theta^{\prime}|\theta) with respect to θ\theta^{\prime} is equivalent to maximizing the (unnormalized) QQ-function Q~μ,T(θ|θ)\tilde{Q}_{\mu^{*},T}(\theta^{\prime}|\theta). In Algorithm 1, since QT(θ|θ)Q^{*}_{T}(\theta^{\prime}|\theta) is unavailable, we use Qμ,T(θ|θ)Q_{\mu^{*},T}(\theta^{\prime}|\theta) as its approximation.

QT(θ|θ)Q^{*}_{T}(\theta^{\prime}|\theta) depends on the observation sequence, therefore it is a function of a sample path ωΩ\omega\in\Omega. In the following we explicitly show this dependency by writing QT(θ|θ;ω)Q^{*}_{T}(\theta^{\prime}|\theta;\omega). Clearly, for all θ,θΘ\theta,\theta^{\prime}\in\mathit{\Theta}, ωΩ\omega\in\Omega and T2T\geq 2,

|QT(θ|θ;ω)Qμ,T(θ|θ;ω)|1TsupθΘmaxb1,s1,o0|logπb(b1|s1,o0;θb)|.\left|Q^{*}_{T}(\theta^{\prime}|\theta;\omega)-Q_{\mu^{*},T}(\theta^{\prime}|\theta;\omega)\right|\leq\frac{1}{T}\sup_{\theta^{\prime}\in\mathit{\Theta}}\max_{b_{1},s_{1},o_{0}}\left|\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})\right|.

Combining this with the stochastic convergence of Qμ,TQ_{\mu^{*},T} as shown in Theorem 2, we have, that for any θΘ\theta\in\mathit{\Theta}, as TT\rightarrow\infty,

|QT(θ|θ;ω)Q¯(θ|θ)|0,Pθ,ν-a.s.\left|Q^{*}_{T}(\theta|\theta^{*};\omega)-\bar{Q}(\theta|\theta^{*})\right|\rightarrow 0,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

Using the dominated convergence theorem, such a convergence holds in expectation as well. For any θΘ\theta\in\mathit{\Theta},

limT𝔼θ,ν[QT(θ|θ;ω)]=Q¯(θ|θ).\lim_{T\rightarrow\infty}\mathbb{E}_{\theta^{*},\nu^{*}}\left[Q^{*}_{T}(\theta|\theta^{*};\omega)\right]=\bar{Q}(\theta|\theta^{*}).

Since maximizing QT(θ|θ)Q^{*}_{T}(\theta|\theta^{*}) with respect to θ\theta is equivalent to maximizing the (unnormalized) QQ-function Q~μ,T(θ|θ)\tilde{Q}_{\mu^{*},T}(\theta|\theta^{*}), the standard monotonicity property of the EM update holds as well. For all θΘ\theta\in\mathit{\Theta}, ωΩ\omega\in\Omega and T2T\geq 2,

logLm[ω(s1:T),ω(a1:T);θ]logLm[ω(s1:T),ω(a1:T);θ]T[QT(θ|θ;ω)QT(θ|θ;ω)].\log L^{m}[\omega(s_{1:T}),\omega(a_{1:T});\theta]-\log L^{m}[\omega(s_{1:T}),\omega(a_{1:T});\theta^{*}]\geq T\left[Q^{*}_{T}(\theta|\theta^{*};\omega)-Q^{*}_{T}(\theta^{*}|\theta^{*};\omega)\right].

Taking expectation on both sides, we have

𝔼θ,ν[LHS]=s1:T,a1:TLm(s1:T,a1:T;θ)logLm(s1:T,a1:T;θ)Lm(s1:T,a1:T;θ)0,\mathbb{E}_{\theta^{*},\nu^{*}}[\text{LHS}]=\sum_{s_{1:T},a_{1:T}}L^{m}(s_{1:T},a_{1:T};\theta^{*})\log\frac{L^{m}(s_{1:T},a_{1:T};\theta)}{L^{m}(s_{1:T},a_{1:T};\theta^{*})}\leq 0,

due to the non-negativity of the Kullback-Leibler divergence. Therefore, 𝔼θ,ν[QT(θ|θ;ω)]𝔼θ,ν[QT(θ|θ;ω)]\mathbb{E}_{\theta^{*},\nu^{*}}[Q^{*}_{T}(\theta|\theta^{*};\omega)]\leq\mathbb{E}_{\theta^{*},\nu^{*}}[Q^{*}_{T}(\theta^{*}|\theta^{*};\omega)], and in the limit we have Q¯(θ|θ)Q¯(θ|θ)\bar{Q}(\theta|\theta^{*})\leq\bar{Q}(\theta^{*}|\theta^{*}) for all θΘ\theta\in\mathit{\Theta}. Applying the identifiability assumption for the uniqueness of M¯(θ)\bar{M}(\theta^{*}) completes the proof. ∎

Appendix E Additional experiments and details omitted in Section 5

E.1 Generation of the observation sequences

We first introduce the method to sample observation sequences from the stationary Markov chain induced by the expert policy. Using the expert policy and a fixed (o0,s1)(o_{0},s_{1}) pair, we generate 50 sample paths of length 20,000. Then, the first 10,000 time steps in each sample path are discarded, and the rest state-action pairs are saved as the observation sequences used in the algorithm. For different TT, we just take the first TT time steps in each observation sequence.

Such a procedure is motivated by Proposition 5: it can be easily verified that Assumption 1 and 2 hold in our numerical example. Therefore, from Proposition 5, the distribution of XtX_{t} approaches the unique stationary distribution regardless of the initial (o0,s1)(o_{0},s_{1}) pair. In this way, Assumption 3 is approximately satisfied.

E.2 Analytical expression of the parameter update

For our numerical example, the parameter update of Algorithm 1 has a unique analytical solution. For all θΘ\theta\in\mathit{\Theta}, ωΩ\omega\in\Omega, T2T\geq 2 and μ\mu\in\mathcal{M}, we first derive the analytical expression of Mμ,T(θ;ω)hiM_{\mu,T}(\theta;\omega)_{hi} which is the updated parameter for πhi\pi_{hi} based on the previous parameter θ\theta. Such a notation for parameter updates is borrowed from Assumption 5. Using the expression of the QQ-function (7), we have

Mμ,T(θ;ω)hiargmaxθhiΘhit=1Totγθμ,t|T(ot,bt=1)[logπhi(ot|st;θhi)],M_{\mu,T}(\theta;\omega)_{hi}\in\operatorname*{arg\,max}_{\theta^{\prime}_{hi}\in\mathit{\Theta}_{hi}}\sum_{t=1}^{T}\sum_{o_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}=1)[\log\pi_{hi}(o_{t}|s_{t};\theta^{\prime}_{hi})],

where sts_{t} on the RHS is the state value ω(st)\omega(s_{t}) from the sample path ω\omega. We omit ω\omega on the RHS for a cleaner notation. Let f(θhi)f(\theta^{\prime}_{hi}) denote the sum inside the argmax. Then,

f(θhi)\displaystyle f(\theta^{\prime}_{hi}) =t=1T{γθμ,t|T(ot=LEFTEND,bt=1)[logπhi(ot=LEFTEND|st;θhi)]\displaystyle=\sum_{t=1}^{T}\bigg{\{}\gamma^{\theta}_{\mu,t|T}(o_{t}=\textrm{LEFTEND},b_{t}=1)[\log\pi_{hi}(o_{t}=\textrm{LEFTEND}|s_{t};\theta^{\prime}_{hi})]
+γθμ,t|T(ot=RIGHTEND,bt=1)[logπhi(ot=RIGHTEND|st;θhi)]}\displaystyle\hskip 50.00008pt+\gamma^{\theta}_{\mu,t|T}(o_{t}=\textrm{RIGHTEND},b_{t}=1)[\log\pi_{hi}(o_{t}=\textrm{RIGHTEND}|s_{t};\theta^{\prime}_{hi})]\bigg{\}}
=t=1T{γθμ,t|T(ot=LEFTEND,bt=1)[𝟙[st=1,2]logθhi+𝟙[st=3,4]log(1θhi)]\displaystyle=\sum_{t=1}^{T}\bigg{\{}\gamma^{\theta}_{\mu,t|T}(o_{t}=\textrm{LEFTEND},b_{t}=1)\Big{[}\mathbbm{1}[s_{t}=1,2]\log\theta^{\prime}_{hi}+\mathbbm{1}[s_{t}=3,4]\log(1-\theta^{\prime}_{hi})\Big{]}
+γθμ,t|T(ot=RIGHTEND,bt=1)[𝟙[st=3,4]logθhi+𝟙[st=1,2]log(1θhi)]}.\displaystyle\hskip 10.00002pt+\gamma^{\theta}_{\mu,t|T}(o_{t}=\textrm{RIGHTEND},b_{t}=1)\Big{[}\mathbbm{1}[s_{t}=3,4]\log\theta^{\prime}_{hi}+\mathbbm{1}[s_{t}=1,2]\log(1-\theta^{\prime}_{hi})\Big{]}\bigg{\}}.

Taking the derivative of f(θhi)f(\theta^{\prime}_{hi}), we can verify that f(θhi)f(\theta^{\prime}_{hi}) is strongly concave. Therefore, the parameter update for πhi\pi_{hi} is unique.

Mμ,T(θ;ω)hi={0.1,if M~μ,T(θ;ω)hi<0.1,M~μ,T(θ;ω)hi,if 0.1M~μ,T(θ;ω)hi0.9,0.9,if M~μ,T(θ;ω)hi>0.9,M_{\mu,T}(\theta;\omega)_{hi}=\begin{cases}0.1,&\text{if $\tilde{M}_{\mu,T}(\theta;\omega)_{hi}<0.1$},\\ \tilde{M}_{\mu,T}(\theta;\omega)_{hi},&\text{if $0.1\leq\tilde{M}_{\mu,T}(\theta;\omega)_{hi}\leq 0.9$},\\ 0.9,&\text{if $\tilde{M}_{\mu,T}(\theta;\omega)_{hi}>0.9$,}\end{cases}

where M~μ,T(θ;ω)hi\tilde{M}_{\mu,T}(\theta;\omega)_{hi} is the unconstrained parameter update given as

M~μ,T(θ;ω)hi=t=1Tγθμ,t|T(ot=LEFTEND,bt=1)𝟙[st=1,2]t=1Totγθμ,t|T(ot,bt=1)+t=1Tγθμ,t|T(ot=RIGHTEND,bt=1)𝟙[st=3,4]t=1Totγθμ,t|T(ot,bt=1).\tilde{M}_{\mu,T}(\theta;\omega)_{hi}=\frac{\sum_{t=1}^{T}\gamma^{\theta}_{\mu,t|T}(o_{t}=\textrm{LEFTEND},b_{t}=1)\mathbbm{1}[s_{t}=1,2]}{\sum_{t=1}^{T}\sum_{o_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}=1)}\\ +\frac{\sum_{t=1}^{T}\gamma^{\theta}_{\mu,t|T}(o_{t}=\textrm{RIGHTEND},b_{t}=1)\mathbbm{1}[s_{t}=3,4]}{\sum_{t=1}^{T}\sum_{o_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}=1)}.

Similarly, the unconstrained parameter updates for πlo\pi_{lo} and πb\pi_{b} are the following:

M~μ,T(θ;ω)lo=1Tt=1Tbt{γθμ,t|T(ot=LEFTEND,bt)𝟙[at=LEFT]+γθμ,t|T(ot=RIGHTEND,bt)𝟙[at=RIGHT]}.\tilde{M}_{\mu,T}(\theta;\omega)_{lo}=\frac{1}{T}\sum_{t=1}^{T}\sum_{b_{t}}\bigg{\{}\gamma^{\theta}_{\mu,t|T}(o_{t}=\textrm{LEFTEND},b_{t})\mathbbm{1}[a_{t}=\textrm{LEFT}]\\ +\gamma^{\theta}_{\mu,t|T}(o_{t}=\textrm{RIGHTEND},b_{t})\mathbbm{1}[a_{t}=\textrm{RIGHT}]\bigg{\}}.
M~μ,T(θ;ω)b=1T1t=2Tot1{γ~θμ,t|T(ot1,bt=1)𝟙[event]+γ~θμ,t|T(ot1,bt=0)𝟙[¬event]},\tilde{M}_{\mu,T}(\theta;\omega)_{b}=\frac{1}{T-1}\sum_{t=2}^{T}\sum_{o_{t-1}}\bigg{\{}\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t}=1)\mathbbm{1}[\textrm{event}]+\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t}=0)\mathbbm{1}[\neg\textrm{event}]\bigg{\}},

where the event={(st=1,ot1=LEFTEND)(st=4,ot1=RIGHTEND)}\textrm{event}=\{(s_{t}=1,o_{t-1}=\textrm{LEFTEND})\vee(s_{t}=4,o_{t-1}=\textrm{RIGHTEND})\}. The parameter updates Mμ,T(θ;ω)loM_{\mu,T}(\theta;\omega)_{lo} and Mμ,T(θ;ω)bM_{\mu,T}(\theta;\omega)_{b} are the projections of M~μ,T(θ;ω)lo\tilde{M}_{\mu,T}(\theta;\omega)_{lo} and M~μ,T(θ;ω)b\tilde{M}_{\mu,T}(\theta;\omega)_{b} onto [0.1,0.9][0.1,0.9], respectively.

E.3 Supplementary results to Figure 3

In this subsection we present supplementary results to Figure 3. In Figure 3, err(n,T)err(n,T) is defined as the average of θ(n)θ2\|{\theta^{(n)}-\theta^{*}}\|_{2} over all the 50 sample paths. Here, we divide the set of sample paths into smaller sets and evaluate the average of θ(n)θ2\|{\theta^{(n)}-\theta^{*}}\|_{2} over these smaller sets separately. The settings for the computation of parameter estimates are the same as in Section 5. The following procedure serves as the post-processing step of the obtained parameter estimates.

Concretely, as defined in Section 5, we obtain a sequence {θ(n)θ2;ω,T}n[0:N]\{\|{\theta^{(n)}-\theta^{*}}\|_{2};\omega,T\}_{n\in[0:N]} after running Algorithm 1 with any sample path ω\omega and any TT. After fixing TT and letting n=Nn=N, θ(N)θ2\|{\theta^{(N)}-\theta^{*}}\|_{2} is a function of ω\omega only. With a given threshold interval I=[I1,I2]I=[I_{1},I_{2}], we define a smaller set of sample paths as the set of ω\omega with θ(N)θ2\|{\theta^{(N)}-\theta^{*}}\|_{2} greater than the I1I_{1}-th percentile and less than the I2I_{2}-th percentile. Let err(n,T,I)err(n,T,I) be the average of θ(n)θ2\|{\theta^{(n)}-\theta^{*}}\|_{2} over this smaller set of sample path specified by interval II. For T=8000T=8000, the values of err(n,T,I)err(n,T,I) with specific choices of II are plotted below. If I=[0,100]I=[0,100], err(n,T,I)err(n,T,I) is equivalent to err(n,T)err(n,T) investigated in Section 5.

Refer to caption
Figure 5: Plots of err(n,T,I)err(n,T,I) with varying nn and II; TT is fixed as 8000.

Figure 5 suggests that with probability around 0.6, our algorithm with the particular choice of TT and θ(0)\theta^{(0)} achieves decent performance, decreasing the original estimation error by at least a half. A worth-noting observation is that, for all the choices of II (including I=[90,100]I=[90,100] representing the failed sample paths), err(n,T,I)err(n,T,I) roughly follows the same exponential decay in the early stage of the algorithm (roughly the first 10 iterations). The same behavior can be observed for T=5000T=5000 and T=10000T=10000 as well. It is not clear whether this behavior is general or specific to our numerical example. Detailed investigation is required in future work.

E.4 Varying μ\mu

Refer to caption
Figure 6: Plots of err(n,T)err(n,T) with varying nn and μ\mu; TT is fixed to 5000.

In this subsection we investigate the effect of μ\mu on the performance of Algorithm 1. Intuitively, from the uniform forgetting analysis throughout this paper, it is reasonable to expect that at each iteration, the effect of μ\mu on the parameter update is negligible if TT is large. However, such a negligible error could accumulate if NN is large. The effect of μ\mu on the final parameter estimate is not clear without experiments.

We use the same observation sequences as in Section 5. TT is fixed as 5000. θ(0)=(0.5,0.6,0.7)\theta^{(0)}=(0.5,0.6,0.7), and the parameter space for all the three parameters remains the same as [0.1,0.9][0.1,0.9]. For all s1s_{1}, μ(o0=RIGHTEND|s1){0.2,0.5,0.8}\mu(o_{0}=\textrm{RIGHTEND}|s_{1})\in\{0.2,0.5,0.8\}. The performance of the algorithm is evaluated by err(n,T)err(n,T) defined in Section 5. The result is presented in Figure 6, which shows that indeed, the effect of μ\mu on the final performance of the algorithm is negligible. For n=1000n=1000, maxμerr(n,T)\max_{\mu}err(n,T) is 0.7%0.7\% higher than minμerr(n,T)\min_{\mu}err(n,T).

E.5 Random initialization

Up to this point, all the empirical results use the same initial parameter estimate θ(0)=(0.5,0.6,0.7)\theta^{(0)}=(0.5,0.6,0.7) on all the 50 sample paths. In this subsection, we evaluate the effect of the initial estimation error {θ(0)θ}2\{\theta^{(0)}-\theta^{*}\}_{2} on the performance of the algorithm, by applying random θ(0)\theta^{(0)}. Such a randomization is not considered in Section 5 since more explanations are required.

In this experiment, we use the same observation sequences as in Section 5. TT is fixed to 8000. For all s1s_{1}, μ(o0=RIGHTEND|s1)=1\mu(o_{0}=\textrm{RIGHTEND}|s_{1})=1. The parameter space for all the three parameters remains the same as [0.1,0.9][0.1,0.9]. For each observation sequence, we first generate three independent samples xhix_{hi}, xlox_{lo} and xbx_{b} uniformly from the interval [0,1][0,1]. Then, θ(0)\theta^{(0)} is generated as follows: with a scale factor w{0.1,0.2,0.3}w\in\{0.1,0.2,0.3\}, let θ(0)hi=θhiwxhi\theta^{(0)}_{hi}=\theta^{*}_{hi}-wx_{hi}, θ(0)lo=θlowxlo\theta^{(0)}_{lo}=\theta^{*}_{lo}-wx_{lo} and θ(0)b=θbwxb\theta^{(0)}_{b}=\theta^{*}_{b}-wx_{b}. As a result, θ(0)\theta^{(0)} dependent on ww is different for different observation sequences. The choices of θ(0)\theta^{(0)} are not symmetrical with respect to θ\theta^{*} due to the restriction of the bounded parameter space. For the parameter estimates obtained from the computation, err(n,T)err(n,T) is defined as in Section 5. The result is shown in Figure 7.

Refer to caption
Figure 7: Plots of err(n,T)err(n,T) with varying nn and θ(0)\theta^{(0)}; TT is fixed to 8000.

From Figure 7, the curves corresponding to w=0.1w=0.1 and w=0.2w=0.2 qualitatively match the performance guarantee in Theorem 4. The algorithm achieves decent performance when {θ(0)θ}2\{\theta^{(0)}-\theta^{*}\}_{2} is intermediate (the case of w=0.2w=0.2), where the average estimation error err(n,T)err(n,T) is reduced by at least a half. If {θ(0)θ}2\{\theta^{(0)}-\theta^{*}\}_{2} is small (the case of w=0.1w=0.1), the parameter estimates cannot improve much from θ(0)\theta^{(0)}. If {θ(0)θ}2\{\theta^{(0)}-\theta^{*}\}_{2} is large (the case of w=0.3w=0.3), the algorithm cannot converge to the vicinity of the true parameter, which is consistent with our local convergence analysis.