Provable Hierarchical Imitation Learning via EM

Zhiyu Zhang
Boston University
[email protected]
Ioannis Ch. Paschalidis
Boston University
[email protected]

Abstract

Due to recent empirical successes, the options framework for hierarchical reinforcement learning is gaining increasing popularity. Rather than learning from rewards, we consider learning an options-type hierarchical policy from expert demonstrations. Such a problem is referred to as hierarchical imitation learning. Converting this problem to parameter inference in a latent variable model, we develop convergence guarantees for the EM approach proposed by [10]. The population level algorithm is analyzed as an intermediate step, which is nontrivial due to the samples being correlated. If the expert policy can be parameterized by a variant of the options framework, then, under regularity conditions, we prove that the proposed algorithm converges with high probability to a norm ball around the true parameter. To our knowledge, this is the first performance guarantee for an hierarchical imitation learning algorithm that only observes primitive state-action pairs.

1 Introduction

Recent empirical studies [25, 31, 36, 38] have shown that the scalability of Reinforcement Learning (RL) algorithms can be improved by incorporating hierarchical structures. As an example, consider the options framework [35] representing a two-level hierarchical policy: with a set of multi-step low level procedures (options), the high level policy selects an option, which, in turn, decides the primitive action applied at each time step until the option terminates. Learning such a hierarchical policy from environmental feedback effectively breaks the overall task into sub-tasks, each easier to solve.

Researchers have investigated the hierarchical RL problem under various settings. Existing theoretical analyses [6, 18, 19, 28] typically assume that the options are given. As a result, only the high-level policy needs to be learned. Recent advances in deep hierarchical RL (e.g., [2]) focus on concurrently learning the full options framework, but still the initialization of the options is critical. A promising practical approach is to learn an initial hierarchical policy from expert demonstrations. Then, deep hierarchical RL algorithms can be applied for policy improvement. The former step is named as Hierarchical Imitation Learning (HIL).

Due to its practicality, HIL has been extensively studied within the deep learning and robotics communities. However, existing works typically suffer from the following limitations. First, the considered HIL formulations often lack rigor and clarity. Second, existing works are mostly empirical, only testing on a few specific benchmarks. Without theoretical justification, it remains unclear whether the proposed methods can be generalized beyond their experimental settings.

In this paper, we investigate HIL from a theoretical perspective. Our problem formulation is concise while retaining the essential difficulty of HIL: we need to learn a complete hierarchical policy from an unsegmented sequence of state-action pairs. Under this setting, HIL becomes an inference problem in a latent variable model. Such a transformation was first proposed by [10], where the Expectation-Maximization (EM) algorithm [14] was applied for policy learning. Empirical results for this algorithm and its gradient variants [17, 24] demonstrate good performance, but the theoretical analysis remains open. By bridging this gap, we aim to solidify the foundation of HIL and provide some high level guidance for its practice.

1.1 Related work

Due to its intrinsic difficulty, existing works on HIL typically consider its easier variants for practicality. If the expert options are observed, standard imitation learning algorithms can be applied to learn the high and low level policies separately [26]. If those are not available, a popular idea [7, 29, 32, 33] is to first divide the expert demonstration into segments using domain knowledge or heuristics, learn the individual option corresponding to each segment, and finally learn the high level policy. With additional supervision, these steps can be unified [34]. In this regard, the EM approach [10, 17, 24] is this particular idea pushed to an extreme: without any other forms of supervision, we simultaneously segment the demonstration and learn from it, by exploiting the latent variable structure.

From the theoretical perspective, inference in parametric latent variable models is a long-standing problem in statistics. For many years the EM algorithm has been considered the standard approach, but performance guarantees [30, 40] were generally weak, only characterizing the convergence of parameter estimates to stationary points of the finite sample likelihood function. Under additional local assumptions, convergence to the Maximum Likelihood Estimate (MLE) can be further established. However, due to the randomness in sampling, the finite sample likelihood function is usually highly non-concave, leading to stringent requirements on initialization. Another weakness is that converging to the finite sample MLE does not directly characterize the distance to the maximizer of the population likelihood function which is the true parameter.

Recent ideas on EM algorithms [3, 39, 42, 43] focus on the convergence to the true parameter directly, relying on an instrumental object named as the population EM algorithm. It has the same two-stage iterative procedure as the standard EM algorithm, but its $Q$ -function, the maximization objective in the M-step, is defined as the infinite sample limit of the finite sample $Q$ -function. Under regularity conditions, the population EM algorithm converges to the true parameter. The standard EM algorithm is then analyzed as its perturbed version, converging with high probability to a norm ball around the true parameter. The main advantage of this approach is that the true parameter usually has a large basin of attraction in the population EM algorithm. Therefore, the requirement on initialization is less stringent. See [42, Figure 1] for an illustration.

The $Q$ -function adopted in the population EM algorithm is named as the population $Q$ -function. To properly define such a quantity, the stochastic convergence of the finite sample $Q$ -function needs to be constructed. When the samples are i.i.d., such as in Gaussian Mixture Models (GMMs) [3, 11, 41], the required convergence follows directly from the law of large numbers. However, this argument is less straightforward in time-series models such as Hidden Markov Models (HMMs) and the model considered in HIL. For HMMs, [42] showed that the expectation of the $Q$ -function converges, but both the stochastic convergence analysis and the analytical expression of the population $Q$ -function are not provided. The missing techniques could be borrowed from a body of work [8, 13, 27, 37] analyzing the asymptotic behavior of HMMs. Most notably, [27] provided a rigorous treatment of the population EM algorithm via sufficient statistics, assuming the HMM is parameterized by an exponential family.

Finally, apart from the EM algorithm, a separate line of research [1, 21] applies spectral methods for tractable inference in latent variable models. However, such methods are mainly complementary to the EM algorithm since better performance can usually be obtained by initializing the EM algorithm with the solution of the spectral methods [23].

1.2 Our contributions

In this paper, we establish the first known performance guarantee for a HIL algorithm that only observes primitive state-action pairs. Specifically, we first fix and reformulate the original EM approach by [10] in a rigorous manner. The lack of mixing is identified as a technical difficulty in learning the standard options framework, and a novel options with failure framework is proposed to circumvent this issue.

Inspired by [3] and [42], the population version of our algorithm is analyzed as an intermediate step. We prove that if the expert policy can be parameterized by the options with failure framework, then, under regularity conditions, the population version algorithm converges to the true parameter, and the finite sample version converges with high probability to a norm ball around the true parameter. Our analysis directly constructs the stochastic convergence of the finite sample $Q$ -function, and an analytical expression of the resulting population $Q$ -function is provided. Finally, we qualitatively validate our theoretical results using a numerical example.

2 Problem settings

Notation.

Throughout this paper, we use uppercase letters (e.g., $S_{t}$ ) for random variables and lowercase letters (e.g., $s_{t}$ ) for values of random variables. Let $[t_{1}:t_{2}]$ be the set of integers $t$ such that $t_{1}\leq t\leq t_{2}$ . When used in the subscript, the brackets are removed (e.g., $S_{t_{1}:t_{2}}={\left\{S_{t}\right\}}_{t_{1}\leq t\leq t_{2}}$ ).

2.1 Definition of the hierarchical policy

Refer to caption — Figure 1: A graphical model for hierarchical reinforcement learning.

In this section, we first introduce the options framework for hierarchical reinforcement learning [4, 35], captured by the probabilistic graphical model shown in Figure 1. The index $t$ represents the time; $(S_{t},A_{t},O_{t},B_{t})$ respectively represent the state, the action, the option and the termination indicator. For all $t$ , $S_{t}$ , $A_{t}$ and $O_{t}$ are defined on the finite state space $\mathcal{S}$ , the finite action space $\mathcal{A}$ and the finite option space $\mathcal{O}$ ; $B_{t}$ is a binary random variable. Define the parameter $\theta\mathrel{\mathop{:}}=(\theta_{hi},\theta_{lo},\theta_{b})$ where $\theta_{hi}\in\mathit{\Theta}_{hi}$ , $\theta_{lo}\in\mathit{\Theta}_{lo}$ , and $\theta_{b}\in\mathit{\Theta}_{b}$ . The parameter space $\mathit{\Theta}\mathrel{\mathop{:}}=\mathit{\Theta}_{hi}\times\mathit{\Theta}_{lo}\times\mathit{\Theta}_{b}$ is a convex and compact subset of a Euclidean space.

For any $(o_{0},s_{1})\in\mathcal{O}\times\mathcal{S}$ , if we fix $(O_{0},S_{1})=(o_{0},s_{1})$ and consider a given $\theta$ , the joint distribution on the rest of the graphical model is determined by the following components: an unknown environment transition probability $P$ , a high level policy $\pi_{hi}$ parameterized by $\theta_{hi}$ , a low level policy $\pi_{lo}$ parameterized by $\theta_{lo}$ and a termination policy $\pi_{b}$ parameterized by $\theta_{b}$ . Sampling a tuple $(s_{2:T},a_{1:T},o_{1:T},b_{1:T})$ from such a joint distribution, or equivalently, implementing the hierarchical decision process, follows the following procedure. Starting from the first time step, the decision making agent first determines whether or not to terminate the current option $o_{0}$ . The decision is encoded in a termination indicator $b_{1}$ sampled from $\pi_{b}(\cdot|s_{1},o_{0};\theta_{b})$ . $b_{1}=1$ indicates that the option $o_{0}$ terminates and the next option $o_{1}$ is sampled from $\pi_{hi}(\cdot|s_{1};\theta_{hi})$ ; $b_{1}=0$ indicates that the option $o_{0}$ continues and $o_{1}=o_{0}$ . Next, the primitive action $a_{1}$ is sampled from $\pi_{lo}(\cdot|s_{1},o_{1};\theta_{lo})$ , applying the low level policy associated with the option $o_{1}$ . Using the environment, the next state $s_{2}$ is sampled from $P(\cdot|s_{1},a_{1})$ . The rest of the samples $(s_{3:T},a_{2:T},o_{2:T},b_{2:T})$ are generated analogously.

The options framework corresponds to the above hierarchical policy structure and the policy triple $\{\pi_{hi},\pi_{lo},\pi_{b}\}$ . However, due to a technicality identified at the end of this subsection, we consider a novel options with failure framework for the remainder of this paper, which adds an extra failure mechanism to the graphical model in the case of $b_{t}=0$ . Specifically, there exists a constant $0<\zeta<1$ such that when the termination indicator $b_{t}=0$ , with probability $1-\zeta$ the next option $o_{t}$ is assigned to $o_{t-1}$ , whereas with probability $\zeta$ the next option $o_{t}$ is sampled uniformly from the set of options $\mathcal{O}$ . Notice that if $\zeta=0$ , we recover the standard options framework.

To simplify the notation, we define $\bar{\pi}_{hi}$ as the combination of $\pi_{hi}$ and the failure mechanism. For any $\theta_{hi}$ , with any other input arguments,

\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\mathrel{\mathop{:}}=\begin{cases}\pi_{hi}(o_{t}|s_{t};\theta_{hi}),&\text{if $b_{t}=1$},\\ 1-\zeta+\frac{\zeta}{|\mathcal{O}|},&\text{if $b_{t}=0$, $o_{t}=o_{t-1}$},\\ \frac{\zeta}{|\mathcal{O}|},&\text{if $b_{t}=0$, $o_{t}\neq o_{t-1}$.}\end{cases}

Formally, the options with failure framework is defined as the class of policy triples $\{\bar{\pi}_{hi},\pi_{lo},\pi_{b}\}$ parameterized by $\zeta$ and $\theta$ . With $(O_{0},S_{1})=(o_{0},s_{1})$ and a given $\theta$ , let $\mathbb{P}_{\theta,o_{0},s_{1}}$ be the joint distribution of $\{S_{2:T},A_{1:T},O_{1:T},B_{1:T}\}$ . With any input arguments,

\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T})=\\ \left[\prod_{t=1}^{T}\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\right]\left[\prod_{t=1}^{T-1}P(s_{t+1}|s_{t},a_{t})\right].

On the policy framework.

The options with failure framework is adopted to simplify the construction of the mixing condition (Lemma D.1). It is possible that our analysis could be extended to learn the standard options framework. In that case, instead of constructing the usual one step mixing condition, one could target the multi-step mixing condition similar to [8, Chap. 4.3].

2.2 The imitation learning problem

Suppose an expert uses an options with failure policy with true parameters $\zeta$ and $\theta^{*}=(\theta^{*}_{hi},\theta^{*}_{lo},\theta^{*}_{b})$ ; its initial condition $(o_{0},s_{1})$ is sampled from a distribution $\nu^{*}$ . A finite length observation sequence $\{s_{1:T},a_{1:T}\}=\{s_{t},a_{t}\}_{t=1}^{T}$ with $T\geq 2$ is observed from the expert. $\zeta$ and the parametric structure of the expert policy are known, but $\nu^{*}$ is unknown. Our objective is to estimate $\theta^{*}$ from $\{s_{1:T},a_{1:T}\}$ .

On the practicality of our setting.

Two comments need to made here. First, it is common in practice to observe not one, but a set of independent observation sequences. In that case, the problem essentially becomes easier. Second, the cardinality of the option space and the parameterization of the expert policy are usually unknown. A popular solution is to assume an expressive parameterization (e.g., a neural network) in the algorithm and select $\textrm{card}(\mathcal{O})$ through cross-validation. Theoretical analysis of EM under this setting is challenging, even when samples are i.i.d. [15, 16]. Therefore, we only consider the domain of correct-specification.

Throughout this paper, the following assumptions are imposed for simplicity.

Assumption 1 (Non-degeneracy).

With any other input arguments, the domain of $\pi_{hi}$ , $\pi_{lo}$ and $\pi_{b}$ as functions of $\theta$ can be extended to an open set $\tilde{\mathit{\Theta}}$ that contains $\mathit{\Theta}$ . Moreover, for all $\theta\in\tilde{\mathit{\Theta}}$ , $\pi_{hi}$ , $\pi_{lo}$ and $\pi_{b}$ parameterized by $\theta$ are strictly positive.

Assumption 2 (Differentiability).

With any other input arguments, $\pi_{hi}$ , $\pi_{lo}$ and $\pi_{b}$ as functions of $\theta$ are continuously differentiable on $\tilde{\mathit{\Theta}}$ .

Next, consider the stochastic process $\{O_{t-1},S_{t}\}_{t=1}^{\infty}$ induced by $\nu^{*}$ and the expert policy. Based on the graphical model, it is a Markov chain with finite state space $\mathcal{O}\times\mathcal{S}$ . Let $\Pi_{\theta^{*}}$ be its set of stationary distributions, which is nonempty and convex.

Assumption 3 (Stationary initial distribution).

$\nu^{*}$ is an extreme point of $\Pi_{\theta^{*}}$ . That is, $\nu^{*}\in\Pi_{\theta^{*}}$ , and it cannot be written as the convex combination of two elements of $\Pi_{\theta^{*}}$ .

On the assumptions.

The first two assumptions are generally mild and therefore hold for many policy parameterizations. The third assumption is a bit more restrictive, but it is essential for our theoretical analysis. In Appendix A, we provide further justification of this assumption in a particular class of environments: $\forall s_{t},s_{t+1}\in\mathcal{S}$ , there exists $a_{t}\in\mathcal{A}$ such that $P(s_{t+1}|s_{t},a_{t})>0$ . In such environments, $\Pi_{\theta^{*}}$ contains a unique element which is also the limiting distribution. If we start sampling the observation sequence late enough, Assumption 3 is approximately satisfied.

3 A Baum-Welch type algorithm

Adopting the EM approach, we present Algorithm 1 for the estimation of $\theta^{*}$ . It reformulates the algorithm by [10] in a rigorous manner, and an error in the latter is fixed: when defining the posterior distribution of latent variables, at any time $t<T$ , the original algorithm neglects the dependency of future states $S_{t+1:T}$ on the current option $O_{t}$ . A detailed discussion is provided in Appendix B.1.

Algorithm 1 A Baum-Welch type algorithm for provable hierarchical imitation learning

0: Observation sequence

\{s_{1:T},a_{1:T}\}

; a probability mass function

\mu(o_{0}|s_{1})

o_{0}\in\mathcal{O}

;

N\in\mathbb{N}_{+}

;

\theta^{(0)}\in\mathit{\Theta}

1: for

n=1,\ldots,N

2: Compute the forward message

\{\alpha^{\theta^{(n-1)}}_{\mu,t}\}_{t=1}^{T}

and the backward message

\{\beta^{\theta^{(n-1)}}_{t|T}\}_{t=1}^{T}

according to (1), (2), (3) and (4).

3: Compute the smoothing distributions

\{\gamma^{\theta^{(n-1)}}_{\mu,t|T}\}_{t=1}^{T}

and

\{\tilde{\gamma}^{\theta^{(n-1)}}_{\mu,t|T}\}_{t=2}^{T}

according to (5) and (6).

4: Update the parameter estimate

\theta^{(n)}\in\operatorname*{arg\,max}_{\theta\in\mathit{\Theta}}Q_{\mu,T}(\theta|\theta^{(n-1)})

according to (7).

5: end for

Since our graphical model resembles an HMM, Algorithm 1 is intuitively similar to the classical Baum-Welch algorithm [5] for HMM parameter inference. Analogously, it iterates between forward-backward smoothing and parameter update. In each iteration, the algorithm first estimates certain marginal distributions of the latent variables $(O_{1:T},B_{1:T})$ conditioned on the observation sequence $\{s_{1:T},a_{1:T}\}$ , assuming the current estimate of $\theta$ is correct. Such conditional distributions are named as smoothing distributions, and they are used to compute the $Q$ -function, which is a surrogate of the likelihood function. The next estimate of $\theta$ is assigned as one of the maximizing arguments of the $Q$ -function.

From the structure of our graphical model, a prior distribution of $(O_{0},S_{1})$ is required to compute the smoothing distributions. Since the true prior distribution $\nu^{*}$ is unknown, $\hat{\nu}$ , defined next, is used as its approximation: $\forall o_{0}\in\mathcal{O}$ , $\hat{\nu}(o_{0},s_{1})\mathrel{\mathop{:}}=\mu(o_{0}|s_{1})$ ; $\forall s^{\prime}_{1}\neq s_{1}$ , $\hat{\nu}(o_{0},s^{\prime}_{1})\mathrel{\mathop{:}}=0$ . Theorem 2 shows that the additional estimation error introduced by this approximation vanishes as $T\rightarrow\infty$ , regardless of the choice of $\mu$ . Let $\mathcal{M}$ be the set of $\mu$ allowed by Algorithm 1.

3.1 Latent variable estimation

In the following, we define the forward message, the backward message and the smoothing distribution for all $\theta$ , $\mu$ and all $t\in[1:T]$ . All of these quantities are probability mass functions over $\mathcal{O}\times\mathcal{S}$ , and normalizing constants $z_{\alpha,\mu,t}^{\theta}$ , $z_{\beta,t}^{\theta}$ and $z_{\gamma,\mu}^{\theta}$ are adopted to enforce this. With any input arguments $o_{t}$ and $b_{t}$ , the forward message is defined as

\alpha^{\theta}_{\mu,t}(o_{t},b_{t})\mathrel{\mathop{:}}=z_{\alpha,\mu,t}^{\theta}\mathbb{E}_{O_{0}\sim\mu(\cdot|s_{1})}[\mathbb{P}_{\theta,O_{0},s_{1}}(S_{2:t}=s_{2:t},A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t})].

On the LHS, the dependency on $\{s_{1:T},a_{1:T}\}$ is omitted for a cleaner notation. By convention, $\alpha^{\theta}_{\mu,1}$ is equivalent to

\alpha^{\theta}_{\mu,1}(o_{1},b_{1})=z_{\alpha,\mu,1}^{\theta}\mathbb{E}_{O_{0}\sim\mu(\cdot|s_{1})}[\mathbb{P}_{\theta,O_{0},s_{1}}(A_{1}=a_{1},O_{1}=o_{1},B_{1}=b_{1})].

The backward message is defined as

\beta^{\theta}_{t|T}(o_{t},b_{t})\mathrel{\mathop{:}}=z_{\beta,t}^{\theta}\mathbb{P}_{\theta,o_{0},s_{1}}(S_{t+1:T}=s_{t+1:T},A_{t+1:T}=a_{t+1:T}|S_{t}=s_{t},A_{t}=a_{t},O_{t}=o_{t},B_{t}=b_{t}),

where the value of $o_{0}$ on the RHS is arbitrary. By convention, the boundary condition is

\beta^{\theta}_{T|T}(o_{T},b_{T})=(2\left|\mathcal{O}\right|)^{-1}.

(1)

The smoothing distribution is defined as

\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})\mathrel{\mathop{:}}=z_{\gamma,\mu}^{\theta}\mathbb{E}_{O_{0}\sim\mu(\cdot|s_{1})}[\mathbb{P}_{\theta,O_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{t}=o_{t},B_{t}=b_{t})].

It can be easily verified that the normalizing constant does not depend on $t$ .

Finally, for all $\theta$ , $\mu$ and $t\in[2:T]$ , with any input arguments $o_{t-1}$ and $b_{t}$ , we define the two-step smoothing distribution as

\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t})\mathrel{\mathop{:}}=z_{\gamma,\mu}^{\theta}\mathbb{E}_{O_{0}\sim\mu(\cdot|s_{1})}[\mathbb{P}_{\theta,O_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{t-1}=o_{t-1},B_{t}=b_{t})],

where $z_{\gamma,\mu}^{\theta}$ is the same normalizing constant as the one for the smoothing distribution $\gamma^{\theta}_{\mu,t|T}$ .

The quantities above can be computed using the forward-backward recursion. For simplicity, we omit normalizing constants by using the proportional symbol $\propto$ . The proof is deferred to Appendix B.2.

Theorem 1 (Forward-backward smoothing).

For all $\theta\in\mathit{\Theta}$ and $\mu\in\mathcal{M}$ , with any input arguments on the LHS,

(Forward recursion) $\forall t\in[2:T]$ ,

\alpha^{\theta}_{\mu,t}(o_{t},b_{t})\propto\sum_{o_{t-1},b_{t-1}}\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1}).

(2)

When $t=1$ ,

\alpha^{\theta}_{\mu,1}(o_{1},b_{1})\propto\mathbb{E}_{O_{0}\sim\mu(\cdot|s_{1})}[\pi_{b}(b_{1}|s_{1},O_{0};\theta_{b})\bar{\pi}_{hi}(o_{1}|s_{1},O_{0},b_{1};\theta_{hi})\pi_{lo}(a_{1}|s_{1},o_{1};\theta_{lo})].

(3)

(Backward recursion) $\forall t\in[1:T-1]$ ,

\beta^{\theta}_{t|T}(o_{t},b_{t})\propto\sum_{o_{t+1},b_{t+1}}\pi_{b}(b_{t+1}|s_{t+1},o_{t};\theta_{b})\bar{\pi}_{hi}(o_{t+1}|s_{t+1},o_{t},b_{t+1};\theta_{hi})\\ \times\pi_{lo}(a_{t+1}|s_{t+1},o_{t+1};\theta_{lo})\beta^{\theta}_{t+1|T}(o_{t+1},b_{t+1}).

(4)

3.

(Smoothing) $\forall t\in[1:T]$ ,

$\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})\propto\alpha^{\theta}_{\mu,t}(o_{t},b_{t})\beta^{\theta}_{t|T}(o_{t},b_{t}).$ (5)

(Two-step smoothing) $\forall t\in[2:T]$ ,

\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t})\propto\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\left[\sum_{o_{t}}\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\beta^{\theta}_{t|T}(o_{t},b_{t})\right]\\ \times\left[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1})\right].

(6)

3.2 Parameter update

For all $\theta,\theta^{\prime}\in\mathit{\Theta}$ and $\mu\in\mathcal{M}$ , the (finite sample) $Q$ -function is defined as

Q_{\mu,T}(\theta^{\prime}|\theta)\mathrel{\mathop{:}}=\frac{1}{T}\Bigg{\{}\sum_{t=2}^{T}\sum_{o_{t-1},b_{t}}\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t})\left[\log\pi_{b}(b_{t}|s_{t},o_{t-1};\theta^{\prime}_{b})\right]+\sum_{t=1}^{T}\sum_{o_{t},b_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})\\ \times[\log\pi_{lo}(a_{t}|s_{t},o_{t};\theta^{\prime}_{lo})]+\sum_{t=1}^{T}\sum_{o_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}=1)[\log\pi_{hi}(o_{t}|s_{t};\theta^{\prime}_{hi})]\Bigg{\}}.

(7)

The parameter update is performed as $\theta^{(n)}\in\operatorname*{arg\,max}_{\theta\in\mathit{\Theta}}Q_{\mu,T}(\theta|\theta^{(n-1)})$ , which may not be unique. Since $\mathit{\Theta}$ is compact and $Q_{\mu,T}(\theta^{\prime}|\theta)$ is continuous with respect to $\theta^{\prime}$ , the maximization is well-posed. Note that our definition of $Q_{\mu,T}(\theta^{\prime}|\theta)$ is an approximation of the standard definition of $Q$ -function in the EM literature. See Appendix B.3 for a detailed discussion.

3.3 Generalization to continuous spaces

Although we require finite state and action space for our theoretical analysis, Algorithm 1 can be readily generalized to continuous $\mathcal{S}$ and $\mathcal{A}$ : we only need to replace $\pi_{lo}$ by a density function. However, generalization to continuous option space requires a substantially different algorithm. The forward-backward smoothing procedure in Theorem 1 involves integrals rather than sums, and Sequential Monte Carlo (SMC) techniques need to be applied. Fortunately, it is widely accepted that a finite option space is reasonable in the options framework, since the options need to be distinct and separate [9].

4 Performance guarantee

Our analysis of Algorithm 1 has the following structure. We first prove the stochastic convergence of the $Q$ -function $Q_{\mu,T}(\theta^{\prime}|\theta)$ to a population $Q$ -function $\bar{Q}(\theta^{\prime}|\theta)$ , leading to a well-posed definition of the population version algorithm. This step is our major theoretical contribution. With additional assumptions, the first-order stability condition is constructed, and techniques in [3] can be applied to show the convergence of the population version algorithm. The remaining step is to analyze Algorithm 1 as a perturbed form of its population version, which requires a high probability bound on the distance between their parameter updates. We can establish the strong consistency of the parameter update of Algorithm 1 as an estimator of the parameter update of the population version algorithm. Therefore, the existence of such a high probability bound can be proved for large enough $T$ . However, the analytical expression of this bound requires knowledge of the specific parameterization of $\{\bar{\pi}_{hi},\pi_{lo},\pi_{b}\}$ , which is not available in this general context of discussion.

Concretely, we first analyze the asymptotic behavior of the $Q$ -function $Q_{\mu,T}(\theta^{\prime}|\theta)$ as $T\rightarrow\infty$ . From Assumption 3, the observation sequence $\{s_{1:T},a_{1:T}\}$ is generated from a stationary Markov chain $\{X_{t}\}_{t=1}^{\infty}\mathrel{\mathop{:}}=\{S_{t},A_{t},O_{t},B_{t}\}_{t=1}^{\infty}$ . Let $\mathcal{X}=\mathcal{S}\times\mathcal{A}\times\mathcal{O}\times\{0,1\}$ be its state space. Using Kolmogorov’s extension theorem, we can extend this one-sided Markov chain to the index set $\mathbb{Z}$ and define a unique probability measure $\mathbb{P}_{\theta^{*},\nu^{*}}$ over the sample space $\mathcal{X}^{\mathbb{Z}}$ . Any observation sequence $\{s_{1:T},a_{1:T}\}$ can be regarded as a segment of an infinite length sample path $\omega\in\mathcal{X}^{\mathbb{Z}}$ . Therefore, if the observation sequence is not specified, $Q_{\mu,T}(\theta^{\prime}|\theta)$ is a random variable with underlying probability measure $\mathbb{P}_{\theta^{*},\nu^{*}}$ .

One caveat is that the definition of $Q_{\mu,T}(\theta^{\prime}|\theta)$ from Section 3 fails for some $\omega\in\mathcal{X}^{\mathbb{Z}}$ . To fix this issue, define the set of proper sample paths as

\Omega=\left\{\omega\in\mathcal{X}^{\mathbb{Z}};P(s_{t+1}|s_{t},a_{t})>0,\forall t\in\mathbb{Z}\right\}.

(8)

Note that $\mathbb{P}_{\theta^{*},\nu^{*}}(\Omega)=1$ ; therefore, working on $\Omega$ is probabilistically equivalent to working on $\mathcal{X}^{\mathbb{Z}}$ . For all $\omega\in\Omega$ , $Q_{\mu,T}(\theta^{\prime}|\theta)$ follows the definition from Section 3; for other sample paths, $Q_{\mu,T}(\theta^{\prime}|\theta)$ is defined arbitrarily. In this way, $Q_{\mu,T}(\theta^{\prime}|\theta)$ becomes a well-defined random variable. Its stochastic convergence is characterized in the following theorem.

Theorem 2 (The stochastic convergence of the $Q$ -function).

With Assumption 1, 2 and 3, there exists a real-valued function $\bar{Q}(\theta^{\prime}|\theta)$ defined on the domain $\theta^{\prime}\in\tilde{\mathit{\Theta}}$ and $\theta\in\mathit{\Theta}$ such that

1.

For all $\theta\in\mathit{\Theta}$ , $\bar{Q}(\theta^{\prime}|\theta)$ is continuously differentiable with respect to $\theta^{\prime}\in\tilde{\mathit{\Theta}}$ . Moreover, the set $\operatorname*{arg\,max}_{\theta^{\prime}\in\mathit{\Theta}}\bar{Q}(\theta^{\prime}|\theta)$ is nonempty.

As $T\rightarrow\infty$ ,

\sup_{\theta,\theta^{\prime}\in\mathit{\Theta}}\sup_{\mu\in\mathcal{M}}\left|Q_{\mu,T}(\theta^{\prime}|\theta;\omega)-\bar{Q}(\theta^{\prime}|\theta)\right|\rightarrow 0,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

We name $\bar{Q}(\theta^{\prime}|\theta)$ as the population $Q$ -function. The analytical expressions of $\bar{Q}(\theta^{\prime}|\theta)$ and $\nabla\bar{Q}(\theta^{\prime}|\theta)$ are provided in Appendix C.2, where the complete version of the above theorem (Theorem 7) is proved. In the following, we provide a high level sketch of the main idea.

Proof Sketch.

The main difficulty of the proof is that, $Q_{\mu,T}(\theta^{\prime}|\theta)$ defined in (7) is (roughly) the average of $T$ terms, with each term dependent on the entire observation sequence; as $T\rightarrow\infty$ , all the terms keep changing such that the law of large numbers cannot be applied directly. As a solution, we approximate $\gamma^{\theta}_{\mu,t|T}$ and $\tilde{\gamma}^{\theta}_{\mu,t|T}$ with smoothing distributions in an infinitely extended graphical model independent of $T$ , resulting in an approximated $Q$ -function (still depends on $T$ ). The techniques adopted in this step are analogous to Markovian decomposition and uniform forgetting in the HMM literature [8, 37]. The limiting behavior of the approximated $Q$ -function is the same as that of $Q_{\mu,T}(\theta^{\prime}|\theta)$ , since their difference vanishes as $T\rightarrow\infty$ . For the approximated $Q$ -function, we can apply the ergodic theorem since the smoothing distributions no longer depend on $T$ . ∎

The population version of Algorithm 1 has parameter updates $\theta^{(n)}\in\operatorname*{arg\,max}_{\theta\in\mathit{\Theta}}\bar{Q}(\theta|\theta^{(n-1)})$ . To characterize the local convergence of Algorithm 1 and its population version, we impose the following assumptions for the remainder of Section 4.

Assumption 4 (Strong concavity).

There exists $\lambda>0$ such that for all $\theta_{1},\theta_{2}\in\mathit{\Theta}$ ,

\bar{Q}(\theta_{1}|\theta^{*})-\bar{Q}(\theta_{2}|\theta^{*})-\langle\nabla\bar{Q}(\theta_{2}|\theta^{*}),\theta_{1}-\theta_{2}\rangle\leq-\frac{\lambda}{2}\left\|{\theta_{1}-\theta_{2}}\right\|_{2}^{2}.

For any $r>0$ , let $\mathit{\Theta}_{r}\mathrel{\mathop{:}}=\{\theta;\theta\in\mathit{\Theta},\|{\theta-\theta^{*}}\|_{2}\leq r\}$ .

Assumption 5 (Additional local assumptions).

There exists $r>0$ such that

(Identifiability) For all $\theta\in\mathit{\Theta}_{r}$ , the set $\operatorname*{arg\,max}_{\theta^{\prime}\in\mathit{\Theta}}\bar{Q}(\theta^{\prime}|\theta)$ has a unique element $\bar{M}(\theta)$ . Moreover, for all $\varepsilon>0$ , with the convention that $\sup_{\theta^{\prime}\in\varnothing}\bar{Q}(\theta^{\prime}|\theta)=-\infty$ , we have

\inf_{\theta\in\mathit{\Theta}_{r}}\bigg{[}\bar{Q}(\bar{M}(\theta)|\theta)-\sup_{\theta^{\prime}\in\mathit{\Theta};\|{\theta^{\prime}-\bar{M}(\theta)}\|_{2}\geq\varepsilon}\bar{Q}(\theta^{\prime}|\theta)\bigg{]}>0.

2.

(Uniqueness of finite sample parameter updates) For all $\theta\in\mathit{\Theta}_{r}$ , $T\geq 2$ and $\mu\in\mathcal{M}$ , $P_{\theta^{*},\nu^{*}}$ -almost surely, the set $\operatorname*{arg\,max}_{\theta^{\prime}\in\mathit{\Theta}}Q_{\mu,T}(\theta^{\prime}|\theta;\omega)$ has a unique element $M_{\mu,T}(\theta;\omega)$ .

On the additional assumptions.

In Assumption 4, we require the strong concavity of $\bar{Q}(\cdot|\theta^{*})$ over the entire parameter space since the maximization step in our algorithm is global. Such a requirement could be avoided: if the maximization step is replaced by a gradient update (Gradient EM), then $\bar{Q}(\cdot|\theta^{*})$ only needs to be strongly concave in a small region around $\theta^{*}$ . The price to pay is to assume knowledge on structural constants of $\bar{Q}(\cdot|\theta^{*})$ (Lipschitz constant and strong concavity constant). See [3] for an analysis of the gradient EM algorithm.

Nonetheless, we expect the following to hold in certain cases of tabular parameterization: for all $\theta\in\mathit{\Theta}$ , the function $\bar{Q}(\cdot|\theta)$ is strongly concave over $\mathit{\Theta}$ (see the end of Appendix C.2). From this condition, Assumption 4 and 5.1 directly follow. Assumption 5.2 holds as well; in fact, it is a quite mild assumption due to the sample-based nature of $Q_{\mu,T}(\theta^{\prime}|\theta;\omega)$ .

The next step is to characterize the convergence of the population version algorithm.

Theorem 3 (Convergence of the population version algorithm).

With all the assumptions,

(First-order stability) There exists $\gamma>0$ such that for all $\theta\in\mathit{\Theta}_{r}$ ,

\left\|{\nabla\bar{Q}(\bar{M}(\theta)|\theta)-\nabla\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}\leq\gamma\left\|{\theta-\theta^{*}}\right\|_{2}.

2.

(Contraction) Let $\kappa=\gamma/\lambda$ . For all $\theta\in\mathit{\Theta}_{r}$ ,

$\left\|{\bar{M}(\theta)-\theta^{*}}\right\|_{2}\leq\kappa\left\|{\theta-\theta^{*}}\right\|_{2}.$

If $\kappa<1$ , the population version algorithm converges linearly to the true parameter $\theta^{*}$ .

The proof is given in Appendix C.3, where we also show an upper bound on $\gamma$ . The idea mirrors that of [3, Theorem 4] with problem-specific modifications. Algorithm 1 can be regarded as a perturbed form of this population version algorithm, with convergence characterized in the following theorem.

Theorem 4 (Performance guarantee for Algorithm 1).

With all the assumptions, if $\kappa<1$ we have

For all $\Delta\in(0,(1-\kappa)r]$ and $q\in(0,1)$ , there exists $\underline{T}(\Delta,q)\in\mathbb{N}_{+}$ such that the following statement is true. If the observation length $T\geq\underline{T}(\Delta,q)$ , then with probability at least $1-q$ ,

\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\leq\Delta.

2.

If $T\geq\underline{T}(\Delta,q)$ , Algorithm 1 with any $\mu\in\mathcal{M}$ has the following performance guarantee. If $\theta^{(0)}\in\mathit{\Theta}_{r}$ , then with probability at least $1-q$ , for all $n\in\mathbb{N}_{+}$ ,

$\|{\theta^{(n)}-\theta^{*}}\|_{2}\leq\kappa^{n}\|{\theta^{(0)}-\theta^{*}}\|_{2}+(1-\kappa)^{-1}\Delta.$

The proof is provided in Appendix C.4. Essentially, we use Theorem 2 to show the uniform (in $\theta$ and $\mu$ ) strong consistency of $M_{\mu,T}(\theta;\omega)$ as an estimator of $\bar{M}(\theta)$ , following the standard analysis of M-estimators. A direct corollary of this argument is the high probability bound on the difference between $M_{\mu,T}(\theta;\omega)$ and $\bar{M}(\theta)$ , as shown in the first part of the theorem. Combining this high probability bound with Theorem 3 and [3, Theorem 5] yields the final performance guarantee.

Theorem 4 has two practical implications. First, under regularity conditions, with large enough $T$ , Algorithm 1 can converge with arbitrarily high probability to an arbitrarily small norm ball around the true parameter. In other words, with enough samples, the EM approach can recover the true parameter of the expert policy arbitrarily well. Second, the estimation error (upper bound) decreases exponentially in the initial phase of the algorithm. In this regard, a practitioner can allocate his computational budget accordingly.

One limitation of our analysis is that the condition $\kappa<1$ is hard to verify for a practical parameterization of the expert policy. This is typical in the theory of EM algorithms: even in the case of i.i.d. samples, characterizing the contraction coefficient is intractable except for a few simple parametric models. Nonetheless, such a condition strengthens our intuition on when the EM approach to HIL works: $\bar{Q}(\theta^{\prime}|\theta)$ should have a large curvature with respect to $\theta^{\prime}$ , and the function should not change much with respect to $\theta$ around $\theta^{*}$ . In the next section, we present a numerical example to qualitatively demonstrate our result.

5 Numerical example

In this section, we qualitatively demonstrate our theoretical result through an example. Here, we value clarity over completeness, therefore large-scale experiments are deferred to future works.

Consider the Markov Decision Process (MDP) illustrated in Figure 2. There are four states, numbered from left to right as 1 to 4. At any state $s_{t}\in[1:4]$ , there are two allowable actions: LEFT and RIGHT. If $a_{t}=\textrm{RIGHT}$ , then the next state is sampled uniformly from the states on the right of state $s_{t}$ (including $s_{t}$ itself). Symmetrically, if $a_{t}=\textrm{LEFT}$ , then the next state is sampled uniformly from the states on the left of state $s_{t}$ (including $s_{t}$ ).

Suppose an expert applies the following options with failure policy with parameters $(\theta^{*}_{hi},\theta^{*}_{lo},\theta^{*}_{b})=(0.6,0.7,0.8)$ and $\zeta=0.1$ . The option space has two elements: LEFTEND and RIGHTEND. $\pi_{hi}(o_{t}=\textrm{LEFTEND}|s_{t};\theta_{hi})$ equals $\theta_{hi}$ if $s_{t}=1,2$ , and $1-\theta_{hi}$ if $s_{t}=3,4$ . For all $s_{t}$ , $\pi_{lo}(a_{t}=\textrm{LEFT}|s_{t},o_{t}=\textrm{LEFTEND};\theta_{lo})=\pi_{lo}(a_{t}=\textrm{RIGHT}|s_{t},o_{t}=\textrm{RIGHTEND};\theta_{lo})=\theta_{lo}$ . $\pi_{b}(b_{t}=1|s_{t},o_{t}=\textrm{LEFTEND};\theta_{b})$ equals $\theta_{b}$ if $s_{t}=1$ , and $1-\theta_{b}$ otherwise. Symmetrically, $\pi_{b}(b_{t}=1|s_{t},o_{t}=\textrm{RIGHTEND};\theta_{b})$ equals $\theta_{b}$ if $s_{t}=4$ , and $1-\theta_{b}$ otherwise. Intuitively, the high level policy directs the agent to states 1 and 4, and the option terminates with high probability when the corresponding target state is reached.

In our experiment, the parameter spaces $\mathit{\Theta}_{hi}$ , $\mathit{\Theta}_{lo}$ and $\mathit{\Theta}_{b}$ are all equal to the interval $[0.1,0.9]$ . The initial parameter estimate $(\theta^{(0)}_{hi},\theta^{(0)}_{lo},\theta^{(0)}_{b})=(0.5,0.6,0.7)$ . For all $s_{1}$ , $\mu(o_{0}=\textrm{RIGHTEND}|s_{1})=1$ .

We investigate the behavior of $\|{\theta^{(n)}-\theta^{*}}\|_{2}$ as a random variable dependent on $n$ and $T$ . 50 sample paths of length $T$ are sampled from (approximately) the stationary Markov chain induced by the expert policy, with $T\in\{5000,8000,10000\}$ . After running Algorithm 1 with any sample path $\omega$ and any $T$ , we obtain a sequence $\{\|{\theta^{(n)}-\theta^{*}}\|_{2};\omega,T\}_{n\in[0:N]}$ . Let $err(n,T)$ be the average of $\|{\theta^{(n)}-\theta^{*}}\|_{2}$ for fixed $n$ and $T$ , over the 50 sample paths. The result is shown in Figure 3.

Assumption 1, 2, 3 and 5.2 hold in this example, and we speculate that Assumption 4 and 5.1 hold as well. The condition $\kappa<1$ cannot be verified, but the empirical result exhibits patterns consistent with the performance guarantee, even though rigorously Theorem 4 is not applicable. First, $err(n,T)$ decreases exponentially in the early phase of the algorithm. Second, as $T$ increases, Algorithm 1 achieves better performance.

An observation is worth mentioning as a separate note: for $n>300$ , $err(n,T)$ first slightly increases, then levels off. This is due to the parameter estimate on some sample paths converging to bad stationary points of the finite sample likelihood function, which suggests that early stopping could be helpful in practice. Omitted details and additional experiments are provided in Appendix E, where we also investigate, for example, the effect of $\mu$ and random initialization on the performance of Algorithm 1.

6 Conclusions

In this paper, we investigate the EM approach to HIL from a theoretical perspective. We prove that under regularity conditions, the proposed algorithm converges with high probability to a norm ball around the true parameter. To our knowledge, this is the first performance guarantee for an HIL algorithm that only observes primitive state-action pairs. Future works could further investigate the practical performance of this approach, especially its scalability in complicated environments.

Acknowledgements

We thank the anonymous reviewers for their constructive comments. Z.Z. thanks Tianrui Chen for helpful discussions. The research was partially supported by the NSF under grants DMS-1664644, CNS-1645681, and IIS-1914792, by the ONR under grant N00014-19-1-2571, by the NIH under grants R01 GM135930 and UL54 TR004130, and by the DOE under grant DE-AR-0001282.

References

Anandkumar et al. [2014] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(80):2773–2832, 2014.
Bacon et al. [2017] P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 1726–1734, 2017.
Balakrishnan et al. [2017] S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120, 2017.
Barto and Mahadevan [2003] A. G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13(1-2):41–77, 2003.
Baum et al. [1970] L. E. Baum, T. Petrie, G. Soules, and N. Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The annals of mathematical statistics, 41(1):164–171, 1970.
Brunskill and Li [2014] E. Brunskill and L. Li. PAC-inspired option discovery in lifelong reinforcement learning. In Proceedings of the 31st International Conference on Machine Learning, pages 316–324, 2014.
Butterfield et al. [2010] J. Butterfield, S. Osentoski, G. Jay, and O. C. Jenkins. Learning from demonstration using a multi-valued function regressor for time-series data. In 2010 10th IEEE-RAS International Conference on Humanoid Robots, pages 328–333. IEEE, 2010.
Cappé et al. [2006] O. Cappé, E. Moulines, and T. Rydén. Inference in hidden Markov models. Springer Science & Business Media, 2006.
Daniel et al. [2016a] C. Daniel, G. Neumann, O. Kroemer, and J. Peters. Hierarchical relative entropy policy search. Journal of Machine Learning Research, 17(1):3190–3239, 2016a.
Daniel et al. [2016b] C. Daniel, H. Van Hoof, J. Peters, and G. Neumann. Probabilistic inference for determining options in reinforcement learning. Machine Learning, 104(2-3):337–357, 2016b.
Daskalakis et al. [2017] C. Daskalakis, C. Tzamos, and M. Zampetakis. Ten steps of EM suffice for mixtures of two Gaussians. In Conference on Learning Theory, pages 704–710, 2017.
Davidson [1994] J. Davidson. Stochastic limit theory: An introduction for econometricians. OUP Oxford, 1994.
De Castro et al. [2017] Y. De Castro, E. Gassiat, and S. Le Corff. Consistent estimation of the filtering and marginal smoothing distributions in nonparametric hidden Markov models. IEEE Transactions on Information Theory, 63(8):4758–4777, 2017.
Dempster et al. [1977] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
Dwivedi et al. [2018a] R. Dwivedi, N. Ho, K. Khamaru, M. I. Jordan, M. J. Wainwright, and B. Yu. Singularity, misspecification, and the convergence rate of EM. arXiv preprint arXiv:1810.00828, 2018a.
Dwivedi et al. [2018b] R. Dwivedi, N. Ho, K. Khamaru, M. J. Wainwright, and M. I. Jordan. Theoretical guarantees for EM under mis-specified gaussian mixture models. In Advances in Neural Information Processing Systems 31, pages 9681–9689, 2018b.
Fox et al. [2017] R. Fox, S. Krishnan, I. Stoica, and K. Goldberg. Multi-level discovery of deep options. arXiv preprint arXiv:1703.08294, 2017.
Fruit and Lazaric [2017] R. Fruit and A. Lazaric. Exploration–exploitation in MDPs with options. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 576–584, 2017.
Fruit et al. [2017] R. Fruit, M. Pirotta, A. Lazaric, and E. Brunskill. Regret minimization in MDPs with options without prior knowledge. In Advances in Neural Information Processing Systems 30, pages 3166–3176, 2017.
Hairer [2006] M. Hairer. Ergodic properties of Markov processes. Unpublished lecture notes, 2006. URL http://www.hairer.org/notes/Markov.pdf.
Hsu et al. [2012] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden Markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
Jain and Kar [2017] P. Jain and P. Kar. Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning, 10(3-4):142–336, 2017.
Kontorovich et al. [2013] A. Kontorovich, B. Nadler, and R. Weiss. On learning parametric-output HMMs. In Proceedings of the 30th International Conference on Machine Learning, pages 702–710, 2013.
Krishnan et al. [2017] S. Krishnan, R. Fox, I. Stoica, and K. Goldberg. DDCO: Discovery of deep continuous options for robot learning from demonstrations. In Conference on Robot Learning, pages 418–437, 2017.
Kulkarni et al. [2016] T. D. Kulkarni, K. Narasimhan, A. Saeedi, and J. Tenenbaum. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems 29, pages 3675–3683, 2016.
Le et al. [2018] H. Le, N. Jiang, A. Agarwal, M. Dudik, Y. Yue, and H. Daumé. Hierarchical imitation and reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, pages 2917–2926, 2018.
Le Corff and Fort [2013] S. Le Corff and G. Fort. Online expectation maximization based algorithms for inference in hidden Markov models. Electronic Journal of Statistics, 7:763–792, 2013.
Mann and Mannor [2014] T. Mann and S. Mannor. Scaling up approximate value iteration with options: Better policies with fewer iterations. In Proceedings of the 31th International Conference on Machine Learning, pages 127–135, 2014.
Manschitz et al. [2014] S. Manschitz, J. Kober, M. Gienger, and J. Peters. Learning to sequence movement primitives from demonstrations. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4414–4421. IEEE, 2014.
McLachlan and Krishnan [2007] G. J. McLachlan and T. Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.
Nachum et al. [2018] O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning. In Advances in Neural Information Processing Systems 31, pages 3303–3313, 2018.
Niekum et al. [2012] S. Niekum, S. Osentoski, G. Konidaris, and A. G. Barto. Learning and generalization of complex tasks from unstructured demonstrations. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5239–5246. IEEE, 2012.
Niekum et al. [2015] S. Niekum, S. Osentoski, G. Konidaris, S. Chitta, B. Marthi, and A. G. Barto. Learning grounded finite-state representations from unstructured demonstrations. The International Journal of Robotics Research, 34(2):131–157, 2015.
Shiarlis et al. [2018] K. Shiarlis, M. Wulfmeier, S. Salter, S. Whiteson, and I. Posner. TACO: Learning task decomposition via temporal alignment for control. In Proceedings of the 35th International Conference on Machine Learning, pages 4654–4663, 2018.
Sutton et al. [1999] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
Tessler et al. [2017] C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor. A deep hierarchical approach to lifelong learning in minecraft. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 1553–1561, 2017.
van Handel [2008] R. van Handel. Hidden Markov models. Unpublished lecture notes, 2008. URL https://web.math.princeton.edu/~rvan/orf557/hmm080728.pdf.
Vezhnevets et al. [2017] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pages 3540–3549, 2017.
Wang et al. [2015] Z. Wang, Q. Gu, Y. Ning, and H. Liu. High dimensional EM algorithm: Statistical optimization and asymptotic normality. In Advances in Neural Information Processing Systems 28, pages 2521–2529, 2015.
Wu [1983] C. J. Wu. On the convergence properties of the EM algorithm. The Annals of statistics, 11(1):95–103, 1983.
Xu et al. [2016] J. Xu, D. J. Hsu, and A. Maleki. Global analysis of expectation maximization for mixtures of two Gaussians. In Advances in Neural Information Processing Systems 29, pages 2676–2684, 2016.
Yang et al. [2017] F. Yang, S. Balakrishnan, and M. J. Wainwright. Statistical and computational guarantees for the Baum-Welch algorithm. Journal of Machine Learning Research, 18(1):4528–4580, 2017.
Yi and Caramanis [2015] X. Yi and C. Caramanis. Regularized EM algorithms: A unified framework and statistical guarantees. In Advances in Neural Information Processing Systems 28, pages 1567–1575, 2015.

Appendix

Organization.

Appendix A presents discussions that motivate Assumption 3. In particular, we show that Assumption 3 approximately holds in a particular class of environment. Appendix B provides details on Algorithm 1, including the comparison with the existing algorithm from [10], the forward-backward implementation and the derivation of the $Q$ -function from (7). In Appendix C, we prove our theoretical results from Section 4. Technical lemmas involved in the proofs are deferred to Appendix D. Finally, Appendix E presents details of our numerical example omitted from Section 5.

Appendix A Discussion on Assumption 3

In this section we justify Assumption 3 in a particular class of environment. Consider the stochastic process $\{X_{t};\theta\}_{t=1}^{\infty}=\{S_{t},A_{t},O_{t},B_{t};\theta\}_{t=1}^{\infty}$ generated by any $(o_{0},s_{1})$ and an options with failure hierarchical policy with parameter $\theta$ . It is a Markov chain with its transition kernel parameterized by $\theta$ , and its state space $\mathcal{X}=\mathcal{S}\times\mathcal{A}\times\mathcal{O}\times\{0,1\}$ is finite. Denote its one step transition kernel as $Q_{\theta}$ and its $t$ step transition kernel as $Q^{t}_{\theta}$ . In the following, we show that $\{X_{t};\theta\}_{t=1}^{\infty}$ is uniformly ergodic when the environment meets the reachability assumption: $\forall s_{t},s_{t+1}\in\mathcal{S}$ , there exists $a_{t}\in\mathcal{A}$ such that $P(s_{t+1}|s_{t},a_{t})>0$ .

Proposition 5 (Ergodicity).

With Assumption 1, 2 and the reachability assumption stated above, for all $\theta\in\mathit{\Theta}$ , a Markov chain with transition kernel $Q_{\theta}$ has a unique stationary distribution $\nu_{\theta}$ . There exist constants $\alpha\in(0,1)$ and $C>0$ such that for all $\theta\in\mathit{\Theta}$ and $t\in\mathbb{N}_{+}$ ,

\sup_{\theta\in\mathit{\Theta}}\max_{x\in\mathcal{X}}\left\|{Q_{\theta}^{t}(x,\cdot)-\nu_{\theta}}\right\|_{\rm TV}\leq C\alpha^{t}.

Proof of Proposition 5.

We start by analyzing the irreducibility of the Markov chain $\{X_{t};\theta\}_{t=1}^{\infty}$ with any $\theta$ . Denote the probability measure on the natural filtered space as $\mathbb{P}_{X}$ . The dependency on $\theta$ is dropped for a cleaner notation, since the following proof holds for all $\theta\in\mathit{\Theta}$ . For any $x,\tilde{x}\in\mathcal{X}$ , let $x=(s,a,o,b)$ and $\tilde{x}=(\tilde{s},\tilde{a},\tilde{o},\tilde{b})$ . For any time $t$ ,

\mathbb{P}_{X}(X_{t+2}=\tilde{x}|X_{t}=x)=\sum_{\bar{s}\in\mathcal{S},\bar{a}\in\mathcal{A}}\mathbb{P}_{X}(X_{t+2}=\tilde{x}|X_{t}=x,S_{t+1}=\bar{s},A_{t+1}=\bar{a})\\ \times\mathbb{P}_{X}(S_{t+1}=\bar{s},A_{t+1}=\bar{a}|X_{t}=x).

From Assumption 1, there exists a state $\bar{s}$ such that $\forall\bar{a}\in\mathcal{A}$ , $\mathbb{P}_{X}(S_{t+1}=\bar{s},A_{t+1}=\bar{a}|X_{t}=x)>0$ . Consider the first factor in the sum,

\mathbb{P}_{X}(X_{t+2}=\tilde{x}|X_{t}=x,S_{t+1}=\bar{s},A_{t+1}=\bar{a})=\mathbb{P}_{X}(S_{t+2}=\tilde{s}|S_{t+1}=\bar{s},A_{t+1}=\bar{a})\\ \times\mathbb{P}_{X}(B_{t+2}=\tilde{b},O_{t+2}=\tilde{o},A_{t+2}=\tilde{a}|X_{t}=x,S_{t+1}=\bar{s},A_{t+1}=\bar{a},S_{t+2}=\tilde{s}).

From Assumption 1, the second term on the RHS is positive for all $\bar{s}\in\mathcal{S}$ and $\bar{a}\in\mathcal{A}$ . From the reachability assumption, for any $\bar{s}$ there exists an action $\bar{a}$ such that $\mathbb{P}_{X}(S_{t+2}=\tilde{s}|S_{t+1}=\bar{s},A_{t+1}=\bar{a})>0$ . As a result, for any $x,\tilde{x}\in\mathcal{X}$ , $\mathbb{P}_{X}(X_{t+2}=\tilde{x}|X_{t}=x)>0$ , and the considered Markov chain is irreducible.

As shown above, for all $\theta\in\mathit{\Theta}$ , $\min_{x,\tilde{x}\in\mathcal{X}}Q^{2}_{\theta}(x,\tilde{x})>0$ where $Q^{2}_{\theta}$ is the two step transition kernel of the Markov chain $\{X_{t};\theta\}_{t=1}^{\infty}$ . Due to Assumption 2, $\min_{x,\tilde{x}\in\mathcal{X}}Q^{2}_{\theta}(x,\tilde{x})$ is continuous with respect to $\theta$ . Moreover, since $\mathit{\Theta}$ is compact, if we let $\delta=\inf_{\theta\in\mathit{\Theta}}\min_{x,\tilde{x}\in\mathcal{X}}Q^{2}_{\theta}(x,\tilde{x})$ we have $\delta>0$ . The classical Doeblin-type condition can be constructed as follows. For all $\theta\in\mathit{\Theta}$ and $x,\tilde{x}\in\mathcal{X}$ , with any probability measure $\nu$ over the finite sample space $\mathcal{X}$ ,

Q^{2}_{\theta}(x,\tilde{x})\geq\delta\nu(\tilde{x}).

(9)

A Markov chain convergence result is restated in the following lemma, tailored to our need.

Lemma A.1 ([8], Theorem 4.3.16 restated).

With the Doeblin-type condition in (9), the Markov chain $\{X_{t};\theta\}_{t=1}^{\infty}$ with any $\theta\in\mathit{\Theta}$ has a unique stationary distribution $\nu_{\theta}$ . Moreover, for all $\theta\in\mathit{\Theta}$ , $x\in\mathcal{X}$ and $t\in\mathbb{N}_{+}$ ,

\left\|{Q^{t}_{\theta}(x,\cdot)-\nu_{\theta}}\right\|_{\rm TV}\leq(1-\delta)^{\left\lfloor{t/2}\right\rfloor}.

Letting $C=(1-\delta)^{-1}$ and $\alpha=(1-\delta)^{1/2}$ , we have

\sup_{\theta\in\mathit{\Theta}}\max_{x_{1}\in\mathcal{X}}\left\|{Q_{\theta}^{t}(x_{1},\cdot)-\nu_{\theta}}\right\|_{\rm TV}\leq(1-\delta)^{\left\lfloor{t/2}\right\rfloor}\leq C\alpha^{t}.\qed

Proposition 5 shows that in $\{X_{t};\theta\}_{t=1}^{\infty}$ , the initial distribution (of $X_{1}$ ) is not very important since the distribution of $X_{t}$ converges to $\nu_{\theta}$ uniformly with respect to $X_{1}$ and $\theta$ . As a result, $\{O_{t-1},S_{t}\}_{t=1}^{\infty}$ also converges to the unique limiting distribution, regardless of the initial distribution. When sampling the observation sequence from the expert, we can always start sampling late enough such that Assumption 3 is approximately satisfied. Note that the proof of Proposition 5 does not use the failure mechanism imposed on the hierarchical policy, implying that the result also holds for the standard options framework.

Appendix B Details of the algorithm

B.1 An error in the existing algorithm

First, we point out a technicality when comparing Algorithm 1 to the algorithm from [10]. The algorithm from [10] learns a hierarchical policy following the standard options framework, not the options with failure framework considered in Algorithm 1. To draw direct comparison, we need to let $\zeta=0$ in Algorithm 1. However, an error in the existing algorithm can be demonstrated without referring to $\zeta$ .

For simplicity, consider $O_{0}$ fixed to $o_{0}\in\mathcal{O}$ ; let $2\leq t\leq T-1$ . Then, according to the definitions in [10], the (unnormalized) forward message is defined as

\alpha^{\theta}_{t}(o_{t},b_{t})=\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t}|S_{2:t}=s_{2:t}).

The (unnormalized) backward message is defined as

\beta^{\theta}_{t|T}(o_{t},b_{t})=\mathbb{P}_{\theta,o_{0},s_{1}}(A_{t+1:T}=a_{t+1:T}|S_{t+1:T}=s_{t+1:T},O_{t}=o_{t},B_{t}=b_{t}).

The smoothing distribution is defined as

\gamma^{\theta}_{t|T}(o_{t},b_{t})=\mathbb{P}_{\theta,o_{0},s_{1}}(O_{t}=o_{t},B_{t}=b_{t}|S_{2:T}=s_{2:T},A_{1:T}=a_{1:T}).

We use the proportional symbol $\propto$ to represent normalizing constants independent of $o_{t}$ and $b_{t}$ . [10] claims that, for any $o_{t}$ and $b_{t}$ ,

\gamma^{\theta}_{t|T}(o_{t},b_{t})\propto\alpha^{\theta}_{t}(o_{t},b_{t})\beta^{\theta}_{t|T}(o_{t},b_{t}).

However, applying Bayes’ formula, it follows that

\gamma^{\theta}_{t|T}(o_{t},b_{t})\propto\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:T}=a_{1:T}|S_{2:T}=s_{2:T},O_{t}=o_{t},B_{t}=b_{t})\mathbb{P}_{\theta,o_{0},s_{1}}(O_{t}=o_{t},B_{t}=b_{t}|S_{2:T}=s_{2:T}).

Using the Markov property,

\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:T}=a_{1:T}|S_{2:T}=s_{2:T},O_{t}=o_{t},B_{t}=b_{t})=\\ \mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:t}=a_{1:t}|S_{2:T}=s_{2:T},O_{t}=o_{t},B_{t}=b_{t})\\ \times\mathbb{P}_{\theta,o_{0},s_{1}}(A_{t+1:T}=a_{t+1:T}|S_{2:T}=s_{2:T},O_{t}=o_{t},B_{t}=b_{t}).

Therefore,

\gamma^{\theta}_{t|T}(o_{t},b_{t})\propto\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t}|S_{2:T}=s_{2:T})\beta^{\theta}_{t|T}(o_{t},b_{t}).

Applying Bayes’ formula again, it follows that

		$\displaystyle\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t}\|S_{2:T}=s_{2:T})$
	$\displaystyle\propto~{}$	$\displaystyle\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t}\|S_{2:t}=s_{2:t})$
		$\displaystyle\hskip 60.00009pt\times\mathbb{P}_{\theta,o_{0},s_{1}}(S_{t+1:T}=s_{t+1:T}\|S_{2:t}=s_{2:t},A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t})$
	$\displaystyle=~{}$	$\displaystyle\alpha^{\theta}_{t}(o_{t},b_{t})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{t+1:T}=s_{t+1:T}\|S_{t}=s_{t},A_{t}=a_{t},O_{t}=o_{t},B_{t}=b_{t}).$

For the claim in [10] to be true, $\mathbb{P}_{\theta,o_{0},s_{1}}(S_{t+1:T}=s_{t+1:T}|S_{t}=s_{t},A_{t}=a_{t},O_{t}=o_{t},B_{t}=b_{t})$ should not depend on $o_{t}$ and $b_{t}$ . Clearly this requirement does not hold in most cases, since the likelihood of the future observation sequence should depend on the currently applied option.

B.2 Proof of Theorem 1

We drop the dependency on $\theta$ , since the following proof holds for all $\theta\in\mathit{\Theta}$ . The proportional symbol $\propto$ is used to replace a multiplier term that depends on the context.

1. (Forward recursion)

First consider any fixed $o_{0}$ . For a cleaner notation, we use $p$ as an abbreviation of $\mathbb{P}_{\theta,o_{0},s_{1}}$ . Let $H_{1}$ , $H_{2}$ be any two subsets of $\{S_{t},A_{t},O_{t},B_{t}\}_{t=1}^{T}$ , and let $h_{1}$ , $h_{2}$ be the sets of values generated from $H_{1}$ and $H_{2}$ , respectively, such that the uppercase symbols are replaced by the lowercase symbols. ( $H_{1}$ and $H_{2}$ are two sets of random variables; $h_{1}$ and $h_{2}$ are two sets of values of random variables.) Then, for all $(o_{0},s_{1})$ , $p$ is defined as

p(h_{1}|h_{2},o_{0},s_{1})\mathrel{\mathop{:}}=\mathbb{P}_{\theta,o_{0},s_{1}}(H_{1}=h_{1}|H_{2}=h_{2}).

If the RHS does not depend on $o_{0}$ and $s_{1}$ , we can omit it on the LHS by using $p(h_{1}|h_{2})$ . $\forall t\in[2:T]$ ,

		$\displaystyle p(s_{2:t},a_{1:t},o_{t},b_{t}\|o_{0},s_{1})$
	$\displaystyle=~{}$	$\displaystyle p(s_{2:t},a_{1:t-1},o_{t},b_{t}\|o_{0},s_{1})\pi_{lo}(a_{t}\|s_{t},o_{t})$
	$\displaystyle=~{}$	$\displaystyle\sum_{o_{t-1}}p(s_{2:t},a_{1:t-1},o_{t},b_{t},o_{t-1}\|o_{0},s_{1})\pi_{lo}(a_{t}\|s_{t},o_{t})$
	$\displaystyle=~{}$	$\displaystyle\sum_{o_{t-1}}p(s_{2:t},a_{1:t-1},o_{t-1}\|o_{0},s_{1})\pi_{b}(b_{t}\|s_{t},o_{t-1})\bar{\pi}_{hi}(o_{t}\|s_{t},o_{t-1},b_{t})\pi_{lo}(a_{t}\|s_{t},o_{t}).$

Furthermore,

	$\displaystyle p(s_{2:t},a_{1:t-1},o_{t-1}\|o_{0},s_{1})$	$\displaystyle=p(s_{2:t-1},a_{1:t-1},o_{t-1}\|o_{0},s_{1})P(s_{t}\|s_{t-1},a_{t-1})$
		$\displaystyle\propto\sum_{b_{t-1}}p(s_{2:t-1},a_{1:t-1},o_{t-1},b_{t-1}\|o_{0},s_{1}),$

where $\propto$ replaces a multiplier that does not depend on $o_{t-1}$ . Taking expectation with respect to $O_{0}$ gives the desirable forward recursion result. For the case of $t=1$ , the proof is analogous.

2. (Backward recursion)

For any $o_{0}$ , $\forall t\in[1:T-1]$ ,

	$\displaystyle\beta^{\theta}_{t\|T}(o_{t},b_{t})$	$\displaystyle\propto p(s_{t+1:T},a_{t+1:T}\|s_{t},a_{t},o_{t},b_{t})$
		$\displaystyle=p(s_{t+2:T},a_{t+1:T}\|s_{t+1},o_{t})P(s_{t+1}\|s_{t},a_{t})$
		$\displaystyle\propto\sum_{o_{t+1},b_{t+1}}p(s_{t+2:T},a_{t+1:T}\|s_{t+1},o_{t},o_{t+1},b_{t+1})p(o_{t+1},b_{t+1}\|s_{t+1},o_{t}),$

where the multipliers replaced by $\propto$ are independent of $o_{t}$ and $b_{t}$ . Moreover,

		$\displaystyle p(s_{t+2:T},a_{t+1:T}\|s_{t+1},o_{t},o_{t+1},b_{t+1})$
	$\displaystyle=~{}$	$\displaystyle p(s_{t+2:T},a_{t+2:T}\|s_{t+1},o_{t},o_{t+1},b_{t+1},a_{t+1})p(a_{t+1}\|s_{t+1},o_{t},o_{t+1},b_{t+1})$
	$\displaystyle=~{}$	$\displaystyle\beta^{\theta}_{t+1\|T}(o_{t+1},b_{t+1})p(a_{t+1}\|s_{t+1},o_{t},o_{t+1},b_{t+1}).$

Plugging in the structure of the policy gives the desirable result.

3. (Smoothing)

Consider any fixed $o_{0}$ . For any $t\in[2:T]$ ,

	$\displaystyle p(s_{2:T},a_{1:T},o_{t},b_{t}\|o_{0},s_{1})$	$\displaystyle=p(s_{2:t},a_{1:t},o_{t},b_{t}\|o_{0},s_{1})p(s_{t+1:T},a_{t+1:T}\|s_{1:t},a_{1:t},o_{t},b_{t},o_{0})$
		$\displaystyle=p(s_{2:t},a_{1:t},o_{t},b_{t}\|o_{0},s_{1})p(s_{t+1:T},a_{t+1:T}\|s_{t},a_{t},o_{t},b_{t}).$

Taking expectation with respect to $O_{0}$ on both sides yields the desirable result. Notice that the second term on the RHS does not depend on $O_{0}$ , therefore is not involved in the expectation. For the case of $t=1$ the proof is analogous.

4. (Two-step smoothing)

For any $t\in[3:T]$ , consider any fixed $o_{0}$ ,

		$\displaystyle p(s_{2:T},a_{1:T},o_{t-1},b_{t}\|o_{0},s_{1})$
	$\displaystyle=~{}$	$\displaystyle\sum_{b_{t-1}}p(s_{2:T},a_{1:T},o_{t-1},b_{t},b_{t-1}\|o_{0},s_{1})$
	$\displaystyle=~{}$	$\displaystyle\sum_{b_{t-1}}p(s_{2:t-1},a_{1:t-1},o_{t-1},b_{t-1}\|o_{0},s_{1})p(s_{t:T},a_{t:T},b_{t}\|s_{1:t-1},a_{1:t-1},o_{t-1},b_{t-1},o_{0})$
	$\displaystyle=~{}$	$\displaystyle\sum_{b_{t-1}}p(s_{2:t-1},a_{1:t-1},o_{t-1},b_{t-1}\|o_{0},s_{1})P(s_{t}\|s_{t-1},a_{t-1})p(s_{t+1:T},a_{t:T},b_{t}\|s_{t},o_{t-1}).$

Take expectation with respect to $O_{0}$ on both sides. Notice that only the first term on the RHS depends on $o_{0}$ . We have

		$\displaystyle\tilde{\gamma}_{\mu,t\|T}(o_{t-1},b_{t})$
	$\displaystyle\propto~{}$	$\displaystyle\sum_{b_{t-1}}\alpha_{\mu,t-1}(o_{t-1},b_{t-1})P(s_{t}\|s_{t-1},a_{t-1})p(s_{t+1:T},a_{t:T},b_{t}\|s_{t},o_{t-1})$
	$\displaystyle\propto~{}$	$\displaystyle\pi_{b}(b_{t}\|s_{t},o_{t-1})p(s_{t+1:T},a_{t:T}\|s_{t},b_{t},o_{t-1})\sum_{b_{t-1}}\alpha_{\mu,t-1}(o_{t-1},b_{t-1})$
	$\displaystyle=~{}$	$\displaystyle\pi_{b}(b_{t}\|s_{t},o_{t-1})\left[\sum_{o_{t}}p(s_{t+1:T},a_{t:T},o_{t}\|s_{t},b_{t},o_{t-1})\right]\sum_{b_{t-1}}\alpha_{\mu,t-1}(o_{t-1},b_{t-1})$
	$\displaystyle\propto~{}$	$\displaystyle\pi_{b}(b_{t}\|s_{t},o_{t-1})\left[\sum_{o_{t}}\bar{\pi}_{hi}(o_{t}\|s_{t},o_{t-1},b_{t})\pi_{lo}(a_{t}\|s_{t},o_{t})\beta_{t\|T}(o_{t},b_{t})\right]\sum_{b_{t-1}}\alpha_{\mu,t-1}(o_{t-1},b_{t-1}),$

where the multipliers replaced by $\propto$ are independent of $o_{t-1}$ and $b_{t}$ . For the case of $t=2$ the proof is analogous. ∎

B.3 Discussion on the $Q$ -function

In our algorithm, as motivated by Section 3, we effectively consider the following joint distribution on the graphical model shown in Figure 1: the prior distribution of $(O_{0},S_{1})$ is $\hat{\nu}$ , and the distribution of the rest of the graphical model is determined by an options with failure policy with parameters $\zeta$ and $\theta$ . From the EM literature [3, 22], the complete likelihood function is

L(s_{1:T},a_{1:T},o_{0:T},b_{1:T};\theta)=\hat{\nu}(o_{0},s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T}).

The marginal likelihood function is

L^{m}(s_{1:T},a_{1:T};\theta)=\sum_{o_{0:T},b_{1:T}}\hat{\nu}(o_{0},s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T}),

where the superscript $m$ means marginal. From the definition of smoothing distributions, we can verify that $L^{m}(s_{1:T},a_{1:T};\theta)=(z^{\theta}_{\gamma,\mu})^{-1}$ .

The standard MLE approach maximizes the logarithm of the marginal likelihood function (marginal log-likelihood) with respect to $\theta$ . However, such an optimization objective is hard to evaluate for time series models (e.g., HMMs and our graphical model). As an alternative, the marginal log-likelihood can be lower bounded [22, Chap. 5.4] as

\log L^{m}(s_{1:T},a_{1:T};\theta^{\prime})\geq\sum_{o_{0:T},b_{1:T}}\frac{L(s_{1:T},a_{1:T},o_{0:T},b_{1:T};\theta)}{L^{m}(s_{1:T},a_{1:T};\theta)}\log L(s_{1:T},a_{1:T},o_{0:T},b_{1:T};\theta^{\prime}),

where $\theta$ on the RHS is arbitrary. The RHS is usually called the (unnormalized) $Q$ -function. For our graphical model, it is denoted as $\tilde{Q}_{\mu,T}(\theta^{\prime}|\theta)$ .

\tilde{Q}_{\mu,T}(\theta^{\prime}|\theta)=\sum_{o_{0:T},b_{1:T}}\hat{\nu}(o_{0},s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T})\\ \times z^{\theta}_{\gamma,\mu}\log[\hat{\nu}(o_{0},s_{1})\mathbb{P}_{\theta^{\prime},o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T})].

The RHS is well-defined from the non-degeneracy assumption. From the classical monotonicity property of EM updates [22, Chap. 5.7], maximizing the (unnormalized) $Q$ -function $\tilde{Q}_{\mu,T}(\theta^{\prime}|\theta)$ with respect to $\theta^{\prime}$ guarantees non-negative improvement on the marginal log-likelihood. Therefore, improvements on parameter inference can be achieved via iteratively maximizing the (unnormalized) $Q$ -function.

Using the structure of the hierarchical policy, $\tilde{Q}_{\mu,T}$ can be rewritten as

\tilde{Q}_{\mu,T}(\theta^{\prime}|\theta)=\sum_{t=2}^{T}\sum_{o_{t-1},b_{t}}\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t})[\log\pi_{b}(b_{t}|s_{t},o_{t-1};\theta^{\prime}_{b})]\\ +\sum_{t=1}^{T}\sum_{o_{t},b_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t})[\log\pi_{lo}(a_{t}|s_{t},o_{t};\theta^{\prime}_{lo})]+\sum_{t=1}^{T}\sum_{o_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}=1)[\log\pi_{hi}(o_{t}|s_{t};\theta^{\prime}_{hi})]\\ +z^{\theta}_{\gamma,\mu}\sum_{o_{0},b_{1}}\mu(o_{0}|s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},B_{1}=b_{1})[\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})]+C,

where $C$ contains terms unrelated to $\theta^{\prime}$ . Consider the first term on the last line, which partially captures the effect of assuming $\hat{\nu}$ on the parameter inference. Since this term is upper bounded by $\max_{b_{1},s_{1},o_{0}}|\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})|$ , when $T$ is large enough this term becomes negligible. The precise argument is similar to the proof of Lemma C.2. Therefore, after dropping the last line and normalizing, we arrive at our definition of the (normalized) $Q$ -function in (7).

Appendix C Details of the performance guarantee

C.1 Smoothing in an extended graphical model

Before providing the proofs, we first introduce a few definitions. Consider the extended graphical model shown in Figure 4 with a parameter $k$ ; $k\in\mathbb{N}_{+}$ .

Let the joint distribution of $(O_{-k},S_{1-k})$ be $\nu^{*}$ . Define the distribution of the rest of the graphical model using an options with failure hierarchical policy with parameters $\zeta$ and $\theta$ , analogous to our settings so far. With these two components, the joint distribution on the graphical model is determined. Let $\mathbb{P}_{\theta,k}$ be such a joint distribution; $\nu^{*}$ is omitted for conciseness.

We emphasize the comparison between $\mathbb{P}_{\theta,k}$ and $\mathbb{P}_{\theta,o_{0},s_{1}}$ . The sample space of $\mathbb{P}_{\theta,k}$ is the domain of $\{S_{1-k:T+k},A_{1-k:T+k},O_{-k:T+k},B_{1-k:T+k}\}$ , whereas the sample space of $\mathbb{P}_{\theta,o_{0},s_{1}}$ is the domain of $\{S_{2:T},A_{1:T},O_{1:T},B_{1:T}\}$ since $(O_{0},S_{1})$ is fixed to $(o_{0},s_{1})$ .

Consider the infinite length observation sequence $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ corresponding to any $\omega\in\Omega$ , where $\Omega$ is defined in (8). Analogous to the non-extended model (Figure 1), we can define smoothing distributions for the extended model with any parameter $k$ . For all $\theta\in\mathit{\Theta}$ and $t\in[1:T]$ , with any input arguments $o_{t}$ and $b_{t}$ , the forward message is defined as

\alpha^{\theta}_{k,t}(o_{t},b_{t})\mathrel{\mathop{:}}=z_{\alpha,k,t}^{\theta}\mathbb{P}_{\theta,k}(S_{1-k:t}=s_{1-k:t},A_{1-k:t}=a_{1-k:t},O_{t}=o_{t},B_{t}=b_{t}).

The backward message is defined as

\beta^{\theta}_{k,t}(o_{t},b_{t})\mathrel{\mathop{:}}=\\ z_{\beta,k,t}^{\theta}\mathbb{P}_{\theta,k}(S_{t+1:T+k}=s_{t+1:T+k},A_{t+1:T+k}=a_{t+1:T+k}|S_{t}=s_{t},A_{t}=a_{t},O_{t}=o_{t},B_{t}=b_{t}).

The smoothing distribution is defined as

\gamma^{\theta}_{k,t}(o_{t},b_{t})\mathrel{\mathop{:}}=z_{\gamma,k}^{\theta}\mathbb{P}_{\theta,k}(S_{1-k:T+k}=s_{1-k:T+k},A_{1-k:T+k}=a_{1-k:T+k},O_{t}=o_{t},B_{t}=b_{t}).

The two-step smoothing distribution is defined as

\tilde{\gamma}^{\theta}_{k,t}(o_{t-1},b_{t})\mathrel{\mathop{:}}=z_{\gamma,k}^{\theta}\mathbb{P}_{\theta,k}(S_{1-k:T+k}=s_{1-k:T+k},A_{1-k:T+k}=a_{1-k:T+k},O_{t-1}=o_{t-1},B_{t}=b_{t}).

The quantities $z^{\theta}_{\alpha,k,t}$ , $z^{\theta}_{\beta,k,t}$ and $z^{\theta}_{\gamma,k}$ are normalizing constants such that the LHS of the expressions above are probability mass functions. In particular, since $k>0$ , we can define $\alpha^{\theta}_{k,t}$ for $t=0$ in the same way as $t\in[1:T]$ . The dependency on $T$ in the smoothing distributions is dropped for a cleaner notation.

Recursive results similar to Theorem 1 can be established; the proof is analogous and therefore omitted. As in Theorem 1, we make extensive use of the proportional symbol $\propto$ which stands for, the LHS equals the RHS multiplied by a normalizing constant. Moreover, the normalizing constant does not depend on the input arguments of the LHS.

Corollary 6 (Forward-backward smoothing for the extended model).

For all $\theta\in\mathit{\Theta}$ and $k\in\mathbb{N}_{+}$ , with any input arguments,

(Forward recursion) $\forall t\in[1:T]$ ,

\alpha^{\theta}_{k,t}(o_{t},b_{t})\propto\sum_{o_{t-1},b_{t-1}}\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\alpha^{\theta}_{k,t-1}(o_{t-1},b_{t-1}).

(10)

(Backward recursion) $\forall t\in[1:T-1]$ ,

\beta^{\theta}_{k,t}(o_{t},b_{t})\propto\sum_{o_{t+1},b_{t+1}}\pi_{b}(b_{t+1}|s_{t+1},o_{t};\theta_{b})\bar{\pi}_{hi}(o_{t+1}|s_{t+1},o_{t},b_{t+1};\theta_{hi})\\ \times\pi_{lo}(a_{t+1}|s_{t+1},o_{t+1};\theta_{lo})\beta^{\theta}_{k,t+1}(o_{t+1},b_{t+1}).

(11)

3.

(Smoothing) $\forall t\in[1:T]$ ,

$\gamma^{\theta}_{k,t}(o_{t},b_{t})\propto\alpha^{\theta}_{k,t}(o_{t},b_{t})\beta^{\theta}_{k,t}(o_{t},b_{t}).$ (12)

(Two-step smoothing) $\forall t\in[1:T]$ ,

\tilde{\gamma}^{\theta}_{k,t}(o_{t-1},b_{t})\propto\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bigg{[}\sum_{o_{t}}\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\beta^{\theta}_{k,t}(o_{t},b_{t})\bigg{]}\\ \times\bigg{[}\sum_{b_{t-1}}\alpha^{\theta}_{k,t-1}(o_{t-1},b_{t-1})\bigg{]}.

(13)

The following lemma characterizes the limiting behavior of $\gamma^{\theta}_{k,t}$ and $\tilde{\gamma}^{\theta}_{k,t}$ as $k\rightarrow\infty$ .

Lemma C.1 (Limits of smoothing distributions).

With Assumption 1, 2 and 3, for all $T\geq 2$ , $\theta\in\mathit{\Theta}$ , $\omega\in\Omega$ and $t\in[1:T]$ , the limits of $\{\gamma^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}}$ and $\{\tilde{\gamma}^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}}$ as $k\rightarrow\infty$ exist with respect to the total variation distance. Let $\gamma^{\theta}_{\infty,t}\mathrel{\mathop{:}}=\lim_{k\rightarrow\infty}\gamma^{\theta}_{k,t}$ and $\tilde{\gamma}^{\theta}_{\infty,t}\mathrel{\mathop{:}}=\lim_{k\rightarrow\infty}\tilde{\gamma}^{\theta}_{k,t}$ . They have the following properties:

1.

$\gamma^{\theta}_{\infty,t}$ and $\tilde{\gamma}^{\theta}_{\infty,t}$ do not depend on $T$ .
2.

$\gamma^{\theta}_{\infty,t}$ and $\tilde{\gamma}^{\theta}_{\infty,t}$ are entry-wise Lipschitz continuous with respect to $\theta\in\mathit{\Theta}$ .

The proof is given in Appendix D.4. The dependency of $\gamma^{\theta}_{\infty,t}$ and $\tilde{\gamma}^{\theta}_{\infty,t}$ on $\omega$ is omitted for a cleaner notation.

C.2 The stochastic convergence of the $Q$ -function

In this subsection, we present the proof of Theorem 2.

First, consider $\gamma^{\theta}_{\infty,t}$ and $\tilde{\gamma}^{\theta}_{\infty,t}$ defined in Lemma C.1. Using the arguments from Section 4, they can also be analyzed in the infinitely extended probability space $(\mathcal{X}^{\mathbb{Z}},\mathcal{P}(\mathcal{X}^{\mathbb{Z}}),\mathbb{P}_{\theta^{*},\nu^{*}})$ , where $\mathcal{P}(\cdot)$ denotes the power set. We only define $\gamma^{\theta}_{\infty,t}$ and $\tilde{\gamma}^{\theta}_{\infty,t}$ for $\omega\in\Omega$ ; for other sample paths, they are defined arbitrarily. Since $\mathbb{P}_{\theta^{*},\nu^{*}}(\Omega)=1$ , such a restriction from $\mathcal{X}^{\mathbb{Z}}$ to $\Omega$ does not change our probabilistic results.

For any sample path $\omega$ , let $\omega(s_{t})$ and $\omega(a_{t})$ be the values of $S_{t}$ and $A_{t}$ corresponding to $\omega$ . With a slight overload of notation, let $\omega(t)=\{\omega(s_{t}),\omega(a_{t}),\omega(o_{t}),\omega(b_{t})\}$ , which is the set of components in $\omega$ corresponding to time $t$ .

For all $\theta\in\mathit{\Theta}$ , $\theta^{\prime}\in\tilde{\mathit{\Theta}}$ , $\omega\in\Omega$ and $t\in\mathbb{N}_{+}$ , define

f_{t}(\theta^{\prime}|\theta;\omega)\mathrel{\mathop{:}}=\sum_{o_{t-1},b_{t}}\tilde{\gamma}^{\theta}_{\infty,t}(o_{t-1},b_{t};\omega)\left[\log\pi_{b}(b_{t}|\omega(s_{t}),o_{t-1};\theta^{\prime}_{b})\right]+\sum_{o_{t},b_{t}}\gamma^{\theta}_{\infty,t}(o_{t},b_{t};\omega)\\ \times\left[\log\pi_{lo}(\omega(a_{t})|\omega(s_{t}),o_{t};\theta^{\prime}_{lo})\right]+\sum_{o_{t}}\gamma^{\theta}_{\infty,t}(o_{t},b_{t}=1;\omega)\left[\log\pi_{hi}(o_{t}|\omega(s_{t});\theta^{\prime}_{hi})\right],

where the dependency of the RHS on $\omega$ is shown explicitly for clarity. $|f_{t}(\theta^{\prime}|\theta;\omega)|$ is upper bounded by a constant that does not depend on $\theta$ , $\theta^{\prime}$ , $\omega$ and $t$ , due to Assumption 1 and 2. Moreover, for all $\theta$ , $\omega$ and $t$ , $f_{t}(\theta^{\prime}|\theta;\omega)$ is continuously differentiable with respect to $\theta^{\prime}\in\tilde{\mathit{\Theta}}$ ; for all $\theta^{\prime}$ , $\omega$ and $t$ , $f_{t}(\theta^{\prime}|\theta;\omega)$ is Lipschitz continuous with respect to $\theta\in\mathit{\Theta}$ , due to Lemma C.1.

Next, define

\bar{Q}(\theta^{\prime}|\theta)\mathrel{\mathop{:}}=\mathbb{E}_{\theta^{*},\nu^{*}}[f_{1}(\theta^{\prime}|\theta;\omega)].

(14)

The subscripts $\theta^{*}$ and $\nu^{*}$ in $\mathbb{E}_{\theta^{*},\nu^{*}}$ denote that the expectation is taken with respect to the probability measure $\mathbb{P}_{\theta^{*},\nu^{*}}$ .

With the above definitions, we state the complete version of Theorem 2. The $Q$ -function defined in (7) is written as $Q_{\mu,T}(\theta^{\prime}|\theta;\omega)$ , showing its dependency on the sample path.

Theorem 7 (The complete version of Theorem 2).

With Assumption 1, 2 and 3, consider $\bar{Q}(\theta^{\prime}|\theta)$ defined in (14), we have

1.

For all $\theta\in\mathit{\Theta}$ , $\bar{Q}(\theta^{\prime}|\theta)$ is continuously differentiable with respect to $\theta^{\prime}\in\tilde{\mathit{\Theta}}$ , where $\tilde{\mathit{\Theta}}$ is defined in Assumption 1. The gradient is

$\nabla\bar{Q}(\theta^{\prime}|\theta)=\mathbb{E}_{\theta^{*},\nu^{*}}[\nabla f_{1}(\theta^{\prime}|\theta;\omega)].$

Moreover, as the set of maximizing arguments, $\operatorname*{arg\,max}_{\theta^{\prime}\in\mathit{\Theta}}\bar{Q}(\theta^{\prime}|\theta)$ is nonempty.

As $T\rightarrow\infty$ ,

\sup_{\theta,\theta^{\prime}\in\mathit{\Theta}}\sup_{\mu\in\mathcal{M}}\left|Q_{\mu,T}(\theta^{\prime}|\theta;\omega)-\bar{Q}(\theta^{\prime}|\theta)\right|\rightarrow 0,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

Before proving Theorem 7, we state the following definition and an auxiliary lemma required for the proof. For all $\theta,\theta^{\prime}\in\mathit{\Theta}$ , $\omega\in\Omega$ and $T\geq 2$ , the sample-path-based population $Q$ -function $Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)$ is defined as

Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)\mathrel{\mathop{:}}=\frac{1}{T}\sum_{t=1}^{T}f_{t}(\theta^{\prime}|\theta;\omega).

(15)

The superscript s in $Q^{s}_{\infty,T}$ stands for sample-path-based. If the sample path $\omega$ is not specified, $Q^{s}_{\infty,T}(\theta^{\prime}|\theta)$ is a random variable associated with probability measure $\mathbb{P}_{\theta^{*},\nu^{*}}$ . Note that due to stationarity, for any $\theta$ , $\theta^{\prime}$ and $T$ , $\bar{Q}(\theta^{\prime}|\theta)=\mathbb{E}_{\theta^{*},\nu^{*}}[Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)]$ .

The difference between $Q^{s}_{\infty,T}$ and $Q_{\mu,T}$ is bounded in the following lemma.

Lemma C.2 (Bounding the difference between the $Q$ -function and the sample-path-based population $Q$ -function).

With Assumption 1, 2 and 3, for all $T\geq 2$ and $\omega\in\Omega$ ,

\sup_{\theta,\theta^{\prime}\in\mathit{\Theta}}\sup_{\mu\in\mathcal{M}}\left|Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)-Q_{\mu,T}(\theta^{\prime}|\theta;\omega)\right|\leq const\cdot T^{-1},

where $const$ is a constant independent of $T$ and $\omega$ .

The proof is provided in Appendix D.5. Now we are ready to present the proof of Theorem 7 step-by-step. The structure of this proof is similar to the standard analysis of HMM maximum likelihood estimators [8, Chap. 12].

Proof of Theorem 7.

We prove the two parts of the theorem separately.

1. For all $\theta^{\prime}\in\tilde{\mathit{\Theta}}$ , there exists $\delta_{\theta^{\prime}}>0$ such that the set $\{\tilde{\theta};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta_{\theta^{\prime}}\}\subseteq\tilde{\mathit{\Theta}}$ . For all $\theta\in\mathit{\Theta}$ and $\omega\in\Omega$ , due to the differentiability of $f_{1}(\theta^{\prime}|\theta;\omega)$ with respect to $\theta^{\prime}$ , there exists a gradient $\nabla f_{1}(\theta^{\prime}|\theta;\omega)$ at any $\theta^{\prime}\in\tilde{\mathit{\Theta}}$ such that

\lim_{\delta\rightarrow 0}\sup_{\tilde{\theta}\in\tilde{\mathit{\Theta}};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta}\frac{|f_{1}(\tilde{\theta}|\theta;\omega)-f_{1}(\theta^{\prime}|\theta;\omega)-\langle\nabla f_{1}(\theta^{\prime}|\theta;\omega),\tilde{\theta}-\theta^{\prime}\rangle|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}=0.

We need to transform the above almost surely (in $\omega$ ) convergence to the convergence of expectation, using the dominated convergence theorem. As a requirement, the quantity inside the limit on the LHS needs to be upper-bounded. For all $\theta\in\mathit{\Theta}$ , $\theta^{\prime}\in\tilde{\mathit{\Theta}}$ , $\omega\in\Omega$ and $0<\delta\leq\delta_{\theta^{\prime}}$ ,

\sup_{\tilde{\theta}\in\tilde{\mathit{\Theta}};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta}\frac{|f_{1}(\tilde{\theta}|\theta;\omega)-f_{1}(\theta^{\prime}|\theta;\omega)-\langle\nabla f_{1}(\theta^{\prime}|\theta;\omega),\tilde{\theta}-\theta^{\prime}\rangle|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}\leq\\ \sup_{\tilde{\theta};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta_{\theta^{\prime}}}\frac{|f_{1}(\tilde{\theta}|\theta;\omega)-f_{1}(\theta^{\prime}|\theta;\omega)|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}+\sup_{\tilde{\theta};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta_{\theta^{\prime}}}\frac{|\langle\nabla f_{1}(\theta^{\prime}|\theta;\omega),\tilde{\theta}-\theta^{\prime}\rangle|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}.

(16)

Since continuously differentiable functions are Lipschitz continuous on convex and compact subsets, $\pi_{hi}$ , $\pi_{lo}$ and $\pi_{b}$ as functions of $\tilde{\theta}\in\tilde{\mathit{\Theta}}$ are Lipschitz continuous on $\{\tilde{\theta};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta_{\theta^{\prime}}\}$ , with any other input arguments. From the expression of $f_{1}$ , we can verify that for any fixed $\theta$ and $\omega$ , $f_{1}(\tilde{\theta}|\theta;\omega)$ as a function of $\tilde{\theta}$ is Lipschitz continuous on $\{\tilde{\theta};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta_{\theta^{\prime}}\}$ , and the Lipschitz constant only depends on $\theta^{\prime}$ and $\delta_{\theta^{\prime}}$ . Consequently, the RHS of (16) can be upper-bounded for all $\omega\in\Omega$ . Applying the dominated convergence theorem, we have

\lim_{\delta\rightarrow 0}\mathbb{E}_{\theta^{*},\nu^{*}}\left[\sup_{\tilde{\theta}\in\tilde{\mathit{\Theta}};\|{\tilde{\theta}-\theta^{\prime}}\|_{2}\leq\delta}\frac{|f_{1}(\tilde{\theta}|\theta;\omega)-f_{1}(\theta^{\prime}|\theta;\omega)-\langle\nabla f_{1}(\theta^{\prime}|\theta;\omega),\tilde{\theta}-\theta^{\prime}\rangle|}{\|{\tilde{\theta}-\theta^{\prime}}\|_{2}}\right]=0.

(17)

On the other hand, notice that for all $\theta\in\mathit{\Theta}$ , $\theta^{\prime}\in\tilde{\mathit{\Theta}}$ and $\delta>0$ ,

		$\displaystyle\sup_{\tilde{\theta}\in\tilde{\mathit{\Theta}};\\|{\tilde{\theta}-\theta^{\prime}}\\|_{2}\leq\delta}\frac{\|\bar{Q}(\tilde{\theta}\|\theta)-\bar{Q}(\theta^{\prime}\|\theta)-\langle\mathbb{E}_{\theta^{},\nu^{}}[\nabla f_{1}(\theta^{\prime}\|\theta;\omega)],\tilde{\theta}-\theta^{\prime}\rangle\|}{\\|{\tilde{\theta}-\theta^{\prime}}\\|_{2}}$
	$\displaystyle=~{}$	$\displaystyle\sup_{\tilde{\theta}\in\tilde{\mathit{\Theta}};\\|{\tilde{\theta}-\theta^{\prime}}\\|_{2}\leq\delta}\frac{\|\mathbb{E}_{\theta^{},\nu^{}}[f_{1}(\tilde{\theta}\|\theta;\omega)-f_{1}(\theta^{\prime}\|\theta;\omega)-\langle\nabla f_{1}(\theta^{\prime}\|\theta;\omega),\tilde{\theta}-\theta^{\prime}\rangle]\|}{\\|{\tilde{\theta}-\theta^{\prime}}\\|_{2}}$
	$\displaystyle\leq~{}$	$\displaystyle\mathbb{E}_{\theta^{},\nu^{}}\left[\sup_{\tilde{\theta}\in\tilde{\mathit{\Theta}};\\|{\tilde{\theta}-\theta^{\prime}}\\|_{2}\leq\delta}\frac{\|f_{1}(\tilde{\theta}\|\theta;\omega)-f_{1}(\theta^{\prime}\|\theta;\omega)-\langle\nabla f_{1}(\theta^{\prime}\|\theta;\omega),\tilde{\theta}-\theta^{\prime}\rangle\|}{\\|{\tilde{\theta}-\theta^{\prime}}\\|_{2}}\right].$

Combining with (17) proves the differentiability of $\bar{Q}(\theta^{\prime}|\theta)$ with respect to $\theta^{\prime}\in\tilde{\mathit{\Theta}}$ for any fixed $\theta$ . The gradient is

\nabla\bar{Q}(\theta^{\prime}|\theta)=\mathbb{E}_{\theta^{*},\nu^{*}}[\nabla f_{1}(\theta^{\prime}|\theta;\omega)].

Analogously, using the dominated convergence theorem we can also show that the gradient $\nabla\bar{Q}(\theta^{\prime}|\theta)$ is continuous with respect to $\theta^{\prime}\in\tilde{\mathit{\Theta}}$ . Details are omitted due to the similarity with the above procedure. It is worth noting that we let $\theta^{\prime}\in\tilde{\mathit{\Theta}}$ instead of $\mathit{\Theta}$ . In this way, the gradient $\nabla\bar{Q}(\theta^{\prime}|\theta)$ can be naturally defined when $\theta^{\prime}$ is not an interior point of $\mathit{\Theta}$ .

From differentiability and $\mathit{\Theta}\subseteq\tilde{\mathit{\Theta}}$ , $\bar{Q}(\theta^{\prime}|\theta)$ is also continuous with respect to $\theta^{\prime}\in\mathit{\Theta}$ . Since $\mathit{\Theta}$ is compact, the set of maximizing arguments $\operatorname*{arg\,max}_{\theta^{\prime}\in\mathit{\Theta}}\bar{Q}(\theta^{\prime}|\theta)$ is nonempty.

2. We need to prove the uniform (in $\theta,\theta^{\prime}\in\mathit{\Theta}$ and $\mu\in\mathcal{M}$ ) almost sure convergence of the $Q$ -function $Q_{\mu,T}(\theta^{\prime}|\theta;\omega)$ to the population $Q$ -function $\bar{Q}(\theta^{\prime}|\theta)$ . The proof is separated into three steps. First, we show the almost sure convergence of $Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)$ to $\bar{Q}(\theta^{\prime}|\theta)$ for all $\theta,\theta^{\prime}\in\mathit{\Theta}$ using the ergodic theorem. Second, we extend this pointwise convergence to uniform (in $\theta,\theta^{\prime}$ ) convergence using a version of the Arzelà-Ascoli theorem [12, Chap. 21]. Finally, from Lemma C.2, the difference between $Q_{\mu,T}(\theta^{\prime}|\theta;\omega)$ and $Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)$ vanishes uniformly in $\mu$ as $T\rightarrow\infty$ .

Concretely, for the pointwise (in $\theta,\theta^{\prime}$ ) almost sure convergence of $Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)$ as $T\rightarrow\infty$ , we apply Birkhoff’s ergodic theorem. Let $\mathcal{T}:\mathcal{X}^{\mathbb{Z}}\rightarrow\mathcal{X}^{\mathbb{Z}}$ be the standard shift operator. That is, for any $t\in\mathbb{Z}$ , $\mathcal{T}\omega(t)=\omega(t+1)$ . Due to stationarity, $\mathcal{T}$ is a measure-preserving map, i.e., $\mathbb{P}_{\theta^{*},\nu^{*}}(\mathcal{T}^{-1}F)=\mathbb{P}_{\theta^{*},\nu^{*}}(F)$ for all $F\in\mathcal{P}(\mathcal{X}^{\mathbb{Z}})$ . Therefore, the quadruple $\{\mathcal{X}^{\mathbb{Z}},\mathcal{P}(\mathcal{X}^{\mathbb{Z}}),\mathbb{P}_{\theta^{*},\nu^{*}},\mathcal{T}\}$ defines a dynamical system.

Here, we need some clarification on some concepts and notations. Consider the Markov chain $\{X_{t}\}_{t=1}^{\infty}=\{S_{t},A_{t},O_{t},B_{t}\}_{t=1}^{\infty}$ induced by the expert policy, let $\Pi_{X,\theta^{*}}$ be its set of all stationary distributions. Comparing $\Pi_{X,\theta^{*}}$ to $\Pi_{\theta^{*}}$ from Assumption 3, they both depend on the true parameter $\theta^{*}$ ; the former corresponds to the chain $\{S_{t},A_{t},O_{t},B_{t}\}_{t=1}^{\infty}$ , while the latter corresponds to the chain $\{O_{t-1},S_{t}\}_{t=1}^{\infty}$ . From the structure of our graphical model, they are equivalent by some transformation.

From Section 4, $\mathbb{P}_{\theta^{*},\nu^{*}}$ is defined from an element of $\Pi_{X,\theta^{*}}$ that depends on $\nu^{*}$ . Denote this stationary distribution as $\psi$ . Since $\nu^{*}$ is an extreme point of $\Pi_{\theta^{*}}$ (Assumption 3), $\psi$ is also an extreme point of $\Pi_{X,\theta^{*}}$ . Then, we can apply a standard Markov chain ergodicity result. From [20, Theorem 5.7], the dynamical system $\{\mathcal{X}^{\mathbb{Z}},\mathcal{P}(\mathcal{X}^{\mathbb{Z}}),\mathbb{P}_{\theta^{*},\nu^{*}},\mathcal{T}\}$ is ergodic. For our case, Birkhoff’s ergodic theorem is restated as follows.

Lemma C.3 ([20], Corollary 5.3 restated).

If a dynamical system $\{\mathcal{X}^{\mathbb{Z}},\mathcal{P}(\mathcal{X}^{\mathbb{Z}}),\mathbb{P}_{\theta^{*},\nu^{*}},\mathcal{T}\}$ is ergodic and $f:\mathcal{X}^{\mathbb{Z}}\rightarrow\mathbb{R}$ satisfies $\mathbb{E}_{\theta^{*},\nu^{*}}[f(\omega)]<\infty$ , then as $T\rightarrow\infty$ ,

\frac{1}{T}\sum_{t=0}^{T-1}f(\mathcal{T}^{t}\omega)\rightarrow\mathbb{E}_{\theta^{*},\nu^{*}}[f(\omega)],~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

For our purpose, observe that for any $\theta,\theta^{\prime}\in\mathit{\Theta}$ , $f_{t}(\theta^{\prime}|\theta;\omega)=f_{1}(\theta^{\prime}|\theta;\mathcal{T}^{t-1}\omega)$ . Therefore, applying the ergodic theorem to $Q^{s}_{\infty,T}(\theta^{\prime}|\theta)$ , as $T\rightarrow\infty$ ,

Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)\rightarrow\bar{Q}(\theta^{\prime}|\theta),~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

(18)

To extend the pointwise convergence in (18) to uniform (in $\theta,\theta^{\prime}$ ) convergence, the following concept is required. The sequence $\{Q^{s}_{\infty,T}(\theta^{\prime}|\theta)\}$ indexed by $T$ as functions of $\theta$ and $\theta^{\prime}$ is strongly stochastically equicontinuous [12, Equation 21.43] if for any $\varepsilon>0$ there exists $\delta>0$ such that

\limsup_{T\rightarrow\infty}\sup_{\theta_{1},\theta^{\prime}_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta_{1}-\theta_{2}}\|_{2}+\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\|_{2}\leq\delta}\left|Q^{s}_{\infty,T}(\theta^{\prime}_{1}|\theta_{1};\omega)-Q^{s}_{\infty,T}(\theta^{\prime}_{2}|\theta_{2};\omega)\right|<\varepsilon,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

(19)

Indeed this property holds for $\{Q^{s}_{\infty,T}(\theta^{\prime}|\theta)\}$ , as shown in Appendix D.6. The version of the Arzelà-Ascoli theorem we use is restated as follows, tailored to our need.

Lemma C.4 ([12], Theorem 21.8 restated).

Given (18) and (19), as $T\rightarrow\infty$ we have

\sup_{\theta,\theta^{\prime}\in\mathit{\Theta}}\left|Q^{s}_{\infty,T}(\theta^{\prime}|\theta;\omega)-\bar{Q}(\theta^{\prime}|\theta)\right|\rightarrow 0,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

Combining Lemma C.2 and Lemma C.4 concludes the proof of the second part. ∎

On the concavity of $\bar{Q}(\cdot|\theta)$ .

As discussed after introducing Assumption 4, we expect the following to hold in certain cases of tabular parameterization: for all $\theta\in\mathit{\Theta}$ , the function $\bar{Q}(\cdot|\theta)$ is strongly concave over $\mathit{\Theta}$ . Details are presented below.

Consider $\theta^{\prime}_{b}$ for example, we need to provide sufficient conditions such that the following function is strongly concave with respect to $\theta^{\prime}_{b}\in\mathit{\Theta}_{b}$ , given any $\theta\in\mathit{\Theta}$ .

\bar{Q}_{b}(\theta^{\prime}_{b}|\theta)=\sum_{o_{0},b_{1}}\mathbb{E}_{\theta^{*},\nu^{*}}\left[\tilde{\gamma}^{\theta}_{\infty,t}(o_{0},b_{1};\omega)\log\pi_{b}(b_{1}|\omega(s_{1}),o_{0};\theta^{\prime}_{b})\right].

Let the marginal distribution of $\nu^{*}$ on $S_{1}$ be $\nu^{*}_{S_{1}}$ . If $\nu^{*}_{S_{1}}$ is strictly positive on $\mathcal{S}$ , then we rewrite $\bar{Q}_{b}(\theta^{\prime}_{b}|\theta)$ as

\bar{Q}_{b}(\theta^{\prime}_{b}|\theta)=\sum_{o_{0},b_{1}}\sum_{s_{1}\in\mathcal{S}}\nu^{*}_{S_{1}}(s_{1})\mathbb{E}_{\theta^{*},\nu^{*}|S_{1}=s_{1}}\left[\tilde{\gamma}^{\theta}_{\infty,t}(o_{0},b_{1};\omega)\right]\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b}).

In the case of tabular parameterization, $\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})$ is an entry of $\theta^{\prime}_{b}$ indexed as $\theta^{\prime}_{b}(b_{1},s_{1},o_{0})$ ; its logarithm is 1-strongly concave on the interval $[0,1]$ . $\bar{Q}_{b}(\theta^{\prime}_{b}|\theta)$ is strongly concave with respect to $\theta^{\prime}_{b}$ if $\mathbb{E}_{\theta^{*},\nu^{*}|S_{1}=s_{1}}[\tilde{\gamma}^{\theta}_{\infty,t}(o_{0},b_{1};\omega)]$ is strictly positive for all $o_{0}$ and $b_{1}$ . We speculate that this requirement is mild, but a rigorous characterization is quite challenging.

C.3 The convergence of the population version algorithm

We first present the complete version of Theorem 3, where an upper bound on $\gamma$ is also shown. Notice that we assume all the assumptions, including Assumption 4 and 5.

Theorem 8 (The complete version of Theorem 3).

With all the assumptions,

(First-order stability) There exists $0<\gamma\leq\bar{\gamma}$ such that for all $\theta\in\mathit{\Theta}_{r}$ ,

\left\|{\nabla\bar{Q}(\bar{M}(\theta)|\theta)-\nabla\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}\leq\gamma\left\|{\theta-\theta^{*}}\right\|_{2}.

Specifically, the upper bound $\bar{\gamma}$ is given by

\bar{\gamma}=\frac{4|\mathcal{O}|L_{\theta^{*},r}}{\varepsilon^{2}_{b}\zeta}\left(\sup_{\theta^{\prime}\in\mathit{\Theta}_{r}}z_{\theta^{\prime},\theta^{*}}\right)\bigg{(}2\max_{o_{0},s_{1},b_{1}}\sup_{\theta^{\prime}_{b}\in\mathit{\Theta}_{b}}\left\|{\nabla\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})}\right\|_{2}\\ +\max_{s_{1},a_{1},o_{1}}\sup_{\theta^{\prime}_{lo}\in\mathit{\Theta}_{lo}}\left\|{\nabla\log\pi_{lo}(a_{1}|s_{1},o_{1};\theta^{\prime}_{lo})}\right\|_{2}+\max_{s_{1},o_{1}}\sup_{\theta^{\prime}_{hi}\in\mathit{\Theta}_{hi}}\left\|{\nabla\log\pi_{hi}(o_{1}|s_{1};\theta^{\prime}_{hi})}\right\|_{2}\bigg{)}.

$\zeta$ is the failure parameter in the options with failure framework; $\varepsilon_{b}$ is a mixing constant defined in Lemma D.1; $L_{\theta^{*},r}$ is a Lipschitz constant defined in Lemma D.2; $z_{\theta^{\prime},\theta^{*}}$ is defined in Lemma D.5.

2.

(Contraction) Let $\kappa=\gamma/\lambda$ . For all $\theta\in\mathit{\Theta}_{r}$ ,

$\left\|{\bar{M}(\theta)-\theta^{*}}\right\|_{2}\leq\kappa\left\|{\theta-\theta^{*}}\right\|_{2}.$

If $\kappa<1$ , the population version algorithm converges linearly to the true parameter $\theta^{*}$ .

Proof of Theorem 8.

We prove the two parts separately in the following.

1. For convenience of notation, let $\nabla\bar{Q}(\theta^{\prime}|\theta)=[\nabla_{b}\bar{Q}(\theta^{\prime}|\theta),\nabla_{lo}\bar{Q}(\theta^{\prime}|\theta),\nabla_{hi}\bar{Q}(\theta^{\prime}|\theta)]$ such that, for example, $\nabla_{b}\bar{Q}(\theta^{\prime}|\theta)$ is the gradient of $\bar{Q}(\theta^{\prime}|\theta)$ with respect to $\theta^{\prime}_{b}$ . Using the expressions of $\nabla\bar{Q}(\theta^{\prime}|\theta)$ from Theorem 7, we have

\left\|{\nabla\bar{Q}(\bar{M}(\theta)|\theta)-\nabla\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}\leq\left\|{\nabla_{b}\bar{Q}(\bar{M}(\theta)|\theta)-\nabla_{b}\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}\\ +\left\|{\nabla_{lo}\bar{Q}(\bar{M}(\theta)|\theta)-\nabla_{lo}\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}+\left\|{\nabla_{hi}\bar{Q}(\bar{M}(\theta)|\theta)-\nabla_{hi}\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}.

Consider the first term,

		$\displaystyle\left\\|{\nabla_{b}\bar{Q}(\bar{M}(\theta)\|\theta)-\nabla_{b}\bar{Q}(\bar{M}(\theta)\|\theta^{*})}\right\\|_{2}$
	$\displaystyle=~{}$	$\displaystyle\left\\|{\mathbb{E}_{\theta^{},\nu^{}}\bigg{\{}\sum_{o_{0},b_{1}}\left[\tilde{\gamma}^{\theta}_{\infty,1}(o_{0},b_{1};\omega)-\tilde{\gamma}^{\theta^{*}}_{\infty,1}(o_{0},b_{1};\omega)\right]\left[\nabla\log\pi_{b}(b_{1}\|\omega(s_{1}),o_{0};\bar{M}(\theta)_{b})\right]\bigg{\}}}\right\\|_{2}$
	$\displaystyle\leq~{}$	$\displaystyle\sum_{o_{0},b_{1}}\left\\|{\mathbb{E}_{\theta^{},\nu^{}}\bigg{\{}\left[\tilde{\gamma}^{\theta}_{\infty,1}(o_{0},b_{1};\omega)-\tilde{\gamma}^{\theta^{*}}_{\infty,1}(o_{0},b_{1};\omega)\right]\left[\nabla\log\pi_{b}(b_{1}\|\omega(s_{1}),o_{0};\bar{M}(\theta)_{b})\right]\bigg{\}}}\right\\|_{2}$
	$\displaystyle\leq~{}$	$\displaystyle\sum_{o_{0},b_{1}}\mathbb{E}_{\theta^{},\nu^{}}\bigg{\{}\left\|\tilde{\gamma}^{\theta}_{\infty,1}(o_{0},b_{1};\omega)-\tilde{\gamma}^{\theta^{*}}_{\infty,1}(o_{0},b_{1};\omega)\right\|\left\\|{\nabla\log\pi_{b}(b_{1}\|\omega(s_{1}),o_{0};\bar{M}(\theta)_{b})}\right\\|_{2}\bigg{\}}$
	$\displaystyle\leq~{}$	$\displaystyle\max_{o_{0},s_{1},b_{1}}\sup_{\theta^{\prime}_{b}\in\mathit{\Theta}_{b}}\left\\|{\nabla\log\pi_{b}(b_{1}\|s_{1},o_{0};\theta^{\prime}_{b})}\right\\|_{2}\mathbb{E}_{\theta^{},\nu^{}}\bigg{\{}\sum_{o_{0},b_{1}}\left\|\tilde{\gamma}^{\theta}_{\infty,1}(o_{0},b_{1};\omega)-\tilde{\gamma}^{\theta^{*}}_{\infty,1}(o_{0},b_{1};\omega)\right\|\bigg{\}}$
	$\displaystyle\leq~{}$	$\displaystyle 2\max_{o_{0},s_{1},b_{1}}\sup_{\theta^{\prime}_{b}\in\mathit{\Theta}_{b}}\left\\|{\nabla\log\pi_{b}(b_{1}\|s_{1},o_{0};\theta^{\prime}_{b})}\right\\|_{2}\times\sup_{\omega\in\Omega}\left\\|{\tilde{\gamma}^{\theta}_{\infty,1}(\omega)-\tilde{\gamma}^{\theta^{*}}_{\infty,1}(\omega)}\right\\|_{\rm TV}$
	$\displaystyle\leq~{}$	$\displaystyle\frac{8\|\mathcal{O}\|L_{\theta^{},r}}{\varepsilon^{2}_{b}\zeta}\left(\sup_{\theta^{\prime}\in\mathit{\Theta}_{r}}z_{\theta^{\prime},\theta^{}}\right)\left(\max_{o_{0},s_{1},b_{1}}\sup_{\theta^{\prime}_{b}\in\mathit{\Theta}_{b}}\left\\|{\nabla\log\pi_{b}(b_{1}\|s_{1},o_{0};\theta^{\prime}_{b})}\right\\|_{2}\right)\left\\|{\theta-\theta^{*}}\right\\|_{2}.$

We use the triangle inequality and the Jensen’s inequality in the third and the fourth line respectively. The fifth line is finite due to $\theta_{b}$ being compact and the continuity of the gradient (Assumption 2). The last line is due to the limit form of Lemma D.7, similar to the argument in Appendix D.4. Notice that the coefficient of $\|{\theta-\theta^{*}}\|_{2}$ on the last line does not depend on $\theta$ .

Analogously, we have

\left\|{\nabla_{lo}\bar{Q}(\bar{M}(\theta)|\theta)-\nabla_{lo}\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}\leq\\ \frac{4|\mathcal{O}|L_{\theta^{*},r}}{\varepsilon^{2}_{b}\zeta}\left(\sup_{\theta^{\prime}\in\mathit{\Theta}_{r}}z_{\theta^{\prime},\theta^{*}}\right)\left(\max_{s_{1},a_{1},o_{1}}\sup_{\theta^{\prime}_{lo}\in\mathit{\Theta}_{lo}}\left\|{\nabla\log\pi_{lo}(a_{1}|s_{1},o_{1};\theta^{\prime}_{lo})}\right\|_{2}\right)\left\|{\theta-\theta^{*}}\right\|_{2},

\left\|{\nabla_{hi}\bar{Q}(\bar{M}(\theta)|\theta)-\nabla_{hi}\bar{Q}(\bar{M}(\theta)|\theta^{*})}\right\|_{2}\leq\\ \frac{4|\mathcal{O}|L_{\theta^{*},r}}{\varepsilon^{2}_{b}\zeta}\left(\sup_{\theta^{\prime}\in\mathit{\Theta}_{r}}z_{\theta^{\prime},\theta^{*}}\right)\left(\max_{s_{1},o_{1}}\sup_{\theta^{\prime}_{hi}\in\mathit{\Theta}_{hi}}\left\|{\nabla\log\pi_{hi}(o_{1}|s_{1};\theta^{\prime}_{hi})}\right\|_{2}\right)\left\|{\theta-\theta^{*}}\right\|_{2}.

Combining everything, we have the upper bound on $\gamma$ .

2. The proof of the second part mirrors the proof of [3, Theorem 4]. The main difference is the construction of the following self-consistency (a.k.a. fixed-point) condition.

Lemma C.5 (Self-consistency).

With all the assumptions, $\theta^{*}=\bar{M}(\theta^{*})$ .

The proof of this lemma is presented in Appendix D.7. Such a condition is used without proof in [3] since it only considers i.i.d. samples, and the self-consistency condition for EM with i.i.d. samples is a well-known result. However, for the case of dependent samples like our graphical model, such a condition results from the stochastic convergence of the $Q$ -function which is not immediate.

For the rest of the proof, we present a brief sketch here for completeness. Due to concavity, we have the first order optimality conditions: for all $\theta,\theta^{\prime}\in\mathit{\Theta}$ , $\langle\nabla\bar{Q}(\bar{M}(\theta^{*})|\theta^{*}),\theta-\bar{M}(\theta^{*})\rangle\leq 0$ and $\langle\nabla\bar{Q}(\bar{M}(\theta)|\theta),\theta^{\prime}-\bar{M}(\theta)\rangle\leq 0$ . Using $\theta^{*}=\bar{M}(\theta^{*})$ , we can combine the two optimality conditions together and obtain the following. For all $\theta\in\mathit{\Theta}$ ,

\langle\nabla\bar{Q}(\bar{M}(\theta)|\theta^{*})-\nabla\bar{Q}(\theta^{*}|\theta^{*}),\theta^{*}-\bar{M}(\theta)\rangle\leq\langle\nabla\bar{Q}(\bar{M}(\theta)|\theta^{*})-\nabla\bar{Q}(\bar{M}(\theta)|\theta),\theta^{*}-\bar{M}(\theta)\rangle.

From Assumption 4, $\text{LHS}\geq\lambda\|{\theta^{*}-\bar{M}(\theta)}\|_{2}^{2}$ . From Cauchy-Schwarz and the first part of this theorem, $\text{RHS}\leq\gamma\|{\theta^{*}-\bar{M}(\theta)}\|_{2}\|{\theta-\theta^{*}}\|_{2}$ . Canceling $\|{\theta^{*}-\bar{M}(\theta)}\|_{2}$ on both sides completes the proof. ∎

C.4 Proof of Theorem 4

1. We first show the strong consistency of $M_{\mu,T}(\theta;\omega)$ , the parameter update of Algorithm 1, as an estimator of $\bar{M}(\theta)$ . This follows from standard techniques in the analysis of M-estimators. In particular, consider the set of sample paths $\omega$ such that $\omega\in\Omega$ and $\operatorname*{arg\,max}_{\theta^{\prime}\in\mathit{\Theta}}Q_{\mu,T}(\theta^{\prime}|\theta;\omega)$ has a unique element $M_{\mu,T}(\theta;\omega)$ . Such a set of sample paths has probability measure 1.

For all $\theta\in\mathit{\Theta}$ , $T\geq 2$ and $\mu\in\mathcal{M}$ , with one of the above sample path $\omega$ ,

	$\displaystyle 0$	$\displaystyle\leq\bar{Q}(\bar{M}(\theta)\|\theta)-\bar{Q}(M_{\mu,T}(\theta;\omega)\|\theta)$
		$\displaystyle\leq\bar{Q}(\bar{M}(\theta)\|\theta)-Q_{\mu,T}(\bar{M}(\theta)\|\theta;\omega)+Q_{\mu,T}(\bar{M}(\theta)\|\theta;\omega)-Q_{\mu,T}(M_{\mu,T}(\theta;\omega)\|\theta;\omega)$
		$\displaystyle\hskip 200.0003pt+Q_{\mu,T}(M_{\mu,T}(\theta;\omega)\|\theta;\omega)-\bar{Q}(M_{T}(\theta;\omega)\|\theta)$
		$\displaystyle\leq 2\sup_{\theta^{\prime}\in\mathit{\Theta}}\left\|\bar{Q}(\theta^{\prime}\|\theta)-Q_{\mu,T}(\theta^{\prime}\|\theta;\omega)\right\|.$

From Theorem 7, $\mathbb{P}_{\theta^{*},\nu^{*}}$ -almost surely, $\sup_{\theta,\theta^{\prime}\in\mathit{\Theta}}\sup_{\mu\in\mathcal{M}}|\bar{Q}(\theta^{\prime}|\theta)-Q_{\mu,T}(\theta^{\prime}|\theta;\omega)|\rightarrow 0$ as $T\rightarrow\infty$ . Therefore,

\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left[\bar{Q}(\bar{M}(\theta)|\theta)-\bar{Q}(M_{\mu,T}(\theta;\omega)|\theta)\right]\rightarrow 0,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

An equivalent argument is the following. $\mathbb{P}_{\theta^{*},\nu^{*}}$ -almost surely, for any $\delta>0$ there exists $T_{\omega}\in\mathbb{N}_{+}$ such that for all $T\geq T_{\omega}$ , $\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}[\bar{Q}(\bar{M}(\theta)|\theta)-\bar{Q}(M_{\mu,T}(\theta;\omega)|\theta)]\leq\delta$ . In particular, for any $\varepsilon>0$ , let

\delta=\frac{1}{2}\inf_{\theta\in\mathit{\Theta}_{r}}\bigg{[}\bar{Q}(\bar{M}(\theta)|\theta)-\sup_{\theta^{\prime}\in\mathit{\Theta};\|{\theta^{\prime}-\bar{M}(\theta)}\|_{2}\geq\varepsilon}\bar{Q}(\theta^{\prime}|\theta)\bigg{]}.

From the identifiability assumption (Assumption 5), the RHS is positive. Therefore, such an assignment of $\delta$ is valid. Consequently, for all $T\geq T_{\omega}$ , $\theta\in\mathit{\Theta}_{r}$ and $\mu\in\mathcal{M}$ ,

\bar{Q}(\bar{M}(\theta)|\theta)-\bar{Q}(M_{\mu,T}(\theta;\omega)|\theta)<\bar{Q}(\bar{M}(\theta)|\theta)-\sup_{\theta^{\prime}\in\mathit{\Theta};\|{\theta^{\prime}-\bar{M}(\theta)}\|_{2}\geq\varepsilon}\bar{Q}(\theta^{\prime}|\theta),

which means that $\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\|_{2}<\varepsilon$ . Taking supremum over $\theta\in\mathit{\Theta}_{r}$ and $\mu\in\mathcal{M}$ , we summarize the argument as the following. $\mathbb{P}_{\theta^{*},\nu^{*}}$ -almost surely, for any $\varepsilon>0$ there exists $T_{\omega}\in\mathbb{N}_{+}$ such that for all $T\geq T_{\omega}$ ,

\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}<\varepsilon.

Such a result is equivalent to the uniform (in $\theta$ and $\mu$ ) strong consistency of $M_{\mu,T}(\theta;\omega)$ as an estimator of $\bar{M}(\theta)$ . As $T\rightarrow\infty$ ,

\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\rightarrow 0,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

This result is insufficient for Part 1, since $T_{\omega}$ is sample path dependent. To get rid of this sample path dependency, we use the dominated convergence theorem. Notice that $\mathbb{P}_{\theta^{*},\nu^{*}}$ -almost surely, for all $T\geq 2$ , $\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\|_{2}$ is bounded due to the compactness of $\mathit{\Theta}$ . Therefore we have

\lim_{T\rightarrow\infty}\mathbb{E}_{\theta^{*},\nu^{*}}\left[\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\right]=0.

For any $q>0$ , there exists $\underline{T}(q)\in\mathbb{N}_{+}$ such that for all $T\geq\underline{T}(q)$ ,

\mathbb{E}_{\theta^{*},\nu^{*}}\left[\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\right]\leq q.

Applying Markov’s inequality, for any $\Delta>0$ ,

\mathbb{P}_{\theta^{*},\nu^{*}}\left(\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\geq\Delta\right)\leq\frac{1}{\Delta}\mathbb{E}_{\theta^{*},\nu^{*}}\left[\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\right]\leq\frac{q}{\Delta}.

Scaling $q$ yields the desirable result.

2. The proof of Part 2 is the same as [3, Theorem 5]. We present a sketch for completeness. For all $T\geq\underline{T}(\Delta,q)$ , condition the following proof on the high probability event that $\sup_{\theta\in\mathit{\Theta}_{r}}\sup_{\mu\in\mathcal{M}}\left\|{M_{\mu,T}(\theta;\omega)-\bar{M}(\theta)}\right\|_{2}\leq\Delta$ .

Assume $\|{\theta^{(n-1)}-\theta^{*}}\|_{2}\leq r$ , which holds for $n=1$ . Then, using the triangle inequality, the result from Theorem 3, the above concentration and $\Delta\leq(1-\kappa)r$ , we have the following for any $\mu$ .

	$\displaystyle\left\\|{\theta^{(n)}-\theta^{*}}\right\\|_{2}$	$\displaystyle\leq\left\\|{\bar{M}(\theta^{(n-1)})-\theta^{*}}\right\\|_{2}+\left\\|{M_{\mu,T}(\theta^{(n-1)})-\bar{M}(\theta^{(n-1)})}\right\\|_{2}$
		$\displaystyle\leq\kappa\\|{\theta^{(n-1)}-\theta^{*}}\\|_{2}+\Delta,$		(20)

and $\|{\theta^{(n)}-\theta^{*}}\|_{2}\leq\kappa r+(1-\kappa)r=r$ . From induction, the one step relation (20) holds for all $n\in\mathbb{N}_{+}$ . Unrolling (20) and regrouping the terms completes the proof. ∎

Appendix D Proofs of auxiliary lemmas

This section presents proofs omitted in earlier sections. Assumptions 1, 2 and 3 are assumed.

In particular, the first three subsections develop a few essential lemmas required for the proofs in later subsections. In Appendix D.1, we show an important mixing property of the options with failure framework. In Appendix D.2, such a mixing property is used to prove a general contraction result of our forward-backward smoothing procedure (Theorem 1 and Corollary 6), similar to the concept of filtering stability in the HMM literature. At a high level, considering the forward-backward recursion in the extended graphical model (Corollary 6), this result characterizes the effect of changing $\theta$ and the boundary conditions $\alpha^{\theta}_{k,0}$ and $\beta^{\theta}_{k,T}$ on the smoothing distribution $\gamma^{\theta}_{k,t}$ , given any observation sequence $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ . Due to this high level reasoning, we name this result as the smoothing stability lemma. Appendix D.3 provides concrete applications of this lemma to quantities defined in earlier sections.

D.1 Mixing

Recall that $\zeta$ is the auxiliary parameter in the options with failure framework.

Lemma D.1 (Mixing).

There exists a constant $\varepsilon_{b}>0$ and a conditional distribution $\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta)$ parameterized by $\theta$ such that for all $\theta\in\mathit{\Theta}$ , with any input arguments $b_{t}$ , $s_{t}$ , $o_{t-1}$ and $o_{t}$ ,

0<\varepsilon_{b}\zeta\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta)\leq\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\leq\varepsilon_{b}^{-1}|\mathcal{O}|\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta).

Proof of Lemma D.1.

The proof is separated into two parts.

1. We first show an intermediate result: there exists a constant $\varepsilon_{b}>0$ and a conditional distribution $\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})$ parameterized by $\theta_{b}$ such that for all $\theta_{b}\in\mathit{\Theta}_{b}$ , with any input arguments $b_{t}$ , $s_{t}$ and $o_{t-1}$ ,

0<\varepsilon_{b}\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})\leq\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\leq\varepsilon_{b}^{-1}\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b}).

This can be proved as follows. Let $c_{b}=\inf_{\theta_{b}\in\mathit{\Theta}_{b}}\min_{b_{t},s_{t},o_{t-1}}\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})$ . Similar to the procedure in Appendix A, from the non-degeneracy assumption, the differentiabiilty assumption and $\mathit{\Theta}$ being compact, we have $c_{b}>0$ . For any $\theta_{b}\in\Theta_{b}$ , with any input arguments $b_{t}$ and $s_{t}$ , let $f(b_{t},s_{t};\theta_{b})=\min_{o_{t-1}\in\mathcal{O}}\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})$ . Observe that $c_{b}\leq f(b_{t},s_{t};\theta_{b})\leq 1$ . Let $\varepsilon_{b}=c_{b}/2$ and

\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})=\frac{f(b_{t},s_{t};\theta_{b})}{\sum_{b^{\prime}_{t}\in\{0,1\}}f(b^{\prime}_{t},s_{t};\theta_{b})}.

Clearly $\varepsilon_{b}\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})>0$ . Moreover, for any $o_{t-1}$ , $\varepsilon_{b}\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})<2c_{b}\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})\leq f(b_{t},s_{t};\theta_{b})\leq\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})$ .

On the other hand, with any input arguments,

\varepsilon_{b}^{-1}\bar{\pi}_{b}(b_{t}|s_{t};\theta_{b})\geq\varepsilon_{b}^{-1}c_{b}/2=1\geq\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b}),

which completes the proof of the first part.

2. Define $\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta)$ as follows. With any input arguments, let

	$\displaystyle\bar{\pi}_{o,b}(o_{t},b_{t}=0\|s_{t};\theta)\mathrel{\mathop{:}}=\bar{\pi}_{b}(b_{t}=0\|s_{t};\theta_{b})/\|\mathcal{O}\|,$
	$\displaystyle\bar{\pi}_{o,b}(o_{t},b_{t}=1\|s_{t};\theta)\mathrel{\mathop{:}}=\bar{\pi}_{b}(b_{t}=1\|s_{t};\theta_{b})\pi_{hi}(o_{t}\|s_{t};\theta_{hi}).$

Clearly $\varepsilon_{b}\zeta\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta)>0$ . Omit the dependency on $\theta$ for a cleaner notation since every term is parameterized by $\theta$ . When $b_{t}=1$ , with any other input arguments,

\varepsilon_{b}\bar{\pi}_{b}(b_{t}=1|s_{t})\pi_{hi}(o_{t}|s_{t})\leq\pi_{b}(b_{t}=1|s_{t},o_{t-1})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t}=1)\leq\varepsilon_{b}^{-1}\bar{\pi}_{b}(b_{t}=1|s_{t})\pi_{hi}(o_{t}|s_{t}).

Similarly, when $b_{t}=0$ and $o_{t}=o_{t-1}$ ,

	$\displaystyle\varepsilon_{b}\bar{\pi}_{b}(b_{t}=0\|s_{t})\zeta/\|\mathcal{O}\|$	$\displaystyle\leq\varepsilon_{b}\bar{\pi}_{b}(b_{t}=0\|s_{t})\bigg{(}1-\frac{\|\mathcal{O}\|-1}{\|\mathcal{O}\|}\zeta\bigg{)}$
		$\displaystyle\leq\pi_{b}(b_{t}=0\|s_{t},o_{t-1})\bar{\pi}_{hi}(o_{t}=o_{t-1}\|s_{t},o_{t-1},b_{t}=0)$
		$\displaystyle\leq\varepsilon_{b}^{-1}\bar{\pi}_{b}(b_{t}=0\|s_{t}).$

Finally, when $b_{t}=0$ and $o_{t}\neq o_{t-1}$ ,

\varepsilon_{b}\bar{\pi}_{b}(b_{t}=0|s_{t})\zeta/|\mathcal{O}|\leq\pi_{b}(b_{t}=0|s_{t},o_{t-1})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t}=0)\leq\varepsilon_{b}^{-1}\bar{\pi}_{b}(b_{t}=0|s_{t})\zeta/|\mathcal{O}|.

Combining the above cases and the definition of $\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta)$ completes the proof. ∎

D.2 Smoothing stability

Before stating the smoothing stability lemma, we introduce a few definitions. The quantities defined in this subsection depend on an observation sequence $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ , but such a dependency is usually omitted to simplify the notation, unless specified otherwise. Consistent with our notations so far, in the following we make extensive use of the proportional symbol $\propto$ .

D.2.1 Forward and backward recursion operators

With any given observation sequence $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ and any $\theta\in\mathit{\Theta}$ , define the filtering operator $F^{\theta}_{t}$ as the following. For any probability measure $\varphi$ over $\mathcal{O}\times\{0,1\}$ , $F^{\theta}_{t}\varphi$ is also a probability measure such that with any input arguments $o_{t}$ and $b_{t}$ ,

F^{\theta}_{t}\varphi(o_{t},b_{t})\propto\sum_{o_{t-1},b_{t-1}}\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\varphi(o_{t-1},b_{t-1}).

(21)

The RHS has exactly the form of the forward recursion, therefore the recursion on both $\alpha^{\theta}_{k,t}$ in (2) and $\alpha^{\theta}_{\mu,t}$ in (10) can be expressed using $F^{\theta}_{t}$ . For generality, let $\{\varphi^{\theta}_{t}\}_{t\in\mathbb{Z}}$ and $\{\hat{\varphi}^{\hat{\theta}}_{t}\}_{t\in\mathbb{Z}}$ be any two indexed sets of probability measures such that $F^{\theta}_{t}\varphi^{\theta}_{t-1}=\varphi^{\theta}_{t}$ and $F^{\hat{\theta}}_{t}\hat{\varphi}^{\hat{\theta}}_{t-1}=\hat{\varphi}^{\hat{\theta}}_{t}$ . We restrict $\{\varphi^{\theta}_{t}\}_{t\in\mathbb{Z}}$ and $\{\hat{\varphi}^{\hat{\theta}}_{t}\}_{t\in\mathbb{Z}}$ to be strictly positive. Due to Assumption 1, such a restriction is valid. Notice that $\theta$ and $\hat{\theta}$ here can be equal. We use the seemingly more complicated notation $\{\hat{\varphi}^{\hat{\theta}}_{t}\}_{t\in\mathbb{Z}}$ because even if $\theta=\hat{\theta}$ , $\{\varphi^{\theta}_{t}\}_{t\in\mathbb{Z}}$ and $\{\hat{\varphi}^{\hat{\theta}}_{t}\}_{t\in\mathbb{Z}}$ are still different; in this case they are just two different sets of probability measures satisfying the same recursion $F^{\theta}_{t}$ .

Similarly, we define the backward recursion operator $B^{\theta}_{t}$ as follows. For any probability measure $\rho$ over $\mathcal{O}\times\{0,1\}$ , $B^{\theta}_{t}\rho$ is also a probability measure such that with any input arguments $o_{t}$ and $b_{t}$ ,

B^{\theta}_{t}\rho(o_{t},b_{t})\propto\sum_{o_{t+1},b_{t+1}}\pi_{b}(b_{t+1}|s_{t+1},o_{t};\theta_{b})\bar{\pi}_{hi}(o_{t+1}|s_{t+1},o_{t},b_{t+1};\theta_{hi})\\ \times\pi_{lo}(a_{t+1}|s_{t+1},o_{t+1};\theta_{lo})\rho(o_{t+1},b_{t+1}).

(22)

The recursion on both $\beta^{\theta}_{t|T}$ in (4) and $\beta^{\theta}_{k,t}$ in (11) can be expressed using $B^{\theta}_{t}$ . Let $\{\rho^{\theta}_{t}\}_{t\in\mathbb{Z}}$ and $\{\hat{\rho}^{\hat{\theta}}_{t}\}_{t\in\mathbb{Z}}$ be any two indexed sets of probability measures such that $B^{\theta}_{t}\rho^{\theta}_{t+1}=\rho^{\theta}_{t}$ and $B^{\hat{\theta}}_{t}\hat{\rho}^{\hat{\theta}}_{t+1}=\hat{\rho}^{\hat{\theta}}_{t}$ . We restrict $\{\rho^{\theta}_{t}\}_{t\in\mathbb{Z}}$ and $\{\hat{\rho}^{\hat{\theta}}_{t}\}_{t\in\mathbb{Z}}$ to be strictly positive.

The operation $\otimes$ is defined as follows: $\{(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t}\}_{t\in\mathbb{Z}}$ is an indexed set of probability measures such that for any input arguments $o_{t}$ and $b_{t}$ ,

(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t}(o_{t},b_{t})\propto\varphi^{\theta}_{t}(o_{t},b_{t})\hat{\rho}^{\hat{\theta}}_{t}(o_{t},b_{t}).

(23)

Finally, we clarify the use of $\propto$ in the above definitions. In (21), (22) and (23), the normalizing constants replaced by $\propto$ are independent of the input arguments $(o_{t},b_{t})$ .

D.2.2 Forward and backward smoothing operators

For any $\theta,\hat{\theta}\in\mathit{\Theta}$ and any $t$ , with any observation sequence $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ and any input arguments $o_{t}$ and $b_{t}$ , observe that

(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t}(o_{t},b_{t})\propto\sum_{o_{t-1},b_{t-1}}\pi_{b}(b_{t}|s_{t},o_{t-1};\hat{\theta}_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\hat{\theta}_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\hat{\theta}_{lo})\\ \times\rho^{\theta}_{t}(o_{t},b_{t})\frac{(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}(o_{t-1},b_{t-1})}{\rho^{\theta}_{t-1}(o_{t-1},b_{t-1})},

and

\rho^{\theta}_{t-1}(o_{t-1},b_{t-1})\propto\sum_{o^{\prime}_{t},b^{\prime}_{t}}\pi_{b}(b^{\prime}_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o^{\prime}_{t}|s_{t},o_{t-1},b^{\prime}_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o^{\prime}_{t};\theta_{lo})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t}).

To simplify notation, let

h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})=\pi_{b}(b_{t}|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo}).

(24)

Then,

(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t}(o_{t},b_{t})=C^{\hat{\theta},\theta}_{F}\sum_{o_{t-1},b_{t-1}}\frac{h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}(o_{t-1},b_{t-1})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})},

(25)

where $C^{\hat{\theta},\theta}_{F}$ is a normalizing constant such that

\left(C^{\hat{\theta},\theta}_{F}\right)^{-1}=\sum_{o_{t-1},b_{t-1}}\frac{\sum_{o_{t},b_{t}}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}(o_{t-1},b_{t-1}).

From (25), we define the forward smoothing operator $K^{\hat{\theta},\theta}_{F,t}$ on the probability measure $(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}$ such that as probability measures,

(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}K^{\hat{\theta},\theta}_{F,t}=(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t}.

The subscript $F$ in $K^{\hat{\theta},\theta}_{F,t}$ stands for forward. $K^{\hat{\theta},\theta}_{F,t}$ depends on the the parameters $\theta$ and $\hat{\theta}$ , the observation $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ and the specific choice of $\{\rho^{\theta}_{t}\}_{t\in\mathbb{Z}}$ . In the general case of $\theta\neq\hat{\theta}$ , $K^{\hat{\theta},\theta}_{F,t}$ is a nonlinear operator which requires rather sophisticated analysis. However, when $\theta=\hat{\theta}$ , it is straightforward to verify that the normalizing constant $C^{\theta,\theta}_{F}=1$ , and $K^{\theta,\theta}_{F,t}$ becomes a linear operator.

In fact, the linear operator $K^{\theta,\theta}_{F,t}$ can be regarded as the standard operation of a Markov transition kernel on probability measures. With a slight overload of notation, define such a Markov transition kernel on $\mathcal{O}\times\{0,1\}$ , entry-wise, as the following. For any $(o_{t},b_{t})$ and $(o_{t-1},b_{t-1})$ in $\mathcal{O}\times\{0,1\}$ ,

K^{\theta,\theta}_{F,t}(o_{t},b_{t}|o_{t-1},b_{t-1})\mathrel{\mathop{:}}=\frac{h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}.

(26)

We name this Markov transition kernel as the forward smoothing kernel. Such a definition is analogous to Markovian decomposition in the HMM literature [8]. The only caveat here is that we also allow perturbations on the parameter. The resulting operator $K^{\hat{\theta},\theta}_{F,t}$ is nonlinear and no longer corresponds to a Markov transition kernel.

To proceed, we characterize the difference between operators $K^{\hat{\theta},\theta}_{F,t}$ and $K^{\theta,\theta}_{F,t}$ when $\hat{\theta}$ and $\theta$ are close. First, we show a version of Lipschitz continuity for the options with failure framework.

Lemma D.2 (Lipschitz continuity).

For all $\theta\in\mathit{\Theta}$ and $\delta>0$ , there exists a real number $L_{\theta,\delta}$ such that with any input arguments $o_{t-1}$ , $s_{t}$ , $a_{t}$ , $o_{t}$ and $b_{t}$ , the function $h(\tilde{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})$ defined in (24) is $L_{\theta,\delta}$ -Lipschitz with respect to $\tilde{\theta}$ on the set $\{\tilde{\theta};\tilde{\theta}\in\mathit{\Theta},\|{\tilde{\theta}-\theta}\|_{2}\leq\delta\}$ . Moreover, $L_{\theta,\delta}$ is upper bounded by a constant that does not depend on $\theta$ and $\delta$ .

Proof of Lemma D.2.

Due to Assumption 2, with any input arguments $o_{t-1}$ , $s_{t}$ , $a_{t}$ , $o_{t}$ and $b_{t}$ , $h(\tilde{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})$ is continuously differentiable with respect to $\tilde{\theta}\in\tilde{\mathit{\Theta}}$ . As continuously differentiable functions are Lipschitz continuous on convex and compact subsets, $h(\tilde{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})$ is Lipschitz continuous on $\mathit{\Theta}$ , hence also on $\{\tilde{\theta};\tilde{\theta}\in\mathit{\Theta},\|{\tilde{\theta}-\theta}\|_{2}\leq\delta\}$ . The Lipschitz constants depend on the choice of input arguments $o_{t-1}$ , $s_{t}$ , $a_{t}$ , $o_{t}$ and $b_{t}$ .

We can let $L_{\theta,\delta}$ be the smallest Lipschitz constant on $\{\tilde{\theta};\tilde{\theta}\in\mathit{\Theta},\|{\tilde{\theta}-\theta}\|_{2}\leq\delta\}$ that holds for all input arguments $o_{t-1}$ , $s_{t}$ , $a_{t}$ , $o_{t}$ and $b_{t}$ . Clearly $L_{\theta,\delta}$ is upper bounded by any Lipschitz constant on $\mathit{\Theta}$ that holds for all input arguments, which does not depend on $\theta$ and $\delta$ . ∎

Next, we bound the difference between operators $K^{\hat{\theta},\theta}_{F,t}$ and $K^{\theta,\theta}_{F,t}$ .

Lemma D.3 (Perturbation on the forward smoothing kernel).

Let $\varphi$ be any probability measure on $\mathcal{O}\times\{0,1\}$ . Let $K^{\hat{\theta},\theta}_{F,t}$ and $K^{\theta,\theta}_{F,t}$ be defined with the same observation sequence $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ and the same choice of $\{\rho^{\theta}_{t}\}_{t\in\mathbb{Z}}$ . Their difference is only in the first entry of the superscript ( $\hat{\theta}$ in $K^{\hat{\theta},\theta}_{F,t}$ ; $\theta$ in $K^{\theta,\theta}_{F,t}$ ). Then, for all $t$ , $\varphi$ , $\theta$ , $\hat{\theta}$ , $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ and $\{\rho^{\theta}_{t}\}_{t\in\mathbb{Z}}$ ,

\left\|{\varphi K^{\hat{\theta},\theta}_{F,t}-\varphi K^{\theta,\theta}_{F,t}}\right\|_{\rm TV}\leq\frac{\max_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}{\min_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}\frac{L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}\|{\hat{\theta}-\theta}\|_{2}}{\min_{o_{t-1},o_{t},b_{t}}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})}.

Proof of Lemma D.3.

From the definitions, for any $t$ , $\varphi$ , $\theta$ , $\hat{\theta}$ , $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ and $\{\rho^{\theta}_{t}\}_{t\in\mathbb{Z}}$ ,

		$\displaystyle\left\\|{\varphi K^{\hat{\theta},\theta}_{F,t}-\varphi K^{\theta,\theta}_{F,t}}\right\\|_{\rm TV}$
	$\displaystyle=~{}$	$\displaystyle\frac{1}{2}\sum_{o_{t},b_{t}}\left\|\sum_{o_{t-1},b_{t-1}}\frac{\left[C^{\hat{\theta},\theta}_{F}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\right]}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}\rho^{\theta}_{t}(o_{t},b_{t})\varphi(o_{t-1},b_{t-1})\right\|$
	$\displaystyle\leq~{}$	$\displaystyle\frac{1}{2}\sum_{o_{t-1},b_{t-1}}\frac{\sum_{o_{t},b_{t}}\left\|C^{\hat{\theta},\theta}_{F}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\right\|\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}\varphi(o_{t-1},b_{t-1}).$

From the definition of the normalizing constant $C^{\hat{\theta},\theta}_{F}$ , we have

\left(C^{\hat{\theta},\theta}_{F}\right)^{-1}=\sum_{o_{t-1},b_{t-1}}\frac{\sum_{o_{t},b_{t}}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}\varphi(o_{t-1},b_{t-1}).

Therefore,

C^{\hat{\theta},\theta}_{F}\leq\max_{o_{t-1}}\frac{\sum_{o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o_{t},b_{t}}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})},

and

		$\displaystyle\left\|C^{\hat{\theta},\theta}_{F}-1\right\|$
	$\displaystyle=~{}$	$\displaystyle\left\|\sum_{o_{t-1},b_{t-1}}\frac{\sum_{o_{t},b_{t}}[h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})]\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}\varphi(o_{t-1},b_{t-1})\right\|C^{\hat{\theta},\theta}_{F}$
	$\displaystyle\leq~{}$	$\displaystyle\frac{L_{\theta,\\|{\hat{\theta}-\theta}\\|_{2}}\\|{\hat{\theta}-\theta}\\|_{2}C^{\hat{\theta},\theta}_{F}}{\min_{o_{t-1}}\sum_{o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}.$

As a result, for any given $o_{t-1}$ , $o_{t}$ and $b_{t}$ ,

		$\displaystyle\left\|C^{\hat{\theta},\theta}_{F}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\right\|$
	$\displaystyle\leq~{}$	$\displaystyle C^{\hat{\theta},\theta}_{F}\left\|h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\right\|+\left\|C^{\hat{\theta},\theta}_{F}-1\right\|h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})$
	$\displaystyle\leq~{}$	$\displaystyle\left[1+\frac{h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}{\min_{o^{\prime}_{t-1}}\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o^{\prime}_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}\right]L_{\theta,\\|{\hat{\theta}-\theta}\\|_{2}}\left\\|{\hat{\theta}-\theta}\right\\|_{2}C^{\hat{\theta},\theta}_{F}.$

Combining everything together,

		$\displaystyle\left\\|{\varphi K^{\hat{\theta},\theta}_{F,t}-\varphi K^{\theta,\theta}_{F,t}}\right\\|_{\rm TV}$
	$\displaystyle\leq~{}$	$\displaystyle L_{\theta,\\|{\hat{\theta}-\theta}\\|_{2}}\left\\|{\hat{\theta}-\theta}\right\\|_{2}C^{\hat{\theta},\theta}_{F}\times\max_{o_{t-1}}\frac{1+\frac{\sum_{o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})\rho^{\theta}_{t}(o_{t},b_{t})}{\min_{o^{\prime}_{t-1}}\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o^{\prime}_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}}{2\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}$
	$\displaystyle=~{}$	$\displaystyle\frac{L_{\theta,\\|{\hat{\theta}-\theta}\\|_{2}}\\|{\hat{\theta}-\theta}\\|_{2}C^{\hat{\theta},\theta}_{F}}{\min_{o^{\prime}_{t-1}}\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o^{\prime}_{t-1},s_{t},a_{t},o^{\prime}_{t},b^{\prime}_{t})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}$
	$\displaystyle\leq~{}$	$\displaystyle\frac{\max_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}{\min_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}\frac{L_{\theta,\\|{\hat{\theta}-\theta}\\|_{2}}\\|{\hat{\theta}-\theta}\\|_{2}}{\min_{o_{t-1},o_{t},b_{t}}h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})}.\qed$

On the other hand, we can formulate a backward smoothing recursion as

(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t}(o_{t},b_{t})=C^{\theta,\hat{\theta}}_{B}\sum_{o_{t+1},b_{t+1}}\frac{h(\hat{\theta};o_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})\varphi^{\theta}_{t}(o_{t},b_{t})(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t+1}(o_{t+1},b_{t+1})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o^{\prime}_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})\varphi^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})},

(27)

where $C^{\theta,\hat{\theta}}_{B}$ is a normalizing constant such that

\left(C^{\theta,\hat{\theta}}_{B}\right)^{-1}=\sum_{o_{t+1},b_{t+1}}\frac{\sum_{o_{t},b_{t}}h(\hat{\theta};o_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})\varphi^{\theta}_{t}(o_{t},b_{t})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}h(\theta;o^{\prime}_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})\varphi^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t+1}(o_{t+1},b_{t+1}).

The subscript $B$ in $K^{\theta,\hat{\theta}}_{B,t}$ stands for backward. Similar to the forward smoothing operator $K^{\hat{\theta},\theta}_{F,t}$ , we can define the backward smoothing operator $K^{\theta,\hat{\theta}}_{B,t}$ from (27) such that as probability measures,

(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t+1}K^{\theta,\hat{\theta}}_{B,t}=(\varphi^{\theta}\otimes\hat{\rho}^{\hat{\theta}})_{t}.

Analogous to $K^{\hat{\theta},\theta}_{F,t}$ , in the general case of $\theta\neq\hat{\theta}$ , $K^{\theta,\hat{\theta}}_{B,t}$ is a nonlinear operator. However, if $\theta=\hat{\theta}$ , $K^{\theta,\hat{\theta}}_{B,t}$ becomes a linear operator and induces a Markov transition kernel. The following lemma is similar to Lemma D.3. We state it without proof.

Lemma D.4 (Perturbation on the backward smoothing kernel).

Let $\rho$ be any probability measure on $\mathcal{O}\times\{0,1\}$ . Let $K^{\theta,\hat{\theta}}_{B,t}$ and $K^{\theta,\theta}_{B,t}$ be defined with the same observation sequence $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ and the same choice of $\{\varphi^{\theta}_{t}\}_{t\in\mathbb{Z}}$ . Then, for any $t$ , $\rho$ , $\theta$ , $\hat{\theta}$ , $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ and $\{\varphi^{\theta}_{t}\}_{t\in\mathbb{Z}}$ ,

\left\|{\rho K^{\theta,\hat{\theta}}_{B,t}-\rho K^{\theta,\theta}_{B,t}}\right\|_{\rm TV}\leq\frac{\max_{o_{t},o_{t+1},b_{t+1}}h(\theta;o_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})}{\min_{o_{t},o_{t+1},b_{t+1}}h(\theta;o_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})}\\ \times\frac{L_{\hat{\theta},\|{\hat{\theta}-\theta}\|_{2}}\|{\hat{\theta}-\theta}\|_{2}}{\min_{o_{t},o_{t+1},b_{t+1}}h(\hat{\theta};o_{t},s_{t+1},a_{t+1},o_{t+1},b_{t+1})}.

Notice that the bounds in both Lemma D.3 and Lemma D.4 depend on the observation sequence $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ .

D.2.3 A perturbed contraction result for smoothing stability

For any $t_{1},t_{2}\in\mathbb{Z}$ with $t_{1}\leq t_{2}$ , let $\mathbb{I}=[t_{1}:t_{2}]$ . Remember the following definition from Appendix D.2.1, with the index set restricted to $\mathbb{I}$ : for any $\theta,\hat{\theta}\in\mathit{\Theta}$ , $\{\varphi^{\theta}_{t}\}_{t\in\mathbb{I}}$ and $\{\hat{\varphi}^{\hat{\theta}}_{t}\}_{t\in\mathbb{I}}$ are two indexed sets of probability measures defined on $\mathcal{O}\times\{0,1\}$ such that, for all $t\in\mathbb{I}$ , (1) if $t\neq t_{1}$ , $F^{\theta}_{t}\varphi^{\theta}_{t-1}=\varphi^{\theta}_{t}$ and $F^{\hat{\theta}}_{t}\hat{\varphi}^{\hat{\theta}}_{t-1}=\hat{\varphi}^{\hat{\theta}}_{t}$ ; (2) $\varphi^{\theta}_{t}$ and $\hat{\varphi}^{\hat{\theta}}_{t}$ are strictly positive on their domains. $\{\rho^{\theta}_{t}\}_{t\in\mathbb{I}}$ and $\{\hat{\rho}^{\hat{\theta}}_{t}\}_{t\in\mathbb{I}}$ are two indexed sets of probability measures defined on $\mathcal{O}\times\{0,1\}$ such that for all $t\in\mathbb{I}$ , (1) if $t\neq t_{2}$ , $B^{\theta}_{t}\rho^{\theta}_{t+1}=\rho^{\theta}_{t}$ and $B^{\hat{\theta}}_{t}\hat{\rho}^{\hat{\theta}}_{t+1}=\hat{\rho}^{\hat{\theta}}_{t}$ ; (2) $\rho^{\theta}_{t}$ and $\hat{\rho}^{\hat{\theta}}_{t}$ are strictly positive on their domains. $\theta$ and $\hat{\theta}$ are allowed to be equal.

The smoothing stability lemma is stated as follows.

Lemma D.5 (Smoothing stability).

With $\{\varphi^{\theta}_{t}\}_{t\in\mathbb{I}}$ , $\{\hat{\varphi}^{\hat{\theta}}_{t}\}_{t\in\mathbb{I}}$ , $\{\rho^{\theta}_{t}\}_{t\in\mathbb{I}}$ and $\{\hat{\rho}^{\hat{\theta}}_{t}\}_{t\in\mathbb{I}}$ defined above,

\left\|{(\varphi^{\theta}\otimes\rho^{\theta})_{t_{2}}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t_{2}-t_{1}}+\frac{|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2},

\left\|{(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{1}}-(\hat{\varphi}^{\hat{\theta}}\otimes\hat{\rho}^{\hat{\theta}})_{t_{1}}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t_{2}-t_{1}}+\frac{|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2},

where $z_{\theta,\theta^{\prime}}$ is a positive real number dependent only on $\theta$ and $\hat{\theta}$ . Specifically,

z_{\theta,\theta^{\prime}}=\max_{s^{\prime}_{t},a^{\prime}_{t}}\frac{[\max_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s^{\prime}_{t},a^{\prime}_{t},o_{t},b_{t})]\vee[\max_{o_{t-1},o_{t},b_{t}}h(\hat{\theta};o_{t-1},s^{\prime}_{t},a^{\prime}_{t},o_{t},b_{t})]}{[\min_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s^{\prime}_{t},a^{\prime}_{t},o_{t},b_{t})][\min_{o_{t-1},o_{t},b_{t}}h(\hat{\theta};o_{t-1},s^{\prime}_{t},a^{\prime}_{t},o_{t},b_{t})]}.

Intuitively, if $\hat{\theta}=\theta$ , Lemma D.5 has the form of an exact contraction, which is similar to the standard filtering stability result for HMMs. Indeed, our proof uses the classical techniques of uniform forgetting from the HMM literature [8]. If $\hat{\theta}$ is different from $\theta$ , such a contraction is perturbed. For HMMs, similar results are provided in [13, Proposition 2.2, Theorem 2.3].

Proof of Lemma D.5.

Consider the first bound. It holds trivially when $t_{2}=t_{1}$ . Now consider only $t_{2}>t_{1}$ . Using the forward smoothing operators, for any $t_{1}<t\leq t_{2}$ ,

(\varphi^{\theta}\otimes\rho^{\theta})_{t-1}K^{\theta,\theta}_{F,t}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}K^{\hat{\theta},\theta}_{F,t}=(\varphi^{\theta}\otimes\rho^{\theta})_{t}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t}.

Therefore,

\left\|{(\varphi^{\theta}\otimes\rho^{\theta})_{t}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t}}\right\|_{\rm TV}\leq\left\|{\left[(\varphi^{\theta}\otimes\rho^{\theta})_{t-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}\right]K^{\theta,\theta}_{F,t}}\right\|_{\rm TV}\\ +\left\|{(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}K^{\theta,\theta}_{F,t}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}K^{\hat{\theta},\theta}_{F,t}}\right\|_{\rm TV},

where the first term is due to $K^{\theta,\theta}_{F,t}$ being a linear operator.

From Lemma D.3, the second term on the RHS is upper bounded by $z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}\|{\hat{\theta}-\theta}\|_{2}$ . As for the first term, we can construct the classical Doeblin-type minorization condition [8, Chap. 4.3]. Applying Lemma D.1 in the definition of the Markov transition kernel $K^{\theta,\theta}_{F,t}$ (26), we have

K^{\theta,\theta}_{F,t}(o_{t},b_{t}|o_{t-1},b_{t-1})\geq\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\frac{\bar{\pi}_{o,b}(o_{t},b_{t}|s_{t};\theta)\pi_{lo}(a_{t}|s_{t},o_{t};\theta_{lo})\rho^{\theta}_{t}(o_{t},b_{t})}{\sum_{o^{\prime}_{t},b^{\prime}_{t}}\bar{\pi}_{o,b}(o^{\prime}_{t},b^{\prime}_{t}|s_{t};\theta)\pi_{lo}(a_{t}|s_{t},o^{\prime}_{t};\theta_{lo})\rho^{\theta}_{t}(o^{\prime}_{t},b^{\prime}_{t})}=\mathrel{\mathop{:}}\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bar{\pi}^{\theta}_{F,t}(o_{t},b_{t}).

(28)

Observe that $\bar{\pi}^{\theta}_{F,t}$ just defined is a probability measure. Further define $\bar{K}^{\theta,\theta}_{F,t}$ entry-wise as

\bar{K}^{\theta,\theta}_{F,t}(o_{t},b_{t}|o_{t-1},b_{t-1})\mathrel{\mathop{:}}=\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{-1}\bigg{(}K^{\theta,\theta}_{F,t}(o_{t},b_{t}|o_{t-1},b_{t-1})-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bar{\pi}^{\theta}_{F,t}(o_{t},b_{t})\bigg{)}.

We can verify that $\bar{K}^{\theta,\theta}_{F,t}$ is also a Markov transition kernel. Moreover,

\left[(\varphi^{\theta}\otimes\rho^{\theta})_{t-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}\right]K^{\theta,\theta}_{F,t}=\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}\left[(\varphi^{\theta}\otimes\rho^{\theta})_{t-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t-1}\right]\bar{K}^{\theta,\theta}_{F,t}.

To proceed, the standard approach is to use the fact that the Dobrushin coefficient of $\bar{K}^{\theta,\theta}_{F,t}$ is upper bounded by one. For clarity, we avoid such definitions and take a more direct approach here, which requires the extension of the total variation distance for two probability measures to the total variation norm for a finite signed measure. For a finite signed measure $\nu$ over a finite set $\Omega$ , let the total variation norm of $\nu$ be

\left\|{\nu}\right\|_{\rm TV}\mathrel{\mathop{:}}=\frac{1}{2}\sum_{\omega\in\Omega}\left|\nu(\omega)\right|.

When $\nu$ is the difference between two probability measures $\nu_{1}-\nu_{2}$ , the total variation norm of $\nu$ coincides with the total variation distance between $\nu_{1}$ and $\nu_{2}$ . Therefore, the same notation $\|{\cdot}\|_{\rm TV}$ is adopted here.

Let $\mathcal{\bar{M}}(\mathcal{O}\times\{0,1\})$ be the set of finite signed measures over the finite set $\mathcal{O}\times\{0,1\}$ . From [8, Chap. 4.3.1], $\mathcal{\bar{M}}(\mathcal{O}\times\{0,1\})$ is a Banach space. Define an operator norm $\|{\cdot}\|_{\rm op}$ for $\bar{K}^{\theta,\theta}_{F,t}$ as

\left\|{\bar{K}^{\theta,\theta}_{F,t}}\right\|_{\rm op}\mathrel{\mathop{:}}=\sup\left\{\left\|{\nu\bar{K}^{\theta,\theta}_{F,t}}\right\|_{\rm TV};\left\|{\nu}\right\|_{\rm TV}=1,\nu\in\mathcal{\bar{M}}(\mathcal{O}\times\{0,1\})\right\}.

Since $\bar{K}^{\theta,\theta}_{F,t}$ is a Markov transition kernel, $\|{\bar{K}^{\theta,\theta}_{F,t}}\|_{\rm op}=1$ [8, Lemma 4.3.6]. Therefore,

		$\displaystyle\left\\|{(\varphi^{\theta}\otimes\rho^{\theta})_{t_{2}}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}}}\right\\|_{\rm TV}$
	$\displaystyle\leq~{}$	$\displaystyle\left\\|{\left[(\varphi^{\theta}\otimes\rho^{\theta})_{t_{2}-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}-1}\right]K^{\theta,\theta}_{F,t_{2}}}\right\\|_{\rm TV}+\left\\|{(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}-1}\left(K^{\theta,\theta}_{F,t_{2}}-K^{\hat{\theta},\theta}_{F,t_{2}}\right)}\right\\|_{\rm TV}$
	$\displaystyle=~{}$	$\displaystyle\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{\|\mathcal{O}\|}\bigg{)}\left\\|{\left[(\varphi^{\theta}\otimes\rho^{\theta})_{t_{2}-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}-1}\right]\bar{K}^{\theta,\theta}_{F,t_{2}}}\right\\|_{\rm TV}+z_{\theta,\hat{\theta}}L_{\theta,\\|{\hat{\theta}-\theta}\\|_{2}}\\|{\hat{\theta}-\theta}\\|_{2}$
	$\displaystyle\leq~{}$	$\displaystyle\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{\|\mathcal{O}\|}\bigg{)}\left\\|{(\varphi^{\theta}\otimes\rho^{\theta})_{t_{2}-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}-1}}\right\\|_{\rm TV}\left\\|{\bar{K}^{\theta,\theta}_{F,t_{2}}}\right\\|_{\rm op}+z_{\theta,\hat{\theta}}L_{\theta,\\|{\hat{\theta}-\theta}\\|_{2}}\\|{\hat{\theta}-\theta}\\|_{2}$
	$\displaystyle=~{}$	$\displaystyle\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{\|\mathcal{O}\|}\bigg{)}\left\\|{(\varphi^{\theta}\otimes\rho^{\theta})_{t_{2}-1}-(\hat{\varphi}^{\hat{\theta}}\otimes\rho^{\theta})_{t_{2}-1}}\right\\|_{\rm TV}+z_{\theta,\hat{\theta}}L_{\theta,\\|{\hat{\theta}-\theta}\\|_{2}}\\|{\hat{\theta}-\theta}\\|_{2}.$

The second inequality is due to the sub-multiplicativity of the operator norm. Finally, the desirable result follows from unrolling the summation and identifying the geometric series.

The proof of the second bound is analogous, using the backward smoothing operators instead of the forward smoothing operators. Details are omitted. ∎

Note that Lemma D.5 only holds when considering the options with failure framework. For the standard options framework, the one-step Doeblin-type minorization condition (28) we construct in the proof does not hold anymore, due to the failure of Lemma D.1. Instead, one could target the two-step minorization condition: define a two step smoothing kernel similar to $K^{\theta,\theta}_{F,t}$ and lower bound it similar to (28). Notations are much more complicated. For simplicity, this extension is not considered in this paper.

D.3 The approximation lemmas

This subsection applies Lemma D.5 to quantities defined in earlier sections.

First, we bound the difference of smoothing distributions in the non-extended graphical model (as in Theorem 1) and the extended one with parameter $k$ (as in Corollary 6). The parameter $\theta$ in the two models can be different. The bounds use quantities defined in Appendix D.1 and Appendix D.2. Recall the definition of $\Omega$ from 8.

Lemma D.6 (Bounding the difference of smoothing distributions, Part I).

For all $\theta,\hat{\theta}\in\mathit{\Theta}$ , $k\in\mathbb{N}_{+}$ and $\mu\in\mathcal{M}$ , with the observation sequence $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ corresponding to any $\omega\in\Omega$ , we have

$\forall t\in[1:T]$ ,

\left\|{\gamma^{\theta}_{\mu,t|T}-\gamma^{\hat{\theta}}_{k,t}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-1}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T-t}+\frac{2|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2}.

$\forall t\in[2:T]$ ,

\left\|{\tilde{\gamma}^{\theta}_{\mu,t|T}-\tilde{\gamma}^{\hat{\theta}}_{k,t}}\right\|_{\rm TV}\leq 2\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-2}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T-t}+\frac{4|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2}.

Similarly, we can bound the difference of smoothing distributions in two extended graphical models with different $k$ and different parameter $\theta$ .

Lemma D.7 (Bounding the difference of smoothing distributions, Part II).

For all $\theta,\hat{\theta}\in\mathit{\Theta}$ and $t\in[1:T]$ , with any two integers $k_{2}>k_{1}>0$ and the observation sequence $\{s_{t},a_{t}\}_{t\in\mathbb{Z}}$ corresponding to any $\omega\in\Omega$ , we have

\left\|{\gamma^{\theta}_{k_{1},t}-\gamma^{\hat{\theta}}_{k_{2},t}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t+k_{1}-1}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T+k_{1}-t}+\frac{2|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2},

\left\|{\tilde{\gamma}^{\theta}_{k_{1},t}-\tilde{\gamma}^{\hat{\theta}}_{k_{2},t}}\right\|_{\rm TV}\leq 2\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t+k_{1}-2}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T+k_{1}-t}+\frac{4|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2}.

It can be easily verified that in Lemma D.6 and Lemma D.7, the bounds still hold if $\theta$ and $\hat{\theta}$ on the LHS are interchanged. We only present the proof of Lemma D.6. As for Lemma D.7, the proof is analogous therefore omitted. Our proof essentially relies on the smoothing stability lemma (Lemma D.5).

Proof of Lemma D.6.

Consider the first bound. For a cleaner notation, let

\Delta_{\theta,\hat{\theta}}=\frac{|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2}.

Apply Lemma D.5 as follows: $\forall t\in[1:T]$ , let $\varphi^{\theta}_{t}=\alpha^{\theta}_{\mu,t}$ and $\hat{\varphi}^{\hat{\theta}}_{t}=\alpha^{\hat{\theta}}_{k,t}$ ; let $\rho^{\theta}_{t}=\beta^{\theta}_{t|T}$ and $\hat{\rho}^{\hat{\theta}}_{t}=\beta^{\hat{\theta}}_{k,t}$ . Due to Assumption 1, the strictly positive requirement is satisfied. Then, we have

\left\|{\frac{\alpha^{\theta}_{\mu,t}\cdot\beta^{\theta}_{t|T}}{\langle\alpha^{\theta}_{\mu,t},\beta^{\theta}_{t|T}\rangle}-\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\theta}_{t|T}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\theta}_{t|T}\rangle}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-1}+\Delta_{\theta,\hat{\theta}},

\left\|{\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\theta}_{t|T}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\theta}_{t|T}\rangle}-\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\hat{\theta}}_{k,t}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\hat{\theta}}_{k,t}\rangle}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T-t}+\Delta_{\theta,\hat{\theta}},

where $\cdot$ denotes element-wise product and $\langle\cdot,\cdot\rangle$ denotes Euclidean inner product. Therefore,

	$\displaystyle\left\\|{\gamma^{\theta}_{\mu,t\|T}-\gamma^{\hat{\theta}}_{k,t}}\right\\|_{\rm TV}$	$\displaystyle=\left\\|{\frac{\alpha^{\theta}_{\mu,t}\cdot\beta^{\theta}_{t\|T}}{\langle\alpha^{\theta}_{\mu,t},\beta^{\theta}_{t\|T}\rangle}-\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\hat{\theta}}_{k,t}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\hat{\theta}}_{k,t}\rangle}}\right\\|_{\rm TV}$
		$\displaystyle\leq\left\\|{\frac{\alpha^{\theta}_{\mu,t}\cdot\beta^{\theta}_{t\|T}}{\langle\alpha^{\theta}_{\mu,t},\beta^{\theta}_{t\|T}\rangle}-\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\theta}_{t\|T}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\theta}_{t\|T}\rangle}}\right\\|_{\rm TV}+\left\\|{\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\theta}_{t\|T}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\theta}_{t\|T}\rangle}-\frac{\alpha^{\hat{\theta}}_{k,t}\cdot\beta^{\hat{\theta}}_{k,t}}{\langle\alpha^{\hat{\theta}}_{k,t},\beta^{\hat{\theta}}_{k,t}\rangle}}\right\\|_{\rm TV}$
		$\displaystyle\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{\|\mathcal{O}\|}\bigg{)}^{t-1}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{\|\mathcal{O}\|}\bigg{)}^{T-t}+2\Delta_{\theta,\hat{\theta}}.$

Next, we bound the difference of two-step smoothing distributions $\|{\tilde{\gamma}^{\theta}_{\mu,t|T}-\tilde{\gamma}^{\hat{\theta}}_{k,t}}\|_{\rm TV}$ . Although the idea is straightforward, the details are tedious. For any $t\in[2:T]$ , from (6) we have

		$\displaystyle\tilde{\gamma}^{\theta}_{\mu,t\|T}(o_{t-1},b_{t})$
	$\displaystyle\propto~{}$	$\displaystyle\pi_{b}(b_{t}\|s_{t},o_{t-1};\theta_{b})\left[\sum_{o_{t}}\bar{\pi}_{hi}(o_{t}\|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}\|s_{t},o_{t};\theta_{lo})\frac{\gamma^{\theta}_{\mu,t\|T}(o_{t},b_{t})}{\alpha^{\theta}_{\mu,t}(o_{t},b_{t})}\right]\left[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1})\right]$
	$\displaystyle\propto~{}$	$\displaystyle\sum_{o_{t}}\frac{\bar{\pi}_{hi}(o_{t}\|s_{t},o_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}\|s_{t},o_{t};\theta_{lo})\gamma^{\theta}_{\mu,t\|T}(o_{t},b_{t})[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1})]\pi_{b}(b_{t}\|s_{t},o_{t-1};\theta_{b})}{\sum_{o^{\prime}_{t-1},b_{t-1}}\pi_{b}(b_{t}\|s_{t},o^{\prime}_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}\|s_{t},o^{\prime}_{t-1},b_{t};\theta_{hi})\pi_{lo}(a_{t}\|s_{t},o_{t};\theta_{lo})\alpha^{\theta}_{\mu,t-1}(o^{\prime}_{t-1},b_{t-1})}$
	$\displaystyle=~{}$	$\displaystyle\sum_{o_{t}}\frac{\pi_{b}(b_{t}\|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}\|s_{t},o_{t-1},b_{t};\theta_{hi})[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1})]}{\sum_{o^{\prime}_{t-1}}\pi_{b}(b_{t}\|s_{t},o^{\prime}_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}\|s_{t},o^{\prime}_{t-1},b_{t};\theta_{hi})[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o^{\prime}_{t-1},b_{t-1})]}\gamma^{\theta}_{\mu,t\|T}(o_{t},b_{t}).$

The denominators are all positive due to the non-degeneracy assumption. It can be easily verified that the normalizing constants involved in the second and the third line cancel each other. As abbreviations, define

	$\displaystyle g^{\theta}(o_{t-1},s_{t},o_{t},b_{t})\mathrel{\mathop{:}}=\pi_{b}(b_{t}\|s_{t},o_{t-1};\theta_{b})\bar{\pi}_{hi}(o_{t}\|s_{t},o_{t-1},b_{t};\theta_{hi}),$
	$\displaystyle g^{\hat{\theta}}(o_{t-1},s_{t},o_{t},b_{t})\mathrel{\mathop{:}}=\pi_{b}(b_{t}\|s_{t},o_{t-1};\hat{\theta}_{b})\bar{\pi}_{hi}(o_{t}\|s_{t},o_{t-1},b_{t};\hat{\theta}_{hi}),$
	$\displaystyle f^{\theta}_{\mu,t}(o_{t-1},s_{t},o_{t},b_{t})\mathrel{\mathop{:}}=\frac{g^{\theta}(o_{t-1},s_{t},o_{t},b_{t})[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1})]}{\sum_{o^{\prime}_{t-1}}g^{\theta}(o^{\prime}_{t-1},s_{t},o_{t},b_{t})[\sum_{b_{t-1}}\alpha^{\theta}_{\mu,t-1}(o^{\prime}_{t-1},b_{t-1})]},$
	$\displaystyle f^{\hat{\theta}}_{k,t}(o_{t-1},s_{t},o_{t},b_{t})\mathrel{\mathop{:}}=\frac{g^{\hat{\theta}}(o_{t-1},s_{t},o_{t},b_{t})[\sum_{b_{t-1}}\alpha^{\hat{\theta}}_{k,t-1}(o_{t-1},b_{t-1})]}{\sum_{o^{\prime}_{t-1}}g^{\hat{\theta}}(o^{\prime}_{t-1},s_{t},o_{t},b_{t})[\sum_{b_{t-1}}\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b_{t-1})]}.$

Then,

$\displaystyle\left\\|{\tilde{\gamma}^{\theta}_{\mu,t\|T}-\tilde{\gamma}^{\hat{\theta}}_{k,t}}\right\\|_{\rm TV}=~{}$	$\displaystyle\frac{1}{2}\sum_{o_{t-1},b_{t}}\bigg{\|}\sum_{o_{t}}[f^{\theta}_{\mu,t}(o_{t-1},s_{t},o_{t},b_{t})\gamma^{\theta}_{\mu,t\|T}(o_{t},b_{t})-f^{\hat{\theta}}_{k,t}(o_{t-1},s_{t},o_{t},b_{t})\gamma^{\hat{\theta}}_{k,t\|T}(o_{t},b_{t})]\bigg{\|}$
$\displaystyle\leq~{}$	$\displaystyle\frac{1}{2}\sum_{o_{t-1},b_{t},o_{t}}\left\|f^{\theta}_{\mu,t}(o_{t-1},s_{t},o_{t},b_{t})-f^{\hat{\theta}}_{k,t}(o_{t-1},s_{t},o_{t},b_{t})\right\|\gamma^{\theta}_{\mu,t\|T}(o_{t},b_{t})$
	$\displaystyle\hskip 50.00008pt+\frac{1}{2}\sum_{o_{t-1},b_{t},o_{t}}f^{\hat{\theta}}_{k,t}(o_{t-1},s_{t},o_{t},b_{t})\left\|\gamma^{\theta}_{\mu,t\|T}(o_{t},b_{t})-\gamma^{\hat{\theta}}_{k,t\|T}(o_{t},b_{t})\right\|.$	(29)

Now, we bound the two terms on the RHS separately. Consider the first term in (29),

	$\displaystyle\frac{1}{2}\sum_{o_{t-1},o_{t},b_{t}}\left\|f^{\theta}_{\mu,t}(o_{t-1},s_{t},o_{t},b_{t})-f^{\hat{\theta}}_{k,t}(o_{t-1},s_{t},o_{t},b_{t})\right\|\gamma^{\theta}_{\mu,t\|T}(o_{t},b_{t})$
$\displaystyle\leq~{}$	$\displaystyle\frac{1}{2}\max_{o_{t},b_{t}}\sum_{o_{t-1},b_{t-1}}\bigg{\|}\frac{g^{\theta}(o_{t-1},s_{t},o_{t},b_{t})\alpha^{\theta}_{\mu,t-1}(o_{t-1},b_{t-1})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}g^{\theta}(o^{\prime}_{t-1},s_{t},o_{t},b_{t})\alpha^{\theta}_{\mu,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}$
	$\displaystyle\hskip 80.00012pt-\frac{g^{\theta}(o_{t-1},s_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o_{t-1},b_{t-1})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}g^{\theta}(o^{\prime}_{t-1},s_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}\bigg{\|}$
	$\displaystyle\hskip 20.00003pt+\frac{1}{2}\max_{o_{t},b_{t}}\sum_{o_{t-1},b_{t-1}}\alpha^{\hat{\theta}}_{k,t-1}(o_{t-1},b_{t-1})\bigg{\|}\frac{g^{\theta}(o_{t-1},s_{t},o_{t},b_{t})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}g^{\theta}(o^{\prime}_{t-1},s_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}$
	$\displaystyle\hskip 100.00015pt-\frac{g^{\hat{\theta}}(o_{t-1},s_{t},o_{t},b_{t})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}g^{\hat{\theta}}(o^{\prime}_{t-1},s_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}\bigg{\|}.$	(30)

Denote the two terms on the RHS of (30) as $\Delta_{1}$ and $\Delta_{2}$ respectively. To bound $\Delta_{1}$ , we can apply Lemma D.5 on the index set $[1:t-1]$ as follows, assuming $t>2$ . For any $t^{\prime}\in[1:t-1]$ , let $\varphi^{\theta}_{t^{\prime}}=\alpha^{\theta}_{\mu,t^{\prime}}$ and $\hat{\varphi}^{\hat{\theta}}_{t^{\prime}}=\alpha^{\hat{\theta}}_{k,t^{\prime}}$ . For any $(o_{t},b_{t})$ , let $\rho^{\theta}_{t-1}(o_{t-1},b_{t-1})=z^{-1}_{\theta}g^{\theta}(o_{t-1},s_{t},o_{t},b_{t})$ , where $z_{\theta}$ is a normalizing constant. For $1\leq t^{\prime}<t-1$ , let $\rho^{\theta}_{t^{\prime}}=B^{\theta}_{t^{\prime}}\rho^{\theta}_{t^{\prime}+1}$ . Then,

\Delta_{1}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-2}+\Delta_{\theta,\hat{\theta}}.

Such a bound holds trivially if $t\leq 2$ .

Next, we bound $\Delta_{2}$ as follows. Straightforward computation yields the following result.

	$\displaystyle\Delta_{2}$	$\displaystyle=\frac{1}{2}\max_{o_{t},b_{t}}\sum_{o_{t-1},b_{t-1}}\alpha^{\hat{\theta}}_{k,t-1}(o_{t-1},b_{t-1})\bigg{\|}\frac{h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}h(\theta;o^{\prime}_{t-1},s_{t},a_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}$
		$\displaystyle\hskip 70.0001pt-\frac{h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}h(\hat{\theta};o^{\prime}_{t-1},s_{t},a_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}\bigg{\|}$
		$\displaystyle\leq\max_{o_{t},b_{t}}\frac{\sum_{o_{t-1},b_{t-1}}\left\|h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})\right\|\alpha^{\hat{\theta}}_{k,t-1}(o_{t-1},b_{t-1})}{\sum_{o^{\prime}_{t-1},b^{\prime}_{t-1}}h(\theta;o^{\prime}_{t-1},s_{t},a_{t},o_{t},b_{t})\alpha^{\hat{\theta}}_{k,t-1}(o^{\prime}_{t-1},b^{\prime}_{t-1})}$
		$\displaystyle\leq\frac{\max_{o_{t-1},o_{t},b_{t}}\left\|h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})-h(\hat{\theta};o_{t-1},s_{t},a_{t},o_{t},b_{t})\right\|}{\min_{o_{t-1},o_{t},b_{t}}h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})}\leq\Delta_{\theta,\hat{\theta}},$

where we use the definition of $h(\theta;o_{t-1},s_{t},a_{t},o_{t},b_{t})$ in (24).

As for the second term in (29),

		$\displaystyle\frac{1}{2}\sum_{o_{t-1},b_{t},o_{t}}f^{\theta}_{k,t}(o_{t-1},s_{t},o_{t},b_{t})\left\|\gamma^{\theta}_{\mu,t\|T}(o_{t},b_{t})-\gamma^{\theta}_{k,t\|T}(o_{t},b_{t})\right\|$
	$\displaystyle=~{}$	$\displaystyle\left\\|{\gamma^{\theta}_{\mu,t\|T}-\gamma^{\theta}_{k,t}}\right\\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{\|\mathcal{O}\|}\bigg{)}^{t-1}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{\|\mathcal{O}\|}\bigg{)}^{T-t}+2\Delta_{\theta,\hat{\theta}}.$

Combining the above gives the desirable result. ∎

D.4 Proof of Lemma C.1

Based on Lemma D.7, for all $T\geq 2$ , $\theta\in\mathit{\Theta}$ and $t\in[1:T]$ , with any observation sequence, both the sequences $\{\gamma^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}}$ and $\{\tilde{\gamma}^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}}$ are Cauchy sequences associated with the total variation distance. Moreover, the set of probability measures over the finite sample space $\mathcal{O}\times\{0,1\}$ is complete. Therefore, the limits of both $\{\gamma^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}}$ and $\{\tilde{\gamma}^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}}$ as $k\rightarrow\infty$ exist with respect to the total variation distance. From the definitions of $\{\gamma^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}}$ and $\{\tilde{\gamma}^{\theta}_{k,t}\}_{k\in\mathbb{N}_{+}}$ in Appendix C.1, it is clear that their limits as $k\rightarrow\infty$ do not depend on $T$ .

The Lipschitz continuity of $\gamma^{\theta}_{\infty,t}$ also follows from Lemma D.7. Specifically, for all $\theta,\hat{\theta}\in\mathit{\Theta}$ and $t\in[1:T]$ , with any observation sequence,

\left\|{\gamma^{\theta}_{\infty,t}-\gamma^{\hat{\theta}}_{\infty,t}}\right\|_{\rm TV}\leq\frac{2|\mathcal{O}|z_{\theta,\hat{\theta}}L_{\theta,\|{\hat{\theta}-\theta}\|_{2}}}{\varepsilon^{2}_{b}\zeta}\left\|{\hat{\theta}-\theta}\right\|_{2}.

The coefficient of $\|{\hat{\theta}-\theta}\|_{2}$ on the RHS can be upper bounded by a constant that does not depend on $\theta$ and $\hat{\theta}$ . The same argument holds for $\tilde{\gamma}^{\theta}_{\infty,t}$ . ∎

D.5 Proof of Lemma C.2

For a cleaner notation, we omit the dependency on $\omega$ in the following analysis. From the definitions, for all $\theta,\theta^{\prime}\in\mathit{\Theta}$ and $\mu\in\mathcal{M}$ ,

		$\displaystyle Q^{s}_{\infty,T}(\theta^{\prime}\|\theta)-Q_{\mu,T}(\theta^{\prime}\|\theta)$
	$\displaystyle=~{}$	$\displaystyle\frac{1}{T}\bigg{\{}\sum_{t=2}^{T}\sum_{o_{t-1},b_{t}}\left[\tilde{\gamma}^{\theta}_{\infty,t}(o_{t-1},b_{t})-\tilde{\gamma}^{\theta}_{\mu,t\|T}(o_{t-1},b_{t})\right][\log\pi_{b}(b_{t}\|s_{t},o_{t-1};\theta^{\prime}_{b})]$
		$\displaystyle+\sum_{t=1}^{T}\sum_{o_{t},b_{t}}\left[\gamma^{\theta}_{\infty,t}(o_{t},b_{t})-\gamma^{\theta}_{\mu,t\|T}(o_{t},b_{t})\right][\log\pi_{lo}(a_{t}\|s_{t},o_{t};\theta^{\prime}_{lo})]$
		$\displaystyle+\sum_{t=1}^{T}\sum_{o_{t}}\left[\gamma^{\theta}_{\infty,t}(o_{t},b_{t}=1)-\gamma^{\theta}_{\mu,t\|T}(o_{t},b_{t}=1)\right][\log\pi_{hi}(o_{t}\|s_{t};\theta^{\prime}_{hi})]+err\bigg{\}},$

where the last term is a small error term associated with $t=1$ such that,

\left|err\right|=\bigg{|}\sum_{o_{0},b_{1}}\tilde{\gamma}^{\theta}_{\infty,1}(o_{0},b_{1})\left[\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})\right]\bigg{|}\leq\max_{b_{1},s_{1},o_{0}}|\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})|.

The maximum on the RHS is finite due to the non-degeneracy assumption. Furthermore,

		$\displaystyle\left\|Q^{s}_{\infty,T}(\theta^{\prime}\|\theta)-Q_{\mu,T}(\theta^{\prime}\|\theta)\right\|$
	$\displaystyle\leq~{}$	$\displaystyle\frac{1}{T}\bigg{\{}\sum_{t=2}^{T}\max_{b_{t},s_{t},o_{t-1}}\left\|\log\pi_{b}(b_{t}\|s_{t},o_{t-1};\theta^{\prime}_{b})\right\|\sum_{o_{t-1},b_{t}}\left\|\tilde{\gamma}^{\theta}_{\infty,t}(o_{t-1},b_{t})-\tilde{\gamma}^{\theta}_{\mu,t\|T}(o_{t-1},b_{t})\right\|$
		$\displaystyle+\sum_{t=1}^{T}\max_{a_{t},s_{t},o_{t}}\left\|\log\pi_{lo}(a_{t}\|s_{t},o_{t};\theta^{\prime}_{lo})\right\|\sum_{o_{t},b_{t}}\left\|\gamma^{\theta}_{\infty,t}(o_{t},b_{t})-\gamma^{\theta}_{\mu,t\|T}(o_{t},b_{t})\right\|$
		$\displaystyle+\sum_{t=1}^{T}\max_{s_{t},o_{t}}\left\|\log\pi_{hi}(o_{t}\|s_{t};\theta^{\prime}_{hi})\right\|\sum_{o_{t}}\left\|\gamma^{\theta}_{\infty,t}(o_{t},b_{t}=1)-\gamma^{\theta}_{\mu,t\|T}(o_{t},b_{t}=1)\right\|+\|err\|\bigg{\}}.$

Since the bounds in Lemma D.6 hold for any $k>0$ , they also hold in the limit as $k\rightarrow\infty$ . Therefore, for any $\theta$ , $\mu$ and any $t\in[1:T]$ ,

\left\|{\gamma^{\theta}_{\mu,t|T}-\gamma^{\theta}_{\infty,t}}\right\|_{\rm TV}\leq\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-1}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T-t}.

For any $\theta$ , $\mu$ and any $t\in[2:T]$ ,

\left\|{\tilde{\gamma}^{\theta}_{\mu,t|T}-\tilde{\gamma}^{\theta}_{\infty,t}}\right\|_{\rm TV}\leq 2\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{t-2}+\bigg{(}1-\frac{\varepsilon^{2}_{b}\zeta}{|\mathcal{O}|}\bigg{)}^{T-t}.

Combining everything above,

		$\displaystyle\left\|Q^{s}_{\infty,T}(\theta^{\prime}\|\theta)-Q_{\mu,T}(\theta^{\prime}\|\theta)\right\|$
	$\displaystyle\leq~{}$	$\displaystyle\frac{1}{T}\bigg{\{}\max_{b_{t},s_{t},o_{t-1}}\left\|\log\pi_{b}(b_{t}\|s_{t},o_{t-1};\theta^{\prime}_{b})\right\|\bigg{[}1+2\sum_{t=2}^{T}\left\\|{\tilde{\gamma}^{\theta}_{\mu,t\|T}-\tilde{\gamma}^{\theta}_{\infty,t}}\right\\|_{\rm TV}\bigg{]}$
		$\displaystyle+2\left[\max_{a_{t},s_{t},o_{t}}\left\|\log\pi_{lo}(a_{t}\|s_{t},o_{t};\theta^{\prime}_{lo})\right\|+\max_{s_{t},o_{t}}\left\|\log\pi_{hi}(o_{t}\|s_{t};\theta^{\prime}_{hi})\right\|\right]\sum_{t=1}^{T}\left\\|{\gamma^{\theta}_{\mu,t\|T}-\gamma^{\theta}_{\infty,t}}\right\\|_{\rm TV}\bigg{\}}$
	$\displaystyle\leq~{}$	$\displaystyle\frac{1}{T}\bigg{\{}\bigg{(}1+\frac{6\|O\|}{\varepsilon_{b}^{2}\zeta}\bigg{)}\max_{b_{t},s_{t},o_{t-1}}\left\|\log\pi_{b}(b_{t}\|s_{t},o_{t-1};\theta^{\prime}_{b})\right\|$
		$\displaystyle\hskip 50.00008pt+\frac{4\|O\|}{\varepsilon_{b}^{2}\zeta}\left[\max_{a_{t},s_{t},o_{t}}\left\|\log\pi_{lo}(a_{t}\|s_{t},o_{t};\theta^{\prime}_{lo})\right\|+\max_{s_{t},o_{t}}\left\|\log\pi_{hi}(o_{t}\|s_{t};\theta^{\prime}_{hi})\right\|\right]\bigg{\}}=\frac{C(\theta^{\prime})}{T},$

where $C(\theta^{\prime})$ is a positive real number that only depends on $\theta^{\prime}$ and the structural constants $|\mathcal{O}|$ , $\zeta$ and $\varepsilon_{b}$ . Due to Assumption 2, $C(\theta^{\prime})$ is continuous with respect to $\theta^{\prime}$ . Since $\mathit{\Theta}$ is compact, $\sup_{\theta^{\prime}\in\mathit{\Theta}}C(\theta^{\prime})<\infty$ . Therefore,

\left|Q^{s}_{\infty,T}(\theta^{\prime}|\theta)-Q_{\mu,T}(\theta^{\prime}|\theta)\right|\leq\frac{1}{T}\sup_{\theta^{\prime}\in\mathit{\Theta}}C(\theta^{\prime}).

Taking supremum with respect to $\theta$ , $\theta^{\prime}$ and $\mu$ completes the proof. ∎

D.6 Proof of the strong stochastic equicontinuity condition (19)

First, for all $\delta>0$ and $\omega\in\Omega$ ,

		$\displaystyle\limsup_{T\rightarrow\infty}\sup_{\theta_{1},\theta^{\prime}_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\\|{\theta_{1}-\theta_{2}}\\|_{2}+\\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\\|_{2}\leq\delta}\left\|Q^{s}_{\infty,T}(\theta^{\prime}_{1}\|\theta_{1};\omega)-Q^{s}_{\infty,T}(\theta^{\prime}_{2}\|\theta_{2};\omega)\right\|$
	$\displaystyle\leq~{}$	$\displaystyle\limsup_{T\rightarrow\infty}\frac{1}{T}\sup_{\theta_{1},\theta^{\prime}_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\\|{\theta_{1}-\theta_{2}}\\|_{2}+\\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\\|_{2}\leq\delta}\left\|f_{t}(\theta^{\prime}_{1}\|\theta_{1};\omega)-f_{t}(\theta^{\prime}_{2}\|\theta_{2};\omega)\right\|.$

Due to the boundedness of $f_{t}(\theta^{\prime}|\theta;\omega)$ from Appendix C.2, we can apply the ergodic theorem (Lemma C.3). $\mathbb{P}_{\theta^{*},\nu^{*}}$ almost surely,

		$\displaystyle\limsup_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\sup_{\theta_{1},\theta^{\prime}_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\\|{\theta_{1}-\theta_{2}}\\|_{2}+\\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\\|_{2}\leq\delta}\left\|f_{t}(\theta^{\prime}_{1}\|\theta_{1};\omega)-f_{t}(\theta^{\prime}_{2}\|\theta_{2};\omega)\right\|$
	$\displaystyle=~{}$	$\displaystyle\mathbb{E}_{\theta^{},\nu^{}}\bigg{[}\sup_{\theta_{1},\theta^{\prime}_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\\|{\theta_{1}-\theta_{2}}\\|_{2}+\\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\\|_{2}\leq\delta}\left\|f_{1}(\theta^{\prime}_{1}\|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}\|\theta_{2};\omega)\right\|\bigg{]}$
	$\displaystyle\leq~{}$	$\displaystyle\mathbb{E}_{\theta^{},\nu^{}}\bigg{[}\sup_{\theta_{1},\theta^{\prime}_{1},\theta^{\prime}_{2}\in\mathit{\Theta};\\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\\|_{2}\leq\delta}\left\|f_{1}(\theta^{\prime}_{1}\|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}\|\theta_{1};\omega)\right\|\bigg{]}$
		$\displaystyle\hskip 50.00008pt+\mathbb{E}_{\theta^{},\nu^{}}\bigg{[}\sup_{\theta_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\\|{\theta_{1}-\theta_{2}}\\|_{2}\leq\delta}\left\|f_{1}(\theta^{\prime}_{2}\|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}\|\theta_{2};\omega)\right\|\bigg{]}.$

Notice that for all $\theta_{1}$ , $\theta^{\prime}_{1}$ , $\theta^{\prime}_{2}$ and $\omega$ ,

\left|f_{1}(\theta^{\prime}_{1}|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}|\theta_{1};\omega)\right|\leq\max_{o_{t}}\left|\log\pi_{hi}(o_{t}|\omega(s_{t});\theta^{\prime}_{1,hi})-\log\pi_{hi}(o_{t}|\omega(s_{t});\theta^{\prime}_{2,hi})\right|\\ +\max_{o_{t}}\left|\log\pi_{lo}(\omega(a_{t})|\omega(s_{t}),o_{t};\theta^{\prime}_{1,lo})-\log\pi_{lo}(\omega(a_{t})|\omega(s_{t}),o_{t};\theta^{\prime}_{2,lo})\right|\\ +\max_{o_{t-1},b_{t}}\left|\log\pi_{b}(b_{t}|\omega(s_{t}),o_{t-1};\theta^{\prime}_{1,b})-\log\pi_{b}(b_{t}|\omega(s_{t}),o_{t-1};\theta^{\prime}_{2,b})\right|.

The RHS does not depend on $\theta_{1}$ . Due to Assumption 2, $\pi_{hi}$ , $\pi_{lo}$ and $\pi_{b}$ as functions of the parameter $\theta$ are uniformly continuous on $\mathit{\Theta}$ , with any other input arguments. Therefore it is straightforward to verify that, for any $\omega\in\Omega$ ,

\lim_{\delta\rightarrow 0}\sup_{\theta_{1},\theta^{\prime}_{1},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\|_{2}\leq\delta}\left|f_{1}(\theta^{\prime}_{1}|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}|\theta_{1};\omega)\right|=0.

Applying the dominated convergence theorem,

\lim_{\delta\rightarrow 0}\mathbb{E}_{\theta^{*},\nu^{*}}\bigg{[}\sup_{\theta_{1},\theta^{\prime}_{1},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta^{\prime}_{1}-\theta^{\prime}_{2}}\|_{2}\leq\delta}\left|f_{1}(\theta^{\prime}_{1}|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}|\theta_{1};\omega)\right|\bigg{]}=0.

Similarly, using Lemma C.1 we can show that for any $\omega\in\Omega$ ,

\lim_{\delta\rightarrow 0}\sup_{\theta_{1},\theta_{2},\theta^{\prime}_{2}\in\mathit{\Theta};\|{\theta_{1}-\theta_{2}}\|_{2}\leq\delta}\left|f_{1}(\theta^{\prime}_{2}|\theta_{1};\omega)-f_{1}(\theta^{\prime}_{2}|\theta_{2};\omega)\right|=0.

Using the dominated convergence theorem gives the convergence of the expectation as well. Combining the above gives the strong stochastic equicontinuity condition (19). ∎

D.7 Proof of Lemma C.5

Consider the following joint distribution on the graphical model shown in Figure 1: the prior distribution of $(O_{0},S_{1})$ is $\nu^{*}$ , and the joint distribution of the rest of the graphical model is determined by an options with failure policy with parameters $\zeta$ and $\theta$ . Notice that this is the correct graphical model for the inference of the true parameter $\theta^{*}$ , since the assumed prior distribution of $(O_{0},S_{1})$ coincides with the correct one.

For clarity, we use the same notations as in Appendix B.3 for the complete likelihood function, the marginal likelihood function and the (unnormalized) $Q$ -function. Specifically, such quantities used in this proof have the same symbols as those defined in Appendix B.3, but mathematically they are not the same.

Parallel to Appendix B.3, the complete likelihood function is

L(s_{1:T},a_{1:T},o_{0:T},b_{1:T};\theta)=\nu^{*}(o_{0},s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{1:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T}).

The marginal likelihood function is

L^{m}(s_{1:T},a_{1:T};\theta)=\sum_{o_{0}}\nu^{*}(o_{0},s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{1:T},A_{1:T}=a_{1:T}).

Let $\mu^{*}$ be the conditional distribution of $O_{0}$ given $s_{1}$ . For any $o_{0}\in\mathcal{O}$ ,

\mu^{*}(o_{0}|s_{1})=\frac{\nu^{*}(o_{0},s_{1})}{\sum_{o^{\prime}_{0}\in\mathcal{O}}\nu^{*}(o^{\prime}_{0},s_{1})}.

Therefore, for the inference of $\theta^{*}$ considered in this proof, the (unnormalized) $Q$ -function can be expressed as

		$\displaystyle\tilde{Q}_{\mu^{*},T}(\theta^{\prime}\|\theta)$
	$\displaystyle=~{}$	$\displaystyle\sum_{o_{0:T},b_{1:T}}\frac{L(s_{1:T},a_{1:T},o_{0:T},b_{1:T};\theta)}{L^{m}(s_{1:T},a_{1:T};\theta)}\log L(s_{1:T},a_{1:T},o_{0:T},b_{1:T};\theta^{\prime})$
	$\displaystyle=~{}$	$\displaystyle\sum_{o_{0:T},b_{1:T}}\mu^{*}(o_{0}\|s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T})$
		$\displaystyle\hskip 20.00003pt\times z^{\theta}_{\gamma,\mu^{}}\log[\nu^{}(o_{0},s_{1})\mathbb{P}_{\theta^{\prime},o^{\prime}_{0},s_{1}}(S_{2:T}=s_{1:T},A_{1:T}=a_{1:T},O_{1:T}=o_{1:T},B_{1:T}=b_{1:T})].$

We can rewrite $\tilde{Q}_{\mu^{*},T}(\theta^{\prime}|\theta)$ using the structure of the options with failure framework, drop the terms irrelevant to $\theta^{\prime}$ and normalize using $T$ . The result is the following definition of the (normalized) $Q$ -function:

	$\displaystyle Q^{*}_{T}(\theta^{\prime}\|\theta)\mathrel{\mathop{:}}=$	$\displaystyle\frac{\sum_{o_{0},b_{1}}\nu^{}(o_{0}\|s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{2:T},A_{1:T}=a_{1:T},B_{1}=b_{1})[\log\pi_{b}(b_{1}\|s_{1},o_{0};\theta^{\prime}_{b})]}{T\sum_{o_{0}}\nu^{}(o_{0},s_{1})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{2:T}=s_{1:T},A_{1:T}=a_{1:T})}$
		$\displaystyle\hskip 50.00008pt+\frac{1}{T}\sum_{t=1}^{T}\sum_{o_{t},b_{t}}\gamma^{\theta}_{\mu^{*},t\|T}(o_{t},b_{t})[\log\pi_{lo}(a_{t}\|s_{t},o_{t};\theta^{\prime}_{lo})]$
		$\displaystyle\hskip 50.00008pt+\frac{1}{T}\sum_{t=1}^{T}\sum_{o_{t}}\gamma^{\theta}_{\mu^{*},t\|T}(o_{t},b_{t}=1)[\log\pi_{hi}(o_{t}\|s_{t};\theta^{\prime}_{hi})]$
		$\displaystyle\hskip 100.00015pt+\frac{1}{T}\sum_{t=2}^{T}\sum_{o_{t-1},b_{t}}\tilde{\gamma}^{\theta}_{\mu^{*},t\|T}(o_{t-1},b_{t})[\log\pi_{b}(b_{t}\|s_{t},o_{t-1};\theta^{\prime}_{b})].$

We draw a comparison between $Q^{*}_{T}(\theta^{\prime}|\theta)$ and $Q_{\mu^{*},T}(\theta^{\prime}|\theta)$ defined in (7): their difference is in the first term of $Q^{*}_{T}(\theta^{\prime}|\theta)$ . Maximizing $Q^{*}_{T}(\theta^{\prime}|\theta)$ with respect to $\theta^{\prime}$ is equivalent to maximizing the (unnormalized) $Q$ -function $\tilde{Q}_{\mu^{*},T}(\theta^{\prime}|\theta)$ . In Algorithm 1, since $Q^{*}_{T}(\theta^{\prime}|\theta)$ is unavailable, we use $Q_{\mu^{*},T}(\theta^{\prime}|\theta)$ as its approximation.

$Q^{*}_{T}(\theta^{\prime}|\theta)$ depends on the observation sequence, therefore it is a function of a sample path $\omega\in\Omega$ . In the following we explicitly show this dependency by writing $Q^{*}_{T}(\theta^{\prime}|\theta;\omega)$ . Clearly, for all $\theta,\theta^{\prime}\in\mathit{\Theta}$ , $\omega\in\Omega$ and $T\geq 2$ ,

\left|Q^{*}_{T}(\theta^{\prime}|\theta;\omega)-Q_{\mu^{*},T}(\theta^{\prime}|\theta;\omega)\right|\leq\frac{1}{T}\sup_{\theta^{\prime}\in\mathit{\Theta}}\max_{b_{1},s_{1},o_{0}}\left|\log\pi_{b}(b_{1}|s_{1},o_{0};\theta^{\prime}_{b})\right|.

Combining this with the stochastic convergence of $Q_{\mu^{*},T}$ as shown in Theorem 2, we have, that for any $\theta\in\mathit{\Theta}$ , as $T\rightarrow\infty$ ,

\left|Q^{*}_{T}(\theta|\theta^{*};\omega)-\bar{Q}(\theta|\theta^{*})\right|\rightarrow 0,~{}P_{\theta^{*},\nu^{*}}\text{-a.s.}

Using the dominated convergence theorem, such a convergence holds in expectation as well. For any $\theta\in\mathit{\Theta}$ ,

\lim_{T\rightarrow\infty}\mathbb{E}_{\theta^{*},\nu^{*}}\left[Q^{*}_{T}(\theta|\theta^{*};\omega)\right]=\bar{Q}(\theta|\theta^{*}).

Since maximizing $Q^{*}_{T}(\theta|\theta^{*})$ with respect to $\theta$ is equivalent to maximizing the (unnormalized) $Q$ -function $\tilde{Q}_{\mu^{*},T}(\theta|\theta^{*})$ , the standard monotonicity property of the EM update holds as well. For all $\theta\in\mathit{\Theta}$ , $\omega\in\Omega$ and $T\geq 2$ ,

\log L^{m}[\omega(s_{1:T}),\omega(a_{1:T});\theta]-\log L^{m}[\omega(s_{1:T}),\omega(a_{1:T});\theta^{*}]\geq T\left[Q^{*}_{T}(\theta|\theta^{*};\omega)-Q^{*}_{T}(\theta^{*}|\theta^{*};\omega)\right].

Taking expectation on both sides, we have

\mathbb{E}_{\theta^{*},\nu^{*}}[\text{LHS}]=\sum_{s_{1:T},a_{1:T}}L^{m}(s_{1:T},a_{1:T};\theta^{*})\log\frac{L^{m}(s_{1:T},a_{1:T};\theta)}{L^{m}(s_{1:T},a_{1:T};\theta^{*})}\leq 0,

due to the non-negativity of the Kullback-Leibler divergence. Therefore, $\mathbb{E}_{\theta^{*},\nu^{*}}[Q^{*}_{T}(\theta|\theta^{*};\omega)]\leq\mathbb{E}_{\theta^{*},\nu^{*}}[Q^{*}_{T}(\theta^{*}|\theta^{*};\omega)]$ , and in the limit we have $\bar{Q}(\theta|\theta^{*})\leq\bar{Q}(\theta^{*}|\theta^{*})$ for all $\theta\in\mathit{\Theta}$ . Applying the identifiability assumption for the uniqueness of $\bar{M}(\theta^{*})$ completes the proof. ∎

Appendix E Additional experiments and details omitted in Section 5

E.1 Generation of the observation sequences

We first introduce the method to sample observation sequences from the stationary Markov chain induced by the expert policy. Using the expert policy and a fixed $(o_{0},s_{1})$ pair, we generate 50 sample paths of length 20,000. Then, the first 10,000 time steps in each sample path are discarded, and the rest state-action pairs are saved as the observation sequences used in the algorithm. For different $T$ , we just take the first $T$ time steps in each observation sequence.

Such a procedure is motivated by Proposition 5: it can be easily verified that Assumption 1 and 2 hold in our numerical example. Therefore, from Proposition 5, the distribution of $X_{t}$ approaches the unique stationary distribution regardless of the initial $(o_{0},s_{1})$ pair. In this way, Assumption 3 is approximately satisfied.

E.2 Analytical expression of the parameter update

For our numerical example, the parameter update of Algorithm 1 has a unique analytical solution. For all $\theta\in\mathit{\Theta}$ , $\omega\in\Omega$ , $T\geq 2$ and $\mu\in\mathcal{M}$ , we first derive the analytical expression of $M_{\mu,T}(\theta;\omega)_{hi}$ which is the updated parameter for $\pi_{hi}$ based on the previous parameter $\theta$ . Such a notation for parameter updates is borrowed from Assumption 5. Using the expression of the $Q$ -function (7), we have

M_{\mu,T}(\theta;\omega)_{hi}\in\operatorname*{arg\,max}_{\theta^{\prime}_{hi}\in\mathit{\Theta}_{hi}}\sum_{t=1}^{T}\sum_{o_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}=1)[\log\pi_{hi}(o_{t}|s_{t};\theta^{\prime}_{hi})],

where $s_{t}$ on the RHS is the state value $\omega(s_{t})$ from the sample path $\omega$ . We omit $\omega$ on the RHS for a cleaner notation. Let $f(\theta^{\prime}_{hi})$ denote the sum inside the argmax. Then,

	$\displaystyle f(\theta^{\prime}_{hi})$	$\displaystyle=\sum_{t=1}^{T}\bigg{\{}\gamma^{\theta}_{\mu,t\|T}(o_{t}=\textrm{LEFTEND},b_{t}=1)[\log\pi_{hi}(o_{t}=\textrm{LEFTEND}\|s_{t};\theta^{\prime}_{hi})]$
		$\displaystyle\hskip 50.00008pt+\gamma^{\theta}_{\mu,t\|T}(o_{t}=\textrm{RIGHTEND},b_{t}=1)[\log\pi_{hi}(o_{t}=\textrm{RIGHTEND}\|s_{t};\theta^{\prime}_{hi})]\bigg{\}}$
		$\displaystyle=\sum_{t=1}^{T}\bigg{\{}\gamma^{\theta}_{\mu,t\|T}(o_{t}=\textrm{LEFTEND},b_{t}=1)\Big{[}\mathbbm{1}[s_{t}=1,2]\log\theta^{\prime}_{hi}+\mathbbm{1}[s_{t}=3,4]\log(1-\theta^{\prime}_{hi})\Big{]}$
		$\displaystyle\hskip 10.00002pt+\gamma^{\theta}_{\mu,t\|T}(o_{t}=\textrm{RIGHTEND},b_{t}=1)\Big{[}\mathbbm{1}[s_{t}=3,4]\log\theta^{\prime}_{hi}+\mathbbm{1}[s_{t}=1,2]\log(1-\theta^{\prime}_{hi})\Big{]}\bigg{\}}.$

Taking the derivative of $f(\theta^{\prime}_{hi})$ , we can verify that $f(\theta^{\prime}_{hi})$ is strongly concave. Therefore, the parameter update for $\pi_{hi}$ is unique.

M_{\mu,T}(\theta;\omega)_{hi}=\begin{cases}0.1,&\text{if $\tilde{M}_{\mu,T}(\theta;\omega)_{hi}<0.1$},\\ \tilde{M}_{\mu,T}(\theta;\omega)_{hi},&\text{if $0.1\leq\tilde{M}_{\mu,T}(\theta;\omega)_{hi}\leq 0.9$},\\ 0.9,&\text{if $\tilde{M}_{\mu,T}(\theta;\omega)_{hi}>0.9$,}\end{cases}

where $\tilde{M}_{\mu,T}(\theta;\omega)_{hi}$ is the unconstrained parameter update given as

\tilde{M}_{\mu,T}(\theta;\omega)_{hi}=\frac{\sum_{t=1}^{T}\gamma^{\theta}_{\mu,t|T}(o_{t}=\textrm{LEFTEND},b_{t}=1)\mathbbm{1}[s_{t}=1,2]}{\sum_{t=1}^{T}\sum_{o_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}=1)}\\ +\frac{\sum_{t=1}^{T}\gamma^{\theta}_{\mu,t|T}(o_{t}=\textrm{RIGHTEND},b_{t}=1)\mathbbm{1}[s_{t}=3,4]}{\sum_{t=1}^{T}\sum_{o_{t}}\gamma^{\theta}_{\mu,t|T}(o_{t},b_{t}=1)}.

Similarly, the unconstrained parameter updates for $\pi_{lo}$ and $\pi_{b}$ are the following:

\tilde{M}_{\mu,T}(\theta;\omega)_{lo}=\frac{1}{T}\sum_{t=1}^{T}\sum_{b_{t}}\bigg{\{}\gamma^{\theta}_{\mu,t|T}(o_{t}=\textrm{LEFTEND},b_{t})\mathbbm{1}[a_{t}=\textrm{LEFT}]\\ +\gamma^{\theta}_{\mu,t|T}(o_{t}=\textrm{RIGHTEND},b_{t})\mathbbm{1}[a_{t}=\textrm{RIGHT}]\bigg{\}}.

\tilde{M}_{\mu,T}(\theta;\omega)_{b}=\frac{1}{T-1}\sum_{t=2}^{T}\sum_{o_{t-1}}\bigg{\{}\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t}=1)\mathbbm{1}[\textrm{event}]+\tilde{\gamma}^{\theta}_{\mu,t|T}(o_{t-1},b_{t}=0)\mathbbm{1}[\neg\textrm{event}]\bigg{\}},

where the $\textrm{event}=\{(s_{t}=1,o_{t-1}=\textrm{LEFTEND})\vee(s_{t}=4,o_{t-1}=\textrm{RIGHTEND})\}$ . The parameter updates $M_{\mu,T}(\theta;\omega)_{lo}$ and $M_{\mu,T}(\theta;\omega)_{b}$ are the projections of $\tilde{M}_{\mu,T}(\theta;\omega)_{lo}$ and $\tilde{M}_{\mu,T}(\theta;\omega)_{b}$ onto $[0.1,0.9]$ , respectively.

E.3 Supplementary results to Figure 3

In this subsection we present supplementary results to Figure 3. In Figure 3, $err(n,T)$ is defined as the average of $\|{\theta^{(n)}-\theta^{*}}\|_{2}$ over all the 50 sample paths. Here, we divide the set of sample paths into smaller sets and evaluate the average of $\|{\theta^{(n)}-\theta^{*}}\|_{2}$ over these smaller sets separately. The settings for the computation of parameter estimates are the same as in Section 5. The following procedure serves as the post-processing step of the obtained parameter estimates.

Concretely, as defined in Section 5, we obtain a sequence $\{\|{\theta^{(n)}-\theta^{*}}\|_{2};\omega,T\}_{n\in[0:N]}$ after running Algorithm 1 with any sample path $\omega$ and any $T$ . After fixing $T$ and letting $n=N$ , $\|{\theta^{(N)}-\theta^{*}}\|_{2}$ is a function of $\omega$ only. With a given threshold interval $I=[I_{1},I_{2}]$ , we define a smaller set of sample paths as the set of $\omega$ with $\|{\theta^{(N)}-\theta^{*}}\|_{2}$ greater than the $I_{1}$ -th percentile and less than the $I_{2}$ -th percentile. Let $err(n,T,I)$ be the average of $\|{\theta^{(n)}-\theta^{*}}\|_{2}$ over this smaller set of sample path specified by interval $I$ . For $T=8000$ , the values of $err(n,T,I)$ with specific choices of $I$ are plotted below. If $I=[0,100]$ , $err(n,T,I)$ is equivalent to $err(n,T)$ investigated in Section 5.

Figure 5 suggests that with probability around 0.6, our algorithm with the particular choice of $T$ and $\theta^{(0)}$ achieves decent performance, decreasing the original estimation error by at least a half. A worth-noting observation is that, for all the choices of $I$ (including $I=[90,100]$ representing the failed sample paths), $err(n,T,I)$ roughly follows the same exponential decay in the early stage of the algorithm (roughly the first 10 iterations). The same behavior can be observed for $T=5000$ and $T=10000$ as well. It is not clear whether this behavior is general or specific to our numerical example. Detailed investigation is required in future work.

E.4 Varying $\mu$

In this subsection we investigate the effect of $\mu$ on the performance of Algorithm 1. Intuitively, from the uniform forgetting analysis throughout this paper, it is reasonable to expect that at each iteration, the effect of $\mu$ on the parameter update is negligible if $T$ is large. However, such a negligible error could accumulate if $N$ is large. The effect of $\mu$ on the final parameter estimate is not clear without experiments.

We use the same observation sequences as in Section 5. $T$ is fixed as 5000. $\theta^{(0)}=(0.5,0.6,0.7)$ , and the parameter space for all the three parameters remains the same as $[0.1,0.9]$ . For all $s_{1}$ , $\mu(o_{0}=\textrm{RIGHTEND}|s_{1})\in\{0.2,0.5,0.8\}$ . The performance of the algorithm is evaluated by $err(n,T)$ defined in Section 5. The result is presented in Figure 6, which shows that indeed, the effect of $\mu$ on the final performance of the algorithm is negligible. For $n=1000$ , $\max_{\mu}err(n,T)$ is $0.7\%$ higher than $\min_{\mu}err(n,T)$ .

E.5 Random initialization

Up to this point, all the empirical results use the same initial parameter estimate $\theta^{(0)}=(0.5,0.6,0.7)$ on all the 50 sample paths. In this subsection, we evaluate the effect of the initial estimation error $\{\theta^{(0)}-\theta^{*}\}_{2}$ on the performance of the algorithm, by applying random $\theta^{(0)}$ . Such a randomization is not considered in Section 5 since more explanations are required.

In this experiment, we use the same observation sequences as in Section 5. $T$ is fixed to 8000. For all $s_{1}$ , $\mu(o_{0}=\textrm{RIGHTEND}|s_{1})=1$ . The parameter space for all the three parameters remains the same as $[0.1,0.9]$ . For each observation sequence, we first generate three independent samples $x_{hi}$ , $x_{lo}$ and $x_{b}$ uniformly from the interval $[0,1]$ . Then, $\theta^{(0)}$ is generated as follows: with a scale factor $w\in\{0.1,0.2,0.3\}$ , let $\theta^{(0)}_{hi}=\theta^{*}_{hi}-wx_{hi}$ , $\theta^{(0)}_{lo}=\theta^{*}_{lo}-wx_{lo}$ and $\theta^{(0)}_{b}=\theta^{*}_{b}-wx_{b}$ . As a result, $\theta^{(0)}$ dependent on $w$ is different for different observation sequences. The choices of $\theta^{(0)}$ are not symmetrical with respect to $\theta^{*}$ due to the restriction of the bounded parameter space. For the parameter estimates obtained from the computation, $err(n,T)$ is defined as in Section 5. The result is shown in Figure 7.

From Figure 7, the curves corresponding to $w=0.1$ and $w=0.2$ qualitatively match the performance guarantee in Theorem 4. The algorithm achieves decent performance when $\{\theta^{(0)}-\theta^{*}\}_{2}$ is intermediate (the case of $w=0.2$ ), where the average estimation error $err(n,T)$ is reduced by at least a half. If $\{\theta^{(0)}-\theta^{*}\}_{2}$ is small (the case of $w=0.1$ ), the parameter estimates cannot improve much from $\theta^{(0)}$ . If $\{\theta^{(0)}-\theta^{*}\}_{2}$ is large (the case of $w=0.3$ ), the algorithm cannot converge to the vicinity of the true parameter, which is consistent with our local convergence analysis.

		$\displaystyle\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t}\|S_{2:T}=s_{2:T})$
	$\displaystyle\propto~{}$	$\displaystyle\mathbb{P}_{\theta,o_{0},s_{1}}(A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t}\|S_{2:t}=s_{2:t})$
		$\displaystyle\hskip 60.00009pt\times\mathbb{P}_{\theta,o_{0},s_{1}}(S_{t+1:T}=s_{t+1:T}\|S_{2:t}=s_{2:t},A_{1:t}=a_{1:t},O_{t}=o_{t},B_{t}=b_{t})$
	$\displaystyle=~{}$	$\displaystyle\alpha^{\theta}_{t}(o_{t},b_{t})\mathbb{P}_{\theta,o_{0},s_{1}}(S_{t+1:T}=s_{t+1:T}\|S_{t}=s_{t},A_{t}=a_{t},O_{t}=o_{t},B_{t}=b_{t}).$

		$\displaystyle p(s_{2:t},a_{1:t},o_{t},b_{t}\|o_{0},s_{1})$
	$\displaystyle=~{}$	$\displaystyle p(s_{2:t},a_{1:t-1},o_{t},b_{t}\|o_{0},s_{1})\pi_{lo}(a_{t}\|s_{t},o_{t})$
	$\displaystyle=~{}$	$\displaystyle\sum_{o_{t-1}}p(s_{2:t},a_{1:t-1},o_{t},b_{t},o_{t-1}\|o_{0},s_{1})\pi_{lo}(a_{t}\|s_{t},o_{t})$
	$\displaystyle=~{}$	$\displaystyle\sum_{o_{t-1}}p(s_{2:t},a_{1:t-1},o_{t-1}\|o_{0},s_{1})\pi_{b}(b_{t}\|s_{t},o_{t-1})\bar{\pi}_{hi}(o_{t}\|s_{t},o_{t-1},b_{t})\pi_{lo}(a_{t}\|s_{t},o_{t}).$

	$\displaystyle\beta^{\theta}_{t\|T}(o_{t},b_{t})$	$\displaystyle\propto p(s_{t+1:T},a_{t+1:T}\|s_{t},a_{t},o_{t},b_{t})$
		$\displaystyle=p(s_{t+2:T},a_{t+1:T}\|s_{t+1},o_{t})P(s_{t+1}\|s_{t},a_{t})$
		$\displaystyle\propto\sum_{o_{t+1},b_{t+1}}p(s_{t+2:T},a_{t+1:T}\|s_{t+1},o_{t},o_{t+1},b_{t+1})p(o_{t+1},b_{t+1}\|s_{t+1},o_{t}),$

		$\displaystyle p(s_{t+2:T},a_{t+1:T}\|s_{t+1},o_{t},o_{t+1},b_{t+1})$
	$\displaystyle=~{}$	$\displaystyle p(s_{t+2:T},a_{t+2:T}\|s_{t+1},o_{t},o_{t+1},b_{t+1},a_{t+1})p(a_{t+1}\|s_{t+1},o_{t},o_{t+1},b_{t+1})$
	$\displaystyle=~{}$	$\displaystyle\beta^{\theta}_{t+1\|T}(o_{t+1},b_{t+1})p(a_{t+1}\|s_{t+1},o_{t},o_{t+1},b_{t+1}).$

		$\displaystyle p(s_{2:T},a_{1:T},o_{t-1},b_{t}\|o_{0},s_{1})$
	$\displaystyle=~{}$	$\displaystyle\sum_{b_{t-1}}p(s_{2:T},a_{1:T},o_{t-1},b_{t},b_{t-1}\|o_{0},s_{1})$
	$\displaystyle=~{}$	$\displaystyle\sum_{b_{t-1}}p(s_{2:t-1},a_{1:t-1},o_{t-1},b_{t-1}\|o_{0},s_{1})p(s_{t:T},a_{t:T},b_{t}\|s_{1:t-1},a_{1:t-1},o_{t-1},b_{t-1},o_{0})$
	$\displaystyle=~{}$	$\displaystyle\sum_{b_{t-1}}p(s_{2:t-1},a_{1:t-1},o_{t-1},b_{t-1}\|o_{0},s_{1})P(s_{t}\|s_{t-1},a_{t-1})p(s_{t+1:T},a_{t:T},b_{t}\|s_{t},o_{t-1}).$

Provable Hierarchical Imitation Learning via EM

Abstract

1 Introduction

1.1 Related work

1.2 Our contributions

2 Problem settings

Notation.

2.1 Definition of the hierarchical policy

On the policy framework.

2.2 The imitation learning problem

On the practicality of our setting.

Assumption 1 (Non-degeneracy).

Assumption 2 (Differentiability).

Assumption 3 (Stationary initial distribution).

On the assumptions.

3 A Baum-Welch type algorithm

3.1 Latent variable estimation

Theorem 1 (Forward-backward smoothing).

3.2 Parameter update

3.3 Generalization to continuous spaces

4 Performance guarantee

Theorem 2 (The stochastic convergence of the QQ-function).

Proof Sketch.

Assumption 4 (Strong concavity).

Assumption 5 (Additional local assumptions).

On the additional assumptions.

Theorem 3 (Convergence of the population version algorithm).

Theorem 4 (Performance guarantee for Algorithm 1).

5 Numerical example

6 Conclusions

Acknowledgements

References

Appendix

Organization.

Appendix A Discussion on Assumption 3

Proposition 5 (Ergodicity).

Proof of Proposition 5.

Lemma A.1 ([8], Theorem 4.3.16 restated).

Appendix B Details of the algorithm

B.1 An error in the existing algorithm

B.2 Proof of Theorem 1

B.3 Discussion on the QQ-function

Appendix C Details of the performance guarantee

C.1 Smoothing in an extended graphical model

Corollary 6 (Forward-backward smoothing for the extended model).

Lemma C.1 (Limits of smoothing distributions).

C.2 The stochastic convergence of the QQ-function

Theorem 7 (The complete version of Theorem 2).

Lemma C.2 (Bounding the difference between the QQ-function and the sample-path-based population QQ-function).

Proof of Theorem 7.

Lemma C.3 ([20], Corollary 5.3 restated).

Lemma C.4 ([12], Theorem 21.8 restated).

On the concavity of Q¯(⋅|θ)\bar{Q}(\cdot|\theta).

C.3 The convergence of the population version algorithm

Theorem 8 (The complete version of Theorem 3).

Proof of Theorem 8.

Lemma C.5 (Self-consistency).

C.4 Proof of Theorem 4

Appendix D Proofs of auxiliary lemmas

D.1 Mixing

Lemma D.1 (Mixing).

Proof of Lemma D.1.

D.2 Smoothing stability

D.2.1 Forward and backward recursion operators

D.2.2 Forward and backward smoothing operators

Lemma D.2 (Lipschitz continuity).

Proof of Lemma D.2.

Lemma D.3 (Perturbation on the forward smoothing kernel).

Proof of Lemma D.3.

Lemma D.4 (Perturbation on the backward smoothing kernel).

D.2.3 A perturbed contraction result for smoothing stability

Lemma D.5 (Smoothing stability).

Proof of Lemma D.5.

D.3 The approximation lemmas

Lemma D.6 (Bounding the difference of smoothing distributions, Part I).

Lemma D.7 (Bounding the difference of smoothing distributions, Part II).

Proof of Lemma D.6.

D.4 Proof of Lemma C.1

D.5 Proof of Lemma C.2

D.6 Proof of the strong stochastic equicontinuity condition (19)

Theorem 2 (The stochastic convergence of the $Q$ -function).

B.3 Discussion on the $Q$ -function

C.2 The stochastic convergence of the $Q$ -function

Lemma C.2 (Bounding the difference between the $Q$ -function and the sample-path-based population $Q$ -function).

On the concavity of $\bar{Q}(\cdot|\theta)$ .

E.4 Varying $\mu$