This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Off-Policy Evaluation with Irregularly-Spaced, Outcome-Dependent Observation Times

Xin Chen   
School of Statistics and Mathematics,
Shanghai Lixin University of Accounting and Finance
Wenbin Lu
Department of Statistics, North Carolina State University
Shu Yang
Department of Statistics, North Carolina State University
and
Dipankar Bandyopadhyay
Department of Biostatistics, Virginia Commonwealth University
Abstract

While the classic off-policy evaluation (OPE) literature commonly assumes decision time points to be evenly spaced for simplicity, in many real-world scenarios, such as those involving user-initiated visits, decisions are made at irregularly-spaced and potentially outcome-dependent time points. For a more principled evaluation of the dynamic policies, this paper constructs a novel OPE framework, which concerns not only the state-action process but also an observation process dictating the time points at which decisions are made. The framework is closely connected to the Markov decision process in computer science and with the renewal process in the statistical literature. Within the framework, two distinct value functions, derived from cumulative reward and integrated reward respectively, are considered, and statistical inference for each value function is developed under revised Markov and time-homogeneous assumptions. The validity of the proposed method is further supported by theoretical results, simulation studies, and a real-world application from electronic health records (EHR) evaluating periodontal disease treatments.


Keywords: Irregular decision time points; Markov decision process; Reinforcement learning; Statistical inference.

1 Introduction

A dynamic policy is a sequence of decision rules guiding an agent to select time-varying actions based on updated information at each decision stage throughout its interaction with the environment. The frameworks for analyzing dynamic policies can be broadly categorized into two settings: (a) finite-horizon and (b) long/infinite-horizon, based on how they handle the number of decision stages. Traditional research in dynamic policies predominantly focused on the finite-horizon settings with a limited number of decision stages (Murphy et al., 2001; Murphy, 2003; Robins, 2004; Zhang et al., 2013; Zhao et al., 2015), where the potential outcome framework is commonly employed to describe the target of interest, and the classical causal inference assumptions, such as unconfoundedness and consistency, are sufficient to guarantee target estimability. Applications of these methods appear in multi-stage treatment settings for cancer, HIV infection, depression, and other chronic conditions. By contrast, infinite-horizon dynamic policies involve a relatively large number of decision stages. For example, consider our motivating periodontal disease (PD) treatment data extracted from the HealthPartners (HP) electronic health records (EHR), where patients are generally recommended for a regular check-up every 6 months (Clarkson et al., 2020). In this dataset, half of the patients visit their dentists for dental treatments more than 8 times, with the maximum value of treatment stages reaching 31 during the 8-year follow-up period. Similar decision policies also arise in the long-haul treatments of chronic diseases, intensively interventions provided during the stay in the ICU, and other sequential decision-making applications, such as in mobile health, gaming, robotics, and ride-sharing. In such cases, the backward induction algorithms used in the finite horizon settings may suffer from the curse of the horizon (Uehara et al., 2022). To mitigate this, off-policy evaluation methods developed within the reinforcement learning (RL) framework (Sutton and Barto, 2018) are preferred. Specifically, instead of directly estimating the target of interest, such as the value of a dynamic policy, researchers reformulate and identify the value function through an action-value function. Under time-homogeneous and Markov assumptions, the action-value function can be estimated by solving the Bellman Equation, followed by the estimation of the quantity of interest. Recently, interest has shifted to dynamic policies in long- or infinite-horizon settings, and statistical inferences for off-policy evaluations have also been developed (Ertefaie and Strawderman, 2018; Luckett et al., 2019; Kallus and Uehara, 2020; Shi et al., 2021, 2022).

This paper considers the evaluation of dynamic policies in an infinite horizon setting. As a variation to the aforementioned literature, our framework includes not only the sequential decision-making process but also an observation process which dictates the time points {Tk}k1\{T_{k}\}_{k\geq 1} at which decisions are to be made. In the existing literature, it is common to assume that decisions are made at regular, pre-assigned time points Tk=kT_{k}=k for simplicity. However, in many real-world scenarios, such as those involving user-initiated visits, the observation process is often irregularly spaced and may depend on both the decision-making process and the outcomes (Yang et al., 2019; Yang, 2021). For example, in the PD treatment study, the gap times between the dental revisits, even for the same patient, can range from 3 to 24 months. Additionally, the length of recall intervals is correlated with not only the actions made at the previous visit (such as the clinic visit recommendations made by dentists) but also with the outcomes (such as the patient’s oral health status after treatment). The necessity of including such an irregularly spaced and outcome-dependent observation process in the analysis of dynamic policies arises for at least two reasons: (a) the gap times between decision stages might be useful for the decision-maker in choosing actions and, therefore, could be included as the inputs of dynamic policies along with the state; and more importantly, (b) replacing the regular decision stage number kk with its specific decision time point TkT_{k} allows for more sensible and principled evaluations of dynamic policies (as will be discussed throughout the paper), regardless of whether the dynamic policy involves gap times.

Within the realm of developing dynamic policies with irregularly spaced and outcome-dependent observation times, this work advances the field in multiple ways. Our first contribution is offering two options for evaluating dynamic policies: one based on cumulative reward and the other on integrated reward. Following the OPE approaches with regular decision points, a simple way to define the value of a dynamic policy is to use the expected discounted cumulative reward, assuming all actions are taken according to the policy. However, because the value of a cumulative sum is typically highly correlated with the number of accumulators (especially when rewards are consistently non-negative or non-positive), the policy value tends to increase or decrease with the number of decision stages. This observation motivates us to consider a more principled evaluation, where the value of dynamic policies could be independent of the number of observed decision stages. Therefore, by treating the rewards observed at {Tk}k1\{T_{k}\}_{k\geq 1} as discrete observations from a continuous reward process potentially available throughout the entire follow-up period, we define an alternative value function for the dynamic policies based on the integration of the reward function over the full follow-up time. The value function, when derived from integrated reward, differs in magnitude, significance, and application contexts compared to one based on cumulative reward. This difference necessitates additional effort in the policy evaluation process. Our next key contribution is developing a novel RL framework that addresses both the state-action and observation processes. The complex and often unknown relationship between these two processes implies that the value of a dynamic policy, whether based on cumulative or integrated reward, may vary with the time TkT_{k} at which the policy is evaluated. To address this complexity, we introduce a revised assumption of time-homogeneity and Markov properties for states, actions, and gap times. Under these assumptions, we demonstrate that the value function at Tk=tT_{k}=t can be independent of both kk and tt, provided the current state and gap time are known. Furthermore, we validate the broad applicability of our assumptions using two data-generating schemes. These schemes show that our framework is closely related to Markov decision process models widely used in computer science and renewal process models developed in statistics. Finally, under the proposed RL framework and for each of the value functions, we develop two policy evaluation methods: one based on the classic Bellman equation and the other on a Bellman equation modulated by an observation process model. We demonstrate that the proposed observation process model is consistent with the revised time-homogeneity and Markov assumptions for states and gap times. The model also allows us to formalize the connection between the value function based on cumulative reward and that based on integrated reward, thus playing a crucial role in the policy evaluation procedure based on integrated reward.

The rest of the paper is organized as follows. In Section 2, we introduce the notations, value functions, assumptions about the data structure, and two data-generating schemes. Sections 3 and 4 present the off-policy evaluation methods for the value function based on cumulative and integrated reward, respectively. Section 5 conducts simulation studies to evaluate the empirical performance of the proposed methods. An application to the PD treatment dataset is given in Section 6. Section 7 concludes the paper with a discussion.

2 Markovian decision process with irregularly-spaced and outcome-dependent observation times

2.1 Notations and value functions

Consider a study where treatment decisions are made at irregularly-spaced time points {T0=0,T1,T2,,Tk,}\{T_{0}=0\,,T_{1}\,,T_{2},\ldots,T_{k},\ldots\}. For each k1k\geq 1, let Xk=TkTk1X_{k}=T_{k}-T_{k-1} denote the gap times, and let 𝑺k𝕊\bm{S}_{k}\in\mathbb{S} be a vector of state variables that summarizes the subject’s information collected up to and including time TkT_{k}. At each TkT_{k}, investigators can observe the status of subject 𝑺k=S(Tk)\bm{S}_{k}=S(T_{k}) and the gap time Xk=TkTk1X_{k}=T_{k}-T_{k-1}, and then take an action Ak=A(Tk)𝔸A_{k}=A(T_{k})\in\mathbb{A} based on 𝑺k\bm{S}_{k} and XkX_{k}. The reward of action AkA_{k} is observed at Tk+1T_{k+1}, and thus denoted as R(Tk+1)R(T_{k+1}). For simplicity, we assume that the state space 𝕊\mathbb{S} is a subspace of d\mathbb{R}^{d} where dd is the number of state vectors, and the action space 𝔸\mathbb{A} is a discrete space {0,1,,m1}\{0,1\,,\ldots\,,m-1\} with mm denoting the number of actions. We also assume that 𝑺0\bm{S}_{0}, X0X_{0}, and A0A_{0} exist and are observable at T0T_{0}. An illustration of the data structure is given in Figure 1.

Refer to caption
Figure 1: Basic structure of data

Let π(|,)\pi(\cdot|\cdot,\cdot) denote a policy that maps the state and gap time (𝒔,x)𝕊×[0,](\bm{s},x)\in\mathbb{S}\times[0,\infty] to a probability mass function π(|𝒔,x)\pi(\cdot|\bm{s},x) on 𝔸\mathbb{A}. Under such a policy π\pi, a decision maker will take action Ak=aA_{k}=a with probability π(a|𝑺k,Xk)\pi(a|\bm{S}_{k},X_{k}) for a𝔸a\in\mathbb{A} and k0k\geq 0. Aiming at off-policy evaluation for any given dynamic policy π\pi, we first define the corresponding value functions. Unlike the existing literature (Shi et al., 2021), we treat {Tk}k0\{T_{k}\}_{k\geq 0} as jump points of a counting process N(t)N(t) and propose the following two value functions

Vkπ(𝒔,x,t)=\displaystyle V_{k}^{\pi}(\bm{s},x,t)= Eπ[tγutR(u)dN(u)\nonscript|\nonscript𝑺t=𝒔,Xt=x,Tk=t]\displaystyle\ \operatorname{E^{\pi}}\left[\int_{t}^{\infty}\gamma^{u-t}R(u)\ \mathrm{d}N(u)\nonscript\;\middle|\nonscript\;\bm{S}_{t}=\bm{s}\,,X_{t}=x\,,T_{k}=t\right] (1)
V,kπ(𝒔,x,t)=\displaystyle V_{\mathcal{I},k}^{\pi}(\bm{s},x,t)= Eπ[tγutR(u)du\nonscript|\nonscript𝑺t=𝒔,Xt=x,Tk=t].\displaystyle\ \operatorname{E^{\pi}}\left[\int_{t}^{\infty}\gamma^{u-t}R(u)\ \mathrm{d}u\ \nonscript\;\middle|\nonscript\;\bm{S}_{t}=\bm{s}\,,X_{t}=x\,,T_{k}=t\right]\,. (2)

Here, 0<γ<10<\gamma<1 is a discounted factor that balances the immediate and long-term effects of treatments; R(t)R(t), an extension of the aforementioned notation R(Tk)R(T_{k}), is redefined as a reward process only observed at {Tk}k1\{T_{k}\}_{k\geq 1}, but could be potentially available at time t0t\geq 0; and EπE^{\pi} denotes the expectation when all the actions follow policy π\pi. By the definition of N(t)N(t), Vkπ(𝒔,x,t)V_{k}^{\pi}(\bm{s},x,t) defined in (1) can be rewritten as an expectation of discounted cumulative reward, given as

Vkπ(𝒔,x,t)\displaystyle V^{\pi}_{k}(\bm{s},x,t) Eπ[jk+1γTjTkR(Tj)\nonscript|\nonscript𝑺k=s,Xk=x,Tk=t]\displaystyle\equiv\operatorname{E^{\pi}}\left[\sum_{j\geq k+1}\gamma^{T_{j}-T_{k}}R(T_{j})\nonscript\;\middle|\nonscript\;\bm{S}_{k}=s,X_{k}=x,T_{k}=t\right] (3)

Therefore, we refer to Vkπ(𝒔,x,t)V^{\pi}_{k}(\bm{s},x,t) as the value function based on cumulative reward and V,kπ(𝒔,x,t)V^{\pi}_{\mathcal{I},k}(\bm{s},x,t) as the value function based on integrated reward. Similarly, we define Qkπ(𝒔,x,a,t)=Eπ[jk+1γTjTkR(Tj)\nonscript|\nonscript𝑺k=𝒔,Xk=x,Ak=a,Tk=t]Q_{k}^{\pi}(\bm{s},x,a,t)=\operatorname{E^{\pi}}\left[\sum_{j\geq k+1}\gamma^{T_{j}-T_{k}}R(T_{j})\nonscript\;\middle|\nonscript\;\bm{S}_{k}=\bm{s}\,,X_{k}=x\,,A_{k}=a\,,T_{k}=t\right] as the action-value function (or Q-function) based on cumulative reward, while the one based on integrated reward is defined as Q,kπ(𝒔,x,a,t)=Eπ[tγutR(u)du\nonscript|\nonscript𝑺k=𝒔,Xk=x,Ak=a,Tk=t].Q_{\mathcal{I},k}^{\pi}(\bm{s},x,a,t)=\operatorname{E^{\pi}}\left[\int_{t}^{\infty}\gamma^{u-t}R(u)\ \mathrm{d}u\ \nonscript\;\middle|\nonscript\;\bm{S}_{k}=\bm{s}\,,X_{k}=x\,,A_{k}=a\,,T_{k}=t\right]\,.

Remark 1. Both value functions, whether based on cumulative or integrated reward, may depend not only on the current value of state 𝑺k=𝒔\bm{S}_{k}=\bm{s}, but also on the time Tk=tT_{k}=t at which the policy is evaluated. This differs from the value functions developed under regularly-spaced observations and introduces challenges to the policy evaluation process. To address this issue, we incorporate XkX_{k} into the value functions and will demonstrate in subsequent sections that, under modified Markov and time-homogeneous assumptions, the value function at Tk=tT_{k}=t can be independent of kk and tt, given the current state 𝑺k\bm{S}_{k} and gap time XkX_{k}.

Remark 2. The two value functions differ in terms of quantity, interpretation, application scenarios, and estimation procedures. First, the value function based on cumulative reward, Vkπ(𝒔,x,t)V^{\pi}_{k}(\bm{s},x,t), can be highly correlated with the rate of observation process N(t)N(t). For example, assuming R(t)R(t) is a non-negative function, a higher rate of N(t)N(t) would yield more decision stages TjT_{j}, leading to a higher policy value. In contrast, V,kπ(𝒔,x,t)V^{\pi}_{\mathcal{I},k}(\bm{s},x,t), which integrates the reward function over the follow-up period, provides a more principled evaluation of the policy as it remains independent of the number of observed decision stages. As a result, the two value functions are suited to different application scenarios, and the choice of value function should depend on how the reward function is defined. Taking our motivating PD treatment study as an example, when the outcome of interest is whether a patient’s probing pocket depths (PPD) are reduced after treatment, the reward R(Tk)R(T_{k}) is defined as an indicator of PPD reduction, making the value function based on cumulative reward preferable. Conversely, when the reward function R(t)R(t) is defined as the PPD measurement at time tt for t[0,]t\in[0,\infty], policy evaluation should be independent of the measurement times, and the value function based on integrated reward provides a comprehensive assessment of the patient’s oral health status. Finally, since R(t)R(t) is unobservable when tTjt\neq T_{j}, additional assumptions about R(t)R(t) and N(t)N(t) are required to ensure that the expectation of tγutR(u)du\int_{t}^{\infty}\gamma^{u-t}R(u)\mathrm{d}u is estimable. This makes the evaluation process for V,kπ(𝒔,x,t)V^{\pi}_{\mathcal{I},k}(\bm{s},x,t) different from that for Vkπ(𝒔,x,t)V^{\pi}_{k}(\bm{s},x,t); further details are provided in Section 4.

2.2 Assumptions

Now, we posit assumptions on the states, actions, rewards, and gap times.

  • (A1)

    (Markov and time-homogeneous) For all k0k\geq 0, and for any measurable set 𝕊×+\mathcal{B}\subset\mathbb{S}\times\mathbb{R}^{+} with 𝑺k\bm{S}_{k} and XkX_{k} as the current state and gap times, respectively, the next state variable 𝑺k+1\bm{S}_{k+1} and gap time Xk+1X_{k+1} satisfy

    P[(𝑺k+1Xk+1)|{𝑺j}0jk,{Xj}0jk,{Aj}0jk,{Tj}0jk]=𝒫(;𝑺k,Xk,Ak).\displaystyle P\bigg{[}\begin{pmatrix}\bm{S}_{k+1}\\ X_{k+1}\end{pmatrix}\in\mathcal{B}\bigg{|}\{\bm{S}_{j}\}_{0\leq j\leq k},\{X_{j}\}_{0\leq j\leq k},\{A_{j}\}_{0\leq j\leq k},\{T_{j}\}_{0\leq j\leq k}\bigg{]}=\mathcal{P}(\mathcal{B};\bm{S}_{k},X_{k},A_{k})\,.

    Here, 𝒫\mathcal{P} denotes the joint distribution of the next state and next gap time conditional on the current state, action, and gap time.

  • (A2)

    (Conditional mean independence) For all k0k\geq 0 and for some bounded function rr, the reward R(Tk+1)R(T_{k+1}) satisfies

    E[R(Tk+1)|{R(Tj),𝑺j,Xj,Aj}1jk,𝑺k+1,Xk+1]=r(𝑺k,Xk,Ak,𝑺k+1,Xk+1).\displaystyle E\left[R(T_{k+1})|\{R(T_{j}),\bm{S}_{j},X_{j},A_{j}\}_{1\leq j\leq k},\bm{S}_{k+1}\,,X_{k+1}\right]=r(\bm{S}_{k},X_{k},A_{k},\bm{S}_{k+1},X_{k+1})\,.

Assumption A1 implies that, given the current state, action, and gap time, the joint distribution of (𝑺k+1,Xk+1)(\bm{S}_{k+1},X_{k+1}) does not depend on the historical observations {𝑺j,Xj,Aj,Tj}j<k\{\bm{S}_{j},X_{j},A_{j},T_{j}\}_{j<k}, or the current observation time TkT_{k}. Instead of assuming the marginal distributions of the states and the gap times separately, we present assumption A1 based on the joint distribution of the next state and the next gap time. This indicates that, not only the marginal distributions of 𝑺k+1\bm{S}_{k+1} and Xk+1X_{k+1}, but also the dependence between 𝑺k+1\bm{S}_{k+1} and Xk+1X_{k+1}, should be independent of the historical observations and the current observation time TkT_{k}, conditionally on {𝑺k,Xk,Ak}\{\bm{S}_{k},X_{k},A_{k}\}. For instance, in the PD treatment study, the recall interval Xk+1X_{k+1} often correlates strongly with the PD status at the subsequent visit 𝑺k+1\bm{S}_{k+1}. This correlation, along with the values of 𝑺k+1\bm{S}_{k+1} and Xk+1X_{k+1}, may be influenced by the current PD status of patients and the treatments provided by the dentists. However, it’s reasonable to assume conditional independence of these factors from historical states, treatments, and recall intervals, given the latest observations {𝑺k,Xk,Ak}\{\bm{S}_{k},X_{k},A_{k}\}.

By denoting 𝑺~k+1=(𝑺k+1,Xk+1)\widetilde{\bm{S}}_{k+1}=(\bm{S}_{k+1}^{\top},X_{k+1})^{\top} as an extended vector of state variables, it is not hard to see that A1 takes a similar form as the Markov and time-homogeneous assumption commonly adopted in the RL literature. However, in the following sections, we will demonstrate that the gap times {Xk}k1\{X_{k}\}_{k\geq 1} and the state variables {𝑺k}k1\{\bm{S}_{k}\}_{k\geq 1} play different roles in the estimation procedure and theoretical properties. This distinction explains why we do not simply include Xk+1X_{k+1} in the state vector.

2.3 Example schemes

For illustration, we present two data-generating schemes that satisfy assumption A1, while A1 itself does not specify any structure on the relationship between 𝑺k+1\bm{S}_{k+1} and Xk+1X_{k+1}. The two schemes bridge the proposed framework with modulated renewal process models and Markov decision process models, as well as providing guidelines for constructing specific models on the states, actions, and gap times.

Scheme 1. Consider a stochastic process where, given 𝑺k=𝒔\bm{S}_{k}=\bm{s}, Xk=xX_{k}=x and Ak=aA_{k}=a, the next state 𝑺k+1\bm{S}_{k+1} is generated as

(S1-s)P(𝑺k+1s|𝑺k=𝒔,Xk=x,Ak=a,Tk,{𝑺j,Xj,Aj,Tj}j<k)=𝒫S(s;𝒔,x,a)\displaystyle\text{(S1-s)}\quad P\left(\bm{S}_{k+1}\in\mathcal{B}_{s}|\bm{S}_{k}=\bm{s},X_{k}=x,A_{k}=a,T_{k},\{\bm{S}_{j},X_{j},A_{j},T_{j}\}_{j<k}\right)=\mathcal{P}_{S}(\mathcal{B}_{s};\bm{s},x,a)

for any measurable set s𝕊\mathcal{B}_{s}\subset\mathbb{S}, while given the value of next state 𝑺k+1=𝒔\bm{S}_{k+1}=\bm{s^{\prime}}, the next gap time Xk+1X_{k+1} is generated as

(S1-x)P(Xk+1x|𝑺k+1=𝒔,𝑺k=s,Xk=x,Ak=a,Tk,{𝑺j,Xj,Aj,Tj}j<k)=𝒫X(x;𝒔,𝒔,x,a)\displaystyle\text{(S1-x)}\quad P\left(X_{k+1}\leq x^{\prime}|\bm{S}_{k+1}=\bm{s^{\prime}},\bm{S}_{k}=s,X_{k}=x,A_{k}=a,T_{k},\{\bm{S}_{j},X_{j},A_{j},T_{j}\}_{j<k}\right)=\mathcal{P}_{X}(x^{\prime};\bm{s^{\prime}},\bm{s},x,a)

for any x+x^{\prime}\in\mathbb{R}^{+}. Here, 𝒫S\mathcal{P}_{S} denotes the transition function of the next state conditional on the current state, action, and gap time, and 𝒫X\mathcal{P}_{X} denotes the cumulative distribution function of next gap time conditional on the current state, gap time, action, as well as the next state.

Remark 3. As a special case, when 𝕊\mathbb{S} consists of finite or countable number of states and 𝒫S(s;𝒔,x,a)\mathcal{P}_{S}(\mathcal{B}_{s};\bm{s},x,a) is independent of the value of current gap time xx, (S1-s) and (S1-x) together form a time-homogenous semi-Markov decision process (Bradtke and Duff, 1994; Parr, 1998; Du et al., 2020), where {𝑺k}k0\{\bm{S}_{k}\}_{k\geq 0} is the Markov chain and Xk+1X_{k+1} is the sojourn time in the kkth state 𝑺k\bm{S}_{k}.

Remark 4. Considering that {Xk}k1\{X_{k}\}_{k\geq 1} are the gap times of a counting process, we can restate the generation procedure in the model (S1-x) using the notations of the counting process. Let {N(t)}t0\{N(t)\}_{t\geq 0} denote a right-continuous counting process with jumps at {Tk}k1\{T_{k}\}_{k\geq 1}. Then, we have TN(t)=TkT_{N(t)}=T_{k}, XN(t)=XkX_{N(t)}=X_{k}, AN(t)=AkA_{N(t)}=A_{k}, 𝑺N(t)=𝑺k\bm{S}_{N(t)}=\bm{S}_{k}, and 𝑺N(t)+1=𝑺k+1\bm{S}_{N(t)+1}=\bm{S}_{k+1} for all Tkt<Tk+1T_{k}\leq t<T_{k+1}. According to (S1-s), the distribution/generation of next state 𝑺N(t)+1\bm{S}_{N(t)+1} is fully determined by {𝑺N(t),XN(t),AN(t)}\{\bm{S}_{N(t)},X_{N(t)},A_{N(t)}\}, and therefore could be assumed available at time tt. Let (t)={{N(u)}ut,{𝑺N(u)}ut,{XN(u)}ut,{AN(u)}ut,𝑺N(t)+1}\mathcal{F}(t)=\left\{\{N(u)\}_{u\leq t},\{\bm{S}_{N(u)}\}_{u\leq t},\{X_{N(u)}\}_{u\leq t},\{A_{N(u)}\}_{u\leq t},\bm{S}_{N(t)+1}\right\} denote all information available at tt. It is not hard to observe that if N(t)N(t) satisfies

(S1-x)E[dN(t)|(t)]=\displaystyle\text{(S1-x${}^{\prime}$)}\qquad E\left[\mathrm{d}N(t)|\mathcal{F}(t-)\right]= λ{tTN(t);𝑺N(t)+1,𝑺N(t),XN(t),AN(t)}dt\displaystyle\lambda\{t-T_{N(t-)};\bm{S}_{N(t-)+1},\bm{S}_{N(t-)},X_{N(t-)},A_{N(t-)}\}\mathrm{d}t
=\displaystyle= λ{tTk;𝑺k+1,𝑺k,Xk,Ak}dt,fork=max{j;Tj<t}\displaystyle\lambda\{t-T_{k};\bm{S}_{k+1},\bm{S}_{k},X_{k},A_{k}\}\mathrm{d}t\,,\text{for}\ k=\max\{j;T_{j}<t\}\quad

or some intensity function λ\lambda, then for each k0k\geq 0, the gap time Xk+1X_{k+1} with the hazard function h(x)=λ{x;𝑺k+1,𝑺k,Xk,Ak}h(x)=\lambda\{x;\bm{S}_{k+1},\bm{S}_{k},X_{k},A_{k}\} satisfies model (S1-x) too. On the other hand, if the distribution function 𝒫X\mathcal{P}_{X} in the model (S1-x) is absolutely continuous wrt. the Lebesgue measure, then the counting process defined by N(t)=j1I(Tjt)N(t)=\sum_{j\geq 1}I(T_{j}\leq t) also satisfies (S1-x). As a matter of fact, by viewing {𝑺k,Ak,𝑺k+1}k0\{\bm{S}_{k},A_{k},\bm{S}_{k+1}\}_{k\geq 0} as the covariates available at {Tk}k0\{T_{k}\}_{k\geq 0}, (S1-x) takes a similar form to the modulated renewal process model. The assumption is natural for user-initiated visits and has been popularly adopted in the statistical literature (Cox, 1972; Lin et al., 2013; Chen et al., 2018).

Scheme 2. As an alternative, scheme 2 generates the gap time first and then determines the value of the next state. Specifically, assume that the next gap time Xk+1X_{k+1} satisfies

(S2-x)P(Xk+1x|𝑺k=s,Xk=x,Ak=a,Tk,{𝑺j,Xj,Aj,Tj}j<k)=𝒫~X(x;𝒔,x,a)\text{(S2-x)}\quad P\left(X_{k+1}\leq x^{\prime}|\bm{S}_{k}=s,X_{k}=x,A_{k}=a,T_{k},\{\bm{S}_{j},X_{j},A_{j},T_{j}\}_{j<k}\right)=\widetilde{\mathcal{P}}_{X}(x^{\prime};\bm{s},x,a)\qquad

for x+x^{\prime}\in\mathbb{R}^{+}. Then, given the value Xk+1=xX_{k+1}=x^{\prime}, the next state 𝑺k+1\bm{S}_{k+1} satisfies

(S2-s)P(𝑺k+1s|Xk+1=x,𝑺k=𝒔,Xk=x,Ak=a,Tk,{𝑺j,Xj,Aj,Tj}j<k)=𝒫~S(s;x,𝒔,x,a),\text{(S2-s)}\quad P\left(\bm{S}_{k+1}\in\mathcal{B}_{s}|X_{k+1}=x^{\prime},\bm{S}_{k}=\bm{s},X_{k}=x,A_{k}=a,T_{k},\{\bm{S}_{j},X_{j},A_{j},T_{j}\}_{j<k}\right)=\widetilde{\mathcal{P}}_{S}(\mathcal{B}_{s};x^{\prime},\bm{s},x,a)\,,

for any measurable set s𝕊\mathcal{B}_{s}\subset\mathbb{S}. Here, 𝒫~X\widetilde{\mathcal{P}}_{X} denotes the cumulative distribution function of the next gap time conditional on the current state, gap time, and action, while 𝒫~S\widetilde{\mathcal{P}}_{S} denotes the transition function of the next state conditional on the current state, action, gap time, as well as the next gap time. Let ~(t)={{N(u)}ut,{𝑺N(u)}ut,{XN(u)}ut,{AN(u)}ut}\mathcal{\widetilde{F}}(t)=\left\{\{N(u)\}_{u\leq t},\{\bm{S}_{N(u)}\}_{u\leq t},\{X_{N(u)}\}_{u\leq t},\{A_{N(u)}\}_{u\leq t}\right\}. With arguments similar to Remark 4, it can be shown that (S2-x) is equivalent to

(S2-x)E[dN(t)|~(t)]=λ(tTN(t);𝑺N(t),XN(t),AN(t))=λ(tTk;𝑺k,Xk,Ak),\text{(S2-x${}^{\prime}$)}\quad E\left[\mathrm{d}N(t)\big{|}\mathcal{\widetilde{F}}(t-)\right]=\lambda(t-T_{N(t-)};\bm{S}_{N(t-)},X_{N(t-)},A_{N(t-)})=\lambda(t-T_{k};\bm{S}_{k},X_{k},A_{k})\,,

for k=max{j;Tj<t}k=\max\{j;T_{j}<t\}, and takes the form of a modulated renewal process.

3 Policy evaluation based on cumulative reward

In the following subsections, we initially propose a policy evaluation procedure, requiring only assumptions A1 and A2. Then, under a counting process model for the observation process, we develop an alternative policy evaluation method. The asymptotic properties of all the proposed estimates are given in the last subsection.

3.1 Utilizing the classic Bellman equation

To estimate Vkπ(𝒔,x,t)V^{\pi}_{k}(\bm{s},x,t) and Qkπ(𝒔,x,a,t)Q^{\pi}_{k}(\bm{s},x,a,t), we first present the following two lemmas, whose proof is available in Section A.1 of the Supplementary Material.

Lemma 1. Under assumptions A1 and A2, the action-value function based on cumulative reward Qkπ(𝐬,x,a,t)Q^{\pi}_{k}(\bm{s},x,a,t) exists and satisfies Qkπ(𝐬,x,a,t)=Q0π(𝐬,x,a,0)Q^{\pi}_{k}(\bm{s},x,a,t)=Q^{\pi}_{0}(\bm{s},x,a,0) for all k0k\geq 0, t0t\geq 0, s𝕊s\in\mathbb{S}, x+x\in\mathbb{R}^{+} and a𝔸a\in\mathbb{A}.

By Lemma 1, the value of Qkπ(𝒔,x,a,t)Q^{\pi}_{k}(\bm{s},x,a,t) is time-homogeneous, meaning the expected value of a policy π\pi does not depend on the number of past observations kk or the current observation time tt. Therefore, we can omit both kk and tt, and denote the action-value function simply as Qπ(𝒔,x,a)Q^{\pi}(\bm{s},x,a) for brevity. Similarly, the state-value function of policy π\pi can also be presented as Vπ(𝒔,x)=a𝔸Qπ(𝒔,x,a)π(a|𝒔,x),V^{\pi}(\bm{s},x)=\sum_{a\in\mathbb{A}}Q^{\pi}(\bm{s},x,a)\pi(a|\bm{s},x)\,, and the value function of policy π\pi under some reference distribution 𝒢\mathcal{G} can be presented as Vπ(𝒢)=a𝔸Qπ(𝒔,x,a)π(a|𝒔,x)d𝒢(𝒔,x),V^{\pi}(\mathcal{G})=\int\sum_{a\in\mathbb{A}}Q^{\pi}(\bm{s},x,a)\pi(a|\bm{s},x)d\mathcal{G}(\bm{s},x)\,, both of which are independent of kk and tt. Furthermore, we derive a Bellman equation as formulated in Lemma 2.

Lemma 2. Under assumptions A1 and A2, the action-value function based on cumulative reward Qπ(𝐬,x,a)Q^{\pi}(\bm{s},x,a) satisfies a Bellman equation

Qπ(𝑺k,Xk,Ak)=\displaystyle Q^{\pi}(\bm{S}_{k},X_{k},A_{k})= E[γXk+1R(Tk+1)+γXk+1a𝔸Qπ(𝑺k+1,Xk+1,a)π(a|𝑺k+1,Xk+1)|𝑺k,Xk,Ak]\displaystyle E\left[\gamma^{X_{k+1}}R(T_{k+1})+\gamma^{X_{k+1}}\sum_{a\in\mathbb{A}}Q^{\pi}(\bm{S}_{k+1},X_{k+1},a)\pi(a|\bm{S}_{k+1},X_{k+1})\bigg{|}\bm{S}_{k},X_{k},A_{k}\right]

for all k0k\geq 0.

To solve the Bellman equation, let Q(𝒔,x,a;𝜽)Q(\bm{s},x,a;\bm{\theta}) denote a model for Qπ(𝒔,x,a)Q^{\pi}(\bm{s},x,a) indexed by 𝜽Θ\bm{\theta}\in\Theta, and satisfies Q(𝒔,x,a;𝜽π)=Qπ(𝒔,x,a)Q(\bm{s},x,a;\bm{\theta}_{\pi})=Q^{\pi}(\bm{s},x,a) for some 𝜽πΘ\bm{\theta}_{\pi}\in\Theta. Define

δk+1(𝜽)=\displaystyle\delta_{k+1}(\bm{\theta})= γXk+1R(Tk+1)+γXk+1a𝔸Q(𝑺k+1,Xk+1,a;𝜽)π(a|𝑺k+1,Xk+1)Q(𝑺k,Xk,Ak;𝜽).\displaystyle\gamma^{X_{k+1}}R(T_{k+1})+\gamma^{X_{k+1}}\sum_{a\in\mathbb{A}}Q(\bm{S}_{k+1},X_{k+1},a;\bm{\theta})\pi(a|\bm{S}_{k+1},X_{k+1})-Q(\bm{S}_{k},X_{k},A_{k};\bm{\theta})\,.

According to Lemma 2, we have E[δk+1(𝜽π)|𝑺k,Xk,Ak]=0E[\delta_{k+1}(\bm{\theta}_{\pi})|\bm{S}_{k},X_{k},A_{k}]=0 for all k0k\geq 0. Assume the map 𝜽Q(𝒔,x,a;𝜽)\bm{\theta}\rightarrow Q(\bm{s},x,a;\bm{\theta}) is differentiable everywhere for each fixed ss, xx and aa, and let 𝑸˙𝜽\bm{\dot{Q}}_{\bm{\theta}} denote the gradient of QQ with respect to 𝜽\bm{\theta}. Then, we construct an estimating function for 𝜽π\bm{\theta}_{\pi} as

𝑼(𝜽)=1iKii=1nk=0Ki1𝑸˙𝜽(𝑺i,k,Xi,k,Ai,k;𝜽)δ^i,k+1(𝜽),\bm{U}(\bm{\theta})=\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\bm{\dot{Q}}_{\bm{\theta}}(\bm{S}_{i,k},X_{i,k},A_{i,k};\bm{\theta})\widehat{\delta}_{i,k+1}(\bm{\theta})\,, (4)

where {Xi,k,𝑺i,k,Ai,k,Ri(Ti,k+1);1kKi,1in}\{X_{i,k}\,,\bm{S}_{i,k}\,,A_{i,k}\,,R_{i}(T_{i,k+1});1\leq k\leq K_{i},1\leq i\leq n\} are nn independent and identically distributed trajectories of {𝑺k,Xk,Ak,R(Tk+1)}k0\{\bm{S}_{k},X_{k},A_{k},R(T_{k+1})\}_{k\geq 0}. Ki+K_{i}\in\mathbb{Z}_{+} denotes the total number of decisions observed on the iith sequence, and δi,k+1(𝜽)=γXi,k+1Ri(Ti,k+1)+γXi,k+1a𝔸Q(𝑺i,k+1,Xi,k+1,a;𝜽)π(a|𝑺i,k+1,Xi,k+1)Q(𝑺i,k,Xi,k,Ai,k;𝜽).\delta_{i,k+1}(\bm{\theta})=\gamma^{X_{i,k+1}}R_{i}(T_{i,k+1})+\gamma^{X_{i,k+1}}\sum_{a\in\mathbb{A}}Q(\bm{S}_{i,k+1},X_{i,k+1},a;\bm{\theta})\pi(a|\bm{S}_{i,k+1},X_{i,k+1})-Q(\bm{S}_{i,k},X_{i,k},A_{i,k};\bm{\theta})\,.

Generally, (4) can be solved by numerical methods, but in certain cases, it may have a closed form. For example, consider a B-spline model Q(𝒔,x,a;𝜽)=𝝃(𝒔,x,a)𝜽,Q(\bm{s},x,a;\bm{\theta})=\bm{\xi}(\bm{s},x,a)^{\top}\bm{\theta}\,, where 𝝃(𝒔,x,a)={ΦL(𝒔,x)𝕀(a=0),ΦL(𝒔,x)𝕀(a=1),,ΦL(𝒔,x)𝕀(a=m1)},\bm{\xi}(\bm{s},x,a)=\{\Phi_{L}^{\top}(\bm{s},x)\mathbb{I}(a=0),\Phi_{L}^{\top}(\bm{s},x)\mathbb{I}(a=1),\ldots,\Phi_{L}^{\top}(\bm{s},x)\mathbb{I}(a=m-1)\}^{\top}\,, 𝜽=(𝜽0,𝜽1,,𝜽m1),\bm{\theta}=(\bm{\theta}_{0}^{\top},\bm{\theta}_{1}^{\top},\cdots,\bm{\theta}_{m-1}^{\top})^{\top}\,, ΦL()={ΦL,1(),,ΦL,L()}\Phi_{L}(\cdot)=\{\Phi_{L,1}(\cdot),\cdots,\Phi_{L,L}(\cdot)\}^{\top} is a vector consisting of LL B-spline basis functions, and 𝜽a\bm{\theta}_{a} is a LL-dimensional parameter vector for each a𝔸a\in\mathbb{A}. Denote 𝝃i,k=𝝃(𝑺i,k,Xi,k,Ai,k)\bm{\xi}_{i,k}=\bm{\xi}(\bm{S}_{i,k},X_{i,k},A_{i,k}) and 𝜻π,i,k+1=a𝔸𝝃(𝑺i,k+1,Xi,k+1,a)π(a|𝑺i,k+1,Xi,k+1).\bm{\zeta}_{\pi,i,k+1}=\sum_{a\in\mathbb{A}}\bm{\xi}(\bm{S}_{i,k+1},X_{i,k+1},a)\pi(a|\bm{S}_{i,k+1},X_{i,k+1})\,. Then, the estimating equation takes the form 𝑼(𝜽)=1iKii=1nk=0Ki1𝝃i,k{γXi,k+1Ri(Ti,k+1)+(γXi,k+1𝜻π,i,k+1𝝃i,k)𝜽}=0,\bm{U}(\bm{\theta})=\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\bm{\xi}_{i,k}\Big{\{}\gamma^{X_{i,k+1}}R_{i}(T_{i,k+1})+(\gamma^{X_{i,k+1}}\bm{\zeta}_{\pi,i,k+1}-\bm{\xi}_{i,k})^{\top}\bm{\theta}\Big{\}}=0\,, and can be solved as

𝜽^π={1iKii=1nk=0Ki1𝝃i,k(𝝃i,kγXi,k+1𝜻π,i,k+1)}1{1iKii=1nk=0Ki1𝝃i,kγXi,k+1Ri(Ti,k+1)}.\displaystyle\widehat{\bm{\theta}}_{\pi}=\biggl{\{}\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\bm{\xi}_{i,k}(\bm{\xi}_{i,k}-\gamma^{X_{i,k+1}}\bm{\zeta}_{\pi,i,k+1})^{\top}\biggr{\}}^{-1}\bigg{\{}\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\bm{\xi}_{i,k}\gamma^{X_{i,k+1}}R_{i}(T_{i,k+1})\bigg{\}}\,.

Therefore, we propose to estimate Qπ(𝒔,x,a)Q^{\pi}(\bm{s},x,a) by Q^π(𝒔,x,a)=Q(𝒔,x,a;𝜽^π)=𝝃(𝒔,x,a)𝜽^π,\widehat{Q}^{\pi}(\bm{s},x,a)=Q(\bm{s},x,a;\widehat{\bm{\theta}}_{\pi})=\bm{\xi}(\bm{s},x,a)^{\top}\widehat{\bm{\theta}}_{\pi}\,, estimate state-value function Vπ(𝒔,x)V^{\pi}(\bm{s},x) by V^π(𝒔,x)=a𝔸π(a|𝒔,x)𝝃(𝒔,x,a)𝜽^π\widehat{V}^{\pi}(\bm{s},x)=\sum_{a\in\mathbb{A}}\pi(a|\bm{s},x)\bm{\xi}(\bm{s},x,a)^{\top}\widehat{\bm{\theta}}_{\pi}, and estimate Vπ(𝒢)V^{\pi}(\mathcal{G}) by V^π(𝒢)=𝜻π,𝒢𝜽^π,\widehat{V}^{\pi}(\mathcal{G})=\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\widehat{\bm{\theta}}_{\pi}\,, where 𝜻π,𝒢=a𝔸π(a|𝒔,x)𝝃(𝒔,x,a)d𝒢(𝒔,x)\bm{\zeta}_{\pi,\mathcal{G}}=\sum_{a\in\mathbb{A}}\int\pi(a|\bm{s},x)\bm{\xi}(\bm{s},x,a)\mathrm{d}\mathcal{G}(\bm{s},x).

3.2 Utilizing the modulated Bellman equation via an observation process model

In this subsection, we further propose an alternative policy evaluation procedure that relies on a specified model for the observation process. Here, the motivation for modeling the observation process arises from two aspects. First, as demonstrated in Subsection 2.3, under the revised time homogeneity and Markov assumptions for states and gap times, the irregularly spaced and outcome-dependent observation times can be represented as the occurrence times of a renewal process modulated by the state and action sequences (Cox, 1972). Therefore, positing a modulated renewal process model on the observation process is consistent with the structure of the data and helps us make more accurate inferences. Second, by employing an intensity model for the observation process, the integration of rewards can be transformed into a cumulative form. This enables us to leverage policy evaluation techniques based on cumulative reward for estimating the value function based on integrated reward. More details are given in Section 4.

For brevity, we focus on Scheme 1 given in Section 2.3; similar derivations can also be applied to Scheme 2. The modification begins with the Bellman equation outlined in Lemma 2. Under assumption (S1-x), we can express the equation as

Qπ(𝑺k,Xk,Ak)\displaystyle Q^{\pi}(\bm{S}_{k},X_{k},A_{k})
=\displaystyle= E[γXk+1R(Tk+1)+E[γXk+1a𝔸π(a|𝑺k+1,Xk+1)Qπ(𝑺k+1,Xk+1,a)|𝑺k+1,𝑺k,Xk,Ak]|𝑺k,Xk,Ak]\displaystyle E\left[\gamma^{X_{k+1}}R(T_{k+1})+E\big{[}\gamma^{X_{k+1}}\sum_{a\in\mathbb{A}}\pi(a|\bm{S}_{k+1},X_{k+1})Q^{\pi}(\bm{S}_{k+1},X_{k+1},a)\big{|}\bm{S}_{k+1},\bm{S}_{k},X_{k},A_{k}\big{]}\bigg{|}\bm{S}_{k},X_{k},A_{k}\right]
=\displaystyle= E[γXk+1R(Tk+1)+γxa𝔸π(a|𝑺k+1,x)Qπ(𝑺k+1,x,a)d𝒫X(x;𝑺k+1,𝑺k,Xk,Ak)|𝑺k,Xk,Ak].\displaystyle E\left[\gamma^{X_{k+1}}R(T_{k+1})+\int\gamma^{x^{\prime}}\sum_{a\in\mathbb{A}}\pi(a|\bm{S}_{k+1},x^{\prime})Q^{\pi}(\bm{S}_{k+1},x^{\prime},a)d\mathcal{P}_{X}(x^{\prime};\bm{S}_{k+1},\bm{S}_{k},X_{k},A_{k})\big{|}\bm{S}_{k},X_{k},A_{k}\right]\,.

Let 𝒫^X\widehat{\mathcal{P}}_{X} denote an estimate of the distribution function 𝒫X\mathcal{P}_{X}. Similar to (4), we construct 𝑼~(𝜽)=1iKii=1nk=0Ki1𝑸˙𝜽(𝑺i,k,Xi,k,Ai,k;𝜽)δ~i,k+1(𝜽),\widetilde{\bm{U}}(\bm{\theta})=\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\bm{\dot{Q}}_{\bm{\theta}}(\bm{S}_{i,k},X_{i,k},A_{i,k};\bm{\theta})\widetilde{\delta}_{i,k+1}(\bm{\theta})\,, with δ~i,k+1(𝜽)=γXi,k+1Ri(Ti,k+1)+γxa𝔸π(a|𝑺i,k+1,x)Q(𝑺i,k+1,x,a;𝜽)d𝒫^X(x;𝑺i,k+1,𝑺i,k,Xi,k,Ai,k)Q(𝑺i,k,Xi,k,Ai,k;𝜽).\widetilde{\delta}_{i,k+1}(\bm{\theta})=\gamma^{X_{i,k+1}}R_{i}(T_{i,k+1})+\int\gamma^{x^{\prime}}\sum_{a\in\mathbb{A}}\pi(a|\bm{S}_{i,k+1},x^{\prime})Q(\bm{S}_{i,k+1},x^{\prime},a;\bm{\theta})\mathrm{d}\widehat{\mathcal{P}}_{X}(x^{\prime};\bm{S}_{i,k+1},\bm{S}_{i,k},X_{i,k},A_{i,k})-Q(\bm{S}_{i,k},X_{i,k},A_{i,k};\bm{\theta})\,. Further approximating action-value functions with B-splines Q(𝒔,x,a;𝜽)=𝝃(𝒔,x,a)𝜽Q(\bm{s},x,a;\bm{\theta})=\bm{\xi}(\bm{s},x,a)^{\top}\bm{\theta}, then the solution to 𝑼~(𝜽)=0\widetilde{\bm{U}}(\bm{\theta})=0 takes the form of

𝜽~π={1iKii=1nk=0Ki1𝝃i,k(𝝃i,k𝒰~π,i,k+1)}1{1iKii=1nk=0Ki1𝝃i,kγXi,k+1Ri(Ti,k+1)}\displaystyle\widetilde{\bm{\theta}}_{\pi}=\biggl{\{}\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\bm{\xi}_{i,k}(\bm{\xi}_{i,k}-\widetilde{\mathcal{U}}_{\pi,i,k+1})^{\top}\biggr{\}}^{-1}\bigg{\{}\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\bm{\xi}_{i,k}\gamma^{X_{i,k+1}}R_{i}(T_{i,k+1})\bigg{\}} (5)

with 𝒰~π,i,k+1=γxa𝔸π(a|𝑺i,k+1,x)𝝃(𝑺i,k+1,x,a)d𝒫^X(x;𝑺i,k+1,𝑺i,k,Xi,k,Ai,k).\widetilde{\mathcal{U}}_{\pi,i,k+1}=\int\gamma^{x^{\prime}}\sum_{a\in\mathbb{A}}\pi(a|\bm{S}_{i,k+1},x^{\prime})\bm{\xi}(\bm{S}_{i,k+1},x^{\prime},a)d\widehat{\mathcal{P}}_{X}(x^{\prime};\bm{S}_{i,k+1},\bm{S}_{i,k},X_{i,k},A_{i,k})\,.

Then, the estimator for Qπ(𝒔,x,a)Q^{\pi}(\bm{s},x,a) is Q~π(𝒔,x,a)=𝝃(𝒔,x,a)𝜽~π\widetilde{Q}^{\pi}(\bm{s},x,a)=\bm{\xi}(\bm{s},x,a)^{\top}\widetilde{\bm{\theta}}_{\pi}, the estimator for state-value function Vπ(𝒔,x)V^{\pi}(\bm{s},x) is V~π(𝒔,x)=a𝔸π(a|𝒔,x)𝝃(𝒔,x,a)𝜽~π\widetilde{V}^{\pi}(\bm{s},x)=\sum_{a\in\mathbb{A}}\pi(a|\bm{s},x)\bm{\xi}(\bm{s},x,a)^{\top}\widetilde{\bm{\theta}}_{\pi} and the estimator for the value function Vπ(𝒢)V^{\pi}(\mathcal{G}) under given reference distribution 𝒢\mathcal{G} is V~π(𝒢)=𝜻π,𝒢𝜽~π\widetilde{V}^{\pi}(\mathcal{G})=\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\widetilde{\bm{\theta}}_{\pi}. Here, extra effort is required in estimating 𝒫X\mathcal{P}_{X}, the distribution of the next gap time Xk+1X_{k+1} conditional on 𝑺k+1,𝑺k,Xk\bm{S}_{k+1},\bm{S}_{k},X_{k} and AkA_{k}. As an illustration, we propose a Cox-type counting process model that satisfies (S1-xx^{\prime}), and takes the form

E[dN(t)|(t)]=λ(tTk;𝑺k+1,𝑺k,Xk,Ak)dt=λ0(tTk)exp(𝜷0𝒁k)dt,\displaystyle E[\mathrm{d}N(t)|\mathcal{F}(t-)]=\lambda(t-T_{k};\bm{S}_{k+1},\bm{S}_{k},X_{k},A_{k})\ \mathrm{d}t=\lambda_{0}(t-T_{k})\exp(\bm{\beta}_{0}^{\top}\bm{Z}_{k})\ \mathrm{d}t\,, (6)

Here, k=max{j;Tj<t}k=\max\{j;T_{j}<t\}, 𝒁k=ϕ(𝑺k+1,𝑺k,Xk,Ak)q\bm{Z}_{k}=\bm{\phi}(\bm{S}_{k+1},\bm{S}_{k},X_{k},A_{k})\in\mathbb{R}^{q} for some pre-specified qq-dimensional function ϕ\bm{\phi}, 𝜷0\bm{\beta}_{0} is a vector of unknown parameters and λ0\lambda_{0} is an unspecified baseline intensity function. Let {Ni(t)}1in\{N_{i}(t)\}_{1\leq i\leq n} denote nn independent and identically distributed trajectories of N(t)N(t). For i=1,,ni=1,\ldots,n and k=1,,Kik=1,\ldots,K_{i}, define Ni,k(x)=Ni(Ti,k1+x)Ni(Ti,k1)N_{i,k}(x)=N_{i}(T_{i,k-1}+x)-N_{i}(T_{i,k-1}), 𝒁i,k=ϕ(𝑺i,k+1,𝑺i,k,Xi,k,Ai,k)\bm{Z}_{i,k}=\bm{\phi}(\bm{S}_{i,k+1},\bm{S}_{i,k},X_{i,k},A_{i,k}). Then, we propose to estimate 𝜷0\bm{\beta}_{0} by solving the equation

𝑼z(β)=1iKii=1nk=0Ki10τ{𝒁i,k𝒁¯(x;𝜷)}dNi,k+1(x)=0,\displaystyle\bm{U}_{z}(\beta)=\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\int_{0}^{\tau}\{\bm{Z}_{i,k}-\overline{\bm{Z}}(x;\bm{\beta})\}\ \mathrm{d}N_{i,k+1}(x)=0\,, (7)

where 𝒁¯(x;𝜷)={i,kI(Xi,k+1x)exp(𝜷𝒁i,k)𝒁i,k}1{i,kI(Xi,k+1x)exp(𝜷𝒁i,k)},\overline{\bm{Z}}(x;\bm{\beta})={\{\sum_{i,k}I(X_{i,k+1}\geq x)\exp(\bm{\beta}^{\top}\bm{Z}_{i,k})\bm{Z}_{i,k}\}^{-1}}{\{\sum_{i,k}I(X_{i,k+1}\geq x)\exp(\bm{\beta}^{\top}\bm{Z}_{i,k})\}}\,, and τ\tau is a positive constant satisfying P(Xkτ)>0P(X_{k}\geq\tau)>0 for each kk. Let 𝜷^\widehat{\bm{\beta}} denote the obtained estimator. Then, we can further estimate Λ0(x)=0xλ0(u)𝑑u\Lambda_{0}(x)=\int_{0}^{x}\lambda_{0}(u)du by Λ^0(x)=0x{ikI(Xi,k+1u)exp(𝜷^𝒁i,k)}1{ikdNi,k+1(u)}\widehat{\Lambda}_{0}(x)=\int_{0}^{x}\{\sum_{i}\sum_{k}I(X_{i,k+1}\geq u)\exp(\widehat{\bm{\beta}}^{\top}\bm{Z}_{i,k})\}^{-1}\{\sum_{i}\sum_{k}\mathrm{d}N_{i,k+1}(u)\}, and can estimate 𝒫X\mathcal{P}_{X} by 𝒫^X(x;𝒔,𝒔,x,a)=1exp{Λ^0(x)exp(𝜷^𝒛)}\widehat{\mathcal{P}}_{X}(x^{\prime};\bm{s^{\prime}},\bm{s},x,a)=1-\exp\{-\widehat{\Lambda}_{0}(x^{\prime})\exp(\widehat{\bm{\beta}}^{\top}\bm{z})\} for 𝒛=ϕ(𝒔,𝒔,x,a)\bm{z}=\bm{\phi}(\bm{s^{\prime}},\bm{s},x,a).

3.3 Asymptotic properties

We first introduce the notations and impose necessary assumptions. Denote r(𝒔,x,a)=E[γXk+1R(Tk+1)|𝑺k=s,Xk=x,Ak=a]r(\bm{s},x,a)=E[\gamma^{X_{k+1}}R(T_{k+1})|\bm{S}_{k}=s,X_{k}=x,A_{k}=a]. Let p(𝒔,x;𝒔,x,a)p(\bm{s^{\prime}},x^{\prime};\bm{s},x,a) be the conditional density function satisfying 𝒫(|𝒔,x,a)=p(𝒔,x;𝒔,x,a)d𝒔dx\mathcal{P}(\mathcal{B}|\bm{s},x,a)=\int_{\mathcal{B}}p(\bm{s^{\prime}},x^{\prime};\bm{s},x,a)\mathrm{d}\bm{s^{\prime}}\mathrm{d}x^{\prime} for the joint distribution 𝒫(|𝒔,x,a)\mathcal{P}(\cdot|\bm{s},x,a) defined in (A1). Then, (A3) is a regularity condition ensuring the action function Qπ(𝒔,x,a)Q^{\pi}(\bm{s},x,a) to be continuous (Shi et al., 2021, Lemma 1).

  • (A3)

    There exists some p0>0p_{0}>0, such that r(,,a)r(\cdot,\cdot,a) and p(,;𝒔,x,a)p(\cdot,\cdot;\bm{s},x,a) are Hölder continuous functions with exponent p0p_{0} and with respect to (𝒔,x)𝕊×+(\bm{s},x)\in\mathbb{S}\times\mathbb{R}^{+} .

As discussed in Section 2.2, the transition trajectories of states and gap times {S~k=(𝑺k,Xk)}k0\{\widetilde{S}_{k}=(\bm{S}_{k}^{\top},X_{k})^{\top}\}_{k\geq 0} form a time-homogeneous Markov chain under assumption (A1). Suppose {(𝑺k,Xk)}k0\{(\bm{S}_{k}^{\top},X_{k})^{\top}\}_{k\geq 0} has a unique invariant joint distribution with density function μ(,)\mu(\cdot,\cdot) under some behavior policy bb. Similarly, when the reference distribution 𝒢\mathcal{G} is absolutely continuous with respect to the Lebesgue measure, there exists a density function v0(𝒔,x)v_{0}(\bm{s},x) satisfying d𝒢(𝒔,x)=v0(𝒔,x)dsdx\mathrm{d}\mathcal{G}(\bm{s},x)=v_{0}(\bm{s},x)\mathrm{d}s\mathrm{d}x, which is also a joint distribution for the states and gap time. We posit the following assumptions on μ\mu and v0v_{0}.

  • (A4)

    μ(,)\mu(\cdot,\cdot) and v0(,)v_{0}(\cdot,\cdot) are uniformly bounded away from 0 and \infty.

Recall that 𝜽^π\widehat{\bm{\theta}}_{\pi} and 𝜽~π\widetilde{\bm{\theta}}_{\pi} are obtained by solving 𝑼(𝜽)=0\bm{U}(\bm{\theta})=0 and 𝑼~(𝜽)=0\widetilde{\bm{U}}(\bm{\theta})=0. Define 𝒰π,i,k+1=γxa𝔸π(a|𝑺i,k+1,x)𝝃(𝑺i,k+1,x,a)d𝒫X(x;𝑺i,k+1,𝑺i,k,Xi,k,Ai,k)\mathcal{U}_{\pi,i,k+1}=\int\gamma^{x^{\prime}}\sum_{a\in\mathbb{A}}\pi(a|\bm{S}_{i,k+1},x^{\prime})\bm{\xi}(\bm{S}_{i,k+1},x^{\prime},a)d\mathcal{P}_{X}(x^{\prime};\bm{S}_{i,k+1},\bm{S}_{i,k},X_{i,k},A_{i,k}), 𝜻¯π(𝒔,x,a)=E[𝜻π,i,k+1|𝑺i,k=s,Xi,k=x,Ai,k=a]\bar{\bm{\zeta}}_{\pi}(\bm{s},x,a)=E[\bm{\zeta}_{\pi,i,k+1}|\bm{S}_{i,k}=s,X_{i,k}=x,A_{i,k}=a], and 𝒖¯π(𝒔,x,a)=E[𝒰π,i,k+1|𝑺i,k=s,Xi,k=x,Ai,k=a].\bar{\bm{u}}_{\pi}(\bm{s},x,a)=E[\mathcal{U}_{\pi,i,k+1}|\bm{S}_{i,k}=s,X_{i,k}=x,A_{i,k}=a]\,.

The following assumption ensures the existence of 𝜽^π\widehat{\bm{\theta}}_{\pi} and 𝜽~π\widetilde{\bm{\theta}}_{\pi}.

  • (A5)

    There exist constants c¯\bar{c}, such that λmin(k=0K1E[𝝃i,k2γ2Xi,k+1𝜻¯π(𝑺i,k,Xi,k,Ai,k)2])Kc¯,\lambda_{min}(\sum_{k=0}^{K-1}E[\bm{\xi}_{i,k}^{\otimes 2}-\gamma^{2X_{i,k+1}}\bar{\bm{\zeta}}_{\pi}(\bm{S}_{i,k},X_{i,k},A_{i,k})^{\otimes 2}])\geq K\bar{c}\,, and λmin(k=0K1E[𝝃i,k2𝒖¯π(𝑺i,k,Xi,k,Ai,k)2])Kc¯,\lambda_{min}(\sum_{k=0}^{K-1}E[\bm{\xi}_{i,k}^{\otimes 2}-\bar{\bm{u}}_{\pi}(\bm{S}_{i,k},X_{i,k},A_{i,k})^{\otimes 2}])\geq K\bar{c}, where λmin()\lambda_{min}(\cdot) denotes the minimum eigenvalue of a matrix.

The following assumption is required in constructing asymptotic approximation when KiK_{i}\rightarrow\infty. Specifically, the geometric ergodicity in A6(i) is part of the sufficient conditions for the Markov chain central limit theorem to hold, and A6(ii) guarantees the validity of estimating equation (7) under renewal process model (6).

  • (A6)

    When KiK_{i}\rightarrow\infty, (i) the Markov chain {(𝑺ik,Xik)}k0\{(\bm{S}_{ik}^{\top},X_{ik})^{\top}\}_{k\geq 0} is geometrically ergodic, and (ii) there exists a probability measure v()v(\cdot) such that for any measurable set BzB_{z} on q\mathbb{R}^{q}, {𝒁i,k}k0\{\bm{Z}_{i,k}\}_{k\geq 0} satisfying 1Kik=0Ki1I(𝒁i,kBz)pv(Bz).\frac{1}{K_{i}}\sum_{k=0}^{K_{i}-1}I(\bm{Z}_{i,k}\in B_{z})\stackrel{{\scriptstyle p}}{{\rightarrow}}v(B_{z})\,.

Now, we provide bidirectional asymptotic properties for the estimated values V^π(𝒢)\widehat{V}^{\pi}(\mathcal{G}) and V~π(𝒢)\widetilde{V}^{\pi}(\mathcal{G}). The property of the estimated state-value functions follows directly by setting 𝒢\mathcal{G} as a Dirac delta function at fixed (𝒔,x)𝕊×+(\bm{s},x)\in\mathbb{S}\times\mathbb{R}^{+}. For any s𝕊s\in\mathbb{S}, x+x\in\mathbb{R}^{+}, a𝔸a\in\mathbb{A}, define ωπ(𝒔,x,a)=E[{γX1R(T1)+γX1a𝔸π(a|𝑺1,X1)Qπ(𝑺1,X1,a)Qπ(𝑺0,X0,A0)}2|𝑺0=s,X0=x,A0=a]\omega_{\pi}(\bm{s},x,a)=E[\{\gamma^{X_{1}}R(T_{1})+\gamma^{X_{1}}\sum_{a\in\mathbb{A}}\pi(a|\bm{S}_{1},X_{1})Q^{\pi}(\bm{S}_{1},X_{1},a)-Q^{\pi}(\bm{S}_{0},X_{0},A_{0})\}^{2}|\bm{S}_{0}=s,X_{0}=x,A_{0}=a].

Theorem 1.

Assume assumptions A1A1-A6A6 hold. Suppose L=o{nK1/2/log(nK)}L=o\{n_{K}^{1/2}/log(n_{K})\}, L2pd+1nK(1+𝛇π,𝒢22)L^{\frac{2p}{d+1}}\gg n_{K}(1+\|\bm{\zeta}_{\pi,\mathcal{G}}\|_{2}^{-2}) for nK=i=1nKin_{K}=\sum_{i=1}^{n}K_{i}, and there exists some constant cR1c_{R}\geq 1, such that P(maxk|R(Tk)|cR)=1P(\max_{k}|R(T_{k})|\leq c_{R})=1, and ωπ(𝐬,x,a)cR1\omega_{\pi}(\bm{s},x,a)\geq c_{R}^{-1} for any s𝕊,x+,a𝔸s\in\mathbb{S}\,,x\in\mathbb{R}^{+}\,,a\in\mathbb{A} . Then, as nK=i=1nKin_{K}=\sum_{i=1}^{n}K_{i}\rightarrow\infty, we have: nK1/2σπ,𝒢,11{V^π(𝒢)Vπ(𝒢)}dN(0,1),n_{K}^{1/2}\sigma_{\pi,\mathcal{G},1}^{-1}\{\widehat{V}^{\pi}(\mathcal{G})-V^{\pi}(\mathcal{G})\}\stackrel{{\scriptstyle d}}{{\rightarrow}}N(0,1)\,, where σπ,𝒢,1\sigma_{\pi,\mathcal{G},1} is given in Subsection A.2 of the Supplementary Material.

Theorem 2.

Assume assumptions A1A1-A6A6, conditions C1C1-C6C6, and model (6) hold. Suppose L=o{nK1/2/log(nK)}L=o\{n_{K}^{1/2}/log(n_{K})\}, L2pd+1nK(1+𝛇π,𝒢22)L^{\frac{2p}{d+1}}\gg n_{K}(1+\|\bm{\zeta}_{\pi,\mathcal{G}}\|_{2}^{-2}) and there exists some constant cR0c_{R}\geq 0 such that P(maxk|R(Tk)|cR)=1P(\max_{k}|R(T_{k})|\leq c_{R})=1, for any s𝕊,x+,a𝔸s\in\mathbb{S}\,,x\in\mathbb{R}^{+}\,,a\in\mathbb{A}. Then as nKn_{K}\rightarrow\infty, we have nK1/2σπ,𝒢,21{V~π(𝒢)Vπ(𝒢)}dN(0,1),n_{K}^{1/2}\sigma_{\pi,\mathcal{G},2}^{-1}\{\widetilde{V}^{\pi}(\mathcal{G})-V^{\pi}(\mathcal{G})\}\stackrel{{\scriptstyle d}}{{\rightarrow}}N(0,1)\,, where σπ,𝒢,2\sigma_{\pi,\mathcal{G},2} is given in Subsection A.2 of the Supplementary Material.

Here, conditions C1C1-C6C6 are regularity conditions for constructing asymptotic properties of 𝒫^X\mathcal{\widehat{P}}_{X} under model (6). They are commonly adopted in the modulated renewal process studies (Pons and de Turckheim, 1988; Lin et al., 2013), and we relegate them, together with the proofs, in Subsection A.2 of the Supplementary Material for brevity.

4 Policy evaluation based on integrated reward

As remarked in Section 2.1, additional efforts are required for policy evaluation based on integrated reward. Here, the key insight is to transform the integration in Qπ(𝒔,x,a)Q_{\mathcal{I}}^{\pi}(\bm{s},x,a) into a cumulative reward, thereby enabling the application of the method developed in Section 4 to Q,kπ(𝒔,x,a,t)Q_{\mathcal{I},k}^{\pi}(\bm{s},x,a,t). Assuming N(t)N(t) satisfies (S1-x’) and R(t)(t)R(t)\in\mathcal{F}(t-), we have

𝔼π[TkγuTkR(u)du|𝑺k,Xk,Ak,Tk]\displaystyle\mathbb{E}^{\pi}\left[\int_{T_{k}}^{\infty}\gamma^{u-T_{k}}R(u)\mathrm{d}u\bigg{|}\bm{S}_{k}\,,X_{k}\,,A_{k}\,,T_{k}\right]
=\displaystyle= 𝔼π[TkγuTkR(u)E[dN(u)|(u)]λ(uTN(u);𝑺N(u)+1,𝑺N(u),XN(u),AN(u))|𝑺k,Xk,Ak,Tk]\displaystyle\mathbb{E}^{\pi}\left[\int_{T_{k}}^{\infty}\frac{\gamma^{u-T_{k}}R(u)\ E[\mathrm{d}N(u)|\mathcal{F}(u-)]}{\lambda(u-T_{N(u-)};\bm{S}_{N(u-)+1},\bm{S}_{N(u-)},X_{N(u-)},A_{N(u-)})}\bigg{|}\bm{S}_{k}\,,X_{k}\,,A_{k}\,,T_{k}\right]
=\displaystyle= 𝔼π[TkE[γuTkR(u)dN(u)λ(uTN(u);𝑺N(u)+1,𝑺N(u),XN(u),AN(u))|(u)]|𝑺k,Xk,Ak,Tk]\displaystyle\mathbb{E}^{\pi}\left[\int_{T_{k}}^{\infty}E\big{[}\frac{\gamma^{u-T_{k}}R(u)\ \mathrm{d}N(u)}{\lambda(u-T_{N(u-)};\bm{S}_{N(u-)+1},\bm{S}_{N(u-)},X_{N(u-)},A_{N(u-)})}\big{|}\mathcal{F}(u-)\big{]}\bigg{|}\bm{S}_{k}\,,X_{k}\,,A_{k}\,,T_{k}\right]
=\displaystyle= 𝔼π[TkγuTkR(u)dN(u)λ(uTN(u);𝑺N(u)+1,𝑺N(u),XN(u),AN(u))|𝑺k,Xk,Ak,Tk]\displaystyle\mathbb{E}^{\pi}\left[\int_{T_{k}}^{\infty}\frac{\gamma^{u-T_{k}}R(u)\ \mathrm{d}N(u)}{\lambda(u-T_{N(u-)};\bm{S}_{N(u-)+1},\bm{S}_{N(u-)},X_{N(u-)},A_{N(u-)})}\bigg{|}\bm{S}_{k}\,,X_{k}\,,A_{k}\,,T_{k}\right]
=\displaystyle= 𝔼π[jk+1γTjTkR(Tj)λ(Xj;𝑺j,𝑺j1,Xj1,Aj1)|𝑺k,Xk,Ak,Tk]\displaystyle\mathbb{E}^{\pi}\left[\sum_{j\geq k+1}\frac{\gamma^{T_{j}-T_{k}}R(T_{j})}{\lambda(X_{j};\bm{S}_{j},\bm{S}_{j-1},X_{j-1},A_{j-1})}\bigg{|}\bm{S}_{k}\,,X_{k}\,,A_{k}\,,T_{k}\right]
=\displaystyle= 𝔼π[jk+1γTjTkR(Tj)|𝑺k,Xk,Ak,Tk],\displaystyle\mathbb{E}^{\pi}\left[\sum_{j\geq k+1}\gamma^{T_{j}-T_{k}}R_{\mathcal{I}}(T_{j})\bigg{|}\bm{S}_{k}\,,X_{k}\,,A_{k}\,,T_{k}\right]\,,

where R(Tj)=R(Tj)/λ(Xj;𝑺j,𝑺j1,Xj1,Aj1)R_{\mathcal{I}}(T_{j})=R(T_{j})/\lambda(X_{j};\bm{S}_{j},\bm{S}_{j-1},X_{j-1},A_{j-1}) is an inverse-intensity-weighted reward.

It is not hard to verify that {R(Tk)}k1\{R_{\mathcal{I}}(T_{k})\}_{k\geq 1} satisfies assumption (A2)(A2) and thus, the policy evaluation methods proposed in Section 3 are applicable. For example, by Lemma 1, Q,kπ(𝒔,x,a,t)Q_{\mathcal{I},k}^{\pi}(\bm{s},x,a,t) is also free of kk and tt, and hence, we can rewrite the Q-function as Qπ(𝒔,x,a)Q_{\mathcal{I}}^{\pi}(\bm{s},x,a), the value function as Vπ(𝒔,x)=a𝔸Qπ(𝒔,x,a)π(a|𝒔,x),V^{\pi}_{\mathcal{I}}(\bm{s},x)=\sum_{a\in\mathbb{A}}Q^{\pi}_{\mathcal{I}}(\bm{s},x,a)\pi(a|\bm{s},x)\,, and taking the expression Vπ(𝒢)=a𝔸Qπ(𝒔,x,a)π(a|𝒔,x)d𝒢(𝒔,x)V^{\pi}_{\mathcal{I}}(\mathcal{G})=\int\sum_{a\in\mathbb{A}}Q^{\pi}_{\mathcal{I}}(\bm{s},x,a)\pi(a|\bm{s},x)d\mathcal{G}(\bm{s},x)\, under some reference distribution 𝒢\mathcal{G}, free of kk and tt. Furthermore, by assuming Qπ(𝒔,x,a)=ξ(𝒔,x,a)𝜽π,Q_{\mathcal{I}}^{\pi}(\bm{s},x,a)=\xi(\bm{s},x,a)^{\top}\bm{\theta}_{\pi,\mathcal{I}} for some B-spline basis {ξ(𝒔,x,a),s𝕊,x+,a𝔸}\{\xi(\bm{s},x,a)\,,s\in\mathbb{S},x\in\mathbb{R}^{+},a\in\mathbb{A}\}, we can also obtain two estimates for 𝜽π,\bm{\theta}_{\pi,\mathcal{I}} as

𝜽^π,={1iKii=1nk=0Ki1𝝃i,k(𝝃i,kγXi,k+1𝜻π,i,k+1)}1{1iKii=1nk=0Ki1𝝃i,kγXi,k+1R^,i(Ti,k+1)},\displaystyle\widehat{\bm{\theta}}_{\pi,\mathcal{I}}=\biggl{\{}\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\bm{\xi}_{i,k}(\bm{\xi}_{i,k}-\gamma^{X_{i,k+1}}\bm{\zeta}_{\pi,i,k+1})^{\top}\biggr{\}}^{-1}\bigg{\{}\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\bm{\xi}_{i,k}\gamma^{X_{i,k+1}}\widehat{R}_{\mathcal{I},i}(T_{i,k+1})\bigg{\}}\,,

and

𝜽~π,={1iKii=1nk=0Ki1𝝃i,k(𝝃i,k𝒰~π,i,k+1)}1{1iKii=1nk=0Ki1𝝃i,kγXi,k+1R^,i(Ti,k+1)},\displaystyle\widetilde{\bm{\theta}}_{\pi,\mathcal{I}}=\biggl{\{}\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\bm{\xi}_{i,k}(\bm{\xi}_{i,k}-\widetilde{\mathcal{U}}_{\pi,i,k+1})^{\top}\biggr{\}}^{-1}\bigg{\{}\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\bm{\xi}_{i,k}\gamma^{X_{i,k+1}}\widehat{R}_{\mathcal{I},i}(T_{i,k+1})\bigg{\}}\,,

respectively. Here, R^,i(Ti,k+1)=Ri(Ti,k+1)/λ^(Xi,k+1;𝑺i,k+1,𝑺i,k,Xi,k,Ai,k)\widehat{R}_{\mathcal{I},i}(T_{i,k+1})=R_{i}(T_{i,k+1})/\widehat{\lambda}(X_{i,k+1};\bm{S}_{i,k+1},\bm{S}_{i,k},X_{i,k},A_{i,k}) and λ^(x;𝒔,𝒔,x,a)\widehat{\lambda}(x^{\prime};\bm{s^{\prime}},\bm{s},x,a) denotes some estimator of λ(x;𝒔,𝒔,x,a)\lambda(x^{\prime};\bm{s^{\prime}},\bm{s},x,a) for 𝒔,𝒔𝕊,x,x+\bm{s^{\prime}},\bm{s}\in\mathbb{S},x^{\prime},x\in\mathbb{R}^{+} and a𝔸a\in\mathbb{A}. For example, under the Cox-type model (6), the estimated intensity function is λ^(x;𝒔,𝒔,x,a)=λ^0(x)exp(𝜷^𝒛)\widehat{\lambda}(x^{\prime};\bm{s^{\prime}},\bm{s},x,a)=\widehat{\lambda}_{0}(x^{\prime})\exp(\widehat{\bm{\beta}}^{\top}\bm{z}) with 𝒛=ϕ(𝒔,𝒔,x,a)\bm{z}=\bm{\phi}(\bm{s^{\prime}},\bm{s},x,a) and λ^0(x)=1bn0τK(uxbn)dΛ^0(u),\widehat{\lambda}_{0}(x)=\frac{1}{b_{n}}\int_{0}^{\tau}K(\frac{u-x}{b_{n}})\mathrm{d}\widehat{\Lambda}_{0}(u), where 𝜷^\widehat{\bm{\beta}} and Λ^0(u)\widehat{\Lambda}_{0}(u) is given in SubSection 3.2, K()K(\cdot) is a kernel function, and bnb_{n} is the kernel bandwidth.

Furthermore, we derive two estimates for action-value function Qπ(𝒔,x,a)Q_{\mathcal{I}}^{\pi}(\bm{s},x,a) as Q^π(𝒔,x,a)=𝝃(𝒔,x,a)𝜽^π,\widehat{Q}_{\mathcal{I}}^{\pi}(\bm{s},x,a)=\bm{\xi}(\bm{s},x,a)^{\top}\widehat{\bm{\theta}}_{\pi,{\mathcal{I}}} and Q~π(𝒔,x,a)=𝝃(𝒔,x,a)𝜽~π,\widetilde{Q}_{\mathcal{I}}^{\pi}(\bm{s},x,a)=\bm{\xi}(\bm{s},x,a)^{\top}\widetilde{\bm{\theta}}_{\pi,{\mathcal{I}}}. The estimates for value function Vπ(𝒔,x)V_{\mathcal{I}}^{\pi}(\bm{s},x) are V^π(𝒔,x)=a𝔸π(a|𝒔,x)𝝃(𝒔,x,a)𝜽^,π\widehat{V}_{\mathcal{I}}^{\pi}(\bm{s},x)=\sum_{a\in\mathbb{A}}\pi(a|\bm{s},x)\bm{\xi}(\bm{s},x,a)^{\top}\widehat{\bm{\theta}}_{\mathcal{I},\pi} and V~π(𝒔,x)=a𝔸π(a|𝒔,x)𝝃(𝒔,x,a)𝜽~,π\widetilde{V}_{\mathcal{I}}^{\pi}(\bm{s},x)=\sum_{a\in\mathbb{A}}\pi(a|\bm{s},x)\bm{\xi}(\bm{s},x,a)^{\top}\widetilde{\bm{\theta}}_{\mathcal{I},\pi}, and the estimates for Vπ(𝒢)V_{\mathcal{I}}^{\pi}(\mathcal{G}) are V^π(𝒢)=𝜻π,𝒢𝜽^π,\widehat{V}^{\pi}_{\mathcal{I}}(\mathcal{G})=\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\widehat{\bm{\theta}}_{\pi,\mathcal{I}} and V~π(𝒢)=𝜻π,𝒢𝜽~π,.\widetilde{V}^{\pi}_{\mathcal{I}}(\mathcal{G})=\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\widetilde{\bm{\theta}}_{\pi,\mathcal{I}}\,. The following theorem presents the asymptotic properties of V^π(𝒢)\widehat{V}^{\pi}_{\mathcal{I}}(\mathcal{G}) and V~π(𝒢)\widetilde{V}^{\pi}_{\mathcal{I}}(\mathcal{G}), whose proof is given in Subsection A.2 of the Supplementary Material.

Theorem 3.

Consider assumptions A1A1-A6A6, conditions C1C1-C6C6, and model (6) hold. Suppose L=o{nK1/2/log(nK)}L=o\{n_{K}^{1/2}/log(n_{K})\}, L2pd+1nK(1+𝛇π,𝒢22)L^{\frac{2p}{d+1}}\gg n_{K}(1+\|\bm{\zeta}_{\pi,\mathcal{G}}\|_{2}^{-2}) nKbn40n_{K}b_{n}^{4}\rightarrow 0, nKbn2n_{K}b_{n}^{2}\rightarrow\infty, and there exists some constant cR0c_{R}\geq 0, such that P(maxk|R(Tk)|cR)=1P(\max_{k}|R(T_{k})|\leq c_{R})=1, for any s𝕊,x+,a𝔸s\in\mathbb{S}\,,x\in\mathbb{R}^{+}\,,a\in\mathbb{A}. Then as nK=i=1nKin_{K}=\sum_{i=1}^{n}K_{i}\rightarrow\infty, we have (i) nK1/2σπ,𝒢,31{V^π(𝒢)Vπ(𝒢)}dN(0,1),n_{K}^{1/2}\sigma_{\pi,\mathcal{G},3}^{-1}\{\widehat{V}_{\mathcal{I}}^{\pi}(\mathcal{G})-V_{\mathcal{I}}^{\pi}(\mathcal{G})\}\stackrel{{\scriptstyle d}}{{\rightarrow}}N(0,1)\,, and (ii) nK1/2σπ,𝒢,41{V~π(𝒢)Vπ(𝒢)}dN(0,1).n_{K}^{1/2}\sigma_{\pi,\mathcal{G},4}^{-1}\{\widetilde{V}_{\mathcal{I}}^{\pi}(\mathcal{G})-V_{\mathcal{I}}^{\pi}(\mathcal{G})\}\stackrel{{\scriptstyle d}}{{\rightarrow}}N(0,1)\,. Here σπ,𝒢,3\sigma_{\pi,\mathcal{G},3} and σπ,𝒢,4\sigma_{\pi,\mathcal{G},4} are given in section A.2 of the Supplementary Material.

5 Simulation

In this section, we conduct simulation studies to examine the finite sample performance of the proposed methods. We begin by describing our setup. First, for the initial values, we generate S0S_{0} from a Uniform distribution on [3/2,3/2][-3/2,3/2], X0X_{0} from an Exponential distribution with intensity 1/2, and A0A_{0} from a Bernoulli distribution with expectation 1/2. Then, given the current state SkS_{k}, gap time XkX_{k}, and action AkA_{k}, the next state is generated from one of the three models: (S1) Sk+1=34(2Ak1)Sk+ϵkS_{k+1}=\frac{3}{4}(2A_{k}-1)S_{k}+\epsilon_{k}; (S2) Sk+1={3414I(Xk<12)}(2Ak1)Sk+ϵkS_{k+1}=\{\frac{3}{4}-\frac{1}{4}I(X_{k}<\frac{1}{2})\}(2A_{k}-1)S_{k}+\epsilon_{k}; (S3) Sk+1={3414I(Xk<12)+14I(Xk+1>1)}(2Ak1)Sk+ϵkS_{k+1}=\{\frac{3}{4}-\frac{1}{4}I(X_{k}<\frac{1}{2})+\frac{1}{4}I(X_{k+1}>1)\}(2A_{k}-1)S_{k}+\epsilon_{k}, where, {ϵk}k1i.i.d.𝒩(0,14)\{\epsilon_{k}\}_{k\geq 1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(0,\frac{1}{4}). The next gap time Xk+1X_{k+1} is generated from of the following intensity models: (X1) λ(x;Sk,Xk,Ak)=λ0(x)\lambda(x;S_{k},X_{k},A_{k})=\lambda_{0}(x); (X2) λ(x;Sk,Xk,Ak)=λ0(x)exp(Sk+12Xk+Ak12SkAk)\lambda(x;S_{k},X_{k},A_{k})=\lambda_{0}(x)\exp(-S_{k}+\frac{1}{2}X_{k}+A_{k}-\frac{1}{2}S_{k}A_{k}); (X3) λ(x;Sk,Xk,Ak,Sk+1)=λ0(x)exp(Sk+12Xk+Ak12SkAk+12Sk+1)\lambda(x;S_{k},X_{k},A_{k},S_{k+1})=\lambda_{0}(x)\exp(-S_{k}+\frac{1}{2}X_{k}+A_{k}-\frac{1}{2}S_{k}A_{k}+\frac{1}{2}S_{k+1}). All the actions {Ak,k1}\{A_{k},k\geq 1\} are independently generated from a Bernoulli distribution with mean 0.5, and the reward of each action AkA_{k}, which is observed at Tk+1T_{k+1}, is defined as R(Tk+1)={Sk+1Sk12(2Ak1)}Xk+1+rkR(T_{k+1})=\{S_{k+1}-S_{k}-\frac{1}{2}(2A_{k}-1)\}X_{k+1}+r_{k} with {rk}k1i.i.d.𝒩(0,14)\{r_{k}\}_{k\geq 1}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(0,\frac{1}{4}).

Although there could be numerous combinations of the generative models, here we only choose four representative scenarios for a brief illustration. The settings of the four scenarios are listed in Table 1 and corresponding data structures are plotted in Figure 2. In general, all four datasets satisfy the the time-homogeneous and Markov assumptions given in (A1), but they vary in the dependence structure among the states, actions, and gap times. In Scenario 1, the generation of Xk+1X_{k+1} does not depend on the values of SkS_{k} or AkA_{k}, and the value of XkX_{k} does not affect the generation of (Sk+1,Ak+1)(S_{k+1},A_{k+1}) either. This implies that the observation process N(t)N(t) is independent of the state-action transition process. In Scenario 2, we allow the generation of gap times to be correlated with the actions and states, but Xk+1X_{k+1} and Sk+1S_{k+1} are conditionally independent with each other, given (Sk,Xk,Ak)(S_{k},X_{k},A_{k}). More general scenarios are given in scenarios 3 and 4. As noted in section 2.2, the Markov assumption in (A1) allows Xk+1X_{k+1} and Sk+1S_{k+1} to be correlated with each other, only with the restriction that their dependence structure satisfies the Markov property. Two example schemes are also developed in Section 2.3 illustrating two different dependence structures satisfying (A1). Here, Scenarios 3 and 4 represent Schemes 1 and 2, respectively. Specifically, in Scenario 3, we first generate Xk+1X_{k+1}, and then generate Sk+1S_{k+1} based on (Sk,Xk,Ak,Xk+1)(S_{k},X_{k},A_{k},X_{k+1}); while in Scenario 4, we generate Sk+1S_{k+1} first, and then Xk+1X_{k+1} based on (Sk,Xk,Ak,Sk+1)(S_{k},X_{k},A_{k},S_{k+1}). A comparison of the data structures in Scenarios 1-4 is provided in Table 1.

Table 1: Settings in Scenarios 1-4.
Settings Properties
State Gaptime A1& A2 Xk+1Sk+1X_{k+1}\perp S_{k+1} Xk+1Sk+1|(Tk)X_{k+1}\perp S_{k+1}|\mathcal{F}(T_{k}) model (6)
Scenario 1 S1S1 X1X1 \sqrt{} \sqrt{} \sqrt{} \sqrt{}
Scenario 2 S2S2 X2X2 \sqrt{}

×\times

\sqrt{} \sqrt{}
Scenario 3 S2S2 X3X3 \sqrt{}

×\times

×\times

\sqrt{}
Scenario 4 S3S3 X2X2 \sqrt{}

×\times

×\times

×\times

Scenario 1 Scenario 2
Refer to caption Refer to caption
Scenario 3 Scenario 4
Refer to caption Refer to caption
Figure 2: Data structure under scenarios 1-4.

In the policy evaluation procedure, the discount factor is set as γ=0.7\gamma=0.7 for all settings. For the reference distribution, we consider 𝒢(s,x)=(s+1)x/4\mathcal{G}(s,x)=(s+1)x/4, with 1s1-1\leq s\leq 1 and 0x20\leq x\leq 2 for policy evaluation under cumulative rewards, and 𝒢(s,x)=(5s+1)x/2\mathcal{G}(s,x)=(5s+1)x/2, with 1/5s1/5-1/5\leq s\leq 1/5 and 0x10\leq x\leq 1 for policy evaluation under integrated rewards. For the target policy, we considered a class of linear deterministic policies π(a;s,x)=I(α0+α1s+α2x0),\pi(a;s,x)=I(\alpha_{0}+\alpha_{1}s+\alpha_{2}x\leq 0)\,, where α=(α0,α1,α2)\alpha=(\alpha_{0},\alpha_{1},\alpha_{2}) is set as (0,1,0)(0,-1,0) in Scenario 1, and set as (1,1,1)(-1,-1,1) in Scenarios 2-4. To calculate the true values, we simulate N=5×105N=5\times 10^{5} independent trajectories with (S0,X0)(S_{0},X_{0}) distributed according to 𝒢\mathcal{G}, and {Ak}k0\{A_{k}\}_{k\geq 0} chosen according to target policy π\pi. Then, Vπ(𝒢)V^{\pi}(\mathcal{G}) is approximated by 1Nj=1Nl=1KjγTjlRj(Tj,l)\frac{1}{N}\sum_{j=1}^{N}\sum_{l=1}^{K_{j}}\gamma^{T_{jl}}R_{j}(T_{j,l}) and Vπ(𝒢)V^{\pi}_{\mathcal{I}}(\mathcal{G}) is approximated by 1Nj=1Nl=1KiγTjlR,j(Tj,l)\frac{1}{N}\sum_{j=1}^{N}\sum_{l=1}^{K_{i}}\gamma^{T_{jl}}R_{\mathcal{I},j}(T_{j,l}).

Since both state SkS_{k} and gap time XkX_{k} might not have bounded supports in our settings, we transform the data by Sk=Φ(Sk)S_{k}^{*}=\Phi(S_{k}) and Xk=1exp(Xk)X_{k}^{*}=1-\exp(X_{k}), where Φ()\Phi(\cdot) is the cumulative distribution function of a standard normal random variable. Then, the basis functions are constructed from the tensor product of two cubic B-spline sets, whose knots are placed at equally spaced sample quantiles of the transformed states and gap times. When estimating the B-spline coefficients 𝜽\bm{\theta}, three methods were considered, namely, the naive method, the standard method, and the modulated method. When the value of the policy is defined by cumulative reward, the standard method leads to the estimate 𝜽^π\widehat{\bm{\theta}}_{\pi}. The modulated method, which depends on model (6) for the observation process, leads to the estimate 𝜽~π\widetilde{\bm{\theta}}_{\pi}. Finally, the naive estimate is 𝜽^π,N=[i,k𝝃i,k{𝝃i,kγ𝝃(Si,k+1,Xi,k+1,π)}]1{i,k𝝃i,kγRi(Ti,k+1)},\widehat{\bm{\theta}}_{\pi,N}=\bigl{[}\sum_{i,k}\bm{\xi}_{i,k}\{\bm{\xi}_{i,k}-\gamma\bm{\xi}(S_{i,k+1},X_{i,k+1},\pi)\}^{\top}\bigr{]}^{-1}\bigl{\{}\sum_{i,k}\bm{\xi}_{i,k}\gamma R_{i}(T_{i,k+1})\bigr{\}}\,, which is directly obtained from Shi et al. (2021) by treating (Sk,Xk)(S_{k},X_{k}) as the state variable. Similarly, when the value of the policy is defined by integrated reward, we adopt the standard estimate as 𝜽^π,\widehat{\bm{\theta}}_{\pi,\mathcal{I}}, the modulated estimate as 𝜽~π,\widetilde{\bm{\theta}}_{\pi,\mathcal{I}}, and the naive estimate as

𝜽^π,,N=[i,k𝝃i,k{𝝃i,kγ𝝃(Si,k+1,Xi,k+1,π)}]1{i,k𝝃i,kγR^,i(Ti,k+1)}.\widehat{\bm{\theta}}_{\pi,\mathcal{I},N}=\bigl{[}\sum_{i,k}\bm{\xi}_{i,k}\{\bm{\xi}_{i,k}-\gamma\bm{\xi}(S_{i,k+1},X_{i,k+1},\pi)\}^{\top}\bigr{]}^{-1}\bigl{\{}\sum_{i,k}\bm{\xi}_{i,k}\gamma\widehat{R}_{\mathcal{I},i}(T_{i,k+1})\bigr{\}}\,.

The performances of the three estimation methods are summarized in Tables 2 and 3. Within each scenario, we further consider 6 cases by setting sample sizes n=100,200,400n=100,200,400 with the length of trajectory K=10K=10, as well as setting K=100,200,400K=100,200,400 when the number of trajectory n=10n=10. In each case, we generate 1000 simulation replicates to approximate the bias and standard deviation of the naive estimators (BiasN, SDN), and the bias, standard deviation, estimated standard error, and empirical coverage probability of 95%95\% confidence intervals for standardized estimators (BiasS, SDS, SES, CPS) and modulated estimators (BiasM, SDM, SEM, CPM). From the results, we observe that both the standard method and the modulated method perform well in Scenarios 1-3: the estimators are asymptotically unbiased, the estimated standard errors are close to the standard deviation of the estimators, and the coverage probabilities of 95%95\% confidence interval match the nominal level when either nn or KK is large enough. As a comparison, the naive estimate is biased across all scenarios, including Scenario 1, where the observations follow a Poisson process of rate one. This implies the observation process is informative and should be included in the policy evaluation, even if it is independent of the state-action process. Moreover, by comparing SDSSD_{S} and SDMSD_{M} in Scenarios 2-3, we observe that the modulated estimates are more efficient than the standard ones when nK=nKn_{K}=nK is relatively small. This suggests that the value function estimated using the modulated Bellman equation may lead to reduced variance in policy evaluation under small sample sizes, especially when the distribution of gap times depends on the state-action process.

Table 2: Policy evaluation results based on cumulative reward, including the bias and standard deviation of the naive estimators (BiasN, SDN), and the bias, standard deviation, estimated standard error, and empirical coverage probability of 95%95\% confidence intervals for standardized estimators (BiasS, SDS, SES, CPS) and modulated estimators (BiasM,SDM,SEM,CPM).
Naive Standard Modulated
nn K BiasNBias_{N} SDNSD_{N} BiasSBias_{S} SDSSD_{S} SESSE_{S} CPSCP_{S} BiasMBias_{M} SDMSD_{M} SEMSE_{M} CPMCP_{M}
Scenario 1   True value = -0.641
100 10 -0.099 0.079 -0.002 0.063 0.059 0.958 0.002 0.062 0.063 0.954
200 10 -0.096 0.053 0.000 0.039 0.041 0.960 0.002 0.041 0.045 0.968
400 10 -0.095 0.036 -0.002 0.028 0.029 0.952 -0.001 0.028 0.032 0.958
10 100 -0.075 0.100 -0.030 0.073 0.065 0.950 0.001 0.080 0.071 0.938
10 200 -0.091 0.065 0.001 0.049 0.044 0.934 0.003 0.047 0.049 0.952
10 400 -0.094 0.043 -0.002 0.030 0.031 0.944 -0.001 0.030 0.034 0.968
Scenario 2   True value = 0.637
100 10 -0.388 0.105 0.002 0.094 0.082 0.930 -0.010 0.092 0.078 0.934
200 10 -0.397 0.064 -0.003 0.058 0.055 0.936 -0.009 0.053 0.054 0.940
400 10 -0.397 0.042 -0.002 0.038 0.039 0.952 -0.005 0.034 0.038 0.956
10 100 -0.387 0.147 -0.002 0.151 0.117 0.896 -0.016 0.135 0.106 0.906
10 200 -0.385 0.091 -0.002 0.077 0.072 0.924 -0.008 0.074 0.066 0.932
10 400 -0.389 0.068 -0.004 0.056 0.049 0.928 -0.006 0.051 0.046 0.936
Scenario 3   True value = 0.510
100 10 -0.441 0.151 0.004 0.110 0.085 0.894 -0.006 0.093 0.079 0.898
200 10 -0.442 0.093 -0.002 0.064 0.055 0.932 -0.006 0.055 0.053 0.930
400 10 -0.437 0.063 0.000 0.042 0.039 0.936 -0.002 0.039 0.038 0.956
10 100 -0.410 0.234 -0.011 0.200 0.141 0.884 0.002 0.159 0.124 0.888
10 200 -0.423 0.126 -0.001 0.092 0.079 0.902 -0.005 0.084 0.071 0.920
10 400 -0.431 0.095 -0.004 0.062 0.053 0.924 -0.005 0.058 0.048 0.938
Table 3: Policy evaluation results based on integrated reward, including the bias and standard deviation of the naive estimators (BiasN, SDN), and the bias, standard deviation, estimated standard error, and empirical coverage probability of 95%95\% confidence intervals for standardized estimators (BiasS, SDS, SES, CPS) and modulated estimators (BiasM,SDM,SEM,CPM).
Naive Standard Modulated
nn K BiasNBias_{N} SDNSD_{N} BiasSBias_{S} SDSSD_{S} SESSE_{S} CPSCP_{S} BiasMBias_{M} SDMSD_{M} SEMSE_{M} CPMCP_{M}
Scenario 1   True value = -0.413
100 10 -0.019 0.105 0.001 0.080 0.071 0.934 0.005 0.083 0.068 0.910
200 10 -0.023 0.072 0.000 0.056 0.049 0.929 0.003 0.053 0.048 0.908
400 10 -0.028 0.053 -0.003 0.040 0.034 0.929 -0.001 0.041 0.034 0.912
10 100 -0.020 0.099 -0.001 0.079 0.070 0.935 -0.012 0.078 0.066 0.904
10 200 -0.023 0.072 0.002 0.054 0.048 0.929 0.004 0.054 0.046 0.898
10 400 -0.027 0.049 0.000 0.038 0.034 0.925 0.001 0.038 0.033 0.924
Scenario 2   True value = 0.383
100 10 -0.045 0.098 -0.002 0.122 0.104 0.943 -0.003 0.111 0.096 0.938
200 10 -0.044 0.066 0.001 0.091 0.078 0.948 0.000 0.085 0.074 0.960
400 10 -0.048 0.051 -0.004 0.069 0.061 0.947 -0.004 0.066 0.059 0.944
10 100 -0.047 0.094 0.000 0.120 0.107 0.949 -0.006 0.110 0.098 0.934
10 200 -0.044 0.066 -0.001 0.086 0.076 0.934 -0.003 0.084 0.071 0.934
10 400 -0.046 0.048 -0.002 0.064 0.058 0.949 -0.004 0.063 0.056 0.957
Scenario 3   True value = 0.460
100 10 -0.138 0.069 0.004 0.076 0.073 0.950 -0.001 0.076 0.069 0.940
200 10 -0.150 0.051 0.002 0.050 0.051 0.945 0.000 0.049 0.049 0.949
400 10 -0.160 0.042 -0.001 0.037 0.036 0.944 -0.002 0.035 0.035 0.945
10 100 -0.134 0.066 0.007 0.077 0.071 0.956 0.001 0.073 0.067 0.945
10 200 -0.140 0.047 0.003 0.053 0.049 0.946 0.001 0.054 0.048 0.939
10 400 -0.142 0.034 0.003 0.037 0.035 0.936 0.002 0.037 0.034 0.931

Finally, since the modulated estimates require modeling the observation process N(t)N(t), we also investigate the robustness of the proposed estimates when model (6) is misspecified under Scenario 4. Specifically, we generated data following Scheme 2, calculated the estimates using methods developed under Scheme 1, and report the results in Table 4 of the Supplementary Material. The results show that the standard estimator under cumulative reward V^π(𝒢)\widehat{V}^{\pi}(\mathcal{\mathcal{G}}), which only requires A1 and A2, still performs well in all settings, and, what is beyond the expectation is that, the estimates V~π(𝒢)\widetilde{V}_{\pi}(\mathcal{\mathcal{G}}), V^π,(𝒢)\widehat{V}_{\pi,\mathcal{I}}(\mathcal{\mathcal{G}}) and V~π,(𝒢)\widetilde{V}_{\pi,\mathcal{I}}(\mathcal{\mathcal{G}}) that are based on model (6), also demonstrate negligible bias and provide satisfactory variance estimation. Hence, the proposed methods are fairly robust to mild misspecification of the observation process model.

6 Application: HealthPartners Data

In this section, we illustrate our proposed methodology by applying it to the periodontal recall interval selection problem from a HealthPartners database (henceforth, HP data) of longitudinal electronic health records tracking PD status and progression, along with other covariates, among subjects around the Minneapolis, Minnesota area (Guan et al., 2020). The length of recall intervals continues to be a topic of debate and research while the current, insurance-mandated, standard of care is a recall interval of 6 months for all patients. However, recall frequencies are intimately related to patient outcomes, provider workload, and dental healthcare costs. Recent NICE guidelines recommend (Clarkson et al., 2018) that recall intervals vary with time and be determined based on disease levels and oral disease risk. Overall, the selection of an appropriate personalized recall interval is a multifaceted clinical decision that is challenging to evaluate mechanistically. Our HP dataset consists of 7654 dental visit records from 1056 patients enrolled within the HP EHR from January 1, 2007, to December 31, 2014. For each visit record, we have the recall visit date, probed pocket depth (PPD, measured in mm) from all available tooth-sites as the PD assessment endpoint, and the clinician-recommended recall interval for the follow-up visit. The total follow-up time for these subjects has a median of 5.95 years, with a maximum of 7.97 years (resembling about 8 years of data). The frequency of recall visits has a median of 8, with a range from 3 to 29. The gap time (actual recall intervals) has a median of 181 days (approximately 6 months), with a range of 1 to 2598 days. Due to the unavailability of full-mouth, site-level measures of clinical attachment level (another important biomarker for PD) in the HP database, we refrain from using the ACES framework (Holtfreter et al., 2024), the modified 2018 periodontal status classification scheme to epidemiological survey data, and instead considered the mean PPD (mean across all tooth-sites at a specific visit) as a plausible subject-level PD endpoint, with a median of 2.51mm (range: 0.0–6.2mm).

To apply our methods, let TkT_{k} denote the record time (in days), XkX_{k} denote the gap time (in days) between TkT_{k} and Tk1T_{k-1}, SkS_{k} denote (mean) PPD measurements at TkT_{k}, and AkA_{k} be an indicator of the recommended recall interval at TkT_{k}, where Ak=1A_{k}=1 implies the patient is advised to revisit within 6 months, and Ak=0A_{k}=0 otherwise. Within the HP database, the mean PPD measurements (SkS_{k}) are significantly correlated (pp-value <<0.001) with the (true) recommended recall intervals, suggesting that the mean PPD measurements play a crucial role in the dentists’ decisions regarding recall intervals. On the other hand, although other covariates were available in the dataset, only PPD and clinical attachment loss (CAL) had a significant correlation with the treatment decisions made by dentists. Given the high correlation between PPD and CAL, we selected PPD as the only state variable. To measure the effects of the action AkA_{k}, we considered two definitions for the reward R(Tk+1)R(T_{k+1}). First, considering that the PPD around teeth would deepen in the presence of gum disease, and a reduction of PPD is one of the desired outcomes of periodontal therapy, we define the reward as an indicator of PPD reduction, i.e. R(Tk+1)=I(Sk+1Sk)R(T_{k+1})=I(S_{k+1}\leq S_{k}). Based on this definition, the performance of any given policy π\pi could be evaluated in terms of the value function Vπ(s,x)V^{\pi}(s,x) defined in (1), which is based on the cumulated sum of rewards. Moreover, since PPD \leq 3 mm is usually considered a threshold to distinguish periodontally healthy sites from diseased ones, we also considered another definition of reward which takes the form: R(Tk+1)=3Sk+1R(T_{k+1})=3-S_{k+1}. In this case, R()R(\cdot) could be viewed as a continuous function where R(t)R(t) denotes the potential PPD measurement at time tt, and therefore, a more reasonable evaluation of policies should be the value function Vπ(s,x)V^{\pi}_{\mathcal{I}}(s,x) defined in (2), which is based on integrated reward.

Under each value function, we considered three policies for comparison, namely, the observed stochastic policy πos\pi_{os}, the observed fixed policy πof\pi_{of}, and the optimal policy. Since the estimation of optimal policy depends on the choice of the value function, we considered two optimal policies, πop\pi_{op} and πop,\pi_{op,\mathcal{I}}, which correspond to VπV^{\pi} and VπV^{\pi}_{\mathcal{I}} respectively. The plots of the obtained policies are shown in Figure 5 of the Supplementary Material. Typically, under the observed policies πof\pi_{of} and πos\pi_{os}, patients with deeper pocket depth and more frequent recalls are more likely to be recommended a shorter recall interval, while under the estimated optimal policies, there is no such linear relationship among actions, states, and gap times.

Refer to caption
Refer to caption
Figure 3: Plots of the estimated value (solid line) and confidence intervals (dashed lines) of the observed fix policy πof\pi_{of}, the observed stochastic policy πos\pi_{os}, and the estimated optimal policy πop\pi_{op} with gap time fixed as 360 days (upper panel) and PPD fixed as 2.4 mm (lower panel), with the reward as an indicator of PPD reduction, i.e., R(Tk+1)=I(Sk+1Sk)R(T_{k+1})=I(S_{k+1}\leq S_{k}), obtained under the naive method (red line), standard method (green line), and the modulated method (blue line).
Refer to caption
Refer to caption
Figure 4: Plots of the estimated value (solid line) and confidence intervals (dashed lines) of the observed fix policy πof\pi_{of}, the observed stochastic policy πos\pi_{os}, and the estimated optimal policy πop\pi_{op} with gap time fixed at 360 days (upper panel) and PPD fixed as 2.4 mm (lower panel), with the reward as PPD measurement, defined as i.e., R(Tk+1)=3Sk+1R(T_{k+1})=3-S_{k+1}, obtained under the naive method (red line), standard method (green line), and the modulated method (blue line).

The data analysis results are summarized in Figures 3-4. In general, the estimated values change with not only the initial state but also the initial gap time, which indicates that the frequency of recall visits does have effects on the performance of the policies, and should be taken into account in the evaluation process. Moreover, by comparing the estimated results obtained from employing the standard method, modulated method, and naive method, we observe that the estimates from the standard method and modulated method are similar, while the results from the naive method exhibit significant differences from the former two in both the magnitude of values and the patterns of the function. For example, in Figure 3 and when PPD is fixed at 2.4 mm, the value functions estimated by the naive method remain around 5 as the gap time varies, even though both the standard and modulated methods reveal a significant decrease in the value as the gap time values increase. Furthermore, it can also be seen that the confidence intervals provided by the modulated method are generally narrower than those given by the standard method (for example, when gap time = 360 days). This result is consistent with the findings in simulation studies.

We also compare the estimated value functions of the observed stochastic policy, observed fix policy, and the optimal policy, under the modulated method and based on cumulative reward and integrated reward respectively. The plots of the confidence intervals for the value of three policies are given in Figure 6 of the Supplementary Material. From the results, it is evident that the alterations in the action assignment of the optimal policy, in comparison to the other two observed policies, effectively enhance the policy’s value. A more detailed discussion of the results is also given in section C of the Supplementary Material.

7 Discussion

In this paper, we construct a framework for dynamic policies with irregularly spaced and outcome-dependent observation process and develop an off-policy evaluation procedure under revised time-homogeneous and Markov assumptions. Similar Markov assumptions are commonly posited in the off-policy evaluation literature and, in real-world data applications, can be tested from the observed data using a forward-backwards learning algorithm (Shi et al., 2020). We also provide example schemes to demonstrate the connection between the proposed assumptions and existing ones. In particular, we show that the proposed time-homogeneous and Markov assumption requires the observation process to satisfy some modulated renewal process assumption (see S1-x’), and introduce an intensity-based modulated renewal process model (6) in the estimation procedure. It would be an interesting but challenging task to generalize the intensity model to a rate model, under which the existing Markov assumption no longer holds, or generalize the modulated renewal process to a counting process, under which the time-homogenous assumption may not hold. Finally, although our paper focuses on the off-policy evaluation of dynamic policies, it is not hard to adopt the proposed method to search for an optimal dynamic policy, as well as to make inferences on the value of the optimal dynamic policy.

Supplementary Material

This supplementary material is organized as follows. In Section A, we present proofs for Lemmas 1-2, Lemmas A.1-A.5 and Theorems 1-4. In Section B, we present additional simulation results. In Section C, we present additional figures from the application, together with some discussion on the results.

Appendix A Proofs for lemmas and theorems

For notational brevity, we assume π\pi to be a deterministic policy, with π(a|s,x){0,1}\pi(a|s,x)\in\{0,1\}. Denote π(s,x)𝔸\pi(s,x)\in\mathbb{A} as the selected action under policy π\pi given state ss and gap time aa. for any a𝔸a\in\mathbb{A}, s𝕊s\in\mathbb{S} and x+x\in\mathbb{R}^{+}. All derivations and results can be generalized to the evaluation of stochastic policies, straightforwardly.

A.1 Proof of Lemmas 1-2

Proof of Lemma 1 : Under assumptions A1 and A2, the existence of Qkπ(s,x,a,t)Q_{k}^{\pi}(s,x,a,t) follows directly from the boundedness of rr and limjTj=\lim_{j\rightarrow\infty}T_{j}=\infty. Then, we can rewrite Qkπ(s,x,a,t)Q_{k}^{\pi}(s,x,a,t) as

jk+1Eπ[γTjTkR(Tj)|Sk=s,Xk=x,Ak=a,Tk=t]\displaystyle\sum_{j\geq k+1}E^{\pi}\left[\gamma^{T_{j}-T_{k}}R(T_{j})|S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\right]
=\displaystyle= limMj=k+1MEπ[γXk+1++XjR(Tj)|Sk=s,Xk=x,Ak=a,Tk=t]\displaystyle\lim_{M\rightarrow\infty}\sum_{j=k+1}^{M}E^{\pi}\left[\gamma^{X_{k+1}+\cdots+X_{j}}R(T_{j})|S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\right]
=\displaystyle= limM[E{γXk+1R(Tk+1)|Sk=s,Xk=x,Ak=a,Tk=t}\displaystyle\lim_{M\rightarrow\infty}\bigg{[}E\{\gamma^{X_{k+1}}R(T_{k+1})|S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\}
+j=k+2MEπ{γXk+1++Xj1γXjR(Tj)|Sk=s,Xk=x,Ak=a,Tk=t}]\displaystyle+\sum_{j=k+2}^{M}E^{\pi}\{\gamma^{X_{k+1}+\cdots+X_{j-1}}\ \gamma^{X_{j}}R(T_{j})|S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\}\bigg{]}
=\displaystyle= limM(E[γXk+1E{R(Tk+1)|Sk,Xk,Ak,Sk+1,Xk+1}|Sk=s,Xk=x,Ak=a,Tk=t]\displaystyle\lim_{M\rightarrow\infty}\bigg{(}E\big{[}\gamma^{X_{k+1}}E\{R(T_{k+1})|S_{k},X_{k},A_{k},S_{k+1},X_{k+1}\}\big{|}S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\big{]}
+j=k+2MEπ[γXk+1++Xj1γXiE{R(Ti)|Si1,Xi1,Ai1,Xi,Si}|Sk=s,Xk=x,Ak=a,Tk=t])\displaystyle+\sum_{j=k+2}^{M}E^{\pi}\big{[}\gamma^{X_{k+1}+\cdots+X_{j-1}}\ \gamma^{X_{i}}E\{R(T_{i})|S_{i-1},X_{i-1},A_{i-1},X_{i},S_{i}\}|S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\big{]}\bigg{)}
=\displaystyle= limM(E{γXk+1r(Sk,Xk,Ak,Sk+1,Xk+1)|Sk=s,Xk=x,Ak=a,Tk=t}+j=k+2MEπ[γXk+1++Xj1\displaystyle\lim_{M\rightarrow\infty}\bigg{(}E\{\gamma^{X_{k+1}}r(S_{k},X_{k},A_{k},S_{k+1},X_{k+1})|S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\}+\sum_{j=k+2}^{M}E^{\pi}\big{[}\gamma^{X_{k+1}+\cdots+X_{j-1}}\
E{γXjr(Sj1,Xj1,Aj1,Sj,Xj)|Sj1,Xj1,Aj1}|Sk=s,Xk=x,Ak=a,Tk=t])\displaystyle E\{\gamma^{X_{j}}r(S_{j-1},X_{j-1},A_{j-1},S_{j},X_{j})|S_{j-1},X_{j-1},A_{j-1}\}|S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\big{]}\bigg{)}

Note, 𝒫\mathcal{P} denotes the joint distribution of the next gap time and next state. Under assumption (A1)(A1), for all k1k\geq 1,

E[γXk+1r(Sk,Xk,Ak,Sk+1,Xk+1)|Sk=s,Xk=x,Ak=a,Tk]\displaystyle E\left[\gamma^{X_{k+1}}r(S_{k},X_{k},A_{k},S_{k+1},X_{k+1})|S_{k}=s,X_{k}=x,A_{k}=a,T_{k}\right]
=\displaystyle= E[γXk+1r(Sk,Xk,Ak,Sk+1,Xk+1)|Sk=s,Xk=x,Ak=a]\displaystyle E\left[\gamma^{X_{k+1}}r(S_{k},X_{k},A_{k},S_{k+1},X_{k+1})|S_{k}=s,X_{k}=x,A_{k}=a\right]
=\displaystyle= 𝕊×[0,)γxr(s,x,a,s,x)d𝒫(s,x;s,x,a)=:𝒈1(s,x,a).\displaystyle\iint_{\mathbb{S}\times[0,\infty)}\gamma^{x^{\prime}}r(s,x,a,s^{\prime},x^{\prime})d\mathcal{P}(s^{\prime},x^{\prime};s,x,a)=:\bm{g}_{1}(s,x,a)\,.

Thus,

Qkπ(s,x,a,t)\displaystyle Q_{k}^{\pi}(s,x,a,t)
=\displaystyle= limM(𝒈1(s,x,a)+j=k+2MEπ[γXk+1++Xj1𝒈1{Sj1,Xj1,π(Sj1,Xj1)}|Sk=s,Xk=x,\displaystyle\lim_{M\rightarrow\infty}\bigg{(}\bm{g}_{1}(s,x,a)+\sum_{j=k+2}^{M}E^{\pi}\big{[}\gamma^{X_{k+1}+\cdots+X_{j-1}}\ \bm{g}_{1}\{S_{j-1},X_{j-1},\pi(S_{j-1},X_{j-1})\}\big{|}S_{k}=s,X_{k}=x,
Ak=a,Tk=t])\displaystyle A_{k}=a,T_{k}=t\big{]}\bigg{)}
=\displaystyle= limM(𝒈1(s,x,a)+E[γXk+1𝒈1{Sk+1,Xk+1,π(Sk+1,Xk+1)}|Sk=s,Xk=x,Ak=a,Tk=t]\displaystyle\lim_{M\rightarrow\infty}\bigg{(}\bm{g}_{1}(s,x,a)+E\big{[}\gamma^{X_{k+1}}\bm{g}_{1}\{S_{k+1},X_{k+1},\pi(S_{k+1},X_{k+1})\}\big{|}S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\big{]}
+j=k+3MEπ[γXk+1++Xj2γXj1𝒈1{Sj1,Xj1,π(Sj1,Xj1)}|Sk=s,Xk=x,Ak=a,Tk=t])\displaystyle+\sum_{j=k+3}^{M}E^{\pi}\big{[}\gamma^{X_{k+1}+\cdots+X_{j-2}}\gamma^{X_{j-1}}\bm{g}_{1}\{S_{j-1},X_{j-1},\pi(S_{j-1},X_{j-1})\}\big{|}S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\big{]}\bigg{)}
=\displaystyle= limM(𝒈1(s,x,a)+E[γXk+1𝒈1{Sk+1,Xk+1,π(Sk+1,Xk+1)}|Sk=s,Xk=x,Ak=a,Tk=t]\displaystyle\lim_{M\rightarrow\infty}\bigg{(}\bm{g}_{1}(s,x,a)+E\big{[}\gamma^{X_{k+1}}\bm{g}_{1}\{S_{k+1},X_{k+1},\pi(S_{k+1},X_{k+1})\}\big{|}S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\big{]}
+j=k+3MEπ[γXk+1++Xj2E[γXj1𝒈1{Sj1,Xj1,π(Sj1,Xj1)}|Sj2,Xj2,Aj2,Tj2]\displaystyle+\sum_{j=k+3}^{M}E^{\pi}\big{[}\gamma^{X_{k+1}+\cdots+X_{j-2}}E\big{[}\gamma^{X_{j-1}}\bm{g}_{1}\{S_{j-1},X_{j-1},\pi(S_{j-1},X_{j-1})\}\big{|}S_{j-2},X_{j-2},A_{j-2},T_{j-2}\big{]}
|Sk=s,Xk=x,Ak=a,Tk=t])\displaystyle\big{|}S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\big{]}\bigg{)}

By similar arguments, we have

E[γXk+1𝒈1{Sk+1,Xk+1,π(Sk+1,Xk+1)|Sk=s,Xk=x,Ak=a,Tk]\displaystyle E\left[\gamma^{X_{k+1}}\bm{g}_{1}\{S_{k+1},X_{k+1},\pi(S_{k+1},X_{k+1})|S_{k}=s,X_{k}=x,A_{k}=a,T_{k}\right]
=\displaystyle= 𝕊×[0,)γx𝒈1{s,x,π(s,x)}d𝒫(s,x;s,x,a)=:𝒈2π(s,x,a)\displaystyle\iint_{\mathbb{S}\times[0,\infty)}\gamma^{x^{\prime}}\bm{g}_{1}\{s^{\prime},x^{\prime},\pi(s^{\prime},x^{\prime})\}d\mathcal{P}(s^{\prime},x^{\prime};s,x,a)=:\bm{g}_{2}^{\pi}(s,x,a)

which does not depend on TkT_{k} or kk for all k1k\geq 1. For all j2j\geq 2, define

𝒈j+1π(s,x,a)𝕊×[0,)γx𝒈jπ{s,x,π(s,x)}𝑑𝒫(s,x;s,x,a).\displaystyle\bm{g}_{j+1}^{\pi}(s,x,a)\equiv\iint_{\mathbb{S}\times[0,\infty)}\gamma^{x^{\prime}}\bm{g}_{j}^{\pi}\{s^{\prime},x^{\prime},\pi(s^{\prime},x^{\prime})\}d\mathcal{P}(s^{\prime},x^{\prime};s,x,a)\,.

Then, we have

Qkπ(s,x,a,t)\displaystyle Q_{k}^{\pi}(s,x,a,t)
=\displaystyle= limM(𝒈1(s,x,a)+𝒈2π(s,x,a)+j=k+3MEπ[γXk+1++Xj2𝒈2π{Sj2,Xj2,π(Sj2,Xj3)}\displaystyle\lim_{M\rightarrow\infty}\bigg{(}\bm{g}_{1}(s,x,a)+\bm{g}_{2}^{\pi}(s,x,a)+\sum_{j=k+3}^{M}E^{\pi}\big{[}\gamma^{X_{k+1}+\cdots+X_{j-2}}\bm{g}_{2}^{\pi}\{S_{j-2},X_{j-2},\pi(S_{j-2},X_{j-3})\}
|Sk=s,Xk=x,Ak=a,Tk=t])\displaystyle\big{|}S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\big{]}\bigg{)}
=\displaystyle= limM(𝒈1(s,x,a)+𝒈2π(s,x,a)+𝒈3π(s,x,a)+j=k+4MEπ[γXk+1++Xj3𝒈3π{Sj3,Xj3,π(Sj3,Xj3)}\displaystyle\lim_{M\rightarrow\infty}\bigg{(}\bm{g}_{1}(s,x,a)+\bm{g}_{2}^{\pi}(s,x,a)+\bm{g}_{3}^{\pi}(s,x,a)+\sum_{j=k+4}^{M}E^{\pi}\big{[}\gamma^{X_{k+1}+\cdots+X_{j-3}}\bm{g}_{3}^{\pi}\{S_{j-3},X_{j-3},\pi(S_{j-3},X_{j-3})\}
|Sk=s,Xk=x,Ak=a,Tk=t])\displaystyle\big{|}S_{k}=s,X_{k}=x,A_{k}=a,T_{k}=t\big{]}\bigg{)}
=\displaystyle= \displaystyle\cdots
=\displaystyle= limM{𝒈1(s,x,a)+j=2Mk𝒈jπ(s,x,a)}=𝒈1(s,x,a)+j2𝒈jπ(s,x,a).\displaystyle\lim_{M\rightarrow\infty}\bigg{\{}\bm{g}_{1}(s,x,a)+\sum_{j=2}^{M-k}\bm{g}_{j}^{\pi}(s,x,a)\bigg{\}}=\bm{g}_{1}(s,x,a)+\sum_{j\geq 2}\bm{g}_{j}^{\pi}(s,x,a)\,.

Therefore, Qkπ(s,x,a,t)=Q0π(s,x,a,0)Q_{k}^{\pi}(s,x,a,t)=Q_{0}^{\pi}(s,x,a,0) for all k0k\geq 0, t0t\geq 0, s𝕊s\in\mathbb{S}, x+x\in\mathbb{R}^{+} and a𝔸a\in\mathbb{A}. \blacksquare

Proof of Lemma 2 (Bellman Equation)

Under assumptions A1-A2, Qπ(Sk,Xk,Ak)Q^{\pi}(S_{k},X_{k},A_{k}) can be rewritten as

Eπ{jk+1γXk+1++XjR(Tj)|Sk,Xk,Ak}\displaystyle E^{\pi}\big{\{}\sum_{j\geq k+1}\gamma^{X_{k+1}+\cdots+X_{j}}R(T_{j})\big{|}S_{k},X_{k},A_{k}\big{\}}
=\displaystyle= E{γXk+1R(Tk+1)|Sk,Xk,Ak}+Eπ[γXk+1Eπ{jk+2γXk+2++XjR(Tj)|Sk+1,Xk+1,Ak+1,\displaystyle E\big{\{}\gamma^{X_{k+1}}R(T_{k+1})\big{|}S_{k},X_{k},A_{k}\big{\}}+E^{\pi}\big{[}\gamma^{X_{k+1}}E^{\pi}\big{\{}\sum_{j\geq k+2}\gamma^{X_{k+2}+\cdots+X_{j}}R(T_{j})\big{|}S_{k+1},X_{k+1},A_{k+1},
Sk,Xk,Ak}|Sk,Xk,Ak]\displaystyle S_{k},X_{k},A_{k}\big{\}}\big{|}S_{k},X_{k},A_{k}\big{]}
=\displaystyle= E{γXk+1R(Tk+1)|Sk,Xk,Ak}+Eπ[γXk+1Eπ{jk+2γXk+2++XjR(Tj)|Sk+1,Xk+1,Ak+1,}\displaystyle E\big{\{}\gamma^{X_{k+1}}R(T_{k+1})\big{|}S_{k},X_{k},A_{k}\big{\}}+E^{\pi}\big{[}\gamma^{X_{k+1}}E^{\pi}\big{\{}\sum_{j\geq k+2}\gamma^{X_{k+2}+\cdots+X_{j}}R(T_{j})\big{|}S_{k+1},X_{k+1},A_{k+1},\big{\}}
|Sk,Xk,Ak]\displaystyle\big{|}S_{k},X_{k},A_{k}\big{]}
=\displaystyle= E{γXk+1R(Tk+1)|Sk,Xk,Ak}+Eπ[γXk+1Qπ{Sk+1,Xk+1,Ak+1}|Sk,Xk,Ak]\displaystyle E\big{\{}\gamma^{X_{k+1}}R(T_{k+1})\big{|}S_{k},X_{k},A_{k}\big{\}}+E^{\pi}\big{[}\gamma^{X_{k+1}}Q^{\pi}\{S_{k+1},X_{k+1},A_{k+1}\}\big{|}S_{k},X_{k},A_{k}\big{]}
=\displaystyle= E{γXk+1R(Tk+1)|Sk,Xk,Ak}+E[γXk+1Qπ{Sk+1,Xk+1,π(Sk+1,Xk+1)}|Sk,Xk,Ak].\displaystyle E\big{\{}\gamma^{X_{k+1}}R(T_{k+1})\big{|}S_{k},X_{k},A_{k}\big{\}}+E\big{[}\gamma^{X_{k+1}}Q^{\pi}\{S_{k+1},X_{k+1},\pi(S_{k+1},X_{k+1})\}\big{|}S_{k},X_{k},A_{k}\big{]}\,.

Therefore, the Bellman Equation in Lemma 2 holds for all k1k\geq 1. \blacksquare

A.2 Proofs of Theorems 1-4.

For notational brevity, let Qπ(s,x,π)Q^{\pi}(s,x,\pi) denote Qπ{s,x,π(s,x)}Q^{\pi}\{s,x,\pi(s,x)\}, let Q(s,x,π;𝜽)Q(s,x,\pi;\bm{\theta}) denote Q{s,x,π(s,x);𝜽}Q\{s,x,\pi(s,x);\bm{\theta}\}, let ξ(s,x,π;𝜽)\xi(s,x,\pi;\bm{\theta}) denote 𝝃{s,x,π(s,x);𝜽}\bm{\xi}\{s,x,\pi(s,x);\bm{\theta}\}, and let i,k\sum_{i,k} denote i=1nk=0Ki1\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}. For a set of ll-dimensional vectors, 𝒱={vi,kl}1in,0kKi1\mathcal{V}=\{v_{i,k}\in\mathbb{R}^{l}\}_{1\leq i\leq n,0\leq k\leq K_{i}-1}, let array(𝒱)array(\mathcal{V}) denote a nK×ln_{K}\times l dimension array, whose (j<iKj+k+1)(\sum_{j<i}K_{j}+k+1)th row is nK1/2vi,kn_{K}^{1/2}v_{i,k}, and taking the form

array(𝒱)=nK1/2(v1,0,v1,1,vi,k,,vn,Kn1).array(\mathcal{V})=n_{K}^{1/2}(v_{1,0},v_{1,1}\cdots,v_{i,k},\cdots,v_{n,K_{n}-1})^{\top}.

Moreover, the following notations and conditions are required in the proofs of Theorems 2-4. For j=0,1,2j=0,1,2, define Sz,j(x;𝜷)=1iKii=1nk=0Ki1I(Xi,k+1x)exp(𝜷𝒁i,k)𝒁i,kj,S_{z,j}(x;\bm{\beta})=\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}I(X_{i,k+1}\geq x)\exp(\bm{\beta}^{\top}\bm{Z}_{i,k})\bm{Z}_{i,k}^{\otimes j}\,, with 𝒂j=1,𝒂,𝒂𝒂\bm{a}^{\otimes j}=1,\bm{a},\bm{a}\bm{a}^{\top}. The convergency of Sz,j(x;𝜷)S_{z,j}(x;\bm{\beta}) holds under assumption A6A6. So we denote the limit by sz,j(x;𝜷)s_{z,j}(x;\bm{\beta}) for each xx and β\beta, and denote sz,j(x;𝜷0)s_{z,j}(x;\bm{\beta}_{0}) as sz,j(x)s_{z,j}(x) for notational briefness. By Slutsky’s theorem, we can further denote z¯(x)=sz,1(x)/sz,0(x)\overline{z}(x)=s_{z,1}(x)/s_{z,0}(x) as the limit of Z¯(x;𝜷0)\overline{Z}(x;\bm{\beta}_{0}). Conditions C1C1-C5C5 are regular conditions commonly adopted in the modulated renewal process studies (Pons and de Turckheim, 1988; Lin and Fine, 2009; Lin et al., 2013), and here used to construct asymptotic properties of 𝒫^X\mathcal{\widehat{P}}_{X} under model 6 in SubSection 3.2. Conditions C6C6 ensures the existence of σπ,𝒢,2\sigma_{\pi,\mathcal{G},2}.

  • C1

    There exists a compact closure ¯\overline{\mathcal{B}} surrounding 𝜷0\bm{\beta}_{0}, such that sz,0(β,x)s_{z,0}(\beta,x) is bounded away from zero on ¯×[0,τ]\overline{\mathcal{B}}\times[0,\tau].

  • C2

    𝛀z=0τ{sz,01(x)sz,2(x)z¯2(x)}sz,0(x)𝑑Λ0(x)\bm{\Omega}_{z}=\int_{0}^{\tau}\{s_{z,0}^{-1}(x)s_{z,2}(x)-\overline{z}^{\otimes 2}(x)\}s_{z,0}(x)d\Lambda_{0}(x) is positive definite.

  • C3

    nK1/2{ESz,j(x,𝜷0)sz,j(x)}n_{K}^{1/2}\{ES_{z,j}(x,\bm{\beta}_{0})-s_{z,j}(x)\} is finite for j=0,1,2j=0,1,2 and for each xx in a dense subset of [0,τ][0,\tau] including 0 and τ\tau.

  • C4

    For any ϵ>0\epsilon>0, nK1i,k𝒁i,k22I{𝒁i,k22>nKϵ}exp(𝜷0𝒁i,k)Λ0(τ)p0.n_{K}^{-1}\sum_{i,k}\|\bm{Z}_{i,k}\|_{2}^{2}I\{\|\bm{Z}_{i,k}\|_{2}^{2}>n_{K}\epsilon\}\exp(\bm{\beta}_{0}^{\top}\bm{Z}_{i,k})\Lambda_{0}(\tau)\stackrel{{\scriptstyle p}}{{\rightarrow}}0\,.

  • C5

    nK1/2{Sz,j(x,𝜷0)sz,j(x,𝜷0)}n_{K}^{1/2}\{S_{z,j}(x,\bm{\beta}_{0})-s_{z,j}(x,\bm{\beta}_{0})\} is tight on [0,τ][0,\tau] for j=0,1,2j=0,1,2 and nK1n_{K}\geq 1.

  • C6

    𝛀π,2\bm{\Omega}_{\pi,2} defined in (12), 𝛀π,3\bm{\Omega}_{\pi,3} defined in (16), and 𝛀π,4\bm{\Omega}_{\pi,4} defined in (17) are positive definitive.

Proof of Theorem 1:

The proof of Theorem 1 is similar to Appendix E.1 in Shi et al. (2021), so we omit the technical details and give an outline of the proof.

Firstly, by Lemma 1 of Shi et al. (2021) and Section 2.2 of Huang (1998), there exists 𝜽πmL\bm{\theta}^{*}_{\pi}\in\mathbb{R}^{mL} that satisfies

sups𝕊,x+,a𝔸|Qπ(s,x,a)ξ(s,x,a)𝜽π|C1Lp/(d+1)\displaystyle\sup_{s\in\mathbb{S},x\in\mathbb{R}^{+},a\in\mathbb{A}}|Q^{\pi}(s,x,a)-\xi(s,x,a)^{\top}\bm{\theta}_{\pi}^{*}|\leq C_{1}L^{-p/(d+1)} (8)

for some constant C1C_{1}. Then by the definition of 𝜽^π\widehat{\bm{\theta}}_{\pi}, we can rewrite 𝜽^π𝜽π\widehat{\bm{\theta}}_{\pi}-{\bm{\theta}}_{\pi}^{*} as

𝑫^π1[1nKi,k𝝃i,k{γXi,k+1Ri(Ti,k+1)(𝝃i,kγXi,k+1𝜻π,i,k+1)𝜽π}]\displaystyle\widehat{\bm{D}}_{\pi}^{-1}\big{[}\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\left\{\gamma^{X_{i,k+1}}R_{i}(T_{i,k+1})-(\bm{\xi}_{i,k}-\gamma^{X_{i,k+1}}\bm{\zeta}_{\pi,i,k+1})^{\top}{\bm{\theta}}_{\pi}^{*}\right\}\big{]}
=\displaystyle= 𝑫^π1[1nKi,k𝝃i,k{γXi,k+1Ri(Ti,k+1)Qπ(Si,k,Xi,k,Ai,k)+γXi,k+1Qπ(Si,k+1,Xi,k+1,π)ϕi,k,1+ϵi,k,1}]\displaystyle\widehat{\bm{D}}_{\pi}^{-1}\big{[}\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\{\underbrace{\gamma^{X_{i,k+1}}R_{i}(T_{i,k+1})-Q^{\pi}(S_{i,k},X_{i,k},A_{i,k})+\gamma^{X_{i,k+1}}Q^{\pi}(S_{i,k+1},X_{i,k+1},\pi)}_{\phi_{i,k,1}}+\epsilon_{i,k,1}\}\big{]}
=\displaystyle= 𝑫π1𝑨ξϕ1+(𝑫^π1𝑫π1)𝑨ξϕ1+𝑫^π1𝑨ξϵ1,\displaystyle\bm{D}_{\pi}^{-1}\bm{A}_{\xi}^{\top}\bm{\phi}_{1}+(\widehat{\bm{D}}_{\pi}^{-1}-\bm{D}_{\pi}^{-1})\bm{A}_{\xi}^{\top}\bm{\phi}_{1}+\widehat{\bm{D}}_{\pi}^{-1}\bm{A}_{\xi}^{\top}\bm{\epsilon}_{1}\,,

where ϵi,k,1=Qπ(Si,k,Xi,k,Ai,k)𝝃(Si,k,Xi,k,Ai,k)𝜽πγXi,k+1{Qπ(Si,k+1,Xi,k+1,π)𝝃(Si,k+1,Xi,k+1,π)𝜽π},\epsilon_{i,k,1}=Q^{\pi}(S_{i,k},X_{i,k},A_{i,k})-\bm{\xi}(S_{i,k},X_{i,k},A_{i,k})^{\top}\bm{\theta}_{\pi}^{*}-\gamma^{X_{i,k+1}}\{Q^{\pi}(S_{i,k+1},X_{i,k+1},\pi)-\bm{\xi}(S_{i,k+1},X_{i,k+1},\pi)^{\top}\bm{\theta}_{\pi}^{*}\}\,, ϕ1=array({ϕi,k,1}1in,1kKi),\bm{\phi}_{1}=array(\{\phi_{i,k,1}\}_{1\leq i\leq n,1\leq k\leq K_{i}})\,, ϵ1=array({ϵi,k,1}1in,1kKi),\bm{\epsilon}_{1}=array(\{\epsilon_{i,k,1}\}_{1\leq i\leq n,1\leq k\leq K_{i}})\,, and 𝑨ξ=array({𝝃i,k}1in,1kKi)nK×mL,\bm{A}_{\xi}=array(\{\bm{\xi}_{i,k}\}_{1\leq i\leq n,1\leq k\leq K_{i}})\in\mathbb{R}^{n_{K}\times mL}\,, and 𝑫π=E(𝑫^π)\bm{D}_{\pi}=E(\widehat{\bm{D}}_{\pi}). Lemma A.1 gives the properties of 𝝃i,k\bm{\xi}_{i,k}, ϕi,k,1\phi_{i,k,1}, ϵi,k,1\epsilon_{i,k,1}, 𝑫π\bm{D}_{\pi} and 𝑫^π\widehat{\bm{D}}_{\pi}.

Lemma A.1 Under the conditions given in Theorem 1, we have

  • (i)

    For any s𝕊,x+s\in\mathbb{S},x\in\mathbb{R}^{+} and a𝔸a\in\mathbb{A}, |Qπ(s,x,a)|CQ|Q^{\pi}(s,x,a)|\leq C_{Q} for some constant CQC_{Q}.

  • (ii)

    E(𝝃i,k𝝃i,k)=O(L)E(\bm{\xi}_{i,k}^{\top}\bm{\xi}_{i,k})=O(L). As nKn_{K}\rightarrow\infty, λmax{E(1nKi,k𝝃i,k𝝃i,k)}=O(1)\lambda_{max}\{E(\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\bm{\xi}_{i,k}^{\top})\}=O(1), λmax(1nKi,k𝝃i,k𝝃i,k)=Op(1)\lambda_{max}(\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\bm{\xi}_{i,k}^{\top})=O_{p}(1), λmin{E(1nKi,k𝝃i,k𝝃i,k)}c¯1/2\lambda_{min}\{E(\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\bm{\xi}_{i,k}^{\top})\}\geq\overline{c}_{1}/2, λmin{1nKi,k𝝃i,k𝝃i,k)}c¯1/3\lambda_{min}\{\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\bm{\xi}_{i,k}^{\top})\}\geq\overline{c}_{1}/3 wpa 1.

  • (iii)

    max1in,1kKi|ϕi,k,1|CR+2CQ\max_{1\leq i\leq n,1\leq k\leq K_{i}}|\phi_{i,k,1}|\leq C_{R}+2C_{Q}, max1in,1kKi|ϵi,k,1|2C1Lpd+1\max_{1\leq i\leq n,1\leq k\leq K_{i}}|\epsilon_{i,k,1}|\leq 2C_{1}L^{-\frac{p}{d+1}}.

  • (iv)

    As nKn_{K}\rightarrow\infty, 𝑫π2=O(1)\|\bm{D}_{\pi}\|_{2}=O(1), 𝑫π123c¯1\|\bm{D}_{\pi}^{-1}\|_{2}\leq 3\overline{c}^{-1}, 𝑫^π𝑫π2=Op{L1/2nK1/2log(nK)}\|\widehat{\bm{D}}_{\pi}-\bm{D}_{\pi}\|_{2}=O_{p}\{L^{1/2}n_{K}^{-1/2}\log(n_{K})\}, 𝑫^π1𝑫π12=Op{L1/2nK1/2log(nK)}\|\widehat{\bm{D}}_{\pi}^{-1}-\bm{D}_{\pi}^{-1}\|_{2}=O_{p}\{L^{1/2}n_{K}^{-1/2}\log(n_{K})\}, 𝑫^π126c¯1\|\widehat{\bm{D}}_{\pi}^{-1}\|_{2}\leq 6\overline{c}^{-1} wpa 1.

Together with inequality (8), Markov’s inequality and Cauchy-schwarz inequality, we can obtain 𝑫π1𝑨ξϕ12=Op(L1/2nK1/2)\|\bm{D}_{\pi}^{-1}\bm{A}_{\xi}^{\top}\bm{\phi}_{1}\|_{2}=O_{p}(L^{1/2}n_{K}^{-1/2}), (𝑫^π1𝑫π1)𝑨ξϕ12=Op{LnK1log(nK)}\|(\widehat{\bm{D}}_{\pi}^{-1}-\bm{D}_{\pi}^{-1})\bm{A}_{\xi}^{\top}\bm{\phi}_{1}\|_{2}=O_{p}\{Ln_{K}^{-1}\log(n_{K})\}, 𝑫^π1𝑨ξϵ12=Op(Lpd+1)\|\widehat{\bm{D}}_{\pi}^{-1}\bm{A}_{\xi}^{\top}\bm{\epsilon}_{1}\|_{2}=O_{p}(L^{-\frac{p}{d+1}}), and therefore

𝜽^π𝜽π=𝑫π1𝑨ξϕ1+Op{LnK1log(nK)}+Op(Lpd+1).\displaystyle\widehat{\bm{\theta}}_{\pi}-{\bm{\theta}}_{\pi}^{*}=\bm{D}_{\pi}^{-1}\bm{A}_{\xi}^{\top}\bm{\phi}_{1}+O_{p}\{Ln_{K}^{-1}\log(n_{K})\}+O_{p}(L^{-\frac{p}{d+1}})\,. (9)

Next, we study the asymptotic properties of V^π(𝒢)Vπ(𝒢)\widehat{V}^{\pi}(\mathcal{G})-V^{\pi}(\mathcal{G}). Recall that V^π(𝒢)=𝜻π,𝒢𝜽^\widehat{V}^{\pi}(\mathcal{G})=\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\widehat{\bm{\theta}}, define σπ,𝒢,12=𝜻π,𝒢𝑫π1𝛀π,1(𝑫π)1𝜻π,𝒢\sigma^{2}_{\pi,\mathcal{G},1}=\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\bm{D}_{\pi}^{-1}\bm{\Omega}_{\pi,1}(\bm{D}_{\pi}^{\top})^{-1}\bm{\zeta}_{\pi,\mathcal{G}} with 𝛀π,1=E(𝑨ξϕ1ϕ1𝑨ξ).\bm{\Omega}_{\pi,1}=E(\bm{A}_{\xi}^{\top}\bm{\phi}_{1}\bm{\phi}_{1}^{\top}\bm{A}_{\xi}).. Lemma A.2 gives the properties of σπ,𝒢,12\sigma_{\pi,\mathcal{G},1}^{2} and 𝛀π,𝒢,1\bm{\Omega}_{\pi,\mathcal{G},1}.

Lemma A.2 Under the conditions given in Theorem 1, we have: (i) 𝛀π,12=O(1)\|\bm{\Omega}_{\pi,1}\|_{2}=O(1), (ii) σπ,𝒢,12C2𝛇π,𝒢22\sigma_{\pi,\mathcal{G},1}^{2}\geq C_{2}\|\bm{\zeta}_{\pi,\mathcal{G}}\|_{2}^{2} for some constant C2C_{2}.

By (8), (9) and Lemma A.2, it can be obtained that

σπ,𝒢,11{V^π(𝒢)Vπ(𝒢)}=\displaystyle\sigma_{\pi,\mathcal{G},1}^{-1}\{\widehat{V}^{\pi}(\mathcal{G})-V^{\pi}(\mathcal{G})\}= σπ,𝒢,11[𝜻π,𝒢(𝜽^π𝜽π){Qπ(s,x,π)ξ(s,x,π)𝜽π}𝑑𝒢(s,x)]\displaystyle\sigma_{\pi,\mathcal{G},1}^{-1}\left[\bm{\zeta}_{\pi,\mathcal{G}}^{\top}(\widehat{\bm{\theta}}_{\pi}-\bm{\theta}_{\pi}^{*})-\int\{Q^{\pi}(s,x,\pi)-\xi(s,x,\pi)^{\top}\bm{\theta}_{\pi}^{*}\}d\mathcal{G}(s,x)\right]
=\displaystyle= σπ,𝒢,11𝜻π,𝒢𝑫π1𝑨ξϕ1+Op{LnK1log(nK)}+Op{Lpd+1(1+𝜻π,𝒢21)}\displaystyle\sigma_{\pi,\mathcal{G},1}^{-1}\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\bm{D}_{\pi}^{-1}\bm{A}_{\xi}^{\top}\bm{\phi}_{1}+O_{p}\{Ln_{K}^{-1}\log(n_{K})\}+O_{p}\{L^{-\frac{p}{d+1}}(1+\|\bm{\zeta}_{\pi,\mathcal{G}}\|_{2}^{-1})\}
=\displaystyle= σπ,𝒢,11𝜻π,𝒢𝑫π1𝑨ξϕ1+op(nK1/2),\displaystyle\sigma_{\pi,\mathcal{G},1}^{-1}\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\bm{D}_{\pi}^{-1}\bm{A}_{\xi}^{\top}\bm{\phi}_{1}+o_{p}(n_{K}^{-1/2})\,,

when LnK1/2/log(nK)L\ll{n_{K}^{1/2}}/{\log(n_{K})} and L2pd+1nK(1+𝜻π,𝒢22)L^{\frac{2p}{d+1}}\gg n_{K}(1+\|\bm{\zeta}_{\pi,\mathcal{G}}\|_{2}^{-2}). Using the martingale central limit theorem (McLeish, 1974, corrollary 2.8) and following arguments in section E.5 of Shi et al. (2021), we have nK1/2σπ,𝒢,11{V^π(𝒢)Vπ(𝒢)}dN(0,1)n_{K}^{1/2}\sigma_{\pi,\mathcal{G},1}^{-1}\{\widehat{V}^{\pi}(\mathcal{G})-V^{\pi}(\mathcal{G})\}\stackrel{{\scriptstyle d}}{{\rightarrow}}N(0,1). \blacksquare

Proof of Theorem 2:

Theorem 2 will be proved by three steps. In the first step, we give the asymptotic properties of 𝒫^X(x;𝒛)\widehat{\mathcal{P}}_{X}(x;\bm{z}) under the modulated renewal process model (6) in SubSection 3.2. In the second step, we use the properties of {𝒫^X(x;𝒛)𝒫X(x;𝒛)}\{\widehat{\mathcal{P}}_{X}(x;\bm{z})-\mathcal{P}_{X}(x;\bm{z})\} to obtain 𝜽~π𝜽π=𝑫π1(𝑨ξϕ2+𝑨I𝝋2)+Op{LnK1log(nK)}+Op(Lpd+1).\widetilde{\bm{\theta}}_{\pi}-{\bm{\theta}}_{\pi}^{*}=\bm{D}_{\pi}^{-1}(\bm{A}_{\xi}^{\top}\bm{\phi}_{2}+\bm{A}_{I}^{\top}\bm{\varphi}_{2})+O_{p}\{Ln_{K}^{-1}\log(n_{K})\}+O_{p}(L^{-\frac{p}{d+1}})\,. Lastly, we give the asymptotic properties of nK1/2σπ,𝒢,21{V~π(𝒢)Vπ(𝒢)}n_{K}^{1/2}\sigma_{\pi,\mathcal{G},2}^{-1}\{\widetilde{V}^{\pi}(\mathcal{G})-V^{\pi}(\mathcal{G})\} and finishes the proof.

Step1. Asymptotic properties of 𝒫^X\widehat{\mathcal{P}}_{X}.

Define Mi,k+1(x)=0x{dNi,k+1(u)I(Xi,k+1u)exp(𝜷0𝒁i,k)dΛ0(u)}M_{i,k+1}(x)=\int_{0}^{x}\{dN_{i,k+1}(u)-I(X_{i,k+1}\geq u)\exp(\bm{\beta}_{0}^{\top}\bm{Z}_{i,k})d\Lambda_{0}(u)\} and ϕi,k,z=0τ{𝒁i,k𝒛¯(u)}𝑑Mi,k+1(u).\bm{\phi}_{i,k,z}=\int_{0}^{\tau}\{\bm{Z}_{i,k}-\bar{\bm{z}}(u)\}dM_{i,k+1}(u)\,. Then Lemma A.3 gives the asymptotic properties for 𝜷^\widehat{\bm{\beta}} and Λ^0(x)\widehat{\Lambda}_{0}(x).

Lemma A.3.Under the conditions given in Theorem 2, we have: (i) nK1/2(𝛃^𝛃0)n_{K}^{1/2}(\widehat{\bm{\beta}}-\bm{\beta}_{0}) converges weakly to a normal distribution 𝒩(0,𝛀z1)\mathcal{N}(0,\bm{\Omega}_{z}^{-1}), and can be asymptotically written as

nK1/2(𝜷^𝜷0)=nK1/2i,k𝛀z1ϕi,k,z+op(1).n_{K}^{1/2}(\widehat{\bm{\beta}}-\bm{\beta}_{0})=n_{K}^{-1/2}\sum_{i,k}\bm{\Omega}_{z}^{-1}\bm{\phi}_{i,k,z}+o_{p}(1)\,.

(ii) nK1/2{Λ^0(x)Λ0(x)}n_{K}^{1/2}\{\widehat{\Lambda}_{0}(x)-\Lambda_{0}(x)\} converges weakly to a mean zero Gaussian process, and can be asymptotically written as

nK1/2{Λ^0(x)Λ0(x)}=nK1/2i,k[0xsz,01(u)𝑑Mi,k+1(u)0x𝒛¯(u)𝑑Λ0(u)𝛀z1ϕi,k,z]+op(1)n_{K}^{1/2}\{\widehat{\Lambda}_{0}(x)-\Lambda_{0}(x)\}=n_{K}^{-1/2}\sum_{i,k}\big{[}\int_{0}^{x}s_{z,0}^{-1}(u)dM_{i,k+1}(u)-\int_{0}^{x}\overline{\bm{z}}(u)^{\top}d\Lambda_{0}(u)\bm{\Omega}_{z}^{-1}\bm{\phi}_{i,k,z}\big{]}+o_{p}(1)

for each x[0,τ]x\in[0,\tau].

Furthermore, by Taylor’s expansion, we can rewrite nK1/2{Λ^(x;𝒛)Λ(x;𝒛)}n_{K}^{1/2}\{\widehat{\Lambda}(x;\bm{z})-\Lambda(x;\bm{z})\} as

nK1/2{Λ^0(x)exp(𝜷^𝒛)Λ0(x)exp(𝜷0𝒛)}\displaystyle n_{K}^{1/2}\{\widehat{\Lambda}_{0}(x)\exp(\widehat{\bm{\beta}}^{\top}\bm{z})-\Lambda_{0}(x)\exp(\bm{\beta}_{0}^{\top}\bm{z})\}
=\displaystyle= nK1/2{Λ^0(x)Λ0(x)}exp(𝜷0𝒛)+Λ0(x)exp(𝜷0𝒛)nK1/2(𝜷^𝜷0)𝒛+Op(nK1/2)\displaystyle n_{K}^{1/2}\{\widehat{\Lambda}_{0}(x)-\Lambda_{0}(x)\}\exp(\bm{\beta}_{0}^{\top}\bm{z})+\Lambda_{0}(x)\exp(\bm{\beta}_{0}^{\top}\bm{z})n_{K}^{1/2}(\widehat{\bm{\beta}}-\bm{\beta}_{0})^{\top}\bm{z}+O_{p}(n_{K}^{-1/2})
=\displaystyle= nK1/2i,kexp(𝜷0𝒛)[0xsz,01(u)𝑑Mi,k+1(u)+0x{𝒛𝒛¯(u)}𝑑Λ0(u)𝛀z1ϕi,k,z]+op(1),\displaystyle n_{K}^{-1/2}\sum_{i,k}\exp(\bm{\beta}_{0}^{\top}\bm{z})\big{[}\int_{0}^{x}s^{-1}_{z,0}(u)dM_{i,k+1}(u)+\int_{0}^{x}\{\bm{z}-\overline{\bm{z}}(u)\}^{\top}d\Lambda_{0}(u)\bm{\Omega}_{z}^{-1}\bm{\phi}_{i,k,z}\big{]}+o_{p}(1)\,,

which is also Op(1)O_{p}(1) for each fixed xx and 𝒛\bm{z}. Recall that 𝒫X(x;𝒛)=1exp{Λ(x;𝒛)}\mathcal{P}_{X}(x;\bm{z})=1-\exp\{-\Lambda(x;\bm{z})\}. Therefore nK1/2{𝒫^X(x;𝒛)𝒫X(x;𝒛)}n_{K}^{1/2}\{\widehat{\mathcal{P}}_{X}(x;\bm{z})-\mathcal{P}_{X}(x;\bm{z})\} is asymptotically equivalent with

nK1/2i,kexp{Λ(x;𝒛)+𝜷0𝒛}[0xsz,01(u)𝑑Mi,k+1(u)+0x{𝒛𝒛¯(u)}𝑑Λ0(u)𝛀z1ϕi,k,z].\displaystyle n_{K}^{-1/2}\sum_{i,k}\exp\{-\Lambda(x;\bm{z})+\bm{\beta}_{0}^{\top}\bm{z}\}[\int_{0}^{x}s_{z,0}^{-1}(u)dM_{i,k+1}(u)+\int_{0}^{x}\{\bm{z}-\overline{\bm{z}}(u)\}^{\top}d\Lambda_{0}(u)\bm{\Omega}_{z}^{-1}\bm{\phi}_{i,k,z}]\,. (10)

Step 2. Asymptotic properties of 𝛉~π\widetilde{\bm{\theta}}_{\pi}

Denote 𝑫~π=1iKii=1nk=0Ki1𝝃i,k(𝝃i,k𝒰~π,i,k+1)\widetilde{\bm{D}}_{\pi}=\frac{1}{\sum_{i}K_{i}}\sum_{i=1}^{n}\sum_{k=0}^{K_{i}-1}\bm{\xi}_{i,k}(\bm{\xi}_{i,k}-\widetilde{\mathcal{U}}_{\pi,i,k+1})^{\top} and define

𝒰π,i,k+1=γx𝝃(Si,k+1,x,π)𝑑𝒫X(x;Si,k+1,Si,k,Xi,k,Ai,k).\mathcal{U}_{\pi,i,k+1}=\int\gamma^{x}\bm{\xi}(S_{i,k+1},x,\pi)d\mathcal{P}_{X}(x;S_{i,k+1},S_{i,k},X_{i,k},A_{i,k})\,.

By (8) and the definition of 𝒰~π,i,k+1\widetilde{\mathcal{U}}_{\pi,i,k+1} given in section 3.3, (𝜽~π𝜽π)(\widetilde{\bm{\theta}}_{\pi}-{\bm{\theta}}_{\pi}^{*}) can be rewritten as

𝑫~π1[1nKi,k𝝃i,k{γXi,k+1Ri(Ti,k+1)(𝝃i,k𝒰~π,i,k+1)𝜽π}]\displaystyle\ \widetilde{\bm{D}}_{\pi}^{-1}\big{[}\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\{\gamma^{X_{i,k+1}}R_{i}(T_{i,k+1})-(\bm{\xi}_{i,k}-\widetilde{\mathcal{U}}_{\pi,i,k+1})^{\top}{\bm{\theta}}_{\pi}^{*}\}\big{]}
=\displaystyle= 𝑫~π1[1nKi,k𝝃i,k{γXi,k+1Ri(Ti,k+1)Qπ(Si,k,Xi,k,Ai,k)+γxQπ(Si,k+1,x,π)𝑑𝒫^X(x;𝒁i,k)+ϵi,k,2}]\displaystyle\ \widetilde{\bm{D}}_{\pi}^{-1}\big{[}\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\{\gamma^{X_{i,k+1}}R_{i}(T_{i,k+1})-Q^{\pi}(S_{i,k},X_{i,k},A_{i,k})+\int\gamma^{x}Q^{\pi}(S_{i,k+1},x,\pi)d\widehat{\mathcal{P}}_{X}(x;\bm{Z}_{i,k})+\epsilon_{i,k,2}\}\big{]}
=\displaystyle= 𝑫~π1[1nKi,k𝝃i,k{γXi,k+1Ri(Ti,k+1)Qπ(Si,k,Xi,k,Ai,k)+γxQπ(Si,k+1,x,π)𝑑𝒫X(x;𝒁i,k)ϕi,k,2+ϵi,k,2}\displaystyle\ \widetilde{\bm{D}}_{\pi}^{-1}\big{[}\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\{\underbrace{\gamma^{X_{i,k+1}}R_{i}(T_{i,k+1})-Q^{\pi}(S_{i,k},X_{i,k},A_{i,k})+\int\gamma^{x}Q^{\pi}(S_{i,k+1},x,\pi)d\mathcal{P}_{X}(x;\bm{Z}_{i,k})}_{\phi_{i,k,2}}+\epsilon_{i,k,2}\}
+1nKi,k𝝃i,kγxQπ(Si,k+1,x,π)d{𝒫^X(x;𝒁i,k)𝒫X(x;𝒁i,k)}],\displaystyle+\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\int\gamma^{x}Q^{\pi}(S_{i,k+1},x,\pi)d\{\widehat{\mathcal{P}}_{X}(x;\bm{Z}_{i,k})-\mathcal{P}_{X}(x;\bm{Z}_{i,k})\}\big{]}\,,

where ϵi,k,2={Qπ(Si,k,Xi,k,Ai,k)𝝃i,k𝜽π}+γx{𝝃(Si,k+1,x,π)𝜽πQπ(Si,k+1,x,π)}𝑑𝒫^X(x;𝒁i,k).\epsilon_{i,k,2}=\{Q^{\pi}(S_{i,k},X_{i,k},A_{i,k})-\bm{\xi}_{i,k}^{\top}\bm{\theta}_{\pi}^{*}\}+\int\gamma^{x}\{\bm{\xi}(S_{i,k+1},x,\pi)^{\top}\bm{\theta}_{\pi}^{*}-Q^{\pi}(S_{i,k+1},x,\pi)\}d\widehat{\mathcal{P}}_{X}(x;\bm{Z}_{i,k})\,. Lemma A.4 gives the properties of ϕi,k,2\phi_{i,k,2}, ϵi,k,2\epsilon_{i,k,2} and 𝑫~π\widetilde{\bm{D}}_{\pi}.

Lemma A.4 Under the conditions given in Theorem 2, we have: (i) max1in,1kKi|ϕi,k,2|CR+2CQ\max_{1\leq i\leq n,1\leq k\leq K_{i}}|\phi_{i,k,2}|\leq C_{R}+2C_{Q}, max1in,1kKi|ϵi,k,2|3C1Lpd+1\max_{1\leq i\leq n,1\leq k\leq K_{i}}|\epsilon_{i,k,2}|\leq 3C_{1}L^{-\frac{p}{d+1}} wpa 1. (ii) 𝐃~π𝐃π2=Op{L1/2nK1/2log(nK)}\|\widetilde{\bm{D}}_{\pi}-\bm{D}_{\pi}\|_{2}=O_{p}\{L^{1/2}n_{K}^{-1/2}\log(n_{K})\}, 𝐃~π126c¯1\|\widetilde{\bm{D}}_{\pi}^{-1}\|_{2}\leq 6\overline{c}^{-1}, 𝐃~π1𝐃π12=Op{L1/2nK1/2log(nK)}\|\widetilde{\bm{D}}_{\pi}^{-1}-\bm{D}_{\pi}^{-1}\|_{2}=O_{p}\{L^{1/2}n_{K}^{-1/2}\log(n_{K})\}.

Similar to the proof of Theorem 1, the first term in the above equation can be written as 𝑫π1𝑨ξϕ2+Op{LnK1log(nK)}+Op(Lpd+1)\bm{D}_{\pi}^{-1}\bm{A}_{\xi}^{\top}\bm{\phi}_{2}+O_{p}\{Ln_{K}^{-1}\log(n_{K})\}+O_{p}(L^{-\frac{p}{d+1}}) with ϕ2=array({ϕi,k,2}1in,1kKi).\bm{\phi}_{2}=array(\{\phi_{i,k,2}\}_{1\leq i\leq n,1\leq k\leq K_{i}})\,. For the second term, by the asymptotic expression given in (10), we have

1nKj,l𝝃j,l0τγxQπ(Sj,l+1,x,π)d{𝒫^X(x;𝒁j,l)𝒫X(x;𝒁j,l)}\displaystyle\ \frac{1}{n_{K}}\sum_{j,l}\bm{\xi}_{j,l}\int_{0}^{\tau}\gamma^{x}Q^{\pi}(S_{j,l+1},x,\pi)\ d\big{\{}\widehat{\mathcal{P}}_{X}(x;\bm{Z}_{j,l})-{\mathcal{P}}_{X}(x;\bm{Z}_{j,l})\big{\}}
=\displaystyle= 0τnK1j,l𝝃j,lγxQπ(Sj,l+1,x,π)exp{Λ(x;𝒁j,l)+𝜷0𝒁j,l}𝑮π,1(x)sz,01(x)nK1i,kdMi,k+1(x)\displaystyle\ \int_{0}^{\tau}\underbrace{{n_{K}}^{-1}\sum_{j,l}\bm{\xi}_{j,l}\gamma^{x}Q^{\pi}(S_{j,l+1},x,\pi)\exp\{-\Lambda(x;\bm{Z}_{j,l})+\bm{\beta}_{0}^{\top}\bm{Z}_{j,l}\}}_{\bm{G}_{\pi,1}(x)}s_{z,0}^{-1}(x)n_{K}^{-1}\sum_{i,k}dM_{i,k+1}(x)
+nK1j,l0τ𝝃j,lγxQπ(Sj,l+1,x,π){𝒁j,l𝒛¯(x)}𝑑Px(x;Zj,l)𝑮π,3𝛀z1nK1i,kϕi,k,z\displaystyle+\underbrace{{n_{K}}^{-1}\sum_{j,l}\int_{0}^{\tau}\bm{\xi}_{j,l}\gamma^{x}Q^{\pi}(S_{j,l+1},x,\pi)\{\bm{Z}_{j,l}-\overline{\bm{z}}(x)\}^{\top}dP_{x}(x;Z_{j,l})}_{\bm{G}_{\pi,3}}\bm{\Omega}_{z}^{-1}n_{K}^{-1}\sum_{i,k}\bm{\phi}_{i,k,z}
0τxτnK1j,l𝝃j,lγuQπ(Sj,l+1,u,π)exp{Λ(u;𝒁j,l)+2𝜷0𝒁j,l}𝑮π,2(u)𝑑Λ0(u)sz,01(x)nK1i,kdMi,k+1(x)\displaystyle-\int_{0}^{\tau}\int_{x}^{\tau}\underbrace{{n_{K}}^{-1}\sum_{j,l}\bm{\xi}_{j,l}\gamma^{u}Q^{\pi}(S_{j,l+1},u,\pi)\exp\{-\Lambda(u;\bm{Z}_{j,l})+2\bm{\beta}_{0}^{\top}\bm{Z}_{j,l}\}}_{\bm{G}_{\pi,2}(u)}d\Lambda_{0}(u)s_{z,0}^{-1}(x)n_{K}^{-1}\sum_{i,k}dM_{i,k+1}(x)
nK1j,l0τ𝝃j,lγxQπ(Sj,l+1,x,π)exp(𝜷0𝒁j,l){𝒁j,lΛ0(x)0xz¯(u)𝑑Λ0(u)}𝑑PX(x;𝒁j,l)𝑮π,4\displaystyle-\underbrace{{n_{K}}^{-1}\sum_{j,l}\int_{0}^{\tau}\bm{\xi}_{j,l}\gamma^{x}Q^{\pi}(S_{j,l+1},x,\pi)\exp(\bm{\beta}_{0}^{\top}\bm{Z}_{j,l})\{\bm{Z}_{j,l}\Lambda_{0}(x)-\int_{0}^{x}\overline{z}(u)d\Lambda_{0}(u)\}^{\top}dP_{X}(x;\bm{Z}_{j,l})}_{\bm{G}_{\pi,4}}
×𝛀z1nK1i,kϕi,k,z\displaystyle\times\bm{\Omega}_{z}^{-1}n_{K}^{-1}\sum_{i,k}\bm{\phi}_{i,k,z}
=\displaystyle= 0τ{𝑮π,1(x)xτ𝑮π,2(u)𝑑Λ0(u)}sz,01(x)nK1i,kdMi,k+1(x)+(𝑮π,3𝑮π,4)𝛀z1nK1i,kϕi,k,z\displaystyle\int_{0}^{\tau}\{\bm{G}_{\pi,1}(x)-\int_{x}^{\tau}\bm{G}_{\pi,2}(u)d\Lambda_{0}(u)\}s_{z,0}^{-1}(x)n_{K}^{-1}\sum_{i,k}dM_{i,k+1}(x)+(\bm{G}_{\pi,3}-\bm{G}_{\pi,4})\bm{\Omega}_{z}^{-1}n_{K}^{-1}\sum_{i,k}\bm{\phi}_{i,k,z}
=\displaystyle= nK1i,k[0τ{𝒈π,1(x)xτ𝒈π,2(u)𝑑Λ0(u)}sz,01(x)𝑑Mi,k+1(x)+(𝒈π,3𝒈π,4)𝛀z1ϕi,k,z]φi,k,2+op(nK1/2).\displaystyle\ n_{K}^{-1}\sum_{i,k}\underbrace{\big{[}\int_{0}^{\tau}\{\bm{g}_{\pi,1}(x)-\int_{x}^{\tau}\bm{g}_{\pi,2}(u)d\Lambda_{0}(u)\}s_{z,0}^{-1}(x)dM_{i,k+1}(x)+(\bm{g}_{\pi,3}-\bm{g}_{\pi,4})\bm{\Omega}_{z}^{-1}\bm{\phi}_{i,k,z}\big{]}}_{\varphi_{i,k,2}}+o_{p}(n_{K}^{-1/2})\,.

Here, 𝒈π,k=E(𝑮π,k)\bm{g}_{\pi,k}=E(\bm{G}_{\pi,k}) for k=1,2,3,4k=1,2,3,4, and the last equation follows from the convergence of 𝑮π,k\bm{G}_{\pi,k}, which could be obtained by Assumption A6. Lastly, denote 𝝋2=array({𝝋i,k,2}1in,1kKi)\bm{\varphi}_{2}=array(\{\bm{\varphi}_{i,k,2}\}_{1\leq i\leq n,1\leq k\leq K_{i}}) and 𝑨I=array({1}1in,1kKi)\bm{A}_{I}=array(\{1\}_{1\leq i\leq n,1\leq k\leq K_{i}}), then it is not hard to obtain

𝜽~π𝜽π=𝑫π1(𝑨ξϕ2+𝑨I𝝋2)+Op{LnK1log(nK)}+Op(Lpd+1).\displaystyle\widetilde{\bm{\theta}}_{\pi}-{\bm{\theta}}_{\pi}^{*}=\bm{D}_{\pi}^{-1}(\bm{A}_{\xi}^{\top}\bm{\phi}_{2}+\bm{A}_{I}^{\top}\bm{\varphi}_{2})+O_{p}\{Ln_{K}^{-1}\log(n_{K})\}+O_{p}(L^{-\frac{p}{d+1}})\,. (11)

Step 3. Asymptotic properties of V~π(𝒢)\widetilde{V}^{\pi}(\mathcal{G})

Define σπ,𝒢,22=𝜻π,𝒢𝑫π1𝛀π,2(𝑫π)1𝜻π,𝒢\sigma^{2}_{\pi,\mathcal{G},2}=\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\bm{D}_{\pi}^{-1}\bm{\Omega}_{\pi,2}(\bm{D}_{\pi}^{\top})^{-1}\bm{\zeta}_{\pi,\mathcal{G}} with

𝛀π,2=E{(𝑨ξϕ2+𝑨I𝝋2)(𝑨ξϕ2+𝑨I𝝋2)}.\displaystyle\bm{\Omega}_{\pi,2}=E\{(\bm{A}_{\xi}^{\top}\bm{\phi}_{2}+\bm{A}_{I}^{\top}\bm{\varphi}_{2})(\bm{A}_{\xi}^{\top}\bm{\phi}_{2}+\bm{A}_{I}^{\top}\bm{\varphi}_{2})^{\top}\}\,. (12)

Then similar with Lemma A.2, we can obtain 𝛀π,22=O(1)\|\bm{\Omega}_{\pi,2}\|_{2}=O(1) and σπ,𝒢,22C4𝜻π,𝒢22\sigma_{\pi,\mathcal{G},2}^{2}\geq C_{4}\|\bm{\zeta}_{\pi,\mathcal{G}}\|_{2}^{2} for some constant C4C_{4}. Together with (11), σπ,𝒢,21{V~π(𝒢)Vπ(𝒢)}\sigma_{\pi,\mathcal{G},2}^{-1}\{\widetilde{V}^{\pi}(\mathcal{G})-V^{\pi}(\mathcal{G})\} can be written as

σπ,𝒢,21[𝜻π,𝒢(𝜽~π𝜽π){Qπ(s,x,π)ξ(s,x,π)𝜽π}𝑑𝒢(s,x)]\displaystyle\sigma_{\pi,\mathcal{G},2}^{-1}\left[\bm{\zeta}_{\pi,\mathcal{G}}^{\top}(\widetilde{\bm{\theta}}_{\pi}-\bm{\theta}_{\pi}^{*})-\int\{Q^{\pi}(s,x,\pi)-\xi(s,x,\pi)^{\top}\bm{\theta}_{\pi}^{*}\}d\mathcal{G}(s,x)\right]
=\displaystyle= σπ,𝒢,21𝜻π,𝒢𝑫π1(𝑨ξϕ2+𝑨I𝝋2)+Op{LnK1log(nK)}+Op{Lpd+1(1+𝜻π,𝒢21)}\displaystyle\sigma_{\pi,\mathcal{G},2}^{-1}\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\bm{D}_{\pi}^{-1}(\bm{A}_{\xi}^{\top}\bm{\phi}_{2}+\bm{A}_{I}^{\top}\bm{\varphi}_{2})+O_{p}\{Ln_{K}^{-1}\log(n_{K})\}+O_{p}\{L^{-\frac{p}{d+1}}(1+\|\bm{\zeta}_{\pi,\mathcal{G}}\|_{2}^{-1})\}
=\displaystyle= σπ,𝒢,21𝜻π,𝒢𝑫π1(𝑨ξϕ2+𝑨I𝝋2)+op(nK1/2),\displaystyle\sigma_{\pi,\mathcal{G},2}^{-1}\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\bm{D}_{\pi}^{-1}(\bm{A}_{\xi}^{\top}\bm{\phi}_{2}+\bm{A}_{I}^{\top}\bm{\varphi}_{2})+o_{p}(n_{K}^{-1/2})\,,

when LnK1/2/log(nK)L\ll{n_{K}^{1/2}}/{\log(n_{K})} and L2pd+1nK(1+𝜻π,𝒢22)L^{\frac{2p}{d+1}}\gg n_{K}(1+\|\bm{\zeta}_{\pi,\mathcal{G}}\|_{2}^{-2}). Similar with the proof of Theorem 1, we can use martingale central limit theorem (McLeish, 1974, corrollary 2.8) to obtain that nK1/2σπ,𝒢,21{V~π(𝒢)Vπ(𝒢)}dN(0,1).n_{K}^{1/2}\sigma_{\pi,\mathcal{G},2}^{-1}\{\widetilde{V}^{\pi}(\mathcal{G})-V^{\pi}(\mathcal{G})\}\stackrel{{\scriptstyle d}}{{\rightarrow}}N(0,1)\,. \blacksquare

Proof of Theorem 3(i):

We first give the following lemma which is required for the Kernel-based estimates. Lemma A.5. For any function g(x)g(x), define g(x)=0τ1bnK(uxbn)g(u)𝑑ug^{*}(x)=\int_{0}^{\tau}\frac{1}{b_{n}}K(\frac{u-x}{b_{n}})g(u)du. Then |g(x)g(x)|=O(bn2)|g^{*}(x)-g(x)|=O(b_{n}^{2}) if g(u)g(u) has a bounded second-order derivative in a neighborhood of xx.

Theorem 3 will be proved by two steps. In the first step, we obtain the asymptotic properties of λ^0(x)\widehat{\lambda}_{0}(x) and λ^(x;𝒛)\widehat{\lambda}(x;\bm{z}). Then, using the properties of λ^(x;𝒛)\widehat{\lambda}(x;\bm{z}), to derive the asymptotic properties of 𝜽^π,\widehat{\bm{\theta}}_{\pi,\mathcal{I}} and V^π(𝒢)\widehat{V}^{\pi}_{\mathcal{I}}(\mathcal{G}).

Step 1. Asymptotic properties of λ^(x;𝐳)\widehat{\lambda}(x;\bm{z})

Rewrite

λ^(x)λ(x)=\displaystyle\widehat{\lambda}(x)-\lambda(x)= λ^0(x)exp(𝜷^𝒛)λ0(x)exp(𝜷0𝒛)\displaystyle\widehat{\lambda}_{0}(x)\exp(\widehat{\bm{\beta}}^{\top}\bm{z})-\lambda_{0}(x)\exp(\bm{\beta}_{0}^{\top}\bm{z})
=\displaystyle= {λ^0(x)λ0(x)}exp{𝜷0𝒛}+λ0(x){exp(𝜷^𝒛)exp(𝜷0𝒛)}+rn(x;𝒛)\displaystyle\{\widehat{\lambda}_{0}(x)-\lambda_{0}(x)\}\exp\{\bm{\beta}_{0}^{\top}\bm{z}\}+\lambda_{0}(x)\{\exp(\widehat{\bm{\beta}}^{\top}\bm{z})-\exp(\bm{\beta}_{0}^{\top}\bm{z})\}+r_{n}(x;\bm{z})

with rn(x;𝒛)={λ^0(x)λ0(x)}{exp(𝜷^𝒛)exp(𝜷0𝒛)}r_{n}(x;\bm{z})=\{\widehat{\lambda}_{0}(x)-\lambda_{0}(x)\}\{\exp(\widehat{\bm{\beta}}^{\top}\bm{z})-\exp(\bm{\beta}_{0}^{\top}\bm{z})\}.

For the first term, we first rewrite

λ^0(x)λ0(x)=\displaystyle\widehat{\lambda}_{0}(x)-\lambda_{0}(x)= 0τ1bnK(uxbn)d{Λ^0(u)Λ0(u)}+0τ1bnK(uxbn)𝑑Λ0(u)λ0(x)λ0(x)\displaystyle\int_{0}^{\tau}\frac{1}{b_{n}}K(\frac{u-x}{b_{n}})d\{\widehat{\Lambda}_{0}(u)-\Lambda_{0}(u)\}+\underbrace{\int_{0}^{\tau}\frac{1}{b_{n}}K(\frac{u-x}{b_{n}})d\Lambda_{0}(u)}_{\lambda_{0}^{*}(x)}-\lambda_{0}(x)
=\displaystyle= {λ^0(u)λ0(u)}+{λ0(u)λ0(u)}\displaystyle\{\widehat{\lambda}_{0}(u)-\lambda_{0}^{*}(u)\}+\{\lambda_{0}^{*}(u)-\lambda_{0}(u)\}

Following Theorem 4.2.2 of Ramlau-Hansen (1983), we can show that (nKbn)1/2{λ^0(x)λ0(x)}(n_{K}b_{n})^{1/2}\{\widehat{\lambda}_{0}(x)-\lambda_{0}^{*}(x)\} converges weakly to a normal distribution for each x[0,τ]x\in[0,\tau], and therefore |λ^0(x)λ0(x)|=Op(nK1/2bn1/2)|\widehat{\lambda}_{0}(x)-\lambda_{0}^{*}(x)|=O_{p}(n_{K}^{-1/2}b_{n}^{-1/2}). Moreover, by Lemma A.5, |λ0(x)λ0(x)|=O(bn2)|\lambda_{0}^{*}(x)-\lambda_{0}(x)|=O(b_{n}^{2}), if λ0(x)\lambda_{0}(x) has a bounded second derivative in a neighborhood of xx. Therefore, together with Lemma A.3(ii) and Lemma A.5, nK1/2{λ^0(x)λ0(x)}n_{K}^{1/2}\{\widehat{\lambda}_{0}(x)-\lambda_{0}(x)\} is asymptotically equivalent with

0τ1bnK(uxbn)d[nK1/2{Λ^0(u)Λ0(u)}]+O(nK1/2bn2)\displaystyle\int_{0}^{\tau}\frac{1}{b_{n}}K(\frac{u-x}{b_{n}})d[n_{K}^{1/2}\{\widehat{\Lambda}_{0}(u)-\Lambda_{0}(u)\}]+O(n_{K}^{1/2}b_{n}^{2})
=\displaystyle= 0τ1bnK(uxbn)nK1/2i,k[sz,01(u)dMi,k+1(u)z¯(u)dΛ0(u)𝛀z1ϕi,k,z]+O(nK1/2bn2)\displaystyle\int_{0}^{\tau}\frac{1}{b_{n}}K(\frac{u-x}{b_{n}})n_{K}^{-1/2}\sum_{i,k}[s_{z,0}^{-1}(u)dM_{i,k+1}(u)-\overline{z}^{\top}(u)d\Lambda_{0}(u)\bm{\Omega}_{z}^{-1}\phi_{i,k,z}]+O(n_{K}^{1/2}b_{n}^{2})
=\displaystyle= nK1/2i,k[0τ1bnK(uxbn)sz,01(u)𝑑Mi,k+1(u)λ0(x)z¯(x)𝛀z1ϕi,k,z]+O(nK1/2bn2).\displaystyle n_{K}^{-1/2}\sum_{i,k}\big{[}\int_{0}^{\tau}\frac{1}{b_{n}}K(\frac{u-x}{b_{n}})s_{z,0}^{-1}(u)dM_{i,k+1}(u)-\lambda_{0}(x)\overline{z}^{\top}(x)\bm{\Omega}_{z}^{-1}\phi_{i,k,z}\big{]}+O(n_{K}^{1/2}b_{n}^{2})\,. (13)

On the other hand, by Taylor expansion and Lemma A.3(i), we have

nK1/2λ0(x){exp(𝜷^𝒛)exp(𝜷0𝒛)}=\displaystyle n_{K}^{1/2}\lambda_{0}(x)\{\exp(\widehat{\bm{\beta}}^{\top}\bm{z})-\exp(\bm{\beta}_{0}^{\top}\bm{z})\}= λ0(x)exp(𝜷0𝒛)𝒛nK1/2(β^β0)+Op(β^β022nK1/2)\displaystyle\lambda_{0}(x)\exp(\bm{\beta}_{0}^{\top}\bm{z})\bm{z}^{\top}n_{K}^{1/2}(\widehat{\beta}-\beta_{0})+O_{p}(\|\widehat{\beta}-\beta_{0}\|_{2}^{2}n_{K}^{1/2})
=\displaystyle= λ0(x;𝒛)𝒛𝛀z1nK1/2i,kϕi,k,z+op(1).\displaystyle\lambda_{0}(x;\bm{z})\bm{z}^{\top}\bm{\Omega}_{z}^{-1}n_{K}^{-1/2}\sum_{i,k}\phi_{i,k,z}+o_{p}(1)\,. (14)

Combining (13) and (14), we have rn=Op(nK1bn1/2)r_{n}=O_{p}(n_{K}^{-1}b_{n}^{-1/2}), and therefore

nK1/2{λ^(x;𝒛)λ(x;𝒛)}=\displaystyle n_{K}^{1/2}\{\widehat{\lambda}(x;\bm{z})-\lambda(x;\bm{z})\}= nK1/2i,k[exp(𝜷0𝒛)0τ1bnK(uxbn)sz,01(u)dMi,k+1(u)\displaystyle\ n_{K}^{-1/2}\sum_{i,k}\big{[}\exp(\bm{\beta}_{0}^{\top}\bm{z})\int_{0}^{\tau}\frac{1}{b_{n}}K(\frac{u-x}{b_{n}})s_{z,0}^{-1}(u)dM_{i,k+1}(u)
+λ(x;𝒛){𝒛𝒛¯(x)}𝛀z1ϕi,k,z]+Op(nK1/2bn2+nK1/2bn1/2).\displaystyle+\lambda(x;\bm{z})\{\bm{z}-\overline{\bm{z}}(x)\}^{\top}\bm{\Omega}_{z}^{-1}\phi_{i,k,z}\big{]}+O_{p}(n_{K}^{1/2}b_{n}^{2}+n_{K}^{-1/2}b_{n}^{-1/2})\,. (15)

Step 2. Asymptotic properties of 𝛉^\widehat{\bm{\theta}}_{\mathcal{I}}

Similar with (8), there exists 𝜽,πmL\bm{\theta}^{*}_{\mathcal{I},\pi}\in\mathbb{R}^{mL} that satisfies sups𝕊,x+,a𝔸|Qπ(s,x,a)ξ(s,x,a)𝜽π|C1Lp/(d+1)\sup_{s\in\mathbb{S},x\in\mathbb{R}^{+},a\in\mathbb{A}}|Q^{\pi}(s,x,a)-\xi(s,x,a)^{\top}\bm{\theta}_{\pi}^{*}|\leq C_{1}L^{-p/(d+1)} for some constant C1C_{1}. Recall that R,i(Ti,k+1)=Ri(Ti,k+1)λ(Xi,k+1;𝒁ik)R_{\mathcal{I},i}(T_{i,k+1})=\frac{R_{i}(T_{i,k+1})}{\lambda(X_{i,k+1};\bm{Z}_{ik})} and R^,i(Ti,k+1)=Ri(Ti,k+1)λ^(Xi,k+1;𝒁ik)\widehat{R}_{\mathcal{I},i}(T_{i,k+1})=\frac{R_{i}(T_{i,k+1})}{\widehat{\lambda}(X_{i,k+1};\bm{Z}_{ik})}, we have

𝑫^π(𝜽^π,𝜽π,)\displaystyle\widehat{\bm{D}}_{\pi}(\widehat{\bm{\theta}}_{\pi,\mathcal{I}}^{*}-{\bm{\theta}}_{\pi,\mathcal{I}}^{*})
=\displaystyle= nK1i,k𝝃i,k{γXi,k+1R^,i(Ti,k+1)(𝝃i,kγXi,k+1𝜻π,i,k+1)𝜽π,}\displaystyle n_{K}^{-1}\sum_{i,k}\bm{\xi}_{i,k}\big{\{}\gamma^{X_{i,k+1}}\widehat{R}_{\mathcal{I},i}(T_{i,k+1})-(\bm{\xi}_{i,k}-\gamma^{X_{i,k+1}}\bm{\zeta}_{\pi,i,k+1})^{\top}{\bm{\theta}}_{\pi,\mathcal{I}}^{*}\big{\}}
=\displaystyle= nK1i,k𝝃i,k{γXi,k+1R,i(Ti,k+1)(𝝃i,kγXi,k+1𝜻π,i,k+1)𝜽π,}J1\displaystyle\underbrace{n_{K}^{-1}\sum_{i,k}\bm{\xi}_{i,k}\big{\{}\gamma^{X_{i,k+1}}R_{\mathcal{I},i}(T_{i,k+1})-(\bm{\xi}_{i,k}-\gamma^{X_{i,k+1}}\bm{\zeta}_{\pi,i,k+1})^{\top}{\bm{\theta}}_{\pi,\mathcal{I}}^{*}\big{\}}}_{J_{1}}
+nK1i,k𝝃i,kγXi,k+1R,i(Ti,k+1)λ(Xi,k+1;𝒁i,k){λ^(Xi,k+1;𝒁i,k)λ(Xi,k+1;𝒁i,k)}+Op(nK1bn1)\displaystyle+n_{K}^{-1}\sum_{i,k}\frac{\bm{\xi}_{i,k}\gamma^{X_{i,k+1}}R_{\mathcal{I},i}(T_{i,k+1})}{\lambda(X_{i,k+1};\bm{Z}_{i,k})}\big{\{}\widehat{\lambda}(X_{i,k+1};\bm{Z}_{i,k})-\lambda(X_{i,k+1};\bm{Z}_{i,k})\big{\}}+O_{p}(n_{K}^{-1}b_{n}^{-1})
=\displaystyle= J1+0τ1nKi,k𝝃i,kγXi,k+1R,i(Ti,k+1){λ0(Xi,k+1)}11bnK(uXi,k+1bn)𝑮5(u)sz,01(u)nK1j,ldMj,l+1(u)\displaystyle J_{1}+\int_{0}^{\tau}\underbrace{\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\gamma^{X_{i,k+1}}R_{\mathcal{I},i}(T_{i,k+1})\{\lambda_{0}(X_{i,k+1})\}^{-1}\frac{1}{b_{n}}K(\frac{u-X_{i,k+1}}{b_{n}})}_{\bm{G}_{5}(u)}s_{z,0}^{-1}(u)n_{K}^{-1}\sum_{j,l}dM_{j,l+1}(u)
+1nKi,k𝝃i,kγXi,k+1R,i(Ti,k+1){𝒁i,k𝒛¯(Xi,k+1)}𝑮6𝛀z1nK1j,lϕj,l,z+Op(bn2)+Op(nK1bn1)\displaystyle+\underbrace{\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\gamma^{X_{i,k+1}}R_{\mathcal{I},i}(T_{i,k+1})\{\bm{Z}_{i,k}-\overline{\bm{z}}(X_{i,k+1})\}^{\top}}_{\bm{G}_{6}}\bm{\Omega}_{z}^{-1}n_{K}^{-1}\sum_{j,l}\phi_{j,l,z}+O_{p}(b_{n}^{2})+O_{p}(n_{K}^{-1}b_{n}^{-1})
=\displaystyle= J1+J2+J3+Op(bn2)+Op(nK1bn1)\displaystyle J_{1}+J_{2}+J_{3}+O_{p}(b_{n}^{2})+O_{p}(n_{K}^{-1}b_{n}^{-1})

As shown in the proof of Theorem 1,

J1=\displaystyle J_{1}= 1nKi,k𝝃i,k{γXi,k+1R,i(Ti,k+1)Qπ(Si,k,Xi,k,Ai,k)+γXi,k+1Qπ(Si,k+1,Xi,k+1,π)ϕi,k,3+ϵi,k,3}\displaystyle\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\Big{\{}\underbrace{\gamma^{X_{i,k+1}}R_{\mathcal{I},i}(T_{i,k+1})-Q^{\pi}_{\mathcal{I}}(S_{i,k},X_{i,k},A_{i,k})+\gamma^{X_{i,k+1}}Q^{\pi}_{\mathcal{I}}(S_{i,k+1},X_{i,k+1},\pi)}_{\phi_{i,k,3}}+\epsilon_{i,k,3}\Big{\}}
=\displaystyle= 1nKi,k𝝃i,kϕi,k,3+Op(Lpd+1)\displaystyle\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\phi_{i,k,3}+O_{p}(L^{-\frac{p}{d+1}})

where
ϵi,k,3=Qπ(Si,k,Xi,k,Ai,k)𝝃(Si,k,Xi,k,Ai,k)𝜽π,γXi,k+1{Qπ(Si,k+1,Xi,k+1,π)𝝃(Si,k+1,Xi,k+1,π)𝜽π,}.\epsilon_{i,k,3}=Q_{\mathcal{I}}^{\pi}(S_{i,k},X_{i,k},A_{i,k})-\bm{\xi}(S_{i,k},X_{i,k},A_{i,k})^{\top}\bm{\theta}_{\pi,\mathcal{I}}-\gamma^{X_{i,k+1}}\{Q^{\pi}_{\mathcal{I}}(S_{i,k+1},X_{i,k+1},\pi)-\bm{\xi}(S_{i,k+1},X_{i,k+1},\pi)^{\top}\bm{\theta}_{\pi,\mathcal{I}}\}\,.

For the second term J2J_{2}, since

E{1nKi,k𝝃i,kγXi,k+1R,i(Ti,k+1){λ0(Xi,k+1)}11bnK(uXi,k+1bn)}\displaystyle E\big{\{}\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\gamma^{X_{i,k+1}}R_{\mathcal{I},i}(T_{i,k+1})\{\lambda_{0}(X_{i,k+1})\}^{-1}\frac{1}{b_{n}}K(\frac{u-X_{i,k+1}}{b_{n}})\big{\}}
=\displaystyle= E{1nKi,k𝝃i,k01bnK(xubn)γxr(Sk,Xk,Ak,Sk+1,x)λ(x;𝒁ik)λ0(x)𝑑𝒫X(x;𝒁i,k)}\displaystyle E\big{\{}\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\int_{0}^{\infty}\frac{1}{b_{n}}K(\frac{x-u}{b_{n}})\frac{\gamma^{x}r(S_{k},X_{k},A_{k},S_{k+1},x)}{\lambda(x;\bm{Z}_{ik})\lambda_{0}(x)}d\mathcal{P}_{X}(x;\bm{Z}_{i,k})\big{\}}
=\displaystyle= E[1nKi,k𝝃i,k01bnK(xubn)γxr(Sk,Xk,Ak,Sk+1,x){λ0(x)}1exp{Λ(x;𝒁ik)}𝑑x]\displaystyle E\big{[}\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\int_{0}^{\infty}\frac{1}{b_{n}}K(\frac{x-u}{b_{n}})\gamma^{x}r(S_{k},X_{k},A_{k},S_{k+1},x)\{\lambda_{0}(x)\}^{-1}\exp\{-\Lambda(x;\bm{Z}_{ik})\}dx\big{]}
=\displaystyle= E[1nKi,k𝝃i,kγur(Sk,Xk,Ak,Sk+1,u){λ0(u)}1exp{Λ(u;𝒁ik)}𝒈5(u)]+O(bn2),\displaystyle\underbrace{E\big{[}\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\gamma^{u}r(S_{k},X_{k},A_{k},S_{k+1},u)\{\lambda_{0}(u)\}^{-1}\exp\{-\Lambda(u;\bm{Z}_{ik})\}}_{\bm{g}_{5}(u)}\big{]}+O(b_{n}^{2})\,,

when r(s,x,a,s,)r(s,x,a,s^{\prime},\cdot) has a bounded second-order derivative in a neighborhood of uu. Then J2J_{2} is can be asymptotically written as

nK1i,k0τ𝒈5(u)sz,01(u)𝑑Mi,k+1(u)+Op(bn2).\displaystyle n_{K}^{-1}\sum_{i,k}\int_{0}^{\tau}\bm{g}_{5}(u)s_{z,0}^{-1}(u)dM_{i,k+1}(u)+O_{p}(b_{n}^{2})\,.

For the third term J3J_{3}, denote 𝒈6=E(𝑮6)\bm{g}_{6}=E(\bm{G}_{6}), then J3=nK1i,k𝒈6𝛀z1ϕi,k,z+op(nK1/2)J_{3}=n_{K}^{-1}\sum_{i,k}\bm{g}_{6}\bm{\Omega}_{z}^{-1}\phi_{i,k,z}+o_{p}(n_{K}^{-1/2}). Define φi,k,3=0τ𝒈5(u)sz,01(u)𝑑Mi,k+1(u)+𝒈6𝛀z1ϕi,k,z\varphi_{i,k,3}=\int_{0}^{\tau}\bm{g}_{5}(u)s_{z,0}^{-1}(u)dM_{i,k+1}(u)+\bm{g}_{6}\bm{\Omega}_{z}^{-1}\phi_{i,k,z}, then 𝑫^π(𝜽^π,𝜽π,)\widehat{\bm{D}}_{\pi}(\widehat{\bm{\theta}}_{\pi,\mathcal{I}}-{\bm{\theta}}_{\pi,\mathcal{I}}^{*}) is asymptotically equivalent with nK1i,k(𝝃i,kϕi,k,3+φi,k,3)n_{K}^{-1}\sum_{i,k}(\bm{\xi}_{i,k}\phi_{i,k,3}+\varphi_{i,k,3}). Together with Lemma A.1(iv), we have

𝜽^π,𝜽π,=𝑫π1(𝑨ξϕ3+𝑨I𝝋3)+Op(Lpd+1)+Op(bn2)+Op(bn1nK1),\displaystyle\widehat{\bm{\theta}}_{\pi,\mathcal{I}}^{*}-{\bm{\theta}}_{\pi,\mathcal{I}}^{*}=\bm{D}_{\pi}^{-1}(\bm{A}_{\xi}^{\top}\bm{\phi}_{3}+\bm{A}_{I}^{\top}\bm{\varphi}_{3})+O_{p}(L^{-\frac{p}{d+1}})+O_{p}(b_{n}^{2})+O_{p}(b_{n}^{-1}n_{K}^{-1})\,,

where ϕ3=array({ϕi,k,3}1in,1kKi)\bm{\phi}_{3}=array(\{\phi_{i,k,3}\}_{1\leq i\leq n,1\leq k\leq K_{i}})^{\top} and 𝝋3=array({𝝋i,k,3}1in,1kKi)\bm{\varphi}_{3}=array(\{\bm{\varphi}_{i,k,3}\}_{1\leq i\leq n,1\leq k\leq K_{i}})^{\top}. Define σπ,𝒢,32=𝜻π,𝒢𝑫π1𝛀π,3(𝑫π)1𝜻π,𝒢\sigma^{2}_{\pi,\mathcal{G},3}=\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\bm{D}_{\pi}^{-1}\bm{\Omega}_{\pi,3}(\bm{D}_{\pi}^{\top})^{-1}\bm{\zeta}_{\pi,\mathcal{G}} with

𝛀π,3=E{(𝑨ξϕ3+𝑨I𝝋3)(𝑨ξϕ3+𝑨I𝝋3)}.\displaystyle\bm{\Omega}_{\pi,3}=E\{(\bm{A}_{\xi}^{\top}\bm{\phi}_{3}+\bm{A}_{I}^{\top}\bm{\varphi}_{3})(\bm{A}_{\xi}^{\top}\bm{\phi}_{3}+\bm{A}_{I}^{\top}\bm{\varphi}_{3})^{\top}\}\,. (16)

Then nK1/2σπ,𝒢,31{V^π(𝒢)Vπ(𝒢)}n_{K}^{1/2}\sigma_{\pi,\mathcal{G},3}^{-1}\{\widehat{V}^{\pi}_{\mathcal{I}}(\mathcal{G})-V^{\pi}_{\mathcal{I}}(\mathcal{G})\} is asymptotically equivalent with

nK1/2σπ,𝒢,31𝜻π,𝒢𝑫π1(𝑨ξϕ3+𝑨I𝝋3)+op(1),n_{K}^{1/2}\sigma_{\pi,\mathcal{G},3}^{-1}\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\bm{D}_{\pi}^{-1}(\bm{A}_{\xi}^{\top}\bm{\phi}_{3}+\bm{A}_{I}^{\top}\bm{\varphi}_{3})+o_{p}(1)\,,

when L2pd+1nK(1+𝜻π,𝒢22)L^{\frac{2p}{d+1}}\gg n_{K}(1+\|\bm{\zeta}_{\pi,\mathcal{G}}\|_{2}^{-2}), nKbn40n_{K}b_{n}^{4}\rightarrow 0 and nKbn2n_{K}b_{n}^{2}\rightarrow\infty. Lastly, by the martingale central limit theorem, nK1/2σπ,𝒢,31{V^π(𝒢)Vπ(𝒢)}n_{K}^{1/2}\sigma_{\pi,\mathcal{G},3}^{-1}\{\widehat{V}^{\pi}_{\mathcal{I}}(\mathcal{G})-V^{\pi}_{\mathcal{I}}(\mathcal{G})\} converges weakly to a normal variable with mean 0 and variance 1. \blacksquare

Proof of Theorem 3(ii): For 1j41\leq j\leq 4, let 𝑮,π,j\bm{G}_{\mathcal{I},\pi,j} denote the quantities obtained by replacing QπQ^{\pi} in 𝑮π,j(x)\bm{G}_{\pi,j}(x) with QπQ^{\pi}_{\mathcal{I}}, and let 𝒈,π,j(x)\bm{g}_{\mathcal{I},\pi,j}(x) denotes the expectations of 𝑮,π,j(x)\bm{G}_{\mathcal{I},\pi,j}(x). Then following the proofs of Theorems 1-3, 𝜽~π,𝜽π,\widetilde{\bm{\theta}}_{\pi,\mathcal{I}}^{*}-{\bm{\theta}}_{\pi,\mathcal{I}}^{*} can be rewritten as

𝑫~π1[1nKi,k𝝃i,k{γXi,k+1R^,i(Ti,k+1)(𝝃i,kγXi,k+1𝒰~π,i,k+1)𝜽π,}]\displaystyle\widetilde{\bm{D}}_{\pi}^{-1}\bigg{[}\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\{\gamma^{X_{i,k+1}}\widehat{R}_{\mathcal{I},i}(T_{i,k+1})-(\bm{\xi}_{i,k}-\gamma^{X_{i,k+1}}\widetilde{\mathcal{U}}_{\pi,i,k+1})^{\top}{\bm{\theta}}_{\pi,\mathcal{I}}^{*}\}\bigg{]}
=\displaystyle= 𝑫~π1[1nKi,k𝝃i,k{γXi,k+1R,i(Ti,k+1)(𝝃i,kγXi,k+1𝒰π,i,k+1)𝜽π,}\displaystyle\widetilde{\bm{D}}_{\pi}^{-1}\bigg{[}\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\{\gamma^{X_{i,k+1}}R_{\mathcal{I},i}(T_{i,k+1})-(\bm{\xi}_{i,k}-\gamma^{X_{i,k+1}}\mathcal{U}_{\pi,i,k+1})^{\top}{\bm{\theta}}_{\pi,\mathcal{I}}^{*}\}
+1nKi,k𝝃i,kγXi,k+1R,i(Ti,k+1)λ(Xi,k+1;𝒁i,k){λ^(Xi,k+1;𝒁i,k)λ(Xi,k+1;𝒁i,k)}\displaystyle+\frac{1}{n_{K}}\sum_{i,k}\frac{\bm{\xi}_{i,k}\gamma^{X_{i,k+1}}R_{\mathcal{I},i}(T_{i,k+1})}{\lambda(X_{i,k+1};\bm{Z}_{i,k})}\big{\{}\widehat{\lambda}(X_{i,k+1};\bm{Z}_{i,k})-\lambda(X_{i,k+1};\bm{Z}_{i,k})\big{\}}
+1nKi,k𝝃i,k(𝒰~π,i,k+1𝒰π,i,k+1)𝜽π,]+Op(nK1bn1)\displaystyle+\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}(\widetilde{\mathcal{U}}_{\pi,i,k+1}-\mathcal{U}_{\pi,i,k+1})^{\top}{\bm{\theta}}_{\pi,\mathcal{I}}^{*}\bigg{]}+O_{p}(n_{K}^{-1}b_{n}^{-1})
=\displaystyle= 𝑫π1{1nKi,k(𝝃i,kϕi,k,4+𝝋i,k,3+𝝋i,k,4)}+Op{LnK1log(nK)}+Op(Lpd+1)+Op(bn2)+Op(bn1nK1),\displaystyle\bm{D}_{\pi}^{-1}\{\frac{1}{n_{K}}\sum_{i,k}(\bm{\xi}_{i,k}\phi_{i,k,4}+\bm{\varphi}_{i,k,3}+\bm{\varphi}_{i,k,4})\}+O_{p}\{Ln_{K}^{-1}\log(n_{K})\}+O_{p}(L^{-\frac{p}{d+1}})+O_{p}(b_{n}^{2})+O_{p}(b_{n}^{-1}n_{K}^{-1})\,,

where ϕi,k,4=γXi,k+1R,i(Ti,k+1)Qπ(Si,k,Xi,k,Ai,k)+γxQπ(Si,k+1,x,π)𝑑𝒫X(x;𝒁i,k),\phi_{i,k,4}=\gamma^{X_{i,k+1}}R_{\mathcal{I},i}(T_{i,k+1})-Q^{\pi}_{\mathcal{I}}(S_{i,k},X_{i,k},A_{i,k})+\int\gamma^{x}Q^{\pi}_{\mathcal{I}}(S_{i,k+1},x,\pi)d\mathcal{P}_{X}(x;\bm{Z}_{i,k})\,, and 𝝋i,k,4=0τ{𝒈,π,1(x)xτ𝒈,π,2(u)𝑑Λ0(u)}sz,01(x)𝑑Mi,k+1(x)+(𝒈,π,3𝒈,π,4)𝛀z1ϕi,k,z\bm{\varphi}_{i,k,4}=\int_{0}^{\tau}\{\bm{g}_{\mathcal{I},\pi,1}(x)-\int_{x}^{\tau}\bm{g}_{\mathcal{I},\pi,2}(u)d\Lambda_{0}(u)\}s_{z,0}^{-1}(x)dM_{i,k+1}(x)+(\bm{g}_{\mathcal{I},\pi,3}-\bm{g}_{\mathcal{I},\pi,4})\bm{\Omega}_{z}^{-1}\bm{\phi}_{i,k,z} Here, 𝝋i,k,4\bm{\varphi}_{i,k,4} takes similar form with 𝝋i,k,2\bm{\varphi}_{i,k,2} except that 𝒈π,j(x)\bm{g}_{\pi,j}(x) are replaced by 𝒈,π,j(x)\bm{g}_{\mathcal{I},\pi,j}(x).

Furthermore, define ϕ4=array({ϕi,k,4}1in,1kKi)\bm{\phi}_{4}=array(\{\phi_{i,k,4}\}_{1\leq i\leq n,1\leq k\leq K_{i}}), 𝝋4=array({𝝋i,k,4}1in,1kKi)\bm{\varphi}_{4}=array(\{\bm{\varphi}_{i,k,4}\}_{1\leq i\leq n,1\leq k\leq K_{i}}), and σπ,𝒢,42=𝜻π,𝒢𝑫π1𝛀π,4(𝑫π)1𝜻π,𝒢\sigma^{2}_{\pi,\mathcal{G},4}=\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\bm{D}_{\pi}^{-1}\bm{\Omega}_{\pi,4}(\bm{D}_{\pi}^{\top})^{-1}\bm{\zeta}_{\pi,\mathcal{G}} with

𝛀π,4=E[{𝑨ξϕ4+𝑨I(𝝋3+𝝋4)}{𝑨ξϕ4+𝑨I(𝝋3+𝝋4)}].\displaystyle\bm{\Omega}_{\pi,4}=E[\{\bm{A}_{\xi}^{\top}\bm{\phi}_{4}+\bm{A}_{I}^{\top}(\bm{\varphi}_{3}+\bm{\varphi}_{4})\}\{\bm{A}_{\xi}^{\top}\bm{\phi}_{4}+\bm{A}_{I}^{\top}(\bm{\varphi}_{3}+\bm{\varphi}_{4})\}^{\top}]\,. (17)

Then, when L2pd+1nK(1+𝜻π,𝒢22)L^{\frac{2p}{d+1}}\gg n_{K}(1+\|\bm{\zeta}_{\pi,\mathcal{G}}\|_{2}^{-2}), nKbn40n_{K}b_{n}^{4}\rightarrow 0 and nKbn2n_{K}b_{n}^{2}\rightarrow\infty,

nK1/2σπ,𝒢,41{V~π(𝒢)Vπ(𝒢)}=nK1/2σπ,𝒢,31𝜻π,𝒢𝑫π1{𝑨ξϕ4+𝑨I(𝝋4𝝋3)}+op(1),n_{K}^{1/2}\sigma_{\pi,\mathcal{G},4}^{-1}\{\widetilde{V}^{\pi}_{\mathcal{I}}(\mathcal{G})-V^{\pi}_{\mathcal{I}}(\mathcal{G})\}=n_{K}^{1/2}\sigma_{\pi,\mathcal{G},3}^{-1}\bm{\zeta}_{\pi,\mathcal{G}}^{\top}\bm{D}_{\pi}^{-1}\{\bm{A}_{\xi}^{\top}\bm{\phi}_{4}+\bm{A}_{I}^{\top}(\bm{\varphi}_{4}-\bm{\varphi}_{3})\}+o_{p}(1)\,,

converges weakly to a normal distribution with mean 0 and variance 1. \blacksquare

A.3 Proof of Lemmas A.1-A.5

Proof of Lemma A.1

By assumption A3 and Lemma 1 in Shi et al. (2021), Qπ(,,a)Q^{\pi}(\cdot,\cdot,a) is a Ho¨\ddot{o}lder continuous function with exponent p0p_{0} with respect to (s,x)𝕊×+(s,x)\in\mathbb{S}\times\mathbb{R}^{+}, and thus (i) holds. For (ii), by Lemma E.1 in Shi et al. (2021), there exists some constant cc^{*} such that sups,x𝚽L(s,x)2cL\sup_{s,x}\|\bm{\Phi}_{L}(s,x)\|_{2}\leq c^{*}L. Therefore, E(𝝃i,k𝝃i,k)msups,x𝚽L(s,x)22cmLE(\bm{\xi}_{i,k}^{\top}\bm{\xi}_{i,k})\leq m\sup_{s,x}\|\bm{\Phi}_{L}(s,x)\|_{2}^{2}\leq c^{*}mL is O(L)O(L). The other results are proved in Lemma E.3 in Shi et al. (2021). For (iii), the boundedness of ϕi,k,1\phi_{i,k,1} follows from |Qπ(s,x,a)|CQ|Q^{\pi}(s,x,a)|\leq C_{Q} given in Lemma A.1(i) and P(maxk|R(Tk)|cR)=1P(\max_{k}|R(T_{k})|\leq c_{R})=1 given in the conditions of Theorem 1, while the boundedness of ϵi,k,1\epsilon_{i,k,1} follows from (8). Lastly, the properties of 𝑫π\bm{D}_{\pi} and 𝑫^π\widehat{\bm{D}}_{\pi} given in (iv) are consistent with Lemma E.2 in Shi et al. (2021) and can be obtained with similar arguments. \blacksquare

Proof of Lemma A.2

By Lemma A.1(iii), 𝛀π,1=E(1nKi,k𝝃i,k𝝃i,kϕi,k,12)(CR+2CQ)E(1nKi,k𝝃i,k𝝃i,k)\bm{\Omega}_{\pi,1}=E(\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\bm{\xi}_{i,k}^{\top}\phi_{i,k,1}^{2})\leq(C_{R}+2C_{Q})E(\frac{1}{n_{K}}\sum_{i,k}\bm{\xi}_{i,k}\bm{\xi}_{i,k}^{\top}). Then 𝛀π,12=O(1)\|\bm{\Omega}_{\pi,1}\|_{2}=O(1) follows from Lemma A.1(iii).

Now we prove Lemma A.2(ii). Recall that ωπ(s,x,a)=E(ϕi,k,12|Sk=s,Xk=x,Ak=a).\omega_{\pi}(s,x,a)=E(\phi_{i,k,1}^{2}|S_{k}=s,X_{k}=x,A_{k}=a)\,. Then by ωπ(s,x,a)>cR1\omega_{\pi}(s,x,a)>c_{R}^{-1}, it can be shown that

𝛀π,1\displaystyle\bm{\Omega}_{\pi,1} =nK1i,kE(ϕi,k,12ξi,kξi,k)=nK1i,kE{ωπ(Si,k,Xi,k,Ai,k)ξi,kξi,k}\displaystyle=n_{K}^{-1}\sum_{i,k}E(\phi_{i,k,1}^{2}\xi_{i,k}\xi_{i,k}^{\top})=n_{K}^{-1}\sum_{i,k}E\{\omega_{\pi}(S_{i,k},X_{i,k},A_{i,k})\xi_{i,k}\xi_{i,k}^{\top}\}
cR1nK1i,kE(ξi,kξi,k).\displaystyle\geq c_{R}^{-1}n_{K}^{-1}\sum_{i,k}E(\xi_{i,k}\xi_{i,k}^{\top})\,.

By Lemma A.1(ii), we have λmin(𝛀π,1)cR1c¯1/2\lambda_{\min}(\bm{\Omega}_{\pi,1})\geq c_{R}^{-1}\overline{c}_{1}/2 . Moreover, by 𝑫π2=O(1)\|\bm{D}_{\pi}\|_{2}=O(1) given in Lemma A.1(iii), there exists some constant C¯>0\overline{C}>0 such that λmin{𝑫π1(𝑫π)1}C¯\lambda_{\min}\{\bm{D}_{\pi}^{-1}(\bm{D}_{\pi}^{\top})^{-1}\}\geq\overline{C}. Therefore, σπ,𝒢,12=ζπ,𝒢𝑫π1Ωπ,1(𝑫π)1ζπ,𝒢C2𝜻π,𝒢22\sigma^{2}_{\pi,\mathcal{G},1}=\zeta_{\pi,\mathcal{G}}^{\top}\bm{D}_{\pi}^{-1}\Omega_{\pi,1}(\bm{D}_{\pi}^{\top})^{-1}\zeta_{\pi,\mathcal{G}}\geq C_{2}\left\|\bm{\zeta}_{\pi,\mathcal{G}}\right\|_{2}^{2} with C2=C¯cR1c¯1/2C_{2}=\overline{C}c_{R}^{-1}\overline{c}_{1}/2. \blacksquare

Proof of Lemma A.3

We first introduce some notations. For 1in1\leq i\leq n, 0kKi10\leq k\leq K_{i}-1, denote {Xi,k+1,𝒁i,k,Ni,k+1}\{X_{i,k+1},\bm{Z}_{i,k},N_{i,k+1}\} as {X(g),𝒁(g),N(g)}\{X_{(g)},\bm{Z}_{(g)},N_{(g)}\} for g=l<iKl+k+1g=\sum_{l<i}K_{l}+k+1. Then for j=0,1,2j=0,1,2, Sz,j(x;𝜷)S_{z,j}(x;\bm{\beta}) can be rewritten as Sz,j(x;𝜷)=nK1g=1nKI(Xgx)exp(𝜷𝒁(g))𝒁(g)jS_{z,j}(x;\bm{\beta})=n_{K}^{-1}\sum_{g=1}^{n_{K}}I(X_{g}\geq x)\exp(\bm{\beta}^{\top}\bm{Z}_{(g)})\bm{Z}_{(g)}^{\otimes j}, and 𝜷^\widehat{\bm{\beta}} is the solution to 𝑼z(𝜷)=nK1g=1nK0τ{𝒁(g)𝒁¯(x;𝜷)}𝑑N(g)(x)=0\bm{U}_{z}(\bm{\beta})=n_{K}^{-1}\sum_{g=1}^{n_{K}}\int_{0}^{\tau}\{\bm{Z}_{(g)}-\overline{\bm{Z}}(x;\bm{\beta})\}dN_{(g)}(x)=0. Under assumption A6, conditions C1-C5, and by Theorem 3.2 of Pons and de Turckheim (1988), nK1/2(𝜷^𝜷0)n_{K}^{1/2}(\widehat{\bm{\beta}}-\bm{\beta}_{0}) converges weakly to a Gaussian variable with mean 0 and variance 𝛀z1\bm{\Omega}_{z}^{-1}, where 𝛀z\bm{\Omega}_{z} is defined in condition C2. Moreover, nK1/2(𝜷^𝜷0)n_{K}^{1/2}(\widehat{\bm{\beta}}-\bm{\beta}_{0}) can be asymptotically written as

nK1/2g=1nK𝛀z10τ{𝒁(g)𝒛¯(x)}𝑑M(g)(x)+op(1),n_{K}^{-1/2}\sum_{g=1}^{n_{K}}\bm{\Omega}_{z}^{-1}\int_{0}^{\tau}\{\bm{Z}_{(g)}-\overline{\bm{z}}(x)\}dM_{(g)}(x)+o_{p}(1)\,,

with M(g)(x)=0x{dN(g)(u)I(X(g)u)exp(𝜷0𝒁(g))dΛ0(u)}M_{(g)}(x)=\int_{0}^{x}\{dN_{(g)}(u)-I(X_{(g)}\geq u)\exp(\bm{\beta}_{0}^{\top}\bm{Z}_{(g)})d\Lambda_{0}(u)\}. Rewrite M(g)(x)M_{(g)}(x) as Mi,k+1(x)M_{i,k+1}(x), and define ϕi,k,z=0τ{𝒁i,k𝒛¯(u)}𝑑Mi,k+1(u)\bm{\phi}_{i,k,z}=\int_{0}^{\tau}\{\bm{Z}_{i,k}-\bar{\bm{z}}(u)\}dM_{i,k+1}(u), then Lemma A.3(i) follows directly.

Now we derive the asymptotic properties of Λ^0(x)\widehat{\Lambda}_{0}(x). The convergence of nK1/2{Λ^0(x)Λ0(x)}n_{K}^{1/2}\{\widehat{\Lambda}_{0}(x)-\Lambda_{0}(x)\} follows from the Theorem 3.3 of Pons and de Turckheim (1988). To obtain the asymptotic expression in Lemma A.3(ii), we first rewrite nK1/2{Λ^0(x)Λ0(x)}n_{K}^{1/2}\{\widehat{\Lambda}_{0}(x)-\Lambda_{0}(x)\} as

nK1/2{Λ^0(x;𝜷^)Λ^0(x;𝜷0)}+nK1/2{Λ^0(x;𝜷0)Λ0(x)}\displaystyle\ n_{K}^{1/2}\{\widehat{\Lambda}_{0}(x;\widehat{\bm{\beta}})-\widehat{\Lambda}_{0}(x;\bm{\beta}_{0})\}+n_{K}^{1/2}\{\widehat{\Lambda}_{0}(x;\bm{\beta}_{0})-\Lambda_{0}(x)\}
=\displaystyle= 0x𝒛(u)𝑑Λ0(u)nK1/2(𝜷^𝜷0)+nK1/2{Λ^0(x;𝜷0)Λ0(x)}+op(1)\displaystyle-\int_{0}^{x}\bm{z}(u)^{\top}d\Lambda_{0}(u)n_{K}^{1/2}(\widehat{\bm{\beta}}-\bm{\beta}_{0})+n_{K}^{1/2}\{\widehat{\Lambda}_{0}(x;\bm{\beta}_{0})-\Lambda_{0}(x)\}+o_{p}(1)\,

where nK1/2{Λ^0(x;𝜷0)Λ0(x)}=nK1g=1nK0xsz,0(u)1𝑑M(g)(u)+op(1).n_{K}^{1/2}\{\widehat{\Lambda}_{0}(x;\bm{\beta}_{0})-\Lambda_{0}(x)\}=n_{K}^{-1}\sum_{g=1}^{n_{K}}\int_{0}^{x}s_{z,0}(u)^{-1}dM_{(g)}(u)+o_{p}(1)\,. Then together with Lemma A.3(i), we have

nK1/2{Λ^0(x)Λ0(x)}=nK1/2i,k[0xsz,0(u)1𝑑Mi,k+1(u)0xz¯(u)𝑑Λ0(u)𝛀z1ϕi,k,z]+op(1).\displaystyle n_{K}^{1/2}\{\widehat{\Lambda}_{0}(x)-\Lambda_{0}(x)\}=n_{K}^{-1/2}\sum_{i,k}\big{[}\int_{0}^{x}s_{z,0}(u)^{-1}dM_{i,k+1}(u)-\int_{0}^{x}\overline{z}(u)^{\top}d\Lambda_{0}(u)\bm{\Omega}_{z}^{-1}\phi_{i,k,z}\big{]}+o_{p}(1)\,.

\blacksquare

Proof of Lemma A.4

We first prove Lemma A.4(i). The boundedness property of ϕi,k,2\phi_{i,k,2} is similar with ϕi,k,1\phi_{i,k,1}, and therefore can be similarly obtained by the arguments i the proof of Lemma A.1(iii). For the boundedness of ϵi,k,2\epsilon_{i,k,2}, rewrite it as ϵi,k,2,1+ϵi,k,2,2\epsilon_{i,k,2,1}+\epsilon_{i,k,2,2}, where ϵi,k,2,1=Qπ(Si,k,Xi,k,Ai,k)𝝃i,k𝜽π+γx{𝝃(Si,k+1,x,π)𝜽πQπ(Si,k+1,x,π)}𝑑𝒫X(x;𝒁i,k),\epsilon_{i,k,2,1}=Q^{\pi}(S_{i,k},X_{i,k},A_{i,k})-\bm{\xi}_{i,k}^{\top}\bm{\theta}_{\pi}^{*}+\int\gamma^{x}\{\bm{\xi}(S_{i,k+1},x,\pi)^{\top}\bm{\theta}_{\pi}^{*}-Q^{\pi}(S_{i,k+1},x,\pi)\}d\mathcal{P}_{X}(x;\bm{Z}_{i,k})\,, and ϵi,k,2,2=γx{𝝃(Si,k+1,x,π)𝜽πQπ(Si,k+1,x,π)}d{𝒫^X(x;𝒁i,k)𝒫X(x;𝒁i,k)}\epsilon_{i,k,2,2}=\int\gamma^{x}\{\bm{\xi}(S_{i,k+1},x,\pi)^{\top}{\bm{\theta}}_{\pi}^{*}-Q^{\pi}(S_{i,k+1},x,\pi)\}d\{\widehat{\mathcal{P}}_{X}(x;\bm{Z}_{i,k})-\mathcal{P}_{X}(x;\bm{Z}_{i,k})\}. By (8), ϵi,k,2,1\epsilon_{i,k,2,1} is bounded by 2C1Lpd+12C_{1}L^{-\frac{p}{d+1}}. For ϵi,k,2,2\epsilon_{i,k,2,2}, since nK1/2{Λ^0(x)Λ0(x)}n_{K}^{1/2}\{\widehat{\Lambda}_{0}(x)-\Lambda_{0}(x)\} converges to a Gaussian process with zero mean and continuous sample paths (see Theorem 3.3 in Pons and de Turckheim (1988)), then, together with the convergence of nK1/2{𝜷^𝜷0}n_{K}^{1/2}\{\widehat{\bm{\beta}}-\bm{\beta}_{0}\}, it is not hard to obtain that nK1/2{P^X(x;𝒛)PX(x;𝒛)}n_{K}^{1/2}\{\widehat{P}_{X}(x;\bm{z})-P_{X}(x;\bm{z})\} also converges weakly to a stochastic process with zero mean and continuous sample paths for each fixed 𝒛\bm{z}. This implies that supx[0,τ]|P^X(x;𝒛)PX(x;𝒛)|p0\sup_{x\in[0,\tau]}|\widehat{P}_{X}(x;\bm{z})-P_{X}(x;\bm{z})|\stackrel{{\scriptstyle p}}{{\rightarrow}}0 for each fixed 𝒛\bm{z}. Then ϵi,k,2,2\epsilon_{i,k,2,2} is also bounded by C1Lpd+1C_{1}L^{-\frac{p}{d+1}} when nKn_{K} is large enough, and thus we finish the proof.

To prove Lemma A.4(ii), rewrite

𝑫~π=\displaystyle\widetilde{\bm{D}}_{\pi}= nK1i,k𝝃i,k(𝝃i,k𝑼~π,i,k+1)\displaystyle n_{K}^{-1}\sum_{i,k}\bm{\xi}_{i,k}(\bm{\xi}_{i,k}-\widetilde{\bm{U}}_{\pi,i,k+1})^{\top}
=\displaystyle= nK1i,k𝝃i,k(𝝃i,k𝑼π,i,k+1)𝑫~π,1+nK1i,k𝝃i,k(𝑼π,i,k+1𝑼~π,i,k+1)𝑫~π,2.\displaystyle\underbrace{n_{K}^{-1}\sum_{i,k}\bm{\xi}_{i,k}(\bm{\xi}_{i,k}-\bm{U}_{\pi,i,k+1})^{\top}}_{\widetilde{\bm{D}}_{\pi,1}}+\underbrace{n_{K}^{-1}\sum_{i,k}\bm{\xi}_{i,k}(\bm{U}_{\pi,i,k+1}-\widetilde{\bm{U}}_{\pi,i,k+1})^{\top}}_{\widetilde{\bm{D}}_{\pi,2}}\,.

Note that E(𝑫~π,1)=𝑫πE(\widetilde{\bm{D}}_{\pi,1})=\bm{D}_{\pi}, then similar with Lemma A.1(iv), we have 𝑫~π,1𝑫π=Op{L1/2nK1/2log(nK)}\|\widetilde{\bm{D}}_{\pi,1}-\bm{D}_{\pi}\|=O_{p}\{L^{1/2}n_{K}^{-1/2}\log(n_{K})\} and 𝑫~π,1=O(1)\|\widetilde{\bm{D}}_{\pi,1}\|=O(1).

On the other hand, rewrite 𝑫~π,2\widetilde{\bm{D}}_{\pi,2} as nK1i,k𝝃i,k0τγx𝝃(Si,k+1,x,π)𝜽πd{P^X(x;𝒁i,k)PX(x;𝒁i,k)}n_{K}^{-1}\sum_{i,k}\bm{\xi}_{i,k}\int_{0}^{\tau}\gamma^{x}\bm{\xi}(S_{i,k+1},x,\pi)^{\top}\bm{\theta}_{\pi}^{*}d\{\widehat{P}_{X}(x;\bm{Z}_{i,k})-P_{X}(x;\bm{Z}_{i,k})\}. Then following (10) and with arguments similar to the proof of Theorem 2 (Step 2), we can obtain 𝑫~π,2=Op(nK1/2)\|\widetilde{\bm{D}}_{\pi,2}\|=O_{p}(n_{K}^{-1/2}) and therefore, 𝑫~π𝑫π2𝑫~π,1𝑫π2+𝑫~π,2=Op{L1/2nK1/2log(nK)}\|\widetilde{\bm{D}}_{\pi}-\bm{D}_{\pi}\|_{2}\leq\|\widetilde{\bm{D}}_{\pi,1}-\bm{D}_{\pi}\|_{2}+\|\widetilde{\bm{D}}_{\pi,2}\|=O_{p}\{L^{1/2}n_{K}^{-1/2}\log(n_{K})\}.

Furthermore, since 𝑫~π𝑫π2=op(1)\|\widetilde{\bm{D}}_{\pi}-\bm{D}_{\pi}\|_{2}=o_{p}(1), we have 𝑫~π𝑫π2c¯/6\|\widetilde{\bm{D}}_{\pi}-\bm{D}_{\pi}\|_{2}\leq\overline{c}/6 wpa 1. Together with 𝑫π123c¯1\|\bm{D}_{\pi}^{-1}\|_{2}\leq 3\overline{c}^{-1} given in Lemma A.1(iv), we have

𝒗𝑫~π𝒗=\displaystyle\bm{v}^{\top}\widetilde{\bm{D}}_{\pi}\bm{v}= 𝒗𝑫π𝒗+𝒗(𝑫~π𝑫π)𝒗c¯3𝒗2c¯6𝒗2=c¯6𝒗2,\displaystyle\bm{v}^{\top}\bm{D}_{\pi}\bm{v}+\bm{v}^{\top}(\widetilde{\bm{D}}_{\pi}-\bm{D}_{\pi})\bm{v}\geq\frac{\overline{c}}{3}\|\bm{v}\|^{2}-\frac{\overline{c}}{6}\|\bm{v}\|^{2}=\frac{\overline{c}}{6}\|\bm{v}\|^{2}\,,

for any 𝒗mL\bm{v}\in\mathbb{R}^{mL}. This implies 𝑫~π126c¯1\|\widetilde{\bm{D}}_{\pi}^{-1}\|_{2}\leq 6\overline{c}^{-1} wpa 1. Therefore,

𝑫~π1𝑫π12=\displaystyle\|\widetilde{\bm{D}}_{\pi}^{-1}-\bm{D}_{\pi}^{-1}\|_{2}= 𝑫~π1(𝑫~π𝑫π)𝑫π12𝑫~π12𝑫~π𝑫π2𝑫π1218c¯2𝑫~π𝑫π2\displaystyle\|\widetilde{\bm{D}}_{\pi}^{-1}(\widetilde{\bm{D}}_{\pi}-\bm{D}_{\pi})\bm{D}_{\pi}^{-1}\|_{2}\leq\|\widetilde{\bm{D}}_{\pi}^{-1}\|_{2}\|\widetilde{\bm{D}}_{\pi}-\bm{D}_{\pi}\|_{2}\|\bm{D}_{\pi}^{-1}\|_{2}\leq 18\overline{c}^{-2}\|\widetilde{\bm{D}}_{\pi}-\bm{D}_{\pi}\|_{2}

is also Op{L1/2nK1/2log(nK)}O_{p}\{L^{1/2}n_{K}^{-1/2}\log(n_{K})\}. \blacksquare

Proof of Lemma A.5

Similarly with Ramlau-Hansen (1983), we assume the kernel has support within [1,1][-1,1] and 0bn1/20\leq b_{n}\leq 1/2 to simplify mathematics. Then by Taylor expansion,

g(x)=\displaystyle g^{*}(x)= 0τ1bnK(uxbn)g(u)𝑑u\displaystyle\int_{0}^{\tau}\frac{1}{b_{n}}K(\frac{u-x}{b_{n}})g(u)du
=\displaystyle= 11K(t)g(x+tbn)𝑑t\displaystyle\int_{-1}^{1}K(t)g(x+tb_{n})dt
=\displaystyle= 11K(t){g(x)+g(x)tbn+g′′(x+vtbn)t2bn2}𝑑t\displaystyle\int_{-1}^{1}K(t)\{g(x)+g^{{}^{\prime}}(x)tb_{n}+g^{{}^{\prime\prime}}(x+v_{t}b_{n})t^{2}b_{n}^{2}\}dt
=\displaystyle= g(x)+11K(t)g′′(x+vtbn)t2𝑑tbn2,\displaystyle g(x)+\int_{-1}^{1}K(t)g^{{}^{\prime\prime}}(x+v_{t}b_{n})t^{2}dtb_{n}^{2}\,,

for 0vtt0\leq v_{t}\leq t. Therefore, |g(x)g(x)|11K(t)t2𝑑tsup|ux|1/2|g′′(u)|bn2|g^{*}(x)-g(x)|\leq\int_{-1}^{1}K(t)t^{2}dt\sup_{|u-x|\leq 1/2}|g^{{}^{\prime\prime}}(u)|b_{n}^{2} is O(bn2)O(b_{n}^{2}) when g(u)g(u) has a bounded second-order derivative in the neighborhood of xx. \blacksquare

Appendix B Additional Simulation Results

In this section, we present the simulation studies under scenario 4 to investigate the robustness of the proposed estimates. Specifically, the dataset is generated following scheme 2, the estimates are calculated by the methods developed under scheme 1, and therefore model (6) in Section 3.2 is mis-specified. The results are reported in Table 4. From the results, it is not hard to see that the naive method remains biased, the standard estimator under cumulative reward V^π(𝒢)\widehat{V}^{\pi}(\mathcal{\mathcal{G}}) still performances well in all settings, and the estimates V~π(𝒢)\widetilde{V}_{\pi}(\mathcal{\mathcal{G}}), V^π,(𝒢)\widehat{V}_{\pi,\mathcal{I}}(\mathcal{\mathcal{G}}) and V~π,(𝒢)\widetilde{V}_{\pi,\mathcal{I}}(\mathcal{\mathcal{G}}), which are based on model (6), demonstrate negligible bias and provide satisfactory variance estimation. Hence, the proposed methods are fairly robust to mild misspecification of the observation process model.

Table 4: Policy evaluation results when the observation process model is misspecified, including the bias and standard deviation of the naive estimators (BiasN, SDN), and the bias, standard deviation, estimated standard error, and empirical coverage probability of 95%95\% confidence intervals for standardized estimators (BiasS, SDS, SES, CPS) and modulated estimators (BiasM,SDM,SEM,CPM).
Naive Standard Modulated
nn K BiasNBias_{N} SDNSD_{N} BiasSBias_{S} SDSSD_{S} SESSE_{S} CPSCP_{S} BiasMBias_{M} SDMSD_{M} SEMSE_{M} CPMCP_{M}
Scenario 4   Cumulative Rewards   True value = 0.594
100 10 -0.426 0.103 0.006 0.082 0.081 0.924 -0.017 0.080 0.078 0.922
200 10 -0.424 0.064 0.001 0.055 0.057 0.940 -0.013 0.053 0.055 0.930
400 10 -0.428 0.042 0.001 0.041 0.040 0.940 -0.009 0.041 0.039 0.938
10 100 -0.413 0.142 0.012 0.123 0.099 0.910 -0.008 0.109 0.093 0.922
10 200 -0.430 0.087 0.001 0.073 0.065 0.920 -0.014 0.071 0.062 0.924
10 400 -0.425 0.067 0.000 0.051 0.046 0.930 -0.011 0.045 0.044 0.926
Scenario 4   Integrated Rewards   True value = 0.569
100 10 -0.172 0.077 0.001 0.084 0.074 0.944 -0.015 0.077 0.069 0.930
200 10 -0.176 0.056 0.000 0.056 0.051 0.935 -0.013 0.057 0.049 0.919
400 10 -0.181 0.042 -0.003 0.041 0.036 0.924 -0.014 0.039 0.035 0.910
10 100 -0.170 0.068 0.000 0.073 0.072 0.938 -0.015 0.071 0.068 0.938
10 200 -0.165 0.051 0.002 0.054 0.050 0.942 -0.010 0.052 0.048 0.938
10 400 -0.168 0.034 0.001 0.037 0.035 0.929 -0.009 0.035 0.034 0.927

Appendix C Additional Application Results

In this section, we present additional application results to compare the observed stochastic policy, observed fix policy, and the optimal policy. The policies are estimated by 2488 visit records from 352 patients (approximately 33%33\% of the total number of study subjects) who are randomly selected from the full dataset, while the policy evaluation procedure used the remaining 5166 records from 704 patients. To obtain the observed stochastic policy, we first fitted a logistic regression model with the (binary) response AkA_{k} and the explanatory variables SkS_{k} and XkX_{k}. Then, let the estimated probability of Ak=aA_{k}=a given (s,x)(s,x) be πos(a;s,x)\pi_{os}(a;s,x), and denote πos\pi_{os} as the observed stochastic policy. Moreover, by setting the probability threshold as 0.48, we further obtain a deterministic function mapping from (s,x)(s,x) to {0,1}\{0,1\}, and denote it as the observed fixed policy πof(a;s,x)\pi_{of}(a;s,x). Finally, the optimal policy is estimated by double-fitted Q-learning (Shi et al., 2021), where the Bellman equation for the optimal policy can be obtained by a trivial extension of Lemma 2. Plots of the obtained policies are shown in Figure 5. From the plots, it is not hard to see that, under the observed policies πof\pi_{of} and πos\pi_{os}, patients with deeper pocket depth and more frequent recalls are more likely to be recommended a shorter recall interval, while under the estimated optimal policies, there is no such linear relationship among actions, states, and gap times.

(a) (b)
Refer to caption Refer to caption
(c) (d)
Refer to caption Refer to caption
Figure 5: Plots of policies, including (a) observed fix policy πof\pi_{of}; (b) observed stochastic policy πos\pi_{os}; (c) estimated optimal policy based on discrete reward πop\pi_{op}; and (d) estimated optimal policy based on continuous reward πop,\pi_{op,\mathcal{I}}.

Furthermore, under the modulated method, we compared the confidence intervals of the value functions for these policies in cases where the value function is defined by cumulative reward and integrated reward, respectively. The estimated results are presented in Figure 6. Together with Figures 5, we found that changes in the action assignment of the optimal policy, compared with observed policies, indeed improve the value of the policy. For example, when the reward is defined as PD measurement, the optimal policy suggests that patients with a gap time of three months and PD values less than 2.7 mm may schedule their subsequent visit at six months or later, and correspondingly, the value of the optimal policy at a gap time == 90 days and PD<<2.2 mm is significantly higher than that of the two observed policies.

Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 6: Under the modulated method, the estimated value (solid line) and confidence intervals (dashed lines) of the observed fix policy (red line), the observed stochastic policy (green line), and the estimated optimal policy (blue line) under the modulated method when the reward is defined as a PD reduction indicator (upper panels) and PD measurement (lower panels), respectively.

References

  • (1)
  • Bradtke and Duff (1994) Bradtke, S., and Duff, M. (1994), “Reinforcement Learning Methods for Continuous-Time Markov Decision Problems,” Advances in Neural Information Processing Systems, 7, 393–400.
  • Chen et al. (2018) Chen, X., Ding, J., and Sun, L. (2018), “A Semiparametric Additive Rate Model for a Modulated Renewal Process,” Lifetime Data Analysis, 24, 675–698.
  • Clarkson et al. (2018) Clarkson, J. E., Pitts, N. B., Bonetti, D., Boyers, D., Braid, H., Elford, R., Fee, P. A., Floate, R., Goulão, B., Humphris, G. et al. (2018), “INTERVAL (investigation of NICE technologies for enabling risk-variable-adjusted-length) dental recalls trial: a multicentre randomised controlled trial investigating the best dental recall interval for optimum, cost-effective maintenance of oral health in dentate adults attending dental primary care,” BMC Oral Health, 18, 1–10.
  • Clarkson et al. (2020) Clarkson, J. E., Pitts, N. B., Goulao, B., Boyers, D., Ramsay, C. R., Floate, R., Braid, H. J., Fee, P. A., Ord, F. S., Worthington, H. V. et al. (2020), “Risk-based, 6-monthly and 24-monthly dental check-ups for adults: the INTERVAL three-arm RCT.,” Health Technology Assessment (Winchester, England), 24(60), 1–138.
  • Cox (1972) Cox, D. R. (1972), “The Statistical Analysis of Dependencies in Point Processes,” in Stochastic Point Processes., New York: Wiley, pp. 55–66.
  • Du et al. (2020) Du, J., Futoma, J., and Doshi-Velez, F. (2020), “Model-based Reinforcement Learning for Semi-Markov Decision Processes with Neural ODEs,” Advances in Neural Information Processing Systems, 33, 19805–19816.
  • Ertefaie and Strawderman (2018) Ertefaie, A., and Strawderman, R. L. (2018), “Constructing Dynamic Treatment Regimes over Indefinite Time Horizons,” Biometrika, 105, 963–977.
  • Guan et al. (2020) Guan, Q., Reich, B. J., Laber, E. B., and Bandyopadhyay, D. (2020), “Bayesian nonparametric policy search with application to periodontal recall intervals,” Journal of the American Statistical Association, 115(531), 1066–1078.
  • Holtfreter et al. (2024) Holtfreter, B., Kuhr, K., Borof, K., Tonetti, M. S., Sanz, M., Kornman, K., Jepsen, S., Aarabi, G., Völzke, H., Kocher, T. et al. (2024), “ACES: A new framework for the application of the 2018 periodontal status classification scheme to epidemiological survey data,” Journal of Clinical Periodontology, 51(5), 512–521.
  • Huang (1998) Huang, J. Z. (1998), “Projection Estimation in Multiple Regression with Application to Functional Anova Models,” The Annals of Statistics, 26, 242–272.
  • Kallus and Uehara (2020) Kallus, N., and Uehara, M. (2020), “Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes,” Journal of Machine Learning Research, 21, 1–63.
  • Lin et al. (2013) Lin, F.-C., Truong, Y. K., and Fine, J. P. (2013), “Robust Analysis of Semiparametric Renewal Process Models,” Biometrika, 100, 709–726.
  • Lin and Fine (2009) Lin, F., and Fine, J. P. (2009), “Pseudomartingale Estimating Equations for Modulated Renewal Process Models,” Journal of the Royal Statistical Society. Series B (Statistical Methodology), 71, 3–23.
  • Luckett et al. (2019) Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E., and Kosorok, M. R. (2019), “Estimating Dynamic Treatment Regimes in Mobile Health Using V-Learning,” Journal of the American Statistical Association, 115, 692–706.
  • McLeish (1974) McLeish, D. L. (1974), “Dependent Central Limit Theorems and Invariance Principles,” The Annals of Probability, 2.
  • Murphy (2003) Murphy, S. A. (2003), “Optimal Dynamic Treatment Regimes,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 65, 331–355.
  • Murphy et al. (2001) Murphy, S. A., van der Laan, M. J., and Robins, J. M. (2001), “Marginal Mean Models for Dynamic Regimes,” Journal of the American Statistical Association, 96, 1410–1423.
  • Parr (1998) Parr, R. E. (1998), Hierarchical Control and Learning for Markov Decision Processes, PhD thesis, University of California,Berkeley.
  • Pons and de Turckheim (1988) Pons, O., and de Turckheim, E. (1988), “Cox’s Periodic Regression Model,” The Annals of Statistics, 16, 678–693.
  • Ramlau-Hansen (1983) Ramlau-Hansen, H. (1983), “Smoothing Counting Process Intensities by Means of Kernel Functions,” The Annals of Statistics, 11, 453–466.
  • Robins (2004) Robins, J. M. (2004), Optimal Structural Nested Models for Optimal Sequential Decisions,, in Proceedings of the Second Seattle Symposium in Biostatistics: Analysis of Correlated Data, eds. D. Y. Lin, and P. J. Heagerty, Springer, New York, NY, pp. 189–326.
  • Shi et al. (2022) Shi, C., Luo, S., Le, Y., Zhu, H., and Song, R. (2022), “Statistically Efficient Advantage Learning for Offline Reinforcement Learning in Infinite Horizons,” Journal of the American Statistical Association, pp. 1–14.
  • Shi et al. (2020) Shi, C., Wan, R., Song, R., Lu, W., and Leng, L. (2020), Does the Markov Decision Process Fit the Data: Testing for the Markov Property in Sequential Decision Making,, in Proceedings of the 37th International Conference on Machine Learning, eds. H. D. III, and A. Singh, Vol. 119 of Proceedings of Machine Learning Research, PMLR, pp. 8807–8817.
  • Shi et al. (2021) Shi, C., Zhang, S., Lu, W., and Song, R. (2021), “Statistical Inference of the Value Function for Reinforcement Learning in Infinite-Horizon Settings,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 84, 765–793.
  • Sutton and Barto (2018) Sutton, R. S., and Barto, A. G. (2018), Reinforcement learning: An introduction, Cambridge: MIT press.
  • Uehara et al. (2022) Uehara, M., Shi, C., and Kallus, N. (2022), “A Review of Off-Policy Evaluation in Reinforcement Learning,” arXiv preprint arXiv:2212.06355, .
  • Yang (2021) Yang, S. (2021), “Semiparametric Estimation of Structural Nested Mean Models with Irregularly Spaced Longitudinal Observations,” Biometrics, 78, 937–949.
  • Yang et al. (2019) Yang, S., Pieper, K., and Cools, F. (2019), “Semiparametric Estimation of Structural Failure Time Models in Continuous-Time Processes,” Biometrika, 107, 123–136.
  • Zhang et al. (2013) Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. (2013), “Robust Estimation of Optimal Dynamic Treatment Regimes for Sequential Treatment Decisions,” Biometrika, 100, 681–694.
  • Zhao et al. (2015) Zhao, Y.-Q., Zeng, D., Laber, E. B., and Kosorok, M. R. (2015), “New Statistical Learning Methods for Estimating Optimal Dynamic Treatment Regimes,” Journal of the American Statistical Association, 110, 583–598.