This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep Spectral Q-learning with Application to Mobile Health

Yuhe Gao [email protected] Department of Statistics, North Carolina State University, Raleigh, USA Chengchun Shi [email protected] Department of Statistics, London School of Economics and Political Science, London, UK Rui Song [email protected] Department of Statistics, North Carolina State University, Raleigh, USA
Abstract

Dynamic treatment regimes assign personalized treatments to patients sequentially over time based on their baseline information and time-varying covariates. In mobile health applications, these covariates are typically collected at different frequencies over a long time horizon. In this paper, we propose a deep spectral Q-learning algorithm, which integrates principal component analysis (PCA) with deep Q-learning to handle the mixed frequency data. In theory, we prove that the mean return under the estimated optimal policy converges to that under the optimal one and establish its rate of convergence. The usefulness of our proposal is further illustrated via simulations and an application to a diabetes dataset.


Keywords: Dynamic Treatment Regimes, Mixed Frequency Data, Principal Component Analysis, Reinforcement Learning

1 Introduction

Precision medicine focuses on providing personalized treatment to patients by taking their personal information into consideration (see e.g., Kosorok and Laber, 2019; Tsiatis et al., 2019). It has found various applications in numerous studies, ranging from the cardiovascular disease study to cancer treatment and gene therapy (Jameson and Longo, 2015). A dynamic treatment regime (DTR) consists of a sequence of treatment decisions rules tailored to each individual patient’s status at each time, mathematically formulating the idea behind precision medicine. One of the major objectives in precision medicine is to identify the optimal dynamic treatment regime that yields the most favorable outcome on average.

With the rapidly development of mobile health (mHealth) technology, it becomes feasible to collect rich longitudinal data through mobile apps in medical studies. A motivating data example is given by the OhioT1DM dataset (Marling and Bunescu, 2020) which contains data from 12 patients suffering from type-I diabetes measured via fitness bands over 8 weeks. Data-driven decision rules estimated from these data have the potential to improve these patients’ health (see e.g., Shi et al., 2020; Zhu et al., 2020; Zhou et al., 2022). However, it remains challenging to estimate the optimal DTR in these mHealth studies. First, the number of treatment stages (e.g., horizon) is no longer fixed whereas the number of patients can be limited. For instance, in the OhioT1DM dataset, only 12 patients are enrolled in the study. Nonetheless, suppose treatment decisions are made on an hourly basis, the horizon is over 1000. Existing proposals in the DTR literature (Murphy, 2003; Zhang et al., 2013; Zhao et al., 2015; Song et al., 2015; Shi et al., 2018; Nie et al., 2021; Fang et al., 2022; Qi and Liu, 2018; Mo et al., 2021; Ertefaie et al., 2021; Guan et al., 2020; Zhang et al., 2018) become inefficient in these long or infinite horizon settings and require a large number of patients to be consistent. Second, patients’ time-varying covariates typically contain mixed frequency data. In the OhioT1DM dataset, some of the variables, such as the continuous glucose monitoring (CGM) blood glucose levels, are recorded every 5 minutes. Meanwhile, other variables, such as the carbohydrate estimate for the meal and the exercise intensity are recorded with a much lower frequency. Concatenating these high-frequency variables over each one-hour interval produces a high-dimensional state vector and directly using these states as input of the treatment policy would yield very noisy decision rules. A naive approach is to use some ad-hoc summaries of the high-frequency data for policy learning. However, this might produce a sub-optimal policy due to the information loss.

Recently, there is a growing line of research in the statistics literature for policy learning and/or evaluation in infinite horizons. Some references include Ertefaie and Strawderman (2018); Luckett et al. (2020); Shi et al. (2022); Liao et al. (2020, 2021); Shi et al. (2021c); Chen et al. (2022); Ramprasad et al. (2022); Li et al. (2022); Xu et al. (2020). In the computer science literature, there is a huge literature on developing reinforcement learning (RL) algorithms in infinite horizons. These algorithms can be casted into as model-free and model-based algorithms. Popular model-free RL methods include value-based methods that model the expected return starting from a given state (or state-action pair) and compute the optimal policy as the greedy policy with respect to the value function (Ernst et al., 2005; Riedmiller, 2005; Mnih et al., 2015; Van Hasselt et al., 2016; Dabney et al., 2018), and policy-based methods that directly search the optimal policy among a parameterized class via policy gradient or action critic methods (Schulman et al., 2015a, b; Koutník et al., 2013; Mnih et al., 2016; Wang et al., 2017b). Model-based algorithms are different from the model-free algorithms in the sense that they model the transition dynamics of the environment and use the model of environment to derive or improve policies. Popular model-based RL methods include Guestrin et al. (2002); Janner et al. (2019); Lai et al. (2020); Li et al. (2020), to name a few. See also Arulkumaran et al. (2017); Sutton and Barto (2018); Luo et al. (2022) for more details. These methods cannot be directly applied to datasets such as OhioT1DM as they haven’t considered the mixed frequency data.

In the RL literature, a few works have considered dimension reduction to handle the high dimensional state system. In particular, Murao and Kitamura (1997) proposed to segment the state space and learn a cluster representation of the states. Whiteson et al. (2007) proposed to divide the state space into tilings to represent each state. Both papers proposed to discretize the state space for dimension reduction. However, it can lead to considerable information loss (Wang et al., 2017a). Sprague (2007) proposed a iterative dimension reduction method using neighborhood components analysis. Their method uses a linear basis function to model the Q-function and cannot allow more general nonlinear function approximation. Recently, there are a few works that employ principal components analysis (PCA) for dimension reduction in RL (Curran et al., 2015, 2016; Parisi et al., 2017). However, none of the aforementioned papers formally established the theoretical guarantees for their proposals. Moreover, these methods are motivated by applications in games or robotics and their generalization to mHealth applications with mixed frequency data remains unknown.

In the DTR literature, a few works considered mixed frequency data which include both scalar and functional covariates. Specifically, McKeague and Qian (2014) proposed a functional regression model for optimal decision making with one functional covariate. Ciarleglio et al. (2015) and Ciarleglio et al. (2016) extended their proposal to a more general setting with multiple scalar and functional covariates. Ciarleglio et al. (2018) considered variable selection to handle the mixed frequency data. Laber and Staicu (2018) applied functional PCA to the functional covariates for dimension reduction. All these works considered single stage decision making. Their methods are not directly applicable to the infinite horizon settings.

Our contributions are as follows. Scientifically, mixed frequency data frequently arise in mHealth applications. Nonetheless, it has been less explored in the infinite horizon settings. Our proposal thus fills a crucial gap, and greatly extends the scope of existing approaches to learning DTRs. Methodologically, we propose a deep spectral Q-learning algorithm for dimension reduction. The proposed algorithm achieves a better empirical performance than those that either directly use the original mixed frequency data or its ad-hoc summaries as input of the treatment policy. Theoretically, we derive an upper error bound for the regret of the proposed policy and decompose this bound into the sum of approximation error and estimation error. Our theories offer some general guidelines for practitioners to select the number of principal components in the proposed algorithm.

The rest of this paper is organized as follows. We introduce some background about DTR and the mixed frequency data in Section 2. We introduce the proposed method to estimate the optimal DTR in Section 3 and study its theoretical properties in Section 4. Empirical studies are conducted in Section 5. In Section 5.3, we apply the proposed method to the OhioT1DM dataset. Finally, we conclude our paper in Section 6.

2 Preliminary

2.1 Data and Notations

Suppose the study enrolls NN patients. The dataset for the ii-th patient can be summarized as Oi{(Si,t,Ai,t,Yi,t):1tTi}O_{i}\equiv\{(S_{i,t},A_{i,t},Y_{i,t}):1\leq t\leq T_{i}\}. For simplicity, we assume Ti=TT_{i}=T for any ii. Each state Si,t=(Xi,t,{Zi,t,j}j=1J)𝒮S_{i,t}=(X_{i,t},\{Z_{i,t,j}\}_{j=1}^{J})\in\mathcal{S} is composed of a set of low frequency covariates Xi,tm0X_{i,t}\in\mathbb{R}^{m_{0}} and a set of JJ high frequency covariates {Zi,t,j}j=1J\{Z_{i,t,j}\}_{j=1}^{J}, where 𝒮\mathcal{S} is the state space. For each high frequency variable Zi,t,j,j{1,2,..,J}Z_{i,t,j},j\in\{1,2,..,J\}, we have Zi,t,j=(Zi,t,j(1),Zi,t,j(2),,Zi,t,j(mj))TmjZ_{i,t,j}=(Z_{i,t,j}^{(1)},Z_{i,t,j}^{(2)},...,Z_{i,t,j}^{(m_{j})})^{T}\in\mathbb{R}^{m_{j}} for some large integer mjm_{j}. Let τ\tau denote the length of a time unit. The low frequency variables are recorded at time points τ,2τ,,tτ,\tau,2\tau,\cdots,t\tau,\cdots. The jjth high frequency variables, however, are recorded more frequently at time points mj1τ,2mj1τ,m_{j}^{-1}\tau,2m_{j}^{-1}\tau,\cdots. Notice that we allow the JJ high-frequency variables to be recorded with different frequencies. Let m=j=1Jmjm=\sum_{j=1}^{J}{m_{j}} and Zi,tT=[Zi,t,1T,Zi,t,2T,,Zi,t,JT]mZ_{i,t}^{T}=[Z_{i,t,1}^{T},Z_{i,t,2}^{T},\ldots,Z_{i,t,J}^{T}]\in\mathbb{R}^{m} denote a high-dimensional variable that concatenates all the high frequency covariates. As such, the state Si,tS_{i,t} can be represented as (Xi,t,Zi,t)m0+m(X_{i,t},Z_{i,t})\in\mathbb{R}^{m_{0}+m}. In addition, Ai,tA_{i,t} denotes the treatment indicator at the ttth time point and 𝒜\mathcal{A} denotes the set of all possible treatment options. The reward, Yi,tY_{i,t} corresponds to the iith patient’s response obtained after the ttth decision point. By convention, we assume a larger value of Yi,tY_{i,t} indicates a better outcome. We require |Yi,t||Y_{i,t}| to be uniformly bounded by some constant Rmax>0R_{max}>0, and assume O1,O2,,ONO_{1},O_{2},...,O_{N} are i.i.d.i.i.d., which are commonly imposed in the RL literature (see e.g., Sutton and Barto, 2018). Finally, we denote the lpl_{p}-norm of a function aggregated over a given distribution function σ\sigma by .p,σ\left\|.\right\|_{p,\sigma}. We use [q][q] to represent the indices set {1,2,3,q}\{1,2,3,...q\} for any integer qq\in\mathbb{N}.

2.2 Assumptions, Policies and Value Functions

We will require the system to satisfy the Markov assumption such that

P(S0,t+1𝕊|S0,t=s,A0,t=a,{S0,t,A0,t}0t<t)=P(S0,t+1𝕊|S0,t=s,A0,t=a),t,\displaystyle P(S_{0,t+1}\in\mathbb{S}|S_{0,t}=s,A_{0,t}=a,\{S_{0,t^{\prime}},A_{0,t^{\prime}}\}_{0\leq t^{\prime}<t})=P(S_{0,t+1}\in\mathbb{S}|S_{0,t}=s,A_{0,t}=a),\forall t,

for any s,as,a and Borel set 𝕊𝒮\mathbb{S}\in\mathcal{S}. In other words, the distribution of the next state depends on the past data history only through the current state-action pair. We assume the transition kernel is absolutely continuous with respect to some uniformly bounded transition density function q(s|s,a)q(s^{\prime}|s,a) such that sups,a,s|q(s|s,a)|cq\sup_{s,a,s^{\prime}}|q(s^{\prime}|s,a)|\leq c_{q} for some constant cq>0c_{q}>0.

In addition, we also impose the following conditional mean independence assumption:

E(Y0,t|S0,t=s,A0,t=a,{Y0,t,S0,t,A0,t}0t<t)=E(Y0,t|S0,t=s,A0,t=a)=r(s,a),t,\displaystyle E(Y_{0,t}|S_{0,t}=s,A_{0,t}=a,\{Y_{0,t^{\prime}},S_{0,t^{\prime}},A_{0,t^{\prime}}\}_{0\leq t^{\prime}<t})=E(Y_{0,t}|S_{0,t}=s,A_{0,t}=a)=r(s,a),\forall t,

where we refer to r(s,a)=r(x,{zj}j=1J,a)r(s,a)=r(x,\{z_{j}\}_{j=1}^{J},a) as the immediate reward function.

Next, define a policy π:𝒮P(𝒜)\pi:\mathcal{S}\rightarrow P(\mathcal{A}) as a function that maps a patient’s state at each time point into a probability distribution function on the action space. Given π\pi, we define its (state) value function as

Vπ(s)=t=0γtEπ{Y0,t|S0,0=s},\displaystyle V^{\pi}(s)=\sum_{t=0}^{\infty}{\gamma^{t}E^{\pi}\{Y_{0,t}|S_{0,0}=s\}},

with γ[0,1)\gamma\in[0,1) being a discount factor that balances the immediate and long-term rewards. By definition, the state value function characterizes the expected return assuming the decision process follows a given target policy π\pi. In addition, we define the state-action value function (or Q-function) as

Qπ(s,a)=t=0γtEπ{Y0,t|S0,0=s,A0,0=a},Q^{\pi}(s,a)=\sum_{t=0}^{\infty}{\gamma^{t}E^{\pi}\{Y_{0,t}|S_{0,0}=s,A_{0,0}=a\}},

which is the expected discounted cumulative rewards given an initial state-action pair.

Under the Markov assumption and the conditional mean independence assumption, there exists an optimal policy π\pi^{*} such that Vπ(s)Vπ(s),π,s𝒮V^{\pi^{*}}(s)\geq V^{\pi}(s),\forall\pi,s\in\mathcal{S} (see e.g., Puterman, 2014). Moreover, π\pi^{*} satisfies the following Bellman optimality equation:

Qπ(s,a)=E{Y0,t+γmaxa𝒜Qπ(S0,t+1,a)|S0,t=s,A0,t=a},\displaystyle Q^{\pi^{*}}(s,a)=E\{Y_{0,t}+\gamma\max_{a^{\prime}\in\mathcal{A}}Q^{\pi^{*}}{(S_{0,t+1},a^{\prime})}|S_{0,t}=s,A_{0,t}=a\}, (1)

where QπQ^{\pi^{*}} denotes the optimal Q-function.

2.3 ReLU Network

In this paper, we use value-based methods that learn the optimal policy π\pi^{*} by estimating the optimal Q-function. We will use the class of sparse neural network with the Rectified Linear Unit (ReLU) activation function, i.e., f(x)=max(x,0)f(x)=\max(x,0), to model the Q-function. The advantage of using a neural network over a simple parametric model is that the neural network can better capture the potential non-linearity in the high-dimensional state system.

Formally, the class of sparse ReLU network is defined as

SReLU(L,{dj}j=0L+1,s,Vmax)={d0dL+1:f(x)=WLgL1gq1g1g0(x),\displaystyle\mathcal{F}_{SReLU}(L,\{d_{j}\}_{j=0}^{L+1},s,V_{max})=\{\mathbb{R}^{d_{0}}\rightarrow\mathbb{R}^{d_{L+1}}:f(x)=W_{L}g_{L-1}\circ g_{q-1}\circ...g_{1}\circ g_{0}(x),
gj(x)=σ(Wjx+vj),Wjdj+1×dj,vjdj+1,j{0,1,,L},\displaystyle g_{j}(x)=\sigma(W_{j}x+v_{j}),W_{j}\in\mathbb{R}^{d_{j+1}\times d_{j}},v_{j}\in\mathbb{R}^{d_{j+1}},j\in\{0,1,...,L\},
maxj=0,1,,L{max(Wj,|vj|)}1,j=0L(Wj0+|vj|0)s,maxk{1,2,,dL+1}fkVmax}.\displaystyle\max_{j=0,1,...,L}{\{\max(\left\|W_{j}\right\|_{\infty},|v_{j}|_{\infty})}\}\leq 1,\sum_{j=0}^{L}{(\left\|W_{j}\right\|_{0}+|v_{j}|_{0})}\leq s,\max_{k\in\{1,2,...,d_{L+1}\}}{\left\|f_{k}\right\|_{\infty}}\leq V_{max}\}.

Here, LL is the number of hidden layers of the neural network and djd_{j} is the width of each layer. The output dimension dL+1d_{L+1} is set to 1 since the Q-function output is a scalar. The parameters in ReLU(L)\mathcal{F}_{ReLU}(L) are the weight matrices WjW_{j} and bias vectors vjv_{j}. The sparsity level ss upper bounds the total number of nonzero parameters in the model. This constraint can be satisfied using dropout layers in the implementation (Srivastava et al., 2014). In theory, sparse ReLU networks can fit smooth functions with a minimax optimal rate of convergence (Schmidt-Hieber, 2020). The main theorems in Section 4 will rely on this property. An illustration of sparse ReLU network is in Figure 1.

3 Spectral Fitted Q-Iteration

Neural network with ReLU activation functions in 2.3 is commonly used in value-based reinforcement learning algorithms. However, in medical studies, the training dataset is often of limited size, with a few thousands or 10 thousands of observations in total (see e.g., Marling and Bunescu, 2020; Liao et al., 2021). Meanwhile, the data contains high frequency state variables, which yields a high-dimensional state system. Directly using these states as input will procedure a very noisy policy. This motivates us to consider dimension reduction in RL.

A naive approach for dimension reduction is to use some summary statistics of the high frequency state as input for policy learning. For instance, on the OhioT1DM dataset, the average of CGM blood glucose levels between two treatment decision points can be used as the summary statistic, as in Zhu et al. (2020); Shi et al. (2022); Zhou et al. (2022). In this paper, we propose to use principal component analysis to reduce the dimensionality of {Zi,t,j}j=1J\{Z_{i,t,j}\}_{j=1}^{J}. We expect that using PCA can preserve more information than some ad-hoc summaries (e.g., average).

To apply PCA in the infinite horizon setting, we need to impose some stationarity assumptions on the concatenated high dimensional variables Zi,tTmZ_{i,t}^{T}\in\mathbb{R}^{m}: E[Zi,t]=μE[Z_{i,t}]=\bf\mathbf{\mu} and Cov[Zi,t]=𝐆Cov[Z_{i,t}]=\mathbf{G} for some mean vector μm\mathbf{\mu}\in\mathbb{R}^{m} and covariance matrix 𝐆m×m\mathbf{G}\in\mathbb{R}^{m\times m} that are independent of tt. In real data application, we can test whether the concatenated high-frequency variable Zi,tZ_{i,t} is weak stationary (see e.g., Dickey and Fuller, 1979; Said and Dickey, 1984; Kwiatkowski et al., 1992). If it is weak stationary, the concatenated high frequency covariate Zi,tZ_{i,t} will automatically satisfy the two assumptions above. Similar assumptions have been widely imposed in the literature (see e.g., Shi et al., 2021b; Kallus and Uehara, 2022). Without loss of generality, we assume μ=0m\mathbf{\mu}=0_{m} for simplicity of notations. For the covariance matrix 𝐆\mathbf{G}, it is generally unknown. In practice, we recommend to use the sample covariance estimator 𝐆^\hat{\mathbf{G}}.

We describe our procedure as follows. By the spectral decomposition, we have 𝐆^=k=1mλ^kU^kU^kT\mathbf{\hat{G}}=\sum_{k=1}^{m}{\hat{\lambda}_{k}\hat{U}_{k}\hat{U}_{k}^{T}}, where λ1λ20\lambda_{1}\geq\lambda_{2}\geq...\geq 0 are the eigenvalues and U^k\hat{U}_{k}’s are the corresponding eigenvectors. This allows us to represent Zi,t{Z}_{i,t} as k=1mλ^k1/2V^k(i,t)U^k\sum_{k=1}^{m}{\hat{\lambda}_{k}^{1/2}\hat{V}_{k}^{(i,t)}\hat{U}}_{k}, where V^k(i,t)\hat{V}_{k}^{(i,t)}s are the estimated principal component scores, given by λ^k1/2Zi,tTU^k\hat{\lambda}_{k}^{-1/2}{Z}_{i,t}^{T}\hat{U}_{k}. For any κ\kappa, the estimated principal component scores 𝐕^i,t,κ=(V^1(i,t),V^2(i,t),,V^κ(i,t))\hat{\mathbf{V}}_{i,t,\kappa}=(\hat{V}_{1}^{(i,t)},\hat{V}_{2}^{(i,t)},...,\hat{V}_{\kappa}^{(i,t)}) correspond to the κm\kappa\leq m largest eigenvalues of the concatenated high frequency variable Zi,tZ_{i,t}. When κ=m\kappa=m, using these principal component scores is equivalent to using the original high-frequency variable Zi,t{Z}_{i,t}. We will approximate Qπ(Xi,t,𝐕i,t,m,Ai,t)Q^{\pi}({X}_{i,t},\mathbf{V}_{i,t,m},{A}_{i,t}) by Qπ(Xi,t,𝐕^i,t,κ,Ai,t)Q^{\pi}({X}_{i,t},\hat{\mathbf{V}}_{i,t,\kappa},{A}_{i,t}) and propose to use neural fitted Q-iteration algorithm by Riedmiller (2005) to learn the estimated optimal policy. We detail our procedure in Algorithm 1.

Algorithm 1 Spectral Fitted Q-Iteration
Input: {Si,t=(Xi,t,Zi,t),Ai,t,Yi,t,γ}\{S_{i,t}=(X_{i,t},Z_{i,t}),A_{i,t},Y_{i,t},\gamma\} with i{1,2,,N},t{1,2,,T}i\in\{1,2,...,N\},t\in\{1,2,...,T\}; ReLU network function class SReLU=SReLU(L,{dj}j=0L+1,s,Vmax)\mathcal{F}_{SReLU}=\mathcal{F}_{SReLU}(L,\{d_{j}\}_{j=0}^{L+1},s,V_{max}); sampling distribution σ\sigma; sample size nn; number of principal components κ\kappa; number of iterations KK; estimated covariance matrix 𝐆^\hat{\bf G} for Zi,tZ_{i,t};
Calculate first κ\kappa PCA 𝐕^i,t,κ\hat{\mathbf{V}}_{i,t,\kappa} for high frequency data part Zi,tZ_{i,t};
For each a𝒜a\in\mathcal{A}, initialize a sparse ReLU network Q~0(x,vκ,a)SReLU\tilde{Q}_{0}(x,v_{\kappa},a)\in\mathcal{F}_{SReLU};
for k=1k=1 to KK do
   Sample nn observations Ik={(i,t):1iN,1tT1}I_{k}=\{(i,t):1\leq i\leq N,1\leq t\leq T-1\} based on σ\sigma from data;
   Define a response Ri,tR_{i,t} based on Q~k\tilde{Q}_{k}: Ri,t(Q~k)=Yi,t+γmaxa𝒜Q~k(Xi,t+1,𝐕^i,t+1,κ,a)R_{i,t}(\tilde{Q}_{k})=Y_{i,t}+\gamma\operatorname*{max}_{a\in\mathcal{A}}\tilde{Q}_{k}(X_{i,t+1},\hat{\mathbf{V}}_{i,t+1,\kappa},a);
   Update Q~k\tilde{Q}_{k} to Q~k+1\tilde{Q}_{k+1}:
Q~k+1argminf(.,a)SReLU1n(i,t)Ij[Ri,t(Q~k)f(Xi,t,𝐕^i,t,κ,Ai,t)]2\displaystyle\tilde{Q}_{k+1}\leftarrow\operatorname*{argmin}_{f(.,a)\in\mathcal{F}_{SReLU}}\frac{1}{n}\sum_{(i,t)\in I_{j}}{[R_{i,t}(\tilde{Q}_{k})-f(X_{i,t},\hat{\mathbf{V}}_{i,t,\kappa},A_{i,t})]^{2}}
end for
Return The greedy policy: πK(a|x,vκ)=0\pi_{K}(a|x,v_{\kappa})=0, if aargmaxaQ~K(x,vκ,a),x,vκ,aa\notin argmax_{a^{\prime}}\tilde{Q}_{K}(x,v_{\kappa},a^{\prime}),\forall x,v_{\kappa},a

In Algorithm 1, we fit |𝒜||\mathcal{A}| neural networks corresponding to each aa in Q(s,a)Q(s,a). This is reasonable in settings where the action space is small. When the dataset is small (such as the OhioT1DM dataset), we recommend to set nn to N(T1)N(T-1) such that all the data transactions (instead of a random subsample) will be used in each iteration.

Similar to the original neural fitted Q-iteration algorithm in Riedmiller (2005), the intuition of this algorithm is also based on the Bellman optimality equation (1). In each step kk of Algorithm 1, Q~k\tilde{Q}_{k} estimates QπQ^{\pi^{*}} and response Ri,t(Q~k)=Yi,t+γmaxa𝒜Q~k(Xi,t+1,𝐕^i,t+1,κ,a)R_{i,t}(\tilde{Q}_{k})=Y_{i,t}+\gamma\operatorname*{max}_{a\in\mathcal{A}}\tilde{Q}_{k}(X_{i,t+1},\hat{\mathbf{V}}_{i,t+1,\kappa},a) corresponds to the right hand side of equation (1). Therefore, fitting the regression of Ri,tR_{i,t} with Q~k+1\tilde{Q}_{k+1} is to solve the Bellman optimality equation. The key difference between Algorithm 1 and original neural fitted Q-iteration algorithm is that the high dimensional inputs Zi,tT=[Zi,t,1T,Zi,t,2T,,Zi,t,JT]Z_{i,t}^{T}=[Z_{i,t,1}^{T},Z_{i,t,2}^{T},\ldots,Z_{i,t,J}^{T}] is involved in the state space, and is mapped to a lower dimensional vector 𝐕^i,t,κ\hat{\mathbf{V}}_{i,t,\kappa} during the learning process, so the neural networks Q~0(x,vκ,a)\tilde{Q}_{0}(x,v_{\kappa},a) takes principle components 𝐕^i,t,κ\hat{\mathbf{V}}_{i,t,\kappa} rather than original high dimensional Zi,tZ_{i,t} as input.

4 Asymptotic Properties

Before discussing the asymptotic properties of our proposed Q-learning method, we introduce some notations.

Definition 4.1.

Define

0(L,{di}i=0L+1,s,Vmax)={f:𝒮×𝒜,f(.,a)SReLU(L,{di}i=0L+1,s,Vmax),a𝒜},\displaystyle\mathcal{F}_{0}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max})=\{f:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R},f(.,a)\in\mathcal{F}_{SReLU}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max}),\forall a\in\mathcal{A}\},

where SReLU(L,{dj}i=0L+1,s,Vmax)\mathcal{F}_{SReLU}(L,\{d_{j}\}_{i=0}^{L+1},s,V_{max}) is the class of sparse ReLU network with LL layers and sparsity parameter ss and Vmax=Rmax1γV_{max}=\frac{R_{max}}{1-\gamma}, the uniform upper bound for the cumulative reward.

We define L0=supf0supxy|f(y)f(x)|2yx2L_{\mathcal{F}_{0}}=\sup_{f\in\mathcal{F}_{0}}{\sup_{x\neq y}{\frac{{|f(y)-f(x)|}^{2}}{\left\|y-x\right\|^{2}}}} as the Lipschitz constant for the sparse ReLU class SReLU(L,{dj}i=0L+1,s,Vmax)\mathcal{F}_{SReLU}(L,\{d_{j}\}_{i=0}^{L+1},s,V_{max}) used in 0\mathcal{F}_{0}. Note that this class 0\mathcal{F}_{0} is used in the original neural fitted Q-iteration algorithm to model the Q-function, where the dimension of high frequency part ZZ in state is not reduced through PCA. We further define a function class 2\mathcal{F}_{2} such that it models the Q-function by first converting high frequency part ZZ into its principal component scores and then use a sparse ReLU neural network to obtain the resulting Q-function. More specifically, 2\mathcal{F}_{2} is a set of functions {f2}\{f_{2}\}, such that f2(x,z,a)=f0(x,v^κ,a)f_{2}(x,z,a)=f_{0}(x,\hat{v}_{\kappa},a), where f00f_{0}\in\mathcal{F}_{0} and v^κ\hat{v}_{\kappa} is the vector containing first κ\kappa principle components of ZZ. Note that 2\mathcal{F}_{2} is the function class that we use in Algorithm 1 to model the Q-function. That is, Q~k2,k{1,2,,K}\tilde{Q}_{k}\in\mathcal{F}_{2},k\in\{1,2,...,K\} (formal definition of 2\mathcal{F}_{2} and another function class 1\mathcal{F}_{1} not mentioned here are in Section A of Appendix).

In addition, we introduce the Hölder smooth function class by 𝒞r(𝒟,β,H)\mathcal{C}_{r}(\mathcal{D},\beta,H) with 𝒟r\mathcal{D}\in\mathbb{R}^{r} to be the set of function input. The definition is:

Definition 4.2.

Define

𝒞r(𝒟,β,H)={f:𝒟:γ<βγf+α0r:α=βsupx,y𝒟,xy|αf(x)αf(y)|xyββH},\displaystyle\mathcal{C}_{r}(\mathcal{D},\beta,H)=\{f:\mathcal{D}\rightarrow\mathbb{R}:\sum_{\gamma<\beta}{\left\|\partial^{\gamma}f\right\|_{\infty}}+\sum_{\alpha\in\mathbb{N}_{0}^{r}:\left\|\alpha\right\|=\lfloor\beta\rfloor}{\sup_{x,y\in\mathcal{D},x\neq y}{\frac{|\partial^{\alpha}f(x)-\partial^{\alpha}f(y)|}{\left\|x-y\right\|_{\infty}^{\beta-\lfloor\beta\rfloor}}}}\leq H\},

where α0r\alpha\in\mathbb{N}_{0}^{r} is a rr-tuple multi-index for partial derivatives.

We next construct a network structure 𝒢({pj,tj,βj,Hj}j[q])\mathcal{G}(\{p_{j},t_{j},\beta_{j},H_{j}\}_{j\in[q]}) with the component function on each layer of this network belonging to Holder smooth function class 𝒞(𝒟,β,)\mathcal{C(\mathcal{D},\beta,H)}, which is called composition of Holder Smooth functions. This composition network contains qq layers, with each layer being gj:[aj,bj]pj[aj+1,bj+1]pj+1g_{j}:{[a_{j},b_{j}]}^{p_{j}}\rightarrow{[a_{j+1},b_{j+1}]}^{p_{j+1}}, such that gjkg_{jk} the kkth component (k[pj+1]k\in[p_{j+1}]) in layer jj satisfies that gjk𝒞tj([aj,bj]tj,βj,Hj)g_{jk}\in\mathcal{C}_{t_{j}}({[a_{j},b_{j}]}^{t_{j}},\beta_{j},H_{j}) with tjpjt_{j}\leq p_{j} inputs. Similar to the definition of 0\mathcal{F}_{0}, we can define the function class 𝒢0({pj,tj,βj,Hj}j[q])\mathcal{G}_{0}(\{p_{j},t_{j},\beta_{j},H_{j}\}_{j\in[q]})(simply denoted as 𝒢0\mathcal{G}_{0}) on 𝒮×𝒜\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} such that each function g𝒢0g\in\mathcal{G}_{0} satisfies that g(.,a)𝒢({pj,tj,βj,Hj}j[q])g(.,a)\in\mathcal{G}(\{p_{j},t_{j},\beta_{j},H_{j}\}_{j\in[q]}) for a𝒜\forall a\in\mathcal{A}. The relation between function class 𝒢0\mathcal{G}_{0} and the network structure 𝒢\mathcal{G} is similar to the relation between function class 0\mathcal{F}_{0} and the neural network SReLU\mathcal{F}_{SReLU} in Definition 4.1. See Definition 4.1 of Fan et al. (2020) for more details on 𝒢0({pj,tj,βj,Hj}j[q])\mathcal{G}_{0}(\{p_{j},t_{j},\beta_{j},H_{j}\}_{j\in[q]}).

Next we will introduce the three major assumptions for our theorems:

Assumption 4.3.

The eigenvalues of Cov(Z)Cov(Z) follow an exponential decaying trend λk=O(eζk),k=1,2,,m\lambda_{k}=O(e^{-\zeta k}),k=1,2,...,m for some constant ζ>0\zeta>0.

Assumption 4.4.

The estimator 𝐆^=k=1mλ^kU^kU^kT\hat{\mathbf{G}}=\sum_{k=1}^{m}{\hat{\lambda}_{k}\hat{U}_{k}{\hat{U}_{k}}^{T}} satisfies that U^kUk2=Op(nΔ)\left\|\hat{U}_{k}-U_{k}\right\|_{2}=O_{p}(n^{-\Delta}) for 1km1\leq k\leq m such that Δ>0\Delta>0 is some constant.

Assumption 4.5.

First, we define the Bellman optimality operator 𝒯\mathcal{T} as

𝒯f(x,z,a)=E{Y0,t+γmaxa𝒜f(X0,t+1,Z0,t+1,a)|A0,t=a,X0,t=x,Z0,t=z}.\displaystyle\mathcal{T}f(x,z,a)=E\{Y_{0,t}+\gamma\max_{a^{\prime}\in\mathcal{A}}f{(X_{0,t+1},Z_{0,t+1},a^{\prime})}|A_{0,t}=a,X_{0,t}=x,Z_{0,t}=z\}.

Then we assume 𝒯f𝒢0\mathcal{T}f\in\mathcal{G}_{0} for f2f\in\mathcal{F}_{2}.

Among the three assumptions, the exponential decaying structure of eigenvalues in Assumption 4.3 can be commonly found in the literature of high-dimensional and functional data analysis (see e.g., Reiß and Wahl, 2020; Crambes and Mas, 2013; Jirak, 2016). This assumption is to control the information loss caused by using the first κ\kappa principal component scores of Zi,tZ_{i,t} only. Assumption 4.4 is about the consistency of the estimators 𝐆^\hat{\mathbf{G}} and similar assumptions are imposed in the literature of functional data analysis (see e.g., Laber and Staicu, 2018; Staicu et al., 2014). Using similar arguments in proving Theorem 5.2 of Zhang and Wang (2016), we can show that such an assumption holds in our setting as well. It is to bound the error caused by the estimation of the covariance matrix. Assumption 4.5 is referred to as the completeness assumption in the literature (see e.g., Chen and Jiang, 2019; Uehara et al., 2022, 2021). This assumption is automatically satisfied when the transition kernel and the reward function satisfy certain smoothness conditions.

Our first theorem is concerend with the nonasymptotic convergence rate of Q~K\tilde{Q}_{K} in Algorithm 1.

Theorem 4.6 (Convergence of Estimated Q-Function).

Let μ\mu be some distribution on 𝒮\mathcal{S} such that dμ(s)ds\frac{d\mu(s)}{ds} bounded away from 0. Under the Assumptions 4.3 to 4.5, with sufficiently large nn, there exists a sparse ReLU network structure for the function class 2\mathcal{F}_{2} modeling Q~(s,a)\tilde{Q}(s,a), such that Q~K\tilde{Q}_{K} obtained from our Algorithm 1 satisfies that:

1|𝒜|a𝒜Qπ(.,a)Q~K(.,a)2,μ2=Op(|𝒜|(nαlogξn+d1κ)n1logξ+1n\displaystyle\frac{1}{|\mathcal{A}|}\sum_{a\in\mathcal{A}}{\left\|Q^{\pi^{*}}(.,a)-\tilde{Q}_{K}(.,a)\right\|_{2,\mu}^{2}}=O_{p}(|\mathcal{A}|(n^{\alpha^{*}}\log^{\xi^{*}}{n}+d_{1}^{*}\kappa)n^{-1}\log^{\xi^{*}+1}{n}
+L0(eζκeζm+n2Δ)+γ2K(1γ)2Rmax2),\displaystyle+L_{\mathcal{F}_{0}}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}),

where ξ>1,0<α<1\xi^{*}>1,0<\alpha^{*}<1 are some constants, |𝒜||\mathcal{A}| is number of treatment options, and d1d_{1}^{*} is the width of the first layer of the sparse ReLU network used in 2\mathcal{F}_{2} satisfying the bound m0+md1nαm_{0}+m\leq d_{1}^{*}\leq n^{\alpha^{*}}.

More details of neural network structure and sample size assumptions for Theorem 4.6 can be found in Section B of Appendix. Theorem 4.6 provides an error bound on the estimated Q-function Q~K\tilde{Q}_{K}. Based on this theorem, we further establish the regret bound of the estimated policy πK\pi_{K} obtained via Algorithm 1. Toward that end, we need another assumption:

Assumption 4.7.

Assume there exist η>0,δ0>0\eta>0,\delta_{0}>0, such that

P(s:maxaQπ(s,a)maxa𝒜argmaxaQπ(x,a)Qπ(s,a)ϵ)=O(ϵη)\displaystyle P(s:\max_{a}{Q^{\pi^{*}}(s,a)}-\max_{a\in{\mathcal{A}-\operatorname*{arg\,max}_{a^{\prime}}{Q^{\pi^{*}}(x,a^{\prime})}}}{Q^{\pi^{*}}(s,a)}\leq\epsilon)=O(\epsilon^{\eta})

for 0<ϵδ00<\epsilon\leq\delta_{0}.

The margin type condition Assumption 4.7 is commonly used in the literature. Specifically, in classification, the margin conditions are imposed to bound the excess risk (Tsybakov, 2004; Audibert and Tsybakov, 2007). In dynamic treatment regime, a similar assumption is introduced for proving the convergence of State-Value function in a finite horizon setting (Qian and Murphy, 2011; Luedtke and van der Laan, 2016). In RL, these assumptions were introduced by (Shi et al., 2022) to obtain sharper regret bound for the estimated optimal policy.

Theorem 4.8 (Convergence of State-Value Function).

Under the Assumptions 4.3 to 4.7 and the conditions of μ,2,n\mu,\mathcal{F}_{2},n in Theorem 4.6, we have:

Eμ[Vπ(s)VπK(s)]=Op(11γ{|𝒜|(nαlogξn+d1κ)n1logξ+1n\displaystyle E_{\mu}[V^{\pi^{*}}(s)-V^{\pi_{K}}(s)]=O_{p}(\frac{1}{1-\gamma}\{|\mathcal{A}|(n^{\alpha^{*}}{\log^{\xi^{*}}{n}}+d_{1}^{*}\kappa)n^{-1}{\log^{\xi^{*}+1}{n}}
+L0(eζκeζm+n2Δ)+γ2K(1γ)2Rmax2}η+1η+2).\displaystyle+L_{\mathcal{F}_{0}}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}\}^{\frac{\eta+1}{\eta+2}}).

The proofs of the two theorems are included in Section C of Appendix. We summarize our theoretical findings below. First, we notice that the convergence rate of regret in Theorem 4.8 is faster than the convergence of estimated Q-functions in Theorem 4.6. This is due to the margin type Assumption 4.7 which enables us to obtain a sharper error bound. Similar results have been established in the literature. See e.g., Theorem 3.3 in Audibert and Tsybakov (2007), Theorem 3.1 in Qian and Murphy (2011), Theorems 3 and 4 in Shi et al. (2022).

From Theorem 4.8, it can be seen that the regret bound is mainly determined by three parameters: the sample size nn, the number of principal components κ\kappa and the number of iterations KK. Here, the first term |𝒜|(nα(logξn)+d1κ)n1logξ+1n|\mathcal{A}|(n^{\alpha^{*}}{(\log^{\xi^{*}}{n})}+d_{1}^{*}\kappa)n^{-1}\log^{\xi^{*}+1}{n} on the right-hand-side corresponds to the estimation error, which decreases with nn and increases with κ\kappa. The second term L0(eζκeζm+n2Δ)L_{\mathcal{F}_{0}}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta}) corresponds to the approximation error, which decreases with both nn and κ\kappa. The remaining term γK+1(1γ)2Rmax\frac{\gamma^{K+1}}{(1-\gamma)^{2}}R_{max} is the optimization error which will decrease as the iteration number KK in Algorithm 1 grows.

Compared with the existing results on the convergence rate of deep fitted Q-iteration algorithm (see Theorem 4.4 of Fan et al., 2020), our theorems additionally characterize the dependence upon the number of principal components. Specifically, selecting the first κ\kappa principal components induces the information loss (e.g., bias) that is of the order eζκeζme^{-\zeta\kappa}-e^{-\zeta m}, but reduces the model complexity caused by high frequency variables from d1md_{1}^{*}m to d1κd_{1}^{*}\kappa and hence the variance of the policy estimator. This represents a bias-variance trade-off. Notice that the bias decays at an exponential order, when the training data is small, reducing the model complexity can be more beneficial. Thus, our algorithm is expected to perform better than the original fitted Q-iteration algorithm in small samples, as shown in our numerical studies.

Finally, The number of principal components shall diverge with nn to ensure the consistency of the proposed algorithm. Based on the two theorems, the optimal κ\kappa^{*} that balances the bias and variance trade-off shall satisfy κlog(n)\kappa^{*}\asymp\log(n) (details are given in Section D of Appendix). Thus, when nn goes to infinity, we will eventually take κ=m\kappa^{*}=m and our Algorithm 1 will be equivalent to the original neural fitted Q-iteration. This is just an asymptotic guideline for selecting the number of principal components. We provide some practical guidelines in the next section.

5 Empirical Studies

5.1 Practical Guidelines for Number of Principal Components

In functional data analysis, several criteria have been developed to select number of principal components, including the percentage of variance explained, Akaike information criterion (AIC) and Bayesian information criterion (BIC) (see e.g., Yao et al., 2005; Li et al., 2013). In our setting, it is difficult to apply AIC/BIC, since there does not exist a natural objective function (e.g., likelihood) for Q-function estimation. One possibility is to extend the value information criterion (Shi et al., 2021a) developed in single-stage decision making to infinite horizons. Nonetheless, it remains unclear how to determine the penalty parameter for consistent tuning parameter selection.

Here, we select κ\kappa based on the percentage of variance explained. That is, we can look at the minimum value of κ\kappa such that the total variance explained by PCA reaches a certain level (e.g., 95%95\%). This method is also employed in Laber and Staicu (2018) in single-stage decision making. To illustrate the empirical performance of this method, we apply Algorithm 1 with κ{2,6,10,14,74}\kappa\in\{2,6,10,14...,74\} and evaluate the expected return of these policies in the following numerical study.

The data generating process can be described as follows. We set the low frequency covariate Xi,tX_{i,t} to be a 22-dimensional vector and the high frequency variables Zi,tZ_{i,t} to be a 108108-dimensional vector (m=108m=108). Both are sampled from mean zero normal distributions. The covariance matrix of ZZ is set to satisfies Assumption 4.3. The action space is binary and the behavior policy to generate actions in training data is a uniform random policy. The reward function r(x,z,a)r(x,z,a) is set to =xβ1,a+zβ2,a+cmax{(xβ1,a+zβ2,a),0}=x\beta_{1,a}+z\beta_{2,a}+c\max{\{(x\beta_{1,a}+z\beta_{2,a}),0\}} for some constant cc and coefficient vectors β1,a,β2,a\beta_{1,a},\beta_{2,a} such that it is a mixture of a linear function and a neural network with a single layer and ReLU activation function. Next state (Xi,t+1,Zi,t+1)(X_{i,t+1},Z_{i,t+1}) will be generated from a normal distribution with mean being a linear function of state (Xi,t,Zi,t)({X}_{i,t},{Z}_{i,t}) and action Ai,t{A}_{i,t}. The number of trajectories NN is fixed to 66 and the length of trajectory TT is set to be 8080.

The ReLU network is constructed with 3 hidden layers and width d1=15,d2=d3=5d_{1}=15,d_{2}=d_{3}=5. Dropout layers with 10%10\% dropout rate are added between layer 1, layer 2 and layer 3. During training, the dropout layers randomly sets the output from previous layers to 0 with the probability 10%10\%, which can introduce sparsity to the neural network and reduce over-fitting (Srivastava et al., 2014). The hyperparamters of neural network structure can be tuned via cross-validation. The discounted factor γ\gamma is fixed to 0.50.5.

To evaluate the policy performance, we can use a Monte-Carlo method to approximate the expected return under each estimated policy. Specifically, for each estimated policy π\pi, we generate Nmc=100N_{mc}=100 trajectories each of length Tmc=20T_{mc}=20 (in our setting with γ=0.5\gamma=0.5, the cumulative reward after Tmc=20T_{mc}=20 is negligible). The initial state distribution is the same as the one in the training dataset. The actions are assigned according to π\pi. The expected return can then be approximated via the average of the empirical cumulative rewards over the 100 trajectories.

For each κ\kappa in the list {2,6,10,14,74}\{2,6,10,14...,74\}, we apply Algorithm 1 to learn the optimal policy over 8080 random seeds and evaluate their expected return using the Monte Carlo method. We then take the sample average and standard error of these 8080 expected returns to estimate the value of policy and construct the margin of error. Figure 2 depicts the estimated values of these expected returns as well as their confidence intervals. It can be seen that increasing κ\kappa from 22 to 66 leads to a significant improvement. However, further increasing κ\kappa worsens the performance. This trend is consistent with our theory since the bias term will dominate the estimation error for small value of κ\kappa. When κ\kappa increases, the bias decays at exponentially fast and the model complexity term becomes the leading term. Meanwhile, the percentage of variance explained increases quickly when κ6\kappa\leq 6 and remains stable afterwards. As such, it makes sense to use this criterion for κ\kappa selection. In our implementation, we select the smallest κ\kappa such that the variance explained is at least 95%95\%.

5.2 Simulation Study

In the simulation study, we compare the proposed policy πKPCA\pi_{K}^{PCA} against two baseline policies obtained by directly using the original high frequency variable Zi,t,jZ_{i,t,j} (denoted by πKALL\pi_{K}^{ALL}) and its average as input (denoted by πKAVE\pi_{K}^{AVE}). Both policies are computed in a similar manner based on the deep fitted Q-iteration algorithm. We additionally include one more baseline policy, denoted by πKBOTTLE\pi_{K}^{BOTTLE}, which adds one bottleneck layer after the input of original ZZ in the neural network architecture such that the width of this bottleneck layer is the same as the input dimension of the proposed policy πKPCA\pi_{K}^{PCA}. This policy differs from the proposed policy in that it uses this bottleneck layer for dimension reduction instead of PCA. Both the data generating setting and the neural network structure are the same to Section 5.1. However, in this simulation study we vary the sample size and the dimension of the high frequency variables. Specifically, we consider 9 cases of training size (N=6,T=60,75,90,105,120N=6,T=60,75,90,105,120 and N=7,8,9,10,T=120N=7,8,9,10,T=120 ) and 55 different high frequency part dimension m=27,54,108,162,216m=27,54,108,162,216. Furthermore, we have two settings of generating high frequency variables that will be discussed below. We similarly compare the proposed policy against πKALL\pi_{K}^{ALL}, πKBOTTLE\pi_{K}^{BOTTLE}, and πKAVE\pi_{K}^{AVE} and use the Monte Carlo method to evaluate their values.

In the first setting, we consider 55 cases with JJ (the number of high frequency variables) equals 1,2,4,6,81,2,4,6,8 and each high frequency variable Zj,j{1,2,,J}Z_{j},j\in\{1,2,...,J\} is of dimension 2727. In this setting, all JJ high frequency variables are dependent and eigenvalues of concatenated high frequency variable ZZ decays at an exponential order. We find that the first 55 principal components explains over 95%95\% of variance in all the 55 cases, as shown in Figure 3. Therefore, we set the number of principal components κ=5\kappa=5 and plot the results in Figure 4.

In the second setting, the JJ high frequency variables are independent with each other. For each jj, all the elements in ZjZ_{j} are dependent and eigenvalues of ZjZ_{j} decays at the same exponential order as the eigenvalues of the concatenated high frequency variable ZZ in the first setting. Therefore, more principal components are needed to guarantee that the number of variance explained exceeds 95%95\%, as JJ increases. Specifically, when J=1,2,4,6,8J=1,2,4,6,8 high frequency variables, the corresponding κ\kappa is given by 5,10,20,30,405,10,20,30,40 accordingly. See Figure 3 for details. The expected returns of all estimated optimal policies are plotted in Figure 5.

From Figures 4 and 5, it can be seen that the proposed policy πKPCA\pi_{K}^{PCA} always achieves a larger value than the three baseline policies. Meanwhile, πKALL\pi_{K}^{ALL} and πKBOTTLE\pi_{K}^{BOTTLE} perform comparably. The value of both of them are significantly affected by the training size nn. In addition, πKAVE\pi_{K}^{AVE} outperforms πKALL\pi_{K}^{ALL} and πKBOTTLE\pi_{K}^{BOTTLE} in small samples, but performs worse than the two policies when the sample size is large. In the second setting, πKALL\pi_{K}^{ALL} and πKBOTTLE\pi_{K}^{BOTTLE} tend to perform much better than πKAVE\pi_{K}^{AVE} when J=4,6,8J=4,6,8, since averaging over several high frequency variables will lose more relevant information for policy learning.

Finally, we conduct an additional simulation study with large training datasets where N=200N=200 or 40004000 and T=120T=120. This setting might be unrealistic in an mHealth dataset. It is included only to test the performance of πKALL\pi_{K}^{ALL}. As πKALL\pi_{K}^{ALL} is consistent as well, we anticipate that the difference between the value under πKALL\pi_{K}^{ALL} and the proposed policy will be negligible as the sample size grows to infinity. Figure 6 depicts the results. As expected, we observe no significant difference between πPCA\pi^{PCA} and πALL\pi^{ALL} when N200N\geq 200.

5.3 Application on OhioT1DM Dataset

We apply the proposed Algorithm 1 on the updated OhioT1DM Dataset by Marling and Bunescu (2020). In the real data case, we would still like to compare the behaviors of the four policies: πKPCA\pi_{K}^{PCA}, πKALL\pi_{K}^{ALL}, πKBOTTLE\pi_{K}^{BOTTLE} and πKAVE\pi_{K}^{AVE}. OhioT1DM Dataset contains medical information of 12 type-I diabetes patients, including the CGM blood glucose levels of the patients, insulin doses applied during this period, self-reported information of meals and exercises, and other variables recorded by mobile phone apps and physiological sensors. The high frequency variables in the OhioT1DM Dataset, such as CGM blood glucose levels, are recorded every 5 minutes. The data for exercises and meals are collected with a much lower frequency, say recorded every few hours. Moreover, considering the basal insulin rate of in this dataset, although this variable is also collected every 55 minutes, it usually remains a constant for several hours in a day. Thus, the basal insulin rate can also be regarded as a low-frequency scalar variable by taking the average of it. The time period between two decision points is set as 33 hours, as we only consider non-emergency situations where patients don’t need to take bolus injection promptly. In other studies using the OhioT1DM Dataset, the treatment decision frequency is also set to be much lower than the recording frequency of CGM blood glucose levels (see e.g., Shi et al., 2022; Zhou et al., 2021; Zhu et al., 2020).

For the low frequency covariate Xi,t=(Xi,t(1),Xi,t(2))X_{i,t}=(X_{i,t}^{(1)},X_{i,t}^{(2)}), Xi,t(1)X_{i,t}^{(1)} is constructed based on the iith patient’s self-reported carbohydrate estimate for the meal during the past 33-hour-interval [t1,t)[t-1,t). The second scalar variable in Xi,tX_{i,t} is defined as the average of the basal rate of insulin dose during the past interval [t1,t)[t-1,t). We consider one high-frequency element, Zi,tZ_{i,t}, which contains CGM blood glucose levels recorded every 5 minutes during the past 33 hours (its dimension is m=36m=36). The action variable Ai,tA_{i,t} is set to 11 when the total amount of insulin delivered to the ii-th patient is greater than one unit in the past interval. The response variable Yi,tY_{i,t} is defined according to the Index of Glycemic Control (IGC, Rodbard, 2009), which is a non-linear function of the blood glucose levels in the following stage. A higher IGC value indicates that the blood glucose level of this patient stays close or falls in to the proper range of glucose level.

In this study, κ=5\kappa=5 is selected when training πKPCA\pi_{K}^{PCA}, as the proportion of variance explained by the first 55 principal components is over 99%99\%, as is shown in Figure 7. The ReLU network here is with 2 hidden layers and width di=6,i=1,2d_{i}=6,i=1,2 (dropout layers with 10%10\% dropout rate added between layer 1, layer 2 and the output layer). To estimate the value Vπ(x,z)V^{\pi}(x,z) of the four policies, we use the Fitted Q Evaluation algorithm proposed by Le et al. (2019). When applying the Fitted Q Evaluation algorithm, a random forest model is used to fit the estimated Q-function of the policy to be evaluated. By dividing 1212 patients into a training set of 99 patients and testing set of 33 patients, there are 220220 repetitions with different patient combinations. In each repetition, the data of 99 patients is used to train the policy and fit the random forest for Fitted Q Evaluation corresponding to this policy. The data of the other 33 patient is used for approximating the value of the policy using the estimated Q-function from Fitted Q Evaluation. The sample mean of estimated values from all 220220 repetitions is taken as our main result and the standard errors are used to construct the margin of error. To compare the performance of our proposed policies πKPCA\pi_{K}^{PCA} against πKALL\pi_{K}^{ALL}, πKAVE\pi_{K}^{AVE}, and πKBOTTLE\pi_{K}^{BOTTLE}, we present the difference of estimated values of πPCA\pi^{PCA} and the three other policies in Table 1, where margin of error is standard error of the mean difference multiplied by the critical value 1.961.96. The estimated values of the four policies is in Table 2.

Based on the result, it can be shown that the estimated value of πKPCA\pi_{K}^{PCA} is higher than all three baselines. The policy πKAVE\pi_{K}^{AVE} obtained by using the average of CGM blood glucose levels is commonly used in literature (Zhou et al., 2021; Zhu et al., 2020). The less plausible performance of πKAVE\pi_{K}^{AVE} is probably due to the information loss by simply replacing the CGM blood glucose levels with its average. On the other hand, the size of training data is relatively small, as we didn’t use all the data recorded in eight weeks due to the large chunks of missing values. Eventually training data from about 5 consecutive weeks are used for training DTR policies. In such scenarios, using the original high frequency vector Zi,tZ_{i,t} will significantly increase the complexity of the ReLU network structure, such that the number of parameters to be trained is too large compared to the size of training data. Thus, πKALL\pi_{K}^{ALL} and πKBOTTLE\pi_{K}^{BOTTLE} cannot outperform πKPCA\pi_{K}^{PCA} where input dimension is reduced by PCA. The results shown in Table 1 and Table 2 agree with the results in Section 5.2.

6 Discussions

In summary, we propose a deep spectral fitted Q-iteration algorithm to handle mixed frequency data in infinite horizon settings. The algorithm relies on the use of PCA for dimension reduction and the use of deep neural networks to capture the non-linearity in the high-dimensional system. In theory, we establish a regret bound for the estimated optimal policy. Our theorem provides an asymptotic guideline for selecting the number of principal components. In empirical studies, we demonstrate the superiority of the proposed algorithm over baseline methods without dimension reduction or use ad-hoc summaries of the state. We further offer practical guidelines to select the number of principal components. The proposed paper is built upon on PCA and deep Q-learning. It is worthwhile to investigate the performance of other dimension reduction and RL methods. We leave it for future research.

References

  • Arulkumaran et al. (2017) Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. “Deep reinforcement learning: A brief survey.” IEEE Signal Processing Magazine, 34(6):26–38 (2017).
  • Audibert and Tsybakov (2007) Audibert, J. and Tsybakov, A. “Fast learning rates for plug-in classifiers.” The Annals of statistics, 35(2):608–633 (2007).
  • Chen et al. (2022) Chen, E. Y., Song, R., and Jordan, M. I. “Reinforcement Learning with Heterogeneous Data: Estimation and Inference.” arXiv preprint arXiv:2202.00088 (2022).
  • Chen and Jiang (2019) Chen, J. and Jiang, N. “Information-theoretic considerations in batch reinforcement learning.” International Conference on Machine Learning, 1042–1051 (2019).
  • Ciarleglio et al. (2015) Ciarleglio, A., Petkova, E., Ogden, R. T., and Tarpey, T. “Treatment decisions based on scalar and functional baseline covariates.” Biometrics, 71(4):884–894 (2015).
  • Ciarleglio et al. (2018) Ciarleglio, A., Petkova, E., Ogden, T., and Tarpey, T. “Constructing treatment decision rules based on scalar and functional predictors when moderators of treatment effect are unknown.” Journal of the Royal Statistical Society. Series C, Applied statistics, 67(5):1331 (2018).
  • Ciarleglio et al. (2016) Ciarleglio, A., Petkova, E., Tarpey, T., and Ogden, R. T. “Flexible functional regression methods for estimating individualized treatment rules.” Stat (Int Stat Inst), 5(1):185–199 (2016).
  • Crambes and Mas (2013) Crambes, C. and Mas, A. “Asymptotics of prediction in functional linear regression with functional outputs.” Bernoulli, 2627–2651 (2013).
  • Curran et al. (2016) Curran, W., Brys, T., Aha, D., Taylor, M., and Smart, W. D. “Dimensionality reduced reinforcement learning for assistive robots.” In 2016 AAAI Fall Symposium Series (2016).
  • Curran et al. (2015) Curran, W., Brys, T., Taylor, M., and Smart, W. “Using PCA to efficiently represent state spaces.” arXiv preprint arXiv:1505.00322 (2015).
  • Dabney et al. (2018) Dabney, W., Rowland, M., Bellemare, M., and Munos, R. “Distributional reinforcement learning with quantile regression.” Thirty-Second AAAI Conference on Artificial Intelligence (2018).
  • Dickey and Fuller (1979) Dickey, D. A. and Fuller, W. A. “Distribution of the estimators for autoregressive time series with a unit root.” Journal of the American statistical association, 74(366a):427–431 (1979).
  • Ernst et al. (2005) Ernst, D., Geurts, P., and Wehenkel, L. “Tree-based batch mode reinforcement learning.” Journal of Machine Learning Research, 6:503–556 (2005).
  • Ertefaie et al. (2021) Ertefaie, A., McKay, J. R., Oslin, D., and Strawderman, R. L. “Robust Q-learning.” Journal of the American Statistical Association, 116(533):368–381 (2021).
  • Ertefaie and Strawderman (2018) Ertefaie, A. and Strawderman, R. L. “Constructing dynamic treatment regimes over indefinite time horizons.” Biometrika, 105(4):963–977 (2018).
  • Fan et al. (2020) Fan, J., Wang, Z., Xie, Y., and Yang, Z. “A theoretical analysis of deep Q-learning.” Learning for Dynamics and Control, 486–489 (2020).
  • Fang et al. (2022) Fang, E. X., Wang, Z., and Wang, L. “Fairness-Oriented Learning for Optimal Individualized Treatment Rules.” Journal of the American Statistical Association, 1–14 (2022).
  • Guan et al. (2020) Guan, Q., Reich, B. J., Laber, E. B., and Bandyopadhyay, D. “Bayesian nonparametric policy search with application to periodontal recall intervals.” Journal of the American Statistical Association, 115(531):1066–1078 (2020).
  • Guestrin et al. (2002) Guestrin, C., Patrascu, R., and Schuurmans, D. “Algorithm-directed exploration for model-based reinforcement learning in factored MDPs.” In ICML, 235–242. Citeseer (2002).
  • Jameson and Longo (2015) Jameson, J. L. and Longo, D. L. “Precision medicine—personalized, problematic, and promising.” Obstetrical & gynecological survey, 70(10):612–614 (2015).
  • Janner et al. (2019) Janner, M., Fu, J., Zhang, M., and Levine, S. “When to trust your model: Model-based policy optimization.” Advances in Neural Information Processing Systems, 32 (2019).
  • Jirak (2016) Jirak, M. “Optimal eigen expansions and uniform bounds.” Probability Theory and Related Fields, 166(3):753–799 (2016).
  • Kallus and Uehara (2022) Kallus, N. and Uehara, M. “Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning.” Operations Research (2022).
  • Kosorok and Laber (2019) Kosorok, M. R. and Laber, E. B. “Precision medicine.” Annual review of statistics and its application, 6:263–286 (2019).
  • Koutník et al. (2013) Koutník, J., Cuccu, G., Schmidhuber, J., and Gomez, F. “Evolving large-scale neural networks for vision-based reinforcement learning.” Proceedings of the 15th annual conference on Genetic and evolutionary computation, 1061–1068 (2013).
  • Kwiatkowski et al. (1992) Kwiatkowski, D., Phillips, P. C., Schmidt, P., and Shin, Y. “Testing the null hypothesis of stationarity against the alternative of a unit root: How sure are we that economic time series have a unit root?” Journal of econometrics, 54(1-3):159–178 (1992).
  • Laber and Staicu (2018) Laber, E. and Staicu, A. “Functional feature construction for individualized treatment regimes.” Journal of the American Statistical Association, 113(523):1219–1227 (2018).
  • Lai et al. (2020) Lai, H., Shen, J., Zhang, W., and Yu, Y. “Bidirectional model-based policy optimization.” In International Conference on Machine Learning, 5618–5627. PMLR (2020).
  • Lazaric et al. (2016) Lazaric, A., Ghavamzadeh, M., and Munos, R. “Analysis of classification-based policy iteration algorithms.” Journal of Machine Learning Research, 17:583–612 (2016).
  • Le et al. (2019) Le, H., Voloshin, C., and Yue, Y. “Batch policy learning under constraints.” International Conference on Machine Learning, 3703–3712 (2019).
  • Li et al. (2020) Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. “Breaking the sample size barrier in model-based reinforcement learning with a generative model.” Advances in neural information processing systems, 33:12861–12872 (2020).
  • Li et al. (2022) Li, Y., Wang, C.-h., Cheng, G., and Sun, W. W. “Rate-Optimal Contextual Online Matching Bandit.” arXiv preprint arXiv:2205.03699 (2022).
  • Li et al. (2013) Li, Y., Wang, N., and Carroll, R. J. “Selecting the number of principal components in functional data.” Journal of the American Statistical Association, 108(504):1284–1294 (2013).
  • Liao et al. (2021) Liao, P., Klasnja, P., and Murphy, S. “Off-policy estimation of long-term average outcomes with applications to mobile health.” Journal of the American Statistical Association, 116(533):382–391 (2021).
  • Liao et al. (2020) Liao, P., Qi, Z., Klasnja, P., and Murphy, S. “Batch policy learning in average reward markov decision processes.” arXiv preprint arXiv:2007.11771 (2020).
  • Luckett et al. (2020) Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E., and Kosorok, M. R. “Estimating Dynamic Treatment Regimes in Mobile Health Using V-Learning.” Journal of the American Statistical Association, 115(530):692–706 (2020). PMID: 32952236.
    URL https://doi.org/10.1080/01621459.2018.1537919
  • Luedtke and van der Laan (2016) Luedtke, A. R. and van der Laan, M. J. “Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy.” The Annals of Statistics, 44(2):713 – 742 (2016).
    URL https://doi.org/10.1214/15-AOS1384
  • Luo et al. (2022) Luo, F.-M., Xu, T., Lai, H., Chen, X.-H., Zhang, W., and Yu, Y. “A survey on model-based reinforcement learning.” arXiv preprint arXiv:2206.09328 (2022).
  • Marling and Bunescu (2020) Marling, C. and Bunescu, R. “The OhioT1DM Dataset for Blood Glucose Level Prediction: Update 2020.” CEUR workshop proceedings, 2675:71–74 (2020).
  • McKeague and Qian (2014) McKeague, I. W. and Qian, M. “Estimation of treatment policies based on functional predictors.” Statistica Sinica, 24(3):1461 (2014).
  • Mnih et al. (2016) Mnih, V., Badia, A., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. “Asynchronous methods for deep reinforcement learning.” International conference on machine learning, 1928–1937 (2016).
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, R., A., M., F., A.K., G., Ostrovski, and Petersen, S. “Human-level control through deep reinforcement learning.” nature, 518(7540):529–533 (2015).
  • Mo et al. (2021) Mo, W., Qi, Z., and Liu, Y. “Learning optimal distributionally robust individualized treatment rules.” Journal of the American Statistical Association, 116(534):659–674 (2021).
  • Murao and Kitamura (1997) Murao, H. and Kitamura, S. “Q-learning with adaptive state segmentation (QLASS).” In Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation’, 179–184 (1997).
  • Murphy (2003) Murphy, S. A. “Optimal dynamic treatment regimes.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355 (2003).
  • Nie et al. (2021) Nie, X., Brunskill, E., and Wager, S. “Learning when-to-treat policies.” Journal of the American Statistical Association, 116(533):392–409 (2021).
  • Parisi et al. (2017) Parisi, S., Ramstedt, S., and Peters, J. “Goal-driven dimensionality reduction for reinforcement learning.” In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 4634–4639. IEEE (2017).
  • Puterman (2014) Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons (2014).
  • Qi and Liu (2018) Qi, Z. and Liu, Y. “D-learning to estimate optimal individual treatment rules.” Electronic Journal of Statistics, 12(2):3601–3638 (2018).
  • Qian and Murphy (2011) Qian, M. and Murphy, S. A. “Performance guarantees for individualized treatment rules.” Annals of statistics, 39(2):1180 (2011).
  • Ramprasad et al. (2022) Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W. W., and Cheng, G. “Online bootstrap inference for policy evaluation in reinforcement learning.” Journal of the American Statistical Association, 1–14 (2022).
  • Reiß and Wahl (2020) Reiß, M. and Wahl, M. “Nonasymptotic upper bounds for the reconstruction error of PCA.” The Annals of Statistics, 48(2):1098–1123 (2020).
  • Riedmiller (2005) Riedmiller, M. “Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method.” European Conference on Machine Learning, 317–328 (2005).
  • Rodbard (2009) Rodbard, D. “Interpretation of continuous glucose monitoring data: glycemic variability and quality of glycemic control.” Diabetes technology & therapeutics, 11(S1):S–55 (2009).
  • Said and Dickey (1984) Said, S. E. and Dickey, D. A. “Testing for unit roots in autoregressive-moving average models of unknown order.” Biometrika, 71(3):599–607 (1984).
  • Schmidt-Hieber (2020) Schmidt-Hieber, J. “Nonparametric regression using deep neural networks with ReLU activation function.” The Annals of Statistics, 48(4):1875–1897 (2020).
  • Schulman et al. (2015a) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. “Trust region policy optimization.” International conference on machine learning, 1889–1897 (2015a).
  • Schulman et al. (2015b) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. “High-dimensional continuous control using generalized advantage estimation.” arXiv preprint arXiv:1506.02438 (2015b).
  • Shi et al. (2021a) Shi, C., Song, R., and Lu, W. “Concordance and value information criteria for optimal treatment decision.” The Annals of Statistics, 49(1):49–75 (2021a).
  • Shi et al. (2018) Shi, C., Song, R., Lu, W., and Fu, B. “Maximin projection learning for optimal treatment decision with heterogeneous individualized treatment effects.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):681–702 (2018).
  • Shi et al. (2021b) Shi, C., Wan, R., Chernozhukov, V., and Song, R. “Deeply-debiased off-policy interval estimation.” In International Conference on Machine Learning, 9580–9591. PMLR (2021b).
  • Shi et al. (2020) Shi, C., Wan, R., Song, R., Lu, W., and Leng, L. “Does the Markov Decision Process Fit the Data: Testing for the Markov Property in Sequential Decision Making.” In International Conference on Machine Learning, 8807–8817. PMLR (2020).
  • Shi et al. (2021c) Shi, C., Wang, X., Luo, S., Zhu, H., Ye, J., and Song, R. “Dynamic causal effects evaluation in A/B testing with a reinforcement learning framework.” Journal of the American Statistical Association, accepted (2021c).
  • Shi et al. (2022) Shi, C., Zhang, S., Lu, W., and Song, R. “Statistical inference of the value function for reinforcement learning in infinite horizon settings.” Journal of the Royal Statistical Society: Series B, 84:765–793 (2022).
  • Song et al. (2015) Song, R., Wang, W., Zeng, D., and Kosorok, M. R. “Penalized q-learning for dynamic treatment regimens.” Statistica Sinica, 25(3):901 (2015).
  • Sprague (2007) Sprague, N. “Basis iteration for reward based dimensionality reduction.” 2007 IEEE 6th International Conference on Development and Learning, 187–192 (2007).
  • Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. “Dropout: a simple way to prevent neural networks from overfitting.” The journal of machine learning research, 15(1):1929–1958 (2014).
  • Staicu et al. (2014) Staicu, A.-M., Li, Y., Crainiceanu, C. M., and Ruppert, D. “Likelihood ratio tests for dependent data with applications to longitudinal and functional data analysis.” Scandinavian Journal of Statistics, 41(4):932–949 (2014).
  • Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: an introduction. MIT Press (2018).
  • Tsiatis et al. (2019) Tsiatis, A. A., Davidian, M., Holloway, S. T., and Laber, E. B. Dynamic treatment regimes: Statistical methods for precision medicine. Chapman and Hall/CRC (2019).
  • Tsybakov (2004) Tsybakov, A. “Optimal aggregation of classifiers in statistical learning.” The Annals of Statistics, 32(1):135–166 (2004).
  • Uehara et al. (2021) Uehara, M., Imaizumi, M., Jiang, N., Kallus, N., Sun, W., and Xie, T. “Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency.” arXiv preprint arXiv:2102.02981 (2021).
  • Uehara et al. (2022) Uehara, M., Kiyohara, H., Bennett, A., Chernozhukov, V., Jiang, N., Kallus, N., Shi, C., and Sun, W. “Future-dependent value-based off-policy evaluation in pomdps.” arXiv preprint arXiv:2207.13081 (2022).
  • Van Hasselt et al. (2016) Van Hasselt, H., Guez, A., and Silver, D. “Deep reinforcement learning with double q-learning.” Proceedings of the AAAI conference on artificial intelligence, 30(1) (2016).
  • Wang et al. (2017a) Wang, L., Laber, E. B., and Witkiewitz, K. “Sufficient Markov decision processes with alternating deep neural networks.” arXiv preprint, arXiv:1704.07531 (2017a).
  • Wang et al. (2017b) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. “Sample efficient actor-critic with experience replay.” ICLR (2017b).
  • Whiteson et al. (2007) Whiteson, S., Taylor, M. E., and Stone, P. “Adaptive Tile Coding for Value Function Approximation.” (2007).
  • Xu et al. (2020) Xu, Z., Laber, E., Staicu, A.-M., and Severus, E. “Latent-state models for precision medicine.” arXiv preprint arXiv:2005.13001 (2020).
  • Yao et al. (2005) Yao, F., Müller, H.-G., and Wang, J.-L. “Functional data analysis for sparse longitudinal data.” Journal of the American statistical association, 100(470):577–590 (2005).
  • Zhang et al. (2013) Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. “Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions.” Biometrika, 100(3):681–694 (2013).
  • Zhang and Wang (2016) Zhang, X. and Wang, J. “From sparse to dense functional data and beyond.” The Annals of Statistics, 44(5):2281–2321 (2016).
  • Zhang et al. (2018) Zhang, Y., Laber, E. B., Davidian, M., and Tsiatis, A. A. “Interpretable dynamic treatment regimes.” Journal of the American Statistical Association, 113(524):1541–1549 (2018).
  • Zhao et al. (2015) Zhao, Y.-Q., Zeng, D., Laber, E. B., and Kosorok, M. R. “New statistical learning methods for estimating optimal dynamic treatment regimes.” Journal of the American Statistical Association, 110(510):583–598 (2015).
  • Zhou et al. (2021) Zhou, W., Zhu, R., and Qu, A. “Estimating Optimal Infinite Horizon Dynamic Treatment Regimes via pT-Learning.” arXiv preprint arXiv:2110.10719 (2021).
  • Zhou et al. (2022) —. “Estimating Optimal Infinite Horizon Dynamic Treatment Regimes via pT-Learning.” Journal of the American Statistical Association, accepted (2022).
  • Zhu et al. (2020) Zhu, L., Lu, W., and Song, R. “Causal effect estimation and optimal dose suggestions in mobile health.” In International Conference on Machine Learning, 11588–11598. PMLR (2020).

Appendix A Supplementary Definitions

In this section we would like to introduce more definitions, which will be used in the rest of supporting information.

Note that 0\mathcal{F}_{0} is a function class to model the Q-function without using PCA. We need to define two more function classes based on PCA of the high dimensional input zz:

Definition A.1.
1(L,{di}i=0L+1,s,Vmax,G,κ)={f1:𝒮×𝒜,f1(x,z,a)=\displaystyle\mathcal{F}_{1}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max},G,\kappa)=\{f_{1}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R},f_{1}(x,z,a)=
f0(x,(kκUkUkT)z,a)=f0(x,z,a),f00(L,{di}i=0L+1,s,Vmax)},\displaystyle f_{0}(x,(\sum_{k}^{\kappa}{U_{k}{U_{k}}^{T}})z,a)=f_{0}(x,z^{*},a),f_{0}\in\mathcal{F}_{0}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max})\},

where 0\mathcal{F}_{0} is defined in Section 4 and G=k=1mλk𝐔𝐤𝐔𝐤TG=\sum_{k=1}^{m}{\lambda_{k}\mathbf{U_{k}}\mathbf{U_{k}}^{T}} is the covariance matrix of zz. z=(kκUkUkT)zz^{*}=(\sum_{k}^{\kappa}{U_{k}{U_{k}}^{T}})z is the vector recovered from the first κ\kappa PCA values of the original high-frequency vector zz.

The definition of 2\mathcal{F}_{2} is already mentioned in Section 4. Here we would like to formally define it as below:

2(L,{di}i=0L+1,s,Vmax,G,κ)={f2:𝒮×𝒜,f2(x,z,a)=\displaystyle\mathcal{F}_{2}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max},G,\kappa)=\{f_{2}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R},f_{2}(x,z,a)=
f0(x,(Iκ0κ×(mκ))Σ12UTz,a)=f0(x,vκ,a),f00(L,{di}i=0L+1,s,Vmax)},\displaystyle f_{0}(x,\begin{pmatrix}I_{\kappa}&0_{\kappa\times(m-\kappa)}\end{pmatrix}{\Sigma}^{-\frac{1}{2}}U^{T}z,a)=f_{0}(x,v_{\kappa},a),f_{0}\in\mathcal{F}_{0}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max})\},

where vκv_{\kappa} is the vector containing first κ\kappa PCA values of the original high-frequency vector zz.

Note that 2(L,{di}i=0L+1,s,Vmax,G,κ)\mathcal{F}_{2}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max},G,\kappa) is the function class that we use in Spectrum Q-Iterated Learning Algorithm to model the Q-function. In practice, we will use the estimated G^\hat{G} and the corresponding PCA vector v^κ\hat{v}_{\kappa} in 2\mathcal{F}_{2}. The relationship between the two classes 1\mathcal{F}_{1} and 2\mathcal{F}_{2} will be used in the proof of the main theorems (Proposition 1 and Lemma 2 in Section C of Appendix).

Here we list the definition of the concentration coefficient as below:

Definition A.2.

Let σ1\sigma_{1} and σ2\sigma_{2} be two absolutely continuous distributions on 𝒮×𝒜\mathcal{S}\times\mathcal{A}. Let {πt}t1={π1,,πm}\{\pi_{t}\}_{t\geq 1}=\{\pi_{1},…,\pi_{m}\} be a sequence of policies. For a action-state pair S0,A0S_{0},A_{0} from the distribution σ1\sigma_{1}, denote the state at tt-th decision point by St,t=1,2,,mS_{t},t=1,2,...,m. Suppose we take action AtA_{t} at tt-th decision point according to πt\pi_{t}. Then we denote the marginal distribution of (St,At)(S_{t},A_{t}) by PπmPπm1Pπ1σ1P^{\pi_{m}}P^{\pi_{m-1}}...P^{\pi_{1}}\sigma_{1}. Then we can define the concentration coefficient by

ω(m;σ1,σ2)=supπ1,,πmdPπmPπm1Pπ1σ1(s,a)dσ2(s,a),σ2.\displaystyle\omega_{\infty}(m;\sigma_{1},\sigma_{2})=\sup_{\pi_{1},...,\pi_{m}}{\left\|\frac{dP^{\pi_{m}}P^{\pi_{m-1}}...P^{\pi_{1}}\sigma_{1}(s,a)}{d\sigma_{2}(s,a)}\right\|_{\infty,\sigma_{2}}}. (2)

Similar definitions of concentration coefficients can be commonly found in the literature of reinforcement learning (Fan et al., 2020; Lazaric et al., 2016).

Appendix B Detailed Version of the First Theorem

The detailed version of Theorem 4.6 is: Let μ\mu be some distribution on (𝒮×𝒜)(\mathcal{S}\times\mathcal{A}) such that μ(𝒮×𝒜)=μ1(𝒮)×μ2(𝒜)\mu(\mathcal{S}\times\mathcal{A})=\mu_{1}(\mathcal{S})\times\mu_{2}(\mathcal{A}) with μ2(a)>0,a𝒜\mu_{2}(a)>0,\forall a\in\mathcal{A} and dμ1(s)ds\frac{d\mu_{1}(s)}{ds} bounded away from 0 (here we would like to include a more general measure μ2\mu_{2} for the action space 𝒜\mathcal{A}). Define βj=βj×l=j+1min(βl,1)\beta_{j}^{*}=\beta_{j}\times\prod_{l=j+1}{\min{(\beta_{l},1)}} and α=maxj[q]tj2βj+tj\alpha^{*}=\max_{j\in[q]}{\frac{t_{j}}{2\beta_{j}^{*}+t_{j}}}, where βj,q,tj,pj\beta_{j},q,t_{j},p_{j} come from Definition 4.2 and definition of 𝒢0({pj,tj,βj,Hj}j[q])\mathcal{G}_{0}(\{p_{j},t_{j},\beta_{j},H_{j}\}_{j\in[q]}) in Section 4.

Under the Assumptions 4.3 to 4.5, and the condition that nn is large enough such that max{j(tj+βj+1)3+tj,maxjpj}Cξ(log(n))ξ\max\{\sum_{j}{{(t_{j}+\beta_{j}+1)}^{3+t_{j}}},\max_{j}{p_{j}}\}\leq C_{\xi}{(\log(n))}^{\xi} for some ξ>0,Cξ>0\xi>0,C_{\xi}>0, there exists a function class 2(L,{dj}i=0L+1,s,Vmax,G^,κ)\mathcal{F}_{2}(L^{*},\{d_{j}^{*}\}_{i=0}^{L+1},s^{*},V_{max},\hat{G},\kappa) modeling Q(s,a)Q(s,a) such that its components are sparse ReLU networks with hyperparameters satisfying: layer number L=O(logξn)L^{*}=O({\log^{\xi^{*}}{n}}), first layer width satisfying the bound d1nαd_{1}^{*}\leq n^{\alpha^{*}}, output dimension dL+1=1d^{*}_{L^{*}+1}=1, other layers width satisfying the bound d0=m0+mminj{1,2,,L}djmaxj{1,2,,L}dj=O(nξ)d_{0}^{*}=m_{0}+m\leq min_{j\in\{1,2,...,L\}}{d_{j}^{*}}\leq max_{j\in\{1,2,...,L\}}{d_{j}^{*}}=O(n^{\xi^{*}}), and sparsity with order sd1κ+nαlogξ(n)s^{*}\asymp d_{1}^{*}\kappa+n^{\alpha^{*}}\log^{\xi^{*}}(n) for some ξ>1+2ξ\xi^{*}>1+2\xi. In this case, we can show that Q~K\tilde{Q}_{K} (modeled by 2\mathcal{F}_{2}) obtained from our Algorithm 1 satisfies that:

QπQ~K2,μ2=Op(|𝒜|(nαlogξn+d1κ)n1logξ+1n\displaystyle\left\|Q^{\pi^{*}}-\tilde{Q}_{K}\right\|_{2,\mu}^{2}=O_{p}(|\mathcal{A}|(n^{\alpha^{*}}\log^{\xi^{*}}{n}+d_{1}^{*}\kappa)n^{-1}\log^{\xi^{*}+1}{n}
+L0(eζκeζm+n2Δ)+γ2K(1γ)2Rmax2),\displaystyle+L_{\mathcal{F}_{0}}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}),

where |𝒜||\mathcal{A}| is the number of treatment options.

Appendix C Proof of theorems

C.1 Proof Sketch of Theorem 4.6

The detailed version of Theorem 4.6 is: Let μ\mu be some distribution on (𝒮×𝒜)(\mathcal{S}\times\mathcal{A}) such that μ(𝒮×𝒜)=μ1(𝒮)×μ2(𝒜)\mu(\mathcal{S}\times\mathcal{A})=\mu_{1}(\mathcal{S})\times\mu_{2}(\mathcal{A}) with μ2(a)>0,a𝒜\mu_{2}(a)>0,\forall a\in\mathcal{A} and dμ1(s)ds\frac{d\mu_{1}(s)}{ds} bounded away from 0 (here we would like to include a more general measure μ2\mu_{2} for the action space 𝒜\mathcal{A}). Define βj=βj×l=j+1min(βl,1)\beta_{j}^{*}=\beta_{j}\times\prod_{l=j+1}{\min{(\beta_{l},1)}} and α=maxj[q]tj2βj+tj\alpha^{*}=\max_{j\in[q]}{\frac{t_{j}}{2\beta_{j}^{*}+t_{j}}}, where βj,q,tj,pj\beta_{j},q,t_{j},p_{j} come from Definition 4.2 and definition of 𝒢0({pj,tj,βj,Hj}j[q])\mathcal{G}_{0}(\{p_{j},t_{j},\beta_{j},H_{j}\}_{j\in[q]}) in Section 4.

Under the Assumptions 4.3 to 4.5, and the condition that nn is large enough such that max{j(tj+βj+1)3+tj,maxjpj}Cξ(log(n))ξ\max\{\sum_{j}{{(t_{j}+\beta_{j}+1)}^{3+t_{j}}},\max_{j}{p_{j}}\}\leq C_{\xi}{(\log(n))}^{\xi} for some ξ>0,Cξ>0\xi>0,C_{\xi}>0, there exists a function class 2(L,{dj}i=0L+1,s,Vmax,G^,κ)\mathcal{F}_{2}(L^{*},\{d_{j}^{*}\}_{i=0}^{L+1},s^{*},V_{max},\hat{G},\kappa) modeling Q(s,a)Q(s,a) such that its components are sparse ReLU networks with hyperparameters satisfying: layer number L=O(logξn)L^{*}=O({\log^{\xi^{*}}{n}}), first layer width satisfying the bound d1nαd_{1}^{*}\leq n^{\alpha^{*}}, output dimension dL+1=1d^{*}_{L^{*}+1}=1, other layers width satisfying the bound d0=m0+mminj{1,2,,L}djmaxj{1,2,,L}dj=O(nξ)d_{0}^{*}=m_{0}+m\leq min_{j\in\{1,2,...,L\}}{d_{j}^{*}}\leq max_{j\in\{1,2,...,L\}}{d_{j}^{*}}=O(n^{\xi^{*}}), and sparsity with order sd1κ+nαlogξ(n)s^{*}\asymp d_{1}^{*}\kappa+n^{\alpha^{*}}\log^{\xi^{*}}(n) for some ξ>1+2ξ\xi^{*}>1+2\xi. In this case, we can show that Q~K\tilde{Q}_{K} (modeled by 2\mathcal{F}_{2}) obtained from our Algorithm 1 satisfies that:

QπQ~K2,μ2=Op(|𝒜|(nαlogξn+d1κ)n1logξ+1n\displaystyle\left\|Q^{\pi^{*}}-\tilde{Q}_{K}\right\|_{2,\mu}^{2}=O_{p}(|\mathcal{A}|(n^{\alpha^{*}}\log^{\xi^{*}}{n}+d_{1}^{*}\kappa)n^{-1}\log^{\xi^{*}+1}{n}
+L0(eζκeζm+n2Δ)+γ2K(1γ)2Rmax2),\displaystyle+L_{\mathcal{F}_{0}}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}),

where |𝒜||\mathcal{A}| is the number of treatment options.

First, we need to denote

1(L,{{di}i=1L+1,d0=m0+m},sd1κ,Vmax,G^,κ)\mathcal{F}_{1}(L^{*},\{\{d_{i}^{*}\}^{L+1}_{i=1},d_{0}=m_{0}+m\},s^{*}-d_{1}\kappa,V_{max},\hat{G},\kappa)

as 1\mathcal{F}_{1},

2(L,{di}i=0L+1,s,Vmax,G^,κ)\mathcal{F}_{2}(L^{*},\{d_{i}^{*}\}^{L+1}_{i=0},s^{*},V_{max},\hat{G},\kappa)

as 2\mathcal{F}_{2} and

0(L,{{di}i=1L+1,d0=m0+m},sd1κ,Vmax)\mathcal{F}_{0}(L^{*},\{\{d_{i}^{*}\}^{L+1}_{i=1},d_{0}=m_{0}+m\},s^{*}-d_{1}\kappa,V_{max})

as 0\mathcal{F}_{0}.

Note that it has been assumed that the transition density function q(s|s,a)q(s^{\prime}|s,a) satisfies that sups,s,aq(s|s,a)cq\sup_{s^{\prime},s,a}{q(s^{\prime}|s,a)}\leq c_{q} (as is stated in Section 2.2). Let qj(s|s,a)q^{j}(s^{\prime}|s,a) denote the density of the state ss^{\prime} following the decisions of the sequence of policy π1,π2,,πj1\pi_{1},\pi_{2},...,\pi_{j-1}. Then it can be shown that

qj(s|s,a)=x𝒮ax𝒜q(s|x,a)πj1(ax|x)qj1(x|s,a)dx\displaystyle q^{j}(s^{\prime}|s,a)=\int_{x\in\mathcal{S}}{\sum_{a_{x}\in\mathcal{A}}{q(s^{\prime}|x,a)\pi_{j-1}(a_{x}|x)q^{j-1}(x|s,a)dx}}
cqx𝒮ax𝒜πj1(ax|x)qj1(x|s,a)dx=cq.\displaystyle\leq c_{q}\int_{x\in\mathcal{S}}{\sum_{a_{x}\in\mathcal{A}}{\pi_{j-1}(a_{x}|x)q^{j-1}(x|s,a)dx}}=c_{q}.

Then the density for joint distribution of s,as^{\prime},a^{\prime} after {π1,,πm}\{\pi_{1},...,\pi_{m}\} is

(s,a)𝒮×𝒜πm(a|s)q(m)(s|s,a)𝑑μ(s,a)cq(s,a)𝒮×𝒜𝑑μ(s,a)=cq.\displaystyle\int_{(s,a)\in\mathcal{S}\times\mathcal{A}}{\pi_{m}(a^{\prime}|s^{\prime})q^{(m)}}(s^{\prime}|s,a)d\mu(s,a)\leq c_{q}\int_{(s,a)\in\mathcal{S}\times\mathcal{A}}{d\mu(s,a)}=c_{q}.

Thus, dPπmPπm1Pπ1μ(s,a)dσ(s,a)\frac{dP^{\pi_{m}}P^{\pi_{m-1}}...P^{\pi_{1}}\mu(s,a)}{d\sigma(s,a)} can be bounded as long as the sampling distribution σ\sigma is bounded away from 0. That is, ω(m;μ,σ)cμ,σ\omega_{\infty}(m;\mu,\sigma)\leq c_{\mu,\sigma} for some constant cμ,σ>0c_{\mu,\sigma}>0 ( recall that ω(m;μ,σ)\omega_{\infty}(m;\mu,\sigma) is from Definition 2).

Then it can be shown that

1(1γ)2m=0γm1(m+1)[ω(m;μ,σ)]121(1γ)2m=0γm1(m+1)cμ,σ12=ϕμ,σ,\displaystyle\frac{1}{(1-\gamma)^{2}}\sum_{m=0}^{\infty}{\gamma^{m-1}(m+1)[\omega_{\infty}(m;\mu,\sigma)]^{\frac{1}{2}}}\leq\frac{1}{(1-\gamma)^{2}}\sum_{m=0}^{\infty}{\gamma^{m-1}(m+1){c_{\mu,\sigma}}^{\frac{1}{2}}}=\phi_{\mu,\sigma},

for some constant ϕμ,σ>0\phi_{\mu,\sigma}>0.

Then we can show a lemma regarding the approximation error of Q~K\tilde{Q}_{K} as below:

Lemma C.1.

For the estimated Q-function obtained in iteration KK, Q~K\tilde{Q}_{K} in Algorithm 1, it leads to

QQ~K2,μ22γ2(1γ)4ϵmax2ϕμ,σ2+8γ2KRmax2(1γ)2,\displaystyle\left\|Q^{*}-\tilde{Q}_{K}\right\|_{2,\mu}^{2}\leq 2\gamma^{2}(1-\gamma)^{4}\epsilon_{max}^{2}\phi_{\mu,\sigma}^{2}+8\frac{\gamma^{2K}R_{max}^{2}}{(1-\gamma)^{2}}, (3)

where ϵmax=maxk{1,2,K}TQ~k1Q~kσ\epsilon_{max}=\max_{k\in\{1,2...,K\}}{\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}}.

That is, the distance of the KK-th iteration result Q~K\tilde{Q}_{K} in our algorithm with QQ^{*} corresponding to the optimal policy can be bound by the largest approximation error TQ~k1Q~kσ2\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}^{2} among all the KK iterations in the Spectrum Q-Iteration algorithm. We know Q~k1,Q~k2\tilde{Q}_{k-1},\tilde{Q}_{k}\in\mathcal{F}_{2}.

To bound the term TQ~k1Q~kσ2\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}^{2} in each iteration, we will apply Theorem 6.2 in Fan et al. (2020), and it can be shown that

TQ~k1Q~kσ2(1+ϵ)2ω(2)+CVmax2nϵlogNδ,2+CVmaxδ\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}^{2}\leq(1+\epsilon)^{2}\omega(\mathcal{F}_{2})+C\frac{V_{max}^{2}}{n\epsilon}\log{N_{\delta,2}}+C^{{}^{\prime}}V_{max}\delta

for ϵ(0,1]\forall\epsilon\in(0,1], where ω(2)=supg2inff2fTgσ2\omega(\mathcal{F}_{2})=\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}. Here Nδ,2N_{\delta,2} is the cardinality of the minimal σ\sigma-covering set of the function class 2\in\mathcal{F}_{2} with respect to ll_{\infty} norm.

If we take ϵ=1\epsilon=1 and fix δ=1n\delta=\frac{1}{n}, we have

TQ~k1Q~kσ24ω(2)+CVmax2nlogN1n,2+CVmaxn.\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}^{2}\leq 4\omega(\mathcal{F}_{2})+C\frac{V_{max}^{2}}{n}\log{N_{\frac{1}{n},2}}+\frac{C^{{}^{\prime}}V_{max}}{n}. (4)

Then we need to bound ω(2)=supg2inff2fTgσ2\omega(\mathcal{F}_{2})=\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}} and logN1n,2\log{N_{\frac{1}{n},2}} respectively.

The term supg2inff2fTgσ2\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}} is related to the error of using a function f2f\in\mathcal{F}_{2} to approximate a function gg from the same class after optimal Bellman operation. Note that in each iteration of the Q-iterative Algorithm, we use Q~k\tilde{Q}_{k} from a ReLU-like class to fit the responses Yi=Ri+γmaxa𝒜Q~k1(Si,a)Y_{i}=R_{i}+\gamma\max_{a\in\mathcal{A}}{\tilde{Q}_{k-1}(S_{i}^{{}^{\prime}},a)}. If we take expectation of YiY_{i}, we know the true model underlying the responses YiY_{i} would be a function from the class of Tg:g2Tg:g\in\mathcal{F}_{2}. As is argued in Fan et al. (2020), the term supg2inff2fTgσ2\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}} can be represented as the bias term when we fit YiY_{i} by using Q~k2\tilde{Q}_{k}\in\mathcal{F}_{2}.

We will first bound this bias-related term. We need to show the following proposition:

Proposition C.2.

The function class 1(L,{{di}i=1L+1,d0=m0+m},s,Vmax,G^,κ)\mathcal{F}_{1}(L,\{\{d_{i}\}^{L+1}_{i=1},d_{0}=m_{0}+m\},s,V_{max},\hat{G},\kappa) is a subset of 2(L,{{di}i=1L+1,d0=κ+m0},s+d1κ,Vmax,G^,κ)\mathcal{F}_{2}(L,\{\{d_{i}\}^{L+1}_{i=1},d_{0}=\kappa+m_{0}\},s+d_{1}\kappa,V_{max},\hat{G},\kappa).

This proposition implies 12\mathcal{F}_{1}\subset\mathcal{F}_{2}. Then we can propose the following lemma, which relies on the Proposition C.2 we’ve just shown:

Lemma C.3.

Under the same assumptions of the Theorem 4.6,

supg2inff2fTgσ22supf𝒢0inff00f0fσ2+2L0ZZσ2,\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}\leq 2\sup_{f^{{}^{\prime}}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{\left\|f_{0}-f^{{}^{\prime}}\right\|_{\sigma}^{2}}}+2L_{\mathcal{F}_{0}}{\left\|Z-Z^{*}\right\|_{\sigma}^{2}},

where L0=supf0supxy|f(y)f(x)|2yx2L_{\mathcal{F}_{0}}=\sup_{f\in\mathcal{F}_{0}}{\sup_{x\neq y}{\frac{{|f(y)-f(x)|}^{2}}{\left\|y-x\right\|^{2}}}}. {Z}j=1J\{Z\}_{j=1}^{J} is the high-frequency vector from the distribution σ\sigma and Z=(k=1κU^kU^kT)ZZ^{*}=(\sum_{k=1}^{\kappa}{\hat{U}_{k}{\hat{U}_{k}}^{T}})Z is the vector recovered from κ\kappa PCA values of ZZ.

By applying Lemma C.3, we now can bound the term supg2inff2fTgσ2\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}} by supf𝒢0inff00f0fσ2\sup_{f^{{}^{\prime}}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{\left\|f_{0}-f^{{}^{\prime}}\right\|_{\sigma}^{2}}} and ZZσ2{\left\|Z-Z^{*}\right\|_{\sigma}^{2}}.

Recall that 0\mathcal{F}_{0} is the simple notation of

0(L,{{di}i=1L+1,d0=m0+m},sd1κ,Vmax),\mathcal{F}_{0}(L^{*},\{\{d_{i}^{*}\}^{L+1}_{i=1},d_{0}=m_{0}+m\},s^{*}-d_{1}\kappa,V_{max}),

as is specified in the beginning of this section. We know sparsity of 0\mathcal{F}_{0} satisfies sd1κnαlogξ(n)s^{*}-d_{1}\kappa\asymp n^{\alpha^{*}}\log^{\xi^{*}}(n) and architectures of 0\mathcal{F}_{0} satisfying d1nd_{1}^{*}\leq n and m0+mminj{1,2,,L}djmaxj{1,2,,L}dj=O(nξ)m_{0}+m\leq min_{j\in\{1,2,...,L\}}{d_{j}^{*}}\leq max_{j\in\{1,2,...,L\}}{d_{j}^{*}}=O(n^{\xi^{*}}). Thus, we can use equation (4.18) of Fan et al. (2020) to bound term supf𝒢0inff00f0fσ2\sup_{f^{{}^{\prime}}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{\left\|f_{0}-f^{{}^{\prime}}\right\|_{\sigma}^{2}}}:

supf𝒢0inff00f0fσ2nα1,\sup_{f^{{}^{\prime}}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{\left\|f_{0}-f^{{}^{\prime}}\right\|_{\sigma}^{2}}}\leq n^{\alpha^{*}-1},

where the conditions of neural networks in 0\mathcal{F}_{0} matches the conditions of applying equation (4.18) of Fan et al. (2020).

Now we can bound the term ZZσ2\left\|Z-Z^{*}\right\|_{\sigma}^{2} by the next lemma.

Lemma C.4.

For the high-frequency vector ZZ, we assume E(Z)=0E(Z)=0 and the the eigenvalues of Cov(Z)Cov(Z) following this exponential decaying trend λk=O(eζk)\lambda_{k}=O(e^{-\zeta k}). We also assume the estimation of the Cov(Z)Cov(Z) satisfies that U^U=Op(nΔ)\left\|\hat{U}-U\right\|=O_{p}(n^{-\Delta}). Then we have

ZZσ2=Op(eζκeζm+n2Δ)\left\|Z-Z^{*}\right\|_{\sigma}^{2}=O_{p}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})

Based on these two lemmas, we know that

supg2inff2fTgσ2=Op(nα1+L0(eζκeζm+n2Δ)).\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}=O_{p}(n^{\alpha^{*}-1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}).

Then, when we take δ=1n\delta=\frac{1}{n}, a bound for logNδ,2\log{N_{\delta,2}} in equation (4) is required. Here we need to show another proposition, Proposition C.5:

Proposition C.5.

When δ=1n\delta=\frac{1}{n}, logNδ,2C1|𝒜|(nα(logn)ξ+d1κ)(logn)1+ξ\log{N_{\delta,2}}\leq C_{1}|\mathcal{A}|(n^{\alpha^{*}}(\log{n})^{\xi^{*}}+d_{1}^{*}\kappa)(\log{n})^{1+\xi^{*}} for some constant C1C_{1}.

Then from equation (4), we know

TQ~k1Q~kσ2=Op(|𝒜|n1(nα(logn)ξ+d1κ)(logn)ξ+1+L0(eζκeζm+n2Δ))\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}^{2}=O_{p}(|\mathcal{A}|n^{-1}(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}+d_{1}^{*}\kappa){(\log{n})}^{\xi^{*}+1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}) (5)

Then from equation (5) and equation (3), the proof of Theorem 4.6 is complete.

C.2 Proof of Theorem 4.8

Theorem 4.8: Under the Assumptions 4.3 to 4.7 and the conditions of μ,2,n\mu,\mathcal{F}_{2},n in Theorem 4.6, we can show:

Eμ1[Vπ(s)VπK(s)]=Op(11γ{|𝒜|(nαlogξn+d1κ)n1logξ+1n\displaystyle E_{\mu_{1}}[V^{\pi^{*}}(s)-V^{\pi_{K}}(s)]=O_{p}(\frac{1}{1-\gamma}\{|\mathcal{A}|(n^{\alpha^{*}}{\log^{\xi^{*}}{n}}+d_{1}^{*}\kappa)n^{-1}{\log^{\xi^{*}+1}{n}}
+L0(eζκeζm+n2Δ)+γ2K(1γ)2Rmax2}η+1η+2).\displaystyle+L_{\mathcal{F}_{0}}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}\}^{\frac{\eta+1}{\eta+2}}).
Proof.

The policy πK\pi_{K} obtained in Algorithm 1 is a greedy policy with respect to Q~K\tilde{Q}_{K}.

First, a proposition from Shi et al. (2022) can be applied:

Proposition C.6 (Inequality E.10 and E.11 in Shi et al. (2022)).
Eμ1[Vπ(s)VπK(s)]s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}dμ1(s)\displaystyle E_{\mu_{1}}[V^{\pi^{*}}(s)-V^{\pi_{K}}(s)]\leq\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}d\mu_{1}(s)}} (6)
+cγ1γs𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}ds.\displaystyle+\frac{c\gamma}{1-\gamma}\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}ds}}.

Then it suffices to bound the two terms s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}dμ1(s)\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}d\mu_{1}(s)}} and s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}ds\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}ds}}. We will first show how to bound the following term:

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}dμ1(s).\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}d\mu_{1}(s)}}.

Define a^πK(s)=argmaxa𝒜Q~K(s,a)\hat{a}_{\pi^{K}}(s)=\operatorname*{arg\,max}_{a\in\mathcal{A}}{\tilde{Q}_{K}(s,a)}. For ϵ>0\epsilon>0, let A={0<maxa𝒜Qπ(s,a)Qπ(s,a^πK)ϵ}A_{*}=\{0<\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})\leq\epsilon\}. Then we know the complement set AC={maxa𝒜Qπ(s,a)Qπ(s,a^πK)>ϵ}{maxa𝒜Qπ(s,a)Qπ(s,a^πK)=0}A_{*}^{C}=\{\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})>\epsilon\}\cup\{\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=0\}. Then it leads to

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}dμ1(s)=\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}d\mu_{1}(s)}}= (7)
s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}𝟙(sA)dμ1(s)+\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}\mathbbm{1}(s\in A_{*})d\mu_{1}(s)}}+
s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}𝟙(sAC)dμ1(s).\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}}.

The first step is to bound the first part of equation (7),

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}𝟙(sA)dμ1(s).\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}\mathbbm{1}(s\in A_{*})d\mu_{1}(s)}}.

It is known that

maxa𝒜Qπ(s,a)=a𝒜Qπ(s,a)π(a|s)\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}=\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\pi^{*}(a|s)}

and

Qπ(s,a^πK)=a𝒜Qπ(s,a)πK(a|s),Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\pi^{K}(a|s)},

as both π\pi^{*} are πK\pi^{K} are greedy policies (π\pi^{*} can be viewed as the greedy policy with respect to QπQ^{\pi^{*}}). Then a𝒜Qπ(s,a){π(a|s)πK(a|s)}=maxa𝒜Qπ(s,a)Qπ(s,a^πK)ϵ\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}}=\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})\leq\epsilon when sAs\in A_{*}. That is,

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}𝟙(sA)dμ1(s)ϵs𝒮𝟙(sA)𝑑μ1(s).\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}}\mathbbm{1}(s\in A_{*})d\mu_{1}(s)}\leq\epsilon\int_{s\in\mathcal{S}}{\mathbbm{1}(s\in A_{*})d\mu_{1}(s)}. (8)

Also, when sAs\in A_{*}, we know that a^πK\hat{a}_{\pi^{K}} will never be the same as aargmaxa𝒜Qπ(s,a)a\in\operatorname*{arg\,max}_{a^{\prime}\in\mathcal{A}}{Q^{\pi^{*}}(s,a^{\prime})} based on the definition of AA_{*}, as maxa𝒜Qπ(s,a)Qπ(s,a^πK)>0\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})>0 is always true in AA_{*}. Thus, maxa𝒜argmaxaQπ(x,a)Qπ(s,a)Qπ(s,a^πK)\max_{a\in{\mathcal{A}-\operatorname*{arg\,max}_{a^{\prime}}{Q^{\pi^{*}}(x,a^{\prime})}}}{Q^{\pi^{*}}(s,a)}\geq Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}}). Then we know maxa𝒜Qπ(s,a)Qπ(s,a^πK)ϵ\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})\leq\epsilon induces maxa𝒜Qπ(s,a)maxa𝒜argmaxaQπ(x,a)Qπ(s,a)ϵ\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-\max_{a\in{\mathcal{A}-\operatorname*{arg\,max}_{a^{\prime}}{Q^{\pi^{*}}(x,a^{\prime})}}}{Q^{\pi^{*}}(s,a)}\leq\epsilon.

Therefore, it can be obtained that

s𝒮𝟙(sA)dμ1(s)s𝒮𝟙{s:maxa𝒜Qπ(s,a)\displaystyle\int_{s\in\mathcal{S}}{\mathbbm{1}(s\in A_{*})d\mu_{1}(s)}\leq\int_{s\in\mathcal{S}}\mathbbm{1}\{s:\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}
maxa𝒜argmaxaQπ(x,a)Qπ(s,a)ϵ}dμ1(s)=O(ϵη).\displaystyle-\max_{a\in{\mathcal{A}-\operatorname*{arg\,max}_{a^{\prime}}{Q^{\pi^{*}}(x,a^{\prime})}}}{Q^{\pi^{*}}(s,a)}\leq\epsilon\}d\mu_{1}(s)=O(\epsilon^{\eta}).

by assumption 4.7. Together with equation (8), we know that

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}𝟙(sA)dμ1(s)=O(ϵη+1).\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}}\mathbbm{1}(s\in A_{*})d\mu_{1}(s)}=O(\epsilon^{\eta+1}). (9)

Then we can bound the second part of equation (7),

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}𝟙(sAC)dμ1(s).\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}}.

Recall the definition of AC={maxa𝒜Qπ(s,a)Qπ(s,a^πK)>ϵ}{maxa𝒜Qπ(s,a)Qπ(s,a^πK)=0}A_{*}^{C}=\{\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})>\epsilon\}\cup\{\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=0\}. We know {maxa𝒜Qπ(s,a)Qπ(s,a^πK)=0}\{\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=0\} is a trivial case. It is easy to show that

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}𝟙{s:maxa𝒜Qπ(s,a)Qπ(s,a^πK)=0}dμ1(s)=0,\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}\mathbbm{1}\{s:\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=0\}d\mu_{1}(s)}}=0,

as π(a|s)=πK(a|s)\pi^{*}(a|s)=\pi^{K}(a|s) when maxa𝒜Qπ(s,a)Qπ(s,a^πK)=0\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=0 always holds.

Therefore,

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}𝟙(sAC)dμ1(s)=\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}}=
s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}𝟙{s:maxa𝒜Qπ(s,a)Qπ(s,a^πK)>ϵ}dμ1(s).\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}\mathbbm{1}\{s:\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})>\epsilon\}d\mu_{1}(s)}}.

We can show that when maxa𝒜Qπ(s,a)Qπ(s,a^πK)>ϵ\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})>\epsilon holds,

maxa𝒜Qπ(s,a)Qπ(s,a^πK)2maxa𝒜|Qπ(s,a)Q~K(s,a)|.\displaystyle\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})\leq 2\max_{a\in\mathcal{A}}{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}.

To show this, note that maxa𝒜Qπ(s,a)Qπ(s,a^πK)=maxa𝒜Qπ(s,a)Q~K(s,a^πK)+Q~K(s,a^πK)Qπ(s,a^πK)\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})+\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}}). It is trivial to show

Q~K(s,a^πK)Qπ(s,a^πK)maxa𝒜|Qπ(s,a)Q~K(s,a)|.\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})\leq\max_{a\in\mathcal{A}}{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}.

Furthermore, we can get maxa𝒜Qπ(s,a)Q~K(s,a^πK)=maxa𝒜Qπ(s,a)Q~K(s,aargmaxa𝒜Q(s,a))+Q~K(s,aargmaxa𝒜Q(s,a))Q~K(s,a^πK)\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})=\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-\tilde{Q}_{K}(s,a\in{\operatorname*{arg\,max}_{a\in\mathcal{A}}{{Q^{*}(s,a)}}})+\tilde{Q}_{K}(s,a\in{\operatorname*{arg\,max}_{a\in\mathcal{A}}{{Q^{*}(s,a)}}})-\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}}). Then we know that maxa𝒜Qπ(s,a)Q~K(s,aargmaxa𝒜Q(s,a))maxa𝒜|Qπ(s,a)Q~K(s,a)|\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-\tilde{Q}_{K}(s,a\in{\operatorname*{arg\,max}_{a\in\mathcal{A}}{{Q^{*}(s,a)}}})\leq\max_{a\in\mathcal{A}}{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|} and Q~K(s,aargmaxa𝒜Q(s,a))Q~K(s,a^πK)0\tilde{Q}_{K}(s,a\in{\operatorname*{arg\,max}_{a\in\mathcal{A}}{{Q^{*}(s,a)}}})-\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})\leq 0, which implies that maxa𝒜Qπ(s,a)Q~K(s,a^πK)maxa𝒜|Qπ(s,a)Q~K(s,a)|\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})\leq\max_{a\in\mathcal{A}}{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}.

To bound maxa𝒜|Qπ(s,a)Q~K(s,a)|\max_{a\in\mathcal{A}}{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}, we know that

maxa𝒜|Qπ(s,a)Q~K(s,a)|2a𝒜|Qπ(s,a)Q~K(s,a)|2\displaystyle\max_{a\in\mathcal{A}}{{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}^{2}}\leq\sum_{a\in\mathcal{A}}{{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}^{2}}\leq
1mina𝒜μ2(a)a𝒜|Qπ(s,a)Q~K(s,a)|2μ2(a).\displaystyle\frac{1}{\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\sum_{a\in\mathcal{A}}{{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}^{2}\mu_{2}(a)}.

Then, it can be obtained that

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}𝟙(sAC)dμ1(s)=\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}= (10)
s𝒮(maxa𝒜Qπ(s,a)Qπ(s,a^πK))𝟙(sAC)𝑑μ1(s)\displaystyle\int_{s\in\mathcal{S}}{(\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}}))\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}\leq
s𝒮(2maxa𝒜|Qπ(s,a)Q~K(s,a)|)2(maxa𝒜Qπ(s,a)Qπ(s,a^πK))𝟙(sAC)𝑑μ1(s)\displaystyle\int_{s\in\mathcal{S}}{\frac{{(2\max_{a\in\mathcal{A}}{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|})}^{2}}{(\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}}))}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}\leq
4ϵmina𝒜μ2(a)s𝒮a𝒜|Qπ(s,a)Q~K(s,a)|2μ2(a)𝟙(sAC)dμ1(s)\displaystyle\frac{4}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}^{2}\mu_{2}(a)}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}\leq
4ϵmina𝒜μ2(a)s𝒮a𝒜|Qπ(s,a)Q~K(s,a)|2μ2(a)dμ1(s)=\displaystyle\frac{4}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}^{2}\mu_{2}(a)}d\mu_{1}(s)}=
4ϵmina𝒜μ2(a)Qπ(s,a)Q~K(s,a)μ,22.\displaystyle\frac{4}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\left\|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)\right\|_{\mu,2}^{2}.

From Theorem 4.6, we already have

QπQ~K2,μ2=Op(γ2(1γ)4×[|𝒜|n1(nα(logn)ξ+d1κ)(logn)ξ+1\displaystyle\left\|Q^{\pi^{*}}-\tilde{Q}_{K}\right\|_{2,\mu}^{2}=O_{p}(\gamma^{2}(1-\gamma)^{4}\times[|\mathcal{A}|n^{-1}(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}+d_{1}^{*}{\kappa}){(\log{n})}^{\xi^{*}+1} (11)
+L0(eζκeζm+n2Δ)]+γ2K(1γ)2Rmax2).\displaystyle+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}).

Then plug in equation (11) to equation (10) and it leads to

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}𝟙(sAC)dμ1(s)\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)} (12)
=Op(1ϵmina𝒜μ2(a)γ2(1γ)4×[|𝒜|n1(nα(logn)ξ\displaystyle=O_{p}(\frac{1}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\gamma^{2}(1-\gamma)^{4}\times[|\mathcal{A}|n^{-1}(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}
+d1κ)(logn)ξ+1+L0(eζκeζm+n2Δ)]+γ2K(1γ)2Rmax2).\displaystyle+d_{1}^{*}{\kappa}){(\log{n})}^{\xi^{*}+1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}).

Then from equation (7), equation (9) and equation (12), we can get

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}dμ1(s)\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}d\mu_{1}(s)}}
=Op(ϵη+1+1ϵmina𝒜μ2(a)γ2(1γ)4×[|𝒜|n1(nα(logn)ξ\displaystyle=O_{p}(\epsilon^{\eta+1}+\frac{1}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\gamma^{2}(1-\gamma)^{4}\times[|\mathcal{A}|n^{-1}(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}
+d1κ)(logn)ξ+1+L0(eζκeζm+n2Δ)]+γ2K(1γ)2Rmax2).\displaystyle+d_{1}^{*}{\kappa}){(\log{n})}^{\xi^{*}+1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}).

If we take

ϵη+1=1ϵmina𝒜μ2(a)γ2(1γ)4×[|𝒜|n1(nα(logn)ξ+d1κ)(logn)ξ+1+\displaystyle\epsilon^{\eta+1}=\frac{1}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\gamma^{2}(1-\gamma)^{4}\times[|\mathcal{A}|n^{-1}(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}+d_{1}^{*}{\kappa}){(\log{n})}^{\xi^{*}+1}+
L0(eζκeζm+n2Δ)]+γ2K(1γ)2Rmax2,\displaystyle L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2},

then the following result is obtained:

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}dμ1(s)=Op({γ2(1γ)4×[|𝒜|n1(nα(logn)ξ\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}d\mu_{1}(s)}}=O_{p}(\{\gamma^{2}(1-\gamma)^{4}\times[|\mathcal{A}|n^{-1}(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}} (13)
+d1κ)(logn)ξ+1+L0(eζκeζm+n2Δ)]+γ2K(1γ)2Rmax2}η+1η+2).\displaystyle+d_{1}^{*}{\kappa}){(\log{n})}^{\xi^{*}+1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}\}^{\frac{\eta+1}{\eta+2}}).

Then using similar process, we can show the result on Lebesgue measure,

s𝒮a𝒜Qπ(s,a){π(a|s)πK(a|s)}ds=Op({γ2(1γ)4×[|𝒜|n1(nα(logn)ξ\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}ds}}=O_{p}(\{\gamma^{2}(1-\gamma)^{4}\times[|\mathcal{A}|n^{-1}(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}} (14)
+d1κ)(logn)ξ+1+L0(eζκeζm+n2Δ)]+γ2K(1γ)2Rmax2}η+1η+2).\displaystyle+d_{1}^{*}{\kappa}){(\log{n})}^{\xi^{*}+1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}\}^{\frac{\eta+1}{\eta+2}}).

Combine equation (6), equation (13) and equation (14), the result of Theorem 4.8 can be obtained.

C.3 Proof of Lemma C.1

The Lemma C.1 is: For the estimated Q-function obtained in iteration KK, Q~K\tilde{Q}_{K} in Algorithm 1, we have

QQ~K2,μ22γ2(1γ)4ϵmax2ϕμ,σ2+8γ2KRmax2(1γ)2,\displaystyle\left\|Q^{*}-\tilde{Q}_{K}\right\|_{2,\mu}^{2}\leq 2\gamma^{2}(1-\gamma)^{4}\epsilon_{max}^{2}\phi_{\mu,\sigma}^{2}+8\frac{\gamma^{2K}R_{max}^{2}}{(1-\gamma)^{2}},

where ϵmax=maxk{1,2,K}TQ~k1Q~kσ\epsilon_{max}=\max_{k\in\{1,2...,K\}}{\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}}.

Proof.

First, we denote ρk=TQ~k1Q~k\rho_{k}=T\tilde{Q}_{k-1}-\tilde{Q}_{k} and define the operator PπP^{\pi} as follow:

(PπQ)(s,a)=E[Q(S,A)|SP(.|s,a),Aπ(.|S)].\displaystyle(P^{\pi}Q)(s,a)=E[Q(S^{\prime},A^{\prime})|S^{\prime}~{}P(.|s,a),A^{\prime}~{}\pi(.|S^{\prime})]. (15)

We can rely on this Lemma from Fan et al. (2020):

Lemma C.7 (Lemma C.2 in Fan et al. (2020)).
QQ~li=kl1γl1i(Pπ)l1iρi+1+γlk(Pπ)lk(QQ~k)\displaystyle Q^{*}-\tilde{Q}_{l}\leq\sum_{i=k}{l-1}{\gamma^{l-1-i}(P^{\pi^{*}})^{l-1-i}\rho_{i+1}+\gamma^{l-k}(P^{\pi^{*}})^{l-k}(Q^{*}-\tilde{Q}_{k})}

for k,l,0k<lK1\forall k,l\in\mathbb{Z},0\leq k<l\leq K-1.

If we plug in l=K,k=0l=K,k=0, we have

QQ~Ki=0K1γK1i(Pπ)K1iρi+1+γK(Pπ)K(QQ~0)\displaystyle Q^{*}-\tilde{Q}_{K}\leq\sum_{i=0}{K-1}{\gamma^{K-1-i}(P^{\pi^{*}})^{K-1-i}\rho_{i+1}+\gamma^{K}(P^{\pi^{*}})^{K}(Q^{*}-\tilde{Q}_{0})}

We can denote Ui=(Pπ)K1iU_{i}=(P^{\pi^{*}})^{K-1-i} for i=0,1,,K1i=0,1,...,K-1 and UK=(Pπ)KU_{K}=(P^{\pi^{*}})^{K}. Then we have

|Q(s,a)Q~K(s,a)|i=0K1γK1i(Ui|ρi+1|)(s,a)+γK(UK|QQ~0|)(s,a).\displaystyle|Q^{*}(s,a)-\tilde{Q}_{K}(s,a)|\leq\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}+\gamma^{K}(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a).

By taking square on both sides, we have

|Q(s,a)Q~K(s,a)|2[i=0K1γK1i(Ui|ρi+1|)(s,a)]2+2i=0K1γK1i(Ui|ρi+1|)(s,a)\displaystyle{|Q^{*}(s,a)-\tilde{Q}_{K}(s,a)|}^{2}\leq[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}]^{2}+2\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}
×γK(UK|QQ~0|)(s,a)+γ2K[(UK|QQ~0|)(s,a)]2.\displaystyle\times\gamma^{K}(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)+\gamma^{2K}[(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)]^{2}.

That is,

Eμ[|Q(s,a)Q~K(s,a)|2]Eμ{[i=0K1γK1i(Ui|ρi+1|)(s,a)]2}\displaystyle E_{\mu}[{|Q^{*}(s,a)-\tilde{Q}_{K}(s,a)|}^{2}]\leq E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}]^{2}\} (16)
+Eμ{2i=0K1γK1i(Ui|ρi+1|)(s,a)×γK(UK|QQ~0|)(s,a)}\displaystyle+E_{\mu}\{2\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}\times\gamma^{K}(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)\}
+Eμ{γ2K[(UK|QQ~0|)(s,a)]2}.\displaystyle+E_{\mu}\{\gamma^{2K}[(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)]^{2}\}.

For the third term Eμ{γ2K[(UK|QQ~0|)(s,a)]2}E_{\mu}\{\gamma^{2K}[(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)]^{2}\}, we can use the bound

Eμ{γ2K[(UK|QQ~0|)(s,a)]2}4γ2KVmax2=4γ2KRmax2(1γ)2.\displaystyle E_{\mu}\{\gamma^{2K}[(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)]^{2}\}\leq 4\gamma^{2K}V_{max}^{2}=\frac{4\gamma^{2K}R_{max}^{2}}{(1-\gamma)^{2}}. (17)

This equation (17) holds because of the following reason: we know |Q(s,a)Q~0(s,a)|2Vmax,(s,a)|Q^{*}(s,a)-\tilde{Q}_{0}(s,a)|\leq 2V_{max},\forall(s,a) and this implies that |(UK|QQ~0|)(s,a)|=|((Pπ)K|QQ~0|)(s,a)|2Vmax,(s,a)|(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)|=|({(P^{\pi^{*}})}^{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)|\leq 2V_{max},\forall(s,a).

For the cross term in (16), we have

Eμ{2i=0K1γK1i(Ui|ρi+1|)(s,a)×γK(UK|QQ~0|)(s,a)}\displaystyle E_{\mu}\{2\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}\times\gamma^{K}(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)\}\leq (18)
Eμ{[i=0K1γK1i(Ui|ρi+1|)(s,a)]2}+Eμ{γ2K[(UK|QQ~0|)(s,a)]2}.\displaystyle E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}]^{2}\}+E_{\mu}\{\gamma^{2K}[(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)]^{2}\}.

Then the main focus should be the bound on Eμ{[i=0K1γK1i(Ui|ρi+1|)(s,a)]2}E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}]^{2}\}. We have

Eμ{[i=0K1γK1i(Ui|ρi+1|)(s,a)]2}=0i,jK1γ2K2ijEμ[(Ui|ρi+1|)(s,a)(Uj|ρj+1|)(s,a)].\displaystyle E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}]^{2}\}=\sum_{0\leq i,j\leq K-1}{\gamma^{2K-2-i-j}E_{\mu}[(U_{i}|\rho_{i+1}|)(s,a)(U_{j}|\rho_{j+1}|)(s,a)]}.

Then we need to handle Eμ[(Ui|ρi+1|)(s,a)(Uj|ρj+1|)(s,a)]E_{\mu}[(U_{i}|\rho_{i+1}|)(s,a)(U_{j}|\rho_{j+1}|)(s,a)] for 0i,jK10\leq i,j\leq K-1.

We have

Eμ[(Ui|ρi+1|)(s,a)(Uj|ρj+1|)(s,a)]=Eμ[(Pπ)K1i|ρi+1|(s,a)(Pπ)K1j|ρj+1|(s,a)].E_{\mu}[(U_{i}|\rho_{i+1}|)(s,a)(U_{j}|\rho_{j+1}|)(s,a)]=E_{\mu}[{(P^{\pi^{*}})}^{K-1-i}|\rho_{i+1}|(s,a){(P^{\pi^{*}})}^{K-1-j}|\rho_{j+1}|(s,a)].

Recall that (PπQ)(s,a)=E[Q(S,A)|SP(.|s,a),Aπ(.|S)](P^{\pi}Q)(s,a)=E[Q(S^{\prime},A^{\prime})|S^{\prime}\sim P(.|s,a),A^{\prime}\sim\pi(.|S^{\prime})]. There is one key observation

=2E[Q(S,A)|SP(.|s,a),Aπ(.|S)]\displaystyle{}^{2}=E[Q(S^{\prime},A^{\prime})|S^{\prime}\sim P(.|s,a),A^{\prime}\sim\pi(.|S^{\prime})]
×E[Q(S,A)|SP(.|s,a),Aπ(.|S)]\displaystyle\times E[Q(S^{\prime},A^{\prime})|S^{\prime}\sim P(.|s,a),A^{\prime}\sim\pi(.|S^{\prime})]
E[Q2(S,A)|SP(.|s,a),Aπ(.|S)].\displaystyle\leq E[Q^{2}(S^{\prime},A^{\prime})|S^{\prime}\sim P(.|s,a),A^{\prime}\sim\pi(.|S^{\prime})].

Here we would abuse the notation PπμP^{\pi}\mu to represent the distribution of next stage (S,A)(S^{\prime},A^{\prime}) such that SP(.|s,a),Aπ(.|S)S^{\prime}\sim P(.|s,a),A^{\prime}\sim\pi(.|S^{\prime}) (when notation PπP^{\pi} is followed by a function on 𝒮×𝒜\mathcal{S}\times\mathcal{A}, it is regarded as the operator defined in equation (15)); when it is followed by a distribution on 𝒮×𝒜\mathcal{S}\times\mathcal{A}, it is regarded as a transformation of distribution as is discussed here.

Then it can be shown that

Eμ{[(PπQ)(s,a)]2}Eμ{E[Q2(S,A)|SP(.|s,a),Aπ(.|S)]}=EPπμ[Q2(s,a)].\displaystyle E_{\mu}\{[(P^{\pi}Q)(s,a)]^{2}\}\leq E_{\mu}\{E[Q^{2}(S^{\prime},A^{\prime})|S^{\prime}\sim P(.|s,a),A^{\prime}\sim\pi(.|S^{\prime})]\}=E_{P^{\pi}\mu}[Q^{2}(s,a)].

We can further get

Eμ{[(Pπ1Pπ2PπlQ)(s,a)]2}Eπ1Pπ2Pπlμ[Q2(s,a)].\displaystyle E_{\mu}\{[(P^{\pi_{1}}P^{\pi_{2}}...P^{\pi_{l}}Q)(s,a)]^{2}\}\leq E_{{\pi_{1}}P^{\pi_{2}}...P^{\pi_{l}}\mu}[Q^{2}(s,a)].

For simplicity, let’s denote π1Pπ2Pπlμ{\pi_{1}}P^{\pi_{2}}...P^{\pi_{l}}\mu as μ~l\tilde{\mu}_{l}.

For two functions f1,f2f_{1},f_{2} on 𝒮×𝒜\mathcal{S}\times\mathcal{A}, we know

Eμ[(Pπ)K1if1(Pπ)K1jf2]{Eμ{[(Pπ)K1if1]2}Eμ{[(Pπ)K1jf2]2}}12.\displaystyle E_{\mu}[(P^{\pi^{*}})^{K-1-i}f_{1}(P^{\pi^{*}})^{K-1-j}f_{2}]\leq\{E_{\mu}\{[(P^{\pi^{*}})^{K-1-i}f_{1}]^{2}\}E_{\mu}\{[(P^{\pi^{*}})^{K-1-j}f_{2}]^{2}\}\}^{\frac{1}{2}}.

It can be obtained that

Eμ{[(Pπ)K1if1]2}E(Pπ)K1iμ[f12]=𝒮×𝒜f12(s,a)𝑑μ~K1i(s,a)\displaystyle E_{\mu}\{[(P^{\pi^{*}})^{K-1-i}f_{1}]^{2}\}\leq E_{(P^{\pi^{*}})^{K-1-i}\mu}[f_{1}^{2}]=\int_{\mathcal{S}\times\mathcal{A}}{f_{1}^{2}(s,a)d\tilde{\mu}_{K-1-i}(s,a)}
=𝒮×𝒜|f12(s,a)dμ~K1i(s,a)dσ(s,a)|𝑑σ(s,a)𝒮×𝒜|f12(s,a)|𝑑σ(s,a)×dμ~K1i(s,a)dσ(s,a),σ\displaystyle=\int_{\mathcal{S}\times\mathcal{A}}{|f_{1}^{2}(s,a)\frac{d\tilde{\mu}_{K-1-i}(s,a)}{d\sigma(s,a)}|d\sigma(s,a)}\leq\int_{\mathcal{S}\times\mathcal{A}}{|f_{1}^{2}(s,a)|d\sigma(s,a)}\times\left\|\frac{d\tilde{\mu}_{K-1-i}(s,a)}{d\sigma(s,a)}\right\|_{\infty,\sigma}
=f12,σ2×ω(K1i;μ,σ).\displaystyle=\left\|f_{1}\right\|_{2,\sigma}^{2}\times\omega_{\infty}(K-1-i;\mu,\sigma).

Recall that definition of ω(i;μ,σ)\omega_{\infty}(i;\mu,\sigma) in equation (2). Similarly, we can have

Eμ{[(Pπ)K1jf2]2}f22,σ2×ω(K1j;μ,σ).\displaystyle E_{\mu}\{[(P^{\pi^{*}})^{K-1-j}f_{2}]^{2}\}\leq\left\|f_{2}\right\|_{2,\sigma}^{2}\times\omega_{\infty}(K-1-j;\mu,\sigma).

If we plug in f1=|ρi+1|f_{1}=|\rho_{i+1}| and f2=|ρj+1|f_{2}=|\rho_{j+1}|, we can get

Eμ[(Pπ)K1i|ρi+1|(s,a)(Pπ)K1j|ρj+1|(s,a)]\displaystyle E_{\mu}[{(P^{\pi^{*}})}^{K-1-i}|\rho_{i+1}|(s,a){(P^{\pi^{*}})}^{K-1-j}|\rho_{j+1}|(s,a)]
{Eμ{[(Pπ)K1i|ρi+1|]2}Eμ{[(Pπ)K1j|ρj+1|]2}}12\displaystyle\leq\{E_{\mu}\{[(P^{\pi^{*}})^{K-1-i}|\rho_{i+1}|]^{2}\}E_{\mu}\{[(P^{\pi^{*}})^{K-1-j}|\rho_{j+1}|]^{2}\}\}^{\frac{1}{2}}
ρi+12,σρj+12,σ[ω(K1i;μ,σ)ω(K1j;μ,σ)]12.\displaystyle\leq\left\|\rho_{i+1}\right\|_{2,\sigma}\left\|\rho_{j+1}\right\|_{2,\sigma}[\omega_{\infty}(K-1-i;\mu,\sigma)\omega_{\infty}(K-1-j;\mu,\sigma)]^{\frac{1}{2}}.

Then,

Eμ{[i=0K1γK1i(Ui|ρi+1|)(s,a)]2}0i,jK1γ2K2ijρi+12,σρj+12,σ\displaystyle E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}]^{2}\}\leq\sum_{0\leq i,j\leq K-1}\gamma^{2K-2-i-j}\left\|\rho_{i+1}\right\|_{2,\sigma}\left\|\rho_{j+1}\right\|_{2,\sigma}
×[ω(K1i;μ,σ)ω(K1j;μ,σ)]12.\displaystyle\times[\omega_{\infty}(K-1-i;\mu,\sigma)\omega_{\infty}(K-1-j;\mu,\sigma)]^{\frac{1}{2}}.

We can define

ϵmax=maxi{1,2,K}ρi2,σ=maxi{1,2,K}TQ~i1Q~i2,σ,\displaystyle\epsilon_{max}=\max_{i\in\{1,2...,K\}}{\left\|\rho_{i}\right\|_{2,\sigma}}=\max_{i\in\{1,2...,K\}}{\left\|T\tilde{Q}_{i-1}-\tilde{Q}_{i}\right\|_{2,\sigma}},

so that ϵmax2=maxi{1,2,K}ρi22,σ\epsilon_{max}^{2}=max_{i\in\{1,2...,K\}}\left\|\rho_{i}^{2}\right\|_{2,\sigma}. Thus, it can be shown that

Eμ{[i=0K1γK1i(Ui|ρi+1|)(s,a)]2}ϵmax2{i=0K1γi[ω(i;μ,σ)]12}2.\displaystyle E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}]^{2}\}\leq\epsilon_{max}^{2}\{\sum_{i=0}^{K-1}{\gamma^{i}[\omega_{\infty}(i;\mu,\sigma)]^{\frac{1}{2}}}\}^{2}.

We’ve already shown that 1(1γ)2m=0γm1(m+1)[ω(m;μ,σ)]12ϕμ,σ\frac{1}{(1-\gamma)^{2}}\sum_{m=0}^{\infty}{\gamma^{m-1}(m+1)[\omega_{\infty}(m;\mu,\sigma)]^{\frac{1}{2}}}\leq\phi_{\mu,\sigma}. Then, we know

i=0K1γi[ω(i;μ,σ)]12<(1γ)2γ×1(1γ)2i=0γi1(i+1)[ωi;μ,σ]12(1γ)2γϕμ,σ.\displaystyle\sum_{i=0}^{K-1}{\gamma^{i}[\omega_{\infty}(i;\mu,\sigma)]^{\frac{1}{2}}}<(1-\gamma)^{2}\gamma\times\frac{1}{(1-\gamma)^{2}}\sum_{i=0}^{\infty}{\gamma^{i-1}(i+1)[\omega_{\infty}{i;\mu,\sigma}]^{\frac{1}{2}}}\leq(1-\gamma)^{2}\gamma\phi_{\mu,\sigma}.

Hence,

Eμ{[i=0K1γK1i(Ui|ρi+1|)(s,a)]2}<ϵmax2(1γ)4γ2ϕμ,σ2.\displaystyle E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}]^{2}\}<\epsilon_{max}^{2}(1-\gamma)^{4}\gamma^{2}\phi_{\mu,\sigma}^{2}. (19)

Then combine equation (16), equation (17), equation (18) and equation (19), we can have the result of Lemma C.1.

C.4 Proof of Proposition C.2

The Proposition C.2 is: The function class

1(L,{{di}i=1L+1,d0=m+m0},s,Vmax,G^,κ)\mathcal{F}_{1}(L,\{\{d_{i}\}^{L+1}_{i=1},d_{0}=m+m_{0}\},s,V_{max},\hat{G},\kappa)

is a subset of

2(L,{{di}i=1L+1,d0=κ+m0},s+d1κ,Vmax,G^,κ).\mathcal{F}_{2}(L,\{\{d_{i}\}^{L+1}_{i=1},d_{0}=\kappa+m_{0}\},s+d_{1}\kappa,V_{max},\hat{G},\kappa).
Proof.

Note that the vector zz^{*} recovered from the first κ\kappa PCA values of zz has a relation with the corresponding vector containing the κ\kappa PCA values: z=U^Σ^12(Iκ0(mκ)×κ)v^z^{*}=\hat{U}{\hat{\Sigma}}^{\frac{1}{2}}\begin{pmatrix}I_{\kappa}\\ 0_{(m-\kappa)\times\kappa}\end{pmatrix}\hat{v}. By Definition A.1 in Section A, f11(L,{{di}i=1L+1,d0=m+m0},s,Vmax,G^,κ)\forall f_{1}\in\mathcal{F}_{1}(L,\{\{d_{i}\}^{L+1}_{i=1},d_{0}=m+m_{0}\},s,V_{max},\hat{G},\kappa), we have f1(x,z,a)=f01(x,z,a)f_{1}(x,z,a)=f_{0}^{1}(x,z^{*},a) with f01(.,a)SReLU(L,{di}i=0L+1,s,Vmax)f_{0}^{1}(.,a)\in\mathcal{F}_{SReLU}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max}). We can denote the first layer in the sparse ReLU network f01(.,a)f_{0}^{1}(.,a) by σ(W11(xz)+b11)\sigma(W_{1}^{1}\begin{pmatrix}x\\ z^{*}\end{pmatrix}+b_{1}^{1}). We know W11(xz)=(W111W121)(xz)=W111x+W121zW_{1}^{1}\begin{pmatrix}x\\ z^{*}\end{pmatrix}=\begin{pmatrix}W_{11}^{1}&W_{12}^{1}\end{pmatrix}\begin{pmatrix}x\\ z^{*}\end{pmatrix}=W_{11}^{1}x+W_{12}^{1}z^{*}, with W121z=W121U^Σ^12(Iκ0(mκ)×κ)v^W_{12}^{1}z^{*}=W_{12}^{1}\hat{U}{\hat{\Sigma}}^{\frac{1}{2}}\begin{pmatrix}I_{\kappa}\\ 0_{(m-\kappa)\times\kappa}\end{pmatrix}\hat{v}.

Then we would like to know f22(L,{{dj}j=1L+1,d0=κ+m0},s+d1κ,Vmax,G^,κ)\exists f_{2}\in\mathcal{F}_{2}(L,\{\{d_{j}\}^{L+1}_{j=1},d_{0}=\kappa+m_{0}\},s+d_{1}\kappa,V_{max},\hat{G},\kappa), such that f2(x,z,a)=f1(x,z,a)f_{2}(x,z,a)=f_{1}(x,z,a). We know f2(x,z,a)=f02(x,v^,a)f_{2}(x,z,a)=f_{0}^{2}(x,\hat{v},a) with

f02(.,a)SReLU(L,{di}i=0L+1,s+κd1,Vmax)f_{0}^{2}(.,a)\in\mathcal{F}_{SReLU}(L,\{d_{i}\}_{i=0}^{L+1},s+\kappa d_{1},V_{max})

by definition. The first layer of f02f_{0}^{2} can be denoted in a similar way: σ(W12(xv^)+b12)\sigma(W_{1}^{2}\begin{pmatrix}x\\ \hat{v}\end{pmatrix}+b_{1}^{2}), with W12(xv^)=(W112W122)(xv^)=W112x+W122v^W_{1}^{2}\begin{pmatrix}x\\ \hat{v}\end{pmatrix}=\begin{pmatrix}W_{11}^{2}&W_{12}^{2}\end{pmatrix}\begin{pmatrix}x\\ \hat{v}\end{pmatrix}=W_{11}^{2}x+W_{12}^{2}\hat{v}. We can set W112=W111,b11=b12W_{11}^{2}=W_{11}^{1},b_{1}^{1}=b_{1}^{2} and W122=W121U^Σ^12(Iκ0(mκ)×κ)W_{12}^{2}=W_{12}^{1}\hat{U}{\hat{\Sigma}}^{\frac{1}{2}}\begin{pmatrix}I_{\kappa}\\ 0_{(m-\kappa)\times\kappa}\end{pmatrix}, which gives σ(W11(xz)+b11)=σ(W12(xv^)+b12)\sigma(W_{1}^{1}\begin{pmatrix}x\\ z^{*}\end{pmatrix}+b_{1}^{1})=\sigma(W_{1}^{2}\begin{pmatrix}x\\ \hat{v}\end{pmatrix}+b_{1}^{2}). The parameters in other layers in f02f_{0}^{2} will be set to be the same as f01f_{0}^{1}.

Then we know f11(L,{{dj}j=1L+1,d0=m+m0},s,Vmax,G^,κ)\forall f_{1}\in\mathcal{F}_{1}(L,\{\{d_{j}\}^{L+1}_{j=1},d_{0}=m+m_{0}\},s,V_{max},\hat{G},\kappa), f22(L,{{dj}j=1L+1,d0=κ+m0},s+d1κ,Vmax,G^,κ)\exists f_{2}\in\mathcal{F}_{2}(L,\{\{d_{j}\}^{L+1}_{j=1},d_{0}=\kappa+m_{0}\},s+d_{1}\kappa,V_{max},\hat{G},\kappa) such that

f1(x,z,a)=f01(x,z,a)=f02(x,v^,a)=f2(x,z,a).f_{1}(x,z,a)=f_{0}^{1}(x,z^{*},a)=f_{0}^{2}(x,\hat{v},a)=f_{2}(x,z,a).

The reason why the sparsity in f02f_{0}^{2} is increased to s+κd1s+\kappa d_{1} is that the weight W122=W121U^Σ^12(Iκ0(mκ)×κ)d1×κW_{12}^{2}=W_{12}^{1}\hat{U}{\hat{\Sigma}}^{\frac{1}{2}}\begin{pmatrix}I_{\kappa}\\ 0_{(m-\kappa)\times\kappa}\end{pmatrix}\in\mathbb{R}^{d_{1}\times\kappa} may no longer be sparse, despite that W121W_{12}^{1} is sparse.

C.5 Proof of Lemma C.3

The Lemma C.3 is: Under the same assumptions of the Theorem 4.6,

supg2inff2fTgσ22supf𝒢0inff00f0fσ2+2L0ZZσ2,\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}\leq 2\sup_{f^{{}^{\prime}}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{\left\|f_{0}-f^{{}^{\prime}}\right\|_{\sigma}^{2}}}+2L_{\mathcal{F}_{0}}{\left\|Z-Z^{*}\right\|_{\sigma}^{2}},

where L0=supf0supxy|f(y)f(x)|2yx2L_{\mathcal{F}_{0}}=\sup_{f\in\mathcal{F}_{0}}{\sup_{x\neq y}{\frac{{|f(y)-f(x)|}^{2}}{\left\|y-x\right\|^{2}}}}. ZZ is the high-frequency vector from the distribution σ\sigma and Z=(k=1κjU^kU^kT)ZZ^{*}=(\sum_{k=1}^{\kappa_{j}}{\hat{U}_{k}{\hat{U}_{k}}^{T}})Z is the vector recovered from κ\kappa PCA values of ZZ.

Proof.

We have supg2inff2fTgσ2supf𝒢0inff22f2fσ2\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}\leq\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{2}\in\mathcal{F}_{2}}{\left\|f_{2}-f^{\prime}\right\|_{\sigma}^{2}}}, which relies on the assumption: Tg𝒢0,g2Tg\in\mathcal{G}_{0},\forall g\in\mathcal{F}_{2}. From the Proposition C.2, we have 12\mathcal{F}_{1}\subset\mathcal{F}_{2}. Then we have supf𝒢0inff22f2fσ2supf𝒢0inff11f1fσ2\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{2}\in\mathcal{F}_{2}}{\left\|f_{2}-f^{\prime}\right\|_{\sigma}^{2}}}\leq\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{1}\in\mathcal{F}_{1}}{\left\|f_{1}-f^{\prime}\right\|_{\sigma}^{2}}}.

Then we know

supf𝒢0inff11f1fσ2=supf𝒢0inff11Eσ{[f1(x,z,a)f(x,z,a)]2}\displaystyle\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{1}\in\mathcal{F}_{1}}{\left\|f_{1}-f^{\prime}\right\|_{\sigma}^{2}}}=\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{1}\in\mathcal{F}_{1}}{E_{\sigma}\{{[f_{1}(x,z,a)-f^{\prime}(x,z,a)]}^{2}\}}}
=supf𝒢0inff00Eσ{[f0(x,z,a)f(x,z,a)]2}\displaystyle=\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{E_{\sigma}\{{[f_{0}(x,z^{*},a)-f^{\prime}(x,z,a)]}^{2}\}}}

by definition of 1\mathcal{F}_{1}.

We know

Eσ{[f0(x,z,a)f(x,z,a)]2}=Eσ{[f0(x,z,a)f0(x,z,a)+f0(x,z,a)f(x,z,a)]2}\displaystyle E_{\sigma}\{{[f_{0}(x,z^{*},a)-f^{\prime}(x,z,a)]}^{2}\}=E_{\sigma}\{{[f_{0}(x,z^{*},a)-f_{0}(x,z,a)+f_{0}(x,z,a)-f^{\prime}(x,z,a)]}^{2}\}
2Eσ{[f0(x,z,a)f0(x,z,a)]2}+2Eσ{[f0(x,z,a)f(x,z,a)]2}.\displaystyle\leq 2E_{\sigma}\{{[f_{0}(x,z^{*},a)-f_{0}(x,z,a)]}^{2}\}+2E_{\sigma}\{{[f_{0}(x,z,a)-f^{\prime}(x,z,a)]}^{2}\}.

Furthermore, the term Eσ{[f0(x,z,a)f0(x,z,a)]2}E_{\sigma}\{{[f_{0}(x,z^{*},a)-f_{0}(x,z,a)]}^{2}\} satisfies that

Eσ{[f0(x,z,a)f0(x,z,a)]2}L0ZZσ2.E_{\sigma}\{{[f_{0}(x,z^{*},a)-f_{0}(x,z,a)]}^{2}\}\\ \leq L_{\mathcal{F}_{0}}{\left\|Z-Z^{*}\right\|_{\sigma}^{2}}.

Note that L0L_{\mathcal{F}_{0}} is the Lipschitz constant of ReLU networks in the class 0\mathcal{F}_{0}. Based on chain rule and the fact that W22mWW2\left\|W\right\|_{2}^{2}\leq m_{W}\left\|W\right\|_{\infty}^{2} for matrix WW with mWm_{W} columns, we know that L0L_{\mathcal{F}_{0}} is always bounded: L0lLdl1L_{\mathcal{F}_{0}}\leq\prod_{l}^{L}{d_{l-1}}.

So that we know

Eσ{[f0(x,z,a)f(x,z,a)]2}2L0ZZσ2+2Eσ{[f0(x,z,a)f(x,z,a)]2},E_{\sigma}\{{[f_{0}(x,z^{*},a)-f^{\prime}(x,z,a)]}^{2}\}\leq 2L_{\mathcal{F}_{0}}\left\|Z-Z^{*}\right\|_{\sigma}^{2}+2E_{\sigma}\{{[f_{0}(x,z,a)-f^{\prime}(x,z,a)]}^{2}\},

f00,f𝒢0\forall f_{0}\in\mathcal{F}_{0},f^{\prime}\in\mathcal{G}_{0}. Then we can take sup\sup and inf\inf on both sides to have

supf𝒢0inff00Eσ{[f0(x,z,a)f(x,z,a)]2}2L0ZZσ2+\displaystyle\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{E_{\sigma}\{{[f_{0}(x,z^{*},a)-f^{\prime}(x,z,a)]}^{2}\}}}\leq 2L_{\mathcal{F}_{0}}{\left\|Z-Z^{*}\right\|_{\sigma}^{2}}+
2supf𝒢0inff00Eσ{[f0(x,z,a)f(x,z,a)]2},\displaystyle 2\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{E_{\sigma}\{{[f_{0}(x,z,a)-f^{\prime}(x,z,a)]}^{2}\}}},

with Eσ{[f0(x,z,a)f(x,z,a)]2}=f0fσ2E_{\sigma}\{{[f_{0}(x,z,a)-f^{\prime}(x,z,a)]}^{2}\}=\left\|f_{0}-f^{\prime}\right\|_{\sigma}^{2}.

So that we eventually get

supg2inff2fTgσ22L0ZZσ2+2supf𝒢0inff00f0fσ2.\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}\leq 2L_{\mathcal{F}_{0}}{\left\|Z-Z^{*}\right\|_{\sigma}^{2}}+2\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{\left\|f_{0}-f^{\prime}\right\|_{\sigma}^{2}}}.

C.6 Proof of Lemma C.4

The Lemma C.4 is: For the high-frequency vector ZZ, we assume E(Z)=0E(Z)=0 and the the eigenvalues of Cov(Z)Cov(Z) following this exponential decaying trend λj=O(eζj)\lambda_{j}=O(e^{-\zeta j}). We also assume the estimation of the Cov(Z)Cov(Z) satisfies that U^jUj=Op(nΔ)\left\|\hat{U}_{j}-U_{j}\right\|=O_{p}(n^{-\Delta}). Then we have

ZZσ2=Op(eζκeζm+n2Δ)\left\|Z-Z^{*}\right\|_{\sigma}^{2}=O_{p}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})
Proof.

Note that

ZZσ2=E[(ZZ)T(ZZ)]=E[ZT(j=κ+1mU^jU^jT)2Z]=j=κ+1mE[tr(ZTU^jU^jTZ)]\displaystyle\left\|Z-Z^{*}\right\|_{\sigma}^{2}=E[{(Z-Z^{*})}^{T}(Z-Z^{*})]=E[Z^{T}{(\sum_{j=\kappa+1}^{m}{\hat{U}_{j}{\hat{U}_{j}}^{T}})}^{2}Z]=\sum_{j=\kappa+1}^{m}{E[tr(Z^{T}\hat{U}_{j}{\hat{U}_{j}}^{T}Z)]}
=j=κ+1mtr[E(ZZT)U^jU^jT]=tr[E(ZZT)j=κ+1mU^jU^jT].\displaystyle=\sum_{j=\kappa+1}^{m}{tr[E(ZZ^{T})\hat{U}_{j}{\hat{U}_{j}}^{T}]}=tr[E(ZZ^{T})\sum_{j=\kappa+1}^{m}{\hat{U}_{j}{\hat{U}_{j}}^{T}}].

We know E(ZZT)=Cov(Z)=i=1mλiUiUiTE(ZZ^{T})=Cov(Z)=\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}. So that we will have

ZZσ2=tr[i=1mλiUiUiTj=κ+1mU^jU^jT]=tr[i=1mλiUiUiTj=κ+1m(U^jUj+Uj)(U^jUj+Uj)T]\displaystyle\left\|Z-Z^{*}\right\|_{\sigma}^{2}=tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{\hat{U}_{j}{\hat{U}_{j}}^{T}}]=tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{(\hat{U}_{j}-U_{j}+U_{j}){(\hat{U}_{j}-U_{j}+U_{j}})}^{T}]
=tr[i=1mλiUiUiTj=κ+1mUjUjT]+tr[i=1mλiUiUiTj=κ+1m(U^jUj)UjT]\displaystyle=tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{U_{j}{U_{j}}^{T}}]+tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{(\hat{U}_{j}-U_{j}){U_{j}}^{T}}]
+tr[i=1mλiUiUiTj=κ+1mUj(U^jUj)T]+tr[i=1mλiUiUiTj=κ+1m(U^jUj)(U^jUj)T].\displaystyle+tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{U_{j}{(\hat{U}_{j}-U_{j})}^{T}}]+tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{(\hat{U}_{j}-U_{j}){(\hat{U}_{j}-U_{j})}^{T}}].

That is, we need to bound four terms here.

The first term

tr[i=1mλiUiUiTj=κ+1mUjUjT]=tr[j=κ+1mλjUjUjTUjUjT]=j=κ+1mtr(λjUjUjT)=j=κ+1mλj.tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{U_{j}{U_{j}}^{T}}]=tr[\sum_{j=\kappa+1}^{m}{\lambda_{j}U_{j}{U_{j}}^{T}U_{j}{U_{j}}^{T}}]=\sum_{j=\kappa+1}^{m}{tr(\lambda_{j}U_{j}{U_{j}}^{T})}=\sum_{j=\kappa+1}^{m}{\lambda_{j}}.

Since we already know that the eigenvalues λj\lambda_{j} satisfies that λj=O(eζj)\lambda_{j}=O(e^{-\zeta j}), so that we have j=κ+1mλj=O(eζ(κ+1)1eζ(mκ)1eζ)=O(eζκeζm)\sum_{j=\kappa+1}^{m}{\lambda_{j}}=O(e^{-\zeta(\kappa+1)}\frac{1-e^{-\zeta(m-\kappa)}}{1-e^{-\zeta}})=O(e^{-\zeta\kappa}-e^{-\zeta m}). Then we know the first term

tr[i=1mλiUiUiTj=κ+1mUjUjT]=O(eζκeζm).tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{U_{j}{U_{j}}^{T}}]=O(e^{-\zeta\kappa}-e^{-\zeta m}).

The second term is

tr[i=1mλiUiUiTj=κ+1m(U^jUj)UjT]=i=1mj=κ+1mtr[λiUiUiT(Uj^Uj)UjT]\displaystyle tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{(\hat{U}_{j}-U_{j}){U_{j}}^{T}}]=\sum_{i=1}^{m}{\sum_{j=\kappa+1}^{m}{tr[\lambda_{i}U_{i}{U_{i}}^{T}(\hat{U_{j}}-U_{j}){U_{j}}^{T}]}}
=i=1mj=κ+1mtr[λiUjTUiUiT(Uj^Uj)]=j=κ+1mλjUjT(U^jUj).\displaystyle=\sum_{i=1}^{m}{\sum_{j=\kappa+1}^{m}{tr[\lambda_{i}{U_{j}}^{T}U_{i}{U_{i}}^{T}(\hat{U_{j}}-U_{j})]}}=\sum_{j=\kappa+1}^{m}{\lambda_{j}{U_{j}}^{T}(\hat{U}_{j}-U_{j})}.

We know that each term of UjU_{j} is Op(1)O_{p}(1), so that |UjT(U^jUj)|UjTU^jUj=Op(nΔ)|{U_{j}}^{T}(\hat{U}_{j}-U_{j})|\leq\left\|{U_{j}}^{T}\right\|\left\|\hat{U}_{j}-U_{j}\right\|=O_{p}(n^{-\Delta}) and j=κ+1mλjUjT(U^jUj)=Op(nΔ(eζκeζm))\sum_{j=\kappa+1}^{m}{\lambda_{j}{U_{j}}^{T}(\hat{U}_{j}-U_{j})}=O_{p}(n^{-\Delta}(e^{-\zeta\kappa}-e^{-\zeta m})).

The third term is similar to the second term, we have

tr[i=1mλiUiUiTj=κ+1mUjU^jUjT]=i=1mj=κ+1mtr[λiUiUiTUj(Uj^Uj)T]\displaystyle tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{U_{j}{\hat{U}_{j}-U_{j}}^{T}}]=\sum_{i=1}^{m}{\sum_{j=\kappa+1}^{m}{tr[\lambda_{i}U_{i}{U_{i}}^{T}U_{j}{(\hat{U_{j}}-U_{j})}^{T}]}}
=j=κ+1mλj(U^jUj)TUj=Op(nΔ(eζκeζm)).\displaystyle=\sum_{j=\kappa+1}^{m}{\lambda_{j}{(\hat{U}_{j}-U_{j})}^{T}U_{j}}=O_{p}(n^{-\Delta}(e^{-\zeta\kappa}-e^{-\zeta m})).

Then the fourth term is

tr[i=1mλiUiUiTj=κ+1m(U^jUj)(U^jUj)T]=i=1mj=κ+1mtr[λiUiUiT(Uj^Uj)(Uj^Uj)T]\displaystyle tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{(\hat{U}_{j}-U_{j}){(\hat{U}_{j}-U_{j})}^{T}}]=\sum_{i=1}^{m}{\sum_{j=\kappa+1}^{m}{tr[\lambda_{i}U_{i}{U_{i}}^{T}(\hat{U_{j}}-U_{j}){(\hat{U_{j}}-U_{j})}^{T}]}}
=i=1mj=κ+1mtr[λiUiT(Uj^Uj)(Uj^Uj)TUi]=i=1mj=κ+1mλi[UiT(Uj^Uj)]2\displaystyle=\sum_{i=1}^{m}{\sum_{j=\kappa+1}^{m}{tr[\lambda_{i}{U_{i}}^{T}(\hat{U_{j}}-U_{j}){(\hat{U_{j}}-U_{j})}^{T}U_{i}]}}=\sum_{i=1}^{m}{\sum_{j=\kappa+1}^{m}{\lambda_{i}[{U_{i}}^{T}(\hat{U_{j}}-U_{j})]^{2}}}
=Op(n2Δeζ)=Op(n2Δ).\displaystyle=O_{p}(n^{-2\Delta}e^{-\zeta})=O_{p}(n^{-2\Delta}).

Thus, it can be shown that

ZZσ2=Op((eζκeζm)+nΔ(eζκeζm)+n2Δ)=Op(eζκeζm+n2Δ).\left\|Z-Z^{*}\right\|_{\sigma}^{2}=O_{p}((e^{-\zeta\kappa}-e^{-\zeta m})+n^{-\Delta}(e^{-\zeta\kappa}-e^{-\zeta m})+n^{-2\Delta})=O_{p}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta}).

C.7 Proof of Proposition C.5

The Proposition C.5 is: When δ=1n\delta=\frac{1}{n}, logNδ,2C1|𝒜|(nα(logn)ξ+d1κ)(logn)1+ξ\log{N_{\delta,2}}\leq C_{1}|\mathcal{A}|(n^{\alpha^{*}}(\log{n})^{\xi^{*}}+d_{1}^{*}\kappa)(\log{n})^{1+\xi^{*}} for some constant C1C_{1}.

Proof.

To prove this result, we try to show for the function class 3={f3:S:f3(x,z)=f(x,(Iκ0κ×(mκ))Σ^12U^Tz)=f(x,v^),fSReLU(L,{dj}j=0L+1,s,Vmax),d0=κ+m0}\mathcal{F}_{3}=\{f_{3}:S\rightarrow\mathbb{R}:f_{3}(x,z)=f(x,\begin{pmatrix}I_{\kappa}&0_{\kappa\times(m-\kappa)}\end{pmatrix}{\hat{\Sigma}}^{-\frac{1}{2}}{\hat{U}}^{T}z)=f(x,\hat{v}),f\in\mathcal{F}_{SReLU}(L^{*},{\{d_{j}^{*}\}}_{j=0}^{L+1},s^{*},V_{max}),d_{0}=\kappa+m_{0}\}. Recall the definition of 2\mathcal{F}_{2} at Section A, we know that f2\forall f\in\mathcal{F}_{2}, f(.,a)3f(.,a)\in\mathcal{F}_{3}. That is, 3\mathcal{F}_{3} is the function class for modeling action component aa of Q(s,a)Q(s,a). There is a relation between cardinality of the minimal σ\sigma-covering set of 2\mathcal{F}_{2}, N1n,2N_{\frac{1}{n},2} and cardinality of the minimal σ\sigma-covering set of 3\mathcal{F}_{3}, N1n,3N_{\frac{1}{n},3}. With similar arguments to the step |𝒩(δ,0,.)||𝒩δ||𝒜||\mathcal{N}(\delta,\mathcal{F}_{0},\left\|.\right\|_{\infty})|\leq{|\mathcal{N}_{\delta}|}^{|\mathcal{A}|} in equation (6.26) of Fan et al. (2020), we can show that

logN1n,2|𝒜|logN1n,3.\log{N_{\frac{1}{n},2}}\leq|\mathcal{A}|\log{N_{\frac{1}{n},3}}. (20)

Here we can argue that 3\mathcal{F}_{3} is also a type of sparse ReLU network and directly use Lemma 6.4 in Fan et al. (2020). We can show that 3SReLU,3\mathcal{F}_{3}\subset\mathcal{F}_{SReLU,3}, where SReLU,3\mathcal{F}_{SReLU,3} is a simplified notation for SReLU(L+1,{{di}i=0L+2:d0=m,di+1=di,i{0,1,2,,L+1},s+mκ,Vmax)\mathcal{F}_{SReLU}(L^{*}+1,\{{\{d_{i}\}}_{i=0}^{L+2}:d_{0}=m,d_{i+1}=d_{i}^{*},i\in\{0,1,2,...,L+1\},s^{*}+m\kappa,V_{max}). That is, the one extra layer of 3\mathcal{F}_{3} is actually the linear transition v^=(Iκ0κ×(mκ))Σ^12U^Tz\hat{v}=\begin{pmatrix}I_{\kappa}&0_{\kappa\times(m-\kappa)}\end{pmatrix}{\hat{\Sigma}}^{-\frac{1}{2}}{\hat{U}}^{T}z. We don’t know whether the weight matrix (Iκ0κ×(mκ))Σ^12U^T\begin{pmatrix}I_{\kappa}&0_{\kappa\times(m-\kappa)}\end{pmatrix}{\hat{\Sigma}}^{-\frac{1}{2}}{\hat{U}}^{T} is sparse, so that we need to increase the original sparsity to s+mκs^{*}+m\kappa.

The cardinality of δ\delta-covering set of sparse ReLU network is already bounded in Lemma 6.4 of Fan et al. (2020), and we can apply it here. We know the cardinality of δ\delta-covering set of 3\mathcal{F}_{3},N1n,3N_{\frac{1}{n},3} satisfies that

logN1n,3log|𝒩(δ=1n,SReLU,3,.)|(s+mκ+1)log[2n(L+2)l=0L+1dl+1]\displaystyle\log{N_{\frac{1}{n},3}}\leq\log{|\mathcal{N}(\delta=\frac{1}{n},\mathcal{F}_{SReLU,3},\left\|.\right\|_{\infty})|}\leq(s^{*}+m\kappa+1)\log{[2n(L^{*}+2)\prod_{l=0}^{L+1}{d_{l}^{*}+1}]}

.

Here L=O((logn)ξ)L^{*}=O({(\log{n})}^{\xi^{*}}),m+m0minj{1,2,,L}djmaxj{1,2,,L}dj=O(nξ)m+m_{0}\leq min_{j\in\{1,2,...,L\}}{d_{j}^{*}}\leq max_{j\in\{1,2,...,L\}}{d_{j}^{*}}=O(n^{\xi^{*}}) and snα(logn)ξ+d1κs^{*}\asymp n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}+d_{1}^{*}\kappa for some constant ξ>1+2ξ\xi^{*}>1+2\xi. So that

logN1n,3C(nα(logn)ξ+d1κ+mκ)log[2n(L+2)l=0L+1dl+1]\displaystyle\log{N_{\frac{1}{n},3}}\leq C(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}+d_{1}^{*}\kappa+m\kappa)\log{[2n(L^{*}+2)\prod_{l=0}^{L+1}{d_{l}^{*}+1}]}
C(nα(logn)ξ+d1κ)log[2n(L+2)l=0L+1dl+1]\displaystyle\leq C^{\prime}(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}+d_{1}^{*}\kappa)\log{[2n(L^{*}+2)\prod_{l=0}^{L+1}{d_{l}^{*}+1}]}
C′′(nα(logn)ξ+d1κ)[logn+logL+Llog(nξ)]\displaystyle\leq C^{\prime\prime}(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}+d_{1}^{*}\kappa)[\log{n}+\log{L^{*}}+L^{*}\log{(n^{\xi^{*}})}]
C1(nα(logn)ξ+d1κ)(logn)ξ+1,\displaystyle\leq C_{1}(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}+d_{1}^{*}\kappa){(\log{n})}^{\xi^{*}+1},

where C,C,C′′,C1C,C^{\prime},C^{\prime\prime},C_{1} are all constants.

Then from equation (20), we know logN1n,2C1|𝒜|(nα(logn)ξ+d1κ)(logn)ξ+1\log{N_{\frac{1}{n},2}}\leq C_{1}|\mathcal{A}|(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}+d_{1}^{*}\kappa){(\log{n})}^{\xi^{*}+1}. ∎

Appendix D Details of Selecting Number of Principal Components

An optimal number of principal components, κ\kappa^{*} that balances the bias and variance follows the order of κlog(n)\kappa^{*}\asymp\log(n).

Proof.

We know the terms involving κ\kappa in Theorem 4.8 is κ|𝒜|d1n1logξ+1n+CL0eζκ\kappa|\mathcal{A}|d_{1}^{*}n^{-1}\log^{\xi^{*}+1}{n}+CL_{\mathcal{F}_{0}}e^{-\zeta\kappa} for some constant C>0C>0. We can take the derivative of this term with respect to κ\kappa. Then we know when κ=1ζ[log(CζL0|𝒜|1d11)+logn(ξ+1)log(logn)]\kappa^{*}=\frac{1}{\zeta}[\log(C\zeta L_{\mathcal{F}_{0}}{|\mathcal{A}|}^{-1}{d_{1}^{*}}^{-1})+\log{n}-(\xi^{*}+1)\log(\log{n})], this term will be minimized. So that we know the optimal κ\kappa^{*} that balances the bias and variance must satisfy κlog(n)\kappa^{*}\asymp\log(n).

This logn\log{n} guideline applies when nn is large, as it is derived based on the theorems of asymptotic trend. On the other hand, κm\kappa\leq m needs to hold, as number of principal components cannot exceed the original dimension. Thus, when nn increases to infinity, eventually we will take κ=m\kappa^{*}=m. That is, when training sample size is extremely large, we no longer need to reduce the dimensions of high frequency part and our Algorithm 1 will be equivalent to the original neural fitted Q-iteration.

Refer to caption

Figure 1: Illustration of a ReLU network.
Refer to caption
Refer to caption
Figure 2: Left: Estimated value of policies by Algorithm 1 when κ\kappa varies in {2,6,10,14,74}\{2,6,10,14...,74\} (shaded area is 95%95\% confidence interval); Right: Proportion of variance explained by the first κ\kappa principal components
Refer to caption
Refer to caption
Figure 3: Upper: Variance explained by first κ\kappa principal component scores when there are J=1,2,4,6,8J=1,2,4,6,8 high frequency variables in the first setting of simulation (corresponding to Figure 4). Horizontal dash line is 95%95\% of variance explained. κ=5\kappa=5 can explain 95%95\% variance in all J=1,2,4,6,8J=1,2,4,6,8 cases of the first setting; Lower: Variance explained by first κ\kappa principal component scores when there are J=1,2,4,6,8J=1,2,4,6,8 high frequency variables in the second setting of simulation (corresponding to Figure 5). Here the number of principal components are κ=5,10,20,30,40\kappa=5,10,20,30,40 corresponding to the five cases J=1,2,4,6,8J=1,2,4,6,8 respectively to ensure 95%95\% of variance explained. Here eigenvalues of the concatenated high frequency ZZ decays at an exponential order λk=O(eζk)\lambda_{k}=O(e^{-\zeta k}) with ζ=0.6\zeta=0.6.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: First setting in simulation. Left: training data with N=6N=6 and T=60,75,90,105,120T=60,75,90,105,120 ; Right: training data with T=120T=120 and N=6,7,8,9,10N=6,7,8,9,10 when there are 1,2,4,6,81,2,4,6,8 variables with dimension 2727 (shaded area is 95%95\% confidence interval). In the legend, “PCA” refers to πKPCA\pi_{K}^{PCA}; “all z” refers to πKALL\pi_{K}^{ALL}; “bottleneck z” refers to πKBOTTLE\pi_{K}^{BOTTLE}; “mean z” refers to πKAVE\pi_{K}^{AVE}.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Second setting in simulation. Left: training data with N=6N=6 and T=60,75,90,105,120T=60,75,90,105,120 ; Right: training data with T=120T=120 and N=6,7,8,9,10N=6,7,8,9,10 when there are 1,2,4,6,81,2,4,6,8 variables with dimension 2727 (shaded area is 95%95\% confidence interval). In the legend, “PCA” refers to πKPCA\pi_{K}^{PCA}; “all z” refers to πKALL\pi_{K}^{ALL}; “bottleneck z” refers to πKBOTTLE\pi_{K}^{BOTTLE}; “mean z” refers to πKAVE\pi_{K}^{AVE}.
Refer to caption
Figure 6: Performance of πPCA\pi^{PCA} and πALL\pi^{ALL} as training data size nn goes to extremely large (TT fixed to be 120120, NN from 1010 to 200200 and 40004000). The x-axis presents logN\log{N}. In this figure, shaded area is 95%95\% confidence interval.In the legend, “PCA” refers to πKPCA\pi_{K}^{PCA}; “all z” refers to πKALL\pi_{K}^{ALL}.
Refer to caption
Figure 7: Proportion of variance explained by the first κ\kappa principal components for CGM blood glucose level in OhioT1DM data.
Table 1: Estimate of value difference of four policies.

Difference πkPCAπkALL\pi_{k}^{PCA}-\pi_{k}^{ALL} πkPCAπkAVE\pi_{k}^{PCA}-\pi_{k}^{AVE} πkPCAπkBOTTLE\pi_{k}^{PCA}-\pi_{k}^{BOTTLE}
Mean 1.163 1.399 1.153
Margin of Error 0.181 0.224 0.145
Table 2: Value estimate of the four policies.

Value Estimate πkPCA\pi_{k}^{PCA} πkALL\pi_{k}^{ALL} πkAVE\pi_{k}^{AVE} πkBOTTLE\pi_{k}^{BOTTLE}
Mean -8.762 -9.926 -10.162 -9.916
Standard Error 0.055 0.096 0.121 0.086