Deep Spectral Q-learning with Application to Mobile Health

Yuhe Gao [email protected] Department of Statistics, North Carolina State University, Raleigh, USA Chengchun Shi [email protected] Department of Statistics, London School of Economics and Political Science, London, UK Rui Song [email protected] Department of Statistics, North Carolina State University, Raleigh, USA

Abstract

Dynamic treatment regimes assign personalized treatments to patients sequentially over time based on their baseline information and time-varying covariates. In mobile health applications, these covariates are typically collected at different frequencies over a long time horizon. In this paper, we propose a deep spectral Q-learning algorithm, which integrates principal component analysis (PCA) with deep Q-learning to handle the mixed frequency data. In theory, we prove that the mean return under the estimated optimal policy converges to that under the optimal one and establish its rate of convergence. The usefulness of our proposal is further illustrated via simulations and an application to a diabetes dataset.

Keywords: Dynamic Treatment Regimes, Mixed Frequency Data, Principal Component Analysis, Reinforcement Learning

1 Introduction

Precision medicine focuses on providing personalized treatment to patients by taking their personal information into consideration (see e.g., Kosorok and Laber, 2019; Tsiatis et al., 2019). It has found various applications in numerous studies, ranging from the cardiovascular disease study to cancer treatment and gene therapy (Jameson and Longo, 2015). A dynamic treatment regime (DTR) consists of a sequence of treatment decisions rules tailored to each individual patient’s status at each time, mathematically formulating the idea behind precision medicine. One of the major objectives in precision medicine is to identify the optimal dynamic treatment regime that yields the most favorable outcome on average.

With the rapidly development of mobile health (mHealth) technology, it becomes feasible to collect rich longitudinal data through mobile apps in medical studies. A motivating data example is given by the OhioT1DM dataset (Marling and Bunescu, 2020) which contains data from 12 patients suffering from type-I diabetes measured via fitness bands over 8 weeks. Data-driven decision rules estimated from these data have the potential to improve these patients’ health (see e.g., Shi et al., 2020; Zhu et al., 2020; Zhou et al., 2022). However, it remains challenging to estimate the optimal DTR in these mHealth studies. First, the number of treatment stages (e.g., horizon) is no longer fixed whereas the number of patients can be limited. For instance, in the OhioT1DM dataset, only 12 patients are enrolled in the study. Nonetheless, suppose treatment decisions are made on an hourly basis, the horizon is over 1000. Existing proposals in the DTR literature (Murphy, 2003; Zhang et al., 2013; Zhao et al., 2015; Song et al., 2015; Shi et al., 2018; Nie et al., 2021; Fang et al., 2022; Qi and Liu, 2018; Mo et al., 2021; Ertefaie et al., 2021; Guan et al., 2020; Zhang et al., 2018) become inefficient in these long or infinite horizon settings and require a large number of patients to be consistent. Second, patients’ time-varying covariates typically contain mixed frequency data. In the OhioT1DM dataset, some of the variables, such as the continuous glucose monitoring (CGM) blood glucose levels, are recorded every 5 minutes. Meanwhile, other variables, such as the carbohydrate estimate for the meal and the exercise intensity are recorded with a much lower frequency. Concatenating these high-frequency variables over each one-hour interval produces a high-dimensional state vector and directly using these states as input of the treatment policy would yield very noisy decision rules. A naive approach is to use some ad-hoc summaries of the high-frequency data for policy learning. However, this might produce a sub-optimal policy due to the information loss.

Recently, there is a growing line of research in the statistics literature for policy learning and/or evaluation in infinite horizons. Some references include Ertefaie and Strawderman (2018); Luckett et al. (2020); Shi et al. (2022); Liao et al. (2020, 2021); Shi et al. (2021c); Chen et al. (2022); Ramprasad et al. (2022); Li et al. (2022); Xu et al. (2020). In the computer science literature, there is a huge literature on developing reinforcement learning (RL) algorithms in infinite horizons. These algorithms can be casted into as model-free and model-based algorithms. Popular model-free RL methods include value-based methods that model the expected return starting from a given state (or state-action pair) and compute the optimal policy as the greedy policy with respect to the value function (Ernst et al., 2005; Riedmiller, 2005; Mnih et al., 2015; Van Hasselt et al., 2016; Dabney et al., 2018), and policy-based methods that directly search the optimal policy among a parameterized class via policy gradient or action critic methods (Schulman et al., 2015a, b; Koutník et al., 2013; Mnih et al., 2016; Wang et al., 2017b). Model-based algorithms are different from the model-free algorithms in the sense that they model the transition dynamics of the environment and use the model of environment to derive or improve policies. Popular model-based RL methods include Guestrin et al. (2002); Janner et al. (2019); Lai et al. (2020); Li et al. (2020), to name a few. See also Arulkumaran et al. (2017); Sutton and Barto (2018); Luo et al. (2022) for more details. These methods cannot be directly applied to datasets such as OhioT1DM as they haven’t considered the mixed frequency data.

In the RL literature, a few works have considered dimension reduction to handle the high dimensional state system. In particular, Murao and Kitamura (1997) proposed to segment the state space and learn a cluster representation of the states. Whiteson et al. (2007) proposed to divide the state space into tilings to represent each state. Both papers proposed to discretize the state space for dimension reduction. However, it can lead to considerable information loss (Wang et al., 2017a). Sprague (2007) proposed a iterative dimension reduction method using neighborhood components analysis. Their method uses a linear basis function to model the Q-function and cannot allow more general nonlinear function approximation. Recently, there are a few works that employ principal components analysis (PCA) for dimension reduction in RL (Curran et al., 2015, 2016; Parisi et al., 2017). However, none of the aforementioned papers formally established the theoretical guarantees for their proposals. Moreover, these methods are motivated by applications in games or robotics and their generalization to mHealth applications with mixed frequency data remains unknown.

In the DTR literature, a few works considered mixed frequency data which include both scalar and functional covariates. Specifically, McKeague and Qian (2014) proposed a functional regression model for optimal decision making with one functional covariate. Ciarleglio et al. (2015) and Ciarleglio et al. (2016) extended their proposal to a more general setting with multiple scalar and functional covariates. Ciarleglio et al. (2018) considered variable selection to handle the mixed frequency data. Laber and Staicu (2018) applied functional PCA to the functional covariates for dimension reduction. All these works considered single stage decision making. Their methods are not directly applicable to the infinite horizon settings.

Our contributions are as follows. Scientifically, mixed frequency data frequently arise in mHealth applications. Nonetheless, it has been less explored in the infinite horizon settings. Our proposal thus fills a crucial gap, and greatly extends the scope of existing approaches to learning DTRs. Methodologically, we propose a deep spectral Q-learning algorithm for dimension reduction. The proposed algorithm achieves a better empirical performance than those that either directly use the original mixed frequency data or its ad-hoc summaries as input of the treatment policy. Theoretically, we derive an upper error bound for the regret of the proposed policy and decompose this bound into the sum of approximation error and estimation error. Our theories offer some general guidelines for practitioners to select the number of principal components in the proposed algorithm.

The rest of this paper is organized as follows. We introduce some background about DTR and the mixed frequency data in Section 2. We introduce the proposed method to estimate the optimal DTR in Section 3 and study its theoretical properties in Section 4. Empirical studies are conducted in Section 5. In Section 5.3, we apply the proposed method to the OhioT1DM dataset. Finally, we conclude our paper in Section 6.

2 Preliminary

2.1 Data and Notations

Suppose the study enrolls $N$ patients. The dataset for the $i$ -th patient can be summarized as $O_{i}\equiv\{(S_{i,t},A_{i,t},Y_{i,t}):1\leq t\leq T_{i}\}$ . For simplicity, we assume $T_{i}=T$ for any $i$ . Each state $S_{i,t}=(X_{i,t},\{Z_{i,t,j}\}_{j=1}^{J})\in\mathcal{S}$ is composed of a set of low frequency covariates $X_{i,t}\in\mathbb{R}^{m_{0}}$ and a set of $J$ high frequency covariates $\{Z_{i,t,j}\}_{j=1}^{J}$ , where $\mathcal{S}$ is the state space. For each high frequency variable $Z_{i,t,j},j\in\{1,2,..,J\}$ , we have $Z_{i,t,j}=(Z_{i,t,j}^{(1)},Z_{i,t,j}^{(2)},...,Z_{i,t,j}^{(m_{j})})^{T}\in\mathbb{R}^{m_{j}}$ for some large integer $m_{j}$ . Let $\tau$ denote the length of a time unit. The low frequency variables are recorded at time points $\tau,2\tau,\cdots,t\tau,\cdots$ . The $j$ th high frequency variables, however, are recorded more frequently at time points $m_{j}^{-1}\tau,2m_{j}^{-1}\tau,\cdots$ . Notice that we allow the $J$ high-frequency variables to be recorded with different frequencies. Let $m=\sum_{j=1}^{J}{m_{j}}$ and $Z_{i,t}^{T}=[Z_{i,t,1}^{T},Z_{i,t,2}^{T},\ldots,Z_{i,t,J}^{T}]\in\mathbb{R}^{m}$ denote a high-dimensional variable that concatenates all the high frequency covariates. As such, the state $S_{i,t}$ can be represented as $(X_{i,t},Z_{i,t})\in\mathbb{R}^{m_{0}+m}$ . In addition, $A_{i,t}$ denotes the treatment indicator at the $t$ th time point and $\mathcal{A}$ denotes the set of all possible treatment options. The reward, $Y_{i,t}$ corresponds to the $i$ th patient’s response obtained after the $t$ th decision point. By convention, we assume a larger value of $Y_{i,t}$ indicates a better outcome. We require $|Y_{i,t}|$ to be uniformly bounded by some constant $R_{max}>0$ , and assume $O_{1},O_{2},...,O_{N}$ are $i.i.d.$ , which are commonly imposed in the RL literature (see e.g., Sutton and Barto, 2018). Finally, we denote the $l_{p}$ -norm of a function aggregated over a given distribution function $\sigma$ by $\left\|.\right\|_{p,\sigma}$ . We use $[q]$ to represent the indices set $\{1,2,3,...q\}$ for any integer $q\in\mathbb{N}$ .

2.2 Assumptions, Policies and Value Functions

We will require the system to satisfy the Markov assumption such that

\displaystyle P(S_{0,t+1}\in\mathbb{S}|S_{0,t}=s,A_{0,t}=a,\{S_{0,t^{\prime}},A_{0,t^{\prime}}\}_{0\leq t^{\prime}<t})=P(S_{0,t+1}\in\mathbb{S}|S_{0,t}=s,A_{0,t}=a),\forall t,

for any $s,a$ and Borel set $\mathbb{S}\in\mathcal{S}$ . In other words, the distribution of the next state depends on the past data history only through the current state-action pair. We assume the transition kernel is absolutely continuous with respect to some uniformly bounded transition density function $q(s^{\prime}|s,a)$ such that $\sup_{s,a,s^{\prime}}|q(s^{\prime}|s,a)|\leq c_{q}$ for some constant $c_{q}>0$ .

In addition, we also impose the following conditional mean independence assumption:

\displaystyle E(Y_{0,t}|S_{0,t}=s,A_{0,t}=a,\{Y_{0,t^{\prime}},S_{0,t^{\prime}},A_{0,t^{\prime}}\}_{0\leq t^{\prime}<t})=E(Y_{0,t}|S_{0,t}=s,A_{0,t}=a)=r(s,a),\forall t,

where we refer to $r(s,a)=r(x,\{z_{j}\}_{j=1}^{J},a)$ as the immediate reward function.

Next, define a policy $\pi:\mathcal{S}\rightarrow P(\mathcal{A})$ as a function that maps a patient’s state at each time point into a probability distribution function on the action space. Given $\pi$ , we define its (state) value function as

\displaystyle V^{\pi}(s)=\sum_{t=0}^{\infty}{\gamma^{t}E^{\pi}\{Y_{0,t}|S_{0,0}=s\}},

with $\gamma\in[0,1)$ being a discount factor that balances the immediate and long-term rewards. By definition, the state value function characterizes the expected return assuming the decision process follows a given target policy $\pi$ . In addition, we define the state-action value function (or Q-function) as

Q^{\pi}(s,a)=\sum_{t=0}^{\infty}{\gamma^{t}E^{\pi}\{Y_{0,t}|S_{0,0}=s,A_{0,0}=a\}},

which is the expected discounted cumulative rewards given an initial state-action pair.

Under the Markov assumption and the conditional mean independence assumption, there exists an optimal policy $\pi^{*}$ such that $V^{\pi^{*}}(s)\geq V^{\pi}(s),\forall\pi,s\in\mathcal{S}$ (see e.g., Puterman, 2014). Moreover, $\pi^{*}$ satisfies the following Bellman optimality equation:

\displaystyle Q^{\pi^{*}}(s,a)=E\{Y_{0,t}+\gamma\max_{a^{\prime}\in\mathcal{A}}Q^{\pi^{*}}{(S_{0,t+1},a^{\prime})}|S_{0,t}=s,A_{0,t}=a\},

(1)

where $Q^{\pi^{*}}$ denotes the optimal Q-function.

2.3 ReLU Network

In this paper, we use value-based methods that learn the optimal policy $\pi^{*}$ by estimating the optimal Q-function. We will use the class of sparse neural network with the Rectified Linear Unit (ReLU) activation function, i.e., $f(x)=\max(x,0)$ , to model the Q-function. The advantage of using a neural network over a simple parametric model is that the neural network can better capture the potential non-linearity in the high-dimensional state system.

Formally, the class of sparse ReLU network is defined as

	$\displaystyle\mathcal{F}_{SReLU}(L,\{d_{j}\}_{j=0}^{L+1},s,V_{max})=\{\mathbb{R}^{d_{0}}\rightarrow\mathbb{R}^{d_{L+1}}:f(x)=W_{L}g_{L-1}\circ g_{q-1}\circ...g_{1}\circ g_{0}(x),$
	$\displaystyle g_{j}(x)=\sigma(W_{j}x+v_{j}),W_{j}\in\mathbb{R}^{d_{j+1}\times d_{j}},v_{j}\in\mathbb{R}^{d_{j+1}},j\in\{0,1,...,L\},$
	$\displaystyle\max_{j=0,1,...,L}{\{\max(\left\\|W_{j}\right\\|_{\infty},\|v_{j}\|_{\infty})}\}\leq 1,\sum_{j=0}^{L}{(\left\\|W_{j}\right\\|_{0}+\|v_{j}\|_{0})}\leq s,\max_{k\in\{1,2,...,d_{L+1}\}}{\left\\|f_{k}\right\\|_{\infty}}\leq V_{max}\}.$

Here, $L$ is the number of hidden layers of the neural network and $d_{j}$ is the width of each layer. The output dimension $d_{L+1}$ is set to 1 since the Q-function output is a scalar. The parameters in $\mathcal{F}_{ReLU}(L)$ are the weight matrices $W_{j}$ and bias vectors $v_{j}$ . The sparsity level $s$ upper bounds the total number of nonzero parameters in the model. This constraint can be satisfied using dropout layers in the implementation (Srivastava et al., 2014). In theory, sparse ReLU networks can fit smooth functions with a minimax optimal rate of convergence (Schmidt-Hieber, 2020). The main theorems in Section 4 will rely on this property. An illustration of sparse ReLU network is in Figure 1.

3 Spectral Fitted Q-Iteration

Neural network with ReLU activation functions in 2.3 is commonly used in value-based reinforcement learning algorithms. However, in medical studies, the training dataset is often of limited size, with a few thousands or 10 thousands of observations in total (see e.g., Marling and Bunescu, 2020; Liao et al., 2021). Meanwhile, the data contains high frequency state variables, which yields a high-dimensional state system. Directly using these states as input will procedure a very noisy policy. This motivates us to consider dimension reduction in RL.

A naive approach for dimension reduction is to use some summary statistics of the high frequency state as input for policy learning. For instance, on the OhioT1DM dataset, the average of CGM blood glucose levels between two treatment decision points can be used as the summary statistic, as in Zhu et al. (2020); Shi et al. (2022); Zhou et al. (2022). In this paper, we propose to use principal component analysis to reduce the dimensionality of $\{Z_{i,t,j}\}_{j=1}^{J}$ . We expect that using PCA can preserve more information than some ad-hoc summaries (e.g., average).

To apply PCA in the infinite horizon setting, we need to impose some stationarity assumptions on the concatenated high dimensional variables $Z_{i,t}^{T}\in\mathbb{R}^{m}$ : $E[Z_{i,t}]=\bf\mathbf{\mu}$ and $Cov[Z_{i,t}]=\mathbf{G}$ for some mean vector $\mathbf{\mu}\in\mathbb{R}^{m}$ and covariance matrix $\mathbf{G}\in\mathbb{R}^{m\times m}$ that are independent of $t$ . In real data application, we can test whether the concatenated high-frequency variable $Z_{i,t}$ is weak stationary (see e.g., Dickey and Fuller, 1979; Said and Dickey, 1984; Kwiatkowski et al., 1992). If it is weak stationary, the concatenated high frequency covariate $Z_{i,t}$ will automatically satisfy the two assumptions above. Similar assumptions have been widely imposed in the literature (see e.g., Shi et al., 2021b; Kallus and Uehara, 2022). Without loss of generality, we assume $\mathbf{\mu}=0_{m}$ for simplicity of notations. For the covariance matrix $\mathbf{G}$ , it is generally unknown. In practice, we recommend to use the sample covariance estimator $\hat{\mathbf{G}}$ .

We describe our procedure as follows. By the spectral decomposition, we have $\mathbf{\hat{G}}=\sum_{k=1}^{m}{\hat{\lambda}_{k}\hat{U}_{k}\hat{U}_{k}^{T}}$ , where $\lambda_{1}\geq\lambda_{2}\geq...\geq 0$ are the eigenvalues and $\hat{U}_{k}$ ’s are the corresponding eigenvectors. This allows us to represent ${Z}_{i,t}$ as $\sum_{k=1}^{m}{\hat{\lambda}_{k}^{1/2}\hat{V}_{k}^{(i,t)}\hat{U}}_{k}$ , where $\hat{V}_{k}^{(i,t)}$ s are the estimated principal component scores, given by $\hat{\lambda}_{k}^{-1/2}{Z}_{i,t}^{T}\hat{U}_{k}$ . For any $\kappa$ , the estimated principal component scores $\hat{\mathbf{V}}_{i,t,\kappa}=(\hat{V}_{1}^{(i,t)},\hat{V}_{2}^{(i,t)},...,\hat{V}_{\kappa}^{(i,t)})$ correspond to the $\kappa\leq m$ largest eigenvalues of the concatenated high frequency variable $Z_{i,t}$ . When $\kappa=m$ , using these principal component scores is equivalent to using the original high-frequency variable ${Z}_{i,t}$ . We will approximate $Q^{\pi}({X}_{i,t},\mathbf{V}_{i,t,m},{A}_{i,t})$ by $Q^{\pi}({X}_{i,t},\hat{\mathbf{V}}_{i,t,\kappa},{A}_{i,t})$ and propose to use neural fitted Q-iteration algorithm by Riedmiller (2005) to learn the estimated optimal policy. We detail our procedure in Algorithm 1.

Algorithm 1 Spectral Fitted Q-Iteration

Input:

\{S_{i,t}=(X_{i,t},Z_{i,t}),A_{i,t},Y_{i,t},\gamma\}

with

i\in\{1,2,...,N\},t\in\{1,2,...,T\}

; ReLU network function class

\mathcal{F}_{SReLU}=\mathcal{F}_{SReLU}(L,\{d_{j}\}_{j=0}^{L+1},s,V_{max})

; sampling distribution

\sigma

; sample size

n

; number of principal components

\kappa

; number of iterations

K

; estimated covariance matrix

\hat{\bf G}

for

Z_{i,t}

;

Calculate first

\kappa

PCA

\hat{\mathbf{V}}_{i,t,\kappa}

for high frequency data part

Z_{i,t}

;

For each

a\in\mathcal{A}

, initialize a sparse ReLU network

\tilde{Q}_{0}(x,v_{\kappa},a)\in\mathcal{F}_{SReLU}

;

for

k=1

K

Sample

n

observations

I_{k}=\{(i,t):1\leq i\leq N,1\leq t\leq T-1\}

based on

\sigma

from data;

Define a response

R_{i,t}

based on

\tilde{Q}_{k}

R_{i,t}(\tilde{Q}_{k})=Y_{i,t}+\gamma\operatorname*{max}_{a\in\mathcal{A}}\tilde{Q}_{k}(X_{i,t+1},\hat{\mathbf{V}}_{i,t+1,\kappa},a)

;

Update

\tilde{Q}_{k}

\tilde{Q}_{k+1}

\displaystyle\tilde{Q}_{k+1}\leftarrow\operatorname*{argmin}_{f(.,a)\in\mathcal{F}_{SReLU}}\frac{1}{n}\sum_{(i,t)\in I_{j}}{[R_{i,t}(\tilde{Q}_{k})-f(X_{i,t},\hat{\mathbf{V}}_{i,t,\kappa},A_{i,t})]^{2}}

end for

Return The greedy policy:

\pi_{K}(a|x,v_{\kappa})=0

, if

a\notin argmax_{a^{\prime}}\tilde{Q}_{K}(x,v_{\kappa},a^{\prime}),\forall x,v_{\kappa},a

In Algorithm 1, we fit $|\mathcal{A}|$ neural networks corresponding to each $a$ in $Q(s,a)$ . This is reasonable in settings where the action space is small. When the dataset is small (such as the OhioT1DM dataset), we recommend to set $n$ to $N(T-1)$ such that all the data transactions (instead of a random subsample) will be used in each iteration.

Similar to the original neural fitted Q-iteration algorithm in Riedmiller (2005), the intuition of this algorithm is also based on the Bellman optimality equation (1). In each step $k$ of Algorithm 1, $\tilde{Q}_{k}$ estimates $Q^{\pi^{*}}$ and response $R_{i,t}(\tilde{Q}_{k})=Y_{i,t}+\gamma\operatorname*{max}_{a\in\mathcal{A}}\tilde{Q}_{k}(X_{i,t+1},\hat{\mathbf{V}}_{i,t+1,\kappa},a)$ corresponds to the right hand side of equation (1). Therefore, fitting the regression of $R_{i,t}$ with $\tilde{Q}_{k+1}$ is to solve the Bellman optimality equation. The key difference between Algorithm 1 and original neural fitted Q-iteration algorithm is that the high dimensional inputs $Z_{i,t}^{T}=[Z_{i,t,1}^{T},Z_{i,t,2}^{T},\ldots,Z_{i,t,J}^{T}]$ is involved in the state space, and is mapped to a lower dimensional vector $\hat{\mathbf{V}}_{i,t,\kappa}$ during the learning process, so the neural networks $\tilde{Q}_{0}(x,v_{\kappa},a)$ takes principle components $\hat{\mathbf{V}}_{i,t,\kappa}$ rather than original high dimensional $Z_{i,t}$ as input.

4 Asymptotic Properties

Before discussing the asymptotic properties of our proposed Q-learning method, we introduce some notations.

Definition 4.1.

Define

\displaystyle\mathcal{F}_{0}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max})=\{f:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R},f(.,a)\in\mathcal{F}_{SReLU}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max}),\forall a\in\mathcal{A}\},

where $\mathcal{F}_{SReLU}(L,\{d_{j}\}_{i=0}^{L+1},s,V_{max})$ is the class of sparse ReLU network with $L$ layers and sparsity parameter $s$ and $V_{max}=\frac{R_{max}}{1-\gamma}$ , the uniform upper bound for the cumulative reward.

We define $L_{\mathcal{F}_{0}}=\sup_{f\in\mathcal{F}_{0}}{\sup_{x\neq y}{\frac{{|f(y)-f(x)|}^{2}}{\left\|y-x\right\|^{2}}}}$ as the Lipschitz constant for the sparse ReLU class $\mathcal{F}_{SReLU}(L,\{d_{j}\}_{i=0}^{L+1},s,V_{max})$ used in $\mathcal{F}_{0}$ . Note that this class $\mathcal{F}_{0}$ is used in the original neural fitted Q-iteration algorithm to model the Q-function, where the dimension of high frequency part $Z$ in state is not reduced through PCA. We further define a function class $\mathcal{F}_{2}$ such that it models the Q-function by first converting high frequency part $Z$ into its principal component scores and then use a sparse ReLU neural network to obtain the resulting Q-function. More specifically, $\mathcal{F}_{2}$ is a set of functions $\{f_{2}\}$ , such that $f_{2}(x,z,a)=f_{0}(x,\hat{v}_{\kappa},a)$ , where $f_{0}\in\mathcal{F}_{0}$ and $\hat{v}_{\kappa}$ is the vector containing first $\kappa$ principle components of $Z$ . Note that $\mathcal{F}_{2}$ is the function class that we use in Algorithm 1 to model the Q-function. That is, $\tilde{Q}_{k}\in\mathcal{F}_{2},k\in\{1,2,...,K\}$ (formal definition of $\mathcal{F}_{2}$ and another function class $\mathcal{F}_{1}$ not mentioned here are in Section A of Appendix).

In addition, we introduce the Hölder smooth function class by $\mathcal{C}_{r}(\mathcal{D},\beta,H)$ with $\mathcal{D}\in\mathbb{R}^{r}$ to be the set of function input. The definition is:

Definition 4.2.

Define

\displaystyle\mathcal{C}_{r}(\mathcal{D},\beta,H)=\{f:\mathcal{D}\rightarrow\mathbb{R}:\sum_{\gamma<\beta}{\left\|\partial^{\gamma}f\right\|_{\infty}}+\sum_{\alpha\in\mathbb{N}_{0}^{r}:\left\|\alpha\right\|=\lfloor\beta\rfloor}{\sup_{x,y\in\mathcal{D},x\neq y}{\frac{|\partial^{\alpha}f(x)-\partial^{\alpha}f(y)|}{\left\|x-y\right\|_{\infty}^{\beta-\lfloor\beta\rfloor}}}}\leq H\},

where $\alpha\in\mathbb{N}_{0}^{r}$ is a $r$ -tuple multi-index for partial derivatives.

We next construct a network structure $\mathcal{G}(\{p_{j},t_{j},\beta_{j},H_{j}\}_{j\in[q]})$ with the component function on each layer of this network belonging to Holder smooth function class $\mathcal{C(\mathcal{D},\beta,H)}$ , which is called composition of Holder Smooth functions. This composition network contains $q$ layers, with each layer being $g_{j}:{[a_{j},b_{j}]}^{p_{j}}\rightarrow{[a_{j+1},b_{j+1}]}^{p_{j+1}}$ , such that $g_{jk}$ the $k$ th component ( $k\in[p_{j+1}]$ ) in layer $j$ satisfies that $g_{jk}\in\mathcal{C}_{t_{j}}({[a_{j},b_{j}]}^{t_{j}},\beta_{j},H_{j})$ with $t_{j}\leq p_{j}$ inputs. Similar to the definition of $\mathcal{F}_{0}$ , we can define the function class $\mathcal{G}_{0}(\{p_{j},t_{j},\beta_{j},H_{j}\}_{j\in[q]})$ (simply denoted as $\mathcal{G}_{0}$ ) on $\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ such that each function $g\in\mathcal{G}_{0}$ satisfies that $g(.,a)\in\mathcal{G}(\{p_{j},t_{j},\beta_{j},H_{j}\}_{j\in[q]})$ for $\forall a\in\mathcal{A}$ . The relation between function class $\mathcal{G}_{0}$ and the network structure $\mathcal{G}$ is similar to the relation between function class $\mathcal{F}_{0}$ and the neural network $\mathcal{F}_{SReLU}$ in Definition 4.1. See Definition 4.1 of Fan et al. (2020) for more details on $\mathcal{G}_{0}(\{p_{j},t_{j},\beta_{j},H_{j}\}_{j\in[q]})$ .

Next we will introduce the three major assumptions for our theorems:

Assumption 4.3.

The eigenvalues of $Cov(Z)$ follow an exponential decaying trend $\lambda_{k}=O(e^{-\zeta k}),k=1,2,...,m$ for some constant $\zeta>0$ .

Assumption 4.4.

The estimator $\hat{\mathbf{G}}=\sum_{k=1}^{m}{\hat{\lambda}_{k}\hat{U}_{k}{\hat{U}_{k}}^{T}}$ satisfies that $\left\|\hat{U}_{k}-U_{k}\right\|_{2}=O_{p}(n^{-\Delta})$ for $1\leq k\leq m$ such that $\Delta>0$ is some constant.

Assumption 4.5.

First, we define the Bellman optimality operator $\mathcal{T}$ as

\displaystyle\mathcal{T}f(x,z,a)=E\{Y_{0,t}+\gamma\max_{a^{\prime}\in\mathcal{A}}f{(X_{0,t+1},Z_{0,t+1},a^{\prime})}|A_{0,t}=a,X_{0,t}=x,Z_{0,t}=z\}.

Then we assume $\mathcal{T}f\in\mathcal{G}_{0}$ for $f\in\mathcal{F}_{2}$ .

Among the three assumptions, the exponential decaying structure of eigenvalues in Assumption 4.3 can be commonly found in the literature of high-dimensional and functional data analysis (see e.g., Reiß and Wahl, 2020; Crambes and Mas, 2013; Jirak, 2016). This assumption is to control the information loss caused by using the first $\kappa$ principal component scores of $Z_{i,t}$ only. Assumption 4.4 is about the consistency of the estimators $\hat{\mathbf{G}}$ and similar assumptions are imposed in the literature of functional data analysis (see e.g., Laber and Staicu, 2018; Staicu et al., 2014). Using similar arguments in proving Theorem 5.2 of Zhang and Wang (2016), we can show that such an assumption holds in our setting as well. It is to bound the error caused by the estimation of the covariance matrix. Assumption 4.5 is referred to as the completeness assumption in the literature (see e.g., Chen and Jiang, 2019; Uehara et al., 2022, 2021). This assumption is automatically satisfied when the transition kernel and the reward function satisfy certain smoothness conditions.

Our first theorem is concerend with the nonasymptotic convergence rate of $\tilde{Q}_{K}$ in Algorithm 1.

Theorem 4.6 (Convergence of Estimated Q-Function).

Let $\mu$ be some distribution on $\mathcal{S}$ such that $\frac{d\mu(s)}{ds}$ bounded away from $0$ . Under the Assumptions 4.3 to 4.5, with sufficiently large $n$ , there exists a sparse ReLU network structure for the function class $\mathcal{F}_{2}$ modeling $\tilde{Q}(s,a)$ , such that $\tilde{Q}_{K}$ obtained from our Algorithm 1 satisfies that:

	$\displaystyle\frac{1}{\|\mathcal{A}\|}\sum_{a\in\mathcal{A}}{\left\\|Q^{\pi^{}}(.,a)-\tilde{Q}_{K}(.,a)\right\\|_{2,\mu}^{2}}=O_{p}(\|\mathcal{A}\|(n^{\alpha^{}}\log^{\xi^{}}{n}+d_{1}^{}\kappa)n^{-1}\log^{\xi^{*}+1}{n}$
	$\displaystyle+L_{\mathcal{F}_{0}}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}),$

where $\xi^{*}>1,0<\alpha^{*}<1$ are some constants, $|\mathcal{A}|$ is number of treatment options, and $d_{1}^{*}$ is the width of the first layer of the sparse ReLU network used in $\mathcal{F}_{2}$ satisfying the bound $m_{0}+m\leq d_{1}^{*}\leq n^{\alpha^{*}}$ .

More details of neural network structure and sample size assumptions for Theorem 4.6 can be found in Section B of Appendix. Theorem 4.6 provides an error bound on the estimated Q-function $\tilde{Q}_{K}$ . Based on this theorem, we further establish the regret bound of the estimated policy $\pi_{K}$ obtained via Algorithm 1. Toward that end, we need another assumption:

Assumption 4.7.

Assume there exist $\eta>0,\delta_{0}>0$ , such that

\displaystyle P(s:\max_{a}{Q^{\pi^{*}}(s,a)}-\max_{a\in{\mathcal{A}-\operatorname*{arg\,max}_{a^{\prime}}{Q^{\pi^{*}}(x,a^{\prime})}}}{Q^{\pi^{*}}(s,a)}\leq\epsilon)=O(\epsilon^{\eta})

for $0<\epsilon\leq\delta_{0}$ .

The margin type condition Assumption 4.7 is commonly used in the literature. Specifically, in classification, the margin conditions are imposed to bound the excess risk (Tsybakov, 2004; Audibert and Tsybakov, 2007). In dynamic treatment regime, a similar assumption is introduced for proving the convergence of State-Value function in a finite horizon setting (Qian and Murphy, 2011; Luedtke and van der Laan, 2016). In RL, these assumptions were introduced by (Shi et al., 2022) to obtain sharper regret bound for the estimated optimal policy.

Theorem 4.8 (Convergence of State-Value Function).

Under the Assumptions 4.3 to 4.7 and the conditions of $\mu,\mathcal{F}_{2},n$ in Theorem 4.6, we have:

	$\displaystyle E_{\mu}[V^{\pi^{}}(s)-V^{\pi_{K}}(s)]=O_{p}(\frac{1}{1-\gamma}\{\|\mathcal{A}\|(n^{\alpha^{}}{\log^{\xi^{}}{n}}+d_{1}^{}\kappa)n^{-1}{\log^{\xi^{*}+1}{n}}$
	$\displaystyle+L_{\mathcal{F}_{0}}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}\}^{\frac{\eta+1}{\eta+2}}).$

The proofs of the two theorems are included in Section C of Appendix. We summarize our theoretical findings below. First, we notice that the convergence rate of regret in Theorem 4.8 is faster than the convergence of estimated Q-functions in Theorem 4.6. This is due to the margin type Assumption 4.7 which enables us to obtain a sharper error bound. Similar results have been established in the literature. See e.g., Theorem 3.3 in Audibert and Tsybakov (2007), Theorem 3.1 in Qian and Murphy (2011), Theorems 3 and 4 in Shi et al. (2022).

From Theorem 4.8, it can be seen that the regret bound is mainly determined by three parameters: the sample size $n$ , the number of principal components $\kappa$ and the number of iterations $K$ . Here, the first term $|\mathcal{A}|(n^{\alpha^{*}}{(\log^{\xi^{*}}{n})}+d_{1}^{*}\kappa)n^{-1}\log^{\xi^{*}+1}{n}$ on the right-hand-side corresponds to the estimation error, which decreases with $n$ and increases with $\kappa$ . The second term $L_{\mathcal{F}_{0}}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})$ corresponds to the approximation error, which decreases with both $n$ and $\kappa$ . The remaining term $\frac{\gamma^{K+1}}{(1-\gamma)^{2}}R_{max}$ is the optimization error which will decrease as the iteration number $K$ in Algorithm 1 grows.

Compared with the existing results on the convergence rate of deep fitted Q-iteration algorithm (see Theorem 4.4 of Fan et al., 2020), our theorems additionally characterize the dependence upon the number of principal components. Specifically, selecting the first $\kappa$ principal components induces the information loss (e.g., bias) that is of the order $e^{-\zeta\kappa}-e^{-\zeta m}$ , but reduces the model complexity caused by high frequency variables from $d_{1}^{*}m$ to $d_{1}^{*}\kappa$ and hence the variance of the policy estimator. This represents a bias-variance trade-off. Notice that the bias decays at an exponential order, when the training data is small, reducing the model complexity can be more beneficial. Thus, our algorithm is expected to perform better than the original fitted Q-iteration algorithm in small samples, as shown in our numerical studies.

Finally, The number of principal components shall diverge with $n$ to ensure the consistency of the proposed algorithm. Based on the two theorems, the optimal $\kappa^{*}$ that balances the bias and variance trade-off shall satisfy $\kappa^{*}\asymp\log(n)$ (details are given in Section D of Appendix). Thus, when $n$ goes to infinity, we will eventually take $\kappa^{*}=m$ and our Algorithm 1 will be equivalent to the original neural fitted Q-iteration. This is just an asymptotic guideline for selecting the number of principal components. We provide some practical guidelines in the next section.

5 Empirical Studies

5.1 Practical Guidelines for Number of Principal Components

In functional data analysis, several criteria have been developed to select number of principal components, including the percentage of variance explained, Akaike information criterion (AIC) and Bayesian information criterion (BIC) (see e.g., Yao et al., 2005; Li et al., 2013). In our setting, it is difficult to apply AIC/BIC, since there does not exist a natural objective function (e.g., likelihood) for Q-function estimation. One possibility is to extend the value information criterion (Shi et al., 2021a) developed in single-stage decision making to infinite horizons. Nonetheless, it remains unclear how to determine the penalty parameter for consistent tuning parameter selection.

Here, we select $\kappa$ based on the percentage of variance explained. That is, we can look at the minimum value of $\kappa$ such that the total variance explained by PCA reaches a certain level (e.g., $95\%$ ). This method is also employed in Laber and Staicu (2018) in single-stage decision making. To illustrate the empirical performance of this method, we apply Algorithm 1 with $\kappa\in\{2,6,10,14...,74\}$ and evaluate the expected return of these policies in the following numerical study.

The data generating process can be described as follows. We set the low frequency covariate $X_{i,t}$ to be a $2$ -dimensional vector and the high frequency variables $Z_{i,t}$ to be a $108$ -dimensional vector ( $m=108$ ). Both are sampled from mean zero normal distributions. The covariance matrix of $Z$ is set to satisfies Assumption 4.3. The action space is binary and the behavior policy to generate actions in training data is a uniform random policy. The reward function $r(x,z,a)$ is set to $=x\beta_{1,a}+z\beta_{2,a}+c\max{\{(x\beta_{1,a}+z\beta_{2,a}),0\}}$ for some constant $c$ and coefficient vectors $\beta_{1,a},\beta_{2,a}$ such that it is a mixture of a linear function and a neural network with a single layer and ReLU activation function. Next state $(X_{i,t+1},Z_{i,t+1})$ will be generated from a normal distribution with mean being a linear function of state $({X}_{i,t},{Z}_{i,t})$ and action ${A}_{i,t}$ . The number of trajectories $N$ is fixed to $6$ and the length of trajectory $T$ is set to be $80$ .

The ReLU network is constructed with 3 hidden layers and width $d_{1}=15,d_{2}=d_{3}=5$ . Dropout layers with $10\%$ dropout rate are added between layer 1, layer 2 and layer 3. During training, the dropout layers randomly sets the output from previous layers to $0$ with the probability $10\%$ , which can introduce sparsity to the neural network and reduce over-fitting (Srivastava et al., 2014). The hyperparamters of neural network structure can be tuned via cross-validation. The discounted factor $\gamma$ is fixed to $0.5$ .

To evaluate the policy performance, we can use a Monte-Carlo method to approximate the expected return under each estimated policy. Specifically, for each estimated policy $\pi$ , we generate $N_{mc}=100$ trajectories each of length $T_{mc}=20$ (in our setting with $\gamma=0.5$ , the cumulative reward after $T_{mc}=20$ is negligible). The initial state distribution is the same as the one in the training dataset. The actions are assigned according to $\pi$ . The expected return can then be approximated via the average of the empirical cumulative rewards over the 100 trajectories.

For each $\kappa$ in the list $\{2,6,10,14...,74\}$ , we apply Algorithm 1 to learn the optimal policy over $80$ random seeds and evaluate their expected return using the Monte Carlo method. We then take the sample average and standard error of these $80$ expected returns to estimate the value of policy and construct the margin of error. Figure 2 depicts the estimated values of these expected returns as well as their confidence intervals. It can be seen that increasing $\kappa$ from $2$ to $6$ leads to a significant improvement. However, further increasing $\kappa$ worsens the performance. This trend is consistent with our theory since the bias term will dominate the estimation error for small value of $\kappa$ . When $\kappa$ increases, the bias decays at exponentially fast and the model complexity term becomes the leading term. Meanwhile, the percentage of variance explained increases quickly when $\kappa\leq 6$ and remains stable afterwards. As such, it makes sense to use this criterion for $\kappa$ selection. In our implementation, we select the smallest $\kappa$ such that the variance explained is at least $95\%$ .

5.2 Simulation Study

In the simulation study, we compare the proposed policy $\pi_{K}^{PCA}$ against two baseline policies obtained by directly using the original high frequency variable $Z_{i,t,j}$ (denoted by $\pi_{K}^{ALL}$ ) and its average as input (denoted by $\pi_{K}^{AVE}$ ). Both policies are computed in a similar manner based on the deep fitted Q-iteration algorithm. We additionally include one more baseline policy, denoted by $\pi_{K}^{BOTTLE}$ , which adds one bottleneck layer after the input of original $Z$ in the neural network architecture such that the width of this bottleneck layer is the same as the input dimension of the proposed policy $\pi_{K}^{PCA}$ . This policy differs from the proposed policy in that it uses this bottleneck layer for dimension reduction instead of PCA. Both the data generating setting and the neural network structure are the same to Section 5.1. However, in this simulation study we vary the sample size and the dimension of the high frequency variables. Specifically, we consider 9 cases of training size ( $N=6,T=60,75,90,105,120$ and $N=7,8,9,10,T=120$ ) and $5$ different high frequency part dimension $m=27,54,108,162,216$ . Furthermore, we have two settings of generating high frequency variables that will be discussed below. We similarly compare the proposed policy against $\pi_{K}^{ALL}$ , $\pi_{K}^{BOTTLE}$ , and $\pi_{K}^{AVE}$ and use the Monte Carlo method to evaluate their values.

In the first setting, we consider $5$ cases with $J$ (the number of high frequency variables) equals $1,2,4,6,8$ and each high frequency variable $Z_{j},j\in\{1,2,...,J\}$ is of dimension $27$ . In this setting, all $J$ high frequency variables are dependent and eigenvalues of concatenated high frequency variable $Z$ decays at an exponential order. We find that the first $5$ principal components explains over $95\%$ of variance in all the $5$ cases, as shown in Figure 3. Therefore, we set the number of principal components $\kappa=5$ and plot the results in Figure 4.

In the second setting, the $J$ high frequency variables are independent with each other. For each $j$ , all the elements in $Z_{j}$ are dependent and eigenvalues of $Z_{j}$ decays at the same exponential order as the eigenvalues of the concatenated high frequency variable $Z$ in the first setting. Therefore, more principal components are needed to guarantee that the number of variance explained exceeds $95\%$ , as $J$ increases. Specifically, when $J=1,2,4,6,8$ high frequency variables, the corresponding $\kappa$ is given by $5,10,20,30,40$ accordingly. See Figure 3 for details. The expected returns of all estimated optimal policies are plotted in Figure 5.

From Figures 4 and 5, it can be seen that the proposed policy $\pi_{K}^{PCA}$ always achieves a larger value than the three baseline policies. Meanwhile, $\pi_{K}^{ALL}$ and $\pi_{K}^{BOTTLE}$ perform comparably. The value of both of them are significantly affected by the training size $n$ . In addition, $\pi_{K}^{AVE}$ outperforms $\pi_{K}^{ALL}$ and $\pi_{K}^{BOTTLE}$ in small samples, but performs worse than the two policies when the sample size is large. In the second setting, $\pi_{K}^{ALL}$ and $\pi_{K}^{BOTTLE}$ tend to perform much better than $\pi_{K}^{AVE}$ when $J=4,6,8$ , since averaging over several high frequency variables will lose more relevant information for policy learning.

Finally, we conduct an additional simulation study with large training datasets where $N=200$ or $4000$ and $T=120$ . This setting might be unrealistic in an mHealth dataset. It is included only to test the performance of $\pi_{K}^{ALL}$ . As $\pi_{K}^{ALL}$ is consistent as well, we anticipate that the difference between the value under $\pi_{K}^{ALL}$ and the proposed policy will be negligible as the sample size grows to infinity. Figure 6 depicts the results. As expected, we observe no significant difference between $\pi^{PCA}$ and $\pi^{ALL}$ when $N\geq 200$ .

5.3 Application on OhioT1DM Dataset

We apply the proposed Algorithm 1 on the updated OhioT1DM Dataset by Marling and Bunescu (2020). In the real data case, we would still like to compare the behaviors of the four policies: $\pi_{K}^{PCA}$ , $\pi_{K}^{ALL}$ , $\pi_{K}^{BOTTLE}$ and $\pi_{K}^{AVE}$ . OhioT1DM Dataset contains medical information of 12 type-I diabetes patients, including the CGM blood glucose levels of the patients, insulin doses applied during this period, self-reported information of meals and exercises, and other variables recorded by mobile phone apps and physiological sensors. The high frequency variables in the OhioT1DM Dataset, such as CGM blood glucose levels, are recorded every 5 minutes. The data for exercises and meals are collected with a much lower frequency, say recorded every few hours. Moreover, considering the basal insulin rate of in this dataset, although this variable is also collected every $5$ minutes, it usually remains a constant for several hours in a day. Thus, the basal insulin rate can also be regarded as a low-frequency scalar variable by taking the average of it. The time period between two decision points is set as $3$ hours, as we only consider non-emergency situations where patients don’t need to take bolus injection promptly. In other studies using the OhioT1DM Dataset, the treatment decision frequency is also set to be much lower than the recording frequency of CGM blood glucose levels (see e.g., Shi et al., 2022; Zhou et al., 2021; Zhu et al., 2020).

For the low frequency covariate $X_{i,t}=(X_{i,t}^{(1)},X_{i,t}^{(2)})$ , $X_{i,t}^{(1)}$ is constructed based on the $i$ th patient’s self-reported carbohydrate estimate for the meal during the past $3$ -hour-interval $[t-1,t)$ . The second scalar variable in $X_{i,t}$ is defined as the average of the basal rate of insulin dose during the past interval $[t-1,t)$ . We consider one high-frequency element, $Z_{i,t}$ , which contains CGM blood glucose levels recorded every 5 minutes during the past $3$ hours (its dimension is $m=36$ ). The action variable $A_{i,t}$ is set to $1$ when the total amount of insulin delivered to the $i$ -th patient is greater than one unit in the past interval. The response variable $Y_{i,t}$ is defined according to the Index of Glycemic Control (IGC, Rodbard, 2009), which is a non-linear function of the blood glucose levels in the following stage. A higher IGC value indicates that the blood glucose level of this patient stays close or falls in to the proper range of glucose level.

In this study, $\kappa=5$ is selected when training $\pi_{K}^{PCA}$ , as the proportion of variance explained by the first $5$ principal components is over $99\%$ , as is shown in Figure 7. The ReLU network here is with 2 hidden layers and width $d_{i}=6,i=1,2$ (dropout layers with $10\%$ dropout rate added between layer 1, layer 2 and the output layer). To estimate the value $V^{\pi}(x,z)$ of the four policies, we use the Fitted Q Evaluation algorithm proposed by Le et al. (2019). When applying the Fitted Q Evaluation algorithm, a random forest model is used to fit the estimated Q-function of the policy to be evaluated. By dividing $12$ patients into a training set of $9$ patients and testing set of $3$ patients, there are $220$ repetitions with different patient combinations. In each repetition, the data of $9$ patients is used to train the policy and fit the random forest for Fitted Q Evaluation corresponding to this policy. The data of the other $3$ patient is used for approximating the value of the policy using the estimated Q-function from Fitted Q Evaluation. The sample mean of estimated values from all $220$ repetitions is taken as our main result and the standard errors are used to construct the margin of error. To compare the performance of our proposed policies $\pi_{K}^{PCA}$ against $\pi_{K}^{ALL}$ , $\pi_{K}^{AVE}$ , and $\pi_{K}^{BOTTLE}$ , we present the difference of estimated values of $\pi^{PCA}$ and the three other policies in Table 1, where margin of error is standard error of the mean difference multiplied by the critical value $1.96$ . The estimated values of the four policies is in Table 2.

Based on the result, it can be shown that the estimated value of $\pi_{K}^{PCA}$ is higher than all three baselines. The policy $\pi_{K}^{AVE}$ obtained by using the average of CGM blood glucose levels is commonly used in literature (Zhou et al., 2021; Zhu et al., 2020). The less plausible performance of $\pi_{K}^{AVE}$ is probably due to the information loss by simply replacing the CGM blood glucose levels with its average. On the other hand, the size of training data is relatively small, as we didn’t use all the data recorded in eight weeks due to the large chunks of missing values. Eventually training data from about 5 consecutive weeks are used for training DTR policies. In such scenarios, using the original high frequency vector $Z_{i,t}$ will significantly increase the complexity of the ReLU network structure, such that the number of parameters to be trained is too large compared to the size of training data. Thus, $\pi_{K}^{ALL}$ and $\pi_{K}^{BOTTLE}$ cannot outperform $\pi_{K}^{PCA}$ where input dimension is reduced by PCA. The results shown in Table 1 and Table 2 agree with the results in Section 5.2.

6 Discussions

In summary, we propose a deep spectral fitted Q-iteration algorithm to handle mixed frequency data in infinite horizon settings. The algorithm relies on the use of PCA for dimension reduction and the use of deep neural networks to capture the non-linearity in the high-dimensional system. In theory, we establish a regret bound for the estimated optimal policy. Our theorem provides an asymptotic guideline for selecting the number of principal components. In empirical studies, we demonstrate the superiority of the proposed algorithm over baseline methods without dimension reduction or use ad-hoc summaries of the state. We further offer practical guidelines to select the number of principal components. The proposed paper is built upon on PCA and deep Q-learning. It is worthwhile to investigate the performance of other dimension reduction and RL methods. We leave it for future research.

References

Arulkumaran et al. (2017) Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. “Deep reinforcement learning: A brief survey.” IEEE Signal Processing Magazine, 34(6):26–38 (2017).
Audibert and Tsybakov (2007) Audibert, J. and Tsybakov, A. “Fast learning rates for plug-in classifiers.” The Annals of statistics, 35(2):608–633 (2007).
Chen et al. (2022) Chen, E. Y., Song, R., and Jordan, M. I. “Reinforcement Learning with Heterogeneous Data: Estimation and Inference.” arXiv preprint arXiv:2202.00088 (2022).
Chen and Jiang (2019) Chen, J. and Jiang, N. “Information-theoretic considerations in batch reinforcement learning.” International Conference on Machine Learning, 1042–1051 (2019).
Ciarleglio et al. (2015) Ciarleglio, A., Petkova, E., Ogden, R. T., and Tarpey, T. “Treatment decisions based on scalar and functional baseline covariates.” Biometrics, 71(4):884–894 (2015).
Ciarleglio et al. (2018) Ciarleglio, A., Petkova, E., Ogden, T., and Tarpey, T. “Constructing treatment decision rules based on scalar and functional predictors when moderators of treatment effect are unknown.” Journal of the Royal Statistical Society. Series C, Applied statistics, 67(5):1331 (2018).
Ciarleglio et al. (2016) Ciarleglio, A., Petkova, E., Tarpey, T., and Ogden, R. T. “Flexible functional regression methods for estimating individualized treatment rules.” Stat (Int Stat Inst), 5(1):185–199 (2016).
Crambes and Mas (2013) Crambes, C. and Mas, A. “Asymptotics of prediction in functional linear regression with functional outputs.” Bernoulli, 2627–2651 (2013).
Curran et al. (2016) Curran, W., Brys, T., Aha, D., Taylor, M., and Smart, W. D. “Dimensionality reduced reinforcement learning for assistive robots.” In 2016 AAAI Fall Symposium Series (2016).
Curran et al. (2015) Curran, W., Brys, T., Taylor, M., and Smart, W. “Using PCA to efficiently represent state spaces.” arXiv preprint arXiv:1505.00322 (2015).
Dabney et al. (2018) Dabney, W., Rowland, M., Bellemare, M., and Munos, R. “Distributional reinforcement learning with quantile regression.” Thirty-Second AAAI Conference on Artificial Intelligence (2018).
Dickey and Fuller (1979) Dickey, D. A. and Fuller, W. A. “Distribution of the estimators for autoregressive time series with a unit root.” Journal of the American statistical association, 74(366a):427–431 (1979).
Ernst et al. (2005) Ernst, D., Geurts, P., and Wehenkel, L. “Tree-based batch mode reinforcement learning.” Journal of Machine Learning Research, 6:503–556 (2005).
Ertefaie et al. (2021) Ertefaie, A., McKay, J. R., Oslin, D., and Strawderman, R. L. “Robust Q-learning.” Journal of the American Statistical Association, 116(533):368–381 (2021).
Ertefaie and Strawderman (2018) Ertefaie, A. and Strawderman, R. L. “Constructing dynamic treatment regimes over indefinite time horizons.” Biometrika, 105(4):963–977 (2018).
Fan et al. (2020) Fan, J., Wang, Z., Xie, Y., and Yang, Z. “A theoretical analysis of deep Q-learning.” Learning for Dynamics and Control, 486–489 (2020).
Fang et al. (2022) Fang, E. X., Wang, Z., and Wang, L. “Fairness-Oriented Learning for Optimal Individualized Treatment Rules.” Journal of the American Statistical Association, 1–14 (2022).
Guan et al. (2020) Guan, Q., Reich, B. J., Laber, E. B., and Bandyopadhyay, D. “Bayesian nonparametric policy search with application to periodontal recall intervals.” Journal of the American Statistical Association, 115(531):1066–1078 (2020).
Guestrin et al. (2002) Guestrin, C., Patrascu, R., and Schuurmans, D. “Algorithm-directed exploration for model-based reinforcement learning in factored MDPs.” In ICML, 235–242. Citeseer (2002).
Jameson and Longo (2015) Jameson, J. L. and Longo, D. L. “Precision medicine—personalized, problematic, and promising.” Obstetrical & gynecological survey, 70(10):612–614 (2015).
Janner et al. (2019) Janner, M., Fu, J., Zhang, M., and Levine, S. “When to trust your model: Model-based policy optimization.” Advances in Neural Information Processing Systems, 32 (2019).
Jirak (2016) Jirak, M. “Optimal eigen expansions and uniform bounds.” Probability Theory and Related Fields, 166(3):753–799 (2016).
Kallus and Uehara (2022) Kallus, N. and Uehara, M. “Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning.” Operations Research (2022).
Kosorok and Laber (2019) Kosorok, M. R. and Laber, E. B. “Precision medicine.” Annual review of statistics and its application, 6:263–286 (2019).
Koutník et al. (2013) Koutník, J., Cuccu, G., Schmidhuber, J., and Gomez, F. “Evolving large-scale neural networks for vision-based reinforcement learning.” Proceedings of the 15th annual conference on Genetic and evolutionary computation, 1061–1068 (2013).
Kwiatkowski et al. (1992) Kwiatkowski, D., Phillips, P. C., Schmidt, P., and Shin, Y. “Testing the null hypothesis of stationarity against the alternative of a unit root: How sure are we that economic time series have a unit root?” Journal of econometrics, 54(1-3):159–178 (1992).
Laber and Staicu (2018) Laber, E. and Staicu, A. “Functional feature construction for individualized treatment regimes.” Journal of the American Statistical Association, 113(523):1219–1227 (2018).
Lai et al. (2020) Lai, H., Shen, J., Zhang, W., and Yu, Y. “Bidirectional model-based policy optimization.” In International Conference on Machine Learning, 5618–5627. PMLR (2020).
Lazaric et al. (2016) Lazaric, A., Ghavamzadeh, M., and Munos, R. “Analysis of classification-based policy iteration algorithms.” Journal of Machine Learning Research, 17:583–612 (2016).
Le et al. (2019) Le, H., Voloshin, C., and Yue, Y. “Batch policy learning under constraints.” International Conference on Machine Learning, 3703–3712 (2019).
Li et al. (2020) Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. “Breaking the sample size barrier in model-based reinforcement learning with a generative model.” Advances in neural information processing systems, 33:12861–12872 (2020).
Li et al. (2022) Li, Y., Wang, C.-h., Cheng, G., and Sun, W. W. “Rate-Optimal Contextual Online Matching Bandit.” arXiv preprint arXiv:2205.03699 (2022).
Li et al. (2013) Li, Y., Wang, N., and Carroll, R. J. “Selecting the number of principal components in functional data.” Journal of the American Statistical Association, 108(504):1284–1294 (2013).
Liao et al. (2021) Liao, P., Klasnja, P., and Murphy, S. “Off-policy estimation of long-term average outcomes with applications to mobile health.” Journal of the American Statistical Association, 116(533):382–391 (2021).
Liao et al. (2020) Liao, P., Qi, Z., Klasnja, P., and Murphy, S. “Batch policy learning in average reward markov decision processes.” arXiv preprint arXiv:2007.11771 (2020).
Luckett et al. (2020) Luckett, D. J., Laber, E. B., Kahkoska, A. R., Maahs, D. M., Mayer-Davis, E., and Kosorok, M. R. “Estimating Dynamic Treatment Regimes in Mobile Health Using V-Learning.” Journal of the American Statistical Association, 115(530):692–706 (2020). PMID: 32952236.
URL https://doi.org/10.1080/01621459.2018.1537919
Luedtke and van der Laan (2016) Luedtke, A. R. and van der Laan, M. J. “Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy.” The Annals of Statistics, 44(2):713 – 742 (2016).
URL https://doi.org/10.1214/15-AOS1384
Luo et al. (2022) Luo, F.-M., Xu, T., Lai, H., Chen, X.-H., Zhang, W., and Yu, Y. “A survey on model-based reinforcement learning.” arXiv preprint arXiv:2206.09328 (2022).
Marling and Bunescu (2020) Marling, C. and Bunescu, R. “The OhioT1DM Dataset for Blood Glucose Level Prediction: Update 2020.” CEUR workshop proceedings, 2675:71–74 (2020).
McKeague and Qian (2014) McKeague, I. W. and Qian, M. “Estimation of treatment policies based on functional predictors.” Statistica Sinica, 24(3):1461 (2014).
Mnih et al. (2016) Mnih, V., Badia, A., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. “Asynchronous methods for deep reinforcement learning.” International conference on machine learning, 1928–1937 (2016).
Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A., Veness, J., Bellemare, M., Graves, R., A., M., F., A.K., G., Ostrovski, and Petersen, S. “Human-level control through deep reinforcement learning.” nature, 518(7540):529–533 (2015).
Mo et al. (2021) Mo, W., Qi, Z., and Liu, Y. “Learning optimal distributionally robust individualized treatment rules.” Journal of the American Statistical Association, 116(534):659–674 (2021).
Murao and Kitamura (1997) Murao, H. and Kitamura, S. “Q-learning with adaptive state segmentation (QLASS).” In Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA’97.’Towards New Computational Principles for Robotics and Automation’, 179–184 (1997).
Murphy (2003) Murphy, S. A. “Optimal dynamic treatment regimes.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331–355 (2003).
Nie et al. (2021) Nie, X., Brunskill, E., and Wager, S. “Learning when-to-treat policies.” Journal of the American Statistical Association, 116(533):392–409 (2021).
Parisi et al. (2017) Parisi, S., Ramstedt, S., and Peters, J. “Goal-driven dimensionality reduction for reinforcement learning.” In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 4634–4639. IEEE (2017).
Puterman (2014) Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons (2014).
Qi and Liu (2018) Qi, Z. and Liu, Y. “D-learning to estimate optimal individual treatment rules.” Electronic Journal of Statistics, 12(2):3601–3638 (2018).
Qian and Murphy (2011) Qian, M. and Murphy, S. A. “Performance guarantees for individualized treatment rules.” Annals of statistics, 39(2):1180 (2011).
Ramprasad et al. (2022) Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W. W., and Cheng, G. “Online bootstrap inference for policy evaluation in reinforcement learning.” Journal of the American Statistical Association, 1–14 (2022).
Reiß and Wahl (2020) Reiß, M. and Wahl, M. “Nonasymptotic upper bounds for the reconstruction error of PCA.” The Annals of Statistics, 48(2):1098–1123 (2020).
Riedmiller (2005) Riedmiller, M. “Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method.” European Conference on Machine Learning, 317–328 (2005).
Rodbard (2009) Rodbard, D. “Interpretation of continuous glucose monitoring data: glycemic variability and quality of glycemic control.” Diabetes technology & therapeutics, 11(S1):S–55 (2009).
Said and Dickey (1984) Said, S. E. and Dickey, D. A. “Testing for unit roots in autoregressive-moving average models of unknown order.” Biometrika, 71(3):599–607 (1984).
Schmidt-Hieber (2020) Schmidt-Hieber, J. “Nonparametric regression using deep neural networks with ReLU activation function.” The Annals of Statistics, 48(4):1875–1897 (2020).
Schulman et al. (2015a) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. “Trust region policy optimization.” International conference on machine learning, 1889–1897 (2015a).
Schulman et al. (2015b) Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. “High-dimensional continuous control using generalized advantage estimation.” arXiv preprint arXiv:1506.02438 (2015b).
Shi et al. (2021a) Shi, C., Song, R., and Lu, W. “Concordance and value information criteria for optimal treatment decision.” The Annals of Statistics, 49(1):49–75 (2021a).
Shi et al. (2018) Shi, C., Song, R., Lu, W., and Fu, B. “Maximin projection learning for optimal treatment decision with heterogeneous individualized treatment effects.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(4):681–702 (2018).
Shi et al. (2021b) Shi, C., Wan, R., Chernozhukov, V., and Song, R. “Deeply-debiased off-policy interval estimation.” In International Conference on Machine Learning, 9580–9591. PMLR (2021b).
Shi et al. (2020) Shi, C., Wan, R., Song, R., Lu, W., and Leng, L. “Does the Markov Decision Process Fit the Data: Testing for the Markov Property in Sequential Decision Making.” In International Conference on Machine Learning, 8807–8817. PMLR (2020).
Shi et al. (2021c) Shi, C., Wang, X., Luo, S., Zhu, H., Ye, J., and Song, R. “Dynamic causal effects evaluation in A/B testing with a reinforcement learning framework.” Journal of the American Statistical Association, accepted (2021c).
Shi et al. (2022) Shi, C., Zhang, S., Lu, W., and Song, R. “Statistical inference of the value function for reinforcement learning in infinite horizon settings.” Journal of the Royal Statistical Society: Series B, 84:765–793 (2022).
Song et al. (2015) Song, R., Wang, W., Zeng, D., and Kosorok, M. R. “Penalized q-learning for dynamic treatment regimens.” Statistica Sinica, 25(3):901 (2015).
Sprague (2007) Sprague, N. “Basis iteration for reward based dimensionality reduction.” 2007 IEEE 6th International Conference on Development and Learning, 187–192 (2007).
Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. “Dropout: a simple way to prevent neural networks from overfitting.” The journal of machine learning research, 15(1):1929–1958 (2014).
Staicu et al. (2014) Staicu, A.-M., Li, Y., Crainiceanu, C. M., and Ruppert, D. “Likelihood ratio tests for dependent data with applications to longitudinal and functional data analysis.” Scandinavian Journal of Statistics, 41(4):932–949 (2014).
Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: an introduction. MIT Press (2018).
Tsiatis et al. (2019) Tsiatis, A. A., Davidian, M., Holloway, S. T., and Laber, E. B. Dynamic treatment regimes: Statistical methods for precision medicine. Chapman and Hall/CRC (2019).
Tsybakov (2004) Tsybakov, A. “Optimal aggregation of classifiers in statistical learning.” The Annals of Statistics, 32(1):135–166 (2004).
Uehara et al. (2021) Uehara, M., Imaizumi, M., Jiang, N., Kallus, N., Sun, W., and Xie, T. “Finite sample analysis of minimax offline reinforcement learning: Completeness, fast rates and first-order efficiency.” arXiv preprint arXiv:2102.02981 (2021).
Uehara et al. (2022) Uehara, M., Kiyohara, H., Bennett, A., Chernozhukov, V., Jiang, N., Kallus, N., Shi, C., and Sun, W. “Future-dependent value-based off-policy evaluation in pomdps.” arXiv preprint arXiv:2207.13081 (2022).
Van Hasselt et al. (2016) Van Hasselt, H., Guez, A., and Silver, D. “Deep reinforcement learning with double q-learning.” Proceedings of the AAAI conference on artificial intelligence, 30(1) (2016).
Wang et al. (2017a) Wang, L., Laber, E. B., and Witkiewitz, K. “Sufficient Markov decision processes with alternating deep neural networks.” arXiv preprint, arXiv:1704.07531 (2017a).
Wang et al. (2017b) Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., and de Freitas, N. “Sample efficient actor-critic with experience replay.” ICLR (2017b).
Whiteson et al. (2007) Whiteson, S., Taylor, M. E., and Stone, P. “Adaptive Tile Coding for Value Function Approximation.” (2007).
Xu et al. (2020) Xu, Z., Laber, E., Staicu, A.-M., and Severus, E. “Latent-state models for precision medicine.” arXiv preprint arXiv:2005.13001 (2020).
Yao et al. (2005) Yao, F., Müller, H.-G., and Wang, J.-L. “Functional data analysis for sparse longitudinal data.” Journal of the American statistical association, 100(470):577–590 (2005).
Zhang et al. (2013) Zhang, B., Tsiatis, A. A., Laber, E. B., and Davidian, M. “Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions.” Biometrika, 100(3):681–694 (2013).
Zhang and Wang (2016) Zhang, X. and Wang, J. “From sparse to dense functional data and beyond.” The Annals of Statistics, 44(5):2281–2321 (2016).
Zhang et al. (2018) Zhang, Y., Laber, E. B., Davidian, M., and Tsiatis, A. A. “Interpretable dynamic treatment regimes.” Journal of the American Statistical Association, 113(524):1541–1549 (2018).
Zhao et al. (2015) Zhao, Y.-Q., Zeng, D., Laber, E. B., and Kosorok, M. R. “New statistical learning methods for estimating optimal dynamic treatment regimes.” Journal of the American Statistical Association, 110(510):583–598 (2015).
Zhou et al. (2021) Zhou, W., Zhu, R., and Qu, A. “Estimating Optimal Infinite Horizon Dynamic Treatment Regimes via pT-Learning.” arXiv preprint arXiv:2110.10719 (2021).
Zhou et al. (2022) —. “Estimating Optimal Infinite Horizon Dynamic Treatment Regimes via pT-Learning.” Journal of the American Statistical Association, accepted (2022).
Zhu et al. (2020) Zhu, L., Lu, W., and Song, R. “Causal effect estimation and optimal dose suggestions in mobile health.” In International Conference on Machine Learning, 11588–11598. PMLR (2020).

Appendix A Supplementary Definitions

In this section we would like to introduce more definitions, which will be used in the rest of supporting information.

Note that $\mathcal{F}_{0}$ is a function class to model the Q-function without using PCA. We need to define two more function classes based on PCA of the high dimensional input $z$ :

Definition A.1.

	$\displaystyle\mathcal{F}_{1}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max},G,\kappa)=\{f_{1}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R},f_{1}(x,z,a)=$
	$\displaystyle f_{0}(x,(\sum_{k}^{\kappa}{U_{k}{U_{k}}^{T}})z,a)=f_{0}(x,z^{*},a),f_{0}\in\mathcal{F}_{0}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max})\},$

where $\mathcal{F}_{0}$ is defined in Section 4 and $G=\sum_{k=1}^{m}{\lambda_{k}\mathbf{U_{k}}\mathbf{U_{k}}^{T}}$ is the covariance matrix of $z$ . $z^{*}=(\sum_{k}^{\kappa}{U_{k}{U_{k}}^{T}})z$ is the vector recovered from the first $\kappa$ PCA values of the original high-frequency vector $z$ .

The definition of $\mathcal{F}_{2}$ is already mentioned in Section 4. Here we would like to formally define it as below:

	$\displaystyle\mathcal{F}_{2}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max},G,\kappa)=\{f_{2}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R},f_{2}(x,z,a)=$
	$\displaystyle f_{0}(x,\begin{pmatrix}I_{\kappa}&0_{\kappa\times(m-\kappa)}\end{pmatrix}{\Sigma}^{-\frac{1}{2}}U^{T}z,a)=f_{0}(x,v_{\kappa},a),f_{0}\in\mathcal{F}_{0}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max})\},$

where $v_{\kappa}$ is the vector containing first $\kappa$ PCA values of the original high-frequency vector $z$ .

Note that $\mathcal{F}_{2}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max},G,\kappa)$ is the function class that we use in Spectrum Q-Iterated Learning Algorithm to model the Q-function. In practice, we will use the estimated $\hat{G}$ and the corresponding PCA vector $\hat{v}_{\kappa}$ in $\mathcal{F}_{2}$ . The relationship between the two classes $\mathcal{F}_{1}$ and $\mathcal{F}_{2}$ will be used in the proof of the main theorems (Proposition 1 and Lemma 2 in Section C of Appendix).

Here we list the definition of the concentration coefficient as below:

Definition A.2.

Let $\sigma_{1}$ and $\sigma_{2}$ be two absolutely continuous distributions on $\mathcal{S}\times\mathcal{A}$ . Let $\{\pi_{t}\}_{t\geq 1}=\{\pi_{1},…,\pi_{m}\}$ be a sequence of policies. For a action-state pair $S_{0},A_{0}$ from the distribution $\sigma_{1}$ , denote the state at $t$ -th decision point by $S_{t},t=1,2,...,m$ . Suppose we take action $A_{t}$ at $t$ -th decision point according to $\pi_{t}$ . Then we denote the marginal distribution of $(S_{t},A_{t})$ by $P^{\pi_{m}}P^{\pi_{m-1}}...P^{\pi_{1}}\sigma_{1}$ . Then we can define the concentration coefficient by

\displaystyle\omega_{\infty}(m;\sigma_{1},\sigma_{2})=\sup_{\pi_{1},...,\pi_{m}}{\left\|\frac{dP^{\pi_{m}}P^{\pi_{m-1}}...P^{\pi_{1}}\sigma_{1}(s,a)}{d\sigma_{2}(s,a)}\right\|_{\infty,\sigma_{2}}}.

(2)

Similar definitions of concentration coefficients can be commonly found in the literature of reinforcement learning (Fan et al., 2020; Lazaric et al., 2016).

Appendix B Detailed Version of the First Theorem

The detailed version of Theorem 4.6 is: Let $\mu$ be some distribution on $(\mathcal{S}\times\mathcal{A})$ such that $\mu(\mathcal{S}\times\mathcal{A})=\mu_{1}(\mathcal{S})\times\mu_{2}(\mathcal{A})$ with $\mu_{2}(a)>0,\forall a\in\mathcal{A}$ and $\frac{d\mu_{1}(s)}{ds}$ bounded away from $0$ (here we would like to include a more general measure $\mu_{2}$ for the action space $\mathcal{A}$ ). Define $\beta_{j}^{*}=\beta_{j}\times\prod_{l=j+1}{\min{(\beta_{l},1)}}$ and $\alpha^{*}=\max_{j\in[q]}{\frac{t_{j}}{2\beta_{j}^{*}+t_{j}}}$ , where $\beta_{j},q,t_{j},p_{j}$ come from Definition 4.2 and definition of $\mathcal{G}_{0}(\{p_{j},t_{j},\beta_{j},H_{j}\}_{j\in[q]})$ in Section 4.

Under the Assumptions 4.3 to 4.5, and the condition that $n$ is large enough such that $\max\{\sum_{j}{{(t_{j}+\beta_{j}+1)}^{3+t_{j}}},\max_{j}{p_{j}}\}\leq C_{\xi}{(\log(n))}^{\xi}$ for some $\xi>0,C_{\xi}>0$ , there exists a function class $\mathcal{F}_{2}(L^{*},\{d_{j}^{*}\}_{i=0}^{L+1},s^{*},V_{max},\hat{G},\kappa)$ modeling $Q(s,a)$ such that its components are sparse ReLU networks with hyperparameters satisfying: layer number $L^{*}=O({\log^{\xi^{*}}{n}})$ , first layer width satisfying the bound $d_{1}^{*}\leq n^{\alpha^{*}}$ , output dimension $d^{*}_{L^{*}+1}=1$ , other layers width satisfying the bound $d_{0}^{*}=m_{0}+m\leq min_{j\in\{1,2,...,L\}}{d_{j}^{*}}\leq max_{j\in\{1,2,...,L\}}{d_{j}^{*}}=O(n^{\xi^{*}})$ , and sparsity with order $s^{*}\asymp d_{1}^{*}\kappa+n^{\alpha^{*}}\log^{\xi^{*}}(n)$ for some $\xi^{*}>1+2\xi$ . In this case, we can show that $\tilde{Q}_{K}$ (modeled by $\mathcal{F}_{2}$ ) obtained from our Algorithm 1 satisfies that:

	$\displaystyle\left\\|Q^{\pi^{}}-\tilde{Q}_{K}\right\\|_{2,\mu}^{2}=O_{p}(\|\mathcal{A}\|(n^{\alpha^{}}\log^{\xi^{}}{n}+d_{1}^{}\kappa)n^{-1}\log^{\xi^{*}+1}{n}$
	$\displaystyle+L_{\mathcal{F}_{0}}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}),$

where $|\mathcal{A}|$ is the number of treatment options.

Appendix C Proof of theorems

C.1 Proof Sketch of Theorem 4.6

	$\displaystyle\left\\|Q^{\pi^{}}-\tilde{Q}_{K}\right\\|_{2,\mu}^{2}=O_{p}(\|\mathcal{A}\|(n^{\alpha^{}}\log^{\xi^{}}{n}+d_{1}^{}\kappa)n^{-1}\log^{\xi^{*}+1}{n}$
	$\displaystyle+L_{\mathcal{F}_{0}}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}),$

where $|\mathcal{A}|$ is the number of treatment options.

First, we need to denote

\mathcal{F}_{1}(L^{*},\{\{d_{i}^{*}\}^{L+1}_{i=1},d_{0}=m_{0}+m\},s^{*}-d_{1}\kappa,V_{max},\hat{G},\kappa)

as $\mathcal{F}_{1}$ ,

\mathcal{F}_{2}(L^{*},\{d_{i}^{*}\}^{L+1}_{i=0},s^{*},V_{max},\hat{G},\kappa)

as $\mathcal{F}_{2}$ and

\mathcal{F}_{0}(L^{*},\{\{d_{i}^{*}\}^{L+1}_{i=1},d_{0}=m_{0}+m\},s^{*}-d_{1}\kappa,V_{max})

as $\mathcal{F}_{0}$ .

Note that it has been assumed that the transition density function $q(s^{\prime}|s,a)$ satisfies that $\sup_{s^{\prime},s,a}{q(s^{\prime}|s,a)}\leq c_{q}$ (as is stated in Section 2.2). Let $q^{j}(s^{\prime}|s,a)$ denote the density of the state $s^{\prime}$ following the decisions of the sequence of policy $\pi_{1},\pi_{2},...,\pi_{j-1}$ . Then it can be shown that

	$\displaystyle q^{j}(s^{\prime}\|s,a)=\int_{x\in\mathcal{S}}{\sum_{a_{x}\in\mathcal{A}}{q(s^{\prime}\|x,a)\pi_{j-1}(a_{x}\|x)q^{j-1}(x\|s,a)dx}}$
	$\displaystyle\leq c_{q}\int_{x\in\mathcal{S}}{\sum_{a_{x}\in\mathcal{A}}{\pi_{j-1}(a_{x}\|x)q^{j-1}(x\|s,a)dx}}=c_{q}.$

Then the density for joint distribution of $s^{\prime},a^{\prime}$ after $\{\pi_{1},...,\pi_{m}\}$ is

\displaystyle\int_{(s,a)\in\mathcal{S}\times\mathcal{A}}{\pi_{m}(a^{\prime}|s^{\prime})q^{(m)}}(s^{\prime}|s,a)d\mu(s,a)\leq c_{q}\int_{(s,a)\in\mathcal{S}\times\mathcal{A}}{d\mu(s,a)}=c_{q}.

Thus, $\frac{dP^{\pi_{m}}P^{\pi_{m-1}}...P^{\pi_{1}}\mu(s,a)}{d\sigma(s,a)}$ can be bounded as long as the sampling distribution $\sigma$ is bounded away from $0$ . That is, $\omega_{\infty}(m;\mu,\sigma)\leq c_{\mu,\sigma}$ for some constant $c_{\mu,\sigma}>0$ ( recall that $\omega_{\infty}(m;\mu,\sigma)$ is from Definition 2).

Then it can be shown that

\displaystyle\frac{1}{(1-\gamma)^{2}}\sum_{m=0}^{\infty}{\gamma^{m-1}(m+1)[\omega_{\infty}(m;\mu,\sigma)]^{\frac{1}{2}}}\leq\frac{1}{(1-\gamma)^{2}}\sum_{m=0}^{\infty}{\gamma^{m-1}(m+1){c_{\mu,\sigma}}^{\frac{1}{2}}}=\phi_{\mu,\sigma},

for some constant $\phi_{\mu,\sigma}>0$ .

Then we can show a lemma regarding the approximation error of $\tilde{Q}_{K}$ as below:

Lemma C.1.

For the estimated Q-function obtained in iteration $K$ , $\tilde{Q}_{K}$ in Algorithm 1, it leads to

\displaystyle\left\|Q^{*}-\tilde{Q}_{K}\right\|_{2,\mu}^{2}\leq 2\gamma^{2}(1-\gamma)^{4}\epsilon_{max}^{2}\phi_{\mu,\sigma}^{2}+8\frac{\gamma^{2K}R_{max}^{2}}{(1-\gamma)^{2}},

(3)

where $\epsilon_{max}=\max_{k\in\{1,2...,K\}}{\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}}$ .

That is, the distance of the $K$ -th iteration result $\tilde{Q}_{K}$ in our algorithm with $Q^{*}$ corresponding to the optimal policy can be bound by the largest approximation error $\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}^{2}$ among all the $K$ iterations in the Spectrum Q-Iteration algorithm. We know $\tilde{Q}_{k-1},\tilde{Q}_{k}\in\mathcal{F}_{2}$ .

To bound the term $\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}^{2}$ in each iteration, we will apply Theorem 6.2 in Fan et al. (2020), and it can be shown that

\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}^{2}\leq(1+\epsilon)^{2}\omega(\mathcal{F}_{2})+C\frac{V_{max}^{2}}{n\epsilon}\log{N_{\delta,2}}+C^{{}^{\prime}}V_{max}\delta

for $\forall\epsilon\in(0,1]$ , where $\omega(\mathcal{F}_{2})=\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}$ . Here $N_{\delta,2}$ is the cardinality of the minimal $\sigma$ -covering set of the function class $\in\mathcal{F}_{2}$ with respect to $l_{\infty}$ norm.

If we take $\epsilon=1$ and fix $\delta=\frac{1}{n}$ , we have

\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}^{2}\leq 4\omega(\mathcal{F}_{2})+C\frac{V_{max}^{2}}{n}\log{N_{\frac{1}{n},2}}+\frac{C^{{}^{\prime}}V_{max}}{n}.

(4)

Then we need to bound $\omega(\mathcal{F}_{2})=\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}$ and $\log{N_{\frac{1}{n},2}}$ respectively.

The term $\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}$ is related to the error of using a function $f\in\mathcal{F}_{2}$ to approximate a function $g$ from the same class after optimal Bellman operation. Note that in each iteration of the Q-iterative Algorithm, we use $\tilde{Q}_{k}$ from a ReLU-like class to fit the responses $Y_{i}=R_{i}+\gamma\max_{a\in\mathcal{A}}{\tilde{Q}_{k-1}(S_{i}^{{}^{\prime}},a)}$ . If we take expectation of $Y_{i}$ , we know the true model underlying the responses $Y_{i}$ would be a function from the class of $Tg:g\in\mathcal{F}_{2}$ . As is argued in Fan et al. (2020), the term $\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}$ can be represented as the bias term when we fit $Y_{i}$ by using $\tilde{Q}_{k}\in\mathcal{F}_{2}$ .

We will first bound this bias-related term. We need to show the following proposition:

Proposition C.2.

The function class $\mathcal{F}_{1}(L,\{\{d_{i}\}^{L+1}_{i=1},d_{0}=m_{0}+m\},s,V_{max},\hat{G},\kappa)$ is a subset of $\mathcal{F}_{2}(L,\{\{d_{i}\}^{L+1}_{i=1},d_{0}=\kappa+m_{0}\},s+d_{1}\kappa,V_{max},\hat{G},\kappa)$ .

This proposition implies $\mathcal{F}_{1}\subset\mathcal{F}_{2}$ . Then we can propose the following lemma, which relies on the Proposition C.2 we’ve just shown:

Lemma C.3.

Under the same assumptions of the Theorem 4.6,

\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}\leq 2\sup_{f^{{}^{\prime}}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{\left\|f_{0}-f^{{}^{\prime}}\right\|_{\sigma}^{2}}}+2L_{\mathcal{F}_{0}}{\left\|Z-Z^{*}\right\|_{\sigma}^{2}},

where $L_{\mathcal{F}_{0}}=\sup_{f\in\mathcal{F}_{0}}{\sup_{x\neq y}{\frac{{|f(y)-f(x)|}^{2}}{\left\|y-x\right\|^{2}}}}$ . $\{Z\}_{j=1}^{J}$ is the high-frequency vector from the distribution $\sigma$ and $Z^{*}=(\sum_{k=1}^{\kappa}{\hat{U}_{k}{\hat{U}_{k}}^{T}})Z$ is the vector recovered from $\kappa$ PCA values of $Z$ .

By applying Lemma C.3, we now can bound the term $\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}$ by $\sup_{f^{{}^{\prime}}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{\left\|f_{0}-f^{{}^{\prime}}\right\|_{\sigma}^{2}}}$ and ${\left\|Z-Z^{*}\right\|_{\sigma}^{2}}$ .

Recall that $\mathcal{F}_{0}$ is the simple notation of

\mathcal{F}_{0}(L^{*},\{\{d_{i}^{*}\}^{L+1}_{i=1},d_{0}=m_{0}+m\},s^{*}-d_{1}\kappa,V_{max}),

as is specified in the beginning of this section. We know sparsity of $\mathcal{F}_{0}$ satisfies $s^{*}-d_{1}\kappa\asymp n^{\alpha^{*}}\log^{\xi^{*}}(n)$ and architectures of $\mathcal{F}_{0}$ satisfying $d_{1}^{*}\leq n$ and $m_{0}+m\leq min_{j\in\{1,2,...,L\}}{d_{j}^{*}}\leq max_{j\in\{1,2,...,L\}}{d_{j}^{*}}=O(n^{\xi^{*}})$ . Thus, we can use equation (4.18) of Fan et al. (2020) to bound term $\sup_{f^{{}^{\prime}}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{\left\|f_{0}-f^{{}^{\prime}}\right\|_{\sigma}^{2}}}$ :

\sup_{f^{{}^{\prime}}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{\left\|f_{0}-f^{{}^{\prime}}\right\|_{\sigma}^{2}}}\leq n^{\alpha^{*}-1},

where the conditions of neural networks in $\mathcal{F}_{0}$ matches the conditions of applying equation (4.18) of Fan et al. (2020).

Now we can bound the term $\left\|Z-Z^{*}\right\|_{\sigma}^{2}$ by the next lemma.

Lemma C.4.

For the high-frequency vector $Z$ , we assume $E(Z)=0$ and the the eigenvalues of $Cov(Z)$ following this exponential decaying trend $\lambda_{k}=O(e^{-\zeta k})$ . We also assume the estimation of the $Cov(Z)$ satisfies that $\left\|\hat{U}-U\right\|=O_{p}(n^{-\Delta})$ . Then we have

\left\|Z-Z^{*}\right\|_{\sigma}^{2}=O_{p}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})

Based on these two lemmas, we know that

\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}=O_{p}(n^{\alpha^{*}-1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}).

Then, when we take $\delta=\frac{1}{n}$ , a bound for $\log{N_{\delta,2}}$ in equation (4) is required. Here we need to show another proposition, Proposition C.5:

Proposition C.5.

When $\delta=\frac{1}{n}$ , $\log{N_{\delta,2}}\leq C_{1}|\mathcal{A}|(n^{\alpha^{*}}(\log{n})^{\xi^{*}}+d_{1}^{*}\kappa)(\log{n})^{1+\xi^{*}}$ for some constant $C_{1}$ .

Then from equation (4), we know

\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}^{2}=O_{p}(|\mathcal{A}|n^{-1}(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}+d_{1}^{*}\kappa){(\log{n})}^{\xi^{*}+1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})})

(5)

Then from equation (5) and equation (3), the proof of Theorem 4.6 is complete.

C.2 Proof of Theorem 4.8

Theorem 4.8: Under the Assumptions 4.3 to 4.7 and the conditions of $\mu,\mathcal{F}_{2},n$ in Theorem 4.6, we can show:

	$\displaystyle E_{\mu_{1}}[V^{\pi^{}}(s)-V^{\pi_{K}}(s)]=O_{p}(\frac{1}{1-\gamma}\{\|\mathcal{A}\|(n^{\alpha^{}}{\log^{\xi^{}}{n}}+d_{1}^{}\kappa)n^{-1}{\log^{\xi^{*}+1}{n}}$
	$\displaystyle+L_{\mathcal{F}_{0}}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}\}^{\frac{\eta+1}{\eta+2}}).$

Proof.

The policy $\pi_{K}$ obtained in Algorithm 1 is a greedy policy with respect to $\tilde{Q}_{K}$ .

First, a proposition from Shi et al. (2022) can be applied:

Proposition C.6 (Inequality E.10 and E.11 in Shi et al. (2022)).

	$\displaystyle E_{\mu_{1}}[V^{\pi^{}}(s)-V^{\pi_{K}}(s)]\leq\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{*}(a\|s)-\pi_{K}(a\|s)\}d\mu_{1}(s)}}$		(6)
	$\displaystyle+\frac{c\gamma}{1-\gamma}\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}ds}}.$		(6)

Then it suffices to bound the two terms $\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}d\mu_{1}(s)}}$ and $\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}ds}}$ . We will first show how to bound the following term:

\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}d\mu_{1}(s)}}.

Define $\hat{a}_{\pi^{K}}(s)=\operatorname*{arg\,max}_{a\in\mathcal{A}}{\tilde{Q}_{K}(s,a)}$ . For $\epsilon>0$ , let $A_{*}=\{0<\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})\leq\epsilon\}$ . Then we know the complement set $A_{*}^{C}=\{\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})>\epsilon\}\cup\{\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=0\}$ . Then it leads to

	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}d\mu_{1}(s)}}=$		(7)
	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}\mathbbm{1}(s\in A_{*})d\mu_{1}(s)}}+$
	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}}.$

The first step is to bound the first part of equation (7),

\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}\mathbbm{1}(s\in A_{*})d\mu_{1}(s)}}.

It is known that

\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}=\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\pi^{*}(a|s)}

and

Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\pi^{K}(a|s)},

as both $\pi^{*}$ are $\pi^{K}$ are greedy policies ( $\pi^{*}$ can be viewed as the greedy policy with respect to $Q^{\pi^{*}}$ ). Then $\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}}=\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})\leq\epsilon$ when $s\in A_{*}$ . That is,

\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}}\mathbbm{1}(s\in A_{*})d\mu_{1}(s)}\leq\epsilon\int_{s\in\mathcal{S}}{\mathbbm{1}(s\in A_{*})d\mu_{1}(s)}.

(8)

Also, when $s\in A_{*}$ , we know that $\hat{a}_{\pi^{K}}$ will never be the same as $a\in\operatorname*{arg\,max}_{a^{\prime}\in\mathcal{A}}{Q^{\pi^{*}}(s,a^{\prime})}$ based on the definition of $A_{*}$ , as $\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})>0$ is always true in $A_{*}$ . Thus, $\max_{a\in{\mathcal{A}-\operatorname*{arg\,max}_{a^{\prime}}{Q^{\pi^{*}}(x,a^{\prime})}}}{Q^{\pi^{*}}(s,a)}\geq Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})$ . Then we know $\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})\leq\epsilon$ induces $\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-\max_{a\in{\mathcal{A}-\operatorname*{arg\,max}_{a^{\prime}}{Q^{\pi^{*}}(x,a^{\prime})}}}{Q^{\pi^{*}}(s,a)}\leq\epsilon$ .

Therefore, it can be obtained that

	$\displaystyle\int_{s\in\mathcal{S}}{\mathbbm{1}(s\in A_{})d\mu_{1}(s)}\leq\int_{s\in\mathcal{S}}\mathbbm{1}\{s:\max_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)}$
	$\displaystyle-\max_{a\in{\mathcal{A}-\operatorname{arg\,max}_{a^{\prime}}{Q^{\pi^{}}(x,a^{\prime})}}}{Q^{\pi^{*}}(s,a)}\leq\epsilon\}d\mu_{1}(s)=O(\epsilon^{\eta}).$

by assumption 4.7. Together with equation (8), we know that

\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}}\mathbbm{1}(s\in A_{*})d\mu_{1}(s)}=O(\epsilon^{\eta+1}).

(9)

Then we can bound the second part of equation (7),

\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}}.

Recall the definition of $A_{*}^{C}=\{\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})>\epsilon\}\cup\{\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=0\}$ . We know $\{\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=0\}$ is a trivial case. It is easy to show that

\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)\{\pi^{*}(a|s)-\pi_{K}(a|s)\}\mathbbm{1}\{s:\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=0\}d\mu_{1}(s)}}=0,

as $\pi^{*}(a|s)=\pi^{K}(a|s)$ when $\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=0$ always holds.

Therefore,

	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}}=$
	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}\mathbbm{1}\{s:\max_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)}-Q^{\pi^{}}(s,\hat{a}_{\pi^{K}})>\epsilon\}d\mu_{1}(s)}}.$

We can show that when $\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})>\epsilon$ holds,

\displaystyle\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})\leq 2\max_{a\in\mathcal{A}}{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}.

To show this, note that $\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})=\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})+\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})$ . It is trivial to show

\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})-Q^{\pi^{*}}(s,\hat{a}_{\pi^{K}})\leq\max_{a\in\mathcal{A}}{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}.

Furthermore, we can get $\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})=\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-\tilde{Q}_{K}(s,a\in{\operatorname*{arg\,max}_{a\in\mathcal{A}}{{Q^{*}(s,a)}}})+\tilde{Q}_{K}(s,a\in{\operatorname*{arg\,max}_{a\in\mathcal{A}}{{Q^{*}(s,a)}}})-\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})$ . Then we know that $\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-\tilde{Q}_{K}(s,a\in{\operatorname*{arg\,max}_{a\in\mathcal{A}}{{Q^{*}(s,a)}}})\leq\max_{a\in\mathcal{A}}{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}$ and $\tilde{Q}_{K}(s,a\in{\operatorname*{arg\,max}_{a\in\mathcal{A}}{{Q^{*}(s,a)}}})-\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})\leq 0$ , which implies that $\max_{a\in\mathcal{A}}{Q^{\pi^{*}}(s,a)}-\tilde{Q}_{K}(s,\hat{a}_{\pi^{K}})\leq\max_{a\in\mathcal{A}}{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}$ .

To bound $\max_{a\in\mathcal{A}}{|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)|}$ , we know that

	$\displaystyle\max_{a\in\mathcal{A}}{{\|Q^{\pi^{}}(s,a)-\tilde{Q}_{K}(s,a)\|}^{2}}\leq\sum_{a\in\mathcal{A}}{{\|Q^{\pi^{}}(s,a)-\tilde{Q}_{K}(s,a)\|}^{2}}\leq$
	$\displaystyle\frac{1}{\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\sum_{a\in\mathcal{A}}{{\|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)\|}^{2}\mu_{2}(a)}.$

Then, it can be obtained that

	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}=$		(10)
	$\displaystyle\int_{s\in\mathcal{S}}{(\max_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)}-Q^{\pi^{}}(s,\hat{a}_{\pi^{K}}))\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}\leq$
	$\displaystyle\int_{s\in\mathcal{S}}{\frac{{(2\max_{a\in\mathcal{A}}{\|Q^{\pi^{}}(s,a)-\tilde{Q}_{K}(s,a)\|})}^{2}}{(\max_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)}-Q^{\pi^{}}(s,\hat{a}_{\pi^{K}}))}\mathbbm{1}(s\in A_{}^{C})d\mu_{1}(s)}\leq$
	$\displaystyle\frac{4}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{{\|Q^{\pi^{}}(s,a)-\tilde{Q}_{K}(s,a)\|}^{2}\mu_{2}(a)}\mathbbm{1}(s\in A_{}^{C})d\mu_{1}(s)}\leq$
	$\displaystyle\frac{4}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{{\|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)\|}^{2}\mu_{2}(a)}d\mu_{1}(s)}=$
	$\displaystyle\frac{4}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\left\\|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)\right\\|_{\mu,2}^{2}.$

From Theorem 4.6, we already have

	$\displaystyle\left\\|Q^{\pi^{}}-\tilde{Q}_{K}\right\\|_{2,\mu}^{2}=O_{p}(\gamma^{2}(1-\gamma)^{4}\times[\|\mathcal{A}\|n^{-1}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}+d_{1}^{}{\kappa}){(\log{n})}^{\xi^{*}+1}$		(11)
	$\displaystyle+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}).$		(11)

Then plug in equation (11) to equation (10) and it leads to

	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}$		(12)
	$\displaystyle=O_{p}(\frac{1}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\gamma^{2}(1-\gamma)^{4}\times[\|\mathcal{A}\|n^{-1}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}$
	$\displaystyle+d_{1}^{}{\kappa}){(\log{n})}^{\xi^{}+1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}).$

Then from equation (7), equation (9) and equation (12), we can get

	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}d\mu_{1}(s)}}$
	$\displaystyle=O_{p}(\epsilon^{\eta+1}+\frac{1}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\gamma^{2}(1-\gamma)^{4}\times[\|\mathcal{A}\|n^{-1}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}$
	$\displaystyle+d_{1}^{}{\kappa}){(\log{n})}^{\xi^{}+1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}).$

If we take

	$\displaystyle\epsilon^{\eta+1}=\frac{1}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\gamma^{2}(1-\gamma)^{4}\times[\|\mathcal{A}\|n^{-1}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}+d_{1}^{}{\kappa}){(\log{n})}^{\xi^{}+1}+$
	$\displaystyle L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2},$

then the following result is obtained:

	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}d\mu_{1}(s)}}=O_{p}(\{\gamma^{2}(1-\gamma)^{4}\times[\|\mathcal{A}\|n^{-1}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}$		(13)
	$\displaystyle+d_{1}^{}{\kappa}){(\log{n})}^{\xi^{}+1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}\}^{\frac{\eta+1}{\eta+2}}).$		(13)

Then using similar process, we can show the result on Lebesgue measure,

	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}ds}}=O_{p}(\{\gamma^{2}(1-\gamma)^{4}\times[\|\mathcal{A}\|n^{-1}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}$		(14)
	$\displaystyle+d_{1}^{}{\kappa}){(\log{n})}^{\xi^{}+1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}\}^{\frac{\eta+1}{\eta+2}}).$		(14)

Combine equation (6), equation (13) and equation (14), the result of Theorem 4.8 can be obtained.

∎

C.3 Proof of Lemma C.1

The Lemma C.1 is: For the estimated Q-function obtained in iteration $K$ , $\tilde{Q}_{K}$ in Algorithm 1, we have

\displaystyle\left\|Q^{*}-\tilde{Q}_{K}\right\|_{2,\mu}^{2}\leq 2\gamma^{2}(1-\gamma)^{4}\epsilon_{max}^{2}\phi_{\mu,\sigma}^{2}+8\frac{\gamma^{2K}R_{max}^{2}}{(1-\gamma)^{2}},

where $\epsilon_{max}=\max_{k\in\{1,2...,K\}}{\left\|T\tilde{Q}_{k-1}-\tilde{Q}_{k}\right\|_{\sigma}}$ .

Proof.

First, we denote $\rho_{k}=T\tilde{Q}_{k-1}-\tilde{Q}_{k}$ and define the operator $P^{\pi}$ as follow:

\displaystyle(P^{\pi}Q)(s,a)=E[Q(S^{\prime},A^{\prime})|S^{\prime}~{}P(.|s,a),A^{\prime}~{}\pi(.|S^{\prime})].

(15)

We can rely on this Lemma from Fan et al. (2020):

Lemma C.7 (Lemma C.2 in Fan et al. (2020)).

\displaystyle Q^{*}-\tilde{Q}_{l}\leq\sum_{i=k}{l-1}{\gamma^{l-1-i}(P^{\pi^{*}})^{l-1-i}\rho_{i+1}+\gamma^{l-k}(P^{\pi^{*}})^{l-k}(Q^{*}-\tilde{Q}_{k})}

for $\forall k,l\in\mathbb{Z},0\leq k<l\leq K-1$ .

If we plug in $l=K,k=0$ , we have

\displaystyle Q^{*}-\tilde{Q}_{K}\leq\sum_{i=0}{K-1}{\gamma^{K-1-i}(P^{\pi^{*}})^{K-1-i}\rho_{i+1}+\gamma^{K}(P^{\pi^{*}})^{K}(Q^{*}-\tilde{Q}_{0})}

We can denote $U_{i}=(P^{\pi^{*}})^{K-1-i}$ for $i=0,1,...,K-1$ and $U_{K}=(P^{\pi^{*}})^{K}$ . Then we have

\displaystyle|Q^{*}(s,a)-\tilde{Q}_{K}(s,a)|\leq\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}+\gamma^{K}(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a).

By taking square on both sides, we have

	$\displaystyle{\|Q^{*}(s,a)-\tilde{Q}_{K}(s,a)\|}^{2}\leq[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}\|\rho_{i+1}\|)(s,a)}]^{2}+2\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}\|\rho_{i+1}\|)(s,a)}$
	$\displaystyle\times\gamma^{K}(U_{K}\|Q^{}-\tilde{Q}_{0}\|)(s,a)+\gamma^{2K}[(U_{K}\|Q^{}-\tilde{Q}_{0}\|)(s,a)]^{2}.$

That is,

	$\displaystyle E_{\mu}[{\|Q^{*}(s,a)-\tilde{Q}_{K}(s,a)\|}^{2}]\leq E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}\|\rho_{i+1}\|)(s,a)}]^{2}\}$		(16)
	$\displaystyle+E_{\mu}\{2\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}\|\rho_{i+1}\|)(s,a)}\times\gamma^{K}(U_{K}\|Q^{*}-\tilde{Q}_{0}\|)(s,a)\}$
	$\displaystyle+E_{\mu}\{\gamma^{2K}[(U_{K}\|Q^{*}-\tilde{Q}_{0}\|)(s,a)]^{2}\}.$

For the third term $E_{\mu}\{\gamma^{2K}[(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)]^{2}\}$ , we can use the bound

\displaystyle E_{\mu}\{\gamma^{2K}[(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)]^{2}\}\leq 4\gamma^{2K}V_{max}^{2}=\frac{4\gamma^{2K}R_{max}^{2}}{(1-\gamma)^{2}}.

(17)

This equation (17) holds because of the following reason: we know $|Q^{*}(s,a)-\tilde{Q}_{0}(s,a)|\leq 2V_{max},\forall(s,a)$ and this implies that $|(U_{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)|=|({(P^{\pi^{*}})}^{K}|Q^{*}-\tilde{Q}_{0}|)(s,a)|\leq 2V_{max},\forall(s,a)$ .

For the cross term in (16), we have

	$\displaystyle E_{\mu}\{2\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}\|\rho_{i+1}\|)(s,a)}\times\gamma^{K}(U_{K}\|Q^{*}-\tilde{Q}_{0}\|)(s,a)\}\leq$		(18)
	$\displaystyle E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}\|\rho_{i+1}\|)(s,a)}]^{2}\}+E_{\mu}\{\gamma^{2K}[(U_{K}\|Q^{*}-\tilde{Q}_{0}\|)(s,a)]^{2}\}.$		(18)

Then the main focus should be the bound on $E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}]^{2}\}$ . We have

\displaystyle E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}]^{2}\}=\sum_{0\leq i,j\leq K-1}{\gamma^{2K-2-i-j}E_{\mu}[(U_{i}|\rho_{i+1}|)(s,a)(U_{j}|\rho_{j+1}|)(s,a)]}.

Then we need to handle $E_{\mu}[(U_{i}|\rho_{i+1}|)(s,a)(U_{j}|\rho_{j+1}|)(s,a)]$ for $0\leq i,j\leq K-1$ .

We have

E_{\mu}[(U_{i}|\rho_{i+1}|)(s,a)(U_{j}|\rho_{j+1}|)(s,a)]=E_{\mu}[{(P^{\pi^{*}})}^{K-1-i}|\rho_{i+1}|(s,a){(P^{\pi^{*}})}^{K-1-j}|\rho_{j+1}|(s,a)].

Recall that $(P^{\pi}Q)(s,a)=E[Q(S^{\prime},A^{\prime})|S^{\prime}\sim P(.|s,a),A^{\prime}\sim\pi(.|S^{\prime})]$ . There is one key observation

	$\displaystyle{}^{2}=E[Q(S^{\prime},A^{\prime})\|S^{\prime}\sim P(.\|s,a),A^{\prime}\sim\pi(.\|S^{\prime})]$
	$\displaystyle\times E[Q(S^{\prime},A^{\prime})\|S^{\prime}\sim P(.\|s,a),A^{\prime}\sim\pi(.\|S^{\prime})]$
	$\displaystyle\leq E[Q^{2}(S^{\prime},A^{\prime})\|S^{\prime}\sim P(.\|s,a),A^{\prime}\sim\pi(.\|S^{\prime})].$

Here we would abuse the notation $P^{\pi}\mu$ to represent the distribution of next stage $(S^{\prime},A^{\prime})$ such that $S^{\prime}\sim P(.|s,a),A^{\prime}\sim\pi(.|S^{\prime})$ (when notation $P^{\pi}$ is followed by a function on $\mathcal{S}\times\mathcal{A}$ , it is regarded as the operator defined in equation (15)); when it is followed by a distribution on $\mathcal{S}\times\mathcal{A}$ , it is regarded as a transformation of distribution as is discussed here.

Then it can be shown that

\displaystyle E_{\mu}\{[(P^{\pi}Q)(s,a)]^{2}\}\leq E_{\mu}\{E[Q^{2}(S^{\prime},A^{\prime})|S^{\prime}\sim P(.|s,a),A^{\prime}\sim\pi(.|S^{\prime})]\}=E_{P^{\pi}\mu}[Q^{2}(s,a)].

We can further get

\displaystyle E_{\mu}\{[(P^{\pi_{1}}P^{\pi_{2}}...P^{\pi_{l}}Q)(s,a)]^{2}\}\leq E_{{\pi_{1}}P^{\pi_{2}}...P^{\pi_{l}}\mu}[Q^{2}(s,a)].

For simplicity, let’s denote ${\pi_{1}}P^{\pi_{2}}...P^{\pi_{l}}\mu$ as $\tilde{\mu}_{l}$ .

For two functions $f_{1},f_{2}$ on $\mathcal{S}\times\mathcal{A}$ , we know

\displaystyle E_{\mu}[(P^{\pi^{*}})^{K-1-i}f_{1}(P^{\pi^{*}})^{K-1-j}f_{2}]\leq\{E_{\mu}\{[(P^{\pi^{*}})^{K-1-i}f_{1}]^{2}\}E_{\mu}\{[(P^{\pi^{*}})^{K-1-j}f_{2}]^{2}\}\}^{\frac{1}{2}}.

It can be obtained that

	$\displaystyle E_{\mu}\{[(P^{\pi^{}})^{K-1-i}f_{1}]^{2}\}\leq E_{(P^{\pi^{}})^{K-1-i}\mu}[f_{1}^{2}]=\int_{\mathcal{S}\times\mathcal{A}}{f_{1}^{2}(s,a)d\tilde{\mu}_{K-1-i}(s,a)}$
	$\displaystyle=\int_{\mathcal{S}\times\mathcal{A}}{\|f_{1}^{2}(s,a)\frac{d\tilde{\mu}_{K-1-i}(s,a)}{d\sigma(s,a)}\|d\sigma(s,a)}\leq\int_{\mathcal{S}\times\mathcal{A}}{\|f_{1}^{2}(s,a)\|d\sigma(s,a)}\times\left\\|\frac{d\tilde{\mu}_{K-1-i}(s,a)}{d\sigma(s,a)}\right\\|_{\infty,\sigma}$
	$\displaystyle=\left\\|f_{1}\right\\|_{2,\sigma}^{2}\times\omega_{\infty}(K-1-i;\mu,\sigma).$

Recall that definition of $\omega_{\infty}(i;\mu,\sigma)$ in equation (2). Similarly, we can have

\displaystyle E_{\mu}\{[(P^{\pi^{*}})^{K-1-j}f_{2}]^{2}\}\leq\left\|f_{2}\right\|_{2,\sigma}^{2}\times\omega_{\infty}(K-1-j;\mu,\sigma).

If we plug in $f_{1}=|\rho_{i+1}|$ and $f_{2}=|\rho_{j+1}|$ , we can get

	$\displaystyle E_{\mu}[{(P^{\pi^{}})}^{K-1-i}\|\rho_{i+1}\|(s,a){(P^{\pi^{}})}^{K-1-j}\|\rho_{j+1}\|(s,a)]$
	$\displaystyle\leq\{E_{\mu}\{[(P^{\pi^{}})^{K-1-i}\|\rho_{i+1}\|]^{2}\}E_{\mu}\{[(P^{\pi^{}})^{K-1-j}\|\rho_{j+1}\|]^{2}\}\}^{\frac{1}{2}}$
	$\displaystyle\leq\left\\|\rho_{i+1}\right\\|_{2,\sigma}\left\\|\rho_{j+1}\right\\|_{2,\sigma}[\omega_{\infty}(K-1-i;\mu,\sigma)\omega_{\infty}(K-1-j;\mu,\sigma)]^{\frac{1}{2}}.$

Then,

	$\displaystyle E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}\|\rho_{i+1}\|)(s,a)}]^{2}\}\leq\sum_{0\leq i,j\leq K-1}\gamma^{2K-2-i-j}\left\\|\rho_{i+1}\right\\|_{2,\sigma}\left\\|\rho_{j+1}\right\\|_{2,\sigma}$
	$\displaystyle\times[\omega_{\infty}(K-1-i;\mu,\sigma)\omega_{\infty}(K-1-j;\mu,\sigma)]^{\frac{1}{2}}.$

We can define

\displaystyle\epsilon_{max}=\max_{i\in\{1,2...,K\}}{\left\|\rho_{i}\right\|_{2,\sigma}}=\max_{i\in\{1,2...,K\}}{\left\|T\tilde{Q}_{i-1}-\tilde{Q}_{i}\right\|_{2,\sigma}},

so that $\epsilon_{max}^{2}=max_{i\in\{1,2...,K\}}\left\|\rho_{i}^{2}\right\|_{2,\sigma}$ . Thus, it can be shown that

\displaystyle E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}]^{2}\}\leq\epsilon_{max}^{2}\{\sum_{i=0}^{K-1}{\gamma^{i}[\omega_{\infty}(i;\mu,\sigma)]^{\frac{1}{2}}}\}^{2}.

We’ve already shown that $\frac{1}{(1-\gamma)^{2}}\sum_{m=0}^{\infty}{\gamma^{m-1}(m+1)[\omega_{\infty}(m;\mu,\sigma)]^{\frac{1}{2}}}\leq\phi_{\mu,\sigma}$ . Then, we know

\displaystyle\sum_{i=0}^{K-1}{\gamma^{i}[\omega_{\infty}(i;\mu,\sigma)]^{\frac{1}{2}}}<(1-\gamma)^{2}\gamma\times\frac{1}{(1-\gamma)^{2}}\sum_{i=0}^{\infty}{\gamma^{i-1}(i+1)[\omega_{\infty}{i;\mu,\sigma}]^{\frac{1}{2}}}\leq(1-\gamma)^{2}\gamma\phi_{\mu,\sigma}.

Hence,

\displaystyle E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}|\rho_{i+1}|)(s,a)}]^{2}\}<\epsilon_{max}^{2}(1-\gamma)^{4}\gamma^{2}\phi_{\mu,\sigma}^{2}.

(19)

Then combine equation (16), equation (17), equation (18) and equation (19), we can have the result of Lemma C.1.

∎

C.4 Proof of Proposition C.2

The Proposition C.2 is: The function class

\mathcal{F}_{1}(L,\{\{d_{i}\}^{L+1}_{i=1},d_{0}=m+m_{0}\},s,V_{max},\hat{G},\kappa)

is a subset of

\mathcal{F}_{2}(L,\{\{d_{i}\}^{L+1}_{i=1},d_{0}=\kappa+m_{0}\},s+d_{1}\kappa,V_{max},\hat{G},\kappa).

Proof.

Note that the vector $z^{*}$ recovered from the first $\kappa$ PCA values of $z$ has a relation with the corresponding vector containing the $\kappa$ PCA values: $z^{*}=\hat{U}{\hat{\Sigma}}^{\frac{1}{2}}\begin{pmatrix}I_{\kappa}\\ 0_{(m-\kappa)\times\kappa}\end{pmatrix}\hat{v}$ . By Definition A.1 in Section A, $\forall f_{1}\in\mathcal{F}_{1}(L,\{\{d_{i}\}^{L+1}_{i=1},d_{0}=m+m_{0}\},s,V_{max},\hat{G},\kappa)$ , we have $f_{1}(x,z,a)=f_{0}^{1}(x,z^{*},a)$ with $f_{0}^{1}(.,a)\in\mathcal{F}_{SReLU}(L,\{d_{i}\}_{i=0}^{L+1},s,V_{max})$ . We can denote the first layer in the sparse ReLU network $f_{0}^{1}(.,a)$ by $\sigma(W_{1}^{1}\begin{pmatrix}x\\ z^{*}\end{pmatrix}+b_{1}^{1})$ . We know $W_{1}^{1}\begin{pmatrix}x\\ z^{*}\end{pmatrix}=\begin{pmatrix}W_{11}^{1}&W_{12}^{1}\end{pmatrix}\begin{pmatrix}x\\ z^{*}\end{pmatrix}=W_{11}^{1}x+W_{12}^{1}z^{*}$ , with $W_{12}^{1}z^{*}=W_{12}^{1}\hat{U}{\hat{\Sigma}}^{\frac{1}{2}}\begin{pmatrix}I_{\kappa}\\ 0_{(m-\kappa)\times\kappa}\end{pmatrix}\hat{v}$ .

Then we would like to know $\exists f_{2}\in\mathcal{F}_{2}(L,\{\{d_{j}\}^{L+1}_{j=1},d_{0}=\kappa+m_{0}\},s+d_{1}\kappa,V_{max},\hat{G},\kappa)$ , such that $f_{2}(x,z,a)=f_{1}(x,z,a)$ . We know $f_{2}(x,z,a)=f_{0}^{2}(x,\hat{v},a)$ with

f_{0}^{2}(.,a)\in\mathcal{F}_{SReLU}(L,\{d_{i}\}_{i=0}^{L+1},s+\kappa d_{1},V_{max})

by definition. The first layer of $f_{0}^{2}$ can be denoted in a similar way: $\sigma(W_{1}^{2}\begin{pmatrix}x\\ \hat{v}\end{pmatrix}+b_{1}^{2})$ , with $W_{1}^{2}\begin{pmatrix}x\\ \hat{v}\end{pmatrix}=\begin{pmatrix}W_{11}^{2}&W_{12}^{2}\end{pmatrix}\begin{pmatrix}x\\ \hat{v}\end{pmatrix}=W_{11}^{2}x+W_{12}^{2}\hat{v}$ . We can set $W_{11}^{2}=W_{11}^{1},b_{1}^{1}=b_{1}^{2}$ and $W_{12}^{2}=W_{12}^{1}\hat{U}{\hat{\Sigma}}^{\frac{1}{2}}\begin{pmatrix}I_{\kappa}\\ 0_{(m-\kappa)\times\kappa}\end{pmatrix}$ , which gives $\sigma(W_{1}^{1}\begin{pmatrix}x\\ z^{*}\end{pmatrix}+b_{1}^{1})=\sigma(W_{1}^{2}\begin{pmatrix}x\\ \hat{v}\end{pmatrix}+b_{1}^{2})$ . The parameters in other layers in $f_{0}^{2}$ will be set to be the same as $f_{0}^{1}$ .

Then we know $\forall f_{1}\in\mathcal{F}_{1}(L,\{\{d_{j}\}^{L+1}_{j=1},d_{0}=m+m_{0}\},s,V_{max},\hat{G},\kappa)$ , $\exists f_{2}\in\mathcal{F}_{2}(L,\{\{d_{j}\}^{L+1}_{j=1},d_{0}=\kappa+m_{0}\},s+d_{1}\kappa,V_{max},\hat{G},\kappa)$ such that

f_{1}(x,z,a)=f_{0}^{1}(x,z^{*},a)=f_{0}^{2}(x,\hat{v},a)=f_{2}(x,z,a).

The reason why the sparsity in $f_{0}^{2}$ is increased to $s+\kappa d_{1}$ is that the weight $W_{12}^{2}=W_{12}^{1}\hat{U}{\hat{\Sigma}}^{\frac{1}{2}}\begin{pmatrix}I_{\kappa}\\ 0_{(m-\kappa)\times\kappa}\end{pmatrix}\in\mathbb{R}^{d_{1}\times\kappa}$ may no longer be sparse, despite that $W_{12}^{1}$ is sparse.

∎

C.5 Proof of Lemma C.3

The Lemma C.3 is: Under the same assumptions of the Theorem 4.6,

\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}\leq 2\sup_{f^{{}^{\prime}}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{\left\|f_{0}-f^{{}^{\prime}}\right\|_{\sigma}^{2}}}+2L_{\mathcal{F}_{0}}{\left\|Z-Z^{*}\right\|_{\sigma}^{2}},

where $L_{\mathcal{F}_{0}}=\sup_{f\in\mathcal{F}_{0}}{\sup_{x\neq y}{\frac{{|f(y)-f(x)|}^{2}}{\left\|y-x\right\|^{2}}}}$ . $Z$ is the high-frequency vector from the distribution $\sigma$ and $Z^{*}=(\sum_{k=1}^{\kappa_{j}}{\hat{U}_{k}{\hat{U}_{k}}^{T}})Z$ is the vector recovered from $\kappa$ PCA values of $Z$ .

Proof.

We have $\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}\leq\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{2}\in\mathcal{F}_{2}}{\left\|f_{2}-f^{\prime}\right\|_{\sigma}^{2}}}$ , which relies on the assumption: $Tg\in\mathcal{G}_{0},\forall g\in\mathcal{F}_{2}$ . From the Proposition C.2, we have $\mathcal{F}_{1}\subset\mathcal{F}_{2}$ . Then we have $\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{2}\in\mathcal{F}_{2}}{\left\|f_{2}-f^{\prime}\right\|_{\sigma}^{2}}}\leq\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{1}\in\mathcal{F}_{1}}{\left\|f_{1}-f^{\prime}\right\|_{\sigma}^{2}}}$ .

Then we know

	$\displaystyle\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{1}\in\mathcal{F}_{1}}{\left\\|f_{1}-f^{\prime}\right\\|_{\sigma}^{2}}}=\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{1}\in\mathcal{F}_{1}}{E_{\sigma}\{{[f_{1}(x,z,a)-f^{\prime}(x,z,a)]}^{2}\}}}$
	$\displaystyle=\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{E_{\sigma}\{{[f_{0}(x,z^{*},a)-f^{\prime}(x,z,a)]}^{2}\}}}$

by definition of $\mathcal{F}_{1}$ .

We know

	$\displaystyle E_{\sigma}\{{[f_{0}(x,z^{},a)-f^{\prime}(x,z,a)]}^{2}\}=E_{\sigma}\{{[f_{0}(x,z^{},a)-f_{0}(x,z,a)+f_{0}(x,z,a)-f^{\prime}(x,z,a)]}^{2}\}$
	$\displaystyle\leq 2E_{\sigma}\{{[f_{0}(x,z^{*},a)-f_{0}(x,z,a)]}^{2}\}+2E_{\sigma}\{{[f_{0}(x,z,a)-f^{\prime}(x,z,a)]}^{2}\}.$

Furthermore, the term $E_{\sigma}\{{[f_{0}(x,z^{*},a)-f_{0}(x,z,a)]}^{2}\}$ satisfies that

E_{\sigma}\{{[f_{0}(x,z^{*},a)-f_{0}(x,z,a)]}^{2}\}\\ \leq L_{\mathcal{F}_{0}}{\left\|Z-Z^{*}\right\|_{\sigma}^{2}}.

Note that $L_{\mathcal{F}_{0}}$ is the Lipschitz constant of ReLU networks in the class $\mathcal{F}_{0}$ . Based on chain rule and the fact that $\left\|W\right\|_{2}^{2}\leq m_{W}\left\|W\right\|_{\infty}^{2}$ for matrix $W$ with $m_{W}$ columns, we know that $L_{\mathcal{F}_{0}}$ is always bounded: $L_{\mathcal{F}_{0}}\leq\prod_{l}^{L}{d_{l-1}}$ .

So that we know

E_{\sigma}\{{[f_{0}(x,z^{*},a)-f^{\prime}(x,z,a)]}^{2}\}\leq 2L_{\mathcal{F}_{0}}\left\|Z-Z^{*}\right\|_{\sigma}^{2}+2E_{\sigma}\{{[f_{0}(x,z,a)-f^{\prime}(x,z,a)]}^{2}\},

$\forall f_{0}\in\mathcal{F}_{0},f^{\prime}\in\mathcal{G}_{0}$ . Then we can take $\sup$ and $\inf$ on both sides to have

	$\displaystyle\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{E_{\sigma}\{{[f_{0}(x,z^{},a)-f^{\prime}(x,z,a)]}^{2}\}}}\leq 2L_{\mathcal{F}_{0}}{\left\\|Z-Z^{}\right\\|_{\sigma}^{2}}+$
	$\displaystyle 2\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{E_{\sigma}\{{[f_{0}(x,z,a)-f^{\prime}(x,z,a)]}^{2}\}}},$

with $E_{\sigma}\{{[f_{0}(x,z,a)-f^{\prime}(x,z,a)]}^{2}\}=\left\|f_{0}-f^{\prime}\right\|_{\sigma}^{2}$ .

So that we eventually get

\sup_{g\in\mathcal{F}_{2}}{\inf_{f\in\mathcal{F}_{2}}{\left\|f-Tg\right\|_{\sigma}^{2}}}\leq 2L_{\mathcal{F}_{0}}{\left\|Z-Z^{*}\right\|_{\sigma}^{2}}+2\sup_{f^{\prime}\in\mathcal{G}_{0}}{\inf_{f_{0}\in\mathcal{F}_{0}}{\left\|f_{0}-f^{\prime}\right\|_{\sigma}^{2}}}.

∎

C.6 Proof of Lemma C.4

The Lemma C.4 is: For the high-frequency vector $Z$ , we assume $E(Z)=0$ and the the eigenvalues of $Cov(Z)$ following this exponential decaying trend $\lambda_{j}=O(e^{-\zeta j})$ . We also assume the estimation of the $Cov(Z)$ satisfies that $\left\|\hat{U}_{j}-U_{j}\right\|=O_{p}(n^{-\Delta})$ . Then we have

\left\|Z-Z^{*}\right\|_{\sigma}^{2}=O_{p}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})

Proof.

Note that

	$\displaystyle\left\\|Z-Z^{}\right\\|_{\sigma}^{2}=E[{(Z-Z^{})}^{T}(Z-Z^{*})]=E[Z^{T}{(\sum_{j=\kappa+1}^{m}{\hat{U}_{j}{\hat{U}_{j}}^{T}})}^{2}Z]=\sum_{j=\kappa+1}^{m}{E[tr(Z^{T}\hat{U}_{j}{\hat{U}_{j}}^{T}Z)]}$
	$\displaystyle=\sum_{j=\kappa+1}^{m}{tr[E(ZZ^{T})\hat{U}_{j}{\hat{U}_{j}}^{T}]}=tr[E(ZZ^{T})\sum_{j=\kappa+1}^{m}{\hat{U}_{j}{\hat{U}_{j}}^{T}}].$

We know $E(ZZ^{T})=Cov(Z)=\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}$ . So that we will have

	$\displaystyle\left\\|Z-Z^{*}\right\\|_{\sigma}^{2}=tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{\hat{U}_{j}{\hat{U}_{j}}^{T}}]=tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{(\hat{U}_{j}-U_{j}+U_{j}){(\hat{U}_{j}-U_{j}+U_{j}})}^{T}]$
	$\displaystyle=tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{U_{j}{U_{j}}^{T}}]+tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{(\hat{U}_{j}-U_{j}){U_{j}}^{T}}]$
	$\displaystyle+tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{U_{j}{(\hat{U}_{j}-U_{j})}^{T}}]+tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{(\hat{U}_{j}-U_{j}){(\hat{U}_{j}-U_{j})}^{T}}].$

That is, we need to bound four terms here.

The first term

tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{U_{j}{U_{j}}^{T}}]=tr[\sum_{j=\kappa+1}^{m}{\lambda_{j}U_{j}{U_{j}}^{T}U_{j}{U_{j}}^{T}}]=\sum_{j=\kappa+1}^{m}{tr(\lambda_{j}U_{j}{U_{j}}^{T})}=\sum_{j=\kappa+1}^{m}{\lambda_{j}}.

Since we already know that the eigenvalues $\lambda_{j}$ satisfies that $\lambda_{j}=O(e^{-\zeta j})$ , so that we have $\sum_{j=\kappa+1}^{m}{\lambda_{j}}=O(e^{-\zeta(\kappa+1)}\frac{1-e^{-\zeta(m-\kappa)}}{1-e^{-\zeta}})=O(e^{-\zeta\kappa}-e^{-\zeta m})$ . Then we know the first term

tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{U_{j}{U_{j}}^{T}}]=O(e^{-\zeta\kappa}-e^{-\zeta m}).

The second term is

	$\displaystyle tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{(\hat{U}_{j}-U_{j}){U_{j}}^{T}}]=\sum_{i=1}^{m}{\sum_{j=\kappa+1}^{m}{tr[\lambda_{i}U_{i}{U_{i}}^{T}(\hat{U_{j}}-U_{j}){U_{j}}^{T}]}}$
	$\displaystyle=\sum_{i=1}^{m}{\sum_{j=\kappa+1}^{m}{tr[\lambda_{i}{U_{j}}^{T}U_{i}{U_{i}}^{T}(\hat{U_{j}}-U_{j})]}}=\sum_{j=\kappa+1}^{m}{\lambda_{j}{U_{j}}^{T}(\hat{U}_{j}-U_{j})}.$

We know that each term of $U_{j}$ is $O_{p}(1)$ , so that $|{U_{j}}^{T}(\hat{U}_{j}-U_{j})|\leq\left\|{U_{j}}^{T}\right\|\left\|\hat{U}_{j}-U_{j}\right\|=O_{p}(n^{-\Delta})$ and $\sum_{j=\kappa+1}^{m}{\lambda_{j}{U_{j}}^{T}(\hat{U}_{j}-U_{j})}=O_{p}(n^{-\Delta}(e^{-\zeta\kappa}-e^{-\zeta m}))$ .

The third term is similar to the second term, we have

	$\displaystyle tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{U_{j}{\hat{U}_{j}-U_{j}}^{T}}]=\sum_{i=1}^{m}{\sum_{j=\kappa+1}^{m}{tr[\lambda_{i}U_{i}{U_{i}}^{T}U_{j}{(\hat{U_{j}}-U_{j})}^{T}]}}$
	$\displaystyle=\sum_{j=\kappa+1}^{m}{\lambda_{j}{(\hat{U}_{j}-U_{j})}^{T}U_{j}}=O_{p}(n^{-\Delta}(e^{-\zeta\kappa}-e^{-\zeta m})).$

Then the fourth term is

	$\displaystyle tr[\sum_{i=1}^{m}{\lambda_{i}U_{i}{U_{i}}^{T}}\sum_{j=\kappa+1}^{m}{(\hat{U}_{j}-U_{j}){(\hat{U}_{j}-U_{j})}^{T}}]=\sum_{i=1}^{m}{\sum_{j=\kappa+1}^{m}{tr[\lambda_{i}U_{i}{U_{i}}^{T}(\hat{U_{j}}-U_{j}){(\hat{U_{j}}-U_{j})}^{T}]}}$
	$\displaystyle=\sum_{i=1}^{m}{\sum_{j=\kappa+1}^{m}{tr[\lambda_{i}{U_{i}}^{T}(\hat{U_{j}}-U_{j}){(\hat{U_{j}}-U_{j})}^{T}U_{i}]}}=\sum_{i=1}^{m}{\sum_{j=\kappa+1}^{m}{\lambda_{i}[{U_{i}}^{T}(\hat{U_{j}}-U_{j})]^{2}}}$
	$\displaystyle=O_{p}(n^{-2\Delta}e^{-\zeta})=O_{p}(n^{-2\Delta}).$

Thus, it can be shown that

\left\|Z-Z^{*}\right\|_{\sigma}^{2}=O_{p}((e^{-\zeta\kappa}-e^{-\zeta m})+n^{-\Delta}(e^{-\zeta\kappa}-e^{-\zeta m})+n^{-2\Delta})=O_{p}(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta}).

∎

C.7 Proof of Proposition C.5

The Proposition C.5 is: When $\delta=\frac{1}{n}$ , $\log{N_{\delta,2}}\leq C_{1}|\mathcal{A}|(n^{\alpha^{*}}(\log{n})^{\xi^{*}}+d_{1}^{*}\kappa)(\log{n})^{1+\xi^{*}}$ for some constant $C_{1}$ .

Proof.

To prove this result, we try to show for the function class $\mathcal{F}_{3}=\{f_{3}:S\rightarrow\mathbb{R}:f_{3}(x,z)=f(x,\begin{pmatrix}I_{\kappa}&0_{\kappa\times(m-\kappa)}\end{pmatrix}{\hat{\Sigma}}^{-\frac{1}{2}}{\hat{U}}^{T}z)=f(x,\hat{v}),f\in\mathcal{F}_{SReLU}(L^{*},{\{d_{j}^{*}\}}_{j=0}^{L+1},s^{*},V_{max}),d_{0}=\kappa+m_{0}\}$ . Recall the definition of $\mathcal{F}_{2}$ at Section A, we know that $\forall f\in\mathcal{F}_{2}$ , $f(.,a)\in\mathcal{F}_{3}$ . That is, $\mathcal{F}_{3}$ is the function class for modeling action component $a$ of $Q(s,a)$ . There is a relation between cardinality of the minimal $\sigma$ -covering set of $\mathcal{F}_{2}$ , $N_{\frac{1}{n},2}$ and cardinality of the minimal $\sigma$ -covering set of $\mathcal{F}_{3}$ , $N_{\frac{1}{n},3}$ . With similar arguments to the step $|\mathcal{N}(\delta,\mathcal{F}_{0},\left\|.\right\|_{\infty})|\leq{|\mathcal{N}_{\delta}|}^{|\mathcal{A}|}$ in equation (6.26) of Fan et al. (2020), we can show that

\log{N_{\frac{1}{n},2}}\leq|\mathcal{A}|\log{N_{\frac{1}{n},3}}.

(20)

Here we can argue that $\mathcal{F}_{3}$ is also a type of sparse ReLU network and directly use Lemma 6.4 in Fan et al. (2020). We can show that $\mathcal{F}_{3}\subset\mathcal{F}_{SReLU,3}$ , where $\mathcal{F}_{SReLU,3}$ is a simplified notation for $\mathcal{F}_{SReLU}(L^{*}+1,\{{\{d_{i}\}}_{i=0}^{L+2}:d_{0}=m,d_{i+1}=d_{i}^{*},i\in\{0,1,2,...,L+1\},s^{*}+m\kappa,V_{max})$ . That is, the one extra layer of $\mathcal{F}_{3}$ is actually the linear transition $\hat{v}=\begin{pmatrix}I_{\kappa}&0_{\kappa\times(m-\kappa)}\end{pmatrix}{\hat{\Sigma}}^{-\frac{1}{2}}{\hat{U}}^{T}z$ . We don’t know whether the weight matrix $\begin{pmatrix}I_{\kappa}&0_{\kappa\times(m-\kappa)}\end{pmatrix}{\hat{\Sigma}}^{-\frac{1}{2}}{\hat{U}}^{T}$ is sparse, so that we need to increase the original sparsity to $s^{*}+m\kappa$ .

The cardinality of $\delta$ -covering set of sparse ReLU network is already bounded in Lemma 6.4 of Fan et al. (2020), and we can apply it here. We know the cardinality of $\delta$ -covering set of $\mathcal{F}_{3}$ , $N_{\frac{1}{n},3}$ satisfies that

\displaystyle\log{N_{\frac{1}{n},3}}\leq\log{|\mathcal{N}(\delta=\frac{1}{n},\mathcal{F}_{SReLU,3},\left\|.\right\|_{\infty})|}\leq(s^{*}+m\kappa+1)\log{[2n(L^{*}+2)\prod_{l=0}^{L+1}{d_{l}^{*}+1}]}

Here $L^{*}=O({(\log{n})}^{\xi^{*}})$ , $m+m_{0}\leq min_{j\in\{1,2,...,L\}}{d_{j}^{*}}\leq max_{j\in\{1,2,...,L\}}{d_{j}^{*}}=O(n^{\xi^{*}})$ and $s^{*}\asymp n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}+d_{1}^{*}\kappa$ for some constant $\xi^{*}>1+2\xi$ . So that

	$\displaystyle\log{N_{\frac{1}{n},3}}\leq C(n^{\alpha^{}}{(\log{n})}^{\xi^{}}+d_{1}^{}\kappa+m\kappa)\log{[2n(L^{}+2)\prod_{l=0}^{L+1}{d_{l}^{*}+1}]}$
	$\displaystyle\leq C^{\prime}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}+d_{1}^{}\kappa)\log{[2n(L^{}+2)\prod_{l=0}^{L+1}{d_{l}^{*}+1}]}$
	$\displaystyle\leq C^{\prime\prime}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}+d_{1}^{}\kappa)[\log{n}+\log{L^{}}+L^{}\log{(n^{\xi^{}})}]$
	$\displaystyle\leq C_{1}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}+d_{1}^{}\kappa){(\log{n})}^{\xi^{}+1},$

where $C,C^{\prime},C^{\prime\prime},C_{1}$ are all constants.

Then from equation (20), we know $\log{N_{\frac{1}{n},2}}\leq C_{1}|\mathcal{A}|(n^{\alpha^{*}}{(\log{n})}^{\xi^{*}}+d_{1}^{*}\kappa){(\log{n})}^{\xi^{*}+1}$ . ∎

Appendix D Details of Selecting Number of Principal Components

An optimal number of principal components, $\kappa^{*}$ that balances the bias and variance follows the order of $\kappa^{*}\asymp\log(n)$ .

Proof.

We know the terms involving $\kappa$ in Theorem 4.8 is $\kappa|\mathcal{A}|d_{1}^{*}n^{-1}\log^{\xi^{*}+1}{n}+CL_{\mathcal{F}_{0}}e^{-\zeta\kappa}$ for some constant $C>0$ . We can take the derivative of this term with respect to $\kappa$ . Then we know when $\kappa^{*}=\frac{1}{\zeta}[\log(C\zeta L_{\mathcal{F}_{0}}{|\mathcal{A}|}^{-1}{d_{1}^{*}}^{-1})+\log{n}-(\xi^{*}+1)\log(\log{n})]$ , this term will be minimized. So that we know the optimal $\kappa^{*}$ that balances the bias and variance must satisfy $\kappa^{*}\asymp\log(n)$ .

This $\log{n}$ guideline applies when $n$ is large, as it is derived based on the theorems of asymptotic trend. On the other hand, $\kappa\leq m$ needs to hold, as number of principal components cannot exceed the original dimension. Thus, when $n$ increases to infinity, eventually we will take $\kappa^{*}=m$ . That is, when training sample size is extremely large, we no longer need to reduce the dimensions of high frequency part and our Algorithm 1 will be equivalent to the original neural fitted Q-iteration.

∎

Refer to caption — Figure 1: Illustration of a ReLU network.

Table 1: Estimate of value difference of four policies.

Difference	$\pi_{k}^{PCA}-\pi_{k}^{ALL}$	$\pi_{k}^{PCA}-\pi_{k}^{AVE}$	$\pi_{k}^{PCA}-\pi_{k}^{BOTTLE}$
Mean	1.163	1.399	1.153
Margin of Error	0.181	0.224	0.145

Table 2: Value estimate of the four policies.

Value Estimate	$\pi_{k}^{PCA}$	$\pi_{k}^{ALL}$	$\pi_{k}^{AVE}$	$\pi_{k}^{BOTTLE}$
Mean	-8.762	-9.926	-10.162	-9.916
Standard Error	0.055	0.096	0.121	0.086

	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}d\mu_{1}(s)}}=$		(7)
	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}\mathbbm{1}(s\in A_{*})d\mu_{1}(s)}}+$
	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}}.$

	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}=$		(10)
	$\displaystyle\int_{s\in\mathcal{S}}{(\max_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)}-Q^{\pi^{}}(s,\hat{a}_{\pi^{K}}))\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}\leq$
	$\displaystyle\int_{s\in\mathcal{S}}{\frac{{(2\max_{a\in\mathcal{A}}{\|Q^{\pi^{}}(s,a)-\tilde{Q}_{K}(s,a)\|})}^{2}}{(\max_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)}-Q^{\pi^{}}(s,\hat{a}_{\pi^{K}}))}\mathbbm{1}(s\in A_{}^{C})d\mu_{1}(s)}\leq$
	$\displaystyle\frac{4}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{{\|Q^{\pi^{}}(s,a)-\tilde{Q}_{K}(s,a)\|}^{2}\mu_{2}(a)}\mathbbm{1}(s\in A_{}^{C})d\mu_{1}(s)}\leq$
	$\displaystyle\frac{4}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{{\|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)\|}^{2}\mu_{2}(a)}d\mu_{1}(s)}=$
	$\displaystyle\frac{4}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\left\\|Q^{\pi^{*}}(s,a)-\tilde{Q}_{K}(s,a)\right\\|_{\mu,2}^{2}.$

	$\displaystyle\int_{s\in\mathcal{S}}{\sum_{a\in\mathcal{A}}{Q^{\pi^{}}(s,a)\{\pi^{}(a\|s)-\pi_{K}(a\|s)\}}\mathbbm{1}(s\in A_{*}^{C})d\mu_{1}(s)}$		(12)
	$\displaystyle=O_{p}(\frac{1}{\epsilon\min_{a\in\mathcal{A}}{\mu_{2}(a)}}\gamma^{2}(1-\gamma)^{4}\times[\|\mathcal{A}\|n^{-1}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}$
	$\displaystyle+d_{1}^{}{\kappa}){(\log{n})}^{\xi^{}+1}+L_{\mathcal{F}_{0}}{(e^{-\zeta\kappa}-e^{-\zeta m}+n^{-2\Delta})}]+\frac{\gamma^{2K}}{(1-\gamma)^{2}}R_{max}^{2}).$

	$\displaystyle E_{\mu}[{\|Q^{*}(s,a)-\tilde{Q}_{K}(s,a)\|}^{2}]\leq E_{\mu}\{[\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}\|\rho_{i+1}\|)(s,a)}]^{2}\}$		(16)
	$\displaystyle+E_{\mu}\{2\sum_{i=0}^{K-1}{\gamma^{K-1-i}(U_{i}\|\rho_{i+1}\|)(s,a)}\times\gamma^{K}(U_{K}\|Q^{*}-\tilde{Q}_{0}\|)(s,a)\}$
	$\displaystyle+E_{\mu}\{\gamma^{2K}[(U_{K}\|Q^{*}-\tilde{Q}_{0}\|)(s,a)]^{2}\}.$

	$\displaystyle\log{N_{\frac{1}{n},3}}\leq C(n^{\alpha^{}}{(\log{n})}^{\xi^{}}+d_{1}^{}\kappa+m\kappa)\log{[2n(L^{}+2)\prod_{l=0}^{L+1}{d_{l}^{*}+1}]}$
	$\displaystyle\leq C^{\prime}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}+d_{1}^{}\kappa)\log{[2n(L^{}+2)\prod_{l=0}^{L+1}{d_{l}^{*}+1}]}$
	$\displaystyle\leq C^{\prime\prime}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}+d_{1}^{}\kappa)[\log{n}+\log{L^{}}+L^{}\log{(n^{\xi^{}})}]$
	$\displaystyle\leq C_{1}(n^{\alpha^{}}{(\log{n})}^{\xi^{}}+d_{1}^{}\kappa){(\log{n})}^{\xi^{}+1},$