Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

Weidong Liu Jiyuan Tu Yichen Zhang Xi Chen School of Mathematical Sciences and MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong UniversitySchool of Mathematics, Shanghai University of Finance and EconomicsDaniels School of Business, Purdue UniversityStern School of Business, New York University.

Abstract

Recently, reinforcement learning has gained prominence in modern statistics, with policy evaluation being a key component. Unlike traditional machine learning literature on this topic, our work places emphasis on statistical inference for the parameter estimates computed using reinforcement learning algorithms. While most existing analyses assume random rewards to follow standard distributions, limiting their applicability, we embrace the concept of robust statistics in reinforcement learning by simultaneously addressing issues of outlier contamination and heavy-tailed rewards within a unified framework. In this paper, we develop an online robust policy evaluation procedure, and establish the limiting distribution of our estimator, based on its Bahadur representation. Furthermore, we develop a fully-online procedure to efficiently conduct statistical inference based on the asymptotic distribution. This paper bridges the gap between robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to policy evaluation. Finally, we validate the efficacy of our algorithm through numerical experiments conducted in real-world reinforcement learning experiments.

Keywords: Statistical inference; dependent samples; policy evaluation; reinforcement learning; online learning.

1 Introduction

Reinforcement learning has offered immense success and remarkable breakthroughs in a variety of application domains, including autonomous driving, precision medicine, recommendation systems, and robotics (to name a few, e.g., Murphy, 2003; Kormushev et al., 2013; Mnih et al., 2015; Shi et al., 2018). From recommendation systems to mobile health (mHealth) intervention, reinforcement learning can be used to adaptively make personalized recommendations and optimize intervention strategies learned from retrospective behavioral and physiology data. While the achievements of reinforcement learning algorithms in applications are undisputed, the reproducibility of its results and reliability is still in many ways nascent.

Those recommendation and health applications enjoy great flexibility and affordability due to the development of reinforcement algorithms, despite calling for critical needs for a reliable and trustworthy uncertainty quantification for such implementation. The reliability of such implementations sometimes plays a life-threatening role in emerging applications. For example, in autonomous driving, it is critical to avoid deadly explorations based upon some uncertainty measures in the trial-and-error learning procedure. This substance also extends to other applications including precision medicine and autonomous robotics. From the statistical perspective, it is important to quantify the uncertainty of a point estimate with complementary hypothesis testing to reveal or justify the reliability of the learning procedure.

Policy evaluation plays a cornerstone role in typical reinforcement learning (RL) algorithms such as Temporal Difference (TD) learning. As one of the most commonly adopted algorithms for policy evaluation in RL, TD learning provides an estimator of the value function iteratively with regard to a given policy based on samples from a Markov chain. In large-scale RL tasks where the state space is infinitely expansive, a typical procedure to provide a scalable yet efficient estimation of the value function is via linear function approximation. This procedure can be formulated in a linear stochastic approximation problem (Sutton, 1988; Tsitsiklis and Van Roy, 1997; Sutton et al., 2009; Ramprasad et al., 2022), which is designed to sequentially solve a deterministic equation $A\theta=b$ by a matrix-vector pair of a sequence of unbiased random observations of $(A_{t},b_{t})$ governed by an ergodic Markov chain.

The earliest and most prototypical stochastic approximation algorithm is the Robbins-Monro algorithm introduced by Robbins and Monro (1951) for solving a root-finding problem, where the function is represented as an expected value, e.g., ${\mathbb{E}}[f(\theta)]=0$ . The algorithm has generated profound interest in the field of stochastic optimization and machine learning to minimize a loss function using random samples. When referring to an optimization problem, its first-order condition can be represented as ${\mathbb{E}}[f(\theta)]=0$ , and the corresponding Robbins-Monro algorithm is often referred to as first-order methods, or more widely known as stochastic gradient descent (SGD) in machine learning literature. It is well established in the literature that its averaged version (Ruppert, 1988; Polyak and Juditsky, 1992), as an online first-order method, achieves optimal statistical efficiency when estimating the model parameters in statistical models, which apparently kills the interest in developing second-order methods that use additional information to help the convergence. That being said, it is often observed in practice that first-order algorithms entail significant accuracy loss on non-asymptotic convergence as well as severe instability in the choice of hyperparameters, specifically, the stepsizes (a.k.a. learning rate). In addition, the stepsize tuning further complicates the quantification of uncertainty associated with the algorithm output. Despite the known drawbacks above, first-order stochastic methods are historically favored in machine learning tasks due to their computational efficiency while they primarily focus on estimation. On the other hand, when the emphasis of the task lies on statistical inference, certain computation of the second-order information is generally inevitable during the inferential procedure, which shakes the supremacy of first-order methods over second-order algorithms.

In light of that, we propose a second-order online algorithm which utilizes second-order information to perform the policy evaluation sequentially. Meanwhile, our algorithm can be used for conducting statistical inference in an online fashion, allowing for the characterization of uncertainty in the estimation of the value function. Such a procedure generates no extra per-unit computation and storage cost beyond $O(d^{2})$ , which is at least the same as typical first-order stochastic methods featuring statistical inference (see, e.g., Chen et al., 2020 for SGD and Ramprasad et al., 2022 for TD). More importantly, we show theoretically that the proposed algorithm converges faster in terms of the remainder term compared with first-order stochastic approximation methods, and revealed significant discrepancies in numerical experiments. In addition, the proposed algorithm is free from tuning stepsizes, which has been well-established as a substantial criticism of first-order algorithms.

Another challenge to the reliability of reinforcement learning algorithms lies in the modeling assumptions. Most algorithms in RL have been in the “optimism in the face of uncertainty” paradigm where such procedures are vulnerable to manipulation (see some earlier exploration in e.g., Everitt et al., 2017; Wang et al., 2020). In practice, it is often unrealistic to believe that rewards on the entire trajectory follow exactly the same underlying model. Indeed, non-standard behavior of the rewards happens from time to time in practice. The model-misspecification and presence of outliers are indeed very common in an RL environment, especially that with a large time horizon $T$ . It is of substantial interest to design a robust policy evaluation procedure. In pursuit of this, our proposed algorithm uses a smoothed Huber loss to replace the least-squares loss function used in classical TD learning, which is tailored to handle both outliers and heavy-tailed rewards. To model outlier observations of rewards in reinforcement learning, we bring the static $\alpha$ -contamination model (Huber, 1992) to an online setting with dependent samples. In a static offline robust estimation problem, one aims to learn the distribution of interest, where a sample of size $n$ is drawn i.i.d. from a mixture distribution $(1-\alpha_{n})P+\alpha_{n}Q$ , and $Q$ denotes an arbitrary outlier distribution. We adopt robust estimation in an online environment, where the observations are no longer independent, and the occurrence time of outliers is unknown. In contrast to the offline setting, future observations cannot be used in earlier periods in an online setting. Therefore, in the earlier periods, there is very limited information to help determine whether an observation is an outlier. In addition to this discrepancy between the online decision process and offline estimation, we further allow the outlier reward models to be potentially different for different time $t$ instead of being from a fixed distribution, and such rewards may be arbitrary and even adversarially adaptive to historical information. Besides the outlier model, our model also incorporates rewards with heavy-tailed distributions. This substantially relaxes the boundness condition that is commonly assumed in policy evaluation literature.

We summarize the challenges and contributions of this paper in the following facets.

•

We propose an online policy evaluation method with dependent samples and simultaneously conduct statistical inference for the model parameters in a fully online fashion. Furthermore, we build a Bahadur representation of the proposed estimator, which includes the main term corresponding to the asymptotic normal distribution and a higher-order remainder term. To the best of our knowledge, there is no literature establishing the Bahadur representation for online policy evaluation. Moreover, it shows that our algorithm matches the offline oracle and converges strictly faster to the asymptotic distribution than that of a prototypical first-order stochastic method such as TD learning.
•

Compared to existing reinforcement learning literature, our proposed algorithm features an online generalization of the $\alpha_{n}$ -contamination model where the rewards contain outliers or arbitrary corruptions. Our proposed algorithm is robust to adversarial corruptions which can be adaptive to the trajectory, as well as heavy-tailed distribution of the rewards. Due to the existence of outliers, we use a smooth Huber loss where the thresholding parameter is carefully specified to change over time to accommodate the online streaming data. From a theoretical standpoint, a robust policy evaluation procedure forces the update step from $\theta_{t}$ to $\theta_{t+1}$ to be a non-linear function of $\theta_{t}$ , which brings in additional technical challenges compared to the analysis of classical TD learning algorithms (see e.g., Ramprasad et al., 2022) based on linear stochastic approximation.
•

Our proposed algorithm is based on a dedicated averaged version of the second-order method, where in each iteration a surrogate Hessian is obtained and used in the update step. This second-order information enables the proposed algorithm to be free from stepsize tuning while still ensuring efficient implementation. Furthermore, our proposed algorithm stands out distinctly from conventional first-order stochastic approximation approaches which fall short of attaining the optimal offline remainder rate. On the other hand, while deterministic second-order methods do excel in offline scenarios, they lack the online adaptability crucial for real-time applications.

1.1 Related Works

Conducting statistical inference for model parameters in stochastic approximation has attracted great interest in the past decade, with a building foundation of the asymptotic distribution of the averaged version of stochastic approximation first established in Ruppert (1988); Polyak and Juditsky (1992). The established asymptotic distribution has been brought to conduct online statistical inference. For example, Fang et al. (2018) presented a perturbation-based resampling procedure for inference. Chen et al. (2020) proposed two online procedures to estimate the asymptotic covariance matrix to conduct inference. Other than those focused on the inference procedures, Shao and Zhang (2022) established the remainder term and Berry-Esseen bounds of the averaged SGD estimator. Givchi and Palhang (2015); Mou et al. (2021) analyzed averaged SGD with Markovian data. Second-order stochastic algorithms were analyzed in Ruppert (1985); Schraudolph et al. (2007); Byrd et al. (2016) and applied to TD learning in Givchi and Palhang (2015).

Under the online decision-making settings including bandit algorithms and reinforcement learning, a few existing works focused on statistical inference of the model parameters or uncertainty quantification of value functions. Deshpande et al. (2018); Chen et al. (2021a, b); Zhan et al. (2021); Zhang et al. (2021, 2022) studied statistical inference for linear models, $M$ -estimation, and the non-Markovian environment with data collected via a bandit algorithm, respectively. Thomas et al. (2015) proposed high-confidence off-policy evaluation based on Bernstein inequality. Hanna et al. (2017) presented two bootstrap methods to compute confidence bounds for off-policy value estimates. Dai et al. (2020); Feng et al. (2021) construct confidence intervals for value functions based on optimization formulations, and Jiang and Huang (2020) derived a minimax value interval, both with i.i.d. sample. Shi et al. (2021a) developed online inference procedures for high-dimensional problems. Shi et al. (2021b) proposed inference procedures for Q functions in RL via sieve approximations. Hao et al. (2021) studied multiplier bootstrap algorithms to offer uncertainty quantification for exploration in fitted Q-evaluation. Syrgkanis and Zhan (2023) studied a re-weighted Z-estimator on episodic RL data and conducted inference on the structural parameter.

The most relevant literature to ours is Ramprasad et al. (2022), who studied a bootstrap online statistical inference procedure under Markov noise using a quadratic SGD and demonstrated its application in the classical TD (and Gradient TD) algorithms in RL. Our proposed procedure and analysis differ in at least two aspects. First, our proposed estimator is a Newton-type second-order approach that enjoys a faster convergence and optimality in the remainder rate. In addition, we show both analytically and numerically that the computation cost of our procedure is typically lower than Ramprasad et al. (2022) for an inference task. Second, our proposed algorithm is a robust alternative to TD algorithms, featuring a non-quadratic loss function to handle the potential outliers and heavy-tailed rewards. There exists limited RL literature on either outliers or heavy-tailed rewards. In a recent work Li and Sun (2023), the author studied online linear stochastic bandits and linear Markov decision processes in the presence of heavy-tailed rewards. While both our work and theirs apply pseudo-Huber loss, our work focuses on policy evaluation with statistical inference guarantees.

1.2 Paper Organization and Notations

The remainder of this paper is organized as follows. In Sections 2 and 3, we present and discuss our proposed algorithm for robust policy evaluation in reinforcement learning. Theoretical results on convergence rates, asymptotic normality, and the Bahadur representation are presented in Section 3. In Section 4, we develop an estimator for the long-run covariance matrix to construct confidence intervals in an online fashion and provide its theoretical guarantee. Simulation experiments are provided in Section 5 to demonstrate the effectiveness of our method. Concluding remarks are given in Section 6. All proofs are deferred to the Appendix.

For every vector $\boldsymbol{v}=(v_{1},...,v_{d})^{{\top}}$ , denote $|\boldsymbol{v}|_{2}=\sqrt{\sum_{l=1}^{d}v_{l}^{2}}$ , $|\boldsymbol{v}|_{1}=\sum_{l=1}^{d}|v_{l}|$ , and $|\boldsymbol{v}|_{\infty}=\max_{1\leq l\leq d}|v_{l}|$ . For simplicity, we denote ${\mathbb{S}}^{d-1}$ and ${\mathbb{B}}^{d}$ as the unit sphere and unit ball in $\mathbb{R}^{d}$ centered at $\boldsymbol{0}$ . Moreover, we use $\mathrm{supp}(\boldsymbol{v})=\{1\leq l\leq d\mid v_{l}\neq 0\}$ as the support of the vector $\boldsymbol{v}$ . For every matrix $\boldsymbol{A}\in{\mathbb{R}}^{d_{1}\times d_{2}}$ , define $\left\|\boldsymbol{A}\right\|=\sup_{|\boldsymbol{v}|_{2}=1}|\boldsymbol{A}\boldsymbol{v}|_{2}$ as the matrix operator norms, $\Lambda_{\max}(\boldsymbol{A})$ and $\Lambda_{\min}(\boldsymbol{A})$ as the largest and smallest singular values of $\boldsymbol{A}$ respectively. The symbols $\lfloor x\rfloor$ ( $\lceil x\rceil$ ) denote the greatest integer (the smallest integer) not larger than (not less than) $x$ . We denote $(x)_{+}=\max(0,x)$ . For two sequences $a_{n},b_{n}$ , we say $a_{n}\asymp b_{n}$ when $a_{n}=O(b_{n})$ and $b_{n}=O(a_{n})$ hold at the same time. We say $a_{n}\approx b_{n}$ if $\lim_{n\rightarrow\infty}a_{n}/b_{n}=1$ . For a sequence of random variables $\{X_{n}\}_{n=1}^{\infty}$ , we denote $X_{n}=O_{{\mathbb{P}}}(a_{n})$ if there holds $\lim_{C\rightarrow\infty}\limsup_{n\rightarrow\infty}{\mathbb{P}}(|X_{n}|>Ca_{n})=0$ , and denote $X_{n}=o_{{\mathbb{P}}}(a_{n})$ if there holds $\lim_{C\rightarrow 0}\limsup_{n\rightarrow\infty}{\mathbb{P}}(|X_{n}|>Ca_{n})=0$ . Lastly, the generic constants are assumed to be independent of $n$ and $d$ .

2 Robust Policy Evaluation in Reinforcement Learning

We first review the Least-squares temporal difference methods in RL. Consider a $4$ -tuple $({\mathcal{S}},{\mathcal{U}},{\mathcal{P}},{\mathcal{R}})$ . Here ${\mathcal{S}}=\{1,2,...,N\}$ is the global finite state space, ${\mathcal{U}}$ is the set of control (action), ${\mathcal{P}}$ is the transition kernel, and ${\mathcal{R}}$ is the reward function. One of the core steps in RL is to estimate the accumulative reward $J^{*}$ (which is also called the state value function) for a given policy:

J^{*}(s)={\mathbb{E}}\Big{[}\sum_{k=0}^{\infty}\gamma^{k}{\mathcal{R}}(s_{k})\;\Big{|}\;s_{0}=s\Big{]},

where $\gamma\in(0,1]$ is a given discount factor and $s\in{\mathcal{S}}$ is any state. Here $\{s_{k}\}$ denote the environment states which are usually modeled by a Markov chain. In real-world RL applications, the state space is often very large such that one cannot directly compute value estimates for every state in the state space. A common approach in the modern RL is to approximate the value function $J^{*}(\cdot)$ , i.e., let

\widetilde{J}(s,\boldsymbol{\theta})=\boldsymbol{\theta}^{{\top}}\boldsymbol{\phi}(s)=\sum_{l=1}^{d}\theta_{l}\phi_{l}(s),

be a linear approximation of $J^{*}(\cdot)$ , where $\boldsymbol{\theta}=(\theta_{1},...,\theta_{d})^{\top}\in\mathbb{R}^{d}$ contains the model parameters, and $\boldsymbol{\phi}(s)=(\phi_{1}(s),...,\phi_{d}(s))^{\top}\in\mathbb{R}^{d}$ is a set of feature vectors that corresponds to the state $s\in{\mathcal{S}}$ . Here we write $\boldsymbol{\phi}_{l}=(\phi_{l}(1),...,\phi_{l}(N))^{\top}$ , $1\leq l\leq d$ are linearly independent vectors in ${\mathbb{R}}^{N}$ and $d\ll N$ . That is, we use a low dimensional linear approximation (with the basis vectors $\{\boldsymbol{\phi}_{1},...,\boldsymbol{\phi}_{d}\}$ ) for $J^{*}(\cdot)$ . Let the matrix $\boldsymbol{\Phi}=(\boldsymbol{\phi}_{1},...,\boldsymbol{\phi}_{d})$ and then $\widetilde{J}=\boldsymbol{\Phi}\boldsymbol{\theta}$ .

The state value function $J^{*}$ satisfies the Bellman equation

J^{*}(s)={\mathcal{R}}(s)+\gamma{\mathbb{E}}\Big{[}\sum_{k=1}^{\infty}\gamma^{k-1}{\mathcal{R}}(s_{k})\Big{|}s_{0}=s\Big{]}={\mathcal{R}}(s)+\gamma\sum_{s^{\prime}=1}^{N}p_{ss^{\prime}}J^{*}(s^{\prime}),

(1)

where $s^{\prime}$ is the next state transferred from $s$ . Let ${\mathcal{P}}=(p_{ss^{\prime}})_{N\times N}$ be the probability transition matrix and define the Bellman operator $\mathcal{T}$ by $\mathcal{T}(Q)={\mathcal{R}}+\gamma{\mathcal{P}}Q$ . The state function $J^{*}$ is the unique fixed point of the Bellman operator $\mathcal{T}$ . When $J^{*}$ is replaced by $\widetilde{J}=\boldsymbol{\Phi}\boldsymbol{\theta}$ , the Bellman equation may not hold.

The classical temporal difference attempts to find $\boldsymbol{\theta}$ such that $\widetilde{J}(s,\boldsymbol{\theta})$ well approximates the value function $J^{*}(s)$ . Particularly, the TD algorithm updates

\displaystyle\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta_{t}\boldsymbol{\phi}(s_{t})[(\boldsymbol{\phi}^{{\top}}(s_{t})-\gamma\boldsymbol{\phi}^{{\top}}(s_{t+1}))\boldsymbol{\theta}_{t}-{\mathcal{R}}(s_{t})],\quad t\geq 0,

(2)

where $\eta_{t}$ is the step size which often requires careful tuning. We next illustrate the key point that (2) leads to a good estimator such that $\widetilde{J}(s,\boldsymbol{\theta}_{t})$ is close to $J^{*}(s)$ . It can be shown that $\boldsymbol{\theta}_{t}$ converges to an unknown population parameter $\boldsymbol{\theta}^{*}$ that minimizes the expected squared difference of the Bellman equation,

\boldsymbol{\theta}^{*}=\mathop{\rm{argmin}}_{\boldsymbol{u}\in{\mathbb{R}}^{d}}{\mathbb{E}}|\boldsymbol{\phi}^{{\top}}(s)\boldsymbol{u}-({\mathcal{R}}(s)+\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime})\boldsymbol{\theta}^{*})|^{2};

(3)

It is easy to see that such $\boldsymbol{\theta}^{*}$ exists and satisfies the following first-order condition,

{\mathbb{E}}\boldsymbol{\phi}(s)[(\boldsymbol{\phi}^{{\top}}(s)-\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime}))\boldsymbol{\theta}^{*}-{\mathcal{R}}(s)]=0.

(4)

Under the condition that $\textbf{H}:={\mathbb{E}}\boldsymbol{\phi}(s)(\boldsymbol{\phi}^{{\top}}(s)-\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime}))$ is positive definite in the sense that $x^{{\top}}\textbf{H}x>0$ for all $x\in{\mathbb{R}}^{d}$ ; see Tsitsiklis and Van Roy (1997), the solution of (4) writes

\boldsymbol{\theta}^{*}=({\mathbb{E}}\boldsymbol{\phi}(s)(\boldsymbol{\phi}^{{\top}}(s)-\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime})))^{-1}{\mathbb{E}}\boldsymbol{\phi}(s){\mathcal{R}}(s).

The estimation equation (2) is a first-order stochastic algorithm that converges to the stochastic root-finding problem (4) using a sequence of observations $\{s_{t},r_{t}\}_{t\geq 1}$ . In this paper, we refer to it as the Least-squares temporal difference estimator. See Kolter and Ng (2009) for details on the properties of the Least-squares TD estimator.

Least-squares-based methods are oftentimes criticized due to their sensitivity to outliers in data. When there may exist outliers in some observations of the reward ${\mathcal{R}}(s)$ , it is a natural call for interest to design a robust estimator of $\boldsymbol{\theta}^{*}$ . Following the widely-known classical literature (Huber, 1964; Charbonnier et al., 1994, 1997; Hastie et al., 2009) , we replace the square loss $|\cdot|^{2}$ by a smoothed (Pseudo) Huber loss $f_{\tau}(x)=\tau^{2}(\sqrt{1+(x/\tau)^{2}}-1)$ , parametrized by a thresholding parameter $\tau$ . We define a similar fixed point equation with the smoothed Huber loss by

\boldsymbol{\theta}^{*}_{\tau}=\mathop{\rm{argmin}}_{\boldsymbol{u}\in{\mathbb{R}}^{d}}{\mathbb{E}}f_{\tau}\big{(}\boldsymbol{\phi}^{{\top}}(s)\boldsymbol{u}-({\mathcal{R}}(s)+\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime})\boldsymbol{\theta}^{*}_{\tau})\big{)}.

(5)

In this section, when we motivate the algorithm, we assume that the fixed point $\boldsymbol{\theta}^{*}_{\tau}$ exists ¹¹1Here $\boldsymbol{\theta}^{*}_{\tau}$ is used for the motivation only. Our algorithm and theory do not depend on the existence of $\boldsymbol{\theta}^{*}_{\tau}$ . . As the thresholding parameter $\tau$ goes to infinity, the objective equation in (5) is close to the least-squares loss in (3), and $\boldsymbol{\theta}_{\tau}^{*}$ should be close to $\boldsymbol{\theta}^{*}$ . When $\tau$ tends to $0$ , the problem (5) becomes $\boldsymbol{\theta}^{*}_{0}=\mathop{\rm{argmin}}_{\boldsymbol{u}\in{\mathbb{R}}^{d}}{\mathbb{E}}|\boldsymbol{\phi}^{{\top}}(s)\boldsymbol{u}-({\mathcal{R}}(s)+\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime})\boldsymbol{\theta}^{*}_{0})|,$ with a nonsmooth least absolute deviation (LAD) loss, which is out of the scope of this paper. In this paper, we carefully specify $\tau$ to balance the statistical efficiency and the effect of potential outliers in an online fashion (see Theorem 1).

By the first-order condition of (5), we obtain a similar estimation function as (4),

{\mathbb{E}}\big{[}\boldsymbol{\phi}(s)g_{\tau}\big{(}(\boldsymbol{\phi}^{{\top}}(s)-\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime}))\boldsymbol{\theta}_{\tau}^{*}-{\mathcal{R}}(s)\big{)}\big{]}=0,

(6)

where the function $g_{\tau}(x)=f^{\prime}_{\tau}(x)$ is the differential of the smoothed Huber loss $f_{\tau}(x)$ . Instead of using the first-order iteration with the estimation equation (6) as in TD (2), we propose a Newton-type iterative estimator, which avoids the tuning of the learning rate. Newton-type estimators are often referred to as second-order methods when discussed in convex optimization. Nonetheless, the equation (5) is just for illustration purposes, which cannot be directly optimized since $\boldsymbol{\theta}_{\tau}^{*}$ appears in the objective function for minimization. On the contrary, our proposed method is a Newton-type method for solving the root-finding problem (6) using a sequence of the observations.

In this paper, we model two types of noise in observed rewards. The first is the classical Huber contamination model (Huber, 1992, 2004), where an $\alpha_{n}$ -fraction of rewards comes from arbitrary distributions. The second is the heavy-tailed model, where the reward function may admit heavy-tailed distributions. In the following section, we show that our proposed estimator is robust to the aforementioned two noises.

3 Online Newton-type Method for Parameter Estimation

In the following, we introduce an online Newton-type method for estimating the parameter $\boldsymbol{\theta}^{*}$ in the presence of outliers and heavy-tailed noise in the reward function ${\mathcal{R}}(s)$ . For ease of presentation, we denote the observations by $\boldsymbol{X}_{i}=\boldsymbol{\phi}(s_{i})$ , $\boldsymbol{Z}_{i}=\boldsymbol{\phi}(s_{i})-\gamma\boldsymbol{\phi}(s_{i+1})$ , and $b_{i}={\mathcal{R}}(s_{i})$ , where $s_{i}$ is the state at time $i$ . Here $(\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i})\in\mathbb{R}^{d}\times\mathbb{R}^{d}\times\mathbb{R}$ for each observation in the sample. Our objective is to propose an online estimator by the estimation function (6), which can be rewritten as

{\mathbb{E}}\big{[}\boldsymbol{X}g_{\tau}(\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}_{\tau}^{*}-b)\big{]}=0,

(7)

based on a sequence of dependent observations $\{(\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i})\}_{i\geq 1}$ .

At iteration $n+1$ , our proposed Newton-type estimator updates $\widehat{\boldsymbol{\theta}}_{n+1}$ by

\widehat{\boldsymbol{\theta}}_{n+1}=\frac{1}{n+1}\sum_{i=0}^{n}\widehat{\boldsymbol{\theta}}_{i}-\widehat{\boldsymbol{H}}_{n+1}^{-1}\frac{1}{n+1}\sum_{i=0}^{n}\boldsymbol{X}_{i+1}g_{\tau_{i+1}}(\boldsymbol{Z}_{i+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{i}-b_{i+1}).

(8)

Here $\widehat{\boldsymbol{H}}_{n+1}$ is an empirical information matrix of the estimation equation (7), as

\widehat{\boldsymbol{H}}_{n+1}=\frac{1}{n+1}\sum_{i=0}^{n}\boldsymbol{X}_{i+1}\boldsymbol{Z}_{i+1}^{{\top}}g_{\tau_{i+1}}^{\prime}(\boldsymbol{Z}_{i+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{i}-b_{i+1}).

(9)

where $\tau_{i}$ is the thresholding parameter in the Huber loss. We let $\tau_{n}$ tend to infinity to eliminate the bias generated by the smoothed Huber loss.

It is noteworthy to mention that the matrix $\widehat{\boldsymbol{H}}_{n+1}$ is not the Hessian matrix of the objective function on the right-hand side of (5). As discussed in the previous section, (5) cannot be directly optimized, nor do they lead to M-estimation problems. Indeed, our proposed update (8) is a Newton-type method to find the root of (7) using a sequence of observations $\{(\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i})\}_{i\geq 1}$ . In the following section, we will show the use of the matrix $\widehat{\boldsymbol{H}}_{n+1}$ helps for desirable convergence properties.

It should be noted that (8) can be implemented efficiently in a fully-online manner, i.e., without storage of the trajectory of historical information. Specifically, we write (8) as $\widehat{\boldsymbol{\theta}}_{n+1}=\,\hbox{\kern 0.09995pt\vbox{\hrule height=0.5pt\kern 1.42082pt\hbox{\kern-1.00006pt$\boldsymbol{\theta}$\kern-1.00006pt}}}\,_{n+1}-\widehat{\boldsymbol{H}}_{n+1}^{-1}\boldsymbol{G}_{n+1}$ , with each of the item on the right-hand side being a running average. It is easy to see that the averaged estimator $\,\hbox{\kern 0.09995pt\vbox{\hrule height=0.5pt\kern 1.42082pt\hbox{\kern-1.00006pt$\boldsymbol{\theta}$\kern-1.00006pt}}}\,_{n+1}=\frac{1}{n+1}(n\,\hbox{\kern 0.09995pt\vbox{\hrule height=0.5pt\kern 1.42082pt\hbox{\kern-1.00006pt$\boldsymbol{\theta}$\kern-1.00006pt}}}\,_{n}+\widehat{\boldsymbol{\theta}}_{n})$ and the vector

\boldsymbol{G}_{n+1}=\frac{1}{n+1}\sum_{i=0}^{n}\boldsymbol{X}_{i+1}g_{\tau_{i+1}}(\boldsymbol{Z}_{i+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{i}-b_{i+1})=\frac{n}{n+1}\boldsymbol{G}_{n}+\frac{1}{n+1}\boldsymbol{X}_{n+1}g_{\tau_{n+1}}(\boldsymbol{Z}_{n+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}-b_{n+1})

(10)

can both be updated online. In addition, the inverse $\widehat{\boldsymbol{H}}_{n+1}^{-1}$ can be directly and efficiently computed online by the inverse recursion formulation. By the Sherman-Morrison formula, we have

	$\displaystyle\widehat{\boldsymbol{H}}_{n+1}^{-1}=$	$\displaystyle\frac{n+1}{n}\widehat{\boldsymbol{H}}_{n}^{-1}-\frac{n+1}{n^{2}}\widehat{\boldsymbol{H}}_{n}^{-1}\boldsymbol{X}_{n+1}$		(11)
		$\displaystyle\times\Big{[}\frac{1}{n}\boldsymbol{Z}_{n+1}^{{\top}}\widehat{\boldsymbol{H}}_{n}^{-1}\boldsymbol{X}_{n+1}+\{g_{\tau_{n+1}}^{\prime}(\boldsymbol{Z}_{n+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}-b_{n+1})\}^{-1}\Big{]}^{-1}\boldsymbol{Z}_{n+1}^{{\top}}\widehat{\boldsymbol{H}}_{n}^{-1}.$		(11)

Here we note that both terms $\frac{1}{n}\boldsymbol{Z}_{n+1}^{{\top}}\widehat{\boldsymbol{H}}_{n}^{-1}\boldsymbol{X}_{n+1}$ and $\{g_{\tau_{n+1}}^{\prime}(\boldsymbol{Z}_{n+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}-b_{n+1})\}^{-1}$ inside the brackets on the right-hand side of (11) are scalars in $\mathbb{R}^{1}$ , not matrices.

The complete algorithm is presented in Algorithm 1. We refer to it as the Robust Online Policy Evaluation (ROPE). Compared with existing stochastic approximation algorithms for TD learning (Durmus et al., 2021; Mou et al., 2021; Ramprasad et al., 2022), our ROPE algorithm does not need to tune the step size.

Algorithm 1 Robust Online Policy Evaluation (

\mathrm{ROPE}

Input: Online streaming data $\{(\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i})\mid i\geq 1\}$

1: Compute

\widehat{\boldsymbol{H}}_{n_{0}}^{-1}

and

\boldsymbol{G}_{n_{0}}

according to (12) respectively.

2: for

n=n_{0}+1,n_{0}+2,\dots

3: Specify the thresholding parameter

\tau_{n}

4: Compute

\,\hbox{\kern 0.09995pt\vbox{\hrule height=0.5pt\kern 1.42082pt\hbox{\kern-1.00006pt$\boldsymbol{\theta}$\kern-1.00006pt}}}\,_{n}=(n-1)\,\hbox{\kern 0.09995pt\vbox{\hrule height=0.5pt\kern 1.42082pt\hbox{\kern-1.00006pt$\boldsymbol{\theta}$\kern-1.00006pt}}}\,_{n-1}/n+\widehat{\boldsymbol{\theta}}_{n-1}/n

, and

(\widehat{\boldsymbol{H}}_{n}^{-1},\boldsymbol{G}_{n})

by (10) and (11).

5: Update the parameter by

\widehat{\boldsymbol{\theta}}_{n}=\,\hbox{\kern 0.09995pt\vbox{\hrule height=0.5pt\kern 1.42082pt\hbox{\kern-1.00006pt$\boldsymbol{\theta}$\kern-1.00006pt}}}\,_{n}-\widehat{\boldsymbol{H}}_{n}^{-1}\boldsymbol{G}_{n}.

6: end for

Output: The final estimator $\widehat{\boldsymbol{\theta}}_{n}$ .

Since performing iterations of $\widehat{\boldsymbol{H}}_{n+1}^{-1}$ in (11) requires an initial invertible $\widehat{\boldsymbol{H}}_{n_{0}}$ , we shall compute from the first $n_{0}$ samples that

\widehat{\boldsymbol{H}}_{n_{0}}=\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}^{\top}g^{\prime}_{\tau_{0}}(\boldsymbol{Z}_{i}^{\top}\widehat{\boldsymbol{\theta}}_{0}-b_{i}),\quad\boldsymbol{G}_{n_{0}}=\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\boldsymbol{X}_{i}g_{\tau_{0}}(\boldsymbol{Z}_{i}^{\top}\widehat{\boldsymbol{\theta}}_{0}-b_{i}),

(12)

which serves as the initial quantities of (10)–(11) in order to compute the forthcoming iterations. Here $\widehat{\boldsymbol{\theta}}_{0}$ is a given initial parameter, and $\tau_{0}$ is a specified initial threshold level.

3.1 Convergence Rate of ROPE

Before illustrating how to conduct statistical inference on the parameter $\boldsymbol{\theta}^{*}$ , we first provide theoretical results for our proposed ROPE method.

To characterize the weak dependence of the sequence of environment states, we use the concept of $\phi$ -mixing. More precisely, we assume $\{s_{i},i\geq 1\}$ is a sequence of $\phi$ -mixing dependent variables, which satisfy

|{\mathbb{P}}(B|A)-{\mathbb{P}}(B)|\leq\phi(k)

(13)

for all $A\in{\mathcal{F}}_{1}^{n},B\in{\mathcal{F}}_{n+k}^{\infty}$ and all $n,k\geq 1$ , where ${\mathcal{F}}_{a}^{b}=\sigma(s_{i},a\leq i\leq b)$ . The $\phi$ -mixing dependence covers the irreducible and aperiodic Markov chain, which is typically used to model the states sampling in reinforcement learning (RL) (Rosenblatt, 1956; Rio, 2017, and others). The mixing rate $\phi(k)$ identifies the strength of dependence of the sequence. In this paper, we assume the following condition on the mixing rate.

Condition (C1).

The sequence $\{(\boldsymbol{X}_{i},\boldsymbol{Z}_{i})\}$ is a stationary $\phi$ -mixing sequence, where the mixing rate satisfies $\phi(k)=O(\rho^{k})$ for some $0<\rho<1$ .

It is easy to see that (C1) holds when $\{s_{i}\}$ is an irreducible and aperiodic finite-state Markov chain. In addition, we also assume the following boundedness condition on $\boldsymbol{X}=\boldsymbol{\phi}(s)$ that is commonly required by the RL literature (Sutton and Barto, 2018; Ramprasad et al., 2022).

Condition (C2).

There exists a constant $C_{X}>0$ such that $\max\big{\{}|\boldsymbol{\phi}(s)|_{2}\;\big{|}\;s\in{\mathcal{S}}\big{\}}\leq C_{X}$ .

Next, we assume the matrix $\boldsymbol{H}={\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}]$ has a bounded condition number.

Condition (C3).

There exists a constant $c>0$ such that $c\leq\Lambda_{\min}(\boldsymbol{H})\leq\Lambda_{\max}(\boldsymbol{H})\leq c^{-1}$ , where $\Lambda_{\max}$ and $\Lambda_{\min}$ denote the largest and the smallest singular values, respectively.

When $\{s_{i}\}$ is an irreducible and aperiodic Markov chain, Tsitsiklis and Van Roy (1997) proved that $\boldsymbol{H}$ is positive definite, i.e., $x^{{\top}}\boldsymbol{H}x>0$ for all $x\neq 0$ . This indicates that $\boldsymbol{H}$ is nonsingular and hence (C3) holds. Finally, we provide conditions on the temporal difference error $\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b$ .

Condition (C4).

The temporal-difference error $\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b$ comes from the distribution $(1-\alpha_{n})P+\alpha_{n}Q$ , where $\alpha_{n}\in[0,1)$ . The distribution $Q$ is an arbitrary distribution, and $P$ satisfies that ${\mathbb{E}}_{P}\big{[}|\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b|^{1+\delta}\big{|}\boldsymbol{X},\boldsymbol{Z}\big{]}\leq C_{b}$ holds uniformly for some constants $\delta>0$ and $C_{b}>0$ .

We assume that the temporal-difference error comes from the Huber contamination model. The outlier distribution $Q$ can be arbitrary and the true distribution $P$ has $(1+\delta)$ -th order of moment ( $\delta>0$ ), which does not necessarily indicate a variance exists when $\delta\in(0,1)$ . Therefore, our assumption largely extends the boundedness condition of the reward function ${\mathcal{R}}(s)$ in RL literature (Sutton and Barto, 2018; Ramprasad et al., 2022). In Condition (C4), $\alpha_{n}$ controls the ratio of outliers among $n$ samples. Particularly, $\alpha_{n}=m_{n}/n$ means that there are $m_{n}$ outliers among $n$ samples $\{(\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i}),1\leq i\leq n\}$ . Given the above conditions, by selecting the thresholding parameter $\tau_{i}$ as defined in (8), we can obtain the following statistical rate for our proposed ROPE estimator.

Theorem 1.

Suppose that (C1) to (C4) hold and the thresholding parameter $\tau_{i}=C_{\tau}\max(1,i^{\beta_{1}}/(\log i)^{\beta_{2}})$ (where $\beta_{1}\in[0,1),\beta_{2}\geq 0$ , and $C_{\tau}>0$ ). Assume $n_{0}$ is sufficiently large and the initial value $|\widehat{\boldsymbol{\theta}}_{0}-\boldsymbol{\theta}^{*}|_{2}\leq c_{0}$ for some $c_{0}<1$ . Then for every $\nu>0$ , there exist constants $C,c>0$ which are independent of the dimension $d$ such that

{\mathbb{P}}\Big{(}\cap_{i=n_{0}}^{n}\big{\{}|\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}|_{2}\geq C\sqrt{d}e_{i}\big{\}}\Big{)}\geq 1-cn_{0}^{-\nu},

where

e_{n}=\alpha_{n}\tau_{n}+\tau_{n}^{-\min(\delta,2)}+\sqrt{\frac{\tau_{n}^{(1-\delta)_{+}}\log n}{n}}+\frac{\tau_{n}\log^{2}n}{n}+\frac{1}{\sqrt{d}}(c_{0})^{2^{n-n_{0}}}.

(14)

Here $\delta$ is defined in the moment condition in (C4).

The error bounds $e_{n}$ in (14) contains five terms. In the sequel, we will discuss each of these terms individually. The first term $\alpha_{n}\tau_{n}$ comes from the impact of outliers among the samples. Here the contamination ratio $\alpha_{n}=m_{n}/n$ should tend to zero; otherwise, it is impossible to obtain a consistent estimator for $\boldsymbol{\theta}^{*}$ . The second term $\tau_{n}^{-\min(\delta,2)}$ is due to the bias from smoothed Huber’s loss for mean estimation. The third and the fourth term in (14) is the classical statistical rate due to the variance and boundedness incurred by thresholding, respectively. These two terms commonly appear in the Bernstein-type inequalities (see, e.g. Bennett, 1962). As we can see from the third term, the convergence rate has a phase transition between the regimes of finite variance $\delta>1$ and infinite variance $0<\delta\leq 1$ (see also Corollary 3 below). This phenomenon has also been observed in different estimation problems with Huber loss; see Sun et al. (2020) and Fan et al. (2022). In the presence of a higher moment condition, specifically $\delta\geq 5$ , we can eliminate the fourth term by applying a more delicate analysis (See Proposition 5 below and Proposition 8 in Appendix for more details). The last term represents how quickly the proposed iterative algorithm converges. Given an initial value $\widehat{\boldsymbol{\theta}}_{0}$ with an error $c_{0}<1$ , the proposed second-order method enjoys a local quadratic convergence, i.e., the error evolves from $c_{0}$ to $(c_{0})^{2^{n-n_{0}}}$ after $n-n_{0}$ iterations. This term decreases super-exponentially fast, a characteristic convergence rate often encountered in the realm of second-order methods (see, e.g., Nesterov, 2003). Specifically, when the sample size, which is equivalent to the number of iterations, is reasonably large, the last term $(c_{0})^{2^{n-n_{0}}}=O(1/\sqrt{n})$ is dominated by the statistical error in the other terms of (14). The assumption on the initial error $|\widehat{\boldsymbol{\theta}}_{0}-\boldsymbol{\theta}^{*}|_{2}\leq c_{0}$ for some $c_{0}<1$ is mild. A valid initial value $\widehat{\boldsymbol{\theta}}_{0}$ can be obtained by the solution to the following estimation equation $\sum_{i=0}^{n_{0}}\boldsymbol{X}_{i}g_{\tau}(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}-b_{i})=0$ using a subsample with fixed size $n_{0}$ before running Algorithm 1 with a specified constant threshold $\tau$ . The above equation can be solved efficiently by classical first-order root-finding algorithms.

To highlight the relationship between the convergence rate and the outlier rewards, we present the following corollary which gives a clear statement on how the rate depends on the number of outliers $m_{n}$ .

Corollary 2.

Suppose the conditions in Theorem 1 hold with $\delta\geq 2$ . Let the thresholding parameter $\tau_{i}=C_{\tau}i^{\beta}$ with some $\beta,C_{\tau}>0$ and $n_{0}$ tend to infinity. We have

|\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*}|_{2}=O_{{\mathbb{P}}}\left(\sqrt{d}\Big{(}\sqrt{\frac{\log n}{n}}+\frac{\log^{2}n}{n^{1-\beta}}+\frac{m_{n}}{n^{1-\beta}}+\frac{1}{n^{2\beta}}\Big{)}\right),

where $m_{n}$ is the number of outliers among the samples of size $n$ .

The corollary shows that, when there are $m_{n}=o(n^{1-\beta})$ outliers, our ROPE can still estimate $\boldsymbol{\theta}^{*}$ consistently. Note that $\beta$ in $\tau_{i}$ is the exponent specified by the practitioner. A smaller $\beta$ allows more outliers $m_{n}$ among the samples. Furthermore, when there are $m_{n}=o(n^{\frac{1}{2}-\beta})$ outliers and $\frac{1}{4}\leq\beta\leq\frac{1}{2}$ , the ROPE estimator achieves the optimal rate $|\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*}|_{2}=O_{{\mathbb{P}}}\Big{(}\sqrt{\frac{d\log n}{n}}\Big{)}$ , up to a logarithm term.

The next corollary indicates the impact of the tail of rewards $\mathcal{R}(s)$ on the convergence rate. For a clear presentation, we discuss the impact under the case without outliers, i.e., $\alpha_{n}=0$ .

Corollary 3.

Suppose the conditions in Theorem 1 hold with the contamination rate $\alpha_{n}=0$ and let $n_{0}$ tend to infinity.

•

When $\delta\in(0,1]$ , we specify $\tau_{i}=C_{\tau}(i/\log i)^{1/(1+\delta)}$ . Then

|\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*}|_{2}=O_{{\mathbb{P}}}\left(\sqrt{d}\Big{(}\frac{\log n}{n}\Big{)}^{\frac{\delta}{1+\delta}}\right).

•

When $\delta>1$ , we specify $\tau_{i}=C_{\tau}(i/\log i)^{1/\min(\delta+1,2)}$ . Then

$|\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*}|_{2}=O_{{\mathbb{P}}}\left(\sqrt{d}\Big{(}\frac{\log n}{n}\Big{)}^{1/2}\right).$

Corollary 3 illustrates a sharp phase transition between the regimes of infinite variance $\delta\in(0,1]$ and finite variance $\delta>1$ , and such transition is smooth and optimal. The rate of convergence established in Corollary 3 matches the offline oracle with independent samples in the estimation of linear regression model (Sun et al., 2020), ignoring the logarithm term. A similar corollary can be established for this phase transition of the convergence rate when the contamination rate $\alpha_{n}>0$ .

3.2 Asymptotic Normality and Bahadur Representation

In the following theorem, we give the Bahadur representation for the proposed estimator $\widehat{\boldsymbol{\theta}}_{n}$ . The Bahadur representation provides a more refined rate for the estimation error. Moreover, the asymptotic normality can be established by applying the central limit theorem to the main term in the representation.

Theorem 4.

Suppose the conditions in Theorem 1 hold, and (C4) holds with $\delta\geq 5$ . Let $n_{0}$ tend to infinity. We have for any nonzero $\boldsymbol{v}\in{\mathbb{R}}^{d}$ with $|\boldsymbol{v}|_{2}\leq 1$ ,

\boldsymbol{v}^{{\top}}(\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*})=\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}^{*}-b_{i})+O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}\alpha_{n}+de_{n-1}^{2}\log^{2}n+\sqrt{d}\tau_{n}^{-2}+\frac{\sqrt{d}}{(n\tau_{n}^{2})^{2/5}}\Big{)}.

(15)

Here $e_{n}$ is defined in (14), and ${\mathcal{Q}}_{n}$ denotes the index set of the outliers. Moreover, if the contamination rate $\alpha_{n}=o(1/(\sqrt{n}\tau_{n}))$ and $n^{1/4}=o(\tau_{n})$ , then

\frac{\sqrt{n}}{\sigma_{\boldsymbol{v}}}\boldsymbol{v}^{{\top}}(\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*})\xrightarrow{d}{\mathcal{N}}(0,1),\text{ where }\sigma_{\boldsymbol{v}}^{2}=\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\boldsymbol{\Sigma}(\boldsymbol{H}^{{\top}})^{-1}\boldsymbol{v}.

(16)

Here

\boldsymbol{\Sigma}=\sum_{k=-\infty}^{\infty}{\mathbb{E}}\big{[}\boldsymbol{X}_{0}\boldsymbol{X}^{{\top}}_{k}(\boldsymbol{Z}^{{\top}}_{0}\boldsymbol{\theta}^{*}-b_{0})(\boldsymbol{Z}^{{\top}}_{k}\boldsymbol{\theta}^{*}-b_{k})\big{]}.

(17)

Theorem 4 provides a Bahadur representation for the proposed estimator $\widehat{\boldsymbol{\theta}}_{n}$ , which contains a main term corresponding to an asymptotic normal distribution in (16), and a higher-order remainder term. The asymptotic normality and its concrete form of the covariance matrix pave the way for further research on conducting statistical inference efficiently in an online fashion, demonstrated in Section 4 below. To achieve the asymptotic normality, additional conditions $\alpha_{n}=o(1/(\sqrt{n}\tau_{n}))$ and $n^{1/4}=o(\tau_{n})$ are required on the specification of the thresholding parameters $\tau_{n}$ . These conditions easily hold when the practitioner specifies $\tau_{n}=C_{\tau}n^{\beta}$ with $\beta>1/4$ and the number of outliers satisfies $m_{n}=o(n^{1/2-\beta})$ . We further note that to establish the Bahadur representation in the above theorem, we require the sixth moment to exist (i.e., $\delta=5$ ). This condition may be weakened and we believe a similar Bahadur representation holds for $\delta>1$ . We leave this theoretical question to future investigation.

It is worthwhile to note that even for the case where there is no contamination, i.e., $\alpha_{n}=0$ , our result is still new. As far as we know, there is no literature establishing the Bahadur representation for the TD method. To highlight the rate of convergence concisely in the remainder term, we provide the following corollary under $\alpha_{n}=0$ .

Proposition 5.

Suppose the conditions of Theorem 4 hold with $\alpha_{n}=0$ and $\tau_{i}=C_{\tau}i^{\beta}$ for $\beta\geq 3/4$ and some $C_{\tau}>0$ , we have for any nonzero $\boldsymbol{v}\in{\mathbb{R}}^{d}$ with $|\boldsymbol{v}|_{2}\leq 1$ ,

\boldsymbol{v}^{{\top}}(\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*})=\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{X}_{i}(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}^{*}-b_{i})+O_{{\mathbb{P}}}\Big{(}\frac{d\log n}{n}\Big{)}.

In this proposition, we specify $\beta\geq 3/4$ to ensure that the remainder term possesses an order of $O_{{\mathbb{P}}}\big{(}d\log n/n\big{)}$ . Notably, this result is not a direct corollary of Theorems 1 and 4, since the rate of $e_{n-1}$ in (14) of Theorem 1 is not sharp enough to guarantee the term $de_{n-1}^{2}\log^{2}n$ in (15) converges as fast as $O\big{(}d\log n/n\big{)}$ in Proposition 5. To reach this fast rate of the remainder, we establish an improved rate of Theorem 1 under the stronger assumptions, achieved by eliminating the fourth term of (14). The detailed formal result of the improved rate of $e_{n}$ is relegated to Proposition 8 in Appendix.

Remark 1.

We provide some discussion on the remainder term in the Bahadur representation. Since we have not found any result on the Bahadur representation for TD algorithm in literature, we compare it with SGD instead, as the theoretical properties of TD are expected to be similar to SGD. For i.i.d. samples, Polyak and Juditsky (1992) shows that the average of SGD with learning rate $cn^{-\eta}$ is consistent when $0<\eta\leq 1$ . Their proof indicates that the remainder term is at the order of $O_{{\mathbb{P}}}(1/n^{1-\eta/2}+1/n^{\eta})$ (assuming that the dimension $d$ is a constant for simplicity). It is easy to see that SGD with any choice of $0<\eta\leq 1$ leads to a strictly slower rate of the remainder than that of our proposed method, $O(n^{-1}\log n)$ .

4 Estimation of Long-Run Covariance Matrix and Online Statistical Inference

In the above section, we provide an online Newton-type algorithm to estimate the parameter $\boldsymbol{\theta}^{*}$ . As we can see in Theorem 4, the proposed estimator has an asymptotic variance with a sandwich structure $\boldsymbol{H}^{-1}\boldsymbol{\Sigma}(\boldsymbol{H}^{{\top}})^{-1}$ . To conduct statistical inference simultaneously with ROPE method we proposed in Section 3, we propose an online plug-in estimator for this sandwich structure. Note that the online estimator $\widehat{\boldsymbol{H}}^{-1}_{n}$ of $\boldsymbol{H}^{-1}$ is proposed and utilized in (11) of the estimation algorithm. It remains to construct an online estimator for $\boldsymbol{\Sigma}$ in (17).

Developing an online estimator of $\boldsymbol{\Sigma}$ with dependent samples is intricate compared to the case of independent ones. As the samples are dependent, the covariance matrix $\boldsymbol{\Sigma}$ is an infinite sum of the series of time-lag covariance matrices. To estimate the above long-run covariance matrix $\boldsymbol{\Sigma}$ , we first rewrite the definition of $\boldsymbol{\Sigma}$ in (17) into the following form

	$\displaystyle\boldsymbol{\Sigma}=$	$\displaystyle{\mathbb{E}}\big{[}\boldsymbol{X}_{0}\boldsymbol{X}^{{\top}}_{0}(\boldsymbol{Z}^{{\top}}_{0}\boldsymbol{\theta}^{}-b_{0})^{2}\big{]}+\sum_{k=-\infty}^{-1}{\mathbb{E}}\big{[}\boldsymbol{X}_{0}\boldsymbol{X}^{{\top}}_{k}(\boldsymbol{Z}^{{\top}}_{0}\boldsymbol{\theta}^{}-b_{0})(\boldsymbol{Z}^{{\top}}_{k}\boldsymbol{\theta}^{*}-b_{k})\big{]}$
		$\displaystyle+\sum_{k=1}^{\infty}{\mathbb{E}}\big{[}\boldsymbol{X}_{0}\boldsymbol{X}^{{\top}}_{k}(\boldsymbol{Z}^{{\top}}_{0}\boldsymbol{\theta}^{}-b_{0})(\boldsymbol{Z}^{{\top}}_{k}\boldsymbol{\theta}^{}-b_{k})\big{]}.$

Next, we replace each term ${\mathbb{E}}\big{[}\boldsymbol{X}_{0}\boldsymbol{X}^{{\top}}_{k}(\boldsymbol{Z}^{{\top}}_{0}\boldsymbol{\theta}^{*}-b_{0})(\boldsymbol{Z}^{{\top}}_{k}\boldsymbol{\theta}^{*}-b_{k})\big{]}$ in (17) with its empirical counterpart using the $n$ samples. Meanwhile, to handle outliers, we replace $(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}^{*}-b_{i})$ by $g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})$ . In addition, the infinite sum in (17) is not feasible for direct estimation. Nonetheless, Condition (C1) on the mixing rate of $\{(\boldsymbol{X}_{t},\boldsymbol{Z}_{t})\}$ allows us to approximate it by effectively estimating the first $\lceil\lambda\log n\rceil$ terms only, where $\lambda$ is a pre-specified constant. In summary, we can construct the following estimator for the covariance matrix $\boldsymbol{\Sigma}$ ,

$\displaystyle\widehat{\boldsymbol{\Sigma}}_{n}=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{X}_{i}\boldsymbol{X}^{{\top}}_{i}\big{(}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})\big{)}^{2}$	(18)
	$\displaystyle+\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{\lceil\lambda\log i\rceil\wedge(i-1)}\boldsymbol{X}_{i}\boldsymbol{X}^{{\top}}_{i-k}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})g_{\tau_{i-k}}(\boldsymbol{Z}^{{\top}}_{i-k}\widehat{\boldsymbol{\theta}}_{i-k-1}-b_{i-k})$
	$\displaystyle+\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{\lceil\lambda\log i\rceil\wedge(i-1)}\boldsymbol{X}_{i-k}\boldsymbol{X}^{{\top}}_{i}g_{\tau_{i-k}}(\boldsymbol{Z}^{{\top}}_{i-k}\widehat{\boldsymbol{\theta}}_{i-k-1}-b_{i-k})g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i}).$

To perform a fully-online update for the estimator $\widehat{\boldsymbol{\Sigma}}_{n}$ , we define

\boldsymbol{S}_{j}=\sum_{i=1}^{j}\boldsymbol{X}^{{\top}}_{i}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i}),\quad j\geq 1,

and the above equation (18) can be rewritten as,

	$\displaystyle\widehat{\boldsymbol{\Sigma}}_{n}=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{X}_{i}\boldsymbol{X}^{{\top}}_{i}g_{\tau_{i}}^{2}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})+\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})(\boldsymbol{S}_{i-1}-\boldsymbol{S}_{i-\lceil\lambda\log i\rceil\wedge(i-1)})$
		$\displaystyle+\frac{1}{n}\sum_{i=1}^{n}(\boldsymbol{S}_{i-1}-\boldsymbol{S}_{i-\lceil\lambda\log i\rceil\wedge(i-1)})^{{\top}}\boldsymbol{X}^{{\top}}_{i}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i}).$

In practice, at the $n$ -th step, we keep only the terms $\{\boldsymbol{S}_{n},...,\boldsymbol{S}_{n-\lceil\lambda\log n\rceil\wedge(n-1)}\}$ in memory, and the covariance matrix can be updated by

	$\displaystyle\widehat{\boldsymbol{\Sigma}}_{n}=\frac{n-1}{n}\widehat{\boldsymbol{\Sigma}}_{n-1}+\frac{1}{n}\Big{[}$	$\displaystyle\boldsymbol{X}_{n}\boldsymbol{X}^{{\top}}_{n}g_{\tau_{n}}^{2}(\boldsymbol{Z}^{{\top}}_{n}\widehat{\boldsymbol{\theta}}_{n-1}-b_{n})+\boldsymbol{X}_{n}g_{\tau_{n}}(\boldsymbol{Z}^{{\top}}_{n}\widehat{\boldsymbol{\theta}}_{n-1}-b_{n})(\boldsymbol{S}_{n-1}-\boldsymbol{S}_{n-\lceil\lambda\log n\rceil\wedge(n-1)})$
		$\displaystyle+(\boldsymbol{S}_{n-1}-\boldsymbol{S}_{n-\lceil\lambda\log n\rceil\wedge(n-1)})^{{\top}}\boldsymbol{X}^{{\top}}_{n}g_{\tau_{n}}(\boldsymbol{Z}^{{\top}}_{n}\widehat{\boldsymbol{\theta}}_{n-1}-b_{n})\Big{]}.$		(19)

The proposed estimator $\widehat{\boldsymbol{\Sigma}}_{n}$ complements the estimator $\widehat{\boldsymbol{H}}_{n}^{-1}$ in (11) to construct an estimator of the sandwich form $\boldsymbol{H}^{-1}\boldsymbol{\Sigma}(\boldsymbol{H}^{{\top}})^{-1}$ that appears in the asymptotic distribution in Theorem 4.

Specifically, we can construct the confidence interval for the parameter $\boldsymbol{v}^{{\top}}\boldsymbol{\theta}^{*}$ using the asymptotic distribution in (16) in the following way. For any unit vector $\boldsymbol{v}$ , a confidence interval with nominal level $(1-\xi)$ is

\Big{[}\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}-q_{1-\xi/2}\widehat{\sigma}_{\boldsymbol{v}},\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}+q_{1-\xi/2}\widehat{\sigma}_{\boldsymbol{v}}\Big{]},

(20)

where $\widehat{\sigma}_{\boldsymbol{v}}^{2}=\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{H}}_{n}^{-1}\widehat{\boldsymbol{\Sigma}}_{n}(\widehat{\boldsymbol{H}}_{n}^{{\top}})^{-1}\boldsymbol{v}$ and $q_{1-\xi/2}$ denotes the $(1-\xi/2)$ -th quantile of a standard normal distribution.

Remark 2.

We provide some insights into the computational complexity of our inference procedure. Both estimators $\widehat{\boldsymbol{H}}_{n}^{-1}$ and $\widehat{\boldsymbol{\Sigma}}_{n}$ can be computed in an $O(d^{2})$ per-iteration computation cost, as discussed in (11). In addition, the computation $\widehat{\sigma}_{\boldsymbol{v}}^{2}=\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{H}}_{n}^{-1}\widehat{\boldsymbol{\Sigma}}_{n}(\widehat{\boldsymbol{H}}_{n}^{{\top}})^{-1}\boldsymbol{v}$ requires three vector-matrix multiplication with the same computation complexity $O(d^{2})$ .

On the contrary, the online bootstrap method proposed in Ramprasad et al. (2022) has a per-iteration complexity of at least $O(Bd)$ , where the resampling size $B$ in the bootstrap is usually much larger than $d$ in practice.

In the following theorem, we provide the theoretical guarantee of the above construction of the confidence intervals by showing that our online estimator of the covariance matrix, $\widehat{\boldsymbol{\Sigma}}_{n}$ , is consistent.

Theorem 6.

Suppose the conditions in Theorem 4 hold. The covariance estimator $\widehat{\boldsymbol{\Sigma}}_{n}$ defined in (19) satisfies

\|\widehat{\boldsymbol{\Sigma}}_{n}-\boldsymbol{\Sigma}\|=O_{{\mathbb{P}}}\left(\sqrt{d}\tau_{n}^{2}\alpha_{n}+\sqrt{d}\tau_{n}^{-1}+\tau_{n}\sqrt{\frac{d\log n}{n}}+\frac{d\tau_{n}^{2}\log^{2}n}{n}\right).

Theorem 6 provides an upper bound on the estimation error $\widehat{\boldsymbol{\Sigma}}_{n}-\boldsymbol{\Sigma}$ . To achieve the consistency on $\widehat{\boldsymbol{\Sigma}}_{n}$ , we specify the thresholding parameter $\tau_{n}=C_{\tau}n^{\beta}$ for $1/4<\beta<1/2$ , such that

\|\widehat{\boldsymbol{\Sigma}}_{n}-\boldsymbol{\Sigma}\|=O_{{\mathbb{P}}}\left(\sqrt{d}\tau_{n}^{2}\alpha_{n}+\frac{\sqrt{d\log n}}{n^{1/2-\beta}}\right).

In this case, as long as the fraction of outliers $\alpha_{n}$ satisfies $\alpha_{n}=o(1/(n^{2\beta}\sqrt{d}))$ , we obtain a consistent estimator of the matrix $\boldsymbol{\Sigma}$ . In other words, our proposed robust algorithm allows $o(n^{1-2\beta})$ outliers, ignoring the dimension. We summarize the result in the following corollary.

Corollary 7.

Suppose the conditions of Theorem 4 hold, and the fraction of outliers satisfies $\alpha_{n}=o(1/(n^{2\beta}\sqrt{d}))$ , where we specify $\tau_{i}=C_{\tau}i^{\beta}$ and $1/4<\beta<1/2$ . Then given a vector $\boldsymbol{v}\in{\mathbb{R}}^{d}$ and a pre-specified confidence level $1-\xi$ , we have

\lim_{n\rightarrow\infty}{\mathbb{P}}\left(\boldsymbol{v}^{{\top}}\boldsymbol{\theta}^{*}\in\Big{[}\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}-q_{1-\xi/2}\widehat{\sigma}_{\boldsymbol{v}},\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}+q_{1-\xi/2}\widehat{\sigma}_{\boldsymbol{v}}\Big{]}\right)=1-\xi,

5 Numerical Experiments

In this section, we assess the performance of our ROPE algorithm by conducting experiments under various settings. We construct all confidence intervals with a nominal coverage of $95\%$ . Throughout the majority of experiments, we compare our method with the online bootstrap method with the linear stochastic approximation proposed in Ramprasad et al. (2022), which we refer to as LSA. The hyperparameters for LSA are selected based on the settings described in Sections 5.1 and 5.2 of Ramprasad et al. (2022).

5.1 Parameter Inference for Infinite-Horizon MDP

In the first experiment, we focus on an infinite-horizon Markov Decision Process (MDP) setting. Specifically, we create an environment with a state space of size $50$ and an action space of size $5$ . The dimension of the features is fixed at $10$ . The transition probability kernel of the MDP, the state features, and the policy under evaluation are all randomly generated. By employing the Bellman equation (1), we can compute the expected rewards at each state under the policy. To introduce variability, we add noise to the expected rewards, drawn from different distributions such as the standard normal distribution, $t$ distribution with a degree of freedom of $1.5$ , and standard Cauchy distribution. The main objective of this experiment is to construct a confidence interval for the first coordinate of the true parameter $\boldsymbol{\theta}^{*}$ .

We set the thresholding level as $\tau_{i}=C(i/(\log i)^{2})^{\beta}$ where $\beta$ is a positive constant. We begin with an investigation of the influence of the parameters $C$ and $\beta$ on the coverage probability and width of the confidence interval in our ROPE algorithm. Figure 1 displays the results, revealing that the performance is relatively robust to variations in $C$ and $\beta$ , irrespective of the type of noise applied. Notably, although our theoretical guarantees are based on the noises with a $6$ -th order moment (See Theorem 4), the experiment indicates that our method exhibits a wider range of applicability.

Refer to caption — Figure 1: Coverage probability (the first row) and the width of confidence interval (the second row) of $\mathrm{ROPE}$ of various $C$ and $\beta$ . We specify the noise distribution as standard normal (the first column), Student’s $t_{1.5}$ (the second column), and Cauchy (the third column).

In subsequent experiments, we specify $C=0.5$ and $\beta=1/3$ for the ROPE algorithm and compare it with the LSA method. Alongside the coverage probability and width of the confidence interval, we also assess the average running time for both methods. The findings are depicted in Figure 2. It is evident that, under a light-tailed noise that admits normal distribution, the two methods yield comparable results, with the ROPE method consistently outperforming LSA in terms of the confidence interval width. When the noise exhibits heavy-tailed characteristics, the confidence interval width of LSA is notably larger than that of ROPE. Additionally, the runtime of ROPE is approximately four times shorter than that of LSA.

In the LSA method, the step size is determined by the expression $\alpha i^{-\eta}$ , which relies on two positive hyperparameters $\alpha$ and $\eta$ . In this experiment, we investigate the sensitivity of LSA method on these parameters. For comparison, we also present the result of our ROPE algorithm, which does not require any hyperparameter tuning. In this particular experiment, we specify the noise distribution to be standard normal, and directly apply the online Newton step on the square loss (as opposed to the pseudo-Huber loss). Notably, Figure 3 illustrates that the LSA method is sensitive to the parameter $\alpha$ . Specifically, when $\alpha$ takes on larger values, the LSA method generates significantly wider confidence intervals than that of LSA with well-tuned hyperparameters, which is undesirable in practical applications. Meanwhile, our ROPE method always generates confidence intervals with a valid width and comparable coverage rate.

5.2 Value Inference for FrozenLake RL Environment

In this section, we consider the Frozenlake environment provided by OpenAI gym, which involves a character navigating an $8\times 8$ grid world. The character’s objective is to start from the first tile of the grid and reach the end tile within each episode. If the character successfully reaches the target tile, a reward of $1$ is obtained; otherwise, the reward is $0$ . In our setup, we generate the state features randomly, with a dimensionality of $4$ . The policy is trained using $Q$ -learning, and the true parameter can be explicitly computed using the transition probability matrix. Under a contaminated reward model, we introduce random perturbations by replacing the true reward with a value uniformly sampled from the range $[0,100]$ with probability $\alpha\in\{0,n^{-1},0.05n^{-1/2}\}$ . Our main goal is to infer the value at the initial state.

Following the approach in Section 5.1, we proceed to examine the sensitivity of ROPE to the thresholding parameters $C$ and $\beta$ . We present the results in Figure 4, and observe that the performance of ROPE is influenced by both the contamination rate $\alpha_{n}$ and the chosen thresholding parameter. Specifically, when $\alpha_{n}$ is small, a larger thresholding parameter tends to yield better outcomes. However, as $\alpha_{n}$ increases, the use of a large thresholding parameter can negatively impact performance.

Subsequently, we set the values $C=0.5$ , $\beta=1/3$ when $\alpha_{n}=0$ or $n^{-1}$ , and $C=0.1,\beta=1/3$ when $\alpha_{n}=0.05n^{-1/2}$ for ROPE algorithm, and compare its performance with that of the LSA method. Figure 5 displays the results, which clearly demonstrates that our ROPE method consistently outperforms LSA in terms of coverage rates, CI widths, and running time. The advantage of ROPE becomes more pronounced as the contamination rate $\alpha_{n}$ increases.

6 Concluding Remarks

This paper introduces a robust online Newton-type algorithm designed for policy evaluation in reinforcement learning under the influence of heavy-tailed noise and outliers. We demonstrate the estimation consistency as well as the asymptotic normality of our estimator. We further establish the Bahadur representation and show that it converges strictly faster than that of a prototypical first-order method such as TD. To further conduct statistical inference, we propose an online approach for constructing confidence intervals. Our experimental results highlight the efficiency and robustness of our approach when compared to the existing online bootstrap method. In future research, it would be intriguing to explore online statistical inference for off-policy evaluation.

References

Bennett (1962) Bennett, G. (1962), “Probability inequalities for the sum of independent random variables,” Journal of the American Statistical Association, 57, 33–45.
Berkes and Philipp (1979) Berkes, I. and Philipp, W. (1979), “Approximation theorems for independent and weakly dependent random vectors,” The Annals of Probability, 7, 29–54.
Billingsley (1968) Billingsley, P. (1968), Convergence of Probability Measures, Wiley Series in Probability and Mathematical Statistics, Wiley.
Byrd et al. (2016) Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. (2016), “A stochastic quasi-Newton method for large-scale optimization,” SIAM Journal on Optimization, 26, 1008–1031.
Charbonnier et al. (1994) Charbonnier, P., Blanc-Feraud, L., Aubert, G., and Barlaud, M. (1994), “Two deterministic half-quadratic regularization algorithms for computed imaging,” in Proceedings of 1st International Conference on Image Processing.
Charbonnier et al. (1997) — (1997), “Deterministic edge-preserving regularization in computed imaging,” IEEE Transactions on Image Processing, 6, 298–311.
Chen et al. (2021a) Chen, H., Lu, W., and Song, R. (2021a), “Statistical inference for online decision making: In a contextual bandit setting,” Journal of the American Statistical Association, 116, 240–255.
Chen et al. (2021b) — (2021b), “Statistical inference for online decision making via stochastic gradient descent,” Journal of the American Statistical Association, 116, 708–719.
Chen et al. (2020) Chen, X., Lee, J. D., Tong, X. T., and Zhang, Y. (2020), “Statistical inference for model parameters in stochastic gradient descent,” The Annals of Statistics, 48, 251–273.
Dai et al. (2020) Dai, B., Nachum, O., Chow, Y., Li, L., Szepesvári, C., and Schuurmans, D. (2020), “Coindice: Off-policy confidence interval estimation,” Advances in Neural Information Processing Systems.
Deshpande et al. (2018) Deshpande, Y., Mackey, L., Syrgkanis, V., and Taddy, M. (2018), “Accurate inference for adaptive linear models,” in International Conference on Machine Learning, PMLR, pp. 1194–1203.
Doukhan et al. (1994) Doukhan, P., Massart, P., and Rio, E. (1994), “The functional central limit theorem for strongly mixing processes,” in Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 30, pp. 63–82.
Durmus et al. (2021) Durmus, A., Moulines, E., Naumov, A., Samsonov, S., and Wai, H.-T. (2021), “On the stability of random matrix product with Markovian noise: Application to linear stochastic approximation and TD learning,” in Proceedings of Thirty Fourth Conference on Learning Theory.
Everitt et al. (2017) Everitt, T., Krakovna, V., Orseau, L., and Legg, S. (2017), “Reinforcement learning with a corrupted reward channel,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence.
Fan et al. (2022) Fan, J., Guo, Y., and Jiang, B. (2022), “Adaptive Huber regression on Markov-dependent data,” Stochastic Processes and their Applications, 150, 802–818.
Fang et al. (2018) Fang, Y., Xu, J., and Yang, L. (2018), “Online bootstrap confidence intervals for the stochastic gradient descent estimator,” Journal of Machine Learning Research, 19, 1–21.
Feng et al. (2021) Feng, Y., Tang, Z., Zhang, N., and Liu, Q. (2021), “Non-asymptotic confidence intervals of off-policy evaluation: Primal and dual bounds,” in International Conference on Learning Representations.
Freedman (1975) Freedman, D. A. (1975), “On tail probabilities for martingales,” The Annals of Probability, 3, 100–118.
Givchi and Palhang (2015) Givchi, A. and Palhang, M. (2015), “Quasi newton temporal difference learning,” in Proceedings of the Sixth Asian Conference on Machine Learning.
Hanna et al. (2017) Hanna, J., Stone, P., and Niekum, S. (2017), “Bootstrapping with models: Confidence intervals for off-policy evaluation,” in Proceedings of the AAAI Conference on Artificial Intelligence.
Hao et al. (2021) Hao, B., Ji, X., Duan, Y., Lu, H., Szepesvari, C., and Wang, M. (2021), “Bootstrapping fitted q-evaluation for off-policy inference,” in International Conference on Machine Learning.
Hastie et al. (2009) Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2, Springer.
Huber (1964) Huber, P. J. (1964), “Robust estimation of a location parameter,” The Annals of Mathematical Statistics, 35, 73–101.
Huber (1992) — (1992), “Robust estimation of a location parameter,” in Breakthroughs in Statistics, Springer.
Huber (2004) — (2004), Robust Statistics, vol. 523, John Wiley & Sons.
Jiang and Huang (2020) Jiang, N. and Huang, J. (2020), “Minimax value interval for off-policy evaluation and policy optimization,” Advances in Neural Information Processing Systems.
Kolter and Ng (2009) Kolter, J. Z. and Ng, A. Y. (2009), “Regularization and feature selection in least-squares temporal difference learning,” in Proceedings of the 26th Annual International Conference on Machine Learning.
Kormushev et al. (2013) Kormushev, P., Calinon, S., and Caldwell, D. G. (2013), “Reinforcement learning in robotics: Applications and real-world challenges,” Robotics, 2, 122–148.
Li and Sun (2023) Li, X. and Sun, Q. (2023), “Variance-aware robust reinforcement learning with linear function approximation under heavy-tailed rewards,” arXiv e-prints, arXiv:2303.05606.
Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015), “Human-level control through deep reinforcement learning,” Nature, 518, 529–533.
Mou et al. (2021) Mou, W., Pananjady, A., Wainwright, M. J., and Bartlett, P. L. (2021), “Optimal and instance-dependent guarantees for Markovian linear stochastic approximation,” arXiv e-prints, arXiv:2112.12770.
Murphy (2003) Murphy, S. A. (2003), “Optimal dynamic treatment regimes,” Journal of the Royal Statistical Society: Series B: Statistical Methodology, 65, 331–355.
Nesterov (2003) Nesterov, Y. (2003), Introductory Lectures on Convex Optimization: A Basic Course, vol. 87, Springer Science & Business Media.
Polyak and Juditsky (1992) Polyak, B. T. and Juditsky, A. B. (1992), “Acceleration of stochastic approximation by averaging,” SIAM Journal on Control and Optimization, 30, 838–855.
Ramprasad et al. (2022) Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W. W., and Cheng, G. (2022), “Online bootstrap inference for policy evaluation in reinforcement learning,” Journal of the American Statistical Association, To appear.
Rio (2017) Rio, E. (2017), Asymptotic Theory of Weakly Dependent Random Processes, vol. 80, Springer.
Robbins and Monro (1951) Robbins, H. and Monro, S. (1951), “A stochastic approximation method,” The Annals of Mathematical Statistics, 22, 400–407.
Rosenblatt (1956) Rosenblatt, M. (1956), “A central limit theorem and a strong mixing condition,” Proceedings of the National Academy of Sciences, 42, 43–47.
Ruppert (1985) Ruppert, D. (1985), “A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure,” The Annals of Statistics, 13, 236–245.
Ruppert (1988) — (1988), “Efficient estimations from a slowly convergent Robbins-Monro process,” Tech. rep., Cornell University Operations Research and Industrial Engineering.
Schraudolph et al. (2007) Schraudolph, N. N., Yu, J., and Günter, S. (2007), “A stochastic quasi-Newton method for online convex optimization,” in Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics.
Shao and Zhang (2022) Shao, Q.-M. and Zhang, Z.-S. (2022), “Berry–Esseen bounds for multivariate nonlinear statistics with applications to M-estimators and stochastic gradient descent algorithms,” Bernoulli, 28, 1548–1576.
Shi et al. (2018) Shi, C., Fan, A., Song, R., and Lu, W. (2018), “High-dimensional A-learning for optimal dynamic treatment regimes,” The Annals of Statistics, 46, 925–957.
Shi et al. (2021a) Shi, C., Song, R., Lu, W., and Li, R. (2021a), “Statistical inference for high-dimensional models via recursive online-score estimation,” Journal of the American Statistical Association, 116, 1307–1318.
Shi et al. (2021b) Shi, C., Zhang, S., Lu, W., and Song, R. (2021b), “Statistical inference of the value function for reinforcement learning in infinite-horizon settings,” Journal of the Royal Statistical Society. Series B: Statistical Methodology, 84, 765–793.
Sun et al. (2020) Sun, Q., Zhou, W.-X., and Fan, J. (2020), “Adaptive Huber Regression,” Journal of the American Statistical Association, 115, 254–265.
Sutton (1988) Sutton, R. (1988), “Learning to predict by the method of temporal differences,” Machine Learning, 3, 9–44.
Sutton and Barto (2018) Sutton, R. and Barto, A. (2018), Reinforcement Learning, second edition: An Introduction, Adaptive Computation and Machine Learning series, MIT Press.
Sutton et al. (2009) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. (2009), “Fast gradient-descent methods for temporal-difference learning with linear function approximation,” in Proceedings of the 26th Annual International Conference on Machine Learning.
Syrgkanis and Zhan (2023) Syrgkanis, V. and Zhan, R. (2023), “Post-Episodic Reinforcement Learning Inference,” arXiv e-prints, arXiv:2302.08854.
Thomas et al. (2015) Thomas, P., Theocharous, G., and Ghavamzadeh, M. (2015), “High confidence policy improvement,” in International Conference on Machine Learning, PMLR.
Tsitsiklis and Van Roy (1997) Tsitsiklis, J. and Van Roy, B. (1997), “An analysis of temporal-difference learning with function approximation,” IEEE Transactions on Automatic Control, 42, 674–690.
Vershynin (2010) Vershynin, R. (2010), “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027.
Wang et al. (2020) Wang, J., Liu, Y., and Li, B. (2020), “Reinforcement learning with perturbed rewards,” in Proceedings of the AAAI conference on Artificial Intelligence.
Zhan et al. (2021) Zhan, R., Hadad, V., Hirshberg, D. A., and Athey, S. (2021), “Off-policy evaluation via adaptive weighting with data from contextual bandits,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2125–2135.
Zhang et al. (2021) Zhang, K., Janson, L., and Murphy, S. (2021), “Statistical inference with M-estimators on adaptively collected data,” Advances in Neural Information Processing Systems, 34, 7460–7471.
Zhang et al. (2022) — (2022), “Statistical inference after adaptive sampling in non-Markovian environments,” arXiv preprint arXiv:2202.07098.

7 Appendix

7.1 Technical Lemmas

Lemma 1.

Let $Y_{1},...,Y_{n}$ be $\phi$ -mixing sequence with $\phi(k)=O(\rho^{k})$ . Assume that $|Y_{i}|\leq M$ and ${\mathbb{E}}Y_{i}=0$ . Then for any fixed $\nu>0$ and $1\leq k\leq n$ , we have some constant $C$ such that

\displaystyle{\mathbb{P}}\Bigg{(}\Big{|}\frac{\sum_{i=1}^{k}Y_{i}}{k}\Big{|}\geq C\sqrt{\frac{\log n}{k}}+C\frac{\log^{2}n}{k}\Bigg{)}=O(n^{-\nu})

Proof.

We divide the $k$ -tuple $(1,\dots,k)$ into $m_{k}$ different subsets $H_{1},\dots,H_{m_{k}}$ , where $m_{k}=\lceil k/\lceil\lambda\log n\rceil\rceil$ . Here $|H_{i}|=\lceil\lambda\log n\rceil$ for $1\leq i\leq m_{k}-1$ , and $|H_{m_{j}}|\leq\lceil\lambda\log n\rceil$ . $\lambda$ is a sufficiently large constant which will be specified later. Then we have $m_{k}\approx k/(\lambda\log n)$ . Without loss of generality, we assume that $m_{k}$ is an even integer.

Let $\xi_{l}=\frac{1}{\lceil\lambda\log n\rceil}\sum_{i\in H_{2l-1}}Y_{i}$ , then we know $|\xi_{l}|\leq M$ and ${\mathbb{E}}[\xi_{l}]=0$ holds. For all $B_{1}\in\sigma(\xi_{1},\dots,\xi_{l})$ and $B_{2}\in\sigma(\xi_{l+1})$ , there holds $|{\mathbb{P}}(B_{2}|B_{1})-{\mathbb{P}}(B_{2})|\leq\phi(\lceil\lambda\log n\rceil)$ for all $l$ . By Theorem 2 of Berkes and Philipp (1979), there exists a sequence of independent variables $\eta_{l}$ , $l\geq 1$ with $\eta_{l}$ having the same distribution as $\xi_{l}$ , and

{\mathbb{P}}\Big{(}|\xi_{l}-\eta_{l}|\geq 6\phi(\lceil\lambda\log n\rceil)\Big{)}\leq 6\phi(\lceil\lambda\log n\rceil).

For any $\nu>0$ , we can take $\lambda\geq(\nu+1)/|\log\rho|$ so that

		$\displaystyle{\mathbb{P}}\Big{(}\Big{\|}\frac{2}{m_{k}}\sum_{l=1}^{m_{k}/2}(\xi_{l}-\eta_{l})\Big{\|}\geq C_{1}n^{-\nu}\Big{)}$		(21)
	$\displaystyle\leq$	$\displaystyle{\mathbb{P}}\Big{(}\Big{\|}\frac{2}{m_{k}}\sum_{l=1}^{m_{k}/2}(\xi_{l}-\eta_{l})\Big{\|}\geq 6\phi(\lceil\lambda\log n\rceil)\Big{)}\leq 3m_{k}\phi(\lceil\lambda\log n\rceil)\leq C_{1}n^{-\nu},$		(21)

where $C_{1}>0$ is some constant. Next, we bound the variance of $\xi_{l}$ . From equation (20.23) of Billingsley (1968) we know that for arbitrary $i,j$ , there is

\big{|}{\mathbb{E}}[Y_{i}Y_{j}]-{\mathbb{E}}[Y_{i}]{\mathbb{E}}[Y_{j}]\big{|}\leq 2\sqrt{\phi(|i-j|)}\sqrt{{\mathbb{E}}[Y_{i}^{2}]{\mathbb{E}}[Y_{j}^{2}]}\leq 2\rho^{|i-j|/2}M^{2}.

Then we compute that

$\displaystyle{\rm Var}(\eta_{l})={\rm Var}(\xi_{l})=$	$\displaystyle\frac{1}{\lceil\lambda\log n\rceil^{2}}\Big{[}\sum_{i,j\in H_{2l-1}}\{{\mathbb{E}}(Y_{i}Y_{j})-{\mathbb{E}}(Y_{i}){\mathbb{E}}(Y_{j})\}\Big{]}$	(22)
$\displaystyle\leq$	$\displaystyle\frac{2}{\lceil\lambda\log n\rceil^{2}}\sum_{i,j=1}^{\lceil\lambda\log n\rceil}\rho^{\|i-j\|/2}M^{2}$
$\displaystyle\leq$	$\displaystyle\frac{2\lceil\lambda\log n\rceil}{\lceil\lambda\log n\rceil^{2}}\Big{(}\frac{2}{1-\sqrt{\rho}}-1\Big{)}M^{2}\leq\frac{4}{(1-\sqrt{\rho})\lceil\lambda\log n\rceil}M^{2}.$

Notice that $\eta_{l}$ ’s are all uniformly bounded by $M$ , we can apply Bernstein’s inequality (Bennett, 1962) and yield

		$\displaystyle{\mathbb{P}}\Big{(}\Big{\|}\frac{2}{m_{k}}\sum_{l=1}^{m_{k}/2}\eta_{l}\Big{\|}\geq t\Big{)}\leq 2\exp\Big{(}-\frac{m_{k}t^{2}/4}{{\rm Var}(\eta_{l})/2+Mt/3}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\exp\Big{(}-\frac{m_{k}\lceil\lambda\log n\rceil t^{2}/4}{2M^{2}/(1-\sqrt{\rho})+\lceil\lambda\log n\rceil Mt/3}\Big{)}.$

We take $t=C(\sqrt{\log n/k}+\log^{2}n/k)$ for some $C$ large enough (note that $k\approx m_{k}\lceil\lambda\log n\rceil$ ). Then we have that

{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{l=1}^{m_{k}/2}\eta_{l}\Big{|}\geq t\Big{)}=O(n^{-\nu}).

(23)

Combining (21) and (23) we have that

\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{l=1}^{m_{k}/2}\xi_{l}\Big{|}\geq C\sqrt{\frac{\log n}{k}}+C\frac{\log^{2}n}{k}\Big{)}=O(n^{-\nu}).

(24)

Next we denote $\widetilde{\xi}=\frac{1}{\lceil\lambda\log n\rceil}\sum_{i\in H_{2l}}Y_{i}$ . We can similarly show that

\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{l=1}^{m_{k}/2}\widetilde{\xi}_{l}\Big{|}\geq C\sqrt{\frac{\log n}{k}}+C\frac{\log^{2}n}{k}\Big{)}=O(n^{-\nu}).

(25)

Combining (24) and (25) we prove the desired result. ∎

Lemma 2.

Let $\boldsymbol{Y}_{1},\dots,\boldsymbol{Y}_{n}$ be random vectors in ${\mathbb{R}}^{p}$ . Let ${\mathcal{G}}_{i}=\sigma(\boldsymbol{Y}_{1},\dots,\boldsymbol{Y}_{i})$ . Assume that

	$\displaystyle{\mathbb{E}}[\boldsymbol{Y}_{i}\|{\mathcal{G}}_{i-1}]=\boldsymbol{0},\quad\max_{1\leq i\leq n}\sup_{\boldsymbol{v}\in{\mathbb{S}}^{p}}\|\boldsymbol{v}^{{\top}}\boldsymbol{Y}_{i}\|\leq 1,$
	$\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{p}}\sum_{i=1}^{n}{\rm Var}[\boldsymbol{v}^{{\top}}\boldsymbol{Y}_{i}\|{\mathcal{G}}_{i-1}]\leq b_{n}.$

Then for every $\nu>0$ , there is

{\mathbb{P}}\Big{(}\big{|}\sum_{i=1}^{n}\boldsymbol{Y}_{i}\big{|}_{2}\geq C\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}=O(n^{-\nu d}),

for some $C$ sufficiently large.

Proof.

Let ${\mathfrak{N}}$ be the $1/2$ -net of the unit ball ${\mathbb{S}}^{d-1}$ . By Lemma 5.2 of Vershynin (2010) we know that $|{\mathfrak{N}}|\leq 5^{d}$ . Then we have that

	$\displaystyle\Big{\|}\sum_{i=1}^{n}\boldsymbol{Y}_{i}\Big{\|}_{2}=$	$\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\Big{\|}\sum_{i=1}^{n}\boldsymbol{v}^{{\top}}\boldsymbol{Y}_{i}\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\sup_{\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\Big{\|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\Big{\|}+\sup_{\|\boldsymbol{v}-\widetilde{\boldsymbol{v}}\|_{2}\leq 1/2}\Big{\|}\sum_{i=1}^{n}(\boldsymbol{v}-\widetilde{\boldsymbol{v}})^{{\top}}\boldsymbol{Y}_{i}\Big{\|},$
	$\displaystyle\Rightarrow\Big{\|}\sum_{i=1}^{n}\boldsymbol{Y}_{i}\Big{\|}_{2}\leq$	$\displaystyle 2\sup_{\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\Big{\|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\Big{\|}.$

By Theorem 1.6 of Freedman (1975) we know that

\sup_{\widetilde{\boldsymbol{v}}\in{\mathbb{S}}^{d-1}}{\mathbb{P}}\Big{(}\big{|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\big{|}\geq C_{1}\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}=O(n^{-(\nu+\log 5)d})

for some large constant $C^{\prime}>0$ . Therefore

		$\displaystyle{\mathbb{P}}\Big{(}\big{\|}\sum_{i=1}^{n}\boldsymbol{Y}_{i}\big{\|}_{2}\geq 2C_{1}\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}$
	$\displaystyle\leq$	$\displaystyle{\mathbb{P}}\Big{(}\sup_{\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\big{\|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\big{\|}\geq C_{1}\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}$
	$\displaystyle\leq$	$\displaystyle 5^{d}\sup_{\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}{\mathbb{P}}\Big{(}\big{\|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\big{\|}\geq C_{1}\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}=O(n^{-\nu d}),$

which proves the lemma. ∎

Lemma 3.

Let $\boldsymbol{Y}$ be a $d\times d$ random matrix with ${\mathbb{E}}[\boldsymbol{Y}]=\boldsymbol{0}$ , $\|\boldsymbol{Y}\|\leq 1$ , ${\mathcal{F}}$ be the $\sigma$ -field generated by $\boldsymbol{Y}$ . Then for any $\sigma$ -field ${\mathcal{G}}$ , there holds

{\mathbb{E}}\Big{[}\|{\mathbb{E}}[\boldsymbol{Y}|{\mathcal{G}}]\|\Big{]}\leq d\sqrt{2\pi\phi},

where

\phi=\sup_{A\in{\mathcal{F}},B\in{\mathcal{G}}}|{\mathbb{P}}(AB)-{\mathbb{P}}(A){\mathbb{P}}(B)|.

Proof.

By elementary inequality of matrix, we know that

|\boldsymbol{Y}|_{\infty}\leq\|\boldsymbol{Y}\|\leq 1,\quad\text{ and }\quad\|\boldsymbol{Y}\|\leq\|\boldsymbol{Y}\|_{F}.

For every element $Y_{i,j}$ of $\boldsymbol{Y}$ , by Lemma 4.4.1 of Berkes and Philipp (1979) we have that

{\mathbb{E}}\Big{[}\big{|}{\mathbb{E}}[Y_{ij}|{\mathcal{G}}]\big{|}^{2}\Big{]}\leq{\mathbb{E}}\Big{[}\big{|}{\mathbb{E}}[Y_{ij}|{\mathcal{G}}]\big{|}\Big{]}\leq 2\pi\phi.

Therefore, we have that

		$\displaystyle{\mathbb{E}}\Big{[}\\|{\mathbb{E}}[\boldsymbol{Y}\|{\mathcal{G}}]\\|\Big{]}\leq{\mathbb{E}}\Big{[}\\|{\mathbb{E}}[\boldsymbol{Y}\|{\mathcal{G}}]\\|_{F}\Big{]},$
	$\displaystyle\leq$	$\displaystyle\Big{\{}{\mathbb{E}}\Big{[}\\|{\mathbb{E}}[\boldsymbol{Y}\|{\mathcal{G}}]\\|^{2}_{F}\Big{]}\Big{\}}^{1/2}$
	$\displaystyle\leq$	$\displaystyle\Big{\{}{\mathbb{E}}\Big{[}\sum_{i,j=1}^{d}\big{\|}{\mathbb{E}}[Y_{ij}\|{\mathcal{G}}]\big{\|}^{2}\Big{]}\Big{\}}^{1/2}\leq\sqrt{d^{2}2\pi\phi}=d\sqrt{2\pi\phi}.$

The proof is complete. ∎

Lemma 4.

(Approximation of pseudo-Huber gradient) The gradient of pseudo-Huber loss $g_{\tau}(x)$ satisfies

	$\displaystyle\|g_{\tau}(x)-x\|$	$\displaystyle\leq\frac{1}{2}\tau^{-\delta}\|x\|^{1+\delta},\quad\text{ for }\delta\in(0,2]$		(26)
	$\displaystyle\|g^{\prime}_{\tau}(x)-1\|$	$\displaystyle\leq\frac{5}{2}\tau^{-1-\delta}\|x\|^{1+\delta}\quad\text{ for }\delta\in(0,1]$		(26)

uniformly for $|x|\leq\tau$ . And

|g_{\tau}(x)-x|\leq|x|,\quad|g^{\prime}_{\tau}(x)-1|\leq 1,

(27)

holds for all $x$ .

Proof.

It is easy to compute that

g_{\tau}(x)=\frac{x}{\sqrt{1+x^{2}/\tau^{2}}},\quad g^{\prime}_{\tau}(x)=\frac{1}{(1+x^{2}/\tau^{2})^{3/2}}.

It is not hard to see that

|g_{\tau}(x)-x|\leq|x|,\quad|g^{\prime}_{\tau}(x)-1|\leq 1.

This proves (27). Furthermore, when $|x|\leq\tau$ , we have

	$\displaystyle\|g_{\tau}(x)-x\|=$	$\displaystyle\frac{x^{2}/\tau^{2}}{(1+\sqrt{1+x^{2}/\tau^{2}})\sqrt{1+x^{2}/\tau}}\cdot\|x\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}\tau^{-\delta}\|x\|^{1+\delta}$

for all $0\leq\delta\leq 2$ . Similarly,

	$\displaystyle\|g^{\prime}_{\tau}(x)-1\|=$	$\displaystyle\frac{x^{2}/\tau^{2}(2+\sqrt{1+x^{2}/\tau^{2}}+x^{2}/\tau^{2})}{(1+\sqrt{1+x^{2}/\tau^{2}})(1+x^{2}/\tau^{2})^{3/2}}$
	$\displaystyle\leq$	$\displaystyle\frac{5}{2}\tau^{-1-\delta}\|x\|^{1+\delta},$

for $|x|\leq\tau$ . The proof is complete. ∎

Lemma 5.

For any sequence $\{a_{k}\}$ where $a_{k}=(\log k)^{\beta_{1}}/k^{\beta_{2}}$ and $\beta_{1}\geq 0$ , there hold

n_{0}a_{n_{0}}+\sum_{k=n_{0}+1}^{n}a_{k}\leq\begin{cases}Cna_{n}\log n\quad&\text{if }\beta_{2}\leq 1;\\ C(\log n)^{\beta_{1}}/n_{0}^{\beta_{2}-1}\quad&\text{if }\beta_{2}>1.\end{cases}

Here the constant $C$ only depends on the parameter $\beta_{2}$ .

Proof.

Directly compute that

n_{0}a_{n_{0}}+\sum_{k=n_{0}+1}^{n}a_{k}\leq(\log n)^{\beta_{1}}\Big{(}\frac{n_{0}}{n_{0}^{\beta_{2}}}+\sum_{k=n_{0}+1}^{n}\frac{1}{k^{\beta_{2}}}\Big{)}.

(28)

We know that

\sum_{k=n_{0}+1}^{n}\frac{1}{k^{\beta_{2}}}\leq\int_{n_{0}}^{n}\frac{1}{x^{\beta_{2}}}\mathrm{d}x=\frac{1}{1-\beta_{2}}\Big{(}n^{1-\beta_{2}}-n_{0}^{1-\beta_{2}}\Big{)}.

When $\beta_{2}<1$ , (28) can be bounded by

	$\displaystyle n_{0}a_{n_{0}}+\sum_{k=n_{0}+1}^{n}a_{k}\leq$	$\displaystyle(\log n)^{\beta_{1}}\Big{(}n_{0}^{1-\beta_{2}}+\frac{1}{1-\beta_{2}}(n^{1-\beta_{2}}-n_{0}^{1-\beta_{2}})\Big{)}$
	$\displaystyle\leq$	$\displaystyle(\log n)^{\beta_{1}}n^{1-\beta_{2}}\Big{(}-\frac{\beta_{2}}{1-\beta_{2}}\Big{(}\frac{n_{0}}{n}\Big{)}^{1-\beta_{2}}+\frac{1}{1-\beta_{2}}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{1-\beta_{2}}na_{n}.$

When $\beta_{2}>1$ , (28) can be bounded by

	$\displaystyle n_{0}a_{n_{0}}+\sum_{k=n_{0}+1}^{n}a_{k}\leq$	$\displaystyle(\log n)^{\beta_{1}}\Big{(}\frac{\beta_{2}}{\beta_{2}-1}\times\frac{1}{n_{0}^{\beta_{2}-1}}-\frac{1}{\beta_{2}-1}\times\frac{1}{n^{\beta_{2}-1}}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\frac{\beta_{2}}{\beta_{2}-1}(\log n)^{\beta_{1}}\frac{1}{n_{0}^{\beta_{2}-1}}.$

When $\beta_{2}=1$ , (28) can be bounded by

	$\displaystyle n_{0}a_{n_{0}}+\sum_{k=n_{0}+1}^{n}a_{k}\leq$	$\displaystyle(\log n)^{\beta_{1}}\big{(}1+\log n\big{)}$
	$\displaystyle\leq$	$\displaystyle 2na_{n}\log n.$

∎

7.2 Proof of Results in Section 2

Proof of Theorem 1.

For each $k\geq n_{0}$ , when $\delta>0$ , we denote

e_{k}=\alpha_{k}\tau_{k}+\tau_{k}^{-\min(\delta,2)}+\sqrt{\frac{\tau_{k}^{(1-\delta)_{+}}\log k}{k}}+\frac{\log^{2}k\tau_{k}}{k}+\frac{1}{\sqrt{d}}(c_{0})^{2^{k-n_{0}}},

(29)

where $\delta^{\prime}=\min\{2,\delta\}$ . Given a sufficiently large constant $\Psi>0$ , we define the event

	$\displaystyle E_{k}=$	$\displaystyle\Big{\{}\|\widehat{\boldsymbol{\theta}}_{k}-\boldsymbol{\theta}^{*}\|_{2}\leq\Psi\sqrt{d}e_{k}\Big{\}}\cap\Big{\{}\\|\widehat{\boldsymbol{H}}_{k}-\boldsymbol{H}\\|\leq C_{0}\Psi\sqrt{d}e_{k}\log k\Big{\}}$		(30)
		$\displaystyle\cap\Big{\{}\Big{\|}\frac{1}{k}\sum_{i=1}^{k}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\Big{\|}_{2}\leq C_{0}\sqrt{d}e_{k}\Big{\}},$		(30)

where the constant $C_{0}$ is a constant that does not depend on $\Psi$ . We will show that ${\mathbb{P}}(E_{k}^{c},\cap_{i=n_{0}}^{k-1}E_{i})\leq C_{0}k^{-\nu}$ , where $\nu>1$ is some constant and $C_{0}>0$ is some constant only depends on $\nu$ . Let us prove it by induction on $k$ . By the choice of initial parameter $\widehat{\boldsymbol{\theta}}_{0}$ , we know (30) holds for $k=n_{0}$ . For $n=k+1$ , from (8) we have that

	$\displaystyle\widehat{\boldsymbol{\theta}}_{k+1}-\boldsymbol{\theta}^{*}=$	$\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{})-\widehat{\boldsymbol{H}}_{k+1}^{-1}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})-\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{}-b_{i})\big{\}}$
		$\displaystyle-\widehat{\boldsymbol{H}}_{k+1}^{-1}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}.$		(31)

For the second term on the right-hand side of (31), we have that

			$\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})-\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}$
	$\displaystyle=$		$\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}^{{\top}}g_{\tau_{i}}^{\prime}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{}-b_{i})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{})$
			$\displaystyle+\frac{1}{k+1}\sum_{i=1}^{k+1}\int_{0}^{1}(1-t)\boldsymbol{X}_{i}g_{\tau_{i}}^{\prime\prime}\big{\{}\boldsymbol{Z}_{i}^{{\top}}(\boldsymbol{\theta}^{}+t(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{}))-b_{i}\big{\}}(\boldsymbol{Z}_{i}^{{\top}}(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}))^{2}\mathrm{d}t$
	$\displaystyle=$		$\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}^{{\top}}g_{\tau_{i}}^{\prime}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{}-b_{i})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{})+\frac{1}{k+1}\sum_{i=1}^{k+1}O(1)\|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}\|_{2}^{2}$
	$\displaystyle=$		$\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}^{{\top}}g_{\tau_{i}}^{\prime}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{}-b_{i})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{})+O(1)\frac{1}{k+1}\sum_{i=1}^{k+1}\|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}\|_{2}^{2},$

where the second equality holds because of the fact that $|g_{\tau_{i}}^{\prime\prime}(x)|\leq 3/\tau_{i}=O(1)$ , the last equality uses the inductive hypothesis. Substitute it into (31), we have

	$\displaystyle\widehat{\boldsymbol{\theta}}_{k+1}-\boldsymbol{\theta}^{*}=$	$\displaystyle(\boldsymbol{I}-\widehat{\boldsymbol{H}}_{k+1}^{-1}\boldsymbol{H})\frac{1}{k+1}\sum_{i=1}^{k+1}(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{})-\widehat{\boldsymbol{H}}_{k+1}^{-1}\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-\boldsymbol{H})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{})$
		$\displaystyle-\widehat{\boldsymbol{H}}_{k+1}^{-1}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{}-b_{i})\big{\}}+O(1)\frac{1}{k+1}\sum_{i=1}^{k+1}\|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{}\|_{2}^{2},$		(32)

where $\boldsymbol{A}_{i}=\boldsymbol{X}_{i}\boldsymbol{Z}_{i}^{{\top}}g_{\tau_{i}}^{\prime}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})$ and $\boldsymbol{H}={\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}]$ .

Under event $\cap_{i=n_{0}}^{k}E_{i}$ , by Lemma 5 there exists a constant $C_{1}>0$ such that

		$\displaystyle{\mathbb{P}}\Big{(}\frac{1}{k+1}\sum_{i=1}^{k+1}\|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}\|_{2}^{2}\geq C_{1}\Psi^{2}de_{k}^{2}\log k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}$
	$\displaystyle\geq$	$\displaystyle{\mathbb{P}}\Big{(}\frac{1}{k+1}\sum_{i=1}^{k+1}\|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}\|_{2}^{2}\geq\frac{1}{k+1}\sum_{i=0}^{k}\Psi^{2}de_{i}^{2},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=0.$		(33)

Similarly, by Lemma 5 there exists a constant $C_{2}>0$ such that

		$\displaystyle{\mathbb{P}}\Big{(}\frac{1}{k+1}\sum_{i=1}^{k+1}\|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}\|_{2}\geq C_{2}\Psi\sqrt{d}e_{k}\log k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}$
	$\displaystyle\geq$	$\displaystyle{\mathbb{P}}\Big{(}\frac{1}{k+1}\sum_{i=1}^{k+1}\|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}\|_{2}\geq\frac{1}{k+1}\sum_{i=0}^{k}\Psi\sqrt{d}e_{i},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=0.$		(34)

By Lemma 6, (34) and (42) we know that under event $\cap_{i=n_{0}}^{k}E_{i}$ , for every $\nu>0$ there exist constants $C_{1},C_{3}>0$ (which only depend on $\nu$ ), such that

		$\displaystyle{\mathbb{P}}\Big{(}\Big{\|}(\boldsymbol{I}-\widehat{\boldsymbol{H}}_{k+1}^{-1}\boldsymbol{H})\frac{1}{k+1}\sum_{i=0}^{k}(\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*})\Big{\|}_{2}\geq C_{1}\Psi^{2}de^{2}_{k}\log^{2}k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}$
	$\displaystyle\leq$	$\displaystyle{\mathbb{P}}\Big{(}\\|\widehat{\boldsymbol{H}}_{k+1}^{-1}\\|\cdot\\|\widehat{\boldsymbol{H}}_{k+1}-\boldsymbol{H}\\|\cdot\frac{1}{k+1}\sum_{i=0}^{k}\|\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}\|_{2}\geq C_{1}\Psi^{2}de_{k}^{2}\log^{2}k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq C_{3}(k+1)^{-\nu}.$

By Lemma 8 we know that

{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-\boldsymbol{H})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})\Big{|}_{2}\geq C_{1}(\alpha_{k}+\Psi^{2}de_{k}^{2}\log k),\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq C_{3}(k+1)^{-\nu}.

(35)

In this part, we mainly consider the case where $\delta>0$ and $\tau_{k}$ can be arbitrary. A special case where $\delta>4$ and $\sqrt{i/\log^{3}i}=O(\tau_{i})$ will be presented in Proposition 8 below. By Lemma 7 we have that when $\delta>0$ , there is

{\mathbb{P}}\Big{(}\Big{|}\widehat{\boldsymbol{H}}_{k+1}^{-1}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C_{1}\sqrt{d}e_{k+1},\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq C_{3}(k+1)^{-\nu};

By substituting the above inequalities into (32), we have that

	$\displaystyle{\mathbb{P}}\Big{(}\|\widehat{\boldsymbol{\theta}}_{k+1}-\boldsymbol{\theta}^{*}\|_{2}\geq\Psi\sqrt{d}e_{k+1},\cap_{i=n_{0}}^{k}E_{i}\Big{)}$
$\displaystyle\leq$	$\displaystyle{\mathbb{P}}\Big{(}\|\widehat{\boldsymbol{\theta}}_{k+1}-\boldsymbol{\theta}^{*}\|_{2}\geq C_{1}\sqrt{d}e_{k+1}+C_{1}\alpha_{k}+3C_{1}\Psi^{2}de_{k}^{2}\log^{2}k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}$
$\displaystyle\leq$	$\displaystyle 3C_{3}(k+1)^{-\nu},$	(36)

where $\Psi$ can be taken to be sufficiently large, given $C_{1}$ fixed. Here the second inequality holds because $\sqrt{d}e_{k}$ can be sufficiently small for $k\geq n_{0}$ by taking $n_{0}$ sufficiently large. Meanwhile, as $\alpha_{k}\leq(k+1)\alpha_{k+1}/k\leq(n_{0}+1)\alpha_{k+1}/n_{0}=o(\tau_{k+1}\alpha_{k+1})=o(e_{k+1})$ , the last two terms both have order of $o(e_{k+1})$ . Obviously, the rest two events of (30) are contained in (36). Therefore, the inductive hypothesis that ${\mathbb{P}}(E_{k+1}^{c},\cap_{j=n_{0}}^{k}E_{j})\leq 3C_{3}(k+1)^{-\nu}$ , is proved.

For $k\geq n_{0}$ , there holds

		$\displaystyle{\mathbb{P}}\Big{(}\cap_{i=n_{0}}^{k}E_{i}\Big{)}$
	$\displaystyle\geq$	$\displaystyle{\mathbb{P}}\Big{(}E_{n_{0}}\Big{)}-\sum_{i=n_{0}+1}^{k}{\mathbb{P}}\Big{(}E_{i}^{c},\cap_{j=n_{0}}^{i-1}E_{j}\Big{)}$
	$\displaystyle\geq$	$\displaystyle 1-\sum_{i=n_{0}+1}^{k}3C_{3}i^{-\nu}\geq 1-\frac{3C_{3}}{\nu-1}\frac{1}{n_{0}^{\nu-1}},$

which proves the theorem. ∎

Lemma 6.

(Bound of the pseudo-Huber Hessian matrix) Under the same assumptions as in Theorem 1, there exists uniform constants $C$ and $c$ , such that the Hessian matrix $\widehat{\boldsymbol{H}}_{k+1}$ satisfies

{\mathbb{P}}\Big{(}\|\widehat{\boldsymbol{H}}_{k+1}-\boldsymbol{H}\|\geq C\Psi\sqrt{d}e_{k}\log k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq c(k+1)^{-\nu d}.

Proof.

We note that

\displaystyle\|\widehat{\boldsymbol{H}}_{k+1}-\boldsymbol{H}\|\leq\frac{1}{k+1}\Big{\|}\sum_{i\in{\mathcal{Q}}_{k}}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}g^{\prime}_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})-\boldsymbol{H}\Big{\|}+\frac{1}{k+1}\Big{\|}\sum_{i\notin{\mathcal{Q}}_{k}}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}g^{\prime}_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})-\boldsymbol{H}\Big{\|}.

For the first term, there is

		$\displaystyle\Big{\\|}\boldsymbol{X}\boldsymbol{Z}^{{\top}}g^{\prime}_{\tau}(\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}-b)\Big{\\|}$
	$\displaystyle=$	$\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\boldsymbol{u}^{{\top}}\boldsymbol{X}\boldsymbol{Z}^{{\top}}\boldsymbol{v}\|g^{\prime}_{\tau}(\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}-b)\|\leq M^{2},$

holds for all $\boldsymbol{\theta}$ and $\tau\geq 1$ . Therefore we have that

		$\displaystyle\frac{1}{k+1}\Big{\\|}\sum_{i\in{\mathcal{Q}}_{k}}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}g^{\prime}_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})-\boldsymbol{H}\Big{\\|}$		(37)
	$\displaystyle\leq$	$\displaystyle\frac{1}{k+1}\sum_{i\in{\mathcal{Q}}_{k}}\\|\boldsymbol{X}_{i}\boldsymbol{Z}_{i}g^{\prime}_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})\\|+\frac{1}{k+1}\sum_{i\in{\mathcal{Q}}_{i}}\\|\boldsymbol{H}\\|\leq 2M^{2}\alpha_{k}.$

For the second term, we first prove that

{\mathbb{P}}\Big{(}\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\Big{\|}\geq C_{1}\sqrt{\frac{d\log k}{k+1}}\Big{)}=O((k+1)^{-\nu d}).

(38)

Let ${\mathfrak{N}}$ be the $1/4$ -net of the unit ball ${\mathbb{S}}^{d-1}$ . By Lemma 5.2 of Vershynin (2010) we know that $|{\mathfrak{N}}|\leq 9^{d}$ . Then we have that

	$\displaystyle\Big{\\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\Big{\\|}=$		$\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\boldsymbol{u}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\boldsymbol{v}\Big{\|}$
	$\displaystyle\leq$		$\displaystyle\sup_{\widetilde{\boldsymbol{u}},\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{v}}\Big{\|}$
			$\displaystyle+\sup_{\widetilde{\boldsymbol{u}}\in{\mathfrak{N}}}\sup_{\|\boldsymbol{v}-\widetilde{\boldsymbol{v}}\|_{2}\leq 1/4}\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])(\boldsymbol{v}-\widetilde{\boldsymbol{v}})\Big{\|}_{2}$
			$\displaystyle+\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sup_{\|\boldsymbol{u}-\widetilde{\boldsymbol{u}}\|_{2}\leq 1/4}\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}(\boldsymbol{u}-\widetilde{\boldsymbol{u}})^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\boldsymbol{v}\Big{\|}_{2}$
	$\displaystyle\Rightarrow\Big{\\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\Big{\\|}\leq$		$\displaystyle 2\sup_{\widetilde{\boldsymbol{u}},\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{v}}\Big{\|}.$

Applying Lemma 1 to every $\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{v}}$ we can obtain that

\sup_{\widetilde{u},\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{v}}\Big{|}_{2}\geq C_{1}\sqrt{\frac{d\log k}{k+1}}\Big{)}=O((k+1)^{-(\nu+\log 9)d})

for some constant $C_{1}>0$ . Therefore

		$\displaystyle{\mathbb{P}}\Big{(}\Big{\\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\Big{\\|}\geq 2C_{1}\sqrt{\frac{d\log k}{k+1}}\Big{)}$
	$\displaystyle\leq$	$\displaystyle{\mathbb{P}}\Big{(}\sup_{\widetilde{u},\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{v}}\Big{\|}_{2}\geq C_{1}\sqrt{\frac{d\log k}{k+1}}\Big{)}$
	$\displaystyle\leq$	$\displaystyle 9^{2d}\sup_{\widetilde{\boldsymbol{u}},\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}{\mathbb{P}}\Big{(}\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{v}}\Big{\|}_{2}\geq C_{1}\sqrt{\frac{d\log k}{k+1}}\Big{)}=O((k+1)^{-\nu d}),$

which proves (38). Next we prove that when $(\boldsymbol{X},\boldsymbol{Z},b)$ are normal data, for every $\tau>0$ ,

\Big{\|}{\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}g^{\prime}_{\tau}(\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b)]-{\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}]\Big{\|}\leq C_{2}\tau^{-\delta^{\prime}}.

Indeed, denote $\epsilon=\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b$ , we have that

		$\displaystyle\Big{\\|}{\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}g^{\prime}_{\tau}(\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b)]-{\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}]\Big{\\|}$
	$\displaystyle\leq$	$\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}}{\mathbb{E}}[\boldsymbol{u}^{{\top}}\boldsymbol{X}\boldsymbol{Z}^{{\top}}\boldsymbol{v}\|g^{\prime}_{\tau}(\epsilon)-1\|]$
	$\displaystyle\leq$	$\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}}{\mathbb{E}}[\boldsymbol{u}^{{\top}}\boldsymbol{X}\boldsymbol{Z}^{{\top}}\boldsymbol{v}{\mathbb{I}}(\|\epsilon\|>\tau)+\frac{5}{2}\boldsymbol{u}^{{\top}}\boldsymbol{X}\boldsymbol{Z}^{{\top}}\boldsymbol{v}\tau^{-1-\delta^{\prime\prime}}\|\epsilon\|^{1+\delta^{\prime\prime}}{\mathbb{I}}(\|\epsilon\|\leq\tau)]$
	$\displaystyle\leq$	$\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}}{\mathbb{E}}\big{[}\boldsymbol{u}^{{\top}}\boldsymbol{X}\boldsymbol{Z}^{{\top}}\boldsymbol{v}{\mathbb{P}}(\|\epsilon\|>\tau\|\boldsymbol{X},\boldsymbol{Z})\big{]}+\frac{5}{2}\tau^{-1-\delta^{\prime\prime}}\sup_{\boldsymbol{u},\boldsymbol{v}}{\mathbb{E}}\big{[}\boldsymbol{u}^{{\top}}\boldsymbol{X}\boldsymbol{Z}^{{\top}}\boldsymbol{v}{\mathbb{E}}[\|\epsilon\|^{1+\delta^{\prime\prime}}\|\boldsymbol{X},\boldsymbol{Z}]\big{]}\leq C_{2}\tau^{-1-\delta^{\prime\prime}}\leq C_{2}\tau^{-\delta^{\prime}},$

where the third line uses Lemma 4, and $\delta^{\prime\prime}=\min(1,\delta)$ . Then clearly we have that $1+\delta^{\prime\prime}\geq\delta^{\prime}$ . Therefore, by the choice of $\tau_{i}$ and Lemma 5, we have

\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}{\mathbb{E}}[\boldsymbol{A}_{i}]-\boldsymbol{H}\Big{\|}\leq\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\Big{\|}{\mathbb{E}}[\boldsymbol{A}_{i}]-\boldsymbol{H}\Big{\|}\leq C_{2}\tau_{k+1}^{-\delta^{\prime}}.

(39)

Under event $\cap_{i=n_{0}}^{k}E_{i}$ , from (34) we know there is

	$\displaystyle\Big{\\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\boldsymbol{A}_{i}-\widehat{\boldsymbol{H}}_{k}\Big{\\|}=$	$\displaystyle\Big{\\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}^{{\top}}\big{\{}g_{\tau_{i}}^{\prime}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})-g_{\tau_{i}}^{\prime}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})\big{\}}\Big{\\|}$		(40)
	$\displaystyle\leq$	$\displaystyle M^{2}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}\|_{2}\leq C_{3}\Psi\sqrt{d}e_{k}\log k,$		(40)

Therefore combining (38), (39) and (40) we have that

	.	$\displaystyle{\mathbb{P}}\Bigg{(}\frac{1}{k+1}\Big{\\|}\sum_{i\notin{\mathcal{Q}}_{k}}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}g^{\prime}_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})-\boldsymbol{H}\Big{\\|}\geq C_{4}\Big{(}\sqrt{\frac{d\log k}{k+1}}+\tau_{k+1}^{-\delta^{\prime}}+\Psi\sqrt{d}e_{k}\log k\Big{)}\Bigg{)}$		(41)
	$\displaystyle=$	$\displaystyle O((k+1)^{-\nu d}).$		(41)

Combining (37) and (41), we have that

\displaystyle{\mathbb{P}}\Bigg{(}\|\widehat{\boldsymbol{H}}_{k+1}-\boldsymbol{H}\|\geq C_{5}\Big{(}\alpha_{k}+\sqrt{\frac{d\log k}{k+1}}+\tau_{k+1}^{-\delta^{\prime}}+\sqrt{d}e_{k}\log k\Big{)}\Bigg{)}=O((k+1)^{-\nu d}),

which proves the lemma.

As a corollary, we can show that under event $\cap_{i=n_{0}}^{k}E_{i}$ ,

\|\widehat{\boldsymbol{H}}_{k+1}^{-1}\|=(\Lambda_{\min}(\widehat{\boldsymbol{H}}_{k+1}))^{-1}\leq\big{(}\Lambda_{\min}(\boldsymbol{H})-\|\widehat{\boldsymbol{H}}_{k+1}-\boldsymbol{H}\|\big{)}^{-1}\leq 2/\Lambda_{\min}(\boldsymbol{H}).

(42)

∎

Lemma 7.

Under the same assumptions as in Theorem 1, for every $\nu>0$ , there exist constants $C$ and $c$ such that:

If $\delta>0$ and $\tau_{k}$ can be arbitrary, then

{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C\sqrt{d}e_{k+1}\Big{)}\leq c(k+1)^{-\nu}.

Here $e_{k+1}$ is defined in (29).

ii)

If $\delta>1$ and $\sqrt{k/\log^{3}k}=O(\tau_{k})$ , then

{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C\sqrt{d}e_{k+1}\Big{)}\leq c(k+1)^{-\min\{\nu,(\delta-1)/3\}}.

Here $e_{k+1}$ is defined in (65).

Proof.

Proof of i). We prove the bound coordinate-wisely. When $(\boldsymbol{X},\boldsymbol{Z},b)$ has no outlier, for the $l$ -th coordinate, we first prove that for every $\tau>0$ , there is

|{\mathbb{E}}[X_{l}g_{\tau}(\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b)]|\leq C_{1}\tau^{-\delta^{\prime}}.

Indeed, $\epsilon=\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b$ , we have that

	$\displaystyle\|{\mathbb{E}}[X_{l}g_{\tau}(\epsilon)]\|\leq$	$\displaystyle\|{\mathbb{E}}[X_{l}\epsilon]\|+\big{\|}{\mathbb{E}}[X_{l}\|\epsilon-g_{\tau}(\epsilon)\|{\mathbb{I}}(\|\epsilon\|\leq\tau)]\big{\|}+\big{\|}{\mathbb{E}}[X_{l}\|g_{\tau}(\epsilon)-\epsilon\|{\mathbb{I}}(\|\epsilon\|>\tau)]\big{\|}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}\tau^{-\delta^{\prime}}\big{\|}{\mathbb{E}}[X_{l}\|\epsilon\|^{1+\delta^{\prime}}{\mathbb{I}}(\|\epsilon\|\leq\tau)]\big{\|}+\big{\|}{\mathbb{E}}[X_{l}\|\epsilon\|{\mathbb{I}}(\|\epsilon\|>\tau)]\big{\|}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}\tau^{-\delta^{\prime}}\big{\|}{\mathbb{E}}\big{[}X_{l}{\mathbb{E}}[\|\epsilon\|^{1+\delta^{\prime}}{\mathbb{I}}(\|\epsilon\|\leq\tau)\|\boldsymbol{X},\boldsymbol{Z}]\big{]}\big{\|}+\big{\|}{\mathbb{E}}\big{[}X_{l}{\mathbb{E}}[\|\epsilon\|{\mathbb{I}}(\|\epsilon\|>\tau)\|\boldsymbol{X},\boldsymbol{Z}]\big{]}\big{\|}$
	$\displaystyle\leq$	$\displaystyle C_{1}\tau^{-\delta^{\prime}}.$

Therefore, by the choice of $\tau_{i}$ and Lemma 5, we have

\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}^{*}-b_{i})]\Big{|}\leq\frac{1}{k+1}\sum_{i=1}^{k+1}\Big{|}{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}^{*}-b_{i})]\Big{|}\leq C_{2}\tau_{k+1}^{-\delta^{\prime}}.

(43)

Next, when all data are normal, we prove the rate of

\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}X_{i,l}g_{\tau_{i}}(\epsilon_{i})-{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i})]\big{\}}\Big{|}.

We basically rehash the proof in Lemma 1.

Case 1: $\delta\in(0,1]$ . We divide the $k$ -tuple $(1,\dots,k)$ into $m_{k}$ different subsets $H_{1},\dots,H_{m_{k}}$ , where $m_{k}=\lceil k/\lceil\lambda\log k\rceil\rceil$ . Here $|H_{i}|=\lceil\lambda\log k\rceil$ for $1\leq i\leq m_{k}-1$ , and $|H_{m_{j}}|\leq\lceil\lambda\log k\rceil$ . Then we have $m_{k}\approx k/(\lambda\log k)$ . Without loss of generality, we assume that $m_{k}$ is an even integer.

Let $Y_{i}=X_{i,l}g_{\tau_{i}}(\epsilon_{i})-{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i})]$ and $\xi_{q}=\frac{1}{\lceil\lambda\log k\rceil}\sum_{i\in H_{2q-1}}Y_{i}$ , then we know $|\xi_{q}|\leq 2M\tau_{k+1}$ (since $X_{i,l}g_{\tau_{i}}(\epsilon)\leq|\boldsymbol{X}|_{2}\tau_{i}\leq M\tau_{k+1}$ ) and ${\mathbb{E}}[\xi_{l}]=0$ holds. For all $B_{1}\in\sigma(\xi_{1},\dots,\xi_{q})$ and $B_{2}\in\sigma(\xi_{q+1})$ , there holds $|{\mathbb{P}}(B_{2}|B_{1})-{\mathbb{P}}(B_{2})|\leq\phi(\lceil\lambda\log k\rceil)$ for all $q$ . By Theorem 2 of Berkes and Philipp (1979), there exists a sequence of independent variables $\eta_{l}$ , $l\geq 1$ with $\eta_{l}$ having the same distribution as $\xi_{l}$ , and

{\mathbb{P}}\Big{(}|\xi_{q}-\eta_{q}|\geq 6\phi(\lceil\lambda\log k\rceil)\Big{)}\leq 6\phi(\lceil\lambda\log k\rceil).

For any $\nu>0$ , we can take $\lambda\geq(\nu+1)/|\log\rho|$ so that

		$\displaystyle{\mathbb{P}}\Big{(}\Big{\|}\frac{2}{m_{k}}\sum_{q=1}^{m_{k}/2}(\xi_{q}-\eta_{q})\Big{\|}\geq C_{3}k^{-\nu}\Big{)}$		(44)
	$\displaystyle\leq$	$\displaystyle{\mathbb{P}}\Big{(}\Big{\|}\frac{2}{m_{k}}\sum_{q=1}^{m_{k}/2}(\xi_{q}-\eta_{q})\Big{\|}\geq 6\phi(\lceil\lambda\log k\rceil)\Big{)}\leq 3m_{k}\phi(\lceil\lambda\log k\rceil)\leq C_{3}k^{-\nu},$		(44)

where $C_{3}>0$ is some constant. Next, we bound the variance of $\xi_{q}$ . From equation (20.23) of Billingsley (1968) we know that for arbitrary $i,j$ , there is

\displaystyle\big{|}{\mathbb{E}}\big{[}Y_{i}Y_{j}\big{]}\big{|}\leq 2\sqrt{\phi(|i-j|)}\sqrt{{\mathbb{E}}[Y_{i}^{2}]{\mathbb{E}}[Y_{j}^{2}]}\leq 2C_{4}\rho^{|i-j|/2}M^{2}\tau_{k+1}^{1-\delta}.

Then we compute that

	$\displaystyle{\rm Var}(\eta_{q})={\rm Var}(\xi_{q})=$	$\displaystyle\frac{1}{\lceil\lambda\log k\rceil^{2}}\Big{[}\sum_{i,j\in H_{2q-1}}\{{\mathbb{E}}(Y_{i}Y_{j})\}\Big{]}$
	$\displaystyle\leq$	$\displaystyle\frac{2C_{4}}{\lceil\lambda\log k\rceil^{2}}\sum_{i,j=1}^{\lceil\lambda\log k\rceil}\rho^{\|i-j\|/2}M^{2}\tau_{k+1}^{1-\delta}$
	$\displaystyle\leq$	$\displaystyle\frac{2\lceil\lambda\log k\rceil C_{4}}{\lceil\lambda\log k\rceil^{2}}\Big{(}\frac{2}{1-\sqrt{\rho}}-1\Big{)}M^{2}\tau_{k+1}^{1-\delta}\leq\frac{4C_{4}}{(1-\sqrt{\rho})\lceil\lambda\log k\rceil}M^{2}\tau_{k+1}^{1-\delta}.$

Notice that $\eta_{q}$ ’s are all uniformly bounded by $2M\tau_{k+1}$ , we can apply Bernstein’s inequality (Bennett, 1962) and yield

		$\displaystyle{\mathbb{P}}\Big{(}\Big{\|}\frac{2}{m_{k}}\sum_{q=1}^{m_{k}/2}\eta_{q}\Big{\|}\geq t\Big{)}\leq 2\exp\Big{(}-\frac{m_{k}t^{2}/4}{{\rm Var}(\eta_{q})/2+M\tau_{k}t/3}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\exp\Big{(}-\frac{m_{k}\lceil\lambda\log k\rceil t^{2}/4}{2M^{2}C_{\epsilon}\tau_{k+1}^{1-\delta}/(1-\sqrt{\rho})+2\lceil\lambda\log k\rceil M\tau_{k+1}t/3}\Big{)}.$

We take $t=C(\sqrt{\tau_{k+1}^{1-\delta}\log k/(k+1)}+\tau_{k+1}\log^{2}k/(k+1))$ for some $C$ large enough (note that $k\approx m_{k}\lceil\lambda\log k\rceil$ ). Then we have that

{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{q=1}^{m_{k}/2}\eta_{q}\Big{|}\geq t\Big{)}=O(k^{-\nu}).

(45)

Combining (44) and (45) we have that

\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{q=1}^{m_{k}/2}\xi_{q}\Big{|}\geq C\sqrt{\frac{\tau^{1-\delta}_{k+1}\log k}{k+1}}+C\frac{\log^{2}k\tau_{k+1}}{k+1}\Big{)}=O(k^{-\nu}).

(46)

Next we denote $\widetilde{\xi}=\frac{1}{\lceil\lambda\log k\rceil}\sum_{i\in H_{2q}}Y_{i}$ , then we can similarly show that

\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{q=1}^{m_{k}/2}\widetilde{\xi}_{q}\Big{|}\geq C\sqrt{\frac{\tau^{1-\delta}_{k+1}\log k}{k+1}}+C\frac{\log^{2}k\tau_{k+1}}{k+1}\Big{)}=O(k^{-\nu}),

(47)

Combining (46) and (47) we can prove that

{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}X_{i,l}g_{\tau_{i}}(\epsilon_{i})-{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i})]\big{\}}\Big{|}\geq C\Big{(}\sqrt{\frac{\tau^{1-\delta}_{k+1}\log k}{k+1}}+\frac{\log^{2}k\tau_{k+1}}{k+1}\Big{)}\Bigg{)}=O(k^{-\nu}).

(48)

By combining it with (43) we have that

{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C\Big{(}\sqrt{d}\tau_{k+1}^{-\delta^{\prime}}+\sqrt{\frac{d\tau^{1-\delta}_{k+1}\log k}{k+1}}+\frac{\sqrt{d}\log^{2}k\tau_{k+1}}{k+1}\Big{)}\Bigg{)}=O(k^{-\nu}).

Case 2: $\delta>1$ . Similarly, we can obtain the rate

{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}X_{i,l}g_{\tau_{i}}(\epsilon_{i})-{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i})]\big{\}}\Big{|}\geq C\Big{(}\sqrt{\frac{\log k}{k+1}}+\frac{\log^{2}k\tau_{k+1}}{k+1}\Big{)}\Bigg{)}=O(k^{-\nu}),

by directly plugging $\tau_{k}$ into (48). Then we have that

{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C\Big{(}\tau_{k+1}^{-\delta^{\prime}}+\sqrt{\frac{\log k}{k+1}}+\frac{\log^{2}k\tau_{k+1}}{k+1}\Big{)}\Bigg{)}=O(k^{-\nu}).

Proof of ii). When $\delta>1$ and $\sqrt{k/\log^{3}k}=O(\tau_{k})$ , we define the thresholding level $\widetilde{\tau}=C_{\tau}\sqrt{k+1}/\log^{3/2}k$ , and consider the truncated random variables $\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(|\epsilon_{i}|\leq\widetilde{\tau})$ . Here $C_{\tau}$ is a sufficiently large constant. Then we have that

	$\displaystyle{\mathbb{P}}\Big{(}\sum_{i=1}^{k+1}\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i})\neq\sum_{i=1}^{k+1}\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(\|\epsilon_{i}\|\leq\widetilde{\tau})\Big{)}$
$\displaystyle\leq$	$\displaystyle{\mathbb{P}}\Big{(}\cup_{i=1}^{k+1}\{\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i})\neq\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(\|\epsilon_{i}\|\leq\widetilde{\tau})\}\Big{)}$
$\displaystyle\leq$	$\displaystyle(k+1)\max_{1\leq i\leq k+1}{\mathbb{P}}\Big{(}\|\epsilon_{i}\|>\widetilde{\tau}\Big{)}=O((k+1)^{-(\delta-1)/3}).$	(49)

Similarly as in (43), we have that

\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(|\epsilon_{i}|\leq\widetilde{\tau})]\Big{|}\leq\frac{1}{k+1}\sum_{i=1}^{k+1}\Big{|}{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(|\epsilon_{i}|\leq\widetilde{\tau})]\Big{|}\leq C(\tau_{k+1}^{-\delta^{\prime}}+\widetilde{\tau}^{-\delta}).

(50)

Similarly as in the proof of (48), for every $\nu>0$ , there exists $C>0$ such that

\displaystyle{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}X_{i,l}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(|\epsilon_{i}|\leq\widetilde{\tau})-{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(|\epsilon_{i}|\leq\widetilde{\tau})]\big{\}}\Big{|}\geq C\big{(}\sqrt{\frac{\log k}{k+1}}+\frac{\log^{2}k\widetilde{\tau}}{k+1}\big{)}\Bigg{)}=O(k^{-\nu})

(51)

By the choice of $\widetilde{\tau}$ , we know that

\sqrt{\frac{\log k}{k+1}}+\frac{\log^{2}k\widetilde{\tau}}{k+1}=O\Big{(}\sqrt{\frac{\log k}{k+1}}\Big{)}.

Combining (49), (50) and (51) we have that

{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C\sqrt{d}\Big{(}\tau_{k+1}^{-\delta^{\prime}}+\widetilde{\tau}^{-\delta}+\sqrt{\frac{\log k}{k+1}}\Big{)}\Bigg{)}=O((k+1)^{-\min\{\nu,(\delta-1)/3\}}).

When there are $\alpha_{k}$ fraction of outliers, denote ${\mathcal{Q}}_{k}$ as the index set, we have

		$\displaystyle\Big{\|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{\|}_{2}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{k+1}\Big{\|}\sum_{i\in{\mathcal{Q}}_{k}}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{}-b_{i})\big{\}}\Big{\|}_{2}+\frac{1}{k+1}\Big{\|}\sum_{i\notin{\mathcal{Q}}_{k}}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{}-b_{i})\big{\}}\Big{\|}_{2}$
	$\displaystyle=$	$\displaystyle O\big{(}\sqrt{d}\alpha_{k}\tau_{k+1}\big{)}+\frac{1}{k+1}\Big{\|}\sum_{i\notin{\mathcal{Q}}_{k}}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{\|}_{2},$

which proves the lemma. ∎

Lemma 8.

(Bound of the mixed term) Under the same assumptions as in Theorem 1, for every $\nu>0$ , there exist constants $C$ and $c$ , such that

{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-\boldsymbol{H})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})\Big{|}_{2}\geq C(\alpha_{k}+\Psi^{2}de_{k}^{2}\log k),\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq c(k+1)^{-\nu}.

Proof.

Firstly, from (39) and Lemma 5, under the event $\cap_{i=n_{0}}^{k}E_{i}$ , we can bound the term

	$\displaystyle\Big{\|}\frac{1}{k+1}\sum_{i=1}^{k+1}({\mathbb{E}}[\boldsymbol{A}_{i}]-\boldsymbol{H})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})\Big{\|}_{2}$
$\displaystyle\leq$	$\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}\\|\boldsymbol{E}[\boldsymbol{A}_{i}]-\boldsymbol{H}\\|\cdot\|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}\|_{2}$
$\displaystyle\leq$	$\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}C\Psi\tau_{i}^{-\delta^{\prime}}e_{i-1}=O(\tau_{k+1}^{-\delta^{\prime}}\sqrt{d}e_{k}\log k)=O(\sqrt{d}e_{k}^{2}\log k).$	(52)

For ease of notation, we first consider the case when there is no outlier. Similar as in the proof of Lemma 1, for each $k$ , we evenly divide the tuple $\{n_{0},\dots,k\}$ into $m_{k}$ subsets $H_{1},\dots,H_{m_{k}}$ , where $m_{k}=\lceil(k-n_{0})/\lceil\lambda\log k\rceil\rceil$ . Here $|H_{q}|=\lceil\lambda\log k\rceil$ for $1\leq q\leq m_{k}-1$ , and $|H_{m_{k}}|\leq\lceil\lambda\log k\rceil$ . $\lambda$ is a sufficiently large constant which will be specified later. Then we have $m_{k}\approx(k-n_{0})/(\lambda\log k)$ . We further denote $H_{0}=\{1,\dots,n_{0}\}$ . Without loss of generality, we assume that $m_{k}$ is an even integer. For each $i\in\{n_{0},\dots,k\}$ , suppose $i\in H_{l_{i}}$ , we construct the following random variable

\widetilde{\boldsymbol{\theta}}_{i}=\frac{1}{i}\sum_{q=0}^{l_{i}-2}\sum_{j\in H_{q}}(\widehat{\boldsymbol{\theta}}_{j}-\boldsymbol{\theta}^{*})+\boldsymbol{H}^{-1}\frac{1}{i}\sum_{q=0}^{l_{i}-2}\sum_{j\in H_{q}}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{j}-b_{j+1}).

(53)

When $l_{j}=1$ , we take the sum for the terms in $H_{0}$ . For $i\in H_{0}$ , we take $\widetilde{\boldsymbol{\theta}}_{i}=\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}=\widehat{\boldsymbol{\theta}}_{0}-\boldsymbol{\theta}^{*}$ . Then by Lemma 9 below we have that,

`\begin{aligned} &{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})-\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{\theta}}_{i-1}\Big{|}_{2}\geq C\Psi^{2}de_{k}^{2},\cap_{i=n_{0}}^{k}E_{i}\Big{)}\\ =&O((k+1)^{-\nu}).\end{aligned}

(54)

Moreover, from the proof of Lemma 9, we can obtain that under $\cap_{i=n_{0}}^{k}E_{i}$ , there is

|\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}-\widetilde{\boldsymbol{\theta}}_{i}|_{2}\leq\Psi\sqrt{d}e_{i}.

(55)

It left to bound the term $\frac{1}{k+1}\sum_{j=0}^{k}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{\theta}}_{i}$ . To be more precise, we define the $\sigma$ -field ${\mathcal{G}}_{l}=\sigma((\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i}):i\in\cup_{j=1}^{2l+1}H_{j})$ , and construct the random variable

\boldsymbol{\xi}_{l}=\sum_{i\in H_{2l+1}}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])\widetilde{\boldsymbol{\theta}}_{i}

Then by (55) we know that

{\mathbb{P}}\Big{(}\cup_{i}\Big{\{}|\widetilde{\boldsymbol{\theta}}_{i}|_{2}\geq 2\Psi\sqrt{d}e_{i}\Big{\}},\cap_{j=n_{0}}^{k}E_{j}\Big{)}=O((k+1)^{-\nu}).

We further set

\widetilde{\boldsymbol{\xi}}_{l}=\sum_{i\in H_{2l+1}}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])\widetilde{\boldsymbol{\theta}}_{i}{\mathbb{I}}\Big{\{}|\widetilde{\boldsymbol{\theta}}_{i}|_{2}\leq 2\Psi\sqrt{d}e_{i}\Big{\}},

then there is

{\mathbb{P}}\Big{(}\sum_{l=1}^{m_{k}/2}\boldsymbol{\xi}_{l}\neq\sum_{l=1}^{m_{k}/2}\widetilde{\boldsymbol{\xi}}_{l},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=O(k^{-\nu}).

(56)

Notice that $\{\widetilde{\boldsymbol{\xi}}_{l}-{\mathbb{E}}[\widetilde{\boldsymbol{\xi}}_{l}|{\mathcal{G}}_{l-1}],l\geq 1\}$ are martingale differences, and there is ${\mathbb{E}}[\widetilde{\boldsymbol{\theta}}_{i}|{\mathcal{G}}_{l-1}]=\widetilde{\boldsymbol{\theta}}_{i}$ for $i\in H_{2l+1}$ . Therefore we have

{\mathbb{E}}[\widetilde{\boldsymbol{\xi}}_{l}|{\mathcal{G}}_{l-1}]=\sum_{i\in H_{2l+1}}{\mathbb{E}}[(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])|{\mathcal{G}}_{l-1}]\widetilde{\boldsymbol{\theta}}_{i}{\mathbb{I}}\Big{\{}|\widetilde{\boldsymbol{\theta}}_{i}|_{2}\leq 2\Psi\sqrt{d}e_{i}\Big{\}}.

By Lemma 3 there is

{\mathbb{E}}\Big{[}\big{\|}{\mathbb{E}}[\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}]|{\mathcal{G}}_{l-1}]\big{\|}\Big{]}\leq Cd\sqrt{\phi(\lceil\lambda\log k\rceil)}=O(k^{-\nu-2}),

for some $\lambda$ large enough. Then Markov’s inequality yields

{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k}\sum_{l=1}^{m_{k}/2}{\mathbb{E}}[\widetilde{\boldsymbol{\xi}}_{l}|{\mathcal{G}}_{l-1}]\Big{|}_{2}\geq C\frac{\sqrt{d}e_{k}}{k},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=O(k^{-\nu}).

(57)

It is direct to verify that

	$\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\|\boldsymbol{v}^{{\top}}(\widetilde{\boldsymbol{\xi}}_{l}-{\mathbb{E}}[\widetilde{\boldsymbol{\xi}}_{l}\|{\mathcal{G}}_{l-1}])\|\leq C\Psi\lambda\log k,$
	$\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sum_{l=1}^{m_{k}/2}{\rm Var}[\|\boldsymbol{v}^{{\top}}\widetilde{\boldsymbol{\xi}}_{l}\|\|{\mathcal{G}}_{l-1}]\leq\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sum_{l=1}^{m_{k}/2}{\mathbb{E}}[\|\boldsymbol{v}^{{\top}}\widetilde{\boldsymbol{\xi}}_{l}\|^{2}\|{\mathcal{G}}_{l-1}]\leq C\Psi^{2}kde_{k}^{2}\log k.$

Then we can apply Lemma 2 and yield

\displaystyle{\mathbb{P}}\Big{(}\Big{|}\sum_{l=1}^{m_{k}/2}(\widetilde{\boldsymbol{\xi}}_{l}-{\mathbb{E}}[\widetilde{\boldsymbol{\xi}}_{l}|{\mathcal{G}}_{l-1}])\Big{|}_{2}\geq C\Psi(d\log^{2}k+de_{k}\sqrt{k\log^{2}k})\Big{)}=O(k^{-\nu d}).

Combining it with (57) and (56), we have that

{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{l=1}^{m_{k}/2}\boldsymbol{\xi}_{l}\Big{|}_{2}\geq C\Psi\Big{(}\frac{d\log^{2}k}{k}+de_{k}\sqrt{\frac{\log^{2}k}{k}}\Big{)},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=O((k+1)^{-\nu})

A similar result holds for the even term. Note that $\frac{d\log^{2}k}{k}+de_{k}\sqrt{\frac{\log^{2}k}{k}}=O(de_{k}^{2})$ , and together with (54) we have

{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{j=0}^{k}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])(\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*})\Big{|}_{2}\geq C\Psi^{2}de_{k}^{2}\log k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}=O((k+1)^{-\nu}).

When taking the outliers into consideration, we have that

		$\displaystyle\Big{\|}\frac{1}{k+1}\sum_{j=0}^{k}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])(\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*})\Big{\|}_{2}$
	$\displaystyle=$	$\displaystyle\Big{\|}\frac{1}{k+1}\sum_{j\in{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])(\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{})\Big{\|}_{2}+\Big{\|}\frac{1}{k+1}\sum_{j\notin{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])(\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{})\Big{\|}_{2}$
	$\displaystyle=$	$\displaystyle\Big{\|}\frac{1}{k+1}\sum_{j\notin{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])(\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*})\Big{\|}_{2}+O\big{(}\alpha_{k}),$

under the event $\cap_{i=n_{0}}^{k}E_{i}$ , which proves the lemma by combining it with (52). ∎

Lemma 9.

Let $\widetilde{\boldsymbol{\theta}}_{i}$ be defined in (53), for every $\nu>0$ , there exist constants $C$ and $c$ , such that

		$\displaystyle{\mathbb{P}}\Big{(}\Big{\|}\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})-\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{\theta}}_{i-1}\Big{\|}_{2}\geq C\Psi^{2}de_{k}^{2},\cap_{i=n_{0}}^{k}E_{i}\Big{)}$
	$\displaystyle\leq$	$\displaystyle c(k+1)^{-\nu}.$

Proof.

Under event $\cap_{i=n_{0}}^{k}E_{i}$ , using Lemma 5, we have

	$\displaystyle\big{\|}\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}-\widetilde{\boldsymbol{\theta}}_{i}\big{\|}_{2}$
$\displaystyle=$	$\displaystyle\Big{\|}\frac{1}{i}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i-1}(\widehat{\boldsymbol{\theta}}_{j}-\boldsymbol{\theta}^{*})\Big{\|}_{2}+\Big{\|}\boldsymbol{H}^{-1}\frac{1}{i}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{j}-b_{j+1})\Big{\|}_{2}$
	$\displaystyle+\Big{\|}(\widehat{\boldsymbol{H}}_{i-1}^{-1}-\boldsymbol{H}^{-1})\frac{1}{i}\sum_{j=0}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{j}-b_{j+1})\Big{\|}_{2}$
$\displaystyle=$	$\displaystyle\Big{\|}(\widehat{\boldsymbol{H}}_{i-1}^{-1}-\boldsymbol{H}^{-1})\frac{1}{i}\sum_{j=0}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{j}-b_{j+1})\Big{\|}_{2}$
	$\displaystyle+\Big{\|}\boldsymbol{H}^{-1}\frac{1}{i}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\boldsymbol{\theta}^{*}-b_{j+1})\Big{\|}_{2}+O\Big{(}\frac{e_{i}\log k}{i}\Big{)}.$	(58)

For the first term, by Lemma 6 and (42) we have that under event $\cap_{i=n_{0}}^{k}E_{i}$ ,

\|\widehat{\boldsymbol{H}}_{i-1}^{-1}-\boldsymbol{H}^{-1}\|\leq\|\widehat{\boldsymbol{H}}_{i-1}^{-1}\|\cdot\|\widehat{\boldsymbol{H}}_{i-1}-\boldsymbol{H}\|\cdot\|\boldsymbol{H}^{-1}\|\leq C\Psi\sqrt{d}e_{i}.

On the other hand, there is

		$\displaystyle\Big{\|}\frac{1}{i}\sum_{j=0}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{j}-b_{j+1})\Big{\|}_{2}$
	$\displaystyle\leq$	$\displaystyle\Big{\|}\frac{1}{i}\sum_{j=0}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\boldsymbol{\theta}^{}-b_{j+1})\Big{\|}_{2}+M^{2}\frac{1}{i}\sum_{j=0}^{i-1}\|\widehat{\boldsymbol{\theta}}_{j}-\boldsymbol{\theta}^{}\|_{2}\leq C\Psi\sqrt{d}e_{i}\log i.$

Therefore, we have that

	$\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})-\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{\theta}}_{i-1}$
$\displaystyle=$	$\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\boldsymbol{H}^{-1}\frac{1}{i}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\boldsymbol{\theta}^{*}-b_{j+1})+O(\Psi^{2}de_{i}^{2})$
$\displaystyle=$	$\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}\Big{\{}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i}\frac{1}{j}(\boldsymbol{A}_{j}-{\mathbb{E}}[\boldsymbol{A}_{j}])\Big{\}}\boldsymbol{H}^{-1}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})+O(\Psi^{2}de_{i}^{2}).$	(59)

Denote

\boldsymbol{Y}_{i}=\Big{\{}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i}\frac{1}{j}(\boldsymbol{A}_{j}-{\mathbb{E}}[\boldsymbol{A}_{j}])\Big{\}}\boldsymbol{H}^{-1}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i}).

It is direct to verify that

	$\displaystyle\|{\mathbb{E}}[Y_{i}]\|_{2}\leq$	$\displaystyle\sqrt{{\mathbb{E}}\Big{\\|}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i}\frac{1}{j}(\boldsymbol{A}_{j}-{\mathbb{E}}[\boldsymbol{A}_{j}])\Big{\\|}^{2}{\mathbb{E}}\big{\|}\boldsymbol{H}^{-1}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\|}_{2}^{2}}$
	$\displaystyle=$	$\displaystyle O\Big{(}\frac{\sqrt{d}\log k}{i}\tau_{i}^{(1-\delta)_{+}/2}\Big{)}=O(\sqrt{d}e_{i}^{2}).$		(60)

Therefore, it left bound

\displaystyle\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}(Y_{i}-{\mathbb{E}}[Y_{i}])\Big{|}_{2}.

To this end, we further denote

\displaystyle\boldsymbol{\xi}_{l}=\sum_{i\in H_{4l+1}}(\boldsymbol{Y}_{i}-{\mathbb{E}}[\boldsymbol{Y}_{i}]),

We define the $\sigma$ -field ${\mathcal{G}}_{l}=\sigma((\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i}):i\in\cup_{j=1}^{4l+1}H_{j})$ , by Lemma 3 we have that

{\mathbb{E}}\Big{[}\Big{|}{\mathbb{E}}[\boldsymbol{\xi}_{l}|{\mathcal{G}}_{l-1}]\Big{|}_{2}\Big{]}=O\Big{(}\frac{\sqrt{d}\tau_{l}\log k}{lk^{\nu+2}}\Big{)}.

Therefore, by Markov’s inequality

{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k}\sum_{l=1}^{m_{k}/4}{\mathbb{E}}[\boldsymbol{\xi}_{l}|{\mathcal{G}}_{l-1}]\Big{|}_{2}\geq C\frac{\sqrt{d}\tau_{k}\log k}{k^{2}},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=O(k^{-\nu}).

(61)

It is direct to verify that

		$\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\|\boldsymbol{v}^{{\top}}(\boldsymbol{\xi}_{l}-{\mathbb{E}}[\boldsymbol{\xi}_{l}\|{\mathcal{G}}_{l-1}])\|\leq C\sqrt{d}\log k,$
		$\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sum_{l=1}^{m_{k}/2}{\rm Var}[\|\boldsymbol{v}^{{\top}}\boldsymbol{\xi}_{l}\|\|{\mathcal{G}}_{l-1}]\leq\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sum_{l=1}^{m_{k}/2}{\mathbb{E}}[\|\boldsymbol{v}^{{\top}}\boldsymbol{\xi}_{l}\|^{2}\|{\mathcal{G}}_{l-1}]$
	$\displaystyle\leq$	$\displaystyle C_{1}\sum_{l=1}^{m_{k}/4}d\tau_{l}^{(1-\delta)_{+}}\log^{2}k/l^{2}\leq Cd\log^{3}k.$

Then we can apply Lemma 2 and yield

\displaystyle{\mathbb{P}}\Big{(}\Big{|}\sum_{l=1}^{m_{k}/4}(\boldsymbol{\xi}_{l}-{\mathbb{E}}[\boldsymbol{\xi}_{l}|{\mathcal{G}}_{l-1}])\Big{|}_{2}\geq C(d^{3/2}\log^{2}k+d\log^{2}k)\Big{)}=O(k^{-\nu d}).

Combining it with (60) and (61), we have that

{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{l=1}^{m_{k}/2}\boldsymbol{\xi}_{l}\Big{|}_{2}\geq C\frac{d^{3/2}\log^{2}k}{k},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=O((k+1)^{-\nu})

A similar result holds for the average of $\boldsymbol{\xi}_{l}=\sum_{i\in H_{4l+q}}(\boldsymbol{Y}_{i}-{\mathbb{E}}[\boldsymbol{Y}_{i}])$ , where $q=0,2,3$ . Substitute it into (59), we have that

		$\displaystyle{\mathbb{P}}\Big{(}\Big{\|}\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})-\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{\theta}}_{i-1}\Big{\|}_{2}\geq C\Psi^{2}de_{k}^{2},\cap_{i=n_{0}}^{k}E_{i}\Big{)}$
	$\displaystyle=$	$\displaystyle O((k+1)^{-\nu}),$

which proves the lemma. ∎

7.3 Proof of Results in Section 3.2

Proof of Theorem 4.

From the proof of Theorem 1, as $n_{0}$ tends to infinity, we can obtain that

\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*}=\boldsymbol{H}^{-1}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})+O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}\alpha_{n}+de_{n-1}^{2}\log^{2}n\Big{)},

(62)

where $\boldsymbol{H}={\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}]$ . Denote $\epsilon_{i}=\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i}$ , we first bound the term

	$\displaystyle\Big{\|}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})\Big{\|}_{2}$	(63)
$\displaystyle\leq$	$\displaystyle\Big{\|}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})-\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})]\Big{\|}_{2}+\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\Big{\|}{\mathbb{E}}[\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})]\Big{\|}_{2}$
$\displaystyle=$	$\displaystyle\Big{\|}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})-\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})]\Big{\|}_{2}+O(\sqrt{d}\tau_{n}^{-2}),$

where the last inequality uses the moment assumption on $\epsilon$ and similar argument as in (43). To bound the first term, we directly compute its variance and use Chebyshev’s inequality. More precisely, by equation (20.23) of Billingsley (1968), for arbitrary $i,j$ and each coordinate $l$ , there is

		$\displaystyle\Big{\|}{\mathbb{E}}\Big{[}\big{\{}X_{i,l}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})-{\mathbb{E}}[X_{i,l}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})]\big{\}}\big{\{}X_{j,l}(g_{\tau_{j}}(\epsilon_{j})-\epsilon_{j})-{\mathbb{E}}[X_{j,l}(g_{\tau_{j}}(\epsilon_{j})-\epsilon_{j})]\big{\}}\Big{]}\Big{\|}$
	$\displaystyle\leq$	$\displaystyle 2\rho^{\|i-j\|/2}\sqrt{{\mathbb{E}}[X_{i,l}^{2}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})^{2}]{\mathbb{E}}[X_{j,l}^{2}(g_{\tau_{j}}(\epsilon_{j})-\epsilon_{j})^{2}]}$
	$\displaystyle\leq$	$\displaystyle 2C_{0}\rho^{\|i-j\|/2}\frac{1}{\tau^{2}_{i}\tau^{2}_{j}}\sqrt{{\mathbb{E}}[\|\epsilon_{i}\|^{6}]{\mathbb{E}}[\|\epsilon_{j}\|^{6}]}\leq C_{1}\frac{\rho^{\|i-j\|/2}}{\tau^{2}_{i}\tau^{2}_{j}}.$

Therefore we have that

		$\displaystyle{\rm Var}\Big{[}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})\Big{]}\leq\frac{C_{1}}{n^{2}}\sum_{i,j=1}^{n}\frac{\rho^{\|i-j\|/2}}{\tau_{i}^{2}\tau_{j}^{2}}$
	$\displaystyle\leq$	$\displaystyle\frac{C_{1}}{n^{2}}\sum_{i=1}^{n}\frac{2}{\tau_{i}^{2}}\Big{(}\sum_{j=0}^{n}\rho^{j/2}\Big{)}\leq C_{2}\frac{1}{n\tau_{n}^{2}},$

for some constant $C_{2}>0$ . Therefore, by Chebyshev’s inequality, we have that

\Big{|}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})-\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})]\Big{|}_{2}=O_{{\mathbb{P}}}\Big{(}\frac{\sqrt{d}}{(n\tau_{n}^{2})^{2/5}}\Big{)}.

(64)

Combining (62), (63) and (64), given a unit vector $\boldsymbol{v}\in{\mathbb{S}}^{d-1}$ , we have

	$\displaystyle\boldsymbol{v}^{{\top}}(\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*})=$	$\displaystyle\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}\epsilon_{i}+O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}\alpha_{n}+de_{n-1}^{2}\log^{2}n+\sqrt{d}\tau_{n}^{-2}+\frac{\sqrt{d}}{(n\tau_{n}^{2})^{2/5}}\Big{)},$
	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\boldsymbol{X}_{i}\epsilon_{i}+o_{{\mathbb{P}}}\Big{(}\frac{1}{\sqrt{n}}\Big{)},$

where the remainder becomes $o_{{\mathbb{P}}}(1/\sqrt{n})$ under some rate constraints. Here the main term is the average of strict stationary and strong mixing sequence $\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\boldsymbol{X}_{i}\epsilon_{i}$ . By Lemma 10, the conditions in Theorem 1 of Doukhan et al. (1994) is fulfilled. Then we can apply Theorem 1 of Doukhan et al. (1994) and yield

\frac{\sqrt{n}}{\sigma_{\boldsymbol{v}}}(\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*})\rightarrow{\mathcal{N}}(0,1),

where

	$\displaystyle\sigma^{2}_{\boldsymbol{v}}=$	$\displaystyle\sum_{i=-\infty}^{\infty}{\rm Cov}\big{(}\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\boldsymbol{X}_{0}\epsilon_{0},\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\boldsymbol{X}_{i}^{{\top}}\epsilon_{i}\big{)}$
	$\displaystyle=$	$\displaystyle\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\boldsymbol{\Sigma}(\boldsymbol{H}^{{\top}})^{-1}\boldsymbol{v},$

and $\boldsymbol{\Sigma}$ is defined in (17). Therefore the theorem is proved. ∎

Lemma 10.

Let the random variable $X$ satisfies ${\mathbb{E}}[|X|^{1+\delta}]$ for some $\delta>1$ , and $\phi(k)=O(\rho^{k})$ (where $\rho<1$ ). Then we have that

\int_{0}^{1}\phi^{-1}(u)Q_{X}^{2}(u)\mathrm{d}u<\infty,

where $\phi^{-1}(u)$ is the inverse function of $\phi(k)$ , i.e. , $\phi^{-1}(u)=\log u/\log\rho$ , and $Q_{X}(u)=\inf\{t:{\mathbb{P}}(|X|\geq t)\leq u\}$ .

Proof.

By Markov’s inequality, we have that

		$\displaystyle{\mathbb{P}}(\|X\|\geq Q_{X}(u))\leq u\leq\frac{{\mathbb{E}}[\|X\|^{1+\delta}]}{(Q_{X}(u))^{1+\delta}},$
	$\displaystyle\Rightarrow$	$\displaystyle Q_{X}(u)\leq\Big{(}\frac{{\mathbb{E}}[\|X\|^{1+\delta}]}{u}\Big{)}^{1/(1+\delta)}.$

Therefore, we have that

\displaystyle\int_{0}^{1}\phi^{-1}(u)Q_{X}^{2}(u)\mathrm{d}u\leq\int_{0}^{1}\frac{\log u}{\log\rho}\Big{(}\frac{{\mathbb{E}}[|X|^{1+\delta}]}{u}\Big{)}^{2/(1+\delta)}\mathrm{d}u<\infty,

as long as $\delta>1$ , which proves the lemma. ∎

Proposition 8.

Suppose the conditions in Theorem 1 hold. When $\delta>4$ and $\sqrt{i/\log^{3}i}=O(\tau_{i})$ , for every $\nu>0$ , there exists constants $C,c>0$ such that

{\mathbb{P}}\Big{(}\cap_{i=n_{0}}^{n}\big{\{}|\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}|_{2}\geq C\sqrt{d}e_{i}\big{\}}\Big{)}\geq 1-cn_{0}^{-\min\{\nu,(\delta-4)/3\}},

where

e_{n}=\alpha_{n}\tau_{n}+\sqrt{\frac{\log n}{n}}+\frac{1}{\sqrt{d}}(c_{0})^{2^{n-n_{0}}}.

(65)

Here $\delta$ is defined in the moment condition in (C4).

Proof.

When $\delta>4$ and $\sqrt{k}=O(\tau_{k})$ , we denote

e_{k}=\alpha_{k}\tau_{k}+\sqrt{\frac{\log k}{k}}+\frac{1}{\sqrt{d}}(c_{0})^{2^{k-n_{0}}},

for $k\geq n_{0}$ . We continue from equation (33) in the proof of Theorem 1.

By ii) of Lemma 7, there is

{\mathbb{P}}\Big{(}\Big{|}\widehat{\boldsymbol{H}}_{k+1}^{-1}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C_{1}\sqrt{d}e_{k+1},\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq C_{3}(k+1)^{-(\delta-1)/3}.

Then we have that

\displaystyle{\mathbb{P}}\Big{(}|\widehat{\boldsymbol{\theta}}_{k+1}-\boldsymbol{\theta}^{*}|_{2}\geq\Psi\sqrt{d}e_{k+1},\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq 3C_{3}(k+1)^{-\min\{\nu,(\delta-1)/3\}},

which yields

	$\displaystyle{\mathbb{P}}\Big{(}\cap_{i=n_{0}}^{k}E_{i}\Big{)}\geq$	$\displaystyle 1-\sum_{i=n_{0}+1}^{k}3C_{3}i^{-\min\{\nu,(\delta-1)/3\}}$
	$\displaystyle\geq$	$\displaystyle 1-3C_{3}\max\Big{\{}\frac{1}{\nu-1},\frac{2}{\delta-4}\Big{\}}n_{0}^{-\min\{\nu-1,(\delta-4)/3\}}.$

Therefore, the theorem is proved. ∎

Proof of Proposition 5.

By Proposition 8 and Theorem 4, we know that $\widehat{\boldsymbol{\theta}}_{n}$ admits the Bahadur representation

\boldsymbol{v}^{{\top}}(\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*})=\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}^{*}-b_{i})+O_{{\mathbb{P}}}\Big{(}de_{n-1}^{2}\log^{2}n+\sqrt{d}\tau_{n}^{-2}+\frac{\sqrt{d}}{(n\tau_{n}^{2})^{2/5}}\Big{)},

where $\alpha_{n}=0$ and $e_{n-1}$ is given in (65). When $\tau_{i}=Ci^{\beta}$ for $\beta\geq 3/4$ we have that the remainder term has an order $O_{{\mathbb{P}}}(d\log n/n)$ when $n_{0}\rightarrow\infty$ , which proves the proposition. ∎

7.4 Proof of Results in Section 4

Proof of Theorem 6.

For simplicity we denote $\boldsymbol{\Gamma}_{k}={\rm Cov}(\boldsymbol{X}_{0}\epsilon_{0},\boldsymbol{X}_{-k}\epsilon_{-k})$ , $\boldsymbol{Y}_{i,k}=\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i})\boldsymbol{X}^{{\top}}_{i-k}g_{\tau_{i-k}}(\epsilon_{i-k})$ . Then by (17) we know

\boldsymbol{\Sigma}=\boldsymbol{\Gamma}_{0}+\sum_{k=1}^{\infty}(\boldsymbol{\Gamma}_{k}+\boldsymbol{\Gamma}^{{\top}}_{k}).

Under event (30), for $k=0,\dots,\lceil\lambda\log n\rceil$ , we have that

		$\displaystyle\Big{\\|}\frac{1}{n}\sum_{i=\lceil e^{k/\lambda}\rceil}^{n}\boldsymbol{Y}_{i,k}-\frac{1}{n}\sum_{i=\lceil e^{k/\lambda}\rceil}^{n}\boldsymbol{X}_{i-k}g_{\tau_{i-k}}(\boldsymbol{Z}^{\top}_{i-k}\widehat{\boldsymbol{\theta}}_{i-k-1}-b_{i-k})\boldsymbol{X}^{\top}_{i}g_{\tau_{i}}(\boldsymbol{Z}^{\top}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})\Big{\\|}$
	$\displaystyle=$	$\displaystyle O\Big{(}\frac{1}{n}\sum_{i=\lceil e^{k/\lambda}\rceil}^{n}\tau_{i}\|\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}\|_{2}\Big{)}=O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}e_{n}\Big{)}.$

Therefore we have that

\Big{\|}\widehat{\boldsymbol{\Sigma}}_{n}-\Big{(}\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{Y}_{i,0}+\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{\lceil\lambda\log i\rceil}(\boldsymbol{Y}_{i,k}+\boldsymbol{Y}_{i,k}^{{\top}})\Big{)}\Big{\|}=O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}e_{n}\Big{)}.

(66)

We first prove

\big{\|}\boldsymbol{\Sigma}-\big{(}\boldsymbol{\Gamma}_{0}+\sum_{k=1}^{\lceil\lambda\log n/2\rceil}(\boldsymbol{\Gamma}_{k}+\boldsymbol{\Gamma}_{k}^{{\top}})\big{)}\big{\|}=O(n^{-\nu}),

(67)

for some $\nu>0$ . By equation (20.23) of Billingsley (1968), for each $k$ , there is

	$\displaystyle\\|\boldsymbol{\Gamma}_{k}\\|=$	$\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathbb{S}}^{d-1}}{\mathbb{E}}\Big{[}\|\boldsymbol{u}^{{\top}}\boldsymbol{X}_{0}\epsilon_{0}\boldsymbol{v}^{{\top}}\boldsymbol{X}_{-k}\epsilon_{-k}\|\Big{]}$
	$\displaystyle\leq$	$\displaystyle 2\sqrt{\phi(\|k\|)}\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sqrt{{\mathbb{E}}[\|\boldsymbol{u}^{{\top}}\boldsymbol{X}_{0}\epsilon_{0}\|^{2}]{\mathbb{E}}[\|\boldsymbol{v}^{{\top}}\boldsymbol{X}_{-k}\epsilon_{-k}\|^{2}]}=O(\rho^{\|k\|/2}).$

Therefore we can obtain that

		$\displaystyle\big{\\|}\boldsymbol{\Sigma}-\big{(}\boldsymbol{\Gamma}_{0}+\sum_{k=1}^{\lceil\lambda\log n/2\rceil}(\boldsymbol{\Gamma}_{k}+\boldsymbol{\Gamma}_{k}^{{\top}})\big{)}\big{\\|}$
	$\displaystyle\leq$	$\displaystyle 2\sum_{k=\lceil\lambda\log n/2\rceil+1}^{\infty}\\|\boldsymbol{\Gamma}_{k}\\|=O(\sum_{k=\lceil\lambda\log n/2\rceil+1}^{\infty}\rho^{\|k\|/2})=O(\rho^{\lceil\lambda\log n\rceil/4})=O(n^{-\nu}),$

for $\lambda\geq 4\nu/|\log\rho|$ , which proves (67). Next we prove that for $k=0,\dots,\lceil\lambda\log n\rceil$ , there holds

	$\displaystyle\Big{\\|}\frac{1}{n}\sum_{i=\lceil e^{k/\lambda}\rceil}^{n}\boldsymbol{Y}_{i,k}-\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]\Big{\\|}$	(68)
$\displaystyle=$	$\displaystyle\Big{\\|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}\boldsymbol{Y}_{i,k}-\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]\Big{\\|}+O_{{\mathbb{P}}}\big{(}\tau_{n}^{2}\alpha_{n}\big{)}$
$\displaystyle=$	$\displaystyle O_{{\mathbb{P}}}\Big{(}\tau_{n}^{2}\alpha_{n}+\sqrt{\frac{d\log^{2}n}{n}}+\frac{d\tau_{n}^{2}\log^{2}n}{n}\Big{)}.$

By the proof of Lemma 6, we know that

\Big{\|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}\boldsymbol{Y}_{i,k}-\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]\Big{\|}\leq 2\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathfrak{N}}}\Big{|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}\boldsymbol{u}^{{\top}}(\boldsymbol{Y}_{i,k}-{\mathbb{E}}[\boldsymbol{Y}_{i,k}])\boldsymbol{v}\Big{|},

where ${\mathfrak{N}}$ is a $1/4$ -net of ${\mathbb{S}}^{d-1}$ . For $k=0,\dots,\lceil\lambda\log n\rceil-1$ , Denote $\widetilde{{\mathcal{F}}}_{k,a}^{b}=\sigma(\boldsymbol{Y}_{i,k},a\leq i\leq b)$ , then by (13) we know that

|{\mathbb{P}}(B|A)-{\mathbb{P}}(B)|\leq\phi((j-k)_{+}),

for all $A\in\widetilde{{\mathcal{F}}}_{k,1}^{n},B\in\widetilde{{\mathcal{F}}}_{k,n+j}^{\infty}$ for all $n,j\geq 0$ . We basically rehash the proof in Lemma 1 and obtain that

\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathfrak{N}}}{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}\boldsymbol{u}^{{\top}}(\boldsymbol{Y}_{i,k}-{\mathbb{E}}[\boldsymbol{Y}_{i,k}])\boldsymbol{v}\Big{|}\geq C\Big{(}\sqrt{\frac{d\log^{2}n}{n}}+\frac{d\tau_{n}^{2}\log^{2}n}{n}\Big{)}\Bigg{)}=O(n^{-(\gamma+2\log 9)d}).

Here the only difference is that ${\rm Var}(\eta_{l})=O(1)$ in (22), and $\boldsymbol{u}^{{\top}}\boldsymbol{Y}_{i,k}\boldsymbol{v}$ is bounded by $\tau_{n}^{2}$ . Therefore (68) can be proved.

Last, we prove that

\Big{\|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]-\boldsymbol{\Gamma}_{k}\Big{\|}=O\Big{(}\alpha_{n}+\tau_{n}^{-2}\Big{)}

(69)

To see this, we compute that

		$\displaystyle\Big{\\|}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]-\boldsymbol{\Gamma}_{k}\Big{\\|}$
	$\displaystyle\leq$	$\displaystyle\Big{\\|}{\mathbb{E}}[\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i})\boldsymbol{X}^{{\top}}_{i-k}g_{\tau_{i-k}}(\epsilon_{i-k})]-{\mathbb{E}}[\boldsymbol{X}_{0}\epsilon_{0}\boldsymbol{X}_{-k}g_{\tau_{i-k}}(\epsilon_{-k})]\Big{\\|}$
		$\displaystyle+\Big{\\|}{\mathbb{E}}[\boldsymbol{X}_{i}\epsilon_{i}\boldsymbol{X}^{{\top}}_{i-k}g_{\tau_{i-k}}(\epsilon_{i-k})]-{\mathbb{E}}[\boldsymbol{X}_{0}\epsilon_{0}\boldsymbol{X}_{-k}\epsilon_{-k}]\Big{\\|}$

For the second term,

		$\displaystyle\Big{\\|}{\mathbb{E}}[\boldsymbol{X}_{i}\epsilon_{i}\boldsymbol{X}^{{\top}}_{i-k}g_{\tau_{i-k}}(\epsilon_{i-k})]-{\mathbb{E}}[\boldsymbol{X}_{0}\epsilon_{0}\boldsymbol{X}_{-k}\epsilon_{-k}]\Big{\\|}$
	$\displaystyle=$	$\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\Big{\|}{\mathbb{E}}[\boldsymbol{u}^{{\top}}\boldsymbol{X}_{i}\epsilon_{i}\boldsymbol{X}^{{\top}}_{i-k}\boldsymbol{v}\big{\{}g_{\tau_{i-k}}(\epsilon_{i-k})-\epsilon_{i-k}\big{\}}]\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sqrt{{\mathbb{E}}[\|\boldsymbol{u}^{{\top}}\boldsymbol{X}_{i}\epsilon_{i}\|^{2}]{\mathbb{E}}[\|\boldsymbol{X}^{{\top}}_{i-k}\boldsymbol{v}\big{\{}g_{\tau_{i-k}}(\epsilon_{i-k})-\epsilon_{i-k}\big{\}}\|^{2}]}\leq C_{1}\tau_{i-k}^{-2},$

for some constants $C_{1}>0$ . Therefore (69) is proved.

Combining (66), (67), (68) and (69) we have that

			$\displaystyle\\|\widehat{\boldsymbol{\Sigma}}_{n}-\boldsymbol{\Sigma}\\|$
	$\displaystyle\leq$		$\displaystyle\Big{\\|}\widehat{\boldsymbol{\Sigma}}_{n}-\Big{(}\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{Y}_{i,0}+\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{\lceil\lambda\log i\rceil}(\boldsymbol{Y}_{i,k}+\boldsymbol{Y}_{i,k}^{{\top}})\Big{)}\Big{\\|}+\big{\\|}\boldsymbol{\Sigma}-\big{(}\boldsymbol{\Gamma}_{0}+\sum_{k=1}^{\lceil\lambda\log n/2\rceil}(\boldsymbol{\Gamma}_{k}+\boldsymbol{\Gamma}_{k}^{{\top}})\big{)}\big{\\|}$
			$\displaystyle+2\sum_{k=0}^{\lceil\lambda\log n/2\rceil}\Big{\\|}\frac{1}{n}\sum_{i=\lceil e^{k/\lambda}\rceil}^{n}\boldsymbol{Y}_{i,k}-\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]\Big{\\|}+2\sum_{k=0}^{\lceil\lambda\log n/2\rceil}\Big{\\|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]-\boldsymbol{\Gamma}_{k}\Big{\\|}$
			$\displaystyle+2\sum_{k=\lceil\lambda\log n/2\rceil+1}^{\lceil\lambda\log n\rceil}\Big{\\|}\frac{1}{n}\sum_{i=\lceil e^{k/\lambda}\rceil}^{n}\boldsymbol{Y}_{i,k}-\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]\Big{\\|}+2\sum_{k=\lceil\lambda\log n/2\rceil+1}^{\lceil\lambda\log n\rceil}\Big{\\|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]-\boldsymbol{\Gamma}_{k}\Big{\\|}$
			$\displaystyle+2\sum_{k=\lceil\lambda\log n/2\rceil+1}^{\lceil\lambda\log n\rceil}\Big{\\|}\boldsymbol{\Gamma}_{k}\Big{\\|}$
	$\displaystyle=$		$\displaystyle O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}e_{n}+n^{-\nu}+\tau_{n}^{2}\alpha_{n}+\sqrt{\frac{d\log^{2}n}{n}}+\frac{d\tau_{n}^{2}\log^{2}n}{n}+\alpha_{n}+\tau_{n}^{-2}\Big{)}$
	$\displaystyle=$		$\displaystyle O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}^{2}\alpha_{n}+\sqrt{d}\tau_{n}^{-1}+\tau_{n}\sqrt{\frac{d\log n}{n}}+\frac{d\tau_{n}^{2}\log^{2}n}{n}\Big{)},$

which proves the theorem. ∎

	$\displaystyle\Big{\|}\sum_{i=1}^{n}\boldsymbol{Y}_{i}\Big{\|}_{2}=$	$\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\Big{\|}\sum_{i=1}^{n}\boldsymbol{v}^{{\top}}\boldsymbol{Y}_{i}\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\sup_{\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\Big{\|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\Big{\|}+\sup_{\|\boldsymbol{v}-\widetilde{\boldsymbol{v}}\|_{2}\leq 1/2}\Big{\|}\sum_{i=1}^{n}(\boldsymbol{v}-\widetilde{\boldsymbol{v}})^{{\top}}\boldsymbol{Y}_{i}\Big{\|},$
	$\displaystyle\Rightarrow\Big{\|}\sum_{i=1}^{n}\boldsymbol{Y}_{i}\Big{\|}_{2}\leq$	$\displaystyle 2\sup_{\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\Big{\|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\Big{\|}.$

		$\displaystyle{\mathbb{P}}\Big{(}\big{\|}\sum_{i=1}^{n}\boldsymbol{Y}_{i}\big{\|}_{2}\geq 2C_{1}\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}$
	$\displaystyle\leq$	$\displaystyle{\mathbb{P}}\Big{(}\sup_{\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\big{\|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\big{\|}\geq C_{1}\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}$
	$\displaystyle\leq$	$\displaystyle 5^{d}\sup_{\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}{\mathbb{P}}\Big{(}\big{\|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\big{\|}\geq C_{1}\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}=O(n^{-\nu d}),$

		$\displaystyle{\mathbb{E}}\Big{[}\\|{\mathbb{E}}[\boldsymbol{Y}\|{\mathcal{G}}]\\|\Big{]}\leq{\mathbb{E}}\Big{[}\\|{\mathbb{E}}[\boldsymbol{Y}\|{\mathcal{G}}]\\|_{F}\Big{]},$
	$\displaystyle\leq$	$\displaystyle\Big{\{}{\mathbb{E}}\Big{[}\\|{\mathbb{E}}[\boldsymbol{Y}\|{\mathcal{G}}]\\|^{2}_{F}\Big{]}\Big{\}}^{1/2}$
	$\displaystyle\leq$	$\displaystyle\Big{\{}{\mathbb{E}}\Big{[}\sum_{i,j=1}^{d}\big{\|}{\mathbb{E}}[Y_{ij}\|{\mathcal{G}}]\big{\|}^{2}\Big{]}\Big{\}}^{1/2}\leq\sqrt{d^{2}2\pi\phi}=d\sqrt{2\pi\phi}.$

	$\displaystyle\|g_{\tau}(x)-x\|$	$\displaystyle\leq\frac{1}{2}\tau^{-\delta}\|x\|^{1+\delta},\quad\text{ for }\delta\in(0,2]$		(26)
	$\displaystyle\|g^{\prime}_{\tau}(x)-1\|$	$\displaystyle\leq\frac{5}{2}\tau^{-1-\delta}\|x\|^{1+\delta}\quad\text{ for }\delta\in(0,1]$		(26)

	$\displaystyle\|g_{\tau}(x)-x\|=$	$\displaystyle\frac{x^{2}/\tau^{2}}{(1+\sqrt{1+x^{2}/\tau^{2}})\sqrt{1+x^{2}/\tau}}\cdot\|x\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}\tau^{-\delta}\|x\|^{1+\delta}$