This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

Weidong Liu Jiyuan Tu Yichen Zhang Xi Chen School of Mathematical Sciences and MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong UniversitySchool of Mathematics, Shanghai University of Finance and EconomicsDaniels School of Business, Purdue UniversityStern School of Business, New York University.
Abstract

Recently, reinforcement learning has gained prominence in modern statistics, with policy evaluation being a key component. Unlike traditional machine learning literature on this topic, our work places emphasis on statistical inference for the parameter estimates computed using reinforcement learning algorithms. While most existing analyses assume random rewards to follow standard distributions, limiting their applicability, we embrace the concept of robust statistics in reinforcement learning by simultaneously addressing issues of outlier contamination and heavy-tailed rewards within a unified framework. In this paper, we develop an online robust policy evaluation procedure, and establish the limiting distribution of our estimator, based on its Bahadur representation. Furthermore, we develop a fully-online procedure to efficiently conduct statistical inference based on the asymptotic distribution. This paper bridges the gap between robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to policy evaluation. Finally, we validate the efficacy of our algorithm through numerical experiments conducted in real-world reinforcement learning experiments.

Keywords: Statistical inference; dependent samples; policy evaluation; reinforcement learning; online learning.

1 Introduction

Reinforcement learning has offered immense success and remarkable breakthroughs in a variety of application domains, including autonomous driving, precision medicine, recommendation systems, and robotics (to name a few, e.g., Murphy, 2003; Kormushev et al., 2013; Mnih et al., 2015; Shi et al., 2018). From recommendation systems to mobile health (mHealth) intervention, reinforcement learning can be used to adaptively make personalized recommendations and optimize intervention strategies learned from retrospective behavioral and physiology data. While the achievements of reinforcement learning algorithms in applications are undisputed, the reproducibility of its results and reliability is still in many ways nascent.

Those recommendation and health applications enjoy great flexibility and affordability due to the development of reinforcement algorithms, despite calling for critical needs for a reliable and trustworthy uncertainty quantification for such implementation. The reliability of such implementations sometimes plays a life-threatening role in emerging applications. For example, in autonomous driving, it is critical to avoid deadly explorations based upon some uncertainty measures in the trial-and-error learning procedure. This substance also extends to other applications including precision medicine and autonomous robotics. From the statistical perspective, it is important to quantify the uncertainty of a point estimate with complementary hypothesis testing to reveal or justify the reliability of the learning procedure.

Policy evaluation plays a cornerstone role in typical reinforcement learning (RL) algorithms such as Temporal Difference (TD) learning. As one of the most commonly adopted algorithms for policy evaluation in RL, TD learning provides an estimator of the value function iteratively with regard to a given policy based on samples from a Markov chain. In large-scale RL tasks where the state space is infinitely expansive, a typical procedure to provide a scalable yet efficient estimation of the value function is via linear function approximation. This procedure can be formulated in a linear stochastic approximation problem (Sutton, 1988; Tsitsiklis and Van Roy, 1997; Sutton et al., 2009; Ramprasad et al., 2022), which is designed to sequentially solve a deterministic equation Aθ=bA\theta=b by a matrix-vector pair of a sequence of unbiased random observations of (At,bt)(A_{t},b_{t}) governed by an ergodic Markov chain.

The earliest and most prototypical stochastic approximation algorithm is the Robbins-Monro algorithm introduced by Robbins and Monro (1951) for solving a root-finding problem, where the function is represented as an expected value, e.g., 𝔼[f(θ)]=0{\mathbb{E}}[f(\theta)]=0. The algorithm has generated profound interest in the field of stochastic optimization and machine learning to minimize a loss function using random samples. When referring to an optimization problem, its first-order condition can be represented as 𝔼[f(θ)]=0{\mathbb{E}}[f(\theta)]=0, and the corresponding Robbins-Monro algorithm is often referred to as first-order methods, or more widely known as stochastic gradient descent (SGD) in machine learning literature. It is well established in the literature that its averaged version (Ruppert, 1988; Polyak and Juditsky, 1992), as an online first-order method, achieves optimal statistical efficiency when estimating the model parameters in statistical models, which apparently kills the interest in developing second-order methods that use additional information to help the convergence. That being said, it is often observed in practice that first-order algorithms entail significant accuracy loss on non-asymptotic convergence as well as severe instability in the choice of hyperparameters, specifically, the stepsizes (a.k.a. learning rate). In addition, the stepsize tuning further complicates the quantification of uncertainty associated with the algorithm output. Despite the known drawbacks above, first-order stochastic methods are historically favored in machine learning tasks due to their computational efficiency while they primarily focus on estimation. On the other hand, when the emphasis of the task lies on statistical inference, certain computation of the second-order information is generally inevitable during the inferential procedure, which shakes the supremacy of first-order methods over second-order algorithms.

In light of that, we propose a second-order online algorithm which utilizes second-order information to perform the policy evaluation sequentially. Meanwhile, our algorithm can be used for conducting statistical inference in an online fashion, allowing for the characterization of uncertainty in the estimation of the value function. Such a procedure generates no extra per-unit computation and storage cost beyond O(d2)O(d^{2}), which is at least the same as typical first-order stochastic methods featuring statistical inference (see, e.g., Chen et al., 2020 for SGD and Ramprasad et al., 2022 for TD). More importantly, we show theoretically that the proposed algorithm converges faster in terms of the remainder term compared with first-order stochastic approximation methods, and revealed significant discrepancies in numerical experiments. In addition, the proposed algorithm is free from tuning stepsizes, which has been well-established as a substantial criticism of first-order algorithms.

Another challenge to the reliability of reinforcement learning algorithms lies in the modeling assumptions. Most algorithms in RL have been in the “optimism in the face of uncertainty” paradigm where such procedures are vulnerable to manipulation (see some earlier exploration in e.g., Everitt et al., 2017; Wang et al., 2020). In practice, it is often unrealistic to believe that rewards on the entire trajectory follow exactly the same underlying model. Indeed, non-standard behavior of the rewards happens from time to time in practice. The model-misspecification and presence of outliers are indeed very common in an RL environment, especially that with a large time horizon TT. It is of substantial interest to design a robust policy evaluation procedure. In pursuit of this, our proposed algorithm uses a smoothed Huber loss to replace the least-squares loss function used in classical TD learning, which is tailored to handle both outliers and heavy-tailed rewards. To model outlier observations of rewards in reinforcement learning, we bring the static α\alpha-contamination model (Huber, 1992) to an online setting with dependent samples. In a static offline robust estimation problem, one aims to learn the distribution of interest, where a sample of size nn is drawn i.i.d. from a mixture distribution (1αn)P+αnQ(1-\alpha_{n})P+\alpha_{n}Q, and QQ denotes an arbitrary outlier distribution. We adopt robust estimation in an online environment, where the observations are no longer independent, and the occurrence time of outliers is unknown. In contrast to the offline setting, future observations cannot be used in earlier periods in an online setting. Therefore, in the earlier periods, there is very limited information to help determine whether an observation is an outlier. In addition to this discrepancy between the online decision process and offline estimation, we further allow the outlier reward models to be potentially different for different time tt instead of being from a fixed distribution, and such rewards may be arbitrary and even adversarially adaptive to historical information. Besides the outlier model, our model also incorporates rewards with heavy-tailed distributions. This substantially relaxes the boundness condition that is commonly assumed in policy evaluation literature.

We summarize the challenges and contributions of this paper in the following facets.

  • We propose an online policy evaluation method with dependent samples and simultaneously conduct statistical inference for the model parameters in a fully online fashion. Furthermore, we build a Bahadur representation of the proposed estimator, which includes the main term corresponding to the asymptotic normal distribution and a higher-order remainder term. To the best of our knowledge, there is no literature establishing the Bahadur representation for online policy evaluation. Moreover, it shows that our algorithm matches the offline oracle and converges strictly faster to the asymptotic distribution than that of a prototypical first-order stochastic method such as TD learning.

  • Compared to existing reinforcement learning literature, our proposed algorithm features an online generalization of the αn\alpha_{n}-contamination model where the rewards contain outliers or arbitrary corruptions. Our proposed algorithm is robust to adversarial corruptions which can be adaptive to the trajectory, as well as heavy-tailed distribution of the rewards. Due to the existence of outliers, we use a smooth Huber loss where the thresholding parameter is carefully specified to change over time to accommodate the online streaming data. From a theoretical standpoint, a robust policy evaluation procedure forces the update step from θt\theta_{t} to θt+1\theta_{t+1} to be a non-linear function of θt\theta_{t}, which brings in additional technical challenges compared to the analysis of classical TD learning algorithms (see e.g., Ramprasad et al., 2022) based on linear stochastic approximation.

  • Our proposed algorithm is based on a dedicated averaged version of the second-order method, where in each iteration a surrogate Hessian is obtained and used in the update step. This second-order information enables the proposed algorithm to be free from stepsize tuning while still ensuring efficient implementation. Furthermore, our proposed algorithm stands out distinctly from conventional first-order stochastic approximation approaches which fall short of attaining the optimal offline remainder rate. On the other hand, while deterministic second-order methods do excel in offline scenarios, they lack the online adaptability crucial for real-time applications.

1.1 Related Works

Conducting statistical inference for model parameters in stochastic approximation has attracted great interest in the past decade, with a building foundation of the asymptotic distribution of the averaged version of stochastic approximation first established in Ruppert (1988); Polyak and Juditsky (1992). The established asymptotic distribution has been brought to conduct online statistical inference. For example, Fang et al. (2018) presented a perturbation-based resampling procedure for inference. Chen et al. (2020) proposed two online procedures to estimate the asymptotic covariance matrix to conduct inference. Other than those focused on the inference procedures, Shao and Zhang (2022) established the remainder term and Berry-Esseen bounds of the averaged SGD estimator. Givchi and Palhang (2015); Mou et al. (2021) analyzed averaged SGD with Markovian data. Second-order stochastic algorithms were analyzed in Ruppert (1985); Schraudolph et al. (2007); Byrd et al. (2016) and applied to TD learning in Givchi and Palhang (2015).

Under the online decision-making settings including bandit algorithms and reinforcement learning, a few existing works focused on statistical inference of the model parameters or uncertainty quantification of value functions. Deshpande et al. (2018); Chen et al. (2021a, b); Zhan et al. (2021); Zhang et al. (2021, 2022) studied statistical inference for linear models, MM-estimation, and the non-Markovian environment with data collected via a bandit algorithm, respectively. Thomas et al. (2015) proposed high-confidence off-policy evaluation based on Bernstein inequality. Hanna et al. (2017) presented two bootstrap methods to compute confidence bounds for off-policy value estimates. Dai et al. (2020); Feng et al. (2021) construct confidence intervals for value functions based on optimization formulations, and Jiang and Huang (2020) derived a minimax value interval, both with i.i.d. sample. Shi et al. (2021a) developed online inference procedures for high-dimensional problems. Shi et al. (2021b) proposed inference procedures for Q functions in RL via sieve approximations. Hao et al. (2021) studied multiplier bootstrap algorithms to offer uncertainty quantification for exploration in fitted Q-evaluation. Syrgkanis and Zhan (2023) studied a re-weighted Z-estimator on episodic RL data and conducted inference on the structural parameter.

The most relevant literature to ours is Ramprasad et al. (2022), who studied a bootstrap online statistical inference procedure under Markov noise using a quadratic SGD and demonstrated its application in the classical TD (and Gradient TD) algorithms in RL. Our proposed procedure and analysis differ in at least two aspects. First, our proposed estimator is a Newton-type second-order approach that enjoys a faster convergence and optimality in the remainder rate. In addition, we show both analytically and numerically that the computation cost of our procedure is typically lower than Ramprasad et al. (2022) for an inference task. Second, our proposed algorithm is a robust alternative to TD algorithms, featuring a non-quadratic loss function to handle the potential outliers and heavy-tailed rewards. There exists limited RL literature on either outliers or heavy-tailed rewards. In a recent work Li and Sun (2023), the author studied online linear stochastic bandits and linear Markov decision processes in the presence of heavy-tailed rewards. While both our work and theirs apply pseudo-Huber loss, our work focuses on policy evaluation with statistical inference guarantees.

1.2 Paper Organization and Notations

The remainder of this paper is organized as follows. In Sections 2 and 3, we present and discuss our proposed algorithm for robust policy evaluation in reinforcement learning. Theoretical results on convergence rates, asymptotic normality, and the Bahadur representation are presented in Section 3. In Section 4, we develop an estimator for the long-run covariance matrix to construct confidence intervals in an online fashion and provide its theoretical guarantee. Simulation experiments are provided in Section 5 to demonstrate the effectiveness of our method. Concluding remarks are given in Section 6. All proofs are deferred to the Appendix.

For every vector 𝒗=(v1,,vd)\boldsymbol{v}=(v_{1},...,v_{d})^{{\top}}, denote |𝒗|2=l=1dvl2|\boldsymbol{v}|_{2}=\sqrt{\sum_{l=1}^{d}v_{l}^{2}}, |𝒗|1=l=1d|vl||\boldsymbol{v}|_{1}=\sum_{l=1}^{d}|v_{l}|, and |𝒗|=max1ld|vl||\boldsymbol{v}|_{\infty}=\max_{1\leq l\leq d}|v_{l}|. For simplicity, we denote 𝕊d1{\mathbb{S}}^{d-1} and 𝔹d{\mathbb{B}}^{d} as the unit sphere and unit ball in d\mathbb{R}^{d} centered at 𝟎\boldsymbol{0}. Moreover, we use supp(𝒗)={1ldvl0}\mathrm{supp}(\boldsymbol{v})=\{1\leq l\leq d\mid v_{l}\neq 0\} as the support of the vector 𝒗\boldsymbol{v}. For every matrix 𝑨d1×d2\boldsymbol{A}\in{\mathbb{R}}^{d_{1}\times d_{2}}, define 𝑨=sup|𝒗|2=1|𝑨𝒗|2\left\|\boldsymbol{A}\right\|=\sup_{|\boldsymbol{v}|_{2}=1}|\boldsymbol{A}\boldsymbol{v}|_{2} as the matrix operator norms, Λmax(𝑨)\Lambda_{\max}(\boldsymbol{A}) and Λmin(𝑨)\Lambda_{\min}(\boldsymbol{A}) as the largest and smallest singular values of 𝑨\boldsymbol{A} respectively. The symbols x\lfloor x\rfloor (x\lceil x\rceil) denote the greatest integer (the smallest integer) not larger than (not less than) xx. We denote (x)+=max(0,x)(x)_{+}=\max(0,x). For two sequences an,bna_{n},b_{n}, we say anbna_{n}\asymp b_{n} when an=O(bn)a_{n}=O(b_{n}) and bn=O(an)b_{n}=O(a_{n}) hold at the same time. We say anbna_{n}\approx b_{n} if limnan/bn=1\lim_{n\rightarrow\infty}a_{n}/b_{n}=1. For a sequence of random variables {Xn}n=1\{X_{n}\}_{n=1}^{\infty}, we denote Xn=O(an)X_{n}=O_{{\mathbb{P}}}(a_{n}) if there holds limClim supn(|Xn|>Can)=0\lim_{C\rightarrow\infty}\limsup_{n\rightarrow\infty}{\mathbb{P}}(|X_{n}|>Ca_{n})=0, and denote Xn=o(an)X_{n}=o_{{\mathbb{P}}}(a_{n}) if there holds limC0lim supn(|Xn|>Can)=0\lim_{C\rightarrow 0}\limsup_{n\rightarrow\infty}{\mathbb{P}}(|X_{n}|>Ca_{n})=0. Lastly, the generic constants are assumed to be independent of nn and dd.

2 Robust Policy Evaluation in Reinforcement Learning

We first review the Least-squares temporal difference methods in RL. Consider a 44-tuple (𝒮,𝒰,𝒫,)({\mathcal{S}},{\mathcal{U}},{\mathcal{P}},{\mathcal{R}}). Here 𝒮={1,2,,N}{\mathcal{S}}=\{1,2,...,N\} is the global finite state space, 𝒰{\mathcal{U}} is the set of control (action), 𝒫{\mathcal{P}} is the transition kernel, and {\mathcal{R}} is the reward function. One of the core steps in RL is to estimate the accumulative reward JJ^{*} (which is also called the state value function) for a given policy:

J(s)=𝔼[k=0γk(sk)|s0=s],J^{*}(s)={\mathbb{E}}\Big{[}\sum_{k=0}^{\infty}\gamma^{k}{\mathcal{R}}(s_{k})\;\Big{|}\;s_{0}=s\Big{]},

where γ(0,1]\gamma\in(0,1] is a given discount factor and s𝒮s\in{\mathcal{S}} is any state. Here {sk}\{s_{k}\} denote the environment states which are usually modeled by a Markov chain. In real-world RL applications, the state space is often very large such that one cannot directly compute value estimates for every state in the state space. A common approach in the modern RL is to approximate the value function J()J^{*}(\cdot), i.e., let

J~(s,𝜽)=𝜽ϕ(s)=l=1dθlϕl(s),\widetilde{J}(s,\boldsymbol{\theta})=\boldsymbol{\theta}^{{\top}}\boldsymbol{\phi}(s)=\sum_{l=1}^{d}\theta_{l}\phi_{l}(s),

be a linear approximation of J()J^{*}(\cdot), where 𝜽=(θ1,,θd)d\boldsymbol{\theta}=(\theta_{1},...,\theta_{d})^{\top}\in\mathbb{R}^{d} contains the model parameters, and ϕ(s)=(ϕ1(s),,ϕd(s))d\boldsymbol{\phi}(s)=(\phi_{1}(s),...,\phi_{d}(s))^{\top}\in\mathbb{R}^{d} is a set of feature vectors that corresponds to the state s𝒮s\in{\mathcal{S}}. Here we write ϕl=(ϕl(1),,ϕl(N))\boldsymbol{\phi}_{l}=(\phi_{l}(1),...,\phi_{l}(N))^{\top}, 1ld1\leq l\leq d are linearly independent vectors in N{\mathbb{R}}^{N} and dNd\ll N. That is, we use a low dimensional linear approximation (with the basis vectors {ϕ1,,ϕd}\{\boldsymbol{\phi}_{1},...,\boldsymbol{\phi}_{d}\}) for J()J^{*}(\cdot). Let the matrix 𝚽=(ϕ1,,ϕd)\boldsymbol{\Phi}=(\boldsymbol{\phi}_{1},...,\boldsymbol{\phi}_{d}) and then J~=𝚽𝜽\widetilde{J}=\boldsymbol{\Phi}\boldsymbol{\theta}.

The state value function JJ^{*} satisfies the Bellman equation

J(s)=(s)+γ𝔼[k=1γk1(sk)|s0=s]=(s)+γs=1NpssJ(s),J^{*}(s)={\mathcal{R}}(s)+\gamma{\mathbb{E}}\Big{[}\sum_{k=1}^{\infty}\gamma^{k-1}{\mathcal{R}}(s_{k})\Big{|}s_{0}=s\Big{]}={\mathcal{R}}(s)+\gamma\sum_{s^{\prime}=1}^{N}p_{ss^{\prime}}J^{*}(s^{\prime}), (1)

where ss^{\prime} is the next state transferred from ss. Let 𝒫=(pss)N×N{\mathcal{P}}=(p_{ss^{\prime}})_{N\times N} be the probability transition matrix and define the Bellman operator 𝒯\mathcal{T} by 𝒯(Q)=+γ𝒫Q\mathcal{T}(Q)={\mathcal{R}}+\gamma{\mathcal{P}}Q. The state function JJ^{*} is the unique fixed point of the Bellman operator 𝒯\mathcal{T}. When JJ^{*} is replaced by J~=𝚽𝜽\widetilde{J}=\boldsymbol{\Phi}\boldsymbol{\theta}, the Bellman equation may not hold.

The classical temporal difference attempts to find 𝜽\boldsymbol{\theta} such that J~(s,𝜽)\widetilde{J}(s,\boldsymbol{\theta}) well approximates the value function J(s)J^{*}(s). Particularly, the TD algorithm updates

𝜽t+1=𝜽tηtϕ(st)[(ϕ(st)γϕ(st+1))𝜽t(st)],t0,\displaystyle\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}-\eta_{t}\boldsymbol{\phi}(s_{t})[(\boldsymbol{\phi}^{{\top}}(s_{t})-\gamma\boldsymbol{\phi}^{{\top}}(s_{t+1}))\boldsymbol{\theta}_{t}-{\mathcal{R}}(s_{t})],\quad t\geq 0, (2)

where ηt\eta_{t} is the step size which often requires careful tuning. We next illustrate the key point that (2) leads to a good estimator such that J~(s,𝜽t)\widetilde{J}(s,\boldsymbol{\theta}_{t}) is close to J(s)J^{*}(s). It can be shown that 𝜽t\boldsymbol{\theta}_{t} converges to an unknown population parameter 𝜽\boldsymbol{\theta}^{*} that minimizes the expected squared difference of the Bellman equation,

𝜽=argmin𝒖d𝔼|ϕ(s)𝒖((s)+γϕ(s)𝜽)|2;\boldsymbol{\theta}^{*}=\mathop{\rm{argmin}}_{\boldsymbol{u}\in{\mathbb{R}}^{d}}{\mathbb{E}}|\boldsymbol{\phi}^{{\top}}(s)\boldsymbol{u}-({\mathcal{R}}(s)+\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime})\boldsymbol{\theta}^{*})|^{2}; (3)

It is easy to see that such 𝜽\boldsymbol{\theta}^{*} exists and satisfies the following first-order condition,

𝔼ϕ(s)[(ϕ(s)γϕ(s))𝜽(s)]=0.{\mathbb{E}}\boldsymbol{\phi}(s)[(\boldsymbol{\phi}^{{\top}}(s)-\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime}))\boldsymbol{\theta}^{*}-{\mathcal{R}}(s)]=0. (4)

Under the condition that H:=𝔼ϕ(s)(ϕ(s)γϕ(s))\textbf{H}:={\mathbb{E}}\boldsymbol{\phi}(s)(\boldsymbol{\phi}^{{\top}}(s)-\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime})) is positive definite in the sense that xHx>0x^{{\top}}\textbf{H}x>0 for all xdx\in{\mathbb{R}}^{d}; see Tsitsiklis and Van Roy (1997), the solution of (4) writes

𝜽=(𝔼ϕ(s)(ϕ(s)γϕ(s)))1𝔼ϕ(s)(s).\boldsymbol{\theta}^{*}=({\mathbb{E}}\boldsymbol{\phi}(s)(\boldsymbol{\phi}^{{\top}}(s)-\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime})))^{-1}{\mathbb{E}}\boldsymbol{\phi}(s){\mathcal{R}}(s).

The estimation equation (2) is a first-order stochastic algorithm that converges to the stochastic root-finding problem (4) using a sequence of observations {st,rt}t1\{s_{t},r_{t}\}_{t\geq 1}. In this paper, we refer to it as the Least-squares temporal difference estimator. See Kolter and Ng (2009) for details on the properties of the Least-squares TD estimator.

Least-squares-based methods are oftentimes criticized due to their sensitivity to outliers in data. When there may exist outliers in some observations of the reward (s){\mathcal{R}}(s), it is a natural call for interest to design a robust estimator of 𝜽\boldsymbol{\theta}^{*}. Following the widely-known classical literature (Huber, 1964; Charbonnier et al., 1994, 1997; Hastie et al., 2009) , we replace the square loss ||2|\cdot|^{2} by a smoothed (Pseudo) Huber loss fτ(x)=τ2(1+(x/τ)21)f_{\tau}(x)=\tau^{2}(\sqrt{1+(x/\tau)^{2}}-1), parametrized by a thresholding parameter τ\tau. We define a similar fixed point equation with the smoothed Huber loss by

𝜽τ=argmin𝒖d𝔼fτ(ϕ(s)𝒖((s)+γϕ(s)𝜽τ)).\boldsymbol{\theta}^{*}_{\tau}=\mathop{\rm{argmin}}_{\boldsymbol{u}\in{\mathbb{R}}^{d}}{\mathbb{E}}f_{\tau}\big{(}\boldsymbol{\phi}^{{\top}}(s)\boldsymbol{u}-({\mathcal{R}}(s)+\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime})\boldsymbol{\theta}^{*}_{\tau})\big{)}. (5)

In this section, when we motivate the algorithm, we assume that the fixed point 𝜽τ\boldsymbol{\theta}^{*}_{\tau} exists 111Here 𝜽τ\boldsymbol{\theta}^{*}_{\tau} is used for the motivation only. Our algorithm and theory do not depend on the existence of 𝜽τ\boldsymbol{\theta}^{*}_{\tau}. . As the thresholding parameter τ\tau goes to infinity, the objective equation in (5) is close to the least-squares loss in (3), and 𝜽τ\boldsymbol{\theta}_{\tau}^{*} should be close to 𝜽\boldsymbol{\theta}^{*}. When τ\tau tends to 0, the problem (5) becomes 𝜽0=argmin𝒖d𝔼|ϕ(s)𝒖((s)+γϕ(s)𝜽0)|,\boldsymbol{\theta}^{*}_{0}=\mathop{\rm{argmin}}_{\boldsymbol{u}\in{\mathbb{R}}^{d}}{\mathbb{E}}|\boldsymbol{\phi}^{{\top}}(s)\boldsymbol{u}-({\mathcal{R}}(s)+\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime})\boldsymbol{\theta}^{*}_{0})|, with a nonsmooth least absolute deviation (LAD) loss, which is out of the scope of this paper. In this paper, we carefully specify τ\tau to balance the statistical efficiency and the effect of potential outliers in an online fashion (see Theorem 1).

By the first-order condition of (5), we obtain a similar estimation function as (4),

𝔼[ϕ(s)gτ((ϕ(s)γϕ(s))𝜽τ(s))]=0,{\mathbb{E}}\big{[}\boldsymbol{\phi}(s)g_{\tau}\big{(}(\boldsymbol{\phi}^{{\top}}(s)-\gamma\boldsymbol{\phi}^{{\top}}(s^{\prime}))\boldsymbol{\theta}_{\tau}^{*}-{\mathcal{R}}(s)\big{)}\big{]}=0, (6)

where the function gτ(x)=fτ(x)g_{\tau}(x)=f^{\prime}_{\tau}(x) is the differential of the smoothed Huber loss fτ(x)f_{\tau}(x). Instead of using the first-order iteration with the estimation equation (6) as in TD (2), we propose a Newton-type iterative estimator, which avoids the tuning of the learning rate. Newton-type estimators are often referred to as second-order methods when discussed in convex optimization. Nonetheless, the equation (5) is just for illustration purposes, which cannot be directly optimized since 𝜽τ\boldsymbol{\theta}_{\tau}^{*} appears in the objective function for minimization. On the contrary, our proposed method is a Newton-type method for solving the root-finding problem (6) using a sequence of the observations.

In this paper, we model two types of noise in observed rewards. The first is the classical Huber contamination model (Huber, 1992, 2004), where an αn\alpha_{n}-fraction of rewards comes from arbitrary distributions. The second is the heavy-tailed model, where the reward function may admit heavy-tailed distributions. In the following section, we show that our proposed estimator is robust to the aforementioned two noises.

3 Online Newton-type Method for Parameter Estimation

In the following, we introduce an online Newton-type method for estimating the parameter 𝜽\boldsymbol{\theta}^{*} in the presence of outliers and heavy-tailed noise in the reward function (s){\mathcal{R}}(s). For ease of presentation, we denote the observations by 𝑿i=ϕ(si)\boldsymbol{X}_{i}=\boldsymbol{\phi}(s_{i}), 𝒁i=ϕ(si)γϕ(si+1)\boldsymbol{Z}_{i}=\boldsymbol{\phi}(s_{i})-\gamma\boldsymbol{\phi}(s_{i+1}), and bi=(si)b_{i}={\mathcal{R}}(s_{i}), where sis_{i} is the state at time ii. Here (𝑿i,𝒁i,bi)d×d×(\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i})\in\mathbb{R}^{d}\times\mathbb{R}^{d}\times\mathbb{R} for each observation in the sample. Our objective is to propose an online estimator by the estimation function (6), which can be rewritten as

𝔼[𝑿gτ(𝒁𝜽τb)]=0,{\mathbb{E}}\big{[}\boldsymbol{X}g_{\tau}(\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}_{\tau}^{*}-b)\big{]}=0, (7)

based on a sequence of dependent observations {(𝑿i,𝒁i,bi)}i1\{(\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i})\}_{i\geq 1}.

At iteration n+1n+1, our proposed Newton-type estimator updates 𝜽^n+1\widehat{\boldsymbol{\theta}}_{n+1} by

𝜽^n+1=1n+1i=0n𝜽^i𝑯^n+111n+1i=0n𝑿i+1gτi+1(𝒁i+1𝜽^ibi+1).\widehat{\boldsymbol{\theta}}_{n+1}=\frac{1}{n+1}\sum_{i=0}^{n}\widehat{\boldsymbol{\theta}}_{i}-\widehat{\boldsymbol{H}}_{n+1}^{-1}\frac{1}{n+1}\sum_{i=0}^{n}\boldsymbol{X}_{i+1}g_{\tau_{i+1}}(\boldsymbol{Z}_{i+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{i}-b_{i+1}). (8)

Here 𝑯^n+1\widehat{\boldsymbol{H}}_{n+1} is an empirical information matrix of the estimation equation (7), as

𝑯^n+1=1n+1i=0n𝑿i+1𝒁i+1gτi+1(𝒁i+1𝜽^ibi+1).\widehat{\boldsymbol{H}}_{n+1}=\frac{1}{n+1}\sum_{i=0}^{n}\boldsymbol{X}_{i+1}\boldsymbol{Z}_{i+1}^{{\top}}g_{\tau_{i+1}}^{\prime}(\boldsymbol{Z}_{i+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{i}-b_{i+1}). (9)

where τi\tau_{i} is the thresholding parameter in the Huber loss. We let τn\tau_{n} tend to infinity to eliminate the bias generated by the smoothed Huber loss.

It is noteworthy to mention that the matrix 𝑯^n+1\widehat{\boldsymbol{H}}_{n+1} is not the Hessian matrix of the objective function on the right-hand side of (5). As discussed in the previous section, (5) cannot be directly optimized, nor do they lead to M-estimation problems. Indeed, our proposed update (8) is a Newton-type method to find the root of (7) using a sequence of observations {(𝑿i,𝒁i,bi)}i1\{(\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i})\}_{i\geq 1}. In the following section, we will show the use of the matrix 𝑯^n+1\widehat{\boldsymbol{H}}_{n+1} helps for desirable convergence properties.

It should be noted that (8) can be implemented efficiently in a fully-online manner, i.e., without storage of the trajectory of historical information. Specifically, we write (8) as 𝜽^n+1=  θ n+1𝑯^n+11𝑮n+1\widehat{\boldsymbol{\theta}}_{n+1}=\,\hbox{\kern 0.09995pt\vbox{\hrule height=0.5pt\kern 1.42082pt\hbox{\kern-1.00006pt$\boldsymbol{\theta}$\kern-1.00006pt}}}\,_{n+1}-\widehat{\boldsymbol{H}}_{n+1}^{-1}\boldsymbol{G}_{n+1}, with each of the item on the right-hand side being a running average. It is easy to see that the averaged estimator  θ n+1=1n+1(n  θ n+𝜽^n)\,\hbox{\kern 0.09995pt\vbox{\hrule height=0.5pt\kern 1.42082pt\hbox{\kern-1.00006pt$\boldsymbol{\theta}$\kern-1.00006pt}}}\,_{n+1}=\frac{1}{n+1}(n\,\hbox{\kern 0.09995pt\vbox{\hrule height=0.5pt\kern 1.42082pt\hbox{\kern-1.00006pt$\boldsymbol{\theta}$\kern-1.00006pt}}}\,_{n}+\widehat{\boldsymbol{\theta}}_{n}) and the vector

𝑮n+1=1n+1i=0n𝑿i+1gτi+1(𝒁i+1𝜽^ibi+1)=nn+1𝑮n+1n+1𝑿n+1gτn+1(𝒁n+1𝜽^nbn+1)\boldsymbol{G}_{n+1}=\frac{1}{n+1}\sum_{i=0}^{n}\boldsymbol{X}_{i+1}g_{\tau_{i+1}}(\boldsymbol{Z}_{i+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{i}-b_{i+1})=\frac{n}{n+1}\boldsymbol{G}_{n}+\frac{1}{n+1}\boldsymbol{X}_{n+1}g_{\tau_{n+1}}(\boldsymbol{Z}_{n+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}-b_{n+1}) (10)

can both be updated online. In addition, the inverse 𝑯^n+11\widehat{\boldsymbol{H}}_{n+1}^{-1} can be directly and efficiently computed online by the inverse recursion formulation. By the Sherman-Morrison formula, we have

𝑯^n+11=\displaystyle\widehat{\boldsymbol{H}}_{n+1}^{-1}= n+1n𝑯^n1n+1n2𝑯^n1𝑿n+1\displaystyle\frac{n+1}{n}\widehat{\boldsymbol{H}}_{n}^{-1}-\frac{n+1}{n^{2}}\widehat{\boldsymbol{H}}_{n}^{-1}\boldsymbol{X}_{n+1} (11)
×[1n𝒁n+1𝑯^n1𝑿n+1+{gτn+1(𝒁n+1𝜽^nbn+1)}1]1𝒁n+1𝑯^n1.\displaystyle\times\Big{[}\frac{1}{n}\boldsymbol{Z}_{n+1}^{{\top}}\widehat{\boldsymbol{H}}_{n}^{-1}\boldsymbol{X}_{n+1}+\{g_{\tau_{n+1}}^{\prime}(\boldsymbol{Z}_{n+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}-b_{n+1})\}^{-1}\Big{]}^{-1}\boldsymbol{Z}_{n+1}^{{\top}}\widehat{\boldsymbol{H}}_{n}^{-1}.

Here we note that both terms 1n𝒁n+1𝑯^n1𝑿n+1\frac{1}{n}\boldsymbol{Z}_{n+1}^{{\top}}\widehat{\boldsymbol{H}}_{n}^{-1}\boldsymbol{X}_{n+1} and {gτn+1(𝒁n+1𝜽^nbn+1)}1\{g_{\tau_{n+1}}^{\prime}(\boldsymbol{Z}_{n+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}-b_{n+1})\}^{-1} inside the brackets on the right-hand side of (11) are scalars in 1\mathbb{R}^{1}, not matrices.

The complete algorithm is presented in Algorithm 1. We refer to it as the Robust Online Policy Evaluation (ROPE). Compared with existing stochastic approximation algorithms for TD learning (Durmus et al., 2021; Mou et al., 2021; Ramprasad et al., 2022), our ROPE algorithm does not need to tune the step size.

Algorithm 1 Robust Online Policy Evaluation (ROPE\mathrm{ROPE}).

Input: Online streaming data {(𝑿i,𝒁i,bi)i1}\{(\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i})\mid i\geq 1\}

1:  Compute 𝑯^n01\widehat{\boldsymbol{H}}_{n_{0}}^{-1} and 𝑮n0\boldsymbol{G}_{n_{0}} according to (12) respectively.
2:  for n=n0+1,n0+2,n=n_{0}+1,n_{0}+2,\dots do
3:     Specify the thresholding parameter τn\tau_{n}.
4:     Compute  θ n=(n1)  θ n1/n+𝜽^n1/n\,\hbox{\kern 0.09995pt\vbox{\hrule height=0.5pt\kern 1.42082pt\hbox{\kern-1.00006pt$\boldsymbol{\theta}$\kern-1.00006pt}}}\,_{n}=(n-1)\,\hbox{\kern 0.09995pt\vbox{\hrule height=0.5pt\kern 1.42082pt\hbox{\kern-1.00006pt$\boldsymbol{\theta}$\kern-1.00006pt}}}\,_{n-1}/n+\widehat{\boldsymbol{\theta}}_{n-1}/n, and (𝑯^n1,𝑮n)(\widehat{\boldsymbol{H}}_{n}^{-1},\boldsymbol{G}_{n}) by (10) and (11).
5:     Update the parameter by
𝜽^n=  θ n𝑯^n1𝑮n.\widehat{\boldsymbol{\theta}}_{n}=\,\hbox{\kern 0.09995pt\vbox{\hrule height=0.5pt\kern 1.42082pt\hbox{\kern-1.00006pt$\boldsymbol{\theta}$\kern-1.00006pt}}}\,_{n}-\widehat{\boldsymbol{H}}_{n}^{-1}\boldsymbol{G}_{n}.
6:  end for

Output: The final estimator 𝜽^n\widehat{\boldsymbol{\theta}}_{n}.

Since performing iterations of 𝑯^n+11\widehat{\boldsymbol{H}}_{n+1}^{-1} in (11) requires an initial invertible 𝑯^n0\widehat{\boldsymbol{H}}_{n_{0}}, we shall compute from the first n0n_{0} samples that

𝑯^n0=1n0i=1n0𝑿i𝒁igτ0(𝒁i𝜽^0bi),𝑮n0=1n0i=1n0𝑿igτ0(𝒁i𝜽^0bi),\widehat{\boldsymbol{H}}_{n_{0}}=\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}^{\top}g^{\prime}_{\tau_{0}}(\boldsymbol{Z}_{i}^{\top}\widehat{\boldsymbol{\theta}}_{0}-b_{i}),\quad\boldsymbol{G}_{n_{0}}=\frac{1}{n_{0}}\sum_{i=1}^{n_{0}}\boldsymbol{X}_{i}g_{\tau_{0}}(\boldsymbol{Z}_{i}^{\top}\widehat{\boldsymbol{\theta}}_{0}-b_{i}), (12)

which serves as the initial quantities of (10)–(11) in order to compute the forthcoming iterations. Here 𝜽^0\widehat{\boldsymbol{\theta}}_{0} is a given initial parameter, and τ0\tau_{0} is a specified initial threshold level.

3.1 Convergence Rate of ROPE

Before illustrating how to conduct statistical inference on the parameter 𝜽\boldsymbol{\theta}^{*}, we first provide theoretical results for our proposed ROPE method.

To characterize the weak dependence of the sequence of environment states, we use the concept of ϕ\phi-mixing. More precisely, we assume {si,i1}\{s_{i},i\geq 1\} is a sequence of ϕ\phi-mixing dependent variables, which satisfy

|(B|A)(B)|ϕ(k)|{\mathbb{P}}(B|A)-{\mathbb{P}}(B)|\leq\phi(k) (13)

for all A1n,Bn+kA\in{\mathcal{F}}_{1}^{n},B\in{\mathcal{F}}_{n+k}^{\infty} and all n,k1n,k\geq 1, where ab=σ(si,aib){\mathcal{F}}_{a}^{b}=\sigma(s_{i},a\leq i\leq b). The ϕ\phi-mixing dependence covers the irreducible and aperiodic Markov chain, which is typically used to model the states sampling in reinforcement learning (RL) (Rosenblatt, 1956; Rio, 2017, and others). The mixing rate ϕ(k)\phi(k) identifies the strength of dependence of the sequence. In this paper, we assume the following condition on the mixing rate.

Condition (C1).

The sequence {(𝐗i,𝐙i)}\{(\boldsymbol{X}_{i},\boldsymbol{Z}_{i})\} is a stationary ϕ\phi-mixing sequence, where the mixing rate satisfies ϕ(k)=O(ρk)\phi(k)=O(\rho^{k}) for some 0<ρ<10<\rho<1.

It is easy to see that (C1) holds when {si}\{s_{i}\} is an irreducible and aperiodic finite-state Markov chain. In addition, we also assume the following boundedness condition on 𝑿=ϕ(s)\boldsymbol{X}=\boldsymbol{\phi}(s) that is commonly required by the RL literature (Sutton and Barto, 2018; Ramprasad et al., 2022).

Condition (C2).

There exists a constant CX>0C_{X}>0 such that max{|ϕ(s)|2|s𝒮}CX\max\big{\{}|\boldsymbol{\phi}(s)|_{2}\;\big{|}\;s\in{\mathcal{S}}\big{\}}\leq C_{X}.

Next, we assume the matrix 𝑯=𝔼[𝑿𝒁]\boldsymbol{H}={\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}] has a bounded condition number.

Condition (C3).

There exists a constant c>0c>0 such that cΛmin(𝐇)Λmax(𝐇)c1c\leq\Lambda_{\min}(\boldsymbol{H})\leq\Lambda_{\max}(\boldsymbol{H})\leq c^{-1}, where Λmax\Lambda_{\max} and Λmin\Lambda_{\min} denote the largest and the smallest singular values, respectively.

When {si}\{s_{i}\} is an irreducible and aperiodic Markov chain, Tsitsiklis and Van Roy (1997) proved that 𝑯\boldsymbol{H} is positive definite, i.e., x𝑯x>0x^{{\top}}\boldsymbol{H}x>0 for all x0x\neq 0. This indicates that 𝑯\boldsymbol{H} is nonsingular and hence (C3) holds. Finally, we provide conditions on the temporal difference error 𝒁𝜽b\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b.

Condition (C4).

The temporal-difference error 𝐙𝛉b\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b comes from the distribution (1αn)P+αnQ(1-\alpha_{n})P+\alpha_{n}Q, where αn[0,1)\alpha_{n}\in[0,1). The distribution QQ is an arbitrary distribution, and PP satisfies that 𝔼P[|𝐙𝛉b|1+δ|𝐗,𝐙]Cb{\mathbb{E}}_{P}\big{[}|\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b|^{1+\delta}\big{|}\boldsymbol{X},\boldsymbol{Z}\big{]}\leq C_{b} holds uniformly for some constants δ>0\delta>0 and Cb>0C_{b}>0.

We assume that the temporal-difference error comes from the Huber contamination model. The outlier distribution QQ can be arbitrary and the true distribution PP has (1+δ)(1+\delta)-th order of moment (δ>0\delta>0), which does not necessarily indicate a variance exists when δ(0,1)\delta\in(0,1). Therefore, our assumption largely extends the boundedness condition of the reward function (s){\mathcal{R}}(s) in RL literature (Sutton and Barto, 2018; Ramprasad et al., 2022). In Condition (C4), αn\alpha_{n} controls the ratio of outliers among nn samples. Particularly, αn=mn/n\alpha_{n}=m_{n}/n means that there are mnm_{n} outliers among nn samples {(𝑿i,𝒁i,bi),1in}\{(\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i}),1\leq i\leq n\}. Given the above conditions, by selecting the thresholding parameter τi\tau_{i} as defined in (8), we can obtain the following statistical rate for our proposed ROPE estimator.

Theorem 1.

Suppose that (C1) to (C4) hold and the thresholding parameter τi=Cτmax(1,iβ1/(logi)β2)\tau_{i}=C_{\tau}\max(1,i^{\beta_{1}}/(\log i)^{\beta_{2}}) (where β1[0,1),β20\beta_{1}\in[0,1),\beta_{2}\geq 0, and Cτ>0C_{\tau}>0). Assume n0n_{0} is sufficiently large and the initial value |𝛉^0𝛉|2c0|\widehat{\boldsymbol{\theta}}_{0}-\boldsymbol{\theta}^{*}|_{2}\leq c_{0} for some c0<1c_{0}<1. Then for every ν>0\nu>0, there exist constants C,c>0C,c>0 which are independent of the dimension dd such that

(i=n0n{|𝜽^i𝜽|2Cdei})1cn0ν,{\mathbb{P}}\Big{(}\cap_{i=n_{0}}^{n}\big{\{}|\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}|_{2}\geq C\sqrt{d}e_{i}\big{\}}\Big{)}\geq 1-cn_{0}^{-\nu},

where

en=αnτn+τnmin(δ,2)+τn(1δ)+lognn+τnlog2nn+1d(c0)2nn0.e_{n}=\alpha_{n}\tau_{n}+\tau_{n}^{-\min(\delta,2)}+\sqrt{\frac{\tau_{n}^{(1-\delta)_{+}}\log n}{n}}+\frac{\tau_{n}\log^{2}n}{n}+\frac{1}{\sqrt{d}}(c_{0})^{2^{n-n_{0}}}. (14)

Here δ\delta is defined in the moment condition in (C4).

The error bounds ene_{n} in (14) contains five terms. In the sequel, we will discuss each of these terms individually. The first term αnτn\alpha_{n}\tau_{n} comes from the impact of outliers among the samples. Here the contamination ratio αn=mn/n\alpha_{n}=m_{n}/n should tend to zero; otherwise, it is impossible to obtain a consistent estimator for 𝜽\boldsymbol{\theta}^{*}. The second term τnmin(δ,2)\tau_{n}^{-\min(\delta,2)} is due to the bias from smoothed Huber’s loss for mean estimation. The third and the fourth term in (14) is the classical statistical rate due to the variance and boundedness incurred by thresholding, respectively. These two terms commonly appear in the Bernstein-type inequalities (see, e.g. Bennett, 1962). As we can see from the third term, the convergence rate has a phase transition between the regimes of finite variance δ>1\delta>1 and infinite variance 0<δ10<\delta\leq 1 (see also Corollary 3 below). This phenomenon has also been observed in different estimation problems with Huber loss; see Sun et al. (2020) and Fan et al. (2022). In the presence of a higher moment condition, specifically δ5\delta\geq 5, we can eliminate the fourth term by applying a more delicate analysis (See Proposition 5 below and Proposition 8 in Appendix for more details). The last term represents how quickly the proposed iterative algorithm converges. Given an initial value 𝜽^0\widehat{\boldsymbol{\theta}}_{0} with an error c0<1c_{0}<1, the proposed second-order method enjoys a local quadratic convergence, i.e., the error evolves from c0c_{0} to (c0)2nn0(c_{0})^{2^{n-n_{0}}} after nn0n-n_{0} iterations. This term decreases super-exponentially fast, a characteristic convergence rate often encountered in the realm of second-order methods (see, e.g., Nesterov, 2003). Specifically, when the sample size, which is equivalent to the number of iterations, is reasonably large, the last term (c0)2nn0=O(1/n)(c_{0})^{2^{n-n_{0}}}=O(1/\sqrt{n}) is dominated by the statistical error in the other terms of (14). The assumption on the initial error |𝜽^0𝜽|2c0|\widehat{\boldsymbol{\theta}}_{0}-\boldsymbol{\theta}^{*}|_{2}\leq c_{0} for some c0<1c_{0}<1 is mild. A valid initial value 𝜽^0\widehat{\boldsymbol{\theta}}_{0} can be obtained by the solution to the following estimation equation i=0n0𝑿igτ(𝒁i𝜽bi)=0\sum_{i=0}^{n_{0}}\boldsymbol{X}_{i}g_{\tau}(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}-b_{i})=0 using a subsample with fixed size n0n_{0} before running Algorithm 1 with a specified constant threshold τ\tau. The above equation can be solved efficiently by classical first-order root-finding algorithms.

To highlight the relationship between the convergence rate and the outlier rewards, we present the following corollary which gives a clear statement on how the rate depends on the number of outliers mnm_{n}.

Corollary 2.

Suppose the conditions in Theorem 1 hold with δ2\delta\geq 2. Let the thresholding parameter τi=Cτiβ\tau_{i}=C_{\tau}i^{\beta} with some β,Cτ>0\beta,C_{\tau}>0 and n0n_{0} tend to infinity. We have

|𝜽^n𝜽|2=O(d(lognn+log2nn1β+mnn1β+1n2β)),|\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*}|_{2}=O_{{\mathbb{P}}}\left(\sqrt{d}\Big{(}\sqrt{\frac{\log n}{n}}+\frac{\log^{2}n}{n^{1-\beta}}+\frac{m_{n}}{n^{1-\beta}}+\frac{1}{n^{2\beta}}\Big{)}\right),

where mnm_{n} is the number of outliers among the samples of size nn.

The corollary shows that, when there are mn=o(n1β)m_{n}=o(n^{1-\beta}) outliers, our ROPE can still estimate 𝜽\boldsymbol{\theta}^{*} consistently. Note that β\beta in τi\tau_{i} is the exponent specified by the practitioner. A smaller β\beta allows more outliers mnm_{n} among the samples. Furthermore, when there are mn=o(n12β)m_{n}=o(n^{\frac{1}{2}-\beta}) outliers and 14β12\frac{1}{4}\leq\beta\leq\frac{1}{2}, the ROPE estimator achieves the optimal rate |𝜽^n𝜽|2=O(dlognn)|\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*}|_{2}=O_{{\mathbb{P}}}\Big{(}\sqrt{\frac{d\log n}{n}}\Big{)}, up to a logarithm term.

The next corollary indicates the impact of the tail of rewards (s)\mathcal{R}(s) on the convergence rate. For a clear presentation, we discuss the impact under the case without outliers, i.e., αn=0\alpha_{n}=0.

Corollary 3.

Suppose the conditions in Theorem 1 hold with the contamination rate αn=0\alpha_{n}=0 and let n0n_{0} tend to infinity.

  • When δ(0,1]\delta\in(0,1], we specify τi=Cτ(i/logi)1/(1+δ)\tau_{i}=C_{\tau}(i/\log i)^{1/(1+\delta)}. Then

    |𝜽^n𝜽|2=O(d(lognn)δ1+δ).|\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*}|_{2}=O_{{\mathbb{P}}}\left(\sqrt{d}\Big{(}\frac{\log n}{n}\Big{)}^{\frac{\delta}{1+\delta}}\right).
  • When δ>1\delta>1, we specify τi=Cτ(i/logi)1/min(δ+1,2)\tau_{i}=C_{\tau}(i/\log i)^{1/\min(\delta+1,2)}. Then

    |𝜽^n𝜽|2=O(d(lognn)1/2).|\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*}|_{2}=O_{{\mathbb{P}}}\left(\sqrt{d}\Big{(}\frac{\log n}{n}\Big{)}^{1/2}\right).

Corollary 3 illustrates a sharp phase transition between the regimes of infinite variance δ(0,1]\delta\in(0,1] and finite variance δ>1\delta>1, and such transition is smooth and optimal. The rate of convergence established in Corollary 3 matches the offline oracle with independent samples in the estimation of linear regression model (Sun et al., 2020), ignoring the logarithm term. A similar corollary can be established for this phase transition of the convergence rate when the contamination rate αn>0\alpha_{n}>0.

3.2 Asymptotic Normality and Bahadur Representation

In the following theorem, we give the Bahadur representation for the proposed estimator 𝜽^n\widehat{\boldsymbol{\theta}}_{n}. The Bahadur representation provides a more refined rate for the estimation error. Moreover, the asymptotic normality can be established by applying the central limit theorem to the main term in the representation.

Theorem 4.

Suppose the conditions in Theorem 1 hold, and (C4) holds with δ5\delta\geq 5. Let n0n_{0} tend to infinity. We have for any nonzero 𝐯d\boldsymbol{v}\in{\mathbb{R}}^{d} with |𝐯|21|\boldsymbol{v}|_{2}\leq 1,

𝒗(𝜽^n𝜽)=𝒗𝑯11ni𝒬n𝑿i(𝒁i𝜽bi)+O(dτnαn+den12log2n+dτn2+d(nτn2)2/5).\boldsymbol{v}^{{\top}}(\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*})=\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}^{*}-b_{i})+O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}\alpha_{n}+de_{n-1}^{2}\log^{2}n+\sqrt{d}\tau_{n}^{-2}+\frac{\sqrt{d}}{(n\tau_{n}^{2})^{2/5}}\Big{)}. (15)

Here ene_{n} is defined in (14), and 𝒬n{\mathcal{Q}}_{n} denotes the index set of the outliers. Moreover, if the contamination rate αn=o(1/(nτn))\alpha_{n}=o(1/(\sqrt{n}\tau_{n})) and n1/4=o(τn)n^{1/4}=o(\tau_{n}), then

nσ𝒗𝒗(𝜽^n𝜽)𝑑𝒩(0,1), where σ𝒗2=𝒗𝑯1𝚺(𝑯)1𝒗.\frac{\sqrt{n}}{\sigma_{\boldsymbol{v}}}\boldsymbol{v}^{{\top}}(\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*})\xrightarrow{d}{\mathcal{N}}(0,1),\text{ where }\sigma_{\boldsymbol{v}}^{2}=\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\boldsymbol{\Sigma}(\boldsymbol{H}^{{\top}})^{-1}\boldsymbol{v}. (16)

Here

𝚺=k=𝔼[𝑿0𝑿k(𝒁0𝜽b0)(𝒁k𝜽bk)].\boldsymbol{\Sigma}=\sum_{k=-\infty}^{\infty}{\mathbb{E}}\big{[}\boldsymbol{X}_{0}\boldsymbol{X}^{{\top}}_{k}(\boldsymbol{Z}^{{\top}}_{0}\boldsymbol{\theta}^{*}-b_{0})(\boldsymbol{Z}^{{\top}}_{k}\boldsymbol{\theta}^{*}-b_{k})\big{]}. (17)

Theorem 4 provides a Bahadur representation for the proposed estimator 𝜽^n\widehat{\boldsymbol{\theta}}_{n}, which contains a main term corresponding to an asymptotic normal distribution in (16), and a higher-order remainder term. The asymptotic normality and its concrete form of the covariance matrix pave the way for further research on conducting statistical inference efficiently in an online fashion, demonstrated in Section 4 below. To achieve the asymptotic normality, additional conditions αn=o(1/(nτn))\alpha_{n}=o(1/(\sqrt{n}\tau_{n})) and n1/4=o(τn)n^{1/4}=o(\tau_{n}) are required on the specification of the thresholding parameters τn\tau_{n}. These conditions easily hold when the practitioner specifies τn=Cτnβ\tau_{n}=C_{\tau}n^{\beta} with β>1/4\beta>1/4 and the number of outliers satisfies mn=o(n1/2β)m_{n}=o(n^{1/2-\beta}). We further note that to establish the Bahadur representation in the above theorem, we require the sixth moment to exist (i.e., δ=5\delta=5). This condition may be weakened and we believe a similar Bahadur representation holds for δ>1\delta>1. We leave this theoretical question to future investigation.

It is worthwhile to note that even for the case where there is no contamination, i.e., αn=0\alpha_{n}=0, our result is still new. As far as we know, there is no literature establishing the Bahadur representation for the TD method. To highlight the rate of convergence concisely in the remainder term, we provide the following corollary under αn=0\alpha_{n}=0.

Proposition 5.

Suppose the conditions of Theorem 4 hold with αn=0\alpha_{n}=0 and τi=Cτiβ\tau_{i}=C_{\tau}i^{\beta} for β3/4\beta\geq 3/4 and some Cτ>0C_{\tau}>0, we have for any nonzero 𝐯d\boldsymbol{v}\in{\mathbb{R}}^{d} with |𝐯|21|\boldsymbol{v}|_{2}\leq 1,

𝒗(𝜽^n𝜽)=𝒗𝑯11ni=1n𝑿i(𝒁i𝜽bi)+O(dlognn).\boldsymbol{v}^{{\top}}(\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*})=\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{X}_{i}(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}^{*}-b_{i})+O_{{\mathbb{P}}}\Big{(}\frac{d\log n}{n}\Big{)}.

In this proposition, we specify β3/4\beta\geq 3/4 to ensure that the remainder term possesses an order of O(dlogn/n)O_{{\mathbb{P}}}\big{(}d\log n/n\big{)}. Notably, this result is not a direct corollary of Theorems 1 and 4, since the rate of en1e_{n-1} in (14) of Theorem 1 is not sharp enough to guarantee the term den12log2nde_{n-1}^{2}\log^{2}n in (15) converges as fast as O(dlogn/n)O\big{(}d\log n/n\big{)} in Proposition 5. To reach this fast rate of the remainder, we establish an improved rate of Theorem 1 under the stronger assumptions, achieved by eliminating the fourth term of (14). The detailed formal result of the improved rate of ene_{n} is relegated to Proposition 8 in Appendix.

Remark 1.

We provide some discussion on the remainder term in the Bahadur representation. Since we have not found any result on the Bahadur representation for TD algorithm in literature, we compare it with SGD instead, as the theoretical properties of TD are expected to be similar to SGD. For i.i.d. samples, Polyak and Juditsky (1992) shows that the average of SGD with learning rate cnηcn^{-\eta} is consistent when 0<η10<\eta\leq 1. Their proof indicates that the remainder term is at the order of O(1/n1η/2+1/nη)O_{{\mathbb{P}}}(1/n^{1-\eta/2}+1/n^{\eta}) (assuming that the dimension dd is a constant for simplicity). It is easy to see that SGD with any choice of 0<η10<\eta\leq 1 leads to a strictly slower rate of the remainder than that of our proposed method, O(n1logn)O(n^{-1}\log n).

4 Estimation of Long-Run Covariance Matrix and Online Statistical Inference

In the above section, we provide an online Newton-type algorithm to estimate the parameter 𝜽\boldsymbol{\theta}^{*}. As we can see in Theorem 4, the proposed estimator has an asymptotic variance with a sandwich structure 𝑯1𝚺(𝑯)1\boldsymbol{H}^{-1}\boldsymbol{\Sigma}(\boldsymbol{H}^{{\top}})^{-1}. To conduct statistical inference simultaneously with ROPE method we proposed in Section 3, we propose an online plug-in estimator for this sandwich structure. Note that the online estimator 𝑯^n1\widehat{\boldsymbol{H}}^{-1}_{n} of 𝑯1\boldsymbol{H}^{-1} is proposed and utilized in (11) of the estimation algorithm. It remains to construct an online estimator for 𝚺\boldsymbol{\Sigma} in (17).

Developing an online estimator of 𝚺\boldsymbol{\Sigma} with dependent samples is intricate compared to the case of independent ones. As the samples are dependent, the covariance matrix 𝚺\boldsymbol{\Sigma} is an infinite sum of the series of time-lag covariance matrices. To estimate the above long-run covariance matrix 𝚺\boldsymbol{\Sigma}, we first rewrite the definition of 𝚺\boldsymbol{\Sigma} in (17) into the following form

𝚺=\displaystyle\boldsymbol{\Sigma}= 𝔼[𝑿0𝑿0(𝒁0𝜽b0)2]+k=1𝔼[𝑿0𝑿k(𝒁0𝜽b0)(𝒁k𝜽bk)]\displaystyle{\mathbb{E}}\big{[}\boldsymbol{X}_{0}\boldsymbol{X}^{{\top}}_{0}(\boldsymbol{Z}^{{\top}}_{0}\boldsymbol{\theta}^{*}-b_{0})^{2}\big{]}+\sum_{k=-\infty}^{-1}{\mathbb{E}}\big{[}\boldsymbol{X}_{0}\boldsymbol{X}^{{\top}}_{k}(\boldsymbol{Z}^{{\top}}_{0}\boldsymbol{\theta}^{*}-b_{0})(\boldsymbol{Z}^{{\top}}_{k}\boldsymbol{\theta}^{*}-b_{k})\big{]}
+k=1𝔼[𝑿0𝑿k(𝒁0𝜽b0)(𝒁k𝜽bk)].\displaystyle+\sum_{k=1}^{\infty}{\mathbb{E}}\big{[}\boldsymbol{X}_{0}\boldsymbol{X}^{{\top}}_{k}(\boldsymbol{Z}^{{\top}}_{0}\boldsymbol{\theta}^{*}-b_{0})(\boldsymbol{Z}^{{\top}}_{k}\boldsymbol{\theta}^{*}-b_{k})\big{]}.

Next, we replace each term 𝔼[𝑿0𝑿k(𝒁0𝜽b0)(𝒁k𝜽bk)]{\mathbb{E}}\big{[}\boldsymbol{X}_{0}\boldsymbol{X}^{{\top}}_{k}(\boldsymbol{Z}^{{\top}}_{0}\boldsymbol{\theta}^{*}-b_{0})(\boldsymbol{Z}^{{\top}}_{k}\boldsymbol{\theta}^{*}-b_{k})\big{]} in (17) with its empirical counterpart using the nn samples. Meanwhile, to handle outliers, we replace (𝒁i𝜽bi)(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}^{*}-b_{i}) by gτi(𝒁i𝜽^i1bi)g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i}). In addition, the infinite sum in (17) is not feasible for direct estimation. Nonetheless, Condition (C1) on the mixing rate of {(𝑿t,𝒁t)}\{(\boldsymbol{X}_{t},\boldsymbol{Z}_{t})\} allows us to approximate it by effectively estimating the first λlogn\lceil\lambda\log n\rceil terms only, where λ\lambda is a pre-specified constant. In summary, we can construct the following estimator for the covariance matrix 𝚺\boldsymbol{\Sigma},

𝚺^n=\displaystyle\widehat{\boldsymbol{\Sigma}}_{n}= 1ni=1n𝑿i𝑿i(gτi(𝒁i𝜽^i1bi))2\displaystyle\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{X}_{i}\boldsymbol{X}^{{\top}}_{i}\big{(}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})\big{)}^{2} (18)
+1ni=1nk=1λlogi(i1)𝑿i𝑿ikgτi(𝒁i𝜽^i1bi)gτik(𝒁ik𝜽^ik1bik)\displaystyle+\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{\lceil\lambda\log i\rceil\wedge(i-1)}\boldsymbol{X}_{i}\boldsymbol{X}^{{\top}}_{i-k}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})g_{\tau_{i-k}}(\boldsymbol{Z}^{{\top}}_{i-k}\widehat{\boldsymbol{\theta}}_{i-k-1}-b_{i-k})
+1ni=1nk=1λlogi(i1)𝑿ik𝑿igτik(𝒁ik𝜽^ik1bik)gτi(𝒁i𝜽^i1bi).\displaystyle+\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{\lceil\lambda\log i\rceil\wedge(i-1)}\boldsymbol{X}_{i-k}\boldsymbol{X}^{{\top}}_{i}g_{\tau_{i-k}}(\boldsymbol{Z}^{{\top}}_{i-k}\widehat{\boldsymbol{\theta}}_{i-k-1}-b_{i-k})g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i}).

To perform a fully-online update for the estimator 𝚺^n\widehat{\boldsymbol{\Sigma}}_{n}, we define

𝑺j=i=1j𝑿igτi(𝒁i𝜽^i1bi),j1,\boldsymbol{S}_{j}=\sum_{i=1}^{j}\boldsymbol{X}^{{\top}}_{i}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i}),\quad j\geq 1,

and the above equation (18) can be rewritten as,

𝚺^n=\displaystyle\widehat{\boldsymbol{\Sigma}}_{n}= 1ni=1n𝑿i𝑿igτi2(𝒁i𝜽^i1bi)+1ni=1n𝑿igτi(𝒁i𝜽^i1bi)(𝑺i1𝑺iλlogi(i1))\displaystyle\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{X}_{i}\boldsymbol{X}^{{\top}}_{i}g_{\tau_{i}}^{2}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})+\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})(\boldsymbol{S}_{i-1}-\boldsymbol{S}_{i-\lceil\lambda\log i\rceil\wedge(i-1)})
+1ni=1n(𝑺i1𝑺iλlogi(i1))𝑿igτi(𝒁i𝜽^i1bi).\displaystyle+\frac{1}{n}\sum_{i=1}^{n}(\boldsymbol{S}_{i-1}-\boldsymbol{S}_{i-\lceil\lambda\log i\rceil\wedge(i-1)})^{{\top}}\boldsymbol{X}^{{\top}}_{i}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i}).

In practice, at the nn-th step, we keep only the terms {𝑺n,,𝑺nλlogn(n1)}\{\boldsymbol{S}_{n},...,\boldsymbol{S}_{n-\lceil\lambda\log n\rceil\wedge(n-1)}\} in memory, and the covariance matrix can be updated by

𝚺^n=n1n𝚺^n1+1n[\displaystyle\widehat{\boldsymbol{\Sigma}}_{n}=\frac{n-1}{n}\widehat{\boldsymbol{\Sigma}}_{n-1}+\frac{1}{n}\Big{[} 𝑿n𝑿ngτn2(𝒁n𝜽^n1bn)+𝑿ngτn(𝒁n𝜽^n1bn)(𝑺n1𝑺nλlogn(n1))\displaystyle\boldsymbol{X}_{n}\boldsymbol{X}^{{\top}}_{n}g_{\tau_{n}}^{2}(\boldsymbol{Z}^{{\top}}_{n}\widehat{\boldsymbol{\theta}}_{n-1}-b_{n})+\boldsymbol{X}_{n}g_{\tau_{n}}(\boldsymbol{Z}^{{\top}}_{n}\widehat{\boldsymbol{\theta}}_{n-1}-b_{n})(\boldsymbol{S}_{n-1}-\boldsymbol{S}_{n-\lceil\lambda\log n\rceil\wedge(n-1)})
+(𝑺n1𝑺nλlogn(n1))𝑿ngτn(𝒁n𝜽^n1bn)].\displaystyle+(\boldsymbol{S}_{n-1}-\boldsymbol{S}_{n-\lceil\lambda\log n\rceil\wedge(n-1)})^{{\top}}\boldsymbol{X}^{{\top}}_{n}g_{\tau_{n}}(\boldsymbol{Z}^{{\top}}_{n}\widehat{\boldsymbol{\theta}}_{n-1}-b_{n})\Big{]}. (19)

The proposed estimator 𝚺^n\widehat{\boldsymbol{\Sigma}}_{n} complements the estimator 𝑯^n1\widehat{\boldsymbol{H}}_{n}^{-1} in (11) to construct an estimator of the sandwich form 𝑯1𝚺(𝑯)1\boldsymbol{H}^{-1}\boldsymbol{\Sigma}(\boldsymbol{H}^{{\top}})^{-1} that appears in the asymptotic distribution in Theorem 4.

Specifically, we can construct the confidence interval for the parameter 𝒗𝜽\boldsymbol{v}^{{\top}}\boldsymbol{\theta}^{*} using the asymptotic distribution in (16) in the following way. For any unit vector 𝒗\boldsymbol{v}, a confidence interval with nominal level (1ξ)(1-\xi) is

[𝒗𝜽^nq1ξ/2σ^𝒗,𝒗𝜽^n+q1ξ/2σ^𝒗],\Big{[}\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}-q_{1-\xi/2}\widehat{\sigma}_{\boldsymbol{v}},\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}+q_{1-\xi/2}\widehat{\sigma}_{\boldsymbol{v}}\Big{]}, (20)

where σ^𝒗2=𝒗𝑯^n1𝚺^n(𝑯^n)1𝒗\widehat{\sigma}_{\boldsymbol{v}}^{2}=\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{H}}_{n}^{-1}\widehat{\boldsymbol{\Sigma}}_{n}(\widehat{\boldsymbol{H}}_{n}^{{\top}})^{-1}\boldsymbol{v} and q1ξ/2q_{1-\xi/2} denotes the (1ξ/2)(1-\xi/2)-th quantile of a standard normal distribution.

Remark 2.

We provide some insights into the computational complexity of our inference procedure. Both estimators 𝐇^n1\widehat{\boldsymbol{H}}_{n}^{-1} and 𝚺^n\widehat{\boldsymbol{\Sigma}}_{n} can be computed in an O(d2)O(d^{2}) per-iteration computation cost, as discussed in (11). In addition, the computation σ^𝐯2=𝐯𝐇^n1𝚺^n(𝐇^n)1𝐯\widehat{\sigma}_{\boldsymbol{v}}^{2}=\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{H}}_{n}^{-1}\widehat{\boldsymbol{\Sigma}}_{n}(\widehat{\boldsymbol{H}}_{n}^{{\top}})^{-1}\boldsymbol{v} requires three vector-matrix multiplication with the same computation complexity O(d2)O(d^{2}).

On the contrary, the online bootstrap method proposed in Ramprasad et al. (2022) has a per-iteration complexity of at least O(Bd)O(Bd), where the resampling size BB in the bootstrap is usually much larger than dd in practice.

In the following theorem, we provide the theoretical guarantee of the above construction of the confidence intervals by showing that our online estimator of the covariance matrix, 𝚺^n\widehat{\boldsymbol{\Sigma}}_{n}, is consistent.

Theorem 6.

Suppose the conditions in Theorem 4 hold. The covariance estimator 𝚺^n\widehat{\boldsymbol{\Sigma}}_{n} defined in (19) satisfies

𝚺^n𝚺=O(dτn2αn+dτn1+τndlognn+dτn2log2nn).\|\widehat{\boldsymbol{\Sigma}}_{n}-\boldsymbol{\Sigma}\|=O_{{\mathbb{P}}}\left(\sqrt{d}\tau_{n}^{2}\alpha_{n}+\sqrt{d}\tau_{n}^{-1}+\tau_{n}\sqrt{\frac{d\log n}{n}}+\frac{d\tau_{n}^{2}\log^{2}n}{n}\right).

Theorem 6 provides an upper bound on the estimation error 𝚺^n𝚺\widehat{\boldsymbol{\Sigma}}_{n}-\boldsymbol{\Sigma}. To achieve the consistency on 𝚺^n\widehat{\boldsymbol{\Sigma}}_{n}, we specify the thresholding parameter τn=Cτnβ\tau_{n}=C_{\tau}n^{\beta} for 1/4<β<1/21/4<\beta<1/2, such that

𝚺^n𝚺=O(dτn2αn+dlognn1/2β).\|\widehat{\boldsymbol{\Sigma}}_{n}-\boldsymbol{\Sigma}\|=O_{{\mathbb{P}}}\left(\sqrt{d}\tau_{n}^{2}\alpha_{n}+\frac{\sqrt{d\log n}}{n^{1/2-\beta}}\right).

In this case, as long as the fraction of outliers αn\alpha_{n} satisfies αn=o(1/(n2βd))\alpha_{n}=o(1/(n^{2\beta}\sqrt{d})), we obtain a consistent estimator of the matrix 𝚺\boldsymbol{\Sigma}. In other words, our proposed robust algorithm allows o(n12β)o(n^{1-2\beta}) outliers, ignoring the dimension. We summarize the result in the following corollary.

Corollary 7.

Suppose the conditions of Theorem 4 hold, and the fraction of outliers satisfies αn=o(1/(n2βd))\alpha_{n}=o(1/(n^{2\beta}\sqrt{d})), where we specify τi=Cτiβ\tau_{i}=C_{\tau}i^{\beta} and 1/4<β<1/21/4<\beta<1/2. Then given a vector 𝐯d\boldsymbol{v}\in{\mathbb{R}}^{d} and a pre-specified confidence level 1ξ1-\xi, we have

limn(𝒗𝜽[𝒗𝜽^nq1ξ/2σ^𝒗,𝒗𝜽^n+q1ξ/2σ^𝒗])=1ξ,\lim_{n\rightarrow\infty}{\mathbb{P}}\left(\boldsymbol{v}^{{\top}}\boldsymbol{\theta}^{*}\in\Big{[}\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}-q_{1-\xi/2}\widehat{\sigma}_{\boldsymbol{v}},\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{\theta}}_{n}+q_{1-\xi/2}\widehat{\sigma}_{\boldsymbol{v}}\Big{]}\right)=1-\xi,

where σ^𝐯2=𝐯𝐇^n1𝚺^n(𝐇^n)1𝐯\widehat{\sigma}_{\boldsymbol{v}}^{2}=\boldsymbol{v}^{{\top}}\widehat{\boldsymbol{H}}_{n}^{-1}\widehat{\boldsymbol{\Sigma}}_{n}(\widehat{\boldsymbol{H}}_{n}^{{\top}})^{-1}\boldsymbol{v} and q1ξ/2q_{1-\xi/2} denotes the (1ξ/2)(1-\xi/2)-th quantile of a standard normal distribution.

5 Numerical Experiments

In this section, we assess the performance of our ROPE algorithm by conducting experiments under various settings. We construct all confidence intervals with a nominal coverage of 95%95\%. Throughout the majority of experiments, we compare our method with the online bootstrap method with the linear stochastic approximation proposed in Ramprasad et al. (2022), which we refer to as LSA. The hyperparameters for LSA are selected based on the settings described in Sections 5.1 and 5.2 of Ramprasad et al. (2022).

5.1 Parameter Inference for Infinite-Horizon MDP

In the first experiment, we focus on an infinite-horizon Markov Decision Process (MDP) setting. Specifically, we create an environment with a state space of size 5050 and an action space of size 55. The dimension of the features is fixed at 1010. The transition probability kernel of the MDP, the state features, and the policy under evaluation are all randomly generated. By employing the Bellman equation (1), we can compute the expected rewards at each state under the policy. To introduce variability, we add noise to the expected rewards, drawn from different distributions such as the standard normal distribution, tt distribution with a degree of freedom of 1.51.5, and standard Cauchy distribution. The main objective of this experiment is to construct a confidence interval for the first coordinate of the true parameter 𝜽\boldsymbol{\theta}^{*}.

We set the thresholding level as τi=C(i/(logi)2)β\tau_{i}=C(i/(\log i)^{2})^{\beta} where β\beta is a positive constant. We begin with an investigation of the influence of the parameters CC and β\beta on the coverage probability and width of the confidence interval in our ROPE algorithm. Figure 1 displays the results, revealing that the performance is relatively robust to variations in CC and β\beta, irrespective of the type of noise applied. Notably, although our theoretical guarantees are based on the noises with a 66-th order moment (See Theorem 4), the experiment indicates that our method exhibits a wider range of applicability.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Coverage probability (the first row) and the width of confidence interval (the second row) of ROPE\mathrm{ROPE} of various CC and β\beta. We specify the noise distribution as standard normal (the first column), Student’s t1.5t_{1.5} (the second column), and Cauchy (the third column).

In subsequent experiments, we specify C=0.5C=0.5 and β=1/3\beta=1/3 for the ROPE algorithm and compare it with the LSA method. Alongside the coverage probability and width of the confidence interval, we also assess the average running time for both methods. The findings are depicted in Figure 2. It is evident that, under a light-tailed noise that admits normal distribution, the two methods yield comparable results, with the ROPE method consistently outperforming LSA in terms of the confidence interval width. When the noise exhibits heavy-tailed characteristics, the confidence interval width of LSA is notably larger than that of ROPE. Additionally, the runtime of ROPE is approximately four times shorter than that of LSA.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Coverage probability (the first row), the width of confidence interval (the second row), and computing time (the third row) of ROPE\mathrm{ROPE} and LSA\mathrm{LSA}. We specify the noise distribution as standard normal (the first column), Student’s tt (the second column), and Cauchy (the third column).

In the LSA method, the step size is determined by the expression αiη\alpha i^{-\eta}, which relies on two positive hyperparameters α\alpha and η\eta. In this experiment, we investigate the sensitivity of LSA method on these parameters. For comparison, we also present the result of our ROPE algorithm, which does not require any hyperparameter tuning. In this particular experiment, we specify the noise distribution to be standard normal, and directly apply the online Newton step on the square loss (as opposed to the pseudo-Huber loss). Notably, Figure 3 illustrates that the LSA method is sensitive to the parameter α\alpha. Specifically, when α\alpha takes on larger values, the LSA method generates significantly wider confidence intervals than that of LSA with well-tuned hyperparameters, which is undesirable in practical applications. Meanwhile, our ROPE method always generates confidence intervals with a valid width and comparable coverage rate.

Refer to caption
Refer to caption
Figure 3: Coverage probability (left), and the width of confidence interval (right) of ROPE\mathrm{ROPE} and LSA\mathrm{LSA} of various α\alpha and η\eta. We specify the noise distribution as standard normal.

5.2 Value Inference for FrozenLake RL Environment

In this section, we consider the Frozenlake environment provided by OpenAI gym, which involves a character navigating an 8×88\times 8 grid world. The character’s objective is to start from the first tile of the grid and reach the end tile within each episode. If the character successfully reaches the target tile, a reward of 11 is obtained; otherwise, the reward is 0. In our setup, we generate the state features randomly, with a dimensionality of 44. The policy is trained using QQ-learning, and the true parameter can be explicitly computed using the transition probability matrix. Under a contaminated reward model, we introduce random perturbations by replacing the true reward with a value uniformly sampled from the range [0,100][0,100] with probability α{0,n1,0.05n1/2}\alpha\in\{0,n^{-1},0.05n^{-1/2}\}. Our main goal is to infer the value at the initial state.

Following the approach in Section 5.1, we proceed to examine the sensitivity of ROPE to the thresholding parameters CC and β\beta. We present the results in Figure 4, and observe that the performance of ROPE is influenced by both the contamination rate αn\alpha_{n} and the chosen thresholding parameter. Specifically, when αn\alpha_{n} is small, a larger thresholding parameter tends to yield better outcomes. However, as αn\alpha_{n} increases, the use of a large thresholding parameter can negatively impact performance.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Coverage probability (the first row) and the width of confidence interval (the second row) of ROPE\mathrm{ROPE} of various CC and β\beta. We set the contamination rate to be 0 (the first column), n1n^{-1} (the second column), and 0.05n1/20.05n^{-1/2} (the third column), respectively.

Subsequently, we set the values C=0.5C=0.5, β=1/3\beta=1/3 when αn=0\alpha_{n}=0 or n1n^{-1}, and C=0.1,β=1/3C=0.1,\beta=1/3 when αn=0.05n1/2\alpha_{n}=0.05n^{-1/2} for ROPE algorithm, and compare its performance with that of the LSA method. Figure 5 displays the results, which clearly demonstrates that our ROPE method consistently outperforms LSA in terms of coverage rates, CI widths, and running time. The advantage of ROPE becomes more pronounced as the contamination rate αn\alpha_{n} increases.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Coverage probability (the first row), the width of confidence interval (the second row), and computing time (the third row) of ROPE\mathrm{ROPE} and LSA\mathrm{LSA}. We set the contamination rate to be 0 (the first column), n1n^{-1} (the second column), and 0.05n1/20.05n^{-1/2} (the third column), respectively.

6 Concluding Remarks

This paper introduces a robust online Newton-type algorithm designed for policy evaluation in reinforcement learning under the influence of heavy-tailed noise and outliers. We demonstrate the estimation consistency as well as the asymptotic normality of our estimator. We further establish the Bahadur representation and show that it converges strictly faster than that of a prototypical first-order method such as TD. To further conduct statistical inference, we propose an online approach for constructing confidence intervals. Our experimental results highlight the efficiency and robustness of our approach when compared to the existing online bootstrap method. In future research, it would be intriguing to explore online statistical inference for off-policy evaluation.

References

  • Bennett (1962) Bennett, G. (1962), “Probability inequalities for the sum of independent random variables,” Journal of the American Statistical Association, 57, 33–45.
  • Berkes and Philipp (1979) Berkes, I. and Philipp, W. (1979), “Approximation theorems for independent and weakly dependent random vectors,” The Annals of Probability, 7, 29–54.
  • Billingsley (1968) Billingsley, P. (1968), Convergence of Probability Measures, Wiley Series in Probability and Mathematical Statistics, Wiley.
  • Byrd et al. (2016) Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. (2016), “A stochastic quasi-Newton method for large-scale optimization,” SIAM Journal on Optimization, 26, 1008–1031.
  • Charbonnier et al. (1994) Charbonnier, P., Blanc-Feraud, L., Aubert, G., and Barlaud, M. (1994), “Two deterministic half-quadratic regularization algorithms for computed imaging,” in Proceedings of 1st International Conference on Image Processing.
  • Charbonnier et al. (1997) — (1997), “Deterministic edge-preserving regularization in computed imaging,” IEEE Transactions on Image Processing, 6, 298–311.
  • Chen et al. (2021a) Chen, H., Lu, W., and Song, R. (2021a), “Statistical inference for online decision making: In a contextual bandit setting,” Journal of the American Statistical Association, 116, 240–255.
  • Chen et al. (2021b) — (2021b), “Statistical inference for online decision making via stochastic gradient descent,” Journal of the American Statistical Association, 116, 708–719.
  • Chen et al. (2020) Chen, X., Lee, J. D., Tong, X. T., and Zhang, Y. (2020), “Statistical inference for model parameters in stochastic gradient descent,” The Annals of Statistics, 48, 251–273.
  • Dai et al. (2020) Dai, B., Nachum, O., Chow, Y., Li, L., Szepesvári, C., and Schuurmans, D. (2020), “Coindice: Off-policy confidence interval estimation,” Advances in Neural Information Processing Systems.
  • Deshpande et al. (2018) Deshpande, Y., Mackey, L., Syrgkanis, V., and Taddy, M. (2018), “Accurate inference for adaptive linear models,” in International Conference on Machine Learning, PMLR, pp. 1194–1203.
  • Doukhan et al. (1994) Doukhan, P., Massart, P., and Rio, E. (1994), “The functional central limit theorem for strongly mixing processes,” in Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, vol. 30, pp. 63–82.
  • Durmus et al. (2021) Durmus, A., Moulines, E., Naumov, A., Samsonov, S., and Wai, H.-T. (2021), “On the stability of random matrix product with Markovian noise: Application to linear stochastic approximation and TD learning,” in Proceedings of Thirty Fourth Conference on Learning Theory.
  • Everitt et al. (2017) Everitt, T., Krakovna, V., Orseau, L., and Legg, S. (2017), “Reinforcement learning with a corrupted reward channel,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence.
  • Fan et al. (2022) Fan, J., Guo, Y., and Jiang, B. (2022), “Adaptive Huber regression on Markov-dependent data,” Stochastic Processes and their Applications, 150, 802–818.
  • Fang et al. (2018) Fang, Y., Xu, J., and Yang, L. (2018), “Online bootstrap confidence intervals for the stochastic gradient descent estimator,” Journal of Machine Learning Research, 19, 1–21.
  • Feng et al. (2021) Feng, Y., Tang, Z., Zhang, N., and Liu, Q. (2021), “Non-asymptotic confidence intervals of off-policy evaluation: Primal and dual bounds,” in International Conference on Learning Representations.
  • Freedman (1975) Freedman, D. A. (1975), “On tail probabilities for martingales,” The Annals of Probability, 3, 100–118.
  • Givchi and Palhang (2015) Givchi, A. and Palhang, M. (2015), “Quasi newton temporal difference learning,” in Proceedings of the Sixth Asian Conference on Machine Learning.
  • Hanna et al. (2017) Hanna, J., Stone, P., and Niekum, S. (2017), “Bootstrapping with models: Confidence intervals for off-policy evaluation,” in Proceedings of the AAAI Conference on Artificial Intelligence.
  • Hao et al. (2021) Hao, B., Ji, X., Duan, Y., Lu, H., Szepesvari, C., and Wang, M. (2021), “Bootstrapping fitted q-evaluation for off-policy inference,” in International Conference on Machine Learning.
  • Hastie et al. (2009) Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2, Springer.
  • Huber (1964) Huber, P. J. (1964), “Robust estimation of a location parameter,” The Annals of Mathematical Statistics, 35, 73–101.
  • Huber (1992) — (1992), “Robust estimation of a location parameter,” in Breakthroughs in Statistics, Springer.
  • Huber (2004) — (2004), Robust Statistics, vol. 523, John Wiley & Sons.
  • Jiang and Huang (2020) Jiang, N. and Huang, J. (2020), “Minimax value interval for off-policy evaluation and policy optimization,” Advances in Neural Information Processing Systems.
  • Kolter and Ng (2009) Kolter, J. Z. and Ng, A. Y. (2009), “Regularization and feature selection in least-squares temporal difference learning,” in Proceedings of the 26th Annual International Conference on Machine Learning.
  • Kormushev et al. (2013) Kormushev, P., Calinon, S., and Caldwell, D. G. (2013), “Reinforcement learning in robotics: Applications and real-world challenges,” Robotics, 2, 122–148.
  • Li and Sun (2023) Li, X. and Sun, Q. (2023), “Variance-aware robust reinforcement learning with linear function approximation under heavy-tailed rewards,” arXiv e-prints, arXiv:2303.05606.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015), “Human-level control through deep reinforcement learning,” Nature, 518, 529–533.
  • Mou et al. (2021) Mou, W., Pananjady, A., Wainwright, M. J., and Bartlett, P. L. (2021), “Optimal and instance-dependent guarantees for Markovian linear stochastic approximation,” arXiv e-prints, arXiv:2112.12770.
  • Murphy (2003) Murphy, S. A. (2003), “Optimal dynamic treatment regimes,” Journal of the Royal Statistical Society: Series B: Statistical Methodology, 65, 331–355.
  • Nesterov (2003) Nesterov, Y. (2003), Introductory Lectures on Convex Optimization: A Basic Course, vol. 87, Springer Science & Business Media.
  • Polyak and Juditsky (1992) Polyak, B. T. and Juditsky, A. B. (1992), “Acceleration of stochastic approximation by averaging,” SIAM Journal on Control and Optimization, 30, 838–855.
  • Ramprasad et al. (2022) Ramprasad, P., Li, Y., Yang, Z., Wang, Z., Sun, W. W., and Cheng, G. (2022), “Online bootstrap inference for policy evaluation in reinforcement learning,” Journal of the American Statistical Association, To appear.
  • Rio (2017) Rio, E. (2017), Asymptotic Theory of Weakly Dependent Random Processes, vol. 80, Springer.
  • Robbins and Monro (1951) Robbins, H. and Monro, S. (1951), “A stochastic approximation method,” The Annals of Mathematical Statistics, 22, 400–407.
  • Rosenblatt (1956) Rosenblatt, M. (1956), “A central limit theorem and a strong mixing condition,” Proceedings of the National Academy of Sciences, 42, 43–47.
  • Ruppert (1985) Ruppert, D. (1985), “A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure,” The Annals of Statistics, 13, 236–245.
  • Ruppert (1988) — (1988), “Efficient estimations from a slowly convergent Robbins-Monro process,” Tech. rep., Cornell University Operations Research and Industrial Engineering.
  • Schraudolph et al. (2007) Schraudolph, N. N., Yu, J., and Günter, S. (2007), “A stochastic quasi-Newton method for online convex optimization,” in Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics.
  • Shao and Zhang (2022) Shao, Q.-M. and Zhang, Z.-S. (2022), “Berry–Esseen bounds for multivariate nonlinear statistics with applications to M-estimators and stochastic gradient descent algorithms,” Bernoulli, 28, 1548–1576.
  • Shi et al. (2018) Shi, C., Fan, A., Song, R., and Lu, W. (2018), “High-dimensional A-learning for optimal dynamic treatment regimes,” The Annals of Statistics, 46, 925–957.
  • Shi et al. (2021a) Shi, C., Song, R., Lu, W., and Li, R. (2021a), “Statistical inference for high-dimensional models via recursive online-score estimation,” Journal of the American Statistical Association, 116, 1307–1318.
  • Shi et al. (2021b) Shi, C., Zhang, S., Lu, W., and Song, R. (2021b), “Statistical inference of the value function for reinforcement learning in infinite-horizon settings,” Journal of the Royal Statistical Society. Series B: Statistical Methodology, 84, 765–793.
  • Sun et al. (2020) Sun, Q., Zhou, W.-X., and Fan, J. (2020), “Adaptive Huber Regression,” Journal of the American Statistical Association, 115, 254–265.
  • Sutton (1988) Sutton, R. (1988), “Learning to predict by the method of temporal differences,” Machine Learning, 3, 9–44.
  • Sutton and Barto (2018) Sutton, R. and Barto, A. (2018), Reinforcement Learning, second edition: An Introduction, Adaptive Computation and Machine Learning series, MIT Press.
  • Sutton et al. (2009) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. (2009), “Fast gradient-descent methods for temporal-difference learning with linear function approximation,” in Proceedings of the 26th Annual International Conference on Machine Learning.
  • Syrgkanis and Zhan (2023) Syrgkanis, V. and Zhan, R. (2023), “Post-Episodic Reinforcement Learning Inference,” arXiv e-prints, arXiv:2302.08854.
  • Thomas et al. (2015) Thomas, P., Theocharous, G., and Ghavamzadeh, M. (2015), “High confidence policy improvement,” in International Conference on Machine Learning, PMLR.
  • Tsitsiklis and Van Roy (1997) Tsitsiklis, J. and Van Roy, B. (1997), “An analysis of temporal-difference learning with function approximation,” IEEE Transactions on Automatic Control, 42, 674–690.
  • Vershynin (2010) Vershynin, R. (2010), “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027.
  • Wang et al. (2020) Wang, J., Liu, Y., and Li, B. (2020), “Reinforcement learning with perturbed rewards,” in Proceedings of the AAAI conference on Artificial Intelligence.
  • Zhan et al. (2021) Zhan, R., Hadad, V., Hirshberg, D. A., and Athey, S. (2021), “Off-policy evaluation via adaptive weighting with data from contextual bandits,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2125–2135.
  • Zhang et al. (2021) Zhang, K., Janson, L., and Murphy, S. (2021), “Statistical inference with M-estimators on adaptively collected data,” Advances in Neural Information Processing Systems, 34, 7460–7471.
  • Zhang et al. (2022) — (2022), “Statistical inference after adaptive sampling in non-Markovian environments,” arXiv preprint arXiv:2202.07098.

7 Appendix

7.1 Technical Lemmas

Lemma 1.

Let Y1,,YnY_{1},...,Y_{n} be ϕ\phi-mixing sequence with ϕ(k)=O(ρk)\phi(k)=O(\rho^{k}). Assume that |Yi|M|Y_{i}|\leq M and 𝔼Yi=0{\mathbb{E}}Y_{i}=0. Then for any fixed ν>0\nu>0 and 1kn1\leq k\leq n, we have some constant CC such that

(|i=1kYik|Clognk+Clog2nk)=O(nν)\displaystyle{\mathbb{P}}\Bigg{(}\Big{|}\frac{\sum_{i=1}^{k}Y_{i}}{k}\Big{|}\geq C\sqrt{\frac{\log n}{k}}+C\frac{\log^{2}n}{k}\Bigg{)}=O(n^{-\nu})
Proof.

We divide the kk-tuple (1,,k)(1,\dots,k) into mkm_{k} different subsets H1,,HmkH_{1},\dots,H_{m_{k}}, where mk=k/λlognm_{k}=\lceil k/\lceil\lambda\log n\rceil\rceil. Here |Hi|=λlogn|H_{i}|=\lceil\lambda\log n\rceil for 1imk11\leq i\leq m_{k}-1, and |Hmj|λlogn|H_{m_{j}}|\leq\lceil\lambda\log n\rceil. λ\lambda is a sufficiently large constant which will be specified later. Then we have mkk/(λlogn)m_{k}\approx k/(\lambda\log n). Without loss of generality, we assume that mkm_{k} is an even integer.

Let ξl=1λlogniH2l1Yi\xi_{l}=\frac{1}{\lceil\lambda\log n\rceil}\sum_{i\in H_{2l-1}}Y_{i}, then we know |ξl|M|\xi_{l}|\leq M and 𝔼[ξl]=0{\mathbb{E}}[\xi_{l}]=0 holds. For all B1σ(ξ1,,ξl)B_{1}\in\sigma(\xi_{1},\dots,\xi_{l}) and B2σ(ξl+1)B_{2}\in\sigma(\xi_{l+1}), there holds |(B2|B1)(B2)|ϕ(λlogn)|{\mathbb{P}}(B_{2}|B_{1})-{\mathbb{P}}(B_{2})|\leq\phi(\lceil\lambda\log n\rceil) for all ll. By Theorem 2 of Berkes and Philipp (1979), there exists a sequence of independent variables ηl\eta_{l}, l1l\geq 1 with ηl\eta_{l} having the same distribution as ξl\xi_{l}, and

(|ξlηl|6ϕ(λlogn))6ϕ(λlogn).{\mathbb{P}}\Big{(}|\xi_{l}-\eta_{l}|\geq 6\phi(\lceil\lambda\log n\rceil)\Big{)}\leq 6\phi(\lceil\lambda\log n\rceil).

For any ν>0\nu>0, we can take λ(ν+1)/|logρ|\lambda\geq(\nu+1)/|\log\rho| so that

(|2mkl=1mk/2(ξlηl)|C1nν)\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{l=1}^{m_{k}/2}(\xi_{l}-\eta_{l})\Big{|}\geq C_{1}n^{-\nu}\Big{)} (21)
\displaystyle\leq (|2mkl=1mk/2(ξlηl)|6ϕ(λlogn))3mkϕ(λlogn)C1nν,\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{l=1}^{m_{k}/2}(\xi_{l}-\eta_{l})\Big{|}\geq 6\phi(\lceil\lambda\log n\rceil)\Big{)}\leq 3m_{k}\phi(\lceil\lambda\log n\rceil)\leq C_{1}n^{-\nu},

where C1>0C_{1}>0 is some constant. Next, we bound the variance of ξl\xi_{l}. From equation (20.23) of Billingsley (1968) we know that for arbitrary i,ji,j, there is

|𝔼[YiYj]𝔼[Yi]𝔼[Yj]|2ϕ(|ij|)𝔼[Yi2]𝔼[Yj2]2ρ|ij|/2M2.\big{|}{\mathbb{E}}[Y_{i}Y_{j}]-{\mathbb{E}}[Y_{i}]{\mathbb{E}}[Y_{j}]\big{|}\leq 2\sqrt{\phi(|i-j|)}\sqrt{{\mathbb{E}}[Y_{i}^{2}]{\mathbb{E}}[Y_{j}^{2}]}\leq 2\rho^{|i-j|/2}M^{2}.

Then we compute that

Var(ηl)=Var(ξl)=\displaystyle{\rm Var}(\eta_{l})={\rm Var}(\xi_{l})= 1λlogn2[i,jH2l1{𝔼(YiYj)𝔼(Yi)𝔼(Yj)}]\displaystyle\frac{1}{\lceil\lambda\log n\rceil^{2}}\Big{[}\sum_{i,j\in H_{2l-1}}\{{\mathbb{E}}(Y_{i}Y_{j})-{\mathbb{E}}(Y_{i}){\mathbb{E}}(Y_{j})\}\Big{]} (22)
\displaystyle\leq 2λlogn2i,j=1λlognρ|ij|/2M2\displaystyle\frac{2}{\lceil\lambda\log n\rceil^{2}}\sum_{i,j=1}^{\lceil\lambda\log n\rceil}\rho^{|i-j|/2}M^{2}
\displaystyle\leq 2λlognλlogn2(21ρ1)M24(1ρ)λlognM2.\displaystyle\frac{2\lceil\lambda\log n\rceil}{\lceil\lambda\log n\rceil^{2}}\Big{(}\frac{2}{1-\sqrt{\rho}}-1\Big{)}M^{2}\leq\frac{4}{(1-\sqrt{\rho})\lceil\lambda\log n\rceil}M^{2}.

Notice that ηl\eta_{l}’s are all uniformly bounded by MM, we can apply Bernstein’s inequality (Bennett, 1962) and yield

(|2mkl=1mk/2ηl|t)2exp(mkt2/4Var(ηl)/2+Mt/3)\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{l=1}^{m_{k}/2}\eta_{l}\Big{|}\geq t\Big{)}\leq 2\exp\Big{(}-\frac{m_{k}t^{2}/4}{{\rm Var}(\eta_{l})/2+Mt/3}\Big{)}
\displaystyle\leq exp(mkλlognt2/42M2/(1ρ)+λlognMt/3).\displaystyle\exp\Big{(}-\frac{m_{k}\lceil\lambda\log n\rceil t^{2}/4}{2M^{2}/(1-\sqrt{\rho})+\lceil\lambda\log n\rceil Mt/3}\Big{)}.

We take t=C(logn/k+log2n/k)t=C(\sqrt{\log n/k}+\log^{2}n/k) for some CC large enough (note that kmkλlognk\approx m_{k}\lceil\lambda\log n\rceil). Then we have that

(|2mkl=1mk/2ηl|t)=O(nν).{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{l=1}^{m_{k}/2}\eta_{l}\Big{|}\geq t\Big{)}=O(n^{-\nu}). (23)

Combining (21) and (23) we have that

(|2mkl=1mk/2ξl|Clognk+Clog2nk)=O(nν).\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{l=1}^{m_{k}/2}\xi_{l}\Big{|}\geq C\sqrt{\frac{\log n}{k}}+C\frac{\log^{2}n}{k}\Big{)}=O(n^{-\nu}). (24)

Next we denote ξ~=1λlogniH2lYi\widetilde{\xi}=\frac{1}{\lceil\lambda\log n\rceil}\sum_{i\in H_{2l}}Y_{i}. We can similarly show that

(|2mkl=1mk/2ξ~l|Clognk+Clog2nk)=O(nν).\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{l=1}^{m_{k}/2}\widetilde{\xi}_{l}\Big{|}\geq C\sqrt{\frac{\log n}{k}}+C\frac{\log^{2}n}{k}\Big{)}=O(n^{-\nu}). (25)

Combining (24) and (25) we prove the desired result. ∎

Lemma 2.

Let 𝐘1,,𝐘n\boldsymbol{Y}_{1},\dots,\boldsymbol{Y}_{n} be random vectors in p{\mathbb{R}}^{p}. Let 𝒢i=σ(𝐘1,,𝐘i){\mathcal{G}}_{i}=\sigma(\boldsymbol{Y}_{1},\dots,\boldsymbol{Y}_{i}). Assume that

𝔼[𝒀i|𝒢i1]=𝟎,max1insup𝒗𝕊p|𝒗𝒀i|1,\displaystyle{\mathbb{E}}[\boldsymbol{Y}_{i}|{\mathcal{G}}_{i-1}]=\boldsymbol{0},\quad\max_{1\leq i\leq n}\sup_{\boldsymbol{v}\in{\mathbb{S}}^{p}}|\boldsymbol{v}^{{\top}}\boldsymbol{Y}_{i}|\leq 1,
sup𝒗𝕊pi=1nVar[𝒗𝒀i|𝒢i1]bn.\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{p}}\sum_{i=1}^{n}{\rm Var}[\boldsymbol{v}^{{\top}}\boldsymbol{Y}_{i}|{\mathcal{G}}_{i-1}]\leq b_{n}.

Then for every ν>0\nu>0, there is

(|i=1n𝒀i|2C(dlogn+dbnlogn))=O(nνd),{\mathbb{P}}\Big{(}\big{|}\sum_{i=1}^{n}\boldsymbol{Y}_{i}\big{|}_{2}\geq C\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}=O(n^{-\nu d}),

for some CC sufficiently large.

Proof.

Let 𝔑{\mathfrak{N}} be the 1/21/2-net of the unit ball 𝕊d1{\mathbb{S}}^{d-1}. By Lemma 5.2 of Vershynin (2010) we know that |𝔑|5d|{\mathfrak{N}}|\leq 5^{d}. Then we have that

|i=1n𝒀i|2=\displaystyle\Big{|}\sum_{i=1}^{n}\boldsymbol{Y}_{i}\Big{|}_{2}= sup𝒗𝕊d1|i=1n𝒗𝒀i|\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\Big{|}\sum_{i=1}^{n}\boldsymbol{v}^{{\top}}\boldsymbol{Y}_{i}\Big{|}
\displaystyle\leq sup𝒗~𝔑|i=1n𝒗~𝒀i|+sup|𝒗𝒗~|21/2|i=1n(𝒗𝒗~)𝒀i|,\displaystyle\sup_{\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\Big{|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\Big{|}+\sup_{|\boldsymbol{v}-\widetilde{\boldsymbol{v}}|_{2}\leq 1/2}\Big{|}\sum_{i=1}^{n}(\boldsymbol{v}-\widetilde{\boldsymbol{v}})^{{\top}}\boldsymbol{Y}_{i}\Big{|},
|i=1n𝒀i|2\displaystyle\Rightarrow\Big{|}\sum_{i=1}^{n}\boldsymbol{Y}_{i}\Big{|}_{2}\leq 2sup𝒗~𝔑|i=1n𝒗~𝒀i|.\displaystyle 2\sup_{\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\Big{|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\Big{|}.

By Theorem 1.6 of Freedman (1975) we know that

sup𝒗~𝕊d1(|i=1n𝒗~𝒀i|C1(dlogn+dbnlogn))=O(n(ν+log5)d)\sup_{\widetilde{\boldsymbol{v}}\in{\mathbb{S}}^{d-1}}{\mathbb{P}}\Big{(}\big{|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\big{|}\geq C_{1}\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}=O(n^{-(\nu+\log 5)d})

for some large constant C>0C^{\prime}>0. Therefore

(|i=1n𝒀i|22C1(dlogn+dbnlogn))\displaystyle{\mathbb{P}}\Big{(}\big{|}\sum_{i=1}^{n}\boldsymbol{Y}_{i}\big{|}_{2}\geq 2C_{1}\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}
\displaystyle\leq (sup𝒗~𝔑|i=1n𝒗~𝒀i|C1(dlogn+dbnlogn))\displaystyle{\mathbb{P}}\Big{(}\sup_{\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\big{|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\big{|}\geq C_{1}\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}
\displaystyle\leq 5dsup𝒗~𝔑(|i=1n𝒗~𝒀i|C1(dlogn+dbnlogn))=O(nνd),\displaystyle 5^{d}\sup_{\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}{\mathbb{P}}\Big{(}\big{|}\sum_{i=1}^{n}\widetilde{\boldsymbol{v}}^{{\top}}\boldsymbol{Y}_{i}\big{|}\geq C_{1}\big{(}d\log n+\sqrt{db_{n}\log n}\big{)}\Big{)}=O(n^{-\nu d}),

which proves the lemma. ∎

Lemma 3.

Let 𝐘\boldsymbol{Y} be a d×dd\times d random matrix with 𝔼[𝐘]=𝟎{\mathbb{E}}[\boldsymbol{Y}]=\boldsymbol{0}, 𝐘1\|\boldsymbol{Y}\|\leq 1, {\mathcal{F}} be the σ\sigma-field generated by 𝐘\boldsymbol{Y}. Then for any σ\sigma-field 𝒢{\mathcal{G}}, there holds

𝔼[𝔼[𝒀|𝒢]]d2πϕ,{\mathbb{E}}\Big{[}\|{\mathbb{E}}[\boldsymbol{Y}|{\mathcal{G}}]\|\Big{]}\leq d\sqrt{2\pi\phi},

where

ϕ=supA,B𝒢|(AB)(A)(B)|.\phi=\sup_{A\in{\mathcal{F}},B\in{\mathcal{G}}}|{\mathbb{P}}(AB)-{\mathbb{P}}(A){\mathbb{P}}(B)|.
Proof.

By elementary inequality of matrix, we know that

|𝒀|𝒀1, and 𝒀𝒀F.|\boldsymbol{Y}|_{\infty}\leq\|\boldsymbol{Y}\|\leq 1,\quad\text{ and }\quad\|\boldsymbol{Y}\|\leq\|\boldsymbol{Y}\|_{F}.

For every element Yi,jY_{i,j} of 𝒀\boldsymbol{Y}, by Lemma 4.4.1 of Berkes and Philipp (1979) we have that

𝔼[|𝔼[Yij|𝒢]|2]𝔼[|𝔼[Yij|𝒢]|]2πϕ.{\mathbb{E}}\Big{[}\big{|}{\mathbb{E}}[Y_{ij}|{\mathcal{G}}]\big{|}^{2}\Big{]}\leq{\mathbb{E}}\Big{[}\big{|}{\mathbb{E}}[Y_{ij}|{\mathcal{G}}]\big{|}\Big{]}\leq 2\pi\phi.

Therefore, we have that

𝔼[𝔼[𝒀|𝒢]]𝔼[𝔼[𝒀|𝒢]F],\displaystyle{\mathbb{E}}\Big{[}\|{\mathbb{E}}[\boldsymbol{Y}|{\mathcal{G}}]\|\Big{]}\leq{\mathbb{E}}\Big{[}\|{\mathbb{E}}[\boldsymbol{Y}|{\mathcal{G}}]\|_{F}\Big{]},
\displaystyle\leq {𝔼[𝔼[𝒀|𝒢]F2]}1/2\displaystyle\Big{\{}{\mathbb{E}}\Big{[}\|{\mathbb{E}}[\boldsymbol{Y}|{\mathcal{G}}]\|^{2}_{F}\Big{]}\Big{\}}^{1/2}
\displaystyle\leq {𝔼[i,j=1d|𝔼[Yij|𝒢]|2]}1/2d22πϕ=d2πϕ.\displaystyle\Big{\{}{\mathbb{E}}\Big{[}\sum_{i,j=1}^{d}\big{|}{\mathbb{E}}[Y_{ij}|{\mathcal{G}}]\big{|}^{2}\Big{]}\Big{\}}^{1/2}\leq\sqrt{d^{2}2\pi\phi}=d\sqrt{2\pi\phi}.

The proof is complete. ∎

Lemma 4.

(Approximation of pseudo-Huber gradient) The gradient of pseudo-Huber loss gτ(x)g_{\tau}(x) satisfies

|gτ(x)x|\displaystyle|g_{\tau}(x)-x| 12τδ|x|1+δ, for δ(0,2]\displaystyle\leq\frac{1}{2}\tau^{-\delta}|x|^{1+\delta},\quad\text{ for }\delta\in(0,2] (26)
|gτ(x)1|\displaystyle|g^{\prime}_{\tau}(x)-1| 52τ1δ|x|1+δ for δ(0,1]\displaystyle\leq\frac{5}{2}\tau^{-1-\delta}|x|^{1+\delta}\quad\text{ for }\delta\in(0,1]

uniformly for |x|τ|x|\leq\tau. And

|gτ(x)x||x|,|gτ(x)1|1,|g_{\tau}(x)-x|\leq|x|,\quad|g^{\prime}_{\tau}(x)-1|\leq 1, (27)

holds for all xx.

Proof.

It is easy to compute that

gτ(x)=x1+x2/τ2,gτ(x)=1(1+x2/τ2)3/2.g_{\tau}(x)=\frac{x}{\sqrt{1+x^{2}/\tau^{2}}},\quad g^{\prime}_{\tau}(x)=\frac{1}{(1+x^{2}/\tau^{2})^{3/2}}.

It is not hard to see that

|gτ(x)x||x|,|gτ(x)1|1.|g_{\tau}(x)-x|\leq|x|,\quad|g^{\prime}_{\tau}(x)-1|\leq 1.

This proves (27). Furthermore, when |x|τ|x|\leq\tau, we have

|gτ(x)x|=\displaystyle|g_{\tau}(x)-x|= x2/τ2(1+1+x2/τ2)1+x2/τ|x|\displaystyle\frac{x^{2}/\tau^{2}}{(1+\sqrt{1+x^{2}/\tau^{2}})\sqrt{1+x^{2}/\tau}}\cdot|x|
\displaystyle\leq 12τδ|x|1+δ\displaystyle\frac{1}{2}\tau^{-\delta}|x|^{1+\delta}

for all 0δ20\leq\delta\leq 2. Similarly,

|gτ(x)1|=\displaystyle|g^{\prime}_{\tau}(x)-1|= x2/τ2(2+1+x2/τ2+x2/τ2)(1+1+x2/τ2)(1+x2/τ2)3/2\displaystyle\frac{x^{2}/\tau^{2}(2+\sqrt{1+x^{2}/\tau^{2}}+x^{2}/\tau^{2})}{(1+\sqrt{1+x^{2}/\tau^{2}})(1+x^{2}/\tau^{2})^{3/2}}
\displaystyle\leq 52τ1δ|x|1+δ,\displaystyle\frac{5}{2}\tau^{-1-\delta}|x|^{1+\delta},

for |x|τ|x|\leq\tau. The proof is complete. ∎

Lemma 5.

For any sequence {ak}\{a_{k}\} where ak=(logk)β1/kβ2a_{k}=(\log k)^{\beta_{1}}/k^{\beta_{2}} and β10\beta_{1}\geq 0, there hold

n0an0+k=n0+1nak{Cnanlognif β21;C(logn)β1/n0β21if β2>1.n_{0}a_{n_{0}}+\sum_{k=n_{0}+1}^{n}a_{k}\leq\begin{cases}Cna_{n}\log n\quad&\text{if }\beta_{2}\leq 1;\\ C(\log n)^{\beta_{1}}/n_{0}^{\beta_{2}-1}\quad&\text{if }\beta_{2}>1.\end{cases}

Here the constant CC only depends on the parameter β2\beta_{2}.

Proof.

Directly compute that

n0an0+k=n0+1nak(logn)β1(n0n0β2+k=n0+1n1kβ2).n_{0}a_{n_{0}}+\sum_{k=n_{0}+1}^{n}a_{k}\leq(\log n)^{\beta_{1}}\Big{(}\frac{n_{0}}{n_{0}^{\beta_{2}}}+\sum_{k=n_{0}+1}^{n}\frac{1}{k^{\beta_{2}}}\Big{)}. (28)

We know that

k=n0+1n1kβ2n0n1xβ2dx=11β2(n1β2n01β2).\sum_{k=n_{0}+1}^{n}\frac{1}{k^{\beta_{2}}}\leq\int_{n_{0}}^{n}\frac{1}{x^{\beta_{2}}}\mathrm{d}x=\frac{1}{1-\beta_{2}}\Big{(}n^{1-\beta_{2}}-n_{0}^{1-\beta_{2}}\Big{)}.

When β2<1\beta_{2}<1, (28) can be bounded by

n0an0+k=n0+1nak\displaystyle n_{0}a_{n_{0}}+\sum_{k=n_{0}+1}^{n}a_{k}\leq (logn)β1(n01β2+11β2(n1β2n01β2))\displaystyle(\log n)^{\beta_{1}}\Big{(}n_{0}^{1-\beta_{2}}+\frac{1}{1-\beta_{2}}(n^{1-\beta_{2}}-n_{0}^{1-\beta_{2}})\Big{)}
\displaystyle\leq (logn)β1n1β2(β21β2(n0n)1β2+11β2)\displaystyle(\log n)^{\beta_{1}}n^{1-\beta_{2}}\Big{(}-\frac{\beta_{2}}{1-\beta_{2}}\Big{(}\frac{n_{0}}{n}\Big{)}^{1-\beta_{2}}+\frac{1}{1-\beta_{2}}\Big{)}
\displaystyle\leq 11β2nan.\displaystyle\frac{1}{1-\beta_{2}}na_{n}.

When β2>1\beta_{2}>1, (28) can be bounded by

n0an0+k=n0+1nak\displaystyle n_{0}a_{n_{0}}+\sum_{k=n_{0}+1}^{n}a_{k}\leq (logn)β1(β2β21×1n0β211β21×1nβ21)\displaystyle(\log n)^{\beta_{1}}\Big{(}\frac{\beta_{2}}{\beta_{2}-1}\times\frac{1}{n_{0}^{\beta_{2}-1}}-\frac{1}{\beta_{2}-1}\times\frac{1}{n^{\beta_{2}-1}}\Big{)}
\displaystyle\leq β2β21(logn)β11n0β21.\displaystyle\frac{\beta_{2}}{\beta_{2}-1}(\log n)^{\beta_{1}}\frac{1}{n_{0}^{\beta_{2}-1}}.

When β2=1\beta_{2}=1, (28) can be bounded by

n0an0+k=n0+1nak\displaystyle n_{0}a_{n_{0}}+\sum_{k=n_{0}+1}^{n}a_{k}\leq (logn)β1(1+logn)\displaystyle(\log n)^{\beta_{1}}\big{(}1+\log n\big{)}
\displaystyle\leq 2nanlogn.\displaystyle 2na_{n}\log n.

7.2 Proof of Results in Section 2

Proof of Theorem 1.

For each kn0k\geq n_{0}, when δ>0\delta>0, we denote

ek=αkτk+τkmin(δ,2)+τk(1δ)+logkk+log2kτkk+1d(c0)2kn0,e_{k}=\alpha_{k}\tau_{k}+\tau_{k}^{-\min(\delta,2)}+\sqrt{\frac{\tau_{k}^{(1-\delta)_{+}}\log k}{k}}+\frac{\log^{2}k\tau_{k}}{k}+\frac{1}{\sqrt{d}}(c_{0})^{2^{k-n_{0}}}, (29)

where δ=min{2,δ}\delta^{\prime}=\min\{2,\delta\}. Given a sufficiently large constant Ψ>0\Psi>0, we define the event

Ek=\displaystyle E_{k}= {|𝜽^k𝜽|2Ψdek}{𝑯^k𝑯C0Ψdeklogk}\displaystyle\Big{\{}|\widehat{\boldsymbol{\theta}}_{k}-\boldsymbol{\theta}^{*}|_{2}\leq\Psi\sqrt{d}e_{k}\Big{\}}\cap\Big{\{}\|\widehat{\boldsymbol{H}}_{k}-\boldsymbol{H}\|\leq C_{0}\Psi\sqrt{d}e_{k}\log k\Big{\}} (30)
{|1ki=1k𝑿igτi(𝒁i𝜽bi)|2C0dek},\displaystyle\cap\Big{\{}\Big{|}\frac{1}{k}\sum_{i=1}^{k}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\Big{|}_{2}\leq C_{0}\sqrt{d}e_{k}\Big{\}},

where the constant C0C_{0} is a constant that does not depend on Ψ\Psi. We will show that (Ekc,i=n0k1Ei)C0kν{\mathbb{P}}(E_{k}^{c},\cap_{i=n_{0}}^{k-1}E_{i})\leq C_{0}k^{-\nu}, where ν>1\nu>1 is some constant and C0>0C_{0}>0 is some constant only depends on ν\nu. Let us prove it by induction on kk. By the choice of initial parameter 𝜽^0\widehat{\boldsymbol{\theta}}_{0}, we know (30) holds for k=n0k=n_{0}. For n=k+1n=k+1, from (8) we have that

𝜽^k+1𝜽=\displaystyle\widehat{\boldsymbol{\theta}}_{k+1}-\boldsymbol{\theta}^{*}= 1k+1i=1k+1(𝜽^i1𝜽)𝑯^k+111k+1i=1k+1{𝑿igτi(𝒁i𝜽^i1bi)𝑿igτi(𝒁i𝜽bi)}\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})-\widehat{\boldsymbol{H}}_{k+1}^{-1}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})-\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}
𝑯^k+111k+1i=1k+1{𝑿igτi(𝒁i𝜽bi)}.\displaystyle-\widehat{\boldsymbol{H}}_{k+1}^{-1}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}. (31)

For the second term on the right-hand side of (31), we have that

1k+1i=1k+1{𝑿igτi(𝒁i𝜽^i1bi)𝑿igτi(𝒁i𝜽bi)}\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})-\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}
=\displaystyle= 1k+1i=1k+1𝑿i𝒁igτi(𝒁i𝜽bi)(𝜽^i1𝜽)\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}^{{\top}}g_{\tau_{i}}^{\prime}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})
+1k+1i=1k+101(1t)𝑿igτi′′{𝒁i(𝜽+t(𝜽^i1𝜽))bi}(𝒁i(𝜽^i1𝜽))2dt\displaystyle+\frac{1}{k+1}\sum_{i=1}^{k+1}\int_{0}^{1}(1-t)\boldsymbol{X}_{i}g_{\tau_{i}}^{\prime\prime}\big{\{}\boldsymbol{Z}_{i}^{{\top}}(\boldsymbol{\theta}^{*}+t(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}))-b_{i}\big{\}}(\boldsymbol{Z}_{i}^{{\top}}(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}))^{2}\mathrm{d}t
=\displaystyle= 1k+1i=1k+1𝑿i𝒁igτi(𝒁i𝜽bi)(𝜽^i1𝜽)+1k+1i=1k+1O(1)|𝜽^i1𝜽|22\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}^{{\top}}g_{\tau_{i}}^{\prime}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})+\frac{1}{k+1}\sum_{i=1}^{k+1}O(1)|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}|_{2}^{2}
=\displaystyle= 1k+1i=1k+1𝑿i𝒁igτi(𝒁i𝜽bi)(𝜽^i1𝜽)+O(1)1k+1i=1k+1|𝜽^i1𝜽|22,\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}^{{\top}}g_{\tau_{i}}^{\prime}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})+O(1)\frac{1}{k+1}\sum_{i=1}^{k+1}|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}|_{2}^{2},

where the second equality holds because of the fact that |gτi′′(x)|3/τi=O(1)|g_{\tau_{i}}^{\prime\prime}(x)|\leq 3/\tau_{i}=O(1), the last equality uses the inductive hypothesis. Substitute it into (31), we have

𝜽^k+1𝜽=\displaystyle\widehat{\boldsymbol{\theta}}_{k+1}-\boldsymbol{\theta}^{*}= (𝑰𝑯^k+11𝑯)1k+1i=1k+1(𝜽^i1𝜽)𝑯^k+111k+1i=1k+1(𝑨i𝑯)(𝜽^i1𝜽)\displaystyle(\boldsymbol{I}-\widehat{\boldsymbol{H}}_{k+1}^{-1}\boldsymbol{H})\frac{1}{k+1}\sum_{i=1}^{k+1}(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})-\widehat{\boldsymbol{H}}_{k+1}^{-1}\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-\boldsymbol{H})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})
𝑯^k+111k+1i=1k+1{𝑿igτi(𝒁i𝜽bi)}+O(1)1k+1i=1k+1|𝜽^i1𝜽|22,\displaystyle-\widehat{\boldsymbol{H}}_{k+1}^{-1}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}+O(1)\frac{1}{k+1}\sum_{i=1}^{k+1}|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}|_{2}^{2}, (32)

where 𝑨i=𝑿i𝒁igτi(𝒁i𝜽bi)\boldsymbol{A}_{i}=\boldsymbol{X}_{i}\boldsymbol{Z}_{i}^{{\top}}g_{\tau_{i}}^{\prime}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i}) and 𝑯=𝔼[𝑿𝒁]\boldsymbol{H}={\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}].

Under event i=n0kEi\cap_{i=n_{0}}^{k}E_{i}, by Lemma 5 there exists a constant C1>0C_{1}>0 such that

(1k+1i=1k+1|𝜽^i1𝜽|22C1Ψ2dek2logk,i=n0kEi)\displaystyle{\mathbb{P}}\Big{(}\frac{1}{k+1}\sum_{i=1}^{k+1}|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}|_{2}^{2}\geq C_{1}\Psi^{2}de_{k}^{2}\log k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}
\displaystyle\geq (1k+1i=1k+1|𝜽^i1𝜽|221k+1i=0kΨ2dei2,i=n0kEi)=0.\displaystyle{\mathbb{P}}\Big{(}\frac{1}{k+1}\sum_{i=1}^{k+1}|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}|_{2}^{2}\geq\frac{1}{k+1}\sum_{i=0}^{k}\Psi^{2}de_{i}^{2},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=0. (33)

Similarly, by Lemma 5 there exists a constant C2>0C_{2}>0 such that

(1k+1i=1k+1|𝜽^i1𝜽|2C2Ψdeklogk,i=n0kEi)\displaystyle{\mathbb{P}}\Big{(}\frac{1}{k+1}\sum_{i=1}^{k+1}|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}|_{2}\geq C_{2}\Psi\sqrt{d}e_{k}\log k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}
\displaystyle\geq (1k+1i=1k+1|𝜽^i1𝜽|21k+1i=0kΨdei,i=n0kEi)=0.\displaystyle{\mathbb{P}}\Big{(}\frac{1}{k+1}\sum_{i=1}^{k+1}|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}|_{2}\geq\frac{1}{k+1}\sum_{i=0}^{k}\Psi\sqrt{d}e_{i},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=0. (34)

By Lemma 6, (34) and (42) we know that under event i=n0kEi\cap_{i=n_{0}}^{k}E_{i}, for every ν>0\nu>0 there exist constants C1,C3>0C_{1},C_{3}>0 (which only depend on ν\nu), such that

(|(𝑰𝑯^k+11𝑯)1k+1i=0k(𝜽^i𝜽)|2C1Ψ2dek2log2k,i=n0kEi)\displaystyle{\mathbb{P}}\Big{(}\Big{|}(\boldsymbol{I}-\widehat{\boldsymbol{H}}_{k+1}^{-1}\boldsymbol{H})\frac{1}{k+1}\sum_{i=0}^{k}(\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*})\Big{|}_{2}\geq C_{1}\Psi^{2}de^{2}_{k}\log^{2}k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}
\displaystyle\leq (𝑯^k+11𝑯^k+1𝑯1k+1i=0k|𝜽^i𝜽|2C1Ψ2dek2log2k,i=n0kEi)C3(k+1)ν.\displaystyle{\mathbb{P}}\Big{(}\|\widehat{\boldsymbol{H}}_{k+1}^{-1}\|\cdot\|\widehat{\boldsymbol{H}}_{k+1}-\boldsymbol{H}\|\cdot\frac{1}{k+1}\sum_{i=0}^{k}|\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}|_{2}\geq C_{1}\Psi^{2}de_{k}^{2}\log^{2}k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq C_{3}(k+1)^{-\nu}.

By Lemma 8 we know that

(|1k+1i=1k+1(𝑨i𝑯)(𝜽^i1𝜽)|2C1(αk+Ψ2dek2logk),i=n0kEi)C3(k+1)ν.{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-\boldsymbol{H})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})\Big{|}_{2}\geq C_{1}(\alpha_{k}+\Psi^{2}de_{k}^{2}\log k),\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq C_{3}(k+1)^{-\nu}. (35)

In this part, we mainly consider the case where δ>0\delta>0 and τk\tau_{k} can be arbitrary. A special case where δ>4\delta>4 and i/log3i=O(τi)\sqrt{i/\log^{3}i}=O(\tau_{i}) will be presented in Proposition 8 below. By Lemma 7 we have that when δ>0\delta>0, there is

(|𝑯^k+111k+1i=1k+1{𝑿ig(𝒁i𝜽bi)}|2C1dek+1,i=n0kEi)C3(k+1)ν;{\mathbb{P}}\Big{(}\Big{|}\widehat{\boldsymbol{H}}_{k+1}^{-1}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C_{1}\sqrt{d}e_{k+1},\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq C_{3}(k+1)^{-\nu};

By substituting the above inequalities into (32), we have that

(|𝜽^k+1𝜽|2Ψdek+1,i=n0kEi)\displaystyle{\mathbb{P}}\Big{(}|\widehat{\boldsymbol{\theta}}_{k+1}-\boldsymbol{\theta}^{*}|_{2}\geq\Psi\sqrt{d}e_{k+1},\cap_{i=n_{0}}^{k}E_{i}\Big{)}
\displaystyle\leq (|𝜽^k+1𝜽|2C1dek+1+C1αk+3C1Ψ2dek2log2k,i=n0kEi)\displaystyle{\mathbb{P}}\Big{(}|\widehat{\boldsymbol{\theta}}_{k+1}-\boldsymbol{\theta}^{*}|_{2}\geq C_{1}\sqrt{d}e_{k+1}+C_{1}\alpha_{k}+3C_{1}\Psi^{2}de_{k}^{2}\log^{2}k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}
\displaystyle\leq 3C3(k+1)ν,\displaystyle 3C_{3}(k+1)^{-\nu}, (36)

where Ψ\Psi can be taken to be sufficiently large, given C1C_{1} fixed. Here the second inequality holds because dek\sqrt{d}e_{k} can be sufficiently small for kn0k\geq n_{0} by taking n0n_{0} sufficiently large. Meanwhile, as αk(k+1)αk+1/k(n0+1)αk+1/n0=o(τk+1αk+1)=o(ek+1)\alpha_{k}\leq(k+1)\alpha_{k+1}/k\leq(n_{0}+1)\alpha_{k+1}/n_{0}=o(\tau_{k+1}\alpha_{k+1})=o(e_{k+1}), the last two terms both have order of o(ek+1)o(e_{k+1}). Obviously, the rest two events of (30) are contained in (36). Therefore, the inductive hypothesis that (Ek+1c,j=n0kEj)3C3(k+1)ν{\mathbb{P}}(E_{k+1}^{c},\cap_{j=n_{0}}^{k}E_{j})\leq 3C_{3}(k+1)^{-\nu}, is proved.

For kn0k\geq n_{0}, there holds

(i=n0kEi)\displaystyle{\mathbb{P}}\Big{(}\cap_{i=n_{0}}^{k}E_{i}\Big{)}
\displaystyle\geq (En0)i=n0+1k(Eic,j=n0i1Ej)\displaystyle{\mathbb{P}}\Big{(}E_{n_{0}}\Big{)}-\sum_{i=n_{0}+1}^{k}{\mathbb{P}}\Big{(}E_{i}^{c},\cap_{j=n_{0}}^{i-1}E_{j}\Big{)}
\displaystyle\geq 1i=n0+1k3C3iν13C3ν11n0ν1,\displaystyle 1-\sum_{i=n_{0}+1}^{k}3C_{3}i^{-\nu}\geq 1-\frac{3C_{3}}{\nu-1}\frac{1}{n_{0}^{\nu-1}},

which proves the theorem. ∎

Lemma 6.

(Bound of the pseudo-Huber Hessian matrix) Under the same assumptions as in Theorem 1, there exists uniform constants CC and cc, such that the Hessian matrix 𝐇^k+1\widehat{\boldsymbol{H}}_{k+1} satisfies

(𝑯^k+1𝑯CΨdeklogk,i=n0kEi)c(k+1)νd.{\mathbb{P}}\Big{(}\|\widehat{\boldsymbol{H}}_{k+1}-\boldsymbol{H}\|\geq C\Psi\sqrt{d}e_{k}\log k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq c(k+1)^{-\nu d}.
Proof.

We note that

𝑯^k+1𝑯1k+1i𝒬k𝑿i𝒁igτi(𝒁i𝜽^i1bi)𝑯+1k+1i𝒬k𝑿i𝒁igτi(𝒁i𝜽^i1bi)𝑯.\displaystyle\|\widehat{\boldsymbol{H}}_{k+1}-\boldsymbol{H}\|\leq\frac{1}{k+1}\Big{\|}\sum_{i\in{\mathcal{Q}}_{k}}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}g^{\prime}_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})-\boldsymbol{H}\Big{\|}+\frac{1}{k+1}\Big{\|}\sum_{i\notin{\mathcal{Q}}_{k}}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}g^{\prime}_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})-\boldsymbol{H}\Big{\|}.

For the first term, there is

𝑿𝒁gτ(𝒁𝜽b)\displaystyle\Big{\|}\boldsymbol{X}\boldsymbol{Z}^{{\top}}g^{\prime}_{\tau}(\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}-b)\Big{\|}
=\displaystyle= sup𝒖,𝒗𝕊d1𝒖𝑿𝒁𝒗|gτ(𝒁𝜽b)|M2,\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\boldsymbol{u}^{{\top}}\boldsymbol{X}\boldsymbol{Z}^{{\top}}\boldsymbol{v}|g^{\prime}_{\tau}(\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}-b)|\leq M^{2},

holds for all 𝜽\boldsymbol{\theta} and τ1\tau\geq 1. Therefore we have that

1k+1i𝒬k𝑿i𝒁igτi(𝒁i𝜽^i1bi)𝑯\displaystyle\frac{1}{k+1}\Big{\|}\sum_{i\in{\mathcal{Q}}_{k}}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}g^{\prime}_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})-\boldsymbol{H}\Big{\|} (37)
\displaystyle\leq 1k+1i𝒬k𝑿i𝒁igτi(𝒁i𝜽^i1bi)+1k+1i𝒬i𝑯2M2αk.\displaystyle\frac{1}{k+1}\sum_{i\in{\mathcal{Q}}_{k}}\|\boldsymbol{X}_{i}\boldsymbol{Z}_{i}g^{\prime}_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})\|+\frac{1}{k+1}\sum_{i\in{\mathcal{Q}}_{i}}\|\boldsymbol{H}\|\leq 2M^{2}\alpha_{k}.

For the second term, we first prove that

(1k+1i𝒬k(𝑨i𝔼[𝑨i])C1dlogkk+1)=O((k+1)νd).{\mathbb{P}}\Big{(}\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\Big{\|}\geq C_{1}\sqrt{\frac{d\log k}{k+1}}\Big{)}=O((k+1)^{-\nu d}). (38)

Let 𝔑{\mathfrak{N}} be the 1/41/4-net of the unit ball 𝕊d1{\mathbb{S}}^{d-1}. By Lemma 5.2 of Vershynin (2010) we know that |𝔑|9d|{\mathfrak{N}}|\leq 9^{d}. Then we have that

1k+1i𝒬k(𝑨i𝔼[𝑨i])=\displaystyle\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\Big{\|}= sup𝒖,𝒗𝕊d1|1k+1i𝒬k𝒖(𝑨i𝔼[𝑨i])𝒗|\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\Big{|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\boldsymbol{u}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\boldsymbol{v}\Big{|}
\displaystyle\leq sup𝒖~,𝒗~𝔑|1k+1i𝒬k𝒖~(𝑨i𝔼[𝑨i])𝒗~|\displaystyle\sup_{\widetilde{\boldsymbol{u}},\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\Big{|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{v}}\Big{|}
+sup𝒖~𝔑sup|𝒗𝒗~|21/4|1k+1i𝒬k𝒖~(𝑨i𝔼[𝑨i])(𝒗𝒗~)|2\displaystyle+\sup_{\widetilde{\boldsymbol{u}}\in{\mathfrak{N}}}\sup_{|\boldsymbol{v}-\widetilde{\boldsymbol{v}}|_{2}\leq 1/4}\Big{|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])(\boldsymbol{v}-\widetilde{\boldsymbol{v}})\Big{|}_{2}
+sup𝒗𝕊d1sup|𝒖𝒖~|21/4|1k+1i𝒬k(𝒖𝒖~)(𝑨i𝔼[𝑨i])𝒗|2\displaystyle+\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sup_{|\boldsymbol{u}-\widetilde{\boldsymbol{u}}|_{2}\leq 1/4}\Big{|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}(\boldsymbol{u}-\widetilde{\boldsymbol{u}})^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\boldsymbol{v}\Big{|}_{2}
1k+1i𝒬k(𝑨i𝔼[𝑨i])\displaystyle\Rightarrow\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\Big{\|}\leq 2sup𝒖~,𝒗~𝔑|1k+1i𝒬k𝒖~(𝑨i𝔼[𝑨i])𝒗~|.\displaystyle 2\sup_{\widetilde{\boldsymbol{u}},\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\Big{|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{v}}\Big{|}.

Applying Lemma 1 to every 1k+1i𝒬k𝒖~(𝑨i𝔼[𝑨i])𝒗~\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{v}} we can obtain that

supu~,𝒗~𝔑(|1k+1i𝒬k𝒖~(𝑨i𝔼[𝑨i])𝒗~|2C1dlogkk+1)=O((k+1)(ν+log9)d)\sup_{\widetilde{u},\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{v}}\Big{|}_{2}\geq C_{1}\sqrt{\frac{d\log k}{k+1}}\Big{)}=O((k+1)^{-(\nu+\log 9)d})

for some constant C1>0C_{1}>0. Therefore

(1k+1i𝒬k(𝑨i𝔼[𝑨i])2C1dlogkk+1)\displaystyle{\mathbb{P}}\Big{(}\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\Big{\|}\geq 2C_{1}\sqrt{\frac{d\log k}{k+1}}\Big{)}
\displaystyle\leq (supu~,𝒗~𝔑|1k+1i𝒬k𝒖~(𝑨i𝔼[𝑨i])𝒗~|2C1dlogkk+1)\displaystyle{\mathbb{P}}\Big{(}\sup_{\widetilde{u},\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}\Big{|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{v}}\Big{|}_{2}\geq C_{1}\sqrt{\frac{d\log k}{k+1}}\Big{)}
\displaystyle\leq 92dsup𝒖~,𝒗~𝔑(|1k+1i𝒬k𝒖~(𝑨i𝔼[𝑨i])𝒗~|2C1dlogkk+1)=O((k+1)νd),\displaystyle 9^{2d}\sup_{\widetilde{\boldsymbol{u}},\widetilde{\boldsymbol{v}}\in{\mathfrak{N}}}{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\widetilde{\boldsymbol{u}}^{{\top}}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{v}}\Big{|}_{2}\geq C_{1}\sqrt{\frac{d\log k}{k+1}}\Big{)}=O((k+1)^{-\nu d}),

which proves (38). Next we prove that when (𝑿,𝒁,b)(\boldsymbol{X},\boldsymbol{Z},b) are normal data, for every τ>0\tau>0,

𝔼[𝑿𝒁gτ(𝒁𝜽b)]𝔼[𝑿𝒁]C2τδ.\Big{\|}{\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}g^{\prime}_{\tau}(\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b)]-{\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}]\Big{\|}\leq C_{2}\tau^{-\delta^{\prime}}.

Indeed, denote ϵ=𝒁𝜽b\epsilon=\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b, we have that

𝔼[𝑿𝒁gτ(𝒁𝜽b)]𝔼[𝑿𝒁]\displaystyle\Big{\|}{\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}g^{\prime}_{\tau}(\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b)]-{\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}]\Big{\|}
\displaystyle\leq sup𝒖,𝒗𝔼[𝒖𝑿𝒁𝒗|gτ(ϵ)1|]\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}}{\mathbb{E}}[\boldsymbol{u}^{{\top}}\boldsymbol{X}\boldsymbol{Z}^{{\top}}\boldsymbol{v}|g^{\prime}_{\tau}(\epsilon)-1|]
\displaystyle\leq sup𝒖,𝒗𝔼[𝒖𝑿𝒁𝒗𝕀(|ϵ|>τ)+52𝒖𝑿𝒁𝒗τ1δ′′|ϵ|1+δ′′𝕀(|ϵ|τ)]\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}}{\mathbb{E}}[\boldsymbol{u}^{{\top}}\boldsymbol{X}\boldsymbol{Z}^{{\top}}\boldsymbol{v}{\mathbb{I}}(|\epsilon|>\tau)+\frac{5}{2}\boldsymbol{u}^{{\top}}\boldsymbol{X}\boldsymbol{Z}^{{\top}}\boldsymbol{v}\tau^{-1-\delta^{\prime\prime}}|\epsilon|^{1+\delta^{\prime\prime}}{\mathbb{I}}(|\epsilon|\leq\tau)]
\displaystyle\leq sup𝒖,𝒗𝔼[𝒖𝑿𝒁𝒗(|ϵ|>τ|𝑿,𝒁)]+52τ1δ′′sup𝒖,𝒗𝔼[𝒖𝑿𝒁𝒗𝔼[|ϵ|1+δ′′|𝑿,𝒁]]C2τ1δ′′C2τδ,\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}}{\mathbb{E}}\big{[}\boldsymbol{u}^{{\top}}\boldsymbol{X}\boldsymbol{Z}^{{\top}}\boldsymbol{v}{\mathbb{P}}(|\epsilon|>\tau|\boldsymbol{X},\boldsymbol{Z})\big{]}+\frac{5}{2}\tau^{-1-\delta^{\prime\prime}}\sup_{\boldsymbol{u},\boldsymbol{v}}{\mathbb{E}}\big{[}\boldsymbol{u}^{{\top}}\boldsymbol{X}\boldsymbol{Z}^{{\top}}\boldsymbol{v}{\mathbb{E}}[|\epsilon|^{1+\delta^{\prime\prime}}|\boldsymbol{X},\boldsymbol{Z}]\big{]}\leq C_{2}\tau^{-1-\delta^{\prime\prime}}\leq C_{2}\tau^{-\delta^{\prime}},

where the third line uses Lemma 4, and δ′′=min(1,δ)\delta^{\prime\prime}=\min(1,\delta). Then clearly we have that 1+δ′′δ1+\delta^{\prime\prime}\geq\delta^{\prime}. Therefore, by the choice of τi\tau_{i} and Lemma 5, we have

1k+1i𝒬k𝔼[𝑨i]𝑯1k+1i𝒬k𝔼[𝑨i]𝑯C2τk+1δ.\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}{\mathbb{E}}[\boldsymbol{A}_{i}]-\boldsymbol{H}\Big{\|}\leq\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\Big{\|}{\mathbb{E}}[\boldsymbol{A}_{i}]-\boldsymbol{H}\Big{\|}\leq C_{2}\tau_{k+1}^{-\delta^{\prime}}. (39)

Under event i=n0kEi\cap_{i=n_{0}}^{k}E_{i}, from (34) we know there is

1k+1i𝒬k𝑨i𝑯^k=\displaystyle\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\boldsymbol{A}_{i}-\widehat{\boldsymbol{H}}_{k}\Big{\|}= 1k+1i𝒬k𝑿i𝒁i{gτi(𝒁i𝜽bi)gτi(𝒁i𝜽^i1bi)}\displaystyle\Big{\|}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}^{{\top}}\big{\{}g_{\tau_{i}}^{\prime}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})-g_{\tau_{i}}^{\prime}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})\big{\}}\Big{\|} (40)
\displaystyle\leq M21k+1i𝒬k|𝜽^i1𝜽|2C3Ψdeklogk,\displaystyle M^{2}\frac{1}{k+1}\sum_{i\notin{\mathcal{Q}}_{k}}|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}|_{2}\leq C_{3}\Psi\sqrt{d}e_{k}\log k,

Therefore combining (38), (39) and (40) we have that

. (1k+1i𝒬k𝑿i𝒁igτi(𝒁i𝜽^i1bi)𝑯C4(dlogkk+1+τk+1δ+Ψdeklogk))\displaystyle{\mathbb{P}}\Bigg{(}\frac{1}{k+1}\Big{\|}\sum_{i\notin{\mathcal{Q}}_{k}}\boldsymbol{X}_{i}\boldsymbol{Z}_{i}g^{\prime}_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})-\boldsymbol{H}\Big{\|}\geq C_{4}\Big{(}\sqrt{\frac{d\log k}{k+1}}+\tau_{k+1}^{-\delta^{\prime}}+\Psi\sqrt{d}e_{k}\log k\Big{)}\Bigg{)} (41)
=\displaystyle= O((k+1)νd).\displaystyle O((k+1)^{-\nu d}).

Combining (37) and (41), we have that

(𝑯^k+1𝑯C5(αk+dlogkk+1+τk+1δ+deklogk))=O((k+1)νd),\displaystyle{\mathbb{P}}\Bigg{(}\|\widehat{\boldsymbol{H}}_{k+1}-\boldsymbol{H}\|\geq C_{5}\Big{(}\alpha_{k}+\sqrt{\frac{d\log k}{k+1}}+\tau_{k+1}^{-\delta^{\prime}}+\sqrt{d}e_{k}\log k\Big{)}\Bigg{)}=O((k+1)^{-\nu d}),

which proves the lemma.

As a corollary, we can show that under event i=n0kEi\cap_{i=n_{0}}^{k}E_{i},

𝑯^k+11=(Λmin(𝑯^k+1))1(Λmin(𝑯)𝑯^k+1𝑯)12/Λmin(𝑯).\|\widehat{\boldsymbol{H}}_{k+1}^{-1}\|=(\Lambda_{\min}(\widehat{\boldsymbol{H}}_{k+1}))^{-1}\leq\big{(}\Lambda_{\min}(\boldsymbol{H})-\|\widehat{\boldsymbol{H}}_{k+1}-\boldsymbol{H}\|\big{)}^{-1}\leq 2/\Lambda_{\min}(\boldsymbol{H}). (42)

Lemma 7.

Under the same assumptions as in Theorem 1, for every ν>0\nu>0, there exist constants CC and cc such that:

  • i)

    If δ>0\delta>0 and τk\tau_{k} can be arbitrary, then

    (|1k+1i=1k+1{𝑿igτi(𝒁i𝜽bi)}|2Cdek+1)c(k+1)ν.{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C\sqrt{d}e_{k+1}\Big{)}\leq c(k+1)^{-\nu}.

    Here ek+1e_{k+1} is defined in (29).

  • ii)

    If δ>1\delta>1 and k/log3k=O(τk)\sqrt{k/\log^{3}k}=O(\tau_{k}), then

    (|1k+1i=1k+1{𝑿igτi(𝒁i𝜽bi)}|2Cdek+1)c(k+1)min{ν,(δ1)/3}.{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C\sqrt{d}e_{k+1}\Big{)}\leq c(k+1)^{-\min\{\nu,(\delta-1)/3\}}.

    Here ek+1e_{k+1} is defined in (65).

Proof.

Proof of i). We prove the bound coordinate-wisely. When (𝑿,𝒁,b)(\boldsymbol{X},\boldsymbol{Z},b) has no outlier, for the ll-th coordinate, we first prove that for every τ>0\tau>0, there is

|𝔼[Xlgτ(𝒁𝜽b)]|C1τδ.|{\mathbb{E}}[X_{l}g_{\tau}(\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b)]|\leq C_{1}\tau^{-\delta^{\prime}}.

Indeed, ϵ=𝒁𝜽b\epsilon=\boldsymbol{Z}^{{\top}}\boldsymbol{\theta}^{*}-b, we have that

|𝔼[Xlgτ(ϵ)]|\displaystyle|{\mathbb{E}}[X_{l}g_{\tau}(\epsilon)]|\leq |𝔼[Xlϵ]|+|𝔼[Xl|ϵgτ(ϵ)|𝕀(|ϵ|τ)]|+|𝔼[Xl|gτ(ϵ)ϵ|𝕀(|ϵ|>τ)]|\displaystyle|{\mathbb{E}}[X_{l}\epsilon]|+\big{|}{\mathbb{E}}[X_{l}|\epsilon-g_{\tau}(\epsilon)|{\mathbb{I}}(|\epsilon|\leq\tau)]\big{|}+\big{|}{\mathbb{E}}[X_{l}|g_{\tau}(\epsilon)-\epsilon|{\mathbb{I}}(|\epsilon|>\tau)]\big{|}
\displaystyle\leq 12τδ|𝔼[Xl|ϵ|1+δ𝕀(|ϵ|τ)]|+|𝔼[Xl|ϵ|𝕀(|ϵ|>τ)]|\displaystyle\frac{1}{2}\tau^{-\delta^{\prime}}\big{|}{\mathbb{E}}[X_{l}|\epsilon|^{1+\delta^{\prime}}{\mathbb{I}}(|\epsilon|\leq\tau)]\big{|}+\big{|}{\mathbb{E}}[X_{l}|\epsilon|{\mathbb{I}}(|\epsilon|>\tau)]\big{|}
\displaystyle\leq 12τδ|𝔼[Xl𝔼[|ϵ|1+δ𝕀(|ϵ|τ)|𝑿,𝒁]]|+|𝔼[Xl𝔼[|ϵ|𝕀(|ϵ|>τ)|𝑿,𝒁]]|\displaystyle\frac{1}{2}\tau^{-\delta^{\prime}}\big{|}{\mathbb{E}}\big{[}X_{l}{\mathbb{E}}[|\epsilon|^{1+\delta^{\prime}}{\mathbb{I}}(|\epsilon|\leq\tau)|\boldsymbol{X},\boldsymbol{Z}]\big{]}\big{|}+\big{|}{\mathbb{E}}\big{[}X_{l}{\mathbb{E}}[|\epsilon|{\mathbb{I}}(|\epsilon|>\tau)|\boldsymbol{X},\boldsymbol{Z}]\big{]}\big{|}
\displaystyle\leq C1τδ.\displaystyle C_{1}\tau^{-\delta^{\prime}}.

Therefore, by the choice of τi\tau_{i} and Lemma 5, we have

|1k+1i=1k+1𝔼[Xi,lgτi(𝒁i𝜽bi)]|1k+1i=1k+1|𝔼[Xi,lgτi(𝒁i𝜽bi)]|C2τk+1δ.\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}^{*}-b_{i})]\Big{|}\leq\frac{1}{k+1}\sum_{i=1}^{k+1}\Big{|}{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}^{*}-b_{i})]\Big{|}\leq C_{2}\tau_{k+1}^{-\delta^{\prime}}. (43)

Next, when all data are normal, we prove the rate of

|1k+1i=1k+1{Xi,lgτi(ϵi)𝔼[Xi,lgτi(ϵi)]}|.\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}X_{i,l}g_{\tau_{i}}(\epsilon_{i})-{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i})]\big{\}}\Big{|}.

We basically rehash the proof in Lemma 1.

Case 1: δ(0,1]\delta\in(0,1]. We divide the kk-tuple (1,,k)(1,\dots,k) into mkm_{k} different subsets H1,,HmkH_{1},\dots,H_{m_{k}}, where mk=k/λlogkm_{k}=\lceil k/\lceil\lambda\log k\rceil\rceil. Here |Hi|=λlogk|H_{i}|=\lceil\lambda\log k\rceil for 1imk11\leq i\leq m_{k}-1, and |Hmj|λlogk|H_{m_{j}}|\leq\lceil\lambda\log k\rceil. Then we have mkk/(λlogk)m_{k}\approx k/(\lambda\log k). Without loss of generality, we assume that mkm_{k} is an even integer.

Let Yi=Xi,lgτi(ϵi)𝔼[Xi,lgτi(ϵi)]Y_{i}=X_{i,l}g_{\tau_{i}}(\epsilon_{i})-{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i})] and ξq=1λlogkiH2q1Yi\xi_{q}=\frac{1}{\lceil\lambda\log k\rceil}\sum_{i\in H_{2q-1}}Y_{i}, then we know |ξq|2Mτk+1|\xi_{q}|\leq 2M\tau_{k+1} (since Xi,lgτi(ϵ)|𝑿|2τiMτk+1X_{i,l}g_{\tau_{i}}(\epsilon)\leq|\boldsymbol{X}|_{2}\tau_{i}\leq M\tau_{k+1}) and 𝔼[ξl]=0{\mathbb{E}}[\xi_{l}]=0 holds. For all B1σ(ξ1,,ξq)B_{1}\in\sigma(\xi_{1},\dots,\xi_{q}) and B2σ(ξq+1)B_{2}\in\sigma(\xi_{q+1}), there holds |(B2|B1)(B2)|ϕ(λlogk)|{\mathbb{P}}(B_{2}|B_{1})-{\mathbb{P}}(B_{2})|\leq\phi(\lceil\lambda\log k\rceil) for all qq. By Theorem 2 of Berkes and Philipp (1979), there exists a sequence of independent variables ηl\eta_{l}, l1l\geq 1 with ηl\eta_{l} having the same distribution as ξl\xi_{l}, and

(|ξqηq|6ϕ(λlogk))6ϕ(λlogk).{\mathbb{P}}\Big{(}|\xi_{q}-\eta_{q}|\geq 6\phi(\lceil\lambda\log k\rceil)\Big{)}\leq 6\phi(\lceil\lambda\log k\rceil).

For any ν>0\nu>0, we can take λ(ν+1)/|logρ|\lambda\geq(\nu+1)/|\log\rho| so that

(|2mkq=1mk/2(ξqηq)|C3kν)\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{q=1}^{m_{k}/2}(\xi_{q}-\eta_{q})\Big{|}\geq C_{3}k^{-\nu}\Big{)} (44)
\displaystyle\leq (|2mkq=1mk/2(ξqηq)|6ϕ(λlogk))3mkϕ(λlogk)C3kν,\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{q=1}^{m_{k}/2}(\xi_{q}-\eta_{q})\Big{|}\geq 6\phi(\lceil\lambda\log k\rceil)\Big{)}\leq 3m_{k}\phi(\lceil\lambda\log k\rceil)\leq C_{3}k^{-\nu},

where C3>0C_{3}>0 is some constant. Next, we bound the variance of ξq\xi_{q}. From equation (20.23) of Billingsley (1968) we know that for arbitrary i,ji,j, there is

|𝔼[YiYj]|2ϕ(|ij|)𝔼[Yi2]𝔼[Yj2]2C4ρ|ij|/2M2τk+11δ.\displaystyle\big{|}{\mathbb{E}}\big{[}Y_{i}Y_{j}\big{]}\big{|}\leq 2\sqrt{\phi(|i-j|)}\sqrt{{\mathbb{E}}[Y_{i}^{2}]{\mathbb{E}}[Y_{j}^{2}]}\leq 2C_{4}\rho^{|i-j|/2}M^{2}\tau_{k+1}^{1-\delta}.

Then we compute that

Var(ηq)=Var(ξq)=\displaystyle{\rm Var}(\eta_{q})={\rm Var}(\xi_{q})= 1λlogk2[i,jH2q1{𝔼(YiYj)}]\displaystyle\frac{1}{\lceil\lambda\log k\rceil^{2}}\Big{[}\sum_{i,j\in H_{2q-1}}\{{\mathbb{E}}(Y_{i}Y_{j})\}\Big{]}
\displaystyle\leq 2C4λlogk2i,j=1λlogkρ|ij|/2M2τk+11δ\displaystyle\frac{2C_{4}}{\lceil\lambda\log k\rceil^{2}}\sum_{i,j=1}^{\lceil\lambda\log k\rceil}\rho^{|i-j|/2}M^{2}\tau_{k+1}^{1-\delta}
\displaystyle\leq 2λlogkC4λlogk2(21ρ1)M2τk+11δ4C4(1ρ)λlogkM2τk+11δ.\displaystyle\frac{2\lceil\lambda\log k\rceil C_{4}}{\lceil\lambda\log k\rceil^{2}}\Big{(}\frac{2}{1-\sqrt{\rho}}-1\Big{)}M^{2}\tau_{k+1}^{1-\delta}\leq\frac{4C_{4}}{(1-\sqrt{\rho})\lceil\lambda\log k\rceil}M^{2}\tau_{k+1}^{1-\delta}.

Notice that ηq\eta_{q}’s are all uniformly bounded by 2Mτk+12M\tau_{k+1}, we can apply Bernstein’s inequality (Bennett, 1962) and yield

(|2mkq=1mk/2ηq|t)2exp(mkt2/4Var(ηq)/2+Mτkt/3)\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{q=1}^{m_{k}/2}\eta_{q}\Big{|}\geq t\Big{)}\leq 2\exp\Big{(}-\frac{m_{k}t^{2}/4}{{\rm Var}(\eta_{q})/2+M\tau_{k}t/3}\Big{)}
\displaystyle\leq exp(mkλlogkt2/42M2Cϵτk+11δ/(1ρ)+2λlogkMτk+1t/3).\displaystyle\exp\Big{(}-\frac{m_{k}\lceil\lambda\log k\rceil t^{2}/4}{2M^{2}C_{\epsilon}\tau_{k+1}^{1-\delta}/(1-\sqrt{\rho})+2\lceil\lambda\log k\rceil M\tau_{k+1}t/3}\Big{)}.

We take t=C(τk+11δlogk/(k+1)+τk+1log2k/(k+1))t=C(\sqrt{\tau_{k+1}^{1-\delta}\log k/(k+1)}+\tau_{k+1}\log^{2}k/(k+1)) for some CC large enough (note that kmkλlogkk\approx m_{k}\lceil\lambda\log k\rceil). Then we have that

(|2mkq=1mk/2ηq|t)=O(kν).{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{q=1}^{m_{k}/2}\eta_{q}\Big{|}\geq t\Big{)}=O(k^{-\nu}). (45)

Combining (44) and (45) we have that

(|2mkq=1mk/2ξq|Cτk+11δlogkk+1+Clog2kτk+1k+1)=O(kν).\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{q=1}^{m_{k}/2}\xi_{q}\Big{|}\geq C\sqrt{\frac{\tau^{1-\delta}_{k+1}\log k}{k+1}}+C\frac{\log^{2}k\tau_{k+1}}{k+1}\Big{)}=O(k^{-\nu}). (46)

Next we denote ξ~=1λlogkiH2qYi\widetilde{\xi}=\frac{1}{\lceil\lambda\log k\rceil}\sum_{i\in H_{2q}}Y_{i}, then we can similarly show that

(|2mkq=1mk/2ξ~q|Cτk+11δlogkk+1+Clog2kτk+1k+1)=O(kν),\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{2}{m_{k}}\sum_{q=1}^{m_{k}/2}\widetilde{\xi}_{q}\Big{|}\geq C\sqrt{\frac{\tau^{1-\delta}_{k+1}\log k}{k+1}}+C\frac{\log^{2}k\tau_{k+1}}{k+1}\Big{)}=O(k^{-\nu}), (47)

Combining (46) and (47) we can prove that

(|1k+1i=1k+1{Xi,lgτi(ϵi)𝔼[Xi,lgτi(ϵi)]}|C(τk+11δlogkk+1+log2kτk+1k+1))=O(kν).{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}X_{i,l}g_{\tau_{i}}(\epsilon_{i})-{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i})]\big{\}}\Big{|}\geq C\Big{(}\sqrt{\frac{\tau^{1-\delta}_{k+1}\log k}{k+1}}+\frac{\log^{2}k\tau_{k+1}}{k+1}\Big{)}\Bigg{)}=O(k^{-\nu}). (48)

By combining it with (43) we have that

(|1k+1i=1k+1{𝑿igτi(𝒁i𝜽bi)}|2C(dτk+1δ+dτk+11δlogkk+1+dlog2kτk+1k+1))=O(kν).{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C\Big{(}\sqrt{d}\tau_{k+1}^{-\delta^{\prime}}+\sqrt{\frac{d\tau^{1-\delta}_{k+1}\log k}{k+1}}+\frac{\sqrt{d}\log^{2}k\tau_{k+1}}{k+1}\Big{)}\Bigg{)}=O(k^{-\nu}).

Case 2: δ>1\delta>1. Similarly, we can obtain the rate

(|1k+1i=1k+1{Xi,lgτi(ϵi)𝔼[Xi,lgτi(ϵi)]}|C(logkk+1+log2kτk+1k+1))=O(kν),{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}X_{i,l}g_{\tau_{i}}(\epsilon_{i})-{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i})]\big{\}}\Big{|}\geq C\Big{(}\sqrt{\frac{\log k}{k+1}}+\frac{\log^{2}k\tau_{k+1}}{k+1}\Big{)}\Bigg{)}=O(k^{-\nu}),

by directly plugging τk\tau_{k} into (48). Then we have that

(|1k+1i=1k+1{𝑿igτi(𝒁i𝜽bi)}|2C(τk+1δ+logkk+1+log2kτk+1k+1))=O(kν).{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C\Big{(}\tau_{k+1}^{-\delta^{\prime}}+\sqrt{\frac{\log k}{k+1}}+\frac{\log^{2}k\tau_{k+1}}{k+1}\Big{)}\Bigg{)}=O(k^{-\nu}).

Proof of ii). When δ>1\delta>1 and k/log3k=O(τk)\sqrt{k/\log^{3}k}=O(\tau_{k}), we define the thresholding level τ~=Cτk+1/log3/2k\widetilde{\tau}=C_{\tau}\sqrt{k+1}/\log^{3/2}k, and consider the truncated random variables 𝑿igτi(ϵi)𝕀(|ϵi|τ~)\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(|\epsilon_{i}|\leq\widetilde{\tau}). Here CτC_{\tau} is a sufficiently large constant. Then we have that

(i=1k+1𝑿igτi(ϵi)i=1k+1𝑿igτi(ϵi)𝕀(|ϵi|τ~))\displaystyle{\mathbb{P}}\Big{(}\sum_{i=1}^{k+1}\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i})\neq\sum_{i=1}^{k+1}\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(|\epsilon_{i}|\leq\widetilde{\tau})\Big{)}
\displaystyle\leq (i=1k+1{𝑿igτi(ϵi)𝑿igτi(ϵi)𝕀(|ϵi|τ~)})\displaystyle{\mathbb{P}}\Big{(}\cup_{i=1}^{k+1}\{\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i})\neq\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(|\epsilon_{i}|\leq\widetilde{\tau})\}\Big{)}
\displaystyle\leq (k+1)max1ik+1(|ϵi|>τ~)=O((k+1)(δ1)/3).\displaystyle(k+1)\max_{1\leq i\leq k+1}{\mathbb{P}}\Big{(}|\epsilon_{i}|>\widetilde{\tau}\Big{)}=O((k+1)^{-(\delta-1)/3}). (49)

Similarly as in (43), we have that

|1k+1i=1k+1𝔼[Xi,lgτi(ϵi)𝕀(|ϵi|τ~)]|1k+1i=1k+1|𝔼[Xi,lgτi(ϵi)𝕀(|ϵi|τ~)]|C(τk+1δ+τ~δ).\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(|\epsilon_{i}|\leq\widetilde{\tau})]\Big{|}\leq\frac{1}{k+1}\sum_{i=1}^{k+1}\Big{|}{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(|\epsilon_{i}|\leq\widetilde{\tau})]\Big{|}\leq C(\tau_{k+1}^{-\delta^{\prime}}+\widetilde{\tau}^{-\delta}). (50)

Similarly as in the proof of (48), for every ν>0\nu>0, there exists C>0C>0 such that

(|1k+1i=1k+1{Xi,lgτi(ϵi)𝕀(|ϵi|τ~)𝔼[Xi,lgτi(ϵi)𝕀(|ϵi|τ~)]}|C(logkk+1+log2kτ~k+1))=O(kν)\displaystyle{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}X_{i,l}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(|\epsilon_{i}|\leq\widetilde{\tau})-{\mathbb{E}}[X_{i,l}g_{\tau_{i}}(\epsilon_{i}){\mathbb{I}}(|\epsilon_{i}|\leq\widetilde{\tau})]\big{\}}\Big{|}\geq C\big{(}\sqrt{\frac{\log k}{k+1}}+\frac{\log^{2}k\widetilde{\tau}}{k+1}\big{)}\Bigg{)}=O(k^{-\nu}) (51)

By the choice of τ~\widetilde{\tau}, we know that

logkk+1+log2kτ~k+1=O(logkk+1).\sqrt{\frac{\log k}{k+1}}+\frac{\log^{2}k\widetilde{\tau}}{k+1}=O\Big{(}\sqrt{\frac{\log k}{k+1}}\Big{)}.

Combining (49), (50) and (51) we have that

(|1k+1i=1k+1{𝑿igτi(𝒁i𝜽bi)}|2Cd(τk+1δ+τ~δ+logkk+1))=O((k+1)min{ν,(δ1)/3}).{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C\sqrt{d}\Big{(}\tau_{k+1}^{-\delta^{\prime}}+\widetilde{\tau}^{-\delta}+\sqrt{\frac{\log k}{k+1}}\Big{)}\Bigg{)}=O((k+1)^{-\min\{\nu,(\delta-1)/3\}}).

When there are αk\alpha_{k} fraction of outliers, denote 𝒬k{\mathcal{Q}}_{k} as the index set, we have

|1k+1i=1k+1{𝑿igτi(𝒁i𝜽bi)}|2\displaystyle\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}
\displaystyle\leq 1k+1|i𝒬k{𝑿igτi(𝒁i𝜽bi)}|2+1k+1|i𝒬k{𝑿igτi(𝒁i𝜽bi)}|2\displaystyle\frac{1}{k+1}\Big{|}\sum_{i\in{\mathcal{Q}}_{k}}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}+\frac{1}{k+1}\Big{|}\sum_{i\notin{\mathcal{Q}}_{k}}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}
=\displaystyle= O(dαkτk+1)+1k+1|i𝒬k{𝑿igτi(𝒁i𝜽bi)}|2,\displaystyle O\big{(}\sqrt{d}\alpha_{k}\tau_{k+1}\big{)}+\frac{1}{k+1}\Big{|}\sum_{i\notin{\mathcal{Q}}_{k}}\big{\{}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2},

which proves the lemma. ∎

Lemma 8.

(Bound of the mixed term) Under the same assumptions as in Theorem 1, for every ν>0\nu>0, there exist constants CC and cc, such that

(|1k+1i=1k+1(𝑨i𝑯)(𝜽^i1𝜽)|2C(αk+Ψ2dek2logk),i=n0kEi)c(k+1)ν.{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-\boldsymbol{H})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})\Big{|}_{2}\geq C(\alpha_{k}+\Psi^{2}de_{k}^{2}\log k),\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq c(k+1)^{-\nu}.
Proof.

Firstly, from (39) and Lemma 5, under the event i=n0kEi\cap_{i=n_{0}}^{k}E_{i}, we can bound the term

|1k+1i=1k+1(𝔼[𝑨i]𝑯)(𝜽^i1𝜽)|2\displaystyle\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}({\mathbb{E}}[\boldsymbol{A}_{i}]-\boldsymbol{H})(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})\Big{|}_{2}
\displaystyle\leq 1k+1i=1k+1𝑬[𝑨i]𝑯|𝜽^i1𝜽|2\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}\|\boldsymbol{E}[\boldsymbol{A}_{i}]-\boldsymbol{H}\|\cdot|\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*}|_{2}
\displaystyle\leq 1k+1i=1k+1CΨτiδei1=O(τk+1δdeklogk)=O(dek2logk).\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}C\Psi\tau_{i}^{-\delta^{\prime}}e_{i-1}=O(\tau_{k+1}^{-\delta^{\prime}}\sqrt{d}e_{k}\log k)=O(\sqrt{d}e_{k}^{2}\log k). (52)

For ease of notation, we first consider the case when there is no outlier. Similar as in the proof of Lemma 1, for each kk, we evenly divide the tuple {n0,,k}\{n_{0},\dots,k\} into mkm_{k} subsets H1,,HmkH_{1},\dots,H_{m_{k}}, where mk=(kn0)/λlogkm_{k}=\lceil(k-n_{0})/\lceil\lambda\log k\rceil\rceil. Here |Hq|=λlogk|H_{q}|=\lceil\lambda\log k\rceil for 1qmk11\leq q\leq m_{k}-1, and |Hmk|λlogk|H_{m_{k}}|\leq\lceil\lambda\log k\rceil. λ\lambda is a sufficiently large constant which will be specified later. Then we have mk(kn0)/(λlogk)m_{k}\approx(k-n_{0})/(\lambda\log k). We further denote H0={1,,n0}H_{0}=\{1,\dots,n_{0}\}. Without loss of generality, we assume that mkm_{k} is an even integer. For each i{n0,,k}i\in\{n_{0},\dots,k\}, suppose iHlii\in H_{l_{i}}, we construct the following random variable

𝜽~i=1iq=0li2jHq(𝜽^j𝜽)+𝑯11iq=0li2jHq𝑿j+1gτj+1(𝒁j+1𝜽^jbj+1).\widetilde{\boldsymbol{\theta}}_{i}=\frac{1}{i}\sum_{q=0}^{l_{i}-2}\sum_{j\in H_{q}}(\widehat{\boldsymbol{\theta}}_{j}-\boldsymbol{\theta}^{*})+\boldsymbol{H}^{-1}\frac{1}{i}\sum_{q=0}^{l_{i}-2}\sum_{j\in H_{q}}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{j}-b_{j+1}). (53)

When lj=1l_{j}=1, we take the sum for the terms in H0H_{0}. For iH0i\in H_{0}, we take 𝜽~i=𝜽^i𝜽=𝜽^0𝜽\widetilde{\boldsymbol{\theta}}_{i}=\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}=\widehat{\boldsymbol{\theta}}_{0}-\boldsymbol{\theta}^{*}. Then by Lemma 9 below we have that,

`(|1k+1i=1k+1(𝑨i𝔼[𝑨i])(𝜽^i1𝜽)1k+1i=1k+1(𝑨i𝔼[𝑨i])𝜽~i1|2CΨ2dek2,i=n0kEi)=O((k+1)ν).`\begin{aligned} &{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})-\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{\theta}}_{i-1}\Big{|}_{2}\geq C\Psi^{2}de_{k}^{2},\cap_{i=n_{0}}^{k}E_{i}\Big{)}\\ =&O((k+1)^{-\nu}).\end{aligned} (54)

Moreover, from the proof of Lemma 9, we can obtain that under i=n0kEi\cap_{i=n_{0}}^{k}E_{i}, there is

|𝜽^i𝜽𝜽~i|2Ψdei.|\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}-\widetilde{\boldsymbol{\theta}}_{i}|_{2}\leq\Psi\sqrt{d}e_{i}. (55)

It left to bound the term 1k+1j=0k(𝑨i𝔼[𝑨i])𝜽~i\frac{1}{k+1}\sum_{j=0}^{k}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{\theta}}_{i}. To be more precise, we define the σ\sigma-field 𝒢l=σ((𝑿i,𝒁i,bi):ij=12l+1Hj){\mathcal{G}}_{l}=\sigma((\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i}):i\in\cup_{j=1}^{2l+1}H_{j}), and construct the random variable

𝝃l=iH2l+1(𝑨i+1𝔼[𝑨i+1])𝜽~i\boldsymbol{\xi}_{l}=\sum_{i\in H_{2l+1}}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])\widetilde{\boldsymbol{\theta}}_{i}

Then by (55) we know that

(i{|𝜽~i|22Ψdei},j=n0kEj)=O((k+1)ν).{\mathbb{P}}\Big{(}\cup_{i}\Big{\{}|\widetilde{\boldsymbol{\theta}}_{i}|_{2}\geq 2\Psi\sqrt{d}e_{i}\Big{\}},\cap_{j=n_{0}}^{k}E_{j}\Big{)}=O((k+1)^{-\nu}).

We further set

𝝃~l=iH2l+1(𝑨i+1𝔼[𝑨i+1])𝜽~i𝕀{|𝜽~i|22Ψdei},\widetilde{\boldsymbol{\xi}}_{l}=\sum_{i\in H_{2l+1}}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])\widetilde{\boldsymbol{\theta}}_{i}{\mathbb{I}}\Big{\{}|\widetilde{\boldsymbol{\theta}}_{i}|_{2}\leq 2\Psi\sqrt{d}e_{i}\Big{\}},

then there is

(l=1mk/2𝝃ll=1mk/2𝝃~l,i=n0kEi)=O(kν).{\mathbb{P}}\Big{(}\sum_{l=1}^{m_{k}/2}\boldsymbol{\xi}_{l}\neq\sum_{l=1}^{m_{k}/2}\widetilde{\boldsymbol{\xi}}_{l},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=O(k^{-\nu}). (56)

Notice that {𝝃~l𝔼[𝝃~l|𝒢l1],l1}\{\widetilde{\boldsymbol{\xi}}_{l}-{\mathbb{E}}[\widetilde{\boldsymbol{\xi}}_{l}|{\mathcal{G}}_{l-1}],l\geq 1\} are martingale differences, and there is 𝔼[𝜽~i|𝒢l1]=𝜽~i{\mathbb{E}}[\widetilde{\boldsymbol{\theta}}_{i}|{\mathcal{G}}_{l-1}]=\widetilde{\boldsymbol{\theta}}_{i} for iH2l+1i\in H_{2l+1}. Therefore we have

𝔼[𝝃~l|𝒢l1]=iH2l+1𝔼[(𝑨i+1𝔼[𝑨i+1])|𝒢l1]𝜽~i𝕀{|𝜽~i|22Ψdei}.{\mathbb{E}}[\widetilde{\boldsymbol{\xi}}_{l}|{\mathcal{G}}_{l-1}]=\sum_{i\in H_{2l+1}}{\mathbb{E}}[(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])|{\mathcal{G}}_{l-1}]\widetilde{\boldsymbol{\theta}}_{i}{\mathbb{I}}\Big{\{}|\widetilde{\boldsymbol{\theta}}_{i}|_{2}\leq 2\Psi\sqrt{d}e_{i}\Big{\}}.

By Lemma 3 there is

𝔼[𝔼[𝑨i+1𝔼[𝑨i+1]|𝒢l1]]Cdϕ(λlogk)=O(kν2),{\mathbb{E}}\Big{[}\big{\|}{\mathbb{E}}[\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}]|{\mathcal{G}}_{l-1}]\big{\|}\Big{]}\leq Cd\sqrt{\phi(\lceil\lambda\log k\rceil)}=O(k^{-\nu-2}),

for some λ\lambda large enough. Then Markov’s inequality yields

(|1kl=1mk/2𝔼[𝝃~l|𝒢l1]|2Cdekk,i=n0kEi)=O(kν).{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k}\sum_{l=1}^{m_{k}/2}{\mathbb{E}}[\widetilde{\boldsymbol{\xi}}_{l}|{\mathcal{G}}_{l-1}]\Big{|}_{2}\geq C\frac{\sqrt{d}e_{k}}{k},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=O(k^{-\nu}). (57)

It is direct to verify that

sup𝒗𝕊d1|𝒗(𝝃~l𝔼[𝝃~l|𝒢l1])|CΨλlogk,\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}|\boldsymbol{v}^{{\top}}(\widetilde{\boldsymbol{\xi}}_{l}-{\mathbb{E}}[\widetilde{\boldsymbol{\xi}}_{l}|{\mathcal{G}}_{l-1}])|\leq C\Psi\lambda\log k,
sup𝒗𝕊d1l=1mk/2Var[|𝒗𝝃~l||𝒢l1]sup𝒗𝕊d1l=1mk/2𝔼[|𝒗𝝃~l|2|𝒢l1]CΨ2kdek2logk.\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sum_{l=1}^{m_{k}/2}{\rm Var}[|\boldsymbol{v}^{{\top}}\widetilde{\boldsymbol{\xi}}_{l}||{\mathcal{G}}_{l-1}]\leq\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sum_{l=1}^{m_{k}/2}{\mathbb{E}}[|\boldsymbol{v}^{{\top}}\widetilde{\boldsymbol{\xi}}_{l}|^{2}|{\mathcal{G}}_{l-1}]\leq C\Psi^{2}kde_{k}^{2}\log k.

Then we can apply Lemma 2 and yield

(|l=1mk/2(𝝃~l𝔼[𝝃~l|𝒢l1])|2CΨ(dlog2k+dekklog2k))=O(kνd).\displaystyle{\mathbb{P}}\Big{(}\Big{|}\sum_{l=1}^{m_{k}/2}(\widetilde{\boldsymbol{\xi}}_{l}-{\mathbb{E}}[\widetilde{\boldsymbol{\xi}}_{l}|{\mathcal{G}}_{l-1}])\Big{|}_{2}\geq C\Psi(d\log^{2}k+de_{k}\sqrt{k\log^{2}k})\Big{)}=O(k^{-\nu d}).

Combining it with (57) and (56), we have that

(|1k+1l=1mk/2𝝃l|2CΨ(dlog2kk+deklog2kk),i=n0kEi)=O((k+1)ν){\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{l=1}^{m_{k}/2}\boldsymbol{\xi}_{l}\Big{|}_{2}\geq C\Psi\Big{(}\frac{d\log^{2}k}{k}+de_{k}\sqrt{\frac{\log^{2}k}{k}}\Big{)},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=O((k+1)^{-\nu})

A similar result holds for the even term. Note that dlog2kk+deklog2kk=O(dek2)\frac{d\log^{2}k}{k}+de_{k}\sqrt{\frac{\log^{2}k}{k}}=O(de_{k}^{2}), and together with (54) we have

(|1k+1j=0k(𝑨i+1𝔼[𝑨i+1])(𝜽^i𝜽)|2CΨ2dek2logk,i=n0kEi)=O((k+1)ν).{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{j=0}^{k}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])(\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*})\Big{|}_{2}\geq C\Psi^{2}de_{k}^{2}\log k,\cap_{i=n_{0}}^{k}E_{i}\Big{)}=O((k+1)^{-\nu}).

When taking the outliers into consideration, we have that

|1k+1j=0k(𝑨i+1𝔼[𝑨i+1])(𝜽^i𝜽)|2\displaystyle\Big{|}\frac{1}{k+1}\sum_{j=0}^{k}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])(\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*})\Big{|}_{2}
=\displaystyle= |1k+1j𝒬k(𝑨i+1𝔼[𝑨i+1])(𝜽^i𝜽)|2+|1k+1j𝒬k(𝑨i+1𝔼[𝑨i+1])(𝜽^i𝜽)|2\displaystyle\Big{|}\frac{1}{k+1}\sum_{j\in{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])(\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*})\Big{|}_{2}+\Big{|}\frac{1}{k+1}\sum_{j\notin{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])(\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*})\Big{|}_{2}
=\displaystyle= |1k+1j𝒬k(𝑨i+1𝔼[𝑨i+1])(𝜽^i𝜽)|2+O(αk),\displaystyle\Big{|}\frac{1}{k+1}\sum_{j\notin{\mathcal{Q}}_{k}}(\boldsymbol{A}_{i+1}-{\mathbb{E}}[\boldsymbol{A}_{i+1}])(\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*})\Big{|}_{2}+O\big{(}\alpha_{k}),

under the event i=n0kEi\cap_{i=n_{0}}^{k}E_{i}, which proves the lemma by combining it with (52). ∎

Lemma 9.

Let 𝛉~i\widetilde{\boldsymbol{\theta}}_{i} be defined in (53), for every ν>0\nu>0, there exist constants CC and cc, such that

(|1k+1i=1k+1(𝑨i𝔼[𝑨i])(𝜽^i1𝜽)1k+1i=1k+1(𝑨i𝔼[𝑨i])𝜽~i1|2CΨ2dek2,i=n0kEi)\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})-\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{\theta}}_{i-1}\Big{|}_{2}\geq C\Psi^{2}de_{k}^{2},\cap_{i=n_{0}}^{k}E_{i}\Big{)}
\displaystyle\leq c(k+1)ν.\displaystyle c(k+1)^{-\nu}.
Proof.

Under event i=n0kEi\cap_{i=n_{0}}^{k}E_{i}, using Lemma 5, we have

|𝜽^i𝜽𝜽~i|2\displaystyle\big{|}\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}-\widetilde{\boldsymbol{\theta}}_{i}\big{|}_{2}
=\displaystyle= |1ij=(li2)λlogk+n0+1i1(𝜽^j𝜽)|2+|𝑯11ij=(li2)λlogk+n0+1i1𝑿j+1gτj+1(𝒁j+1𝜽^jbj+1)|2\displaystyle\Big{|}\frac{1}{i}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i-1}(\widehat{\boldsymbol{\theta}}_{j}-\boldsymbol{\theta}^{*})\Big{|}_{2}+\Big{|}\boldsymbol{H}^{-1}\frac{1}{i}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{j}-b_{j+1})\Big{|}_{2}
+|(𝑯^i11𝑯1)1ij=0i1𝑿j+1gτj+1(𝒁j+1𝜽^jbj+1)|2\displaystyle+\Big{|}(\widehat{\boldsymbol{H}}_{i-1}^{-1}-\boldsymbol{H}^{-1})\frac{1}{i}\sum_{j=0}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{j}-b_{j+1})\Big{|}_{2}
=\displaystyle= |(𝑯^i11𝑯1)1ij=0i1𝑿j+1gτj+1(𝒁j+1𝜽^jbj+1)|2\displaystyle\Big{|}(\widehat{\boldsymbol{H}}_{i-1}^{-1}-\boldsymbol{H}^{-1})\frac{1}{i}\sum_{j=0}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{j}-b_{j+1})\Big{|}_{2}
+|𝑯11ij=(li2)λlogk+n0+1i1𝑿j+1gτj+1(𝒁j+1𝜽bj+1)|2+O(eilogki).\displaystyle+\Big{|}\boldsymbol{H}^{-1}\frac{1}{i}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\boldsymbol{\theta}^{*}-b_{j+1})\Big{|}_{2}+O\Big{(}\frac{e_{i}\log k}{i}\Big{)}. (58)

For the first term, by Lemma 6 and (42) we have that under event i=n0kEi\cap_{i=n_{0}}^{k}E_{i},

𝑯^i11𝑯1𝑯^i11𝑯^i1𝑯𝑯1CΨdei.\|\widehat{\boldsymbol{H}}_{i-1}^{-1}-\boldsymbol{H}^{-1}\|\leq\|\widehat{\boldsymbol{H}}_{i-1}^{-1}\|\cdot\|\widehat{\boldsymbol{H}}_{i-1}-\boldsymbol{H}\|\cdot\|\boldsymbol{H}^{-1}\|\leq C\Psi\sqrt{d}e_{i}.

On the other hand, there is

|1ij=0i1𝑿j+1gτj+1(𝒁j+1𝜽^jbj+1)|2\displaystyle\Big{|}\frac{1}{i}\sum_{j=0}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\widehat{\boldsymbol{\theta}}_{j}-b_{j+1})\Big{|}_{2}
\displaystyle\leq |1ij=0i1𝑿j+1gτj+1(𝒁j+1𝜽bj+1)|2+M21ij=0i1|𝜽^j𝜽|2CΨdeilogi.\displaystyle\Big{|}\frac{1}{i}\sum_{j=0}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\boldsymbol{\theta}^{*}-b_{j+1})\Big{|}_{2}+M^{2}\frac{1}{i}\sum_{j=0}^{i-1}|\widehat{\boldsymbol{\theta}}_{j}-\boldsymbol{\theta}^{*}|_{2}\leq C\Psi\sqrt{d}e_{i}\log i.

Therefore, we have that

1k+1i=1k+1(𝑨i𝔼[𝑨i])(𝜽^i1𝜽)1k+1i=1k+1(𝑨i𝔼[𝑨i])𝜽~i1\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})-\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{\theta}}_{i-1}
=\displaystyle= 1k+1i=1k+1(𝑨i𝔼[𝑨i])𝑯11ij=(li2)λlogk+n0+1i1𝑿j+1gτj+1(𝒁j+1𝜽bj+1)+O(Ψ2dei2)\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\boldsymbol{H}^{-1}\frac{1}{i}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i-1}\boldsymbol{X}_{j+1}g_{\tau_{j+1}}(\boldsymbol{Z}_{j+1}^{{\top}}\boldsymbol{\theta}^{*}-b_{j+1})+O(\Psi^{2}de_{i}^{2})
=\displaystyle= 1k+1i=1k+1{j=(li2)λlogk+n0+1i1j(𝑨j𝔼[𝑨j])}𝑯1𝑿igτi(𝒁i𝜽bi)+O(Ψ2dei2).\displaystyle\frac{1}{k+1}\sum_{i=1}^{k+1}\Big{\{}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i}\frac{1}{j}(\boldsymbol{A}_{j}-{\mathbb{E}}[\boldsymbol{A}_{j}])\Big{\}}\boldsymbol{H}^{-1}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})+O(\Psi^{2}de_{i}^{2}). (59)

Denote

𝒀i={j=(li2)λlogk+n0+1i1j(𝑨j𝔼[𝑨j])}𝑯1𝑿igτi(𝒁i𝜽bi).\boldsymbol{Y}_{i}=\Big{\{}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i}\frac{1}{j}(\boldsymbol{A}_{j}-{\mathbb{E}}[\boldsymbol{A}_{j}])\Big{\}}\boldsymbol{H}^{-1}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i}).

It is direct to verify that

|𝔼[Yi]|2\displaystyle|{\mathbb{E}}[Y_{i}]|_{2}\leq 𝔼j=(li2)λlogk+n0+1i1j(𝑨j𝔼[𝑨j])2𝔼|𝑯1𝑿igτi(𝒁i𝜽bi)|22\displaystyle\sqrt{{\mathbb{E}}\Big{\|}\sum_{j=(l_{i}-2)\lceil\lambda\log k\rceil+n_{0}+1}^{i}\frac{1}{j}(\boldsymbol{A}_{j}-{\mathbb{E}}[\boldsymbol{A}_{j}])\Big{\|}^{2}{\mathbb{E}}\big{|}\boldsymbol{H}^{-1}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{|}_{2}^{2}}
=\displaystyle= O(dlogkiτi(1δ)+/2)=O(dei2).\displaystyle O\Big{(}\frac{\sqrt{d}\log k}{i}\tau_{i}^{(1-\delta)_{+}/2}\Big{)}=O(\sqrt{d}e_{i}^{2}). (60)

Therefore, it left bound

|1k+1i=1k+1(Yi𝔼[Yi])|2.\displaystyle\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}(Y_{i}-{\mathbb{E}}[Y_{i}])\Big{|}_{2}.

To this end, we further denote

𝝃l=iH4l+1(𝒀i𝔼[𝒀i]),\displaystyle\boldsymbol{\xi}_{l}=\sum_{i\in H_{4l+1}}(\boldsymbol{Y}_{i}-{\mathbb{E}}[\boldsymbol{Y}_{i}]),

We define the σ\sigma-field 𝒢l=σ((𝑿i,𝒁i,bi):ij=14l+1Hj){\mathcal{G}}_{l}=\sigma((\boldsymbol{X}_{i},\boldsymbol{Z}_{i},b_{i}):i\in\cup_{j=1}^{4l+1}H_{j}), by Lemma 3 we have that

𝔼[|𝔼[𝝃l|𝒢l1]|2]=O(dτllogklkν+2).{\mathbb{E}}\Big{[}\Big{|}{\mathbb{E}}[\boldsymbol{\xi}_{l}|{\mathcal{G}}_{l-1}]\Big{|}_{2}\Big{]}=O\Big{(}\frac{\sqrt{d}\tau_{l}\log k}{lk^{\nu+2}}\Big{)}.

Therefore, by Markov’s inequality

(|1kl=1mk/4𝔼[𝝃l|𝒢l1]|2Cdτklogkk2,i=n0kEi)=O(kν).{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k}\sum_{l=1}^{m_{k}/4}{\mathbb{E}}[\boldsymbol{\xi}_{l}|{\mathcal{G}}_{l-1}]\Big{|}_{2}\geq C\frac{\sqrt{d}\tau_{k}\log k}{k^{2}},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=O(k^{-\nu}). (61)

It is direct to verify that

sup𝒗𝕊d1|𝒗(𝝃l𝔼[𝝃l|𝒢l1])|Cdlogk,\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}|\boldsymbol{v}^{{\top}}(\boldsymbol{\xi}_{l}-{\mathbb{E}}[\boldsymbol{\xi}_{l}|{\mathcal{G}}_{l-1}])|\leq C\sqrt{d}\log k,
sup𝒗𝕊d1l=1mk/2Var[|𝒗𝝃l||𝒢l1]sup𝒗𝕊d1l=1mk/2𝔼[|𝒗𝝃l|2|𝒢l1]\displaystyle\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sum_{l=1}^{m_{k}/2}{\rm Var}[|\boldsymbol{v}^{{\top}}\boldsymbol{\xi}_{l}||{\mathcal{G}}_{l-1}]\leq\sup_{\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sum_{l=1}^{m_{k}/2}{\mathbb{E}}[|\boldsymbol{v}^{{\top}}\boldsymbol{\xi}_{l}|^{2}|{\mathcal{G}}_{l-1}]
\displaystyle\leq C1l=1mk/4dτl(1δ)+log2k/l2Cdlog3k.\displaystyle C_{1}\sum_{l=1}^{m_{k}/4}d\tau_{l}^{(1-\delta)_{+}}\log^{2}k/l^{2}\leq Cd\log^{3}k.

Then we can apply Lemma 2 and yield

(|l=1mk/4(𝝃l𝔼[𝝃l|𝒢l1])|2C(d3/2log2k+dlog2k))=O(kνd).\displaystyle{\mathbb{P}}\Big{(}\Big{|}\sum_{l=1}^{m_{k}/4}(\boldsymbol{\xi}_{l}-{\mathbb{E}}[\boldsymbol{\xi}_{l}|{\mathcal{G}}_{l-1}])\Big{|}_{2}\geq C(d^{3/2}\log^{2}k+d\log^{2}k)\Big{)}=O(k^{-\nu d}).

Combining it with (60) and (61), we have that

(|1k+1l=1mk/2𝝃l|2Cd3/2log2kk,i=n0kEi)=O((k+1)ν){\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{l=1}^{m_{k}/2}\boldsymbol{\xi}_{l}\Big{|}_{2}\geq C\frac{d^{3/2}\log^{2}k}{k},\cap_{i=n_{0}}^{k}E_{i}\Big{)}=O((k+1)^{-\nu})

A similar result holds for the average of 𝝃l=iH4l+q(𝒀i𝔼[𝒀i])\boldsymbol{\xi}_{l}=\sum_{i\in H_{4l+q}}(\boldsymbol{Y}_{i}-{\mathbb{E}}[\boldsymbol{Y}_{i}]), where q=0,2,3q=0,2,3. Substitute it into (59), we have that

(|1k+1i=1k+1(𝑨i𝔼[𝑨i])(𝜽^i1𝜽)1k+1i=1k+1(𝑨i𝔼[𝑨i])𝜽~i1|2CΨ2dek2,i=n0kEi)\displaystyle{\mathbb{P}}\Big{(}\Big{|}\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])(\widehat{\boldsymbol{\theta}}_{i-1}-\boldsymbol{\theta}^{*})-\frac{1}{k+1}\sum_{i=1}^{k+1}(\boldsymbol{A}_{i}-{\mathbb{E}}[\boldsymbol{A}_{i}])\widetilde{\boldsymbol{\theta}}_{i-1}\Big{|}_{2}\geq C\Psi^{2}de_{k}^{2},\cap_{i=n_{0}}^{k}E_{i}\Big{)}
=\displaystyle= O((k+1)ν),\displaystyle O((k+1)^{-\nu}),

which proves the lemma. ∎

7.3 Proof of Results in Section 3.2

Proof of Theorem 4.

From the proof of Theorem 1, as n0n_{0} tends to infinity, we can obtain that

𝜽^n𝜽=𝑯11ni𝒬n𝑿igτi(𝒁i𝜽bi)+O(dτnαn+den12log2n),\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*}=\boldsymbol{H}^{-1}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}g_{\tau_{i}}(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})+O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}\alpha_{n}+de_{n-1}^{2}\log^{2}n\Big{)}, (62)

where 𝑯=𝔼[𝑿𝒁]\boldsymbol{H}={\mathbb{E}}[\boldsymbol{X}\boldsymbol{Z}^{{\top}}]. Denote ϵi=𝒁i𝜽bi\epsilon_{i}=\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i}, we first bound the term

|1ni𝒬n𝑿i(gτi(ϵi)ϵi)|2\displaystyle\Big{|}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})\Big{|}_{2} (63)
\displaystyle\leq |1ni𝒬n𝑿i(gτi(ϵi)ϵi)1ni𝒬n𝔼[𝑿i(gτi(ϵi)ϵi)]|2+1ni𝒬n|𝔼[𝑿i(gτi(ϵi)ϵi)]|2\displaystyle\Big{|}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})-\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})]\Big{|}_{2}+\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\Big{|}{\mathbb{E}}[\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})]\Big{|}_{2}
=\displaystyle= |1ni𝒬n𝑿i(gτi(ϵi)ϵi)1ni𝒬n𝔼[𝑿i(gτi(ϵi)ϵi)]|2+O(dτn2),\displaystyle\Big{|}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})-\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})]\Big{|}_{2}+O(\sqrt{d}\tau_{n}^{-2}),

where the last inequality uses the moment assumption on ϵ\epsilon and similar argument as in (43). To bound the first term, we directly compute its variance and use Chebyshev’s inequality. More precisely, by equation (20.23) of Billingsley (1968), for arbitrary i,ji,j and each coordinate ll, there is

|𝔼[{Xi,l(gτi(ϵi)ϵi)𝔼[Xi,l(gτi(ϵi)ϵi)]}{Xj,l(gτj(ϵj)ϵj)𝔼[Xj,l(gτj(ϵj)ϵj)]}]|\displaystyle\Big{|}{\mathbb{E}}\Big{[}\big{\{}X_{i,l}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})-{\mathbb{E}}[X_{i,l}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})]\big{\}}\big{\{}X_{j,l}(g_{\tau_{j}}(\epsilon_{j})-\epsilon_{j})-{\mathbb{E}}[X_{j,l}(g_{\tau_{j}}(\epsilon_{j})-\epsilon_{j})]\big{\}}\Big{]}\Big{|}
\displaystyle\leq 2ρ|ij|/2𝔼[Xi,l2(gτi(ϵi)ϵi)2]𝔼[Xj,l2(gτj(ϵj)ϵj)2]\displaystyle 2\rho^{|i-j|/2}\sqrt{{\mathbb{E}}[X_{i,l}^{2}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})^{2}]{\mathbb{E}}[X_{j,l}^{2}(g_{\tau_{j}}(\epsilon_{j})-\epsilon_{j})^{2}]}
\displaystyle\leq 2C0ρ|ij|/21τi2τj2𝔼[|ϵi|6]𝔼[|ϵj|6]C1ρ|ij|/2τi2τj2.\displaystyle 2C_{0}\rho^{|i-j|/2}\frac{1}{\tau^{2}_{i}\tau^{2}_{j}}\sqrt{{\mathbb{E}}[|\epsilon_{i}|^{6}]{\mathbb{E}}[|\epsilon_{j}|^{6}]}\leq C_{1}\frac{\rho^{|i-j|/2}}{\tau^{2}_{i}\tau^{2}_{j}}.

Therefore we have that

Var[1ni𝒬n𝑿i(gτi(ϵi)ϵi)]C1n2i,j=1nρ|ij|/2τi2τj2\displaystyle{\rm Var}\Big{[}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})\Big{]}\leq\frac{C_{1}}{n^{2}}\sum_{i,j=1}^{n}\frac{\rho^{|i-j|/2}}{\tau_{i}^{2}\tau_{j}^{2}}
\displaystyle\leq C1n2i=1n2τi2(j=0nρj/2)C21nτn2,\displaystyle\frac{C_{1}}{n^{2}}\sum_{i=1}^{n}\frac{2}{\tau_{i}^{2}}\Big{(}\sum_{j=0}^{n}\rho^{j/2}\Big{)}\leq C_{2}\frac{1}{n\tau_{n}^{2}},

for some constant C2>0C_{2}>0. Therefore, by Chebyshev’s inequality, we have that

|1ni𝒬n𝑿i(gτi(ϵi)ϵi)1ni𝒬n𝔼[𝑿i(gτi(ϵi)ϵi)]|2=O(d(nτn2)2/5).\Big{|}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})-\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{X}_{i}(g_{\tau_{i}}(\epsilon_{i})-\epsilon_{i})]\Big{|}_{2}=O_{{\mathbb{P}}}\Big{(}\frac{\sqrt{d}}{(n\tau_{n}^{2})^{2/5}}\Big{)}. (64)

Combining (62), (63) and (64), given a unit vector 𝒗𝕊d1\boldsymbol{v}\in{\mathbb{S}}^{d-1}, we have

𝒗(𝜽^n𝜽)=\displaystyle\boldsymbol{v}^{{\top}}(\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*})= 𝒗𝑯11ni𝒬n𝑿iϵi+O(dτnαn+den12log2n+dτn2+d(nτn2)2/5),\displaystyle\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}\epsilon_{i}+O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}\alpha_{n}+de_{n-1}^{2}\log^{2}n+\sqrt{d}\tau_{n}^{-2}+\frac{\sqrt{d}}{(n\tau_{n}^{2})^{2/5}}\Big{)},
=\displaystyle= 1ni𝒬n𝒗𝑯1𝑿iϵi+o(1n),\displaystyle\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\boldsymbol{X}_{i}\epsilon_{i}+o_{{\mathbb{P}}}\Big{(}\frac{1}{\sqrt{n}}\Big{)},

where the remainder becomes o(1/n)o_{{\mathbb{P}}}(1/\sqrt{n}) under some rate constraints. Here the main term is the average of strict stationary and strong mixing sequence 𝒗𝑯1𝑿iϵi\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\boldsymbol{X}_{i}\epsilon_{i}. By Lemma 10, the conditions in Theorem 1 of Doukhan et al. (1994) is fulfilled. Then we can apply Theorem 1 of Doukhan et al. (1994) and yield

nσ𝒗(𝜽^n𝜽)𝒩(0,1),\frac{\sqrt{n}}{\sigma_{\boldsymbol{v}}}(\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*})\rightarrow{\mathcal{N}}(0,1),

where

σ𝒗2=\displaystyle\sigma^{2}_{\boldsymbol{v}}= i=Cov(𝒗𝑯1𝑿0ϵ0,𝒗𝑯1𝑿iϵi)\displaystyle\sum_{i=-\infty}^{\infty}{\rm Cov}\big{(}\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\boldsymbol{X}_{0}\epsilon_{0},\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\boldsymbol{X}_{i}^{{\top}}\epsilon_{i}\big{)}
=\displaystyle= 𝒗𝑯1𝚺(𝑯)1𝒗,\displaystyle\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\boldsymbol{\Sigma}(\boldsymbol{H}^{{\top}})^{-1}\boldsymbol{v},

and 𝚺\boldsymbol{\Sigma} is defined in (17). Therefore the theorem is proved. ∎

Lemma 10.

Let the random variable XX satisfies 𝔼[|X|1+δ]{\mathbb{E}}[|X|^{1+\delta}] for some δ>1\delta>1, and ϕ(k)=O(ρk)\phi(k)=O(\rho^{k}) (where ρ<1\rho<1). Then we have that

01ϕ1(u)QX2(u)du<,\int_{0}^{1}\phi^{-1}(u)Q_{X}^{2}(u)\mathrm{d}u<\infty,

where ϕ1(u)\phi^{-1}(u) is the inverse function of ϕ(k)\phi(k), i.e. , ϕ1(u)=logu/logρ\phi^{-1}(u)=\log u/\log\rho, and QX(u)=inf{t:(|X|t)u}Q_{X}(u)=\inf\{t:{\mathbb{P}}(|X|\geq t)\leq u\}.

Proof.

By Markov’s inequality, we have that

(|X|QX(u))u𝔼[|X|1+δ](QX(u))1+δ,\displaystyle{\mathbb{P}}(|X|\geq Q_{X}(u))\leq u\leq\frac{{\mathbb{E}}[|X|^{1+\delta}]}{(Q_{X}(u))^{1+\delta}},
\displaystyle\Rightarrow QX(u)(𝔼[|X|1+δ]u)1/(1+δ).\displaystyle Q_{X}(u)\leq\Big{(}\frac{{\mathbb{E}}[|X|^{1+\delta}]}{u}\Big{)}^{1/(1+\delta)}.

Therefore, we have that

01ϕ1(u)QX2(u)du01logulogρ(𝔼[|X|1+δ]u)2/(1+δ)du<,\displaystyle\int_{0}^{1}\phi^{-1}(u)Q_{X}^{2}(u)\mathrm{d}u\leq\int_{0}^{1}\frac{\log u}{\log\rho}\Big{(}\frac{{\mathbb{E}}[|X|^{1+\delta}]}{u}\Big{)}^{2/(1+\delta)}\mathrm{d}u<\infty,

as long as δ>1\delta>1, which proves the lemma. ∎

Proposition 8.

Suppose the conditions in Theorem 1 hold. When δ>4\delta>4 and i/log3i=O(τi)\sqrt{i/\log^{3}i}=O(\tau_{i}), for every ν>0\nu>0, there exists constants C,c>0C,c>0 such that

(i=n0n{|𝜽^i𝜽|2Cdei})1cn0min{ν,(δ4)/3},{\mathbb{P}}\Big{(}\cap_{i=n_{0}}^{n}\big{\{}|\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}|_{2}\geq C\sqrt{d}e_{i}\big{\}}\Big{)}\geq 1-cn_{0}^{-\min\{\nu,(\delta-4)/3\}},

where

en=αnτn+lognn+1d(c0)2nn0.e_{n}=\alpha_{n}\tau_{n}+\sqrt{\frac{\log n}{n}}+\frac{1}{\sqrt{d}}(c_{0})^{2^{n-n_{0}}}. (65)

Here δ\delta is defined in the moment condition in (C4).

Proof.

When δ>4\delta>4 and k=O(τk)\sqrt{k}=O(\tau_{k}), we denote

ek=αkτk+logkk+1d(c0)2kn0,e_{k}=\alpha_{k}\tau_{k}+\sqrt{\frac{\log k}{k}}+\frac{1}{\sqrt{d}}(c_{0})^{2^{k-n_{0}}},

for kn0k\geq n_{0}. We continue from equation (33) in the proof of Theorem 1.

By ii) of Lemma 7, there is

(|𝑯^k+111k+1i=1k+1{𝑿ig(𝒁i𝜽bi)}|2C1dek+1,i=n0kEi)C3(k+1)(δ1)/3.{\mathbb{P}}\Big{(}\Big{|}\widehat{\boldsymbol{H}}_{k+1}^{-1}\frac{1}{k+1}\sum_{i=1}^{k+1}\big{\{}\boldsymbol{X}_{i}g(\boldsymbol{Z}_{i}^{{\top}}\boldsymbol{\theta}^{*}-b_{i})\big{\}}\Big{|}_{2}\geq C_{1}\sqrt{d}e_{k+1},\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq C_{3}(k+1)^{-(\delta-1)/3}.

Then we have that

(|𝜽^k+1𝜽|2Ψdek+1,i=n0kEi)3C3(k+1)min{ν,(δ1)/3},\displaystyle{\mathbb{P}}\Big{(}|\widehat{\boldsymbol{\theta}}_{k+1}-\boldsymbol{\theta}^{*}|_{2}\geq\Psi\sqrt{d}e_{k+1},\cap_{i=n_{0}}^{k}E_{i}\Big{)}\leq 3C_{3}(k+1)^{-\min\{\nu,(\delta-1)/3\}},

which yields

(i=n0kEi)\displaystyle{\mathbb{P}}\Big{(}\cap_{i=n_{0}}^{k}E_{i}\Big{)}\geq 1i=n0+1k3C3imin{ν,(δ1)/3}\displaystyle 1-\sum_{i=n_{0}+1}^{k}3C_{3}i^{-\min\{\nu,(\delta-1)/3\}}
\displaystyle\geq 13C3max{1ν1,2δ4}n0min{ν1,(δ4)/3}.\displaystyle 1-3C_{3}\max\Big{\{}\frac{1}{\nu-1},\frac{2}{\delta-4}\Big{\}}n_{0}^{-\min\{\nu-1,(\delta-4)/3\}}.

Therefore, the theorem is proved. ∎

Proof of Proposition 5.

By Proposition 8 and Theorem 4, we know that 𝜽^n\widehat{\boldsymbol{\theta}}_{n} admits the Bahadur representation

𝒗(𝜽^n𝜽)=𝒗𝑯11ni𝒬n𝑿i(𝒁i𝜽bi)+O(den12log2n+dτn2+d(nτn2)2/5),\boldsymbol{v}^{{\top}}(\widehat{\boldsymbol{\theta}}_{n}-\boldsymbol{\theta}^{*})=\boldsymbol{v}^{{\top}}\boldsymbol{H}^{-1}\frac{1}{n}\sum_{i\notin{\mathcal{Q}}_{n}}\boldsymbol{X}_{i}(\boldsymbol{Z}^{{\top}}_{i}\boldsymbol{\theta}^{*}-b_{i})+O_{{\mathbb{P}}}\Big{(}de_{n-1}^{2}\log^{2}n+\sqrt{d}\tau_{n}^{-2}+\frac{\sqrt{d}}{(n\tau_{n}^{2})^{2/5}}\Big{)},

where αn=0\alpha_{n}=0 and en1e_{n-1} is given in (65). When τi=Ciβ\tau_{i}=Ci^{\beta} for β3/4\beta\geq 3/4 we have that the remainder term has an order O(dlogn/n)O_{{\mathbb{P}}}(d\log n/n) when n0n_{0}\rightarrow\infty, which proves the proposition. ∎

7.4 Proof of Results in Section 4

Proof of Theorem 6.

For simplicity we denote 𝚪k=Cov(𝑿0ϵ0,𝑿kϵk)\boldsymbol{\Gamma}_{k}={\rm Cov}(\boldsymbol{X}_{0}\epsilon_{0},\boldsymbol{X}_{-k}\epsilon_{-k}), 𝒀i,k=𝑿igτi(ϵi)𝑿ikgτik(ϵik)\boldsymbol{Y}_{i,k}=\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i})\boldsymbol{X}^{{\top}}_{i-k}g_{\tau_{i-k}}(\epsilon_{i-k}). Then by (17) we know

𝚺=𝚪0+k=1(𝚪k+𝚪k).\boldsymbol{\Sigma}=\boldsymbol{\Gamma}_{0}+\sum_{k=1}^{\infty}(\boldsymbol{\Gamma}_{k}+\boldsymbol{\Gamma}^{{\top}}_{k}).

Under event (30), for k=0,,λlognk=0,\dots,\lceil\lambda\log n\rceil, we have that

1ni=ek/λn𝒀i,k1ni=ek/λn𝑿ikgτik(𝒁ik𝜽^ik1bik)𝑿igτi(𝒁i𝜽^i1bi)\displaystyle\Big{\|}\frac{1}{n}\sum_{i=\lceil e^{k/\lambda}\rceil}^{n}\boldsymbol{Y}_{i,k}-\frac{1}{n}\sum_{i=\lceil e^{k/\lambda}\rceil}^{n}\boldsymbol{X}_{i-k}g_{\tau_{i-k}}(\boldsymbol{Z}^{\top}_{i-k}\widehat{\boldsymbol{\theta}}_{i-k-1}-b_{i-k})\boldsymbol{X}^{\top}_{i}g_{\tau_{i}}(\boldsymbol{Z}^{\top}_{i}\widehat{\boldsymbol{\theta}}_{i-1}-b_{i})\Big{\|}
=\displaystyle= O(1ni=ek/λnτi|𝜽^i𝜽|2)=O(dτnen).\displaystyle O\Big{(}\frac{1}{n}\sum_{i=\lceil e^{k/\lambda}\rceil}^{n}\tau_{i}|\widehat{\boldsymbol{\theta}}_{i}-\boldsymbol{\theta}^{*}|_{2}\Big{)}=O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}e_{n}\Big{)}.

Therefore we have that

𝚺^n(1ni=1n𝒀i,0+1ni=1nk=1λlogi(𝒀i,k+𝒀i,k))=O(dτnen).\Big{\|}\widehat{\boldsymbol{\Sigma}}_{n}-\Big{(}\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{Y}_{i,0}+\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{\lceil\lambda\log i\rceil}(\boldsymbol{Y}_{i,k}+\boldsymbol{Y}_{i,k}^{{\top}})\Big{)}\Big{\|}=O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}e_{n}\Big{)}. (66)

We first prove

𝚺(𝚪0+k=1λlogn/2(𝚪k+𝚪k))=O(nν),\big{\|}\boldsymbol{\Sigma}-\big{(}\boldsymbol{\Gamma}_{0}+\sum_{k=1}^{\lceil\lambda\log n/2\rceil}(\boldsymbol{\Gamma}_{k}+\boldsymbol{\Gamma}_{k}^{{\top}})\big{)}\big{\|}=O(n^{-\nu}), (67)

for some ν>0\nu>0. By equation (20.23) of Billingsley (1968), for each kk, there is

𝚪k=\displaystyle\|\boldsymbol{\Gamma}_{k}\|= sup𝒖,𝒗𝕊d1𝔼[|𝒖𝑿0ϵ0𝒗𝑿kϵk|]\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathbb{S}}^{d-1}}{\mathbb{E}}\Big{[}|\boldsymbol{u}^{{\top}}\boldsymbol{X}_{0}\epsilon_{0}\boldsymbol{v}^{{\top}}\boldsymbol{X}_{-k}\epsilon_{-k}|\Big{]}
\displaystyle\leq 2ϕ(|k|)sup𝒖,𝒗𝕊d1𝔼[|𝒖𝑿0ϵ0|2]𝔼[|𝒗𝑿kϵk|2]=O(ρ|k|/2).\displaystyle 2\sqrt{\phi(|k|)}\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sqrt{{\mathbb{E}}[|\boldsymbol{u}^{{\top}}\boldsymbol{X}_{0}\epsilon_{0}|^{2}]{\mathbb{E}}[|\boldsymbol{v}^{{\top}}\boldsymbol{X}_{-k}\epsilon_{-k}|^{2}]}=O(\rho^{|k|/2}).

Therefore we can obtain that

𝚺(𝚪0+k=1λlogn/2(𝚪k+𝚪k))\displaystyle\big{\|}\boldsymbol{\Sigma}-\big{(}\boldsymbol{\Gamma}_{0}+\sum_{k=1}^{\lceil\lambda\log n/2\rceil}(\boldsymbol{\Gamma}_{k}+\boldsymbol{\Gamma}_{k}^{{\top}})\big{)}\big{\|}
\displaystyle\leq 2k=λlogn/2+1𝚪k=O(k=λlogn/2+1ρ|k|/2)=O(ρλlogn/4)=O(nν),\displaystyle 2\sum_{k=\lceil\lambda\log n/2\rceil+1}^{\infty}\|\boldsymbol{\Gamma}_{k}\|=O(\sum_{k=\lceil\lambda\log n/2\rceil+1}^{\infty}\rho^{|k|/2})=O(\rho^{\lceil\lambda\log n\rceil/4})=O(n^{-\nu}),

for λ4ν/|logρ|\lambda\geq 4\nu/|\log\rho|, which proves (67). Next we prove that for k=0,,λlognk=0,\dots,\lceil\lambda\log n\rceil, there holds

1ni=ek/λn𝒀i,k1ni,ik𝒬n𝔼[𝒀i,k]\displaystyle\Big{\|}\frac{1}{n}\sum_{i=\lceil e^{k/\lambda}\rceil}^{n}\boldsymbol{Y}_{i,k}-\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]\Big{\|} (68)
=\displaystyle= 1ni,ik𝒬n𝒀i,k1ni,ik𝒬n𝔼[𝒀i,k]+O(τn2αn)\displaystyle\Big{\|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}\boldsymbol{Y}_{i,k}-\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]\Big{\|}+O_{{\mathbb{P}}}\big{(}\tau_{n}^{2}\alpha_{n}\big{)}
=\displaystyle= O(τn2αn+dlog2nn+dτn2log2nn).\displaystyle O_{{\mathbb{P}}}\Big{(}\tau_{n}^{2}\alpha_{n}+\sqrt{\frac{d\log^{2}n}{n}}+\frac{d\tau_{n}^{2}\log^{2}n}{n}\Big{)}.

By the proof of Lemma 6, we know that

1ni,ik𝒬n𝒀i,k1ni,ik𝒬n𝔼[𝒀i,k]2sup𝒖,𝒗𝔑|1ni,ik𝒬n𝒖(𝒀i,k𝔼[𝒀i,k])𝒗|,\Big{\|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}\boldsymbol{Y}_{i,k}-\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]\Big{\|}\leq 2\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathfrak{N}}}\Big{|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}\boldsymbol{u}^{{\top}}(\boldsymbol{Y}_{i,k}-{\mathbb{E}}[\boldsymbol{Y}_{i,k}])\boldsymbol{v}\Big{|},

where 𝔑{\mathfrak{N}} is a 1/41/4-net of 𝕊d1{\mathbb{S}}^{d-1}. For k=0,,λlogn1k=0,\dots,\lceil\lambda\log n\rceil-1, Denote ~k,ab=σ(𝒀i,k,aib)\widetilde{{\mathcal{F}}}_{k,a}^{b}=\sigma(\boldsymbol{Y}_{i,k},a\leq i\leq b), then by (13) we know that

|(B|A)(B)|ϕ((jk)+),|{\mathbb{P}}(B|A)-{\mathbb{P}}(B)|\leq\phi((j-k)_{+}),

for all A~k,1n,B~k,n+jA\in\widetilde{{\mathcal{F}}}_{k,1}^{n},B\in\widetilde{{\mathcal{F}}}_{k,n+j}^{\infty} for all n,j0n,j\geq 0. We basically rehash the proof in Lemma 1 and obtain that

sup𝒖,𝒗𝔑(|1ni,ik𝒬n𝒖(𝒀i,k𝔼[𝒀i,k])𝒗|C(dlog2nn+dτn2log2nn))=O(n(γ+2log9)d).\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathfrak{N}}}{\mathbb{P}}\Bigg{(}\Big{|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}\boldsymbol{u}^{{\top}}(\boldsymbol{Y}_{i,k}-{\mathbb{E}}[\boldsymbol{Y}_{i,k}])\boldsymbol{v}\Big{|}\geq C\Big{(}\sqrt{\frac{d\log^{2}n}{n}}+\frac{d\tau_{n}^{2}\log^{2}n}{n}\Big{)}\Bigg{)}=O(n^{-(\gamma+2\log 9)d}).

Here the only difference is that Var(ηl)=O(1){\rm Var}(\eta_{l})=O(1) in (22), and 𝒖𝒀i,k𝒗\boldsymbol{u}^{{\top}}\boldsymbol{Y}_{i,k}\boldsymbol{v} is bounded by τn2\tau_{n}^{2}. Therefore (68) can be proved.

Last, we prove that

1ni,ik𝒬n𝔼[𝒀i,k]𝚪k=O(αn+τn2)\Big{\|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]-\boldsymbol{\Gamma}_{k}\Big{\|}=O\Big{(}\alpha_{n}+\tau_{n}^{-2}\Big{)} (69)

To see this, we compute that

𝔼[𝒀i,k]𝚪k\displaystyle\Big{\|}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]-\boldsymbol{\Gamma}_{k}\Big{\|}
\displaystyle\leq 𝔼[𝑿igτi(ϵi)𝑿ikgτik(ϵik)]𝔼[𝑿0ϵ0𝑿kgτik(ϵk)]\displaystyle\Big{\|}{\mathbb{E}}[\boldsymbol{X}_{i}g_{\tau_{i}}(\epsilon_{i})\boldsymbol{X}^{{\top}}_{i-k}g_{\tau_{i-k}}(\epsilon_{i-k})]-{\mathbb{E}}[\boldsymbol{X}_{0}\epsilon_{0}\boldsymbol{X}_{-k}g_{\tau_{i-k}}(\epsilon_{-k})]\Big{\|}
+𝔼[𝑿iϵi𝑿ikgτik(ϵik)]𝔼[𝑿0ϵ0𝑿kϵk]\displaystyle+\Big{\|}{\mathbb{E}}[\boldsymbol{X}_{i}\epsilon_{i}\boldsymbol{X}^{{\top}}_{i-k}g_{\tau_{i-k}}(\epsilon_{i-k})]-{\mathbb{E}}[\boldsymbol{X}_{0}\epsilon_{0}\boldsymbol{X}_{-k}\epsilon_{-k}]\Big{\|}

For the second term,

𝔼[𝑿iϵi𝑿ikgτik(ϵik)]𝔼[𝑿0ϵ0𝑿kϵk]\displaystyle\Big{\|}{\mathbb{E}}[\boldsymbol{X}_{i}\epsilon_{i}\boldsymbol{X}^{{\top}}_{i-k}g_{\tau_{i-k}}(\epsilon_{i-k})]-{\mathbb{E}}[\boldsymbol{X}_{0}\epsilon_{0}\boldsymbol{X}_{-k}\epsilon_{-k}]\Big{\|}
=\displaystyle= sup𝒖,𝒗𝕊d1|𝔼[𝒖𝑿iϵi𝑿ik𝒗{gτik(ϵik)ϵik}]|\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\Big{|}{\mathbb{E}}[\boldsymbol{u}^{{\top}}\boldsymbol{X}_{i}\epsilon_{i}\boldsymbol{X}^{{\top}}_{i-k}\boldsymbol{v}\big{\{}g_{\tau_{i-k}}(\epsilon_{i-k})-\epsilon_{i-k}\big{\}}]\Big{|}
\displaystyle\leq sup𝒖,𝒗𝕊d1𝔼[|𝒖𝑿iϵi|2]𝔼[|𝑿ik𝒗{gτik(ϵik)ϵik}|2]C1τik2,\displaystyle\sup_{\boldsymbol{u},\boldsymbol{v}\in{\mathbb{S}}^{d-1}}\sqrt{{\mathbb{E}}[|\boldsymbol{u}^{{\top}}\boldsymbol{X}_{i}\epsilon_{i}|^{2}]{\mathbb{E}}[|\boldsymbol{X}^{{\top}}_{i-k}\boldsymbol{v}\big{\{}g_{\tau_{i-k}}(\epsilon_{i-k})-\epsilon_{i-k}\big{\}}|^{2}]}\leq C_{1}\tau_{i-k}^{-2},

for some constants C1>0C_{1}>0. Therefore (69) is proved.

Combining (66), (67), (68) and (69) we have that

𝚺^n𝚺\displaystyle\|\widehat{\boldsymbol{\Sigma}}_{n}-\boldsymbol{\Sigma}\|
\displaystyle\leq 𝚺^n(1ni=1n𝒀i,0+1ni=1nk=1λlogi(𝒀i,k+𝒀i,k))+𝚺(𝚪0+k=1λlogn/2(𝚪k+𝚪k))\displaystyle\Big{\|}\widehat{\boldsymbol{\Sigma}}_{n}-\Big{(}\frac{1}{n}\sum_{i=1}^{n}\boldsymbol{Y}_{i,0}+\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{\lceil\lambda\log i\rceil}(\boldsymbol{Y}_{i,k}+\boldsymbol{Y}_{i,k}^{{\top}})\Big{)}\Big{\|}+\big{\|}\boldsymbol{\Sigma}-\big{(}\boldsymbol{\Gamma}_{0}+\sum_{k=1}^{\lceil\lambda\log n/2\rceil}(\boldsymbol{\Gamma}_{k}+\boldsymbol{\Gamma}_{k}^{{\top}})\big{)}\big{\|}
+2k=0λlogn/21ni=ek/λn𝒀i,k1ni,ik𝒬n𝔼[𝒀i,k]+2k=0λlogn/21ni,ik𝒬n𝔼[𝒀i,k]𝚪k\displaystyle+2\sum_{k=0}^{\lceil\lambda\log n/2\rceil}\Big{\|}\frac{1}{n}\sum_{i=\lceil e^{k/\lambda}\rceil}^{n}\boldsymbol{Y}_{i,k}-\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]\Big{\|}+2\sum_{k=0}^{\lceil\lambda\log n/2\rceil}\Big{\|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]-\boldsymbol{\Gamma}_{k}\Big{\|}
+2k=λlogn/2+1λlogn1ni=ek/λn𝒀i,k1ni,ik𝒬n𝔼[𝒀i,k]+2k=λlogn/2+1λlogn1ni,ik𝒬n𝔼[𝒀i,k]𝚪k\displaystyle+2\sum_{k=\lceil\lambda\log n/2\rceil+1}^{\lceil\lambda\log n\rceil}\Big{\|}\frac{1}{n}\sum_{i=\lceil e^{k/\lambda}\rceil}^{n}\boldsymbol{Y}_{i,k}-\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]\Big{\|}+2\sum_{k=\lceil\lambda\log n/2\rceil+1}^{\lceil\lambda\log n\rceil}\Big{\|}\frac{1}{n}\sum_{i,i-k\notin{\mathcal{Q}}_{n}}{\mathbb{E}}[\boldsymbol{Y}_{i,k}]-\boldsymbol{\Gamma}_{k}\Big{\|}
+2k=λlogn/2+1λlogn𝚪k\displaystyle+2\sum_{k=\lceil\lambda\log n/2\rceil+1}^{\lceil\lambda\log n\rceil}\Big{\|}\boldsymbol{\Gamma}_{k}\Big{\|}
=\displaystyle= O(dτnen+nν+τn2αn+dlog2nn+dτn2log2nn+αn+τn2)\displaystyle O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}e_{n}+n^{-\nu}+\tau_{n}^{2}\alpha_{n}+\sqrt{\frac{d\log^{2}n}{n}}+\frac{d\tau_{n}^{2}\log^{2}n}{n}+\alpha_{n}+\tau_{n}^{-2}\Big{)}
=\displaystyle= O(dτn2αn+dτn1+τndlognn+dτn2log2nn),\displaystyle O_{{\mathbb{P}}}\Big{(}\sqrt{d}\tau_{n}^{2}\alpha_{n}+\sqrt{d}\tau_{n}^{-1}+\tau_{n}\sqrt{\frac{d\log n}{n}}+\frac{d\tau_{n}^{2}\log^{2}n}{n}\Big{)},

which proves the theorem. ∎