This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scalable Statistical Inference in Non-parametric Least Squares

Meimei Liu  , Zuofeng Shang  and Yun Yang Department of Statistics,Virginia Tech, Blacksburg, VA. Email: [email protected] Mathematical Sciences, NJIT,Newark, NJ. Email:[email protected] of Statistics, UIUC, Champaign, IL. Email: [email protected]
Abstract

Stochastic approximation (SA) is a powerful and scalable computational method for iteratively estimating the solution of optimization problems in the presence of randomness, particularly well-suited for large-scale and streaming data settings. In this work, we propose a theoretical framework for stochastic approximation (SA) applied to non-parametric least squares in reproducing kernel Hilbert spaces (RKHS), enabling online statistical inference in non-parametric regression models. We achieve this by constructing asymptotically valid pointwise (and simultaneous) confidence intervals (bands) for local (and global) inference of the nonlinear regression function, via employing an online multiplier bootstrap approach to a functional stochastic gradient descent (SGD) algorithm in the RKHS. Our main theoretical contributions consist of a unified framework for characterizing the non-asymptotic behavior of the functional SGD estimator and demonstrating the consistency of the multiplier bootstrap method. The proof techniques involve the development of a higher-order expansion of the functional SGD estimator under the supremum norm metric and the Gaussian approximation of suprema of weighted and non-identically distributed empirical processes. Our theory specifically reveals an interesting relationship between the tuning of step sizes in SGD for estimation and the accuracy of uncertainty quantification.

1 Introduction

Stochastic approximation (SA) [1, 2, 3] is a class of iterative stochastic algorithms to solve the stochastic optimization problem minθΘ{(θ):=𝔼Z[(θ;Z)]}\min_{\theta\in\Theta}\big{\{}\mathcal{L}(\theta):\,=\mathbb{E}_{Z}[\ell(\theta;Z)]\big{\}}, where (θ;z)\ell(\theta;z) is some loss function, ZZ denotes the internal random variable, and Θ\Theta is the domain of the loss function. Statistical inference, such as parameter estimation, can be viewed as a special case of stochastic optimization where the goal is to estimate the minimizer θ=argminθΘ(θ)\theta^{\ast}=\mathop{\mathrm{argmin}}_{\theta\in\Theta}\mathcal{L}(\theta) of the expected loss function (θ)\mathcal{L}(\theta) based on a finite number of i.i.d. observations {Z1,,Zn}\{Z_{1},\ldots,Z_{n}\}. Classical estimation procedures based on minimizing an empirical version n(θ)=n1i=1n(θ;Zi)\mathcal{L}_{n}(\theta)=n^{-1}\sum_{i=1}^{n}\ell(\theta;Z_{i}) of the loss correspond to the sample average approximation (SAA) [4, 5] for solving the stochastic optimization problem. However, directly minimizing LnL_{n} with massive data is computationally wasteful in both time and space, and may pose numerical challenges. For example, in applications involving streaming data where new and dynamic observations are generated on a continuous basis, it may not be necessary or feasible to store all historical data. Instead, stochastic gradient descent (SGD), or Robbins-Monro type SA algorithm [1], is a scalable approximation algorithm for parameter estimation with constant per-iteration time and space complexity. SGD can be viewed as a stochastic version of the gradient descent method that uses a noisy gradient, such as (;Z)\nabla\ell(\cdot\,;Z) based on a single ZZ, to replace the true gradient ()\nabla\mathcal{L}(\cdot). In this work, we explore the use of SA for statistical inference in infinite-dimensional models where Θ\Theta is a functional space or, more precisely, in solving non-parametric least squares in reproducing kernel Hilbert spaces (RKHS).

Consider the standard random-design non-parametric regression model

Yi=f(Xi)+ϵi,ϵiN(0,σ2)fori=1,,n,Y_{i}=f^{\ast}(X_{i})+\epsilon_{i},\quad\epsilon_{i}\sim N(0,\sigma^{2})\quad\;\textrm{for}\;i=1,\cdots,n, (1.1)

with Xi𝒳X_{i}\in\mathcal{X} denoting the ii-th copy of random covariate XX, YiY_{i} the ii-th copy of response YY, and ff^{\ast} the unknown regression function in a reproducing kernel Hilbert space (RKHS, [6, 7]) \mathbb{H} to be estimated. For simplicity, we assume that 𝒳=[0,1]d\mathcal{X}=[0,1]^{d} is the unit cube in d\mathbb{R}^{d}. Since ff^{\ast} minimizes the population-level expected squared error loss objective (f)=𝔼[(f;(X,Y))]\mathcal{L}(f)=\mathbb{E}\big{[}\ell\big{(}f;\,(X,Y)\big{)}\big{]} over all functions f:𝒳f:\,\mathcal{X}\to\mathbb{R}, with (f;(X,Y))=(f(X)Y)2\ell\big{(}f;\,(X,Y)\big{)}=(f(X)-Y)^{2} representing the squared loss function, one can adopt the SSA approach to estimate ff^{\ast} by minimizing a penalized sample-level squared error loss objective. Given a sample {(Xi,Yi)}i=1n\{(X_{i},Y_{i})\}_{i=1}^{n} of size nn, a commonly used SAA approach for estimating ff is kernel ridge regression (KRR). KRR incorporates a penalty term that depends on the norm \|\cdot\|_{\mathbb{H}} associated with the RKHS \mathbb{H}. Although the KRR estimator enjoys many attractive statistical properties [8, 9, 10], its computational complexity of 𝒪(n3)\mathcal{O}(n^{3}) time and 𝒪(n2)\mathcal{O}(n^{2}) space hinders its practicality in large-scale problems [11]. In this work, we instead consider an SA-type approach for directly minimizing the functional (f)\mathcal{E}(f) over the infinite-dimensional RKHS. By operating SGD in this non-parametric setting (see Section 2.2 for details), the resulting algorithm achieves 𝒪(n2)\mathcal{O}(n^{2}) time complexity and 𝒪(n)\mathcal{O}(n) space complexity. In a recent study [12], the authors demonstrate that the online estimator of ff resulting from the SGD achieves optimal rates of convergence for a variety of ff\in\mathbb{H}. It is interesting to note that since the functional gradient is defined with respect to the RKHS norm \|\cdot\|_{\mathbb{H}}, the functional SGD implicitly induces an algorithmic regularization due to the “early-stopping” in the RKHS, which is controlled by the accumulated step sizes. Therefore, with a proper step size decaying scheme, no explicit regularization is needed to achieve optimal convergence rates.

The aim of this research is to take a step further by constructing a new inferential framework for quantifying the estimation uncertainty in the SA procedure. This will be achieved through the construction of pointwise confidence intervals and simultaneous confidence bands for the functional SGD estimator of ff. Previous SGD algorithms and their variants, such as those discussed in [3, 13, 14, 15, 16, 17], are mainly utilized to solve finite-dimensional parametric learning problems with a root-nn convergence rate. In the parametric setting, asymptotic properties of estimators arising in SGD, such as consistency and asymptotic normality, have been well established in literature; for example, see [12, 18, 19, 20]. However, the problem of uncertainty quantification for functional SGD estimators in non-parametric settings is rarely addressed in the literature.

In the parametric setting, several methods have been proposed to conduct uncertainty quantification in SGD. [21, 22] appear to be among the first to formally characterize the magnitudes of random fluctuations in SA; however, their notion of confidence level is based on the large deviation properties of the solution and can be quite conservative. More recently, [19] proposes applying a multiplier bootstrap method for the construction of SGD confidence intervals, whose asymptotic confidence level is shown to exactly match the nominal level. [20] proposes a batch mean method to estimate the asymptotic covariance matrix of the estimator based on a single SGD trajectory. Due to the limited information from a single run of SGD, the best achievable error of their confidence interval (in terms of coverage probability) is of the order 𝒪(n1/8)\mathcal{O}(n^{-1/8}), which is worse than the error of the order 𝒪(n1/3)\mathcal{O}(n^{-1/3}) achieved by the multiplier bootstrap. [23] proposes a different method called Higrad. Higrad constructs a hierarchical tree of a number of SGD estimators and uses their outputs in the leaves to construct a confidence interval.

In this work, we develop a multiplier bootstrap method for uncertainty quantification in SA for solving online non-parametric least squares. Bootstrap methods [24, 25] are widely used in statistics to estimate the sampling distribution of a statistic for uncertainty quantification. Traditional resampling-based bootstrap methods are unsuitable for streaming data inference as the resampling step necessitates storing all historical data, which contradicts the objective of maintaining constant space and time complexity in online learning. Instead, we extend the parametric online multiplier bootstrap method from [19] to the non-parametric setting. We achieve this by employing a perturbed stochastic functional gradient, which is represented as an element in the RKHS evaluated upon the arrival of each new covariate-response pair (Xi,Yi)(X_{i},Y_{i}), to capture the stochastic fluctuation arising from the random streaming data.

To theoretically justify the use of the proposed multiplier bootstrap method, we make two main contributions. First, we build a novel theoretical framework to characterize the non-asymptotic behavior of the infinite-dimensional functional SGD estimator via expanding it into higher-orders under the supremum norm metric. This framework enables us to perform local inference to construct pointwise confidence intervals for ff and global inference to construct a simultaneous confidence band. Second, we demonstrate the consistency of the multiplier bootstrap method by proving that the perturbation injected into the stochastic functional gradient accurately mimics the randomness pattern in the online estimation procedure, so that the conditional law of the bootstrapped functional SGD estimator given the data asymptotically coincides with the sampling law of the functional SGD estimator. Our proof is non-trivial and contains several major improvements that refine the best (to our knowledge) convergence analysis of SGD for non-parametric least squares in [12], and also advances consistency analysis of the multiplier bootstrap in a non-parametric setting. Concretely, in [12], the authors derive the convergence rate of the functional SGD estimator relative to the L2L_{2} norm metric. Their theory only concerns the L2L_{2} convergence rate of the estimation; hence, the proof involves decomposing the SGD recursion into a leading first-order recursion and the remaining higher-order recursions; and bounding their L2L_{2} norms respectively by directly bounding their expectations. In comparison, our analysis for statistical inference in online non-parametric regression requires a functional central limit theorem type result and calls for several substantial refinements in proof techniques.

Our first improvement is to refine the SGD recursion analysis by using a stronger supremum norm metric. This enables us to accurately characterize the stochastic fluctuation of the functional estimator uniformly across all locations. As a result, we can study the coverage probability of simultaneous confidence bands in our subsequent inference tasks. Analyzing the supremum convergence is significantly more intricate than analyzing the L2L_{2} convergence. In the proof, we introduce an augmented RKHS different from \mathbb{H} as a bridge in order to better align its induced norm with the supremum metric; see Remark 3.2 or equation (6.1) in Section 6 for further details. Additionally, we have to employ uniform laws of large numbers and leverage ideas from empirical processes to uniformly control certain stochastic terms that emerge in the expansions. Our second improvement comes from the need of characterizing the large-sample distributional limit of the functional SGD estimator. By using the same recursion decomposition, we must now analyze a high-probability supremum norm bound for the all orders of recursions and determine the large-sample distributional limit of the leading term in the expansion. It is worth noting that the second-order recursion is the most complicated and challenging one to analyze. This recursion requires specialized treatment that involves substantially more effort than the remaining higher-order recursions. A loose analysis, achieved by directly converting an L2L_{2} norm bound into the supremum norm bound using the reproducing kernel property the original RKHS \mathbb{H} — which suffices for bounding the higher-order recursions — might result in a bound whose order is comparable to that of the leading term. This is where we introduce an augmented RKHS and directly analyze the supremum norm using empirical process tools.

Last but not least, in order to analyze the distributional limit of the leading bias and variance terms appearing in the expansion of the functional SGD estimator, we develop new tools by extending the recent technique of Gaussian approximation of suprema of empirical processes [26] from equally weighted sum to a weighted sum. This extension is important and unique for analyzing functional SGD, since earlier-arrived data points will have larger weights in the leading bias and variance terms than later-arrived data points; see Remark 3.3 for more discussions. Towards the analysis of our bootstrap procedure, we further develop Gaussian approximation bounds for multiplier bootstraps for suprema of weighted and non-identically distributed empirical process, which can be used to control the Kolmogorov distance between the sampling distributions of the pointwise evaluation (local inference) of the functional SGD estimator or its supremum norm (global inference), and their bootstrapping counterparts. Our results also elucidate the interplay between early stopping (controlled by the step size) for optimal estimation and the accuracy of uncertainty quantification.

The rest of the article is organized as follows. In Section 2 we introduce the background of RKHS and the functional stochastic gradient descent algorithms in the RKHS; in Section 3, we establish the distributional convergence of SGD for non-parametric least squares; in Section 4, we develop the scalable uncertainty quantification in RKHS via multiplier bootstrapped SGD estimators; Section 5 includes extensive numerical studies to demonstrate the performance of the proposed SGD inference. Section 6 presents a sketched proof highlighting some important technical details and key steps; Section 7 provides an overview and future direction for our work. Section 8 includes some key proofs for the theorems.

Notation: In this paper, we use C,C,C1,C2,C,C^{\prime},C_{1},C_{2},\dots to denote generic positive constants whose values may change from one line to another, but are independent from everything else. We use the notation f\|f\|_{\infty} to denote the supremum norm of a function ff, defined as f=supx𝒳|f(x)|\|f\|_{\infty}=\sup_{x\in\mathcal{X}}|f(x)|, where 𝒳\mathcal{X} is the domain of ff. The notations aba\lesssim b and aba\gtrsim b denote inequalities up to a constant multiple; we write aba\asymp b when both aba\lesssim b and aba\gtrsim b hold. For k>0k>0, let k\lfloor k\rfloor denote the largest integer smaller than or equal to kk. For two operators MM and NN, The order MNM\preccurlyeq N if NMN-M is positive semi-definite.

2 Background and Problem Formulation

We begin by introducing some background on reproducing kernel Hilbert space (RKHS) and functional stochastic gradient descent algorithms in the RKHS.

2.1 Reproducing kernel Hilbert spaces

To describe the structure of regression function ff in non-parametric regression model (1.1), we adopt the standard framework of a reproducing kernel Hilbert space (RKHS, [7, 27, 28]) by assuming f=argminf𝔼[(f(X)Y)2]f^{\ast}=\mathop{\mathrm{argmin}}_{f}\mathbb{E}\big{[}(f(X)-Y)^{2}\big{]} to belong to an RKHS \mathbb{H}. Let X\mathbb{P}_{X} to denote the marginal distribution of the random design XX, and L2(X)={f:𝒳|𝒳f2(X)X(dx)<}L^{2}(\mathbb{P}_{X})=\big{\{}f:\,\mathcal{X}\to\mathbb{R}\,\big{|}\,\int_{\mathcal{X}}f^{2}(X)\,\mathbb{P}_{X}(dx)<\infty\big{\}} to denote the space of all square-integrable functions over 𝒳\mathcal{X} with respect to X\mathbb{P}_{X}. Briefly speaking, an RKHS is a Hilbert space L2(X)\mathbb{H}\subset L^{2}(\mathbb{P}_{X}) of functions defined over a set 𝒳\mathcal{X}, equipped with inner product ,\langle\cdot,\,\cdot\rangle_{\mathbb{H}}, so that for any x𝒳x\in\mathcal{X}, the evaluation functional at xx defined by Lx(f)=f(x)L_{x}(f)=f(x) is a continuous linear functional on the RKHS. Uniquely associated with \mathbb{H} is a positive-definite function K:𝒳×𝒳K:\mathcal{X}\times\mathcal{X}\to\mathbb{R}, called the reproducing kernel. The key property of the reproducing kernel is that it satisfies the reproducing property: the evaluation functional LxL_{x} can be represented by the reproducing kernel function Kx:=K(x,)K_{x}:\,=K(x,\,\cdot) so that f(x)=Lx(f)=Kx,ff(x)=L_{x}(f)=\langle K_{x},\,f\rangle_{\mathbb{H}}. According to Mercer’s theorem [6], kernel function KK has the following spectral decomposition:

K(x,x)=j=1μjϕj(x)ϕj(x),x,x𝒳,K(x,x^{\prime})=\sum_{j=1}^{\infty}\mu_{j}\,\phi_{j}(x)\,\phi_{j}(x^{\prime}),\,\,\,\,x,x^{\prime}\in\mathcal{X}, (2.1)

where the convergence is absolute and uniform on 𝒳×𝒳\mathcal{X}\times\mathcal{X}. Here, μ1μ20\mu_{1}\geq\mu_{2}\geq\cdots\geq 0 is the sequence of eigenvalues, and {ϕj}j=1\{\phi_{j}\}_{j=1}^{\infty} are the corresponding eigenfunctions forming an orthonormal basis in L2(X)L^{2}(\mathbb{P}_{X}), with the following property: for any j,kj,k\in\mathbb{N},

ϕj,ϕkL2(𝒫X)=δjkandϕj,ϕk=δjk/μj,\langle\phi_{j},\phi_{k}\rangle_{L^{2}(\mathcal{P}_{X})}=\delta_{jk}\quad\mbox{and}\quad\langle\phi_{j},\phi_{k}\rangle_{\mathbb{H}}=\delta_{jk}/\mu_{j},

where δjk=1\delta_{jk}=1 if j=kj=k and δjk=0\delta_{jk}=0 otherwise. Moreover, any ff\in\mathbb{H} can be decomposed into f=j=1fjϕjf=\sum_{j=1}^{\infty}f_{j}\phi_{j} with fj=f,ϕjL2(X)f_{j}=\langle f,\phi_{j}\rangle_{L_{2}(\mathbb{P}_{X})}, and its RKHS norm can be computed via f2=j=1μj1fj2\|f\|_{\mathbb{H}}^{2}=\sum_{j=1}^{\infty}\mu_{j}^{-1}f_{j}^{2}.

We introduce some technical conditions on the reproducing kernel KK in terms of its spectral decomposition.

Assumption A1.

The eigenfunctions {ϕk}k=0\{\phi_{k}\}_{k=0}^{\infty} of KK are uniformly bounded on 𝒳\mathcal{X}, i.e., there exists a finite constant cϕ>0c_{\phi}>0 such that supk1ϕkcϕ\sup_{k\geq 1}\|\phi_{k}\|_{\infty}\leq c_{\phi}. Moreover, they satisfy the Lipschitz condition |ϕk(s)ϕk(t)|Lk|st||\phi_{k}(s)-\phi_{k}(t)|\leq L\,k\,|s-t| for any s,t[0,1]s,t\in[0,1], where LL is a finite constant.

Assumption A2.

The eigenvalues {μk}k=1\{\mu_{k}\}_{k=1}^{\infty} of KK satisfy μkkα\mu_{k}\asymp k^{-\alpha} for some α>1\alpha>1.

The uniform boundedness condition in Assumption A1 is common in the literature [29]. Assumption A2 assumes the kernel to have polynomially decaying eigenvalues. Assumption A1-A2 together also implies the kernel function is bounded as supxK(x,x)cϕ2k=1kα:=R2\sup_{x}K(x,x)\leq c^{2}_{\phi}\sum_{k=1}^{\infty}k^{-\alpha}:=R^{2}. One special class of kernels satisfying Assumptions A1-A2 is composed of translation-invariant kernels K(t,s)=g(ts)K(t,s)=g(t-s) for some even function gg of period one. In fact, by utilizing the Fourier series expansion of the kernel function gg, we observe that the eigenfunctions of the corresponding kernel matrix KK are trigonometric functions

ϕ2k1(x)=sin(πkx),ϕ2k(x)=cos(πkx),k=1,2,\phi_{2k-1}(x)=\sin(\pi kx),\quad\phi_{2k}(x)=\cos(\pi kx),\quad k=1,2,\dots

on 𝒳=[0,1]\mathcal{X}=[0,1]. It is easy to see that we can choose cϕ=1c_{\phi}=1 and L=πL=\pi to satisfy Assumption A1. Although we primarily consider kernels with eigenvalues that decay polynomially for the sake of clarity in this paper, it is worth mentioning that our theory extends to other kernel classes, such as squared exponential kernels and polynomial kernels [30].

2.2 Stochastic gradient descent in RKHS

To motivate functional SGD in RKHS, we first review SGD in Euclidean setting for minimizing the expected loss function (θ)=𝔼Z[(θ;Z)]\mathcal{L}(\theta)=\mathbb{E}_{Z}[\ell(\theta;Z)], where θd\theta\in\mathbb{R}^{d} is the parameter of interest, :d×𝒵\ell:\,\mathbb{R}^{d}\times\mathcal{Z}\to\mathbb{R} is the loss function and ZZ denotes a generic random sample, e.g. Z=(X,Y)Z=(X,Y) in the non-parametric regression setting (1.1). By first-order Taylor’s expansion, one can locally approximate (θ+s)\mathcal{L}(\theta+s) for any small deviation ss by (θ+s)(θ)+(θ),s\mathcal{L}(\theta+s)\approx\mathcal{L}(\theta)+\langle\nabla\mathcal{L}(\theta),\,s\rangle, where (θ)\nabla\mathcal{L}(\theta) denotes the gradient (vector) of ()\mathcal{L}(\cdot) evaluated at θ\theta. The gradient (θ)\nabla\mathcal{L}(\theta) therefore encodes the (infinitesimal) steepest descent direction of LL at θ\theta, leading to the following gradient decent (GD) updating formula:

θ^i=θ^i1γiL(θ^i1),fori=1,2,,\displaystyle\widehat{\theta}_{i}=\widehat{\theta}_{i-1}-\gamma_{i}\,\nabla L(\widehat{\theta}_{i-1}),\quad\mbox{for}\quad i=1,2,\ldots,

starting from some initial value θ^0\widehat{\theta}_{0}, where γi>0\gamma_{i}>0 is the step size (also called learning rate) at iteration ii. GD typically requires the computation of the full gradient (θ)\nabla\mathcal{L}(\theta), which is unavailable due to the unknown data distribution of ZZ. In stochastic approximation, SGD takes a more efficient approach by using an unbiased estimate of the gradient as Gi(θ)=(θ,Zi)G_{i}(\theta)=\nabla\ell(\theta,Z_{i}) based on one sample ZiZ_{i} to substitute (θ)\nabla\mathcal{L}(\theta) in the updating rule.

Accordingly, the SGD updating rule takes the form of

θ^i=θ^i1γiGi(θ^i1),fori=1,2,.\displaystyle\widehat{\theta}_{i}=\widehat{\theta}_{i-1}-\gamma_{i}\,G_{i}(\widehat{\theta}_{i-1}),\quad\mbox{for}\quad i=1,2,\ldots.

Let us now extend the concept of SGD from minimizing an expected loss function in Euclidean space to minimizing an expected loss functional in function space. Here for concreteness, we develop SGD for minimizing the expected squared error loss (f)=𝔼[(f(X)Y)2]\mathcal{L}(f)=\mathbb{E}\big{[}(f(X)-Y)^{2}\big{]} over an RKHS \mathbb{H} equipped with inner product ,\langle\cdot,\cdot\rangle_{\mathbb{H}}. Let us begin by extending the concept of the “gradient”. By identifying the gradient (operator) L:\nabla L:\mathbb{H}\to\mathbb{H} of functional ()\mathcal{L}(\cdot) as a steepest descent “direction” in \mathbb{H} through the following first-order “Taylor expansion”

(f)=(g)+(g),fg+𝒪(fg2),as fg,\displaystyle\mathcal{L}(f)=\mathcal{L}(g)+\langle\nabla\mathcal{L}(g),\,f-g\rangle_{\mathbb{H}}+\mathcal{O}\big{(}\|f-g\|_{\mathbb{H}}^{2}\big{)},\ \ \mbox{as }f\to g,

we obtain after some simple algebra that

(g),fg+𝒪(fg2)\displaystyle\langle\nabla\mathcal{L}(g),\,f-g\rangle_{\mathbb{H}}+\mathcal{O}\big{(}\|f-g\|_{\mathbb{H}}^{2}\big{)} =(f)(g)\displaystyle\,=\mathcal{L}(f)-\mathcal{L}(g)
=𝔼[(f(X)g(X))(g(X)Y)]+𝔼[(f(X)g(X))2].\displaystyle\,=\mathbb{E}\big{[}\big{(}f(X)-g(X)\big{)}\cdot\big{(}g(X)-Y\big{)}\big{]}+\mathbb{E}\big{[}(f(X)-g(X))^{2}\big{]}.

Now by using the reproducing property h(x)=h,Kxh(x)=\langle h,\,K_{x}\rangle_{\mathbb{H}} for any hh\in\mathbb{H}, we further obtain

(g),fg=𝔼[(g(X)Y)KX],fg+𝒪(fg2).\displaystyle\langle\nabla\mathcal{L}(g),\,f-g\rangle_{\mathbb{H}}=\big{\langle}\mathbb{E}\big{[}\big{(}g(X)-Y\big{)}K_{X}\big{]},\,f-g\big{\rangle}_{\mathbb{H}}+\mathcal{O}\big{(}\|f-g\|_{\mathbb{H}}^{2}\big{)}. (2.2)

Here, we have used the fact that by Cauchy-Schwarz inequality,

(f(x)g(x))2=fg,Kx2fg2Kx2=K(x,x)fg2=𝒪(fg2),(f(x)-g(x))^{2}=\langle f-g,\,K_{x}\rangle^{2}\leq\|f-g\|_{\mathbb{H}}^{2}\cdot\|K_{x}\|_{\mathbb{H}}^{2}=K(x,x)\,\|f-g\|_{\mathbb{H}}^{2}=\mathcal{O}\big{(}\|f-g\|_{\mathbb{H}}^{2}\big{)},

since Assumptions A1-A2 together with Mercer’s expansion (2.1) imply KK to be uniformly bounded, or K(x,x)cKj=1μkCj=1jαCK(x,x)\leq c_{K}\sum_{j=1}^{\infty}\mu_{k}\leq C\sum_{j=1}^{\infty}j^{-\alpha}\leq C^{\prime}, as long as α>1\alpha>1. From equation (2.2), we can identify the gradient (g)\nabla\mathcal{L}(g) at gg\in\mathbb{H} as

(g)=𝔼[(g(X)Y)KX].\displaystyle\nabla\mathcal{L}(g)=\mathbb{E}\big{[}\big{(}g(X)-Y\big{)}K_{X}\big{]}\in\mathbb{H}.

Throughout the rest of the paper, we will refer to above (g)\nabla\mathcal{L}(g) as the RKHS gradient of functional LL at gg.

Upon the arrival of the iith data point (Xi,Yi)(X_{i},Y_{i}), we can form an unbiased estimator Gi(g)G_{i}(g) of the RKHS gradient (g)\nabla\mathcal{L}(g) as Gi(g)=(g(Xi)Yi)KXiG_{i}(g)=\big{(}g(X_{i})-Y_{i}\big{)}K_{X_{i}}. This leads to the following SGD in RKHS for solving non-parametric least squares: for a given initial estimate f^0\widehat{f}_{0}, the SGD recursively updates the estimate of ff upon the arrival of each data point as

f^i=f^i1γiGi(f^i1)=f^i1+γi(Yif^i1(Xi))KXi,for i=1,2,.\widehat{f}_{i}=\widehat{f}_{i-1}-\gamma_{i}\,G_{i}(\widehat{f}_{i-1})=\widehat{f}_{i-1}+\gamma_{i}\big{(}Y_{i}-\widehat{f}_{i-1}(X_{i})\big{)}K_{X_{i}},\quad\mbox{for }i=1,2,\ldots. (2.3)

By utilizing the reproducing property, the above iterative updating formula can be rewritten as

f^i=f^i1+γi(Yif^i1,KXi)KXi=(IγiKXiKXi)f^i1+γiYiKXi,\displaystyle\widehat{f}_{i}=\widehat{f}_{i-1}+\gamma_{i}\,\big{(}Y_{i}-\langle\widehat{f}_{i-1},K_{X_{i}}\rangle_{\mathbb{H}}\big{)}\,K_{X_{i}}=(I-\gamma_{i}\,K_{X_{i}}\otimes K_{X_{i}})\,\widehat{f}_{i-1}+\gamma_{i}\,Y_{i}\,K_{X_{i}}, (2.4)

where II denotes the identity map on \mathbb{H}, and \otimes is the tensor product operator defined through gh(f)=f,hgg\otimes h(f)=\langle f,h\rangle_{\mathbb{H}}\,g for all g,h,fg,h,f\in\mathbb{H}. Formula (2.3) is more straightforward to use for practical implementation, while formula (2.4) provides a more tractable expression that will facilitate our theoretical analysis. Following [2] and [31], we consider the so-called Polyak averaging scheme to further improve the estimation accuracy by averaging over the entire updating trajectory, i.e. we use f¯n=n1i=1nf^i\bar{f}_{n}=n^{-1}\sum_{i=1}^{n}\,\widehat{f}_{i} as the final functional SGD estimator of ff based on a dataset of sample size nn. Note that this averaged estimator can be efficiently computed without storing all past estimators by using the recursively updating formula f¯i=(1i1)f¯i1+i1f^i\bar{f}_{i}=(1-i^{-1})\,\bar{f}_{i-1}\,+\,i^{-1}\,\widehat{f}_{i} for i=1,,ni=1,\dots,n on the fly. We will refer to the above SGD as functional SGD in order to differentiate it from the SGD in Euclidean space, and f¯n\bar{f}_{n} as the functional SGD estimator (using nn samples). Throughout the remainder of the paper, we consider a zero initialization, f^0=0\widehat{f}_{0}=0, without loss of generality.

In functional SGD with total sample size (time horizon) nn, the only adjustable component is the step size scheme {γi:i=1,2,,n}\{\gamma_{i}:\,i=1,2,\ldots,n\}, which is crucial for achieving fast convergence and accurate estimations (c.f. Remark 3.1). We examine two common schemes [15, 32]: (1) constant step size scheme where γiγ=γ(n)\gamma_{i}\equiv\gamma=\gamma(n) only depends on the total sample size nn; (2) non-constant step size scheme where γi=iξ\gamma_{i}=i^{-\xi} decays polynomially in ii for i=1,2,,ni=1,2,\ldots,n and some ξ>0\xi>0. While the constant step scheme is more amenable to theoretical analysis, it suffers from two notable drawbacks: (1) it assumes prior knowledge of the sample size nn, which is typically unavailable in streaming data scenarios, and (2) the optimal estimation error is only achieved at the nn-th iteration, leading to suboptimal performance before that time point. In contrast, the non-constant step size scheme, despite significantly complicating our theoretical analysis, overcomes the aforementioned limitations and leads to a truly online algorithm that achieves rate-optimal estimation at any intermediate time point (c.f. Theorem 3.1). Due to this characteristic, we will also refer to the non-constant step size scheme as the online scheme.

Although functional SGD operates in the infinite-dimensional RKHS, it can be implemented using a finite-dimensional representation enabled by the kernel trick. Concretely, upon the arrival of the ii-th observation (Xi,Yi)(X_{i},Y_{i}), we can express the time-ii intermediate estimator f^i\widehat{f}_{i} as f^i=j=1iβ^jKXj\widehat{f}_{i}=\sum_{j=1}^{i}\widehat{\beta}_{j}\,K_{X_{j}} due to equation (2.3) and the zero initialization f^=0\widehat{f}^{\ast}=0 condition, where only the last entry β^i\widehat{\beta}_{i} in the coefficient vector (β^1,β^2,,β^i)(\widehat{\beta}_{1},\,\widehat{\beta}_{2},\,\dots,\,\widehat{\beta}_{i})^{\top} needs to be updated,

β^i=γi(Yif^i1(Xi))=γiYiγij=1i1β^jK(Xj,Xi).\displaystyle\widehat{\beta}_{i}=\gamma_{i}\,\big{(}Y_{i}-\widehat{f}_{i-1}(X_{i})\big{)}=\gamma_{i}\,Y_{i}-\gamma_{i}\sum_{j=1}^{i-1}\widehat{\beta}_{j}\,K(X_{j},\,X_{i}).

Note that the computational complexity at time ii is 𝒪(i)\mathcal{O}(i) for i=1,2,,ni=1,2,\ldots,n. Correspondingly, the functional SGD estimator at time ii can be computed through f¯i=(1i1)f¯i1+i1f^i=j=1iβ¯jKXj\bar{f}_{i}=(1-i^{-1})\,\bar{f}_{i-1}+i^{-1}\widehat{f}_{i}=\sum_{j=1}^{i}\bar{\beta}_{j}\,K_{X_{j}}, where (can be proved by induction)

β¯j=(1j1i)β^j,forj=1,2,,i.\displaystyle\bar{\beta}_{j}=\Big{(}1-\frac{j-1}{i}\Big{)}\,\widehat{\beta}_{j},\quad\mbox{for}\quad j=1,2,\ldots,i.

Consequently, the overall time complexity of the resulting algorithm is 𝒪(n2)\mathcal{O}(n^{2}), and the space complexity is 𝒪(n)\mathcal{O}(n).

2.3 Problem formulation

Our objective is to develop online inference for the non-parametric regression function ff^{\ast} based on the functional SGD estimator f¯n\bar{f}_{n}. Specifically, we aim to construct level-β\beta pointwise confidence intervals (local inference) CIn(x;β)=[Un(x;β),Vn(x;β)]CI_{n}(x;\,\beta)=[U_{n}(x;\,\beta),\,V_{n}(x;\,\beta)] for f(x)f^{\ast}(x), where x𝒳x\in\mathcal{X}, and a level-β\beta simultaneous confidence band (global inference) CBn(β)={g:𝒳|g(x)[f¯n(x)bn(β),f¯n(x)+bn(β)],x𝒳}CB_{n}(\beta)=\big{\{}g:\,\mathcal{X}\to\mathbb{R}\,\big{|}\,g(x)\in[\bar{f}_{n}(x)-b_{n}(\beta),\,\bar{f}_{n}(x)+b_{n}(\beta)],\ \forall x\in\mathcal{X}\big{\}} for ff^{\ast}. We require these intervals and band to be asymptotically valid, meaning that the coverage probabilities, i.e., the probabilities of f(x)f^{\ast}(x) or ff^{\ast} falling within CIn(x;β)CI_{n}(x;\,\beta) or CBn(β)CB_{n}(\beta) respectively, are close to their nominal level β\beta. Mathematically, this means [f(x)CIn(x;β)]=β+o(1)\mathbb{P}[f^{\ast}(x)\in CI_{n}(x;\,\beta)]=\beta+o(1) and [fCBn(β)]=β+o(1)\mathbb{P}[f^{\ast}\in CB_{n}(\beta)]=\beta+o(1) as nn\to\infty.

The coverage probability analysis of these intervals and band requires us to examine and prove the distributional convergence of two random quantities (with appropriate rescaling) based on the functional SGD estimator f¯n\bar{f}_{n}: the pointwise difference f¯n(x)f(x)\bar{f}_{n}(x)-f^{\ast}(x) for x𝒳x\in\mathcal{X} and the supremum norm f¯nf\|\bar{f}_{n}-f^{\ast}\|_{\infty} of f¯nf\bar{f}_{n}-f^{\ast}. In particular, the appropriate rescaling choice determines a precise convergence rate of f¯\bar{f} towards ff^{\ast} under the supremum norm metric. The characterization of the convergence rate of a non-parametric regression estimator under the supremum norm metric is a challenging and important problem in its own right. We note that the distribution of the supremum norm f¯nf\|\bar{f}_{n}-f^{\ast}\|_{\infty} after a proper re-scaling behaves like the supreme norm of a Gaussian process in the asymptotic sense, which is not practically feasible to estimate. Therefore, for inference purposes, it is not necessary to explicitly characterize this distributional limit; instead, we will prove a bootstrap consistency by showing that the Kolmogorov distance between the sampling distributions of this supremum norm and its bootstrapping counterpart converges to zero as nn\to\infty.

In our theoretical development to address these problems, we will utilize a recursive expansion of the functional SGD updating formula to construct a higher-order expansion of f¯n\bar{f}_{n} under the \|\cdot\|_{\infty} norm metric. Building upon this expansion, we will establish in Section 3 the distributional convergence of the two aforementioned random quantities and characterize their limiting distributions with an explicit representation of the limiting variance for f¯n(x)f(x)\bar{f}_{n}(x)-f^{\ast}(x) in the large-sample setting. However, these limiting distributions and variances depend on the spectral decomposition of the kernel KK, the marginal distribution of the design variable XX, and the unknown noise variance σ2\sigma^{2}, which are either inaccessible or computationally expensive to evaluate in an online learning scenario. To overcome this challenge, we will propose a scalable bootstrap-based inference method in Section 4, enabling efficient online inference for ff^{\ast}.

3 Finite-Sample Analysis of Functional SGD Estimator

In this section, we start by deriving a higher-order expansion of f¯n\bar{f}_{n} under the \|\cdot\|_{\infty} norm metric. We then proceed to establish the distributional convergence of f¯(x)f(x)\bar{f}(x)-f^{\ast}(x) for any x𝒳x\in\mathcal{X} by characterizing the leading term in the expansion. These results will be useful for motivating our online local and global inference for ff^{\ast} in the following section.

3.1 Higher-order expansion under supreme norm

We begin by decomposing the functional SGD update of f^nf\widehat{f}_{n}-f^{\ast} into two leading recursive formulas and a higher-order remainder term. This decomposition allows us to distinguish between the deterministic term responsible for the estimation bias and the stochastic fluctuation term contributing to the estimation variance. Concretely, we obtain the following by plugging Yi=f(Xi)+ϵiY_{i}=f^{\ast}(X_{i})+\epsilon_{i} into the recursive updating formula (2.4),

f^if=(IγiKXiKXi)(f^i1f)+γiϵiKXi.\widehat{f}_{i}-f^{\ast}=(I-\gamma_{i}\,K_{X_{i}}\otimes K_{X_{i}})\,(\widehat{f}_{i-1}-f^{\ast})+\gamma_{i}\,\epsilon_{i}\,K_{X_{i}}. (3.1)

Let Σ:=𝔼[KX1KX1]:\Sigma:\,=\mathbb{E}[K_{X_{1}}\otimes K_{X_{1}}]:\,\mathbb{H}\to\mathbb{H} denote the population-level covariance operator, so that for any ff, gg\in\mathbb{H} we have f,Σg=𝔼[f(X1)g(X1)]\langle f,\,\Sigma\,g\rangle_{\mathbb{H}}=\mathbb{E}[f(X_{1})\,g(X_{1})]. Now we recursively define the leading bias term through

η0bias,0=f^0f=fandηibias,0=(IγiΣ)ηi1bias,0fori=1,2,\eta_{0}^{bias,0}=\widehat{f}_{0}-f^{\ast}=-f^{\ast}\quad\mbox{and}\quad\eta_{i}^{bias,0}=(I-\gamma_{i}\,\Sigma)\,\eta_{i-1}^{bias,0}\quad\mbox{for}\quad i=1,2,\ldots (3.2)

that collects the leading deterministic component in (3.1); and the leading noise term through

η0noise,0=0andηinoise,0=(IγiΣ)ηi1noise,0+γiϵiKXifori=1,2,\displaystyle\eta_{0}^{noise,0}=0\quad\mbox{and}\quad\eta_{i}^{noise,0}=(I-\gamma_{i}\,\Sigma)\,\eta^{noise,0}_{i-1}+\gamma_{i}\,\epsilon_{i}\,K_{X_{i}}\quad\mbox{for}\quad i=1,2,\ldots (3.3)

that collects the leading stochastic fluctuation component in (3.1); so that we have the following decomposition for the recursion:

f^if=ηibias,0leading bias+ηinoise,0leading noise+(f^ifηibias,0ηinoise,0)remainder termfori=1,2,.\widehat{f}_{i}-f^{\ast}=\underbrace{\eta_{i}^{bias,0}}_{\text{leading bias}}+\underbrace{\eta_{i}^{noise,0}}_{\text{leading noise}}+\ \ \underbrace{\big{(}\widehat{f}_{i}-f^{\ast}-\eta_{i}^{bias,0}-\eta_{i}^{noise,0}\big{)}}_{\text{remainder term}}\quad\mbox{for}\quad i=1,2,\ldots. (3.4)

Correspondingly, we define η¯ibias,0=i1j=1iηjbias,0\bar{\eta}_{i}^{bias,0}=i^{-1}\sum_{j=1}^{i}\eta_{j}^{bias,0} and η¯inoise,0=i1j=1iηjnoise,0\bar{\eta}_{i}^{noise,0}=i^{-1}\sum_{j=1}^{i}\eta_{j}^{noise,0} as the leading bias and noise terms, respectively, in the functional SGD estimator (after averaging). The following Theorem 3.1 presents finite-sample bounds for the two leading terms and the remainder term associated with f¯n\bar{f}_{n} under the supreme norm metric. The results indicate that the remainder term is of strictly higher order (in terms of dependence on nn) compared to the two leading terms, validating the term “leading” for them.

Theorem 3.1 (Finite-sample error bound under supreme norm).

Suppose that the kernel KK satisfies Assumptions A1-A2. Assume ff^{\ast}\in\mathbb{H} satisfies ν=1f,ϕνL2μ1/2<\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\mu_{\ell}^{-1/2}<\infty.

  1. 1.

    (constant step size) Assume that the step size γiγ\gamma_{i}\equiv\gamma satisfies γ(0,μ11)\gamma\in(0,\,\mu_{1}^{-1}), then we have

    supx𝒳|η¯nbias,0(x)|C1nγ,andsupx𝒳Var(η¯nnoise,0(x))C(nγ)1/αn,\sup_{x\in\mathcal{X}}|\bar{\eta}_{n}^{bias,0}(x)|\leq C\frac{1}{\sqrt{n\gamma}},\quad\textrm{and}\;\sup_{x\in\mathcal{X}}\operatorname{{\rm Var}}(\bar{\eta}_{n}^{noise,0}(x))\leq C^{\prime}\frac{(n\gamma)^{1/\alpha}}{n},

    where C,CC,\,C^{\prime} are constants independent of (n,γ)(n,\gamma). Furthermore, assume that the step size 0<γ<n22+3α0<\gamma<n^{-\frac{2}{2+3\alpha}}, we have

    (f¯nfη¯nbias,0η¯nnoise,02γ1/2(nγ)1+γ1/4(nγ)1/αn1logn)C/n+Cγ1/4,\mathbb{P}\Big{(}\|\bar{f}_{n}-f^{\ast}-\bar{\eta}_{n}^{bias,0}-\bar{\eta}_{n}^{noise,0}\|^{2}_{\infty}\geq\gamma^{1/2}(n\gamma)^{-1}+\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\log n\Big{)}\leq C/n+C\gamma^{1/4},

    where the randomness is with respect to the randomness in {(Xi,ϵi)}i=1n\{(X_{i},\epsilon_{i})\}_{i=1}^{n}.

  2. 2.

    (non-constant step size) Assume the step size to satisfy γi=iξ\gamma_{i}=i^{-\xi} for some ξ(0, 1/2)\xi\in(0,\,1/2), then we have

    supx𝒳|η¯nbias,0(x)|C1nγn,andsupx𝒳Var(η¯nnoise,0(x))C(nγn)1/αn,\sup_{x\in\mathcal{X}}|\bar{\eta}_{n}^{bias,0}(x)|\leq C\frac{1}{\sqrt{n\gamma_{n}}},\quad\textrm{and}\;\sup_{x\in\mathcal{X}}\operatorname{{\rm Var}}(\bar{\eta}_{n}^{noise,0}(x))\leq C^{\prime}\frac{(n\gamma_{n})^{1/\alpha}}{n},

    where C,CC,\,C^{\prime} are constants independent of (n,γn)(n,\gamma_{n}). For the special choice of ξ=1α+1\xi=\frac{1}{\alpha+1}, we have

    (f¯nfη¯nbias,0η¯nnoise,02γn1/2(nγn)1+γn1/2(nγn)1/αn1logn)C/n+Cγn1/2.\mathbb{P}\Big{(}\|\bar{f}_{n}-f^{\ast}-\bar{\eta}_{n}^{bias,0}-\bar{\eta}_{n}^{noise,0}\|^{2}_{\infty}\geq\gamma_{n}^{1/2}(n\gamma_{n})^{-1}+\gamma_{n}^{1/2}(n\gamma_{n})^{1/\alpha}n^{-1}\log n\Big{)}\leq C/n+C\gamma_{n}^{1/2}.

A proof of this theorem is based on a higher-order recursion expansion and careful supreme norm analysis of the recursive formula; see Remark 3.2 and proof sketch in Section 6. The detailed proof is outlined in [33].

Remark 3.1.

As demonstrated in Theorem 3.1, the selection of the step size γ\gamma (or γn\gamma_{n} for non-constant step size) in the SGD estimator entails a trade-off between bias and variance. A larger γ\gamma (or γn\gamma_{n}) increases bias while reducing variance, and vice versa. This trade-off can be optimized by choosing the (optimal) step size γn=n1α+1\gamma_{n}=n^{-\frac{1}{\alpha+1}}. This is why we specifically focus on this particular choice in the non-constant step size setting in the theorem, which also significantly simplifies the proof. Interestingly, the step size (scheme) in the functional SGD plays a similar role as the regularization parameter in regularization-based approaches in preventing overfitting according to Theorem 3.1. To see this, let us consider the classic kernel ridge regression (KRR), where the estimator f^n,λ\widehat{f}_{n,\lambda} is constructed as

f^n,λ=argminf{1ni=1n(Yif(Xi))2+λf2},\widehat{f}_{n,\lambda}=\mathop{\mathrm{argmin}}_{f\in\mathbb{H}}\Big{\{}\frac{1}{n}\sum_{i=1}^{n}\big{(}Y_{i}-f(X_{i})\big{)}^{2}+\lambda\|f\|_{\mathbb{H}}^{2}\Big{\}},

where λ\lambda serves as the regularization parameter to avoid overfitting. It can be shown (e.g., [10]) that the squared bias of f^n,λ\widehat{f}_{n,\lambda} has an order of λ\lambda, while the variance has an order of dλ/nd_{\lambda}/n, where dλ=ν=1min{1,λμν}d_{\lambda}=\sum_{\nu=1}^{\infty}\min\{1,\lambda\mu_{\nu}\} represents the effective dimension of the model and is of order λ1/α\lambda^{-1/\alpha} under Assumption A2. In comparison, the squared bias and variance of the functional SGD estimator f¯n\bar{f}_{n} are of order (nγn)1(n\gamma_{n})^{-1} and (nγn)1/α/n(n\gamma_{n})^{1/\alpha}/n respectively. Therefore, (nγn)1(n\gamma_{n})^{-1} and (nγn)1/α(n\gamma_{n})^{1/\alpha} respectively play the same role as the regularization parameter λ\lambda and effective dimension dλd_{\lambda} in KRR. More generally, a step size scheme {γi}i=1n\{\gamma_{i}\}_{i=1}^{n} corresponds to an effective regularization parameter of the order λ=(i=1nγi)1\lambda=\big{(}\sum_{i=1}^{n}\gamma_{i}\big{)}^{-1}, which in our considered settings is of order (nγn)1(n\gamma_{n})^{-1}. Note that the accumulated step size i=1nγi\sum_{i=1}^{n}\gamma_{i} can be interpreted as the total path length in the functional SGD algorithm. This total path length determines the early stopping of the algorithm, effectively controlling the complexity of the learned model and preventing overfitting.

Remark 3.2.

The higher-order recursion expansion and the supreme norm bound in the theorem provide a finer insight into the distributional behavior of f¯n\bar{f}_{n} and paves the way for inference. That is, we only need to focus on the leading noise recursive term for statistical inference. In our proof of bounding the supremum norm for the remainder term in equation (3.4), we further decompose the remainder f¯nfη¯nbias,0η¯nnoise,0\bar{f}_{n}-f^{\ast}-\bar{\eta}_{n}^{bias,0}-\bar{\eta}_{n}^{noise,0} into two parts: the bias remainder and the noise remainder. Note that a loose analysis of bounding the noise remainder under the \|\cdot\|_{\infty} metric by directly converting an L2L_{2} norm bound into the supremum norm bound using the reproducing kernel property of the original RKHS \mathbb{H} would result in a bound whose order is comparable to that of the leading term. This motivates us to introduce an augmented RKHS a={f=ν=1fνϕνν=1fν2μν2a1<}\mathbb{H}_{a}=\{f=\sum_{\nu=1}^{\infty}f_{\nu}\phi_{\nu}\mid\sum_{\nu=1}^{\infty}f_{\nu}^{2}\mu_{\nu}^{2a-1}<\infty\} with 0a1/21/(2α)0\leq a\leq 1/2-1/(2\alpha) equipped with the kernel function Ka(x,y)=ν=1ϕν(X)ϕν(y)μν12aK^{a}(x,y)=\sum_{\nu=1}^{\infty}\phi_{\nu}(X)\phi_{\nu}(y)\mu_{\nu}^{1-2a} and norm fa=(ν=1fν2μν2a1)1/2\|f\|_{a}=\big{(}\sum_{\nu=1}^{\infty}f_{\nu}^{2}\mu_{\nu}^{2a-1}\big{)}^{1/2} for any f=ν=1fνϕνf=\sum_{\nu=1}^{\infty}f_{\nu}\phi_{\nu}\in\mathbb{H}. This augmented RKHS norm weakens the impact of high-frequency components compared to the norm f=(ν=1fν2μν1)1/2\|f\|_{\mathbb{H}}=\big{(}\sum_{\nu=1}^{\infty}f_{\nu}^{2}\mu_{\nu}^{-1}\big{)}^{1/2} and its induced norm turns out to be better aligns with the functional supremum norm in our context. As a result, we have fcafackf\|f\|_{\infty}\leq c_{a}\|f\|_{a}\leq c_{k}\|f\|_{\mathbb{H}} for any ff\in\mathbb{H}, where (ca,ck)(c_{a},\,c_{k}) are constants. In particular, a supremum norm bound based on controlling the fa\|f\|_{a} norm with appropriate choice of aa could be substantially better than that based on f\|f\|_{\mathbb{H}}; see Section 6 and Section 8.2 for further details.

As we discussed in Section 2.3, for inference purposes, it is not necessary to explicitly characterize distributional convergence limit of the supremum norm f¯nf\|\bar{f}_{n}-f^{\ast}\|_{\infty}; instead, we will prove a bootstrap consistency by showing that the Kolmogorov distance between the sampling distributions of this supremum norm and its bootstrapping counterpart converges to zero as nn\to\infty. However, the pointwise convergence limit of f¯n(z0)f(z0)\bar{f}_{n}(z_{0})-f^{\ast}(z_{0}) for fixed z0[0,1]z_{0}\in[0,1] has an easy characterization. Therefore, we present the pointwise convergence limit and use it to discuss the impact of online estimation in the non-parametric regression model in the following subsection.

3.2 Pointwise distributional convergence

According to Theorem 3.1, the large-sample behavior of the functional SGD estimator f¯n\bar{f}_{n} is completely determined by the two leading processes: bias term and noise term. According to (3.2), under the constant step size γ\gamma, the leading bias term has an explicit expression as

η¯nbias,0(x)=\displaystyle\bar{\eta}_{n}^{bias,0}(x)= 1nγ1Σ1(IγΣ)[I(IγΣ)n]f(x)\displaystyle\frac{1}{n}\gamma^{-1}\Sigma^{-1}\,(I-\gamma\Sigma)\,[I-(I-\gamma\Sigma)^{n}\,]f^{\ast}(x) (3.5)
=\displaystyle= 1γnk=1nν=1f,ϕνL2μ1/2(1γμν)k(γμν)1/2ϕν(x),x𝒳,\displaystyle\frac{1}{\sqrt{\gamma}n}\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\mu_{\ell}^{-1/2}(1-\gamma\mu_{\nu})^{k}(\gamma\mu_{\nu})^{1/2}\phi_{\nu}(x),\quad\forall x\in\mathcal{X},

and the leading noise term is

η¯nnoise,0(x)=\displaystyle\bar{\eta}_{n}^{noise,0}(x)= 1nk=1nΣ1[I(IγΣ)n+1k]K(Xk,x)ϵk\displaystyle\,\frac{1}{n}\sum_{k=1}^{n}\Sigma^{-1}\big{[}I-(I-\gamma\Sigma)^{n+1-k}\big{]}\,K(X_{k},\,x)\,\epsilon_{k} (3.6)
=\displaystyle= 1nk=1nϵk{ν=1[1(1γμν)n+1k]ϕν(Xk)ϕν(x)}Ωn,k(x),x𝒳.\displaystyle\,\frac{1}{n}\sum_{k=1}^{n}\ \epsilon_{k}\,\cdot\,\underbrace{\bigg{\{}\sum_{\nu=1}^{\infty}\big{[}1-(1-\gamma\mu_{\nu})^{n+1-k}\big{]}\,\phi_{\nu}(X_{k})\,\phi_{\nu}(x)\bigg{\}}}_{\Omega_{n,k}(x)},\quad\forall x\in\mathcal{X}.

For each fixed z0𝒳z_{0}\in\mathcal{X}, conditioning on the design {Xi}i=1n\{X_{i}\}_{i=1}^{n}, the leading noise term η¯nnoise,0(z0)\bar{\eta}_{n}^{noise,0}(z_{0}) is a weighted average of nn independent and centered normally distributed random variables. This representation enables us to identify the limiting distribution of η¯nnoise,0(z0)\bar{\eta}_{n}^{noise,0}(z_{0}) (this subsection) and conduct local inference (i.e. pointwise confidence intervals) by a bootstrap method (next section). Under Assumption A2, the weight Ωn,k(z0)\Omega_{n,k}(z_{0}) associated with the kk-th observation pair (Xk,Yk)(X_{k},\,Y_{k}) is of order ν=1[1(1γμν)n+1k][(n+1k)γ]1/α\sum_{\nu=1}^{\infty}\big{[}1-(1-\gamma\mu_{\nu})^{n+1-k}\big{]}\asymp\big{[}(n+1-k)\gamma\big{]}^{1/\alpha}, which decreases in kk. This diminishing impact trend is inherent to online learning, as later observations tend to have a smaller influence compared to earlier observations. This characteristic is radically different from offline estimation settings, where all observations contribute equally to the final estimator, and will change the asymptotic variance (i.e., the σz02\sigma^{2}_{z_{0}} in Theorem 3.2).

Furthermore, the entire leading noise process η¯nnoise,0()\bar{\eta}_{n}^{noise,0}(\cdot) can be viewed as a weighted and non-identically distributed empirical process indexed by the spatial location. This characterization enables us to conduct global inference (i.e. simultaneous confidence band) for non-parametric online learning by borrowing and extending the recent developments [26, 34, 35] on Gaussian approximation and multiplier bootstraps for suprema of (equally-weighted and identically distributed) empirical processes, which will be the main focus of next section.

In the following Theorem 3.2, we prove, by analyzing the leading noise term η¯nnoise,0\bar{\eta}_{n}^{noise,0}, a finite-sample upper bound on the Kolmogorov distance between the sampling distribution of f¯n(z0)f(z0)\bar{f}_{n}(z_{0})-f^{\ast}(z_{0}) and the distribution of a standard normal random variable (i.e. supreme norm between the two cumulative distributions) for any z0𝒳z_{0}\in\mathcal{X}.

Theorem 3.2 (Pointwise convergence).

Assume that the kernel KK to satisfy Assumptions A1-A2.

  1. 1.

    (Constant step size) Consider the step size γ(n)=γ\gamma(n)=\gamma with 0<γ<n22+3α0<\gamma<n^{-\frac{2}{2+3\alpha}}. For any fixed z0[0,1]z_{0}\in[0,1], we have

    supu|(σz01n(nγ)1/α(f¯n(z0)f(z0)η¯nbias,0(z0))u)Φ(u)|C1n(nγ)1/α+κn,\sup_{u\in\mathbb{R}}\Big{|}\,\mathbb{P}\Big{(}\sigma^{-1}_{z_{0}}\sqrt{n(n\gamma)^{-1/\alpha}}\big{(}\bar{f}_{n}(z_{0})-f^{\ast}(z_{0})-\bar{\eta}_{n}^{bias,0}(z_{0})\big{)}\leq u\Big{)}-\Phi(u)\Big{|}\leq\frac{C_{1}}{\sqrt{n(n\gamma)^{-1/\alpha}}}+\kappa_{n},

    where κn=C2γ1/2(nγ)1+γ1/2(nγ)1/αn1\kappa_{n}=C_{2}\sqrt{\gamma^{1/2}(n\gamma)^{-1}}+\sqrt{\gamma^{1/2}(n\gamma)^{1/\alpha}n^{-1}}. Here, the bias term has an explicit expression as given in (3.5), and the (limiting) variance is

    σz02=σ2(nγ)1/αn1k=1nν=1[(1(1γμν)n+1k)2]ϕν2(z0).\sigma_{z_{0}}^{2}=\sigma^{2}(n\gamma)^{-1/\alpha}n^{-1}\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\big{[}\big{(}1-(1-\gamma\mu_{\nu})^{n+1-k}\big{)}^{2}\big{]}\,\phi_{\nu}^{2}(z_{0}).
  2. 2.

    (Non-constant step size) Consider the step size γi=i1α+1\gamma_{i}=i^{-\frac{1}{\alpha+1}} for i=1,,ni=1,\dots,n. For any fixed z0[0,1]z_{0}\in[0,1], we have

    supu|(σz01n(nγn)1/α(f¯n(z0)f(z0)η¯nbias,0(z0))u)Φ(u)|C1n(nγn)1/α.\sup_{u\in\mathbb{R}}\big{|}\mathbb{P}\Big{(}\sigma^{-1}_{z_{0}}\sqrt{n(n\gamma_{n})^{-1/\alpha}}\big{(}\bar{f}_{n}(z_{0})-f^{\ast}(z_{0})-\bar{\eta}_{n}^{bias,0}(z_{0})\big{)}\leq u\Big{)}-\Phi(u)\big{|}\leq\frac{C_{1}}{\sqrt{n(n\gamma_{n})^{-1/\alpha}}}.

    Here, the bias term takes an explicit expression as η¯nbias,0(z0)=n1k=1ni=1k(IγiΣ)f(z0)\bar{\eta}_{n}^{bias,0}(z_{0})=n^{-1}\sum_{k=1}^{n}\prod_{i=1}^{k}(I-\gamma_{i}\Sigma)\,f^{\ast}(z_{0}), and the variance is

    σz02=σ2n2k=1nγk2ν=1μν2ϕν2(z0)(j=kni=k+1j(1γiμν))2.\sigma_{z_{0}}^{2}=\frac{\sigma^{2}}{n^{2}}\sum_{k=1}^{n}\gamma_{k}^{2}\,\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\,\phi_{\nu}^{2}(z_{0})\Big{(}\sum_{j=k}^{n}\prod_{i=k+1}^{j}(1-\gamma_{i}\mu_{\nu})\Big{)}^{2}.

Theorem 3.2 establishes that the sampling distribution of f¯nf\bar{f}_{n}-f^{*} at any fixed z0z_{0} can be approximated by a normal distribution N(η¯nbias,0(z0),n1(nγn)1/ασz02)N(\bar{\eta}_{n}^{bias,0}(z_{0}),n^{-1}(n\gamma_{n})^{1/\alpha}\sigma^{2}_{z_{0}}). According to Theorem 3.1, the bias η¯nbias,0(z0)\bar{\eta}_{n}^{bias,0}(z_{0}) has the order of (nγn)1/2(n\gamma_{n})^{-1/2} while the variance has the order of n1(nγn)1/αn^{-1}(n\gamma_{n})^{1/\alpha}; Theorem 3.2 also implies that the minimax convergence rate n12(α+1)n^{-\frac{1}{2(\alpha+1)}} of estimating ff^{\ast} can be achieved with γ=γn=n1α+1\gamma=\gamma_{n}=n^{-\frac{1}{\alpha+1}}, which attains an optimal bias-variance tradeoff. In practice, the bias term can be suppressed by applying a undersmoothing technique; see Remark 4.3 for details.

Remark 3.3.

From the theorem, we see that the (limiting) variance σz02\sigma_{z_{0}}^{2} is precisely the variance of the scaled leading noise n(nγn)1/αη¯nnoise,0(z0)\sqrt{n(n\gamma_{n})^{-1/\alpha}}\bar{\eta}_{n}^{noise,0}(z_{0}) at z0z_{0}, that is, Var(n(nγn)1/αη¯nnoise,0(z0))\operatorname{{\rm Var}}\big{(}\sqrt{n(n\gamma_{n})^{-1/\alpha}}\bar{\eta}_{n}^{noise,0}(z_{0})\big{)}; and σz02\sigma^{2}_{z_{0}} has the same 𝒪(1)\mathcal{O}(1) order for both the constant and non-constant cases. The contribution of for each data point to the variance differs between the constant and non-constant step size cases. Concretely, in the constant step size case, let 𝐂=(c1,,cn){\bf{C}}=(c_{1},\dots,c_{n}) be the vector of variation, where ckc_{k} (k=1,,n)(k=1,\dots,n) represents the contribution to σz02\sigma^{2}_{z_{0}} from the kk-th arrival observation (Xk,Yk)(X_{k},Y_{k}). According to equation (3.6), ck=𝔼Ωn,k2(z0)(nγ)1/αn1((n+1k)γ)1/αc_{k}=\mathbb{E}\Omega^{2}_{n,k}(z_{0})\asymp(n\gamma)^{-1/\alpha}n^{-1}\big{(}(n+1-k)\gamma\big{)}^{1/\alpha} and is of order (nk)1/α(n-k)^{1/\alpha} in the observation index kk, which decreases monotonically to nearly 0 as kk grows to nn. In comparison, in the online (nonconstant) step case, we denote 𝐎=(o1,,ok){\bf{O}}=(o_{1},\dots,o_{k}) as the vector of variation with oko_{k} being the contribution from the kk-th observation. A careful calculation shows that ok=n2γk2ν=1μν2ϕν2(z0)(j=kni=k+1j(1γiμν))2o_{k}=n^{-2}\gamma^{2}_{k}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\,\phi_{\nu}^{2}(z_{0})\Big{(}\sum_{j=k}^{n}\prod_{i=k+1}^{j}(1-\gamma_{i}\mu_{\nu})\Big{)}^{2}, which has order n2γk2γn+1k2((n+1k)γn+1k)1/α+((n+1k)γk)1/αn^{-2}\gamma^{2}_{k}\gamma^{-2}_{n+1-k}\big{(}(n+1-k)\gamma_{n+1-k}\big{)}^{1/\alpha}+\big{(}(n+1-k)\gamma_{k}\big{)}^{1/\alpha} and decreases slower than the constant step size case. This means that the nonconstant step scheme yields a more balanced weighted average over the entire dataset, which tends to lead to a smaller asymptotic variance.

Figure 1 compares the individual variation contribution for both the constant and non-constant step cases. We keep the total step size budget the same for both cases (which also makes the two leading bias terms roughly equal); that is, we choose constant BB in the nonconstant step size γi=Bi1α+1\gamma_{i}=B\cdot i^{-\frac{1}{\alpha+1}} so that nγ=i=1nγin\gamma=\sum_{i=1}^{n}\gamma_{i} with γ=n1α+1\gamma=n^{-\frac{1}{\alpha+1}} being the constant step size. The data index kk is plotted on the xx axis of Figure 1 (A), with the variation contribution summarized by the yy axis. As we can see, the variation contribution from each observation decreases as observations arrive later in both cases. However, the pattern is flatter in the non-constant step case. Figure 1 (B) is a violin plot visualizing the distributions of the components in 𝐂\bf{C} and 𝐎\bf{O}. Specifically, the variation among {ok}k=1n\{o_{k}\}_{k=1}^{n} (depicted by the short blue interval) is smaller in the non-constant case, suggesting reduced fluctuation in individual variation for this setting. As detailed in Section 5, our numerical analysis further confirms that using a nonconstant learning rate outperforms that using a constant learning rate (e.g., Figure 2). An interesting direction for future research might be to identify an optimal learning rate decaying scheme by minimizing the variance σz02\sigma^{2}_{z_{0}} as a function of {γi}i=1n\{\gamma_{i}\}_{i=1}^{n}. It is also interesting to determine whether this scheme results in an equal contribution from each observation. However, this is beyond the scope of this paper.

Refer to caption
Figure 1: Compare the individual variation contribution for each observation in two cases: constant step size case (red curve) and non-constant step size case (blue curve). In (A), xx-axis is the observation index, yy-axis is the variance contributed by the tt-th observation. (B) is the violin plot of individual variance contribution for two cases; the solid dots represent mean while the intervals represent variance.
Remark 3.4.

The Kolmogorov distance bound between the sampling distribution of σz01n(nγ)(f¯n(z0)f(z0))\sigma^{-1}_{z_{0}}\sqrt{n(n\gamma)}\big{(}\bar{f}_{n}(z_{0})-f^{\ast}(z_{0})\big{)} and the standard normal distribution depends on the step size γn\gamma_{n} and sample size nn. In particular, κn\kappa_{n} is the remainder bound stated in Theorem 3.1, which is negligible compared to C1n(nγ)1/α\frac{C_{1}}{\sqrt{n(n\gamma)^{-1/\alpha}}} when γ>n2α+2\gamma>n^{-\frac{2}{\alpha+2}} in the constant step size case. Consequently, a smaller γ\gamma or larger sample size nn leads to a smaller Kolmogorov distance. The same conclusion also applies to the non-constant step size case if we choose γi=i1α+1\gamma_{i}=i^{-\frac{1}{\alpha+1}}.

Although Theorem 3.2 explicitly characterizes the distribution of the SGD estimator, the expression of the standard deviation σz0\sigma_{z_{0}} depends on the eigenvalues and eigenfunctions of \mathbb{H}, the underlying distribution of the design XX, and the unknown noise variance σ2\sigma^{2}, which are typically unknown in practice. One approach is to use plug-in estimators for these unknown quantities, such as empirical eigenvalues and eigenfunctions obtained through SVD decomposition of the empirical kernel matrix 𝐊n×n\mathbf{K}\in\mathbb{R}^{n\times n}, whose ijij-th element is 𝐊ij=K(Xi,Xj)\mathbf{K}_{ij}=K(X_{i},X_{j}). However, computing these plug-in estimators requires access to all observed data points {(Xi,Yi)}i=1n\{(X_{i},\,Y_{i})\}_{i=1}^{n} and has a computational complexity of 𝒪(n3)\mathcal{O}(n^{3}), which undermines the sequential updating advantages of SGD. In the following section, we develop a scalable inference framework that uses multiplier-type bootstraps to generate randomly perturbed SGD estimators upon arrival of each observation. This approach enables us to bypass the evaluation of σz0\sigma_{z_{0}} when constructing confidence intervals.

4 Online Statistical Inference via Multiplier Bootstrap

In this section, we first propose a multiplier bootstrap method for inference based on the functional SGD estimator. After that, we study the theoretical properties of the proposed method, which serve as the cornerstone for proving bootstrap consistency for the local inference of constructing pointwise confidence intervals and the global inference of constructing simultaneous confidence bands. Finally, we describe the resulting online inference algorithm for non-parametric regression based on the functional SGD estimator.

4.1 Multiplier bootstrap for functional SGD

Recall that Theorem 3.1 provides a high probability decomposition of the functional SGD estimator f¯n\bar{f}_{n} (relative to supreme norm metric) into the following sum

f¯n=f+η¯nbias,0+η¯nnoise,0+smaller remainder term,\displaystyle\bar{f}_{n}\ =\ f^{\ast}\ +\ \bar{\eta}_{n}^{bias,0}\ +\ \bar{\eta}_{n}^{noise,0}\ +\ \mbox{smaller remainder term},

where η¯nbias,0\bar{\eta}_{n}^{bias,0} is the leading bias process defined in equation (3.5) and η¯nnoise,0\bar{\eta}_{n}^{noise,0} is the leading noise process defined in equation (3.6). Motivated by this result, we propose in this section a multiplier bootstrap method to mimic and capture the random fluctuation from this leading noise process η¯nnoise,0()=n1k=1nϵkΩn,k()\bar{\eta}_{n}^{noise,0}(\cdot)=n^{-1}\sum_{k=1}^{n}\epsilon_{k}\cdot\Omega_{n,k}(\cdot), where recall that term Ωn,k\Omega_{n,k} only depends on the kk-th design point XkX_{k}, and the primary source of randomness in η¯nnoise,0\bar{\eta}_{n}^{noise,0} is coming from random noises {ϵk}k=1n\{\epsilon_{k}\}_{k=1}^{n} that are i.i.d. normally distributed under a standard non-parametric regression setting.

Our online inference approach is inspired by the multiplier bootstrap idea proposed in [19] for online inference of parametric models using SGD. Remarkably, we demonstrate that their development can be naturally adapted to enable online inference of non-parametric models based on functional SGD. The key idea is to perturb the stochastic gradient in the functional SGD by incorporating a random multiplier upon the arrival of each data point. r Specifically, let w1w_{1}, w2w_{2}, \ldots denote a sequence of i.i.d. random bootstrap multipliers, whose mean and variance are both equal to one. At time ii with the observed data point (Xi,Yi)(X_{i},\,Y_{i}), we use a randomly perturbed functional SGD updating formula as:

f^ib=\displaystyle\widehat{f}^{b}_{i}= f^i1b+γiwiGi(f^i1b)=f^i1b+γiwi(Yif^i1b,KXi)KXi,\displaystyle\,\widehat{f}_{i-1}^{b}+\gamma_{i}\,w_{i}\,G_{i}(\widehat{f}^{b}_{i-1})=\widehat{f}_{i-1}^{b}+\gamma_{i}\,w_{i}\,(Y_{i}-\langle\widehat{f}^{b}_{i-1},\,K_{X_{i}}\rangle_{\mathbb{H}})\,K_{X_{i}}, (4.1)
=\displaystyle= (IγiwiKXiKXi)f^i1b+γiwiYiKXifori=1,2,,\displaystyle\,(I-\gamma_{i}\,w_{i}\,K_{X_{i}}\otimes K_{X_{i}})\,\widehat{f}^{b}_{i-1}+\gamma_{i}\,w_{i}\,Y_{i}\,K_{X_{i}}\quad\mbox{for}\quad i=1,2,\ldots,

which modifies equations (2.3) and (2.4) for functional SGD by multiplying the stochastic gradient Gi(f^i1b)G_{i}(\widehat{f}^{b}_{i-1}) with random multiplier wiw_{i}. We adopt the same zero initialization f^0b=f^0=0\widehat{f}^{b}_{0}=\widehat{f}_{0}=0 and call the (Polyak) averaged estimator f¯nb=n1i=1nf^ib\bar{f}_{n}^{b}=n^{-1}\sum_{i=1}^{n}\widehat{f}_{i}^{b} as the bootstrapped functional SGD estimator (with nn samples).

4.2 Bootstrap consistency

Let us now proceed to derive a higher-order expansion of f¯nb\bar{f}_{n}^{b} analogous to Section 3.1 and compare its leading terms with those associated with the original functional SGD estimator f¯n\bar{f}_{n}. Utilizing equation (4.1) and plugging in Yi=f(Xi)+ϵiY_{i}=f^{\ast}(X_{i})+\epsilon_{i}, we obtain the following expression:

f^ibf=(IγiwiKXiKXi)(f^i1bf)+γiwiϵiKXi.\widehat{f}^{b}_{i}-f^{\ast}=(I-\gamma_{i}\,w_{i}\,K_{X_{i}}\otimes K_{X_{i}})\,(\widehat{f}^{b}_{i-1}-f^{\ast})+\gamma_{i}\,w_{i}\,\epsilon_{i}\,K_{X_{i}}.

Since wiw_{i} has a unit mean, we have an important identity Σ=𝔼(wnKXnKXn)\Sigma=\mathbb{E}(w_{n}K_{X_{n}}\otimes K_{X_{n}}). Similar to equation (3.1)-(3.4), due to this key identity, we can still recursively define the leading bootstrapped bias term through

η0b,bias,0=f^0bf=fandηib,bias,0=(IγnΣ)ηn1bias,0fori=1,2,,\eta_{0}^{b,bias,0}=\widehat{f}_{0}^{b}-f^{\ast}=-f^{\ast}\quad\mbox{and}\quad\eta_{i}^{b,bias,0}=(I-\gamma_{n}\,\Sigma)\,\eta_{n-1}^{bias,0}\quad\mbox{for}\quad i=1,2,\ldots, (4.2)

which coincides with the original leading bias term, i.e. ηib,bias,0ηibias,0{\eta}_{i}^{b,bias,0}\equiv{\eta}_{i}^{bias,0}; and the leading bootstrapped noise term through

η0b,noise,0=0andηib,noise,0=(IγiΣ)ηi1b,noise,0+γiwiϵiKXi,fori=1,2,,\displaystyle\eta_{0}^{b,noise,0}=0\quad\mbox{and}\quad\eta_{i}^{b,noise,0}=(I-\gamma_{i}\,\Sigma)\eta^{b,noise,0}_{i-1}+\gamma_{i}\,w_{i}\,\epsilon_{i}\,K_{X_{i}},\quad\mbox{for}\quad i=1,2,\ldots,

so that a similar decomposition as in equation (3.4) holds,

f^ibf=ηib,bias,0leading bias+ηib,noise,0leading noise+(f^ibfηib,bias,0ηib,noise,0)remainder termfori=1,2,.\widehat{f}^{b}_{i}-f^{\ast}=\ \underbrace{\eta_{i}^{b,bias,0}}_{\text{leading bias}}\ +\ \underbrace{\eta_{i}^{b,noise,0}}_{\text{leading noise}}\ +\ \ \underbrace{\big{(}\widehat{f}^{b}_{i}-f^{\ast}-\eta_{i}^{b,bias,0}-\eta_{i}^{b,noise,0}\big{)}}_{\text{remainder term}}\quad\mbox{for}\quad i=1,2,\ldots. (4.3)

Corresponding, we define η¯ib,bias,0=i1j=1iηjb,bias,0\bar{\eta}_{i}^{b,bias,0}=i^{-1}\sum_{j=1}^{i}\eta_{j}^{b,bias,0} and η¯ib,noise,0=i1j=1iηjb,noise,0\bar{\eta}_{i}^{b,noise,0}=i^{-1}\sum_{j=1}^{i}\eta_{j}^{b,noise,0} as the leading bootstrapped bias and noise terms, respectively, in the bootstrapped functional SGD estimator.

Notice that η¯ib,bias,0\bar{\eta}_{i}^{b,bias,0} also coincides with the original leading bias term η¯ibias,0\bar{\eta}_{i}^{bias,0}, i.e. η¯ib,bias,0η¯ibias,0\bar{\eta}_{i}^{b,bias,0}\equiv\bar{\eta}_{i}^{bias,0}. Therefore, η¯ib,bias,0\bar{\eta}_{i}^{b,bias,0} has the same explicit expression as equation (3.5); while the leading bootstrapped noise term η¯ib,noise,0\bar{\eta}_{i}^{b,noise,0} has a slightly different expression that incorporates the bootstrap multipliers as

η¯nb,noise,0(x)=\displaystyle\bar{\eta}_{n}^{b,noise,0}(x)= 1nk=1nwkϵkΩn,k(x),x𝒳,\displaystyle\,\frac{1}{n}\sum_{k=1}^{n}w_{k}\cdot\epsilon_{k}\cdot\Omega_{n,k}(x),\quad\forall x\in\mathcal{X}, (4.4)

where recall that Ωn,k()\Omega_{n,k}(\cdot) is defined in equation (3.6) and only depends on XkX_{k}. By taking the difference between η¯nb,noise,0\bar{\eta}_{n}^{b,noise,0} and η¯nnoise,0\bar{\eta}_{n}^{noise,0}, we obtain

η¯nb,noise,0(x)η¯nnoise,0(x)=1nk=1n(wk1)ϵkΩn,k(x),x𝒳.\displaystyle\bar{\eta}_{n}^{b,noise,0}(x)-\bar{\eta}_{n}^{noise,0}(x)=\frac{1}{n}\sum_{k=1}^{n}(w_{k}-1)\cdot\epsilon_{k}\cdot\Omega_{n,k}(x),\quad\forall x\in\mathcal{X}. (4.5)

This expression also takes the form of a weighted and non-identically distributed empirical process with “effective” noises {(wi1)ϵi}i=1n\big{\{}(w_{i}-1)\epsilon_{i}\big{\}}_{i=1}^{n}. Since wiw_{i} has unit mean and variance, these effective noises have the same first two-order moments as the original noises {ϵi}i=1n\{\epsilon_{i}\}_{i=1}^{n}, suggesting that the difference η¯nb,noise,0()η¯nnoise,0()\bar{\eta}_{n}^{b,noise,0}(\cdot)-\bar{\eta}_{n}^{noise,0}(\cdot), conditioning on data {(Xi,Yi}i=1n\{(X_{i},\,Y_{i}\}_{i=1}^{n}, tends to capture the random pattern of the original leading noise term η¯nnoise,0()\bar{\eta}_{n}^{noise,0}(\cdot), leading to the so-called bootstrap consistency as formally stated in the theorem below.

Assumption A3.

For i=1,,ni=1,\dots,n, bootstrap multipliers wiw_{i}s are i.i.d. samples of a random variable WW that satisfies 𝔼(W)=1\mathbb{E}(W)=1, Var(W)=1\operatorname{{\rm Var}}(W)=1 and (|W|t)2exp(t2/C)\mathbb{P}(|W|\geq t)\leq 2\exp(-t^{2}/C) for all t0t\geq 0 with a constant C>0C>0.

One simple example that satisfies Assumption A3 is WN(1,1)W\sim N(1,1). A second example is bounded random variables, such as a scaled and shifted uniform random variable on the interval [1,3][-1,3]. One popular choice in practice is discrete random variables, such as WW such that (W=3)=(W=1)=1/2\mathbb{P}(W=3)=\mathbb{P}(W=-1)=1/2.

Let 𝒟n:={Xi,Yi}i=1n\mathcal{D}_{n}:\,=\{X_{i},Y_{i}\}_{i=1}^{n} denote the data of sample size nn, and ()=(|𝒟n)\mathbb{P}^{*}(\cdot)=\mathbb{P}(\,\cdot\,|\,\mathcal{D}_{n}) denote the conditional probability measure given 𝒟n\mathcal{D}_{n}. We first establish the bootstrap consistency for local inference of leading noise term in the following Theorem 4.1.

Theorem 4.1 (Bootstrap consistency for local inference of leading noise term).

Assume that kernel KK satisfies Assumptions A1-A2 and multiplier weights {wi}i=1n\{w_{i}\}_{i=1}^{n} satisfy Assumption A3.

  1. 1.

    (Constant step size) Consider the step size γ(n)=γ\gamma(n)=\gamma with γ(0,nα33)\gamma\in(0,\,n^{-\frac{\alpha-3}{3}}) for some α>1\alpha>1. Then for any z0[0,1]z_{0}\in[0,1], we have with probability at least 1Cn11-Cn^{-1},

    supu|(n(nγ)1α(η¯nb,noise,0(z0)η¯nnoise,0(z0))u)(n(nγ)1αη¯nnoise,0(z0)u)|\displaystyle\sup_{u\in\mathbb{R}}\Big{|}\,\mathbb{P}^{*}\Big{(}\sqrt{n(n\gamma)^{-\frac{1}{\alpha}}}\,\big{(}\bar{\eta}_{n}^{b,noise,0}(z_{0})-\bar{\eta}_{n}^{noise,0}(z_{0})\big{)}\leq u\Big{)}-\,\mathbb{P}\Big{(}\sqrt{n(n\gamma)^{-\frac{1}{\alpha}}}\,\bar{\eta}_{n}^{noise,0}(z_{0})\leq u\Big{)}\Big{|}
    C(logn)3/2(n(nγ)1/α)1/6,\displaystyle\qquad\leq C^{\prime}(\log n)^{3/2}(n(n\gamma)^{-1/\alpha})^{-1/6},

    where C,CC,C^{\prime} are constants independent of nn.

  2. 2.

    (Non-constant step size) Consider the step size γi=iξ\gamma_{i}=i^{-\xi}, i=1,,ni=1,\dots,n, for some ξ(min{0,1α/3}, 1/2)\xi\in(\min\{0,1-\alpha/3\},\,1/2). Then the following bound holds with probability at least 12n11-2n^{-1},

    supu|(n(nγn)1α(η¯nb,noise,0(z0)η¯nnoise,0(z0))u)\displaystyle\sup_{u\in\mathbb{R}}\Big{|}\,\mathbb{P}^{*}\Big{(}\sqrt{n(n\gamma_{n})^{-\frac{1}{\alpha}}}\,\big{(}\bar{\eta}_{n}^{b,noise,0}(z_{0})-\bar{\eta}_{n}^{noise,0}(z_{0})\big{)}\leq u\Big{)}- (n(nγn)1αη¯nnoise,0(z0)u)|\displaystyle\,\mathbb{P}\Big{(}\sqrt{n(n\gamma_{n})^{-\frac{1}{\alpha}}}\,\bar{\eta}_{n}^{noise,0}(z_{0})\leq u\Big{)}\Big{|}
    C(logn)3/2n(nγn)3/(2α).\displaystyle\qquad\leq\frac{C^{\prime}(\log n)^{3/2}}{\sqrt{n(n\gamma_{n})^{-3/(2\alpha)}}}. (4.6)
Remark 4.1.

Recall that from (3.6) and (4.5), we can express η¯nnoise,0(z0)=1nk=1nϵkΩn,k(z0)\bar{\eta}_{n}^{noise,0}(z_{0})=\frac{1}{n}\sum_{k=1}^{n}\epsilon_{k}\cdot\Omega_{n,k}(z_{0}) and η¯nb,noise,0(z0)η¯nnoise,0(z0)=1nk=1n(wk1)ϵkΩn,k(z0)\bar{\eta}_{n}^{b,noise,0}(z_{0})-\bar{\eta}_{n}^{noise,0}(z_{0})=\frac{1}{n}\sum_{k=1}^{n}(w_{k}-1)\cdot\epsilon_{k}\cdot\Omega_{n,k}(z_{0}). Theorem 3.2 shows that k=1nϵkΩn,k(z0)\sum_{k=1}^{n}\epsilon_{k}\cdot\Omega_{n,k}(z_{0}) can be approximated by a normal distribution Φ(0,n1(nγn)1/ασz02)\Phi\big{(}0,n^{-1}(n\gamma_{n})^{1/\alpha}\sigma^{2}_{z_{0}}\big{)}. To prove Theorem 4.1, we introduce an intermediate empirical process evaluated at z0z_{0} as k=1n(ek1)ϵkΩn,k(z0)\sum_{k=1}^{n}(e_{k}-1)\cdot\epsilon_{k}\cdot\Omega_{n,k}(z_{0}) where eke_{k}’s are independent and identically distributed standard normal random variables, such that k=1n(ek1)ϵkΩn,k(z0)𝒟n\sum_{k=1}^{n}(e_{k}-1)\cdot\epsilon_{k}\cdot\Omega_{n,k}(z_{0})\mid\mathcal{D}_{n} has the same (conditional) variance as the (conditional) variance of (η¯nb,noise,0(z0)η¯nnoise,0(z0))𝒟n\big{(}\bar{\eta}_{n}^{b,noise,0}(z_{0})-\bar{\eta}_{n}^{noise,0}(z_{0})\big{)}\mid\mathcal{D}_{n}.

Theorem 4.2 (Bootstrap consistency for global inference of leading noise term).

Assume that kernel KK satisfies Assumptions A1-A2 and multiplier weights {wi}i=1n\{w_{i}\}_{i=1}^{n} satisfy Assumption A3.

  1. 1.

    (Constant step size) Consider the step size γ(n)=γ\gamma(n)=\gamma with γ(0,nα33)\gamma\in(0,\,n^{\frac{\alpha-3}{3}}) for some α>2\alpha>2. Then the following bound holds with probability at least 15n11-5n^{-1} (with respect to the randomness in data 𝒟n\mathcal{D}_{n})

    supu|(n(nγ)1αη¯nb,noise,0η¯nnoise,0)u)\displaystyle\sup_{u\in\mathbb{R}}\Big{|}\,\mathbb{P}^{*}\Big{(}\sqrt{n(n\gamma)^{-\frac{1}{\alpha}}}\,\|\,\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}\,\|_{\infty}\big{)}\leq u\Big{)}- (n(nγ)1αη¯nnoise,0u)|\displaystyle\,\mathbb{P}\Big{(}\sqrt{n(n\gamma)^{-\frac{1}{\alpha}}}\,\|\,\bar{\eta}_{n}^{noise,0}\,\|_{\infty}\leq u\Big{)}\Big{|}
    C(logn)3/2(n(nγ)3/α)1/8.\displaystyle\quad\leq C(\log n)^{3/2}\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{-1/8}.
  2. 2.

    (Non-constant step size) Consider the step size γi=iξ\gamma_{i}=i^{-\xi}, i=1,,ni=1,\dots,n, for some ξ(min{0,1α/3}, 1/2)\xi\in(\min\{0,1-\alpha/3\},\,1/2). Then the following bound holds with probability at least 15n11-5n^{-1},

    supu|(n(nγn)1αη¯nb,noise,0η¯nnoise,0u)\displaystyle\sup_{u\in\mathbb{R}}\Big{|}\,\mathbb{P}^{*}\Big{(}\sqrt{n(n\gamma_{n})^{-\frac{1}{\alpha}}}\,\|\,\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}\,\|_{\infty}\leq u\Big{)}- (n(nγn)1αη¯nnoise,0u)|\displaystyle\,\mathbb{P}\Big{(}\sqrt{n(n\gamma_{n})^{-\frac{1}{\alpha}}}\,\|\,\bar{\eta}_{n}^{noise,0}\,\|_{\infty}\leq u\Big{)}\Big{|}
    C(logn)3/2(n(nγn)3/α)1/8.\displaystyle\quad\leq C(\log n)^{3/2}\big{(}n(n\gamma_{n})^{-3/\alpha}\big{)}^{-1/8}.
Remark 4.2.

Theorem 4.2 demonstrates that the sampling distribution of n(nγn)1/αη¯nnoise,0\sqrt{n(n\gamma_{n})^{-1/\alpha}}\|\bar{\eta}_{n}^{noise,0}\|_{\infty} can be approximated closely by the conditional distribution of n(nγn)1/αη¯nb,noise,0η¯nnoise,0\sqrt{n(n\gamma_{n})^{-1/\alpha}}\|\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}\|_{\infty} given data set 𝒟n\mathcal{D}_{n}. This theorem serves as the theoretical foundation for adopting the multiplier bootstrap method detailed in Section 4.1 for global inference. Recall that the optimal step size for achieving the minimax optimal estimation error is γ=n1α+1\gamma=n^{-\frac{1}{\alpha+1}} for the constant step size and γi=i1α+1\gamma_{i}=i^{-\frac{1}{\alpha+1}} for the non-constant step size (Theorem 3.2). To ensure that the Kolmogorov distance bound in Theorem 4.2 decays to 0 as nn\to\infty under these step sizes, we require α>2\alpha>2. It is likely that our current Kolmogorov distance bound, which is dominated by an error term that arises from applying the Gaussian approximation to analyze η¯nnoise,0\|\bar{\eta}_{n}^{noise,0}\|_{\infty} and η¯nb,noise,0η¯nnoise,0\|\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}\|_{\infty} through space-discretization (see Section 6.2), can be substantially refined. We leave this improvement of the Kolmogorov distance bound, which would consequently lead to a weaker requirement on α\alpha, to future research.

Since the leading noise terms η¯nnoise,0\bar{\eta}_{n}^{noise,0} and η¯nb,noise,0\bar{\eta}_{n}^{b,noise,0} contribute to the primary source of randomness in the functional SGD and its bootstrapped counterpart (Theorem 3.1), Theorem 4.2 then implies the bootstrap consistency for statistical inference of ff^{\ast} based on bootstrapped functional SGD. Particularly, we present the following Corollary, which establishes a high probability supremum norm bound for the remainder term in the bootstrapped functional SGD decomposition (4.3). Such a bound further implies that the sampling distribution of n(nγn)1/αf¯nf\sqrt{n(n\gamma_{n})^{-1/\alpha}}\|\bar{f}_{n}-f^{\ast}\|_{\infty} can be effectively approximated by the conditional distribution of n(nγn)1/αf¯nbf¯n\sqrt{n(n\gamma_{n})^{-1/\alpha}}\|\bar{f}^{b}_{n}-\bar{f}_{n}\|_{\infty} given data 𝒟n\mathcal{D}_{n}. Recall that we use ()=(|𝒟n)\mathbb{P}^{*}(\cdot)=\mathbb{P}(\,\cdot\,|\,\mathcal{D}_{n}) to denote the conditional probability measure given 𝒟n={Xi,Yi}i=1n\mathcal{D}_{n}=\{X_{i},Y_{i}\}_{i=1}^{n}.

Corollary 4.3 (Bootstrap consistency for functional SGD inference).

Assume that kernel KK satisfies Assumptions A1-A2 and multiplier weights {wi}i=1n\{w_{i}\}_{i=1}^{n} satisfies Assumption A3.

  1. 1.

    (Constant step size) Consider the step size γ(n)=γ\gamma(n)=\gamma with γ(0,nα33)\gamma\in(0,\,n^{\frac{\alpha-3}{3}}) for some α>2\alpha>2. Then it holds with probability at least 1γ1/4γ1/21/n1-\gamma^{1/4}-\gamma^{1/2}-1/n with respect to the randomness of 𝒟n\mathcal{D}_{n} that

    (f¯nbfnη¯nb,bias,0η¯nb,noise,0η¯nbias,0+η¯nnoise,02γ1/4(nγ)1/αn1)γ1/4+γ1/2+1/n.\mathbb{P}^{*}\Big{(}\|\,\bar{f}^{b}_{n}-f_{n}-\bar{\eta}_{n}^{b,bias,0}-\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{bias,0}+\bar{\eta}_{n}^{noise,0}\,\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq\gamma^{1/4}+\gamma^{1/2}+1/n. (4.7)

    Furthermore, for 0<γ<n47α+1(logn)3/20<\gamma<n^{-\frac{4}{7\alpha+1}}(\log n)^{-3/2}, it holds with probability at least 15n13γ1/2γ1/41-5n^{-1}-3\gamma^{1/2}-\gamma^{-1/4}, that

    supu|(n(nγ)1/αf¯nb\displaystyle\sup_{u\in\mathbb{R}}\Big{|}\,\mathbb{P}^{*}\Big{(}\sqrt{n(n\gamma)^{-1/\alpha}}\,\|\,\bar{f}^{b}_{n}\,- f¯nu)(n(nγ)1/αf¯nfBias(f)u)|\displaystyle\,\bar{f}_{n}\,\|_{\infty}\leq u\Big{)}-\mathbb{P}\Big{(}\sqrt{n(n\gamma)^{-1/\alpha}}\,\|\,\bar{f}_{n}-f^{\ast}-\mbox{Bias}(f^{\ast})\,\|_{\infty}\leq u\Big{)}\Big{|}
    C1(logn)3/2n1/8(nγ)3/(8α)+Cγ1/4\displaystyle\qquad\leq C_{1}(\log n)^{3/2}n^{-1/8}(n\gamma)^{3/(8\alpha)}+C\gamma^{1/4}

    where Bias(f)=η¯nbias,0\mbox{Bias}(f^{\ast})=\bar{\eta}_{n}^{bias,0} denotes the bias term, C1,C>0C_{1},C>0 are constants.

  2. 2.

    (Non-constant step size) Consider the step size γi=i1α+1\gamma_{i}=i^{-\frac{1}{\alpha+1}} for i=1,,ni=1,\dots,n. Then it holds with probability at least 1γn1/4γn1/21/n1-\gamma_{n}^{1/4}-\gamma_{n}^{1/2}-1/n that

    (f¯nbfη¯nb,bias,0η¯nb,noise,02γn1/4(nγn)1/αn1)γn1/4+γn1/2+1/n.\mathbb{P}^{*}\Big{(}\|\,\bar{f}^{b}_{n}-f^{\ast}-\bar{\eta}_{n}^{b,bias,0}-\bar{\eta}_{n}^{b,noise,0}\,\|^{2}_{\infty}\geq\gamma_{n}^{1/4}(n\gamma_{n})^{1/\alpha}n^{-1}\Big{)}\leq\gamma_{n}^{1/4}+\gamma_{n}^{1/2}+1/n.

    Furthermore, it holds with probability at least 15n1γn1/2γn1/41-5n^{-1}-\gamma_{n}^{1/2}-\gamma_{n}^{1/4} that

    supu|(n(nγn)1/αf¯nb\displaystyle\sup_{u\in\mathbb{R}}\Big{|}\,\mathbb{P}^{*}\Big{(}\sqrt{n(n\gamma_{n})^{-1/\alpha}}\,\|\,\bar{f}^{b}_{n}\,- f¯nu)(n(nγn)1/αf¯nfBias(f)u)|\displaystyle\,\bar{f}_{n}\,\|_{\infty}\leq u\Big{)}-\mathbb{P}\Big{(}\sqrt{n(n\gamma_{n})^{-1/\alpha}}\,\|\,\bar{f}_{n}-f^{\ast}-\mbox{Bias}(f^{\ast})\,\|_{\infty}\leq u\Big{)}\Big{|}
    (logn)3/2n1/8(nγn)38α+γn1/4.\displaystyle\qquad\lesssim(\log n)^{3/2}n^{-1/8}(n\gamma_{n})^{\frac{3}{8\alpha}}+\gamma_{n}^{1/4}.
Remark 4.3.

Corollary 4.3 suggests that a smaller step size γ\gamma (or γn\gamma_{n}) and a larger sample size nn result in more accurate uncertainty quantification. As discussed in Section 4.2, the functional SGD estimator and its bootstrap counterpart share the same leading bias term, which eliminates the bias in the conditional distribution of f¯nbf¯n\bar{f}_{n}^{b}-\bar{f}_{n} given 𝒟n\mathcal{D}_{n}. However, the bias term Bias(f)\mbox{Bias}(f^{\ast}) still exists in the sampling distribution of f¯nf\bar{f}_{n}-f^{\ast}. According to Theorem 3.1, this bias term can be bounded by O(1/nγ)O(1/\sqrt{n\gamma}) with high probability, while the convergence rate of the leading noise term under the supremum norm metric is of order O(1/n(nγ)1/α)O(1/\sqrt{n(n\gamma)^{-1/\alpha}}). Therefore, to make the bias term asymptotically negligible, we can adopt the common practice of “undersmoothing” [36, 37]. In our context, this means slightly enlarging the step size as γ=γ(n)=n1α+1+ε\gamma=\gamma(n)=n^{-\frac{1}{\alpha+1}+\varepsilon} (constant step size) or γi=i1α+1+ε\gamma_{i}=i^{-\frac{1}{\alpha+1}+\varepsilon} for i=1,,ni=1,\dots,n (non-constant step size), where ε\varepsilon is any small positive constant.

4.3 Online inference algorithm

Data: Number of bootstrap samples JJ, initial step size γ0>0\gamma_{0}>0, initial estimates f^0b,j=f^0\widehat{f}^{b,j}_{0}=\widehat{f}_{0}, j=1,,Jj=1,\dots,J, confidence level (1α)(1-\alpha).
for  i=1,2,,ni=1,2,\dots,n do
       Update f^i=f^i1γii(f^i1)\widehat{f}_{i}=\widehat{f}_{i-1}-\gamma_{i}\nabla\ell_{i}(\widehat{f}_{i-1})
       Update f¯i=(i1)f¯i1/i+f^i/i\bar{f}_{i}=(i-1)\bar{f}_{i-1}/i+\widehat{f}_{i}/i
       for j=1,,Jj=1,\dots,J do
             Update f^ib,j=f^i1b,jγiwi,ji(f^i1b,j)\widehat{f}^{b,j}_{i}=\widehat{f}^{b,j}_{i-1}-\gamma_{i}w_{i,j}\nabla\ell_{i}(\widehat{f}^{b,j}_{i-1})
             Update f¯ib,j=(i1)f¯i1b,j/i+f^ib,j/i\bar{f}^{b,j}_{i}=(i-1)\bar{f}^{b,j}_{i-1}/i+\widehat{f}^{b,j}_{i}/i.
       end for
      
end for
Output: SGD estimators f¯n\bar{f}_{n} and the Bootstrap estimates {f¯nb,j}j=1J\{\bar{f}^{b,j}_{n}\}_{j=1}^{J}. Calculate {f¯nb,jf¯n}j=1J\{\bar{f}^{b,j}_{n}-\bar{f}_{n}\}_{j=1}^{J}.
Construct the 100(1α)%100(1-\alpha)\% confidence interval for ff evaluated at any fixed z0z_{0} via
  1. 1.

    Normal CI: (f¯n(z0)zα/2Tnb(z0),f¯n(z0)+zα/2Tnb(z0))(\bar{f}_{n}(z_{0})-z_{\alpha/2}\sqrt{T_{n}^{b}(z_{0})},\,\bar{f}_{n}(z_{0})+z_{\alpha/2}\sqrt{T_{n}^{b}(z_{0})}), where Tnb(z0)=1J1j=1J(f¯nb,j(z0)f¯n(z0))2T_{n}^{b}(z_{0})=\frac{1}{J-1}\sum_{j=1}^{J}\big{(}\bar{f}_{n}^{b,j}(z_{0})-\bar{f}_{n}(z_{0})\big{)}^{2}.

  2. 2.

    Percentile CI: (f¯n(z0)Cα/2,f¯n(z0)+C1α/2)\big{(}\bar{f}_{n}(z_{0})-C_{\alpha/2},\,\bar{f}_{n}(z_{0})+C_{1-\alpha/2}\big{)}, where Cα/2C_{\alpha/2} and C1α/2C_{1-\alpha/2} are the sample α/2\alpha/2-th and (1α/2)(1-\alpha/2)-th quantile of {f¯nb,j(z0)f¯n(z0)}j=1J\{\bar{f}^{b,j}_{n}(z_{0})-\bar{f}_{n}(z_{0})\}_{j=1}^{J}.

Construct the 100(1α)%100(1-\alpha)\% confidence band for ff at any x𝒳x\in\mathcal{X}:
Step 1: Evenly choose t1,,tM𝒳t_{1},\dots,t_{M}\in\mathcal{X}.
Step 2: For j1,,Jj\in 1,\dots,J, calculate max1mM|f¯nb,j(tm)f¯n(tm)|.\max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,j}(t_{m})-\bar{f}_{n}(t_{m})\big{|}.
Step 3: Calculate the sample α/2\alpha/2-th and the (1α/2)(1-\alpha/2)-th quantiles of
max1mM|f¯nb,1(tm)f¯n(tm)|,,max1mM|f¯nb,J(tm)f¯n(tm)|,\max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,1}(t_{m})-\bar{f}_{n}(t_{m})\big{|},\dots,\max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,J}(t_{m})-\bar{f}_{n}(t_{m})\big{|},
and denote them by Qα/2Q_{\alpha/2} and Q1α/2Q_{1-\alpha/2}.
Step 4: Construct the 100(1α)%100(1-\alpha)\% confidence band as {g:𝒳|g(x)[f¯n(x)Qα/2,f¯n(x)+Q1α/2],x𝒳}\big{\{}g:\,\mathcal{X}\to\mathbb{R}\,\big{|}\,g(x)\in[\bar{f}_{n}(x)-Q_{\alpha/2},\,\bar{f}_{n}(x)+Q_{1-\alpha/2}],\ \forall x\in\mathcal{X}\big{\}}.
Algorithm 1 Algorithm 1 (Online Bootstrap Confidence Band for Non-parametric Regression)

As we demonstrated in Theorem 4.2, the sampling distribution of n(nγn)1/α(f¯nf)\sqrt{n(n\gamma_{n})^{-1/\alpha}}(\bar{f}_{n}-f^{\ast}) can be effectively approximated by the conditional distribution of n(nγn)1/α(f¯nbf¯n)\sqrt{n(n\gamma_{n})^{-1/\alpha}}(\bar{f}_{n}^{b}-\bar{f}_{n}) given data 𝒟n\mathcal{D}_{n} using the bootstrap functional SGD. This result provides a strong foundation for conducting online statistical inference based on bootstrap. Specifically, we can run JJ bootstrapped functional SGD in parallel, producing JJ estimators f¯nb,j=1ni=1nf^ib,j\bar{f}_{n}^{b,j}=\frac{1}{n}\sum_{i=1}^{n}\widehat{f}_{i}^{b,j} for j=1,,Jj=1,\dots,J with

f^ib,j=f^i1b,j+γnwi,j(Yif^i1b,j,KXi)KXi,fori=1,2,,\widehat{f}^{b,j}_{i}=\widehat{f}_{i-1}^{b,j}+\gamma_{n}w_{i,j}(Y_{i}-\langle\widehat{f}^{b,j}_{i-1},K_{X_{i}}\rangle_{\mathbb{H}})K_{X_{i}},\quad\mbox{for}\quad i=1,2,\ldots,

where wi,jw_{i,j} are i.i.d. bootstrap weights satisfying Assumption A3. Then we can approximate the sampling distribution of (f¯nf)(\bar{f}_{n}-f^{\ast}) using the empirical distribution of {f^nb,jf¯n,j=1,,J}\{\widehat{f}^{b,j}_{n}-\bar{f}_{n},\,j=1,\dots,J\} conditioning on 𝒟n\mathcal{D}_{n}, and further construct the point-wise confidence intervals and simultaneous confidence band for ff^{\ast}. We can also use the empirical variance of {f¯nb,j,j=1,,J}\{\bar{f}_{n}^{b,j},\,j=1,\dots,J\} to approximate the variance of f¯n\bar{f}_{n}. Based on these quantities, we can construct the point-wise confidence interval for f(x)f^{\ast}(x) for any fixed x𝒳x\in\mathcal{X} in two ways:

  1. 1.

    Normal CI - giving the sequence of bootstrapped estimators f¯nb,j(x)\bar{f}_{n}^{b,j}(x) for j=1,,Jj=1,\dots,J, we calculate the variance as Tnb(x)=1J1j=1J(f¯nb,j(x)f¯n(x))2T_{n}^{b}(x)=\frac{1}{J-1}\sum_{j=1}^{J}\big{(}\bar{f}_{n}^{b,j}(x)-\bar{f}_{n}(x)\big{)}^{2}, and construct the 100(1α)%100(1-\alpha)\% confidence interval for f(x)f^{\ast}(x) as (f¯n(x)zα/2Tnb(x),f¯n(x)+zα/2Tnb(x))(\bar{f}_{n}(x)-z_{\alpha/2}\sqrt{T_{n}^{b}(x)},\bar{f}_{n}(x)+z_{\alpha/2}\sqrt{T_{n}^{b}(x)});

  2. 2.

    Percentile CI - giving the sequence of bootstrapped estimators f¯nb,j(x)\bar{f}_{n}^{b,j}(x) for j=1,,Jj=1,\dots,J, we calculate {f¯nb,j(x)f¯n(x)}j=1J\{\bar{f}_{n}^{b,j}(x)-\bar{f}_{n}(x)\}_{j=1}^{J}, and its α/2\alpha/2-th and (1α/2)(1-\alpha/2)-th quantiles as Cα/2C_{\alpha/2} and C1α/2C_{1-\alpha/2}, then construct the 100(1α)%100(1-\alpha)\% CI for f(x)f^{\ast}(x) as (f¯n(x)Cα/2,f¯n(x)+C1α/2)\big{(}\bar{f}_{n}(x)-C_{\alpha/2},\bar{f}_{n}(x)+C_{1-\alpha/2}\big{)}.

To construct the simultaneous confidence band, we first choose a dense grid points t1,,tM𝒳t_{1},\dots,t_{M}\in\mathcal{X}; then for each j{1,,J}j\in\{1,\dots,J\}, we calculate max1mM|f¯nb,j(tm)f¯n(tm)|\max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,j}(t_{m})-\bar{f}_{n}(t_{m})\big{|} to approximate supt|f¯nb,j(t)f¯n(t)|\sup_{t}|\bar{f}_{n}^{b,j}(t)-\bar{f}_{n}(t)|. Accordingly, we obtain the following JJ bootstrapped supremum norms:

max1mM|f¯nb,1(tm)f¯n(tm)|,max1mM|f¯nb,2(tm)f¯n(tm)|,,andmax1mM|f¯nb,J(tm)f¯n(tm)|.\max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,1}(t_{m})-\bar{f}_{n}(t_{m})\big{|}\ ,\ \max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,2}(t_{m})-\bar{f}_{n}(t_{m})\big{|}\ ,\ \dots\ ,\ \mbox{and}\ \max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,J}(t_{m})-\bar{f}_{n}(t_{m})\big{|}. (4.8)

Denote the sample α/2\alpha/2-th and the (1α/2)(1-\alpha/2)-th quantiles of (4.8) as Qα/2Q_{\alpha/2} and Q1α/2Q_{1-\alpha/2}. Then we construct a 100(1α)%100(1-\alpha)\% confidence band for ff^{\ast} as {g:𝒳|g(x)[f¯n(x)Qα/2,f¯n(x)+Q1α/2],x𝒳}\big{\{}g:\,\mathcal{X}\to\mathbb{R}\,\big{|}\,g(x)\in[\bar{f}_{n}(x)-Q_{\alpha/2},\,\bar{f}_{n}(x)+Q_{1-\alpha/2}],\ \forall x\in\mathcal{X}\big{\}}.

Our online inference algorithm is computationally efficient, as it only requires one pass over the data, and the bootstrapped functional SGD can be computed in parallel. The detailed algorithm is summarized in Algorithm 1.

5 Numerical Study

In this section, we test our proposed online inference approach via simulations. Concretely, we generate synthetic data in a streaming setting with a total sample size of nn. We use (Xt,Yt)(X_{t},Y_{t}) to represent the tt-th observed data point for t=1,,nt=1,\dots,n. We evaluate the performance of our proposed method as described in Algorithm 1 for constructing confidence intervals for f(x)f(x) at x=Xtx=X_{t} for t=501,1000,1500,2000,2500,3000,3500,4000t=501,1000,1500,2000,2500,3000,3500,4000, and compare our method with three existing alternative approaches, which we refer to as “offline” methods. “Offline” methods involve calculating the confidence intervals after all data have been collected, up to the tt-th observation’s arrival, which necessitates refitting the model each time new data arrive. We also evaluate the coverage probabilities of the simultaneous confidence bands constructed in Algorithm 1. We first enumerate the compared offline confidence interval methods as follows:

  1. (i)

    Offline Bayesian confidence interval (Offline BA) proposed in [38]: According to [39], a smoothing spline method corresponds to a Bayesian procedure when using a partially improper prior. Given this relationship between smoothing splines and Bayes estimates, confidence intervals can be derived from the posterior covariance function of the estimation. In practice, we implement Offline BA using the “gss” R package [28].

  2. (ii)

    Offline bootstrap normal interval (Offline BN) proposed in [40]: Let f^λ\widehat{f}_{\lambda} and σ^\widehat{\sigma} denote the estimates of ff and σ\sigma respectively, achieved by minimizing (5.1) with {Xi,Yi}i=1t\{X_{i},Y_{i}\}_{i=1}^{t} as below.

    i=1t(Yif(Xi))2+t2λ01(f′′(u))2𝑑u\sum_{i=1}^{t}(Y_{i}-f(X_{i}))^{2}+\frac{t}{2}\lambda\int_{0}^{1}(f^{{}^{\prime\prime}}(u))^{2}du (5.1)

    where λ\lambda is the roughness penalty and f′′(u)f^{{}^{\prime\prime}}(u) is the second derivative evaluated at uu. A bootstrap sample is generated from

    Yi=f^λ(Xi)+ϵ,i=1,,tY_{i}^{\dagger}=\widehat{f}_{\lambda}(X_{i})+\epsilon^{\dagger},\quad i=1,\dots,t

    where ϵi\epsilon^{\dagger}_{i}s are i.i.d. Gaussian white noise with variance σ^2\widehat{\sigma}^{2}. Based on the bootstrap sample, we calculate the bootstrap estimate as f^λ\widehat{f}_{\lambda}^{\dagger}. Repeating JJ times, we have a sequence of offline bootstrap estimates f¯λ,1,,f¯λ,J\bar{f}_{\lambda}^{{\dagger},1},\dots,\bar{f}_{\lambda}^{{\dagger},J}. We estimate the variance of f^λ(Xt)\widehat{f}_{\lambda}(X_{t}) as Tt=1J1j=1J(f^λ,j(Xt)f^λ(Xt))2T^{\dagger}_{t}=\frac{1}{J-1}\sum_{j=1}^{J}\big{(}\widehat{f}_{\lambda}^{{\dagger},j}(X_{t})-\widehat{f}_{\lambda}(X_{t})\big{)}^{2}. A 100(1α)%100(1-\alpha)\% offline normal bootstrap confidence interval for f^λ(Xt)\widehat{f}_{\lambda}(X_{t}) is then constructed as (f^λ(Xt)zα/2Tt,f^λ(Xt)+zα/2Tt)\big{(}\,\widehat{f}_{\lambda}(X_{t})-z_{\alpha/2}\sqrt{T^{\dagger}_{t}},\,\widehat{f}_{\lambda}(X_{t})+z_{\alpha/2}\sqrt{T^{\dagger}_{t}}\,\big{)}.

  3. (iii)

    Offline bootstrap percentile interval (Offline BP): We apply the same data bootstrapping procedure in Offline BN, which produces the estimate f^λ(Xt)\widehat{f}^{\dagger}_{\lambda}(X_{t}) based on the bootstrap sample. The confidence interval is then constructed using the percentile method suggested in [41]. Specifically, let Cα/2(Xt)C^{\dagger}_{\alpha/2}(X_{t}) and C1α/2(Xt)C^{\dagger}_{1-\alpha/2}(X_{t}) represent the α/2\alpha/2-th quantile and the (1α/2)(1-\alpha/2)-th quantile of the empirical distribution of {f^λ,j(Xt)f^λ(Xt)}j=1J\big{\{}\widehat{f}_{\lambda}^{{\dagger},j}(X_{t})-\widehat{f}^{\dagger}_{\lambda}(X_{t})\big{\}}_{j=1}^{J}, respectively. A 100(1α)%100(1-\alpha)\% confidence interval for f^λ(Xt)\widehat{f}_{\lambda}(X_{t}) is then constructed as (f^λ(Xt)Cα/2(Xt),f^λ(Xt)+C1α/2(Xt))\big{(}\,\widehat{f}_{\lambda}(X_{t})-C^{\dagger}_{\alpha/2}(X_{t}),\,\widehat{f}_{\lambda}(X_{t})+C^{\dagger}_{1-\alpha/2}(X_{t})\,\big{)}.

As tt increases, offline methods lead to a considerable increase in computational cost. For instance, Offline BA/BN theoretically has a total time complexity of order 𝒪(t4)\mathcal{O}(t^{4}) (with an 𝒪(t3)\mathcal{O}(t^{3}) cost at time tt). In contrast, online bootstrap confidence intervals are computed sequentially as new data points become available, making them well-suited for streaming data settings. They have a theoretical complexity of at most 𝒪(t2)\mathcal{O}(t^{2}) (with an 𝒪(t)\mathcal{O}(t) cost at time tt). We examine both the normal CI and percentile CI, as outlined in Algorithm 1, when constructing the confidence interval.

We examine the effects of various step size schemes. Specifically, we consider a constant step size γ=γ(t)=t1α+1\gamma=\gamma(t)=t^{-\frac{1}{\alpha+1}}, where tt represents the total sample size at which the CIs are constructed, and an online step size γi=i1α+1\gamma_{i}=i^{-\frac{1}{\alpha+1}} for i=1,,ti=1,\dots,t. A limitation of the constant-step size method is its dependency on prior knowledge of the total time horizon tt. Consequently, the estimator is only rate-optimal at the tt-th step. We assess our proposed online bootstrap confidence intervals in four different scenarios: (i) Online BNC, which uses a constant step size for the normal interval; (ii) Online BPC, which uses a constant step size for the percentile interval; (iii) Online BNN, which employs a non-constant step size for the normal interval; and (iv) Online BPN, which utilizes a non-constant step size for the percentile interval.

We generate our data as i.i.d. copies of random variables (X,Y)(X,Y), where XX is drawn from a uniform distribution in the interval (0,1)(0,1), and Y=f(X)+ϵY=f(X)+\epsilon. Here ff is the unknown regression function to be estimated, ϵ\epsilon represents Gaussian white noise with a variance of 0.20.2. We consider the following three cases of f=ff=f_{\ell}, =1,2,3\ell=1,2,3:

Case 1: f1(x)=sin(3πx/2),\displaystyle f_{1}(x)=\sin(3\pi x/2),
Case 2: f2(x)=13β10,5(x)+13β7,7(x)+13β5,10(x),\displaystyle f_{2}(x)=\frac{1}{3}\beta_{10,5}(x)+\frac{1}{3}\beta_{7,7}(x)+\frac{1}{3}\beta_{5,10}(x),
Case 3: f3(x)=619β30,17(x)+410β3,11(x).\displaystyle f_{3}(x)=\frac{6}{19}\beta_{30,17}(x)+\frac{4}{10}\beta_{3,11}(x).

Here, βp,q=xp1(1x)q1B(p,q)\beta_{p,q}=\frac{x^{p-1}(1-x)^{q-1}}{B(p,q)} with B(p,q)=Γ(p)Γ(q)Γ(p+q)B(p,q)=\frac{\Gamma(p)\Gamma(q)}{\Gamma(p+q)} denoting the beta function, and Γ\Gamma is the gamma function with Γ(p)=p!\Gamma(p)=p! when p+p\in\mathbb{N}_{+}. Cases 2 and 3 are designed to mimic the increasingly complex “truth” scenarios similar to the settings in [38, 40].

We draw training data of size n=3000n=3000 from these models. In our online approaches, we first use 500 data points to build an initial estimate and then employ SGD to derive online estimates from the 501st to the 3000th data point. Given that our framework is designed for online settings, we can construct the confidence band based on the datasets of size 501501, 10001000, 15001500, 20002000, 25002500 and 30003000, i.e., using the averaged estimators f¯t\bar{f}_{t} at t=501,1000,1500,2000,2500,3000t=501,1000,1500,2000,2500,3000. We repeat the data generation process 200200 times for each case. For each replicate, upon the arrival of a new data point, we apply the proposed multiplier bootstrap method for online inference, using 500500 bootstrap samples (i.e., J=500J=500 in Algorithm 1) with bootstrap weight WW generated from a normal distribution with mean 11 and standard deviation 11. We then construct 95%95\% confidence intervals based on Algorithm 1. Our results will show the coverage and distribution of the lengths of the confidence intervals built at t=501,1000,1500,2000,2500t=501,1000,1500,2000,2500, and 30003000.

Case 1:     (A1) Coverage (A2) Length of Confidence Interval
Refer to caption Refer to caption
Case 2:      (B1) Coverage (B2) Length of Confidence Interval
Refer to caption Refer to caption
Case 3:      (C1) Coverage (C2) Length of Confidence Interval
Refer to caption Refer to caption
Figure 2: We compared seven different CI construction approaches: Offline Bayesian confidence interval (Offline BA), Offline Bootstrap normal interval (Offline BN), Online bootstrap normal interval with constant step size (Online BNC), Online bootstrap percentile interval with constant step size (Online BPC), Online Bootstrap normal interval with non-constant step size (Online BNN), and Online Bootstrap percentile interval with Non-constant step size (Online BPN). The coverage of the CI is shown in (A1), (B1), (C1). The mean and variance of the length of CIs are represented by the solid center and colored interval in (A2), (B2), (C2) respectively.
Refer to caption
Figure 3: Confidence band constructed using an online bootstrap approach with a non-constant step size. Data are generated in three cases with sample sizes of 1000 (red), 2000 (blue), and 3000 (yellow). The colored band represents the confidence band, the solid black curve is the true function curve, and the colored curve is the estimated function curve based on SGD.

As shown in Figure 2, the coverage of all methods approaches the predetermined level of 95%95\% as tt increases. The offline Bayesian method exhibits the lowest coverage of all. While it has the longest average confidence interval length in Cases 1-3, it also has the smallest variance in confidence interval lengths. The offline bootstrap-based methods demonstrate higher coverage and shorter average confidence interval lengths than the offline Bayesian method. The variance in confidence interval lengths for these bootstrap-based methods is larger, due to the bootstrap multiplier resampling procedure or the random step size used in our proposed online bootstrap procedures. As the sample size grows, the variance in the length of the confidence interval diminishes for all methods. Our online bootstrap procedure with a non-constant step size outperforms the others regarding both the average length and the variance of the confidence interval. It offers the shortest average confidence interval length and the smallest variance, compared to the Bayesian confidence interval, offline bootstrap methods, and the online bootstrap procedure with a constant step size. Moreover, the online bootstrap method with a non-constant step size achieves the predetermined coverage level of 95%95\% more quickly than the other methods. We only tested our methods (online BNN and online BQN) with an increased tt at t=3500t=3500 and t=4000t=4000 due to computational costs. As observed in Figure 2 (A1), (B1), and (C1), the coverage stabilizes at the predetermined coverage level of 95%95\%. We also use our proposed online bootstrap method, as outlined in Algorithm 1, to construct a confidence band of level of 95%95\% with a step size of γi=i1α+1\gamma_{i}=i^{-\frac{1}{\alpha+1}} at n=1000,2000,3000n=1000,2000,3000. As seen in Figure 3, the average width of the confidence band decreases as the sample size increases for Case 1-3, and all of them cover the true function curve represented by the solid black curve, indicating that the accuracy of our confidence band estimates improves with a larger sample size.

(A) Cumulative Computation Time (B) Current Computation Time
Refer to caption Refer to caption
Figure 4: (A) Cumulative Computation time is recorded as tt increasing from 501501 to 40004000. (B) The computation time of constructing the confidence interval is recorded at different time points. The computation time is represented on the Y-axis with a scaled interval to differentiate between the blue and green curves.

Finally, we compared the computational time of various methods in constructing confidence intervals on a computer workstation equipped with a 32-core 3.50 GHz AMD Threadripper Pro 3975WX CPU and 64GB RAM. We recorded the computational times as data points 501,1000,,4000501,1000,\dots,4000 arrived and calculated the cumulative computational times up to t=501,1000,,4000t=501,1000,\cdots,4000 for both offline and online algorithms. The normal and percentile Bootstrap methods displayed similar computational times, so we chose to report the computational time of the percentile bootstrap interval for both offline and online approaches. Despite leveraging parallel computing to accelerate the bootstrap procedures, the offline bootstrap algorithms still demanded significant computation time. This is attributed to the need to refit the model each time a new data point arrived, which substantially raises the computational expense. The computational complexity of offline methods for computing the estimate of ff at time tt is 𝒪(t3)\mathcal{O}(t^{3}), leading to a cumulative computational complexity of order 𝒪(t4)\mathcal{O}(t^{4}). Including the bootstrap cost, the total computational complexity at tt becomes 𝒪(Bt3)\mathcal{O}(Bt^{3}), leading to a cumulative computational complexity of order 𝒪(Bt4)\mathcal{O}(Bt^{4}). As shown in Figure 4, the cumulative computational time reaches approximately 6060 hours for the offline bootstrap method and around 88 hours for the Bayesian bootstrap method. Conversely, the cumulative computational time for our proposed bootstrap method grows almost linearly with tt, and requires less than 3030 minutes up to t=4000t=4000. At t=4000t=4000, offline bootstrap methods take about 200200 seconds, and the Bayesian confidence interval necessitates roughly 3030 seconds to construct the confidence interval. Our proposed online bootstrap method requires fewer than 33 seconds, demonstrating its potential for time-sensitive applications such as medical diagnosis and treatment, financial trading, and traffic management, where real-time decision-making is essential as data continuously flows in.

6 Proof Sketch of Main Results

In this section, we present sketched proofs for the expansion of the functional SGD estimator relative to the supremum norm (Theorem 3.1) and the bootstrap consistency for global inference (Theorem 4.2), while highlighting some important technical details and key steps.

6.1 Proof sketch for estimator expansion under supremum norm metric

Theorem 3.1 establishes the supreum norm bound with high probability for the high-order expansion of the SGD estimator. This result is crucial for the inference framework, as we only need to focus on the distribution behavior of leading terms given the negligible remainders. In the sketched proof, we denote ηn=f^nf\eta_{n}=\widehat{f}_{n}-f^{\ast}. According to (3.1), we have

ηn=(IγnKXnKXn)ηn1+γnϵnKXn.\eta_{n}=(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}+\gamma_{n}\epsilon_{n}K_{X_{n}}.

We split the recursion of ηn\eta_{n} into two finer recursions: bias recursion of ηnbias\eta_{n}^{bias} and noise recursion of ηnnoise\eta_{n}^{noise} such that ηn=ηnbias+ηnnoise\eta_{n}=\eta_{n}^{bias}+\eta_{n}^{noise}, where

bias recursion: ηnbias=(IγnKXnKXn)ηn1biaswithη0bias=f;\displaystyle\eta_{n}^{bias}=(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})\eta^{bias}_{n-1}\quad\quad\textrm{with}\quad\eta_{0}^{bias}=-f^{*};
noise recursion: ηnnoise=(IγnKXnKXn)ηn1noise+γnϵnKXnwithη0noise=0.\displaystyle\eta^{noise}_{n}=(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})\eta^{noise}_{n-1}+\gamma_{n}\epsilon_{n}K_{X_{n}}\quad\quad\textrm{with}\quad\eta_{0}^{noise}=0.

To proceed, we further decompose the bias recursion into two parts: (1) the leading bias recursion ηnbias,0\eta_{n}^{bias,0}; and (2) the remainder bias recursion ηnbiasηnbias,0\eta_{n}^{bias}-\eta_{n}^{bias,0} as follows:

ηnbias,0=\displaystyle\eta_{n}^{bias,0}= (IγnΣ)ηn1bias,0withη0bias=f;\displaystyle\,(I-\gamma_{n}\Sigma)\eta_{n-1}^{bias,0}\quad\quad\textrm{with}\quad\eta_{0}^{bias}=-f^{*};
ηnbiasηnbias,0=\displaystyle\eta_{n}^{bias}-\eta_{n}^{bias,0}= (IγnKXnKXn)(ηn1biasηn1bias,0)+γn(ΣKXnKXn)ηn1bias,0.\displaystyle\,(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{bias}-\eta_{n-1}^{bias,0})+\gamma_{n}(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{bias,0}.

It is worth noting that the leading bias recursion essentially replaces KXnKXnK_{X_{n}}\otimes K_{X_{n}} by its expectation Σ=𝔼[KXnKXn]\Sigma=\mathbb{E}[K_{X_{n}}\otimes K_{X_{n}}].

To bound the residual term η¯nbiasη¯nbias,0\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|_{\infty} associated with the leading bias term of the averaged estimator, we introduce an augmented RKHS space (with a[0, 1/21/(2α)]a\in[0,\,1/2-1/(2\alpha)])

a={f=ν=1fνϕνν=1fν2μν2a1<}\mathbb{H}_{a}=\Big{\{}f=\sum_{\nu=1}^{\infty}f_{\nu}\phi_{\nu}\,\mid\,\sum_{\nu=1}^{\infty}f_{\nu}^{2}\mu_{\nu}^{2a-1}<\infty\Big{\}} (6.1)

equipped with the kernel function Ka(x,y)=ν=1ϕν(X)ϕν(y)μν12aK^{a}(x,y)=\sum_{\nu=1}^{\infty}\phi_{\nu}(X)\phi_{\nu}(y)\mu_{\nu}^{1-2a}. To verify Ka(,)K^{a}(\cdot,\cdot) is the reproducing kernel of a\mathbb{H}_{a}, we notice that

Kxaa2=ν=1μν12aϕν(x)ϕνa2=ν=1(ϕν(x)μν12a)2μν2a1=ν=1ϕν2(x)μν12a<ca2,\|K_{x}^{a}\|_{a}^{2}=\|\sum_{\nu=1}^{\infty}\mu_{\nu}^{1-2a}\phi_{\nu}(x)\phi_{\nu}\|_{a}^{2}=\sum_{\nu=1}^{\infty}(\phi_{\nu}(x)\mu_{\nu}^{1-2a})^{2}\mu_{\nu}^{2a-1}=\sum_{\nu=1}^{\infty}\phi^{2}_{\nu}(x)\mu_{\nu}^{1-2a}<c^{2}_{a},

where cac_{a} is a constant. Moreover, Kxa()K_{x}^{a}(\cdot) also satisfies the reproducing property since

Kxa,fa=\displaystyle\langle K_{x}^{a},f\rangle_{a}= ν=1μν12aϕν(x)ϕν,fa=ν=1ϕν(x)μν12afνϕν,ϕνa=ν=1fνϕν(x)=f(x).\displaystyle\langle\sum_{\nu=1}^{\infty}\mu_{\nu}^{1-2a}\phi_{\nu}(x)\phi_{\nu},f\rangle_{a}=\sum_{\nu=1}^{\infty}\phi_{\nu}(x)\mu_{\nu}^{1-2a}f_{\nu}\langle\phi_{\nu},\phi_{\nu}\rangle_{a}=\sum_{\nu=1}^{\infty}f_{\nu}\phi_{\nu}(x)=f(x).

For any faf\in\mathbb{H}\subset\mathbb{H}_{a}, we can use the above reproducing property to bound the supremum norm of ff as f=supx[0,1]|f(x)|=|Kxa,fa|faKxaa2<cafa\|f\|_{\infty}=\sup_{x\in[0,1]}|f(x)|=|\langle K_{x}^{a},f\rangle_{a}|\leq\|f\|_{a}\cdot\|K_{x}^{a}\|_{a}^{2}<c_{a}\|f\|_{a}. Also note that for any ff\in\mathbb{H}, f2=ν=1fν2μν1ν=1fν2μν2a1=fa2\|f\|^{2}_{\mathbb{H}}=\sum_{\nu=1}^{\infty}f_{\nu}^{2}\mu_{\nu}^{-1}\leq\sum_{\nu=1}^{\infty}f_{\nu}^{2}\mu_{\nu}^{2a-1}=\|f\|^{2}_{a} for a0a\geq 0; therefore, we have the relationship fcafackf\|f\|_{\infty}\leq c_{a}\|f\|_{a}\leq c_{k}\|f\|_{\mathbb{H}}, meaning that a\|\cdot\|_{a} provides a tighter bound for the supremum norm compared with \|\cdot\|_{\mathbb{H}}. In Section 8.2 (Lemma 8.1), we use this augmented RKHS to show that the bias remainder term satisfies η¯nbiasη¯nbias,02=o(η¯nbias,0)\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|^{2}_{\infty}=o(\bar{\eta}_{n}^{bias,0}) through computing the expectation 𝔼[η¯nbiasη¯nbias,0a2]\mathbb{E}\big{[}\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|^{2}_{a}\big{]} and applying the Markov inequality.

For the noise recursion of ηnnoise\eta_{n}^{noise}, we can similarly split it into the leading noise recursion term and residual noise recursion term as

ηnnoise,0=\displaystyle\eta_{n}^{noise,0}= (IγnΣ)ηn1noise,0+γnϵnKXnwithη0noise,0=0;\displaystyle\,(I-\gamma_{n}\Sigma)\eta^{noise,0}_{n-1}+\gamma_{n}\epsilon_{n}K_{X_{n}}\quad\quad\textrm{with}\quad\eta_{0}^{noise,0}=0;
ηnnoiseηnnoise,0=\displaystyle\eta_{n}^{noise}-\eta_{n}^{noise,0}= (IγnKXnKXn)(ηn1noiseηn1noise,0)+γn(ΣKXnKXn)ηn1noise,0.\displaystyle\,(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{noise}-\eta_{n-1}^{noise,0})+\gamma_{n}(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{noise,0}.

The leading noise recursion is described as a “semi-stochastic” recursion induced by ηnnoise\eta_{n}^{noise} in [12] since it keeps the randomness in the noise recursion ηnnoise\eta_{n}^{noise} due to the noise {ϵi}i=1n\{\epsilon_{i}\}_{i=1}^{n}, but get ride of the randomness arising from KXnKXnK_{X_{n}}\otimes K_{X_{n}}, which is due to the random design {Xi}i=1n\{X_{i}\}_{i=1}^{n}.

For the residual noise recursion, directly bound η¯nnoiseη¯nnoise,0\|\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}\|_{\infty} is difficult. Instead, we follow [12] by further decomposing ηnnoiseηnnoise,0\eta_{n}^{noise}-\eta_{n}^{noise,0} into a sequence of higher-order “semi-stochastic” recursions as follows. We first define a semi-stochastic recursion induced by ηnnoiseηnnoise,0\eta_{n}^{noise}-\eta_{n}^{noise,0}, denoted as ηnnoise,1\eta_{n}^{noise,1}:

ηnnoise,1=(IγnΣ)ηn1noise,1+γn(ΣKXnKXn)ηn1noise,0.\eta_{n}^{noise,1}=(I-\gamma_{n}\Sigma)\eta_{n-1}^{noise,1}+\gamma_{n}(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{noise,0}. (6.2)

Here, ηnnoise,1\eta_{n}^{noise,1} replaces the random operator KXnKXnK_{X_{n}}\otimes K_{X_{n}} with its expectation Σ\Sigma in the residual noise recursion for ηnnoiseηnnoise,0\eta_{n}^{noise}-\eta_{n}^{noise,0}, and can be viewed as a second-order term in the expansion of the noise recursion, or the leading remainder noise term. The rest noise remainder parts can be expressed as

ηnnoiseηnnoise,0ηnnoise,1=(IγnKXnKXn)(ηn1noiseηn1noise,0)(IγnΣ)ηn1noise,1\displaystyle\,\eta_{n}^{noise}-\eta_{n}^{noise,0}-\eta_{n}^{noise,1}=(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{noise}-\eta_{n-1}^{noise,0})-(I-\gamma_{n}\Sigma)\eta_{n-1}^{noise,1}
=(IγnKXnKXn)(ηn1noiseηn1noise,0ηn1noise,1)+γn(ΣKXnKXn)ηn1noise,1.\displaystyle\qquad=(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{noise}-\eta_{n-1}^{noise,0}-\eta_{n-1}^{noise,1})+\gamma_{n}(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{noise,1}.

Then we can further define a semi-stochastic recursion induced by ηnnoiseηnnoise,0ηnnoise,1\eta_{n}^{noise}-\eta_{n}^{noise,0}-\eta_{n}^{noise,1}, and repeat this process. If we define nr=(ΣKXnKXn)ηn1noise,r1\mathcal{E}_{n}^{r}=(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\,\eta_{n-1}^{noise,r-1} for r1r\geq 1, then we can expand ηnnoise\eta_{n}^{noise} into (r+2)(r+2) terms as

ηnnoise=ηnnoise,0+ηnnoise,1+ηnnoise,2++ηnnoise,r+Remainder,\eta_{n}^{noise}=\eta_{n}^{noise,0}+\eta_{n}^{noise,1}+\eta_{n}^{noise,2}+\cdots+\eta_{n}^{noise,r}+\textrm{Remainder},

where ηnnoise,d=(IγnΣ)ηn1noise,d+γnnd\eta_{n}^{noise,d}=(I-\gamma_{n}\Sigma)\eta_{n-1}^{noise,d}+\gamma_{n}\mathcal{E}_{n}^{d} for 1dr1\leq d\leq r,. The Remainder term ηnnoised=0rηnnoise,d\eta_{n}^{noise}-\sum_{d=0}^{r}\eta_{n}^{noise,d} also has a recursive characterization:

ηnnoised=1rηnnoise,i=(IγnKXnKXn)(ηn1noised=1rηn1noise,d)+γnnr+1.\eta_{n}^{noise}-\sum_{d=1}^{r}\eta_{n}^{noise,i}=(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{noise}-\sum_{d=1}^{r}\eta_{n-1}^{noise,d})+\gamma_{n}\mathcal{E}_{n}^{r+1}. (6.3)

To establish the supreme norm bound of η¯nnoiseη¯nnoise,0\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}, the idea is to show that η¯nnoise,r\bar{\eta}_{n}^{noise,r} decays as rr increases, that is, to prove

η¯nnoise=η¯nnoise,0+η¯nnoise,1=o(η¯nnoise,0)+η¯nnoise,2=o(η¯nnoise,1)++η¯nnoise,r=o(η¯nnoise,r1)+η¯nnoisei=0rη¯nnoise,inegligible.\bar{\eta}_{n}^{noise}=\bar{\eta}_{n}^{noise,0}+\underbrace{\bar{\eta}_{n}^{noise,1}}_{=o(\bar{\eta}_{n}^{noise,0})}+\underbrace{\bar{\eta}_{n}^{noise,2}}_{=o(\bar{\eta}_{n}^{noise,1})}+\dots+\underbrace{\bar{\eta}_{n}^{noise,r}}_{=o(\bar{\eta}_{n}^{noise,r-1})}+\underbrace{\bar{\eta}_{n}^{noise}-\sum_{i=0}^{r}\bar{\eta}_{n}^{noise,i}}_{negligible}.

Concretely, we consider the constant-step case for a simple presentation. By accumulating the effects of the iterations, we can further express ηnnoise,1\eta_{n}^{noise,1} as

ηnnoise,1=\displaystyle\eta_{n}^{noise,1}= γi=1n1(IγΣ)ni1(ΣKXi+1KXi+1)ηinoise,0\displaystyle\gamma\sum_{i=1}^{n-1}(I-\gamma\Sigma)^{n-i-1}\big{(}\Sigma-K_{X_{i+1}}\otimes K_{X_{i+1}}\big{)}\eta_{i}^{noise,0}
=\displaystyle= γ2i=1n1j=1iϵj(IγΣ)ni1(ΣKXi+1KXi+1)(IγΣ)ijKXj,\displaystyle\gamma^{2}\sum_{i=1}^{n-1}\sum_{j=1}^{i}\epsilon_{j}(I-\gamma\Sigma)^{n-i-1}\big{(}\Sigma-K_{X_{i+1}}\otimes K_{X_{i+1}}\big{)}(I-\gamma\Sigma)^{i-j}K_{X_{j}},

and accordingly, the averaged version is

η¯nnoise,1=1nj=1n1ϵjγ2[i=jn1(=in1(IγΣ)i)(ΣKXi+1KXi+1)(IγΣ)ijKXj]gj.\bar{\eta}_{n}^{noise,1}=\frac{1}{n}\sum_{j=1}^{n-1}\,\epsilon_{j}\cdot\underbrace{\gamma^{2}\,\Big{[}\sum_{i=j}^{n-1}(\sum_{\ell=i}^{n-1}(I-\gamma\Sigma)^{\ell-i})\big{(}\Sigma-K_{X_{i+1}}\otimes K_{X_{i+1}}\big{)}(I-\gamma\Sigma)^{i-j}K_{X_{j}}\Big{]}}_{g_{j}}.

This implies that conditioning on the covariates {X1,,Xn}\{X_{1},\dots,X_{n}\}, the empirical process η¯nnoise,1()=1nj=1n1ϵjgj()\bar{\eta}_{n}^{noise,1}(\cdot)=\frac{1}{n}\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(\cdot) over [0,1][0,1] is a Gaussian process with (function) weights {gj}j=1n\{g_{j}\}_{j=1}^{n}. We can then prove a bound of η¯nnoise,1\|\bar{\eta}_{n}^{noise,1}\|_{\infty} by careful analyzing the random function j=1n1gj2()\sum_{j=1}^{n-1}g_{j}^{2}(\cdot); see Appendix 8.3 for further details. A complete proof of Theorem 3.1 under constant step size is included in [33]; see Figure 5 for a float chart explaining the relationship among different components in its proof. The proof for the non-constant step size case is conceptually similar but is considerably more involved, requiring a much more refined analysis of the accumulated step size effect on the iterations of the recursions in [33].

η¯n=f¯nf\bar{\eta}_{n}=\bar{f}_{n}-f^{*}η¯nbias\bar{\eta}_{n}^{bias}η¯nnoise\bar{\eta}_{n}^{noise}η¯nbias,0\bar{\eta}_{n}^{bias,0}η¯nbiasη¯nbias,0\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}η¯nnoise,0\bar{\eta}_{n}^{noise,0}η¯nnoise,1\bar{\eta}_{n}^{noise,1}η¯nnoiseη¯nnoise,0η¯nnoise,1\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}-\bar{\eta}_{n}^{noise,1}biasnoiseSection 8.1leadLemma 8.1remSection 8.1leadLemma 8.2remLemma 8.2rem
Figure 5: Float chart for proof of Theorem 3.1 under constant step size.

6.2 Proof sketch for Bootstrap consistency of global inference

Recall that 𝒟n={Xi,Yi}i=1n\mathcal{D}_{n}=\{X_{i},Y_{i}\}_{i=1}^{n} represents the data. The goal is to bound the difference between the sampling distribution of n(nγ)1/αη¯nnoise,0\sqrt{n(n\gamma)^{-1/\alpha}}\,\|\bar{\eta}_{n}^{noise,0}\|_{\infty} and the conditional distribution of n(nγ)1/αη¯nb,noise,0η¯nnoise,0\sqrt{n(n\gamma)^{-1/\alpha}}\,\|\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}\|_{\infty} given 𝒟n\mathcal{D}_{n}; see Section 4.2 for detailed definitions of these quantities. We sketch the proof idea under the constant step size scheme.

We will use the shorthand α¯n=n(nγ)1/αη¯nnoise,0\bar{\alpha}_{n}=\sqrt{n(n\gamma)^{-1/\alpha}}\,\bar{\eta}_{n}^{noise,0} and α¯nb=n(nγ)1/α(η¯nb,noise,0η¯nnoise,0)\bar{\alpha}_{n}^{b}=\sqrt{n(n\gamma)^{-1/\alpha}}\,\big{(}\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}\big{)}. Recall that from equations (3.6) and (4.5), we have

α¯n()=1n(nγ)1/αi=1nϵiΩn,i()andα¯nb()=1n(nγ)1/αi=1n(wi1)ϵiΩn,i().\displaystyle\bar{\alpha}_{n}(\cdot)=\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}\epsilon_{i}\cdot\Omega_{n,i}(\cdot)\quad\mbox{and}\quad\bar{\alpha}_{n}^{b}(\cdot)=\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}(w_{i}-1)\cdot\epsilon_{i}\cdot\Omega_{n,i}(\cdot).

From this display, we see that for any t𝒳t\in\mathcal{X}, α¯n(t)\bar{\alpha}_{n}(t) is a weighted sum of Gaussian random variables, with the weights being functions of covariates {Xi}i=1n\{X_{i}\}_{i=1}^{n}; conditioning on 𝒟n\mathcal{D}_{n}, α¯nb(t)\bar{\alpha}_{n}^{b}(t) is a weighted sum of sub-Gaussian random variables. In the proof, we also require a sufficiently dense space discretization given by 0=t1<t2<<tN=10=t_{1}<t_{2}<\cdots<t_{N}=1. This discretization forms an ε\varepsilon-covering for some ε\varepsilon with respect to a specific distance metric that will be detailed later.

To bound the difference between the distribution of α¯n\|\bar{\alpha}_{n}\|_{\infty} and the conditional distribution of α¯nb\|\bar{\alpha}_{n}^{b}\|_{\infty} given 𝒟n\mathcal{D}_{n}, we introduce two intermediate processes: (1) α¯ne()=1n(nγ)1/αi=1neiϵiΩn,i()\bar{\alpha}_{n}^{e}(\cdot)=\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}e_{i}\cdot\epsilon_{i}\cdot\Omega_{n,i}(\cdot) with eie_{i} being i.i.d. standard normal random variables for i=1,,ni=1,\cdots,n; (2) an NN-dimensional multivariate normal random vector (Z¯n(tk)=1ni=1nZi(tk),k=1,2,,N)\big{(}\bar{Z}_{n}(t_{k})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Z_{i}(t_{k}),\,k=1,2,\ldots,N\big{)} (recall that 0=t1<t2<<tN=10=t_{1}<t_{2}<\cdots<t_{N}=1 is the space discretization we defined earlier), where {(Z1(t1),Z1(t2),,Z1(tN))}i=1n\big{\{}\big{(}Z_{1}(t_{1}),Z_{1}(t_{2}),\ldots,Z_{1}(t_{N})\big{)}\big{\}}_{i=1}^{n} are i.i.d. (zero mean) normally distributed random vectors having the same covariance structure as (α¯n(t1),α¯n(t2),,α¯n(tN))\big{(}\bar{\alpha}_{n}(t_{1}),\bar{\alpha}_{n}(t_{2}),\ldots,\bar{\alpha}_{n}(t_{N})); that is, Zi(tk)N(0,(nγ)1/αν=1(1(1γμν)ni)2ϕν2(tk))Z_{i}(t_{k})\sim N\big{(}0,(n\gamma)^{-1/\alpha}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}^{2}(t_{k})\big{)}, 𝔼(Zi(tk)Zi(t))=(nγ)1/αν=1(1(1γμν)ni)2ϕν(tk)ϕν(t)\mathbb{E}\big{(}Z_{i}(t_{k})\cdot Z_{i}(t_{\ell})\big{)}=(n\gamma)^{-1/\alpha}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}(t_{k})\phi_{\nu}(t_{\ell}), and 𝔼(Zi(tk)Zj(t))=0\mathbb{E}\big{(}Z_{i}(t_{k})\cdot Z_{j}(t_{\ell})\big{)}=0 for (k,)[N]2(k,\ell)\in[N]^{2} and (i,j)[n]2(i,j)\in[n]^{2}, iji\neq j. These two intermediate processes are introduced so that the conditional distribution of max1jNα¯ne(tj)\max_{1\leq j\leq N}\bar{\alpha}_{n}^{e}(t_{j}) given 𝒟n\mathcal{D}_{n} will be used to approximate the conditional distribution of α¯nb\|\bar{\alpha}_{n}^{b}\|_{\infty} given 𝒟n\mathcal{D}_{n}; while the distribution of max1jNZ¯n(tj)\max_{1\leq j\leq N}\bar{Z}_{n}(t_{j}) will be used to approximate the distribution of α¯n\|\bar{\alpha}_{n}\|_{\infty}. Since both the distribution of (Z¯n(t1),Z¯n(t2),,Z¯n(tN))\big{(}\bar{Z}_{n}(t_{1}),\bar{Z}_{n}(t_{2}),\ldots,\bar{Z}_{n}(t_{N})\big{)} and the conditional distribution of (α¯ne(t1),α¯ne(t2),,α¯ne(tN))\big{(}\bar{\alpha}_{n}^{e}(t_{1}),\bar{\alpha}_{n}^{e}(t_{2}),\ldots,\bar{\alpha}_{n}^{e}(t_{N})\big{)} given 𝒟n\mathcal{D}_{n} are centered multivariate normal distributions, we can use a Gaussian comparison inequality to bound the difference between them by bounding the difference between their covariances.

α¯n\|\bar{\alpha}_{n}\|_{\infty}max1jNα¯n(tj)\underset{1\leq j\leq N}{\max}\bar{\alpha}_{n}(t_{j})max1jNZ¯n(tj)\underset{1\leq j\leq N}{\max}\bar{Z}_{n}(t_{j})max1jNα¯ne(tj)𝒟n\underset{1\leq j\leq N}{\max}\bar{\alpha}^{e}_{n}(t_{j})\mid\mathcal{D}_{n}max1jNα¯nb(tj)𝒟n\underset{1\leq j\leq N}{\max}\bar{\alpha}^{b}_{n}(t_{j})\mid\mathcal{D}_{n}α¯nb𝒟n\|\bar{\alpha}^{b}_{n}\|_{\infty}\mid\mathcal{D}_{n}IIILemma 8.3IIILemma 8.4IVLemma 8.5VVI
Figure 6: Flow chart of the bootstrap consistency proof.

The actual proof is even more complicated, as we also need to control the discretization error. See Figure 6 for a flow chart that summarizes all the intermediate approximation steps and the corresponding lemmas in the appendix. For Steps I and V in Figure 6, we approximate the continuous supremum norms of α¯n\bar{\alpha}_{n} and α¯nb\bar{\alpha}_{n}^{b} by the finite maximums of (α¯n(t1),,α¯n(tN))\big{(}\bar{\alpha}_{n}(t_{1}),\ldots,\bar{\alpha}_{n}(t_{N})\big{)} and (α¯nb(t1),,α¯nb(tN))\big{(}\bar{\alpha}_{n}^{b}(t_{1}),\ldots,\bar{\alpha}_{n}^{b}(t_{N})\big{)}, respectively. Here, NN is chosen as the ε\varepsilon-covering number of the unit interval [0,1][0,1] with respect to the metric defined by eP2(t,s)=𝔼[(α¯n(t)α¯n(s))2]e_{P}^{2}(t,s)=\mathbb{E}\big{[}\big{(}\bar{\alpha}_{n}(t)-\bar{\alpha}_{n}(s)\big{)}^{2}\big{]} for (t,s)[0,1]2(t,s)\in[0,1]^{2}; that is, there exist t1,,tN[0,1]t_{1},\dots,t_{N}\in[0,1], such that for every t[0,1]t\in[0,1], there exist 1jN1\leq j\leq N with eP(t,tj)<εe_{P}(t,t_{j})<\varepsilon. We refer a detailed proof in Step I (Step V) to Supplementary. Notice that α¯n\bar{\alpha}_{n} is a weighted and non-identically distributed empirical process. In Step II, we further develop Gaussian approximation bounds to control the Kolmogorov distance between the sampling distributions of max1jNα¯n(tj)\max_{1\leq j\leq N}\bar{\alpha}_{n}(t_{j}) and the distribution of max1jNZ¯n(tj)\max_{1\leq j\leq N}\bar{Z}_{n}(t_{j}); see the proof in Lemma 8.3. In Step IV, by noticing that conditional on 𝒟n\mathcal{D}_{n}, α¯nb\bar{\alpha}_{n}^{b} is a weighted and non-identically distributed sub-Gaussian process with randomness coming from the Bootstrap multiplier {wi}i=1n\{w_{i}\}_{i=1}^{n}, we adopt a similar argument as in Step II to bound the Kolmogorov distance between the distributions of max1jNα¯ne(tj)\max_{1\leq j\leq N}\bar{\alpha}_{n}^{e}(t_{j}) and max1jNα¯nb(tj)\max_{1\leq j\leq N}\bar{\alpha}_{n}^{b}(t_{j}) given 𝒟n\mathcal{D}_{n}.

7 Discussion

Quantifying uncertainty (UQ) in large-scale streaming data is a central challenge in statistical inference. We are developing multiplier bootstrap-based inferential frameworks for UQ in online non-parametric least squares regression. We propose using perturbed stochastic functional gradients to generate a sequence of bootstrapped functional SGD estimators for constructing point-wise confidence intervals (local inference) and simultaneous confidence bands (global inference) for function parameters in RKHS. Theoretically, we establish a framework to derive the non-asymptotic law of the infinite-dimensional SGD estimator and demonstrate the consistency of the multiplier bootstrap method.

This work assumes that random errors in non-parametric regression follow a Gaussian distribution. However, in many real-world applications, heavy-tailed distributions are more common and suitable for capturing outlier behaviors. One future research direction is to expand the current methods to address heavy-tailed errors, thereby offering a more robust approach to online non-parametric inference. Another direction to explore is the generalization of the multiplier bootstrap weights to independent sub-exponential random variables and even exchangeable weights. Finally, a promising direction is the consideration of online non-parametric inference for dependent data. Such an extension is necessary to address problems like multi-arm bandit and reinforcement learning, where data dependencies are frequent and real-time updates are essential. Adapting our methods to these problems could provide deeper insights into the interplay between statistical inference and online decision-making.

8 Some Key Proofs

8.1 Proof of leading terms in Theorem 3.1 in constant step size case

Recall in Section 6.1, we split the recursion of ηn=f^nf\eta_{n}=\widehat{f}_{n}-f^{*} into the bias recursion and noise recursion. That is, ηn=ηnbias+ηnnoise\eta_{n}=\eta_{n}^{bias}+\eta_{n}^{noise}. Here ηnbias\eta_{n}^{bias} can be further decomposed as its leading bias term ηnbias,0\eta_{n}^{bias,0} and remainder ηnbiasηnbias,0\eta_{n}^{bias}-\eta_{n}^{bias,0} satisfying the recursion

ηnbias,0=\displaystyle\eta_{n}^{bias,0}= (IγnΣ)ηn1bias,0withη0bias,0=f\displaystyle(I-\gamma_{n}\Sigma)\eta_{n-1}^{bias,0}\quad\quad\textrm{with}\quad\eta_{0}^{bias,0}=f^{\ast} (8.1)
ηnbiasηnbias,0=\displaystyle\eta_{n}^{bias}-\eta_{n}^{bias,0}= (IγnKXnKXn)(ηn1biasηn1bias,0)+γn(ΣKXnKXn)ηn1bias,0.\displaystyle(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{bias}-\eta_{n-1}^{bias,0})+\gamma_{n}(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{bias,0}. (8.2)

We further decompose ηnnoise\eta_{n}^{noise} to its main recursion terms and residual recursion terms as

ηnnoise,0=\displaystyle\eta_{n}^{noise,0}= (IγnΣ)ηn1noise,0+γnϵnKXnwithη0noise,0=0\displaystyle(I-\gamma_{n}\Sigma)\eta^{noise,0}_{n-1}+\gamma_{n}\epsilon_{n}K_{X_{n}}\quad\quad\textrm{with}\quad\eta_{0}^{noise,0}=0 (8.3)
ηnnoiseηnnoise,0=\displaystyle\eta_{n}^{noise}-\eta_{n}^{noise,0}= (IγnKXnKXn)(ηn1noiseηn1noise,0)+γn(ΣKXnKXn)ηn1noise,0\displaystyle(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{noise}-\eta_{n-1}^{noise,0})+\gamma_{n}(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{noise,0} (8.4)

We focus on the averaged version η¯n=f¯nf\bar{\eta}_{n}=\bar{f}_{n}-f^{\ast} with

η¯n=\displaystyle\bar{\eta}_{n}= η¯nbias,0+η¯nnoise,0+Remnoise+Rembias,\displaystyle\bar{\eta}_{n}^{bias,0}+\bar{\eta}_{n}^{noise,0}+Rem_{noise}+Rem_{bias},

where Remnoise=η¯nnoiseη¯nnoise,0Rem_{noise}=\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}, Rembias=η¯nbiasη¯nbias,0Rem_{bias}=\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}.

Theorem 3.1 for the constant step case includes three results as follows:

supz0𝒳|η¯nbias,0(z0)|1nγ\displaystyle\sup_{z_{0}\in\mathcal{X}}|\bar{\eta}_{n}^{bias,0}(z_{0})|\lesssim\frac{1}{\sqrt{n\gamma}} (8.5)
supz0𝒳Var(η¯nnoise,0(z0))(nγ)1/αn\displaystyle\sup_{z_{0}\in\mathcal{X}}\operatorname{{\rm Var}}\big{(}\bar{\eta}_{n}^{noise,0}(z_{0})\big{)}\lesssim\frac{(n\gamma)^{1/\alpha}}{n} (8.6)
(η¯nη¯nbias,0η¯nnoise,02γ1/2(nγ)1+γ1/2(nγ)1/αn1logn)1/n+γ1/2.\displaystyle\mathbb{P}\Big{(}\|\bar{\eta}_{n}-\bar{\eta}_{n}^{bias,0}-\bar{\eta}_{n}^{noise,0}\|^{2}_{\infty}\geq\gamma^{1/2}(n\gamma)^{-1}+\gamma^{1/2}(n\gamma)^{1/\alpha}n^{-1}\log n\Big{)}\leq 1/n+\gamma^{1/2}. (8.7)

In this session, we will bound the sup-norm of the leading bias term η¯nbias,0\bar{\eta}_{n}^{bias,0} and leading variance term η¯nnoise,0\bar{\eta}_{n}^{noise,0}. To complete the proof of (8.7), we will bound Rembias\|Rem_{bias}\|_{\infty} in Section 8.2 and Remnoise\|Rem_{noise}\|_{\infty} in Section 8.3.

We first provide a clear expression for ηnbias,0\eta_{n}^{bias,0} and ηnnoise,0\eta_{n}^{noise,0}.

Denote

D(k,n,γi)=\displaystyle D(k,n,\gamma_{i})= i=kn(IγiΣ)andD(k,n,γ)=i=kn(IγΣ)=(IγΣ)nk+1\displaystyle\prod_{i=k}^{n}(I-\gamma_{i}\Sigma)\quad\textrm{and}\quad D(k,n,\gamma)=\prod_{i=k}^{n}(I-\gamma\Sigma)=(I-\gamma\Sigma)^{n-k+1}
M(k,n,γi)=\displaystyle M(k,n,\gamma_{i})= i=kn(IγiKXiKXi)andM(k,n,γ)=i=kn(IγKXiKXi),\displaystyle\prod_{i=k}^{n}(I-\gamma_{i}K_{X_{i}}\otimes K_{X_{i}})\quad\textrm{and}\quad M(k,n,\gamma)=\prod_{i=k}^{n}(I-\gamma K_{X_{i}}\otimes K_{X_{i}}),

with D(n+1,n,γi)=D(n+1,n,γ)=1D(n+1,n,\gamma_{i})=D(n+1,n,\gamma)=1. We have

ηnbias,0=D(1,n,γi)fandη¯nbias,0=1nk=1nD(1,k,γi)f;\eta_{n}^{bias,0}=D(1,n,\gamma_{i})f^{\ast}\quad\textrm{and}\quad\bar{\eta}_{n}^{bias,0}=\frac{1}{n}\sum_{k=1}^{n}D(1,k,\gamma_{i})f^{\ast}; (8.8)
ηnnoise,0=i=1nD(i+1,n,γi)γiϵiKXiandη¯nnoise,0=1ni=1n(j=inD(i+1,j,γi))γiϵiKXi.\eta_{n}^{noise,0}=\sum_{i=1}^{n}D(i+1,n,\gamma_{i})\gamma_{i}\epsilon_{i}K_{X_{i}}\quad\textrm{and}\quad\bar{\eta}_{n}^{noise,0}=\frac{1}{n}\sum_{i=1}^{n}\big{(}\sum_{j=i}^{n}D(i+1,j,\gamma_{i})\big{)}\gamma_{i}\epsilon_{i}K_{X_{i}}. (8.9)

Bound the leading bias term (8.5) For the case of constant step size, based on (8.8), we have

η¯nbias,0(z0)=\displaystyle\bar{\eta}_{n}^{bias,0}(z_{0})= 1nk=1n(IγΣ)nk+1f(z0).\displaystyle\frac{1}{n}\sum_{k=1}^{n}(I-\gamma\Sigma)^{n-k+1}f^{\ast}(z_{0}).

Note that any ff\in\mathbb{H} can be represented as f=ν=1f,ϕνL2ϕνf=\sum_{\nu=1}^{\infty}\langle f,\phi_{\nu}\rangle_{L_{2}}\phi_{\nu}, where {ϕν}ν=1\{\phi_{\nu}\}_{\nu=1}^{\infty} satisfies ϕνL22=1=𝔼(ϕν2(x))\|\phi_{\nu}\|_{L_{2}}^{2}=1=\mathbb{E}(\phi_{\nu}^{2}(x)), ϕν,ϕν=μν1\langle\phi_{\nu},\phi_{\nu}\rangle_{\mathbb{H}}=\mu_{\nu}^{-1}, and Σϕν=μνϕν\Sigma\phi_{\nu}=\mu_{\nu}\phi_{\nu}, Σ1ϕν=μν1ϕν\Sigma^{-1}\phi_{\nu}=\mu_{\nu}^{-1}\phi_{\nu}. Then for any z0𝒳z_{0}\in\mathcal{X}, f(z0)=ν=1f,ϕνL2ϕν(z0)f^{\ast}(z_{0})=\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\phi_{\nu}(z_{0}). By the assumption that ff\in\mathbb{H} satisfies ν=1f,ϕνL2μν1/2<\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\mu_{\nu}^{-1/2}<\infty, we have

η¯nbias,0(z0)=\displaystyle\bar{\eta}_{n}^{bias,0}(z_{0})= 1γnk=1nν=1f,ϕνL2μν1/2(1γμν)k(γμν)1/2ϕν(z0)\displaystyle\frac{1}{\sqrt{\gamma}n}\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\mu_{\nu}^{-1/2}(1-\gamma\mu_{\nu})^{k}(\gamma\mu_{\nu})^{1/2}\phi_{\nu}(z_{0})
\displaystyle\leq cϕ1γ1/2nν=1f,ϕνL2μν1/2(sup0x1(k=1n(1x)kx1/2))\displaystyle c_{\phi}\frac{1}{\gamma^{1/2}n}\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\mu_{\nu}^{-1/2}\Big{(}\sup_{0\leq x\leq 1}\big{(}\sum_{k=1}^{n}(1-x)^{k}x^{1/2}\big{)}\Big{)}
\displaystyle\leq cϕ1nγν=1f,ϕνL2μν1/21nγ,\displaystyle c_{\phi}\frac{1}{\sqrt{n\gamma}}\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\mu_{\nu}^{-1/2}\lesssim\frac{1}{\sqrt{n\gamma}},

where the inequality holds based on the bound that sup0x1(k=1n(1x)kx1/2)n\sup_{0\leq x\leq 1}\big{(}\sum_{k=1}^{n}(1-x)^{k}x^{1/2}\big{)}\leq\sqrt{n}.

Bound the leading noise term (8.6) We first deduce the explicit expression of η¯nnoise,0(z0)\bar{\eta}_{n}^{noise,0}(z_{0}) and its variance. Based on (8.9), for constant step case, we have for any z0𝒳z_{0}\in\mathcal{X},

η¯nnoise,0(z0)=1nk=1nΣ1(I(IγΣ)n+1k)K(Xk,z0)ϵk.\bar{\eta}_{n}^{noise,0}(z_{0})=\frac{1}{n}\sum_{k=1}^{n}\Sigma^{-1}\big{(}I-(I-\gamma\Sigma)^{n+1-k}\big{)}K(X_{k},z_{0})\epsilon_{k}.

Note that for any x,zx,z, K(x,z)=ν=1μνϕν(x)ϕν(z)K(x,z)=\sum_{\nu=1}^{\infty}\mu_{\nu}\phi_{\nu}(x)\phi_{\nu}(z). Then

Σ1(I(IγΣ)n+1k)K(Xk,z0)=\displaystyle\Sigma^{-1}\big{(}I-(I-\gamma\Sigma)^{n+1-k}\big{)}K(X_{k},z_{0})= Σ1(I(IγΣ)n+1k)(ν=1μνϕν(Xk)ϕν(z0))\displaystyle\Sigma^{-1}\big{(}I-(I-\gamma\Sigma)^{n+1-k}\big{)}\big{(}\sum_{\nu=1}^{\infty}\mu_{\nu}\phi_{\nu}(X_{k})\phi_{\nu}(z_{0})\big{)}
=\displaystyle= ν=1(1(1γμν)n+1k)ϕν(Xk)ϕν(z0).\displaystyle\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n+1-k})\phi_{\nu}(X_{k})\phi_{\nu}(z_{0}).

Therefore, η¯nnoise,0(z0)=1nk=1nν=1(1(1γμν)n+1k)ϕν(Xk)ϕν(z0)ϵk\bar{\eta}_{n}^{noise,0}(z_{0})=\frac{1}{n}\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n+1-k})\phi_{\nu}(X_{k})\phi_{\nu}(z_{0})\epsilon_{k} with 𝔼(η¯nnoise,0(z0))=0\mathbb{E}(\bar{\eta}_{n}^{noise,0}(z_{0}))=0, and

Var(η¯nnoise,0(z0))=σ2n2ν=1ϕν2(z0)k=1n[1(1γμν)n+1k]2.\operatorname{{\rm Var}}\big{(}\bar{\eta}_{n}^{noise,0}(z_{0})\big{)}=\frac{\sigma^{2}}{n^{2}}\sum_{\nu=1}^{\infty}\phi_{\nu}^{2}(z_{0})\sum_{k=1}^{n}[1-(1-\gamma\mu_{\nu})^{n+1-k}]^{2}.

Note that

k=1nν=1(1(1γμν)k)2k=1nν=1min{1,(kγμν)2}.\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\big{(}1-(1-\gamma\mu_{\nu})^{k}\big{)}^{2}\asymp\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\min\bigl{\{}1,(k\gamma\mu_{\nu})^{2}\bigr{\}}.

On the other hand, ν=1min{1,(kγμν)2}=(kγ)1/α+ν=(kγ)1/α+1(kγμν)2\sum_{\nu=1}^{\infty}\min\{1,(k\gamma\mu_{\nu})^{2}\}=(k\gamma)^{1/\alpha}+\sum_{\nu=(k\gamma)^{1/\alpha}+1}^{\infty}(k\gamma\mu_{\nu})^{2}. Since

ν=(kγ)1/α+1(kγμν)2ν=(kγ)1/α+1kγμν=kγν=(kγ)1/α+1ναkγ(kγ)1/αxα𝑑x=(kγ)1/α,\sum_{\nu=(k\gamma)^{1/\alpha}+1}^{\infty}(k\gamma\mu_{\nu})^{2}\leq\sum_{\nu=(k\gamma)^{1/\alpha}+1}^{\infty}k\gamma\mu_{\nu}=k\gamma\sum_{\nu=(k\gamma)^{1/\alpha}+1}^{\infty}\nu^{-\alpha}\leq k\gamma\int_{(k\gamma)^{1/\alpha}}^{\infty}x^{-\alpha}dx=(k\gamma)^{1/\alpha},

we have

k=1nν=1(1(1γμν)k)2k=1n(kγ)1/αγ1/αn(α+1)/α=(nγ)1/αn.\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\big{(}1-(1-\gamma\mu_{\nu})^{k}\big{)}^{2}\asymp\sum_{k=1}^{n}(k\gamma)^{1/\alpha}\asymp\gamma^{1/\alpha}n^{(\alpha+1)/\alpha}=(n\gamma)^{1/\alpha}n.

Accordingly, Var(η¯nnoise,0(z0))(nγ)1/αn\operatorname{{\rm Var}}(\bar{\eta}_{n}^{noise,0}(z_{0}))\lesssim\frac{(n\gamma)^{1/\alpha}}{n}.

Meanwhile, ν=1min{1,(kγμν)2}(kγ)1/α\sum_{\nu=1}^{\infty}\min\{1,(k\gamma\mu_{\nu})^{2}\}\geq(k\gamma)^{1/\alpha} leads to the result that k=1nν=1(1(1γμν)k)2n(nγ)1/α,\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\big{(}1-(1-\gamma\mu_{\nu})^{k}\big{)}^{2}\geq n(n\gamma)^{1/\alpha}, thus Var(η¯nnoise,0(z0))(nγ)1/αn\operatorname{{\rm Var}}(\bar{\eta}_{n}^{noise,0}(z_{0}))\geq\frac{(n\gamma)^{1/\alpha}}{n}. Therefore, Var(η¯nnoise,0(z0))(nγ)1/αn\operatorname{{\rm Var}}(\bar{\eta}_{n}^{noise,0}(z_{0}))\asymp\frac{(n\gamma)^{1/\alpha}}{n}.

Bound the remaining term (8.7) Recall

f¯nf=\displaystyle\bar{f}_{n}-f^{\ast}= η¯n=η¯nbias,0+η¯nnoise,0+(η¯nbiasη¯nbias,0)+(η¯nnoiseη¯nnoise,0).\displaystyle\bar{\eta}_{n}=\bar{\eta}_{n}^{bias,0}+\bar{\eta}_{n}^{noise,0}+\big{(}\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\big{)}+\big{(}\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}\big{)}.

To prove (8.7) in Theorem 3.1, we bound η¯nbiasη¯nbias,0\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|_{\infty} and η¯nnoiseη¯nnoise,0\|\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}\|_{\infty} separately in Section 8.2 and Section 8.3.

8.2 Bound the bias remainder in constant step case

Recall in (8.2), the bias remainder recursion follows

ηnbiasηnbias,0=\displaystyle\eta_{n}^{bias}-\eta_{n}^{bias,0}= (IγKXnKXn)(ηn1biasηn1bias,0)+γ(ΣKXnKXn)ηn1bias,0.\displaystyle(I-\gamma K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{bias}-\eta_{n-1}^{bias,0})+\gamma(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{bias,0}.

Our goal is to bound η¯nbiasη¯nbias,0\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|_{\infty}. Let βn=ηnbiasηnbias,0\beta_{n}=\eta_{n}^{bias}-\eta_{n}^{bias,0} with β0=0\beta_{0}=0, we have

βn=(IγKXnKXn)βn1+γ(ΣKXnKXn)ηn1bias,0\beta_{n}=(I-\gamma K_{X_{n}}\otimes K_{X_{n}})\beta_{n-1}+\gamma(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{bias,0}

with ηnbias,0=(IγΣ)ηn1bias,0\eta_{n}^{bias,0}=(I-\gamma\Sigma)\eta_{n-1}^{bias,0} and η0bias,0=f\eta_{0}^{bias,0}=f^{\ast}. We first express βn\beta_{n} in an explicit form as follows.

Let Sn=IγKXnKXnS_{n}=I-\gamma K_{X_{n}}\otimes K_{X_{n}}, Tn=ΣKXnKXnT_{n}=\Sigma-K_{X_{n}}\otimes K_{X_{n}} and T=IγΣT=I-\gamma\Sigma, we have βn=Snβn1+γTnηn1bias,0.\beta_{n}=S_{n}\beta_{n-1}+\gamma T_{n}\eta_{n-1}^{bias,0}. We can further represent βn\beta_{n} as

βn=γ(Tnηn1bias,0+SnTn1ηn2bias,0++SnSn1S2T1η0bias,0);\beta_{n}=\gamma(T_{n}\eta_{n-1}^{bias,0}+S_{n}T_{n-1}\eta_{n-2}^{bias,0}+\cdots+S_{n}S_{n-1}\dots S_{2}T_{1}\eta_{0}^{bias,0});

on the other hand, ηibias,0=(IγΣ)if\eta_{i}^{bias,0}=(I-\gamma\Sigma)^{i}f^{\ast}. Therefore, for any 1in1\leq i\leq n, we have

βi=γ(TiTi1+SiTi1Ti2++SiSi1S2T1)fγUi.\beta_{i}=\gamma(T_{i}T^{i-1}+S_{i}T_{i-1}T^{i-2}+\cdots+S_{i}S_{i-1}\cdots S_{2}T_{1})\cdot f^{\ast}\equiv\gamma U_{i}. (8.10)

Note that β¯nΣaβ¯n\|\bar{\beta}_{n}\|_{\infty}\leq\|\Sigma^{a}\bar{\beta}_{n}\|_{\mathbb{H}}. In the following lemma 8.1, we bound β¯n\|\bar{\beta}_{n}\|_{\infty} through 𝔼β¯n,Σ2aβ¯n\mathbb{E}\langle\bar{\beta}_{n},\Sigma^{2a}\bar{\beta}_{n}\rangle, and show that η¯nbiasη¯nbias,02=o(η¯nbias,0)\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|^{2}_{\infty}=o(\bar{\eta}_{n}^{bias,0}) with high probability.

Lemma 8.1.

Suppose the step size γ(n)=γ\gamma(n)=\gamma with 0<γ<μ110<\gamma<\mu_{1}^{-1}. Then

(η¯nbiasη¯nbias,02γ1/2(nγ)1)γ1/2.\mathbb{P}\Big{(}\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|^{2}_{\infty}\geq\gamma^{1/2}(n\gamma)^{-1}\Big{)}\leq\gamma^{1/2}.
Proof.

To simplify the notation, we set ,\langle\cdot,\cdot\rangle as ,\langle\cdot,\cdot\rangle_{\mathbb{H}}. For β¯n=ηnbiasηnbias,0\bar{\beta}_{n}=\eta_{n}^{bias}-\eta_{n}^{bias,0}, by (8.10), we have

𝔼β¯n,Σ2aβ¯n=𝔼1ni=1nγUi,Σ2a1ni=1nγUi=γ2n2i=1n𝔼Ui,Σ2aUi+2γ2n2i<j𝔼Ui,Σ2aUj.\mathbb{E}\langle\bar{\beta}_{n},\Sigma^{2a}\bar{\beta}_{n}\rangle=\mathbb{E}\langle\frac{1}{n}\sum_{i=1}^{n}\gamma U_{i},\Sigma^{2a}\frac{1}{n}\sum_{i=1}^{n}\gamma U_{i}\rangle=\frac{\gamma^{2}}{n^{2}}\sum_{i=1}^{n}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle+\frac{2\gamma^{2}}{n^{2}}\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle. (8.11)

That is, we split 𝔼β¯n,Σ2aβ¯n\mathbb{E}\langle\bar{\beta}_{n},\Sigma^{2a}\bar{\beta}_{n}\rangle into two parts, and will bound each part separately.

We first bound γ2n2i=1n𝔼Ui,Σ2aUi\frac{\gamma^{2}}{n^{2}}\sum_{i=1}^{n}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle. Denote Hi=SiSi1S+1TT1fH_{i\ell}=S_{i}S_{i-1}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast} with Hii=TiTi1fH_{ii}=T_{i}T^{i-1}f^{\ast}, then Ui=Hii+Hi(i1)++Hi1U_{i}=H_{ii}+H_{i(i-1)}+\cdots+H_{i1}.

𝔼Ui,Σ2aUi=\displaystyle\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle= 𝔼Hii+Hi(i1)++Hi1,Σ2a(Hii+Hi(i1)++Hi1)\displaystyle\mathbb{E}\langle H_{ii}+H_{i(i-1)}+\cdots+H_{i1},\Sigma^{2a}(H_{ii}+H_{i(i-1)}+\cdots+H_{i1})\rangle
=\displaystyle= j,k=1i𝔼Hij,Σ2aHik=j=1i𝔼Hij,Σ2aHij+jk𝔼Hij,Σ2aHik.\displaystyle\sum_{j,k=1}^{i}\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ik}\rangle=\sum_{j=1}^{i}\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ij}\rangle+\sum_{j\neq k}\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ik}\rangle.

If jkj\neq k, suppose ij>k1i\geq j>k\geq 1, then

𝔼Hij,Σ2aHik=𝔼Si,Si1Sj+1TjTj1f,Σ2aSiSi1Sk+1TkTk1f\displaystyle\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ik}\rangle=\mathbb{E}\langle S_{i},S_{i-1}\cdots S_{j+1}T_{j}T^{j-1}f^{\ast},\Sigma^{2a}S_{i}S_{i-1}\cdots S_{k+1}T_{k}T^{k-1}f^{\ast}\rangle
=\displaystyle= 𝔼[𝔼[SiSi1Sj+1TjTj1f,Σ2aSiSi1Sk+1TkTk1f|Xi,,Xj,wi,,wj]]\displaystyle\mathbb{E}\Big{[}\mathbb{E}\big{[}\langle S_{i}S_{i-1}\cdots S_{j+1}T_{j}T^{j-1}f^{\ast},\Sigma^{2a}S_{i}S_{i-1}\cdots S_{k+1}T_{k}T^{k-1}f^{\ast}|X_{i},\dots,X_{j},w_{i},\dots,w_{j}\big{]}\Big{]}
=\displaystyle= 𝔼(SiSi1Sj+1TjTj1f,Σ2aSiSi1Sj𝔼(Sj1Sk+1Tk)Tk1f)=0,\displaystyle\mathbb{E}\big{(}\langle S_{i}S_{i-1}\cdots S_{j+1}T_{j}T^{j-1}f^{\ast},\Sigma^{2a}S_{i}S_{i-1}\dots S_{j}\mathbb{E}(S_{j-1}\cdots S_{k+1}T_{k})T^{k-1}f^{\ast}\big{)}=0,

where the last step is due to 𝔼(Sj1Sk+1Tk)=𝔼Sj1𝔼Sk+1𝔼Tk=0\mathbb{E}(S_{j-1}\cdots S_{k+1}T_{k})=\mathbb{E}S_{j-1}\cdots\mathbb{E}S_{k+1}\mathbb{E}T_{k}=0 with the fact that 𝔼Tk=0\mathbb{E}T_{k}=0. Therefore, we have 𝔼Ui,Σ2aUi=j=1i𝔼Hij,Σ2aHij\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle=\sum_{j=1}^{i}\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ij}\rangle. Furthermore,

𝔼Hij,Σ2aHij=𝔼SiSi1Sj+1TjTj1f,Σ2aSiSi1Sj+1TjTj1f\displaystyle\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ij}\rangle=\mathbb{E}\langle S_{i}S_{i-1}\cdots S_{j+1}T_{j}T^{j-1}f^{\ast},\Sigma^{2a}S_{i}S_{i-1}\cdots S_{j+1}T_{j}T^{j-1}f^{\ast}\rangle
=\displaystyle= f,𝔼(Tj1TjSj+1SiΣ2aSiSi1Sj+1TjTj1)f=f,Δf.\displaystyle\langle f^{\ast},\mathbb{E}(T^{j-1}T_{j}S_{j+1}\cdots S_{i}\Sigma^{2a}S_{i}S_{i-1}\cdots S_{j+1}T_{j}T^{j-1})f^{\ast}\rangle=\langle f^{\ast},\Delta f^{\ast}\rangle.

Note that Δ=𝔼(Tj1TjSj+1𝔼(SiΣ2aSi)Si1Sj+1TjTj1)\Delta=\mathbb{E}\big{(}T^{j-1}T_{j}S_{j+1}\cdots\mathbb{E}(S_{i}\Sigma^{2a}S_{i})S_{i-1}\cdots S_{j+1}T_{j}T^{j-1}\big{)}, with

𝔼(SiΣ2aSi)=\displaystyle\mathbb{E}(S_{i}\Sigma^{2a}S_{i})= 𝔼((IγKXiKXi)Σ2a(IγKXiKXi))\displaystyle\mathbb{E}\big{(}(I-\gamma K_{X_{i}}\otimes K_{X_{i}})\Sigma^{2a}(I-\gamma K_{X_{i}}\otimes K_{X_{i}})\big{)}
=\displaystyle= Σ2aγ(ΣΣ2a+Σ2aΣ2γSΣ2a)=Σ2aγGΣ2a,\displaystyle\Sigma^{2a}-\gamma(\Sigma\cdot\Sigma^{2a}+\Sigma^{2a}\cdot\Sigma-2\gamma S\Sigma^{2a})=\Sigma^{2a}-\gamma G\Sigma^{2a}, (8.12)

where GΣ2a=ΣΣ2a+Σ2aΣ2γSΣ2aG\Sigma^{2a}=\Sigma\cdot\Sigma^{2a}+\Sigma^{2a}\cdot\Sigma-2\gamma S\Sigma^{2a} with SΣ2a=𝔼((KxKx)Σ2a(KxKx))S\Sigma^{2a}=\mathbb{E}\big{(}(K_{x}\otimes K_{x})\Sigma^{2a}(K_{x}\otimes K_{x})\big{)}.

To be abstract, for any AA, 𝔼SiASi=Aγ(ΣA+AΣ2γSA)=AγGA=(IγG)A,\mathbb{E}S_{i}AS_{i}=A-\gamma(\Sigma A+A\Sigma-2\gamma SA)=A-\gamma GA=(I-\gamma G)A, where GA=ΣA+AΣ2γSAGA=\Sigma A+A\Sigma-2\gamma SA. Then Δ\Delta can be written as

Δ=\displaystyle\Delta= 𝔼(Tj1TjSj+1Si1(IγG)ΣSi1Sj+1Tj)Tj1=𝔼(Tj1Tj(IγG)ijΣ2aTjTj1).\displaystyle\mathbb{E}\big{(}T^{j-1}T_{j}S_{j+1}\cdots S_{i-1}(I-\gamma G)\Sigma S_{i-1}\cdots S_{j+1}T_{j}\big{)}T^{j-1}=\mathbb{E}\big{(}T^{j-1}T_{j}(I-\gamma G)^{i-j}\Sigma^{2a}T_{j}T^{j-1}\big{)}.

Furthermore, for any AA,

𝔼TjATj=\displaystyle\mathbb{E}T_{j}AT_{j}= 𝔼(KXjKXjΣ)A(KXjKXjΣ)=𝔼(KXjKXj)A(KXjKXj)ΣAΣ\displaystyle\mathbb{E}(K_{X_{j}}\otimes K_{X_{j}}-\Sigma)A(K_{X_{j}}\otimes K_{X_{j}}-\Sigma)=\mathbb{E}(K_{X_{j}}\otimes K_{X_{j}})A(K_{X_{j}}\otimes K_{X_{j}})-\Sigma A\Sigma (8.13)
\displaystyle\leq 2𝔼(KXjKXj)A(KXjKXj)=2SA.\displaystyle 2\mathbb{E}(K_{X_{j}}\otimes K_{X_{j}})A(K_{X_{j}}\otimes K_{X_{j}})=2SA.

Therefore, Δ2Tj1S(IγG)ijΣ2aTj1\Delta\prec 2T^{j-1}S(I-\gamma G)^{i-j}\Sigma^{2a}T^{j-1}, and in (8.13), we have 𝔼Hij,Σ2aHij2f,Tj1S(IγG)ijΣ2aTj1f.\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ij}\rangle\leq 2\langle f^{\ast},T^{j-1}S(I-\gamma G)^{i-j}\Sigma^{2a}T^{j-1}f^{\ast}\rangle. Then we have

i=1n𝔼Ui,Σ2aUi=i=1nj=1i𝔼Hij,Σ2aHij\displaystyle\sum_{i=1}^{n}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle=\sum_{i=1}^{n}\sum_{j=1}^{i}\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ij}\rangle\leq 2f,i=1nj=1iTjiS(IγG)ijΣ2aTj1f.\displaystyle 2\langle f^{\ast},\sum_{i=1}^{n}\sum_{j=1}^{i}T^{j-i}S(I-\gamma G)^{i-j}\Sigma^{2a}T^{j-1}f^{\ast}\rangle.

Denote P=i=1nj=1iTjiS(IγG)ijΣ2aTj1P=\sum_{i=1}^{n}\sum_{j=1}^{i}T^{j-i}S(I-\gamma G)^{i-j}\Sigma^{2a}T^{j-1}, then

P=\displaystyle P= i=1nj=1iTj1S(IγG)ijΣ2aTj1=j=1nTj1Si=jn(IγG)ijΣ2aTj1nj=1nTj1SΣ2aTj1.\displaystyle\sum_{i=1}^{n}\sum_{j=1}^{i}T^{j-1}S(I-\gamma G)^{i-j}\Sigma^{2a}T^{j-1}=\sum_{j=1}^{n}T^{j-1}S\sum_{i=j}^{n}(I-\gamma G)^{i-j}\Sigma^{2a}T^{j-1}\leq n\sum_{j=1}^{n}T^{j-1}S\Sigma^{2a}T^{j-1}.

Recall SΣ2a=𝔼((KXKX)Σ2a(KXKX))S\Sigma^{2a}=\mathbb{E}\big{(}(K_{X}\otimes K_{X})\Sigma^{2a}(K_{X}\otimes K_{X})\big{)}, we can bound SΣ2ackΣS\Sigma^{2a}\leq c_{k}\Sigma as follows.

(SΣ2a)f,f=\displaystyle\langle(S\Sigma^{2a})f,f\rangle= 𝔼((KXKX)Σ2a(KXKX))f,f=𝔼f(X)Σ2aKX(X)KX,f\displaystyle\langle\mathbb{E}\big{(}(K_{X}\otimes K_{X})\Sigma^{2a}(K_{X}\otimes K_{X})\big{)}f,f\rangle=\langle\mathbb{E}f(X)\Sigma^{2a}K_{X}(X)K_{X},f\rangle
=\displaystyle= 𝔼f2(X)(Σ2aKX)(X)ck𝔼f2(X)=ckΣf,f,\displaystyle\mathbb{E}f^{2}(X)(\Sigma^{2a}K_{X})(X)\leq c_{k}\mathbb{E}f^{2}(X)=c_{k}\langle\Sigma f,f\rangle,

where the last inequality is due to the fact that

Σ2aKX(X)=ν=1ϕν(X)ϕν(X)μνμν2a=μ=1ϕν(x)ϕν(x)μν2a+1.\Sigma^{2a}K_{X}(X)=\sum_{\nu=1}^{\infty}\phi_{\nu}(X)\phi_{\nu}(X)\mu_{\nu}\mu_{\nu}^{2a}=\sum_{\mu=1}^{\infty}\phi_{\nu}(x)\phi_{\nu}(x)\mu_{\nu}^{2a+1}\leq\infty.

Accordingly, Pnckj=1nT2(j1)Σnck(IT2)1Σnckγ1IP\leq nc_{k}\sum_{j=1}^{n}T^{2(j-1)}\Sigma\leq nc_{k}(I-T^{2})^{-1}\Sigma\leq nc_{k}\gamma^{-1}I; and

γ2n2i=1n𝔼Ui,Σ2aUi=γ2n2f,Pfγnf2.\frac{\gamma^{2}}{n^{2}}\sum_{i=1}^{n}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle=\frac{\gamma^{2}}{n^{2}}\langle f^{\ast},Pf^{\ast}\rangle\lesssim\frac{\gamma}{n}\|f\|^{2}_{\mathbb{H}}. (8.14)

Next, we analyze 𝔼Ui,Σ2aUj\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle in (8.10) for 1i<jn1\leq i<j\leq n. Note that

𝔼Ui,Σ2aUj=\displaystyle\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle= 𝔼Hii++Hi1,Σ2a(Hjj++Hj1)==1jk=1j𝔼Hi,Σ2aHjk.\displaystyle\mathbb{E}\langle H_{ii}+\cdots+H_{i1},\Sigma^{2a}(H_{j}j+\cdots+H_{j1})\rangle=\sum_{\ell=1}^{j}\sum_{k=1}^{j}\mathbb{E}\langle H_{i\ell},\Sigma^{2a}H_{jk}\rangle.

We first consider k\ell\neq k and assume >k\ell>k, note that i<ji<j, then

𝔼Hi,Σ2aHjk=𝔼SiSi1S+1TT1f,Σ2aSjSj1Sk+1TkTk1f\displaystyle\mathbb{E}\langle H_{i\ell},\Sigma^{2a}H_{jk}\rangle=\mathbb{E}\langle S_{i}S_{i-1}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast},\Sigma^{2a}S_{j}S_{j-1}\cdots S_{k+1}T_{k}T^{k-1}f^{\ast}\rangle
=\displaystyle= 𝔼SiSi1S+1TT1f,Σ2aSjS𝔼(S1Sk+1Tk)Tk1f=0.\displaystyle\mathbb{E}\langle S_{i}S_{i-1}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast},\Sigma^{2a}S_{j}\cdots S_{\ell}\mathbb{E}(S_{\ell-1}\cdots S_{k+1}T_{k})T^{k-1}f^{\ast}\rangle=0. (8.15)

Similarly, for <k\ell<k, 𝔼Hi,Σ2aHjk=𝔼[𝔼Hi,Σ2aHjk|Xj,Xk]=0.\mathbb{E}\langle H_{i\ell},\Sigma^{2a}H_{jk}\rangle=\mathbb{E}[\mathbb{E}\langle H_{i\ell},\Sigma^{2a}H_{jk}\rangle|X_{j},\cdots X_{k}]=0. Therefore,

𝔼Ui,Σ2aUj=\displaystyle\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle= =1i𝔼Hi,Σ2aHj==1i𝔼SiS+1TT1f,Σ2aSjS+1TT1f\displaystyle\sum_{\ell=1}^{i}\mathbb{E}\langle H_{i\ell},\Sigma^{2a}H_{j\ell}\rangle=\sum_{\ell=1}^{i}\mathbb{E}\langle S_{i}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast},\Sigma^{2a}S_{j}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast}\rangle
=\displaystyle= =1i𝔼f,T1TS+1SiΣ2aSjSiSi1S+1TT1f\displaystyle\sum_{\ell=1}^{i}\mathbb{E}\langle f^{\ast},T^{\ell-1}T_{\ell}S_{\ell+1}\cdots S_{i}\Sigma^{2a}S_{j}\cdots S_{i}S_{i-1}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast}\rangle
=\displaystyle= =1i𝔼f,T1TS+1SiΣ2a(IγΣ)jiSiS+1TT1f.\displaystyle\sum_{\ell=1}^{i}\mathbb{E}\langle f^{\ast},T^{\ell-1}T_{\ell}S_{\ell+1}\cdots S_{i}\Sigma^{2a}(I-\gamma\Sigma)^{j-i}S_{i}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast}\rangle.

And i<j𝔼Ui,Σ2aUj=i=1n1=1i𝔼f,T1TS+1SiΣ2a(j=i+1n(IγΣ)ji)SiS+1TT1f.\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle=\sum_{i=1}^{n-1}\sum_{\ell=1}^{i}\mathbb{E}\langle f^{\ast},T^{\ell-1}T_{\ell}S_{\ell+1}\cdots S_{i}\Sigma^{2a}(\sum_{j=i+1}^{n}(I-\gamma\Sigma)^{j-i})S_{i}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast}\rangle. Since Σ2aj=i+1n(IγΣ)ji=Σ2a=1ni(IγΣ)Σ2a(IγΣ)(=0n1(IγΣ))Σ2a=1n1(IγΣ)A,\Sigma^{2a}\sum_{j=i+1}^{n}(I-\gamma\Sigma)^{j-i}=\Sigma^{2a}\sum_{\ell=1}^{n-i}(I-\gamma\Sigma)^{\ell}\leq\Sigma^{2a}(I-\gamma\Sigma)(\sum_{\ell=0}^{n-1}(I-\gamma\Sigma)^{\ell})\leq\Sigma^{2a}\sum_{\ell=1}^{n-1}(I-\gamma\Sigma)^{\ell}\equiv A, we have

i<j𝔼Ui,Σ2aUji=1n1=1i𝔼f,T1TS+1SiASiS+1TT1f\displaystyle\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle\leq\sum_{i=1}^{n-1}\sum_{\ell=1}^{i}\mathbb{E}\langle f^{\ast},T^{\ell-1}T_{\ell}S_{\ell+1}\cdots S_{i}AS_{i}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast}\rangle
=\displaystyle= =1n1i=n1f,T1𝔼(TS+1SiASiS+1T)T1f=1n1f,T1S(i=n1(IγG)i)AT1f\displaystyle\sum_{\ell=1}^{n-1}\sum_{i=\ell}^{n-1}\langle f^{\ast},T^{\ell-1}\mathbb{E}(T_{\ell}S_{\ell+1}\cdots S_{i}AS_{i}\cdots S_{\ell+1}T_{\ell})T^{\ell-1}f^{\ast}\rangle\leq\sum_{\ell=1}^{n-1}\langle f^{\ast},T^{\ell-1}S(\sum_{i=\ell}^{n-1}(I-\gamma G)^{i-\ell})AT^{\ell-1}f^{\ast}\rangle
\displaystyle\leq =1n1f,T1BAT1f,\displaystyle\sum_{\ell=1}^{n-1}\langle f^{\ast},T^{\ell-1}BAT^{\ell-1}f^{\ast}\rangle, (8.16)

where B=Si=n1(IγG)iB=S\sum_{i=\ell}^{n-1}(I-\gamma G)^{i-\ell}, and BA=S(i=0n1(IγG)i)AnSA=2n𝔼(KxKx)A(KxKx)nγ1𝔼((KXKX)Σ1+2a(KXKX))nγ1ckΣ,BA=S(\sum_{i=0}^{n-1}(I-\gamma G)^{i})A\leq nSA=2n\mathbb{E}(K_{x}\otimes K_{x})A(K_{x}\otimes K_{x})\leq n\gamma^{-1}\mathbb{E}\big{(}(K_{X}\otimes K_{X})\Sigma^{-1+2a}(K_{X}\otimes K_{X})\big{)}\leq n\gamma^{-1}c_{k}\Sigma, where the last step is due to the fact that

𝔼((KXKX)Σ1+2a(KXKX))f,f=𝔼(KXKX)Σ1+2aKXf(X),f\displaystyle\langle\mathbb{E}\big{(}(K_{X}\otimes K_{X})\Sigma^{-1+2a}(K_{X}\otimes K_{X})\big{)}f,f\rangle=\mathbb{E}\langle(K_{X}\otimes K_{X})\Sigma^{-1+2a}K_{X}f(X),f\rangle
=\displaystyle= 𝔼f(X)KXΣ1+2aKX(X),f=𝔼f2(X)Σ1+2aKX,KXCΣf,f\displaystyle\mathbb{E}f(X)\langle K_{X}\Sigma^{-1+2a}K_{X}(X),f\rangle=\mathbb{E}f^{2}(X)\langle\Sigma^{-1+2a}K_{X},K_{X}\rangle\leq C\langle\Sigma f,f\rangle

with Σ1+2aKX,KX=ν=1ϕν(X)μν2aϕν(X)cϕ2ν=1ν2aα<\langle\Sigma^{-1+2a}K_{X},K_{X}\rangle=\sum_{\nu=1}^{\infty}\phi_{\nu}(X)\mu_{\nu}^{2a}\phi_{\nu}(X)\leq c_{\phi}^{2}\sum_{\nu=1}^{\infty}\nu^{-2a\alpha}<\infty for 2aα>12a\alpha>1.

By equation (8.16), we have i<j𝔼Ui,Σ2aUj=1n1f,T1BAT1f\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle\leq\sum_{\ell=1}^{n-1}\langle f^{\ast},T^{\ell-1}BAT^{\ell-1}f^{\ast}\rangle. Recall T=IγΣT=I-\gamma\Sigma. For notation simlicity, let C=BAC=BA, then TCTTCT can be written as (IγΣ)C(IγΣ)=CγΣCγCΣ+γ2ΣCΣ=CγΘC=(IγΘ)C,(I-\gamma\Sigma)C(I-\gamma\Sigma)=C-\gamma\Sigma C-\gamma C\Sigma+\gamma^{2}\Sigma C\Sigma=C-\gamma\Theta C=(I-\gamma\Theta)C, where Θ\Theta is an operator such that for any CC, ΘC=ΣC+CΣγ2ΣCΣ\Theta C=\Sigma C+C\Sigma-\gamma^{2}\Sigma C\Sigma. Replacing CC with BABA, we have T1BAT1=(IγΘ)1BAT^{\ell-1}BAT^{\ell-1}=(I-\gamma\Theta)^{\ell-1}BA, and

i<j𝔼Ui,Σ2aUjf,=1n1(IγΘ)1BAf.\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle\leq\langle f^{\ast},\sum_{\ell=1}^{n-1}(I-\gamma\Theta)^{\ell-1}BAf^{\ast}\rangle.

Since =1n1(IγΘ)1γ1Θ1\sum_{\ell=1}^{n-1}(I-\gamma\Theta)^{\ell-1}\leq\gamma^{-1}\Theta^{-1}, we further need to bound Θ1\Theta^{-1}. Let C=Θ1C=\Theta^{-1}, then I=ΣΘ1+Θ1ΣγΣΘ1ΣI=\Sigma\Theta^{-1}+\Theta^{-1}\Sigma-\gamma\Sigma\Theta^{-1}\Sigma. Note that ΣΘ1Σtr(Σ)Θ1ΣcΘ1Σ\Sigma\Theta^{-1}\Sigma\leq tr(\Sigma)\Theta^{-1}\Sigma\leq c\Theta^{-1}\Sigma, where cc is a constant. Then

I\displaystyle I\succeq ΣΘ1+Θ1ΣcγΘ1Σ=ΣΘ1+(1cγ)Θ1Σ=(ΣI+(1cγ)IΣ)Θ1.\displaystyle\Sigma\Theta^{-1}+\Theta^{-1}\Sigma-c\gamma\Theta^{-1}\Sigma=\Sigma\Theta^{-1}+(1-c\gamma)\Theta^{-1}\Sigma=(\Sigma\otimes I+(1-c\gamma)I\otimes\Sigma)\Theta^{-1}.

Therefore, Θ1(ΣI+(1cγ)IΣ)1I\Theta^{-1}\preceq(\Sigma\otimes I+(1-c\gamma)I\otimes\Sigma)^{-1}I, and

=1n1(IγΘ)1BA\displaystyle\sum_{\ell=1}^{n-1}(I-\gamma\Theta)^{\ell-1}BA\preceq γ1Θ1nγ1Σ11+(1cγ)nγ2\displaystyle\gamma^{-1}\Theta^{-1}n\gamma^{-1}\Sigma\preceq\frac{1}{1+(1-c\gamma)}n\gamma^{-2}

Accordingly, we have γ2n2i<j𝔼Ui,Σ2aUjγnγ\frac{\gamma^{2}}{n^{2}}\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle\lesssim\frac{\gamma}{n\gamma}.

Therefore,

𝔼β¯n,Σ2aβ¯n=\displaystyle\mathbb{E}\langle\bar{\beta}_{n},\Sigma^{2a}\bar{\beta}_{n}\rangle= γ2n2i=1n𝔼Ui,Σ2aUi+2γ2n2i<j𝔼Ui,Σ2aUjγnγf2.\displaystyle\frac{\gamma^{2}}{n^{2}}\sum_{i=1}^{n}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle+\frac{2\gamma^{2}}{n^{2}}\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle\lesssim\frac{\gamma}{n\gamma}\|f\|^{2}_{\mathbb{H}}.

Then by Markov’s inequality, we have

(Σa(η¯nbiasη¯nbias,0)2>γ1/2𝔼Σaβ¯b,n2)γ1/2.\mathbb{P}\Big{(}\|\Sigma^{a}\big{(}\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\big{)}\|^{2}_{\mathbb{H}}>\gamma^{-1/2}\mathbb{E}\|\Sigma^{a}\bar{\beta}_{b,n}\|^{2}_{\mathbb{H}}\Big{)}\leq\gamma^{1/2}.

That is, η¯nbiasη¯nbias,02γ1/2nγ\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|^{2}_{\infty}\leq\frac{\gamma^{1/2}}{n\gamma} with probability at least 1γ1/21-\gamma^{1/2}. ∎

8.3 Proof the sup-norm bound of noise remainder in constant step case

Recall the noise remainder recursion follows

ηnnoiseηnnoise,0=(IγKXnKXn)(ηn1ηn1noise,0)+γ(ΣKXnKXn)ηn1noise,0.\eta_{n}^{noise}-\eta_{n}^{noise,0}=(I-\gamma K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}-\eta_{n-1}^{noise,0})+\gamma(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{noise,0}.

Follow the recursion decomposition in Section 6.1, we can split ηnnoiseηnnoise,0\eta_{n}^{noise}-\eta_{n}^{noise,0} into higher order expansions as

ηnnoise=ηnnoise,0+ηnnoise,1+ηnnoise,2++ηnnoise,r+Remainder,\eta_{n}^{noise}=\eta_{n}^{noise,0}+\eta_{n}^{noise,1}+\eta_{n}^{noise,2}+\cdots+\eta_{n}^{noise,r}+\textrm{Remainder},

where ηnnoise,d\eta_{n}^{noise,d} can be viewed as ηnnoise,d=(IγΣ)ηn1noise,d+γnd\eta_{n}^{noise,d}=(I-\gamma\Sigma)\eta_{n-1}^{noise,d}+\gamma\mathcal{E}_{n}^{d} and nd=(ΣKXnKXn)ηn1noise,d1\mathcal{E}_{n}^{d}=(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{noise,d-1} for 1dr1\leq d\leq r and r1r\geq 1. The remainder term follows the recursion as

ηnnoised=0rηnnoise,d=(IγKXnKXn)(ηn1noised=1rηn1noise,d)+γnr+1.\eta_{n}^{noise}-\sum_{d=0}^{r}\eta_{n}^{noise,d}=(I-\gamma K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{noise}-\sum_{d=1}^{r}\eta_{n-1}^{noise,d})+\gamma\mathcal{E}_{n}^{r+1}.

The following lemma (see 8.2) demonstrates that the high-order expansion terms η¯nnoise,d\bar{\eta}_{n}^{noise,d} (for d1d\geq 1) decrease as the value of dd increases. In particular, we first characterize the behavior of η¯nnoise,1\|\bar{\eta}_{n}^{noise,1}\|_{\infty} by representing it as a weighted empirical process and establish its convergence rate that η¯nnoise,1=o(η¯nnoise,0)\|\bar{\eta}_{n}^{noise,1}\|_{\infty}=o(\|\bar{\eta}_{n}^{noise,0}\|_{\infty}) with high probability. Next, we show that η¯nnoise,d+1=o(η¯nnoise,d)\|\bar{\eta}_{n}^{noise,d+1}\|_{\infty}=o(\|\bar{\eta}_{n}^{noise,d}\|_{\infty}) for d1d\geq 1 using mathematical induction. Finally, we bound η¯nnoised=0rη¯nnoise,d\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d} through its \mathbb{H}-norm based on the property that η¯nnoised=0rη¯nnoise,dη¯nnoised=0rη¯nnoise,d\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|_{\infty}\leq\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|_{\mathbb{H}}.

Lemma 8.2.

Suppose the step size γ(n)=γ\gamma(n)=\gamma with 0<γ<n22+3α0<\gamma<n^{-\frac{2}{2+3\alpha}}. Then

  1. (a)(a)
    (η¯nnoise,1>γ1/2(nγ)1/αn1logn)2γ1/2.\mathbb{P}\Big{(}\|\bar{\eta}_{n}^{noise,1}\|_{\infty}>\sqrt{\gamma^{1/2}(n\gamma)^{1/\alpha}n^{-1}\log n}\Big{)}\leq 2\gamma^{1/2}.
  2. (b)(b)
    (η¯nnoise,d2γ1/4(nγ)1/αn1)\displaystyle\mathbb{P}\Big{(}\|\bar{\eta}_{n}^{noise,d}\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq (nγ)1/α+2εγd1/4.\displaystyle(n\gamma)^{1/\alpha+2\varepsilon}\gamma^{d-1/4}.

    Furthermore, for d2d\geq 2 and 0<γ<n22+3α0<\gamma<n^{-\frac{2}{2+3\alpha}}, we have (nγ)1/α+2εγd1/4γ1/4(n\gamma)^{1/\alpha+2\varepsilon}\gamma^{d-1/4}\leq\gamma^{1/4}.

  3. (c)(c)
    (η¯inoised=0rη¯inoise,d2γ1/4(nγ)1/αn1)n1.\mathbb{P}\Big{(}\|\bar{\eta}_{i}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{i}^{noise,d}\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq n^{-1}.

    with rr large enough.

Furthermore, combine (a)-(c), we have

(η¯nnoiseη¯nnoise,02Cγ1/4(nγ)1/αn1)γ1/4,\mathbb{P}\Big{(}\|\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}\|^{2}_{\infty}\geq C\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq\gamma^{1/4},

where CC is a constant.

Proof.

Proof of Lemma 8.2 (a) by analyzing η¯nnoise,1\|\bar{\eta}_{n}^{noise,1}\|_{\infty}. First, we calculate the explicit expression of ηnnoise,1\eta_{n}^{noise,1}. Let T=IγΣT=I-\gamma\Sigma and Tn=ΣKXnKXnT_{n}=\Sigma-K_{X_{n}}\otimes K_{X_{n}}, then ηnnoise,1=Tηn1noise,1+γTnηn1noise,0\eta_{n}^{noise,1}=T\eta_{n-1}^{noise,1}+\gamma T_{n}\eta_{n-1}^{noise,0} with η0noise,1=0\eta_{0}^{noise,1}=0. Therefore,

ηnnoise,1=\displaystyle\eta_{n}^{noise,1}= γi=1n1Tni1Ti+1ηinoise,0=γ2i=1n1j=1iϵjTni1Ti+1TijKXj,\displaystyle\gamma\sum_{i=1}^{n-1}T^{n-i-1}T_{i+1}\eta_{i}^{noise,0}=\gamma^{2}\sum_{i=1}^{n-1}\sum_{j=1}^{i}\epsilon_{j}T^{n-i-1}T_{i+1}T^{i-j}K_{X_{j}},

where the last step is by plugging in ηinoise,0=γj=1i1TijϵjKXj\eta_{i}^{noise,0}=\gamma\sum_{j=1}^{i-1}T^{i-j}\epsilon_{j}K_{X_{j}} in (8.9) with γ=γ(n)\gamma=\gamma(n). Accordingly,

η¯nnoise,1=\displaystyle\bar{\eta}_{n}^{noise,1}= γ2n=1n1i=1j=1iϵjTiTi+1TijKXj=γ2nj=1n1(i=jn1(=in1Ti)Ti+1TijKXj)ϵj.\displaystyle\frac{\gamma^{2}}{n}\sum_{\ell=1}^{n-1}\sum_{i=1}^{\ell}\sum_{j=1}^{i}\epsilon_{j}T^{\ell-i}T_{i+1}T^{i-j}K_{X_{j}}=\frac{\gamma^{2}}{n}\sum_{j=1}^{n-1}\big{(}\sum_{i=j}^{n-1}(\sum_{\ell=i}^{n-1}T^{\ell-i})T_{i+1}T^{i-j}K_{X_{j}}\big{)}\epsilon_{j}. (8.17)

Let gj=i=jn1(=in1Ti)Ti+1TijKXjg_{j}=\sum_{i=j}^{n-1}(\sum_{\ell=i}^{n-1}T^{\ell-i})T_{i+1}T^{i-j}K_{X_{j}}, where the randomness of gjg_{j} involves Xj,Xj+1,,XnX_{j},X_{j+1},\dots,X_{n}. Then η¯nnoise,1()=γ2nj=1n1ϵjgj()\bar{\eta}_{n}^{noise,1}(\cdot)=\frac{\gamma^{2}}{n}\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(\cdot), which is a Gaussian process conditional on (Xj,,Xn)(X_{j},\dots,X_{n}).

We can further express gj()g_{j}(\cdot) as a function of the eigenvalues and eigenfunctions that follows

gj()=γ1ν,k=1μνi=jn1(1γμν)ij(1(1γμk)ni)ϕiνkϕν(Xj)ϕk()g_{j}(\cdot)=\gamma^{-1}\sum_{\nu,k=1}^{\infty}\mu_{\nu}\sum_{i=j}^{n-1}(1-\gamma\mu_{\nu})^{i-j}(1-(1-\gamma\mu_{k})^{n-i})\phi_{i\nu k}\phi_{\nu}(X_{j})\phi_{k}(\cdot) (8.18)

with ϕiνk=ϕν(Xi+1)ϕk(Xi+1)δνk\phi_{i\nu k}=\phi_{\nu}(X_{i+1})\phi_{k}(X_{i+1})-\delta_{\nu k}; we refer the proof to [33]. Such expression can facilitate the downstream analysis of η¯nnoise,1\bar{\eta}_{n}^{noise,1}. Denote aijν=(1γμν)ija_{ij\nu}=(1-\gamma\mu_{\nu})^{i-j} and bik=1(1γμk)nib_{ik}=1-(1-\gamma\mu_{k})^{n-i}. Then gjg_{j} can be simplified as gj=γ1ν,k=1μν(i=jn1aijνbikϕb,iνk)ϕν(Xj)ϕkg_{j}=\gamma^{-1}\sum_{\nu,k=1}^{\infty}\mu_{\nu}\big{(}\sum_{i=j}^{n-1}a_{ij\nu}b_{ik}\phi_{b,i\nu k}\big{)}\phi_{\nu}(X_{j})\phi_{k}.

We are ready to prove that η¯nnoise,1γ12n12(nγ)12α\|\bar{\eta}_{n}^{noise,1}\|_{\infty}\leq\gamma^{\frac{1}{2}}n^{-\frac{1}{2}}(n\gamma)^{\frac{1}{2\alpha}} where η¯nnoise,1()=γ2nj=1n1ϵjgj()\bar{\eta}_{n}^{noise,1}(\cdot)=\frac{\gamma^{2}}{n}\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(\cdot). It involves two steps: (1) for any fixed ss, we see that η¯nnoise,1(s)=γ2nj=1n1ϵjgj(s)\bar{\eta}_{n}^{noise,1}(s)=\frac{\gamma^{2}}{n}\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(s) is a weighted Gaussian random variable with variance γ4n2j=1n1gj2(s)\frac{\gamma^{4}}{n^{2}}\sum_{j=1}^{n-1}g^{2}_{j}(s) conditional on X1:n=(X1,,Xn)X_{1:n}=(X_{1},\dots,X_{n}). Therefore, we first bound η¯nnoise,1(s)\bar{\eta}_{n}^{noise,1}(s) with an exponentially decaying probability by characterizing j=1n1gj2(s)\sum_{j=1}^{n-1}g^{2}_{j}(s); (2) we then bridge η¯nnoise,1(s)\bar{\eta}_{n}^{noise,1}(s) to η¯nnoise,1\|\bar{\eta}_{n}^{noise,1}\|_{\infty}. We illustrate the details as follows.

Conditional on X1:nX_{1:n}, η¯nnoise,1(s)=γ2nj=1n1ϵjgj(s)\bar{\eta}_{n}^{noise,1}(s)=\frac{\gamma^{2}}{n}\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(s) is a weighted Gaussian random variable; by Hoeffding’s inequality,

(γ2n|j=1n1ϵjgj(s)|>uX1:n)exp(u2n2γ4j=1n1gj2(s)).\mathbb{P}\Big{(}\frac{\gamma^{2}}{n}|\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(s)|>u\mid X_{1:n}\Big{)}\leq\exp\Big{(}-\frac{u^{2}n^{2}}{\gamma^{4}\sum_{j=1}^{n-1}g_{j}^{2}(s)}\Big{)}. (8.19)

We then bound j=1n1𝔼gj2(s)\sum_{j=1}^{n-1}\mathbb{E}g_{j}^{2}(s). We separate j=1n1gj2(s)\sum_{j=1}^{n-1}g^{2}_{j}(s) as two parts as follows:

j=1n1gj2(s)\displaystyle\sum_{j=1}^{n-1}g^{2}_{j}(s)
\displaystyle\leq γ2ν,ν=1j=1n1μνμν(ϕν(Xj)ϕν(Xj)δνν)i,=jn1aijνajνk,k=1bikbkϕiνkϕνkϕk(s)ϕk(s)\displaystyle\gamma^{-2}\sum_{\nu,\nu^{\prime}=1}^{\infty}\sum_{j=1}^{n-1}\mu_{\nu}\mu_{\nu^{\prime}}(\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}})\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\sum_{k,k^{\prime}=1}^{\infty}b_{ik}b_{\ell k^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\phi_{k}(s)\phi_{k^{\prime}}(s)
+γ2ν=1μν2j=1n1i,=jn1aijνajνk,k=1bikbkϕiνkϕνkϕk(s)ϕk(s)\displaystyle+\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu}\sum_{k,k^{\prime}=1}^{\infty}b_{ik}b_{\ell k^{\prime}}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\phi_{k}(s)\phi_{k^{\prime}}(s)
=\displaystyle= Δ1+Δ2,\displaystyle\Delta_{1}+\Delta_{2},

where Δ1\Delta_{1} involves the interaction terms indexed by ν,ν\nu,\nu^{\prime} and Δ2\Delta_{2} includes the terms that ν=ν\nu=\nu^{\prime}. Recall bik=(1(1γμk)ni)b_{ik}=(1-(1-\gamma\mu_{k})^{n-i}). Then bik<(1(1γμk)n)bkb_{ik}<(1-(1-\gamma\mu_{k})^{n})\equiv b_{k} for 1in1\leq i\leq n. For Δ1\Delta_{1}, we have

Δ1\displaystyle\Delta_{1}\leq γ2k,k=1bkbkϕk(s)ϕk(s)ν,ν=1μνμνj=1n1(ϕν(Xj)ϕν(Xj)δνν)i,=jn1aijνajνϕiνkϕνk,\displaystyle\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\phi_{k}(s)\phi_{k^{\prime}}(s)\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}\mu_{\nu^{\prime}}\sum_{j=1}^{n-1}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}},

Take expectation on Δ1\Delta_{1}, we can see

𝔼|ν,ν=1μνμνj=1n1(ϕν(Xj)ϕν(Xj)δνν)i,=jn1aijνajνϕiνkϕνk|2\displaystyle\mathbb{E}|\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}\mu_{\nu^{\prime}}\sum_{j=1}^{n-1}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}|^{2} (8.20)
\displaystyle\leq (ν,ν=1μν1+εαμν1+εα)(ν,ν=1μν21+εαμν21+εα𝔼|j=1n1(ϕν(Xj)ϕν(Xj)δνν)i,=jn1aijνajνϕiνkϕνk|2)\displaystyle\big{(}\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{\frac{1+\varepsilon}{\alpha}}\big{)}\big{(}\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{2-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{2-\frac{1+\varepsilon}{\alpha}}\mathbb{E}|\sum_{j=1}^{n-1}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}|^{2}\big{)}
\displaystyle\lesssim (ν,ν=1μν1+εα)2(ν,ν=1μν21+εαμν21+εαj=1n1𝔼(ϕν(Xj)ϕν(Xj)δνν)2𝔼|i,=jn1aijνajνϕiνkϕνk|2),\displaystyle\big{(}\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{\frac{1+\varepsilon}{\alpha}}\big{)}^{2}\big{(}\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{2-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{2-\frac{1+\varepsilon}{\alpha}}\sum_{j=1}^{n-1}\mathbb{E}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}^{2}\cdot\mathbb{E}|\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}|^{2}\big{)},

where the last step is due to the calculation that

𝔼|j=1n1(ϕν(Xj)ϕν(Xj)δνν)i,=jn1aijνajνϕiνkϕνk|2\displaystyle\mathbb{E}|\sum_{j=1}^{n-1}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}|^{2}
=\displaystyle= j=1n1𝔼(ϕν(Xj)ϕν(Xj)δνν)2𝔼|i,=jn1aijνajνϕiνkϕνk|2\displaystyle\sum_{j=1}^{n-1}\mathbb{E}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}^{2}\cdot\mathbb{E}|\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}|^{2}
+2j1<j2𝔼((ϕν(xj1)ϕν(xj1)δνν))𝔼((ϕν(xj2)ϕν(xj2)δνν)(i,=j1n1aij1νaj1νϕiνkϕνk)\displaystyle+2\sum_{j_{1}<j_{2}}\mathbb{E}\big{(}\big{(}\phi_{\nu}(x_{j_{1}})\phi_{\nu^{\prime}}(x_{j_{1}})-\delta_{\nu\nu^{\prime}}\big{)}\big{)}\cdot\mathbb{E}\big{(}\big{(}\phi_{\nu}(x_{j_{2}})\phi_{\nu^{\prime}}(x_{j_{2}})-\delta_{\nu\nu^{\prime}}\big{)}\cdot(\sum_{i,\ell=j_{1}}^{n-1}a_{ij_{1}\nu}a_{\ell j_{1}\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}})
(i,=j2n1aij2νaj2νϕiνkϕνk))=j=1n1𝔼(ϕν(Xj)ϕν(Xj)δνν)2𝔼|i,=jn1aijνajνϕiνkϕνk|2,\displaystyle\cdot(\sum_{i,\ell=j_{2}}^{n-1}a_{ij_{2}\nu}a_{\ell j_{2}\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}})\big{)}=\sum_{j=1}^{n-1}\mathbb{E}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}^{2}\cdot\mathbb{E}|\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}|^{2},

with 𝔼(ϕν(xj1)ϕν(xj1)δνν)=0\mathbb{E}\big{(}\phi_{\nu}(x_{j_{1}})\phi_{\nu^{\prime}}(x_{j_{1}})-\delta_{\nu\nu^{\prime}}\big{)}=0. Note that

j=1n1(𝔼(ϕν(Xj)ϕν(Xj)δνν)2𝔼|i,=jn1aijνajνϕiνkϕνk|2)j=1n1𝔼|i,=jn1aijνajνϕiνkϕνk|2\displaystyle\sum_{j=1}^{n-1}\big{(}\mathbb{E}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}^{2}\cdot\mathbb{E}|\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}|^{2}\big{)}\leq\sum_{j=1}^{n-1}\mathbb{E}|\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}|^{2}
=\displaystyle= j=1n1i1,i2=jn11,2=jn1ai1jνa1jνai2jνa2jν𝔼(ϕi1νkϕi2νkϕ1νkϕ2νk)\displaystyle\sum_{j=1}^{n-1}\sum_{i_{1},i_{2}=j}^{n-1}\sum_{\ell_{1},\ell_{2}=j}^{n-1}a_{i_{1}j\nu}a_{\ell_{1}j\nu^{\prime}}a_{i_{2}j\nu}a_{\ell_{2}j\nu^{\prime}}\mathbb{E}(\phi_{i_{1}\nu k}\phi_{i_{2}\nu k}\phi_{\ell_{1}\nu^{\prime}k^{\prime}}\phi_{\ell_{2}\nu^{\prime}k^{\prime}})
(i)\displaystyle\overset{(i)}{\lesssim} j=1n(i=jn1aijν2aijν2+i,=jn1aijν2ajν2+i,=jn1aijνaijνajνajν)\displaystyle\sum_{j=1}^{n}\big{(}\sum_{i=j}^{n-1}a_{ij\nu}^{2}a_{ij\nu^{\prime}}^{2}+\sum_{i,\ell=j}^{n-1}a_{ij\nu}^{2}a_{\ell j\nu^{\prime}}^{2}+\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{ij\nu^{\prime}}a_{\ell j\nu}a_{\ell j\nu^{\prime}}\big{)}
\displaystyle\leq j=1n1(i=jn1aijν2)(i=jn1aijν2)+(i=jn1aijνaijν)2.\displaystyle\sum_{j=1}^{n-1}\big{(}\sum_{i=j}^{n-1}a_{ij\nu}^{2}\big{)}\big{(}\sum_{i=j}^{n-1}a^{2}_{ij\nu^{\prime}}\big{)}+\big{(}\sum_{i=j}^{n-1}a_{ij\nu}a_{ij\nu^{\prime}}\big{)}^{2}.

In the (i)(i)-step, 𝔼(ϕi1νkϕi2νkϕ1νkϕ2νk)0\mathbb{E}(\phi_{i_{1}\nu k}\phi_{i_{2}\nu k}\phi_{\ell_{1}\nu^{\prime}k^{\prime}}\phi_{\ell_{2}\nu^{\prime}k^{\prime}})\neq 0 if and only if the following cases hold: (1)i1=i2=1=2i_{1}=i_{2}=\ell_{1}=\ell_{2}; (2) i1=i2i_{1}=i_{2} and 1=2\ell_{1}=\ell_{2}; (3) i1=1i_{1}=\ell_{1} and i2=2i_{2}=\ell_{2}. Recall aijν=(1γμν)ija_{ij\nu}=(1-\gamma\mu_{\nu})^{i-j}. Then we have

i=jn1aijνaijν=\displaystyle\sum_{i=j}^{n-1}a_{ij\nu}a_{ij\nu^{\prime}}= i=jn1[(1γμν)(1γμν)]ij(1(1γμν)(1γμν))1γ1(μν+μν)1.\displaystyle\sum_{i=j}^{n-1}[(1-\gamma\mu_{\nu})(1-\gamma\mu_{\nu^{\prime}})]^{i-j}\leq(1-(1-\gamma\mu_{\nu})(1-\gamma\mu_{\nu^{\prime}}))^{-1}\leq\gamma^{-1}(\mu_{\nu}+\mu_{\nu^{\prime}})^{-1}.

For i=jn1aijν2\sum_{i=j}^{n-1}a^{2}_{ij\nu}, we have i=jn1aijν2=i=jn1(1γμν)2(ij)γ1μν1.\sum_{i=j}^{n-1}a^{2}_{ij\nu}=\sum_{i=j}^{n-1}(1-\gamma\mu_{\nu})^{2(i-j)}\lesssim\gamma^{-1}\mu_{\nu}^{-1}. Therefore,

𝔼|ν,ν=1μνμνj=1n1(ϕν(Xj)ϕν(Xj)δνν)i,=jn1aijνajνϕiνkϕνk|2\displaystyle\mathbb{E}|\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}\mu_{\nu^{\prime}}\sum_{j=1}^{n-1}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}|^{2}
\displaystyle\lesssim (ν=1μν1+εα)2ν,ν=1μν21+εαμν21+εαj=1n1(γ2(μν+μν)2+γ1μν1γ1μν1)\displaystyle(\sum_{\nu=1}^{\infty}\mu_{\nu}^{\frac{1+\varepsilon}{\alpha}})^{2}\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{2-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{2-\frac{1+\varepsilon}{\alpha}}\sum_{j=1}^{n-1}\big{(}\gamma^{-2}(\mu_{\nu}+\mu_{\nu}^{\prime})^{-2}+\gamma^{-1}\mu_{\nu}^{-1}\gamma^{-1}\mu_{\nu^{\prime}}^{-1}\big{)}
\displaystyle\lesssim nγ2(ν,ν=1μν11+εαμν11+εα+ν,ν=1μν21+εαμν21+εα(μν+μν)2)nγ2,\displaystyle n\gamma^{-2}\big{(}\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{1-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{1-\frac{1+\varepsilon}{\alpha}}+\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{2-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{2-\frac{1+\varepsilon}{\alpha}}(\mu_{\nu}+\mu_{\nu^{\prime}})^{-2}\big{)}\lesssim n\gamma^{-2},

with ε0\varepsilon\to 0. The final step is due to the fact that

ν,ν=1μν21+εαμν21+εα(μν+μν)2\displaystyle\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{2-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{2-\frac{1+\varepsilon}{\alpha}}(\mu_{\nu}+\mu_{\nu^{\prime}})^{-2}
=\displaystyle= ν,ν=1μνμν(μν+μν)2μν11+εαμν11+εαν,ν=1μν11+εαμν11+εα=(ν=1μν11+εα)2C.\displaystyle\sum_{\nu,\nu^{\prime}=1}^{\infty}\frac{\mu_{\nu}\mu_{\nu^{\prime}}}{(\mu_{\nu}+\mu_{\nu^{\prime}})^{2}}\mu_{\nu}^{1-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{1-\frac{1+\varepsilon}{\alpha}}\leq\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{1-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{1-\frac{1+\varepsilon}{\alpha}}=(\sum_{\nu=1}^{\infty}\mu_{\nu}^{1-\frac{1+\varepsilon}{\alpha}})^{2}\leq C.

Since bkmin{1,nγμk}b_{k}\leq\min\{1,n\gamma\mu_{k}\} and accordingly, k,k=1bkbk=(k=1(1(1γμk)n))2(nγ)2α.\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}=(\sum_{k=1}^{\infty}(1-(1-\gamma\mu_{k})^{n}))^{2}\leq(n\gamma)^{\frac{2}{\alpha}}. Therefore, we have

𝔼Δ1𝔼Δ12nγ3(nγ)2α.\mathbb{E}\Delta_{1}\leq\sqrt{\mathbb{E}\Delta_{1}^{2}}\leq\sqrt{n}\gamma^{-3}(n\gamma)^{\frac{2}{\alpha}}. (8.21)

For Δ2\Delta_{2}, we rewrite Δ2\Delta_{2} as

Δ2=\displaystyle\Delta_{2}= γ2ν=1μν2j=1n1i,=jn1aijνajνk,k=1bikbkϕiνkϕνk\displaystyle\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu}\sum_{k,k^{\prime}=1}^{\infty}b_{ik}b_{\ell k^{\prime}}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}
=\displaystyle= γ2ν=1μν2j=1n1ji<n1aijνajνk,k=1bikbkϕiνkϕνk\displaystyle\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\sum_{k,k^{\prime}=1}^{\infty}b_{ik}b_{\ell k^{\prime}}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}
+γ2ν=1μν2j=1n1wj2i=jn1aijν2k,k=1bikbkϕiνkϕiνk\displaystyle\quad+\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}w_{j}^{2}\sum_{i=j}^{n-1}a^{2}_{ij\nu}\sum_{k,k^{\prime}=1}^{\infty}b_{ik}b_{\ell k^{\prime}}\phi_{i\nu k}\phi_{i\nu k^{\prime}}
=\displaystyle= Δ21+Δ22,\displaystyle\Delta_{21}+\Delta_{22}, (8.22)

where Δ21\Delta_{21} includes the terms that ii\neq\ell and Δ22\Delta_{22} includes the terms that i=i=\ell. For Δ21\Delta_{21}, with any positive ε0\varepsilon\to 0, we have

Δ21\displaystyle\Delta_{21}\leq 2γ2k,k=1bkbkν=1μν2j=1n1ji<n1aijνajνϕiνkϕνk\displaystyle 2\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}
=\displaystyle= 2γ2k,k=1bkbkν=1μν1+ε2αμν21+ε2αj=1n1ji<n1aijνajνϕiνkϕνk\displaystyle 2\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sum_{\nu=1}^{\infty}\mu_{\nu}^{\frac{1+\varepsilon}{2\alpha}}\mu_{\nu}^{2-\frac{1+\varepsilon}{2\alpha}}\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}
\displaystyle\leq 2γ2k,k=1bkbkν=1μν1+2εαν=1μν41+2εα(j=1n1ji<n1aijνajνϕiνkϕνk)2.\displaystyle 2\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{\frac{1+2\varepsilon}{\alpha}}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{4-\frac{1+2\varepsilon}{\alpha}}\big{(}\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}^{2}}.

To bound the expectation of Δ21\Delta_{21}, we need to bound 𝔼|j=1n1ji<n1aijνajνϕiνkϕνk|2\mathbb{E}|\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}|^{2}. Note that

𝔼|j=1n1ji<n1aijνajνϕiνkϕνk|2𝔼|j=1n1ji<n1aijνajνϕiνkϕνk|2\displaystyle\mathbb{E}|\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}|^{2}\lesssim\mathbb{E}|\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}|^{2}
=\displaystyle= j,d=1n1𝔼(ji<n1aijνajνϕiνkϕνk)(di<n1aidνadνϕiνkϕνk)\displaystyle\sum_{j,d=1}^{n-1}\mathbb{E}\big{(}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}\big{(}\sum_{d\leq i<\ell\leq n-1}a_{id\nu}a_{\ell d\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}
=\displaystyle= j=1n𝔼|ji<n1aijνajνϕiνkϕνk|2\displaystyle\sum_{j=1}^{n}\mathbb{E}|\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}|^{2}
+2d=1n1j=1d1𝔼(di<n1aijνajνϕiνkϕνk)(di<n1aidνadνϕiνkϕνk)\displaystyle+2\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}\mathbb{E}\big{(}\sum_{d\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}\big{(}\sum_{d\leq i<\ell\leq n-1}a_{id\nu}a_{\ell d\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}
+(i)2d=1n1j=1d1𝔼(ji<d1aijνajνϕiνkϕνk)(di<n1aidνadνϕiνkϕνk),\displaystyle\overset{(i)}{+}2\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}\mathbb{E}\big{(}\sum_{j\leq i<\ell\leq d-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}\big{(}\sum_{d\leq i<\ell\leq n-1}a_{id\nu}a_{\ell d\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)},

where the last term (i)(i) is 0. Then we have

d=1n1j=1d1𝔼(di<n1aijνajνϕiνkϕνk)(di<n1aidνadνϕiνkϕνk)\displaystyle\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}\mathbb{E}\big{(}\sum_{d\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}\big{(}\sum_{d\leq i<\ell\leq n-1}a_{id\nu}a_{\ell d\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}
=\displaystyle= d=1n1j=1d1di<n1aijνajνaidνadν𝔼ϕiνk2ϕνk2j<ddi<n1aijνajνaidνadν\displaystyle\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}\sum_{d\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}a_{id\nu}a_{\ell d\nu}\mathbb{E}\phi^{2}_{i\nu k}\phi^{2}_{\ell\nu k^{\prime}}\lesssim\sum_{j<d}\sum_{d\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}a_{id\nu}a_{\ell d\nu}
=\displaystyle= d=1n1j=1d1di<n1(1γμν)ij(1γμν)j(1γμν)id(1γμν)d\displaystyle\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}\sum_{d\leq i<\ell\leq n-1}(1-\gamma\mu_{\nu})^{i-j}(1-\gamma\mu_{\nu})^{\ell-j}(1-\gamma\mu_{\nu})^{i-d}(1-\gamma\mu_{\nu})^{\ell-d}
=\displaystyle= 2d=1n1j=1d1[di<n1(1γμν)2(id)(1γμν)2(d)](1γμν)2(dj)\displaystyle 2\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}\big{[}\sum_{d\leq i<\ell\leq n-1}(1-\gamma\mu_{\nu})^{2(i-d)}(1-\gamma\mu_{\nu})^{2(\ell-d)}\big{]}(1-\gamma\mu_{\nu})^{2(d-j)}
\displaystyle\leq 2(d=1n1j=1d1(1γμν)2(dj))(i=dn1(1γμν)2(id))(=dn1(1γμν)2(d))n(γμν)3.\displaystyle 2\big{(}\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}(1-\gamma\mu_{\nu})^{2(d-j)}\big{)}\big{(}\sum_{i=d}^{n-1}(1-\gamma\mu_{\nu})^{2(i-d)}\big{)}\big{(}\sum_{\ell=d}^{n-1}(1-\gamma\mu_{\nu})^{2(\ell-d)}\big{)}\lesssim n(\gamma\mu_{\nu})^{-3}.

Then accordingly,

𝔼Δ21γ2k,k=1bkbkν=1μν1+2ϵαν=1μν41+2ϵα|j=1n1ji<n1aijνajνϕiνkϕνk|2(nγ)2αnγ72.\mathbb{E}\Delta_{21}\leq\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{\frac{1+2\epsilon}{\alpha}}}\cdot\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{4-\frac{1+2\epsilon}{\alpha}}|\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}|^{2}}\lesssim(n\gamma)^{\frac{2}{\alpha}}\sqrt{n}\gamma^{-\frac{7}{2}}. (8.23)

For Δ22\Delta_{22}, we have

γ2k,k=1bkbkν=1μν2j=1n1i=jn1aijν2ϕiνkϕiνk\displaystyle\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}\phi_{i\nu k}\phi_{i\nu k^{\prime}}
=\displaystyle= γ2k,k=1bkbkν=1μν2j=1n1i=jn1aijν2(ϕiνkϕiνk𝔼(ϕiνkϕiνk))\displaystyle\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))
+\displaystyle+ γ2k,k=1bkbkν=1μν2j=1n1i=jn1aijν2𝔼(ϕiνkϕiνk)\displaystyle\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}})
=\displaystyle= Δ22(1)+Δ22(2).\displaystyle\Delta_{22}^{(1)}+\Delta_{22}^{(2)}.

We first bound |Δ22(1)||\Delta_{22}^{(1)}|.

|Δ22(1)|\displaystyle|\Delta_{22}^{(1)}|\leq γ2k,k=1bkbkν=1μν1+2ϵαν=1μν41+2ϵα(j=1n1i=jn1aijν2(ϕiνkϕiνk𝔼(ϕiνkϕiνk)))2\displaystyle\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{\frac{1+2\epsilon}{\alpha}}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{4-\frac{1+2\epsilon}{\alpha}}\big{(}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\big{)}^{2}}

Notice that

𝔼(j=1n1i=jn1aijν2(ϕiνkϕiνk𝔼(ϕiνkϕiνk))|2=i=1n1𝔼(j=1iaijν2(ϕiνkϕiνk𝔼(ϕiνkϕiνk)))2\displaystyle\mathbb{E}\big{(}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))|^{2}=\sum_{i=1}^{n-1}\mathbb{E}\big{(}\sum_{j=1}^{i}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\big{)}^{2}
=\displaystyle= i=1n1𝔼(j=1iaijν2(ϕiνkϕiνk𝔼(ϕiνkϕiνk)))2\displaystyle\sum_{i=1}^{n-1}\mathbb{E}\big{(}\sum_{j=1}^{i}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\big{)}^{2}
+21i1<i2n1(j=1i1ai1jν2(ϕi1νkϕi1νk𝔼(ϕi1νkϕi1νk)))(j=1i2ai2jν2(ϕi2νkϕi2νk𝔼(ϕi2νkϕi2νk)))\displaystyle+2\sum_{1\leq i_{1}<i_{2}\leq n-1}\big{(}\sum_{j=1}^{i_{1}}a^{2}_{i_{1}j\nu}(\phi_{i_{1}\nu k}\phi_{i_{1}\nu k^{\prime}}-\mathbb{E}(\phi_{i_{1}\nu k}\phi_{i_{1}\nu k^{\prime}}))\big{)}\cdot\big{(}\sum_{j=1}^{i_{2}}a^{2}_{i_{2}j\nu}(\phi_{i_{2}\nu k}\phi_{i_{2}\nu k^{\prime}}-\mathbb{E}(\phi_{i_{2}\nu k}\phi_{i_{2}\nu k^{\prime}}))\big{)}
=\displaystyle= i=1n1𝔼(j=1iaijν2(ϕiνkϕiνk𝔼(ϕiνkϕiνk)))2,\displaystyle\sum_{i=1}^{n-1}\mathbb{E}\big{(}\sum_{j=1}^{i}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\big{)}^{2},

since 𝔼(ϕi1νkϕi1νk𝔼(ϕi1νkϕi1νk))=0\mathbb{E}\big{(}\phi_{i_{1}\nu k}\phi_{i_{1}\nu k^{\prime}}-\mathbb{E}(\phi_{i_{1}\nu k}\phi_{i_{1}\nu k^{\prime}})\big{)}=0. Then we have

𝔼(j=1n1i=jn1aijν2(ϕiνkϕiνk𝔼(ϕiνkϕiνk)))2i=1n1(j=1iaijν2)2𝔼(ϕiνkϕiνk𝔼(ϕiνkϕiνk))2n(γ1μν1)2\mathbb{E}\big{(}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\big{)}^{2}\lesssim\sum_{i=1}^{n-1}(\sum_{j=1}^{i}a^{2}_{ij\nu})^{2}\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))^{2}\lesssim n(\gamma^{-1}\mu_{\nu}^{-1})^{2}

due to the property that j=1iaijν2=j=1i(1γμν)2(ij)γ1μν1\sum_{j=1}^{i}a^{2}_{ij\nu}=\sum_{j=1}^{i}(1-\gamma\mu_{\nu})^{2(i-j)}\leq\gamma^{-1}\mu_{\nu}^{-1}. Accordingly, we have

k,k=1bkbkν=1μν1+2ϵαν=1μν41+2ϵα(j=1n1i=jn1aijν2(ϕiνkϕiνk𝔼(ϕiνkϕiνk)))2=OP(nγ1(nγ)2α).\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{\frac{1+2\epsilon}{\alpha}}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{4-\frac{1+2\epsilon}{\alpha}}\big{(}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\big{)}^{2}}=O_{P}(\sqrt{n}\gamma^{-1}(n\gamma)^{\frac{2}{\alpha}}).

Therefore,

𝔼Δ22(1)nγ3(nγ)2α.\mathbb{E}\Delta^{(1)}_{22}\lesssim\sqrt{n}\gamma^{-3}(n\gamma)^{\frac{2}{\alpha}}. (8.24)

We next deal with Δ22(2)\Delta^{(2)}_{22}.

𝔼Δ22(2)=γ2k,k=1ν=1μν2j=1n1i=jn1aijν2bikbik𝔼(ϕiνkϕiνk)\displaystyle\mathbb{E}\Delta^{(2)}_{22}=\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}b_{ik}b_{ik^{\prime}}\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}})
\displaystyle\leq γ2k,k=1ν=1μν2j=1n1i=jn1aijν2bikbik𝔼(ϕν2(Xi+1)ϕk(Xi+1)ϕk(Xi+1))\displaystyle\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}b_{ik}b_{ik^{\prime}}\mathbb{E}(\phi^{2}_{\nu}(X_{i+1})\phi_{k}(X_{i+1})\phi_{k^{\prime}}(X_{i+1}))
=\displaystyle= γ2ν=1μν2j=1n1i=jn1aijν2𝔼(ϕν2(Xi+1)(k=1bikϕk(Xi+1))2)\displaystyle\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}\mathbb{E}\big{(}\phi^{2}_{\nu}(X_{i+1})\cdot\big{(}\sum_{k=1}^{\infty}b_{ik}\phi_{k}(X_{i+1})\big{)}^{2}\big{)}
\displaystyle\leq cϕ2γ2ν=1μν2j=1n1i=jn1aijν2𝔼(k=1bikϕk(Xi+1))2=cϕ2γ2ν=1μν2j=1n1i=jn1aijν2k=1bik2γ3n(nγ)1α\displaystyle c_{\phi}^{2}\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}\mathbb{E}\big{(}\sum_{k=1}^{\infty}b_{ik}\phi_{k}(X_{i+1})\big{)}^{2}=c_{\phi}^{2}\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}\sum_{k=1}^{\infty}b^{2}_{ik}\lesssim\gamma^{-3}n(n\gamma)^{\frac{1}{\alpha}} (8.25)

Combine equation (8.21), (8.23), (8.24), and (8.25) together, and notice that nγ72(nγ)2αnγ3(nγ)1α\sqrt{n}\gamma^{-\frac{7}{2}}(n\gamma)^{\frac{2}{\alpha}}\leq n\gamma^{-3}(n\gamma)^{\frac{1}{\alpha}} for γn1\gamma\geq n^{-1}, we have

𝔼j=1n1gj2(s)nγ3(nγ)1α.\mathbb{E}\sum_{j=1}^{n-1}g^{2}_{j}(s)\leq n\gamma^{-3}(n\gamma)^{\frac{1}{\alpha}}.

Define an event 1={j=1n1gj2(s)γ7/2n(nγ)1/α}\mathcal{E}_{1}=\{\sum_{j=1}^{n-1}g^{2}_{j}(s)\leq\gamma^{-7/2}n(n\gamma)^{1/\alpha}\}, by Markov inequality, (1)>1γ1/2\mathbb{P}\big{(}\mathcal{E}_{1}\big{)}>1-\gamma^{1/2}. Conditional on the event 1\mathcal{E}_{1}, and let u=Cn12γ14(nγ)12αlognu=Cn^{-\frac{1}{2}}\gamma^{\frac{1}{4}}(n\gamma)^{\frac{1}{2\alpha}}\sqrt{\log n} in equation (8.19), we have

(γ2n|j=1n1ϵjgj(s)|>Cn12γ14(nγ)12αlogn|1)exp(Clogn).\mathbb{P}\Big{(}\frac{\gamma^{2}}{n}\bigl{|}\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(s)\bigr{|}>Cn^{-\frac{1}{2}}\gamma^{\frac{1}{4}}(n\gamma)^{\frac{1}{2\alpha}}\sqrt{\log n}\bigm{|}\mathcal{E}_{1}\Big{)}\leq\exp\Big{(}-C^{\prime}\log n\Big{)}. (8.26)

Combined with the Lemma bridging η¯nnoise,1(t)\bar{\eta}_{n}^{noise,1}(t) and η¯nnoise,1\|\bar{\eta}_{n}^{noise,1}\|_{\infty} in Supplementary [33], we achieve the result.

Next, we prove Lemma 8.2 (b) and analyze η¯nnoise,d\|\bar{\eta}_{n}^{noise,d}\|_{\infty} for d2d\geq 2.

Note that η¯nnoise,dΣaη¯nnoise,d\|\bar{\eta}_{n}^{noise,d}\|_{\infty}\leq\|\Sigma^{a}\bar{\eta}_{n}^{noise,d}\|_{\mathbb{H}}. In the following part, we focus on 𝔼Σaη¯nnoise,d2\mathbb{E}\|\Sigma^{a}\bar{\eta}_{n}^{noise,d}\|^{2}_{\mathbb{H}}. Recall in Section 6, ηnnoise,d\eta_{n}^{noise,d} follows the recursion as ηnnoise,d=(IγΣ)ηn1noise,d+γnd,\eta_{n}^{noise,d}=(I-\gamma\Sigma)\eta_{n-1}^{noise,d}+\gamma\mathcal{E}_{n}^{d}, where nd=(ΣKXn×KXn)ηn1noise,d\mathcal{E}_{n}^{d}=(\Sigma-K_{X_{n}}\times K_{X_{n}})\eta_{n-1}^{noise,d} for d1d\geq 1 and nd=εn\mathcal{E}_{n}^{d}=\varepsilon_{n} for d=0d=0.

Let T=IγΣT=I-\gamma\Sigma, then ηjnoise,d=γk=1jTjkkd\eta_{j}^{noise,d}=\gamma\sum_{k=1}^{j}T^{j-k}\mathcal{E}_{k}^{d}, and η¯nnoise,d=γ1nj=1nk=1jTjkkd\bar{\eta}_{n}^{noise,d}=\gamma\frac{1}{n}\sum_{j=1}^{n}\sum_{k=1}^{j}T^{j-k}\mathcal{E}_{k}^{d}.

𝔼η¯nnoise,d,Σ2aη¯nnoise,d\displaystyle\mathbb{E}\langle\bar{\eta}_{n}^{noise,d},\Sigma^{2a}\bar{\eta}_{n}^{noise,d}\rangle
=\displaystyle= γ2n2𝔼j=1nk=1jTjkkd,Σ2aj=1nk=1jTjkkd\displaystyle\frac{\gamma^{2}}{n^{2}}\mathbb{E}\langle\sum_{j=1}^{n}\sum_{k=1}^{j}T^{j-k}\mathcal{E}_{k}^{d},\Sigma^{2a}\sum_{j=1}^{n}\sum_{k=1}^{j}T^{j-k}\mathcal{E}_{k}^{d}\rangle
=\displaystyle= γ2n2k=1n𝔼k=1n(j=knTjk)kd,Σ2ak=1n(j=knTjk)kd\displaystyle\frac{\gamma^{2}}{n^{2}}\sum_{k=1}^{n}\mathbb{E}\langle\sum_{k=1}^{n}(\sum_{j=k}^{n}T^{j-k})\mathcal{E}_{k}^{d},\Sigma^{2a}\sum_{k=1}^{n}(\sum_{j=k}^{n}T^{j-k})\mathcal{E}_{k}^{d}\rangle
=\displaystyle= γ2n2k=1n𝔼Mn,kkd,Σ2aMn,kkd=γ2n2k=1n𝔼tr(kdMn,kΣ2aMn,kkd)=γ2n2k=1n𝔼tr(Mn,kΣ2aMn,kkdkd)\displaystyle\frac{\gamma^{2}}{n^{2}}\sum_{k=1}^{n}\mathbb{E}\langle M_{n,k}\mathcal{E}_{k}^{d},\Sigma^{2a}M_{n,k}\mathcal{E}_{k}^{d}\rangle=\frac{\gamma^{2}}{n^{2}}\sum_{k=1}^{n}\mathbb{E}\operatorname{tr}(\mathcal{E}_{k}^{d}M_{n,k}\Sigma^{2a}M_{n,k}\mathcal{E}_{k}^{d})=\frac{\gamma^{2}}{n^{2}}\sum_{k=1}^{n}\mathbb{E}\operatorname{tr}(M_{n,k}\Sigma^{2a}M_{n,k}\mathcal{E}_{k}^{d}\otimes\mathcal{E}_{k}^{d})
=\displaystyle= γ2n2k=1ntr(Mn,kΣ2aMn,k𝔼(kdkd))γ2+dn2k=1ntr(Mn,kΣ2aMn,kΣ)\displaystyle\frac{\gamma^{2}}{n^{2}}\sum_{k=1}^{n}\operatorname{tr}(M_{n,k}\Sigma^{2a}M_{n,k}\mathbb{E}\big{(}\mathcal{E}_{k}^{d}\otimes\mathcal{E}_{k}^{d})\big{)}\lesssim\frac{\gamma^{2+d}}{n^{2}}\sum_{k=1}^{n}\operatorname{tr}\big{(}M_{n,k}\Sigma^{2a}M_{n,k}\Sigma\big{)}

where we use the property that 𝔼(kdkd)γdΣ\mathbb{E}\big{(}\mathcal{E}_{k}^{d}\otimes\mathcal{E}_{k}^{d}\big{)}\lesssim\gamma^{d}\Sigma. Since Mn,k=j=knTjk=I+T+T2++TnknIM_{n,k}=\sum_{j=k}^{n}T^{j-k}=I+T+T^{2}+\cdots+T^{n-k}\leq nI, then Mn,kΣ2anΣ2aM_{n,k}\Sigma^{2a}\leq n\Sigma^{2a}. On the other hand, Mn,kΣ2a=γ1Σ1(ITnk)Σ2aγ1Σ2a1M_{n,k}\Sigma^{2a}=\gamma^{-1}\Sigma^{-1}(I-T^{n-k})\Sigma^{2a}\preceq\gamma^{-1}\Sigma^{2a-1}. Therefore, we have

Mn,kΣ2a(nΣ2a)q(γ1Σ2a1)1qM_{n,k}\Sigma^{2a}\preceq(n\Sigma^{2a})^{q}(\gamma^{-1}\Sigma^{2a-1})^{1-q}

with 0q10\leq q\leq 1. Also, Mn,kΣγ1Σ1(ITnk)Σγ1IM_{n,k}\Sigma\preceq\gamma^{-1}\Sigma^{-1}(I-T^{n-k})\Sigma\preceq\gamma^{-1}I. Then

tr(Mn,kΣ2aMn,kΣ)nqγq1γ1ν=1μν2aq+(2a1)(1q).\operatorname{tr}\big{(}M_{n,k}\Sigma^{2a}M_{n,k}\Sigma\big{)}\leq n^{q}\gamma^{q-1}\gamma^{-1}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2aq+(2a-1)(1-q)}.

Therefore, 𝔼η¯nnoise,d,Σ2aη¯nnoise,dγ2+dn2nγ1nqγq1ν=1μν2a1+q(nγ)qn1γdν=1μν2a1+q.\mathbb{E}\langle\bar{\eta}_{n}^{noise,d},\Sigma^{2a}\bar{\eta}_{n}^{noise,d}\rangle\lesssim\frac{\gamma^{2+d}}{n^{2}}n\gamma^{-1}n^{q}\gamma^{q-1}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2a-1+q}\leq(n\gamma)^{q}n^{-1}\gamma^{d}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2a-1+q}. Let 2a1+q=1/α+ε2a-1+q=1/\alpha+\varepsilon with a=1/21/(2α)εa=1/2-1/(2\alpha)-\varepsilon and ε0\varepsilon\to 0, then we have ν=1μν2a1+q=j=1μj1/α+ε=j=1j1αε<\sum_{\nu=1}^{\infty}\mu_{\nu}^{2a-1+q}=\sum_{j=1}^{\infty}\mu_{j}^{1/\alpha+\varepsilon}=\sum_{j=1}^{\infty}j^{-1-\alpha\varepsilon}<\infty and

𝔼η¯nnoise,d,Σ2aη¯nnoise,d(nγ)1/αn1(nγ)1/α+2εγd.\mathbb{E}\langle\bar{\eta}_{n}^{noise,d},\Sigma^{2a}\bar{\eta}_{n}^{noise,d}\rangle\lesssim(n\gamma)^{1/\alpha}n^{-1}(n\gamma)^{1/\alpha+2\varepsilon}\gamma^{d}.

Through Markov’s inequality, we have

(η¯nnoise,d2γ1/4(nγ)1/αn1)\displaystyle\mathbb{P}\Big{(}\|\bar{\eta}_{n}^{noise,d}\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq 𝔼Σaη¯nnoise,d2γ1/4(nγ)1/αn1(nγ)1/α+2εγd1/4.\displaystyle\frac{\mathbb{E}\|\Sigma^{a}\bar{\eta}_{n}^{noise,d}\|^{2}_{\mathbb{H}}}{\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}}\leq(n\gamma)^{1/\alpha+2\varepsilon}\gamma^{d-1/4}.

For d2d\geq 2 and 0<γ<n22+3α0<\gamma<n^{-\frac{2}{2+3\alpha}}, we have (nγ)1/α+2εγd1/41/2γ1/4(n\gamma)^{1/\alpha+2\varepsilon}\gamma^{d-1/4}\leq 1/2\gamma^{1/4}.

Proof of Lemma 8.2 (c) - the reminder term η¯nnoised=0rη¯nnoise,d\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}. Note that for any ff\in\mathbb{H}, |f(x)|=|f,Kx||Kx|fCf|f(x)|=|\langle f,K_{x}\rangle_{\mathbb{H}}|\leq|K_{x}|_{\mathbb{H}}\|f\|_{\mathbb{H}}\leq C\|f\|_{\mathbb{H}}. Therefore, η¯nnoised=0rη¯nnoise,dη¯nnoised=0rη¯nnoise,d\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|_{\infty}\leq\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|_{\mathbb{H}}. Next, we will bound η¯nnoised=0rη¯nnoise,d\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|_{\mathbb{H}}.

For i=1,,ni=1,\dots,n, recall ηinoised=0rηinoise,d=(IγKXiKXi)(ηi1noised=0rηi1noise,d)+γir+1,\eta_{i}^{noise}-\sum_{d=0}^{r}\eta_{i}^{noise,d}=(I-\gamma K_{X_{i}}\otimes K_{X_{i}})(\eta_{i-1}^{noise}-\sum_{d=0}^{r}\eta_{i-1}^{noise,d})+\gamma\mathcal{E}_{i}^{r+1}, we have

ηinoised=0rηinoise,dηi1noised=0rηi1noise,d+γir+1j=1iγjr+1.\|\eta_{i}^{noise}-\sum_{d=0}^{r}\eta_{i}^{noise,d}\|_{\mathbb{H}}\leq\|\eta_{i-1}^{noise}-\sum_{d=0}^{r}\eta_{i-1}^{noise,d}\|_{\mathbb{H}}+\gamma\|\mathcal{E}_{i}^{r+1}\|_{\mathbb{H}}\leq\sum_{j=1}^{i}\gamma\|\mathcal{E}_{j}^{r+1}\|_{\mathbb{H}}.

Accordingly, 𝔼ηinoisek=0dηinoise,k2γ2j=1i(j=1i𝔼jd+12)\mathbb{E}\|\eta_{i}^{noise}-\sum_{k=0}^{d}\eta_{i}^{noise,k}\|_{\mathbb{H}}^{2}\leq\gamma^{2}\sum_{j=1}^{i}\big{(}\sum_{j=1}^{i}\mathbb{E}\|\mathcal{E}_{j}^{d+1}\|^{2}_{\mathbb{H}}\big{)}. Since 𝔼jd+12=𝔼tr(jd+1jd+1)=tr𝔼(jd+1jd+1)σ2γd+1R2d+2tr(Σ)\mathbb{E}\|\mathcal{E}_{j}^{d+1}\|_{\mathbb{H}}^{2}=\mathbb{E}\operatorname{tr}(\mathcal{E}_{j}^{d+1}\otimes\mathcal{E}_{j}^{d+1})=\operatorname{tr}\mathbb{E}(\mathcal{E}_{j}^{d+1}\otimes\mathcal{E}_{j}^{d+1})\leq\sigma^{2}\gamma^{d+1}R^{2d+2}\operatorname{tr}(\Sigma), we have

𝔼ηinoised=0rηinoise,d2γ2i2σ2γr+1R2r+2tr(Σ),\displaystyle\mathbb{E}\|\eta_{i}^{noise}-\sum_{d=0}^{r}\eta_{i}^{noise,d}\|_{\mathbb{H}}^{2}\leq\gamma^{2}i^{2}\sigma^{2}\gamma^{r+1}R^{2r+2}\operatorname{tr}(\Sigma),

and accordingly

𝔼η¯nnoised=0rη¯nnoise,d2\displaystyle\mathbb{E}\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|_{\mathbb{H}}^{2}\leq 2ni=1n𝔼ηinoised=0rηinoise,d2\displaystyle\frac{2}{n}\sum_{i=1}^{n}\mathbb{E}\|\eta_{i}^{noise}-\sum_{d=0}^{r}\eta_{i}^{noise,d}\|_{\mathbb{H}}^{2}
\displaystyle\leq σ2γr+3R2r+2tr(Σ)1ni=1ni2σ2γr+3R2r+2tr(Σ)n2.\displaystyle\sigma^{2}\gamma^{r+3}R^{2r+2}\operatorname{tr}(\Sigma)\frac{1}{n}\sum_{i=1}^{n}i^{2}\leq\sigma^{2}\gamma^{r+3}R^{2r+2}\operatorname{tr}(\Sigma)n^{2}. (8.27)

By Markov inequality,

(η¯inoised=0rη¯inoise,d2γ1/4(nγ)1/αn1)𝔼η¯nnoised=0rη¯nnoise,d2γ1/4(nγ)1/αn11/n\mathbb{P}\Big{(}\|\bar{\eta}_{i}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{i}^{noise,d}\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq\frac{\mathbb{E}\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|^{2}_{\mathbb{H}}}{\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}}\leq 1/n

with the constant rr large enough.

Finally, we have

(η¯nnoiseη¯nnoise,02(r+1)γ1/4(nγ)1/αn1)\displaystyle\mathbb{P}\Big{(}\|\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}\|^{2}_{\infty}\geq(r+1)\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}
\displaystyle\leq d=1r(η¯nnoise,d2γ1/4(nγ)1/αn1)+(η¯inoised=0rη¯inoise,d2γ1/4(nγ)1/αn1)γ1/4.\displaystyle\sum_{d=1}^{r}\mathbb{P}\Big{(}\|\bar{\eta}_{n}^{noise,d}\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}+\mathbb{P}\Big{(}\|\bar{\eta}_{i}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{i}^{noise,d}\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq\gamma^{1/4}.

8.4 Bootstrap SGD decomposition

Similar to the SGD recursion decomposition in Section 6, we define the Bootstrap SGD recursion decomposition as follows. Based on (4.1), denote ηnb=f^nbf\eta_{n}^{b}=\widehat{f}_{n}^{b}-f^{\ast}, then

ηnb=(IγnwnKXnKXn)(fn1bf)+γnwnbϵnKXn.\eta_{n}^{b}=(I-\gamma_{n}w_{n}K_{X_{n}}\otimes K_{X_{n}})(f_{n-1}^{b}-f^{\ast})+\gamma_{n}w_{n}^{b}\epsilon_{n}K_{X_{n}}. (8.28)

We split the recursion (8.28) in two recursions ηnb,bias\eta_{n}^{b,bias} and ηnb,noise\eta_{n}^{b,noise} such that ηnb=ηnb,bias+ηnb,noise\eta_{n}^{b}=\eta_{n}^{b,bias}+\eta_{n}^{b,noise}. Specifically,

ηnb,bias=\displaystyle\eta_{n}^{b,bias}= (IγnwnKXnKXn)ηn1b,biaswithη0b,bias=f,\displaystyle(I-\gamma_{n}w_{n}K_{X_{n}}\otimes K_{X_{n}})\eta^{b,bias}_{n-1}\quad\textrm{with}\quad\eta_{0}^{b,bias}=f^{\ast}, (8.29)
ηnb,noise=\displaystyle\eta_{n}^{b,noise}= (IγnwnKXnKXn)ηn1b,noise+γnwnϵnKXnwithη0b,noise=0.\displaystyle(I-\gamma_{n}w_{n}K_{X_{n}}\otimes K_{X_{n}})\eta^{b,noise}_{n-1}+\gamma_{n}w_{n}\epsilon_{n}K_{X_{n}}\quad\textrm{with}\quad\eta_{0}^{b,noise}=0. (8.30)

Since 𝔼[wnKXnKXn]=Σ\mathbb{E}[w_{n}K_{X_{n}}\otimes K_{X_{n}}]=\Sigma, we further decompose ηnb,bias\eta_{n}^{b,bias} to two parts: (1) its main recursion terms which determine the bias order; (2) residual recursion terms. That is,

ηnb,bias,0=\displaystyle\eta_{n}^{b,bias,0}= (IγnΣ)ηn1b,bias,0withη0b,bias,0=f\displaystyle(I-\gamma_{n}\Sigma)\eta_{n-1}^{b,bias,0}\quad\quad\textrm{with}\quad\eta_{0}^{b,bias,0}=f^{\ast}
ηnb,biasηnb,bias,0=\displaystyle\eta_{n}^{b,bias}-\eta_{n}^{b,bias,0}= (IγnwnKXnKXn)(ηn1b,biasηn1b,bias,0)+γn(ΣwnKXnKXn)ηn1b,bias,0,\displaystyle(I-\gamma_{n}w_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{b,bias}-\eta_{n-1}^{b,bias,0})+\gamma_{n}(\Sigma-w_{n}K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{b,bias,0},

Similarly, we decompose ηnb,noise\eta_{n}^{b,noise} to its main recursion term that dominates the variation and residual recursion terms as

ηnb,noise,0=\displaystyle\eta_{n}^{b,noise,0}= (IγnΣ)ηn1b,noise,0+γnwnϵnKXn\displaystyle(I-\gamma_{n}\Sigma)\eta^{b,noise,0}_{n-1}+\gamma_{n}w_{n}\epsilon_{n}K_{X_{n}} (8.31)
ηnb,noiseηnb,noise,0=\displaystyle\eta_{n}^{b,noise}-\eta_{n}^{b,noise,0}= (IγnwnKXnKXn)(ηn1b,noiseηn1b,noise,0)+γn(ΣwnKXnKXn)ηn1b,noise,0,\displaystyle(I-\gamma_{n}w_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{b,noise}-\eta_{n-1}^{b,noise,0})+\gamma_{n}(\Sigma-w_{n}K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{b,noise,0},

with η0b,noise,0=0\eta_{0}^{b,noise,0}=0.

We aim to quantify the distribution behavior of (f¯nbf¯n)𝒟n(\bar{f}_{n}^{b}-\bar{f}_{n})\mid\mathcal{D}_{n} given 𝒟n\mathcal{D}_{n}. Denote η¯nb=1ni=1nf^ib\bar{\eta}_{n}^{b}=\frac{1}{n}\sum_{i=1}^{n}\widehat{f}_{i}^{b}. Then

f¯nbf¯n=\displaystyle\bar{f}_{n}^{b}-\bar{f}_{n}= η¯nbη¯n=1ni=1n(f^ibf)1ni=1n(f^if)\displaystyle\bar{\eta}_{n}^{b}-\bar{\eta}_{n}=\frac{1}{n}\sum_{i=1}^{n}\big{(}\widehat{f}_{i}^{b}-f^{\ast}\big{)}-\frac{1}{n}\sum_{i=1}^{n}\big{(}\widehat{f}_{i}-f^{\ast}\big{)}
=\displaystyle= η¯nb,bias,0η¯nbias,0leading bias+η¯nb,noise,0η¯nnoise,0leading noise+Remnoiseb+RembiasbRemnoiseRembiasnegligible terms,\displaystyle\underbrace{\bar{\eta}_{n}^{b,bias,0}-\bar{\eta}_{n}^{bias,0}}_{\textrm{leading bias}}+\underbrace{\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}}_{\textrm{leading noise}}+\underbrace{Rem_{noise}^{b}+Rem_{bias}^{b}-Rem_{noise}-Rem_{bias}}_{\textrm{negligible terms}},

where Remnoiseb=η¯nb,noiseη¯nb,noise,0Rem^{b}_{noise}=\bar{\eta}_{n}^{b,noise}-\bar{\eta}_{n}^{b,noise,0}, Rembiasb=η¯nb,biasη¯nb,bias,0Rem^{b}_{bias}=\bar{\eta}_{n}^{b,bias}-\bar{\eta}_{n}^{b,bias,0}, and Remnoise,RembiasRem_{noise},Rem_{bias} are remainder terms in original SGD recursion with Remnoise=η¯nnoiseη¯nnoise,0Rem_{noise}=\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0} (bounded in Section 8.2), Rembias=η¯nbiasη¯nbias,0Rem_{bias}=\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0} (bounded in Section 8.2).

Since η¯nb,bias,0\bar{\eta}_{n}^{b,bias,0} and η¯nbias,0\bar{\eta}_{n}^{bias,0} follow the same recursion, we have the leading bias of f¯nbf¯n\bar{f}_{n}^{b}-\bar{f}_{n} as 0. We next need to: (1) characterize the distribution behavior of η¯nb,noise,0η¯nnoise,0\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0} conditional on 𝒟n\mathcal{D}_{n}; and (2) prove the term Remnoiseb+RembiasbRemnoiseRembiasRem_{noise}^{b}+Rem_{bias}^{b}-Rem_{noise}-Rem_{bias} are negligible.

In the following, we provide a clear express on η¯nb,noise,0η¯nnoise,0\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}. Similar to the expression of ηnnoise,0=i=1nD(i+1,n,γi)γiϵiKXi\eta_{n}^{noise,0}=\sum_{i=1}^{n}D(i+1,n,\gamma_{i})\gamma_{i}\epsilon_{i}K_{X_{i}} in (8.9), A simple calculation from the recursion (8.31) shows that ηnb,noise,0=i=1nD(i+1,n,γi)γiwiϵiKXi.\eta_{n}^{b,noise,0}=\sum_{i=1}^{n}D(i+1,n,\gamma_{i})\gamma_{i}w_{i}\epsilon_{i}K_{X_{i}}. Accordingly,

ηnb,noise,0ηnnoise,0=i=1nD(i+1,n,γi)γi(wi1)ϵiKXi.\eta_{n}^{b,noise,0}-\eta_{n}^{noise,0}=\sum_{i=1}^{n}D(i+1,n,\gamma_{i})\gamma_{i}(w_{i}-1)\epsilon_{i}K_{X_{i}}.

Then

η¯nb,noise,0η¯nnoise,0\displaystyle\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}
=\displaystyle= 1nj=1ni=1jD(i+1,j,γi)γi(wi1)ϵiKXi=1ni=1n(j=inD(i+1,j,γi))γi(wi1)ϵiKXi.\displaystyle\frac{1}{n}\sum_{j=1}^{n}\sum_{i=1}^{j}D(i+1,j,\gamma_{i})\gamma_{i}(w_{i}-1)\epsilon_{i}K_{X_{i}}=\frac{1}{n}\sum_{i=1}^{n}\big{(}\sum_{j=i}^{n}D(i+1,j,\gamma_{i})\big{)}\gamma_{i}(w_{i}-1)\epsilon_{i}K_{X_{i}}. (8.32)

8.5 Proof of the Bootstrap consistency in Theorem 4.2 for constant step size case

We follow the proof sketch in Section 6.2 and complete the proof of Step II, III and IV in this section.

For the reader’s convenience, we restate the following notations. Denote

α¯n()=1n(nγ)1/αi=1nϵiΩn,i()α¯nb()=1n(nγ)1/αi=1n(wi1)ϵiΩn,i()α¯ne()=1n(nγ)1/αi=1neiϵiΩn,i()Z¯n()=1n(nγ)1/αi=1nZi()\begin{array}[]{rcl@{\qquad}rcl}\bar{\alpha}_{n}(\cdot)&=&\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}\epsilon_{i}\cdot\Omega_{n,i}(\cdot)&\bar{\alpha}_{n}^{b}(\cdot)&=&\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}(w_{i}-1)\cdot\epsilon_{i}\cdot\Omega_{n,i}(\cdot)\\ \bar{\alpha}_{n}^{e}(\cdot)&=&\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}e_{i}\cdot\epsilon_{i}\cdot\Omega_{n,i}(\cdot)&\bar{Z}_{n}(\cdot)&=&\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}Z_{i}(\cdot)\end{array}

where eie_{i}’s, for i=1,,ni=1,\cdots,n, are i.i.d. standard normal random variables, and Zi(t)N(0,(nγ)1/αν=1(1(1γμν)ni)2ϕν2(t))Z_{i}(t)\sim N\big{(}0,(n\gamma)^{-1/\alpha}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}^{2}(t)\big{)} satisfying 𝔼(Zi(tk)Zi(t))=(nγ)1/αν=1(1(1γμν)ni)2ϕν(tk)ϕν(t)\mathbb{E}\big{(}Z_{i}(t_{k})\cdot Z_{i}(t_{\ell})\big{)}=(n\gamma)^{-1/\alpha}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}(t_{k})\phi_{\nu}(t_{\ell}), and 𝔼(Zi(tk)Zj(t))=0\mathbb{E}\big{(}Z_{i}(t_{k})\cdot Z_{j}(t_{\ell})\big{)}=0 for iji\neq j.

Lemma 8.3.

(Proof of Step II) Suppose α>2\alpha>2 and γ=nξ\gamma=n^{-\xi} with ξ>max{1α/3,0}\xi>\max\{1-\alpha/3,0\}. We have

supν|(max1kNα¯n(tk)ν)(max1kNZ¯n(tk)ν)|(logN)3/2(n(nγ)3/α)1/8,\sup_{\nu\in\mathbb{R}}\Big{|}\mathbb{P}(\max_{1\leq k\leq N}\bar{\alpha}_{n}(t_{k})\leq\nu)-\mathbb{P}(\max_{1\leq k\leq N}\bar{Z}_{n}(t_{k})\leq\nu)\Big{|}\leq\frac{(\log N)^{3/2}}{\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{1/8}}, (8.33)

which converges to 0 with increased nn.

Lemma 8.4.

(Proof of Step III) Suppose α>2\alpha>2 and γ=nξ\gamma=n^{-\xi} with ξ>max{1α/3,0}\xi>\max\{1-\alpha/3,0\}. With probability at least 1exp(Clogn)1-\exp(-C\log n),

supν|(max1jNα¯ne(tj)ν)(max1jNZ¯n(tj)ν)|((nγ)1/αn1)1/6(logn)1/3(logN)2/3.\sup_{\nu\in\mathbb{R}}\Big{|}\mathbb{P}^{*}\Big{(}\max_{1\leq j\leq N}\bar{\alpha}^{e}_{n}(t_{j})\leq\nu\Big{)}-\mathbb{P}\Big{(}\max_{1\leq j\leq N}\bar{Z}_{n}(t_{j})\leq\nu\Big{)}\Big{|}\preceq\big{(}(n\gamma)^{1/\alpha}n^{-1}\big{)}^{1/6}(\log n)^{1/3}(\log N)^{2/3}.
Lemma 8.5.

(Proof of Step IV) Suppose α>2\alpha>2 and γ=nξ\gamma=n^{-\xi} with ξ>max{1α/3,0}\xi>\max\{1-\alpha/3,0\}. With probability at least 14/n1-4/n,

supζ|(max1kNα¯nb(tk)ζ)(max1kNα¯ne(tk)ζ)|(logN)3/2(n(nγ)3/α)1/8.\sup_{\zeta\in\mathbb{R}}\Big{|}\mathbb{P}^{*}(\max_{1\leq k\leq N}\bar{\alpha}_{n}^{b}(t_{k})\leq\zeta)-\mathbb{P}^{*}(\max_{1\leq k\leq N}\bar{\alpha}_{n}^{e}(t_{k})\leq\zeta)\Big{|}\leq\frac{(\log N)^{3/2}}{\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{1/8}}.

Proof of Lemma 8.3

Proof.

We define gm(i,Xi,ϵi)=1(nγ)1/αϵiΩn,i(tm)g_{m}(i,X_{i},\epsilon_{i})=\frac{1}{\sqrt{(n\gamma)^{1/\alpha}}}\epsilon_{i}\cdot\Omega_{n,i}(t_{m}) for tm{t1,,tN}t_{m}\in\{t_{1},\dots,t_{N}\}. With little abuse of notation, we use gi,mg_{i,m} to represent gm(i,Xi,ϵi)g_{m}(i,X_{i},\epsilon_{i}). Then α¯n(tm)=1ni=1ngi,m\bar{\alpha}_{n}(t_{m})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}g_{i,m}. Define 𝒈i=(gi,1,,gi,N)\bm{g}_{i}=(g_{i,1},\cdots,g_{i,N})^{\top} and 𝜶¯n=(α¯n(t1),,α¯n(tN))N\bar{\bm{\alpha}}_{n}=\big{(}\bar{\alpha}_{n}(t_{1}),\cdots,\bar{\alpha}_{n}(t_{N})\big{)}^{\top}\in\mathbb{R}^{N}, then 𝜶¯n=1ni=1n𝒈i\bar{\bm{\alpha}}_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{g}_{i}. For 1mkN1\leq m\leq k\leq N,

𝔼(gi,mgi,k)=\displaystyle\mathbb{E}(g_{i,m}\cdot g_{i,k})= σ2(nγ)1/α𝔼[(ν=1(1(1γμν)ni)ϕν(tm)ϕν(Xi))(ν=1(1(1γμν)ni)ϕν(tk)ϕν(Xi)))]\displaystyle\frac{\sigma^{2}}{(n\gamma)^{1/\alpha}}\mathbb{E}[\big{(}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})\phi_{\nu}(t_{m})\phi_{\nu}(X_{i})\big{)}\big{(}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})\phi_{\nu}(t_{k})\phi_{\nu}(X_{i})\big{)})]
=\displaystyle= σ2(nγ)1/αν=1(1(1γμν)ni)2ϕν(tm)ϕν(tk).\displaystyle\frac{\sigma^{2}}{(n\gamma)^{1/\alpha}}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}(t_{m})\phi_{\nu}(t_{k}).

When m=km=k, 𝔼(gi,mgi,m)=1(nγ)1/αν=1(1(1γμν)ni)2ϕν2(tm)\mathbb{E}(g_{i,m}\cdot g_{i,m})=\frac{1}{(n\gamma)^{1/\alpha}}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}^{2}(t_{m}). We also have 𝔼(gi,mgj,m)=0\mathbb{E}(g_{i,m}\cdot g_{j,m})=0 for iji\neq j.

We also use the notation Zi,mZ_{i,m} to represent Zi(t)Z_{i}(t) defined in Section 6.2. Let 𝒁i=(Zi,1,,Zi,N)N\bm{Z}_{i}=(Z_{i,1},\cdots,Z_{i,N})^{\top}\in\mathbb{R}^{N} for i=1,,ni=1,\dots,n, and 𝒁¯n=1ni=1n𝒁i=(Z¯n(t1),,Z¯n(tN))N\bar{\bm{Z}}_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{Z}_{i}=(\bar{Z}_{n}(t_{1}),\cdots,\bar{Z}_{n}(t_{N}))^{\top}\in\mathbb{R}^{N}. We remark that 𝜶¯n\bar{\bm{\alpha}}_{n} has the same mean and covariance structure as 𝒁¯n\bar{\bm{Z}}_{n}.

For qq as an underdetermined scalar that depends on nn, and 𝜷=(β1,,βN)N\bm{\beta}=(\beta_{1},\ldots,\beta_{N})^{\top}\in\mathbb{R}^{N}, define Fq(𝜷)=q1log(l=1Nexp(qβl)).F_{q}(\bm{\beta})=q^{-1}\log(\sum_{l=1}^{N}\exp(q\beta_{l})). It follows by [42] that Fq(𝜷)F_{q}(\bm{\beta}) satisfies 0Fq(𝜷)max1lNβlq1logN.0\leq F_{q}(\bm{\beta})-\max_{1\leq l\leq N}\beta_{l}\leq q^{-1}\log{N}. Let U0:[0,1]U_{0}:\mathbb{R}\rightarrow[0,1] be a C3C^{3}-function such that U0(s)=1U_{0}(s)=1 for s0s\leq 0 and U0(s)=0U_{0}(s)=0 for s1s\geq 1. Let Uζ(s)=U0(ψn(sζq1logN))U_{\zeta}(s)=U_{0}(\psi_{n}(s-\zeta-q^{-1}\log{N})), for ζ\zeta\in\mathbb{R}, where ψn\psi_{n} is underdetermined. Then

(max1mN1ni=1ngi,mζ)(Fq(𝜶¯n)ζ+q1logN)𝔼{Uζ(Fq(𝜶¯n))}.\mathbb{P}(\max_{1\leq m\leq N}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}g_{i,m}\leq\zeta)\leq\mathbb{P}(F_{q}(\bar{\bm{\alpha}}_{n})\leq\zeta+q^{-1}\log{N})\leq\mathbb{E}\{U_{\zeta}(F_{q}(\bar{\bm{\alpha}}_{n}))\}.

To proceed, we approximate 𝔼{Uζ(Fq(𝜶¯n))Uζ(Fq(𝒁¯n))}\mathbb{E}\{U_{\zeta}(F_{q}(\bar{\bm{\alpha}}_{n}))-U_{\zeta}(F_{q}(\bar{\bm{Z}}_{n}))\} using the techniques used in [42]. Let G=UζFbG=U_{\zeta}\circ F_{b}. Define Ψ(t)=𝔼{G(t𝜶¯n+1t𝒁¯n)}\Psi(t)=\mathbb{E}\{G(\sqrt{t}\bar{\bm{\alpha}}_{n}+\sqrt{1-t}\bar{\bm{Z}}_{n})\}, W(t)=t𝜶¯n+1t𝒁¯nW(t)=\sqrt{t}\bar{\bm{\alpha}}_{n}+\sqrt{1-t}\bar{\bm{Z}}_{n}, Wi(t)=1n(t𝒈i+1t𝒁i)W_{i}(t)=\frac{1}{\sqrt{n}}(\sqrt{t}\bm{g}_{i}+\sqrt{1-t}\bm{Z}_{i}) and Wi(t)=W(t)Wi(t)W_{-i}(t)=W(t)-W_{i}(t), for i=1,,ni=1,\ldots,n. Let Gk(𝜷)=βkG(𝜷)G_{k}(\bm{\beta})=\frac{\partial}{\partial\beta_{k}}G(\bm{\beta}), Gkl(𝜷)=2βkβlG(𝜷)G_{kl}(\bm{\beta})=\frac{\partial^{2}}{\partial\beta_{k}\partial\beta_{l}}G(\bm{\beta}) and Gklq(𝜷)=3βkβlβqG(𝜷)G_{klq}(\bm{\beta})=\frac{\partial^{3}}{\partial\beta_{k}\partial\beta_{l}\partial\beta_{q}}G(\bm{\beta}), for 1k,l,qN1\leq k,l,q\leq N. Then Wik(t)=12n(gi,k/tZi,k/1t)W^{\prime}_{ik}(t)=\frac{1}{2\sqrt{n}}(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t}).

Then

𝔼{G(𝜶¯n)G(𝒁¯n)}=𝔼{Uζ(Fq(𝜶¯n))Uζ(Fq(𝒁¯n))}=Ψ(1)Ψ(0)=01Ψ(t)𝑑t\displaystyle\mathbb{E}\{G(\bar{\bm{\alpha}}_{n})-G(\bar{\bm{Z}}_{n})\}=\mathbb{E}\{U_{\zeta}(F_{q}(\bar{\bm{\alpha}}_{n}))-U_{\zeta}(F_{q}(\bar{\bm{Z}}_{n}))\}=\Psi(1)-\Psi(0)=\int_{0}^{1}\Psi^{\prime}(t)dt
=\displaystyle= 12nk=1N01𝔼{Gk(W(t))(i=1ngi,k/ti=1nZi,k/1t)}𝑑t\displaystyle\frac{1}{2\sqrt{n}}\sum_{k=1}^{N}\int_{0}^{1}\mathbb{E}\{G_{k}(W(t))(\sum_{i=1}^{n}g_{i,k}/\sqrt{t}-\sum_{i=1}^{n}Z_{i,k}/\sqrt{1-t})\}dt
=\displaystyle= 12nk=1Ni=1n01𝔼{Gk(W(t))(gi,k/tZi,k/1t)}𝑑t\displaystyle\frac{1}{2\sqrt{n}}\sum_{k=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\mathbb{E}\big{\{}G_{k}(W(t))(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\big{\}}dt
=\displaystyle= 12nk=1Ni=1n01𝔼{[Gk(Wi(t))+1nl=1NGkl(Wi(t))(tgi,l+1tZi,l)\displaystyle\frac{1}{2\sqrt{n}}\sum_{k=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\mathbb{E}\big{\{}\big{[}G_{k}(W_{-i}(t))+\frac{1}{\sqrt{n}}\sum_{l=1}^{N}G_{kl}(W_{-i}(t))(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})
+1nl=1Nd=1N01(1t)Gkld(Wi(t)+tWi(t))(tgi,l+1tZi,l)(tgi,d+1tZi,d)dt]\displaystyle+\frac{1}{n}\sum_{l=1}^{N}\sum_{d=1}^{N}\int_{0}^{1}(1-t^{\prime})G_{kld}(W_{-i}(t)+t^{\prime}W_{i}(t))(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})(\sqrt{t}g_{i,d}+\sqrt{1-t}Z_{i,d})dt^{\prime}\big{]}
×(gi,k/tZi,k/1t)}dt\displaystyle\times(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\big{\}}dt
=\displaystyle= 12nk=1Ni=1n01𝔼{Gk(Wi(t))}𝔼{gi,k/tZi,k/1t}𝑑t\displaystyle\frac{1}{2\sqrt{n}}\sum_{k=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\mathbb{E}\{G_{k}(W_{-i}(t))\}\mathbb{E}\{g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t}\}dt
+12nk,l=1Ni=1n01𝔼{Gkl(Wi(t))}×𝔼{(tgi,l+1tZi,l)(gi,k/tZi,k/1t)}𝑑t\displaystyle+\frac{1}{2n}\sum_{k,l=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\mathbb{E}\{G_{kl}(W_{-i}(t))\}\times\mathbb{E}\{(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\}dt
+12n3/2k,l,d=1Ni=1n0101(1t)𝔼{Gkld(Wi(t)+tWi(t))(tgi,l+1tZi,l)\displaystyle+\frac{1}{2n^{3/2}}\sum_{k,l,d=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\int_{0}^{1}(1-t^{\prime})\mathbb{E}\{G_{kld}(W_{-i}(t)+t^{\prime}W_{i}(t))(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})
(tgi,d+1tZi,d)(gi,k/tZi,k/1t)}dtdt\displaystyle(\sqrt{t}g_{i,d}+\sqrt{1-t}Z_{i,d})(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\}dtdt^{\prime}
\displaystyle\equiv J1/2+J2/2+J3/2,\displaystyle J_{1}/2+J_{2}/2+J_{3}/2,

where

J1\displaystyle J_{1} =\displaystyle= 1nk=1Ni=1n01𝔼{Gk(Wi(t))}𝔼{gi,k/tZi,k/1t}𝑑t=0\displaystyle\frac{1}{\sqrt{n}}\sum_{k=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\mathbb{E}\{G_{k}(W_{-i}(t))\}\mathbb{E}\{g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t}\}dt=0
J2\displaystyle J_{2} =\displaystyle= 1nk,l=1Ni=1n01𝔼{Gkl(Wi(t))}×𝔼{(tgi,l+1tZi,l)(gi,k/tZi,k/1t)}𝑑t\displaystyle\frac{1}{n}\sum_{k,l=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\mathbb{E}\{G_{kl}(W_{-i}(t))\}\times\mathbb{E}\{(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\}dt
J3\displaystyle J_{3} =\displaystyle= 1n3/2k,l,d=1Ni=1n0101(1t)𝔼{Gkld(Wi(t)+tWi(t))(tgi,l+1tZi,l)(tgi,d+1tZi,d)\displaystyle\frac{1}{n^{3/2}}\sum_{k,l,d=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\int_{0}^{1}(1-t^{\prime})\mathbb{E}\{G_{kld}(W_{-i}(t)+t^{\prime}W_{i}(t))(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})(\sqrt{t}g_{i,d}+\sqrt{1-t}Z_{i,d})
(Zi,k/tZi,k/1t)}dtdt\displaystyle(Z_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\}dtdt^{\prime}

We further note that J2=0J_{2}=0 since 𝔼{(tgi,l+1tZi,l)(gi,k/tZi,k/1t)}=𝔼(gi,lgi,k)𝔼(Zi,lZi,k)=0\mathbb{E}\{(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\}=\mathbb{E}(g_{i,l}g_{i,k})-\mathbb{E}(Z_{i,l}Z_{i,k})=0. For J3J_{3}, it follows from [42] that for any zNz\in\mathbb{R}^{N},

k,l,d=1N|Gkld(z)|(C3ψn3+6C2qψn2+6C1q2ψn),\sum_{k,l,d=1}^{N}|G_{kld}(z)|\leq(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n}),

where C3=U0′′′C_{3}=\|U_{0}^{\prime\prime\prime}\|_{\infty} is a finite constant. Then

|J3|\displaystyle|J_{3}|\leq 1n3/2k,l,d=1Ni=1n0101𝔼{|Gkld(Wi(t)+tWi(t))|max1kN(|gi,k|+|Zi,k|)3}\displaystyle\frac{1}{n^{3/2}}\sum_{k,l,d=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\int_{0}^{1}\mathbb{E}\{|G_{kld}(W_{-i}(t)+t^{\prime}W_{i}(t))|\max_{1\leq k\leq N}(|g_{i,k}|+|Z_{i,k}|)^{3}\}
×(1/t+1/1t)dtdt\displaystyle\times(1/\sqrt{t}+1/\sqrt{1-t})dtdt^{\prime}
\displaystyle\leq 1n3/24(C3ψn3+6C2qψn2+6C1q2ψn)i=1n𝔼{max1kN(|gi,k|+|Zi,k|)3}\displaystyle\frac{1}{n^{3/2}}4(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\sum_{i=1}^{n}\mathbb{E}\{\max_{1\leq k\leq N}(|g_{i,k}|+|Z_{i,k}|)^{3}\}
\displaystyle\leq 1n3/232(C3ψn3+6C2qψn2+6C1q2ψn){i=1n(𝔼{max1kN|gi,k|3}+𝔼{max1kN|Zi,k|3})}\displaystyle\frac{1}{n^{3/2}}32(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\big{\{}\sum_{i=1}^{n}(\mathbb{E}\{\max_{1\leq k\leq N}|g_{i,k}|^{3}\}+\mathbb{E}\{\max_{1\leq k\leq N}|Z_{i,k}|^{3}\})\big{\}} (8.34)

We need to bound i=1n(𝔼{max1kN|gi,k|3}\sum_{i=1}^{n}(\mathbb{E}\{\max\limits_{1\leq k\leq N}|g_{i,k}|^{3}\} and 𝔼{max1kN|Zi,k|3}\mathbb{E}\{\max\limits_{1\leq k\leq N}|Z_{i,k}|^{3}\}.

i=1n𝔼max1kN|gi,k|3=\displaystyle\sum_{i=1}^{n}\mathbb{E}\max_{1\leq k\leq N}|g_{i,k}|^{3}= 1(nγ)3/(2α)i=1n𝔼max1kN|ϵiΩn,i(tk)|3\displaystyle\frac{1}{(n\gamma)^{3/(2\alpha)}}\sum_{i=1}^{n}\mathbb{E}\max_{1\leq k\leq N}|\epsilon_{i}\cdot\Omega_{n,i}(t_{k})|^{3}
\displaystyle\leq 1(nγ)3/(2α)i=1n𝔼|ϵi|3𝔼max1kN|Ωn,i(tk)|3\displaystyle\frac{1}{(n\gamma)^{3/(2\alpha)}}\sum_{i=1}^{n}\mathbb{E}|\epsilon_{i}|^{3}\cdot\mathbb{E}\max_{1\leq k\leq N}|\Omega_{n,i}(t_{k})|^{3}
\displaystyle\lesssim σ3(nγ)3/(2α)i=1n𝔼max1kN|Ωn,i(tk)|3cϕ6σ3n(nγ)3/(2α),\displaystyle\frac{\sigma^{3}}{(n\gamma)^{3/(2\alpha)}}\sum_{i=1}^{n}\mathbb{E}\max_{1\leq k\leq N}|\Omega_{n,i}(t_{k})|^{3}\leq c_{\phi}^{6}\sigma^{3}n(n\gamma)^{3/(2\alpha)},

where the last step is due to the property that |Ωn,i|=ν=1(1(1γμν)ni)|ϕν(Xi)||ϕν(tk)|cϕ2(nγ)1/α|\Omega_{n,i}|=\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})\cdot|\phi_{\nu}(X_{i})|\cdot|\phi_{\nu}(t_{k})|\leq c_{\phi}^{2}(n\gamma)^{1/\alpha}, and max1kN|Ωi,k|3cϕ6(nγ)3/α\max\limits_{1\leq k\leq N}|\Omega_{i,k}|^{3}\leq c_{\phi}^{6}(n\gamma)^{3/\alpha}.

Next we deal with 𝔼max1kN|Zi,k|3\mathbb{E}\max\limits_{1\leq k\leq N}|Z_{i,k}|^{3}, where Zi,kN(0,1(nγ)1/αν=1(1(1γμν)ni)2ϕν2(tk))Z_{i,k}\sim N\big{(}0,\frac{1}{(n\gamma)^{1/\alpha}}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}^{2}(t_{k})\big{)}, and 𝔼(Zi,kZi,l)=1(nγ)1/αν=1(1(1γμν)ni)2ϕν(tk)ϕν(tl)\mathbb{E}(Z_{i,k}\cdot Z_{i,l})=\frac{1}{(n\gamma)^{1/\alpha}}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}(t_{k})\phi_{\nu}(t_{l}). For p>3p>3, we have

𝔼max1kN|Zi,k|3=\displaystyle\mathbb{E}\max_{1\leq k\leq N}|Z_{i,k}|^{3}= 𝔼max1kN(|Zi,k|p)3/p(𝔼max1kN|Zi,k|p)3/p(k=1N𝔼|Zi,k|p)3/p\displaystyle\mathbb{E}\max_{1\leq k\leq N}(|Z_{i,k}|^{p})^{3/p}\leq\big{(}\mathbb{E}\max_{1\leq k\leq N}|Z_{i,k}|^{p}\big{)}^{3/p}\leq\big{(}\sum_{k=1}^{N}\mathbb{E}|Z_{i,k}|^{p}\big{)}^{3/p}
=\displaystyle= 1(nγ)3/(2α)[k=1N(ν=1(1(1γμν)ni)2ϕν2(tk))p/2]3/p((p1)!!)3/p\displaystyle\frac{1}{(n\gamma)^{3/(2\alpha)}}\big{[}\sum_{k=1}^{N}\big{(}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}^{2}(t_{k})\big{)}^{p/2}\big{]}^{3/p}((p-1)!!)^{3/p}
\displaystyle\leq cϕ2((p1)!!)3/p1(nγ)3/(2α)N3/p[(ni)γ]32α\displaystyle c_{\phi}^{2}((p-1)!!)^{3/p}\frac{1}{(n\gamma)^{3/(2\alpha)}}N^{3/p}[(n-i)\gamma]^{\frac{3}{2\alpha}}

Then we have

J3\displaystyle J_{3}\leq 32σ3(C3ψn3+6C2qψn2+6C1q2ψn)(n3/2n(nγ)3/(2α)+n3/2cϕ2((p1)!!)3/pN3/pn(nγ)32α)\displaystyle 32\sigma^{3}(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\big{(}n^{-3/2}n(n\gamma)^{3/(2\alpha)}+n^{-3/2}c_{\phi}^{2}((p-1)!!)^{3/p}N^{3/p}n(n\gamma)^{\frac{3}{2\alpha}}\big{)}
\displaystyle\leq C(C3ψn3+6C2qψn2+6C1q2ψn)(n1/2(nγ)32α+((p1)!!)3/pN3/pn1/2)\displaystyle C^{\prime}(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\big{(}n^{-1/2}(n\gamma)^{\frac{3}{2\alpha}}+((p-1)!!)^{3/p}N^{3/p}n^{-1/2}\big{)}

Therefore,

|𝔼{Uζ(Fq(𝜶¯n))Uζ(Fq(𝒁¯n))}|\displaystyle|\mathbb{E}\{U_{\zeta}(F_{q}(\bar{\bm{\alpha}}_{n}))-U_{\zeta}(F_{q}(\bar{\bm{Z}}_{n}))\}|
\displaystyle\leq C(C3ψn3+6C2qψn2+6C1q2ψn)(n1/2(nγ)32α+((p1)!!)3/pN3/pn1/2)\displaystyle C^{\prime}(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\big{(}n^{-1/2}(n\gamma)^{\frac{3}{2\alpha}}+((p-1)!!)^{3/p}N^{3/p}n^{-1/2}\big{)} (8.35)

In the meantime, it follows by Lemma 2.1 of [42] that

𝔼{Uζ(Fq(𝒁¯n))}\displaystyle\mathbb{E}\{U_{\zeta}(F_{q}(\bar{\bm{Z}}_{n}))\} \displaystyle\leq (max1kNi=1nZi,kζ+b1logN+ψn1)\displaystyle\mathbb{P}(\max_{1\leq k\leq N}\sum_{i=1}^{n}Z_{i,k}\leq\zeta+b^{-1}\log{N}+\psi_{n}^{-1})
\displaystyle\leq (max1kNi=1nZi,kζ)+C(b1logN+ψn1)(1+2logN),\displaystyle\mathbb{P}(\max_{1\leq k\leq N}\sum_{i=1}^{n}Z_{i,k}\leq\zeta)+C^{\prime}(b^{-1}\log{N}+\psi_{n}^{-1})(1+\sqrt{2\log{N}}),

where C>0C^{\prime}>0 is a universal constant. Therefore, for any ζ\zeta\in\mathbb{R},

(max1kNα¯n(tk)ζ)(max1kNZ¯n(tk)ζ)\displaystyle\mathbb{P}(\max_{1\leq k\leq N}\bar{\alpha}_{n}(t_{k})\leq\zeta)-\mathbb{P}(\max_{1\leq k\leq N}\bar{Z}_{n}(t_{k})\leq\zeta)
\displaystyle\leq C4(C3ψn3+6C2qψn2+6C1q2ψn)(n1/2(nγ)32α+((p1)!!)3/pN3/pn1/2)\displaystyle C_{4}(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\big{(}n^{-1/2}(n\gamma)^{\frac{3}{2\alpha}}+((p-1)!!)^{3/p}N^{3/p}n^{-1/2}\big{)}
+c(b1logN+ψn1)(1+2logN).\displaystyle\quad\quad+c^{\prime}(b^{-1}\log N+\psi_{n}^{-1})(1+\sqrt{2\log N}).

On the other hand, let Vζ(s)=U0(ψn(sζ)+1)V_{\zeta}(s)=U_{0}(\psi_{n}(s-\zeta)+1). Then

(max1kN1ni=1ngi,kζ)(Fq(𝜶¯n)ζ)𝔼{Vζ(Fq(𝜶¯n))}.\mathbb{P}(\max_{1\leq k\leq N}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}g_{i,k}\leq\zeta)\geq\mathbb{P}(F_{q}(\bar{\bm{\alpha}}_{n})\leq\zeta)\geq\mathbb{E}\{V_{\zeta}(F_{q}(\bar{\bm{\alpha}}_{n}))\}.

Using the same arguments, it can be shown that |𝔼{Vζ(Fq(𝜶¯n))Vζ(Fq(𝒁¯n))}||\mathbb{E}\{V_{\zeta}(F_{q}(\bar{\bm{\alpha}}_{n}))-V_{\zeta}(F_{q}(\bar{\bm{Z}}_{n}))\}| has the same upper bound specified as in (8.5). Furthermore, by Lemma 2.1 of [42] and direct calculations, we have

𝔼{Vζ(Fq(𝒁¯n))}\displaystyle\mathbb{E}\{V_{\zeta}(F_{q}(\bar{\bm{Z}}_{n}))\} \displaystyle\geq (Fq(𝒁¯n)ζψn1)(max1kN1ni=1nZi,kζ(ψn1+b1logN))\displaystyle\mathbb{P}(F_{q}(\bar{\bm{Z}}_{n})\leq\zeta-\psi_{n}^{-1})\geq\mathbb{P}(\max_{1\leq k\leq N}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Z_{i,k}\leq\zeta-(\psi_{n}^{-1}+b^{-1}\log{N}))
\displaystyle\geq (max1kN1ni=1nZi,kζ)C(ψn1+b1logN)(1+2logN).\displaystyle\mathbb{P}(\max_{1\leq k\leq N}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Z_{i,k}\leq\zeta)-C^{\prime}(\psi_{n}^{-1}+b^{-1}\log{N})(1+\sqrt{2\log{N}}).

Therefore,

(max1kNα¯n(tk)ζ)(max1kNZ¯n(tk)ζ)\displaystyle\mathbb{P}(\max_{1\leq k\leq N}\bar{\alpha}_{n}(t_{k})\leq\zeta)-\mathbb{P}(\max_{1\leq k\leq N}\bar{Z}_{n}(t_{k})\leq\zeta)
\displaystyle\geq C0′′(C3ψn3+C2qψn2+6C1q2ψn)(n1/2(nγ)32α+((p1)!!)3/pN3/pn1/2)\displaystyle-C_{0}^{{}^{\prime\prime}}(C_{3}\psi_{n}^{3}+C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\big{(}n^{-1/2}(n\gamma)^{\frac{3}{2\alpha}}+((p-1)!!)^{3/p}N^{3/p}n^{-1/2}\big{)}
C′′(ψn1+q1logN)(1+2logN).\displaystyle-C^{{}^{\prime\prime}}(\psi_{n}^{-1}+q^{-1}\log N)(1+\sqrt{2\log N}).

Consequently, let ψn=q=(n(nγ)3/α)1/8\psi_{n}=q=\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{1/8} and pp large enough, we have

supζ|(max1kNα¯n(tk)ζ)(max1kNZ¯n(tk)ζ)|(logN)3/2(n(nγ)3/α)1/8,\sup_{\zeta\in\mathbb{R}}\Big{|}\mathbb{P}(\max_{1\leq k\leq N}\bar{\alpha}_{n}(t_{k})\leq\zeta)-\mathbb{P}(\max_{1\leq k\leq N}\bar{Z}_{n}(t_{k})\leq\zeta)\Big{|}\leq\frac{(\log N)^{3/2}}{\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{1/8}}, (8.36)

which converges to 0 with increased nn when α>2\alpha>2 and γ=nξ\gamma=n^{-\xi} with ξ>max{1α/3,0}\xi>\max\{1-\alpha/3,0\}. ∎

Proof of Lemma 8.4

Proof.

Let 𝜶¯ne=(α¯ne(t1),,α¯ne(tN))\bar{\bm{\alpha}}^{e}_{n}=\big{(}\bar{\alpha}^{e}_{n}(t_{1}),\cdots,\bar{\alpha}^{e}_{n}(t_{N})\big{)}^{\top} and 𝒁¯n=1ni=1n𝒁i=(Z¯n(t1),,Z¯n(tN))\bar{\bm{Z}}_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{Z}_{i}=(\bar{Z}_{n}(t_{1}),\cdots,\bar{Z}_{n}(t_{N}))^{\top}. Then 𝜶¯ne𝒟nN(0,Σα¯ne)\bar{\bm{\alpha}}^{e}_{n}\mid\mathcal{D}_{n}\sim N(0,\Sigma^{\bar{\alpha}^{e}_{n}}) and 𝒁¯nN(0,ΣZ¯n)\bar{\bm{Z}}_{n}\sim N(0,\Sigma^{\bar{Z}_{n}}). Denote the jkjk-th element of the covariance matrices as Σj,kα¯ne\Sigma_{j,k}^{\bar{\alpha}^{e}_{n}} and Σj,kZ¯n\Sigma_{j,k}^{\bar{Z}_{n}}, respectively. Set biν=(1(1γμν)ni)b_{i\nu}=(1-(1-\gamma\mu_{\nu})^{n-i}). Then

Σj,kα¯ne=\displaystyle\Sigma_{j,k}^{\bar{\alpha}^{e}_{n}}= 1n(nγ)1/αi=1nϵi2(ν=1biνϕν(Xi)ϕν(tj))(ν=1biνϕν(Xi)ϕν(tk)),\displaystyle\frac{1}{n(n\gamma)^{1/\alpha}}\sum_{i=1}^{n}\epsilon_{i}^{2}\big{(}\sum_{\nu=1}^{\infty}b_{i\nu}\phi_{\nu}(X_{i})\phi_{\nu}(t_{j})\big{)}\cdot\big{(}\sum_{\nu=1}^{\infty}b_{i\nu}\phi_{\nu}(X_{i})\phi_{\nu}(t_{k})\big{)},

and Σj,kZ¯n=1n(nγ)1/αi=1nν=1biν2ϕν(tk)ϕν(tj)\Sigma_{j,k}^{\bar{Z}_{n}}=\frac{1}{n(n\gamma)^{1/\alpha}}\sum_{i=1}^{n}\sum_{\nu=1}^{\infty}b_{i\nu}^{2}\phi_{\nu}(t_{k})\phi_{\nu}(t_{j}).

Following Lemma in [33], we have

(|Σj,kα¯neΣj,kZ¯n|C(nγ)1/(2α)n1/2logn)exp(C1logn).\mathbb{P}\Big{(}|\Sigma_{j,k}^{\bar{\alpha}^{e}_{n}}-\Sigma_{j,k}^{\bar{Z}_{n}}|\geq C(n\gamma)^{1/(2\alpha)}n^{-1/2}\log n\Big{)}\leq\exp(-C_{1}\log n).

Then

(max1j,kN|Σj,kα¯neΣj,kZ¯n|C(nγ)1/(2α)n1/2logn)N2exp{C1logn}.\displaystyle\mathbb{P}\Big{(}\max_{1\leq j,k\leq N}|\Sigma_{j,k}^{\bar{\alpha}^{e}_{n}}-\Sigma_{j,k}^{\bar{Z}_{n}}|\geq C(n\gamma)^{1/(2\alpha)}n^{-1/2}\log n\Big{)}\leq N^{2}\exp\{-C_{1}\log n\}.

Consequently, we have with probability at least 1exp{Clogn}1-\exp\{-C\log n\},

supν|(max1jNα¯ne(tj)ν)(max1jNZ¯n(tj)ν)|((nγ)1/αn1)1/6(logn)1/3(logN)2/3.\sup_{\nu\in\mathbb{R}}\Big{|}\mathbb{P}^{*}\Big{(}\max_{1\leq j\leq N}\bar{\alpha}^{e}_{n}(t_{j})\leq\nu\Big{)}-\mathbb{P}\Big{(}\max_{1\leq j\leq N}\bar{Z}_{n}(t_{j})\leq\nu\Big{)}\Big{|}\preceq\big{(}(n\gamma)^{1/\alpha}n^{-1}\big{)}^{1/6}(\log n)^{1/3}(\log N)^{2/3}.

Proof of Lemma 8.5

Proof.

Define αi,je=eigj(i,Xi,ϵi)\alpha_{i,j}^{e}=e_{i}\cdot g_{j}(i,X_{i},\epsilon_{i}) with gj(i,Xi,ϵi)=1(nγ)1/αϵiΩn,i(tj)=ν=1(1(1γμν)ni)ϕν(Xi)ϕν(tj)ϵi.g_{j}(i,X_{i},\epsilon_{i})=\frac{1}{\sqrt{(n\gamma)^{1/\alpha}}}\epsilon_{i}\cdot\Omega_{n,i}(t_{j})=\sum_{\nu=1}^{\infty}\big{(}1-(1-\gamma\mu_{\nu})^{n-i}\big{)}\phi_{\nu}(X_{i})\phi_{\nu}(t_{j})\epsilon_{i}. We have 𝔼(αi,jeαi,e)=gj(i,Xi,ϵi)g(i,Xi,ϵi)\mathbb{E}^{*}(\alpha_{i,j}^{e}\cdot\alpha_{i,\ell}^{e})=g_{j}(i,X_{i},\epsilon_{i})g_{\ell}(i,X_{i},\epsilon_{i}) and 𝔼(αi,jeαk,je)=0\mathbb{E}^{*}(\alpha_{i,j}^{e}\cdot\alpha_{k,j}^{e})=0. Define 𝜶ie=(αi,je,,αi,Ne)\bm{\alpha}_{i}^{e}=(\alpha_{i,j}^{e},\dots,\alpha_{i,N}^{e})^{\top}, then 𝜶ie\bm{\alpha}_{i}^{e} and 𝜶ke\bm{\alpha}_{k}^{e} are independent for iki\neq k and i,k=1,,ni,k=1,\dots,n. Let 𝜶¯ne=1ni=1n𝜶ie=(α¯ne(t1),,α¯ne(tN))\bar{\bm{\alpha}}_{n}^{e}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{\alpha}_{i}^{e}=\big{(}\bar{\alpha}_{n}^{e}(t_{1}),\dots,\bar{\alpha}_{n}^{e}(t_{N})\big{)}^{\top} with α¯ne(tj)=1ni=1nαi,je=1ni=1neigj(i,Xi,ϵi)\bar{\alpha}_{n}^{e}(t_{j})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\alpha^{e}_{i,j}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}e_{i}\cdot g_{j}(i,X_{i},\epsilon_{i}) for j=1,,Nj=1,\dots,N.

Similarly, denote αi,jb=(wi1)gj(i,Xi,ϵi)\alpha_{i,j}^{b}=(w_{i}-1)\cdot g_{j}(i,X_{i},\epsilon_{i}) and 𝜶ib=(αi,jb,,αi,Nb)\bm{\alpha}_{i}^{b}=(\alpha_{i,j}^{b},\dots,\alpha_{i,N}^{b})^{\top}. Then we have 𝔼(αi,jbαi,b)=gj(i,Xi,ϵi)g(i,Xi,ϵi)\mathbb{E}^{*}(\alpha_{i,j}^{b}\cdot\alpha_{i,\ell}^{b})=g_{j}(i,X_{i},\epsilon_{i})g_{\ell}(i,X_{i},\epsilon_{i}), and 𝔼(αi,jbαk,jb)=0.\mathbb{E}^{*}(\alpha_{i,j}^{b}\cdot\alpha_{k,j}^{b})=0. Denote 𝜶¯nb=1ni=1n𝜶ib=(α¯nb(t1),,α¯nb(tN))\bar{\bm{\alpha}}_{n}^{b}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{\alpha}_{i}^{b}=\big{(}\bar{\alpha}_{n}^{b}(t_{1}),\dots,\bar{\alpha}_{n}^{b}(t_{N})\big{)}^{\top} with α¯nb(tj)=1ni=1nαi,jb\bar{\alpha}_{n}^{b}(t_{j})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\alpha^{b}_{i,j}.

The proof of Lemma 8.5 follows the proof of Lemma 8.3. We adopt the notation and follow the proof of Lemma 8.3 step by step, with only the following changes: (1) replacing 𝜶¯n\bar{\bm{\alpha}}_{n} with 𝜶¯nb\bar{\bm{\alpha}}_{n}^{b}; (2) replacing 𝒁¯n\bar{\bm{Z}}_{n} with 𝜶¯ne\bar{\bm{\alpha}}_{n}^{e}; (3) replacing the probability ()\mathbb{P}(\cdot) and expectation 𝔼()\mathbb{E}(\cdot) to conditional probabilities ()=(𝒟n)\mathbb{P}*(\cdot)=\mathbb{P}(\cdot\mid\mathcal{D}_{n}) and conditional expectation 𝔼()=𝔼(𝒟n)\mathbb{E}^{*}(\cdot)=\mathbb{E}(\cdot\mid\mathcal{D}_{n}). Then equation (8.34) here will be adapted to

|J3|1n3/232(C3ψn3+6C2qψn2+6C1q2ψn)(i=1n(𝔼{max1kN|αi,kb|3}+𝔼{max1kN|αi,ke|3}))|J_{3}|\leq\frac{1}{n^{3/2}}32(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\Big{(}\sum_{i=1}^{n}(\mathbb{E}^{*}\{\max_{1\leq k\leq N}|\alpha^{b}_{i,k}|^{3}\}+\mathbb{E}^{*}\{\max_{1\leq k\leq N}|\alpha^{e}_{i,k}|^{3}\})\Big{)} (8.37)

Since

𝔼max1kN|αi,kb|3\displaystyle\mathbb{E}^{*}\max_{1\leq k\leq N}|\alpha^{b}_{i,k}|^{3}\leq 1(nγ)3/(2α)max1kN|ϵiΩn,i(tk)|3𝔼|wi1|3|ϵi|3(nγ)3/(2α),\displaystyle\frac{1}{(n\gamma)^{3/(2\alpha)}}\max_{1\leq k\leq N}|\epsilon_{i}\cdot\Omega_{n,i}(t_{k})|^{3}\cdot\mathbb{E}|w_{i}-1|^{3}\lesssim|\epsilon_{i}|^{3}(n\gamma)^{3/(2\alpha)},

where the last inequality follows the proof in Lemma 8.3 that max1kN|Ωi,k|3cϕ6(nγ)3/α\max\limits_{1\leq k\leq N}|\Omega_{i,k}|^{3}\leq c_{\phi}^{6}(n\gamma)^{3/\alpha}. Then with probability at least 1n11-n^{-1}, we have

1n3/2i=1n𝔼(max1kN|αi,kb|3)\displaystyle\frac{1}{n^{3/2}}\sum_{i=1}^{n}\mathbb{E}^{*}\big{(}\max_{1\leq k\leq N}|\alpha^{b}_{i,k}|^{3}\big{)}\leq n1/2(nγ)3/(2α)1ni=1n|ϵi|3Cn1/2(nγ)3/(2α).\displaystyle n^{-1/2}(n\gamma)^{3/(2\alpha)}\frac{1}{n}\sum_{i=1}^{n}|\epsilon_{i}|^{3}\leq Cn^{-1/2}(n\gamma)^{3/(2\alpha)}.

Similarly, with probability at least 1n11-n^{-1}, 1n3/2i=1n𝔼(max1kN|αi,ke|3)Cn1/2(nγ)3/(2α),\frac{1}{n^{3/2}}\sum_{i=1}^{n}\mathbb{E}^{*}\big{(}\max_{1\leq k\leq N}|\alpha^{e}_{i,k}|^{3}\big{)}\leq Cn^{-1/2}(n\gamma)^{3/(2\alpha)}, where CC is a constant independent of nn. Then we have with probability at least 12n11-2n^{-1},

J3\displaystyle J_{3}\leq C(C3ψn3+6C2qψn2+6C1q2ψn)n1/2(nγ)3/(2α).\displaystyle C(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})n^{-1/2}(n\gamma)^{3/(2\alpha)}.

Therefore, follow the proofs in Lemma 8.3, we have

|(max1kNα¯nb(tk)ζ)(max1kNα¯ne(tk)ζ)|\displaystyle|\mathbb{P}^{*}(\max_{1\leq k\leq N}\bar{\alpha}_{n}^{b}(t_{k})\leq\zeta)-\mathbb{P}^{*}(\max_{1\leq k\leq N}\bar{\alpha}_{n}^{e}(t_{k})\leq\zeta)|
\displaystyle\leq C0(C3ψn3+C2qψn2+6C1q2ψn)n1/2(nγ)3/(2α)+C′′(ψn1+q1logN)(1+2logN).\displaystyle C_{0}(C_{3}\psi_{n}^{3}+C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})n^{-1/2}(n\gamma)^{3/(2\alpha)}+C^{{}^{\prime\prime}}(\psi_{n}^{-1}+q^{-1}\log N)(1+\sqrt{2\log N}).

Consequently, let ψn=q=(n(nγ)3/α)1/8\psi_{n}=q=\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{1/8}, we have with probability at least 14/n1-4/n,

supζ|(max1kNα¯nb(tk)ζ)(max1kNα¯ne(tk)ζ)|(logN)3/2(n(nγ)3/α)1/8.\sup_{\zeta\in\mathbb{R}}\Big{|}\mathbb{P}^{*}(\max_{1\leq k\leq N}\bar{\alpha}_{n}^{b}(t_{k})\leq\zeta)-\mathbb{P}^{*}(\max_{1\leq k\leq N}\bar{\alpha}_{n}^{e}(t_{k})\leq\zeta)\Big{|}\leq\frac{(\log N)^{3/2}}{\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{1/8}}.

References

  • [1] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22:400–407, 1951.
  • [2] David Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, Ithaca, NY, 1988.
  • [3] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 161–168. Curran Associates, Inc., 2008.
  • [4] Anton J. Kleywegt, Alexander Shapiro, and Tito Homem de Mello. The sample average approximation method for stochastic discrete optimization. SIAM Journal on Optimization, 12(2):479–502, 2002.
  • [5] Sujin Kim, Raghu Pasupathy, and Shane G. Henderson. A guide to sample average approximation. In Handbook of Simulation Optimization, pages 207–243. Springer, 2015.
  • [6] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.
  • [7] Grace Wahba. Spline Models for Observational Data. Society for Industrial and Applied Mathematics, 1990.
  • [8] Vladimir Koltchinskii. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34:2593–2656, 2006.
  • [9] Shahar Mendelson. Geometric parameters of kernel machines. In International Conference on Computational Learning Theory, pages 29–43. Springer, 2002.
  • [10] Yun Yang, Mert Pilanci, and Martin J. Wainwright. Randomized sketches for kernels: Fast and optimal nonparametric regression. The Annals of Statistics, 45(3):991–1023, 2017.
  • [11] Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge regression learning algorithm in dual variables. In International Conference on Machine Learning, pages 515–521. Springer, 1998.
  • [12] Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4):1363–1399, 2016.
  • [13] Léon Bottou. Online learning and stochastic approximations. Online Learning in Neural Networks, 17(9):142, 1998.
  • [14] Quoc V. Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, and Andrew Y. Ng. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning, pages 265–272, 2011.
  • [15] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In 19th International Conference on Computational Statistics, pages 177–186. Springer, 2010.
  • [16] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  • [17] Patrick Cheridito, Arnulf Jentzen, and Florian Rossmannek. Non-convergence of stochastic gradient descent in the training of deep neural networks. Journal of Complexity, 64:101540, 2021.
  • [18] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
  • [19] Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator. Journal of Machine Learning Research, 19(78):1–21, 2018.
  • [20] Xi Chen, Jason D Lee, Xin T Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics, 48(1):251–273, 2020.
  • [21] Yu Nesterov and J-Ph Vial. Confidence level solutions for stochastic programming. Automatica, 44(6):1559–1568, 2008.
  • [22] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
  • [23] Weijie J Su and Yuancheng Zhu. Higrad: Uncertainty quantification for online learning and stochastic approximation. Journal of Machine Learning Research, 24(124):1–53, 2023.
  • [24] Bradley Efron and Robert J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC, 1993.
  • [25] Thomas J Diciccio and Joseph P Romano. A review of bootstrap confidence intervals. Journal of the Royal Statistical Society: Series B (Methodological), 50(3):338–354, 1988.
  • [26] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximation of suprema of empirical processes. The Annals of Statistics, 42(4):1564–1597, 2014.
  • [27] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
  • [28] Chong Gu. Smoothing spline ANOVA models, volume 297. Springer, 2013.
  • [29] Shahar Mendelson and Joseph Neeman. Regularization in kernel learning. The Annals of Statistics, 38(1):526–565, 2010.
  • [30] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714–751, 2017.
  • [31] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
  • [32] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, volume 20, 2007.
  • [33] Meimei Liu, Zuofeng Shang, and Yun Yang. Supplementary: Scalable statistical inference in non-parametric least squares, 2023. Supplementary material.
  • [34] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related gaussian couplings. Stochastic Processes and their Applications, 126(12):3632–3651, 2016.
  • [35] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Anti-concentration and honest, adaptive confidence bands. The Annals of Statistics, 42(5):1787–1818, 2014.
  • [36] Michael H Neumann and Jörg Polzehl. Simultaneous bootstrap confidence bands in nonparametric regression. Journal of Nonparametric Statistics, 9(4):307–333, 1998.
  • [37] Timothy B Armstrong and Michal Kolesár. Simple and honest confidence intervals in nonparametric regression. Quantitative Economics, 11(1):1–39, 2020.
  • [38] Grace Wahba. Bayesian “confidence intervals” for the cross-validated smoothing spline. Journal of the Royal Statistical Society: Series B (Methodological), 45(1):133–150, 1983.
  • [39] Grace Wahba. Improper priors, spline smoothing and the problem of guarding against model errors in regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 40(3):364–372, 1978.
  • [40] Yuedong Wang and Grace Wahba. Bootstrap confidence intervals for smoothing splines and their comparison to bayesian confidence intervals. Journal of Statistical Computation and Simulation, 51(2-4):263–279, 1995.
  • [41] Bradley Efron. The jackknife, the bootstrap and other resampling plans. SIAM, 1982.
  • [42] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819, 2013.