Scalable Statistical Inference in Non-parametric Least Squares

Meimei Liu , Zuofeng Shang and Yun Yang Department of Statistics,Virginia Tech, Blacksburg, VA. Email: [email protected] Mathematical Sciences, NJIT,Newark, NJ. Email:[email protected] of Statistics, UIUC, Champaign, IL. Email: [email protected]

Abstract

Stochastic approximation (SA) is a powerful and scalable computational method for iteratively estimating the solution of optimization problems in the presence of randomness, particularly well-suited for large-scale and streaming data settings. In this work, we propose a theoretical framework for stochastic approximation (SA) applied to non-parametric least squares in reproducing kernel Hilbert spaces (RKHS), enabling online statistical inference in non-parametric regression models. We achieve this by constructing asymptotically valid pointwise (and simultaneous) confidence intervals (bands) for local (and global) inference of the nonlinear regression function, via employing an online multiplier bootstrap approach to a functional stochastic gradient descent (SGD) algorithm in the RKHS. Our main theoretical contributions consist of a unified framework for characterizing the non-asymptotic behavior of the functional SGD estimator and demonstrating the consistency of the multiplier bootstrap method. The proof techniques involve the development of a higher-order expansion of the functional SGD estimator under the supremum norm metric and the Gaussian approximation of suprema of weighted and non-identically distributed empirical processes. Our theory specifically reveals an interesting relationship between the tuning of step sizes in SGD for estimation and the accuracy of uncertainty quantification.

1 Introduction

Stochastic approximation (SA) [1, 2, 3] is a class of iterative stochastic algorithms to solve the stochastic optimization problem $\min_{\theta\in\Theta}\big{\{}\mathcal{L}(\theta):\,=\mathbb{E}_{Z}[\ell(\theta;Z)]\big{\}}$ , where $\ell(\theta;z)$ is some loss function, $Z$ denotes the internal random variable, and $\Theta$ is the domain of the loss function. Statistical inference, such as parameter estimation, can be viewed as a special case of stochastic optimization where the goal is to estimate the minimizer $\theta^{\ast}=\mathop{\mathrm{argmin}}_{\theta\in\Theta}\mathcal{L}(\theta)$ of the expected loss function $\mathcal{L}(\theta)$ based on a finite number of i.i.d. observations $\{Z_{1},\ldots,Z_{n}\}$ . Classical estimation procedures based on minimizing an empirical version $\mathcal{L}_{n}(\theta)=n^{-1}\sum_{i=1}^{n}\ell(\theta;Z_{i})$ of the loss correspond to the sample average approximation (SAA) [4, 5] for solving the stochastic optimization problem. However, directly minimizing $L_{n}$ with massive data is computationally wasteful in both time and space, and may pose numerical challenges. For example, in applications involving streaming data where new and dynamic observations are generated on a continuous basis, it may not be necessary or feasible to store all historical data. Instead, stochastic gradient descent (SGD), or Robbins-Monro type SA algorithm [1], is a scalable approximation algorithm for parameter estimation with constant per-iteration time and space complexity. SGD can be viewed as a stochastic version of the gradient descent method that uses a noisy gradient, such as $\nabla\ell(\cdot\,;Z)$ based on a single $Z$ , to replace the true gradient $\nabla\mathcal{L}(\cdot)$ . In this work, we explore the use of SA for statistical inference in infinite-dimensional models where $\Theta$ is a functional space or, more precisely, in solving non-parametric least squares in reproducing kernel Hilbert spaces (RKHS).

Consider the standard random-design non-parametric regression model

Y_{i}=f^{\ast}(X_{i})+\epsilon_{i},\quad\epsilon_{i}\sim N(0,\sigma^{2})\quad\;\textrm{for}\;i=1,\cdots,n,

(1.1)

with $X_{i}\in\mathcal{X}$ denoting the $i$ -th copy of random covariate $X$ , $Y_{i}$ the $i$ -th copy of response $Y$ , and $f^{\ast}$ the unknown regression function in a reproducing kernel Hilbert space (RKHS, [6, 7]) $\mathbb{H}$ to be estimated. For simplicity, we assume that $\mathcal{X}=[0,1]^{d}$ is the unit cube in $\mathbb{R}^{d}$ . Since $f^{\ast}$ minimizes the population-level expected squared error loss objective $\mathcal{L}(f)=\mathbb{E}\big{[}\ell\big{(}f;\,(X,Y)\big{)}\big{]}$ over all functions $f:\,\mathcal{X}\to\mathbb{R}$ , with $\ell\big{(}f;\,(X,Y)\big{)}=(f(X)-Y)^{2}$ representing the squared loss function, one can adopt the SSA approach to estimate $f^{\ast}$ by minimizing a penalized sample-level squared error loss objective. Given a sample $\{(X_{i},Y_{i})\}_{i=1}^{n}$ of size $n$ , a commonly used SAA approach for estimating $f$ is kernel ridge regression (KRR). KRR incorporates a penalty term that depends on the norm $\|\cdot\|_{\mathbb{H}}$ associated with the RKHS $\mathbb{H}$ . Although the KRR estimator enjoys many attractive statistical properties [8, 9, 10], its computational complexity of $\mathcal{O}(n^{3})$ time and $\mathcal{O}(n^{2})$ space hinders its practicality in large-scale problems [11]. In this work, we instead consider an SA-type approach for directly minimizing the functional $\mathcal{E}(f)$ over the infinite-dimensional RKHS. By operating SGD in this non-parametric setting (see Section 2.2 for details), the resulting algorithm achieves $\mathcal{O}(n^{2})$ time complexity and $\mathcal{O}(n)$ space complexity. In a recent study [12], the authors demonstrate that the online estimator of $f$ resulting from the SGD achieves optimal rates of convergence for a variety of $f\in\mathbb{H}$ . It is interesting to note that since the functional gradient is defined with respect to the RKHS norm $\|\cdot\|_{\mathbb{H}}$ , the functional SGD implicitly induces an algorithmic regularization due to the “early-stopping” in the RKHS, which is controlled by the accumulated step sizes. Therefore, with a proper step size decaying scheme, no explicit regularization is needed to achieve optimal convergence rates.

The aim of this research is to take a step further by constructing a new inferential framework for quantifying the estimation uncertainty in the SA procedure. This will be achieved through the construction of pointwise confidence intervals and simultaneous confidence bands for the functional SGD estimator of $f$ . Previous SGD algorithms and their variants, such as those discussed in [3, 13, 14, 15, 16, 17], are mainly utilized to solve finite-dimensional parametric learning problems with a root- $n$ convergence rate. In the parametric setting, asymptotic properties of estimators arising in SGD, such as consistency and asymptotic normality, have been well established in literature; for example, see [12, 18, 19, 20]. However, the problem of uncertainty quantification for functional SGD estimators in non-parametric settings is rarely addressed in the literature.

In the parametric setting, several methods have been proposed to conduct uncertainty quantification in SGD. [21, 22] appear to be among the first to formally characterize the magnitudes of random fluctuations in SA; however, their notion of confidence level is based on the large deviation properties of the solution and can be quite conservative. More recently, [19] proposes applying a multiplier bootstrap method for the construction of SGD confidence intervals, whose asymptotic confidence level is shown to exactly match the nominal level. [20] proposes a batch mean method to estimate the asymptotic covariance matrix of the estimator based on a single SGD trajectory. Due to the limited information from a single run of SGD, the best achievable error of their confidence interval (in terms of coverage probability) is of the order $\mathcal{O}(n^{-1/8})$ , which is worse than the error of the order $\mathcal{O}(n^{-1/3})$ achieved by the multiplier bootstrap. [23] proposes a different method called Higrad. Higrad constructs a hierarchical tree of a number of SGD estimators and uses their outputs in the leaves to construct a confidence interval.

In this work, we develop a multiplier bootstrap method for uncertainty quantification in SA for solving online non-parametric least squares. Bootstrap methods [24, 25] are widely used in statistics to estimate the sampling distribution of a statistic for uncertainty quantification. Traditional resampling-based bootstrap methods are unsuitable for streaming data inference as the resampling step necessitates storing all historical data, which contradicts the objective of maintaining constant space and time complexity in online learning. Instead, we extend the parametric online multiplier bootstrap method from [19] to the non-parametric setting. We achieve this by employing a perturbed stochastic functional gradient, which is represented as an element in the RKHS evaluated upon the arrival of each new covariate-response pair $(X_{i},Y_{i})$ , to capture the stochastic fluctuation arising from the random streaming data.

To theoretically justify the use of the proposed multiplier bootstrap method, we make two main contributions. First, we build a novel theoretical framework to characterize the non-asymptotic behavior of the infinite-dimensional functional SGD estimator via expanding it into higher-orders under the supremum norm metric. This framework enables us to perform local inference to construct pointwise confidence intervals for $f$ and global inference to construct a simultaneous confidence band. Second, we demonstrate the consistency of the multiplier bootstrap method by proving that the perturbation injected into the stochastic functional gradient accurately mimics the randomness pattern in the online estimation procedure, so that the conditional law of the bootstrapped functional SGD estimator given the data asymptotically coincides with the sampling law of the functional SGD estimator. Our proof is non-trivial and contains several major improvements that refine the best (to our knowledge) convergence analysis of SGD for non-parametric least squares in [12], and also advances consistency analysis of the multiplier bootstrap in a non-parametric setting. Concretely, in [12], the authors derive the convergence rate of the functional SGD estimator relative to the $L_{2}$ norm metric. Their theory only concerns the $L_{2}$ convergence rate of the estimation; hence, the proof involves decomposing the SGD recursion into a leading first-order recursion and the remaining higher-order recursions; and bounding their $L_{2}$ norms respectively by directly bounding their expectations. In comparison, our analysis for statistical inference in online non-parametric regression requires a functional central limit theorem type result and calls for several substantial refinements in proof techniques.

Our first improvement is to refine the SGD recursion analysis by using a stronger supremum norm metric. This enables us to accurately characterize the stochastic fluctuation of the functional estimator uniformly across all locations. As a result, we can study the coverage probability of simultaneous confidence bands in our subsequent inference tasks. Analyzing the supremum convergence is significantly more intricate than analyzing the $L_{2}$ convergence. In the proof, we introduce an augmented RKHS different from $\mathbb{H}$ as a bridge in order to better align its induced norm with the supremum metric; see Remark 3.2 or equation (6.1) in Section 6 for further details. Additionally, we have to employ uniform laws of large numbers and leverage ideas from empirical processes to uniformly control certain stochastic terms that emerge in the expansions. Our second improvement comes from the need of characterizing the large-sample distributional limit of the functional SGD estimator. By using the same recursion decomposition, we must now analyze a high-probability supremum norm bound for the all orders of recursions and determine the large-sample distributional limit of the leading term in the expansion. It is worth noting that the second-order recursion is the most complicated and challenging one to analyze. This recursion requires specialized treatment that involves substantially more effort than the remaining higher-order recursions. A loose analysis, achieved by directly converting an $L_{2}$ norm bound into the supremum norm bound using the reproducing kernel property the original RKHS $\mathbb{H}$ — which suffices for bounding the higher-order recursions — might result in a bound whose order is comparable to that of the leading term. This is where we introduce an augmented RKHS and directly analyze the supremum norm using empirical process tools.

Last but not least, in order to analyze the distributional limit of the leading bias and variance terms appearing in the expansion of the functional SGD estimator, we develop new tools by extending the recent technique of Gaussian approximation of suprema of empirical processes [26] from equally weighted sum to a weighted sum. This extension is important and unique for analyzing functional SGD, since earlier-arrived data points will have larger weights in the leading bias and variance terms than later-arrived data points; see Remark 3.3 for more discussions. Towards the analysis of our bootstrap procedure, we further develop Gaussian approximation bounds for multiplier bootstraps for suprema of weighted and non-identically distributed empirical process, which can be used to control the Kolmogorov distance between the sampling distributions of the pointwise evaluation (local inference) of the functional SGD estimator or its supremum norm (global inference), and their bootstrapping counterparts. Our results also elucidate the interplay between early stopping (controlled by the step size) for optimal estimation and the accuracy of uncertainty quantification.

The rest of the article is organized as follows. In Section 2 we introduce the background of RKHS and the functional stochastic gradient descent algorithms in the RKHS; in Section 3, we establish the distributional convergence of SGD for non-parametric least squares; in Section 4, we develop the scalable uncertainty quantification in RKHS via multiplier bootstrapped SGD estimators; Section 5 includes extensive numerical studies to demonstrate the performance of the proposed SGD inference. Section 6 presents a sketched proof highlighting some important technical details and key steps; Section 7 provides an overview and future direction for our work. Section 8 includes some key proofs for the theorems.

Notation: In this paper, we use $C,C^{\prime},C_{1},C_{2},\dots$ to denote generic positive constants whose values may change from one line to another, but are independent from everything else. We use the notation $\|f\|_{\infty}$ to denote the supremum norm of a function $f$ , defined as $\|f\|_{\infty}=\sup_{x\in\mathcal{X}}|f(x)|$ , where $\mathcal{X}$ is the domain of $f$ . The notations $a\lesssim b$ and $a\gtrsim b$ denote inequalities up to a constant multiple; we write $a\asymp b$ when both $a\lesssim b$ and $a\gtrsim b$ hold. For $k>0$ , let $\lfloor k\rfloor$ denote the largest integer smaller than or equal to $k$ . For two operators $M$ and $N$ , The order $M\preccurlyeq N$ if $N-M$ is positive semi-definite.

2 Background and Problem Formulation

We begin by introducing some background on reproducing kernel Hilbert space (RKHS) and functional stochastic gradient descent algorithms in the RKHS.

2.1 Reproducing kernel Hilbert spaces

To describe the structure of regression function $f$ in non-parametric regression model (1.1), we adopt the standard framework of a reproducing kernel Hilbert space (RKHS, [7, 27, 28]) by assuming $f^{\ast}=\mathop{\mathrm{argmin}}_{f}\mathbb{E}\big{[}(f(X)-Y)^{2}\big{]}$ to belong to an RKHS $\mathbb{H}$ . Let $\mathbb{P}_{X}$ to denote the marginal distribution of the random design $X$ , and $L^{2}(\mathbb{P}_{X})=\big{\{}f:\,\mathcal{X}\to\mathbb{R}\,\big{|}\,\int_{\mathcal{X}}f^{2}(X)\,\mathbb{P}_{X}(dx)<\infty\big{\}}$ to denote the space of all square-integrable functions over $\mathcal{X}$ with respect to $\mathbb{P}_{X}$ . Briefly speaking, an RKHS is a Hilbert space $\mathbb{H}\subset L^{2}(\mathbb{P}_{X})$ of functions defined over a set $\mathcal{X}$ , equipped with inner product $\langle\cdot,\,\cdot\rangle_{\mathbb{H}}$ , so that for any $x\in\mathcal{X}$ , the evaluation functional at $x$ defined by $L_{x}(f)=f(x)$ is a continuous linear functional on the RKHS. Uniquely associated with $\mathbb{H}$ is a positive-definite function $K:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ , called the reproducing kernel. The key property of the reproducing kernel is that it satisfies the reproducing property: the evaluation functional $L_{x}$ can be represented by the reproducing kernel function $K_{x}:\,=K(x,\,\cdot)$ so that $f(x)=L_{x}(f)=\langle K_{x},\,f\rangle_{\mathbb{H}}$ . According to Mercer’s theorem [6], kernel function $K$ has the following spectral decomposition:

K(x,x^{\prime})=\sum_{j=1}^{\infty}\mu_{j}\,\phi_{j}(x)\,\phi_{j}(x^{\prime}),\,\,\,\,x,x^{\prime}\in\mathcal{X},

(2.1)

where the convergence is absolute and uniform on $\mathcal{X}\times\mathcal{X}$ . Here, $\mu_{1}\geq\mu_{2}\geq\cdots\geq 0$ is the sequence of eigenvalues, and $\{\phi_{j}\}_{j=1}^{\infty}$ are the corresponding eigenfunctions forming an orthonormal basis in $L^{2}(\mathbb{P}_{X})$ , with the following property: for any $j,k\in\mathbb{N}$ ,

\langle\phi_{j},\phi_{k}\rangle_{L^{2}(\mathcal{P}_{X})}=\delta_{jk}\quad\mbox{and}\quad\langle\phi_{j},\phi_{k}\rangle_{\mathbb{H}}=\delta_{jk}/\mu_{j},

where $\delta_{jk}=1$ if $j=k$ and $\delta_{jk}=0$ otherwise. Moreover, any $f\in\mathbb{H}$ can be decomposed into $f=\sum_{j=1}^{\infty}f_{j}\phi_{j}$ with $f_{j}=\langle f,\phi_{j}\rangle_{L_{2}(\mathbb{P}_{X})}$ , and its RKHS norm can be computed via $\|f\|_{\mathbb{H}}^{2}=\sum_{j=1}^{\infty}\mu_{j}^{-1}f_{j}^{2}$ .

We introduce some technical conditions on the reproducing kernel $K$ in terms of its spectral decomposition.

Assumption A1.

The eigenfunctions $\{\phi_{k}\}_{k=0}^{\infty}$ of $K$ are uniformly bounded on $\mathcal{X}$ , i.e., there exists a finite constant $c_{\phi}>0$ such that $\sup_{k\geq 1}\|\phi_{k}\|_{\infty}\leq c_{\phi}$ . Moreover, they satisfy the Lipschitz condition $|\phi_{k}(s)-\phi_{k}(t)|\leq L\,k\,|s-t|$ for any $s,t\in[0,1]$ , where $L$ is a finite constant.

Assumption A2.

The eigenvalues $\{\mu_{k}\}_{k=1}^{\infty}$ of $K$ satisfy $\mu_{k}\asymp k^{-\alpha}$ for some $\alpha>1$ .

The uniform boundedness condition in Assumption A1 is common in the literature [29]. Assumption A2 assumes the kernel to have polynomially decaying eigenvalues. Assumption A1-A2 together also implies the kernel function is bounded as $\sup_{x}K(x,x)\leq c^{2}_{\phi}\sum_{k=1}^{\infty}k^{-\alpha}:=R^{2}$ . One special class of kernels satisfying Assumptions A1-A2 is composed of translation-invariant kernels $K(t,s)=g(t-s)$ for some even function $g$ of period one. In fact, by utilizing the Fourier series expansion of the kernel function $g$ , we observe that the eigenfunctions of the corresponding kernel matrix $K$ are trigonometric functions

\phi_{2k-1}(x)=\sin(\pi kx),\quad\phi_{2k}(x)=\cos(\pi kx),\quad k=1,2,\dots

on $\mathcal{X}=[0,1]$ . It is easy to see that we can choose $c_{\phi}=1$ and $L=\pi$ to satisfy Assumption A1. Although we primarily consider kernels with eigenvalues that decay polynomially for the sake of clarity in this paper, it is worth mentioning that our theory extends to other kernel classes, such as squared exponential kernels and polynomial kernels [30].

2.2 Stochastic gradient descent in RKHS

To motivate functional SGD in RKHS, we first review SGD in Euclidean setting for minimizing the expected loss function $\mathcal{L}(\theta)=\mathbb{E}_{Z}[\ell(\theta;Z)]$ , where $\theta\in\mathbb{R}^{d}$ is the parameter of interest, $\ell:\,\mathbb{R}^{d}\times\mathcal{Z}\to\mathbb{R}$ is the loss function and $Z$ denotes a generic random sample, e.g. $Z=(X,Y)$ in the non-parametric regression setting (1.1). By first-order Taylor’s expansion, one can locally approximate $\mathcal{L}(\theta+s)$ for any small deviation $s$ by $\mathcal{L}(\theta+s)\approx\mathcal{L}(\theta)+\langle\nabla\mathcal{L}(\theta),\,s\rangle$ , where $\nabla\mathcal{L}(\theta)$ denotes the gradient (vector) of $\mathcal{L}(\cdot)$ evaluated at $\theta$ . The gradient $\nabla\mathcal{L}(\theta)$ therefore encodes the (infinitesimal) steepest descent direction of $L$ at $\theta$ , leading to the following gradient decent (GD) updating formula:

\displaystyle\widehat{\theta}_{i}=\widehat{\theta}_{i-1}-\gamma_{i}\,\nabla L(\widehat{\theta}_{i-1}),\quad\mbox{for}\quad i=1,2,\ldots,

starting from some initial value $\widehat{\theta}_{0}$ , where $\gamma_{i}>0$ is the step size (also called learning rate) at iteration $i$ . GD typically requires the computation of the full gradient $\nabla\mathcal{L}(\theta)$ , which is unavailable due to the unknown data distribution of $Z$ . In stochastic approximation, SGD takes a more efficient approach by using an unbiased estimate of the gradient as $G_{i}(\theta)=\nabla\ell(\theta,Z_{i})$ based on one sample $Z_{i}$ to substitute $\nabla\mathcal{L}(\theta)$ in the updating rule.

Accordingly, the SGD updating rule takes the form of

\displaystyle\widehat{\theta}_{i}=\widehat{\theta}_{i-1}-\gamma_{i}\,G_{i}(\widehat{\theta}_{i-1}),\quad\mbox{for}\quad i=1,2,\ldots.

Let us now extend the concept of SGD from minimizing an expected loss function in Euclidean space to minimizing an expected loss functional in function space. Here for concreteness, we develop SGD for minimizing the expected squared error loss $\mathcal{L}(f)=\mathbb{E}\big{[}(f(X)-Y)^{2}\big{]}$ over an RKHS $\mathbb{H}$ equipped with inner product $\langle\cdot,\cdot\rangle_{\mathbb{H}}$ . Let us begin by extending the concept of the “gradient”. By identifying the gradient (operator) $\nabla L:\mathbb{H}\to\mathbb{H}$ of functional $\mathcal{L}(\cdot)$ as a steepest descent “direction” in $\mathbb{H}$ through the following first-order “Taylor expansion”

\displaystyle\mathcal{L}(f)=\mathcal{L}(g)+\langle\nabla\mathcal{L}(g),\,f-g\rangle_{\mathbb{H}}+\mathcal{O}\big{(}\|f-g\|_{\mathbb{H}}^{2}\big{)},\ \ \mbox{as }f\to g,

we obtain after some simple algebra that

	$\displaystyle\langle\nabla\mathcal{L}(g),\,f-g\rangle_{\mathbb{H}}+\mathcal{O}\big{(}\\|f-g\\|_{\mathbb{H}}^{2}\big{)}$	$\displaystyle\,=\mathcal{L}(f)-\mathcal{L}(g)$
		$\displaystyle\,=\mathbb{E}\big{[}\big{(}f(X)-g(X)\big{)}\cdot\big{(}g(X)-Y\big{)}\big{]}+\mathbb{E}\big{[}(f(X)-g(X))^{2}\big{]}.$

Now by using the reproducing property $h(x)=\langle h,\,K_{x}\rangle_{\mathbb{H}}$ for any $h\in\mathbb{H}$ , we further obtain

\displaystyle\langle\nabla\mathcal{L}(g),\,f-g\rangle_{\mathbb{H}}=\big{\langle}\mathbb{E}\big{[}\big{(}g(X)-Y\big{)}K_{X}\big{]},\,f-g\big{\rangle}_{\mathbb{H}}+\mathcal{O}\big{(}\|f-g\|_{\mathbb{H}}^{2}\big{)}.

(2.2)

Here, we have used the fact that by Cauchy-Schwarz inequality,

(f(x)-g(x))^{2}=\langle f-g,\,K_{x}\rangle^{2}\leq\|f-g\|_{\mathbb{H}}^{2}\cdot\|K_{x}\|_{\mathbb{H}}^{2}=K(x,x)\,\|f-g\|_{\mathbb{H}}^{2}=\mathcal{O}\big{(}\|f-g\|_{\mathbb{H}}^{2}\big{)},

since Assumptions A1-A2 together with Mercer’s expansion (2.1) imply $K$ to be uniformly bounded, or $K(x,x)\leq c_{K}\sum_{j=1}^{\infty}\mu_{k}\leq C\sum_{j=1}^{\infty}j^{-\alpha}\leq C^{\prime}$ , as long as $\alpha>1$ . From equation (2.2), we can identify the gradient $\nabla\mathcal{L}(g)$ at $g\in\mathbb{H}$ as

\displaystyle\nabla\mathcal{L}(g)=\mathbb{E}\big{[}\big{(}g(X)-Y\big{)}K_{X}\big{]}\in\mathbb{H}.

Throughout the rest of the paper, we will refer to above $\nabla\mathcal{L}(g)$ as the RKHS gradient of functional $L$ at $g$ .

Upon the arrival of the $i$ th data point $(X_{i},Y_{i})$ , we can form an unbiased estimator $G_{i}(g)$ of the RKHS gradient $\nabla\mathcal{L}(g)$ as $G_{i}(g)=\big{(}g(X_{i})-Y_{i}\big{)}K_{X_{i}}$ . This leads to the following SGD in RKHS for solving non-parametric least squares: for a given initial estimate $\widehat{f}_{0}$ , the SGD recursively updates the estimate of $f$ upon the arrival of each data point as

\widehat{f}_{i}=\widehat{f}_{i-1}-\gamma_{i}\,G_{i}(\widehat{f}_{i-1})=\widehat{f}_{i-1}+\gamma_{i}\big{(}Y_{i}-\widehat{f}_{i-1}(X_{i})\big{)}K_{X_{i}},\quad\mbox{for }i=1,2,\ldots.

(2.3)

By utilizing the reproducing property, the above iterative updating formula can be rewritten as

\displaystyle\widehat{f}_{i}=\widehat{f}_{i-1}+\gamma_{i}\,\big{(}Y_{i}-\langle\widehat{f}_{i-1},K_{X_{i}}\rangle_{\mathbb{H}}\big{)}\,K_{X_{i}}=(I-\gamma_{i}\,K_{X_{i}}\otimes K_{X_{i}})\,\widehat{f}_{i-1}+\gamma_{i}\,Y_{i}\,K_{X_{i}},

(2.4)

where $I$ denotes the identity map on $\mathbb{H}$ , and $\otimes$ is the tensor product operator defined through $g\otimes h(f)=\langle f,h\rangle_{\mathbb{H}}\,g$ for all $g,h,f\in\mathbb{H}$ . Formula (2.3) is more straightforward to use for practical implementation, while formula (2.4) provides a more tractable expression that will facilitate our theoretical analysis. Following [2] and [31], we consider the so-called Polyak averaging scheme to further improve the estimation accuracy by averaging over the entire updating trajectory, i.e. we use $\bar{f}_{n}=n^{-1}\sum_{i=1}^{n}\,\widehat{f}_{i}$ as the final functional SGD estimator of $f$ based on a dataset of sample size $n$ . Note that this averaged estimator can be efficiently computed without storing all past estimators by using the recursively updating formula $\bar{f}_{i}=(1-i^{-1})\,\bar{f}_{i-1}\,+\,i^{-1}\,\widehat{f}_{i}$ for $i=1,\dots,n$ on the fly. We will refer to the above SGD as functional SGD in order to differentiate it from the SGD in Euclidean space, and $\bar{f}_{n}$ as the functional SGD estimator (using $n$ samples). Throughout the remainder of the paper, we consider a zero initialization, $\widehat{f}_{0}=0$ , without loss of generality.

In functional SGD with total sample size (time horizon) $n$ , the only adjustable component is the step size scheme $\{\gamma_{i}:\,i=1,2,\ldots,n\}$ , which is crucial for achieving fast convergence and accurate estimations (c.f. Remark 3.1). We examine two common schemes [15, 32]: (1) constant step size scheme where $\gamma_{i}\equiv\gamma=\gamma(n)$ only depends on the total sample size $n$ ; (2) non-constant step size scheme where $\gamma_{i}=i^{-\xi}$ decays polynomially in $i$ for $i=1,2,\ldots,n$ and some $\xi>0$ . While the constant step scheme is more amenable to theoretical analysis, it suffers from two notable drawbacks: (1) it assumes prior knowledge of the sample size $n$ , which is typically unavailable in streaming data scenarios, and (2) the optimal estimation error is only achieved at the $n$ -th iteration, leading to suboptimal performance before that time point. In contrast, the non-constant step size scheme, despite significantly complicating our theoretical analysis, overcomes the aforementioned limitations and leads to a truly online algorithm that achieves rate-optimal estimation at any intermediate time point (c.f. Theorem 3.1). Due to this characteristic, we will also refer to the non-constant step size scheme as the online scheme.

Although functional SGD operates in the infinite-dimensional RKHS, it can be implemented using a finite-dimensional representation enabled by the kernel trick. Concretely, upon the arrival of the $i$ -th observation $(X_{i},Y_{i})$ , we can express the time- $i$ intermediate estimator $\widehat{f}_{i}$ as $\widehat{f}_{i}=\sum_{j=1}^{i}\widehat{\beta}_{j}\,K_{X_{j}}$ due to equation (2.3) and the zero initialization $\widehat{f}^{\ast}=0$ condition, where only the last entry $\widehat{\beta}_{i}$ in the coefficient vector $(\widehat{\beta}_{1},\,\widehat{\beta}_{2},\,\dots,\,\widehat{\beta}_{i})^{\top}$ needs to be updated,

\displaystyle\widehat{\beta}_{i}=\gamma_{i}\,\big{(}Y_{i}-\widehat{f}_{i-1}(X_{i})\big{)}=\gamma_{i}\,Y_{i}-\gamma_{i}\sum_{j=1}^{i-1}\widehat{\beta}_{j}\,K(X_{j},\,X_{i}).

Note that the computational complexity at time $i$ is $\mathcal{O}(i)$ for $i=1,2,\ldots,n$ . Correspondingly, the functional SGD estimator at time $i$ can be computed through $\bar{f}_{i}=(1-i^{-1})\,\bar{f}_{i-1}+i^{-1}\widehat{f}_{i}=\sum_{j=1}^{i}\bar{\beta}_{j}\,K_{X_{j}}$ , where (can be proved by induction)

\displaystyle\bar{\beta}_{j}=\Big{(}1-\frac{j-1}{i}\Big{)}\,\widehat{\beta}_{j},\quad\mbox{for}\quad j=1,2,\ldots,i.

Consequently, the overall time complexity of the resulting algorithm is $\mathcal{O}(n^{2})$ , and the space complexity is $\mathcal{O}(n)$ .

2.3 Problem formulation

Our objective is to develop online inference for the non-parametric regression function $f^{\ast}$ based on the functional SGD estimator $\bar{f}_{n}$ . Specifically, we aim to construct level- $\beta$ pointwise confidence intervals (local inference) $CI_{n}(x;\,\beta)=[U_{n}(x;\,\beta),\,V_{n}(x;\,\beta)]$ for $f^{\ast}(x)$ , where $x\in\mathcal{X}$ , and a level- $\beta$ simultaneous confidence band (global inference) $CB_{n}(\beta)=\big{\{}g:\,\mathcal{X}\to\mathbb{R}\,\big{|}\,g(x)\in[\bar{f}_{n}(x)-b_{n}(\beta),\,\bar{f}_{n}(x)+b_{n}(\beta)],\ \forall x\in\mathcal{X}\big{\}}$ for $f^{\ast}$ . We require these intervals and band to be asymptotically valid, meaning that the coverage probabilities, i.e., the probabilities of $f^{\ast}(x)$ or $f^{\ast}$ falling within $CI_{n}(x;\,\beta)$ or $CB_{n}(\beta)$ respectively, are close to their nominal level $\beta$ . Mathematically, this means $\mathbb{P}[f^{\ast}(x)\in CI_{n}(x;\,\beta)]=\beta+o(1)$ and $\mathbb{P}[f^{\ast}\in CB_{n}(\beta)]=\beta+o(1)$ as $n\to\infty$ .

The coverage probability analysis of these intervals and band requires us to examine and prove the distributional convergence of two random quantities (with appropriate rescaling) based on the functional SGD estimator $\bar{f}_{n}$ : the pointwise difference $\bar{f}_{n}(x)-f^{\ast}(x)$ for $x\in\mathcal{X}$ and the supremum norm $\|\bar{f}_{n}-f^{\ast}\|_{\infty}$ of $\bar{f}_{n}-f^{\ast}$ . In particular, the appropriate rescaling choice determines a precise convergence rate of $\bar{f}$ towards $f^{\ast}$ under the supremum norm metric. The characterization of the convergence rate of a non-parametric regression estimator under the supremum norm metric is a challenging and important problem in its own right. We note that the distribution of the supremum norm $\|\bar{f}_{n}-f^{\ast}\|_{\infty}$ after a proper re-scaling behaves like the supreme norm of a Gaussian process in the asymptotic sense, which is not practically feasible to estimate. Therefore, for inference purposes, it is not necessary to explicitly characterize this distributional limit; instead, we will prove a bootstrap consistency by showing that the Kolmogorov distance between the sampling distributions of this supremum norm and its bootstrapping counterpart converges to zero as $n\to\infty$ .

In our theoretical development to address these problems, we will utilize a recursive expansion of the functional SGD updating formula to construct a higher-order expansion of $\bar{f}_{n}$ under the $\|\cdot\|_{\infty}$ norm metric. Building upon this expansion, we will establish in Section 3 the distributional convergence of the two aforementioned random quantities and characterize their limiting distributions with an explicit representation of the limiting variance for $\bar{f}_{n}(x)-f^{\ast}(x)$ in the large-sample setting. However, these limiting distributions and variances depend on the spectral decomposition of the kernel $K$ , the marginal distribution of the design variable $X$ , and the unknown noise variance $\sigma^{2}$ , which are either inaccessible or computationally expensive to evaluate in an online learning scenario. To overcome this challenge, we will propose a scalable bootstrap-based inference method in Section 4, enabling efficient online inference for $f^{\ast}$ .

3 Finite-Sample Analysis of Functional SGD Estimator

In this section, we start by deriving a higher-order expansion of $\bar{f}_{n}$ under the $\|\cdot\|_{\infty}$ norm metric. We then proceed to establish the distributional convergence of $\bar{f}(x)-f^{\ast}(x)$ for any $x\in\mathcal{X}$ by characterizing the leading term in the expansion. These results will be useful for motivating our online local and global inference for $f^{\ast}$ in the following section.

3.1 Higher-order expansion under supreme norm

We begin by decomposing the functional SGD update of $\widehat{f}_{n}-f^{\ast}$ into two leading recursive formulas and a higher-order remainder term. This decomposition allows us to distinguish between the deterministic term responsible for the estimation bias and the stochastic fluctuation term contributing to the estimation variance. Concretely, we obtain the following by plugging $Y_{i}=f^{\ast}(X_{i})+\epsilon_{i}$ into the recursive updating formula (2.4),

\widehat{f}_{i}-f^{\ast}=(I-\gamma_{i}\,K_{X_{i}}\otimes K_{X_{i}})\,(\widehat{f}_{i-1}-f^{\ast})+\gamma_{i}\,\epsilon_{i}\,K_{X_{i}}.

(3.1)

Let $\Sigma:\,=\mathbb{E}[K_{X_{1}}\otimes K_{X_{1}}]:\,\mathbb{H}\to\mathbb{H}$ denote the population-level covariance operator, so that for any $f$ , $g\in\mathbb{H}$ we have $\langle f,\,\Sigma\,g\rangle_{\mathbb{H}}=\mathbb{E}[f(X_{1})\,g(X_{1})]$ . Now we recursively define the leading bias term through

\eta_{0}^{bias,0}=\widehat{f}_{0}-f^{\ast}=-f^{\ast}\quad\mbox{and}\quad\eta_{i}^{bias,0}=(I-\gamma_{i}\,\Sigma)\,\eta_{i-1}^{bias,0}\quad\mbox{for}\quad i=1,2,\ldots

(3.2)

that collects the leading deterministic component in (3.1); and the leading noise term through

\displaystyle\eta_{0}^{noise,0}=0\quad\mbox{and}\quad\eta_{i}^{noise,0}=(I-\gamma_{i}\,\Sigma)\,\eta^{noise,0}_{i-1}+\gamma_{i}\,\epsilon_{i}\,K_{X_{i}}\quad\mbox{for}\quad i=1,2,\ldots

(3.3)

that collects the leading stochastic fluctuation component in (3.1); so that we have the following decomposition for the recursion:

\widehat{f}_{i}-f^{\ast}=\underbrace{\eta_{i}^{bias,0}}_{\text{leading bias}}+\underbrace{\eta_{i}^{noise,0}}_{\text{leading noise}}+\ \ \underbrace{\big{(}\widehat{f}_{i}-f^{\ast}-\eta_{i}^{bias,0}-\eta_{i}^{noise,0}\big{)}}_{\text{remainder term}}\quad\mbox{for}\quad i=1,2,\ldots.

(3.4)

Correspondingly, we define $\bar{\eta}_{i}^{bias,0}=i^{-1}\sum_{j=1}^{i}\eta_{j}^{bias,0}$ and $\bar{\eta}_{i}^{noise,0}=i^{-1}\sum_{j=1}^{i}\eta_{j}^{noise,0}$ as the leading bias and noise terms, respectively, in the functional SGD estimator (after averaging). The following Theorem 3.1 presents finite-sample bounds for the two leading terms and the remainder term associated with $\bar{f}_{n}$ under the supreme norm metric. The results indicate that the remainder term is of strictly higher order (in terms of dependence on $n$ ) compared to the two leading terms, validating the term “leading” for them.

Theorem 3.1 (Finite-sample error bound under supreme norm).

Suppose that the kernel $K$ satisfies Assumptions A1-A2. Assume $f^{\ast}\in\mathbb{H}$ satisfies $\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\mu_{\ell}^{-1/2}<\infty$ .

(constant step size) Assume that the step size $\gamma_{i}\equiv\gamma$ satisfies $\gamma\in(0,\,\mu_{1}^{-1})$ , then we have

\sup_{x\in\mathcal{X}}|\bar{\eta}_{n}^{bias,0}(x)|\leq C\frac{1}{\sqrt{n\gamma}},\quad\textrm{and}\;\sup_{x\in\mathcal{X}}\operatorname{{\rm Var}}(\bar{\eta}_{n}^{noise,0}(x))\leq C^{\prime}\frac{(n\gamma)^{1/\alpha}}{n},

where $C,\,C^{\prime}$ are constants independent of $(n,\gamma)$ . Furthermore, assume that the step size $0<\gamma<n^{-\frac{2}{2+3\alpha}}$ , we have

\mathbb{P}\Big{(}\|\bar{f}_{n}-f^{\ast}-\bar{\eta}_{n}^{bias,0}-\bar{\eta}_{n}^{noise,0}\|^{2}_{\infty}\geq\gamma^{1/2}(n\gamma)^{-1}+\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\log n\Big{)}\leq C/n+C\gamma^{1/4},

where the randomness is with respect to the randomness in $\{(X_{i},\epsilon_{i})\}_{i=1}^{n}$ .

(non-constant step size) Assume the step size to satisfy $\gamma_{i}=i^{-\xi}$ for some $\xi\in(0,\,1/2)$ , then we have

\sup_{x\in\mathcal{X}}|\bar{\eta}_{n}^{bias,0}(x)|\leq C\frac{1}{\sqrt{n\gamma_{n}}},\quad\textrm{and}\;\sup_{x\in\mathcal{X}}\operatorname{{\rm Var}}(\bar{\eta}_{n}^{noise,0}(x))\leq C^{\prime}\frac{(n\gamma_{n})^{1/\alpha}}{n},

where $C,\,C^{\prime}$ are constants independent of $(n,\gamma_{n})$ . For the special choice of $\xi=\frac{1}{\alpha+1}$ , we have

\mathbb{P}\Big{(}\|\bar{f}_{n}-f^{\ast}-\bar{\eta}_{n}^{bias,0}-\bar{\eta}_{n}^{noise,0}\|^{2}_{\infty}\geq\gamma_{n}^{1/2}(n\gamma_{n})^{-1}+\gamma_{n}^{1/2}(n\gamma_{n})^{1/\alpha}n^{-1}\log n\Big{)}\leq C/n+C\gamma_{n}^{1/2}.

A proof of this theorem is based on a higher-order recursion expansion and careful supreme norm analysis of the recursive formula; see Remark 3.2 and proof sketch in Section 6. The detailed proof is outlined in [33].

Remark 3.1.

As demonstrated in Theorem 3.1, the selection of the step size $\gamma$ (or $\gamma_{n}$ for non-constant step size) in the SGD estimator entails a trade-off between bias and variance. A larger $\gamma$ (or $\gamma_{n}$ ) increases bias while reducing variance, and vice versa. This trade-off can be optimized by choosing the (optimal) step size $\gamma_{n}=n^{-\frac{1}{\alpha+1}}$ . This is why we specifically focus on this particular choice in the non-constant step size setting in the theorem, which also significantly simplifies the proof. Interestingly, the step size (scheme) in the functional SGD plays a similar role as the regularization parameter in regularization-based approaches in preventing overfitting according to Theorem 3.1. To see this, let us consider the classic kernel ridge regression (KRR), where the estimator $\widehat{f}_{n,\lambda}$ is constructed as

\widehat{f}_{n,\lambda}=\mathop{\mathrm{argmin}}_{f\in\mathbb{H}}\Big{\{}\frac{1}{n}\sum_{i=1}^{n}\big{(}Y_{i}-f(X_{i})\big{)}^{2}+\lambda\|f\|_{\mathbb{H}}^{2}\Big{\}},

where $\lambda$ serves as the regularization parameter to avoid overfitting. It can be shown (e.g., [10]) that the squared bias of $\widehat{f}_{n,\lambda}$ has an order of $\lambda$ , while the variance has an order of $d_{\lambda}/n$ , where $d_{\lambda}=\sum_{\nu=1}^{\infty}\min\{1,\lambda\mu_{\nu}\}$ represents the effective dimension of the model and is of order $\lambda^{-1/\alpha}$ under Assumption A2. In comparison, the squared bias and variance of the functional SGD estimator $\bar{f}_{n}$ are of order $(n\gamma_{n})^{-1}$ and $(n\gamma_{n})^{1/\alpha}/n$ respectively. Therefore, $(n\gamma_{n})^{-1}$ and $(n\gamma_{n})^{1/\alpha}$ respectively play the same role as the regularization parameter $\lambda$ and effective dimension $d_{\lambda}$ in KRR. More generally, a step size scheme $\{\gamma_{i}\}_{i=1}^{n}$ corresponds to an effective regularization parameter of the order $\lambda=\big{(}\sum_{i=1}^{n}\gamma_{i}\big{)}^{-1}$ , which in our considered settings is of order $(n\gamma_{n})^{-1}$ . Note that the accumulated step size $\sum_{i=1}^{n}\gamma_{i}$ can be interpreted as the total path length in the functional SGD algorithm. This total path length determines the early stopping of the algorithm, effectively controlling the complexity of the learned model and preventing overfitting.

Remark 3.2.

The higher-order recursion expansion and the supreme norm bound in the theorem provide a finer insight into the distributional behavior of $\bar{f}_{n}$ and paves the way for inference. That is, we only need to focus on the leading noise recursive term for statistical inference. In our proof of bounding the supremum norm for the remainder term in equation (3.4), we further decompose the remainder $\bar{f}_{n}-f^{\ast}-\bar{\eta}_{n}^{bias,0}-\bar{\eta}_{n}^{noise,0}$ into two parts: the bias remainder and the noise remainder. Note that a loose analysis of bounding the noise remainder under the $\|\cdot\|_{\infty}$ metric by directly converting an $L_{2}$ norm bound into the supremum norm bound using the reproducing kernel property of the original RKHS $\mathbb{H}$ would result in a bound whose order is comparable to that of the leading term. This motivates us to introduce an augmented RKHS $\mathbb{H}_{a}=\{f=\sum_{\nu=1}^{\infty}f_{\nu}\phi_{\nu}\mid\sum_{\nu=1}^{\infty}f_{\nu}^{2}\mu_{\nu}^{2a-1}<\infty\}$ with $0\leq a\leq 1/2-1/(2\alpha)$ equipped with the kernel function $K^{a}(x,y)=\sum_{\nu=1}^{\infty}\phi_{\nu}(X)\phi_{\nu}(y)\mu_{\nu}^{1-2a}$ and norm $\|f\|_{a}=\big{(}\sum_{\nu=1}^{\infty}f_{\nu}^{2}\mu_{\nu}^{2a-1}\big{)}^{1/2}$ for any $f=\sum_{\nu=1}^{\infty}f_{\nu}\phi_{\nu}\in\mathbb{H}$ . This augmented RKHS norm weakens the impact of high-frequency components compared to the norm $\|f\|_{\mathbb{H}}=\big{(}\sum_{\nu=1}^{\infty}f_{\nu}^{2}\mu_{\nu}^{-1}\big{)}^{1/2}$ and its induced norm turns out to be better aligns with the functional supremum norm in our context. As a result, we have $\|f\|_{\infty}\leq c_{a}\|f\|_{a}\leq c_{k}\|f\|_{\mathbb{H}}$ for any $f\in\mathbb{H}$ , where $(c_{a},\,c_{k})$ are constants. In particular, a supremum norm bound based on controlling the $\|f\|_{a}$ norm with appropriate choice of $a$ could be substantially better than that based on $\|f\|_{\mathbb{H}}$ ; see Section 6 and Section 8.2 for further details.

As we discussed in Section 2.3, for inference purposes, it is not necessary to explicitly characterize distributional convergence limit of the supremum norm $\|\bar{f}_{n}-f^{\ast}\|_{\infty}$ ; instead, we will prove a bootstrap consistency by showing that the Kolmogorov distance between the sampling distributions of this supremum norm and its bootstrapping counterpart converges to zero as $n\to\infty$ . However, the pointwise convergence limit of $\bar{f}_{n}(z_{0})-f^{\ast}(z_{0})$ for fixed $z_{0}\in[0,1]$ has an easy characterization. Therefore, we present the pointwise convergence limit and use it to discuss the impact of online estimation in the non-parametric regression model in the following subsection.

3.2 Pointwise distributional convergence

According to Theorem 3.1, the large-sample behavior of the functional SGD estimator $\bar{f}_{n}$ is completely determined by the two leading processes: bias term and noise term. According to (3.2), under the constant step size $\gamma$ , the leading bias term has an explicit expression as

	$\displaystyle\bar{\eta}_{n}^{bias,0}(x)=$	$\displaystyle\frac{1}{n}\gamma^{-1}\Sigma^{-1}\,(I-\gamma\Sigma)\,[I-(I-\gamma\Sigma)^{n}\,]f^{\ast}(x)$		(3.5)
	$\displaystyle=$	$\displaystyle\frac{1}{\sqrt{\gamma}n}\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\mu_{\ell}^{-1/2}(1-\gamma\mu_{\nu})^{k}(\gamma\mu_{\nu})^{1/2}\phi_{\nu}(x),\quad\forall x\in\mathcal{X},$		(3.5)

and the leading noise term is

	$\displaystyle\bar{\eta}_{n}^{noise,0}(x)=$	$\displaystyle\,\frac{1}{n}\sum_{k=1}^{n}\Sigma^{-1}\big{[}I-(I-\gamma\Sigma)^{n+1-k}\big{]}\,K(X_{k},\,x)\,\epsilon_{k}$		(3.6)
	$\displaystyle=$	$\displaystyle\,\frac{1}{n}\sum_{k=1}^{n}\ \epsilon_{k}\,\cdot\,\underbrace{\bigg{\{}\sum_{\nu=1}^{\infty}\big{[}1-(1-\gamma\mu_{\nu})^{n+1-k}\big{]}\,\phi_{\nu}(X_{k})\,\phi_{\nu}(x)\bigg{\}}}_{\Omega_{n,k}(x)},\quad\forall x\in\mathcal{X}.$		(3.6)

For each fixed $z_{0}\in\mathcal{X}$ , conditioning on the design $\{X_{i}\}_{i=1}^{n}$ , the leading noise term $\bar{\eta}_{n}^{noise,0}(z_{0})$ is a weighted average of $n$ independent and centered normally distributed random variables. This representation enables us to identify the limiting distribution of $\bar{\eta}_{n}^{noise,0}(z_{0})$ (this subsection) and conduct local inference (i.e. pointwise confidence intervals) by a bootstrap method (next section). Under Assumption A2, the weight $\Omega_{n,k}(z_{0})$ associated with the $k$ -th observation pair $(X_{k},\,Y_{k})$ is of order $\sum_{\nu=1}^{\infty}\big{[}1-(1-\gamma\mu_{\nu})^{n+1-k}\big{]}\asymp\big{[}(n+1-k)\gamma\big{]}^{1/\alpha}$ , which decreases in $k$ . This diminishing impact trend is inherent to online learning, as later observations tend to have a smaller influence compared to earlier observations. This characteristic is radically different from offline estimation settings, where all observations contribute equally to the final estimator, and will change the asymptotic variance (i.e., the $\sigma^{2}_{z_{0}}$ in Theorem 3.2).

Furthermore, the entire leading noise process $\bar{\eta}_{n}^{noise,0}(\cdot)$ can be viewed as a weighted and non-identically distributed empirical process indexed by the spatial location. This characterization enables us to conduct global inference (i.e. simultaneous confidence band) for non-parametric online learning by borrowing and extending the recent developments [26, 34, 35] on Gaussian approximation and multiplier bootstraps for suprema of (equally-weighted and identically distributed) empirical processes, which will be the main focus of next section.

In the following Theorem 3.2, we prove, by analyzing the leading noise term $\bar{\eta}_{n}^{noise,0}$ , a finite-sample upper bound on the Kolmogorov distance between the sampling distribution of $\bar{f}_{n}(z_{0})-f^{\ast}(z_{0})$ and the distribution of a standard normal random variable (i.e. supreme norm between the two cumulative distributions) for any $z_{0}\in\mathcal{X}$ .

Theorem 3.2 (Pointwise convergence).

Assume that the kernel $K$ to satisfy Assumptions A1-A2.

(Constant step size) Consider the step size $\gamma(n)=\gamma$ with $0<\gamma<n^{-\frac{2}{2+3\alpha}}$ . For any fixed $z_{0}\in[0,1]$ , we have

\sup_{u\in\mathbb{R}}\Big{|}\,\mathbb{P}\Big{(}\sigma^{-1}_{z_{0}}\sqrt{n(n\gamma)^{-1/\alpha}}\big{(}\bar{f}_{n}(z_{0})-f^{\ast}(z_{0})-\bar{\eta}_{n}^{bias,0}(z_{0})\big{)}\leq u\Big{)}-\Phi(u)\Big{|}\leq\frac{C_{1}}{\sqrt{n(n\gamma)^{-1/\alpha}}}+\kappa_{n},

where $\kappa_{n}=C_{2}\sqrt{\gamma^{1/2}(n\gamma)^{-1}}+\sqrt{\gamma^{1/2}(n\gamma)^{1/\alpha}n^{-1}}$ . Here, the bias term has an explicit expression as given in (3.5), and the (limiting) variance is

\sigma_{z_{0}}^{2}=\sigma^{2}(n\gamma)^{-1/\alpha}n^{-1}\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\big{[}\big{(}1-(1-\gamma\mu_{\nu})^{n+1-k}\big{)}^{2}\big{]}\,\phi_{\nu}^{2}(z_{0}).

(Non-constant step size) Consider the step size $\gamma_{i}=i^{-\frac{1}{\alpha+1}}$ for $i=1,\dots,n$ . For any fixed $z_{0}\in[0,1]$ , we have

\sup_{u\in\mathbb{R}}\big{|}\mathbb{P}\Big{(}\sigma^{-1}_{z_{0}}\sqrt{n(n\gamma_{n})^{-1/\alpha}}\big{(}\bar{f}_{n}(z_{0})-f^{\ast}(z_{0})-\bar{\eta}_{n}^{bias,0}(z_{0})\big{)}\leq u\Big{)}-\Phi(u)\big{|}\leq\frac{C_{1}}{\sqrt{n(n\gamma_{n})^{-1/\alpha}}}.

Here, the bias term takes an explicit expression as $\bar{\eta}_{n}^{bias,0}(z_{0})=n^{-1}\sum_{k=1}^{n}\prod_{i=1}^{k}(I-\gamma_{i}\Sigma)\,f^{\ast}(z_{0})$ , and the variance is

\sigma_{z_{0}}^{2}=\frac{\sigma^{2}}{n^{2}}\sum_{k=1}^{n}\gamma_{k}^{2}\,\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\,\phi_{\nu}^{2}(z_{0})\Big{(}\sum_{j=k}^{n}\prod_{i=k+1}^{j}(1-\gamma_{i}\mu_{\nu})\Big{)}^{2}.

Theorem 3.2 establishes that the sampling distribution of $\bar{f}_{n}-f^{*}$ at any fixed $z_{0}$ can be approximated by a normal distribution $N(\bar{\eta}_{n}^{bias,0}(z_{0}),n^{-1}(n\gamma_{n})^{1/\alpha}\sigma^{2}_{z_{0}})$ . According to Theorem 3.1, the bias $\bar{\eta}_{n}^{bias,0}(z_{0})$ has the order of $(n\gamma_{n})^{-1/2}$ while the variance has the order of $n^{-1}(n\gamma_{n})^{1/\alpha}$ ; Theorem 3.2 also implies that the minimax convergence rate $n^{-\frac{1}{2(\alpha+1)}}$ of estimating $f^{\ast}$ can be achieved with $\gamma=\gamma_{n}=n^{-\frac{1}{\alpha+1}}$ , which attains an optimal bias-variance tradeoff. In practice, the bias term can be suppressed by applying a undersmoothing technique; see Remark 4.3 for details.

Remark 3.3.

From the theorem, we see that the (limiting) variance $\sigma_{z_{0}}^{2}$ is precisely the variance of the scaled leading noise $\sqrt{n(n\gamma_{n})^{-1/\alpha}}\bar{\eta}_{n}^{noise,0}(z_{0})$ at $z_{0}$ , that is, $\operatorname{{\rm Var}}\big{(}\sqrt{n(n\gamma_{n})^{-1/\alpha}}\bar{\eta}_{n}^{noise,0}(z_{0})\big{)}$ ; and $\sigma^{2}_{z_{0}}$ has the same $\mathcal{O}(1)$ order for both the constant and non-constant cases. The contribution of for each data point to the variance differs between the constant and non-constant step size cases. Concretely, in the constant step size case, let ${\bf{C}}=(c_{1},\dots,c_{n})$ be the vector of variation, where $c_{k}$ $(k=1,\dots,n)$ represents the contribution to $\sigma^{2}_{z_{0}}$ from the $k$ -th arrival observation $(X_{k},Y_{k})$ . According to equation (3.6), $c_{k}=\mathbb{E}\Omega^{2}_{n,k}(z_{0})\asymp(n\gamma)^{-1/\alpha}n^{-1}\big{(}(n+1-k)\gamma\big{)}^{1/\alpha}$ and is of order $(n-k)^{1/\alpha}$ in the observation index $k$ , which decreases monotonically to nearly $0$ as $k$ grows to $n$ . In comparison, in the online (nonconstant) step case, we denote ${\bf{O}}=(o_{1},\dots,o_{k})$ as the vector of variation with $o_{k}$ being the contribution from the $k$ -th observation. A careful calculation shows that $o_{k}=n^{-2}\gamma^{2}_{k}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\,\phi_{\nu}^{2}(z_{0})\Big{(}\sum_{j=k}^{n}\prod_{i=k+1}^{j}(1-\gamma_{i}\mu_{\nu})\Big{)}^{2}$ , which has order $n^{-2}\gamma^{2}_{k}\gamma^{-2}_{n+1-k}\big{(}(n+1-k)\gamma_{n+1-k}\big{)}^{1/\alpha}+\big{(}(n+1-k)\gamma_{k}\big{)}^{1/\alpha}$ and decreases slower than the constant step size case. This means that the nonconstant step scheme yields a more balanced weighted average over the entire dataset, which tends to lead to a smaller asymptotic variance.

Figure 1 compares the individual variation contribution for both the constant and non-constant step cases. We keep the total step size budget the same for both cases (which also makes the two leading bias terms roughly equal); that is, we choose constant $B$ in the nonconstant step size $\gamma_{i}=B\cdot i^{-\frac{1}{\alpha+1}}$ so that $n\gamma=\sum_{i=1}^{n}\gamma_{i}$ with $\gamma=n^{-\frac{1}{\alpha+1}}$ being the constant step size. The data index $k$ is plotted on the $x$ axis of Figure 1 (A), with the variation contribution summarized by the $y$ axis. As we can see, the variation contribution from each observation decreases as observations arrive later in both cases. However, the pattern is flatter in the non-constant step case. Figure 1 (B) is a violin plot visualizing the distributions of the components in $\bf{C}$ and $\bf{O}$ . Specifically, the variation among $\{o_{k}\}_{k=1}^{n}$ (depicted by the short blue interval) is smaller in the non-constant case, suggesting reduced fluctuation in individual variation for this setting. As detailed in Section 5, our numerical analysis further confirms that using a nonconstant learning rate outperforms that using a constant learning rate (e.g., Figure 2). An interesting direction for future research might be to identify an optimal learning rate decaying scheme by minimizing the variance $\sigma^{2}_{z_{0}}$ as a function of $\{\gamma_{i}\}_{i=1}^{n}$ . It is also interesting to determine whether this scheme results in an equal contribution from each observation. However, this is beyond the scope of this paper.

Refer to caption — Figure 1: Compare the individual variation contribution for each observation in two cases: constant step size case (red curve) and non-constant step size case (blue curve). In (A), $x$ -axis is the observation index, $y$ -axis is the variance contributed by the $t$ -th observation. (B) is the violin plot of individual variance contribution for two cases; the solid dots represent mean while the intervals represent variance.

Remark 3.4.

The Kolmogorov distance bound between the sampling distribution of $\sigma^{-1}_{z_{0}}\sqrt{n(n\gamma)}\big{(}\bar{f}_{n}(z_{0})-f^{\ast}(z_{0})\big{)}$ and the standard normal distribution depends on the step size $\gamma_{n}$ and sample size $n$ . In particular, $\kappa_{n}$ is the remainder bound stated in Theorem 3.1, which is negligible compared to $\frac{C_{1}}{\sqrt{n(n\gamma)^{-1/\alpha}}}$ when $\gamma>n^{-\frac{2}{\alpha+2}}$ in the constant step size case. Consequently, a smaller $\gamma$ or larger sample size $n$ leads to a smaller Kolmogorov distance. The same conclusion also applies to the non-constant step size case if we choose $\gamma_{i}=i^{-\frac{1}{\alpha+1}}$ .

Although Theorem 3.2 explicitly characterizes the distribution of the SGD estimator, the expression of the standard deviation $\sigma_{z_{0}}$ depends on the eigenvalues and eigenfunctions of $\mathbb{H}$ , the underlying distribution of the design $X$ , and the unknown noise variance $\sigma^{2}$ , which are typically unknown in practice. One approach is to use plug-in estimators for these unknown quantities, such as empirical eigenvalues and eigenfunctions obtained through SVD decomposition of the empirical kernel matrix $\mathbf{K}\in\mathbb{R}^{n\times n}$ , whose $ij$ -th element is $\mathbf{K}_{ij}=K(X_{i},X_{j})$ . However, computing these plug-in estimators requires access to all observed data points $\{(X_{i},\,Y_{i})\}_{i=1}^{n}$ and has a computational complexity of $\mathcal{O}(n^{3})$ , which undermines the sequential updating advantages of SGD. In the following section, we develop a scalable inference framework that uses multiplier-type bootstraps to generate randomly perturbed SGD estimators upon arrival of each observation. This approach enables us to bypass the evaluation of $\sigma_{z_{0}}$ when constructing confidence intervals.

4 Online Statistical Inference via Multiplier Bootstrap

In this section, we first propose a multiplier bootstrap method for inference based on the functional SGD estimator. After that, we study the theoretical properties of the proposed method, which serve as the cornerstone for proving bootstrap consistency for the local inference of constructing pointwise confidence intervals and the global inference of constructing simultaneous confidence bands. Finally, we describe the resulting online inference algorithm for non-parametric regression based on the functional SGD estimator.

4.1 Multiplier bootstrap for functional SGD

Recall that Theorem 3.1 provides a high probability decomposition of the functional SGD estimator $\bar{f}_{n}$ (relative to supreme norm metric) into the following sum

\displaystyle\bar{f}_{n}\ =\ f^{\ast}\ +\ \bar{\eta}_{n}^{bias,0}\ +\ \bar{\eta}_{n}^{noise,0}\ +\ \mbox{smaller remainder term},

where $\bar{\eta}_{n}^{bias,0}$ is the leading bias process defined in equation (3.5) and $\bar{\eta}_{n}^{noise,0}$ is the leading noise process defined in equation (3.6). Motivated by this result, we propose in this section a multiplier bootstrap method to mimic and capture the random fluctuation from this leading noise process $\bar{\eta}_{n}^{noise,0}(\cdot)=n^{-1}\sum_{k=1}^{n}\epsilon_{k}\cdot\Omega_{n,k}(\cdot)$ , where recall that term $\Omega_{n,k}$ only depends on the $k$ -th design point $X_{k}$ , and the primary source of randomness in $\bar{\eta}_{n}^{noise,0}$ is coming from random noises $\{\epsilon_{k}\}_{k=1}^{n}$ that are i.i.d. normally distributed under a standard non-parametric regression setting.

Our online inference approach is inspired by the multiplier bootstrap idea proposed in [19] for online inference of parametric models using SGD. Remarkably, we demonstrate that their development can be naturally adapted to enable online inference of non-parametric models based on functional SGD. The key idea is to perturb the stochastic gradient in the functional SGD by incorporating a random multiplier upon the arrival of each data point. r Specifically, let $w_{1}$ , $w_{2}$ , $\ldots$ denote a sequence of i.i.d. random bootstrap multipliers, whose mean and variance are both equal to one. At time $i$ with the observed data point $(X_{i},\,Y_{i})$ , we use a randomly perturbed functional SGD updating formula as:

	$\displaystyle\widehat{f}^{b}_{i}=$	$\displaystyle\,\widehat{f}_{i-1}^{b}+\gamma_{i}\,w_{i}\,G_{i}(\widehat{f}^{b}_{i-1})=\widehat{f}_{i-1}^{b}+\gamma_{i}\,w_{i}\,(Y_{i}-\langle\widehat{f}^{b}_{i-1},\,K_{X_{i}}\rangle_{\mathbb{H}})\,K_{X_{i}},$		(4.1)
	$\displaystyle=$	$\displaystyle\,(I-\gamma_{i}\,w_{i}\,K_{X_{i}}\otimes K_{X_{i}})\,\widehat{f}^{b}_{i-1}+\gamma_{i}\,w_{i}\,Y_{i}\,K_{X_{i}}\quad\mbox{for}\quad i=1,2,\ldots,$		(4.1)

which modifies equations (2.3) and (2.4) for functional SGD by multiplying the stochastic gradient $G_{i}(\widehat{f}^{b}_{i-1})$ with random multiplier $w_{i}$ . We adopt the same zero initialization $\widehat{f}^{b}_{0}=\widehat{f}_{0}=0$ and call the (Polyak) averaged estimator $\bar{f}_{n}^{b}=n^{-1}\sum_{i=1}^{n}\widehat{f}_{i}^{b}$ as the bootstrapped functional SGD estimator (with $n$ samples).

4.2 Bootstrap consistency

Let us now proceed to derive a higher-order expansion of $\bar{f}_{n}^{b}$ analogous to Section 3.1 and compare its leading terms with those associated with the original functional SGD estimator $\bar{f}_{n}$ . Utilizing equation (4.1) and plugging in $Y_{i}=f^{\ast}(X_{i})+\epsilon_{i}$ , we obtain the following expression:

\widehat{f}^{b}_{i}-f^{\ast}=(I-\gamma_{i}\,w_{i}\,K_{X_{i}}\otimes K_{X_{i}})\,(\widehat{f}^{b}_{i-1}-f^{\ast})+\gamma_{i}\,w_{i}\,\epsilon_{i}\,K_{X_{i}}.

Since $w_{i}$ has a unit mean, we have an important identity $\Sigma=\mathbb{E}(w_{n}K_{X_{n}}\otimes K_{X_{n}})$ . Similar to equation (3.1)-(3.4), due to this key identity, we can still recursively define the leading bootstrapped bias term through

\eta_{0}^{b,bias,0}=\widehat{f}_{0}^{b}-f^{\ast}=-f^{\ast}\quad\mbox{and}\quad\eta_{i}^{b,bias,0}=(I-\gamma_{n}\,\Sigma)\,\eta_{n-1}^{bias,0}\quad\mbox{for}\quad i=1,2,\ldots,

(4.2)

which coincides with the original leading bias term, i.e. ${\eta}_{i}^{b,bias,0}\equiv{\eta}_{i}^{bias,0}$ ; and the leading bootstrapped noise term through

\displaystyle\eta_{0}^{b,noise,0}=0\quad\mbox{and}\quad\eta_{i}^{b,noise,0}=(I-\gamma_{i}\,\Sigma)\eta^{b,noise,0}_{i-1}+\gamma_{i}\,w_{i}\,\epsilon_{i}\,K_{X_{i}},\quad\mbox{for}\quad i=1,2,\ldots,

so that a similar decomposition as in equation (3.4) holds,

\widehat{f}^{b}_{i}-f^{\ast}=\ \underbrace{\eta_{i}^{b,bias,0}}_{\text{leading bias}}\ +\ \underbrace{\eta_{i}^{b,noise,0}}_{\text{leading noise}}\ +\ \ \underbrace{\big{(}\widehat{f}^{b}_{i}-f^{\ast}-\eta_{i}^{b,bias,0}-\eta_{i}^{b,noise,0}\big{)}}_{\text{remainder term}}\quad\mbox{for}\quad i=1,2,\ldots.

(4.3)

Corresponding, we define $\bar{\eta}_{i}^{b,bias,0}=i^{-1}\sum_{j=1}^{i}\eta_{j}^{b,bias,0}$ and $\bar{\eta}_{i}^{b,noise,0}=i^{-1}\sum_{j=1}^{i}\eta_{j}^{b,noise,0}$ as the leading bootstrapped bias and noise terms, respectively, in the bootstrapped functional SGD estimator.

Notice that $\bar{\eta}_{i}^{b,bias,0}$ also coincides with the original leading bias term $\bar{\eta}_{i}^{bias,0}$ , i.e. $\bar{\eta}_{i}^{b,bias,0}\equiv\bar{\eta}_{i}^{bias,0}$ . Therefore, $\bar{\eta}_{i}^{b,bias,0}$ has the same explicit expression as equation (3.5); while the leading bootstrapped noise term $\bar{\eta}_{i}^{b,noise,0}$ has a slightly different expression that incorporates the bootstrap multipliers as

\displaystyle\bar{\eta}_{n}^{b,noise,0}(x)=

\displaystyle\,\frac{1}{n}\sum_{k=1}^{n}w_{k}\cdot\epsilon_{k}\cdot\Omega_{n,k}(x),\quad\forall x\in\mathcal{X},

(4.4)

where recall that $\Omega_{n,k}(\cdot)$ is defined in equation (3.6) and only depends on $X_{k}$ . By taking the difference between $\bar{\eta}_{n}^{b,noise,0}$ and $\bar{\eta}_{n}^{noise,0}$ , we obtain

\displaystyle\bar{\eta}_{n}^{b,noise,0}(x)-\bar{\eta}_{n}^{noise,0}(x)=\frac{1}{n}\sum_{k=1}^{n}(w_{k}-1)\cdot\epsilon_{k}\cdot\Omega_{n,k}(x),\quad\forall x\in\mathcal{X}.

(4.5)

This expression also takes the form of a weighted and non-identically distributed empirical process with “effective” noises $\big{\{}(w_{i}-1)\epsilon_{i}\big{\}}_{i=1}^{n}$ . Since $w_{i}$ has unit mean and variance, these effective noises have the same first two-order moments as the original noises $\{\epsilon_{i}\}_{i=1}^{n}$ , suggesting that the difference $\bar{\eta}_{n}^{b,noise,0}(\cdot)-\bar{\eta}_{n}^{noise,0}(\cdot)$ , conditioning on data $\{(X_{i},\,Y_{i}\}_{i=1}^{n}$ , tends to capture the random pattern of the original leading noise term $\bar{\eta}_{n}^{noise,0}(\cdot)$ , leading to the so-called bootstrap consistency as formally stated in the theorem below.

Assumption A3.

For $i=1,\dots,n$ , bootstrap multipliers $w_{i}$ s are i.i.d. samples of a random variable $W$ that satisfies $\mathbb{E}(W)=1$ , $\operatorname{{\rm Var}}(W)=1$ and $\mathbb{P}(|W|\geq t)\leq 2\exp(-t^{2}/C)$ for all $t\geq 0$ with a constant $C>0$ .

One simple example that satisfies Assumption A3 is $W\sim N(1,1)$ . A second example is bounded random variables, such as a scaled and shifted uniform random variable on the interval $[-1,3]$ . One popular choice in practice is discrete random variables, such as $W$ such that $\mathbb{P}(W=3)=\mathbb{P}(W=-1)=1/2$ .

Let $\mathcal{D}_{n}:\,=\{X_{i},Y_{i}\}_{i=1}^{n}$ denote the data of sample size $n$ , and $\mathbb{P}^{*}(\cdot)=\mathbb{P}(\,\cdot\,|\,\mathcal{D}_{n})$ denote the conditional probability measure given $\mathcal{D}_{n}$ . We first establish the bootstrap consistency for local inference of leading noise term in the following Theorem 4.1.

Theorem 4.1 (Bootstrap consistency for local inference of leading noise term).

Assume that kernel $K$ satisfies Assumptions A1-A2 and multiplier weights $\{w_{i}\}_{i=1}^{n}$ satisfy Assumption A3.

(Constant step size) Consider the step size $\gamma(n)=\gamma$ with $\gamma\in(0,\,n^{-\frac{\alpha-3}{3}})$ for some $\alpha>1$ . Then for any $z_{0}\in[0,1]$ , we have with probability at least $1-Cn^{-1}$ ,

	$\displaystyle\sup_{u\in\mathbb{R}}\Big{\|}\,\mathbb{P}^{*}\Big{(}\sqrt{n(n\gamma)^{-\frac{1}{\alpha}}}\,\big{(}\bar{\eta}_{n}^{b,noise,0}(z_{0})-\bar{\eta}_{n}^{noise,0}(z_{0})\big{)}\leq u\Big{)}-\,\mathbb{P}\Big{(}\sqrt{n(n\gamma)^{-\frac{1}{\alpha}}}\,\bar{\eta}_{n}^{noise,0}(z_{0})\leq u\Big{)}\Big{\|}$
	$\displaystyle\qquad\leq C^{\prime}(\log n)^{3/2}(n(n\gamma)^{-1/\alpha})^{-1/6},$

where $C,C^{\prime}$ are constants independent of $n$ .

(Non-constant step size) Consider the step size $\gamma_{i}=i^{-\xi}$ , $i=1,\dots,n$ , for some $\xi\in(\min\{0,1-\alpha/3\},\,1/2)$ . Then the following bound holds with probability at least $1-2n^{-1}$ ,

	$\displaystyle\sup_{u\in\mathbb{R}}\Big{\|}\,\mathbb{P}^{*}\Big{(}\sqrt{n(n\gamma_{n})^{-\frac{1}{\alpha}}}\,\big{(}\bar{\eta}_{n}^{b,noise,0}(z_{0})-\bar{\eta}_{n}^{noise,0}(z_{0})\big{)}\leq u\Big{)}-$	$\displaystyle\,\mathbb{P}\Big{(}\sqrt{n(n\gamma_{n})^{-\frac{1}{\alpha}}}\,\bar{\eta}_{n}^{noise,0}(z_{0})\leq u\Big{)}\Big{\|}$
		$\displaystyle\qquad\leq\frac{C^{\prime}(\log n)^{3/2}}{\sqrt{n(n\gamma_{n})^{-3/(2\alpha)}}}.$		(4.6)

Remark 4.1.

Recall that from (3.6) and (4.5), we can express $\bar{\eta}_{n}^{noise,0}(z_{0})=\frac{1}{n}\sum_{k=1}^{n}\epsilon_{k}\cdot\Omega_{n,k}(z_{0})$ and $\bar{\eta}_{n}^{b,noise,0}(z_{0})-\bar{\eta}_{n}^{noise,0}(z_{0})=\frac{1}{n}\sum_{k=1}^{n}(w_{k}-1)\cdot\epsilon_{k}\cdot\Omega_{n,k}(z_{0})$ . Theorem 3.2 shows that $\sum_{k=1}^{n}\epsilon_{k}\cdot\Omega_{n,k}(z_{0})$ can be approximated by a normal distribution $\Phi\big{(}0,n^{-1}(n\gamma_{n})^{1/\alpha}\sigma^{2}_{z_{0}}\big{)}$ . To prove Theorem 4.1, we introduce an intermediate empirical process evaluated at $z_{0}$ as $\sum_{k=1}^{n}(e_{k}-1)\cdot\epsilon_{k}\cdot\Omega_{n,k}(z_{0})$ where $e_{k}$ ’s are independent and identically distributed standard normal random variables, such that $\sum_{k=1}^{n}(e_{k}-1)\cdot\epsilon_{k}\cdot\Omega_{n,k}(z_{0})\mid\mathcal{D}_{n}$ has the same (conditional) variance as the (conditional) variance of $\big{(}\bar{\eta}_{n}^{b,noise,0}(z_{0})-\bar{\eta}_{n}^{noise,0}(z_{0})\big{)}\mid\mathcal{D}_{n}$ .

Theorem 4.2 (Bootstrap consistency for global inference of leading noise term).

Assume that kernel $K$ satisfies Assumptions A1-A2 and multiplier weights $\{w_{i}\}_{i=1}^{n}$ satisfy Assumption A3.

(Constant step size) Consider the step size $\gamma(n)=\gamma$ with $\gamma\in(0,\,n^{\frac{\alpha-3}{3}})$ for some $\alpha>2$ . Then the following bound holds with probability at least $1-5n^{-1}$ (with respect to the randomness in data $\mathcal{D}_{n}$ )

	$\displaystyle\sup_{u\in\mathbb{R}}\Big{\|}\,\mathbb{P}^{*}\Big{(}\sqrt{n(n\gamma)^{-\frac{1}{\alpha}}}\,\\|\,\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}\,\\|_{\infty}\big{)}\leq u\Big{)}-$	$\displaystyle\,\mathbb{P}\Big{(}\sqrt{n(n\gamma)^{-\frac{1}{\alpha}}}\,\\|\,\bar{\eta}_{n}^{noise,0}\,\\|_{\infty}\leq u\Big{)}\Big{\|}$
		$\displaystyle\quad\leq C(\log n)^{3/2}\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{-1/8}.$

	$\displaystyle\sup_{u\in\mathbb{R}}\Big{\|}\,\mathbb{P}^{*}\Big{(}\sqrt{n(n\gamma_{n})^{-\frac{1}{\alpha}}}\,\\|\,\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}\,\\|_{\infty}\leq u\Big{)}-$	$\displaystyle\,\mathbb{P}\Big{(}\sqrt{n(n\gamma_{n})^{-\frac{1}{\alpha}}}\,\\|\,\bar{\eta}_{n}^{noise,0}\,\\|_{\infty}\leq u\Big{)}\Big{\|}$
		$\displaystyle\quad\leq C(\log n)^{3/2}\big{(}n(n\gamma_{n})^{-3/\alpha}\big{)}^{-1/8}.$

Remark 4.2.

Theorem 4.2 demonstrates that the sampling distribution of $\sqrt{n(n\gamma_{n})^{-1/\alpha}}\|\bar{\eta}_{n}^{noise,0}\|_{\infty}$ can be approximated closely by the conditional distribution of $\sqrt{n(n\gamma_{n})^{-1/\alpha}}\|\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}\|_{\infty}$ given data set $\mathcal{D}_{n}$ . This theorem serves as the theoretical foundation for adopting the multiplier bootstrap method detailed in Section 4.1 for global inference. Recall that the optimal step size for achieving the minimax optimal estimation error is $\gamma=n^{-\frac{1}{\alpha+1}}$ for the constant step size and $\gamma_{i}=i^{-\frac{1}{\alpha+1}}$ for the non-constant step size (Theorem 3.2). To ensure that the Kolmogorov distance bound in Theorem 4.2 decays to $0$ as $n\to\infty$ under these step sizes, we require $\alpha>2$ . It is likely that our current Kolmogorov distance bound, which is dominated by an error term that arises from applying the Gaussian approximation to analyze $\|\bar{\eta}_{n}^{noise,0}\|_{\infty}$ and $\|\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}\|_{\infty}$ through space-discretization (see Section 6.2), can be substantially refined. We leave this improvement of the Kolmogorov distance bound, which would consequently lead to a weaker requirement on $\alpha$ , to future research.

Since the leading noise terms $\bar{\eta}_{n}^{noise,0}$ and $\bar{\eta}_{n}^{b,noise,0}$ contribute to the primary source of randomness in the functional SGD and its bootstrapped counterpart (Theorem 3.1), Theorem 4.2 then implies the bootstrap consistency for statistical inference of $f^{\ast}$ based on bootstrapped functional SGD. Particularly, we present the following Corollary, which establishes a high probability supremum norm bound for the remainder term in the bootstrapped functional SGD decomposition (4.3). Such a bound further implies that the sampling distribution of $\sqrt{n(n\gamma_{n})^{-1/\alpha}}\|\bar{f}_{n}-f^{\ast}\|_{\infty}$ can be effectively approximated by the conditional distribution of $\sqrt{n(n\gamma_{n})^{-1/\alpha}}\|\bar{f}^{b}_{n}-\bar{f}_{n}\|_{\infty}$ given data $\mathcal{D}_{n}$ . Recall that we use $\mathbb{P}^{*}(\cdot)=\mathbb{P}(\,\cdot\,|\,\mathcal{D}_{n})$ to denote the conditional probability measure given $\mathcal{D}_{n}=\{X_{i},Y_{i}\}_{i=1}^{n}$ .

Corollary 4.3 (Bootstrap consistency for functional SGD inference).

Assume that kernel $K$ satisfies Assumptions A1-A2 and multiplier weights $\{w_{i}\}_{i=1}^{n}$ satisfies Assumption A3.

(Constant step size) Consider the step size $\gamma(n)=\gamma$ with $\gamma\in(0,\,n^{\frac{\alpha-3}{3}})$ for some $\alpha>2$ . Then it holds with probability at least $1-\gamma^{1/4}-\gamma^{1/2}-1/n$ with respect to the randomness of $\mathcal{D}_{n}$ that

\mathbb{P}^{*}\Big{(}\|\,\bar{f}^{b}_{n}-f_{n}-\bar{\eta}_{n}^{b,bias,0}-\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{bias,0}+\bar{\eta}_{n}^{noise,0}\,\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq\gamma^{1/4}+\gamma^{1/2}+1/n.

(4.7)

Furthermore, for $0<\gamma<n^{-\frac{4}{7\alpha+1}}(\log n)^{-3/2}$ , it holds with probability at least $1-5n^{-1}-3\gamma^{1/2}-\gamma^{-1/4}$ , that

	$\displaystyle\sup_{u\in\mathbb{R}}\Big{\|}\,\mathbb{P}^{*}\Big{(}\sqrt{n(n\gamma)^{-1/\alpha}}\,\\|\,\bar{f}^{b}_{n}\,-$	$\displaystyle\,\bar{f}_{n}\,\\|_{\infty}\leq u\Big{)}-\mathbb{P}\Big{(}\sqrt{n(n\gamma)^{-1/\alpha}}\,\\|\,\bar{f}_{n}-f^{\ast}-\mbox{Bias}(f^{\ast})\,\\|_{\infty}\leq u\Big{)}\Big{\|}$
		$\displaystyle\qquad\leq C_{1}(\log n)^{3/2}n^{-1/8}(n\gamma)^{3/(8\alpha)}+C\gamma^{1/4}$

where $\mbox{Bias}(f^{\ast})=\bar{\eta}_{n}^{bias,0}$ denotes the bias term, $C_{1},C>0$ are constants.

(Non-constant step size) Consider the step size $\gamma_{i}=i^{-\frac{1}{\alpha+1}}$ for $i=1,\dots,n$ . Then it holds with probability at least $1-\gamma_{n}^{1/4}-\gamma_{n}^{1/2}-1/n$ that

\mathbb{P}^{*}\Big{(}\|\,\bar{f}^{b}_{n}-f^{\ast}-\bar{\eta}_{n}^{b,bias,0}-\bar{\eta}_{n}^{b,noise,0}\,\|^{2}_{\infty}\geq\gamma_{n}^{1/4}(n\gamma_{n})^{1/\alpha}n^{-1}\Big{)}\leq\gamma_{n}^{1/4}+\gamma_{n}^{1/2}+1/n.

Furthermore, it holds with probability at least $1-5n^{-1}-\gamma_{n}^{1/2}-\gamma_{n}^{1/4}$ that

	$\displaystyle\sup_{u\in\mathbb{R}}\Big{\|}\,\mathbb{P}^{*}\Big{(}\sqrt{n(n\gamma_{n})^{-1/\alpha}}\,\\|\,\bar{f}^{b}_{n}\,-$	$\displaystyle\,\bar{f}_{n}\,\\|_{\infty}\leq u\Big{)}-\mathbb{P}\Big{(}\sqrt{n(n\gamma_{n})^{-1/\alpha}}\,\\|\,\bar{f}_{n}-f^{\ast}-\mbox{Bias}(f^{\ast})\,\\|_{\infty}\leq u\Big{)}\Big{\|}$
		$\displaystyle\qquad\lesssim(\log n)^{3/2}n^{-1/8}(n\gamma_{n})^{\frac{3}{8\alpha}}+\gamma_{n}^{1/4}.$

Remark 4.3.

Corollary 4.3 suggests that a smaller step size $\gamma$ (or $\gamma_{n}$ ) and a larger sample size $n$ result in more accurate uncertainty quantification. As discussed in Section 4.2, the functional SGD estimator and its bootstrap counterpart share the same leading bias term, which eliminates the bias in the conditional distribution of $\bar{f}_{n}^{b}-\bar{f}_{n}$ given $\mathcal{D}_{n}$ . However, the bias term $\mbox{Bias}(f^{\ast})$ still exists in the sampling distribution of $\bar{f}_{n}-f^{\ast}$ . According to Theorem 3.1, this bias term can be bounded by $O(1/\sqrt{n\gamma})$ with high probability, while the convergence rate of the leading noise term under the supremum norm metric is of order $O(1/\sqrt{n(n\gamma)^{-1/\alpha}})$ . Therefore, to make the bias term asymptotically negligible, we can adopt the common practice of “undersmoothing” [36, 37]. In our context, this means slightly enlarging the step size as $\gamma=\gamma(n)=n^{-\frac{1}{\alpha+1}+\varepsilon}$ (constant step size) or $\gamma_{i}=i^{-\frac{1}{\alpha+1}+\varepsilon}$ for $i=1,\dots,n$ (non-constant step size), where $\varepsilon$ is any small positive constant.

4.3 Online inference algorithm

Data: Number of bootstrap samples

J

, initial step size

\gamma_{0}>0

, initial estimates

\widehat{f}^{b,j}_{0}=\widehat{f}_{0}

j=1,\dots,J

, confidence level

(1-\alpha)

for $i=1,2,\dots,n$ do

Update

\widehat{f}_{i}=\widehat{f}_{i-1}-\gamma_{i}\nabla\ell_{i}(\widehat{f}_{i-1})

Update

\bar{f}_{i}=(i-1)\bar{f}_{i-1}/i+\widehat{f}_{i}/i

for $j=1,\dots,J$ do

Update

\widehat{f}^{b,j}_{i}=\widehat{f}^{b,j}_{i-1}-\gamma_{i}w_{i,j}\nabla\ell_{i}(\widehat{f}^{b,j}_{i-1})

Update

\bar{f}^{b,j}_{i}=(i-1)\bar{f}^{b,j}_{i-1}/i+\widehat{f}^{b,j}_{i}/i

end for

Output: SGD estimators

\bar{f}_{n}

and the Bootstrap estimates

\{\bar{f}^{b,j}_{n}\}_{j=1}^{J}

. Calculate

\{\bar{f}^{b,j}_{n}-\bar{f}_{n}\}_{j=1}^{J}

Construct the

100(1-\alpha)\%

confidence interval for

f

evaluated at any fixed

z_{0}

via

1.

Normal CI: $(\bar{f}_{n}(z_{0})-z_{\alpha/2}\sqrt{T_{n}^{b}(z_{0})},\,\bar{f}_{n}(z_{0})+z_{\alpha/2}\sqrt{T_{n}^{b}(z_{0})})$ , where $T_{n}^{b}(z_{0})=\frac{1}{J-1}\sum_{j=1}^{J}\big{(}\bar{f}_{n}^{b,j}(z_{0})-\bar{f}_{n}(z_{0})\big{)}^{2}$ .
2.

Percentile CI: $\big{(}\bar{f}_{n}(z_{0})-C_{\alpha/2},\,\bar{f}_{n}(z_{0})+C_{1-\alpha/2}\big{)}$ , where $C_{\alpha/2}$ and $C_{1-\alpha/2}$ are the sample $\alpha/2$ -th and $(1-\alpha/2)$ -th quantile of $\{\bar{f}^{b,j}_{n}(z_{0})-\bar{f}_{n}(z_{0})\}_{j=1}^{J}$ .

Construct the

100(1-\alpha)\%

confidence band for

f

at any

x\in\mathcal{X}

Step 1: Evenly choose

t_{1},\dots,t_{M}\in\mathcal{X}

Step 2: For

j\in 1,\dots,J

, calculate

\max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,j}(t_{m})-\bar{f}_{n}(t_{m})\big{|}.

Step 3: Calculate the sample

\alpha/2

-th and the

(1-\alpha/2)

-th quantiles of

\max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,1}(t_{m})-\bar{f}_{n}(t_{m})\big{|},\dots,\max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,J}(t_{m})-\bar{f}_{n}(t_{m})\big{|},

and denote them by

Q_{\alpha/2}

and

Q_{1-\alpha/2}

Step 4: Construct the

100(1-\alpha)\%

confidence band as

\big{\{}g:\,\mathcal{X}\to\mathbb{R}\,\big{|}\,g(x)\in[\bar{f}_{n}(x)-Q_{\alpha/2},\,\bar{f}_{n}(x)+Q_{1-\alpha/2}],\ \forall x\in\mathcal{X}\big{\}}

Algorithm 1 Algorithm 1 (Online Bootstrap Confidence Band for Non-parametric Regression)

As we demonstrated in Theorem 4.2, the sampling distribution of $\sqrt{n(n\gamma_{n})^{-1/\alpha}}(\bar{f}_{n}-f^{\ast})$ can be effectively approximated by the conditional distribution of $\sqrt{n(n\gamma_{n})^{-1/\alpha}}(\bar{f}_{n}^{b}-\bar{f}_{n})$ given data $\mathcal{D}_{n}$ using the bootstrap functional SGD. This result provides a strong foundation for conducting online statistical inference based on bootstrap. Specifically, we can run $J$ bootstrapped functional SGD in parallel, producing $J$ estimators $\bar{f}_{n}^{b,j}=\frac{1}{n}\sum_{i=1}^{n}\widehat{f}_{i}^{b,j}$ for $j=1,\dots,J$ with

\widehat{f}^{b,j}_{i}=\widehat{f}_{i-1}^{b,j}+\gamma_{n}w_{i,j}(Y_{i}-\langle\widehat{f}^{b,j}_{i-1},K_{X_{i}}\rangle_{\mathbb{H}})K_{X_{i}},\quad\mbox{for}\quad i=1,2,\ldots,

where $w_{i,j}$ are i.i.d. bootstrap weights satisfying Assumption A3. Then we can approximate the sampling distribution of $(\bar{f}_{n}-f^{\ast})$ using the empirical distribution of $\{\widehat{f}^{b,j}_{n}-\bar{f}_{n},\,j=1,\dots,J\}$ conditioning on $\mathcal{D}_{n}$ , and further construct the point-wise confidence intervals and simultaneous confidence band for $f^{\ast}$ . We can also use the empirical variance of $\{\bar{f}_{n}^{b,j},\,j=1,\dots,J\}$ to approximate the variance of $\bar{f}_{n}$ . Based on these quantities, we can construct the point-wise confidence interval for $f^{\ast}(x)$ for any fixed $x\in\mathcal{X}$ in two ways:

1.

Normal CI - giving the sequence of bootstrapped estimators $\bar{f}_{n}^{b,j}(x)$ for $j=1,\dots,J$ , we calculate the variance as $T_{n}^{b}(x)=\frac{1}{J-1}\sum_{j=1}^{J}\big{(}\bar{f}_{n}^{b,j}(x)-\bar{f}_{n}(x)\big{)}^{2}$ , and construct the $100(1-\alpha)\%$ confidence interval for $f^{\ast}(x)$ as $(\bar{f}_{n}(x)-z_{\alpha/2}\sqrt{T_{n}^{b}(x)},\bar{f}_{n}(x)+z_{\alpha/2}\sqrt{T_{n}^{b}(x)})$ ;
2.

Percentile CI - giving the sequence of bootstrapped estimators $\bar{f}_{n}^{b,j}(x)$ for $j=1,\dots,J$ , we calculate $\{\bar{f}_{n}^{b,j}(x)-\bar{f}_{n}(x)\}_{j=1}^{J}$ , and its $\alpha/2$ -th and $(1-\alpha/2)$ -th quantiles as $C_{\alpha/2}$ and $C_{1-\alpha/2}$ , then construct the $100(1-\alpha)\%$ CI for $f^{\ast}(x)$ as $\big{(}\bar{f}_{n}(x)-C_{\alpha/2},\bar{f}_{n}(x)+C_{1-\alpha/2}\big{)}$ .

To construct the simultaneous confidence band, we first choose a dense grid points $t_{1},\dots,t_{M}\in\mathcal{X}$ ; then for each $j\in\{1,\dots,J\}$ , we calculate $\max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,j}(t_{m})-\bar{f}_{n}(t_{m})\big{|}$ to approximate $\sup_{t}|\bar{f}_{n}^{b,j}(t)-\bar{f}_{n}(t)|$ . Accordingly, we obtain the following $J$ bootstrapped supremum norms:

\max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,1}(t_{m})-\bar{f}_{n}(t_{m})\big{|}\ ,\ \max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,2}(t_{m})-\bar{f}_{n}(t_{m})\big{|}\ ,\ \dots\ ,\ \mbox{and}\ \max_{1\leq m\leq M}\big{|}\bar{f}_{n}^{b,J}(t_{m})-\bar{f}_{n}(t_{m})\big{|}.

(4.8)

Denote the sample $\alpha/2$ -th and the $(1-\alpha/2)$ -th quantiles of (4.8) as $Q_{\alpha/2}$ and $Q_{1-\alpha/2}$ . Then we construct a $100(1-\alpha)\%$ confidence band for $f^{\ast}$ as $\big{\{}g:\,\mathcal{X}\to\mathbb{R}\,\big{|}\,g(x)\in[\bar{f}_{n}(x)-Q_{\alpha/2},\,\bar{f}_{n}(x)+Q_{1-\alpha/2}],\ \forall x\in\mathcal{X}\big{\}}$ .

Our online inference algorithm is computationally efficient, as it only requires one pass over the data, and the bootstrapped functional SGD can be computed in parallel. The detailed algorithm is summarized in Algorithm 1.

5 Numerical Study

In this section, we test our proposed online inference approach via simulations. Concretely, we generate synthetic data in a streaming setting with a total sample size of $n$ . We use $(X_{t},Y_{t})$ to represent the $t$ -th observed data point for $t=1,\dots,n$ . We evaluate the performance of our proposed method as described in Algorithm 1 for constructing confidence intervals for $f(x)$ at $x=X_{t}$ for $t=501,1000,1500,2000,2500,3000,3500,4000$ , and compare our method with three existing alternative approaches, which we refer to as “offline” methods. “Offline” methods involve calculating the confidence intervals after all data have been collected, up to the $t$ -th observation’s arrival, which necessitates refitting the model each time new data arrive. We also evaluate the coverage probabilities of the simultaneous confidence bands constructed in Algorithm 1. We first enumerate the compared offline confidence interval methods as follows:

(i)

Offline Bayesian confidence interval (Offline BA) proposed in [38]: According to [39], a smoothing spline method corresponds to a Bayesian procedure when using a partially improper prior. Given this relationship between smoothing splines and Bayes estimates, confidence intervals can be derived from the posterior covariance function of the estimation. In practice, we implement Offline BA using the “gss” R package [28].
(ii)

Offline bootstrap normal interval (Offline BN) proposed in [40]: Let $\widehat{f}_{\lambda}$ and $\widehat{\sigma}$ denote the estimates of $f$ and $\sigma$ respectively, achieved by minimizing (5.1) with $\{X_{i},Y_{i}\}_{i=1}^{t}$ as below.

$\sum_{i=1}^{t}(Y_{i}-f(X_{i}))^{2}+\frac{t}{2}\lambda\int_{0}^{1}(f^{{}^{\prime\prime}}(u))^{2}du$ (5.1)

where $\lambda$ is the roughness penalty and $f^{{}^{\prime\prime}}(u)$ is the second derivative evaluated at $u$ . A bootstrap sample is generated from

$Y_{i}^{\dagger}=\widehat{f}_{\lambda}(X_{i})+\epsilon^{\dagger},\quad i=1,\dots,t$

where $\epsilon^{\dagger}_{i}$ s are i.i.d. Gaussian white noise with variance $\widehat{\sigma}^{2}$ . Based on the bootstrap sample, we calculate the bootstrap estimate as $\widehat{f}_{\lambda}^{\dagger}$ . Repeating $J$ times, we have a sequence of offline bootstrap estimates $\bar{f}_{\lambda}^{{\dagger},1},\dots,\bar{f}_{\lambda}^{{\dagger},J}$ . We estimate the variance of $\widehat{f}_{\lambda}(X_{t})$ as $T^{\dagger}_{t}=\frac{1}{J-1}\sum_{j=1}^{J}\big{(}\widehat{f}_{\lambda}^{{\dagger},j}(X_{t})-\widehat{f}_{\lambda}(X_{t})\big{)}^{2}$ . A $100(1-\alpha)\%$ offline normal bootstrap confidence interval for $\widehat{f}_{\lambda}(X_{t})$ is then constructed as $\big{(}\,\widehat{f}_{\lambda}(X_{t})-z_{\alpha/2}\sqrt{T^{\dagger}_{t}},\,\widehat{f}_{\lambda}(X_{t})+z_{\alpha/2}\sqrt{T^{\dagger}_{t}}\,\big{)}$ .
(iii)

Offline bootstrap percentile interval (Offline BP): We apply the same data bootstrapping procedure in Offline BN, which produces the estimate $\widehat{f}^{\dagger}_{\lambda}(X_{t})$ based on the bootstrap sample. The confidence interval is then constructed using the percentile method suggested in [41]. Specifically, let $C^{\dagger}_{\alpha/2}(X_{t})$ and $C^{\dagger}_{1-\alpha/2}(X_{t})$ represent the $\alpha/2$ -th quantile and the $(1-\alpha/2)$ -th quantile of the empirical distribution of $\big{\{}\widehat{f}_{\lambda}^{{\dagger},j}(X_{t})-\widehat{f}^{\dagger}_{\lambda}(X_{t})\big{\}}_{j=1}^{J}$ , respectively. A $100(1-\alpha)\%$ confidence interval for $\widehat{f}_{\lambda}(X_{t})$ is then constructed as $\big{(}\,\widehat{f}_{\lambda}(X_{t})-C^{\dagger}_{\alpha/2}(X_{t}),\,\widehat{f}_{\lambda}(X_{t})+C^{\dagger}_{1-\alpha/2}(X_{t})\,\big{)}$ .

As $t$ increases, offline methods lead to a considerable increase in computational cost. For instance, Offline BA/BN theoretically has a total time complexity of order $\mathcal{O}(t^{4})$ (with an $\mathcal{O}(t^{3})$ cost at time $t$ ). In contrast, online bootstrap confidence intervals are computed sequentially as new data points become available, making them well-suited for streaming data settings. They have a theoretical complexity of at most $\mathcal{O}(t^{2})$ (with an $\mathcal{O}(t)$ cost at time $t$ ). We examine both the normal CI and percentile CI, as outlined in Algorithm 1, when constructing the confidence interval.

We examine the effects of various step size schemes. Specifically, we consider a constant step size $\gamma=\gamma(t)=t^{-\frac{1}{\alpha+1}}$ , where $t$ represents the total sample size at which the CIs are constructed, and an online step size $\gamma_{i}=i^{-\frac{1}{\alpha+1}}$ for $i=1,\dots,t$ . A limitation of the constant-step size method is its dependency on prior knowledge of the total time horizon $t$ . Consequently, the estimator is only rate-optimal at the $t$ -th step. We assess our proposed online bootstrap confidence intervals in four different scenarios: (i) Online BNC, which uses a constant step size for the normal interval; (ii) Online BPC, which uses a constant step size for the percentile interval; (iii) Online BNN, which employs a non-constant step size for the normal interval; and (iv) Online BPN, which utilizes a non-constant step size for the percentile interval.

We generate our data as i.i.d. copies of random variables $(X,Y)$ , where $X$ is drawn from a uniform distribution in the interval $(0,1)$ , and $Y=f(X)+\epsilon$ . Here $f$ is the unknown regression function to be estimated, $\epsilon$ represents Gaussian white noise with a variance of $0.2$ . We consider the following three cases of $f=f_{\ell}$ , $\ell=1,2,3$ :

	Case 1:	$\displaystyle f_{1}(x)=\sin(3\pi x/2),$
	Case 2:	$\displaystyle f_{2}(x)=\frac{1}{3}\beta_{10,5}(x)+\frac{1}{3}\beta_{7,7}(x)+\frac{1}{3}\beta_{5,10}(x),$
	Case 3:	$\displaystyle f_{3}(x)=\frac{6}{19}\beta_{30,17}(x)+\frac{4}{10}\beta_{3,11}(x).$

Here, $\beta_{p,q}=\frac{x^{p-1}(1-x)^{q-1}}{B(p,q)}$ with $B(p,q)=\frac{\Gamma(p)\Gamma(q)}{\Gamma(p+q)}$ denoting the beta function, and $\Gamma$ is the gamma function with $\Gamma(p)=p!$ when $p\in\mathbb{N}_{+}$ . Cases 2 and 3 are designed to mimic the increasingly complex “truth” scenarios similar to the settings in [38, 40].

We draw training data of size $n=3000$ from these models. In our online approaches, we first use 500 data points to build an initial estimate and then employ SGD to derive online estimates from the 501st to the 3000th data point. Given that our framework is designed for online settings, we can construct the confidence band based on the datasets of size $501$ , $1000$ , $1500$ , $2000$ , $2500$ and $3000$ , i.e., using the averaged estimators $\bar{f}_{t}$ at $t=501,1000,1500,2000,2500,3000$ . We repeat the data generation process $200$ times for each case. For each replicate, upon the arrival of a new data point, we apply the proposed multiplier bootstrap method for online inference, using $500$ bootstrap samples (i.e., $J=500$ in Algorithm 1) with bootstrap weight $W$ generated from a normal distribution with mean $1$ and standard deviation $1$ . We then construct $95\%$ confidence intervals based on Algorithm 1. Our results will show the coverage and distribution of the lengths of the confidence intervals built at $t=501,1000,1500,2000,2500$ , and $3000$ .

As shown in Figure 2, the coverage of all methods approaches the predetermined level of $95\%$ as $t$ increases. The offline Bayesian method exhibits the lowest coverage of all. While it has the longest average confidence interval length in Cases 1-3, it also has the smallest variance in confidence interval lengths. The offline bootstrap-based methods demonstrate higher coverage and shorter average confidence interval lengths than the offline Bayesian method. The variance in confidence interval lengths for these bootstrap-based methods is larger, due to the bootstrap multiplier resampling procedure or the random step size used in our proposed online bootstrap procedures. As the sample size grows, the variance in the length of the confidence interval diminishes for all methods. Our online bootstrap procedure with a non-constant step size outperforms the others regarding both the average length and the variance of the confidence interval. It offers the shortest average confidence interval length and the smallest variance, compared to the Bayesian confidence interval, offline bootstrap methods, and the online bootstrap procedure with a constant step size. Moreover, the online bootstrap method with a non-constant step size achieves the predetermined coverage level of $95\%$ more quickly than the other methods. We only tested our methods (online BNN and online BQN) with an increased $t$ at $t=3500$ and $t=4000$ due to computational costs. As observed in Figure 2 (A1), (B1), and (C1), the coverage stabilizes at the predetermined coverage level of $95\%$ . We also use our proposed online bootstrap method, as outlined in Algorithm 1, to construct a confidence band of level of $95\%$ with a step size of $\gamma_{i}=i^{-\frac{1}{\alpha+1}}$ at $n=1000,2000,3000$ . As seen in Figure 3, the average width of the confidence band decreases as the sample size increases for Case 1-3, and all of them cover the true function curve represented by the solid black curve, indicating that the accuracy of our confidence band estimates improves with a larger sample size.

Finally, we compared the computational time of various methods in constructing confidence intervals on a computer workstation equipped with a 32-core 3.50 GHz AMD Threadripper Pro 3975WX CPU and 64GB RAM. We recorded the computational times as data points $501,1000,\dots,4000$ arrived and calculated the cumulative computational times up to $t=501,1000,\cdots,4000$ for both offline and online algorithms. The normal and percentile Bootstrap methods displayed similar computational times, so we chose to report the computational time of the percentile bootstrap interval for both offline and online approaches. Despite leveraging parallel computing to accelerate the bootstrap procedures, the offline bootstrap algorithms still demanded significant computation time. This is attributed to the need to refit the model each time a new data point arrived, which substantially raises the computational expense. The computational complexity of offline methods for computing the estimate of $f$ at time $t$ is $\mathcal{O}(t^{3})$ , leading to a cumulative computational complexity of order $\mathcal{O}(t^{4})$ . Including the bootstrap cost, the total computational complexity at $t$ becomes $\mathcal{O}(Bt^{3})$ , leading to a cumulative computational complexity of order $\mathcal{O}(Bt^{4})$ . As shown in Figure 4, the cumulative computational time reaches approximately $60$ hours for the offline bootstrap method and around $8$ hours for the Bayesian bootstrap method. Conversely, the cumulative computational time for our proposed bootstrap method grows almost linearly with $t$ , and requires less than $30$ minutes up to $t=4000$ . At $t=4000$ , offline bootstrap methods take about $200$ seconds, and the Bayesian confidence interval necessitates roughly $30$ seconds to construct the confidence interval. Our proposed online bootstrap method requires fewer than $3$ seconds, demonstrating its potential for time-sensitive applications such as medical diagnosis and treatment, financial trading, and traffic management, where real-time decision-making is essential as data continuously flows in.

6 Proof Sketch of Main Results

In this section, we present sketched proofs for the expansion of the functional SGD estimator relative to the supremum norm (Theorem 3.1) and the bootstrap consistency for global inference (Theorem 4.2), while highlighting some important technical details and key steps.

6.1 Proof sketch for estimator expansion under supremum norm metric

Theorem 3.1 establishes the supreum norm bound with high probability for the high-order expansion of the SGD estimator. This result is crucial for the inference framework, as we only need to focus on the distribution behavior of leading terms given the negligible remainders. In the sketched proof, we denote $\eta_{n}=\widehat{f}_{n}-f^{\ast}$ . According to (3.1), we have

\eta_{n}=(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}+\gamma_{n}\epsilon_{n}K_{X_{n}}.

We split the recursion of $\eta_{n}$ into two finer recursions: bias recursion of $\eta_{n}^{bias}$ and noise recursion of $\eta_{n}^{noise}$ such that $\eta_{n}=\eta_{n}^{bias}+\eta_{n}^{noise}$ , where

	bias recursion:	$\displaystyle\eta_{n}^{bias}=(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})\eta^{bias}_{n-1}\quad\quad\textrm{with}\quad\eta_{0}^{bias}=-f^{*};$
	noise recursion:	$\displaystyle\eta^{noise}_{n}=(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})\eta^{noise}_{n-1}+\gamma_{n}\epsilon_{n}K_{X_{n}}\quad\quad\textrm{with}\quad\eta_{0}^{noise}=0.$

To proceed, we further decompose the bias recursion into two parts: (1) the leading bias recursion $\eta_{n}^{bias,0}$ ; and (2) the remainder bias recursion $\eta_{n}^{bias}-\eta_{n}^{bias,0}$ as follows:

	$\displaystyle\eta_{n}^{bias,0}=$	$\displaystyle\,(I-\gamma_{n}\Sigma)\eta_{n-1}^{bias,0}\quad\quad\textrm{with}\quad\eta_{0}^{bias}=-f^{*};$
	$\displaystyle\eta_{n}^{bias}-\eta_{n}^{bias,0}=$	$\displaystyle\,(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{bias}-\eta_{n-1}^{bias,0})+\gamma_{n}(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{bias,0}.$

It is worth noting that the leading bias recursion essentially replaces $K_{X_{n}}\otimes K_{X_{n}}$ by its expectation $\Sigma=\mathbb{E}[K_{X_{n}}\otimes K_{X_{n}}]$ .

To bound the residual term $\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|_{\infty}$ associated with the leading bias term of the averaged estimator, we introduce an augmented RKHS space (with $a\in[0,\,1/2-1/(2\alpha)]$ )

\mathbb{H}_{a}=\Big{\{}f=\sum_{\nu=1}^{\infty}f_{\nu}\phi_{\nu}\,\mid\,\sum_{\nu=1}^{\infty}f_{\nu}^{2}\mu_{\nu}^{2a-1}<\infty\Big{\}}

(6.1)

equipped with the kernel function $K^{a}(x,y)=\sum_{\nu=1}^{\infty}\phi_{\nu}(X)\phi_{\nu}(y)\mu_{\nu}^{1-2a}$ . To verify $K^{a}(\cdot,\cdot)$ is the reproducing kernel of $\mathbb{H}_{a}$ , we notice that

\|K_{x}^{a}\|_{a}^{2}=\|\sum_{\nu=1}^{\infty}\mu_{\nu}^{1-2a}\phi_{\nu}(x)\phi_{\nu}\|_{a}^{2}=\sum_{\nu=1}^{\infty}(\phi_{\nu}(x)\mu_{\nu}^{1-2a})^{2}\mu_{\nu}^{2a-1}=\sum_{\nu=1}^{\infty}\phi^{2}_{\nu}(x)\mu_{\nu}^{1-2a}<c^{2}_{a},

where $c_{a}$ is a constant. Moreover, $K_{x}^{a}(\cdot)$ also satisfies the reproducing property since

\displaystyle\langle K_{x}^{a},f\rangle_{a}=

\displaystyle\langle\sum_{\nu=1}^{\infty}\mu_{\nu}^{1-2a}\phi_{\nu}(x)\phi_{\nu},f\rangle_{a}=\sum_{\nu=1}^{\infty}\phi_{\nu}(x)\mu_{\nu}^{1-2a}f_{\nu}\langle\phi_{\nu},\phi_{\nu}\rangle_{a}=\sum_{\nu=1}^{\infty}f_{\nu}\phi_{\nu}(x)=f(x).

For any $f\in\mathbb{H}\subset\mathbb{H}_{a}$ , we can use the above reproducing property to bound the supremum norm of $f$ as $\|f\|_{\infty}=\sup_{x\in[0,1]}|f(x)|=|\langle K_{x}^{a},f\rangle_{a}|\leq\|f\|_{a}\cdot\|K_{x}^{a}\|_{a}^{2}<c_{a}\|f\|_{a}$ . Also note that for any $f\in\mathbb{H}$ , $\|f\|^{2}_{\mathbb{H}}=\sum_{\nu=1}^{\infty}f_{\nu}^{2}\mu_{\nu}^{-1}\leq\sum_{\nu=1}^{\infty}f_{\nu}^{2}\mu_{\nu}^{2a-1}=\|f\|^{2}_{a}$ for $a\geq 0$ ; therefore, we have the relationship $\|f\|_{\infty}\leq c_{a}\|f\|_{a}\leq c_{k}\|f\|_{\mathbb{H}}$ , meaning that $\|\cdot\|_{a}$ provides a tighter bound for the supremum norm compared with $\|\cdot\|_{\mathbb{H}}$ . In Section 8.2 (Lemma 8.1), we use this augmented RKHS to show that the bias remainder term satisfies $\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|^{2}_{\infty}=o(\bar{\eta}_{n}^{bias,0})$ through computing the expectation $\mathbb{E}\big{[}\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|^{2}_{a}\big{]}$ and applying the Markov inequality.

For the noise recursion of $\eta_{n}^{noise}$ , we can similarly split it into the leading noise recursion term and residual noise recursion term as

	$\displaystyle\eta_{n}^{noise,0}=$	$\displaystyle\,(I-\gamma_{n}\Sigma)\eta^{noise,0}_{n-1}+\gamma_{n}\epsilon_{n}K_{X_{n}}\quad\quad\textrm{with}\quad\eta_{0}^{noise,0}=0;$
	$\displaystyle\eta_{n}^{noise}-\eta_{n}^{noise,0}=$	$\displaystyle\,(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{noise}-\eta_{n-1}^{noise,0})+\gamma_{n}(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{noise,0}.$

The leading noise recursion is described as a “semi-stochastic” recursion induced by $\eta_{n}^{noise}$ in [12] since it keeps the randomness in the noise recursion $\eta_{n}^{noise}$ due to the noise $\{\epsilon_{i}\}_{i=1}^{n}$ , but get ride of the randomness arising from $K_{X_{n}}\otimes K_{X_{n}}$ , which is due to the random design $\{X_{i}\}_{i=1}^{n}$ .

For the residual noise recursion, directly bound $\|\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}\|_{\infty}$ is difficult. Instead, we follow [12] by further decomposing $\eta_{n}^{noise}-\eta_{n}^{noise,0}$ into a sequence of higher-order “semi-stochastic” recursions as follows. We first define a semi-stochastic recursion induced by $\eta_{n}^{noise}-\eta_{n}^{noise,0}$ , denoted as $\eta_{n}^{noise,1}$ :

\eta_{n}^{noise,1}=(I-\gamma_{n}\Sigma)\eta_{n-1}^{noise,1}+\gamma_{n}(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{noise,0}.

(6.2)

Here, $\eta_{n}^{noise,1}$ replaces the random operator $K_{X_{n}}\otimes K_{X_{n}}$ with its expectation $\Sigma$ in the residual noise recursion for $\eta_{n}^{noise}-\eta_{n}^{noise,0}$ , and can be viewed as a second-order term in the expansion of the noise recursion, or the leading remainder noise term. The rest noise remainder parts can be expressed as

	$\displaystyle\,\eta_{n}^{noise}-\eta_{n}^{noise,0}-\eta_{n}^{noise,1}=(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{noise}-\eta_{n-1}^{noise,0})-(I-\gamma_{n}\Sigma)\eta_{n-1}^{noise,1}$
	$\displaystyle\qquad=(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{noise}-\eta_{n-1}^{noise,0}-\eta_{n-1}^{noise,1})+\gamma_{n}(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{noise,1}.$

Then we can further define a semi-stochastic recursion induced by $\eta_{n}^{noise}-\eta_{n}^{noise,0}-\eta_{n}^{noise,1}$ , and repeat this process. If we define $\mathcal{E}_{n}^{r}=(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\,\eta_{n-1}^{noise,r-1}$ for $r\geq 1$ , then we can expand $\eta_{n}^{noise}$ into $(r+2)$ terms as

\eta_{n}^{noise}=\eta_{n}^{noise,0}+\eta_{n}^{noise,1}+\eta_{n}^{noise,2}+\cdots+\eta_{n}^{noise,r}+\textrm{Remainder},

where $\eta_{n}^{noise,d}=(I-\gamma_{n}\Sigma)\eta_{n-1}^{noise,d}+\gamma_{n}\mathcal{E}_{n}^{d}$ for $1\leq d\leq r$ ,. The Remainder term $\eta_{n}^{noise}-\sum_{d=0}^{r}\eta_{n}^{noise,d}$ also has a recursive characterization:

\eta_{n}^{noise}-\sum_{d=1}^{r}\eta_{n}^{noise,i}=(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{noise}-\sum_{d=1}^{r}\eta_{n-1}^{noise,d})+\gamma_{n}\mathcal{E}_{n}^{r+1}.

(6.3)

To establish the supreme norm bound of $\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}$ , the idea is to show that $\bar{\eta}_{n}^{noise,r}$ decays as $r$ increases, that is, to prove

\bar{\eta}_{n}^{noise}=\bar{\eta}_{n}^{noise,0}+\underbrace{\bar{\eta}_{n}^{noise,1}}_{=o(\bar{\eta}_{n}^{noise,0})}+\underbrace{\bar{\eta}_{n}^{noise,2}}_{=o(\bar{\eta}_{n}^{noise,1})}+\dots+\underbrace{\bar{\eta}_{n}^{noise,r}}_{=o(\bar{\eta}_{n}^{noise,r-1})}+\underbrace{\bar{\eta}_{n}^{noise}-\sum_{i=0}^{r}\bar{\eta}_{n}^{noise,i}}_{negligible}.

Concretely, we consider the constant-step case for a simple presentation. By accumulating the effects of the iterations, we can further express $\eta_{n}^{noise,1}$ as

	$\displaystyle\eta_{n}^{noise,1}=$	$\displaystyle\gamma\sum_{i=1}^{n-1}(I-\gamma\Sigma)^{n-i-1}\big{(}\Sigma-K_{X_{i+1}}\otimes K_{X_{i+1}}\big{)}\eta_{i}^{noise,0}$
	$\displaystyle=$	$\displaystyle\gamma^{2}\sum_{i=1}^{n-1}\sum_{j=1}^{i}\epsilon_{j}(I-\gamma\Sigma)^{n-i-1}\big{(}\Sigma-K_{X_{i+1}}\otimes K_{X_{i+1}}\big{)}(I-\gamma\Sigma)^{i-j}K_{X_{j}},$

and accordingly, the averaged version is

\bar{\eta}_{n}^{noise,1}=\frac{1}{n}\sum_{j=1}^{n-1}\,\epsilon_{j}\cdot\underbrace{\gamma^{2}\,\Big{[}\sum_{i=j}^{n-1}(\sum_{\ell=i}^{n-1}(I-\gamma\Sigma)^{\ell-i})\big{(}\Sigma-K_{X_{i+1}}\otimes K_{X_{i+1}}\big{)}(I-\gamma\Sigma)^{i-j}K_{X_{j}}\Big{]}}_{g_{j}}.

This implies that conditioning on the covariates $\{X_{1},\dots,X_{n}\}$ , the empirical process $\bar{\eta}_{n}^{noise,1}(\cdot)=\frac{1}{n}\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(\cdot)$ over $[0,1]$ is a Gaussian process with (function) weights $\{g_{j}\}_{j=1}^{n}$ . We can then prove a bound of $\|\bar{\eta}_{n}^{noise,1}\|_{\infty}$ by careful analyzing the random function $\sum_{j=1}^{n-1}g_{j}^{2}(\cdot)$ ; see Appendix 8.3 for further details. A complete proof of Theorem 3.1 under constant step size is included in [33]; see Figure 5 for a float chart explaining the relationship among different components in its proof. The proof for the non-constant step size case is conceptually similar but is considerably more involved, requiring a much more refined analysis of the accumulated step size effect on the iterations of the recursions in [33].

Figure 5: Float chart for proof of Theorem 3.1 under constant step size.

6.2 Proof sketch for Bootstrap consistency of global inference

Recall that $\mathcal{D}_{n}=\{X_{i},Y_{i}\}_{i=1}^{n}$ represents the data. The goal is to bound the difference between the sampling distribution of $\sqrt{n(n\gamma)^{-1/\alpha}}\,\|\bar{\eta}_{n}^{noise,0}\|_{\infty}$ and the conditional distribution of $\sqrt{n(n\gamma)^{-1/\alpha}}\,\|\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}\|_{\infty}$ given $\mathcal{D}_{n}$ ; see Section 4.2 for detailed definitions of these quantities. We sketch the proof idea under the constant step size scheme.

We will use the shorthand $\bar{\alpha}_{n}=\sqrt{n(n\gamma)^{-1/\alpha}}\,\bar{\eta}_{n}^{noise,0}$ and $\bar{\alpha}_{n}^{b}=\sqrt{n(n\gamma)^{-1/\alpha}}\,\big{(}\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}\big{)}$ . Recall that from equations (3.6) and (4.5), we have

\displaystyle\bar{\alpha}_{n}(\cdot)=\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}\epsilon_{i}\cdot\Omega_{n,i}(\cdot)\quad\mbox{and}\quad\bar{\alpha}_{n}^{b}(\cdot)=\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}(w_{i}-1)\cdot\epsilon_{i}\cdot\Omega_{n,i}(\cdot).

From this display, we see that for any $t\in\mathcal{X}$ , $\bar{\alpha}_{n}(t)$ is a weighted sum of Gaussian random variables, with the weights being functions of covariates $\{X_{i}\}_{i=1}^{n}$ ; conditioning on $\mathcal{D}_{n}$ , $\bar{\alpha}_{n}^{b}(t)$ is a weighted sum of sub-Gaussian random variables. In the proof, we also require a sufficiently dense space discretization given by $0=t_{1}<t_{2}<\cdots<t_{N}=1$ . This discretization forms an $\varepsilon$ -covering for some $\varepsilon$ with respect to a specific distance metric that will be detailed later.

To bound the difference between the distribution of $\|\bar{\alpha}_{n}\|_{\infty}$ and the conditional distribution of $\|\bar{\alpha}_{n}^{b}\|_{\infty}$ given $\mathcal{D}_{n}$ , we introduce two intermediate processes: (1) $\bar{\alpha}_{n}^{e}(\cdot)=\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}e_{i}\cdot\epsilon_{i}\cdot\Omega_{n,i}(\cdot)$ with $e_{i}$ being i.i.d. standard normal random variables for $i=1,\cdots,n$ ; (2) an $N$ -dimensional multivariate normal random vector $\big{(}\bar{Z}_{n}(t_{k})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Z_{i}(t_{k}),\,k=1,2,\ldots,N\big{)}$ (recall that $0=t_{1}<t_{2}<\cdots<t_{N}=1$ is the space discretization we defined earlier), where $\big{\{}\big{(}Z_{1}(t_{1}),Z_{1}(t_{2}),\ldots,Z_{1}(t_{N})\big{)}\big{\}}_{i=1}^{n}$ are i.i.d. (zero mean) normally distributed random vectors having the same covariance structure as $\big{(}\bar{\alpha}_{n}(t_{1}),\bar{\alpha}_{n}(t_{2}),\ldots,\bar{\alpha}_{n}(t_{N}))$ ; that is, $Z_{i}(t_{k})\sim N\big{(}0,(n\gamma)^{-1/\alpha}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}^{2}(t_{k})\big{)}$ , $\mathbb{E}\big{(}Z_{i}(t_{k})\cdot Z_{i}(t_{\ell})\big{)}=(n\gamma)^{-1/\alpha}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}(t_{k})\phi_{\nu}(t_{\ell})$ , and $\mathbb{E}\big{(}Z_{i}(t_{k})\cdot Z_{j}(t_{\ell})\big{)}=0$ for $(k,\ell)\in[N]^{2}$ and $(i,j)\in[n]^{2}$ , $i\neq j$ . These two intermediate processes are introduced so that the conditional distribution of $\max_{1\leq j\leq N}\bar{\alpha}_{n}^{e}(t_{j})$ given $\mathcal{D}_{n}$ will be used to approximate the conditional distribution of $\|\bar{\alpha}_{n}^{b}\|_{\infty}$ given $\mathcal{D}_{n}$ ; while the distribution of $\max_{1\leq j\leq N}\bar{Z}_{n}(t_{j})$ will be used to approximate the distribution of $\|\bar{\alpha}_{n}\|_{\infty}$ . Since both the distribution of $\big{(}\bar{Z}_{n}(t_{1}),\bar{Z}_{n}(t_{2}),\ldots,\bar{Z}_{n}(t_{N})\big{)}$ and the conditional distribution of $\big{(}\bar{\alpha}_{n}^{e}(t_{1}),\bar{\alpha}_{n}^{e}(t_{2}),\ldots,\bar{\alpha}_{n}^{e}(t_{N})\big{)}$ given $\mathcal{D}_{n}$ are centered multivariate normal distributions, we can use a Gaussian comparison inequality to bound the difference between them by bounding the difference between their covariances.

Figure 6: Flow chart of the bootstrap consistency proof.

The actual proof is even more complicated, as we also need to control the discretization error. See Figure 6 for a flow chart that summarizes all the intermediate approximation steps and the corresponding lemmas in the appendix. For Steps I and V in Figure 6, we approximate the continuous supremum norms of $\bar{\alpha}_{n}$ and $\bar{\alpha}_{n}^{b}$ by the finite maximums of $\big{(}\bar{\alpha}_{n}(t_{1}),\ldots,\bar{\alpha}_{n}(t_{N})\big{)}$ and $\big{(}\bar{\alpha}_{n}^{b}(t_{1}),\ldots,\bar{\alpha}_{n}^{b}(t_{N})\big{)}$ , respectively. Here, $N$ is chosen as the $\varepsilon$ -covering number of the unit interval $[0,1]$ with respect to the metric defined by $e_{P}^{2}(t,s)=\mathbb{E}\big{[}\big{(}\bar{\alpha}_{n}(t)-\bar{\alpha}_{n}(s)\big{)}^{2}\big{]}$ for $(t,s)\in[0,1]^{2}$ ; that is, there exist $t_{1},\dots,t_{N}\in[0,1]$ , such that for every $t\in[0,1]$ , there exist $1\leq j\leq N$ with $e_{P}(t,t_{j})<\varepsilon$ . We refer a detailed proof in Step I (Step V) to Supplementary. Notice that $\bar{\alpha}_{n}$ is a weighted and non-identically distributed empirical process. In Step II, we further develop Gaussian approximation bounds to control the Kolmogorov distance between the sampling distributions of $\max_{1\leq j\leq N}\bar{\alpha}_{n}(t_{j})$ and the distribution of $\max_{1\leq j\leq N}\bar{Z}_{n}(t_{j})$ ; see the proof in Lemma 8.3. In Step IV, by noticing that conditional on $\mathcal{D}_{n}$ , $\bar{\alpha}_{n}^{b}$ is a weighted and non-identically distributed sub-Gaussian process with randomness coming from the Bootstrap multiplier $\{w_{i}\}_{i=1}^{n}$ , we adopt a similar argument as in Step II to bound the Kolmogorov distance between the distributions of $\max_{1\leq j\leq N}\bar{\alpha}_{n}^{e}(t_{j})$ and $\max_{1\leq j\leq N}\bar{\alpha}_{n}^{b}(t_{j})$ given $\mathcal{D}_{n}$ .

7 Discussion

Quantifying uncertainty (UQ) in large-scale streaming data is a central challenge in statistical inference. We are developing multiplier bootstrap-based inferential frameworks for UQ in online non-parametric least squares regression. We propose using perturbed stochastic functional gradients to generate a sequence of bootstrapped functional SGD estimators for constructing point-wise confidence intervals (local inference) and simultaneous confidence bands (global inference) for function parameters in RKHS. Theoretically, we establish a framework to derive the non-asymptotic law of the infinite-dimensional SGD estimator and demonstrate the consistency of the multiplier bootstrap method.

This work assumes that random errors in non-parametric regression follow a Gaussian distribution. However, in many real-world applications, heavy-tailed distributions are more common and suitable for capturing outlier behaviors. One future research direction is to expand the current methods to address heavy-tailed errors, thereby offering a more robust approach to online non-parametric inference. Another direction to explore is the generalization of the multiplier bootstrap weights to independent sub-exponential random variables and even exchangeable weights. Finally, a promising direction is the consideration of online non-parametric inference for dependent data. Such an extension is necessary to address problems like multi-arm bandit and reinforcement learning, where data dependencies are frequent and real-time updates are essential. Adapting our methods to these problems could provide deeper insights into the interplay between statistical inference and online decision-making.

8 Some Key Proofs

8.1 Proof of leading terms in Theorem 3.1 in constant step size case

Recall in Section 6.1, we split the recursion of $\eta_{n}=\widehat{f}_{n}-f^{*}$ into the bias recursion and noise recursion. That is, $\eta_{n}=\eta_{n}^{bias}+\eta_{n}^{noise}$ . Here $\eta_{n}^{bias}$ can be further decomposed as its leading bias term $\eta_{n}^{bias,0}$ and remainder $\eta_{n}^{bias}-\eta_{n}^{bias,0}$ satisfying the recursion

	$\displaystyle\eta_{n}^{bias,0}=$	$\displaystyle(I-\gamma_{n}\Sigma)\eta_{n-1}^{bias,0}\quad\quad\textrm{with}\quad\eta_{0}^{bias,0}=f^{\ast}$		(8.1)
	$\displaystyle\eta_{n}^{bias}-\eta_{n}^{bias,0}=$	$\displaystyle(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{bias}-\eta_{n-1}^{bias,0})+\gamma_{n}(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{bias,0}.$		(8.2)

We further decompose $\eta_{n}^{noise}$ to its main recursion terms and residual recursion terms as

	$\displaystyle\eta_{n}^{noise,0}=$	$\displaystyle(I-\gamma_{n}\Sigma)\eta^{noise,0}_{n-1}+\gamma_{n}\epsilon_{n}K_{X_{n}}\quad\quad\textrm{with}\quad\eta_{0}^{noise,0}=0$		(8.3)
	$\displaystyle\eta_{n}^{noise}-\eta_{n}^{noise,0}=$	$\displaystyle(I-\gamma_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{noise}-\eta_{n-1}^{noise,0})+\gamma_{n}(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{noise,0}$		(8.4)

We focus on the averaged version $\bar{\eta}_{n}=\bar{f}_{n}-f^{\ast}$ with

\displaystyle\bar{\eta}_{n}=

\displaystyle\bar{\eta}_{n}^{bias,0}+\bar{\eta}_{n}^{noise,0}+Rem_{noise}+Rem_{bias},

where $Rem_{noise}=\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}$ , $Rem_{bias}=\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}$ .

Theorem 3.1 for the constant step case includes three results as follows:

	$\displaystyle\sup_{z_{0}\in\mathcal{X}}\|\bar{\eta}_{n}^{bias,0}(z_{0})\|\lesssim\frac{1}{\sqrt{n\gamma}}$		(8.5)
	$\displaystyle\sup_{z_{0}\in\mathcal{X}}\operatorname{{\rm Var}}\big{(}\bar{\eta}_{n}^{noise,0}(z_{0})\big{)}\lesssim\frac{(n\gamma)^{1/\alpha}}{n}$		(8.6)
	$\displaystyle\mathbb{P}\Big{(}\\|\bar{\eta}_{n}-\bar{\eta}_{n}^{bias,0}-\bar{\eta}_{n}^{noise,0}\\|^{2}_{\infty}\geq\gamma^{1/2}(n\gamma)^{-1}+\gamma^{1/2}(n\gamma)^{1/\alpha}n^{-1}\log n\Big{)}\leq 1/n+\gamma^{1/2}.$		(8.7)

In this session, we will bound the sup-norm of the leading bias term $\bar{\eta}_{n}^{bias,0}$ and leading variance term $\bar{\eta}_{n}^{noise,0}$ . To complete the proof of (8.7), we will bound $\|Rem_{bias}\|_{\infty}$ in Section 8.2 and $\|Rem_{noise}\|_{\infty}$ in Section 8.3.

We first provide a clear expression for $\eta_{n}^{bias,0}$ and $\eta_{n}^{noise,0}$ .

Denote

	$\displaystyle D(k,n,\gamma_{i})=$	$\displaystyle\prod_{i=k}^{n}(I-\gamma_{i}\Sigma)\quad\textrm{and}\quad D(k,n,\gamma)=\prod_{i=k}^{n}(I-\gamma\Sigma)=(I-\gamma\Sigma)^{n-k+1}$
	$\displaystyle M(k,n,\gamma_{i})=$	$\displaystyle\prod_{i=k}^{n}(I-\gamma_{i}K_{X_{i}}\otimes K_{X_{i}})\quad\textrm{and}\quad M(k,n,\gamma)=\prod_{i=k}^{n}(I-\gamma K_{X_{i}}\otimes K_{X_{i}}),$

with $D(n+1,n,\gamma_{i})=D(n+1,n,\gamma)=1$ . We have

\eta_{n}^{bias,0}=D(1,n,\gamma_{i})f^{\ast}\quad\textrm{and}\quad\bar{\eta}_{n}^{bias,0}=\frac{1}{n}\sum_{k=1}^{n}D(1,k,\gamma_{i})f^{\ast};

(8.8)

\eta_{n}^{noise,0}=\sum_{i=1}^{n}D(i+1,n,\gamma_{i})\gamma_{i}\epsilon_{i}K_{X_{i}}\quad\textrm{and}\quad\bar{\eta}_{n}^{noise,0}=\frac{1}{n}\sum_{i=1}^{n}\big{(}\sum_{j=i}^{n}D(i+1,j,\gamma_{i})\big{)}\gamma_{i}\epsilon_{i}K_{X_{i}}.

(8.9)

Bound the leading bias term (8.5) For the case of constant step size, based on (8.8), we have

\displaystyle\bar{\eta}_{n}^{bias,0}(z_{0})=

\displaystyle\frac{1}{n}\sum_{k=1}^{n}(I-\gamma\Sigma)^{n-k+1}f^{\ast}(z_{0}).

Note that any $f\in\mathbb{H}$ can be represented as $f=\sum_{\nu=1}^{\infty}\langle f,\phi_{\nu}\rangle_{L_{2}}\phi_{\nu}$ , where $\{\phi_{\nu}\}_{\nu=1}^{\infty}$ satisfies $\|\phi_{\nu}\|_{L_{2}}^{2}=1=\mathbb{E}(\phi_{\nu}^{2}(x))$ , $\langle\phi_{\nu},\phi_{\nu}\rangle_{\mathbb{H}}=\mu_{\nu}^{-1}$ , and $\Sigma\phi_{\nu}=\mu_{\nu}\phi_{\nu}$ , $\Sigma^{-1}\phi_{\nu}=\mu_{\nu}^{-1}\phi_{\nu}$ . Then for any $z_{0}\in\mathcal{X}$ , $f^{\ast}(z_{0})=\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\phi_{\nu}(z_{0})$ . By the assumption that $f\in\mathbb{H}$ satisfies $\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\mu_{\nu}^{-1/2}<\infty$ , we have

	$\displaystyle\bar{\eta}_{n}^{bias,0}(z_{0})=$	$\displaystyle\frac{1}{\sqrt{\gamma}n}\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\mu_{\nu}^{-1/2}(1-\gamma\mu_{\nu})^{k}(\gamma\mu_{\nu})^{1/2}\phi_{\nu}(z_{0})$
	$\displaystyle\leq$	$\displaystyle c_{\phi}\frac{1}{\gamma^{1/2}n}\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\mu_{\nu}^{-1/2}\Big{(}\sup_{0\leq x\leq 1}\big{(}\sum_{k=1}^{n}(1-x)^{k}x^{1/2}\big{)}\Big{)}$
	$\displaystyle\leq$	$\displaystyle c_{\phi}\frac{1}{\sqrt{n\gamma}}\sum_{\nu=1}^{\infty}\langle f^{\ast},\phi_{\nu}\rangle_{L_{2}}\mu_{\nu}^{-1/2}\lesssim\frac{1}{\sqrt{n\gamma}},$

where the inequality holds based on the bound that $\sup_{0\leq x\leq 1}\big{(}\sum_{k=1}^{n}(1-x)^{k}x^{1/2}\big{)}\leq\sqrt{n}$ .

Bound the leading noise term (8.6) We first deduce the explicit expression of $\bar{\eta}_{n}^{noise,0}(z_{0})$ and its variance. Based on (8.9), for constant step case, we have for any $z_{0}\in\mathcal{X}$ ,

\bar{\eta}_{n}^{noise,0}(z_{0})=\frac{1}{n}\sum_{k=1}^{n}\Sigma^{-1}\big{(}I-(I-\gamma\Sigma)^{n+1-k}\big{)}K(X_{k},z_{0})\epsilon_{k}.

Note that for any $x,z$ , $K(x,z)=\sum_{\nu=1}^{\infty}\mu_{\nu}\phi_{\nu}(x)\phi_{\nu}(z)$ . Then

	$\displaystyle\Sigma^{-1}\big{(}I-(I-\gamma\Sigma)^{n+1-k}\big{)}K(X_{k},z_{0})=$	$\displaystyle\Sigma^{-1}\big{(}I-(I-\gamma\Sigma)^{n+1-k}\big{)}\big{(}\sum_{\nu=1}^{\infty}\mu_{\nu}\phi_{\nu}(X_{k})\phi_{\nu}(z_{0})\big{)}$
	$\displaystyle=$	$\displaystyle\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n+1-k})\phi_{\nu}(X_{k})\phi_{\nu}(z_{0}).$

Therefore, $\bar{\eta}_{n}^{noise,0}(z_{0})=\frac{1}{n}\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n+1-k})\phi_{\nu}(X_{k})\phi_{\nu}(z_{0})\epsilon_{k}$ with $\mathbb{E}(\bar{\eta}_{n}^{noise,0}(z_{0}))=0$ , and

\operatorname{{\rm Var}}\big{(}\bar{\eta}_{n}^{noise,0}(z_{0})\big{)}=\frac{\sigma^{2}}{n^{2}}\sum_{\nu=1}^{\infty}\phi_{\nu}^{2}(z_{0})\sum_{k=1}^{n}[1-(1-\gamma\mu_{\nu})^{n+1-k}]^{2}.

Note that

\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\big{(}1-(1-\gamma\mu_{\nu})^{k}\big{)}^{2}\asymp\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\min\bigl{\{}1,(k\gamma\mu_{\nu})^{2}\bigr{\}}.

On the other hand, $\sum_{\nu=1}^{\infty}\min\{1,(k\gamma\mu_{\nu})^{2}\}=(k\gamma)^{1/\alpha}+\sum_{\nu=(k\gamma)^{1/\alpha}+1}^{\infty}(k\gamma\mu_{\nu})^{2}$ . Since

\sum_{\nu=(k\gamma)^{1/\alpha}+1}^{\infty}(k\gamma\mu_{\nu})^{2}\leq\sum_{\nu=(k\gamma)^{1/\alpha}+1}^{\infty}k\gamma\mu_{\nu}=k\gamma\sum_{\nu=(k\gamma)^{1/\alpha}+1}^{\infty}\nu^{-\alpha}\leq k\gamma\int_{(k\gamma)^{1/\alpha}}^{\infty}x^{-\alpha}dx=(k\gamma)^{1/\alpha},

we have

\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\big{(}1-(1-\gamma\mu_{\nu})^{k}\big{)}^{2}\asymp\sum_{k=1}^{n}(k\gamma)^{1/\alpha}\asymp\gamma^{1/\alpha}n^{(\alpha+1)/\alpha}=(n\gamma)^{1/\alpha}n.

Accordingly, $\operatorname{{\rm Var}}(\bar{\eta}_{n}^{noise,0}(z_{0}))\lesssim\frac{(n\gamma)^{1/\alpha}}{n}$ .

Meanwhile, $\sum_{\nu=1}^{\infty}\min\{1,(k\gamma\mu_{\nu})^{2}\}\geq(k\gamma)^{1/\alpha}$ leads to the result that $\sum_{k=1}^{n}\sum_{\nu=1}^{\infty}\big{(}1-(1-\gamma\mu_{\nu})^{k}\big{)}^{2}\geq n(n\gamma)^{1/\alpha},$ thus $\operatorname{{\rm Var}}(\bar{\eta}_{n}^{noise,0}(z_{0}))\geq\frac{(n\gamma)^{1/\alpha}}{n}$ . Therefore, $\operatorname{{\rm Var}}(\bar{\eta}_{n}^{noise,0}(z_{0}))\asymp\frac{(n\gamma)^{1/\alpha}}{n}$ .

Bound the remaining term (8.7) Recall

\displaystyle\bar{f}_{n}-f^{\ast}=

\displaystyle\bar{\eta}_{n}=\bar{\eta}_{n}^{bias,0}+\bar{\eta}_{n}^{noise,0}+\big{(}\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\big{)}+\big{(}\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}\big{)}.

To prove (8.7) in Theorem 3.1, we bound $\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|_{\infty}$ and $\|\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}\|_{\infty}$ separately in Section 8.2 and Section 8.3.

8.2 Bound the bias remainder in constant step case

Recall in (8.2), the bias remainder recursion follows

\displaystyle\eta_{n}^{bias}-\eta_{n}^{bias,0}=

\displaystyle(I-\gamma K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{bias}-\eta_{n-1}^{bias,0})+\gamma(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{bias,0}.

Our goal is to bound $\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|_{\infty}$ . Let $\beta_{n}=\eta_{n}^{bias}-\eta_{n}^{bias,0}$ with $\beta_{0}=0$ , we have

\beta_{n}=(I-\gamma K_{X_{n}}\otimes K_{X_{n}})\beta_{n-1}+\gamma(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{bias,0}

with $\eta_{n}^{bias,0}=(I-\gamma\Sigma)\eta_{n-1}^{bias,0}$ and $\eta_{0}^{bias,0}=f^{\ast}$ . We first express $\beta_{n}$ in an explicit form as follows.

Let $S_{n}=I-\gamma K_{X_{n}}\otimes K_{X_{n}}$ , $T_{n}=\Sigma-K_{X_{n}}\otimes K_{X_{n}}$ and $T=I-\gamma\Sigma$ , we have $\beta_{n}=S_{n}\beta_{n-1}+\gamma T_{n}\eta_{n-1}^{bias,0}.$ We can further represent $\beta_{n}$ as

\beta_{n}=\gamma(T_{n}\eta_{n-1}^{bias,0}+S_{n}T_{n-1}\eta_{n-2}^{bias,0}+\cdots+S_{n}S_{n-1}\dots S_{2}T_{1}\eta_{0}^{bias,0});

on the other hand, $\eta_{i}^{bias,0}=(I-\gamma\Sigma)^{i}f^{\ast}$ . Therefore, for any $1\leq i\leq n$ , we have

\beta_{i}=\gamma(T_{i}T^{i-1}+S_{i}T_{i-1}T^{i-2}+\cdots+S_{i}S_{i-1}\cdots S_{2}T_{1})\cdot f^{\ast}\equiv\gamma U_{i}.

(8.10)

Note that $\|\bar{\beta}_{n}\|_{\infty}\leq\|\Sigma^{a}\bar{\beta}_{n}\|_{\mathbb{H}}$ . In the following lemma 8.1, we bound $\|\bar{\beta}_{n}\|_{\infty}$ through $\mathbb{E}\langle\bar{\beta}_{n},\Sigma^{2a}\bar{\beta}_{n}\rangle$ , and show that $\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|^{2}_{\infty}=o(\bar{\eta}_{n}^{bias,0})$ with high probability.

Lemma 8.1.

Suppose the step size $\gamma(n)=\gamma$ with $0<\gamma<\mu_{1}^{-1}$ . Then

\mathbb{P}\Big{(}\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|^{2}_{\infty}\geq\gamma^{1/2}(n\gamma)^{-1}\Big{)}\leq\gamma^{1/2}.

Proof.

To simplify the notation, we set $\langle\cdot,\cdot\rangle$ as $\langle\cdot,\cdot\rangle_{\mathbb{H}}$ . For $\bar{\beta}_{n}=\eta_{n}^{bias}-\eta_{n}^{bias,0}$ , by (8.10), we have

\mathbb{E}\langle\bar{\beta}_{n},\Sigma^{2a}\bar{\beta}_{n}\rangle=\mathbb{E}\langle\frac{1}{n}\sum_{i=1}^{n}\gamma U_{i},\Sigma^{2a}\frac{1}{n}\sum_{i=1}^{n}\gamma U_{i}\rangle=\frac{\gamma^{2}}{n^{2}}\sum_{i=1}^{n}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle+\frac{2\gamma^{2}}{n^{2}}\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle.

(8.11)

That is, we split $\mathbb{E}\langle\bar{\beta}_{n},\Sigma^{2a}\bar{\beta}_{n}\rangle$ into two parts, and will bound each part separately.

We first bound $\frac{\gamma^{2}}{n^{2}}\sum_{i=1}^{n}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle$ . Denote $H_{i\ell}=S_{i}S_{i-1}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast}$ with $H_{ii}=T_{i}T^{i-1}f^{\ast}$ , then $U_{i}=H_{ii}+H_{i(i-1)}+\cdots+H_{i1}$ .

	$\displaystyle\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle=$	$\displaystyle\mathbb{E}\langle H_{ii}+H_{i(i-1)}+\cdots+H_{i1},\Sigma^{2a}(H_{ii}+H_{i(i-1)}+\cdots+H_{i1})\rangle$
	$\displaystyle=$	$\displaystyle\sum_{j,k=1}^{i}\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ik}\rangle=\sum_{j=1}^{i}\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ij}\rangle+\sum_{j\neq k}\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ik}\rangle.$

If $j\neq k$ , suppose $i\geq j>k\geq 1$ , then

		$\displaystyle\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ik}\rangle=\mathbb{E}\langle S_{i},S_{i-1}\cdots S_{j+1}T_{j}T^{j-1}f^{\ast},\Sigma^{2a}S_{i}S_{i-1}\cdots S_{k+1}T_{k}T^{k-1}f^{\ast}\rangle$
	$\displaystyle=$	$\displaystyle\mathbb{E}\Big{[}\mathbb{E}\big{[}\langle S_{i}S_{i-1}\cdots S_{j+1}T_{j}T^{j-1}f^{\ast},\Sigma^{2a}S_{i}S_{i-1}\cdots S_{k+1}T_{k}T^{k-1}f^{\ast}\|X_{i},\dots,X_{j},w_{i},\dots,w_{j}\big{]}\Big{]}$
	$\displaystyle=$	$\displaystyle\mathbb{E}\big{(}\langle S_{i}S_{i-1}\cdots S_{j+1}T_{j}T^{j-1}f^{\ast},\Sigma^{2a}S_{i}S_{i-1}\dots S_{j}\mathbb{E}(S_{j-1}\cdots S_{k+1}T_{k})T^{k-1}f^{\ast}\big{)}=0,$

where the last step is due to $\mathbb{E}(S_{j-1}\cdots S_{k+1}T_{k})=\mathbb{E}S_{j-1}\cdots\mathbb{E}S_{k+1}\mathbb{E}T_{k}=0$ with the fact that $\mathbb{E}T_{k}=0$ . Therefore, we have $\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle=\sum_{j=1}^{i}\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ij}\rangle$ . Furthermore,

		$\displaystyle\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ij}\rangle=\mathbb{E}\langle S_{i}S_{i-1}\cdots S_{j+1}T_{j}T^{j-1}f^{\ast},\Sigma^{2a}S_{i}S_{i-1}\cdots S_{j+1}T_{j}T^{j-1}f^{\ast}\rangle$
	$\displaystyle=$	$\displaystyle\langle f^{\ast},\mathbb{E}(T^{j-1}T_{j}S_{j+1}\cdots S_{i}\Sigma^{2a}S_{i}S_{i-1}\cdots S_{j+1}T_{j}T^{j-1})f^{\ast}\rangle=\langle f^{\ast},\Delta f^{\ast}\rangle.$

Note that $\Delta=\mathbb{E}\big{(}T^{j-1}T_{j}S_{j+1}\cdots\mathbb{E}(S_{i}\Sigma^{2a}S_{i})S_{i-1}\cdots S_{j+1}T_{j}T^{j-1}\big{)}$ , with

	$\displaystyle\mathbb{E}(S_{i}\Sigma^{2a}S_{i})=$	$\displaystyle\mathbb{E}\big{(}(I-\gamma K_{X_{i}}\otimes K_{X_{i}})\Sigma^{2a}(I-\gamma K_{X_{i}}\otimes K_{X_{i}})\big{)}$
	$\displaystyle=$	$\displaystyle\Sigma^{2a}-\gamma(\Sigma\cdot\Sigma^{2a}+\Sigma^{2a}\cdot\Sigma-2\gamma S\Sigma^{2a})=\Sigma^{2a}-\gamma G\Sigma^{2a},$		(8.12)

where $G\Sigma^{2a}=\Sigma\cdot\Sigma^{2a}+\Sigma^{2a}\cdot\Sigma-2\gamma S\Sigma^{2a}$ with $S\Sigma^{2a}=\mathbb{E}\big{(}(K_{x}\otimes K_{x})\Sigma^{2a}(K_{x}\otimes K_{x})\big{)}$ .

To be abstract, for any $A$ , $\mathbb{E}S_{i}AS_{i}=A-\gamma(\Sigma A+A\Sigma-2\gamma SA)=A-\gamma GA=(I-\gamma G)A,$ where $GA=\Sigma A+A\Sigma-2\gamma SA$ . Then $\Delta$ can be written as

\displaystyle\Delta=

\displaystyle\mathbb{E}\big{(}T^{j-1}T_{j}S_{j+1}\cdots S_{i-1}(I-\gamma G)\Sigma S_{i-1}\cdots S_{j+1}T_{j}\big{)}T^{j-1}=\mathbb{E}\big{(}T^{j-1}T_{j}(I-\gamma G)^{i-j}\Sigma^{2a}T_{j}T^{j-1}\big{)}.

Furthermore, for any $A$ ,

	$\displaystyle\mathbb{E}T_{j}AT_{j}=$	$\displaystyle\mathbb{E}(K_{X_{j}}\otimes K_{X_{j}}-\Sigma)A(K_{X_{j}}\otimes K_{X_{j}}-\Sigma)=\mathbb{E}(K_{X_{j}}\otimes K_{X_{j}})A(K_{X_{j}}\otimes K_{X_{j}})-\Sigma A\Sigma$		(8.13)
	$\displaystyle\leq$	$\displaystyle 2\mathbb{E}(K_{X_{j}}\otimes K_{X_{j}})A(K_{X_{j}}\otimes K_{X_{j}})=2SA.$

Therefore, $\Delta\prec 2T^{j-1}S(I-\gamma G)^{i-j}\Sigma^{2a}T^{j-1}$ , and in (8.13), we have $\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ij}\rangle\leq 2\langle f^{\ast},T^{j-1}S(I-\gamma G)^{i-j}\Sigma^{2a}T^{j-1}f^{\ast}\rangle.$ Then we have

\displaystyle\sum_{i=1}^{n}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle=\sum_{i=1}^{n}\sum_{j=1}^{i}\mathbb{E}\langle H_{ij},\Sigma^{2a}H_{ij}\rangle\leq

\displaystyle 2\langle f^{\ast},\sum_{i=1}^{n}\sum_{j=1}^{i}T^{j-i}S(I-\gamma G)^{i-j}\Sigma^{2a}T^{j-1}f^{\ast}\rangle.

Denote $P=\sum_{i=1}^{n}\sum_{j=1}^{i}T^{j-i}S(I-\gamma G)^{i-j}\Sigma^{2a}T^{j-1}$ , then

\displaystyle P=

\displaystyle\sum_{i=1}^{n}\sum_{j=1}^{i}T^{j-1}S(I-\gamma G)^{i-j}\Sigma^{2a}T^{j-1}=\sum_{j=1}^{n}T^{j-1}S\sum_{i=j}^{n}(I-\gamma G)^{i-j}\Sigma^{2a}T^{j-1}\leq n\sum_{j=1}^{n}T^{j-1}S\Sigma^{2a}T^{j-1}.

Recall $S\Sigma^{2a}=\mathbb{E}\big{(}(K_{X}\otimes K_{X})\Sigma^{2a}(K_{X}\otimes K_{X})\big{)}$ , we can bound $S\Sigma^{2a}\leq c_{k}\Sigma$ as follows.

	$\displaystyle\langle(S\Sigma^{2a})f,f\rangle=$	$\displaystyle\langle\mathbb{E}\big{(}(K_{X}\otimes K_{X})\Sigma^{2a}(K_{X}\otimes K_{X})\big{)}f,f\rangle=\langle\mathbb{E}f(X)\Sigma^{2a}K_{X}(X)K_{X},f\rangle$
	$\displaystyle=$	$\displaystyle\mathbb{E}f^{2}(X)(\Sigma^{2a}K_{X})(X)\leq c_{k}\mathbb{E}f^{2}(X)=c_{k}\langle\Sigma f,f\rangle,$

where the last inequality is due to the fact that

\Sigma^{2a}K_{X}(X)=\sum_{\nu=1}^{\infty}\phi_{\nu}(X)\phi_{\nu}(X)\mu_{\nu}\mu_{\nu}^{2a}=\sum_{\mu=1}^{\infty}\phi_{\nu}(x)\phi_{\nu}(x)\mu_{\nu}^{2a+1}\leq\infty.

Accordingly, $P\leq nc_{k}\sum_{j=1}^{n}T^{2(j-1)}\Sigma\leq nc_{k}(I-T^{2})^{-1}\Sigma\leq nc_{k}\gamma^{-1}I$ ; and

\frac{\gamma^{2}}{n^{2}}\sum_{i=1}^{n}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle=\frac{\gamma^{2}}{n^{2}}\langle f^{\ast},Pf^{\ast}\rangle\lesssim\frac{\gamma}{n}\|f\|^{2}_{\mathbb{H}}.

(8.14)

Next, we analyze $\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle$ in (8.10) for $1\leq i<j\leq n$ . Note that

\displaystyle\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle=

\displaystyle\mathbb{E}\langle H_{ii}+\cdots+H_{i1},\Sigma^{2a}(H_{j}j+\cdots+H_{j1})\rangle=\sum_{\ell=1}^{j}\sum_{k=1}^{j}\mathbb{E}\langle H_{i\ell},\Sigma^{2a}H_{jk}\rangle.

We first consider $\ell\neq k$ and assume $\ell>k$ , note that $i<j$ , then

		$\displaystyle\mathbb{E}\langle H_{i\ell},\Sigma^{2a}H_{jk}\rangle=\mathbb{E}\langle S_{i}S_{i-1}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast},\Sigma^{2a}S_{j}S_{j-1}\cdots S_{k+1}T_{k}T^{k-1}f^{\ast}\rangle$
	$\displaystyle=$	$\displaystyle\mathbb{E}\langle S_{i}S_{i-1}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast},\Sigma^{2a}S_{j}\cdots S_{\ell}\mathbb{E}(S_{\ell-1}\cdots S_{k+1}T_{k})T^{k-1}f^{\ast}\rangle=0.$		(8.15)

Similarly, for $\ell<k$ , $\mathbb{E}\langle H_{i\ell},\Sigma^{2a}H_{jk}\rangle=\mathbb{E}[\mathbb{E}\langle H_{i\ell},\Sigma^{2a}H_{jk}\rangle|X_{j},\cdots X_{k}]=0.$ Therefore,

	$\displaystyle\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle=$	$\displaystyle\sum_{\ell=1}^{i}\mathbb{E}\langle H_{i\ell},\Sigma^{2a}H_{j\ell}\rangle=\sum_{\ell=1}^{i}\mathbb{E}\langle S_{i}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast},\Sigma^{2a}S_{j}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast}\rangle$
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{i}\mathbb{E}\langle f^{\ast},T^{\ell-1}T_{\ell}S_{\ell+1}\cdots S_{i}\Sigma^{2a}S_{j}\cdots S_{i}S_{i-1}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast}\rangle$
	$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{i}\mathbb{E}\langle f^{\ast},T^{\ell-1}T_{\ell}S_{\ell+1}\cdots S_{i}\Sigma^{2a}(I-\gamma\Sigma)^{j-i}S_{i}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast}\rangle.$

And $\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle=\sum_{i=1}^{n-1}\sum_{\ell=1}^{i}\mathbb{E}\langle f^{\ast},T^{\ell-1}T_{\ell}S_{\ell+1}\cdots S_{i}\Sigma^{2a}(\sum_{j=i+1}^{n}(I-\gamma\Sigma)^{j-i})S_{i}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast}\rangle.$ Since $\Sigma^{2a}\sum_{j=i+1}^{n}(I-\gamma\Sigma)^{j-i}=\Sigma^{2a}\sum_{\ell=1}^{n-i}(I-\gamma\Sigma)^{\ell}\leq\Sigma^{2a}(I-\gamma\Sigma)(\sum_{\ell=0}^{n-1}(I-\gamma\Sigma)^{\ell})\leq\Sigma^{2a}\sum_{\ell=1}^{n-1}(I-\gamma\Sigma)^{\ell}\equiv A,$ we have

	$\displaystyle\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle\leq\sum_{i=1}^{n-1}\sum_{\ell=1}^{i}\mathbb{E}\langle f^{\ast},T^{\ell-1}T_{\ell}S_{\ell+1}\cdots S_{i}AS_{i}\cdots S_{\ell+1}T_{\ell}T^{\ell-1}f^{\ast}\rangle$
$\displaystyle=$	$\displaystyle\sum_{\ell=1}^{n-1}\sum_{i=\ell}^{n-1}\langle f^{\ast},T^{\ell-1}\mathbb{E}(T_{\ell}S_{\ell+1}\cdots S_{i}AS_{i}\cdots S_{\ell+1}T_{\ell})T^{\ell-1}f^{\ast}\rangle\leq\sum_{\ell=1}^{n-1}\langle f^{\ast},T^{\ell-1}S(\sum_{i=\ell}^{n-1}(I-\gamma G)^{i-\ell})AT^{\ell-1}f^{\ast}\rangle$
$\displaystyle\leq$	$\displaystyle\sum_{\ell=1}^{n-1}\langle f^{\ast},T^{\ell-1}BAT^{\ell-1}f^{\ast}\rangle,$	(8.16)

where $B=S\sum_{i=\ell}^{n-1}(I-\gamma G)^{i-\ell}$ , and $BA=S(\sum_{i=0}^{n-1}(I-\gamma G)^{i})A\leq nSA=2n\mathbb{E}(K_{x}\otimes K_{x})A(K_{x}\otimes K_{x})\leq n\gamma^{-1}\mathbb{E}\big{(}(K_{X}\otimes K_{X})\Sigma^{-1+2a}(K_{X}\otimes K_{X})\big{)}\leq n\gamma^{-1}c_{k}\Sigma,$ where the last step is due to the fact that

		$\displaystyle\langle\mathbb{E}\big{(}(K_{X}\otimes K_{X})\Sigma^{-1+2a}(K_{X}\otimes K_{X})\big{)}f,f\rangle=\mathbb{E}\langle(K_{X}\otimes K_{X})\Sigma^{-1+2a}K_{X}f(X),f\rangle$
	$\displaystyle=$	$\displaystyle\mathbb{E}f(X)\langle K_{X}\Sigma^{-1+2a}K_{X}(X),f\rangle=\mathbb{E}f^{2}(X)\langle\Sigma^{-1+2a}K_{X},K_{X}\rangle\leq C\langle\Sigma f,f\rangle$

with $\langle\Sigma^{-1+2a}K_{X},K_{X}\rangle=\sum_{\nu=1}^{\infty}\phi_{\nu}(X)\mu_{\nu}^{2a}\phi_{\nu}(X)\leq c_{\phi}^{2}\sum_{\nu=1}^{\infty}\nu^{-2a\alpha}<\infty$ for $2a\alpha>1$ .

By equation (8.16), we have $\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle\leq\sum_{\ell=1}^{n-1}\langle f^{\ast},T^{\ell-1}BAT^{\ell-1}f^{\ast}\rangle$ . Recall $T=I-\gamma\Sigma$ . For notation simlicity, let $C=BA$ , then $TCT$ can be written as $(I-\gamma\Sigma)C(I-\gamma\Sigma)=C-\gamma\Sigma C-\gamma C\Sigma+\gamma^{2}\Sigma C\Sigma=C-\gamma\Theta C=(I-\gamma\Theta)C,$ where $\Theta$ is an operator such that for any $C$ , $\Theta C=\Sigma C+C\Sigma-\gamma^{2}\Sigma C\Sigma$ . Replacing $C$ with $BA$ , we have $T^{\ell-1}BAT^{\ell-1}=(I-\gamma\Theta)^{\ell-1}BA$ , and

\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle\leq\langle f^{\ast},\sum_{\ell=1}^{n-1}(I-\gamma\Theta)^{\ell-1}BAf^{\ast}\rangle.

Since $\sum_{\ell=1}^{n-1}(I-\gamma\Theta)^{\ell-1}\leq\gamma^{-1}\Theta^{-1}$ , we further need to bound $\Theta^{-1}$ . Let $C=\Theta^{-1}$ , then $I=\Sigma\Theta^{-1}+\Theta^{-1}\Sigma-\gamma\Sigma\Theta^{-1}\Sigma$ . Note that $\Sigma\Theta^{-1}\Sigma\leq tr(\Sigma)\Theta^{-1}\Sigma\leq c\Theta^{-1}\Sigma$ , where $c$ is a constant. Then

\displaystyle I\succeq

\displaystyle\Sigma\Theta^{-1}+\Theta^{-1}\Sigma-c\gamma\Theta^{-1}\Sigma=\Sigma\Theta^{-1}+(1-c\gamma)\Theta^{-1}\Sigma=(\Sigma\otimes I+(1-c\gamma)I\otimes\Sigma)\Theta^{-1}.

Therefore, $\Theta^{-1}\preceq(\Sigma\otimes I+(1-c\gamma)I\otimes\Sigma)^{-1}I$ , and

\displaystyle\sum_{\ell=1}^{n-1}(I-\gamma\Theta)^{\ell-1}BA\preceq

\displaystyle\gamma^{-1}\Theta^{-1}n\gamma^{-1}\Sigma\preceq\frac{1}{1+(1-c\gamma)}n\gamma^{-2}

Accordingly, we have $\frac{\gamma^{2}}{n^{2}}\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle\lesssim\frac{\gamma}{n\gamma}$ .

Therefore,

\displaystyle\mathbb{E}\langle\bar{\beta}_{n},\Sigma^{2a}\bar{\beta}_{n}\rangle=

\displaystyle\frac{\gamma^{2}}{n^{2}}\sum_{i=1}^{n}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{i}\rangle+\frac{2\gamma^{2}}{n^{2}}\sum_{i<j}\mathbb{E}\langle U_{i},\Sigma^{2a}U_{j}\rangle\lesssim\frac{\gamma}{n\gamma}\|f\|^{2}_{\mathbb{H}}.

Then by Markov’s inequality, we have

\mathbb{P}\Big{(}\|\Sigma^{a}\big{(}\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\big{)}\|^{2}_{\mathbb{H}}>\gamma^{-1/2}\mathbb{E}\|\Sigma^{a}\bar{\beta}_{b,n}\|^{2}_{\mathbb{H}}\Big{)}\leq\gamma^{1/2}.

That is, $\|\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}\|^{2}_{\infty}\leq\frac{\gamma^{1/2}}{n\gamma}$ with probability at least $1-\gamma^{1/2}$ . ∎

8.3 Proof the sup-norm bound of noise remainder in constant step case

Recall the noise remainder recursion follows

\eta_{n}^{noise}-\eta_{n}^{noise,0}=(I-\gamma K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}-\eta_{n-1}^{noise,0})+\gamma(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{noise,0}.

Follow the recursion decomposition in Section 6.1, we can split $\eta_{n}^{noise}-\eta_{n}^{noise,0}$ into higher order expansions as

\eta_{n}^{noise}=\eta_{n}^{noise,0}+\eta_{n}^{noise,1}+\eta_{n}^{noise,2}+\cdots+\eta_{n}^{noise,r}+\textrm{Remainder},

where $\eta_{n}^{noise,d}$ can be viewed as $\eta_{n}^{noise,d}=(I-\gamma\Sigma)\eta_{n-1}^{noise,d}+\gamma\mathcal{E}_{n}^{d}$ and $\mathcal{E}_{n}^{d}=(\Sigma-K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{noise,d-1}$ for $1\leq d\leq r$ and $r\geq 1$ . The remainder term follows the recursion as

\eta_{n}^{noise}-\sum_{d=0}^{r}\eta_{n}^{noise,d}=(I-\gamma K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{noise}-\sum_{d=1}^{r}\eta_{n-1}^{noise,d})+\gamma\mathcal{E}_{n}^{r+1}.

The following lemma (see 8.2) demonstrates that the high-order expansion terms $\bar{\eta}_{n}^{noise,d}$ (for $d\geq 1$ ) decrease as the value of $d$ increases. In particular, we first characterize the behavior of $\|\bar{\eta}_{n}^{noise,1}\|_{\infty}$ by representing it as a weighted empirical process and establish its convergence rate that $\|\bar{\eta}_{n}^{noise,1}\|_{\infty}=o(\|\bar{\eta}_{n}^{noise,0}\|_{\infty})$ with high probability. Next, we show that $\|\bar{\eta}_{n}^{noise,d+1}\|_{\infty}=o(\|\bar{\eta}_{n}^{noise,d}\|_{\infty})$ for $d\geq 1$ using mathematical induction. Finally, we bound $\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}$ through its $\mathbb{H}$ -norm based on the property that $\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|_{\infty}\leq\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|_{\mathbb{H}}$ .

Lemma 8.2.

Suppose the step size $\gamma(n)=\gamma$ with $0<\gamma<n^{-\frac{2}{2+3\alpha}}$ . Then

(a)

\mathbb{P}\Big{(}\|\bar{\eta}_{n}^{noise,1}\|_{\infty}>\sqrt{\gamma^{1/2}(n\gamma)^{1/\alpha}n^{-1}\log n}\Big{)}\leq 2\gamma^{1/2}.

(b)

\displaystyle\mathbb{P}\Big{(}\|\bar{\eta}_{n}^{noise,d}\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq

\displaystyle(n\gamma)^{1/\alpha+2\varepsilon}\gamma^{d-1/4}.

Furthermore, for $d\geq 2$ and $0<\gamma<n^{-\frac{2}{2+3\alpha}}$ , we have $(n\gamma)^{1/\alpha+2\varepsilon}\gamma^{d-1/4}\leq\gamma^{1/4}$ .

(c)

\mathbb{P}\Big{(}\|\bar{\eta}_{i}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{i}^{noise,d}\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq n^{-1}.

with $r$ large enough.

Furthermore, combine (a)-(c), we have

\mathbb{P}\Big{(}\|\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}\|^{2}_{\infty}\geq C\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq\gamma^{1/4},

where $C$ is a constant.

Proof.

Proof of Lemma 8.2 (a) by analyzing $\|\bar{\eta}_{n}^{noise,1}\|_{\infty}$ . First, we calculate the explicit expression of $\eta_{n}^{noise,1}$ . Let $T=I-\gamma\Sigma$ and $T_{n}=\Sigma-K_{X_{n}}\otimes K_{X_{n}}$ , then $\eta_{n}^{noise,1}=T\eta_{n-1}^{noise,1}+\gamma T_{n}\eta_{n-1}^{noise,0}$ with $\eta_{0}^{noise,1}=0$ . Therefore,

\displaystyle\eta_{n}^{noise,1}=

\displaystyle\gamma\sum_{i=1}^{n-1}T^{n-i-1}T_{i+1}\eta_{i}^{noise,0}=\gamma^{2}\sum_{i=1}^{n-1}\sum_{j=1}^{i}\epsilon_{j}T^{n-i-1}T_{i+1}T^{i-j}K_{X_{j}},

where the last step is by plugging in $\eta_{i}^{noise,0}=\gamma\sum_{j=1}^{i-1}T^{i-j}\epsilon_{j}K_{X_{j}}$ in (8.9) with $\gamma=\gamma(n)$ . Accordingly,

\displaystyle\bar{\eta}_{n}^{noise,1}=

\displaystyle\frac{\gamma^{2}}{n}\sum_{\ell=1}^{n-1}\sum_{i=1}^{\ell}\sum_{j=1}^{i}\epsilon_{j}T^{\ell-i}T_{i+1}T^{i-j}K_{X_{j}}=\frac{\gamma^{2}}{n}\sum_{j=1}^{n-1}\big{(}\sum_{i=j}^{n-1}(\sum_{\ell=i}^{n-1}T^{\ell-i})T_{i+1}T^{i-j}K_{X_{j}}\big{)}\epsilon_{j}.

(8.17)

Let $g_{j}=\sum_{i=j}^{n-1}(\sum_{\ell=i}^{n-1}T^{\ell-i})T_{i+1}T^{i-j}K_{X_{j}}$ , where the randomness of $g_{j}$ involves $X_{j},X_{j+1},\dots,X_{n}$ . Then $\bar{\eta}_{n}^{noise,1}(\cdot)=\frac{\gamma^{2}}{n}\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(\cdot)$ , which is a Gaussian process conditional on $(X_{j},\dots,X_{n})$ .

We can further express $g_{j}(\cdot)$ as a function of the eigenvalues and eigenfunctions that follows

g_{j}(\cdot)=\gamma^{-1}\sum_{\nu,k=1}^{\infty}\mu_{\nu}\sum_{i=j}^{n-1}(1-\gamma\mu_{\nu})^{i-j}(1-(1-\gamma\mu_{k})^{n-i})\phi_{i\nu k}\phi_{\nu}(X_{j})\phi_{k}(\cdot)

(8.18)

with $\phi_{i\nu k}=\phi_{\nu}(X_{i+1})\phi_{k}(X_{i+1})-\delta_{\nu k}$ ; we refer the proof to [33]. Such expression can facilitate the downstream analysis of $\bar{\eta}_{n}^{noise,1}$ . Denote $a_{ij\nu}=(1-\gamma\mu_{\nu})^{i-j}$ and $b_{ik}=1-(1-\gamma\mu_{k})^{n-i}$ . Then $g_{j}$ can be simplified as $g_{j}=\gamma^{-1}\sum_{\nu,k=1}^{\infty}\mu_{\nu}\big{(}\sum_{i=j}^{n-1}a_{ij\nu}b_{ik}\phi_{b,i\nu k}\big{)}\phi_{\nu}(X_{j})\phi_{k}$ .

We are ready to prove that $\|\bar{\eta}_{n}^{noise,1}\|_{\infty}\leq\gamma^{\frac{1}{2}}n^{-\frac{1}{2}}(n\gamma)^{\frac{1}{2\alpha}}$ where $\bar{\eta}_{n}^{noise,1}(\cdot)=\frac{\gamma^{2}}{n}\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(\cdot)$ . It involves two steps: (1) for any fixed $s$ , we see that $\bar{\eta}_{n}^{noise,1}(s)=\frac{\gamma^{2}}{n}\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(s)$ is a weighted Gaussian random variable with variance $\frac{\gamma^{4}}{n^{2}}\sum_{j=1}^{n-1}g^{2}_{j}(s)$ conditional on $X_{1:n}=(X_{1},\dots,X_{n})$ . Therefore, we first bound $\bar{\eta}_{n}^{noise,1}(s)$ with an exponentially decaying probability by characterizing $\sum_{j=1}^{n-1}g^{2}_{j}(s)$ ; (2) we then bridge $\bar{\eta}_{n}^{noise,1}(s)$ to $\|\bar{\eta}_{n}^{noise,1}\|_{\infty}$ . We illustrate the details as follows.

Conditional on $X_{1:n}$ , $\bar{\eta}_{n}^{noise,1}(s)=\frac{\gamma^{2}}{n}\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(s)$ is a weighted Gaussian random variable; by Hoeffding’s inequality,

\mathbb{P}\Big{(}\frac{\gamma^{2}}{n}|\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(s)|>u\mid X_{1:n}\Big{)}\leq\exp\Big{(}-\frac{u^{2}n^{2}}{\gamma^{4}\sum_{j=1}^{n-1}g_{j}^{2}(s)}\Big{)}.

(8.19)

We then bound $\sum_{j=1}^{n-1}\mathbb{E}g_{j}^{2}(s)$ . We separate $\sum_{j=1}^{n-1}g^{2}_{j}(s)$ as two parts as follows:

		$\displaystyle\sum_{j=1}^{n-1}g^{2}_{j}(s)$
	$\displaystyle\leq$	$\displaystyle\gamma^{-2}\sum_{\nu,\nu^{\prime}=1}^{\infty}\sum_{j=1}^{n-1}\mu_{\nu}\mu_{\nu^{\prime}}(\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}})\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\sum_{k,k^{\prime}=1}^{\infty}b_{ik}b_{\ell k^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\phi_{k}(s)\phi_{k^{\prime}}(s)$
		$\displaystyle+\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu}\sum_{k,k^{\prime}=1}^{\infty}b_{ik}b_{\ell k^{\prime}}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\phi_{k}(s)\phi_{k^{\prime}}(s)$
	$\displaystyle=$	$\displaystyle\Delta_{1}+\Delta_{2},$

where $\Delta_{1}$ involves the interaction terms indexed by $\nu,\nu^{\prime}$ and $\Delta_{2}$ includes the terms that $\nu=\nu^{\prime}$ . Recall $b_{ik}=(1-(1-\gamma\mu_{k})^{n-i})$ . Then $b_{ik}<(1-(1-\gamma\mu_{k})^{n})\equiv b_{k}$ for $1\leq i\leq n$ . For $\Delta_{1}$ , we have

\displaystyle\Delta_{1}\leq

\displaystyle\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\phi_{k}(s)\phi_{k^{\prime}}(s)\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}\mu_{\nu^{\prime}}\sum_{j=1}^{n-1}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}},

Take expectation on $\Delta_{1}$ , we can see

	$\displaystyle\mathbb{E}\|\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}\mu_{\nu^{\prime}}\sum_{j=1}^{n-1}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\|^{2}$	(8.20)
$\displaystyle\leq$	$\displaystyle\big{(}\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{\frac{1+\varepsilon}{\alpha}}\big{)}\big{(}\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{2-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{2-\frac{1+\varepsilon}{\alpha}}\mathbb{E}\|\sum_{j=1}^{n-1}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\|^{2}\big{)}$
$\displaystyle\lesssim$	$\displaystyle\big{(}\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{\frac{1+\varepsilon}{\alpha}}\big{)}^{2}\big{(}\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{2-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{2-\frac{1+\varepsilon}{\alpha}}\sum_{j=1}^{n-1}\mathbb{E}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}^{2}\cdot\mathbb{E}\|\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\|^{2}\big{)},$

where the last step is due to the calculation that

		$\displaystyle\mathbb{E}\|\sum_{j=1}^{n-1}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\|^{2}$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{n-1}\mathbb{E}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}^{2}\cdot\mathbb{E}\|\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\|^{2}$
		$\displaystyle+2\sum_{j_{1}<j_{2}}\mathbb{E}\big{(}\big{(}\phi_{\nu}(x_{j_{1}})\phi_{\nu^{\prime}}(x_{j_{1}})-\delta_{\nu\nu^{\prime}}\big{)}\big{)}\cdot\mathbb{E}\big{(}\big{(}\phi_{\nu}(x_{j_{2}})\phi_{\nu^{\prime}}(x_{j_{2}})-\delta_{\nu\nu^{\prime}}\big{)}\cdot(\sum_{i,\ell=j_{1}}^{n-1}a_{ij_{1}\nu}a_{\ell j_{1}\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}})$
		$\displaystyle\cdot(\sum_{i,\ell=j_{2}}^{n-1}a_{ij_{2}\nu}a_{\ell j_{2}\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}})\big{)}=\sum_{j=1}^{n-1}\mathbb{E}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}^{2}\cdot\mathbb{E}\|\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\|^{2},$

with $\mathbb{E}\big{(}\phi_{\nu}(x_{j_{1}})\phi_{\nu^{\prime}}(x_{j_{1}})-\delta_{\nu\nu^{\prime}}\big{)}=0$ . Note that

		$\displaystyle\sum_{j=1}^{n-1}\big{(}\mathbb{E}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}^{2}\cdot\mathbb{E}\|\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\|^{2}\big{)}\leq\sum_{j=1}^{n-1}\mathbb{E}\|\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\|^{2}$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{n-1}\sum_{i_{1},i_{2}=j}^{n-1}\sum_{\ell_{1},\ell_{2}=j}^{n-1}a_{i_{1}j\nu}a_{\ell_{1}j\nu^{\prime}}a_{i_{2}j\nu}a_{\ell_{2}j\nu^{\prime}}\mathbb{E}(\phi_{i_{1}\nu k}\phi_{i_{2}\nu k}\phi_{\ell_{1}\nu^{\prime}k^{\prime}}\phi_{\ell_{2}\nu^{\prime}k^{\prime}})$
	$\displaystyle\overset{(i)}{\lesssim}$	$\displaystyle\sum_{j=1}^{n}\big{(}\sum_{i=j}^{n-1}a_{ij\nu}^{2}a_{ij\nu^{\prime}}^{2}+\sum_{i,\ell=j}^{n-1}a_{ij\nu}^{2}a_{\ell j\nu^{\prime}}^{2}+\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{ij\nu^{\prime}}a_{\ell j\nu}a_{\ell j\nu^{\prime}}\big{)}$
	$\displaystyle\leq$	$\displaystyle\sum_{j=1}^{n-1}\big{(}\sum_{i=j}^{n-1}a_{ij\nu}^{2}\big{)}\big{(}\sum_{i=j}^{n-1}a^{2}_{ij\nu^{\prime}}\big{)}+\big{(}\sum_{i=j}^{n-1}a_{ij\nu}a_{ij\nu^{\prime}}\big{)}^{2}.$

In the $(i)$ -step, $\mathbb{E}(\phi_{i_{1}\nu k}\phi_{i_{2}\nu k}\phi_{\ell_{1}\nu^{\prime}k^{\prime}}\phi_{\ell_{2}\nu^{\prime}k^{\prime}})\neq 0$ if and only if the following cases hold: (1) $i_{1}=i_{2}=\ell_{1}=\ell_{2}$ ; (2) $i_{1}=i_{2}$ and $\ell_{1}=\ell_{2}$ ; (3) $i_{1}=\ell_{1}$ and $i_{2}=\ell_{2}$ . Recall $a_{ij\nu}=(1-\gamma\mu_{\nu})^{i-j}$ . Then we have

\displaystyle\sum_{i=j}^{n-1}a_{ij\nu}a_{ij\nu^{\prime}}=

\displaystyle\sum_{i=j}^{n-1}[(1-\gamma\mu_{\nu})(1-\gamma\mu_{\nu^{\prime}})]^{i-j}\leq(1-(1-\gamma\mu_{\nu})(1-\gamma\mu_{\nu^{\prime}}))^{-1}\leq\gamma^{-1}(\mu_{\nu}+\mu_{\nu^{\prime}})^{-1}.

For $\sum_{i=j}^{n-1}a^{2}_{ij\nu}$ , we have $\sum_{i=j}^{n-1}a^{2}_{ij\nu}=\sum_{i=j}^{n-1}(1-\gamma\mu_{\nu})^{2(i-j)}\lesssim\gamma^{-1}\mu_{\nu}^{-1}.$ Therefore,

		$\displaystyle\mathbb{E}\|\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}\mu_{\nu^{\prime}}\sum_{j=1}^{n-1}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\|^{2}$
	$\displaystyle\lesssim$	$\displaystyle(\sum_{\nu=1}^{\infty}\mu_{\nu}^{\frac{1+\varepsilon}{\alpha}})^{2}\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{2-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{2-\frac{1+\varepsilon}{\alpha}}\sum_{j=1}^{n-1}\big{(}\gamma^{-2}(\mu_{\nu}+\mu_{\nu}^{\prime})^{-2}+\gamma^{-1}\mu_{\nu}^{-1}\gamma^{-1}\mu_{\nu^{\prime}}^{-1}\big{)}$
	$\displaystyle\lesssim$	$\displaystyle n\gamma^{-2}\big{(}\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{1-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{1-\frac{1+\varepsilon}{\alpha}}+\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{2-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{2-\frac{1+\varepsilon}{\alpha}}(\mu_{\nu}+\mu_{\nu^{\prime}})^{-2}\big{)}\lesssim n\gamma^{-2},$

with $\varepsilon\to 0$ . The final step is due to the fact that

		$\displaystyle\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{2-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{2-\frac{1+\varepsilon}{\alpha}}(\mu_{\nu}+\mu_{\nu^{\prime}})^{-2}$
	$\displaystyle=$	$\displaystyle\sum_{\nu,\nu^{\prime}=1}^{\infty}\frac{\mu_{\nu}\mu_{\nu^{\prime}}}{(\mu_{\nu}+\mu_{\nu^{\prime}})^{2}}\mu_{\nu}^{1-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{1-\frac{1+\varepsilon}{\alpha}}\leq\sum_{\nu,\nu^{\prime}=1}^{\infty}\mu_{\nu}^{1-\frac{1+\varepsilon}{\alpha}}\mu_{\nu^{\prime}}^{1-\frac{1+\varepsilon}{\alpha}}=(\sum_{\nu=1}^{\infty}\mu_{\nu}^{1-\frac{1+\varepsilon}{\alpha}})^{2}\leq C.$

Since $b_{k}\leq\min\{1,n\gamma\mu_{k}\}$ and accordingly, $\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}=(\sum_{k=1}^{\infty}(1-(1-\gamma\mu_{k})^{n}))^{2}\leq(n\gamma)^{\frac{2}{\alpha}}.$ Therefore, we have

\mathbb{E}\Delta_{1}\leq\sqrt{\mathbb{E}\Delta_{1}^{2}}\leq\sqrt{n}\gamma^{-3}(n\gamma)^{\frac{2}{\alpha}}.

(8.21)

For $\Delta_{2}$ , we rewrite $\Delta_{2}$ as

$\displaystyle\Delta_{2}=$	$\displaystyle\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu}\sum_{k,k^{\prime}=1}^{\infty}b_{ik}b_{\ell k^{\prime}}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}$
$\displaystyle=$	$\displaystyle\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\sum_{k,k^{\prime}=1}^{\infty}b_{ik}b_{\ell k^{\prime}}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}$
	$\displaystyle\quad+\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}w_{j}^{2}\sum_{i=j}^{n-1}a^{2}_{ij\nu}\sum_{k,k^{\prime}=1}^{\infty}b_{ik}b_{\ell k^{\prime}}\phi_{i\nu k}\phi_{i\nu k^{\prime}}$
$\displaystyle=$	$\displaystyle\Delta_{21}+\Delta_{22},$	(8.22)

where $\Delta_{21}$ includes the terms that $i\neq\ell$ and $\Delta_{22}$ includes the terms that $i=\ell$ . For $\Delta_{21}$ , with any positive $\varepsilon\to 0$ , we have

	$\displaystyle\Delta_{21}\leq$	$\displaystyle 2\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}$
	$\displaystyle=$	$\displaystyle 2\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sum_{\nu=1}^{\infty}\mu_{\nu}^{\frac{1+\varepsilon}{2\alpha}}\mu_{\nu}^{2-\frac{1+\varepsilon}{2\alpha}}\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}$
	$\displaystyle\leq$	$\displaystyle 2\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{\frac{1+2\varepsilon}{\alpha}}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{4-\frac{1+2\varepsilon}{\alpha}}\big{(}\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}^{2}}.$

To bound the expectation of $\Delta_{21}$ , we need to bound $\mathbb{E}|\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}|^{2}$ . Note that

		$\displaystyle\mathbb{E}\|\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\|^{2}\lesssim\mathbb{E}\|\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\|^{2}$
	$\displaystyle=$	$\displaystyle\sum_{j,d=1}^{n-1}\mathbb{E}\big{(}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}\big{(}\sum_{d\leq i<\ell\leq n-1}a_{id\nu}a_{\ell d\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{n}\mathbb{E}\|\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\|^{2}$
		$\displaystyle+2\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}\mathbb{E}\big{(}\sum_{d\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}\big{(}\sum_{d\leq i<\ell\leq n-1}a_{id\nu}a_{\ell d\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}$
		$\displaystyle\overset{(i)}{+}2\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}\mathbb{E}\big{(}\sum_{j\leq i<\ell\leq d-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}\big{(}\sum_{d\leq i<\ell\leq n-1}a_{id\nu}a_{\ell d\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)},$

where the last term $(i)$ is $0$ . Then we have

		$\displaystyle\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}\mathbb{E}\big{(}\sum_{d\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}\big{(}\sum_{d\leq i<\ell\leq n-1}a_{id\nu}a_{\ell d\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}\big{)}$
	$\displaystyle=$	$\displaystyle\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}\sum_{d\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}a_{id\nu}a_{\ell d\nu}\mathbb{E}\phi^{2}_{i\nu k}\phi^{2}_{\ell\nu k^{\prime}}\lesssim\sum_{j<d}\sum_{d\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}a_{id\nu}a_{\ell d\nu}$
	$\displaystyle=$	$\displaystyle\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}\sum_{d\leq i<\ell\leq n-1}(1-\gamma\mu_{\nu})^{i-j}(1-\gamma\mu_{\nu})^{\ell-j}(1-\gamma\mu_{\nu})^{i-d}(1-\gamma\mu_{\nu})^{\ell-d}$
	$\displaystyle=$	$\displaystyle 2\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}\big{[}\sum_{d\leq i<\ell\leq n-1}(1-\gamma\mu_{\nu})^{2(i-d)}(1-\gamma\mu_{\nu})^{2(\ell-d)}\big{]}(1-\gamma\mu_{\nu})^{2(d-j)}$
	$\displaystyle\leq$	$\displaystyle 2\big{(}\sum_{d=1}^{n-1}\sum_{j=1}^{d-1}(1-\gamma\mu_{\nu})^{2(d-j)}\big{)}\big{(}\sum_{i=d}^{n-1}(1-\gamma\mu_{\nu})^{2(i-d)}\big{)}\big{(}\sum_{\ell=d}^{n-1}(1-\gamma\mu_{\nu})^{2(\ell-d)}\big{)}\lesssim n(\gamma\mu_{\nu})^{-3}.$

Then accordingly,

\mathbb{E}\Delta_{21}\leq\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{\frac{1+2\epsilon}{\alpha}}}\cdot\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{4-\frac{1+2\epsilon}{\alpha}}|\sum_{j=1}^{n-1}\sum_{j\leq i<\ell\leq n-1}a_{ij\nu}a_{\ell j\nu}\phi_{i\nu k}\phi_{\ell\nu k^{\prime}}|^{2}}\lesssim(n\gamma)^{\frac{2}{\alpha}}\sqrt{n}\gamma^{-\frac{7}{2}}.

(8.23)

For $\Delta_{22}$ , we have

		$\displaystyle\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}\phi_{i\nu k}\phi_{i\nu k^{\prime}}$
	$\displaystyle=$	$\displaystyle\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))$
	$\displaystyle+$	$\displaystyle\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}})$
	$\displaystyle=$	$\displaystyle\Delta_{22}^{(1)}+\Delta_{22}^{(2)}.$

We first bound $|\Delta_{22}^{(1)}|$ .

\displaystyle|\Delta_{22}^{(1)}|\leq

\displaystyle\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{\frac{1+2\epsilon}{\alpha}}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{4-\frac{1+2\epsilon}{\alpha}}\big{(}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\big{)}^{2}}

Notice that

		$\displaystyle\mathbb{E}\big{(}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\|^{2}=\sum_{i=1}^{n-1}\mathbb{E}\big{(}\sum_{j=1}^{i}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\big{)}^{2}$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{n-1}\mathbb{E}\big{(}\sum_{j=1}^{i}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\big{)}^{2}$
		$\displaystyle+2\sum_{1\leq i_{1}<i_{2}\leq n-1}\big{(}\sum_{j=1}^{i_{1}}a^{2}_{i_{1}j\nu}(\phi_{i_{1}\nu k}\phi_{i_{1}\nu k^{\prime}}-\mathbb{E}(\phi_{i_{1}\nu k}\phi_{i_{1}\nu k^{\prime}}))\big{)}\cdot\big{(}\sum_{j=1}^{i_{2}}a^{2}_{i_{2}j\nu}(\phi_{i_{2}\nu k}\phi_{i_{2}\nu k^{\prime}}-\mathbb{E}(\phi_{i_{2}\nu k}\phi_{i_{2}\nu k^{\prime}}))\big{)}$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{n-1}\mathbb{E}\big{(}\sum_{j=1}^{i}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\big{)}^{2},$

since $\mathbb{E}\big{(}\phi_{i_{1}\nu k}\phi_{i_{1}\nu k^{\prime}}-\mathbb{E}(\phi_{i_{1}\nu k}\phi_{i_{1}\nu k^{\prime}})\big{)}=0$ . Then we have

\mathbb{E}\big{(}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\big{)}^{2}\lesssim\sum_{i=1}^{n-1}(\sum_{j=1}^{i}a^{2}_{ij\nu})^{2}\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))^{2}\lesssim n(\gamma^{-1}\mu_{\nu}^{-1})^{2}

due to the property that $\sum_{j=1}^{i}a^{2}_{ij\nu}=\sum_{j=1}^{i}(1-\gamma\mu_{\nu})^{2(i-j)}\leq\gamma^{-1}\mu_{\nu}^{-1}$ . Accordingly, we have

\sum_{k,k^{\prime}=1}^{\infty}b_{k}b_{k^{\prime}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{\frac{1+2\epsilon}{\alpha}}}\sqrt{\sum_{\nu=1}^{\infty}\mu_{\nu}^{4-\frac{1+2\epsilon}{\alpha}}\big{(}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}-\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}}))\big{)}^{2}}=O_{P}(\sqrt{n}\gamma^{-1}(n\gamma)^{\frac{2}{\alpha}}).

Therefore,

\mathbb{E}\Delta^{(1)}_{22}\lesssim\sqrt{n}\gamma^{-3}(n\gamma)^{\frac{2}{\alpha}}.

(8.24)

We next deal with $\Delta^{(2)}_{22}$ .

	$\displaystyle\mathbb{E}\Delta^{(2)}_{22}=\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}b_{ik}b_{ik^{\prime}}\mathbb{E}(\phi_{i\nu k}\phi_{i\nu k^{\prime}})$
$\displaystyle\leq$	$\displaystyle\gamma^{-2}\sum_{k,k^{\prime}=1}^{\infty}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}b_{ik}b_{ik^{\prime}}\mathbb{E}(\phi^{2}_{\nu}(X_{i+1})\phi_{k}(X_{i+1})\phi_{k^{\prime}}(X_{i+1}))$
$\displaystyle=$	$\displaystyle\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}\mathbb{E}\big{(}\phi^{2}_{\nu}(X_{i+1})\cdot\big{(}\sum_{k=1}^{\infty}b_{ik}\phi_{k}(X_{i+1})\big{)}^{2}\big{)}$
$\displaystyle\leq$	$\displaystyle c_{\phi}^{2}\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}\mathbb{E}\big{(}\sum_{k=1}^{\infty}b_{ik}\phi_{k}(X_{i+1})\big{)}^{2}=c_{\phi}^{2}\gamma^{-2}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2}\sum_{j=1}^{n-1}\sum_{i=j}^{n-1}a^{2}_{ij\nu}\sum_{k=1}^{\infty}b^{2}_{ik}\lesssim\gamma^{-3}n(n\gamma)^{\frac{1}{\alpha}}$	(8.25)

Combine equation (8.21), (8.23), (8.24), and (8.25) together, and notice that $\sqrt{n}\gamma^{-\frac{7}{2}}(n\gamma)^{\frac{2}{\alpha}}\leq n\gamma^{-3}(n\gamma)^{\frac{1}{\alpha}}$ for $\gamma\geq n^{-1}$ , we have

\mathbb{E}\sum_{j=1}^{n-1}g^{2}_{j}(s)\leq n\gamma^{-3}(n\gamma)^{\frac{1}{\alpha}}.

Define an event $\mathcal{E}_{1}=\{\sum_{j=1}^{n-1}g^{2}_{j}(s)\leq\gamma^{-7/2}n(n\gamma)^{1/\alpha}\}$ , by Markov inequality, $\mathbb{P}\big{(}\mathcal{E}_{1}\big{)}>1-\gamma^{1/2}$ . Conditional on the event $\mathcal{E}_{1}$ , and let $u=Cn^{-\frac{1}{2}}\gamma^{\frac{1}{4}}(n\gamma)^{\frac{1}{2\alpha}}\sqrt{\log n}$ in equation (8.19), we have

\mathbb{P}\Big{(}\frac{\gamma^{2}}{n}\bigl{|}\sum_{j=1}^{n-1}\epsilon_{j}\cdot g_{j}(s)\bigr{|}>Cn^{-\frac{1}{2}}\gamma^{\frac{1}{4}}(n\gamma)^{\frac{1}{2\alpha}}\sqrt{\log n}\bigm{|}\mathcal{E}_{1}\Big{)}\leq\exp\Big{(}-C^{\prime}\log n\Big{)}.

(8.26)

Combined with the Lemma bridging $\bar{\eta}_{n}^{noise,1}(t)$ and $\|\bar{\eta}_{n}^{noise,1}\|_{\infty}$ in Supplementary [33], we achieve the result.

Next, we prove Lemma 8.2 (b) and analyze $\|\bar{\eta}_{n}^{noise,d}\|_{\infty}$ for $d\geq 2$ .

Note that $\|\bar{\eta}_{n}^{noise,d}\|_{\infty}\leq\|\Sigma^{a}\bar{\eta}_{n}^{noise,d}\|_{\mathbb{H}}$ . In the following part, we focus on $\mathbb{E}\|\Sigma^{a}\bar{\eta}_{n}^{noise,d}\|^{2}_{\mathbb{H}}$ . Recall in Section 6, $\eta_{n}^{noise,d}$ follows the recursion as $\eta_{n}^{noise,d}=(I-\gamma\Sigma)\eta_{n-1}^{noise,d}+\gamma\mathcal{E}_{n}^{d},$ where $\mathcal{E}_{n}^{d}=(\Sigma-K_{X_{n}}\times K_{X_{n}})\eta_{n-1}^{noise,d}$ for $d\geq 1$ and $\mathcal{E}_{n}^{d}=\varepsilon_{n}$ for $d=0$ .

Let $T=I-\gamma\Sigma$ , then $\eta_{j}^{noise,d}=\gamma\sum_{k=1}^{j}T^{j-k}\mathcal{E}_{k}^{d}$ , and $\bar{\eta}_{n}^{noise,d}=\gamma\frac{1}{n}\sum_{j=1}^{n}\sum_{k=1}^{j}T^{j-k}\mathcal{E}_{k}^{d}$ .

		$\displaystyle\mathbb{E}\langle\bar{\eta}_{n}^{noise,d},\Sigma^{2a}\bar{\eta}_{n}^{noise,d}\rangle$
	$\displaystyle=$	$\displaystyle\frac{\gamma^{2}}{n^{2}}\mathbb{E}\langle\sum_{j=1}^{n}\sum_{k=1}^{j}T^{j-k}\mathcal{E}_{k}^{d},\Sigma^{2a}\sum_{j=1}^{n}\sum_{k=1}^{j}T^{j-k}\mathcal{E}_{k}^{d}\rangle$
	$\displaystyle=$	$\displaystyle\frac{\gamma^{2}}{n^{2}}\sum_{k=1}^{n}\mathbb{E}\langle\sum_{k=1}^{n}(\sum_{j=k}^{n}T^{j-k})\mathcal{E}_{k}^{d},\Sigma^{2a}\sum_{k=1}^{n}(\sum_{j=k}^{n}T^{j-k})\mathcal{E}_{k}^{d}\rangle$
	$\displaystyle=$	$\displaystyle\frac{\gamma^{2}}{n^{2}}\sum_{k=1}^{n}\mathbb{E}\langle M_{n,k}\mathcal{E}_{k}^{d},\Sigma^{2a}M_{n,k}\mathcal{E}_{k}^{d}\rangle=\frac{\gamma^{2}}{n^{2}}\sum_{k=1}^{n}\mathbb{E}\operatorname{tr}(\mathcal{E}_{k}^{d}M_{n,k}\Sigma^{2a}M_{n,k}\mathcal{E}_{k}^{d})=\frac{\gamma^{2}}{n^{2}}\sum_{k=1}^{n}\mathbb{E}\operatorname{tr}(M_{n,k}\Sigma^{2a}M_{n,k}\mathcal{E}_{k}^{d}\otimes\mathcal{E}_{k}^{d})$
	$\displaystyle=$	$\displaystyle\frac{\gamma^{2}}{n^{2}}\sum_{k=1}^{n}\operatorname{tr}(M_{n,k}\Sigma^{2a}M_{n,k}\mathbb{E}\big{(}\mathcal{E}_{k}^{d}\otimes\mathcal{E}_{k}^{d})\big{)}\lesssim\frac{\gamma^{2+d}}{n^{2}}\sum_{k=1}^{n}\operatorname{tr}\big{(}M_{n,k}\Sigma^{2a}M_{n,k}\Sigma\big{)}$

where we use the property that $\mathbb{E}\big{(}\mathcal{E}_{k}^{d}\otimes\mathcal{E}_{k}^{d}\big{)}\lesssim\gamma^{d}\Sigma$ . Since $M_{n,k}=\sum_{j=k}^{n}T^{j-k}=I+T+T^{2}+\cdots+T^{n-k}\leq nI$ , then $M_{n,k}\Sigma^{2a}\leq n\Sigma^{2a}$ . On the other hand, $M_{n,k}\Sigma^{2a}=\gamma^{-1}\Sigma^{-1}(I-T^{n-k})\Sigma^{2a}\preceq\gamma^{-1}\Sigma^{2a-1}$ . Therefore, we have

M_{n,k}\Sigma^{2a}\preceq(n\Sigma^{2a})^{q}(\gamma^{-1}\Sigma^{2a-1})^{1-q}

with $0\leq q\leq 1$ . Also, $M_{n,k}\Sigma\preceq\gamma^{-1}\Sigma^{-1}(I-T^{n-k})\Sigma\preceq\gamma^{-1}I$ . Then

\operatorname{tr}\big{(}M_{n,k}\Sigma^{2a}M_{n,k}\Sigma\big{)}\leq n^{q}\gamma^{q-1}\gamma^{-1}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2aq+(2a-1)(1-q)}.

Therefore, $\mathbb{E}\langle\bar{\eta}_{n}^{noise,d},\Sigma^{2a}\bar{\eta}_{n}^{noise,d}\rangle\lesssim\frac{\gamma^{2+d}}{n^{2}}n\gamma^{-1}n^{q}\gamma^{q-1}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2a-1+q}\leq(n\gamma)^{q}n^{-1}\gamma^{d}\sum_{\nu=1}^{\infty}\mu_{\nu}^{2a-1+q}.$ Let $2a-1+q=1/\alpha+\varepsilon$ with $a=1/2-1/(2\alpha)-\varepsilon$ and $\varepsilon\to 0$ , then we have $\sum_{\nu=1}^{\infty}\mu_{\nu}^{2a-1+q}=\sum_{j=1}^{\infty}\mu_{j}^{1/\alpha+\varepsilon}=\sum_{j=1}^{\infty}j^{-1-\alpha\varepsilon}<\infty$ and

\mathbb{E}\langle\bar{\eta}_{n}^{noise,d},\Sigma^{2a}\bar{\eta}_{n}^{noise,d}\rangle\lesssim(n\gamma)^{1/\alpha}n^{-1}(n\gamma)^{1/\alpha+2\varepsilon}\gamma^{d}.

Through Markov’s inequality, we have

\displaystyle\mathbb{P}\Big{(}\|\bar{\eta}_{n}^{noise,d}\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq

\displaystyle\frac{\mathbb{E}\|\Sigma^{a}\bar{\eta}_{n}^{noise,d}\|^{2}_{\mathbb{H}}}{\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}}\leq(n\gamma)^{1/\alpha+2\varepsilon}\gamma^{d-1/4}.

For $d\geq 2$ and $0<\gamma<n^{-\frac{2}{2+3\alpha}}$ , we have $(n\gamma)^{1/\alpha+2\varepsilon}\gamma^{d-1/4}\leq 1/2\gamma^{1/4}$ .

Proof of Lemma 8.2 (c) - the reminder term $\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}$ . Note that for any $f\in\mathbb{H}$ , $|f(x)|=|\langle f,K_{x}\rangle_{\mathbb{H}}|\leq|K_{x}|_{\mathbb{H}}\|f\|_{\mathbb{H}}\leq C\|f\|_{\mathbb{H}}$ . Therefore, $\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|_{\infty}\leq\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|_{\mathbb{H}}$ . Next, we will bound $\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|_{\mathbb{H}}$ .

For $i=1,\dots,n$ , recall $\eta_{i}^{noise}-\sum_{d=0}^{r}\eta_{i}^{noise,d}=(I-\gamma K_{X_{i}}\otimes K_{X_{i}})(\eta_{i-1}^{noise}-\sum_{d=0}^{r}\eta_{i-1}^{noise,d})+\gamma\mathcal{E}_{i}^{r+1},$ we have

\|\eta_{i}^{noise}-\sum_{d=0}^{r}\eta_{i}^{noise,d}\|_{\mathbb{H}}\leq\|\eta_{i-1}^{noise}-\sum_{d=0}^{r}\eta_{i-1}^{noise,d}\|_{\mathbb{H}}+\gamma\|\mathcal{E}_{i}^{r+1}\|_{\mathbb{H}}\leq\sum_{j=1}^{i}\gamma\|\mathcal{E}_{j}^{r+1}\|_{\mathbb{H}}.

Accordingly, $\mathbb{E}\|\eta_{i}^{noise}-\sum_{k=0}^{d}\eta_{i}^{noise,k}\|_{\mathbb{H}}^{2}\leq\gamma^{2}\sum_{j=1}^{i}\big{(}\sum_{j=1}^{i}\mathbb{E}\|\mathcal{E}_{j}^{d+1}\|^{2}_{\mathbb{H}}\big{)}$ . Since $\mathbb{E}\|\mathcal{E}_{j}^{d+1}\|_{\mathbb{H}}^{2}=\mathbb{E}\operatorname{tr}(\mathcal{E}_{j}^{d+1}\otimes\mathcal{E}_{j}^{d+1})=\operatorname{tr}\mathbb{E}(\mathcal{E}_{j}^{d+1}\otimes\mathcal{E}_{j}^{d+1})\leq\sigma^{2}\gamma^{d+1}R^{2d+2}\operatorname{tr}(\Sigma)$ , we have

\displaystyle\mathbb{E}\|\eta_{i}^{noise}-\sum_{d=0}^{r}\eta_{i}^{noise,d}\|_{\mathbb{H}}^{2}\leq\gamma^{2}i^{2}\sigma^{2}\gamma^{r+1}R^{2r+2}\operatorname{tr}(\Sigma),

and accordingly

	$\displaystyle\mathbb{E}\\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\\|_{\mathbb{H}}^{2}\leq$	$\displaystyle\frac{2}{n}\sum_{i=1}^{n}\mathbb{E}\\|\eta_{i}^{noise}-\sum_{d=0}^{r}\eta_{i}^{noise,d}\\|_{\mathbb{H}}^{2}$
	$\displaystyle\leq$	$\displaystyle\sigma^{2}\gamma^{r+3}R^{2r+2}\operatorname{tr}(\Sigma)\frac{1}{n}\sum_{i=1}^{n}i^{2}\leq\sigma^{2}\gamma^{r+3}R^{2r+2}\operatorname{tr}(\Sigma)n^{2}.$		(8.27)

By Markov inequality,

\mathbb{P}\Big{(}\|\bar{\eta}_{i}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{i}^{noise,d}\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq\frac{\mathbb{E}\|\bar{\eta}_{n}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{n}^{noise,d}\|^{2}_{\mathbb{H}}}{\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}}\leq 1/n

with the constant $r$ large enough.

Finally, we have

		$\displaystyle\mathbb{P}\Big{(}\\|\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}\\|^{2}_{\infty}\geq(r+1)\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\sum_{d=1}^{r}\mathbb{P}\Big{(}\\|\bar{\eta}_{n}^{noise,d}\\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}+\mathbb{P}\Big{(}\\|\bar{\eta}_{i}^{noise}-\sum_{d=0}^{r}\bar{\eta}_{i}^{noise,d}\\|^{2}_{\infty}\geq\gamma^{1/4}(n\gamma)^{1/\alpha}n^{-1}\Big{)}\leq\gamma^{1/4}.$

∎

8.4 Bootstrap SGD decomposition

Similar to the SGD recursion decomposition in Section 6, we define the Bootstrap SGD recursion decomposition as follows. Based on (4.1), denote $\eta_{n}^{b}=\widehat{f}_{n}^{b}-f^{\ast}$ , then

\eta_{n}^{b}=(I-\gamma_{n}w_{n}K_{X_{n}}\otimes K_{X_{n}})(f_{n-1}^{b}-f^{\ast})+\gamma_{n}w_{n}^{b}\epsilon_{n}K_{X_{n}}.

(8.28)

We split the recursion (8.28) in two recursions $\eta_{n}^{b,bias}$ and $\eta_{n}^{b,noise}$ such that $\eta_{n}^{b}=\eta_{n}^{b,bias}+\eta_{n}^{b,noise}$ . Specifically,

	$\displaystyle\eta_{n}^{b,bias}=$	$\displaystyle(I-\gamma_{n}w_{n}K_{X_{n}}\otimes K_{X_{n}})\eta^{b,bias}_{n-1}\quad\textrm{with}\quad\eta_{0}^{b,bias}=f^{\ast},$		(8.29)
	$\displaystyle\eta_{n}^{b,noise}=$	$\displaystyle(I-\gamma_{n}w_{n}K_{X_{n}}\otimes K_{X_{n}})\eta^{b,noise}_{n-1}+\gamma_{n}w_{n}\epsilon_{n}K_{X_{n}}\quad\textrm{with}\quad\eta_{0}^{b,noise}=0.$		(8.30)

Since $\mathbb{E}[w_{n}K_{X_{n}}\otimes K_{X_{n}}]=\Sigma$ , we further decompose $\eta_{n}^{b,bias}$ to two parts: (1) its main recursion terms which determine the bias order; (2) residual recursion terms. That is,

	$\displaystyle\eta_{n}^{b,bias,0}=$	$\displaystyle(I-\gamma_{n}\Sigma)\eta_{n-1}^{b,bias,0}\quad\quad\textrm{with}\quad\eta_{0}^{b,bias,0}=f^{\ast}$
	$\displaystyle\eta_{n}^{b,bias}-\eta_{n}^{b,bias,0}=$	$\displaystyle(I-\gamma_{n}w_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{b,bias}-\eta_{n-1}^{b,bias,0})+\gamma_{n}(\Sigma-w_{n}K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{b,bias,0},$

Similarly, we decompose $\eta_{n}^{b,noise}$ to its main recursion term that dominates the variation and residual recursion terms as

	$\displaystyle\eta_{n}^{b,noise,0}=$	$\displaystyle(I-\gamma_{n}\Sigma)\eta^{b,noise,0}_{n-1}+\gamma_{n}w_{n}\epsilon_{n}K_{X_{n}}$		(8.31)
	$\displaystyle\eta_{n}^{b,noise}-\eta_{n}^{b,noise,0}=$	$\displaystyle(I-\gamma_{n}w_{n}K_{X_{n}}\otimes K_{X_{n}})(\eta_{n-1}^{b,noise}-\eta_{n-1}^{b,noise,0})+\gamma_{n}(\Sigma-w_{n}K_{X_{n}}\otimes K_{X_{n}})\eta_{n-1}^{b,noise,0},$

with $\eta_{0}^{b,noise,0}=0$ .

We aim to quantify the distribution behavior of $(\bar{f}_{n}^{b}-\bar{f}_{n})\mid\mathcal{D}_{n}$ given $\mathcal{D}_{n}$ . Denote $\bar{\eta}_{n}^{b}=\frac{1}{n}\sum_{i=1}^{n}\widehat{f}_{i}^{b}$ . Then

	$\displaystyle\bar{f}_{n}^{b}-\bar{f}_{n}=$	$\displaystyle\bar{\eta}_{n}^{b}-\bar{\eta}_{n}=\frac{1}{n}\sum_{i=1}^{n}\big{(}\widehat{f}_{i}^{b}-f^{\ast}\big{)}-\frac{1}{n}\sum_{i=1}^{n}\big{(}\widehat{f}_{i}-f^{\ast}\big{)}$
	$\displaystyle=$	$\displaystyle\underbrace{\bar{\eta}_{n}^{b,bias,0}-\bar{\eta}_{n}^{bias,0}}_{\textrm{leading bias}}+\underbrace{\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}}_{\textrm{leading noise}}+\underbrace{Rem_{noise}^{b}+Rem_{bias}^{b}-Rem_{noise}-Rem_{bias}}_{\textrm{negligible terms}},$

where $Rem^{b}_{noise}=\bar{\eta}_{n}^{b,noise}-\bar{\eta}_{n}^{b,noise,0}$ , $Rem^{b}_{bias}=\bar{\eta}_{n}^{b,bias}-\bar{\eta}_{n}^{b,bias,0}$ , and $Rem_{noise},Rem_{bias}$ are remainder terms in original SGD recursion with $Rem_{noise}=\bar{\eta}_{n}^{noise}-\bar{\eta}_{n}^{noise,0}$ (bounded in Section 8.2), $Rem_{bias}=\bar{\eta}_{n}^{bias}-\bar{\eta}_{n}^{bias,0}$ (bounded in Section 8.2).

Since $\bar{\eta}_{n}^{b,bias,0}$ and $\bar{\eta}_{n}^{bias,0}$ follow the same recursion, we have the leading bias of $\bar{f}_{n}^{b}-\bar{f}_{n}$ as 0. We next need to: (1) characterize the distribution behavior of $\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}$ conditional on $\mathcal{D}_{n}$ ; and (2) prove the term $Rem_{noise}^{b}+Rem_{bias}^{b}-Rem_{noise}-Rem_{bias}$ are negligible.

In the following, we provide a clear express on $\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}$ . Similar to the expression of $\eta_{n}^{noise,0}=\sum_{i=1}^{n}D(i+1,n,\gamma_{i})\gamma_{i}\epsilon_{i}K_{X_{i}}$ in (8.9), A simple calculation from the recursion (8.31) shows that $\eta_{n}^{b,noise,0}=\sum_{i=1}^{n}D(i+1,n,\gamma_{i})\gamma_{i}w_{i}\epsilon_{i}K_{X_{i}}.$ Accordingly,

\eta_{n}^{b,noise,0}-\eta_{n}^{noise,0}=\sum_{i=1}^{n}D(i+1,n,\gamma_{i})\gamma_{i}(w_{i}-1)\epsilon_{i}K_{X_{i}}.

Then

		$\displaystyle\bar{\eta}_{n}^{b,noise,0}-\bar{\eta}_{n}^{noise,0}$
	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{j=1}^{n}\sum_{i=1}^{j}D(i+1,j,\gamma_{i})\gamma_{i}(w_{i}-1)\epsilon_{i}K_{X_{i}}=\frac{1}{n}\sum_{i=1}^{n}\big{(}\sum_{j=i}^{n}D(i+1,j,\gamma_{i})\big{)}\gamma_{i}(w_{i}-1)\epsilon_{i}K_{X_{i}}.$		(8.32)

8.5 Proof of the Bootstrap consistency in Theorem 4.2 for constant step size case

We follow the proof sketch in Section 6.2 and complete the proof of Step II, III and IV in this section.

For the reader’s convenience, we restate the following notations. Denote

\begin{array}[]{rcl@{\qquad}rcl}\bar{\alpha}_{n}(\cdot)&=&\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}\epsilon_{i}\cdot\Omega_{n,i}(\cdot)&\bar{\alpha}_{n}^{b}(\cdot)&=&\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}(w_{i}-1)\cdot\epsilon_{i}\cdot\Omega_{n,i}(\cdot)\\ \bar{\alpha}_{n}^{e}(\cdot)&=&\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}e_{i}\cdot\epsilon_{i}\cdot\Omega_{n,i}(\cdot)&\bar{Z}_{n}(\cdot)&=&\frac{1}{\sqrt{n(n\gamma)^{1/\alpha}}}\sum_{i=1}^{n}Z_{i}(\cdot)\end{array}

where $e_{i}$ ’s, for $i=1,\cdots,n$ , are i.i.d. standard normal random variables, and $Z_{i}(t)\sim N\big{(}0,(n\gamma)^{-1/\alpha}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}^{2}(t)\big{)}$ satisfying $\mathbb{E}\big{(}Z_{i}(t_{k})\cdot Z_{i}(t_{\ell})\big{)}=(n\gamma)^{-1/\alpha}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}(t_{k})\phi_{\nu}(t_{\ell})$ , and $\mathbb{E}\big{(}Z_{i}(t_{k})\cdot Z_{j}(t_{\ell})\big{)}=0$ for $i\neq j$ .

Lemma 8.3.

(Proof of Step II) Suppose $\alpha>2$ and $\gamma=n^{-\xi}$ with $\xi>\max\{1-\alpha/3,0\}$ . We have

\sup_{\nu\in\mathbb{R}}\Big{|}\mathbb{P}(\max_{1\leq k\leq N}\bar{\alpha}_{n}(t_{k})\leq\nu)-\mathbb{P}(\max_{1\leq k\leq N}\bar{Z}_{n}(t_{k})\leq\nu)\Big{|}\leq\frac{(\log N)^{3/2}}{\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{1/8}},

(8.33)

which converges to $0$ with increased $n$ .

Lemma 8.4.

(Proof of Step III) Suppose $\alpha>2$ and $\gamma=n^{-\xi}$ with $\xi>\max\{1-\alpha/3,0\}$ . With probability at least $1-\exp(-C\log n)$ ,

\sup_{\nu\in\mathbb{R}}\Big{|}\mathbb{P}^{*}\Big{(}\max_{1\leq j\leq N}\bar{\alpha}^{e}_{n}(t_{j})\leq\nu\Big{)}-\mathbb{P}\Big{(}\max_{1\leq j\leq N}\bar{Z}_{n}(t_{j})\leq\nu\Big{)}\Big{|}\preceq\big{(}(n\gamma)^{1/\alpha}n^{-1}\big{)}^{1/6}(\log n)^{1/3}(\log N)^{2/3}.

Lemma 8.5.

(Proof of Step IV) Suppose $\alpha>2$ and $\gamma=n^{-\xi}$ with $\xi>\max\{1-\alpha/3,0\}$ . With probability at least $1-4/n$ ,

\sup_{\zeta\in\mathbb{R}}\Big{|}\mathbb{P}^{*}(\max_{1\leq k\leq N}\bar{\alpha}_{n}^{b}(t_{k})\leq\zeta)-\mathbb{P}^{*}(\max_{1\leq k\leq N}\bar{\alpha}_{n}^{e}(t_{k})\leq\zeta)\Big{|}\leq\frac{(\log N)^{3/2}}{\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{1/8}}.

Proof of Lemma 8.3

Proof.

We define $g_{m}(i,X_{i},\epsilon_{i})=\frac{1}{\sqrt{(n\gamma)^{1/\alpha}}}\epsilon_{i}\cdot\Omega_{n,i}(t_{m})$ for $t_{m}\in\{t_{1},\dots,t_{N}\}$ . With little abuse of notation, we use $g_{i,m}$ to represent $g_{m}(i,X_{i},\epsilon_{i})$ . Then $\bar{\alpha}_{n}(t_{m})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}g_{i,m}$ . Define $\bm{g}_{i}=(g_{i,1},\cdots,g_{i,N})^{\top}$ and $\bar{\bm{\alpha}}_{n}=\big{(}\bar{\alpha}_{n}(t_{1}),\cdots,\bar{\alpha}_{n}(t_{N})\big{)}^{\top}\in\mathbb{R}^{N}$ , then $\bar{\bm{\alpha}}_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{g}_{i}$ . For $1\leq m\leq k\leq N$ ,

	$\displaystyle\mathbb{E}(g_{i,m}\cdot g_{i,k})=$	$\displaystyle\frac{\sigma^{2}}{(n\gamma)^{1/\alpha}}\mathbb{E}[\big{(}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})\phi_{\nu}(t_{m})\phi_{\nu}(X_{i})\big{)}\big{(}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})\phi_{\nu}(t_{k})\phi_{\nu}(X_{i})\big{)})]$
	$\displaystyle=$	$\displaystyle\frac{\sigma^{2}}{(n\gamma)^{1/\alpha}}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}(t_{m})\phi_{\nu}(t_{k}).$

When $m=k$ , $\mathbb{E}(g_{i,m}\cdot g_{i,m})=\frac{1}{(n\gamma)^{1/\alpha}}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}^{2}(t_{m})$ . We also have $\mathbb{E}(g_{i,m}\cdot g_{j,m})=0$ for $i\neq j$ .

We also use the notation $Z_{i,m}$ to represent $Z_{i}(t)$ defined in Section 6.2. Let $\bm{Z}_{i}=(Z_{i,1},\cdots,Z_{i,N})^{\top}\in\mathbb{R}^{N}$ for $i=1,\dots,n$ , and $\bar{\bm{Z}}_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{Z}_{i}=(\bar{Z}_{n}(t_{1}),\cdots,\bar{Z}_{n}(t_{N}))^{\top}\in\mathbb{R}^{N}$ . We remark that $\bar{\bm{\alpha}}_{n}$ has the same mean and covariance structure as $\bar{\bm{Z}}_{n}$ .

For $q$ as an underdetermined scalar that depends on $n$ , and $\bm{\beta}=(\beta_{1},\ldots,\beta_{N})^{\top}\in\mathbb{R}^{N}$ , define $F_{q}(\bm{\beta})=q^{-1}\log(\sum_{l=1}^{N}\exp(q\beta_{l})).$ It follows by [42] that $F_{q}(\bm{\beta})$ satisfies $0\leq F_{q}(\bm{\beta})-\max_{1\leq l\leq N}\beta_{l}\leq q^{-1}\log{N}.$ Let $U_{0}:\mathbb{R}\rightarrow[0,1]$ be a $C^{3}$ -function such that $U_{0}(s)=1$ for $s\leq 0$ and $U_{0}(s)=0$ for $s\geq 1$ . Let $U_{\zeta}(s)=U_{0}(\psi_{n}(s-\zeta-q^{-1}\log{N}))$ , for $\zeta\in\mathbb{R}$ , where $\psi_{n}$ is underdetermined. Then

\mathbb{P}(\max_{1\leq m\leq N}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}g_{i,m}\leq\zeta)\leq\mathbb{P}(F_{q}(\bar{\bm{\alpha}}_{n})\leq\zeta+q^{-1}\log{N})\leq\mathbb{E}\{U_{\zeta}(F_{q}(\bar{\bm{\alpha}}_{n}))\}.

To proceed, we approximate $\mathbb{E}\{U_{\zeta}(F_{q}(\bar{\bm{\alpha}}_{n}))-U_{\zeta}(F_{q}(\bar{\bm{Z}}_{n}))\}$ using the techniques used in [42]. Let $G=U_{\zeta}\circ F_{b}$ . Define $\Psi(t)=\mathbb{E}\{G(\sqrt{t}\bar{\bm{\alpha}}_{n}+\sqrt{1-t}\bar{\bm{Z}}_{n})\}$ , $W(t)=\sqrt{t}\bar{\bm{\alpha}}_{n}+\sqrt{1-t}\bar{\bm{Z}}_{n}$ , $W_{i}(t)=\frac{1}{\sqrt{n}}(\sqrt{t}\bm{g}_{i}+\sqrt{1-t}\bm{Z}_{i})$ and $W_{-i}(t)=W(t)-W_{i}(t)$ , for $i=1,\ldots,n$ . Let $G_{k}(\bm{\beta})=\frac{\partial}{\partial\beta_{k}}G(\bm{\beta})$ , $G_{kl}(\bm{\beta})=\frac{\partial^{2}}{\partial\beta_{k}\partial\beta_{l}}G(\bm{\beta})$ and $G_{klq}(\bm{\beta})=\frac{\partial^{3}}{\partial\beta_{k}\partial\beta_{l}\partial\beta_{q}}G(\bm{\beta})$ , for $1\leq k,l,q\leq N$ . Then $W^{\prime}_{ik}(t)=\frac{1}{2\sqrt{n}}(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})$ .

Then

			$\displaystyle\mathbb{E}\{G(\bar{\bm{\alpha}}_{n})-G(\bar{\bm{Z}}_{n})\}=\mathbb{E}\{U_{\zeta}(F_{q}(\bar{\bm{\alpha}}_{n}))-U_{\zeta}(F_{q}(\bar{\bm{Z}}_{n}))\}=\Psi(1)-\Psi(0)=\int_{0}^{1}\Psi^{\prime}(t)dt$
		$\displaystyle=$	$\displaystyle\frac{1}{2\sqrt{n}}\sum_{k=1}^{N}\int_{0}^{1}\mathbb{E}\{G_{k}(W(t))(\sum_{i=1}^{n}g_{i,k}/\sqrt{t}-\sum_{i=1}^{n}Z_{i,k}/\sqrt{1-t})\}dt$
		$\displaystyle=$	$\displaystyle\frac{1}{2\sqrt{n}}\sum_{k=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\mathbb{E}\big{\{}G_{k}(W(t))(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\big{\}}dt$
		$\displaystyle=$	$\displaystyle\frac{1}{2\sqrt{n}}\sum_{k=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\mathbb{E}\big{\{}\big{[}G_{k}(W_{-i}(t))+\frac{1}{\sqrt{n}}\sum_{l=1}^{N}G_{kl}(W_{-i}(t))(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})$
			$\displaystyle+\frac{1}{n}\sum_{l=1}^{N}\sum_{d=1}^{N}\int_{0}^{1}(1-t^{\prime})G_{kld}(W_{-i}(t)+t^{\prime}W_{i}(t))(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})(\sqrt{t}g_{i,d}+\sqrt{1-t}Z_{i,d})dt^{\prime}\big{]}$
			$\displaystyle\times(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\big{\}}dt$
		$\displaystyle=$	$\displaystyle\frac{1}{2\sqrt{n}}\sum_{k=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\mathbb{E}\{G_{k}(W_{-i}(t))\}\mathbb{E}\{g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t}\}dt$
			$\displaystyle+\frac{1}{2n}\sum_{k,l=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\mathbb{E}\{G_{kl}(W_{-i}(t))\}\times\mathbb{E}\{(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\}dt$
			$\displaystyle+\frac{1}{2n^{3/2}}\sum_{k,l,d=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\int_{0}^{1}(1-t^{\prime})\mathbb{E}\{G_{kld}(W_{-i}(t)+t^{\prime}W_{i}(t))(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})$
			$\displaystyle(\sqrt{t}g_{i,d}+\sqrt{1-t}Z_{i,d})(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\}dtdt^{\prime}$
		$\displaystyle\equiv$	$\displaystyle J_{1}/2+J_{2}/2+J_{3}/2,$

where

$\displaystyle J_{1}$	$\displaystyle=$	$\displaystyle\frac{1}{\sqrt{n}}\sum_{k=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\mathbb{E}\{G_{k}(W_{-i}(t))\}\mathbb{E}\{g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t}\}dt=0$
$\displaystyle J_{2}$	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{k,l=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\mathbb{E}\{G_{kl}(W_{-i}(t))\}\times\mathbb{E}\{(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\}dt$
$\displaystyle J_{3}$	$\displaystyle=$	$\displaystyle\frac{1}{n^{3/2}}\sum_{k,l,d=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\int_{0}^{1}(1-t^{\prime})\mathbb{E}\{G_{kld}(W_{-i}(t)+t^{\prime}W_{i}(t))(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})(\sqrt{t}g_{i,d}+\sqrt{1-t}Z_{i,d})$
		$\displaystyle(Z_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\}dtdt^{\prime}$

We further note that $J_{2}=0$ since $\mathbb{E}\{(\sqrt{t}g_{i,l}+\sqrt{1-t}Z_{i,l})(g_{i,k}/\sqrt{t}-Z_{i,k}/\sqrt{1-t})\}=\mathbb{E}(g_{i,l}g_{i,k})-\mathbb{E}(Z_{i,l}Z_{i,k})=0$ . For $J_{3}$ , it follows from [42] that for any $z\in\mathbb{R}^{N}$ ,

\sum_{k,l,d=1}^{N}|G_{kld}(z)|\leq(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n}),

where $C_{3}=\|U_{0}^{\prime\prime\prime}\|_{\infty}$ is a finite constant. Then

$\displaystyle\|J_{3}\|\leq$	$\displaystyle\frac{1}{n^{3/2}}\sum_{k,l,d=1}^{N}\sum_{i=1}^{n}\int_{0}^{1}\int_{0}^{1}\mathbb{E}\{\|G_{kld}(W_{-i}(t)+t^{\prime}W_{i}(t))\|\max_{1\leq k\leq N}(\|g_{i,k}\|+\|Z_{i,k}\|)^{3}\}$
	$\displaystyle\times(1/\sqrt{t}+1/\sqrt{1-t})dtdt^{\prime}$
$\displaystyle\leq$	$\displaystyle\frac{1}{n^{3/2}}4(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\sum_{i=1}^{n}\mathbb{E}\{\max_{1\leq k\leq N}(\|g_{i,k}\|+\|Z_{i,k}\|)^{3}\}$
$\displaystyle\leq$	$\displaystyle\frac{1}{n^{3/2}}32(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\big{\{}\sum_{i=1}^{n}(\mathbb{E}\{\max_{1\leq k\leq N}\|g_{i,k}\|^{3}\}+\mathbb{E}\{\max_{1\leq k\leq N}\|Z_{i,k}\|^{3}\})\big{\}}$	(8.34)

We need to bound $\sum_{i=1}^{n}(\mathbb{E}\{\max\limits_{1\leq k\leq N}|g_{i,k}|^{3}\}$ and $\mathbb{E}\{\max\limits_{1\leq k\leq N}|Z_{i,k}|^{3}\}$ .

	$\displaystyle\sum_{i=1}^{n}\mathbb{E}\max_{1\leq k\leq N}\|g_{i,k}\|^{3}=$	$\displaystyle\frac{1}{(n\gamma)^{3/(2\alpha)}}\sum_{i=1}^{n}\mathbb{E}\max_{1\leq k\leq N}\|\epsilon_{i}\cdot\Omega_{n,i}(t_{k})\|^{3}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{(n\gamma)^{3/(2\alpha)}}\sum_{i=1}^{n}\mathbb{E}\|\epsilon_{i}\|^{3}\cdot\mathbb{E}\max_{1\leq k\leq N}\|\Omega_{n,i}(t_{k})\|^{3}$
	$\displaystyle\lesssim$	$\displaystyle\frac{\sigma^{3}}{(n\gamma)^{3/(2\alpha)}}\sum_{i=1}^{n}\mathbb{E}\max_{1\leq k\leq N}\|\Omega_{n,i}(t_{k})\|^{3}\leq c_{\phi}^{6}\sigma^{3}n(n\gamma)^{3/(2\alpha)},$

where the last step is due to the property that $|\Omega_{n,i}|=\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})\cdot|\phi_{\nu}(X_{i})|\cdot|\phi_{\nu}(t_{k})|\leq c_{\phi}^{2}(n\gamma)^{1/\alpha}$ , and $\max\limits_{1\leq k\leq N}|\Omega_{i,k}|^{3}\leq c_{\phi}^{6}(n\gamma)^{3/\alpha}$ .

Next we deal with $\mathbb{E}\max\limits_{1\leq k\leq N}|Z_{i,k}|^{3}$ , where $Z_{i,k}\sim N\big{(}0,\frac{1}{(n\gamma)^{1/\alpha}}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}^{2}(t_{k})\big{)}$ , and $\mathbb{E}(Z_{i,k}\cdot Z_{i,l})=\frac{1}{(n\gamma)^{1/\alpha}}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}(t_{k})\phi_{\nu}(t_{l})$ . For $p>3$ , we have

	$\displaystyle\mathbb{E}\max_{1\leq k\leq N}\|Z_{i,k}\|^{3}=$	$\displaystyle\mathbb{E}\max_{1\leq k\leq N}(\|Z_{i,k}\|^{p})^{3/p}\leq\big{(}\mathbb{E}\max_{1\leq k\leq N}\|Z_{i,k}\|^{p}\big{)}^{3/p}\leq\big{(}\sum_{k=1}^{N}\mathbb{E}\|Z_{i,k}\|^{p}\big{)}^{3/p}$
	$\displaystyle=$	$\displaystyle\frac{1}{(n\gamma)^{3/(2\alpha)}}\big{[}\sum_{k=1}^{N}\big{(}\sum_{\nu=1}^{\infty}(1-(1-\gamma\mu_{\nu})^{n-i})^{2}\phi_{\nu}^{2}(t_{k})\big{)}^{p/2}\big{]}^{3/p}((p-1)!!)^{3/p}$
	$\displaystyle\leq$	$\displaystyle c_{\phi}^{2}((p-1)!!)^{3/p}\frac{1}{(n\gamma)^{3/(2\alpha)}}N^{3/p}[(n-i)\gamma]^{\frac{3}{2\alpha}}$

Then we have

	$\displaystyle J_{3}\leq$	$\displaystyle 32\sigma^{3}(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\big{(}n^{-3/2}n(n\gamma)^{3/(2\alpha)}+n^{-3/2}c_{\phi}^{2}((p-1)!!)^{3/p}N^{3/p}n(n\gamma)^{\frac{3}{2\alpha}}\big{)}$
	$\displaystyle\leq$	$\displaystyle C^{\prime}(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\big{(}n^{-1/2}(n\gamma)^{\frac{3}{2\alpha}}+((p-1)!!)^{3/p}N^{3/p}n^{-1/2}\big{)}$

Therefore,

		$\displaystyle\|\mathbb{E}\{U_{\zeta}(F_{q}(\bar{\bm{\alpha}}_{n}))-U_{\zeta}(F_{q}(\bar{\bm{Z}}_{n}))\}\|$
	$\displaystyle\leq$	$\displaystyle C^{\prime}(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\big{(}n^{-1/2}(n\gamma)^{\frac{3}{2\alpha}}+((p-1)!!)^{3/p}N^{3/p}n^{-1/2}\big{)}$		(8.35)

In the meantime, it follows by Lemma 2.1 of [42] that

	$\displaystyle\mathbb{E}\{U_{\zeta}(F_{q}(\bar{\bm{Z}}_{n}))\}$	$\displaystyle\leq$	$\displaystyle\mathbb{P}(\max_{1\leq k\leq N}\sum_{i=1}^{n}Z_{i,k}\leq\zeta+b^{-1}\log{N}+\psi_{n}^{-1})$
		$\displaystyle\leq$	$\displaystyle\mathbb{P}(\max_{1\leq k\leq N}\sum_{i=1}^{n}Z_{i,k}\leq\zeta)+C^{\prime}(b^{-1}\log{N}+\psi_{n}^{-1})(1+\sqrt{2\log{N}}),$

where $C^{\prime}>0$ is a universal constant. Therefore, for any $\zeta\in\mathbb{R}$ ,

		$\displaystyle\mathbb{P}(\max_{1\leq k\leq N}\bar{\alpha}_{n}(t_{k})\leq\zeta)-\mathbb{P}(\max_{1\leq k\leq N}\bar{Z}_{n}(t_{k})\leq\zeta)$
	$\displaystyle\leq$	$\displaystyle C_{4}(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\big{(}n^{-1/2}(n\gamma)^{\frac{3}{2\alpha}}+((p-1)!!)^{3/p}N^{3/p}n^{-1/2}\big{)}$
		$\displaystyle\quad\quad+c^{\prime}(b^{-1}\log N+\psi_{n}^{-1})(1+\sqrt{2\log N}).$

On the other hand, let $V_{\zeta}(s)=U_{0}(\psi_{n}(s-\zeta)+1)$ . Then

\mathbb{P}(\max_{1\leq k\leq N}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}g_{i,k}\leq\zeta)\geq\mathbb{P}(F_{q}(\bar{\bm{\alpha}}_{n})\leq\zeta)\geq\mathbb{E}\{V_{\zeta}(F_{q}(\bar{\bm{\alpha}}_{n}))\}.

Using the same arguments, it can be shown that $|\mathbb{E}\{V_{\zeta}(F_{q}(\bar{\bm{\alpha}}_{n}))-V_{\zeta}(F_{q}(\bar{\bm{Z}}_{n}))\}|$ has the same upper bound specified as in (8.5). Furthermore, by Lemma 2.1 of [42] and direct calculations, we have

	$\displaystyle\mathbb{E}\{V_{\zeta}(F_{q}(\bar{\bm{Z}}_{n}))\}$	$\displaystyle\geq$	$\displaystyle\mathbb{P}(F_{q}(\bar{\bm{Z}}_{n})\leq\zeta-\psi_{n}^{-1})\geq\mathbb{P}(\max_{1\leq k\leq N}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Z_{i,k}\leq\zeta-(\psi_{n}^{-1}+b^{-1}\log{N}))$
		$\displaystyle\geq$	$\displaystyle\mathbb{P}(\max_{1\leq k\leq N}\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Z_{i,k}\leq\zeta)-C^{\prime}(\psi_{n}^{-1}+b^{-1}\log{N})(1+\sqrt{2\log{N}}).$

Therefore,

			$\displaystyle\mathbb{P}(\max_{1\leq k\leq N}\bar{\alpha}_{n}(t_{k})\leq\zeta)-\mathbb{P}(\max_{1\leq k\leq N}\bar{Z}_{n}(t_{k})\leq\zeta)$
		$\displaystyle\geq$	$\displaystyle-C_{0}^{{}^{\prime\prime}}(C_{3}\psi_{n}^{3}+C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\big{(}n^{-1/2}(n\gamma)^{\frac{3}{2\alpha}}+((p-1)!!)^{3/p}N^{3/p}n^{-1/2}\big{)}$
			$\displaystyle-C^{{}^{\prime\prime}}(\psi_{n}^{-1}+q^{-1}\log N)(1+\sqrt{2\log N}).$

Consequently, let $\psi_{n}=q=\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{1/8}$ and $p$ large enough, we have

\sup_{\zeta\in\mathbb{R}}\Big{|}\mathbb{P}(\max_{1\leq k\leq N}\bar{\alpha}_{n}(t_{k})\leq\zeta)-\mathbb{P}(\max_{1\leq k\leq N}\bar{Z}_{n}(t_{k})\leq\zeta)\Big{|}\leq\frac{(\log N)^{3/2}}{\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{1/8}},

(8.36)

which converges to $0$ with increased $n$ when $\alpha>2$ and $\gamma=n^{-\xi}$ with $\xi>\max\{1-\alpha/3,0\}$ . ∎

Proof of Lemma 8.4

Proof.

Let $\bar{\bm{\alpha}}^{e}_{n}=\big{(}\bar{\alpha}^{e}_{n}(t_{1}),\cdots,\bar{\alpha}^{e}_{n}(t_{N})\big{)}^{\top}$ and $\bar{\bm{Z}}_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{Z}_{i}=(\bar{Z}_{n}(t_{1}),\cdots,\bar{Z}_{n}(t_{N}))^{\top}$ . Then $\bar{\bm{\alpha}}^{e}_{n}\mid\mathcal{D}_{n}\sim N(0,\Sigma^{\bar{\alpha}^{e}_{n}})$ and $\bar{\bm{Z}}_{n}\sim N(0,\Sigma^{\bar{Z}_{n}})$ . Denote the $jk$ -th element of the covariance matrices as $\Sigma_{j,k}^{\bar{\alpha}^{e}_{n}}$ and $\Sigma_{j,k}^{\bar{Z}_{n}}$ , respectively. Set $b_{i\nu}=(1-(1-\gamma\mu_{\nu})^{n-i})$ . Then

\displaystyle\Sigma_{j,k}^{\bar{\alpha}^{e}_{n}}=

\displaystyle\frac{1}{n(n\gamma)^{1/\alpha}}\sum_{i=1}^{n}\epsilon_{i}^{2}\big{(}\sum_{\nu=1}^{\infty}b_{i\nu}\phi_{\nu}(X_{i})\phi_{\nu}(t_{j})\big{)}\cdot\big{(}\sum_{\nu=1}^{\infty}b_{i\nu}\phi_{\nu}(X_{i})\phi_{\nu}(t_{k})\big{)},

and $\Sigma_{j,k}^{\bar{Z}_{n}}=\frac{1}{n(n\gamma)^{1/\alpha}}\sum_{i=1}^{n}\sum_{\nu=1}^{\infty}b_{i\nu}^{2}\phi_{\nu}(t_{k})\phi_{\nu}(t_{j})$ .

Following Lemma in [33], we have

\mathbb{P}\Big{(}|\Sigma_{j,k}^{\bar{\alpha}^{e}_{n}}-\Sigma_{j,k}^{\bar{Z}_{n}}|\geq C(n\gamma)^{1/(2\alpha)}n^{-1/2}\log n\Big{)}\leq\exp(-C_{1}\log n).

Then

\displaystyle\mathbb{P}\Big{(}\max_{1\leq j,k\leq N}|\Sigma_{j,k}^{\bar{\alpha}^{e}_{n}}-\Sigma_{j,k}^{\bar{Z}_{n}}|\geq C(n\gamma)^{1/(2\alpha)}n^{-1/2}\log n\Big{)}\leq N^{2}\exp\{-C_{1}\log n\}.

Consequently, we have with probability at least $1-\exp\{-C\log n\}$ ,

\sup_{\nu\in\mathbb{R}}\Big{|}\mathbb{P}^{*}\Big{(}\max_{1\leq j\leq N}\bar{\alpha}^{e}_{n}(t_{j})\leq\nu\Big{)}-\mathbb{P}\Big{(}\max_{1\leq j\leq N}\bar{Z}_{n}(t_{j})\leq\nu\Big{)}\Big{|}\preceq\big{(}(n\gamma)^{1/\alpha}n^{-1}\big{)}^{1/6}(\log n)^{1/3}(\log N)^{2/3}.

∎

Proof of Lemma 8.5

Proof.

Define $\alpha_{i,j}^{e}=e_{i}\cdot g_{j}(i,X_{i},\epsilon_{i})$ with $g_{j}(i,X_{i},\epsilon_{i})=\frac{1}{\sqrt{(n\gamma)^{1/\alpha}}}\epsilon_{i}\cdot\Omega_{n,i}(t_{j})=\sum_{\nu=1}^{\infty}\big{(}1-(1-\gamma\mu_{\nu})^{n-i}\big{)}\phi_{\nu}(X_{i})\phi_{\nu}(t_{j})\epsilon_{i}.$ We have $\mathbb{E}^{*}(\alpha_{i,j}^{e}\cdot\alpha_{i,\ell}^{e})=g_{j}(i,X_{i},\epsilon_{i})g_{\ell}(i,X_{i},\epsilon_{i})$ and $\mathbb{E}^{*}(\alpha_{i,j}^{e}\cdot\alpha_{k,j}^{e})=0$ . Define $\bm{\alpha}_{i}^{e}=(\alpha_{i,j}^{e},\dots,\alpha_{i,N}^{e})^{\top}$ , then $\bm{\alpha}_{i}^{e}$ and $\bm{\alpha}_{k}^{e}$ are independent for $i\neq k$ and $i,k=1,\dots,n$ . Let $\bar{\bm{\alpha}}_{n}^{e}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{\alpha}_{i}^{e}=\big{(}\bar{\alpha}_{n}^{e}(t_{1}),\dots,\bar{\alpha}_{n}^{e}(t_{N})\big{)}^{\top}$ with $\bar{\alpha}_{n}^{e}(t_{j})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\alpha^{e}_{i,j}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}e_{i}\cdot g_{j}(i,X_{i},\epsilon_{i})$ for $j=1,\dots,N$ .

Similarly, denote $\alpha_{i,j}^{b}=(w_{i}-1)\cdot g_{j}(i,X_{i},\epsilon_{i})$ and $\bm{\alpha}_{i}^{b}=(\alpha_{i,j}^{b},\dots,\alpha_{i,N}^{b})^{\top}$ . Then we have $\mathbb{E}^{*}(\alpha_{i,j}^{b}\cdot\alpha_{i,\ell}^{b})=g_{j}(i,X_{i},\epsilon_{i})g_{\ell}(i,X_{i},\epsilon_{i})$ , and $\mathbb{E}^{*}(\alpha_{i,j}^{b}\cdot\alpha_{k,j}^{b})=0.$ Denote $\bar{\bm{\alpha}}_{n}^{b}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{\alpha}_{i}^{b}=\big{(}\bar{\alpha}_{n}^{b}(t_{1}),\dots,\bar{\alpha}_{n}^{b}(t_{N})\big{)}^{\top}$ with $\bar{\alpha}_{n}^{b}(t_{j})=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\alpha^{b}_{i,j}$ .

The proof of Lemma 8.5 follows the proof of Lemma 8.3. We adopt the notation and follow the proof of Lemma 8.3 step by step, with only the following changes: (1) replacing $\bar{\bm{\alpha}}_{n}$ with $\bar{\bm{\alpha}}_{n}^{b}$ ; (2) replacing $\bar{\bm{Z}}_{n}$ with $\bar{\bm{\alpha}}_{n}^{e}$ ; (3) replacing the probability $\mathbb{P}(\cdot)$ and expectation $\mathbb{E}(\cdot)$ to conditional probabilities $\mathbb{P}*(\cdot)=\mathbb{P}(\cdot\mid\mathcal{D}_{n})$ and conditional expectation $\mathbb{E}^{*}(\cdot)=\mathbb{E}(\cdot\mid\mathcal{D}_{n})$ . Then equation (8.34) here will be adapted to

|J_{3}|\leq\frac{1}{n^{3/2}}32(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})\Big{(}\sum_{i=1}^{n}(\mathbb{E}^{*}\{\max_{1\leq k\leq N}|\alpha^{b}_{i,k}|^{3}\}+\mathbb{E}^{*}\{\max_{1\leq k\leq N}|\alpha^{e}_{i,k}|^{3}\})\Big{)}

(8.37)

Since

\displaystyle\mathbb{E}^{*}\max_{1\leq k\leq N}|\alpha^{b}_{i,k}|^{3}\leq

\displaystyle\frac{1}{(n\gamma)^{3/(2\alpha)}}\max_{1\leq k\leq N}|\epsilon_{i}\cdot\Omega_{n,i}(t_{k})|^{3}\cdot\mathbb{E}|w_{i}-1|^{3}\lesssim|\epsilon_{i}|^{3}(n\gamma)^{3/(2\alpha)},

where the last inequality follows the proof in Lemma 8.3 that $\max\limits_{1\leq k\leq N}|\Omega_{i,k}|^{3}\leq c_{\phi}^{6}(n\gamma)^{3/\alpha}$ . Then with probability at least $1-n^{-1}$ , we have

\displaystyle\frac{1}{n^{3/2}}\sum_{i=1}^{n}\mathbb{E}^{*}\big{(}\max_{1\leq k\leq N}|\alpha^{b}_{i,k}|^{3}\big{)}\leq

\displaystyle n^{-1/2}(n\gamma)^{3/(2\alpha)}\frac{1}{n}\sum_{i=1}^{n}|\epsilon_{i}|^{3}\leq Cn^{-1/2}(n\gamma)^{3/(2\alpha)}.

Similarly, with probability at least $1-n^{-1}$ , $\frac{1}{n^{3/2}}\sum_{i=1}^{n}\mathbb{E}^{*}\big{(}\max_{1\leq k\leq N}|\alpha^{e}_{i,k}|^{3}\big{)}\leq Cn^{-1/2}(n\gamma)^{3/(2\alpha)},$ where $C$ is a constant independent of $n$ . Then we have with probability at least $1-2n^{-1}$ ,

\displaystyle J_{3}\leq

\displaystyle C(C_{3}\psi_{n}^{3}+6C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})n^{-1/2}(n\gamma)^{3/(2\alpha)}.

Therefore, follow the proofs in Lemma 8.3, we have

		$\displaystyle\|\mathbb{P}^{}(\max_{1\leq k\leq N}\bar{\alpha}_{n}^{b}(t_{k})\leq\zeta)-\mathbb{P}^{}(\max_{1\leq k\leq N}\bar{\alpha}_{n}^{e}(t_{k})\leq\zeta)\|$
	$\displaystyle\leq$	$\displaystyle C_{0}(C_{3}\psi_{n}^{3}+C_{2}q\psi_{n}^{2}+6C_{1}q^{2}\psi_{n})n^{-1/2}(n\gamma)^{3/(2\alpha)}+C^{{}^{\prime\prime}}(\psi_{n}^{-1}+q^{-1}\log N)(1+\sqrt{2\log N}).$

Consequently, let $\psi_{n}=q=\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{1/8}$ , we have with probability at least $1-4/n$ ,

\sup_{\zeta\in\mathbb{R}}\Big{|}\mathbb{P}^{*}(\max_{1\leq k\leq N}\bar{\alpha}_{n}^{b}(t_{k})\leq\zeta)-\mathbb{P}^{*}(\max_{1\leq k\leq N}\bar{\alpha}_{n}^{e}(t_{k})\leq\zeta)\Big{|}\leq\frac{(\log N)^{3/2}}{\big{(}n(n\gamma)^{-3/\alpha}\big{)}^{1/8}}.

∎

References

[1] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22:400–407, 1951.
[2] David Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, Ithaca, NY, 1988.
[3] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 161–168. Curran Associates, Inc., 2008.
[4] Anton J. Kleywegt, Alexander Shapiro, and Tito Homem de Mello. The sample average approximation method for stochastic discrete optimization. SIAM Journal on Optimization, 12(2):479–502, 2002.
[5] Sujin Kim, Raghu Pasupathy, and Shane G. Henderson. A guide to sample average approximation. In Handbook of Simulation Optimization, pages 207–243. Springer, 2015.
[6] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.
[7] Grace Wahba. Spline Models for Observational Data. Society for Industrial and Applied Mathematics, 1990.
[8] Vladimir Koltchinskii. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34:2593–2656, 2006.
[9] Shahar Mendelson. Geometric parameters of kernel machines. In International Conference on Computational Learning Theory, pages 29–43. Springer, 2002.
[10] Yun Yang, Mert Pilanci, and Martin J. Wainwright. Randomized sketches for kernels: Fast and optimal nonparametric regression. The Annals of Statistics, 45(3):991–1023, 2017.
[11] Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge regression learning algorithm in dual variables. In International Conference on Machine Learning, pages 515–521. Springer, 1998.
[12] Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4):1363–1399, 2016.
[13] Léon Bottou. Online learning and stochastic approximations. Online Learning in Neural Networks, 17(9):142, 1998.
[14] Quoc V. Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, and Andrew Y. Ng. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning, pages 265–272, 2011.
[15] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In 19th International Conference on Computational Statistics, pages 177–186. Springer, 2010.
[16] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
[17] Patrick Cheridito, Arnulf Jentzen, and Florian Rossmannek. Non-convergence of stochastic gradient descent in the training of deep neural networks. Journal of Complexity, 64:101540, 2021.
[18] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
[19] Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator. Journal of Machine Learning Research, 19(78):1–21, 2018.
[20] Xi Chen, Jason D Lee, Xin T Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics, 48(1):251–273, 2020.
[21] Yu Nesterov and J-Ph Vial. Confidence level solutions for stochastic programming. Automatica, 44(6):1559–1568, 2008.
[22] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
[23] Weijie J Su and Yuancheng Zhu. Higrad: Uncertainty quantification for online learning and stochastic approximation. Journal of Machine Learning Research, 24(124):1–53, 2023.
[24] Bradley Efron and Robert J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC, 1993.
[25] Thomas J Diciccio and Joseph P Romano. A review of bootstrap confidence intervals. Journal of the Royal Statistical Society: Series B (Methodological), 50(3):338–354, 1988.
[26] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximation of suprema of empirical processes. The Annals of Statistics, 42(4):1564–1597, 2014.
[27] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
[28] Chong Gu. Smoothing spline ANOVA models, volume 297. Springer, 2013.
[29] Shahar Mendelson and Joseph Neeman. Regularization in kernel learning. The Annals of Statistics, 38(1):526–565, 2010.
[30] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714–751, 2017.
[31] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
[32] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, volume 20, 2007.
[33] Meimei Liu, Zuofeng Shang, and Yun Yang. Supplementary: Scalable statistical inference in non-parametric least squares, 2023. Supplementary material.
[34] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related gaussian couplings. Stochastic Processes and their Applications, 126(12):3632–3651, 2016.
[35] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Anti-concentration and honest, adaptive confidence bands. The Annals of Statistics, 42(5):1787–1818, 2014.
[36] Michael H Neumann and Jörg Polzehl. Simultaneous bootstrap confidence bands in nonparametric regression. Journal of Nonparametric Statistics, 9(4):307–333, 1998.
[37] Timothy B Armstrong and Michal Kolesár. Simple and honest confidence intervals in nonparametric regression. Quantitative Economics, 11(1):1–39, 2020.
[38] Grace Wahba. Bayesian “confidence intervals” for the cross-validated smoothing spline. Journal of the Royal Statistical Society: Series B (Methodological), 45(1):133–150, 1983.
[39] Grace Wahba. Improper priors, spline smoothing and the problem of guarding against model errors in regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 40(3):364–372, 1978.
[40] Yuedong Wang and Grace Wahba. Bootstrap confidence intervals for smoothing splines and their comparison to bayesian confidence intervals. Journal of Statistical Computation and Simulation, 51(2-4):263–279, 1995.
[41] Bradley Efron. The jackknife, the bootstrap and other resampling plans. SIAM, 1982.
[42] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819, 2013.

Case 1: (A1) Coverage	(A2) Length of Confidence Interval

Case 2: (B1) Coverage	(B2) Length of Confidence Interval

Case 3: (C1) Coverage	(C2) Length of Confidence Interval

		$\displaystyle\mathbb{E}\|\sum_{j=1}^{n-1}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\|^{2}$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{n-1}\mathbb{E}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}^{2}\cdot\mathbb{E}\|\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\|^{2}$
		$\displaystyle+2\sum_{j_{1}<j_{2}}\mathbb{E}\big{(}\big{(}\phi_{\nu}(x_{j_{1}})\phi_{\nu^{\prime}}(x_{j_{1}})-\delta_{\nu\nu^{\prime}}\big{)}\big{)}\cdot\mathbb{E}\big{(}\big{(}\phi_{\nu}(x_{j_{2}})\phi_{\nu^{\prime}}(x_{j_{2}})-\delta_{\nu\nu^{\prime}}\big{)}\cdot(\sum_{i,\ell=j_{1}}^{n-1}a_{ij_{1}\nu}a_{\ell j_{1}\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}})$
		$\displaystyle\cdot(\sum_{i,\ell=j_{2}}^{n-1}a_{ij_{2}\nu}a_{\ell j_{2}\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}})\big{)}=\sum_{j=1}^{n-1}\mathbb{E}\big{(}\phi_{\nu}(X_{j})\phi_{\nu^{\prime}}(X_{j})-\delta_{\nu\nu^{\prime}}\big{)}^{2}\cdot\mathbb{E}\|\sum_{i,\ell=j}^{n-1}a_{ij\nu}a_{\ell j\nu^{\prime}}\phi_{i\nu k}\phi_{\ell\nu^{\prime}k^{\prime}}\|^{2},$

	$\displaystyle\sum_{i=1}^{n}\mathbb{E}\max_{1\leq k\leq N}\|g_{i,k}\|^{3}=$	$\displaystyle\frac{1}{(n\gamma)^{3/(2\alpha)}}\sum_{i=1}^{n}\mathbb{E}\max_{1\leq k\leq N}\|\epsilon_{i}\cdot\Omega_{n,i}(t_{k})\|^{3}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{(n\gamma)^{3/(2\alpha)}}\sum_{i=1}^{n}\mathbb{E}\|\epsilon_{i}\|^{3}\cdot\mathbb{E}\max_{1\leq k\leq N}\|\Omega_{n,i}(t_{k})\|^{3}$
	$\displaystyle\lesssim$	$\displaystyle\frac{\sigma^{3}}{(n\gamma)^{3/(2\alpha)}}\sum_{i=1}^{n}\mathbb{E}\max_{1\leq k\leq N}\|\Omega_{n,i}(t_{k})\|^{3}\leq c_{\phi}^{6}\sigma^{3}n(n\gamma)^{3/(2\alpha)},$