Scalable Statistical Inference in Non-parametric Least Squares
Abstract
Stochastic approximation (SA) is a powerful and scalable computational method for iteratively estimating the solution of optimization problems in the presence of randomness, particularly well-suited for large-scale and streaming data settings. In this work, we propose a theoretical framework for stochastic approximation (SA) applied to non-parametric least squares in reproducing kernel Hilbert spaces (RKHS), enabling online statistical inference in non-parametric regression models. We achieve this by constructing asymptotically valid pointwise (and simultaneous) confidence intervals (bands) for local (and global) inference of the nonlinear regression function, via employing an online multiplier bootstrap approach to a functional stochastic gradient descent (SGD) algorithm in the RKHS. Our main theoretical contributions consist of a unified framework for characterizing the non-asymptotic behavior of the functional SGD estimator and demonstrating the consistency of the multiplier bootstrap method. The proof techniques involve the development of a higher-order expansion of the functional SGD estimator under the supremum norm metric and the Gaussian approximation of suprema of weighted and non-identically distributed empirical processes. Our theory specifically reveals an interesting relationship between the tuning of step sizes in SGD for estimation and the accuracy of uncertainty quantification.
1 Introduction
Stochastic approximation (SA) [1, 2, 3] is a class of iterative stochastic algorithms to solve the stochastic optimization problem , where is some loss function, denotes the internal random variable, and is the domain of the loss function. Statistical inference, such as parameter estimation, can be viewed as a special case of stochastic optimization where the goal is to estimate the minimizer of the expected loss function based on a finite number of i.i.d. observations . Classical estimation procedures based on minimizing an empirical version of the loss correspond to the sample average approximation (SAA) [4, 5] for solving the stochastic optimization problem. However, directly minimizing with massive data is computationally wasteful in both time and space, and may pose numerical challenges. For example, in applications involving streaming data where new and dynamic observations are generated on a continuous basis, it may not be necessary or feasible to store all historical data. Instead, stochastic gradient descent (SGD), or Robbins-Monro type SA algorithm [1], is a scalable approximation algorithm for parameter estimation with constant per-iteration time and space complexity. SGD can be viewed as a stochastic version of the gradient descent method that uses a noisy gradient, such as based on a single , to replace the true gradient . In this work, we explore the use of SA for statistical inference in infinite-dimensional models where is a functional space or, more precisely, in solving non-parametric least squares in reproducing kernel Hilbert spaces (RKHS).
Consider the standard random-design non-parametric regression model
(1.1) |
with denoting the -th copy of random covariate , the -th copy of response , and the unknown regression function in a reproducing kernel Hilbert space (RKHS, [6, 7]) to be estimated. For simplicity, we assume that is the unit cube in . Since minimizes the population-level expected squared error loss objective over all functions , with representing the squared loss function, one can adopt the SSA approach to estimate by minimizing a penalized sample-level squared error loss objective. Given a sample of size , a commonly used SAA approach for estimating is kernel ridge regression (KRR). KRR incorporates a penalty term that depends on the norm associated with the RKHS . Although the KRR estimator enjoys many attractive statistical properties [8, 9, 10], its computational complexity of time and space hinders its practicality in large-scale problems [11]. In this work, we instead consider an SA-type approach for directly minimizing the functional over the infinite-dimensional RKHS. By operating SGD in this non-parametric setting (see Section 2.2 for details), the resulting algorithm achieves time complexity and space complexity. In a recent study [12], the authors demonstrate that the online estimator of resulting from the SGD achieves optimal rates of convergence for a variety of . It is interesting to note that since the functional gradient is defined with respect to the RKHS norm , the functional SGD implicitly induces an algorithmic regularization due to the “early-stopping” in the RKHS, which is controlled by the accumulated step sizes. Therefore, with a proper step size decaying scheme, no explicit regularization is needed to achieve optimal convergence rates.
The aim of this research is to take a step further by constructing a new inferential framework for quantifying the estimation uncertainty in the SA procedure. This will be achieved through the construction of pointwise confidence intervals and simultaneous confidence bands for the functional SGD estimator of . Previous SGD algorithms and their variants, such as those discussed in [3, 13, 14, 15, 16, 17], are mainly utilized to solve finite-dimensional parametric learning problems with a root- convergence rate. In the parametric setting, asymptotic properties of estimators arising in SGD, such as consistency and asymptotic normality, have been well established in literature; for example, see [12, 18, 19, 20]. However, the problem of uncertainty quantification for functional SGD estimators in non-parametric settings is rarely addressed in the literature.
In the parametric setting, several methods have been proposed to conduct uncertainty quantification in SGD. [21, 22] appear to be among the first to formally characterize the magnitudes of random fluctuations in SA; however, their notion of confidence level is based on the large deviation properties of the solution and can be quite conservative. More recently, [19] proposes applying a multiplier bootstrap method for the construction of SGD confidence intervals, whose asymptotic confidence level is shown to exactly match the nominal level. [20] proposes a batch mean method to estimate the asymptotic covariance matrix of the estimator based on a single SGD trajectory. Due to the limited information from a single run of SGD, the best achievable error of their confidence interval (in terms of coverage probability) is of the order , which is worse than the error of the order achieved by the multiplier bootstrap. [23] proposes a different method called Higrad. Higrad constructs a hierarchical tree of a number of SGD estimators and uses their outputs in the leaves to construct a confidence interval.
In this work, we develop a multiplier bootstrap method for uncertainty quantification in SA for solving online non-parametric least squares. Bootstrap methods [24, 25] are widely used in statistics to estimate the sampling distribution of a statistic for uncertainty quantification. Traditional resampling-based bootstrap methods are unsuitable for streaming data inference as the resampling step necessitates storing all historical data, which contradicts the objective of maintaining constant space and time complexity in online learning. Instead, we extend the parametric online multiplier bootstrap method from [19] to the non-parametric setting. We achieve this by employing a perturbed stochastic functional gradient, which is represented as an element in the RKHS evaluated upon the arrival of each new covariate-response pair , to capture the stochastic fluctuation arising from the random streaming data.
To theoretically justify the use of the proposed multiplier bootstrap method, we make two main contributions. First, we build a novel theoretical framework to characterize the non-asymptotic behavior of the infinite-dimensional functional SGD estimator via expanding it into higher-orders under the supremum norm metric. This framework enables us to perform local inference to construct pointwise confidence intervals for and global inference to construct a simultaneous confidence band. Second, we demonstrate the consistency of the multiplier bootstrap method by proving that the perturbation injected into the stochastic functional gradient accurately mimics the randomness pattern in the online estimation procedure, so that the conditional law of the bootstrapped functional SGD estimator given the data asymptotically coincides with the sampling law of the functional SGD estimator. Our proof is non-trivial and contains several major improvements that refine the best (to our knowledge) convergence analysis of SGD for non-parametric least squares in [12], and also advances consistency analysis of the multiplier bootstrap in a non-parametric setting. Concretely, in [12], the authors derive the convergence rate of the functional SGD estimator relative to the norm metric. Their theory only concerns the convergence rate of the estimation; hence, the proof involves decomposing the SGD recursion into a leading first-order recursion and the remaining higher-order recursions; and bounding their norms respectively by directly bounding their expectations. In comparison, our analysis for statistical inference in online non-parametric regression requires a functional central limit theorem type result and calls for several substantial refinements in proof techniques.
Our first improvement is to refine the SGD recursion analysis by using a stronger supremum norm metric. This enables us to accurately characterize the stochastic fluctuation of the functional estimator uniformly across all locations. As a result, we can study the coverage probability of simultaneous confidence bands in our subsequent inference tasks. Analyzing the supremum convergence is significantly more intricate than analyzing the convergence. In the proof, we introduce an augmented RKHS different from as a bridge in order to better align its induced norm with the supremum metric; see Remark 3.2 or equation (6.1) in Section 6 for further details. Additionally, we have to employ uniform laws of large numbers and leverage ideas from empirical processes to uniformly control certain stochastic terms that emerge in the expansions. Our second improvement comes from the need of characterizing the large-sample distributional limit of the functional SGD estimator. By using the same recursion decomposition, we must now analyze a high-probability supremum norm bound for the all orders of recursions and determine the large-sample distributional limit of the leading term in the expansion. It is worth noting that the second-order recursion is the most complicated and challenging one to analyze. This recursion requires specialized treatment that involves substantially more effort than the remaining higher-order recursions. A loose analysis, achieved by directly converting an norm bound into the supremum norm bound using the reproducing kernel property the original RKHS — which suffices for bounding the higher-order recursions — might result in a bound whose order is comparable to that of the leading term. This is where we introduce an augmented RKHS and directly analyze the supremum norm using empirical process tools.
Last but not least, in order to analyze the distributional limit of the leading bias and variance terms appearing in the expansion of the functional SGD estimator, we develop new tools by extending the recent technique of Gaussian approximation of suprema of empirical processes [26] from equally weighted sum to a weighted sum. This extension is important and unique for analyzing functional SGD, since earlier-arrived data points will have larger weights in the leading bias and variance terms than later-arrived data points; see Remark 3.3 for more discussions. Towards the analysis of our bootstrap procedure, we further develop Gaussian approximation bounds for multiplier bootstraps for suprema of weighted and non-identically distributed empirical process, which can be used to control the Kolmogorov distance between the sampling distributions of the pointwise evaluation (local inference) of the functional SGD estimator or its supremum norm (global inference), and their bootstrapping counterparts. Our results also elucidate the interplay between early stopping (controlled by the step size) for optimal estimation and the accuracy of uncertainty quantification.
The rest of the article is organized as follows. In Section 2 we introduce the background of RKHS and the functional stochastic gradient descent algorithms in the RKHS; in Section 3, we establish the distributional convergence of SGD for non-parametric least squares; in Section 4, we develop the scalable uncertainty quantification in RKHS via multiplier bootstrapped SGD estimators; Section 5 includes extensive numerical studies to demonstrate the performance of the proposed SGD inference. Section 6 presents a sketched proof highlighting some important technical details and key steps; Section 7 provides an overview and future direction for our work. Section 8 includes some key proofs for the theorems.
Notation: In this paper, we use to denote generic positive constants whose values may change from one line to another, but are independent from everything else. We use the notation to denote the supremum norm of a function , defined as , where is the domain of . The notations and denote inequalities up to a constant multiple; we write when both and hold. For , let denote the largest integer smaller than or equal to . For two operators and , The order if is positive semi-definite.
2 Background and Problem Formulation
We begin by introducing some background on reproducing kernel Hilbert space (RKHS) and functional stochastic gradient descent algorithms in the RKHS.
2.1 Reproducing kernel Hilbert spaces
To describe the structure of regression function in non-parametric regression model (1.1), we adopt the standard framework of a reproducing kernel Hilbert space (RKHS, [7, 27, 28]) by assuming to belong to an RKHS . Let to denote the marginal distribution of the random design , and to denote the space of all square-integrable functions over with respect to . Briefly speaking, an RKHS is a Hilbert space of functions defined over a set , equipped with inner product , so that for any , the evaluation functional at defined by is a continuous linear functional on the RKHS. Uniquely associated with is a positive-definite function , called the reproducing kernel. The key property of the reproducing kernel is that it satisfies the reproducing property: the evaluation functional can be represented by the reproducing kernel function so that . According to Mercer’s theorem [6], kernel function has the following spectral decomposition:
(2.1) |
where the convergence is absolute and uniform on . Here, is the sequence of eigenvalues, and are the corresponding eigenfunctions forming an orthonormal basis in , with the following property: for any ,
where if and otherwise. Moreover, any can be decomposed into with , and its RKHS norm can be computed via .
We introduce some technical conditions on the reproducing kernel in terms of its spectral decomposition.
Assumption A1.
The eigenfunctions of are uniformly bounded on , i.e., there exists a finite constant such that . Moreover, they satisfy the Lipschitz condition for any , where is a finite constant.
Assumption A2.
The eigenvalues of satisfy for some .
The uniform boundedness condition in Assumption A1 is common in the literature [29]. Assumption A2 assumes the kernel to have polynomially decaying eigenvalues. Assumption A1-A2 together also implies the kernel function is bounded as . One special class of kernels satisfying Assumptions A1-A2 is composed of translation-invariant kernels for some even function of period one. In fact, by utilizing the Fourier series expansion of the kernel function , we observe that the eigenfunctions of the corresponding kernel matrix are trigonometric functions
on . It is easy to see that we can choose and to satisfy Assumption A1. Although we primarily consider kernels with eigenvalues that decay polynomially for the sake of clarity in this paper, it is worth mentioning that our theory extends to other kernel classes, such as squared exponential kernels and polynomial kernels [30].
2.2 Stochastic gradient descent in RKHS
To motivate functional SGD in RKHS, we first review SGD in Euclidean setting for minimizing the expected loss function , where is the parameter of interest, is the loss function and denotes a generic random sample, e.g. in the non-parametric regression setting (1.1). By first-order Taylor’s expansion, one can locally approximate for any small deviation by , where denotes the gradient (vector) of evaluated at . The gradient therefore encodes the (infinitesimal) steepest descent direction of at , leading to the following gradient decent (GD) updating formula:
starting from some initial value , where is the step size (also called learning rate) at iteration . GD typically requires the computation of the full gradient , which is unavailable due to the unknown data distribution of . In stochastic approximation, SGD takes a more efficient approach by using an unbiased estimate of the gradient as based on one sample to substitute in the updating rule.
Accordingly, the SGD updating rule takes the form of
Let us now extend the concept of SGD from minimizing an expected loss function in Euclidean space to minimizing an expected loss functional in function space. Here for concreteness, we develop SGD for minimizing the expected squared error loss over an RKHS equipped with inner product . Let us begin by extending the concept of the “gradient”. By identifying the gradient (operator) of functional as a steepest descent “direction” in through the following first-order “Taylor expansion”
we obtain after some simple algebra that
Now by using the reproducing property for any , we further obtain
(2.2) |
Here, we have used the fact that by Cauchy-Schwarz inequality,
since Assumptions A1-A2 together with Mercer’s expansion (2.1) imply to be uniformly bounded, or , as long as . From equation (2.2), we can identify the gradient at as
Throughout the rest of the paper, we will refer to above as the RKHS gradient of functional at .
Upon the arrival of the th data point , we can form an unbiased estimator of the RKHS gradient as . This leads to the following SGD in RKHS for solving non-parametric least squares: for a given initial estimate , the SGD recursively updates the estimate of upon the arrival of each data point as
(2.3) |
By utilizing the reproducing property, the above iterative updating formula can be rewritten as
(2.4) |
where denotes the identity map on , and is the tensor product operator defined through for all . Formula (2.3) is more straightforward to use for practical implementation, while formula (2.4) provides a more tractable expression that will facilitate our theoretical analysis. Following [2] and [31], we consider the so-called Polyak averaging scheme to further improve the estimation accuracy by averaging over the entire updating trajectory, i.e. we use as the final functional SGD estimator of based on a dataset of sample size . Note that this averaged estimator can be efficiently computed without storing all past estimators by using the recursively updating formula for on the fly. We will refer to the above SGD as functional SGD in order to differentiate it from the SGD in Euclidean space, and as the functional SGD estimator (using samples). Throughout the remainder of the paper, we consider a zero initialization, , without loss of generality.
In functional SGD with total sample size (time horizon) , the only adjustable component is the step size scheme , which is crucial for achieving fast convergence and accurate estimations (c.f. Remark 3.1). We examine two common schemes [15, 32]: (1) constant step size scheme where only depends on the total sample size ; (2) non-constant step size scheme where decays polynomially in for and some . While the constant step scheme is more amenable to theoretical analysis, it suffers from two notable drawbacks: (1) it assumes prior knowledge of the sample size , which is typically unavailable in streaming data scenarios, and (2) the optimal estimation error is only achieved at the -th iteration, leading to suboptimal performance before that time point. In contrast, the non-constant step size scheme, despite significantly complicating our theoretical analysis, overcomes the aforementioned limitations and leads to a truly online algorithm that achieves rate-optimal estimation at any intermediate time point (c.f. Theorem 3.1). Due to this characteristic, we will also refer to the non-constant step size scheme as the online scheme.
Although functional SGD operates in the infinite-dimensional RKHS, it can be implemented using a finite-dimensional representation enabled by the kernel trick. Concretely, upon the arrival of the -th observation , we can express the time- intermediate estimator as due to equation (2.3) and the zero initialization condition, where only the last entry in the coefficient vector needs to be updated,
Note that the computational complexity at time is for . Correspondingly, the functional SGD estimator at time can be computed through , where (can be proved by induction)
Consequently, the overall time complexity of the resulting algorithm is , and the space complexity is .
2.3 Problem formulation
Our objective is to develop online inference for the non-parametric regression function based on the functional SGD estimator . Specifically, we aim to construct level- pointwise confidence intervals (local inference) for , where , and a level- simultaneous confidence band (global inference) for . We require these intervals and band to be asymptotically valid, meaning that the coverage probabilities, i.e., the probabilities of or falling within or respectively, are close to their nominal level . Mathematically, this means and as .
The coverage probability analysis of these intervals and band requires us to examine and prove the distributional convergence of two random quantities (with appropriate rescaling) based on the functional SGD estimator : the pointwise difference for and the supremum norm of . In particular, the appropriate rescaling choice determines a precise convergence rate of towards under the supremum norm metric. The characterization of the convergence rate of a non-parametric regression estimator under the supremum norm metric is a challenging and important problem in its own right. We note that the distribution of the supremum norm after a proper re-scaling behaves like the supreme norm of a Gaussian process in the asymptotic sense, which is not practically feasible to estimate. Therefore, for inference purposes, it is not necessary to explicitly characterize this distributional limit; instead, we will prove a bootstrap consistency by showing that the Kolmogorov distance between the sampling distributions of this supremum norm and its bootstrapping counterpart converges to zero as .
In our theoretical development to address these problems, we will utilize a recursive expansion of the functional SGD updating formula to construct a higher-order expansion of under the norm metric. Building upon this expansion, we will establish in Section 3 the distributional convergence of the two aforementioned random quantities and characterize their limiting distributions with an explicit representation of the limiting variance for in the large-sample setting. However, these limiting distributions and variances depend on the spectral decomposition of the kernel , the marginal distribution of the design variable , and the unknown noise variance , which are either inaccessible or computationally expensive to evaluate in an online learning scenario. To overcome this challenge, we will propose a scalable bootstrap-based inference method in Section 4, enabling efficient online inference for .
3 Finite-Sample Analysis of Functional SGD Estimator
In this section, we start by deriving a higher-order expansion of under the norm metric. We then proceed to establish the distributional convergence of for any by characterizing the leading term in the expansion. These results will be useful for motivating our online local and global inference for in the following section.
3.1 Higher-order expansion under supreme norm
We begin by decomposing the functional SGD update of into two leading recursive formulas and a higher-order remainder term. This decomposition allows us to distinguish between the deterministic term responsible for the estimation bias and the stochastic fluctuation term contributing to the estimation variance. Concretely, we obtain the following by plugging into the recursive updating formula (2.4),
(3.1) |
Let denote the population-level covariance operator, so that for any , we have . Now we recursively define the leading bias term through
(3.2) |
that collects the leading deterministic component in (3.1); and the leading noise term through
(3.3) |
that collects the leading stochastic fluctuation component in (3.1); so that we have the following decomposition for the recursion:
(3.4) |
Correspondingly, we define and as the leading bias and noise terms, respectively, in the functional SGD estimator (after averaging). The following Theorem 3.1 presents finite-sample bounds for the two leading terms and the remainder term associated with under the supreme norm metric. The results indicate that the remainder term is of strictly higher order (in terms of dependence on ) compared to the two leading terms, validating the term “leading” for them.
Theorem 3.1 (Finite-sample error bound under supreme norm).
Suppose that the kernel satisfies Assumptions A1-A2. Assume satisfies .
-
1.
(constant step size) Assume that the step size satisfies , then we have
where are constants independent of . Furthermore, assume that the step size , we have
where the randomness is with respect to the randomness in .
-
2.
(non-constant step size) Assume the step size to satisfy for some , then we have
where are constants independent of . For the special choice of , we have
A proof of this theorem is based on a higher-order recursion expansion and careful supreme norm analysis of the recursive formula; see Remark 3.2 and proof sketch in Section 6. The detailed proof is outlined in [33].
Remark 3.1.
As demonstrated in Theorem 3.1, the selection of the step size (or for non-constant step size) in the SGD estimator entails a trade-off between bias and variance. A larger (or ) increases bias while reducing variance, and vice versa. This trade-off can be optimized by choosing the (optimal) step size . This is why we specifically focus on this particular choice in the non-constant step size setting in the theorem, which also significantly simplifies the proof. Interestingly, the step size (scheme) in the functional SGD plays a similar role as the regularization parameter in regularization-based approaches in preventing overfitting according to Theorem 3.1. To see this, let us consider the classic kernel ridge regression (KRR), where the estimator is constructed as
where serves as the regularization parameter to avoid overfitting. It can be shown (e.g., [10]) that the squared bias of has an order of , while the variance has an order of , where represents the effective dimension of the model and is of order under Assumption A2. In comparison, the squared bias and variance of the functional SGD estimator are of order and respectively. Therefore, and respectively play the same role as the regularization parameter and effective dimension in KRR. More generally, a step size scheme corresponds to an effective regularization parameter of the order , which in our considered settings is of order . Note that the accumulated step size can be interpreted as the total path length in the functional SGD algorithm. This total path length determines the early stopping of the algorithm, effectively controlling the complexity of the learned model and preventing overfitting.
Remark 3.2.
The higher-order recursion expansion and the supreme norm bound in the theorem provide a finer insight into the distributional behavior of and paves the way for inference. That is, we only need to focus on the leading noise recursive term for statistical inference. In our proof of bounding the supremum norm for the remainder term in equation (3.4), we further decompose the remainder into two parts: the bias remainder and the noise remainder. Note that a loose analysis of bounding the noise remainder under the metric by directly converting an norm bound into the supremum norm bound using the reproducing kernel property of the original RKHS would result in a bound whose order is comparable to that of the leading term. This motivates us to introduce an augmented RKHS with equipped with the kernel function and norm for any . This augmented RKHS norm weakens the impact of high-frequency components compared to the norm and its induced norm turns out to be better aligns with the functional supremum norm in our context. As a result, we have for any , where are constants. In particular, a supremum norm bound based on controlling the norm with appropriate choice of could be substantially better than that based on ; see Section 6 and Section 8.2 for further details.
As we discussed in Section 2.3, for inference purposes, it is not necessary to explicitly characterize distributional convergence limit of the supremum norm ; instead, we will prove a bootstrap consistency by showing that the Kolmogorov distance between the sampling distributions of this supremum norm and its bootstrapping counterpart converges to zero as . However, the pointwise convergence limit of for fixed has an easy characterization. Therefore, we present the pointwise convergence limit and use it to discuss the impact of online estimation in the non-parametric regression model in the following subsection.
3.2 Pointwise distributional convergence
According to Theorem 3.1, the large-sample behavior of the functional SGD estimator is completely determined by the two leading processes: bias term and noise term. According to (3.2), under the constant step size , the leading bias term has an explicit expression as
(3.5) | ||||
and the leading noise term is
(3.6) | ||||
For each fixed , conditioning on the design , the leading noise term is a weighted average of independent and centered normally distributed random variables. This representation enables us to identify the limiting distribution of (this subsection) and conduct local inference (i.e. pointwise confidence intervals) by a bootstrap method (next section). Under Assumption A2, the weight associated with the -th observation pair is of order , which decreases in . This diminishing impact trend is inherent to online learning, as later observations tend to have a smaller influence compared to earlier observations. This characteristic is radically different from offline estimation settings, where all observations contribute equally to the final estimator, and will change the asymptotic variance (i.e., the in Theorem 3.2).
Furthermore, the entire leading noise process can be viewed as a weighted and non-identically distributed empirical process indexed by the spatial location. This characterization enables us to conduct global inference (i.e. simultaneous confidence band) for non-parametric online learning by borrowing and extending the recent developments [26, 34, 35] on Gaussian approximation and multiplier bootstraps for suprema of (equally-weighted and identically distributed) empirical processes, which will be the main focus of next section.
In the following Theorem 3.2, we prove, by analyzing the leading noise term , a finite-sample upper bound on the Kolmogorov distance between the sampling distribution of and the distribution of a standard normal random variable (i.e. supreme norm between the two cumulative distributions) for any .
Theorem 3.2 (Pointwise convergence).
Assume that the kernel to satisfy Assumptions A1-A2.
-
1.
(Constant step size) Consider the step size with . For any fixed , we have
where . Here, the bias term has an explicit expression as given in (3.5), and the (limiting) variance is
-
2.
(Non-constant step size) Consider the step size for . For any fixed , we have
Here, the bias term takes an explicit expression as , and the variance is
Theorem 3.2 establishes that the sampling distribution of at any fixed can be approximated by a normal distribution . According to Theorem 3.1, the bias has the order of while the variance has the order of ; Theorem 3.2 also implies that the minimax convergence rate of estimating can be achieved with , which attains an optimal bias-variance tradeoff. In practice, the bias term can be suppressed by applying a undersmoothing technique; see Remark 4.3 for details.
Remark 3.3.
From the theorem, we see that the (limiting) variance is precisely the variance of the scaled leading noise at , that is, ; and has the same order for both the constant and non-constant cases. The contribution of for each data point to the variance differs between the constant and non-constant step size cases. Concretely, in the constant step size case, let be the vector of variation, where represents the contribution to from the -th arrival observation . According to equation (3.6), and is of order in the observation index , which decreases monotonically to nearly as grows to . In comparison, in the online (nonconstant) step case, we denote as the vector of variation with being the contribution from the -th observation. A careful calculation shows that , which has order and decreases slower than the constant step size case. This means that the nonconstant step scheme yields a more balanced weighted average over the entire dataset, which tends to lead to a smaller asymptotic variance.
Figure 1 compares the individual variation contribution for both the constant and non-constant step cases. We keep the total step size budget the same for both cases (which also makes the two leading bias terms roughly equal); that is, we choose constant in the nonconstant step size so that with being the constant step size. The data index is plotted on the axis of Figure 1 (A), with the variation contribution summarized by the axis. As we can see, the variation contribution from each observation decreases as observations arrive later in both cases. However, the pattern is flatter in the non-constant step case. Figure 1 (B) is a violin plot visualizing the distributions of the components in and . Specifically, the variation among (depicted by the short blue interval) is smaller in the non-constant case, suggesting reduced fluctuation in individual variation for this setting. As detailed in Section 5, our numerical analysis further confirms that using a nonconstant learning rate outperforms that using a constant learning rate (e.g., Figure 2). An interesting direction for future research might be to identify an optimal learning rate decaying scheme by minimizing the variance as a function of . It is also interesting to determine whether this scheme results in an equal contribution from each observation. However, this is beyond the scope of this paper.

Remark 3.4.
The Kolmogorov distance bound between the sampling distribution of and the standard normal distribution depends on the step size and sample size . In particular, is the remainder bound stated in Theorem 3.1, which is negligible compared to when in the constant step size case. Consequently, a smaller or larger sample size leads to a smaller Kolmogorov distance. The same conclusion also applies to the non-constant step size case if we choose .
Although Theorem 3.2 explicitly characterizes the distribution of the SGD estimator, the expression of the standard deviation depends on the eigenvalues and eigenfunctions of , the underlying distribution of the design , and the unknown noise variance , which are typically unknown in practice. One approach is to use plug-in estimators for these unknown quantities, such as empirical eigenvalues and eigenfunctions obtained through SVD decomposition of the empirical kernel matrix , whose -th element is . However, computing these plug-in estimators requires access to all observed data points and has a computational complexity of , which undermines the sequential updating advantages of SGD. In the following section, we develop a scalable inference framework that uses multiplier-type bootstraps to generate randomly perturbed SGD estimators upon arrival of each observation. This approach enables us to bypass the evaluation of when constructing confidence intervals.
4 Online Statistical Inference via Multiplier Bootstrap
In this section, we first propose a multiplier bootstrap method for inference based on the functional SGD estimator. After that, we study the theoretical properties of the proposed method, which serve as the cornerstone for proving bootstrap consistency for the local inference of constructing pointwise confidence intervals and the global inference of constructing simultaneous confidence bands. Finally, we describe the resulting online inference algorithm for non-parametric regression based on the functional SGD estimator.
4.1 Multiplier bootstrap for functional SGD
Recall that Theorem 3.1 provides a high probability decomposition of the functional SGD estimator (relative to supreme norm metric) into the following sum
where is the leading bias process defined in equation (3.5) and is the leading noise process defined in equation (3.6). Motivated by this result, we propose in this section a multiplier bootstrap method to mimic and capture the random fluctuation from this leading noise process , where recall that term only depends on the -th design point , and the primary source of randomness in is coming from random noises that are i.i.d. normally distributed under a standard non-parametric regression setting.
Our online inference approach is inspired by the multiplier bootstrap idea proposed in [19] for online inference of parametric models using SGD. Remarkably, we demonstrate that their development can be naturally adapted to enable online inference of non-parametric models based on functional SGD. The key idea is to perturb the stochastic gradient in the functional SGD by incorporating a random multiplier upon the arrival of each data point. r Specifically, let , , denote a sequence of i.i.d. random bootstrap multipliers, whose mean and variance are both equal to one. At time with the observed data point , we use a randomly perturbed functional SGD updating formula as:
(4.1) | ||||
which modifies equations (2.3) and (2.4) for functional SGD by multiplying the stochastic gradient with random multiplier . We adopt the same zero initialization and call the (Polyak) averaged estimator as the bootstrapped functional SGD estimator (with samples).
4.2 Bootstrap consistency
Let us now proceed to derive a higher-order expansion of analogous to Section 3.1 and compare its leading terms with those associated with the original functional SGD estimator . Utilizing equation (4.1) and plugging in , we obtain the following expression:
Since has a unit mean, we have an important identity . Similar to equation (3.1)-(3.4), due to this key identity, we can still recursively define the leading bootstrapped bias term through
(4.2) |
which coincides with the original leading bias term, i.e. ; and the leading bootstrapped noise term through
so that a similar decomposition as in equation (3.4) holds,
(4.3) |
Corresponding, we define and as the leading bootstrapped bias and noise terms, respectively, in the bootstrapped functional SGD estimator.
Notice that also coincides with the original leading bias term , i.e. . Therefore, has the same explicit expression as equation (3.5); while the leading bootstrapped noise term has a slightly different expression that incorporates the bootstrap multipliers as
(4.4) |
where recall that is defined in equation (3.6) and only depends on . By taking the difference between and , we obtain
(4.5) |
This expression also takes the form of a weighted and non-identically distributed empirical process with “effective” noises . Since has unit mean and variance, these effective noises have the same first two-order moments as the original noises , suggesting that the difference , conditioning on data , tends to capture the random pattern of the original leading noise term , leading to the so-called bootstrap consistency as formally stated in the theorem below.
Assumption A3.
For , bootstrap multipliers s are i.i.d. samples of a random variable that satisfies , and for all with a constant .
One simple example that satisfies Assumption A3 is . A second example is bounded random variables, such as a scaled and shifted uniform random variable on the interval . One popular choice in practice is discrete random variables, such as such that .
Let denote the data of sample size , and denote the conditional probability measure given . We first establish the bootstrap consistency for local inference of leading noise term in the following Theorem 4.1.
Theorem 4.1 (Bootstrap consistency for local inference of leading noise term).
Assume that kernel satisfies Assumptions A1-A2 and multiplier weights satisfy Assumption A3.
-
1.
(Constant step size) Consider the step size with for some . Then for any , we have with probability at least ,
where are constants independent of .
-
2.
(Non-constant step size) Consider the step size , , for some . Then the following bound holds with probability at least ,
(4.6)
Remark 4.1.
Recall that from (3.6) and (4.5), we can express and . Theorem 3.2 shows that can be approximated by a normal distribution . To prove Theorem 4.1, we introduce an intermediate empirical process evaluated at as where ’s are independent and identically distributed standard normal random variables, such that has the same (conditional) variance as the (conditional) variance of .
Theorem 4.2 (Bootstrap consistency for global inference of leading noise term).
Assume that kernel satisfies Assumptions A1-A2 and multiplier weights satisfy Assumption A3.
-
1.
(Constant step size) Consider the step size with for some . Then the following bound holds with probability at least (with respect to the randomness in data )
-
2.
(Non-constant step size) Consider the step size , , for some . Then the following bound holds with probability at least ,
Remark 4.2.
Theorem 4.2 demonstrates that the sampling distribution of can be approximated closely by the conditional distribution of given data set . This theorem serves as the theoretical foundation for adopting the multiplier bootstrap method detailed in Section 4.1 for global inference. Recall that the optimal step size for achieving the minimax optimal estimation error is for the constant step size and for the non-constant step size (Theorem 3.2). To ensure that the Kolmogorov distance bound in Theorem 4.2 decays to as under these step sizes, we require . It is likely that our current Kolmogorov distance bound, which is dominated by an error term that arises from applying the Gaussian approximation to analyze and through space-discretization (see Section 6.2), can be substantially refined. We leave this improvement of the Kolmogorov distance bound, which would consequently lead to a weaker requirement on , to future research.
Since the leading noise terms and contribute to the primary source of randomness in the functional SGD and its bootstrapped counterpart (Theorem 3.1), Theorem 4.2 then implies the bootstrap consistency for statistical inference of based on bootstrapped functional SGD. Particularly, we present the following Corollary, which establishes a high probability supremum norm bound for the remainder term in the bootstrapped functional SGD decomposition (4.3). Such a bound further implies that the sampling distribution of can be effectively approximated by the conditional distribution of given data . Recall that we use to denote the conditional probability measure given .
Corollary 4.3 (Bootstrap consistency for functional SGD inference).
Assume that kernel satisfies Assumptions A1-A2 and multiplier weights satisfies Assumption A3.
-
1.
(Constant step size) Consider the step size with for some . Then it holds with probability at least with respect to the randomness of that
(4.7) Furthermore, for , it holds with probability at least , that
where denotes the bias term, are constants.
-
2.
(Non-constant step size) Consider the step size for . Then it holds with probability at least that
Furthermore, it holds with probability at least that
Remark 4.3.
Corollary 4.3 suggests that a smaller step size (or ) and a larger sample size result in more accurate uncertainty quantification. As discussed in Section 4.2, the functional SGD estimator and its bootstrap counterpart share the same leading bias term, which eliminates the bias in the conditional distribution of given . However, the bias term still exists in the sampling distribution of . According to Theorem 3.1, this bias term can be bounded by with high probability, while the convergence rate of the leading noise term under the supremum norm metric is of order . Therefore, to make the bias term asymptotically negligible, we can adopt the common practice of “undersmoothing” [36, 37]. In our context, this means slightly enlarging the step size as (constant step size) or for (non-constant step size), where is any small positive constant.
4.3 Online inference algorithm
-
1.
Normal CI: , where .
-
2.
Percentile CI: , where and are the sample -th and -th quantile of .
As we demonstrated in Theorem 4.2, the sampling distribution of can be effectively approximated by the conditional distribution of given data using the bootstrap functional SGD. This result provides a strong foundation for conducting online statistical inference based on bootstrap. Specifically, we can run bootstrapped functional SGD in parallel, producing estimators for with
where are i.i.d. bootstrap weights satisfying Assumption A3. Then we can approximate the sampling distribution of using the empirical distribution of conditioning on , and further construct the point-wise confidence intervals and simultaneous confidence band for . We can also use the empirical variance of to approximate the variance of . Based on these quantities, we can construct the point-wise confidence interval for for any fixed in two ways:
-
1.
Normal CI - giving the sequence of bootstrapped estimators for , we calculate the variance as , and construct the confidence interval for as ;
-
2.
Percentile CI - giving the sequence of bootstrapped estimators for , we calculate , and its -th and -th quantiles as and , then construct the CI for as .
To construct the simultaneous confidence band, we first choose a dense grid points ; then for each , we calculate to approximate . Accordingly, we obtain the following bootstrapped supremum norms:
(4.8) |
Denote the sample -th and the -th quantiles of (4.8) as and . Then we construct a confidence band for as .
Our online inference algorithm is computationally efficient, as it only requires one pass over the data, and the bootstrapped functional SGD can be computed in parallel. The detailed algorithm is summarized in Algorithm 1.
5 Numerical Study
In this section, we test our proposed online inference approach via simulations. Concretely, we generate synthetic data in a streaming setting with a total sample size of . We use to represent the -th observed data point for . We evaluate the performance of our proposed method as described in Algorithm 1 for constructing confidence intervals for at for , and compare our method with three existing alternative approaches, which we refer to as “offline” methods. “Offline” methods involve calculating the confidence intervals after all data have been collected, up to the -th observation’s arrival, which necessitates refitting the model each time new data arrive. We also evaluate the coverage probabilities of the simultaneous confidence bands constructed in Algorithm 1. We first enumerate the compared offline confidence interval methods as follows:
-
(i)
Offline Bayesian confidence interval (Offline BA) proposed in [38]: According to [39], a smoothing spline method corresponds to a Bayesian procedure when using a partially improper prior. Given this relationship between smoothing splines and Bayes estimates, confidence intervals can be derived from the posterior covariance function of the estimation. In practice, we implement Offline BA using the “gss” R package [28].
-
(ii)
Offline bootstrap normal interval (Offline BN) proposed in [40]: Let and denote the estimates of and respectively, achieved by minimizing (5.1) with as below.
(5.1) where is the roughness penalty and is the second derivative evaluated at . A bootstrap sample is generated from
where s are i.i.d. Gaussian white noise with variance . Based on the bootstrap sample, we calculate the bootstrap estimate as . Repeating times, we have a sequence of offline bootstrap estimates . We estimate the variance of as . A offline normal bootstrap confidence interval for is then constructed as .
-
(iii)
Offline bootstrap percentile interval (Offline BP): We apply the same data bootstrapping procedure in Offline BN, which produces the estimate based on the bootstrap sample. The confidence interval is then constructed using the percentile method suggested in [41]. Specifically, let and represent the -th quantile and the -th quantile of the empirical distribution of , respectively. A confidence interval for is then constructed as .
As increases, offline methods lead to a considerable increase in computational cost. For instance, Offline BA/BN theoretically has a total time complexity of order (with an cost at time ). In contrast, online bootstrap confidence intervals are computed sequentially as new data points become available, making them well-suited for streaming data settings. They have a theoretical complexity of at most (with an cost at time ). We examine both the normal CI and percentile CI, as outlined in Algorithm 1, when constructing the confidence interval.
We examine the effects of various step size schemes. Specifically, we consider a constant step size , where represents the total sample size at which the CIs are constructed, and an online step size for . A limitation of the constant-step size method is its dependency on prior knowledge of the total time horizon . Consequently, the estimator is only rate-optimal at the -th step. We assess our proposed online bootstrap confidence intervals in four different scenarios: (i) Online BNC, which uses a constant step size for the normal interval; (ii) Online BPC, which uses a constant step size for the percentile interval; (iii) Online BNN, which employs a non-constant step size for the normal interval; and (iv) Online BPN, which utilizes a non-constant step size for the percentile interval.
We generate our data as i.i.d. copies of random variables , where is drawn from a uniform distribution in the interval , and . Here is the unknown regression function to be estimated, represents Gaussian white noise with a variance of . We consider the following three cases of , :
Case 1: | |||
Case 2: | |||
Case 3: |
Here, with denoting the beta function, and is the gamma function with when . Cases 2 and 3 are designed to mimic the increasingly complex “truth” scenarios similar to the settings in [38, 40].
We draw training data of size from these models. In our online approaches, we first use 500 data points to build an initial estimate and then employ SGD to derive online estimates from the 501st to the 3000th data point. Given that our framework is designed for online settings, we can construct the confidence band based on the datasets of size , , , , and , i.e., using the averaged estimators at . We repeat the data generation process times for each case. For each replicate, upon the arrival of a new data point, we apply the proposed multiplier bootstrap method for online inference, using bootstrap samples (i.e., in Algorithm 1) with bootstrap weight generated from a normal distribution with mean and standard deviation . We then construct confidence intervals based on Algorithm 1. Our results will show the coverage and distribution of the lengths of the confidence intervals built at , and .
Case 1: (A1) Coverage | (A2) Length of Confidence Interval |
![]() |
![]() |
Case 2: (B1) Coverage | (B2) Length of Confidence Interval |
![]() |
![]() |
Case 3: (C1) Coverage | (C2) Length of Confidence Interval |
![]() |
![]() |

As shown in Figure 2, the coverage of all methods approaches the predetermined level of as increases. The offline Bayesian method exhibits the lowest coverage of all. While it has the longest average confidence interval length in Cases 1-3, it also has the smallest variance in confidence interval lengths. The offline bootstrap-based methods demonstrate higher coverage and shorter average confidence interval lengths than the offline Bayesian method. The variance in confidence interval lengths for these bootstrap-based methods is larger, due to the bootstrap multiplier resampling procedure or the random step size used in our proposed online bootstrap procedures. As the sample size grows, the variance in the length of the confidence interval diminishes for all methods. Our online bootstrap procedure with a non-constant step size outperforms the others regarding both the average length and the variance of the confidence interval. It offers the shortest average confidence interval length and the smallest variance, compared to the Bayesian confidence interval, offline bootstrap methods, and the online bootstrap procedure with a constant step size. Moreover, the online bootstrap method with a non-constant step size achieves the predetermined coverage level of more quickly than the other methods. We only tested our methods (online BNN and online BQN) with an increased at and due to computational costs. As observed in Figure 2 (A1), (B1), and (C1), the coverage stabilizes at the predetermined coverage level of . We also use our proposed online bootstrap method, as outlined in Algorithm 1, to construct a confidence band of level of with a step size of at . As seen in Figure 3, the average width of the confidence band decreases as the sample size increases for Case 1-3, and all of them cover the true function curve represented by the solid black curve, indicating that the accuracy of our confidence band estimates improves with a larger sample size.
(A) Cumulative Computation Time | (B) Current Computation Time |
![]() |
![]() |
Finally, we compared the computational time of various methods in constructing confidence intervals on a computer workstation equipped with a 32-core 3.50 GHz AMD Threadripper Pro 3975WX CPU and 64GB RAM. We recorded the computational times as data points arrived and calculated the cumulative computational times up to for both offline and online algorithms. The normal and percentile Bootstrap methods displayed similar computational times, so we chose to report the computational time of the percentile bootstrap interval for both offline and online approaches. Despite leveraging parallel computing to accelerate the bootstrap procedures, the offline bootstrap algorithms still demanded significant computation time. This is attributed to the need to refit the model each time a new data point arrived, which substantially raises the computational expense. The computational complexity of offline methods for computing the estimate of at time is , leading to a cumulative computational complexity of order . Including the bootstrap cost, the total computational complexity at becomes , leading to a cumulative computational complexity of order . As shown in Figure 4, the cumulative computational time reaches approximately hours for the offline bootstrap method and around hours for the Bayesian bootstrap method. Conversely, the cumulative computational time for our proposed bootstrap method grows almost linearly with , and requires less than minutes up to . At , offline bootstrap methods take about seconds, and the Bayesian confidence interval necessitates roughly seconds to construct the confidence interval. Our proposed online bootstrap method requires fewer than seconds, demonstrating its potential for time-sensitive applications such as medical diagnosis and treatment, financial trading, and traffic management, where real-time decision-making is essential as data continuously flows in.
6 Proof Sketch of Main Results
In this section, we present sketched proofs for the expansion of the functional SGD estimator relative to the supremum norm (Theorem 3.1) and the bootstrap consistency for global inference (Theorem 4.2), while highlighting some important technical details and key steps.
6.1 Proof sketch for estimator expansion under supremum norm metric
Theorem 3.1 establishes the supreum norm bound with high probability for the high-order expansion of the SGD estimator. This result is crucial for the inference framework, as we only need to focus on the distribution behavior of leading terms given the negligible remainders. In the sketched proof, we denote . According to (3.1), we have
We split the recursion of into two finer recursions: bias recursion of and noise recursion of such that , where
bias recursion: | |||
noise recursion: |
To proceed, we further decompose the bias recursion into two parts: (1) the leading bias recursion ; and (2) the remainder bias recursion as follows:
It is worth noting that the leading bias recursion essentially replaces by its expectation .
To bound the residual term associated with the leading bias term of the averaged estimator, we introduce an augmented RKHS space (with )
(6.1) |
equipped with the kernel function . To verify is the reproducing kernel of , we notice that
where is a constant. Moreover, also satisfies the reproducing property since
For any , we can use the above reproducing property to bound the supremum norm of as . Also note that for any , for ; therefore, we have the relationship , meaning that provides a tighter bound for the supremum norm compared with . In Section 8.2 (Lemma 8.1), we use this augmented RKHS to show that the bias remainder term satisfies through computing the expectation and applying the Markov inequality.
For the noise recursion of , we can similarly split it into the leading noise recursion term and residual noise recursion term as
The leading noise recursion is described as a “semi-stochastic” recursion induced by in [12] since it keeps the randomness in the noise recursion due to the noise , but get ride of the randomness arising from , which is due to the random design .
For the residual noise recursion, directly bound is difficult. Instead, we follow [12] by further decomposing into a sequence of higher-order “semi-stochastic” recursions as follows. We first define a semi-stochastic recursion induced by , denoted as :
(6.2) |
Here, replaces the random operator with its expectation in the residual noise recursion for , and can be viewed as a second-order term in the expansion of the noise recursion, or the leading remainder noise term. The rest noise remainder parts can be expressed as
Then we can further define a semi-stochastic recursion induced by , and repeat this process. If we define for , then we can expand into terms as
where for ,. The Remainder term also has a recursive characterization:
(6.3) |
To establish the supreme norm bound of , the idea is to show that decays as increases, that is, to prove
Concretely, we consider the constant-step case for a simple presentation. By accumulating the effects of the iterations, we can further express as
and accordingly, the averaged version is
This implies that conditioning on the covariates , the empirical process over is a Gaussian process with (function) weights . We can then prove a bound of by careful analyzing the random function ; see Appendix 8.3 for further details. A complete proof of Theorem 3.1 under constant step size is included in [33]; see Figure 5 for a float chart explaining the relationship among different components in its proof. The proof for the non-constant step size case is conceptually similar but is considerably more involved, requiring a much more refined analysis of the accumulated step size effect on the iterations of the recursions in [33].
6.2 Proof sketch for Bootstrap consistency of global inference
Recall that represents the data. The goal is to bound the difference between the sampling distribution of and the conditional distribution of given ; see Section 4.2 for detailed definitions of these quantities. We sketch the proof idea under the constant step size scheme.
We will use the shorthand and . Recall that from equations (3.6) and (4.5), we have
From this display, we see that for any , is a weighted sum of Gaussian random variables, with the weights being functions of covariates ; conditioning on , is a weighted sum of sub-Gaussian random variables. In the proof, we also require a sufficiently dense space discretization given by . This discretization forms an -covering for some with respect to a specific distance metric that will be detailed later.
To bound the difference between the distribution of and the conditional distribution of given , we introduce two intermediate processes: (1) with being i.i.d. standard normal random variables for ; (2) an -dimensional multivariate normal random vector (recall that is the space discretization we defined earlier), where are i.i.d. (zero mean) normally distributed random vectors having the same covariance structure as ; that is, , , and for and , . These two intermediate processes are introduced so that the conditional distribution of given will be used to approximate the conditional distribution of given ; while the distribution of will be used to approximate the distribution of . Since both the distribution of and the conditional distribution of given are centered multivariate normal distributions, we can use a Gaussian comparison inequality to bound the difference between them by bounding the difference between their covariances.
The actual proof is even more complicated, as we also need to control the discretization error. See Figure 6 for a flow chart that summarizes all the intermediate approximation steps and the corresponding lemmas in the appendix. For Steps I and V in Figure 6, we approximate the continuous supremum norms of and by the finite maximums of and , respectively. Here, is chosen as the -covering number of the unit interval with respect to the metric defined by for ; that is, there exist , such that for every , there exist with . We refer a detailed proof in Step I (Step V) to Supplementary. Notice that is a weighted and non-identically distributed empirical process. In Step II, we further develop Gaussian approximation bounds to control the Kolmogorov distance between the sampling distributions of and the distribution of ; see the proof in Lemma 8.3. In Step IV, by noticing that conditional on , is a weighted and non-identically distributed sub-Gaussian process with randomness coming from the Bootstrap multiplier , we adopt a similar argument as in Step II to bound the Kolmogorov distance between the distributions of and given .
7 Discussion
Quantifying uncertainty (UQ) in large-scale streaming data is a central challenge in statistical inference. We are developing multiplier bootstrap-based inferential frameworks for UQ in online non-parametric least squares regression. We propose using perturbed stochastic functional gradients to generate a sequence of bootstrapped functional SGD estimators for constructing point-wise confidence intervals (local inference) and simultaneous confidence bands (global inference) for function parameters in RKHS. Theoretically, we establish a framework to derive the non-asymptotic law of the infinite-dimensional SGD estimator and demonstrate the consistency of the multiplier bootstrap method.
This work assumes that random errors in non-parametric regression follow a Gaussian distribution. However, in many real-world applications, heavy-tailed distributions are more common and suitable for capturing outlier behaviors. One future research direction is to expand the current methods to address heavy-tailed errors, thereby offering a more robust approach to online non-parametric inference. Another direction to explore is the generalization of the multiplier bootstrap weights to independent sub-exponential random variables and even exchangeable weights. Finally, a promising direction is the consideration of online non-parametric inference for dependent data. Such an extension is necessary to address problems like multi-arm bandit and reinforcement learning, where data dependencies are frequent and real-time updates are essential. Adapting our methods to these problems could provide deeper insights into the interplay between statistical inference and online decision-making.
8 Some Key Proofs
8.1 Proof of leading terms in Theorem 3.1 in constant step size case
Recall in Section 6.1, we split the recursion of into the bias recursion and noise recursion. That is, . Here can be further decomposed as its leading bias term and remainder satisfying the recursion
(8.1) | ||||
(8.2) |
We further decompose to its main recursion terms and residual recursion terms as
(8.3) | ||||
(8.4) |
We focus on the averaged version with
where , .
Theorem 3.1 for the constant step case includes three results as follows:
(8.5) | |||
(8.6) | |||
(8.7) |
In this session, we will bound the sup-norm of the leading bias term and leading variance term . To complete the proof of (8.7), we will bound in Section 8.2 and in Section 8.3.
We first provide a clear expression for and .
Denote
with . We have
(8.8) |
(8.9) |
Bound the leading bias term (8.5) For the case of constant step size, based on (8.8), we have
Note that any can be represented as , where satisfies , , and , . Then for any , . By the assumption that satisfies , we have
where the inequality holds based on the bound that .
Bound the leading noise term (8.6) We first deduce the explicit expression of and its variance. Based on (8.9), for constant step case, we have for any ,
Note that for any , . Then
Therefore, with , and
Note that
On the other hand, . Since
we have
Accordingly, .
Meanwhile, leads to the result that thus . Therefore, .
8.2 Bound the bias remainder in constant step case
Recall in (8.2), the bias remainder recursion follows
Our goal is to bound . Let with , we have
with and . We first express in an explicit form as follows.
Let , and , we have We can further represent as
on the other hand, . Therefore, for any , we have
(8.10) |
Note that . In the following lemma 8.1, we bound through , and show that with high probability.
Lemma 8.1.
Suppose the step size with . Then
Proof.
To simplify the notation, we set as . For , by (8.10), we have
(8.11) |
That is, we split into two parts, and will bound each part separately.
We first bound . Denote with , then .
If , suppose , then
where the last step is due to with the fact that . Therefore, we have . Furthermore,
Note that , with
(8.12) |
where with .
To be abstract, for any , where . Then can be written as
Furthermore, for any ,
(8.13) | ||||
Therefore, , and in (8.13), we have Then we have
Denote , then
Recall , we can bound as follows.
where the last inequality is due to the fact that
Accordingly, ; and
(8.14) |
Next, we analyze in (8.10) for . Note that
We first consider and assume , note that , then
(8.15) |
Similarly, for , Therefore,
And Since we have
(8.16) |
where , and where the last step is due to the fact that
with for .
By equation (8.16), we have . Recall . For notation simlicity, let , then can be written as where is an operator such that for any , . Replacing with , we have , and
Since , we further need to bound . Let , then . Note that , where is a constant. Then
Therefore, , and
Accordingly, we have .
Therefore,
Then by Markov’s inequality, we have
That is, with probability at least . ∎
8.3 Proof the sup-norm bound of noise remainder in constant step case
Recall the noise remainder recursion follows
Follow the recursion decomposition in Section 6.1, we can split into higher order expansions as
where can be viewed as and for and . The remainder term follows the recursion as
The following lemma (see 8.2) demonstrates that the high-order expansion terms (for ) decrease as the value of increases. In particular, we first characterize the behavior of by representing it as a weighted empirical process and establish its convergence rate that with high probability. Next, we show that for using mathematical induction. Finally, we bound through its -norm based on the property that .
Lemma 8.2.
Suppose the step size with . Then
-
-
Furthermore, for and , we have .
-
with large enough.
Furthermore, combine (a)-(c), we have
where is a constant.
Proof.
Proof of Lemma 8.2 (a) by analyzing . First, we calculate the explicit expression of . Let and , then with . Therefore,
where the last step is by plugging in in (8.9) with . Accordingly,
(8.17) |
Let , where the randomness of involves . Then , which is a Gaussian process conditional on .
We can further express as a function of the eigenvalues and eigenfunctions that follows
(8.18) |
with ; we refer the proof to [33]. Such expression can facilitate the downstream analysis of . Denote and . Then can be simplified as .
We are ready to prove that where . It involves two steps: (1) for any fixed , we see that is a weighted Gaussian random variable with variance conditional on . Therefore, we first bound with an exponentially decaying probability by characterizing ; (2) we then bridge to . We illustrate the details as follows.
Conditional on , is a weighted Gaussian random variable; by Hoeffding’s inequality,
(8.19) |
We then bound . We separate as two parts as follows:
where involves the interaction terms indexed by and includes the terms that . Recall . Then for . For , we have
Take expectation on , we can see
(8.20) | ||||
where the last step is due to the calculation that
with . Note that
In the -step, if and only if the following cases hold: (1); (2) and ; (3) and . Recall . Then we have
For , we have Therefore,
with . The final step is due to the fact that
Since and accordingly, Therefore, we have
(8.21) |
For , we rewrite as
(8.22) |
where includes the terms that and includes the terms that . For , with any positive , we have
To bound the expectation of , we need to bound . Note that
where the last term is . Then we have
Then accordingly,
(8.23) |
For , we have
We first bound .
Notice that
since . Then we have
due to the property that . Accordingly, we have
Therefore,
(8.24) |
We next deal with .
(8.25) |
Combine equation (8.21), (8.23), (8.24), and (8.25) together, and notice that for , we have
Define an event , by Markov inequality, . Conditional on the event , and let in equation (8.19), we have
(8.26) |
Combined with the Lemma bridging and in Supplementary [33], we achieve the result.
Next, we prove Lemma 8.2 (b) and analyze for .
Note that . In the following part, we focus on . Recall in Section 6, follows the recursion as where for and for .
Let , then , and .
where we use the property that . Since , then . On the other hand, . Therefore, we have
with . Also, . Then
Therefore, Let with and , then we have and
Through Markov’s inequality, we have
For and , we have .
Proof of Lemma 8.2 (c) - the reminder term . Note that for any , . Therefore, . Next, we will bound .
For , recall we have
Accordingly, . Since , we have
and accordingly
(8.27) |
By Markov inequality,
with the constant large enough.
Finally, we have
∎
8.4 Bootstrap SGD decomposition
Similar to the SGD recursion decomposition in Section 6, we define the Bootstrap SGD recursion decomposition as follows. Based on (4.1), denote , then
(8.28) |
We split the recursion (8.28) in two recursions and such that . Specifically,
(8.29) | ||||
(8.30) |
Since , we further decompose to two parts: (1) its main recursion terms which determine the bias order; (2) residual recursion terms. That is,
Similarly, we decompose to its main recursion term that dominates the variation and residual recursion terms as
(8.31) | ||||
with .
We aim to quantify the distribution behavior of given . Denote . Then
where , , and are remainder terms in original SGD recursion with (bounded in Section 8.2), (bounded in Section 8.2).
Since and follow the same recursion, we have the leading bias of as 0. We next need to: (1) characterize the distribution behavior of conditional on ; and (2) prove the term are negligible.
8.5 Proof of the Bootstrap consistency in Theorem 4.2 for constant step size case
We follow the proof sketch in Section 6.2 and complete the proof of Step II, III and IV in this section.
For the reader’s convenience, we restate the following notations. Denote
where ’s, for , are i.i.d. standard normal random variables, and satisfying , and for .
Lemma 8.3.
(Proof of Step II) Suppose and with . We have
(8.33) |
which converges to with increased .
Lemma 8.4.
(Proof of Step III) Suppose and with . With probability at least ,
Lemma 8.5.
(Proof of Step IV) Suppose and with . With probability at least ,
Proof of Lemma 8.3
Proof.
We define for . With little abuse of notation, we use to represent . Then . Define and , then . For ,
When , . We also have for .
We also use the notation to represent defined in Section 6.2. Let for , and . We remark that has the same mean and covariance structure as .
For as an underdetermined scalar that depends on , and , define It follows by [42] that satisfies Let be a -function such that for and for . Let , for , where is underdetermined. Then
To proceed, we approximate using the techniques used in [42]. Let . Define , , and , for . Let , and , for . Then .
Then
where
We further note that since . For , it follows from [42] that for any ,
where is a finite constant. Then
(8.34) |
We need to bound and .
where the last step is due to the property that , and .
Next we deal with , where , and . For , we have
Then we have
Therefore,
(8.35) |
In the meantime, it follows by Lemma 2.1 of [42] that
where is a universal constant. Therefore, for any ,
On the other hand, let . Then
Using the same arguments, it can be shown that has the same upper bound specified as in (8.5). Furthermore, by Lemma 2.1 of [42] and direct calculations, we have
Therefore,
Consequently, let and large enough, we have
(8.36) |
which converges to with increased when and with . ∎
Proof of Lemma 8.4
Proof.
Let and . Then and . Denote the -th element of the covariance matrices as and , respectively. Set . Then
and .
Proof of Lemma 8.5
Proof.
Define with We have and . Define , then and are independent for and . Let with for .
Similarly, denote and . Then we have , and Denote with .
The proof of Lemma 8.5 follows the proof of Lemma 8.3. We adopt the notation and follow the proof of Lemma 8.3 step by step, with only the following changes: (1) replacing with ; (2) replacing with ; (3) replacing the probability and expectation to conditional probabilities and conditional expectation . Then equation (8.34) here will be adapted to
(8.37) |
Since
where the last inequality follows the proof in Lemma 8.3 that . Then with probability at least , we have
Similarly, with probability at least , where is a constant independent of . Then we have with probability at least ,
Therefore, follow the proofs in Lemma 8.3, we have
Consequently, let , we have with probability at least ,
∎
References
- [1] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22:400–407, 1951.
- [2] David Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, Ithaca, NY, 1988.
- [3] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 161–168. Curran Associates, Inc., 2008.
- [4] Anton J. Kleywegt, Alexander Shapiro, and Tito Homem de Mello. The sample average approximation method for stochastic discrete optimization. SIAM Journal on Optimization, 12(2):479–502, 2002.
- [5] Sujin Kim, Raghu Pasupathy, and Shane G. Henderson. A guide to sample average approximation. In Handbook of Simulation Optimization, pages 207–243. Springer, 2015.
- [6] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950.
- [7] Grace Wahba. Spline Models for Observational Data. Society for Industrial and Applied Mathematics, 1990.
- [8] Vladimir Koltchinskii. Local rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34:2593–2656, 2006.
- [9] Shahar Mendelson. Geometric parameters of kernel machines. In International Conference on Computational Learning Theory, pages 29–43. Springer, 2002.
- [10] Yun Yang, Mert Pilanci, and Martin J. Wainwright. Randomized sketches for kernels: Fast and optimal nonparametric regression. The Annals of Statistics, 45(3):991–1023, 2017.
- [11] Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge regression learning algorithm in dual variables. In International Conference on Machine Learning, pages 515–521. Springer, 1998.
- [12] Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes. The Annals of Statistics, 44(4):1363–1399, 2016.
- [13] Léon Bottou. Online learning and stochastic approximations. Online Learning in Neural Networks, 17(9):142, 1998.
- [14] Quoc V. Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, and Andrew Y. Ng. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning, pages 265–272, 2011.
- [15] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In 19th International Conference on Computational Statistics, pages 177–186. Springer, 2010.
- [16] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- [17] Patrick Cheridito, Arnulf Jentzen, and Florian Rossmannek. Non-convergence of stochastic gradient descent in the training of deep neural networks. Journal of Complexity, 64:101540, 2021.
- [18] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
- [19] Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator. Journal of Machine Learning Research, 19(78):1–21, 2018.
- [20] Xi Chen, Jason D Lee, Xin T Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics, 48(1):251–273, 2020.
- [21] Yu Nesterov and J-Ph Vial. Confidence level solutions for stochastic programming. Automatica, 44(6):1559–1568, 2008.
- [22] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
- [23] Weijie J Su and Yuancheng Zhu. Higrad: Uncertainty quantification for online learning and stochastic approximation. Journal of Machine Learning Research, 24(124):1–53, 2023.
- [24] Bradley Efron and Robert J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall/CRC, 1993.
- [25] Thomas J Diciccio and Joseph P Romano. A review of bootstrap confidence intervals. Journal of the Royal Statistical Society: Series B (Methodological), 50(3):338–354, 1988.
- [26] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximation of suprema of empirical processes. The Annals of Statistics, 42(4):1564–1597, 2014.
- [27] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
- [28] Chong Gu. Smoothing spline ANOVA models, volume 297. Springer, 2013.
- [29] Shahar Mendelson and Joseph Neeman. Regularization in kernel learning. The Annals of Statistics, 38(1):526–565, 2010.
- [30] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. The Journal of Machine Learning Research, 18(1):714–751, 2017.
- [31] Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
- [32] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, volume 20, 2007.
- [33] Meimei Liu, Zuofeng Shang, and Yun Yang. Supplementary: Scalable statistical inference in non-parametric least squares, 2023. Supplementary material.
- [34] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related gaussian couplings. Stochastic Processes and their Applications, 126(12):3632–3651, 2016.
- [35] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Anti-concentration and honest, adaptive confidence bands. The Annals of Statistics, 42(5):1787–1818, 2014.
- [36] Michael H Neumann and Jörg Polzehl. Simultaneous bootstrap confidence bands in nonparametric regression. Journal of Nonparametric Statistics, 9(4):307–333, 1998.
- [37] Timothy B Armstrong and Michal Kolesár. Simple and honest confidence intervals in nonparametric regression. Quantitative Economics, 11(1):1–39, 2020.
- [38] Grace Wahba. Bayesian “confidence intervals” for the cross-validated smoothing spline. Journal of the Royal Statistical Society: Series B (Methodological), 45(1):133–150, 1983.
- [39] Grace Wahba. Improper priors, spline smoothing and the problem of guarding against model errors in regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 40(3):364–372, 1978.
- [40] Yuedong Wang and Grace Wahba. Bootstrap confidence intervals for smoothing splines and their comparison to bayesian confidence intervals. Journal of Statistical Computation and Simulation, 51(2-4):263–279, 1995.
- [41] Bradley Efron. The jackknife, the bootstrap and other resampling plans. SIAM, 1982.
- [42] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics, 41(6):2786–2819, 2013.