This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Statistical Spatially Inhomogeneous Diffusion Inference

Yinuo Ren1, Yiping Lu2, Lexing Ying1,3, Grant M. Rotskoff1,4111GMR acknowledges support from a Google Research Scholar Award.
Abstract

Inferring a diffusion equation from discretely-observed measurements is a statistical challenge of significant importance in a variety of fields, from single-molecule tracking in biophysical systems to modeling financial instruments. Assuming that the underlying dynamical process obeys a dd-dimensional stochastic differential equation of the form

d𝒙t=𝒃(𝒙t)dt+Σ(𝒙t)d𝒘t,\mathrm{d}\boldsymbol{x}_{t}=\boldsymbol{b}(\boldsymbol{x}_{t})\mathrm{d}t+\Sigma(\boldsymbol{x}_{t})\mathrm{d}\boldsymbol{w}_{t},

we propose neural network-based estimators of both the drift 𝒃\boldsymbol{b} and the spatially-inhomogeneous diffusion tensor D=ΣΣTD=\Sigma\Sigma^{T} and provide statistical convergence guarantees when 𝒃\boldsymbol{b} and DD are ss-Hölder continuous. Notably, our bound aligns with the minimax optimal rate N2s2s+dN^{-\frac{2s}{2s+d}} for nonparametric function estimation even in the presence of correlation within observational data, which necessitates careful handling when establishing fast-rate generalization bounds. Our theoretical results are bolstered by numerical experiments demonstrating accurate inference of spatially-inhomogeneous diffusion tensors.

1 Introduction

The dynamical evolution of a wide variety of natural processes, from molecular motion within cells to atmospheric systems, involves an interplay between deterministic forces and noise from the surrounding environment. While it is possible to observe time series data from such systems, in general the underlying equation of motion is not known analytically. Stochastic differential equations offer a powerful and versatile framework for modeling these complex systems, but inferring the deterministic drift and diffusion tensor from time series data remains challenging, especially in high-dimensional settings. Among the many strategies proposed (Crommelin and Vanden-Eijnden 2011; Frishman and Ronceray 2020; Nickl 2022), there are few rigorous results on the optimality and convergence properties of estimators of, in particular, spatially-inhomogeneous diffusion tensors.

Many numerical algorithms have been proposed to infer the drift and diffusion, accommodating various settings, including one-dimensional (Sura and Barsugli 2002; Papaspiliopoulos et al. 2012; Davis and Buffett 2022) and multidimensional SDEs (Pokern, Stuart, and Vanden-Eijnden 2009; Frishman and Ronceray 2020; Crommelin and Vanden-Eijnden 2011). Also, the statistical convergence rate has been extensively studied for both the one-dimensional case (Dalalyan 2005; Dalalyan and Reiß 2006; Pokern, Stuart, and van Zanten 2013; Aeckerle-Willems and Strauch 2018) and the multidimensional cases (Van der Meulen, Van Der Vaart, and Van Zanten 2006; Dalalyan and Reiß 2007; van Waaij and van Zanten 2016; Nickl and Söhl 2017; Nickl and Ray 2020; Oga and Koike 2021; Nickl 2022). For parametric estimators using a Fourier or wavelet basis, the statistical limits of estimating the spatially-inhomogeneous diffusion tensor have been rigorously characterized (Hoffmann 1997, 1999a). However, strategies based on such decompositions do not scale to high-dimensional problems, which has motivated the investigation of neural networks as a more flexible representation of the SDE coefficients (Han, Jentzen, and E 2018; Rotskoff, Mitchell, and Vanden-Eijnden 2022; Khoo, Lu, and Ying 2021; Li et al. 2021).

Thus, we consider the nonparametric neural network estimator (Suzuki 2018; Oono and Suzuki 2019; Schmidt-Hieber 2020) as our ansatz function class, which has achieved great success in estimating SDE coefficients empirically (Xie et al. 2007; Zhang et al. 2018; Han, Jentzen, and E 2018; Wang et al. 2022; Lin, Li, and Ren 2023). We aim to build statistical guarantees for such neural network-based estimators. The most related concurrent work is (Gu et al. 2023), where the authors provide a convergence guarantee for the neural network estimation of the drift vector and the homogeneous diffusion tensor of an SDE by solving appropriate supervised learning tasks. However, their approach assumes that the data observed along the trajectory are independently and identically distributed from the stationary distribution. Additionally, the generalization bound used in (Gu et al. 2023) is not the fast rate generalization bound (Bartlett, Bousquet, and Mendelson 2005; Koltchinskii 2006), resulting in a suboptimal final guarantee. Therefore, we seek to bridge the gap between the i.i.d. setting and the non-i.i.d. ergodic setting using mixing conditions and extend the algorithm and analysis to the spatially-inhomogeneous diffusion estimation. We show that neural estimators have the ability to achieve standard minimax optimal nonparametric function estimation rates even when the data are non-i.i.d.

1.1 Contribution

In this paper, we construct a fast-rate error bound for estimating a multi-dimensional spatially-inhomogeneous diffusion process based on non-i.i.d ergodic data along a single trajectory. Our contributions are as follows:

  • We derive for neural network-based diffusion estimators a convergence rate that matches the minimax optimal nonparametric function estimation rate for the ss-Hölder continuous function class (Tsybakov and Zaiats 2009);

  • Our analysis explores the β\upbeta-mixing condition to address the correlation present among observed data along the trajectory, making our result readily applicable to a wide range of ergodic diffusion processes;

  • We present numerical experiments, providing empirical support for our derived convergence rate and facilitating further applications of neural diffusion estimators in various contexts with theoretical assurance.

Our theoretical bound depicts the relationships between the error of nonparametric regression, numerical discretization, and ergodic approximation, and provides a general guideline for designing data-efficient, scale-minimal, and statistically-optimal neural estimators for diffusion inference.

1.2 Related Works

Inference of diffusion processes from data

The problem of inferring the drift and diffusion coefficients of an SDE from data has been studied extensively in the literature. The setting with access to the whole continuous trajectory is studied by (Dalalyan and Reiß 2006, 2007; Strauch 2015, 2016; Nickl and Ray 2020; Rotskoff and Vanden-Eijnden 2019), in which the diffusion tensor can be exactly identified using quadratic variation arguments, and thus only the drift inference is considered. Many works focus on the numerical recovery of both the drift vector and the diffusion tensor in the more realistic setting when only discrete observations are available, including methods based on local linearization (Ozaki 1992; Shoji and Ozaki 1998), martingale estimating functions (Bibby and Sørensen 1995), maximum likelihood estimation (Pedersen 1995; Aït-Sahalia 2002), and Markov chain Monte Carlo (Elerian, Chib, and Shephard 2001). We refer readers to (Sørensen 2004; López-Pérez, Febrero-Bande, and González-Manteiga 2021) for an overview of parametric approaches. A spectral method that estimates the eigenpairs of the Markov semigroup operator is proposed in (Crommelin and Vanden-Eijnden 2011), and a nonparametric Bayesian inference scheme based on the finite element method is studied in (Papaspiliopoulos et al. 2012). As for the statistical convergence rate of the drift and diffusion inference, a line of pioneering works is by (Hoffmann 1997, 1999a, 1999b), where the minimax convergence rate of the one-dimensional diffusion process is derived for Besov spaces and matched by adaptive wavelet estimators. Alternative analyses mainly follow a Bayesian methodology, with notable results by (Nickl and Söhl 2017; Nickl and Ray 2020; Nickl 2022) in both high- and low-frequency schemes.

Solving high-dimensional PDEs with deep neural networks

The curse of dimensionality has stymied efforts to solve high-dimensional partial differential equations (PDEs) numerically. However, deep learning has demonstrated remarkable flexibility and adaptivity in approximating high-dimensional functions, which indeed has led to significant advances in computer vision and natural language processing. Recently, a series of works (Han, Jentzen, and E 2018; Yu et al. 2018; Karniadakis et al. 2021; Khoo, Lu, and Ying 2021; Long et al. 2018; Zang et al. 2020; Kovachki et al. 2021; Lu, Jin, and Karniadakis 2019; Li et al. 2022; Rotskoff, Mitchell, and Vanden-Eijnden 2022) have explored solving PDEs with deep neural networks, achieving impressive results for a diverse collection of tasks. These approaches rely on representing PDE solutions using neural networks, and various schemes propose different loss functions to obtain a solution. For instance, (Han, Jentzen, and E 2018) uses the Feynman-Kac formulation to convert PDE solving into a stochastic control problem, (Karniadakis et al. 2021) solves the PDE minimizing the strong form, while (Zang et al. 2020) solves weak formulations of PDEs with an adversarial approach.

Theoretical guarantees for neural network-based PDE solvers

Statistical learning theory offers a powerful toolkit to prove theoretical convergence results for PDE solvers based on deep learning. For example, (Weinan and Wojtowytsch 2022; Chen, Lu, and Lu 2021; Marwah, Lipton, and Risteski 2021; Marwah et al. 2022) investigated the regularity of PDEs approximated by neural networks and (Nickl, van de Geer, and Wang 2020; Duan et al. 2021; Lu et al. 2021; Hütter and Rigollet 2021; Lu, Blanchet, and Ying 2022) consider the statistical convergence rate of various machine learning-based PDE solvers. However, most of these optimality results are based on concentration results that assume the sampled data are independent and identically distributed. This i.i.d. assumption is often violated in various financial and biophysical applications, for example, time series prediction, complex system analysis, and signal processing. Among many possible relaxations to this i.i.d. setting, the scenario, where data are drawn from a strong mixing process, has been widely adopted (Bradley 2005). Inspired by the first work of this kind (Yu 1994), many authors exploited a set of mixing concepts such as α\alpha-mixing (Zhang 2004; Steinwart and Christmann 2009), β\beta- and ϕ\phi-mixing (Mohri and Rostamizadeh 2008, 2010; Kuznetsov and Mohri 2017; Ziemann, Sandberg, and Matni 2022), and 𝒞\mathcal{C}-mixing (Hang and Steinwart 2017). We refer readers to (Hang et al. 2016) for an overview of this line of research.

Notations

We will use \lesssim and \gtrsim to denote the inequality up to a constant factor and \asymp the equality up to a constant factor.

Definition 1 (Hölder space).

We denote the Hölder space of order ss\in\mathbb{R} with constant M>0M>0 by 𝒞s(d,M)\mathcal{C}^{s}(\mathbb{R}^{d},M), i.e.

𝒞s(d,M)={f:d|\displaystyle\mathcal{C}^{s}(\mathbb{R}^{d},M)=\bigg{\{}f:\mathbb{R}^{d}\to\mathbb{R}\bigg{|}
|𝜶|<s𝜶f+|𝜶|=ssup𝒙𝒚|𝜶f(𝒙)𝜶f(𝒚)||𝒙𝒚|ss<M}.\displaystyle\sum_{|\boldsymbol{\alpha}|<s}\left\|\partial^{\boldsymbol{\alpha}}f\right\|_{\infty}+\sum_{|\boldsymbol{\alpha}|=\lfloor s\rfloor}\sup_{\boldsymbol{x}\neq\boldsymbol{y}}\frac{\left|\partial^{\boldsymbol{\alpha}}f(\boldsymbol{x})-\partial^{\boldsymbol{\alpha}}f(\boldsymbol{y})\right|}{|\boldsymbol{x}-\boldsymbol{y}|^{s-\lfloor s\rfloor}}<M\bigg{\}}.

2 Problem Setting

Suppose we have access to a sequence of NN discrete position snapshots (𝒙kτ)k=0N(\boldsymbol{x}_{k\tau})_{k=0}^{N} along a single trajectory (𝒙t)0tT(\boldsymbol{x}_{t})_{0\leq t\leq T}, where the time step τ=T/N\tau=T/N and (𝒙t)t0(\boldsymbol{x}_{t})_{t\geq 0} is the solution to the following Itô stochastic differential equation:

d𝒙t=𝒃(𝒙t)dt+Σ(𝒙t)d𝒘t,\mathrm{d}\boldsymbol{x}_{t}=\boldsymbol{b}(\boldsymbol{x}_{t})\mathrm{d}t+\Sigma(\boldsymbol{x}_{t})\mathrm{d}\boldsymbol{w}_{t}, (1)

where 𝒃:dd\boldsymbol{b}:\mathbb{R}^{\mathrm{d}}\to\mathbb{R}^{d}, Σ:dd×r\Sigma:\mathbb{R}^{d}\to\mathbb{R}^{d\times r}, and (𝒘t)t0(\boldsymbol{w}_{t})_{t\geq 0} is an rr-dimensional Wiener process. We refer the vector field 𝒃(𝒙)\boldsymbol{b}(\boldsymbol{x}) as the drift vector, and define the diffusion tensor as D()=12Σ()Σ()D(\cdot)=\tfrac{1}{2}\Sigma(\cdot)\Sigma(\cdot)^{\top}. As noted in (Lau and Lubensky 2007), any interpolation between the Itô convention and other conventions for stochastic calculus can be transformed into the Itô convention by an additional term to the drift vector, and therefore, we work with the Itô convention throughout this paper222The Itô convention along with others represent different methods to extend the Riemann integral to stochastic processes. Roughly speaking, Ito uses the left endpoint of the interval for functional value in the Riemann sum. We adopt the Itô convention due to several martingale properties it introduces which are mathematically convenient for statements and proofs..

Remark 1.

Our focus on the inhomogeneity in the space variable stems from the fact that when the SDE coefficients are time-dependent, it becomes very challenging to infer them from a singular observational trajectory, i.e. with only one observation at each time point and we would leave this case with multiple trajectories for future work.

For simplicity, we will be working on Ω=[0,1)d\Omega=[0,1)^{d} with periodic boundaries, i.e. the dd-dimensional torus Ω~=d/d\widetilde{\Omega}=\mathbb{R}^{d}/\mathbb{Z}^{d}. Points on the torus Ω~\widetilde{\Omega} are represented by 𝒙~\widetilde{\boldsymbol{x}}, where ~\widetilde{\cdot} denotes the canonical map and 𝒙d\boldsymbol{x}\in\mathbb{R}^{d} is a representative of the equivalence class 𝒙~\widetilde{\boldsymbol{x}}. The Borel σ\sigma-algebra on Ω~\widetilde{\Omega} coincides with the sub-σ\sigma algebra of 1-periodic Borel sets of d\mathbb{R}^{d}. We refer readers to (Papanicolau, Bensoussan, and Lions 1978) for further mathematical details of homogenization with tori. We further assume the drift and diffusion coefficients in (1) satisfy the following regularity assumptions:

Assumption 2a (Periodicity).

𝒃(𝒙)\boldsymbol{b}(\boldsymbol{x}), Σ(𝐱)\Sigma(\boldsymbol{x}), and D(𝐱)D(\boldsymbol{x}) are 1-periodic for all variables.

Remark 2.

This assumption is primarily for simplicity, and has been adopted in many previous works on the statistical inference of SDE coefficients, e.g. (Nickl and Ray 2020). This allows us to bypass the technicalities concerning boundary conditions, which might detract from our main contributions.

Assumption 2b (Hölder-smoothness).

Each entry bi(𝐱),Σij(𝐱),Dij(𝐱)𝒞s(d,M)b_{i}(\boldsymbol{x}),\Sigma_{ij}(\boldsymbol{x}),D_{ij}(\boldsymbol{x})\in\mathcal{C}^{s}(\mathbb{R}^{d},M) for some s2s\geq 2 and M>0M>0.

Assumption 2c (Uniform ellipticity).

It holds that rdr\geq d and there exists a constant c>0c>0 such that D(𝐱)cID(\boldsymbol{x})\succ cI, i.e. i,j=1dDij(𝐱)ξiξjcξ2\sum_{i,j=1}^{d}D_{ij}(\boldsymbol{x})\xi_{i}\xi_{j}\geq c\|\xi\|^{2} for any ξd\xi\in\mathbb{R}^{d}, holds uniformly for any 𝐱d\boldsymbol{x}\in\mathbb{R}^{d}.

Remark 3.

This uniform ellipticity is commonly assumed across the analysis of the Fokker-Planck equation. It guarantees the Fokker-Planck equation has a unique strong solution with regularity properties that are essential for the analysis of asymptotic behavior and numerical approximation of the solution. We refer readers to (Stroock and Varadhan 1997; Bogachev et al. 2022) for more detailed discussions.

Since 𝒃(𝒙)\boldsymbol{b}(\boldsymbol{x}) and Σ(𝒙)\Sigma(\boldsymbol{x}) are 1-periodic, the process (𝒙t)t0(\boldsymbol{x}_{t})_{t\geq 0} in (1) can thus be viewed as a process (𝒙~t)t0:=(𝒙t~)t0(\widetilde{\boldsymbol{x}}_{t})_{t\geq 0}:=(\widetilde{\boldsymbol{x}_{t}})_{t\geq 0} on the torus Ω~\widetilde{\Omega}. Denote the transition kernel of the process 𝒙t\boldsymbol{x}_{t} by t(𝒙,):=(𝒙t|𝒙0=𝒙)\mathbb{P}^{t}(\boldsymbol{x},\cdot):=\mathbb{P}(\boldsymbol{x}_{t}\in\cdot|\boldsymbol{x}_{0}=\boldsymbol{x}), the transition kernel of the corresponding process 𝒙~t\widetilde{\boldsymbol{x}}_{t} satisfies:

~t(𝒙~,)=𝒌=(k1,,kd)dt(𝒙,+i=1dki𝒆i),\widetilde{\mathbb{P}}^{t}(\widetilde{\boldsymbol{x}},\cdot)=\sum_{\boldsymbol{k}=(k_{1},\cdots,k_{d})\in\mathbb{Z}^{d}}\mathbb{P}^{t}\left(\boldsymbol{x},\cdot+\sum_{i=1}^{d}k_{i}\boldsymbol{e}_{i}\right), (2)

where 𝒆i\boldsymbol{e}_{i} is the ii-th standard basis vector in d\mathbb{R}^{d}. When no confusion arises, we will use 𝒙~\widetilde{\boldsymbol{x}} to denote its representative in the fundamental domain Ω\Omega in the following.

3 Spatially Inhomogeneous Diffusion Estimator

In this section, we aim to build neural estimators of both the drift and diffusion coefficients based on a sequence of NN discrete observations (𝒙kτ)k=0N\left(\boldsymbol{x}_{k\tau}\right)_{k=0}^{N} along a single trajectory of the SDE (1). A straightforward neural drift estimator allows us to subsequently construct a simple neural estimator of the diffusion tensor. In what follows, we introduce and prove the convergence of these neural estimators. Without loss of generality, we assume τ1\tau\leq 1 and T1T\geq 1, and denote the σ\sigma-algebra generated by all possible sequences (𝒙kτ)k=0N\left(\boldsymbol{x}_{k\tau}\right)_{k=0}^{N} as τ,N(𝒃,D)\mathcal{F}_{\tau,N}(\boldsymbol{b},D).

3.1 Neural Estimators

We define T𝒃(𝒃^)\mathcal{L}_{T}^{\boldsymbol{b}}(\hat{\boldsymbol{b}}) and TD(D^)\mathcal{L}_{T}^{D}(\hat{D}) as the objective function for drift and diffusion estimation, respectively, by noticing that the ground truth drift vector 𝒃\boldsymbol{b} can be represented as the minimizer of the following objective function as the time step τ0\tau\to 0 and the time horizon TT\to\infty:

T𝒃(𝒃^;(𝒙t)0tT):=1T0T𝒃^(𝒙t)1τΔ𝒙t22dt,\mathcal{L}_{T}^{\boldsymbol{b}}(\hat{\boldsymbol{b}};(\boldsymbol{x}_{t})_{0\leq t\leq T}):=\dfrac{1}{T}\int_{0}^{T}\left\|{\hat{\boldsymbol{b}}}(\boldsymbol{x}_{t})-\dfrac{1}{\tau}\Delta\boldsymbol{x}_{t}\right\|_{2}^{2}\mathrm{d}t, (3)

where Δ𝒙t=𝒙t+τ𝒙t\Delta\boldsymbol{x}_{t}=\boldsymbol{x}_{t+\tau}-\boldsymbol{x}_{t}. With the ground truth drift vector 𝒃\boldsymbol{b}, the ground truth diffusion tensor can also be represented as the minimizer of the following objective function as τ0\tau\to 0 and TT\to\infty:

TD(D^;(𝒙t)0tT,𝒃):=\displaystyle\mathcal{L}_{T}^{D}(\hat{D};(\boldsymbol{x}_{t})_{0\leq t\leq T},\boldsymbol{b}):= (4)
1T0TD^(𝒙t)(Δ𝒙t𝒃(𝒙t)τ)(Δ𝒙t𝒃(𝒙t)τ)2τF2dt,\displaystyle\dfrac{1}{T}\int_{0}^{T}\bigg{\|}\hat{D}(\boldsymbol{x}_{t})-\dfrac{\left(\Delta\boldsymbol{x}_{t}-\boldsymbol{b}(\boldsymbol{x}_{t})\tau\right)\left(\Delta\boldsymbol{x}_{t}-\boldsymbol{b}(\boldsymbol{x}_{t})\tau\right)^{\top}}{2\tau}\bigg{\|}_{F}^{2}\mathrm{d}t,

where F\|\cdot\|_{F} is the Frobenius norm of a matrix.

Based on the discussions in the last section, we will only estimate the value of 𝒃^\hat{\boldsymbol{b}} and D^\hat{D} in the fundamental domain Ω\Omega and then extend it to the whole space by periodicity. Therefore, using our data (𝒙kτ)k=0N(\boldsymbol{x}_{k\tau})_{k=0}^{N} as quadrature points, we approximate the objective function for drift estimation (3) as:

~N𝒃(𝒃^;(𝒙kτ)k=0N):=1Nk=0N1𝒙(k+1)τ𝒙kττ𝒃^(𝒙~kτ)22,\tilde{\mathcal{L}}_{N}^{\boldsymbol{b}}(\hat{\boldsymbol{b}};(\boldsymbol{x}_{k\tau})_{k=0}^{N}):=\dfrac{1}{N}\sum_{k=0}^{N-1}\left\|\dfrac{\boldsymbol{x}_{(k+1)\tau}-\boldsymbol{x}_{k\tau}}{\tau}-\hat{\boldsymbol{b}}(\widetilde{\boldsymbol{x}}_{k\tau})\right\|_{2}^{2}, (5)

and the objective function for diffusion estimation (4) as

~ND(D^;(𝒙kτ)k=0N,𝒃):=1Nk=0N1\displaystyle\tilde{\mathcal{L}}_{N}^{D}(\hat{D};(\boldsymbol{x}_{k\tau})_{k=0}^{N},\boldsymbol{b}):=\dfrac{1}{N}\sum_{k=0}^{N-1} (6)
(Δ𝒙kτ𝒃(𝒙~kτ)τ)(Δ𝒙kτ𝒃(𝒙~kτ)τ)2τD^(𝒙~kτ)F2.\displaystyle\left\|\dfrac{\left(\Delta\boldsymbol{x}_{k\tau}-{\boldsymbol{b}}(\widetilde{\boldsymbol{x}}_{k\tau})\tau\right)\left(\Delta\boldsymbol{x}_{k\tau}-{\boldsymbol{b}}(\widetilde{\boldsymbol{x}}_{k\tau})\tau\right)^{\top}}{2\tau}-\hat{D}(\widetilde{\boldsymbol{x}}_{k\tau})\right\|_{F}^{2}.

We will refer to ~N𝒃(𝒃^;(𝒙kτ)k=0N)\tilde{\mathcal{L}}_{N}^{\boldsymbol{b}}(\hat{\boldsymbol{b}};(\boldsymbol{x}_{k\tau})_{k=0}^{N}) and ~ND(D^;(𝒙kτ)k=0N,𝒃)\tilde{\mathcal{L}}_{N}^{D}(\hat{D};(\boldsymbol{x}_{k\tau})_{k=0}^{N},\boldsymbol{b}) as the estimated empirical loss for drift and diffusion estimation, respectively.

Algorithm 1 Diffusion inference within function class 𝔊\mathfrak{G}
1:  Find the drift estimator
𝒃^:=argmin𝒃¯𝔊d~𝒃N(𝒃¯;(𝒙kτ)k=0N);\hat{\boldsymbol{b}}:=\operatorname*{arg\,min}_{\bar{\boldsymbol{b}}\in\mathfrak{G}^{d}}\tilde{\mathcal{L}}^{\boldsymbol{b}}_{N}(\bar{\boldsymbol{b}};(\boldsymbol{x}_{k\tau})_{k=0}^{N});
2:  Find the diffusion estimator
D^:=argminD¯𝔊d×d~DN(D¯;(𝒙kτ)k=0N,𝒃^),\hat{D}:=\operatorname*{arg\,min}_{\bar{D}\in\mathfrak{G}^{d\times d}}\tilde{\mathcal{L}}^{D}_{N}(\bar{D};(\boldsymbol{x}_{k\tau})_{k=0}^{N},\hat{\boldsymbol{b}}),
where 𝒃^\hat{\boldsymbol{b}} is the drift estimator obtained in the first step as an approximation for the ground truth 𝒃\boldsymbol{b}.

We then parametrize the drift vector and the diffusion tensor within a hypothesis function class 𝔊\mathfrak{G} and solve for the estimators by optimizing the corresponding estimated empirical loss, as in Algorithm 1. Following foundational works including (Oono and Suzuki 2019; Schmidt-Hieber 2020; Chen et al. 2022), we adopt sparse neural networks 𝔑(L,𝒑,S,M)\mathfrak{N}(L,\boldsymbol{p},S,M) as our hypothesis function class 𝔊\mathfrak{G}, which is defined as follows. A neural network with depth LL and width vector 𝒑=(p0,,pL+1)\boldsymbol{p}=(p_{0},\cdots,p_{L+1}) has the following form f:p0pL+1f:\mathbb{R}^{p_{0}}\to\mathbb{R}^{p_{L+1}} with

𝒙f(𝒙)=WL(σ(WL1(σ(W0𝒙𝒘1))𝒘L)),\boldsymbol{x}\mapsto f(\boldsymbol{x})=W_{L}(\sigma(W_{L-1}(\cdots\sigma(W_{0}\boldsymbol{x}-\boldsymbol{w}_{1})\cdots)-\boldsymbol{w}_{L})), (7)

where Wipi+1×piW_{i}\in\mathbb{R}^{p_{i+1}\times p_{i}} are the weight matrices, 𝒘ipi\boldsymbol{w}_{i}\in\mathbb{R}^{p_{i}} are the shift vectors, and σ()\sigma(\cdot) is the element-wise ReLU activation function. We also bound all parameters in the neural network by unity as in (Schmidt-Hieber 2020; Suzuki 2018).

Definition 3 (Sparse neural network).

Let 𝔑(L,𝐩,S,M)\mathfrak{N}(L,\boldsymbol{p},S,M) be the function class of ReLU-activated neural networks with depth LL and width 𝐩\boldsymbol{p} that has at most SS non-zero entries with the function value uniformly bounded by MM and all parameters bounded by 1, i.e.

𝔑(L,𝒑,S,M)={f(𝒙) has the form of (7)|\displaystyle\mathfrak{N}(L,\boldsymbol{p},S,M)=\bigg{\{}f(\boldsymbol{x})\text{ has the form of~{}\eqref{eq:nn}}\bigg{|}
i=0LWi0+i=1L𝒘i0S,fM\displaystyle\quad\quad\quad\sum_{i=0}^{L}\|W_{i}\|_{0}+\sum_{i=1}^{L}\|\boldsymbol{w}_{i}\|_{0}\leq S,\|f\|_{\infty}\leq M
maxi=0,,LWimaxi=1,,L𝒘i1},\displaystyle\quad\quad\quad\quad\max_{i=0,\cdots,L}\|W_{i}\|_{\infty}\vee\max_{i=1,\cdots,L}\|\boldsymbol{w}_{i}\|_{\infty}\leq 1\bigg{\}},

where 0\|\cdot\|_{0} is the number of non-zero entries of a matrix (or a vector) and \|\cdot\|_{\infty} is the maximum absolute value of a matrix (or a vector).

Since we are using the neural network for nonparametric estimation in Ωd\Omega\subset\mathbb{R}^{d}, we will assume p0=dp_{0}=d and pL+1=1p_{L+1}=1 in the following discussion.

3.2 Ergodicity

Optimal convergence rates of neural network-based PDE solvers, as showcased in (Nickl, van de Geer, and Wang 2020; Lu et al. 2021; Gu et al. 2023), are typically established under the assumption of data independence. However, the presence of time correlations in the observational data (𝒙kτ)k=0N\left(\boldsymbol{x}_{k\tau}\right)_{k=0}^{N} from a single trajectory significantly complicates the task of setting an upper bound for the convergence of the neural estimators obtained by Algorithm 1. In this context, we fully explore the ergodicity of the diffusion process, bound the ergodic approximation error by the β\upbeta-mixing coefficient, and show that the exponential ergodicity condition, which is naturally satisfied by a wide range of diffusion processes, is sufficient for the fast rate convergence of the proposed neural estimators.

We first introduce the definition of exponential ergodicity:

Definition 4 (Exponential ergodicity (Down, Meyn, and Tweedie 1995)).

A diffusion process (Xt)t0(X_{t})_{t\geq 0} with domain Ω\Omega is uniformly exponential ergodic if there exists a unique stationary distribution μ\mu that for any xΩx\in\Omega,

t(x,)μTVMμ(x)exp(Cμt),\|\mathbb{P}^{t}(x,\cdot)-\mu\|_{\mathrm{TV}}\leq M_{\mu}(x)\exp(-C_{\mu}t),

where Mμ(x),Cμ>0M_{\mu}(x),C_{\mu}>0.

As a direct consequence of (Papanicolau, Bensoussan, and Lions 1978, Theorem 3.2) and the compactness of the torus Ω~\widetilde{\Omega}, we have the following result:

Proposition 1 (Exponential ergodicity of (𝒙~t)t0(\widetilde{\boldsymbol{x}}_{t})_{t\geq 0}).

The diffusion process (𝐱~t)t0(\widetilde{\boldsymbol{x}}_{t})_{t\geq 0}, the image of (𝐱t)t0(\boldsymbol{x}_{t})_{t\geq 0} in (1) under the quotient map, is uniformly exponential ergodic with respect to a unique stationary distribution Π~\widetilde{\Pi} on the torus Ω~\widetilde{\Omega} under Assumptions 2a, 2b, and 2c. Especially, there exist constants MΠ~,CΠ~>0M_{\widetilde{\Pi}},C_{\widetilde{\Pi}}>0 that only depend on cc, 𝐛\boldsymbol{b}, and DD, such that for any 𝐱~Ω\widetilde{\boldsymbol{x}}\in\Omega,

~t(𝒙~,)Π~TVMΠ~exp(CΠ~t).\|\widetilde{\mathbb{P}}^{t}(\widetilde{\boldsymbol{x}},\cdot)-\widetilde{\Pi}\|_{\mathrm{TV}}\leq M_{\widetilde{\Pi}}\exp(-C_{\widetilde{\Pi}}t).

See (Kulik 2017) for further discussions and required regularities for this property beyond the torus setting.

The ergodicity of stochastic processes is closely related to the notion of mixing conditions, which quantifies the “asymptotic independence” of random sequences. One of the most utilized mixing conditions for stochastic processes is the following β\upbeta-mixing condition:

Definition 5 (β\upbeta-mixing condition (Kuznetsov and Mohri 2017)).

The β\upbeta-mixing coefficient of a stochastic process (Xt)t0(X_{t})_{t\geq 0} with respect to a probability measure μ\mu is defined as

β(t;(Xt)t0,μ):=sups0𝔼0t[μt+s(|0t)TV],\upbeta(t;(X_{t})_{t\geq 0},\mu):=\sup_{s\geq 0}\mathbb{E}_{\mathcal{F}_{0}^{t}}\left[\|\mu-\mathbb{P}_{t+s}^{\infty}(\cdot|\mathcal{F}_{0}^{t})\|_{\mathrm{TV}}\right],

where ab\mathcal{F}_{a}^{b} is the σ\sigma-algebra generated by (Xt)atb(X_{t})_{a\leq t\leq b}, and ab\mathbb{P}_{a}^{b} is the law of (Xt)atb(X_{t})_{a\leq t\leq b}. Especially, when β(t;(Xt)t0,Π)Mβexp(Cβt)\upbeta(t;(X_{t})_{t\geq 0},\Pi)\leq M_{\upbeta}\exp(-C_{\upbeta}t) for some constants Mβ,Cβ>0M_{\upbeta},C_{\upbeta}>0, we say XtX_{t} is geometrically β\upbeta-mixing with respect to μ\mu.

By taking μ\mu as the stationary distribution Π~\widetilde{\Pi} in the definition above, the proposition follows:

Proposition 2 (β\upbeta-mixing condition of (𝒙~t)t0(\widetilde{\boldsymbol{x}}_{t})_{t\geq 0}).

β(t;(𝒙~t)t0,Π~)MΠ~exp(CΠ~t)\upbeta(t;(\widetilde{\boldsymbol{x}}_{t})_{t\geq 0},\widetilde{\Pi})\leq M_{\widetilde{\Pi}}\exp(-C_{\widetilde{\Pi}}t), i.e. 𝐱~t\widetilde{\boldsymbol{x}}_{t} is geometrically β\upbeta-mixing with respect to Π~\widetilde{\Pi}.

We will denote the pushforward of the invariant measure Π~\widetilde{\Pi} under the following inverse of the canonical map ι1:Ω~Ω\iota^{-1}:\widetilde{\Omega}\to\Omega also as Π~\widetilde{\Pi}.

3.3 Convergence Guarantee

In this section, we describe the main upper bound for the neural estimators in Algorithm 1. We also present a theoretical guarantee for drift and diffusion estimation in Theorem 3 and 9, respectively. Our main result shows that estimating the drift and diffusion tensor can achieve the standard minimax optimal nonparametric function estimation convergence rate, even with non-i.i.d. data.

Due to the ergodic theorem (Kulik 2017, Theorem 5.3.3) under the exponential ergodicity condition and the property of Itô process, the bias part of the objective functions T𝒃(𝒃^;(𝒙t)0tT)\mathcal{L}_{T}^{\boldsymbol{b}}(\hat{\boldsymbol{b}};(\boldsymbol{x}_{t})_{0\leq t\leq T}) and TD(D^;(𝒙t)0tT,𝒃)\mathcal{L}_{T}^{D}(\hat{D};(\boldsymbol{x}_{t})_{0\leq t\leq T},\boldsymbol{b}) for drift and diffusion estimation as defined in (3) and (4) converge to

𝒃Π~(𝒃^)\displaystyle\mathcal{L}^{\boldsymbol{b}}_{\widetilde{\Pi}}(\hat{\boldsymbol{b}}) :=𝔼𝒙~Π~[𝒃^(𝒙~)𝒃(𝒙~)22]\displaystyle:=\mathbb{E}_{\widetilde{\boldsymbol{x}}\sim\widetilde{\Pi}}\left[\|\hat{\boldsymbol{b}}(\widetilde{\boldsymbol{x}})-\boldsymbol{b}(\widetilde{\boldsymbol{x}})\|_{2}^{2}\right] (8)
andDΠ~(D^)\displaystyle\text{and}\quad\mathcal{L}^{D}_{\widetilde{\Pi}}(\hat{D}) :=𝔼𝒙~Π~[D^(𝒙~)D(𝒙~)2F],\displaystyle:=\mathbb{E}_{\widetilde{\boldsymbol{x}}\sim\widetilde{\Pi}}\left[\|\hat{D}(\widetilde{\boldsymbol{x}})-D(\widetilde{\boldsymbol{x}})\|^{2}_{F}\right],

as τ0\tau\to 0 and TT\to\infty, which we will refer to as the population loss for drift and diffusion estimation, respectively. Our convergence guarantee is thus built on these population losses.

Theorem 3 (Upper bound for drift estimation in 𝔑(L,𝒑,S,M)\mathfrak{N}(L,\boldsymbol{p},S,M)).

Suppose the drift vector 𝐛𝒞s(Ω,M)\boldsymbol{b}\in\mathcal{C}^{s}(\Omega,M), and the hypothesis class 𝔊=𝔑(L,𝐩,S,M)\mathfrak{G}=\mathfrak{N}(L,\boldsymbol{p},S,M) with

KTd2s+d,LlogK,𝒑K,SKlogK.K\asymp T^{\frac{d}{2s+d}},\quad L\lesssim\log K,\quad\|\boldsymbol{p}\|_{\infty}\lesssim K,\quad S\lesssim K\log K.

Then with high probability the minimizer 𝐛^\hat{\boldsymbol{b}} obtained by Algorithm 1 satisfies

𝔼(𝒙kτ)k=0Nτ,N(𝒃,D)[bΠ~(𝒃^)]T2s2s+dlog3T+τ.\mathbb{E}_{(\boldsymbol{x}_{k\tau})_{k=0}^{N}\sim\mathcal{F}_{\tau,N}(\boldsymbol{b},D)}\left[\mathcal{L}^{b}_{\widetilde{\Pi}}(\hat{\boldsymbol{b}})\right]\lesssim T^{-\frac{2s}{2s+d}}\log^{3}T+\tau.
Theorem 4 (Upper bound for diffusion estimation in 𝔑(L,𝒑,S,M)\mathfrak{N}(L,\boldsymbol{p},S,M)).

Suppose the diffusion tensor D𝒞s(Ω,M)D\in\mathcal{C}^{s}(\Omega,M), and the hypothesis class 𝔊=𝔑(L,𝐩,S,M)\mathfrak{G}=\mathfrak{N}(L,\boldsymbol{p},S,M) with

KNd2s+d,LlogK,𝒑K,SKlogK.K\asymp N^{\frac{d}{2s+d}},\quad L\lesssim\log K,\quad\|\boldsymbol{p}\|_{\infty}\lesssim K,\quad S\lesssim K\log K.

Then with high probability the minimizer D^\hat{D} obtained by Algorithm 1 satisfies

𝔼(𝒙kτ)k=0Nτ,N(𝒃,D)[DΠ~(D^)]N2s2s+dlog3N+τ+log2NT.\mathbb{E}_{(\boldsymbol{x}_{k\tau})_{k=0}^{N}\sim\mathcal{F}_{\tau,N}(\boldsymbol{b},D)}\left[\mathcal{L}^{D}_{\widetilde{\Pi}}(\hat{D})\right]\lesssim N^{-\frac{2s}{2s+d}}\log^{3}N+\tau+\dfrac{\log^{2}N}{T}. (9)
Remark 4.

In this remark, we explain the meaning of each term in the convergence rate (9):

  • The term N2s2s+dlog3NN^{-\frac{2s}{2s+d}}\log^{3}N matches the standard minimax optimal rate N2s2s+dN^{-\frac{2s}{2s+d}} up to an extra poly(logn)\mathrm{poly}(\log n) factor. This is characteristic of performing nonparametric regression for ss-Hölder continuous functions with NN noisy observations (Tsybakov and Zaiats 2009). This is different from the drift estimation (Theorem 3) in which the nonparametric dependency is on TT instead of NN with a further log2NT\frac{\log^{2}N}{T} term which is discussed below.

  • The term τ\tau represents a bias term that arises due to the finite resolution of the observations (𝒙kτ)t=0N(\boldsymbol{x}_{k\tau})_{t=0}^{N}. Specifically, this term encapsulates the error incurred while approximating the objective function TD(D^;(𝒙t)0tT,𝒃)\mathcal{L}_{T}^{D}(\hat{D};(\boldsymbol{x}_{t})_{0\leq t\leq T},\boldsymbol{b}) by the estimated empirical loss ~ND(D^;(𝒙kτ)k=0N,𝒃^)\tilde{\mathcal{L}}_{N}^{D}(\hat{D};(\boldsymbol{x}_{k\tau})_{k=0}^{N},\hat{\boldsymbol{b}}) with numerical quadrature and finite difference computations;

  • The term log2NT\frac{\log^{2}N}{T} quantifies the error in approximating the population loss DΠ~(D^)\mathcal{L}^{D}_{\widetilde{\Pi}}(\hat{D}) by the objective function TD(D^;(𝒙t)0tT,𝒃)\mathcal{L}_{T}^{D}(\hat{D};(\boldsymbol{x}_{t})_{0\leq t\leq T},\boldsymbol{b}) by applying the ergodic theorem up to time horizon TT. This term essentially signifies the portion of the domain that the trajectory has not yet traversed. Refs. (Hoffmann 1997, 1999a) only provide guarantee for TD(D^;(𝒙t)0tT,𝒃)\mathcal{L}_{T}^{D}(\hat{D};(\boldsymbol{x}_{t})_{0\leq t\leq T},\boldsymbol{b}) and thus this term is not included.

3.4 Proof Sketch

In this section, we omit the dependency of the losses on the data (𝒙kτ)k=0N(\boldsymbol{x}_{k\tau})_{k=0}^{N} for notational simplicity.

To obtain a unified proving approach for both drift and diffusion estimation, it is useful to think of our neural estimator as a function regressor with imperfect supervision signals. We consider an estimator g^𝔊\hat{g}\in\mathfrak{G} of an arbitrary function g0g^{0} as the ground truth obtained by minimizing over the estimated empirical loss

~Ng0(g^)=1Nk=0N1(g0(𝒙~kτ)+ΔZkτg^(𝒙~kτ))2,\tilde{\mathcal{L}}_{N}^{g^{0}}(\hat{g})=\dfrac{1}{N}\sum_{k=0}^{N-1}\left(g^{0}(\widetilde{\boldsymbol{x}}_{k\tau})+\Delta Z_{k\tau}-\hat{g}(\widetilde{\boldsymbol{x}}_{k\tau})\right)^{2}, (10)

where the supervision signal is polluted by the noise given by ΔZkτ=Z(k+1)τZkτ\Delta Z_{k\tau}=Z_{(k+1)\tau}-Z_{k\tau}, with ZtZ_{t} being an t\mathcal{F}_{t}-adapted continuous semimartingale. Following Doob’s decomposition, we write Zt=At+MtZ_{t}=A_{t}+M_{t}, where (At)t0(A_{t})_{t\geq 0} is a continuous process with finite variation and is deterministic on [kτ,(k+1)τ][k\tau,(k+1)\tau] conditioned on 0kτ\mathcal{F}_{0}^{k\tau} as

At=k=0N1(𝔼[Zt(k+1)τtkτ]Ztkτ)A_{t}=\sum_{k=0}^{N-1}\left(\mathbb{E}\left[Z_{t\wedge(k+1)\tau}\mid\mathcal{F}_{t\wedge k\tau}\right]-Z_{t\wedge k\tau}\right)

and (Mt)t0(M_{t})_{t\geq 0} forms a local martingale as

Mt=k=0N1(Zt(k+1)τ𝔼[Zt(k+1)τtkτ]).M_{t}=\sum_{k=0}^{N-1}\left(Z_{t\wedge(k+1)\tau}-\mathbb{E}\left[Z_{t\wedge(k+1)\tau}\mid\mathcal{F}_{t\wedge k\tau}\right]\right).

The population loss g0Π~(g^)\mathcal{L}^{g^{0}}_{\widetilde{\Pi}}(\hat{g}) can also be similarly defined as in (8). Additionally, we define the empirical loss for the estimator g^\hat{g} as

^Ng0(g^):=1Nk=0N1(g0(𝒙~kτ)g^(𝒙~kτ))2.\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g}):=\dfrac{1}{N}\sum_{k=0}^{N-1}\left(g^{0}(\widetilde{\boldsymbol{x}}_{k\tau})-\hat{g}(\widetilde{\boldsymbol{x}}_{k\tau})\right)^{2}.

In our proof, we first show that as long as the following two conditions hold for the noise (ΔZkτ)k=0N(\Delta Z_{k\tau})_{k=0}^{N}, the minimax optimal nonparametric function estimation rate would be achieved:

Assumption 6.

For any kk, the continuous finite variation process (At)t0(A_{t})_{t\geq 0} satisfies

𝔼[1Nk=0N1(ΔAkτ)2]CAτ.\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}(\Delta A_{k\tau})^{2}\right]\leq C_{A}\tau.
Assumption 7.

For some γ1\gamma\leq 1, the local martingale (Mt)t0(M_{t})_{t\geq 0} satisfies

maxk|𝔼[ΔMkτ|0kτ]|CMτγ,\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\leq C_{M}\tau^{-\gamma},

where \left\langle\cdot\right\rangle denotes the quadratic variation.

Remark 5.

Based on the noise decomposition ΔZkτ=ΔAkτ+ΔMkτ\Delta Z_{k\tau}=\Delta A_{k\tau}+\Delta M_{k\tau}, the term ΔAkτ\Delta A_{k\tau} can be intuitively understood as the bias of the data. This bias is caused by the numerical scheme employed for computing g0kτg^{0}_{k\tau}. On the other hand, the term ΔMkτ\Delta M_{k\tau} represents the martingale noise added to the data, which can be considered analogous to the i.i.d. noise in the common nonparametric estimation settings. Assumption 6 essentially implies that the estimator g^\hat{g} is consistent, for its expectation converges to g0g^{0} as τ0\tau\to 0. Meanwhile, Assumption 7 assumes that the variance of the noise present in the data is at most of order 𝒪(τ1)\mathcal{O}(\tau^{-1}).

To overcome the correlation of the observed data, we adopt the following sub-sampling technique as in (Yu 1994; Mohri and Rostamizadeh 2008; Hang and Steinwart 2017): For a sufficiently large l1l\geq 1 such that N=nlN=nl333Here we assume NN is divisible by ll without loss of generality., we split the original NN correlated samples SN:=(𝒙kτ)k=0NS^{N}:=(\boldsymbol{x}_{k\tau})_{k=0}^{N} into ll sub-sequences Sn(a):=(𝒙(kl+a)τ)k=0n1S^{n}_{(a)}:=(\boldsymbol{x}_{(kl+a)\tau})_{k=0}^{n-1} for a=0,,l1a=0,\cdots,l-1. The main idea of this technique is that under fast β\upbeta-mixing conditions, each sub-sequence can be treated approximately as nn i.i.d. samples from the distribution Π~\widetilde{\Pi} to which the classical generalization results may apply, with an error that can be controlled by the mixing coefficient via the following lemma:

Lemma 5 ((Kuznetsov and Mohri 2017, Proposition 2)).

Let hh be any function on Ω~n\widetilde{\Omega}^{n} with M1hM2-M_{1}\leq h\leq M_{2} for M1,M20M_{1},M_{2}\geq 0. Then for any 0al10\leq a\leq l-1, we have

|𝔼S~nΠ~Π~n[h(S~nΠ~)]𝔼[h(Sn(a)~)]|(M0+M1)nβ(lτ),\left|\mathbb{E}_{\widetilde{S}^{n}_{\widetilde{\Pi}}\sim{\widetilde{\Pi}}^{\otimes n}}\left[h(\widetilde{S}^{n}_{\widetilde{\Pi}})\right]-\mathbb{E}\left[h(\widetilde{S^{n}_{(a)}})\right]\right|\leq(M_{0}+M_{1})n\upbeta(l\tau),

where the second expectation is taken over the sub-σ\sigma-algebra of 0T\mathcal{F}_{0}^{T} generated by the sub-sequence Sn(a)=(𝐱(kl+a)τ)k=0n1S^{n}_{(a)}=(\boldsymbol{x}_{(kl+a)\tau})_{k=0}^{n-1} and Sn(a)~:=(𝐱~(kl+a)τ)k=0n1\widetilde{S^{n}_{(a)}}:=(\widetilde{\boldsymbol{x}}_{(kl+a)\tau})_{k=0}^{n-1}.

Based on Lemma 5, we derive the following fast rate generalization bound via local Rademacher complexity arguments (Bartlett, Bousquet, and Mendelson 2005; Koltchinskii 2006). The proof is shown in Appendix A.1.

Theorem 6.

Let N=nlN=nl. Suppose the localized Rademacher complexity satisfies

N({g|g𝔊,𝔼Π[g]r})ϕ(r),\mathfrak{R}_{N}(\{\ell\circ g|g\in\mathfrak{G},\mathbb{E}_{\Pi}[\ell\circ g]\leq r\})\leq\phi(r),

where ϕ(r)\phi(r) is a sub-root function444A sub-root function ϕ(r)\phi(r) is a function ϕ:++\phi:\mathbb{R}^{+}\to\mathbb{R}^{+} that is non-negative, non-decreasing function, satisfying that ϕ(r)/r\phi(r)/\sqrt{r} is non-increasing for r>0r>0. and N(𝔉)\mathfrak{R}_{N}(\mathfrak{F}) is the Rademacher complexity of a function class 𝔉\mathfrak{F} with respect to NN i.i.d. samples from the stationary distribution Π~\widetilde{\Pi}, i.e.

N(𝔉)=𝔼(X~i)i=1NΠ~N,𝝈Unif({±1}N)[supf𝔉1Ni=1Nσif(X~i)].\mathfrak{R}_{N}(\mathfrak{F})=\mathbb{E}_{(\widetilde{X}_{i})_{i=1}^{N}\sim{\widetilde{\Pi}}^{\otimes N},\boldsymbol{\sigma}\sim\mathrm{Unif}(\{\pm 1\}^{N})}\left[\sup_{f\in\mathfrak{F}}\dfrac{1}{N}\sum_{i=1}^{N}\sigma_{i}f(\widetilde{X}_{i})\right]. (11)

Let rr^{*} be the unique solution to the fixed-point equation ϕ(r)=r\phi(r)=r. Then for any δ>Nβ(lτ)\delta>N\upbeta(l\tau) and ϵ>0\epsilon>0, we have with probability 1δ1-\delta for any g𝔊g\in\mathfrak{G},

Π~g0(g^)11ϵ^Ng0(g^)+176M2ϵr+(44ϵ+104)M2log(lδ)ϵn,\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})\leq\dfrac{1}{1-\epsilon}\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})+\dfrac{176}{M^{2}\epsilon}r^{*}+\dfrac{\left(44\epsilon+104\right)M^{2}\log\left(\frac{l}{\delta^{\prime}}\right)}{\epsilon n}, (12)

where δ=δNβ(lτ)\delta^{\prime}=\delta-N\upbeta(l\tau).

Bias and noise in the objective function certainly affect the optimization. Thus, we need to seek an oracle-type inequality for the expectation of the population loss ^Ng0(g^)\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g}) over the data, which is proved in Appendix A.2. The main technique is a uniform martingale concentration inequality (cf. Lemma 12).

Theorem 7.

Suppose 𝔊\mathfrak{G} is separable with respect to the LL^{\infty} norm with ρ\rho-covering number 𝒩(ρ,𝔊,)2\mathcal{N}(\rho,\mathfrak{G},\|\cdot\|_{\infty})\geq 2. Then under Assumption 6 and 7, we have

𝔼[^Ng0(g^)]1+ϵ1ϵinfg¯𝔊𝔼[^Ng0(g¯)]+3CAϵτ\displaystyle\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})\right]\leq\dfrac{1+\epsilon}{1-\epsilon}\inf_{\bar{g}\in\mathfrak{G}}\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\bar{g})\right]+\dfrac{3C_{A}}{\epsilon}\tau
+12CMlog𝒩(ρ,𝔊,)ϵNτγ+24CMρ2log2Nτγ,\displaystyle\quad\quad+\dfrac{12C_{M}\log\mathcal{N}(\rho,\mathfrak{G},\|\cdot\|_{\infty})}{\epsilon N\tau^{\gamma}}+2\sqrt{\dfrac{4C_{M}\rho^{2}\log 2}{N\tau^{\gamma}}},

where the expectation is taken over the sub-σ\sigma-algebra of 0T\mathcal{F}_{0}^{T} generated by the trajectory (𝐱kτ)k=0N(\boldsymbol{x}_{k\tau})_{k=0}^{N} from which g^\hat{g} is constructed by minimizing over the estimate empirical loss ~Ng0(g^;(𝐱kτ)k=0N)\tilde{\mathcal{L}}_{N}^{g^{0}}(\hat{g};(\boldsymbol{x}_{k\tau})_{k=0}^{N}).

Especially, when we choose the hypothesis class 𝔊\mathfrak{G} as the sparse neural network class 𝔑(L,𝒑,S,M)\mathfrak{N}(L,\boldsymbol{p},S,M) and combine Theorem 6 and Theorem 7, we have the following theorem with the proof given in Appendix A.3:

Theorem 8.

Suppose Assumption 6 and 7 are satisfied and the ground truth g0𝒞s(Ω,M)g^{0}\in\mathcal{C}^{s}(\Omega,M), and the hypothesis class 𝔊=𝔑(L,𝐩,S,M)\mathfrak{G}=\mathfrak{N}(L,\boldsymbol{p},S,M) with

K(N(τγ1))d2s+d\displaystyle K\asymp(N(\tau^{\gamma}\wedge 1))^{\frac{d}{2s+d}} ,LlogK,\displaystyle,\quad L\lesssim\log K,
𝒑K\displaystyle\|\boldsymbol{p}\|_{\infty}\lesssim K ,SKlogK.\displaystyle,\quad S\lesssim K\log K.

Then with high probability the minimizer g^\hat{g} obtained by minimizing the estimated empirical loss ~Ng0(g^;(𝐱kτ)k=0N)\tilde{\mathcal{L}}_{N}^{g^{0}}(\hat{g};(\boldsymbol{x}_{k\tau})_{k=0}^{N}) satisfies

𝔼[Π~g0(g^)](N(τγ1))2s2s+dlog3(N(τγ1))+τ+log2NNτ,\mathbb{E}\left[\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})\right]\lesssim\left(N(\tau^{\gamma}\wedge 1)\right)^{-\frac{2s}{2s+d}}\log^{3}(N(\tau^{\gamma}\wedge 1))+\tau+\dfrac{\log^{2}N}{N\tau}, (13)

where the expectation has the same interpretation as in Theorem 7.

With Theorem 8, the detailed proofs of Theorem 3 and 9 are given in Appendix B.1 and B.2, respectively.

Refer to caption
(a) τ=103\tau=10^{-3} with varying NN and TT
Refer to caption
(b) T=103T=10^{3} with varying NN and τ\tau
Figure 1: Numerical results of the proposed neural diffusion estimator are consistent with the scaling expected from the theoretical bound. We probe this by varying NN and T=NτT=N\tau with fixed lag time τ\tau and also by varying NN and τ=T/N\tau=T/N with fixed time horizon TT.

4 Experiments

In this section, we present numerical results on a two-dimensional example, to illustrate the accordance between our theoretical convergence rates and those of our proposed neural diffusion estimator. Consider the following SDE in 2\mathbb{R}^{2}:

d𝒙t=f(𝒙t)f(𝒙t)dt+f(𝒙t)d𝒘t,\mathrm{d}\boldsymbol{x}_{t}=f(\boldsymbol{x}_{t})\nabla f(\boldsymbol{x}_{t})\mathrm{d}t+f(\boldsymbol{x}_{t})\mathrm{d}\boldsymbol{w}_{t}, (14)

where

f(𝒙)=1+12cos(2π(x1+x2)),f(\boldsymbol{x})=1+\frac{1}{2}\cos(2\pi(x_{1}+x_{2})),

i.e. 𝒃(𝒙)=f(𝒙)f(𝒙)\boldsymbol{b}(\boldsymbol{x})=f(\boldsymbol{x})\nabla f(\boldsymbol{x}) and D(𝒙)=f(𝒙)22ID(\boldsymbol{x})=\frac{f(\boldsymbol{x})^{2}}{2}I, where II is the 2×22\times 2 identity matrix. Then it is straightforward to verify that this diffusion process satisfies Assumption 2a2b, and 2c with smoothness s=s=\infty. Our goal is to estimate the value of the function f(𝒙)f(\boldsymbol{x}) within Ω=[0,1)2\Omega=[0,1)^{2}. We employ Algorithm 1 for estimating both 𝒃(𝒙)\boldsymbol{b}(\boldsymbol{x}) and D(𝒙)D(\boldsymbol{x}) with separate neural networks and treat them entirely independently in the inference task. One may also prove that the stationary distribution Π~\widetilde{\Pi} of this diffusion process is given by the Lebesgue measure on the two-dimensional torus, which makes evaluating errors easier and more precise.

To impose the periodic boundary, we introduce an explicit regularization term to our training loss

per(g^)=𝔼(𝒙,𝒚)Unif(Ω2),𝒙~=𝒚~[(g^(𝒙)g^(𝒚))2],\mathcal{L}_{\rm per}(\hat{g})=\mathbb{E}_{(\boldsymbol{x},\boldsymbol{y})\sim\mathrm{Unif}(\partial\Omega^{2}),\widetilde{\boldsymbol{x}}=\widetilde{\boldsymbol{y}}}\left[\left(\hat{g}(\boldsymbol{x})-\hat{g}(\boldsymbol{y})\right)^{2}\right],

approximated by ^per(g^)\hat{\mathcal{L}}_{\rm per}(\hat{g}) with 1000 pairs of random samples empirically. The final training loss is thus ~N(g^)+λ^per(g^)\tilde{\mathcal{L}}^{N}(\hat{g})+\lambda\hat{\mathcal{L}}_{\rm per}(\hat{g}), where λ\lambda is a hyperparameter and g^\hat{g} can be either 𝒃^\hat{\boldsymbol{b}} or D^\hat{D}.

We first generate data using the Euler-Maruyama method with a time step τ0=2×105\tau_{0}=2\times 10^{-5} up to T0=104T_{0}=10^{4}, and then sub-sample data at varying time steps τ\tau and time horizons TT for each experiment instance from this common trajectory. We use a ResNet as our neural network structure with two residual blocks, each containing a fully-connected layer with a hidden dimension of 1000. Test data are generated by randomly selecting 5×1045\times 10^{4} samples from another sufficiently long trajectory, which are shared by all experiment instances. The training process is executed on one Tesla V100 GPU.

According to our theoretical result (Theorem 9), the convergence rate of this implementation should be approximately of order N1+τ+T1N^{-1}+\tau+T^{-1} up to log\log terms. We thus consider two schemes in our experiment. The first involves a fixed time step τ=103\tau=10^{-3} with an expected rate of τ+N1\tau+N^{-1}, and the other maintains a fixed T=103T=10^{3} with an expected rate of N1+T1N^{-1}+T^{-1}. Each of the aforementioned instances is carried out five times. Figure 1 presents the mean values along with their corresponding confidence intervals from these runs. Additionally, reference lines indicating the expected convergence rate N1N^{-1} are shown in red. Both schemes roughly exhibit the exponential decay phenomenon, aligning with our theoretical expectations. As depicted in Figure 1(a), the decreasing rate of the test error decelerates as NN exceeds a certain threshold. This can be attributed to the fact that when NN and TT are sufficiently large, the bias term τ\tau arising from the discretization becomes the dominant factor in the error.

5 Conclusion

The ubiquity of correlated data in processes modeled with spatially-inhomogeneous diffusions has created substantial barriers to analysis. In this paper, we construct and analyze a neural network-based numerical algorithm for estimating multidimensional spatially-inhomogeneous diffusion processes based on discretely-observed data obtained from a single trajectory. Utilizing β\upbeta-mixing conditions and local Rademacher complexity arguments, we establish the convergence rate for our neural diffusion estimator. Our upper bound has recovered the minimax optimal nonparametric function estimation rate in the common i.i.d. setting, even with correlated data. We expect our proof techniques serve as a model for general exponential ergodic diffusion processes beyond the toroidal setting considered here. Numerical experiments validate our theoretical findings and demonstrate the potential of applying the neural diffusion estimators across various contexts with provable accuracy guarantees. Extending our results to typical biophysical settings, e.g. compact domains with reflective boundaries and motion blur due to measurement error, could help establish more rigorous error estimates for physical inference problems.

References

  • Aeckerle-Willems and Strauch (2018) Aeckerle-Willems, C.; and Strauch, C. 2018. Sup-norm adaptive simultaneous drift estimation for ergodic diffusions. arXiv:1808.10660.
  • Aït-Sahalia (2002) Aït-Sahalia, Y. 2002. Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach. Econometrica, 70(1): 223–262.
  • Bartlett, Bousquet, and Mendelson (2005) Bartlett, P. L.; Bousquet, O.; and Mendelson, S. 2005. Local rademacher complexities. The Annals of Statistics, 33(4): 1497 – 1537.
  • Bibby and Sørensen (1995) Bibby, B. M.; and Sørensen, M. 1995. Martingale estimation functions for discretely observed diffusion processes. Bernoulli, 17–39.
  • Bogachev et al. (2022) Bogachev, V. I.; Krylov, N. V.; Röckner, M.; and Shaposhnikov, S. V. 2022. Fokker–Planck–Kolmogorov Equations, volume 207. American Mathematical Society.
  • Boucheron, Lugosi, and Massart (2013) Boucheron, S.; Lugosi, G.; and Massart, P. 2013. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
  • Bradley (2005) Bradley, R. C. 2005. Basic properties of strong mixing conditions. A survey and some open questions. Probability Surveys, 2(none): 107 – 144.
  • Chen et al. (2022) Chen, M.; Jiang, H.; Liao, W.; and Zhao, T. 2022. Nonparametric regression on low-dimensional manifolds using deep ReLU networks: Function approximation and statistical recovery. Information and Inference: A Journal of the IMA, 11(4): 1203–1253.
  • Chen, Lu, and Lu (2021) Chen, Z.; Lu, J.; and Lu, Y. 2021. On the representation of solutions to elliptic pdes in barron spaces. Advances in neural information processing systems, 34: 6454–6465.
  • Crommelin and Vanden-Eijnden (2011) Crommelin, D.; and Vanden-Eijnden, E. 2011. Diffusion estimation from multiscale data by operator eigenpairs. Multiscale Modeling & Simulation, 9(4): 1588–1623.
  • Dalalyan (2005) Dalalyan, A. 2005. Sharp adaptive estimation of the drift function for ergodic diffusions. The Annals of Statistics, 33(6): 2507 – 2528.
  • Dalalyan and Reiß (2006) Dalalyan, A.; and Reiß, M. 2006. Asymptotic statistical equivalence for scalar ergodic diffusions. Probability theory and related fields, 134: 248–282.
  • Dalalyan and Reiß (2007) Dalalyan, A.; and Reiß, M. 2007. Asymptotic statistical equivalence for ergodic diffusions: the multidimensional case. Probability theory and related fields, 137: 25–47.
  • Davis and Buffett (2022) Davis, W.; and Buffett, B. 2022. Estimation of drift and diffusion functions from unevenly sampled time-series data. Physical Review E, 106(1): 014140.
  • de la Peña, Klass, and Leung Lai (2004) de la Peña, V. H.; Klass, M. J.; and Leung Lai, T. 2004. Self-normalized processes: exponential inequalities, moment bounds and iterated logarithm laws. The Annals of Probability, 32(3): 1902 – 1933.
  • Down, Meyn, and Tweedie (1995) Down, D.; Meyn, S. P.; and Tweedie, R. L. 1995. Exponential and uniform ergodicity of Markov processes. The Annals of Probability, 23(4): 1671–1691.
  • Duan et al. (2021) Duan, C.; Jiao, Y.; Lai, Y.; Lu, X.; and Yang, Z. 2021. Convergence rate analysis for deep ritz method. arXiv:2103.13330.
  • Elerian, Chib, and Shephard (2001) Elerian, O.; Chib, S.; and Shephard, N. 2001. Likelihood inference for discretely observed nonlinear diffusions. Econometrica, 69(4): 959–993.
  • Frishman and Ronceray (2020) Frishman, A.; and Ronceray, P. 2020. Learning force fields from stochastic trajectories. Physical Review X, 10(2): 021009.
  • Gu et al. (2023) Gu, Y.; Harlim, J.; Liang, S.; and Yang, H. 2023. Stationary density estimation of itô diffusions using deep learning. SIAM Journal on Numerical Analysis, 61(1): 45–82.
  • Han, Jentzen, and E (2018) Han, J.; Jentzen, A.; and E, W. 2018. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34): 8505–8510.
  • Hang et al. (2016) Hang, H.; Feng, Y.; Steinwart, I.; and Suykens, J. A. 2016. Learning theory estimates with observations from general stationary stochastic processes. Neural computation, 28(12): 2853–2889.
  • Hang and Steinwart (2017) Hang, H.; and Steinwart, I. 2017. A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning. The Annals of Statistics, 45(2): 708 – 743.
  • Hoffmann (1997) Hoffmann, M. 1997. Minimax estimation of the diffusion coefficient through irregular samplings. Statistics & probability letters, 32(1): 11–24.
  • Hoffmann (1999a) Hoffmann, M. 1999a. Adaptive estimation in diffusion processes. Stochastic processes and their Applications, 79(1): 135–163.
  • Hoffmann (1999b) Hoffmann, M. 1999b. Lp estimation of the diffusion coefficient. Bernoulli, 447–481.
  • Hütter and Rigollet (2021) Hütter, J.-C.; and Rigollet, P. 2021. Minimax estimation of smooth optimal transport maps. The Annals of Statistics, 49(2): 1166 – 1194.
  • Karniadakis et al. (2021) Karniadakis, G. E.; Kevrekidis, I. G.; Lu, L.; Perdikaris, P.; Wang, S.; and Yang, L. 2021. Physics-informed machine learning. Nature Reviews Physics, 3(6): 422–440.
  • Khoo, Lu, and Ying (2021) Khoo, Y.; Lu, J.; and Ying, L. 2021. Solving parametric PDE problems with artificial neural networks. European Journal of Applied Mathematics, 32(3): 421–435.
  • Koltchinskii (2006) Koltchinskii, V. 2006. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6): 2593 – 2656.
  • Kovachki et al. (2021) Kovachki, N.; Li, Z.; Liu, B.; Azizzadenesheli, K.; Bhattacharya, K.; Stuart, A.; and Anandkumar, A. 2021. Neural operator: Learning maps between function spaces. arXiv:2108.08481.
  • Kulik (2017) Kulik, A. 2017. Ergodic behavior of Markov processes. In Ergodic Behavior of Markov Processes. de Gruyter.
  • Kuznetsov and Mohri (2017) Kuznetsov, V.; and Mohri, M. 2017. Generalization bounds for non-stationary mixing processes. Machine Learning, 106(1): 93–117.
  • Lau and Lubensky (2007) Lau, A. W.; and Lubensky, T. C. 2007. State-dependent diffusion: Thermodynamic consistency and its path integral formulation. Physical Review E, 76(1): 011123.
  • Li et al. (2022) Li, H.; Khoo, Y.; Ren, Y.; and Ying, L. 2022. A semigroup method for high dimensional committor functions based on neural network. In Mathematical and Scientific Machine Learning, 598–618. PMLR.
  • Li et al. (2021) Li, Z.; Kovachki, N.; Azizzadenesheli, K.; Liu, B.; Bhattacharya, K.; Stuart, A.; and Anandkumar, A. 2021. Markov neural operators for learning chaotic systems. arXiv:2106.06898.
  • Lin, Li, and Ren (2023) Lin, B.; Li, Q.; and Ren, W. 2023. Computing high-dimensional invariant distributions from noisy data. Journal of Computational Physics, 474: 111783.
  • Long et al. (2018) Long, Z.; Lu, Y.; Ma, X.; and Dong, B. 2018. Pde-net: Learning pdes from data. In International conference on machine learning, 3208–3216. PMLR.
  • López-Pérez, Febrero-Bande, and González-Manteiga (2021) López-Pérez, A.; Febrero-Bande, M.; and González-Manteiga, W. 2021. Parametric estimation of diffusion processes: A review and comparative study. Mathematics, 9(8): 859.
  • Lu, Jin, and Karniadakis (2019) Lu, L.; Jin, P.; and Karniadakis, G. E. 2019. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. arXiv:1910.03193.
  • Lu, Blanchet, and Ying (2022) Lu, Y.; Blanchet, J.; and Ying, L. 2022. Sobolev acceleration and statistical optimality for learning elliptic equations via gradient descent. arXiv:2205.07331.
  • Lu et al. (2021) Lu, Y.; Chen, H.; Lu, J.; Ying, L.; and Blanchet, J. 2021. Machine learning for elliptic pdes: fast rate generalization bound, neural scaling law and minimax optimality. arXiv:2110.06897.
  • Marwah, Lipton, and Risteski (2021) Marwah, T.; Lipton, Z.; and Risteski, A. 2021. Parametric complexity bounds for approximating PDEs with neural networks. Advances in Neural Information Processing Systems, 34: 15044–15055.
  • Marwah et al. (2022) Marwah, T.; Lipton, Z. C.; Lu, J.; and Risteski, A. 2022. Neural Network Approximations of PDEs Beyond Linearity: Representational Perspective. arXiv:2210.12101.
  • Mohri and Rostamizadeh (2008) Mohri, M.; and Rostamizadeh, A. 2008. Rademacher complexity bounds for non-iid processes. Advances in Neural Information Processing Systems, 21.
  • Mohri and Rostamizadeh (2010) Mohri, M.; and Rostamizadeh, A. 2010. Stability Bounds for Stationary φ\varphi-mixing and β\beta-mixing Processes. Journal of Machine Learning Research, 11(2).
  • Nickl (2022) Nickl, R. 2022. Inference for diffusions from low frequency measurements. arXiv:2210.13008.
  • Nickl and Ray (2020) Nickl, R.; and Ray, K. 2020. Nonparametric statistical inference for drift vector fields of multi-dimensional diffusions. The Annals of Statistics, 48(3): 1383 – 1408.
  • Nickl and Söhl (2017) Nickl, R.; and Söhl, J. 2017. Nonparametric Bayesian posterior contraction rates for discretely observed scalar diffusions. The Annals of Statistics, 45(4): 1664 – 1693.
  • Nickl, van de Geer, and Wang (2020) Nickl, R.; van de Geer, S.; and Wang, S. 2020. Convergence rates for penalized least squares estimators in PDE constrained regression problems. SIAM/ASA Journal on Uncertainty Quantification, 8(1): 374–413.
  • Oga and Koike (2021) Oga, A.; and Koike, Y. 2021. Drift estimation for a multi-dimensional diffusion process using deep neural networks. arXiv:2112.13332.
  • Oono and Suzuki (2019) Oono, K.; and Suzuki, T. 2019. Approximation and non-parametric estimation of ResNet-type convolutional neural networks. In International conference on machine learning, 4922–4931. PMLR.
  • Ozaki (1992) Ozaki, T. 1992. A bridge between nonlinear time series models and nonlinear stochastic dynamical systems: a local linearization approach. Statistica Sinica, 113–135.
  • Papanicolau, Bensoussan, and Lions (1978) Papanicolau, G.; Bensoussan, A.; and Lions, J.-L. 1978. Asymptotic analysis for periodic structures. Elsevier.
  • Papaspiliopoulos et al. (2012) Papaspiliopoulos, O.; Pokern, Y.; Roberts, G. O.; and Stuart, A. M. 2012. Nonparametric estimation of diffusions: a differential equations approach. Biometrika, 99(3): 511–531.
  • Pedersen (1995) Pedersen, A. R. 1995. Consistency and asymptotic normality of an approximate maximum likelihood estimator for discretely observed diffusion processes. Bernoulli, 257–279.
  • Pokern, Stuart, and van Zanten (2013) Pokern, Y.; Stuart, A. M.; and van Zanten, J. H. 2013. Posterior consistency via precision operators for Bayesian nonparametric drift estimation in SDEs. Stochastic Processes and their Applications, 123(2): 603–628.
  • Pokern, Stuart, and Vanden-Eijnden (2009) Pokern, Y.; Stuart, A. M.; and Vanden-Eijnden, E. 2009. Remarks on drift estimation for diffusion processes. Multiscale modeling & simulation, 8(1): 69–95.
  • Rotskoff, Mitchell, and Vanden-Eijnden (2022) Rotskoff, G. M.; Mitchell, A. R.; and Vanden-Eijnden, E. 2022. Active importance sampling for variational objectives dominated by rare events: Consequences for optimization and generalization. In Mathematical and Scientific Machine Learning, 757–780. PMLR.
  • Rotskoff and Vanden-Eijnden (2019) Rotskoff, G. M.; and Vanden-Eijnden, E. 2019. Dynamical computation of the density of states and Bayes factors using nonequilibrium importance sampling. Physical review letters, 122(15): 150602.
  • Schmidt-Hieber (2020) Schmidt-Hieber, J. 2020. Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics, 48(4): 1875 – 1897.
  • Shoji and Ozaki (1998) Shoji, I.; and Ozaki, T. 1998. Estimation for nonlinear stochastic differential equations by a local linearization method. Stochastic Analysis and Applications, 16(4): 733–752.
  • Sørensen (2004) Sørensen, H. 2004. Parametric inference for diffusion processes observed at discrete points in time: a survey. International Statistical Review, 72(3): 337–354.
  • Steinwart and Christmann (2009) Steinwart, I.; and Christmann, A. 2009. Fast learning from non-iid observations. Advances in neural information processing systems, 22.
  • Strauch (2015) Strauch, C. 2015. Sharp adaptive drift estimation for ergodic diffusions: the multivariate case. Stochastic Processes and their Applications, 125(7): 2562–2602.
  • Strauch (2016) Strauch, C. 2016. Exact adaptive pointwise drift estimation for multidimensional ergodic diffusions. Probability Theory and Related Fields, 164: 361–400.
  • Stroock and Varadhan (1997) Stroock, D. W.; and Varadhan, S. S. 1997. Multidimensional diffusion processes, volume 233. Springer Science & Business Media.
  • Sura and Barsugli (2002) Sura, P.; and Barsugli, J. 2002. A note on estimating drift and diffusion parameters from timeseries. Physics Letters A, 305(5): 304–311.
  • Suzuki (2018) Suzuki, T. 2018. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. arXiv:1810.08033.
  • Tsybakov and Zaiats (2009) Tsybakov, A. B.; and Zaiats, V. G. 2009. Introduction to nonparametric estimation. Springer-Verlag New York.
  • Van der Meulen, Van Der Vaart, and Van Zanten (2006) Van der Meulen, F.; Van Der Vaart, A. W.; and Van Zanten, J. 2006. Convergence rates of posterior distributions for Brownian semimartingale models. Bernoulli, 12(5): 863–888.
  • van Waaij and van Zanten (2016) van Waaij, J.; and van Zanten, H. 2016. Gaussian process methods for one-dimensional diffusions: Optimal rates and adaptation. Electronic Journal of Statistics, 10(1): 628 – 645.
  • Wang et al. (2022) Wang, X.; Feng, J.; Liu, Q.; Li, Y.; and Xu, Y. 2022. Neural network-based parameter estimation of stochastic differential equations driven by Lévy noise. Physica A: Statistical Mechanics and its Applications, 606: 128146.
  • Weinan and Wojtowytsch (2022) Weinan, E.; and Wojtowytsch, S. 2022. Some observations on high-dimensional partial differential equations with Barron data. In Mathematical and Scientific Machine Learning, 253–269. PMLR.
  • Xie et al. (2007) Xie, Z.; Kulasiri, D.; Samarasinghe, S.; and Rajanayaka, C. 2007. The estimation of parameters for stochastic differential equations using neural networks. Inverse Problems in Science and Engineering, 15(6): 629–641.
  • Yu (1994) Yu, B. 1994. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 94–116.
  • Yu et al. (2018) Yu, B.; et al. 2018. The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1): 1–12.
  • Zang et al. (2020) Zang, Y.; Bao, G.; Ye, X.; and Zhou, H. 2020. Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics, 411: 109409.
  • Zhang et al. (2018) Zhang, L.; Han, J.; Wang, H.; Car, R.; and Weinan, E. 2018. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Physical review letters, 120(14): 143001.
  • Zhang (2004) Zhang, T. 2004. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1): 56–85.
  • Ziemann, Sandberg, and Matni (2022) Ziemann, I. M.; Sandberg, H.; and Matni, N. 2022. Single trajectory nonparametric learning of nonlinear dynamics. In conference on Learning Theory, 3333–3364. PMLR.

Appendix A Detailed Proofs of Section 3.4

Besides ^Ng0(g^)\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g}) defined in the main text, we will use the following notation to denote the empirical loss over a sample set SS another than the sequence SN=(𝒙kτ)k=0NS^{N}=(\boldsymbol{x}_{k\tau})_{k=0}^{N},

^|S|g0(g^;S):=1|S|XkS(g0(X~k)g^(X~k))2,\hat{\mathcal{L}}_{|S|}^{g^{0}}(\hat{g};S):=\dfrac{1}{|S|}\sum_{X_{k}\in S}\left(g^{0}(\widetilde{X}_{k})-\hat{g}(\widetilde{X}_{k})\right)^{2},

where applying ~\widetilde{\cdot} is to make sure each X~kΩ~\widetilde{X}_{k}\in\widetilde{\Omega}. We will denote the squared loss (gg0)2(g-g^{0})^{2} by g\ell\circ g, and adopt the shorthand notation fkτ=f(𝒙~(kτ))f_{k\tau}=f(\widetilde{\boldsymbol{x}}(k\tau)) for any function f:df:\mathbb{R}^{d}\to\mathbb{R} throughout this section for simplicity, when the sequence SN=(𝒙kτ)k=0NS^{N}=(\boldsymbol{x}_{k\tau})_{k=0}^{N} is clear from the context.

A.1 Proof of Theorem 6

In the following, we use S~nΠ~\widetilde{S}^{n}_{\widetilde{\Pi}} to denote a sample set consisting of nn i.i.d. samples drawn from the distribution Π~\widetilde{\Pi}, i.e. S~nΠ~={X~1,X~2,,X~n}Π~n\widetilde{S}^{n}_{\widetilde{\Pi}}=\{\widetilde{X}_{1},\widetilde{X}_{2},\cdots,\widetilde{X}_{n}\}\sim{\widetilde{\Pi}}^{\otimes n}.

To achieve the fast-rate generalization bound, we will make use of the following local Rademacher complexity argument:

Lemma 9 ((Bartlett, Bousquet, and Mendelson 2005, Theorem 3.3)).

Let 𝔉\mathfrak{F} be a function class and for each f𝔉f\in\mathfrak{F}, M1fM2M_{1}\leq f\leq M_{2} and VarΠ~[f]B𝔼Π~[f]\mathrm{Var}_{\widetilde{\Pi}}[f]\leq B\mathbb{E}_{\widetilde{\Pi}}[f] hold. Suppose N({f𝔉|𝔼Π~[f]r})ϕ(r)\mathfrak{R}_{N}(\{f\in\mathfrak{F}|\mathbb{E}_{\widetilde{\Pi}}[f]\leq r\})\leq\phi(r), where ϕ(r)\phi(r) is a sub-root function with a fixed point rr^{*}. Then with probability at least 1δ1-\delta, we have for any f𝔉f\in\mathfrak{F}

𝔼Π~[f]11ϵ1nX~kS~Π~nf(X~k)+704Bϵr+(11(M2M1)ϵ+26B)log1δϵn.\mathbb{E}_{\widetilde{\Pi}}[f]\leq\dfrac{1}{1-\epsilon}\dfrac{1}{n}\sum_{\widetilde{X}_{k}\in\widetilde{S}_{\widetilde{\Pi}}^{n}}f(\widetilde{X}_{k})+\dfrac{704}{B\epsilon}r^{*}+\dfrac{\left(11(M_{2}-M_{1})\epsilon+26B\right)\log\frac{1}{\delta}}{\epsilon n}.
Remark 6.

Compared with Rademacher complexity, the arguments of local Rademacher complexity enables adaptation to the local quadratic geometry of the loss function and recovers the fast rate generalization bound (Bartlett, Bousquet, and Mendelson 2005). One of our contributions is showing how to adapt the local Rademacher arguments to the non-i.i.d data along the diffusion process and providing near-optimal bounds for diffusion estimation.

For completeness, we also provide the proof of Lemma 5:

Proof of Lemma 5.

Notice that each sub-sequence Sn(a)=(𝒙(kl+a))k=0n1S^{n}_{(a)}=(\boldsymbol{x}_{(kl+a)})_{k=0}^{n-1} is a sequence separated with time interval lτl\tau. Then the result follows directly from (Kuznetsov and Mohri 2017, Proposition 2) by setting the distribution Π\Pi and sub-sample 𝒁(j)\boldsymbol{Z}^{(j)} as Π~\widetilde{\Pi} and Sn(a)S^{n}_{(a)}, respectively. ∎

Remark 7.

Lemma 5 serves as a crucial tool for addressing the challenges presented by non-i.i.d. data points. To give an intuitive explanation, consider the way we segment the initial observations, represented by SN=(𝐱kτ)k=0NS^{N}=(\boldsymbol{x}_{k\tau})_{k=0}^{N}. We divide them into ll sub-sequences, denoted as Sn(a)=(𝐱(kl+a)τ)k=0n1S^{n}_{(a)}=(\boldsymbol{x}_{(kl+a)\tau})_{k=0}^{n-1}, for a=0,,l1a=0,\cdots,l-1. In each of these sub-sequences, the observational gaps are lτl\tau, in contrast to the τ\tau gap found in the initial sequence. By employing β\upbeta-mixing and strategically choosing a sufficiently large value for ll, we can effectively treat each sub-sequence as though it follows an i.i.d. distribution. The accuracy of this approximation is then determined by the mixing coefficient, as quantified in Lemma 5. With this setup, we can apply local Rademacher arguments to the approximated i.i.d. sequences, paving the way for fast convergence rates.

With Lemma 5 and Lemma 9, we are ready to present the proof of Theorem 6.

Proof of Theorem 6.

First, it is straightforward to check by Assumption 2b that for any 𝒙~Ω~\widetilde{\boldsymbol{x}}\in\widetilde{\Omega},

|g(𝒙~)𝔼Π~[g(𝒙~)]|4M2,|\ell\circ g^{*}(\widetilde{\boldsymbol{x}})-\mathbb{E}_{\widetilde{\Pi}}[\ell\circ g^{*}(\widetilde{\boldsymbol{x}})]|\leq 4M^{2}, (15)

and

VarΠ~[g]𝔼Π~[(g)2]=𝔼Π~[(gg0)4]4M2𝔼Π~[g],\mathrm{Var}_{\widetilde{\Pi}}\left[\ell\circ g^{*}\right]\leq\mathbb{E}_{\widetilde{\Pi}}\left[\left(\ell\circ g^{*}\right)^{2}\right]=\mathbb{E}_{\widetilde{\Pi}}\left[\left(g^{*}-g^{0}\right)^{4}\right]\leq 4M^{2}\mathbb{E}_{\widetilde{\Pi}}[\ell\circ g^{*}], (16)

Applying Lemma 9 for f=g^f=\ell\circ\hat{g} with B=4M2B=4M^{2}, M1=0M_{1}=0, and M2=4M2M_{2}=4M^{2}, yields that with probability at least 1δ/l1-\delta^{\prime}/l, we have

Π~g0(g^)11ϵ^ng0(g^;S~nΠ~)+176M2ϵr+(44+104ϵ)M2log(lδ)n.\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})\leq\dfrac{1}{1-\epsilon}\hat{\mathcal{L}}_{n}^{g^{0}}(\hat{g};\widetilde{S}^{n}_{\widetilde{\Pi}})+\dfrac{176}{M^{2}\epsilon}r^{*}+\dfrac{\left(44+\frac{104}{\epsilon}\right)M^{2}\log\left(\frac{l}{\delta^{\prime}}\right)}{n}.

Now we split the empirical loss ^Ng0(g^)\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g}) into the mean of ll empirical losses ^Ng0(g^;Sn(a))\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g};S^{n}_{(a)}), where Sn(a)S^{n}_{(a)} are the sub-sequences of SN=(𝒙kτ)k=0NS^{N}=(\boldsymbol{x}_{k\tau})_{k=0}^{N} obtained by sub-sampling:

Π~g0(g^)11ϵ^Ng0(g^)\displaystyle\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})-\dfrac{1}{1-\epsilon}\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g}) =1Nk=0N1(Π~g0(g^)11ϵg^(𝒙~kτ))\displaystyle=\dfrac{1}{N}\sum_{k=0}^{N-1}\left(\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})-\dfrac{1}{1-\epsilon}\ell\circ\hat{g}(\widetilde{\boldsymbol{x}}_{k\tau})\right)
=1la=0l1(Π~g0(g^)11ϵ1nk=0n1g^(𝒙~(kl+a)τ))\displaystyle=\dfrac{1}{l}\sum_{a=0}^{l-1}\left(\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})-\dfrac{1}{1-\epsilon}\dfrac{1}{n}\sum_{k=0}^{n-1}\ell\circ\hat{g}(\widetilde{\boldsymbol{x}}_{(kl+a)\tau})\right)
=1la=0l1(Π~g0(g^)11ϵ^ng0(g^;Sn(a))),\displaystyle=\dfrac{1}{l}\sum_{a=0}^{l-1}\left(\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})-\dfrac{1}{1-\epsilon}\hat{\mathcal{L}}_{n}^{g^{0}}(\hat{g};S^{n}_{(a)})\right),

and for each summand, we apply Lemma 5 to the indicator function of the event {Π~g0(g^)11ϵ^ng0(g^;Sn(a))>α}\left\{\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})-\frac{1}{1-\epsilon}\hat{\mathcal{L}}_{n}^{g^{0}}(\hat{g};S^{n}_{(a)})>\alpha\right\} and obtain

(Π~g0(g^)11ϵ^ng0(g^;Sn(a))>α)(Π~g0(g^)11ϵ^ng0(g^;S~nΠ~)>α)+nβ(lτ).\mathbb{P}\left(\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})-\dfrac{1}{1-\epsilon}\hat{\mathcal{L}}_{n}^{g^{0}}(\hat{g};S^{n}_{(a)})>\alpha\right)\leq\mathbb{P}\left(\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})-\dfrac{1}{1-\epsilon}\hat{\mathcal{L}}_{n}^{g^{0}}(\hat{g};\widetilde{S}^{n}_{\widetilde{\Pi}})>\alpha\right)+n\upbeta(l\tau).

Then let α=176M2ϵr+(44+104ϵ)M2log(lδ)n\alpha=\frac{176}{M^{2}\epsilon}r^{*}+\frac{\left(44+\frac{104}{\epsilon}\right)M^{2}\log\left(\frac{l}{\delta^{\prime}}\right)}{n}, we have by the union bound

(Π~g0(g^)11ϵ^Ng0(g^)>α)\displaystyle\mathbb{P}\left(\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})-\dfrac{1}{1-\epsilon}\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})>\alpha\right) (1la=0l1(Π~g0(g^)11ϵ^ng0(g^;Sn(a)))>α)\displaystyle\leq\mathbb{P}\left(\dfrac{1}{l}\sum_{a=0}^{l-1}\left(\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})-\dfrac{1}{1-\epsilon}\hat{\mathcal{L}}_{n}^{g^{0}}(\hat{g};S^{n}_{(a)})\right)>\alpha\right)
a=0l1((Π~g0(g^)11ϵ^ng0(g^;Sn(a))>α))\displaystyle\leq\sum_{a=0}^{l-1}\left(\mathbb{P}\left(\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})-\dfrac{1}{1-\epsilon}\hat{\mathcal{L}}_{n}^{g^{0}}(\hat{g};S^{n}_{(a)})>\alpha\right)\right)
a=0l1((Π~g0(g^)11ϵ^ng0(g^;S~nΠ~)>α)+nβ(lτ))\displaystyle\leq\sum_{a=0}^{l-1}\left(\mathbb{P}\left(\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})-\dfrac{1}{1-\epsilon}\hat{\mathcal{L}}_{n}^{g^{0}}(\hat{g};\widetilde{S}^{n}_{\widetilde{\Pi}})>\alpha\right)+n\upbeta(l\tau)\right)
l(Π~g0(g^)11ϵ^Ng0(g^;S~nΠ~)>α)+Nβ(lτ)\displaystyle\leq l\mathbb{P}\left(\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})-\dfrac{1}{1-\epsilon}\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g};\widetilde{S}^{n}_{\widetilde{\Pi}})>\alpha\right)+N\upbeta(l\tau)
δ+Nβ(lτ)=δ,\displaystyle\leq\delta^{\prime}+N\upbeta(l\tau)=\delta,

and the result follows. ∎

A.2 Proof of Theorem 7

In this section, the expectiation 𝔼\mathbb{E} should be referred to as being taken over the sub-σ\sigma-algebra of 0T\mathcal{F}_{0}^{T} generated by the trajectory (𝒙kτ)k=0N(\boldsymbol{x}_{k\tau})_{k=0}^{N} from which g^\hat{g} is constructed by minimizing over the estimate empirical loss ~Ng0(g^;(𝒙kτ)k=0N)\tilde{\mathcal{L}}_{N}^{g^{0}}(\hat{g};(\boldsymbol{x}_{k\tau})_{k=0}^{N}).

We first consider the following decomposition of the expectation of the empirical loss ^Ng0(g^)\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g}):

Lemma 10.

For an arbitrary g¯𝔊\bar{g}\in\mathfrak{G}, we have

𝔼[^Ng0(g^)]𝔼[^Ng0(g¯)]+2N𝔼[k=0N1ΔAkτ(g^kτg¯kτ)]+2N𝔼[k=0N1ΔMkτ(g^kτg0kτ)].\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})\right]\leq\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\bar{g})\right]+\dfrac{2}{N}\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta A_{k\tau}\left(\hat{g}_{k\tau}-\bar{g}_{k\tau}\right)\right]+\dfrac{2}{N}\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta M_{k\tau}\left(\hat{g}_{k\tau}-g^{0}_{k\tau}\right)\right]. (17)
Proof.

By the definition of the estimator g^\hat{g}, we have ~Ng0(g^;(𝒙kτ)k=0N)~Ng0(g¯;(𝒙kτ)k=0N)\tilde{\mathcal{L}}_{N}^{g^{0}}(\hat{g};(\boldsymbol{x}_{k\tau})_{k=0}^{N})\leq\tilde{\mathcal{L}}_{N}^{g^{0}}(\bar{g};(\boldsymbol{x}_{k\tau})_{k=0}^{N}) for every possible sequence (𝒙kτ)k=0N(\boldsymbol{x}_{k\tau})_{k=0}^{N}. Recall that

~Ng0(g;(𝒙kτ)k=0N)=1Nk=0N1(g0kτ+ΔZkτgkτ)2=^Ng0(g)+2Nk=0N1ΔZkτ(g0kτgkτ)+1Nk=0N1(ΔZkτ)2,\tilde{\mathcal{L}}_{N}^{g^{0}}(g;(\boldsymbol{x}_{k\tau})_{k=0}^{N})=\dfrac{1}{N}\sum_{k=0}^{N-1}\left(g^{0}_{k\tau}+\Delta Z_{k\tau}-g_{k\tau}\right)^{2}=\hat{\mathcal{L}}_{N}^{g^{0}}(g)+\dfrac{2}{N}\sum_{k=0}^{N-1}\Delta Z_{k\tau}(g^{0}_{k\tau}-g_{k\tau})+\dfrac{1}{N}\sum_{k=0}^{N-1}(\Delta Z_{k\tau})^{2},

we have by taking expectation to both sides,

𝔼[^Ng0(g^)]𝔼[^Ng0(g¯)]\displaystyle\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})\right]-\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\bar{g})\right]
=\displaystyle= 𝔼[~Ng0(g^;(𝒙kτ)k=0N)]+𝔼[2Nk=0N1ΔZkτ(g^kτg0kτ)]𝔼[1Nk=0N1(ΔZkτ)2]\displaystyle\mathbb{E}\left[\tilde{\mathcal{L}}_{N}^{g^{0}}(\hat{g};(\boldsymbol{x}_{k\tau})_{k=0}^{N})\right]+\mathbb{E}\left[\dfrac{2}{N}\sum_{k=0}^{N-1}\Delta Z_{k\tau}\left(\hat{g}_{k\tau}-g^{0}_{k\tau}\right)\right]-\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}(\Delta Z_{k\tau})^{2}\right]
\displaystyle- 𝔼[~Ng0(g¯;(𝒙kτ)k=0N)]𝔼[2Nk=0N1ΔZkτ(g¯kτg0kτ)]+𝔼[1Nk=0N1(ΔZkτ)2]\displaystyle\mathbb{E}\left[\tilde{\mathcal{L}}_{N}^{g^{0}}(\bar{g};(\boldsymbol{x}_{k\tau})_{k=0}^{N})\right]-\mathbb{E}\left[\dfrac{2}{N}\sum_{k=0}^{N-1}\Delta Z_{k\tau}\left(\bar{g}_{k\tau}-g^{0}_{k\tau}\right)\right]+\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}(\Delta Z_{k\tau})^{2}\right]
\displaystyle\leq 2N𝔼[k=0N1ΔZkτ(g^kτg¯kτ)],\displaystyle\dfrac{2}{N}\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta Z_{k\tau}\left(\hat{g}_{k\tau}-\bar{g}_{k\tau}\right)\right],

where we used the fact that the noise (ΔZkτ)k=0N1(\Delta Z_{k\tau})_{k=0}^{N-1} only depends on the ground truth g0g^{0} and the sequence (𝒙kτ)k=0N(\boldsymbol{x}_{k\tau})_{k=0}^{N}.

Since (Mkτ)k=0N(M_{k\tau})_{k=0}^{N} forms a martingale, we have

𝔼[k=0N1ΔMkτ(g¯kτg0kτ)]=𝔼[k=0N1𝔼[ΔMkτ(g¯kτg0kτ)|0kτ]]=0,\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta M_{k\tau}\left(\bar{g}_{k\tau}-g^{0}_{k\tau}\right)\right]=\mathbb{E}\left[\sum_{k=0}^{N-1}\mathbb{E}\left[\Delta M_{k\tau}\left(\bar{g}_{k\tau}-g^{0}_{k\tau}\right)|\mathcal{F}_{0}^{k\tau}\right]\right]=0,

and thus

𝔼[k=0N1ΔZkτ(g^kτg¯kτ)]\displaystyle\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta Z_{k\tau}\left(\hat{g}_{k\tau}-\bar{g}_{k\tau}\right)\right]
=\displaystyle= 𝔼[k=0N1ΔAkτ(g^kτg¯kτ)]+𝔼[k=0N1ΔMkτ(g^kτg¯kτ)]+𝔼[k=0N1ΔMkτ(g¯kτg0kτ)]\displaystyle\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta A_{k\tau}\left(\hat{g}_{k\tau}-\bar{g}_{k\tau}\right)\right]+\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta M_{k\tau}\left(\hat{g}_{k\tau}-\bar{g}_{k\tau}\right)\right]+\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta M_{k\tau}\left(\bar{g}_{k\tau}-g^{0}_{k\tau}\right)\right]
=\displaystyle= 𝔼[k=0N1ΔAkτ(g^kτg¯kτ)]+𝔼[k=0N1ΔMkτ(g^kτg0kτ)],\displaystyle\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta A_{k\tau}\left(\hat{g}_{k\tau}-\bar{g}_{k\tau}\right)\right]+\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta M_{k\tau}\left(\hat{g}_{k\tau}-g^{0}_{k\tau}\right)\right],

and the result follows. ∎

Following Remark 5 in the main text, we will refer to the second term in the RHS of (17) as the bias term and the last as the martingale noise term.

For the bias term, we have the following simple bound:

Lemma 11.

For any ϵ>0\epsilon>0, we have

𝔼[1Nk=0N1ΔAkτ(g^kτg¯kτ)]ϵ4𝔼[^Ng0(g^)]+ϵ2𝔼[^Ng0(g¯)]+32ϵ𝔼[1Nk=0N1(ΔAkτ)2].\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}\Delta A_{k\tau}\left(\hat{g}_{k\tau}-\bar{g}_{k\tau}\right)\right]\leq\dfrac{\epsilon}{4}\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})\right]+\dfrac{\epsilon}{2}\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\bar{g})\right]+\dfrac{3}{2\epsilon}\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}(\Delta A_{k\tau})^{2}\right].
Proof.
𝔼[1Nk=0N1ΔAkτ(g^kτg¯kτ)]\displaystyle\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}\Delta A_{k\tau}\left(\hat{g}_{k\tau}-\bar{g}_{k\tau}\right)\right]
=\displaystyle= 𝔼[1Nk=0N1(ΔAkτ(g^kτg0kτ)+ΔAkτ(g0kτg¯kτ))]\displaystyle\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}\left(\Delta A_{k\tau}\left(\hat{g}_{k\tau}-g^{0}_{k\tau}\right)+\Delta A_{k\tau}\left(g^{0}_{k\tau}-\bar{g}_{k\tau}\right)\right)\right]
\displaystyle\leq 𝔼[1Nk=0N1(1ϵ(ΔAkτ)2+ϵ4(g^kτg0kτ)2+12ϵ(ΔAkτ)2+ϵ2(g0kτg¯kτ)2)]\displaystyle\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}\left(\dfrac{1}{\epsilon}(\Delta A_{k\tau})^{2}+\dfrac{\epsilon}{4}\left(\hat{g}_{k\tau}-g^{0}_{k\tau}\right)^{2}+\dfrac{1}{2\epsilon}(\Delta A_{k\tau})^{2}+\dfrac{\epsilon}{2}\left(g^{0}_{k\tau}-\bar{g}_{k\tau}\right)^{2}\right)\right]
=\displaystyle= ϵ4𝔼[^Ng0(g^)]+ϵ2𝔼[^Ng0(g¯)]+32ϵ𝔼[1Nk=0N1(ΔAkτ)2],\displaystyle\dfrac{\epsilon}{4}\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})\right]+\dfrac{\epsilon}{2}\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\bar{g})\right]+\dfrac{3}{2\epsilon}\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}(\Delta A_{k\tau})^{2}\right],

where the inequality is by AM-GM and the last equality is due to Assumption 6. ∎

The following proof of the martingale noise term bound is inspired by the proof of (Schmidt-Hieber 2020, Lemma 4), where i.i.d. Gaussian noise is considered for nonparametric regression.

Definition 8 (ϵ\epsilon-covering).

We define the ϵ\epsilon-covering 𝔉ϵ\mathfrak{F}_{\epsilon} of a function class 𝔉\mathfrak{F} with respect to the norm \|\cdot\| as the subset of 𝔉\mathfrak{F} with the smallest possible cardinality such that for any f𝔉f\in\mathfrak{F}, there exists an f𝔉ϵf^{\prime}\in\mathfrak{F}_{\epsilon} satisfying ffϵ\|f^{\prime}-f\|\leq\epsilon. The cardinality of 𝔉ϵ\mathfrak{F}_{\epsilon} is denoted by 𝒩(ϵ,𝔉,)\mathcal{N}(\epsilon,\mathfrak{F},\|\cdot\|).

For a fixed ρ>0\rho>0, let 𝔊ρ\mathfrak{G}_{\rho} be the ρ\rho-net of the function class 𝔊\mathfrak{G} with respect to the norm \|\cdot\|_{\infty}, where ρ>0\rho>0. Without loss of generality, we assume that 𝒩(ρ,𝔊,)2\mathcal{N}(\rho,\mathfrak{G},\|\cdot\|_{\infty})\geq 2.

We now introduce the following exponential inequality for local martingales.

Lemma 12.

Let LtL_{t} be a continuous local martingale and Lt\left\langle L\right\rangle_{t} be the corresponding quadratic variation of LtL_{t}. If 𝔼[Lt]>0\mathbb{E}\left[\sqrt{\langle L\rangle_{t}}\right]>0, then for any t,y>0t,y>0,

𝔼[yLt+y2exp(Lt22(Lt+y2))]1.\mathbb{E}\left[\dfrac{y}{\sqrt{\langle L\rangle_{t}+y^{2}}}\exp\left(\frac{L_{t}^{2}}{2\left(\langle L\rangle_{t}+y^{2}\right)}\right)\right]\leq 1.

Moreover, then

𝔼[exp(Lt24(Lt+𝔼[Lt]))]2.\mathbb{E}\left[\exp\left(\dfrac{L_{t}^{2}}{4\left(\langle L\rangle_{t}+\mathbb{E}\left[\langle L\rangle_{t}\right]\right)}\right)\right]\leq\sqrt{2}.
Proof.

The result follows directly from (de la Peña, Klass, and Leung Lai 2004, Theorem 2.1) by setting A=LtA=L_{t} and B=LtB=\sqrt{\left\langle L\right\rangle_{t}}, and noticing that

𝔼[exp(λLtλ22Lt)]𝔼[exp(λL0λ22L0)]=1\mathbb{E}\left[\exp\left(\lambda L_{t}-\dfrac{\lambda^{2}}{2}\left\langle L\right\rangle_{t}\right)\right]\leq\mathbb{E}\left[\exp\left(\lambda L_{0}-\dfrac{\lambda^{2}}{2}\left\langle L\right\rangle_{0}\right)\right]=1

holds for any t0t\geq 0 and λ\lambda\in\mathbb{R} by (de la Peña, Klass, and Leung Lai 2004, Lemma 1.2). Especially, the second inequality follows from

𝔼[exp(Lt24(Lt+𝔼[Lt]))]𝔼[exp(Lt24(Lt+(𝔼[Lt])2))]2.\mathbb{E}\left[\exp\left(\dfrac{L_{t}^{2}}{4\left(\langle L\rangle_{t}+\mathbb{E}\left[\langle L\rangle_{t}\right]\right)}\right)\right]\leq\mathbb{E}\left[\exp\left(\dfrac{L_{t}^{2}}{4\left(\langle L\rangle_{t}+\left(\mathbb{E}\left[\sqrt{\left\langle L\right\rangle_{t}}\right]\right)^{2}\right)}\right)\right]\leq\sqrt{2}.

Lemma 12 enables us to give the following concentration inequality for the martingale noise term.

Lemma 13.

Let g𝔊ρg^{\prime}\in\mathfrak{G}_{\rho} such that g^gρ\|\hat{g}-g^{\prime}\|_{\infty}\leq\rho, then for any ϵ>0\epsilon>0 and g𝔊ρg\in\mathfrak{G}_{\rho}, we have

𝔼[1Nk=0N1ΔMkτ(gkτg0kτ)]6maxk|𝔼[ΔMkτ|0kτ]|log𝒩(ρ,𝔊,)ϵN+ϵ4𝔼[L^Ng0(g)].\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}\Delta M_{k\tau}\left(g^{\prime}_{k\tau}-g^{0}_{k\tau}\right)\right]\leq\dfrac{6\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\log\mathcal{N}(\rho,\mathfrak{G},\|\cdot\|_{\infty})}{\epsilon N}+\dfrac{\epsilon}{4}\mathbb{E}\left[\hat{L}_{N}^{g^{0}}(g^{\prime})\right]. (18)
Proof.

First, for any g𝔊ρg\in\mathfrak{G}_{\rho}, we define the following local martingale

L[gg0]t=k=0N1(gkτg0kτ)(Mt(k+1)τMtkτ),L[g-g^{0}]_{t}=\sum_{k=0}^{N-1}(g_{k\tau}-g^{0}_{k\tau})\left(M_{t\wedge(k+1)\tau}-M_{t\wedge k\tau}\right),

for any t0t\geq 0 so that the LHS of (18) is 1N𝔼[L[gg0]T]\frac{1}{N}\mathbb{E}\left[L[g-g^{0}]_{T}\right].

By martingale representation theorem, (Mt)t0(M_{t})_{t\geq 0} as a continuous local martingale is an Itô integral and thus the corresponding quadratic variation (Mt)t0(\left\langle M\right\rangle_{t})_{t\geq 0} exists. As a result, the quadratic variation of L[gg0]L[g-g^{0}] is

L[gg0]t=k=0N1(gkτg0kτ)2(Mt(k+1)τMtkτ).\langle L[g-g^{0}]\rangle_{t}=\sum_{k=0}^{N-1}(g_{k\tau}-g^{0}_{k\tau})^{2}\left(\left\langle M\right\rangle_{t\wedge(k+1)\tau}-\left\langle M\right\rangle_{t\wedge k\tau}\right).

Take y=𝔼[L[gg0]T]+ηy=\sqrt{\mathbb{E}\left[\left\langle L[g-g^{0}]\right\rangle_{T}\right]+\eta}, where η>0\eta>0, we have by Lemma 12,

𝔼[yL[gg0]T+y2exp(L[gg0]T22(L[gg0]T+y2))]1.\mathbb{E}\left[\dfrac{y}{\sqrt{\left\langle L[g-g^{0}]\right\rangle_{T}+y^{2}}}\exp\left(\dfrac{L[g-g^{0}]_{T}^{2}}{2\left(\left\langle L[g-g^{0}]\right\rangle_{T}+y^{2}\right)}\right)\right]\leq 1.

Therefore, by Jensen’s inequality and Cauchy-Schwarz, we have

exp𝔼[L[gg0]T24(L[gg0]T+y2)]𝔼[exp(L[gg0]T24(L[gg0]T+y2))]\displaystyle\exp\mathbb{E}\left[\dfrac{L[g^{\prime}-g^{0}]_{T}^{2}}{4\left(\left\langle L[g^{\prime}-g^{0}]\right\rangle_{T}+y^{2}\right)}\right]\leq\mathbb{E}\left[\exp\left(\dfrac{L[g^{\prime}-g^{0}]_{T}^{2}}{4\left(\left\langle L[g^{\prime}-g^{0}]\right\rangle_{T}+y^{2}\right)}\right)\right]
\displaystyle\leq 𝔼[yL[gg0]T+y2exp(L[gg0]T22(L[gg0]T+y2))]𝔼[L[gg0]T+y2y]\displaystyle\sqrt{\mathbb{E}\left[\dfrac{y}{\sqrt{\left\langle L[g^{\prime}-g^{0}]\right\rangle_{T}+y^{2}}}\exp\left(\dfrac{L[g^{\prime}-g^{0}]_{T}^{2}}{2\left(\left\langle L[g^{\prime}-g^{0}]\right\rangle_{T}+y^{2}\right)}\right)\right]\mathbb{E}\left[\dfrac{\sqrt{\left\langle L[g^{\prime}-g^{0}]\right\rangle_{T}+y^{2}}}{y}\right]}
\displaystyle\leq g𝔊ρ𝔼[yL[gg0]T+y2exp(L[gg0]T22(L[gg0]T+y2))]𝔼[2y2ηy]\displaystyle\sqrt{\sum_{g\in\mathfrak{G}_{\rho}}\mathbb{E}\left[\dfrac{y}{\sqrt{\left\langle L[g-g^{0}]\right\rangle_{T}+y^{2}}}\exp\left(\dfrac{L[g-g^{0}]_{T}^{2}}{2\left(\left\langle L[g-g^{0}]\right\rangle_{T}+y^{2}\right)}\right)\right]\mathbb{E}\left[\dfrac{\sqrt{2y^{2}-\eta}}{y}\right]}
\displaystyle\leq 𝒩(ρ,𝔊,)2,\displaystyle\sqrt{\mathcal{N}(\rho,\mathfrak{G},\|\cdot\|_{\infty})\sqrt{2}},

where the second to last inequality is due to the fact g𝔊ρg^{\prime}\in\mathfrak{G}_{\rho} and thus the first expectation must be bounded by the summation over all possible g𝔊ρg\in\mathfrak{G}_{\rho}.

Again by Cauchy-Schwarz,

1N𝔼[L[gg0]T]\displaystyle\dfrac{1}{N}\mathbb{E}\left[L[g^{\prime}-g^{0}]_{T}\right] =1N𝔼[L[gg0]T2L[gg0]T+y22L[gg0]T+y2]\displaystyle=\dfrac{1}{N}\mathbb{E}\left[\dfrac{L[g^{\prime}-g^{0}]_{T}}{2\sqrt{\left\langle L[g^{\prime}-g^{0}]\right\rangle_{T}+y^{2}}}2\sqrt{\left\langle L[g^{\prime}-g^{0}]\right\rangle_{T}+y^{2}}\right]
4N2𝔼[L[gg0]T24(L[gg0]T+y2)]𝔼[L[gg0]T+y2]\displaystyle\leq\sqrt{\dfrac{4}{N^{2}}\mathbb{E}\left[\dfrac{L[g^{\prime}-g^{0}]_{T}^{2}}{4\left(\left\langle L[g^{\prime}-g^{0}]\right\rangle_{T}+y^{2}\right)}\right]\mathbb{E}\left[\left\langle L[g^{\prime}-g^{0}]\right\rangle_{T}+y^{2}\right]}
2log𝒩(ρ,𝔊,)+log2N2𝔼[2k=0N1(gkτg0kτ)2ΔMkτ+η].\displaystyle\leq\sqrt{\dfrac{2\log\mathcal{N}(\rho,\mathfrak{G},\|\cdot\|_{\infty})+\log 2}{N^{2}}\mathbb{E}\left[2\sum_{k=0}^{N-1}\left(g^{\prime}_{k\tau}-g^{0}_{k\tau}\right)^{2}\Delta\left\langle M\right\rangle_{k\tau}+\eta\right]}.

Finally, let η0\eta\to 0,

1N𝔼[L[gg0]T]\displaystyle\dfrac{1}{N}\mathbb{E}\left[L[g^{\prime}-g^{0}]_{T}\right] 3log𝒩(ρ,𝔊,)N22𝔼[k=0N1(gkτg0kτ)2ΔMkτ]\displaystyle\leq\sqrt{\dfrac{3\log\mathcal{N}(\rho,\mathfrak{G},\|\cdot\|_{\infty})}{N^{2}}2\mathbb{E}\left[\sum_{k=0}^{N-1}\left(g^{\prime}_{k\tau}-g^{0}_{k\tau}\right)^{2}\Delta\left\langle M\right\rangle_{k\tau}\right]}
6log𝒩(ρ,𝔊,)Nmaxk|𝔼[ΔMkτ|0kτ]|𝔼[1Nk=0N1(gkτg0kτ)2]\displaystyle\leq\sqrt{\dfrac{6\log\mathcal{N}(\rho,\mathfrak{G},\|\cdot\|_{\infty})}{N}\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}\left(g^{\prime}_{k\tau}-g^{0}_{k\tau}\right)^{2}\right]}
6maxk|𝔼[ΔMkτ|0kτ]|log𝒩(ρ,𝔊,)ϵN+ϵ4𝔼[L^Ng0(g)],\displaystyle\leq\dfrac{6\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\log\mathcal{N}(\rho,\mathfrak{G},\|\cdot\|_{\infty})}{\epsilon N}+\dfrac{\epsilon}{4}\mathbb{E}\left[\hat{L}_{N}^{g^{0}}(g^{\prime})\right],

where the second inequality is due to the tower property

𝔼[k=0N1(gkτg0kτ)2ΔMkτ]\displaystyle\mathbb{E}\left[\sum_{k=0}^{N-1}\left(g^{\prime}_{k\tau}-g^{0}_{k\tau}\right)^{2}\Delta\left\langle M\right\rangle_{k\tau}\right] =k=0N1𝔼[(gkτg0kτ)2ΔMkτ]\displaystyle=\sum_{k=0}^{N-1}\mathbb{E}\left[\left(g^{\prime}_{k\tau}-g^{0}_{k\tau}\right)^{2}\Delta\left\langle M\right\rangle_{k\tau}\right] (19)
=𝔼[k=0N1𝔼[(gkτg0kτ)2ΔMkτ|0kτ]]\displaystyle=\mathbb{E}\left[\sum_{k=0}^{N-1}\mathbb{E}\left[\left(g^{\prime}_{k\tau}-g^{0}_{k\tau}\right)^{2}\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right]
=𝔼[k=0N1(gkτg0kτ)2𝔼[ΔMkτ|0kτ]]\displaystyle=\mathbb{E}\left[\sum_{k=0}^{N-1}\left(g^{\prime}_{k\tau}-g^{0}_{k\tau}\right)^{2}\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right]
maxk|𝔼[ΔMkτ|0kτ]|𝔼[k=0N1(gkτg0kτ)2]\displaystyle\leq\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\mathbb{E}\left[\sum_{k=0}^{N-1}\left(g^{\prime}_{k\tau}-g^{0}_{k\tau}\right)^{2}\right]

and the last inequality is by AM-GM. ∎

It remains to bound the error by projecting the estimator g^\hat{g} to the ρ\rho-covering 𝔊ρ\mathfrak{G}_{\rho}, which is given by the following lemma.

Lemma 14.

Let g𝔊ρg^{\prime}\in\mathfrak{G}_{\rho} such that g^gρ\|\hat{g}-g^{\prime}\|_{\infty}\leq\rho, then

𝔼[1Nk=0N1ΔMkτ(g^kτgkτ)]4maxk|𝔼[ΔMkτ|0kτ]|ρ2log2N.\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}\Delta M_{k\tau}\left(\hat{g}_{k\tau}-g^{\prime}_{k\tau}\right)\right]\leq\sqrt{\dfrac{4\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\rho^{2}\log 2}{N}}.
Proof.

We will follow the notations in the proof of Lemma 18. If L[g^g]T=0\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}=0, L[g^g]t=0L[\hat{g}-g^{\prime}]_{t}=0 almost everywhere and the result holds trivially.

Now suppose L[g^g]T>0\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}>0, applying Lemma 12 and Jensen’s inequality yields

𝔼[(L[g^g]T24(L[g^g]T+𝔼[L[g^g]T]))]\displaystyle\mathbb{E}\left[\left(\dfrac{L[\hat{g}-g^{\prime}]_{T}^{2}}{4\left(\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}+\mathbb{E}\left[\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}\right]\right)}\right)\right]
\displaystyle\leq log𝔼[exp(L[g^g]T24(L[g^g]T+𝔼[L[g^g]T]))]log22.\displaystyle\log\mathbb{E}\left[\exp\left(\dfrac{L[\hat{g}-g^{\prime}]_{T}^{2}}{4\left(\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}+\mathbb{E}\left[\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}\right]\right)}\right)\right]\leq\dfrac{\log 2}{2}.

With a similar argument as in (19), we have

𝔼[L[g^g]T]=𝔼[k=0N1(g^kτgkτ)2ΔMkτ]\displaystyle\mathbb{E}\left[\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}\right]=\mathbb{E}\left[\sum_{k=0}^{N-1}\left(\hat{g}_{k\tau}-g^{\prime}_{k\tau}\right)^{2}\Delta\left\langle M\right\rangle_{k\tau}\right]
\displaystyle\leq maxk|𝔼[ΔMkτ|0kτ]|𝔼[k=0N1(g^kτgkτ)2]Nmaxk|𝔼[ΔMkτ|0kτ]|ρ2.\displaystyle\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\mathbb{E}\left[\sum_{k=0}^{N-1}\left(\hat{g}_{k\tau}-g^{\prime}_{k\tau}\right)^{2}\right]\leq N\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\rho^{2}.

Therefore, we have

1N𝔼[L[g^g]T]\displaystyle\dfrac{1}{N}\mathbb{E}\left[L[\hat{g}-g^{\prime}]_{T}\right] =1N𝔼[L[g^g]T2L[g^g]T+𝔼[L[g^g]T]2L[g^g]T+𝔼[L[g^g]T]]\displaystyle=\dfrac{1}{N}\mathbb{E}\left[\dfrac{L[\hat{g}-g^{\prime}]_{T}}{2\sqrt{\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}+\mathbb{E}\left[\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}\right]}}2\sqrt{\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}+\mathbb{E}\left[\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}\right]}\right]
2N𝔼[L[g^g]T24(L[g^g]T+𝔼[L[g^g]T])]2𝔼[L[g^g]T]\displaystyle\leq\dfrac{2}{N}\sqrt{\mathbb{E}\left[\dfrac{L[\hat{g}-g^{\prime}]_{T}^{2}}{4\left(\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}+\mathbb{E}\left[\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}\right]\right)}\right]2\mathbb{E}\left[\left\langle L[\hat{g}-g^{\prime}]\right\rangle_{T}\right]}
4maxk|𝔼[ΔMkτ|0kτ]|ρ2log2N.\displaystyle\leq\sqrt{\dfrac{4\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\rho^{2}\log 2}{N}}.

Now we are ready to present the proof of Theorem 7.

Proof of Theorem 7.

Plug Lemma 1118, and 14 into Lemma 17, we have

𝔼[^Ng0(g^)]\displaystyle\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})\right]\leq 𝔼[^Ng0(g¯)]+2N𝔼[k=0N1ΔAkτ(g^kτg¯kτ)]+2N𝔼[k=0N1ΔMkτ(g^kτg0kτ)]\displaystyle\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\bar{g})\right]+\dfrac{2}{N}\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta A_{k\tau}\left(\hat{g}_{k\tau}-\bar{g}_{k\tau}\right)\right]+\dfrac{2}{N}\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta M_{k\tau}\left(\hat{g}_{k\tau}-g^{0}_{k\tau}\right)\right]
\displaystyle\leq 𝔼[^Ng0(g¯)]+2N𝔼[k=0N1ΔAkτ(g^kτg¯kτ)]\displaystyle\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\bar{g})\right]+\dfrac{2}{N}\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta A_{k\tau}\left(\hat{g}_{k\tau}-\bar{g}_{k\tau}\right)\right]
+2N𝔼[k=0N1ΔMkτ(g^kτgkτ)]+2N𝔼[k=0N1ΔMkτ(gkτg0kτ)]\displaystyle+\dfrac{2}{N}\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta M_{k\tau}\left(\hat{g}_{k\tau}-g^{\prime}_{k\tau}\right)\right]+\dfrac{2}{N}\mathbb{E}\left[\sum_{k=0}^{N-1}\Delta M_{k\tau}\left(g^{\prime}_{k\tau}-g^{0}_{k\tau}\right)\right]
\displaystyle\leq 𝔼[^Ng0(g¯)]+ϵ2𝔼[^Ng0(g^)]+ϵ𝔼[^Ng0(g¯)]+3ϵ𝔼[1Nk=0N1(ΔAkτ)2]\displaystyle\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\bar{g})\right]+\dfrac{\epsilon}{2}\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})\right]+\epsilon\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\bar{g})\right]+\dfrac{3}{\epsilon}\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}(\Delta A_{k\tau})^{2}\right]
+12maxk|𝔼[ΔMkτ|0kτ]|log𝒩(ρ,𝔊,)ϵN+ϵ2L^N(g)\displaystyle+\dfrac{12\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\log\mathcal{N}(\rho,\mathfrak{G},\|\cdot\|_{\infty})}{\epsilon N}+\dfrac{\epsilon}{2}\hat{L}^{N}(g)
+24maxk|𝔼[ΔMkτ|0kτ]|ρ2log2N.\displaystyle+2\sqrt{\dfrac{4\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\rho^{2}\log 2}{N}}.

Then we rearrange the terms and take infimum over all g¯,𝔊\bar{g},\in\mathfrak{G},

𝔼[^Ng0(g^)]1+ϵ1ϵinfg¯𝔊𝔼[^Ng0(g¯)]+3ϵ𝔼[1Nk=0N1(ΔAkτ)2]\displaystyle\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})\right]\leq\dfrac{1+\epsilon}{1-\epsilon}\inf_{\bar{g}\in\mathfrak{G}}\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\bar{g})\right]+\dfrac{3}{\epsilon}\mathbb{E}\left[\dfrac{1}{N}\sum_{k=0}^{N-1}(\Delta A_{k\tau})^{2}\right]
+12maxk|𝔼[ΔMkτ|0kτ]|log𝒩(ρ,𝔊,)ϵN+24maxk|𝔼[ΔMkτ|0kτ]|ρ2log2N\displaystyle+\dfrac{12\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\log\mathcal{N}(\rho,\mathfrak{G},\|\cdot\|_{\infty})}{\epsilon N}+2\sqrt{\dfrac{4\max_{k}\left|\mathbb{E}\left[\Delta\left\langle M\right\rangle_{k\tau}\big{|}\mathcal{F}_{0}^{k\tau}\right]\right|\rho^{2}\log 2}{N}}
1+ϵ1ϵinfg¯𝔊𝔼[^Ng0(g¯)]+3CAϵτ+12CMlog𝒩(ρ,𝔊,)ϵNτγ+24CMρ2log2Nτγ,\displaystyle\leq\dfrac{1+\epsilon}{1-\epsilon}\inf_{\bar{g}\in\mathfrak{G}}\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\bar{g})\right]+\dfrac{3C_{A}}{\epsilon}\tau+\dfrac{12C_{M}\log\mathcal{N}(\rho,\mathfrak{G},\|\cdot\|_{\infty})}{\epsilon N\tau^{\gamma}}+2\sqrt{\dfrac{4C_{M}\rho^{2}\log 2}{N\tau^{\gamma}}},

where the last inequality is by Assumption 6 and 7.

A.3 Proof of Theorem 8

Before we present the proof, we introduce the following lemma concerning the complexity of the sparse neural network function class, which will serve as the fundamental building block of our proof via local Rademacher arguments.

Lemma 15 (log\log-covering number of 𝔑(L,𝒑,S,M)\mathfrak{N}(L,\boldsymbol{p},S,M) (Schmidt-Hieber 2020, Lemma 5)).
log𝒩(ρ,𝔑(L,𝒑,S,),)(S+1)log(22L+5ρ1(L+1)d2(S+1)2L).\log\mathcal{N}(\rho,\mathfrak{N}(L,\boldsymbol{p},S,\infty),\|\cdot\|_{\infty})\leq(S+1)\log\left(2^{2L+5}\rho^{-1}(L+1)d^{2}(S+1)^{2L}\right).

In fact, the Rademacher complexity N(𝔉)\mathfrak{R}_{N}(\mathfrak{F}) can be bounded by log\log-covering number via the following lemma.

Lemma 16 (Localized Dudley’s theorem).

For any function class 𝔉\mathfrak{F},

N(𝔉)𝔼μN[infρ>0(4ρ+12ρlog𝒩(u,𝔉,2,S)Ndu)],\mathfrak{R}_{N}(\mathfrak{F})\leq\mathbb{E}_{\mu^{\otimes N}}\left[\inf_{\rho>0}\left(4\rho+12\int_{\rho}^{\infty}\sqrt{\dfrac{\log\mathcal{N}(u,\mathfrak{F},\|\cdot\|_{2,S})}{N}}\mathrm{d}u\right)\right],

where the expectation in the definition of N(𝔉)\mathfrak{R}_{N}(\mathfrak{F}) is taken over a sample set SS of NN i.i.d. samples X1,,XNX_{1},\cdots,X_{N} drawn from a distribution μ\mu, and 2,S\|\cdot\|_{2,S} denotes the L2L^{2} norm with respect to the empirical measure 1Ni=1Nδ(xXi)\frac{1}{N}\sum_{i=1}^{N}\delta(x-X_{i}).

The following lemma will also be used in the proof of Theorem 8.

Lemma 17 (Talagrand’s contraction lemma).

Let ϕ:\phi:\mathbb{R}\to\mathbb{R} be a LL-Lipschitz continuous function and 𝔉\mathfrak{F} be a function class, then

N(𝔉)LN(ϕ𝔉),\mathfrak{R}_{N}(\mathfrak{F})\leq L\mathfrak{R}_{N}(\phi\circ\mathfrak{F}),

where ϕ𝔉={ϕf|f𝔉}\phi\circ\mathfrak{F}=\{\phi\circ f|f\in\mathfrak{F}\}.

With all lemmas aforementioned, we first bound the local Rademacher complexity of the sparse neural network class 𝔑(L,𝒑,S,M)\mathfrak{N}(L,\boldsymbol{p},S,M).

Lemma 18 (Local Rademacher complexity of 𝔑(L,𝒑,S,M)\mathfrak{N}(L,\boldsymbol{p},S,M)).

The local Rademacher complexity of the sparse neural network class 𝔑(L,𝐩,S,M)\mathfrak{N}(L,\boldsymbol{p},S,M)

N({g|g𝔑(L,𝒑,S,M),𝔼Π~[g]r})\mathfrak{R}_{N}(\{\ell\circ g|g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}[\ell\circ g]\leq r\})

as appeared in Theorem 6 is bounded by the sub-root function

ϕ(r)\displaystyle\phi(r)\leq 32MN+96Mr(S+1)log(22L+5N(L+1)d2(S+1)2L)N\displaystyle\dfrac{32M}{N}+96M\sqrt{\dfrac{r(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N}}
+2304M2(S+1)log(22L+5N(L+1)d2(S+1)2L)N,\displaystyle+\dfrac{2304M^{2}(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N},

with the fixed point rr^{*} bounded by

r64M+18432M2(S+1)log(22L+5N(L+1)d2(S+1)2L)N.r^{*}\leq\dfrac{64M+18432M^{2}(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N}.
Proof.

First, by applying Talagrand’s contraction lemma with Assumption 2b, and the localized Dudley’s theorem, we have

N({g|g𝔑(L,𝒑,S,M),𝔼Π~[g]r})\displaystyle\mathfrak{R}_{N}(\{\ell\circ g|g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}[\ell\circ g]\leq r\}) (20)
\displaystyle\leq 4MN({gg0|g𝔑(L,𝒑,S,M),𝔼Π~[(gg0)2]r})\displaystyle 4M\mathfrak{R}_{N}\left(\left\{g-g^{0}\big{|}g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}[(g-g^{0})^{2}]\leq r\right\}\right)
\displaystyle\leq 4M𝔼ΠN[infρ>0(4ρ+12ρr^log𝒩(u,{gg0|g𝔑(L,𝒑,S,M),𝔼Π~[(gg0)2]r},2,S)Ndu)],\displaystyle 4M\mathbb{E}_{\Pi^{\otimes N}}\left[\inf_{\rho>0}\left(4\rho+12\int_{\rho}^{\sqrt{\hat{r}}}\sqrt{\dfrac{\log\mathcal{N}\left(u,\left\{g-g^{0}\big{|}g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}[(g-g^{0})^{2}]\leq r\right\},\|\cdot\|_{2,S}\right)}{N}}\mathrm{d}u\right)\right],

where the empirical localization radius depending on the sample set SΠNS\sim\Pi^{\otimes N} is defined as

r^:=supg𝔑(L,𝒑,S,M),𝔼Π~[(gg0)2]rgg02,S2,\hat{r}:=\sup_{g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}\left[(g-g^{0})^{2}\right]\leq r}\|g-g^{0}\|_{2,S}^{2},

and the last inequality is due to the fact that whenever u>r^u>\sqrt{\hat{r}}, the uu-covering number with respect to 2,S\|\cdot\|_{2,S} is exactly 1 and the integrand thus vanishes.

By choosing ρ=1N\rho=\frac{1}{N} on the RHS of (20), we have

infρ>0(4ρ+12ρr^log𝒩(u,{gg0|g𝔑(L,𝒑,S,M),𝔼Π~[(gg0)2]r},2,S)Ndu)\displaystyle\inf_{\rho>0}\left(4\rho+12\int_{\rho}^{\sqrt{\hat{r}}}\sqrt{\dfrac{\log\mathcal{N}\left(u,\left\{g-g^{0}\big{|}g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}[(g-g^{0})^{2}]\leq r\right\},\|\cdot\|_{2,S}\right)}{N}}\mathrm{d}u\right)
\displaystyle\leq 4N+121/Nr^log𝒩(u,{gg0|g𝔑(L,𝒑,S,M),𝔼Π~[(gg0)2]r},2,S)Ndu\displaystyle\dfrac{4}{N}+12\int_{1/N}^{\sqrt{\hat{r}}}\sqrt{\dfrac{\log\mathcal{N}\left(u,\left\{g-g^{0}\big{|}g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}[(g-g^{0})^{2}]\leq r\right\},\|\cdot\|_{2,S}\right)}{N}}\mathrm{d}u
\displaystyle\leq 4N+121/Nr^log𝒩(u,{gg0|g𝔑(L,𝒑,S,M)},2,S)Ndu\displaystyle\dfrac{4}{N}+12\int_{1/N}^{\sqrt{\hat{r}}}\sqrt{\dfrac{\log\mathcal{N}\left(u,\left\{g-g^{0}|g\in\mathfrak{N}(L,\boldsymbol{p},S,M)\right\},\|\cdot\|_{2,S}\right)}{N}}\mathrm{d}u
\displaystyle\leq 4N+121/Nr^log𝒩(u,𝔑(L,𝒑,S,M),)Ndu\displaystyle\dfrac{4}{N}+12\int_{1/N}^{\sqrt{\hat{r}}}\sqrt{\dfrac{\log\mathcal{N}\left(u,\mathfrak{N}(L,\boldsymbol{p},S,M),\|\cdot\|_{\infty}\right)}{N}}\mathrm{d}u
\displaystyle\leq 4N+121/Nr^(S+1)log(22L+5u1(L+1)d2(S+1)2L)Ndu\displaystyle\dfrac{4}{N}+12\int_{1/N}^{\sqrt{\hat{r}}}\sqrt{\dfrac{(S+1)\log\left(2^{2L+5}u^{-1}(L+1)d^{2}(S+1)^{2L}\right)}{N}}\mathrm{d}u
\displaystyle\leq 4N+12r^(S+1)log(22L+5N(L+1)d2(S+1)2L)N,\displaystyle\dfrac{4}{N}+12\sqrt{\hat{r}}\sqrt{\dfrac{(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N}},

where the third inequality is because of 2,S\|\cdot\|_{2,S}\leq\|\cdot\|_{\infty}, and the second to last equality is by Lemma 15.

Now set

ϕ(r)=4M𝔼ΠN[4N+12r^(S+1)log(22L+5N(L+1)d2(S+1)2L)N],\phi(r)=4M\mathbb{E}_{\Pi^{\otimes N}}\left[\dfrac{4}{N}+12\sqrt{\hat{r}}\sqrt{\dfrac{(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N}}\right], (21)

then

N({g|g𝔑(L,𝒑,S,M),𝔼Π~[g]r})ϕ(r)\mathfrak{R}_{N}\left(\left\{\ell\circ g|g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}[\ell\circ g]\leq r\right\}\right)\leq\phi(r)

following the reasoning above.

Now we turn to bound the empirical localization radius r^\hat{r} again by the local Rademacher complexity.

𝔼ΠN[r^]\displaystyle\mathbb{E}_{\Pi^{\otimes N}}\left[\hat{r}\right] =𝔼ΠN[supg𝔑(L,𝒑,S,M),𝔼Π~[(gg0)2]rgg02,S2]\displaystyle=\mathbb{E}_{\Pi^{\otimes N}}\left[\sup_{g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}\left[(g-g^{0})^{2}\right]\leq r}\|g-g^{0}\|_{2,S}^{2}\right]
=𝔼ΠN[supg𝔑(L,𝒑,S,M),𝔼Π~[(gg0)2]r1Ni=1N(g(Xi)g0(Xi))2]\displaystyle=\mathbb{E}_{\Pi^{\otimes N}}\left[\sup_{g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}\left[(g-g^{0})^{2}\right]\leq r}\dfrac{1}{N}\sum_{i=1}^{N}\left(g(X_{i})-g^{0}(X_{i})\right)^{2}\right]
=𝔼ΠN[supg𝔑(L,𝒑,S,M),𝔼Π~[(gg0)2]r1Ni=1N((g(Xi)g0(Xi))2𝔼Π~[(gg0)2])]+r\displaystyle=\mathbb{E}_{\Pi^{\otimes N}}\left[\sup_{g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}\left[(g-g^{0})^{2}\right]\leq r}\dfrac{1}{N}\sum_{i=1}^{N}\left(\left(g(X_{i})-g^{0}(X_{i})\right)^{2}-\mathbb{E}_{\widetilde{\Pi}}\left[(g-g^{0})^{2}\right]\right)\right]+r
2N({(gg0)2|g𝔑(L,𝒑,S,M),𝔼Π~[(gg0)2]r})+r2ϕ(r)+r,\displaystyle\leq 2\mathfrak{R}_{N}\left(\left\{(g-g^{0})^{2}|g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}[(g-g^{0})^{2}]\leq r\right\}\right)+r\leq 2\phi(r)+r,

where the last inequality is by symmetrization (Boucheron, Lugosi, and Massart 2013).

Then from (21), we have by Jensen’s inequality,

ϕ(r)\displaystyle\phi(r)\leq 4M(4N+12𝔼ΠN[r^](S+1)log(22L+5N(L+1)d2(S+1)2L)N)\displaystyle 4M\left(\dfrac{4}{N}+12\sqrt{\mathbb{E}_{\Pi^{\otimes N}}\left[\hat{r}\right]}\sqrt{\dfrac{(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N}}\right)
\displaystyle\leq 4M(4N+122ϕ(r)+r(S+1)log(22L+5N(L+1)d2(S+1)2L)N)\displaystyle 4M\left(\dfrac{4}{N}+12\sqrt{2\phi(r)+r}\sqrt{\dfrac{(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N}}\right)
\displaystyle\leq 4M(4N+12(r+2ϕ(r))(S+1)log(22L+5N(L+1)d2(S+1)2L)N)\displaystyle 4M\left(\dfrac{4}{N}+12\left(\sqrt{r}+\sqrt{2\phi(r)}\right)\sqrt{\dfrac{(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N}}\right)
\displaystyle\leq 16MN+48Mr(S+1)log(22L+5N(L+1)d2(S+1)2L)N\displaystyle\dfrac{16M}{N}+48M\sqrt{\dfrac{r(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N}}
+ϕ(r)2+2304M2(S+1)log(22L+5N(L+1)d2(S+1)2L)N,\displaystyle+\dfrac{\phi(r)}{2}+\dfrac{2304M^{2}(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N},

from which we deduce that

N({g|g𝔑(L,𝒑,S,M),𝔼Π~[g]r})\displaystyle\mathfrak{R}_{N}\left(\left\{\ell\circ g|g\in\mathfrak{N}(L,\boldsymbol{p},S,M),\mathbb{E}_{\widetilde{\Pi}}[\ell\circ g]\leq r\right\}\right)
\displaystyle\leq ϕ(r)32MN+96Mr(S+1)log(22L+5N(L+1)d2(S+1)2L)N\displaystyle\phi(r)\leq\dfrac{32M}{N}+96M\sqrt{\dfrac{r(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N}}
+4608M2(S+1)log(22L+5N(L+1)d2(S+1)2L)N,\displaystyle\ \ \ \ \ \ \ \ \ \ +\dfrac{4608M^{2}(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N},

which is clearly a sub-root function. Furthermore, by direct calculation, the fixed point rr^{*} of ϕ(r)\phi(r) satisfies

r64M+18432M2(S+1)log(22L+5N(L+1)d2(S+1)2L)N,r^{*}\leq\dfrac{64M+18432M^{2}(S+1)\log\left(2^{2L+5}N(L+1)d^{2}(S+1)^{2L}\right)}{N},

which concludes the proof. ∎

The following lemma describes the approximation capability of 𝔑(L,𝒑,S,)\mathfrak{N}(L,\boldsymbol{p},S,\infty):

Lemma 19 (Approximation error of 𝔑(L,𝒑,S,)\mathfrak{N}(L,\boldsymbol{p},S,\infty) (Schmidt-Hieber 2020, Theorem 5)).

For any g𝒞s(Ω,M)g\in\mathcal{C}^{s}(\Omega,M) and any integer m1m\geq 1 and K(s+1)d(M+1)edK\geq(s+1)^{d}\vee(M+1)e^{d}, there exists g¯𝔑(L,𝐩,S,)\bar{g}\in\mathfrak{N}(L,\boldsymbol{p},S,\infty), where

L\displaystyle L =8+(m+5)(1+log2(ds))\displaystyle=8+(m+5)(1+\lceil\log_{2}(d\vee s)\rceil)
𝒑\displaystyle\boldsymbol{p} =(d,6(d+s)K,,6(d+s)K,1)\displaystyle=\left(d,6(d+\lceil s\rceil)K,\cdots,6(d+\lceil s\rceil)K,1\right)
S\displaystyle S 141(d+s+1)3+rK(m+6),\displaystyle\leq 141(d+s+1)^{3+r}K(m+6),

such that

g¯gL(Ω)(2M+1)(1+d2+s2)6dK2m+M3sKs/d.\|\bar{g}-g\|_{L^{\infty}(\Omega)}\leq(2M+1)(1+d^{2}+s^{2})6^{d}K2^{-m}+M3^{s}K^{-s/d}.

With all the above preparations, we are ready to prove Theorem 8.

Proof of Theorem 8.

Choose mm to be the smallest integer such that K2mKs/dK2^{-m}\leq K^{-s/d}, i.e.

m=(1+s/d)logKlog2logK,m=\left\lceil\frac{(1+s/d)\log K}{\log 2}\right\rceil\lesssim\log K,

by Lemma 19, for any g𝒞s(Ω,M)g\in\mathcal{C}^{s}(\Omega,M), there exists g¯𝔑(L,𝒑,S,)\bar{g}\in\mathfrak{N}(L,\boldsymbol{p},S,\infty), where

LlogK,𝒑K,SKlogK,L\lesssim\log K,\quad\|\boldsymbol{p}\|_{\infty}\lesssim K,\quad S\lesssim K\log K,\ (22)

such that

g¯gL(Ω)Ks/d.\|\bar{g}-g\|_{L^{\infty}(\Omega)}\lesssim K^{-s/d}.

Thus in Lemma 18, we have

r\displaystyle r^{*}\lesssim M+M2(KlogK+1)((2logK+5)log2+logN+log(logK+1)+2logd+2logKlog(KlogK+1))N\displaystyle\dfrac{M+M^{2}(K\log K+1)\left((2\log K+5)\log 2+\log N+\log(\log K+1)+2\log d+2\log K\log(K\log K+1)\right)}{N}
\displaystyle\lesssim KlogK(logKlog(KlogK)+logN)NKlog3K+KlogKlogNN.\displaystyle\dfrac{K\log K(\log K\log(K\log K)+\log N)}{N}\lesssim\dfrac{K\log^{3}K+K\log K\log N}{N}.

In Theorem 6, we take δ=Nβ(lτ)=Nα\delta^{\prime}=N\upbeta(l\tau)=N^{-\alpha} with an arbitrary α>0\alpha>0, i.e.

l=logMΠ~+(1+α)logNCΠ~τlogNτ.l=\dfrac{\log M_{\widetilde{\Pi}}+(1+\alpha)\log N}{C_{\widetilde{\Pi}}\tau}\lesssim\dfrac{\log N}{\tau}.

Then with probability at least 12Nα1-2N^{-\alpha}, we have

Π~g0(g^)\displaystyle\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})\leq 11ϵ^Ng0(g^)+176M2ϵr+l(44ϵ+104)M2log(lNα)ϵN\displaystyle\dfrac{1}{1-\epsilon}\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})+\dfrac{176}{M^{2}\epsilon}r^{*}+\dfrac{l\left(44\epsilon+104\right)M^{2}\log\left(lN^{\alpha}\right)}{\epsilon N}
\displaystyle\lesssim ^Ng0(g^)+Klog3K+KlogKlogNN+log2NNτ.\displaystyle\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})+\dfrac{K\log^{3}K+K\log K\log N}{N}+\dfrac{\log^{2}N}{N\tau}.

In Lemma 15, take ρ=1/N\rho=1/N and LL, 𝒑\boldsymbol{p}, and SS as in (22), we have

log𝒩(ρ,𝔑(L,𝒑,S,),)Klog3K+KlogKlogN,\log\mathcal{N}(\rho,\mathfrak{N}(L,\boldsymbol{p},S,\infty),\|\cdot\|_{\infty})\leq K\log^{3}K+K\log K\log N,

and thus in Theorem 7 with also ρ=1/N\rho=1/N,

𝔼[^Ng0(g^)]\displaystyle\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})\right]\lesssim infg^𝔊𝔼[^Ng0(g^)]+τ+log𝒩(1/N,𝔑(L,𝒑,S,M),)Nτγ+ρ2Nτγ\displaystyle\inf_{\hat{g}\in\mathfrak{G}}\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})\right]+\tau+\dfrac{\log\mathcal{N}(1/N,\mathfrak{N}(L,\boldsymbol{p},S,M),\|\cdot\|_{\infty})}{N\tau^{\gamma}}+\sqrt{\dfrac{\rho^{2}}{N\tau^{\gamma}}}
\displaystyle\lesssim infg^𝔊g^g02+τ+log𝒩(1/N,𝔑(L,𝒑,S,),)Nτγ+1N3τγ\displaystyle\inf_{\hat{g}\in\mathfrak{G}}\|\hat{g}-g^{0}\|_{\infty}^{2}+\tau+\dfrac{\log\mathcal{N}(1/N,\mathfrak{N}(L,\boldsymbol{p},S,\infty),\|\cdot\|_{\infty})}{N\tau^{\gamma}}+\sqrt{\dfrac{1}{N^{3}\tau^{\gamma}}}
\displaystyle\lesssim K2s/d+τ+Klog3K+KlogKlogNNτγ+1N,\displaystyle K^{-2s/d}+\tau+\dfrac{K\log^{3}K+K\log K\log N}{N\tau^{\gamma}}+\dfrac{1}{N},

where we used NτγNτ1N\tau^{\gamma}\geq N\tau\geq 1.

Finally, we conclude

𝔼[Π~g0(g^)]\displaystyle\mathbb{E}\left[\mathcal{L}_{\widetilde{\Pi}}^{g^{0}}(\hat{g})\right]\lesssim 𝔼[^Ng0(g^)]+Klog3K+KlogKlogNN+log2NNτ\displaystyle\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{g})\right]+\dfrac{K\log^{3}K+K\log K\log N}{N}+\dfrac{\log^{2}N}{N\tau}
\displaystyle\lesssim K2s/d+τ+Klog3K+KlogKlogNN(τγ1)+1N+log2NNτ\displaystyle K^{-2s/d}+\tau+\dfrac{K\log^{3}K+K\log K\log N}{N(\tau^{\gamma}\wedge 1)}+\dfrac{1}{N}+\dfrac{\log^{2}N}{N\tau}
\displaystyle\lesssim (N(τγ1))2s2s+dlog3(N(τγ1))+τ+log2NNτ,\displaystyle\left(N(\tau^{\gamma}\wedge 1)\right)^{-\frac{2s}{2s+d}}\log^{3}(N(\tau^{\gamma}\wedge 1))+\tau+\dfrac{\log^{2}N}{N\tau},

by taking K(N(τγ1))d2s+dK\asymp(N(\tau^{\gamma}\wedge 1))^{\frac{d}{2s+d}}.

Appendix B Detailed Proofs of Section 3.3

In this section, we present the detailed proofs of the results in Section 3.3, including Theorem 3 and Theorem 9. Both proofs are based on Theorem 8 that we have proved in Appendix A. Specifically, we would like to verify the conditions of Theorem 8 for the drift and diffusion estimation problems, respectively. To this end, we first compute the noise ΔZkτ\Delta Z_{k\tau} of the drift and diffusion estimators as in (10), analyze its decomposition into the bias term ΔAkτ\Delta A_{k\tau} and the variance term ΔMkτ\Delta M_{k\tau}, and then verify Assumption 6 and find the exponent γ\gamma in Assumption 7 for both drift and diffusion estimators.

Lemma 20.

For s[kτ,(k+1)τ]s\in[k\tau,(k+1)\tau],

𝔼[𝒙t𝒙kτ20kτ]eτ(M2d+2Md2)(tkτ).\mathbb{E}\left[\|\boldsymbol{x}_{t}-\boldsymbol{x}_{k\tau}\|^{2}\mid\mathcal{F}_{0}^{k\tau}\right]\leq e^{\tau}(M^{2}d+2Md^{2})(t-k\tau).
Proof.

By Itô’s formula, we have for s[kτ,(k+1)τ]s\in[k\tau,(k+1)\tau],

d𝒙s𝒙kτ2\displaystyle\mathrm{d}\|\boldsymbol{x}_{s}-\boldsymbol{x}_{k\tau}\|^{2} =2(𝒙s𝒙kτ)d𝒙s+dd𝒙s𝒙kτ,𝒙s𝒙kτ\displaystyle=2(\boldsymbol{x}_{s}-\boldsymbol{x}_{k\tau})^{\top}\mathrm{d}\boldsymbol{x}_{s}+d\mathrm{d}\langle\boldsymbol{x}_{s}-\boldsymbol{x}_{k\tau},\boldsymbol{x}_{s}-\boldsymbol{x}_{k\tau}\rangle
=2(𝒙s𝒙kτ)(𝒃sds+Σsd𝒘s)+dd𝒙s𝒙kτ,𝒙s𝒙kτ\displaystyle=2(\boldsymbol{x}_{s}-\boldsymbol{x}_{k\tau})^{\top}\left(\boldsymbol{b}_{s}\mathrm{d}s+\Sigma_{s}\mathrm{d}\boldsymbol{w}_{s}\right)+d\mathrm{d}\langle\boldsymbol{x}_{s}-\boldsymbol{x}_{k\tau},\boldsymbol{x}_{s}-\boldsymbol{x}_{k\tau}\rangle
=2(𝒙s𝒙kτ)(𝒃sds+Σsd𝒘s)+dtr(ΣsΣs)ds\displaystyle=2(\boldsymbol{x}_{s}-\boldsymbol{x}_{k\tau})^{\top}\left(\boldsymbol{b}_{s}\mathrm{d}s+\Sigma_{s}\mathrm{d}\boldsymbol{w}_{s}\right)+d\mathrm{tr}\left(\Sigma_{s}^{\top}\Sigma_{s}\right)\mathrm{d}s
=(2(𝒙s𝒙kτ)𝒃s+2dtrDs)ds+2(𝒙s𝒙kτ)Σsd𝒘s.\displaystyle=\left(2(\boldsymbol{x}_{s}-\boldsymbol{x}_{k\tau})^{\top}\boldsymbol{b}_{s}+2d\mathrm{tr}D_{s}\right)\mathrm{d}s+2(\boldsymbol{x}_{s}-\boldsymbol{x}_{k\tau})^{\top}\Sigma_{s}\mathrm{d}\boldsymbol{w}_{s}.

Therefore for any t[kτ,(k+1)τ]t\in[k\tau,(k+1)\tau],

𝔼[𝒙t𝒙kτ20kτ]\displaystyle\mathbb{E}\left[\|\boldsymbol{x}_{t}-\boldsymbol{x}_{k\tau}\|^{2}\mid\mathcal{F}_{0}^{k\tau}\right] =𝔼[kτt(2(𝒙s𝒙kτ)𝒃s+2dtrDs)ds|0kτ]\displaystyle=\mathbb{E}\left[\int_{k\tau}^{t}\left(2(\boldsymbol{x}_{s}-\boldsymbol{x}_{k\tau})^{\top}\boldsymbol{b}_{s}+2d\mathrm{tr}D_{s}\right)\mathrm{d}s\bigg{|}\mathcal{F}_{0}^{k\tau}\right]
(M2d+2Md)(tkτ)+kτt𝔼[𝒙s𝒙kτ20kτ]ds,\displaystyle\leq(M^{2}d+2Md)(t-k\tau)+\int_{k\tau}^{t}\mathbb{E}\left[\|\boldsymbol{x}_{s}-\boldsymbol{x}_{k\tau}\|^{2}\mid\mathcal{F}_{0}^{k\tau}\right]\mathrm{d}s,

and by Gronwall’s inequality, we have

𝔼[𝒙t𝒙kτ20kτ]eτ(M2d+2Md2)(tkτ).\mathbb{E}\left[\|\boldsymbol{x}_{t}-\boldsymbol{x}_{k\tau}\|^{2}\mid\mathcal{F}_{0}^{k\tau}\right]\leq e^{\tau}(M^{2}d+2Md^{2})(t-k\tau).

B.1 Proof of Theorem 3

As a warm-up for the proof of Theorem 9, we first prove the corresponding upper bound for drift estimation:

Proof of Theorem 3.

Suppose that 1id1\leq i\leq d. Let bitb^{i}_{t}, the ii-th component of the drift 𝒃t\boldsymbol{b}_{t}, be our target function g0g^{0}. We also use ZitZ^{i}_{t}, AitA^{i}_{t}, and MitM^{i}_{t} to denote the corresponding noise terms.

For any 0kN10\leq k\leq N-1, we have by Itô’s formula,

xi(k+1)τxikττ=bikτ+1τkτ(k+1)τ(bitbikτ)dt+1τkτ(k+1)τ𝚺tid𝒘t,\dfrac{x^{i}_{(k+1)\tau}-x^{i}_{k\tau}}{\tau}=b^{i}_{k\tau}+\dfrac{1}{\tau}\int_{k\tau}^{(k+1)\tau}\left(b^{i}_{t}-b^{i}_{k\tau}\right)\mathrm{d}t+\dfrac{1}{\tau}\int_{k\tau}^{(k+1)\tau}\boldsymbol{\Sigma}_{t}^{i}\mathrm{d}\boldsymbol{w}_{t},

and compare the estimated empirical loss for drift inference (5) with the general form (10), we have

ΔZikτ=1τkτ(k+1)τ(bitbikτ)dt+1τkτ(k+1)τ𝚺tid𝒘t.\Delta Z^{i}_{k\tau}=\dfrac{1}{\tau}\int_{k\tau}^{(k+1)\tau}\left(b^{i}_{t}-b^{i}_{k\tau}\right)\mathrm{d}t+\dfrac{1}{\tau}\int_{k\tau}^{(k+1)\tau}\boldsymbol{\Sigma}_{t}^{i}\mathrm{d}\boldsymbol{w}_{t}.

By definition, we also have

ΔAikτ=𝔼[1τkτ(k+1)τ(bitbikτ)dt],ΔMikτ=1τkτ(k+1)τ𝚺tid𝒘t.\Delta A^{i}_{k\tau}=\mathbb{E}\left[\dfrac{1}{\tau}\int_{k\tau}^{(k+1)\tau}\left(b^{i}_{t}-b^{i}_{k\tau}\right)\mathrm{d}t\right],\quad\Delta M^{i}_{k\tau}=\dfrac{1}{\tau}\int_{k\tau}^{(k+1)\tau}\boldsymbol{\Sigma}_{t}^{i}\mathrm{d}\boldsymbol{w}_{t}.

The following simple argument

ΔMikτ=1τ2kτ(k+1)τ𝚺ti22dtM2dτ1\Delta\left\langle M^{i}\right\rangle_{k\tau}=\dfrac{1}{\tau^{2}}\int_{k\tau}^{(k+1)\tau}\|\boldsymbol{\Sigma}_{t}^{i}\|_{2}^{2}\mathrm{d}t\leq M^{2}d\tau^{-1}

validates Assumption 7 with γ=1\gamma=1.

Also, by Cauchy-Schwarz and Lemma 20, we have

𝔼[(ΔAikτ)2|0kτ]\displaystyle\mathbb{E}\left[\left(\Delta A^{i}_{k\tau}\right)^{2}\big{|}\mathcal{F}_{0}^{k\tau}\right]\leq 𝔼[1τ2(kτ(k+1)τ(bitbikτ)dt)2|0kτ]\displaystyle\mathbb{E}\left[\dfrac{1}{\tau^{2}}\left(\int_{k\tau}^{(k+1)\tau}\left(b^{i}_{t}-b^{i}_{k\tau}\right)\mathrm{d}t\right)^{2}\bigg{|}\mathcal{F}_{0}^{k\tau}\right]
\displaystyle\leq 𝔼[1τkτ(k+1)τ(bitbikτ)2dt|0kτ]\displaystyle\mathbb{E}\left[\dfrac{1}{\tau}\int_{k\tau}^{(k+1)\tau}\left(b^{i}_{t}-b^{i}_{k\tau}\right)^{2}\mathrm{d}t\bigg{|}\mathcal{F}_{0}^{k\tau}\right]
\displaystyle\leq 𝔼[M2τkτ(k+1)τ𝒙t𝒙kτ2dt|0kτ]\displaystyle\mathbb{E}\left[\dfrac{M^{2}}{\tau}\int_{k\tau}^{(k+1)\tau}\left\|\boldsymbol{x}_{t}-\boldsymbol{x}_{k\tau}\right\|^{2}\mathrm{d}t\bigg{|}\mathcal{F}_{0}^{k\tau}\right]
\displaystyle\leq 1τkτ(k+1)τeτ(M2d+2Md2)(tkτ)dt\displaystyle\dfrac{1}{\tau}\int_{k\tau}^{(k+1)\tau}e^{\tau}(M^{2}d+2Md^{2})(t-k\tau)\mathrm{d}t
\displaystyle\leq 12eτ(M2d+2Md2)τ,\displaystyle\dfrac{1}{2}e^{\tau}(M^{2}d+2Md^{2})\tau,

and Assumption 6 is thus satisfied.

Finally, we apply Theorem 8 to g0=big^{0}=b^{i}, and obtain the final rate

𝔼[^Ng0(bi^)]\displaystyle\mathbb{E}\left[\hat{\mathcal{L}}_{N}^{g^{0}}(\hat{b^{i}})\right]\lesssim (Nτ)2s2s+dlog3(Nτ)+τ+log2NNτ\displaystyle\left(N\tau\right)^{-\frac{2s}{2s+d}}\log^{3}(N\tau)+\tau+\dfrac{\log^{2}N}{N\tau}
\displaystyle\lesssim T2s2s+dlog3T+τ,\displaystyle T^{-\frac{2s}{2s+d}}\log^{3}T+\tau,

and the proof is thus complete by repeating for all 1id1\leq i\leq d. ∎

B.2 Proof of Theorem 9

The proof for diffusion estimation requires a even more delicate analysis on the dynamics of the SDE (1).

Proof of Theorem 9.

Following the notations introduced in the proof of Theorem 9, for 1i,jd1\leq i,j\leq d, let DijD^{ij} be the target function and ZijZ^{ij}, AijA^{ij}, MijM^{ij} be the corresponding noise terms.

For any 0kN10\leq k\leq N-1, define the following auxiliary process for s[0,τ]s\in[0,\tau],

Yk,si:=xikτ+sxikτb^ikτs=kτkτ+s(bitb^ikτ)dt+kτkτ+s𝚺tid𝒘t,Y_{k,s}^{i}:=x^{i}_{k\tau+s}-x^{i}_{k\tau}-\hat{b}^{i}_{k\tau}s=\int_{k\tau}^{k\tau+s}\left(b^{i}_{t}-\hat{b}^{i}_{k\tau}\right)\mathrm{d}t+\int_{k\tau}^{k\tau+s}\boldsymbol{\Sigma}_{t}^{i}\mathrm{d}\boldsymbol{w}_{t}, (23)

where 𝚺i\boldsymbol{\Sigma}^{i} denotes the ii-th row of Σ\Sigma (a row vector). Plugging (23) into the estimated empirical loss for diffusion estimation (6), we have by definition

ΔZijkτ\displaystyle\Delta Z^{ij}_{k\tau} =(2τ)1Yk,τiYk,τjDijkτ\displaystyle=(2\tau)^{-1}Y_{k,\tau}^{i}Y_{k,\tau}^{j}-D^{ij}_{k\tau}
ΔAijkτ\displaystyle\Delta A^{ij}_{k\tau} =(2τ)1𝔼[Yk,τiYk,τj]Dijkτ\displaystyle=(2\tau)^{-1}\mathbb{E}\left[Y_{k,\tau}^{i}Y_{k,\tau}^{j}\right]-D^{ij}_{k\tau}
ΔMijkτ\displaystyle\Delta\left\langle M^{ij}\right\rangle_{k\tau} =(2τ)2YkiYkjτ.\displaystyle=(2\tau)^{-2}\left\langle Y_{k}^{i}Y_{k}^{j}\right\rangle_{\tau}.

The process Yk,siY_{k,s}^{i} satisfies the following SDE:

dYk,si=(bkτ+sib^kτi)ds+𝚺kτ+sid𝒘s.\mathrm{d}Y_{k,s}^{i}=\left(b_{k\tau+s}^{i}-\hat{b}_{k\tau}^{i}\right)\mathrm{d}s+\boldsymbol{\Sigma}_{k\tau+s}^{i}\mathrm{d}\boldsymbol{w}_{s}.

By Itô’s formula, the process (Yk,siYk,sj)0sτ\left(Y_{k,s}^{i}Y_{k,s}^{j}\right)_{0\leq s\leq\tau} as the product of Yk,siY_{k,s}^{i} and Yk,sjY_{k,s}^{j} also satisfies an SDE as follows:

d(Yk,siYk,sj)=Yk,sidYk,sj+Yk,sjdYk,si+dYk,si,Yk,sj\displaystyle\mathrm{d}\left(Y_{k,s}^{i}Y_{k,s}^{j}\right)=Y_{k,s}^{i}\mathrm{d}Y_{k,s}^{j}+Y_{k,s}^{j}\mathrm{d}Y_{k,s}^{i}+\mathrm{d}\left\langle Y_{k,s}^{i},Y_{k,s}^{j}\right\rangle
=\displaystyle= [Yk,si(bkτ+sjb^kτj)+Yk,sj(bkτ+sib^kτi)+𝚺ikτ+s(𝚺jkτ+s)]ds+[Yk,si𝚺kτ+sj+Yk,sj𝚺kτ+si]d𝒘s\displaystyle\left[Y_{k,s}^{i}\left(b_{k\tau+s}^{j}-\hat{b}_{k\tau}^{j}\right)+Y_{k,s}^{j}\left(b_{k\tau+s}^{i}-\hat{b}_{k\tau}^{i}\right)+\boldsymbol{\Sigma}^{i}_{k\tau+s}\left(\boldsymbol{\Sigma}^{j}_{k\tau+s}\right)^{\top}\right]\mathrm{d}s+\left[Y_{k,s}^{i}\boldsymbol{\Sigma}_{k\tau+s}^{j}+Y_{k,s}^{j}\boldsymbol{\Sigma}_{k\tau+s}^{i}\right]\mathrm{d}\boldsymbol{w}_{s}

with its quadratic variation satisfying

dYkiYkjs=|Yk,si𝚺kτ+sj+Yk,sj𝚺kτ+si|2ds.\mathrm{d}\left\langle Y_{k}^{i}Y_{k}^{j}\right\rangle_{s}=\left|Y_{k,s}^{i}\boldsymbol{\Sigma}_{k\tau+s}^{j}+Y_{k,s}^{j}\boldsymbol{\Sigma}_{k\tau+s}^{i}\right|^{2}\mathrm{d}s.

By direct calculation, we have

𝔼[Yk,siYk,sj|0kτ]\displaystyle\mathbb{E}\left[Y_{k,s}^{i}Y_{k,s}^{j}\big{|}\mathcal{F}_{0}^{k\tau}\right] (24)
=\displaystyle= 𝔼[0s(0s1(bkτ+s2ib^kτi)ds2+0s1𝚺kτ+s2id𝒘s2)(bkτ+s1jb^kτj)ds1|0kτ]\displaystyle\mathbb{E}\left[\int_{0}^{s}\left(\int_{0}^{s_{1}}\left(b_{k\tau+s_{2}}^{i}-\hat{b}_{k\tau}^{i}\right)\mathrm{d}s_{2}+\int_{0}^{s_{1}}\boldsymbol{\Sigma}_{k\tau+s_{2}}^{i}\mathrm{d}\boldsymbol{w}_{s_{2}}\right)\left(b_{k\tau+s_{1}}^{j}-\hat{b}_{k\tau}^{j}\right)\mathrm{d}s_{1}\bigg{|}\mathcal{F}_{0}^{k\tau}\right]
+\displaystyle+ 𝔼[0s(0s2(bkτ+s1jb^kτj)ds1+0s2𝚺kτ+s1jd𝒘s1)(bkτ+s2ib^kτi)ds2|0kτ]\displaystyle\mathbb{E}\left[\int_{0}^{s}\left(\int_{0}^{s_{2}}\left(b_{k\tau+s_{1}}^{j}-\hat{b}_{k\tau}^{j}\right)\mathrm{d}s_{1}+\int_{0}^{s_{2}}\boldsymbol{\Sigma}_{k\tau+s_{1}}^{j}\mathrm{d}\boldsymbol{w}_{s_{1}}\right)\left(b_{k\tau+s_{2}}^{i}-\hat{b}_{k\tau}^{i}\right)\mathrm{d}s_{2}\bigg{|}\mathcal{F}_{0}^{k\tau}\right]
+\displaystyle+ 𝔼[0s𝚺ikτ+s(𝚺jkτ+s𝚺kτj)+(𝚺ikτ+s𝚺ikτ)(𝚺kτj)ds|0kτ]+s𝚺kτi(𝚺kτj)\displaystyle\mathbb{E}\left[\int_{0}^{s}\boldsymbol{\Sigma}^{i}_{k\tau+s^{\prime}}\left(\boldsymbol{\Sigma}^{j}_{k\tau+s^{\prime}}-\boldsymbol{\Sigma}_{k\tau}^{j}\right)^{\top}+\left(\boldsymbol{\Sigma}^{i}_{k\tau+s^{\prime}}-\boldsymbol{\Sigma}^{i}_{k\tau}\right)\left(\boldsymbol{\Sigma}_{k\tau}^{j}\right)^{\top}\mathrm{d}s^{\prime}\bigg{|}\mathcal{F}_{0}^{k\tau}\right]+s\boldsymbol{\Sigma}_{k\tau}^{i}\left(\boldsymbol{\Sigma}_{k\tau}^{j}\right)^{\top}
=\displaystyle= 0s0s𝔼[(bkτ+s1ib^kτi)(bkτ+s2jb^kτj)|0kτ]ds1ds2\displaystyle\int_{0}^{s}\int_{0}^{s}\mathbb{E}\left[\left(b_{k\tau+s_{1}}^{i}-\hat{b}_{k\tau}^{i}\right)\left(b_{k\tau+s_{2}}^{j}-\hat{b}_{k\tau}^{j}\right)\bigg{|}\mathcal{F}_{0}^{k\tau}\right]\mathrm{d}s_{1}\mathrm{d}s_{2}
+𝔼[0s𝚺ikτ+s(𝚺jkτ+s𝚺kτj)+(𝚺ikτ+s𝚺ikτ)(𝚺kτj)ds|0kτ]+2sDijkτ.\displaystyle+\mathbb{E}\left[\int_{0}^{s}\boldsymbol{\Sigma}^{i}_{k\tau+s^{\prime}}\left(\boldsymbol{\Sigma}^{j}_{k\tau+s^{\prime}}-\boldsymbol{\Sigma}_{k\tau}^{j}\right)^{\top}+\left(\boldsymbol{\Sigma}^{i}_{k\tau+s^{\prime}}-\boldsymbol{\Sigma}^{i}_{k\tau}\right)\left(\boldsymbol{\Sigma}_{k\tau}^{j}\right)^{\top}\mathrm{d}s^{\prime}\bigg{|}\mathcal{F}_{0}^{k\tau}\right]+2sD^{ij}_{k\tau}.

For the first term of (24), we have the following bound by applying Cauchy-Schwarz

(0s0s𝔼[(bkτ+s1ib^kτi)(bkτ+s2jb^kτj)|0kτ]ds1ds2)2\displaystyle\left(\int_{0}^{s}\int_{0}^{s}\mathbb{E}\left[\left(b_{k\tau+s_{1}}^{i}-\hat{b}_{k\tau}^{i}\right)\left(b_{k\tau+s_{2}}^{j}-\hat{b}_{k\tau}^{j}\right)\bigg{|}\mathcal{F}_{0}^{k\tau}\right]\mathrm{d}s_{1}\mathrm{d}s_{2}\right)^{2}
\displaystyle\lesssim (0s0s𝔼[(bkτ+s1ibkτi)(bkτ+s2jbkτj)|0kτ]ds1ds2)2\displaystyle\left(\int_{0}^{s}\int_{0}^{s}\mathbb{E}\left[\left(b_{k\tau+s_{1}}^{i}-b_{k\tau}^{i}\right)\left(b_{k\tau+s_{2}}^{j}-b_{k\tau}^{j}\right)\bigg{|}\mathcal{F}_{0}^{k\tau}\right]\mathrm{d}s_{1}\mathrm{d}s_{2}\right)^{2}
+\displaystyle+ (0s0s𝔼[(bkτib^kτi)(bkτ+s2jbkτj)|0kτ]ds1ds2)2\displaystyle\left(\int_{0}^{s}\int_{0}^{s}\mathbb{E}\left[\left(b_{k\tau}^{i}-\hat{b}_{k\tau}^{i}\right)\left(b_{k\tau+s_{2}}^{j}-b_{k\tau}^{j}\right)\bigg{|}\mathcal{F}_{0}^{k\tau}\right]\mathrm{d}s_{1}\mathrm{d}s_{2}\right)^{2}
+\displaystyle+ (0s0s𝔼[(bkτ+s1ibkτi)(bkτjb^kτj)|0kτ]ds1ds2)2\displaystyle\left(\int_{0}^{s}\int_{0}^{s}\mathbb{E}\left[\left(b_{k\tau+s_{1}}^{i}-b_{k\tau}^{i}\right)\left(b_{k\tau}^{j}-\hat{b}_{k\tau}^{j}\right)\bigg{|}\mathcal{F}_{0}^{k\tau}\right]\mathrm{d}s_{1}\mathrm{d}s_{2}\right)^{2}
+\displaystyle+ (0s0s𝔼[(bkτib^kτi)(bkτjb^kτj)|0kτ]ds1ds2)2\displaystyle\left(\int_{0}^{s}\int_{0}^{s}\mathbb{E}\left[\left(b_{k\tau}^{i}-\hat{b}_{k\tau}^{i}\right)\left(b_{k\tau}^{j}-\hat{b}_{k\tau}^{j}\right)\bigg{|}\mathcal{F}_{0}^{k\tau}\right]\mathrm{d}s_{1}\mathrm{d}s_{2}\right)^{2}
\displaystyle\lesssim s20s0s𝔼[𝒙kτ+s1𝒙kτ2𝒙kτ+s2𝒙kτ2|0kτ]ds1ds2\displaystyle s^{2}\int_{0}^{s}\int_{0}^{s}\mathbb{E}\left[\|\boldsymbol{x}_{k\tau+s_{1}}-\boldsymbol{x}_{k\tau}\|^{2}\|\boldsymbol{x}_{k\tau+s_{2}}-\boldsymbol{x}_{k\tau}\|^{2}|\mathcal{F}_{0}^{k\tau}\right]\mathrm{d}s_{1}\mathrm{d}s_{2}
+\displaystyle+ s3(bkτib^kτi)20s𝔼[𝒙kτ+s1𝒙kτ2|0kτ]ds1\displaystyle s^{3}\left(b_{k\tau}^{i}-\hat{b}_{k\tau}^{i}\right)^{2}\int_{0}^{s}\mathbb{E}\left[\|\boldsymbol{x}_{k\tau+s_{1}}-\boldsymbol{x}_{k\tau}\|^{2}|\mathcal{F}_{0}^{k\tau}\right]\mathrm{d}s_{1}
+\displaystyle+ s3(bkτjb^kτj)20s𝔼[𝒙kτ+s2𝒙kτ2|0kτ]ds2+s4(bkτib^kτi)2(bkτjb^kτj)2\displaystyle s^{3}\left(b_{k\tau}^{j}-\hat{b}_{k\tau}^{j}\right)^{2}\int_{0}^{s}\mathbb{E}\left[\|\boldsymbol{x}_{k\tau+s_{2}}-\boldsymbol{x}_{k\tau}\|^{2}|\mathcal{F}_{0}^{k\tau}\right]\mathrm{d}s_{2}+s^{4}\left(b_{k\tau}^{i}-\hat{b}_{k\tau}^{i}\right)^{2}\left(b_{k\tau}^{j}-\hat{b}_{k\tau}^{j}\right)^{2}
\displaystyle\lesssim s6+s5[(bkτib^kτi)2+(bkτjb^kτj)2]+s4(bkτib^kτi)2(bkτjb^kτj)2,\displaystyle s^{6}+s^{5}\left[\left(b_{k\tau}^{i}-\hat{b}_{k\tau}^{i}\right)^{2}+\left(b_{k\tau}^{j}-\hat{b}_{k\tau}^{j}\right)^{2}\right]+s^{4}\left(b_{k\tau}^{i}-\hat{b}_{k\tau}^{i}\right)^{2}\left(b_{k\tau}^{j}-\hat{b}_{k\tau}^{j}\right)^{2},

where the last inequality follows from Lemma 20.

For the second term of (24),

(𝔼[0s𝚺ikτ+s(𝚺jkτ+s𝚺kτj)+(𝚺ikτ+s𝚺ikτ)(𝚺kτj)ds|0kτ])2\displaystyle\left(\mathbb{E}\left[\int_{0}^{s}\boldsymbol{\Sigma}^{i}_{k\tau+s^{\prime}}\left(\boldsymbol{\Sigma}^{j}_{k\tau+s^{\prime}}-\boldsymbol{\Sigma}_{k\tau}^{j}\right)^{\top}+\left(\boldsymbol{\Sigma}^{i}_{k\tau+s^{\prime}}-\boldsymbol{\Sigma}^{i}_{k\tau}\right)\left(\boldsymbol{\Sigma}_{k\tau}^{j}\right)^{\top}\mathrm{d}s^{\prime}\bigg{|}\mathcal{F}_{0}^{k\tau}\right]\right)^{2} (25)
\displaystyle\leq 𝔼[(0s𝚺ikτ+s(𝚺jkτ+s𝚺kτj)+(𝚺ikτ+s𝚺ikτ)(𝚺kτj)ds)2|0kτ]\displaystyle\mathbb{E}\left[\left(\int_{0}^{s}\boldsymbol{\Sigma}^{i}_{k\tau+s^{\prime}}\left(\boldsymbol{\Sigma}^{j}_{k\tau+s^{\prime}}-\boldsymbol{\Sigma}_{k\tau}^{j}\right)^{\top}+\left(\boldsymbol{\Sigma}^{i}_{k\tau+s^{\prime}}-\boldsymbol{\Sigma}^{i}_{k\tau}\right)\left(\boldsymbol{\Sigma}_{k\tau}^{j}\right)^{\top}\mathrm{d}s^{\prime}\right)^{2}\bigg{|}\mathcal{F}_{0}^{k\tau}\right]
\displaystyle\lesssim s0s𝔼[𝚺jkτ+s𝚺jkτ2|0kτ]ds+s0s𝔼[𝚺ikτ+s𝚺ikτ2|0kτ]ds\displaystyle s\int_{0}^{s}\mathbb{E}\left[\left\|\boldsymbol{\Sigma}^{j}_{k\tau+s^{\prime}}-\boldsymbol{\Sigma}^{j}_{k\tau}\right\|^{2}\bigg{|}\mathcal{F}_{0}^{k\tau}\right]\mathrm{d}s^{\prime}+s\int_{0}^{s}\mathbb{E}\left[\left\|\boldsymbol{\Sigma}^{i}_{k\tau+s^{\prime}}-\boldsymbol{\Sigma}^{i}_{k\tau}\right\|^{2}\bigg{|}\mathcal{F}_{0}^{k\tau}\right]\mathrm{d}s^{\prime}
\displaystyle\lesssim s0s𝔼[𝒙kτ+s𝒙kτ2]ds=s3.\displaystyle s\int_{0}^{s}\mathbb{E}\left[\left\|\boldsymbol{x}_{k\tau+s^{\prime}}-\boldsymbol{x}_{k\tau}\right\|^{2}\right]\mathrm{d}s^{\prime}=s^{3}.

Combining the two estimations above, we have

(𝔼[(Yk,siYk,sj)|0kτ])2s6+s5[(bkτib^ (26)