Statistical Spatially Inhomogeneous Diffusion Inference
Abstract
Inferring a diffusion equation from discretely-observed measurements is a statistical challenge of significant importance in a variety of fields, from single-molecule tracking in biophysical systems to modeling financial instruments. Assuming that the underlying dynamical process obeys a -dimensional stochastic differential equation of the form
we propose neural network-based estimators of both the drift and the spatially-inhomogeneous diffusion tensor and provide statistical convergence guarantees when and are -Hölder continuous. Notably, our bound aligns with the minimax optimal rate for nonparametric function estimation even in the presence of correlation within observational data, which necessitates careful handling when establishing fast-rate generalization bounds. Our theoretical results are bolstered by numerical experiments demonstrating accurate inference of spatially-inhomogeneous diffusion tensors.
1 Introduction
The dynamical evolution of a wide variety of natural processes, from molecular motion within cells to atmospheric systems, involves an interplay between deterministic forces and noise from the surrounding environment. While it is possible to observe time series data from such systems, in general the underlying equation of motion is not known analytically. Stochastic differential equations offer a powerful and versatile framework for modeling these complex systems, but inferring the deterministic drift and diffusion tensor from time series data remains challenging, especially in high-dimensional settings. Among the many strategies proposed (Crommelin and Vanden-Eijnden 2011; Frishman and Ronceray 2020; Nickl 2022), there are few rigorous results on the optimality and convergence properties of estimators of, in particular, spatially-inhomogeneous diffusion tensors.
Many numerical algorithms have been proposed to infer the drift and diffusion, accommodating various settings, including one-dimensional (Sura and Barsugli 2002; Papaspiliopoulos et al. 2012; Davis and Buffett 2022) and multidimensional SDEs (Pokern, Stuart, and Vanden-Eijnden 2009; Frishman and Ronceray 2020; Crommelin and Vanden-Eijnden 2011). Also, the statistical convergence rate has been extensively studied for both the one-dimensional case (Dalalyan 2005; Dalalyan and Reiß 2006; Pokern, Stuart, and van Zanten 2013; Aeckerle-Willems and Strauch 2018) and the multidimensional cases (Van der Meulen, Van Der Vaart, and Van Zanten 2006; Dalalyan and Reiß 2007; van Waaij and van Zanten 2016; Nickl and Söhl 2017; Nickl and Ray 2020; Oga and Koike 2021; Nickl 2022). For parametric estimators using a Fourier or wavelet basis, the statistical limits of estimating the spatially-inhomogeneous diffusion tensor have been rigorously characterized (Hoffmann 1997, 1999a). However, strategies based on such decompositions do not scale to high-dimensional problems, which has motivated the investigation of neural networks as a more flexible representation of the SDE coefficients (Han, Jentzen, and E 2018; Rotskoff, Mitchell, and Vanden-Eijnden 2022; Khoo, Lu, and Ying 2021; Li et al. 2021).
Thus, we consider the nonparametric neural network estimator (Suzuki 2018; Oono and Suzuki 2019; Schmidt-Hieber 2020) as our ansatz function class, which has achieved great success in estimating SDE coefficients empirically (Xie et al. 2007; Zhang et al. 2018; Han, Jentzen, and E 2018; Wang et al. 2022; Lin, Li, and Ren 2023). We aim to build statistical guarantees for such neural network-based estimators. The most related concurrent work is (Gu et al. 2023), where the authors provide a convergence guarantee for the neural network estimation of the drift vector and the homogeneous diffusion tensor of an SDE by solving appropriate supervised learning tasks. However, their approach assumes that the data observed along the trajectory are independently and identically distributed from the stationary distribution. Additionally, the generalization bound used in (Gu et al. 2023) is not the fast rate generalization bound (Bartlett, Bousquet, and Mendelson 2005; Koltchinskii 2006), resulting in a suboptimal final guarantee. Therefore, we seek to bridge the gap between the i.i.d. setting and the non-i.i.d. ergodic setting using mixing conditions and extend the algorithm and analysis to the spatially-inhomogeneous diffusion estimation. We show that neural estimators have the ability to achieve standard minimax optimal nonparametric function estimation rates even when the data are non-i.i.d.
1.1 Contribution
In this paper, we construct a fast-rate error bound for estimating a multi-dimensional spatially-inhomogeneous diffusion process based on non-i.i.d ergodic data along a single trajectory. Our contributions are as follows:
-
•
We derive for neural network-based diffusion estimators a convergence rate that matches the minimax optimal nonparametric function estimation rate for the -Hölder continuous function class (Tsybakov and Zaiats 2009);
-
•
Our analysis explores the -mixing condition to address the correlation present among observed data along the trajectory, making our result readily applicable to a wide range of ergodic diffusion processes;
-
•
We present numerical experiments, providing empirical support for our derived convergence rate and facilitating further applications of neural diffusion estimators in various contexts with theoretical assurance.
Our theoretical bound depicts the relationships between the error of nonparametric regression, numerical discretization, and ergodic approximation, and provides a general guideline for designing data-efficient, scale-minimal, and statistically-optimal neural estimators for diffusion inference.
1.2 Related Works
Inference of diffusion processes from data
The problem of inferring the drift and diffusion coefficients of an SDE from data has been studied extensively in the literature. The setting with access to the whole continuous trajectory is studied by (Dalalyan and Reiß 2006, 2007; Strauch 2015, 2016; Nickl and Ray 2020; Rotskoff and Vanden-Eijnden 2019), in which the diffusion tensor can be exactly identified using quadratic variation arguments, and thus only the drift inference is considered. Many works focus on the numerical recovery of both the drift vector and the diffusion tensor in the more realistic setting when only discrete observations are available, including methods based on local linearization (Ozaki 1992; Shoji and Ozaki 1998), martingale estimating functions (Bibby and Sørensen 1995), maximum likelihood estimation (Pedersen 1995; Aït-Sahalia 2002), and Markov chain Monte Carlo (Elerian, Chib, and Shephard 2001). We refer readers to (Sørensen 2004; López-Pérez, Febrero-Bande, and González-Manteiga 2021) for an overview of parametric approaches. A spectral method that estimates the eigenpairs of the Markov semigroup operator is proposed in (Crommelin and Vanden-Eijnden 2011), and a nonparametric Bayesian inference scheme based on the finite element method is studied in (Papaspiliopoulos et al. 2012). As for the statistical convergence rate of the drift and diffusion inference, a line of pioneering works is by (Hoffmann 1997, 1999a, 1999b), where the minimax convergence rate of the one-dimensional diffusion process is derived for Besov spaces and matched by adaptive wavelet estimators. Alternative analyses mainly follow a Bayesian methodology, with notable results by (Nickl and Söhl 2017; Nickl and Ray 2020; Nickl 2022) in both high- and low-frequency schemes.
Solving high-dimensional PDEs with deep neural networks
The curse of dimensionality has stymied efforts to solve high-dimensional partial differential equations (PDEs) numerically. However, deep learning has demonstrated remarkable flexibility and adaptivity in approximating high-dimensional functions, which indeed has led to significant advances in computer vision and natural language processing. Recently, a series of works (Han, Jentzen, and E 2018; Yu et al. 2018; Karniadakis et al. 2021; Khoo, Lu, and Ying 2021; Long et al. 2018; Zang et al. 2020; Kovachki et al. 2021; Lu, Jin, and Karniadakis 2019; Li et al. 2022; Rotskoff, Mitchell, and Vanden-Eijnden 2022) have explored solving PDEs with deep neural networks, achieving impressive results for a diverse collection of tasks. These approaches rely on representing PDE solutions using neural networks, and various schemes propose different loss functions to obtain a solution. For instance, (Han, Jentzen, and E 2018) uses the Feynman-Kac formulation to convert PDE solving into a stochastic control problem, (Karniadakis et al. 2021) solves the PDE minimizing the strong form, while (Zang et al. 2020) solves weak formulations of PDEs with an adversarial approach.
Theoretical guarantees for neural network-based PDE solvers
Statistical learning theory offers a powerful toolkit to prove theoretical convergence results for PDE solvers based on deep learning. For example, (Weinan and Wojtowytsch 2022; Chen, Lu, and Lu 2021; Marwah, Lipton, and Risteski 2021; Marwah et al. 2022) investigated the regularity of PDEs approximated by neural networks and (Nickl, van de Geer, and Wang 2020; Duan et al. 2021; Lu et al. 2021; Hütter and Rigollet 2021; Lu, Blanchet, and Ying 2022) consider the statistical convergence rate of various machine learning-based PDE solvers. However, most of these optimality results are based on concentration results that assume the sampled data are independent and identically distributed. This i.i.d. assumption is often violated in various financial and biophysical applications, for example, time series prediction, complex system analysis, and signal processing. Among many possible relaxations to this i.i.d. setting, the scenario, where data are drawn from a strong mixing process, has been widely adopted (Bradley 2005). Inspired by the first work of this kind (Yu 1994), many authors exploited a set of mixing concepts such as -mixing (Zhang 2004; Steinwart and Christmann 2009), - and -mixing (Mohri and Rostamizadeh 2008, 2010; Kuznetsov and Mohri 2017; Ziemann, Sandberg, and Matni 2022), and -mixing (Hang and Steinwart 2017). We refer readers to (Hang et al. 2016) for an overview of this line of research.
Notations
We will use and to denote the inequality up to a constant factor and the equality up to a constant factor.
Definition 1 (Hölder space).
We denote the Hölder space of order with constant by , i.e.
2 Problem Setting
Suppose we have access to a sequence of discrete position snapshots along a single trajectory , where the time step and is the solution to the following Itô stochastic differential equation:
(1) |
where , , and is an -dimensional Wiener process. We refer the vector field as the drift vector, and define the diffusion tensor as . As noted in (Lau and Lubensky 2007), any interpolation between the Itô convention and other conventions for stochastic calculus can be transformed into the Itô convention by an additional term to the drift vector, and therefore, we work with the Itô convention throughout this paper222The Itô convention along with others represent different methods to extend the Riemann integral to stochastic processes. Roughly speaking, Ito uses the left endpoint of the interval for functional value in the Riemann sum. We adopt the Itô convention due to several martingale properties it introduces which are mathematically convenient for statements and proofs..
Remark 1.
Our focus on the inhomogeneity in the space variable stems from the fact that when the SDE coefficients are time-dependent, it becomes very challenging to infer them from a singular observational trajectory, i.e. with only one observation at each time point and we would leave this case with multiple trajectories for future work.
For simplicity, we will be working on with periodic boundaries, i.e. the -dimensional torus . Points on the torus are represented by , where denotes the canonical map and is a representative of the equivalence class . The Borel -algebra on coincides with the sub- algebra of 1-periodic Borel sets of . We refer readers to (Papanicolau, Bensoussan, and Lions 1978) for further mathematical details of homogenization with tori. We further assume the drift and diffusion coefficients in (1) satisfy the following regularity assumptions:
Assumption 2a (Periodicity).
, , and are 1-periodic for all variables.
Remark 2.
This assumption is primarily for simplicity, and has been adopted in many previous works on the statistical inference of SDE coefficients, e.g. (Nickl and Ray 2020). This allows us to bypass the technicalities concerning boundary conditions, which might detract from our main contributions.
Assumption 2b (Hölder-smoothness).
Each entry for some and .
Assumption 2c (Uniform ellipticity).
It holds that and there exists a constant such that , i.e. for any , holds uniformly for any .
Remark 3.
This uniform ellipticity is commonly assumed across the analysis of the Fokker-Planck equation. It guarantees the Fokker-Planck equation has a unique strong solution with regularity properties that are essential for the analysis of asymptotic behavior and numerical approximation of the solution. We refer readers to (Stroock and Varadhan 1997; Bogachev et al. 2022) for more detailed discussions.
Since and are 1-periodic, the process in (1) can thus be viewed as a process on the torus . Denote the transition kernel of the process by , the transition kernel of the corresponding process satisfies:
(2) |
where is the -th standard basis vector in . When no confusion arises, we will use to denote its representative in the fundamental domain in the following.
3 Spatially Inhomogeneous Diffusion Estimator
In this section, we aim to build neural estimators of both the drift and diffusion coefficients based on a sequence of discrete observations along a single trajectory of the SDE (1). A straightforward neural drift estimator allows us to subsequently construct a simple neural estimator of the diffusion tensor. In what follows, we introduce and prove the convergence of these neural estimators. Without loss of generality, we assume and , and denote the -algebra generated by all possible sequences as .
3.1 Neural Estimators
We define and as the objective function for drift and diffusion estimation, respectively, by noticing that the ground truth drift vector can be represented as the minimizer of the following objective function as the time step and the time horizon :
(3) |
where . With the ground truth drift vector , the ground truth diffusion tensor can also be represented as the minimizer of the following objective function as and :
(4) | ||||
where is the Frobenius norm of a matrix.
Based on the discussions in the last section, we will only estimate the value of and in the fundamental domain and then extend it to the whole space by periodicity. Therefore, using our data as quadrature points, we approximate the objective function for drift estimation (3) as:
(5) |
and the objective function for diffusion estimation (4) as
(6) | ||||
We will refer to and as the estimated empirical loss for drift and diffusion estimation, respectively.
We then parametrize the drift vector and the diffusion tensor within a hypothesis function class and solve for the estimators by optimizing the corresponding estimated empirical loss, as in Algorithm 1. Following foundational works including (Oono and Suzuki 2019; Schmidt-Hieber 2020; Chen et al. 2022), we adopt sparse neural networks as our hypothesis function class , which is defined as follows. A neural network with depth and width vector has the following form with
(7) |
where are the weight matrices, are the shift vectors, and is the element-wise ReLU activation function. We also bound all parameters in the neural network by unity as in (Schmidt-Hieber 2020; Suzuki 2018).
Definition 3 (Sparse neural network).
Let be the function class of ReLU-activated neural networks with depth and width that has at most non-zero entries with the function value uniformly bounded by and all parameters bounded by 1, i.e.
where is the number of non-zero entries of a matrix (or a vector) and is the maximum absolute value of a matrix (or a vector).
Since we are using the neural network for nonparametric estimation in , we will assume and in the following discussion.
3.2 Ergodicity
Optimal convergence rates of neural network-based PDE solvers, as showcased in (Nickl, van de Geer, and Wang 2020; Lu et al. 2021; Gu et al. 2023), are typically established under the assumption of data independence. However, the presence of time correlations in the observational data from a single trajectory significantly complicates the task of setting an upper bound for the convergence of the neural estimators obtained by Algorithm 1. In this context, we fully explore the ergodicity of the diffusion process, bound the ergodic approximation error by the -mixing coefficient, and show that the exponential ergodicity condition, which is naturally satisfied by a wide range of diffusion processes, is sufficient for the fast rate convergence of the proposed neural estimators.
We first introduce the definition of exponential ergodicity:
Definition 4 (Exponential ergodicity (Down, Meyn, and Tweedie 1995)).
A diffusion process with domain is uniformly exponential ergodic if there exists a unique stationary distribution that for any ,
where .
As a direct consequence of (Papanicolau, Bensoussan, and Lions 1978, Theorem 3.2) and the compactness of the torus , we have the following result:
Proposition 1 (Exponential ergodicity of ).
See (Kulik 2017) for further discussions and required regularities for this property beyond the torus setting.
The ergodicity of stochastic processes is closely related to the notion of mixing conditions, which quantifies the “asymptotic independence” of random sequences. One of the most utilized mixing conditions for stochastic processes is the following -mixing condition:
Definition 5 (-mixing condition (Kuznetsov and Mohri 2017)).
The -mixing coefficient of a stochastic process with respect to a probability measure is defined as
where is the -algebra generated by , and is the law of . Especially, when for some constants , we say is geometrically -mixing with respect to .
By taking as the stationary distribution in the definition above, the proposition follows:
Proposition 2 (-mixing condition of ).
, i.e. is geometrically -mixing with respect to .
We will denote the pushforward of the invariant measure under the following inverse of the canonical map also as .
3.3 Convergence Guarantee
In this section, we describe the main upper bound for the neural estimators in Algorithm 1. We also present a theoretical guarantee for drift and diffusion estimation in Theorem 3 and 9, respectively. Our main result shows that estimating the drift and diffusion tensor can achieve the standard minimax optimal nonparametric function estimation convergence rate, even with non-i.i.d. data.
Due to the ergodic theorem (Kulik 2017, Theorem 5.3.3) under the exponential ergodicity condition and the property of Itô process, the bias part of the objective functions and for drift and diffusion estimation as defined in (3) and (4) converge to
(8) | ||||
as and , which we will refer to as the population loss for drift and diffusion estimation, respectively. Our convergence guarantee is thus built on these population losses.
Theorem 3 (Upper bound for drift estimation in ).
Suppose the drift vector , and the hypothesis class with
Then with high probability the minimizer obtained by Algorithm 1 satisfies
Theorem 4 (Upper bound for diffusion estimation in ).
Suppose the diffusion tensor , and the hypothesis class with
Then with high probability the minimizer obtained by Algorithm 1 satisfies
(9) |
Remark 4.
In this remark, we explain the meaning of each term in the convergence rate (9):
-
•
The term matches the standard minimax optimal rate up to an extra factor. This is characteristic of performing nonparametric regression for -Hölder continuous functions with noisy observations (Tsybakov and Zaiats 2009). This is different from the drift estimation (Theorem 3) in which the nonparametric dependency is on instead of with a further term which is discussed below.
-
•
The term represents a bias term that arises due to the finite resolution of the observations . Specifically, this term encapsulates the error incurred while approximating the objective function by the estimated empirical loss with numerical quadrature and finite difference computations;
-
•
The term quantifies the error in approximating the population loss by the objective function by applying the ergodic theorem up to time horizon . This term essentially signifies the portion of the domain that the trajectory has not yet traversed. Refs. (Hoffmann 1997, 1999a) only provide guarantee for and thus this term is not included.
3.4 Proof Sketch
In this section, we omit the dependency of the losses on the data for notational simplicity.
To obtain a unified proving approach for both drift and diffusion estimation, it is useful to think of our neural estimator as a function regressor with imperfect supervision signals. We consider an estimator of an arbitrary function as the ground truth obtained by minimizing over the estimated empirical loss
(10) |
where the supervision signal is polluted by the noise given by , with being an -adapted continuous semimartingale. Following Doob’s decomposition, we write , where is a continuous process with finite variation and is deterministic on conditioned on as
and forms a local martingale as
The population loss can also be similarly defined as in (8). Additionally, we define the empirical loss for the estimator as
In our proof, we first show that as long as the following two conditions hold for the noise , the minimax optimal nonparametric function estimation rate would be achieved:
Assumption 6.
For any , the continuous finite variation process satisfies
Assumption 7.
For some , the local martingale satisfies
where denotes the quadratic variation.
Remark 5.
Based on the noise decomposition , the term can be intuitively understood as the bias of the data. This bias is caused by the numerical scheme employed for computing . On the other hand, the term represents the martingale noise added to the data, which can be considered analogous to the i.i.d. noise in the common nonparametric estimation settings. Assumption 6 essentially implies that the estimator is consistent, for its expectation converges to as . Meanwhile, Assumption 7 assumes that the variance of the noise present in the data is at most of order .
To overcome the correlation of the observed data, we adopt the following sub-sampling technique as in (Yu 1994; Mohri and Rostamizadeh 2008; Hang and Steinwart 2017): For a sufficiently large such that 333Here we assume is divisible by without loss of generality., we split the original correlated samples into sub-sequences for . The main idea of this technique is that under fast -mixing conditions, each sub-sequence can be treated approximately as i.i.d. samples from the distribution to which the classical generalization results may apply, with an error that can be controlled by the mixing coefficient via the following lemma:
Lemma 5 ((Kuznetsov and Mohri 2017, Proposition 2)).
Let be any function on with for . Then for any , we have
where the second expectation is taken over the sub--algebra of generated by the sub-sequence and .
Based on Lemma 5, we derive the following fast rate generalization bound via local Rademacher complexity arguments (Bartlett, Bousquet, and Mendelson 2005; Koltchinskii 2006). The proof is shown in Appendix A.1.
Theorem 6.
Let . Suppose the localized Rademacher complexity satisfies
where is a sub-root function444A sub-root function is a function that is non-negative, non-decreasing function, satisfying that is non-increasing for . and is the Rademacher complexity of a function class with respect to i.i.d. samples from the stationary distribution , i.e.
(11) |
Let be the unique solution to the fixed-point equation . Then for any and , we have with probability for any ,
(12) |
where .
Bias and noise in the objective function certainly affect the optimization. Thus, we need to seek an oracle-type inequality for the expectation of the population loss over the data, which is proved in Appendix A.2. The main technique is a uniform martingale concentration inequality (cf. Lemma 12).
Theorem 7.
Especially, when we choose the hypothesis class as the sparse neural network class and combine Theorem 6 and Theorem 7, we have the following theorem with the proof given in Appendix A.3:
Theorem 8.
With Theorem 8, the detailed proofs of Theorem 3 and 9 are given in Appendix B.1 and B.2, respectively.


4 Experiments
In this section, we present numerical results on a two-dimensional example, to illustrate the accordance between our theoretical convergence rates and those of our proposed neural diffusion estimator. Consider the following SDE in :
(14) |
where
i.e. and , where is the identity matrix. Then it is straightforward to verify that this diffusion process satisfies Assumption 2a, 2b, and 2c with smoothness . Our goal is to estimate the value of the function within . We employ Algorithm 1 for estimating both and with separate neural networks and treat them entirely independently in the inference task. One may also prove that the stationary distribution of this diffusion process is given by the Lebesgue measure on the two-dimensional torus, which makes evaluating errors easier and more precise.
To impose the periodic boundary, we introduce an explicit regularization term to our training loss
approximated by with 1000 pairs of random samples empirically. The final training loss is thus , where is a hyperparameter and can be either or .
We first generate data using the Euler-Maruyama method with a time step up to , and then sub-sample data at varying time steps and time horizons for each experiment instance from this common trajectory. We use a ResNet as our neural network structure with two residual blocks, each containing a fully-connected layer with a hidden dimension of 1000. Test data are generated by randomly selecting samples from another sufficiently long trajectory, which are shared by all experiment instances. The training process is executed on one Tesla V100 GPU.
According to our theoretical result (Theorem 9), the convergence rate of this implementation should be approximately of order up to terms. We thus consider two schemes in our experiment. The first involves a fixed time step with an expected rate of , and the other maintains a fixed with an expected rate of . Each of the aforementioned instances is carried out five times. Figure 1 presents the mean values along with their corresponding confidence intervals from these runs. Additionally, reference lines indicating the expected convergence rate are shown in red. Both schemes roughly exhibit the exponential decay phenomenon, aligning with our theoretical expectations. As depicted in Figure 1(a), the decreasing rate of the test error decelerates as exceeds a certain threshold. This can be attributed to the fact that when and are sufficiently large, the bias term arising from the discretization becomes the dominant factor in the error.
5 Conclusion
The ubiquity of correlated data in processes modeled with spatially-inhomogeneous diffusions has created substantial barriers to analysis. In this paper, we construct and analyze a neural network-based numerical algorithm for estimating multidimensional spatially-inhomogeneous diffusion processes based on discretely-observed data obtained from a single trajectory. Utilizing -mixing conditions and local Rademacher complexity arguments, we establish the convergence rate for our neural diffusion estimator. Our upper bound has recovered the minimax optimal nonparametric function estimation rate in the common i.i.d. setting, even with correlated data. We expect our proof techniques serve as a model for general exponential ergodic diffusion processes beyond the toroidal setting considered here. Numerical experiments validate our theoretical findings and demonstrate the potential of applying the neural diffusion estimators across various contexts with provable accuracy guarantees. Extending our results to typical biophysical settings, e.g. compact domains with reflective boundaries and motion blur due to measurement error, could help establish more rigorous error estimates for physical inference problems.
References
- Aeckerle-Willems and Strauch (2018) Aeckerle-Willems, C.; and Strauch, C. 2018. Sup-norm adaptive simultaneous drift estimation for ergodic diffusions. arXiv:1808.10660.
- Aït-Sahalia (2002) Aït-Sahalia, Y. 2002. Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach. Econometrica, 70(1): 223–262.
- Bartlett, Bousquet, and Mendelson (2005) Bartlett, P. L.; Bousquet, O.; and Mendelson, S. 2005. Local rademacher complexities. The Annals of Statistics, 33(4): 1497 – 1537.
- Bibby and Sørensen (1995) Bibby, B. M.; and Sørensen, M. 1995. Martingale estimation functions for discretely observed diffusion processes. Bernoulli, 17–39.
- Bogachev et al. (2022) Bogachev, V. I.; Krylov, N. V.; Röckner, M.; and Shaposhnikov, S. V. 2022. Fokker–Planck–Kolmogorov Equations, volume 207. American Mathematical Society.
- Boucheron, Lugosi, and Massart (2013) Boucheron, S.; Lugosi, G.; and Massart, P. 2013. Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
- Bradley (2005) Bradley, R. C. 2005. Basic properties of strong mixing conditions. A survey and some open questions. Probability Surveys, 2(none): 107 – 144.
- Chen et al. (2022) Chen, M.; Jiang, H.; Liao, W.; and Zhao, T. 2022. Nonparametric regression on low-dimensional manifolds using deep ReLU networks: Function approximation and statistical recovery. Information and Inference: A Journal of the IMA, 11(4): 1203–1253.
- Chen, Lu, and Lu (2021) Chen, Z.; Lu, J.; and Lu, Y. 2021. On the representation of solutions to elliptic pdes in barron spaces. Advances in neural information processing systems, 34: 6454–6465.
- Crommelin and Vanden-Eijnden (2011) Crommelin, D.; and Vanden-Eijnden, E. 2011. Diffusion estimation from multiscale data by operator eigenpairs. Multiscale Modeling & Simulation, 9(4): 1588–1623.
- Dalalyan (2005) Dalalyan, A. 2005. Sharp adaptive estimation of the drift function for ergodic diffusions. The Annals of Statistics, 33(6): 2507 – 2528.
- Dalalyan and Reiß (2006) Dalalyan, A.; and Reiß, M. 2006. Asymptotic statistical equivalence for scalar ergodic diffusions. Probability theory and related fields, 134: 248–282.
- Dalalyan and Reiß (2007) Dalalyan, A.; and Reiß, M. 2007. Asymptotic statistical equivalence for ergodic diffusions: the multidimensional case. Probability theory and related fields, 137: 25–47.
- Davis and Buffett (2022) Davis, W.; and Buffett, B. 2022. Estimation of drift and diffusion functions from unevenly sampled time-series data. Physical Review E, 106(1): 014140.
- de la Peña, Klass, and Leung Lai (2004) de la Peña, V. H.; Klass, M. J.; and Leung Lai, T. 2004. Self-normalized processes: exponential inequalities, moment bounds and iterated logarithm laws. The Annals of Probability, 32(3): 1902 – 1933.
- Down, Meyn, and Tweedie (1995) Down, D.; Meyn, S. P.; and Tweedie, R. L. 1995. Exponential and uniform ergodicity of Markov processes. The Annals of Probability, 23(4): 1671–1691.
- Duan et al. (2021) Duan, C.; Jiao, Y.; Lai, Y.; Lu, X.; and Yang, Z. 2021. Convergence rate analysis for deep ritz method. arXiv:2103.13330.
- Elerian, Chib, and Shephard (2001) Elerian, O.; Chib, S.; and Shephard, N. 2001. Likelihood inference for discretely observed nonlinear diffusions. Econometrica, 69(4): 959–993.
- Frishman and Ronceray (2020) Frishman, A.; and Ronceray, P. 2020. Learning force fields from stochastic trajectories. Physical Review X, 10(2): 021009.
- Gu et al. (2023) Gu, Y.; Harlim, J.; Liang, S.; and Yang, H. 2023. Stationary density estimation of itô diffusions using deep learning. SIAM Journal on Numerical Analysis, 61(1): 45–82.
- Han, Jentzen, and E (2018) Han, J.; Jentzen, A.; and E, W. 2018. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34): 8505–8510.
- Hang et al. (2016) Hang, H.; Feng, Y.; Steinwart, I.; and Suykens, J. A. 2016. Learning theory estimates with observations from general stationary stochastic processes. Neural computation, 28(12): 2853–2889.
- Hang and Steinwart (2017) Hang, H.; and Steinwart, I. 2017. A Bernstein-type inequality for some mixing processes and dynamical systems with an application to learning. The Annals of Statistics, 45(2): 708 – 743.
- Hoffmann (1997) Hoffmann, M. 1997. Minimax estimation of the diffusion coefficient through irregular samplings. Statistics & probability letters, 32(1): 11–24.
- Hoffmann (1999a) Hoffmann, M. 1999a. Adaptive estimation in diffusion processes. Stochastic processes and their Applications, 79(1): 135–163.
- Hoffmann (1999b) Hoffmann, M. 1999b. Lp estimation of the diffusion coefficient. Bernoulli, 447–481.
- Hütter and Rigollet (2021) Hütter, J.-C.; and Rigollet, P. 2021. Minimax estimation of smooth optimal transport maps. The Annals of Statistics, 49(2): 1166 – 1194.
- Karniadakis et al. (2021) Karniadakis, G. E.; Kevrekidis, I. G.; Lu, L.; Perdikaris, P.; Wang, S.; and Yang, L. 2021. Physics-informed machine learning. Nature Reviews Physics, 3(6): 422–440.
- Khoo, Lu, and Ying (2021) Khoo, Y.; Lu, J.; and Ying, L. 2021. Solving parametric PDE problems with artificial neural networks. European Journal of Applied Mathematics, 32(3): 421–435.
- Koltchinskii (2006) Koltchinskii, V. 2006. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34(6): 2593 – 2656.
- Kovachki et al. (2021) Kovachki, N.; Li, Z.; Liu, B.; Azizzadenesheli, K.; Bhattacharya, K.; Stuart, A.; and Anandkumar, A. 2021. Neural operator: Learning maps between function spaces. arXiv:2108.08481.
- Kulik (2017) Kulik, A. 2017. Ergodic behavior of Markov processes. In Ergodic Behavior of Markov Processes. de Gruyter.
- Kuznetsov and Mohri (2017) Kuznetsov, V.; and Mohri, M. 2017. Generalization bounds for non-stationary mixing processes. Machine Learning, 106(1): 93–117.
- Lau and Lubensky (2007) Lau, A. W.; and Lubensky, T. C. 2007. State-dependent diffusion: Thermodynamic consistency and its path integral formulation. Physical Review E, 76(1): 011123.
- Li et al. (2022) Li, H.; Khoo, Y.; Ren, Y.; and Ying, L. 2022. A semigroup method for high dimensional committor functions based on neural network. In Mathematical and Scientific Machine Learning, 598–618. PMLR.
- Li et al. (2021) Li, Z.; Kovachki, N.; Azizzadenesheli, K.; Liu, B.; Bhattacharya, K.; Stuart, A.; and Anandkumar, A. 2021. Markov neural operators for learning chaotic systems. arXiv:2106.06898.
- Lin, Li, and Ren (2023) Lin, B.; Li, Q.; and Ren, W. 2023. Computing high-dimensional invariant distributions from noisy data. Journal of Computational Physics, 474: 111783.
- Long et al. (2018) Long, Z.; Lu, Y.; Ma, X.; and Dong, B. 2018. Pde-net: Learning pdes from data. In International conference on machine learning, 3208–3216. PMLR.
- López-Pérez, Febrero-Bande, and González-Manteiga (2021) López-Pérez, A.; Febrero-Bande, M.; and González-Manteiga, W. 2021. Parametric estimation of diffusion processes: A review and comparative study. Mathematics, 9(8): 859.
- Lu, Jin, and Karniadakis (2019) Lu, L.; Jin, P.; and Karniadakis, G. E. 2019. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. arXiv:1910.03193.
- Lu, Blanchet, and Ying (2022) Lu, Y.; Blanchet, J.; and Ying, L. 2022. Sobolev acceleration and statistical optimality for learning elliptic equations via gradient descent. arXiv:2205.07331.
- Lu et al. (2021) Lu, Y.; Chen, H.; Lu, J.; Ying, L.; and Blanchet, J. 2021. Machine learning for elliptic pdes: fast rate generalization bound, neural scaling law and minimax optimality. arXiv:2110.06897.
- Marwah, Lipton, and Risteski (2021) Marwah, T.; Lipton, Z.; and Risteski, A. 2021. Parametric complexity bounds for approximating PDEs with neural networks. Advances in Neural Information Processing Systems, 34: 15044–15055.
- Marwah et al. (2022) Marwah, T.; Lipton, Z. C.; Lu, J.; and Risteski, A. 2022. Neural Network Approximations of PDEs Beyond Linearity: Representational Perspective. arXiv:2210.12101.
- Mohri and Rostamizadeh (2008) Mohri, M.; and Rostamizadeh, A. 2008. Rademacher complexity bounds for non-iid processes. Advances in Neural Information Processing Systems, 21.
- Mohri and Rostamizadeh (2010) Mohri, M.; and Rostamizadeh, A. 2010. Stability Bounds for Stationary -mixing and -mixing Processes. Journal of Machine Learning Research, 11(2).
- Nickl (2022) Nickl, R. 2022. Inference for diffusions from low frequency measurements. arXiv:2210.13008.
- Nickl and Ray (2020) Nickl, R.; and Ray, K. 2020. Nonparametric statistical inference for drift vector fields of multi-dimensional diffusions. The Annals of Statistics, 48(3): 1383 – 1408.
- Nickl and Söhl (2017) Nickl, R.; and Söhl, J. 2017. Nonparametric Bayesian posterior contraction rates for discretely observed scalar diffusions. The Annals of Statistics, 45(4): 1664 – 1693.
- Nickl, van de Geer, and Wang (2020) Nickl, R.; van de Geer, S.; and Wang, S. 2020. Convergence rates for penalized least squares estimators in PDE constrained regression problems. SIAM/ASA Journal on Uncertainty Quantification, 8(1): 374–413.
- Oga and Koike (2021) Oga, A.; and Koike, Y. 2021. Drift estimation for a multi-dimensional diffusion process using deep neural networks. arXiv:2112.13332.
- Oono and Suzuki (2019) Oono, K.; and Suzuki, T. 2019. Approximation and non-parametric estimation of ResNet-type convolutional neural networks. In International conference on machine learning, 4922–4931. PMLR.
- Ozaki (1992) Ozaki, T. 1992. A bridge between nonlinear time series models and nonlinear stochastic dynamical systems: a local linearization approach. Statistica Sinica, 113–135.
- Papanicolau, Bensoussan, and Lions (1978) Papanicolau, G.; Bensoussan, A.; and Lions, J.-L. 1978. Asymptotic analysis for periodic structures. Elsevier.
- Papaspiliopoulos et al. (2012) Papaspiliopoulos, O.; Pokern, Y.; Roberts, G. O.; and Stuart, A. M. 2012. Nonparametric estimation of diffusions: a differential equations approach. Biometrika, 99(3): 511–531.
- Pedersen (1995) Pedersen, A. R. 1995. Consistency and asymptotic normality of an approximate maximum likelihood estimator for discretely observed diffusion processes. Bernoulli, 257–279.
- Pokern, Stuart, and van Zanten (2013) Pokern, Y.; Stuart, A. M.; and van Zanten, J. H. 2013. Posterior consistency via precision operators for Bayesian nonparametric drift estimation in SDEs. Stochastic Processes and their Applications, 123(2): 603–628.
- Pokern, Stuart, and Vanden-Eijnden (2009) Pokern, Y.; Stuart, A. M.; and Vanden-Eijnden, E. 2009. Remarks on drift estimation for diffusion processes. Multiscale modeling & simulation, 8(1): 69–95.
- Rotskoff, Mitchell, and Vanden-Eijnden (2022) Rotskoff, G. M.; Mitchell, A. R.; and Vanden-Eijnden, E. 2022. Active importance sampling for variational objectives dominated by rare events: Consequences for optimization and generalization. In Mathematical and Scientific Machine Learning, 757–780. PMLR.
- Rotskoff and Vanden-Eijnden (2019) Rotskoff, G. M.; and Vanden-Eijnden, E. 2019. Dynamical computation of the density of states and Bayes factors using nonequilibrium importance sampling. Physical review letters, 122(15): 150602.
- Schmidt-Hieber (2020) Schmidt-Hieber, J. 2020. Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics, 48(4): 1875 – 1897.
- Shoji and Ozaki (1998) Shoji, I.; and Ozaki, T. 1998. Estimation for nonlinear stochastic differential equations by a local linearization method. Stochastic Analysis and Applications, 16(4): 733–752.
- Sørensen (2004) Sørensen, H. 2004. Parametric inference for diffusion processes observed at discrete points in time: a survey. International Statistical Review, 72(3): 337–354.
- Steinwart and Christmann (2009) Steinwart, I.; and Christmann, A. 2009. Fast learning from non-iid observations. Advances in neural information processing systems, 22.
- Strauch (2015) Strauch, C. 2015. Sharp adaptive drift estimation for ergodic diffusions: the multivariate case. Stochastic Processes and their Applications, 125(7): 2562–2602.
- Strauch (2016) Strauch, C. 2016. Exact adaptive pointwise drift estimation for multidimensional ergodic diffusions. Probability Theory and Related Fields, 164: 361–400.
- Stroock and Varadhan (1997) Stroock, D. W.; and Varadhan, S. S. 1997. Multidimensional diffusion processes, volume 233. Springer Science & Business Media.
- Sura and Barsugli (2002) Sura, P.; and Barsugli, J. 2002. A note on estimating drift and diffusion parameters from timeseries. Physics Letters A, 305(5): 304–311.
- Suzuki (2018) Suzuki, T. 2018. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. arXiv:1810.08033.
- Tsybakov and Zaiats (2009) Tsybakov, A. B.; and Zaiats, V. G. 2009. Introduction to nonparametric estimation. Springer-Verlag New York.
- Van der Meulen, Van Der Vaart, and Van Zanten (2006) Van der Meulen, F.; Van Der Vaart, A. W.; and Van Zanten, J. 2006. Convergence rates of posterior distributions for Brownian semimartingale models. Bernoulli, 12(5): 863–888.
- van Waaij and van Zanten (2016) van Waaij, J.; and van Zanten, H. 2016. Gaussian process methods for one-dimensional diffusions: Optimal rates and adaptation. Electronic Journal of Statistics, 10(1): 628 – 645.
- Wang et al. (2022) Wang, X.; Feng, J.; Liu, Q.; Li, Y.; and Xu, Y. 2022. Neural network-based parameter estimation of stochastic differential equations driven by Lévy noise. Physica A: Statistical Mechanics and its Applications, 606: 128146.
- Weinan and Wojtowytsch (2022) Weinan, E.; and Wojtowytsch, S. 2022. Some observations on high-dimensional partial differential equations with Barron data. In Mathematical and Scientific Machine Learning, 253–269. PMLR.
- Xie et al. (2007) Xie, Z.; Kulasiri, D.; Samarasinghe, S.; and Rajanayaka, C. 2007. The estimation of parameters for stochastic differential equations using neural networks. Inverse Problems in Science and Engineering, 15(6): 629–641.
- Yu (1994) Yu, B. 1994. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 94–116.
- Yu et al. (2018) Yu, B.; et al. 2018. The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1): 1–12.
- Zang et al. (2020) Zang, Y.; Bao, G.; Ye, X.; and Zhou, H. 2020. Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics, 411: 109409.
- Zhang et al. (2018) Zhang, L.; Han, J.; Wang, H.; Car, R.; and Weinan, E. 2018. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Physical review letters, 120(14): 143001.
- Zhang (2004) Zhang, T. 2004. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1): 56–85.
- Ziemann, Sandberg, and Matni (2022) Ziemann, I. M.; Sandberg, H.; and Matni, N. 2022. Single trajectory nonparametric learning of nonlinear dynamics. In conference on Learning Theory, 3333–3364. PMLR.
Appendix A Detailed Proofs of Section 3.4
Besides defined in the main text, we will use the following notation to denote the empirical loss over a sample set another than the sequence ,
where applying is to make sure each . We will denote the squared loss by , and adopt the shorthand notation for any function throughout this section for simplicity, when the sequence is clear from the context.
A.1 Proof of Theorem 6
In the following, we use to denote a sample set consisting of i.i.d. samples drawn from the distribution , i.e. .
To achieve the fast-rate generalization bound, we will make use of the following local Rademacher complexity argument:
Lemma 9 ((Bartlett, Bousquet, and Mendelson 2005, Theorem 3.3)).
Let be a function class and for each , and hold. Suppose , where is a sub-root function with a fixed point . Then with probability at least , we have for any
Remark 6.
Compared with Rademacher complexity, the arguments of local Rademacher complexity enables adaptation to the local quadratic geometry of the loss function and recovers the fast rate generalization bound (Bartlett, Bousquet, and Mendelson 2005). One of our contributions is showing how to adapt the local Rademacher arguments to the non-i.i.d data along the diffusion process and providing near-optimal bounds for diffusion estimation.
For completeness, we also provide the proof of Lemma 5:
Proof of Lemma 5.
Notice that each sub-sequence is a sequence separated with time interval . Then the result follows directly from (Kuznetsov and Mohri 2017, Proposition 2) by setting the distribution and sub-sample as and , respectively. ∎
Remark 7.
Lemma 5 serves as a crucial tool for addressing the challenges presented by non-i.i.d. data points. To give an intuitive explanation, consider the way we segment the initial observations, represented by . We divide them into sub-sequences, denoted as , for . In each of these sub-sequences, the observational gaps are , in contrast to the gap found in the initial sequence. By employing -mixing and strategically choosing a sufficiently large value for , we can effectively treat each sub-sequence as though it follows an i.i.d. distribution. The accuracy of this approximation is then determined by the mixing coefficient, as quantified in Lemma 5. With this setup, we can apply local Rademacher arguments to the approximated i.i.d. sequences, paving the way for fast convergence rates.
Proof of Theorem 6.
First, it is straightforward to check by Assumption 2b that for any ,
(15) |
and
(16) |
Applying Lemma 9 for with , , and , yields that with probability at least , we have
Now we split the empirical loss into the mean of empirical losses , where are the sub-sequences of obtained by sub-sampling:
and for each summand, we apply Lemma 5 to the indicator function of the event and obtain
Then let , we have by the union bound
and the result follows. ∎
A.2 Proof of Theorem 7
In this section, the expectiation should be referred to as being taken over the sub--algebra of generated by the trajectory from which is constructed by minimizing over the estimate empirical loss .
We first consider the following decomposition of the expectation of the empirical loss :
Lemma 10.
For an arbitrary , we have
(17) |
Proof.
By the definition of the estimator , we have for every possible sequence . Recall that
we have by taking expectation to both sides,
where we used the fact that the noise only depends on the ground truth and the sequence .
Since forms a martingale, we have
and thus
and the result follows. ∎
Following Remark 5 in the main text, we will refer to the second term in the RHS of (17) as the bias term and the last as the martingale noise term.
For the bias term, we have the following simple bound:
Lemma 11.
For any , we have
Proof.
where the inequality is by AM-GM and the last equality is due to Assumption 6. ∎
The following proof of the martingale noise term bound is inspired by the proof of (Schmidt-Hieber 2020, Lemma 4), where i.i.d. Gaussian noise is considered for nonparametric regression.
Definition 8 (-covering).
We define the -covering of a function class with respect to the norm as the subset of with the smallest possible cardinality such that for any , there exists an satisfying . The cardinality of is denoted by .
For a fixed , let be the -net of the function class with respect to the norm , where . Without loss of generality, we assume that .
We now introduce the following exponential inequality for local martingales.
Lemma 12.
Let be a continuous local martingale and be the corresponding quadratic variation of . If , then for any ,
Moreover, then
Proof.
Lemma 12 enables us to give the following concentration inequality for the martingale noise term.
Lemma 13.
Let such that , then for any and , we have
(18) |
Proof.
By martingale representation theorem, as a continuous local martingale is an Itô integral and thus the corresponding quadratic variation exists. As a result, the quadratic variation of is
Take , where , we have by Lemma 12,
Therefore, by Jensen’s inequality and Cauchy-Schwarz, we have
where the second to last inequality is due to the fact and thus the first expectation must be bounded by the summation over all possible .
Again by Cauchy-Schwarz,
Finally, let ,
where the second inequality is due to the tower property
(19) | ||||
and the last inequality is by AM-GM. ∎
It remains to bound the error by projecting the estimator to the -covering , which is given by the following lemma.
Lemma 14.
Let such that , then
Proof.
We will follow the notations in the proof of Lemma 18. If , almost everywhere and the result holds trivially.
Now suppose , applying Lemma 12 and Jensen’s inequality yields
With a similar argument as in (19), we have
Therefore, we have
∎
Now we are ready to present the proof of Theorem 7.
A.3 Proof of Theorem 8
Before we present the proof, we introduce the following lemma concerning the complexity of the sparse neural network function class, which will serve as the fundamental building block of our proof via local Rademacher arguments.
Lemma 15 (-covering number of (Schmidt-Hieber 2020, Lemma 5)).
In fact, the Rademacher complexity can be bounded by -covering number via the following lemma.
Lemma 16 (Localized Dudley’s theorem).
For any function class ,
where the expectation in the definition of is taken over a sample set of i.i.d. samples drawn from a distribution , and denotes the norm with respect to the empirical measure .
The following lemma will also be used in the proof of Theorem 8.
Lemma 17 (Talagrand’s contraction lemma).
Let be a -Lipschitz continuous function and be a function class, then
where .
With all lemmas aforementioned, we first bound the local Rademacher complexity of the sparse neural network class .
Lemma 18 (Local Rademacher complexity of ).
The local Rademacher complexity of the sparse neural network class
as appeared in Theorem 6 is bounded by the sub-root function
with the fixed point bounded by
Proof.
First, by applying Talagrand’s contraction lemma with Assumption 2b, and the localized Dudley’s theorem, we have
(20) | ||||
where the empirical localization radius depending on the sample set is defined as
and the last inequality is due to the fact that whenever , the -covering number with respect to is exactly 1 and the integrand thus vanishes.
By choosing on the RHS of (20), we have
where the third inequality is because of , and the second to last equality is by Lemma 15.
Now set
(21) |
then
following the reasoning above.
Now we turn to bound the empirical localization radius again by the local Rademacher complexity.
where the last inequality is by symmetrization (Boucheron, Lugosi, and Massart 2013).
Then from (21), we have by Jensen’s inequality,
from which we deduce that
which is clearly a sub-root function. Furthermore, by direct calculation, the fixed point of satisfies
which concludes the proof. ∎
The following lemma describes the approximation capability of :
Lemma 19 (Approximation error of (Schmidt-Hieber 2020, Theorem 5)).
For any and any integer and , there exists , where
such that
With all the above preparations, we are ready to prove Theorem 8.
Appendix B Detailed Proofs of Section 3.3
In this section, we present the detailed proofs of the results in Section 3.3, including Theorem 3 and Theorem 9. Both proofs are based on Theorem 8 that we have proved in Appendix A. Specifically, we would like to verify the conditions of Theorem 8 for the drift and diffusion estimation problems, respectively. To this end, we first compute the noise of the drift and diffusion estimators as in (10), analyze its decomposition into the bias term and the variance term , and then verify Assumption 6 and find the exponent in Assumption 7 for both drift and diffusion estimators.
Lemma 20.
For ,
Proof.
By Itô’s formula, we have for ,
Therefore for any ,
and by Gronwall’s inequality, we have
∎
B.1 Proof of Theorem 3
As a warm-up for the proof of Theorem 9, we first prove the corresponding upper bound for drift estimation:
Proof of Theorem 3.
Suppose that . Let , the -th component of the drift , be our target function . We also use , , and to denote the corresponding noise terms.
For any , we have by Itô’s formula,
and compare the estimated empirical loss for drift inference (5) with the general form (10), we have
By definition, we also have
Finally, we apply Theorem 8 to , and obtain the final rate
and the proof is thus complete by repeating for all . ∎
B.2 Proof of Theorem 9
The proof for diffusion estimation requires a even more delicate analysis on the dynamics of the SDE (1).
Proof of Theorem 9.
Following the notations introduced in the proof of Theorem 9, for , let be the target function and , , be the corresponding noise terms.
For any , define the following auxiliary process for ,
(23) |
where denotes the -th row of (a row vector). Plugging (23) into the estimated empirical loss for diffusion estimation (6), we have by definition
The process satisfies the following SDE:
By Itô’s formula, the process as the product of and also satisfies an SDE as follows:
with its quadratic variation satisfying
By direct calculation, we have
(24) | ||||