This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unsupervised learning of observation functions in
state-space models by nonparametric moment methods

Qingci An Yannis Kevrekidis Department of Applied Mathematics and Mathematics, Johns Hopkins University, Baltimore, MD 21218, USA [email protected] Fei Lu Mauro Maggioni Department of Applied Mathematics and Mathematics, Johns Hopkins University, Baltimore, MD 21218, USA [email protected]
Abstract

We investigate the unsupervised learning of non-invertible observation functions in nonlinear state-space models. Assuming abundant data of the observation process along with the distribution of the state process, we introduce a nonparametric generalized moment method to estimate the observation function via constrained regression. The major challenge comes from the non-invertibility of the observation function and the lack of data pairs between the state and observation. We address the fundamental issue of identifiability from quadratic loss functionals and show that the function space of identifiability is the closure of a RKHS that is intrinsic to the state process. Numerical results show that the first two moments and temporal correlations, along with upper and lower bounds, can identify functions ranging from piecewise polynomials to smooth functions, leading to convergent estimators. The limitations of this method, such as non-identifiability due to symmetry and stationarity, are also discussed.

Key words unsupervised learning, state-space models, nonparametric regression, generalized moment method, RKHS

1 Introduction

We consider the following state-space model for (Xt,Yt)(X_{t},Y_{t}) processes in ×\mathbb{R}\times\mathbb{R}:

State model: dXt\displaystyle dX_{t} =a(Xt)dt+b(Xt)dBt,\displaystyle=a(X_{t})dt+b(X_{t})dB_{t}, with a,bare known;\displaystyle a,b\text{are known}; (1.1)
Observation model: Yt\displaystyle Y_{t} =f(Xt),\displaystyle=f_{*}(X_{t}), with f unknown.\displaystyle f_{*}\text{ unknown}. (1.2)

Here (Bt)(B_{t}) is the standard Brownian motion, the drift function a(x)a(x) and the diffusion coefficient b(x)b(x) are given, satisfying the linear growth and global Lipschitz conditions. We assume that the initial distribution of Xt0X_{t_{0}} is given. Thus, the state model is known, in other words, the distribution of the process (Xt)(X_{t}) is known.

Our goal is to estimate the unknown observation function ff_{*} from data consisting of an ensemble of trajectories of the process YtY_{t}, denoted by {Yt0:tL(m)}m=1M\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M}, where mm indexes trajectories, t0<<tLt_{0}<\dots<t_{L} are the times at which the observations are made. In particular, there are no pairs (Xt,Yt)(X_{t},Y_{t}) being observed, so in the language of machine learning this may be considered an unsupervised learning problem. A case of particular interest in the present work is when the observation function ff_{*} is nonlinear and non-invertible, and it is within a large class of functions, including smooth functions but also, for example, piecewise continuous functions.

We estimate the observation function ff_{*} by matching generalized moments, while constraining the estimator to a suitably chosen finite-dimensional hypothesis (function) space, whose dimension depends on the number of observations, in the spirit of nonparametric statistics. We consider both first- and second-order moments, as well as temporal correlations, of the observation process. The estimator minimizes the discrepancy between the moments over hypothesis spaces spanned by B-spline functions, with upper and lower pointwise constraints estimated from data. The method we propose has several significant strengths:

  • the generalized moments do not require the invertibility of the observation function ff_{*};

  • low-order generalized moments tend to be robust to additive observation noise;

  • generalize moments avoid the need of local constructions, since they depend on the entire distribution of the latent and observed processes;

  • our nonparametric approach does not require a priori information about the observation function, and, for example, it can deal with both continuous and discontinuous functions;

  • the method is computationally efficient because the moments need to be estimated only once, and their computation is easily performed in parallel.

We note that the method we propose readily extends to multivariate state models, with the main statistical and computational bottlenecks coming from the curse-of-dimensionality in the representation and estimation of a higher-dimensional ff_{*} in terms the basis functions.

The problem we are considering has been studied in the contexts of nonlinear system identification [2, 23], filtering and data assimilation [4, 21], albeit typically when observations are in the form of one, or a small number of, long trajectories, and in the case of an invertible or smooth observations function ff_{*}. The estimation of the unknown observation function and of the latent dynamics from unlabeled data has been considered in [14, 17, 10, 27] and references therein. Inference for state-space models (SSMs) has been widely studied; most classical approaches focus on estimating the parameters in the SSM from a single trajectory of the observation process, by expectation-maximization methods maximizing the likelihood, or Bayesian approaches [2, 23, 4, 18, 11], with the recent studies estimating the coefficients in a kernel representation [36] or the coefficients of a pre-specified set of basis functions [35].

Our framework combines nonparametric learning [13, 7] with the generalized moments method, that is mainly studied in the setting of parametric inference [33, 31, 30]. We study the identifiability of the observation function from first-order moments, and show that the first-order generalized moments can identify the function in the L2L^{2} closure a reproducing kernel Hilbert space (RKHS) that is intrinsic to the state model. As far as we know, this is the first result on the function space of identifiability for nonparametric learning of observation functions in SSMs.

When the observation function is invertible, its unsupervised regression is investigated [32] by maximizing the likelihood for high-dimensional data. However, in many applications, particularly those involving complex dynamics, the observation functions are non-invertible, for example they are projections or nonlinear non-invertible transformations (e.g.,f(x)=|x|2f(x)=|x|^{2} with xdx\in\mathbb{R}^{d}). As a consequence, the resulting observed process may have discontinuous or singular probability densities [16, 12]. In [27], it has been shown empirically that delayed coordinates with principal component analysis may be used to estimate the dimension of the hidden process, and diffusion maps [6] may yield a diffeomorphic copy of the observation function.

The remainder of the paper is organized as follows. We present the nonparametric generalized moments method in Section 2. In Section 3 we study the identifiability of the observation function from first-order moments, and show that the function spaces of identifiability are RKHSs intrinsic to the state model. We present numerical examples to demonstrate the effectiveness, and limitations, of the proposed method in Section 4. Section 5 summarizes this study and discusses directions of future research; we review the basic elements about RKHSs in Appendix A.

2 Non-parametric regression based on generalized moments

Throughout this work, we focus on discrete-time observations of the state-space model (1.1)–(1.2), because data in practice are discrete in time, and the extension to continuous time trajectories is straightforward. We thereby suppose that the data is in the form {Yt0:tL(m)}m=1M\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M}, with mm indexing multiple independent trajectories, observed at the vector t0:tLt_{0}:t_{L} of discrete times (t0,,tL)(t_{0},\cdots,t_{L}).

2.1 Generalized moments method

We estimate the observation function ff_{*} by the generalized moment method (GMM) [33, 31, 30], searching for an observation function f^\widehat{f}, in a suitable finite-dimensional hypothesis (function) space, such that the moments of functionals of the process (f^(Xt))(\widehat{f}(X_{t})) are close to the empirical ones (computed from data) of f(Xt)f_{*}(X_{t}).

We consider “generalized moments” in the form 𝔼[ξ(Yt0:tL)]\mathbb{E}\left[{\xi(Y_{t_{0}:t_{L}})}\right], where ξ:L+1K\xi:\mathbb{R}^{L+1}\to\mathbb{R}^{K} is a functional of the trajectory Yt0:tLY_{t_{0}:t_{L}}. For example, the functional ξ\xi can be ξ(Yt0:tL)=[Yt0:tL,Yt0Yt1,,YtL1YtL]2L+1\xi(Y_{t_{0}:t_{L}})=[Y_{t_{0}:t_{L}},Y_{t_{0}}Y_{t_{1}},\ldots,Y_{t_{L-1}}Y_{t_{L}}]\in\mathbb{R}^{2L+1}, in which case 𝔼[ξ(Yt0:tL)]=[𝔼[Yt0:tL],𝔼[Yt1Yt2],,𝔼[YtL1YtL]]\mathbb{E}\left[{\xi(Y_{t_{0}:t_{L}})}\right]=\left[\mathbb{E}\left[{Y_{t_{0}:t_{L}}}\right],\mathbb{E}\left[{Y_{t_{1}}Y_{t_{2}}}\right],\ldots,\mathbb{E}\left[{Y_{t_{L-1}}Y_{t_{L}}}\right]\right] is the vector of the first moments and of temporal correlations at consecutive observation times. The empirical generalized moments ξ\xi are computed from data by Monte Carlo approximation:

𝔼[ξ(Yt0:tL)]EM[ξ(Yt0:tL)]:=1Mm=1Mξ(Yt0:tL(m)),\mathbb{E}\left[{\xi(Y_{t_{0}:t_{L}})}\right]\approx E_{M}[{\xi(Y_{t_{0}:t_{L}})}]:=\frac{1}{M}\sum_{m=1}^{M}\xi(Y_{t_{0}:t_{L}}^{(m)}), (2.1)

which converges at the rate M1/2M^{-1/2} by the Central Limit Theorem, since the MM trajectories are independent. Meanwhile, since the state model (hence the distribution of the state process) is known, for any putative observation function ff, we approximate the moments of the process (f(Xt)))(f(X_{t}))) by simulating MM^{\prime} independent trajectories of the state process (Xt)(X_{t}):

𝔼[ξ(f(X)t0:tL)]1Mm=1Mξ(f(X)t0:tL(m)).\mathbb{E}\left[{\xi(f(X)_{t_{0}:t_{L}})}\right]\approx\frac{1}{M^{\prime}}\sum_{m=1}^{M^{\prime}}\xi(f(X)_{t_{0}:t_{L}}^{(m)})\,. (2.2)

Here, with some abuse of notation, f(X)t0:tL(m):=(f(Xt0(m)),,f(XtL(m)))f(X)_{t_{0}:t_{L}}^{(m)}:=(f(X_{t_{0}}^{(m)}),\dots,f(X_{t_{L}}^{(m)})). The number MM^{\prime} can be as large as we can afford from a computational perspective. In what follows, since MM^{\prime} can be chosen large – only subject to computational constraints – we consider the error in this empirical approximation negligible and work with 𝔼[ξ(f(X)t0:tL)]\mathbb{E}\left[{\xi(f(X)_{t_{0}:t_{L}})}\right] directly.

We estimate the observation function ff_{*} by minimizing a notion of discrepancy between these two empirical generalized moments:

f^=argminfM(f),where M(f):=dist(EM[ξ(Yt0:tL)],𝔼[ξ(f(X)t0:tL)])2,\displaystyle\widehat{f}=\underset{f\in\mathcal{H}}{\operatorname{arg}\operatorname{min}}\;\mathcal{E}^{M}(f),\quad\text{where }\mathcal{E}^{M}(f):=\mathrm{dist}\left(E_{M}[{\xi(Y_{t_{0}:t_{L}})}],\mathbb{E}\left[{\xi(f(X)_{t_{0}:t_{L}})}\right]\right)^{2}, (2.3)

where ff is restricted to some suitable hypothesis space \mathcal{H}, and dist(,)\mathrm{dist}(\cdot,\cdot) is a proper distance between the moments to be specified later. We choose \mathcal{H} to be a subset of an nn-dimensional function space, spanned by basis functions {ϕi}\{\phi_{i}\}, within which we can write f^=i=1nc^iϕi\widehat{f}=\sum_{i=1}^{n}\widehat{c}_{i}\phi_{i}. By the law of large numbers, M(f)\mathcal{E}^{M}(f) tends almost surely to (f):=dist(𝔼[ξ(Yt0:tL)],𝔼[ξ(f(X)t0:tL)])2\mathcal{E}(f):=\mathrm{dist}\left(\mathbb{E}\left[{\xi(Y_{t_{0}:t_{L}})}\right],\mathbb{E}\left[{\xi(f(X)_{t_{0}:t_{L}})}\right]\right)^{2}.

It is desirable to choose the generalized moment functional ξ\xi and the hypothesis space \mathcal{H} so that the minimization in (2.3) can be performed efficiently. We select the functional ξ\xi so that the moments 𝔼[ξ(f(X)t0:tL)]\mathbb{E}\left[{\xi(f(X)_{t_{0}:t_{L}})}\right], for f=i=1nciϕif=\sum_{i=1}^{n}c_{i}\phi_{i}, can be efficiently evaluated for all (c1,,cn)(c_{1},\ldots,c_{n}). To this end, we choose linear functionals or low-degree polynomials, so that we only need to compute the moments of the basis functions once, and use these moments repeatedly during the optimization process, as discussed in Section 2.2. The selection of the hypothesis space is detailed in Section 2.3.

2.2 Loss functional and estimator

The generalized moments we consider include the first and the second moments, as well as the one-step temporal correlation: we let ξ(Yt0:tL):=(Yt0:tL,Yt0:tL2,Yt0Yt1,,YtL1YtL)3L+2\xi(Y_{t_{0}:t_{L}}):=\left(Y_{t_{0}:t_{L}},Y_{t_{0}:t_{L}}^{2},Y_{t_{0}}Y_{t_{1}},\ldots,Y_{t_{L-1}}Y_{t_{L}}\right)\in\mathbb{R}^{3L+2}. The loss functional in (2.3) is then chosen in the following form: for weights w1,,w3>0w_{1},\dots,w_{3}>0,

(f):=\displaystyle\mathcal{E}(f):= w11Ll=1L|𝔼[f(Xtl)]𝔼[Ytl]|21(f)+w21Ll=1L|𝔼[f(Xtl)2]𝔼[Ytl2]|22(f)\displaystyle w_{1}\underbrace{\frac{1}{L}\sum_{l=1}^{L}\big{|}\mathbb{E}[f(X_{t_{l}})]-\mathbb{E}[Y_{t_{l}}]|^{2}}_{\mathcal{E}_{1}(f)}+w_{2}\underbrace{\frac{1}{L}\sum_{l=1}^{L}\left|\mathbb{E}[f(X_{t_{l}})^{2}]-\mathbb{E}[Y_{t_{l}}^{2}]\right|^{2}}_{\mathcal{E}_{2}(f)} (2.4)
+w31Ll=1L|𝔼[f(Xtl)f(Xtl1)]𝔼[YtlYtl1]|23(f).\displaystyle+w_{3}\underbrace{\frac{1}{L}\sum_{l=1}^{L}\left|\mathbb{E}[f(X_{t_{l}})f(X_{t_{l-1}})]-\mathbb{E}[Y_{t_{l}}Y_{t_{l-1}}]\right|^{2}}_{\mathcal{E}_{3}(f)}\,.

Let the hypothesis space \mathcal{H} be a subset of the span of a linearly independent set {ϕi}i=1n\{\phi_{i}\}_{i=1}^{n}, which we specify in the next section. For f=i=1nciϕif=\sum_{i=1}^{n}c_{i}\phi_{i}\in\mathcal{H}, we can write the loss functionals 1(f)\mathcal{E}_{1}(f) in (2.4) as

1(f)\displaystyle\mathcal{E}_{1}(f) =1Ll=1L|i=1nci𝔼[ϕi(Xtl)]𝔼[Ytl]|2=cA¯1c2cb¯1+b~1,\displaystyle=\frac{1}{L}\sum_{l=1}^{L}\bigg{|}\sum_{i=1}^{n}c_{i}\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]-\mathbb{E}\left[{Y_{t_{l}}}\right]\bigg{|}^{2}=c^{\top}\overline{A}_{1}c-2c^{\top}\overline{b}_{1}+\tilde{b}_{1}, (2.5)

where b~1:=1Ll=1L𝔼[Ytl]2\tilde{b}_{1}:=\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}\left[{Y_{t_{l}}}\right]^{2}, and the matrix A¯1\overline{A}_{1} and the vector b¯1\overline{b}_{1} are given by

A¯1(i,j)\displaystyle\overline{A}_{1}(i,j) :=1Ll=1L𝔼[ϕi(Xtl)]𝔼[ϕj(Xtl)]A1,l(i,j),b¯1(i):=1Ll=1L𝔼[ϕi(Xtl)]𝔼[Ytl]b1,l(i).\displaystyle:=\frac{1}{L}\sum_{l=1}^{L}\underbrace{\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]\mathbb{E}\left[{\phi_{j}(X_{t_{l}})}\right]}_{A_{1,l}(i,j)},\quad\overline{b}_{1}(i):=\frac{1}{L}\sum_{l=1}^{L}\underbrace{\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]\mathbb{E}\left[{Y_{t_{l}}}\right]}_{b_{1,l}(i)}. (2.6)

Similarly, we can write 2(f)\mathcal{E}_{2}(f) and 3(f)\mathcal{E}_{3}(f) in (2.4) as

2(f)\displaystyle\mathcal{E}_{2}(f) =1Ll=1L|i=1ncicj𝔼[ϕiϕj(Xtl)]A2,l(i,j)𝔼[Ytl2]b2,l|2.\displaystyle=\frac{1}{L}\sum_{l=1}^{L}\bigg{|}\sum_{i=1}^{n}c_{i}c_{j}\underbrace{\mathbb{E}\left[{\phi_{i}\phi_{j}(X_{t_{l}})}\right]}_{A_{2,l}(i,j)}-\underbrace{\mathbb{E}\left[{Y_{t_{l}}^{2}}\right]}_{b_{2,l}}\bigg{|}^{2}. (2.7)
3(f)\displaystyle\mathcal{E}_{3}(f) =1Ll=1L|i=1ncicj𝔼[ϕi(Xtl1)ϕj(Xtl)]A3,l(i,j)𝔼[Ytl1Ytl]b3,l|2.\displaystyle=\frac{1}{L}\sum_{l=1}^{L}\bigg{|}\sum_{i=1}^{n}c_{i}c_{j}\underbrace{\mathbb{E}\left[{\phi_{i}(X_{t_{l-1}})\phi_{j}(X_{t_{l}})}\right]}_{A_{3,l}(i,j)}-\underbrace{\mathbb{E}\left[{Y_{t_{l-1}}Y_{t_{l}}}\right]}_{b_{3,l}}\bigg{|}^{2}.

Thus, with the above notations in (2.6)-(2.7), the minimizer of the loss functional (f)\mathcal{E}(f) over \mathcal{H} is

f^\displaystyle\widehat{f}_{\mathcal{H}} :=i=1nc^iϕi,c^:=argmincn s.t. i=1nciϕi(c), where\displaystyle:=\sum_{i=1}^{n}\widehat{c}_{i}\phi_{i},\quad\widehat{c}:=\underset{c\in\mathbb{R}^{n}\text{ s.t. }\sum_{i=1}^{n}c_{i}\phi_{i}\in\mathcal{H}}{\operatorname{arg}\operatorname{min}}\;\mathcal{E}(c),\,\text{ where } (2.8)
(c)\displaystyle\mathcal{E}(c) :=w1[cA¯1c2cb¯1+b~1]+k=23wk1Ll=1L|cAk,lcbk,l|2.\displaystyle:=w_{1}[c^{\top}\overline{A}_{1}c-2c^{\top}\overline{b}_{1}+\tilde{b}_{1}]+\sum_{k=2}^{3}w_{k}\frac{1}{L}\sum_{l=1}^{L}\left|c^{\top}A_{k,l}c-b_{k,l}\right|^{2}.

Here, with an abuse of notation, we denote (i=1nciϕi)\mathcal{E}(\sum_{i=1}^{n}c_{i}\phi_{i}) by (c)\mathcal{E}(c).

In practice, with data {Y[t1:tN](m)}m=1M\{Y_{[t_{1}:t_{N}]}^{(m)}\}_{m=1}^{M}, we approximate the expectations involving the observation process (Yt)(Y_{t}) by the corresponding empirical means as in (2.1). Meanwhile, we approximate the expectations involving the state process (Xt)(X_{t}) by Monte Carlo as in (2.2), using MM^{\prime} trajectories. We assume that the sampling errors in the expectations of (Xt)(X_{t}), i.e. in the terms {Ak,l}k=13\{A_{k,l}\}_{k=1}^{3}, are negligible, since the basis {ϕi}\{\phi_{i}\} can be chosen to be bounded functions (such as B-spline polynomials) and MM^{\prime} can be as large as we can afford. We approximate {bk,l}k=13\{b_{k,l}\}_{k=1}^{3} by their empirical means {bk,lM}k=13\{b_{k,l}^{M}\}_{k=1}^{3}:

b1,l(i)\displaystyle b_{1,l}(i) =𝔼[ϕi(Xtl)]𝔼[Ytl]\displaystyle=\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]\mathbb{E}\left[{Y_{t_{l}}}\right] 𝔼[ϕi(Xtl)]1Mm=1MYtl(m)\displaystyle\approx\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]\frac{1}{M}\sum_{m=1}^{M}Y_{t_{l}}^{(m)} =:b1,lM(i),\displaystyle=:b_{1,l}^{M}(i)\,, (2.9)
b2,l\displaystyle b_{2,l} =𝔼[|Ytl|2]\displaystyle=\mathbb{E}\left[{|Y_{t_{l}}|^{2}}\right] 1Mm=1M|Ytl(m)|2\displaystyle\approx\frac{1}{M}\sum_{m=1}^{M}|Y_{t_{l}}^{(m)}|^{2} =:b2,lM,\displaystyle=:b_{2,l}^{M}\,, (2.10)
b3,l\displaystyle b_{3,l} =𝔼[Ytl1Ytl]\displaystyle=\mathbb{E}\left[{Y_{t_{l-1}}Y_{t_{l}}}\right] 1Mm=1MYtl1(m)Ytl(m)\displaystyle\approx\frac{1}{M}\sum_{m=1}^{M}Y_{t_{l-1}}^{(m)}Y_{t_{l}}^{(m)} =:b3,lM.\displaystyle=:b_{3,l}^{M}\,. (2.11)

Then, with b¯1M=1Ll=1Lb1,lM\overline{b}_{1}^{M}=\frac{1}{L}\sum_{l=1}^{L}b_{1,l}^{M} and b~1M=1LMl=1Lm=1M(Ytl(m))2\widetilde{b}_{1}^{M}=\frac{1}{LM}\sum_{l=1}^{L}\sum_{m=1}^{M}\left(Y_{t_{l}}^{(m)}\right)^{2}, the estimator from data is

f^,M\displaystyle\widehat{f}_{\mathcal{H},M} =i=1nc^iϕi,c^=argmincn s.t. i=1nciϕiM(c), where\displaystyle=\sum_{i=1}^{n}\widehat{c}_{i}\phi_{i},\qquad\widehat{c}=\underset{c\in\mathbb{R}^{n}\text{ s.t. }\sum_{i=1}^{n}c_{i}\phi_{i}\in\mathcal{H}}{\operatorname{arg}\operatorname{min}}\;\mathcal{E}^{M}(c),\,\text{ where } (2.12)
M(c)\displaystyle\mathcal{E}^{M}(c) =w1[cA¯1c2cb¯1M+b~1M]+k=23wk1Ll=1L|cAk,lcbk,lM|2.\displaystyle=w_{1}[c^{\top}\overline{A}_{1}c-2c^{\top}\overline{b}_{1}^{M}+\widetilde{b}_{1}^{M}]+\sum_{k=2}^{3}w_{k}\frac{1}{L}\sum_{l=1}^{L}\left|c^{\top}A_{k,l}c-b_{k,l}^{M}\right|^{2}.

The minimization of M(c)\mathcal{E}^{M}(c) can be performed with iterative algorithms, with each optimization iteration, with respect to cc, performed efficiently since the data-based matrices and vectors, A¯1,b¯1M\overline{A}_{1},\overline{b}_{1}^{M} and {Ak,l,bk,lM}k=23\{A_{k,l},b_{k,l}^{M}\}_{k=2}^{3}, only need to be computed once. The main source of sampling error is the empirical approximation of the moments of the process (Yt)(Y_{t}). We specify the hypothesis space in the next section and provide a detailed algorithm for the computation of the estimator in Section 2.4.

Remark 2.1 (Moments involving Itô’s formula)

When the data trajectories are continuous in time (or when they are sampled with a high frequency in time), we can utilize additional moments from Itô’s formula. Recall that for fCb2f\in C^{2}_{b}, applying Itô formula for the diffusion process in (1.1), we have

f(Xt+Δt)f(Xt)=tt+Δtfb(Xs)𝑑Ws+tt+Δtf(Xs)𝑑s,f(X_{t+\Delta t})-f(X_{t})=\int_{t}^{t+\Delta t}\nabla f\cdot b(X_{s})dW_{s}+\int_{t}^{t+\Delta t}\mathcal{L}f(X_{s})ds,

where the operator \mathcal{L} is

f=fa+12Hess(f):bb.\mathcal{L}f=\nabla f\cdot a+\frac{1}{2}Hess(f):b^{\top}b. (2.13)

Hence, 𝔼[ΔYtl]=𝔼[f(Xtl1)]Δt+o(Δt)\mathbb{E}\left[{\Delta Y_{t_{l}}}\right]=\mathbb{E}\left[{\mathcal{L}f_{*}(X_{t_{l-1}})}\right]\Delta t+o(\Delta t), where ΔYtl=YtlYtl1\Delta Y_{t_{l}}=Y_{t_{l}}-Y_{t_{l-1}}. Thus, when Δt\Delta t is small, we can consider matching the generalized moments

4(f)\displaystyle\mathcal{E}_{4}(f) =1Ll=1L|𝔼[f(Xtl1)]Δt𝔼[ΔYtl]|2.\displaystyle=\frac{1}{L}\sum_{l=1}^{L}\bigg{|}\mathbb{E}\left[{\mathcal{L}f(X_{t_{l-1}})}\right]\Delta t-\mathbb{E}\left[{\Delta Y_{t_{l}}}\right]\bigg{|}^{2}. (2.14)

Similarly, we can further consider the generalized moments 𝔼[YtΔYt]\mathbb{E}\left[{Y_{t}\Delta Y_{t}}\right] and Var(ΔYt)\mathrm{Var}{(\Delta Y_{t})} and the corresponding quartic loss functionals. Since they require the moments of the first- and second-order derivatives of the observation function, they are helpful when the observation function is smooth with bounded derivatives.

2.3 Hypothesis space and optimal dimension

We let the hypothesis space \mathcal{H} be a class of bounded functions in span{ϕi}i=1n\mathrm{span}\{\phi_{i}\}_{i=1}^{n},

:={f:f=i=1nciϕi:yminf(x)ymax for all xsupp(ρ¯T)},\mathcal{H}:=\{f\,:\,f=\sum_{i=1}^{n}c_{i}\phi_{i}:y_{\min}\leq f(x)\leq y_{\max}\text{ for all }x\in\text{supp}(\widebar{\rho}_{T})\}, (2.15)

where the basis functions {ϕi}\{\phi_{i}\} are to be specified below, and the empirical bounds

ymin:=min{Ytl(m)}l,m=1L,M,ymax:=max{Ytl(m)}l,m=1L,My_{\min}:=\min\{Y_{t_{l}}^{(m)}\}_{l,m=1}^{L,M},\quad y_{\max}:=\max\{Y_{t_{l}}^{(m)}\}_{l,m=1}^{L,M}

aim to approximate the upper and lower bounds for ff_{*}. Note that the hypothesis space \mathcal{H} is a bounded convex subset of the linear space span{ϕi}i=1n\mathrm{span}\{\phi_{i}\}_{i=1}^{n}. While the pointwise bound constraints are for all xx, in practice, for efficient computation, we apply these constraints at representative points, for example at the mesh-grid points used when the basis functions are piecewise polynomials. One may apply stronger constraints, such as requiring time-dependent bounds to hold at all times: ymin(t)incifi(x)ymax(t)y_{\min}(t)\leq\sum_{i}^{n}c_{i}f_{i}(x)\leq y_{\max}(t) for each time tt, where ymin(t)y_{\min}(t) and ymax(t)y_{\max}(t) are the minimum and maximum of the data set {Yt(m)}m=1M\{Y_{t}^{(m)}\}_{m=1}^{M}.

Basis functions.

As basis functions {ϕi}\{\phi_{i}\} for the subspace containing \mathcal{H} we choose B-spline basis consisting of piecewise polynomials (see Appendix B.1 for details). To specify the knots of B-spline functions, we introduce a density function ρ¯TL\widebar{\rho}_{T}^{L}, which is the average of the probability densities {ptl}l=1L\{p_{t_{l}}\}_{l=1}^{L} of {Xtl}l=1L\{X_{t_{l}}\}_{l=1}^{L}:

ρ¯TL(x)=1Ll=1Lptl(x)Lρ¯T(x)=1T0Tpt(x)𝑑t,\widebar{\rho}_{T}^{L}(x)=\frac{1}{L}\sum_{l=1}^{L}p_{t_{l}}(x)\quad\xrightarrow[]{L\to\infty}\,\widebar{\rho}_{T}(x)=\frac{1}{T}\int_{0}^{T}p_{t}(x)dt, (2.16)

when tL=Tt_{L}=T and max1lL|tltl1|0\max_{1\leq l\leq L}|t_{l}-t_{l-1}|\to 0. Here ρ¯TL\widebar{\rho}_{T}^{L} (and its continuous time limit ρ¯T(x)\widebar{\rho}_{T}(x)) describes the intensity of visits to the regions explored by the process (Xt)(X_{t}). The knots of the B-spline function are from a uniform partition of [Rmin,Rmax][R_{min},R_{max}], the smallest interval enclosing the support of ρ¯TL\widebar{\rho}_{T}^{L}. Thus, the basis functions {ϕi}\{\phi_{i}\} are piecewise polynomials with knots adaptive to the state model which determines ρ¯TL\widebar{\rho}_{T}^{L}.

Dimension of the hypothesis space.

It is important to select a suitable dimension of the hypothesis space to avoid under- or over-fitting. We select the dimension in two steps. First, we introduce an algorithm, namely Cross-validating Estimation of Dimension Range (CEDR), to estimate the range of the dimension from the quadratic loss functional 1\mathcal{E}_{1}. Its main idea is to avoid the sampling error amplification due to an unsuitably large dimension. The sampling error is estimated from data by splitting the data into two sets. Then, we select the optimal dimension that minimizes the 2-Wasserstein distance between the measures of data and prediction. See Appendix B.1 for details.

2.4 Algorithm

We summarize the above method of nonparametric regression with generalized moments in Algorithm 1. It minimizes a quartic loss function with the upper and lower bound constraints, and we perform the optimization with multiple initial conditions (see Appendix B.2 for the details).

Algorithm 1 Estimating the observation function by nonparametric generalized moment methods
1:The state model and data {Yt0:tL(m)}m=1M\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M} consisting of multiple trajectories of the observation process.
2:Estimator f^\widehat{f}.
3:Estimate the empirical density ρ¯T\widebar{\rho}_{T} in (2.16) and find its support [Rmin,Rmax][R_{min},R_{max}].
4:Select a basis type, Fourier or B-spline, with an estimated dimension range [1,N][1,N] (by Algorithm 2), and compute the basis functions as described in Section 2.3.
5:for n=1:Nn=1:N do
6:     Compute the moment matrices in (2.6)-(2.7) and the vectors bk,lMb_{k,l}^{M} in (2.11).
7:     Find the estimator c^n\widehat{c}_{n} by optimization with multiple initial conditions. Compute and record the values of the loss functional and the 2-Wasserstein distances.
8:Select the optimal dimension nn^{*} (and degree if B-spline basis) that has the minimal 2-Wasserstein distance in (B.5). Return the estimator f^=i=1ncniϕi\widehat{f}=\sum_{i=1}^{n^{*}}c^{i}_{n^{*}}\phi_{i}.

Computational complexity

The computational complexity is driven by the construction of the normal matrix and vectors and the evaluation of the 2-Wasserstein distances, which require computations of order O(n2LM)\mathrm{O}(n^{2}LM) and O(nLM)\mathrm{O}(nLM), respectively. Thus, the total computational complexity is of order O((n2+n)LM)\mathrm{O}((n^{2}+n)LM).

2.5 Tolerance to noise in the observations

The (generalized) moment method can tolerate large additive observation noise if the distribution of the noise is known. The estimation error caused by the noise is at the scale of the sampling error, which is negligible when the sample size is large.

More specifically, suppose that we observe {Yt0:tL(m)}m=1M\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M} from the observation model

Ytl=f(Xtl)+ηtl,Y_{t_{l}}=f_{*}(X_{t_{l}})+\eta_{t_{l}}, (2.17)

where {ηtl}\{\eta_{t_{l}}\} is sampled from a process (ηt)(\eta_{t}) that is independent of (Xt)(X_{t}) and has moments

𝔼[ηt]=0,C(s,t)=𝔼[ηtηs], for s,t0.\mathbb{E}[\eta_{t}]=0,\quad C(s,t)=\mathbb{E}[\eta_{t}\eta_{s}],\text{ for }s,t\geq 0. (2.18)

A typical example is when η\eta being identically distributed independent Gaussian noise 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}), which gives C(s,t)=σ2δ(ts)C(s,t)=\sigma^{2}\delta(t-s).

The algorithm in Section 2 applies the noisy data with only a few changes. First, note that the loss functional in (2.4) involves only the moments 𝔼[Yt]\mathbb{E}[Y_{t}], 𝔼[Yt2]\mathbb{E}[Y_{t}^{2}] and 𝔼[YtlYtl1]\mathbb{E}[Y_{t_{l}}Y_{t_{l-1}}], which are moments of f(Xt)f_{*}(X_{t}). When YtY_{t} in (2.17) has observation noise specified above, we have

𝔼[f(Xt)]\displaystyle\mathbb{E}[f_{*}(X_{t})] =𝔼[Yt]𝔼[ηt]=𝔼[Yt];\displaystyle=\mathbb{E}[Y_{t}]-\mathbb{E}[\eta_{t}]=\mathbb{E}[Y_{t}];
𝔼[f(Xt)f(Xs)]\displaystyle\mathbb{E}[f_{*}(X_{t})f_{*}(X_{s})] =𝔼[YtYs]𝔼[ηtηs]=𝔼[YtYs]C(t,s)\displaystyle=\mathbb{E}[Y_{t}Y_{s}]-\mathbb{E}[\eta_{t}\eta_{s}]=\mathbb{E}[Y_{t}Y_{s}]-C(t,s)

for all t,s0t,s\geq 0. Thus, we only need to change the loss functional to be

(f)=\displaystyle\mathcal{E}(f)= w11Ll=1L|𝔼[f(Xtl)]𝔼[Ytl]|2+w21Ll=1L|𝔼[f(Xtl)2]𝔼[Ytl2]+C(t,t)|2\displaystyle w_{1}\frac{1}{L}\sum_{l=1}^{L}\big{|}\mathbb{E}[f(X_{t_{l}})]-\mathbb{E}[Y_{t_{l}}]|^{2}+w_{2}\frac{1}{L}\sum_{l=1}^{L}\left|\mathbb{E}[f(X_{t_{l}})^{2}]-\mathbb{E}[Y_{t_{l}}^{2}]+C(t,t)\right|^{2} (2.19)
+w31Ll=1L|𝔼[f(Xtl)f(Xtl1)]𝔼[YtlYtl1]+C(t,s)|2.\displaystyle+w_{3}\frac{1}{L}\sum_{l=1}^{L}\left|\mathbb{E}[f(X_{t_{l}})f(X_{t_{l-1}})]-\mathbb{E}[Y_{t_{l}}Y_{t_{l-1}}]+C(t,s)\right|^{2}.

Similar to (2.12), the minimizer of the loss functional can be then computed as

f^,M\displaystyle\widehat{f}_{\mathcal{H},M} =i=1nc^iϕi,c^=argmincn s.t. i=1nciϕiM(c), where\displaystyle=\sum_{i=1}^{n}\widehat{c}_{i}\phi_{i},\quad\widehat{c}=\underset{c\in\mathbb{R}^{n}\text{ s.t. }\sum_{i=1}^{n}c_{i}\phi_{i}\in\mathcal{H}}{\operatorname{arg}\operatorname{min}}\;\mathcal{E}^{M}(c),\,\text{ where } (2.20)
M(c)\displaystyle\mathcal{E}^{M}(c) =w1[cA¯1c2cb¯1M+b~1M]+w21Ll=1L|cA2,lcb2,lM+C(tl,tl)|2\displaystyle=w_{1}[c^{\top}\overline{A}_{1}c-2c^{\top}\overline{b}_{1}^{M}+\widetilde{b}_{1}^{M}]+w_{2}\frac{1}{L}\sum_{l=1}^{L}\left|c^{\top}A_{2,l}c-b_{2,l}^{M}+C(t_{l},t_{l})\right|^{2}
+w31Ll=1L|cA3,lcb3,lM+C(tl,tl+1)|2,\displaystyle+w_{3}\frac{1}{L}\sum_{l=1}^{L}\left|c^{\top}A_{3,l}c-b_{3,l}^{M}+C(t_{l},t_{l+1})\right|^{2},

where all the AA-matrices and bb-vectors are the same as before (e.g., in (2.6)–(2.7) and (2.11)).

Note that the observation noise introduces sampling errors through b1Mb_{1}^{M}, b2,lMb_{2,l}^{M} and b2,lMb_{2,l}^{M}, which are at the scale O(1M)\mathrm{O}(\frac{1}{\sqrt{M}}). Also, note the AA-matrices are independent of the observation noise. Thus, the observation noise affects the estimator only through the sampling error at the scale O(1M)\mathrm{O}(\frac{1}{\sqrt{M}}), the same as the sampling error in the estimator from noiseless data.

3 Identifiability

We discuss in this section the identifiability of the observation function by those loss functionals in the previous section. We show that 1\mathcal{E}_{1}, the quadratic loss functional based on the 1st-order moments in (2.5), can identify the observation function in the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L})-closure of a reproducing kernel Hilbert space (RKHS) that is intrinsic to the state model. In addition, the loss functional 4\mathcal{E}_{4} in (2.14) based on the Itô formula, enlarges the function space of identifiability. We also discuss, in Section 3.2, some limitations of the loss functional \mathcal{E} in (2.19), that combines the quadratic and quartic loss functionals; in particular, symmetry and stationarity may prevent us from identifying the observation function when using only generalized moments.

The starting point is a definition of identifiability, which is a generalization of the uniqueness of minimizer of a loss function in parametric inference (see e.g., [3, page 431] and [8]).

Definition 3.1 (Identifiability)

We say that the observation function ff_{*} is identifiable by a data-based loss functional \mathcal{E} on a function space HH if ff_{*} is the unique minimizer of \mathcal{E} in HH.

The identifiability consists of three elements: a loss functional \mathcal{E}, a function space HH, and a unique minimizer for the loss functional in HH. When the loss functional is quadratic (such as 1\mathcal{E}_{1} or 4\mathcal{E}_{4}), it has a unique minimizer in a Hilbert space iff its Frechét derivative is invertible in the Hilbert space; thus, the main task is to find such function spaces [22, 20, 24]. We will specify such function spaces for 1\mathcal{E}_{1} and/or 4\mathcal{E}_{4} in Section 3.1. We note that these function spaces do not take into account the constraints of upper and lower bounds, which generically lead to minimizers near or at the boundary of the constrained set. This consideration applies also to the piecewise quadratic functionals 2\mathcal{E}_{2} and 3\mathcal{E}_{3}, which can be viewed as providing additional constraints (see Section 3.2).

3.1 Identifiability by quadratic loss functionals

We consider the quadratic loss functionals 1\mathcal{E}_{1} and 4\mathcal{E}_{4}, and show that they can identify the observation function in the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L})-closure of reproducing kernel Hilbert spaces (RKHSs) that are intrinsic to the state model.

Assumption 3.2

We make the following assumptions on the state-space model.

  • The coefficients in the state model (1.1) satisfy a global Lipschitz condition, and therefore also a linear growth condition: there exists a constant C>0C>0 such that |a(x)a(y)|+|b(x)b(y)|C|xy||a(x)-a(y)|+|b(x)-b(y)|\leq C|x-y| for all x,yx,y\in\mathbb{R}, and |a(x)|+|b(x)|C(1+|x|)|a(x)|+|b(x)|\leq C(1+|x|). Furthermore, we assume that infxb(x)>0\inf_{x\in\mathbb{R}}b(x)>0 for all xx\in\mathbb{R}.

  • The observation function ff_{*} satisfies supt[0,tL]𝔼[|f(Xt)|2]<\sup_{t\in[0,t_{L}]}\mathbb{E}\left[{|f_{*}(X_{t})|^{2}}\right]<\infty.

Theorem 3.3

Given discrete-time data {Yt0:tL(m)}m=1M\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M} from the state-space model (1.1) satisfying Assumption 3.2, let 1\mathcal{E}_{1} and 4\mathcal{E}_{4} be the loss functionals defined in (2.4) and (2.14). Denote pt(x)p_{t}(x) the density of the state process XtX_{t} at time t, and recall that ρ¯TL\widebar{\rho}_{T}^{L} in (2.16) is the average, in time, of these densities. Let \mathcal{L}^{*} be the adjoint of the 2nd-order elliptic operator \mathcal{L} in (2.13). Then,

  • (a)

    1\mathcal{E}_{1} has a unique minimizer in H1H_{1}, the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) closure of the RKHS K1\mathcal{H}_{K_{1}} with reproducing kernel

    K1(x,x)=1ρ¯TL(x)ρ¯TL(x)1Ll=1Lptl(x)ptl(x),K_{1}(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})}\frac{1}{L}\sum_{l=1}^{L}p_{t_{l}}(x)p_{t_{l}}(x^{\prime}), (3.1)

    for (x,x)(x,x^{\prime}) such that ρ¯TL(x)ρ¯TL(x)>0\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})>0, and K1(x,x)=0K_{1}(x,x^{\prime})=0 otherwise. When the data is continuous (LL\to\infty), we have K1(x,x)=1ρ¯T(x)ρ¯T(x)1T0Tpt(x)pt(x)𝑑tK_{1}(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}p_{t}(x)p_{t}(x^{\prime})dt.

  • (b)

    4\mathcal{E}_{4} has a unique minimizer in H4H_{4}, the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) closure of the RKHS K4\mathcal{H}_{K_{4}} with reproducing kernel

    K4(x,x)=1ρ¯TL(x)ρ¯TL(x)1Ll=1Lptl(x)ptl(x),K_{4}(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})}\frac{1}{L}\sum_{l=1}^{L}\mathcal{L}^{*}p_{t_{l}}(x)\mathcal{L}^{*}p_{t_{l}}(x^{\prime}), (3.2)

    for (x,x)(x,x^{\prime}) such that ρ¯TL(x)ρ¯TL(x)>0\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})>0, and K4(x,x)=0K_{4}(x,x^{\prime})=0 otherwise. When the data is continuous, we have K4(x,x)=1ρ¯T(x)ρ¯T(x)1T0Tpt(x)pt(x)𝑑tK_{4}(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}\mathcal{L}^{*}p_{t}(x)\mathcal{L}^{*}p_{t}(x^{\prime})dt.

  • (c)

    1+4\mathcal{E}_{1}+\mathcal{E}_{4} has a unique minimizer in HH, the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) closure of the RKHS K\mathcal{H}_{K} with reproducing kernel

    K(x,x)=1ρ¯TL(x)ρ¯TL(x)1Ll=1L[ptl(x)ptl(x)+ptl(x)ptl(x)],K(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})}\frac{1}{L}\sum_{l=1}^{L}\left[p_{t_{l}}(x)p_{t_{l}}(x^{\prime})+\mathcal{L}^{*}p_{t_{l}}(x)\mathcal{L}^{*}p_{t_{l}}(x^{\prime})\right], (3.3)

    for (x,x)(x,x^{\prime}) such that ρ¯TL(x)ρ¯TL(x)>0\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})>0, and K(x,x)=0K(x,x^{\prime})=0 otherwise. Similarly, we have K(x,x)=1ρ¯T(x)ρ¯T(x)1T0T[pt(x)pt(x)+pt(x)pt(x)]𝑑tK(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}[p_{t}(x)p_{t}(x^{\prime})+\mathcal{L}^{*}p_{t}(x)\mathcal{L}^{*}p_{t}(x^{\prime})]dt for continuous data.

In particular, ff_{*} is the unique minimizer of these loss functionals if ff_{*} is in H1H_{1}, H4H_{4} or HH.

To prove this theorem, we first introduce an operator characterization of the RKHS K1\mathcal{H}_{K_{1}} in the next lemma. Similar characterizations hold for the RKHSs K1\mathcal{H}_{K_{1}} and K\mathcal{H}_{K}.

Lemma 3.4

The function K1K_{1} in (3.1) is a Mercer kernel, that is, it is continuous, symmetric and positive semi-definite. Furthermore, K1K_{1} is square integrable in L2(ρ¯TL×ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}\times\widebar{\rho}_{T}^{L}), and it defines a compact positive integral operator LK1:L2(ρ¯TL)L2(ρ¯TL)L_{K_{1}}:L^{2}(\widebar{\rho}_{T}^{L})\to L^{2}(\widebar{\rho}_{T}^{L}):

[LK1h](x)=h(x)K1(x,x)ρ¯TL(x)𝑑x.[L_{K_{1}}h](x^{\prime})=\int h(x)K_{1}(x,x^{\prime})\widebar{\rho}_{T}^{L}(x)dx. (3.4)

Also, the RKHS K1\mathcal{H}_{K_{1}} has the operator characterization: K1=LK11/2(L2(ρ¯TL))\mathcal{H}_{K_{1}}=L^{1/2}_{K_{1}}(L^{2}(\widebar{\rho}_{T}^{L})) and {λiψi}i=1\{\sqrt{\lambda_{i}}\psi_{i}\}_{i=1}^{\infty} is an orthonormal basis of the RKHS K1\mathcal{H}_{K_{1}}, where {λi,ψi}\{\lambda_{i},\psi_{i}\} are the pairs of positive eigenvalues and corresponding eigenfunctions of LK1L_{K_{1}}.

Proof. Since the densities of diffusion process are smooth, the kernel K1K_{1} is continuous on the support of ρ¯TL\widebar{\rho}_{T}^{L} and it is symmetric. It is positive semi-definite (see Appendix A for a definition) because for any (c1,,cn)n(c_{1},\ldots,c_{n})\in\mathbb{R}^{n} and (x1,,xn)(x_{1},\ldots,x_{n}), we have

i,j=1ncicjK(xi,xj)=1Ll=1Li,j=1ncicjptl(xi)ptl(xj)ρ¯TL(xi)ρ¯TL(xj)=1Ll=1L(i=1nciptl(xi)ρ¯TL(xi))20.\sum_{i,j=1}^{n}c_{i}c_{j}K(x_{i},x_{j})=\frac{1}{L}\sum_{l=1}^{L}\sum_{i,j=1}^{n}c_{i}c_{j}\frac{p_{t_{l}}(x_{i})p_{t_{l}}(x_{j})}{\widebar{\rho}_{T}^{L}(x_{i})\widebar{\rho}_{T}^{L}(x_{j})}=\frac{1}{L}\sum_{l=1}^{L}\left(\sum_{i=1}^{n}c_{i}\frac{p_{t_{l}}(x_{i})}{\widebar{\rho}_{T}^{L}(x_{i})}\right)^{2}\geq 0.

Thus, K1K_{1} is a Mercer kernel.

To show that K1K_{1} is square integrable, note first that ptl(x)max1kLptk(x)Lρ¯TL(x)p_{t_{l}}(x)\leq\max_{1\leq k\leq L}p_{t_{k}}(x)\leq L\widebar{\rho}_{T}^{L}(x) for any xx. Thus for each x,xx,x^{\prime}, we have

1Ll=1Lptl(x)ptl(x)Lρ¯TL(x)ρ¯TL(x)\frac{1}{L}\sum_{l=1}^{L}p_{t_{l}}(x)p_{t_{l}}(x^{\prime})\leq L\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})

and K1(x,x)LK_{1}(x,x^{\prime})\leq L. It follows that K1K_{1} is in L2(ρ¯TL×ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}\times\widebar{\rho}_{T}^{L}).

Since K1K_{1} is positive definite and square integrable, the integral operator LK1L_{K_{1}} is compact and positive. The operator characterization follows from Theorem A.3.   

Remark 3.5

The above lemma is only applicable to discrete-time observations because it uses the bounds K1(x,x)LK_{1}(x,x^{\prime})\leq L. When the data is continuous in time on [0,T][0,T], we have K1L2(ρ¯T×ρ¯T)K_{1}\in L^{2}(\widebar{\rho}_{T}\times\widebar{\rho}_{T}) if the support of ρ¯T\widebar{\rho}_{T} is compact. In fact, to show that K1K_{1} is square integrable when supp(ρ¯T)\mathrm{supp}(\widebar{\rho}_{T}) is compact, we note that the probability densities are uniformly bounded above, that is, pt(x)maxy,s[0,T]ps(y)p_{t}(x)\leq\max_{y\in\mathbb{R},s\in[0,T]}p_{s}(y). Thus for each x,xx,x^{\prime}, we have

1T0Tpt(x)pt(x)𝑑t\displaystyle\frac{1}{T}\int_{0}^{T}p_{t}(x)p_{t}(x^{\prime})dt |1T0Tpt(x)2𝑑t|1/2|1T0Tpt(x)2𝑑t|1/2\displaystyle\leq\left|\frac{1}{T}\int_{0}^{T}p_{t}(x)^{2}\ dt\right|^{1/2}\left|\frac{1}{T}\int_{0}^{T}p_{t}(x^{\prime})^{2}\ dt\right|^{1/2}
=ρ¯T(x)1/2ρ¯T(x)1/2maxy,s[0,T]ps(y)\displaystyle=\widebar{\rho}_{T}(x)^{1/2}\widebar{\rho}_{T}(x^{\prime})^{1/2}\max_{y\in\mathbb{R},s\in[0,T]}p_{s}(y)

by Cauchy-Schwartz for the first inequality. Then,

K1(x,x)=1ρ¯T(x)ρ¯T(x)1T0Tpt(x)pt(x)𝑑tρ¯T(x)1/2ρ¯T(x)1/2maxy,s[0,T]ps(y).K_{1}(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}p_{t}(x)p_{t}(x^{\prime})dt\leq\widebar{\rho}_{T}(x)^{-1/2}\widebar{\rho}_{T}(x^{\prime})^{-1/2}\max_{y\in\mathbb{R},s\in[0,T]}p_{s}(y).

It follows that K1K_{1} is in L2(ρ¯T×ρ¯T)L^{2}(\widebar{\rho}_{T}\times\widebar{\rho}_{T}):

K12(x,x)ρ¯T(x)ρ¯T(x)𝑑x𝑑x|supp(ρ¯T)|maxy,s[0,T]ps(y)2<.\int\int K_{1}^{2}(x,x^{\prime})\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})dxdx^{\prime}\leq|\mathrm{supp}(\widebar{\rho}_{T})|\max_{y\in\mathbb{R},s\in[0,T]}p_{s}(y)^{2}<\infty.

When ρ¯T\widebar{\rho}_{T} has non-compact support, it remains to be proved that K1L2(ρ¯T×ρ¯T)K_{1}\in L^{2}(\widebar{\rho}_{T}\times\widebar{\rho}_{T}).

Proof of Theorem 3.3. The proof for (a)–(c) are similar, so we focus on (a) and only sketch the proof for (b)–(c).

To prove (a), we only need to show the uniqueness of the minimizer, because Lemma 3.4 has shown that K1K_{1} is a Mercer kernel. Furthermore, note that by Lemma 3.4, the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) closure of the RKHS K1\mathcal{H}_{K_{1}} is H1=span{ψi}i=1¯H_{1}=\overline{\mathrm{span}\{\psi_{i}\}_{i=1}^{\infty}}, the closure in L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) of the eigenspace of LK1L_{K_{1}} with non-zero eigenvalues, where LK1L_{K_{1}} is the operator defined in (3.4).

For any fH1f\in H_{1}, with the notation h=ffh=f-f_{*}, we have 𝔼[f(Xt)]𝔼[Yt]=𝔼[h(Xt)]\mathbb{E}[f(X_{t})]-\mathbb{E}[Y_{t}]=\mathbb{E}[h(X_{t})] for each tt (recall that Yt=f(Xt)Y_{t}=f_{*}(X_{t})). Hence, we can write the loss functional as

1(f)=\displaystyle\mathcal{E}_{1}(f)= 1Ll=1L|𝔼[f(Xtl)]𝔼[Ytl]|2=1Ll=1L|𝔼[h(Xtl)]|2=h(x)h(x)1Ll=1Lptl(x)ptl(x)dxdx\displaystyle\frac{1}{L}\sum_{l=1}^{L}\big{|}\mathbb{E}[f(X_{t_{l}})]-\mathbb{E}[Y_{t_{l}}]|^{2}=\frac{1}{L}\sum_{l=1}^{L}\big{|}\mathbb{E}[h(X_{t_{l}})]|^{2}=\int\int h(x)h(x^{\prime})\frac{1}{L}\sum_{l=1}^{L}p_{t_{l}}(x)p_{t_{l}}(x^{\prime})dxdx^{\prime} (3.5)
=\displaystyle= h(x)h(x)K1(x,x)ρ¯TL(x)ρ¯TL(x)𝑑x𝑑x0.\displaystyle\int\int h(x)h(x^{\prime})K_{1}(x,x^{\prime})\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})dxdx^{\prime}\geq 0.

Thus, 1\mathcal{E}_{1} attains its unique minimizer in H1H_{1} at ff_{*} iff 1(f+h)=0\mathcal{E}_{1}(f_{*}+h)=0 with hH1h\in H_{1} implies that h=0h=0. Note that the second equality in (3.5) implies that 1(f+h)=0\mathcal{E}_{1}(f_{*}+h)=0 iff 𝔼[h(Xtl)]=0\mathbb{E}[h(X_{t_{l}})]=0, i.e. h(x)ptl(x)𝑑x=0\int h(x)p_{t_{l}}(x)dx=0, for all tlt_{l}. Then, h(x)ptl(x)ptl(x)ρ¯TL(x)𝑑x=0\int h(x)p_{t_{l}}(x)\frac{p_{t_{l}}(x^{\prime})}{\widebar{\rho}_{T}^{L}(x^{\prime})}dx=0 for each tlt_{l} and xx^{\prime}. Thus, the sum of them is also zero:

0=h(x)1Ll=1Lptl(x)ptl(x)ρ¯TL(x)ρ¯TL(x)ρ¯TL(x)dx=h(x)K1(x,x)ρ¯TL(x)𝑑x0=\int h(x)\frac{1}{L}\sum_{l=1}^{L}\frac{p_{t_{l}}(x)p_{t_{l}}(x^{\prime})}{\widebar{\rho}_{T}^{L}(x^{\prime})\widebar{\rho}_{T}^{L}(x)}\widebar{\rho}_{T}^{L}(x)dx=\int h(x)K_{1}(x,x^{\prime})\widebar{\rho}_{T}^{L}(x)dx

for each xx^{\prime}. By the definition of the operator LK1L_{K_{1}}, this implies that LK1h=0L_{K_{1}}h=0. Thus, h=0h=0 because hH1h\in H_{1}.

The above arguments hold true when the kernel K1K_{1} is from continuous-time data: one only has to replace 1Ll=1L\frac{1}{L}\sum_{l=1}^{L} by the averaged integral in time. This completes the proof for (a).

The proof of (b) and (c) are the same as above except the appearance of the operator \mathcal{L}^{*}. Note that 4\mathcal{E}_{4} in (2.14) reads 4(f)=1Ll=1L|𝔼[f(Xtl)]𝔼[ΔYtl]|2\mathcal{E}_{4}(f)=\frac{1}{L}\sum_{l=1}^{L}\left|\mathbb{E}\left[{\mathcal{L}f(X_{t_{l}})}\right]-\mathbb{E}\left[{\Delta Y_{t_{l}}}\right]\right|^{2}, thus, it differs from 1\mathcal{E}_{1} only at the expectation 𝔼[f(Xtl)]\mathbb{E}\left[{\mathcal{L}f(X_{t_{l}})}\right]. By integration by parts, we have

𝔼[f(Xs)]=f(x)ps(x)𝑑x=f(x)ps(x)𝑑x\mathbb{E}\left[{\mathcal{L}f(X_{s})}\right]=\int\mathcal{L}f(x)p_{s}(x)dx=\int f(x)\mathcal{L}^{*}p_{s}(x)dx

for any fCb2f\in C_{b}^{2}. Then, the rest of the proof for Part (b) follows exactly as above with K1K_{1} and LK1L_{K_{1}} replaced by K4K_{4} and LK4L_{K_{4}}.   


The following remarks highlight the implications of the above theorem. We consider only 1\mathcal{E}_{1}, but all the remarks apply also to 4\mathcal{E}_{4} and 1+4\mathcal{E}_{1}+\mathcal{E}_{4}.

Remark 3.6 (An operator view of identifiability)

The unique minimizer of 1\mathcal{E}_{1} in H1H_{1} defined in Theorem 3.3 is the zero of its Frechét derivative: f^=LK11LK1f\widehat{f}=L_{K_{1}}^{-1}L_{K_{1}}f_{*}, which is ff_{*} if fH1f_{*}\in H_{1}. In fact, note that with the integral operator LK1L_{K_{1}}, we can write the loss functional 1\mathcal{E}_{1} as

1(f)=ff,LK1(ff)L2(ρ¯TL).\mathcal{E}_{1}(f)=\langle{f-f_{*},L_{K_{1}}(f-f_{*})}\rangle_{L^{2}(\widebar{\rho}_{T}^{L})}.

Thus, the Frechét derivative of 1\mathcal{E}_{1} over L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) is 1(f)=LK1(ff)\nabla\mathcal{E}_{1}(f)=L_{K_{1}}(f-f_{*}) and we obtain the unique minimizer. Furthermore, this operator representation of the minimizer conveys two important messages about the inverse problem of finding the minimizer of 1\mathcal{E}_{1}: (1) it is ill-defined beyond H1H_{1}. In particularly, it is ill-defined on L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) when LK1L_{K_{1}} is not strictly positive; (2) the inverse problem is ill-posed on H1H_{1}, because the operator LK1L_{K_{1}} is compact and its inverse LK11L_{K_{1}}^{-1} is unbounded.

Remark 3.7 (Identifiability and normal matrix in regression)

Suppose n=span{ϕi}i=1n\mathcal{H}_{n}=\mathrm{span}\{\phi_{i}\}_{i=1}^{n} and denote f=i=1nciϕif=\sum_{i=1}^{n}c_{i}\phi_{i} with ϕi\phi_{i} being basis functions such as B-splines. As shown in (2.5)-(2.6), the loss functional 1\mathcal{E}_{1} becomes a quadratic function with normal matrix A¯1=1Ll=1LA1,l\overline{A}_{1}=\frac{1}{L}\sum_{l=1}^{L}A_{1,l} with A1,l=𝐮l𝐮lA_{1,l}=\mathbf{u}_{l}^{\top}\mathbf{u}_{l}, where 𝐮l=(𝔼[ϕ1(Xtl)],,𝔼[ϕn(Xtl)])n\mathbf{u}_{l}=(\mathbb{E}\left[{\phi_{1}(X_{t_{l}})}\right],\ldots,\mathbb{E}\left[{\phi_{n}(X_{t_{l}})}\right])\in\mathbb{R}^{n}. Thus, the rank of the matrix A¯1\overline{A}_{1} is no larger than min{n,L}\min\{n,L\}. Note that A¯1\overline{A}_{1} is the matrix approximation of LK1L_{K_{1}} on the basis {ϕi}i=1n\{\phi_{i}\}_{i=1}^{n} in the sense that

A¯1(i,j)=LK1ϕi,ϕjL2(ρ¯TL),\overline{A}_{1}(i,j)=\langle{L_{K_{1}}\phi_{i},\phi_{j}}\rangle_{L^{2}(\widebar{\rho}_{T}^{L})},

for each 1i,jn1\leq i,j\leq n. Thus, the minimum eigenvalue of A¯1\overline{A}_{1} approximates the minimal eigenvalue of LK1L_{K_{1}} restricted in n\mathcal{H}_{n}. In particular, if n\mathcal{H}_{n} contains a nonzero element in the null space of LK1L_{K_{1}}, then the normal matrix will be singular; if n\mathcal{H}_{n} is a subspace of the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) closure of K1\mathcal{H}_{K_{1}}, then the normal matrix is invertible and we can find a unique minimizer.

Remark 3.8 (Convergence of estimator)

For a fixed hypothesis space, the estimator converges to the projection of ff_{*} in H1\mathcal{H}\cap H_{1} as the data size MM increases, at the order O(M1/2)O(M^{-1/2}), with the error coming from the Monte Carlo estimation of the moments of observations. With data adaptive hypothesis spaces, we are short of proving the minimax rate of convergence as in classical nonparametric regression. This is because of the lack of a coercivity condition [25, 22], since the compact operator LK1L_{K_{1}}’s eigenvalue converges to zero. A minimax rate would require an estimate on the spectrum decay of LK1L_{K_{1}}, and we leave this for future research.

Remark 3.9 (Regularization using the RKHS)

The RKHS HK1H_{K_{1}} can be further utilized to provide a regularization norm in the Tikhonov regularization (see [24]). It has the advantage of being data adaptive and constrains the learning to take place in the function space of learning.

Examples of the RKHS.

We emphasize that the reproducing kernel and the RKHS are intrinsic to the state model (including the initial distribution). We demonstrate the kernels by analytically computing them when the process (Xt)(X_{t}) is either the Brownian motion or the Ornstein-Uhlenbeck (OU) process. For simplicity, we consider continuous-time data. Recall that when the diffusion coefficient in the state-model (1.1) is a constant, the second-order elliptic operators \mathcal{L} is f=fa+12b2Δf\mathcal{L}f=\nabla f\cdot a+\frac{1}{2}b^{2}\Delta f and its joint operator \mathcal{L}^{*} is

ps=(aps)+12b2Δps,\mathcal{L}^{*}p_{s}=-\nabla\cdot(ap_{s})+\frac{1}{2}b^{2}\Delta p_{s},

where psp_{s} denotes the probability density of XsX_{s}.

Example 3.10 (1D Brownian motion)

Let a=0a=0 and b=1b=1. Assume p0(x)=δx0p_{0}(x)=\delta_{x_{0}}, i.e., X0=x0X_{0}=x_{0}. Then, XtX_{t} is the Brownian motion starting from x0x_{0} and pt(x)=12πte(xx0)22tp_{t}(x)=\frac{1}{\sqrt{2\pi t}}e^{-\frac{(x-x_{0})^{2}}{2t}} for each t>0t>0. We have ρ¯T(x)=1T0Tpt(x)𝑑t=xx0TπΓ(12,(xx0)22T)\widebar{\rho}_{T}(x)=\frac{1}{T}\int_{0}^{T}p_{t}(x)dt=\frac{x-x_{0}}{T\sqrt{\pi}}\Gamma(-\frac{1}{2},\frac{(x-x_{0})^{2}}{2T}) and

K1(x,x)=1ρ¯T(x)ρ¯T(x)1T0Tps(x)ps(x)𝑑s=TΓ(0,(xx0)2+(xx0)22T)2(xx0)(xx0)Γ(12,(xx0)22T)Γ(12,(xx0)22T),K_{1}(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}p_{s}(x)p_{s}(x^{\prime})ds=\frac{T\Gamma(0,\frac{(x-x_{0})^{2}+(x^{\prime}-x_{0})^{2}}{2T})}{2(x-x_{0})(x^{\prime}-x_{0})\Gamma(-\frac{1}{2},\frac{(x-x_{0})^{2}}{2T})\Gamma(-\frac{1}{2},\frac{(x^{\prime}-x_{0})^{2}}{2T})},

where Γ(s,x):=xts1et𝑑t\Gamma(s,x):=\int_{x}^{\infty}t^{s-1}e^{-t}dt is the upper incomplete Gamma function. Also, we have

ps(x)=ϕ2(s,x)ps(x), with ϕ2(s,x)=(1s2(xx0)21s).\mathcal{L}^{*}p_{s}(x)=\phi_{2}(s,x)p_{s}(x),\text{ with }\phi_{2}(s,x)=\left(\frac{1}{s^{2}}(x-x_{0})^{2}-\frac{1}{s}\right).

Thus, the reproducing kernel K4K_{4} in (3.2) and KK in (3.3) from continuous-time data are

K4(x,x)\displaystyle K_{4}(x,x^{\prime}) =1ρ¯T(x)ρ¯T(x)1T0Tϕ2(s,x)ϕ2(s,x)ps(x)ps(x)𝑑s;\displaystyle=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}\phi_{2}(s,x)\phi_{2}(s,x^{\prime})p_{s}(x)p_{s}(x^{\prime})ds;
K(x,x)\displaystyle K(x,x^{\prime}) =1ρ¯T(x)ρ¯T(x)1T0T(1+ϕ2(s,x)ϕ2(s,x))ps(x)ps(x)𝑑s.\displaystyle=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}(1+\phi_{2}(s,x)\phi_{2}(s,x^{\prime}))p_{s}(x)p_{s}(x^{\prime})ds.
Example 3.11 (Ornstein-Uhlenbeck process)

Let a(x)=θxa(x)=\theta x and b=1b=1 with θ>0\theta>0. Assume p0(x)=δx0p_{0}(x)=\delta_{x_{0}}, i.e., X0=x0X_{0}=x_{0}. Then, Xt=eθtx0+0teθ(ts)𝑑BsX_{t}=e^{-\theta t}x_{0}+\int_{0}^{t}e^{-\theta(t-s)}dB_{s}. It has a distribution 𝒩(eθtx0,12θ(1e2θt))\mathcal{N}(e^{-\theta t}x_{0},\frac{1}{2\theta}(1-e^{-2\theta t})), thus pt(x)=12πσtexp((xx0t)22σt2)p_{t}(x)=\frac{1}{\sqrt{2\pi}\sigma_{t}}\exp(-\frac{(x-x_{0}^{t})^{2}}{2\sigma_{t}^{2}}), where x0t:=eθtx0x_{0}^{t}:=e^{-\theta t}x_{0} and σt2:=12θ(1e2θt)\sigma_{t}^{2}:=\frac{1}{2\theta}(1-e^{-2\theta t}). Computing the spatial derivatives, we have ps(x)=12[(xx0s)2σs41σs2]ps(x)(θxps(x))=ϕ2(s,x)ps(x),\mathcal{L}^{*}p_{s}(x)=\frac{1}{2}\left[\frac{(x-x_{0}^{s})^{2}}{\sigma_{s}^{4}}-\frac{1}{\sigma_{s}^{2}}\right]p_{s}(x)-(\theta xp_{s}(x))^{\prime}=\phi_{2}(s,x)p_{s}(x), where

ϕ2(s,x):=[(xx0)22σs412σs2θ+θσs2x(xx0s)].\phi_{2}(s,x):=\left[\frac{(x-x_{0})^{2}}{2\sigma_{s}^{4}}-\frac{1}{2\sigma_{s}^{2}}-\theta+\frac{\theta}{\sigma_{s}^{2}}x(x-x_{0}^{s})\right].

The reproducing kernels K1K_{1} in (3.1), K4K_{4} in (3.2) and KK in (3.3) are

K1(x,x)\displaystyle K_{1}(x,x^{\prime}) =1ρ¯T(x)ρ¯T(x)1T0Tps(x)ps(x)𝑑s;\displaystyle=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}p_{s}(x)p_{s}(x^{\prime})ds;
K4(x,x)\displaystyle K_{4}(x,x^{\prime}) =1ρ¯T(x)ρ¯T(x)1T0Tϕ2(s,x)ϕ2(s,x)ps(x)ps(x)𝑑s;\displaystyle=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}\phi_{2}(s,x)\phi_{2}(s,x^{\prime})p_{s}(x)p_{s}(x^{\prime})ds;
K(x,x)\displaystyle K(x,x^{\prime}) =1ρ¯T(x)ρ¯T(x)1T0T(1+ϕ2(s,x)ϕ2(s,x))ps(x)ps(x)𝑑s.\displaystyle=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}(1+\phi_{2}(s,x)\phi_{2}(s,x^{\prime}))p_{s}(x)p_{s}(x^{\prime})ds.

In particular, when the process is stationary, we have K1(x,x)1K_{1}(x,x^{\prime})\equiv 1 and K4(x,x)=0K_{4}(x,x^{\prime})=0 because ps=0\mathcal{L}^{*}p_{s}=0 when ps(x)=2θ2πexp(θx2)p_{s}(x)=\frac{2\theta}{\sqrt{2\pi}}\exp(-\theta x^{2}) is the stationary density.

3.2 Non-identifiability due to stationarity and symmetry

When the hypothesis space \mathcal{H} has a dimension larger than the RKHS’s, the quadratic loss functional 1\mathcal{E}_{1} may have multiple minimizers. The constraints of upper and lower bounds, as well as the loss functionals 2\mathcal{E}_{2} and 3\mathcal{E}_{3}, can help to identify the observation function. However, as we show next, identifiability may still not hold due to symmetry and/or stationarity.

Stationary processes

When the process (Xt)(X_{t}) is stationary, we have limited information from the moments in our loss functionals. We have 1(f)=|𝔼[Yt1]𝔼[f(Xt1)]|2\mathcal{E}_{1}(f)=\left|\mathbb{E}\left[{Y_{t_{1}}}\right]-\mathbb{E}\left[{f(X_{t_{1}})}\right]\right|^{2} with K1(x,x)1K_{1}(x,x^{\prime})\equiv 1, so 1\mathcal{E}_{1} can only identify a constant function. Also, the loss functional 4=0\mathcal{E}_{4}=0 because

ps=sps=0;𝔼[h(Xs)]=0 for any hCb2.\mathcal{L}^{*}p_{s}=\partial_{s}p_{s}=0;\Leftrightarrow\mathbb{E}[\mathcal{L}h(X_{s})]=0\text{ for any }h\in C^{2}_{b}.

In other words, the function space of identifiability by 1+4\mathcal{E}_{1}+\mathcal{E}_{4} is the space of constant functions. Meanwhile, the quartic loss functionals 2\mathcal{E}_{2} and 3\mathcal{E}_{3} also provide limited information: they become 2=|𝔼[f(Xt1)2]𝔼[Yt12]|2\mathcal{E}_{2}=\left|\mathbb{E}[f(X_{t_{1}})^{2}]-\mathbb{E}[Y_{t_{1}}^{2}]\right|^{2} and 3=|𝔼[f(Xt2)f(Xt1)]𝔼[Yt2Yt1]|2\mathcal{E}_{3}=\left|\mathbb{E}[f(X_{t_{2}})f(X_{t_{1}})]-\mathbb{E}[Y_{t_{2}}Y_{t_{1}}]\right|^{2}, the second-order moment and the temporal correlation at one-time instance.

To see the limitations, consider the finite-dimensional hypothesis space \mathcal{H} in (2.15). As in (2.12), with f=i=1nciϕif=\sum_{i=1}^{n}c_{i}\phi_{i}, the loss functional becomes

(f)=\displaystyle\mathcal{E}(f)= cA¯1c2cb¯1M+|𝔼[Yt1]|2+k=23|cAk,1cbk,1M|2,\displaystyle c^{\top}\overline{A}_{1}c-2c^{\top}\overline{b}_{1}^{M}+|\mathbb{E}[Y_{t_{1}}]|^{2}+\sum_{k=2}^{3}\left|c^{\top}A_{k,1}c-b_{k,1}^{M}\right|^{2},

where A¯1\overline{A}_{1} is a rank-one matrix, and k=23|cAk,1cbk,1M|2\sum_{k=2}^{3}\left|c^{\top}A_{k,1}c-b_{k,1}^{M}\right|^{2} only bring in two additional constraints. Thus, \mathcal{E} has multiple minimizers in a linear space with dimension greater than 3. One has to resort to the upper and lower bounds in (2.15) for additional constraints, which lead to minimizers on the boundary of the resulted convex sets.

Symmetry

When the distribution of the state process XtX_{t} is symmetric, a moment-based loss functional does not distinguish the true observation function from its symmetric counterpart. More specifically, if a transformation R:R:\mathbb{R}\to\mathbb{R} preserves the distribution, i.e., (Xt,t0)(X_{t},t\geq 0) and (R(Xt),t0)(R(X_{t}),t\geq 0) have the same distribution, then 𝔼[f(Xt)]=𝔼[fR(Xt)]\mathbb{E}[f(X_{t})]=\mathbb{E}\left[{f\circ R(X_{t})}\right] and 𝔼[f(Xt)f(Xs)]=𝔼[fR(Xt)fR(Xs)]\mathbb{E}[f(X_{t})f(X_{s})]=\mathbb{E}\left[{f\circ R(X_{t})f\circ R(X_{s})}\right]. Thus, our loss functional will not distinguish ff from fRf\circ R. However, this is totally reasonable: the two functions yield the same observation process (in terms of the distribution), thus the observation data does not provide the necessary information for identifying ff from fRf\circ R.

Example 3.12 (Brownian motion)

Consider the standard Brownian motion XtX_{t}, whose distribution is symmetric about x=0x=0 (because the two processes (Xt,t0)(X_{t},t\geq 0) and (Xt,t0)(-X_{t},t\geq 0) have the same distribution). Let the transformation RR be R(x)=xR(x)=-x. Then, the two functions f(x)f(x) and f(x)f(-x) lead to the same observation process, thus they cannot be distinguished from the observations.

4 Numerical Examples

We demonstrate the effectiveness and limitations of our algorithm using synthetic data in representative examples. The algorithm works well when the state-model’s densities vary appreciably in time to yield a function space of identifiability whose distance to the true observation function is small. In this case, our algorithm leads to a convergent estimator as the sample size increases. We also demonstrate that when the state process (i.e., the Ornstein-Uhlenbeck process) is stationary or symmetric in distribution (i.e., the Brownian motion), the loss functional can have multiple minimizers in the hypothesis space, preventing us from identifying the observation functions (see Section 4.3).

4.1 Numerical settings

We first introduce the numerical settings used in the tests.

Data generation.

The synthetic data {Yt0:tL(m)}m=1M\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M} with tl=lΔtt_{l}=l\Delta t are generated from the state model, which is solved by the Euler-Maruyama scheme with a time-step Δt=0.01\Delta t=0.01 for L=100L=100 steps. We will consider sample sizes M{103.5+jΔ:j=0,1,2,3,4,Δ=0.0625}M\in\{\lfloor{10^{3.5+j\Delta}}\rfloor:j=0,1,2,3,4,\,\,\Delta=0.0625\} to test the convergence of the estimator.

To estimate the moments in the AA-matrices and bb-vectors in (2.6)–(2.7) by Monte Carlo, we generate a new set of independent trajectories {Xtl(m)}i=1M\{X_{t_{l}}^{(m)}\}_{i=1}^{M^{\prime}} with M=106M^{\prime}=10^{6}. We emphasize that these XX samples are independent of the data {Yt0:tL(m)}m=1M\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M}.

Inference algorithm.

We follow Algorithm 1 to search for the global minimum of the loss functionals in (2.12). The weights for the k\mathcal{E}_{k}’s are LM/mkYL\sqrt{M}/\|m_{k}^{Y}\|, where \|\cdot\| is the Euclidean norm on L\mathbb{R}^{L} and

mkY(l)=1Mm=1M(Ytl(m))k for k=1,2 and m3Y(l)=1Mm=1MYtl(m)Ytl+1(m),m_{k}^{Y}(l)=\frac{1}{M}\sum_{m=1}^{M}(Y_{t_{l}}^{(m)})^{k}\text{ for }k=1,2\quad\text{ and }\quad m_{3}^{Y}(l)=\frac{1}{M}\sum_{m=1}^{M}Y_{t_{l}}^{(m)}Y_{t_{l+1}}^{(m)},

for l=0,1,,L1l=0,1,\cdots,L-1.

For each example, we test B-spline hypothesis spaces \mathcal{H} with dimension in the range [1,N][1,N], which is selected by Algorithm 2 with degrees in {0,1,2,3}\{0,1,2,3\}. We select the optimal dimension and degree with the minimal 2-Wasserstein distance between the predicted and true distribution of YY. The details are presented in Section C.

Results assessment and presentation.

We present three aspects of the estimator f^\widehat{f}:

  • Estimated and true functions. We compare the estimator with the true function ff_{*}, along with the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) projection of ff_{*} to the linear space expanded by the elements of \mathcal{H}.

  • 2-Wasserstein distance. We present the 2-Wasserstein distance (see (B.5)) between the distributions of Ytl=f(Xtl)Y_{t_{l}}=f_{*}(X_{t_{l}}) and f^(Xtl)\widehat{f}(X_{t_{l}}) for each time with training data and a new set of randomly generated data.

  • Convergence of L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error. We test the convergence of the estimator in L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) as the sample size MM increases. The L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error is computed by the Riemann sum approximation. We present the mean and standard deviation of L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) errors from 20 independent simulations. The convergence rate is also highlighted, and we compare it with the minimax convergence rate in classical nonparametric regression (see e.g., [13, 25]), which is s2s+1\frac{s}{2s+1} with s1s-1 being the degree of the B-spline basis. This minimax rate is not available yet for our method, see Remark 3.8.

Refer to caption
Figure 1: Empirical densities from the data trajectories of the state-model process (Xtl)(X_{t_{l}}) in (4.1) and the observation processes (Ytl)(Y_{t_{l}}) with f=fif_{*}=f_{i}, where fif_{i}’s are three observation functions in (4.2). Since we do not have data pairs between (Xtl(m),Ytl(m))(X_{t_{l}}^{(m)},Y_{t_{l}}^{(m)}), these empirical densities are the available information from data. Our goal is to find the function ff in the operator that maps the densities of {Xtl}\{X_{t_{l}}\} to the densities of {Ytl}\{Y_{t_{l}}\}.

4.2 Examples

The state model we consider is a stochastic differential equation with the double-well potential

dXt\displaystyle dX_{t} =(XtXt3)dt+dBt,Xt0pt0\displaystyle=(X_{t}-X_{t}^{3})dt+dB_{t},X_{t_{0}}\sim p_{t_{0}} (4.1)

where the density of Xt0X_{t_{0}} is the average of 𝒩(0.5,0.2)\mathcal{N}(-0.5,0.2) and 𝒩(1,0.5)\mathcal{N}(1,0.5). The distribution of Xt0:tLX_{t_{0}:t_{L}} is non-symmetric and far from stationary (see Figure 1(a)). Thus the quadratic loss functional 1\mathcal{E}_{1} provides a rich RKHS space for learning.

We consider three observation functions f(x)f(x) representing typical challenges: nearly invertible, non-invertible, and non-invertible discontinuous, in the set supp(ρ¯T)\mathrm{supp}(\widebar{\rho}_{T}):

Sine function: f1(x)=\displaystyle\quad f_{1}(x)= sin(x);\displaystyle\sin(x); (4.2)
Sine-Cosine function: f2(x)=\displaystyle\quad f_{2}(x)= 2sin(x)+cos(6x);\displaystyle 2\sin(x)+\cos(6x);
Arch function: f3(x)=\displaystyle\quad f_{3}(x)= (2(1x)3+1.5(1x)+0.5)𝟏x[0,1].\displaystyle\left(-2(1-x)^{3}+1.5(1-x)+0.5\right)\mathbf{1}_{x\in[0,1]}.

These functions are shown in 2(a) –4(a). They lead to observation processes with dramatically different distributions, as shown in Fig.1(b-d).

Refer to caption
Figure 2: Learning results of Sine function f1(x)=sin(x)f_{1}(x)=\sin(x) with model (4.1).
Refer to caption
Figure 3: Learning results of Sine-Cosine function f2(x)=2sin(x)+cos(6x)f_{2}(x)=2\sin(x)+\cos(6x) with model (4.1).
Refer to caption
Figure 4: Learning results of Arch function f3f_{3} with model (4.1).

The learning results for these three functions are shown in Figure 24. For each of these three observation functions, we present the estimator with the optimal hypothesis space, the 2-Wasserstein distance in prediction and the convergence of the estimator in L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) (see Section 4.1 for details).

Sine function: Fig. 2(a) shows the estimator with degree-1 B-spline basis with dimension n=9n=9 for M=106M=10^{6}. The L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error is 0.02450.0245 and the relative error is 3.47%. Fig. 2(b) shows that the Wasserstein distances are small at the scale 10310^{-3}. Fig. 2(c) shows that the convergence rate of the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error is 0.460.46. This rate is close to the minimax rate 25=0.4\frac{2}{5}=0.4.

Sine-Cosine function: Fig. 3(a) shows the estimator with degree-2 B-spline basis with dimension n=13n=13. The L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error is 0.15960.1596 and the relative error is 9.90%9.90\%. Fig. 3(b) shows that the Wasserstein distances are at the scale of 10210^{-2}. Fig. 3(c) shows that the convergence rate of the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error is 0.260.26, less than the classical minimax rate 370.42\frac{3}{7}\approx 0.42. Note also that the variance of the L2L^{2} error does not decrease as MM increases. In comparison with the results for f1f_{1} in Fig.2(a), we attribute this relatively low convergence rate and the large variance to the high-frequency component cos(6x)\cos(6x), which is harder to identify from moments than than the low frequency component sin(x)\sin(x).

Arch function: Fig. 4(a) shows the estimator with degree-0 B-spline basis with dimension n=45n=45. The L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error is 0.06450.0645 and the relative error is 14.44%14.44\%. Fig. 4(b) shows that the Wasserstein distances are small at the scale 10210^{-2}. Fig. 4(c) shows that the convergence rate of the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error is 0.170.17, less than the would-be minimax rate 130.33\frac{1}{3}\approx 0.33.

Arch function with observation noise: To demonstrate that our method can tolerate large observation noise, we present the estimation results from noisy observations of the Arch function, which is the most difficult among the three examples. Suppose that the observation noise ξ\xi in (2.17) is iid 𝒩(0,0.25)\mathcal{N}(0,0.25). Note that the average of 𝔼[|Yt|2]\mathbb{E}\left[{|Y_{t}|^{2}}\right] is about 0.20.2, so the signal-to-noise ratio is about 𝔼[|Y|2]𝔼[ξ2]0.8\frac{\mathbb{E}[|Y|^{2}]}{\mathbb{E}[\xi^{2}]}\approx 0.8. Thus, we have a relatively large noise. However, our method can identify the function using the moments of the noise as discussed in Section 2.5. Fig. 5(a) shows the estimator with degree-1 B-spline basis with dimension n=24n=24. The L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error is 0.12200.1220 and the relative error is 27.32%27.32\%. Fig. 5(b) shows that the Wasserstein distances are small at the scale 10310^{-3}. The Wasserstein distances is approximated from samples of the noisy data Y=ftrue(X)+ξY=f_{true}(X)+\xi and the noisy prediction Y^=f^(X)+ξ\widehat{Y}=\widehat{f}(X)+\xi. Fig. 5(c) shows that the convergence rate of the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error is 0.140.14. The estimation is not as good as the noise-free case because the noisy observation data lead to milder lower and upper bound restrictions in (2.15). We emphasize that the tolerance to noise is exceptional for such an ill-posed inverse problem, and the key is our use of moments, which averages the noise so that the error occurs at the scale O(1M)O(\frac{1}{\sqrt{M}}).

Refer to caption
Figure 5: Learning results of Arch function f3f_{3} with model (4.1) and i.i.d Gaussian observation noise.

We have also tested piecewise constant observation functions. Our method has difficulty in identifying such functions, due to two issues: (i) the uniform partition often misses the jump discontinuities (even the projection of ff_{*} has a large error); and (ii) the moments we considered depend on the observation function non-locally, thus, they provide limited information to identify the true function from its local perturbations. We leave it for future research to overcome these difficulties by searching the jump discontinuities and by introducing moments detecting local information.

4.3 Limitations

We demonstrate by examples the non-identifiability due to symmetry and stationarity.

Symmetric distribution

Let the state model be the Brownian motion with initial distribution Unif(0,1)\text{Unif}(0,1). The state process (Xt)(X_{t}) has a distribution that is symmetric with respect to the line x=12x=\frac{1}{2}, i.e., the processes (Xt)(X_{t}) and (1Xt)(1-X_{t}) have the same distribution. Thus, with the reflection function R(x)=1xR(x)=1-x, the processes f(Xt)f(X_{t}) and fR(Xt)f\circ R(X_{t}) have the same distribution, and the observation data does not provide information for distinguishing ff from fRf\circ R. The loss functional (2.4) has at least two minima.

Figure 6 shows that our algorithm finds the reflection of the true function f=sin(x)f_{*}=\sin(x). The hypothesis space \mathcal{H} has B-spline basis functions with degree 2 and dimension 58. Our estimator is close to fR(x)=sin(1x)f_{*}\circ R(x)=\sin(1-x). Its L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error is 1.12441.1244 and its reflection’s L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error is 0.07900.0790. Both the estimator and its reflection correctly predict the distribution of the observation process (Yt)(Y_{t}).

Refer to caption
Figure 6: Learning results of f(x)=sin(x)f_{*}(x)=\sin(x) with the state model being Xt=Bt+X0X_{t}=B_{t}+X_{0} where X0Unif(0,1)X_{0}\sim\mathrm{Unif}(0,1). Due to the symmetry with respect to the line x=12x=\frac{1}{2}, the estimator f^(x)\widehat{f}(x) and its reflection f^(1x)\widehat{f}(1-x) are indistinguishable by the loss functional and they lead to similar prediction of the distribution of {Ytl}\{Y_{t_{l}}\}.

Stationary process

When the diffusion process (Xt)(X_{t}) is stationary, the loss functional (2.4)\eqref{eq:errfnl_12corr} provides limited information about the observation function. As discussed in Section 3.2, the matrix A¯1\overline{A}_{1} has rank 1, and 2=0\mathcal{E}_{2}=0 and 3=0\mathcal{E}_{3}=0 lead to only two more constraints. The constraints from the upper and lower bounds in (2.15) play a major role in leading to a minimizer at the boundary of the convex set \mathcal{H}.

Figure 7 shows the learning results with the stationary Ornstein-Uhlenbeck process dXt=Xtdt+dBtdX_{t}=-X_{t}dt+dB_{t} and with the observation function f(x)=sin(x)f_{*}(x)=\sin(x). The stationary density of (Xt)(X_{t}) is 𝒩(0,12)\mathcal{N}(0,\frac{1}{2}). Due the limited information, the estimator has a large L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error, which is 0.26560.2656 and its prediction has large 2-Wasserstein distances oscillating near 0.12900.1290.

Refer to caption
Figure 7: Learning results of f(x)=sin(x)f_{*}(x)=\sin(x) with stationary Ornstein-Uhlenbeck process. Due to limited information from the moments, the estimator is inaccurate due to its reliance on the upper and lower bound constraints.

5 Discussions and conclusion

We have proposed a nonparametric learning method to estimate the observation functions in nonlinear state-space models. It matches the generalized moments via constrained regression. The algorithm is suitable for large sets of unlabeled data. Moreover, it can deal with challenging cases when the observation function is non-invertible. We address the fundamental issue of identifiability from first-order moments. We show that the function spaces of identifiability are the closure of RKHS spaces intrinsic to the state model. Numerical examples show that the first two moments and temporal correlations, along with upper and lower bounds, can identify functions ranging from piecewise polynomials to smooth functions and tolerate considerable observation noise. The limitations of this method, such as non-identifiability due to symmetry and stationarity, are also discussed.

This study provides a first step in the unsupervised learning of latent dynamics from abundant unlabeled data. There are several directions calling for further exploration: (i) a mixture of unsupervised and supervised learning that combines unlabeled data with limited labeled data, particularly for high-dimensional functions; (ii) enlarging the function space of learning, either by construction of more first-order generalized moments or by designing experiments to collect more informative data; (iii) joint estimation of the observation function and the state model.

Appendix A A review of RKHS

Positive definite functions

We review the definitions and properties of positive definite kernels. The following is a real-variable version of the definition in [1, p.67].

Definition A.1 (Positive definite function)

Let XX be a nonempty set. A function G:X×XG:X\times X\rightarrow\mathbb{R} is positive definite if and only if it is symmetric (i.e. G(x,y)=G(y,x)G(x,y)=G(y,x)) and j,k=1ncjckG(xj,xk)0\sum_{j,k=1}^{n}c_{j}c_{k}G(x_{j},x_{k})\geq 0 for all nn\in\mathbb{N}, {x1,,xn}X\{x_{1},\ldots,x_{n}\}\subset X and 𝐜=(c1,,cn)n\mathbf{c}=(c_{1},\ldots,c_{n})\in\mathbb{R}^{n}. The function ϕ\phi is strictly positive definite if the equality hold only when 𝐜=𝟎n\mathbf{c}=\mathbf{0}\in\mathbb{R}^{n}.

Theorem A.2 (Properties of positive definite kernels)

Suppose that k,k1,k2:X×Xd×dk,k_{1},k_{2}:X\times X\subset\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R} are positive definite kernels. Then

  1. (a)

    k1k2k_{1}k_{2} is positive definite. ([1, p.69])

  2. (b)

    Inner product u,v=j=1dujvj\langle u,v\rangle=\sum_{j=1}^{d}u_{j}v_{j} is positive definite ([1, p.73])

  3. (c)

    f(u)f(v)f(u)f(v) is positive definite for any function f:Xf:X\to\mathbb{R} ([1, p.69]).

RKHS and positive integral operators

We review the definitions and properties of the Mercer kernel, the RKHS, and the related integral operator, see e.g., [7] for them on a compact domain [34] for them on a non-compact domain.

Let (X,d)(X,d) be a metric space and G:X×XG:X\times X\to\mathbb{R} be continuous and symmetric. We say that GG is a Mercer kernel if it is positive definite (as in Definition A.1). The reproducing kernel Hilbert space (RKHS) G\mathcal{H}_{G} associated with GG is defined to be closure of span{G(x,):xX}\mathrm{span}\{G(x,\cdot):x\in X\} with the inner product

f,gG=i=1,j=1n,mcidjG(xi,yj)\langle f,g\rangle_{\mathcal{H}_{G}}=\sum_{i=1,j=1}^{n,m}c_{i}d_{j}G(x_{i},y_{j})

for any f=i=1nciG(xi,)f=\sum_{i=1}^{n}c_{i}G(x_{i},\cdot) and g=j=1ndkG(xj,)g=\sum_{j=1}^{n}d_{k}G(x_{j},\cdot). It is the unique Hilbert space such that: (1) the linear space span{G(,y),yX}\mathrm{span}\{G(\cdot,y),y\in X\} is dense in it; (2) it has the reproducing kernel property in the sense that for all fGf\in\mathcal{H}_{G} and xXx\in X, f(x)=G(x,),fGf(x)=\langle G(x,\cdot),f\rangle_{G} (see [7, Theorem 2.9]).

By means of the Mercer Theorem, we can characterize the RKHS G\mathcal{H}_{G} through the integral operator associated with the kernel. Let μ\mu be a nondegenerate Borel measure on (X,d)(X,d) (that is, μ(U)>0\mu(U)>0 for every open set UXU\subset X). Define the integral operator LGL_{G} on L2(X,μ)L^{2}(X,\mu) by

LGf(x)=XG(x,y)f(y)𝑑μ(y).L_{G}f(x)=\int_{X}G(x,y)f(y)d\mu(y).

The RKHS has the operator characterization (see e.g., [7, Section 4.4] and [34]):

Theorem A.3

Assume that the GG is a Mercer kernel and GL2(X×X,μμ)G\in L^{2}(X\times X,\mu\otimes\mu). Then

  1. 1.

    LGL_{G} is a compact positive self-adjoint operator. It has countably many positive eigenvalues {λi}i=1\{\lambda_{i}\}_{i=1}^{\infty} and corresponding orthonormal eigenfunctions {ϕi}i=1\{\phi_{i}\}_{i=1}^{\infty}. Note that when zero is an eigenvalue of LGL_{G}, the linear space H=span{ϕi}i=1H=\mathrm{span}\{\phi_{i}\}_{i=1}^{\infty} is a proper subspace of L2(μ)L^{2}(\mu).

  2. 2.

    {λiϕi}i=1\{\sqrt{\lambda_{i}}\phi_{i}\}_{i=1}^{\infty} is an orthonormal basis of the RKHS G\mathcal{H}_{G}.

  3. 3.

    The RKHS is the image of the square root of the integral operator, i.e., G=LG1/2L2(X,μ)\mathcal{H}_{G}=L_{G}^{1/2}L^{2}(X,\mu).

Appendix B Algorithm details

B.1 B-spline basis and dimension of the hypothesis space

The choice of hypothesis space is important for the nonparametric regression. We choose the basis functions to be the B-splines. To select an optimal dimension of the hypothesis space, we introduce a new algorithm to estimate the range for the dimension and then we select the optimal dimension that minimizes the 2-Wasserstein distance between the measures of data and prediction.

B-Spline basis functions

We briefly review the definition of B-spline basis functions and we refer to [29, Chapter 2] and [26] for details. Given a nondecreasing sequence of real numbers, called knots, (r0,r1,,rm)(r_{0},r_{1},\ldots,r_{m}), the B-spline basis functions of degree pp, denoted by {Ni,p}i=0mp1\{N_{i,p}\}_{i=0}^{m-p-1}, are defined recursively as

Ni,0(r)={1,rir<ri+10,otherwise,Ni,p(r)=rriri+priNi,p1(r)+ri+p+1rri+p+1ri+1Ni+1,p1(r).\displaystyle N_{i,0}(r)=\left\{\begin{array}[]{lr}1,&r_{i}\leq r<r_{i+1}\\ 0,&\text{otherwise}\end{array}\right.,\qquad N_{i,p}(r)=\frac{r-r_{i}}{r_{i+p}-r_{i}}N_{i,p-1}(r)+\frac{r_{i+p+1}-r}{r_{i+p+1}-r_{i+1}}N_{i+1,p-1}(r).

Each function Ni,pN_{i,p} is a nonnegative local polynomial of degree pp, supported on [ri,ri+p+1][r_{i},r_{i+p+1}]. At a knot with multiplicity kk, it is pkp-k times continuously differentiable. Hence, the differentiability increases with the degree but decreases when the knot multiplicity increases. The basis satisfies a partition unity property: for each r[ri,ri+1]r\in[r_{i},r_{i+1}], jNj,p(r)=j=ipiNj,p(r)=1\sum_{j}N_{j,p}(r)=\sum_{j=i-p}^{i}N_{j,p}(r)=1.

We set the knots of the spline functions to be a uniform partition of [Rmin,Rmax][R_{min},R_{max}] (the support of the measure ρ¯TL\widebar{\rho}_{T}^{L} in (2.16)) Rmin=r0r1rm=RminR_{min}=r_{0}\leq r_{1}\leq\cdots\leq r_{m}=R_{min}. For any choice of degree pp, we set the basis functions of the hypothesis space \mathcal{H}, contained in a subspace with dimension n=mpn=m-p, to be

ϕi(r)=Ni,p(r),i=0,,mp1.\phi_{i}(r)=N_{i,p}(r),\ i=0,\dots,m-p-1.

Thus, the basis functions {ϕi}\{\phi_{i}\} are piecewise degree-pp polynomials with knots adaptive to ρ¯TL\widebar{\rho}_{T}^{L}.

Dimension of the hypothesis space.

The choice of dimension nn of \mathcal{H} is important to avoid under- and over-fitting: we choose it by minimizing the 2-Wasserstein distance between the empirical distributions of observed process (Yt)(Y_{t}) and that predicted by our estimated observation function. To reduce the computational burden, we proceed in 2 steps: first we determine a rough range for nn, and then within this range we select the dimension with the minimal Wasserstein distance.

Step 1: we introduce an algorithm, called Cross-validating Estimation of Dimension Range (CEDR), to estimate the range [1,N][1,N] for the dimension of the hypothesis space, based on the quadratic loss functional 1\mathcal{E}_{1}. Its main idea is to restrict NN to avoid overly amplifying the estimator’s sampling error, which is estimated by splitting the data into two sets. It incorporates the function space of identifiability in Section 3.1 into the SVD analysis [9, 15] of the normal matrix and vector from 1\mathcal{E}_{1}.

The CEDR algorithm estimates the sampling error in the minimizer of loss functional 1\mathcal{E}_{1} through SVD analysis in three steps. First, we compute the normal matrix A¯1\overline{A}_{1} and vector b¯1\overline{b}_{1} in (2.6) by Monte Carlo; to estimate the sampling error in b¯1\overline{b}_{1}, we compute two copies, bb and bb^{\prime}, of b¯1\overline{b}_{1} from two halves of the data:

b(i)=1Ll=1L𝔼[ϕi(Xtl)]2Mm=1M2Ytl(m),b(i)=1Ll=1L𝔼[ϕi(Xtl)]2Mm=M2+1MYtl(m).\displaystyle b(i)=\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]\frac{2}{M}\sum_{m=1}^{\lfloor\frac{M}{2}\rfloor}{Y_{t_{l}}^{(m)}},\quad b^{\prime}(i)=\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]\frac{2}{M}\sum_{m=\lfloor\frac{M}{2}\rfloor+1}^{M}{Y_{t_{l}}^{(m)}}. (B.1)

Second, we implement an eigen-decomposition to find an orthonormal basis of L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}), the default function space of learning. We view the matrix A¯1\overline{A}_{1} as a representation of the integral operator LK1L_{K_{1}} in Lemma 3.4 on =span{ϕi}i=1n\mathcal{H}=\mathrm{span}\{\phi_{i}\}_{i=1}^{n}. The eigen-decomposition requires the generalized eigenvalue problem

A¯1u=λBu, where B=(ϕi,ϕjL2(ρ¯TL))\overline{A}_{1}u=\lambda Bu,\quad\text{ where }B=(\langle{\phi_{i},\phi_{j}}\rangle_{L^{2}(\widebar{\rho}_{T}^{L})}) (B.2)

(see [20, Theorem 5.1]). Denote the eigen-pairs by {σi,ui}\{\sigma_{i},u_{i}\}, where the eigenvalues {σi}\{\sigma_{i}\} are non-increasingly ordered and the eigenvectors are subject to normalization uiBuj=δi,ju_{i}^{\top}Bu_{j}=\delta_{i,j}. Thus, we have A¯1=i=1nσiuiui\overline{A}_{1}=\sum_{i=1}^{n}\sigma_{i}u_{i}u_{i}^{\top} (assuming that all σi\sigma_{i}’s are positive; otherwise, we drop those zero eigenvalues). The least-squares estimators from bb and bb^{\prime} are c=i=1nuibσiuic=\sum_{i=1}^{n}\frac{u_{i}^{\top}b}{\sigma_{i}}u_{i} and c=i=1nuibσiuic^{\prime}=\sum_{i=1}^{n}\frac{u_{i}^{\top}b^{\prime}}{\sigma_{i}}u_{i}, respectively. Third, the difference between their function estimators represents the sampling error (with Δc=cc\Delta c=c-c^{\prime})

g(n):=\displaystyle g(n):= f^f^L2(ρ¯TL)2=k=1nΔckϕkL2(ρ¯TL)2=i,j=1nΔciϕi,ϕjL2(ρ¯TL)Δcj=ΔcBΔc\displaystyle\|\widehat{f}-\widehat{f}^{\prime}\|_{L^{2}(\widebar{\rho}_{T}^{L})}^{2}=\|\sum_{k=1}^{n}\Delta c_{k}\phi_{k}\|_{L^{2}(\widebar{\rho}_{T}^{L})}^{2}=\sum_{i,j=1}^{n}\Delta c_{i}\langle{\phi_{i},\phi_{j}}\rangle_{L^{2}(\widebar{\rho}_{T}^{L})}\Delta c_{j}=\Delta c^{\top}B\Delta c (B.3)
=i,j=1nui(bb)σiuiBujuj(bb)σj=i=1nri2,\displaystyle=\sum_{i,j=1}^{n}\frac{u_{i}^{\top}(b-b^{\prime})}{\sigma_{i}}u_{i}^{\top}Bu_{j}\frac{u_{j}^{\top}(b-b^{\prime})}{\sigma_{j}}=\sum_{i=1}^{n}r_{i}^{2},

where ri=|ui(bb)|σir_{i}=\frac{|u_{i}^{\top}(b-b^{\prime})|}{\sigma_{i}}. The ratio rir_{i} is in the same spirit as the Picard projection ratio |uib|σi\frac{|u_{i}^{\top}b|}{\sigma_{i}} in [15], which is used to detect overfitting. Note that the eigenvalues σi\sigma_{i} will vanish as nn increases because the operator LK1L_{K_{1}} is compact. Clearly, the sampling error g(n)g(n) should be less than fL2(ρ¯TL)2\|f_{*}\|_{L^{2}(\widebar{\rho}_{T}^{L})}^{2}, which is the average of the second moments. Thus, we set NN to be

N\displaystyle N =max{k1:g(k)τ}, where τ=1LMl=1,m=1L,M|Ytl(m)|2.\displaystyle=\max\{k\geq 1:g(k)\leq\tau\},\text{ where }\tau=\frac{1}{LM}\sum_{l=1,m=1}^{L,M}|Y_{t_{l}}^{(m)}|^{2}. (B.4)

We note that this threshold is relatively large, neglecting the rich information in gg, a subject worthy of further investigation.

Algorithm 2 summarizes the above procedure.

Algorithm 2 Cross-validating Estimation of Dimension Range (CEDR) for hypothesis space
1:The state model and data {Yt0:tL(m)}m=1M\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M}.
2:A range [1,N][1,N] for the dimension of the hypothesis space for further selection.
3:Estimate the empirical density ρ¯T\widebar{\rho}_{T} in (2.16) and find its support [Rmin,Rmax][R_{min},R_{max}].
4:Set n=1n=1 and g(n)=0g(n)=0. Estimate the threshold τ\tau in (B.4).
5:while g(n)τg(n)\leq\tau do
6:     Set nn+1n\leftarrow n+1. Update the basis functions, Fourier or B-spline, as in Section 2.3.
7:     Compute normal matrix A¯1\overline{A}_{1} in (2.6) by Monte Carlo. Also, compute bb and bb^{\prime} in (B.1).
8:     Eigen-decomposition of A¯1\overline{A}_{1} as in (B.2); return A¯1=i=1nuiσiuiT\overline{A}_{1}=\sum_{i=1}^{n}u_{i}\sigma_{i}u_{i}^{T} with uiBuj=δi,ju_{i}^{\top}Bu_{j}=\delta_{i,j}.
9:     Compute the Picard projection ratios: ri=|ui(bb)|σir_{i}=\frac{|u_{i}^{\top}(b-b^{\prime})|}{\sigma_{i}} for i=1,,ni=1,\ldots,n and g(n)=i=1nri2g(n)=\sum_{i=1}^{n}r_{i}^{2}.
10:Return N=nN=n.

Step 2: We select the dimension nn and degree for B-spline basis functions to be the one with the smallest 2-Wasserstein distance between the distribution of the data and that of the predictions. More precisely, let μtlf\mu^{f}_{t_{l}} and μtlf^\mu^{\widehat{f}}_{t_{l}} denote the distributions of Ytl=f(Xtl)Y_{t_{l}}=f(X_{t_{l}}) and f^(Xtl)\widehat{f}(X_{t_{l}}), respectively. Let FtlF_{t_{l}} and F^tl\widehat{F}_{t_{l}} denote their cumulative distribution functions (CDF), with Ftl1F_{t_{l}}^{-1} and F^tl1\widehat{F}_{t_{l}}^{-1} being their inversion. We compute FtlF_{t_{l}} from the data and F^tl\widehat{F}_{t_{l}} from independent simulation. We approximate their inversions by quantile, and compute the 2-Wasserstein distance

(1Ll=1LW2(μtlf,μtlf^)2)1/2 with W2(μtlf,μtlf^)=(01(Ftl1(r)F^tl1(r))2𝑑r)12,\displaystyle\left(\frac{1}{L}\sum_{l=1}^{L}W_{2}(\mu^{f}_{t_{l}},\mu^{\widehat{f}}_{t_{l}})^{2}\right)^{1/2}\quad\text{ with }W_{2}(\mu^{f}_{t_{l}},\mu^{\widehat{f}}_{t_{l}})=\left(\int_{0}^{1}(F^{-1}_{t_{l}}(r)-\widehat{F}^{-1}_{t_{l}}(r))^{2}dr\right)^{\frac{1}{2}}, (B.5)

This method of computing the Wasserstein distance is based on an observation in [5], and it has been used in [28, 19]. Recall that the 2-Wasserstein distance W2(μ,ν)W_{2}(\mu,\nu) of two probability density functions μ\mu and ν\nu over Ω\Omega with finite second order moments is given by W2(μ,ν)=(infγΓ(μ,ν)Ω×Ω|xy|2𝑑γ(x,y))12W_{2}(\mu,\nu)=\left(\inf\limits_{\gamma\in\Gamma(\mu,\nu)}\int_{\Omega\times\Omega}|x-y|^{2}d\gamma(x,y)\right)^{\frac{1}{2}}, where Γ(μ,ν)\Gamma(\mu,\nu) denotes the set of all measures on Ω×Ω\Omega\times\Omega with μ\mu and ν\nu as marginals. Let FF and GG be the CDFs of μ\mu and ν\nu respectively, and let F1F^{-1} and G1G^{-1} be their quantile functions. Then the L2L^{2} distance of the quantile functions d2(μ,ν)=(01|F1(r)G1(r)dr|2)12d_{2}(\mu,\nu)=\left(\int_{0}^{1}|F^{-1}(r)-G^{-1}(r)dr|^{2}\right)^{\frac{1}{2}} is equal to the 2-Wasserstein distance W2(μ,ν)W_{2}(\mu,\nu).

B.2 Optimization with multiple initial conditions

With the convex hypothesis space in (2.15), the minimization in (2.12) is a constrained optimization problem and it may have multiple local minima. Note that the loss functional M(c)\mathcal{E}^{M}(c) in (2.12) consists of a quadratic term and two quartic terms. The quadratic term, which represents 1M\mathcal{E}_{1}^{M} in (2.5), has a Hessian matrix A¯1\overline{A}_{1} which is often not full rank because it is the average of rank-one matrices by its definition (2.6). Thus, the quadratic term has a valley of minima in the kernel of A¯1\overline{A}_{1}. The two quartic terms have valleys of minima at the intersections of the ellipse-shaped manifolds {cn:cAk,lc=bk,lM}l=1L\{c\in\mathbb{R}^{n}:c^{\top}A_{k,l}c=b_{k,l}^{M}\}_{l=1}^{L} for k=2,3k=2,3. Also, symmetry in the distribution of the state process will also lead to multiple minima (see Section 3.2 for more discussions).

To reduce the possibility of obtaining a local minimum, we search for a minimizer from multiple initial conditions. We consider the following initial conditions: (1) the least squares estimator for the quadratic term; (2) the minimizer of the quadratic term in the hypothesis space, which is solved by least squares with linear constraints using @MATLAB function lsqlin, starting from the LSE estimator; (3) the minimizers of the quartic terms over the hypothesis space, which is found by constrained optimization through @MATLAB fmincon with the interior-point search. Then, among the minimizers from these initial conditions, we take the one leading to the smallest 2-Wasserstein distance.

Appendix C Selection of dimension and degree of the B-spline basis

We demonstrate the selection of the dimension and degree of the B-spline basis functions of the hypothesis space. As described in Section 2.3, we select the dimension and degree in two steps: we first select a rough range for the dimension by the Cross-validating Estimation of Dimension Range (CEDR) algorithm; then we pick the dimension and degree to be the ones with minimal 2-Wasserstein distance between the true and estimated distribution of the observation processes.

The CEDR algorithm helps to reduce the computational cost by estimating the dimension range for the hypothesis space. It is based on an SVD analysis of the normal matrix A¯1\overline{A}_{1} and vector b¯1\overline{b}_{1} from the quadratic loss functional 1\mathcal{E}_{1}. The key idea is to control the sampling error’s effect on the estimator in the metric of the function space of learning. The sampling error is estimated by computing two copies of the normal vector through splitting the data into two halves. The function space of learning plays an important role here: it directs us to use a generalized eigenvalue problem for the SVD analysis. This is different from the classical SVD analysis in [15], where the information of the function space is neglected.

Refer to caption
Figure 8: The selection of the dimension and the degree of B-spline basis functions in the case of Sine-Cosine function. In (a), the 2-Wasserstein distance reaches minimum among all cases when the degree is 2 and the knot number is 15, at the same time as the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error reaches the minimum. Figure (b) shows the cross-validating error indicator gg (defined in (B.3)) for selecting the dimension range NN, suggesting an upper bound N=60N=60 with the threshold.

Figure 8 shows the dimension selection by 2-Wasserstein distances and by the CEDR algorithm for the example of sine-cosine function. To confirm the effectiveness of our CEDR algorithm, we compute the 2-Wasserstein distances for all dimensions in (a), side-by-side with the CEDR sampling error indicator gg in (b) with relatively large dimensions {n=75deg|\{n=75-deg| for deg{0,1,2,3}deg\in\{0,1,2,3\}. First, the left figure suggests that the optimal dimension and degree are n=13n=13 and deg=2deg=2, where the 2-Wasserstein distance reaches minimum among all cases, and at the same time as the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error. For the other degrees, the minimum 2-Wasserstein distances are either reached before of after the L2(ρ¯TL)L^{2}(\widebar{\rho}_{T}^{L}) error. Thus, the 2-Wasserstein distance correctly selects the optimal dimension and degree for the hypothesis space. Second, (a) shows that the CEDR algorithm can effectively select the dimension range. With the threshold in (B.4) being τ=1.60\tau=1.60, which is relatively large (representing a tolerance of 100% relative error), the dimension upper bounds are around N=60N=60 for all degrees, and the ranges encloses the optimal dimensions selected by the 2-Wasserstein distance in (b).

Here we used a relatively large threshold for a rough estimation of the range of dimension. Clearly, our cross-validating error indicator g(k)g(k) in (B.3) provides rich information about the increase of sampling error as the dimension increases. A future direction is to extract the information, along with the decay of the integral operator, to find the trade-off between sampling error and approximation error.

Acknowledgments

MM, YGK and FL are partially supported by DE-SC0021361 and FA9550-21-1-0317. FL is partially funded by the NSF Award DMS-1913243.

References

  • [1] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups: theory of positive definite and related functions, volume 100. New York: Springer, 1984.
  • [2] S. A. Billings. Nonlinear System Identification. John Wiley & Sons, Ltd, Chichester, UK, 2013.
  • [3] P. Brockwell and R. Davis. Time series: theory and methods. Springer, New York, 2nd edition, 1991.
  • [4] O. Cappé, E. Moulines, and T. Rydén. Inference in Hidden Markov Models. Springer Series in Statistics. Springer, New York ; London, 2005.
  • [5] J. A. Carrillo and G. Toscani. Wasserstein metric and large–time asymptotics of nonlinear diffusion equations. In New Trends in Mathematical Physics: In Honour of the Salvatore Rionero 70th Birthday, pages 234–244. World Scientific, 2004.
  • [6] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences of the United States of America, 102(21):7426–7431, 2005.
  • [7] F. Cucker and D. X. Zhou. Learning theory: an approximation theory viewpoint, volume 24. Cambridge University Press, 2007.
  • [8] J. Fan and Q. Yao. Nonlinear Time Series: Nonparametric and Parametric Methods. Springer, New York, NY, 2003.
  • [9] R. D. Fierro, G. H. Golub, P. C. Hansen, and D. P. O’Leary. Regularization by Truncated Total Least Squares. SIAM J. Sci. Comput., 18(4):1223–1241, 1997.
  • [10] C. Gelada, S. Kumar, J. Buckman, O. Nachum, and M. G. Bellemare. DeepMDP: Learning Continuous Latent Space Models for Representation Learning. ArXiv190602736 Cs Stat, 2019.
  • [11] A. Ghosh, S. Mukhopadhyay, S. Roy, and S. Bhattacharya. Bayesian inference in nonparametric dynamic state-space models. Statistical Methodology, 21:35–48, 2014.
  • [12] N. Guglielmi and E. Hairer. Classification of Hidden Dynamics in Discontinuous Dynamical Systems. SIAM J. Appl. Dyn. Syst., 14(3):1454–1477, 2015.
  • [13] L. Györfi, M. Kohler, A. Krzyzak, and H. Walk. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006.
  • [14] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning Latent Dynamics for Planning from Pixels. ArXiv181104551 Cs Stat, 2019.
  • [15] P. C. Hansen. The L-curve and its use in the numerical treatment of inverse problems. In in Computational Inverse Problems in Electrocardiology, ed. P. Johnston, Advances in Computational Bioengineering, pages 119–142. WIT Press, 2000.
  • [16] M. R. Jeffrey. Hidden Dynamics: The Mathematics of Switches, Decisions and Other Discontinuous Behaviour. Springer International Publishing, Cham, 2018.
  • [17] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, A. Mohiuddin, R. Sepassi, G. Tucker, and H. Michalewski. Model-Based Reinforcement Learning for Atari. ArXiv190300374 Cs Stat, 2020.
  • [18] N. Kantas, A. Doucet, S. Singh, and J. Maciejowski. An Overview of Sequential Monte Carlo Methods for Parameter Estimation in General State-Space Models. IFAC Proc. Vol., 42(10):774–785, 2009.
  • [19] N. Kolbe. Wasserstein distance. https://github.com/nklb/wasserstein-distance, 2020.
  • [20] Q. Lang and F. Lu. Identifiability of interaction kernels in mean-field equations of interacting particles. arXiv preprint arXiv:2106.05565, 2021.
  • [21] K. Law, A. Stuart, and K. Zygalakis. Data Assimilation: A Mathematical Introduction. Springer, 2015.
  • [22] Z. Li, F. Lu, M. Maggioni, S. Tang, and C. Zhang. On the identifiability of interaction functions in systems of interacting particles. Stochastic Processes and their Applications, 132:135–163, 2021.
  • [23] L. Ljung. System identification. In Signal analysis and prediction, pages 163–173. Springer, 1998.
  • [24] F. Lu, Q. Lang, and Q. An. Data adaptive RKHS Tikhonov regularization for learning kernels in operators. arXiv preprint arXiv:2203.03791, 2022.
  • [25] F. Lu, M. Zhong, S. Tang, and M. Maggioni. Nonparametric inference of interaction laws in systems of agents from trajectory data. Proc. Natl. Acad. Sci. USA, 116(29):14424–14433, 2019.
  • [26] T. Lyche, C. Manni, and H. Speleers. Foundations of Spline Theory: B-Splines, Spline Approximation, and Hierarchical Refinement, volume 2219, pages 1–76. Springer International Publishing, Cham, 2018.
  • [27] C. Moosmüller, F. Dietrich, and I. G. Kevrekidis. A geometric approach to the transport of discontinuous densities. ArXiv190708260 Phys. Stat, 2019.
  • [28] V. M. Panaretos and Y. Zemel. Statistical aspects of wasserstein distances. Annual review of statistics and its application, 6:405–431, 2019.
  • [29] L. Piegl and W. Tiller. The NURBS Book. Monographs in Visual Communication. Springer Berlin Heidelberg, Berlin, Heidelberg, 1997.
  • [30] Y. Pokern, A. M. Stuart, and P. Wiberg. Parameter estimation for partially observed hypoelliptic diffusions. J. R. Stat. Soc. Ser. B Stat. Methodol., 71(1):49–73, 2009.
  • [31] B. L. S. Prakasa Rao. Statistical inference from sampled data for stochastic processes. In N. U. Prabhu, editor, Contemporary Mathematics, volume 80, pages 249–284. American Mathematical Society, Providence, Rhode Island, 1988.
  • [32] A. Rahimi and B. Recht. Unsupervised regression with applications to nonlinear system identification. In Advances in Neural Information Processing Systems, pages 1113–1120, 2007.
  • [33] M. Sørensen. Estimating functions for diffusion-type processes. In Statistical Methods for Stochastic Differential Equations, volume 124, pages 1–107. Monogr. Statist. Appl. Probab, 2012.
  • [34] H. Sun. Mercer theorem for RKHS on noncompact sets. Journal of Complexity, 21(3):337 – 349, 2005.
  • [35] A. Svensson and T. B. Schön. A flexible state–space model for learning nonlinear dynamical systems. Automatica, 80:189–199, 2017.
  • [36] F. Tobar, P. M. Djuric, and D. P. Mandic. Unsupervised State-Space Modeling Using Reproducing Kernels. IEEE Trans. Signal Process., 63(19):5210–5221, 2015.