Unsupervised learning of observation functions in
state-space models by nonparametric moment methods

Qingci An Yannis Kevrekidis Department of Applied Mathematics and Mathematics, Johns Hopkins University, Baltimore, MD 21218, USA [email protected] Fei Lu Mauro Maggioni Department of Applied Mathematics and Mathematics, Johns Hopkins University, Baltimore, MD 21218, USA [email protected]

Abstract

We investigate the unsupervised learning of non-invertible observation functions in nonlinear state-space models. Assuming abundant data of the observation process along with the distribution of the state process, we introduce a nonparametric generalized moment method to estimate the observation function via constrained regression. The major challenge comes from the non-invertibility of the observation function and the lack of data pairs between the state and observation. We address the fundamental issue of identifiability from quadratic loss functionals and show that the function space of identifiability is the closure of a RKHS that is intrinsic to the state process. Numerical results show that the first two moments and temporal correlations, along with upper and lower bounds, can identify functions ranging from piecewise polynomials to smooth functions, leading to convergent estimators. The limitations of this method, such as non-identifiability due to symmetry and stationarity, are also discussed.

Key words unsupervised learning, state-space models, nonparametric regression, generalized moment method, RKHS

1 Introduction

We consider the following state-space model for $(X_{t},Y_{t})$ processes in $\mathbb{R}\times\mathbb{R}$ :

	State model:	$\displaystyle dX_{t}$	$\displaystyle=a(X_{t})dt+b(X_{t})dB_{t},$	with	$\displaystyle a,b\text{are known};$			(1.1)
	Observation model:	$\displaystyle Y_{t}$	$\displaystyle=f_{*}(X_{t}),$	with	$\displaystyle f_{*}\text{ unknown}.$			(1.2)

Here $(B_{t})$ is the standard Brownian motion, the drift function $a(x)$ and the diffusion coefficient $b(x)$ are given, satisfying the linear growth and global Lipschitz conditions. We assume that the initial distribution of $X_{t_{0}}$ is given. Thus, the state model is known, in other words, the distribution of the process $(X_{t})$ is known.

Our goal is to estimate the unknown observation function $f_{*}$ from data consisting of an ensemble of trajectories of the process $Y_{t}$ , denoted by $\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M}$ , where $m$ indexes trajectories, $t_{0}<\dots<t_{L}$ are the times at which the observations are made. In particular, there are no pairs $(X_{t},Y_{t})$ being observed, so in the language of machine learning this may be considered an unsupervised learning problem. A case of particular interest in the present work is when the observation function $f_{*}$ is nonlinear and non-invertible, and it is within a large class of functions, including smooth functions but also, for example, piecewise continuous functions.

We estimate the observation function $f_{*}$ by matching generalized moments, while constraining the estimator to a suitably chosen finite-dimensional hypothesis (function) space, whose dimension depends on the number of observations, in the spirit of nonparametric statistics. We consider both first- and second-order moments, as well as temporal correlations, of the observation process. The estimator minimizes the discrepancy between the moments over hypothesis spaces spanned by B-spline functions, with upper and lower pointwise constraints estimated from data. The method we propose has several significant strengths:

•

the generalized moments do not require the invertibility of the observation function $f_{*}$ ;
•

low-order generalized moments tend to be robust to additive observation noise;
•

generalize moments avoid the need of local constructions, since they depend on the entire distribution of the latent and observed processes;
•

our nonparametric approach does not require a priori information about the observation function, and, for example, it can deal with both continuous and discontinuous functions;
•

the method is computationally efficient because the moments need to be estimated only once, and their computation is easily performed in parallel.

We note that the method we propose readily extends to multivariate state models, with the main statistical and computational bottlenecks coming from the curse-of-dimensionality in the representation and estimation of a higher-dimensional $f_{*}$ in terms the basis functions.

The problem we are considering has been studied in the contexts of nonlinear system identification [2, 23], filtering and data assimilation [4, 21], albeit typically when observations are in the form of one, or a small number of, long trajectories, and in the case of an invertible or smooth observations function $f_{*}$ . The estimation of the unknown observation function and of the latent dynamics from unlabeled data has been considered in [14, 17, 10, 27] and references therein. Inference for state-space models (SSMs) has been widely studied; most classical approaches focus on estimating the parameters in the SSM from a single trajectory of the observation process, by expectation-maximization methods maximizing the likelihood, or Bayesian approaches [2, 23, 4, 18, 11], with the recent studies estimating the coefficients in a kernel representation [36] or the coefficients of a pre-specified set of basis functions [35].

Our framework combines nonparametric learning [13, 7] with the generalized moments method, that is mainly studied in the setting of parametric inference [33, 31, 30]. We study the identifiability of the observation function from first-order moments, and show that the first-order generalized moments can identify the function in the $L^{2}$ closure a reproducing kernel Hilbert space (RKHS) that is intrinsic to the state model. As far as we know, this is the first result on the function space of identifiability for nonparametric learning of observation functions in SSMs.

When the observation function is invertible, its unsupervised regression is investigated [32] by maximizing the likelihood for high-dimensional data. However, in many applications, particularly those involving complex dynamics, the observation functions are non-invertible, for example they are projections or nonlinear non-invertible transformations (e.g., $f(x)=|x|^{2}$ with $x\in\mathbb{R}^{d}$ ). As a consequence, the resulting observed process may have discontinuous or singular probability densities [16, 12]. In [27], it has been shown empirically that delayed coordinates with principal component analysis may be used to estimate the dimension of the hidden process, and diffusion maps [6] may yield a diffeomorphic copy of the observation function.

The remainder of the paper is organized as follows. We present the nonparametric generalized moments method in Section 2. In Section 3 we study the identifiability of the observation function from first-order moments, and show that the function spaces of identifiability are RKHSs intrinsic to the state model. We present numerical examples to demonstrate the effectiveness, and limitations, of the proposed method in Section 4. Section 5 summarizes this study and discusses directions of future research; we review the basic elements about RKHSs in Appendix A.

2 Non-parametric regression based on generalized moments

Throughout this work, we focus on discrete-time observations of the state-space model (1.1)–(1.2), because data in practice are discrete in time, and the extension to continuous time trajectories is straightforward. We thereby suppose that the data is in the form $\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M}$ , with $m$ indexing multiple independent trajectories, observed at the vector $t_{0}:t_{L}$ of discrete times $(t_{0},\cdots,t_{L})$ .

2.1 Generalized moments method

We estimate the observation function $f_{*}$ by the generalized moment method (GMM) [33, 31, 30], searching for an observation function $\widehat{f}$ , in a suitable finite-dimensional hypothesis (function) space, such that the moments of functionals of the process $(\widehat{f}(X_{t}))$ are close to the empirical ones (computed from data) of $f_{*}(X_{t})$ .

We consider “generalized moments” in the form $\mathbb{E}\left[{\xi(Y_{t_{0}:t_{L}})}\right]$ , where $\xi:\mathbb{R}^{L+1}\to\mathbb{R}^{K}$ is a functional of the trajectory $Y_{t_{0}:t_{L}}$ . For example, the functional $\xi$ can be $\xi(Y_{t_{0}:t_{L}})=[Y_{t_{0}:t_{L}},Y_{t_{0}}Y_{t_{1}},\ldots,Y_{t_{L-1}}Y_{t_{L}}]\in\mathbb{R}^{2L+1}$ , in which case $\mathbb{E}\left[{\xi(Y_{t_{0}:t_{L}})}\right]=\left[\mathbb{E}\left[{Y_{t_{0}:t_{L}}}\right],\mathbb{E}\left[{Y_{t_{1}}Y_{t_{2}}}\right],\ldots,\mathbb{E}\left[{Y_{t_{L-1}}Y_{t_{L}}}\right]\right]$ is the vector of the first moments and of temporal correlations at consecutive observation times. The empirical generalized moments $\xi$ are computed from data by Monte Carlo approximation:

\mathbb{E}\left[{\xi(Y_{t_{0}:t_{L}})}\right]\approx E_{M}[{\xi(Y_{t_{0}:t_{L}})}]:=\frac{1}{M}\sum_{m=1}^{M}\xi(Y_{t_{0}:t_{L}}^{(m)}),

(2.1)

which converges at the rate $M^{-1/2}$ by the Central Limit Theorem, since the $M$ trajectories are independent. Meanwhile, since the state model (hence the distribution of the state process) is known, for any putative observation function $f$ , we approximate the moments of the process $(f(X_{t})))$ by simulating $M^{\prime}$ independent trajectories of the state process $(X_{t})$ :

\mathbb{E}\left[{\xi(f(X)_{t_{0}:t_{L}})}\right]\approx\frac{1}{M^{\prime}}\sum_{m=1}^{M^{\prime}}\xi(f(X)_{t_{0}:t_{L}}^{(m)})\,.

(2.2)

Here, with some abuse of notation, $f(X)_{t_{0}:t_{L}}^{(m)}:=(f(X_{t_{0}}^{(m)}),\dots,f(X_{t_{L}}^{(m)}))$ . The number $M^{\prime}$ can be as large as we can afford from a computational perspective. In what follows, since $M^{\prime}$ can be chosen large – only subject to computational constraints – we consider the error in this empirical approximation negligible and work with $\mathbb{E}\left[{\xi(f(X)_{t_{0}:t_{L}})}\right]$ directly.

We estimate the observation function $f_{*}$ by minimizing a notion of discrepancy between these two empirical generalized moments:

\displaystyle\widehat{f}=\underset{f\in\mathcal{H}}{\operatorname{arg}\operatorname{min}}\;\mathcal{E}^{M}(f),\quad\text{where }\mathcal{E}^{M}(f):=\mathrm{dist}\left(E_{M}[{\xi(Y_{t_{0}:t_{L}})}],\mathbb{E}\left[{\xi(f(X)_{t_{0}:t_{L}})}\right]\right)^{2},

(2.3)

where $f$ is restricted to some suitable hypothesis space $\mathcal{H}$ , and $\mathrm{dist}(\cdot,\cdot)$ is a proper distance between the moments to be specified later. We choose $\mathcal{H}$ to be a subset of an $n$ -dimensional function space, spanned by basis functions $\{\phi_{i}\}$ , within which we can write $\widehat{f}=\sum_{i=1}^{n}\widehat{c}_{i}\phi_{i}$ . By the law of large numbers, $\mathcal{E}^{M}(f)$ tends almost surely to $\mathcal{E}(f):=\mathrm{dist}\left(\mathbb{E}\left[{\xi(Y_{t_{0}:t_{L}})}\right],\mathbb{E}\left[{\xi(f(X)_{t_{0}:t_{L}})}\right]\right)^{2}$ .

It is desirable to choose the generalized moment functional $\xi$ and the hypothesis space $\mathcal{H}$ so that the minimization in (2.3) can be performed efficiently. We select the functional $\xi$ so that the moments $\mathbb{E}\left[{\xi(f(X)_{t_{0}:t_{L}})}\right]$ , for $f=\sum_{i=1}^{n}c_{i}\phi_{i}$ , can be efficiently evaluated for all $(c_{1},\ldots,c_{n})$ . To this end, we choose linear functionals or low-degree polynomials, so that we only need to compute the moments of the basis functions once, and use these moments repeatedly during the optimization process, as discussed in Section 2.2. The selection of the hypothesis space is detailed in Section 2.3.

2.2 Loss functional and estimator

The generalized moments we consider include the first and the second moments, as well as the one-step temporal correlation: we let $\xi(Y_{t_{0}:t_{L}}):=\left(Y_{t_{0}:t_{L}},Y_{t_{0}:t_{L}}^{2},Y_{t_{0}}Y_{t_{1}},\ldots,Y_{t_{L-1}}Y_{t_{L}}\right)\in\mathbb{R}^{3L+2}$ . The loss functional in (2.3) is then chosen in the following form: for weights $w_{1},\dots,w_{3}>0$ ,

	$\displaystyle\mathcal{E}(f):=$	$\displaystyle w_{1}\underbrace{\frac{1}{L}\sum_{l=1}^{L}\big{\|}\mathbb{E}[f(X_{t_{l}})]-\mathbb{E}[Y_{t_{l}}]\|^{2}}_{\mathcal{E}_{1}(f)}+w_{2}\underbrace{\frac{1}{L}\sum_{l=1}^{L}\left\|\mathbb{E}[f(X_{t_{l}})^{2}]-\mathbb{E}[Y_{t_{l}}^{2}]\right\|^{2}}_{\mathcal{E}_{2}(f)}$		(2.4)
		$\displaystyle+w_{3}\underbrace{\frac{1}{L}\sum_{l=1}^{L}\left\|\mathbb{E}[f(X_{t_{l}})f(X_{t_{l-1}})]-\mathbb{E}[Y_{t_{l}}Y_{t_{l-1}}]\right\|^{2}}_{\mathcal{E}_{3}(f)}\,.$		(2.4)

Let the hypothesis space $\mathcal{H}$ be a subset of the span of a linearly independent set $\{\phi_{i}\}_{i=1}^{n}$ , which we specify in the next section. For $f=\sum_{i=1}^{n}c_{i}\phi_{i}\in\mathcal{H}$ , we can write the loss functionals $\mathcal{E}_{1}(f)$ in (2.4) as

\displaystyle\mathcal{E}_{1}(f)

\displaystyle=\frac{1}{L}\sum_{l=1}^{L}\bigg{|}\sum_{i=1}^{n}c_{i}\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]-\mathbb{E}\left[{Y_{t_{l}}}\right]\bigg{|}^{2}=c^{\top}\overline{A}_{1}c-2c^{\top}\overline{b}_{1}+\tilde{b}_{1},

(2.5)

where $\tilde{b}_{1}:=\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}\left[{Y_{t_{l}}}\right]^{2}$ , and the matrix $\overline{A}_{1}$ and the vector $\overline{b}_{1}$ are given by

\displaystyle\overline{A}_{1}(i,j)

\displaystyle:=\frac{1}{L}\sum_{l=1}^{L}\underbrace{\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]\mathbb{E}\left[{\phi_{j}(X_{t_{l}})}\right]}_{A_{1,l}(i,j)},\quad\overline{b}_{1}(i):=\frac{1}{L}\sum_{l=1}^{L}\underbrace{\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]\mathbb{E}\left[{Y_{t_{l}}}\right]}_{b_{1,l}(i)}.

(2.6)

Similarly, we can write $\mathcal{E}_{2}(f)$ and $\mathcal{E}_{3}(f)$ in (2.4) as

	$\displaystyle\mathcal{E}_{2}(f)$	$\displaystyle=\frac{1}{L}\sum_{l=1}^{L}\bigg{\|}\sum_{i=1}^{n}c_{i}c_{j}\underbrace{\mathbb{E}\left[{\phi_{i}\phi_{j}(X_{t_{l}})}\right]}_{A_{2,l}(i,j)}-\underbrace{\mathbb{E}\left[{Y_{t_{l}}^{2}}\right]}_{b_{2,l}}\bigg{\|}^{2}.$		(2.7)
	$\displaystyle\mathcal{E}_{3}(f)$	$\displaystyle=\frac{1}{L}\sum_{l=1}^{L}\bigg{\|}\sum_{i=1}^{n}c_{i}c_{j}\underbrace{\mathbb{E}\left[{\phi_{i}(X_{t_{l-1}})\phi_{j}(X_{t_{l}})}\right]}_{A_{3,l}(i,j)}-\underbrace{\mathbb{E}\left[{Y_{t_{l-1}}Y_{t_{l}}}\right]}_{b_{3,l}}\bigg{\|}^{2}.$		(2.7)

Thus, with the above notations in (2.6)-(2.7), the minimizer of the loss functional $\mathcal{E}(f)$ over $\mathcal{H}$ is

	$\displaystyle\widehat{f}_{\mathcal{H}}$	$\displaystyle:=\sum_{i=1}^{n}\widehat{c}_{i}\phi_{i},\quad\widehat{c}:=\underset{c\in\mathbb{R}^{n}\text{ s.t. }\sum_{i=1}^{n}c_{i}\phi_{i}\in\mathcal{H}}{\operatorname{arg}\operatorname{min}}\;\mathcal{E}(c),\,\text{ where }$		(2.8)
	$\displaystyle\mathcal{E}(c)$	$\displaystyle:=w_{1}[c^{\top}\overline{A}_{1}c-2c^{\top}\overline{b}_{1}+\tilde{b}_{1}]+\sum_{k=2}^{3}w_{k}\frac{1}{L}\sum_{l=1}^{L}\left\|c^{\top}A_{k,l}c-b_{k,l}\right\|^{2}.$		(2.8)

Here, with an abuse of notation, we denote $\mathcal{E}(\sum_{i=1}^{n}c_{i}\phi_{i})$ by $\mathcal{E}(c)$ .

In practice, with data $\{Y_{[t_{1}:t_{N}]}^{(m)}\}_{m=1}^{M}$ , we approximate the expectations involving the observation process $(Y_{t})$ by the corresponding empirical means as in (2.1). Meanwhile, we approximate the expectations involving the state process $(X_{t})$ by Monte Carlo as in (2.2), using $M^{\prime}$ trajectories. We assume that the sampling errors in the expectations of $(X_{t})$ , i.e. in the terms $\{A_{k,l}\}_{k=1}^{3}$ , are negligible, since the basis $\{\phi_{i}\}$ can be chosen to be bounded functions (such as B-spline polynomials) and $M^{\prime}$ can be as large as we can afford. We approximate $\{b_{k,l}\}_{k=1}^{3}$ by their empirical means $\{b_{k,l}^{M}\}_{k=1}^{3}$ :

$\displaystyle b_{1,l}(i)$	$\displaystyle=\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]\mathbb{E}\left[{Y_{t_{l}}}\right]$	$\displaystyle\approx\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]\frac{1}{M}\sum_{m=1}^{M}Y_{t_{l}}^{(m)}$	$\displaystyle=:b_{1,l}^{M}(i)\,,$	(2.9)
$\displaystyle b_{2,l}$	$\displaystyle=\mathbb{E}\left[{\|Y_{t_{l}}\|^{2}}\right]$	$\displaystyle\approx\frac{1}{M}\sum_{m=1}^{M}\|Y_{t_{l}}^{(m)}\|^{2}$	$\displaystyle=:b_{2,l}^{M}\,,$	(2.10)
$\displaystyle b_{3,l}$	$\displaystyle=\mathbb{E}\left[{Y_{t_{l-1}}Y_{t_{l}}}\right]$	$\displaystyle\approx\frac{1}{M}\sum_{m=1}^{M}Y_{t_{l-1}}^{(m)}Y_{t_{l}}^{(m)}$	$\displaystyle=:b_{3,l}^{M}\,.$	(2.11)

Then, with $\overline{b}_{1}^{M}=\frac{1}{L}\sum_{l=1}^{L}b_{1,l}^{M}$ and $\widetilde{b}_{1}^{M}=\frac{1}{LM}\sum_{l=1}^{L}\sum_{m=1}^{M}\left(Y_{t_{l}}^{(m)}\right)^{2}$ , the estimator from data is

	$\displaystyle\widehat{f}_{\mathcal{H},M}$	$\displaystyle=\sum_{i=1}^{n}\widehat{c}_{i}\phi_{i},\qquad\widehat{c}=\underset{c\in\mathbb{R}^{n}\text{ s.t. }\sum_{i=1}^{n}c_{i}\phi_{i}\in\mathcal{H}}{\operatorname{arg}\operatorname{min}}\;\mathcal{E}^{M}(c),\,\text{ where }$		(2.12)
	$\displaystyle\mathcal{E}^{M}(c)$	$\displaystyle=w_{1}[c^{\top}\overline{A}_{1}c-2c^{\top}\overline{b}_{1}^{M}+\widetilde{b}_{1}^{M}]+\sum_{k=2}^{3}w_{k}\frac{1}{L}\sum_{l=1}^{L}\left\|c^{\top}A_{k,l}c-b_{k,l}^{M}\right\|^{2}.$		(2.12)

The minimization of $\mathcal{E}^{M}(c)$ can be performed with iterative algorithms, with each optimization iteration, with respect to $c$ , performed efficiently since the data-based matrices and vectors, $\overline{A}_{1},\overline{b}_{1}^{M}$ and $\{A_{k,l},b_{k,l}^{M}\}_{k=2}^{3}$ , only need to be computed once. The main source of sampling error is the empirical approximation of the moments of the process $(Y_{t})$ . We specify the hypothesis space in the next section and provide a detailed algorithm for the computation of the estimator in Section 2.4.

Remark 2.1 (Moments involving Itô’s formula)

When the data trajectories are continuous in time (or when they are sampled with a high frequency in time), we can utilize additional moments from Itô’s formula. Recall that for $f\in C^{2}_{b}$ , applying Itô formula for the diffusion process in (1.1), we have

f(X_{t+\Delta t})-f(X_{t})=\int_{t}^{t+\Delta t}\nabla f\cdot b(X_{s})dW_{s}+\int_{t}^{t+\Delta t}\mathcal{L}f(X_{s})ds,

where the operator $\mathcal{L}$ is

\mathcal{L}f=\nabla f\cdot a+\frac{1}{2}Hess(f):b^{\top}b.

(2.13)

Hence, $\mathbb{E}\left[{\Delta Y_{t_{l}}}\right]=\mathbb{E}\left[{\mathcal{L}f_{*}(X_{t_{l-1}})}\right]\Delta t+o(\Delta t)$ , where $\Delta Y_{t_{l}}=Y_{t_{l}}-Y_{t_{l-1}}$ . Thus, when $\Delta t$ is small, we can consider matching the generalized moments

\displaystyle\mathcal{E}_{4}(f)

\displaystyle=\frac{1}{L}\sum_{l=1}^{L}\bigg{|}\mathbb{E}\left[{\mathcal{L}f(X_{t_{l-1}})}\right]\Delta t-\mathbb{E}\left[{\Delta Y_{t_{l}}}\right]\bigg{|}^{2}.

(2.14)

Similarly, we can further consider the generalized moments $\mathbb{E}\left[{Y_{t}\Delta Y_{t}}\right]$ and $\mathrm{Var}{(\Delta Y_{t})}$ and the corresponding quartic loss functionals. Since they require the moments of the first- and second-order derivatives of the observation function, they are helpful when the observation function is smooth with bounded derivatives.

2.3 Hypothesis space and optimal dimension

We let the hypothesis space $\mathcal{H}$ be a class of bounded functions in $\mathrm{span}\{\phi_{i}\}_{i=1}^{n}$ ,

\mathcal{H}:=\{f\,:\,f=\sum_{i=1}^{n}c_{i}\phi_{i}:y_{\min}\leq f(x)\leq y_{\max}\text{ for all }x\in\text{supp}(\widebar{\rho}_{T})\},

(2.15)

where the basis functions $\{\phi_{i}\}$ are to be specified below, and the empirical bounds

y_{\min}:=\min\{Y_{t_{l}}^{(m)}\}_{l,m=1}^{L,M},\quad y_{\max}:=\max\{Y_{t_{l}}^{(m)}\}_{l,m=1}^{L,M}

aim to approximate the upper and lower bounds for $f_{*}$ . Note that the hypothesis space $\mathcal{H}$ is a bounded convex subset of the linear space $\mathrm{span}\{\phi_{i}\}_{i=1}^{n}$ . While the pointwise bound constraints are for all $x$ , in practice, for efficient computation, we apply these constraints at representative points, for example at the mesh-grid points used when the basis functions are piecewise polynomials. One may apply stronger constraints, such as requiring time-dependent bounds to hold at all times: $y_{\min}(t)\leq\sum_{i}^{n}c_{i}f_{i}(x)\leq y_{\max}(t)$ for each time $t$ , where $y_{\min}(t)$ and $y_{\max}(t)$ are the minimum and maximum of the data set $\{Y_{t}^{(m)}\}_{m=1}^{M}$ .

Basis functions.

As basis functions $\{\phi_{i}\}$ for the subspace containing $\mathcal{H}$ we choose B-spline basis consisting of piecewise polynomials (see Appendix B.1 for details). To specify the knots of B-spline functions, we introduce a density function $\widebar{\rho}_{T}^{L}$ , which is the average of the probability densities $\{p_{t_{l}}\}_{l=1}^{L}$ of $\{X_{t_{l}}\}_{l=1}^{L}$ :

\widebar{\rho}_{T}^{L}(x)=\frac{1}{L}\sum_{l=1}^{L}p_{t_{l}}(x)\quad\xrightarrow[]{L\to\infty}\,\widebar{\rho}_{T}(x)=\frac{1}{T}\int_{0}^{T}p_{t}(x)dt,

(2.16)

when $t_{L}=T$ and $\max_{1\leq l\leq L}|t_{l}-t_{l-1}|\to 0$ . Here $\widebar{\rho}_{T}^{L}$ (and its continuous time limit $\widebar{\rho}_{T}(x)$ ) describes the intensity of visits to the regions explored by the process $(X_{t})$ . The knots of the B-spline function are from a uniform partition of $[R_{min},R_{max}]$ , the smallest interval enclosing the support of $\widebar{\rho}_{T}^{L}$ . Thus, the basis functions $\{\phi_{i}\}$ are piecewise polynomials with knots adaptive to the state model which determines $\widebar{\rho}_{T}^{L}$ .

Dimension of the hypothesis space.

It is important to select a suitable dimension of the hypothesis space to avoid under- or over-fitting. We select the dimension in two steps. First, we introduce an algorithm, namely Cross-validating Estimation of Dimension Range (CEDR), to estimate the range of the dimension from the quadratic loss functional $\mathcal{E}_{1}$ . Its main idea is to avoid the sampling error amplification due to an unsuitably large dimension. The sampling error is estimated from data by splitting the data into two sets. Then, we select the optimal dimension that minimizes the 2-Wasserstein distance between the measures of data and prediction. See Appendix B.1 for details.

2.4 Algorithm

We summarize the above method of nonparametric regression with generalized moments in Algorithm 1. It minimizes a quartic loss function with the upper and lower bound constraints, and we perform the optimization with multiple initial conditions (see Appendix B.2 for the details).

Algorithm 1 Estimating the observation function by nonparametric generalized moment methods

1:The state model and data

\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M}

consisting of multiple trajectories of the observation process.

2:Estimator

\widehat{f}

3:Estimate the empirical density

\widebar{\rho}_{T}

in (2.16) and find its support

[R_{min},R_{max}]

4:Select a basis type, Fourier or B-spline, with an estimated dimension range

[1,N]

(by Algorithm 2), and compute the basis functions as described in Section 2.3.

5:for

n=1:N

6: Compute the moment matrices in (2.6)-(2.7) and the vectors

b_{k,l}^{M}

in (2.11).

7: Find the estimator

\widehat{c}_{n}

by optimization with multiple initial conditions. Compute and record the values of the loss functional and the 2-Wasserstein distances.

8:Select the optimal dimension

n^{*}

(and degree if B-spline basis) that has the minimal 2-Wasserstein distance in (B.5). Return the estimator

\widehat{f}=\sum_{i=1}^{n^{*}}c^{i}_{n^{*}}\phi_{i}

Computational complexity

The computational complexity is driven by the construction of the normal matrix and vectors and the evaluation of the 2-Wasserstein distances, which require computations of order $\mathrm{O}(n^{2}LM)$ and $\mathrm{O}(nLM)$ , respectively. Thus, the total computational complexity is of order $\mathrm{O}((n^{2}+n)LM)$ .

2.5 Tolerance to noise in the observations

The (generalized) moment method can tolerate large additive observation noise if the distribution of the noise is known. The estimation error caused by the noise is at the scale of the sampling error, which is negligible when the sample size is large.

More specifically, suppose that we observe $\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M}$ from the observation model

Y_{t_{l}}=f_{*}(X_{t_{l}})+\eta_{t_{l}},

(2.17)

where $\{\eta_{t_{l}}\}$ is sampled from a process $(\eta_{t})$ that is independent of $(X_{t})$ and has moments

\mathbb{E}[\eta_{t}]=0,\quad C(s,t)=\mathbb{E}[\eta_{t}\eta_{s}],\text{ for }s,t\geq 0.

(2.18)

A typical example is when $\eta$ being identically distributed independent Gaussian noise $\mathcal{N}(0,\sigma^{2})$ , which gives $C(s,t)=\sigma^{2}\delta(t-s)$ .

The algorithm in Section 2 applies the noisy data with only a few changes. First, note that the loss functional in (2.4) involves only the moments $\mathbb{E}[Y_{t}]$ , $\mathbb{E}[Y_{t}^{2}]$ and $\mathbb{E}[Y_{t_{l}}Y_{t_{l-1}}]$ , which are moments of $f_{*}(X_{t})$ . When $Y_{t}$ in (2.17) has observation noise specified above, we have

	$\displaystyle\mathbb{E}[f_{*}(X_{t})]$	$\displaystyle=\mathbb{E}[Y_{t}]-\mathbb{E}[\eta_{t}]=\mathbb{E}[Y_{t}];$
	$\displaystyle\mathbb{E}[f_{}(X_{t})f_{}(X_{s})]$	$\displaystyle=\mathbb{E}[Y_{t}Y_{s}]-\mathbb{E}[\eta_{t}\eta_{s}]=\mathbb{E}[Y_{t}Y_{s}]-C(t,s)$

for all $t,s\geq 0$ . Thus, we only need to change the loss functional to be

	$\displaystyle\mathcal{E}(f)=$	$\displaystyle w_{1}\frac{1}{L}\sum_{l=1}^{L}\big{\|}\mathbb{E}[f(X_{t_{l}})]-\mathbb{E}[Y_{t_{l}}]\|^{2}+w_{2}\frac{1}{L}\sum_{l=1}^{L}\left\|\mathbb{E}[f(X_{t_{l}})^{2}]-\mathbb{E}[Y_{t_{l}}^{2}]+C(t,t)\right\|^{2}$		(2.19)
		$\displaystyle+w_{3}\frac{1}{L}\sum_{l=1}^{L}\left\|\mathbb{E}[f(X_{t_{l}})f(X_{t_{l-1}})]-\mathbb{E}[Y_{t_{l}}Y_{t_{l-1}}]+C(t,s)\right\|^{2}.$		(2.19)

Similar to (2.12), the minimizer of the loss functional can be then computed as

$\displaystyle\widehat{f}_{\mathcal{H},M}$	$\displaystyle=\sum_{i=1}^{n}\widehat{c}_{i}\phi_{i},\quad\widehat{c}=\underset{c\in\mathbb{R}^{n}\text{ s.t. }\sum_{i=1}^{n}c_{i}\phi_{i}\in\mathcal{H}}{\operatorname{arg}\operatorname{min}}\;\mathcal{E}^{M}(c),\,\text{ where }$	(2.20)
$\displaystyle\mathcal{E}^{M}(c)$	$\displaystyle=w_{1}[c^{\top}\overline{A}_{1}c-2c^{\top}\overline{b}_{1}^{M}+\widetilde{b}_{1}^{M}]+w_{2}\frac{1}{L}\sum_{l=1}^{L}\left\|c^{\top}A_{2,l}c-b_{2,l}^{M}+C(t_{l},t_{l})\right\|^{2}$
	$\displaystyle+w_{3}\frac{1}{L}\sum_{l=1}^{L}\left\|c^{\top}A_{3,l}c-b_{3,l}^{M}+C(t_{l},t_{l+1})\right\|^{2},$

where all the $A$ -matrices and $b$ -vectors are the same as before (e.g., in (2.6)–(2.7) and (2.11)).

Note that the observation noise introduces sampling errors through $b_{1}^{M}$ , $b_{2,l}^{M}$ and $b_{2,l}^{M}$ , which are at the scale $\mathrm{O}(\frac{1}{\sqrt{M}})$ . Also, note the $A$ -matrices are independent of the observation noise. Thus, the observation noise affects the estimator only through the sampling error at the scale $\mathrm{O}(\frac{1}{\sqrt{M}})$ , the same as the sampling error in the estimator from noiseless data.

3 Identifiability

We discuss in this section the identifiability of the observation function by those loss functionals in the previous section. We show that $\mathcal{E}_{1}$ , the quadratic loss functional based on the 1st-order moments in (2.5), can identify the observation function in the $L^{2}(\widebar{\rho}_{T}^{L})$ -closure of a reproducing kernel Hilbert space (RKHS) that is intrinsic to the state model. In addition, the loss functional $\mathcal{E}_{4}$ in (2.14) based on the Itô formula, enlarges the function space of identifiability. We also discuss, in Section 3.2, some limitations of the loss functional $\mathcal{E}$ in (2.19), that combines the quadratic and quartic loss functionals; in particular, symmetry and stationarity may prevent us from identifying the observation function when using only generalized moments.

The starting point is a definition of identifiability, which is a generalization of the uniqueness of minimizer of a loss function in parametric inference (see e.g., [3, page 431] and [8]).

Definition 3.1 (Identifiability)

We say that the observation function $f_{*}$ is identifiable by a data-based loss functional $\mathcal{E}$ on a function space $H$ if $f_{*}$ is the unique minimizer of $\mathcal{E}$ in $H$ .

The identifiability consists of three elements: a loss functional $\mathcal{E}$ , a function space $H$ , and a unique minimizer for the loss functional in $H$ . When the loss functional is quadratic (such as $\mathcal{E}_{1}$ or $\mathcal{E}_{4}$ ), it has a unique minimizer in a Hilbert space iff its Frechét derivative is invertible in the Hilbert space; thus, the main task is to find such function spaces [22, 20, 24]. We will specify such function spaces for $\mathcal{E}_{1}$ and/or $\mathcal{E}_{4}$ in Section 3.1. We note that these function spaces do not take into account the constraints of upper and lower bounds, which generically lead to minimizers near or at the boundary of the constrained set. This consideration applies also to the piecewise quadratic functionals $\mathcal{E}_{2}$ and $\mathcal{E}_{3}$ , which can be viewed as providing additional constraints (see Section 3.2).

3.1 Identifiability by quadratic loss functionals

We consider the quadratic loss functionals $\mathcal{E}_{1}$ and $\mathcal{E}_{4}$ , and show that they can identify the observation function in the $L^{2}(\widebar{\rho}_{T}^{L})$ -closure of reproducing kernel Hilbert spaces (RKHSs) that are intrinsic to the state model.

Assumption 3.2

We make the following assumptions on the state-space model.

•

The coefficients in the state model (1.1) satisfy a global Lipschitz condition, and therefore also a linear growth condition: there exists a constant $C>0$ such that $|a(x)-a(y)|+|b(x)-b(y)|\leq C|x-y|$ for all $x,y\in\mathbb{R}$ , and $|a(x)|+|b(x)|\leq C(1+|x|)$ . Furthermore, we assume that $\inf_{x\in\mathbb{R}}b(x)>0$ for all $x\in\mathbb{R}$ .
•

The observation function $f_{*}$ satisfies $\sup_{t\in[0,t_{L}]}\mathbb{E}\left[{|f_{*}(X_{t})|^{2}}\right]<\infty$ .

Theorem 3.3

Given discrete-time data $\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M}$ from the state-space model (1.1) satisfying Assumption 3.2, let $\mathcal{E}_{1}$ and $\mathcal{E}_{4}$ be the loss functionals defined in (2.4) and (2.14). Denote $p_{t}(x)$ the density of the state process $X_{t}$ at time t, and recall that $\widebar{\rho}_{T}^{L}$ in (2.16) is the average, in time, of these densities. Let $\mathcal{L}^{*}$ be the adjoint of the 2nd-order elliptic operator $\mathcal{L}$ in (2.13). Then,

(a)

$\mathcal{E}_{1}$ has a unique minimizer in $H_{1}$ , the $L^{2}(\widebar{\rho}_{T}^{L})$ closure of the RKHS $\mathcal{H}_{K_{1}}$ with reproducing kernel

K_{1}(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})}\frac{1}{L}\sum_{l=1}^{L}p_{t_{l}}(x)p_{t_{l}}(x^{\prime}),

(3.1)

for $(x,x^{\prime})$ such that $\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})>0$ , and $K_{1}(x,x^{\prime})=0$ otherwise. When the data is continuous ( $L\to\infty$ ), we have $K_{1}(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}p_{t}(x)p_{t}(x^{\prime})dt$ .

(b)

$\mathcal{E}_{4}$ has a unique minimizer in $H_{4}$ , the $L^{2}(\widebar{\rho}_{T}^{L})$ closure of the RKHS $\mathcal{H}_{K_{4}}$ with reproducing kernel

K_{4}(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})}\frac{1}{L}\sum_{l=1}^{L}\mathcal{L}^{*}p_{t_{l}}(x)\mathcal{L}^{*}p_{t_{l}}(x^{\prime}),

(3.2)

for $(x,x^{\prime})$ such that $\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})>0$ , and $K_{4}(x,x^{\prime})=0$ otherwise. When the data is continuous, we have $K_{4}(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}\mathcal{L}^{*}p_{t}(x)\mathcal{L}^{*}p_{t}(x^{\prime})dt$ .

(c)

$\mathcal{E}_{1}+\mathcal{E}_{4}$ has a unique minimizer in $H$ , the $L^{2}(\widebar{\rho}_{T}^{L})$ closure of the RKHS $\mathcal{H}_{K}$ with reproducing kernel

K(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})}\frac{1}{L}\sum_{l=1}^{L}\left[p_{t_{l}}(x)p_{t_{l}}(x^{\prime})+\mathcal{L}^{*}p_{t_{l}}(x)\mathcal{L}^{*}p_{t_{l}}(x^{\prime})\right],

(3.3)

for $(x,x^{\prime})$ such that $\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})>0$ , and $K(x,x^{\prime})=0$ otherwise. Similarly, we have $K(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}[p_{t}(x)p_{t}(x^{\prime})+\mathcal{L}^{*}p_{t}(x)\mathcal{L}^{*}p_{t}(x^{\prime})]dt$ for continuous data.

In particular, $f_{*}$ is the unique minimizer of these loss functionals if $f_{*}$ is in $H_{1}$ , $H_{4}$ or $H$ .

To prove this theorem, we first introduce an operator characterization of the RKHS $\mathcal{H}_{K_{1}}$ in the next lemma. Similar characterizations hold for the RKHSs $\mathcal{H}_{K_{1}}$ and $\mathcal{H}_{K}$ .

Lemma 3.4

The function $K_{1}$ in (3.1) is a Mercer kernel, that is, it is continuous, symmetric and positive semi-definite. Furthermore, $K_{1}$ is square integrable in $L^{2}(\widebar{\rho}_{T}^{L}\times\widebar{\rho}_{T}^{L})$ , and it defines a compact positive integral operator $L_{K_{1}}:L^{2}(\widebar{\rho}_{T}^{L})\to L^{2}(\widebar{\rho}_{T}^{L})$ :

[L_{K_{1}}h](x^{\prime})=\int h(x)K_{1}(x,x^{\prime})\widebar{\rho}_{T}^{L}(x)dx.

(3.4)

Also, the RKHS $\mathcal{H}_{K_{1}}$ has the operator characterization: $\mathcal{H}_{K_{1}}=L^{1/2}_{K_{1}}(L^{2}(\widebar{\rho}_{T}^{L}))$ and $\{\sqrt{\lambda_{i}}\psi_{i}\}_{i=1}^{\infty}$ is an orthonormal basis of the RKHS $\mathcal{H}_{K_{1}}$ , where $\{\lambda_{i},\psi_{i}\}$ are the pairs of positive eigenvalues and corresponding eigenfunctions of $L_{K_{1}}$ .

Proof. Since the densities of diffusion process are smooth, the kernel $K_{1}$ is continuous on the support of $\widebar{\rho}_{T}^{L}$ and it is symmetric. It is positive semi-definite (see Appendix A for a definition) because for any $(c_{1},\ldots,c_{n})\in\mathbb{R}^{n}$ and $(x_{1},\ldots,x_{n})$ , we have

\sum_{i,j=1}^{n}c_{i}c_{j}K(x_{i},x_{j})=\frac{1}{L}\sum_{l=1}^{L}\sum_{i,j=1}^{n}c_{i}c_{j}\frac{p_{t_{l}}(x_{i})p_{t_{l}}(x_{j})}{\widebar{\rho}_{T}^{L}(x_{i})\widebar{\rho}_{T}^{L}(x_{j})}=\frac{1}{L}\sum_{l=1}^{L}\left(\sum_{i=1}^{n}c_{i}\frac{p_{t_{l}}(x_{i})}{\widebar{\rho}_{T}^{L}(x_{i})}\right)^{2}\geq 0.

Thus, $K_{1}$ is a Mercer kernel.

To show that $K_{1}$ is square integrable, note first that $p_{t_{l}}(x)\leq\max_{1\leq k\leq L}p_{t_{k}}(x)\leq L\widebar{\rho}_{T}^{L}(x)$ for any $x$ . Thus for each $x,x^{\prime}$ , we have

\frac{1}{L}\sum_{l=1}^{L}p_{t_{l}}(x)p_{t_{l}}(x^{\prime})\leq L\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})

and $K_{1}(x,x^{\prime})\leq L$ . It follows that $K_{1}$ is in $L^{2}(\widebar{\rho}_{T}^{L}\times\widebar{\rho}_{T}^{L})$ .

Since $K_{1}$ is positive definite and square integrable, the integral operator $L_{K_{1}}$ is compact and positive. The operator characterization follows from Theorem A.3.

Remark 3.5

The above lemma is only applicable to discrete-time observations because it uses the bounds $K_{1}(x,x^{\prime})\leq L$ . When the data is continuous in time on $[0,T]$ , we have $K_{1}\in L^{2}(\widebar{\rho}_{T}\times\widebar{\rho}_{T})$ if the support of $\widebar{\rho}_{T}$ is compact. In fact, to show that $K_{1}$ is square integrable when $\mathrm{supp}(\widebar{\rho}_{T})$ is compact, we note that the probability densities are uniformly bounded above, that is, $p_{t}(x)\leq\max_{y\in\mathbb{R},s\in[0,T]}p_{s}(y)$ . Thus for each $x,x^{\prime}$ , we have

	$\displaystyle\frac{1}{T}\int_{0}^{T}p_{t}(x)p_{t}(x^{\prime})dt$	$\displaystyle\leq\left\|\frac{1}{T}\int_{0}^{T}p_{t}(x)^{2}\ dt\right\|^{1/2}\left\|\frac{1}{T}\int_{0}^{T}p_{t}(x^{\prime})^{2}\ dt\right\|^{1/2}$
		$\displaystyle=\widebar{\rho}_{T}(x)^{1/2}\widebar{\rho}_{T}(x^{\prime})^{1/2}\max_{y\in\mathbb{R},s\in[0,T]}p_{s}(y)$

by Cauchy-Schwartz for the first inequality. Then,

K_{1}(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}p_{t}(x)p_{t}(x^{\prime})dt\leq\widebar{\rho}_{T}(x)^{-1/2}\widebar{\rho}_{T}(x^{\prime})^{-1/2}\max_{y\in\mathbb{R},s\in[0,T]}p_{s}(y).

It follows that $K_{1}$ is in $L^{2}(\widebar{\rho}_{T}\times\widebar{\rho}_{T})$ :

\int\int K_{1}^{2}(x,x^{\prime})\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})dxdx^{\prime}\leq|\mathrm{supp}(\widebar{\rho}_{T})|\max_{y\in\mathbb{R},s\in[0,T]}p_{s}(y)^{2}<\infty.

When $\widebar{\rho}_{T}$ has non-compact support, it remains to be proved that $K_{1}\in L^{2}(\widebar{\rho}_{T}\times\widebar{\rho}_{T})$ .

Proof of Theorem 3.3. The proof for (a)–(c) are similar, so we focus on (a) and only sketch the proof for (b)–(c).

To prove (a), we only need to show the uniqueness of the minimizer, because Lemma 3.4 has shown that $K_{1}$ is a Mercer kernel. Furthermore, note that by Lemma 3.4, the $L^{2}(\widebar{\rho}_{T}^{L})$ closure of the RKHS $\mathcal{H}_{K_{1}}$ is $H_{1}=\overline{\mathrm{span}\{\psi_{i}\}_{i=1}^{\infty}}$ , the closure in $L^{2}(\widebar{\rho}_{T}^{L})$ of the eigenspace of $L_{K_{1}}$ with non-zero eigenvalues, where $L_{K_{1}}$ is the operator defined in (3.4).

For any $f\in H_{1}$ , with the notation $h=f-f_{*}$ , we have $\mathbb{E}[f(X_{t})]-\mathbb{E}[Y_{t}]=\mathbb{E}[h(X_{t})]$ for each $t$ (recall that $Y_{t}=f_{*}(X_{t})$ ). Hence, we can write the loss functional as

	$\displaystyle\mathcal{E}_{1}(f)=$	$\displaystyle\frac{1}{L}\sum_{l=1}^{L}\big{\|}\mathbb{E}[f(X_{t_{l}})]-\mathbb{E}[Y_{t_{l}}]\|^{2}=\frac{1}{L}\sum_{l=1}^{L}\big{\|}\mathbb{E}[h(X_{t_{l}})]\|^{2}=\int\int h(x)h(x^{\prime})\frac{1}{L}\sum_{l=1}^{L}p_{t_{l}}(x)p_{t_{l}}(x^{\prime})dxdx^{\prime}$		(3.5)
	$\displaystyle=$	$\displaystyle\int\int h(x)h(x^{\prime})K_{1}(x,x^{\prime})\widebar{\rho}_{T}^{L}(x)\widebar{\rho}_{T}^{L}(x^{\prime})dxdx^{\prime}\geq 0.$		(3.5)

Thus, $\mathcal{E}_{1}$ attains its unique minimizer in $H_{1}$ at $f_{*}$ iff $\mathcal{E}_{1}(f_{*}+h)=0$ with $h\in H_{1}$ implies that $h=0$ . Note that the second equality in (3.5) implies that $\mathcal{E}_{1}(f_{*}+h)=0$ iff $\mathbb{E}[h(X_{t_{l}})]=0$ , i.e. $\int h(x)p_{t_{l}}(x)dx=0$ , for all $t_{l}$ . Then, $\int h(x)p_{t_{l}}(x)\frac{p_{t_{l}}(x^{\prime})}{\widebar{\rho}_{T}^{L}(x^{\prime})}dx=0$ for each $t_{l}$ and $x^{\prime}$ . Thus, the sum of them is also zero:

0=\int h(x)\frac{1}{L}\sum_{l=1}^{L}\frac{p_{t_{l}}(x)p_{t_{l}}(x^{\prime})}{\widebar{\rho}_{T}^{L}(x^{\prime})\widebar{\rho}_{T}^{L}(x)}\widebar{\rho}_{T}^{L}(x)dx=\int h(x)K_{1}(x,x^{\prime})\widebar{\rho}_{T}^{L}(x)dx

for each $x^{\prime}$ . By the definition of the operator $L_{K_{1}}$ , this implies that $L_{K_{1}}h=0$ . Thus, $h=0$ because $h\in H_{1}$ .

The above arguments hold true when the kernel $K_{1}$ is from continuous-time data: one only has to replace $\frac{1}{L}\sum_{l=1}^{L}$ by the averaged integral in time. This completes the proof for (a).

The proof of (b) and (c) are the same as above except the appearance of the operator $\mathcal{L}^{*}$ . Note that $\mathcal{E}_{4}$ in (2.14) reads $\mathcal{E}_{4}(f)=\frac{1}{L}\sum_{l=1}^{L}\left|\mathbb{E}\left[{\mathcal{L}f(X_{t_{l}})}\right]-\mathbb{E}\left[{\Delta Y_{t_{l}}}\right]\right|^{2}$ , thus, it differs from $\mathcal{E}_{1}$ only at the expectation $\mathbb{E}\left[{\mathcal{L}f(X_{t_{l}})}\right]$ . By integration by parts, we have

\mathbb{E}\left[{\mathcal{L}f(X_{s})}\right]=\int\mathcal{L}f(x)p_{s}(x)dx=\int f(x)\mathcal{L}^{*}p_{s}(x)dx

for any $f\in C_{b}^{2}$ . Then, the rest of the proof for Part (b) follows exactly as above with $K_{1}$ and $L_{K_{1}}$ replaced by $K_{4}$ and $L_{K_{4}}$ .

The following remarks highlight the implications of the above theorem. We consider only $\mathcal{E}_{1}$ , but all the remarks apply also to $\mathcal{E}_{4}$ and $\mathcal{E}_{1}+\mathcal{E}_{4}$ .

Remark 3.6 (An operator view of identifiability)

The unique minimizer of $\mathcal{E}_{1}$ in $H_{1}$ defined in Theorem 3.3 is the zero of its Frechét derivative: $\widehat{f}=L_{K_{1}}^{-1}L_{K_{1}}f_{*}$ , which is $f_{*}$ if $f_{*}\in H_{1}$ . In fact, note that with the integral operator $L_{K_{1}}$ , we can write the loss functional $\mathcal{E}_{1}$ as

\mathcal{E}_{1}(f)=\langle{f-f_{*},L_{K_{1}}(f-f_{*})}\rangle_{L^{2}(\widebar{\rho}_{T}^{L})}.

Thus, the Frechét derivative of $\mathcal{E}_{1}$ over $L^{2}(\widebar{\rho}_{T}^{L})$ is $\nabla\mathcal{E}_{1}(f)=L_{K_{1}}(f-f_{*})$ and we obtain the unique minimizer. Furthermore, this operator representation of the minimizer conveys two important messages about the inverse problem of finding the minimizer of $\mathcal{E}_{1}$ : (1) it is ill-defined beyond $H_{1}$ . In particularly, it is ill-defined on $L^{2}(\widebar{\rho}_{T}^{L})$ when $L_{K_{1}}$ is not strictly positive; (2) the inverse problem is ill-posed on $H_{1}$ , because the operator $L_{K_{1}}$ is compact and its inverse $L_{K_{1}}^{-1}$ is unbounded.

Remark 3.7 (Identifiability and normal matrix in regression)

Suppose $\mathcal{H}_{n}=\mathrm{span}\{\phi_{i}\}_{i=1}^{n}$ and denote $f=\sum_{i=1}^{n}c_{i}\phi_{i}$ with $\phi_{i}$ being basis functions such as B-splines. As shown in (2.5)-(2.6), the loss functional $\mathcal{E}_{1}$ becomes a quadratic function with normal matrix $\overline{A}_{1}=\frac{1}{L}\sum_{l=1}^{L}A_{1,l}$ with $A_{1,l}=\mathbf{u}_{l}^{\top}\mathbf{u}_{l}$ , where $\mathbf{u}_{l}=(\mathbb{E}\left[{\phi_{1}(X_{t_{l}})}\right],\ldots,\mathbb{E}\left[{\phi_{n}(X_{t_{l}})}\right])\in\mathbb{R}^{n}$ . Thus, the rank of the matrix $\overline{A}_{1}$ is no larger than $\min\{n,L\}$ . Note that $\overline{A}_{1}$ is the matrix approximation of $L_{K_{1}}$ on the basis $\{\phi_{i}\}_{i=1}^{n}$ in the sense that

\overline{A}_{1}(i,j)=\langle{L_{K_{1}}\phi_{i},\phi_{j}}\rangle_{L^{2}(\widebar{\rho}_{T}^{L})},

for each $1\leq i,j\leq n$ . Thus, the minimum eigenvalue of $\overline{A}_{1}$ approximates the minimal eigenvalue of $L_{K_{1}}$ restricted in $\mathcal{H}_{n}$ . In particular, if $\mathcal{H}_{n}$ contains a nonzero element in the null space of $L_{K_{1}}$ , then the normal matrix will be singular; if $\mathcal{H}_{n}$ is a subspace of the $L^{2}(\widebar{\rho}_{T}^{L})$ closure of $\mathcal{H}_{K_{1}}$ , then the normal matrix is invertible and we can find a unique minimizer.

Remark 3.8 (Convergence of estimator)

For a fixed hypothesis space, the estimator converges to the projection of $f_{*}$ in $\mathcal{H}\cap H_{1}$ as the data size $M$ increases, at the order $O(M^{-1/2})$ , with the error coming from the Monte Carlo estimation of the moments of observations. With data adaptive hypothesis spaces, we are short of proving the minimax rate of convergence as in classical nonparametric regression. This is because of the lack of a coercivity condition [25, 22], since the compact operator $L_{K_{1}}$ ’s eigenvalue converges to zero. A minimax rate would require an estimate on the spectrum decay of $L_{K_{1}}$ , and we leave this for future research.

Remark 3.9 (Regularization using the RKHS)

The RKHS $H_{K_{1}}$ can be further utilized to provide a regularization norm in the Tikhonov regularization (see [24]). It has the advantage of being data adaptive and constrains the learning to take place in the function space of learning.

Examples of the RKHS.

We emphasize that the reproducing kernel and the RKHS are intrinsic to the state model (including the initial distribution). We demonstrate the kernels by analytically computing them when the process $(X_{t})$ is either the Brownian motion or the Ornstein-Uhlenbeck (OU) process. For simplicity, we consider continuous-time data. Recall that when the diffusion coefficient in the state-model (1.1) is a constant, the second-order elliptic operators $\mathcal{L}$ is $\mathcal{L}f=\nabla f\cdot a+\frac{1}{2}b^{2}\Delta f$ and its joint operator $\mathcal{L}^{*}$ is

\mathcal{L}^{*}p_{s}=-\nabla\cdot(ap_{s})+\frac{1}{2}b^{2}\Delta p_{s},

where $p_{s}$ denotes the probability density of $X_{s}$ .

Example 3.10 (1D Brownian motion)

Let $a=0$ and $b=1$ . Assume $p_{0}(x)=\delta_{x_{0}}$ , i.e., $X_{0}=x_{0}$ . Then, $X_{t}$ is the Brownian motion starting from $x_{0}$ and $p_{t}(x)=\frac{1}{\sqrt{2\pi t}}e^{-\frac{(x-x_{0})^{2}}{2t}}$ for each $t>0$ . We have $\widebar{\rho}_{T}(x)=\frac{1}{T}\int_{0}^{T}p_{t}(x)dt=\frac{x-x_{0}}{T\sqrt{\pi}}\Gamma(-\frac{1}{2},\frac{(x-x_{0})^{2}}{2T})$ and

K_{1}(x,x^{\prime})=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}p_{s}(x)p_{s}(x^{\prime})ds=\frac{T\Gamma(0,\frac{(x-x_{0})^{2}+(x^{\prime}-x_{0})^{2}}{2T})}{2(x-x_{0})(x^{\prime}-x_{0})\Gamma(-\frac{1}{2},\frac{(x-x_{0})^{2}}{2T})\Gamma(-\frac{1}{2},\frac{(x^{\prime}-x_{0})^{2}}{2T})},

where $\Gamma(s,x):=\int_{x}^{\infty}t^{s-1}e^{-t}dt$ is the upper incomplete Gamma function. Also, we have

\mathcal{L}^{*}p_{s}(x)=\phi_{2}(s,x)p_{s}(x),\text{ with }\phi_{2}(s,x)=\left(\frac{1}{s^{2}}(x-x_{0})^{2}-\frac{1}{s}\right).

Thus, the reproducing kernel $K_{4}$ in (3.2) and $K$ in (3.3) from continuous-time data are

	$\displaystyle K_{4}(x,x^{\prime})$	$\displaystyle=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}\phi_{2}(s,x)\phi_{2}(s,x^{\prime})p_{s}(x)p_{s}(x^{\prime})ds;$
	$\displaystyle K(x,x^{\prime})$	$\displaystyle=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}(1+\phi_{2}(s,x)\phi_{2}(s,x^{\prime}))p_{s}(x)p_{s}(x^{\prime})ds.$

Example 3.11 (Ornstein-Uhlenbeck process)

Let $a(x)=\theta x$ and $b=1$ with $\theta>0$ . Assume $p_{0}(x)=\delta_{x_{0}}$ , i.e., $X_{0}=x_{0}$ . Then, $X_{t}=e^{-\theta t}x_{0}+\int_{0}^{t}e^{-\theta(t-s)}dB_{s}$ . It has a distribution $\mathcal{N}(e^{-\theta t}x_{0},\frac{1}{2\theta}(1-e^{-2\theta t}))$ , thus $p_{t}(x)=\frac{1}{\sqrt{2\pi}\sigma_{t}}\exp(-\frac{(x-x_{0}^{t})^{2}}{2\sigma_{t}^{2}})$ , where $x_{0}^{t}:=e^{-\theta t}x_{0}$ and $\sigma_{t}^{2}:=\frac{1}{2\theta}(1-e^{-2\theta t})$ . Computing the spatial derivatives, we have $\mathcal{L}^{*}p_{s}(x)=\frac{1}{2}\left[\frac{(x-x_{0}^{s})^{2}}{\sigma_{s}^{4}}-\frac{1}{\sigma_{s}^{2}}\right]p_{s}(x)-(\theta xp_{s}(x))^{\prime}=\phi_{2}(s,x)p_{s}(x),$ where

\phi_{2}(s,x):=\left[\frac{(x-x_{0})^{2}}{2\sigma_{s}^{4}}-\frac{1}{2\sigma_{s}^{2}}-\theta+\frac{\theta}{\sigma_{s}^{2}}x(x-x_{0}^{s})\right].

The reproducing kernels $K_{1}$ in (3.1), $K_{4}$ in (3.2) and $K$ in (3.3) are

	$\displaystyle K_{1}(x,x^{\prime})$	$\displaystyle=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}p_{s}(x)p_{s}(x^{\prime})ds;$
	$\displaystyle K_{4}(x,x^{\prime})$	$\displaystyle=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}\phi_{2}(s,x)\phi_{2}(s,x^{\prime})p_{s}(x)p_{s}(x^{\prime})ds;$
	$\displaystyle K(x,x^{\prime})$	$\displaystyle=\frac{1}{\widebar{\rho}_{T}(x)\widebar{\rho}_{T}(x^{\prime})}\frac{1}{T}\int_{0}^{T}(1+\phi_{2}(s,x)\phi_{2}(s,x^{\prime}))p_{s}(x)p_{s}(x^{\prime})ds.$

In particular, when the process is stationary, we have $K_{1}(x,x^{\prime})\equiv 1$ and $K_{4}(x,x^{\prime})=0$ because $\mathcal{L}^{*}p_{s}=0$ when $p_{s}(x)=\frac{2\theta}{\sqrt{2\pi}}\exp(-\theta x^{2})$ is the stationary density.

3.2 Non-identifiability due to stationarity and symmetry

When the hypothesis space $\mathcal{H}$ has a dimension larger than the RKHS’s, the quadratic loss functional $\mathcal{E}_{1}$ may have multiple minimizers. The constraints of upper and lower bounds, as well as the loss functionals $\mathcal{E}_{2}$ and $\mathcal{E}_{3}$ , can help to identify the observation function. However, as we show next, identifiability may still not hold due to symmetry and/or stationarity.

Stationary processes

When the process $(X_{t})$ is stationary, we have limited information from the moments in our loss functionals. We have $\mathcal{E}_{1}(f)=\left|\mathbb{E}\left[{Y_{t_{1}}}\right]-\mathbb{E}\left[{f(X_{t_{1}})}\right]\right|^{2}$ with $K_{1}(x,x^{\prime})\equiv 1$ , so $\mathcal{E}_{1}$ can only identify a constant function. Also, the loss functional $\mathcal{E}_{4}=0$ because

\mathcal{L}^{*}p_{s}=\partial_{s}p_{s}=0;\Leftrightarrow\mathbb{E}[\mathcal{L}h(X_{s})]=0\text{ for any }h\in C^{2}_{b}.

In other words, the function space of identifiability by $\mathcal{E}_{1}+\mathcal{E}_{4}$ is the space of constant functions. Meanwhile, the quartic loss functionals $\mathcal{E}_{2}$ and $\mathcal{E}_{3}$ also provide limited information: they become $\mathcal{E}_{2}=\left|\mathbb{E}[f(X_{t_{1}})^{2}]-\mathbb{E}[Y_{t_{1}}^{2}]\right|^{2}$ and $\mathcal{E}_{3}=\left|\mathbb{E}[f(X_{t_{2}})f(X_{t_{1}})]-\mathbb{E}[Y_{t_{2}}Y_{t_{1}}]\right|^{2}$ , the second-order moment and the temporal correlation at one-time instance.

To see the limitations, consider the finite-dimensional hypothesis space $\mathcal{H}$ in (2.15). As in (2.12), with $f=\sum_{i=1}^{n}c_{i}\phi_{i}$ , the loss functional becomes

\displaystyle\mathcal{E}(f)=

\displaystyle c^{\top}\overline{A}_{1}c-2c^{\top}\overline{b}_{1}^{M}+|\mathbb{E}[Y_{t_{1}}]|^{2}+\sum_{k=2}^{3}\left|c^{\top}A_{k,1}c-b_{k,1}^{M}\right|^{2},

where $\overline{A}_{1}$ is a rank-one matrix, and $\sum_{k=2}^{3}\left|c^{\top}A_{k,1}c-b_{k,1}^{M}\right|^{2}$ only bring in two additional constraints. Thus, $\mathcal{E}$ has multiple minimizers in a linear space with dimension greater than 3. One has to resort to the upper and lower bounds in (2.15) for additional constraints, which lead to minimizers on the boundary of the resulted convex sets.

Symmetry

When the distribution of the state process $X_{t}$ is symmetric, a moment-based loss functional does not distinguish the true observation function from its symmetric counterpart. More specifically, if a transformation $R:\mathbb{R}\to\mathbb{R}$ preserves the distribution, i.e., $(X_{t},t\geq 0)$ and $(R(X_{t}),t\geq 0)$ have the same distribution, then $\mathbb{E}[f(X_{t})]=\mathbb{E}\left[{f\circ R(X_{t})}\right]$ and $\mathbb{E}[f(X_{t})f(X_{s})]=\mathbb{E}\left[{f\circ R(X_{t})f\circ R(X_{s})}\right]$ . Thus, our loss functional will not distinguish $f$ from $f\circ R$ . However, this is totally reasonable: the two functions yield the same observation process (in terms of the distribution), thus the observation data does not provide the necessary information for identifying $f$ from $f\circ R$ .

Example 3.12 (Brownian motion)

Consider the standard Brownian motion $X_{t}$ , whose distribution is symmetric about $x=0$ (because the two processes $(X_{t},t\geq 0)$ and $(-X_{t},t\geq 0)$ have the same distribution). Let the transformation $R$ be $R(x)=-x$ . Then, the two functions $f(x)$ and $f(-x)$ lead to the same observation process, thus they cannot be distinguished from the observations.

4 Numerical Examples

We demonstrate the effectiveness and limitations of our algorithm using synthetic data in representative examples. The algorithm works well when the state-model’s densities vary appreciably in time to yield a function space of identifiability whose distance to the true observation function is small. In this case, our algorithm leads to a convergent estimator as the sample size increases. We also demonstrate that when the state process (i.e., the Ornstein-Uhlenbeck process) is stationary or symmetric in distribution (i.e., the Brownian motion), the loss functional can have multiple minimizers in the hypothesis space, preventing us from identifying the observation functions (see Section 4.3).

4.1 Numerical settings

We first introduce the numerical settings used in the tests.

Data generation.

The synthetic data $\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M}$ with $t_{l}=l\Delta t$ are generated from the state model, which is solved by the Euler-Maruyama scheme with a time-step $\Delta t=0.01$ for $L=100$ steps. We will consider sample sizes $M\in\{\lfloor{10^{3.5+j\Delta}}\rfloor:j=0,1,2,3,4,\,\,\Delta=0.0625\}$ to test the convergence of the estimator.

To estimate the moments in the $A$ -matrices and $b$ -vectors in (2.6)–(2.7) by Monte Carlo, we generate a new set of independent trajectories $\{X_{t_{l}}^{(m)}\}_{i=1}^{M^{\prime}}$ with $M^{\prime}=10^{6}$ . We emphasize that these $X$ samples are independent of the data $\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M}$ .

Inference algorithm.

We follow Algorithm 1 to search for the global minimum of the loss functionals in (2.12). The weights for the $\mathcal{E}_{k}$ ’s are $L\sqrt{M}/\|m_{k}^{Y}\|$ , where $\|\cdot\|$ is the Euclidean norm on $\mathbb{R}^{L}$ and

m_{k}^{Y}(l)=\frac{1}{M}\sum_{m=1}^{M}(Y_{t_{l}}^{(m)})^{k}\text{ for }k=1,2\quad\text{ and }\quad m_{3}^{Y}(l)=\frac{1}{M}\sum_{m=1}^{M}Y_{t_{l}}^{(m)}Y_{t_{l+1}}^{(m)},

for $l=0,1,\cdots,L-1$ .

For each example, we test B-spline hypothesis spaces $\mathcal{H}$ with dimension in the range $[1,N]$ , which is selected by Algorithm 2 with degrees in $\{0,1,2,3\}$ . We select the optimal dimension and degree with the minimal 2-Wasserstein distance between the predicted and true distribution of $Y$ . The details are presented in Section C.

Results assessment and presentation.

We present three aspects of the estimator $\widehat{f}$ :

•

Estimated and true functions. We compare the estimator with the true function $f_{*}$ , along with the $L^{2}(\widebar{\rho}_{T}^{L})$ projection of $f_{*}$ to the linear space expanded by the elements of $\mathcal{H}$ .
•

2-Wasserstein distance. We present the 2-Wasserstein distance (see (B.5)) between the distributions of $Y_{t_{l}}=f_{*}(X_{t_{l}})$ and $\widehat{f}(X_{t_{l}})$ for each time with training data and a new set of randomly generated data.
•

Convergence of $L^{2}(\widebar{\rho}_{T}^{L})$ error. We test the convergence of the estimator in $L^{2}(\widebar{\rho}_{T}^{L})$ as the sample size $M$ increases. The $L^{2}(\widebar{\rho}_{T}^{L})$ error is computed by the Riemann sum approximation. We present the mean and standard deviation of $L^{2}(\widebar{\rho}_{T}^{L})$ errors from 20 independent simulations. The convergence rate is also highlighted, and we compare it with the minimax convergence rate in classical nonparametric regression (see e.g., [13, 25]), which is $\frac{s}{2s+1}$ with $s-1$ being the degree of the B-spline basis. This minimax rate is not available yet for our method, see Remark 3.8.

Refer to caption — Figure 1: Empirical densities from the data trajectories of the state-model process $(X_{t_{l}})$ in (4.1) and the observation processes $(Y_{t_{l}})$ with $f_{*}=f_{i}$ , where $f_{i}$ ’s are three observation functions in (4.2). Since we do not have data pairs between $(X_{t_{l}}^{(m)},Y_{t_{l}}^{(m)})$ , these empirical densities are the available information from data. Our goal is to find the function $f$ in the operator that maps the densities of $\{X_{t_{l}}\}$ to the densities of $\{Y_{t_{l}}\}$ .

4.2 Examples

The state model we consider is a stochastic differential equation with the double-well potential

\displaystyle dX_{t}

\displaystyle=(X_{t}-X_{t}^{3})dt+dB_{t},X_{t_{0}}\sim p_{t_{0}}

(4.1)

where the density of $X_{t_{0}}$ is the average of $\mathcal{N}(-0.5,0.2)$ and $\mathcal{N}(1,0.5)$ . The distribution of $X_{t_{0}:t_{L}}$ is non-symmetric and far from stationary (see Figure 1(a)). Thus the quadratic loss functional $\mathcal{E}_{1}$ provides a rich RKHS space for learning.

We consider three observation functions $f(x)$ representing typical challenges: nearly invertible, non-invertible, and non-invertible discontinuous, in the set $\mathrm{supp}(\widebar{\rho}_{T})$ :

Sine function:	$\displaystyle\quad f_{1}(x)=$	$\displaystyle\sin(x);$	(4.2)
Sine-Cosine function:	$\displaystyle\quad f_{2}(x)=$	$\displaystyle 2\sin(x)+\cos(6x);$
Arch function:	$\displaystyle\quad f_{3}(x)=$	$\displaystyle\left(-2(1-x)^{3}+1.5(1-x)+0.5\right)\mathbf{1}_{x\in[0,1]}.$

These functions are shown in 2(a) –4(a). They lead to observation processes with dramatically different distributions, as shown in Fig.1(b-d).

The learning results for these three functions are shown in Figure 2–4. For each of these three observation functions, we present the estimator with the optimal hypothesis space, the 2-Wasserstein distance in prediction and the convergence of the estimator in $L^{2}(\widebar{\rho}_{T}^{L})$ (see Section 4.1 for details).

Sine function: Fig. 2(a) shows the estimator with degree-1 B-spline basis with dimension $n=9$ for $M=10^{6}$ . The $L^{2}(\widebar{\rho}_{T}^{L})$ error is $0.0245$ and the relative error is 3.47%. Fig. 2(b) shows that the Wasserstein distances are small at the scale $10^{-3}$ . Fig. 2(c) shows that the convergence rate of the $L^{2}(\widebar{\rho}_{T}^{L})$ error is $0.46$ . This rate is close to the minimax rate $\frac{2}{5}=0.4$ .

Sine-Cosine function: Fig. 3(a) shows the estimator with degree-2 B-spline basis with dimension $n=13$ . The $L^{2}(\widebar{\rho}_{T}^{L})$ error is $0.1596$ and the relative error is $9.90\%$ . Fig. 3(b) shows that the Wasserstein distances are at the scale of $10^{-2}$ . Fig. 3(c) shows that the convergence rate of the $L^{2}(\widebar{\rho}_{T}^{L})$ error is $0.26$ , less than the classical minimax rate $\frac{3}{7}\approx 0.42$ . Note also that the variance of the $L^{2}$ error does not decrease as $M$ increases. In comparison with the results for $f_{1}$ in Fig.2(a), we attribute this relatively low convergence rate and the large variance to the high-frequency component $\cos(6x)$ , which is harder to identify from moments than than the low frequency component $\sin(x)$ .

Arch function: Fig. 4(a) shows the estimator with degree-0 B-spline basis with dimension $n=45$ . The $L^{2}(\widebar{\rho}_{T}^{L})$ error is $0.0645$ and the relative error is $14.44\%$ . Fig. 4(b) shows that the Wasserstein distances are small at the scale $10^{-2}$ . Fig. 4(c) shows that the convergence rate of the $L^{2}(\widebar{\rho}_{T}^{L})$ error is $0.17$ , less than the would-be minimax rate $\frac{1}{3}\approx 0.33$ .

Arch function with observation noise: To demonstrate that our method can tolerate large observation noise, we present the estimation results from noisy observations of the Arch function, which is the most difficult among the three examples. Suppose that the observation noise $\xi$ in (2.17) is iid $\mathcal{N}(0,0.25)$ . Note that the average of $\mathbb{E}\left[{|Y_{t}|^{2}}\right]$ is about $0.2$ , so the signal-to-noise ratio is about $\frac{\mathbb{E}[|Y|^{2}]}{\mathbb{E}[\xi^{2}]}\approx 0.8$ . Thus, we have a relatively large noise. However, our method can identify the function using the moments of the noise as discussed in Section 2.5. Fig. 5(a) shows the estimator with degree-1 B-spline basis with dimension $n=24$ . The $L^{2}(\widebar{\rho}_{T}^{L})$ error is $0.1220$ and the relative error is $27.32\%$ . Fig. 5(b) shows that the Wasserstein distances are small at the scale $10^{-3}$ . The Wasserstein distances is approximated from samples of the noisy data $Y=f_{true}(X)+\xi$ and the noisy prediction $\widehat{Y}=\widehat{f}(X)+\xi$ . Fig. 5(c) shows that the convergence rate of the $L^{2}(\widebar{\rho}_{T}^{L})$ error is $0.14$ . The estimation is not as good as the noise-free case because the noisy observation data lead to milder lower and upper bound restrictions in (2.15). We emphasize that the tolerance to noise is exceptional for such an ill-posed inverse problem, and the key is our use of moments, which averages the noise so that the error occurs at the scale $O(\frac{1}{\sqrt{M}})$ .

We have also tested piecewise constant observation functions. Our method has difficulty in identifying such functions, due to two issues: (i) the uniform partition often misses the jump discontinuities (even the projection of $f_{*}$ has a large error); and (ii) the moments we considered depend on the observation function non-locally, thus, they provide limited information to identify the true function from its local perturbations. We leave it for future research to overcome these difficulties by searching the jump discontinuities and by introducing moments detecting local information.

4.3 Limitations

We demonstrate by examples the non-identifiability due to symmetry and stationarity.

Symmetric distribution

Let the state model be the Brownian motion with initial distribution $\text{Unif}(0,1)$ . The state process $(X_{t})$ has a distribution that is symmetric with respect to the line $x=\frac{1}{2}$ , i.e., the processes $(X_{t})$ and $(1-X_{t})$ have the same distribution. Thus, with the reflection function $R(x)=1-x$ , the processes $f(X_{t})$ and $f\circ R(X_{t})$ have the same distribution, and the observation data does not provide information for distinguishing $f$ from $f\circ R$ . The loss functional (2.4) has at least two minima.

Figure 6 shows that our algorithm finds the reflection of the true function $f_{*}=\sin(x)$ . The hypothesis space $\mathcal{H}$ has B-spline basis functions with degree 2 and dimension 58. Our estimator is close to $f_{*}\circ R(x)=\sin(1-x)$ . Its $L^{2}(\widebar{\rho}_{T}^{L})$ error is $1.1244$ and its reflection’s $L^{2}(\widebar{\rho}_{T}^{L})$ error is $0.0790$ . Both the estimator and its reflection correctly predict the distribution of the observation process $(Y_{t})$ .

Stationary process

When the diffusion process $(X_{t})$ is stationary, the loss functional $\eqref{eq:errfnl_12corr}$ provides limited information about the observation function. As discussed in Section 3.2, the matrix $\overline{A}_{1}$ has rank 1, and $\mathcal{E}_{2}=0$ and $\mathcal{E}_{3}=0$ lead to only two more constraints. The constraints from the upper and lower bounds in (2.15) play a major role in leading to a minimizer at the boundary of the convex set $\mathcal{H}$ .

Figure 7 shows the learning results with the stationary Ornstein-Uhlenbeck process $dX_{t}=-X_{t}dt+dB_{t}$ and with the observation function $f_{*}(x)=\sin(x)$ . The stationary density of $(X_{t})$ is $\mathcal{N}(0,\frac{1}{2})$ . Due the limited information, the estimator has a large $L^{2}(\widebar{\rho}_{T}^{L})$ error, which is $0.2656$ and its prediction has large 2-Wasserstein distances oscillating near $0.1290$ .

5 Discussions and conclusion

We have proposed a nonparametric learning method to estimate the observation functions in nonlinear state-space models. It matches the generalized moments via constrained regression. The algorithm is suitable for large sets of unlabeled data. Moreover, it can deal with challenging cases when the observation function is non-invertible. We address the fundamental issue of identifiability from first-order moments. We show that the function spaces of identifiability are the closure of RKHS spaces intrinsic to the state model. Numerical examples show that the first two moments and temporal correlations, along with upper and lower bounds, can identify functions ranging from piecewise polynomials to smooth functions and tolerate considerable observation noise. The limitations of this method, such as non-identifiability due to symmetry and stationarity, are also discussed.

This study provides a first step in the unsupervised learning of latent dynamics from abundant unlabeled data. There are several directions calling for further exploration: (i) a mixture of unsupervised and supervised learning that combines unlabeled data with limited labeled data, particularly for high-dimensional functions; (ii) enlarging the function space of learning, either by construction of more first-order generalized moments or by designing experiments to collect more informative data; (iii) joint estimation of the observation function and the state model.

Appendix A A review of RKHS

Positive definite functions

We review the definitions and properties of positive definite kernels. The following is a real-variable version of the definition in [1, p.67].

Definition A.1 (Positive definite function)

Let $X$ be a nonempty set. A function $G:X\times X\rightarrow\mathbb{R}$ is positive definite if and only if it is symmetric (i.e. $G(x,y)=G(y,x)$ ) and $\sum_{j,k=1}^{n}c_{j}c_{k}G(x_{j},x_{k})\geq 0$ for all $n\in\mathbb{N}$ , $\{x_{1},\ldots,x_{n}\}\subset X$ and $\mathbf{c}=(c_{1},\ldots,c_{n})\in\mathbb{R}^{n}$ . The function $\phi$ is strictly positive definite if the equality hold only when $\mathbf{c}=\mathbf{0}\in\mathbb{R}^{n}$ .

Theorem A.2 (Properties of positive definite kernels)

Suppose that $k,k_{1},k_{2}:X\times X\subset\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ are positive definite kernels. Then

(a)

$k_{1}k_{2}$ is positive definite. ([1, p.69])
(b)

Inner product $\langle u,v\rangle=\sum_{j=1}^{d}u_{j}v_{j}$ is positive definite ([1, p.73])
(c)

$f(u)f(v)$ is positive definite for any function $f:X\to\mathbb{R}$ ([1, p.69]).

RKHS and positive integral operators

We review the definitions and properties of the Mercer kernel, the RKHS, and the related integral operator, see e.g., [7] for them on a compact domain [34] for them on a non-compact domain.

Let $(X,d)$ be a metric space and $G:X\times X\to\mathbb{R}$ be continuous and symmetric. We say that $G$ is a Mercer kernel if it is positive definite (as in Definition A.1). The reproducing kernel Hilbert space (RKHS) $\mathcal{H}_{G}$ associated with $G$ is defined to be closure of $\mathrm{span}\{G(x,\cdot):x\in X\}$ with the inner product

\langle f,g\rangle_{\mathcal{H}_{G}}=\sum_{i=1,j=1}^{n,m}c_{i}d_{j}G(x_{i},y_{j})

for any $f=\sum_{i=1}^{n}c_{i}G(x_{i},\cdot)$ and $g=\sum_{j=1}^{n}d_{k}G(x_{j},\cdot)$ . It is the unique Hilbert space such that: (1) the linear space $\mathrm{span}\{G(\cdot,y),y\in X\}$ is dense in it; (2) it has the reproducing kernel property in the sense that for all $f\in\mathcal{H}_{G}$ and $x\in X$ , $f(x)=\langle G(x,\cdot),f\rangle_{G}$ (see [7, Theorem 2.9]).

By means of the Mercer Theorem, we can characterize the RKHS $\mathcal{H}_{G}$ through the integral operator associated with the kernel. Let $\mu$ be a nondegenerate Borel measure on $(X,d)$ (that is, $\mu(U)>0$ for every open set $U\subset X$ ). Define the integral operator $L_{G}$ on $L^{2}(X,\mu)$ by

L_{G}f(x)=\int_{X}G(x,y)f(y)d\mu(y).

The RKHS has the operator characterization (see e.g., [7, Section 4.4] and [34]):

Theorem A.3

Assume that the $G$ is a Mercer kernel and $G\in L^{2}(X\times X,\mu\otimes\mu)$ . Then

1.

$L_{G}$ is a compact positive self-adjoint operator. It has countably many positive eigenvalues $\{\lambda_{i}\}_{i=1}^{\infty}$ and corresponding orthonormal eigenfunctions $\{\phi_{i}\}_{i=1}^{\infty}$ . Note that when zero is an eigenvalue of $L_{G}$ , the linear space $H=\mathrm{span}\{\phi_{i}\}_{i=1}^{\infty}$ is a proper subspace of $L^{2}(\mu)$ .
2.

$\{\sqrt{\lambda_{i}}\phi_{i}\}_{i=1}^{\infty}$ is an orthonormal basis of the RKHS $\mathcal{H}_{G}$ .
3.

The RKHS is the image of the square root of the integral operator, i.e., $\mathcal{H}_{G}=L_{G}^{1/2}L^{2}(X,\mu)$ .

Appendix B Algorithm details

B.1 B-spline basis and dimension of the hypothesis space

The choice of hypothesis space is important for the nonparametric regression. We choose the basis functions to be the B-splines. To select an optimal dimension of the hypothesis space, we introduce a new algorithm to estimate the range for the dimension and then we select the optimal dimension that minimizes the 2-Wasserstein distance between the measures of data and prediction.

B-Spline basis functions

We briefly review the definition of B-spline basis functions and we refer to [29, Chapter 2] and [26] for details. Given a nondecreasing sequence of real numbers, called knots, $(r_{0},r_{1},\ldots,r_{m})$ , the B-spline basis functions of degree $p$ , denoted by $\{N_{i,p}\}_{i=0}^{m-p-1}$ , are defined recursively as

\displaystyle N_{i,0}(r)=\left\{\begin{array}[]{lr}1,&r_{i}\leq r<r_{i+1}\\ 0,&\text{otherwise}\end{array}\right.,\qquad N_{i,p}(r)=\frac{r-r_{i}}{r_{i+p}-r_{i}}N_{i,p-1}(r)+\frac{r_{i+p+1}-r}{r_{i+p+1}-r_{i+1}}N_{i+1,p-1}(r).

Each function $N_{i,p}$ is a nonnegative local polynomial of degree $p$ , supported on $[r_{i},r_{i+p+1}]$ . At a knot with multiplicity $k$ , it is $p-k$ times continuously differentiable. Hence, the differentiability increases with the degree but decreases when the knot multiplicity increases. The basis satisfies a partition unity property: for each $r\in[r_{i},r_{i+1}]$ , $\sum_{j}N_{j,p}(r)=\sum_{j=i-p}^{i}N_{j,p}(r)=1$ .

We set the knots of the spline functions to be a uniform partition of $[R_{min},R_{max}]$ (the support of the measure $\widebar{\rho}_{T}^{L}$ in (2.16)) $R_{min}=r_{0}\leq r_{1}\leq\cdots\leq r_{m}=R_{min}$ . For any choice of degree $p$ , we set the basis functions of the hypothesis space $\mathcal{H}$ , contained in a subspace with dimension $n=m-p$ , to be

\phi_{i}(r)=N_{i,p}(r),\ i=0,\dots,m-p-1.

Thus, the basis functions $\{\phi_{i}\}$ are piecewise degree- $p$ polynomials with knots adaptive to $\widebar{\rho}_{T}^{L}$ .

Dimension of the hypothesis space.

The choice of dimension $n$ of $\mathcal{H}$ is important to avoid under- and over-fitting: we choose it by minimizing the 2-Wasserstein distance between the empirical distributions of observed process $(Y_{t})$ and that predicted by our estimated observation function. To reduce the computational burden, we proceed in 2 steps: first we determine a rough range for $n$ , and then within this range we select the dimension with the minimal Wasserstein distance.

Step 1: we introduce an algorithm, called Cross-validating Estimation of Dimension Range (CEDR), to estimate the range $[1,N]$ for the dimension of the hypothesis space, based on the quadratic loss functional $\mathcal{E}_{1}$ . Its main idea is to restrict $N$ to avoid overly amplifying the estimator’s sampling error, which is estimated by splitting the data into two sets. It incorporates the function space of identifiability in Section 3.1 into the SVD analysis [9, 15] of the normal matrix and vector from $\mathcal{E}_{1}$ .

The CEDR algorithm estimates the sampling error in the minimizer of loss functional $\mathcal{E}_{1}$ through SVD analysis in three steps. First, we compute the normal matrix $\overline{A}_{1}$ and vector $\overline{b}_{1}$ in (2.6) by Monte Carlo; to estimate the sampling error in $\overline{b}_{1}$ , we compute two copies, $b$ and $b^{\prime}$ , of $\overline{b}_{1}$ from two halves of the data:

\displaystyle b(i)=\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]\frac{2}{M}\sum_{m=1}^{\lfloor\frac{M}{2}\rfloor}{Y_{t_{l}}^{(m)}},\quad b^{\prime}(i)=\frac{1}{L}\sum_{l=1}^{L}\mathbb{E}\left[{\phi_{i}(X_{t_{l}})}\right]\frac{2}{M}\sum_{m=\lfloor\frac{M}{2}\rfloor+1}^{M}{Y_{t_{l}}^{(m)}}.

(B.1)

Second, we implement an eigen-decomposition to find an orthonormal basis of $L^{2}(\widebar{\rho}_{T}^{L})$ , the default function space of learning. We view the matrix $\overline{A}_{1}$ as a representation of the integral operator $L_{K_{1}}$ in Lemma 3.4 on $\mathcal{H}=\mathrm{span}\{\phi_{i}\}_{i=1}^{n}$ . The eigen-decomposition requires the generalized eigenvalue problem

\overline{A}_{1}u=\lambda Bu,\quad\text{ where }B=(\langle{\phi_{i},\phi_{j}}\rangle_{L^{2}(\widebar{\rho}_{T}^{L})})

(B.2)

(see [20, Theorem 5.1]). Denote the eigen-pairs by $\{\sigma_{i},u_{i}\}$ , where the eigenvalues $\{\sigma_{i}\}$ are non-increasingly ordered and the eigenvectors are subject to normalization $u_{i}^{\top}Bu_{j}=\delta_{i,j}$ . Thus, we have $\overline{A}_{1}=\sum_{i=1}^{n}\sigma_{i}u_{i}u_{i}^{\top}$ (assuming that all $\sigma_{i}$ ’s are positive; otherwise, we drop those zero eigenvalues). The least-squares estimators from $b$ and $b^{\prime}$ are $c=\sum_{i=1}^{n}\frac{u_{i}^{\top}b}{\sigma_{i}}u_{i}$ and $c^{\prime}=\sum_{i=1}^{n}\frac{u_{i}^{\top}b^{\prime}}{\sigma_{i}}u_{i}$ , respectively. Third, the difference between their function estimators represents the sampling error (with $\Delta c=c-c^{\prime}$ )

	$\displaystyle g(n):=$	$\displaystyle\\|\widehat{f}-\widehat{f}^{\prime}\\|_{L^{2}(\widebar{\rho}_{T}^{L})}^{2}=\\|\sum_{k=1}^{n}\Delta c_{k}\phi_{k}\\|_{L^{2}(\widebar{\rho}_{T}^{L})}^{2}=\sum_{i,j=1}^{n}\Delta c_{i}\langle{\phi_{i},\phi_{j}}\rangle_{L^{2}(\widebar{\rho}_{T}^{L})}\Delta c_{j}=\Delta c^{\top}B\Delta c$		(B.3)
		$\displaystyle=\sum_{i,j=1}^{n}\frac{u_{i}^{\top}(b-b^{\prime})}{\sigma_{i}}u_{i}^{\top}Bu_{j}\frac{u_{j}^{\top}(b-b^{\prime})}{\sigma_{j}}=\sum_{i=1}^{n}r_{i}^{2},$		(B.3)

where $r_{i}=\frac{|u_{i}^{\top}(b-b^{\prime})|}{\sigma_{i}}$ . The ratio $r_{i}$ is in the same spirit as the Picard projection ratio $\frac{|u_{i}^{\top}b|}{\sigma_{i}}$ in [15], which is used to detect overfitting. Note that the eigenvalues $\sigma_{i}$ will vanish as $n$ increases because the operator $L_{K_{1}}$ is compact. Clearly, the sampling error $g(n)$ should be less than $\|f_{*}\|_{L^{2}(\widebar{\rho}_{T}^{L})}^{2}$ , which is the average of the second moments. Thus, we set $N$ to be

\displaystyle N

\displaystyle=\max\{k\geq 1:g(k)\leq\tau\},\text{ where }\tau=\frac{1}{LM}\sum_{l=1,m=1}^{L,M}|Y_{t_{l}}^{(m)}|^{2}.

(B.4)

We note that this threshold is relatively large, neglecting the rich information in $g$ , a subject worthy of further investigation.

Algorithm 2 summarizes the above procedure.

Algorithm 2 Cross-validating Estimation of Dimension Range (CEDR) for hypothesis space

1:The state model and data

\{Y_{t_{0}:t_{L}}^{(m)}\}_{m=1}^{M}

2:A range

[1,N]

for the dimension of the hypothesis space for further selection.

3:Estimate the empirical density

\widebar{\rho}_{T}

in (2.16) and find its support

[R_{min},R_{max}]

4:Set

n=1

and

g(n)=0

. Estimate the threshold

\tau

in (B.4).

5:while

g(n)\leq\tau

6: Set

n\leftarrow n+1

. Update the basis functions, Fourier or B-spline, as in Section 2.3.

7: Compute normal matrix

\overline{A}_{1}

in (2.6) by Monte Carlo. Also, compute

b

and

b^{\prime}

in (B.1).

8: Eigen-decomposition of

\overline{A}_{1}

as in (B.2); return

\overline{A}_{1}=\sum_{i=1}^{n}u_{i}\sigma_{i}u_{i}^{T}

with

u_{i}^{\top}Bu_{j}=\delta_{i,j}

9: Compute the Picard projection ratios:

r_{i}=\frac{|u_{i}^{\top}(b-b^{\prime})|}{\sigma_{i}}

for

i=1,\ldots,n

and

g(n)=\sum_{i=1}^{n}r_{i}^{2}

10:Return

N=n

Step 2: We select the dimension $n$ and degree for B-spline basis functions to be the one with the smallest 2-Wasserstein distance between the distribution of the data and that of the predictions. More precisely, let $\mu^{f}_{t_{l}}$ and $\mu^{\widehat{f}}_{t_{l}}$ denote the distributions of $Y_{t_{l}}=f(X_{t_{l}})$ and $\widehat{f}(X_{t_{l}})$ , respectively. Let $F_{t_{l}}$ and $\widehat{F}_{t_{l}}$ denote their cumulative distribution functions (CDF), with $F_{t_{l}}^{-1}$ and $\widehat{F}_{t_{l}}^{-1}$ being their inversion. We compute $F_{t_{l}}$ from the data and $\widehat{F}_{t_{l}}$ from independent simulation. We approximate their inversions by quantile, and compute the 2-Wasserstein distance

\displaystyle\left(\frac{1}{L}\sum_{l=1}^{L}W_{2}(\mu^{f}_{t_{l}},\mu^{\widehat{f}}_{t_{l}})^{2}\right)^{1/2}\quad\text{ with }W_{2}(\mu^{f}_{t_{l}},\mu^{\widehat{f}}_{t_{l}})=\left(\int_{0}^{1}(F^{-1}_{t_{l}}(r)-\widehat{F}^{-1}_{t_{l}}(r))^{2}dr\right)^{\frac{1}{2}},

(B.5)

This method of computing the Wasserstein distance is based on an observation in [5], and it has been used in [28, 19]. Recall that the 2-Wasserstein distance $W_{2}(\mu,\nu)$ of two probability density functions $\mu$ and $\nu$ over $\Omega$ with finite second order moments is given by $W_{2}(\mu,\nu)=\left(\inf\limits_{\gamma\in\Gamma(\mu,\nu)}\int_{\Omega\times\Omega}|x-y|^{2}d\gamma(x,y)\right)^{\frac{1}{2}}$ , where $\Gamma(\mu,\nu)$ denotes the set of all measures on $\Omega\times\Omega$ with $\mu$ and $\nu$ as marginals. Let $F$ and $G$ be the CDFs of $\mu$ and $\nu$ respectively, and let $F^{-1}$ and $G^{-1}$ be their quantile functions. Then the $L^{2}$ distance of the quantile functions $d_{2}(\mu,\nu)=\left(\int_{0}^{1}|F^{-1}(r)-G^{-1}(r)dr|^{2}\right)^{\frac{1}{2}}$ is equal to the 2-Wasserstein distance $W_{2}(\mu,\nu)$ .

B.2 Optimization with multiple initial conditions

With the convex hypothesis space in (2.15), the minimization in (2.12) is a constrained optimization problem and it may have multiple local minima. Note that the loss functional $\mathcal{E}^{M}(c)$ in (2.12) consists of a quadratic term and two quartic terms. The quadratic term, which represents $\mathcal{E}_{1}^{M}$ in (2.5), has a Hessian matrix $\overline{A}_{1}$ which is often not full rank because it is the average of rank-one matrices by its definition (2.6). Thus, the quadratic term has a valley of minima in the kernel of $\overline{A}_{1}$ . The two quartic terms have valleys of minima at the intersections of the ellipse-shaped manifolds $\{c\in\mathbb{R}^{n}:c^{\top}A_{k,l}c=b_{k,l}^{M}\}_{l=1}^{L}$ for $k=2,3$ . Also, symmetry in the distribution of the state process will also lead to multiple minima (see Section 3.2 for more discussions).

To reduce the possibility of obtaining a local minimum, we search for a minimizer from multiple initial conditions. We consider the following initial conditions: (1) the least squares estimator for the quadratic term; (2) the minimizer of the quadratic term in the hypothesis space, which is solved by least squares with linear constraints using @MATLAB function lsqlin, starting from the LSE estimator; (3) the minimizers of the quartic terms over the hypothesis space, which is found by constrained optimization through @MATLAB fmincon with the interior-point search. Then, among the minimizers from these initial conditions, we take the one leading to the smallest 2-Wasserstein distance.

Appendix C Selection of dimension and degree of the B-spline basis

We demonstrate the selection of the dimension and degree of the B-spline basis functions of the hypothesis space. As described in Section 2.3, we select the dimension and degree in two steps: we first select a rough range for the dimension by the Cross-validating Estimation of Dimension Range (CEDR) algorithm; then we pick the dimension and degree to be the ones with minimal 2-Wasserstein distance between the true and estimated distribution of the observation processes.

The CEDR algorithm helps to reduce the computational cost by estimating the dimension range for the hypothesis space. It is based on an SVD analysis of the normal matrix $\overline{A}_{1}$ and vector $\overline{b}_{1}$ from the quadratic loss functional $\mathcal{E}_{1}$ . The key idea is to control the sampling error’s effect on the estimator in the metric of the function space of learning. The sampling error is estimated by computing two copies of the normal vector through splitting the data into two halves. The function space of learning plays an important role here: it directs us to use a generalized eigenvalue problem for the SVD analysis. This is different from the classical SVD analysis in [15], where the information of the function space is neglected.

Figure 8 shows the dimension selection by 2-Wasserstein distances and by the CEDR algorithm for the example of sine-cosine function. To confirm the effectiveness of our CEDR algorithm, we compute the 2-Wasserstein distances for all dimensions in (a), side-by-side with the CEDR sampling error indicator $g$ in (b) with relatively large dimensions $\{n=75-deg|$ for $deg\in\{0,1,2,3\}$ . First, the left figure suggests that the optimal dimension and degree are $n=13$ and $deg=2$ , where the 2-Wasserstein distance reaches minimum among all cases, and at the same time as the $L^{2}(\widebar{\rho}_{T}^{L})$ error. For the other degrees, the minimum 2-Wasserstein distances are either reached before of after the $L^{2}(\widebar{\rho}_{T}^{L})$ error. Thus, the 2-Wasserstein distance correctly selects the optimal dimension and degree for the hypothesis space. Second, (a) shows that the CEDR algorithm can effectively select the dimension range. With the threshold in (B.4) being $\tau=1.60$ , which is relatively large (representing a tolerance of 100% relative error), the dimension upper bounds are around $N=60$ for all degrees, and the ranges encloses the optimal dimensions selected by the 2-Wasserstein distance in (b).

Here we used a relatively large threshold for a rough estimation of the range of dimension. Clearly, our cross-validating error indicator $g(k)$ in (B.3) provides rich information about the increase of sampling error as the dimension increases. A future direction is to extract the information, along with the decay of the integral operator, to find the trade-off between sampling error and approximation error.

Acknowledgments

MM, YGK and FL are partially supported by DE-SC0021361 and FA9550-21-1-0317. FL is partially funded by the NSF Award DMS-1913243.

References

[1] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic analysis on semigroups: theory of positive definite and related functions, volume 100. New York: Springer, 1984.
[2] S. A. Billings. Nonlinear System Identification. John Wiley & Sons, Ltd, Chichester, UK, 2013.
[3] P. Brockwell and R. Davis. Time series: theory and methods. Springer, New York, 2nd edition, 1991.
[4] O. Cappé, E. Moulines, and T. Rydén. Inference in Hidden Markov Models. Springer Series in Statistics. Springer, New York ; London, 2005.
[5] J. A. Carrillo and G. Toscani. Wasserstein metric and large–time asymptotics of nonlinear diffusion equations. In New Trends in Mathematical Physics: In Honour of the Salvatore Rionero 70th Birthday, pages 234–244. World Scientific, 2004.
[6] R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences of the United States of America, 102(21):7426–7431, 2005.
[7] F. Cucker and D. X. Zhou. Learning theory: an approximation theory viewpoint, volume 24. Cambridge University Press, 2007.
[8] J. Fan and Q. Yao. Nonlinear Time Series: Nonparametric and Parametric Methods. Springer, New York, NY, 2003.
[9] R. D. Fierro, G. H. Golub, P. C. Hansen, and D. P. O’Leary. Regularization by Truncated Total Least Squares. SIAM J. Sci. Comput., 18(4):1223–1241, 1997.
[10] C. Gelada, S. Kumar, J. Buckman, O. Nachum, and M. G. Bellemare. DeepMDP: Learning Continuous Latent Space Models for Representation Learning. ArXiv190602736 Cs Stat, 2019.
[11] A. Ghosh, S. Mukhopadhyay, S. Roy, and S. Bhattacharya. Bayesian inference in nonparametric dynamic state-space models. Statistical Methodology, 21:35–48, 2014.
[12] N. Guglielmi and E. Hairer. Classification of Hidden Dynamics in Discontinuous Dynamical Systems. SIAM J. Appl. Dyn. Syst., 14(3):1454–1477, 2015.
[13] L. Györfi, M. Kohler, A. Krzyzak, and H. Walk. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006.
[14] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning Latent Dynamics for Planning from Pixels. ArXiv181104551 Cs Stat, 2019.
[15] P. C. Hansen. The L-curve and its use in the numerical treatment of inverse problems. In in Computational Inverse Problems in Electrocardiology, ed. P. Johnston, Advances in Computational Bioengineering, pages 119–142. WIT Press, 2000.
[16] M. R. Jeffrey. Hidden Dynamics: The Mathematics of Switches, Decisions and Other Discontinuous Behaviour. Springer International Publishing, Cham, 2018.
[17] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn, P. Kozakowski, S. Levine, A. Mohiuddin, R. Sepassi, G. Tucker, and H. Michalewski. Model-Based Reinforcement Learning for Atari. ArXiv190300374 Cs Stat, 2020.
[18] N. Kantas, A. Doucet, S. Singh, and J. Maciejowski. An Overview of Sequential Monte Carlo Methods for Parameter Estimation in General State-Space Models. IFAC Proc. Vol., 42(10):774–785, 2009.
[19] N. Kolbe. Wasserstein distance. https://github.com/nklb/wasserstein-distance, 2020.
[20] Q. Lang and F. Lu. Identifiability of interaction kernels in mean-field equations of interacting particles. arXiv preprint arXiv:2106.05565, 2021.
[21] K. Law, A. Stuart, and K. Zygalakis. Data Assimilation: A Mathematical Introduction. Springer, 2015.
[22] Z. Li, F. Lu, M. Maggioni, S. Tang, and C. Zhang. On the identifiability of interaction functions in systems of interacting particles. Stochastic Processes and their Applications, 132:135–163, 2021.
[23] L. Ljung. System identification. In Signal analysis and prediction, pages 163–173. Springer, 1998.
[24] F. Lu, Q. Lang, and Q. An. Data adaptive RKHS Tikhonov regularization for learning kernels in operators. arXiv preprint arXiv:2203.03791, 2022.
[25] F. Lu, M. Zhong, S. Tang, and M. Maggioni. Nonparametric inference of interaction laws in systems of agents from trajectory data. Proc. Natl. Acad. Sci. USA, 116(29):14424–14433, 2019.
[26] T. Lyche, C. Manni, and H. Speleers. Foundations of Spline Theory: B-Splines, Spline Approximation, and Hierarchical Refinement, volume 2219, pages 1–76. Springer International Publishing, Cham, 2018.
[27] C. Moosmüller, F. Dietrich, and I. G. Kevrekidis. A geometric approach to the transport of discontinuous densities. ArXiv190708260 Phys. Stat, 2019.
[28] V. M. Panaretos and Y. Zemel. Statistical aspects of wasserstein distances. Annual review of statistics and its application, 6:405–431, 2019.
[29] L. Piegl and W. Tiller. The NURBS Book. Monographs in Visual Communication. Springer Berlin Heidelberg, Berlin, Heidelberg, 1997.
[30] Y. Pokern, A. M. Stuart, and P. Wiberg. Parameter estimation for partially observed hypoelliptic diffusions. J. R. Stat. Soc. Ser. B Stat. Methodol., 71(1):49–73, 2009.
[31] B. L. S. Prakasa Rao. Statistical inference from sampled data for stochastic processes. In N. U. Prabhu, editor, Contemporary Mathematics, volume 80, pages 249–284. American Mathematical Society, Providence, Rhode Island, 1988.
[32] A. Rahimi and B. Recht. Unsupervised regression with applications to nonlinear system identification. In Advances in Neural Information Processing Systems, pages 1113–1120, 2007.
[33] M. Sørensen. Estimating functions for diffusion-type processes. In Statistical Methods for Stochastic Differential Equations, volume 124, pages 1–107. Monogr. Statist. Appl. Probab, 2012.
[34] H. Sun. Mercer theorem for RKHS on noncompact sets. Journal of Complexity, 21(3):337 – 349, 2005.
[35] A. Svensson and T. B. Schön. A flexible state–space model for learning nonlinear dynamical systems. Automatica, 80:189–199, 2017.
[36] F. Tobar, P. M. Djuric, and D. P. Mandic. Unsupervised State-Space Modeling Using Reproducing Kernels. IEEE Trans. Signal Process., 63(19):5210–5221, 2015.

Unsupervised learning of observation functions in state-space models by nonparametric moment methods

Abstract

1 Introduction

2 Non-parametric regression based on generalized moments

2.1 Generalized moments method

2.2 Loss functional and estimator

Remark 2.1 (Moments involving Itô’s formula)

2.3 Hypothesis space and optimal dimension

Basis functions.

Dimension of the hypothesis space.

2.4 Algorithm

Computational complexity

2.5 Tolerance to noise in the observations

3 Identifiability

Definition 3.1 (Identifiability)

3.1 Identifiability by quadratic loss functionals

Assumption 3.2

Theorem 3.3

Lemma 3.4

Remark 3.5

Remark 3.6 (An operator view of identifiability)

Remark 3.7 (Identifiability and normal matrix in regression)

Remark 3.8 (Convergence of estimator)

Remark 3.9 (Regularization using the RKHS)

Examples of the RKHS.

Example 3.10 (1D Brownian motion)

Example 3.11 (Ornstein-Uhlenbeck process)

3.2 Non-identifiability due to stationarity and symmetry

Stationary processes

Symmetry

Example 3.12 (Brownian motion)

4 Numerical Examples

4.1 Numerical settings

Data generation.

Inference algorithm.

Results assessment and presentation.

4.2 Examples

4.3 Limitations

Symmetric distribution

Stationary process

5 Discussions and conclusion

Appendix A A review of RKHS

Positive definite functions

Definition A.1 (Positive definite function)

Theorem A.2 (Properties of positive definite kernels)

RKHS and positive integral operators

Theorem A.3

Appendix B Algorithm details

B.1 B-spline basis and dimension of the hypothesis space

B-Spline basis functions

Dimension of the hypothesis space.

B.2 Optimization with multiple initial conditions

Appendix C Selection of dimension and degree of the B-spline basis

Acknowledgments

References

Unsupervised learning of observation functions in
state-space models by nonparametric moment methods