Numerical Method for Parameter Inference of Nonlinear ODEs with Partial Observations

Yu Chen Centre for Quantitative Analysis and Modeling (CQAM), The Fields Institute for Research in Mathematical Sciences, 222 College Street, Toronto, Ontario, Canada. School of Mathematics, Shanghai University of Finance and Economics, Shanghai, China Jin Cheng School of Mathematical Sciences, Fudan University, Shanghai, 200433, China School of Mathematics, Shanghai University of Finance and Economics, Shanghai, China Arvind Gupta Computer Science, University of Toronto, Toronto, Ontario, Canada. Huaxiong Huang Corresponding author: [email protected] Joint Mathematical Research Centre of Beijing Normal University and BNU-HKBU United International College, Zhuhai, China Centre for Quantitative Analysis and Modeling (CQAM), The Fields Institute for Research in Mathematical Sciences, 222 College Street, Toronto, Ontario, Canada. Computer Science, University of Toronto, Toronto, Ontario, Canada. Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada. Shixin Xu Corresponding author: [email protected] Duke Kunshan University, 8 Duke Ave, Kunshan, Jiangsu, China.

Abstract

Parameter inference of dynamical systems is a challenging task faced by many researchers and practitioners across various fields. In many applications, it is common that only limited variables are observable. In this paper, we propose a method for parameter inference of a system of nonlinear coupled ODEs with partial observations. Our method combines fast Gaussian process based gradient matching (FGPGM) and deterministic optimization algorithms. By using initial values obtained by Bayesian steps with low sampling numbers, our deterministic optimization algorithm is both accurate and efficient.

Key words: Gaussian Process, Parameter inference, Nonlinear ODEs, Partial observations

1 Introduction

Many problems in science and engineering can be modelled by systems of Ordinary differential equations (ODEs). It is often difficult or impossible to measure some parameters of the systems directly. Therefore, various methods have been developed to estimate parameters based on available data. Mathematically, such problems are classified as inverse problems which have been widely studied [1, 2, 21, 13]. They can be also treated as parameter inference in statistics [19, 11].

For nonlinear ODEs, standard statistical inference is time consuming as numerical integration is needed after each update of the parameters [5, 14]. Recently, gradient matching techniques have been proposed to circumvent the high computational cost of numerical integration [19, 6, 16, 9]. These techniques are based on minimizing the difference between the values obtained by two different approaches. This usually involves a process consisting of two steps: data interpolation and parameter adaptation. Among them, nonparametric Bayesian modelling with Gaussian processes is one of the promising approaches. Calderhead et al. [5] proposed an adaptive gradient matching method based on a product-of-experts approach and a marginalization over the derivatives of the state variables, which was proposed by Calderhead et al. [5] and extended by Dondelinger et al. [6]. Barber & Wang [3] proposed a GPODE method in which the state variables are marginalized. Macdonald et al. [14] provided an interpretation of the above paradigms. Wenk et al. [23] proposed a fast Gaussian process based gradient matching (FGPGM) algrithm with theoretical framework in systems of nonlinear ODEs. our new approach is more accurate, robust and efficient.

For many practical problems, the variables are only partially observable, or not at all times. As a consequence, parameter inference is more challenging, even for a coupled system where the parameters are uniquely determined by data of partially observed data under certain initial conditions. It is not clear whether the gradient matching techniques can be applied to the case when there are latent variables. The Markov Chain Monte Carlo algorithm has the ability to side-step the issue of parameter identifiability in many cases, but convergence remains a serious issue [19]. Therefore, we need to pay attention to the feasibility, accuracy, robustness and computational cost of numerical computations for such problems.

In this work, we focus on the case of parameter inference with partially observable data. The main idea is to treat the observable and nonobservale variables differently. For observable variables, we use the same approach as proposed by Wenk et al. [23]. For non-observable variablies, they are obtained only by using ODEs. To circumvent the high computational cost of sampling in Bayesian approaches, we also combine FGPGM with least square optimization method. The remaining part of the paper is organized as follows. In Section 2 we give the numerical method to deal with parameter identification problems with partial observation. Numerical examples are presented in Section3. Some concluding remarks are given in Section 4.

2 Algorithm

The main strategy of FGPGM is to minimize the mismatch between the data and the ODE solutions in a maximum likelihood sense, making use of the property that Gaussian process is closed under differentiation.

In this work, we would like to estimate the time-independent parameters $\boldsymbol{\theta}$ of the following dynamical system described by

\dot{\boldsymbol{X}}=\boldsymbol{f}(\boldsymbol{X},\boldsymbol{\theta}).

(1)

$\dot{\boldsymbol{X}}$ is the vector of time derivative of the state $\boldsymbol{X}$ and $\boldsymbol{f}$ can be an nonlinear vector valued function. We assume only parts of the variables are measurable and denote them as $\boldsymbol{X}_{M}$ . They are observed on discrete time points as $\boldsymbol{Y}(t_{i})(i=1,...N)$ with noise $\epsilon$ such that $\boldsymbol{Y}=\boldsymbol{X}_{M}+\epsilon$ . We assume that the noise is Gaussian $\epsilon(t_{i})\sim\mathcal{N}(\boldsymbol{0},\sigma^{2}\boldsymbol{I})$ , then

\rho(\boldsymbol{y}|\boldsymbol{x}_{M},\sigma)=\mathcal{N}(\boldsymbol{y}|\boldsymbol{x}_{M},\sigma^{2}\boldsymbol{I}),

(2)

where $\boldsymbol{x}_{M}$ and $\boldsymbol{y}$ are the realizations of $\boldsymbol{X}_{M}$ and $\boldsymbol{Y}$ respectively. The latent/unmeasurable variables are denoted as $\boldsymbol{X}_{L}$ , with $dim(\boldsymbol{X}_{M})+dim(\boldsymbol{X}_{L})=dim(\boldsymbol{X})$ . The idea of Gaussian process based gradient matching is as follows. Firstly, we put a Gaussian process prior on $\boldsymbol{x}_{M}$ ,

\rho(\boldsymbol{x}_{M}|\boldsymbol{\mu}_{M},\phi)=\mathcal{N}(\boldsymbol{x}_{M}|\boldsymbol{\mu}_{M},\boldsymbol{C}_{\phi}).

(3)

Then according to Lemma A.8 the conditional distribution of the $k$ th state derivatives is

\rho(\dot{\boldsymbol{x}}_{M,k}|\boldsymbol{x}_{M,k},\phi_{k})=\mathcal{N}(\dot{\boldsymbol{x}}_{M,k}|\boldsymbol{D}_{k}\boldsymbol{x}_{M,k},\boldsymbol{A}_{k}),

(4)

where

\boldsymbol{D}_{k}=\boldsymbol{C}_{\phi_{k}}(\dot{\boldsymbol{x}}_{k},\boldsymbol{x}_{k})\boldsymbol{C}_{\phi_{k}}(\boldsymbol{x}_{k},\boldsymbol{x}_{k})^{-1}(\boldsymbol{x}_{k}-\boldsymbol{\mu}_{k}),

(5)

\boldsymbol{A}_{k}=\boldsymbol{C}_{\phi_{k}}(\dot{\boldsymbol{x}}_{k},\dot{\boldsymbol{x}}_{k})-\boldsymbol{C}_{\phi_{k}}(\dot{\boldsymbol{x}}_{k},\boldsymbol{x}_{k})\boldsymbol{C}_{\phi_{k}}(\boldsymbol{x}_{k},\boldsymbol{x}_{k})^{-1}\boldsymbol{C}_{\phi_{k}}(\boldsymbol{x}_{k},\dot{\boldsymbol{x}}_{k}).

(6)

Here we have denoted $\boldsymbol{C}_{\phi}$ as the covariance matrix. Its components are given by $\boldsymbol{C}_{\phi}(i,j)=k_{\phi}(t_{i},t_{j})$ with respect to a kernel function $k_{\phi}$ parameterized by the hyperparameter $\phi$ . For more details we refer to Appendix A. There is also a Gaussian noise with standard deviation $\gamma$ introduced to represent the model uncertainty

\rho(\dot{\boldsymbol{x}}|\boldsymbol{x},\boldsymbol{\theta},\gamma)=\mathcal{N}(\dot{\boldsymbol{x}}|\boldsymbol{f}(\boldsymbol{x},\boldsymbol{\theta}),\gamma\boldsymbol{I}).

(7)

We set up the following graphical probabilistic model to show the relationship between the variables (Fig. 1 ). Then the joint density can be represented by the following theorem.

Refer to caption — Figure 1: Probabilistic model with partial observable variables.

Theorem 2.1.

Given the modeling assumptions summarized in the graphical probabilistic model in Fig. 1,

	$\displaystyle\rho(\boldsymbol{x}_{M},\boldsymbol{\theta}\|\boldsymbol{y},\boldsymbol{\phi},\boldsymbol{\sigma},\boldsymbol{\gamma})$
	$\displaystyle=\rho(\boldsymbol{\theta})\mathcal{N}(\boldsymbol{x}_{M}\|\boldsymbol{0},\boldsymbol{C}_{\phi})\mathcal{N}(\boldsymbol{y}\|\boldsymbol{x}_{M},\sigma^{2}\boldsymbol{I})\mathcal{N}(\boldsymbol{f}_{M}(\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}),\boldsymbol{\theta})\|\boldsymbol{D}\boldsymbol{x}_{M},\boldsymbol{A}+\gamma\boldsymbol{I})$		(8)

where $\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta})$ involved in $\boldsymbol{f}_{M}$ is the solution determined by $\boldsymbol{\theta}$ and $\boldsymbol{x}_{M}$ .

The proof can be found in Appendix B. In our computation, $\tilde{\boldsymbol{x}}_{L}$ can be obtained by integrating the ODE system numerically with proposed $\boldsymbol{\theta}$ and initial values of $\boldsymbol{x}_{M}$ and $\boldsymbol{x}_{L}$ . Then, the target is to maximize the likelihood function $\rho(\boldsymbol{x}_{M},\boldsymbol{\theta}|\boldsymbol{y},\phi,\sigma,\gamma)$ .

The present algorithm is a combination of a Gaussian process based gradient matching and a least square optimization. In the GP gradient matching step, the Gaussian process model is first fitted by inferring the hyperparameter $\phi$ . Secondly, the states and parameters are inferred using a one chain MCMC scheme on the density as in [23]. Finally, the parameters estimated above is set as initial guess in the least square optimization. The algorithm can be summarized as follows.

	Algorithm
	Input: $\boldsymbol{y},\boldsymbol{f}(\boldsymbol{x},\boldsymbol{\theta}),\gamma,N_{MCMC},N_{burnin},t,\sigma_{s},\sigma_{p}$
	Step 1. Fit GP model to data
	Step 2. Infer $\boldsymbol{x}_{M}$ , $\boldsymbol{x}_{L}$ and $\boldsymbol{\theta}$ using MCMC
	$S\leftarrow\emptyset$
	for $i=1\rightarrow N_{MCMC}+N_{burnin}$ do
	for each state do
	Propose a new state value using a Gaussian distribution with standard
	deviation $\sigma_{s}$
	Accept proposed value based on the density (Eq. 8)
	Add current value to $\mathcal{T}$
	end for
	for each parameter do
	$\mathcal{T}\leftarrow\emptyset$
	Propose a new parameter value using a Gaussian distribution with standard
	deviation $\sigma_{p}$
	Integrate $\boldsymbol{x}_{L}$ with initial values and proposed parameters of ODEs.
	Accept proposed value based on the density (Eq. 8)
	Add current value to $\mathcal{T}$
	end for
	end for
	Discard the first $N_{burnin}$ samples of $\mathcal{S}$
	Return $\boldsymbol{x}_{M},\boldsymbol{x}_{L},\boldsymbol{\theta}$
	Step 3. Optimization using $\boldsymbol{\theta}$ from Step 2 as initial guess.

In Step 1, the Gaussian process model is fitted to data by maximizing the $\log$ of marginal likelihood of the observations $\boldsymbol{y}$ at times $\boldsymbol{t}$

\log(\rho(\boldsymbol{y}|\boldsymbol{t},\phi,\sigma))=-\frac{1}{2}\boldsymbol{y}^{T}(\boldsymbol{C}_{\phi}+\sigma\boldsymbol{I})^{-1}\boldsymbol{y}-\frac{1}{2}\log|\boldsymbol{C}_{\phi}+\sigma\boldsymbol{I}|-\frac{n}{2}\log 2\pi,

(9)

with respect to hyperparameters $\phi$ and $\sigma$ . $\sigma$ is the standard deviation of the observation noise and $n$ is the amount of observations. The numerical integrals of $\tilde{\boldsymbol{x}}_{L}$ in Step 2 are calculated only after each update of $\boldsymbol{\theta}$ . Step 3 is to solve the following minimization problem,

\min_{\boldsymbol{\theta}}\|\boldsymbol{x}_{M}(\boldsymbol{\theta})-\boldsymbol{y}\|_{L^{2}(0,T)}^{2}.

(10)

In the optimization process, gradient descent method is adopted where numerical gradient is used in each searching step. One advantage of doing optimization is its ability to obtain a more accurate result with less computational cost. In fact, with increased data the Gaussian noise can be balanced in the cost functional. But it requires proper initial guess of the parameters so as to avoid falling in local minima, whereas FGPGM has relatively less restrictions on the initial guess. However, for FGPGM a large amount of MCMC samplings are necessary to ensure the expectations of random variables make sense and it is hard to estimate the accuracy of the reconstructed solution. Therefore, if we combine these two methods, it is possible to use less MCMC sampling number to obtain a rough approximation of the parameters first and then adopt them as initial guess to obtain a least square optimization result.

3 Experiments

For the Gaussian regression step for observable variables, the code published alongside Wenk et al. (2019)[23] was used. The MCMC sampling part should then be adapted to the partial observation case according to the diagram provided above. In the following we refer FGPGM to the adapted FGPGM method for partial observation (Step 1 and Step 2 in the present algorithm).

3.1 Lotka Volterra

The Lotka Volterra system was originally proposed in Lotka (1978)[12]. It was introduced to model the prey-predator interaction system whose dynamics are given by

	$\displaystyle\dot{x}_{1}$	$\displaystyle=\theta_{1}x_{1}(t)-\theta_{2}x_{1}(t)x_{2}(t)$		(11)
	$\displaystyle\dot{x}_{2}$	$\displaystyle=-\theta_{3}x_{2}(t)+\theta_{4}x_{1}(t)x_{2}(t),$		(12)

where $\theta_{1},\theta_{2},\theta_{3},\theta_{4}>0$ . In the present work, the system was observed with one variable and the initial value of the other variable. The other setup is the same as Gorbach et al.(2017)[9] and Wenk et al. (2019)[23]. The observed series are located in the time interval $[0,2]$ at 20 uniformly distributed observation times. The initial values of the variables are $(5,3)$ . The history of the observable variable is generated with numerical integration of the system with true parameters $\theta_{1}=2,\theta_{2}=1,\theta_{3}=4,\theta_{4}=1$ , added by Gaussian noise with standard deviation 0.1. The RBF kernel was used for the Gaussian process. For the model noise we set $\gamma=3\times 10^{-1}$ . The results with $x_{1}$ being observed is shown in Fig. 2. Those with observation of $x_{2}$ are given in Fig. 3. In the later case, we can see that the optimization process can improve the results from FGPGM, with the identified parameters being closer to the true values.

The sensitivities of the variables to the parameters are listed in Tab. 1. The sensitivity indexes at the true parameter set $\boldsymbol{\theta_{0}}$ are defined as

S_{ij}=\frac{1}{\|x_{i}\|_{L^{2}(T_{1},T_{2})}}\left\|\frac{\partial x_{i}(t;\boldsymbol{\theta})}{\partial\theta_{j}}\right\|_{L^{2}(T_{1},T_{2})}(\boldsymbol{\theta_{0}})

(13)

which are normalized. It is approximated by numerical difference. It is shown that near the true parameter set, $\theta_{1}$ and $\theta_{3}$ are relatively less sensitive to the variables than other parameters. This explains that $\theta_{1}$ and $\theta_{3}$ are less accurate in the numerical test (see Fig. 2(c) and Fig. 3(c)).

$S_{ij}$	$x_{1}$	$x_{2}$
$\theta_{1}$	0.20	0.61
$\theta_{2}$	0.52	1.13
$\theta_{3}$	0.40	0.33
$\theta_{4}$	1.27	0.98

Table 1: Sensitivity of each variable to parameters for Lotka Volterra system at

\boldsymbol{\theta}=(2,1,4,1)

. The sensitivity index is defined Eq. 13.

The cases with larger noise level (std=0.5) are shown in Fig. 4 and Fig. 5, corresponding to $x_{1}$ and $x_{2}$ observations respectively. It can be seen that the prediction of the unknown variable can deviate far from the ground truth if we use FGPGM method only. The inferring of states and parameters can be improved after further applying the deterministic optimization.

3.2 Spiky Dynamics

This example is a system proposed by FitzHugh (1961) and Nagumo et al. (1962) for modeling the spike potentials in the giant squid neurons, which is abbreviated as FHN system. This system involves two ODEs with three parameters. The FHN system has notoriously fast changing dynamics due to its highly nonlinear terms. In the following numerical tests, the Matern 52 kernel was used and $\gamma$ was set to $3\times 10^{-1}$ , the same as that in Wenk et al. (2019)[23]. We assume one of the two variables is observable, which was generated with $\theta_{1}=0.2$ , $\theta_{2}=0.2$ , $\theta_{3}=3$ and added by Gaussian noise with average signal-to-noise ratio $SNR=100$ . There were 100 data points uniformly spaced in $[0,10]$ .

	$\displaystyle\dot{V}=\theta_{1}(V-\frac{V^{3}}{3}+R)$		(14)
	$\displaystyle\dot{R}=\frac{1}{\theta_{1}}(V-\theta_{2}+\theta_{3}R)$		(15)

$S_{ij}$	$x_{1}$	$x_{2}$
$\theta_{1}$	2.33	1.24
$\theta_{2}$	0.44	0.31
$\theta_{3}$	1.01	0.55

Table 2: Sensitivity of each variable to parameters for FHN system at

\boldsymbol{\theta}=(0.2,0.2,3.0)

. The sensitivity index is defined as Eq. 13.

In this case, if we merely use FGPGM step, the reconstructed solution corresponding to the identified parameters may deviate significantly from the true time series (see Fig. 6, where data of $x_{1}$ are observable). It was pointed out [23] that all GP based gradient matching algorithms lead to smoother trajectories than the ground truth. This becomes more severe with sparse observation. Thus a least square optimization after doing FGPGM may well reduce this effect (see Fig. 7).

Fig. 7 and Fig. 8 present the results with single $x_{1}$ and $x_{2}$ observations respectively. In both cases the identified parameters are more accurate than using FGPGM only. From the sensitivity check in Tab. 2, it is expected that $\theta_{1}$ is most accurate because it is most sensitive among these three parameters, whereas $\theta_{2}$ is most insensitive and would be harder to be identified. The numerical results agree with that. It is worth mention that in the FGPGM step, only 3500 samplings were taken and the time for optimization step was much less than FGPGM step. This means the time needed for the whole process can be greatly saved compared with that in Wenk et al. 2019 [23], where 100,000 MCMC samplings were implemented.

In this example, we also notice that if we merely use least square optimization method, the local minimum effect would lead to reconstruction being far from the ground truth, which is even less robust than FGPGM method. For example, if we choose initial guess of the parameters near $(\theta_{1},\theta_{2},\theta_{3})=(1.51,2.2,1.78)$ then the costfunctional will fall into the local minimum during gradient based search (see Fig. 9). The existence of many local minima in the full observation case has been pointed out in e.g., [7][19]. These results clearly illustrate the performance of the combination of FGPGM and least square optimization.

3.3 Protein Transduction

Finally the Protein Transduction system proposed in Vyshemisky and Girolami (2008) [22] was adopted to illustrate the performance of the method in ODEs with more equations. As mentioned in Dondelinger et al., 2013 [6], it is notoriously difficult to fit with unidentifiable parameters. The system is described by

	$\displaystyle\dot{S}=-\theta_{1}S-\theta_{2}SR+\theta_{3}R_{S}$		(16)
	$\displaystyle\dot{dS}=\theta_{1}S$		(17)
	$\displaystyle\dot{R}=-\theta_{2}SR+\theta_{3}R_{S}+\theta_{5}\frac{R_{pp}}{\theta_{6}+R_{pp}}$		(18)
	$\displaystyle\dot{R}_{S}=\theta_{2}SR-\theta_{3}R_{S}-\theta_{4}R_{S}$		(19)
	$\displaystyle\dot{R}_{pp}=\theta_{4}R_{S}-\theta_{5}\frac{R_{pp}}{\theta_{6}+R_{pp}}.$		(20)

The $\theta_{6}$ in this system is unidentifiable. We adopted the same experimental setup of Dondelinger et al. 2013 and Wenk et al. 2019 as follows. $\gamma=10^{-4}$ in FGPGM step. The observation were made at discrete times $[0,1,2,4,5,7,10,15,20,30,40,50,60,80,100]$ . The initial condition was $[1,0,1,0,0]$ and the data were generated by numerical integrating the system under $\boldsymbol{\theta}=[0.07,0.6,0.05,0.3,0.017,0.3]$ , added by Gaussian noise with standard deviation 0.01. A sigmoid kernel was used to deal with the logarithmically spaced observation times and the typically spiky form of the dynamics as in the previous papers.

$S_{ij}$	$x_{1}$	$x_{2}$	$x_{3}$	$x_{4}$	$x_{5}$
$\theta_{1}$	2.86	9.78	1.73	1.77	3.33
$\theta_{2}$	0.70	0.98	0.22	0.59	0.41
$\theta_{3}$	1.35	2.11	0.47	0.92	0.90
$\theta_{4}$	0.26	0.43	0.03	2.64	0.62
$\theta_{5}$	1.53	2.58	24.48	0.90	49.38
$\theta_{6}$	0.04	0.07	0.60	0.02	1.21

Table 3: Sensitivity of each variable to parameters for Protein Transduction system at

\boldsymbol{\theta}=[0.07,0.6,0.05,0.3,0.017,0.3]

. The sensitivity index is defined as Eq. 13.

Fig. 10 gives the result with $x_{3}$ ( $R$ ) being unobserved. In fact, the situations with one of other variables being unknown have better results than the case illustrated here, which will not be presented here. We can see that $x_{3}$ was not well fitted by merely using FGPGM step, whereas the combination of FGPGM and optimization generated satisfactory result, with the parameters $\theta_{2}$ and $\theta_{4}$ being significantly improved. The sensitivity check is summarized in Tab. 3, from which we can see that $\theta_{2}$ is less sensitive and thereby harder to infer accurately. The error of $\theta_{5}$ may be affected by the value of the unidentifiable parameter $\theta_{6}$ .

It would also be of interest to see the performance of the method for the cases with more latent variables. In this model, although $dS$ is not involved in equations for other variables, the data of $dS$ helps infer $\theta_{1}$ . We also notice that $\dot{R}+\dot{R_{S}}+\dot{R}_{pp}=0$ . If $S$ and $dS$ are both missing, it is impossible to identify $\theta_{1}$ . Therefore, in the following test we choose data of $dS$ , $R_{s}$ and $R_{pp}$ as observations. The data has a Gaussian noise with standard deviation 0.01, the same as the previous case with one latent variable. It can be seen from Fig. 11 that the result from FGPGM step is worse than the case with only one latent variable, but the final reconstruction of latent variables and parameter identification is not significantly different from the case with one latent variable.

4 Discussion

In the work, we proposed an effective method for parameter inference of coupled ODE systems with partially observable data. Our method is based on previous work known as FGPGM [23], which avoids product of experts heuristics. In order to improve the accuracy and efficiency of the method, we also use Least Square optimization in our computation. Due to the existence of latent variables, numerical integration is necessary for computation of the likelihood function in FGPGM method, which increases computational cost. The Least Square optimization allows us to greatly reduce the sampling number. In our numerical tests, we show that the sampling number in the FGPGM step is only 10% of that suggested in literature. It is worth noting that conventional least square optimization method requires good initial guess, which is not the case in our approach. Our numerical examples illustrated here demonstrate the feasibility of the proposed method for parameter inference with partial observations.

Acknowledgments

This work was supported in part by National Science Foundation of China (NSFC: No. 11971121), Nature Science and Research Council (NSERC) of Canada and the Fields Institute for Research in Mathematical Sciences. The authors also express the gratitude to Prof. Xin Gao, Nathan Gold and other members of the Fileds CQAM Lab on Health Analytics and Modelling for very beneficial discussions.

References

[1] G. Anger. Inverse Problems in Differential Equations. Berlin: Kluwer, 1990.
[2] R. C. Aster, B. Borchers and C. H. Thurber. Parameter Estimation and Inverse Problems. Boston: Elsevier, 2005.
[3] D. Barber, Y. Wang. Gaussian processes for Bayesian estimation in ordinary differential equations. International Conference on Machine Learning. 2014.
[4] H. G. Bock. Recent advances in parameter identification techniques for ODE. In Numerical Treatment of Inverse Problems in Differential and Integral Equations (eds P. Deuflhard and E. Harrier), 95-121. Basel:Birkhäuser, 1983.
[5] Ben Calderhead, Mark Girolami and Neil D. Lawrence. Accelerating bayesian inference over nonlinear differential equations with gaussian processes. Neural Information Processing Systems (NIPS), 2008.
[6] Frank Dondelinger, Maurizio Filippone, Simon Rogers and Dirk Husmeier. Ode parameter inference using adaptive gradient matching with gaussian processes. International Conference on Artificial Intelligence and Statistics (AISTATS), 2013.
[7] W. R. Esposito and C. Floudas. Deterministic global optimization in nonlinear optimal control problems. J. Glob. Optimizn, 17, 97-126., 2000.
[8] J. P. Gauthier and I. A. K. Kupka. Observability and Observers for Nonlinear Systems. SIAM J. Control Optim., 32(4), 975-994, 1994.
[9] Nico S Gorbach, Stefan Bauer, and Joachim M. Buhmann. Scalable variational inference for dynamical systems. Neural Information Processing Systems (NIPS), 2017.
[10] A. Isidori. Nonlinear Control Systems (3rd Edition). Springer-Verlag, 1995.
[11] Jari Kaipio and Erkki Somersalo. Statistical and computational inverse problems. Springer Science, 2006.
[12] Alfred J Lotka. The growth of mixed populations: two species competing for a common food supply. In The Golden Age of Theoretical Ecology: 1923-1940, pages 274-286. Springer, 1978.
[13] Z. Li, M. Osborne and T. Prvan. Parameter estimation in ordinary differential equations. IMA J. Numer. Anal., 25, 264-285, 2005.
[14] Benn Macdonald, Catherine Higham and Dirk Husmeier. Controversy in mechanistic modelling with Gaussian processes. Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37.
[15] J. Nocedal and S. Wright. Numerical Optimization, 2nd edn. New York: Springer, 2006.
[16] Mu Niu, Simon Rogers, Maurizio Filippone, and Dirk Husmeier. Fast inference in nonlinear dynamical systems using gradient matching. International Conference on Machine Learning (ICML), 2016.
[17] H.Pohjanpalo. System identifiability based on the power series expansion of the solution. Mathematical biosciences, (41)21-33, 1978.
[18] A. Papoulis, Su Pillai. Probability, Random Variables, and Stochastic Processes, 2002.
[19] Jim O Ramsay, Giles Hooker, David Campbell, and Jiguo Cao. Parameter estimation for differential equations: a generalized smoothing approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(5):741-796, 2007.
[20] Carl Edward Rasmussen. ”Gaussian processes in machine learning.” Summer School on Machine Learning. Springer, Berlin, Heidelberg, 2003.
[21] A. Tarantola. Inverse Problem Theory. Philadelphia: Society for Industrial and Applied Mathematics, 2005.
[22] Vladislav Vyshemirsky and Mark A Girolami. Bayesian ranking of biochemical system models. Bioinformatics, 24(6):833-839, 2008.
[23] P. Wenk, A. Gotovos, S. Bauer, N. Gorbach, A. Krause, J. M. Buhmann, Fast Gaussian process based gradient matching for parameter identification in systems of nonlinear ODEs. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan. PMLR: Volume 89, 2019.

Appendix A Preliminaries

In the following we list some preliminaries on derivatives of a Gaussian process that are used in this work, the proofs can be find in, e.g., [18][20][23]. Denote a random process $X_{t}$ , its realization $x$ and its time derivative $\dot{X}_{t}$ .

Definition A.1.

([18]) The random variable $X_{n}$ converges to $x$ in the first-mean sense (limit in mean) if for some $X$ ,

\lim_{n\rightarrow\infty}\mathbb{E}(|X_{n}-X|)=0.

(21)

Definition A.2.

The stochastic process $X_{t}$ is first-mean differentiable if for some $\dot{X}_{t}$

\lim_{\delta t\rightarrow 0}\mathbb{E}\left|\frac{X_{t+\delta t}-X_{t}}{\delta t}-\dot{X}_{t}\right|=0.

(22)

Definition A.3.

For given random variable $X$ , the moment generating function (MGF) is defined by

\Phi_{X}(t)=E[\exp(Xt)]=\int_{-\infty}^{\infty}\exp(xt)\rho(x)dx.

(23)

Proposition A.4.

If $\Phi_{X}(t)$ is the MGF, then

1.

$\frac{d\Phi_{X}}{dt}|_{t=0}=m_{i},$ where $m_{i}$ is the $i_{th}$ moment of $X$ .
2.

Let $X$ and $Y$ be two random variables. X and Y have the same distribution if and only if they have the same MGFs.
3.

we say $X\sim N(\mu,\sigma^{2})$ if and only if $\Phi_{X}(t)=\exp^{\frac{\sigma^{2}t^{2}}{2}+\mu t}$ .
4.

If $X$ and $Y$ are two random variable, then the MGF $\Phi_{X+Y}(t)=\Phi_{X}(t)\Phi_{Y}(t).$

By the above propositions, one has

Lemma A.5.

If $X,Y$ are two independent Gaussian random variables with means $\mu_{X},\mu_{Y}$ and covariances $\sigma_{X}^{2},\sigma_{Y}^{2}$ , then $X+Y$ is a Gaussian random variable with mean $\mu_{X}+\mu_{y}$ and covariance $\sigma_{X}^{2}+\sigma_{y}^{2}$ .

Definition A.6.

([20]) A real-valued stochastic process $\{X_{t}\}_{t\in T}$ , where $T$ is an index set, is a Gaussian process if all the finite-dimensional distributions are a multivariate normal distribution. That is, for any choice of distinct values $t_{1},t_{2},\dots t_{N}\in T$ , the random vector $\boldsymbol{X}=(X_{t_{1}},\dots,X_{t_{N}})^{T}$ has a multivariate normal distribution with joint Gaussian probability density function given by

\rho_{X_{t_{1}}X_{t_{2}}\dots X_{t_{N}}}(x_{t_{1}},\dots,x_{t_{N}})=\frac{1}{(2\pi)^{N/2}det(\boldsymbol{\Sigma})^{1/2}}\exp\left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu}_{\boldsymbol{X}})^{T}\boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_{\boldsymbol{X}})\right).

(24)

where the mean vector is defined as

\displaystyle(\boldsymbol{\mu}_{\boldsymbol{X}})_{i}=\mathbb{E}[\boldsymbol{X}_{t_{i}}]

(25)

and covariance matrix $(\boldsymbol{\Sigma})_{ij}=cov(X_{t_{i}},X_{t_{j}})$ .

The Gaussian processes only depend on the mean and covariance functions. Usual covariance functions could be Squared exponential $cov(X_{t_{i}},X_{t_{j}})=k_{\phi}(t_{i},t_{j})=\exp(-\frac{1}{2l^{2}}|t_{i}-t_{j}|^{2})$ , where $l$ is a hyperparameter and represents the nonlocal interaction length scale.

Let $t_{0},\delta t\in R$ , and $X_{t}$ be a Gaussian Process with constant mean $\mu$ and kernel function $k_{\phi}(t_{1},t_{2})$ , assumed to be first-mean differentiable. Then $X_{t_{0}+\delta t}$ and $X_{t_{0}}$ are jointed Gaussian distributed

\left[\begin{array}[]{c}X_{t_{0}}\\ X_{t_{0}+\delta t}\end{array}\right]\sim\mathcal{N}\left(\left[\begin{array}[]{c}\mu\\ \mu\end{array}\right],\boldsymbol{\Sigma}\right)

(26)

with density function

\rho(x_{t_{0}},x_{t_{0}+\delta t})=\frac{1}{2\pi\det(\boldsymbol{\Sigma})^{1/2}}\exp\left({-\frac{1}{2}\left[\begin{array}[]{c}x_{t_{0}}-\mu\\ x_{t_{0}+\delta t}-\mu\end{array}\right]^{T}\boldsymbol{\Sigma}^{-1}\left[\begin{array}[]{c}x_{t_{0}}-\mu\\ x_{t_{0}+\delta t}-\mu\end{array}\right]}\right)

(27)

where

\boldsymbol{\Sigma}=\left(\begin{array}[]{cc}k_{\phi}(t_{0},t_{0})&k_{\phi}(t_{0},t_{0}+\delta t)\\ k_{\phi}(t_{0}+\delta t,t_{0})&k_{\phi}(t_{0}+\delta t,t_{0}+\delta t)\end{array}\right).

(28)

If we define linear transformation

\displaystyle\mathbf{T}=\left(\begin{array}[]{cc}1&0\\ -\frac{1}{\delta t}&\frac{1}{\delta t}\end{array}\right),

(31)

then we have

\left[\begin{array}[]{c}X_{t_{0}}\\ \frac{X_{t_{0}+\delta t}-X_{t_{0}}}{\delta t}\end{array}\right]=\mathbf{T}\left[\begin{array}[]{c}X_{t_{0}}\\ X_{t_{0}+\delta t}\end{array}\right]\sim\mathcal{N}\left(\left[\begin{array}[]{c}\mu\\ 0\end{array}\right],\mathbf{T}\boldsymbol{\Sigma}\mathbf{T}^{T}\right)

(32)

i.e.

\rho(X_{t_{0}},\frac{X_{t_{0}+\delta t}-X_{t_{0}}}{\delta t})=\mathcal{N}\left(\left[\begin{array}[]{c}\mu\\ 0\end{array}\right],\mathbf{T}\boldsymbol{\Sigma}\mathbf{T}^{T}\right)

(33)

where

\displaystyle\mathbf{T}\boldsymbol{\Sigma}\mathbf{T}^{T}=\left(\begin{array}[]{cc}k_{\phi}(t_{0},t_{0})&\frac{k_{\phi}(t_{0},t_{0}+\delta t)-k_{\phi}(t_{0},t_{0})}{\delta t_{0}}\\ \frac{k_{\phi}(t_{0}+\delta t_{0},t_{0})-k_{\phi}(t_{0},t_{0})}{\delta t}&\frac{\frac{k_{\phi}(t_{0}+\delta t_{0},t_{0}+\delta t)-k_{\phi}(t_{0},t_{0}+\delta t)}{\delta t_{0}}-\frac{k_{\phi}(t_{0}+\delta t,t_{0})-k_{\phi}(t_{0},t_{0})}{\delta t_{0}}}{\delta t}\end{array}\right).

(36)

The above derivation shows that $X_{t_{0}}$ and $\frac{X_{t_{0}+\delta t}-X_{t_{0}}}{\delta t}$ are jointly Gaussian distributed. Using the definition of $first-mean$ differential and the fact that $rth-mean$ convergence implies convergence in distribution, it is clear that $X_{t_{0}}$ and $\dot{X}_{t_{0}}$ are jointly Gaussian

\left[\begin{array}[]{c}X_{t_{0}}\\ \dot{X}_{t_{0}}\end{array}\right]\sim\mathcal{N}\left(\left[\begin{array}[]{c}\mu\\ 0\end{array}\right],\left[\begin{array}[]{cc}k_{\phi}(t_{0},t_{0})&\frac{\partial k_{\phi}(a,b)}{\partial b}|_{a=t_{0},b=t_{0}}\\ \frac{\partial k_{\phi}(a,b)}{\partial a}|_{a=t_{0},b=t_{0}}&\frac{\partial^{2}k_{\phi}(a,b)}{\partial a\partial b}|_{a=t_{0},b=t_{0}}\end{array}\right]\right).

(37)

In general, $\boldsymbol{X}=(X_{t_{1}},\dots,X_{t_{k}})^{T}$ and $\dot{\boldsymbol{X}}=(\dot{X}_{t_{1}},\dots,\dot{X}_{t_{k}})^{T}$ are jointly Gaussian

\left[\begin{array}[]{c}\boldsymbol{X}\\ \dot{\boldsymbol{X}}\end{array}\right]\sim\mathcal{N}\left(\left[\begin{array}[]{c}\boldsymbol{\mu}\\ \boldsymbol{0}\end{array}\right],\left[\begin{array}[]{cc}\boldsymbol{C}_{\phi}(\boldsymbol{X},\boldsymbol{X})&\boldsymbol{C}_{\phi}(\boldsymbol{X},\dot{\boldsymbol{X}})\\ \boldsymbol{C}_{\phi}(\dot{\boldsymbol{X}},\boldsymbol{X})&\boldsymbol{C}_{\phi}(\dot{\boldsymbol{X}},\dot{\boldsymbol{X}})\end{array}\right]\right).

(38)

Here $(C_{\phi}(\mathbf{a},\mathbf{b}))_{ij}=cov(a_{i},b_{j})$ is the covariance between between $a_{i}$ and $b_{j}$ , and predefined kernel matrix of Gaussian process. By linearity of the covariance operator and the predefined kernel function $k_{\phi}(a,b)$ , we have

	$\displaystyle C_{\phi}(X_{t_{i}},\dot{X}_{t_{j}})=\frac{\partial k_{\phi}(a,b)}{\partial b}\|_{a=t_{i},b=t_{j}},$		(39)
	$\displaystyle C_{\phi}(\dot{X}_{t_{i}},X_{t_{j}})=\frac{\partial k_{\phi}(a,b)}{\partial a}\|_{a=t_{i},b=t_{j}},$		(40)
	$\displaystyle C_{\phi}(\dot{X}_{t_{i}},\dot{X}_{t_{j}})=\frac{\partial^{2}k_{\phi}(a,b)}{\partial a\partial b}\|_{a=t_{i},b=t_{j}}.$		(41)

Lemma A.7.

(Matrix Inversions Lemma) Let $\boldsymbol{\Sigma}$ be a $p\times p-matrix$ ( $p=n+m$ ):

\boldsymbol{\Sigma}=\left[\begin{array}[]{cc}\boldsymbol{\Sigma}_{11}&\boldsymbol{\Sigma}_{12}\\ \boldsymbol{\Sigma}_{21}&\boldsymbol{\Sigma}_{22}\end{array}\right]

(42)

where the sum-matrices have dimension $n\times n$ , $n\times m$ , etc. Suppose $\boldsymbol{\Sigma},\boldsymbol{\Sigma}_{11},\boldsymbol{\Sigma}_{22}$ are non-singular; and partition the inverse in the same way as $\boldsymbol{\Sigma}$ ,

\boldsymbol{\Lambda}=\boldsymbol{\Sigma}^{-1}=\left[\begin{array}[]{cc}\boldsymbol{\Lambda}_{11}&\boldsymbol{\Lambda}_{12}\\ \boldsymbol{\Lambda}_{21}&\boldsymbol{\Lambda}_{22}\end{array}\right].

(43)

Then

\displaystyle\left\{\begin{array}[]{l}\boldsymbol{\Lambda}_{11}=(\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21})^{-1}\\ \boldsymbol{\Lambda}_{12}=-(\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21})^{-1}\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\\ \boldsymbol{\Lambda}_{21}=-(\boldsymbol{\Sigma}_{22}-\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12})^{-1}\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\\ \boldsymbol{\Lambda}_{22}=(\boldsymbol{\Sigma}_{22}-\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12})^{-1}.\end{array}\right.

(48)

Lemma A.8.

(Conditional Gaussian distributions) Let $\boldsymbol{X}\in\mathbb{R}^{D}$ , $\boldsymbol{Y}\in\mathbb{R}^{M}$ , be jointly Gaussian random vectors with distribution

\left[\begin{array}[]{c}\boldsymbol{X}\\ \boldsymbol{Y}\end{array}\right]\sim\mathcal{N}\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right)

(49)

where

\boldsymbol{\mu}=\left[\begin{array}[]{c}\boldsymbol{\mu}_{X}\\ \boldsymbol{\mu}_{Y}\end{array}\right],\boldsymbol{\Sigma}=\left[\begin{array}[]{cc}\boldsymbol{\Sigma}_{XX}&\boldsymbol{\Sigma}_{XY}\\ \boldsymbol{\Sigma}_{YX}&\boldsymbol{\Sigma}_{YY}\end{array}\right].

(50)

Then the conditional Gaussian distributions density functions are

\displaystyle\rho_{Y|X}(\boldsymbol{y}|\boldsymbol{x})=\frac{\rho_{XY}(\boldsymbol{x},\boldsymbol{y})}{\rho_{X}(\boldsymbol{x})}=\frac{1}{(2\pi)^{\frac{M+D}{2}}\det(\Sigma_{Y|X})^{1/2}}\exp{(\boldsymbol{y}-\boldsymbol{\mu}_{Y|X})^{T}\boldsymbol{\Sigma}_{Y|X}^{-1}(\boldsymbol{y}-\boldsymbol{\mu}_{Y|X})}

(51)

where

	$\displaystyle\boldsymbol{\mu}_{Y\|X}=\boldsymbol{\mu}_{Y}+\boldsymbol{\Sigma}_{YX}\boldsymbol{\Sigma}_{XX}^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_{X}),$		(52)
	$\displaystyle\boldsymbol{\Sigma}_{Y\|X}=\boldsymbol{\Sigma}_{YY}-\boldsymbol{\Sigma}_{YX}\boldsymbol{\Sigma}_{XX}^{-1}\boldsymbol{\Sigma}_{XY}.$		(53)

According to above Lemma, we have the condition distribution

Lemma A.9.

\rho(\dot{\boldsymbol{x}}|\boldsymbol{x})\sim\mathcal{N}(\boldsymbol{D}(\boldsymbol{x}-\boldsymbol{\mu}_{X}),\boldsymbol{A})

(54)

where

	$\displaystyle\boldsymbol{D}=\boldsymbol{C}_{\phi}(\dot{\boldsymbol{X}},\boldsymbol{X})\boldsymbol{C}_{\phi}(\boldsymbol{X},\boldsymbol{X})^{-1}$		(55)
	$\displaystyle\boldsymbol{A}=\boldsymbol{C}_{\phi}(\dot{\boldsymbol{X}},\dot{\boldsymbol{X}})-\boldsymbol{C}_{\phi}(\dot{\boldsymbol{X}},\boldsymbol{X})\boldsymbol{C}_{\phi}(\boldsymbol{X},\boldsymbol{X})^{-1}\boldsymbol{C}_{\phi}(\boldsymbol{X},\dot{\boldsymbol{X}})$		(56)

Appendix B Proof of Theorem 2.1

Proof.

The joint density over all variables in Fig.1 can be represented as

		$\displaystyle\rho(\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\boldsymbol{y},\boldsymbol{x}_{L},F_{M},\bar{F}_{M},\boldsymbol{\theta}\|\phi,\sigma,\gamma)$
	$\displaystyle=$	$\displaystyle\rho_{GP}(\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\boldsymbol{y}\|\phi,\sigma)\rho_{ODE}(F_{M},\bar{F}_{M},\theta,\boldsymbol{x}_{L}\|\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\gamma)$		(57)

\rho_{GP}(\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\boldsymbol{y}|\phi,\sigma)=\rho(\boldsymbol{x}_{M},\phi)\rho(\boldsymbol{y}|\boldsymbol{x}_{M},\sigma)\rho(\dot{\boldsymbol{x}}_{M}|\boldsymbol{x}_{M},\phi)

(58)

	$\displaystyle\rho_{ODE}(F_{M},\bar{F}_{M},\theta,\boldsymbol{x}_{L}\|\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\gamma)$
$\displaystyle=$	$\displaystyle\rho(\boldsymbol{\theta})\rho(F_{M},\bar{F}_{M},\boldsymbol{x}_{L}\|\boldsymbol{\theta},\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\gamma)$
$\displaystyle=$	$\displaystyle\rho(\boldsymbol{\theta})\rho(\boldsymbol{x}_{L}\|\boldsymbol{\theta},\boldsymbol{x}_{M})\rho(F_{M},\bar{F}_{M}\|\boldsymbol{x}_{L},\boldsymbol{\theta},\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\gamma)$
$\displaystyle=$	$\displaystyle\rho(\boldsymbol{\theta})\delta(\tilde{\boldsymbol{x}}_{L}(\boldsymbol{\theta},\boldsymbol{x}_{M})-\boldsymbol{x}_{L})\rho(F_{M},\bar{F}_{M}\|\boldsymbol{x}_{L},\boldsymbol{\theta},\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\gamma)$
$\displaystyle=$	$\displaystyle\rho(\boldsymbol{\theta})\rho(F_{M}\|\boldsymbol{\theta},\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}))\rho(\bar{F}_{M}\|\dot{\boldsymbol{x}}_{M},\gamma)\delta(F_{M}-\bar{F}_{M})$
$\displaystyle=$	$\displaystyle\rho(\boldsymbol{\theta})\delta(\boldsymbol{f}_{M}(\boldsymbol{\theta},\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}))-F_{M})\mathcal{N}(F_{M}\|\dot{\boldsymbol{x}}_{M},\gamma\boldsymbol{I})$
$\displaystyle=$	$\displaystyle\rho(\boldsymbol{\theta})\mathcal{N}(\boldsymbol{f}_{M}(\boldsymbol{\theta},\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}))\|\dot{\boldsymbol{x}}_{M},\gamma\boldsymbol{I}),$	(59)

by which $\rho_{ODE}$ is independent of $F_{M},\bar{F}_{M},\boldsymbol{x}_{L}$ . $\tilde{\boldsymbol{x}}_{L}$ is deterministically decided by $\boldsymbol{x}_{M},\boldsymbol{\theta}$ through integration. Using Lemma A.9, we have

	$\displaystyle\rho(\boldsymbol{x}_{M},\boldsymbol{\theta},\boldsymbol{y}\|\phi,\sigma,\gamma)=$
	$\displaystyle\rho(\boldsymbol{\theta})\mathcal{N}(\boldsymbol{x}_{M}\|\boldsymbol{0},\boldsymbol{C}_{\phi})\mathcal{N}(\boldsymbol{y}\|\boldsymbol{x}_{M},\sigma^{2}\boldsymbol{I})\mathcal{N}(\dot{\boldsymbol{x}}_{M}\|\boldsymbol{D}\boldsymbol{x}_{M},\boldsymbol{A})\mathcal{N}(\boldsymbol{f}_{M}(\boldsymbol{\theta},\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}))\|\dot{\boldsymbol{x}}_{M},\gamma\boldsymbol{I}).$		(60)

Integrating $\dot{\boldsymbol{x}}_{M}$ out yields

	$\displaystyle\rho(\boldsymbol{x}_{M},\boldsymbol{\theta},\boldsymbol{y}\|\phi,\sigma,\gamma)=$
	$\displaystyle\rho(\boldsymbol{\theta})\mathcal{N}(\boldsymbol{x}_{M}\|\boldsymbol{0},\boldsymbol{C}_{\phi})\mathcal{N}(\boldsymbol{y}\|\boldsymbol{x}_{M},\sigma^{2}\boldsymbol{I})\mathcal{N}(\boldsymbol{f}_{M}(\boldsymbol{\theta},\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}))\|\boldsymbol{D}\boldsymbol{x}_{M},\boldsymbol{A}+\gamma\boldsymbol{I}).$		(61)

Finally, we get

	$\displaystyle\rho(\boldsymbol{x}_{M},\boldsymbol{\theta}\|\boldsymbol{y},\phi,\sigma,\gamma)\propto$
	$\displaystyle\rho(\boldsymbol{\theta})\mathcal{N}(\boldsymbol{x}_{M}\|\boldsymbol{0},\boldsymbol{C}_{\phi})\mathcal{N}(\boldsymbol{y}\|\boldsymbol{x}_{M},\sigma^{2}\boldsymbol{I})\mathcal{N}(\boldsymbol{f}_{M}(\boldsymbol{\theta},\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}))\|\boldsymbol{D}\boldsymbol{x}_{M},\boldsymbol{A}+\gamma\boldsymbol{I}).$		(62)

∎