This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Numerical Method for Parameter Inference of Nonlinear ODEs with Partial Observations

Yu Chen Centre for Quantitative Analysis and Modeling (CQAM), The Fields Institute for Research in Mathematical Sciences, 222 College Street, Toronto, Ontario, Canada. School of Mathematics, Shanghai University of Finance and Economics, Shanghai, China Jin Cheng School of Mathematical Sciences, Fudan University, Shanghai, 200433, China School of Mathematics, Shanghai University of Finance and Economics, Shanghai, China Arvind Gupta Computer Science, University of Toronto, Toronto, Ontario, Canada. Huaxiong Huang Corresponding author: [email protected] Joint Mathematical Research Centre of Beijing Normal University and BNU-HKBU United International College, Zhuhai, China Centre for Quantitative Analysis and Modeling (CQAM), The Fields Institute for Research in Mathematical Sciences, 222 College Street, Toronto, Ontario, Canada. Computer Science, University of Toronto, Toronto, Ontario, Canada. Department of Mathematics and Statistics, York University, Toronto, Ontario, Canada. Shixin Xu Corresponding author: [email protected] Duke Kunshan University, 8 Duke Ave, Kunshan, Jiangsu, China.

Abstract

Parameter inference of dynamical systems is a challenging task faced by many researchers and practitioners across various fields. In many applications, it is common that only limited variables are observable. In this paper, we propose a method for parameter inference of a system of nonlinear coupled ODEs with partial observations. Our method combines fast Gaussian process based gradient matching (FGPGM) and deterministic optimization algorithms. By using initial values obtained by Bayesian steps with low sampling numbers, our deterministic optimization algorithm is both accurate and efficient.

Key words: Gaussian Process, Parameter inference, Nonlinear ODEs, Partial observations

1 Introduction

Many problems in science and engineering can be modelled by systems of Ordinary differential equations (ODEs). It is often difficult or impossible to measure some parameters of the systems directly. Therefore, various methods have been developed to estimate parameters based on available data. Mathematically, such problems are classified as inverse problems which have been widely studied [1, 2, 21, 13]. They can be also treated as parameter inference in statistics [19, 11].

For nonlinear ODEs, standard statistical inference is time consuming as numerical integration is needed after each update of the parameters [5, 14]. Recently, gradient matching techniques have been proposed to circumvent the high computational cost of numerical integration [19, 6, 16, 9]. These techniques are based on minimizing the difference between the values obtained by two different approaches. This usually involves a process consisting of two steps: data interpolation and parameter adaptation. Among them, nonparametric Bayesian modelling with Gaussian processes is one of the promising approaches. Calderhead et al. [5] proposed an adaptive gradient matching method based on a product-of-experts approach and a marginalization over the derivatives of the state variables, which was proposed by Calderhead et al. [5] and extended by Dondelinger et al. [6]. Barber & Wang [3] proposed a GPODE method in which the state variables are marginalized. Macdonald et al. [14] provided an interpretation of the above paradigms. Wenk et al. [23] proposed a fast Gaussian process based gradient matching (FGPGM) algrithm with theoretical framework in systems of nonlinear ODEs. our new approach is more accurate, robust and efficient.

For many practical problems, the variables are only partially observable, or not at all times. As a consequence, parameter inference is more challenging, even for a coupled system where the parameters are uniquely determined by data of partially observed data under certain initial conditions. It is not clear whether the gradient matching techniques can be applied to the case when there are latent variables. The Markov Chain Monte Carlo algorithm has the ability to side-step the issue of parameter identifiability in many cases, but convergence remains a serious issue [19]. Therefore, we need to pay attention to the feasibility, accuracy, robustness and computational cost of numerical computations for such problems.

In this work, we focus on the case of parameter inference with partially observable data. The main idea is to treat the observable and nonobservale variables differently. For observable variables, we use the same approach as proposed by Wenk et al. [23]. For non-observable variablies, they are obtained only by using ODEs. To circumvent the high computational cost of sampling in Bayesian approaches, we also combine FGPGM with least square optimization method. The remaining part of the paper is organized as follows. In Section 2 we give the numerical method to deal with parameter identification problems with partial observation. Numerical examples are presented in Section3. Some concluding remarks are given in Section 4.

2 Algorithm

The main strategy of FGPGM is to minimize the mismatch between the data and the ODE solutions in a maximum likelihood sense, making use of the property that Gaussian process is closed under differentiation.

In this work, we would like to estimate the time-independent parameters 𝜽\boldsymbol{\theta} of the following dynamical system described by

𝑿˙=𝒇(𝑿,𝜽).\dot{\boldsymbol{X}}=\boldsymbol{f}(\boldsymbol{X},\boldsymbol{\theta}). (1)

𝑿˙\dot{\boldsymbol{X}} is the vector of time derivative of the state 𝑿\boldsymbol{X} and 𝒇\boldsymbol{f} can be an nonlinear vector valued function. We assume only parts of the variables are measurable and denote them as 𝑿M\boldsymbol{X}_{M}. They are observed on discrete time points as 𝒀(ti)(i=1,N)\boldsymbol{Y}(t_{i})(i=1,...N) with noise ϵ\epsilon such that 𝒀=𝑿M+ϵ\boldsymbol{Y}=\boldsymbol{X}_{M}+\epsilon. We assume that the noise is Gaussian ϵ(ti)𝒩(𝟎,σ2𝑰)\epsilon(t_{i})\sim\mathcal{N}(\boldsymbol{0},\sigma^{2}\boldsymbol{I}), then

ρ(𝒚|𝒙M,σ)=𝒩(𝒚|𝒙M,σ2𝑰),\rho(\boldsymbol{y}|\boldsymbol{x}_{M},\sigma)=\mathcal{N}(\boldsymbol{y}|\boldsymbol{x}_{M},\sigma^{2}\boldsymbol{I}), (2)

where 𝒙M\boldsymbol{x}_{M} and 𝒚\boldsymbol{y} are the realizations of 𝑿M\boldsymbol{X}_{M} and 𝒀\boldsymbol{Y} respectively. The latent/unmeasurable variables are denoted as 𝑿L\boldsymbol{X}_{L}, with dim(𝑿M)+dim(𝑿L)=dim(𝑿)dim(\boldsymbol{X}_{M})+dim(\boldsymbol{X}_{L})=dim(\boldsymbol{X}). The idea of Gaussian process based gradient matching is as follows. Firstly, we put a Gaussian process prior on 𝒙M\boldsymbol{x}_{M},

ρ(𝒙M|𝝁M,ϕ)=𝒩(𝒙M|𝝁M,𝑪ϕ).\rho(\boldsymbol{x}_{M}|\boldsymbol{\mu}_{M},\phi)=\mathcal{N}(\boldsymbol{x}_{M}|\boldsymbol{\mu}_{M},\boldsymbol{C}_{\phi}). (3)

Then according to Lemma A.8 the conditional distribution of the kkth state derivatives is

ρ(𝒙˙M,k|𝒙M,k,ϕk)=𝒩(𝒙˙M,k|𝑫k𝒙M,k,𝑨k),\rho(\dot{\boldsymbol{x}}_{M,k}|\boldsymbol{x}_{M,k},\phi_{k})=\mathcal{N}(\dot{\boldsymbol{x}}_{M,k}|\boldsymbol{D}_{k}\boldsymbol{x}_{M,k},\boldsymbol{A}_{k}), (4)

where

𝑫k=𝑪ϕk(𝒙˙k,𝒙k)𝑪ϕk(𝒙k,𝒙k)1(𝒙k𝝁k),\boldsymbol{D}_{k}=\boldsymbol{C}_{\phi_{k}}(\dot{\boldsymbol{x}}_{k},\boldsymbol{x}_{k})\boldsymbol{C}_{\phi_{k}}(\boldsymbol{x}_{k},\boldsymbol{x}_{k})^{-1}(\boldsymbol{x}_{k}-\boldsymbol{\mu}_{k}), (5)
𝑨k=𝑪ϕk(𝒙˙k,𝒙˙k)𝑪ϕk(𝒙˙k,𝒙k)𝑪ϕk(𝒙k,𝒙k)1𝑪ϕk(𝒙k,𝒙˙k).\boldsymbol{A}_{k}=\boldsymbol{C}_{\phi_{k}}(\dot{\boldsymbol{x}}_{k},\dot{\boldsymbol{x}}_{k})-\boldsymbol{C}_{\phi_{k}}(\dot{\boldsymbol{x}}_{k},\boldsymbol{x}_{k})\boldsymbol{C}_{\phi_{k}}(\boldsymbol{x}_{k},\boldsymbol{x}_{k})^{-1}\boldsymbol{C}_{\phi_{k}}(\boldsymbol{x}_{k},\dot{\boldsymbol{x}}_{k}). (6)

Here we have denoted 𝑪ϕ\boldsymbol{C}_{\phi} as the covariance matrix. Its components are given by 𝑪ϕ(i,j)=kϕ(ti,tj)\boldsymbol{C}_{\phi}(i,j)=k_{\phi}(t_{i},t_{j}) with respect to a kernel function kϕk_{\phi} parameterized by the hyperparameter ϕ\phi. For more details we refer to Appendix A. There is also a Gaussian noise with standard deviation γ\gamma introduced to represent the model uncertainty

ρ(𝒙˙|𝒙,𝜽,γ)=𝒩(𝒙˙|𝒇(𝒙,𝜽),γ𝑰).\rho(\dot{\boldsymbol{x}}|\boldsymbol{x},\boldsymbol{\theta},\gamma)=\mathcal{N}(\dot{\boldsymbol{x}}|\boldsymbol{f}(\boldsymbol{x},\boldsymbol{\theta}),\gamma\boldsymbol{I}). (7)

We set up the following graphical probabilistic model to show the relationship between the variables (Fig. 1 ). Then the joint density can be represented by the following theorem.

Refer to caption
Figure 1: Probabilistic model with partial observable variables.
Theorem 2.1.

Given the modeling assumptions summarized in the graphical probabilistic model in Fig. 1,

ρ(𝒙M,𝜽|𝒚,ϕ,𝝈,𝜸)\displaystyle\rho(\boldsymbol{x}_{M},\boldsymbol{\theta}|\boldsymbol{y},\boldsymbol{\phi},\boldsymbol{\sigma},\boldsymbol{\gamma})
=ρ(𝜽)𝒩(𝒙M|𝟎,𝑪ϕ)𝒩(𝒚|𝒙M,σ2𝑰)𝒩(𝒇M(𝒙M,𝒙~L(𝒙M,𝜽),𝜽)|𝑫𝒙M,𝑨+γ𝑰)\displaystyle=\rho(\boldsymbol{\theta})\mathcal{N}(\boldsymbol{x}_{M}|\boldsymbol{0},\boldsymbol{C}_{\phi})\mathcal{N}(\boldsymbol{y}|\boldsymbol{x}_{M},\sigma^{2}\boldsymbol{I})\mathcal{N}(\boldsymbol{f}_{M}(\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}),\boldsymbol{\theta})|\boldsymbol{D}\boldsymbol{x}_{M},\boldsymbol{A}+\gamma\boldsymbol{I}) (8)

where 𝐱~L(𝐱M,𝛉)\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}) involved in 𝐟M\boldsymbol{f}_{M} is the solution determined by 𝛉\boldsymbol{\theta} and 𝐱M\boldsymbol{x}_{M}.

The proof can be found in Appendix B. In our computation, 𝒙~L\tilde{\boldsymbol{x}}_{L} can be obtained by integrating the ODE system numerically with proposed 𝜽\boldsymbol{\theta} and initial values of 𝒙M\boldsymbol{x}_{M} and 𝒙L\boldsymbol{x}_{L}. Then, the target is to maximize the likelihood function ρ(𝒙M,𝜽|𝒚,ϕ,σ,γ)\rho(\boldsymbol{x}_{M},\boldsymbol{\theta}|\boldsymbol{y},\phi,\sigma,\gamma).

The present algorithm is a combination of a Gaussian process based gradient matching and a least square optimization. In the GP gradient matching step, the Gaussian process model is first fitted by inferring the hyperparameter ϕ\phi. Secondly, the states and parameters are inferred using a one chain MCMC scheme on the density as in [23]. Finally, the parameters estimated above is set as initial guess in the least square optimization. The algorithm can be summarized as follows.

Algorithm
Input: 𝒚,𝒇(𝒙,𝜽),γ,NMCMC,Nburnin,t,σs,σp\boldsymbol{y},\boldsymbol{f}(\boldsymbol{x},\boldsymbol{\theta}),\gamma,N_{MCMC},N_{burnin},t,\sigma_{s},\sigma_{p}
Step 1. Fit GP model to data
Step 2. Infer 𝒙M\boldsymbol{x}_{M}, 𝒙L\boldsymbol{x}_{L} and 𝜽\boldsymbol{\theta} using MCMC
SS\leftarrow\emptyset
for i=1NMCMC+Nburnini=1\rightarrow N_{MCMC}+N_{burnin} do
      for each state do
            Propose a new state value using a Gaussian distribution with standard
            deviation σs\sigma_{s}
            Accept proposed value based on the density (Eq. 8)
            Add current value to 𝒯\mathcal{T}
      end for
      for each parameter do
             𝒯\mathcal{T}\leftarrow\emptyset
            Propose a new parameter value using a Gaussian distribution with standard
            deviation σp\sigma_{p}
             Integrate 𝒙L\boldsymbol{x}_{L} with initial values and proposed parameters of ODEs.
            Accept proposed value based on the density (Eq. 8)
            Add current value to 𝒯\mathcal{T}
      end for
end for
Discard the first NburninN_{burnin} samples of 𝒮\mathcal{S}
Return 𝒙M,𝒙L,𝜽\boldsymbol{x}_{M},\boldsymbol{x}_{L},\boldsymbol{\theta}
Step 3. Optimization using 𝜽\boldsymbol{\theta} from Step 2 as initial guess.

In Step 1, the Gaussian process model is fitted to data by maximizing the log\log of marginal likelihood of the observations 𝒚\boldsymbol{y} at times 𝒕\boldsymbol{t}

log(ρ(𝒚|𝒕,ϕ,σ))=12𝒚T(𝑪ϕ+σ𝑰)1𝒚12log|𝑪ϕ+σ𝑰|n2log2π,\log(\rho(\boldsymbol{y}|\boldsymbol{t},\phi,\sigma))=-\frac{1}{2}\boldsymbol{y}^{T}(\boldsymbol{C}_{\phi}+\sigma\boldsymbol{I})^{-1}\boldsymbol{y}-\frac{1}{2}\log|\boldsymbol{C}_{\phi}+\sigma\boldsymbol{I}|-\frac{n}{2}\log 2\pi, (9)

with respect to hyperparameters ϕ\phi and σ\sigma. σ\sigma is the standard deviation of the observation noise and nn is the amount of observations. The numerical integrals of 𝒙~L\tilde{\boldsymbol{x}}_{L} in Step 2 are calculated only after each update of 𝜽\boldsymbol{\theta}. Step 3 is to solve the following minimization problem,

min𝜽𝒙M(𝜽)𝒚L2(0,T)2.\min_{\boldsymbol{\theta}}\|\boldsymbol{x}_{M}(\boldsymbol{\theta})-\boldsymbol{y}\|_{L^{2}(0,T)}^{2}. (10)

In the optimization process, gradient descent method is adopted where numerical gradient is used in each searching step. One advantage of doing optimization is its ability to obtain a more accurate result with less computational cost. In fact, with increased data the Gaussian noise can be balanced in the cost functional. But it requires proper initial guess of the parameters so as to avoid falling in local minima, whereas FGPGM has relatively less restrictions on the initial guess. However, for FGPGM a large amount of MCMC samplings are necessary to ensure the expectations of random variables make sense and it is hard to estimate the accuracy of the reconstructed solution. Therefore, if we combine these two methods, it is possible to use less MCMC sampling number to obtain a rough approximation of the parameters first and then adopt them as initial guess to obtain a least square optimization result.

3 Experiments

For the Gaussian regression step for observable variables, the code published alongside Wenk et al. (2019)[23] was used. The MCMC sampling part should then be adapted to the partial observation case according to the diagram provided above. In the following we refer FGPGM to the adapted FGPGM method for partial observation (Step 1 and Step 2 in the present algorithm).

3.1 Lotka Volterra

The Lotka Volterra system was originally proposed in Lotka (1978)[12]. It was introduced to model the prey-predator interaction system whose dynamics are given by

x˙1\displaystyle\dot{x}_{1} =θ1x1(t)θ2x1(t)x2(t)\displaystyle=\theta_{1}x_{1}(t)-\theta_{2}x_{1}(t)x_{2}(t) (11)
x˙2\displaystyle\dot{x}_{2} =θ3x2(t)+θ4x1(t)x2(t),\displaystyle=-\theta_{3}x_{2}(t)+\theta_{4}x_{1}(t)x_{2}(t), (12)

where θ1,θ2,θ3,θ4>0\theta_{1},\theta_{2},\theta_{3},\theta_{4}>0. In the present work, the system was observed with one variable and the initial value of the other variable. The other setup is the same as Gorbach et al.(2017)[9] and Wenk et al. (2019)[23]. The observed series are located in the time interval [0,2][0,2] at 20 uniformly distributed observation times. The initial values of the variables are (5,3)(5,3). The history of the observable variable is generated with numerical integration of the system with true parameters θ1=2,θ2=1,θ3=4,θ4=1\theta_{1}=2,\theta_{2}=1,\theta_{3}=4,\theta_{4}=1, added by Gaussian noise with standard deviation 0.1. The RBF kernel was used for the Gaussian process. For the model noise we set γ=3×101\gamma=3\times 10^{-1}. The results with x1x_{1} being observed is shown in Fig. 2. Those with observation of x2x_{2} are given in Fig. 3. In the later case, we can see that the optimization process can improve the results from FGPGM, with the identified parameters being closer to the true values.

The sensitivities of the variables to the parameters are listed in Tab. 1. The sensitivity indexes at the true parameter set 𝜽𝟎\boldsymbol{\theta_{0}} are defined as

Sij=1xiL2(T1,T2)xi(t;𝜽)θjL2(T1,T2)(𝜽𝟎)S_{ij}=\frac{1}{\|x_{i}\|_{L^{2}(T_{1},T_{2})}}\left\|\frac{\partial x_{i}(t;\boldsymbol{\theta})}{\partial\theta_{j}}\right\|_{L^{2}(T_{1},T_{2})}(\boldsymbol{\theta_{0}}) (13)

which are normalized. It is approximated by numerical difference. It is shown that near the true parameter set, θ1\theta_{1} and θ3\theta_{3} are relatively less sensitive to the variables than other parameters. This explains that θ1\theta_{1} and θ3\theta_{3} are less accurate in the numerical test (see Fig. 2(c) and Fig. 3(c)).

SijS_{ij} x1x_{1} x2x_{2}
θ1\theta_{1} 0.20 0.61
θ2\theta_{2} 0.52 1.13
θ3\theta_{3} 0.40 0.33
θ4\theta_{4} 1.27 0.98
Table 1: Sensitivity of each variable to parameters for Lotka Volterra system at 𝜽=(2,1,4,1)\boldsymbol{\theta}=(2,1,4,1). The sensitivity index is defined Eq. 13.

The cases with larger noise level (std=0.5) are shown in Fig. 4 and Fig. 5, corresponding to x1x_{1} and x2x_{2} observations respectively. It can be seen that the prediction of the unknown variable can deviate far from the ground truth if we use FGPGM method only. The inferring of states and parameters can be improved after further applying the deterministic optimization.

Refer to caption
(a) x1x_{1}
Refer to caption
(b) x2x_{2}
Refer to caption
(c) θ\theta
Figure 2: Reconstruction and inference results for the Lotka Volterra system, showing the state evolution over time and parameter distributions. x1x_{1} is observable and x2x_{2} is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared.
Refer to caption
(a) x1x_{1}
Refer to caption
(b) x2x_{2}
Refer to caption
(c) θ\theta
Figure 3: The state evolution over time for Lotka Volterra system and parameter inference results. x2x_{2} is observable and x1x_{1} is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared.
Refer to caption
(a) x1x_{1}
Refer to caption
(b) x2x_{2}
Refer to caption
(c) θ\theta
Figure 4: Reconstruction and inference results for the Lotka Volterra system with x1x_{1} being observable and x2x_{2} latent. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. The observation noise has a standard deviation std=0.5std=0.5 (large noise case).
Refer to caption
(a) x1x_{1}
Refer to caption
(b) x2x_{2}
Refer to caption
(c) θ\theta
Figure 5: The state evolution over time for Lotka Volterra system. x2x_{2} is observable and x1x_{1} is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared. The observation noise has a standard deviation std=0.5std=0.5 (large noise case).

3.2 Spiky Dynamics

This example is a system proposed by FitzHugh (1961) and Nagumo et al. (1962) for modeling the spike potentials in the giant squid neurons, which is abbreviated as FHN system. This system involves two ODEs with three parameters. The FHN system has notoriously fast changing dynamics due to its highly nonlinear terms. In the following numerical tests, the Matern 52 kernel was used and γ\gamma was set to 3×1013\times 10^{-1}, the same as that in Wenk et al. (2019)[23]. We assume one of the two variables is observable, which was generated with θ1=0.2\theta_{1}=0.2, θ2=0.2\theta_{2}=0.2, θ3=3\theta_{3}=3 and added by Gaussian noise with average signal-to-noise ratio SNR=100SNR=100. There were 100 data points uniformly spaced in [0,10][0,10].

V˙=θ1(VV33+R)\displaystyle\dot{V}=\theta_{1}(V-\frac{V^{3}}{3}+R) (14)
R˙=1θ1(Vθ2+θ3R)\displaystyle\dot{R}=\frac{1}{\theta_{1}}(V-\theta_{2}+\theta_{3}R) (15)
SijS_{ij} x1x_{1} x2x_{2}
θ1\theta_{1} 2.33 1.24
θ2\theta_{2} 0.44 0.31
θ3\theta_{3} 1.01 0.55
Table 2: Sensitivity of each variable to parameters for FHN system at 𝜽=(0.2,0.2,3.0)\boldsymbol{\theta}=(0.2,0.2,3.0). The sensitivity index is defined as Eq. 13.

In this case, if we merely use FGPGM step, the reconstructed solution corresponding to the identified parameters may deviate significantly from the true time series (see Fig. 6, where data of x1x_{1} are observable). It was pointed out [23] that all GP based gradient matching algorithms lead to smoother trajectories than the ground truth. This becomes more severe with sparse observation. Thus a least square optimization after doing FGPGM may well reduce this effect (see Fig. 7).

Refer to caption
Refer to caption
Figure 6: Results for FHN system obtained from FGPGM method, without further optimization. x1x_{1} is observable and x2x_{2} is latent variable. The ground truth, FGPGM result, and reconstructed solution (integration of ODEs with inferred parameters) are compared.

Fig. 7 and Fig. 8 present the results with single x1x_{1} and x2x_{2} observations respectively. In both cases the identified parameters are more accurate than using FGPGM only. From the sensitivity check in Tab. 2, it is expected that θ1\theta_{1} is most accurate because it is most sensitive among these three parameters, whereas θ2\theta_{2} is most insensitive and would be harder to be identified. The numerical results agree with that. It is worth mention that in the FGPGM step, only 3500 samplings were taken and the time for optimization step was much less than FGPGM step. This means the time needed for the whole process can be greatly saved compared with that in Wenk et al. 2019 [23], where 100,000 MCMC samplings were implemented.

Refer to caption
(a) x1x_{1}
Refer to caption
(b) x2x_{2}
Refer to caption
(c) θ\theta
Figure 7: The state evolution over time and identified parameters for FHN system. x1x_{1} is observable and x2x_{2} is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared.
Refer to caption
(a) x1x_{1}
Refer to caption
(b) x2x_{2}
Refer to caption
(c) θ\theta
Figure 8: The state evolution over time and identified parameters for FHN system. x2x_{2} is observable and x1x_{1} is latent variable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared.

In this example, we also notice that if we merely use least square optimization method, the local minimum effect would lead to reconstruction being far from the ground truth, which is even less robust than FGPGM method. For example, if we choose initial guess of the parameters near (θ1,θ2,θ3)=(1.51,2.2,1.78)(\theta_{1},\theta_{2},\theta_{3})=(1.51,2.2,1.78) then the costfunctional will fall into the local minimum during gradient based search (see Fig. 9). The existence of many local minima in the full observation case has been pointed out in e.g., [7][19]. These results clearly illustrate the performance of the combination of FGPGM and least square optimization.

Refer to caption
Refer to caption
Figure 9: Results of FHN system with x1x_{1} being latent, obtained by merely using least square optimization with initial guess of parameters near a local minimum point.

3.3 Protein Transduction

Finally the Protein Transduction system proposed in Vyshemisky and Girolami (2008) [22] was adopted to illustrate the performance of the method in ODEs with more equations. As mentioned in Dondelinger et al., 2013 [6], it is notoriously difficult to fit with unidentifiable parameters. The system is described by

S˙=θ1Sθ2SR+θ3RS\displaystyle\dot{S}=-\theta_{1}S-\theta_{2}SR+\theta_{3}R_{S} (16)
dS˙=θ1S\displaystyle\dot{dS}=\theta_{1}S (17)
R˙=θ2SR+θ3RS+θ5Rppθ6+Rpp\displaystyle\dot{R}=-\theta_{2}SR+\theta_{3}R_{S}+\theta_{5}\frac{R_{pp}}{\theta_{6}+R_{pp}} (18)
R˙S=θ2SRθ3RSθ4RS\displaystyle\dot{R}_{S}=\theta_{2}SR-\theta_{3}R_{S}-\theta_{4}R_{S} (19)
R˙pp=θ4RSθ5Rppθ6+Rpp.\displaystyle\dot{R}_{pp}=\theta_{4}R_{S}-\theta_{5}\frac{R_{pp}}{\theta_{6}+R_{pp}}. (20)

The θ6\theta_{6} in this system is unidentifiable. We adopted the same experimental setup of Dondelinger et al. 2013 and Wenk et al. 2019 as follows. γ=104\gamma=10^{-4} in FGPGM step. The observation were made at discrete times [0,1,2,4,5,7,10,15,20,30,40,50,60,80,100][0,1,2,4,5,7,10,15,20,30,40,50,60,80,100]. The initial condition was [1,0,1,0,0][1,0,1,0,0] and the data were generated by numerical integrating the system under 𝜽=[0.07,0.6,0.05,0.3,0.017,0.3]\boldsymbol{\theta}=[0.07,0.6,0.05,0.3,0.017,0.3], added by Gaussian noise with standard deviation 0.01. A sigmoid kernel was used to deal with the logarithmically spaced observation times and the typically spiky form of the dynamics as in the previous papers.

SijS_{ij} x1x_{1} x2x_{2} x3x_{3} x4x_{4} x5x_{5}
θ1\theta_{1} 2.86 9.78 1.73 1.77 3.33
θ2\theta_{2} 0.70 0.98 0.22 0.59 0.41
θ3\theta_{3} 1.35 2.11 0.47 0.92 0.90
θ4\theta_{4} 0.26 0.43 0.03 2.64 0.62
θ5\theta_{5} 1.53 2.58 24.48 0.90 49.38
θ6\theta_{6} 0.04 0.07 0.60 0.02 1.21
Table 3: Sensitivity of each variable to parameters for Protein Transduction system at 𝜽=[0.07,0.6,0.05,0.3,0.017,0.3]\boldsymbol{\theta}=[0.07,0.6,0.05,0.3,0.017,0.3]. The sensitivity index is defined as Eq. 13.
Refer to caption
(a) x1x_{1}
Refer to caption
(b) x2x_{2}
Refer to caption
(c) x3x_{3}
Refer to caption
(d) x4x_{4}
Refer to caption
(e) x5x_{5}
Refer to caption
(f) θ\theta
Figure 10: The state evolution over time and inferred parameters for protein transduction system. x3x_{3} is unknown and other variables are observable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared.

Fig. 10 gives the result with x3x_{3} (RR) being unobserved. In fact, the situations with one of other variables being unknown have better results than the case illustrated here, which will not be presented here. We can see that x3x_{3} was not well fitted by merely using FGPGM step, whereas the combination of FGPGM and optimization generated satisfactory result, with the parameters θ2\theta_{2} and θ4\theta_{4} being significantly improved. The sensitivity check is summarized in Tab. 3, from which we can see that θ2\theta_{2} is less sensitive and thereby harder to infer accurately. The error of θ5\theta_{5} may be affected by the value of the unidentifiable parameter θ6\theta_{6}.

It would also be of interest to see the performance of the method for the cases with more latent variables. In this model, although dSdS is not involved in equations for other variables, the data of dSdS helps infer θ1\theta_{1}. We also notice that R˙+RS˙+R˙pp=0\dot{R}+\dot{R_{S}}+\dot{R}_{pp}=0. If SS and dSdS are both missing, it is impossible to identify θ1\theta_{1}. Therefore, in the following test we choose data of dSdS, RsR_{s} and RppR_{pp} as observations. The data has a Gaussian noise with standard deviation 0.01, the same as the previous case with one latent variable. It can be seen from Fig. 11 that the result from FGPGM step is worse than the case with only one latent variable, but the final reconstruction of latent variables and parameter identification is not significantly different from the case with one latent variable.

Refer to caption
(a) x1x_{1}
Refer to caption
(b) x2x_{2}
Refer to caption
(c) x3x_{3}
Refer to caption
(d) x4x_{4}
Refer to caption
(e) x5x_{5}
Refer to caption
(f) θ\theta
Figure 11: The state evolution over time and inferred parameters for protein transduction system. x1x_{1} and x3x_{3} are unknown and other variables are observable. The ground truth, FGPGM result, and result from combination of FGPGM and optimization are compared.

4 Discussion

In the work, we proposed an effective method for parameter inference of coupled ODE systems with partially observable data. Our method is based on previous work known as FGPGM [23], which avoids product of experts heuristics. In order to improve the accuracy and efficiency of the method, we also use Least Square optimization in our computation. Due to the existence of latent variables, numerical integration is necessary for computation of the likelihood function in FGPGM method, which increases computational cost. The Least Square optimization allows us to greatly reduce the sampling number. In our numerical tests, we show that the sampling number in the FGPGM step is only 10% of that suggested in literature. It is worth noting that conventional least square optimization method requires good initial guess, which is not the case in our approach. Our numerical examples illustrated here demonstrate the feasibility of the proposed method for parameter inference with partial observations.

Acknowledgments

This work was supported in part by National Science Foundation of China (NSFC: No. 11971121), Nature Science and Research Council (NSERC) of Canada and the Fields Institute for Research in Mathematical Sciences. The authors also express the gratitude to Prof. Xin Gao, Nathan Gold and other members of the Fileds CQAM Lab on Health Analytics and Modelling for very beneficial discussions.

References

  • [1] G. Anger. Inverse Problems in Differential Equations. Berlin: Kluwer, 1990.
  • [2] R. C. Aster, B. Borchers and C. H. Thurber. Parameter Estimation and Inverse Problems. Boston: Elsevier, 2005.
  • [3] D. Barber, Y. Wang. Gaussian processes for Bayesian estimation in ordinary differential equations. International Conference on Machine Learning. 2014.
  • [4] H. G. Bock. Recent advances in parameter identification techniques for ODE. In Numerical Treatment of Inverse Problems in Differential and Integral Equations (eds P. Deuflhard and E. Harrier), 95-121. Basel:Birkhäuser, 1983.
  • [5] Ben Calderhead, Mark Girolami and Neil D. Lawrence. Accelerating bayesian inference over nonlinear differential equations with gaussian processes. Neural Information Processing Systems (NIPS), 2008.
  • [6] Frank Dondelinger, Maurizio Filippone, Simon Rogers and Dirk Husmeier. Ode parameter inference using adaptive gradient matching with gaussian processes. International Conference on Artificial Intelligence and Statistics (AISTATS), 2013.
  • [7] W. R. Esposito and C. Floudas. Deterministic global optimization in nonlinear optimal control problems. J. Glob. Optimizn, 17, 97-126., 2000.
  • [8] J. P. Gauthier and I. A. K. Kupka. Observability and Observers for Nonlinear Systems. SIAM J. Control Optim., 32(4), 975-994, 1994.
  • [9] Nico S Gorbach, Stefan Bauer, and Joachim M. Buhmann. Scalable variational inference for dynamical systems. Neural Information Processing Systems (NIPS), 2017.
  • [10] A. Isidori. Nonlinear Control Systems (3rd Edition). Springer-Verlag, 1995.
  • [11] Jari Kaipio and Erkki Somersalo. Statistical and computational inverse problems. Springer Science, 2006.
  • [12] Alfred J Lotka. The growth of mixed populations: two species competing for a common food supply. In The Golden Age of Theoretical Ecology: 1923-1940, pages 274-286. Springer, 1978.
  • [13] Z. Li, M. Osborne and T. Prvan. Parameter estimation in ordinary differential equations. IMA J. Numer. Anal., 25, 264-285, 2005.
  • [14] Benn Macdonald, Catherine Higham and Dirk Husmeier. Controversy in mechanistic modelling with Gaussian processes. Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37.
  • [15] J. Nocedal and S. Wright. Numerical Optimization, 2nd edn. New York: Springer, 2006.
  • [16] Mu Niu, Simon Rogers, Maurizio Filippone, and Dirk Husmeier. Fast inference in nonlinear dynamical systems using gradient matching. International Conference on Machine Learning (ICML), 2016.
  • [17] H.Pohjanpalo. System identifiability based on the power series expansion of the solution. Mathematical biosciences, (41)21-33, 1978.
  • [18] A. Papoulis, Su Pillai. Probability, Random Variables, and Stochastic Processes, 2002.
  • [19] Jim O Ramsay, Giles Hooker, David Campbell, and Jiguo Cao. Parameter estimation for differential equations: a generalized smoothing approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(5):741-796, 2007.
  • [20] Carl Edward Rasmussen. ”Gaussian processes in machine learning.” Summer School on Machine Learning. Springer, Berlin, Heidelberg, 2003.
  • [21] A. Tarantola. Inverse Problem Theory. Philadelphia: Society for Industrial and Applied Mathematics, 2005.
  • [22] Vladislav Vyshemirsky and Mark A Girolami. Bayesian ranking of biochemical system models. Bioinformatics, 24(6):833-839, 2008.
  • [23] P. Wenk, A. Gotovos, S. Bauer, N. Gorbach, A. Krause, J. M. Buhmann, Fast Gaussian process based gradient matching for parameter identification in systems of nonlinear ODEs. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019, Naha, Okinawa, Japan. PMLR: Volume 89, 2019.

Appendix A Preliminaries

In the following we list some preliminaries on derivatives of a Gaussian process that are used in this work, the proofs can be find in, e.g., [18][20][23]. Denote a random process XtX_{t}, its realization xx and its time derivative X˙t\dot{X}_{t}.

Definition A.1.

([18]) The random variable XnX_{n} converges to xx in the first-mean sense (limit in mean) if for some XX,

limn𝔼(|XnX|)=0.\lim_{n\rightarrow\infty}\mathbb{E}(|X_{n}-X|)=0. (21)
Definition A.2.

The stochastic process XtX_{t} is first-mean differentiable if for some X˙t\dot{X}_{t}

limδt0𝔼|Xt+δtXtδtX˙t|=0.\lim_{\delta t\rightarrow 0}\mathbb{E}\left|\frac{X_{t+\delta t}-X_{t}}{\delta t}-\dot{X}_{t}\right|=0. (22)
Definition A.3.

For given random variable XX, the moment generating function (MGF) is defined by

ΦX(t)=E[exp(Xt)]=exp(xt)ρ(x)𝑑x.\Phi_{X}(t)=E[\exp(Xt)]=\int_{-\infty}^{\infty}\exp(xt)\rho(x)dx. (23)
Proposition A.4.

If ΦX(t)\Phi_{X}(t) is the MGF, then

  1. 1.

    dΦXdt|t=0=mi,\frac{d\Phi_{X}}{dt}|_{t=0}=m_{i}, where mim_{i} is the ithi_{th} moment of XX.

  2. 2.

    Let XX and YY be two random variables. X and Y have the same distribution if and only if they have the same MGFs.

  3. 3.

    we say XN(μ,σ2)X\sim N(\mu,\sigma^{2}) if and only if ΦX(t)=expσ2t22+μt\Phi_{X}(t)=\exp^{\frac{\sigma^{2}t^{2}}{2}+\mu t}.

  4. 4.

    If XX and YY are two random variable, then the MGF ΦX+Y(t)=ΦX(t)ΦY(t).\Phi_{X+Y}(t)=\Phi_{X}(t)\Phi_{Y}(t).

By the above propositions, one has

Lemma A.5.

If X,YX,Y are two independent Gaussian random variables with means μX,μY\mu_{X},\mu_{Y} and covariances σX2,σY2\sigma_{X}^{2},\sigma_{Y}^{2}, then X+YX+Y is a Gaussian random variable with mean μX+μy\mu_{X}+\mu_{y} and covariance σX2+σy2\sigma_{X}^{2}+\sigma_{y}^{2}.

Definition A.6.

([20]) A real-valued stochastic process {Xt}tT\{X_{t}\}_{t\in T}, where TT is an index set, is a Gaussian process if all the finite-dimensional distributions are a multivariate normal distribution. That is, for any choice of distinct values t1,t2,tNTt_{1},t_{2},\dots t_{N}\in T, the random vector 𝐗=(Xt1,,XtN)T\boldsymbol{X}=(X_{t_{1}},\dots,X_{t_{N}})^{T} has a multivariate normal distribution with joint Gaussian probability density function given by

ρXt1Xt2XtN(xt1,,xtN)=1(2π)N/2det(𝚺)1/2exp(12(𝒙𝝁𝑿)T𝚺1(𝒙𝝁𝑿)).\rho_{X_{t_{1}}X_{t_{2}}\dots X_{t_{N}}}(x_{t_{1}},\dots,x_{t_{N}})=\frac{1}{(2\pi)^{N/2}det(\boldsymbol{\Sigma})^{1/2}}\exp\left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu}_{\boldsymbol{X}})^{T}\boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_{\boldsymbol{X}})\right). (24)

where the mean vector is defined as

(𝝁𝑿)i=𝔼[𝑿ti]\displaystyle(\boldsymbol{\mu}_{\boldsymbol{X}})_{i}=\mathbb{E}[\boldsymbol{X}_{t_{i}}] (25)

and covariance matrix (𝚺)ij=cov(Xti,Xtj)(\boldsymbol{\Sigma})_{ij}=cov(X_{t_{i}},X_{t_{j}}).

The Gaussian processes only depend on the mean and covariance functions. Usual covariance functions could be Squared exponential cov(Xti,Xtj)=kϕ(ti,tj)=exp(12l2|titj|2)cov(X_{t_{i}},X_{t_{j}})=k_{\phi}(t_{i},t_{j})=\exp(-\frac{1}{2l^{2}}|t_{i}-t_{j}|^{2}), where ll is a hyperparameter and represents the nonlocal interaction length scale.

Let t0,δtRt_{0},\delta t\in R, and XtX_{t} be a Gaussian Process with constant mean μ\mu and kernel function kϕ(t1,t2)k_{\phi}(t_{1},t_{2}), assumed to be first-mean differentiable. Then Xt0+δtX_{t_{0}+\delta t} and Xt0X_{t_{0}} are jointed Gaussian distributed

[Xt0Xt0+δt]𝒩([μμ],𝚺)\left[\begin{array}[]{c}X_{t_{0}}\\ X_{t_{0}+\delta t}\end{array}\right]\sim\mathcal{N}\left(\left[\begin{array}[]{c}\mu\\ \mu\end{array}\right],\boldsymbol{\Sigma}\right) (26)

with density function

ρ(xt0,xt0+δt)=12πdet(𝚺)1/2exp(12[xt0μxt0+δtμ]T𝚺1[xt0μxt0+δtμ])\rho(x_{t_{0}},x_{t_{0}+\delta t})=\frac{1}{2\pi\det(\boldsymbol{\Sigma})^{1/2}}\exp\left({-\frac{1}{2}\left[\begin{array}[]{c}x_{t_{0}}-\mu\\ x_{t_{0}+\delta t}-\mu\end{array}\right]^{T}\boldsymbol{\Sigma}^{-1}\left[\begin{array}[]{c}x_{t_{0}}-\mu\\ x_{t_{0}+\delta t}-\mu\end{array}\right]}\right) (27)

where

𝚺=(kϕ(t0,t0)kϕ(t0,t0+δt)kϕ(t0+δt,t0)kϕ(t0+δt,t0+δt)).\boldsymbol{\Sigma}=\left(\begin{array}[]{cc}k_{\phi}(t_{0},t_{0})&k_{\phi}(t_{0},t_{0}+\delta t)\\ k_{\phi}(t_{0}+\delta t,t_{0})&k_{\phi}(t_{0}+\delta t,t_{0}+\delta t)\end{array}\right). (28)

If we define linear transformation

𝐓=(101δt1δt),\displaystyle\mathbf{T}=\left(\begin{array}[]{cc}1&0\\ -\frac{1}{\delta t}&\frac{1}{\delta t}\end{array}\right), (31)

then we have

[Xt0Xt0+δtXt0δt]=𝐓[Xt0Xt0+δt]𝒩([μ0],𝐓𝚺𝐓T)\left[\begin{array}[]{c}X_{t_{0}}\\ \frac{X_{t_{0}+\delta t}-X_{t_{0}}}{\delta t}\end{array}\right]=\mathbf{T}\left[\begin{array}[]{c}X_{t_{0}}\\ X_{t_{0}+\delta t}\end{array}\right]\sim\mathcal{N}\left(\left[\begin{array}[]{c}\mu\\ 0\end{array}\right],\mathbf{T}\boldsymbol{\Sigma}\mathbf{T}^{T}\right) (32)

i.e.

ρ(Xt0,Xt0+δtXt0δt)=𝒩([μ0],𝐓𝚺𝐓T)\rho(X_{t_{0}},\frac{X_{t_{0}+\delta t}-X_{t_{0}}}{\delta t})=\mathcal{N}\left(\left[\begin{array}[]{c}\mu\\ 0\end{array}\right],\mathbf{T}\boldsymbol{\Sigma}\mathbf{T}^{T}\right) (33)

where

𝐓𝚺𝐓T=(kϕ(t0,t0)kϕ(t0,t0+δt)kϕ(t0,t0)δt0kϕ(t0+δt0,t0)kϕ(t0,t0)δtkϕ(t0+δt0,t0+δt)kϕ(t0,t0+δt)δt0kϕ(t0+δt,t0)kϕ(t0,t0)δt0δt).\displaystyle\mathbf{T}\boldsymbol{\Sigma}\mathbf{T}^{T}=\left(\begin{array}[]{cc}k_{\phi}(t_{0},t_{0})&\frac{k_{\phi}(t_{0},t_{0}+\delta t)-k_{\phi}(t_{0},t_{0})}{\delta t_{0}}\\ \frac{k_{\phi}(t_{0}+\delta t_{0},t_{0})-k_{\phi}(t_{0},t_{0})}{\delta t}&\frac{\frac{k_{\phi}(t_{0}+\delta t_{0},t_{0}+\delta t)-k_{\phi}(t_{0},t_{0}+\delta t)}{\delta t_{0}}-\frac{k_{\phi}(t_{0}+\delta t,t_{0})-k_{\phi}(t_{0},t_{0})}{\delta t_{0}}}{\delta t}\end{array}\right). (36)

The above derivation shows that Xt0X_{t_{0}} and Xt0+δtXt0δt\frac{X_{t_{0}+\delta t}-X_{t_{0}}}{\delta t} are jointly Gaussian distributed. Using the definition of firstmeanfirst-mean differential and the fact that rthmeanrth-mean convergence implies convergence in distribution, it is clear that Xt0X_{t_{0}} and X˙t0\dot{X}_{t_{0}} are jointly Gaussian

[Xt0X˙t0]𝒩([μ0],[kϕ(t0,t0)kϕ(a,b)b|a=t0,b=t0kϕ(a,b)a|a=t0,b=t02kϕ(a,b)ab|a=t0,b=t0]).\left[\begin{array}[]{c}X_{t_{0}}\\ \dot{X}_{t_{0}}\end{array}\right]\sim\mathcal{N}\left(\left[\begin{array}[]{c}\mu\\ 0\end{array}\right],\left[\begin{array}[]{cc}k_{\phi}(t_{0},t_{0})&\frac{\partial k_{\phi}(a,b)}{\partial b}|_{a=t_{0},b=t_{0}}\\ \frac{\partial k_{\phi}(a,b)}{\partial a}|_{a=t_{0},b=t_{0}}&\frac{\partial^{2}k_{\phi}(a,b)}{\partial a\partial b}|_{a=t_{0},b=t_{0}}\end{array}\right]\right). (37)

In general, 𝑿=(Xt1,,Xtk)T\boldsymbol{X}=(X_{t_{1}},\dots,X_{t_{k}})^{T} and 𝑿˙=(X˙t1,,X˙tk)T\dot{\boldsymbol{X}}=(\dot{X}_{t_{1}},\dots,\dot{X}_{t_{k}})^{T} are jointly Gaussian

[𝑿𝑿˙]𝒩([𝝁𝟎],[𝑪ϕ(𝑿,𝑿)𝑪ϕ(𝑿,𝑿˙)𝑪ϕ(𝑿˙,𝑿)𝑪ϕ(𝑿˙,𝑿˙)]).\left[\begin{array}[]{c}\boldsymbol{X}\\ \dot{\boldsymbol{X}}\end{array}\right]\sim\mathcal{N}\left(\left[\begin{array}[]{c}\boldsymbol{\mu}\\ \boldsymbol{0}\end{array}\right],\left[\begin{array}[]{cc}\boldsymbol{C}_{\phi}(\boldsymbol{X},\boldsymbol{X})&\boldsymbol{C}_{\phi}(\boldsymbol{X},\dot{\boldsymbol{X}})\\ \boldsymbol{C}_{\phi}(\dot{\boldsymbol{X}},\boldsymbol{X})&\boldsymbol{C}_{\phi}(\dot{\boldsymbol{X}},\dot{\boldsymbol{X}})\end{array}\right]\right). (38)

Here (Cϕ(𝐚,𝐛))ij=cov(ai,bj)(C_{\phi}(\mathbf{a},\mathbf{b}))_{ij}=cov(a_{i},b_{j}) is the covariance between between aia_{i} and bjb_{j}, and predefined kernel matrix of Gaussian process. By linearity of the covariance operator and the predefined kernel function kϕ(a,b)k_{\phi}(a,b), we have

Cϕ(Xti,X˙tj)=kϕ(a,b)b|a=ti,b=tj,\displaystyle C_{\phi}(X_{t_{i}},\dot{X}_{t_{j}})=\frac{\partial k_{\phi}(a,b)}{\partial b}|_{a=t_{i},b=t_{j}}, (39)
Cϕ(X˙ti,Xtj)=kϕ(a,b)a|a=ti,b=tj,\displaystyle C_{\phi}(\dot{X}_{t_{i}},X_{t_{j}})=\frac{\partial k_{\phi}(a,b)}{\partial a}|_{a=t_{i},b=t_{j}}, (40)
Cϕ(X˙ti,X˙tj)=2kϕ(a,b)ab|a=ti,b=tj.\displaystyle C_{\phi}(\dot{X}_{t_{i}},\dot{X}_{t_{j}})=\frac{\partial^{2}k_{\phi}(a,b)}{\partial a\partial b}|_{a=t_{i},b=t_{j}}. (41)
Lemma A.7.

(Matrix Inversions Lemma) Let 𝚺\boldsymbol{\Sigma} be a p×pmatrixp\times p-matrix (p=n+mp=n+m):

𝚺=[𝚺11𝚺12𝚺21𝚺22]\boldsymbol{\Sigma}=\left[\begin{array}[]{cc}\boldsymbol{\Sigma}_{11}&\boldsymbol{\Sigma}_{12}\\ \boldsymbol{\Sigma}_{21}&\boldsymbol{\Sigma}_{22}\end{array}\right] (42)

where the sum-matrices have dimension n×nn\times n, n×mn\times m, etc. Suppose 𝚺,𝚺11,𝚺22\boldsymbol{\Sigma},\boldsymbol{\Sigma}_{11},\boldsymbol{\Sigma}_{22} are non-singular; and partition the inverse in the same way as 𝚺\boldsymbol{\Sigma},

𝚲=𝚺1=[𝚲11𝚲12𝚲21𝚲22].\boldsymbol{\Lambda}=\boldsymbol{\Sigma}^{-1}=\left[\begin{array}[]{cc}\boldsymbol{\Lambda}_{11}&\boldsymbol{\Lambda}_{12}\\ \boldsymbol{\Lambda}_{21}&\boldsymbol{\Lambda}_{22}\end{array}\right]. (43)

Then

{𝚲11=(𝚺11𝚺12𝚺221𝚺21)1𝚲12=(𝚺11𝚺12𝚺221𝚺21)1𝚺12𝚺221𝚲21=(𝚺22𝚺21𝚺111𝚺12)1𝚺21𝚺111𝚲22=(𝚺22𝚺21𝚺111𝚺12)1.\displaystyle\left\{\begin{array}[]{l}\boldsymbol{\Lambda}_{11}=(\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21})^{-1}\\ \boldsymbol{\Lambda}_{12}=-(\boldsymbol{\Sigma}_{11}-\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21})^{-1}\boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\\ \boldsymbol{\Lambda}_{21}=-(\boldsymbol{\Sigma}_{22}-\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12})^{-1}\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\\ \boldsymbol{\Lambda}_{22}=(\boldsymbol{\Sigma}_{22}-\boldsymbol{\Sigma}_{21}\boldsymbol{\Sigma}_{11}^{-1}\boldsymbol{\Sigma}_{12})^{-1}.\end{array}\right. (48)
Lemma A.8.

(Conditional Gaussian distributions) Let 𝐗D\boldsymbol{X}\in\mathbb{R}^{D}, 𝐘M\boldsymbol{Y}\in\mathbb{R}^{M}, be jointly Gaussian random vectors with distribution

[𝑿𝒀]𝒩(𝝁,𝚺)\left[\begin{array}[]{c}\boldsymbol{X}\\ \boldsymbol{Y}\end{array}\right]\sim\mathcal{N}\left(\boldsymbol{\mu},\boldsymbol{\Sigma}\right) (49)

where

𝝁=[𝝁X𝝁Y],𝚺=[𝚺XX𝚺XY𝚺YX𝚺YY].\boldsymbol{\mu}=\left[\begin{array}[]{c}\boldsymbol{\mu}_{X}\\ \boldsymbol{\mu}_{Y}\end{array}\right],\boldsymbol{\Sigma}=\left[\begin{array}[]{cc}\boldsymbol{\Sigma}_{XX}&\boldsymbol{\Sigma}_{XY}\\ \boldsymbol{\Sigma}_{YX}&\boldsymbol{\Sigma}_{YY}\end{array}\right]. (50)

Then the conditional Gaussian distributions density functions are

ρY|X(𝒚|𝒙)=ρXY(𝒙,𝒚)ρX(𝒙)=1(2π)M+D2det(ΣY|X)1/2exp(𝒚𝝁Y|X)T𝚺Y|X1(𝒚𝝁Y|X)\displaystyle\rho_{Y|X}(\boldsymbol{y}|\boldsymbol{x})=\frac{\rho_{XY}(\boldsymbol{x},\boldsymbol{y})}{\rho_{X}(\boldsymbol{x})}=\frac{1}{(2\pi)^{\frac{M+D}{2}}\det(\Sigma_{Y|X})^{1/2}}\exp{(\boldsymbol{y}-\boldsymbol{\mu}_{Y|X})^{T}\boldsymbol{\Sigma}_{Y|X}^{-1}(\boldsymbol{y}-\boldsymbol{\mu}_{Y|X})} (51)

where

𝝁Y|X=𝝁Y+𝚺YX𝚺XX1(𝒙𝝁X),\displaystyle\boldsymbol{\mu}_{Y|X}=\boldsymbol{\mu}_{Y}+\boldsymbol{\Sigma}_{YX}\boldsymbol{\Sigma}_{XX}^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_{X}), (52)
𝚺Y|X=𝚺YY𝚺YX𝚺XX1𝚺XY.\displaystyle\boldsymbol{\Sigma}_{Y|X}=\boldsymbol{\Sigma}_{YY}-\boldsymbol{\Sigma}_{YX}\boldsymbol{\Sigma}_{XX}^{-1}\boldsymbol{\Sigma}_{XY}. (53)

According to above Lemma, we have the condition distribution

Lemma A.9.
ρ(𝒙˙|𝒙)𝒩(𝑫(𝒙𝝁X),𝑨)\rho(\dot{\boldsymbol{x}}|\boldsymbol{x})\sim\mathcal{N}(\boldsymbol{D}(\boldsymbol{x}-\boldsymbol{\mu}_{X}),\boldsymbol{A}) (54)

where

𝑫=𝑪ϕ(𝑿˙,𝑿)𝑪ϕ(𝑿,𝑿)1\displaystyle\boldsymbol{D}=\boldsymbol{C}_{\phi}(\dot{\boldsymbol{X}},\boldsymbol{X})\boldsymbol{C}_{\phi}(\boldsymbol{X},\boldsymbol{X})^{-1} (55)
𝑨=𝑪ϕ(𝑿˙,𝑿˙)𝑪ϕ(𝑿˙,𝑿)𝑪ϕ(𝑿,𝑿)1𝑪ϕ(𝑿,𝑿˙)\displaystyle\boldsymbol{A}=\boldsymbol{C}_{\phi}(\dot{\boldsymbol{X}},\dot{\boldsymbol{X}})-\boldsymbol{C}_{\phi}(\dot{\boldsymbol{X}},\boldsymbol{X})\boldsymbol{C}_{\phi}(\boldsymbol{X},\boldsymbol{X})^{-1}\boldsymbol{C}_{\phi}(\boldsymbol{X},\dot{\boldsymbol{X}}) (56)

Appendix B Proof of Theorem 2.1

Proof.

The joint density over all variables in Fig.1 can be represented as

ρ(𝒙M,𝒙˙M,𝒚,𝒙L,FM,F¯M,𝜽|ϕ,σ,γ)\displaystyle\rho(\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\boldsymbol{y},\boldsymbol{x}_{L},F_{M},\bar{F}_{M},\boldsymbol{\theta}|\phi,\sigma,\gamma)
=\displaystyle= ρGP(𝒙M,𝒙˙M,𝒚|ϕ,σ)ρODE(FM,F¯M,θ,𝒙L|𝒙M,𝒙˙M,γ)\displaystyle\rho_{GP}(\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\boldsymbol{y}|\phi,\sigma)\rho_{ODE}(F_{M},\bar{F}_{M},\theta,\boldsymbol{x}_{L}|\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\gamma) (57)
ρGP(𝒙M,𝒙˙M,𝒚|ϕ,σ)=ρ(𝒙M,ϕ)ρ(𝒚|𝒙M,σ)ρ(𝒙˙M|𝒙M,ϕ)\rho_{GP}(\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\boldsymbol{y}|\phi,\sigma)=\rho(\boldsymbol{x}_{M},\phi)\rho(\boldsymbol{y}|\boldsymbol{x}_{M},\sigma)\rho(\dot{\boldsymbol{x}}_{M}|\boldsymbol{x}_{M},\phi) (58)
ρODE(FM,F¯M,θ,𝒙L|𝒙M,𝒙˙M,γ)\displaystyle\rho_{ODE}(F_{M},\bar{F}_{M},\theta,\boldsymbol{x}_{L}|\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\gamma)
=\displaystyle= ρ(𝜽)ρ(FM,F¯M,𝒙L|𝜽,𝒙M,𝒙˙M,γ)\displaystyle\rho(\boldsymbol{\theta})\rho(F_{M},\bar{F}_{M},\boldsymbol{x}_{L}|\boldsymbol{\theta},\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\gamma)
=\displaystyle= ρ(𝜽)ρ(𝒙L|𝜽,𝒙M)ρ(FM,F¯M|𝒙L,𝜽,𝒙M,𝒙˙M,γ)\displaystyle\rho(\boldsymbol{\theta})\rho(\boldsymbol{x}_{L}|\boldsymbol{\theta},\boldsymbol{x}_{M})\rho(F_{M},\bar{F}_{M}|\boldsymbol{x}_{L},\boldsymbol{\theta},\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\gamma)
=\displaystyle= ρ(𝜽)δ(𝒙~L(𝜽,𝒙M)𝒙L)ρ(FM,F¯M|𝒙L,𝜽,𝒙M,𝒙˙M,γ)\displaystyle\rho(\boldsymbol{\theta})\delta(\tilde{\boldsymbol{x}}_{L}(\boldsymbol{\theta},\boldsymbol{x}_{M})-\boldsymbol{x}_{L})\rho(F_{M},\bar{F}_{M}|\boldsymbol{x}_{L},\boldsymbol{\theta},\boldsymbol{x}_{M},\dot{\boldsymbol{x}}_{M},\gamma)
=\displaystyle= ρ(𝜽)ρ(FM|𝜽,𝒙M,𝒙~L(𝒙M,𝜽))ρ(F¯M|𝒙˙M,γ)δ(FMF¯M)\displaystyle\rho(\boldsymbol{\theta})\rho(F_{M}|\boldsymbol{\theta},\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}))\rho(\bar{F}_{M}|\dot{\boldsymbol{x}}_{M},\gamma)\delta(F_{M}-\bar{F}_{M})
=\displaystyle= ρ(𝜽)δ(𝒇M(𝜽,𝒙M,𝒙~L(𝒙M,𝜽))FM)𝒩(FM|𝒙˙M,γ𝑰)\displaystyle\rho(\boldsymbol{\theta})\delta(\boldsymbol{f}_{M}(\boldsymbol{\theta},\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}))-F_{M})\mathcal{N}(F_{M}|\dot{\boldsymbol{x}}_{M},\gamma\boldsymbol{I})
=\displaystyle= ρ(𝜽)𝒩(𝒇M(𝜽,𝒙M,𝒙~L(𝒙M,𝜽))|𝒙˙M,γ𝑰),\displaystyle\rho(\boldsymbol{\theta})\mathcal{N}(\boldsymbol{f}_{M}(\boldsymbol{\theta},\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}))|\dot{\boldsymbol{x}}_{M},\gamma\boldsymbol{I}), (59)

by which ρODE\rho_{ODE} is independent of FM,F¯M,𝒙LF_{M},\bar{F}_{M},\boldsymbol{x}_{L}. 𝒙~L\tilde{\boldsymbol{x}}_{L} is deterministically decided by 𝒙M,𝜽\boldsymbol{x}_{M},\boldsymbol{\theta} through integration. Using Lemma A.9, we have

ρ(𝒙M,𝜽,𝒚|ϕ,σ,γ)=\displaystyle\rho(\boldsymbol{x}_{M},\boldsymbol{\theta},\boldsymbol{y}|\phi,\sigma,\gamma)=
ρ(𝜽)𝒩(𝒙M|𝟎,𝑪ϕ)𝒩(𝒚|𝒙M,σ2𝑰)𝒩(𝒙˙M|𝑫𝒙M,𝑨)𝒩(𝒇M(𝜽,𝒙M,𝒙~L(𝒙M,𝜽))|𝒙˙M,γ𝑰).\displaystyle\rho(\boldsymbol{\theta})\mathcal{N}(\boldsymbol{x}_{M}|\boldsymbol{0},\boldsymbol{C}_{\phi})\mathcal{N}(\boldsymbol{y}|\boldsymbol{x}_{M},\sigma^{2}\boldsymbol{I})\mathcal{N}(\dot{\boldsymbol{x}}_{M}|\boldsymbol{D}\boldsymbol{x}_{M},\boldsymbol{A})\mathcal{N}(\boldsymbol{f}_{M}(\boldsymbol{\theta},\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}))|\dot{\boldsymbol{x}}_{M},\gamma\boldsymbol{I}). (60)

Integrating 𝒙˙M\dot{\boldsymbol{x}}_{M} out yields

ρ(𝒙M,𝜽,𝒚|ϕ,σ,γ)=\displaystyle\rho(\boldsymbol{x}_{M},\boldsymbol{\theta},\boldsymbol{y}|\phi,\sigma,\gamma)=
ρ(𝜽)𝒩(𝒙M|𝟎,𝑪ϕ)𝒩(𝒚|𝒙M,σ2𝑰)𝒩(𝒇M(𝜽,𝒙M,𝒙~L(𝒙M,𝜽))|𝑫𝒙M,𝑨+γ𝑰).\displaystyle\rho(\boldsymbol{\theta})\mathcal{N}(\boldsymbol{x}_{M}|\boldsymbol{0},\boldsymbol{C}_{\phi})\mathcal{N}(\boldsymbol{y}|\boldsymbol{x}_{M},\sigma^{2}\boldsymbol{I})\mathcal{N}(\boldsymbol{f}_{M}(\boldsymbol{\theta},\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}))|\boldsymbol{D}\boldsymbol{x}_{M},\boldsymbol{A}+\gamma\boldsymbol{I}). (61)

Finally, we get

ρ(𝒙M,𝜽|𝒚,ϕ,σ,γ)\displaystyle\rho(\boldsymbol{x}_{M},\boldsymbol{\theta}|\boldsymbol{y},\phi,\sigma,\gamma)\propto
ρ(𝜽)𝒩(𝒙M|𝟎,𝑪ϕ)𝒩(𝒚|𝒙M,σ2𝑰)𝒩(𝒇M(𝜽,𝒙M,𝒙~L(𝒙M,𝜽))|𝑫𝒙M,𝑨+γ𝑰).\displaystyle\rho(\boldsymbol{\theta})\mathcal{N}(\boldsymbol{x}_{M}|\boldsymbol{0},\boldsymbol{C}_{\phi})\mathcal{N}(\boldsymbol{y}|\boldsymbol{x}_{M},\sigma^{2}\boldsymbol{I})\mathcal{N}(\boldsymbol{f}_{M}(\boldsymbol{\theta},\boldsymbol{x}_{M},\tilde{\boldsymbol{x}}_{L}(\boldsymbol{x}_{M},\boldsymbol{\theta}))|\boldsymbol{D}\boldsymbol{x}_{M},\boldsymbol{A}+\gamma\boldsymbol{I}). (62)