This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning governing physics from output only measurements

Tapas Tripura
Department of Applied Mechanics
Indian Institute of Technology Delhi
[email protected]
&Souvik Chakraborty
Department of Applied Mechanics
School of Artificial Intelligence (ScAI)
Indian Institute of Technology Delhi
[email protected]
Abstract

Extracting governing physics from data is a key challenge in many areas of science and technology. The existing techniques for equations discovery are dependent on both input and state measurements; however, in practice, we only have access to the output measurements only. We here propose a novel framework for learning governing physics of dynamical system from output only measurements; this essentially transfers the physics discovery problem from the deterministic to the stochastic domain. The proposed approach models the input as a stochastic process and blends concepts of stochastic calculus, sparse learning algorithms, and Bayesian statistics. In particular, we combine sparsity promoting spike and slab prior, Bayes law, and Euler Maruyama scheme to identify the governing physics from data. The resulting model is highly efficient and works with sparse, noisy, and incomplete output measurements. The efficacy and robustness of the proposed approach is illustrated on several numerical examples involving both complete and partial state measurements. The results obtained indicate the potential of the proposed approach in identifying governing physics from output only measurement.

Keywords Sparse Bayesian learning  \cdot explainable artificial intelligence  \cdot stochastic differential equation  \cdot equation discovery  \cdot probabilistic machine learning

1 Introduction

Physical systems are governed by the physical laws, often represented either in form of ordinary differential equations (ODE) or partial differential equations (PDE). The literature on techniques for the solution of ODE/PDE is quite matured and given sufficient computational resources, it is possible to solve the governing ODE/PDE with sufficient accuracy. Modern techniques such as physics-informed deep learning [1, 2, 3, 4] and neural network based operator learning [5, 6] can also be used. However, due to various assumptions and approximations, the governing equations often fails to represent the exact physics of the system and results in modelling and prediction errors. With advancement in the sensor technology and the internet of things (IoT), we have access to data in abundance whereas the underlying governing physics often remains elusive specifically is domains such as climate science [7], biology [8], physics [9], chemistry [10] and finance [11] to name a few. One possible alternative is obviously to train a conventional data-driven machine learning model to obtain an input-output mapping; however, such models often do not generalize to unseen environment and out-of-distribution inputs. To address this issue, methods for data-driven equation discovery has recently been proposed. A seminal work in this area was carried out by [12] wherein symbolic regression was utilized for discovering underlying structure of the data. Another important work in this area includes work by Brunton et al.[13]. However, most of the work in this area is limited by the fact that information on both input and state vector are needed. In practice, the input is often not measurable and only few states are accurately observable; this greatly limits the applicability of the available physics discovery techniques. To address this issue, we propose a novel approach that leverages sparse learning, Bayesian statistics, and stochastic calculus and enables discovery of governing physics from output only measurements.

The development in the equation discovery techniques has made significant progress from early 1970s equation discovery was mostly dependent on the expertise in the interested area, to the modern machine learning techniques that can extract the physical law of the underlying process with only little human supervision. In the early 1970s the models were selected by maintaining a balance between the model complexity and its fitness. The fitness of the models were measured using Akaike (AIC) and Bayesian information criterion (BIC) [14, 15]. Later advances in the machine learning tools in combination with new sophisticated data measurement instruments gave rise to the data-driven techniques for discovery of governing physics. Towards this, first attempt was made by combining symbolic regression with genetic programming to select the best combination of the functions from candidate functional forms that accurately represent the data [12, 16]. There are also equation-free methods that bypasses the need of formulating constitutive equations to track the time evolution of the dynamical systems in closed form. These methods provide a practical recipe for multiscale simulations through local microscopic simulations in time and space [17].

Following the work of symbolic regression [12, 16], Sparse Identification of Nonlinear Dynamics (SINDy) was proposed for discovery of the governing physics of non-linear dynamical systems [13]. In terms of accuracy and intepretability of the discovered physics, the SINDy framework overcame the shortcomings of the previous discovery techniques. The idea behind SINDy was to use sparse linear regression in the purview of least-square fitting for selecting the most dominating candidate functions that best represent the data. Due to sparsity promoting approach, SINDy proved to be computationally efficient and scalable with the increase in the input dimension. In the later years, SINDy has shown tremendous applications and area-specific developments of sparse linear regression in various fields. Few examples are, sparse identification of biological networks in biology [18] and sparse identification of chemical reaction in chemistry [19, 20], sparse modelling for state estimation in fluid mechanics [21, 22], system identification of structures with hysteresis and sparse learning of aerodynamics of bridges in structural systems [23, 24], sparse model selection using an integral formulation [25], sparse model selection of dynamical system using information criteria [26], sparse identification for predictive control [27], identification of structured dynamical systems with limited data [28], model identification using recovery of differential equations from short impulse response time-series data [29], identifying stochastic dynamic equation [30], and discovery of partial differential equations [31, 32], among others.

Apart from SINDy, identification of non-linear dynamical systems using deep neural networks were also proposed by the researchers [33, 2, 34]. The deep learning approaches for non-linear dynamical systems exists as a black-box which takes the data as a input and provides the prediction as output. Later, a different approach within the Bayesian framework for discovery of governing physics from noisy and/or limited data was proposed [35, 36, 37]. In this approach the least-square based sparse linear regression was replaced by more accurate sparse Bayesian linear regression technique. Due to the ability of the Bayesian inference to perform simultaneous model selection and parameter estimation, this approach eliminated the possibility of overfitting of the governing model. The sparsity in the solution was promoted by assigning the appropriate sparse promoting priors over the weight vector of the sparse regression problem. This approach provided a natural elimination of the basis functions that do not present the data accurately. As a output, this framework provides the posterior distributions of the weight vector; this means one could accurately identify the uncertainty in the parameters and construct a confidence interval for future predictions.

Despite the progress in equation discovery from data in recent times, almost all the approaches require measurements for both state and input variables; the only exceptions is perhaps the Stochastic SINDy [30]. Unfortunately, we often only have access to the state variables only; this significantly limits the applicability of the available approaches. We hereby propose an equation discovery framework rooted in stochastic calculus, sparse learning, and Bayesian statistics, that require only the state estimates. The basic premise here is to treat the unknown input as a stochastic process, model it as a Gaussian white noise, and identify the stochastic differential equation (SDE); this essentially leads to identifying the drift component and diffusion component of a stochastic differential equation. We utilize stochastic calculus to decouple the drift and diffusion part; this allows parallel identification of the two components. Sparse learning and Bayesian statistics, on the other hand, ensures that the governing equation identified is sparse and interpretable; this is achieved by using the spike and slab prior [38, 39]. The proposed approach has several key features that can be encapsulated into the following points:

  • First and foremost, unlike other approaches, the proposed approach require only the state measurements; no input measurements are needed.

  • The proposed approach being Bayesian in nature estimates the posterior distribution. This allows computation of the epistemic uncertainty (aka predictive uncertainty) due to limited data. This feature is particularly important in design and decision making.

  • Unlike non-Bayesian approaches, the proposed approach is autonomous in the sense that it doesn’t require any human calibration and cross-validation.

  • The proposed framework falls under the broad umbrella of interpretable machine learning and explainable AI. In other words, the model identified generalizes to unseen environment and out-of-distribution data.

Remaining of the paper is structured as follows: in section 2, the proposed data-driven framework for learning governing physics without the measurement of input is briefly presented. In section 3, numerical experiments using fully and partially observed output measurements are undertaken for the demonstration of the robustness of the proposed framework. In section 4, key contributions of the proposed novel framework is summarised and finally the paper is concluded.

Refer to caption
Figure 1: Schematic illustration of the proposed Bayesian-SDE framework for discovery of stochastic dynamical systems. The proposed framework integrates two key ideas for identification of the stochastic systems in terms of the SDEs. The first key step, is the construction of the target vector for the sparse linear regression. For this purpose the proposed framework utilises the Kramers-Moyal expansion to estimate the drift g(𝑿t,t)g({\bm{X}}_{t},t) and diffusion terms f(𝑿t,t)f({\bm{X}}_{t},t) in terms of the linear and quadratic variations of the measured sample paths. The linear variation: limΔt0X(t+Δt)z;X(t)=z{{\mathop{\lim}_{\Delta t\to 0}\left\langle{{X}(t+\Delta t)-z}\right\rangle};{X(t)=z}} indicates sum of the increments of the sample paths and the quadratic variation: limΔt0|X(t+Δt)z|2;X(t)=z{{\mathop{\lim}_{\Delta t\to 0}\left\langle{{{\left|{{X}(t+\Delta t)-z}\right|}^{2}}}\right\rangle};{X(t)=z}} indicates sum of the square of the increments of the sample paths in limiting sense. The second step, includes formulation of the sparse Bayesian linear regression framework for the above target vector. Given an ensemble of time history of data the framework obtains a sequence of libraries by evaluating the candidate basis functions {υk(𝐗t);k=1,2,K}\left\{{{\upsilon_{k}}({{\bf{X}}_{t}});\;k=1,2,\ldots K}\right\} over the ensembles. Ensemble mean is taken over all the libraries and the averaged library 𝐋N×K{\bf{L}}\in\mathbb{R}^{N\times K} is utilized for performing the sparse linear regression. The candidate function in the library are parameterized by a weight vector 𝜽{\bm{\theta}}. To identify each of the drift terms in an mm-dimensional diffusion process this regression are performed mm-times. The diffusion terms are not directly recoverable but the discovery of diffusion terms are possible in terms of its covariances. This requires m(m+1)/2m(m+1)/2 number of regression for discovery of all the terms in a covariance matrix. The elements of the weight vector 𝜽{\bm{\theta}} is assigned a latent variable Zk;k=1KZ_{k};k=1\ldots K in order to classify the weights into spikes and slabs. Post the Bayesian regression the marginal posterior inclusion probabilities (PIP) p(Zk=1|𝒀)p(Z_{k}=1|{\bm{Y}}) is estimated. In the final model only the basis functions whose corresponding marginal PIP values are higher than a threshold value is included.

2 Discovery of governing physics without input measurement

The overarching framework of the proposed Bayesian physics discovery framework is presented in the Fig. 1. The proposed framework treats the state measurements from the sensors as stochastic process and applies Kramers-Moyal formula to obtain the target vector for sparse linear regression. In the sparse regression, the library is obtained by taking ensemble mean over all the libraries constructed from the collected data set. Then the proposed framework constructs two sparse regression problem using these libraries and target vectors for identification of the governing physics, which are independent and can be solved simultaneously. The discovery of governing physics here involves identification of the system dynamics and the volatility associated arising due to environmental uncertainty. The system dynamics and the volatility are represented in this framework as drift and diffusion components of an SDE.

Before formulating the mathematical backbone of the proposed framework let us formalize the problem through following terminologies. Let 𝐋N×K{\bf{L}}\in\mathbb{R}^{N\times K} be the library of candidate functions, where KK is the total number of independent basis functions in the library and NN is the sample length. Then the key idea of the sparse Bayesian linear regression is to select kk basis functions from the total number of KK candidate library functions such that kKk\ll K. If 𝒀N{\bm{Y}}\in\mathbb{R}^{N} is the target vector then the linear combinations of the identified basis functions should represent the target vector 𝒀{\bm{Y}} in best possible coefficients in linear regression are denoted by a weight vector 𝜽{\bm{\theta}}. For the Bayesian model discovery, the condition kKk\ll K is maintained by assigning certain type of prior distributions on the weight vector 𝜽{\bm{\theta}}, such that they favour the sparsity in the solution. In this context, the SS-prior have shown high shrinkage property due to its sharp spike at zero and a diffused density spanned over a large range of possible parameter value. The Dirac-delta spike concentrates most of the probability mass at zero thus allowing most of the samples to take value zero. On the other hand, the diffused tail distributes a small amount of probability mass over a large range of possible values allowing only very few samples with very high probability to escape the shrinkage. The alternate flavours of the SS-prior constructed from various combination of candidate spike and slab functions are well documented in the literature [38, 39, 40]. In this work, the authors have considered the discontinuous spike and slab prior (DSS-prior) where the spike at zero is modeled as Dirac-delta function and the tail distribution as independent Student’s-t distribution. Next, a sparse regression framework is formulated for SDEs.

2.1 Sparse learning of Stochastic differential equations

Dynamical processes are generally expressed in terms of higher order differential equations. However, in most of the data-driven machine learning techniques the actual system is not observed through its original space but identified in a projected space. The projected space contains all the states of a system which are directly observable. For instance, the second order dynamical systems are often expressed in terms of its displacement and velocity components. Let the projection be realized by a map 𝒯:dm\mathcal{T}:\mathbb{R}^{d}\to\mathbb{R}^{m} such that m<dm<d. If there exists a mapping 𝒯\mathcal{T}, then any higher order system can be reduced to a set of SDEs of the form:

𝑿˙=𝒇(𝑿t,t)+𝒈(𝑿t,t)𝜻(t),{\bm{\dot{X}}}={\bm{f}}({{\bm{X}}_{t}},t)+{\bm{g}}({{\bm{X}}_{t}},t){\bm{\zeta}}(t), (1)

where 𝒇(𝑿t,t){\bm{f}}({{\bm{X}}_{t}},t) is the deterministic dynamics of the underlying process, 𝒈(𝑿t,t){\bm{g}}({{\bm{X}}_{t}},t) is the volatility associated with the dynamics arising due to the stochastic external input, and 𝜻(t){\bm{\zeta}}(t) is the white noise. A more appropriate representation of the above equation can be derived through first order Itô SDEs which arises naturally in non-linear dynamical systems subjected to stochastic excitation such as earthquake, wind force, wave force etc. [41, 42]. In non-linear stochastic dynamical systems the SDEs provide an splendid approach for expressing the behavior of the underlying dynamical system. The SDEs are generally defined in term of its deterministic drift and the additive stochastic diffusion components where both the components are allowed to depend on time and systems states. Let (Ω,,P)\left({\Omega,\mathcal{F},P}\right) be the probability space and {t,0tT)}\{{\mathcal{F}_{t}},0\leq t\leq T)\} be the natural filtration constructed from sub σ\sigma-algebras of \mathcal{F}. Under the probability space (Ω,,P)\left({\Omega,\mathcal{F},P}\right) an mm-dimensional nn-factor SDE driven by nn-dimensional Brownian motion {𝑩j(t),j=1,n{\bm{B}}_{j}(t),j=1,\ldots n} is defined as,

d𝑿t=𝒇(𝑿t,t)dt+j=1n𝒈j(𝑿t,t)d𝐁j(t);𝑿(t=t0)=𝑿0;t[0,T],\begin{array}[]{l}d{\bm{X}}_{t}={{\bm{f}}}\left({{{\bm{X}}_{t}},t}\right)dt+\sum\limits_{j=1}^{n}{{\bm{g}}_{j}\left({{{\bm{X}}_{t}},t}\right)}d{{\bf B}_{j}}\left(t\right);\quad{\bm{X}}(t=t_{0})={\bm{X}}_{0};\quad t\in[0,T],\end{array} (2)

where 𝑿tm{\bm{X}}_{t}\in{\mathbb{R}^{m}} denotes the t{{\mathcal{F}_{t}}}-measurable state vector, 𝒇(𝑿t,t):mm{\bm{f}}\left({{{\bm{X}}_{t}},t}\right):{\mathbb{R}^{m}}\mapsto{\mathbb{R}^{m}} is the drift vector, 𝒈(𝑿t,t):mm×n{\bm{g}}\left({{{\bm{X}}_{t}},t}\right):{\mathbb{R}^{m}}\mapsto{{\mathbb{R}}^{m\times n}} is the diffusion matrix and 𝑩j(t)n{{\bm{B}}_{j}}\left(t\right)\in{\mathbb{R}^{n}} is the Brownian motion. The white noises are generalized derivative Brownian motion i.e. 𝜻(t)=𝑩˙(t){\bm{\zeta}}(t)=\dot{\bm{B}}(t); this ensures that there exists a corresponding SDE of Eq. (2). In a compact matrix notation, the SDEs are can be expressed as,

d𝑿t=𝒇(𝑿t,t)dt+𝒈(𝑿t,t)d𝑩(t);𝑿(t=t0)=𝑿0;t[0,T].\begin{array}[]{l}d{{\bm{X}}_{t}}={\bm{f}}\left({{{\bm{X}}_{t}},t}\right)dt+{\bm{g}}\left({{{\bm{X}}_{t}},t}\right)d{\bm{B}}\left(t\right);\quad{\bm{X}}(t=t_{0})={\bm{X}}_{0};\quad t\in[0,T].\end{array} (3)

The solution to Eq. (3) can be obtained using various stochastic integration schemes [43, 44, 45, 46]. The discovery of governing physics in terms of an SDE requires the identification of the drift and diffusion components from output measurements. One of the challenges arising in the identification of the SDEs from output data is the non-differentiability of the Brownian motion. In general, the stochastic Brownian motion 𝑩(t)\bm{B}(t) is not a well defined mathematical function since it is not differentiable everywhere with respect to the process 𝑿(t)\bm{X}(t). However, its continuity is assumed to exists in mean square sense. Formally, if the interval s[0,t]s\in[0,t] is partitioned into nn-parts as, 𝒫n(0,t):=[0=s0<s1<<sn=t]{\mathcal{P}_{n}}(0,t):=\left[{0={s_{0}}<s_{1}<\ldots<{s_{n}}=t}\right], then for a random process ww, the quadratic variation Qn(w,t)=in|w(si)w(si1)|2{Q_{n}}(w,t)=\sum\nolimits_{i}^{n}{{{\left|{w({s_{i}})-w({s_{i-1}})}\right|}^{2}}} converges to Qn(w,t)Q(w,t){Q_{n}}(w,t)\to Q(w,t). From the property of the Brownian motions, it can be found that Q(w,t)=tQ(w,t)=t; this states that B(t)B(t) has zero finite variation but non-vanishing quadratic variation. On the other hand, the deterministic functions have finite variation but zero quadratic variations. Under these properties, the Kramers-Moyal formula suggests that the drift and diffusion components of an SDE can be extracted from the sample time history in terms of their linear and quadratic variations, respectively. Since the continuity of Brownian motions are mathematically defined in the mean square sense, as a consequence, the diffusion components does not have finite variations but are bounded by their quadratic variations. Thus, the diffusion components, unlike the drifts of an SDE are recoverable only through its covariation terms. The linear and quadratic variations denotes the first and second moments of the sample increments. Leveraging this facts, the drift and diffusion components of an SDE given in Eq. (3) can be expressed in terms of the sample time history as follows (the simplified detailed derivation is presented in A):

𝒇i(𝑿t,t)=limΔt01ΔtE[Xi(t+Δt)ξi]|xk(t)=ξkk=1,2,N𝚪ij(𝑿t,t)=12limΔt01ΔtE[|Xi(t+Δt)ξi||Xj(t+Δt)ξj|]|xk(t)=ξkk=1,2,N,\begin{array}[]{l}{{\bm{f}}_{i}}\left({{{\bm{X}}_{t}},t}\right)={\left.{\mathop{\lim}\limits_{\Delta t\to 0}\dfrac{1}{{\Delta t}}E\left[{{X_{i}}(t+\Delta t)-{\xi_{i}}}\right]}\right|_{{x_{k}}(t)={\xi_{k}}}}\forall\;k=1,2,\ldots N\\ {{\bf\Gamma}_{ij}}\left({{{\bm{X}}_{t}},t}\right)={\left.{\dfrac{1}{2}\mathop{\lim}\limits_{\Delta t\to 0}\dfrac{1}{{\Delta t}}E\left[{\left|{{X_{i}}(t+\Delta t)-{\xi_{i}}}\right|\left|{{X_{j}}(t+\Delta t)-{\xi_{j}}}\right|}\right]}\right|_{{x_{k}}(t)={\xi_{k}}}}\forall\;k=1,2,\ldots N,\end{array} (4)

where 𝒇i(𝑿t,t){{\bm{f}}_{i}}\left({{{\bm{X}}_{t}},t}\right) is the ithi^{th} drift component and 𝚪ij{{\bf{\Gamma}}_{ij}} is the (ij)th(ij)^{th} component of the diffusion covaraince matrix 𝚪n×n:=𝒈(t,𝑿t)𝒈(t,𝑿t)T{\bf{\Gamma}}\in{\mathbb{R}^{n\times n}}:={\bm{g}}(t,{{\bm{X}}_{t}}){\bm{g}}{(t,{{\bm{X}}_{t}})^{T}}. We assume that the analytical form of the drift and diffusion components can be obtained from measurement sample paths in terms of some basis functions. Let υk(𝑿t){\upsilon_{k}}({{\bm{X}}_{t}}) represents the various linear and non-linear mathematical functions evaluated on the system states, and NN is the length of sample path. Also let 𝐋fN×K:=[υ1f(𝑿t),,υKf(𝑿t)]{\bf{L}}^{f}\in\mathbb{R}^{N\times K}:=[\upsilon_{1}^{f}({{\bm{X}}_{t}}),\ldots,\upsilon_{K}^{f}({{\bm{X}}_{t}})] and 𝐋gN×K:=[υ1g(𝑿t),,υKg(𝑿t)]{\bf{L}}^{g}\in\mathbb{R}^{N\times K}:=[\upsilon_{1}^{g}({{\bm{X}}_{t}}),\ldots,\upsilon_{K}^{g}({{\bm{X}}_{t}})] are the library of candidate functions constructed from the set {υk(𝑿t),k=1,,K}\{\upsilon_{k}({{\bm{X}}_{t}}),k=1,\ldots,K\}. Here the candidate functions can contain several functional forms such as polynomial, trigonometric, etc. Then, the ithi^{th} drift component and the ijth{ij}^{th} term of diffusion covariance matrix can be expressed as linear combination of the library functions as, fi(𝑿t,t)=k=1Kυk(𝑿t)θi,k{f_{i}}({{\bm{X}}_{t}},t)=\sum\nolimits_{k=1}^{K}{{\upsilon_{k}}({{\bm{X}}_{t}}){\theta_{i,k}}}; i=1,,mi=1,\ldots,m and 𝚪ij=k=1Kυk(𝐗t)θij,k{{\bf{\Gamma}}_{ij}}=\sum\nolimits_{k=1}^{K}{{\upsilon_{k}}({{\bf{X}}_{t}}){{\bf{\theta}}_{ij,k}}}; i,j=1,2,mi,j=1,2,\ldots m, briefly,

𝒇i(𝑿t,t)=θi1fυ1f(𝑿t)++θikfυkf(𝑿t)++θiKfυKf(𝑿t)𝚪ij(𝑿t,t)=θij1gυ1g(𝑿t)++θijkgυkg(𝑿t)++θijKgυKg(𝑿t),\begin{array}[]{l}{\bm{f}}_{i}({{\bm{X}}_{t}},t)=\theta_{i_{1}}^{f}\upsilon_{1}^{f}({{\bm{X}}_{t}})+\ldots+\theta_{i_{k}}^{f}\upsilon_{k}^{f}({{\bm{X}}_{t}})+\ldots+\theta_{i_{K}}^{f}\upsilon_{K}^{f}({{\bm{X}}_{t}})\\ {{\bf\Gamma}_{ij}}\left({{{\bm{X}}_{t}},t}\right)=\theta_{{ij}_{1}}^{g}\upsilon_{1}^{g}({{\bm{X}}_{t}})+\ldots+\theta_{{ij}_{k}}^{g}\upsilon_{k}^{g}({{\bm{X}}_{t}})+\ldots+\theta_{{ij}_{K}}^{g}\upsilon_{K}^{g}({{\bm{X}}_{t}}),\end{array} (5)

where θikf\theta_{i_{k}}^{f} and θijkg\theta_{{ij}_{k}}^{g} are the weights associated with the kthk^{th} basis functions of drift and diffusion covariance components, respectively. In general, the libraries 𝑳f{\bm{L}}^{f} and 𝑳g{\bm{L}}^{g} can be different but one can construct a library by taking into account of all the basis functions for discovery of both drift and diffusion terms. In a general setting, the discovery of the drift terms culminates in the following regression problem,

[𝒀1𝒀2𝒀m]=[υ1(X1,1Xm,1)υ2(X1,1Xm,1)υK(X1,1Xm,1)υ1(X1,2Xm,2)υ2(X1,2Xm,2)υK(X1,2Xm,2)υ1(X1,NXm,N)υ2(X1,NXm,N)υK(X1,NXm,N)]𝐋N×K[θ1,1θ2,1θm,1θ1,2θ2,2θm,2θ1,kθ2,kθm,k]θK×m.\left[{\begin{array}[]{*{20}{c}}{{{\bm{Y}}_{1}}}&{{{\bm{Y}}_{2}}}&\ldots&{{{\bm{Y}}_{m}}}\end{array}}\right]=\underbrace{\left[{\begin{array}[]{*{20}{c}}{{\upsilon_{1}}({X_{1,1}}\ldots{X_{m,1}})}&{{\upsilon_{2}}({X_{1,1}}\ldots{X_{m,1}})}&\cdots&{{\upsilon_{K}}({X_{1,1}}\ldots{X_{m,1}})}\\ {{\upsilon_{1}}({X_{1,2}}\ldots{X_{m,2}})}&{{\upsilon_{2}}({X_{1,2}}\ldots{X_{m,2}})}&\cdots&{{\upsilon_{K}}({X_{1,2}}\ldots{X_{m,2}})}\\ \vdots&\vdots&\ddots&\vdots\\ {{\upsilon_{1}}({X_{1,N}}\ldots{X_{m,N}})}&{{\upsilon_{2}}({X_{1,N}}\ldots{X_{m,N}})}&\cdots&{{\upsilon_{K}}({X_{1,N}}\ldots{X_{m,N}})}\end{array}}\right]}_{{\bf{L}}\in{{}^{N\times K}}}\underbrace{\left[{\begin{array}[]{*{20}{c}}{{\theta_{1,1}}}&{{\theta_{2,1}}}&\cdots&{{\theta_{m,1}}}\\ {{\theta_{1,2}}}&{{\theta_{2,2}}}&\cdots&{{\theta_{m,2}}}\\ \vdots&\vdots&\ddots&\vdots\\ {{\theta_{1,k}}}&{{\theta_{2,k}}}&\cdots&{{\theta_{m,k}}}\end{array}}\right]}_{{\bf{\theta}}\in{{}^{K\times m}}}. (6)

The problem for discovery of diffusion components can also be formulated in the same way as above, thus omitted here for brevity. As a subset of the above regression problems, one can solve the following equations and identify the SDE components independently,

𝒀i\displaystyle{{\bm{Y}}_{i}} =𝐋f𝜽if+𝜺i\displaystyle={\bf{L}}^{f}{{\bm{\theta}}_{i}^{f}}+{{\bm{\varepsilon}}_{i}} (7a)
𝒀ij\displaystyle{{\bm{Y}}_{ij}} =𝐋g𝜽ijg+𝜼ij,\displaystyle={\bf{L}}^{g}{{\bm{\theta}}_{ij}^{g}}+{{\bm{\eta}}_{ij}}, (7b)

where 𝒀i{{\bm{Y}}_{i}} and 𝒀ij{{\bm{Y}}_{ij}} are the target vectors of the sparse regression problem associated with ithi^{th}-drift component and (ij)th(ij)^{th}-diffusion covariance term, respectively, and, 𝜺i{{\bm{\varepsilon}}_{i}} and 𝜼ij{{\bm{\eta}}_{ij}} are the corresponding residual error vectors. For the discovery of the ithi^{th} drift component in Eq. (7a), the target vector is defined as, 𝒀i=Δt1[(Xi,1ξi,1)(Xi,Nξi,N)]T{{\bm{Y}}_{i}}={{\Delta t}^{-1}\left[{\left({{X_{i,1}}-{\xi_{i,1}}}\right)\ldots\left({{X_{i,N}}-{\xi_{i,N}}}\right)}\right]^{T}}, and in the formulation of the sparse regression problem for diffusion components in Eq. (7b), the target vector is set as, 𝒀ij=Δt1[(Xi,1ξi,1)(Xj,1ξj,1)(Xi,Nξi,N)(Xj,Nξj,N)]T{{\bm{Y}}_{ij}}={{\Delta t}^{-1}\left[{\left({{X_{i,1}}-{\xi_{i,1}}}\right)\left({{X_{j,1}}-{\xi_{j,1}}}\right)\ldots\left({{X_{i,N}}-{\xi_{i,N}}}\right)\left({{X_{j,N}}-{\xi_{j,N}}}\right)}\right]^{T}}. The weights vectors for ithi^{th} drift component and ijth{ij}^{th} term of diffusion covariance matrix are given as, 𝜽if=[θ11,θ12,,θ1K]T{{\bm{\theta}}_{i}^{f}}={\left[{{\theta_{1_{1}}}},{{\theta_{1_{2}}}},\ldots,{{\theta_{1_{K}}}}\right]^{T}}, and 𝜽ijg=[θij1,θij2,,θijK]T{\bm{\theta}}_{ij}^{g}={\left[{\theta_{{ij}_{1}}},{\theta_{{ij}_{2}}},\ldots,{\theta_{{ij}_{K}}}\right]^{T}}, respectively. The straightforward application of the Gibbs sampling mentioned in the Eqs. (17) \to (22) can be performed to solve the regression problems in Eq. 7. Noting that a mm-dimensional process has mm numbers of drift components it requires mm-independent regressions to identify all the drifts. Further it can be noted that in a mm-dimensional process with m×nm\times n-diffusion matrix the diffusion covariance matrix 𝚪{\bf\Gamma} has a dimension m×mm\times m. However, due to the symmetry of the covariance matrix, it requires m(m+1)/2{{m(m+1)}}/{2} numbers of independent regressions to be performed to completely identify the diffusion space.

2.2 Discovery of SDE by sparse Bayesian regression

Since Eq. (7a) and (7b) reflects the same sparse regression problem to be solved, for the brevity they can be represented using the following one-dimensional regression equation:

𝒀=𝐋𝜽+ϵ,{\bm{Y}}={\bf{L}}{\bm{\theta}}+\epsilon, (8)

where 𝒀N{\bm{Y}}\in\mathbb{R}^{N} denotes the NN-dimensional target vector, 𝐋{\bf{L}} denotes the library of candidate functions, 𝜽{\bm{\theta}} is the weight vector, and ϵN\epsilon\in\mathbb{R}^{N} is the residual error vector representing the model mismatch error. In the Bayesian inference, given the output measurements the aim is to find the posterior distribution of the weight vector 𝜽{\bm{\theta}}. For estimating the posterior distribution of the weight vector 𝜽{\bm{\theta}}, one can apply the Bayes formula in Eq. (8), which yields,

P(𝜽|𝒀)=P(𝜽)P(𝒀|𝜽)P(𝒀).P\left({\bm{\theta}|{\bm{Y}}}\right)=\frac{P\left(\bm{\theta}\right){P\left({{\bm{Y}}|{\bm{\theta}}}\right)}}{{P\left({\bm{Y}}\right)}}. (9)

Modeling the mismatch error ϵ\epsilon as i.i.d Gaussian random variable with zero mean and variance σ2\sigma^{2}, the likelihood function is written as,

𝒀|𝜽,σ2𝒩(𝐋𝜽,σ2𝐈N×N),{\bm{Y}}|{\bm{\theta}},{\sigma^{2}}\sim\mathcal{N}\left({{\bf{L}}{\bm{\theta}},{\sigma^{2}}{{\bf{I}}_{N\times N}}}\right), (10)

where 𝐈N×N{\bf{I}}_{N\times N} denotes the N×N{N\times N} identity matrix. The discovery of governing physics demands that the resulting model should be interpretable i.e. the solution of weight vector 𝜽{\bm{\theta}} should contain only few terms in the final model. In this work, the discontinuous spike and slab (DSS) distributions are used for promotion of sparsity in the solution. The equation discovery using DSS-prior requires the classification of weights of the basis functions into either of the spike and slab component. This is done by introducing a latent indicator variable 𝒁=[Z1,,ZK]{\bm{Z}}=\left[Z_{1},\ldots,Z_{K}\right] for each of the component θk;k=1,,K\theta_{k};k=1,\ldots,K of the weight vector 𝜽{\bm{\theta}}. The latent indicator variable ZkZ_{k} is assigned a value 1 if the weight corresponds to the slab component, otherwise, it takes a value 0 when the weight belongs to spike component. It is to be noted that the components of the weight vector that belongs to spike does not contribute to discovery of governing physics. Therefore, let us consider a vector 𝜽rr:{rK}{\bm{\theta}}_{r}\in\mathbb{R}^{r}:\{r\ll K\}, composed from the elements of the weight vector 𝜽{\bm{\theta}} for which Zk=1Z_{k}=1. Denoting 𝜽r{\bm{\theta}}_{r} as the weight vector containing only those variables from 𝜽{\bm{\theta}} for which Zk=1Z_{k}=1, the DSS-prior is defined as [36, 38],

p(𝜽|𝒁)=pslab(θr)k,Zk=0pspike(θk),p\left({{\bm{\theta}}|{\bm{Z}}}\right)={p_{slab}}({\theta_{r}})\prod\limits_{k,{Z_{k}}=0}{{p_{spike}}({\theta_{k}})}, (11)

where the spike and slab distributions are defined as, pspike(θk)=δ0{p_{spike}}({\theta_{k}})={\delta_{0}} and pslab(𝜽r)=𝒩(𝟎,σ2ϑs𝐑0,r){p_{slab}}({{\bm{\theta}}_{r}})=\mathcal{N}\left({{\bf{0}},{\sigma^{2}}{\vartheta_{s}}{{\bf{R}}_{0,r}}}\right) with 𝐑0,r=𝐈r×r{{\bf{R}}_{0,r}}={{\bf{I}}_{r\times r}}. It can be observed that the spike and slab distributions are assumed to be independent. To increase the effectiveness and faster convergence to the sparse solution the noise variance σ2{\sigma^{2}} and the slab variance ϑs{\vartheta_{s}} are treated as random variables. The noise variance σ2\sigma^{2} is assigned the Inverse-gamma distribution with the hyperparameters ασ{\alpha_{\sigma}} and βσ{\beta_{\sigma}}, the the slab variance ϑs\vartheta_{s} is assigned the Inverse-gamma prior with the hyperparameters αϑ{\alpha_{\vartheta}} and βϑ{\beta_{\vartheta}}, the latent variables ZkZ_{k} takes a value from the set {0,1}\{0,1\}, thus each of the element in the latent vector is assigned the Bernoulli prior with common hyperparameter p0{p_{0}}. The hyperparameter p0{p_{0}} is simulated from the Beta prior with the hyperparameters αp{\alpha_{p}} and βp{\beta_{p}}. This can be summarized as,

p(ϑs)=IG(αϑ,βϑ)\displaystyle p\left({\vartheta_{s}}\right)=IG\left({{\alpha_{\vartheta}},{\beta_{\vartheta}}}\right) (12)
p(Zk|p0)=Bern(p0);k=1K\displaystyle p\left({Z_{k}}|{p_{0}}\right)=Bern\left({{p_{0}}}\right);k=1\ldots K (13)
p(p0)=Beta(αp,βp)\displaystyle p\left({p_{0}}\right)=Beta\left({{\alpha_{p}},{\beta_{p}}}\right) (14)
p(σ2)=IG(ασ,βσ)\displaystyle p\left({\sigma^{2}}\right)=IG\left({{\alpha_{\sigma}},{\beta_{\sigma}}}\right) (15)
Refer to caption
Figure 2: Hierarchical Bayesian network of the discontinuous spike and slab model for sparse linear regression. The variables in the green square boxes indicate the deterministic parameters and the variables in the white circles represents random variables. The term 𝐋{\bf{L}} represents the library of candidate functions, and the parameters αϑ{\alpha_{\vartheta}}, βϑ{\beta_{\vartheta}}, αp{\alpha_{p}}, βp{\beta_{p}}, ασ{\alpha_{\sigma}}, and βσ{\beta_{\sigma}} indicates the hyperparameters of the priors of the hierarchical DSS model. Here pp, ϑs\vartheta_{s} and σ2\sigma^{2} scalar valued, and, the the variables 𝒁{\bm{Z}} and 𝜽{\bm{\theta}} are vector valued variables.

With all the random variables, we construct a bigger Bayesian hierarchical model as shown in Fig. 2, where the hyperparameters αϑ{\alpha_{\vartheta}}, βϑ{\beta_{\vartheta}}, αp{\alpha_{p}}, βp{\beta_{p}}, ασ{\alpha_{\sigma}}, and βσ{\beta_{\sigma}} are provided as a deterministic constants in the hierarchical model. From the DAG structure in Fig. 2, the joint distribution of the random variables 𝜽{\bm{\theta}}, 𝒁{\bm{Z}}, ϑs{\vartheta_{s}}, σ2{\sigma^{2}}, p0|𝒀{p_{0}}|{\bm{Y}} are obtained from the Fig. 2 as,

p(𝜽,𝒁,ϑs,σ2,p0|𝒀)=p(𝒀|𝜽,σ2)p(𝜽|𝒁,ϑs,σ2)p(𝒁|p0)p(ϑs)p(σ2)p(p0)p(𝒀)p(𝒀|𝜽,σ2)p(𝜽|𝒁,ϑs,σ2)p(𝒁|p0)p(ϑs)p(σ2)p(p0),\begin{array}[]{ll}p\left({{\bm{\theta}},{\bm{Z}},{\vartheta_{s}},{\sigma^{2}},{p_{0}}|{\bm{Y}}}\right)&=\dfrac{{p\left({{\bm{Y}}|{\bm{\theta}},{\sigma^{2}}}\right)p\left({{\bm{\theta}}|{\bm{Z}},{\vartheta_{s}},{\sigma^{2}}}\right)p\left({{\bm{Z}}|{p_{0}}}\right)p\left({{\vartheta_{s}}}\right)p\left({{\sigma^{2}}}\right)p\left({{p_{0}}}\right)}}{{p\left({\bm{Y}}\right)}}\\ &\propto p\left({{\bm{Y}}|{\bm{\theta}},{\sigma^{2}}}\right)p\left({{\bm{\theta}}|{\bm{Z}},{\vartheta_{s}},{\sigma^{2}}}\right)p\left({{\bm{Z}}|{p_{0}}}\right)p\left({{\vartheta_{s}}}\right)p\left({{\sigma^{2}}}\right)p\left({{p_{0}}}\right),\end{array} (16)

where p(𝜽,𝒁,ϑs,σ2,p0|𝒀)p\left({{\bm{\theta}},{\bm{Z}},{\vartheta_{s}},{\sigma^{2}},{p_{0}}|{\bm{Y}}}\right) denotes the joint distribution of the random variables, p(𝒀|𝜽,σ2)p\left({{\bm{Y}}|{\bm{\theta}},{\sigma^{2}}}\right) denotes the likelihood function, p(𝜽|𝒁,ϑs,σ2)p\left({{\bm{\theta}}|{\bm{Z}},{\vartheta_{s}},{\sigma^{2}}}\right) is the prior distribution for the weight vector 𝜽{\bm{\theta}}, p(𝒁|p0)p\left({{\bm{Z}}|{p_{0}}}\right) is the prior distribution for the latent vector 𝒁{\bm{Z}}, p(ϑs)p\left({{\vartheta_{s}}}\right) is the prior distribution for the slab variance ϑs{\vartheta}_{s}, p(σ2)p\left({{\sigma^{2}}}\right) is the prior distribution for the noise variance, p(p0)p\left({{p_{0}}}\right) is the prior distribution for the success probability p0p_{0} and p(𝒀){p\left({\bm{Y}}\right)} is the marginal likelihood. Direct sampling from the joint distribution function is intractable in this case due to the spike and slab distribution. Thus, Gibbs sampling technique is used to draw the random samples from the joint distribution [47]. For the Gibbs sampling, the conditional distributions of the random variables are derived in B. The weights corresponding to the spike distribution does not contribute to the selection of the correct basis functions. Thus, only the weights corresponding to Zk0Z_{k}\neq 0 i.e. the 𝜽r{\bm{\theta}}_{r} vector is sampled using the Gibbs sampling. Referring to the Eqs. (12), (13), (14), and (15), the sequence of the random variables 𝜽(0),σ2(0),ϑs(0),p0(0),𝒁(0),,𝜽(1),σ2(1),ϑs(1),p0(1),𝒁(1),{{\bm{\theta}}^{(0)}},{\sigma^{2(0)}},\vartheta_{s}^{(0)},p_{0}^{(0)},{{\bm{Z}}^{(0)}},\ldots,{{\bm{\theta}}^{(1)}},{\sigma^{2(1)}},\vartheta_{s}^{(1)},p_{0}^{(1)},{{\bm{Z}}^{(1)}},\ldots, using the Gibbs sampling technique is obtained by following steps,

  1. 1.

    The weight vector 𝜽r(i){\bm{\theta}}_{r}^{(i)} is sampled from the Gaussian distribution with mean 𝝁θ{{\bm{\mu}}_{\theta}} and variance 𝚺θ{{\bf{\Sigma}}_{\theta}} as,

    𝜽r(i)|𝒀,ϑs(i),σ2(i)𝒩(𝝁θ(i),𝚺θ(i)),{{\bm{\theta}}_{r}^{(i)}}|{\bm{Y}},{\vartheta_{s}^{(i)}},{\sigma^{2(i)}}\sim\mathcal{N}\left({{{\bm{\mu}}_{\theta}^{(i)}},{{\bf{\Sigma}}_{\theta}^{(i)}}}\right), (17)

    where the mean and covariance is defined as, 𝝁θ(i)=𝚺θ(i)𝐋r(i)T𝒀{{\bm{\mu}}_{\theta}^{(i)}}={\bf{\Sigma}}_{\theta}^{(i)}{\bf{L}}_{r}^{(i)T}{\bm{Y}} and 𝚺θ(i)=σ2(i)(𝐋r(i)T𝐋r(i)+ϑs(i)1𝐑0,r(i)1)1{{\bf{\Sigma}}_{\theta}^{(i)}}={\sigma^{2(i)}}{\left({{\bf{L}}_{r}^{(i)T}{{\bf{L}}_{r}^{(i)}}+\vartheta_{s}^{{(i)}-1}{\bf{R}}_{0,r}^{{(i)}-1}}\right)^{-1}}, respectively.

  2. 2.

    The latent variable Zk(i+1)Z_{k}^{(i+1)} is assigned the values from the set, {0,1}\left\{0,1\right\} by using the Bernoulli distribution as,

    Zk(i+1)|𝒀,ϑs(i),p0(i)Bern(uk),{Z_{k}^{(i+1)}}|{\bm{Y}},{\vartheta_{s}^{(i)}},{p_{0}^{(i)}}\sim Bern\left({{u_{k}}}\right), (18)

    where uk=p0p0+λ(1p0){u_{k}}=\dfrac{{{p_{0}}}}{{{p_{0}}+\lambda\left({1-{p_{0}}}\right)}} and λ=p(𝒀|Zk(i)=0,𝒁k(i),ϑs(i))p(𝒀|Zk(i)=1,𝒁k(i),ϑs(i))\lambda=\dfrac{{p\left({{\bm{Y}}|{Z_{k}^{(i)}}=0,{{\bm{Z}}_{-k}^{(i)}},{\vartheta_{s}^{(i)}}}\right)}}{{p\left({{\bm{Y}}|{Z_{k}^{(i)}}=1,{{\bm{Z}}_{-k}^{(i)}},{\vartheta_{s}^{(i)}}}\right)}}. Here, 𝒁k(i)K1{{\bm{Z}}_{-k}^{(i)}}\in\mathbb{R}^{K-1} denotes the latent variable vector 𝒁{\bm{Z}} consisting of all the elements except the kthk^{th} component. The probability that the kthk^{th} latent variable Zk(i)Z_{k}^{(i)} takes a value 0 or 1 is estimated as follows,

    p(𝒀|𝒁(i),ϑs(i))=Γ(ασ+N2)βσασΓ(ασ)(2π)N2(βσ+12𝒀T𝒀)(ασ+N2);when all {Zk(i):k=1,,K}=0=Γ(ασ+N2)βσασ(|𝐑0,r(i)1||𝚺θ(i)|)12Γ(ασ)(2π)N2ϑshz2(βσ+12𝒀T(𝐈N×N𝐋r(i)𝚺θ(i)𝐋r(i)T)𝒀)(ασ+N2);otherwise.\begin{array}[]{ll}p\left({{\bm{Y}}|{\bm{Z}}^{(i)},{\vartheta_{s}^{(i)}}}\right)&=\dfrac{{\Gamma\left({{\alpha_{\sigma}}+\dfrac{N}{2}}\right)\beta_{\sigma}^{{\alpha_{\sigma}}}}}{{\Gamma\left({{\alpha_{\sigma}}}\right){{\left({2\pi}\right)}^{\dfrac{N}{2}}}{{\left({{\beta_{\sigma}}+\dfrac{1}{2}{{\bm{Y}}^{T}}{\bm{Y}}}\right)}^{\left({{\alpha_{\sigma}}+\dfrac{N}{2}}\right)}}}};\quad\text{when all $\left\{{Z_{k}^{(i)}}{:k=1,\ldots,K}\right\}=0$}\\ &=\dfrac{{\Gamma\left({{\alpha_{\sigma}}+\dfrac{N}{2}}\right)\beta_{\sigma}^{{\alpha_{\sigma}}}{{\left({\left|{{\bf{R}}_{0,r}^{{(i)}-1}}\right|\left|{{{\bf{\Sigma}}_{\theta}^{(i)}}}\right|}\right)}^{\dfrac{1}{2}}}}}{{\Gamma\left({{\alpha_{\sigma}}}\right){{\left({2\pi}\right)}^{\dfrac{N}{2}}}\vartheta_{s}^{\dfrac{{{h_{z}}}}{2}}{{\left({{\beta_{\sigma}}+\dfrac{1}{2}{{\bm{Y}}^{T}}\left({{{\bf{I}}_{N\times N}}-{{\bf{L}}_{r}^{(i)}}{{\bf{\Sigma}}_{\theta}^{(i)}}{\bf{L}}_{r}^{(i)T}}\right){\bm{Y}}}\right)}^{\left({{\alpha_{\sigma}}+\dfrac{N}{2}}\right)}}}};\quad\text{otherwise.}\end{array} (19)
  3. 3.

    The noise variance σ2(i+1){\sigma^{2(i+1)}} is simulated from the Inverse-gamma distribution as,

    σ2(i+1)|𝒀,𝒁(i+1),ϑs(i)IG(ασ+N2,βσ+12(𝒀T𝒀𝝁θ(i)T𝚺θ(i)1𝝁θ(i))).{\sigma^{2(i+1)}}|{\bm{Y}},{\bm{Z}^{(i+1)}},{\vartheta_{s}^{(i)}}\sim IG\left({{\alpha_{\sigma}}+\dfrac{N}{2},{\beta_{\sigma}}+\dfrac{1}{2}\left({{{\bm{Y}}^{T}}{\bm{Y}}-{\bm{\mu}}_{\theta}^{(i)T}{\bf{\Sigma}}_{\theta}^{{(i)}-1}{{\bm{\mu}}_{\theta}^{(i)}}}\right)}\right). (20)
  4. 4.

    The slab variance ϑs(i+1){\vartheta_{s}^{(i+1)}} is sampled from the Inverse-gamma distribution as,

    ϑs(i+1)|𝜽(i),𝒁(i+1),σ2(i+1)IG(αϑ+hz2,βϑ+12σ2𝜽r(i)T𝐑0,r(i)1𝜽r(i)).{\vartheta_{s}^{(i+1)}}|{\bm{\theta}^{(i)}},{\bm{Z}}^{(i+1)},{\sigma^{2(i+1)}}\sim IG\left({{\alpha_{\vartheta}}+\dfrac{{{h_{z}}}}{2},{\beta_{\vartheta}}+\dfrac{1}{{2{\sigma^{2}}}}{\bm{\theta}}_{r}^{(i)T}{\bf{R}}_{0,r}^{{(i)}-1}{{\bm{\theta}}_{r}^{(i)}}}\right). (21)
  5. 5.

    The success rate p0(i+1){p_{0}^{(i+1)}} is sampled from the Beta distribution as,

    p0(i+1)|𝒁(i+1)Beta(αp+hz,βp+Khz),{p_{0}^{(i+1)}}|{\bm{Z}}^{(i+1)}\sim Beta\left({{\alpha_{p}}+{h_{z}},{\beta_{p}}+K-{h_{z}}}\right), (22)

    where hz=k=1KZk(i+1){h_{z}}=\sum\nolimits_{k=1}^{K}{{Z_{k}}^{(i+1)}}.

  6. 6.

    The weight vector 𝜽r(i+1){\bm{\theta}}_{r}^{(i+1)} is updated using the step 1.

This MCMC is performed for a total of NM{N_{M}} simulations out of which initial 1000 samples are discarded as the burn-in samples. Let Ns{N_{s}} denote the number of MCMC required to achieve the stationary distribution after the burn-in samples are discarded. Then the marginal posterior inclusion probability (PIP):= p(Zk=1|𝒀)p\left({{Z_{k}}=1|{\bm{Y}}}\right) for each of the KK basis functions can be estimated by taking mean over the Gibbs samples for each of the kthk^{th} latent vector Zk{Z_{k}} [36] as,

p(Zk=1|𝒀)1Nsj=1NsZkj;k=1,,K.p\left({{Z_{k}}=1|{\bm{Y}}}\right)\approx\dfrac{1}{{{N_{s}}}}\sum\limits_{j=1}^{{N_{s}}}{Z_{k}^{j}};{\rm{}}k=1,\ldots,K. (23)

The basis functions whose corresponding PIP values are more than 0.5 i.e. p(Zk=1|𝒀)>0.5p\left({{Z_{k}}=1|{\bm{Y}}}\right)>0.5 are included in the final model of discovered equation. The PIP value greater than 0.5 indicates that the selected basis functions are observed more than half of the times in the MCMC simulations. Higher PIP value suggests that in case of unseen scenarios, the corresponding basis functions are highly likely to occur in the data representation target vector. The mean and covariance of the weight vector gives the expected value and standard deviation of the system parameters. The standard deviation provides the confidence interval of the parameters which can be used to design the lower and upper bounds of a random variable in an unseen environment. A pseudo code for the proposed framework, is provided in the Algorithm 1.

Algorithm 1 Pseudo code of the proposed Bayesian framework for discovery of governing physics of stochastic non-linear systems
1:Sample paths: 𝐗(t)N×m{\bf{X}}(t)\in\mathbb{R}^{N\times m}, hyperparameters: αp\alpha_{p}, βp\beta_{p}, ασ\alpha_{\sigma}, βσ\beta_{\sigma}, αϑ\alpha_{\vartheta}, βϑ\beta_{\vartheta}, p0(0)p_{0}^{(0)}, ϑs(0)\vartheta_{s}^{(0)}
2:Estimate the linear and quadratic variation vectors. \triangleright Eq. (4)
3:For drift obtain the target vectors 𝒀i\bm{Y}_{i} and the library 𝐋{\bf{L}} using the candidate basis functions. \triangleright Eq. (7a)
4:For diffusion obtain the target vectors 𝒀ij\bm{Y}_{ij} and the library 𝐋{\bf{L}}. \triangleright Eq. (7b)
5:Estimate the initial variance of noise from residual variance: σ2,(0)\sigma^{2,(0)} = Var(𝐋𝜽𝒀{\bf{L}}{\bm{\theta}}-{\bm{Y}})
6:Estimate the initial latent vector 𝒁(0)\bm{Z}^{(0)}=[Z1(0),Z2(0),,ZK(0)]\left[Z_{1}^{(0)},Z_{2}^{(0)},\ldots,Z_{K}^{(0)}\right] such that MSE(𝐋𝜽𝒀)({\bf{L}}{\bm{\theta}}-{\bm{Y}}) is minimum.
7:Find θ(0)\theta^{(0)}, 𝝁θ{\bm{\mu}}_{\theta} and 𝚺θ{\bf{\Sigma}}_{\theta} \triangleright Eq. (17)
8:for ii = 1,,NM1,\ldots,{N_{M}} do
9:     Update the latent variable vector 𝒁(i){\bm{Z}}^{(i)}. \triangleright Eq. (18)
10:     Update the noise variance σ2(i)\sigma^{2(i)}. \triangleright Eq. (20)
11:     Update the slab variance ϑs(i)\vartheta_{s}^{(i)}. \triangleright Eq. (21)
12:     Update the success rate p0(i)p_{0}^{(i)}. \triangleright Eq. (22)
13:     Update the weight vector 𝜽(i){\bm{\theta}}^{(i)}. \triangleright Eq. (17)
14:     Repeat steps 9\to13
15:Discard the burn-in MCMC samples
16:Estimate the marginal PIP values p(Zk=1|𝒀)p(Z_{k}=1|{\bm{Y}}) \triangleright Eq. (23)
17:Include the basis functions with higher PIP values
18:Estimate the expected value E[θk]{\rm E}[{\theta_{k}}] and standard deviation σ[θk]{\sigma}[{\theta_{k}}] of the system parameters
19:{𝜽k;k=1K}\{{\bm{\theta}}_{k};k=1\ldots{K}\}, {ϑk(𝑿j);j=1m¯}\{\vartheta_{k}({\bm{X}}_{j});j=1\ldots{\bar{m}}\} , where KK and m¯\bar{m} are the number of library functions and process states, respectively.

The Gibbs sampler is initialised with the following initial values of the hyperparameters: p0(0)p_{0}^{(0)}=0.1, ϑ(0)\vartheta^{(0)}=10, and σ2(0)\sigma^{2(0)} is set equal to the residual variance from ordinary least-squares regression. The initial vector of binary latent variables 𝒁(0){\bm{Z}}^{(0)} is computed by setting 𝒁(0)=[Z1,,ZK]{\bm{Z}}^{(0)}=[Z_{1},\ldots,Z_{K}] to zero and then activating the components of 𝒁{\bm{Z}} that reduce the mean-squared error between the training data and the obtained model from ordinary least-squares. For this purpose a forward followed by a backward search algorithm is devised, where the backward search iterates through the activated components of initial latent vector in similar fashion to forward search. Given all the other parameters, the initial value of θ(0)\theta^{(0)} is obtained from the Eq. (16). For the commencement of the algorithm, the deterministic prior parameters are set to the following values: apa_{p}=0.1 and bpb_{p}=1 for the Beta prior on p0p_{0}, ava_{v}=0.5 and bvb_{v}=0.5 for inverse-Gamma prior on slab variance, and, aσa_{\sigma}=10410^{4} and bσb_{\sigma}=10410^{4} for inverse-Gamma prior on measurement noise. A Markov chains with 3000 samples is used for Gibbs sampling. The first 1000 samples are discarded as burn-in, and the remaining 2000 samples are used for posterior computation.

For all the demonstrations in this work, the data are simulated using the Euler-Maruyama (EM) scheme at a frequency of 1000Hz using the parameters listed in the Table 1. The noise in the measurements is modeled as NN-dimensional sequence of zero-mean Gaussian white noise with a standard deviation equal to 5% of the standard deviation of the simulated quantities. In this work, the dictionary 𝐋N×K{\bf{L}}\in\mathbb{R}^{N\times K} is constructed from 5 types of mathematical functions, each function representing a mapping of the mm-dimensional state vector 𝑿={X1,X2,Xm}\bm{X}=\left\{X_{1},X_{2},\ldots X_{m}\right\}:

𝐋(𝑿)=[𝟏P1(𝑿)P2(𝑿)P𝒫(𝑿)sgn(𝑿)|𝑿|𝑿|𝑿|].{\bf{L}}({\bm{X}})=\left[{\begin{array}[]{*{20}{c}}{\bf{1}}&{{P^{1}}({\bm{X}})}&{{P^{2}}({\bm{X}})}&\ldots&{{P^{\mathcal{P}}}({\bm{X}})}&{{\mathop{\rm sgn}}({\bm{X}})}&{\left|{\bm{X}}\right|}&{{\bm{X}}\otimes\left|{\bm{X}}\right|}\end{array}}\right]. (24)

Here 𝟏N{\bf{1}}\in\mathbb{R}^{N} denotes the NN-dimensional vector of 1, P𝒫(𝑿)N×m{P^{\mathcal{P}}}({\bm{X}})\in\mathbb{R}^{N\times m} denotes the set of terms present in the multinomial expansion (X1+X2++Xm)𝒫{({X_{1}}+{X_{2}}+\ldots+{X_{m}})^{\mathcal{P}}}, sgn(𝑿)N×msgn({\bm{X}})\in\mathbb{R}^{N\times m} represents the signum functions of the form sgn(Xi)sgn({X_{i}}) i=1m\forall i=1\ldots m, |𝑿|N×m{\left|{\bm{X}}\right|}\in\mathbb{R}^{N\times m} denotes the absolute mapping of the states: |Xi|{\left|{X_{i}}\right|} i=1m\forall i=1\ldots m. The tensor product term 𝑿|𝑿|N×2m{{\bm{X}}\otimes\left|{\bm{X}}\right|}\in\mathbb{R}^{N\times{2m}} represents the set of functions: Xi|Xj|{X_{i}\otimes\left|X_{j}\right|} i,j=1m\forall{i,j}=1\ldots m. For the study, the length of the polynomial 𝒫\mathcal{P} is chosen as 6. The cardinality of the library is found as K=(1+|(X1+X2++Xm)n|+4m)K=(1+\left|{{{({X_{1}}+{X_{2}}+\ldots+{X_{m}})}^{n}}}\right|+4m), where the number of terms in the multinomial expansion can be found as, |(X1+X2++Xm)n|=Cm1n+m1\left|{{{({X_{1}}+{X_{2}}+\ldots+{X_{m}})}^{n}}}\right|={}^{n+m-1}{C_{m-1}}. The complete architecture of the propose framework is illustrated in the Fig. 1.

Simulated systems Drift parameters Diffusion parameters
Linear Non-linear
Black-Scholes λ\lambda = 2 - μ\mu = 1
Duffing-Van der pol mm = 1, kk = 2×1032\times 10^{3}, cc = 2 α\alpha = 10510^{5} σ\sigma = 10
2-DOF non-linear m1m_{1} = m2m_{2} = 1, k1k_{1} = 4×1034\times 10^{3} μ\mu = 1, gg = 9.81 σ1\sigma_{1} = 10, σ2\sigma_{2} = 10
k2k_{2} = 2×1032\times 10^{3}, c1c_{1} = c2c_{2} = 2
Bouc-Wen mm = 1, kk = 1×1041\times 10^{4}, cc = 20, λ\lambda = 0.5 A1A_{1} = A2A_{2} = 0.5, A3A_{3} = 1, n¯\bar{n} = 3 σ\sigma = 2
Table 1: Simulation parameters of the systems.

3 Discovery of governing physics of example problems

In this section, the efficacy and robustness of the proposed Bayesian physics discovery framework is observed on a variety of representative stochastic non-linear dynamical systems. The examples taken includes: (a) Black–Scholes SDE, (b) Duffing Van-der pol oscillator, and (c) Two degree-of-freedom (DOF) base isolated shear building. For the equation discovery, it is assumed that the input forces are not measurable and only the noisy measurements of the displacements are available for physics discovery. The velocity component is obtained from the displacement vector through numerical differentiation such as fourth order central finite difference formula. Additionally a case study using Bouc-Wen oscillator is undertaken, where it is assumed that the hysteresis measurement is also not available. Therefore the proposed framework treats it as a partially observed system and tries to estimate the effect of the unobserved state without explicitly considering it in to the library of candidate functions. The results of the case studies undertaken in this work are presented in Figs. 3, 4 and Table 2. These results demonstrate sufficient accuracy in the discovered physics and robustness of the proposed framework to learn governing law from limited and noisy displacement measurements only (without any input measurements).

Refer to caption
Figure 3: Discovery of the drift components of the governing physics from data corrupted with 5% white Gaussian noise based on marginal posterior inclusion probability (PIP), P(Zk=1|Y)(Z_{k}=1|\bm{Y}). (a) Black-Scholes SDE: the library 𝐋N×9{\bf{L}}\in\mathbb{R}^{N\times 9} consists 9 candidate basis functions out of which the basis functions xx and |x||x| are identified based on the criteria, PIP>0.5. (b) Black-Scholes SDE: the library 𝐋N×6{\bf{L}}\in\mathbb{R}^{N\times 6} is constructed as a collection of 6 basis functions. The correct drift component X(t)X(t) is accurately identified with the almost sure probability P(Z1=1)=1{\rm{P}}(Z_{1}=1)=1. (c) Duffing-Van der pol oscillator: the library 𝐋N×36{\bf{L}}\in\mathbb{R}^{N\times 36} has 36 candidate basis functions. The drift component is correctly identified as, υ(1)=X1\upsilon(1)=X_{1}, υ(2)=X2\upsilon(2)=X_{2} and υ(6)=X13\upsilon(6)=X_{1}^{3}. (d) Two-DOF shear building (drift component of first DOF): the library 𝐋N×58{\bf{L}}\in\mathbb{R}^{N\times 58} consists a total of 58 candidate basis functions, out of which 5 basis are selected for discovery of first drift component as, υ(1)=Y1\upsilon(1)=Y_{1}, υ(2)=Y2\upsilon(2)=Y_{2}, υ(3)=Y3\upsilon(3)=Y_{3}, υ(4)=Y4\upsilon(4)=Y_{4} and υ(36)=sgn(Y2)\upsilon(36)=sgn(Y_{2}). (e) Two-DOF shear building (drift component of first DOF): from the library 𝐋N×58{\bf{L}}\in\mathbb{R}^{N\times 58} the basis functions are discovered as, υ(1)=Y1\upsilon(1)=Y_{1}, υ(2)=Y2\upsilon(2)=Y_{2}, υ(3)=Y3\upsilon(3)=Y_{3}, υ(4)=Y4\upsilon(4)=Y_{4}. The basis functions for both the drift components are selected with almost full probability P(Zk=1|𝒀){\rm{P}}(Z_{k}=1|\bm{Y}).
Refer to caption
Figure 4: Discovery of the diffusion components of the governing physics from data corrupted with 5% white Gaussian noise based on marginal posterior inclusion probability (PIP), P(Zk=1|Y)(Z_{k}=1|\bm{Y}). (a) Black-Scholes SDE: the weight vector θ9\theta\in\mathbb{R}^{9} for the library 𝐋N×9{\bf{L}}\in\mathbb{R}^{N\times 9} is shown. The covariance of the diffusion component is identified as a combination of the terms X2(t)X^{2}(t) and X|X|X|X|. (b) Black-Scholes SDE: the weight vector θ6\theta\in\mathbb{R}^{6} has 6 elements out of which the the correct covariance term X2(t)X^{2}(t) is discovered with almost sure probability P(Z2=1)=1{\rm{P}}(Z_{2}=1)=1. (c) Duffing-Van der pol oscillator: there are 36 elements in the weight vector θ36\theta\in\mathbb{R}^{36}. The diffusion term is discovered correctly as X12(t)X_{1}^{2}(t) with almost sure probability P(Z2=1)=1{\rm{P}}(Z_{2}=1)=1. (d) Two-DOF shear building (drift component of first DOF): the weight vector θ58\theta\in\mathbb{R}^{58} has 58 elements corresponding to the 58 basis functions. The first diffusion component is accurately identified as X12(t)X_{1}^{2}(t) with almost sure probability P(Z2=1)=1{\rm{P}}(Z_{2}=1)=1. (e) Two-DOF shear building (drift component of first DOF): the second diffusion component X22(t)X_{2}^{2}(t) is accurately identified with almost sure probability P(Z2=1)=1{\rm{P}}(Z_{2}=1)=1.
Refer to caption
Figure 5: Black-Scholes SDE: posterior distributions of the weights of discovered basis functions. (a) The posterior distributions of θX\theta_{X} and θ|X|\theta_{|X|}. Posterior of both emulates the uniform distribution, whose means lie around the value 1. As a combination the basis θX\theta_{X} and θ|X|\theta_{|X|} identifies the drift. (b) The posterior distributions of θX2\theta_{X^{2}} and θX|X|\theta_{X|X|}. The posteriors emulate uniform distribution. The mean gives a value around 0.5. Thus, as a combination they discovered the diffusion. (c) The posterior of weight θX\theta_{X}. The mean is obtained as 2.04, which exactly identifies the drift with almost sure probability P(Z1=1)=1{\rm{P}}(Z_{1}=1)=1. (d) The posterior of weight θX2\theta_{X^{2}} whose mean provides the value of the parameters μ\mu as 1.02.
Correct SDEs:
Black-Scholes: dX(t)=2X(t)dt+X(t)dB(t)dX(t)=2X(t)dt+X(t)dB(t)
Duff.-Van der pol: dX2(t)=(1000X1(t)+2X2(t)+100000X13(t))dt+10X(t)dB(t)d{X_{2}}(t)=-\left({1000{X_{1}}(t)+2{X_{2}}(t)+100000X_{1}^{3}(t)}\right)dt+10X(t)dB(t)
2-DOF non-linear: {dY2(t)=(6000Y1(t)4Y2(t)9.81sgn(Y2(t))+2000Y3(t)+2Y4(t))dt+10dB1(t)dY4(t)=(2000Y1(t)+2Y2(t)2000Y3(t)2Y4(t))dt+10dB2(t)\left\{\begin{array}[]{l}d{Y_{2}}(t)=\left({-6000{Y_{1}}(t)-4{Y_{2}}(t)-9.81{\mathop{\rm sgn}}\left({{Y_{2}}(t)}\right)+2000{Y_{3}}(t)+2{Y_{4}}(t)}\right)dt+10d{B_{1}}(t)\\ d{Y_{4}}(t)=\left({2000{Y_{1}}(t)+2{Y_{2}}(t)-2000{Y_{3}}(t)-2{Y_{4}}(t)}\right)dt+10d{B_{2}}(t)\end{array}\right.
Identified SDEs:
Black-Scholes: {dX(t)=2.05X(t)dt+1.02X(t)dB(t)dX(t)=(0.47X(t)+0.53|X(t)|)dt+(1.09X2(t)+0.95X(t)|X(t)|)dB(t)\left\{\begin{array}[]{l}dX(t)=2.05X(t)dt+1.02X(t)dB(t)\\ dX(t)=\left({0.47X(t)+0.53\left|{X(t)}\right|}\right)dt+\left({1.09{X^{2}}(t)+0.95X(t)\left|{X(t)}\right|}\right)dB(t)\end{array}\right.
Duff.-Van der pol: dX2(t)=(1000.02X1(t)+1.99X2(t)+99885.10X13(t))dt+10.31X(t)dB(t)d{X_{2}}(t)=-\left({1000.02{X_{1}}(t)+1.99{X_{2}}(t)+99885.10X_{1}^{3}(t)}\right)dt+10.31X(t)dB(t)
2-DOF non-linear: {dY2(t)=(6000.30Y1(t)3.97Y2(t)9.89sgn(Y2(t))+1999.73Y3(t)+2.09Y4(t))dt+10.03dB1(t)dY4(t)=(2000.65Y1(t)+1.99Y2(t)1999.06Y3(t)1.99Y4(t))dt+10.11dB2(t)\left\{\begin{array}[]{ll}d{Y_{2}}(t)=&\left({-6000.30{Y_{1}}(t)-3.97{Y_{2}}(t)-9.89{\mathop{\rm sgn}}\left({{Y_{2}}(t)}\right)+1999.73{Y_{3}}(t)+2.09{Y_{4}}(t)}\right)dt\\ &+10.03d{B_{1}}(t)\\ d{Y_{4}}(t)=&\left({2000.65{Y_{1}}(t)+1.99{Y_{2}}(t)-1999.06{Y_{3}}(t)-1.99{Y_{4}}(t)}\right)dt+10.11d{B_{2}}(t)\end{array}\right.
Table 2: Summary of the results of Bayesian equation discovery of stochastic dynamical systems. The parameters of the identified systems denote the expected value of the weights after discarding the burn-in samples. Here, the first 1000 samples are taken as burn-in samples and therefore discarded for obtaining the final stationary distribution.

3.1 Black–Scholes SDE

A Black–Scholes SDE (formulated as geometric Brownian motion) is a continuous-time stochastic process where the logarithm of the randomly varying quantity follows a Brownian motion. The Black-Scholes SDE is frequently used in stock market for modelling of the evolution of stock price of an underlying asset [48]. The Black–Scholes SDE has the drift f(t,Xt)=λXf(t,X_{t})=\lambda X and diffusion g(t,X)=μXg(t,X)=\mu X with λ>0\lambda>0 and μ>0\mu>0 being the real constants. Towards this, the Black–Scholes SDE is defined as a geometric Brownian motion as follows:

dX(t)X(t)=λdt+μdB(t)X(t=t0)=X0;t[0,T],\dfrac{{dX(t)}}{{X(t)}}=\lambda dt+\mu dB(t)\quad{X}(t=t_{0})={X}_{0};\quad t\in[0,T], (25)

where B={B(t);t0}B=\{B(t);t\geq 0\} is the Brownian motion. Data is generated by simulating Eq. (25) and then corrupting it with Gaussian white noise. Identifying the diffusion term as, g(𝐗t,t)=μX(n)g({\bf{X}}_{t},t)=\mu X(n), the variance of the diffusion is obtained: Γ=μ2X2(n)\Gamma=\mu^{2}X^{2}(n). The results of this case study are plotted in the Figs. 3(a) and 4(a). Fig. 3(a) depicts that there are two identified basis function XX and |X||X|. However, from the Eq. (25), it is obvious that only one function XX should have been identified as the driving function of the Black-Scholes SDE. To understand this discrepancy one needs to consider that the evolution of the random variable X(t)X(t) in the Black-Scholes SDE follows Geometric Brownian motion (GBM) and GBMs always assume positive values (e.g. real stock price). Since the characteristics of both the functions XX and |X||X| in this case are same, the algorithm identifies both the functions with almost equal probability. A similar phenomenon is observed in case of the identification of the diffusion term too. Figs. 4(b) depicts almost equal contribution from the functions X2X^{2} and X|X|X|X|, however, only the term X2X^{2} should have been actually identified. Upon considering the corresponding parameter values of the basis functions XX and |X||X| for the drift component, and, X2X^{2} and X|X|X|X| for the diffusion component, the Black-Scholes SDE in Eq. (25) can be easily simulated for any unseen environmental conditions.

To verify that this is not a limitation of the proposed scheme, this numerical demonstration is repeated in the Figs. 3(b) and 4(b) without considering the functions |X||X| and X|X|X|X| in the library. The parameters of the basis functions for the drift and diffusion components are portrayed in Figs. 5(c) and 5(d), respectively. The results clearly show that the proposed scheme is able to identify the basis functions along with their associated parameters λ\lambda and μ\mu without any significant error. As a consequence of the above case study, one can ask a question about what basis functions are to be considered in the library. One of the possibilities could be to visualize the time series data and then take scientific and engineering judgements. For example, if the time history data shows no zero crossing then one can opt against considering the functions which shares similar properties (for e.g. |X||X| and XX, and X|X|X|X| and X2X^{2}). Besides one can also choose to consider due to the fact that the similar terms will have equal probability of occurrence and as a combination will demonstrate the observed phenomenon. However, for future prediction the model will be applicable to process depicting no zero crossing only.

Finally, to illustrate the robustness of the proposed approach and simulate a realistic scenario, we consider a case where the data is generated using strong Taylor 1.5 stochastic integration scheme. The results of the study is provided in the Fig. 6. In the Figs. 6(a) and (b), it is evident that the proposed framework is able to correctly identify the drift and diffusion components of the Black-Scholes equation. The subplots (c) and (d) further shows the expected values of the parameters of the identified basis functions, which matches almost exactly with the parameters listed in Table. 1. This shows that despite the lower order formulation, the proposed framework is robust and well equipped.

Refer to caption
Figure 6: Black-Scholes equation: Posterior distributions of the weights of the selected basis functions. (a) The drift component XX of the Black-Scholes equation is identified. (b) The variance term X2X^{2} of the diffusion component is identified. (c) Posterior distribution of the weight θ(X)\theta(X) which identifies the expected value of the parameter λ\lambda. The expected value of λ\lambda is identified from the mean of the distribution as approximately 2.09. (d) The posterior distribution of the weight θ(X2)\theta(X^{2}). The mean of the distribution represents the expected value of parameter μ\mu, which is found as 1.01.

3.2 Duffing-Van der pol oscillator

The second order non-linear hardening Duffing-Van der pol oscillator is considered, in this section. The Duffing-Van der pol oscillator draws its physical relevance from the models in flow-induced structural vibration problems. The Duffing-Van der pol oscillator has a cubic dissipating force and it is driven by multiplicative white noise. Towards this, the equation of motion of the undertaken system is expressed as:

mX¨(t)+cX˙(t)+kX(t)+αX3(t)=σX(t)B˙(t)𝑿(t=t0)=𝑿0;t[0,T],m\ddot{X}(t)+c\dot{X}(t)+kX(t)+\alpha{X^{3}}(t)=\sigma X(t)\dot{B}(t)\quad{\bm{X}}(t=t_{0})={\bm{X}}_{0};\quad t\in[0,T], (26)

where mm, cc, and kk are the mass, damping, and stiffness parameters of the oscillator. Here, α\alpha is a real-valued parameter associated with the cubic non-linearity, σ0\sigma\geq 0 is the strength of the multiplicative white noise, and B={B(t);t0}B=\{B(t);t\geq 0\} is the Brownian motion. The time derivative of the Brownian B˙(t)\dot{B}(t) here represents the white noise. With a statespace transformation X=X1X={X_{1}}, and X˙=X2\dot{X}={X_{2}}, where X1X_{1} and X2X_{2} represents the displacement and velocity, the corresponding first order Itô-stochastic SDEs for the dynamical system are derived as,

[dX1(t)dX2(t)]=[X2(t)1m(kX1(t)+cX2(t)+αX13(t))]dt+[0σmX(t)]dB(t).\left[{\begin{array}[]{*{20}{c}}{d{X_{1}}(t)}\\ {d{X_{2}}(t)}\end{array}}\right]=\left[{\begin{array}[]{*{20}{c}}{{X_{2}}(t)}\\ {\dfrac{{-1}}{m}\left({k{X_{1}}(t)+c{X_{2}}(t)+\alpha X_{1}^{3}(t)}\right)}\end{array}}\right]dt+\left[{\begin{array}[]{*{20}{c}}0\\ {\dfrac{\sigma}{m}X(t)}\end{array}}\right]dB(t). (27)

For estimating the drift and diffusion terms only, the evolution of second variable X2(t){X_{2}}(t) is used to construct the target linear and quadratic variation vectors from the sample paths using the Kramers-Moyal formulae. Here, the variance of the diffusion term is identified as, Γ=(σ2X2(t))/m2\Gamma=({\sigma^{2}}X^{2}(t))/{m^{2}}. The results for the basis function selection are displayed in Figs. 3(c) and 4(c), respectively, while their associated parameters are plotted in Fig. 7. From the Figs. 3(c) and 4(c), it is evident that the proposed framework is able to accurately discover the basis functions for the drift and diffusion terms as, {X1(t),X2(t),X13(t)}\left\{X_{1}(t),X_{2}(t),X_{1}^{3}(t)\right\} and {X1(t),X2(t)}\left\{X_{1}(t),X_{2}(t)\right\}, respectively.

The pairwise joint posterior distributions of weights associated with the discovered basis functions are presented in Fig. 7. In the subplots (a), (b), and (c), the weights of discovered drift functions are shown. It is evident in this figures that the actual parameters associated with the drift component of DVP oscillator as listed in Table 1 is correctly identified with very small relative error. The subplot (d) depicts the posterior distribution of the identified diffusion basis function. The mean value of the posterior distribution represents the term g2(𝑿t,t){g^{2}}({{\bm{X}}_{t}},t), which in this case can be noticed as g(𝑿t,t)g({{\bm{X}}_{t}},t) = X(t)X(t). The mean value here is obtained as approximately 106.47. By performing the square root operation over the mean, one can approximately get the diffusion value σ=10.31\sigma=10.31. These results verify that the proposed scheme can be successfully applied to discover the governing physics of non-linear oscillators when subjected to parametric excitation effectively.

Refer to caption
Figure 7: Duffing-Van der pol oscillator: Posterior distributions of the weights of the selected basis functions. (a) Joint posterior distribution of θ(X1)\theta(X_{1}) and θ(X2)\theta(X_{2}). The yellow region indicates the mean of the joint distribution. This weights θ(X1)\theta(X_{1}) and θ(X2)\theta(X_{2}) represents the parameters kk and cc. The expected values of the identified weights are, E(θ(X1))=1000.2{\rm{E}}(\theta(X_{1}))=1000.2 and E(θ(X2))=1.99{\rm{E}}(\theta(X_{2}))=1.99. (b) Joint posterior distribution of θ(X1)\theta(X_{1}) and θ(X13)\theta(X_{1}^{3}). The expected values of the weights are E(θ(X1))=1000.2{\rm{E}}(\theta(X_{1}))=1000.2 and E(θ(X13))=99885.1{\rm{E}}(\theta(X_{1}^{3}))=99885.1. (c) Joint posterior distribution of θ(X2)\theta(X_{2}) and θ(X13)\theta(X_{1}^{3}). he expected values of the weights are E(θ(X2))=1.99{\rm{E}}(\theta(X_{2}))=1.99 and E(θ(X13))=99885.1{\rm{E}}(\theta(X_{1}^{3}))=99885.1. (d) The histogram of the diffusion parameter. The mean value of the histogram is 106.47 which represents the term g2(𝑿t,t)g^{2}(\bm{X}_{t},t). Upon conversion the parameters σ/m\sigma/m can be obtained as 10.29.

3.3 Two story base isolated Shear Building

The responses of civil engineering structures often exhibit the characteristics of stochastic non-linear dynamical systems due to the external excitation. Here, a 2 story base isolated structure is considered. The system is a linear base isolated structure where the structure is connected to the foundation through a Coulomb friction-type base isolator. Under this considerations, the governing equation of motion of the system is written as,

m1X¨1(t)+c1X˙1(t)+μm1gsgn(X˙1(t))+k1X1(t)+c2(X˙1(t)X˙2(t))+k2(X1(t)X2(t))=σ1B˙1(t)m2X¨2(t)+c2(X˙2(t)X˙1(t))+k2(X2(t)X1(t))=σ2B˙2(t)𝑿(t=t0)=𝑿0;t[0,T],\begin{array}[]{l}{m_{1}}{{\ddot{X}}_{1}}(t)+{c_{1}}{{\dot{X}}_{1}}(t)+\mu{m_{1}}g{\mathop{\rm sgn}}\left({{\dot{X}}_{1}}(t)\right)+{k_{1}}{X_{1}}(t)+{c_{2}}\left({{{\dot{X}}_{1}}(t)-{{\dot{X}}_{2}}(t)}\right)+{k_{2}}\left({{X_{1}}(t)-{X_{2}}(t)}\right)={\sigma_{1}}{{\dot{B}}_{1}}(t)\\ {m_{2}}{{\ddot{X}}_{2}}(t)+{c_{2}}\left({{{\dot{X}}_{2}}(t)-{{\dot{X}}_{1}}(t)}\right)+{k_{2}}\left({{X_{2}}(t)-{X_{1}}(t)}\right)={\sigma_{2}}{{\dot{B}}_{2}}(t)\\ {\bm{X}}(t=t_{0})={\bm{X}}_{0};\quad t\in[0,T],\end{array} (28)

where μm1gsgn(X˙1(t))\mu{m_{1}}g{\mathop{\rm sgn}}({{\dot{X}}_{1}}(t)) is the Coulomb friction force arising due to the sliding of bearings in Coulomb oscillator, and {mi,i=1,2}\{m_{i},i=1,2\}, {ci,i=1,2}\{c_{i},i=1,2\}, and {ki,i=1,2}\{k_{i},i=1,2\} are the mass, damping, and stiffness of ithi^{th}-DOF. Here, σ10\sigma_{1}\geq 0 and σ20\sigma_{2}\geq 0 are the strength of the white noise B˙1(t){\dot{B}}_{1}(t) and B˙2(t){\dot{B}}_{2}(t), where B1(t)B_{1}(t) and B2(t)B_{2}(t) are the independent Brownian motion. In this case study, the equation discovery involves identification of two drift and two covariance terms (one for each of the DOFs). Following to the case study of the Duffing-Van der pol oscillator, the target linear variation vector for discovery of ithi^{th}; i=1,,2\forall i=1,\ldots,2 drift term needs to be constructed from the velocity component of the corresponding DOF. For discovery of the (ij)th(ij)^{th}; i,j=1,,2\forall i,j=1,\ldots,2 component of the covariance matrix, the quadratic variation matrix is constructed from the pointwise product of ithi^{th} and jthj^{th} response vectors.

Γ=[00000σ12/m12000000000σ22/m22].\Gamma=\left[{\begin{array}[]{*{20}{c}}0&0&0&0\\ 0&{{\sigma_{1}^{2}}/{{m_{1}^{2}}}}&0&0\\ 0&0&0&0\\ 0&0&0&{\sigma_{2}^{2}}/{{m_{2}^{2}}}\end{array}}\right]. (29)

The results for the identification of corresponding basis functions for both the drift terms are presented in Figs. 3(d) and (e). From Fig. 3(d), it can be observed that the identified basis functions are {Y1(t),Y2(t),Y3(t),Y4(t),sgn(Y2(t))}\left\{Y_{1}(t),Y_{2}(t),Y_{3}(t),Y_{4}(t),sgn(Y_{2}(t))\right\}, which correctly matches with the terms in the equation of motion of first DOF. Similarly, from Fig. 3(e) it can be verified that the proposed scheme has correctly identified the basis functions of the second drift component as {Y1(t),Y2(t),Y3(t),Y4(t)}\left\{Y_{1}(t),Y_{2}(t),Y_{3}(t),Y_{4}(t)\right\}. Figs. 8 and 9 show the pairwise joint posterior distributions of the weights associated with identified first and second drift basis functions, respectively. From these figures it is evident that the proposed scheme is able to identify the parameters of the basis functions as listed in Table 1 with sufficient accuracy. In order to find the explicit values of system stiffness and damping parameters, the parameters of second DOF can be identified from the Figs. 8(h), 8(e), and 8(f).

Refer to caption
Figure 8: Non-linear two-DOF oscillator (first drift component): Pairwise joint posterior distributions of the parameters θk\theta_{k} of the selected basis functions. (a) Joint posterior of the weights θ(Y1)\theta({Y_{1}}) and θ(Y2)\theta(Y_{2}). The yellow color denotes the mean value of the joint distribution. The expected values of the parameters are identified as E(θ(Y1)){\rm{E}}(\theta(Y_{1}))= -6000.30 and E(θ(Y2)){\rm{E}}(\theta(Y_{2}))= -3.97. (b) Joint posterior of the weights θY1\theta_{Y_{1}} and θY3\theta_{Y_{3}}. The expected values of the parameters are identified as E(θ(Y1)){\rm{E}}(\theta(Y_{1}))= -6000.30 and E(θ(Y3)){\rm{E}}(\theta(Y_{3}))= 1999.73. (c) Joint posterior of the weights θY1\theta_{Y_{1}} and θυ1\theta_{\upsilon_{1}}. The expected values of the parameters are identified as E(θ(Y1)){\rm{E}}(\theta(Y_{1}))= -6000.30 and E(θ(Y4)){\rm{E}}(\theta(Y_{4}))= 2.09. (d) Joint posterior of the weights θY1\theta_{Y_{1}} and θsgn(Y2)\theta_{sgn(Y_{2})}. The expected values of the parameters are identified as E(θ(Y1)){\rm{E}}(\theta(Y_{1}))= -6000.30 and E(θ(sgn(Y2))){\rm{E}}(\theta(sgn(Y_{2})))= -9.89. (e) Joint posterior of the weights θY2\theta_{Y_{2}} and θY3\theta_{Y_{3}}. The expected values of the parameters are identified as E(θ(Y2)){\rm{E}}(\theta(Y_{2}))= -3.97 and E(θ(Y3)){\rm{E}}(\theta(Y_{3}))= 1999.73. (f) Joint posterior of the weights θY2\theta_{Y_{2}} and θY4\theta_{Y_{4}}. The expected values of the parameters are identified as E(θ(Y2)){\rm{E}}(\theta(Y_{2}))= -3.97 and E(θ(Y4)){\rm{E}}(\theta(Y_{4}))= 2.09. (g) Joint posterior of the weights θY2\theta_{Y_{2}} and θsgn(Y2)\theta_{sgn(Y_{2})}. The expected values of the parameters are identified as E(θ(Y2)){\rm{E}}(\theta(Y_{2}))= 1999.73 and E(θ(sgn(Y2))){\rm{E}}(\theta(sgn(Y_{2})))= -9.89. (h) Joint posterior of the weights θY3\theta_{Y_{3}} and θY4\theta_{Y_{4}}. The expected values of the parameters are identified as E(θ(Y3)){\rm{E}}(\theta(Y_{3}))= 1999.73 and E(θ(Y4)){\rm{E}}(\theta(Y_{4}))= 2.09. (i) Joint posterior of the weights θY3\theta_{Y_{3}} and θsgn(Y2)\theta_{sgn(Y_{2})}. The expected values of the parameters are identified as E(θ(Y3)){\rm{E}}(\theta(Y_{3}))= 1999.73 and E(θ(sgn(Y2))){\rm{E}}(\theta(sgn(Y_{2})))= -9.89. The weights {θY1,θY2,θY3,θY4,θsgn(Y2)}\{\theta_{Y_{1}},\theta_{Y_{2}},\theta_{Y_{3}},\theta_{Y_{4}},\theta_{sgn(Y_{2})}\} corresponds to the basis functions {υ1,υ2,υ3,υ4,υ36}\{\upsilon_{1},\upsilon_{2},\upsilon_{3},\upsilon_{4},\upsilon_{3}6\}. They represents the relative value of the parameters {k1,k2,k3,k4,μmg}\{k_{1},k_{2},k_{3},k_{4},\mu mg\}. The negative values arises due to the coupling of the systems states.
Refer to caption
Figure 9: Non-linear two-DOF oscillator (second drift component): Pairwise joint posterior distributions of the parameters θk\theta_{k} of the selected basis functions. (a) Joint posterior of the weights θ(Y1)\theta({Y_{1}}) and θ(Y2)\theta(Y_{2}) corresponding to the basis functions υ1\upsilon_{1} and υ2\upsilon_{2}. The yellow color denotes the mean value of the joint distribution. The expected values of the parameters are identified as E(θ(Y1)){\rm{E}}(\theta(Y_{1}))= 2000.65 and E(θ(Y2)){\rm{E}}(\theta(Y_{2}))= 1.99. (b) Joint posterior of the weights θY1\theta_{Y_{1}} and θY3\theta_{Y_{3}} corresponding to the basis functions υ1\upsilon_{1} and υ3\upsilon_{3}. The expected values of the parameters are identified as E(θ(Y1)){\rm{E}}(\theta(Y_{1}))= 2000.65 and E(θ(Y3)){\rm{E}}(\theta(Y_{3}))= -1999.06. (c) Joint posterior of the weights θY1\theta_{Y_{1}} and θY4\theta_{Y_{4}} corresponding to the basis functions υ1\upsilon_{1} and υ4\upsilon_{4}. The expected values of the parameters are identified as E(θ(Y1)){\rm{E}}(\theta(Y_{1}))= 2000.65 and E(θ(Y4)){\rm{E}}(\theta(Y_{4}))= -1.99. (d) Joint posterior of the weights θY2\theta_{Y_{2}} and θY3\theta_{Y_{3}} corresponding to the basis functions υ2\upsilon_{2} and υ3\upsilon_{3}. The expected values of the parameters are identified as E(θ(Y2)){\rm{E}}(\theta(Y_{2}))= 1.99 and E(θ(Y3)){\rm{E}}(\theta(Y_{3}))= -1999.06. (e) Joint posterior of the weights θY2\theta_{Y_{2}} and θY4\theta_{Y_{4}} corresponding to the basis functions υ2\upsilon_{2} and υ4\upsilon_{4}. The expected values of the parameters are identified as E(θ(Y2)){\rm{E}}(\theta(Y_{2}))= 1.99 and E(θ(Y4)){\rm{E}}(\theta(Y_{4}))= -1.99. (f) Joint posterior of the weights θY3\theta_{Y_{3}} and θY4\theta_{Y_{4}} corresponding to the basis functions υ3\upsilon_{3} and υ4\upsilon_{4}. The expected values of the parameters are identified as E(θ(Y3)){\rm{E}}(\theta(Y_{3}))= -1999.06 and E(θ(Y4)){\rm{E}}(\theta(Y_{4}))= -1.99. The weights {θY1,θY2,θY3,θY4}\{\theta_{Y_{1}},\theta_{Y_{2}},\theta_{Y_{3}},\theta_{Y_{4}}\} represents the relative value of the parameters {k1,k2,k3,k4}\{k_{1},k_{2},k_{3},k_{4}\}. The negative values arises due to the coupling of the systems states.

The identification results for the basis functions along with the posterior distribution of their parameters for both the diffusion terms are depicted in Fig. 10. Figs. 10(a) and 10(b) present the results of the first diffusion term. The identification results of the second diffusion term are shown in Figs. 10(c) and 10(d). In both the cases, it is evident that the proposed scheme has correctly identified the basis function as 11. The posterior distributions of the covariance terms, 𝚪11{\bf{\Gamma}}^{11} and 𝚪22{\bf{\Gamma}}^{22} is plotted in Figs. 10(b) and 10(d) whose mean values represents σ12/m12{{\sigma_{1}^{2}}}/{{m_{1}^{2}}} and σ22/m22{{\sigma_{2}^{2}}}/{{m_{2}^{2}}}. The summery of the results are presented in Table 2.

Refer to caption
Figure 10: Two-DOF shear building: posterior distributions of the weights associated with the discovered basis functions of diffusion components. (a) First DOF: posterior distribution of the basis function θ1\theta_{1}. The mean of the distribution denotes the expected value of σ12/m12\sigma_{1}^{2}/m_{1}^{2}, which in this case is observed from the figure as approximately 101. (b) Second-DOF: posterior distribution of the basis function θ1\theta_{1}. The mean is observed as 102, which is equivalent to the term σ22/m22\sigma_{2}^{2}/m_{2}^{2}. By performing the square root operation over the mean values one can easily obtain the desired diffusion components.

3.4 Case study: Non-linear system with partially observed state variables

In this section, a case study has been undertaken where we do not have access to all the state variables. We consider a Bouc-Wen oscillator where the system is described using three states namely, displacement and velocity of the main mass, and the hysteresis displacement associated with the isolator [43, 49]. In this context, we note that the measurement of the hysteresis displacement is practically intractable and only the displacement and velocity states of the main system is available. Further, measuring the velocity of a mechanical system using sensors is a practically challenging task. Through this example, we illustrate the applicability of the proposed approach for such a system. The governing equation of motion of a sdof Bouc-Wen oscillator is [43]:

mX¨(t)+cX˙(t)+kλX(t)+k(1λ)Z(t)=σB˙(t)𝑿(t=t0)=𝑿0;t[0,T],m\ddot{X}(t)+c\dot{X}(t)+k\lambda X(t)+k\left({1-\lambda}\right)Z(t)=\sigma\dot{B}(t)\qquad{\bm{X}}(t=t_{0})={\bm{X}}_{0};\quad t\in[0,T], (30)

where mm, cc, and kk are the mass, damping, and stiffness of the system, X(t)X(t) is the system state, λ\lambda is a factor that defines the participation of the elastic force FeF_{e} and hysteresis force FhF_{h}, Fe(t)=kλX(t)F_{e}(t)=k\lambda X(t), Fh(t)=k(1λ)Z(t)F_{h}(t)=k\left({1-\lambda}\right)Z(t), and Z(t)Z(t) is the hysteresis displacement. The evolution of the hysteresis parameter is defined using the non-linear differential equation:

Z(t)=A1Z(t)|X˙(t)||Z(t)|n¯1A2X˙(t)|Z(t)|n¯+A3X˙(t)𝒁(t=t0)=𝒁0;t[0,T],Z(t)=-{A_{1}}Z(t)\left|{\dot{X}(t)}\right|{\left|{Z(t)}\right|^{\bar{n}-1}}-{A_{2}}\dot{X}(t){\left|{Z(t)}\right|^{\bar{n}}}+{A_{3}}\dot{X}(t)\qquad{\bm{Z}}(t=t_{0})={\bm{Z}}_{0};\quad t\in[0,T], (31)

The positive exponential parameter nn defines the smoothness of the transition from elastic to the post-elastic branch, and the parameters A1A_{1}, A2A_{2}, and A3A_{3} control the size and shape of the hysteresis loop. The drift and diffusion terms are:

f(t,𝑿t)=[X2(t)1m(kλX1(t)+cX2(t)+k(1λ)X3(t))A1X3(t)|X2(t)||X3(t)|n¯1A2X2(t)|X3(t)|n¯+A3X2(t)];g(t,𝑿t)=[0σm0].f(t,{{\bm{X}}_{t}})=\left[{\begin{array}[]{*{20}{c}}{{X_{2}}(t)}\\ {\dfrac{{-1}}{m}\left({k\lambda{X_{1}}(t)+c{X_{2}}(t)+k(1-\lambda){X_{3}}(t)}\right)}\\ {-{A_{1}}{X_{3}}(t)\left|{{X_{2}}(t)}\right|{{\left|{{X_{3}}(t)}\right|}^{\bar{n}-1}}-{A_{2}}{X_{2}}(t){{\left|{{X_{3}}(t)}\right|}^{\bar{n}}}+{A_{3}}{X_{2}}(t)}\end{array}}\right];\quad g(t,{{\bm{X}}_{t}})=\left[{\begin{array}[]{*{20}{c}}0\\ {\dfrac{\sigma}{m}}\\ 0\end{array}}\right]. (32)

Then, the covariance of the diffusion matrix is obtained as,

Γ=g(t,𝐗t)g(t,𝐗t)T=[0000σ2m20000].\Gamma=g(t,{{\bf{X}}_{t}})g{(t,{{\bf{X}}_{t}})^{T}}=\left[{\begin{array}[]{*{20}{c}}0&0&0\\ 0&{\dfrac{{{\sigma^{2}}}}{{{m^{2}}}}}&0\\ 0&0&0\end{array}}\right]. (33)

For the identification of the system only the displacement value (noisy) is considered as input to the algorithm. The system is simulated for TT = 1s at a frequency of 1000Hz using the parameter values, mm = 1, cc = 20, kk = 100000, λ\lambda = 0.5, A1A_{1} = 0.5, A2A_{2} = 0.5, A3A_{3} = 1, n¯=3\bar{n}=3, and σ1\sigma_{1} = 2. The identified basis functions for the drift and diffusion terms are presented in Fig. 11. In sub-figure (a) it is evident that there are more number of basis functions than the input equation. It is straightforward to note that this extra basis functions arises to take care of the non-observable hysteresis parameter Z(t)Z(t). One can then identify the basis functions Θf(𝐗){\Theta_{f}}({\bf{X}}) and Θg(𝐗){\Theta_{g}}({\bf{X}}) for the drift and diffusion, respectively, that best represent the data as,

Θf(𝐗)=[1X1X2X12X22X13X14sgn(X1)|X1||X2|X1|X1|]Θg(𝐗)=[1X1|X1|].\begin{array}[]{l}{\Theta_{f}}({\bf{X}})=\left[{\begin{array}[]{*{20}{c}}1&{{X_{1}}}&{{X_{2}}}&{X_{1}^{2}}&{X_{2}^{2}}&{X_{1}^{3}}&{X_{1}^{4}}&{sgn({X_{1}})}&{\left|{{X_{1}}}\right|}&{\left|{{X_{2}}}\right|}&{{X_{1}}\left|{{X_{1}}}\right|}\end{array}}\right]\\ {\Theta_{g}}({\bf{X}})=\left[{\begin{array}[]{*{20}{c}}1&{{X_{1}}}&{\left|{{X_{1}}}\right|}\end{array}}\right].\end{array} (34)
Refer to caption
Figure 11: Bouc-Wen oscillator: Basis function selection for the Bouc-Wen base isolator system based on marginal posterior inclusion probability (PIP), P(Zk=1|Y);k=1,,36(Z_{k}=1|\bm{Y});k=1,\ldots,36. (a) The identified basis functions for the drift component. The black horizontal axes represent the collection of 36 basis functions, and the red line represents marginal PIP >> 0.5. The identified basis functions are {υ(0),υ(1),υ(2),υ(3),υ(5),υ(6),υ(10),υ(28),υ(30),υ(31),υ(32)}\{\upsilon(0),\upsilon(1),\upsilon(2),\upsilon(3),\upsilon(5),\upsilon(6),\upsilon(10),\upsilon(28),\upsilon(30),\upsilon(31),\upsilon(32)\}. These basses corresponds to the functions {1,X1,X2,X12,X22,X13,X14,sgn(X1),|X1|,|X2|,X1|X1|}\{1,{X_{1}},{X_{2}},{X_{1}^{2}},{X_{2}^{2}},{X_{1}^{3}},{X_{1}^{4}},{sgn({X_{1}})},{\left|{{X_{1}}}\right|},{\left|{{X_{2}}}\right|},{{X_{1}}\left|{{X_{1}}}\right|}\}. (b) The identified basis functions for the diffusion component. The identified function are {1,X1,|X1|}\{1,{X_{1}},{\left|{{X_{1}}}\right|}\}.
Refer to caption
Figure 12: Bouc-Wen oscillator: Pairwise joint posterior distributions of the parameters θk\theta_{k} of the selected basis functions. The subplots (a)\to(f) presents the joint posterior distributions of drift parameters whereas the same is presented in the subplots (I)\to(III). (a) Joint posterior distribution of θ(1)\theta(1) and θ(X1)\theta(X_{1}). The parameters are identified as E(θ(1)){\rm{E}}(\theta(1))= -46.4 and E(θ(X1)){\rm{E}}(\theta(X_{1}))= -5246. (b) Joint posterior distribution of θ(X2)\theta(X_{2}) and θ(X12)\theta(X_{1}^{2}). The expected value of the parameters are identified as E(θ(X2)){\rm{E}}(\theta(X_{2}))= -20 and E(θ(X12)){\rm{E}}(\theta(X_{1}^{2}))= -614. (c) Joint posterior distribution of θ(X22)\theta(X_{2}^{2}) and θ(X13)\theta(X_{1}^{3}). The parameters are identified as E(θ(X22)){\rm{E}}(\theta(X_{2}^{2}))= -0.05 and E(θ(X13)){\rm{E}}(\theta(X_{1}^{3}))= -2328. (d) Joint posterior distribution of θ(X14)\theta(X_{1}^{4}) and θ(sgn(X1))\theta(sgn(X_{1})). The expected value of the parameters are identified as E(θ(X14)){\rm{E}}(\theta(X_{1}^{4}))= 1829 and E(θ(sgn(X1))){\rm{E}}(\theta(sgn(X_{1})))= 10.64. (e) Joint posterior distribution of θ(|X1|)\theta(|X_{1}|) and θ(|X2|)\theta(|X_{2}|). The parameters are identified as E(θ(|X1|)){\rm{E}}(\theta(|X_{1}|))= 159 and E(θ(sgn(X1))){\rm{E}}(\theta(sgn(X_{1})))= 10.64. (f) Joint posterior distribution of θ(|X1|)\theta(|X_{1}|) and θ(X1|X1|)\theta(X_{1}|X_{1}|). The parameters are identified as E(θ(|X1|)){\rm{E}}(\theta(|X_{1}|))= 159 and E(θ(X1|X1|)){\rm{E}}(\theta(X_{1}|X_{1}|))= 1439. (I) The joint posterior of the weights θ1\theta_{1} and θ2\theta_{2} corresponding to the basis functions 11 and X1X_{1} is shown. The yellow color denotes the mean values of the joint distribution. (II) The joint posterior of the weights θ1\theta_{1} and θ30\theta_{30} corresponding to the basis functions 11 and |X1||X_{1}| is shown. (III) The joint posterior of the weights θ2\theta_{2} and θ30\theta_{3}0 corresponding to the basis functions X1X_{1} and |X1||X_{1}| is shown.

The joint posterior distributions of the retained basis functions in Fig. 11 (a) and (b) are plotted in the Fig. 12. Based on the basis selection in Fig. 11 and from the joint posteriors of the parameter θi\theta_{i} plotted in Fig. 12, the final drift and diffusion fields for the Bouc-Wen system can identified as follows,

f(t,X)=46.45246X120X2614X120.05X222328X13+1829X14+10.64sgn(X1)+159|X1|+1.625|X2|+1439X1|X1|g(t,X)=4.654.872X1+6.786|X1|.\begin{array}[]{ll}f(t,X)=&-46.4-5246{X_{1}}-20{X_{2}}-614X_{1}^{2}-0.05X_{2}^{2}-2328X_{1}^{3}+1829X_{1}^{4}+10.64sgn({X_{1}})+159\left|{{X_{1}}}\right|\\ &+1.625\left|{{X_{2}}}\right|+1439{X_{1}}\left|{{X_{1}}}\right|\\ g(t,X)=&4.65-4.872{X_{1}}+6.786\left|{{X_{1}}}\right|.\end{array} (35)
Correct SDEs: {dX2(t)=(5000X1(t)+20X2(t)+5000X3(t))dt+2dB(t)dX3(t)=(0.5X3(t)|X2(t)||X3(t)|20.5X2(t)|X3(t)|3+X2(t))dt\left\{\begin{array}[]{l}d{X_{2}}(t)=-\left({5000{X_{1}}(t)+20{X_{2}}(t)+5000{X_{3}}(t)}\right)dt+2dB(t)\\ d{X_{3}}(t)=\left({-0.5{X_{3}}(t)\left|{{X_{2}}(t)}\right|{{\left|{{X_{3}}(t)}\right|}^{2}}-0.5{X_{2}}(t){{\left|{{X_{3}}(t)}\right|}^{3}}+{X_{2}}(t)}\right)dt\end{array}\right.
Identified SDEs: {dX2(t)=(46.45246X120X2614X122328X13+1829X14+10.64sgn(X1)+159|X1|+1.63|X2|+1439X1|X1|)dt+4.653.79X1+5.76|X1|dB(t)\left\{\begin{array}[]{ll}d{X_{2}}(t)=&\Bigl{(}-46.4-5246{X_{1}}-20{X_{2}}-614X_{1}^{2}-2328X_{1}^{3}+1829X_{1}^{4}+10.64sgn({X_{1}})+159\left|{{X_{1}}}\right|\\ &+1.63\left|{{X_{2}}}\right|+1439{X_{1}}\left|{{X_{1}}}\right|\Bigr{)}dt+\sqrt{4.65-3.79{X_{1}}+5.76\left|{{X_{1}}}\right|}dB(t)\end{array}\right.
Table 3: Identification results of the Bayesian physics discovery for Bouc-Wen oscillator

For further treatment the term 0.05X220.05X_{2}^{2} is neglected due to its non-significant parameter value. With this, the identified equation for the Bouc-Wen system is shown in Eq. (36):

X¨+20X˙+5246X+614X2+2328X31829X410.64sgn(X)159|X|1.625|X˙|1439X|X|+46.4=4.654.872X+6.786|X|.\ddot{X}+20\dot{X}+5246X+614{X^{2}}+2328{X^{3}}-1829{X^{4}}-10.64sgn(X)-159\left|X\right|\\ -1.625\left|{\dot{X}}\right|-1439X\left|X\right|+46.4=4.65-4.872X+6.786\left|X\right|. (36)

In order to verify the accuracy of the identified system, the identified equation (Eq. (36)) is simulated using the EM scheme with a time step of Δt\Delta t = 0.001. To account for the fact that in future the external excitation will be uncertain and the intensity of the same will not remain same, we model the forcing function using Brownian force with different intensity. To be specific, a Brownian noise intensity of σ\sigma = 6 is used to simulate the new unseen environment. The results are compared in Fig. 13. We observe that the results obtained using the identified physics matches almost exactly with the ground truth obtained by solving the actual system. It is worthwhile to note that the proposed approach require only 1s of data and is able to predict the the responses for a stochastic force of different intensity at distant future (500s). This clearly indicates that the proposed approach, even with partially observed displacement only measurement, generalizes to unseen environment and provides accurate estimate even for out-of-distribution inputs. The summary of the cases studies undertaken in this work is presented in the Table 2.

Refer to caption
Figure 13: Bouc-Wen oscillator: Future prediction in an unseen scenario using the discovered equation in the absence of the hysteresis data. The governing equation of the Bouc-Wen system learned using 1 second of data sampled at 1000Hz. The prediction is performed for 500 seconds ahead of time using the discovered Bouc-Wen model. In the figure, the ensemble of the displacement and velocity prediction data over 500 seconds is presented. The simulation is performed using a different Brownian noise intensity of σ\sigma = 6, which is significantly higher than the training data. The dotted red line shows the prediction using the discovered equation and the solid blue line denotes the accurate simulated data. The predicted data almost accurately emulates the original system even though the measurements for all the states are not available.

4 Discussions

We proposed a novel data-driven framework for discovering governing physics of multi-dimensional non-linear dynamical systems from limited and noisy output only data. In many natural process the estimate of external input is intractable due to the limitations in the present measurement technologies. Further, measurement of all the state variables is often not possible due to high operational cost; the proposed framework is developed to identify governing equations in such scenarios. In essence, the proposed approach identifies SDEs from displacement only measurements; in other words, no measurement for the input is required. The deterministic drift component of the SDEs captures the dynamics of the underlying dynamical system whereas the external stochastic force is identified in terms of the diffusion component. This is a key novelty of the proposed framework.

The proposed framework employs the sparse Bayesian linear regression in conjunction with the Gibbs sampler to simultaneously obtain the basis functions of the model and their associated parameters. To that end, the Kramers-Moyal expansion is utilized to express the drift and diffusion dynamics of an SDE in terms of the measured samples paths. The drift and diffusion vectors obtained from the sample paths are then used as a target vector in the sparse regression. Use of the Kramers-Moyal expansion within the purview of sparse Bayesian linear regression for discovery of an explainable governing physics is another novelty of the proposed framework. The unified framework possesses the ability of the Bayesian inference to update the probability of observing the correct basis function based on the new information. The ability to update the information is missing in recently published least-square based physics discovery schemes. In a noisy and unseen environment, one would be more interested in dealing with the probability distribution of a random variable rather than a point estimate; the proposed framework thus, takes a leap in such situations and provides a probability distribution for each of the weights associated with corresponding basis functions. The obtained standard deviations of the distributions can then be used for defining the lower and uppers bounds on the estimates of an unknown parameters. Another aspect of the proposed framework is its advantages over the Neural network based grey box models, where the explicit expression of the discovered governing physics is not known. In cases of partially observed processes, the proposed framework seems to provide an explainable equation for the governing physics. Although the basis functions in the obtained equation is significantly different from the one determined from first principal laws, the obtained model is able to predict the distant future without significant error. To the knowledge of authors, the discovery of physics in a partially observed process from the purview of probability theory is one of the key contributions of the proposed novel framework.

Despite of the advantages of the proposed framework over presently available physics discovery techniques there are certain issues that requires special attention for accurate identification of the governing physics. The first issue is the judicious selection of the candidate basis function. The presence of basis functions with high correlation may sometimes yield a physical law different from the actual physics. This is evident from the identification example on the Black-Scholes equation. The second issue is the associated computational cost. Although the computational cost of the proposed scheme is significantly less than the time required to train its neural network alternatives, further improvements can be made. One such alternative is to use the marginal standard deviation criteria, instead of taking the mean of the latent vector. Another improvement in terms of computational efficiency can be made by devising an appropriate filter to process out the signal noise synchronously with the sparse Bayesian linear regression. The third issue is related to the quality of the data. The low-fidelity data are less expensive to obtain but the representation of the physical model can be poor due to the presence of noise. On the other hand, the high-fidelity data are highly accurate but expensive to obtain. The proposed framework in its present state is not well equipped to make use of both the low and high-fidelity data. Thus, to further enhance the fidelity of the proposed framework for field applications a more robust algorithm needs to be formulated. The use of low-fidelity data ensure low operational cost while the high-fidelity data will ensure the accuracy of the discovered physics. Lastly, the proposed framework leverages the Kramers-Moyal expansion to express the drift and diffusion components of an SDE in terms of the first and second order moments of the sample paths. The moments are approximated in a limiting sense by making use of the absolute and quadratic identities of Brownian motion. In this context, future improvements in the proposed framework can be made by exploiting the higher order moments to retain the higher order terms from Stochastic-Taylor expansions of the drift and diffusion terms. Overall, the proposed framework provides a novel methodology for discovering the governing physics from displacement only measurements and is particularly useful whenever it is not possible to measure the input force. The learned equations have shown to be accurate and demonstrated good prediction ability in distant future.

Appendix A Kramers-Moyal expansion for estimation of the drift and diffusion terms of an SDE from sample paths

Let us consider, p(X,t)p(X,t)=P(X,t|X0,t0){\rm{P}}(X,t|X_{0},t_{0}) to be the transition probability density of the solution of the SDE in Eq. (3). Then the Kramers-Moyal expansion is written as [50],

P(X,t)t=n=1(X)nD(n)(X,t)p(X,t),\dfrac{{\partial P(X,t)}}{{\partial t}}={\sum\limits_{n=1}^{\infty}{\left({-\dfrac{\partial}{{\partial X}}}\right)}^{n}}{D^{(n)}}(X,t)p(X,t), (37)

where the coefficients in the expansion are given as,

D(n)(X)=1n!Mn(t)|t=0=1n!limΔt01Δt|X(t+Δt)z|n|X(t)=z.{D^{(n)}}(X)={\left.{{{\left.{\dfrac{1}{{n!}}{M_{n}}(t)}\right|}_{t=0}}=\dfrac{1}{{n!}}\mathop{\lim}\limits_{{\Delta t}\to 0}\dfrac{1}{{\Delta t}}\left\langle{{{\left|{X(t+{\Delta t})-z}\right|}^{n}}}\right\rangle}\right|_{X(t)=z}}. (38)

In order to know how many terms in Eq. (37) will be active for the SDE in Eq. (3), a Fokker-Planck equation for the pdf of the solution of Eq. (3) needs to be constructed. Towards this, let us consider a well behaved generic function F(X,t)F(X,t) which is at least twice differential. For this function, the Itô’s lemma is written as,

dF(X,t)=F(X,t)XdX(t)+122F(X,t)X2g2(X,t)dt.dF(X,t)=\dfrac{{\partial F(X,t)}}{{\partial X}}dX(t)+\dfrac{1}{2}\dfrac{{{\partial^{2}}F(X,t)}}{{\partial{X^{2}}}}{g^{2}}(X,t)dt. (39)

Substituting Eq. (3) in the above equation gives the following:

dF(X,t)=(F(X,t)Xf(X,t)+122F(X,t)X2g2(X,t))dt+F(X,t)Xg(X,t)dB(t).dF(X,t)=\left({\dfrac{{\partial F(X,t)}}{{\partial X}}f(X,t)+\dfrac{1}{2}\dfrac{{{\partial^{2}}F(X,t)}}{{\partial{X^{2}}}}{g^{2}}(X,t)}\right)dt+\dfrac{{\partial F(X,t)}}{{\partial X}}g(X,t)dB(t). (40)

Upon taking the derivative with respect to tt on both sides yields:

dE[F(X,t)]dt=E(F(X,t)Xf(X,t)+122F(X,t)X2g2(X,t)).\dfrac{{dE[F(X,t)]}}{{dt}}=E\left({\dfrac{{\partial F(X,t)}}{{\partial X}}f(X,t)+\dfrac{1}{2}\dfrac{{{\partial^{2}}F(X,t)}}{{\partial{X^{2}}}}{g^{2}}(X,t)}\right). (41)

Expanding the above equation using expectation operator and noting that the last term in the expression is a stochastic integral, one can invoke the result E[0tF(X,s)Xg(X,s)𝑑B(s)]=0{\rm{E}}\left[{\int_{0}^{t}{\dfrac{{\partial F(X,s)}}{{\partial X}}g(X,s)dB(s)}}\right]=0, where E(){\rm{E}}(\cdot) is the expectation operator to obtain the following,

p(X,t)dF(X)dt𝑑X=p(X,t)(F(X,t)Xf(X,t)+122F(X,t)X2g2(X,t))𝑑X.\int\limits_{\infty}^{\infty}{p(X,t)\dfrac{{dF(X)}}{{dt}}dX}=\int\limits_{\infty}^{\infty}{p(X,t)\left({\dfrac{{\partial F(X,t)}}{{\partial X}}f(X,t)+\dfrac{1}{2}\dfrac{{{\partial^{2}}F(X,t)}}{{\partial{X^{2}}}}{g^{2}}(X,t)}\right)dX}. (42)

By expanding the terms in the above equation using integration by parts and assuming that limX0p(X,t)0\mathop{\lim}\limits_{X\to 0}p(X,t)\to 0, the following weak form is obtained:

F(X)(p(X,t)t+(p(X,t)f(X,t))X122(p(X,t)g2(X,t))X2)𝑑X=0.\int\limits_{\infty}^{\infty}{F(X)\left({\dfrac{{\partial p(X,t)}}{{\partial t}}+\dfrac{{\partial(p(X,t)f(X,t))}}{{\partial X}}-\dfrac{1}{2}\dfrac{{{\partial^{2}}(p(X,t){g^{2}}(X,t))}}{{\partial{X^{2}}}}}\right)dX}=0. (43)

As the function F(X,t)F(X,t) is arbitrarily selected, the Fokker-Plank-Kolmogorov equation is obtained.

p(X,t)t=(p(X,t)f(X,t))X+122(p(X,t)g2(X,t))X2;p(0,X)=p0(X).\dfrac{{\partial p(X,t)}}{{\partial t}}=-\dfrac{{\partial(p(X,t)f(X,t))}}{{\partial X}}+\dfrac{1}{2}\dfrac{{{\partial^{2}}(p(X,t){g^{2}}(X,t))}}{{\partial{X^{2}}}};\;p(0,X)={p_{0}}(X). (44)

In a general setting the above equation for higher dimensional diffusion process is expressed as,

p(X,t)t=Ltp(X,t)=im(p(X,t)fi(X,t))Xi+12imjmkn2(p(X,t)gi,k(X,t)gj,k(X,t))XiXj,\dfrac{{\partial p(X,t)}}{{\partial t}}={L_{t}}p(X,t)=-\sum\limits_{i}^{m}{\dfrac{{\partial(p(X,t){f_{i}}(X,t))}}{{\partial{X_{i}}}}}+\dfrac{1}{2}\sum\limits_{i}^{m}{\sum\limits_{j}^{m}{\sum\limits_{k}^{n}{\dfrac{{{\partial^{2}}(p(X,t){g_{i,k}}(X,t){g_{j,k}}(X,t))}}{{\partial{X_{i}}\partial{X_{j}}}}}}}, (45)

where the random variable XX satisfies the SDE in Eq. (3). At this point it is clear that there will be two terms active in Eq. (3). Thus for nn=2, Eq. (37) yields,

P(X,t)t=X[D(1)(X,t)p(X,t)]+2X2[D(2)(X,t)p(X,t)].\dfrac{{\partial P(X,t)}}{{\partial t}}=-\dfrac{\partial}{{\partial X}}\left[{{D^{(1)}}(X,t)p(X,t)}\right]+\dfrac{{{\partial^{2}}}}{{\partial{X^{2}}}}\left[{{D^{(2)}}(X,t)p(X,t)}\right]. (46)

On comparison of Eqs. (37) and (46), it is straightforward to note that to estimate the drift and diffusion terms it suffices to estimate the first and second order moments of the variations of the random variable X(t)X(t) and then calculate the coefficients D(1){D^{(1)}} and D(2){D^{(2)}}, formally,

f(X,t)=D(1),g2(X,t)=D(2).f(X,t)={D^{(1)}},\;g^{2}(X,t)={D^{(2)}}. (47)

To derive these coefficients in the Kramers-Moyal expansion, it is imperative to understand the one-step Itô-Taylor expansion of the random variable X(t)X(t). Referring to the Eq. (40), the integral form the Itô-lemma can be expressed as follows,

F(Xt+h,t+h)=F(Xt,t)+tt+h{f(Xs,s)F(Xs,s)+12g2(Xs,s)F′′(Xs,s)}𝑑s+tt+hg(Xs,s)F(Xs,s)𝑑B(s),F({X_{t+h}},t+h)=F({X_{t}},t)+\int_{t}^{t+h}{\left\{{f({X_{s}},s)F^{\prime}({X_{s}},s)+\dfrac{1}{2}{g^{2}}({X_{s}},s)F^{\prime\prime}({X_{s}},s)}\right\}ds+}\int_{t}^{t+h}{g({X_{s}},s)F^{\prime}({X_{s}},s)dB(s)}, (48)

where F(Xs,s)F^{\prime}({X_{s}},s) and F′′(Xs,s)F^{\prime\prime}({X_{s}},s) denotes the first and second order partial derivatives with respect to system states. For simplicity, two stochastic operators are defined in the following form:

0(.)=(.)t+imfi(Xt,t)(.)Xi+12imjmkngi,k(Xt,t)gj,k(Xt,t)2(.)XiXj1(.)=imkngi,k(Xt,t)(.)Xi.\begin{array}[]{l}{\Im^{0}}(.)=\dfrac{{\partial(.)}}{{\partial t}}+\sum\limits_{i}^{m}{{f_{i}}({X_{t}},t)\dfrac{{\partial(.)}}{{\partial{X_{i}}}}}+\dfrac{1}{2}\sum\limits_{i}^{m}{\sum\limits_{j}^{m}{\sum\limits_{k}^{n}{{g_{i,k}}({X_{t}},t){g_{j,k}}({X_{t}},t)\dfrac{{{\partial^{2}}(.)}}{{\partial{X_{i}}\partial{X_{j}}}}}}}\\ {\Im^{1}}(.)=\sum\limits_{i}^{m}{\sum\limits_{k}^{n}{{g_{i,k}}({X_{t}},t)\dfrac{{\partial(.)}}{{\partial{X_{i}}}}}}.\end{array} (49)

Using the operators, Eq. (48) can be rephrased as,

F(Xt+h,t+h)=F(Xt,t)+tt+h0(F(Xs,s))𝑑s+tt+h1(F(Xs,s))𝑑B(s).F({X_{t+h}},t+h)=F({X_{t}},t)+\int_{t}^{t+h}{{\Im^{0}}\left({F({X_{s}},s)}\right)ds+}\int_{t}^{t+h}{{\Im^{1}}\left({F({X_{s}},s)}\right)dB(s)}. (50)

Then substituting F(Xt,t)=X(t)F({X_{t}},t)=X(t), one verifies that 0X(t)=f(Xt,t){\Im^{0}}X(t)=f({X_{t}},t), and 1X(t)=g(Xt,t){\Im^{1}}X(t)=g({X_{t}},t). This yields the first iteration:

X(t+h)=X(t)+tt+h0(X(s))𝑑s+tt+h1(X(s))𝑑B(s)=X(t)+tt+hf(Xs,s)𝑑s+tt+hg(Xs,s)𝑑B(s).\begin{array}[]{ll}X(t+h)&=X(t)+\int_{t}^{t+h}{{\Im^{0}}\left({X(s)}\right)ds+}\int_{t}^{t+h}{{\Im^{1}}\left({X(s)}\right)dB(s)}\\ &=X(t)+\int_{t}^{t+h}{f({X_{s}},s)ds+}\int_{t}^{t+h}{g({X_{s}},s)dB(s)}.\end{array} (51)

In order to perform the second iteration it is required to find the stochastic expansion of the terms F(Xt,t)=f(Xs,s)F({X_{t}},t)=f({X_{s}},s), and F(Xt,t)=g(Xs,s)F({X_{t}},t)=g({X_{s}},s). Thus, expanding f(Xt,t)f({X_{t}},t), and g(Xt,t)g({X_{t}},t) and using the operators in Eq. (49) yields,

f(Xs,s)=f(X,t)+ts10(f(Xs2,s2))𝑑s2+ts11(f(Xs2,s2))𝑑B(s2)g(Xs,s)=g(X,t)+ts10(g(Xs2,s2))𝑑s2+ts11(g(Xs2,s2))𝑑B(s2).\begin{array}[]{l}f({X_{s}},s)=f(X,t)+\int_{t}^{{s_{1}}}{{\Im^{0}}\left({f({X_{{s_{2}}}},{s_{2}})}\right)d{s_{2}}+}\int_{t}^{{s_{1}}}{{\Im^{1}}\left({f({X_{{s_{2}}}},{s_{2}})}\right)dB({s_{2}})}\\ g({X_{s}},s)=g(X,t)+\int_{t}^{{s_{1}}}{{\Im^{0}}\left({g({X_{{s_{2}}}},{s_{2}})}\right)d{s_{2}}+}\int_{t}^{{s_{1}}}{{\Im^{1}}\left({g({X_{{s_{2}}}},{s_{2}})}\right)dB({s_{2}})}.\end{array} (52)

On substituting the above result in Eq. (51) and further iterating,

X(t+h)X(t)=f(X,t)tt+h𝑑s1+g(X,t)tt+h𝑑B(s1)+1(g(X,t))tt+hts1𝑑B(s2)𝑑B(s1)+R,X(t+h)-X(t)=f(X,t)\int_{t}^{t+h}{d{s_{1}}+}g(X,t)\int_{t}^{t+h}{dB({s_{1}})+}{\Im^{1}}\left({g(X,t)}\right)\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{dB({s_{2}})dB({s_{1}})}}+R, (53)

where the remainder term RR is,

R=tt+hts10(f(Xs2,s2))𝑑s2𝑑s1+{tt+hts11(f(Xs2,s2))𝑑B(s2)𝑑s1+tt+hts10(g(Xs2,s2))𝑑s2𝑑B(s1)}+t0tt0s1ts210(g(Xs3,s3))ds3dB(s2)𝑑B(s1)+t0tt0s1ts211(g(Xs3,s3))dB(s3)𝑑B(s2)𝑑B(s1).\begin{array}[]{ll}R=&\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{{\Im^{0}}\left({f({X_{{s_{2}}}},{s_{2}})}\right)d{s_{2}}d{s_{1}}+}}\left\{{\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{{\Im^{1}}\left({f({X_{{s_{2}}}},{s_{2}})}\right)dB({s_{2}})d{s_{1}}}}+\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{{\Im^{0}}\left({g({X_{{s_{2}}}},{s_{2}})}\right)d{s_{2}}dB({s_{1}})}}}\right\}+\\ &\int_{{t_{0}}}^{t}{\int_{{t_{0}}}^{{s_{1}}}{\int_{t}^{{s_{2}}}{{\Im^{1}}{\Im^{0}}\left({g({X_{{s_{3}}}},{s_{3}})}\right)d{s_{3}}}dB({s_{2}})dB({s_{1}})}}+\int_{{t_{0}}}^{t}{\int_{{t_{0}}}^{{s_{1}}}{\int_{t}^{{s_{2}}}{{\Im^{1}}{\Im^{1}}\left({g({X_{{s_{3}}}},{s_{3}})}\right)dB({s_{3}})}dB({s_{2}})dB({s_{1}})}}.\end{array} (54)

The above equation is infinitely expandable using the Eq. (49). For further treatment, the first moment is taken on both side. Noting the results tt+h𝑑B(s1)=0\left\langle{\int_{t}^{t+h}{dB({s_{1}})}}\right\rangle=0 and t0tt0s1𝑑B(s2)𝑑B(s1)=t0tB(s1)𝑑B(s1)B(t)t0t𝑑B(s1)=0\left\langle{\int_{{t_{0}}}^{t}{\int_{{t_{0}}}^{{s_{1}}}{dB({s_{2}})dB({s_{1}})}}}\right\rangle=\left\langle{\int_{{t_{0}}}^{t}{B({s_{1}})dB({s_{1}})}-B(t)\int_{{t_{0}}}^{t}{dB({s_{1}})}}\right\rangle=0,

X(t+h)X(t)=f(X,t)h+g(X,t)tt+h𝑑B(s1)+1(g(X,t))tt+hts1𝑑B(s2)𝑑B(s1)+R=f(X,t)h+R.\begin{array}[]{ll}\left\langle{X(t+h)-X(t)}\right\rangle&=f(X,t)h+g(X,t)\left\langle{\int_{t}^{t+h}{dB({s_{1}})}}\right\rangle+{\Im^{1}}\left({g(X,t)}\right)\left\langle{\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{dB({s_{2}})dB({s_{1}})}}}\right\rangle+\left\langle R\right\rangle\\ &=f(X,t)h+\left\langle R\right\rangle.\end{array} (55)

If the number of iteration is kk, then the Brownian integrals tt+hts1tsk𝑑B(sk+1)𝑑B(s2)𝑑B(s1)\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{\ldots\int_{t}^{{s_{k}}}{dB({s_{k+1}})\ldots}dB({s_{2}})dB({s_{1}})}} of multiplicity kk have a contribution proportional to hk/2h^{k/2}, the time integrals tt+hts1tsk𝑑sk+1𝑑s2𝑑s1\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{\ldots\int_{t}^{{s_{k}}}{d{s_{k+1}}\ldots}d{s_{2}}d{s_{1}}}} have a contribution proportional to hkh^{k}, and the combination shares a contribution between hk/2h^{k/2} and hkh^{k}. Thus it is easy to infer that as h0h\to 0, the higher order terms in the remainder RR will vanish. After invoking this facts in the above expression, the first coefficient in Kramers-Moyal expansion is obtained as,

D(1)=f(X,t)=limΔt01ΔtX(1)(t+Δt)z|X(t)=z.{D^{(1)}}=f(X,t)={\left.{\mathop{\lim}\limits_{\Delta t\to 0}\dfrac{1}{{\Delta t}}\left\langle{{X^{(1)}}(t+\Delta t)-z}\right\rangle}\right|_{X(t)=z}}. (56)

To derive the second coefficient, it is only required to find the quadratic variation of the increment process of X(t)X(t). Upon taking second moment on both sides of Eq. (53), the following is obtained,

|X(t+h)X(t)|2=f(X,t)h+g(X,t)ΔB+R=f2(X,t)h2+2f(X,t)g(X,t)hΔB+g2(X,t)(ΔB)2+R=g2(X,t)h+R.\begin{array}[]{ll}\left\langle{{{\left|{X(t+h)-X(t)}\right|}^{2}}}\right\rangle&=f(X,t)h+g(X,t)\Delta B+\left\langle R\right\rangle\\ &={f^{2}}(X,t){h^{2}}+2f(X,t)g(X,t)h\Delta B+{g^{2}}(X,t){\left({\Delta B}\right)^{2}}+\left\langle R\right\rangle\\ &={g^{2}}(X,t)h+\left\langle R\right\rangle.\end{array} (57)

In the above proof, the Itô identities are utilized, which states that under the mean square convergence theory the quadratic variation of the time and Brownian increments are given as ds2ds1=0d{s_{2}}d{s_{1}}=0, ds1dW(s1)=0d{s_{1}}dW({s_{1}})=0, dW(s2)ds1=0dW({s_{2}})d{s_{1}}=0, dW(s1)dW(s2)=0dW({s_{1}})dW({s_{2}})=0. The deterministic time integrals will vanish automatically since they have finite variance and zero quadratic covariance and the other higher order Brownian integrals will vanish as h0h\to 0. Thus the second coefficient in the Kramers-Moyal expansion is obtained as,

D(2)=g2(X,t)=12limΔt01Δt|X(1)(t+Δt)z|2|X(t)=z.{D^{(2)}}=g^{2}(X,t)={\left.{\dfrac{1}{2}\mathop{\lim}\limits_{\Delta t\to 0}\dfrac{1}{{\Delta t}}\left\langle{{{\left|{{X^{(1)}}(t+\Delta t)-z}\right|}^{2}}}\right\rangle}\right|_{X(t)=z}}. (58)

From the ergodic assumption of the evolution of process X(t)X(t), the time average \left\langle\cdot\right\rangle is often replaced by the expectation operator E()\rm{E}(\cdot). For mm-variables 𝑿={X1,X2,Xm}{\bm{X}}=\left\{{{X_{1}},{X_{2}},\ldots{X_{m}}}\right\}, the diffusion process has the form,

dXi(t)=fi(𝑿t,t)+jngij(𝑿t,t)dBj(t);i=1,2,m.d{X_{i}}(t)={f_{i}}({{\bm{X}}_{t}},t)+\sum\limits_{j}^{n}{{g_{ij}}({{\bm{X}}_{t}},t)d{B_{j}}(t)};\;i=1,2,\ldots m.

The properties of the Brownian motion are: Bi(t)=0\left\langle{{B_{i}}(t)}\right\rangle=0 and Bi(t)Bj(s)=min(t,s)\left\langle{{B_{i}}(t){B_{j}}(s)}\right\rangle=\min(t,s). If the covariation matrix of the diffusion components is given by Γ(𝑿t,t)=g(𝑿t,t)g(𝑿t,t)T\Gamma({{\bm{X}}_{t}},t)=g({{\bm{X}}_{t}},t)g{({{\bm{X}}_{t}},t)^{T}}, then, the drift fi(𝑿,t){f_{i}}({\bm{X}},t) of ithi^{th}-diffusion process and the ijth{ij}^{th}-element of the covariation matrix Γ(𝑿,t)\Gamma({\bm{X},t}) can be estimated as,

fi(𝑿,t)=limΔt01ΔtE[Xi(t+Δt)zi]|Xk(t)=zkk=1,2,mΓij(𝑿,t)=12limΔt01ΔtE[|Xi(t+Δt)zi||Xj(t+Δt)zj|]|Xk(t)=zkk=1,2,m.\begin{array}[]{l}{f_{i}}({\bm{X}},t)={\left.{\mathop{\lim}\limits_{\Delta t\to 0}\dfrac{1}{{\Delta t}}E\left[{{X_{i}}(t+\Delta t)-{z_{i}}}\right]}\right|_{{X_{k}}(t)={z_{k}}}}\forall\;k=1,2,\ldots m\\ {\Gamma_{ij}}({\bm{X}},t)={\left.{\dfrac{1}{2}\mathop{\lim}\limits_{\Delta t\to 0}\dfrac{1}{{\Delta t}}E\left[{\left|{{X_{i}}(t+\Delta t)-{z_{i}}}\right|\left|{{X_{j}}(t+\Delta t)-{z_{j}}}\right|}\right]}\right|_{{X_{k}}(t)={z_{k}}}}\forall\;k=1,2,\ldots m.\end{array} (59)

Appendix B Conditional probability distributions of the discontinuous Spike and Slab prior (DSS)

The DSS prior model is described in the Methods section. Drawing sample from the joint distribution is intractable but possible through Markov Chain Monte Carlo methods. In the present work, the Gibbs sampling technique is utilized to draw the sample for the random variables 𝒀{\bm{Y}}, 𝜽{\bm{\theta}}, 𝒁{\bm{Z}}, ϑs{\vartheta_{s}}, σ2{\sigma^{2}} and p0{p_{0}}. For drawing samples using Gibbs sampling, the random variables need to be conditioned on other variables. However, due to the DAG structure in Fig. 2, the dependencies of the conditional distributions on other random variables can be relaxed to some extent as follows:

p(p0𝒀,𝜽,𝒁,ϑs,σ2)=p(p0𝒁)\displaystyle p\left(p_{0}\mid\bm{Y},\bm{\theta},\bm{Z},\vartheta_{s},\sigma^{2}\right)=p\left(p_{0}\mid\bm{Z}\right) (60)
p(ϑs𝒀,𝜽,𝒁,p0,σ2)=p(ϑs𝜽,𝒁,σ2)\displaystyle p\left(\vartheta_{s}\mid\bm{Y},\bm{\theta},\bm{Z},p_{0},\sigma^{2}\right)=p\left(\vartheta_{s}\mid\bm{\theta},\bm{Z},\sigma^{2}\right)
p(σ2𝒀,𝜽,𝒁,p0,ϑs)=p(σ2𝒀,𝜽,𝒁,ϑs)\displaystyle p\left(\sigma^{2}\mid\bm{Y},\bm{\theta},\bm{Z},p_{0},\vartheta_{s}\right)=p\left(\sigma^{2}\mid\bm{Y},\bm{\theta},\bm{Z},\vartheta_{s}\right)
p(𝜽𝒀,𝒁,p0,ϑs,σ2)=p(𝜽𝒀,𝒁,ϑs,σ2)\displaystyle p\left(\bm{\theta}\mid\bm{Y},\bm{Z},p_{0},\vartheta_{s},\sigma^{2}\right)=p\left(\bm{\theta}\mid\bm{Y},\bm{Z},\vartheta_{s},\sigma^{2}\right)
p(𝒁𝒀,𝜽,p0,ϑs,σ2)=p(𝒁𝜽,p0,ϑs,σ2).\displaystyle p\left(\bm{Z}\mid\bm{Y},\bm{\theta},p_{0},\vartheta_{s},\sigma^{2}\right)=p\left(\bm{Z}\mid\bm{\theta},p_{0},\vartheta_{s},\sigma^{2}\right).

The elements of the weight vector 𝜽{\bm{\theta}} for which the latent variable Zk1Z_{k}\neq 1, becomes an absorbing state in the Markov chain. However, to achieve a stationary distribution, one needs to construct an irreducible Markov chain. Thus, the conditionals over weight vector is eliminated by marginalizing the conditional distributions with respect to the vector 𝜽{\bm{\theta}}.

p(σ2𝒀,𝜽,𝒁,ϑs)𝑑𝜽=p(σ2,𝒀,𝜽,𝒁,ϑs)p(𝒀,𝜽,𝒁,ϑs)𝑑𝜽=p(σ2𝒀,𝒁,ϑs)p(𝒁𝒀,𝜽,p0,ϑs,σ2)𝑑𝜽=p(𝒁,𝒀,𝜽,p0,ϑs,σ2)p(𝒀,𝜽,p0,ϑs,σ2)𝑑𝜽=p(𝒁,𝒀,p0,ϑs,σ2)p(𝒀,p0,ϑs,σ2)=p(𝒁|𝒀,p0,ϑs,σ2).\begin{array}[]{l}\int{p\left({{\sigma^{2}}\mid{\bm{Y}},{\bm{\theta}},{\bm{Z}},{\vartheta_{s}}}\right)d{\bm{\theta}}}=\int{\dfrac{{p\left({{\sigma^{2}},{\bm{Y}},{\bm{\theta}},{\bm{Z}},{\vartheta_{s}}}\right)}}{{p\left({{\bm{Y}},{\bm{\theta}},{\bm{Z}},{\vartheta_{s}}}\right)}}d{\bm{\theta}}}=p\left({{\sigma^{2}}\mid{\bm{Y}},{\bm{Z}},{\vartheta_{s}}}\right)\\ \int{p\left({{\bm{Z}}\mid{\bm{Y}},{\bm{\theta}},{p_{0}},{\vartheta_{s}},{\sigma^{2}}}\right)}d{\bm{\theta}}=\int{\dfrac{{p\left({{\bm{Z}},{\bm{Y}},{\bm{\theta}},{p_{0}},{\vartheta_{s}},{\sigma^{2}}}\right)}}{{p\left({{\bm{Y}},{\bm{\theta}},{p_{0}},{\vartheta_{s}},{\sigma^{2}}}\right)}}d{\bm{\theta}}}=\dfrac{{p\left({{\bm{Z}},{\bm{Y}},{p_{0}},{\vartheta_{s}},{\sigma^{2}}}\right)}}{{p\left({{\bm{Y}},{p_{0}},{\vartheta_{s}},{\sigma^{2}}}\right)}}=p\left({{\bm{Z}}|{\bm{Y}},{p_{0}},{\vartheta_{s}},{\sigma^{2}}}\right).\end{array} (61)

The conditional distribution p(p0Z)p\left(p_{0}\mid\bm{Z}\right):

p(p0𝒁)=p(𝒁p0)p(p0)(k=1Kp0Zk(1p0)1Zk)(p0αp1(1p0)βp1)p0αp+hZ(1p0)βp+KhZ,\begin{array}[]{ll}p\left(p_{0}\mid{\bm{Z}}\right)&=p\left(\bm{Z}\mid p_{0}\right)p\left(p_{0}\right)\\ &\propto\left(\prod_{k=1}^{K}p_{0}^{Z_{k}}\left(1-p_{0}\right)^{1-Z_{k}}\right)\left(p_{0}^{\alpha_{p}-1}\left(1-p_{0}\right)^{\beta_{p}-1}\right)\\ &\propto p_{0}^{\alpha_{p}+h_{Z}}\left(1-p_{0}\right)^{\beta_{p}+K-h_{Z}},\end{array} (62)

where hZ=k=1KZkh_{Z}=\sum_{k=1}^{K}Z_{k}. Given the latent vector 𝒁{\bm{Z}}, p0p_{0} is sampled as, p0𝒁Beta(αp+hz,βp+Khz)p_{0}\mid{\bm{Z}}\sim{Beta}\left(\alpha_{p}+h_{z},\beta_{p}+K-h_{z}\right).

The conditional distribution p(ϑsθ,Z,σ2)p\left(\vartheta_{s}\mid\bm{\theta},\bm{Z},\sigma^{2}\right):

p(ϑs𝜽,𝒁,σ2)p(𝜽𝒁,ϑs,σ2)p(ϑs)p(𝒁)p(σ2)p(𝜽𝒁,ϑs,σ2)p(ϑs)𝒩(𝜽r𝟎,σ2ϑs𝐑0,r)𝒢(αϑ,βϑ)1(ϑsσ2)r/2|𝐑0,r|1/2exp(𝜽rT𝐑0,r1𝜽r2σ2ϑs)ϑsαϑ1exp(βϑϑs)ϑs(αϑ+r2)1exp(βϑ+(𝜽rT𝐑0,r1𝜽r/2σ2)ϑs),\begin{array}[]{ll}p\left(\vartheta_{s}\mid\bm{\theta},\bm{Z},\sigma^{2}\right)&\propto p\left(\bm{\theta}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)p\left(\vartheta_{s}\right)p(\bm{Z})p\left(\sigma^{2}\right)\\ &\propto p\left(\bm{\theta}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)p\left(\vartheta_{s}\right)\\ &\propto\mathcal{N}\left(\bm{\theta}_{r}\mid\mathbf{0},\sigma^{2}\vartheta_{s}\mathbf{R}_{0,r}\right)\mathcal{I}\mathcal{G}\left(\alpha_{\vartheta},\beta_{\vartheta}\right)\\ &\propto\dfrac{1}{\left(\vartheta_{s}\sigma^{2}\right)^{r/2}\left|\mathbf{R}_{0,r}\right|^{1/2}}\exp\left(-\dfrac{\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}{2\sigma^{2}\vartheta_{s}}\right)\vartheta_{s}^{-\alpha_{\vartheta}-1}\exp\left(-\dfrac{\beta_{\vartheta}}{\vartheta_{s}}\right)\\ &\propto\vartheta_{s}^{-\left(\alpha_{\vartheta}+\dfrac{r}{2}\right)-1}\exp\left(-\dfrac{\beta_{\vartheta}+\left({\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}/{2\sigma^{2}}\right)}{\vartheta_{s}}\right),\end{array} (63)

where rr is the number of elements of the weight vector 𝜽{\bm{\theta}} for which the latent variable Zk=1Z_{k}=1. In the above it is to be noted that p(𝒁)p(\bm{Z}) and p(σ2)p\left(\sigma^{2}\right) are constants when ϑs\vartheta_{s} is sampled. This can be observed from the DAG structure in Fig. 2. Thus given 𝜽\bm{\theta}, 𝒁\bm{Z} and σ2\sigma^{2} and ϑs\vartheta_{s} is sampled as, ϑs𝜽,𝒁,σ2𝒢(αϑ+r2,βϑ+𝜽rT𝐑0,r1𝜽r2σ2)\vartheta_{s}\mid\bm{\theta},\bm{Z},\sigma^{2}\sim\mathcal{I}\mathcal{G}\left(\alpha_{\vartheta}+\dfrac{r}{2},\beta_{\vartheta}+\dfrac{\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}{2\sigma^{2}}\right).

The conditional distribution p(σ2Y,θ,Z,ϑs)p\left(\sigma^{2}\mid\bm{Y},\bm{\theta},\bm{Z},\vartheta_{s}\right):

p(σ2𝒀,𝒁,ϑs)\displaystyle p\left(\sigma^{2}\mid\bm{Y},\bm{Z},\vartheta_{s}\right) p(𝒀,𝜽,𝒁,ϑs,σ2)𝑑𝜽\displaystyle\propto\int p\left(\bm{Y},\bm{\theta},\bm{Z},\vartheta_{s},\sigma^{2}\right)d\bm{\theta} (64)
(p(𝒀𝜽,σ2)p(𝜽𝒁,ϑs,σ2)𝑑𝜽)p(𝒁)p(ϑs)p(σ2)\displaystyle\propto\left(\int p\left(\bm{Y}\mid\bm{\theta},\sigma^{2}\right)p\left(\bm{\theta}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)d\bm{\theta}\right)p(\bm{Z})p\left(\vartheta_{s}\right)p\left(\sigma^{2}\right)
(p(𝒀𝜽,σ2)p(𝜽𝒁,ϑs,σ2)𝑑𝜽)p(σ2).\displaystyle\propto\left(\int p\left(\bm{Y}\mid\bm{\theta},\sigma^{2}\right)p\left(\bm{\theta}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)d\bm{\theta}\right)p\left(\sigma^{2}\right).

From the DAG structure it can be understood that p(𝒁)p(\bm{Z}) and p(ϑs)p\left(\vartheta_{s}\right) will act as a constant when σ2\sigma^{2} is sampled. Since the θk\theta_{k}-values corresponding to the spike distribution does not contribute to the basis function selection, the remaining weights belonging to the slab denoted by 𝜽r\bm{\theta}_{r} is used. Then, one can expand the intregrand p(𝒀𝜽,σ2)p(𝜽𝒁,ϑs,σ2)p\left({\bm{Y}}\mid\bm{\theta},\sigma^{2}\right)p\left(\bm{\theta}\mid{\bm{Z}},\vartheta_{s},\sigma^{2}\right), as,

p(𝒀𝜽,σ2)p(𝜽𝒁,ϑs,σ2)\displaystyle p\left({{\bm{Y}}\mid{\bm{\theta}},{\sigma^{2}}}\right)p\left({{\bm{\theta}}\mid{\bm{Z}},{\vartheta_{s}},{\sigma^{2}}}\right) (65)
=𝒩(𝒀|𝐋r𝜽r,σ2𝐈N×N)N(𝜽r|0,σ2ϑs𝐑0,r)\displaystyle=\mathcal{N}\left({{\bm{Y}}|{{\bf{L}}_{r}}{\bm{\theta}_{r}},{\sigma^{2}}{{\bf{I}}_{N\times N}}}\right)N\left({{\bm{\theta}_{r}}|0,{\sigma^{2}}{\vartheta_{s}}{{\bf{R}}_{0,r}}}\right)
=1(2πσ2)N/2exp((𝒀𝐃r𝜽r)T(𝒀𝐃r𝜽r)2σ2)(|𝐑0,r1|)1/2(2πϑsσ2)r/2exp(𝜽rT𝐑0,r1𝜽r2σ2ϑs)\displaystyle=\dfrac{1}{\left(2\pi\sigma^{2}\right)^{N/2}}\exp\left(-\frac{\left({\bm{Y}}-\mathbf{D}_{r}\bm{\theta}_{r}\right)^{T}\left({\bm{Y}}-\mathbf{D}_{r}\bm{\theta}_{r}\right)}{2\sigma^{2}}\right)\frac{\left(|\mathbf{R}_{0,r}^{-1}|\right)^{1/2}}{\left(2\pi\vartheta_{s}\sigma^{2}\right)^{r/2}}\exp\left(-\frac{\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}{2\sigma^{2}\vartheta_{s}}\right)
=1(2πσ2)N/2(|𝐑0,r1|)1/2(2πϑsσ2)r/2exp((𝜽r𝝁)T𝚺1(𝜽r𝝁)2σ2)exp((𝒀T𝒀𝝁T𝚺1𝝁)2σ2),\displaystyle=\frac{1}{\left(2\pi\sigma^{2}\right)^{N/2}}\frac{\left(|\mathbf{R}_{0,r}^{-1}|\right)^{1/2}}{\left(2\pi\vartheta_{s}\sigma^{2}\right)^{r/2}}\exp\left(-\frac{\left(\bm{\theta}_{r}-\bm{\mu}\right)^{T}\bm{\Sigma}^{-1}\left(\bm{\theta}_{r}-\bm{\mu}\right)}{2\sigma^{2}}\right)\exp\left(-\frac{\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)}{2\sigma^{2}}\right),

where 𝚺1=(𝐋rT𝐋r+ϑs1𝐑0,r1)\bm{\Sigma}^{-1}=\left(\mathbf{L}_{r}^{T}\mathbf{L}_{r}+\vartheta_{s}^{-1}\mathbf{R}_{0,r}^{-1}\right) and 𝝁=𝚺𝐋rT𝒀.\bm{\mu}=\bm{\Sigma}\mathbf{L}_{r}^{T}{\bm{Y}}. Upon integration of the the above expression about 𝜽r{\bm{\theta}}_{r}, one obtains:

p(σ2𝒀,𝒁,ϑs)\displaystyle p\left(\sigma^{2}\mid\bm{Y},\bm{Z},\vartheta_{s}\right) 1(σ2)N/2exp((𝒀T𝒀𝝁T𝚺1𝝁)2σ2)p(σ2)\displaystyle\propto\frac{1}{\left(\sigma^{2}\right)^{N/2}}\exp\left(-\frac{\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)}{2\sigma^{2}}\right)p\left(\sigma^{2}\right) (66)
1(σ2)N/2exp((𝒀T𝒀𝝁T𝚺1𝝁)2σ2)(σ2)ασ1exp(βσσ2)\displaystyle\propto\frac{1}{\left(\sigma^{2}\right)^{N/2}}\exp\left(-\frac{\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)}{2\sigma^{2}}\right)\left(\sigma^{2}\right)^{-\alpha_{\sigma}-1}\exp\left(-\frac{\beta_{\sigma}}{\sigma^{2}}\right)
(σ2)(ασ+0.5N)1exp(βσ+12(𝒀T𝒀𝝁T𝚺1𝝁)σ2).\displaystyle\propto\left(\sigma^{2}\right)^{-\left(\alpha_{\sigma}+{0.5N}\right)-1}\exp\left(-\frac{\beta_{\sigma}+\frac{1}{2}\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)}{\sigma^{2}}\right).

The random variable σ2\sigma^{2} is sampled as, σ2𝒀,𝒁,ϑs𝒢(ασ+N2,βσ+12(𝒀T𝒀𝝁T𝚺𝟏𝝁))\sigma^{2}\mid\bm{Y},\bm{Z},\vartheta_{s}\sim\mathcal{I}\mathcal{G}\left(\alpha_{\sigma}+\dfrac{N}{2},\beta_{\sigma}+\frac{1}{2}\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\bf{\Sigma}^{-1}\bm{\mu}\right)\right).

The conditional distribution p(θY,Z,ϑs,σ2)p\left(\bm{\theta}\mid\bm{Y},\bm{Z},\vartheta_{s},\sigma^{2}\right): The elements of the weight vector 𝜽\bm{\theta} corresponding to the spike distribution i.e. the weights for which the latent variable zk=0;k=1Kz_{k}=0;k=1\ldots K, are assigned the value 0. The conditional distribution for the weights belonging to the slab distribution denoted by 𝜽r\bm{\theta}_{r} is derived as follows:

p(𝜽r𝒀,ϑs,σ2)\displaystyle p\left(\bm{\theta}_{r}\mid\bm{Y},\vartheta_{s},\sigma^{2}\right) 𝒩(𝒀𝐋r𝜽r,σ2𝐈N×N)𝒩(𝜽r𝟎,σ2ϑs𝐑0,r)\displaystyle\propto\mathcal{N}\left(\bm{Y}\mid\mathbf{L}_{r}\bm{\theta}_{r},\sigma^{2}\mathbf{I}_{N\times N}\right)\mathcal{N}\left(\bm{\theta}_{r}\mid\mathbf{0},\sigma^{2}\vartheta_{s}\mathbf{R}_{0,r}\right) (67)
exp((𝒀𝐋r𝜽r)T(𝒀𝐋r𝜽r)2σ2)exp(𝜽rT𝐑0,r1𝜽r2σ2vs)\displaystyle\propto\exp\left(-\frac{\left(\bm{Y}-\mathbf{L}_{r}\bm{\theta}_{r}\right)^{T}\left(\bm{Y}-\mathbf{L}_{r}\bm{\theta}_{r}\right)}{2\sigma^{2}}\right)\exp\left(-\frac{\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}{2\sigma^{2}v_{s}}\right)
exp(𝒀T𝒀+𝜽rT𝚺1𝜽r2𝜽rT𝚺1𝝁2σ2)\displaystyle\propto\exp\left(-\frac{\bm{Y}^{T}\bm{Y}+\bm{\theta}_{r}^{T}\bm{\Sigma}^{-1}\bm{\theta}_{r}-2\bm{\theta}_{r}^{T}\bm{\Sigma}^{-1}\bm{\mu}}{2\sigma^{2}}\right)
exp((𝜽r𝝁)T𝚺1(𝜽r𝝁)2σ2),\displaystyle\propto\exp\left(-\frac{\left(\bm{\theta}_{r}-\bm{\mu}\right)^{T}\bm{\Sigma}^{-1}\left(\bm{\theta}_{r}-\bm{\mu}\right)}{2\sigma^{2}}\right),

where 𝚺1=(𝐋rT𝐋r+ϑs1𝐑0,r1)\mathbf{\Sigma}^{-1}=\left(\mathbf{L}_{r}^{T}\mathbf{L}_{r}+\vartheta_{s}^{-1}\mathbf{R}_{0,r}^{-1}\right) and 𝝁=𝚺𝐋rT𝒀\bm{\mu}=\bm{\Sigma}\mathbf{L}_{r}^{T}\bm{Y}. Then, the random values of 𝜽r\bm{\theta}_{r} can be sampled as 𝜽r𝒀,ϑs,σ2𝒩(𝝁,σ2𝚺)\bm{\theta}_{r}\mid\bm{Y},\vartheta_{s},\sigma^{2}\sim\mathcal{N}\left(\bm{\mu},\sigma^{2}\bf{\Sigma}\right).

The conditional distribution p(Zθ,p0,ϑs,σ2)p\left(\bm{Z}\mid\bm{\theta},p_{0},\vartheta_{s},\sigma^{2}\right): The latent variable ZkZ_{k} corresponding to the kthk^{th} element of the weight vector are assigned the value 0 or 1 independently. The conditional probability distributions of ZkZ_{k} are estimated by comparing the probabilities that the kthk^{th} latent variable Zk=1Z_{k}=1 will assume a value 1 or 0, given the values of ϑs\vartheta_{s}, p0p_{0} and remaining values of the latent vector 𝒁k\bm{Z}_{-k}. Here, the term 𝒁k\bm{Z}_{-k} denotes the latent vector 𝒁\bm{Z} whose kthk^{th} element is removed. Let uku_{k} denotes the probability with which the kthk^{th} latent variable ZkZ_{k} takes a value 1. The probability uku_{k} is then found as,

uk\displaystyle u_{k} =p(Zk=1𝒀,𝒁k,ϑs,p0)p(Zk=1𝒀,𝒁k,ϑs,p0)+p(Zk=0𝒀,𝒁k,ϑs,p0)\displaystyle=\frac{p\left(Z_{k}=1\mid\bm{Y},\bm{Z}_{-k},\vartheta_{s},p_{0}\right)}{p\left(Z_{k}=1\mid\bm{Y},\bm{Z}_{-k},\vartheta_{s},p_{0}\right)+p\left(Z_{k}=0\mid\bm{Y},\bm{Z}_{-k},\vartheta_{s},p_{0}\right)} (68)
=p(𝒀Zk=1,𝒁k,ϑs)p(Zk=1p0)p(𝒀Zk=1,𝒁k,ϑs)p(Zk=1p0)+p(𝒀Zk=0,𝒁k,ϑs)p(Zk=0p0)\displaystyle=\frac{p\left(\bm{Y}\mid Z_{k}=1,\bm{Z}_{-k},\vartheta_{s}\right)p\left(Z_{k}=1\mid p_{0}\right)}{p\left(\bm{Y}\mid Z_{k}=1,\bm{Z}_{-k},\vartheta_{s}\right)p\left(Z_{k}=1\mid p_{0}\right)+p\left(\bm{Y}\mid Z_{k}=0,\bm{Z}_{-k},\vartheta_{s}\right)p\left(Z_{k}=0\mid p_{0}\right)}
=p(𝒀Zk=1,𝒁k,ϑs)p0p(𝒀Zk=1,𝒁k,ϑs)p0+p(𝒀Zk=0,𝒁k,ϑs)(1p0)\displaystyle=\frac{p\left(\bm{Y}\mid Z_{k}=1,\bm{Z}_{-k},\vartheta_{s}\right)p_{0}}{p\left(\bm{Y}\mid Z_{k}=1,\bm{Z}_{-k},\vartheta_{s}\right)p_{0}+p\left(\bm{Y}\mid Z_{k}=0,\bm{Z}_{-k},\vartheta_{s}\right)\left(1-p_{0}\right)}
=p0p0+λk(1p0),\displaystyle=\frac{p_{0}}{p_{0}+\lambda_{k}\left(1-p_{0}\right)},

where λk=p(𝒀Zk=0,𝒁k,ϑs)p(𝒀Zk=1,𝒁k,ϑs)\lambda_{k}=\dfrac{p\left(\bm{Y}\mid Z_{k}=0,\bm{Z}_{-k},\vartheta_{s}\right)}{p\left(\bm{Y}\mid Z_{k}=1,\bm{Z}_{-k},\vartheta_{s}\right)}. The marginal likelihood function p(𝒀𝒁,ϑs)p\left(\bm{Y}\mid\bm{Z},\vartheta_{s}\right) was derived by integrating out the random variables 𝜽\bm{\theta} and σ2\sigma^{2} from the original likelihood function, which follows,

p(𝒀𝒁,ϑs)=p(𝒀𝒁,ϑs,σ2)p(σ2)𝑑σ2,p\left(\bm{Y}\mid\bm{Z},\vartheta_{s}\right)=\int p\left(\bm{Y}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)p\left(\sigma^{2}\right)d\sigma^{2}, (69)

where p(𝒀𝒁,ϑs,σ2)p\left(\bm{Y}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right) is obtained by marginalizing the likelihood function with respect to the variable 𝜽\bm{\theta}. Considering only the weights whose corresponding latent variables Zk=0Z_{k}=0, denoted by 𝜽𝒓\bm{\theta_{r}}, the probability p(𝒀𝒁,ϑs,σ2)p\left(\bm{Y}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right) is obtained as,

p(𝒀𝒁,ϑs,σ2)=\displaystyle p\left(\bm{Y}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)= p(𝒀,𝜽𝒁,ϑs,σ2)𝑑θ\displaystyle\int p\left(\bm{Y},\bm{\theta}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)d\theta (70)
=\displaystyle= p(𝒀𝜽r,σ2)p(𝜽rϑs,σ2)𝑑𝜽r\displaystyle\int p\left(\bm{Y}\mid\bm{\theta}_{r},\sigma^{2}\right)p\left(\bm{\theta}_{r}\mid\vartheta_{s},\sigma^{2}\right)d\bm{\theta}_{r}
=\displaystyle= 𝒩(𝒀𝐋r𝜽r,σ2𝐈N×N)𝒩(𝜽r0,σ2ϑs𝐑0,r)𝑑𝜽r\displaystyle\int\mathcal{N}\left(\bm{Y}\mid\mathbf{L}_{r}\bm{\theta}_{r},\sigma^{2}\mathbf{I}_{N\times N}\right)\mathcal{N}\left(\bm{\theta}_{r}\mid 0,\sigma^{2}\vartheta_{s}\mathbf{R}_{0,r}\right)d\bm{\theta}_{r}
=\displaystyle= 1(2πσ2)N/2exp((𝒀𝐋r𝜽r)T(𝒀𝐋r𝜽r)2σ2)(|𝐑0,r1|)1/2(2πϑsσ2)r/2exp(𝜽rT𝐑0,r1𝜽r2σ2ϑs)𝑑𝜽r\displaystyle\int\frac{1}{\left(2\pi\sigma^{2}\right)^{N/2}}\exp\left(-\frac{\left(\bm{Y}-\mathbf{L}_{r}\bm{\theta}_{r}\right)^{T}\left(\bm{Y}-\mathbf{L}_{r}\bm{\theta}_{r}\right)}{2\sigma^{2}}\right)\frac{\left(|\mathbf{R}_{0,r}^{-1}|\right)^{1/2}}{\left(2\pi\vartheta_{s}\sigma^{2}\right)^{r/2}}\exp\left(-\frac{\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}{2\sigma^{2}\vartheta_{s}}\right)d\bm{\theta}_{r}
=\displaystyle= 1(2πσ2)N/2(|𝐑0,r1|)1/2(|𝚺1|)1/2(ϑs)r/2exp((𝒀T𝒀𝝁T𝚺𝝁)2Σ2),\displaystyle\frac{1}{\left(2\pi\sigma^{2}\right)^{N/2}}\frac{\left(|\mathbf{R}_{0,r}^{-1}|\right)^{1/2}\left(|\mathbf{\Sigma}^{-1}|\right)^{1/2}}{\left(\vartheta_{s}\right)^{r/2}}\exp\left(-\frac{\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\mathbf{\Sigma}\bm{\mu}\right)}{2\Sigma^{2}}\right),

where 𝚺=(𝐋rT𝐋r+ϑs1𝐑0,r1)1\mathbf{\Sigma}=\left(\mathbf{L}_{r}^{T}\mathbf{L}_{r}+\vartheta_{s}^{-1}\mathbf{R}_{0,r}^{-1}\right)^{-1} and 𝝁=𝚺𝐋rT𝒀\bm{\mu}=\mathbf{\Sigma}\mathbf{L}_{r}^{T}\bm{Y}. The operator |||\cdot| denotes the determinant. With the above result, p(𝒀𝒁,ϑs)p\left(\bm{Y}\mid\bm{Z},\vartheta_{s}\right) can be derived as,

p(𝒀𝒁,ϑs)=\displaystyle p\left(\bm{Y}\mid\bm{Z},\vartheta_{s}\right)= (|𝐑0,r1|)1/2(|𝚺1|)1/2(2π)N/2(ϑs)r/21(σ2)N/2exp((𝒀T𝒀𝝁T𝚺𝝁)2σ2)𝒢(ασ,βσ)𝑑σ2\displaystyle\frac{\left(|\mathbf{R}_{0,r}^{-1}|\right)^{1/2}\left(|\mathbf{\Sigma}^{-1}|\right)^{1/2}}{(2\pi)^{N/2}\left(\vartheta_{s}\right)^{r/2}}\int\frac{1}{\left(\sigma^{2}\right)^{N/2}}\exp\left(-\frac{\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\mathbf{\Sigma}\bm{\mu}\right)}{2\sigma^{2}}\right)\mathcal{I}\mathcal{G}\left(\alpha_{\sigma},\beta_{\sigma}\right)d\sigma^{2} (71)
=\displaystyle= (|𝐑0,r1|)1/2(|𝚺1|)1/2(2π)N/2(ϑs)r/2(βσ)ασΓ(ασ)1(σ2)ασ+N/2+1exp(βσ+12(𝒀T𝒀𝝁T𝚺𝝁)σ2)𝑑σ2\displaystyle\frac{\left(|\mathbf{R}_{0,r}^{-1}|\right)^{1/2}\left(|\mathbf{\Sigma}^{-1}|\right)^{1/2}}{(2\pi)^{N/2}\left(\vartheta_{s}\right)^{r/2}}\frac{\left(\beta_{\sigma}\right)^{\alpha_{\sigma}}}{\Gamma\left(\alpha_{\sigma}\right)}\int\frac{1}{\left(\sigma^{2}\right)^{\alpha_{\sigma}+N/2+1}\exp\left(-\frac{\beta_{\sigma}+\frac{1}{2}\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\mathbf{\Sigma}\bm{\mu}\right)}{\sigma^{2}}\right)}d\sigma^{2}
=\displaystyle= (|𝐑0,r1|)1/2(|𝚺1|)1/2(2π)N/2(ϑs)r/2(bσ)ασΓ(ασ)Γ(ασ+N2)(βσ+12(𝒀T𝒀𝝁T𝚺𝝁))(ασ+N2),\displaystyle\frac{\left(|\mathbf{R}_{0,r}^{-1}|\right)^{1/2}\left(|\mathbf{\Sigma}^{-1}|\right)^{1/2}}{(2\pi)^{N/2}\left(\vartheta_{s}\right)^{r/2}}\frac{\left(b_{\sigma}\right)^{\alpha_{\sigma}}}{\Gamma\left(\alpha_{\sigma}\right)}\frac{\Gamma\left(\alpha_{\sigma}+\frac{N}{2}\right)}{\left(\beta_{\sigma}+\frac{1}{2}\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\mathbf{\Sigma}\bm{\mu}\right)\right)^{\left(\alpha_{\sigma}+\frac{N}{2}\right)}},

where the operator Γ()\Gamma(\cdot) denotes the Gamma function. To summarise, the random variables ZkZ_{k} can be computed from the Bernoulli distribution with the parameter uku_{k} as, Zk;k=1,,K𝒀,ϑs,p0Bern(uk)Z_{k};k=1,\ldots,K\quad\mid\bm{Y},\vartheta_{s},p_{0}\sim{Bern}\left(u_{k}\right), where the marginalised likelihood function is obtained as,

p(𝒀𝒁,ϑs)={Γ(ασ+N2)(2π)N/2(βσ)ασΓ(ασ)1(βσ+12(𝒀T𝒚))(ασ+N2), when all {Zk;k=1,,K} = 0Γ(aσ+N2)(2π)N/2(ϑs)r/2(βσ)ασΓ(ασ)(|𝐑0,r1|)1/2(|𝚺1|)1/2(βσ+12(𝒀T𝒀𝝁T𝚺𝝁))(ασ+N2), otherwise.p\left(\bm{Y}\mid\bm{Z},\vartheta_{s}\right)=\begin{cases}\dfrac{\Gamma\left(\alpha_{\sigma}+\dfrac{N}{2}\right)}{(2\pi)^{N/2}\dfrac{\left(\beta_{\sigma}\right)^{\alpha_{\sigma}}}{\Gamma\left(\alpha_{\sigma}\right)}}\dfrac{1}{\left(\beta_{\sigma}+\dfrac{1}{2}\left(\bm{Y}^{T}\bm{y}\right)\right)^{\left(\alpha_{\sigma}+\dfrac{N}{2}\right)}}\quad\text{, when all $\{Z_{k};k=1,\ldots,K\}$ = 0}\\ \dfrac{\Gamma\left(a_{\sigma}+\dfrac{N}{2}\right)}{(2\pi)^{N/2}\left(\vartheta_{s}\right)^{r/2}}\dfrac{\left(\beta_{\sigma}\right)^{\alpha_{\sigma}}}{\Gamma\left(\alpha_{\sigma}\right)}\dfrac{\left(|\mathbf{R}_{0,r}^{-1}|\right)^{1/2}\left(|\mathbf{\Sigma}^{-1}|\right)^{1/2}}{\left(\beta_{\sigma}+\dfrac{1}{2}\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\mathbf{\Sigma}\bm{\mu}\right)\right)^{\left(\alpha_{\sigma}+\dfrac{N}{2}\right)}}\quad\text{, otherwise.}\end{cases} (72)

Acknowledgements

T. Tripura acknowledges the financial support received from the Ministry of Human Resource Development (MHRD), India in form of the Prime Minister’s Research Fellows (PMRF) scholarship. S. Chakraborty acknowledges the financial support received from Science and Engineering Research Board (SERB) through grant no. SRG/2021/000467 and seed grant received from IIT Delhi.

Declarations

Funding

The corresponding author received funding from IIT Delhi in form of seed grant.

Conflicts of interest

The authors declare that they have no conflict of interest.

Availability of data and material

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Code availability

The Python codes written for this work are available from the corresponding author on reasonable request.

References

  • [1] Somdatta Goswami, Cosmin Anitescu, Souvik Chakraborty, and Timon Rabczuk. Transfer learning enhanced physics informed neural network for phase-field modeling of fracture. Theoretical and Applied Fracture Mechanics, 106:102447, 2020.
  • [2] Maziar Raissi. Deep hidden physics models: Deep learning of nonlinear partial differential equations. The Journal of Machine Learning Research, 19(1):932–955, 2018.
  • [3] Souvik Chakraborty. Transfer learning based multi-fidelity physics informed deep neural network. Journal of Computational Physics, 426:109942, 2021.
  • [4] Souvik Chakraborty. Simulation free reliability analysis: A physics-informed deep learning based approach. arXiv preprint arXiv:2005.01302, 2020.
  • [5] Lu Lu, Pengzhan Jin, and George Em Karniadakis. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. arXiv preprint arXiv:1910.03193, 2019.
  • [6] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895, 2020.
  • [7] Laure Zanna and Thomas Bolton. Data-driven equation discovery of ocean mesoscale closures. Geophysical Research Letters, 47(17):e2020GL088376, 2020.
  • [8] Yayun Zheng, Fang Yang, Jinqiao Duan, Xu Sun, Ling Fu, and Jürgen Kurths. The maximum likelihood climate change for global warming under the influence of greenhouse effect and lévy noise. Chaos: An Interdisciplinary Journal of Nonlinear Science, 30(1):013132, 2020.
  • [9] Chen Jia, Michael Q Zhang, and Hong Qian. Emergent lévy behavior in single-cell stochastic gene expression. Physical Review E, 96(4):040402, 2017.
  • [10] Frank Noé and Cecilia Clementi. Collective variables for the study of long-time kinetics from molecular trajectories: theory and methods. Current opinion in structural biology, 43:141–147, 2017.
  • [11] X Frank Zhang. Information uncertainty and stock returns. The Journal of Finance, 61(1):105–137, 2006.
  • [12] Josh Bongard and Hod Lipson. Automated reverse engineering of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 104(24):9943–9948, 2007.
  • [13] Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the national academy of sciences, 113(15):3932–3937, 2016.
  • [14] Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723, 1974.
  • [15] Gideon Schwarz. Estimating the dimension of a model. The annals of statistics, pages 461–464, 1978.
  • [16] Michael Schmidt and Hod Lipson. Distilling free-form natural laws from experimental data. science, 324(5923):81–85, 2009.
  • [17] Ioannis G Kevrekidis and Giovanni Samaey. Equation-free multiscale computation: Algorithms and applications. Annual review of physical chemistry, 60:321–344, 2009.
  • [18] Niall M Mangan, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Inferring biological networks by sparse identification of nonlinear dynamics. IEEE Transactions on Molecular, Biological and Multi-Scale Communications, 2(1):52–63, 2016.
  • [19] Moritz Hoffmann, Christoph Fröhner, and Frank Noé. Reactive sindy: Discovering governing reactions from concentration data. The Journal of chemical physics, 150(2):025101, 2019.
  • [20] Bhavana Bhadriraju, Abhinav Narasingam, and Joseph Sang-Il Kwon. Machine learning-based adaptive model identification of systems: Application to a chemical process. Chemical Engineering Research and Design, 152:372–383, 2019.
  • [21] Jean-Christophe Loiseau and Steven L Brunton. Constrained sparse galerkin regression. Journal of Fluid Mechanics, 838:42–67, 2018.
  • [22] Jean-Christophe Loiseau, Bernd R Noack, and Steven L Brunton. Sparse reduced-order modelling: sensor-based dynamics to full-state estimation. Journal of Fluid Mechanics, 844:459–490, 2018.
  • [23] Zhilu Lai and Satish Nagarajaiah. Sparse structural system identification method for nonlinear dynamic systems with hysteresis/inelastic behavior. Mechanical Systems and Signal Processing, 117:813–842, 2019.
  • [24] Shanwu Li, Eurika Kaiser, Shujin Laima, Hui Li, Steven L Brunton, and J Nathan Kutz. Discovering time-varying aerodynamics of a prototype bridge by sparse identification of nonlinear dynamical systems. Physical Review E, 100(2):022220, 2019.
  • [25] Hayden Schaeffer and Scott G McCalla. Sparse model selection via integral terms. Physical Review E, 96(2):023302, 2017.
  • [26] Niall M Mangan, J Nathan Kutz, Steven L Brunton, and Joshua L Proctor. Model selection for dynamical systems via sparse regression and information criteria. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2204):20170009, 2017.
  • [27] Eurika Kaiser, J Nathan Kutz, and Steven L Brunton. Sparse identification of nonlinear dynamics for model predictive control in the low-data limit. Proceedings of the Royal Society A, 474(2219):20180335, 2018.
  • [28] Hayden Schaeffer, Giang Tran, Rachel Ward, and Linan Zhang. Extracting structured dynamical systems using sparse optimization with very few samples. Multiscale Modeling & Simulation, 18(4):1435–1461, 2020.
  • [29] Merten Stender, Sebastian Oberst, and Norbert Hoffmann. Recovery of differential equations from impulse response time series data for model identification and feature extraction. Vibration, 2(1):25–46, 2019.
  • [30] Lorenzo Boninsegna, Feliks Nüske, and Cecilia Clementi. Sparse learning of stochastic dynamical equations. The Journal of chemical physics, 148(24):241723, 2018.
  • [31] Samuel H Rudy, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Data-driven discovery of partial differential equations. Science Advances, 3(4):e1602614, 2017.
  • [32] Sheng Zhang and Guang Lin. Robust data-driven discovery of governing physical laws with error bars. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2217):20180305, 2018.
  • [33] Samuel H Rudy, J Nathan Kutz, and Steven L Brunton. Deep learning of dynamics and signal-noise decomposition with time-stepping constraints. Journal of Computational Physics, 396:483–506, 2019.
  • [34] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Multistep neural networks for data-driven discovery of nonlinear dynamical systems. arXiv preprint arXiv:1801.01236, 2018.
  • [35] R Fuentes, R Nayek, P Gardner, N Dervilis, T Rogers, K Worden, and EJ Cross. Equation discovery for nonlinear dynamical systems: A bayesian viewpoint. Mechanical Systems and Signal Processing, 154:107528, 2021.
  • [36] Rajdip Nayek, Ramon Fuentes, Keith Worden, and Elizabeth J Cross. On spike-and-slab priors for bayesian equation discovery of nonlinear dynamical systems via sparse linear regression. Mechanical Systems and Signal Processing, 161:107986, 2021.
  • [37] Kushagra Gupta, Dootika Vats, and Snigdhansu Chatterjee. Bayesian equation selection on sparse data for discovery of stochastic dynamical systems. arXiv preprint arXiv:2101.04437, 2021.
  • [38] Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear regression. Journal of the american statistical association, 83(404):1023–1032, 1988.
  • [39] Edward I George and Robert E McCulloch. Approaches for bayesian variable selection. Statistica sinica, pages 339–373, 1997.
  • [40] Robert B O’Hara and Mikko J Sillanpää. A review of bayesian variable selection methods: what, how and which. Bayesian analysis, 4(1):85–117, 2009.
  • [41] Ovidiu Calin. An informal introduction to stochastic calculus with applications. World Scientific, 2015.
  • [42] Fima C Klebaner. Introduction to stochastic calculus with applications. World Scientific Publishing Company, 2005.
  • [43] Tapas Tripura, Ankush Gogoi, and Budhaditya Hazra. An ito-taylor weak 3.0 method for stochastic dynamics of nonlinear systems. Applied Mathematical Modelling, 2020.
  • [44] Peter E Kloeden and Eckhard Platen. Higher-order implicit strong numerical schemes for stochastic differential equations. Journal of statistical physics, 66(1-2):283–314, 1992.
  • [45] Tapas Tripura, Mohammad Imran, Budhaditya Hazra, and Souvik Chakraborty. A change of measure enhanced near exact euler maruyama scheme for the solution to nonlinear stochastic dynamical systems. arXiv preprint arXiv:2108.10655, 2021.
  • [46] Tapas Tripura, Budhaditya Hazra, and Souvik Chakraborty. Generalized weakly corrected milstein solutions to stochastic differential equations. arXiv preprint arXiv:2108.10681, 2021.
  • [47] George Casella and Edward I George. Explaining the gibbs sampler. The American Statistician, 46(3):167–174, 1992.
  • [48] Bernt Oksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013.
  • [49] Tapas Tripura, Basuraj Bhowmik, Vikram Pakrashi, and Budhaditya Hazra. Real-time damage detection of degrading systems. Structural Health Monitoring, 19(3):810–837, 2020.
  • [50] Hannes Risken. Fokker-planck equation. In The Fokker-Planck Equation, pages 63–95. Springer, 1996.