Learning governing physics from output only measurements

Tapas Tripura
Department of Applied Mechanics
Indian Institute of Technology Delhi
[email protected]
&Souvik Chakraborty
Department of Applied Mechanics
School of Artificial Intelligence (ScAI)
Indian Institute of Technology Delhi
[email protected]

Abstract

Extracting governing physics from data is a key challenge in many areas of science and technology. The existing techniques for equations discovery are dependent on both input and state measurements; however, in practice, we only have access to the output measurements only. We here propose a novel framework for learning governing physics of dynamical system from output only measurements; this essentially transfers the physics discovery problem from the deterministic to the stochastic domain. The proposed approach models the input as a stochastic process and blends concepts of stochastic calculus, sparse learning algorithms, and Bayesian statistics. In particular, we combine sparsity promoting spike and slab prior, Bayes law, and Euler Maruyama scheme to identify the governing physics from data. The resulting model is highly efficient and works with sparse, noisy, and incomplete output measurements. The efficacy and robustness of the proposed approach is illustrated on several numerical examples involving both complete and partial state measurements. The results obtained indicate the potential of the proposed approach in identifying governing physics from output only measurement.

Keywords Sparse Bayesian learning $\cdot$ explainable artificial intelligence $\cdot$ stochastic differential equation $\cdot$ equation discovery $\cdot$ probabilistic machine learning

1 Introduction

Physical systems are governed by the physical laws, often represented either in form of ordinary differential equations (ODE) or partial differential equations (PDE). The literature on techniques for the solution of ODE/PDE is quite matured and given sufficient computational resources, it is possible to solve the governing ODE/PDE with sufficient accuracy. Modern techniques such as physics-informed deep learning [1, 2, 3, 4] and neural network based operator learning [5, 6] can also be used. However, due to various assumptions and approximations, the governing equations often fails to represent the exact physics of the system and results in modelling and prediction errors. With advancement in the sensor technology and the internet of things (IoT), we have access to data in abundance whereas the underlying governing physics often remains elusive specifically is domains such as climate science [7], biology [8], physics [9], chemistry [10] and finance [11] to name a few. One possible alternative is obviously to train a conventional data-driven machine learning model to obtain an input-output mapping; however, such models often do not generalize to unseen environment and out-of-distribution inputs. To address this issue, methods for data-driven equation discovery has recently been proposed. A seminal work in this area was carried out by [12] wherein symbolic regression was utilized for discovering underlying structure of the data. Another important work in this area includes work by Brunton et al.[13]. However, most of the work in this area is limited by the fact that information on both input and state vector are needed. In practice, the input is often not measurable and only few states are accurately observable; this greatly limits the applicability of the available physics discovery techniques. To address this issue, we propose a novel approach that leverages sparse learning, Bayesian statistics, and stochastic calculus and enables discovery of governing physics from output only measurements.

The development in the equation discovery techniques has made significant progress from early 1970s equation discovery was mostly dependent on the expertise in the interested area, to the modern machine learning techniques that can extract the physical law of the underlying process with only little human supervision. In the early 1970s the models were selected by maintaining a balance between the model complexity and its fitness. The fitness of the models were measured using Akaike (AIC) and Bayesian information criterion (BIC) [14, 15]. Later advances in the machine learning tools in combination with new sophisticated data measurement instruments gave rise to the data-driven techniques for discovery of governing physics. Towards this, first attempt was made by combining symbolic regression with genetic programming to select the best combination of the functions from candidate functional forms that accurately represent the data [12, 16]. There are also equation-free methods that bypasses the need of formulating constitutive equations to track the time evolution of the dynamical systems in closed form. These methods provide a practical recipe for multiscale simulations through local microscopic simulations in time and space [17].

Following the work of symbolic regression [12, 16], Sparse Identification of Nonlinear Dynamics (SINDy) was proposed for discovery of the governing physics of non-linear dynamical systems [13]. In terms of accuracy and intepretability of the discovered physics, the SINDy framework overcame the shortcomings of the previous discovery techniques. The idea behind SINDy was to use sparse linear regression in the purview of least-square fitting for selecting the most dominating candidate functions that best represent the data. Due to sparsity promoting approach, SINDy proved to be computationally efficient and scalable with the increase in the input dimension. In the later years, SINDy has shown tremendous applications and area-specific developments of sparse linear regression in various fields. Few examples are, sparse identification of biological networks in biology [18] and sparse identification of chemical reaction in chemistry [19, 20], sparse modelling for state estimation in fluid mechanics [21, 22], system identification of structures with hysteresis and sparse learning of aerodynamics of bridges in structural systems [23, 24], sparse model selection using an integral formulation [25], sparse model selection of dynamical system using information criteria [26], sparse identification for predictive control [27], identification of structured dynamical systems with limited data [28], model identification using recovery of differential equations from short impulse response time-series data [29], identifying stochastic dynamic equation [30], and discovery of partial differential equations [31, 32], among others.

Apart from SINDy, identification of non-linear dynamical systems using deep neural networks were also proposed by the researchers [33, 2, 34]. The deep learning approaches for non-linear dynamical systems exists as a black-box which takes the data as a input and provides the prediction as output. Later, a different approach within the Bayesian framework for discovery of governing physics from noisy and/or limited data was proposed [35, 36, 37]. In this approach the least-square based sparse linear regression was replaced by more accurate sparse Bayesian linear regression technique. Due to the ability of the Bayesian inference to perform simultaneous model selection and parameter estimation, this approach eliminated the possibility of overfitting of the governing model. The sparsity in the solution was promoted by assigning the appropriate sparse promoting priors over the weight vector of the sparse regression problem. This approach provided a natural elimination of the basis functions that do not present the data accurately. As a output, this framework provides the posterior distributions of the weight vector; this means one could accurately identify the uncertainty in the parameters and construct a confidence interval for future predictions.

Despite the progress in equation discovery from data in recent times, almost all the approaches require measurements for both state and input variables; the only exceptions is perhaps the Stochastic SINDy [30]. Unfortunately, we often only have access to the state variables only; this significantly limits the applicability of the available approaches. We hereby propose an equation discovery framework rooted in stochastic calculus, sparse learning, and Bayesian statistics, that require only the state estimates. The basic premise here is to treat the unknown input as a stochastic process, model it as a Gaussian white noise, and identify the stochastic differential equation (SDE); this essentially leads to identifying the drift component and diffusion component of a stochastic differential equation. We utilize stochastic calculus to decouple the drift and diffusion part; this allows parallel identification of the two components. Sparse learning and Bayesian statistics, on the other hand, ensures that the governing equation identified is sparse and interpretable; this is achieved by using the spike and slab prior [38, 39]. The proposed approach has several key features that can be encapsulated into the following points:

•

First and foremost, unlike other approaches, the proposed approach require only the state measurements; no input measurements are needed.
•

The proposed approach being Bayesian in nature estimates the posterior distribution. This allows computation of the epistemic uncertainty (aka predictive uncertainty) due to limited data. This feature is particularly important in design and decision making.
•

Unlike non-Bayesian approaches, the proposed approach is autonomous in the sense that it doesn’t require any human calibration and cross-validation.
•

The proposed framework falls under the broad umbrella of interpretable machine learning and explainable AI. In other words, the model identified generalizes to unseen environment and out-of-distribution data.

Remaining of the paper is structured as follows: in section 2, the proposed data-driven framework for learning governing physics without the measurement of input is briefly presented. In section 3, numerical experiments using fully and partially observed output measurements are undertaken for the demonstration of the robustness of the proposed framework. In section 4, key contributions of the proposed novel framework is summarised and finally the paper is concluded.

Refer to caption — Figure 1: Schematic illustration of the proposed Bayesian-SDE framework for discovery of stochastic dynamical systems. The proposed framework integrates two key ideas for identification of the stochastic systems in terms of the SDEs. The first key step, is the construction of the target vector for the sparse linear regression. For this purpose the proposed framework utilises the Kramers-Moyal expansion to estimate the drift $g({\bm{X}}_{t},t)$ and diffusion terms $f({\bm{X}}_{t},t)$ in terms of the linear and quadratic variations of the measured sample paths. The linear variation: ${{\mathop{\lim}_{\Delta t\to 0}\left\langle{{X}(t+\Delta t)-z}\right\rangle};{X(t)=z}}$ indicates sum of the increments of the sample paths and the quadratic variation: ${{\mathop{\lim}_{\Delta t\to 0}\left\langle{{{\left|{{X}(t+\Delta t)-z}\right|}^{2}}}\right\rangle};{X(t)=z}}$ indicates sum of the square of the increments of the sample paths in limiting sense. The second step, includes formulation of the sparse Bayesian linear regression framework for the above target vector. Given an ensemble of time history of data the framework obtains a sequence of libraries by evaluating the candidate basis functions $\left\{{{\upsilon_{k}}({{\bf{X}}_{t}});\;k=1,2,\ldots K}\right\}$ over the ensembles. Ensemble mean is taken over all the libraries and the averaged library ${\bf{L}}\in\mathbb{R}^{N\times K}$ is utilized for performing the sparse linear regression. The candidate function in the library are parameterized by a weight vector ${\bm{\theta}}$ . To identify each of the drift terms in an $m$ -dimensional diffusion process this regression are performed $m$ -times. The diffusion terms are not directly recoverable but the discovery of diffusion terms are possible in terms of its covariances. This requires $m(m+1)/2$ number of regression for discovery of all the terms in a covariance matrix. The elements of the weight vector ${\bm{\theta}}$ is assigned a latent variable $Z_{k};k=1\ldots K$ in order to classify the weights into spikes and slabs. Post the Bayesian regression the marginal posterior inclusion probabilities (PIP) $p(Z_{k}=1|{\bm{Y}})$ is estimated. In the final model only the basis functions whose corresponding marginal PIP values are higher than a threshold value is included.

2 Discovery of governing physics without input measurement

The overarching framework of the proposed Bayesian physics discovery framework is presented in the Fig. 1. The proposed framework treats the state measurements from the sensors as stochastic process and applies Kramers-Moyal formula to obtain the target vector for sparse linear regression. In the sparse regression, the library is obtained by taking ensemble mean over all the libraries constructed from the collected data set. Then the proposed framework constructs two sparse regression problem using these libraries and target vectors for identification of the governing physics, which are independent and can be solved simultaneously. The discovery of governing physics here involves identification of the system dynamics and the volatility associated arising due to environmental uncertainty. The system dynamics and the volatility are represented in this framework as drift and diffusion components of an SDE.

Before formulating the mathematical backbone of the proposed framework let us formalize the problem through following terminologies. Let ${\bf{L}}\in\mathbb{R}^{N\times K}$ be the library of candidate functions, where $K$ is the total number of independent basis functions in the library and $N$ is the sample length. Then the key idea of the sparse Bayesian linear regression is to select $k$ basis functions from the total number of $K$ candidate library functions such that $k\ll K$ . If ${\bm{Y}}\in\mathbb{R}^{N}$ is the target vector then the linear combinations of the identified basis functions should represent the target vector ${\bm{Y}}$ in best possible coefficients in linear regression are denoted by a weight vector ${\bm{\theta}}$ . For the Bayesian model discovery, the condition $k\ll K$ is maintained by assigning certain type of prior distributions on the weight vector ${\bm{\theta}}$ , such that they favour the sparsity in the solution. In this context, the SS-prior have shown high shrinkage property due to its sharp spike at zero and a diffused density spanned over a large range of possible parameter value. The Dirac-delta spike concentrates most of the probability mass at zero thus allowing most of the samples to take value zero. On the other hand, the diffused tail distributes a small amount of probability mass over a large range of possible values allowing only very few samples with very high probability to escape the shrinkage. The alternate flavours of the SS-prior constructed from various combination of candidate spike and slab functions are well documented in the literature [38, 39, 40]. In this work, the authors have considered the discontinuous spike and slab prior (DSS-prior) where the spike at zero is modeled as Dirac-delta function and the tail distribution as independent Student’s-t distribution. Next, a sparse regression framework is formulated for SDEs.

2.1 Sparse learning of Stochastic differential equations

Dynamical processes are generally expressed in terms of higher order differential equations. However, in most of the data-driven machine learning techniques the actual system is not observed through its original space but identified in a projected space. The projected space contains all the states of a system which are directly observable. For instance, the second order dynamical systems are often expressed in terms of its displacement and velocity components. Let the projection be realized by a map $\mathcal{T}:\mathbb{R}^{d}\to\mathbb{R}^{m}$ such that $m<d$ . If there exists a mapping $\mathcal{T}$ , then any higher order system can be reduced to a set of SDEs of the form:

{\bm{\dot{X}}}={\bm{f}}({{\bm{X}}_{t}},t)+{\bm{g}}({{\bm{X}}_{t}},t){\bm{\zeta}}(t),

(1)

where ${\bm{f}}({{\bm{X}}_{t}},t)$ is the deterministic dynamics of the underlying process, ${\bm{g}}({{\bm{X}}_{t}},t)$ is the volatility associated with the dynamics arising due to the stochastic external input, and ${\bm{\zeta}}(t)$ is the white noise. A more appropriate representation of the above equation can be derived through first order Itô SDEs which arises naturally in non-linear dynamical systems subjected to stochastic excitation such as earthquake, wind force, wave force etc. [41, 42]. In non-linear stochastic dynamical systems the SDEs provide an splendid approach for expressing the behavior of the underlying dynamical system. The SDEs are generally defined in term of its deterministic drift and the additive stochastic diffusion components where both the components are allowed to depend on time and systems states. Let $\left({\Omega,\mathcal{F},P}\right)$ be the probability space and $\{{\mathcal{F}_{t}},0\leq t\leq T)\}$ be the natural filtration constructed from sub $\sigma$ -algebras of $\mathcal{F}$ . Under the probability space $\left({\Omega,\mathcal{F},P}\right)$ an $m$ -dimensional $n$ -factor SDE driven by $n$ -dimensional Brownian motion { ${\bm{B}}_{j}(t),j=1,\ldots n$ } is defined as,

\begin{array}[]{l}d{\bm{X}}_{t}={{\bm{f}}}\left({{{\bm{X}}_{t}},t}\right)dt+\sum\limits_{j=1}^{n}{{\bm{g}}_{j}\left({{{\bm{X}}_{t}},t}\right)}d{{\bf B}_{j}}\left(t\right);\quad{\bm{X}}(t=t_{0})={\bm{X}}_{0};\quad t\in[0,T],\end{array}

(2)

where ${\bm{X}}_{t}\in{\mathbb{R}^{m}}$ denotes the ${{\mathcal{F}_{t}}}$ -measurable state vector, ${\bm{f}}\left({{{\bm{X}}_{t}},t}\right):{\mathbb{R}^{m}}\mapsto{\mathbb{R}^{m}}$ is the drift vector, ${\bm{g}}\left({{{\bm{X}}_{t}},t}\right):{\mathbb{R}^{m}}\mapsto{{\mathbb{R}}^{m\times n}}$ is the diffusion matrix and ${{\bm{B}}_{j}}\left(t\right)\in{\mathbb{R}^{n}}$ is the Brownian motion. The white noises are generalized derivative Brownian motion i.e. ${\bm{\zeta}}(t)=\dot{\bm{B}}(t)$ ; this ensures that there exists a corresponding SDE of Eq. (2). In a compact matrix notation, the SDEs are can be expressed as,

\begin{array}[]{l}d{{\bm{X}}_{t}}={\bm{f}}\left({{{\bm{X}}_{t}},t}\right)dt+{\bm{g}}\left({{{\bm{X}}_{t}},t}\right)d{\bm{B}}\left(t\right);\quad{\bm{X}}(t=t_{0})={\bm{X}}_{0};\quad t\in[0,T].\end{array}

(3)

The solution to Eq. (3) can be obtained using various stochastic integration schemes [43, 44, 45, 46]. The discovery of governing physics in terms of an SDE requires the identification of the drift and diffusion components from output measurements. One of the challenges arising in the identification of the SDEs from output data is the non-differentiability of the Brownian motion. In general, the stochastic Brownian motion $\bm{B}(t)$ is not a well defined mathematical function since it is not differentiable everywhere with respect to the process $\bm{X}(t)$ . However, its continuity is assumed to exists in mean square sense. Formally, if the interval $s\in[0,t]$ is partitioned into $n$ -parts as, ${\mathcal{P}_{n}}(0,t):=\left[{0={s_{0}}<s_{1}<\ldots<{s_{n}}=t}\right]$ , then for a random process $w$ , the quadratic variation ${Q_{n}}(w,t)=\sum\nolimits_{i}^{n}{{{\left|{w({s_{i}})-w({s_{i-1}})}\right|}^{2}}}$ converges to ${Q_{n}}(w,t)\to Q(w,t)$ . From the property of the Brownian motions, it can be found that $Q(w,t)=t$ ; this states that $B(t)$ has zero finite variation but non-vanishing quadratic variation. On the other hand, the deterministic functions have finite variation but zero quadratic variations. Under these properties, the Kramers-Moyal formula suggests that the drift and diffusion components of an SDE can be extracted from the sample time history in terms of their linear and quadratic variations, respectively. Since the continuity of Brownian motions are mathematically defined in the mean square sense, as a consequence, the diffusion components does not have finite variations but are bounded by their quadratic variations. Thus, the diffusion components, unlike the drifts of an SDE are recoverable only through its covariation terms. The linear and quadratic variations denotes the first and second moments of the sample increments. Leveraging this facts, the drift and diffusion components of an SDE given in Eq. (3) can be expressed in terms of the sample time history as follows (the simplified detailed derivation is presented in A):

\begin{array}[]{l}{{\bm{f}}_{i}}\left({{{\bm{X}}_{t}},t}\right)={\left.{\mathop{\lim}\limits_{\Delta t\to 0}\dfrac{1}{{\Delta t}}E\left[{{X_{i}}(t+\Delta t)-{\xi_{i}}}\right]}\right|_{{x_{k}}(t)={\xi_{k}}}}\forall\;k=1,2,\ldots N\\ {{\bf\Gamma}_{ij}}\left({{{\bm{X}}_{t}},t}\right)={\left.{\dfrac{1}{2}\mathop{\lim}\limits_{\Delta t\to 0}\dfrac{1}{{\Delta t}}E\left[{\left|{{X_{i}}(t+\Delta t)-{\xi_{i}}}\right|\left|{{X_{j}}(t+\Delta t)-{\xi_{j}}}\right|}\right]}\right|_{{x_{k}}(t)={\xi_{k}}}}\forall\;k=1,2,\ldots N,\end{array}

(4)

where ${{\bm{f}}_{i}}\left({{{\bm{X}}_{t}},t}\right)$ is the $i^{th}$ drift component and ${{\bf{\Gamma}}_{ij}}$ is the $(ij)^{th}$ component of the diffusion covaraince matrix ${\bf{\Gamma}}\in{\mathbb{R}^{n\times n}}:={\bm{g}}(t,{{\bm{X}}_{t}}){\bm{g}}{(t,{{\bm{X}}_{t}})^{T}}$ . We assume that the analytical form of the drift and diffusion components can be obtained from measurement sample paths in terms of some basis functions. Let ${\upsilon_{k}}({{\bm{X}}_{t}})$ represents the various linear and non-linear mathematical functions evaluated on the system states, and $N$ is the length of sample path. Also let ${\bf{L}}^{f}\in\mathbb{R}^{N\times K}:=[\upsilon_{1}^{f}({{\bm{X}}_{t}}),\ldots,\upsilon_{K}^{f}({{\bm{X}}_{t}})]$ and ${\bf{L}}^{g}\in\mathbb{R}^{N\times K}:=[\upsilon_{1}^{g}({{\bm{X}}_{t}}),\ldots,\upsilon_{K}^{g}({{\bm{X}}_{t}})]$ are the library of candidate functions constructed from the set $\{\upsilon_{k}({{\bm{X}}_{t}}),k=1,\ldots,K\}$ . Here the candidate functions can contain several functional forms such as polynomial, trigonometric, etc. Then, the $i^{th}$ drift component and the ${ij}^{th}$ term of diffusion covariance matrix can be expressed as linear combination of the library functions as, ${f_{i}}({{\bm{X}}_{t}},t)=\sum\nolimits_{k=1}^{K}{{\upsilon_{k}}({{\bm{X}}_{t}}){\theta_{i,k}}}$ ; $i=1,\ldots,m$ and ${{\bf{\Gamma}}_{ij}}=\sum\nolimits_{k=1}^{K}{{\upsilon_{k}}({{\bf{X}}_{t}}){{\bf{\theta}}_{ij,k}}}$ ; $i,j=1,2,\ldots m$ , briefly,

\begin{array}[]{l}{\bm{f}}_{i}({{\bm{X}}_{t}},t)=\theta_{i_{1}}^{f}\upsilon_{1}^{f}({{\bm{X}}_{t}})+\ldots+\theta_{i_{k}}^{f}\upsilon_{k}^{f}({{\bm{X}}_{t}})+\ldots+\theta_{i_{K}}^{f}\upsilon_{K}^{f}({{\bm{X}}_{t}})\\ {{\bf\Gamma}_{ij}}\left({{{\bm{X}}_{t}},t}\right)=\theta_{{ij}_{1}}^{g}\upsilon_{1}^{g}({{\bm{X}}_{t}})+\ldots+\theta_{{ij}_{k}}^{g}\upsilon_{k}^{g}({{\bm{X}}_{t}})+\ldots+\theta_{{ij}_{K}}^{g}\upsilon_{K}^{g}({{\bm{X}}_{t}}),\end{array}

(5)

where $\theta_{i_{k}}^{f}$ and $\theta_{{ij}_{k}}^{g}$ are the weights associated with the $k^{th}$ basis functions of drift and diffusion covariance components, respectively. In general, the libraries ${\bm{L}}^{f}$ and ${\bm{L}}^{g}$ can be different but one can construct a library by taking into account of all the basis functions for discovery of both drift and diffusion terms. In a general setting, the discovery of the drift terms culminates in the following regression problem,

\left[{\begin{array}[]{*{20}{c}}{{{\bm{Y}}_{1}}}&{{{\bm{Y}}_{2}}}&\ldots&{{{\bm{Y}}_{m}}}\end{array}}\right]=\underbrace{\left[{\begin{array}[]{*{20}{c}}{{\upsilon_{1}}({X_{1,1}}\ldots{X_{m,1}})}&{{\upsilon_{2}}({X_{1,1}}\ldots{X_{m,1}})}&\cdots&{{\upsilon_{K}}({X_{1,1}}\ldots{X_{m,1}})}\\ {{\upsilon_{1}}({X_{1,2}}\ldots{X_{m,2}})}&{{\upsilon_{2}}({X_{1,2}}\ldots{X_{m,2}})}&\cdots&{{\upsilon_{K}}({X_{1,2}}\ldots{X_{m,2}})}\\ \vdots&\vdots&\ddots&\vdots\\ {{\upsilon_{1}}({X_{1,N}}\ldots{X_{m,N}})}&{{\upsilon_{2}}({X_{1,N}}\ldots{X_{m,N}})}&\cdots&{{\upsilon_{K}}({X_{1,N}}\ldots{X_{m,N}})}\end{array}}\right]}_{{\bf{L}}\in{{}^{N\times K}}}\underbrace{\left[{\begin{array}[]{*{20}{c}}{{\theta_{1,1}}}&{{\theta_{2,1}}}&\cdots&{{\theta_{m,1}}}\\ {{\theta_{1,2}}}&{{\theta_{2,2}}}&\cdots&{{\theta_{m,2}}}\\ \vdots&\vdots&\ddots&\vdots\\ {{\theta_{1,k}}}&{{\theta_{2,k}}}&\cdots&{{\theta_{m,k}}}\end{array}}\right]}_{{\bf{\theta}}\in{{}^{K\times m}}}.

(6)

The problem for discovery of diffusion components can also be formulated in the same way as above, thus omitted here for brevity. As a subset of the above regression problems, one can solve the following equations and identify the SDE components independently,


$\displaystyle{{\bm{Y}}_{i}}$	$\displaystyle={\bf{L}}^{f}{{\bm{\theta}}_{i}^{f}}+{{\bm{\varepsilon}}_{i}}$	(7a)
$\displaystyle{{\bm{Y}}_{ij}}$	$\displaystyle={\bf{L}}^{g}{{\bm{\theta}}_{ij}^{g}}+{{\bm{\eta}}_{ij}},$	(7b)

where ${{\bm{Y}}_{i}}$ and ${{\bm{Y}}_{ij}}$ are the target vectors of the sparse regression problem associated with $i^{th}$ -drift component and $(ij)^{th}$ -diffusion covariance term, respectively, and, ${{\bm{\varepsilon}}_{i}}$ and ${{\bm{\eta}}_{ij}}$ are the corresponding residual error vectors. For the discovery of the $i^{th}$ drift component in Eq. (7a), the target vector is defined as, ${{\bm{Y}}_{i}}={{\Delta t}^{-1}\left[{\left({{X_{i,1}}-{\xi_{i,1}}}\right)\ldots\left({{X_{i,N}}-{\xi_{i,N}}}\right)}\right]^{T}}$ , and in the formulation of the sparse regression problem for diffusion components in Eq. (7b), the target vector is set as, ${{\bm{Y}}_{ij}}={{\Delta t}^{-1}\left[{\left({{X_{i,1}}-{\xi_{i,1}}}\right)\left({{X_{j,1}}-{\xi_{j,1}}}\right)\ldots\left({{X_{i,N}}-{\xi_{i,N}}}\right)\left({{X_{j,N}}-{\xi_{j,N}}}\right)}\right]^{T}}$ . The weights vectors for $i^{th}$ drift component and ${ij}^{th}$ term of diffusion covariance matrix are given as, ${{\bm{\theta}}_{i}^{f}}={\left[{{\theta_{1_{1}}}},{{\theta_{1_{2}}}},\ldots,{{\theta_{1_{K}}}}\right]^{T}}$ , and ${\bm{\theta}}_{ij}^{g}={\left[{\theta_{{ij}_{1}}},{\theta_{{ij}_{2}}},\ldots,{\theta_{{ij}_{K}}}\right]^{T}}$ , respectively. The straightforward application of the Gibbs sampling mentioned in the Eqs. (17) $\to$ (22) can be performed to solve the regression problems in Eq. 7. Noting that a $m$ -dimensional process has $m$ numbers of drift components it requires $m$ -independent regressions to identify all the drifts. Further it can be noted that in a $m$ -dimensional process with $m\times n$ -diffusion matrix the diffusion covariance matrix ${\bf\Gamma}$ has a dimension $m\times m$ . However, due to the symmetry of the covariance matrix, it requires ${{m(m+1)}}/{2}$ numbers of independent regressions to be performed to completely identify the diffusion space.

2.2 Discovery of SDE by sparse Bayesian regression

Since Eq. (7a) and (7b) reflects the same sparse regression problem to be solved, for the brevity they can be represented using the following one-dimensional regression equation:

{\bm{Y}}={\bf{L}}{\bm{\theta}}+\epsilon,

(8)

where ${\bm{Y}}\in\mathbb{R}^{N}$ denotes the $N$ -dimensional target vector, ${\bf{L}}$ denotes the library of candidate functions, ${\bm{\theta}}$ is the weight vector, and $\epsilon\in\mathbb{R}^{N}$ is the residual error vector representing the model mismatch error. In the Bayesian inference, given the output measurements the aim is to find the posterior distribution of the weight vector ${\bm{\theta}}$ . For estimating the posterior distribution of the weight vector ${\bm{\theta}}$ , one can apply the Bayes formula in Eq. (8), which yields,

P\left({\bm{\theta}|{\bm{Y}}}\right)=\frac{P\left(\bm{\theta}\right){P\left({{\bm{Y}}|{\bm{\theta}}}\right)}}{{P\left({\bm{Y}}\right)}}.

(9)

Modeling the mismatch error $\epsilon$ as i.i.d Gaussian random variable with zero mean and variance $\sigma^{2}$ , the likelihood function is written as,

{\bm{Y}}|{\bm{\theta}},{\sigma^{2}}\sim\mathcal{N}\left({{\bf{L}}{\bm{\theta}},{\sigma^{2}}{{\bf{I}}_{N\times N}}}\right),

(10)

where ${\bf{I}}_{N\times N}$ denotes the ${N\times N}$ identity matrix. The discovery of governing physics demands that the resulting model should be interpretable i.e. the solution of weight vector ${\bm{\theta}}$ should contain only few terms in the final model. In this work, the discontinuous spike and slab (DSS) distributions are used for promotion of sparsity in the solution. The equation discovery using DSS-prior requires the classification of weights of the basis functions into either of the spike and slab component. This is done by introducing a latent indicator variable ${\bm{Z}}=\left[Z_{1},\ldots,Z_{K}\right]$ for each of the component $\theta_{k};k=1,\ldots,K$ of the weight vector ${\bm{\theta}}$ . The latent indicator variable $Z_{k}$ is assigned a value 1 if the weight corresponds to the slab component, otherwise, it takes a value 0 when the weight belongs to spike component. It is to be noted that the components of the weight vector that belongs to spike does not contribute to discovery of governing physics. Therefore, let us consider a vector ${\bm{\theta}}_{r}\in\mathbb{R}^{r}:\{r\ll K\}$ , composed from the elements of the weight vector ${\bm{\theta}}$ for which $Z_{k}=1$ . Denoting ${\bm{\theta}}_{r}$ as the weight vector containing only those variables from ${\bm{\theta}}$ for which $Z_{k}=1$ , the DSS-prior is defined as [36, 38],

p\left({{\bm{\theta}}|{\bm{Z}}}\right)={p_{slab}}({\theta_{r}})\prod\limits_{k,{Z_{k}}=0}{{p_{spike}}({\theta_{k}})},

(11)

where the spike and slab distributions are defined as, ${p_{spike}}({\theta_{k}})={\delta_{0}}$ and ${p_{slab}}({{\bm{\theta}}_{r}})=\mathcal{N}\left({{\bf{0}},{\sigma^{2}}{\vartheta_{s}}{{\bf{R}}_{0,r}}}\right)$ with ${{\bf{R}}_{0,r}}={{\bf{I}}_{r\times r}}$ . It can be observed that the spike and slab distributions are assumed to be independent. To increase the effectiveness and faster convergence to the sparse solution the noise variance ${\sigma^{2}}$ and the slab variance ${\vartheta_{s}}$ are treated as random variables. The noise variance $\sigma^{2}$ is assigned the Inverse-gamma distribution with the hyperparameters ${\alpha_{\sigma}}$ and ${\beta_{\sigma}}$ , the the slab variance $\vartheta_{s}$ is assigned the Inverse-gamma prior with the hyperparameters ${\alpha_{\vartheta}}$ and ${\beta_{\vartheta}}$ , the latent variables $Z_{k}$ takes a value from the set $\{0,1\}$ , thus each of the element in the latent vector is assigned the Bernoulli prior with common hyperparameter ${p_{0}}$ . The hyperparameter ${p_{0}}$ is simulated from the Beta prior with the hyperparameters ${\alpha_{p}}$ and ${\beta_{p}}$ . This can be summarized as,

	$\displaystyle p\left({\vartheta_{s}}\right)=IG\left({{\alpha_{\vartheta}},{\beta_{\vartheta}}}\right)$		(12)
	$\displaystyle p\left({Z_{k}}\|{p_{0}}\right)=Bern\left({{p_{0}}}\right);k=1\ldots K$		(13)
	$\displaystyle p\left({p_{0}}\right)=Beta\left({{\alpha_{p}},{\beta_{p}}}\right)$		(14)
	$\displaystyle p\left({\sigma^{2}}\right)=IG\left({{\alpha_{\sigma}},{\beta_{\sigma}}}\right)$		(15)

With all the random variables, we construct a bigger Bayesian hierarchical model as shown in Fig. 2, where the hyperparameters ${\alpha_{\vartheta}}$ , ${\beta_{\vartheta}}$ , ${\alpha_{p}}$ , ${\beta_{p}}$ , ${\alpha_{\sigma}}$ , and ${\beta_{\sigma}}$ are provided as a deterministic constants in the hierarchical model. From the DAG structure in Fig. 2, the joint distribution of the random variables ${\bm{\theta}}$ , ${\bm{Z}}$ , ${\vartheta_{s}}$ , ${\sigma^{2}}$ , ${p_{0}}|{\bm{Y}}$ are obtained from the Fig. 2 as,

\begin{array}[]{ll}p\left({{\bm{\theta}},{\bm{Z}},{\vartheta_{s}},{\sigma^{2}},{p_{0}}|{\bm{Y}}}\right)&=\dfrac{{p\left({{\bm{Y}}|{\bm{\theta}},{\sigma^{2}}}\right)p\left({{\bm{\theta}}|{\bm{Z}},{\vartheta_{s}},{\sigma^{2}}}\right)p\left({{\bm{Z}}|{p_{0}}}\right)p\left({{\vartheta_{s}}}\right)p\left({{\sigma^{2}}}\right)p\left({{p_{0}}}\right)}}{{p\left({\bm{Y}}\right)}}\\ &\propto p\left({{\bm{Y}}|{\bm{\theta}},{\sigma^{2}}}\right)p\left({{\bm{\theta}}|{\bm{Z}},{\vartheta_{s}},{\sigma^{2}}}\right)p\left({{\bm{Z}}|{p_{0}}}\right)p\left({{\vartheta_{s}}}\right)p\left({{\sigma^{2}}}\right)p\left({{p_{0}}}\right),\end{array}

(16)

where $p\left({{\bm{\theta}},{\bm{Z}},{\vartheta_{s}},{\sigma^{2}},{p_{0}}|{\bm{Y}}}\right)$ denotes the joint distribution of the random variables, $p\left({{\bm{Y}}|{\bm{\theta}},{\sigma^{2}}}\right)$ denotes the likelihood function, $p\left({{\bm{\theta}}|{\bm{Z}},{\vartheta_{s}},{\sigma^{2}}}\right)$ is the prior distribution for the weight vector ${\bm{\theta}}$ , $p\left({{\bm{Z}}|{p_{0}}}\right)$ is the prior distribution for the latent vector ${\bm{Z}}$ , $p\left({{\vartheta_{s}}}\right)$ is the prior distribution for the slab variance ${\vartheta}_{s}$ , $p\left({{\sigma^{2}}}\right)$ is the prior distribution for the noise variance, $p\left({{p_{0}}}\right)$ is the prior distribution for the success probability $p_{0}$ and ${p\left({\bm{Y}}\right)}$ is the marginal likelihood. Direct sampling from the joint distribution function is intractable in this case due to the spike and slab distribution. Thus, Gibbs sampling technique is used to draw the random samples from the joint distribution [47]. For the Gibbs sampling, the conditional distributions of the random variables are derived in B. The weights corresponding to the spike distribution does not contribute to the selection of the correct basis functions. Thus, only the weights corresponding to $Z_{k}\neq 0$ i.e. the ${\bm{\theta}}_{r}$ vector is sampled using the Gibbs sampling. Referring to the Eqs. (12), (13), (14), and (15), the sequence of the random variables ${{\bm{\theta}}^{(0)}},{\sigma^{2(0)}},\vartheta_{s}^{(0)},p_{0}^{(0)},{{\bm{Z}}^{(0)}},\ldots,{{\bm{\theta}}^{(1)}},{\sigma^{2(1)}},\vartheta_{s}^{(1)},p_{0}^{(1)},{{\bm{Z}}^{(1)}},\ldots$ , using the Gibbs sampling technique is obtained by following steps,

The weight vector ${\bm{\theta}}_{r}^{(i)}$ is sampled from the Gaussian distribution with mean ${{\bm{\mu}}_{\theta}}$ and variance ${{\bf{\Sigma}}_{\theta}}$ as,

{{\bm{\theta}}_{r}^{(i)}}|{\bm{Y}},{\vartheta_{s}^{(i)}},{\sigma^{2(i)}}\sim\mathcal{N}\left({{{\bm{\mu}}_{\theta}^{(i)}},{{\bf{\Sigma}}_{\theta}^{(i)}}}\right),

(17)

where the mean and covariance is defined as, ${{\bm{\mu}}_{\theta}^{(i)}}={\bf{\Sigma}}_{\theta}^{(i)}{\bf{L}}_{r}^{(i)T}{\bm{Y}}$ and ${{\bf{\Sigma}}_{\theta}^{(i)}}={\sigma^{2(i)}}{\left({{\bf{L}}_{r}^{(i)T}{{\bf{L}}_{r}^{(i)}}+\vartheta_{s}^{{(i)}-1}{\bf{R}}_{0,r}^{{(i)}-1}}\right)^{-1}}$ , respectively.

The latent variable $Z_{k}^{(i+1)}$ is assigned the values from the set, $\left\{0,1\right\}$ by using the Bernoulli distribution as,

{Z_{k}^{(i+1)}}|{\bm{Y}},{\vartheta_{s}^{(i)}},{p_{0}^{(i)}}\sim Bern\left({{u_{k}}}\right),

(18)

where ${u_{k}}=\dfrac{{{p_{0}}}}{{{p_{0}}+\lambda\left({1-{p_{0}}}\right)}}$ and $\lambda=\dfrac{{p\left({{\bm{Y}}|{Z_{k}^{(i)}}=0,{{\bm{Z}}_{-k}^{(i)}},{\vartheta_{s}^{(i)}}}\right)}}{{p\left({{\bm{Y}}|{Z_{k}^{(i)}}=1,{{\bm{Z}}_{-k}^{(i)}},{\vartheta_{s}^{(i)}}}\right)}}$ . Here, ${{\bm{Z}}_{-k}^{(i)}}\in\mathbb{R}^{K-1}$ denotes the latent variable vector ${\bm{Z}}$ consisting of all the elements except the $k^{th}$ component. The probability that the $k^{th}$ latent variable $Z_{k}^{(i)}$ takes a value 0 or 1 is estimated as follows,

\begin{array}[]{ll}p\left({{\bm{Y}}|{\bm{Z}}^{(i)},{\vartheta_{s}^{(i)}}}\right)&=\dfrac{{\Gamma\left({{\alpha_{\sigma}}+\dfrac{N}{2}}\right)\beta_{\sigma}^{{\alpha_{\sigma}}}}}{{\Gamma\left({{\alpha_{\sigma}}}\right){{\left({2\pi}\right)}^{\dfrac{N}{2}}}{{\left({{\beta_{\sigma}}+\dfrac{1}{2}{{\bm{Y}}^{T}}{\bm{Y}}}\right)}^{\left({{\alpha_{\sigma}}+\dfrac{N}{2}}\right)}}}};\quad\text{when all $\left\{{Z_{k}^{(i)}}{:k=1,\ldots,K}\right\}=0$}\\ &=\dfrac{{\Gamma\left({{\alpha_{\sigma}}+\dfrac{N}{2}}\right)\beta_{\sigma}^{{\alpha_{\sigma}}}{{\left({\left|{{\bf{R}}_{0,r}^{{(i)}-1}}\right|\left|{{{\bf{\Sigma}}_{\theta}^{(i)}}}\right|}\right)}^{\dfrac{1}{2}}}}}{{\Gamma\left({{\alpha_{\sigma}}}\right){{\left({2\pi}\right)}^{\dfrac{N}{2}}}\vartheta_{s}^{\dfrac{{{h_{z}}}}{2}}{{\left({{\beta_{\sigma}}+\dfrac{1}{2}{{\bm{Y}}^{T}}\left({{{\bf{I}}_{N\times N}}-{{\bf{L}}_{r}^{(i)}}{{\bf{\Sigma}}_{\theta}^{(i)}}{\bf{L}}_{r}^{(i)T}}\right){\bm{Y}}}\right)}^{\left({{\alpha_{\sigma}}+\dfrac{N}{2}}\right)}}}};\quad\text{otherwise.}\end{array}

(19)

The noise variance ${\sigma^{2(i+1)}}$ is simulated from the Inverse-gamma distribution as,

{\sigma^{2(i+1)}}|{\bm{Y}},{\bm{Z}^{(i+1)}},{\vartheta_{s}^{(i)}}\sim IG\left({{\alpha_{\sigma}}+\dfrac{N}{2},{\beta_{\sigma}}+\dfrac{1}{2}\left({{{\bm{Y}}^{T}}{\bm{Y}}-{\bm{\mu}}_{\theta}^{(i)T}{\bf{\Sigma}}_{\theta}^{{(i)}-1}{{\bm{\mu}}_{\theta}^{(i)}}}\right)}\right).

(20)

The slab variance ${\vartheta_{s}^{(i+1)}}$ is sampled from the Inverse-gamma distribution as,

{\vartheta_{s}^{(i+1)}}|{\bm{\theta}^{(i)}},{\bm{Z}}^{(i+1)},{\sigma^{2(i+1)}}\sim IG\left({{\alpha_{\vartheta}}+\dfrac{{{h_{z}}}}{2},{\beta_{\vartheta}}+\dfrac{1}{{2{\sigma^{2}}}}{\bm{\theta}}_{r}^{(i)T}{\bf{R}}_{0,r}^{{(i)}-1}{{\bm{\theta}}_{r}^{(i)}}}\right).

(21)

5.

The success rate ${p_{0}^{(i+1)}}$ is sampled from the Beta distribution as,

${p_{0}^{(i+1)}}|{\bm{Z}}^{(i+1)}\sim Beta\left({{\alpha_{p}}+{h_{z}},{\beta_{p}}+K-{h_{z}}}\right),$ (22)

where ${h_{z}}=\sum\nolimits_{k=1}^{K}{{Z_{k}}^{(i+1)}}$ .
6.

The weight vector ${\bm{\theta}}_{r}^{(i+1)}$ is updated using the step 1.

This MCMC is performed for a total of ${N_{M}}$ simulations out of which initial 1000 samples are discarded as the burn-in samples. Let ${N_{s}}$ denote the number of MCMC required to achieve the stationary distribution after the burn-in samples are discarded. Then the marginal posterior inclusion probability (PIP):= $p\left({{Z_{k}}=1|{\bm{Y}}}\right)$ for each of the $K$ basis functions can be estimated by taking mean over the Gibbs samples for each of the $k^{th}$ latent vector ${Z_{k}}$ [36] as,

p\left({{Z_{k}}=1|{\bm{Y}}}\right)\approx\dfrac{1}{{{N_{s}}}}\sum\limits_{j=1}^{{N_{s}}}{Z_{k}^{j}};{\rm{}}k=1,\ldots,K.

(23)

The basis functions whose corresponding PIP values are more than 0.5 i.e. $p\left({{Z_{k}}=1|{\bm{Y}}}\right)>0.5$ are included in the final model of discovered equation. The PIP value greater than 0.5 indicates that the selected basis functions are observed more than half of the times in the MCMC simulations. Higher PIP value suggests that in case of unseen scenarios, the corresponding basis functions are highly likely to occur in the data representation target vector. The mean and covariance of the weight vector gives the expected value and standard deviation of the system parameters. The standard deviation provides the confidence interval of the parameters which can be used to design the lower and upper bounds of a random variable in an unseen environment. A pseudo code for the proposed framework, is provided in the Algorithm 1.

Algorithm 1 Pseudo code of the proposed Bayesian framework for discovery of governing physics of stochastic non-linear systems

1:Sample paths:

{\bf{X}}(t)\in\mathbb{R}^{N\times m}

, hyperparameters:

\alpha_{p}

\beta_{p}

\alpha_{\sigma}

\beta_{\sigma}

\alpha_{\vartheta}

\beta_{\vartheta}

p_{0}^{(0)}

\vartheta_{s}^{(0)}

2:Estimate the linear and quadratic variation vectors.

\triangleright

Eq. (4)

3:For drift obtain the target vectors

\bm{Y}_{i}

and the library

{\bf{L}}

using the candidate basis functions.

\triangleright

Eq. (7a)

4:For diffusion obtain the target vectors

\bm{Y}_{ij}

and the library

{\bf{L}}

\triangleright

Eq. (7b)

5:Estimate the initial variance of noise from residual variance:

\sigma^{2,(0)}

= Var(

{\bf{L}}{\bm{\theta}}-{\bm{Y}}

)

6:Estimate the initial latent vector

\bm{Z}^{(0)}

\left[Z_{1}^{(0)},Z_{2}^{(0)},\ldots,Z_{K}^{(0)}\right]

such that MSE

({\bf{L}}{\bm{\theta}}-{\bm{Y}})

is minimum.

7:Find

\theta^{(0)}

{\bm{\mu}}_{\theta}

and

{\bf{\Sigma}}_{\theta}

\triangleright

Eq. (17)

8:for

i

1,\ldots,{N_{M}}

9: Update the latent variable vector

{\bm{Z}}^{(i)}

\triangleright

Eq. (18)

10: Update the noise variance

\sigma^{2(i)}

\triangleright

Eq. (20)

11: Update the slab variance

\vartheta_{s}^{(i)}

\triangleright

Eq. (21)

12: Update the success rate

p_{0}^{(i)}

\triangleright

Eq. (22)

13: Update the weight vector

{\bm{\theta}}^{(i)}

\triangleright

Eq. (17)

14: Repeat steps 9

\to

15:Discard the burn-in MCMC samples

16:Estimate the marginal PIP values

p(Z_{k}=1|{\bm{Y}})

\triangleright

Eq. (23)

17:Include the basis functions with higher PIP values

18:Estimate the expected value

{\rm E}[{\theta_{k}}]

and standard deviation

{\sigma}[{\theta_{k}}]

of the system parameters

19:

\{{\bm{\theta}}_{k};k=1\ldots{K}\}

\{\vartheta_{k}({\bm{X}}_{j});j=1\ldots{\bar{m}}\}

, where

K

and

\bar{m}

are the number of library functions and process states, respectively.

The Gibbs sampler is initialised with the following initial values of the hyperparameters: $p_{0}^{(0)}$ =0.1, $\vartheta^{(0)}$ =10, and $\sigma^{2(0)}$ is set equal to the residual variance from ordinary least-squares regression. The initial vector of binary latent variables ${\bm{Z}}^{(0)}$ is computed by setting ${\bm{Z}}^{(0)}=[Z_{1},\ldots,Z_{K}]$ to zero and then activating the components of ${\bm{Z}}$ that reduce the mean-squared error between the training data and the obtained model from ordinary least-squares. For this purpose a forward followed by a backward search algorithm is devised, where the backward search iterates through the activated components of initial latent vector in similar fashion to forward search. Given all the other parameters, the initial value of $\theta^{(0)}$ is obtained from the Eq. (16). For the commencement of the algorithm, the deterministic prior parameters are set to the following values: $a_{p}$ =0.1 and $b_{p}$ =1 for the Beta prior on $p_{0}$ , $a_{v}$ =0.5 and $b_{v}$ =0.5 for inverse-Gamma prior on slab variance, and, $a_{\sigma}$ = $10^{4}$ and $b_{\sigma}$ = $10^{4}$ for inverse-Gamma prior on measurement noise. A Markov chains with 3000 samples is used for Gibbs sampling. The first 1000 samples are discarded as burn-in, and the remaining 2000 samples are used for posterior computation.

For all the demonstrations in this work, the data are simulated using the Euler-Maruyama (EM) scheme at a frequency of 1000Hz using the parameters listed in the Table 1. The noise in the measurements is modeled as $N$ -dimensional sequence of zero-mean Gaussian white noise with a standard deviation equal to 5% of the standard deviation of the simulated quantities. In this work, the dictionary ${\bf{L}}\in\mathbb{R}^{N\times K}$ is constructed from 5 types of mathematical functions, each function representing a mapping of the $m$ -dimensional state vector $\bm{X}=\left\{X_{1},X_{2},\ldots X_{m}\right\}$ :

{\bf{L}}({\bm{X}})=\left[{\begin{array}[]{*{20}{c}}{\bf{1}}&{{P^{1}}({\bm{X}})}&{{P^{2}}({\bm{X}})}&\ldots&{{P^{\mathcal{P}}}({\bm{X}})}&{{\mathop{\rm sgn}}({\bm{X}})}&{\left|{\bm{X}}\right|}&{{\bm{X}}\otimes\left|{\bm{X}}\right|}\end{array}}\right].

(24)

Here ${\bf{1}}\in\mathbb{R}^{N}$ denotes the $N$ -dimensional vector of 1, ${P^{\mathcal{P}}}({\bm{X}})\in\mathbb{R}^{N\times m}$ denotes the set of terms present in the multinomial expansion ${({X_{1}}+{X_{2}}+\ldots+{X_{m}})^{\mathcal{P}}}$ , $sgn({\bm{X}})\in\mathbb{R}^{N\times m}$ represents the signum functions of the form $sgn({X_{i}})$ $\forall i=1\ldots m$ , ${\left|{\bm{X}}\right|}\in\mathbb{R}^{N\times m}$ denotes the absolute mapping of the states: ${\left|{X_{i}}\right|}$ $\forall i=1\ldots m$ . The tensor product term ${{\bm{X}}\otimes\left|{\bm{X}}\right|}\in\mathbb{R}^{N\times{2m}}$ represents the set of functions: ${X_{i}\otimes\left|X_{j}\right|}$ $\forall{i,j}=1\ldots m$ . For the study, the length of the polynomial $\mathcal{P}$ is chosen as 6. The cardinality of the library is found as $K=(1+\left|{{{({X_{1}}+{X_{2}}+\ldots+{X_{m}})}^{n}}}\right|+4m)$ , where the number of terms in the multinomial expansion can be found as, $\left|{{{({X_{1}}+{X_{2}}+\ldots+{X_{m}})}^{n}}}\right|={}^{n+m-1}{C_{m-1}}$ . The complete architecture of the propose framework is illustrated in the Fig. 1.

Simulated systems	Drift parameters		Diffusion parameters
	Linear	Non-linear	Diffusion parameters
Black-Scholes	$\lambda$ = 2	-	$\mu$ = 1
Duffing-Van der pol	$m$ = 1, $k$ = $2\times 10^{3}$ , $c$ = 2	$\alpha$ = $10^{5}$	$\sigma$ = 10
2-DOF non-linear	$m_{1}$ = $m_{2}$ = 1, $k_{1}$ = $4\times 10^{3}$	$\mu$ = 1, $g$ = 9.81	$\sigma_{1}$ = 10, $\sigma_{2}$ = 10
	$k_{2}$ = $2\times 10^{3}$ , $c_{1}$ = $c_{2}$ = 2
Bouc-Wen	$m$ = 1, $k$ = $1\times 10^{4}$ , $c$ = 20, $\lambda$ = 0.5	$A_{1}$ = $A_{2}$ = 0.5, $A_{3}$ = 1, $\bar{n}$ = 3	$\sigma$ = 2

Table 1: Simulation parameters of the systems.

3 Discovery of governing physics of example problems

In this section, the efficacy and robustness of the proposed Bayesian physics discovery framework is observed on a variety of representative stochastic non-linear dynamical systems. The examples taken includes: (a) Black–Scholes SDE, (b) Duffing Van-der pol oscillator, and (c) Two degree-of-freedom (DOF) base isolated shear building. For the equation discovery, it is assumed that the input forces are not measurable and only the noisy measurements of the displacements are available for physics discovery. The velocity component is obtained from the displacement vector through numerical differentiation such as fourth order central finite difference formula. Additionally a case study using Bouc-Wen oscillator is undertaken, where it is assumed that the hysteresis measurement is also not available. Therefore the proposed framework treats it as a partially observed system and tries to estimate the effect of the unobserved state without explicitly considering it in to the library of candidate functions. The results of the case studies undertaken in this work are presented in Figs. 3, 4 and Table 2. These results demonstrate sufficient accuracy in the discovered physics and robustness of the proposed framework to learn governing law from limited and noisy displacement measurements only (without any input measurements).

Correct SDEs:
Black-Scholes:	$dX(t)=2X(t)dt+X(t)dB(t)$
Duff.-Van der pol:	$d{X_{2}}(t)=-\left({1000{X_{1}}(t)+2{X_{2}}(t)+100000X_{1}^{3}(t)}\right)dt+10X(t)dB(t)$
2-DOF non-linear:	$\left\{\begin{array}[]{l}d{Y_{2}}(t)=\left({-6000{Y_{1}}(t)-4{Y_{2}}(t)-9.81{\mathop{\rm sgn}}\left({{Y_{2}}(t)}\right)+2000{Y_{3}}(t)+2{Y_{4}}(t)}\right)dt+10d{B_{1}}(t)\\ d{Y_{4}}(t)=\left({2000{Y_{1}}(t)+2{Y_{2}}(t)-2000{Y_{3}}(t)-2{Y_{4}}(t)}\right)dt+10d{B_{2}}(t)\end{array}\right.$
Identified SDEs:
Black-Scholes:	$\left\{\begin{array}[]{l}dX(t)=2.05X(t)dt+1.02X(t)dB(t)\\ dX(t)=\left({0.47X(t)+0.53\left\|{X(t)}\right\|}\right)dt+\left({1.09{X^{2}}(t)+0.95X(t)\left\|{X(t)}\right\|}\right)dB(t)\end{array}\right.$
Duff.-Van der pol:	$d{X_{2}}(t)=-\left({1000.02{X_{1}}(t)+1.99{X_{2}}(t)+99885.10X_{1}^{3}(t)}\right)dt+10.31X(t)dB(t)$
2-DOF non-linear:	$\left\{\begin{array}[]{ll}d{Y_{2}}(t)=&\left({-6000.30{Y_{1}}(t)-3.97{Y_{2}}(t)-9.89{\mathop{\rm sgn}}\left({{Y_{2}}(t)}\right)+1999.73{Y_{3}}(t)+2.09{Y_{4}}(t)}\right)dt\\ &+10.03d{B_{1}}(t)\\ d{Y_{4}}(t)=&\left({2000.65{Y_{1}}(t)+1.99{Y_{2}}(t)-1999.06{Y_{3}}(t)-1.99{Y_{4}}(t)}\right)dt+10.11d{B_{2}}(t)\end{array}\right.$

Table 2: Summary of the results of Bayesian equation discovery of stochastic dynamical systems. The parameters of the identified systems denote the expected value of the weights after discarding the burn-in samples. Here, the first 1000 samples are taken as burn-in samples and therefore discarded for obtaining the final stationary distribution.

3.1 Black–Scholes SDE

A Black–Scholes SDE (formulated as geometric Brownian motion) is a continuous-time stochastic process where the logarithm of the randomly varying quantity follows a Brownian motion. The Black-Scholes SDE is frequently used in stock market for modelling of the evolution of stock price of an underlying asset [48]. The Black–Scholes SDE has the drift $f(t,X_{t})=\lambda X$ and diffusion $g(t,X)=\mu X$ with $\lambda>0$ and $\mu>0$ being the real constants. Towards this, the Black–Scholes SDE is defined as a geometric Brownian motion as follows:

\dfrac{{dX(t)}}{{X(t)}}=\lambda dt+\mu dB(t)\quad{X}(t=t_{0})={X}_{0};\quad t\in[0,T],

(25)

where $B=\{B(t);t\geq 0\}$ is the Brownian motion. Data is generated by simulating Eq. (25) and then corrupting it with Gaussian white noise. Identifying the diffusion term as, $g({\bf{X}}_{t},t)=\mu X(n)$ , the variance of the diffusion is obtained: $\Gamma=\mu^{2}X^{2}(n)$ . The results of this case study are plotted in the Figs. 3(a) and 4(a). Fig. 3(a) depicts that there are two identified basis function $X$ and $|X|$ . However, from the Eq. (25), it is obvious that only one function $X$ should have been identified as the driving function of the Black-Scholes SDE. To understand this discrepancy one needs to consider that the evolution of the random variable $X(t)$ in the Black-Scholes SDE follows Geometric Brownian motion (GBM) and GBMs always assume positive values (e.g. real stock price). Since the characteristics of both the functions $X$ and $|X|$ in this case are same, the algorithm identifies both the functions with almost equal probability. A similar phenomenon is observed in case of the identification of the diffusion term too. Figs. 4(b) depicts almost equal contribution from the functions $X^{2}$ and $X|X|$ , however, only the term $X^{2}$ should have been actually identified. Upon considering the corresponding parameter values of the basis functions $X$ and $|X|$ for the drift component, and, $X^{2}$ and $X|X|$ for the diffusion component, the Black-Scholes SDE in Eq. (25) can be easily simulated for any unseen environmental conditions.

To verify that this is not a limitation of the proposed scheme, this numerical demonstration is repeated in the Figs. 3(b) and 4(b) without considering the functions $|X|$ and $X|X|$ in the library. The parameters of the basis functions for the drift and diffusion components are portrayed in Figs. 5(c) and 5(d), respectively. The results clearly show that the proposed scheme is able to identify the basis functions along with their associated parameters $\lambda$ and $\mu$ without any significant error. As a consequence of the above case study, one can ask a question about what basis functions are to be considered in the library. One of the possibilities could be to visualize the time series data and then take scientific and engineering judgements. For example, if the time history data shows no zero crossing then one can opt against considering the functions which shares similar properties (for e.g. $|X|$ and $X$ , and $X|X|$ and $X^{2}$ ). Besides one can also choose to consider due to the fact that the similar terms will have equal probability of occurrence and as a combination will demonstrate the observed phenomenon. However, for future prediction the model will be applicable to process depicting no zero crossing only.

Finally, to illustrate the robustness of the proposed approach and simulate a realistic scenario, we consider a case where the data is generated using strong Taylor 1.5 stochastic integration scheme. The results of the study is provided in the Fig. 6. In the Figs. 6(a) and (b), it is evident that the proposed framework is able to correctly identify the drift and diffusion components of the Black-Scholes equation. The subplots (c) and (d) further shows the expected values of the parameters of the identified basis functions, which matches almost exactly with the parameters listed in Table. 1. This shows that despite the lower order formulation, the proposed framework is robust and well equipped.

3.2 Duffing-Van der pol oscillator

The second order non-linear hardening Duffing-Van der pol oscillator is considered, in this section. The Duffing-Van der pol oscillator draws its physical relevance from the models in flow-induced structural vibration problems. The Duffing-Van der pol oscillator has a cubic dissipating force and it is driven by multiplicative white noise. Towards this, the equation of motion of the undertaken system is expressed as:

m\ddot{X}(t)+c\dot{X}(t)+kX(t)+\alpha{X^{3}}(t)=\sigma X(t)\dot{B}(t)\quad{\bm{X}}(t=t_{0})={\bm{X}}_{0};\quad t\in[0,T],

(26)

where $m$ , $c$ , and $k$ are the mass, damping, and stiffness parameters of the oscillator. Here, $\alpha$ is a real-valued parameter associated with the cubic non-linearity, $\sigma\geq 0$ is the strength of the multiplicative white noise, and $B=\{B(t);t\geq 0\}$ is the Brownian motion. The time derivative of the Brownian $\dot{B}(t)$ here represents the white noise. With a statespace transformation $X={X_{1}}$ , and $\dot{X}={X_{2}}$ , where $X_{1}$ and $X_{2}$ represents the displacement and velocity, the corresponding first order Itô-stochastic SDEs for the dynamical system are derived as,

\left[{\begin{array}[]{*{20}{c}}{d{X_{1}}(t)}\\ {d{X_{2}}(t)}\end{array}}\right]=\left[{\begin{array}[]{*{20}{c}}{{X_{2}}(t)}\\ {\dfrac{{-1}}{m}\left({k{X_{1}}(t)+c{X_{2}}(t)+\alpha X_{1}^{3}(t)}\right)}\end{array}}\right]dt+\left[{\begin{array}[]{*{20}{c}}0\\ {\dfrac{\sigma}{m}X(t)}\end{array}}\right]dB(t).

(27)

For estimating the drift and diffusion terms only, the evolution of second variable ${X_{2}}(t)$ is used to construct the target linear and quadratic variation vectors from the sample paths using the Kramers-Moyal formulae. Here, the variance of the diffusion term is identified as, $\Gamma=({\sigma^{2}}X^{2}(t))/{m^{2}}$ . The results for the basis function selection are displayed in Figs. 3(c) and 4(c), respectively, while their associated parameters are plotted in Fig. 7. From the Figs. 3(c) and 4(c), it is evident that the proposed framework is able to accurately discover the basis functions for the drift and diffusion terms as, $\left\{X_{1}(t),X_{2}(t),X_{1}^{3}(t)\right\}$ and $\left\{X_{1}(t),X_{2}(t)\right\}$ , respectively.

The pairwise joint posterior distributions of weights associated with the discovered basis functions are presented in Fig. 7. In the subplots (a), (b), and (c), the weights of discovered drift functions are shown. It is evident in this figures that the actual parameters associated with the drift component of DVP oscillator as listed in Table 1 is correctly identified with very small relative error. The subplot (d) depicts the posterior distribution of the identified diffusion basis function. The mean value of the posterior distribution represents the term ${g^{2}}({{\bm{X}}_{t}},t)$ , which in this case can be noticed as $g({{\bm{X}}_{t}},t)$ = $X(t)$ . The mean value here is obtained as approximately 106.47. By performing the square root operation over the mean, one can approximately get the diffusion value $\sigma=10.31$ . These results verify that the proposed scheme can be successfully applied to discover the governing physics of non-linear oscillators when subjected to parametric excitation effectively.

3.3 Two story base isolated Shear Building

The responses of civil engineering structures often exhibit the characteristics of stochastic non-linear dynamical systems due to the external excitation. Here, a 2 story base isolated structure is considered. The system is a linear base isolated structure where the structure is connected to the foundation through a Coulomb friction-type base isolator. Under this considerations, the governing equation of motion of the system is written as,

\begin{array}[]{l}{m_{1}}{{\ddot{X}}_{1}}(t)+{c_{1}}{{\dot{X}}_{1}}(t)+\mu{m_{1}}g{\mathop{\rm sgn}}\left({{\dot{X}}_{1}}(t)\right)+{k_{1}}{X_{1}}(t)+{c_{2}}\left({{{\dot{X}}_{1}}(t)-{{\dot{X}}_{2}}(t)}\right)+{k_{2}}\left({{X_{1}}(t)-{X_{2}}(t)}\right)={\sigma_{1}}{{\dot{B}}_{1}}(t)\\ {m_{2}}{{\ddot{X}}_{2}}(t)+{c_{2}}\left({{{\dot{X}}_{2}}(t)-{{\dot{X}}_{1}}(t)}\right)+{k_{2}}\left({{X_{2}}(t)-{X_{1}}(t)}\right)={\sigma_{2}}{{\dot{B}}_{2}}(t)\\ {\bm{X}}(t=t_{0})={\bm{X}}_{0};\quad t\in[0,T],\end{array}

(28)

where $\mu{m_{1}}g{\mathop{\rm sgn}}({{\dot{X}}_{1}}(t))$ is the Coulomb friction force arising due to the sliding of bearings in Coulomb oscillator, and $\{m_{i},i=1,2\}$ , $\{c_{i},i=1,2\}$ , and $\{k_{i},i=1,2\}$ are the mass, damping, and stiffness of $i^{th}$ -DOF. Here, $\sigma_{1}\geq 0$ and $\sigma_{2}\geq 0$ are the strength of the white noise ${\dot{B}}_{1}(t)$ and ${\dot{B}}_{2}(t)$ , where $B_{1}(t)$ and $B_{2}(t)$ are the independent Brownian motion. In this case study, the equation discovery involves identification of two drift and two covariance terms (one for each of the DOFs). Following to the case study of the Duffing-Van der pol oscillator, the target linear variation vector for discovery of $i^{th}$ ; $\forall i=1,\ldots,2$ drift term needs to be constructed from the velocity component of the corresponding DOF. For discovery of the $(ij)^{th}$ ; $\forall i,j=1,\ldots,2$ component of the covariance matrix, the quadratic variation matrix is constructed from the pointwise product of $i^{th}$ and $j^{th}$ response vectors.

\Gamma=\left[{\begin{array}[]{*{20}{c}}0&0&0&0\\ 0&{{\sigma_{1}^{2}}/{{m_{1}^{2}}}}&0&0\\ 0&0&0&0\\ 0&0&0&{\sigma_{2}^{2}}/{{m_{2}^{2}}}\end{array}}\right].

(29)

The results for the identification of corresponding basis functions for both the drift terms are presented in Figs. 3(d) and (e). From Fig. 3(d), it can be observed that the identified basis functions are $\left\{Y_{1}(t),Y_{2}(t),Y_{3}(t),Y_{4}(t),sgn(Y_{2}(t))\right\}$ , which correctly matches with the terms in the equation of motion of first DOF. Similarly, from Fig. 3(e) it can be verified that the proposed scheme has correctly identified the basis functions of the second drift component as $\left\{Y_{1}(t),Y_{2}(t),Y_{3}(t),Y_{4}(t)\right\}$ . Figs. 8 and 9 show the pairwise joint posterior distributions of the weights associated with identified first and second drift basis functions, respectively. From these figures it is evident that the proposed scheme is able to identify the parameters of the basis functions as listed in Table 1 with sufficient accuracy. In order to find the explicit values of system stiffness and damping parameters, the parameters of second DOF can be identified from the Figs. 8(h), 8(e), and 8(f).

The identification results for the basis functions along with the posterior distribution of their parameters for both the diffusion terms are depicted in Fig. 10. Figs. 10(a) and 10(b) present the results of the first diffusion term. The identification results of the second diffusion term are shown in Figs. 10(c) and 10(d). In both the cases, it is evident that the proposed scheme has correctly identified the basis function as $1$ . The posterior distributions of the covariance terms, ${\bf{\Gamma}}^{11}$ and ${\bf{\Gamma}}^{22}$ is plotted in Figs. 10(b) and 10(d) whose mean values represents ${{\sigma_{1}^{2}}}/{{m_{1}^{2}}}$ and ${{\sigma_{2}^{2}}}/{{m_{2}^{2}}}$ . The summery of the results are presented in Table 2.

3.4 Case study: Non-linear system with partially observed state variables

In this section, a case study has been undertaken where we do not have access to all the state variables. We consider a Bouc-Wen oscillator where the system is described using three states namely, displacement and velocity of the main mass, and the hysteresis displacement associated with the isolator [43, 49]. In this context, we note that the measurement of the hysteresis displacement is practically intractable and only the displacement and velocity states of the main system is available. Further, measuring the velocity of a mechanical system using sensors is a practically challenging task. Through this example, we illustrate the applicability of the proposed approach for such a system. The governing equation of motion of a sdof Bouc-Wen oscillator is [43]:

m\ddot{X}(t)+c\dot{X}(t)+k\lambda X(t)+k\left({1-\lambda}\right)Z(t)=\sigma\dot{B}(t)\qquad{\bm{X}}(t=t_{0})={\bm{X}}_{0};\quad t\in[0,T],

(30)

where $m$ , $c$ , and $k$ are the mass, damping, and stiffness of the system, $X(t)$ is the system state, $\lambda$ is a factor that defines the participation of the elastic force $F_{e}$ and hysteresis force $F_{h}$ , $F_{e}(t)=k\lambda X(t)$ , $F_{h}(t)=k\left({1-\lambda}\right)Z(t)$ , and $Z(t)$ is the hysteresis displacement. The evolution of the hysteresis parameter is defined using the non-linear differential equation:

Z(t)=-{A_{1}}Z(t)\left|{\dot{X}(t)}\right|{\left|{Z(t)}\right|^{\bar{n}-1}}-{A_{2}}\dot{X}(t){\left|{Z(t)}\right|^{\bar{n}}}+{A_{3}}\dot{X}(t)\qquad{\bm{Z}}(t=t_{0})={\bm{Z}}_{0};\quad t\in[0,T],

(31)

The positive exponential parameter $n$ defines the smoothness of the transition from elastic to the post-elastic branch, and the parameters $A_{1}$ , $A_{2}$ , and $A_{3}$ control the size and shape of the hysteresis loop. The drift and diffusion terms are:

f(t,{{\bm{X}}_{t}})=\left[{\begin{array}[]{*{20}{c}}{{X_{2}}(t)}\\ {\dfrac{{-1}}{m}\left({k\lambda{X_{1}}(t)+c{X_{2}}(t)+k(1-\lambda){X_{3}}(t)}\right)}\\ {-{A_{1}}{X_{3}}(t)\left|{{X_{2}}(t)}\right|{{\left|{{X_{3}}(t)}\right|}^{\bar{n}-1}}-{A_{2}}{X_{2}}(t){{\left|{{X_{3}}(t)}\right|}^{\bar{n}}}+{A_{3}}{X_{2}}(t)}\end{array}}\right];\quad g(t,{{\bm{X}}_{t}})=\left[{\begin{array}[]{*{20}{c}}0\\ {\dfrac{\sigma}{m}}\\ 0\end{array}}\right].

(32)

Then, the covariance of the diffusion matrix is obtained as,

\Gamma=g(t,{{\bf{X}}_{t}})g{(t,{{\bf{X}}_{t}})^{T}}=\left[{\begin{array}[]{*{20}{c}}0&0&0\\ 0&{\dfrac{{{\sigma^{2}}}}{{{m^{2}}}}}&0\\ 0&0&0\end{array}}\right].

(33)

For the identification of the system only the displacement value (noisy) is considered as input to the algorithm. The system is simulated for $T$ = 1s at a frequency of 1000Hz using the parameter values, $m$ = 1, $c$ = 20, $k$ = 100000, $\lambda$ = 0.5, $A_{1}$ = 0.5, $A_{2}$ = 0.5, $A_{3}$ = 1, $\bar{n}=3$ , and $\sigma_{1}$ = 2. The identified basis functions for the drift and diffusion terms are presented in Fig. 11. In sub-figure (a) it is evident that there are more number of basis functions than the input equation. It is straightforward to note that this extra basis functions arises to take care of the non-observable hysteresis parameter $Z(t)$ . One can then identify the basis functions ${\Theta_{f}}({\bf{X}})$ and ${\Theta_{g}}({\bf{X}})$ for the drift and diffusion, respectively, that best represent the data as,

\begin{array}[]{l}{\Theta_{f}}({\bf{X}})=\left[{\begin{array}[]{*{20}{c}}1&{{X_{1}}}&{{X_{2}}}&{X_{1}^{2}}&{X_{2}^{2}}&{X_{1}^{3}}&{X_{1}^{4}}&{sgn({X_{1}})}&{\left|{{X_{1}}}\right|}&{\left|{{X_{2}}}\right|}&{{X_{1}}\left|{{X_{1}}}\right|}\end{array}}\right]\\ {\Theta_{g}}({\bf{X}})=\left[{\begin{array}[]{*{20}{c}}1&{{X_{1}}}&{\left|{{X_{1}}}\right|}\end{array}}\right].\end{array}

(34)

The joint posterior distributions of the retained basis functions in Fig. 11 (a) and (b) are plotted in the Fig. 12. Based on the basis selection in Fig. 11 and from the joint posteriors of the parameter $\theta_{i}$ plotted in Fig. 12, the final drift and diffusion fields for the Bouc-Wen system can identified as follows,

\begin{array}[]{ll}f(t,X)=&-46.4-5246{X_{1}}-20{X_{2}}-614X_{1}^{2}-0.05X_{2}^{2}-2328X_{1}^{3}+1829X_{1}^{4}+10.64sgn({X_{1}})+159\left|{{X_{1}}}\right|\\ &+1.625\left|{{X_{2}}}\right|+1439{X_{1}}\left|{{X_{1}}}\right|\\ g(t,X)=&4.65-4.872{X_{1}}+6.786\left|{{X_{1}}}\right|.\end{array}

(35)

Correct SDEs:	$\left\{\begin{array}[]{l}d{X_{2}}(t)=-\left({5000{X_{1}}(t)+20{X_{2}}(t)+5000{X_{3}}(t)}\right)dt+2dB(t)\\ d{X_{3}}(t)=\left({-0.5{X_{3}}(t)\left\|{{X_{2}}(t)}\right\|{{\left\|{{X_{3}}(t)}\right\|}^{2}}-0.5{X_{2}}(t){{\left\|{{X_{3}}(t)}\right\|}^{3}}+{X_{2}}(t)}\right)dt\end{array}\right.$
Identified SDEs:	$\left\{\begin{array}[]{ll}d{X_{2}}(t)=&\Bigl{(}-46.4-5246{X_{1}}-20{X_{2}}-614X_{1}^{2}-2328X_{1}^{3}+1829X_{1}^{4}+10.64sgn({X_{1}})+159\left\|{{X_{1}}}\right\|\\ &+1.63\left\|{{X_{2}}}\right\|+1439{X_{1}}\left\|{{X_{1}}}\right\|\Bigr{)}dt+\sqrt{4.65-3.79{X_{1}}+5.76\left\|{{X_{1}}}\right\|}dB(t)\end{array}\right.$

Table 3: Identification results of the Bayesian physics discovery for Bouc-Wen oscillator

For further treatment the term $0.05X_{2}^{2}$ is neglected due to its non-significant parameter value. With this, the identified equation for the Bouc-Wen system is shown in Eq. (36):

\ddot{X}+20\dot{X}+5246X+614{X^{2}}+2328{X^{3}}-1829{X^{4}}-10.64sgn(X)-159\left|X\right|\\ -1.625\left|{\dot{X}}\right|-1439X\left|X\right|+46.4=4.65-4.872X+6.786\left|X\right|.

(36)

In order to verify the accuracy of the identified system, the identified equation (Eq. (36)) is simulated using the EM scheme with a time step of $\Delta t$ = 0.001. To account for the fact that in future the external excitation will be uncertain and the intensity of the same will not remain same, we model the forcing function using Brownian force with different intensity. To be specific, a Brownian noise intensity of $\sigma$ = 6 is used to simulate the new unseen environment. The results are compared in Fig. 13. We observe that the results obtained using the identified physics matches almost exactly with the ground truth obtained by solving the actual system. It is worthwhile to note that the proposed approach require only 1s of data and is able to predict the the responses for a stochastic force of different intensity at distant future (500s). This clearly indicates that the proposed approach, even with partially observed displacement only measurement, generalizes to unseen environment and provides accurate estimate even for out-of-distribution inputs. The summary of the cases studies undertaken in this work is presented in the Table 2.

4 Discussions

We proposed a novel data-driven framework for discovering governing physics of multi-dimensional non-linear dynamical systems from limited and noisy output only data. In many natural process the estimate of external input is intractable due to the limitations in the present measurement technologies. Further, measurement of all the state variables is often not possible due to high operational cost; the proposed framework is developed to identify governing equations in such scenarios. In essence, the proposed approach identifies SDEs from displacement only measurements; in other words, no measurement for the input is required. The deterministic drift component of the SDEs captures the dynamics of the underlying dynamical system whereas the external stochastic force is identified in terms of the diffusion component. This is a key novelty of the proposed framework.

The proposed framework employs the sparse Bayesian linear regression in conjunction with the Gibbs sampler to simultaneously obtain the basis functions of the model and their associated parameters. To that end, the Kramers-Moyal expansion is utilized to express the drift and diffusion dynamics of an SDE in terms of the measured samples paths. The drift and diffusion vectors obtained from the sample paths are then used as a target vector in the sparse regression. Use of the Kramers-Moyal expansion within the purview of sparse Bayesian linear regression for discovery of an explainable governing physics is another novelty of the proposed framework. The unified framework possesses the ability of the Bayesian inference to update the probability of observing the correct basis function based on the new information. The ability to update the information is missing in recently published least-square based physics discovery schemes. In a noisy and unseen environment, one would be more interested in dealing with the probability distribution of a random variable rather than a point estimate; the proposed framework thus, takes a leap in such situations and provides a probability distribution for each of the weights associated with corresponding basis functions. The obtained standard deviations of the distributions can then be used for defining the lower and uppers bounds on the estimates of an unknown parameters. Another aspect of the proposed framework is its advantages over the Neural network based grey box models, where the explicit expression of the discovered governing physics is not known. In cases of partially observed processes, the proposed framework seems to provide an explainable equation for the governing physics. Although the basis functions in the obtained equation is significantly different from the one determined from first principal laws, the obtained model is able to predict the distant future without significant error. To the knowledge of authors, the discovery of physics in a partially observed process from the purview of probability theory is one of the key contributions of the proposed novel framework.

Despite of the advantages of the proposed framework over presently available physics discovery techniques there are certain issues that requires special attention for accurate identification of the governing physics. The first issue is the judicious selection of the candidate basis function. The presence of basis functions with high correlation may sometimes yield a physical law different from the actual physics. This is evident from the identification example on the Black-Scholes equation. The second issue is the associated computational cost. Although the computational cost of the proposed scheme is significantly less than the time required to train its neural network alternatives, further improvements can be made. One such alternative is to use the marginal standard deviation criteria, instead of taking the mean of the latent vector. Another improvement in terms of computational efficiency can be made by devising an appropriate filter to process out the signal noise synchronously with the sparse Bayesian linear regression. The third issue is related to the quality of the data. The low-fidelity data are less expensive to obtain but the representation of the physical model can be poor due to the presence of noise. On the other hand, the high-fidelity data are highly accurate but expensive to obtain. The proposed framework in its present state is not well equipped to make use of both the low and high-fidelity data. Thus, to further enhance the fidelity of the proposed framework for field applications a more robust algorithm needs to be formulated. The use of low-fidelity data ensure low operational cost while the high-fidelity data will ensure the accuracy of the discovered physics. Lastly, the proposed framework leverages the Kramers-Moyal expansion to express the drift and diffusion components of an SDE in terms of the first and second order moments of the sample paths. The moments are approximated in a limiting sense by making use of the absolute and quadratic identities of Brownian motion. In this context, future improvements in the proposed framework can be made by exploiting the higher order moments to retain the higher order terms from Stochastic-Taylor expansions of the drift and diffusion terms. Overall, the proposed framework provides a novel methodology for discovering the governing physics from displacement only measurements and is particularly useful whenever it is not possible to measure the input force. The learned equations have shown to be accurate and demonstrated good prediction ability in distant future.

Appendix A Kramers-Moyal expansion for estimation of the drift and diffusion terms of an SDE from sample paths

Let us consider, $p(X,t)$ = ${\rm{P}}(X,t|X_{0},t_{0})$ to be the transition probability density of the solution of the SDE in Eq. (3). Then the Kramers-Moyal expansion is written as [50],

\dfrac{{\partial P(X,t)}}{{\partial t}}={\sum\limits_{n=1}^{\infty}{\left({-\dfrac{\partial}{{\partial X}}}\right)}^{n}}{D^{(n)}}(X,t)p(X,t),

(37)

where the coefficients in the expansion are given as,

{D^{(n)}}(X)={\left.{{{\left.{\dfrac{1}{{n!}}{M_{n}}(t)}\right|}_{t=0}}=\dfrac{1}{{n!}}\mathop{\lim}\limits_{{\Delta t}\to 0}\dfrac{1}{{\Delta t}}\left\langle{{{\left|{X(t+{\Delta t})-z}\right|}^{n}}}\right\rangle}\right|_{X(t)=z}}.

(38)

In order to know how many terms in Eq. (37) will be active for the SDE in Eq. (3), a Fokker-Planck equation for the pdf of the solution of Eq. (3) needs to be constructed. Towards this, let us consider a well behaved generic function $F(X,t)$ which is at least twice differential. For this function, the Itô’s lemma is written as,

dF(X,t)=\dfrac{{\partial F(X,t)}}{{\partial X}}dX(t)+\dfrac{1}{2}\dfrac{{{\partial^{2}}F(X,t)}}{{\partial{X^{2}}}}{g^{2}}(X,t)dt.

(39)

Substituting Eq. (3) in the above equation gives the following:

dF(X,t)=\left({\dfrac{{\partial F(X,t)}}{{\partial X}}f(X,t)+\dfrac{1}{2}\dfrac{{{\partial^{2}}F(X,t)}}{{\partial{X^{2}}}}{g^{2}}(X,t)}\right)dt+\dfrac{{\partial F(X,t)}}{{\partial X}}g(X,t)dB(t).

(40)

Upon taking the derivative with respect to $t$ on both sides yields:

\dfrac{{dE[F(X,t)]}}{{dt}}=E\left({\dfrac{{\partial F(X,t)}}{{\partial X}}f(X,t)+\dfrac{1}{2}\dfrac{{{\partial^{2}}F(X,t)}}{{\partial{X^{2}}}}{g^{2}}(X,t)}\right).

(41)

Expanding the above equation using expectation operator and noting that the last term in the expression is a stochastic integral, one can invoke the result ${\rm{E}}\left[{\int_{0}^{t}{\dfrac{{\partial F(X,s)}}{{\partial X}}g(X,s)dB(s)}}\right]=0$ , where ${\rm{E}}(\cdot)$ is the expectation operator to obtain the following,

\int\limits_{\infty}^{\infty}{p(X,t)\dfrac{{dF(X)}}{{dt}}dX}=\int\limits_{\infty}^{\infty}{p(X,t)\left({\dfrac{{\partial F(X,t)}}{{\partial X}}f(X,t)+\dfrac{1}{2}\dfrac{{{\partial^{2}}F(X,t)}}{{\partial{X^{2}}}}{g^{2}}(X,t)}\right)dX}.

(42)

By expanding the terms in the above equation using integration by parts and assuming that $\mathop{\lim}\limits_{X\to 0}p(X,t)\to 0$ , the following weak form is obtained:

\int\limits_{\infty}^{\infty}{F(X)\left({\dfrac{{\partial p(X,t)}}{{\partial t}}+\dfrac{{\partial(p(X,t)f(X,t))}}{{\partial X}}-\dfrac{1}{2}\dfrac{{{\partial^{2}}(p(X,t){g^{2}}(X,t))}}{{\partial{X^{2}}}}}\right)dX}=0.

(43)

As the function $F(X,t)$ is arbitrarily selected, the Fokker-Plank-Kolmogorov equation is obtained.

\dfrac{{\partial p(X,t)}}{{\partial t}}=-\dfrac{{\partial(p(X,t)f(X,t))}}{{\partial X}}+\dfrac{1}{2}\dfrac{{{\partial^{2}}(p(X,t){g^{2}}(X,t))}}{{\partial{X^{2}}}};\;p(0,X)={p_{0}}(X).

(44)

In a general setting the above equation for higher dimensional diffusion process is expressed as,

\dfrac{{\partial p(X,t)}}{{\partial t}}={L_{t}}p(X,t)=-\sum\limits_{i}^{m}{\dfrac{{\partial(p(X,t){f_{i}}(X,t))}}{{\partial{X_{i}}}}}+\dfrac{1}{2}\sum\limits_{i}^{m}{\sum\limits_{j}^{m}{\sum\limits_{k}^{n}{\dfrac{{{\partial^{2}}(p(X,t){g_{i,k}}(X,t){g_{j,k}}(X,t))}}{{\partial{X_{i}}\partial{X_{j}}}}}}},

(45)

where the random variable $X$ satisfies the SDE in Eq. (3). At this point it is clear that there will be two terms active in Eq. (3). Thus for $n$ =2, Eq. (37) yields,

\dfrac{{\partial P(X,t)}}{{\partial t}}=-\dfrac{\partial}{{\partial X}}\left[{{D^{(1)}}(X,t)p(X,t)}\right]+\dfrac{{{\partial^{2}}}}{{\partial{X^{2}}}}\left[{{D^{(2)}}(X,t)p(X,t)}\right].

(46)

On comparison of Eqs. (37) and (46), it is straightforward to note that to estimate the drift and diffusion terms it suffices to estimate the first and second order moments of the variations of the random variable $X(t)$ and then calculate the coefficients ${D^{(1)}}$ and ${D^{(2)}}$ , formally,

f(X,t)={D^{(1)}},\;g^{2}(X,t)={D^{(2)}}.

(47)

To derive these coefficients in the Kramers-Moyal expansion, it is imperative to understand the one-step Itô-Taylor expansion of the random variable $X(t)$ . Referring to the Eq. (40), the integral form the Itô-lemma can be expressed as follows,

F({X_{t+h}},t+h)=F({X_{t}},t)+\int_{t}^{t+h}{\left\{{f({X_{s}},s)F^{\prime}({X_{s}},s)+\dfrac{1}{2}{g^{2}}({X_{s}},s)F^{\prime\prime}({X_{s}},s)}\right\}ds+}\int_{t}^{t+h}{g({X_{s}},s)F^{\prime}({X_{s}},s)dB(s)},

(48)

where $F^{\prime}({X_{s}},s)$ and $F^{\prime\prime}({X_{s}},s)$ denotes the first and second order partial derivatives with respect to system states. For simplicity, two stochastic operators are defined in the following form:

\begin{array}[]{l}{\Im^{0}}(.)=\dfrac{{\partial(.)}}{{\partial t}}+\sum\limits_{i}^{m}{{f_{i}}({X_{t}},t)\dfrac{{\partial(.)}}{{\partial{X_{i}}}}}+\dfrac{1}{2}\sum\limits_{i}^{m}{\sum\limits_{j}^{m}{\sum\limits_{k}^{n}{{g_{i,k}}({X_{t}},t){g_{j,k}}({X_{t}},t)\dfrac{{{\partial^{2}}(.)}}{{\partial{X_{i}}\partial{X_{j}}}}}}}\\ {\Im^{1}}(.)=\sum\limits_{i}^{m}{\sum\limits_{k}^{n}{{g_{i,k}}({X_{t}},t)\dfrac{{\partial(.)}}{{\partial{X_{i}}}}}}.\end{array}

(49)

Using the operators, Eq. (48) can be rephrased as,

F({X_{t+h}},t+h)=F({X_{t}},t)+\int_{t}^{t+h}{{\Im^{0}}\left({F({X_{s}},s)}\right)ds+}\int_{t}^{t+h}{{\Im^{1}}\left({F({X_{s}},s)}\right)dB(s)}.

(50)

Then substituting $F({X_{t}},t)=X(t)$ , one verifies that ${\Im^{0}}X(t)=f({X_{t}},t)$ , and ${\Im^{1}}X(t)=g({X_{t}},t)$ . This yields the first iteration:

\begin{array}[]{ll}X(t+h)&=X(t)+\int_{t}^{t+h}{{\Im^{0}}\left({X(s)}\right)ds+}\int_{t}^{t+h}{{\Im^{1}}\left({X(s)}\right)dB(s)}\\ &=X(t)+\int_{t}^{t+h}{f({X_{s}},s)ds+}\int_{t}^{t+h}{g({X_{s}},s)dB(s)}.\end{array}

(51)

In order to perform the second iteration it is required to find the stochastic expansion of the terms $F({X_{t}},t)=f({X_{s}},s)$ , and $F({X_{t}},t)=g({X_{s}},s)$ . Thus, expanding $f({X_{t}},t)$ , and $g({X_{t}},t)$ and using the operators in Eq. (49) yields,

\begin{array}[]{l}f({X_{s}},s)=f(X,t)+\int_{t}^{{s_{1}}}{{\Im^{0}}\left({f({X_{{s_{2}}}},{s_{2}})}\right)d{s_{2}}+}\int_{t}^{{s_{1}}}{{\Im^{1}}\left({f({X_{{s_{2}}}},{s_{2}})}\right)dB({s_{2}})}\\ g({X_{s}},s)=g(X,t)+\int_{t}^{{s_{1}}}{{\Im^{0}}\left({g({X_{{s_{2}}}},{s_{2}})}\right)d{s_{2}}+}\int_{t}^{{s_{1}}}{{\Im^{1}}\left({g({X_{{s_{2}}}},{s_{2}})}\right)dB({s_{2}})}.\end{array}

(52)

On substituting the above result in Eq. (51) and further iterating,

X(t+h)-X(t)=f(X,t)\int_{t}^{t+h}{d{s_{1}}+}g(X,t)\int_{t}^{t+h}{dB({s_{1}})+}{\Im^{1}}\left({g(X,t)}\right)\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{dB({s_{2}})dB({s_{1}})}}+R,

(53)

where the remainder term $R$ is,

\begin{array}[]{ll}R=&\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{{\Im^{0}}\left({f({X_{{s_{2}}}},{s_{2}})}\right)d{s_{2}}d{s_{1}}+}}\left\{{\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{{\Im^{1}}\left({f({X_{{s_{2}}}},{s_{2}})}\right)dB({s_{2}})d{s_{1}}}}+\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{{\Im^{0}}\left({g({X_{{s_{2}}}},{s_{2}})}\right)d{s_{2}}dB({s_{1}})}}}\right\}+\\ &\int_{{t_{0}}}^{t}{\int_{{t_{0}}}^{{s_{1}}}{\int_{t}^{{s_{2}}}{{\Im^{1}}{\Im^{0}}\left({g({X_{{s_{3}}}},{s_{3}})}\right)d{s_{3}}}dB({s_{2}})dB({s_{1}})}}+\int_{{t_{0}}}^{t}{\int_{{t_{0}}}^{{s_{1}}}{\int_{t}^{{s_{2}}}{{\Im^{1}}{\Im^{1}}\left({g({X_{{s_{3}}}},{s_{3}})}\right)dB({s_{3}})}dB({s_{2}})dB({s_{1}})}}.\end{array}

(54)

The above equation is infinitely expandable using the Eq. (49). For further treatment, the first moment is taken on both side. Noting the results $\left\langle{\int_{t}^{t+h}{dB({s_{1}})}}\right\rangle=0$ and $\left\langle{\int_{{t_{0}}}^{t}{\int_{{t_{0}}}^{{s_{1}}}{dB({s_{2}})dB({s_{1}})}}}\right\rangle=\left\langle{\int_{{t_{0}}}^{t}{B({s_{1}})dB({s_{1}})}-B(t)\int_{{t_{0}}}^{t}{dB({s_{1}})}}\right\rangle=0$ ,

\begin{array}[]{ll}\left\langle{X(t+h)-X(t)}\right\rangle&=f(X,t)h+g(X,t)\left\langle{\int_{t}^{t+h}{dB({s_{1}})}}\right\rangle+{\Im^{1}}\left({g(X,t)}\right)\left\langle{\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{dB({s_{2}})dB({s_{1}})}}}\right\rangle+\left\langle R\right\rangle\\ &=f(X,t)h+\left\langle R\right\rangle.\end{array}

(55)

If the number of iteration is $k$ , then the Brownian integrals $\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{\ldots\int_{t}^{{s_{k}}}{dB({s_{k+1}})\ldots}dB({s_{2}})dB({s_{1}})}}$ of multiplicity $k$ have a contribution proportional to $h^{k/2}$ , the time integrals $\int_{t}^{t+h}{\int_{t}^{{s_{1}}}{\ldots\int_{t}^{{s_{k}}}{d{s_{k+1}}\ldots}d{s_{2}}d{s_{1}}}}$ have a contribution proportional to $h^{k}$ , and the combination shares a contribution between $h^{k/2}$ and $h^{k}$ . Thus it is easy to infer that as $h\to 0$ , the higher order terms in the remainder $R$ will vanish. After invoking this facts in the above expression, the first coefficient in Kramers-Moyal expansion is obtained as,

{D^{(1)}}=f(X,t)={\left.{\mathop{\lim}\limits_{\Delta t\to 0}\dfrac{1}{{\Delta t}}\left\langle{{X^{(1)}}(t+\Delta t)-z}\right\rangle}\right|_{X(t)=z}}.

(56)

To derive the second coefficient, it is only required to find the quadratic variation of the increment process of $X(t)$ . Upon taking second moment on both sides of Eq. (53), the following is obtained,

\begin{array}[]{ll}\left\langle{{{\left|{X(t+h)-X(t)}\right|}^{2}}}\right\rangle&=f(X,t)h+g(X,t)\Delta B+\left\langle R\right\rangle\\ &={f^{2}}(X,t){h^{2}}+2f(X,t)g(X,t)h\Delta B+{g^{2}}(X,t){\left({\Delta B}\right)^{2}}+\left\langle R\right\rangle\\ &={g^{2}}(X,t)h+\left\langle R\right\rangle.\end{array}

(57)

In the above proof, the Itô identities are utilized, which states that under the mean square convergence theory the quadratic variation of the time and Brownian increments are given as $d{s_{2}}d{s_{1}}=0$ , $d{s_{1}}dW({s_{1}})=0$ , $dW({s_{2}})d{s_{1}}=0$ , $dW({s_{1}})dW({s_{2}})=0$ . The deterministic time integrals will vanish automatically since they have finite variance and zero quadratic covariance and the other higher order Brownian integrals will vanish as $h\to 0$ . Thus the second coefficient in the Kramers-Moyal expansion is obtained as,

{D^{(2)}}=g^{2}(X,t)={\left.{\dfrac{1}{2}\mathop{\lim}\limits_{\Delta t\to 0}\dfrac{1}{{\Delta t}}\left\langle{{{\left|{{X^{(1)}}(t+\Delta t)-z}\right|}^{2}}}\right\rangle}\right|_{X(t)=z}}.

(58)

From the ergodic assumption of the evolution of process $X(t)$ , the time average $\left\langle\cdot\right\rangle$ is often replaced by the expectation operator $\rm{E}(\cdot)$ . For $m$ -variables ${\bm{X}}=\left\{{{X_{1}},{X_{2}},\ldots{X_{m}}}\right\}$ , the diffusion process has the form,

d{X_{i}}(t)={f_{i}}({{\bm{X}}_{t}},t)+\sum\limits_{j}^{n}{{g_{ij}}({{\bm{X}}_{t}},t)d{B_{j}}(t)};\;i=1,2,\ldots m.

The properties of the Brownian motion are: $\left\langle{{B_{i}}(t)}\right\rangle=0$ and $\left\langle{{B_{i}}(t){B_{j}}(s)}\right\rangle=\min(t,s)$ . If the covariation matrix of the diffusion components is given by $\Gamma({{\bm{X}}_{t}},t)=g({{\bm{X}}_{t}},t)g{({{\bm{X}}_{t}},t)^{T}}$ , then, the drift ${f_{i}}({\bm{X}},t)$ of $i^{th}$ -diffusion process and the ${ij}^{th}$ -element of the covariation matrix $\Gamma({\bm{X},t})$ can be estimated as,

\begin{array}[]{l}{f_{i}}({\bm{X}},t)={\left.{\mathop{\lim}\limits_{\Delta t\to 0}\dfrac{1}{{\Delta t}}E\left[{{X_{i}}(t+\Delta t)-{z_{i}}}\right]}\right|_{{X_{k}}(t)={z_{k}}}}\forall\;k=1,2,\ldots m\\ {\Gamma_{ij}}({\bm{X}},t)={\left.{\dfrac{1}{2}\mathop{\lim}\limits_{\Delta t\to 0}\dfrac{1}{{\Delta t}}E\left[{\left|{{X_{i}}(t+\Delta t)-{z_{i}}}\right|\left|{{X_{j}}(t+\Delta t)-{z_{j}}}\right|}\right]}\right|_{{X_{k}}(t)={z_{k}}}}\forall\;k=1,2,\ldots m.\end{array}

(59)

Appendix B Conditional probability distributions of the discontinuous Spike and Slab prior (DSS)

The DSS prior model is described in the Methods section. Drawing sample from the joint distribution is intractable but possible through Markov Chain Monte Carlo methods. In the present work, the Gibbs sampling technique is utilized to draw the sample for the random variables ${\bm{Y}}$ , ${\bm{\theta}}$ , ${\bm{Z}}$ , ${\vartheta_{s}}$ , ${\sigma^{2}}$ and ${p_{0}}$ . For drawing samples using Gibbs sampling, the random variables need to be conditioned on other variables. However, due to the DAG structure in Fig. 2, the dependencies of the conditional distributions on other random variables can be relaxed to some extent as follows:

		$\displaystyle p\left(p_{0}\mid\bm{Y},\bm{\theta},\bm{Z},\vartheta_{s},\sigma^{2}\right)=p\left(p_{0}\mid\bm{Z}\right)$		(60)
		$\displaystyle p\left(\vartheta_{s}\mid\bm{Y},\bm{\theta},\bm{Z},p_{0},\sigma^{2}\right)=p\left(\vartheta_{s}\mid\bm{\theta},\bm{Z},\sigma^{2}\right)$
		$\displaystyle p\left(\sigma^{2}\mid\bm{Y},\bm{\theta},\bm{Z},p_{0},\vartheta_{s}\right)=p\left(\sigma^{2}\mid\bm{Y},\bm{\theta},\bm{Z},\vartheta_{s}\right)$
		$\displaystyle p\left(\bm{\theta}\mid\bm{Y},\bm{Z},p_{0},\vartheta_{s},\sigma^{2}\right)=p\left(\bm{\theta}\mid\bm{Y},\bm{Z},\vartheta_{s},\sigma^{2}\right)$
		$\displaystyle p\left(\bm{Z}\mid\bm{Y},\bm{\theta},p_{0},\vartheta_{s},\sigma^{2}\right)=p\left(\bm{Z}\mid\bm{\theta},p_{0},\vartheta_{s},\sigma^{2}\right).$

The elements of the weight vector ${\bm{\theta}}$ for which the latent variable $Z_{k}\neq 1$ , becomes an absorbing state in the Markov chain. However, to achieve a stationary distribution, one needs to construct an irreducible Markov chain. Thus, the conditionals over weight vector is eliminated by marginalizing the conditional distributions with respect to the vector ${\bm{\theta}}$ .

\begin{array}[]{l}\int{p\left({{\sigma^{2}}\mid{\bm{Y}},{\bm{\theta}},{\bm{Z}},{\vartheta_{s}}}\right)d{\bm{\theta}}}=\int{\dfrac{{p\left({{\sigma^{2}},{\bm{Y}},{\bm{\theta}},{\bm{Z}},{\vartheta_{s}}}\right)}}{{p\left({{\bm{Y}},{\bm{\theta}},{\bm{Z}},{\vartheta_{s}}}\right)}}d{\bm{\theta}}}=p\left({{\sigma^{2}}\mid{\bm{Y}},{\bm{Z}},{\vartheta_{s}}}\right)\\ \int{p\left({{\bm{Z}}\mid{\bm{Y}},{\bm{\theta}},{p_{0}},{\vartheta_{s}},{\sigma^{2}}}\right)}d{\bm{\theta}}=\int{\dfrac{{p\left({{\bm{Z}},{\bm{Y}},{\bm{\theta}},{p_{0}},{\vartheta_{s}},{\sigma^{2}}}\right)}}{{p\left({{\bm{Y}},{\bm{\theta}},{p_{0}},{\vartheta_{s}},{\sigma^{2}}}\right)}}d{\bm{\theta}}}=\dfrac{{p\left({{\bm{Z}},{\bm{Y}},{p_{0}},{\vartheta_{s}},{\sigma^{2}}}\right)}}{{p\left({{\bm{Y}},{p_{0}},{\vartheta_{s}},{\sigma^{2}}}\right)}}=p\left({{\bm{Z}}|{\bm{Y}},{p_{0}},{\vartheta_{s}},{\sigma^{2}}}\right).\end{array}

(61)

The conditional distribution $p\left(p_{0}\mid\bm{Z}\right)$ :

\begin{array}[]{ll}p\left(p_{0}\mid{\bm{Z}}\right)&=p\left(\bm{Z}\mid p_{0}\right)p\left(p_{0}\right)\\ &\propto\left(\prod_{k=1}^{K}p_{0}^{Z_{k}}\left(1-p_{0}\right)^{1-Z_{k}}\right)\left(p_{0}^{\alpha_{p}-1}\left(1-p_{0}\right)^{\beta_{p}-1}\right)\\ &\propto p_{0}^{\alpha_{p}+h_{Z}}\left(1-p_{0}\right)^{\beta_{p}+K-h_{Z}},\end{array}

(62)

where $h_{Z}=\sum_{k=1}^{K}Z_{k}$ . Given the latent vector ${\bm{Z}}$ , $p_{0}$ is sampled as, $p_{0}\mid{\bm{Z}}\sim{Beta}\left(\alpha_{p}+h_{z},\beta_{p}+K-h_{z}\right)$ .

The conditional distribution $p\left(\vartheta_{s}\mid\bm{\theta},\bm{Z},\sigma^{2}\right)$ :

\begin{array}[]{ll}p\left(\vartheta_{s}\mid\bm{\theta},\bm{Z},\sigma^{2}\right)&\propto p\left(\bm{\theta}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)p\left(\vartheta_{s}\right)p(\bm{Z})p\left(\sigma^{2}\right)\\ &\propto p\left(\bm{\theta}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)p\left(\vartheta_{s}\right)\\ &\propto\mathcal{N}\left(\bm{\theta}_{r}\mid\mathbf{0},\sigma^{2}\vartheta_{s}\mathbf{R}_{0,r}\right)\mathcal{I}\mathcal{G}\left(\alpha_{\vartheta},\beta_{\vartheta}\right)\\ &\propto\dfrac{1}{\left(\vartheta_{s}\sigma^{2}\right)^{r/2}\left|\mathbf{R}_{0,r}\right|^{1/2}}\exp\left(-\dfrac{\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}{2\sigma^{2}\vartheta_{s}}\right)\vartheta_{s}^{-\alpha_{\vartheta}-1}\exp\left(-\dfrac{\beta_{\vartheta}}{\vartheta_{s}}\right)\\ &\propto\vartheta_{s}^{-\left(\alpha_{\vartheta}+\dfrac{r}{2}\right)-1}\exp\left(-\dfrac{\beta_{\vartheta}+\left({\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}/{2\sigma^{2}}\right)}{\vartheta_{s}}\right),\end{array}

(63)

where $r$ is the number of elements of the weight vector ${\bm{\theta}}$ for which the latent variable $Z_{k}=1$ . In the above it is to be noted that $p(\bm{Z})$ and $p\left(\sigma^{2}\right)$ are constants when $\vartheta_{s}$ is sampled. This can be observed from the DAG structure in Fig. 2. Thus given $\bm{\theta}$ , $\bm{Z}$ and $\sigma^{2}$ and $\vartheta_{s}$ is sampled as, $\vartheta_{s}\mid\bm{\theta},\bm{Z},\sigma^{2}\sim\mathcal{I}\mathcal{G}\left(\alpha_{\vartheta}+\dfrac{r}{2},\beta_{\vartheta}+\dfrac{\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}{2\sigma^{2}}\right)$ .

The conditional distribution $p\left(\sigma^{2}\mid\bm{Y},\bm{\theta},\bm{Z},\vartheta_{s}\right)$ :

$\displaystyle p\left(\sigma^{2}\mid\bm{Y},\bm{Z},\vartheta_{s}\right)$	$\displaystyle\propto\int p\left(\bm{Y},\bm{\theta},\bm{Z},\vartheta_{s},\sigma^{2}\right)d\bm{\theta}$	(64)
	$\displaystyle\propto\left(\int p\left(\bm{Y}\mid\bm{\theta},\sigma^{2}\right)p\left(\bm{\theta}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)d\bm{\theta}\right)p(\bm{Z})p\left(\vartheta_{s}\right)p\left(\sigma^{2}\right)$
	$\displaystyle\propto\left(\int p\left(\bm{Y}\mid\bm{\theta},\sigma^{2}\right)p\left(\bm{\theta}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)d\bm{\theta}\right)p\left(\sigma^{2}\right).$

From the DAG structure it can be understood that $p(\bm{Z})$ and $p\left(\vartheta_{s}\right)$ will act as a constant when $\sigma^{2}$ is sampled. Since the $\theta_{k}$ -values corresponding to the spike distribution does not contribute to the basis function selection, the remaining weights belonging to the slab denoted by $\bm{\theta}_{r}$ is used. Then, one can expand the intregrand $p\left({\bm{Y}}\mid\bm{\theta},\sigma^{2}\right)p\left(\bm{\theta}\mid{\bm{Z}},\vartheta_{s},\sigma^{2}\right)$ , as,

		$\displaystyle p\left({{\bm{Y}}\mid{\bm{\theta}},{\sigma^{2}}}\right)p\left({{\bm{\theta}}\mid{\bm{Z}},{\vartheta_{s}},{\sigma^{2}}}\right)$		(65)
		$\displaystyle=\mathcal{N}\left({{\bm{Y}}\|{{\bf{L}}_{r}}{\bm{\theta}_{r}},{\sigma^{2}}{{\bf{I}}_{N\times N}}}\right)N\left({{\bm{\theta}_{r}}\|0,{\sigma^{2}}{\vartheta_{s}}{{\bf{R}}_{0,r}}}\right)$
		$\displaystyle=\dfrac{1}{\left(2\pi\sigma^{2}\right)^{N/2}}\exp\left(-\frac{\left({\bm{Y}}-\mathbf{D}_{r}\bm{\theta}_{r}\right)^{T}\left({\bm{Y}}-\mathbf{D}_{r}\bm{\theta}_{r}\right)}{2\sigma^{2}}\right)\frac{\left(\|\mathbf{R}_{0,r}^{-1}\|\right)^{1/2}}{\left(2\pi\vartheta_{s}\sigma^{2}\right)^{r/2}}\exp\left(-\frac{\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}{2\sigma^{2}\vartheta_{s}}\right)$
		$\displaystyle=\frac{1}{\left(2\pi\sigma^{2}\right)^{N/2}}\frac{\left(\|\mathbf{R}_{0,r}^{-1}\|\right)^{1/2}}{\left(2\pi\vartheta_{s}\sigma^{2}\right)^{r/2}}\exp\left(-\frac{\left(\bm{\theta}_{r}-\bm{\mu}\right)^{T}\bm{\Sigma}^{-1}\left(\bm{\theta}_{r}-\bm{\mu}\right)}{2\sigma^{2}}\right)\exp\left(-\frac{\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)}{2\sigma^{2}}\right),$

where $\bm{\Sigma}^{-1}=\left(\mathbf{L}_{r}^{T}\mathbf{L}_{r}+\vartheta_{s}^{-1}\mathbf{R}_{0,r}^{-1}\right)$ and $\bm{\mu}=\bm{\Sigma}\mathbf{L}_{r}^{T}{\bm{Y}}.$ Upon integration of the the above expression about ${\bm{\theta}}_{r}$ , one obtains:

$\displaystyle p\left(\sigma^{2}\mid\bm{Y},\bm{Z},\vartheta_{s}\right)$	$\displaystyle\propto\frac{1}{\left(\sigma^{2}\right)^{N/2}}\exp\left(-\frac{\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)}{2\sigma^{2}}\right)p\left(\sigma^{2}\right)$	(66)
	$\displaystyle\propto\frac{1}{\left(\sigma^{2}\right)^{N/2}}\exp\left(-\frac{\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)}{2\sigma^{2}}\right)\left(\sigma^{2}\right)^{-\alpha_{\sigma}-1}\exp\left(-\frac{\beta_{\sigma}}{\sigma^{2}}\right)$
	$\displaystyle\propto\left(\sigma^{2}\right)^{-\left(\alpha_{\sigma}+{0.5N}\right)-1}\exp\left(-\frac{\beta_{\sigma}+\frac{1}{2}\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)}{\sigma^{2}}\right).$

The random variable $\sigma^{2}$ is sampled as, $\sigma^{2}\mid\bm{Y},\bm{Z},\vartheta_{s}\sim\mathcal{I}\mathcal{G}\left(\alpha_{\sigma}+\dfrac{N}{2},\beta_{\sigma}+\frac{1}{2}\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\bf{\Sigma}^{-1}\bm{\mu}\right)\right)$ .

The conditional distribution $p\left(\bm{\theta}\mid\bm{Y},\bm{Z},\vartheta_{s},\sigma^{2}\right)$ : The elements of the weight vector $\bm{\theta}$ corresponding to the spike distribution i.e. the weights for which the latent variable $z_{k}=0;k=1\ldots K$ , are assigned the value 0. The conditional distribution for the weights belonging to the slab distribution denoted by $\bm{\theta}_{r}$ is derived as follows:

$\displaystyle p\left(\bm{\theta}_{r}\mid\bm{Y},\vartheta_{s},\sigma^{2}\right)$	$\displaystyle\propto\mathcal{N}\left(\bm{Y}\mid\mathbf{L}_{r}\bm{\theta}_{r},\sigma^{2}\mathbf{I}_{N\times N}\right)\mathcal{N}\left(\bm{\theta}_{r}\mid\mathbf{0},\sigma^{2}\vartheta_{s}\mathbf{R}_{0,r}\right)$	(67)
	$\displaystyle\propto\exp\left(-\frac{\left(\bm{Y}-\mathbf{L}_{r}\bm{\theta}_{r}\right)^{T}\left(\bm{Y}-\mathbf{L}_{r}\bm{\theta}_{r}\right)}{2\sigma^{2}}\right)\exp\left(-\frac{\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}{2\sigma^{2}v_{s}}\right)$
	$\displaystyle\propto\exp\left(-\frac{\bm{Y}^{T}\bm{Y}+\bm{\theta}_{r}^{T}\bm{\Sigma}^{-1}\bm{\theta}_{r}-2\bm{\theta}_{r}^{T}\bm{\Sigma}^{-1}\bm{\mu}}{2\sigma^{2}}\right)$
	$\displaystyle\propto\exp\left(-\frac{\left(\bm{\theta}_{r}-\bm{\mu}\right)^{T}\bm{\Sigma}^{-1}\left(\bm{\theta}_{r}-\bm{\mu}\right)}{2\sigma^{2}}\right),$

where $\mathbf{\Sigma}^{-1}=\left(\mathbf{L}_{r}^{T}\mathbf{L}_{r}+\vartheta_{s}^{-1}\mathbf{R}_{0,r}^{-1}\right)$ and $\bm{\mu}=\bm{\Sigma}\mathbf{L}_{r}^{T}\bm{Y}$ . Then, the random values of $\bm{\theta}_{r}$ can be sampled as $\bm{\theta}_{r}\mid\bm{Y},\vartheta_{s},\sigma^{2}\sim\mathcal{N}\left(\bm{\mu},\sigma^{2}\bf{\Sigma}\right)$ .

The conditional distribution $p\left(\bm{Z}\mid\bm{\theta},p_{0},\vartheta_{s},\sigma^{2}\right)$ : The latent variable $Z_{k}$ corresponding to the $k^{th}$ element of the weight vector are assigned the value 0 or 1 independently. The conditional probability distributions of $Z_{k}$ are estimated by comparing the probabilities that the $k^{th}$ latent variable $Z_{k}=1$ will assume a value 1 or 0, given the values of $\vartheta_{s}$ , $p_{0}$ and remaining values of the latent vector $\bm{Z}_{-k}$ . Here, the term $\bm{Z}_{-k}$ denotes the latent vector $\bm{Z}$ whose $k^{th}$ element is removed. Let $u_{k}$ denotes the probability with which the $k^{th}$ latent variable $Z_{k}$ takes a value 1. The probability $u_{k}$ is then found as,

$\displaystyle u_{k}$	$\displaystyle=\frac{p\left(Z_{k}=1\mid\bm{Y},\bm{Z}_{-k},\vartheta_{s},p_{0}\right)}{p\left(Z_{k}=1\mid\bm{Y},\bm{Z}_{-k},\vartheta_{s},p_{0}\right)+p\left(Z_{k}=0\mid\bm{Y},\bm{Z}_{-k},\vartheta_{s},p_{0}\right)}$	(68)
	$\displaystyle=\frac{p\left(\bm{Y}\mid Z_{k}=1,\bm{Z}_{-k},\vartheta_{s}\right)p\left(Z_{k}=1\mid p_{0}\right)}{p\left(\bm{Y}\mid Z_{k}=1,\bm{Z}_{-k},\vartheta_{s}\right)p\left(Z_{k}=1\mid p_{0}\right)+p\left(\bm{Y}\mid Z_{k}=0,\bm{Z}_{-k},\vartheta_{s}\right)p\left(Z_{k}=0\mid p_{0}\right)}$
	$\displaystyle=\frac{p\left(\bm{Y}\mid Z_{k}=1,\bm{Z}_{-k},\vartheta_{s}\right)p_{0}}{p\left(\bm{Y}\mid Z_{k}=1,\bm{Z}_{-k},\vartheta_{s}\right)p_{0}+p\left(\bm{Y}\mid Z_{k}=0,\bm{Z}_{-k},\vartheta_{s}\right)\left(1-p_{0}\right)}$
	$\displaystyle=\frac{p_{0}}{p_{0}+\lambda_{k}\left(1-p_{0}\right)},$

where $\lambda_{k}=\dfrac{p\left(\bm{Y}\mid Z_{k}=0,\bm{Z}_{-k},\vartheta_{s}\right)}{p\left(\bm{Y}\mid Z_{k}=1,\bm{Z}_{-k},\vartheta_{s}\right)}$ . The marginal likelihood function $p\left(\bm{Y}\mid\bm{Z},\vartheta_{s}\right)$ was derived by integrating out the random variables $\bm{\theta}$ and $\sigma^{2}$ from the original likelihood function, which follows,

p\left(\bm{Y}\mid\bm{Z},\vartheta_{s}\right)=\int p\left(\bm{Y}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)p\left(\sigma^{2}\right)d\sigma^{2},

(69)

where $p\left(\bm{Y}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)$ is obtained by marginalizing the likelihood function with respect to the variable $\bm{\theta}$ . Considering only the weights whose corresponding latent variables $Z_{k}=0$ , denoted by $\bm{\theta_{r}}$ , the probability $p\left(\bm{Y}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)$ is obtained as,

$\displaystyle p\left(\bm{Y}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)=$	$\displaystyle\int p\left(\bm{Y},\bm{\theta}\mid\bm{Z},\vartheta_{s},\sigma^{2}\right)d\theta$	(70)
$\displaystyle=$	$\displaystyle\int p\left(\bm{Y}\mid\bm{\theta}_{r},\sigma^{2}\right)p\left(\bm{\theta}_{r}\mid\vartheta_{s},\sigma^{2}\right)d\bm{\theta}_{r}$
$\displaystyle=$	$\displaystyle\int\mathcal{N}\left(\bm{Y}\mid\mathbf{L}_{r}\bm{\theta}_{r},\sigma^{2}\mathbf{I}_{N\times N}\right)\mathcal{N}\left(\bm{\theta}_{r}\mid 0,\sigma^{2}\vartheta_{s}\mathbf{R}_{0,r}\right)d\bm{\theta}_{r}$
$\displaystyle=$	$\displaystyle\int\frac{1}{\left(2\pi\sigma^{2}\right)^{N/2}}\exp\left(-\frac{\left(\bm{Y}-\mathbf{L}_{r}\bm{\theta}_{r}\right)^{T}\left(\bm{Y}-\mathbf{L}_{r}\bm{\theta}_{r}\right)}{2\sigma^{2}}\right)\frac{\left(\|\mathbf{R}_{0,r}^{-1}\|\right)^{1/2}}{\left(2\pi\vartheta_{s}\sigma^{2}\right)^{r/2}}\exp\left(-\frac{\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}{2\sigma^{2}\vartheta_{s}}\right)d\bm{\theta}_{r}$
$\displaystyle=$	$\displaystyle\frac{1}{\left(2\pi\sigma^{2}\right)^{N/2}}\frac{\left(\|\mathbf{R}_{0,r}^{-1}\|\right)^{1/2}\left(\|\mathbf{\Sigma}^{-1}\|\right)^{1/2}}{\left(\vartheta_{s}\right)^{r/2}}\exp\left(-\frac{\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\mathbf{\Sigma}\bm{\mu}\right)}{2\Sigma^{2}}\right),$

where $\mathbf{\Sigma}=\left(\mathbf{L}_{r}^{T}\mathbf{L}_{r}+\vartheta_{s}^{-1}\mathbf{R}_{0,r}^{-1}\right)^{-1}$ and $\bm{\mu}=\mathbf{\Sigma}\mathbf{L}_{r}^{T}\bm{Y}$ . The operator $|\cdot|$ denotes the determinant. With the above result, $p\left(\bm{Y}\mid\bm{Z},\vartheta_{s}\right)$ can be derived as,

$\displaystyle p\left(\bm{Y}\mid\bm{Z},\vartheta_{s}\right)=$	$\displaystyle\frac{\left(\|\mathbf{R}_{0,r}^{-1}\|\right)^{1/2}\left(\|\mathbf{\Sigma}^{-1}\|\right)^{1/2}}{(2\pi)^{N/2}\left(\vartheta_{s}\right)^{r/2}}\int\frac{1}{\left(\sigma^{2}\right)^{N/2}}\exp\left(-\frac{\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\mathbf{\Sigma}\bm{\mu}\right)}{2\sigma^{2}}\right)\mathcal{I}\mathcal{G}\left(\alpha_{\sigma},\beta_{\sigma}\right)d\sigma^{2}$	(71)
$\displaystyle=$	$\displaystyle\frac{\left(\|\mathbf{R}_{0,r}^{-1}\|\right)^{1/2}\left(\|\mathbf{\Sigma}^{-1}\|\right)^{1/2}}{(2\pi)^{N/2}\left(\vartheta_{s}\right)^{r/2}}\frac{\left(\beta_{\sigma}\right)^{\alpha_{\sigma}}}{\Gamma\left(\alpha_{\sigma}\right)}\int\frac{1}{\left(\sigma^{2}\right)^{\alpha_{\sigma}+N/2+1}\exp\left(-\frac{\beta_{\sigma}+\frac{1}{2}\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\mathbf{\Sigma}\bm{\mu}\right)}{\sigma^{2}}\right)}d\sigma^{2}$
$\displaystyle=$	$\displaystyle\frac{\left(\|\mathbf{R}_{0,r}^{-1}\|\right)^{1/2}\left(\|\mathbf{\Sigma}^{-1}\|\right)^{1/2}}{(2\pi)^{N/2}\left(\vartheta_{s}\right)^{r/2}}\frac{\left(b_{\sigma}\right)^{\alpha_{\sigma}}}{\Gamma\left(\alpha_{\sigma}\right)}\frac{\Gamma\left(\alpha_{\sigma}+\frac{N}{2}\right)}{\left(\beta_{\sigma}+\frac{1}{2}\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\mathbf{\Sigma}\bm{\mu}\right)\right)^{\left(\alpha_{\sigma}+\frac{N}{2}\right)}},$

where the operator $\Gamma(\cdot)$ denotes the Gamma function. To summarise, the random variables $Z_{k}$ can be computed from the Bernoulli distribution with the parameter $u_{k}$ as, $Z_{k};k=1,\ldots,K\quad\mid\bm{Y},\vartheta_{s},p_{0}\sim{Bern}\left(u_{k}\right)$ , where the marginalised likelihood function is obtained as,

p\left(\bm{Y}\mid\bm{Z},\vartheta_{s}\right)=\begin{cases}\dfrac{\Gamma\left(\alpha_{\sigma}+\dfrac{N}{2}\right)}{(2\pi)^{N/2}\dfrac{\left(\beta_{\sigma}\right)^{\alpha_{\sigma}}}{\Gamma\left(\alpha_{\sigma}\right)}}\dfrac{1}{\left(\beta_{\sigma}+\dfrac{1}{2}\left(\bm{Y}^{T}\bm{y}\right)\right)^{\left(\alpha_{\sigma}+\dfrac{N}{2}\right)}}\quad\text{, when all $\{Z_{k};k=1,\ldots,K\}$ = 0}\\ \dfrac{\Gamma\left(a_{\sigma}+\dfrac{N}{2}\right)}{(2\pi)^{N/2}\left(\vartheta_{s}\right)^{r/2}}\dfrac{\left(\beta_{\sigma}\right)^{\alpha_{\sigma}}}{\Gamma\left(\alpha_{\sigma}\right)}\dfrac{\left(|\mathbf{R}_{0,r}^{-1}|\right)^{1/2}\left(|\mathbf{\Sigma}^{-1}|\right)^{1/2}}{\left(\beta_{\sigma}+\dfrac{1}{2}\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\mathbf{\Sigma}\bm{\mu}\right)\right)^{\left(\alpha_{\sigma}+\dfrac{N}{2}\right)}}\quad\text{, otherwise.}\end{cases}

(72)

Acknowledgements

T. Tripura acknowledges the financial support received from the Ministry of Human Resource Development (MHRD), India in form of the Prime Minister’s Research Fellows (PMRF) scholarship. S. Chakraborty acknowledges the financial support received from Science and Engineering Research Board (SERB) through grant no. SRG/2021/000467 and seed grant received from IIT Delhi.

Declarations

Funding

The corresponding author received funding from IIT Delhi in form of seed grant.

Conflicts of interest

The authors declare that they have no conflict of interest.

Availability of data and material

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Code availability

The Python codes written for this work are available from the corresponding author on reasonable request.

References

[1] Somdatta Goswami, Cosmin Anitescu, Souvik Chakraborty, and Timon Rabczuk. Transfer learning enhanced physics informed neural network for phase-field modeling of fracture. Theoretical and Applied Fracture Mechanics, 106:102447, 2020.
[2] Maziar Raissi. Deep hidden physics models: Deep learning of nonlinear partial differential equations. The Journal of Machine Learning Research, 19(1):932–955, 2018.
[3] Souvik Chakraborty. Transfer learning based multi-fidelity physics informed deep neural network. Journal of Computational Physics, 426:109942, 2021.
[4] Souvik Chakraborty. Simulation free reliability analysis: A physics-informed deep learning based approach. arXiv preprint arXiv:2005.01302, 2020.
[5] Lu Lu, Pengzhan Jin, and George Em Karniadakis. Deeponet: Learning nonlinear operators for identifying differential equations based on the universal approximation theorem of operators. arXiv preprint arXiv:1910.03193, 2019.
[6] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, and Anima Anandkumar. Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895, 2020.
[7] Laure Zanna and Thomas Bolton. Data-driven equation discovery of ocean mesoscale closures. Geophysical Research Letters, 47(17):e2020GL088376, 2020.
[8] Yayun Zheng, Fang Yang, Jinqiao Duan, Xu Sun, Ling Fu, and Jürgen Kurths. The maximum likelihood climate change for global warming under the influence of greenhouse effect and lévy noise. Chaos: An Interdisciplinary Journal of Nonlinear Science, 30(1):013132, 2020.
[9] Chen Jia, Michael Q Zhang, and Hong Qian. Emergent lévy behavior in single-cell stochastic gene expression. Physical Review E, 96(4):040402, 2017.
[10] Frank Noé and Cecilia Clementi. Collective variables for the study of long-time kinetics from molecular trajectories: theory and methods. Current opinion in structural biology, 43:141–147, 2017.
[11] X Frank Zhang. Information uncertainty and stock returns. The Journal of Finance, 61(1):105–137, 2006.
[12] Josh Bongard and Hod Lipson. Automated reverse engineering of nonlinear dynamical systems. Proceedings of the National Academy of Sciences, 104(24):9943–9948, 2007.
[13] Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the national academy of sciences, 113(15):3932–3937, 2016.
[14] Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723, 1974.
[15] Gideon Schwarz. Estimating the dimension of a model. The annals of statistics, pages 461–464, 1978.
[16] Michael Schmidt and Hod Lipson. Distilling free-form natural laws from experimental data. science, 324(5923):81–85, 2009.
[17] Ioannis G Kevrekidis and Giovanni Samaey. Equation-free multiscale computation: Algorithms and applications. Annual review of physical chemistry, 60:321–344, 2009.
[18] Niall M Mangan, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Inferring biological networks by sparse identification of nonlinear dynamics. IEEE Transactions on Molecular, Biological and Multi-Scale Communications, 2(1):52–63, 2016.
[19] Moritz Hoffmann, Christoph Fröhner, and Frank Noé. Reactive sindy: Discovering governing reactions from concentration data. The Journal of chemical physics, 150(2):025101, 2019.
[20] Bhavana Bhadriraju, Abhinav Narasingam, and Joseph Sang-Il Kwon. Machine learning-based adaptive model identification of systems: Application to a chemical process. Chemical Engineering Research and Design, 152:372–383, 2019.
[21] Jean-Christophe Loiseau and Steven L Brunton. Constrained sparse galerkin regression. Journal of Fluid Mechanics, 838:42–67, 2018.
[22] Jean-Christophe Loiseau, Bernd R Noack, and Steven L Brunton. Sparse reduced-order modelling: sensor-based dynamics to full-state estimation. Journal of Fluid Mechanics, 844:459–490, 2018.
[23] Zhilu Lai and Satish Nagarajaiah. Sparse structural system identification method for nonlinear dynamic systems with hysteresis/inelastic behavior. Mechanical Systems and Signal Processing, 117:813–842, 2019.
[24] Shanwu Li, Eurika Kaiser, Shujin Laima, Hui Li, Steven L Brunton, and J Nathan Kutz. Discovering time-varying aerodynamics of a prototype bridge by sparse identification of nonlinear dynamical systems. Physical Review E, 100(2):022220, 2019.
[25] Hayden Schaeffer and Scott G McCalla. Sparse model selection via integral terms. Physical Review E, 96(2):023302, 2017.
[26] Niall M Mangan, J Nathan Kutz, Steven L Brunton, and Joshua L Proctor. Model selection for dynamical systems via sparse regression and information criteria. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2204):20170009, 2017.
[27] Eurika Kaiser, J Nathan Kutz, and Steven L Brunton. Sparse identification of nonlinear dynamics for model predictive control in the low-data limit. Proceedings of the Royal Society A, 474(2219):20180335, 2018.
[28] Hayden Schaeffer, Giang Tran, Rachel Ward, and Linan Zhang. Extracting structured dynamical systems using sparse optimization with very few samples. Multiscale Modeling & Simulation, 18(4):1435–1461, 2020.
[29] Merten Stender, Sebastian Oberst, and Norbert Hoffmann. Recovery of differential equations from impulse response time series data for model identification and feature extraction. Vibration, 2(1):25–46, 2019.
[30] Lorenzo Boninsegna, Feliks Nüske, and Cecilia Clementi. Sparse learning of stochastic dynamical equations. The Journal of chemical physics, 148(24):241723, 2018.
[31] Samuel H Rudy, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Data-driven discovery of partial differential equations. Science Advances, 3(4):e1602614, 2017.
[32] Sheng Zhang and Guang Lin. Robust data-driven discovery of governing physical laws with error bars. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2217):20180305, 2018.
[33] Samuel H Rudy, J Nathan Kutz, and Steven L Brunton. Deep learning of dynamics and signal-noise decomposition with time-stepping constraints. Journal of Computational Physics, 396:483–506, 2019.
[34] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Multistep neural networks for data-driven discovery of nonlinear dynamical systems. arXiv preprint arXiv:1801.01236, 2018.
[35] R Fuentes, R Nayek, P Gardner, N Dervilis, T Rogers, K Worden, and EJ Cross. Equation discovery for nonlinear dynamical systems: A bayesian viewpoint. Mechanical Systems and Signal Processing, 154:107528, 2021.
[36] Rajdip Nayek, Ramon Fuentes, Keith Worden, and Elizabeth J Cross. On spike-and-slab priors for bayesian equation discovery of nonlinear dynamical systems via sparse linear regression. Mechanical Systems and Signal Processing, 161:107986, 2021.
[37] Kushagra Gupta, Dootika Vats, and Snigdhansu Chatterjee. Bayesian equation selection on sparse data for discovery of stochastic dynamical systems. arXiv preprint arXiv:2101.04437, 2021.
[38] Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear regression. Journal of the american statistical association, 83(404):1023–1032, 1988.
[39] Edward I George and Robert E McCulloch. Approaches for bayesian variable selection. Statistica sinica, pages 339–373, 1997.
[40] Robert B O’Hara and Mikko J Sillanpää. A review of bayesian variable selection methods: what, how and which. Bayesian analysis, 4(1):85–117, 2009.
[41] Ovidiu Calin. An informal introduction to stochastic calculus with applications. World Scientific, 2015.
[42] Fima C Klebaner. Introduction to stochastic calculus with applications. World Scientific Publishing Company, 2005.
[43] Tapas Tripura, Ankush Gogoi, and Budhaditya Hazra. An ito-taylor weak 3.0 method for stochastic dynamics of nonlinear systems. Applied Mathematical Modelling, 2020.
[44] Peter E Kloeden and Eckhard Platen. Higher-order implicit strong numerical schemes for stochastic differential equations. Journal of statistical physics, 66(1-2):283–314, 1992.
[45] Tapas Tripura, Mohammad Imran, Budhaditya Hazra, and Souvik Chakraborty. A change of measure enhanced near exact euler maruyama scheme for the solution to nonlinear stochastic dynamical systems. arXiv preprint arXiv:2108.10655, 2021.
[46] Tapas Tripura, Budhaditya Hazra, and Souvik Chakraborty. Generalized weakly corrected milstein solutions to stochastic differential equations. arXiv preprint arXiv:2108.10681, 2021.
[47] George Casella and Edward I George. Explaining the gibbs sampler. The American Statistician, 46(3):167–174, 1992.
[48] Bernt Oksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013.
[49] Tapas Tripura, Basuraj Bhowmik, Vikram Pakrashi, and Budhaditya Hazra. Real-time damage detection of degrading systems. Structural Health Monitoring, 19(3):810–837, 2020.
[50] Hannes Risken. Fokker-planck equation. In The Fokker-Planck Equation, pages 63–95. Springer, 1996.

		$\displaystyle p\left({{\bm{Y}}\mid{\bm{\theta}},{\sigma^{2}}}\right)p\left({{\bm{\theta}}\mid{\bm{Z}},{\vartheta_{s}},{\sigma^{2}}}\right)$		(65)
		$\displaystyle=\mathcal{N}\left({{\bm{Y}}\|{{\bf{L}}_{r}}{\bm{\theta}_{r}},{\sigma^{2}}{{\bf{I}}_{N\times N}}}\right)N\left({{\bm{\theta}_{r}}\|0,{\sigma^{2}}{\vartheta_{s}}{{\bf{R}}_{0,r}}}\right)$
		$\displaystyle=\dfrac{1}{\left(2\pi\sigma^{2}\right)^{N/2}}\exp\left(-\frac{\left({\bm{Y}}-\mathbf{D}_{r}\bm{\theta}_{r}\right)^{T}\left({\bm{Y}}-\mathbf{D}_{r}\bm{\theta}_{r}\right)}{2\sigma^{2}}\right)\frac{\left(\|\mathbf{R}_{0,r}^{-1}\|\right)^{1/2}}{\left(2\pi\vartheta_{s}\sigma^{2}\right)^{r/2}}\exp\left(-\frac{\bm{\theta}_{r}^{T}\mathbf{R}_{0,r}^{-1}\bm{\theta}_{r}}{2\sigma^{2}\vartheta_{s}}\right)$
		$\displaystyle=\frac{1}{\left(2\pi\sigma^{2}\right)^{N/2}}\frac{\left(\|\mathbf{R}_{0,r}^{-1}\|\right)^{1/2}}{\left(2\pi\vartheta_{s}\sigma^{2}\right)^{r/2}}\exp\left(-\frac{\left(\bm{\theta}_{r}-\bm{\mu}\right)^{T}\bm{\Sigma}^{-1}\left(\bm{\theta}_{r}-\bm{\mu}\right)}{2\sigma^{2}}\right)\exp\left(-\frac{\left(\bm{Y}^{T}\bm{Y}-\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}\right)}{2\sigma^{2}}\right),$