This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\corrauthor

[1]Thomas A. Catanach \corremail[email protected]

Bayesian inference of Stochastic reaction networks using Multifidelity Sequential Tempered Markov Chain Monte Carlo

Huy D. Vo    Brian Munsky Sandia National Laboratories, Livermore, CA Dept. of Chemical and Biological Engineering, Colorado State University, Fort Collins, CO
Abstract

Stochastic reaction network models are often used to explain and predict the dynamics of gene regulation in single cells. These models usually involve several parameters, such as the kinetic rates of chemical reactions, that are not directly measurable and must be inferred from experimental data. Bayesian inference provides a rigorous probabilistic framework for identifying these parameters by finding a posterior parameter distribution that captures their uncertainty. Traditional computational methods for solving inference problems such as Markov Chain Monte Carlo methods based on classical Metropolis-Hastings algorithm involve numerous serial evaluations of the likelihood function, which in turn requires expensive forward solutions of the chemical master equation (CME). We propose an alternative approach based on a multifidelity extension of the Sequential Tempered Markov Chain Monte Carlo (ST-MCMC) sampler. This algorithm is built upon Sequential Monte Carlo and solves the Bayesian inference problem by decomposing it into a sequence of efficiently solved subproblems that gradually increase model fidelity and the influence of the observed data. We reformulate the finite state projection (FSP) algorithm, a well-known method for solving the CME, to produce a hierarchy of surrogate master equations to be used in this multifidelity scheme. To determine the appropriate fidelity, we introduce a novel information-theoretic criteria that seeks to extract the most information about the ultimate Bayesian posterior from each model in the hierarchy without inducing significant bias. This novel sampling scheme is tested with high performance computing resources using biologically relevant problems.

1 Introduction

A distinguishing feature of biology is the diversity manifested by living things across different scales, from the readily observed multitude of species to the differences between individuals of the same species. At the microscopic level, a population of cells with the same genetic code, growing under the same lab conditions, could still display phenotypic variability in gene products [1, 2, 3, 4, 5]. Phenotypic variability has been observed in an increasing volume of data obtained from single-cell, single-molecule measurements enabled by recent progresses in chemical labeling and imaging techniques [6, 7, 8].

Much of the variability in gene expression is attributed to the stochasticity of vital cellular processes (e.g., transcription, translation) that are subjected to the randomness of molecular interactions. Stochastic reaction networks (SRN) represent a class of models that have been widely used to capture temporal and spatial fluctuations in single-cell gene expression [9]. SRN models treat the copy numbers of biochemical species, i.e. the number of molecules of a given type within a cell, as states in a discrete-space, continuous-time Markov process, where chemical reactions are represented by transitions between states. Given an SRN model, the probabilities of gene expression states within a cell can be computed by solving the chemical master equation (CME). This is a dynamical system in an infinite-dimensional space that describes the evolution of the probability distribution of all states. The finite state projection (FSP) is a well-known approximation method to obtain high-fidelity solutions of the CME [10]. This method reduces the intractable state space of the original SRN into a finite subset chosen based on a proven error bound, turning the infinite-dimensional CME system into a finite problem of linear differential equations.

The present work is concerned with the selection, parameter estimation, and uncertainty propagation of these reaction network models within the Bayesian framework. Bayesian methods are a powerful tool for system identification for SRN models because they provide rigorous uncertainty quantification by identifying a probability distribution over plausible model parameters instead of selecting a single model that may fit the data well [11, 12, 13, 14, 15]. This distribution over the models given the data is called the posterior distribution. Quantifying model parameter uncertainty is critical because it is difficult to model the full complexity of the biological system and the biological system may exhibit experimental context dependence [16]. Further, once model parameter uncertainty has been quantified, further experiments can be designed to provide new information about the system [17, 18, 19, 20, 21].

In this paper, we focus on data obtained by the experimental technique of single-molecule fluorescent in situ hybridization (smFISH). These datasets consist of independent single-cell measurements, each of which measures the copy number of biochemical species at a single time point. The standard approach to sample from the posterior distribution implied by this data is to use Markov Chain Monte Carlo (MCMC) algorithms such as the random walk Metropolis-Hastings MCMC sampler [11, 22]. With high-fidelity CME solutions enabled by the FSP, one can compute the likelihood of observing these single-cell data and then perform Bayesian inference for model parameters. However, this approach suffers from two drawbacks. The first drawback is that the MCMC is inherently serial, preventing it from utilizing the massively parallel processing capability provided by modern high performance computing clusters. Therefore, if standard MCMC techniques are used, several tens to hundreds of thousands of sequential model evaluations may be needed to adequately sample the posterior distribution. The second drawback is that the FSP solutions required for the likelihood function are expensive, often requiring several minutes per model evaluation for a moderately sized problem. Typically, the number of differential equations that the FSP algorithm needs to solve grows exponentially with the number of species in the network, so the size of the state space and transition matrix quickly grows intractable. This has motivated Approximate Bayesian Computation (ABC) approaches that replace the computationally expensive single-cell likelihood function with less expensive model-data discrepancy functions [23, 24, 25]. Samples produced by ABC are, in general, not distributed according to the true posterior distribution, and a careful choice of summary statistics is critical for the performance and reliability of ABC samplers.

We propose in this work a different approach that uses a parallel and multifidelity MCMC computational framework to produce samples from the true posterior distribution. Many approaches to parallel MCMC methods have been proposed either based on parallel proposals or parallel Markov chains [26, 27, 28, 29]. One family of popular parallel MCMC methods are those based upon sequential Monte Carlo (SMC) samplers [30, 28, 31, 32]. For this work, we replace the standard MCMC methods with the Sequentially Tempered Markov Chain Monte Carlo (ST-MCMC) [32], which is a massively parallel sampling scheme based on SMC. This method transports a population of model parameter samples through a series of intermediate annealing levels that reflect the gradual increase in the influence of the data likelihood. At each level, MCMC is used to explore the intermediate distribution and re-balance the distribution of the samples. Since this method is population based, these MCMC steps can be done in parallel. Further, this method can effectively adapt to the target posterior distribution to speed up sampling.

Similarly, approaches to multifidelity MCMC have been explored in the literature such as multifidelity delayed acceptance schemes, Multilevel Markov Chain Monte Carlo, and multifidelity approaches to SMC [33, 34, 35, 36, 37, 38]. Multifidelity delayed acceptance schemes have been applied to Bayesian inference for the CME before [39, 40, 22]. Within these methods, a fast surrogate of the expensive likelihood function is used to pre-screen proposed samples within MCMC before they are accepted or rejected based on the expensive CME likelihood. This method still requires many sequential full model evaluations in order to sample the posterior. Multilevel MCMC [37] uses a hierarchy of models, such as different discretization grids of a PDE, to design an estimator for a specific quantify of interest. This uses parallel Markov chains at different model fidelities to estimate a correction to the quantify of interest estimate incurred by refining the model fidelity. Multifdelity SMC methods like the Multilevel Sequential2 Monte Carlo sampler [38] use an embarrassingly parallel approach and a hierarchy of multifidelity models. Therefore, only a few sequential full model evaluations may be needed. We take this approach to develop a multifidelity form of ST-MCMC for solving the CME. Within our method, instead of only considering a series of annealing levels, we also consider a hierarchy of model fidelity. Thus, the solution of the full inference problem is broken down by steering the samples from a distribution that reflects a low fidelity model with little influence from the likelihood to a distribution that reflects the high-fidelity model with the full influence of the likelihood. By performing the early updates using fast models, the sampler can quickly converge to the most important regions of parameter space, where a higher fidelity model can then be used to better assess which regions of this high-likelihood space are most likely. The key challenge for applying multifidelity methods in the SMC context is deciding which annealing factor and model fidelity is appropriate at a given level. Latz et al. suggest an approach based on the effective sample size of the population [38]. We take a different approach by leveraging a limited number of high fidelity model solves to estimate the information gained about the ultimate posterior given a current model fidelity and annealing factor. This information theoretic criteria can effectively identify when the lower fidelity model is overly biasing the solution and should be discarded in favor of a higher fidelity model.

We demonstrate the efficiency and accuracy of our novel scheme when solving parameter estimation, model selection, and uncertainty propagation for stochastic chemical kinetic models in a Bayesian framework. This new approach to multifidelity ST-MCMC using fast surrogates from reduced order models based on a novel reformulation of the FSP significantly reduces the number of expensive likelihood function required to sample the posterior. The example problems are based on models from the system and synthetic biology literature. These include a three-dimensional repressilator gene circuit, a spatial bursting gene expression network, and a stochastic transcription network for the inflammation response gene IL1beta [41]. As such the primary contributions of this paper are:

  1. 1.

    Development of multifidelity ST-MCMC for Bayesian inference of SRNs

  2. 2.

    Introduction of an information-theoretic criteria for assessing the appropriate model fidelity within multifidleity ST-MCMC

  3. 3.

    Description of a novel surrogate model of the CME to be used within multifidelity inference problems

This paper is organized as follows: Section 2 describes general background for stochastic reaction network modeling, finite state projection, Bayesian inference, and MCMC methods. Second 3 describes the Multifidelity Sequential Tempered MCMC algorithm and the information theoretic criteria for adapting model fidelity. Section 4 describes a novel surrogate models for the CME. Section 5 describes three experiments to identify model parameters in SRNs using Multifidelity ST-MCMC. Finally, Section 6 concludes.

2 Background

2.1 Stochastic reaction networks for modeling gene expression

A reaction network consists of NN different chemical species S1,,SNS_{1},\ldots,S_{N} that are interacting via the following MM chemical reactions

ν1,jreactS1++νN,jreactSNν1,jprodS1++νN,jprodSN.\nu_{1,j}^{react}S_{1}+\ldots+\nu_{N,j}^{react}S_{N}\longrightarrow\nu_{1,j}^{prod}S_{1}+\ldots+\nu_{N,j}^{prod}S_{N}. (1)

We are interested in keeping track of the integral vectors 𝒙(x1,,xN)T\bm{x}\equiv(x_{1},\dots,x_{N})^{T}, where xix_{i} is the population of the iith species. Assuming constant temperature and volume, the time-evolution of this system can be modeled by a continuous-time, discrete-space Markov process [9]. The jjth reaction channel is associated with a stoichiometric vector 𝝂j=(ν1,jprodν1,jreact,,νN,jprodνN,jreact)T\bm{\nu}_{j}=(\nu_{1,j}^{prod}-\nu_{1,j}^{react},\ldots,\nu_{N,j}^{prod}-\nu_{N,j}^{react})^{T} (j=1,,Mj=1,\ldots,M) such that, if the system is in state 𝒙\bm{x} and reaction jj occurs, the system transitions to state 𝒙+𝝂j\bm{x}+\bm{\nu}_{j}. Given 𝒙(t)=𝒙\bm{x}(t)=\bm{x}, the propensity αj(𝒙;θ)dt\alpha_{j}(\bm{x};\theta)dt determines the probability that reaction jj occurs in the next infinitesimal time interval [t,t+dt)[t,t+dt), where θ\theta is the vector of model parameters. In other words,

Prob(𝒙(t+dt)=𝒙+𝝂j|𝒙(t)=𝒙)=αj(𝒙;θ)dt.\operatorname{Prob}(\bm{x}(t+dt)=\bm{x}+\bm{\nu}_{j}|\bm{x}(t)=\bm{x})=\alpha_{j}(\bm{x};\theta)dt.

An important case of reaction networks are those that follow mass-action kinetics, whose propensity functions take the form

αj(𝒙;θ)=cj(θ)(x1ν1,jreact)(xNνN,jreact).\alpha_{j}(\bm{x};\theta)=c_{j}(\theta)\binom{x_{1}}{\nu_{1,j}^{react}}\cdot\ldots\cdot\binom{x_{N}}{\nu_{N,j}^{react}}. (2)

In this formulation, cj(θ)c_{j}(\theta) is the probability per unit time that a combination of molecules can react via reaction jj, and the remaining factor is the number of ways the existing molecules can be combined to form the left side of the chemical equation (1).

The time-evolution of the probability distribution of this Markov process is the solution of the linear system of differential equations known as the chemical master equation (CME)

{ddt𝒑(t)=𝑨(𝜽)𝒑(t),t[0,tf]𝒑(0)=𝒑0,\begin{cases}\frac{d}{dt}\bm{p}(t)=\bm{A(\theta)}\bm{p}(t),\quad t\in[0,\,t_{f}]\\ \bm{p}(0)=\bm{p}_{0}\end{cases}, (3)

where 𝒑(t)\bm{p}(t) is the time-dependent probability distribution of all states, p(t,𝒙)=Prob{𝒙(t)=𝒙i|𝒙(0)}p(t,\bm{x})=\textrm{Prob}\{\bm{x}(t)=\bm{x}_{i}|\bm{x}(0)\}. The initial distribution 𝒑0\bm{p}_{0} is assumed to be given, and 𝑨(θ)\bm{A}(\theta) is the infinitesimal generator of the Markov process, defined entry-wise as

𝑨(𝒚,𝒙;θ)={αj(𝒙;θ) if 𝒚=𝒙+𝝂jj=1Mαj(𝒙;θ) if 𝒚=𝒙0 otherwise.\bm{A}(\bm{y},\bm{x};\theta)=\begin{cases}\alpha_{j}(\bm{x};\theta)\;\text{ if }\bm{y}=\bm{x}+\bm{\nu}_{j}\\ -\sum_{j=1}^{M}{\alpha_{j}(\bm{x};\theta)}\;\text{ if }\bm{y}=\bm{x}\\ 0\text{ otherwise}\end{cases}. (4)

Here, we have made explicit the dependence of 𝑨\bm{A} on the model parameter vector θ\theta, which we need to infer from experimental data.

2.2 The finite state projection

Typically, reaction networks model open biochemical systems, where the set of all possible molecular states is unbounded. This makes the CME an infinite-dimensional linear system of ODEs. The finite state projection (FSP) is a well-known strategy to systematically reduce this linear system into a finite surrogate model with a strict error bound.

The FSP can be thought of as a special class of projection-based model reduction applied to the CME. Specifically, let Ω\Omega be a finite subset of the CME state space. The projection of the CME operator 𝑨\bm{A} onto the subspace spanned by the point-mass measures {δ𝒙|𝒙Ω}\{\delta_{\bm{x}}|\bm{x}\in\Omega\} is given by

𝑨~Ω(𝒚,𝒙)={𝑨(𝒚,𝒙)if 𝒙,𝒚Ω0otherwise.\widetilde{\bm{A}}_{\Omega}(\bm{y},\bm{x})=\begin{cases}\bm{A}(\bm{y},\bm{x})\;\text{if }\bm{x},\bm{y}\in\Omega\\ 0\;\text{otherwise}\end{cases}. (5)

We can then define a reduced model of the dynamical system (3) based on this projection as

ddt𝒑~Ω(t)=𝑨~Ω𝒑~Ω(t),t[0,tf].\frac{d}{dt}{\widetilde{\bm{p}}_{\Omega}}(t)=\widetilde{\bm{A}}_{\Omega}\widetilde{\bm{p}}_{\Omega}(t),\;t\in[0,t_{f}]. (6)

Clearly, in solving (6) we only need to keep track of the equations corresponding to states in the finite set Ω\Omega, which is amenable to numerical treatments.

In contrast to generic projection methods, the gap between the reduced-order model and the true CME can be computed for the FSP. Indeed, Munsky and Khammash  [42, Theorem 2] proved that the truncation error can be quantified in 1\ell_{1} norm as

𝒑(t)𝒑~Ω(t)1=1𝒙Ωp~Ω(t,𝒙).\|\bm{p}(t)-\widetilde{\bm{p}}_{\Omega}(t)\|_{1}=1-\sum_{\bm{x}\in\Omega}{\tilde{p}_{\Omega}(t,\bm{x})}. (7)

Clearly, the right hand side can be readily computed from the solution of the reduced system (6). From this precise error quantification, we have effective iterative method for solving the CME. Choosing an error tolerance ε>0\varepsilon>0, starting from any initial set Ω:=Ω0\Omega:=\Omega_{0}, we solve system (6) and check that the right hand side of (7) is less than ε\varepsilon. If this fails, we add more states to Ω0\Omega_{0} to get a strictly larger set Ω1\Omega_{1} and repeat the procedure until we find an approximation that satisfies our error tolerance.

As the sequence of subsets Ωi\Omega_{i} grows until it eventually covers the whole state space, we might expect that the finite-time solution of (6) will likewise converge to the true solution. This is indeed the case for all models in practice, with only a few theoretical counterexamples in which the Markov chain is explosive [10]. Sufficient conditions for the convergence of the FSP can be checked based on the form of the propensity function [43, 44]. In practice, reaction networks tend to have only reactions between two or fewer molecules, with propensity functions in mass-action form (2), and these are guaranteed to be approximable with the FSP [43].

For the rest of the paper, we only concern ourselves with non-explosive SRNs where the FSP converges. Given such models, any exhaustive sequence of subsets {Ωj}\{\Omega_{j}\} suffices to guarantee that the FSP solution eventually satisfies any prespecified error tolerance. However, some choices are more effecient than other, and it is also more advantageous to partition the time interval into smaller timesteps and use a smaller Ωj\Omega_{j} on each step [45]. We will return to these observations in section 4.3.

A final point to make about the FSP is that, if the sequence of reduced sets Ωj\Omega_{j} are increasing, that is, ΩjΩj+1,j=1,2,\Omega_{j}\subset\Omega_{j+1},j=1,2,\ldots, the resulting truncation error has been shown to decrease monotonically [42, Theorem 1] (also see [46, Theorem 2.5]). This gives us a natural way to form a hierarchy of reduced models for the CME, and we will return to this in section 4.1. With the computed distribution of the SRN in our hands, we can now match them directly to experimental data using a straightforward likelihood function.

2.3 Bayesian inference of SRN models from discrete single-cell measurements

We are interested in inferring the parameters for the reaction network from discrete, single-cell datasets [47, 6, 3, 7, 8] that consist of several snapshots of many independent cells taken at discrete times t1,,tTt_{1},\ldots,t_{T}. The snapshot at time tit_{i} records gene expression in nin_{i} cells, each of which can be collected in the data vector 𝒄j,i,j=1,,ni\bm{c}_{j,i},\;j=1,\ldots,n_{i} of molecular populations in cell jj at time tit_{i}.

Assume that a model class ={M(θ)}θΘ\mathcal{M}=\left\{M(\theta)\right\}_{\theta\in\Theta} of stochastic reaction networks has been chosen to model the data consisting of a fixed set of reactions with unknown reaction parameters θ\theta. Let p(t,𝒙|M(θ))p(t,\bm{x}|M(\theta)) denote the entry of the CME solution corresponding to state 𝒙\bm{x} at time tt, given by SRN model M(θ)M(\theta). The log-likelihood of the dataset 𝒟\mathcal{D} given M(θ)M(\theta) is given by

L(𝒟|θ)=i=1Tj=1nilogp(ti,𝒄j,i|M(θ)).{L}(\mathcal{D}|\theta)=\sum_{i=1}^{T}\sum_{j=1}^{n_{i}}{\log{p(t_{i},\bm{c}_{j,i}|M(\theta))}}. (8)

A common approach to fitting this model is find model parameters θ\theta that maximize the log-likelihood function. These parameters would be those that best fit the data, but this process does not capture any uncertainty about those parameters. There are potentially many other models that could fit the data approximately as well as the maximum likelihood model. In contrast, with a Bayesian approach we seek to quantify this parameter uncertainty so that we can be confident in the inference results, understand the influence of this uncertainty on future predictions, and design future experiments to reduce the parameter uncertainty.

Bayesian inference is rooted in the Bayesian philosophy of probability in which our uncertainty about the world is modeled using probability distributions [48]. As information becomes available, we update these distributions to reflect our new state of understanding about the world. Therefore, for Bayesian inference we begin with a prior distribution, p(θ)p\left(\theta\right), that captures our initial beliefs about model parameters. Then after data has been observed, the likelihood of the data given a model class and associated parameters can be found as p(𝒟θ)=exp(L(𝒟|θ))p\left(\mathcal{D}\mid\theta\right)=\exp({L}(\mathcal{D}|\theta)). By applying Bayes’ Theorem, we can combine the prior and likelihood information together to construct the posterior distribution on model parameters that reflects our updated beliefs after data has been observed:

p(θ𝒟)=p(𝒟θ)p(θ)p(𝒟)p\left(\theta\mid\mathcal{D}\right)=\frac{p\left(\mathcal{D}\mid\theta\right)p\left(\theta\right)}{p\left(\mathcal{D}\right)} (9)

Here, p(𝒟)p\left(\mathcal{D}\right) is a normalization constant known as the model evidence. This is relevant for Bayesian model selection problems but is not required for parameter calibration. When p(θ)p\left(\theta\right) is a constant, the parameters that maximize the posterior density are equivalent to the maximum likelihood estimator. Solving the Bayesian inference problem to identify parameters allows us to quantify our uncertainty regarding the accuracy of the parameter fit by sampling from the posterior distribution. However, it can be a computationally challenging problem as discussed in the next subsection.

The Bayesian framework also provides a criteria to select or weigh different model classes as data become available. Suppose, instead of a single model class, we are given KK possible network structures that could potentially explain the observations. Let k={Mk(θk)}θkΘk\mathcal{M}^{k}=\{M^{k}(\theta^{k})\}_{\theta^{k}\in\Theta^{k}} denote the kk-th class, where the parameter domains Θk\Theta^{k} need not have the same dimensionality. Each model class is associated with a prior weight P(k)P(\mathcal{M}^{k}) that represents the prior level of belief in each class. If 𝒟\mathcal{D} denotes the dataset as before, we can compute the model evidence of k\mathcal{M}^{k} as

P(𝒟|k)=θΘkp(𝒟θ,k)P(θk)dθ.P(\mathcal{D}|\mathcal{M}^{k})=\int_{\theta\in\Theta^{k}}{p\left(\mathcal{D}\mid\theta,\mathcal{M}^{k}\right)P(\theta\mid\mathcal{M}^{k})\operatorname{d\theta}}. (10)

The posterior probability of each model class can then be computed by applying Bayes’ Theorem:

P(k𝒟)=P(𝒟|k)P(k)j=1KP(𝒟|j)P(j)P\left(\mathcal{M}^{k}\mid\mathcal{D}\right)=\frac{P(\mathcal{D}|\mathcal{M}^{k})P(\mathcal{M}^{k})}{\sum_{j=1}^{K}P(\mathcal{D}|\mathcal{M}^{j})P(\mathcal{M}^{j})} (11)

These probabilities then reflect the posterior weighting of the different model classes that can be used to make average predictions over the models. Similarly, the Bayes factors

P(k𝒟)P(j𝒟),j,k=1,,K,jk\frac{P\left(\mathcal{M}^{k}\mid\mathcal{D}\right)}{P\left(\mathcal{M}^{j}\mid\mathcal{D}\right)},j,k=1,\ldots,K,\;j\neq k (12)

allow us to rank and compare the models based on how strongly they are supported by the current data for Bayesian model selection. The drawback of Bayesian model selection in the context of stochastic gene expression is the computational cost of computing the model evidences. We will return to this issue in section 3 where we show how the Mutlfidelity ST-MCMC framework provides an efficient way to estimate these evidences.

2.4 Markov Chain Monte Carlo samplers

Markov Chain Monte Carlo algorithms are widely used for sampling the posterior distribution of a Bayesian inference problem. These methods design a Markov chain whose stationary distribution is the target posterior distribution, p(θ𝒟)p\left(\theta\mid\mathcal{D}\right). Therefore, by simulating the evolution of samples, θ\theta, according to the Markov chain, samples are asymptotically drawn from the posterior distribution. However, unlike Monte Carlo sampling, these samples are correlated. A common MCMC method is the Metropolis-Hastings algorithm. The algorithm begins by initializing the parameter state to some θ0\theta_{0}. Then at a step i+1i+1 in the evolution, a candidate sample θ\theta^{\prime} is drawn according to a proposal distribution Q(θθi)Q\left(\theta^{\prime}\mid\theta_{i}\right). This candidate is then accepted or rejected with probability α(θθi)\alpha\left(\theta^{\prime}\mid\theta_{i}\right) given by

α(θθi)=min(1,p(θ𝒟)Q(θiθ)p(θi𝒟)Q(θθi))=min(1,p(𝒟θ)p(θ)Q(θiθ)p(𝒟θi)p(θi)Q(θθi))\alpha\left(\theta^{\prime}\mid\theta_{i}\right)=\min\left(1,\frac{p\left(\theta^{\prime}\mid\mathcal{D}\right)Q\left(\theta_{i}\mid\theta^{\prime}\right)}{p\left(\theta_{i}\mid\mathcal{D}\right)Q\left(\theta^{\prime}\mid\theta_{i}\right)}\right)=\min\left(1,\frac{p\left(\mathcal{D}\mid\theta^{\prime}\right)p\left(\theta^{\prime}\right)Q\left(\theta_{i}\mid\theta^{\prime}\right)}{p\left(\mathcal{D}\mid\theta_{i}\right)p\left(\theta_{i}\right)Q\left(\theta^{\prime}\mid\theta_{i}\right)}\right) (13)

This acceptance probability is independent of the normalization constant p(𝒟)p\left(\mathcal{D}\right) and therefore computationally tractable. If the candidate is accepted θi+1=θ\theta_{i+1}=\theta^{\prime}; otherwise, θi+1=θi\theta_{i+1}=\theta_{i}. This algorithm iterates until a sufficient number of posterior samples have been generated to accurately represent the posterior.

A common method to judge whether sufficient posterior samples have been generated using the MCMC sampler is the notion of effective sample size (ESS). The ESS of a NN sample correlated population θi=1N\theta_{i=1...N} corresponds to the number of independent samples which would estimate a quantity of interest, q^=E[q(θ)]\hat{q}=E\left[q\left(\theta\right)\right], with the same variance as the correlated sample population. Therefore, the ESS of a quantify of interest can be computed as

NESS=Var[q(θ)]Var[q^]N_{ESS}=\frac{Var\left[q\left(\theta\right)\right]}{Var\left[\hat{q}\right]} (14)

Based upon the Markov chain Central Limit Theorem, for correlated samples generated according to the stationary distribution of a Markov process, the ESS can be approximated as

NESS=N1+2i=1Nρi(q(θ1N))N_{ESS}=\frac{N}{1+2\sum_{i=1}^{N}\rho_{i}\left(q\left(\theta_{1\dots N}\right)\right)} (15)

where ρi\rho_{i} is the iith lag autocorrelation of the quantify of interest whose evolution is defined by the Markov chain. Therefore, when designing a Metropolis-Hastings proposal distribution Q(θθi)Q\left(\theta^{\prime}\mid\theta_{i}\right) we want minimize the correlation and thus maximize NESSN\frac{N_{ESS}}{N} to attain the highest possible sampling efficiency. However, even very effective samplers often need tens of thousands of sequential model evaluations to generate enough effective samples, making this form of MCMC challenging for computationally expensive models.

2.5 Sequential Monte Carlo samplers

To overcome many of the challenges associated with a standard Metropolis-Hastings based MCMC method, parallel methods, like Sequential Monte Carlo (SMC), have been introduced to better leverage high performance computing resources. SMC methods for Bayesian inference transport a sample population, initially distributed so that it can approximate expectations with respect to the prior, to one which can approximate posterior expectations [30, 28, 31, 32]. Typically, this means a population of samples initially distributed according to the prior being transformed into a population of samples approximately distributed according to the posterior. For Sequential Tempered MCMC (ST-MCMC) [32], we break down the inference problem into a series of annealing levels ii defined by an annealing factor βi[0,1]\beta_{i}\in\left[0,1\right]. Each level defines an intermediate distribution, πβi(θ)\pi_{\beta_{i}}\left(\theta\right), which we would like to use to generate samples and to compute expectations. These intermediate distributions take the form of

πβi(θ)=p(θ𝒟,βi)=p(𝒟θ)βip(θ)p(𝒟θ)βip(θ)𝑑θ\pi_{\beta_{i}}\left(\theta\right)=p\left(\theta\mid\mathcal{D},\beta_{i}\right)=\frac{p\left(\mathcal{D}\mid\theta\right)^{\beta_{i}}p\left(\theta\right)}{\int p\left(\mathcal{D}\mid\theta\right)^{\beta_{i}}p\left(\theta\right)d\theta} (16)

This annealing approach is common to many SMC methods used for Bayesian inference and can be thought of as gradually integrating the influence of the data into the solution. To simplify the problem of transporting samples from the prior, π0(θ)\pi_{0}\left(\theta\right), to posterior, π1(θ)\pi_{1}\left(\theta\right), we transport samples sequentially through each level in the sequence, i.e. πβi(θ)\pi_{\beta_{i}}\left(\theta\right) to πβi+1(θ)\pi_{\beta_{i+1}}\left(\theta\right) with βi+1>βi\beta_{i+1}>\beta_{i}. Because we can control the size of the jump Δβ=βi+1βi\Delta\beta=\beta_{i+1}-\beta_{i}, we can ensure that this change is not too drastic as to cause poor approximation of the true distribution i.e. too drastic a decrease in the ESS. Transporting samples is done in three steps:

  1. 1.

    Re-weight the previous sample population, distributed according to πβi(θ)\pi_{\beta_{i}}\left(\theta\right), with unnormalized weights wi=p(𝒟θi)Δβw_{i}=p\left(\mathcal{D}\mid\theta_{i}\right)^{\Delta\beta} to reflect expectations with respect to the new distribution πβi+1(θ)\pi_{\beta_{i+1}}\left(\theta\right).

  2. 2.

    Re-sample the population according to the weights so that the samples now reflect πβi+1(θ)\pi_{\beta_{i+1}}\left(\theta\right).

  3. 3.

    Seed a Markov chain starting at each sample, and then use MCMC to explore πβi+1(θ)\pi_{\beta_{i+1}}\left(\theta\right).

The MCMC step is essential to ensure the sample population does not degenerate since the re-weighting and re-sampling steps reduce the ESS of the population. MCMC increases the ESS because it decorrelates the seeds and explores the target distribution, causing samples to better reflect it. Typically, Δβ\Delta\beta is chosen adaptively to not decrease the ESS too much during the update. This is achieved by finding a Δβ\Delta\beta such that the coefficient of variation (COV) of the sample weights equals a target κ\kappa. The COV approximates the ESS by NESSN1+κ2N_{ESS}\approx\frac{N}{1+\kappa^{2}}. Therefore, we find a Δβ>0\Delta\beta>0 that solves the equation

κ=1Ni=1N(wi(Δβ)w^(Δβ))2w^(Δβ)\kappa=\frac{\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(w_{i}\left(\Delta\beta\right)-\hat{w}\left(\Delta\beta\right)\right)^{2}}}{\hat{w}\left(\Delta\beta\right)} (17)

Here, wi(Δβ)=p(𝒟θi)Δβw_{i}\left(\Delta\beta\right)=p\left(\mathcal{D}\mid\theta_{i}\right)^{\Delta\beta} and w^(Δβ)=1Ni=1Nwi(Δβ)\hat{w}\left(\Delta\beta\right)=\frac{1}{N}\sum_{i=1}^{N}w_{i}\left(\Delta\beta\right). Typically, we choose κ=1\kappa=1, which corresponds to a target ESS of N/2N/2. With this method for finding Δβ\Delta\beta, we then sequentially move through all the adaptively tuned annealing levels until we reach the final posterior reflected by π1(θ)\pi_{1}\left(\theta\right). For more details about this algorithm, see [32].

3 Multifidelity ST-MCMC

For expensive models, ST-MCMC and similar SMC based methods may still be computationally prohibitive. One approach to overcome this computational burden is to utilize a multifidelity model hierarchy that can speed up sampling. The key idea is that for early levels of the ST-MCMC algorithm a low fidelity but computationally cheap model may be sufficiently informative to guide the samples towards the ultimate posterior distribution. This is because at early levels, the annealing factor causes the contribution of the likelihood to be damped, so perturbations in the likelihood caused by the decrease in model fidelity are less important. Intuitively, a lower fidelity model may be useful when the bias it introduces in the likelihood function is less than the variance of the likelihood at the annealing level. We consider different strategies for rigorously defining this intuition in the rest of this section.

We extend ST-MCMC to multifidelity ST-MCMC by defining intermediate levels both in terms of their annealing factor β\beta and the choice of model fidelity MM, where we consider a hierarchy of models ={Mj:j=1K}\mathcal{M}=\{M_{j}:j=1\dots K\} with increasing fidelity and computational cost. This algorithm is described in Algorithm 1. The key challenge is determining the best strategy for choosing βl\beta_{l} and mlm_{l} and Step 2, since the rest of the algorithm proceeds like standard ST-MCMC.

Inputs :  Prior distribution p(θ)p\left(\theta\right)
Model fidelity hierarchy ={Mj:j=1K}\mathcal{M}=\{M_{j}:j=1\dots K\}
Likelihood function p(𝒟θ,Mj)p\left(\mathcal{D}\mid\theta,M_{j}\right)
Number of samples NN
Output : Posterior samples θ1N\theta_{1\dots N}
begin
       Initialize : Set level counter l=0l=0
       Set annealing factor β0=0\beta_{0}=0
       Set model level m0=1m_{0}=1
       Define the first intermediate distribution π0(θ)=p(θ)\pi_{0}\left(\theta\right)=p\left(\theta\right)
       Draw initial samples θ1N0π0(θ)\theta_{1\dots N}^{0}\sim\pi_{0}\left(\theta\right)
       while πβl(θMml)p(θ𝒟,MK)\pi_{\beta_{l}}\left(\theta\mid M_{m_{l}}\right)\neq p\left(\theta\mid\mathcal{D},M_{K}\right) do
             1) Increment level counter l=l+1l=l+1
             2) Choose the next βl\beta_{l} and mlm_{l} based on the previous level sample population θ1Nl1\theta_{1\dots N}^{l-1}
             3) Define the next intermediate distribution πβl(θMml)p(𝒟θ,Mml)βlp(θ)\pi_{\beta_{l}}\left(\theta\mid M_{m_{l}}\right)\propto p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}p\left(\theta\right)
             4) Compute the unnormalized importance weights for the population as wi=πβl(θil1Mml)πβl1(θil1Mml1)w_{i}=\frac{\pi_{\beta_{l}}\left(\theta_{i}^{l-1}\mid M_{m_{l}}\right)}{\pi_{\beta_{l-1}}\left(\theta_{i}^{l-1}\mid M_{m_{l-1}}\right)}
             5) Resample the population according to the normalized importance weights to initialize θ1Nl\theta_{1\dots N}^{l}
             6) Evolve the samples θ1Nl\theta_{1\dots N}^{l} using MCMC with stationary distribution πβl(θMml)\pi_{\beta_{l}}\left(\theta\mid M_{m_{l}}\right)
       end while
      return θ1N=θ1Nl\theta_{1\dots N}=\theta_{1\dots N}^{l}
end
Algorithm 1 Multifidelity ST-MCMC

3.1 Tempering and bridging using an Effective Sample Size criteria

One approach to choosing the appropriate annealing factor and model fidelity is a combined likelihood tempering and model bridging scheme discussed in Latz et al. [38]. This scheme is based upon the ESS statistic discussed in Section 2.5. Within Multifidelity ST-MCMC, at every level ll of Algorithm 1, we choose whether to temper by changing βl=βI1+Δβ\beta_{l}=\beta_{I-1}+\Delta\beta or to bridge by changing the model fidelity, ml=ml1+1m_{l}=m_{l-1}+1. This choice is made by measuring the ESS of the sample population with respect to the next change in the model fidelity by computing the unnormalized weights as if we were to bridge:

wi=πβl1(θil1Mml1+1)πβl1(θil1Mml1)=p(𝒟θil1,Mml1+1)βl1p(𝒟θil1,Mml1)βl1w_{i}=\frac{\pi_{\beta_{l}-1}\left(\theta_{i}^{l-1}\mid M_{m_{l-1}+1}\right)}{\pi_{\beta_{l-1}}\left(\theta_{i}^{l-1}\mid M_{m_{l-1}}\right)}=\frac{p\left(\mathcal{D}\mid\theta_{i}^{l-1},M_{m_{l-1}+1}\right)^{\beta_{l-1}}}{p\left(\mathcal{D}\mid\theta_{i}^{l-1},M_{m_{l-1}}\right)^{\beta_{l-1}}} (18)

We can then compute the coefficient of variation of the weights, wiw_{i}, to determine if it exceeds a target κ\kappa. If it does, we choose to bridge to the next model fidelity because the sample population is beginning to degenerate, so it no longer has sufficient ESS to approximate the next level intermediate posterior. If the COV is less than κ\kappa, we choose to keep the current model but instead temper β\beta. The next beta is chosen using the same strategy as before by solving Equation 17.

3.2 Information-theoretic criteria for model fidelity adaptation

We introduce a new criteria for model fidelity selection based on information theory. This criteria is motivated by the fact that the ESS-based strategy described above only decides to change fidelity based upon the next model in the hierarchy. This means the sampler may continue to use a low fidelity model because it still meets the ESS criteria with respect to the next model, even when it is drifting away from the high fidelity posterior. Instead, we introduce a method that utilizes a limited number of full fidelity model evaluations to help us better decide when to bridge model fidelity. Depending on the computational cost of the full fidelity simulations, the improved bridging strategy and the improved robustness of this method may outweigh the cost of these full fidelity solutions.

Within multifidelity ST-MCMC, if the algorithm is at annealing level ll with annealing factor βl[0,1]\beta_{l}\in\left[0,1\right] and has been sampling the intermediate posterior defined by a model, MmlM_{m_{l}}, in a model hierarchy ={Mj:j=1K}\mathcal{M}=\{M_{j}:j=1\dots K\}, we would like to know whether MmlM_{m_{l}} still provides meaningful information about the ultimate posterior once we move to level l+1l+1 with annealing factor βl+1\beta_{l+1}. Here we assume the ultimate posterior is p(θ𝒟,MK)p\left(\theta\mid\mathcal{D},M_{K}\right), where MKM_{K} is the highest fidelity model. Therefore, unlike the previous ESS-based method, we begin by proposing a tempering step under the assumption that the current model fidelity is valid. We find the proposed βl+1\beta_{l+1} by solving Equation 17.

If MmlM_{m_{l}} no longer provides meaningful information at the next level, we use the next highest fidelity model in the algorithm, Mml+1M_{m_{l}+1}. This criteria can be formulated using a generalization of information theory [49], where the information gained about the full posterior, p(θ𝒟,MK)p\left(\theta\mid\mathcal{D},M_{K}\right), by moving from level ll to l+1l+1 with model MmlM_{m_{l}} is:

p(θ𝒟,MK)[p(θ𝒟,Mml,βl+1)||p(θ𝒟,Mml,βl)]=DKL[p(θ𝒟,MK)||p(θ𝒟,Mml,βl)]DKL[p(θ𝒟,MK)||p(θ𝒟,Mml,βl+1)]=p(θ𝒟,MK)logp(θ𝒟,Mml,βl+1)p(θ𝒟,Mml,βl)dθ\begin{split}&\mathcal{I}_{p\left(\theta\mid\mathcal{D},M_{K}\right)}\left[p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l+1}\right)||p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)\right]\\ &=D_{\text{KL}}\left[p\left(\theta\mid\mathcal{D},M_{K}\right)||p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)\right]-D_{\text{KL}}\left[p\left(\theta\mid\mathcal{D},M_{K}\right)||p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l+1}\right)\right]\\ &=\int p\left(\theta\mid\mathcal{D},M_{K}\right)\log\frac{p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l+1}\right)}{p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}d\theta\end{split} (19)

If this quantity is positive, then the intermediate posterior defined by βl+1\beta_{l+1} and mlm_{l} is closer to the ultimate posterior than the previous level, so we choose ml+1=mlm_{l+1}=m_{l}. However if this is negative, this update is driving the distribution away from the ultimate posterior, so we should use a higher fidelity model for the next update, thus ml+1=ml+1m_{l+1}=m_{l}+1.

If we choose to update the model fidelity, we consider two strategies for choosing βl+1\beta_{l+1} for the next level. In the first strategy, keeping with the ESS-based tempering and bridging framework from above, is to set βl+1=βl\beta_{l+1}=\beta_{l}. The second strategy is to tune βl+1\beta_{l+1} to try attain an ESS target. The first strategy is often more computationally efficient, but may not be as robust if changing model fidelity introduces significant variations. To tune βl+1\beta_{l+1}, we first define the importance weight for transitioning from a level defined by βi\beta_{i} and mlm_{l} to a level defined by βl+1\beta_{l+1} and ml+1=ml+1m_{l+1}=m_{l}+1 as

wi=πβl+1(θilMml+1)πβl(θilMml)=p(𝒟θil,Mml+1)βl+1p(𝒟θil,Mml)βlw_{i}=\frac{\pi_{\beta_{l+1}\left(\theta_{i}^{l}\mid M_{m_{l}+1}\right)}}{\pi_{\beta_{l}}\left(\theta_{i}^{l}\mid M_{m_{l}}\right)}=\frac{p\left(\mathcal{D}\mid\theta_{i}^{l},M_{m_{l}+1}\right)^{\beta_{l+1}}}{p\left(\mathcal{D}\mid\theta_{i}^{l},M_{m_{l}}\right)^{\beta_{l}}} (20)

Using the same approach as before, we can then tune βl+1\beta_{l+1} to meet some ESS target based upon the COV of the weights. However, unlike in previous problems, this might not be achievable. If Equation (17) has a solution, we chose the largest βl+1\beta_{l+1} such that the COV target is met. If Equation (17) does not have a solution, we find the βl+1\beta_{l+1} that minimizes the COV and thus maximizes the ESS.

3.3 Computing the information-theoretic criteria

Since computing the information in Equation 19 requires marginalizing over the posterior, it can be challenging. However, this computation can be approximated using the samples from ST-MCMC. The first step is to recognize the connection between computing this criteria and estimating the model evidence:

p(θ𝒟,MK)[p(θ𝒟,Mml,βl+1)||p(θ𝒟,Mml,βl)]=p(θ𝒟,MK)logp(θ𝒟,Mml,βl+1)p(θ𝒟,Mml,βl)dθ=p(θ𝒟,MK)logp(𝒟θ,Mml)βl+1p(θ)p(𝒟Mml,βl+1)p(𝒟Mml,βl)p(𝒟θ,Mml)βlp(θ)dθ=p(θ𝒟,MK)logp(𝒟θ,Mml)Δβp(𝒟Mml,βl)p(𝒟Mml,βl+1)𝑑θ\begin{split}&\mathcal{I}_{p\left(\theta\mid\mathcal{D},M_{K}\right)}\left[p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l+1}\right)||p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)\right]\\ &=\int p\left(\theta\mid\mathcal{D},M_{K}\right)\log\frac{p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l+1}\right)}{p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}d\theta\\ &=\int p\left(\theta\mid\mathcal{D},M_{K}\right)\log\frac{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l+1}}p\left(\theta\right)}{p\left(\mathcal{D}\mid M_{m_{l}},\beta_{l+1}\right)}\frac{p\left(\mathcal{D}\mid M_{m_{l}},\beta_{l}\right)}{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}p\left(\theta\right)}d\theta\\ &=\int p\left(\theta\mid\mathcal{D},M_{K}\right)\log p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}\frac{p\left(\mathcal{D}\mid M_{m_{l}},\beta_{l}\right)}{p\left(\mathcal{D}\mid M_{m_{l}},\beta_{l+1}\right)}d\theta\end{split} (21)

Here, p(𝒟M,β)p\left(\mathcal{D}\mid M,\beta\right) is the model evidence, i.e. the normalization, for the likelihood defined by the model MM with an annealing factor β\beta:

p(𝒟M,β)=p(𝒟θ,M)βp(θ)𝑑θp\left(\mathcal{D}\mid M,\beta\right)=\int p\left(\mathcal{D}\mid\theta,M\right)^{\beta}p\left(\theta\right)d\theta (22)

By noting the relationship to model evidence, the ratio of the evidences can be expressed as:

p(𝒟Mml,βl)p(𝒟Mml,βl+1)=p(𝒟θ,Mml)βlp(θ)𝑑θp(𝒟θ,Mml)βl+1p(θ)𝑑θ=1p(𝒟θ,Mml)Δβp(𝒟θ,Mml)βlp(θ)p(𝒟θ,Mml)βlp(θ)𝑑θ𝑑θ=1p(𝒟θ,Mml)Δβp(θ𝒟,Mml,βl)𝑑θ=1Eθp(θ𝒟,Mml,βl)[p(𝒟θ,Mml)Δβ]\begin{split}\frac{p\left(\mathcal{D}\mid M_{m_{l}},\beta_{l}\right)}{p\left(\mathcal{D}\mid M_{m_{l}},\beta_{l+1}\right)}&=\frac{\int p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}p\left(\theta\right)d\theta}{\int p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l+1}}p\left(\theta\right)d\theta}\\ &=\frac{1}{\int p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}\frac{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}p\left(\theta\right)}{\int p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}p\left(\theta\right)d\theta}d\theta}\\ &=\frac{1}{\int p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)d\theta}\\ &=\frac{1}{\text{E}_{\theta\sim p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}\left[p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}\right]}\end{split} (23)

Therefore,

p(θ𝒟,MK)logp(𝒟θ,Mml)Δβp(𝒟Mml,βl)p(𝒟Mml,βl+1)𝑑θ=p(θ𝒟,MK)logp(𝒟θ,Mml)ΔβEθp(θ𝒟,Mml,βl)[p(𝒟θ,Mml)Δβ]dθ\begin{split}&\int p\left(\theta\mid\mathcal{D},M_{K}\right)\log p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}\frac{p\left(\mathcal{D}\mid M_{m_{l}},\beta_{l}\right)}{p\left(\mathcal{D}\mid M_{m_{l}},\beta_{l+1}\right)}d\theta\\ &=\int p\left(\theta\mid\mathcal{D},M_{K}\right)\log\frac{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}}{\text{E}_{\theta\sim p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}\left[p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}\right]}d\theta\end{split} (24)

Since we cannot yet sample p(θ𝒟,MK)p\left(\theta\mid\mathcal{D},M_{K}\right) we use importance sampling to express this integral in terms of the level ll distribution, which we have samples for:

p(θ𝒟,MK)logp(𝒟θ,Mml)ΔβEθp(θ𝒟,Mml,βl)[p(𝒟θ,Mml)Δβ]dθ=p(θ𝒟,Mml,βl)p(θ𝒟,MK)p(θ𝒟,Mml,βl)logp(𝒟θ,Mml)ΔβEθp(θ𝒟,Mml,βl)[p(𝒟θ,Mml)Δβ]dθp(θ𝒟,Mml,βl)p(𝒟θ,Mk)p(𝒟θ,Mml)βllogp(𝒟θ,Mml)ΔβEθp(θ𝒟,Mml,βl)[p(𝒟θ,Mml)Δβ]dθ=criteria\begin{split}&\int p\left(\theta\mid\mathcal{D},M_{K}\right)\log\frac{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}}{\text{E}_{\theta\sim p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}\left[p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}\right]}d\theta\\ &=\int p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)\frac{p\left(\theta\mid\mathcal{D},M_{K}\right)}{p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}\log\frac{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}}{\text{E}_{\theta\sim p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}\left[p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}\right]}d\theta\\ &\propto\int p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)\frac{p\left(\mathcal{D}\mid\theta,M_{k}\right)}{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}}\log\frac{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}}{\text{E}_{\theta\sim p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}\left[p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}\right]}d\theta\\ &=\mathcal{I}_{criteria}\end{split} (25)

This integral only needs to be known up to a constant of proportionality since we only need to assess whether it is positive. We can then express it in terms of expectations as:

criteria=Eθp(θ𝒟,Mml,βl)[p(𝒟θ,MK)p(𝒟θ,Mml)βllogp(𝒟θ,Mml)Δβ]Eθp(θ𝒟,Mml,βl)[p(𝒟θ,MK)p(𝒟θ,Mml)βl]logEθp(θ𝒟,Mml,βl)[p(𝒟θ,Mml)Δβ]\begin{split}\mathcal{I}_{criteria}=&\text{E}_{\theta\sim p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}\left[\frac{p\left(\mathcal{D}\mid\theta,M_{K}\right)}{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}}\log p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}\right]-\\ &\text{E}_{\theta\sim p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}\left[\frac{p\left(\mathcal{D}\mid\theta,M_{K}\right)}{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}}\right]\log\text{E}_{\theta\sim p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}\left[p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}\right]\end{split} (26)

We can now estimate whether criteria\mathcal{I}_{criteria} is positive or negative to determine if information is gained or lost by this next update. To approximate these expectations we use the NN ST-MCMC samples at level ll, where θil:i=1N{\theta^{l}_{i}:i=1\dots N}, which are approximately distributed according to p(θ𝒟,Mml,βl)p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right). We also use the evaluation of the full fidelity model likelihood at these points:

Eθp(θ𝒟,Mml,βl)[p(𝒟θ,MK)p(𝒟θ,Mml)βllogp(𝒟θ,Mml)Δβ]l=1Np(𝒟θil,MK)p(𝒟θil,Mml)βllogp(𝒟θil,Mml)Δβ\begin{split}\text{E}_{\theta\sim p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}&\left[\frac{p\left(\mathcal{D}\mid\theta,M_{K}\right)}{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}}\log p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}\right]\\ &\approx\sum_{l=1}^{N}\frac{p\left(\mathcal{D}\mid\theta^{l}_{i},M_{K}\right)}{p\left(\mathcal{D}\mid\theta^{l}_{i},M_{m_{l}}\right)^{\beta_{l}}}\log p\left(\mathcal{D}\mid\theta^{l}_{i},M_{m_{l}}\right)^{\Delta\beta}\end{split} (27)
Eθp(θ𝒟,Mml,βl)[p(𝒟θ,MK)p(𝒟θ,Mml)βl]l=1Np(𝒟θil,MK)p(𝒟θil,Mml)βl\text{E}_{\theta\sim p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}\left[\frac{p\left(\mathcal{D}\mid\theta,M_{K}\right)}{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}}\right]\approx\sum_{l=1}^{N}\frac{p\left(\mathcal{D}\mid\theta^{l}_{i},M_{K}\right)}{p\left(\mathcal{D}\mid\theta^{l}_{i},M_{m_{l}}\right)^{\beta_{l}}} (28)
Eθp(θ𝒟,Mml,βl)[p(𝒟θ,Mml)Δβ]l=1Np(𝒟θil,Mml)Δβ\text{E}_{\theta\sim p\left(\theta\mid\mathcal{D},M_{m_{l}},\beta_{l}\right)}\left[p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\Delta\beta}\right]\approx\sum_{l=1}^{N}p\left(\mathcal{D}\mid\theta^{l}_{i},M_{m_{l}}\right)^{\Delta\beta} (29)

3.4 Multifidelity ST-MCMC and Bayesian model selection

SMC and ST-MCMC methods not only enable robust solutions of Bayesian inference problems for parameter calibration, but also enable Bayesian model selection by providing asymptotically unbiased estimates of the model evidence. Model evidence estimates are generally highly computationally expensive since they require estimating the normalization constant,

p(𝒟M)=p(𝒟θ,M)p(θM)𝑑θ,p\left(\mathcal{D}\mid M\right)=\int p\left(\mathcal{D}\mid\theta,M\right)p\left(\theta\mid M\right)d\theta, (30)

which consists of marginalizing the likelihood over the prior distribution. If the high probability content of the prior differs significantly from the most likely parameters according to the likelihood, it is difficult to estimate this integral using Monte Carlo samples. Instead, SMC type methods break down this estimate into a series of Monte Carlo approximations over the intermediate distribution levels previously discussed. As such, a hierarchy of multifidelity models can also be used to accelerate this estimate within the Multifidelity ST-MCMC framework. Using the methods described in [28, 50] a SMC based sampler, like Multifidelity ST-MCMC, can estimate the model evidence of the highest fidelity model, MKM_{K}, by estimating the product:

p(𝒟MK)=l=1Lp(𝒟ml,βl)p(𝒟ml1,βl1)=l=1Lclp\left(\mathcal{D}\mid M_{K}\right)=\prod_{l=1}^{L}\frac{p\left(\mathcal{D}\mid m_{l},\beta_{l}\right)}{p\left(\mathcal{D}\mid m_{l-1},\beta_{l-1}\right)}=\prod_{l=1}^{L}c_{l} (31)

where p(𝒟m,β)=p(𝒟θ,Mm)βp(θ)𝑑θp\left(\mathcal{D}\mid m,\beta\right)=\int p\left(\mathcal{D}\mid\theta,M_{m}\right)^{\beta}p\left(\theta\right)d\theta and LL is the final level of ST-MCMC. The ratio clc_{l} can be written as

cl=p(𝒟θ,Mml)βlp(θ)𝑑θp(𝒟θ,Mml1)βl1p(θ)𝑑θ=p(𝒟θ,Mml)βlp(𝒟θ,Mml1)βl1p(𝒟θ,Mml1)βl1p(θ)p(𝒟θ,Mml1)βl1p(θ)𝑑θ𝑑θ=Eθp(θ𝒟,Mml1,βl1)[p(𝒟θ,Mml)βlp(𝒟θ,Mml1)βl1]1Ni=1Nwil\begin{split}c_{l}&=\frac{\int p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}p\left(\theta\right)d\theta}{\int p\left(\mathcal{D}\mid\theta,M_{m_{l-1}}\right)^{\beta_{l-1}}p\left(\theta\right)d\theta}\\ &=\int\frac{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}}{p\left(\mathcal{D}\mid\theta,M_{m_{l-1}}\right)^{\beta_{l-1}}}\frac{p\left(\mathcal{D}\mid\theta,M_{m_{l-1}}\right)^{\beta_{l-1}}p\left(\theta\right)}{\int p\left(\mathcal{D}\mid\theta,M_{m_{l-1}}\right)^{\beta_{l-1}}p\left(\theta\right)d\theta}d\theta\\ &=\text{E}_{\theta\sim p\left(\theta\mid\mathcal{D},M_{m_{l-1}},\beta_{l-1}\right)}\left[\frac{p\left(\mathcal{D}\mid\theta,M_{m_{l}}\right)^{\beta_{l}}}{p\left(\mathcal{D}\mid\theta,M_{m_{l-1}}\right)^{\beta_{l-1}}}\right]\\ &\approx\frac{1}{N}\sum_{i=1}^{N}w_{i}^{l}\end{split} (32)

where wilw_{i}^{l} are the unnormalized resampling weights at level ll for the sample population θi=1Nl1\theta_{i=1\dots N}^{l-1}. Therefore, using the weights we already computed as part of Multifidelity ST-MCMC, we are able to compute an estimate of the model evidence.

4 Multifidelity reduced models of the chemical master equation

The FSP algorithm introduced in section 2 is commonly used to compute the likelihood of observed data measurements. When used within MCMC sampling, the FSP is usually implemented in one of the following two ways.

The first is to fix a single, large, subset of states for all parameter samples [11]. Since the probability distribution of the CME changes significantly as the MCMC explores the parameter space, it is very difficult to specify a finite state set that accurately captures the significant portion of the probability mass for all times and all parameters. One can end up choosing a static FSP that is either inaccurate or inefficient. This scenario is similar to when a static discretization scheme (e.g. finite element) is employed in the simulation of parametric partial differential equation models, in which the manually chosen grid size may turn out to be too coarse for some parameter regimes and excessive for others.

This drawback motivates the second approach that instead uses adaptive CME solvers. There have been many adaptive formulations of the FSP [45, 51, 52, 53, 54] in which the state set is iteratively expanded until the approximate solution satisfies a user-specified error tolerance. However, there could be regions in the parameter space where the adaptive state set has to be expanded to an enormous size to accurately approximate the CME solution. These ‘non-physical’ parameter combinations usually fit poorly to the data, and the large computational effort for their forward solutions does not provide useful information about the posterior. On the other hand, since reaction networks usually comprise of nonlinear and unpredictable interactions, it is difficult to know a priori which parameter values would give rise to such difficult (but meaningless) forward solutions of the full-fidelity CME. Simple techniques to regularize the cost of the forward solutions by restricting either the computational time or the number of time steps may run the risk of mistakenly ignoring genuinely informative parameter candidates whose evaluation just happens to require high computational cost.

The framework of the Multifidelity ST-MCMC sampler allows us to conceive of a compromise. In particular, we recast the static FSP into a surrogate CME whose solution complexity is uniformly bounded across all parameters. An adaptive FSP method with strict error tolerance is applied only to the surrogate master equations. This allows us to avoid over committing to parameters that have low posterior probabilities during the early annealing levels, yet still guarantee accurate computation of the likelihood at later sampling stages.

4.1 Implicitly defined finite state projection for constructing surrogate CME models

Surrogate models can be derived by adding restrictive assumptions to the physics of the original model. In particular, consider a hypothetical physical biological surrogate of the original cells modeled by the full CME in which all cellular processes ‘freeze’ when molecular copy numbers reach a certain set of thresholds. As we increase these thresholds, the surrogate cells behave more freely and closer to the original cells and the master equation describing their behavior becomes closer to the original CME, illustrated in Fig. 1 .

Let b1,,bNb_{1},\ldots,b_{N} be bounds on the copy number of species 11 through NN. We define an approximate SRN whose propensities are surrogates of the original SRN propensities and are given by

α^j(𝒙)=αj(𝒙)i[xibi],\hat{\alpha}_{j}(\bm{x})=\alpha_{j}(\bm{x})\prod_{i}[x_{i}\leq b_{i}], (33)

where [E][E] takes value 11 if expression EE is true and zero otherwise. Since there are no further transitions once the process enters a state that exceeds the bounds, the state space of the surrogate chemical master equation is effectively reduced to the hyper-rectangle H(b)=×i=1N{0,,bi}H(b)=\times_{i=1}^{N}\{0,\ldots,b_{i}\}. Thus, the infinite-dimensional system of differential equations (3) is replaced by the finite-dimensional surrogate dynamical system

M(b):ddt𝒑^H(t)=𝑨^H𝒑^H(t),𝒑^H(0)=𝒑^0|H,M(b):\;\frac{d}{dt}{\widehat{\bm{p}}_{H}(t)}=\widehat{\bm{A}}_{H}\widehat{\bm{p}}_{H}(t),\;\widehat{\bm{p}}_{H}(0)=\widehat{\bm{p}}_{0}|_{H}, (34)

where the truncated infinitesimal generator A^H\widehat{A}_{H} is defined similar to eq. (4) but with the exact propensities replaced by the surrogate propensities given in eq. (33).

We note that eq. (34) is equivalent to eq. (6) with Ω=H(b)\Omega=H(b). Thus, our surrogate propensities implicitly define a finite state projection of the original CME. We also note that the surrogate CMEs need not be constrained within a hyper-rectangle as considered here. It may be beneficial to derive a sequence of transformations to the original propensities, with the approximations chosen in such a way that alleviate the computational burden of solving the original model by, e.g., making the lower-fidelity dynamical system less stiff than the high-fidelity one. We leave this more general strategy to future work.

For the present choice hyper-rectangular state spaces, we recall the important result mentioned in section 2.2 that, as the entries of 𝒃\bm{b} increase monotonically, the state space H(𝒃)H(\bm{b}) includes more states and the truncation error, measured as the 1\ell_{1}-distance between 𝒑^H\widehat{\bm{p}}_{H} and the true CME solution 𝒑(t)\bm{p}(t) decreases monotonically. This provides us with a straightforward and natural way to form a hierarchy of surrogate models within the Multifidelity ST-MCMC framework.

Refer to caption
Figure 1: Illustrative realizations of the full and surrogate CMEs for a simple system with mass-action propensity. Given the same random seed, the simulated trajectory of the surrogate CME will be identical to that of the true CME until the state reaches a threshold, bb, where the surrogate trajectory freezes. Increasing the threshold reduces the chance that the surrogate trajectories hit the bounds and consequently more realizations of the true CME are captured by the surrogate model.

4.2 Using the hierarchy of surrogate CMEs within the ST-MCMC framework

With the surrogate CME models formulated, a strict hierarchy of surrogate models can be defined by a sequence of bounding vectors 𝒃(1)𝒃(2)𝒃(K)\bm{b}^{(1)}\leq\bm{b}^{(2)}\leq\ldots\leq\bm{b}^{(K)} where the “\leq" sign applies element-wise. The corresponding surrogate models Ml:=M(𝒃(l))M_{l}:=M(\bm{b}^{(l)}) are then defined as in eq. (34). As mentioned earlier, the error in the surrogate CMEs decrease monotonically as we increase the bounds. Therefore, {Ml}\{M_{l}\} forms a hierarchy in which each level attains more fidelity than its predecessor.

At the ll-th level, the log-likelihood function in eq. (8) is approximated by

L(𝒟|θ)LMl(θ)=i=1Tj=1nilogp(ti,min(𝒄j,i,𝒃(l))|Ml(θ)).L(\mathcal{D}|\theta)\approx L_{M_{l}}(\theta)=\sum_{i=1}^{T}{\sum_{j=1}^{n_{i}}{\log p(t_{i},\min(\bm{c}_{j,i},\bm{b}^{(l)})|M_{l}(\theta))}}. (35)

In the surrogate log-likelihood, the data is projected onto the finite state space H(𝒃l)H(\bm{b}_{l}), and the probabilities of the data at different time points are computed from the surrogate Markov model MlM_{l}. Clearly, as ll increases, the surrogate function LMl(θ)L_{M_{l}}(\theta) becomes a more accurate approximation to the true log-likelihood L(𝒟|θ)L(\mathcal{D}|\theta). In the ideal situation where the hierarchy is allowed to have infinite depth, these surrogates are guaranteed to form a sequence that converges pointwise to the true log-likelihood from below. This is shown formally in the following proposition.

Proposition 4.1.

Let the sequence of bounds {𝐛(l)}l=1N\{\bm{b}^{(l)}\}_{l=1}^{\infty}\subset\mathbb{N}^{N}, where 𝐛(l):=(b1(l),,bN(l))\bm{b}^{(l)}:=(b_{1}^{(l)},\ldots,b_{N}^{(l)}) be chosen such that 𝐛(l)𝐛(l+1)\bm{b}^{(l)}\leq\bm{b}^{(l+1)} elementwise (i.e., bj(l)bj(l+1))b_{j}^{(l)}\leq b_{j}^{(l+1)}). Assume that the continuous-time Markov chain underlying the SRN is non-explosive for all parameters and that the initial distribution of the CME (3) has finite support. For each fixed value of the parameter θ\theta, we have the following:

  1. 1.

    LMl(θ)L(𝒟|θ)L_{M_{l}}(\theta)\rightarrow L(\mathcal{D}|\theta) as ll\rightarrow\infty.

  2. 2.

    There exists a subsequence lil_{i} such that LMliL(𝒟|θ)L_{M_{l_{i}}}\uparrow L(\mathcal{D}|\theta).

Here, the log-likelihood function L(𝒟|θ)L(\mathcal{D}|\theta) is defined as in (8) and the surrogate LMli(θ)L_{M_{l_{i}}}(\theta) is defined as in (35).

Proof.

Without loss of generality, we assume that the initial distribution is concentrated at a single state 𝒙0\bm{x}_{0}. Let RR be the number of reactions that occur during the finite time interval [0,tT][0,t_{T}] given the starting state 𝒙0\bm{x}_{0}. If there exists ε>0\varepsilon>0 for which (R>Rε)ε\mathbb{P}(R>R_{\varepsilon})\geq\varepsilon for every choice of RεR_{\varepsilon}, then we have (R=)ε\mathbb{P}(R=\infty)\geq\varepsilon, violating the assumption of non-explosion. Thus, for every ε>0\varepsilon>0, there exists RεR_{\varepsilon} such that (R>Rε)<ε\mathbb{P}(R>R_{\varepsilon})<\varepsilon. Furthermore, we can find lεl_{\varepsilon} such that H(𝒃(lε))H(\bm{b}^{(l_{\varepsilon})}) contains all states that are reachable from 𝒙0\bm{x}_{0} via RεR_{\varepsilon} reactions or fewer. Thus, the probability for a sample path within [0,tT][0,t_{T}] of the surrogate CME to ever exceed H(b(l))H(b^{(l)}) is less than ε\varepsilon, and the corresponding solution of the surrogate CME is guaranteed to be less than ε\varepsilon away (in one-norm) from the true CME solution. This proves (i).

We have min(𝒄j,i,𝒃(l))=𝒄j,i\min(\bm{c}_{j,i},\bm{b}^{(l)})=\bm{c}_{j,i} for sufficiently large ll. Entry-wise, the FSP approximations increase monotonically [10, Theorem 2.2], so p(ti,𝒄j,i|Ml(θ))p(t_{i},\bm{c}_{j,i}|M_{l}(\theta)) increases monotonically. We can then choose the subsequence {li}\{l_{i}\} from {l}\{l\} by simply truncating the leading elements until 𝒃(l)\bm{b}^{(l)} disappears from the min(,)\min(,) function in eq. (35). This proves (ii). ∎

In summary, the FSP scheme allows us to define a hierarchy of surrogate master equations that approach the true CME as the surrogate state space enlarges. From this, we can define a sequence of surrogate log-likelihood functions that converge to the true log-likelihood from below. These surrogates could be used within the Multifidelity ST-MCMC framework introduced in section 3. Before we do so, however, we must first ensure that an accurate solution to the system (34) could be computed efficiently.

4.3 Fast and accurate solution of the surrogate master equation

Although the surrogate master equation (34) is a significant reduction from the infinite-dimensional CME, the number of states included in the truncated state space H(b)H(b) still grows as O(b1bN)O(b_{1}\cdot\ldots\cdot b_{N}) and the surrogate CME can quickly become expensive as we increase the entries of bb. However, in practice, the probability mass of the solution vector 𝒑~H(t)\widetilde{\bm{p}}_{H}(t) tends to concentrate at a much smaller subset of states. It is therefore advantageous to approximate 𝒑^H(t)\widehat{\bm{p}}_{H}(t) with a more compactly supported distribution. More precisely, if we let ε>0\varepsilon>0 be an error tolerance, we can use a distribution 𝒑~Ω\widetilde{\bm{p}}_{\Omega} supported on ΩH\Omega\subset H such that 𝒑^H𝒑~Ωε\|\widehat{\bm{p}}_{H}-\widetilde{\bm{p}}_{\Omega}\|\leq\varepsilon. Here, the FSP error bound (8) plays a critical role in choosing the appropriate support set Ω\Omega. We note that this error bound was recently utilized by Fox et al. [55] to compute rigorous lower and upper bounds for the true log-likelihood function (8), from which comparison between certain models could be done even at a low-fidelity FSP solution. We do not pursue this direction in the present work.

To efficiently compute the solution of the surrogate CMEs using the principles just mentioned, we employ a new FSP implementation recently developed by Vo and Munsky [56]. This solver divides the time interval of interest [0,tf][0,t_{f}] into subintervals Ij:=[tj,tj+1),j=0,,nstep1I_{j}:=[t_{j},t_{j+1}),\;j=0,\ldots,n_{step}-1 with 0:=t0<t1<<tnstep:=tf0:=t_{0}<t_{1}<\ldots<t_{n_{step}}:=t_{f}. On each time subinterval IjI_{j}, the dense tensor 𝒑^H(t)\widehat{\bm{p}}_{H}(t) that is the solution of the surrogate CME (34) is approximated by a sparse tensor 𝒑~Ωj(t)\widetilde{\bm{p}}_{\Omega_{j}}(t) supported on ΩjH\Omega_{j}\subset H, obtained from solving

ddt𝒑~Ωj(t)=𝑨~Ωj𝒑~Ωj(t),t[tj,tj+1)\frac{d}{dt}{\widetilde{\bm{p}}_{\Omega_{j}}}(t)=\widetilde{\bm{A}}_{\Omega_{j}}\widetilde{\bm{p}}_{\Omega_{j}}(t),\;t\in[t_{j},t_{j+1}) (36)

where

𝑨~Ωj(𝒚,𝒙)={𝑨^H(𝒚,𝒙)if 𝒙,𝒚Ωj0otherwise.\widetilde{\bm{A}}_{\Omega_{j}}(\bm{y},\bm{x})=\begin{cases}\widehat{\bm{A}}_{H}(\bm{y},\bm{x})\;\text{if }\bm{x},\bm{y}\in\Omega_{j}\\ 0\;\text{otherwise}\end{cases}.

Clearly, in solving (36), we only need to keep track of the equations corresponding to states in Ωj\Omega_{j} and that reduces the computational cost significantly.

From the FSP error bound (7), we derive an error-control criteria of the form

gj(t)=1𝟙T𝒑~Ωj(t)ttfε.g_{j}(t)=1-\mathds{1}^{T}\widetilde{\bm{p}}_{\Omega_{j}}(t)\leq\frac{t}{t_{f}}\varepsilon. (37)

If at some t[tj,tj+1)t\in[t_{j},t_{j+1}) we find that the inequality is not satisfied, more states are added to Ωj\Omega_{j} and the integration starts again from tjt_{j} until the criteria is satisfied over the whole interval. The determination of the time steps tjt_{j} is left to the ODE integrator employed for solving eq. (36).

The state sets Ωj\Omega_{j} are chosen as integral solutions of a set of inequality constraints. In particular, they have the form

Ωj={𝒙H(b)|fi(𝒙)ci(j)},\Omega_{j}=\left\{\bm{x}\in H(b)|f_{i}(\bm{x})\leq c_{i}^{(j)}\right\}, (38)

where fif_{i} are functions that are chosen a priori, and ci>0c_{i}>0 are positive scalars. To expand Ωj\Omega_{j}, we simply increase ci(j)c_{i}^{(j)} and run a breadth-first-search routine to explore all reachable states that satisfy the relaxed inequality constraints.

Implementation-wise, the approximate solution 𝒑~Ωj\tilde{\bm{p}}_{\Omega_{j}} is stored in the coordinate format similar to that used for sparse tensors [57]. The list of tensor indices is managed with the Distribute Dictionary data structure in the software package Zoltan [58, 59]. We also make use of parallel objects from the PETSc library [60, 61, 62]. Other implementation details will be communicated elsewhere [56]. It is worth pointing out that these MPI-based libraries allow our implementation to scale into multiple computing nodes, but the scaling efficiency will be limited due to the communication cost inherent in numerical operations such as matrix-vector multiplications. Therefore, simply plugging a parallel forward solution code on an increasing number of nodes into a serial MCMC sampler such as Metropolis-Hastings will have diminishing benefits. The ST-MCMC, in contrast, allows us to achieve better utilization of the computing resources, since it is embarrassingly parallel. so doubling the number of processors simply enables us to simultaneously sample twice as many parameter samples in about the same computational time.

We also note that using an adaptive solver such as one we present here incurs some numerical error in the surrogate likelihood function. However, we expect this error to be negligible with a conservative choice of error tolerance. In particular, the error threshold ε\varepsilon is always fixed at 10810^{-8} in our numerical tests. In the next section, we will confirm the accuracy and efficiency of our combined multifidelity sampler and adaptive model reduction scheme when applied to two biologically inspired problems and one on a real experimental dataset.

5 Numerical examples

In the following tests we compare the four variants of the ST-MCMC described above: the Full-fidelity, ESS-Bridge, IT-Bridge, and Tuned IT-Bridge schemes. The full-fidelity scheme is the classic ST-MCMC with every likelihood evaluation using the highest model fidelity. The remaining three schemes are Multifidelity ST-MCMC variants in which the bridging between fidelity levels are determined based on the ESS, the new information theoretic criteria with or without β\beta-tuning (see eq. (20) and the preceding discussion in section 3.2). In each of these Multifidelity schemes, the surrogate likelihood is formulated as described in section 4.1. When all propensities are time-invariant as in the first two examples, the reduced system of ODEs in eq. (36) is solved by computing the action of the matrix exponential operator using the Krylov approximation with Incomplete Orthogonalization Procedure [63, 64, 65], with the Krylov basis size fixed at 3030. In the case of time-varying propensities in the third example, we use the Four stage third order L-stable Rosenbrock-W scheme [66] implemented in the TS module of PETSc [67]. All these ODEs solvers are set with conservative absolute tolerance of 101410^{-14}, and relative tolerance of 10410^{-4}.

5.1 Parameter inference for repressilator gene circuit

reaction propensity
1. TetR\emptyset\rightarrow\mathrm{TetR} k0/(1+a0[LacI]b0)k_{0}/(1+a_{0}[\mathrm{LacI}]^{b_{0}})
2. TetR\mathrm{TetR}\rightarrow\emptyset γ0[TetR]\gamma_{0}[\mathrm{TetR}]
3. λcI\emptyset\rightarrow\mathrm{\lambda{cI}} k1/(1+a1[TetR]b1)k_{1}/(1+a_{1}[\mathrm{TetR}]^{b_{1}})
4. λcI\mathrm{\lambda{cI}}\rightarrow\emptyset γ1[λcI]\gamma_{1}[\mathrm{\lambda{cI}}]
5. LacI\emptyset\rightarrow\mathrm{LacI} k2/(1+a2[λcI]b2)k_{2}/(1+a_{2}[\mathrm{\lambda{cI}}]^{b_{2}})
6. LacI\mathrm{LacI}\rightarrow\emptyset γ2[LacI]\gamma_{2}[\mathrm{LacI}]
Table 1: Reactions and propensities in the repressilator model. ([X] is the number of copies of the species X.)

We first consider a three-species model inspired by the well-known repressilator gene circuit [1]. This model consists of three species, TetR\mathrm{TetR}, λcI\mathrm{\lambda{cI}} and LacI\mathrm{LacI}, which constitute a negative feedback network (Table 1). We simulate a dataset that consists of five measurement times 2,4,6,82,4,6,8, and 1010 minutes, with 10001000 cells measured at each time point. These numbers of single-cell measurements are typical of smFISH experiments [6, 41]. We assume that all cells start at the state 𝒙0=(TetR,λcI,LacI)=(0,0,0)\bm{x}_{0}=(\mathrm{TetR},\mathrm{\lambda{cI}},\mathrm{LacI})=\left(0,0,0\right), so that at the initial time where there are no gene products.

The hierarchy of surrogate CMEs (cf. (34)) is defined by the bounds

b(l)=[bTetR(l)bλcI(l)bLacI(l)]=[c1+(l1)d1c1Lmax+1c2+(l1)d2c2Lmax+1c3+(l1)d3c3Lmax+1]b^{(l)}=\begin{bmatrix}b_{\mathrm{TetR}}^{(l)}\\ \\ b_{\mathrm{\lambda{cI}}}^{(l)}\\ \\ b_{\mathrm{LacI}}^{(l)}\end{bmatrix}=\begin{bmatrix}\left\lfloor c_{1}+(l-1)\frac{d_{1}-c_{1}}{L_{\max}+1}\right\rfloor\\ \\ \left\lfloor c_{2}+(l-1)\frac{d_{2}-c_{2}}{L_{\max}+1}\right\rfloor\\ \\ \left\lfloor c_{3}+(l-1)\frac{d_{3}-c_{3}}{L_{\max}+1}\right\rfloor\end{bmatrix}

where (c1,c2,c3)=(20,40,40)(c_{1},c_{2},c_{3})=(20,40,40) and (d1,d2,d3)=(50,100,100)(d_{1},d_{2},d_{3})=(50,100,100), with Lmax=10L_{\max}=10. Therefore, the multifidelity ST-MCMC will transit through ten levels, with the highest-fidelity model having a state space of size 51×101×10151\times 101\times 101.

We conduct parameter inference in log10\log_{10}-transformed space. The prior for the parameters are chosen to be a multivariate normal distribution (in log10\log_{10} space) with a diagonal covariance matrix (see table 2). We ran both the ST-MCMC with the highest-fidelity surrogate CME and the multifidelity ST-MCMC on 2929 nodes, with 3636 cores per node with 10441044 parallel chains. For each level, samples were evolved using Metropolis-Hastings MCMC until a correlation target of 0.60.6 was reached. The proposal distribution was adaptively tuned as part of the algorithm. Details on the ST-MCMC sampler and its tuning can be found in [32]. Fig. 2 shows the time taken for each sampling scheme to reach a certain annealing level, with the multifidelity schemes with our proposed IT-based criteria outperforming the state-of-the-art fixed-fidelity ST-MCMC and ESS Bridging schemes. Specifically, while the full-fidelity ST-MCMC took over 2525 hours to finish, the Multifidelity ST-MCMC with ESS, Information Theoretic, and Tuned Information Theoretic Bridging took respectively 7.27.2, 4.14.1 and 5.35.3 hours, resulting in speedup factors of about 3.53.5, 6.26.2 and 4.84.8. The novel Information Theoretic (IT) schemes are clearly faster than the ESS-based scheme in this example, with the untuned IT scheme almost twice as fast as the ESS-based scheme. We observe that for the early levels of the algorithms, when β\beta is small, the lowest fidelity model is sufficiently informative. Further, at these early levels the ESS-based scheme is slightly faster than the others since it does not require any full model evaluations. However, after β\beta gets larger, the IT-based methods start to outperform the ESS-based method since they use the full model evaluations to judge that they do not need to bridge to the higher fidelity models as quickly as the ESS-based scheme does.

Although the prior assigns a probability density of only about 8.766×10208.766\times 10^{-20} to the true parameter vector, all samplers were able to bring the particles close to the true parameters (Fig. 4). There is no notable difference in the shapes of the posterior distributions constructed from the samples of these two schemes (Fig.3 and table 2).

Parameter True Prior Posterior
Full-fidelity ESS-Bridge IT-Bridge Tuned IT-Bridge
log10(k0)\log_{10}(k_{0}) 1.00 10.00±0.310.00\pm 0.3 1.00±0.01\text{1.00}\pm 0.01 1.00±0.02\text{1.00}\pm 0.02 0.99±0.01\text{0.99}\pm 0.01 1.00±0.01\text{1.00}\pm 0.01
log10(γ0)\log_{10}(\gamma_{0}) -2.00 0.10±0.30.10\pm 0.3 -1.98±0.07\text{-1.98}\pm 0.07 -1.98±0.08\text{-1.98}\pm 0.08 -1.98±0.07\text{-1.98}\pm 0.07 -1.98±0.07\text{-1.98}\pm 0.07
log10(a0)\log_{10}(a_{0}) -1.00 0.10±0.30.10\pm 0.3 -1.05±0.06\text{-1.05}\pm 0.06 -1.05±0.07\text{-1.05}\pm 0.07 -1.07±0.06\text{-1.07}\pm 0.06 -1.06±0.06\text{-1.06}\pm 0.06
log10(b0)\log_{10}(b_{0}) 0.30 0.10±0.30.10\pm 0.3 0.31±0.01\text{0.31}\pm 0.01 0.31±0.01\text{0.31}\pm 0.01 0.31±0.01\text{0.31}\pm 0.01 0.31±0.01\text{0.31}\pm 0.01
log10(k1)\log_{10}(k_{1}) 0.88 10.00±0.310.00\pm 0.3 0.87±0.00\text{0.87}\pm 0.00 0.87±0.01\text{0.87}\pm 0.01 0.87±0.00\text{0.87}\pm 0.00 0.87±0.00\text{0.87}\pm 0.00
log10(γ1)\log_{10}(\gamma_{1}) -1.70 0.10±0.30.10\pm 0.3 -1.71±0.05\text{-1.71}\pm 0.05 -1.73±0.06\text{-1.73}\pm 0.06 -1.73±0.05\text{-1.73}\pm 0.05 -1.71±0.05\text{-1.71}\pm 0.05
log10(a1)\log_{10}(a_{1}) -2.00 0.10±0.30.10\pm 0.3 -1.98±0.05\text{-1.98}\pm 0.05 -1.98±0.06\text{-1.98}\pm 0.06 -1.99±0.05\text{-1.99}\pm 0.05 -1.98±0.05\text{-1.98}\pm 0.05
log10(b1)\log_{10}(b_{1}) 0.40 0.10±0.30.10\pm 0.3 0.40±0.01\text{0.40}\pm 0.01 0.40±0.01\text{0.40}\pm 0.01 0.40±0.01\text{0.40}\pm 0.01 0.40±0.01\text{0.40}\pm 0.01
log10(k2)\log_{10}(k_{2}) 1.00 10.00±0.310.00\pm 0.3 0.98±0.01\text{0.98}\pm 0.01 0.99±0.01\text{0.99}\pm 0.01 0.98±0.01\text{0.98}\pm 0.01 0.98±0.01\text{0.98}\pm 0.01
log10(γ2)\log_{10}(\gamma_{2}) -1.30 0.10±0.30.10\pm 0.3 -1.34±0.03\text{-1.34}\pm 0.03 -1.34±0.04\text{-1.34}\pm 0.04 -1.34±0.03\text{-1.34}\pm 0.03 -1.34±0.03\text{-1.34}\pm 0.03
log10(a2)\log_{10}(a_{2}) -1.30 0.10±0.30.10\pm 0.3 -1.35±0.06\text{-1.35}\pm 0.06 -1.34±0.07\text{-1.34}\pm 0.07 -1.36±0.06\text{-1.36}\pm 0.06 -1.35±0.06\text{-1.35}\pm 0.06
log10(b2)\log_{10}(b_{2}) 0.48 0.10±0.30.10\pm 0.3 0.48±0.01\text{0.48}\pm 0.01 0.48±0.01\text{0.48}\pm 0.01 0.48±0.01\text{0.48}\pm 0.01 0.48±0.01\text{0.48}\pm 0.01
Table 2: Model parameters in the repressilator example. The second column presents the parameters of the prior distribution, where we use a Gaussian prior in the log10\log_{10}-transformed parameter space with a diagonal covariance matrix. The last four columns present the posterior mean and standard deviation of model parameters estimated using four methods: fixed-fidelity ST-MCMC (Fixed), Multifidelity ST-MCMC with ESS-Bridging, Multifidelity ST-MCMC with IT-Bridging, and Multifidelity ST-MCMC with β\beta-tuning and IT-Bridging.
Refer to caption
Figure 2: Performance of ST-MCMC samplers on the repressilator example. The horizontal axis represent the annealing factor, i.e. inverse temperature. Significant speed up is observed for the multifidelity schemes.
Refer to caption
Figure 3: Prior and posterior densities in the repressilator example. See table 2 for the numerical values of the estimated means and standard deviations of these posterior distributions. It is evident that all methods converge to virtually the same distribution.
Refer to caption
Figure 4: Evolution of the population of samples for the repressilator model parameters using four different ST-MCMC variants: full-fidelity, multifidelity strategies with bridging based on ESS, Information Theoretic Criteria and Tuned Information Theoretic Criteria. The solid lines represent the history of the sample means. The area of the mean ±\pm standard deviation is presented in the shaded region. Notice in the γ1\gamma_{1} parameter that bias starts to accumulate for the IT-based methods. This bias is corrected when the sampler starts bridging since the bias began to exceed the natural parameter variability.

5.2 Bayesian comparison of comparmental models of gene expression

reaction index reaction propensity
1,,nG1,\ldots,n_{G} Gi1GiG_{i-1}\rightarrow G_{i}, i=1,,nG1i=1,\ldots,n_{G}-1 ki1+[Gi1]k_{i-1}^{+}[G_{i-1}]
nG+1,,2nGn_{G}+1,\ldots,2n_{G} GiGi1G_{i}\rightarrow G_{i-1}, i=1,,nG1i=1,\ldots,n_{G}-1 ki[Gi]k_{i}^{-}[G_{i}]
2nG+1,,3nG12n_{G}+1,\ldots,3n_{G}-1 GiGi+RNAnucG_{i}\rightarrow G_{i}+\text{RNA}_{nuc} , i=1,,nG1i=1,\ldots,n_{G}-1 ri[Gi]r_{i}[G_{i}]
3nG3n_{G} RNAnucRNAcyt\text{RNA}_{nuc}\rightarrow\text{RNA}_{cyt} ktrans[RNAnuc]k_{trans}[\text{RNA}_{nuc}]
3nG+13n_{G}+1 RNAcyt\text{RNA}_{cyt}\rightarrow\emptyset γ[RNAcyt]\gamma[\text{RNA}_{cyt}]
Table 3: Reactions and propensities in the compartmental gene expression model.

We next explore the application of multifidelity ST-MCMC to the problem of model selection. We consider a class of compartmental multi-state gene expression models based on the model considered in [68]. The model separates biomolecules into the nuclear and cytoplasmic compartments. The reaction network consists of a gene that could switch between an inactivated state G0G_{0} and several activated states GiG_{i}, i=1,,nG1i=1,\ldots,n_{G}-1. When activated, these gene can be transcribed into RNA molecules within the nucleus at the rate of rir_{i} molecule/minute on average. These nuclear mRNA molecules are then transported into the cytoplasm at a rate of ktransk_{trans} molecule/min, where they degrade at the probabilistic rate γ\gamma molecule/minute. Overall, the model consists of nG+2n_{G}+2 species: genes that are at different states, nuclear mRNA and cytoplasmic mRNA. These molecular species that can go through 3nG+13n_{G}+1 reaction channels (Table 3). Only the copy numbers of the nuclear and cytoplasmic mRNA species are observable in experiments. We want to use model selection to decide the number nGn_{G} of gene states that best explain the observed data.

We simulate a ground truth dataset based on the model with nG:=3n_{G}:=3, which consists of 10001000 single-cell measurements for each time point t{2,4,6,8,10}t\in\{2,4,6,8,10\} (with hour as time unit). We then use the multifidelity ST-MCMC using the information theoretic criteria with β\beta-tuning to estimate the model evidence for three classes of reaction networks that consist of two, three, and four gene states and compare these results with the model evidence founding using the full model. We choose the information theoretic criteria with β\beta-tuning over the other multifidelity approaches because it is the most robust and uses the β\beta-tuning to avoid sampling degeneracy when the model fidelity changes. This is very important for computing model evidence because the model evidence estimate error is related to the KL-divergence between the intermediate distributions. The 2 and 3 state model were run using 7 nodes with 36 cores, while the 4 state model was run on 14 nodes also with 36 cores each. Each ST-MCMC used 1008 parallel samples. For each level, chains were run using MCMC until a correlation target of 0.60.6 was reached or 100100 iterations exceeded. These results are summarized in Table 4. The model evidence ranges are computed using the approach described in [50].

From our results, we observe that not only does the Tuned-IT multifidelity ST-MCMC provide consistent estimate of the model evidence compared to the full-model based ST-MCMC, it actually predicts less error. All the while taking less time with speed up factors of about 1.61.6, 3.83.8 and 3.23.2 for the 2, 3, and 4 gene state model respectively. The improved estimate of the multifidelity approach is likely due to the fact that it uses more intermediate levels so it has a finer discretionary of the thermodynamic integration used to estimate the evidence.

The computed evidence indicates that the present data does not significantly favor one model choice over others. Clearly more experiments are required to provide conclusive evidence for model selection, and the multifidlelity framework allows us to realize the insufficiency of data faster than the full-fidelity scheme. This is an important advantage in practice, as the faster assessment of current experimental data will likely reduce the lag time between consecutive batches of experiments. Further it may be possible to integrate the same multifidelity ST-MCMC based approach into Bayesian experimental design to speed up estimating the expected information gain from various experimental setups in order to design experiments to better discriminate between the models.

Model Full-fidelity Tuned IT Bridge
Log Evidence Time (Sec) Log Evidence Time (Sec)
2 Gene 20108.8±5.4-20108.8\pm 5.4 12441244 20112.6±2.0-20112.6\pm 2.0 758758
3 Gene 20111.9±5.6-20111.9\pm 5.6 6749667496 20115.7±2.0-20115.7\pm 2.0 1751117511
4 Gene 20113.0±5.7-20113.0\pm 5.7 7654676546 20117.5±1.8-20117.5\pm 1.8 2377723777
Table 4: Comparison of the model evidence computation for the 2,3, and 4 state gene expression model using ST-MCMC with the full fidelity and Multifidelity ST-MCMC with β\beta-tuning. The evidence estimates from the full-fidelity and multifidelity methods are consistent, but the multifidelity scheme is significantly faster.

5.3 Stochastic transcription of the inflammation response gene IL1beta

reaction propensity
1. G0G1G_{0}\rightarrow G_{1} k01[G0]k_{01}[G_{0}]
2. G1G2G_{1}\rightarrow G_{2} k12[G1]k_{12}[G_{1}]
3. G2G1G_{2}\rightarrow G_{1} k21[G2]k_{21}[G_{2}]
4. G1G0G_{1}\rightarrow G_{0} k10(t)=max{0,a10b10S(t)}k_{10}(t)=\max\{0,a_{10}-b_{10}S(t)\}, see eq.(39)
5. RNA\emptyset\rightarrow\text{RNA} α1[G1]+α2[G2]\alpha_{1}[G_{1}]+\alpha_{2}[G_{2}]
6. RNA\text{RNA}\rightarrow\emptyset γ[RNA]\gamma[\text{RNA}]
Table 5: Reactions and propensities in the IL1beta model.

Having explored the performance of the Multifidelity ST-MCMC schemes with FSP on theoretical examples with simulated datasets, we apply our method on modeling real datasets. We consider the expression of the IL1beta gene in response to LPS stimulation that was studied in Kalb et al. [41]. The dataset consists of mRNA counts for IL1beta measured right before applying LPS stimulation, as well as those at [0.5,1,2,4][0.5,1,2,4] hours after. We consider a three-state gene expression model with a time-varying deactivation rate. This results in a chemical reaction network with time-varying propensities and eleven uncertain parameters (Table 5). We assume the initial state (2,0,0,0)(2,0,0,0). The observed mRNA counts are fit to the solutions of the CME at times T0+{0,0.5,1,2,4}T_{0}+\{0,0.5,1,2,4\} hour, where the time offset T0T_{0} is to be estimated. The influence of LPS-induced signaling molecules is modeled by the function of the form

S(t)=max{0,exp(r1(tT0))(1exp(r2(tT0)))}.S(t)=\max\left\{0,\exp\left(-r_{1}(t-T_{0})\right)\left(1-\exp\left(-r_{2}(t-T0)\right)\right)\right\}. (39)

This signal affects the rate by which the gene turns off,

k1,0(t)=max{0,a10b10S(t)}.k_{1,0}(t)=\max\{0,a_{10}-b_{10}S(t)\}.

Similar to the previous example, the gene state is hidden and the data only contains measurements of the mRNA copy numbers.

Fig. 5 summarizes the performance of the four sampling schemes. The full-fidelity model considered in this example has a state space of only 18,43218,432 states, and could be solved quickly without any reduction scheme. Yet, we still observe significant speedup from the Multifidelity schemes. Specifically, the Multifidelity ST-MCMC with ESS, Information Theoretic and Tuned Information Theoretic Bridging took respectively 3153, 2384, and 2703 seconds to finish, with speedup factors of 1.7, 2.2, and 2.0 over the full-fidelity scheme that takes over 5468 seconds. In Figure 6, we see the evolution of the model parameters for the different methods and we can use this to better understand differences from Figure 5. It took significantly longer for the ESS-based method to bridge to fidelity models than the IT-based methods. However, when it did bridge it jumped straight to the highest model fidelity. For several parameters, we see that their evolution under the ESS-based scheme accumulated significant bias at lower β\beta levels before being corrected when the ESS sampler started bridging much later on (e.g. parameters r1r_{1}, k01k_{01}, and T0T_{0}). Furthermore, once bridging occurred the distributions were very far apart from each other so the sample population degenerated. This can be seen in the bias that occurs in parameters r2r_{2}, b10b_{10} and α1\alpha_{1} immediately after bridging. The fact that the ESS-based method does not use any information from the full posterior explains this delay and degeneracy as it is unable to detect the emergence of bias. In contrast, the IT-based methods use the guidance of full model evaluations to better recognize the emergence of bias, so they correct it quicker. Therefore, the IT-based schemes have a smoother evolution and as a result take less time. However, we do observe bias occur in r2r_{2} after one bridging step for the IT-Bridge without β\beta tuning, indicating some sampling degeneracy. This is not the case for IT-Bridge with β\beta tuning since it is specifically designed to avoid degeneracy.

All sampling schemes use essentially the same posterior estimates for the model parameters (Table 6). Despite significant posterior variance for some parameters, the Bayesian prediction for the distributions of RNA copy number has negligible uncertainties, and they appear to correspond reasonably well with the experimental data at the beginning and the end of the measurement time period (Fig. 7). We notice that this is not necessarily the only model structure that could explain the data, and there may yet be other models that could fit and predict single-cell behavior more accurately. The speedup enabled by the Multifidelity framework will allow the researcher the ability to more rapidly propose, assess, and choose between different alternative models.

Parameter Prior Posterior
Full-fidelity ESS-Bridge IT-Bridge Tuned IT-Bridge
log10(r1)\log_{10}(r_{1}) 2.00±0.33-2.00\pm 0.33 -2.48±0.03\text{-2.48}\pm 0.03 -2.47±0.03\text{-2.47}\pm 0.03 -2.47±0.03\text{-2.47}\pm 0.03 -2.47±0.03\text{-2.47}\pm 0.03
log10(r2)\log_{10}(r_{2}) 2.00±0.33-2.00\pm 0.33 -2.00±0.34\text{-2.00}\pm 0.34 -2.00±0.33\text{-2.00}\pm 0.33 -2.00±0.33\text{-2.00}\pm 0.33 -2.00±0.32\text{-2.00}\pm 0.32
log10(k01)\log_{10}(k_{{01}}) 3.00±0.33-3.00\pm 0.33 -3.26±0.03\text{-3.26}\pm 0.03 -3.25±0.04\text{-3.25}\pm 0.04 -3.25±0.03\text{-3.25}\pm 0.03 -3.25±0.03\text{-3.25}\pm 0.03
log10(a10)\log_{10}(a_{{10}}) 2.00±0.33-2.00\pm 0.33 -1.31±0.06\text{-1.31}\pm 0.06 -1.31±0.06\text{-1.31}\pm 0.06 -1.30±0.06\text{-1.30}\pm 0.06 -1.31±0.06\text{-1.31}\pm 0.06
log10(b10)\log_{10}(b_{{10}}) 3.00±0.333.00\pm 0.33 3.04±0.34\text{3.04}\pm 0.34 3.05±0.34\text{3.05}\pm 0.34 3.06±0.30\text{3.06}\pm 0.30 3.04±0.31\text{3.04}\pm 0.31
log10(k12)\log_{10}(k_{{12}}) 3.00±0.33-3.00\pm 0.33 -3.08±0.03\text{-3.08}\pm 0.03 -3.08±0.04\text{-3.08}\pm 0.04 -3.08±0.04\text{-3.08}\pm 0.04 -3.08±0.03\text{-3.08}\pm 0.03
log10(k21)\log_{10}(k_{{21}}) 2.00±0.33-2.00\pm 0.33 -1.25±0.15\text{-1.25}\pm 0.15 -1.26±0.14\text{-1.26}\pm 0.14 -1.28±0.14\text{-1.28}\pm 0.14 -1.28±0.13\text{-1.28}\pm 0.13
log10(α1)\log_{10}(\alpha_{1}) 3.00±0.33-3.00\pm 0.33 -3.60±0.15\text{-3.60}\pm 0.15 -3.60±0.16\text{-3.60}\pm 0.16 -3.60±0.16\text{-3.60}\pm 0.16 -3.60±0.16\text{-3.60}\pm 0.16
log10(α2)\log_{10}(\alpha_{2}) 0.00±0.330.00\pm 0.33 0.71±0.15\text{0.71}\pm 0.15 0.70±0.14\text{0.70}\pm 0.14 0.68±0.13\text{0.68}\pm 0.13 0.68±0.13\text{0.68}\pm 0.13
log10(γ)\log_{10}(\gamma) 4.00±0.33-4.00\pm 0.33 -4.56±0.05\text{-4.56}\pm 0.05 -4.56±0.05\text{-4.56}\pm 0.05 -4.56±0.05\text{-4.56}\pm 0.05 -4.56±0.05\text{-4.56}\pm 0.05
log10(T0)\log_{10}(T_{0}) 4.00±0.334.00\pm 0.33 5.36±0.09\text{5.36}\pm 0.09 5.36±0.09\text{5.36}\pm 0.09 5.37±0.08\text{5.37}\pm 0.08 5.37±0.09\text{5.37}\pm 0.09
Table 6: Model parameters in the IL1beta example. The second column presents the parameters of the prior distribution, where we use a Gaussian prior in the log10\log_{10}-transformed parameter space with a diagonal covariance matrix. The last four columns present the posterior mean and standard deviation of model parameters estimated using the ST-MCMC with full-fidelity model and the Multifidelity ST-MCMC with three different bridging strategy.
Refer to caption
Figure 5: Performance of STMCMC samplers on the IL1beta example. The horizontal axis represents the inverse temperature.
Refer to caption
Figure 6: Evolution of the population of samples for the IL1beta model parameters using four different ST-MCMC variants: full-fidelity, multifidelity strategies with bridging based on ESS, Information Theoretic Criteria and Tuned Information Theoretic Criteria. The solid lines represent the history of the sample means. The area of the mean ±\pm standard deviation is presented in the shaded region.
Refer to caption
Figure 7: Comparison of data and the posterior mRNA distribution predictions for the IL1beta transcription model at zero and four hour after LPS induction. The mean Bayesian prediction for the mRNA probability distribution is computed by averaging the solution of the CME over all posterior samples. The area of one standard deviation around the mean is shown in shade. Visually speaking, samples from different ST-MCMC formulations yield identical predictions.

6 Conclusion

Rapid advancements in experimental techniques are allowing biologists to collect quantitative data about cellular processes at ever smaller scales with increasing detail [6, 7, 8]. Mathematical models have become an indispensable part in the process of learning and making predictions from this data. Stochastic reaction networks (SRNs) form a powerful class of models that have found widespread use within the quantitative biology community [7]. Identifying these models from the data, however, is a challenging task due to the computational cost of solving the chemical master equation (CME). This has prevented a fully Bayesian statistical framework from being adopted widely in real biological studies. In this paper, we seek to address the challenge of applying the Bayesian philosophy to analyzing stochastic gene expression data by proposing an efficient computational framework for Bayesian parameter calibration and model selection for SRNs. This framework combines novel multifidelity formulations of the massively parallel ST-MCMC sampler with surrogate models of the CME. Numerical tests demonstrate that this combined approach leads to significant savings in comparison to a state-of-the-art method that uses solely the high-fidelity models. Further, we also propose a new criteria for tuning model fidelity within multifidelity SMC type methods based on information theory that compares favorably to effective sample size based techniques.

The research reported here may potentially lead to fruitful future directions. With respect to surrogate models, the approach proposed here for the efficient solution of the surrogate master equations is only one among various alternatives that have been proposed over the years since the introduction of the FSP algorithm [10]. Another attractive option for constructing multifidelity models is to utilize a low-rank tensor format such as the quantized tensor train that has been proposed for the forward solution of the CME [69, 70, 71, 72, 73, 74]. It is also possible to exploit bounds on the log-likelihood function as done in Fox et al. [55].

The improved efficiency may lead to more widespread adoptions of the Bayesian approach in answering biological questions. We refer to Catanach et al. [16] for an example of a Bayesian approach to studying the phenomenon of context dependence in synthetic gene circuits using Bayesian model selection, which required significant computational resources.

There are also many directions for improving Multifidelity ST-MCMC in general. We expect estimating the information gain criteria could be significantly improved. One possibility is using a more advanced sampling scheme that leverages model evaluations from across the multifidelity hierarchy. This could even further reduce the number of full model evaluations needed at each level. Another avenue of research is designing Multifideltiy ST-MCMC specifically for estimating a quantify of interest to a given accuracy as is done in Multilevel MCMC. If we have a design object, ST-MCMC may not need to progress through the full model hierarchy or all annealing levels in order to provide enough information to estimate the quantify of interest to the desired accuracy. By further reducing the computational cost of Bayesian methods like ST-MCMC, engineers and scientist will be better able to integrate uncertainty quantification into their workflow. Therefore, as high performance computing resources are becoming increasingly accessible, we expect the Multifidelity ST-MCMC framework to provide a useful tools for researchers who are interested in model calibration and uncertainty propagation for complex models.

Acknowledgement

We thank James Werner and Daniel Kalb for kindly sharing with us the data from their smFISH experiment. The cited work [41] was performed, in part, at the Center for Integrated Nanotechnologies, an Office of Science User Facility operated for the U.S. Department of Energy (DOE) Office of Science by Los Alamos National Laboratory (Contract 89233218CNA000001) and Sandia National Laboratories (Contract DE-NA-0003525). The work presented here was also funded in part by the Department of Energy Office of Advanced Scientific Computing Research through the John von Neumann Fellowship. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. SAND2019-15382 J.

We also thank Ania-Ariadna Baetica for providing constructive comments on the manuscript and Jed Duersch for discussions regarding information theory.

References

  • [1] Elowitz, M.B. and Leibler, S., A synthetic oscillatory network of transcriptional regulators., Nature, 403(6767):335–338, 2000.
  • [2] Munsky, B., Trinh, B., and Khammash, M., Listening to the noise: random fluctuations reveal gene network parameters., Mol. Syst. Biol., 5(318):318, 2009.
  • [3] Neuert, G., Munsky, B., Tan, R.Z., Teytelman, L., Khammash, M., and Van Oudenaarden, A., Systematic identification of signal-activated stochastic gene regulation, Science (80-. )., 339(6119):584–587, 2013.
  • [4] Singh, A. and Dennehy, J.J., Stochastic holin expression can account for lysis time variation in the bacteriophage λ\lambda, J. R. Soc. Interface, 11(95), 2014.
  • [5] Van Boxtel, C., Van Heerden, J.H., Nordholt, N., Schmidt, P., and Bruggeman, F.J., Taking chances and making mistakes: Non-genetic phenotypic heterogeneity and its consequences for surviving in dynamic environments, J. R. Soc. Interface, 14(132), 2017.
  • [6] Raj, A., van den Bogaard, P., Rifkin, S.A., van Oudenaarden, A., and Tyagi, S., Imaging individual mRNA molecules using multiple singly labeled probes, Nat. Methods, 5:877, sep 2008.
  • [7] Munsky, B., Fox, Z., and Neuert, G., Integrating single-molecule experiments and discrete stochastic models to understand heterogeneous gene transcription dynamics, Methods, 85:12–21, sep 2015.
  • [8] Li, G. and Neuert, G., Multiplex RNA single molecule FISH of inducible mRNAs in single yeast cells, Sci. data, 6(1):94, jun 2019.
  • [9] Gillespie, D.T., A rigorous derivation of the chemical master equation, Phys. A Stat. Mech. its Appl., 188(1-3):404–425, sep 1992.
  • [10] Munsky, B. and Khammash, M., The finite state projection algorithm for the solution of the chemical master equation, J. Chem. Phys., 124(4):44104, 2006.
  • [11] Gómez-Schiavon, M., Chen, L.F., West, A.E., and Buchler, N.E., BayFish: Bayesian inference of transcription dynamics from population snapshots of single-molecule RNA FISH in single cells, Genome Biol., 18(1):164, 2017.
  • [12] Schnoerr, D., Sanguinetti, G., and Grima, R., Approximation and inference methods for stochastic biochemical kinetics—a tutorial review, J. Phys. A, 50(9):093001, jan 2017.
  • [13] Tiberi, S., Walsh, M., Cavallaro, M., Hebenstreit, D., and Finkenstädt, B., Bayesian inference on stochastic gene transcription from flow cytometry data, In Bioinformatics, Vol. 34, pp. i647–i655, 2018.
  • [14] Weber, L., Raymond, W., and Munsky, B., Identification of gene regulation models from single-cell data, Phys. Biol., 15(5):055001, may 2018.
  • [15] Lin, Y.T. and Buchler, N.E., Exact and efficient hybrid monte carlo algorithm for accelerated bayesian inference of gene expression models from snapshots of single-cell transcripts, J. Chem. Phys., 151(2):024106, 2019.
  • [16] Catanach, T.A., McCardell, R., Baetica, A.A., and Murray, R.M., Context dependence of biological circuits, bioRxiv, 2018.
  • [17] Lindley, D.V., On a Measure of the Information Provided by an Experiment, Ann. Math. Stat., 27(4):986–1005, 2007.
  • [18] Huan, X. and Marzouk, Y.M., Simulation-based optimal Bayesian experimental design for nonlinear systems, J. Comput. Phys., 232(1):288–317, 2013.
  • [19] Ruess, J., Milias-Argeitis, A., and Lygeros, J., Designing experiments to understand the variability in biochemical reaction networks., J. R. Soc. Interface, 10(88):20130588, 2013.
  • [20] Fox, Z.R. and Munsky, B., The finite state projection based Fisher information matrix approach to estimate information and optimize single-cell experiments, PLOS Comput. Biol., 15(1):e1006365, jan 2019.
  • [21] Busetto, A.G., Hauser, A., Krummenacher, G., Sunnåker, M., Dimopoulos, S., Ong, C.S., Stelling, J., and Buhmann, J.M., Near-optimal experimental design for model selection in systems biology, Bioinformatics, 29(20):2625–2632, 07 2013.
  • [22] Vo, H.D., Fox, Z., Baetica, A., and Munsky, B., Bayesian Estimation for Stochastic Gene Expression Using Multifidelity Models, J. Phys. Chem. B, 123(10):2217–2234, mar 2019.
  • [23] Wu, Q., Smith-Miles, K., and Tian, T., Approximate Bayesian computation schemes for parameter inference of discrete stochastic models using simulated likelihood density, BMC Bioinformatics, 15(12), nov 2014.
  • [24] Prescott, T.P. and Baker, R.E., Multifidelity Approximate Bayesian Computation, 2018.
  • [25] Warne, D.J., Baker, R.E., and Simpson, M.J., Rapid bayesian inference for expensive stochastic models, 2019.
  • [26] Rosenthal, J.S., Parallel computing and monte carlo algorithms, Far east journal of theoretical statistics, 4(2):207–236, 2000.
  • [27] Liu, J.S., Liang, F., and Wong, W.H., The multiple-try method and local optimization in metropolis sampling, Journal of the American Statistical Association, 95(449):121–134, 2000.
  • [28] Ching, J. and Chen, Y.C., Transitional markov chain monte carlo method for bayesian model updating, model class selection, and model averaging, Journal of Engineering Mechanics, 133(7):816–832, 2007.
  • [29] Wu, S., Angelikopoulos, P., Papadimitriou, C., and Koumoutsakos, P., Bayesian Annealed Sequential Importance Sampling: An Unbiased Version of Transitional Markov Chain Monte Carlo, ASCE-ASME J Risk and Uncert in Engrg Sys Part B Mech Engrg, 4(1), 09 2017, 011008.
  • [30] Moral, P.D., Doucet, A., and Jasra, A., Sequential Monte Carlo samplers, J. R. Stat. Soc. B, 68:411–436, 2006.
  • [31] Kantas, N., Beskos, A., and Jasra, A., Sequential monte carlo methods for high-dimensional inverse problems: A case study for the navier–stokes equations, SIAM/ASA Journal on Uncertainty Quantification, 2(1):464–489, 2014.
  • [32] Catanach, T.A. and Beck, J.L. Bayesian updating and uncertainty quantification using sequential tempered mcmc with the rank-one modified metropolis algorithm, 2018.
  • [33] Andrés Christen, J. and Fox, C., Markov chain Monte Carlo using an approximation, J. Comput. Graph. Stat., 14(4):795–810, 2005.
  • [34] Efendiev, Y., Hou, T., and Luo, W., Preconditioning markov chain monte carlo simulations using coarse-scale models, SIAM Journal on Scientific Computing, 28(2):776–803, 2006.
  • [35] Cui, T., Marzouk, Y.M., and Willcox, K.E., Data-driven model reduction for the Bayesian solution of inverse problems, Int. J. Numer. Methods Eng., 102(5):966–990, 2015.
  • [36] Koutsourelakis, P.S., A multi-resolution, non-parametric, Bayesian framework for identification of spatially-varying model parameters, J. Comput. Phys., 228(17):6184–6211, sep 2009.
  • [37] Dodwell, T.J., Ketelsen, C., Scheichl, R., and Teckentrup, A.L., A hierarchical multilevel markov chain monte carlo algorithm with applications to uncertainty quantification in subsurface flow, SIAM/ASA Journal on Uncertainty Quantification, 3(1):1075–1108, 2015.
  • [38] Latz, J., Papaioannou, I., and Ullmann, E., Multilevel Sequential2Monte Carlo for Bayesian inverse problems, J. Comput. Phys., 368:154–178, 2018.
  • [39] Golightly, A., Henderson, D.A., and Sherlock, C., Delayed acceptance particle MCMC for exact inference in stochastic kinetic models, Stat. Comput., 25(5):1039–1055, 2015.
  • [40] Sherlock, C., Golightly, A., and Henderson, D.A., Adaptive, Delayed-Acceptance MCMC for Targets With Expensive Likelihoods, J. Comput. Graph. Stat., 26(2):434–444, 2017.
  • [41] Kalb, D.M., Adikari, S.H., Hong-Geller, E., and Werner, J.H., Single-cell correlations of mRNA and protein content in a human monocytic cell line after LPS stimulation, PLOS ONE, 14(4):1–16, 2019.
  • [42] Munsky, B. and Khammash, M., Transient analysis of stochastic switches and trajectories with applications to gene regulatory networks, IET Systems Biology, 2(5):323–333, Sep. 2008.
  • [43] Gauckler, L. and Yserentant, H., Regularity and approximability of the solutions to the chemical master equation, ESAIM. Math. Model. Numer. Anal., 48:1757–1775, 2014.
  • [44] Gupta, A., Briat, C., and Khammash, M., A Scalable Computational Framework for Establishing Long-Term Behavior of Stochastic Reaction Networks, PLoS Comput. Biol., 10(6), 2014.
  • [45] Munsky, B. and Khammash, M., A multiple time interval finite state projection algorithm for the solution to the chemical master equation, J. Comput. Phys., 226(1):818–835, sep 2007.
  • [46] Kuntz, J., Thomas, P., Stan, G.B., and Barahona, M., The exit time finite state projection scheme: Bounding exit distributions and occupation measures of continuous-time markov chains, SIAM J. Sci. Comput., 41(2):A748–A769, 2019.
  • [47] Femino, A.M., Fay, F.S., Fogarty, K., and Singer, R.H., Visualization of single RNA transcripts in situ, Science (80-. )., 280(5363):585–590, apr 1998.
  • [48] Jaynes, E.T., Probability theory: The logic of science, Cambridge university press, 2003.
  • [49] Duersch, J.A. and Catanach, T.A. Generalizing information to the evolution of rational belief, 2019.
  • [50] Calderhead, B. and Girolami, M., Estimating bayes factors via thermodynamic integration and population mcmc, Computational Statistics & Data Analysis, 53(12):4028 – 4045, 2009.
  • [51] Burrage, K., Hegland, M., Macnamara, S., and Sidje, R.B., A Krylov-based finite state projection algorithm for solving the chemical master equation arising in the discrete modelling of biological systems, In: Langeville, A. and Stewart, W.J. (Eds.), Proc. 150th Markov Anniv. Meet., pp. 1–18, Charleston, SC, USA, 2006. Boson books.
  • [52] Wolf, V., Goel, R., Mateescu, M., and Henzinger, T.A., Solving the chemical master equation using sliding windows., BMC Syst. Biol., 4:42, 2010.
  • [53] Sidje, R.B. and Vo, H.D., Solving the chemical master equation by a fast adaptive finite state projection based on the stochastic simulation algorithm, Math. Biosci., 269:10–16, 2015.
  • [54] Cao, Y., Terebus, A., and Liang, J., State Space Truncation with Quantified Errors for Accurate Solutions to Discrete Chemical Master Equation, Bull. Math. Biol., 78(4):617–661, 2016.
  • [55] Fox, Z., Neuert, G., and Munsky, B., Finite state projection based bounds to compare chemical master equation models using single-cell data, J. Chem. Phys., 145(7):074101, 2016.
  • [56] Vo, H.D. and Munsky, B., High performance solver for the chemical master equation, in preparation, 2019.
  • [57] Bader, B.W. and Kolda, T.G., Efficient matlab computations with sparse and factored tensors, SIAM Journal on Scientific Computing, 30(1):205–231, 2008.
  • [58] Devine, K., Boman, E., Heaphy, R., Hendrickson, B., and Vaughan, C., Zoltan data management services for parallel dynamic applications, Comput. Sci. Eng., 4(2):90–97, 2002.
  • [59] Boman, E.G., Catalyurek, U.V., Chevalier, C., and Devine, K.D., The Zoltan and Isorropia parallel toolkits for combinatorial scientific computing: Partitioning, ordering, and coloring, Sci. Prog., 20(2):129–150, 2012.
  • [60] Balay, S., Gropp, W.D., McInnes, L.C., and Smith, B.F., Efficient management of parallelism in object oriented numerical software libraries, In: Arge, E., Bruaset, A.M., and Langtangen, H.P. (Eds.), Modern Software Tools in Scientific Computing, pp. 163–202. Birkhäuser Press, 1997.
  • [61] Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Rupp, K., Smith, B.F., Zampini, S., and Zhang, H., PETSc users manual, Tech. Rep. ANL-95/11 - Revision 3.6, Argonne National Laboratory, 2015.
  • [62] Balay, S., Abhyankar, S., Adams, M.F., Brown, J., Brune, P., Buschelman, K., Dalcin, L., Eijkhout, V., Gropp, W.D., Kaushik, D., Knepley, M.G., McInnes, L.C., Rupp, K., Smith, B.F., Zampini, S., and Zhang, H. PETSc Web page. http://www.mcs.anl.gov/petsc, 2015.
  • [63] Koskela, A., Approximating the matrix exponential of an advection-diffusion operator using the incomplete orthogonalization method, Lect. Notes Comput. Sci. Eng., 103:345–353, 2015.
  • [64] Vo, H.D. and Sidje, R.B., Approximating the large sparse matrix exponential using incomplete orthogonalization and Krylov subspaces of variable dimension, Numer. Linear Algebr. with Appl., 24(3), 2017.
  • [65] Gaudreault, S., Rainwater, G., and Tokman, M., Kiops: A fast adaptive krylov subspace solver for exponential integrators, J. Comput. Phys., 372:236 – 255, 2018.
  • [66] Rang, J. and Angermann, L., New rosenbrock w-methods of order 3 for partial differential algebraic equations of index 1, BIT Numer. Math., 45(4):761–787, Dec 2005.
  • [67] Abhyankar, S., Brown, J., Constantinescu, E.M., Ghosh, D., Smith, B.F., and Zhang, H., PETSc/TS: A Modern Scalable ODE/DAE Solver Library, jun 2018.
  • [68] Munsky, B., Li, G., Fox, Z.R., Shepherd, D.P., and Neuert, G., Distribution shapes govern the discovery of predictive models for gene regulation, Proc. Natl. Acad. Sci., 115(29):7533–7538, jul 2018.
  • [69] Kazeev, V., Khammash, M., Nip, M., and Schwab, C., Direct Solution of the Chemical Master Equation using Quantized Tensor Trains, PLoS Comput Biol, 10(3), 2014.
  • [70] Kazeev, V. and Schwab, C., Tensor Approximation of Stationary Distributions of Chemical Reaction Networks, SIAM J. Matrix Anal. Appl., 36(3):1221–1247, 2015.
  • [71] Dolgov, S.V. and Savostyanov, D., Alternating minimal energy methods for linear systems in higher dimensions, SIAM J. Sci. Comput., 36(5):A2248–A2271, 2014.
  • [72] Dolgov, S. and Khoromskij, B., Simultaneous state-time approximation of the chemical master equation using tensor product formats, Numerical Linear Algebra with Applications, 22(2):197–219, 2015.
  • [73] Vo, H.D. and Sidje, R.B., An adaptive solution to the chemical master equation using tensors, J. Chem. Phys., 147(147), 2017.
  • [74] Dolgov Sergey V., A Tensor Decomposition Algorithm for Large ODEs with Conservation Laws, cmam, 19(1):23, 2018.