This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Using early rejection Markov chain Monte Carlo and Gaussian processes to accelerate ABC methods

Xuefei Cao1, Shijia Wang2∗ and Yongdao Zhou1
1School of Statistics and Data Science, Nankai University, China
2Institute of Mathematical Sciences, ShanghaiTech University, China
Address correspondence to: Dr. Shijia Wang ([email protected]) and Dr. Yongdao Zhou ([email protected]).
Abstract

Approximate Bayesian computation (ABC) is a class of Bayesian inference algorithms that targets for problems with intractable or unavailable likelihood function. It uses synthetic data drawn from the simulation model to approximate the posterior distribution. However, ABC is computationally intensive for complex models in which simulating synthetic data is very expensive. In this article, we propose an early rejection Markov chain Monte Carlo (ejMCMC) sampler based on Gaussian processes to accelerate inference speed. We early reject samples in the first stage of the kernel using a discrepancy model, in which the discrepancy between the simulated and observed data is modeled by Gaussian process (GP). Hence, the synthetic data is generated only if the parameter space is worth exploring. We demonstrate from theory, simulation experiments, and real data analysis that the new algorithm significantly improves inference efficiency compared to existing early-rejection MCMC algorithms. In addition, we employ our proposed method within an ABC sequential Monte Carlo (SMC) sampler. In our numerical experiments, we use examples of ordinary differential equations, stochastic differential equations, and delay differential equations to demonstrate the effectiveness of the proposed algorithm. We develop an R package that is available at https://github.com/caofff/ejMCMC.


Keywords: Approximate Bayesian computation, Gaussian process, Markov chain Monte Carlo, early rejection, Sequence Monte Carlo.

1 Introduction

In Bayesian statistics, we aim to infer the posterior distribution of the unknown parameter, which requires the evaluation of the likelihood function of data given a parameter value. However, in some complex statistical models (e.g. stochastic differential equations), it is computationally very expensive or even not possible to derive an analytical formula of the likelihood function, but we are able to simulate synthetic data from the statistical model. Approximate Bayesian computation (ABC) is a class of Bayesian computation methods that avoids the computation of likelihood function and is suitable for this context. In ABC, we simulate synthetic data from the statistical model given a simulated parameter, and this parameter value will be retained if the simulated data sets are close enough to the observed data sets.

Approximate Bayesian computation (Pritchard et al., 1999; Beaumont et al., 2002; Sisson et al., 2018) is originally introduced in the population genetics context. The simplest version of ABC is a rejection sampler. In practice, rejection sampling is inefficient, especially when there is a large difference between the prior distribution and the posterior distribution. Marjoram et al. (2003) introduced an ABC Markov chain Monte Carlo (MCMC) algorithm to approximate the posterior distribution. Wegmann et al. (2009) improved the performance of ABC-MCMC by relaxing the tolerance within MCMC while conducting subsampling and regression adjustment to the MCMC output. Clarté et al. (2020) explored a Gibbs type ABC algorithm that component-wisely targets the corresponding conditional posterior distributions. Sequential Monte Carlo (SMC) methods (Doucet et al., 2001; Del Moral et al., 2006) have become good alternatives to MCMC for complex model inference. Sisson et al. (2007); Del Moral et al. (2012) employed ABC-MCMC algorithms within the SMC framework of Del Moral et al. (2006). Buchholz and Chopin (2019) improved ABC algorithm based on quasi-Monte Carlo sequences, and the resulting ABC estimates achieves a lower variance compared with Monte Carlo ABC. Frazier et al. (2022); Price et al. (2017) implemented Bayesian synthetic likelihood to conduct inference for complex models with intractable likelihood function. Frazier et al. (2018) studied the posterior concentration property of the approximate Bayesian computation methods, and the asymptotic distribution of the posterior mean.

The choice of summary statistics affects the inference efficiency and results of ABC. In general, we measure the discrepancy between two data sets by using the distance between their low-dimensional summary statistics. However, constructing low-dimensional summary statistics can be challenging due to lack of expert knowledge. Sisson and Fan (2010); Marin et al. (2012); Blum et al. (2013) provide a detailed summary of the selection of summary statistics in ABC. Joyce and Marjoram (2008); Nunes and Balding (2010) select the best subset from a pre-selected set of candidate summary statistics based on various information criteria. Pudlo et al. (2016) conduct selection of summary statistics via random forests. Fearnhead and Prangle (2012) construct summary statistics in a semi-automatic fashion using regression. Jiang et al. (2017); Åkesson et al. (2021) learn summary statistics automatically via neural networks.

For expensive ABC simulators, various methods have been proposed to increase the sample-efficiency of ABC. Wilkinson (2014) used Gaussian process (GP) to model uncertainty in likelihood to exclude regions with negligible posterior probabilities. Järvenpää et al. (2018, 2019, 2020) modelled the discrepancy between synthetic and observed data using GPs, and estimated the ABC posterior based on the fitted GP without further simulation (GP-ABC). Picchini (2014) proposed an early rejection ABC-MCMC algorithm for SDE parameter inference, in which the acceptance ratio of MH algorithm is divided into two parts. Everitt and Rowińska (2021) used delayed acceptance (DA) MCMC (Christen and Fox, 2005) in ABC framework, and used cheap simulators to build a more efficient proposal distribution.

In this article, we propose a new early rejection MCMC (ejMCMC) method to speed up inference for simulating expensive models, based on the seminal work of Picchini (2014) and Järvenpää et al. (2018, 2019, 2020). In the first step of our algorithm, we rule out the parameter space that is not worth exploring by running a Metropolis Hasting (MH) step with a predicted discrepancy value. Compared to alternative machine learning techniques, Gaussian processes furnish not just point predictions, but also encompass uncertainty estimates and exhibit efficacy with limited data sets. Hence, we choose a Gaussian process model to evaluate the discrepancy in this article. Consequently, our proposed method can early reject samples in cases the algorithm of Picchini (2014) is inefficient (e.g. uniform priors, symmetric proposals). Then, we simulate data from the model only if it is highly possible to be accepted. The posterior estimated by GP-ABC approaches (Järvenpää et al., 2018, 2019, 2020) rely entirely on the fitted surrogate model. In practice, there exist cases that the discrepancy for parameters cannot be well modelled by GPs. For example, if the shape of discrepancy as a function of parameters is multi-modal, the Gaussian process model may not be able to accurately capture these peaks. The approximate Bayesian posterior obtained by the GP surrogate model could be inaccurate. We show this phenomenon in Section 4.1. The proposed ejMCMC method effectively combines MCMC and GP discrepancy model, and hence can correct the inaccurate posterior estimated by GP-ABC approaches. The efficiency of DA-ABC approach (Everitt and Rowińska, 2021) relies on a cheap simulator used in proposing stage. It is challenging to build an effective but cheap simulator when there is a lack of expert knowledge. Our ejMCMC instead does not require as much prior knowledge. Our numerical experiments indicate that the new method can achieve about 40%40\% acceleration compared with Picchini (2014), and the estimated posterior is more accurate than Picchini (2014) with fixed computational budgets. We show that the ejMCMC based on GP discrepancy satisfies the detailed balance condition, and the posterior concentration property. The proposed algorithm is theoretically more efficient than the existing one. We also propose an efficient GP discrepancy function to balance the computational efficiency and ratio of false rejection samples. The ejMCMC can be designed as an efficient forward kernel within ABC-ASMC framework (Del Moral et al., 2012), and the resulting algorithm (ejASMC) is more efficient.

The rest of article is organized as follows. In Section 2, we show some background information of approximate Bayesian computation. In Section 3, we introduce our proposed algorithms. In sections 4 and 5, we use a real data analysis and simulation studies to show the effectiveness of our method. Section 6 gives the conclusion, all proofs of the theoretical results are deferred to the Appendix.

2 Approximate Bayesian computation methods

In this section, we provide a review of ABC algorithms relevant for this article.

2.1 Basics of Approximate Bayesian computation

Let 𝜽\boldsymbol{\theta} be the parameter of interest, which belongs to a parameter space Θp\Theta\subset\mathbb{R}^{p}, and let π(𝜽)\pi(\boldsymbol{\theta}) be a prior density for 𝜽\boldsymbol{\theta}. The likelihood function, denoted as p(𝜽)p(\cdot\mid\boldsymbol{\theta}), is costly to compute, while it is possible to simulate data from the model with a given parameter 𝜽\boldsymbol{\theta}. Consider a given observed data set 𝒚=(𝒚(1),𝒚(2),,𝒚(T))𝒴Td×T\boldsymbol{y}=(\boldsymbol{y}_{(1)},\boldsymbol{y}_{(2)},\ldots,\boldsymbol{y}_{(T)})^{\top}\in\mathcal{Y}^{T}\subset\mathbb{R}^{d\times T}, where dd denotes the dimension of data and TT denotes the number of repeated measurements. The likelihood-free inference is achieved by augmenting the target posterior from π(𝜽𝒚)p(𝒚𝜽)π(𝜽)\pi(\boldsymbol{\theta}\mid\boldsymbol{y})\propto p(\boldsymbol{y}\mid\boldsymbol{\theta})\pi(\boldsymbol{\theta}) to

πε(𝜽,𝒙𝒚)=π(𝜽)p(𝒙𝜽)Kε(Δ(𝒙,𝒚))π(𝜽)p(𝒙𝜽)Kε(Δ(𝒙,𝒚))𝑑𝒙𝑑𝜽,\pi_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})=\frac{\pi(\boldsymbol{\theta})p(\boldsymbol{x}\mid\boldsymbol{\theta})K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y}))}{\iint\pi(\boldsymbol{\theta})p(\boldsymbol{x}\mid\boldsymbol{\theta})K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y}))d\boldsymbol{x}d\boldsymbol{\theta}},

where the auxiliary parameter 𝒙\boldsymbol{x} is a (simulated) data set from p(𝜽)p(\cdot\mid\boldsymbol{\theta}), Δ:𝒴T×𝒴T+\Delta:\mathcal{Y}^{T}\times\mathcal{Y}^{T}\rightarrow\mathbb{R}^{+} is a discrepancy function that measures the difference between the synthetic data set 𝒙\boldsymbol{x} and the observed data set 𝒚\boldsymbol{y}. The function Kε()K_{\varepsilon}(\cdot) is defined as Kε()=K(/ε)K(0)K_{\varepsilon}(\cdot)=\frac{K(\cdot/\varepsilon)}{K(0)}, where K()K(\cdot) is a centered smoothing kernel density. In this article, we mainly focus on a kernel function K(z)K(z) that takes its maximum value at zero and is zero for z>1z>1. For example, the uniform, Epanechnikov, tricube, triangle, quartic (biweight), triweight and quadratic kernels (Epanechnikov, 1969; Cleveland and Devlin, 1988; Altman, 1992). The parameter ε>0\varepsilon>0 serves as a tolerance level, and it weights the posterior πε(𝜽,𝒙𝒚)\pi_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y}) higher in regions where 𝒙\boldsymbol{x} and 𝒚\boldsymbol{y} are similar.

The goal of approximate Bayesian computation is to estimate the marginal posterior distribution of 𝜽\boldsymbol{\theta}, denoted as πε(𝜽𝒚)\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y}), by integrating out the auxiliary data set 𝒙\boldsymbol{x} from the modified posterior distribution. This is given by the following equation:

πε(𝜽𝒚)=π(𝜽)p(𝒙𝜽)Kε(Δ(𝒙,𝒚))𝑑𝒙π(𝜽)p(𝒙𝜽)Kε(Δ(𝒙,𝒚))𝑑𝒙𝑑𝜽.\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})=\frac{\pi(\boldsymbol{\theta})\int p(\boldsymbol{x}\mid\boldsymbol{\theta})K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y}))d\boldsymbol{x}}{\iint\pi(\boldsymbol{\theta})p(\boldsymbol{x}\mid\boldsymbol{\theta})K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y}))d\boldsymbol{x}d\boldsymbol{\theta}}. (1)

The ABC posterior provides an approximation of the true posterior distribution when the tolerance level ε\varepsilon approaches to zero.

In practice, we set Δ(𝒙,𝒚)=ρ(𝑺(𝒙),𝑺(𝒚))\Delta(\boldsymbol{x},\boldsymbol{y})=\rho(\boldsymbol{S}(\boldsymbol{x}),\boldsymbol{S}(\boldsymbol{y})) for some distance metric ρ(,)\rho(\cdot,\cdot) and a chosen vector of summary statistics 𝑺()\boldsymbol{S}(\cdot): 𝒴Tds\mathcal{Y}^{T}\rightarrow{\mathbb{R}^{d_{s}}} with some small dsd_{s}. For example, given an observed data set with TT independent and identically distributed data points, we can take 𝑺(𝒚)=i=1T𝒚i/T\boldsymbol{S}(\boldsymbol{y})=\sum_{i=1}^{T}\boldsymbol{y}_{i}/T and ρ(,)=2\rho(\cdot,\cdot)=\left\|\cdot\right\|_{2}. If the vector of summary statistic is sufficient for the parameter 𝜽\boldsymbol{\theta}, then comparing the summary statistics of two data sets will be equivalent to comparing the data sets themselves (Brooks et al., 2011). In this case, the ABC posterior converges to the conditional distribution of 𝜽\boldsymbol{\theta} given 𝒚\boldsymbol{y} as ε0\varepsilon\rightarrow 0; otherwise, the ABC posterior converges to the conditional distribution of 𝜽\boldsymbol{\theta} given 𝑺(𝒚)\boldsymbol{S}(\boldsymbol{y}), that is, π(𝜽𝑺(𝒚))\pi(\boldsymbol{\theta}\mid\boldsymbol{S}(\boldsymbol{y})). For convenience of description, we still denote πε(𝜽𝑺(𝒚))\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{S}(\boldsymbol{y})) as πε(𝜽𝒚)\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y}).

2.2 ABC Markov chain Monte Carlo

Marjoram et al. (2003) proposed an ABC Markov chain Monte Carlo (MCMC) method based on the Metropolis–Hastings MCMC algorithm. The ABC-MCMC constructs a Markov chain that admits the target probability distribution as stationarity under mild regularity conditions. At (n+1)(n+1)-th MCMC iteration, we repeat the following procedure.

  1. (i)

    Generate a candidate value 𝜽q(𝜽n)\boldsymbol{\theta}^{*}\sim q(\cdot\mid\boldsymbol{\theta}_{n}).

  2. (ii)

    Generate a synthetic data set 𝒙=(𝒙(1),,𝒙(T))T\boldsymbol{x}^{*}=(\boldsymbol{x}_{(1)}^{*},\cdots,\boldsymbol{x}_{(T)}^{*})^{T} from the likelihood, p(𝜽)p(\cdot\mid\boldsymbol{\boldsymbol{\theta}^{*}}).

  3. (iii)

    Accept the proposed parameter value (𝜽n+1,𝒙n+1)=(𝜽,𝒙)(\boldsymbol{\theta}_{n+1},\boldsymbol{x}_{n+1})=(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*}) with probability

    α[(𝜽n,𝒙n),(𝜽,𝒙)]=min{1,π(𝜽)q(𝜽n𝜽)π(𝜽n)q(𝜽𝜽n)Kε(Δ(𝒙,𝒚))Kε(Δ(𝒙n,𝒚))};\alpha[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})]=\operatorname{min}\left\{1,\frac{\pi(\boldsymbol{\theta}^{*})q(\boldsymbol{\theta}_{n}\mid\boldsymbol{\theta}^{*})}{\pi(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta}_{n})}\frac{K_{\varepsilon}(\Delta(\boldsymbol{x}^{*},\boldsymbol{y}))}{K_{\varepsilon}(\Delta(\boldsymbol{x}_{n},\boldsymbol{y}))}\right\}; (2)

    Otherwise, set (𝜽n+1,𝒙n+1)=(𝜽n,𝒙n)(\boldsymbol{\theta}_{n+1},\boldsymbol{x}_{n+1})=(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}).

The MCMC algorithm targets the joint posterior distribution πε(𝜽,𝒙𝒚)\pi_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y}), with the marginal distribution of interest being the posterior distribution πε(𝜽𝒚)\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y}). The sampler generates a Markov chain sequence {(𝜽n,𝒙n)}\{(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n})\} for n0n\geq 0, where 𝒙n\boldsymbol{x}_{n} represents the synthetic data set, and 𝜽n\boldsymbol{\theta}_{n} represents the parameter value.

For complex statistical models that are expensive to simulate, we aim to speed up inference by reducing the number of simulations. Based on a uniform kernel Kε(Δ(𝒙,𝒚))=𝕀(Δ(𝒙,𝒚)ε)K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y}))=\mathbb{I}(\Delta(\boldsymbol{x},\boldsymbol{y})\leq\varepsilon), Picchini (2014) proposed an early-rejection MCMC algorithm denoted by OejMCMC. Here, we generalize the uniform kernel to a more general kernel function. By the definition of the kernel function in Eq.(1), the acceptance probability α[(𝜽n,𝒙n),(𝜽,𝒙)]\alpha[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})] (Eq. (2)) is always not greater than

α˘[(𝜽n,𝒙n),𝜽]=π(𝜽)q(𝜽n𝜽)/{π(𝜽n)q(𝜽𝜽n)Kε(Δ(𝒙n,𝒚))}.\breve{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),\boldsymbol{\theta}^{*}]=\pi(\boldsymbol{\theta}^{*})q(\boldsymbol{\theta}_{n}\mid\boldsymbol{\theta}^{*})/\{\pi(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta}_{n})K_{\varepsilon}(\Delta(\boldsymbol{x}_{n},\boldsymbol{y}))\}. (3)

Then we can reject some candidate parameters before simulations by generating w𝒰(0,1)w\sim\mathcal{U}(0,1), and comparing ww with α˘[(𝜽n,𝒙n),𝜽]\breve{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),\boldsymbol{\theta}^{*}]. Specifically, If w>α˘[(𝜽n,𝒙n),𝜽]w>\breve{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),\boldsymbol{\theta}^{*}], we immediately reject 𝜽\boldsymbol{\theta}^{*} before simulating synthetic data with 𝜽\boldsymbol{\theta}^{*}, otherwise, simulate synthetic data and determine whether to receive candidate parameters according to Metropolis-Hastings acceptance probability.

In many cases, we may take a weak prior distribution or take a uniform distribution as prior, and use a symmetric proposal distribution, that is, for all 𝜽,𝜽Θ\boldsymbol{\theta},\boldsymbol{\theta}^{*}\in\Theta, π(𝜽)π(𝜽)\pi(\boldsymbol{\theta})\approx\pi(\boldsymbol{\theta}^{*}) and q(𝜽𝜽)=q(𝜽𝜽)q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})=q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{*}), P(w>α˘[(𝜽,𝒙),𝜽])0P(w>\breve{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}])\approx 0. Hence, parameters are rarely rejected early, the method basically degenerates to the original ABC-MCMC. In Section 3, we will propose a new early rejection method based on discrepancy models.

2.3 ABC Sequential Monte Carlo

Sisson et al. (2007); Del Moral et al. (2012) proposed ABC-SMC methods within the SMC framework of Del Moral et al. (2006). SMC is a class of importance sampling based methods for generating weighted particles from a target distribution π\pi by approximately simulating from a sequence of related probability distribution {πr}r=0:R\{\pi_{r}\}_{r=0:R}, where the selected initial distribution π0\pi_{0} is easy to approximate (e.g. the prior distribution), and the final distribution πR=π\pi_{R}=\pi is the target distribution. The particles are moved from rr-th target to (r+1)(r+1)-th target, by using a Markov kernel.

In ABC-SMC, we select a sequence of intermediate target distributions {πr=πεr(𝜽𝒚)}r=0:R\{\pi_{r}=\pi_{\varepsilon_{r}}(\boldsymbol{\theta}\mid\boldsymbol{y})\}_{r=0:R} with a descending sequence of tolerance parameters {εr}r=0:R\{\varepsilon_{r}\}_{r=0:R}. The initial distribution has a large tolerance and is easy to approximate, the final distribution has a small tolerance and is hence accurate. Let 𝜽r(i)\boldsymbol{\theta}_{r}^{(i)} denote the ii-th particle after iteration rr, let Δr(i)\Delta_{r}^{(i)} denote the corresponding discrepancy and let Wr(i)W_{r}^{(i)} denote the normalized weight. The set of weighted particles {𝜽r(i),Wr(i)}i=1:N\{\boldsymbol{\theta}_{r}^{(i)},W_{r}^{(i)}\}_{i=1:N} represents an empirical approximation of πεr(𝜽𝒚)\pi_{\varepsilon_{r}}(\boldsymbol{\theta}\mid\boldsymbol{y}). At each iteration r+1r+1, we conduct propagation, weighting and resampling to approximate πεr+1(𝜽𝒚)\pi_{\varepsilon_{r+1}}(\boldsymbol{\theta}\mid\boldsymbol{y}). We refer reader to Appendix Section B.1 for more details of these three steps.

3 An early rejection method based on discrepancy models

To reduce the computational cost of generating synthetic data sets in complex statistical models, we propose an early rejection method based on a discrepancy model. In Section 2.2, we discussed how the early rejection acceptance probability α˘[(𝜽n,𝒙n),𝜽]\breve{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),\boldsymbol{\theta}^{*}] proposed by Picchini (2014) can be inefficient in certain cases. In this section, we introduce a more efficient early rejection ABC method that utilizes a discrepancy model to speed up inference. We refer to our method as ejMCMC to distinguish it from the early rejection MCMC proposed in Picchini (2014).

3.1 Early rejection MCMC based on a discrepancy model

The main computational cost of ABC-MCMC comes from evaluating the acceptance ratio (Eq. 2), which requires simulating synthetic data sets. In this article, we propose to construct a pseudo MH acceptance probability

α^[(𝜽,𝒙),(𝜽,𝒙)]=min{1,π(𝜽)q(𝜽𝜽)π(𝜽)q(𝜽𝜽)min{Kε(Δ(𝒙,𝒚)),Kε(h(𝜽))}min{Kε(Δ(𝒙,𝒚)),Kε(h(𝜽))}}\hat{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})]=\min\left\{1,\frac{\pi(\boldsymbol{\theta}^{*})q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{*})}{\pi(\boldsymbol{\theta})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})}\frac{\min\{K_{\varepsilon}(\Delta(\boldsymbol{x}^{*},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}^{*}))\}}{\min\{K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}))\}}\right\} (4)

based on a discrepancy function hh. Note that the pseudo MH acceptance probability α^[(𝜽,𝒙),(𝜽,𝒙)]\hat{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})] is not greater than

α~[(𝜽,𝒙),𝜽]=π(𝜽)q(𝜽𝜽)π(𝜽)q(𝜽𝜽)Kε(h(𝜽))min{Kε(Δ(𝒙,𝒚)),Kε(h(𝜽))},\tilde{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}]=\frac{\pi(\boldsymbol{\theta}^{*})q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{*})}{\pi(\boldsymbol{\theta})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})}\frac{K_{\varepsilon}(h(\boldsymbol{\theta}^{*}))}{\min\{K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}))\}}, (5)

which does not depend on Δ(𝒙,𝒚)\Delta(\boldsymbol{x}^{*},\boldsymbol{y}). This allows us to reject some candidate parameters based on α~[(𝜽,𝒙),𝜽]\tilde{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}] before simulating the synthetic data. In the first stage of ejMCMC, we use α~[(𝜽,𝒙),𝜽]\tilde{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}] to early reject samples. Since α~[(𝜽,𝒙),𝜽]α˘[(𝜽n,𝒙n),𝜽]\tilde{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}]\leq\breve{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),\boldsymbol{\theta}^{*}], ejMCMC’s early rejection rate is always not lower than OejMCMC. Our ejMCMC algorithm degenerates to OejMCMC in case that h(𝜽)0h(\boldsymbol{\theta})\equiv 0. The proposed early rejection MCMC algorithm is shown in Algorithm 1.

If h(𝜽)h(\boldsymbol{\theta}) satisfies that h(𝜽)Δ(𝒙,𝒚)h(\boldsymbol{\theta})\leq\Delta(\boldsymbol{x},\boldsymbol{y}) in the ABC posterior region with positive support, the pseudo MH acceptance probability (Eq.(4)) is equal to the Metropolis-Hastings acceptance probability α[(𝜽,𝒙),(𝜽,𝒙)]{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})] (Eq.(2)). However, the function h(𝜽)h(\boldsymbol{\theta}) cannot strictly satisfy h(𝜽)Δ(𝒙,𝒚)h(\boldsymbol{\theta})\leq\Delta(\boldsymbol{x},\boldsymbol{y}) in practice. Consequently, we will false reject samples in region {𝜽:h(𝜽)>Δ(𝒙,𝒚)}\{\boldsymbol{\theta}:h(\boldsymbol{\theta})>\Delta(\boldsymbol{x},\boldsymbol{y})\}. Hence, the ejMCMC algorithm targets a modified posterior distribution based on a discrepancy function h(𝜽)h(\boldsymbol{\theta}), which may not be equal to the original posterior distribution πε(𝜽,𝒙𝒚)\pi_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y}).

As we discussed above, the function h(𝜽)h(\boldsymbol{\theta}) is a trade-off between speed and accuracy. In this article, we use a Gaussian process to model the functional relationship between the discrepancy and the parameters, and h(𝜽)h(\boldsymbol{\theta}) is aa (for example, a=0.05a=0.05) quantile prediction function of the deviation. More details of the selection and training of h(𝜽)h(\boldsymbol{\theta}) will be described in Section 3.4.

Algorithm 1 Early rejection MCMC algorithm based on a discrepancy model

Input: (a) Total number of MCMC iterations NN, ABC threshold ε\varepsilon; (b) observed data 𝒚\boldsymbol{y}; (c) prior distribution π()\pi(\cdot), likelihood function p()p(\cdot\mid\cdot), proposal distribution q()q(\cdot\mid\cdot) and prediction function h()h(\cdot).

1:  Initialize 𝜽0\boldsymbol{\theta}_{0}.
2:  for n=0,,N1n=0,\ldots,N-1 do
3:     Set (𝜽n+1,𝒙n+1)=(𝜽n,𝒙n)(\boldsymbol{\theta}_{n+1},\boldsymbol{x}_{n+1})=(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}).
4:     Generate a candidate value 𝜽q(𝜽n)\boldsymbol{\theta}^{*}\sim q(\cdot\mid\boldsymbol{\theta}_{n}) from the proposal distribution.
5:     Generate ww from U[0,1]U[0,1].
6:     if w<π(𝜽)q(𝜽n𝜽)π(𝜽n)q(𝜽𝜽n)min{Kε(Δ(𝒙n,𝒚)),Kε(h(𝜽n))}w<\frac{\pi(\boldsymbol{\theta}^{*})q(\boldsymbol{\theta}_{n}\mid\boldsymbol{\theta}^{*})}{\pi(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta}_{n})\min\{K_{\varepsilon}(\Delta(\boldsymbol{x}_{n},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}_{n}))\}} then
7:        Compute h(𝜽)h(\boldsymbol{\theta}^{*}).
8:        if w<α~[(𝜽n,𝒙n),𝜽]w<\tilde{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),\boldsymbol{\theta}^{*}] then
9:           generate 𝒙\boldsymbol{x}^{*} from p(𝜽)p(\cdot\mid\boldsymbol{\theta}^{*}).
10:           if w<α^[(𝜽n,𝒙n),(𝜽,𝒙)]w<\hat{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})] then
11:              (𝜽n+1,𝒙n+1)=(𝜽,𝒙)(\boldsymbol{\theta}_{n+1},\boldsymbol{x}_{n+1})=(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*}).
12:           end if
13:        end if
14:     end if
15:  end for
16:  return  θ1:N\theta_{1:N}

3.2 Properties of ejMCMC algorithm

Let Θ:={𝜽:h(𝜽)<ε}\Theta^{\prime}:=\{\boldsymbol{\theta}:h(\boldsymbol{\theta})<\varepsilon\} and Θ1:={𝜽:πε(𝜽𝒚)>0}\Theta_{1}:=\{\boldsymbol{\theta}:\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})>0\}. Here Θ1\Theta_{1} denotes the ABC posterior region with positive support. The following proposition shows that under mild assumptions, the ejMCMC algorithm satisfies the detailed balance condition.

Assumption 1.

ΘΘ1\Theta^{\prime}\cap\Theta_{1} is not a measure zero set.

This is a basic assumption for ejMCMC algorithm, without which the algorithm will collapse. All conclusions about the ejMCMC algorithm are based on this assumption.

Proposition 1.

If Assumption 1 is satisfied, the early rejection MCMC based on a discrepancy model satisfies the detailed balance condition, that is,

π^ε(𝜽,𝒙𝒚)q(𝜽𝜽)p(𝒙𝜽)α^[(𝜽,𝒙),(𝜽,𝒙)]=π^ε(𝜽,𝒙𝒚)q(𝜽𝜽)p(𝒙𝜽)α^[(𝜽,𝒙),(𝜽,𝒙)],\hat{\pi}_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})p(\boldsymbol{x}^{*}\mid\boldsymbol{\theta}^{*})\hat{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})]=\hat{\pi}_{\varepsilon}(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*}\mid\boldsymbol{y})q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{*})p(\boldsymbol{x}\mid\boldsymbol{\theta})\hat{\alpha}[(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*}),(\boldsymbol{\theta},\boldsymbol{x})],

where

π^ε(𝜽,𝒙𝒚)=π(𝜽)p(𝒙𝜽)min{Kε(Δ(𝒙,𝒚)),Kε(h(𝜽))}π(𝜽)p(𝒙𝜽)min{Kε(Δ(𝒙,𝒚)),Kε(h(𝜽))}𝑑𝒙𝑑𝜽\hat{\pi}_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})=\frac{\pi(\boldsymbol{\theta})p(\boldsymbol{x}\mid\boldsymbol{\theta})\min\{K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}))\}}{\iint\pi(\boldsymbol{\theta})p(\boldsymbol{x}\mid\boldsymbol{\theta})\min\{K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}))\}d\boldsymbol{x}d\boldsymbol{\theta}}

is the invariant distribution of Markov chain. The marginal posterior distribution of 𝛉\boldsymbol{\theta} obtained from the ejMCMC algorithm is

π^ε(𝜽𝒚)=π(𝜽)p(𝒙𝜽)min{Kε(Δ(𝒙,𝒚)),Kε(h(𝜽))}𝑑𝒙π(𝜽)p(𝒙𝜽)min{Kε(Δ(𝒙,𝒚)),Kε(h(𝜽))}𝑑𝒙𝑑𝜽.\hat{\pi}_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})=\frac{\pi(\boldsymbol{\theta})\int p(\boldsymbol{x}\mid\boldsymbol{\theta})\min\{K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}))\}d\boldsymbol{x}}{\iint\pi(\boldsymbol{\theta})p(\boldsymbol{x}\mid\boldsymbol{\theta})\min\{K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}))\}d\boldsymbol{x}d\boldsymbol{\theta}}. (6)

Proposition 1 demonstrates that using the ejMCMC move does not violate the detailed balance condition of the Markov process. Moreover, Proposition 2 shows that the relationship between the posterior of the ejMCMC and that of the original ABC-MCMC.

Proposition 2.
  1. (i)

    The discrepancy function h(𝜽)h(\boldsymbol{\theta}) satisfies that

    sup𝜽Θ1/𝒩Θ1p(𝒙𝜽)𝕀(Δ(𝒙,𝒚)h(𝜽))𝑑𝒙a,\sup_{\boldsymbol{\theta}\in{\Theta_{1}/\mathcal{N}_{\Theta_{1}}}}\int p(\boldsymbol{x}\mid\boldsymbol{\theta})\mathbb{I}(\Delta(\boldsymbol{x},\boldsymbol{y})\leq h(\boldsymbol{\theta}))d\boldsymbol{x}\leq a,

    where 𝒩Θ1\mathcal{N}_{\Theta_{1}} is a Lebesgue zero-measure subset of Θ1\Theta_{1} and a[0,1)a\in[0,1) is a constant. The target posterior distribution of ejMCMC approaches to that of ABC-MCMC in terms of L1L_{1} distance when a0a\to 0, where the L1L_{1}-distance between g()g(\cdot) and f()f(\cdot) is defined as

    DL1(g,f)=|g(𝜽)f(𝜽)|𝑑𝜽.D_{L_{1}}(g,f)=\int|g(\boldsymbol{\theta})-f(\boldsymbol{\theta})|d\boldsymbol{\theta}. (7)
  2. (ii)

    If K()K(\cdot) is a uniform kernel, the posterior distribution of ejMCMC is

    π^ε(𝜽𝒚){π(𝜽)Δ(𝒙,𝒚)<εp(𝒙𝜽)𝑑𝒙𝜽Θ,0𝜽Θ/Θ.\hat{\pi}_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})\propto\left\{\begin{array}[]{lcl}\pi(\boldsymbol{\theta})\int_{\Delta(\boldsymbol{x},\boldsymbol{y})<\varepsilon}p(\boldsymbol{x}\mid\boldsymbol{\theta})d\boldsymbol{x}&&\boldsymbol{\theta}\in\Theta^{\prime},\\ 0&&\boldsymbol{\theta}\in\Theta/\Theta^{\prime}.\end{array}\right.

    It is equal to ABC-MCMC posterior if Θ1/Θ\Theta_{1}/\Theta^{\prime} is a measure zero set.

Proposition 2 shows that the resulting Markov chain still converges to the true approximate Bayesian posterior under some conditions of h()h(\cdot). This implies that the ejMCMC algorithm provides a promising framework to Bayesian posterior approximation, without compromising on accuracy or convergence guarantees under some mild conditions. In some other cases, there may exist some mismatch between the posterior of ejMCMC and ABC-MCMC. In this article, we utilize the L1L_{1}-distance shown in Eq. (7) to quantify the dissimilarity between two density functions.

Next we will prove the posterior concentration property of ejMCMC algorithm. Frazier et al. (2018) proved the posterior concentration property of the ABC posterior with the uniform kernel, that is, for any δ>0\delta>0, and for some ε>0\varepsilon>0,

Πε{d(𝜽,𝜽0)>δ𝒚}=d(𝜽,𝜽0)>δπε(𝜽𝒚)𝑑𝜽=oP(1),\Pi_{\varepsilon}\{d(\boldsymbol{\theta},\boldsymbol{\theta}_{0})>\delta\mid\boldsymbol{y}\}=\int_{d(\boldsymbol{\theta},\boldsymbol{\theta}_{0})>\delta}\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})d\boldsymbol{\theta}=o_{P}(1),

where d(,)d(\cdot,\cdot) is a metric on Θ\Theta. Theorem 1 shows that the posterior concentration of the algorithm still holds, with some technical assumptions of function h()h(\cdot). This property is important, because for any AΘA\subset\Theta, ABC posterior probability Πε{A𝒚}\Pi_{\varepsilon}\{A\mid\boldsymbol{y}\} will be different from the exact posterior probability. Without the guarantees of exact posterior inference, knowing that Πε{𝒚}\Pi_{\varepsilon}\{\cdot\mid\boldsymbol{y}\} will concentrate on the true parameter 𝜽0\boldsymbol{\theta}_{0} can be used as an effective way to express our uncertainty about 𝜽\boldsymbol{\theta}. The posterior concentration is related to the rate at which ε\varepsilon goes to 0 and the rate at which the observed summaries S(𝒚)S(\boldsymbol{y}) and the simulated summaries S(𝒙)S(\boldsymbol{x}) converge to well defined limit counterparts b(𝜽0)b(\boldsymbol{\theta}_{0}) and b(𝜽)b(\boldsymbol{\theta}). Here we consider S(𝒙)b(𝜽)S(\boldsymbol{x})\rightarrow b(\boldsymbol{\theta}) as TT\rightarrow\infty, where 𝒙=(𝒙(1),,𝒙(T))\boldsymbol{x}=(\boldsymbol{x}_{(1)},\ldots,\boldsymbol{x}_{(T)})^{\top} and 𝒙(t)p(𝜽)\boldsymbol{x}_{(t)}\sim p(\cdot\mid\boldsymbol{\theta}) for t=1,,Tt=1,\ldots,T, and consider ε\varepsilon as a TT-dependent sequence satisfying εT0\varepsilon_{T}\rightarrow 0 as TT\rightarrow\infty. Here, Π()\Pi(\cdot) is a probability measure with prior density π()\pi(\cdot).

Assumption 2.

For all ε\varepsilon, there exists a constant 0γ1<10\leq\gamma_{1}<1 satisfying that

Π[{𝜽:h(𝜽)ε and ρ(b(𝜽),b(𝜽0))ε/3}]γ1Π[ρ(b(𝜽),b(𝜽0))ε/3].\Pi\left[\left\{\boldsymbol{\theta}:h(\boldsymbol{\theta})\geq\varepsilon\text{ and }\rho(b(\boldsymbol{\theta}),b(\boldsymbol{\theta}_{0}))\leq{\varepsilon/3}\right\}\right]\leq\gamma_{1}\Pi\left[\rho(b(\boldsymbol{\theta}),b(\boldsymbol{\theta}_{0}))\leq{\varepsilon/3}\right].

Assumption 3.

For all ε\varepsilon, there exists a constant 0γ2<10\leq\gamma_{2}<1 satisfying that

Θp(𝒙𝜽)𝕀(ρ(S(𝒙),S(𝒚))<h(𝜽))𝑑𝒙𝑑Π(𝜽)γ2Θp(𝒙𝜽)𝕀(ρ(S(𝒙),S(𝒚))<ε)𝑑𝒙𝑑Π(𝜽).\iint_{\Theta^{\prime}}p(\boldsymbol{x}\mid\boldsymbol{\theta})\mathbb{I}(\rho(S(\boldsymbol{x}),S(\boldsymbol{y}))<h(\boldsymbol{\theta}))d\boldsymbol{x}d\Pi(\boldsymbol{\theta})\leq\gamma_{2}\iint_{\Theta^{\prime}}p(\boldsymbol{x}\mid\boldsymbol{\theta})\mathbb{I}(\rho(S(\boldsymbol{x}),S(\boldsymbol{y}))<\varepsilon)d\boldsymbol{x}d\Pi(\boldsymbol{\theta}).

Remark. Assumption 2 controls the proportion of 𝜽\boldsymbol{\theta} satisfying h(𝜽)εh(\boldsymbol{\theta})\geq\varepsilon, that will be rejected before simulating, in a neighbourhood of 𝜽0\boldsymbol{\theta}_{0}. Assumption 3 is required for ABC with a more general kernel than a uniform kernel. This assumption controls the proportion of acceptable (𝜽,𝒙)(\boldsymbol{\theta},\boldsymbol{x}) that may be rejected early. Intuitively, Assumptions 2 and 3 about h(𝜽)h(\boldsymbol{\theta}) are to ensure that a small proportion of 𝜽\boldsymbol{\theta} near 𝜽0\boldsymbol{\theta}_{0} will be rejected before simulating. As detailed in Section 3.4.1, our proposed discrepancy model h(𝜽)h(\boldsymbol{\theta}) can always ensure that only a small proportion of 𝜽\boldsymbol{\theta} near the true value are early rejected by adjusting the quantile prediction function. Without these assumptions about the early rejection stage, posterior concentration concentration about 𝜽0\boldsymbol{\theta}_{0} cannot occur.

Theorem 1.

If Assumptions 1-3 of Frazier et al. (2018) and Assumptions 1-3 stated in this article are satisfied, the posterior concentration of the ejMCMC targeting Equation (6) still holds. For uniform kernel cases, Assumption 3 is not required for this property.

We have previously discussed the quality of the ABC estimates. Now, we turn to investigate the efficiency of the algorithm.

Proposition 3.

When MCMC algorithms reach the stationary distributions, the early rejection rates of ejMCMC algorithm and OejMCMC are

Rej=1min{1,α~[(𝜽,𝒙),𝜽]}q(𝜽𝜽)π^ε(𝜽,𝒙𝒚)𝑑𝒙𝑑𝜽𝑑𝜽R_{ej}=1-\iint\min\{1,\tilde{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}]\}q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})\hat{\pi}_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})d\boldsymbol{x}d\boldsymbol{\theta}^{*}d\boldsymbol{\theta}

and

ROej=1min{1,α˘[(𝜽,𝒙),𝜽]}q(𝜽𝜽)πε(𝜽,𝒙𝒚)𝑑𝒙𝑑𝜽𝑑𝜽,R_{Oej}=1-\iint\min\{1,\breve{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}]\}q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta}){\pi}_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})d\boldsymbol{x}d\boldsymbol{\theta}^{*}d\boldsymbol{\theta},

respectively. For uniform kernel cases, if Θ1Θ\Theta_{1}-\Theta^{\prime} is a set of measure zero on parameter space Θ\Theta,

RejROejΘΘ1Θ/Θmin{1,π(𝜽)q(𝜽𝜽)π(𝜽)q(𝜽𝜽)}q(𝜽𝜽)πε(𝜽𝒚)𝑑𝜽𝑑𝜽,R_{ej}-R_{Oej}\geq\int_{\Theta^{\prime}\cap\Theta_{1}}\int_{\Theta/\Theta^{\prime}}\min\{1,\frac{\pi(\boldsymbol{\theta}^{*})q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{*})}{\pi(\boldsymbol{\theta})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})}\}q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})d\boldsymbol{\theta}^{*}d\boldsymbol{\theta}, (8)

where Θ={𝛉:h(𝛉)<ε}\Theta^{\prime}=\{\boldsymbol{\theta}:h(\boldsymbol{\theta})<\varepsilon\} and Θ1={𝛉:πε(𝛉𝐲)>0}\Theta_{1}=\{\boldsymbol{\theta}:\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})>0\}.

Proposition 3 shows that when the kernel function K()K(\cdot) is uniform and the posterior of ejMCMC and OejMCMC are consistent, the efficiency of ejMCMC is expected to be higher since the term RejROejR_{ej}-R_{Oej} in Equation (8) is always positive. This is because ejMCMC targets the posterior region more effectively by rejecting more candidates in the early rejection step, potentially reducing the computational cost of generating synthetic data sets. Furthermore, when the function h(𝜽)h(\boldsymbol{\theta}) satisfies h(𝜽)>εh(\boldsymbol{\theta})>\varepsilon in all regions with negligible posterior probabilities, the efficiency of ejMCMC is maximized. This means that the discrepancy model h(𝜽)h(\boldsymbol{\theta}) is able to identify high probability ABC posterior regions, leading to higher early rejection rates and improved efficiency of the algorithm. The ejMCMC algorithm can play a role in correcting prior information before simulations in some sense, when the prior information is poor. In this article, we use

Eff=# of early rejected parameters# of rejected parameters\text{Eff}=\frac{\text{\# of early rejected parameters}}{\text{\# of rejected parameters}} (9)

to quantify the early rejection efficiency of algorithms. This metric can be seen as a regularization of the early rejection rate, ensuring that the efficiency value is bounded between 0 and 1, with 1 being the best possible efficiency where all rejected parameters are early rejected. Proposition 4 shows the effect of the function h()h(\cdot) on the early rejection efficiency of the ejMCMC algorithm.

Proposition 4.

For all 𝛉Θ\boldsymbol{\theta}\in\Theta, larger probability of Δ(𝐱,𝐲)h(𝛉)\Delta(\boldsymbol{x},\boldsymbol{y})\leq h(\boldsymbol{\theta}) can lead to higher early rejection efficiency of ejMCMC.

Combining Propositions 2 and 4, the probability of Δ(𝒙,𝒚)h(𝜽)\Delta(\boldsymbol{x},\boldsymbol{y})\leq h(\boldsymbol{\theta}) keeps a balance between efficiency and accuracy.

3.3 An early rejection adaptive SMC

ABC-MCMC algorithm has several drawbacks. Firstly, the convergence speed of MCMC algorithm is often slow, especially in high-dimensional spaces. This may require a large number of iterations to obtain accurate approximations. Secondly, the MCMC algorithm is sensitive to the choice of initial values. Improper initial values can cause the Markov chain to get stuck in local optima, preventing effective exploration of the entire state space. In comparison with ABC-MCMC, ABC-SMC employs a sequential updating of particle weights to approximate the target distribution, which effectively mitigates the risk of getting stuck in local optima. Furthermore, the parallelization in SMC algorithms enhance their efficiency in complex applications. In this section, we introduce an early rejection ASMC (ejASMC) based on ABC adaptive SMC (ABC-ASMC) (Del Moral et al., 2012) and the ejMCMC method. The sequence of thresholds and MCMC kernels can be pre-specified before running the algorithm, but the algorithm may collapse with improperly selected thresholds and kernels. In this article, we adaptively tune the sequence of thresholds and ejMCMC proposal distributions.

Adaptive selection of thresholds: We adapt the scheme proposed in Del Moral et al. (2012) for selecting the sequence of thresholds, based on the key remark that the evaluation of weights Wr+1(i)=Wr(i)Kεr+1(Δ(𝒚,𝒙𝜽r(i))/Kεr(Δ(𝒚,𝒙𝜽r(i)))W_{r+1}^{(i)}={W}_{r}^{(i)}K_{\varepsilon_{r+1}}(\Delta(\boldsymbol{y},\boldsymbol{x}_{\boldsymbol{\theta}_{r}^{(i)}})/K_{\varepsilon_{r}}(\Delta(\boldsymbol{y},\boldsymbol{x}_{\boldsymbol{\theta}_{r}^{(i)}})) does not depend on particles of iteration r+1r+1, {𝜽r+1(i),𝒙r+1(i)}i=1:N\{\boldsymbol{\theta}_{r+1}^{(i)},\boldsymbol{x}_{r+1}^{(i)}\}_{i=1:N}. We select the threshold by controlling the proportion of unique alive particles, which is defined as

PA({Wr+1(i)},εr+1)=iU𝕀(Wr+1(i)>0)N,\operatorname{PA}(\{W_{r+1}^{(i)}\},\varepsilon_{r+1})=\frac{\sum_{i\in U}\mathbb{I}(W_{r+1}^{(i)}>0)}{N},

where Wr+1(i)W_{r+1}^{(i)} is a function of the threshold εr+1\varepsilon_{r+1}, and UU is the largest subset of the set {1:N}\{1:N\} such that for any two distinct elements ii and jj in UU, 𝜽r+1(i)𝜽r+1(j)\boldsymbol{\theta}_{r+1}^{(i)}\neq\boldsymbol{\theta}_{r+1}^{(j)}. The proportion of unique alive particles is also intuitively a sensible measure of ‘quality’ of our SMC approximation. The threshold εr+1\varepsilon_{r+1} is selected to make sure that the proportion of alive particles is equal to a given percentage of the particles number

PA({Wr+1(i)},εr+1)=γN\operatorname{PA}(\{W_{r+1}^{(i)}\},\varepsilon_{r+1})=\gamma N (10)

for γ(0,1)\gamma\in(0,1). In practice, bisection is used to compute the root of Equation (10). The parameter γ\gamma is a “quality” index for the resulting SMC approximation of the target. If γ1\gamma\rightarrow 1 then we will move slowly towards the target but the SMC approximation is more accurate. However, if γ0\gamma\rightarrow 0 then we can move very quickly towards the target but the resulting SMC approximation will be unreliable.

Adaptive MCMC proposal distribution: For each intermediate target, the SMC algorithm applies an MCMC move with invariant density πεr+1(𝜽𝒚)\pi_{\varepsilon_{r+1}}(\boldsymbol{\theta}\mid\boldsymbol{y}). MCMC moves are implemented by accepting candidate parameters with a certain probability given by the MH ratio of ejMCMC algorithm. For 𝜽r(i)\boldsymbol{\theta}_{r}^{(i)} with wr+1(i)>0w_{r+1}^{(i)}>0, we generate 𝜽,𝒙\boldsymbol{\theta}^{*},\boldsymbol{x}^{*} according to a proposal qr+1(𝜽𝜽r(i))p(𝒙𝜽)q_{r+1}(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta}_{r}^{(i)})p(\boldsymbol{x}^{*}\mid\boldsymbol{\theta}^{*}), then accept the candidate with probability α^[(𝜽r(i),𝒙r(i)),(𝜽,𝒙)]\hat{\alpha}[(\boldsymbol{\theta}_{r}^{(i)},\boldsymbol{x}_{r}^{(i)}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})]. We can adaptively determine the parameters of the proposal qr+1()q_{r+1}(\cdot\mid\cdot) based on the previous approximation of the target πεr\pi_{\varepsilon_{r}}. In this article, we use Gaussian distribution as the proposal distribution, and the Gaussian covariance matrix is determined adaptively by computing the covariance matrix of the weighted particles {𝜽r(i),Wr(i)}i=1:N\{\boldsymbol{\theta}_{r}^{(i)},W_{r}^{(i)}\}_{i=1:N}.

The output of ejASMC is the weighted particles {𝜽R(i),WR(i)}i=1:N\{\boldsymbol{\theta}_{R}^{(i)},W_{R}^{(i)}\}_{i=1:N}, the empirical distribution of which is the ABC posterior estimate. We set ε0=\varepsilon_{0}=\infty, then the initial target distribution is the prior distribution. Thus the initial weighted particles 𝜽0(i)π(𝜽)\boldsymbol{\theta}_{0}^{(i)}\sim\pi(\boldsymbol{\theta}) and W0(i)=1/NW_{0}^{(i)}=1/N for i=1,,Ni=1,\cdots,N. The detailed pseudo-code of ejASMC is shown in Appendix Section B.2.

3.4 Selection of discrepancy model

3.4.1 A Gaussian process discrepancy model h()h(\cdot)

In approximate Bayesian computation, whether a candidate parameter 𝜽\boldsymbol{\theta}^{*} will be retained as a sample of the posterior distribution is mainly determined by the discrepancy Δ(𝒙,𝒚)\Delta(\boldsymbol{x},\boldsymbol{y}) between the pseudo-data set 𝒙p(𝜽)\boldsymbol{x}\sim p(\cdot\mid\boldsymbol{\theta}^{*}) and the observed data 𝒚\boldsymbol{y}. To speed up the inference, Gutmann and Corander (2016) considered a uniform kernel in Eq.(1) and proposed to model the discrepancy Δ𝜽=Δ(𝒙𝜽,𝒚)\Delta_{\boldsymbol{\theta}}=\Delta(\boldsymbol{x}_{\boldsymbol{\theta}},\boldsymbol{y}) between the observed data 𝒚\boldsymbol{y} and the pseudo-data 𝒙𝜽\boldsymbol{x}_{\boldsymbol{\theta}} as a function of 𝜽\boldsymbol{\theta}. The estimated posterior with a uniform kernel for each 𝜽\boldsymbol{\theta} admits

πε(𝜽𝒚)π(𝜽)(Δ𝜽ε),\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})\propto\pi(\boldsymbol{\theta})\mathbb{P}(\Delta_{\boldsymbol{\theta}}\leq\varepsilon), (11)

where the probability is computed using the fitted surrogate model. Furthermore, for a continuous and strictly increasing function g()g(\cdot), we have (g(Δ𝜽)g(ε))=(Δ𝜽ε)\mathbb{P}(g(\Delta_{\boldsymbol{\theta}})\leq g(\varepsilon))=\mathbb{P}(\Delta_{\boldsymbol{\theta}}\leq\varepsilon). In some cases, modeling g(Δ𝜽)g(\Delta_{\boldsymbol{\theta}}) instead of Δ𝜽\Delta_{\boldsymbol{\theta}} as a function of 𝜽\boldsymbol{\theta} is easier and more suitable. For example, as Δ(𝒚,𝒙𝜽)\Delta(\boldsymbol{y},\boldsymbol{x}_{\boldsymbol{\theta}}) is a positive value for all 𝒙𝜽\boldsymbol{x}_{\boldsymbol{\theta}}, it may be a better choice to model log-discrepancy log(Δ(𝒚,𝒙𝜽))\log(\Delta(\boldsymbol{y},\boldsymbol{x}_{\boldsymbol{\theta}})).

Gutmann and Corander (2016); Järvenpää et al. (2019, 2020) utilize GP formulations to model the discrepancy. In GP regression, we assume that Δ𝜽𝒩(f(𝜽),σ2)\Delta_{\boldsymbol{\theta}}\sim\mathcal{N}(f(\boldsymbol{\theta}),\sigma^{2}) and f(𝜽)𝒢𝒫(m(𝜽),k(𝜽,𝜽))f(\boldsymbol{\theta})\sim\mathcal{GP}(m(\boldsymbol{\theta}),k(\boldsymbol{\theta},\boldsymbol{\theta}^{\prime})) with a mean function m()m(\cdot): Θ\Theta\rightarrow{\mathbb{R}} and a covariance function k(,)k(\cdot,\cdot): Θ×Θ\Theta\times\Theta\rightarrow{\mathbb{R}}. For a training data set 𝒟s={(Δ1,𝜽1),,(Δs,𝜽s)}\mathcal{D}_{s}=\{(\Delta_{1},\boldsymbol{\theta}_{1}),\cdots,(\Delta_{s},\boldsymbol{\theta}_{s})\}, the posterior predictive density for the latent function ff at 𝜽\boldsymbol{\theta} follows a Gaussian density with the mean and variance

μf𝒟s(𝜽)=𝐤s(𝜽)T𝐊s1(𝜽)Δ1:s,vf𝒟s(𝜽)=k(𝜽,𝜽)𝐤s(𝜽)T𝐊s1(𝜽)𝐤s(𝜽)\mu_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})=\mathbf{k}_{s}(\boldsymbol{\theta})^{T}\mathbf{K}_{s}^{-1}(\boldsymbol{\theta})\Delta_{1:s},\quad v_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})=k(\boldsymbol{\theta},\boldsymbol{\theta})-\mathbf{k}_{s}(\boldsymbol{\theta})^{T}\mathbf{K}_{s}^{-1}(\boldsymbol{\theta})\mathbf{k}_{s}(\boldsymbol{\theta}) (12)

respectively. Here Δ1:s=(Δ1,,Δs)T\Delta_{1:s}=\left(\Delta_{1},\cdots,\Delta_{s}\right)^{T}, 𝐤s(𝜽)=(k(𝜽,𝜽1),,k(𝜽,𝜽s))T\mathbf{k}_{s}(\boldsymbol{\theta})=\left(k(\boldsymbol{\theta},\boldsymbol{\theta}_{1}),\cdots,k(\boldsymbol{\theta},\boldsymbol{\theta}_{s})\right)^{T}, and 𝐊s(𝜽)\mathbf{K}_{s}(\boldsymbol{\theta}) is a s×ss\times s matrix with [𝐊s(𝜽)]ij=k(𝜽i,𝜽j)+σ2𝕀i=j[\mathbf{K}_{s}(\boldsymbol{\theta})]_{ij}=k(\boldsymbol{\theta}_{i},\boldsymbol{\theta}_{j})+\sigma^{2}\mathbb{I}_{i=j} for i=1,si=1,\ldots s and j=1,sj=1,\ldots s. The ABC posterior at 𝜽\boldsymbol{\theta} can be computed using the fitted GP as

πε(𝜽𝒚)π(𝜽)(Δ𝜽ε)=π(𝜽)Φ((εμf𝒟s(𝜽))/vf𝒟s(𝜽)+σ2),\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})\propto\pi(\boldsymbol{\theta})\mathbb{P}\left(\Delta_{\boldsymbol{\theta}}\leq\varepsilon\right)=\pi(\boldsymbol{\theta})\Phi\left(\left(\varepsilon-\mu_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})\right)/\sqrt{v_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})+\sigma^{2}}\right), (13)

where ε\varepsilon is the threshold and Φ\Phi is the cumulative density function of the standard Gaussian distribution.

Following Gutmann and Corander (2016) and Järvenpää et al. (2018, 2019, 2020), we use a Gaussian process model to establish the functional relationship between the discrepancy and the parameters, and propose an expression for h(𝜽)h(\boldsymbol{\theta}). We use an initial discrepancy-parameters pairs 𝒟s\mathcal{D}_{s} to train a GP model. Based on the GP discrepancy model, we take

ha(𝜽)=μf𝒟s(𝜽)Zavf𝒟s(𝜽)+σ2,h_{a}(\boldsymbol{\theta})=\mu_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})-Z_{a}\sqrt{v_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})+\sigma^{2}}, (14)

where (x<Za)=a\mathbb{P}(x<Z_{a})=a and xN(0,1)x\sim N(0,1). Let h(𝜽)=ha(𝜽)h(\boldsymbol{\theta})=h_{a}(\boldsymbol{\theta}), Propositions 2 and 4 indicate that the choice of aa is a trade-off between efficiency and accuracy of ejMCMC algorithm. A smaller value of aa will lead to more accurate estimation of ejMCMC algorithm but lower early rejection efficiency.

3.4.2 Selection of training data

Although the posterior estimated by an early rejection method based on the discrepancy model does not rely entirely on the fitted surrogate model like GP-ABC (Järvenpää et al., 2018), the accuracy of the discrepancy model is also critical to the results. Proposition 2 shows that we only need to ensure the accuracy of discrepancy model in approximation Bayesian posterior region, that is, the region with small discrepancy. In Gutmann and Corander (2016); Järvenpää et al. (2019), they use a GP discrepancy model to intelligently select the next parameter value 𝜽s+1\boldsymbol{\theta}_{s+1} via minimization the acquisition function

𝒜s(𝜽)=μf𝒟s(𝜽)ηsvf𝒟s(𝜽),\mathcal{A}_{s}(\boldsymbol{\theta})=\mu_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})-\eta_{s}\sqrt{v_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})}, (15)

where ηs\eta_{s} is a positive number related to ss, then run the computationally costly simulation model to obtain updated training data 𝒟s+1\mathcal{D}_{s+1}. The coefficient ηs\eta_{s} is a trade-off between exploration and exploitation. This method of sequentially generating training points can ensure that there are enough training data points in areas with small discrepancy to ensure the accuracy of the model in this area. But it is computationally expensive to select a new training point every time by solving an optimization problem.

In this article, we consider ensuring the accuracy of the model by placing more training points in areas with smaller discrepancy via a pilot run of ABC-SMC algorithm with OejMCMC as proposals to sequentially generate batch data points. The detailed pseudo-code is shown in Appendix Section B.3. Our algorithm does not require to solve the optimization problem, and hence has lower computational cost. The budgets for the pilot ABC-SMC run could be relatively small (e.g. with a short sequence of tolerance parameters). Moreover, when the accuracy of the discrepancy model trained by ABC-SMC samples is not enough, the algorithm allows to continue to generate training data sequentially, by changing the algorithm termination conditions.

The computational complexity of training a GP model is O(n3)O(n^{3}). Increasing sample size of training data will significantly increase the computational burden. Based on Gutmann and Corander (2016); Järvenpää et al. (2018, 2019) and our exploratory study, we suggest collecting a few thousands data points for training GP discrepancy model. However, this may not be enough for large dimensional problems. With increasing number of iterations, we can use the set of accepted parameters and corresponding discrepancy terms as new training data to update the GP model within ejMCMC (Drovandi et al., 2018). In this case, we recommend using the online-GP algorithm (Stanton et al., 2021) to train the Gaussian process model, which can use new training data to update the model without using the full data sets.

4 Simulation study

4.1 A toy example

We first illustrate the advantage of our proposed method via a toy example. Suppose that the model of interest is a mixture of two normal distributions

p(yθ)=0.5ϕ(y;θ+2,0.6)+0.5ϕ(y;θ1,0.6),p(y\mid\theta)=0.5\phi(y;\theta+2,0.6)+0.5\phi(y;\theta-1,0.6),

where ϕ(y;μ,σ2)\phi(y;\mu,\sigma^{2}) denotes the probability density function of 𝒩(μ,σ2)\mathcal{N}(\mu,\sigma^{2}) evaluated at yy. Suppose that the prior distribution is π(θ)=𝒰(6,6)\pi(\theta)=\mathcal{U}(-6,6), and the observed data associated with this model is y0=1y_{0}=1. The discrepancy we use is Δ(y,y0)=|y1|\Delta(y,y_{0})=|y-1|. We choose a uniform kernel with the threshold ε=0.6\varepsilon=0.6 for ABC algorithms. Note that we only have one observed data point in this example.

Firstly, we run rejection sampling ABC 10610^{6} iterations and treat the resulting posterior distribution as ground truth. We then compare our ejMCMC with OejMCMC and GP-ABC (Järvenpää et al., 2018). We collect 2,0002,000 training data points from prior, which are then used to train the GP discrepancy model. The resulting GP model is used to approximate the posterior of GP-ABC, and we name this as GP-ABC1. We then run GP-ABC, OejMCMC and ejMCMC with same computational budgets (N=100,000N=100,000 iterations). This GP-ABC with N=100,000N=100,000 is named as GP-ABC2. A random walk proposal q(θθn)𝒩(θn,0.32)q(\theta\mid\theta_{n})\sim\mathcal{N}(\theta_{n},0.3^{2}) is implemented for MCMC algorithms. In the implementation of ejMCMC, the discrepancy model ha(θ)h_{a}(\theta) is a GP model trained with a=0.05a=0.05. Figure 1 shows the Gaussian process modeling for identifying the relationship between the discrepancy Δ\Delta and parameter θ\theta. The x-axis denotes the parameter and the y-axis denotes the value of the discrepancy. Black dots denote the discrepancies, the red line is the mean function of GP and the area within two grey lines is the 95% predictive interval. In this example, the discrepancy for parameters is multi-modal, GP discrepancy model doesn’t fit the ‘shape’ of the discrepancy correctly. Figure 1 displays the estimated posterior distributions provided by GP-ABC1, GP-ABC2, ejMCMC, OejMCMC and the ground truth (rejection sampling). It indicates that both GP-ABC algorithms fail to capture a bimodal shape of the posterior, due to that they fail to capture the bimodal shape of discrepancy. This is also demonstrated in Järvenpää et al. (2018). Our ejMCMC is able to capture both peaks of the posterior. The L1L_{1} distance (DL1D_{L_{1}}) between the estimated posterior by ejMCMC and the corresponding true posterior is 0.0559, lower than the DL1D_{L_{1}} between OejMCMC and ground truth (0.0748). We also implement ejMCMC with online-GP (Stanton et al., 2021) discrepancy model for this toy example. It is available at https://github.com/caofff/ejMCMC.

Refer to caption
Refer to caption
Figure 1: (a) Gaussian process modeling for identifying the relationship between the discrepancy Δ\Delta and parameter θ\theta. The x-axis denotes the parameter and the y-axis denotes the value of the discrepancy. Black dots denote the discrepancies, the red line is the mean function of GP and the area between the two grey lines is the 95% predictive interval. (b) Posterior density estimated by GP-ABC1, GP-ABC2, ejMCMC, OejMCMC, and rejection sampling (ground truth).

4.2 An ordinary differential equation model

In this subsection, we investigate the performance of ABC algorithms via an ordinary differential equation (ODE) example. We use the R package deSolve (Soetaert et al., 2010) to simulate differential equations. We generate ODE trajectories according to the following system,

dx1(t)dt\displaystyle\frac{dx_{1}(t)}{dt} =\displaystyle= 7236+x2(t)θ1,\displaystyle\frac{72}{36+x_{2}(t)}-\theta_{1},
dx2(t)dt\displaystyle\frac{dx_{2}(t)}{dt} =\displaystyle= θ2x1(t)1,\displaystyle\theta_{2}x_{1}(t)-1, (16)

where θ1=2\theta_{1}=2 and θ2=1\theta_{2}=1, and the initial conditions x1(0)=7x_{1}(0)=7 and x2(0)=10x_{2}(0)=-10. The observations 𝒚i\boldsymbol{y}_{i} are simulated from a normal distribution with the mean xi(t|𝜽)x_{i}(t|\boldsymbol{\theta}) and the variance σi2\sigma_{i}^{2}, where σ1=1\sigma_{1}=1 and σ2=3\sigma_{2}=3. We generate 121121 observations for each ODE function, equally spaced within [0,60][0,60]. We use root mean squared error (RMSE) to model the discrepancy between the observed data and the pseudo-data. We take the following prior distributions π(𝜽)\pi(\boldsymbol{\theta}): θ1𝒰(1.8,2.2)\theta_{1}\sim\mathcal{U}(1.8,2.2), θ2𝒰(0.8,1.2)\theta_{2}\sim\mathcal{U}(0.8,1.2).

We conduct a performance comparison of two early rejection MCMC algorithms under different threshold levels. A smaller threshold value implies that the ABC posterior is closer to the actual posterior, and consequently, more simulation costs are required. In the ABC-MCMC algorithm, the convergence speed slows down with smaller thresholds. The 1% and 5% quantiles of the discrepancy correspond to the parameters that conform to the prior distribution are 4.03 and 4.68, respectively. Here we study the performance of ABC algorithms with ε=4.8\varepsilon=4.8, 4.54.5, 4.14.1 and 4.054.05. We observe the Markov chains mix poor when the threshold value is smaller. For each level of tolerance, we first run MCMC algorithms 1,000,0001,000,000 iterations with 3 different initializations. The Gelman-Rubin diagnostic (Gelman et al., 1995) is used to assess the convergence of the Markov chain. The resulting posterior can be regarded as ground truth. For each case, we also run both ejMCMC and OejMCMC 100,000100,000 iterations. We collect 3,0003,000 training data from prior for ejMCMC algorithm. Appendix Figure C.1 illustrates the efficacy of the Gaussian Process (GP) model’s fit and the positioning of the training data points. The discrepancy can be fitted well by the Gaussian process. The choice of aa in the GP discrepancy model is a trade off between efficiency and accuracy. For each threshold ε\varepsilon, we run the ejMCMC algorithm with a=0.5a=0.5, 0.20.2, 0.10.1, 0.050.05 and 0.010.01. Table 1 shows the results for ε=4.5\varepsilon=4.5 by varying aa (i.e. h(𝜽)h(\boldsymbol{\theta})). This demonstrates that the selection of h(𝜽)h(\boldsymbol{\theta}) keeps a balance between accuracy and computing speed. Results for the remaining three thresholds are shown in Appendix Section C.1. To balance the efficiency and accuracy, we choose a=0.05a=0.05.

Table 1: ODE: ε=4.5\varepsilon=4.5, accuracy and efficiency of ejMCMC varying aa.
aa NsimN_{\text{sim}} NaccN_{\text{acc}} DL1(𝜽1)D_{L_{1}}(\boldsymbol{\theta}_{1}) DL1(𝜽2)D_{L_{1}}(\boldsymbol{\theta}_{2}) Eff
0.50 40039 36170 0.2071 0.1922 0.9394
0.20 47550 37574 0.0541 0.0459 0.8402
0.10 51398 37572 0.0374 0.0426 0.7785
0.05 55116 38017 0.0226 0.0216 0.7241
0.01 60464 37593 0.0273 0.0194 0.6335

Table 2 shows the number of synthetic data generated (NsimN_{\text{sim}}), number of collected posterior samples (NaccN_{\text{acc}}), and the L1L_{1}-distance between true value and posterior estimates varying threshold parameter ε\varepsilon. Under the settings of uniform prior and Gaussian proposal distribution, the efficiency of OejMCM is 0. While our proposed algorithm can achieve a considerable computational acceleration by saving 20%80%20\%-80\% of data simulation. The accuracy of the posterior distribution estimates has no drop compared to OejMCMC. The efficiency of algorithms with threshold values 4.84.8 and 4.54.5 is higher than cases with that of 4.14.1 and 4.054.05. When the threshold is set to 4.1 or 4.05, the proportion of candidate parameters lying outside of the posterior region is low, resulting in a decrease of efficiency.

Table 2: Efficiency of OejMCMC and ejMCMC varying threshold parameter ε\varepsilon.
ε\varepsilon method NsimN_{\text{sim}} NaccN_{\text{acc}} DL1(𝜽1)D_{L_{1}}(\boldsymbol{\theta}_{1}) DL1(𝜽2)D_{L_{1}}(\boldsymbol{\theta}_{2}) Eff
4.8 OejMCMC 100000 39965 0.0207 0.0171 0
ejMCMC 52719 39549 0.0194 0.0216 0.7821
\hdashline4.5 OejMCMC 100000 37691 0.0244 0.0259 0
ejMCMC 55116 38017 0.0226 0.0216 0.7241
\hdashline4.1 OejMCMC 100000 37672 0.0418 0.0452 0
ejMCMC 76538 38218 0.0427 0.0360 0.3798
\hdashline4.05 OejMCMC 100000 39861 0.0445 0.0533 0
ejMCMC 85325 39897 0.0428 0.0420 0.2442
Refer to caption
Figure 2: Posterior distributions of the ODE model provided by the three ABC algorithms varying the threshold parameter ε\varepsilon. The grey dots represent the accepted MCMC samples. First column: the true posterior estimated by running reject sampling with 100,000100,000 accepted particles. Second column: posterior density estimated by OejMCMC with 100,000100,000 iterations. Third column: posterior density estimated by ejMCMC with 100,000100,000 iterations. First row: ε=4.8\varepsilon=4.8, second row: ε=4.5\varepsilon=4.5, third row: ε=4.1\varepsilon=4.1, last row: ε=4.05\varepsilon=4.05.

Figure 2 shows the posterior distributions of the ODE model provided by the three ABC algorithms varying the threshold parameter ε\varepsilon. The distributions provided by three methods are close, though ejMCMC slightly misses the tail of posteriors. This is because all candidate parameters satisfying h(𝜽)>εh(\boldsymbol{\theta})>\varepsilon in the posterior region are early rejected. However, the false rejection avoids the MCMC sampler “sticking” in some local region for long periods of time (marked by the red box in the Figure 2). We show an example of ABC-MCMC traceplots in Appendix Figure C.2.

In this study, we also employ the ejASMC algorithm to infer the posterior distribution of the parameters, with N=2048N=2048 samples generated for each iteration and the threshold parameters are updated by retaining 1024 (the scale parameter γ=0.5\gamma=0.5) active particles. The termination condition for the algorithm is set such that the number of simulated data exceeds 50,000. As shown in Figure 3, the posterior distribution becomes increasingly concentrated near the true value as the number of iterations increases. The threshold value ε\varepsilon of the final iteration is 3.843.84. ABC-MCMC algorithms achieve very low acceptance probability using this threshold. Here, the early rejection efficiency of ejASMC algorithm is 0.3820. L1L_{1} distances between the marginal distributions of parameters and exact Bayesian posterior are 1.0626 and 1.0247, which are both smaller than the results provided by ejMCMC and OejMCMC (Appendix Table C.5), due to a smaller threshold. The posterior density plot provided by ejASMC are shown in Appendix Figure C.3.

Refer to caption
Figure 3: Marginal posterior distributions estimated by the ejASMC algorithm at r=0,1,9,18,,54r=0,1,9,18,\cdots,54 iterations. The color of the posterior estimates changes from dark blue to light blue as more steps of SMC sampler are performed, decreasing the threshold ε\varepsilon. True values of parameters are indicated by vertical lines.

4.3 Stochastic kinetic networks

Consider the parameter inference of a multidimensional stochastic differential equation (SDE) model for biochemical networks (Golightly and Wilkinson, 2010). In the dynamic system, there are eight reactions (R1,R2,,R8{R_{1},R_{2},\cdots,R_{8}}) that represent a simplified model of the auto-regulation mechanism in prokaryotes, based on protein dimerization that inhibits its own transcription. The reactions are defined as follows

R1:\displaystyle R_{1}: DNA+P2DNAP2\displaystyle~{}\text{DNA}+P_{2}\rightarrow\text{DNA}\cdot P_{2} R2:\displaystyle R_{2}: DNAP2DNA+P2\displaystyle~{}\text{DNA}\cdot P2\rightarrow\text{DNA}+P_{2}
R3:\displaystyle R_{3}: DNADNA+RNA\displaystyle~{}\text{DNA}\rightarrow\text{DNA}+\text{RNA} R4:\displaystyle R_{4}: RNARNA+P\displaystyle~{}\text{RNA}\rightarrow\text{RNA}+P
R5:\displaystyle R_{5}: 2PP2\displaystyle~{}2P\rightarrow P_{2} R6:\displaystyle R_{6}: P22P\displaystyle~{}P_{2}\rightarrow 2P
R7:\displaystyle R_{7}: RNA\displaystyle~{}\text{RNA}\rightarrow\emptyset R8:\displaystyle R_{8}: P.\displaystyle~{}P\rightarrow\emptyset.

Consider the time evolution of the system as a Markov process with state 𝑿t\boldsymbol{X}_{t}, where 𝑿t=(RNA,P,P2,DNA)T\boldsymbol{X}_{t}=(\text{RNA},P,P_{2},\text{DNA})^{T} represents the number of molecules of each species at time tt (i.e. non-negative integers), and the corresponding stoichiometric matrix can be represented as

S=(00100010000122011100110011000000).S=\left(\begin{array}[]{cccccccc}0&0&1&0&0&0&-1&0\\ 0&0&0&1&-2&2&0&-1\\ -1&1&0&0&1&-1&0&0\\ -1&1&0&0&0&0&0&0\end{array}\right).

We refer readers to Section 6.2 of Picchini (2014) for a more detailed description of this networks.

Now we assume that a constant rate θi\theta_{i} (i=1,2,,8)(i=1,2,\cdots,8) is associated to each reaction ii, the stochastic differential equation (SDE) for this model can be written as

d𝑿t=𝑺𝒉(𝑿t,𝜽)dt+𝑺diag(𝒉(𝑿t,𝜽))d𝑾t,d\boldsymbol{X}_{t}=\boldsymbol{S}\cdot\boldsymbol{h}(\boldsymbol{X}_{t},\boldsymbol{\theta})dt+\boldsymbol{S}\sqrt{\text{diag}(\boldsymbol{h}(\boldsymbol{X}_{t},\boldsymbol{\theta}))}d\boldsymbol{W}_{t}, (17)

where 𝒉(𝑿t,𝜽)=(θ1DNAP2,θ2(kDNA),θ3DNA,θ4RNA,θ5P(P1)/2,θ6P2,θ7RNA,θ8P)T\boldsymbol{h}(\boldsymbol{X}_{t},\boldsymbol{\theta})=(\theta_{1}\cdot\text{DNA}\cdot P_{2},\theta_{2}\cdot(k-\text{DNA}),\theta_{3}\cdot\text{DNA},\theta_{4}\cdot\text{RNA},\theta_{5}\cdot P\cdot(P-1)/2,\theta_{6}\cdot P_{2},\theta_{7}\cdot\text{RNA},\theta_{8}\cdot P)^{T} is a “propensity function”, kk is a constant that satisfies DNAP2+DNA=k\text{DNA}\cdot P_{2}+\text{DNA}=k, and hi(Xt,θi)dth_{i}(\operatorname{X}_{t},\theta_{i})dt is the probability of reaction ii occurring in the time interval (t,t+dt](t,t+dt], dWt=(dWt,1,,dWt,8)TdW_{t}=(dW_{t,1},\ldots,dW_{t,8})^{T} is a Wiener process, where dWt,idW_{t,i} is independently and identically distributed as dWt,iN(0,dt)dW_{t,i}\sim N(0,dt) for i=1,,8i=1,\ldots,8. The goal is to infer the reaction rates 𝜽\boldsymbol{\theta} in different experimental settings.

We consider three scenarios as in Picchini (2014). 𝒟1\mathcal{D}_{1}: fully observed data without measurement error. 𝒟2\mathcal{D}_{2}: fully observed data with measurement error ϵijN(0,5)\epsilon_{ij}\sim N(0,5) independently for every ii and for each coordinate jj of the state vector 𝑿t\boldsymbol{X}_{t}. 𝒟3\mathcal{D}_{3}: partially observed data (i.e. the DNA coordinate is unobserved) with measurement error ϵijN(0,σ2I3)\epsilon_{ij}\sim N(0,\sigma^{2}I_{3}), and σ\sigma is unknown.

We use ejMCMC and OejMCMC algorithms to estimate the ABC posterior. The ABC-ASMC algorithm with OejMCMC move is used to collect 20002000 training data, with 500 particles, and ε\varepsilon is updated by keeping 250 unique particles. In ABC-SMC, RMSE is used to model the discrepancy Δ\Delta between the observed data and the simulated data.

For 𝒟1\mathcal{D}_{1} and 𝒟2\mathcal{D}_{2}, the threshold is set to ε=0.03\varepsilon=0.03. We use the Gaussian kernel function as the proposal distribution of MCMC move, and the covariance matrix of parameters with discrepancy less than 0.05 in the training data serves as the covariance parameter of the Gaussian kernel. For 𝒟3\mathcal{D}_{3}, we set ε=0.05\varepsilon=0.05, and training data with a discrepancy less than 0.060.06 is used to tune the covariance matrix of the Gaussian kernel.

For all three scenarios, we run both ejMCMC and OejMCMC algorithms 500,000500,000 iterations. Table 3 shows the computing details of both algorithms. The results indicate that the efficiency of ejMCMC is higher than OejMCMC in all three scenarios. In scenario 𝒟2\mathcal{D}_{2}, the ejMCMC algorithm can accelerate about 41.57% compared with the OejMCMC algorithm.

Table 3: Number of iterations (NiteN_{\text{ite}}), number of predictions (NpreN_{\text{pre}}), number of synthetic data generated (NsimN_{\text{sim}}), number of accepted posterior samples (NaccN_{\text{acc}}), early rejection efficiency for different scenarios.
method NiteN_{\text{ite}} NpreN_{\text{pre}} NsimN_{\text{sim}} NaccN_{\text{acc}} Eff
𝒟1\mathcal{D}_{1} OejMCMC 500,000 - 124054 10574 0.7681
ejMCMC 500,000 123418 91580 10568 0.8345
\hdashline𝒟2\mathcal{D}_{2} OejMCMC 500,000 - 120423 7566 0.7708
ejMCMC 500,000 120367 73632 7935 0.8665
\hdashline𝒟3\mathcal{D}_{3} OejMCMC 500,000 - 299656 23185 0.4202
ejMCMC 500,000 298229 271459 20045 0.4762

For three scenarios, the fitted GP discrepancy model is shown in Appendix Figure C.4-C.6, and the boxplots of the posterior samples are shown in Figure C.7-C.9 of Appendix C.2. The GP-ABC posterior (estimated using the training points) is less accurate than OejMCMC and ejMCMC since the GP model does not fit the discrepancy very well. Under scenario D3D_{3}, the true values of the parameters θ3\theta_{3} and θ7\theta_{7} are not within the 95% credible interval of the OejMCMC algorithm, but are within those of the ejMCMC algorithm. The estimation results obtained by both algorithms in another two scenarios are similar, and the true values of the parameters are included in the estimated 95% credible intervals. We refer readers to Appendix C.2 for more details of setups and results.

5 Real Data Analysis

In this section, we consider a delay differential equation(DDE) example about the resource competition in the laboratory populations of Australian sheep blowflies (Lucilia cuprina), which is a classic experiment studied by Nicholson (1954). The room temperature of cultivating blowflies were maintained at 25°C. Appendix Figure C.10 displays the counts of blowflies, which are observed at N=137N=137 time points. The resource limitation in the dynamic system of the blowfly population acts with a time delay, which causes the fluctuations shown in the blowfly population. The fluctuations displayed in the blowfly population are caused by the time lag between stimulus and reaction (Berezansky et al., 2010).

May (1976) proposed to model the counts of blowflies with the following DDE model

dx(t)dt\displaystyle\frac{dx(t)}{dt} =\displaystyle= νx(t)[1x(tτ)/(1000P)],\displaystyle\nu x(t)[1-x(t-\tau)/(1000\cdot P)], (18)

where x(t)x(t) is the blowfly population, ν\nu is the rate of increase of the blowfly population, PP is a resource limitation parameter set by the supply of food, and τ\tau is the time delay, roughly equal to the time for an egg to grow up to a pupa. Our goal is to estimate the initial value, x(0)x(0), and the three parameters, ν\nu, PP, and τ\tau, from the noisy blowfly data y(t)y(t). The observed counts of blowflies y(t)y(t) is assumed to be lognormal distributed with the mean x(t)x(t) and the variance σ2\sigma^{2}.

For convenience of description, we assume 𝜽={X0,v,P,τ}\boldsymbol{\theta}=\{X_{0},v,P,\tau\}, and let 𝒚\boldsymbol{y} denote the observed data. We define the following prior distributions for 𝜽:log(X0)𝒩(8.5,0.32),log(v)𝒩(1.35,0.22),log(P)𝒩(0.8,0.32),log(τ)𝒩(2.25,0.082)\boldsymbol{\theta}:~{}\log(X_{0})\sim\mathcal{N}(8.5,0.3^{2}),~{}\log(v)\sim\mathcal{N}(-1.35,0.2^{2}),~{}\log(P)\sim\mathcal{N}(0.8,0.3^{2}),~{}\log(\tau)\sim\mathcal{N}(2.25,0.08^{2}). We use root mean square error as the discrepancy function Δ\Delta for ABC. The simulation data are generated via the Euler-Maruyama method (Maruyama, 1955) implemented in R package deSolve (Soetaert et al., 2010), using a constant step-size 0.1. For the cheap simulator of DAMCMC algorithm, we set the step-size to 2 for the first forty time units, and set the step size of the rest time unit to the time interval of the observation data.

We collect 30003000 training data by running an ABC-ASMC algorithm with 500500 particles, and use the resulting parameter-discrepancy pairs to train the GP discrepancy model. The predict function ha(𝜽)h_{a}(\boldsymbol{\theta}) is the GP model trained using the 30003000 samples with a=0.05a=0.05. We run ejMCMC, OejMCMC and DAMCMC 500,000500,000 iterations. The ejMCMC algorithm simulates 110,564110,564 synthetic data and saves 93.09%93.09\% of the cost, while OejMCMC algorithm simulates 291,250291,250 synthetic data and saves only 49.91%49.91\% of the cost. Our algorithm ejMCMC significantly improve the computational speed compared with OejMCMC. Both algorithms accept approximately 82,00082,000 samples. The DAMCMC algorithm runs the cheap simulator 500,000500,000 times and the expensive simulator 82,69282,692 times, and accepts 66,02866,028 samples. In addition, we compare the computing time of three MCMC methods as a function of the step size. The results are reported in Appendix Section C.3.

Refer to caption
Figure 4: Marginal ABC posterior distributions estimated by OejMCMC, ejMCMC and DAMCMC.

Figure 4 displays the marginal ABC posterior distributions produced by the OejMCMC, ejMCMC and DAMCMC algorithms. Figure 4 shows that the posterior of ejMCMC algorithm is consistent with that of OejMCMC algorithm. However, the posterior generated by the DAMCMC algorithm differs from that of OejMCMC and ejMCMC algorithms. Figure 5 shows the two-dimensional joint posterior distributions of the parameters obtained by using the ejMCMC algorithm. The posterior means and 95% credible intervals of 𝜽\boldsymbol{\theta} provided by the OejMCMC, ejMCMC and DAMCMC algorithms are shown in Appendix Figure C.6. We also compute the correlation between posterior samples: corr(ν,P)=0.264corr(\nu,P)=0.264, corr(ν,τ)=0.882corr(\nu,\tau)=-0.882, corr(P,τ)=0.112corr(P,\tau)=-0.112. Recall that ν\nu is the rate of increase of the blowfly population, PP is a resource limitation parameter set by the supply of food, and τ\tau is the time delay that roughly equal to the time for an egg to grow up to a pupa. The positive correlation between ν\nu and PP indicates that the blowfly population grows faster when there is a sufficient food supply. The tiny negative value of the correlation between τ\tau and PP implies that the amount of food supply has a small impact on the period of being a pupa. The negative correlation between τ\tau and ν\nu indicates that the blowfly population will increase slower if the period for an egg to grow up to a pupa is longer.

Refer to caption
Figure 5: The posterior distributions of parameters estimated by ejMCMC with 500,000500,000 MCMC iterations. Grey points: the accepted particles; loop lines: the 2D joint density contours.

6 Conclusion

We propose the early rejection ABC methods to accelerate parameter inference for complex statistical models. The main idea is to accurately reject parameter samples before generating synthetic data, and this is achieved by using a Gaussian process modeling to describe the relationship between the parameter and discrepancy. Our proposed method has two advantages compared with existing work. First, it utilizes MCMC moves for sampling parameters, which is more efficient in high-dimensional cases than sampling from a prior distribution. Second, it uses a pseudo Metropolis Hasting acceptance ratio to early reject samples in state space that are not worth exploring. The new method inherits the detailed balance condition and the posterior concentration property. It is also theoretically more efficient than the existing early rejection method.

We use several schemes to improve the performance of ejMCMC. To ensure the accuracy of the discrepancy model, we sequentially generate training points in areas with smaller discrepancy via a pilot run of ABC-SMC. We propose a prediction function for discrepancy model that balances accuracy and computational speed. An adaptive early rejection SMC is introduced by combining ABC-SMC and ejMCMC method. The ejASMC can adaptively select the sequence of thresholds and MCMC kernels, the particles are initialized via uniform design.

There are several lines of extensions and improvements for future work. We utilize identical kernel function and threshold value for Δ(x,y)\Delta(x,y) and h(𝜽)h(\boldsymbol{\theta}) in the pseudo acceptance ratio (i.e. min{Kε(Δ(x,y)),Kε(h(𝜽))}\min\left\{K_{\varepsilon}(\Delta(x,y)),K_{\varepsilon}(h(\boldsymbol{\theta}))\right\}) of ejMCMC. Using different kernel functions and threshold values can enhance the flexibility of algorithm. For example, decreasing the threshold of Kε(h(𝜽))K_{\varepsilon}(h(\boldsymbol{\theta})) can decrease the rate of false positive. One future extension is to propose pseudo acceptance ratio with more general kernel functions. MCMC approaches often require a large number of simulations to make sure that the Markov chain converges to the stationary distribution. This limits their application to models that only a few hundreds simulations are possible. Compare to MCMC methods, importance sampling does not require to assess convergence and the consistency property holds under mild conditions. The efficiency of importance sampling methods mainly depends on proposal distributions. Our second line of future research is to combine GP-ABC and importance sampling approaches, by serving the fitted GP model as an efficient proposal distribution in importance sampling framework. In addition, we will investigate using the resulting importance sampling distribution to design efficient MCMC proposal distributions to improve the mixing of Markov chain. The current selection of discrepancy model is limited to models defined on Euclidean space. Lastly, we will extend h(𝜽)h(\boldsymbol{\theta}) to problems defined on more general space (e.g. a phylogenetic tree) in our future research.

Acknowledge

This work was supported by the National Natural Science Foundation of China (12131001 and 12101333), the startup fund of ShanghaiTech University, the Fundamental Research Funds for the Central Universities, LPMC, and KLMDASR. The authorship is listed in alphabetic order.


SUPPLEMENTAL MATERIALS

Appendix A:

The proofs of all theoretical results presented in the main text.

Appendix B:

Some description of algorithms not shown in the main text.

Appendix C:

Some numerical results and some details of setups not shown in the main text.

References

  • Åkesson et al. (2021) Åkesson, M., P. Singh, F. Wrede, and A. Hellander (2021). Convolutional neural networks as summary statistics for approximate bayesian computation. IEEE/ACM Transactions on Computational Biology and Bioinformatics 19(6), 3353–3365.
  • Altman (1992) Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46(3), 175–185.
  • Beaumont et al. (2002) Beaumont, M. A., W. Zhang, and D. J. Balding (2002, 12). Approximate Bayesian computation in population genetics. Genetics 162(4), 2025–2035.
  • Berezansky et al. (2010) Berezansky, L., E. Braverman, and L. Idels (2010). Nicholson’s blowflies differential equations revisited: main results and open problems. Applied Mathematical Modelling 34(6), 1405–1417.
  • Blum et al. (2013) Blum, M. G., M. A. Nunes, D. Prangle, and S. A. Sisson (2013). A comparative review of dimension reduction methods in approximate bayesian computation. Statistical Science 28(2), 189–208.
  • Brooks et al. (2011) Brooks, S., A. Gelman, G. Jones, and X.-L. Meng (2011). Handbook of markov chain monte carlo, Chapter Likelihood-Free MCMC, pp.  313–333. New York: CRC press.
  • Buchholz and Chopin (2019) Buchholz, A. and N. Chopin (2019). Improving approximate Bayesian computation via Quasi-Monte Carlo. Journal of Computational and Graphical Statistics 28(1), 205–219.
  • Christen and Fox (2005) Christen, J. A. and C. Fox (2005). Markov chain monte carlo using an approximation. Journal of Computational and Graphical statistics 14(4), 795–810.
  • Clarté et al. (2020) Clarté, G., C. P. Robert, R. J. Ryder, and J. Stoehr (2020, 11). Componentwise approximate Bayesian computation via Gibbs-like steps. Biometrika 108(3), 591–607.
  • Cleveland and Devlin (1988) Cleveland, W. S. and S. J. Devlin (1988). Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association 83(403), 596–610.
  • Del Moral et al. (2006) Del Moral, P., A. Doucet, and A. Jasra (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(3), 411–436.
  • Del Moral et al. (2012) Del Moral, P., A. Doucet, and A. Jasra (2012). An adaptive sequential Monte Carlo method for approximate Bayesian computation. Statistics and Computing 22(5), 1009–1020.
  • Doucet et al. (2001) Doucet, A., N. De Freitas, and N. Gordon (2001). An introduction to sequential Monte Carlo methods. In Sequential Monte Carlo methods in practice, pp.  3–14. Springer.
  • Drovandi et al. (2018) Drovandi, C. C., M. T. Moores, and R. J. Boys (2018). Accelerating pseudo-marginal mcmc using gaussian processes. Computational Statistics & Data Analysis 118, 1–17.
  • Epanechnikov (1969) Epanechnikov, V. A. (1969). Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications 14(1), 153–158.
  • Everitt and Rowińska (2021) Everitt, R. G. and P. A. Rowińska (2021). Delayed acceptance ABC-SMC. Journal of Computational and Graphical Statistics 30(1), 55–66.
  • Fearnhead and Prangle (2012) Fearnhead, P. and D. Prangle (2012). Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation [with discussion]. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 74(3), 419–474.
  • Frazier et al. (2018) Frazier, D. T., G. M. Martin, C. P. Robert, and J. Rousseau (2018). Asymptotic properties of approximate bayesian computation. Biometrika 105(3), 593–607.
  • Frazier et al. (2022) Frazier, D. T., D. J. Nott, C. Drovandi, and R. Kohn (2022). Bayesian inference using synthetic likelihood: Asymptotics and adjustments. Journal of the American Statistical Association 0(0), 1–12.
  • Gelman et al. (1995) Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (1995). Bayesian data analysis. Chapman and Hall/CRC.
  • Golightly and Wilkinson (2010) Golightly, A. and D. Wilkinson (2010). Learning and Inference for Computational Systems Biology, Chapter Markov chain Monte Carlo algorithms for SDE parameter estimation, pp.  253–276. MIT Press.
  • Gutmann and Corander (2016) Gutmann, M. U. and J. Corander (2016). Bayesian optimization for likelihood-free inference of simulator-based statistical models. Journal of Machine Learning Research 17(125), 1–47.
  • Jiang et al. (2017) Jiang, B., T.-y. Wu, C. Zheng, and W. H. Wong (2017). Learning summary statistic for approximate bayesian computation via deep neural network. Statistica Sinica 27(4), 1595–1618.
  • Joyce and Marjoram (2008) Joyce, P. and P. Marjoram (2008). Approximately sufficient statistics and bayesian computation. Statistical applications in genetics and molecular biology 7(1), Article 26.
  • Järvenpää et al. (2019) Järvenpää, M., M. Gutmann, A. Vehtari, and P. Marttinen (2019, 06). Efficient acquisition rules for model-based approximate bayesian computation. Bayesian Analysis 14(2), 595–622.
  • Järvenpää et al. (2018) Järvenpää, M., M. U. Gutmann, A. Vehtari, and P. Marttinen (2018). Gaussian process modelling in approximate Bayesian computation to estimate horizontal gene transfer in bacteria. The Annals of Applied Statistics 12(4), 2228 – 2251.
  • Järvenpää et al. (2020) Järvenpää, M., A. Vehtari, and P. Marttinen (2020). Batch simulations and uncertainty quantification in gaussian process surrogate approximate bayesian computation. In Conference on Uncertainty in Artificial Intelligence, pp. 779–788. PMLR.
  • Marin et al. (2012) Marin, J.-M., P. Pudlo, C. P. Robert, and R. J. Ryder (2012). Approximate bayesian computational methods. Statistics and Computing 22(6), 1167–1180.
  • Marjoram et al. (2003) Marjoram, P., J. Molitor, V. Plagnol, and S. Tavaré (2003). Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences 100(26), 15324–15328.
  • Maruyama (1955) Maruyama, G. (1955). Continuous markov processes and stochastic equations. Rendiconti del Circolo Matematico di Palermo 4, 48–90.
  • May (1976) May, R. M. (1976). Models for single populations. In R. M. May (Ed.), Theoretical ecology, pp.  4–25. Philadelphia, Pennsylvania, USA: W. B. Saunders Company.
  • Nicholson (1954) Nicholson, A. J. (1954). An outline of the dynamics of animal populations. Australian Journal of Zoology 2(1), 9–65.
  • Nunes and Balding (2010) Nunes, M. A. and D. J. Balding (2010). On optimal selection of summary statistics for approximate bayesian computation. Statistical applications in genetics and molecular biology 9(1), 34.
  • Picchini (2014) Picchini, U. (2014). Inference for SDE models via approximate Bayesian computation. Journal of Computational and Graphical Statistics 23(4), 1080–1100.
  • Price et al. (2017) Price, L. F., C. C. Drovandi, A. Lee, and D. J. Nott (2017, jul). Bayesian synthetic likelihood. Journal of Computational and Graphical Statistics 27(1), 1–11.
  • Pritchard et al. (1999) Pritchard, J. K., M. T. Seielstad, A. Perez-Lezaun, and M. W. Feldman (1999, 12). Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular Biology and Evolution 16(12), 1791–1798.
  • Pudlo et al. (2016) Pudlo, P., J.-M. Marin, A. Estoup, J.-M. Cornuet, M. Gautier, and C. P. Robert (2016). Reliable abc model choice via random forests. Bioinformatics 32(6), 859–866.
  • Sisson and Fan (2010) Sisson, S. A. and Y. Fan (2010). Likelihood-free markov chain monte carlo. In S. Brooks, G. L. Jones, and X.-L. Meng (Eds.), Handbook of Markov Chain Monte Carlo. Chapman and Hall/CRC Press. In press.
  • Sisson et al. (2018) Sisson, S. A., Y. Fan, and M. Beaumont (2018). Handbook of approximate Bayesian computation. CRC Press.
  • Sisson et al. (2007) Sisson, S. A., Y. Fan, and M. M. Tanaka (2007). Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences 104(6), 1760–1765.
  • Soetaert et al. (2010) Soetaert, K., T. Petzoldt, and R. W. Setzer (2010). Solving differential equations in r: package desolve. Journal of statistical software 33, 1–25.
  • Stanton et al. (2021) Stanton, S., W. Maddox, I. Delbridge, and A. G. Wilson (2021). Kernel interpolation for scalable online gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp.  3133–3141. PMLR.
  • Wegmann et al. (2009) Wegmann, D., C. Leuenberger, and L. Excoffier (2009, 08). Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihood. Genetics 182(4), 1207–1218.
  • Wilkinson (2014) Wilkinson, R. (2014). Accelerating abc methods using gaussian processes. In Artificial Intelligence and Statistics, pp.  1015–1023. PMLR.