Using early rejection Markov chain Monte Carlo and Gaussian processes to accelerate ABC methods

Xuefei Cao¹, Shijia Wang^2∗ and Yongdao Zhou¹
¹School of Statistics and Data Science, Nankai University, China
²Institute of Mathematical Sciences, ShanghaiTech University, China
Address correspondence to: Dr. Shijia Wang ([email protected]) and Dr. Yongdao Zhou ([email protected]).

Abstract

Approximate Bayesian computation (ABC) is a class of Bayesian inference algorithms that targets for problems with intractable or unavailable likelihood function. It uses synthetic data drawn from the simulation model to approximate the posterior distribution. However, ABC is computationally intensive for complex models in which simulating synthetic data is very expensive. In this article, we propose an early rejection Markov chain Monte Carlo (ejMCMC) sampler based on Gaussian processes to accelerate inference speed. We early reject samples in the first stage of the kernel using a discrepancy model, in which the discrepancy between the simulated and observed data is modeled by Gaussian process (GP). Hence, the synthetic data is generated only if the parameter space is worth exploring. We demonstrate from theory, simulation experiments, and real data analysis that the new algorithm significantly improves inference efficiency compared to existing early-rejection MCMC algorithms. In addition, we employ our proposed method within an ABC sequential Monte Carlo (SMC) sampler. In our numerical experiments, we use examples of ordinary differential equations, stochastic differential equations, and delay differential equations to demonstrate the effectiveness of the proposed algorithm. We develop an R package that is available at https://github.com/caofff/ejMCMC.

Keywords: Approximate Bayesian computation, Gaussian process, Markov chain Monte Carlo, early rejection, Sequence Monte Carlo.

1 Introduction

In Bayesian statistics, we aim to infer the posterior distribution of the unknown parameter, which requires the evaluation of the likelihood function of data given a parameter value. However, in some complex statistical models (e.g. stochastic differential equations), it is computationally very expensive or even not possible to derive an analytical formula of the likelihood function, but we are able to simulate synthetic data from the statistical model. Approximate Bayesian computation (ABC) is a class of Bayesian computation methods that avoids the computation of likelihood function and is suitable for this context. In ABC, we simulate synthetic data from the statistical model given a simulated parameter, and this parameter value will be retained if the simulated data sets are close enough to the observed data sets.

Approximate Bayesian computation (Pritchard et al., 1999; Beaumont et al., 2002; Sisson et al., 2018) is originally introduced in the population genetics context. The simplest version of ABC is a rejection sampler. In practice, rejection sampling is inefficient, especially when there is a large difference between the prior distribution and the posterior distribution. Marjoram et al. (2003) introduced an ABC Markov chain Monte Carlo (MCMC) algorithm to approximate the posterior distribution. Wegmann et al. (2009) improved the performance of ABC-MCMC by relaxing the tolerance within MCMC while conducting subsampling and regression adjustment to the MCMC output. Clarté et al. (2020) explored a Gibbs type ABC algorithm that component-wisely targets the corresponding conditional posterior distributions. Sequential Monte Carlo (SMC) methods (Doucet et al., 2001; Del Moral et al., 2006) have become good alternatives to MCMC for complex model inference. Sisson et al. (2007); Del Moral et al. (2012) employed ABC-MCMC algorithms within the SMC framework of Del Moral et al. (2006). Buchholz and Chopin (2019) improved ABC algorithm based on quasi-Monte Carlo sequences, and the resulting ABC estimates achieves a lower variance compared with Monte Carlo ABC. Frazier et al. (2022); Price et al. (2017) implemented Bayesian synthetic likelihood to conduct inference for complex models with intractable likelihood function. Frazier et al. (2018) studied the posterior concentration property of the approximate Bayesian computation methods, and the asymptotic distribution of the posterior mean.

The choice of summary statistics affects the inference efficiency and results of ABC. In general, we measure the discrepancy between two data sets by using the distance between their low-dimensional summary statistics. However, constructing low-dimensional summary statistics can be challenging due to lack of expert knowledge. Sisson and Fan (2010); Marin et al. (2012); Blum et al. (2013) provide a detailed summary of the selection of summary statistics in ABC. Joyce and Marjoram (2008); Nunes and Balding (2010) select the best subset from a pre-selected set of candidate summary statistics based on various information criteria. Pudlo et al. (2016) conduct selection of summary statistics via random forests. Fearnhead and Prangle (2012) construct summary statistics in a semi-automatic fashion using regression. Jiang et al. (2017); Åkesson et al. (2021) learn summary statistics automatically via neural networks.

For expensive ABC simulators, various methods have been proposed to increase the sample-efficiency of ABC. Wilkinson (2014) used Gaussian process (GP) to model uncertainty in likelihood to exclude regions with negligible posterior probabilities. Järvenpää et al. (2018, 2019, 2020) modelled the discrepancy between synthetic and observed data using GPs, and estimated the ABC posterior based on the fitted GP without further simulation (GP-ABC). Picchini (2014) proposed an early rejection ABC-MCMC algorithm for SDE parameter inference, in which the acceptance ratio of MH algorithm is divided into two parts. Everitt and Rowińska (2021) used delayed acceptance (DA) MCMC (Christen and Fox, 2005) in ABC framework, and used cheap simulators to build a more efficient proposal distribution.

In this article, we propose a new early rejection MCMC (ejMCMC) method to speed up inference for simulating expensive models, based on the seminal work of Picchini (2014) and Järvenpää et al. (2018, 2019, 2020). In the first step of our algorithm, we rule out the parameter space that is not worth exploring by running a Metropolis Hasting (MH) step with a predicted discrepancy value. Compared to alternative machine learning techniques, Gaussian processes furnish not just point predictions, but also encompass uncertainty estimates and exhibit efficacy with limited data sets. Hence, we choose a Gaussian process model to evaluate the discrepancy in this article. Consequently, our proposed method can early reject samples in cases the algorithm of Picchini (2014) is inefficient (e.g. uniform priors, symmetric proposals). Then, we simulate data from the model only if it is highly possible to be accepted. The posterior estimated by GP-ABC approaches (Järvenpää et al., 2018, 2019, 2020) rely entirely on the fitted surrogate model. In practice, there exist cases that the discrepancy for parameters cannot be well modelled by GPs. For example, if the shape of discrepancy as a function of parameters is multi-modal, the Gaussian process model may not be able to accurately capture these peaks. The approximate Bayesian posterior obtained by the GP surrogate model could be inaccurate. We show this phenomenon in Section 4.1. The proposed ejMCMC method effectively combines MCMC and GP discrepancy model, and hence can correct the inaccurate posterior estimated by GP-ABC approaches. The efficiency of DA-ABC approach (Everitt and Rowińska, 2021) relies on a cheap simulator used in proposing stage. It is challenging to build an effective but cheap simulator when there is a lack of expert knowledge. Our ejMCMC instead does not require as much prior knowledge. Our numerical experiments indicate that the new method can achieve about $40\%$ acceleration compared with Picchini (2014), and the estimated posterior is more accurate than Picchini (2014) with fixed computational budgets. We show that the ejMCMC based on GP discrepancy satisfies the detailed balance condition, and the posterior concentration property. The proposed algorithm is theoretically more efficient than the existing one. We also propose an efficient GP discrepancy function to balance the computational efficiency and ratio of false rejection samples. The ejMCMC can be designed as an efficient forward kernel within ABC-ASMC framework (Del Moral et al., 2012), and the resulting algorithm (ejASMC) is more efficient.

The rest of article is organized as follows. In Section 2, we show some background information of approximate Bayesian computation. In Section 3, we introduce our proposed algorithms. In sections 4 and 5, we use a real data analysis and simulation studies to show the effectiveness of our method. Section 6 gives the conclusion, all proofs of the theoretical results are deferred to the Appendix.

2 Approximate Bayesian computation methods

In this section, we provide a review of ABC algorithms relevant for this article.

2.1 Basics of Approximate Bayesian computation

Let $\boldsymbol{\theta}$ be the parameter of interest, which belongs to a parameter space $\Theta\subset\mathbb{R}^{p}$ , and let $\pi(\boldsymbol{\theta})$ be a prior density for $\boldsymbol{\theta}$ . The likelihood function, denoted as $p(\cdot\mid\boldsymbol{\theta})$ , is costly to compute, while it is possible to simulate data from the model with a given parameter $\boldsymbol{\theta}$ . Consider a given observed data set $\boldsymbol{y}=(\boldsymbol{y}_{(1)},\boldsymbol{y}_{(2)},\ldots,\boldsymbol{y}_{(T)})^{\top}\in\mathcal{Y}^{T}\subset\mathbb{R}^{d\times T}$ , where $d$ denotes the dimension of data and $T$ denotes the number of repeated measurements. The likelihood-free inference is achieved by augmenting the target posterior from $\pi(\boldsymbol{\theta}\mid\boldsymbol{y})\propto p(\boldsymbol{y}\mid\boldsymbol{\theta})\pi(\boldsymbol{\theta})$ to

\pi_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})=\frac{\pi(\boldsymbol{\theta})p(\boldsymbol{x}\mid\boldsymbol{\theta})K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y}))}{\iint\pi(\boldsymbol{\theta})p(\boldsymbol{x}\mid\boldsymbol{\theta})K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y}))d\boldsymbol{x}d\boldsymbol{\theta}},

where the auxiliary parameter $\boldsymbol{x}$ is a (simulated) data set from $p(\cdot\mid\boldsymbol{\theta})$ , $\Delta:\mathcal{Y}^{T}\times\mathcal{Y}^{T}\rightarrow\mathbb{R}^{+}$ is a discrepancy function that measures the difference between the synthetic data set $\boldsymbol{x}$ and the observed data set $\boldsymbol{y}$ . The function $K_{\varepsilon}(\cdot)$ is defined as $K_{\varepsilon}(\cdot)=\frac{K(\cdot/\varepsilon)}{K(0)}$ , where $K(\cdot)$ is a centered smoothing kernel density. In this article, we mainly focus on a kernel function $K(z)$ that takes its maximum value at zero and is zero for $z>1$ . For example, the uniform, Epanechnikov, tricube, triangle, quartic (biweight), triweight and quadratic kernels (Epanechnikov, 1969; Cleveland and Devlin, 1988; Altman, 1992). The parameter $\varepsilon>0$ serves as a tolerance level, and it weights the posterior $\pi_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})$ higher in regions where $\boldsymbol{x}$ and $\boldsymbol{y}$ are similar.

The goal of approximate Bayesian computation is to estimate the marginal posterior distribution of $\boldsymbol{\theta}$ , denoted as $\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})$ , by integrating out the auxiliary data set $\boldsymbol{x}$ from the modified posterior distribution. This is given by the following equation:

\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})=\frac{\pi(\boldsymbol{\theta})\int p(\boldsymbol{x}\mid\boldsymbol{\theta})K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y}))d\boldsymbol{x}}{\iint\pi(\boldsymbol{\theta})p(\boldsymbol{x}\mid\boldsymbol{\theta})K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y}))d\boldsymbol{x}d\boldsymbol{\theta}}.

(1)

The ABC posterior provides an approximation of the true posterior distribution when the tolerance level $\varepsilon$ approaches to zero.

In practice, we set $\Delta(\boldsymbol{x},\boldsymbol{y})=\rho(\boldsymbol{S}(\boldsymbol{x}),\boldsymbol{S}(\boldsymbol{y}))$ for some distance metric $\rho(\cdot,\cdot)$ and a chosen vector of summary statistics $\boldsymbol{S}(\cdot)$ : $\mathcal{Y}^{T}\rightarrow{\mathbb{R}^{d_{s}}}$ with some small $d_{s}$ . For example, given an observed data set with $T$ independent and identically distributed data points, we can take $\boldsymbol{S}(\boldsymbol{y})=\sum_{i=1}^{T}\boldsymbol{y}_{i}/T$ and $\rho(\cdot,\cdot)=\left\|\cdot\right\|_{2}$ . If the vector of summary statistic is sufficient for the parameter $\boldsymbol{\theta}$ , then comparing the summary statistics of two data sets will be equivalent to comparing the data sets themselves (Brooks et al., 2011). In this case, the ABC posterior converges to the conditional distribution of $\boldsymbol{\theta}$ given $\boldsymbol{y}$ as $\varepsilon\rightarrow 0$ ; otherwise, the ABC posterior converges to the conditional distribution of $\boldsymbol{\theta}$ given $\boldsymbol{S}(\boldsymbol{y})$ , that is, $\pi(\boldsymbol{\theta}\mid\boldsymbol{S}(\boldsymbol{y}))$ . For convenience of description, we still denote $\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{S}(\boldsymbol{y}))$ as $\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})$ .

2.2 ABC Markov chain Monte Carlo

Marjoram et al. (2003) proposed an ABC Markov chain Monte Carlo (MCMC) method based on the Metropolis–Hastings MCMC algorithm. The ABC-MCMC constructs a Markov chain that admits the target probability distribution as stationarity under mild regularity conditions. At $(n+1)$ -th MCMC iteration, we repeat the following procedure.

(i)

Generate a candidate value $\boldsymbol{\theta}^{*}\sim q(\cdot\mid\boldsymbol{\theta}_{n})$ .
(ii)

Generate a synthetic data set $\boldsymbol{x}^{*}=(\boldsymbol{x}_{(1)}^{*},\cdots,\boldsymbol{x}_{(T)}^{*})^{T}$ from the likelihood, $p(\cdot\mid\boldsymbol{\boldsymbol{\theta}^{*}})$ .

(iii)

Accept the proposed parameter value $(\boldsymbol{\theta}_{n+1},\boldsymbol{x}_{n+1})=(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})$ with probability

\alpha[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})]=\operatorname{min}\left\{1,\frac{\pi(\boldsymbol{\theta}^{*})q(\boldsymbol{\theta}_{n}\mid\boldsymbol{\theta}^{*})}{\pi(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta}_{n})}\frac{K_{\varepsilon}(\Delta(\boldsymbol{x}^{*},\boldsymbol{y}))}{K_{\varepsilon}(\Delta(\boldsymbol{x}_{n},\boldsymbol{y}))}\right\};

(2)

Otherwise, set $(\boldsymbol{\theta}_{n+1},\boldsymbol{x}_{n+1})=(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n})$ .

The MCMC algorithm targets the joint posterior distribution $\pi_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})$ , with the marginal distribution of interest being the posterior distribution $\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})$ . The sampler generates a Markov chain sequence $\{(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n})\}$ for $n\geq 0$ , where $\boldsymbol{x}_{n}$ represents the synthetic data set, and $\boldsymbol{\theta}_{n}$ represents the parameter value.

For complex statistical models that are expensive to simulate, we aim to speed up inference by reducing the number of simulations. Based on a uniform kernel $K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y}))=\mathbb{I}(\Delta(\boldsymbol{x},\boldsymbol{y})\leq\varepsilon)$ , Picchini (2014) proposed an early-rejection MCMC algorithm denoted by OejMCMC. Here, we generalize the uniform kernel to a more general kernel function. By the definition of the kernel function in Eq.(1), the acceptance probability $\alpha[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})]$ (Eq. (2)) is always not greater than

\breve{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),\boldsymbol{\theta}^{*}]=\pi(\boldsymbol{\theta}^{*})q(\boldsymbol{\theta}_{n}\mid\boldsymbol{\theta}^{*})/\{\pi(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta}_{n})K_{\varepsilon}(\Delta(\boldsymbol{x}_{n},\boldsymbol{y}))\}.

(3)

Then we can reject some candidate parameters before simulations by generating $w\sim\mathcal{U}(0,1)$ , and comparing $w$ with $\breve{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),\boldsymbol{\theta}^{*}]$ . Specifically, If $w>\breve{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),\boldsymbol{\theta}^{*}]$ , we immediately reject $\boldsymbol{\theta}^{*}$ before simulating synthetic data with $\boldsymbol{\theta}^{*}$ , otherwise, simulate synthetic data and determine whether to receive candidate parameters according to Metropolis-Hastings acceptance probability.

In many cases, we may take a weak prior distribution or take a uniform distribution as prior, and use a symmetric proposal distribution, that is, for all $\boldsymbol{\theta},\boldsymbol{\theta}^{*}\in\Theta$ , $\pi(\boldsymbol{\theta})\approx\pi(\boldsymbol{\theta}^{*})$ and $q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})=q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{*})$ , $P(w>\breve{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}])\approx 0$ . Hence, parameters are rarely rejected early, the method basically degenerates to the original ABC-MCMC. In Section 3, we will propose a new early rejection method based on discrepancy models.

2.3 ABC Sequential Monte Carlo

Sisson et al. (2007); Del Moral et al. (2012) proposed ABC-SMC methods within the SMC framework of Del Moral et al. (2006). SMC is a class of importance sampling based methods for generating weighted particles from a target distribution $\pi$ by approximately simulating from a sequence of related probability distribution $\{\pi_{r}\}_{r=0:R}$ , where the selected initial distribution $\pi_{0}$ is easy to approximate (e.g. the prior distribution), and the final distribution $\pi_{R}=\pi$ is the target distribution. The particles are moved from $r$ -th target to $(r+1)$ -th target, by using a Markov kernel.

In ABC-SMC, we select a sequence of intermediate target distributions $\{\pi_{r}=\pi_{\varepsilon_{r}}(\boldsymbol{\theta}\mid\boldsymbol{y})\}_{r=0:R}$ with a descending sequence of tolerance parameters $\{\varepsilon_{r}\}_{r=0:R}$ . The initial distribution has a large tolerance and is easy to approximate, the final distribution has a small tolerance and is hence accurate. Let $\boldsymbol{\theta}_{r}^{(i)}$ denote the $i$ -th particle after iteration $r$ , let $\Delta_{r}^{(i)}$ denote the corresponding discrepancy and let $W_{r}^{(i)}$ denote the normalized weight. The set of weighted particles $\{\boldsymbol{\theta}_{r}^{(i)},W_{r}^{(i)}\}_{i=1:N}$ represents an empirical approximation of $\pi_{\varepsilon_{r}}(\boldsymbol{\theta}\mid\boldsymbol{y})$ . At each iteration $r+1$ , we conduct propagation, weighting and resampling to approximate $\pi_{\varepsilon_{r+1}}(\boldsymbol{\theta}\mid\boldsymbol{y})$ . We refer reader to Appendix Section B.1 for more details of these three steps.

3 An early rejection method based on discrepancy models

To reduce the computational cost of generating synthetic data sets in complex statistical models, we propose an early rejection method based on a discrepancy model. In Section 2.2, we discussed how the early rejection acceptance probability $\breve{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),\boldsymbol{\theta}^{*}]$ proposed by Picchini (2014) can be inefficient in certain cases. In this section, we introduce a more efficient early rejection ABC method that utilizes a discrepancy model to speed up inference. We refer to our method as ejMCMC to distinguish it from the early rejection MCMC proposed in Picchini (2014).

3.1 Early rejection MCMC based on a discrepancy model

The main computational cost of ABC-MCMC comes from evaluating the acceptance ratio (Eq. 2), which requires simulating synthetic data sets. In this article, we propose to construct a pseudo MH acceptance probability

\hat{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})]=\min\left\{1,\frac{\pi(\boldsymbol{\theta}^{*})q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{*})}{\pi(\boldsymbol{\theta})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})}\frac{\min\{K_{\varepsilon}(\Delta(\boldsymbol{x}^{*},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}^{*}))\}}{\min\{K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}))\}}\right\}

(4)

based on a discrepancy function $h$ . Note that the pseudo MH acceptance probability $\hat{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})]$ is not greater than

\tilde{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}]=\frac{\pi(\boldsymbol{\theta}^{*})q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{*})}{\pi(\boldsymbol{\theta})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})}\frac{K_{\varepsilon}(h(\boldsymbol{\theta}^{*}))}{\min\{K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}))\}},

(5)

which does not depend on $\Delta(\boldsymbol{x}^{*},\boldsymbol{y})$ . This allows us to reject some candidate parameters based on $\tilde{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}]$ before simulating the synthetic data. In the first stage of ejMCMC, we use $\tilde{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}]$ to early reject samples. Since $\tilde{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}]\leq\breve{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),\boldsymbol{\theta}^{*}]$ , ejMCMC’s early rejection rate is always not lower than OejMCMC. Our ejMCMC algorithm degenerates to OejMCMC in case that $h(\boldsymbol{\theta})\equiv 0$ . The proposed early rejection MCMC algorithm is shown in Algorithm 1.

If $h(\boldsymbol{\theta})$ satisfies that $h(\boldsymbol{\theta})\leq\Delta(\boldsymbol{x},\boldsymbol{y})$ in the ABC posterior region with positive support, the pseudo MH acceptance probability (Eq.(4)) is equal to the Metropolis-Hastings acceptance probability ${\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})]$ (Eq.(2)). However, the function $h(\boldsymbol{\theta})$ cannot strictly satisfy $h(\boldsymbol{\theta})\leq\Delta(\boldsymbol{x},\boldsymbol{y})$ in practice. Consequently, we will false reject samples in region $\{\boldsymbol{\theta}:h(\boldsymbol{\theta})>\Delta(\boldsymbol{x},\boldsymbol{y})\}$ . Hence, the ejMCMC algorithm targets a modified posterior distribution based on a discrepancy function $h(\boldsymbol{\theta})$ , which may not be equal to the original posterior distribution $\pi_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})$ .

As we discussed above, the function $h(\boldsymbol{\theta})$ is a trade-off between speed and accuracy. In this article, we use a Gaussian process to model the functional relationship between the discrepancy and the parameters, and $h(\boldsymbol{\theta})$ is $a$ (for example, $a=0.05$ ) quantile prediction function of the deviation. More details of the selection and training of $h(\boldsymbol{\theta})$ will be described in Section 3.4.

Algorithm 1 Early rejection MCMC algorithm based on a discrepancy model

Input: (a) Total number of MCMC iterations $N$ , ABC threshold $\varepsilon$ ; (b) observed data $\boldsymbol{y}$ ; (c) prior distribution $\pi(\cdot)$ , likelihood function $p(\cdot\mid\cdot)$ , proposal distribution $q(\cdot\mid\cdot)$ and prediction function $h(\cdot)$ .

1: Initialize

\boldsymbol{\theta}_{0}

2: for

n=0,\ldots,N-1

3: Set

(\boldsymbol{\theta}_{n+1},\boldsymbol{x}_{n+1})=(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n})

4: Generate a candidate value

\boldsymbol{\theta}^{*}\sim q(\cdot\mid\boldsymbol{\theta}_{n})

from the proposal distribution.

5: Generate

w

from

U[0,1]

6: if

w<\frac{\pi(\boldsymbol{\theta}^{*})q(\boldsymbol{\theta}_{n}\mid\boldsymbol{\theta}^{*})}{\pi(\boldsymbol{\theta}_{n})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta}_{n})\min\{K_{\varepsilon}(\Delta(\boldsymbol{x}_{n},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}_{n}))\}}

then

7: Compute

h(\boldsymbol{\theta}^{*})

8: if

w<\tilde{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),\boldsymbol{\theta}^{*}]

then

9: generate

\boldsymbol{x}^{*}

from

p(\cdot\mid\boldsymbol{\theta}^{*})

10: if

w<\hat{\alpha}[(\boldsymbol{\theta}_{n},\boldsymbol{x}_{n}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})]

then

11:

(\boldsymbol{\theta}_{n+1},\boldsymbol{x}_{n+1})=(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})

12: end if

13: end if

14: end if

15: end for

16: return

\theta_{1:N}

3.2 Properties of ejMCMC algorithm

Let $\Theta^{\prime}:=\{\boldsymbol{\theta}:h(\boldsymbol{\theta})<\varepsilon\}$ and $\Theta_{1}:=\{\boldsymbol{\theta}:\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})>0\}$ . Here $\Theta_{1}$ denotes the ABC posterior region with positive support. The following proposition shows that under mild assumptions, the ejMCMC algorithm satisfies the detailed balance condition.

Assumption 1.

$\Theta^{\prime}\cap\Theta_{1}$ is not a measure zero set.

This is a basic assumption for ejMCMC algorithm, without which the algorithm will collapse. All conclusions about the ejMCMC algorithm are based on this assumption.

Proposition 1.

If Assumption 1 is satisfied, the early rejection MCMC based on a discrepancy model satisfies the detailed balance condition, that is,

\hat{\pi}_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})p(\boldsymbol{x}^{*}\mid\boldsymbol{\theta}^{*})\hat{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})]=\hat{\pi}_{\varepsilon}(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*}\mid\boldsymbol{y})q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{*})p(\boldsymbol{x}\mid\boldsymbol{\theta})\hat{\alpha}[(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*}),(\boldsymbol{\theta},\boldsymbol{x})],

where

\hat{\pi}_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})=\frac{\pi(\boldsymbol{\theta})p(\boldsymbol{x}\mid\boldsymbol{\theta})\min\{K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}))\}}{\iint\pi(\boldsymbol{\theta})p(\boldsymbol{x}\mid\boldsymbol{\theta})\min\{K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}))\}d\boldsymbol{x}d\boldsymbol{\theta}}

is the invariant distribution of Markov chain. The marginal posterior distribution of $\boldsymbol{\theta}$ obtained from the ejMCMC algorithm is

\hat{\pi}_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})=\frac{\pi(\boldsymbol{\theta})\int p(\boldsymbol{x}\mid\boldsymbol{\theta})\min\{K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}))\}d\boldsymbol{x}}{\iint\pi(\boldsymbol{\theta})p(\boldsymbol{x}\mid\boldsymbol{\theta})\min\{K_{\varepsilon}(\Delta(\boldsymbol{x},\boldsymbol{y})),K_{\varepsilon}(h(\boldsymbol{\theta}))\}d\boldsymbol{x}d\boldsymbol{\theta}}.

(6)

Proposition 1 demonstrates that using the ejMCMC move does not violate the detailed balance condition of the Markov process. Moreover, Proposition 2 shows that the relationship between the posterior of the ejMCMC and that of the original ABC-MCMC.

Proposition 2.

(i)

The discrepancy function $h(\boldsymbol{\theta})$ satisfies that

\sup_{\boldsymbol{\theta}\in{\Theta_{1}/\mathcal{N}_{\Theta_{1}}}}\int p(\boldsymbol{x}\mid\boldsymbol{\theta})\mathbb{I}(\Delta(\boldsymbol{x},\boldsymbol{y})\leq h(\boldsymbol{\theta}))d\boldsymbol{x}\leq a,

where $\mathcal{N}_{\Theta_{1}}$ is a Lebesgue zero-measure subset of $\Theta_{1}$ and $a\in[0,1)$ is a constant. The target posterior distribution of ejMCMC approaches to that of ABC-MCMC in terms of $L_{1}$ distance when $a\to 0$ , where the $L_{1}$ -distance between $g(\cdot)$ and $f(\cdot)$ is defined as

D_{L_{1}}(g,f)=\int|g(\boldsymbol{\theta})-f(\boldsymbol{\theta})|d\boldsymbol{\theta}.

(7)

(ii)

If $K(\cdot)$ is a uniform kernel, the posterior distribution of ejMCMC is

\hat{\pi}_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})\propto\left\{\begin{array}[]{lcl}\pi(\boldsymbol{\theta})\int_{\Delta(\boldsymbol{x},\boldsymbol{y})<\varepsilon}p(\boldsymbol{x}\mid\boldsymbol{\theta})d\boldsymbol{x}&&\boldsymbol{\theta}\in\Theta^{\prime},\\ 0&&\boldsymbol{\theta}\in\Theta/\Theta^{\prime}.\end{array}\right.

It is equal to ABC-MCMC posterior if $\Theta_{1}/\Theta^{\prime}$ is a measure zero set.

Proposition 2 shows that the resulting Markov chain still converges to the true approximate Bayesian posterior under some conditions of $h(\cdot)$ . This implies that the ejMCMC algorithm provides a promising framework to Bayesian posterior approximation, without compromising on accuracy or convergence guarantees under some mild conditions. In some other cases, there may exist some mismatch between the posterior of ejMCMC and ABC-MCMC. In this article, we utilize the $L_{1}$ -distance shown in Eq. (7) to quantify the dissimilarity between two density functions.

Next we will prove the posterior concentration property of ejMCMC algorithm. Frazier et al. (2018) proved the posterior concentration property of the ABC posterior with the uniform kernel, that is, for any $\delta>0$ , and for some $\varepsilon>0$ ,

\Pi_{\varepsilon}\{d(\boldsymbol{\theta},\boldsymbol{\theta}_{0})>\delta\mid\boldsymbol{y}\}=\int_{d(\boldsymbol{\theta},\boldsymbol{\theta}_{0})>\delta}\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})d\boldsymbol{\theta}=o_{P}(1),

where $d(\cdot,\cdot)$ is a metric on $\Theta$ . Theorem 1 shows that the posterior concentration of the algorithm still holds, with some technical assumptions of function $h(\cdot)$ . This property is important, because for any $A\subset\Theta$ , ABC posterior probability $\Pi_{\varepsilon}\{A\mid\boldsymbol{y}\}$ will be different from the exact posterior probability. Without the guarantees of exact posterior inference, knowing that $\Pi_{\varepsilon}\{\cdot\mid\boldsymbol{y}\}$ will concentrate on the true parameter $\boldsymbol{\theta}_{0}$ can be used as an effective way to express our uncertainty about $\boldsymbol{\theta}$ . The posterior concentration is related to the rate at which $\varepsilon$ goes to 0 and the rate at which the observed summaries $S(\boldsymbol{y})$ and the simulated summaries $S(\boldsymbol{x})$ converge to well defined limit counterparts $b(\boldsymbol{\theta}_{0})$ and $b(\boldsymbol{\theta})$ . Here we consider $S(\boldsymbol{x})\rightarrow b(\boldsymbol{\theta})$ as $T\rightarrow\infty$ , where $\boldsymbol{x}=(\boldsymbol{x}_{(1)},\ldots,\boldsymbol{x}_{(T)})^{\top}$ and $\boldsymbol{x}_{(t)}\sim p(\cdot\mid\boldsymbol{\theta})$ for $t=1,\ldots,T$ , and consider $\varepsilon$ as a $T$ -dependent sequence satisfying $\varepsilon_{T}\rightarrow 0$ as $T\rightarrow\infty$ . Here, $\Pi(\cdot)$ is a probability measure with prior density $\pi(\cdot)$ .

Assumption 2.

For all $\varepsilon$ , there exists a constant $0\leq\gamma_{1}<1$ satisfying that

\Pi\left[\left\{\boldsymbol{\theta}:h(\boldsymbol{\theta})\geq\varepsilon\text{ and }\rho(b(\boldsymbol{\theta}),b(\boldsymbol{\theta}_{0}))\leq{\varepsilon/3}\right\}\right]\leq\gamma_{1}\Pi\left[\rho(b(\boldsymbol{\theta}),b(\boldsymbol{\theta}_{0}))\leq{\varepsilon/3}\right].

Assumption 3.

For all $\varepsilon$ , there exists a constant $0\leq\gamma_{2}<1$ satisfying that

\iint_{\Theta^{\prime}}p(\boldsymbol{x}\mid\boldsymbol{\theta})\mathbb{I}(\rho(S(\boldsymbol{x}),S(\boldsymbol{y}))<h(\boldsymbol{\theta}))d\boldsymbol{x}d\Pi(\boldsymbol{\theta})\leq\gamma_{2}\iint_{\Theta^{\prime}}p(\boldsymbol{x}\mid\boldsymbol{\theta})\mathbb{I}(\rho(S(\boldsymbol{x}),S(\boldsymbol{y}))<\varepsilon)d\boldsymbol{x}d\Pi(\boldsymbol{\theta}).

Remark. Assumption 2 controls the proportion of $\boldsymbol{\theta}$ satisfying $h(\boldsymbol{\theta})\geq\varepsilon$ , that will be rejected before simulating, in a neighbourhood of $\boldsymbol{\theta}_{0}$ . Assumption 3 is required for ABC with a more general kernel than a uniform kernel. This assumption controls the proportion of acceptable $(\boldsymbol{\theta},\boldsymbol{x})$ that may be rejected early. Intuitively, Assumptions 2 and 3 about $h(\boldsymbol{\theta})$ are to ensure that a small proportion of $\boldsymbol{\theta}$ near $\boldsymbol{\theta}_{0}$ will be rejected before simulating. As detailed in Section 3.4.1, our proposed discrepancy model $h(\boldsymbol{\theta})$ can always ensure that only a small proportion of $\boldsymbol{\theta}$ near the true value are early rejected by adjusting the quantile prediction function. Without these assumptions about the early rejection stage, posterior concentration concentration about $\boldsymbol{\theta}_{0}$ cannot occur.

Theorem 1.

If Assumptions 1-3 of Frazier et al. (2018) and Assumptions 1-3 stated in this article are satisfied, the posterior concentration of the ejMCMC targeting Equation (6) still holds. For uniform kernel cases, Assumption 3 is not required for this property.

We have previously discussed the quality of the ABC estimates. Now, we turn to investigate the efficiency of the algorithm.

Proposition 3.

When MCMC algorithms reach the stationary distributions, the early rejection rates of ejMCMC algorithm and OejMCMC are

R_{ej}=1-\iint\min\{1,\tilde{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}]\}q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})\hat{\pi}_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})d\boldsymbol{x}d\boldsymbol{\theta}^{*}d\boldsymbol{\theta}

and

R_{Oej}=1-\iint\min\{1,\breve{\alpha}[(\boldsymbol{\theta},\boldsymbol{x}),\boldsymbol{\theta}^{*}]\}q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta}){\pi}_{\varepsilon}(\boldsymbol{\theta},\boldsymbol{x}\mid\boldsymbol{y})d\boldsymbol{x}d\boldsymbol{\theta}^{*}d\boldsymbol{\theta},

respectively. For uniform kernel cases, if $\Theta_{1}-\Theta^{\prime}$ is a set of measure zero on parameter space $\Theta$ ,

R_{ej}-R_{Oej}\geq\int_{\Theta^{\prime}\cap\Theta_{1}}\int_{\Theta/\Theta^{\prime}}\min\{1,\frac{\pi(\boldsymbol{\theta}^{*})q(\boldsymbol{\theta}\mid\boldsymbol{\theta}^{*})}{\pi(\boldsymbol{\theta})q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})}\}q(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta})\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})d\boldsymbol{\theta}^{*}d\boldsymbol{\theta},

(8)

where $\Theta^{\prime}=\{\boldsymbol{\theta}:h(\boldsymbol{\theta})<\varepsilon\}$ and $\Theta_{1}=\{\boldsymbol{\theta}:\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})>0\}$ .

Proposition 3 shows that when the kernel function $K(\cdot)$ is uniform and the posterior of ejMCMC and OejMCMC are consistent, the efficiency of ejMCMC is expected to be higher since the term $R_{ej}-R_{Oej}$ in Equation (8) is always positive. This is because ejMCMC targets the posterior region more effectively by rejecting more candidates in the early rejection step, potentially reducing the computational cost of generating synthetic data sets. Furthermore, when the function $h(\boldsymbol{\theta})$ satisfies $h(\boldsymbol{\theta})>\varepsilon$ in all regions with negligible posterior probabilities, the efficiency of ejMCMC is maximized. This means that the discrepancy model $h(\boldsymbol{\theta})$ is able to identify high probability ABC posterior regions, leading to higher early rejection rates and improved efficiency of the algorithm. The ejMCMC algorithm can play a role in correcting prior information before simulations in some sense, when the prior information is poor. In this article, we use

\text{Eff}=\frac{\text{\# of early rejected parameters}}{\text{\# of rejected parameters}}

(9)

to quantify the early rejection efficiency of algorithms. This metric can be seen as a regularization of the early rejection rate, ensuring that the efficiency value is bounded between 0 and 1, with 1 being the best possible efficiency where all rejected parameters are early rejected. Proposition 4 shows the effect of the function $h(\cdot)$ on the early rejection efficiency of the ejMCMC algorithm.

Proposition 4.

For all $\boldsymbol{\theta}\in\Theta$ , larger probability of $\Delta(\boldsymbol{x},\boldsymbol{y})\leq h(\boldsymbol{\theta})$ can lead to higher early rejection efficiency of ejMCMC.

Combining Propositions 2 and 4, the probability of $\Delta(\boldsymbol{x},\boldsymbol{y})\leq h(\boldsymbol{\theta})$ keeps a balance between efficiency and accuracy.

3.3 An early rejection adaptive SMC

ABC-MCMC algorithm has several drawbacks. Firstly, the convergence speed of MCMC algorithm is often slow, especially in high-dimensional spaces. This may require a large number of iterations to obtain accurate approximations. Secondly, the MCMC algorithm is sensitive to the choice of initial values. Improper initial values can cause the Markov chain to get stuck in local optima, preventing effective exploration of the entire state space. In comparison with ABC-MCMC, ABC-SMC employs a sequential updating of particle weights to approximate the target distribution, which effectively mitigates the risk of getting stuck in local optima. Furthermore, the parallelization in SMC algorithms enhance their efficiency in complex applications. In this section, we introduce an early rejection ASMC (ejASMC) based on ABC adaptive SMC (ABC-ASMC) (Del Moral et al., 2012) and the ejMCMC method. The sequence of thresholds and MCMC kernels can be pre-specified before running the algorithm, but the algorithm may collapse with improperly selected thresholds and kernels. In this article, we adaptively tune the sequence of thresholds and ejMCMC proposal distributions.

Adaptive selection of thresholds: We adapt the scheme proposed in Del Moral et al. (2012) for selecting the sequence of thresholds, based on the key remark that the evaluation of weights $W_{r+1}^{(i)}={W}_{r}^{(i)}K_{\varepsilon_{r+1}}(\Delta(\boldsymbol{y},\boldsymbol{x}_{\boldsymbol{\theta}_{r}^{(i)}})/K_{\varepsilon_{r}}(\Delta(\boldsymbol{y},\boldsymbol{x}_{\boldsymbol{\theta}_{r}^{(i)}}))$ does not depend on particles of iteration $r+1$ , $\{\boldsymbol{\theta}_{r+1}^{(i)},\boldsymbol{x}_{r+1}^{(i)}\}_{i=1:N}$ . We select the threshold by controlling the proportion of unique alive particles, which is defined as

\operatorname{PA}(\{W_{r+1}^{(i)}\},\varepsilon_{r+1})=\frac{\sum_{i\in U}\mathbb{I}(W_{r+1}^{(i)}>0)}{N},

where $W_{r+1}^{(i)}$ is a function of the threshold $\varepsilon_{r+1}$ , and $U$ is the largest subset of the set $\{1:N\}$ such that for any two distinct elements $i$ and $j$ in $U$ , $\boldsymbol{\theta}_{r+1}^{(i)}\neq\boldsymbol{\theta}_{r+1}^{(j)}$ . The proportion of unique alive particles is also intuitively a sensible measure of ‘quality’ of our SMC approximation. The threshold $\varepsilon_{r+1}$ is selected to make sure that the proportion of alive particles is equal to a given percentage of the particles number

\operatorname{PA}(\{W_{r+1}^{(i)}\},\varepsilon_{r+1})=\gamma N

(10)

for $\gamma\in(0,1)$ . In practice, bisection is used to compute the root of Equation (10). The parameter $\gamma$ is a “quality” index for the resulting SMC approximation of the target. If $\gamma\rightarrow 1$ then we will move slowly towards the target but the SMC approximation is more accurate. However, if $\gamma\rightarrow 0$ then we can move very quickly towards the target but the resulting SMC approximation will be unreliable.

Adaptive MCMC proposal distribution: For each intermediate target, the SMC algorithm applies an MCMC move with invariant density $\pi_{\varepsilon_{r+1}}(\boldsymbol{\theta}\mid\boldsymbol{y})$ . MCMC moves are implemented by accepting candidate parameters with a certain probability given by the MH ratio of ejMCMC algorithm. For $\boldsymbol{\theta}_{r}^{(i)}$ with $w_{r+1}^{(i)}>0$ , we generate $\boldsymbol{\theta}^{*},\boldsymbol{x}^{*}$ according to a proposal $q_{r+1}(\boldsymbol{\theta}^{*}\mid\boldsymbol{\theta}_{r}^{(i)})p(\boldsymbol{x}^{*}\mid\boldsymbol{\theta}^{*})$ , then accept the candidate with probability $\hat{\alpha}[(\boldsymbol{\theta}_{r}^{(i)},\boldsymbol{x}_{r}^{(i)}),(\boldsymbol{\theta}^{*},\boldsymbol{x}^{*})]$ . We can adaptively determine the parameters of the proposal $q_{r+1}(\cdot\mid\cdot)$ based on the previous approximation of the target $\pi_{\varepsilon_{r}}$ . In this article, we use Gaussian distribution as the proposal distribution, and the Gaussian covariance matrix is determined adaptively by computing the covariance matrix of the weighted particles $\{\boldsymbol{\theta}_{r}^{(i)},W_{r}^{(i)}\}_{i=1:N}$ .

The output of ejASMC is the weighted particles $\{\boldsymbol{\theta}_{R}^{(i)},W_{R}^{(i)}\}_{i=1:N}$ , the empirical distribution of which is the ABC posterior estimate. We set $\varepsilon_{0}=\infty$ , then the initial target distribution is the prior distribution. Thus the initial weighted particles $\boldsymbol{\theta}_{0}^{(i)}\sim\pi(\boldsymbol{\theta})$ and $W_{0}^{(i)}=1/N$ for $i=1,\cdots,N$ . The detailed pseudo-code of ejASMC is shown in Appendix Section B.2.

3.4 Selection of discrepancy model

3.4.1 A Gaussian process discrepancy model $h(\cdot)$

In approximate Bayesian computation, whether a candidate parameter $\boldsymbol{\theta}^{*}$ will be retained as a sample of the posterior distribution is mainly determined by the discrepancy $\Delta(\boldsymbol{x},\boldsymbol{y})$ between the pseudo-data set $\boldsymbol{x}\sim p(\cdot\mid\boldsymbol{\theta}^{*})$ and the observed data $\boldsymbol{y}$ . To speed up the inference, Gutmann and Corander (2016) considered a uniform kernel in Eq.(1) and proposed to model the discrepancy $\Delta_{\boldsymbol{\theta}}=\Delta(\boldsymbol{x}_{\boldsymbol{\theta}},\boldsymbol{y})$ between the observed data $\boldsymbol{y}$ and the pseudo-data $\boldsymbol{x}_{\boldsymbol{\theta}}$ as a function of $\boldsymbol{\theta}$ . The estimated posterior with a uniform kernel for each $\boldsymbol{\theta}$ admits

\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})\propto\pi(\boldsymbol{\theta})\mathbb{P}(\Delta_{\boldsymbol{\theta}}\leq\varepsilon),

(11)

where the probability is computed using the fitted surrogate model. Furthermore, for a continuous and strictly increasing function $g(\cdot)$ , we have $\mathbb{P}(g(\Delta_{\boldsymbol{\theta}})\leq g(\varepsilon))=\mathbb{P}(\Delta_{\boldsymbol{\theta}}\leq\varepsilon)$ . In some cases, modeling $g(\Delta_{\boldsymbol{\theta}})$ instead of $\Delta_{\boldsymbol{\theta}}$ as a function of $\boldsymbol{\theta}$ is easier and more suitable. For example, as $\Delta(\boldsymbol{y},\boldsymbol{x}_{\boldsymbol{\theta}})$ is a positive value for all $\boldsymbol{x}_{\boldsymbol{\theta}}$ , it may be a better choice to model log-discrepancy $\log(\Delta(\boldsymbol{y},\boldsymbol{x}_{\boldsymbol{\theta}}))$ .

Gutmann and Corander (2016); Järvenpää et al. (2019, 2020) utilize GP formulations to model the discrepancy. In GP regression, we assume that $\Delta_{\boldsymbol{\theta}}\sim\mathcal{N}(f(\boldsymbol{\theta}),\sigma^{2})$ and $f(\boldsymbol{\theta})\sim\mathcal{GP}(m(\boldsymbol{\theta}),k(\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}))$ with a mean function $m(\cdot)$ : $\Theta\rightarrow{\mathbb{R}}$ and a covariance function $k(\cdot,\cdot)$ : $\Theta\times\Theta\rightarrow{\mathbb{R}}$ . For a training data set $\mathcal{D}_{s}=\{(\Delta_{1},\boldsymbol{\theta}_{1}),\cdots,(\Delta_{s},\boldsymbol{\theta}_{s})\}$ , the posterior predictive density for the latent function $f$ at $\boldsymbol{\theta}$ follows a Gaussian density with the mean and variance

\mu_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})=\mathbf{k}_{s}(\boldsymbol{\theta})^{T}\mathbf{K}_{s}^{-1}(\boldsymbol{\theta})\Delta_{1:s},\quad v_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})=k(\boldsymbol{\theta},\boldsymbol{\theta})-\mathbf{k}_{s}(\boldsymbol{\theta})^{T}\mathbf{K}_{s}^{-1}(\boldsymbol{\theta})\mathbf{k}_{s}(\boldsymbol{\theta})

(12)

respectively. Here $\Delta_{1:s}=\left(\Delta_{1},\cdots,\Delta_{s}\right)^{T}$ , $\mathbf{k}_{s}(\boldsymbol{\theta})=\left(k(\boldsymbol{\theta},\boldsymbol{\theta}_{1}),\cdots,k(\boldsymbol{\theta},\boldsymbol{\theta}_{s})\right)^{T}$ , and $\mathbf{K}_{s}(\boldsymbol{\theta})$ is a $s\times s$ matrix with $[\mathbf{K}_{s}(\boldsymbol{\theta})]_{ij}=k(\boldsymbol{\theta}_{i},\boldsymbol{\theta}_{j})+\sigma^{2}\mathbb{I}_{i=j}$ for $i=1,\ldots s$ and $j=1,\ldots s$ . The ABC posterior at $\boldsymbol{\theta}$ can be computed using the fitted GP as

\pi_{\varepsilon}(\boldsymbol{\theta}\mid\boldsymbol{y})\propto\pi(\boldsymbol{\theta})\mathbb{P}\left(\Delta_{\boldsymbol{\theta}}\leq\varepsilon\right)=\pi(\boldsymbol{\theta})\Phi\left(\left(\varepsilon-\mu_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})\right)/\sqrt{v_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})+\sigma^{2}}\right),

(13)

where $\varepsilon$ is the threshold and $\Phi$ is the cumulative density function of the standard Gaussian distribution.

Following Gutmann and Corander (2016) and Järvenpää et al. (2018, 2019, 2020), we use a Gaussian process model to establish the functional relationship between the discrepancy and the parameters, and propose an expression for $h(\boldsymbol{\theta})$ . We use an initial discrepancy-parameters pairs $\mathcal{D}_{s}$ to train a GP model. Based on the GP discrepancy model, we take

h_{a}(\boldsymbol{\theta})=\mu_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})-Z_{a}\sqrt{v_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})+\sigma^{2}},

(14)

where $\mathbb{P}(x<Z_{a})=a$ and $x\sim N(0,1)$ . Let $h(\boldsymbol{\theta})=h_{a}(\boldsymbol{\theta})$ , Propositions 2 and 4 indicate that the choice of $a$ is a trade-off between efficiency and accuracy of ejMCMC algorithm. A smaller value of $a$ will lead to more accurate estimation of ejMCMC algorithm but lower early rejection efficiency.

3.4.2 Selection of training data

Although the posterior estimated by an early rejection method based on the discrepancy model does not rely entirely on the fitted surrogate model like GP-ABC (Järvenpää et al., 2018), the accuracy of the discrepancy model is also critical to the results. Proposition 2 shows that we only need to ensure the accuracy of discrepancy model in approximation Bayesian posterior region, that is, the region with small discrepancy. In Gutmann and Corander (2016); Järvenpää et al. (2019), they use a GP discrepancy model to intelligently select the next parameter value $\boldsymbol{\theta}_{s+1}$ via minimization the acquisition function

\mathcal{A}_{s}(\boldsymbol{\theta})=\mu_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})-\eta_{s}\sqrt{v_{f\mid\mathcal{D}_{s}}(\boldsymbol{\theta})},

(15)

where $\eta_{s}$ is a positive number related to $s$ , then run the computationally costly simulation model to obtain updated training data $\mathcal{D}_{s+1}$ . The coefficient $\eta_{s}$ is a trade-off between exploration and exploitation. This method of sequentially generating training points can ensure that there are enough training data points in areas with small discrepancy to ensure the accuracy of the model in this area. But it is computationally expensive to select a new training point every time by solving an optimization problem.

In this article, we consider ensuring the accuracy of the model by placing more training points in areas with smaller discrepancy via a pilot run of ABC-SMC algorithm with OejMCMC as proposals to sequentially generate batch data points. The detailed pseudo-code is shown in Appendix Section B.3. Our algorithm does not require to solve the optimization problem, and hence has lower computational cost. The budgets for the pilot ABC-SMC run could be relatively small (e.g. with a short sequence of tolerance parameters). Moreover, when the accuracy of the discrepancy model trained by ABC-SMC samples is not enough, the algorithm allows to continue to generate training data sequentially, by changing the algorithm termination conditions.

The computational complexity of training a GP model is $O(n^{3})$ . Increasing sample size of training data will significantly increase the computational burden. Based on Gutmann and Corander (2016); Järvenpää et al. (2018, 2019) and our exploratory study, we suggest collecting a few thousands data points for training GP discrepancy model. However, this may not be enough for large dimensional problems. With increasing number of iterations, we can use the set of accepted parameters and corresponding discrepancy terms as new training data to update the GP model within ejMCMC (Drovandi et al., 2018). In this case, we recommend using the online-GP algorithm (Stanton et al., 2021) to train the Gaussian process model, which can use new training data to update the model without using the full data sets.

4 Simulation study

4.1 A toy example

We first illustrate the advantage of our proposed method via a toy example. Suppose that the model of interest is a mixture of two normal distributions

p(y\mid\theta)=0.5\phi(y;\theta+2,0.6)+0.5\phi(y;\theta-1,0.6),

where $\phi(y;\mu,\sigma^{2})$ denotes the probability density function of $\mathcal{N}(\mu,\sigma^{2})$ evaluated at $y$ . Suppose that the prior distribution is $\pi(\theta)=\mathcal{U}(-6,6)$ , and the observed data associated with this model is $y_{0}=1$ . The discrepancy we use is $\Delta(y,y_{0})=|y-1|$ . We choose a uniform kernel with the threshold $\varepsilon=0.6$ for ABC algorithms. Note that we only have one observed data point in this example.

Firstly, we run rejection sampling ABC $10^{6}$ iterations and treat the resulting posterior distribution as ground truth. We then compare our ejMCMC with OejMCMC and GP-ABC (Järvenpää et al., 2018). We collect $2,000$ training data points from prior, which are then used to train the GP discrepancy model. The resulting GP model is used to approximate the posterior of GP-ABC, and we name this as GP-ABC₁. We then run GP-ABC, OejMCMC and ejMCMC with same computational budgets ( $N=100,000$ iterations). This GP-ABC with $N=100,000$ is named as GP-ABC₂. A random walk proposal $q(\theta\mid\theta_{n})\sim\mathcal{N}(\theta_{n},0.3^{2})$ is implemented for MCMC algorithms. In the implementation of ejMCMC, the discrepancy model $h_{a}(\theta)$ is a GP model trained with $a=0.05$ . Figure 1 shows the Gaussian process modeling for identifying the relationship between the discrepancy $\Delta$ and parameter $\theta$ . The x-axis denotes the parameter and the y-axis denotes the value of the discrepancy. Black dots denote the discrepancies, the red line is the mean function of GP and the area within two grey lines is the 95% predictive interval. In this example, the discrepancy for parameters is multi-modal, GP discrepancy model doesn’t fit the ‘shape’ of the discrepancy correctly. Figure 1 displays the estimated posterior distributions provided by GP-ABC₁, GP-ABC₂, ejMCMC, OejMCMC and the ground truth (rejection sampling). It indicates that both GP-ABC algorithms fail to capture a bimodal shape of the posterior, due to that they fail to capture the bimodal shape of discrepancy. This is also demonstrated in Järvenpää et al. (2018). Our ejMCMC is able to capture both peaks of the posterior. The $L_{1}$ distance ( $D_{L_{1}}$ ) between the estimated posterior by ejMCMC and the corresponding true posterior is 0.0559, lower than the $D_{L_{1}}$ between OejMCMC and ground truth (0.0748). We also implement ejMCMC with online-GP (Stanton et al., 2021) discrepancy model for this toy example. It is available at https://github.com/caofff/ejMCMC.

Refer to caption — Figure 1: (a) Gaussian process modeling for identifying the relationship between the discrepancy $\Delta$ and parameter $\theta$ . The x-axis denotes the parameter and the y-axis denotes the value of the discrepancy. Black dots denote the discrepancies, the red line is the mean function of GP and the area between the two grey lines is the 95% predictive interval. (b) Posterior density estimated by GP-ABC₁, GP-ABC₂, ejMCMC, OejMCMC, and rejection sampling (ground truth).

4.2 An ordinary differential equation model

In this subsection, we investigate the performance of ABC algorithms via an ordinary differential equation (ODE) example. We use the R package deSolve (Soetaert et al., 2010) to simulate differential equations. We generate ODE trajectories according to the following system,

	$\displaystyle\frac{dx_{1}(t)}{dt}$	$\displaystyle=$	$\displaystyle\frac{72}{36+x_{2}(t)}-\theta_{1},$
	$\displaystyle\frac{dx_{2}(t)}{dt}$	$\displaystyle=$	$\displaystyle\theta_{2}x_{1}(t)-1,$		(16)

where $\theta_{1}=2$ and $\theta_{2}=1$ , and the initial conditions $x_{1}(0)=7$ and $x_{2}(0)=-10$ . The observations $\boldsymbol{y}_{i}$ are simulated from a normal distribution with the mean $x_{i}(t|\boldsymbol{\theta})$ and the variance $\sigma_{i}^{2}$ , where $\sigma_{1}=1$ and $\sigma_{2}=3$ . We generate $121$ observations for each ODE function, equally spaced within $[0,60]$ . We use root mean squared error (RMSE) to model the discrepancy between the observed data and the pseudo-data. We take the following prior distributions $\pi(\boldsymbol{\theta})$ : $\theta_{1}\sim\mathcal{U}(1.8,2.2)$ , $\theta_{2}\sim\mathcal{U}(0.8,1.2)$ .

We conduct a performance comparison of two early rejection MCMC algorithms under different threshold levels. A smaller threshold value implies that the ABC posterior is closer to the actual posterior, and consequently, more simulation costs are required. In the ABC-MCMC algorithm, the convergence speed slows down with smaller thresholds. The 1% and 5% quantiles of the discrepancy correspond to the parameters that conform to the prior distribution are 4.03 and 4.68, respectively. Here we study the performance of ABC algorithms with $\varepsilon=4.8$ , $4.5$ , $4.1$ and $4.05$ . We observe the Markov chains mix poor when the threshold value is smaller. For each level of tolerance, we first run MCMC algorithms $1,000,000$ iterations with 3 different initializations. The Gelman-Rubin diagnostic (Gelman et al., 1995) is used to assess the convergence of the Markov chain. The resulting posterior can be regarded as ground truth. For each case, we also run both ejMCMC and OejMCMC $100,000$ iterations. We collect $3,000$ training data from prior for ejMCMC algorithm. Appendix Figure C.1 illustrates the efficacy of the Gaussian Process (GP) model’s fit and the positioning of the training data points. The discrepancy can be fitted well by the Gaussian process. The choice of $a$ in the GP discrepancy model is a trade off between efficiency and accuracy. For each threshold $\varepsilon$ , we run the ejMCMC algorithm with $a=0.5$ , $0.2$ , $0.1$ , $0.05$ and $0.01$ . Table 1 shows the results for $\varepsilon=4.5$ by varying $a$ (i.e. $h(\boldsymbol{\theta})$ ). This demonstrates that the selection of $h(\boldsymbol{\theta})$ keeps a balance between accuracy and computing speed. Results for the remaining three thresholds are shown in Appendix Section C.1. To balance the efficiency and accuracy, we choose $a=0.05$ .

Table 1: ODE:

\varepsilon=4.5

, accuracy and efficiency of ejMCMC varying

a

$a$	$N_{\text{sim}}$	$N_{\text{acc}}$	$D_{L_{1}}(\boldsymbol{\theta}_{1})$	$D_{L_{1}}(\boldsymbol{\theta}_{2})$	Eff
0.50	40039	36170	0.2071	0.1922	0.9394
0.20	47550	37574	0.0541	0.0459	0.8402
0.10	51398	37572	0.0374	0.0426	0.7785
0.05	55116	38017	0.0226	0.0216	0.7241
0.01	60464	37593	0.0273	0.0194	0.6335

Table 2 shows the number of synthetic data generated ( $N_{\text{sim}}$ ), number of collected posterior samples ( $N_{\text{acc}}$ ), and the $L_{1}$ -distance between true value and posterior estimates varying threshold parameter $\varepsilon$ . Under the settings of uniform prior and Gaussian proposal distribution, the efficiency of OejMCM is 0. While our proposed algorithm can achieve a considerable computational acceleration by saving $20\%-80\%$ of data simulation. The accuracy of the posterior distribution estimates has no drop compared to OejMCMC. The efficiency of algorithms with threshold values $4.8$ and $4.5$ is higher than cases with that of $4.1$ and $4.05$ . When the threshold is set to 4.1 or 4.05, the proportion of candidate parameters lying outside of the posterior region is low, resulting in a decrease of efficiency.

Table 2: Efficiency of OejMCMC and ejMCMC varying threshold parameter

\varepsilon

$\varepsilon$	method	$N_{\text{sim}}$	$N_{\text{acc}}$	$D_{L_{1}}(\boldsymbol{\theta}_{1})$	$D_{L_{1}}(\boldsymbol{\theta}_{2})$	Eff
4.8	OejMCMC	100000	39965	0.0207	0.0171	0
4.8	ejMCMC	52719	39549	0.0194	0.0216	0.7821
\hdashline4.5	OejMCMC	100000	37691	0.0244	0.0259	0
\hdashline4.5	ejMCMC	55116	38017	0.0226	0.0216	0.7241
\hdashline4.1	OejMCMC	100000	37672	0.0418	0.0452	0
\hdashline4.1	ejMCMC	76538	38218	0.0427	0.0360	0.3798
\hdashline4.05	OejMCMC	100000	39861	0.0445	0.0533	0
\hdashline4.05	ejMCMC	85325	39897	0.0428	0.0420	0.2442

Figure 2 shows the posterior distributions of the ODE model provided by the three ABC algorithms varying the threshold parameter $\varepsilon$ . The distributions provided by three methods are close, though ejMCMC slightly misses the tail of posteriors. This is because all candidate parameters satisfying $h(\boldsymbol{\theta})>\varepsilon$ in the posterior region are early rejected. However, the false rejection avoids the MCMC sampler “sticking” in some local region for long periods of time (marked by the red box in the Figure 2). We show an example of ABC-MCMC traceplots in Appendix Figure C.2.

In this study, we also employ the ejASMC algorithm to infer the posterior distribution of the parameters, with $N=2048$ samples generated for each iteration and the threshold parameters are updated by retaining 1024 (the scale parameter $\gamma=0.5$ ) active particles. The termination condition for the algorithm is set such that the number of simulated data exceeds 50,000. As shown in Figure 3, the posterior distribution becomes increasingly concentrated near the true value as the number of iterations increases. The threshold value $\varepsilon$ of the final iteration is $3.84$ . ABC-MCMC algorithms achieve very low acceptance probability using this threshold. Here, the early rejection efficiency of ejASMC algorithm is 0.3820. $L_{1}$ distances between the marginal distributions of parameters and exact Bayesian posterior are 1.0626 and 1.0247, which are both smaller than the results provided by ejMCMC and OejMCMC (Appendix Table C.5), due to a smaller threshold. The posterior density plot provided by ejASMC are shown in Appendix Figure C.3.

4.3 Stochastic kinetic networks

Consider the parameter inference of a multidimensional stochastic differential equation (SDE) model for biochemical networks (Golightly and Wilkinson, 2010). In the dynamic system, there are eight reactions ( ${R_{1},R_{2},\cdots,R_{8}}$ ) that represent a simplified model of the auto-regulation mechanism in prokaryotes, based on protein dimerization that inhibits its own transcription. The reactions are defined as follows

$\displaystyle R_{1}:$	$\displaystyle~{}\text{DNA}+P_{2}\rightarrow\text{DNA}\cdot P_{2}$	$\displaystyle R_{2}:$	$\displaystyle~{}\text{DNA}\cdot P2\rightarrow\text{DNA}+P_{2}$
$\displaystyle R_{3}:$	$\displaystyle~{}\text{DNA}\rightarrow\text{DNA}+\text{RNA}$	$\displaystyle R_{4}:$	$\displaystyle~{}\text{RNA}\rightarrow\text{RNA}+P$
$\displaystyle R_{5}:$	$\displaystyle~{}2P\rightarrow P_{2}$	$\displaystyle R_{6}:$	$\displaystyle~{}P_{2}\rightarrow 2P$
$\displaystyle R_{7}:$	$\displaystyle~{}\text{RNA}\rightarrow\emptyset$	$\displaystyle R_{8}:$	$\displaystyle~{}P\rightarrow\emptyset.$

Consider the time evolution of the system as a Markov process with state $\boldsymbol{X}_{t}$ , where $\boldsymbol{X}_{t}=(\text{RNA},P,P_{2},\text{DNA})^{T}$ represents the number of molecules of each species at time $t$ (i.e. non-negative integers), and the corresponding stoichiometric matrix can be represented as

S=\left(\begin{array}[]{cccccccc}0&0&1&0&0&0&-1&0\\ 0&0&0&1&-2&2&0&-1\\ -1&1&0&0&1&-1&0&0\\ -1&1&0&0&0&0&0&0\end{array}\right).

We refer readers to Section 6.2 of Picchini (2014) for a more detailed description of this networks.

Now we assume that a constant rate $\theta_{i}$ $(i=1,2,\cdots,8)$ is associated to each reaction $i$ , the stochastic differential equation (SDE) for this model can be written as

d\boldsymbol{X}_{t}=\boldsymbol{S}\cdot\boldsymbol{h}(\boldsymbol{X}_{t},\boldsymbol{\theta})dt+\boldsymbol{S}\sqrt{\text{diag}(\boldsymbol{h}(\boldsymbol{X}_{t},\boldsymbol{\theta}))}d\boldsymbol{W}_{t},

(17)

where $\boldsymbol{h}(\boldsymbol{X}_{t},\boldsymbol{\theta})=(\theta_{1}\cdot\text{DNA}\cdot P_{2},\theta_{2}\cdot(k-\text{DNA}),\theta_{3}\cdot\text{DNA},\theta_{4}\cdot\text{RNA},\theta_{5}\cdot P\cdot(P-1)/2,\theta_{6}\cdot P_{2},\theta_{7}\cdot\text{RNA},\theta_{8}\cdot P)^{T}$ is a “propensity function”, $k$ is a constant that satisfies $\text{DNA}\cdot P_{2}+\text{DNA}=k$ , and $h_{i}(\operatorname{X}_{t},\theta_{i})dt$ is the probability of reaction $i$ occurring in the time interval $(t,t+dt]$ , $dW_{t}=(dW_{t,1},\ldots,dW_{t,8})^{T}$ is a Wiener process, where $dW_{t,i}$ is independently and identically distributed as $dW_{t,i}\sim N(0,dt)$ for $i=1,\ldots,8$ . The goal is to infer the reaction rates $\boldsymbol{\theta}$ in different experimental settings.

We consider three scenarios as in Picchini (2014). $\mathcal{D}_{1}$ : fully observed data without measurement error. $\mathcal{D}_{2}$ : fully observed data with measurement error $\epsilon_{ij}\sim N(0,5)$ independently for every $i$ and for each coordinate $j$ of the state vector $\boldsymbol{X}_{t}$ . $\mathcal{D}_{3}$ : partially observed data (i.e. the DNA coordinate is unobserved) with measurement error $\epsilon_{ij}\sim N(0,\sigma^{2}I_{3})$ , and $\sigma$ is unknown.

We use ejMCMC and OejMCMC algorithms to estimate the ABC posterior. The ABC-ASMC algorithm with OejMCMC move is used to collect $2000$ training data, with 500 particles, and $\varepsilon$ is updated by keeping 250 unique particles. In ABC-SMC, RMSE is used to model the discrepancy $\Delta$ between the observed data and the simulated data.

For $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ , the threshold is set to $\varepsilon=0.03$ . We use the Gaussian kernel function as the proposal distribution of MCMC move, and the covariance matrix of parameters with discrepancy less than 0.05 in the training data serves as the covariance parameter of the Gaussian kernel. For $\mathcal{D}_{3}$ , we set $\varepsilon=0.05$ , and training data with a discrepancy less than $0.06$ is used to tune the covariance matrix of the Gaussian kernel.

For all three scenarios, we run both ejMCMC and OejMCMC algorithms $500,000$ iterations. Table 3 shows the computing details of both algorithms. The results indicate that the efficiency of ejMCMC is higher than OejMCMC in all three scenarios. In scenario $\mathcal{D}_{2}$ , the ejMCMC algorithm can accelerate about 41.57% compared with the OejMCMC algorithm.

Table 3: Number of iterations (

N_{\text{ite}}

), number of predictions (

N_{\text{pre}}

), number of synthetic data generated (

N_{\text{sim}}

), number of accepted posterior samples (

N_{\text{acc}}

), early rejection efficiency for different scenarios.

	method	$N_{\text{ite}}$	$N_{\text{pre}}$	$N_{\text{sim}}$	$N_{\text{acc}}$	Eff
$\mathcal{D}_{1}$	OejMCMC	500,000	-	124054	10574	0.7681
$\mathcal{D}_{1}$	ejMCMC	500,000	123418	91580	10568	0.8345
\hdashline $\mathcal{D}_{2}$	OejMCMC	500,000	-	120423	7566	0.7708
\hdashline $\mathcal{D}_{2}$	ejMCMC	500,000	120367	73632	7935	0.8665
\hdashline $\mathcal{D}_{3}$	OejMCMC	500,000	-	299656	23185	0.4202
\hdashline $\mathcal{D}_{3}$	ejMCMC	500,000	298229	271459	20045	0.4762

For three scenarios, the fitted GP discrepancy model is shown in Appendix Figure C.4-C.6, and the boxplots of the posterior samples are shown in Figure C.7-C.9 of Appendix C.2. The GP-ABC posterior (estimated using the training points) is less accurate than OejMCMC and ejMCMC since the GP model does not fit the discrepancy very well. Under scenario $D_{3}$ , the true values of the parameters $\theta_{3}$ and $\theta_{7}$ are not within the 95% credible interval of the OejMCMC algorithm, but are within those of the ejMCMC algorithm. The estimation results obtained by both algorithms in another two scenarios are similar, and the true values of the parameters are included in the estimated 95% credible intervals. We refer readers to Appendix C.2 for more details of setups and results.

5 Real Data Analysis

In this section, we consider a delay differential equation(DDE) example about the resource competition in the laboratory populations of Australian sheep blowflies (Lucilia cuprina), which is a classic experiment studied by Nicholson (1954). The room temperature of cultivating blowflies were maintained at 25°C. Appendix Figure C.10 displays the counts of blowflies, which are observed at $N=137$ time points. The resource limitation in the dynamic system of the blowfly population acts with a time delay, which causes the fluctuations shown in the blowfly population. The fluctuations displayed in the blowfly population are caused by the time lag between stimulus and reaction (Berezansky et al., 2010).

May (1976) proposed to model the counts of blowflies with the following DDE model

\displaystyle\frac{dx(t)}{dt}

\displaystyle=

\displaystyle\nu x(t)[1-x(t-\tau)/(1000\cdot P)],

(18)

where $x(t)$ is the blowfly population, $\nu$ is the rate of increase of the blowfly population, $P$ is a resource limitation parameter set by the supply of food, and $\tau$ is the time delay, roughly equal to the time for an egg to grow up to a pupa. Our goal is to estimate the initial value, $x(0)$ , and the three parameters, $\nu$ , $P$ , and $\tau$ , from the noisy blowfly data $y(t)$ . The observed counts of blowflies $y(t)$ is assumed to be lognormal distributed with the mean $x(t)$ and the variance $\sigma^{2}$ .

For convenience of description, we assume $\boldsymbol{\theta}=\{X_{0},v,P,\tau\}$ , and let $\boldsymbol{y}$ denote the observed data. We define the following prior distributions for $\boldsymbol{\theta}:~{}\log(X_{0})\sim\mathcal{N}(8.5,0.3^{2}),~{}\log(v)\sim\mathcal{N}(-1.35,0.2^{2}),~{}\log(P)\sim\mathcal{N}(0.8,0.3^{2}),~{}\log(\tau)\sim\mathcal{N}(2.25,0.08^{2})$ . We use root mean square error as the discrepancy function $\Delta$ for ABC. The simulation data are generated via the Euler-Maruyama method (Maruyama, 1955) implemented in R package deSolve (Soetaert et al., 2010), using a constant step-size 0.1. For the cheap simulator of DAMCMC algorithm, we set the step-size to 2 for the first forty time units, and set the step size of the rest time unit to the time interval of the observation data.

We collect $3000$ training data by running an ABC-ASMC algorithm with $500$ particles, and use the resulting parameter-discrepancy pairs to train the GP discrepancy model. The predict function $h_{a}(\boldsymbol{\theta})$ is the GP model trained using the $3000$ samples with $a=0.05$ . We run ejMCMC, OejMCMC and DAMCMC $500,000$ iterations. The ejMCMC algorithm simulates $110,564$ synthetic data and saves $93.09\%$ of the cost, while OejMCMC algorithm simulates $291,250$ synthetic data and saves only $49.91\%$ of the cost. Our algorithm ejMCMC significantly improve the computational speed compared with OejMCMC. Both algorithms accept approximately $82,000$ samples. The DAMCMC algorithm runs the cheap simulator $500,000$ times and the expensive simulator $82,692$ times, and accepts $66,028$ samples. In addition, we compare the computing time of three MCMC methods as a function of the step size. The results are reported in Appendix Section C.3.

Figure 4 displays the marginal ABC posterior distributions produced by the OejMCMC, ejMCMC and DAMCMC algorithms. Figure 4 shows that the posterior of ejMCMC algorithm is consistent with that of OejMCMC algorithm. However, the posterior generated by the DAMCMC algorithm differs from that of OejMCMC and ejMCMC algorithms. Figure 5 shows the two-dimensional joint posterior distributions of the parameters obtained by using the ejMCMC algorithm. The posterior means and 95% credible intervals of $\boldsymbol{\theta}$ provided by the OejMCMC, ejMCMC and DAMCMC algorithms are shown in Appendix Figure C.6. We also compute the correlation between posterior samples: $corr(\nu,P)=0.264$ , $corr(\nu,\tau)=-0.882$ , $corr(P,\tau)=-0.112$ . Recall that $\nu$ is the rate of increase of the blowfly population, $P$ is a resource limitation parameter set by the supply of food, and $\tau$ is the time delay that roughly equal to the time for an egg to grow up to a pupa. The positive correlation between $\nu$ and $P$ indicates that the blowfly population grows faster when there is a sufficient food supply. The tiny negative value of the correlation between $\tau$ and $P$ implies that the amount of food supply has a small impact on the period of being a pupa. The negative correlation between $\tau$ and $\nu$ indicates that the blowfly population will increase slower if the period for an egg to grow up to a pupa is longer.

6 Conclusion

We propose the early rejection ABC methods to accelerate parameter inference for complex statistical models. The main idea is to accurately reject parameter samples before generating synthetic data, and this is achieved by using a Gaussian process modeling to describe the relationship between the parameter and discrepancy. Our proposed method has two advantages compared with existing work. First, it utilizes MCMC moves for sampling parameters, which is more efficient in high-dimensional cases than sampling from a prior distribution. Second, it uses a pseudo Metropolis Hasting acceptance ratio to early reject samples in state space that are not worth exploring. The new method inherits the detailed balance condition and the posterior concentration property. It is also theoretically more efficient than the existing early rejection method.

We use several schemes to improve the performance of ejMCMC. To ensure the accuracy of the discrepancy model, we sequentially generate training points in areas with smaller discrepancy via a pilot run of ABC-SMC. We propose a prediction function for discrepancy model that balances accuracy and computational speed. An adaptive early rejection SMC is introduced by combining ABC-SMC and ejMCMC method. The ejASMC can adaptively select the sequence of thresholds and MCMC kernels, the particles are initialized via uniform design.

There are several lines of extensions and improvements for future work. We utilize identical kernel function and threshold value for $\Delta(x,y)$ and $h(\boldsymbol{\theta})$ in the pseudo acceptance ratio (i.e. $\min\left\{K_{\varepsilon}(\Delta(x,y)),K_{\varepsilon}(h(\boldsymbol{\theta}))\right\}$ ) of ejMCMC. Using different kernel functions and threshold values can enhance the flexibility of algorithm. For example, decreasing the threshold of $K_{\varepsilon}(h(\boldsymbol{\theta}))$ can decrease the rate of false positive. One future extension is to propose pseudo acceptance ratio with more general kernel functions. MCMC approaches often require a large number of simulations to make sure that the Markov chain converges to the stationary distribution. This limits their application to models that only a few hundreds simulations are possible. Compare to MCMC methods, importance sampling does not require to assess convergence and the consistency property holds under mild conditions. The efficiency of importance sampling methods mainly depends on proposal distributions. Our second line of future research is to combine GP-ABC and importance sampling approaches, by serving the fitted GP model as an efficient proposal distribution in importance sampling framework. In addition, we will investigate using the resulting importance sampling distribution to design efficient MCMC proposal distributions to improve the mixing of Markov chain. The current selection of discrepancy model is limited to models defined on Euclidean space. Lastly, we will extend $h(\boldsymbol{\theta})$ to problems defined on more general space (e.g. a phylogenetic tree) in our future research.

Acknowledge

This work was supported by the National Natural Science Foundation of China (12131001 and 12101333), the startup fund of ShanghaiTech University, the Fundamental Research Funds for the Central Universities, LPMC, and KLMDASR. The authorship is listed in alphabetic order.

SUPPLEMENTAL MATERIALS

Appendix A:: The proofs of all theoretical results presented in the main text.
Appendix B:: Some description of algorithms not shown in the main text.
Appendix C:: Some numerical results and some details of setups not shown in the main text.

References

Åkesson et al. (2021) Åkesson, M., P. Singh, F. Wrede, and A. Hellander (2021). Convolutional neural networks as summary statistics for approximate bayesian computation. IEEE/ACM Transactions on Computational Biology and Bioinformatics 19(6), 3353–3365.
Altman (1992) Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46(3), 175–185.
Beaumont et al. (2002) Beaumont, M. A., W. Zhang, and D. J. Balding (2002, 12). Approximate Bayesian computation in population genetics. Genetics 162(4), 2025–2035.
Berezansky et al. (2010) Berezansky, L., E. Braverman, and L. Idels (2010). Nicholson’s blowflies differential equations revisited: main results and open problems. Applied Mathematical Modelling 34(6), 1405–1417.
Blum et al. (2013) Blum, M. G., M. A. Nunes, D. Prangle, and S. A. Sisson (2013). A comparative review of dimension reduction methods in approximate bayesian computation. Statistical Science 28(2), 189–208.
Brooks et al. (2011) Brooks, S., A. Gelman, G. Jones, and X.-L. Meng (2011). Handbook of markov chain monte carlo, Chapter Likelihood-Free MCMC, pp. 313–333. New York: CRC press.
Buchholz and Chopin (2019) Buchholz, A. and N. Chopin (2019). Improving approximate Bayesian computation via Quasi-Monte Carlo. Journal of Computational and Graphical Statistics 28(1), 205–219.
Christen and Fox (2005) Christen, J. A. and C. Fox (2005). Markov chain monte carlo using an approximation. Journal of Computational and Graphical statistics 14(4), 795–810.
Clarté et al. (2020) Clarté, G., C. P. Robert, R. J. Ryder, and J. Stoehr (2020, 11). Componentwise approximate Bayesian computation via Gibbs-like steps. Biometrika 108(3), 591–607.
Cleveland and Devlin (1988) Cleveland, W. S. and S. J. Devlin (1988). Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American statistical association 83(403), 596–610.
Del Moral et al. (2006) Del Moral, P., A. Doucet, and A. Jasra (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(3), 411–436.
Del Moral et al. (2012) Del Moral, P., A. Doucet, and A. Jasra (2012). An adaptive sequential Monte Carlo method for approximate Bayesian computation. Statistics and Computing 22(5), 1009–1020.
Doucet et al. (2001) Doucet, A., N. De Freitas, and N. Gordon (2001). An introduction to sequential Monte Carlo methods. In Sequential Monte Carlo methods in practice, pp. 3–14. Springer.
Drovandi et al. (2018) Drovandi, C. C., M. T. Moores, and R. J. Boys (2018). Accelerating pseudo-marginal mcmc using gaussian processes. Computational Statistics & Data Analysis 118, 1–17.
Epanechnikov (1969) Epanechnikov, V. A. (1969). Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications 14(1), 153–158.
Everitt and Rowińska (2021) Everitt, R. G. and P. A. Rowińska (2021). Delayed acceptance ABC-SMC. Journal of Computational and Graphical Statistics 30(1), 55–66.
Fearnhead and Prangle (2012) Fearnhead, P. and D. Prangle (2012). Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation [with discussion]. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 74(3), 419–474.
Frazier et al. (2018) Frazier, D. T., G. M. Martin, C. P. Robert, and J. Rousseau (2018). Asymptotic properties of approximate bayesian computation. Biometrika 105(3), 593–607.
Frazier et al. (2022) Frazier, D. T., D. J. Nott, C. Drovandi, and R. Kohn (2022). Bayesian inference using synthetic likelihood: Asymptotics and adjustments. Journal of the American Statistical Association 0(0), 1–12.
Gelman et al. (1995) Gelman, A., J. B. Carlin, H. S. Stern, and D. B. Rubin (1995). Bayesian data analysis. Chapman and Hall/CRC.
Golightly and Wilkinson (2010) Golightly, A. and D. Wilkinson (2010). Learning and Inference for Computational Systems Biology, Chapter Markov chain Monte Carlo algorithms for SDE parameter estimation, pp. 253–276. MIT Press.
Gutmann and Corander (2016) Gutmann, M. U. and J. Corander (2016). Bayesian optimization for likelihood-free inference of simulator-based statistical models. Journal of Machine Learning Research 17(125), 1–47.
Jiang et al. (2017) Jiang, B., T.-y. Wu, C. Zheng, and W. H. Wong (2017). Learning summary statistic for approximate bayesian computation via deep neural network. Statistica Sinica 27(4), 1595–1618.
Joyce and Marjoram (2008) Joyce, P. and P. Marjoram (2008). Approximately sufficient statistics and bayesian computation. Statistical applications in genetics and molecular biology 7(1), Article 26.
Järvenpää et al. (2019) Järvenpää, M., M. Gutmann, A. Vehtari, and P. Marttinen (2019, 06). Efficient acquisition rules for model-based approximate bayesian computation. Bayesian Analysis 14(2), 595–622.
Järvenpää et al. (2018) Järvenpää, M., M. U. Gutmann, A. Vehtari, and P. Marttinen (2018). Gaussian process modelling in approximate Bayesian computation to estimate horizontal gene transfer in bacteria. The Annals of Applied Statistics 12(4), 2228 – 2251.
Järvenpää et al. (2020) Järvenpää, M., A. Vehtari, and P. Marttinen (2020). Batch simulations and uncertainty quantification in gaussian process surrogate approximate bayesian computation. In Conference on Uncertainty in Artificial Intelligence, pp. 779–788. PMLR.
Marin et al. (2012) Marin, J.-M., P. Pudlo, C. P. Robert, and R. J. Ryder (2012). Approximate bayesian computational methods. Statistics and Computing 22(6), 1167–1180.
Marjoram et al. (2003) Marjoram, P., J. Molitor, V. Plagnol, and S. Tavaré (2003). Markov chain Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences 100(26), 15324–15328.
Maruyama (1955) Maruyama, G. (1955). Continuous markov processes and stochastic equations. Rendiconti del Circolo Matematico di Palermo 4, 48–90.
May (1976) May, R. M. (1976). Models for single populations. In R. M. May (Ed.), Theoretical ecology, pp. 4–25. Philadelphia, Pennsylvania, USA: W. B. Saunders Company.
Nicholson (1954) Nicholson, A. J. (1954). An outline of the dynamics of animal populations. Australian Journal of Zoology 2(1), 9–65.
Nunes and Balding (2010) Nunes, M. A. and D. J. Balding (2010). On optimal selection of summary statistics for approximate bayesian computation. Statistical applications in genetics and molecular biology 9(1), 34.
Picchini (2014) Picchini, U. (2014). Inference for SDE models via approximate Bayesian computation. Journal of Computational and Graphical Statistics 23(4), 1080–1100.
Price et al. (2017) Price, L. F., C. C. Drovandi, A. Lee, and D. J. Nott (2017, jul). Bayesian synthetic likelihood. Journal of Computational and Graphical Statistics 27(1), 1–11.
Pritchard et al. (1999) Pritchard, J. K., M. T. Seielstad, A. Perez-Lezaun, and M. W. Feldman (1999, 12). Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular Biology and Evolution 16(12), 1791–1798.
Pudlo et al. (2016) Pudlo, P., J.-M. Marin, A. Estoup, J.-M. Cornuet, M. Gautier, and C. P. Robert (2016). Reliable abc model choice via random forests. Bioinformatics 32(6), 859–866.
Sisson and Fan (2010) Sisson, S. A. and Y. Fan (2010). Likelihood-free markov chain monte carlo. In S. Brooks, G. L. Jones, and X.-L. Meng (Eds.), Handbook of Markov Chain Monte Carlo. Chapman and Hall/CRC Press. In press.
Sisson et al. (2018) Sisson, S. A., Y. Fan, and M. Beaumont (2018). Handbook of approximate Bayesian computation. CRC Press.
Sisson et al. (2007) Sisson, S. A., Y. Fan, and M. M. Tanaka (2007). Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences 104(6), 1760–1765.
Soetaert et al. (2010) Soetaert, K., T. Petzoldt, and R. W. Setzer (2010). Solving differential equations in r: package desolve. Journal of statistical software 33, 1–25.
Stanton et al. (2021) Stanton, S., W. Maddox, I. Delbridge, and A. G. Wilson (2021). Kernel interpolation for scalable online gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 3133–3141. PMLR.
Wegmann et al. (2009) Wegmann, D., C. Leuenberger, and L. Excoffier (2009, 08). Efficient approximate Bayesian computation coupled with Markov chain Monte Carlo without likelihood. Genetics 182(4), 1207–1218.
Wilkinson (2014) Wilkinson, R. (2014). Accelerating abc methods using gaussian processes. In Artificial Intelligence and Statistics, pp. 1015–1023. PMLR.

Using early rejection Markov chain Monte Carlo and Gaussian processes to accelerate ABC methods

Abstract

1 Introduction

2 Approximate Bayesian computation methods

2.1 Basics of Approximate Bayesian computation

2.2 ABC Markov chain Monte Carlo

2.3 ABC Sequential Monte Carlo

3 An early rejection method based on discrepancy models

3.1 Early rejection MCMC based on a discrepancy model

3.2 Properties of ejMCMC algorithm

Assumption 1.

Proposition 1.

Proposition 2.

Assumption 2.

Assumption 3.

Theorem 1.

Proposition 3.

Proposition 4.

3.3 An early rejection adaptive SMC

3.4 Selection of discrepancy model

3.4.1 A Gaussian process discrepancy model h​(⋅)h(\cdot)

3.4.2 Selection of training data

4 Simulation study

4.1 A toy example

4.2 An ordinary differential equation model

4.3 Stochastic kinetic networks

5 Real Data Analysis

6 Conclusion

Acknowledge

References

3.4.1 A Gaussian process discrepancy model $h(\cdot)$