aeons: approximating the end of nested sampling

Zixiao Hu,^1,2 Artem Baryshnikov,^1,2 Will Handley^1,2
¹Astrophysics Group, Cavendish Laboratory, J. J. Thomson Avenue, Cambridge, CB3 0HE, UK
²Kavli Institute for Cosmology, Madingley Road, Cambridge, CB3 0HA, UK E-mail: zixiao.hu.09@gmail.comE-mail: wh260@cam.ac.uk

Abstract

This paper presents analytic results on the anatomy of nested sampling, from which a technique is developed to estimate the run-time of the algorithm that works for any nested sampling implementation. We test these methods on both toy models and true cosmological nested sampling runs. The method gives an order-of-magnitude prediction of the end point at all times, forecasting the true endpoint within standard error around the halfway point.

keywords:

methods: data analysis – methods: statistical

^†^†pagerange: aeons: approximating the end of nested sampling–References

1 Introduction

Nested sampling is a multi-purpose algorithm invented by John Skilling which simultaneously functions as a probabilistic sampler, integrator and optimiser (Skilling, 2006). It was immediately adopted for cosmology, and is now used in a wide range of physical sciences including particle physics (Trotta et al., 2008), materials science (Pártay. et al., 2021) and machine learning (Higson et al., 2018a). The core algorithm is unique in its estimation of volumes by counting, coupling together nested monte carlo integrals which makes high-dimensional integration feasible and robust. It also avoids problems that challenge traditional Bayesian samplers, such as posterior multi-modality and phase transitions.

The order of magnitude run-time of an algorithm, that is, whether termination is hours or weeks and months away, is of high importance to the end user. Currently, existing implementations of nested sampling (e.g. Feroz et al. 2009; Handley et al. 2015; Brewer & Foreman-Mackey 2016; Speagle 2020; Buchner 2021; Williams et al. 2021; McEwen et al. 2023) either do not give an indication of remaining run-time, or only provide crude measures of progress that do not directly correspond to the the true endpoint.

This paper sets out a principled manner of endpoint estimation for nested sampling at each intermediate stage (as shown in Fig. 1), the key idea being to use the existing samples to predict the likelihood in the region we have yet to sample from. We begin with an overview of nested sampling in Section 2, followed by an examination of the anatomy of a nested sampling run to establish key concepts for endpoint prediction in Section 3. Section 4 then outlines the methodology we use, including discussion and comparisons to previous attempts. Finally, Section 5 presents the results and discussions for toy and cosmological chains, before we conclude.

Predicted endpoint: 25054 +/- 242 Progress: [=================>########] 72% ___________________ lives | 500 | phantoms | 24310 | posteriors | 18018 | equals | 245 | ------------------- ncluster = 1/1 ndead = 18018 nposterior = 18018 nequals = 249 nlike = 4159049 <nlike> = 491.04 (9.82 per slice) log(Z) = -12.55 +/- 0.27

Figure 1: Output from PolyChord for a typical nested sampling run. The predicted endpoint, shown in red, is calculated using the method described in this paper.

2 Background

We begin with a brief description of the nested sampling algorithm to establish the necessary notation. For a more comprehensive treatment, we recommend the original (Skilling, 2006) paper, the Sivia & Skilling (2006) textbook, as well as the excellent technical review by Buchner (2023) and Nature review by Ashton et al. (2022).

For a given likelihood $\mathcal{L}(\theta)$ and prior $\pi(\theta)$ , nested sampling simultaneously calculates the Bayesian evidence

\mathcal{Z}=\int\mathcal{L}\left(\theta\right)\pi(\theta)\ \mathrm{d}\theta

(1)

while producing samples of the posterior distribution

\mathcal{P}(\theta)=\frac{\mathcal{L}(\theta)\pi(\theta)}{\mathcal{Z}}.

(2)

The algorithm operates by maintaining a set of $n_{i}$ live points sampled from the prior, which can vary in number throughout the run (Higson et al., 2019). At each iteration $i$ , the point with the lowest likelihood is removed and added to a list of dead points (illustrated in Fig. 2). New points are then (optionally) drawn from the prior, subject to the constraint that they must have a higher likelihood than the latest dead point $\mathcal{L}_{i}$ . Repeating the procedure leads to the live points compressing around peaks in the likelihood. If there are more births than deaths, the number of live points $n_{i}$ increases, whilst choosing to not generate new points reduces the number of live points. The exact creation schedule depends on the dynamic nested sampling strategy.

The integral in (1) is then evaluated by transformation to a one-dimensional integral over the prior volume $X$

\mathcal{Z}=\int_{0}^{1}\mathcal{L}(X)\ \mathrm{d}X\approx\sum_{i=1}\mathcal{L}_{i}\ \tfrac{1}{2}(X_{i-1}-X_{i+1}),

(3)

where $X(\mathcal{L})$ is the fraction of the prior with a likelihood greater than $\mathcal{L}$ . Whilst the likelihood contour $\mathcal{L}_{i}$ at each iteration is known, the prior volumes $X_{i}$ must be statistically estimated as follows: one can define a shrinkage factor $t_{i}$ at each iteration $X_{i}=t_{i}X_{i-1}$ , such that

X_{i}=\prod_{k=1}^{i}t_{k}.

(4)

The $t_{i}$ are the maximum of $n_{i}$ points drawn from $[0,1]$ , so follow the distribution

P(t_{i})=n_{i}t_{i}^{n_{i}-1},

(5)

\langle\log t_{i}\rangle=-\frac{1}{n_{i}},\quad\mathrm{Var}(\log t_{i})=\frac{1}{n_{i}^{2}}.

(6)

The algorithm terminates when an user-specified condition is met; a popular choice is when the evidence in the live points falls below some fraction $\epsilon$ of the accumulated evidence e.g. $10^{-3}$ , which is proven to be a valid convergence criterion (Evans, 2006; Chopin & Robert, 2010). Much of the existing literature treats this remaining evidence separately, for instance by estimating it as the termination $X$ multiplied by the average likelihood amongst the remaining live points. It is, however, quantitatively equivalent but qualitatively neater to consider termination as killing the remaining live points off one-by-one, incrementing the evidence exactly as during the run with decreasing $n_{i}$ (Speagle, 2020).

Uncertainties in the evidence are dominated by the spread in the prior volume distribution, and the simplest way to estimate them is by Monte Carlo sampling over sets of $\bm{t}$ . As Skilling and others (Chopin & Robert, 2010; Keeton, 2011) have shown, for any given problem the uncertainty in $\log\mathcal{Z}$ is proportional to $1/\sqrt{n_{\mathrm{live}}}$ , so $n_{\mathrm{live}}$ sets the resolution of the algorithm.

3 The anatomy of a nested sampling run

The following sections act as an inventory of the information available to us at an intermediate iteration $i$ , which we shall use to make endpoint predictions in Section 4. We present an anatomy of the progression of a nested sampling run in terms of the prior volume compression (Section 3.1), the log-likelihood increase (Section 3.2), the inferred temperature (Section 3.3), and the dimensionality of the samples (Section 3.4).

3.1 Prior volume

The key feature of nested sampling is that the sampling is controlled by prior volume compression. The task is to find the posterior typically lying in a tiny fraction of the prior volume, a total compression which is quantified by the average information gain, or Kullback-Leibler divergence:

\mathcal{D}_{\mathrm{KL}}=\int\mathcal{P}(\theta)\log\frac{\mathcal{P}(\theta)}{\pi(\theta)}\ \mathrm{d}\theta.

(7)

The bulk of the posterior lies within a prior volume ${X=e^{-\mathcal{D}_{\mathrm{KL}}}}$ , which is the target compression. From Eq. 6 one gets there by iteratively taking steps of size ${\Delta\log X_{i}=-1/n_{i}}$ , so that when we add up the contribution of each step in (6) we get

\langle\log X_{i}\rangle=-\sum_{k=1}^{i}\frac{1}{n_{k}},\quad\mathrm{Var}(\log X_{i})=\sum_{k=1}^{i}\frac{1}{n_{k}^{2}}.

(8)

A steady step size in $\log X$ corresponds to a geometrically constant measure for the dead points, which is exactly needed to overcome the curse of dimensionality.

Refer to caption — Figure 2: The dead points of a nested sampling run, recursively zoomed in. Their density is constant in $\log X$ , which is a geometrically constant measure $\propto\mathrm{d}X/X$ , and hence scale invariant until the final set of live points is reached. The live points at a given iteration $i$ (larger dots) are uniform across the prior, plotted for comparison.

The same is not true for the live points, which are uniformly distributed in prior volume, and by the comments in the penultimate paragraph of Section 2 have ${n_{i+k}=n_{i}-k}$ . As a result, the maximum live point is found at

\langle\log X_{\mathrm{min}}^{\mathrm{live}}\rangle=\langle\log X_{i}\rangle-\sum_{k=1}^{n_{i}}\frac{1}{n_{i+k}}\approx-\frac{i}{n_{i}}-\log n_{i}-\gamma,

(9)

with variance

\mathrm{Var}(\log X_{\mathrm{min}}^{\mathrm{live}})=\mathrm{Var}(\log X_{i})+\sum_{k=1}^{n_{i}}\frac{1}{n_{i+k}^{2}}\approx\frac{i}{n_{i}^{2}}+\frac{\pi^{2}}{6},

(10)

where the large $n_{i}$ limit is taken for the approximation to the harmonic series, $\gamma$ being the Euler-Mascheroni constant.

The live points therefore only get us a factor of $\log n_{i}$ closer in volume to the posterior bulk. In other words, it is not until we are around $\log n_{i}$ away from $\log X=-\mathcal{D}_{\mathrm{KL}}$ that the samples begin populating the posterior typical set. One can see from (7) that the divergence increases linearly with dimension, so for large dimensionalities and typical live point numbers $\lesssim 1000$ , this does not happen until near the end of the run.

The result is consistent with that in Pártay et al. (2010), which states that a spike at a volume smaller than $X_{i}/n_{i}$ will go undetected. Intuitively, it is because for a sharply peaked likelihood the live points are too diffuse to land there with any significant probability for most of the run. These results are summarised in Fig. 3

3.2 Log-likelihood

We now consider the distribution of the samples in log-likelihood. To get an insight into the analytics, we will examine the representative case of the $d$ -dimensional multivariate Gaussian with maximum point $\log\mathcal{L}_{\mathrm{max}}$ and length scale $\sigma$ :

\log\mathcal{L}=\log\mathcal{L}_{\mathrm{max}}-X^{2/d}/2\sigma^{2}.

(11)

Providing that $\sigma^{d}\ll 1$ , the posterior can be approximated as

\mathcal{P}(X)=\tfrac{1}{(2\pi\sigma^{2})^{d/2}}e^{-X^{2/d}/2\sigma^{2}},

(12)

which in terms of $\log\mathcal{L}$ becomes

\mathcal{P}(\log\mathcal{L})=\frac{1}{\Gamma\left(\frac{d}{2}\right)}e^{\log\mathcal{L}-\log\mathcal{L}_{\mathrm{max}}}(\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L})^{\frac{d}{2}-1}

(13)

i.e. $2(\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L})\sim\chi^{2}_{d}$ , with mean and variance

\langle\log\mathcal{L}\rangle_{\mathcal{P}}=\log\mathcal{L}_{\mathrm{max}}-\frac{d}{2},\quad\mathrm{Var}(\log\mathcal{L})_{\mathcal{P}}=\frac{d}{2}.

(14)

Given the prior is uniform $X\sim U(0,1)$ , the KL divergence in the $\sigma^{d}\ll 1$ limit is:

\mathcal{D}_{\mathrm{KL}}=-\frac{d}{2}\log 2\Gamma(1+\tfrac{d}{2})^{2/d}\sigma^{2}.

(15)

One can also derive the individual distributions for the live and dead points in log-likelihood. Note that this is merely the distribution of the values of the points, without their nested sampling weights. The dead distribution is

P(\log\mathcal{L})=\frac{\mathrm{d}\log X}{\mathrm{d}\log\mathcal{L}}P(\log X)\propto\frac{d\>n_{i}}{2(\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L})},

(16)

since the log-densities of the dead point prior volumes are proportional to the live point number at any given iteration. Meanwhile, the live points are uniformly distributed, and so have the distribution

P(\log\mathcal{L})=\frac{d}{2}\frac{(\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L})^{\frac{d}{2}-1}}{(\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L}_{i})^{\frac{d}{2}}}\\ [\log\mathcal{L}_{i}<\log\mathcal{L}<\log\mathcal{L}_{\mathrm{max}}].

(17)

Fig. 4 shows the distinction between the distributions of the live and dead point values, as well as the posterior.

How much further do the live points penetrate in log-likelihood? We seek the distribution for the maximum likelihood of the live points, $\log\mathcal{L}_{(n_{i})}^{\mathrm{live}}$ , where we have used the notation of order statistics to denote $x_{(k)}$ as the maximum of $k$ points. For convenience, we will normalise the likelihoods as

y=\frac{\log\mathcal{L}-\log\mathcal{L}_{i}}{\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L}_{i}}

(18)

so that $y=0$ corresponds to the latest dead point at $\log\mathcal{L}_{i}$ and $y=1$ to the maximum. The live point distribution simplifies to

P(y)=\frac{d}{2}(1-y)^{\frac{d}{2}-1}\quad[0<y<1].

(19)

Using the result that the maximum of $n_{i}$ variables with cumulative distribution $F(y)$ follows $\frac{d}{dy}(1-(1-F(y))^{n_{i}})$ , we obtain

P(y_{(n_{i})}^{\mathrm{live}})=\frac{n_{i}d}{2}(1-y_{(n_{i})}^{\mathrm{live}})^{\frac{d}{2}-1}\left(1-(1-y_{(n_{i})}^{\mathrm{live}})^{\frac{d}{2}}\right)^{n_{i}-1}\\ [0<y_{(n_{i})}^{\mathrm{live}}<1],

(20)

which may be roughly summarised as

y_{(n_{i})}^{\mathrm{live}}\sim 1-\frac{\Gamma(1+\frac{2}{d})\Gamma(1+n_{i})}{\Gamma(1+\frac{2}{d}+n_{i})}\\ \pm\left(\frac{\Gamma(1+n_{i})\Gamma(1+\frac{4}{d})}{\Gamma(1+\frac{4}{d}+n_{i})}-\frac{\Gamma(1+\frac{2}{d})^{2}\Gamma(1+n_{i})^{2}}{\Gamma(1+\frac{2}{d}+n_{i})^{2}}\right)^{\frac{1}{2}},

(21)

or in the large $d$ , $n_{i}$ limit

	$\displaystyle\lim_{d\to\infty}y_{(n_{i})}^{\mathrm{live}}$	$\displaystyle\sim\frac{2H_{n_{i}}}{d}\pm\left(\frac{2(\pi^{2}-6\Psi^{(1)}(1+n_{i}))}{3d^{2}}\right)^{\frac{1}{2}},$		(22)
	$\displaystyle\lim_{d,n_{i}\to\infty}y_{(n_{i})}^{\mathrm{live}}$	$\displaystyle\sim\frac{2\log n_{i}}{d}\pm\sqrt{\frac{2}{3}}\frac{\pi}{d},$		(23)

where $\psi^{(1)}$ is the trigamma function and $H_{n_{i}}$ is the $n_{i}$ th harmonic number. This is a very small fraction in high dimensions, showing that until the end the live points are far from the posterior bulk.

Alternatively, the same result arrives intuitively from the fact that at each step, $y$ increases by

\lim_{n_{i}\to\infty}\Delta y\approx\frac{\mathrm{d}y}{\mathrm{d}\log X}\Delta\log X=\frac{2}{dn_{i}},

(24)

so that by again summing the harmonic series, we get

y_{(n_{i})}^{\mathrm{live}}=\frac{2\log n_{i}}{d}.

(25)

Eq. 24 also implies that the normalised distance between the highest and second highest live point is roughly

y_{(n_{i})}^{\mathrm{live}}-y_{(n_{i}-1)}^{\mathrm{live}}\approx\frac{2}{d}.

(26)

Before reaching the posterior bulk, $\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L}_{i}>d/2$ , so we must have

\log\mathcal{L}^{\mathrm{live}}_{(n_{i})}-\log\mathcal{L}^{\mathrm{live}}_{(n_{i}-1)}>1.

(27)

In other words, the highest likelihood point is always at least an order of magnitude greater than the second highest. It is therefore typically the case that nearly all of the posterior mass is concentrated in a single point, the maximum live point, until the very end of the run when the prior volumes have shrunk enough to compensate.

Aside: nested sampling as a maximiser

Previous literature (Akrami et al., 2010; Feroz et al., 2011) has explored the potential for nested sampling to be used as a global maximiser, given its ability to handle multi-modalities. In particular, the latter authors emphasised that posterior samplers such as nested sampling find the bulk of the mass, not the maximum of the distribution, but that this can be remedied by tightening the termination criterion. We now use the machinery we have developed to put this statement on a more quantitative footing.

Let us take the current iteration to be the termination point with likelihood $\log\mathcal{L}_{\mathrm{f}}$ and prior volume $X_{\mathrm{f}}$ , so that

\epsilon=\frac{\int_{0}^{X_{\mathrm{f}}}\mathcal{L}\ \mathrm{d}X}{\int_{0}^{\infty}\mathcal{L}\ \mathrm{d}X}.

(28)

Note that we have assumed that prior effects are negligible (so $1\approx\infty$ ), and that $\epsilon\ll 1$ so that the denominator is approximately the accumulated evidence. Computing this for (11), we find the answer in terms of lower incomplete gamma functions

\epsilon=1-\frac{\Gamma_{d/2}\left(X_{\mathrm{f}}^{2/d}/2\sigma^{2}\right)}{\Gamma(d/2)}.

(29)

Taking the $X_{\mathrm{f}}\ll(\sqrt{2}\sigma)^{d}$ limit (almost certainly valid at termination) we find

\lim_{X_{\mathrm{f}}\ll(\sqrt{2}\sigma)^{d}}\epsilon\approx\frac{X_{\mathrm{f}}}{(\sqrt{2}\sigma)^{d}\ \Gamma\left(1+\frac{d}{2}\right)}=\frac{(\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L}_{\mathrm{f}})^{\frac{d}{2}}}{\Gamma\left(1+\frac{d}{2}\right)}.

(30)

We thus have an expression relating $\mathcal{L}_{\mathrm{f}}$ at termination to the termination fraction $\epsilon$ . This becomes yet more pleasing in the large $d$ limit, since $\epsilon^{2/d}\to 1$ , we find via a Stirling approximation:

\lim_{d\to\infty}\log\mathcal{L}_{\mathrm{f}}\approx\log\mathcal{L}_{\mathrm{max}}-\frac{d}{2e}.

(31)

In the event that we retain $\epsilon$ , we replace $\frac{d}{2e}\to\frac{d}{2e}\epsilon^{2/d}$ , allowing one to battle the $\frac{d}{2e}$ term exponentially as dimensions increase.

Putting this together, taking $\mathcal{L}_{i}$ in (18) to be $\mathcal{L}_{\mathrm{f}}$ and combining this with (23), we find

\boxed{\log{\mathcal{L}}_{\mathrm{max}}^{\mathrm{live}}\approx\log\mathcal{L}_{\mathrm{max}}-\frac{d}{2e}+\frac{\log n_{i}}{e}\pm\frac{\pi}{\sqrt{6}e}},

(32)

showing that in general nested sampling will finish at a contour $d/2e$ away from the maximum log-likelihood. The final set of $n_{i}$ live points gets you $\log n_{i}/e$ closer, with a chance of getting $\sim\pi/\sqrt{6}e=0.472$ closer still by statistical fluctuation.

Making the traditional termination criterion stricter therefore has limited returns in high-dimensions, if it is ultimately still based on the remaining evidence. However, nested sampling still shrinks around the maximum exponentially, so provided a good alternative termination criterion is chosen, it will get there in reasonable time.

To quantify this statement, consider the number of iterations $\Delta i$ required to get from the posterior bulk to the true maximum, which we will now calculate. At the posterior, we have roughly speaking

(\log X,\log\mathcal{L})=(-\mathcal{D}_{\mathrm{KL}},\log\mathcal{L}_{\mathrm{max}}-d/2).

(33)

The prior volume $\log X_{\delta}$ at which the likelihood is within $\delta$ of the maximum can then be found by inverting Eq. 11:

\log X_{\delta}=-\mathcal{D}_{\mathrm{KL}}-\frac{d}{2}\left(\log\frac{d}{2}-\log\delta\right),

(34)

which corresponds to an additional

\Delta i=\frac{nd}{2}\left(\log\frac{d}{2}-\log\delta\right)

(35)

iterations after the bulk is reached, where $n$ is the harmonic mean of the number of live points. One would therefore expect a general distribution to take $\mathcal{O}(\tfrac{nd}{2}\log\tfrac{d}{2})$ iterations to get from the usual nested sampling stopping point to within an $e$ -fold of the maximum. A rule-of-thumb termination criterion could therefore be to run for at least $\tfrac{nd}{2}\log\tfrac{d}{2}$ iterations after the posterior is reached.

A summary of the distances between the notable points at the end of a run is shown in Fig. 4.

3.3 Temperature

Motivations

As shown in the previous section, midway through the run nearly all of the posterior mass is concentrated at a single point. However, this does not capture the structure of the posterior that has been explored and all of the information it provides.

We have the potential to fix this because nested sampling is invariant to monotonic transformations, so we can transform the likelihood as $\mathcal{L}\to\mathcal{L}^{\beta}$ without loss of information by trivially re-weighting the samples. Increasing $\beta$ worsens the situation, while $\beta\to 0$ simply gives back the prior. There is, on the other hand, a significant intermediate range which makes the samples look like a posterior centred at the present contour, which will allow us to recover the structure of the samples. A schematic of the procedure is shown in Fig. 5.

At this point it is relevant to note the correspondence between Bayesian inference and statistical mechanics, from which the above transform is derived. If one equates the parameters to microstates $i$ , the negative log-likelihood to the microstate energy $E_{i}$ , and the prior to the density of states $g_{i}$ , then the posterior as given by the generalised Bayes’ rule is the canonical ensemble

p(E_{i})=\frac{g_{i}e^{-\beta E_{i}}}{Z(\beta)}\quad\leftrightarrow\quad\mathcal{P}_{\beta}(\theta)=\frac{\mathcal{L}^{\beta}(\theta)\pi(\theta)}{\mathcal{Z}(\beta)}

(36)

at the inverse temperature $\beta=1/T$ . As noted by Habeck (2015), thermal algorithms such as thermodynamic integration (Gelman & Meng, 1998) get the evidence by evolving through a series of canonical ensembles via some temperature schedule, but nested sampling instead maintains microcanonical ensembles, which are the iso-likelihood contours. Instead of using temperature (Kirkpatrick et al., 1983; Swendsen & Wang, 1986) or energy (Wang & Landau, 2001) as a control parameter, nested sampling chooses a series of ensembles with constant relative volume entropy $\Delta\log X$ , which allows the algorithm to handle phase transitions (Baldock et al., 2016).

Because the temperature of a microcanonical ensemble is a derived property rather than a parameter, there is some freedom in its definition. Returning to our original motivation, we make the connection that the temperature is the re-weighting $\mathcal{L}\to\mathcal{L}^{\beta}$ which centres the ensemble around the current energy. We now present several temperatures that achieve this aim, each of which one can plausibly consider to be the current temperature of a nested sampling run.

A. Microcanonical temperature

The obvious candidate is the microcanonical temperature $\partial S/\partial E$ , where the volume entropy is $\log X$ and the energy is as usual $-\log\mathcal{L}$ . This gives the density of states; as discussed in Skilling’s original paper,

\beta_{\mathrm{M}}=-\frac{\mathrm{d}\log X}{\mathrm{d}\log\mathcal{L}}\Bigg{|}_{\log\mathcal{L}_{i}}.

(37)

is the $\beta$ at which $\mathcal{L}^{\beta}X$ peaks at $\log X_{i}$ , if we assume differentiability, which is exactly the intuition we were aiming for to put the ensemble bulk at the current contour.

Its value can be easily obtained via finite difference of the $\log\mathcal{L}$ and $\log X$ intervals, albeit subject to an arbitrary window size for the differencing. Indeed, material science applications (Baldock et al., 2017) use this estimator to monitor the ‘cooling’ progress of nested sampling, with a window size of 1000 iterations.

B. Canonical temperature

Another temperature considered by Habeck (2015) is that at which the current energy (i.e. $-\log\mathcal{L}_{i}$ ) is the average energy of the entire ensemble. One can obtain it by inverting

\langle\log\mathcal{L}\rangle_{\mathcal{P}_{\beta}}=\log\mathcal{L}_{i}

(38)

to get the ‘canonical’ temperature $\beta_{\mathrm{C}}$ . While $\beta_{\mathrm{M}}$ is derived from (the gradient of) a single contour, this temperature uses the entire ensemble. It has the desirable property that it rises monotonically with compression, in analogy to a monotonic annealing schedule.

C. Bayesian temperature

We furthermore propose a temperature $\beta_{\mathrm{B}}$ that is obtained via Bayesian inference, which returns a distribution rather than a point estimate. Since each value of $\beta$ leads to a different likelihood $\mathcal{L}^{\beta}$ , one can consider the posterior distribution as a function of $\log X$ to be conditioned on $\beta$ . We can therefore write

\mathcal{P}(\log X\mid\beta)=\frac{\mathcal{L}^{\beta}(X)X}{\mathcal{Z(\beta)}}.

(39)

What we would really like is the distribution of $\beta$ at the present iteration, so the natural step is to invert this via Bayes’ rule;

P\left(\beta\mid\log X_{i}\right)=\frac{\mathcal{P}\left(\log X_{i}\mid\beta\right)P\left(\beta\right)}{P\left(\log X\right)}.

(40)

As with all Bayesian analyses, the distribution of $\beta$ is fixed up to a prior, which we choose to be uniform in $\beta$ . The obtained temperatures are consistent with the previous two choices, which may seem oddly coincidental. In fact, closer inspection reveals that large values of $P(\beta\mid\log X_{i})$ are the temperatures with a large value of the posterior at the present contour, normalised by the corresponding evidence. Thus the Bayesian temperature uses the same idea as the microcanonical one, except it accounts for the spread in the result.

Comparisons

Fig. 6 shows the three temperatures as a function of compression for two cases, one containing a phase transition and one without. They are consistent in both cases when there is a single dominant phase, but differ during a phase transition. The canonical temperature is the only one that rises monotonically with compression.

One should keep in mind that despite the above theoretical reasoning, our introduction of the likelihood transformation was ultimately motivated by our wish to utilise the extra degree of freedom it provides. As we will see below, we recommend choosing the exact definition depending on what is useful for the problem at hand.

3.4 Dimensionality

We can immediately use the inferred temperature to track how the effective dimensionality of the posterior changes throughout the run, which was previously inaccessible. Handley & Lemos (2019) demonstrated that at the end of a run, a measure of the number of constrained parameters is given by the Bayesian model dimensionality (BMD), defined as the posterior variance of the information content:

\frac{d_{G}}{2}=\int\mathcal{P}(\theta)\left(\log\frac{\mathcal{P}(\theta)}{\pi(\theta)}-\mathcal{D}_{\mathrm{KL}}\right)^{2}\>\mathrm{d}\theta=\langle\mathcal{I}^{2}\rangle_{\mathcal{P}}-\langle\mathcal{I}\rangle^{2}_{\mathcal{P}}.

(41)

Calculating the quantity using intermediate set of weighted samples (which is concentrated at a single point) leads to vanishing variance, hence also dimensionality. However, we can recover the structure of the posterior together with the true dimensionality by adjusting the temperature. Dimensionality estimates are plotted in Fig. 7 for a spherical 32-d Gaussian, for which the true dimensionality is known.

The different choices of temperature are again consistent, but for the rest of this paper we choose the Bayesian $\beta$ , because it provides a better reflection of the uncertainty in the estimate; the others, while fluctuating around the true value, are often many standard errors away from the true value at each single point.

Anisotropic compression

Plots of samples dimensionality against compression also draw attention to the directions in which the samples are constrained throughout the run. As a concrete example, consider an elongated Gaussian in a unit hypercube prior with $\mu=\bm{0}$ and $\Sigma=\mathrm{diag}\left(10^{-3},10^{-3},10^{-3},10^{-6},10^{-6},10^{-6}\right)$ , for which the dimensionality estimates are plotted in Fig. 8. Alongside is a view of the distribution of live points across the prior for two directions with different scales, which shows the level to which those parameters have been constrained at different times.

A feature of nested sampling made apparent here is that parameters with high variance are initially ‘hidden’. Compression occurs in the direction which is most likely to have a sample of higher likelihood, and initially it is much easier to find a better point along the direction of a parameter that is poorly constrained. Lower variance parameters are constrained much later, and before that happens it appears as though those parameters have a uniform distribution.

It is important to appreciate that at lower compression the samples truly lie in a lower-dimensional space, rather than some artefact of the way we view them. Anticipating the full dimensionality of the space is therefore just as impossible as that associated with a slab-spike geometry, so in this sense such geometries contain a ‘compressive phase transition’.

4 Endpoint prediction

As described in Petrosyan & Handley (2022) and further explored in the talk and upcoming paper Handley (2023a, b), the time complexity of nested sampling is

T\propto\langle\tfrac{1}{n_{i}}\rangle^{-1}\times\langle\mathcal{T}\{\mathcal{L}(\theta)\}\rangle\times\langle f_{\mathrm{sampler}}\rangle\times\mathcal{D}_{\mathrm{KL}}.

(42)

The first term is the harmonic mean of the number of live points $\sim\mathcal{O}(n_{i})$ , The second term is the average time per likelihood evaluation. The third is the average number of evaluations required to replace a dead point with a live point at higher likelihood, which is given by the implementation and usually does not vary in orders of magnitude. The final term is the Kullback-Liebler divergence Eq. 7, the compression factor required to get from the prior to the posterior. This term is generally outside of user control, in most cases a priori unknown, and of principle interest in this section.

4.1 The termination prior volume

Making the above discussion more precise, we wish to find the compression factor $\log X_{\mathrm{f}}$ at which the termination criterion is met, which is larger in magnitude than $\mathcal{D}_{\mathrm{KL}}$ (Fig. 3). The difficulty is that at an intermediate iteration we only know the posterior up to the maximum log-likelihood live point, which until just before the end is far from the posterior bulk.

In order to get an idea of where the true posterior bulk sits, we need to predict what the posterior looks like past the highest live point. We do this by extrapolating the known likelihood profile; that is, the trajectory of $\mathcal{L}(X)$ traced out by the live and dead points. One would never use this predicted posterior to perform inference, since more accuracy can always be achieved by simply finishing the run. However, we will demonstrate it is sufficient for making a run-time prediction for $\log X_{\mathrm{f}}$ .

Quantitatively, this proceeds as follows: fit a function $f(X,\phi)$ with some parameters $\phi$ to the known likelihood profile, which allows us to express the prior volume we need to compress to as

\Delta\mathcal{Z}=\epsilon\mathcal{Z}_{\mathrm{tot}},

(43)

or equivalently

\int_{0}^{X_{\mathrm{f}}}f(X,\phi)\ \mathrm{d}X=\epsilon\left(\int_{0}^{X_{i}}f(X,\phi)\ \mathrm{d}X+\mathcal{Z}_{\mathrm{dead}}\right),

(44)

where $X_{i}$ is the volume of the iteration we have currently compressed to, and $\mathcal{Z}_{\mathrm{dead}}$ is the evidence we have accumulated up to this point. $X_{\mathrm{f}}$ can then be identified by solving the above equation either analytically or numerically.

Once $X_{\mathrm{f}}$ is known, the corresponding iteration count depends on the live point schedule. For example, in the constant $n_{i}$ case $\log X$ decreases by $1/n$ at each iteration, so the total number of iterations $N_{\mathrm{f}}$ would be

N_{\mathrm{f}}=-n\log X_{\mathrm{f}}.

(45)

4.2 How to extrapolate?

A key observation is that the Bayesian model dimensionality is the equivalent dimension of the posterior if it were actually Gaussian. Fitting a Gaussian of this dimension to the likelihood profile therefore makes a reasonable approximation to the true distribution, without explicitly assuming the form of the likelihood function. The parameterisation of the Gaussian that we fit is the same as that given in Section 3.2, which we shall repeat here for clarity;

f(X;\phi)=\log\mathcal{L}_{\mathrm{max}}-X^{2/d}/2\sigma^{2}

(46)

The extrapolation then proceeds thus:

1.

Find the current dimensionality ${d}^{(i)}_{G}$ of the posterior at the Bayesian temperature
2.

Take the live point profile and perform a least squares fit to (46), stipulating that $d={d}^{(i)}_{G}$ to infer $\log\mathcal{L}_{\mathrm{max}}$ and $\sigma$
3.

Use the likelihood predicted by these parameters to solve (44) for $X_{\mathrm{f}}$

The advantage of fitting a Gaussian is that the procedure can be sped up analytically. Firstly, the least squares regression is trivial because analytic estimators exist; the cost function

C^{2}(\log\mathcal{L}_{\mathrm{max}},\sigma)=\sum_{i}\left|\log\mathcal{L}_{i}-f(X_{i};\log\mathcal{L}_{\mathrm{max}},\sigma)\right|^{2}

(47)

is minimised with respect to $(\log\mathcal{L}_{\mathrm{max}},\sigma)$ when

\sigma^{2}=\frac{N\sum_{i}X_{i}^{4/d}-\left(\sum_{i}X_{i}^{2/d}\right)^{2}}{2\sum_{i}\log\mathcal{L}_{i}\sum_{i}X_{i}^{2/d}-2N\sum_{i}X_{i}^{2/d}\log\mathcal{L}_{i}},

(48)

and

\log\mathcal{L}_{\mathrm{max}}=\frac{1}{N}\sum_{i}\log\mathcal{L}_{i}+\frac{1}{2N\sigma^{2}}\sum_{i}X_{i}^{2/d}.

(49)

Secondly, the termination prior volume can also be obtained analytically. Rewriting Eq. 44 in terms of the Gaussian parameters gives

\epsilon=\frac{\int_{0}^{X_{\mathrm{f}}}\mathcal{L}_{\mathrm{max}}\exp\left(-X^{2/d}/2\sigma^{2}\right)\ \mathrm{d}X}{\int_{0}^{X_{i}}\mathcal{L}_{\mathrm{max}}\exp\left(-X^{2/d}/2\sigma^{2}\right)\ \mathrm{d}X+\mathcal{Z}_{\mathrm{dead}}}.

(50)

The integrals have the analytic solution

\int_{0}^{X_{k}}\mathcal{L}_{\mathrm{max}}\exp\left(-X^{2/d}/2\sigma^{2}\right)\ \mathrm{d}X=\frac{d}{2}\cdot\left(\sqrt{2}\sigma\right)^{d}\cdot\gamma_{k}

(51)

where $\gamma_{k}=\Gamma_{d/2}\left(X_{k}^{2/d}/2\sigma^{2}\right)$ is the lower incomplete gamma function. After taking the inverse of $\gamma$ and a few more steps of algebra, we arrive at

\log X_{\mathrm{f}}=\frac{d}{2}\log 2\sigma^{2}+\log\Gamma^{-1}_{d/2}\left(\epsilon\gamma_{i}+\frac{\epsilon\mathcal{Z}_{\mathrm{dead}}}{\left(2\sigma^{2}\right)^{d/2}\mathcal{L}_{\mathrm{max}}}\right),

(52)

and $N_{\mathrm{f}}$ is of course just $-n$ multiplied by this. Intuitively, the above procedure can be thought of as inferring the number of constrained parameters, then extrapolating them up to find the point at which they will be fully constrained.

Uncertainties in the final estimate are obtained by drawing many samples from the distribution of $d_{\mathrm{G}}$ defined by the Bayesian temperature, and repeating step two for each. One might wonder why we do not obtain $d$ via least squares regression together with the other parameters; extensive testing has shown this approach to be far less stable.

4.3 Alternative approaches

More comprehensive Bayesian approaches, perhaps including a priori information about the likelihood or greater flexibility in the fitting function, could likely perform better than what we have just presented. However, such methods would not befit run-time prediction which has a much more limited computational budget, hence the more pragmatic approach we have adopted. Here, we discuss as a benchmark alternative approaches to endpoint estimation that have a comparable computational complexity.

A. Integral progress

An alternative approach used in Ultranest (Buchner, 2021) derives a progress bar based on the fraction of the accumulated integral compared to the remaining integral, approximated as

\mathcal{Z}_{\mathrm{rem}}\approx\mathcal{L}_{\mathrm{max}}^{\mathrm{live}}X_{i}.

(53)

This has a several shortcomings. First, run-time is proportional to compression rather than accumulation of the integral, since it takes just as long to traverse the width of the bulk as it does any other width. Second, because of the point-like nature of the posterior mid-run, the remaining integral approximated as such holds nearly all of the evidence, so the relative fraction of the accumulated and remaining evidence is almost zero for most of the run. Finally, approximation (53) is always an underestimate, because as previously found the maximum live point is generally nowhere near the true maximum. This approach can however be useful in the low dimensions appropriate for Ultranest when the live points are always near the maximum, but in general is less reliable.

B. Extrapolating evidence increments

Seasoned watchers of nested sampling runs might be curious how the method compares to simply extrapolating the increments of evidence to roughly estimate when the evidence converges. We do this for a spherical Gaussian and compare it to our method. At an intermediate stage of the run, the most recent outputs might look something like that shown in the first two columns of the table in Fig. 9(a). Extrapolating those data to a linear and exponential profile yields endpoint estimates plotted in the graph to the right.

The linear extrapolation is clearly an underestimate, since it fails to account for the long tail of the nonlinear profile. The increments are also not exactly exponential, since the exponential fit leads to a large over-prediction. The predicted endpoint over the course of a run for $d=16$ , $\sigma=0.01$ , as shown in Fig. 10, shows the same result. One might expect an average to be more accurate, but this tends to be biased towards the exponential prediction, and there is no obvious choice of weighting that would fix this.

More importantly, we find that for real likelihoods which have an element of noise the extrapolation often diverges, for instance when the increments do not monotonically decrease. Directly extrapolating the evidence increments is therefore far less stable than the previous method, and generally not a reliable method for prediction.

5 Results

We now test the approach established in the previous section on a range of distributions. We begin by considering a series of toy examples to explore the capabilities and limitations of the method, before presenting results for real cosmological chains.

5.1 Toy examples

Throughout the toy examples we make use the perfect nested sampling (Keeton, 2011; Higson et al., 2018b) framework implemented in anesthetic (Handley, 2019).

A. Gaussians

Predictions for spherical Gaussians of various dimensions are shown in Fig. 11 as a benchmark for when fitting a Gaussian distribution is exact. In all endpoint prediction plots, the shading indicates the $1-2\sigma$ uncertainties. All were run with $n_{i}=500$ except for one $n_{i}=2000$ for comparison, with each Gaussian having a width of $\sigma=0.01$ . The correct endpoint is recovered to within standard error at all points except the very beginning, when the parameters have hardly been constrained.

We note that as with nested sampling in general, increasing $n_{i}$ improves the resolution and reliability of the inferences, which can be seen from the middle two plots.

We also observe the effect of elongating the Gaussian, using the same example as Section 3.4. Fig. 12 shows a step-like trend similar to the inferred dimensionalities, reflecting the fact that the full dimensionality is undetectable at lower compression factors. The endpoint for a likelihood whose remaining three directions are completely unconstrained coincides with our predictions at early iterations, showing that the two cases are indistinguishable.

B. Cauchy

One case that might be expected to cause problems is the pathological Cauchy distribution, which is far from a Gaussian. Fig. 13 shows the predictions for a likelihood of the form

\log\mathcal{L}=\log\mathcal{L}_{\mathrm{max}}-\frac{1+d}{2}\log\left(1+\frac{X^{2}}{\gamma^{2}}\right),

(54)

choosing $d=10$ and allowing $\gamma$ to vary. The correct estimate is obtained to within standard error by about halfway, but before that is inaccurate. The key limitation is that the estimate is wrong early on, not because the compression is anisotropic, or because there is a phase transition; but rather as a limitation of the reducing the likelihood to a Gaussian via the BMD, which is itself less stable for a Cauchy.

Nevertheless, the right order of magnitude is obtained at all times, so this remains sufficient for most use-cases. The Cauchy is also a pathological case, and the same problem does not in practice appear for more realistic cases, as we shall see next.

5.2 Cosmological examples

Finally, we evaluate the method on real cosmological chains. Fig. 14 presents the endpoints (calculated after the fact) for nested sampling runs for curvature quantification on several common cosmological data sets (details in Handley 2021).

The SH0ES, BAO and lensing chains are ‘easy’ low $\mathcal{D}_{\mathrm{KL}}$ inferences, so it is expected that the correct endpoint is inferred practically from the start. The Planck endpoints, on the other hand, are not correct until at least midway through. However, this is expected from the covariance of the Planck likelihood, which consists of principal components of many scales and therefore elongated in many dimensions. It is therefore of the same class as the elongated Gaussian presented in Section 3.4; the samples exist in a lower dimensional subspace mid-run, which slowly increases to the full dimensionality only at the end of the run.

6 Conclusion

We have derived new analytic results to make an anatomy of nested sampling, understanding the progression of a run via the compression of prior volume, the increase in log-likelihood, the inferred temperature schedule, and the convergence of the sample dimensionality. From these analyses, we developed a method for predicting the endpoint of a nested sampling run, by using the inferred Bayesian model dimensionality mid-run to extrapolate the known likelihood profile.

The method in general converges on a correct prediction of endpoint by about halfway, and gets the correct order of magnitude throughout. Consistent predictions are obtained for both toy and cosmological examples. The accuracy is typically limited by the information available mid-run, either because of a phase transition or because the anisotropy of the nested sampling compression. Pathological distributions, such as a Cauchy, lead to less stable inferences of the dimensionality and expose the limitations of a Gaussian approximation, though the order of magnitude is still correct.

Further work can be done to experiment with more flexible basis functions for regression of the likelihood profile, so that it is less dependent on the Gaussian approximation.

Code availability

A package is in development to implement the endpoint prediction mechanism, for easy plug-in to existing implementations of nested sampling. The latest updates can be found at github.com/zixiao-h/aeons.

Acknowledgements

WJH is supported by the Royal Society as a Royal Society University Research Fellow at the University of Cambridge. AB was supported by a Cambridge Mathematical Placement and the Royal Society summer studentship. ZH was supported by a Royal Society summer studentship. We thank Mike Hobson for helpful discussion during ZH Part III viva.

References

Akrami et al. (2010) Akrami Y., Scott P., Edsjö J., Conrad J., Bergström L., 2010, Journal of High Energy Physics, 2010
Ashton et al. (2022) Ashton G., et al., 2022, Nature Reviews Methods Primers, 2
Baldock et al. (2016) Baldock R. J. N., Pártay L. B., Bartók A. P., Payne M. C., Csányi G., 2016, Phys. Rev. B, 93, 174108
Baldock et al. (2017) Baldock R. J. N., Bernstein N., Salerno K. M., Pártay L. B., Csányi G., 2017, Physical Review E, 96
Brewer & Foreman-Mackey (2016) Brewer B. J., Foreman-Mackey D., 2016, arXiv e-prints, p. arXiv:1606.03757
Buchner (2021) Buchner J., 2021, UltraNest – a robust, general purpose Bayesian inference engine (arXiv:2101.09604)
Buchner (2023) Buchner J., 2023, Statistics Surveys, 17
Chopin & Robert (2010) Chopin N., Robert C. P., 2010, Biometrika, 97, 741
Evans (2006) Evans M., 2006, World Meeting on Bayesian Statistics Benidorm, 8
Feroz et al. (2009) Feroz F., Hobson M. P., Bridges M., 2009, Monthly Notices of the Royal Astronomical Society, 398, 1601
Feroz et al. (2011) Feroz F., Cranmer K., Hobson M., de Austri R. R., Trotta R., 2011, Journal of High Energy Physics, 2011
Gelman & Meng (1998) Gelman A., Meng X.-L., 1998, Statistical Science, 13, 163
Habeck (2015) Habeck M., 2015, AIP Conference Proceedings, 1641, 121
Handley (2019) Handley W., 2019, Journal of Open Source Software, 4, 1414
Handley (2021) Handley W., 2021, Physical Review D, 103
Handley (2023a) Handley W., 2023a, Nested sampling: powering next-generation inference and machine learning tools for cosmology, particle physics and beyond, King’s College London, https://github.com/williamjameshandley/talks/raw/kcl_2023/will_handley_kcl_2023.pdf
Handley (2023b) Handley W., 2023b, The scaling frontier of nested sampling
Handley & Lemos (2019) Handley W., Lemos P., 2019, Physical Review D, 100
Handley et al. (2015) Handley W. J., Hobson M. P., Lasenby A. N., 2015, Monthly Notices of the Royal Astronomical Society, 453, 4385
Higson et al. (2018a) Higson E., Handley W., Hobson M., Lasenby A., 2018a, Monthly Notices of the Royal Astronomical Society
Higson et al. (2018b) Higson E., Handley W., Hobson M., Lasenby A., 2018b, Bayesian Analysis, 13, 873
Higson et al. (2019) Higson E., Handley W., Hobson M., Lasenby A., 2019, Statistics and Computing, 29, 891–913
Keeton (2011) Keeton C. R., 2011, Monthly Notices of the Royal Astronomical Society, 414, 1418
Kirkpatrick et al. (1983) Kirkpatrick S., Gelatt C. D., Vecchi M. P., 1983, Science, 220, 671
McEwen et al. (2023) McEwen J. D., Liaudat T. I., Price M. A., Cai X., Pereyra M., 2023, arXiv e-prints, p. arXiv:2307.00056
Petrosyan & Handley (2022) Petrosyan A., Handley W., 2022, Physical Sciences Forum, 5
Pártay et al. (2010) Pártay L. B., Bartók A. P., Csányi G., 2010, Efficient sampling of atomic configurational spaces (arXiv:0906.3544)
Pártay. et al. (2021) Pártay. L. B., Csányi G., Bernstein N., 2021, The European Physical Journal B, 94
Sivia & Skilling (2006) Sivia D. S., Skilling J., 2006, Data Analysis: A bayesian tutorial (Oxford Science Publications). Oxford University Press
Skilling (2006) Skilling J., 2006, Bayesian Analysis, 1
Speagle (2020) Speagle J. S., 2020, Monthly Notices of the Royal Astronomical Society, 493, 3132
Swendsen & Wang (1986) Swendsen R. H., Wang J.-S., 1986, Phys. Rev. Lett., 57, 2607
Trotta et al. (2008) Trotta R., Feroz F., Hobson M., Roszkowski L., de Austri R. R., 2008, Journal of High Energy Physics, 2008, 024
Wang & Landau (2001) Wang F., Landau D. P., 2001, Phys. Rev. Lett., 86, 2050
Williams et al. (2021) Williams M. J., Veitch J., Messenger C., 2021, Physical Review D, 103, 103006

iteration	log Z	$\Delta\log Z$
5000	-1435.8	190.8
5500	-1264.6	171.2
6000	-1123.7	140.9
6500	-991.5	132.2
7000	-885.0	106.6
7500	-790.3	94.7
8000	-702.6	87.7
8500	-619.7	82.9
9000	-551.8	67.9
9500	-492.7	59.1