This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

aeons: approximating the end of nested sampling

Zixiao Hu,1,2 Artem Baryshnikov,1,2 Will Handley1,2
1Astrophysics Group, Cavendish Laboratory, J. J. Thomson Avenue, Cambridge, CB3 0HE, UK
2Kavli Institute for Cosmology, Madingley Road, Cambridge, CB3 0HA, UK
E-mail: zixiao.hu.09@gmail.comE-mail: wh260@cam.ac.uk
Abstract

This paper presents analytic results on the anatomy of nested sampling, from which a technique is developed to estimate the run-time of the algorithm that works for any nested sampling implementation. We test these methods on both toy models and true cosmological nested sampling runs. The method gives an order-of-magnitude prediction of the end point at all times, forecasting the true endpoint within standard error around the halfway point.

keywords:
methods: data analysis – methods: statistical
pagerange: aeons: approximating the end of nested samplingReferences

1 Introduction

Nested sampling is a multi-purpose algorithm invented by John Skilling which simultaneously functions as a probabilistic sampler, integrator and optimiser (Skilling, 2006). It was immediately adopted for cosmology, and is now used in a wide range of physical sciences including particle physics (Trotta et al., 2008), materials science (Pártay. et al., 2021) and machine learning (Higson et al., 2018a). The core algorithm is unique in its estimation of volumes by counting, coupling together nested monte carlo integrals which makes high-dimensional integration feasible and robust. It also avoids problems that challenge traditional Bayesian samplers, such as posterior multi-modality and phase transitions.

The order of magnitude run-time of an algorithm, that is, whether termination is hours or weeks and months away, is of high importance to the end user. Currently, existing implementations of nested sampling (e.g. Feroz et al. 2009; Handley et al. 2015; Brewer & Foreman-Mackey 2016; Speagle 2020; Buchner 2021; Williams et al. 2021; McEwen et al. 2023) either do not give an indication of remaining run-time, or only provide crude measures of progress that do not directly correspond to the the true endpoint.

This paper sets out a principled manner of endpoint estimation for nested sampling at each intermediate stage (as shown in Fig. 1), the key idea being to use the existing samples to predict the likelihood in the region we have yet to sample from. We begin with an overview of nested sampling in Section 2, followed by an examination of the anatomy of a nested sampling run to establish key concepts for endpoint prediction in Section 3. Section 4 then outlines the methodology we use, including discussion and comparisons to previous attempts. Finally, Section 5 presents the results and discussions for toy and cosmological chains, before we conclude.

     Predicted endpoint: 25054 +/- 242  Progress: [=================>########] 72%  ___________________  lives      |   500 |  phantoms   | 24310 |  posteriors | 18018 |  equals     |   245 |  -------------------  ncluster   =  1/1  ndead      =  18018  nposterior =  18018  nequals    =  249  nlike      =  4159049  <nlike>    =  491.04 (9.82 per slice)  log(Z)     =  -12.55 +/- 0.27     

Figure 1: Output from PolyChord for a typical nested sampling run. The predicted endpoint, shown in red, is calculated using the method described in this paper.

2 Background

We begin with a brief description of the nested sampling algorithm to establish the necessary notation. For a more comprehensive treatment, we recommend the original (Skilling, 2006) paper, the Sivia & Skilling (2006) textbook, as well as the excellent technical review by Buchner (2023) and Nature review by Ashton et al. (2022).

For a given likelihood (θ)\mathcal{L}(\theta) and prior π(θ)\pi(\theta), nested sampling simultaneously calculates the Bayesian evidence

𝒵=(θ)π(θ)dθ\mathcal{Z}=\int\mathcal{L}\left(\theta\right)\pi(\theta)\ \mathrm{d}\theta (1)

while producing samples of the posterior distribution

𝒫(θ)=(θ)π(θ)𝒵.\mathcal{P}(\theta)=\frac{\mathcal{L}(\theta)\pi(\theta)}{\mathcal{Z}}. (2)

The algorithm operates by maintaining a set of nin_{i} live points sampled from the prior, which can vary in number throughout the run (Higson et al., 2019). At each iteration ii, the point with the lowest likelihood is removed and added to a list of dead points (illustrated in Fig. 2). New points are then (optionally) drawn from the prior, subject to the constraint that they must have a higher likelihood than the latest dead point i\mathcal{L}_{i}. Repeating the procedure leads to the live points compressing around peaks in the likelihood. If there are more births than deaths, the number of live points nin_{i} increases, whilst choosing to not generate new points reduces the number of live points. The exact creation schedule depends on the dynamic nested sampling strategy.

The integral in (1) is then evaluated by transformation to a one-dimensional integral over the prior volume XX

𝒵=01(X)dXi=1i12(Xi1Xi+1),\mathcal{Z}=\int_{0}^{1}\mathcal{L}(X)\ \mathrm{d}X\approx\sum_{i=1}\mathcal{L}_{i}\ \tfrac{1}{2}(X_{i-1}-X_{i+1}), (3)

where X()X(\mathcal{L}) is the fraction of the prior with a likelihood greater than \mathcal{L}. Whilst the likelihood contour i\mathcal{L}_{i} at each iteration is known, the prior volumes XiX_{i} must be statistically estimated as follows: one can define a shrinkage factor tit_{i} at each iteration Xi=tiXi1X_{i}=t_{i}X_{i-1}, such that

Xi=k=1itk.X_{i}=\prod_{k=1}^{i}t_{k}. (4)

The tit_{i} are the maximum of nin_{i} points drawn from [0,1][0,1], so follow the distribution

P(ti)=nitini1,P(t_{i})=n_{i}t_{i}^{n_{i}-1}, (5)
logti=1ni,Var(logti)=1ni2.\langle\log t_{i}\rangle=-\frac{1}{n_{i}},\quad\mathrm{Var}(\log t_{i})=\frac{1}{n_{i}^{2}}. (6)

The algorithm terminates when an user-specified condition is met; a popular choice is when the evidence in the live points falls below some fraction ϵ\epsilon of the accumulated evidence e.g. 10310^{-3}, which is proven to be a valid convergence criterion (Evans, 2006; Chopin & Robert, 2010). Much of the existing literature treats this remaining evidence separately, for instance by estimating it as the termination XX multiplied by the average likelihood amongst the remaining live points. It is, however, quantitatively equivalent but qualitatively neater to consider termination as killing the remaining live points off one-by-one, incrementing the evidence exactly as during the run with decreasing nin_{i} (Speagle, 2020).

Uncertainties in the evidence are dominated by the spread in the prior volume distribution, and the simplest way to estimate them is by Monte Carlo sampling over sets of 𝒕\bm{t}. As Skilling and others (Chopin & Robert, 2010; Keeton, 2011) have shown, for any given problem the uncertainty in log𝒵\log\mathcal{Z} is proportional to 1/nlive1/\sqrt{n_{\mathrm{live}}}, so nliven_{\mathrm{live}} sets the resolution of the algorithm.

3 The anatomy of a nested sampling run

The following sections act as an inventory of the information available to us at an intermediate iteration ii, which we shall use to make endpoint predictions in Section 4. We present an anatomy of the progression of a nested sampling run in terms of the prior volume compression (Section 3.1), the log-likelihood increase (Section 3.2), the inferred temperature (Section 3.3), and the dimensionality of the samples (Section 3.4).

3.1 Prior volume

The key feature of nested sampling is that the sampling is controlled by prior volume compression. The task is to find the posterior typically lying in a tiny fraction of the prior volume, a total compression which is quantified by the average information gain, or Kullback-Leibler divergence:

𝒟KL=𝒫(θ)log𝒫(θ)π(θ)dθ.\mathcal{D}_{\mathrm{KL}}=\int\mathcal{P}(\theta)\log\frac{\mathcal{P}(\theta)}{\pi(\theta)}\ \mathrm{d}\theta. (7)

The bulk of the posterior lies within a prior volume X=e𝒟KL{X=e^{-\mathcal{D}_{\mathrm{KL}}}}, which is the target compression. From Eq. 6 one gets there by iteratively taking steps of size ΔlogXi=1/ni{\Delta\log X_{i}=-1/n_{i}}, so that when we add up the contribution of each step in (6) we get

logXi=k=1i1nk,Var(logXi)=k=1i1nk2.\langle\log X_{i}\rangle=-\sum_{k=1}^{i}\frac{1}{n_{k}},\quad\mathrm{Var}(\log X_{i})=\sum_{k=1}^{i}\frac{1}{n_{k}^{2}}. (8)

A steady step size in logX\log X corresponds to a geometrically constant measure for the dead points, which is exactly needed to overcome the curse of dimensionality.

Refer to caption
Figure 2: The dead points of a nested sampling run, recursively zoomed in. Their density is constant in logX\log X, which is a geometrically constant measure dX/X\propto\mathrm{d}X/X, and hence scale invariant until the final set of live points is reached. The live points at a given iteration ii (larger dots) are uniform across the prior, plotted for comparison.

The same is not true for the live points, which are uniformly distributed in prior volume, and by the comments in the penultimate paragraph of Section 2 have ni+k=nik{n_{i+k}=n_{i}-k}. As a result, the maximum live point is found at

logXminlive=logXik=1ni1ni+kinilogniγ,\langle\log X_{\mathrm{min}}^{\mathrm{live}}\rangle=\langle\log X_{i}\rangle-\sum_{k=1}^{n_{i}}\frac{1}{n_{i+k}}\approx-\frac{i}{n_{i}}-\log n_{i}-\gamma, (9)

with variance

Var(logXminlive)=Var(logXi)+k=1ni1ni+k2ini2+π26,\mathrm{Var}(\log X_{\mathrm{min}}^{\mathrm{live}})=\mathrm{Var}(\log X_{i})+\sum_{k=1}^{n_{i}}\frac{1}{n_{i+k}^{2}}\approx\frac{i}{n_{i}^{2}}+\frac{\pi^{2}}{6}, (10)

where the large nin_{i} limit is taken for the approximation to the harmonic series, γ\gamma being the Euler-Mascheroni constant.

The live points therefore only get us a factor of logni\log n_{i} closer in volume to the posterior bulk. In other words, it is not until we are around logni\log n_{i} away from logX=𝒟KL\log X=-\mathcal{D}_{\mathrm{KL}} that the samples begin populating the posterior typical set. One can see from (7) that the divergence increases linearly with dimension, so for large dimensionalities and typical live point numbers 1000\lesssim 1000, this does not happen until near the end of the run.

The result is consistent with that in Pártay et al. (2010), which states that a spike at a volume smaller than Xi/niX_{i}/n_{i} will go undetected. Intuitively, it is because for a sharply peaked likelihood the live points are too diffuse to land there with any significant probability for most of the run. These results are summarised in Fig. 3

Refer to caption
Figure 3: The distribution of the posterior mass in terms of logX\log X, the live points over the constrained prior and the smallest live point prior volume logXminlive\log X_{\mathrm{min}}^{\mathrm{live}} at an intermediate iteration ii. For large values of 𝒟KL\mathcal{D}_{\mathrm{KL}} i.e. informative posteriors and/or large dimensionalities, the maximum live point is very far from the posterior bulk until the very end of the run. Note that the x-axis in this plot is logX\log X, so that the run proceeds from right to left to emphasise that the enclosed prior volume iteratively gets smaller. Plots of the sort from here onward will be in terms of logX-\log X, where the run will more naturally proceed from left to right.

3.2 Log-likelihood

We now consider the distribution of the samples in log-likelihood. To get an insight into the analytics, we will examine the representative case of the dd-dimensional multivariate Gaussian with maximum point logmax\log\mathcal{L}_{\mathrm{max}} and length scale σ\sigma:

log=logmaxX2/d/2σ2.\log\mathcal{L}=\log\mathcal{L}_{\mathrm{max}}-X^{2/d}/2\sigma^{2}. (11)

Providing that σd1\sigma^{d}\ll 1, the posterior can be approximated as

𝒫(X)=1(2πσ2)d/2eX2/d/2σ2,\mathcal{P}(X)=\tfrac{1}{(2\pi\sigma^{2})^{d/2}}e^{-X^{2/d}/2\sigma^{2}}, (12)

which in terms of log\log\mathcal{L} becomes

𝒫(log)=1Γ(d2)eloglogmax(logmaxlog)d21\mathcal{P}(\log\mathcal{L})=\frac{1}{\Gamma\left(\frac{d}{2}\right)}e^{\log\mathcal{L}-\log\mathcal{L}_{\mathrm{max}}}(\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L})^{\frac{d}{2}-1} (13)

i.e. 2(logmaxlog)χd22(\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L})\sim\chi^{2}_{d}, with mean and variance

log𝒫=logmaxd2,Var(log)𝒫=d2.\langle\log\mathcal{L}\rangle_{\mathcal{P}}=\log\mathcal{L}_{\mathrm{max}}-\frac{d}{2},\quad\mathrm{Var}(\log\mathcal{L})_{\mathcal{P}}=\frac{d}{2}. (14)

Given the prior is uniform XU(0,1)X\sim U(0,1), the KL divergence in the σd1\sigma^{d}\ll 1 limit is:

𝒟KL=d2log2Γ(1+d2)2/dσ2.\mathcal{D}_{\mathrm{KL}}=-\frac{d}{2}\log 2\Gamma(1+\tfrac{d}{2})^{2/d}\sigma^{2}. (15)

One can also derive the individual distributions for the live and dead points in log-likelihood. Note that this is merely the distribution of the values of the points, without their nested sampling weights. The dead distribution is

P(log)=dlogXdlogP(logX)dni2(logmaxlog),P(\log\mathcal{L})=\frac{\mathrm{d}\log X}{\mathrm{d}\log\mathcal{L}}P(\log X)\propto\frac{d\>n_{i}}{2(\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L})}, (16)

since the log-densities of the dead point prior volumes are proportional to the live point number at any given iteration. Meanwhile, the live points are uniformly distributed, and so have the distribution

P(log)=d2(logmaxlog)d21(logmaxlogi)d2[logi<log<logmax].P(\log\mathcal{L})=\frac{d}{2}\frac{(\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L})^{\frac{d}{2}-1}}{(\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L}_{i})^{\frac{d}{2}}}\\ [\log\mathcal{L}_{i}<\log\mathcal{L}<\log\mathcal{L}_{\mathrm{max}}]. (17)

Fig. 4 shows the distinction between the distributions of the live and dead point values, as well as the posterior.

How much further do the live points penetrate in log-likelihood? We seek the distribution for the maximum likelihood of the live points, log(ni)live\log\mathcal{L}_{(n_{i})}^{\mathrm{live}}, where we have used the notation of order statistics to denote x(k)x_{(k)} as the maximum of kk points. For convenience, we will normalise the likelihoods as

y=loglogilogmaxlogiy=\frac{\log\mathcal{L}-\log\mathcal{L}_{i}}{\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L}_{i}} (18)

so that y=0y=0 corresponds to the latest dead point at logi\log\mathcal{L}_{i} and y=1y=1 to the maximum. The live point distribution simplifies to

P(y)=d2(1y)d21[0<y<1].P(y)=\frac{d}{2}(1-y)^{\frac{d}{2}-1}\quad[0<y<1]. (19)

Using the result that the maximum of nin_{i} variables with cumulative distribution F(y)F(y) follows ddy(1(1F(y))ni)\frac{d}{dy}(1-(1-F(y))^{n_{i}}), we obtain

P(y(ni)live)=nid2(1y(ni)live)d21(1(1y(ni)live)d2)ni1[0<y(ni)live<1],P(y_{(n_{i})}^{\mathrm{live}})=\frac{n_{i}d}{2}(1-y_{(n_{i})}^{\mathrm{live}})^{\frac{d}{2}-1}\left(1-(1-y_{(n_{i})}^{\mathrm{live}})^{\frac{d}{2}}\right)^{n_{i}-1}\\ [0<y_{(n_{i})}^{\mathrm{live}}<1], (20)

which may be roughly summarised as

y(ni)live1Γ(1+2d)Γ(1+ni)Γ(1+2d+ni)±(Γ(1+ni)Γ(1+4d)Γ(1+4d+ni)Γ(1+2d)2Γ(1+ni)2Γ(1+2d+ni)2)12,y_{(n_{i})}^{\mathrm{live}}\sim 1-\frac{\Gamma(1+\frac{2}{d})\Gamma(1+n_{i})}{\Gamma(1+\frac{2}{d}+n_{i})}\\ \pm\left(\frac{\Gamma(1+n_{i})\Gamma(1+\frac{4}{d})}{\Gamma(1+\frac{4}{d}+n_{i})}-\frac{\Gamma(1+\frac{2}{d})^{2}\Gamma(1+n_{i})^{2}}{\Gamma(1+\frac{2}{d}+n_{i})^{2}}\right)^{\frac{1}{2}}, (21)

or in the large dd, nin_{i} limit

limdy(ni)live\displaystyle\lim_{d\to\infty}y_{(n_{i})}^{\mathrm{live}} 2Hnid±(2(π26Ψ(1)(1+ni))3d2)12,\displaystyle\sim\frac{2H_{n_{i}}}{d}\pm\left(\frac{2(\pi^{2}-6\Psi^{(1)}(1+n_{i}))}{3d^{2}}\right)^{\frac{1}{2}}, (22)
limd,niy(ni)live\displaystyle\lim_{d,n_{i}\to\infty}y_{(n_{i})}^{\mathrm{live}} 2lognid±23πd,\displaystyle\sim\frac{2\log n_{i}}{d}\pm\sqrt{\frac{2}{3}}\frac{\pi}{d}, (23)

where ψ(1)\psi^{(1)} is the trigamma function and HniH_{n_{i}} is the nin_{i}th harmonic number. This is a very small fraction in high dimensions, showing that until the end the live points are far from the posterior bulk.

Alternatively, the same result arrives intuitively from the fact that at each step, yy increases by

limniΔydydlogXΔlogX=2dni,\lim_{n_{i}\to\infty}\Delta y\approx\frac{\mathrm{d}y}{\mathrm{d}\log X}\Delta\log X=\frac{2}{dn_{i}}, (24)

so that by again summing the harmonic series, we get

y(ni)live=2lognid.y_{(n_{i})}^{\mathrm{live}}=\frac{2\log n_{i}}{d}. (25)

Eq. 24 also implies that the normalised distance between the highest and second highest live point is roughly

y(ni)livey(ni1)live2d.y_{(n_{i})}^{\mathrm{live}}-y_{(n_{i}-1)}^{\mathrm{live}}\approx\frac{2}{d}. (26)

Before reaching the posterior bulk, logmaxlogi>d/2\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L}_{i}>d/2, so we must have

log(ni)livelog(ni1)live>1.\log\mathcal{L}^{\mathrm{live}}_{(n_{i})}-\log\mathcal{L}^{\mathrm{live}}_{(n_{i}-1)}>1. (27)

In other words, the highest likelihood point is always at least an order of magnitude greater than the second highest. It is therefore typically the case that nearly all of the posterior mass is concentrated in a single point, the maximum live point, until the very end of the run when the prior volumes have shrunk enough to compensate.

Aside: nested sampling as a maximiser

Previous literature (Akrami et al., 2010; Feroz et al., 2011) has explored the potential for nested sampling to be used as a global maximiser, given its ability to handle multi-modalities. In particular, the latter authors emphasised that posterior samplers such as nested sampling find the bulk of the mass, not the maximum of the distribution, but that this can be remedied by tightening the termination criterion. We now use the machinery we have developed to put this statement on a more quantitative footing.

Let us take the current iteration to be the termination point with likelihood logf\log\mathcal{L}_{\mathrm{f}} and prior volume XfX_{\mathrm{f}}, so that

ϵ=0XfdX0dX.\epsilon=\frac{\int_{0}^{X_{\mathrm{f}}}\mathcal{L}\ \mathrm{d}X}{\int_{0}^{\infty}\mathcal{L}\ \mathrm{d}X}. (28)

Note that we have assumed that prior effects are negligible (so 11\approx\infty), and that ϵ1\epsilon\ll 1 so that the denominator is approximately the accumulated evidence. Computing this for (11), we find the answer in terms of lower incomplete gamma functions

ϵ=1Γd/2(Xf2/d/2σ2)Γ(d/2).\epsilon=1-\frac{\Gamma_{d/2}\left(X_{\mathrm{f}}^{2/d}/2\sigma^{2}\right)}{\Gamma(d/2)}. (29)

Taking the Xf(2σ)dX_{\mathrm{f}}\ll(\sqrt{2}\sigma)^{d} limit (almost certainly valid at termination) we find

limXf(2σ)dϵXf(2σ)dΓ(1+d2)=(logmaxlogf)d2Γ(1+d2).\lim_{X_{\mathrm{f}}\ll(\sqrt{2}\sigma)^{d}}\epsilon\approx\frac{X_{\mathrm{f}}}{(\sqrt{2}\sigma)^{d}\ \Gamma\left(1+\frac{d}{2}\right)}=\frac{(\log\mathcal{L}_{\mathrm{max}}-\log\mathcal{L}_{\mathrm{f}})^{\frac{d}{2}}}{\Gamma\left(1+\frac{d}{2}\right)}. (30)

We thus have an expression relating f\mathcal{L}_{\mathrm{f}} at termination to the termination fraction ϵ\epsilon. This becomes yet more pleasing in the large dd limit, since ϵ2/d1\epsilon^{2/d}\to 1, we find via a Stirling approximation:

limdlogflogmaxd2e.\lim_{d\to\infty}\log\mathcal{L}_{\mathrm{f}}\approx\log\mathcal{L}_{\mathrm{max}}-\frac{d}{2e}. (31)

In the event that we retain ϵ\epsilon, we replace d2ed2eϵ2/d\frac{d}{2e}\to\frac{d}{2e}\epsilon^{2/d}, allowing one to battle the d2e\frac{d}{2e} term exponentially as dimensions increase.

Putting this together, taking i\mathcal{L}_{i} in (18) to be f\mathcal{L}_{\mathrm{f}} and combining this with (23), we find

logmaxlivelogmaxd2e+lognie±π6e,\boxed{\log{\mathcal{L}}_{\mathrm{max}}^{\mathrm{live}}\approx\log\mathcal{L}_{\mathrm{max}}-\frac{d}{2e}+\frac{\log n_{i}}{e}\pm\frac{\pi}{\sqrt{6}e}}, (32)

showing that in general nested sampling will finish at a contour d/2ed/2e away from the maximum log-likelihood. The final set of nin_{i} live points gets you logni/e\log n_{i}/e closer, with a chance of getting π/6e=0.472\sim\pi/\sqrt{6}e=0.472 closer still by statistical fluctuation.

Making the traditional termination criterion stricter therefore has limited returns in high-dimensions, if it is ultimately still based on the remaining evidence. However, nested sampling still shrinks around the maximum exponentially, so provided a good alternative termination criterion is chosen, it will get there in reasonable time.

To quantify this statement, consider the number of iterations Δi\Delta i required to get from the posterior bulk to the true maximum, which we will now calculate. At the posterior, we have roughly speaking

(logX,log)=(𝒟KL,logmaxd/2).(\log X,\log\mathcal{L})=(-\mathcal{D}_{\mathrm{KL}},\log\mathcal{L}_{\mathrm{max}}-d/2). (33)

The prior volume logXδ\log X_{\delta} at which the likelihood is within δ\delta of the maximum can then be found by inverting Eq. 11:

logXδ=𝒟KLd2(logd2logδ),\log X_{\delta}=-\mathcal{D}_{\mathrm{KL}}-\frac{d}{2}\left(\log\frac{d}{2}-\log\delta\right), (34)

which corresponds to an additional

Δi=nd2(logd2logδ)\Delta i=\frac{nd}{2}\left(\log\frac{d}{2}-\log\delta\right) (35)

iterations after the bulk is reached, where nn is the harmonic mean of the number of live points. One would therefore expect a general distribution to take 𝒪(nd2logd2)\mathcal{O}(\tfrac{nd}{2}\log\tfrac{d}{2}) iterations to get from the usual nested sampling stopping point to within an ee-fold of the maximum. A rule-of-thumb termination criterion could therefore be to run for at least nd2logd2\tfrac{nd}{2}\log\tfrac{d}{2} iterations after the posterior is reached.

A summary of the distances between the notable points at the end of a run is shown in Fig. 4.

Refer to caption
Figure 4: Distribution of samples as a function of log\log\mathcal{L}, showing the posterior 𝒫(log)\mathcal{P}(\log\mathcal{L}), the distribution of the live points π(log>i)\mathcal{\pi}(\log\mathcal{L}\mid\mathcal{L}>\mathcal{L}_{i}), and the distribution of the maximum likelihood live point P(logmaxlive)P(\log\mathcal{L}_{\mathrm{max}}^{\mathrm{live}}). The distances are shown between these locations at the end of the run, the key takeaway being that in high dimensions the highest log-likelihood point of a nested sampling run is nowhere near the maximum in high dimensions.

3.3 Temperature

Motivations

As shown in the previous section, midway through the run nearly all of the posterior mass is concentrated at a single point. However, this does not capture the structure of the posterior that has been explored and all of the information it provides.

We have the potential to fix this because nested sampling is invariant to monotonic transformations, so we can transform the likelihood as β\mathcal{L}\to\mathcal{L}^{\beta} without loss of information by trivially re-weighting the samples. Increasing β\beta worsens the situation, while β0\beta\to 0 simply gives back the prior. There is, on the other hand, a significant intermediate range which makes the samples look like a posterior centred at the present contour, which will allow us to recover the structure of the samples. A schematic of the procedure is shown in Fig. 5.

Refer to caption
Figure 5: The likelihood and posterior as a function of logX\log X in the middle of a nested sampling run. Almost all of the posterior mass is concentrated at a single point with the highest compression, because it is orders of magnitude higher in likelihood. Reducing β\beta re-weights to shift the posterior mass; taking β0\beta\to 0 goes too far and gives back the prior, but there is some intermediate β=βi\beta=\beta_{i} which is significant because it centres the posterior at the current contour.

At this point it is relevant to note the correspondence between Bayesian inference and statistical mechanics, from which the above transform is derived. If one equates the parameters to microstates ii, the negative log-likelihood to the microstate energy EiE_{i}, and the prior to the density of states gig_{i}, then the posterior as given by the generalised Bayes’ rule is the canonical ensemble

p(Ei)=gieβEiZ(β)𝒫β(θ)=β(θ)π(θ)𝒵(β)p(E_{i})=\frac{g_{i}e^{-\beta E_{i}}}{Z(\beta)}\quad\leftrightarrow\quad\mathcal{P}_{\beta}(\theta)=\frac{\mathcal{L}^{\beta}(\theta)\pi(\theta)}{\mathcal{Z}(\beta)} (36)

at the inverse temperature β=1/T\beta=1/T. As noted by Habeck (2015), thermal algorithms such as thermodynamic integration (Gelman & Meng, 1998) get the evidence by evolving through a series of canonical ensembles via some temperature schedule, but nested sampling instead maintains microcanonical ensembles, which are the iso-likelihood contours. Instead of using temperature (Kirkpatrick et al., 1983; Swendsen & Wang, 1986) or energy (Wang & Landau, 2001) as a control parameter, nested sampling chooses a series of ensembles with constant relative volume entropy ΔlogX\Delta\log X, which allows the algorithm to handle phase transitions (Baldock et al., 2016).

Because the temperature of a microcanonical ensemble is a derived property rather than a parameter, there is some freedom in its definition. Returning to our original motivation, we make the connection that the temperature is the re-weighting β\mathcal{L}\to\mathcal{L}^{\beta} which centres the ensemble around the current energy. We now present several temperatures that achieve this aim, each of which one can plausibly consider to be the current temperature of a nested sampling run.

A. Microcanonical temperature

The obvious candidate is the microcanonical temperature S/E\partial S/\partial E, where the volume entropy is logX\log X and the energy is as usual log-\log\mathcal{L}. This gives the density of states; as discussed in Skilling’s original paper,

βM=dlogXdlog|logi.\beta_{\mathrm{M}}=-\frac{\mathrm{d}\log X}{\mathrm{d}\log\mathcal{L}}\Bigg{|}_{\log\mathcal{L}_{i}}. (37)

is the β\beta at which βX\mathcal{L}^{\beta}X peaks at logXi\log X_{i}, if we assume differentiability, which is exactly the intuition we were aiming for to put the ensemble bulk at the current contour.

Its value can be easily obtained via finite difference of the log\log\mathcal{L} and logX\log X intervals, albeit subject to an arbitrary window size for the differencing. Indeed, material science applications (Baldock et al., 2017) use this estimator to monitor the ‘cooling’ progress of nested sampling, with a window size of 1000 iterations.

B. Canonical temperature

Another temperature considered by Habeck (2015) is that at which the current energy (i.e. logi-\log\mathcal{L}_{i}) is the average energy of the entire ensemble. One can obtain it by inverting

log𝒫β=logi\langle\log\mathcal{L}\rangle_{\mathcal{P}_{\beta}}=\log\mathcal{L}_{i} (38)

to get the ‘canonical’ temperature βC\beta_{\mathrm{C}}. While βM\beta_{\mathrm{M}} is derived from (the gradient of) a single contour, this temperature uses the entire ensemble. It has the desirable property that it rises monotonically with compression, in analogy to a monotonic annealing schedule.

C. Bayesian temperature

We furthermore propose a temperature βB\beta_{\mathrm{B}} that is obtained via Bayesian inference, which returns a distribution rather than a point estimate. Since each value of β\beta leads to a different likelihood β\mathcal{L}^{\beta}, one can consider the posterior distribution as a function of logX\log X to be conditioned on β\beta. We can therefore write

𝒫(logXβ)=β(X)X𝒵(β).\mathcal{P}(\log X\mid\beta)=\frac{\mathcal{L}^{\beta}(X)X}{\mathcal{Z(\beta)}}. (39)

What we would really like is the distribution of β\beta at the present iteration, so the natural step is to invert this via Bayes’ rule;

P(βlogXi)=𝒫(logXiβ)P(β)P(logX).P\left(\beta\mid\log X_{i}\right)=\frac{\mathcal{P}\left(\log X_{i}\mid\beta\right)P\left(\beta\right)}{P\left(\log X\right)}. (40)

As with all Bayesian analyses, the distribution of β\beta is fixed up to a prior, which we choose to be uniform in β\beta. The obtained temperatures are consistent with the previous two choices, which may seem oddly coincidental. In fact, closer inspection reveals that large values of P(βlogXi)P(\beta\mid\log X_{i}) are the temperatures with a large value of the posterior at the present contour, normalised by the corresponding evidence. Thus the Bayesian temperature uses the same idea as the microcanonical one, except it accounts for the spread in the result.

Comparisons

Fig. 6 shows the three temperatures as a function of compression for two cases, one containing a phase transition and one without. They are consistent in both cases when there is a single dominant phase, but differ during a phase transition. The canonical temperature is the only one that rises monotonically with compression.

Refer to caption
Figure 6: Inferred temperatures using the microcanonical, canonical and Bayesian definitions. The shaded regions show the 12σ1-2\sigma uncertainties. All are consistent for a single phase, but differ during a phase transition.

One should keep in mind that despite the above theoretical reasoning, our introduction of the likelihood transformation was ultimately motivated by our wish to utilise the extra degree of freedom it provides. As we will see below, we recommend choosing the exact definition depending on what is useful for the problem at hand.

3.4 Dimensionality

We can immediately use the inferred temperature to track how the effective dimensionality of the posterior changes throughout the run, which was previously inaccessible. Handley & Lemos (2019) demonstrated that at the end of a run, a measure of the number of constrained parameters is given by the Bayesian model dimensionality (BMD), defined as the posterior variance of the information content:

dG2=𝒫(θ)(log𝒫(θ)π(θ)𝒟KL)2dθ=2𝒫𝒫2.\frac{d_{G}}{2}=\int\mathcal{P}(\theta)\left(\log\frac{\mathcal{P}(\theta)}{\pi(\theta)}-\mathcal{D}_{\mathrm{KL}}\right)^{2}\>\mathrm{d}\theta=\langle\mathcal{I}^{2}\rangle_{\mathcal{P}}-\langle\mathcal{I}\rangle^{2}_{\mathcal{P}}. (41)

Calculating the quantity using intermediate set of weighted samples (which is concentrated at a single point) leads to vanishing variance, hence also dimensionality. However, we can recover the structure of the posterior together with the true dimensionality by adjusting the temperature. Dimensionality estimates are plotted in Fig. 7 for a spherical 32-d Gaussian, for which the true dimensionality is known.

The different choices of temperature are again consistent, but for the rest of this paper we choose the Bayesian β\beta, because it provides a better reflection of the uncertainty in the estimate; the others, while fluctuating around the true value, are often many standard errors away from the true value at each single point.

Refer to caption
Figure 7: Dimensionality estimates using the different temperatures for a spherical 32-d Gaussian. Again, all are consistent, but the Bayesian definition has a uncertainty which includes the true value far more consistently than for the other definitions.

Anisotropic compression

Plots of samples dimensionality against compression also draw attention to the directions in which the samples are constrained throughout the run. As a concrete example, consider an elongated Gaussian in a unit hypercube prior with μ=𝟎\mu=\bm{0} and Σ=diag(103,103,103,106,106,106)\Sigma=\mathrm{diag}\left(10^{-3},10^{-3},10^{-3},10^{-6},10^{-6},10^{-6}\right), for which the dimensionality estimates are plotted in Fig. 8. Alongside is a view of the distribution of live points across the prior for two directions with different scales, which shows the level to which those parameters have been constrained at different times.

A feature of nested sampling made apparent here is that parameters with high variance are initially ‘hidden’. Compression occurs in the direction which is most likely to have a sample of higher likelihood, and initially it is much easier to find a better point along the direction of a parameter that is poorly constrained. Lower variance parameters are constrained much later, and before that happens it appears as though those parameters have a uniform distribution.

Refer to caption
Figure 8: Dimensionality estimates for a Gaussian that is elongated in half of its dimensions, with 12σ1-2\sigma uncertainties shaded. The locations of the live points in the prior are shown at three stages, indicated by the connecting lines. As can be seen from the live point distribution, the prior does not compress in the higher variance direction until much later in the run, and early on it appears as if those directions are completely unconstrained.

It is important to appreciate that at lower compression the samples truly lie in a lower-dimensional space, rather than some artefact of the way we view them. Anticipating the full dimensionality of the space is therefore just as impossible as that associated with a slab-spike geometry, so in this sense such geometries contain a ‘compressive phase transition’.

4 Endpoint prediction

As described in Petrosyan & Handley (2022) and further explored in the talk and upcoming paper Handley (2023a, b), the time complexity of nested sampling is

T1ni1×𝒯{(θ)}×fsampler×𝒟KL.T\propto\langle\tfrac{1}{n_{i}}\rangle^{-1}\times\langle\mathcal{T}\{\mathcal{L}(\theta)\}\rangle\times\langle f_{\mathrm{sampler}}\rangle\times\mathcal{D}_{\mathrm{KL}}. (42)

The first term is the harmonic mean of the number of live points 𝒪(ni)\sim\mathcal{O}(n_{i}), The second term is the average time per likelihood evaluation. The third is the average number of evaluations required to replace a dead point with a live point at higher likelihood, which is given by the implementation and usually does not vary in orders of magnitude. The final term is the Kullback-Liebler divergence Eq. 7, the compression factor required to get from the prior to the posterior. This term is generally outside of user control, in most cases a priori unknown, and of principle interest in this section.

4.1 The termination prior volume

Making the above discussion more precise, we wish to find the compression factor logXf\log X_{\mathrm{f}} at which the termination criterion is met, which is larger in magnitude than 𝒟KL\mathcal{D}_{\mathrm{KL}} (Fig. 3). The difficulty is that at an intermediate iteration we only know the posterior up to the maximum log-likelihood live point, which until just before the end is far from the posterior bulk.

In order to get an idea of where the true posterior bulk sits, we need to predict what the posterior looks like past the highest live point. We do this by extrapolating the known likelihood profile; that is, the trajectory of (X)\mathcal{L}(X) traced out by the live and dead points. One would never use this predicted posterior to perform inference, since more accuracy can always be achieved by simply finishing the run. However, we will demonstrate it is sufficient for making a run-time prediction for logXf\log X_{\mathrm{f}}.

Quantitatively, this proceeds as follows: fit a function f(X,ϕ)f(X,\phi) with some parameters ϕ\phi to the known likelihood profile, which allows us to express the prior volume we need to compress to as

Δ𝒵=ϵ𝒵tot,\Delta\mathcal{Z}=\epsilon\mathcal{Z}_{\mathrm{tot}}, (43)

or equivalently

0Xff(X,ϕ)dX=ϵ(0Xif(X,ϕ)dX+𝒵dead),\int_{0}^{X_{\mathrm{f}}}f(X,\phi)\ \mathrm{d}X=\epsilon\left(\int_{0}^{X_{i}}f(X,\phi)\ \mathrm{d}X+\mathcal{Z}_{\mathrm{dead}}\right), (44)

where XiX_{i} is the volume of the iteration we have currently compressed to, and 𝒵dead\mathcal{Z}_{\mathrm{dead}} is the evidence we have accumulated up to this point. XfX_{\mathrm{f}} can then be identified by solving the above equation either analytically or numerically.

Once XfX_{\mathrm{f}} is known, the corresponding iteration count depends on the live point schedule. For example, in the constant nin_{i} case logX\log X decreases by 1/n1/n at each iteration, so the total number of iterations NfN_{\mathrm{f}} would be

Nf=nlogXf.N_{\mathrm{f}}=-n\log X_{\mathrm{f}}. (45)

4.2 How to extrapolate?

A key observation is that the Bayesian model dimensionality is the equivalent dimension of the posterior if it were actually Gaussian. Fitting a Gaussian of this dimension to the likelihood profile therefore makes a reasonable approximation to the true distribution, without explicitly assuming the form of the likelihood function. The parameterisation of the Gaussian that we fit is the same as that given in Section 3.2, which we shall repeat here for clarity;

f(X;ϕ)=logmaxX2/d/2σ2f(X;\phi)=\log\mathcal{L}_{\mathrm{max}}-X^{2/d}/2\sigma^{2} (46)

The extrapolation then proceeds thus:

  1. 1.

    Find the current dimensionality dG(i){d}^{(i)}_{G} of the posterior at the Bayesian temperature

  2. 2.

    Take the live point profile and perform a least squares fit to (46), stipulating that d=dG(i)d={d}^{(i)}_{G} to infer logmax\log\mathcal{L}_{\mathrm{max}} and σ\sigma

  3. 3.

    Use the likelihood predicted by these parameters to solve (44) for XfX_{\mathrm{f}}

The advantage of fitting a Gaussian is that the procedure can be sped up analytically. Firstly, the least squares regression is trivial because analytic estimators exist; the cost function

C2(logmax,σ)=i|logif(Xi;logmax,σ)|2C^{2}(\log\mathcal{L}_{\mathrm{max}},\sigma)=\sum_{i}\left|\log\mathcal{L}_{i}-f(X_{i};\log\mathcal{L}_{\mathrm{max}},\sigma)\right|^{2} (47)

is minimised with respect to (logmax,σ)(\log\mathcal{L}_{\mathrm{max}},\sigma) when

σ2=NiXi4/d(iXi2/d)22ilogiiXi2/d2NiXi2/dlogi,\sigma^{2}=\frac{N\sum_{i}X_{i}^{4/d}-\left(\sum_{i}X_{i}^{2/d}\right)^{2}}{2\sum_{i}\log\mathcal{L}_{i}\sum_{i}X_{i}^{2/d}-2N\sum_{i}X_{i}^{2/d}\log\mathcal{L}_{i}}, (48)

and

logmax=1Nilogi+12Nσ2iXi2/d.\log\mathcal{L}_{\mathrm{max}}=\frac{1}{N}\sum_{i}\log\mathcal{L}_{i}+\frac{1}{2N\sigma^{2}}\sum_{i}X_{i}^{2/d}. (49)

Secondly, the termination prior volume can also be obtained analytically. Rewriting Eq. 44 in terms of the Gaussian parameters gives

ϵ=0Xfmaxexp(X2/d/2σ2)dX0Ximaxexp(X2/d/2σ2)dX+𝒵dead.\epsilon=\frac{\int_{0}^{X_{\mathrm{f}}}\mathcal{L}_{\mathrm{max}}\exp\left(-X^{2/d}/2\sigma^{2}\right)\ \mathrm{d}X}{\int_{0}^{X_{i}}\mathcal{L}_{\mathrm{max}}\exp\left(-X^{2/d}/2\sigma^{2}\right)\ \mathrm{d}X+\mathcal{Z}_{\mathrm{dead}}}. (50)

The integrals have the analytic solution

0Xkmaxexp(X2/d/2σ2)dX=d2(2σ)dγk\int_{0}^{X_{k}}\mathcal{L}_{\mathrm{max}}\exp\left(-X^{2/d}/2\sigma^{2}\right)\ \mathrm{d}X=\frac{d}{2}\cdot\left(\sqrt{2}\sigma\right)^{d}\cdot\gamma_{k} (51)

where γk=Γd/2(Xk2/d/2σ2)\gamma_{k}=\Gamma_{d/2}\left(X_{k}^{2/d}/2\sigma^{2}\right) is the lower incomplete gamma function. After taking the inverse of γ\gamma and a few more steps of algebra, we arrive at

logXf=d2log2σ2+logΓd/21(ϵγi+ϵ𝒵dead(2σ2)d/2max),\log X_{\mathrm{f}}=\frac{d}{2}\log 2\sigma^{2}+\log\Gamma^{-1}_{d/2}\left(\epsilon\gamma_{i}+\frac{\epsilon\mathcal{Z}_{\mathrm{dead}}}{\left(2\sigma^{2}\right)^{d/2}\mathcal{L}_{\mathrm{max}}}\right), (52)

and NfN_{\mathrm{f}} is of course just n-n multiplied by this. Intuitively, the above procedure can be thought of as inferring the number of constrained parameters, then extrapolating them up to find the point at which they will be fully constrained.

Uncertainties in the final estimate are obtained by drawing many samples from the distribution of dGd_{\mathrm{G}} defined by the Bayesian temperature, and repeating step two for each. One might wonder why we do not obtain dd via least squares regression together with the other parameters; extensive testing has shown this approach to be far less stable.

4.3 Alternative approaches

More comprehensive Bayesian approaches, perhaps including a priori information about the likelihood or greater flexibility in the fitting function, could likely perform better than what we have just presented. However, such methods would not befit run-time prediction which has a much more limited computational budget, hence the more pragmatic approach we have adopted. Here, we discuss as a benchmark alternative approaches to endpoint estimation that have a comparable computational complexity.

A. Integral progress

An alternative approach used in Ultranest (Buchner, 2021) derives a progress bar based on the fraction of the accumulated integral compared to the remaining integral, approximated as

𝒵remmaxliveXi.\mathcal{Z}_{\mathrm{rem}}\approx\mathcal{L}_{\mathrm{max}}^{\mathrm{live}}X_{i}. (53)

This has a several shortcomings. First, run-time is proportional to compression rather than accumulation of the integral, since it takes just as long to traverse the width of the bulk as it does any other width. Second, because of the point-like nature of the posterior mid-run, the remaining integral approximated as such holds nearly all of the evidence, so the relative fraction of the accumulated and remaining evidence is almost zero for most of the run. Finally, approximation (53) is always an underestimate, because as previously found the maximum live point is generally nowhere near the true maximum. This approach can however be useful in the low dimensions appropriate for Ultranest when the live points are always near the maximum, but in general is less reliable.

B. Extrapolating evidence increments

Seasoned watchers of nested sampling runs might be curious how the method compares to simply extrapolating the increments of evidence to roughly estimate when the evidence converges. We do this for a spherical Gaussian and compare it to our method. At an intermediate stage of the run, the most recent outputs might look something like that shown in the first two columns of the table in Fig. 9(a). Extrapolating those data to a linear and exponential profile yields endpoint estimates plotted in the graph to the right.

The linear extrapolation is clearly an underestimate, since it fails to account for the long tail of the nonlinear profile. The increments are also not exactly exponential, since the exponential fit leads to a large over-prediction. The predicted endpoint over the course of a run for d=16d=16, σ=0.01\sigma=0.01, as shown in Fig. 10, shows the same result. One might expect an average to be more accurate, but this tends to be biased towards the exponential prediction, and there is no obvious choice of weighting that would fix this.

More importantly, we find that for real likelihoods which have an element of noise the extrapolation often diverges, for instance when the increments do not monotonically decrease. Directly extrapolating the evidence increments is therefore far less stable than the previous method, and generally not a reliable method for prediction.

iteration log Z ΔlogZ\Delta\log Z
5000 -1435.8 190.8
5500 -1264.6 171.2
6000 -1123.7 140.9
6500 -991.5 132.2
7000 -885.0 106.6
7500 -790.3 94.7
8000 -702.6 87.7
8500 -619.7 82.9
9000 -551.8 67.9
9500 -492.7 59.1
(a)
Figure 9: Extrapolating the increments of evidence. The left column shows the output of a nested sampling run, and the right column shows the extrapolation.

Refer to caption
(b)
Refer to caption
Figure 10: Endpoint predictions for a spherical Gaussian, with 12σ1-2\sigma uncertainties shaded. Extrapolating the evidence increments in a linear/exponential manner under/over-predicts the endpoint, both of which perform considerably worse than the method of extrapolating the likelihood.

5 Results

We now test the approach established in the previous section on a range of distributions. We begin by considering a series of toy examples to explore the capabilities and limitations of the method, before presenting results for real cosmological chains.

5.1 Toy examples

Throughout the toy examples we make use the perfect nested sampling (Keeton, 2011; Higson et al., 2018b) framework implemented in anesthetic (Handley, 2019).

A. Gaussians

Predictions for spherical Gaussians of various dimensions are shown in Fig. 11 as a benchmark for when fitting a Gaussian distribution is exact. In all endpoint prediction plots, the shading indicates the 12σ1-2\sigma uncertainties. All were run with ni=500n_{i}=500 except for one ni=2000n_{i}=2000 for comparison, with each Gaussian having a width of σ=0.01\sigma=0.01. The correct endpoint is recovered to within standard error at all points except the very beginning, when the parameters have hardly been constrained.

We note that as with nested sampling in general, increasing nin_{i} improves the resolution and reliability of the inferences, which can be seen from the middle two plots.

Refer to caption
Figure 11: Endpoint predictions for a spherical Gaussian run with ni=500n_{i}=500 (except for the third plot from left). The correct endpoint is obtained for all but the earliest iterations, and the uncertainty is controlled by the number of live points, which can be seen from the two d=16d=16 plots.

We also observe the effect of elongating the Gaussian, using the same example as Section 3.4. Fig. 12 shows a step-like trend similar to the inferred dimensionalities, reflecting the fact that the full dimensionality is undetectable at lower compression factors. The endpoint for a likelihood whose remaining three directions are completely unconstrained coincides with our predictions at early iterations, showing that the two cases are indistinguishable.

Refer to caption
Figure 12: Endpoint prediction for an elongated Gaussian. At early stages, the full dimensionality is undetectable, and the endpoint is predicted to be the same as for a likelihood with three unconstrained directions. Only once the prior has been compressed enough to constrain the other three directions does the prediction converge to the true value.

B. Cauchy

One case that might be expected to cause problems is the pathological Cauchy distribution, which is far from a Gaussian. Fig. 13 shows the predictions for a likelihood of the form

log=logmax1+d2log(1+X2γ2),\log\mathcal{L}=\log\mathcal{L}_{\mathrm{max}}-\frac{1+d}{2}\log\left(1+\frac{X^{2}}{\gamma^{2}}\right), (54)

choosing d=10d=10 and allowing γ\gamma to vary. The correct estimate is obtained to within standard error by about halfway, but before that is inaccurate. The key limitation is that the estimate is wrong early on, not because the compression is anisotropic, or because there is a phase transition; but rather as a limitation of the reducing the likelihood to a Gaussian via the BMD, which is itself less stable for a Cauchy.

Nevertheless, the right order of magnitude is obtained at all times, so this remains sufficient for most use-cases. The Cauchy is also a pathological case, and the same problem does not in practice appear for more realistic cases, as we shall see next.

Refer to caption
Figure 13: Predictions for the Cauchy distribution for various widths γ\gamma and ni=500n_{i}=500. The endpoint is underestimated for the first half of the run, but this is a limitation of the Gaussian approximation rather than a lack of information mid-run.

5.2 Cosmological examples

Finally, we evaluate the method on real cosmological chains. Fig. 14 presents the endpoints (calculated after the fact) for nested sampling runs for curvature quantification on several common cosmological data sets (details in Handley 2021).

The SH0ES, BAO and lensing chains are ‘easy’ low 𝒟KL\mathcal{D}_{\mathrm{KL}} inferences, so it is expected that the correct endpoint is inferred practically from the start. The Planck endpoints, on the other hand, are not correct until at least midway through. However, this is expected from the covariance of the Planck likelihood, which consists of principal components of many scales and therefore elongated in many dimensions. It is therefore of the same class as the elongated Gaussian presented in Section 3.4; the samples exist in a lower dimensional subspace mid-run, which slowly increases to the full dimensionality only at the end of the run.

Refer to caption
Figure 14: Endpoint predictions for cosmological likelihoods. The first three low 𝒟KL\mathcal{D}_{\mathrm{KL}} inferences get the correct endpoint from the start, while the Planck chain takes longer to converge because the likelihood is highly covariant and thus subject to anisotropic compression, which makes the samples lie in a lower dimensional subspace mid-run.

6 Conclusion

We have derived new analytic results to make an anatomy of nested sampling, understanding the progression of a run via the compression of prior volume, the increase in log-likelihood, the inferred temperature schedule, and the convergence of the sample dimensionality. From these analyses, we developed a method for predicting the endpoint of a nested sampling run, by using the inferred Bayesian model dimensionality mid-run to extrapolate the known likelihood profile.

The method in general converges on a correct prediction of endpoint by about halfway, and gets the correct order of magnitude throughout. Consistent predictions are obtained for both toy and cosmological examples. The accuracy is typically limited by the information available mid-run, either because of a phase transition or because the anisotropy of the nested sampling compression. Pathological distributions, such as a Cauchy, lead to less stable inferences of the dimensionality and expose the limitations of a Gaussian approximation, though the order of magnitude is still correct.

Further work can be done to experiment with more flexible basis functions for regression of the likelihood profile, so that it is less dependent on the Gaussian approximation.

Code availability

A package is in development to implement the endpoint prediction mechanism, for easy plug-in to existing implementations of nested sampling. The latest updates can be found at github.com/zixiao-h/aeons.

Acknowledgements

WJH is supported by the Royal Society as a Royal Society University Research Fellow at the University of Cambridge. AB was supported by a Cambridge Mathematical Placement and the Royal Society summer studentship. ZH was supported by a Royal Society summer studentship. We thank Mike Hobson for helpful discussion during ZH Part III viva.

References