This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Variance and Covariance of Counts-in-Cells Probabilities

Andrew Repp & István Szapudi
Institute for Astronomy, University of Hawaii, 2680 Woodlawn Drive, Honolulu, HI 96822, USA
Abstract

Counts-in-cells (CIC) measurements contain a wealth of cosmological information yet are seldom used to constrain theories. Although we can predict the shape of the distribution for a given cosmology, to fit a model to the observed CIC probabilities requires the covariance matrix – both the variance of counts in one probability bin and the covariance between counts in different bins. To date, there have been no general expressions for these variances. Here we show that correlations of particular levels, or “slices,” of the density field determine the variance and covariance of CIC probabilities. We derive explicit formulae that accurately predict the variance and covariance among subvolumes of a simulated galaxy catalog, opening the door to the use of CIC measurements for cosmological parameter estimation.

keywords:
cosmology: theory – cosmology: miscellaneous – methods: statistical
pagerange: The Variance and Covariance of Counts-in-Cells ProbabilitiesReferences

1 Introduction

One of the primary motivations for galaxy surveys is their ability to constrain cosmological models, a consequence of the statistical imprint which cosmology leaves upon the galaxy distribution. The galaxy power spectrum is an observable which captures a large amount of the information inherent in these surveys (e.g., Peebles 1980; Baumgart and Fry 1991); indeed, for a Gaussian field the power spectrum encodes all of the field’s cosmologically relevant information.

However, the power spectrum is sensitive to the second moment of the distribution only. Thus, since the matter distribution is non-Gaussian (notably so on scales of 10h110h^{-1}Mpc or less), the power spectrum is blind to the cosmological information residing in the higher moments on these scales (Meiksin and White 1999; Rimes and Hamilton 2005). Even worse, because the distribution is close to lognormal, a significant amount of information escapes the entire hierarchy of NN-point correlation functions (Carron 2011; Carron and Neyrinck 2012). Other statistical tools are thus necessary to capture the information lost in the power spectrum.

The log transform – again because of the approximate lognormality of the distribution – represents one means of recapturing this information (Neyrinck et al. 2006, 2009; Repp and Szapudi 2017); indeed, the log power spectrum captures virtually all of it (Carron and Szapudi 2013, 2014). Another means of recapturing at least some of this information is to consider the (one-point) counts-in-cells (CIC) probability distribution function (PDF). Although this measure ignores the information inherent in spatial correlations, its higher-order moments encode information to which the power spectrum is blind, and thus joint analysis of the PDF and power spectrum can provide significantly tighter constraints on cosmological parameters than analysis of the power spectrum alone (Uhlemann et al. 2020).

Theoretical work on cosmological applications of CIC date at least to the efforts of Balian and Schaeffer (1989) – who derive the form of the matter PDF under the assumption of scale-invariant NN-point correlation functions – resulting in a model by Bernardeau and Schaeffer (1991) for galaxy multiplicity functions. Analysis continued with the study by Colombi (1994) of log moments (via the Edgeworth expansion); the proposal by Bernardeau and Kofman (1995) of methods for generating PDFs; and the derivation by Bernardeau (1994a, b) of cumulants for the matter PDF. Colombi et al. (1995) in turn examine the errors introduced by finite-volume effects, and Szapudi and Colombi (1996) characterize the effects of cosmic-variance error. Other existing theoretical work includes analysis of sampling effects (Colombi et al. 1998) and of the distribution of probability measurements (Szapudi et al. 2000). Valageas (2002) provides a non-perturbative calculation of the PDF in the quasilinear regime; likewise, Uhlemann et al. (2018a, b, 2020) use large deviation statistics to predict the CIC for galaxy surveys.

CIC measurements also have a long history, including their use in simulations by Baugh et al. (1995) to determine the NN-point correlation functions and by Colombi et al. (2000) to determine the void probability distribution. Application to survey data includes analyses by Szapudi et al. (1992), Gaztanaga (1994), and Szapudi et al. (1995, 1996) of early projected surveys, and by Baugh et al. (2004) and Croton et al. (2007) of the 2dFGRS data; measurement by Pápai and Szapudi (2010) of the PDF in the SDSS LRG sample; determination by Wolk et al. (2013) of higher-order statistics in the CFTHLS-W survey; and verification by Clerkin et al. (2017) of log-normality of the projected CIC in DES data.

Turning to actual constraints on cosmological parameters, Gruen et al. (2018), together with Friedrich et al. (2018), provide the first complete cosmological analysis of the galaxy density PDF (in combination with lensing), thereby deriving cosmological constraints from DES and SDSS data. In addition, Salvador et al. (2019) use CIC statistics to analyze nonlinear galaxy bias in DES data, and Repp and Szapudi (2020) derive joint constraints on σ8\sigma_{8} and linear galaxy bias from CIC in SDSS data.

However, any use of CIC measurements to constrain cosmology requires an accounting for both the variance in the probability measurements and the covariance between measurements in different probability bins. It is likely that this fact plays a large role in the temporal gap (of nearly thirty years) between the initial theoretical work and the fits of Gruen et al. (2018). Early efforts such as Colombi (1994) suggested that fitting log moments might be more tractable, but until now fits to the entire PDF have typically required an entire ensemble of cosmological simulations to estimate the covariance matrix.

Therefore, as an alternative to running a computationally expensive suite of simulations, we in this work derive analytical expressions for the variance and covariance of CIC probability measurements. To test our results, we empirically determine the variability of probability measurements in a Millennium Simulation galaxy catalog, and we show that our expressions accurately predict the variance and covariance of the measured galaxy PDF.

We structure the remainder of this work as follows: Section 2 defines slice fields corresponding to probability bins; these slice fields form the foundation for the derivation. Section 3 derives the variance of CIC measurements along with the expected error on the estimator for this variance; we then demonstrate the accuracy of our result by comparing it to simulation measurements of the CIC variance. Likewise, Section 4 derives the covariance between CIC measurements in different bins of probability; we again determine the expected error on the covariance estimator and demonstrate the accuracy of our result. Discussion and conclusions follow in Sections 5 and 6.

2 The Slice Field

Suppose we have a field of objects (such as a galaxies in a survey) contained in a number NcN_{c} of non-overlapping cells positioned at 𝐫1,𝐫2,,𝐫Nc\mathbf{r}_{1},\mathbf{r}_{2},\ldots,\mathbf{r}_{N_{c}}. If each cell contains a whole number Ni=N(𝐫i)N_{i}=N(\mathbf{r}_{i}) of objects, it is straightforward to measure the probability distribution 𝒫(N)\mathcal{P}(N) for the field. More generally, if we bin the measured numbers NN into non-overlapping bins B1,B2,B_{1},B_{2},\ldots, we can likewise measure the probability 𝒫(B)\mathcal{P}(B) for any given bin BB, recovering 𝒫(N)\mathcal{P}(N) when the bins have unit width.

For a given bin BB, we now define 𝒮B\mathcal{S}_{B}, the slice field for BB, such that 𝒮B(𝐫i)=1\mathcal{S}_{B}(\mathbf{r}_{i})=1 if N(𝐫i)BN(\mathbf{r}_{i})\in B; otherwise 𝒮B(𝐫i)=0\mathcal{S}_{B}(\mathbf{r}_{i})=0. This field 𝒮B\mathcal{S}_{B} thus identifies the spatial location of a particular slice of the possible values of NN, namely, those for which NBN\in B. Bins of unit width allow us to recover the original NN-field of counts-in-cells (CIC) by summing over the slice fields:

N(𝐫i)=NN𝒮{N}(𝐫i).N(\mathbf{r}_{i})=\sum_{N}N\mathcal{S}_{\{N\}}(\mathbf{r}_{i}). (1)

Wider bins likewise recover a binned version of the original field.

Two simple slice-field properties will be useful in the sequel. First, because all slice-field values are either 0 or 1, all moments of any given slice field are equal to the probability of the associated bin:

𝒫(B)=𝒮B=𝒮Bn,\mathcal{P}(B)=\langle\mathcal{S}_{B}\rangle=\langle\mathcal{S}_{B}^{\>n}\rangle, (2)

for all natural numbers nn. Second, if two bins B1B_{1} and B2B_{2} are disjoint, their corresponding slice fields must also be disjoint:

𝒮B1𝒮B20 if B1B1=.\mathcal{S}_{B_{1}}\cdot\mathcal{S}_{B_{2}}\equiv 0\mbox{ if }B_{1}\cap B_{1}=\varnothing. (3)

The correlation function of the slice field is related to the sliced correlation functions as defined in Neyrinck et al. (2018); it is straightforward to recover the sliced correlation functions from these slice fields by a 1-point slice-averaging at one end.

3 Counts-in-cells Variance

3.1 Theoretical Prediction

We now turn to the problem of determining the error on the measured CIC probability in a way that accounts for correlations between neighboring cells. Given a bin BB of counts, and writing 𝒮\mathcal{S} for 𝒮B\mathcal{S}_{B}, we see from Equation 2 that the probability of that bin 𝒫(B)=𝒮\mathcal{P}(B)=\langle\mathcal{S}\rangle. Furthermore, the variance of 𝒮\mathcal{S} is σ𝒮2=𝒮2𝒮2=𝒫(B)(1𝒫(B))\sigma^{2}_{\mathcal{S}}=\langle\mathcal{S}^{2}\rangle-\langle\mathcal{S}\rangle^{2}=\mathcal{P}(B)\left(1-\mathcal{P}(B)\right).

If the survey cells are uncorrelated, it follows that the variance of 𝒫(B)\mathcal{P}(B) is given by

σ𝒫(B)2=σS2=𝒫(B)(1𝒫(B))Nc.\sigma^{2}_{\mathcal{P}(B)}=\sigma^{2}_{\langle S\rangle}=\frac{\mathcal{P}(B)\left(1-\mathcal{P}(B)\right)}{N_{c}}. (4)

(Note that this expression appears in Colombi et al. (1995) for B={0}B=\{0\} with no correlation.) Equation 4 treats each cell as an independent measurement of the probability; correlations between cells will decrease the effective number of independent measurements, requiring modification of the expression.

Thus, to handle the case of correlated cells, we first consider the artificial case of an 𝒮\mathcal{S}-field consisting of nn survey cells such that the correlation between any two of them is a fixed value ξ\xi. Explicitly, if iji\neq j, we have sisj=(1+ξ)P2\langle s_{i}s_{j}\rangle=(1+\xi)P^{2}, where P=𝒫(B)P=\mathcal{P}(B) is the probability that 𝒮=1\mathcal{S}=1 in any given cell. (Note that ξ\xi is the correlation of 𝒮\mathcal{S}, not of the counts-in-cells NN.) If sis_{i} is the value of the 𝒮\mathcal{S} in the iith cell, we know that

𝒫(B)=𝒮=1ni=1nsi.\mathcal{P}(B)=\langle\mathcal{S}\rangle=\frac{1}{n}\sum_{i=1}^{n}s_{i}. (5)

In Appendix A.1 we show that the variance σ𝒫(B)2\sigma^{2}_{\mathcal{P}(B)} of this quantity is given by

Var(1ni=1nsi)=(1(1(n1)ξ)P)Pn,\mathrm{Var}\left(\frac{1}{n}\sum_{i=1}^{n}s_{i}\right)=\frac{\left(1-\left(1-(n-1)\xi\right)P\right)P}{n}, (6)

However, this expression presupposes an equal degree of correlation ξ\xi between all nn cells. Obtaining a similar expression for the general case of varying ξ\xi is analytically intractable. However, we note that, in the inductive proof of Equation 6 (Appendix A.1), the term through which new instances of ξ\xi accumulate into the expression is sisn+1\langle s_{i}s_{n+1}\rangle (in Equation 22). This term has precisely the form expected for a Monte Carlo volume average of a spatially-varying ξ(r)\xi(r). With this motivation, we approximate ξ\xi in Equation 6 with the average slice-field correlation ξ¯𝒮\overline{\xi}_{\mathcal{S}}, where the average is taken over all pairs (𝐫i,𝐫j)(\mathbf{r}_{i},\mathbf{r}_{j}) of survey cells (iji\neq j). (In Section 3.3 we verify the validity of this approximation numerically.) Performing this substitution and writing NcN_{c} (the number of cells) for nn, we obtain for any counts-in-cells bin BB,

σ𝒫(B)2=𝒫(B)(1𝒫(B))Nc+(Nc1)ξ¯𝒮Nc𝒫(B)2,\sigma^{2}_{\mathcal{P}(B)}=\frac{\mathcal{P}(B)\left(1-\mathcal{P}(B)\right)}{N_{c}}+\frac{(N_{c}-1)\overline{\xi}_{\mathcal{S}}}{N_{c}}\mathcal{P}(B)^{2}, (7)

which reduces to Equation 4 in the uncorrelated case.

3.2 Error on the Measured Variance

In order to test Equation 7, we need multiple measurements of the probability 𝒫(B)\mathcal{P}(B). It is possible to obtain such measurements from a single simulation by dividing it into multiple subvolumes, calculating 𝒫(B)\mathcal{P}(B) in each subvolume, and then observing the variance of those measured 𝒫(B)\mathcal{P}(B)-values. However, a meaningful comparison between measurement and theory requires an estimate of the possible scatter in these observations of the variance. To this we now turn.

Suppose we have a set {Pi}\left\{P_{i}\right\} of NmN_{m} measurements of a probability 𝒫(B)\mathcal{P}(B). Let us denote the measured variance of this set as sP2s_{P}^{2}. Then according to a standard result (e.g., Kendall and Stuart 1958, eq. 12.35, converting cumulants to moments),

Var(sp2)=μ4μ22Nm+2μ22Nm(Nm1),\mathrm{Var}\left(s_{p}^{2}\right)=\frac{\mu_{4}-\mu_{2}^{2}}{N_{m}}+\frac{2\mu_{2}^{2}}{N_{m}(N_{m}-1)}, (8)

where μj\mu_{j} denotes the jjth central moment of the distribution of PiP_{i}-values. We can use the measured variance sP2s_{P}^{2} for μ2\mu_{2}; however, we must perform an additional estimation of μ4\mu_{4}. For simplicity, in calculating μ4\mu_{4} we will ignore the correlation between neighboring cells; we shall however mitigate the effect of this simplification by expressing our results in terms of sp2s_{p}^{2}, which do include the effects of correlation.

Proceding to determine μ4\mu_{4}, and employing PP as shorthand for 𝒫(B)\mathcal{P}(B), it is straightforward to obtain the moment-generating function for the slice-field 𝒮\mathcal{S} corresponding to bin BB:

M𝒮(t)=1+(et1)P.M_{\mathcal{S}}(t)=1+\left(e^{t}-1\right)P. (9)

Now, if each measurement PiP_{i} was obtained by averaging the 𝒮\mathcal{S}-values in NcN_{c} cells (see Equation 5), then the moment-generating function for the probability is

M𝒫(t)=1Nc(1+(et1)P)Nc.M_{\mathcal{P}}(t)=\frac{1}{N_{c}}\left(1+\left(e^{t}-1\right)P\right)^{N_{c}}. (10)

From this function (see Appendix A.2 for details) we obtain the following expression for μ4\mu_{4}:

μ4=3(sp2)2+sp2Nc(16P+6P2);\mu_{4}=3\left(s_{p}^{2}\right)^{2}+\frac{s_{p}^{2}}{N_{c}}(1-6P+6P^{2}); (11)

inserting this result into Equation 8 and simplifying, we obtain

Var(sp2)=2(sp2)2Nm1+sp2(16P+6P2)NmNc2,\mathrm{Var}\left(s_{p}^{2}\right)=\frac{2\left(s_{p}^{2}\right)^{2}}{N_{m}-1}+\frac{s_{p}^{2}(1-6P+6P^{2})}{N_{m}N_{c}^{2}}, (12)

where NcN_{c} is the number of survey cells in each measurement of sP2s_{P}^{2}, and NmN_{m} is the total number of measurements. Equation 12 thus gives us the uncertainty on the measured value of sp2s_{p}^{2}.

Note also that each probability measurement PiP_{i} is a mean of NcN_{c} values of the 𝒮\mathcal{S}-field. Thus for reasonably large values of NcN_{c}, we can invoke the central limit theorem and treat the distribution of PiP_{i}-values as Gaussian, in which case Var(sp2)=2(sp2)2/(Nm1)\mathrm{Var}\left(s_{p}^{2}\right)=2\left(s_{p}^{2}\right)^{2}\!/(N_{m}-1). This result is, of course, the limit of Equation 12 for large NcN_{c}.

3.3 Comparing Theory with Measurement

Figure 1: A comparison of the variance of CIC-probabilities predicted by Equation 7 with those measured in a mock galaxy survey catalog, in 1.95h11.95h^{-1}-Mpc cubical cells (left panels) and 31.25h131.25h^{-1}-Mpc cubical cells (right panels). The top axes show the number of subvolumes into which the survey is split (NmN_{m} in Equation 12), equal to the number of measurements of 𝒫(B)\mathcal{P}(B); the bottom axes show the number of cells within each subvolume (NcN_{c} in Equation 12). Thick lines (with error bars) show the empirical variance of the measurements of 𝒫(B)\mathcal{P}(B) (one measurement for each subvolume, each subvolume containing NcellsN_{\mathrm{cells}} cells); thin lines show the predicted variances from Equation 7, where we determine the volume-averaged correlations as describe in the text. The thin dotted lines show the predicted variance for the uncorrelated case (Equation 4). The bottom panels display the ratio of the true variance to that predicted in the absence of correlation; in some cases the correlations can increase the variance by two orders of magnitude. For clarity, the lower-panel curves have received a slight horizontal offset from each other.

We can now compare the predictions of Equation 7 with measured variances. To do so, we make use of the L-galaxies catalog111From the repository at http://gavo.mpa-garching.mpg.de/
Millennium/
(Bertone et al. 2007) from the Millennium Simulation (Springel et al. 2005), imposing a stellar mass cut of M109MM_{\star}\geq 10^{9}\mathrm{M}_{\odot} to obtain a mock galaxy survey. We perform two tests, one with the survey volume divided into 2563256^{3} cubical cells (with side length 1.95h11.95h^{-1}Mpc) and a second with the volume divided into 16316^{3} cubical cells (with side length 31.25h131.25h^{-1}Mpc).

We next split the survey into NmN_{m} subvolumes, each consisting of NcN_{c} survey cells (so that NmNc=NtotN_{m}N_{c}=N_{\mathrm{tot}}, the total number of cells in the survey). Given a CIC bin BB, we determine the probability 𝒫(B)\mathcal{P}(B) within each subvolume, thus obtaining an ensemble of NmN_{m} measured probabilities. The variance of this ensemble gives us a measured value for σ𝒫2(B)\sigma^{2}_{\mathcal{P}}(B), and this value will depend on the number of cells NcN_{c} used in the measurement of each probability. These empirical variances appear as thick curves in Fig. 1. In this figure, the bottom axis shows NcN_{c}, the number of cells in each subvolume, and the top axis shows NmN_{m}, the number of subvolumes. (Note also that we choose unit bin widths for the 1.95h11.95h^{-1}-Mpc cells and varying bin widths for the 31.25h131.25h^{-1}-Mpc cells.) Equation 12 gives us the error bars on these measurements, where we use for sp2s^{2}_{p} and PP the measured variances and probabilities.

The thin curves in Fig. 1 show the predictions of Equation 7. This prediction is not entirely a priori, since it requires the (measured) probability values 𝒫(B)\mathcal{P}(B) and the measured volume-averaged correlation ξ¯𝒮\overline{\xi}_{\mathcal{S}} of the corresponding slice field. To obtain the latter, we first measure the two-point correlation function ξ𝒮(r)\xi_{\mathcal{S}}(r) of the appropriate slice field using a standard fast Fourier transform method; we then use Monte Carlo sampling of the slice field to obtain random pairs, using ξ(r)\xi(r) to calculate their correlation and folding the result into the average; we continue the sampling process until the variation in the running average has subpercent effect on the predicted σ𝒫2(B)\sigma^{2}_{\mathcal{P}}(B).

The endpoints of the curves illustrate two extremes. The left-hand endpoints represent the situation in which each survey cell constitutes its own subvolume (Nc=1N_{c}=1, and Nm=2563N_{m}=256^{3} or 16316^{3}). In this case, each cell provides an estimate of 𝒫(B)\mathcal{P}(B), and these estimates are either 0 or 1 (depending on whether the cell falls into that probability bin). In this case we expect the variance of these estimates to be large.

The right-hand endpoint of each curve represents the opposite situation of high NcN_{c} and low NmN_{m}. At this endpoint, the survey is divided into 8 subvolumes (Nc=2563/8N_{c}=256^{3}\!/8 or 163/816^{3}\!/8, and in both cases Nm=8N_{m}=8). Here we have 8 measurements of 𝒫(B)\mathcal{P}(B), and the variance of these measurements is small due to the large number of cells NcN_{c} involved in the calculation of each one. On the other hand, since we have only 8 measurements of 𝒫(B)\mathcal{P}(B), our estimate s2s^{2} of the variance is less certain.

In both cases (1.95h11.95h^{-1}- and 31.25h131.25h^{-1}-Mpc cells), we see from this figure that the predicted variance is in excellent agreement with the measured values. Furthermore, we see the expected trends: as the number of cells NcN_{c} used for the measurement of 𝒫(B)\mathcal{P}(B) increases, the variance in those measurements decreases; however, as the number of measurements of 𝒫(B)\mathcal{P}(B) decreases, the uncertainty in the variance increases.

The top panels of Fig. 1 also show, as thin dotted lines, the variance predicted under the assumption of no correlation between survey cells (Equation 4), which is proportional to 1/Nc1/N_{c}; the bottom panels show the ratio between the correlated and uncorrelated cases. It is evident that inter-cell correlations can have a significant effect on the variance of 𝒫(B)\mathcal{P}(B); in the case of 𝒫(0)\mathcal{P}(0) for the 1.95h11.95h^{-1}-Mpc cells, the difference is more than two orders of magnitude.

4 Counts-in-cells Covariance

4.1 Theoretical Prediction

The second issue one must consider in fitting models to CIC results is the covariance between different bins of counts. (It is clear that such covariance must exist, given that a survey cell falling into one bin is thereby excluded from all other bins.) Thus we here derive an expression for the covariance σ𝒫(B1)𝒫(B2)\sigma_{\mathcal{P}(B_{1})\mathcal{P}(B_{2})} of counts-in-cells in two (disjoint) bins B1B_{1} and B2B_{2}. To simplify notation, we write 𝒮1\mathcal{S}_{1}, 𝒮2\mathcal{S}_{2} for the slice fields of the two bins, and P1P_{1}, P2P_{2} for 𝒫(B1)\mathcal{P}(B_{1}), 𝒫(B2)\mathcal{P}(B_{2}).

We begin by considering the case of two slice fields, both drawn from the same survey consisting of nn cells, and, as before, we initially assume that the cross-correlation between the two slice fields is a fixed, constant value ξ12\xi_{12}. In particular, for iji\neq j we let s1i=𝒮1(ri)s_{1i}=\mathcal{S}_{1}(\mathrm{r}_{i}) and s2j=𝒮2(rj)s_{2j}=\mathcal{S}_{2}(\mathrm{r}_{j}); then given this cross-correlation, we can say that the joint probability of (s1i=1,s2j=1)(s_{1i}=1,s_{2j}=1) is 𝒫(1,1)=P1P2(1+ξ12)\mathcal{P}(1,1)=P_{1}P_{2}(1+\xi_{12}). Furthermore, since 𝒮1(ri)𝒮2(rj)\mathcal{S}_{1}(\mathrm{r}_{i})\mathcal{S}_{2}(\mathrm{r}_{j}) vanishes unless 𝒮1(ri)=𝒮2(rj)=1\mathcal{S}_{1}(\mathrm{r}_{i})=\mathcal{S}_{2}(\mathrm{r}_{j})=1, the expected value 𝒮1(ri)𝒮2(rj)=𝒫(1,1)\langle\mathcal{S}_{1}(\mathrm{r}_{i})\mathcal{S}_{2}(\mathrm{r}_{j})\rangle=\mathcal{P}(1,1).

Now P1P_{1} is simply (1/n)s1i(1/n)\sum s_{1i}, and P2P_{2} is (1/n)s2j(1/n)\sum s_{2j}. Given these relationships, we prove in Appendix A.3 the following statement (analogous to Equation 6) concerning the covariance of the probabilities P1P_{1} and P2P_{2}:

Cov(1ni=1ns1i,1nj=1ns2j)=P1P2n(1+(n1)ξ12).\mathrm{Cov}\left(\frac{1}{n}\sum_{i=1}^{n}s_{1i}\,,\frac{1}{n}\sum_{j=1}^{n}s_{2j}\right)=\frac{-P_{1}P_{2}}{n}\left(1+(n-1)\xi_{12}\right). (13)

At this point we make an approximation analogous to that in Section 3.1 by replacing the fixed ξ12\xi_{12} with the average cross-correlation ξ¯𝒮1𝒮2\overline{\xi}_{\mathcal{S}_{1}\mathcal{S}_{2}} of the two slice fields. Again writing NcN_{c} for nn, we obtain, for any disjoint counts-in-cells bins B1B_{1} and B2B_{2},

σ𝒫(B1)𝒫(B2)=𝒫(B1)𝒫(B2)Ncn1Ncξ¯𝒮1𝒮2𝒫(B1)𝒫(B2).\sigma_{\mathcal{P}(B_{1})\mathcal{P}(B_{2})}=\frac{-\mathcal{P}(B_{1})\mathcal{P}(B_{2})}{N_{c}}-\frac{n-1}{N_{c}}\overline{\xi}_{\mathcal{S}_{1}\mathcal{S}_{2}}\mathcal{P}(B_{1})\mathcal{P}(B_{2}). (14)

We note that in the absence of cross-correlation, the covariance is negative since the bins are mutually exclusive.

4.2 Error on the Measured Covariance

As in Section 3, we now wish to compare Equation 14 with results measured from simulations, and thus we require an estimate for the uncertainty of the measured covariances.

Let us begin with two disjoint CIC bins B1B_{1} and B2B_{2}; let us also suppose that we have two sets {P1i}\left\{P_{1i}\right\} and {P2i}\left\{P_{2i}\right\} of measurements of probabilities 𝒫(B1)\mathcal{P}(B_{1}) and 𝒫(B2)\mathcal{P}(B_{2}), each set consisting of NmN_{m} measurements. We denote the covariance of these two sets of probability measurements as SP1P2S_{P_{1}P_{2}}.

The following expression (Kendall and Stuart 1958, p. 322) gives the variance of SP1P2S_{P_{1}P_{2}} (where our SP1P2S_{P_{1}P_{2}} corresponds to k11k_{11} of Kendall and Stuart):

Var(SP1P2)=μ22Nm+μ02μ20Nm(Nm1)Nm2Nm(Nm1)μ112,\mathrm{Var}\left(S_{P_{1}P_{2}}\right)=\frac{\mu_{22}}{N_{m}}+\frac{\mu_{02}\mu_{20}}{N_{m}(N_{m}-1)}-\frac{N_{m}-2}{N_{m}(N_{m}-1)}\mu_{11}^{2}, (15)

where we have converted cumulants into moments, with μrs\mu_{rs} denoting the rrth, ssth product moments about the means of the random variables P1P_{1}, P2P_{2}. In calculating these moments we will ignore cross-correlations between nearby cells (as in Section 3.2), but we shall again seek to reduce the impact of this simplification by expressing our result in terms of SP1P2S_{P_{1}P_{2}}.

Let us employ P1P_{1}, P2P_{2} as shorthand for 𝒫(B1)\mathcal{P}(B_{1}), 𝒫(B2)\mathcal{P}(B_{2}). Then the joint moment-generating function for the corresponding slice fields 𝒮1\mathcal{S}_{1}, 𝒮2\mathcal{S}_{2} is

M𝒮1𝒮2(t1,t2)=1+(et11)P1+(et21)P2.M_{\mathcal{S}_{1}\mathcal{S}_{2}}(t_{1},t_{2})=1+\left(e^{t_{1}}-1\right)P_{1}+\left(e^{t_{2}}-1\right)P_{2}. (16)

Furthermore (as in Section 3.2), each of the measurements P1iP_{1i}, P2iP_{2i} is an average over the 𝒮1\mathcal{S}_{1}, 𝒮2\mathcal{S}_{2}-values in NcN_{c} cells. Thus the joint moment-generating function for the measured probabilities is

M𝒫1𝒫2(t1,t2)=1Nc(1+(et11)P1+(et21)P2)Nc.M_{\mathcal{P}_{1}\mathcal{P}_{2}}(t_{1},t_{2})=\frac{1}{N_{c}}\left(1+\left(e^{t_{1}}-1\right)P_{1}+\left(e^{t_{2}}-1\right)P_{2}\right)^{N_{c}}. (17)

From this function, we calculate in Appendix A.4 the required joint cental moments of the 𝒫1,𝒫2\mathcal{P}_{1},\mathcal{P}_{2} distribution. We then apply these results to Equation 15 and, using SP1P2=P1P2/NcS_{P_{1}P_{2}}=-P_{1}P_{2}/N_{c} (Equation 14 with no cross-correlation), it eventually follows that

Var(SP1P2)=1Nm1{2(SP1P2)2(13(Nm1)NcNm)SP1P2Nc2Nm[(1P2P2)(1+Nm(Nc1))+(P1+P2)(Nm1)]}.\mathrm{Var}\left(S_{P_{1}P_{2}}\right)=\frac{1}{N_{m}-1}\left\{\rule{0.0pt}{12.0pt}2\left(S_{P_{1}P_{2}}\right)^{2}\left(1-\frac{3(N_{m}-1)}{N_{c}N_{m}}\right)\right.\\ -\frac{S_{P_{1}P_{2}}}{N_{c}^{2}N_{m}}\left[\rule{0.0pt}{10.0pt}(1-P_{2}-P_{2})(1+N_{m}(N_{c}-1))\right.\\ +\left.\left.(P_{1}+P_{2})(N_{m}-1)\rule{0.0pt}{10.0pt}\right]\rule{0.0pt}{12.0pt}\right\}. (18)

As with the variance in Section 3.2, we can note that for large values of NcN_{c}, the P1iP_{1i}, P2iP_{2i} values quickly approach a joint Gaussian distribution; in this case we can take the limit of Equation 18 to obtain Var(SP1P2)=2(SP1P2)2/(Nm1)\mathrm{Var}\left(S_{P_{1}P_{2}}\right)=2\left(S_{P_{1}P_{2}}\right)^{2}/(N_{m}-1).

4.3 Comparing Theory with Measurement

Figure 2: A comparison of the predicted covariance of CIC probabilities (Equation 14) and the measured covariance in the same mock galaxy catalog as in Fig. 1. The top axes show the number of subvolumes into which the survey is split (NmN_{m} in Equation 18), equal to the number of measurements of 𝒫(B1)\mathcal{P}(B_{1}) and 𝒫(B2)\mathcal{P}(B_{2}); the bottom axes show the number of cells in each subvolume (NcN_{c} in Equation 18). Thick lines (with error bars) show the negative of empirical covariance of the measurements of 𝒫(B1)\mathcal{P}(B_{1}) and 𝒫(B2)\mathcal{P}(B_{2}) (i.e., NsubvolN_{\mathrm{subvol}} measurements, each involving NcellsN_{\mathrm{cells}} survey cells); thin lines show the negative of the predicted covariance from Equation 14, using volume-averaged cross-correlations determined as in the text. The thin dotted lines show the negative of the predicted covariance for the case of no cross-correlation. Dashed lines (thick and thin) indicate positive (rather than negative) values for the covariance. The bottom panels display the absolute ratio of the true covariances to those predicted assuming no cross-correlation. For clarity, the curves have received a slight horizontal offset from each other.

As in Section 3.3, we proceed to compare the predictions of Equation 7 with measured covariances; we use the same mock survey, with the same cell sizes (1.95h11.95h^{-1} and 31.25h131.25h^{-1}Mpc), as in Section 3.3. Again we split each survey into NmN_{m} subvolumes, with each subvolume consisting of NcN_{c} survey cells.

In this case we choose two CIC bins B1B_{1} and B2B_{2} and empirically determine the probabilities 𝒫(B1)\mathcal{P}(B_{1}) and 𝒫(B1)\mathcal{P}(B_{1}) within each subvolume; the result is an ensemble of NmN_{m} measured probabilities P1iP_{1i} in B1B_{1} and a corresponding ensemble of NmN_{m} measured probabilities P2iP_{2i} within B2B_{2}. The covariance SP1P2S_{P_{1}P_{2}} of these two sets of measurements is our estimate for Cov(𝒫(B1)𝒫(B2))\mathrm{Cov}(\mathcal{P}(B_{1})\mathcal{P}(B_{2})), and this value will depend on the number of cells NcN_{c} used in the measurement of the probabilities. As before, we use unit bin widths with the 1.95h11.95h^{-1}-Mpc cells and varying bin widths with the 31.25h131.25h^{-1}-Mpc cells.

These empirical covariances appear as thick curves in Fig. 2. Since in these cases the covariances are typically negative, we plot Cov-\mathrm{Cov} with solid lines (using dashed lines to indicate positive covariances). Equation 18 provides the error bars for these measurements, where we use for SP1P2S_{P_{1}P_{2}}, P1P_{1}, and P2P_{2} the measured covariances and probabilities.

The thin curves in Fig. 2 show the predictions of Equation 14. For the volume-averaged correlation ξ¯𝒮1𝒮2\overline{\xi}_{\mathcal{S}_{1}\mathcal{S}_{2}} of the slice fields for the two bins, we first empirically calculate the two-point cross-correlation function ξ𝒮1𝒮2(r)\xi_{\mathcal{S}_{1}\mathcal{S}_{2}}(r) of the two slice fields with FFT methods. We then, as in Section 3.3, sample the slice fields to obtain random pairs of positions within the survey volume, determine the cross-correlation of those positions from ξ𝒮1𝒮2(r)\xi_{\mathcal{S}_{1}\mathcal{S}_{2}}(r), and fold the result into a running average, terminating the sampling process once the variation in the running average has subpercent effect on the predicted σ𝒫(B1)𝒫(B2)\sigma_{\mathcal{P}(B_{1})\mathcal{P}(B_{2})}.

Once again, the agreement between prediction and measurement is excellent, although the large error bars at 31h131h^{-1}-Mpc cells mean that most of the measurements yield only upper limits of the absolute value. We also observe the same (expected) trends as in Fig. 1: the covariance of the CIC measurements decreases as the number of cells NcN_{c} in each measurement increases; likewise the error bars on the covariance increase as the number NmN_{m} of measurements decreases.

The thin dotted lines in Fig. 2 show the predicted covariance in the case of no cross-correlation, and the lower panels show the ratio between the cross-correlated and non-cross-correlated results. It is again clear that cross-correlations among survey cells in different probability bins can increase the covariance by multiple orders of magnitude.

5 Discussion

Figure 3: Covariance matrices for CIC probabilities in a mock galaxy survey (described in the text), using logarithmically spaced probability bins and measured (cross-) correlation functions.

Equations 7 and 14 allow us to calculate the covariance matrices for the mock surveys in Sections 3.3 and 4.3 (though the calculation requires empirical determination of the average correlations ξ¯𝒮\overline{\xi}_{\mathcal{S}} and cross-correlations ξ¯𝒮1𝒮2\overline{\xi}_{\mathcal{S}_{1}\mathcal{S}_{2}}). We here perform one such calculation.

For our CIC bins, we start with 20 logarithmically-spaced bins, which we then combine as necessary to insure that no bin contains fewer than three survey cells; we end up with 20 and 18 bins for the 1.95h11.95h^{-1}-Mpc and 31.25h131.25h^{-1}-Mpc cases, respectively. Since we now use the entire survey to calculate the (co-)variances, we set Nc=NtotN_{c}=N_{\mathrm{tot}}, the total number of survey cells, in Equations 7 and 14. Fig. 3 displays the resulting covariance matrices.

Figure 4: Left-hand panel: measured correlation function ξ(r)\xi(r) for the slice fields of four CIC bins, compared to the galaxy correlation function, in our mock galaxy survey. Right-hand panel: measured cross-correlation function ξ12(r)\xi_{12}(r) between the N=3N=3 slice field and three other slice fields, again compared to the galaxy correlation function. To first order, the two-point (cross-) correlation functions for the slice fields seem to differ from the galaxy correlation function by a simple multiplicative factor.

We note the following concerning these matrices. First, the two cases display significant structural differences. At 30h1\sim 30h^{-1}Mpc (right-hand panel), the covariance matrix is approximately diagonal (albeit with significant noise), indicating that at these scales the CIC galaxy distribution is approaching a Gaussian limit. However, at 2h1\sim 2h^{-1}Mpc (left-hand panel), the covariance is dominated by the N=0N=0 cells, which occupy over 85 per cent of the survey volume. Furthermore, we have already noted (immediately following Equation 14) that, in the absence of cross-correlations, the covariance between probability bins will be negative; we see this behavior in the left-hand panel of Fig. 3 near N=0N=0. In this case, the negative covariance induced by mutual exclusivity is exacerbated by the fact that the empty cells aggregate into large voids and thus are negatively cross-correlated with N>1N>1 (see right-hand panel Fig. 4). Other than these features, the covariance matrix at this smaller scale is quite smooth.

The second observation is that, in consequence, to assume diagonal covariance matrices is a reasonable approximation on scales 30h1\ga 30h^{-1}Mpc. This fact is also evident in the right-hand panel of Fig. 1, which makes it clear that at such scales the intercellular correlations have only a minor effect on the variance of 𝒫(B)\mathcal{P}(B). Repp and Szapudi (2020) are therefore justified in ignoring these correlations when fitting σ8\sigma_{8} and linear bias to CIC measurements from the Sloan Main Galaxy Sample. However, extraction of information from the galaxy PDF at smaller scales will need to take these correlations (and cross-correlations) into account.

The third observation is that the calculated covariance matrix is only as good as the measurement of the (cross-) correlation functions of the slice fields. It is this fact which is responsible for the noise in the right-hand panel of Fig. 3, since at these scales we have only 16316^{3} cells in our survey (and thus many fewer in each probability bin). As a result, we expect the measurement of ξ¯\overline{\xi} to be quite noisy, as we in fact see. Even at small scales, the measurement of ξ(r)\xi(r), and thus of ξ¯\overline{\xi} becomes quite noisy for the low-probability, high-density bins (Fig. 4). Thus, it would be helpful to have a theory for the slice-field correlation functions.

Such a theory seems to be feasible, given that the slice-field correlation and cross-correlation functions appear to differ from the galaxy correlation function by a multiplicative bias, at least to first order (Fig. 4). Indeed, the slice fields pick out specific density contours in a manner analogous to the way in which galaxies preferentially trace regions of high matter-density; thus one might expect a bias analogous to the Kaiser bias of galaxies (Kaiser 1984). Fitting these bias parameters to the measured correlations could therefore eliminate much of the noise from the measurements of the various ξ¯\overline{\xi}-values.

Finally, we note that the correlations of the slice field represent a further generalization of the sliced correlation functions introduced by Neyrinck et al. (2018), which isolate the correlation of one particular density with the entire field. Thus they are similar to marked correlation functions and power spectra, which promise to enhance, e.g., the detection of neutrino signatures (Massara et al. 2020; Philcox et al. 2020) in galaxy surveys. Indeed, since slice-field correlations are two-point functions – whereas marked correlations are inherently higher-order (densities at two points plus spatially-varying marks) – it is possible that slice-field correlations will prove more tractable than marked correlations, without sacrificing information content.

6 Conclusions

Since counts-in-cells (CIC) probabilities contain significant information not included in the galaxy power spectrum, it is important to develop the theoretical machinery for fitting cosmological models to CIC measurements from galaxy surveys. One of the key ingredients in performing such fits is an understanding of the variance of CIC-measurements within a given probability bin, as well as of the covariance of those measurements between different probability bins. We have here derived expressions for both of these quantities.

In order to derive these expressions, we first define the slice field 𝒮B\mathcal{S}_{B} for a given bin BB, such that 𝒮B=1\mathcal{S}_{B}=1 if NN (the number of galaxies within a survey cell) falls within BB; otherwise 𝒮B=0\mathcal{S}_{B}=0. Using these fields we derive Equation 7 for the variance Var(𝒫(B))\mathrm{Var}(\mathcal{P}(B)) of measurements of a given probability bin, and we derive Equation 14 for the covariance Cov(𝒫(B1),𝒫(B2))\mathrm{Cov}(\mathcal{P}(B_{1}),\mathcal{P}(B_{2})) of the measurements in two distinct probability bins. These expressions depend on the probabilities, the number of cells from which the probabilities are determined, and the volume-averaged (cross-) correlation of the corresponding slice fields. Conceptually, the degree of correlation affects the result by reducing the effective number of cells involved in the probability calculation.

To test Equations 7 and 14 we turn to a mock galaxy survey from the Millennium Simulation; by dividing the survey into multiple subvolumes we can measure the probability within each subvolume and thus empirically determine the variance/covariance of the 𝒫(B)\mathcal{P}(B) measurements. Furthermore, a meaningful comparison to the predicted (co-)variances requires an estimate of the scatter in the measurement of those (co-)variances. Taking that scatter into account, we find that Equations 7 and 14 accurately predict the variance and covariance of CIC measurements (Figs. 1 and 2).

We further conclude that at large scales (30h1\sim 30h^{-1}Mpc) the correlation between neighboring cells has a negligible impact on the variance, whereas at small scales (2h1\sim 2h^{-1}Mpc) the correlations can increase the variance by orders of magnitude.

In summary, two of the tools necessary for wider cosmological utilization of counts-in-cells are the ability to determine the variance and covariance of the CIC probabilities. These tools are now available.

Acknowledgements

The Millennium Simulation data bases used in this work and the web application providing online access to them were constructed as part of the activities of the German Astrophysical Virtual Observatory (GAVO). This work was supported by NASA Headquarters under the NASA Earth and Space Science Fellowship program – “Grant 80NSSC18K1081” – and AR gratefully acknowledges the support. IS acknowledges support from National Science Foundation (NSF) award 1616974.

Data Availability

The data underlying this article are available in the Virgo-Millennium Database (maintained by the German Astrophysical Virtual Observatory) at http://gavo.mpa-garching.mpg.de/Millennium.

Appendix A Derivations

A.1 Proof of Equation 6

Equation 6 claims that the variance of 𝒫(B)=(1/n)si\mathcal{P}(B)=(1/n)\sum s_{i} is

Var(1ni=1nsi)=(1(1(n1)ξ)P)Pn.\mathrm{Var}\left(\frac{1}{n}\sum_{i=1}^{n}s_{i}\right)=\frac{\left(1-\left(1-(n-1)\xi\right)P\right)P}{n}. (6)

Recall that we are assuming the correlation between any two of the slice-field values s1,s2,,sns_{1},s_{2},\ldots,s_{n} is a fixed quantity ξ\xi. Thus if iji\neq j, it is the case that sisj=(1+ξ)P2\langle s_{i}s_{j}\rangle=(1+\xi)P^{2}, where P=𝒫(B)P=\mathcal{P}(B) is the probability that 𝒮=1\mathcal{S}=1 in any given cell. We also recall that ξ\xi is the correlation function of the slice field 𝒮\mathcal{S}, not the of the counts-in-cells NN.

The proof of Equation 6 proceeds by induction. For n=1n=1, the right-hand side of Equation 6 is simply PP2=𝒮2S2P-P^{2}=\langle\mathcal{S}^{2}\rangle-\langle S\rangle^{2} (by Equation 2), which is of course the variance of 𝒮\mathcal{S}. On the other hand, for an arbitrary nn, we have

Var(1n+1i=1n+1si)=1(n+1)2(i=1n+1si(n+1)P)2\displaystyle\mathrm{Var}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}s_{i}\right)=\frac{1}{(n+1)^{2}}\left\langle\left(\sum_{i=1}^{n+1}s_{i}-(n+1)P\right)^{2}\right\rangle\hfill (19)
=1(n+1)2((i=1nsinP)+(si+1P))2\displaystyle\quad=\frac{1}{(n+1)^{2}}\left\langle\left(\left(\sum_{i=1}^{n}s_{i}-nP\right)+\left(s_{i+1}-P\right)\right)^{2}\right\rangle (20)
=1(n+1)2{n2(1ni=1nsiP)2+2(i=1nsinP)(sn+1P)+(sn+1P)2}\displaystyle\begin{split}\quad=\frac{1}{(n+1)^{2}}&\left\{n^{2}\left\langle\left(\frac{1}{n}\sum_{i=1}^{n}s_{i}-P\right)^{2}\right\rangle\right.\\ &+2\left\langle\left(\sum_{i=1}^{n}s_{i}-nP\right)\left(s_{n+1}-P\right)\right\rangle\\ &+\left\langle\left(s_{n+1}-P\right)^{2}\right\rangle\left.\rule{0.0pt}{15.0pt}\right\}\end{split} (21)

Equation 21 contains three expectation values. If Equation 6 holds for nn, the first expectation value equals (1(1(n1)ξ)P)P/n\left(1-\left(1-(n-1)\xi\right)P\right)P/n. The third is simply the variance of 𝒮\mathcal{S}, or PP2P-P^{2}. And for the second, we have

(i=1nsinP)(sn+1P)=i=1nsisn+1Pi=1nsinPsi+1+nP2\displaystyle\begin{split}&\left\langle\left(\sum_{i=1}^{n}s_{i}-nP\right)\left(s_{n+1}-P\right)\right\rangle\\ &\quad=\sum_{i=1}^{n}\left\langle s_{i}s_{n+1}\right\rangle-P\sum_{i=1}^{n}\langle s_{i}\rangle-nP\langle s_{i+1}\rangle+nP^{2}\end{split} (22)
=nP2(1+ξ)nP2=nP2ξ.\displaystyle\quad=nP^{2}(1+\xi)-nP^{2}=nP^{2}\xi. (23)

Substituting these expressions for the expectation values into Equation 21 and simplifying, we obtain

Var(1n+1i=1n+1si)=(1P(1nξ))n+1,\mathrm{Var}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}s_{i}\right)=\frac{\left(1-P\left(1-n\xi\right)\right)}{n+1}, (24)

completing the induction on Equation 6.

A.2 Moments of of the Distribution of Measurements of 𝒫(B)\mathcal{P}(B)

If BB is a bin in which we measure the probability 𝒫(B)\mathcal{P}(B), let us suppose that {Pi}\{P_{i}\} is a set of NmN_{m} measurements of 𝒫(B)\mathcal{P}(B). Our goal is to determine the fourth central moment μ4\mu_{4} of the distribution of the measurements {Pi}\{P_{i}\} of 𝒫(B)\mathcal{P}(B). We begin with Equation 10 which gives the moment-generating function for this distribution:

M𝒫(t)=1Nc(1+(et1)P)Nc,M_{\mathcal{P}}(t)=\frac{1}{N_{c}}\left(1+\left(e^{t}-1\right)P\right)^{N_{c}}, (10)

from which we can determine the moments of the distribution of measured 𝒫(B)\mathcal{P}(B) values.

Using PP as a shorthand for 𝒫(B)\mathcal{P}(B), we obtain the following moments:

P=P\displaystyle\left\langle P\right\rangle=P (25)
P2=PNc+(Nc1)P2Nc\displaystyle\left\langle P^{2}\right\rangle=\frac{P}{N_{c}}+\frac{(N_{c}-1)P^{2}}{N_{c}} (26)
P3=PNc2+3(Nc1)P2Nc2+(Nc2)(Nc1)P3Nc2\displaystyle\left\langle P^{3}\right\rangle=\frac{P}{N_{c}^{2}}+\frac{3(N_{c}-1)P^{2}}{N_{c}^{2}}+\frac{(N_{c}-2)(N_{c}-1)P^{3}}{N_{c}^{2}} (27)
P4=PNc3+7(Nc1)P2Nc3+6(Nc2)(Nc1)P3Nc3+(N3)(N2)(N1)P4Nc3.\displaystyle\begin{split}\left\langle P^{4}\right\rangle=\frac{P}{N_{c}^{3}}+\frac{7(N_{c}-1)P^{2}}{N_{c}^{3}}+\frac{6(N_{c}-2)(N_{c}-1)P^{3}}{N_{c}^{3}}\\ +\frac{(N-3)(N-2)(N-1)P^{4}}{N_{c}^{3}}.\end{split} (28)

Thus the fourth central moment of the distribution of measured values for 𝒫(B)\mathcal{P}(B) is

μ4\displaystyle\mu_{4} =P44P3P+6P2P23P4\displaystyle=\left\langle P^{4}\right\rangle-4\left\langle P^{3}\right\rangle\left\langle P\right\rangle+6\left\langle P^{2}\right\rangle\left\langle P\right\rangle^{2}-3\left\langle P\right\rangle^{4} (29)
=3P2Nc2(P1)2+P(1P)Nc3(16P+6P2)\displaystyle=\frac{3P^{2}}{N_{c}^{2}}(P-1)^{2}+\frac{P(1-P)}{N_{c}^{3}}(1-6P+6P^{2}) (30)
=3(sp2)2+sp2Nc(16P+6P2),\displaystyle=3\left(s_{p}^{2}\right)^{2}+\frac{s_{p}^{2}}{N_{c}}(1-6P+6P^{2}), (31)

where the final step follows from sP2=P(1P)/Ncs_{P}^{2}=P(1-P)/N_{c} (Equation 4).

A.3 Proof of Equation 13

Equation 13 claims that the variance of the probabilities P1P_{1} and P2P_{2} is given by

Cov(1ni=1ns1i,1nj=1ns2j)=P1P2n(1+(n1)ξ12).\mathrm{Cov}\left(\frac{1}{n}\sum_{i=1}^{n}s_{1i}\,,\frac{1}{n}\sum_{j=1}^{n}s_{2j}\right)=\frac{-P_{1}P_{2}}{n}\left(1+(n-1)\xi_{12}\right). (13)

As in Appendix A.1, we prove the claim using induction.

To establish Equation 13 for n=1n=1, we first note that, because the slice fields are disjoint (Equation 3), it cannot be the case that 𝒮1=𝒮2=1\mathcal{S}_{1}=\mathcal{S}_{2}=1 in this single survey cell. The possibilities therefore are 𝒮1=1,𝒮2=0\mathcal{S}_{1}=1,\mathcal{S}_{2}=0 (with probability P1P_{1}), 𝒮1=0,𝒮2=1\mathcal{S}_{1}=0,\mathcal{S}_{2}=1 (with probability P1P_{1}), and 𝒮1=𝒮2=0\mathcal{S}_{1}=\mathcal{S}_{2}=0 (with probability 1P1P21-P_{1}-P_{2}). Hence the left-hand side of Equation 13 is

Cov(𝒮1,𝒮2)\displaystyle\mathrm{Cov}(\mathcal{S}_{1},\mathcal{S}_{2}) =(𝒮1P1)(𝒮2P2)\displaystyle=\left\langle(\mathcal{S}_{1}-P_{1})(\mathcal{S}_{2}-P_{2})\right\rangle (32)
=(1P1)(P2)P1+(P1)(1P2)P2+(P1P2)(1P1P2)\displaystyle\begin{split}&=(1-P_{1})(-P_{2})P_{1}+(-P_{1})(1-P_{2})P_{2}\\ &\qquad+(P_{1}P_{2})(1-P_{1}-P_{2})\end{split} (33)
=P1P2,\displaystyle=-P_{1}P_{2}, (34)

thus establishing the prescription for n=1n=1.

Now assuming the equation holds for a given nn, we can write

Cov\displaystyle\mathrm{Cov} (1n+1i=1n+1s1i,1n+1j=1n+1s2j)\displaystyle\left(\frac{1}{n+1}\sum_{i=1}^{n+1}s_{1i}\,,\frac{1}{n+1}\sum_{j=1}^{n+1}s_{2j}\right)\hskip 56.9055pt (35)
=1(n+1)2(i=1n+1s1i(n+1)P1)×(j=1n+1s2j(n+1)P2)\displaystyle\begin{split}&=\frac{1}{(n+1)^{2}}\left\langle\left(\sum_{i=1}^{n+1}s_{1i}-(n+1)P_{1}\right)\right.\\ &\qquad\qquad\qquad\qquad\left.\times\left(\sum_{j=1}^{n+1}s_{2j}-(n+1)P_{2}\right)\right\rangle\end{split} (36)
=1(n+1)2{(i=1ns1inP1)(j=1ns2jnP2)+(i=1ns1inP1)(s2(n+1)P2)+(s1(n+1)P1)(j=1ns2jnP2)+(s1(n+1)P1)(s2(n+1)P2)}\displaystyle\begin{split}&=\frac{1}{(n+1)^{2}}\left\{\left\langle\left(\sum_{i=1}^{n}s_{1i}-nP_{1}\right)\left(\sum_{j=1}^{n}s_{2j}-nP_{2}\right)\right\rangle\right.\\ &\qquad\qquad+\left\langle\left(\sum_{i=1}^{n}s_{1i}-nP_{1}\right)\left(s_{2(n+1)}-P_{2}\right)\right\rangle\\ &\qquad\qquad+\left\langle\left(s_{1(n+1)}-P_{1}\right)\left(\sum_{j=1}^{n}s_{2j}-nP_{2}\right)\right\rangle\\ &\qquad\qquad+\left\langle\left(s_{1(n+1)}-P_{1}\right)\left(s_{2(n+1)}-P_{2}\right)\right\rangle\left.\rule{0.0pt}{15.0pt}\right\}\end{split} (37)

Equation 37 contains four expectation values which we now evaluate. By our inductive hypothesis, the first is P1P2n(1+(n1)ξ12)P_{1}P_{2}n(1+(n-1)\xi_{12}). The second becomes

i=1n\displaystyle\sum_{i=1}^{n} s1is2(n+1)P2i=1ns1inP1s2(n+1)+nP1P2\displaystyle\langle s_{1i}s_{2(n+1)}\rangle-P_{2}\sum_{i=1}^{n}\langle s_{1i}\rangle-nP_{1}\langle s_{2(n+1)}\rangle+nP_{1}P_{2} (38)
=nP1P2ξ12\displaystyle=nP_{1}P_{2}\xi_{12} (39)

by recalling that for iji\neq j, s1is2j=P1P2(1+ξ12)\langle s_{1i}s_{2j}\rangle=P_{1}P_{2}(1+\xi_{12}) and that s1i=P1\langle s_{1i}\rangle=P1, etc. The third expectation value becomes the same quantity by symmetry. Finally, we recall that s1(n+1)s2(n+1)=0\langle s_{1(n+1)}s_{2(n+1)}\rangle=0 because the slice fields are disjoint, and thus the fourth expectation value becomes P1P2-P_{1}P_{2}.

Inserting these results into Equation 37 and simplifying, we obtain

Cov(1n+1i=1n+1s1i,1n+1j=1n+1s2j)=P1P2n+1(1+nξ12),\mathrm{Cov}\left(\frac{1}{n+1}\sum_{i=1}^{n+1}s_{1i}\,,\frac{1}{n+1}\sum_{j=1}^{n+1}s_{2j}\right)\\ =\frac{-P_{1}P_{2}}{n+1}(1+n\xi_{12}), (40)

thus establishing Equation 13 for all natural numbers nn.

A.4 Moments of of the Joint Distribution of Measurements of 𝒫(B1,B2)\mathcal{P}(B_{1},B_{2})

If B1,B2B_{1},B_{2} are disjoint CIC bins, let us consider two sets of measurements {P1i},{P2i}\{P_{1i}\},\{P_{2i}\} of the probabilities 𝒫(B1)\mathcal{P}(B_{1}) and 𝒫(B2)\mathcal{P}(B_{2}), respectively, each consisting of NmN_{m} measurements. We want to determine the central moments of the joint distribution of the measurements {P1i}\{P_{1i}\} of 𝒫(B1)\mathcal{P}(B_{1}) and {P2i}\{P_{2i}\} of 𝒫(B2)\mathcal{P}(B_{2}). We start with Equation 17 which (using P1P_{1} and P2P_{2} as shorthand for 𝒫(B1)\mathcal{P}(B_{1}) and 𝒫(B2)\mathcal{P}(B_{2}), respectively) gives the joint moment-generating function for this distribution:

M𝒫1𝒫2(t1,t2)=1Nc(1+(et11)P1+(et21)P2)Nc.M_{\mathcal{P}_{1}\mathcal{P}_{2}}(t_{1},t_{2})=\frac{1}{N_{c}}\left(1+\left(e^{t_{1}}-1\right)P_{1}+\left(e^{t_{2}}-1\right)P_{2}\right)^{N_{c}}. (17)

Proceding to calculate the joint moments by differentiating this expression, we obtain:

P1P2=Nc1NcP1P2\displaystyle\left\langle P_{1}P_{2}\right\rangle=\frac{N_{c}-1}{N_{c}}P_{1}P_{2} (41)
P12P2=Nc1Nc2P1P2+(Nc1)(Nc2)Nc2P12P2\displaystyle\left\langle P_{1}^{2}P_{2}\right\rangle=\frac{N_{c}-1}{N_{c}^{2}}P_{1}P_{2}+\frac{(N_{c}-1)(N_{c}-2)}{N_{c}^{2}}P_{1}^{2}P_{2} (42)
P1P22=Nc1Nc2P1P2+(Nc1)(Nc2)Nc2P1P22\displaystyle\left\langle P_{1}P_{2}^{2}\right\rangle=\frac{N_{c}-1}{N_{c}^{2}}P_{1}P_{2}+\frac{(N_{c}-1)(N_{c}-2)}{N_{c}^{2}}P_{1}P_{2}^{2} (43)
P12P22=(Nc1)(Nc2)(Nc3)Nc3P12P22+(Nc1)(Nc2)Nc3(P12P2+P1P22)+(Nc1)Nc3P12P22.\displaystyle\begin{split}\left\langle P_{1}^{2}P_{2}^{2}\right\rangle=&\frac{(N_{c}-1)(N_{c}-2)(N_{c}-3)}{N_{c}^{3}}P_{1}^{2}P_{2}^{2}\\ &+\frac{(N_{c}-1)(N_{c}-2)}{N_{c}^{3}}(P_{1}^{2}P_{2}+P_{1}P_{2}^{2})\\ &+\frac{(N_{c}-1)}{N_{c}^{3}}P_{1}^{2}P_{2}^{2}.\end{split} (44)

From these results (along with Equations 25 and 26) we can obtain the required joint central moments of the 𝒫1,𝒫2\mathcal{P}_{1},\mathcal{P}_{2} distribution:

μ20=P12P12=P1(1P1)Nc\displaystyle\mu_{20}=\left\langle P_{1}^{2}\right\rangle-\left\langle P_{1}\right\rangle^{2}=\frac{P_{1}(1-P_{1})}{N_{c}} (45)
μ02=P22P22=P2(1P2)Nc\displaystyle\mu_{02}=\left\langle P_{2}^{2}\right\rangle-\left\langle P_{2}\right\rangle^{2}=\frac{P_{2}(1-P_{2})}{N_{c}} (46)
μ11=P1P2P1P2=P1P2Nc\displaystyle\mu_{11}=\left\langle P_{1}P_{2}\right\rangle-\left\langle P_{1}\right\rangle\left\langle P_{2}\right\rangle=-\frac{P_{1}P_{2}}{N_{c}} (47)
μ22=P12P222P12P2P22P1P22P1+4P1P2P1P2+P12P22+P22P123P12P22\displaystyle\begin{split}\mu_{22}&=\left\langle P_{1}^{2}P_{2}^{2}\right\rangle-2\left\langle P_{1}^{2}P_{2}\right\rangle\left\langle P_{2}\right\rangle-2\left\langle P_{1}P_{2}^{2}\right\rangle\left\langle P_{1}\right\rangle\\ &\qquad+4\left\langle P_{1}P_{2}\right\rangle\left\langle P_{1}\right\rangle\left\langle P_{2}\right\rangle+\left\langle P_{1}^{2}\right\rangle\left\langle P_{2}\right\rangle^{2}+\left\langle P_{2}^{2}\right\rangle\left\langle P_{1}\right\rangle^{2}\\ &\qquad-3\left\langle P_{1}\right\rangle^{2}\left\langle P_{2}\right\rangle^{2}\end{split} (48)
=P1P2Nc3((Nc1)(Nc2)(P1+P23P1P2)).\displaystyle=\frac{P_{1}P_{2}}{N_{c}^{3}}\left((N_{c}-1)-(N_{c}-2)(P_{1}+P_{2}-3P_{1}P_{2})\right). (49)

References

  • Balian and Schaeffer (1989) Balian, R. and Schaeffer, R.: 1989, A&A 220, 1
  • Baugh et al. (2004) Baugh, C. M., Croton, D. J., Gaztañaga, E., Norberg, P., Colless, M., Baldry, I. K., Bland -Hawthorn, J., et al.: 2004, MNRAS 351(2), L44
  • Baugh et al. (1995) Baugh, C. M., Gaztanaga, E., and Efstathiou, G.: 1995, MNRAS 274(4), 1049
  • Baumgart and Fry (1991) Baumgart, D. J. and Fry, J. N.: 1991, ApJ 375, 25
  • Bernardeau (1994a) Bernardeau, F.: 1994a, ApJ 433, 1
  • Bernardeau (1994b) Bernardeau, F.: 1994b, A&A 291, 697
  • Bernardeau and Kofman (1995) Bernardeau, F. and Kofman, L.: 1995, ApJ 443, 479
  • Bernardeau and Schaeffer (1991) Bernardeau, F. and Schaeffer, R.: 1991, A&A 250, 23
  • Bertone et al. (2007) Bertone, S., De Lucia, G., and Thomas, P. A.: 2007, MNRAS 379(3), 1143
  • Carron (2011) Carron, J.: 2011, ApJ 738, 86
  • Carron and Neyrinck (2012) Carron, J. and Neyrinck, M. C.: 2012, ApJ 750, 28
  • Carron and Szapudi (2013) Carron, J. and Szapudi, I.: 2013, MNRAS 434, 2961
  • Carron and Szapudi (2014) Carron, J. and Szapudi, I.: 2014, MNRAS 439, L11
  • Clerkin et al. (2017) Clerkin, L., Kirk, D., Manera, M., Lahav, O., Abdalla, F., Amara, A., Bacon, D., et al.: 2017, MNRAS 466(2), 1444
  • Colombi (1994) Colombi, S.: 1994, ApJ 435(2), 536
  • Colombi et al. (1995) Colombi, S., Bouchet, F. R., and Schaeffer, R.: 1995, ApJS 96, 401
  • Colombi et al. (2000) Colombi, S., Szapudi, I., Jenkins, A., and Colberg, J.: 2000, MNRAS 313(4), 711
  • Colombi et al. (1998) Colombi, S., Szapudi, I., and Szalay, A. S.: 1998, MNRAS 296(2), 253
  • Croton et al. (2007) Croton, D. J., Gao, L., and White, S. D. M.: 2007, MNRAS 374(4), 1303
  • Friedrich et al. (2018) Friedrich, O., Gruen, D., DeRose, J., Kirk, D., Krause, E., McClintock, T., Rykoff, E. S., et al.: 2018, Phys. Rev. D 98(2), 023508
  • Gaztanaga (1994) Gaztanaga, E.: 1994, MNRAS 268, 913
  • Gruen et al. (2018) Gruen, D., Friedrich, O., Krause, E., DeRose, J., Cawthon, R., Davis, C., Elvin-Poole, J., Rykoff, E. S., et al.: 2018, Phys. Rev. D 98(2), 023507
  • Kaiser (1984) Kaiser, N.: 1984, ApJ 284, L9
  • Kendall and Stuart (1958) Kendall, M. and Stuart, A.: 1958, The Advanced Theory of Statistics, Vol. 1, Hafner, New York
  • Massara et al. (2020) Massara, E., Villaescusa-Navarro, F., Ho, S., Dalal, N., and Spergel, D. N.: 2020, arXiv e-prints p. arXiv:2001.11024
  • Meiksin and White (1999) Meiksin, A. and White, M.: 1999, MNRAS 308(4), 1179
  • Neyrinck et al. (2018) Neyrinck, M. C., Szapudi, I., McCullagh, N., Szalay, A. S., Falck, B., and Wang, J.: 2018, MNRAS 478, 2495
  • Neyrinck et al. (2006) Neyrinck, M. C., Szapudi, I., and Rimes, C. D.: 2006, MNRAS 370, L66
  • Neyrinck et al. (2009) Neyrinck, M. C., Szapudi, I., and Szalay, A. S.: 2009, ApJ 698, L90
  • Pápai and Szapudi (2010) Pápai, P. and Szapudi, I.: 2010, ApJ 725(2), 2078
  • Peebles (1980) Peebles, P. J. E.: 1980, The large-scale structure of the universe
  • Philcox et al. (2020) Philcox, O. H. E., Massara, E., and Spergel, D. N.: 2020, arXiv e-prints p. arXiv:2006.10055
  • Repp and Szapudi (2017) Repp, A. and Szapudi, I.: 2017, MNRAS 464, L21
  • Repp and Szapudi (2020) Repp, A. and Szapudi, I.: 2020, arXiv e-prints p. arXiv:2006.01146
  • Rimes and Hamilton (2005) Rimes, C. D. and Hamilton, A. J. S.: 2005, MNRAS 360, L82
  • Salvador et al. (2019) Salvador, A. I., Sánchez, F. J., Pagul, A., García-Bellido, J., Sanchez, E., Pujol, A., Frieman, J., et al.: 2019, MNRAS 482(2), 1435
  • Springel et al. (2005) Springel, V., White, S. D. M., Jenkins, A., Frenk, C. S., Yoshida, N., Gao, L., Navarro, J., et al.: 2005, Nature 435, 629
  • Szapudi and Colombi (1996) Szapudi, I. and Colombi, S.: 1996, ApJ 470, 131
  • Szapudi et al. (2000) Szapudi, I., Colombi, S., Jenkins, A., and Colberg, J.: 2000, MNRAS 313(4), 725
  • Szapudi et al. (1995) Szapudi, I., Dalton, G. B., Efstathiou, G., and Szalay, A. S.: 1995, ApJ 444, 520
  • Szapudi et al. (1996) Szapudi, I., Meiksin, A., and Nichol, R. C.: 1996, ApJ 473, 15
  • Szapudi et al. (1992) Szapudi, I., Szalay, A. S., and Boschan, P.: 1992, ApJ 390, 350
  • Uhlemann et al. (2018a) Uhlemann, C., Feix, M., Codis, S., Pichon, C., Bernardeau, F., L’Huillier, B., Kim, J., et al.: 2018a, MNRAS 473(4), 5098
  • Uhlemann et al. (2020) Uhlemann, C., Friedrich, O., Villaescusa-Navarro, F., Banerjee, A., and Codis, S. r.: 2020, MNRAS 495(4), 4006
  • Uhlemann et al. (2018b) Uhlemann, C., Pichon, C., Codis, S., L’Huillier, B., Kim, J., Bernardeau, F., Park, C., and Prunet, S.: 2018b, MNRAS 477(2), 2772
  • Valageas (2002) Valageas, P.: 2002, A&A 382, 412
  • Wolk et al. (2013) Wolk, M., McCracken, H. J., Colombi, S., Fry, J. N., Kilbinger, M., Hudelot, P., Mellier, Y., et al.: 2013, MNRAS 435, 2