Characterizing the Conditional Galaxy Property Distribution using Gaussian Mixture Models

Yucheng Zhang (张宇澄) Department of Mathematics and Theory, Peng Cheng Laboratory, Shenzhen, Guangdong 518066, China Center for Cosmology and Particle Physics, Department of Physics, New York University, 726 Broadway, New York, NY 10003, USA Anthony R. Pullen Center for Cosmology and Particle Physics, Department of Physics, New York University, 726 Broadway, New York, NY 10003, USA Center for Computational Astrophysics, Flatiron Institute, New York, NY 10010, USA Rachel S. Somerville Center for Computational Astrophysics, Flatiron Institute, New York, NY 10010, USA Patrick C. Breysse Center for Cosmology and Particle Physics, Department of Physics, New York University, 726 Broadway, New York, NY 10003, USA John C. Forbes Center for Computational Astrophysics, Flatiron Institute, New York, NY 10010, USA Shengqi Yang (杨晟祺) Carnegie Observatories, 813 Santa Barbara Street, Pasadena, CA 91101, USA Yin Li (李寅) Department of Mathematics and Theory, Peng Cheng Laboratory, Shenzhen, Guangdong 518066, China Center for Computational Astrophysics, Flatiron Institute, New York, NY 10010, USA Center for Computational Mathematics, Flatiron Institute, New York, NY 10010, USA Abhishek S. Maniyar Center for Cosmology and Particle Physics, Department of Physics, New York University, 726 Broadway, New York, NY 10003, USA

Abstract

Line-intensity mapping (LIM) is a promising technique to constrain the global distribution of galaxy properties. To combine LIM experiments probing different tracers with traditional galaxy surveys and fully exploit the scientific potential of these observations, it is necessary to have a physically motivated modeling framework. As part of developing such a framework, in this work we introduce and model the conditional galaxy property distribution (CGPD), i.e. the distribution of galaxy properties conditioned on the host halo mass and redshift. We consider five galaxy properties, including the galaxy stellar mass, molecular gas mass, galaxy radius, gas phase metallicity and star formation rate (SFR), which are important for predicting the emission lines of interest. The CGPD represents the full distribution of galaxies in the five dimensional property space; many important galaxy distribution functions and scaling relations, such as the stellar mass function and SFR main sequence, can be derived from integrating and projecting it. We utilize two different kinds of cosmological galaxy simulations, a semi-analytic model and the IllustrisTNG hydrodynamic simulation, to characterize the CGPD and explore how well it can be represented using a Gaussian mixture model (GMM). We find that with just a few ( $\sim 3$ ) Gaussian components, a GMM can describe the CGPD of the simulated galaxies to high accuracy for both simulations. The CGPD can be mapped to LIM or other observables by constructing the appropriate relationship between galaxy properties and the relevant observable tracers.

1 Introduction

Understanding the physical processes that govern the formation and evolution of galaxies over cosmic time is one of the most fundamental and important problems in astrophysics and cosmology. Numerous astronomical surveys across a wide range of wavelengths have been carried out to observe galaxies, from nearby targets to those probing the earliest epochs (e.g. York et al., 2000; Koekemoer et al., 2011). However, galaxy surveys are limited by the trade-off between area and depth since deeper and fainter targets require more time to achieve a detectable signal-to-noise ratio. Instead of resolving and observing individual targets, a relatively recent technique called line-intensity mapping (LIM) (e.g. Kovetz et al., 2017) has been proposed to measure the integrated emission lines from all sources, bright and faint. Similar to traditional galaxy surveys, LIM not only provides a promising tracer of the large-scale structure (LSS) of the total matter distribution, but also contains a wealth of information about astrophysical properties of galaxies (Schaan & White, 2021a).

One of the potentially powerful aspects of LIM is that there are now numerous experiments planned or online that probe different lines, which trace different gas phases, at different redshifts (see Bernal & Kovetz, 2022, for a recent summary). For example, some of the most commonly considered lines include CO, which is a tracer of dense molecular gas in the interstellar medium (ISM) of galaxies, atomic and ionic fine structure lines such as [CII], which is primarily emitted by photo-dissociation regions in the outer layers of molecular clouds, 21-cm emission from atomic gas in the ISM and intergalactic medium (IGM), and lines arising from ionized gas such as Lyman- $\alpha$ and H- $\alpha$ . These are the brightest and most easily detected lines, but there are many others that can provide interesting information. It is clear that combining multi-tracer LIM surveys with traditional galaxy surveys will be the key to extracting the maximum amount of science from these experiments.

In order to do this, it is essential to have a comprehensive modeling framework. The most commonly used framework is halo-based phenomenological models (see Bernal & Kovetz, 2022, for a recent review). In this approach, scaling relations between halo mass and redshift and “intrinsic” physical properties such as stellar mass and star formation rate are derived using abundance matching. In abundance matching, observational estimates of stellar mass functions and SFR are used along with predictions for the number density of dark matter halos as a function of mass to derive constraints on the mapping between these quantities (see Wechsler & Tinker, 2018, for a review). Then, scaling relations between e.g. SFR and a line luminosity of interest are derived using observations of individually detected sources (e.g. Silva et al., 2015; Li et al., 2016; Padmanabhan, 2018, 2019; Sun et al., 2019). This approach has the advantage of being highly computationally efficient, and of being tied to observations.

However, there are a number of drawbacks and limitations to this approach. First, these empirical fitting functions usually lack physical motivation, so they are able to provide only limited insights about the physical processes underlying galaxy formation. Second, LIM detections are sensitive to galaxies that are too faint to detect individually, so the scaling relations have to be extrapolated beyond the observational range. Third, these empirical approaches are not able to self-consistently capture the covariances between different tracers, or between the intrinsic properties (stellar mass, SFR) and various emission lines. It is important to keep in mind that the estimates of stellar mass and SFR that are used in abundance matching models are not directly observable quantities, but are derived from UV-optical-NIR observations and carry significant uncertainties. The prior used to obtain these estimates can have a strong impact on the results (Leja et al., 2022). It is critical to be able to properly fold in the prior, covariances, and uncertainties on these quantities, which lie at the foundation of abundance matching approaches.

Numerical cosmological simulations that incorporate baryonic astrophysical processes are a potentially powerful tool for exploring the physics of galaxy formation and evolution, as well as the connection with observable line emissions. The most fundamental approach is based on explicitly simulating the hydrodynamics and thermodynamics of baryonic material evolving within a $\Lambda$ CDM cosmological framework for structure formation (Somerville & Davé, 2015a; Naab & Ostriker, 2017). These numerical hydrodynamic simulations provide direct predictions for the distribution and state of nearly all of the relevant components of galaxies and the IGM (stars, gas at different densities and temperatures, various chemical elements). They can be coupled with modeling tools to provide predictions of many of the emission lines of interest for LIM experiments (Leung et al., 2020; Olsen et al., 2021). However, due to its computational expense, this technique suffers from a fairly severe trade-off between volume and resolution. Simulations that resolve the multi-phase ISM can typically be run for only a few galaxies (e.g. Pallottini et al., 2022). Simulations of cosmological volumes must adopt “sub-grid” recipes to treat most of the critical processes of galaxy formation, including star formation, stellar feedback, and black hole formation and feedback. And even the largest computationally feasible cosmological hydrodynamic simulations fall short of the volumes that are already routinely probed by galaxy and LIM surveys. For example, MillenniumTNG has a volume of (500 Mpc/ $h$ )³, compared to (1760 Mpc/ $h$ )³ for the Baryon Oscillations Spectroscopic Survey of $z<0.7$ luminous red galaxies alone (Alam et al., 2017). The currently adopted sub-grid models are highly uncertain, and approaches adopted by different groups can lead to dramatically different predictions for gas properties. It is infeasible to thoroughly explore the parameter space of the sub-grid models for significant volumes using a direct numerical approach.

An alternative approach to physics-based models is semi-analytic models (SAM), which combine simplified treatments of physical processes with cosmological halo merger histories. These models produce similar predictions to those of numerical hydrodynamic simulations for global galaxy properties, and include a similar suite of physical processes, but at a fraction of the computational cost. However, unlike numerical hydrodynamics simulations they do not provide detailed information about the internal structure of galaxies or the spatial distribution of gas within or around them. Popping et al. (2014) and Popping et al. (2019) coupled semi-analytic models of galaxy formation with the DESPOTIC code (Krumholz, 2014), which computes line luminosities for optically thick clouds using a one-zone model, to make predictions for a broad suite of emission lines, including multiple transitions in the CO ladder, [CI] and [CII]. They validated these models by comparing the predictions with an extensive suite of observations of individually detected galaxies across a broad range of redshifts. Yang et al. (2021) used lightcones populated with this approach to create mock maps (including interlopers, cosmic infrared background, and Galactic foreground) for exemplar LIM experiments in CO and [CI]. This work showed that SAMs coupled with the appropriate line emission models in post-processing are a promising alternative tool for LIM forecasts and interpretation.

Ideally, we would like to be able to use Bayesian inference to constrain the posteriors of both cosmological and astrophysical parameters from a combined suite of multi-tracer LIM and galaxy survey observables. When using a forward-modeling or simulation-based inference approach on a large population of galaxies or a cosmological volume (Henriques et al., 2009, 2013; Lu et al., 2011, 2012, 2014; Forbes et al., 2019), depending on the sampling approach adopted, this requires thousands to tens of thousands of executions of the simulation. Most currently available SAM codes are not fast enough for this to be practical, particularly if large volumes need to be simulated.

An intermediate approach is to develop models that link halo properties to galaxy intrinsic and observable properties using physics-based simulations, such as hydrodynamic simulations and SAMs, instead of empirical methods such as abundance matching. For example, Yang et al. (2022) used the SAM-based predictions described above to derive and present fitting functions between halo mass and all major predicted lines over a broad range of redshifts, and showed that when these fitting functions were combined with dark matter halo properties as in the standard empirical halo model approach, agreement of 5-10% was achieved for common LIM summary statistics such as the power spectrum and intensity.

Refer to caption — Figure 1: Illustration of how a parametric model of the Conditional Galaxy Property Distribution can be used to obtain insights about the evolution of physical quantities of interest. First, a general class of models for representing the CGPD (here GMM) is motivated using physics-based models of galaxy formation. The models must be flexible but not overly complex. The CGPD is combined with a model that predicts the observed line luminosity as a function of the galaxy properties (the Line Luminosity Relation, LLR), to predict observable summary statistics, such as the power spectrum, one-point distribution, or integrated line intensity. Observational estimates of these quantities are then used to estimate the likelihood (if using a “traditional” explicit likelihood inference approach) or to directly estimate the posterior, if using implicit likelihood inference. Using a sampler, a new set of parameter values is guessed, and the model is run again until convergence is reached. Now the “best fit” CGPD can be projected and integrated as described in Section 2, yielding the posterior and confidence intervals on quantities of great interest for galaxy evolution, as summarized in the rightmost box. These estimates can be compared to traditional estimates from galaxy surveys, where physical parameters are estimated through SED fitting, and with the predictions of physics-based simulations.

In this work, we introduce a new approach that we believe can have very general applicability to forecasting and interpreting a wide variety of galaxy and LIM survey data. We introduce the conditional galaxy property distribution (CGPD) $\Phi\left(\bm{p}|M,z\right)$ , which gives the number density of galaxies in the property space $\bm{p}$ , conditioned on the halo mass $M$ and redshift $z$ . Whereas traditional halo models are typically mappings from a scalar quantity (halo mass) to another scalar (e.g. stellar mass or SFR), we can think of this as an extension of the halo model to map to a higher dimensional suite of galaxy properties. In this work, $\bm{p}$ has dimension five and includes the galaxy stellar mass, molecular gas mass, galaxy radius, gas phase metallicity and star formation rate (SFR), which are important for predicting the line luminosities of interest (here, we have in mind mainly CO and [CII]). However, in principle $\bm{p}$ could contain any galaxy property of interest and have an arbitrary number of dimensions. With this set of galaxy properties, CGPD also encapsulates many popular and important galaxy distribution functions and scaling relations such as the stellar mass function and SFR main sequence.

In this paper, we explore the form and effective dimensionality of the CGPD for this chosen set of key galaxy properties, using two different physics based cosmological simulations of galaxy formation: the “Santa Cruz” semi-analytic model (Somerville & Primack, 1999; Somerville et al., 2008, 2012; Porter et al., 2014; Popping et al., 2014; Somerville et al., 2015), and the IllustrisTNG hydrodynamic simulation suite (Springel et al., 2018; Pillepich et al., 2018; Marinacci et al., 2018; Nelson et al., 2018; Naiman et al., 2018). A key point is that it is well known that galaxy properties are highly covariant, giving rise to the many familiar galaxy scaling relations such as the stellar mass vs. SFR relation (SF sequence), mass metallicity relation, etc. Exploring higher dimensional manifolds within galaxy property space (as hinted at by, for example, “fundamental plane” type relations such as the fundamental mass metallicity relation) will in itself provide interesting insights into galaxy formation, as the physics that shapes galaxy properties is encoded in these relationships. Moreover, it will help us to develop tools to characterize and model these high-dimensional relationships in order to fully capture the covariances between galaxy properties. In this work, we adopt a Gaussian mixture model (GMM) based approach as a generative density estimation model to describe the CGPD.

Combined with a model that can predict emission line luminosities from these galaxy properties, this parametric model could potentially be used in a statistical inference pipeline, either using “traditional” methods such as Markov Chain Monte Carlo (MCMC) sampling combined with an assumed form for the likelihood, or more modern implicit likelihood inference methods (Alsing et al., 2018; Cranmer et al., 2020). With the formulation presented here, the posterior of the parameters of the GMM (and hence the CGPD) can be constrained based on a chosen set of LIM observables. The resulting CGPD can then be projected and integrated to obtain estimates for galaxy population summary statistics of common interest, such as the SFR or cold gas distribution functions or cosmic SFR or cold gas density, at a series of redshifts. If desired, additional observational constraints from galaxy surveys, such as the stellar mass function, can simultaneously be used as constraints in the inference procedure. In Fig. 1, we provide a graphical representation of how the CGPD model that we develop in this work fits into this broader framework, which we plan to build up in future works.

The paper is organized as follows. In Section 2, we introduce the CGPD and how it is related to more familiar summary statistics that are commonly used to characterize galaxy populations. We introduce the simulated galaxy catalogs with physical properties of interest in Section 3. Based on these simulations, we then present the application of the GMM in characterizing the CGPD in Section 4. We discuss our results in Section 5, and conclude in Section 6. Throughout this work, we adopt a flat $\Lambda$ CDM fiducial cosmology with Planck 2015 parameters: $\Omega_{m}=0.3089$ , $\Omega_{\Lambda}=0.6911$ , $\Omega_{b}=0.0486$ , $h=0.6774$ , $\sigma_{8}=0.8159$ , $n_{s}=0.9667$ (Planck Collaboration et al., 2016), which are consistent with the cosmology assumed in the simulations.

2 Conditional Galaxy Property Distribution

For a halo of mass $M$ at redshift $z$ , the CGPD $\Phi\left(\bm{p}|M,z\right)$ is defined as the number density distribution of galaxies in the property space $\bm{p}$ . Integrating $\Phi\left(\bm{p}|M,z\right)$ over $\bm{p}$ yields the expected number of galaxies for such a halo. For simplicity, in this work we write the CGPD as a function of $\bm{p}$ , although in practice, we perform the fitting of the CGPD in $\tilde{\bm{p}}=\log\bm{p}$ space ¹¹1In this work, we denote a base 10 logarithm with $\log$ and natural logarithm with $\ln$ ., i.e. $\tilde{\Phi}(\tilde{\bm{p}}|M,z)$ . The CGPD may be represented in either of the two spaces for convenience. The transformation of the CGPD between these two spaces can be straightforwardly given by

\Phi\left(\bm{p}|M,z\right)=\tilde{\Phi}\left(\tilde{\bm{p}}(\bm{p})|M,z\right)\left|\frac{\partial\,\tilde{\bm{p}}(\bm{p})}{\partial\,\bm{p}}\right|\,,

(1)

where $\left|\partial\,\tilde{\bm{p}}(\bm{p})/\partial\,\bm{p}\right|$ is the determinant of the Jacobian matrix. Multiplying the differential volume in $\bm{p}$ space on both sides gives

\Phi\left(\bm{p}|M,z\right)\prod_{i}\,dp_{i}=\tilde{\Phi}\left(\tilde{\bm{p}}|M,z\right)\prod_{i}\,d\tilde{p}_{i}\,,

(2)

which is more convenient for the purpose of integration. The galaxy properties we consider are summarized in Table 1.

Table 1: The 5-D galaxy properties

\bm{p}

for the CGPD

\Phi\left(\bm{p}|M,z\right)

considered in this work, which are selected based on their relevance to the emission lines of interest.

Property	Unit	Description
$m_{*}$	$[M_{\odot}]$	galaxy stellar mass
$m_{{\rm H}_{2}}$	$[M_{\odot}]$	molecular gas mass
$r_{\rm gal}$	[kpc]	galaxy radius
$Z_{g}$	$[Z_{\odot}]$	gas phase metallicity
${\rm SFR}$	$[M_{\odot}/{\rm yr}]$	star formation rate

These are the properties that are the most important for the line luminosities of interest. Note that $\Phi\left(\bm{p}|M,z\right)$ is related to the conditional luminosity function (CLF) $\Phi(\bm{L}|M,z)$ introduced in Schaan & White (2021b) by the line luminosity relation (LLR) $L_{i}(\bm{p},z)$ , which gives the luminosity of line $i$ for a galaxy with properties $\bm{p}$ at redshift $z$ . The LLR is assumed to be independent of the host halo mass $M$ . The modeling of the LLR using SAM simulated galaxies coupled with spectral synthesis models is being developed in a companion work (Breysse et al., 2023).

The CGPD must be coupled with the dark matter halo model in order to connect the cosmological dark matter field with observable galaxies. For example, we obtain the global distribution function for each element of the property vector $\bm{p}$ by performing the integral:

\phi(\bm{p}|z)=\int dM\,n_{\rm halo}(M,z)\Phi\left(\bm{p}|M,z\right)\,,

(3)

where $n_{\rm halo}(M,z)$ is the halo mass function (HMF) at redshift $z$ . In this work, we use the HMF from Tinker et al. (2008) with the implementation in the hmf ²²2https://github.com/halomod/hmf code (Murray et al., 2013). This general galaxy distribution function $\phi(\bm{p}|z)$ gives the comoving volume number density of galaxies in the property space $\bm{p}$ at redshift $z$ . Integrating $\phi(\bm{p}|z)$ over $\bm{p}$ yields the total number of galaxies per unit comoving volume. As shown below, by integrating over certain dimensions, $\phi(\bm{p}|z)$ encapsulates many observables of wide interest in galaxy and LIM surveys, such as the cosmic densities of galaxy properties, the galaxy abundance as a function of a certain property, and the scaling relations between two (or more) galaxy properties.

2.1 Cosmic densities of galaxy properties

The cosmic number density of galaxies per unit comoving volume is given by

n_{g}(z)=\int d\bm{p}\,\phi(\bm{p}|z)\,.

(4)

Similarly, the cosmic density of a galaxy property $p_{i}$ in $\bm{p}$ is

\rho_{p_{i}}(z)=\int d\bm{p}\,\phi(\bm{p}|z)p_{i}\,.

(5)

These integrated quantities are important observable functions describing the overall evolution of galaxy properties over cosmic time (e.g. Madau & Dickinson, 2014, and references therein), including the cosmic stellar mass density $\rho_{m_{*}}(z)$ , cosmic molecular gas mass density $\rho_{m_{{\rm H}_{2}}}(z)$ , and cosmic SFR density $\rho_{\rm SFR}(z)$ .

2.2 Galaxy distribution functions and scaling relations

Besides the volume density of the galaxy properties in Eq. 5, the abundances of galaxies as functions of certain physical properties have also been widely studied in galaxy surveys.

As a function of a single property $p_{i}$ , the galaxy distribution $\phi(p_{i}|z)$ is given by integrating $\phi(\bm{p}|z)$ over the other $\dim(\bm{p})-1$ dimensions,

\phi(p_{i}|z)=\int\phi(\bm{p}|z)\,\prod_{j\neq i}dp_{j}\,.

(6)

These functions of wide interest include the galaxy stellar mass function $\phi(m_{*})$ (e.g. Baldry et al., 2008), the molecular gas mass function $\phi(m_{{\rm H}_{2}})$ (e.g. Diemer et al., 2018), and the star formation rate function $\phi({\rm SFR})$ (e.g. Smit et al., 2012), etc.

Another important observable for galaxy surveys is the correlation between certain sets of galaxy properties, some of which can be even described with simple scaling relations. At a given redshift, integrating $\phi(\bm{p}|z)$ over $\dim(\bm{p})-2$ dimensions of $\bm{p}$ yields

\phi(p_{i},p_{j}|z)=\int\phi(\bm{p}|z)\,\prod_{k\neq i,j}dp_{k}\,,

(7)

the distribution of galaxies in the space of the two unprojected-over properties. A traditional scaling relation corresponds to fitting this distribution with an empirical function $p_{i}(p_{j})$ . These scaling relations of wide interest include the star formation main sequence ${\rm SFR}(m_{*})$ (e.g. Noeske et al., 2007), the mass-metallicity relation $Z_{g}(m_{*})$ (e.g. Tremonti et al., 2004), the size-mass relation $r_{\rm gal}(m_{*})$ (e.g. Lange et al., 2015; Mowla et al., 2019), and the stellar mass vs. gas mass relation $m_{*}-m_{{\rm H}_{2}}$ (e.g. Calette et al., 2018). Besides the main trend or correlation given by the scaling relations, there is also dispersion around the median values. These dispersions $\sigma_{p_{i}}(p_{j})$ also contain a wealth of physical information (e.g. Forbes et al., 2014), and the corresponding conditional distributions like $\phi(p_{i}|p_{j})$ have also been investigated (e.g. Leja et al., 2022). In our case, $\phi(p_{i},p_{j})$ includes all the information, from which the scaling relations and the conditional distribution can also be derived. The conditional probability distribution $f(p_{i}|p_{j})$ is given by

f(p_{i}|p_{j})=\frac{\phi(p_{i},p_{j})}{\int dp_{i}\,\phi(p_{i},p_{j})}\,,

(8)

where $f(p_{i}|p_{j})$ differs from $\phi(p_{i}|p_{j})$ by a factor of the galaxy number density. Then the function $p_{i}(p_{j})$ can be modelled as

\left<p_{i}(p_{j})\right>=\int dp_{i}\,p_{i}f(p_{i}|p_{j})\,.

(9)

Similarly, the dispersion follows,

\sigma^{2}_{p_{i}}(p_{j})=\left<p^{2}_{i}(p_{j})\right>-\left<p_{i}(p_{j})\right>^{2}\,.

(10)

Integrating over $\dim(\bm{p})-3$ dimensions yields $\phi(p_{i},p_{j},p_{k}|z)$ , which leads to “fundamental plane” relations $p_{i}(p_{j},p_{k})$ such as $m_{*}(Z_{g},m_{{\rm H}_{2}})$ or $m_{*}(Z_{g},{\rm SFR})$ (e.g. Lara-López et al., 2010). The expressions are almost the same as Eqs. 8, 9 and 10, with $p_{j}$ simply replaced with $p_{j},\,p_{k}$ .

2.3 Line luminosities and intensities

Typically, line luminosity functions are 1-D distributions over one line luminosity $L$ . As with the relation between CGPD and CLF mentioned above, the line luminosity function $\phi(L|z)$ can also be connected to $\phi(\bm{p}|z)$ by the LLR $L_{0}(\bm{p},z)$ ,

\phi(L|z)=\int d\bm{p}\,\delta^{\rm D}(L-L_{0}(\bm{p},z))\phi(\bm{p}|z)\,.

(11)

By combining $\phi(\bm{p}|z)$ and LLR, the observed intensity can be derived as

I_{\nu}(z)=\frac{1}{4\pi\nu_{0}}\frac{c}{H(z)}\int d\bm{p}\,\phi(\bm{p}|z)L_{0}(\bm{p},z)\,,

(12)

where the integral (without the factor) gives the line luminosity density.

2.4 Galaxy properties per halo

Besides the expressions based on Eq. 3 which integrate over all halos of different masses, we can similarly write down the properties per halo from the CGPD $\Phi\left(\bm{p}|M,z\right)$ . For example, the average number of galaxies for a halo of a given mass can be written as

N_{g}^{h}(M,z)=\int d\bm{p}\,\Phi\left(\bm{p}|M,z\right)\,,

(13)

where the superscript $h$ indicates that it is a quantity per halo. Similarly, the expected galaxy property $p_{i}$ per halo is given as

P_{i}^{h}(M,z)=\int d\bm{p}\,\Phi\left(\bm{p}|M,z\right)p_{i}\,.

(14)

Note that Eq. 14 should be divided by Eq. 13 for $r_{\rm gal}$ if the average radius per galaxy is desired instead of the sum of all galaxy radii in the halo. With LLR $L_{i}(\bm{p},z)$ , the line luminosity for line $i$ per halo can be written as

L_{i}^{h}(M,z)=\int d\bm{p}\,\Phi\left(\bm{p}|M,z\right)L_{i}(\bm{p},z)\,.

(15)

3 Simulations

In this section we briefly introduce the physics-based simulations, which will be used to study and motivate the functional form and modeling approach for characterizing the CGPD. These include the “Santa Cruz” semi-analytic models and hydrodynamic simulations from the IllustrisTNG project. Note that comparing and discussing the differences between the simulated results from the SAM and TNG is outside of the scope of this work, and a recent discussion can be found in Gabrielpillai et al. (2022). It is important to consider different models in order to test the robustness and flexibility of the GMM in describing the CGPD.

3.1 Semi-Analytic Model

State-of-the-art SAMs have been shown to produce consistent predictions for global galaxy properties and their evolution compared with much more expensive numerical hydrodynamic simulations (Somerville & Davé, 2015b). In this work, we use the SAM developed by the “Santa Cruz” group (Somerville & Primack, 1999; Somerville et al., 2008, 2012; Porter et al., 2014; Popping et al., 2014; Somerville et al., 2015). This well-established semi-analytic galaxy formation modeling code has been compared extensively with galaxy observations from $z\sim 0-10$ and is overall very successful at matching many observations (see e.g. Popping et al., 2014; Somerville et al., 2015; Yung et al., 2019a, b; Somerville et al., 2021).

The backbone of the SAM consists of dark matter halo merger trees, which can be extracted from N-body simulations or generated with Monte Carlo realizations based on the extended Press-Schechter (EPS) formalism (Somerville & Kolatt, 1999). In this work, in order to make sure that all halo masses in the wide range of interest are well populated for fitting, we use the Monte Carlo EPS method, which enables better sampling of the full range of relevant halo masses at low computational cost.

We generate simulations for one hundred redshifts uniformly sampled in $0\leq z<10$ , with galaxies at each redshift being simulated independently. At each redshift, we consider halos with mass in the range $10\leq\log(M/M_{\odot})\leq 13$ , which has been shown to be the relevant halo mass range probed by LIM measurements (Yang et al., 2021). In this logarithmic halo mass range, a thousand halos are generated for each of the one thousand uniformly sampled halo masses, and hence in total this yields one million halos for each redshift. This large number of halos and well-sampled resolution in halo mass give us more flexibility in binning for constructing the CGPD. We only consider star-forming galaxies, which are selected with the criterion ${\rm sSFR}>1/(3t_{H}(z))$ , where ${\rm sSFR}\equiv{\rm SFR}/m_{*}$ is the specific SFR and $t_{H}(z)$ is the Hubble time at the galaxy redshift $z$ . We ignore quenched galaxies in this work, as they have a negligible contribution to LIM. For simplicity and better demonstration, in our main analyses we only consider central galaxies. We checked that the contributions from satellites for the integrated quantities are less than $10\%$ , and we include some further details on including and fitting satellites in Section 5.2.2.

3.2 Hydrodynamic simulations

IllustrisTNG is a project that includes a series of state-of-the-art cosmological N-body and hydrodynamic simulations (Springel et al., 2018; Pillepich et al., 2018; Marinacci et al., 2018; Nelson et al., 2018; Naiman et al., 2018). We use the halo and subhalo (i.e. galaxy) catalogs from TNG100-1, which are simulated in a cubic box of size $75\,{\rm Mpc/h}$ with a mass resolution of $\sim 10^{6}\,M_{\odot}$ for the dark matter and baryonic particles. The TNG fields that are most comparable to the galaxy properties simulated by the SAM used in this work are summarized in Table 2.

Table 2: Fields in TNG subhalo catalogs that are most comparable to the galaxy properties in Table 1. The units have also been transformed correspondingly.

Property	TNG field
$m_{*}$	SubhaloMassInRadType (Type=4)
$m_{{\rm H}_{2}}$	m_h2_DK11_vol
$r_{\rm gal}$	SubhaloHalfmassRadType (Type=4)
$Z_{g}$	SubhaloGasMetallicity
${\rm SFR}$	SubhaloSFR

TNG fields for $m_{*}$ and ${\rm SFR}$ are selected following the discussion in Gabrielpillai et al. (2022), where it is also noted that in some cases these quantities are not defined in exactly the same way as in SAM. Then the field for $r_{\rm gal}$ is chosen to be consistent with that of $m_{*}$ . The molecular gas mass $m_{{\rm H}_{2}}$ estimates are provided in post-processed supplementary catalogs (Diemer et al., 2018, 2019; Stevens et al., 2021) for five redshifts $z=(0,\ 0.5,\ 1,\ 1.5,\ 2)$ , which are described in detail in the data specifications of the TNG project ³³3https://www.tng-project.org/data/docs/specifications/. For the host halo mass $M$ , the Group_M_TopHat200 field in TNG is used.

4 Results: Characterizing the CGPD

Based on the simulated data described above, we now propose and test analytic models that could potentially properly characterize the CGPD.

There are various parametric and nonparametric methods for describing the distribution of point data, also known as density estimation (Ivezić et al., 2019). Histograms might be the most straightforward nonparametric method, which approximate the function value as constants within discrete bins. However, the total number of bins increases exponentially with the dimensionality of the data. Treating each bin as an independent variable makes it infeasible for inference, and also this method does not give any insights into the structure of the distribution. This remains an issue even for more advanced nonparametric models like Gaussian processes. Another class of methods assumes a specific parametric form of the underlying probabilistic density function that generates the data points, among which Gaussian Mixture Models are one of the most powerful approaches due to its simplicity and flexibility. In this section, we present the modeling of the CGPD using GMMs.

4.1 Gaussian Mixture Model

The GMM is a probabilistic model defined as the weighted average of a finite number of Gaussian distributions, and the likelihood function reads

L(\bm{x}|\bm{\theta})=\sum_{j=1}^{N_{\rm gc}}w_{j}\,\mathcal{N}(\bm{x}|\bm{\mu}_{j},\bm{\Sigma}_{j})\,,

(16)

where $\mathcal{N}$ denotes a multivariate Gaussian distribution, $\bm{x}$ is the data vector, and $\bm{\theta}$ includes all the model parameters, i.e. the weight $w_{j}$ , the mean vector $\bm{\mu}_{j}$ and the covariance matrix $\bm{\Sigma}_{j}$ for each of the $N_{\rm gc}$ Gaussian components. The weights are constrained by the normalization $\sum_{j=1}^{N_{\rm gc}}w_{j}=1$ such that the likelihood function integrates to one. In unsupervised machine learning, GMM is a powerful method for clustering and classification with each component corresponding to a certain class. Here, GMM is mainly used for density estimation and we are not attempting to classify galaxies in their property space. Another practical advantage of GMM in describing multidimensional data is their simplicity in giving marginal and conditional distributions, which remain Gaussian. Since each component is a multivariate Gaussian, marginalizing (integrating) over any subset of dimensions can be done by simply removing the marginalized dimensions from the mean vector and covariance matrix. Similarly, the conditional distributions for a multivariate Gaussian can also be written analytically. For example, for a bivariate case,

f(X_{1}|X_{2}=x_{2})\sim\mathcal{N}\left(\mu_{1}+\frac{\sigma_{1}}{\sigma_{2}}\rho(x_{2}-\mu_{2}),\ (1-\rho^{2})\sigma_{1}^{2}\right)\,,

(17)

where $\mu_{1}\,(\mu_{2})$ and $\sigma_{1}\,(\sigma_{2})$ are the mean and standard deviation of $X_{1}\,(X_{2})$ , and $\rho$ is the correlation coefficient between the two. This would be useful for deriving e.g. $\phi(p_{i}|p_{j},z)$ from $\phi(p_{i},p_{j}|z)$ , as discussed following Eq. 7.

4.2 Fitting CGPD with GMM

Fig. 2 shows an example of the CGPD for central galaxies in the SAM with host halo mass $11.4<\log(M/M_{\odot})<11.6$ at redshift $z=2$ , where the projected pair distribution plots are made using corner.py ⁴⁴4https://github.com/dfm/corner.py (Foreman-Mackey, 2016).

We can see that the marginal distributions for the logarithmic galaxy properties are very close to being Gaussian around the peak but with possible extended spreads on the left side.

Similar to Fig. 2 for the SAM data, we show galaxies selected under the same criteria from the TNG simulated data in Fig. 3. There are many fewer galaxies in the TNG data compared to those in the SAM. Although the locations and dispersions are different from those of the SAM shown in Fig. 2, the distributions are still very close to being Gaussian. We conclude for now that this is likely to be a fairly robust property of CGPD in simulations and that therefore Gaussian mixtures are a promising parameterized model for this problem.

We use GMM as a parametric model that describes the CGPD $\Phi\left(\bm{p}|M,z\right)$ , where the data vector $\bm{x}$ in Eq. 16 corresponds to the galaxy property vector $\bm{p}$ . For a bin in mass and redshift $(M,z)$ , given $n$ simulated galaxies $\{\bm{p}_{i}\}_{i=1}^{n}$ for fitting, the overall logarithmic likelihood function is given by

\mathcal{L}(\{\bm{p}_{i}\}_{i=1}^{n}|\bm{\theta})=\sum_{i=1}^{n}\ln\,L(\bm{p}_{i}|\bm{\theta})\,,

(18)

which is maximized to find the best-fit $\hat{\bm{\theta}}$ using the expectation-maximization (EM) algorithm (Dempster et al., 1977). In this work, we use the implementation of the GMM in scikit-learn ⁵⁵5https://scikit-learn.org/stable/modules/mixture.html.

Given the galaxy data and GMM model, we fit the galaxies in the 5-D physical property space shown in Fig. 2. Considering the correlations between physical properties, we use the full covariance matrices for the Gaussian components. However, this corresponds to a large number of parameters in the covariance matrix and the model might need to be further simplified in order to be used for statistical inference. We discuss a possible dimensionality reduction based on principal component analysis (PCA) in Section 4.4.

At each redshift, we divide the logarithmic host halo mass range into bins of uniform width $\Delta\log M=0.2$ . As mentioned in Section 3.1, the halo mass resolution in our SAM simulations is $0.003\,\log M_{\odot}$ , which is much smaller than the bin size such that halo masses are well sampled in each bin. Since different halo mass bins are fitted independently, the overall complexity of the model is also proportional to the number of bins. However, using fewer bins means a wider spread of the distribution for each bin and possibly results in lower accuracy. This is another trade-off depending on the usage. Another detail to notice is that as described in Section 3.1, in our simulations from SAM based on merger trees using the Monte Carlo method, the same number of halos are generated for the one thousand discrete halo masses. Therefore for wide halo mass bins, to be realistic it would be necessary to subsample the halos within a bin to match the HMF.

The model complexity of a GMM is proportional to the number of Gaussian components $N_{\rm gc}$ , which is an important hyper-parameter that needs to be determined. A typical model selection is based on the Bayesian information criterion (BIC), which attempts to achieve a balance between the complexity and performance of the model,

{\rm BIC}={\rm dim}(\bm{\theta})\ln n-2\hat{\mathcal{L}}\,,

(19)

where ${\rm dim}(\bm{\theta})$ denotes the total number of parameters in the model, $n$ is the number of data points, and $\hat{\mathcal{L}}$ is the log-likelihood at the best-fit $\hat{\bm{\theta}}$ . For the same data, the models with lower BIC scores are preferred.

In Fig. 4, for GMM fitted on the same example group of galaxies shown in Fig. 2, we show the BIC scores for models with different numbers of components ( $N_{\rm gc}$ ). The BIC score decreases as we increase $N_{\rm gc}$ starting from just one component, and the curve flattens very rapidly. Similar curves are also observed for galaxies at other redshifts and other halo mass bins. These curves indicate that a small number of components is sufficient, and continuing to increase $N_{\rm gc}$ to a very large value (e.g. $>\,10$ ) will not significantly improve the fitting performance, but will only make the model more complicated. The BIC score is a general criterion for model selection based on the likelihood, while in practice the trade-off between accuracy and model complexity depends on the specific usage. If we want to use the model as an emulator of the SAM simulations, then we may want to use more components to make it more accurate. On the other hand, if we want to use the model for statistical inference, we would prefer to make the model as simple as possible while achieving acceptable accuracy. In our main analyses, we always start with very few (i.e. 1, 2 or 3) components.

4.3 Model Performance

To check how well the GMM fits the SAM and TNG data, for some of the galaxy property functions mentioned in Section 2, we compare the results given by the best-fit GMM with those computed directly from the data.

First we consider the cosmic densities of galaxy properties defined in Eq. 5. In Fig. 5, we show the results for GMM with one or two components for the one hundred uniformly spaced redshift bins in $0\leq z<10$ . Also shown are the TNG data at five available redshifts (see Section 3.2) and the corresponding best-fit GMMs with just one component.

For both SAM and TNG simulated data, with just one or two Gaussian components, the percent errors of the cosmic densities evaluated with the GMM compared to the results computed directly from the simulation are mostly within $5\,\%$ , which is acceptable given the uncertainties in state-of-the-art observational constraints. For these integrated quantities, the improvement obtained when using one more component which can capture the tails in distributions (see 1-D histograms in Fig. 2) is not very significant.

We also consider the galaxy one-property distribution functions defined in Eq. 6.

In Fig. 6, for both the SAM and TNG, we show these functions at redshift $z=2$ as an example, including the function computed directly from the simulation and the best-fit GMMs with different numbers of components. GMM with just one component can already capture most of the information, especially the main slope of the function. Compared with the cosmic densities above, using more than one component could help capturing more information. In general, using no more than 3 components can already reproduce the galaxy one-property functions to very high accuracy.

We now consider the galaxy two-property distribution functions (scaling relations) defined in Eq. 7.

For example, we plot $\phi(m_{*},{\rm SFR})$ , $\phi(m_{*},Z_{g})$ , $\phi(m_{*},r_{\rm gal})$ , and $\phi(m_{*},m_{{\rm H}_{2}})$ at redshift $z=2$ in Fig. 7 for SAM and Fig. 8 for TNG. For clarity, here we only show a GMM with three components, which can already reproduce the data distribution very well. Using fewer (i.e. one or two) components could also be acceptable depending on how many details we would like to recover. Scaling relations can also be observed in some of the distributions.

4.4 Dimensionality Reduction: Principal Component Analysis

As expected and also shown in Fig. 2, there are strong correlations between the galaxy physical properties we consider. This makes it possible to reduce the dimension of the data without losing much information using linear transformations like Principal Component Analysis. With the data standardized (i.e. shifted to zero mean and normalized to unit variance) on each dimension, starting from the principal component (PC) with the largest variance, PCA sequentially finds the next PC with the next largest variance while being linearly independent of the previous PCs. Lower data dimensionality could reduce the number of parameters needed in the model and make it more feasible for use in an inference pipeline. Besides, the linear independence (i.e. zero covariances) between PCs makes it possible to simplify certain models like a multivariate Gaussian distribution even further.

Here we perform PCA on central galaxies for each $(M,z)$ bin. The cumulative percentages of variance explained by PCs up to the third largest variance is shown in Fig. 9.

It turns out that using just 3 PCs can already recover more than 90 percent of the variance. The distribution in the space of first 3 PCs for the galaxies in Fig. 2, is shown in Fig. 10.

Since PCA attempts to minimize parameter covariance, we do not see very strong correlations as we have seen in Fig. 2 for the physical properties. However, it is still possible for PCs to have nonlinear correlations.

Therefore with acceptable information loss, instead of in the 5-D physical property space, we can describe the distribution in the 3-D PC space and convert it back to the physical property space for usage. An additional advantage of using GMM is that a Gaussian distribution can also be analytically transformed between two spaces. Each Gaussian component fitted in the PC space also corresponds to a Gaussian component in the physical property space. With the mapping between the PC vector $\bm{y}$ and physical property vector $\bm{p}$ given by the linear transformation $\bm{y}=\bm{W}\bm{p}$ with $\bm{W}$ being a 2-D matrix of dimension ${\rm dim}(\bm{y})\times{\rm dim}(\bm{p})$ , the covariance matrices are related by

{\rm Cov}(\bm{p})=\bm{W}^{T}{\rm Cov}(\bm{y})\bm{W}\,.

(20)

In Fig. 10, we show the best-fit GMM with one and two components in the PC space. For just one component, the covariance matrix is diagonal since there is no covariance. For two components, since each of them only cover part of the data, the full covariance matrices need to be considered. If we convert the best-fit GMM back to the physical property space, the models are show a similar level of agreement as Fig. 2 where we fit the physical properties directly.

Fitting the model in PC space does provide the potential to simplify the model. However, the transformation matrix between PC and physical property spaces also introduces additional parameters. We leave further exploration for a future work.

5 Discussion

In this section, we discuss the significance of the work presented here, as well as some of the caveats and limitations of this analysis.

5.1 Significance of the CGPD for understanding galaxy formation

The CGPD is a natural extension of traditional halo models, in that it links halo mass and redshift to galaxy properties. The new feature is that, while traditional halo models attempt to quantify this mapping with a single galaxy property at a time, and typically do not model the covariance between galaxy properties, the CGPD describes the mapping to an n-dimensional vector of galaxy properties and explicitly includes the covariances. We note that what appears as scatter or dispersion in a traditional style halo mass vs. single galaxy property plot is in some cases actually a result of projecting a higher dimensional manifold into a plane. Thus we expect the CGPD approach to partially, though perhaps not fully, also provide a natural description of dispersion along with the correct residual correlations. The CGPD as extracted from physics-based simulations can be an interesting vehicle for comparing different simulations or studying the impact of varying physical processes. This has not been a focus of this work, but it is notable that while the CGPD for the Santa Cruz SAM and TNG appear qualitatively quite similar, some of the quantities differ in the location of the mean and also in the width of the distributions. It would be interesting to systematically study the impact of varying physical processes and/or cosmology on the CGPD. Better understanding why there is a relatively “thin” manifold in the n-dimensional property space and how physical processes shape the location and thickness of this manifold is also an interesting open question.

5.2 Limitations of current modeling approach

5.2.1 Star forming and quenched galaxies

For practical reasons, we have only included central, star-forming galaxies in this analysis. This choice should not have a large impact on the accuracy of our method for our main target application, which is line intensity mapping. The lines of interest are produced by star-forming galaxies. For other applications, including the full galaxy population (not only star-forming galaxies) would certainly lead to a more multi-modal CGPD, but we are confident that this could still be modeled with a GMM approach using more components (see e.g. Hahn et al., 2019).

5.2.2 CGPD for satellite galaxies

According to the SAM predictions, the overall cosmic volume densities and other integrated quantities of the galaxy properties are dominated by central galaxies, with the contributions from satellites mostly less than $10\,\%$ . However, it would also be straightforward to include a separate CGPD for satellite galaxies. For satellite galaxies, instead of the host mass $M$ , it is customary to use a quantity like $M_{\rm max,sub}$ , which is defined as the maximum mass that the host halo of the satellite galaxy had before it became a satellite (e.g. Behroozi et al., 2019). Under the (common) ansatz that the properties of satellite galaxies depend only on the sub-halo mass $M_{\rm max,sub}$ and not on the host mass in which the satellite (sub-halo) is orbiting, we could then carry out an identical analysis of the CGPD for satellites where the host halo mass $M$ is replaced by the sub-halo mass $M_{\rm max,sub}$ . We would need to add the sub-halo mass function to the distinct halo mass function introduced in Eqn. 3. Then the galaxy property distributions can simply be added,

\phi(\bm{p}|z)=\phi(\bm{p}|z)_{\rm cen}+\phi(\bm{p}|z)_{\rm sat}\,.

(21)

This would add some complexity to the model but would be a straightforward extension.

5.2.3 Dependence on higher order halo properties

Another simplification is that we assume that galaxy properties depend only on halo mass and redshift, which is known to be in contrast with the results of both hydrodynamic simulations and SAMs. Although this ansatz is fundamental to the ubiquitous halo model picture, it has been shown that properties such as galaxy clustering in hydrodynamic simulations depend on halo properties other than mass, an effect commonly referred to as assembly bias (Hadzhiyska et al., 2020, 2021). Similarly, it has been shown that some of the dispersion in the stellar mass vs. halo mass relation in simulations is actually due to these higher order halo properties, such as formation history (Mitchell & Schaye, 2022; Gabrielpillai et al., 2022). We are interested in extending the CGPD framework to include additional halo properties, guided by physics-based simulations.

5.2.4 Evolution across halo masses and redshifts

In the main results section, we fit GMM on galaxies for each redshift and halo mass bin independently. However, a continuous evolution of the galaxy distribution across those discrete bins should be expected. If the evolution of GMM parameters could be characterized with another parametric model, then the model could be further simplified. This would be important and possibly necessary for using the approach for statistical inference in the future.

Here we briefly discuss the evolution of galaxy properties across halo masses and redshifts. To avoid the side effects from possible local maxima of the likelihood function, instead of purely random initialization, we use the best-fit parameter values from the previous bin to initialize the parameters for the next bin. For GMM with just one component, the best-fit mean vector and covariance matrix are simply the sample mean and covariance of the data. We notice that the mean of the galaxy properties evolves smoothly across both halo masses and redshifts, while the variances and covariances can change dramatically. For multiple Gaussian components, an additional challenge is that we need to track each component individually, and there are different approaches for this. For example, we could sort the components based on their weights, and assume that the component with the highest weight in the current bin always corresponds to the highest-weight one in the next bin, and so on for other components with lower weights. There are also different definitions of the distance between two Gaussian distributions, based on which we could map one component in the current bin to the closest one in the next bin. However, this approach may not be very robust since the components could cross each others trajectories. In our case, we use the best-fit parameters from the current bin as the initial values for the next bin; then the new data can be considered as introducing an update to the current parameters. From this point of view, the order of parameters remains unchanged across bins. However, having tried all these approaches, we still found discontinuous behavior of the Gaussian components for not only the covariances but also the mean vectors. We leave further investigation for future work.

6 Conclusions

Traditional galaxy surveys and the relatively recent LIM techniques are both powerful methods to study the formation and evolution of galaxies, one of the most fundamental and important problems in astrophysics and cosmology. Motivated by the necessity of developing a physically motivated, computationally efficient framework that can describe both galaxy and LIM observations, we propose and characterize a new approach based on the CGPD $\Phi\left(\bm{p}|M,z\right)$ function, which describes the probability of a halo with mass $M$ at redshift $z$ to host a galaxy with properties $\bm{p}$ . When combined with the Line Luminosity Relation (LLR) $L_{i}(\bm{p},z)$ function developed in a companion work (Breysse et al., 2023), a highly efficient and flexible pipeline that connects halos with an n-dimensional distribution of galaxy physical properties, and with LIM observables, can be constructed.

As a distribution function in the 5-D galaxy property space that we have chosen, the CGPD encapsulates many important galaxy distribution functions and scaling relations etc. Motivated by the form of the CGPD seen in two separate physics-based cosmological simulations (the Santa Cruz SAM and IllustrisTNG), we discuss a parameterized characterization of CGPD using GMM. We find the following main results:

•

With just a few ( $\lesssim 3$ ) components, GMM is capable of reproducing the CGPD distribution from both the SAM and TNG to high accuracy.
•

The GMM based CGPD yields estimates of common summary statistics of interest including 2-D property relations (scaling relations), property distribution functions, and integrated cosmic densities, with a typical accuracy of a few percent.
•

For each bin in halo mass and redshift, we find that the 5-D property distribution can be reduced to only 3 dimensions using PCA, with a 10% loss of information.

The galaxy properties in this work were selected based on their importance for predicting the luminosity of certain emission lines, but a similar approach could be used for other galaxy properties, including galaxy photometry as measured in galaxy surveys.

The next step is to construct a statistical inference pipeline for LIM observations. For a 5-D Gaussian distribution with a full covariance matrix, we have 20 free parameters including 5 parameters in the mean vector and 15 independent parameters in the symmetric covariance matrix. Then if we use 3 Gaussian components for each halo mass bin and 15 mass bins for each redshift as in our main analysis, we would have 900 parameters for the GMM at each redshift. However, these parameters are expected to be highly degenerate. We have shown that the data dimensionality can be reduced significantly using PCA. Another possible approach for doing implicit likelihood inference in high dimensional parameter space is the marginal posterior density with moment networks as proposed in Jeffrey & Wandelt (2020).

7 Acknowledgements

We thank the IllustrisTNG team for granting the JupyterLab access to the simulations, and Dylan Nelson for help accessing the TNG data. ARP was supported by NASA under award numbers 80NSSC18K1014, NNH17ZDA001N, and 80NSSC22K0666, and by the NSF under award number 2108411. ARP and rss are supported by the Simons Foundation. The numerical computations in this work were performed in part using the NYU Greene High Performance Computing cluster.

References

Alam et al. (2017) Alam, S., Ata, M., Bailey, S., et al. 2017, MNRAS, 470, 2617, doi: 10.1093/mnras/stx721
Alsing et al. (2018) Alsing, J., Wandelt, B., & Feeney, S. 2018, MNRAS, 477, 2874, doi: 10.1093/mnras/sty819
Baldry et al. (2008) Baldry, I. K., Glazebrook, K., & Driver, S. P. 2008, MNRAS, 388, 945, doi: 10.1111/j.1365-2966.2008.13348.x
Behroozi et al. (2019) Behroozi, P., Wechsler, R. H., Hearin, A. P., & Conroy, C. 2019, MNRAS, 488, 3143, doi: 10.1093/mnras/stz1182
Bernal & Kovetz (2022) Bernal, J. L., & Kovetz, E. D. 2022, A&A Rev., 30, 5, doi: 10.1007/s00159-022-00143-0
Breysse et al. (2023) Breysse, P., et al. 2023, In preparation
Calette et al. (2018) Calette, A. R., Avila-Reese, V., Rodríguez-Puebla, A., Hernández-Toledo, H., & Papastergis, E. 2018, Rev. Mexicana Astron. Astrofis., 54, 443, doi: 10.48550/arXiv.1803.07692
Cranmer et al. (2020) Cranmer, K., Brehmer, J., & Louppe, G. 2020, Proceedings of the National Academy of Science, 117, 30055, doi: 10.1073/pnas.1912789117
Dempster et al. (1977) Dempster, A. P., Laird, N. M., & Rubin, D. B. 1977, Journal of the Royal Statistical Society. Series B (Methodological), 39, 1. http://www.jstor.org/stable/2984875
Diemer et al. (2018) Diemer, B., Stevens, A. R. H., Forbes, J. C., et al. 2018, ApJS, 238, 33, doi: 10.3847/1538-4365/aae387
Diemer et al. (2019) Diemer, B., Stevens, A. R. H., Lagos, C. d. P., et al. 2019, MNRAS, 487, 1529, doi: 10.1093/mnras/stz1323
Forbes et al. (2014) Forbes, J. C., Krumholz, M. R., Burkert, A., & Dekel, A. 2014, MNRAS, 443, 168, doi: 10.1093/mnras/stu1142
Forbes et al. (2019) Forbes, J. C., Krumholz, M. R., & Speagle, J. S. 2019, MNRAS, 487, 3581, doi: 10.1093/mnras/stz1473
Foreman-Mackey (2016) Foreman-Mackey, D. 2016, The Journal of Open Source Software, 1, 24, doi: 10.21105/joss.00024
Gabrielpillai et al. (2022) Gabrielpillai, A., Somerville, R. S., Genel, S., et al. 2022, MNRAS, 517, 6091, doi: 10.1093/mnras/stac2297
Hadzhiyska et al. (2020) Hadzhiyska, B., Bose, S., Eisenstein, D., Hernquist, L., & Spergel, D. N. 2020, MNRAS, 493, 5506, doi: 10.1093/mnras/staa623
Hadzhiyska et al. (2021) Hadzhiyska, B., Liu, S., Somerville, R. S., et al. 2021, MNRAS, 508, 698, doi: 10.1093/mnras/stab2564
Hahn et al. (2019) Hahn, C., Starkenburg, T. K., Choi, E., et al. 2019, ApJ, 872, 160, doi: 10.3847/1538-4357/aafedd
Henriques et al. (2009) Henriques, B. M. B., Thomas, P. A., Oliver, S., & Roseboom, I. 2009, MNRAS, 396, 535, doi: 10.1111/j.1365-2966.2009.14730.x
Henriques et al. (2013) Henriques, B. M. B., White, S. D. M., Thomas, P. A., et al. 2013, MNRAS, 431, 3373, doi: 10.1093/mnras/stt415
Ivezić et al. (2019) Ivezić, Ž., Connolly, A., Vanderplas, J., & Gray, A. 2019, Statistics, Data Mining and Machine Learning in Astronomy (Princeton University Press)
Jeffrey & Wandelt (2020) Jeffrey, N., & Wandelt, B. D. 2020, arXiv e-prints, arXiv:2011.05991. https://arxiv.org/abs/2011.05991
Koekemoer et al. (2011) Koekemoer, A. M., Faber, S. M., Ferguson, H. C., et al. 2011, ApJS, 197, 36, doi: 10.1088/0067-0049/197/2/36
Kovetz et al. (2017) Kovetz, E. D., Viero, M. P., Lidz, A., et al. 2017, arXiv e-prints, arXiv:1709.09066. https://arxiv.org/abs/1709.09066
Krumholz (2014) Krumholz, M. R. 2014, MNRAS, 437, 1662, doi: 10.1093/mnras/stt2000
Lange et al. (2015) Lange, R., Driver, S. P., Robotham, A. S. G., et al. 2015, MNRAS, 447, 2603, doi: 10.1093/mnras/stu2467
Lara-López et al. (2010) Lara-López, M. A., Cepa, J., Bongiovanni, A., et al. 2010, A&A, 521, L53, doi: 10.1051/0004-6361/201014803
Leja et al. (2022) Leja, J., Speagle, J. S., Ting, Y.-S., et al. 2022, ApJ, 936, 165, doi: 10.3847/1538-4357/ac887d
Leung et al. (2020) Leung, T. K. D., Olsen, K. P., Somerville, R. S., et al. 2020, ApJ, 905, 102, doi: 10.3847/1538-4357/abc25e
Li et al. (2016) Li, T. Y., Wechsler, R. H., Devaraj, K., & Church, S. E. 2016, ApJ, 817, 169, doi: 10.3847/0004-637X/817/2/169
Lu et al. (2012) Lu, Y., Mo, H. J., Katz, N., & Weinberg, M. D. 2012, MNRAS, 421, 1779, doi: 10.1111/j.1365-2966.2012.20435.x
Lu et al. (2014) Lu, Y., Mo, H. J., Lu, Z., Katz, N., & Weinberg, M. D. 2014, MNRAS, 443, 1252, doi: 10.1093/mnras/stu1200
Lu et al. (2011) Lu, Y., Mo, H. J., Weinberg, M. D., & Katz, N. 2011, MNRAS, 416, 1949, doi: 10.1111/j.1365-2966.2011.19170.x
Madau & Dickinson (2014) Madau, P., & Dickinson, M. 2014, ARA&A, 52, 415, doi: 10.1146/annurev-astro-081811-125615
Marinacci et al. (2018) Marinacci, F., Vogelsberger, M., Pakmor, R., et al. 2018, MNRAS, 480, 5113, doi: 10.1093/mnras/sty2206
Mitchell & Schaye (2022) Mitchell, P. D., & Schaye, J. 2022, MNRAS, 511, 2948, doi: 10.1093/mnras/stab3339
Mowla et al. (2019) Mowla, L., van der Wel, A., van Dokkum, P., & Miller, T. B. 2019, ApJ, 872, L13, doi: 10.3847/2041-8213/ab0379
Murray et al. (2013) Murray, S. G., Power, C., & Robotham, A. S. G. 2013, Astronomy and Computing, 3, 23, doi: 10.1016/j.ascom.2013.11.001
Naab & Ostriker (2017) Naab, T., & Ostriker, J. P. 2017, ARA&A, 55, 59, doi: 10.1146/annurev-astro-081913-040019
Naiman et al. (2018) Naiman, J. P., Pillepich, A., Springel, V., et al. 2018, MNRAS, 477, 1206, doi: 10.1093/mnras/sty618
Nelson et al. (2018) Nelson, D., Pillepich, A., Springel, V., et al. 2018, MNRAS, 475, 624, doi: 10.1093/mnras/stx3040
Noeske et al. (2007) Noeske, K. G., Weiner, B. J., Faber, S. M., et al. 2007, ApJ, 660, L43, doi: 10.1086/517926
Olsen et al. (2021) Olsen, K. P., Burkhart, B., Mac Low, M.-M., et al. 2021, ApJ, 922, 88, doi: 10.3847/1538-4357/ac20d4
Padmanabhan (2018) Padmanabhan, H. 2018, MNRAS, 475, 1477, doi: 10.1093/mnras/stx3250
Padmanabhan (2019) —. 2019, MNRAS, 488, 3014, doi: 10.1093/mnras/stz1878
Pallottini et al. (2022) Pallottini, A., Ferrara, A., Gallerani, S., et al. 2022, MNRAS, 513, 5621, doi: 10.1093/mnras/stac1281
Pillepich et al. (2018) Pillepich, A., Nelson, D., Hernquist, L., et al. 2018, MNRAS, 475, 648, doi: 10.1093/mnras/stx3112
Planck Collaboration et al. (2016) Planck Collaboration, Ade, P. A. R., Aghanim, N., et al. 2016, A&A, 594, A13, doi: 10.1051/0004-6361/201525830
Popping et al. (2019) Popping, G., Narayanan, D., Somerville, R. S., Faisst, A. L., & Krumholz, M. R. 2019, MNRAS, 482, 4906, doi: 10.1093/mnras/sty2969
Popping et al. (2014) Popping, G., Somerville, R. S., & Trager, S. C. 2014, MNRAS, 442, 2398, doi: 10.1093/mnras/stu991
Porter et al. (2014) Porter, L. A., Somerville, R. S., Primack, J. R., & Johansson, P. H. 2014, MNRAS, 444, 942, doi: 10.1093/mnras/stu1434
Schaan & White (2021a) Schaan, E., & White, M. 2021a, J. Cosmology Astropart. Phys, 2021, 067, doi: 10.1088/1475-7516/2021/05/067
Schaan & White (2021b) —. 2021b, J. Cosmology Astropart. Phys, 2021, 068, doi: 10.1088/1475-7516/2021/05/068
Silva et al. (2015) Silva, M., Santos, M. G., Cooray, A., & Gong, Y. 2015, ApJ, 806, 209, doi: 10.1088/0004-637X/806/2/209
Smit et al. (2012) Smit, R., Bouwens, R. J., Franx, M., et al. 2012, ApJ, 756, 14, doi: 10.1088/0004-637X/756/1/14
Somerville & Davé (2015a) Somerville, R. S., & Davé, R. 2015a, ARA&A, 53, 51, doi: 10.1146/annurev-astro-082812-140951
Somerville & Davé (2015b) —. 2015b, ARA&A, 53, 51, doi: 10.1146/annurev-astro-082812-140951
Somerville et al. (2012) Somerville, R. S., Gilmore, R. C., Primack, J. R., & Domínguez, A. 2012, MNRAS, 423, 1992, doi: 10.1111/j.1365-2966.2012.20490.x
Somerville et al. (2008) Somerville, R. S., Hopkins, P. F., Cox, T. J., Robertson, B. E., & Hernquist, L. 2008, MNRAS, 391, 481, doi: 10.1111/j.1365-2966.2008.13805.x
Somerville & Kolatt (1999) Somerville, R. S., & Kolatt, T. S. 1999, MNRAS, 305, 1, doi: 10.1046/j.1365-8711.1999.02154.x
Somerville et al. (2015) Somerville, R. S., Popping, G., & Trager, S. C. 2015, MNRAS, 453, 4337, doi: 10.1093/mnras/stv1877
Somerville & Primack (1999) Somerville, R. S., & Primack, J. R. 1999, MNRAS, 310, 1087, doi: 10.1046/j.1365-8711.1999.03032.x
Somerville et al. (2021) Somerville, R. S., Olsen, C., Yung, L. Y. A., et al. 2021, MNRAS, 502, 4858, doi: 10.1093/mnras/stab231
Springel et al. (2018) Springel, V., Pakmor, R., Pillepich, A., et al. 2018, MNRAS, 475, 676, doi: 10.1093/mnras/stx3304
Stevens et al. (2021) Stevens, A. R. H., Lagos, C. d. P., Cortese, L., et al. 2021, MNRAS, 502, 3158, doi: 10.1093/mnras/staa3662
Sun et al. (2019) Sun, G., Hensley, B. S., Chang, T.-C., Doré, O., & Serra, P. 2019, ApJ, 887, 142, doi: 10.3847/1538-4357/ab55df
Tinker et al. (2008) Tinker, J., Kravtsov, A. V., Klypin, A., et al. 2008, ApJ, 688, 709, doi: 10.1086/591439
Tremonti et al. (2004) Tremonti, C. A., Heckman, T. M., Kauffmann, G., et al. 2004, ApJ, 613, 898, doi: 10.1086/423264
Wechsler & Tinker (2018) Wechsler, R. H., & Tinker, J. L. 2018, ARA&A, 56, 435, doi: 10.1146/annurev-astro-081817-051756
Yang et al. (2022) Yang, S., Popping, G., Somerville, R. S., et al. 2022, ApJ, 929, 140, doi: 10.3847/1538-4357/ac5d57
Yang et al. (2021) Yang, S., Somerville, R. S., Pullen, A. R., et al. 2021, ApJ, 911, 132, doi: 10.3847/1538-4357/abec75
York et al. (2000) York, D. G., Adelman, J., Anderson, John E., J., et al. 2000, AJ, 120, 1579, doi: 10.1086/301513
Yung et al. (2019a) Yung, L. Y. A., Somerville, R. S., Finkelstein, S. L., Popping, G., & Davé, R. 2019a, MNRAS, 483, 2983, doi: 10.1093/mnras/sty3241
Yung et al. (2019b) Yung, L. Y. A., Somerville, R. S., Popping, G., et al. 2019b, MNRAS, 490, 2855, doi: 10.1093/mnras/stz2755