Characterizing the Conditional Galaxy Property Distribution using Gaussian Mixture Models
Abstract
Line-intensity mapping (LIM) is a promising technique to constrain the global distribution of galaxy properties. To combine LIM experiments probing different tracers with traditional galaxy surveys and fully exploit the scientific potential of these observations, it is necessary to have a physically motivated modeling framework. As part of developing such a framework, in this work we introduce and model the conditional galaxy property distribution (CGPD), i.e. the distribution of galaxy properties conditioned on the host halo mass and redshift. We consider five galaxy properties, including the galaxy stellar mass, molecular gas mass, galaxy radius, gas phase metallicity and star formation rate (SFR), which are important for predicting the emission lines of interest. The CGPD represents the full distribution of galaxies in the five dimensional property space; many important galaxy distribution functions and scaling relations, such as the stellar mass function and SFR main sequence, can be derived from integrating and projecting it. We utilize two different kinds of cosmological galaxy simulations, a semi-analytic model and the IllustrisTNG hydrodynamic simulation, to characterize the CGPD and explore how well it can be represented using a Gaussian mixture model (GMM). We find that with just a few () Gaussian components, a GMM can describe the CGPD of the simulated galaxies to high accuracy for both simulations. The CGPD can be mapped to LIM or other observables by constructing the appropriate relationship between galaxy properties and the relevant observable tracers.
1 Introduction
Understanding the physical processes that govern the formation and evolution of galaxies over cosmic time is one of the most fundamental and important problems in astrophysics and cosmology. Numerous astronomical surveys across a wide range of wavelengths have been carried out to observe galaxies, from nearby targets to those probing the earliest epochs (e.g. York et al., 2000; Koekemoer et al., 2011). However, galaxy surveys are limited by the trade-off between area and depth since deeper and fainter targets require more time to achieve a detectable signal-to-noise ratio. Instead of resolving and observing individual targets, a relatively recent technique called line-intensity mapping (LIM) (e.g. Kovetz et al., 2017) has been proposed to measure the integrated emission lines from all sources, bright and faint. Similar to traditional galaxy surveys, LIM not only provides a promising tracer of the large-scale structure (LSS) of the total matter distribution, but also contains a wealth of information about astrophysical properties of galaxies (Schaan & White, 2021a).
One of the potentially powerful aspects of LIM is that there are now numerous experiments planned or online that probe different lines, which trace different gas phases, at different redshifts (see Bernal & Kovetz, 2022, for a recent summary). For example, some of the most commonly considered lines include CO, which is a tracer of dense molecular gas in the interstellar medium (ISM) of galaxies, atomic and ionic fine structure lines such as [CII], which is primarily emitted by photo-dissociation regions in the outer layers of molecular clouds, 21-cm emission from atomic gas in the ISM and intergalactic medium (IGM), and lines arising from ionized gas such as Lyman- and H-. These are the brightest and most easily detected lines, but there are many others that can provide interesting information. It is clear that combining multi-tracer LIM surveys with traditional galaxy surveys will be the key to extracting the maximum amount of science from these experiments.
In order to do this, it is essential to have a comprehensive modeling framework. The most commonly used framework is halo-based phenomenological models (see Bernal & Kovetz, 2022, for a recent review). In this approach, scaling relations between halo mass and redshift and “intrinsic” physical properties such as stellar mass and star formation rate are derived using abundance matching. In abundance matching, observational estimates of stellar mass functions and SFR are used along with predictions for the number density of dark matter halos as a function of mass to derive constraints on the mapping between these quantities (see Wechsler & Tinker, 2018, for a review). Then, scaling relations between e.g. SFR and a line luminosity of interest are derived using observations of individually detected sources (e.g. Silva et al., 2015; Li et al., 2016; Padmanabhan, 2018, 2019; Sun et al., 2019). This approach has the advantage of being highly computationally efficient, and of being tied to observations.
However, there are a number of drawbacks and limitations to this approach. First, these empirical fitting functions usually lack physical motivation, so they are able to provide only limited insights about the physical processes underlying galaxy formation. Second, LIM detections are sensitive to galaxies that are too faint to detect individually, so the scaling relations have to be extrapolated beyond the observational range. Third, these empirical approaches are not able to self-consistently capture the covariances between different tracers, or between the intrinsic properties (stellar mass, SFR) and various emission lines. It is important to keep in mind that the estimates of stellar mass and SFR that are used in abundance matching models are not directly observable quantities, but are derived from UV-optical-NIR observations and carry significant uncertainties. The prior used to obtain these estimates can have a strong impact on the results (Leja et al., 2022). It is critical to be able to properly fold in the prior, covariances, and uncertainties on these quantities, which lie at the foundation of abundance matching approaches.
Numerical cosmological simulations that incorporate baryonic astrophysical processes are a potentially powerful tool for exploring the physics of galaxy formation and evolution, as well as the connection with observable line emissions. The most fundamental approach is based on explicitly simulating the hydrodynamics and thermodynamics of baryonic material evolving within a CDM cosmological framework for structure formation (Somerville & Davé, 2015a; Naab & Ostriker, 2017). These numerical hydrodynamic simulations provide direct predictions for the distribution and state of nearly all of the relevant components of galaxies and the IGM (stars, gas at different densities and temperatures, various chemical elements). They can be coupled with modeling tools to provide predictions of many of the emission lines of interest for LIM experiments (Leung et al., 2020; Olsen et al., 2021). However, due to its computational expense, this technique suffers from a fairly severe trade-off between volume and resolution. Simulations that resolve the multi-phase ISM can typically be run for only a few galaxies (e.g. Pallottini et al., 2022). Simulations of cosmological volumes must adopt “sub-grid” recipes to treat most of the critical processes of galaxy formation, including star formation, stellar feedback, and black hole formation and feedback. And even the largest computationally feasible cosmological hydrodynamic simulations fall short of the volumes that are already routinely probed by galaxy and LIM surveys. For example, MillenniumTNG has a volume of (500 Mpc/)3, compared to (1760 Mpc/)3 for the Baryon Oscillations Spectroscopic Survey of luminous red galaxies alone (Alam et al., 2017). The currently adopted sub-grid models are highly uncertain, and approaches adopted by different groups can lead to dramatically different predictions for gas properties. It is infeasible to thoroughly explore the parameter space of the sub-grid models for significant volumes using a direct numerical approach.
An alternative approach to physics-based models is semi-analytic models (SAM), which combine simplified treatments of physical processes with cosmological halo merger histories. These models produce similar predictions to those of numerical hydrodynamic simulations for global galaxy properties, and include a similar suite of physical processes, but at a fraction of the computational cost. However, unlike numerical hydrodynamics simulations they do not provide detailed information about the internal structure of galaxies or the spatial distribution of gas within or around them. Popping et al. (2014) and Popping et al. (2019) coupled semi-analytic models of galaxy formation with the DESPOTIC code (Krumholz, 2014), which computes line luminosities for optically thick clouds using a one-zone model, to make predictions for a broad suite of emission lines, including multiple transitions in the CO ladder, [CI] and [CII]. They validated these models by comparing the predictions with an extensive suite of observations of individually detected galaxies across a broad range of redshifts. Yang et al. (2021) used lightcones populated with this approach to create mock maps (including interlopers, cosmic infrared background, and Galactic foreground) for exemplar LIM experiments in CO and [CI]. This work showed that SAMs coupled with the appropriate line emission models in post-processing are a promising alternative tool for LIM forecasts and interpretation.
Ideally, we would like to be able to use Bayesian inference to constrain the posteriors of both cosmological and astrophysical parameters from a combined suite of multi-tracer LIM and galaxy survey observables. When using a forward-modeling or simulation-based inference approach on a large population of galaxies or a cosmological volume (Henriques et al., 2009, 2013; Lu et al., 2011, 2012, 2014; Forbes et al., 2019), depending on the sampling approach adopted, this requires thousands to tens of thousands of executions of the simulation. Most currently available SAM codes are not fast enough for this to be practical, particularly if large volumes need to be simulated.
An intermediate approach is to develop models that link halo properties to galaxy intrinsic and observable properties using physics-based simulations, such as hydrodynamic simulations and SAMs, instead of empirical methods such as abundance matching. For example, Yang et al. (2022) used the SAM-based predictions described above to derive and present fitting functions between halo mass and all major predicted lines over a broad range of redshifts, and showed that when these fitting functions were combined with dark matter halo properties as in the standard empirical halo model approach, agreement of 5-10% was achieved for common LIM summary statistics such as the power spectrum and intensity.

In this work, we introduce a new approach that we believe can have very general applicability to forecasting and interpreting a wide variety of galaxy and LIM survey data. We introduce the conditional galaxy property distribution (CGPD) , which gives the number density of galaxies in the property space , conditioned on the halo mass and redshift . Whereas traditional halo models are typically mappings from a scalar quantity (halo mass) to another scalar (e.g. stellar mass or SFR), we can think of this as an extension of the halo model to map to a higher dimensional suite of galaxy properties. In this work, has dimension five and includes the galaxy stellar mass, molecular gas mass, galaxy radius, gas phase metallicity and star formation rate (SFR), which are important for predicting the line luminosities of interest (here, we have in mind mainly CO and [CII]). However, in principle could contain any galaxy property of interest and have an arbitrary number of dimensions. With this set of galaxy properties, CGPD also encapsulates many popular and important galaxy distribution functions and scaling relations such as the stellar mass function and SFR main sequence.
In this paper, we explore the form and effective dimensionality of the CGPD for this chosen set of key galaxy properties, using two different physics based cosmological simulations of galaxy formation: the “Santa Cruz” semi-analytic model (Somerville & Primack, 1999; Somerville et al., 2008, 2012; Porter et al., 2014; Popping et al., 2014; Somerville et al., 2015), and the IllustrisTNG hydrodynamic simulation suite (Springel et al., 2018; Pillepich et al., 2018; Marinacci et al., 2018; Nelson et al., 2018; Naiman et al., 2018). A key point is that it is well known that galaxy properties are highly covariant, giving rise to the many familiar galaxy scaling relations such as the stellar mass vs. SFR relation (SF sequence), mass metallicity relation, etc. Exploring higher dimensional manifolds within galaxy property space (as hinted at by, for example, “fundamental plane” type relations such as the fundamental mass metallicity relation) will in itself provide interesting insights into galaxy formation, as the physics that shapes galaxy properties is encoded in these relationships. Moreover, it will help us to develop tools to characterize and model these high-dimensional relationships in order to fully capture the covariances between galaxy properties. In this work, we adopt a Gaussian mixture model (GMM) based approach as a generative density estimation model to describe the CGPD.
Combined with a model that can predict emission line luminosities from these galaxy properties, this parametric model could potentially be used in a statistical inference pipeline, either using “traditional” methods such as Markov Chain Monte Carlo (MCMC) sampling combined with an assumed form for the likelihood, or more modern implicit likelihood inference methods (Alsing et al., 2018; Cranmer et al., 2020). With the formulation presented here, the posterior of the parameters of the GMM (and hence the CGPD) can be constrained based on a chosen set of LIM observables. The resulting CGPD can then be projected and integrated to obtain estimates for galaxy population summary statistics of common interest, such as the SFR or cold gas distribution functions or cosmic SFR or cold gas density, at a series of redshifts. If desired, additional observational constraints from galaxy surveys, such as the stellar mass function, can simultaneously be used as constraints in the inference procedure. In Fig. 1, we provide a graphical representation of how the CGPD model that we develop in this work fits into this broader framework, which we plan to build up in future works.
The paper is organized as follows. In Section 2, we introduce the CGPD and how it is related to more familiar summary statistics that are commonly used to characterize galaxy populations. We introduce the simulated galaxy catalogs with physical properties of interest in Section 3. Based on these simulations, we then present the application of the GMM in characterizing the CGPD in Section 4. We discuss our results in Section 5, and conclude in Section 6. Throughout this work, we adopt a flat CDM fiducial cosmology with Planck 2015 parameters: , , , , , (Planck Collaboration et al., 2016), which are consistent with the cosmology assumed in the simulations.
2 Conditional Galaxy Property Distribution
For a halo of mass at redshift , the CGPD is defined as the number density distribution of galaxies in the property space . Integrating over yields the expected number of galaxies for such a halo. For simplicity, in this work we write the CGPD as a function of , although in practice, we perform the fitting of the CGPD in space 111In this work, we denote a base 10 logarithm with and natural logarithm with ., i.e. . The CGPD may be represented in either of the two spaces for convenience. The transformation of the CGPD between these two spaces can be straightforwardly given by
(1) |
where is the determinant of the Jacobian matrix. Multiplying the differential volume in space on both sides gives
(2) |
which is more convenient for the purpose of integration. The galaxy properties we consider are summarized in Table 1.
Property | Unit | Description |
---|---|---|
galaxy stellar mass | ||
molecular gas mass | ||
[kpc] | galaxy radius | |
gas phase metallicity | ||
star formation rate |
These are the properties that are the most important for the line luminosities of interest. Note that is related to the conditional luminosity function (CLF) introduced in Schaan & White (2021b) by the line luminosity relation (LLR) , which gives the luminosity of line for a galaxy with properties at redshift . The LLR is assumed to be independent of the host halo mass . The modeling of the LLR using SAM simulated galaxies coupled with spectral synthesis models is being developed in a companion work (Breysse et al., 2023).
The CGPD must be coupled with the dark matter halo model in order to connect the cosmological dark matter field with observable galaxies. For example, we obtain the global distribution function for each element of the property vector by performing the integral:
(3) |
where is the halo mass function (HMF) at redshift . In this work, we use the HMF from Tinker et al. (2008) with the implementation in the hmf 222https://github.com/halomod/hmf code (Murray et al., 2013). This general galaxy distribution function gives the comoving volume number density of galaxies in the property space at redshift . Integrating over yields the total number of galaxies per unit comoving volume. As shown below, by integrating over certain dimensions, encapsulates many observables of wide interest in galaxy and LIM surveys, such as the cosmic densities of galaxy properties, the galaxy abundance as a function of a certain property, and the scaling relations between two (or more) galaxy properties.
2.1 Cosmic densities of galaxy properties
The cosmic number density of galaxies per unit comoving volume is given by
(4) |
Similarly, the cosmic density of a galaxy property in is
(5) |
These integrated quantities are important observable functions describing the overall evolution of galaxy properties over cosmic time (e.g. Madau & Dickinson, 2014, and references therein), including the cosmic stellar mass density , cosmic molecular gas mass density , and cosmic SFR density .
2.2 Galaxy distribution functions and scaling relations
Besides the volume density of the galaxy properties in Eq. 5, the abundances of galaxies as functions of certain physical properties have also been widely studied in galaxy surveys.
As a function of a single property , the galaxy distribution is given by integrating over the other dimensions,
(6) |
These functions of wide interest include the galaxy stellar mass function (e.g. Baldry et al., 2008), the molecular gas mass function (e.g. Diemer et al., 2018), and the star formation rate function (e.g. Smit et al., 2012), etc.
Another important observable for galaxy surveys is the correlation between certain sets of galaxy properties, some of which can be even described with simple scaling relations. At a given redshift, integrating over dimensions of yields
(7) |
the distribution of galaxies in the space of the two unprojected-over properties. A traditional scaling relation corresponds to fitting this distribution with an empirical function . These scaling relations of wide interest include the star formation main sequence (e.g. Noeske et al., 2007), the mass-metallicity relation (e.g. Tremonti et al., 2004), the size-mass relation (e.g. Lange et al., 2015; Mowla et al., 2019), and the stellar mass vs. gas mass relation (e.g. Calette et al., 2018). Besides the main trend or correlation given by the scaling relations, there is also dispersion around the median values. These dispersions also contain a wealth of physical information (e.g. Forbes et al., 2014), and the corresponding conditional distributions like have also been investigated (e.g. Leja et al., 2022). In our case, includes all the information, from which the scaling relations and the conditional distribution can also be derived. The conditional probability distribution is given by
(8) |
where differs from by a factor of the galaxy number density. Then the function can be modelled as
(9) |
Similarly, the dispersion follows,
(10) |
2.3 Line luminosities and intensities
Typically, line luminosity functions are 1-D distributions over one line luminosity . As with the relation between CGPD and CLF mentioned above, the line luminosity function can also be connected to by the LLR ,
(11) |
By combining and LLR, the observed intensity can be derived as
(12) |
where the integral (without the factor) gives the line luminosity density.
2.4 Galaxy properties per halo
Besides the expressions based on Eq. 3 which integrate over all halos of different masses, we can similarly write down the properties per halo from the CGPD . For example, the average number of galaxies for a halo of a given mass can be written as
(13) |
where the superscript indicates that it is a quantity per halo. Similarly, the expected galaxy property per halo is given as
(14) |
Note that Eq. 14 should be divided by Eq. 13 for if the average radius per galaxy is desired instead of the sum of all galaxy radii in the halo. With LLR , the line luminosity for line per halo can be written as
(15) |
3 Simulations
In this section we briefly introduce the physics-based simulations, which will be used to study and motivate the functional form and modeling approach for characterizing the CGPD. These include the “Santa Cruz” semi-analytic models and hydrodynamic simulations from the IllustrisTNG project. Note that comparing and discussing the differences between the simulated results from the SAM and TNG is outside of the scope of this work, and a recent discussion can be found in Gabrielpillai et al. (2022). It is important to consider different models in order to test the robustness and flexibility of the GMM in describing the CGPD.
3.1 Semi-Analytic Model
State-of-the-art SAMs have been shown to produce consistent predictions for global galaxy properties and their evolution compared with much more expensive numerical hydrodynamic simulations (Somerville & Davé, 2015b). In this work, we use the SAM developed by the “Santa Cruz” group (Somerville & Primack, 1999; Somerville et al., 2008, 2012; Porter et al., 2014; Popping et al., 2014; Somerville et al., 2015). This well-established semi-analytic galaxy formation modeling code has been compared extensively with galaxy observations from and is overall very successful at matching many observations (see e.g. Popping et al., 2014; Somerville et al., 2015; Yung et al., 2019a, b; Somerville et al., 2021).
The backbone of the SAM consists of dark matter halo merger trees, which can be extracted from N-body simulations or generated with Monte Carlo realizations based on the extended Press-Schechter (EPS) formalism (Somerville & Kolatt, 1999). In this work, in order to make sure that all halo masses in the wide range of interest are well populated for fitting, we use the Monte Carlo EPS method, which enables better sampling of the full range of relevant halo masses at low computational cost.
We generate simulations for one hundred redshifts uniformly sampled in , with galaxies at each redshift being simulated independently. At each redshift, we consider halos with mass in the range , which has been shown to be the relevant halo mass range probed by LIM measurements (Yang et al., 2021). In this logarithmic halo mass range, a thousand halos are generated for each of the one thousand uniformly sampled halo masses, and hence in total this yields one million halos for each redshift. This large number of halos and well-sampled resolution in halo mass give us more flexibility in binning for constructing the CGPD. We only consider star-forming galaxies, which are selected with the criterion , where is the specific SFR and is the Hubble time at the galaxy redshift . We ignore quenched galaxies in this work, as they have a negligible contribution to LIM. For simplicity and better demonstration, in our main analyses we only consider central galaxies. We checked that the contributions from satellites for the integrated quantities are less than , and we include some further details on including and fitting satellites in Section 5.2.2.
3.2 Hydrodynamic simulations
IllustrisTNG is a project that includes a series of state-of-the-art cosmological N-body and hydrodynamic simulations (Springel et al., 2018; Pillepich et al., 2018; Marinacci et al., 2018; Nelson et al., 2018; Naiman et al., 2018). We use the halo and subhalo (i.e. galaxy) catalogs from TNG100-1, which are simulated in a cubic box of size with a mass resolution of for the dark matter and baryonic particles. The TNG fields that are most comparable to the galaxy properties simulated by the SAM used in this work are summarized in Table 2.
Property | TNG field |
---|---|
SubhaloMassInRadType (Type=4) | |
m_h2_DK11_vol | |
SubhaloHalfmassRadType (Type=4) | |
SubhaloGasMetallicity | |
SubhaloSFR |
TNG fields for and are selected following the discussion in Gabrielpillai et al. (2022), where it is also noted that in some cases these quantities are not defined in exactly the same way as in SAM. Then the field for is chosen to be consistent with that of . The molecular gas mass estimates are provided in post-processed supplementary catalogs (Diemer et al., 2018, 2019; Stevens et al., 2021) for five redshifts , which are described in detail in the data specifications of the TNG project 333https://www.tng-project.org/data/docs/specifications/. For the host halo mass , the Group_M_TopHat200 field in TNG is used.
4 Results: Characterizing the CGPD
Based on the simulated data described above, we now propose and test analytic models that could potentially properly characterize the CGPD.
There are various parametric and nonparametric methods for describing the distribution of point data, also known as density estimation (Ivezić et al., 2019). Histograms might be the most straightforward nonparametric method, which approximate the function value as constants within discrete bins. However, the total number of bins increases exponentially with the dimensionality of the data. Treating each bin as an independent variable makes it infeasible for inference, and also this method does not give any insights into the structure of the distribution. This remains an issue even for more advanced nonparametric models like Gaussian processes. Another class of methods assumes a specific parametric form of the underlying probabilistic density function that generates the data points, among which Gaussian Mixture Models are one of the most powerful approaches due to its simplicity and flexibility. In this section, we present the modeling of the CGPD using GMMs.
4.1 Gaussian Mixture Model
The GMM is a probabilistic model defined as the weighted average of a finite number of Gaussian distributions, and the likelihood function reads
(16) |
where denotes a multivariate Gaussian distribution, is the data vector, and includes all the model parameters, i.e. the weight , the mean vector and the covariance matrix for each of the Gaussian components. The weights are constrained by the normalization such that the likelihood function integrates to one. In unsupervised machine learning, GMM is a powerful method for clustering and classification with each component corresponding to a certain class. Here, GMM is mainly used for density estimation and we are not attempting to classify galaxies in their property space. Another practical advantage of GMM in describing multidimensional data is their simplicity in giving marginal and conditional distributions, which remain Gaussian. Since each component is a multivariate Gaussian, marginalizing (integrating) over any subset of dimensions can be done by simply removing the marginalized dimensions from the mean vector and covariance matrix. Similarly, the conditional distributions for a multivariate Gaussian can also be written analytically. For example, for a bivariate case,
(17) |
where and are the mean and standard deviation of , and is the correlation coefficient between the two. This would be useful for deriving e.g. from , as discussed following Eq. 7.
4.2 Fitting CGPD with GMM
Fig. 2 shows an example of the CGPD for central galaxies in the SAM with host halo mass at redshift , where the projected pair distribution plots are made using corner.py 444https://github.com/dfm/corner.py (Foreman-Mackey, 2016).

We can see that the marginal distributions for the logarithmic galaxy properties are very close to being Gaussian around the peak but with possible extended spreads on the left side.

Similar to Fig. 2 for the SAM data, we show galaxies selected under the same criteria from the TNG simulated data in Fig. 3. There are many fewer galaxies in the TNG data compared to those in the SAM. Although the locations and dispersions are different from those of the SAM shown in Fig. 2, the distributions are still very close to being Gaussian. We conclude for now that this is likely to be a fairly robust property of CGPD in simulations and that therefore Gaussian mixtures are a promising parameterized model for this problem.
We use GMM as a parametric model that describes the CGPD , where the data vector in Eq. 16 corresponds to the galaxy property vector . For a bin in mass and redshift , given simulated galaxies for fitting, the overall logarithmic likelihood function is given by
(18) |
which is maximized to find the best-fit using the expectation-maximization (EM) algorithm (Dempster et al., 1977). In this work, we use the implementation of the GMM in scikit-learn 555https://scikit-learn.org/stable/modules/mixture.html.
Given the galaxy data and GMM model, we fit the galaxies in the 5-D physical property space shown in Fig. 2. Considering the correlations between physical properties, we use the full covariance matrices for the Gaussian components. However, this corresponds to a large number of parameters in the covariance matrix and the model might need to be further simplified in order to be used for statistical inference. We discuss a possible dimensionality reduction based on principal component analysis (PCA) in Section 4.4.
At each redshift, we divide the logarithmic host halo mass range into bins of uniform width . As mentioned in Section 3.1, the halo mass resolution in our SAM simulations is , which is much smaller than the bin size such that halo masses are well sampled in each bin. Since different halo mass bins are fitted independently, the overall complexity of the model is also proportional to the number of bins. However, using fewer bins means a wider spread of the distribution for each bin and possibly results in lower accuracy. This is another trade-off depending on the usage. Another detail to notice is that as described in Section 3.1, in our simulations from SAM based on merger trees using the Monte Carlo method, the same number of halos are generated for the one thousand discrete halo masses. Therefore for wide halo mass bins, to be realistic it would be necessary to subsample the halos within a bin to match the HMF.
The model complexity of a GMM is proportional to the number of Gaussian components , which is an important hyper-parameter that needs to be determined. A typical model selection is based on the Bayesian information criterion (BIC), which attempts to achieve a balance between the complexity and performance of the model,
(19) |
where denotes the total number of parameters in the model, is the number of data points, and is the log-likelihood at the best-fit . For the same data, the models with lower BIC scores are preferred.

In Fig. 4, for GMM fitted on the same example group of galaxies shown in Fig. 2, we show the BIC scores for models with different numbers of components (). The BIC score decreases as we increase starting from just one component, and the curve flattens very rapidly. Similar curves are also observed for galaxies at other redshifts and other halo mass bins. These curves indicate that a small number of components is sufficient, and continuing to increase to a very large value (e.g. ) will not significantly improve the fitting performance, but will only make the model more complicated. The BIC score is a general criterion for model selection based on the likelihood, while in practice the trade-off between accuracy and model complexity depends on the specific usage. If we want to use the model as an emulator of the SAM simulations, then we may want to use more components to make it more accurate. On the other hand, if we want to use the model for statistical inference, we would prefer to make the model as simple as possible while achieving acceptable accuracy. In our main analyses, we always start with very few (i.e. 1, 2 or 3) components.
4.3 Model Performance
To check how well the GMM fits the SAM and TNG data, for some of the galaxy property functions mentioned in Section 2, we compare the results given by the best-fit GMM with those computed directly from the data.
First we consider the cosmic densities of galaxy properties defined in Eq. 5. In Fig. 5, we show the results for GMM with one or two components for the one hundred uniformly spaced redshift bins in . Also shown are the TNG data at five available redshifts (see Section 3.2) and the corresponding best-fit GMMs with just one component.




For both SAM and TNG simulated data, with just one or two Gaussian components, the percent errors of the cosmic densities evaluated with the GMM compared to the results computed directly from the simulation are mostly within , which is acceptable given the uncertainties in state-of-the-art observational constraints. For these integrated quantities, the improvement obtained when using one more component which can capture the tails in distributions (see 1-D histograms in Fig. 2) is not very significant.
We also consider the galaxy one-property distribution functions defined in Eq. 6.





In Fig. 6, for both the SAM and TNG, we show these functions at redshift as an example, including the function computed directly from the simulation and the best-fit GMMs with different numbers of components. GMM with just one component can already capture most of the information, especially the main slope of the function. Compared with the cosmic densities above, using more than one component could help capturing more information. In general, using no more than 3 components can already reproduce the galaxy one-property functions to very high accuracy.
We now consider the galaxy two-property distribution functions (scaling relations) defined in Eq. 7.








For example, we plot , , , and at redshift in Fig. 7 for SAM and Fig. 8 for TNG. For clarity, here we only show a GMM with three components, which can already reproduce the data distribution very well. Using fewer (i.e. one or two) components could also be acceptable depending on how many details we would like to recover. Scaling relations can also be observed in some of the distributions.
4.4 Dimensionality Reduction: Principal Component Analysis
As expected and also shown in Fig. 2, there are strong correlations between the galaxy physical properties we consider. This makes it possible to reduce the dimension of the data without losing much information using linear transformations like Principal Component Analysis. With the data standardized (i.e. shifted to zero mean and normalized to unit variance) on each dimension, starting from the principal component (PC) with the largest variance, PCA sequentially finds the next PC with the next largest variance while being linearly independent of the previous PCs. Lower data dimensionality could reduce the number of parameters needed in the model and make it more feasible for use in an inference pipeline. Besides, the linear independence (i.e. zero covariances) between PCs makes it possible to simplify certain models like a multivariate Gaussian distribution even further.
Here we perform PCA on central galaxies for each bin. The cumulative percentages of variance explained by PCs up to the third largest variance is shown in Fig. 9.

It turns out that using just 3 PCs can already recover more than 90 percent of the variance. The distribution in the space of first 3 PCs for the galaxies in Fig. 2, is shown in Fig. 10.

Since PCA attempts to minimize parameter covariance, we do not see very strong correlations as we have seen in Fig. 2 for the physical properties. However, it is still possible for PCs to have nonlinear correlations.
Therefore with acceptable information loss, instead of in the 5-D physical property space, we can describe the distribution in the 3-D PC space and convert it back to the physical property space for usage. An additional advantage of using GMM is that a Gaussian distribution can also be analytically transformed between two spaces. Each Gaussian component fitted in the PC space also corresponds to a Gaussian component in the physical property space. With the mapping between the PC vector and physical property vector given by the linear transformation with being a 2-D matrix of dimension , the covariance matrices are related by
(20) |
In Fig. 10, we show the best-fit GMM with one and two components in the PC space. For just one component, the covariance matrix is diagonal since there is no covariance. For two components, since each of them only cover part of the data, the full covariance matrices need to be considered. If we convert the best-fit GMM back to the physical property space, the models are show a similar level of agreement as Fig. 2 where we fit the physical properties directly.
Fitting the model in PC space does provide the potential to simplify the model. However, the transformation matrix between PC and physical property spaces also introduces additional parameters. We leave further exploration for a future work.
5 Discussion
In this section, we discuss the significance of the work presented here, as well as some of the caveats and limitations of this analysis.
5.1 Significance of the CGPD for understanding galaxy formation
The CGPD is a natural extension of traditional halo models, in that it links halo mass and redshift to galaxy properties. The new feature is that, while traditional halo models attempt to quantify this mapping with a single galaxy property at a time, and typically do not model the covariance between galaxy properties, the CGPD describes the mapping to an n-dimensional vector of galaxy properties and explicitly includes the covariances. We note that what appears as scatter or dispersion in a traditional style halo mass vs. single galaxy property plot is in some cases actually a result of projecting a higher dimensional manifold into a plane. Thus we expect the CGPD approach to partially, though perhaps not fully, also provide a natural description of dispersion along with the correct residual correlations. The CGPD as extracted from physics-based simulations can be an interesting vehicle for comparing different simulations or studying the impact of varying physical processes. This has not been a focus of this work, but it is notable that while the CGPD for the Santa Cruz SAM and TNG appear qualitatively quite similar, some of the quantities differ in the location of the mean and also in the width of the distributions. It would be interesting to systematically study the impact of varying physical processes and/or cosmology on the CGPD. Better understanding why there is a relatively “thin” manifold in the n-dimensional property space and how physical processes shape the location and thickness of this manifold is also an interesting open question.
5.2 Limitations of current modeling approach
5.2.1 Star forming and quenched galaxies
For practical reasons, we have only included central, star-forming galaxies in this analysis. This choice should not have a large impact on the accuracy of our method for our main target application, which is line intensity mapping. The lines of interest are produced by star-forming galaxies. For other applications, including the full galaxy population (not only star-forming galaxies) would certainly lead to a more multi-modal CGPD, but we are confident that this could still be modeled with a GMM approach using more components (see e.g. Hahn et al., 2019).
5.2.2 CGPD for satellite galaxies
According to the SAM predictions, the overall cosmic volume densities and other integrated quantities of the galaxy properties are dominated by central galaxies, with the contributions from satellites mostly less than . However, it would also be straightforward to include a separate CGPD for satellite galaxies. For satellite galaxies, instead of the host mass , it is customary to use a quantity like , which is defined as the maximum mass that the host halo of the satellite galaxy had before it became a satellite (e.g. Behroozi et al., 2019). Under the (common) ansatz that the properties of satellite galaxies depend only on the sub-halo mass and not on the host mass in which the satellite (sub-halo) is orbiting, we could then carry out an identical analysis of the CGPD for satellites where the host halo mass is replaced by the sub-halo mass . We would need to add the sub-halo mass function to the distinct halo mass function introduced in Eqn. 3. Then the galaxy property distributions can simply be added,
(21) |
This would add some complexity to the model but would be a straightforward extension.
5.2.3 Dependence on higher order halo properties
Another simplification is that we assume that galaxy properties depend only on halo mass and redshift, which is known to be in contrast with the results of both hydrodynamic simulations and SAMs. Although this ansatz is fundamental to the ubiquitous halo model picture, it has been shown that properties such as galaxy clustering in hydrodynamic simulations depend on halo properties other than mass, an effect commonly referred to as assembly bias (Hadzhiyska et al., 2020, 2021). Similarly, it has been shown that some of the dispersion in the stellar mass vs. halo mass relation in simulations is actually due to these higher order halo properties, such as formation history (Mitchell & Schaye, 2022; Gabrielpillai et al., 2022). We are interested in extending the CGPD framework to include additional halo properties, guided by physics-based simulations.
5.2.4 Evolution across halo masses and redshifts
In the main results section, we fit GMM on galaxies for each redshift and halo mass bin independently. However, a continuous evolution of the galaxy distribution across those discrete bins should be expected. If the evolution of GMM parameters could be characterized with another parametric model, then the model could be further simplified. This would be important and possibly necessary for using the approach for statistical inference in the future.
Here we briefly discuss the evolution of galaxy properties across halo masses and redshifts. To avoid the side effects from possible local maxima of the likelihood function, instead of purely random initialization, we use the best-fit parameter values from the previous bin to initialize the parameters for the next bin. For GMM with just one component, the best-fit mean vector and covariance matrix are simply the sample mean and covariance of the data. We notice that the mean of the galaxy properties evolves smoothly across both halo masses and redshifts, while the variances and covariances can change dramatically. For multiple Gaussian components, an additional challenge is that we need to track each component individually, and there are different approaches for this. For example, we could sort the components based on their weights, and assume that the component with the highest weight in the current bin always corresponds to the highest-weight one in the next bin, and so on for other components with lower weights. There are also different definitions of the distance between two Gaussian distributions, based on which we could map one component in the current bin to the closest one in the next bin. However, this approach may not be very robust since the components could cross each others trajectories. In our case, we use the best-fit parameters from the current bin as the initial values for the next bin; then the new data can be considered as introducing an update to the current parameters. From this point of view, the order of parameters remains unchanged across bins. However, having tried all these approaches, we still found discontinuous behavior of the Gaussian components for not only the covariances but also the mean vectors. We leave further investigation for future work.
6 Conclusions
Traditional galaxy surveys and the relatively recent LIM techniques are both powerful methods to study the formation and evolution of galaxies, one of the most fundamental and important problems in astrophysics and cosmology. Motivated by the necessity of developing a physically motivated, computationally efficient framework that can describe both galaxy and LIM observations, we propose and characterize a new approach based on the CGPD function, which describes the probability of a halo with mass at redshift to host a galaxy with properties . When combined with the Line Luminosity Relation (LLR) function developed in a companion work (Breysse et al., 2023), a highly efficient and flexible pipeline that connects halos with an n-dimensional distribution of galaxy physical properties, and with LIM observables, can be constructed.
As a distribution function in the 5-D galaxy property space that we have chosen, the CGPD encapsulates many important galaxy distribution functions and scaling relations etc. Motivated by the form of the CGPD seen in two separate physics-based cosmological simulations (the Santa Cruz SAM and IllustrisTNG), we discuss a parameterized characterization of CGPD using GMM. We find the following main results:
-
•
With just a few () components, GMM is capable of reproducing the CGPD distribution from both the SAM and TNG to high accuracy.
-
•
The GMM based CGPD yields estimates of common summary statistics of interest including 2-D property relations (scaling relations), property distribution functions, and integrated cosmic densities, with a typical accuracy of a few percent.
-
•
For each bin in halo mass and redshift, we find that the 5-D property distribution can be reduced to only 3 dimensions using PCA, with a 10% loss of information.
The galaxy properties in this work were selected based on their importance for predicting the luminosity of certain emission lines, but a similar approach could be used for other galaxy properties, including galaxy photometry as measured in galaxy surveys.
The next step is to construct a statistical inference pipeline for LIM observations. For a 5-D Gaussian distribution with a full covariance matrix, we have 20 free parameters including 5 parameters in the mean vector and 15 independent parameters in the symmetric covariance matrix. Then if we use 3 Gaussian components for each halo mass bin and 15 mass bins for each redshift as in our main analysis, we would have 900 parameters for the GMM at each redshift. However, these parameters are expected to be highly degenerate. We have shown that the data dimensionality can be reduced significantly using PCA. Another possible approach for doing implicit likelihood inference in high dimensional parameter space is the marginal posterior density with moment networks as proposed in Jeffrey & Wandelt (2020).
7 Acknowledgements
We thank the IllustrisTNG team for granting the JupyterLab access to the simulations, and Dylan Nelson for help accessing the TNG data. ARP was supported by NASA under award numbers 80NSSC18K1014, NNH17ZDA001N, and 80NSSC22K0666, and by the NSF under award number 2108411. ARP and rss are supported by the Simons Foundation. The numerical computations in this work were performed in part using the NYU Greene High Performance Computing cluster.
References
- Alam et al. (2017) Alam, S., Ata, M., Bailey, S., et al. 2017, MNRAS, 470, 2617, doi: 10.1093/mnras/stx721
- Alsing et al. (2018) Alsing, J., Wandelt, B., & Feeney, S. 2018, MNRAS, 477, 2874, doi: 10.1093/mnras/sty819
- Baldry et al. (2008) Baldry, I. K., Glazebrook, K., & Driver, S. P. 2008, MNRAS, 388, 945, doi: 10.1111/j.1365-2966.2008.13348.x
- Behroozi et al. (2019) Behroozi, P., Wechsler, R. H., Hearin, A. P., & Conroy, C. 2019, MNRAS, 488, 3143, doi: 10.1093/mnras/stz1182
- Bernal & Kovetz (2022) Bernal, J. L., & Kovetz, E. D. 2022, A&A Rev., 30, 5, doi: 10.1007/s00159-022-00143-0
- Breysse et al. (2023) Breysse, P., et al. 2023, In preparation
- Calette et al. (2018) Calette, A. R., Avila-Reese, V., Rodríguez-Puebla, A., Hernández-Toledo, H., & Papastergis, E. 2018, Rev. Mexicana Astron. Astrofis., 54, 443, doi: 10.48550/arXiv.1803.07692
- Cranmer et al. (2020) Cranmer, K., Brehmer, J., & Louppe, G. 2020, Proceedings of the National Academy of Science, 117, 30055, doi: 10.1073/pnas.1912789117
- Dempster et al. (1977) Dempster, A. P., Laird, N. M., & Rubin, D. B. 1977, Journal of the Royal Statistical Society. Series B (Methodological), 39, 1. http://www.jstor.org/stable/2984875
- Diemer et al. (2018) Diemer, B., Stevens, A. R. H., Forbes, J. C., et al. 2018, ApJS, 238, 33, doi: 10.3847/1538-4365/aae387
- Diemer et al. (2019) Diemer, B., Stevens, A. R. H., Lagos, C. d. P., et al. 2019, MNRAS, 487, 1529, doi: 10.1093/mnras/stz1323
- Forbes et al. (2014) Forbes, J. C., Krumholz, M. R., Burkert, A., & Dekel, A. 2014, MNRAS, 443, 168, doi: 10.1093/mnras/stu1142
- Forbes et al. (2019) Forbes, J. C., Krumholz, M. R., & Speagle, J. S. 2019, MNRAS, 487, 3581, doi: 10.1093/mnras/stz1473
- Foreman-Mackey (2016) Foreman-Mackey, D. 2016, The Journal of Open Source Software, 1, 24, doi: 10.21105/joss.00024
- Gabrielpillai et al. (2022) Gabrielpillai, A., Somerville, R. S., Genel, S., et al. 2022, MNRAS, 517, 6091, doi: 10.1093/mnras/stac2297
- Hadzhiyska et al. (2020) Hadzhiyska, B., Bose, S., Eisenstein, D., Hernquist, L., & Spergel, D. N. 2020, MNRAS, 493, 5506, doi: 10.1093/mnras/staa623
- Hadzhiyska et al. (2021) Hadzhiyska, B., Liu, S., Somerville, R. S., et al. 2021, MNRAS, 508, 698, doi: 10.1093/mnras/stab2564
- Hahn et al. (2019) Hahn, C., Starkenburg, T. K., Choi, E., et al. 2019, ApJ, 872, 160, doi: 10.3847/1538-4357/aafedd
- Henriques et al. (2009) Henriques, B. M. B., Thomas, P. A., Oliver, S., & Roseboom, I. 2009, MNRAS, 396, 535, doi: 10.1111/j.1365-2966.2009.14730.x
- Henriques et al. (2013) Henriques, B. M. B., White, S. D. M., Thomas, P. A., et al. 2013, MNRAS, 431, 3373, doi: 10.1093/mnras/stt415
- Ivezić et al. (2019) Ivezić, Ž., Connolly, A., Vanderplas, J., & Gray, A. 2019, Statistics, Data Mining and Machine Learning in Astronomy (Princeton University Press)
- Jeffrey & Wandelt (2020) Jeffrey, N., & Wandelt, B. D. 2020, arXiv e-prints, arXiv:2011.05991. https://arxiv.org/abs/2011.05991
- Koekemoer et al. (2011) Koekemoer, A. M., Faber, S. M., Ferguson, H. C., et al. 2011, ApJS, 197, 36, doi: 10.1088/0067-0049/197/2/36
- Kovetz et al. (2017) Kovetz, E. D., Viero, M. P., Lidz, A., et al. 2017, arXiv e-prints, arXiv:1709.09066. https://arxiv.org/abs/1709.09066
- Krumholz (2014) Krumholz, M. R. 2014, MNRAS, 437, 1662, doi: 10.1093/mnras/stt2000
- Lange et al. (2015) Lange, R., Driver, S. P., Robotham, A. S. G., et al. 2015, MNRAS, 447, 2603, doi: 10.1093/mnras/stu2467
- Lara-López et al. (2010) Lara-López, M. A., Cepa, J., Bongiovanni, A., et al. 2010, A&A, 521, L53, doi: 10.1051/0004-6361/201014803
- Leja et al. (2022) Leja, J., Speagle, J. S., Ting, Y.-S., et al. 2022, ApJ, 936, 165, doi: 10.3847/1538-4357/ac887d
- Leung et al. (2020) Leung, T. K. D., Olsen, K. P., Somerville, R. S., et al. 2020, ApJ, 905, 102, doi: 10.3847/1538-4357/abc25e
- Li et al. (2016) Li, T. Y., Wechsler, R. H., Devaraj, K., & Church, S. E. 2016, ApJ, 817, 169, doi: 10.3847/0004-637X/817/2/169
- Lu et al. (2012) Lu, Y., Mo, H. J., Katz, N., & Weinberg, M. D. 2012, MNRAS, 421, 1779, doi: 10.1111/j.1365-2966.2012.20435.x
- Lu et al. (2014) Lu, Y., Mo, H. J., Lu, Z., Katz, N., & Weinberg, M. D. 2014, MNRAS, 443, 1252, doi: 10.1093/mnras/stu1200
- Lu et al. (2011) Lu, Y., Mo, H. J., Weinberg, M. D., & Katz, N. 2011, MNRAS, 416, 1949, doi: 10.1111/j.1365-2966.2011.19170.x
- Madau & Dickinson (2014) Madau, P., & Dickinson, M. 2014, ARA&A, 52, 415, doi: 10.1146/annurev-astro-081811-125615
- Marinacci et al. (2018) Marinacci, F., Vogelsberger, M., Pakmor, R., et al. 2018, MNRAS, 480, 5113, doi: 10.1093/mnras/sty2206
- Mitchell & Schaye (2022) Mitchell, P. D., & Schaye, J. 2022, MNRAS, 511, 2948, doi: 10.1093/mnras/stab3339
- Mowla et al. (2019) Mowla, L., van der Wel, A., van Dokkum, P., & Miller, T. B. 2019, ApJ, 872, L13, doi: 10.3847/2041-8213/ab0379
- Murray et al. (2013) Murray, S. G., Power, C., & Robotham, A. S. G. 2013, Astronomy and Computing, 3, 23, doi: 10.1016/j.ascom.2013.11.001
- Naab & Ostriker (2017) Naab, T., & Ostriker, J. P. 2017, ARA&A, 55, 59, doi: 10.1146/annurev-astro-081913-040019
- Naiman et al. (2018) Naiman, J. P., Pillepich, A., Springel, V., et al. 2018, MNRAS, 477, 1206, doi: 10.1093/mnras/sty618
- Nelson et al. (2018) Nelson, D., Pillepich, A., Springel, V., et al. 2018, MNRAS, 475, 624, doi: 10.1093/mnras/stx3040
- Noeske et al. (2007) Noeske, K. G., Weiner, B. J., Faber, S. M., et al. 2007, ApJ, 660, L43, doi: 10.1086/517926
- Olsen et al. (2021) Olsen, K. P., Burkhart, B., Mac Low, M.-M., et al. 2021, ApJ, 922, 88, doi: 10.3847/1538-4357/ac20d4
- Padmanabhan (2018) Padmanabhan, H. 2018, MNRAS, 475, 1477, doi: 10.1093/mnras/stx3250
- Padmanabhan (2019) —. 2019, MNRAS, 488, 3014, doi: 10.1093/mnras/stz1878
- Pallottini et al. (2022) Pallottini, A., Ferrara, A., Gallerani, S., et al. 2022, MNRAS, 513, 5621, doi: 10.1093/mnras/stac1281
- Pillepich et al. (2018) Pillepich, A., Nelson, D., Hernquist, L., et al. 2018, MNRAS, 475, 648, doi: 10.1093/mnras/stx3112
- Planck Collaboration et al. (2016) Planck Collaboration, Ade, P. A. R., Aghanim, N., et al. 2016, A&A, 594, A13, doi: 10.1051/0004-6361/201525830
- Popping et al. (2019) Popping, G., Narayanan, D., Somerville, R. S., Faisst, A. L., & Krumholz, M. R. 2019, MNRAS, 482, 4906, doi: 10.1093/mnras/sty2969
- Popping et al. (2014) Popping, G., Somerville, R. S., & Trager, S. C. 2014, MNRAS, 442, 2398, doi: 10.1093/mnras/stu991
- Porter et al. (2014) Porter, L. A., Somerville, R. S., Primack, J. R., & Johansson, P. H. 2014, MNRAS, 444, 942, doi: 10.1093/mnras/stu1434
- Schaan & White (2021a) Schaan, E., & White, M. 2021a, J. Cosmology Astropart. Phys, 2021, 067, doi: 10.1088/1475-7516/2021/05/067
- Schaan & White (2021b) —. 2021b, J. Cosmology Astropart. Phys, 2021, 068, doi: 10.1088/1475-7516/2021/05/068
- Silva et al. (2015) Silva, M., Santos, M. G., Cooray, A., & Gong, Y. 2015, ApJ, 806, 209, doi: 10.1088/0004-637X/806/2/209
- Smit et al. (2012) Smit, R., Bouwens, R. J., Franx, M., et al. 2012, ApJ, 756, 14, doi: 10.1088/0004-637X/756/1/14
- Somerville & Davé (2015a) Somerville, R. S., & Davé, R. 2015a, ARA&A, 53, 51, doi: 10.1146/annurev-astro-082812-140951
- Somerville & Davé (2015b) —. 2015b, ARA&A, 53, 51, doi: 10.1146/annurev-astro-082812-140951
- Somerville et al. (2012) Somerville, R. S., Gilmore, R. C., Primack, J. R., & Domínguez, A. 2012, MNRAS, 423, 1992, doi: 10.1111/j.1365-2966.2012.20490.x
- Somerville et al. (2008) Somerville, R. S., Hopkins, P. F., Cox, T. J., Robertson, B. E., & Hernquist, L. 2008, MNRAS, 391, 481, doi: 10.1111/j.1365-2966.2008.13805.x
- Somerville & Kolatt (1999) Somerville, R. S., & Kolatt, T. S. 1999, MNRAS, 305, 1, doi: 10.1046/j.1365-8711.1999.02154.x
- Somerville et al. (2015) Somerville, R. S., Popping, G., & Trager, S. C. 2015, MNRAS, 453, 4337, doi: 10.1093/mnras/stv1877
- Somerville & Primack (1999) Somerville, R. S., & Primack, J. R. 1999, MNRAS, 310, 1087, doi: 10.1046/j.1365-8711.1999.03032.x
- Somerville et al. (2021) Somerville, R. S., Olsen, C., Yung, L. Y. A., et al. 2021, MNRAS, 502, 4858, doi: 10.1093/mnras/stab231
- Springel et al. (2018) Springel, V., Pakmor, R., Pillepich, A., et al. 2018, MNRAS, 475, 676, doi: 10.1093/mnras/stx3304
- Stevens et al. (2021) Stevens, A. R. H., Lagos, C. d. P., Cortese, L., et al. 2021, MNRAS, 502, 3158, doi: 10.1093/mnras/staa3662
- Sun et al. (2019) Sun, G., Hensley, B. S., Chang, T.-C., Doré, O., & Serra, P. 2019, ApJ, 887, 142, doi: 10.3847/1538-4357/ab55df
- Tinker et al. (2008) Tinker, J., Kravtsov, A. V., Klypin, A., et al. 2008, ApJ, 688, 709, doi: 10.1086/591439
- Tremonti et al. (2004) Tremonti, C. A., Heckman, T. M., Kauffmann, G., et al. 2004, ApJ, 613, 898, doi: 10.1086/423264
- Wechsler & Tinker (2018) Wechsler, R. H., & Tinker, J. L. 2018, ARA&A, 56, 435, doi: 10.1146/annurev-astro-081817-051756
- Yang et al. (2022) Yang, S., Popping, G., Somerville, R. S., et al. 2022, ApJ, 929, 140, doi: 10.3847/1538-4357/ac5d57
- Yang et al. (2021) Yang, S., Somerville, R. S., Pullen, A. R., et al. 2021, ApJ, 911, 132, doi: 10.3847/1538-4357/abec75
- York et al. (2000) York, D. G., Adelman, J., Anderson, John E., J., et al. 2000, AJ, 120, 1579, doi: 10.1086/301513
- Yung et al. (2019a) Yung, L. Y. A., Somerville, R. S., Finkelstein, S. L., Popping, G., & Davé, R. 2019a, MNRAS, 483, 2983, doi: 10.1093/mnras/sty3241
- Yung et al. (2019b) Yung, L. Y. A., Somerville, R. S., Popping, G., et al. 2019b, MNRAS, 490, 2855, doi: 10.1093/mnras/stz2755