This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Characterizing the Conditional Galaxy Property Distribution using Gaussian Mixture Models

Yucheng Zhang (张宇澄) Department of Mathematics and Theory, Peng Cheng Laboratory, Shenzhen, Guangdong 518066, China Center for Cosmology and Particle Physics, Department of Physics, New York University, 726 Broadway, New York, NY 10003, USA Anthony R. Pullen Center for Cosmology and Particle Physics, Department of Physics, New York University, 726 Broadway, New York, NY 10003, USA Center for Computational Astrophysics, Flatiron Institute, New York, NY 10010, USA Rachel S. Somerville Center for Computational Astrophysics, Flatiron Institute, New York, NY 10010, USA Patrick C. Breysse Center for Cosmology and Particle Physics, Department of Physics, New York University, 726 Broadway, New York, NY 10003, USA John C. Forbes Center for Computational Astrophysics, Flatiron Institute, New York, NY 10010, USA Shengqi Yang (杨晟祺) Carnegie Observatories, 813 Santa Barbara Street, Pasadena, CA 91101, USA Yin Li (李寅) Department of Mathematics and Theory, Peng Cheng Laboratory, Shenzhen, Guangdong 518066, China Center for Computational Astrophysics, Flatiron Institute, New York, NY 10010, USA Center for Computational Mathematics, Flatiron Institute, New York, NY 10010, USA Abhishek S. Maniyar Center for Cosmology and Particle Physics, Department of Physics, New York University, 726 Broadway, New York, NY 10003, USA
Abstract

Line-intensity mapping (LIM) is a promising technique to constrain the global distribution of galaxy properties. To combine LIM experiments probing different tracers with traditional galaxy surveys and fully exploit the scientific potential of these observations, it is necessary to have a physically motivated modeling framework. As part of developing such a framework, in this work we introduce and model the conditional galaxy property distribution (CGPD), i.e. the distribution of galaxy properties conditioned on the host halo mass and redshift. We consider five galaxy properties, including the galaxy stellar mass, molecular gas mass, galaxy radius, gas phase metallicity and star formation rate (SFR), which are important for predicting the emission lines of interest. The CGPD represents the full distribution of galaxies in the five dimensional property space; many important galaxy distribution functions and scaling relations, such as the stellar mass function and SFR main sequence, can be derived from integrating and projecting it. We utilize two different kinds of cosmological galaxy simulations, a semi-analytic model and the IllustrisTNG hydrodynamic simulation, to characterize the CGPD and explore how well it can be represented using a Gaussian mixture model (GMM). We find that with just a few (3\sim 3) Gaussian components, a GMM can describe the CGPD of the simulated galaxies to high accuracy for both simulations. The CGPD can be mapped to LIM or other observables by constructing the appropriate relationship between galaxy properties and the relevant observable tracers.

1 Introduction

Understanding the physical processes that govern the formation and evolution of galaxies over cosmic time is one of the most fundamental and important problems in astrophysics and cosmology. Numerous astronomical surveys across a wide range of wavelengths have been carried out to observe galaxies, from nearby targets to those probing the earliest epochs (e.g. York et al., 2000; Koekemoer et al., 2011). However, galaxy surveys are limited by the trade-off between area and depth since deeper and fainter targets require more time to achieve a detectable signal-to-noise ratio. Instead of resolving and observing individual targets, a relatively recent technique called line-intensity mapping (LIM) (e.g. Kovetz et al., 2017) has been proposed to measure the integrated emission lines from all sources, bright and faint. Similar to traditional galaxy surveys, LIM not only provides a promising tracer of the large-scale structure (LSS) of the total matter distribution, but also contains a wealth of information about astrophysical properties of galaxies (Schaan & White, 2021a).

One of the potentially powerful aspects of LIM is that there are now numerous experiments planned or online that probe different lines, which trace different gas phases, at different redshifts (see Bernal & Kovetz, 2022, for a recent summary). For example, some of the most commonly considered lines include CO, which is a tracer of dense molecular gas in the interstellar medium (ISM) of galaxies, atomic and ionic fine structure lines such as [CII], which is primarily emitted by photo-dissociation regions in the outer layers of molecular clouds, 21-cm emission from atomic gas in the ISM and intergalactic medium (IGM), and lines arising from ionized gas such as Lyman-α\alpha and H-α\alpha. These are the brightest and most easily detected lines, but there are many others that can provide interesting information. It is clear that combining multi-tracer LIM surveys with traditional galaxy surveys will be the key to extracting the maximum amount of science from these experiments.

In order to do this, it is essential to have a comprehensive modeling framework. The most commonly used framework is halo-based phenomenological models (see Bernal & Kovetz, 2022, for a recent review). In this approach, scaling relations between halo mass and redshift and “intrinsic” physical properties such as stellar mass and star formation rate are derived using abundance matching. In abundance matching, observational estimates of stellar mass functions and SFR are used along with predictions for the number density of dark matter halos as a function of mass to derive constraints on the mapping between these quantities (see Wechsler & Tinker, 2018, for a review). Then, scaling relations between e.g. SFR and a line luminosity of interest are derived using observations of individually detected sources (e.g. Silva et al., 2015; Li et al., 2016; Padmanabhan, 2018, 2019; Sun et al., 2019). This approach has the advantage of being highly computationally efficient, and of being tied to observations.

However, there are a number of drawbacks and limitations to this approach. First, these empirical fitting functions usually lack physical motivation, so they are able to provide only limited insights about the physical processes underlying galaxy formation. Second, LIM detections are sensitive to galaxies that are too faint to detect individually, so the scaling relations have to be extrapolated beyond the observational range. Third, these empirical approaches are not able to self-consistently capture the covariances between different tracers, or between the intrinsic properties (stellar mass, SFR) and various emission lines. It is important to keep in mind that the estimates of stellar mass and SFR that are used in abundance matching models are not directly observable quantities, but are derived from UV-optical-NIR observations and carry significant uncertainties. The prior used to obtain these estimates can have a strong impact on the results (Leja et al., 2022). It is critical to be able to properly fold in the prior, covariances, and uncertainties on these quantities, which lie at the foundation of abundance matching approaches.

Numerical cosmological simulations that incorporate baryonic astrophysical processes are a potentially powerful tool for exploring the physics of galaxy formation and evolution, as well as the connection with observable line emissions. The most fundamental approach is based on explicitly simulating the hydrodynamics and thermodynamics of baryonic material evolving within a Λ\LambdaCDM cosmological framework for structure formation (Somerville & Davé, 2015a; Naab & Ostriker, 2017). These numerical hydrodynamic simulations provide direct predictions for the distribution and state of nearly all of the relevant components of galaxies and the IGM (stars, gas at different densities and temperatures, various chemical elements). They can be coupled with modeling tools to provide predictions of many of the emission lines of interest for LIM experiments (Leung et al., 2020; Olsen et al., 2021). However, due to its computational expense, this technique suffers from a fairly severe trade-off between volume and resolution. Simulations that resolve the multi-phase ISM can typically be run for only a few galaxies (e.g. Pallottini et al., 2022). Simulations of cosmological volumes must adopt “sub-grid” recipes to treat most of the critical processes of galaxy formation, including star formation, stellar feedback, and black hole formation and feedback. And even the largest computationally feasible cosmological hydrodynamic simulations fall short of the volumes that are already routinely probed by galaxy and LIM surveys. For example, MillenniumTNG has a volume of (500 Mpc/hh)3, compared to (1760 Mpc/hh)3 for the Baryon Oscillations Spectroscopic Survey of z<0.7z<0.7 luminous red galaxies alone (Alam et al., 2017). The currently adopted sub-grid models are highly uncertain, and approaches adopted by different groups can lead to dramatically different predictions for gas properties. It is infeasible to thoroughly explore the parameter space of the sub-grid models for significant volumes using a direct numerical approach.

An alternative approach to physics-based models is semi-analytic models (SAM), which combine simplified treatments of physical processes with cosmological halo merger histories. These models produce similar predictions to those of numerical hydrodynamic simulations for global galaxy properties, and include a similar suite of physical processes, but at a fraction of the computational cost. However, unlike numerical hydrodynamics simulations they do not provide detailed information about the internal structure of galaxies or the spatial distribution of gas within or around them. Popping et al. (2014) and Popping et al. (2019) coupled semi-analytic models of galaxy formation with the DESPOTIC code (Krumholz, 2014), which computes line luminosities for optically thick clouds using a one-zone model, to make predictions for a broad suite of emission lines, including multiple transitions in the CO ladder, [CI] and [CII]. They validated these models by comparing the predictions with an extensive suite of observations of individually detected galaxies across a broad range of redshifts. Yang et al. (2021) used lightcones populated with this approach to create mock maps (including interlopers, cosmic infrared background, and Galactic foreground) for exemplar LIM experiments in CO and [CI]. This work showed that SAMs coupled with the appropriate line emission models in post-processing are a promising alternative tool for LIM forecasts and interpretation.

Ideally, we would like to be able to use Bayesian inference to constrain the posteriors of both cosmological and astrophysical parameters from a combined suite of multi-tracer LIM and galaxy survey observables. When using a forward-modeling or simulation-based inference approach on a large population of galaxies or a cosmological volume (Henriques et al., 2009, 2013; Lu et al., 2011, 2012, 2014; Forbes et al., 2019), depending on the sampling approach adopted, this requires thousands to tens of thousands of executions of the simulation. Most currently available SAM codes are not fast enough for this to be practical, particularly if large volumes need to be simulated.

An intermediate approach is to develop models that link halo properties to galaxy intrinsic and observable properties using physics-based simulations, such as hydrodynamic simulations and SAMs, instead of empirical methods such as abundance matching. For example, Yang et al. (2022) used the SAM-based predictions described above to derive and present fitting functions between halo mass and all major predicted lines over a broad range of redshifts, and showed that when these fitting functions were combined with dark matter halo properties as in the standard empirical halo model approach, agreement of 5-10% was achieved for common LIM summary statistics such as the power spectrum and intensity.

Refer to caption
Figure 1: Illustration of how a parametric model of the Conditional Galaxy Property Distribution can be used to obtain insights about the evolution of physical quantities of interest. First, a general class of models for representing the CGPD (here GMM) is motivated using physics-based models of galaxy formation. The models must be flexible but not overly complex. The CGPD is combined with a model that predicts the observed line luminosity as a function of the galaxy properties (the Line Luminosity Relation, LLR), to predict observable summary statistics, such as the power spectrum, one-point distribution, or integrated line intensity. Observational estimates of these quantities are then used to estimate the likelihood (if using a “traditional” explicit likelihood inference approach) or to directly estimate the posterior, if using implicit likelihood inference. Using a sampler, a new set of parameter values is guessed, and the model is run again until convergence is reached. Now the “best fit” CGPD can be projected and integrated as described in Section 2, yielding the posterior and confidence intervals on quantities of great interest for galaxy evolution, as summarized in the rightmost box. These estimates can be compared to traditional estimates from galaxy surveys, where physical parameters are estimated through SED fitting, and with the predictions of physics-based simulations.

In this work, we introduce a new approach that we believe can have very general applicability to forecasting and interpreting a wide variety of galaxy and LIM survey data. We introduce the conditional galaxy property distribution (CGPD) Φ(𝒑|M,z)\Phi\left(\bm{p}|M,z\right), which gives the number density of galaxies in the property space 𝒑\bm{p}, conditioned on the halo mass MM and redshift zz. Whereas traditional halo models are typically mappings from a scalar quantity (halo mass) to another scalar (e.g. stellar mass or SFR), we can think of this as an extension of the halo model to map to a higher dimensional suite of galaxy properties. In this work, 𝒑\bm{p} has dimension five and includes the galaxy stellar mass, molecular gas mass, galaxy radius, gas phase metallicity and star formation rate (SFR), which are important for predicting the line luminosities of interest (here, we have in mind mainly CO and [CII]). However, in principle 𝒑\bm{p} could contain any galaxy property of interest and have an arbitrary number of dimensions. With this set of galaxy properties, CGPD also encapsulates many popular and important galaxy distribution functions and scaling relations such as the stellar mass function and SFR main sequence.

In this paper, we explore the form and effective dimensionality of the CGPD for this chosen set of key galaxy properties, using two different physics based cosmological simulations of galaxy formation: the “Santa Cruz” semi-analytic model (Somerville & Primack, 1999; Somerville et al., 2008, 2012; Porter et al., 2014; Popping et al., 2014; Somerville et al., 2015), and the IllustrisTNG hydrodynamic simulation suite (Springel et al., 2018; Pillepich et al., 2018; Marinacci et al., 2018; Nelson et al., 2018; Naiman et al., 2018). A key point is that it is well known that galaxy properties are highly covariant, giving rise to the many familiar galaxy scaling relations such as the stellar mass vs. SFR relation (SF sequence), mass metallicity relation, etc. Exploring higher dimensional manifolds within galaxy property space (as hinted at by, for example, “fundamental plane” type relations such as the fundamental mass metallicity relation) will in itself provide interesting insights into galaxy formation, as the physics that shapes galaxy properties is encoded in these relationships. Moreover, it will help us to develop tools to characterize and model these high-dimensional relationships in order to fully capture the covariances between galaxy properties. In this work, we adopt a Gaussian mixture model (GMM) based approach as a generative density estimation model to describe the CGPD.

Combined with a model that can predict emission line luminosities from these galaxy properties, this parametric model could potentially be used in a statistical inference pipeline, either using “traditional” methods such as Markov Chain Monte Carlo (MCMC) sampling combined with an assumed form for the likelihood, or more modern implicit likelihood inference methods (Alsing et al., 2018; Cranmer et al., 2020). With the formulation presented here, the posterior of the parameters of the GMM (and hence the CGPD) can be constrained based on a chosen set of LIM observables. The resulting CGPD can then be projected and integrated to obtain estimates for galaxy population summary statistics of common interest, such as the SFR or cold gas distribution functions or cosmic SFR or cold gas density, at a series of redshifts. If desired, additional observational constraints from galaxy surveys, such as the stellar mass function, can simultaneously be used as constraints in the inference procedure. In Fig. 1, we provide a graphical representation of how the CGPD model that we develop in this work fits into this broader framework, which we plan to build up in future works.

The paper is organized as follows. In Section 2, we introduce the CGPD and how it is related to more familiar summary statistics that are commonly used to characterize galaxy populations. We introduce the simulated galaxy catalogs with physical properties of interest in Section 3. Based on these simulations, we then present the application of the GMM in characterizing the CGPD in Section 4. We discuss our results in Section 5, and conclude in Section 6. Throughout this work, we adopt a flat Λ\LambdaCDM fiducial cosmology with Planck 2015 parameters: Ωm=0.3089\Omega_{m}=0.3089, ΩΛ=0.6911\Omega_{\Lambda}=0.6911, Ωb=0.0486\Omega_{b}=0.0486, h=0.6774h=0.6774, σ8=0.8159\sigma_{8}=0.8159, ns=0.9667n_{s}=0.9667 (Planck Collaboration et al., 2016), which are consistent with the cosmology assumed in the simulations.

2 Conditional Galaxy Property Distribution

For a halo of mass MM at redshift zz, the CGPD Φ(𝒑|M,z)\Phi\left(\bm{p}|M,z\right) is defined as the number density distribution of galaxies in the property space 𝒑\bm{p}. Integrating Φ(𝒑|M,z)\Phi\left(\bm{p}|M,z\right) over 𝒑\bm{p} yields the expected number of galaxies for such a halo. For simplicity, in this work we write the CGPD as a function of 𝒑\bm{p}, although in practice, we perform the fitting of the CGPD in 𝒑~=log𝒑\tilde{\bm{p}}=\log\bm{p} space 111In this work, we denote a base 10 logarithm with log\log and natural logarithm with ln\ln., i.e. Φ~(𝒑~|M,z)\tilde{\Phi}(\tilde{\bm{p}}|M,z). The CGPD may be represented in either of the two spaces for convenience. The transformation of the CGPD between these two spaces can be straightforwardly given by

Φ(𝒑|M,z)=Φ~(𝒑~(𝒑)|M,z)|𝒑~(𝒑)𝒑|,\Phi\left(\bm{p}|M,z\right)=\tilde{\Phi}\left(\tilde{\bm{p}}(\bm{p})|M,z\right)\left|\frac{\partial\,\tilde{\bm{p}}(\bm{p})}{\partial\,\bm{p}}\right|\,, (1)

where |𝒑~(𝒑)/𝒑|\left|\partial\,\tilde{\bm{p}}(\bm{p})/\partial\,\bm{p}\right| is the determinant of the Jacobian matrix. Multiplying the differential volume in 𝒑\bm{p} space on both sides gives

Φ(𝒑|M,z)idpi=Φ~(𝒑~|M,z)idp~i,\Phi\left(\bm{p}|M,z\right)\prod_{i}\,dp_{i}=\tilde{\Phi}\left(\tilde{\bm{p}}|M,z\right)\prod_{i}\,d\tilde{p}_{i}\,, (2)

which is more convenient for the purpose of integration. The galaxy properties we consider are summarized in Table 1.

Table 1: The 5-D galaxy properties 𝒑\bm{p} for the CGPD Φ(𝒑|M,z)\Phi\left(\bm{p}|M,z\right) considered in this work, which are selected based on their relevance to the emission lines of interest.
Property Unit Description
mm_{*} [M][M_{\odot}] galaxy stellar mass
mH2m_{{\rm H}_{2}} [M][M_{\odot}] molecular gas mass
rgalr_{\rm gal} [kpc] galaxy radius
ZgZ_{g} [Z][Z_{\odot}] gas phase metallicity
SFR{\rm SFR} [M/yr][M_{\odot}/{\rm yr}] star formation rate

These are the properties that are the most important for the line luminosities of interest. Note that Φ(𝒑|M,z)\Phi\left(\bm{p}|M,z\right) is related to the conditional luminosity function (CLF) Φ(𝑳|M,z)\Phi(\bm{L}|M,z) introduced in Schaan & White (2021b) by the line luminosity relation (LLR) Li(𝒑,z)L_{i}(\bm{p},z), which gives the luminosity of line ii for a galaxy with properties 𝒑\bm{p} at redshift zz. The LLR is assumed to be independent of the host halo mass MM. The modeling of the LLR using SAM simulated galaxies coupled with spectral synthesis models is being developed in a companion work (Breysse et al., 2023).

The CGPD must be coupled with the dark matter halo model in order to connect the cosmological dark matter field with observable galaxies. For example, we obtain the global distribution function for each element of the property vector 𝒑\bm{p} by performing the integral:

ϕ(𝒑|z)=𝑑Mnhalo(M,z)Φ(𝒑|M,z),\phi(\bm{p}|z)=\int dM\,n_{\rm halo}(M,z)\Phi\left(\bm{p}|M,z\right)\,, (3)

where nhalo(M,z)n_{\rm halo}(M,z) is the halo mass function (HMF) at redshift zz. In this work, we use the HMF from Tinker et al. (2008) with the implementation in the hmf 222https://github.com/halomod/hmf code (Murray et al., 2013). This general galaxy distribution function ϕ(𝒑|z)\phi(\bm{p}|z) gives the comoving volume number density of galaxies in the property space 𝒑\bm{p} at redshift zz. Integrating ϕ(𝒑|z)\phi(\bm{p}|z) over 𝒑\bm{p} yields the total number of galaxies per unit comoving volume. As shown below, by integrating over certain dimensions, ϕ(𝒑|z)\phi(\bm{p}|z) encapsulates many observables of wide interest in galaxy and LIM surveys, such as the cosmic densities of galaxy properties, the galaxy abundance as a function of a certain property, and the scaling relations between two (or more) galaxy properties.

2.1 Cosmic densities of galaxy properties

The cosmic number density of galaxies per unit comoving volume is given by

ng(z)=𝑑𝒑ϕ(𝒑|z).n_{g}(z)=\int d\bm{p}\,\phi(\bm{p}|z)\,. (4)

Similarly, the cosmic density of a galaxy property pip_{i} in 𝒑\bm{p} is

ρpi(z)=𝑑𝒑ϕ(𝒑|z)pi.\rho_{p_{i}}(z)=\int d\bm{p}\,\phi(\bm{p}|z)p_{i}\,. (5)

These integrated quantities are important observable functions describing the overall evolution of galaxy properties over cosmic time (e.g. Madau & Dickinson, 2014, and references therein), including the cosmic stellar mass density ρm(z)\rho_{m_{*}}(z), cosmic molecular gas mass density ρmH2(z)\rho_{m_{{\rm H}_{2}}}(z), and cosmic SFR density ρSFR(z)\rho_{\rm SFR}(z).

2.2 Galaxy distribution functions and scaling relations

Besides the volume density of the galaxy properties in Eq. 5, the abundances of galaxies as functions of certain physical properties have also been widely studied in galaxy surveys.

As a function of a single property pip_{i}, the galaxy distribution ϕ(pi|z)\phi(p_{i}|z) is given by integrating ϕ(𝒑|z)\phi(\bm{p}|z) over the other dim(𝒑)1\dim(\bm{p})-1 dimensions,

ϕ(pi|z)=ϕ(𝒑|z)jidpj.\phi(p_{i}|z)=\int\phi(\bm{p}|z)\,\prod_{j\neq i}dp_{j}\,. (6)

These functions of wide interest include the galaxy stellar mass function ϕ(m)\phi(m_{*}) (e.g. Baldry et al., 2008), the molecular gas mass function ϕ(mH2)\phi(m_{{\rm H}_{2}}) (e.g. Diemer et al., 2018), and the star formation rate function ϕ(SFR)\phi({\rm SFR}) (e.g. Smit et al., 2012), etc.

Another important observable for galaxy surveys is the correlation between certain sets of galaxy properties, some of which can be even described with simple scaling relations. At a given redshift, integrating ϕ(𝒑|z)\phi(\bm{p}|z) over dim(𝒑)2\dim(\bm{p})-2 dimensions of 𝒑\bm{p} yields

ϕ(pi,pj|z)=ϕ(𝒑|z)ki,jdpk,\phi(p_{i},p_{j}|z)=\int\phi(\bm{p}|z)\,\prod_{k\neq i,j}dp_{k}\,, (7)

the distribution of galaxies in the space of the two unprojected-over properties. A traditional scaling relation corresponds to fitting this distribution with an empirical function pi(pj)p_{i}(p_{j}). These scaling relations of wide interest include the star formation main sequence SFR(m){\rm SFR}(m_{*}) (e.g. Noeske et al., 2007), the mass-metallicity relation Zg(m)Z_{g}(m_{*}) (e.g. Tremonti et al., 2004), the size-mass relation rgal(m)r_{\rm gal}(m_{*}) (e.g. Lange et al., 2015; Mowla et al., 2019), and the stellar mass vs. gas mass relation mmH2m_{*}-m_{{\rm H}_{2}} (e.g. Calette et al., 2018). Besides the main trend or correlation given by the scaling relations, there is also dispersion around the median values. These dispersions σpi(pj)\sigma_{p_{i}}(p_{j}) also contain a wealth of physical information (e.g. Forbes et al., 2014), and the corresponding conditional distributions like ϕ(pi|pj)\phi(p_{i}|p_{j}) have also been investigated (e.g. Leja et al., 2022). In our case, ϕ(pi,pj)\phi(p_{i},p_{j}) includes all the information, from which the scaling relations and the conditional distribution can also be derived. The conditional probability distribution f(pi|pj)f(p_{i}|p_{j}) is given by

f(pi|pj)=ϕ(pi,pj)𝑑piϕ(pi,pj),f(p_{i}|p_{j})=\frac{\phi(p_{i},p_{j})}{\int dp_{i}\,\phi(p_{i},p_{j})}\,, (8)

where f(pi|pj)f(p_{i}|p_{j}) differs from ϕ(pi|pj)\phi(p_{i}|p_{j}) by a factor of the galaxy number density. Then the function pi(pj)p_{i}(p_{j}) can be modelled as

pi(pj)=𝑑pipif(pi|pj).\left<p_{i}(p_{j})\right>=\int dp_{i}\,p_{i}f(p_{i}|p_{j})\,. (9)

Similarly, the dispersion follows,

σpi2(pj)=pi2(pj)pi(pj)2.\sigma^{2}_{p_{i}}(p_{j})=\left<p^{2}_{i}(p_{j})\right>-\left<p_{i}(p_{j})\right>^{2}\,. (10)

Integrating over dim(𝒑)3\dim(\bm{p})-3 dimensions yields ϕ(pi,pj,pk|z)\phi(p_{i},p_{j},p_{k}|z), which leads to “fundamental plane” relations pi(pj,pk)p_{i}(p_{j},p_{k}) such as m(Zg,mH2)m_{*}(Z_{g},m_{{\rm H}_{2}}) or m(Zg,SFR)m_{*}(Z_{g},{\rm SFR}) (e.g. Lara-López et al., 2010). The expressions are almost the same as Eqs. 89 and 10, with pjp_{j} simply replaced with pj,pkp_{j},\,p_{k}.

2.3 Line luminosities and intensities

Typically, line luminosity functions are 1-D distributions over one line luminosity LL. As with the relation between CGPD and CLF mentioned above, the line luminosity function ϕ(L|z)\phi(L|z) can also be connected to ϕ(𝒑|z)\phi(\bm{p}|z) by the LLR L0(𝒑,z)L_{0}(\bm{p},z),

ϕ(L|z)=𝑑𝒑δD(LL0(𝒑,z))ϕ(𝒑|z).\phi(L|z)=\int d\bm{p}\,\delta^{\rm D}(L-L_{0}(\bm{p},z))\phi(\bm{p}|z)\,. (11)

By combining ϕ(𝒑|z)\phi(\bm{p}|z) and LLR, the observed intensity can be derived as

Iν(z)=14πν0cH(z)𝑑𝒑ϕ(𝒑|z)L0(𝒑,z),I_{\nu}(z)=\frac{1}{4\pi\nu_{0}}\frac{c}{H(z)}\int d\bm{p}\,\phi(\bm{p}|z)L_{0}(\bm{p},z)\,, (12)

where the integral (without the factor) gives the line luminosity density.

2.4 Galaxy properties per halo

Besides the expressions based on Eq. 3 which integrate over all halos of different masses, we can similarly write down the properties per halo from the CGPD Φ(𝒑|M,z)\Phi\left(\bm{p}|M,z\right). For example, the average number of galaxies for a halo of a given mass can be written as

Ngh(M,z)=𝑑𝒑Φ(𝒑|M,z),N_{g}^{h}(M,z)=\int d\bm{p}\,\Phi\left(\bm{p}|M,z\right)\,, (13)

where the superscript hh indicates that it is a quantity per halo. Similarly, the expected galaxy property pip_{i} per halo is given as

Pih(M,z)=𝑑𝒑Φ(𝒑|M,z)pi.P_{i}^{h}(M,z)=\int d\bm{p}\,\Phi\left(\bm{p}|M,z\right)p_{i}\,. (14)

Note that Eq. 14 should be divided by Eq. 13 for rgalr_{\rm gal} if the average radius per galaxy is desired instead of the sum of all galaxy radii in the halo. With LLR Li(𝒑,z)L_{i}(\bm{p},z), the line luminosity for line ii per halo can be written as

Lih(M,z)=𝑑𝒑Φ(𝒑|M,z)Li(𝒑,z).L_{i}^{h}(M,z)=\int d\bm{p}\,\Phi\left(\bm{p}|M,z\right)L_{i}(\bm{p},z)\,. (15)

3 Simulations

In this section we briefly introduce the physics-based simulations, which will be used to study and motivate the functional form and modeling approach for characterizing the CGPD. These include the “Santa Cruz” semi-analytic models and hydrodynamic simulations from the IllustrisTNG project. Note that comparing and discussing the differences between the simulated results from the SAM and TNG is outside of the scope of this work, and a recent discussion can be found in Gabrielpillai et al. (2022). It is important to consider different models in order to test the robustness and flexibility of the GMM in describing the CGPD.

3.1 Semi-Analytic Model

State-of-the-art SAMs have been shown to produce consistent predictions for global galaxy properties and their evolution compared with much more expensive numerical hydrodynamic simulations (Somerville & Davé, 2015b). In this work, we use the SAM developed by the “Santa Cruz” group (Somerville & Primack, 1999; Somerville et al., 2008, 2012; Porter et al., 2014; Popping et al., 2014; Somerville et al., 2015). This well-established semi-analytic galaxy formation modeling code has been compared extensively with galaxy observations from z010z\sim 0-10 and is overall very successful at matching many observations (see e.g. Popping et al., 2014; Somerville et al., 2015; Yung et al., 2019a, b; Somerville et al., 2021).

The backbone of the SAM consists of dark matter halo merger trees, which can be extracted from N-body simulations or generated with Monte Carlo realizations based on the extended Press-Schechter (EPS) formalism (Somerville & Kolatt, 1999). In this work, in order to make sure that all halo masses in the wide range of interest are well populated for fitting, we use the Monte Carlo EPS method, which enables better sampling of the full range of relevant halo masses at low computational cost.

We generate simulations for one hundred redshifts uniformly sampled in 0z<100\leq z<10, with galaxies at each redshift being simulated independently. At each redshift, we consider halos with mass in the range 10log(M/M)1310\leq\log(M/M_{\odot})\leq 13, which has been shown to be the relevant halo mass range probed by LIM measurements (Yang et al., 2021). In this logarithmic halo mass range, a thousand halos are generated for each of the one thousand uniformly sampled halo masses, and hence in total this yields one million halos for each redshift. This large number of halos and well-sampled resolution in halo mass give us more flexibility in binning for constructing the CGPD. We only consider star-forming galaxies, which are selected with the criterion sSFR>1/(3tH(z)){\rm sSFR}>1/(3t_{H}(z)), where sSFRSFR/m{\rm sSFR}\equiv{\rm SFR}/m_{*} is the specific SFR and tH(z)t_{H}(z) is the Hubble time at the galaxy redshift zz. We ignore quenched galaxies in this work, as they have a negligible contribution to LIM. For simplicity and better demonstration, in our main analyses we only consider central galaxies. We checked that the contributions from satellites for the integrated quantities are less than 10%10\%, and we include some further details on including and fitting satellites in Section 5.2.2.

3.2 Hydrodynamic simulations

IllustrisTNG is a project that includes a series of state-of-the-art cosmological N-body and hydrodynamic simulations (Springel et al., 2018; Pillepich et al., 2018; Marinacci et al., 2018; Nelson et al., 2018; Naiman et al., 2018). We use the halo and subhalo (i.e. galaxy) catalogs from TNG100-1, which are simulated in a cubic box of size 75Mpc/h75\,{\rm Mpc/h} with a mass resolution of 106M\sim 10^{6}\,M_{\odot} for the dark matter and baryonic particles. The TNG fields that are most comparable to the galaxy properties simulated by the SAM used in this work are summarized in Table 2.

Table 2: Fields in TNG subhalo catalogs that are most comparable to the galaxy properties in Table 1. The units have also been transformed correspondingly.
Property TNG field
mm_{*} SubhaloMassInRadType (Type=4)
mH2m_{{\rm H}_{2}} m_h2_DK11_vol
rgalr_{\rm gal} SubhaloHalfmassRadType (Type=4)
ZgZ_{g} SubhaloGasMetallicity
SFR{\rm SFR} SubhaloSFR

TNG fields for mm_{*} and SFR{\rm SFR} are selected following the discussion in Gabrielpillai et al. (2022), where it is also noted that in some cases these quantities are not defined in exactly the same way as in SAM. Then the field for rgalr_{\rm gal} is chosen to be consistent with that of mm_{*}. The molecular gas mass mH2m_{{\rm H}_{2}} estimates are provided in post-processed supplementary catalogs (Diemer et al., 2018, 2019; Stevens et al., 2021) for five redshifts z=(0, 0.5, 1, 1.5, 2)z=(0,\ 0.5,\ 1,\ 1.5,\ 2), which are described in detail in the data specifications of the TNG project 333https://www.tng-project.org/data/docs/specifications/. For the host halo mass MM, the Group_M_TopHat200 field in TNG is used.

4 Results: Characterizing the CGPD

Based on the simulated data described above, we now propose and test analytic models that could potentially properly characterize the CGPD.

There are various parametric and nonparametric methods for describing the distribution of point data, also known as density estimation (Ivezić et al., 2019). Histograms might be the most straightforward nonparametric method, which approximate the function value as constants within discrete bins. However, the total number of bins increases exponentially with the dimensionality of the data. Treating each bin as an independent variable makes it infeasible for inference, and also this method does not give any insights into the structure of the distribution. This remains an issue even for more advanced nonparametric models like Gaussian processes. Another class of methods assumes a specific parametric form of the underlying probabilistic density function that generates the data points, among which Gaussian Mixture Models are one of the most powerful approaches due to its simplicity and flexibility. In this section, we present the modeling of the CGPD using GMMs.

4.1 Gaussian Mixture Model

The GMM is a probabilistic model defined as the weighted average of a finite number of Gaussian distributions, and the likelihood function reads

L(𝒙|𝜽)=j=1Ngcwj𝒩(𝒙|𝝁j,𝚺j),L(\bm{x}|\bm{\theta})=\sum_{j=1}^{N_{\rm gc}}w_{j}\,\mathcal{N}(\bm{x}|\bm{\mu}_{j},\bm{\Sigma}_{j})\,, (16)

where 𝒩\mathcal{N} denotes a multivariate Gaussian distribution, 𝒙\bm{x} is the data vector, and 𝜽\bm{\theta} includes all the model parameters, i.e. the weight wjw_{j}, the mean vector 𝝁j\bm{\mu}_{j} and the covariance matrix 𝚺j\bm{\Sigma}_{j} for each of the NgcN_{\rm gc} Gaussian components. The weights are constrained by the normalization j=1Ngcwj=1\sum_{j=1}^{N_{\rm gc}}w_{j}=1 such that the likelihood function integrates to one. In unsupervised machine learning, GMM is a powerful method for clustering and classification with each component corresponding to a certain class. Here, GMM is mainly used for density estimation and we are not attempting to classify galaxies in their property space. Another practical advantage of GMM in describing multidimensional data is their simplicity in giving marginal and conditional distributions, which remain Gaussian. Since each component is a multivariate Gaussian, marginalizing (integrating) over any subset of dimensions can be done by simply removing the marginalized dimensions from the mean vector and covariance matrix. Similarly, the conditional distributions for a multivariate Gaussian can also be written analytically. For example, for a bivariate case,

f(X1|X2=x2)𝒩(μ1+σ1σ2ρ(x2μ2),(1ρ2)σ12),f(X_{1}|X_{2}=x_{2})\sim\mathcal{N}\left(\mu_{1}+\frac{\sigma_{1}}{\sigma_{2}}\rho(x_{2}-\mu_{2}),\ (1-\rho^{2})\sigma_{1}^{2}\right)\,, (17)

where μ1(μ2)\mu_{1}\,(\mu_{2}) and σ1(σ2)\sigma_{1}\,(\sigma_{2}) are the mean and standard deviation of X1(X2)X_{1}\,(X_{2}), and ρ\rho is the correlation coefficient between the two. This would be useful for deriving e.g. ϕ(pi|pj,z)\phi(p_{i}|p_{j},z) from ϕ(pi,pj|z)\phi(p_{i},p_{j}|z), as discussed following Eq. 7.

4.2 Fitting CGPD with GMM

Fig. 2 shows an example of the CGPD for central galaxies in the SAM with host halo mass 11.4<log(M/M)<11.611.4<\log(M/M_{\odot})<11.6 at redshift z=2z=2, where the projected pair distribution plots are made using corner.py 444https://github.com/dfm/corner.py (Foreman-Mackey, 2016).

Refer to caption
Figure 2: The blue data points with white contours denote the distribution of central galaxies hosted by halos of mass 11.4<log(M/M)<11.611.4<\log(M/M_{\odot})<11.6 at redshift z=2z=2, simulated by the SAM. The orange and green lines denote the GMM models fitted with 1 and 2 Gaussian components respectively. The contours correspond to 1σ1\,\sigma and 2σ2\,\sigma levels in 2D, which cover 39.3%39.3\% and 86.5%86.5\% of the volume respectively.

We can see that the marginal distributions for the logarithmic galaxy properties are very close to being Gaussian around the peak but with possible extended spreads on the left side.

Refer to caption
Figure 3: Similar to Fig. 2 but for the TNG simulation data.

Similar to Fig. 2 for the SAM data, we show galaxies selected under the same criteria from the TNG simulated data in Fig. 3. There are many fewer galaxies in the TNG data compared to those in the SAM. Although the locations and dispersions are different from those of the SAM shown in Fig. 2, the distributions are still very close to being Gaussian. We conclude for now that this is likely to be a fairly robust property of CGPD in simulations and that therefore Gaussian mixtures are a promising parameterized model for this problem.

We use GMM as a parametric model that describes the CGPD Φ(𝒑|M,z)\Phi\left(\bm{p}|M,z\right), where the data vector 𝒙\bm{x} in Eq. 16 corresponds to the galaxy property vector 𝒑\bm{p}. For a bin in mass and redshift (M,z)(M,z), given nn simulated galaxies {𝒑i}i=1n\{\bm{p}_{i}\}_{i=1}^{n} for fitting, the overall logarithmic likelihood function is given by

({𝒑i}i=1n|𝜽)=i=1nlnL(𝒑i|𝜽),\mathcal{L}(\{\bm{p}_{i}\}_{i=1}^{n}|\bm{\theta})=\sum_{i=1}^{n}\ln\,L(\bm{p}_{i}|\bm{\theta})\,, (18)

which is maximized to find the best-fit 𝜽^\hat{\bm{\theta}} using the expectation-maximization (EM) algorithm (Dempster et al., 1977). In this work, we use the implementation of the GMM in scikit-learn 555https://scikit-learn.org/stable/modules/mixture.html.

Given the galaxy data and GMM model, we fit the galaxies in the 5-D physical property space shown in Fig. 2. Considering the correlations between physical properties, we use the full covariance matrices for the Gaussian components. However, this corresponds to a large number of parameters in the covariance matrix and the model might need to be further simplified in order to be used for statistical inference. We discuss a possible dimensionality reduction based on principal component analysis (PCA) in Section 4.4.

At each redshift, we divide the logarithmic host halo mass range into bins of uniform width ΔlogM=0.2\Delta\log M=0.2. As mentioned in Section 3.1, the halo mass resolution in our SAM simulations is 0.003logM0.003\,\log M_{\odot}, which is much smaller than the bin size such that halo masses are well sampled in each bin. Since different halo mass bins are fitted independently, the overall complexity of the model is also proportional to the number of bins. However, using fewer bins means a wider spread of the distribution for each bin and possibly results in lower accuracy. This is another trade-off depending on the usage. Another detail to notice is that as described in Section 3.1, in our simulations from SAM based on merger trees using the Monte Carlo method, the same number of halos are generated for the one thousand discrete halo masses. Therefore for wide halo mass bins, to be realistic it would be necessary to subsample the halos within a bin to match the HMF.

The model complexity of a GMM is proportional to the number of Gaussian components NgcN_{\rm gc}, which is an important hyper-parameter that needs to be determined. A typical model selection is based on the Bayesian information criterion (BIC), which attempts to achieve a balance between the complexity and performance of the model,

BIC=dim(𝜽)lnn2^,{\rm BIC}={\rm dim}(\bm{\theta})\ln n-2\hat{\mathcal{L}}\,, (19)

where dim(𝜽){\rm dim}(\bm{\theta}) denotes the total number of parameters in the model, nn is the number of data points, and ^\hat{\mathcal{L}} is the log-likelihood at the best-fit 𝜽^\hat{\bm{\theta}}. For the same data, the models with lower BIC scores are preferred.

Refer to caption
Figure 4: BIC scores for different numbers of components in a GMM, using the same galaxy sample from the SAM as that shown in Fig. 2.

In Fig. 4, for GMM fitted on the same example group of galaxies shown in Fig. 2, we show the BIC scores for models with different numbers of components (NgcN_{\rm gc}). The BIC score decreases as we increase NgcN_{\rm gc} starting from just one component, and the curve flattens very rapidly. Similar curves are also observed for galaxies at other redshifts and other halo mass bins. These curves indicate that a small number of components is sufficient, and continuing to increase NgcN_{\rm gc} to a very large value (e.g. > 10>\,10) will not significantly improve the fitting performance, but will only make the model more complicated. The BIC score is a general criterion for model selection based on the likelihood, while in practice the trade-off between accuracy and model complexity depends on the specific usage. If we want to use the model as an emulator of the SAM simulations, then we may want to use more components to make it more accurate. On the other hand, if we want to use the model for statistical inference, we would prefer to make the model as simple as possible while achieving acceptable accuracy. In our main analyses, we always start with very few (i.e. 1, 2 or 3) components.

4.3 Model Performance

To check how well the GMM fits the SAM and TNG data, for some of the galaxy property functions mentioned in Section 2, we compare the results given by the best-fit GMM with those computed directly from the data.

First we consider the cosmic densities of galaxy properties defined in Eq. 5. In Fig. 5, we show the results for GMM with one or two components for the one hundred uniformly spaced redshift bins in 0z<100\leq z<10. Also shown are the TNG data at five available redshifts (see Section 3.2) and the corresponding best-fit GMMs with just one component.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Cosmic densities of the galaxy physical properties. The solid blue lines are computed directly from the SAM simulation data. The dashed blue (dotted orange) lines denote the best-fit GMM with one (two) component(s). The lower panels show the percent errors of the GMMs compared to the SAM data. We also show the TNG simulated data at five redshifts (see Section 3.2), along with the best-fit GMM with one component and the corresponding percent error.

For both SAM and TNG simulated data, with just one or two Gaussian components, the percent errors of the cosmic densities evaluated with the GMM compared to the results computed directly from the simulation are mostly within 5%5\,\%, which is acceptable given the uncertainties in state-of-the-art observational constraints. For these integrated quantities, the improvement obtained when using one more component which can capture the tails in distributions (see 1-D histograms in Fig. 2) is not very significant.

We also consider the galaxy one-property distribution functions defined in Eq. 6.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: The galaxy one-property distribution functions defined in Eq. 6. We show the stellar mass function (upper left), molecular gas mass function (upper right), galaxy radius function (middle left), gas phase metallicity function (middle right), and SFR function (lower) at redshift z=2z=2. The black line is the function computed directly from the SAM data. The blue, orange and green lines show the function obtained from the best-fit GMM using 1, 2 and 3 components respectively for each halo mass bin. We show the TNG results and its best-fit GMM with 2 components with dashed lines. The corresponding percent errors compared to the SAM and TNG simulation data are shown in the lower panels. Since we consider halo masses in 10log(M/M)1310\leq\log(M/M_{\odot})\leq 13, there are also limits outside which the galaxies hosted by less or more massive halos may not be negligible. We denote these limits with grey dashed vertical lines, which are the mean properties of the galaxies in the lowest and highest mass bins we consider, i.e. 10log(M/M)10.210\leq\log(M/M_{\odot})\leq 10.2 and 12.8log(M/M)1312.8\leq\log(M/M_{\odot})\leq 13.

In Fig. 6, for both the SAM and TNG, we show these functions at redshift z=2z=2 as an example, including the function computed directly from the simulation and the best-fit GMMs with different numbers of components. GMM with just one component can already capture most of the information, especially the main slope of the function. Compared with the cosmic densities above, using more than one component could help capturing more information. In general, using no more than 3 components can already reproduce the galaxy one-property functions to very high accuracy.

We now consider the galaxy two-property distribution functions (scaling relations) defined in Eq. 7.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Galaxy two-property distribution functions defined in Eq. 7. Here we show ϕ(m,SFR)\phi(m_{*},{\rm SFR}) (upper left), ϕ(m,Zg)\phi(m_{*},Z_{g}) (upper right), ϕ(m,rgal)\phi(m_{*},r_{\rm gal}) (lower left) and ϕ(m,mH2)\phi(m_{*},m_{{\rm H}_{2}}) (lower right) at redshift z=2z=2. The histograms with white-dashed contours show the SAM simulation results. The orange contours show the best-fit GMM with three components. The contours are plotted at (0.7, 0.5, 0.3, 0.1, 0.05, 0.01) times the maximum value of the corresponding distribution. The vertical and horizontal grey dashed lines denote the same limits of each property as in Fig. 6.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Similar to Fig. 7 but for the TNG simulation data.

For example, we plot ϕ(m,SFR)\phi(m_{*},{\rm SFR}), ϕ(m,Zg)\phi(m_{*},Z_{g}), ϕ(m,rgal)\phi(m_{*},r_{\rm gal}), and ϕ(m,mH2)\phi(m_{*},m_{{\rm H}_{2}}) at redshift z=2z=2 in Fig. 7 for SAM and Fig. 8 for TNG. For clarity, here we only show a GMM with three components, which can already reproduce the data distribution very well. Using fewer (i.e. one or two) components could also be acceptable depending on how many details we would like to recover. Scaling relations can also be observed in some of the distributions.

4.4 Dimensionality Reduction: Principal Component Analysis

As expected and also shown in Fig. 2, there are strong correlations between the galaxy physical properties we consider. This makes it possible to reduce the dimension of the data without losing much information using linear transformations like Principal Component Analysis. With the data standardized (i.e. shifted to zero mean and normalized to unit variance) on each dimension, starting from the principal component (PC) with the largest variance, PCA sequentially finds the next PC with the next largest variance while being linearly independent of the previous PCs. Lower data dimensionality could reduce the number of parameters needed in the model and make it more feasible for use in an inference pipeline. Besides, the linear independence (i.e. zero covariances) between PCs makes it possible to simplify certain models like a multivariate Gaussian distribution even further.

Here we perform PCA on central galaxies for each (M,z)(M,z) bin. The cumulative percentages of variance explained by PCs up to the third largest variance is shown in Fig. 9.

Refer to caption
Figure 9: For the distribution of galaxies in each (M,z)(M,z) bin, cumulative percentages of variance explained by PCs with up to the first, second and third largest variance.

It turns out that using just 3 PCs can already recover more than 90 percent of the variance. The distribution in the space of first 3 PCs for the galaxies in Fig. 2, is shown in Fig. 10.

Refer to caption
Figure 10: Blue data points with white contours show the distribution of the same group of galaxies shown in Fig. 2 in the space of the first 3 PCs. The orange and green lines denote the distributions with GMM using 1 and 2 Gaussian components respectively. The contours correspond to 1σ1\,\sigma and 2σ2\,\sigma levels in 2D, which cover 39.3%39.3\% and 86.5%86.5\% of the volume respectively.

Since PCA attempts to minimize parameter covariance, we do not see very strong correlations as we have seen in Fig. 2 for the physical properties. However, it is still possible for PCs to have nonlinear correlations.

Therefore with acceptable information loss, instead of in the 5-D physical property space, we can describe the distribution in the 3-D PC space and convert it back to the physical property space for usage. An additional advantage of using GMM is that a Gaussian distribution can also be analytically transformed between two spaces. Each Gaussian component fitted in the PC space also corresponds to a Gaussian component in the physical property space. With the mapping between the PC vector 𝒚\bm{y} and physical property vector 𝒑\bm{p} given by the linear transformation 𝒚=𝑾𝒑\bm{y}=\bm{W}\bm{p} with 𝑾\bm{W} being a 2-D matrix of dimension dim(𝒚)×dim(𝒑){\rm dim}(\bm{y})\times{\rm dim}(\bm{p}), the covariance matrices are related by

Cov(𝒑)=𝑾TCov(𝒚)𝑾.{\rm Cov}(\bm{p})=\bm{W}^{T}{\rm Cov}(\bm{y})\bm{W}\,. (20)

In Fig. 10, we show the best-fit GMM with one and two components in the PC space. For just one component, the covariance matrix is diagonal since there is no covariance. For two components, since each of them only cover part of the data, the full covariance matrices need to be considered. If we convert the best-fit GMM back to the physical property space, the models are show a similar level of agreement as Fig. 2 where we fit the physical properties directly.

Fitting the model in PC space does provide the potential to simplify the model. However, the transformation matrix between PC and physical property spaces also introduces additional parameters. We leave further exploration for a future work.

5 Discussion

In this section, we discuss the significance of the work presented here, as well as some of the caveats and limitations of this analysis.

5.1 Significance of the CGPD for understanding galaxy formation

The CGPD is a natural extension of traditional halo models, in that it links halo mass and redshift to galaxy properties. The new feature is that, while traditional halo models attempt to quantify this mapping with a single galaxy property at a time, and typically do not model the covariance between galaxy properties, the CGPD describes the mapping to an n-dimensional vector of galaxy properties and explicitly includes the covariances. We note that what appears as scatter or dispersion in a traditional style halo mass vs. single galaxy property plot is in some cases actually a result of projecting a higher dimensional manifold into a plane. Thus we expect the CGPD approach to partially, though perhaps not fully, also provide a natural description of dispersion along with the correct residual correlations. The CGPD as extracted from physics-based simulations can be an interesting vehicle for comparing different simulations or studying the impact of varying physical processes. This has not been a focus of this work, but it is notable that while the CGPD for the Santa Cruz SAM and TNG appear qualitatively quite similar, some of the quantities differ in the location of the mean and also in the width of the distributions. It would be interesting to systematically study the impact of varying physical processes and/or cosmology on the CGPD. Better understanding why there is a relatively “thin” manifold in the n-dimensional property space and how physical processes shape the location and thickness of this manifold is also an interesting open question.

5.2 Limitations of current modeling approach

5.2.1 Star forming and quenched galaxies

For practical reasons, we have only included central, star-forming galaxies in this analysis. This choice should not have a large impact on the accuracy of our method for our main target application, which is line intensity mapping. The lines of interest are produced by star-forming galaxies. For other applications, including the full galaxy population (not only star-forming galaxies) would certainly lead to a more multi-modal CGPD, but we are confident that this could still be modeled with a GMM approach using more components (see e.g. Hahn et al., 2019).

5.2.2 CGPD for satellite galaxies

According to the SAM predictions, the overall cosmic volume densities and other integrated quantities of the galaxy properties are dominated by central galaxies, with the contributions from satellites mostly less than 10%10\,\%. However, it would also be straightforward to include a separate CGPD for satellite galaxies. For satellite galaxies, instead of the host mass MM, it is customary to use a quantity like Mmax,subM_{\rm max,sub}, which is defined as the maximum mass that the host halo of the satellite galaxy had before it became a satellite (e.g. Behroozi et al., 2019). Under the (common) ansatz that the properties of satellite galaxies depend only on the sub-halo mass Mmax,subM_{\rm max,sub} and not on the host mass in which the satellite (sub-halo) is orbiting, we could then carry out an identical analysis of the CGPD for satellites where the host halo mass MM is replaced by the sub-halo mass Mmax,subM_{\rm max,sub}. We would need to add the sub-halo mass function to the distinct halo mass function introduced in Eqn. 3. Then the galaxy property distributions can simply be added,

ϕ(𝒑|z)=ϕ(𝒑|z)cen+ϕ(𝒑|z)sat.\phi(\bm{p}|z)=\phi(\bm{p}|z)_{\rm cen}+\phi(\bm{p}|z)_{\rm sat}\,. (21)

This would add some complexity to the model but would be a straightforward extension.

5.2.3 Dependence on higher order halo properties

Another simplification is that we assume that galaxy properties depend only on halo mass and redshift, which is known to be in contrast with the results of both hydrodynamic simulations and SAMs. Although this ansatz is fundamental to the ubiquitous halo model picture, it has been shown that properties such as galaxy clustering in hydrodynamic simulations depend on halo properties other than mass, an effect commonly referred to as assembly bias (Hadzhiyska et al., 2020, 2021). Similarly, it has been shown that some of the dispersion in the stellar mass vs. halo mass relation in simulations is actually due to these higher order halo properties, such as formation history (Mitchell & Schaye, 2022; Gabrielpillai et al., 2022). We are interested in extending the CGPD framework to include additional halo properties, guided by physics-based simulations.

5.2.4 Evolution across halo masses and redshifts

In the main results section, we fit GMM on galaxies for each redshift and halo mass bin independently. However, a continuous evolution of the galaxy distribution across those discrete bins should be expected. If the evolution of GMM parameters could be characterized with another parametric model, then the model could be further simplified. This would be important and possibly necessary for using the approach for statistical inference in the future.

Here we briefly discuss the evolution of galaxy properties across halo masses and redshifts. To avoid the side effects from possible local maxima of the likelihood function, instead of purely random initialization, we use the best-fit parameter values from the previous bin to initialize the parameters for the next bin. For GMM with just one component, the best-fit mean vector and covariance matrix are simply the sample mean and covariance of the data. We notice that the mean of the galaxy properties evolves smoothly across both halo masses and redshifts, while the variances and covariances can change dramatically. For multiple Gaussian components, an additional challenge is that we need to track each component individually, and there are different approaches for this. For example, we could sort the components based on their weights, and assume that the component with the highest weight in the current bin always corresponds to the highest-weight one in the next bin, and so on for other components with lower weights. There are also different definitions of the distance between two Gaussian distributions, based on which we could map one component in the current bin to the closest one in the next bin. However, this approach may not be very robust since the components could cross each others trajectories. In our case, we use the best-fit parameters from the current bin as the initial values for the next bin; then the new data can be considered as introducing an update to the current parameters. From this point of view, the order of parameters remains unchanged across bins. However, having tried all these approaches, we still found discontinuous behavior of the Gaussian components for not only the covariances but also the mean vectors. We leave further investigation for future work.

6 Conclusions

Traditional galaxy surveys and the relatively recent LIM techniques are both powerful methods to study the formation and evolution of galaxies, one of the most fundamental and important problems in astrophysics and cosmology. Motivated by the necessity of developing a physically motivated, computationally efficient framework that can describe both galaxy and LIM observations, we propose and characterize a new approach based on the CGPD Φ(𝒑|M,z)\Phi\left(\bm{p}|M,z\right) function, which describes the probability of a halo with mass MM at redshift zz to host a galaxy with properties 𝒑\bm{p}. When combined with the Line Luminosity Relation (LLR) Li(𝒑,z)L_{i}(\bm{p},z) function developed in a companion work (Breysse et al., 2023), a highly efficient and flexible pipeline that connects halos with an n-dimensional distribution of galaxy physical properties, and with LIM observables, can be constructed.

As a distribution function in the 5-D galaxy property space that we have chosen, the CGPD encapsulates many important galaxy distribution functions and scaling relations etc. Motivated by the form of the CGPD seen in two separate physics-based cosmological simulations (the Santa Cruz SAM and IllustrisTNG), we discuss a parameterized characterization of CGPD using GMM. We find the following main results:

  • With just a few (3\lesssim 3) components, GMM is capable of reproducing the CGPD distribution from both the SAM and TNG to high accuracy.

  • The GMM based CGPD yields estimates of common summary statistics of interest including 2-D property relations (scaling relations), property distribution functions, and integrated cosmic densities, with a typical accuracy of a few percent.

  • For each bin in halo mass and redshift, we find that the 5-D property distribution can be reduced to only 3 dimensions using PCA, with a 10% loss of information.

The galaxy properties in this work were selected based on their importance for predicting the luminosity of certain emission lines, but a similar approach could be used for other galaxy properties, including galaxy photometry as measured in galaxy surveys.

The next step is to construct a statistical inference pipeline for LIM observations. For a 5-D Gaussian distribution with a full covariance matrix, we have 20 free parameters including 5 parameters in the mean vector and 15 independent parameters in the symmetric covariance matrix. Then if we use 3 Gaussian components for each halo mass bin and 15 mass bins for each redshift as in our main analysis, we would have 900 parameters for the GMM at each redshift. However, these parameters are expected to be highly degenerate. We have shown that the data dimensionality can be reduced significantly using PCA. Another possible approach for doing implicit likelihood inference in high dimensional parameter space is the marginal posterior density with moment networks as proposed in Jeffrey & Wandelt (2020).

7 Acknowledgements

We thank the IllustrisTNG team for granting the JupyterLab access to the simulations, and Dylan Nelson for help accessing the TNG data. ARP was supported by NASA under award numbers 80NSSC18K1014, NNH17ZDA001N, and 80NSSC22K0666, and by the NSF under award number 2108411. ARP and rss are supported by the Simons Foundation. The numerical computations in this work were performed in part using the NYU Greene High Performance Computing cluster.

References

  • Alam et al. (2017) Alam, S., Ata, M., Bailey, S., et al. 2017, MNRAS, 470, 2617, doi: 10.1093/mnras/stx721
  • Alsing et al. (2018) Alsing, J., Wandelt, B., & Feeney, S. 2018, MNRAS, 477, 2874, doi: 10.1093/mnras/sty819
  • Baldry et al. (2008) Baldry, I. K., Glazebrook, K., & Driver, S. P. 2008, MNRAS, 388, 945, doi: 10.1111/j.1365-2966.2008.13348.x
  • Behroozi et al. (2019) Behroozi, P., Wechsler, R. H., Hearin, A. P., & Conroy, C. 2019, MNRAS, 488, 3143, doi: 10.1093/mnras/stz1182
  • Bernal & Kovetz (2022) Bernal, J. L., & Kovetz, E. D. 2022, A&A Rev., 30, 5, doi: 10.1007/s00159-022-00143-0
  • Breysse et al. (2023) Breysse, P., et al. 2023, In preparation
  • Calette et al. (2018) Calette, A. R., Avila-Reese, V., Rodríguez-Puebla, A., Hernández-Toledo, H., & Papastergis, E. 2018, Rev. Mexicana Astron. Astrofis., 54, 443, doi: 10.48550/arXiv.1803.07692
  • Cranmer et al. (2020) Cranmer, K., Brehmer, J., & Louppe, G. 2020, Proceedings of the National Academy of Science, 117, 30055, doi: 10.1073/pnas.1912789117
  • Dempster et al. (1977) Dempster, A. P., Laird, N. M., & Rubin, D. B. 1977, Journal of the Royal Statistical Society. Series B (Methodological), 39, 1. http://www.jstor.org/stable/2984875
  • Diemer et al. (2018) Diemer, B., Stevens, A. R. H., Forbes, J. C., et al. 2018, ApJS, 238, 33, doi: 10.3847/1538-4365/aae387
  • Diemer et al. (2019) Diemer, B., Stevens, A. R. H., Lagos, C. d. P., et al. 2019, MNRAS, 487, 1529, doi: 10.1093/mnras/stz1323
  • Forbes et al. (2014) Forbes, J. C., Krumholz, M. R., Burkert, A., & Dekel, A. 2014, MNRAS, 443, 168, doi: 10.1093/mnras/stu1142
  • Forbes et al. (2019) Forbes, J. C., Krumholz, M. R., & Speagle, J. S. 2019, MNRAS, 487, 3581, doi: 10.1093/mnras/stz1473
  • Foreman-Mackey (2016) Foreman-Mackey, D. 2016, The Journal of Open Source Software, 1, 24, doi: 10.21105/joss.00024
  • Gabrielpillai et al. (2022) Gabrielpillai, A., Somerville, R. S., Genel, S., et al. 2022, MNRAS, 517, 6091, doi: 10.1093/mnras/stac2297
  • Hadzhiyska et al. (2020) Hadzhiyska, B., Bose, S., Eisenstein, D., Hernquist, L., & Spergel, D. N. 2020, MNRAS, 493, 5506, doi: 10.1093/mnras/staa623
  • Hadzhiyska et al. (2021) Hadzhiyska, B., Liu, S., Somerville, R. S., et al. 2021, MNRAS, 508, 698, doi: 10.1093/mnras/stab2564
  • Hahn et al. (2019) Hahn, C., Starkenburg, T. K., Choi, E., et al. 2019, ApJ, 872, 160, doi: 10.3847/1538-4357/aafedd
  • Henriques et al. (2009) Henriques, B. M. B., Thomas, P. A., Oliver, S., & Roseboom, I. 2009, MNRAS, 396, 535, doi: 10.1111/j.1365-2966.2009.14730.x
  • Henriques et al. (2013) Henriques, B. M. B., White, S. D. M., Thomas, P. A., et al. 2013, MNRAS, 431, 3373, doi: 10.1093/mnras/stt415
  • Ivezić et al. (2019) Ivezić, Ž., Connolly, A., Vanderplas, J., & Gray, A. 2019, Statistics, Data Mining and Machine Learning in Astronomy (Princeton University Press)
  • Jeffrey & Wandelt (2020) Jeffrey, N., & Wandelt, B. D. 2020, arXiv e-prints, arXiv:2011.05991. https://arxiv.org/abs/2011.05991
  • Koekemoer et al. (2011) Koekemoer, A. M., Faber, S. M., Ferguson, H. C., et al. 2011, ApJS, 197, 36, doi: 10.1088/0067-0049/197/2/36
  • Kovetz et al. (2017) Kovetz, E. D., Viero, M. P., Lidz, A., et al. 2017, arXiv e-prints, arXiv:1709.09066. https://arxiv.org/abs/1709.09066
  • Krumholz (2014) Krumholz, M. R. 2014, MNRAS, 437, 1662, doi: 10.1093/mnras/stt2000
  • Lange et al. (2015) Lange, R., Driver, S. P., Robotham, A. S. G., et al. 2015, MNRAS, 447, 2603, doi: 10.1093/mnras/stu2467
  • Lara-López et al. (2010) Lara-López, M. A., Cepa, J., Bongiovanni, A., et al. 2010, A&A, 521, L53, doi: 10.1051/0004-6361/201014803
  • Leja et al. (2022) Leja, J., Speagle, J. S., Ting, Y.-S., et al. 2022, ApJ, 936, 165, doi: 10.3847/1538-4357/ac887d
  • Leung et al. (2020) Leung, T. K. D., Olsen, K. P., Somerville, R. S., et al. 2020, ApJ, 905, 102, doi: 10.3847/1538-4357/abc25e
  • Li et al. (2016) Li, T. Y., Wechsler, R. H., Devaraj, K., & Church, S. E. 2016, ApJ, 817, 169, doi: 10.3847/0004-637X/817/2/169
  • Lu et al. (2012) Lu, Y., Mo, H. J., Katz, N., & Weinberg, M. D. 2012, MNRAS, 421, 1779, doi: 10.1111/j.1365-2966.2012.20435.x
  • Lu et al. (2014) Lu, Y., Mo, H. J., Lu, Z., Katz, N., & Weinberg, M. D. 2014, MNRAS, 443, 1252, doi: 10.1093/mnras/stu1200
  • Lu et al. (2011) Lu, Y., Mo, H. J., Weinberg, M. D., & Katz, N. 2011, MNRAS, 416, 1949, doi: 10.1111/j.1365-2966.2011.19170.x
  • Madau & Dickinson (2014) Madau, P., & Dickinson, M. 2014, ARA&A, 52, 415, doi: 10.1146/annurev-astro-081811-125615
  • Marinacci et al. (2018) Marinacci, F., Vogelsberger, M., Pakmor, R., et al. 2018, MNRAS, 480, 5113, doi: 10.1093/mnras/sty2206
  • Mitchell & Schaye (2022) Mitchell, P. D., & Schaye, J. 2022, MNRAS, 511, 2948, doi: 10.1093/mnras/stab3339
  • Mowla et al. (2019) Mowla, L., van der Wel, A., van Dokkum, P., & Miller, T. B. 2019, ApJ, 872, L13, doi: 10.3847/2041-8213/ab0379
  • Murray et al. (2013) Murray, S. G., Power, C., & Robotham, A. S. G. 2013, Astronomy and Computing, 3, 23, doi: 10.1016/j.ascom.2013.11.001
  • Naab & Ostriker (2017) Naab, T., & Ostriker, J. P. 2017, ARA&A, 55, 59, doi: 10.1146/annurev-astro-081913-040019
  • Naiman et al. (2018) Naiman, J. P., Pillepich, A., Springel, V., et al. 2018, MNRAS, 477, 1206, doi: 10.1093/mnras/sty618
  • Nelson et al. (2018) Nelson, D., Pillepich, A., Springel, V., et al. 2018, MNRAS, 475, 624, doi: 10.1093/mnras/stx3040
  • Noeske et al. (2007) Noeske, K. G., Weiner, B. J., Faber, S. M., et al. 2007, ApJ, 660, L43, doi: 10.1086/517926
  • Olsen et al. (2021) Olsen, K. P., Burkhart, B., Mac Low, M.-M., et al. 2021, ApJ, 922, 88, doi: 10.3847/1538-4357/ac20d4
  • Padmanabhan (2018) Padmanabhan, H. 2018, MNRAS, 475, 1477, doi: 10.1093/mnras/stx3250
  • Padmanabhan (2019) —. 2019, MNRAS, 488, 3014, doi: 10.1093/mnras/stz1878
  • Pallottini et al. (2022) Pallottini, A., Ferrara, A., Gallerani, S., et al. 2022, MNRAS, 513, 5621, doi: 10.1093/mnras/stac1281
  • Pillepich et al. (2018) Pillepich, A., Nelson, D., Hernquist, L., et al. 2018, MNRAS, 475, 648, doi: 10.1093/mnras/stx3112
  • Planck Collaboration et al. (2016) Planck Collaboration, Ade, P. A. R., Aghanim, N., et al. 2016, A&A, 594, A13, doi: 10.1051/0004-6361/201525830
  • Popping et al. (2019) Popping, G., Narayanan, D., Somerville, R. S., Faisst, A. L., & Krumholz, M. R. 2019, MNRAS, 482, 4906, doi: 10.1093/mnras/sty2969
  • Popping et al. (2014) Popping, G., Somerville, R. S., & Trager, S. C. 2014, MNRAS, 442, 2398, doi: 10.1093/mnras/stu991
  • Porter et al. (2014) Porter, L. A., Somerville, R. S., Primack, J. R., & Johansson, P. H. 2014, MNRAS, 444, 942, doi: 10.1093/mnras/stu1434
  • Schaan & White (2021a) Schaan, E., & White, M. 2021a, J. Cosmology Astropart. Phys, 2021, 067, doi: 10.1088/1475-7516/2021/05/067
  • Schaan & White (2021b) —. 2021b, J. Cosmology Astropart. Phys, 2021, 068, doi: 10.1088/1475-7516/2021/05/068
  • Silva et al. (2015) Silva, M., Santos, M. G., Cooray, A., & Gong, Y. 2015, ApJ, 806, 209, doi: 10.1088/0004-637X/806/2/209
  • Smit et al. (2012) Smit, R., Bouwens, R. J., Franx, M., et al. 2012, ApJ, 756, 14, doi: 10.1088/0004-637X/756/1/14
  • Somerville & Davé (2015a) Somerville, R. S., & Davé, R. 2015a, ARA&A, 53, 51, doi: 10.1146/annurev-astro-082812-140951
  • Somerville & Davé (2015b) —. 2015b, ARA&A, 53, 51, doi: 10.1146/annurev-astro-082812-140951
  • Somerville et al. (2012) Somerville, R. S., Gilmore, R. C., Primack, J. R., & Domínguez, A. 2012, MNRAS, 423, 1992, doi: 10.1111/j.1365-2966.2012.20490.x
  • Somerville et al. (2008) Somerville, R. S., Hopkins, P. F., Cox, T. J., Robertson, B. E., & Hernquist, L. 2008, MNRAS, 391, 481, doi: 10.1111/j.1365-2966.2008.13805.x
  • Somerville & Kolatt (1999) Somerville, R. S., & Kolatt, T. S. 1999, MNRAS, 305, 1, doi: 10.1046/j.1365-8711.1999.02154.x
  • Somerville et al. (2015) Somerville, R. S., Popping, G., & Trager, S. C. 2015, MNRAS, 453, 4337, doi: 10.1093/mnras/stv1877
  • Somerville & Primack (1999) Somerville, R. S., & Primack, J. R. 1999, MNRAS, 310, 1087, doi: 10.1046/j.1365-8711.1999.03032.x
  • Somerville et al. (2021) Somerville, R. S., Olsen, C., Yung, L. Y. A., et al. 2021, MNRAS, 502, 4858, doi: 10.1093/mnras/stab231
  • Springel et al. (2018) Springel, V., Pakmor, R., Pillepich, A., et al. 2018, MNRAS, 475, 676, doi: 10.1093/mnras/stx3304
  • Stevens et al. (2021) Stevens, A. R. H., Lagos, C. d. P., Cortese, L., et al. 2021, MNRAS, 502, 3158, doi: 10.1093/mnras/staa3662
  • Sun et al. (2019) Sun, G., Hensley, B. S., Chang, T.-C., Doré, O., & Serra, P. 2019, ApJ, 887, 142, doi: 10.3847/1538-4357/ab55df
  • Tinker et al. (2008) Tinker, J., Kravtsov, A. V., Klypin, A., et al. 2008, ApJ, 688, 709, doi: 10.1086/591439
  • Tremonti et al. (2004) Tremonti, C. A., Heckman, T. M., Kauffmann, G., et al. 2004, ApJ, 613, 898, doi: 10.1086/423264
  • Wechsler & Tinker (2018) Wechsler, R. H., & Tinker, J. L. 2018, ARA&A, 56, 435, doi: 10.1146/annurev-astro-081817-051756
  • Yang et al. (2022) Yang, S., Popping, G., Somerville, R. S., et al. 2022, ApJ, 929, 140, doi: 10.3847/1538-4357/ac5d57
  • Yang et al. (2021) Yang, S., Somerville, R. S., Pullen, A. R., et al. 2021, ApJ, 911, 132, doi: 10.3847/1538-4357/abec75
  • York et al. (2000) York, D. G., Adelman, J., Anderson, John E., J., et al. 2000, AJ, 120, 1579, doi: 10.1086/301513
  • Yung et al. (2019a) Yung, L. Y. A., Somerville, R. S., Finkelstein, S. L., Popping, G., & Davé, R. 2019a, MNRAS, 483, 2983, doi: 10.1093/mnras/sty3241
  • Yung et al. (2019b) Yung, L. Y. A., Somerville, R. S., Popping, G., et al. 2019b, MNRAS, 490, 2855, doi: 10.1093/mnras/stz2755