This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Likelihood-free Inference with Mixture Density Network

Guo-Jian Wang Cheng Cheng Yin-Zhe Ma School of Chemistry and Physics, University of KwaZulu-Natal, Westville Campus, Private Bag X54001, Durban, 4000, South Africa NAOC-UKZN Computational Astrophysics Centre (NUCAC), University of KwaZulu-Natal, Durban, 4000, South Africa National Institute for Theoretical and Computational Sciences (NITheCS), South Africa Jun-Qing Xia Department of Astronomy, Beijing Normal University, Beijing 100875, China
(Received 2022 March 20; Revised 2022 June 28; Accepted 2022 June 29)
Abstract

In this work, we propose using the mixture density network (MDN) to estimate cosmological parameters. We test the MDN method by constraining parameters of the Λ\LambdaCDM and wwCDM models using Type Ia supernovae and the power spectra of the cosmic microwave background. We find that the MDN method can achieve the same level of accuracy as the Markov Chain Monte Carlo method, with a slight difference of 𝒪(102σ)\mathcal{O}(10^{-2}\sigma). Furthermore, the MDN method can provide accurate parameter estimates with 𝒪(103)\mathcal{O}(10^{3}) forward simulation samples, which are useful for complex and resource-consuming cosmological models. This method can process either one data set or multiple data sets to achieve joint constraints on parameters, extendable for any parameter estimation of complicated models in a wider scientific field. Thus, the MDN provides an alternative way for likelihood-free inference of parameters.

Cosmological parameters (339); Observational cosmology (1146); Computational methods (1965); Astronomy data analysis (1858); Neural networks (1933)
footnotetext: Corresponding author: [email protected]
\published

2022 September 2

1 Introduction

In research on cosmology, precise parameter estimation is one of the most important steps to understanding the physical processes in the universe. The most commonly used method for parameter estimation is the Markov Chain Monte Carlo (MCMC) method. However, we often encounter scenarios where although we can generate simulated data through forward simulations, the likelihood is intractable in practice. In addition, simulations may consume a lot of time and computational resources, making parameter inference unpragmatic—for example, for physics related to nonlinear structure formation on small scales (Springel, 2005; Klypin et al., 2011) and the epoch of reionization (Mesinger et al., 2016; Kern et al., 2017). Therefore, we need to develop new methods to circumvent these problems.

Recently, the artificial neural network (ANN) with many hidden layers has achieved significant improvement in many fields including cosmology and astrophysics. For example, it performs excellently in searching for and estimating the parameters of strong gravitational lenses (Hezaveh et al., 2017; Jacobs et al., 2017; Petrillo et al., 2017; Pourrahmani et al., 2018; Schaefer et al., 2018; Li et al., 2020, 2021), analyzing gravitational waves (George et al., 2018a; George & Huerta, 2018b, c; Shen et al., 2019; Li et al., 2020), classifying the large-scale structure of the universe (Aragon-Calvo, 2019), discriminating between cosmological and reionization models (Schmelzle et al., 2017; Hassan et al., 2018), recovering the cosmic microwave background (CMB) signal from contaminated observations (Petroff et al., 2020; Casas et al., 2022; Wang et al., 2022), reconstructing functions from observational data (Wang et al., 2020b; Escamilla-Rivera et al., 2020; Wang et al., 2021), and even estimating cosmological and astrophysical parameters (Shimabukuro & Semelin, 2017; Fluri et al., 2018; Schmit & Pritchard, 2018; Fluri et al., 2019; Ribli et al., 2019; Wang et al., 2020a; Ntampaka et al., 2020; Nygaard et al., 2022).

As an application of the ANN, the mixture density network (MDN; Bishop 1994) is designed to model the conditional probability density p(𝜽|𝒅)p(\bm{\theta}|\bm{d}). The MDN is a combination of the ANN and mixture model, which in principle represents an arbitrary conditional probability density. Much of recent literature has used the MDN to model parameters (Zen & Senior, 2014; Alsing et al., 2018, 2019; Kruse, 2020; Zhao et al., 2022). In Alsing et al. (2018), the MDN was used to model the joint probability density p(𝜽,𝒕)p(\bm{\theta},\bm{t}), where 𝒕\bm{t} is the summary statistics of the data. Then, the posterior distribution was obtained by evaluating the joint probability density at the observational summary 𝒕0\bm{t}_{0}, p(𝜽|𝒕0)p(𝜽,𝒕=𝒕0)p(\bm{\theta}|\bm{t}_{0})\propto p(\bm{\theta},\bm{t}{\rm=}\bm{t}_{0}). In comparison, in Alsing et al. (2019), the MDN was used to model the conditional probability density p(𝒕|𝜽)p(\bm{t}|\bm{\theta}), and then the likelihood p(𝒕0|𝜽)p(\bm{t}_{0}|\bm{\theta}) could be obtained at the observational summary 𝒕0\bm{t}_{0}. The posterior distribution could be obtained by multiplying the likelihood and the prior: p(𝜽|𝒕0)p(𝒕0|𝜽)×p(𝜽)p(\bm{\theta}|\bm{t}_{0})\propto p(\bm{t}_{0}|\bm{\theta})\times p(\bm{\theta}).

In this work, we show that for one-dimensional observational data, the MDN is capable of estimating cosmological parameters with high accuracy. The MDN used here aims to model the conditional probability density p(𝜽|𝒅)p(\bm{\theta}|\bm{d}), and the posterior distribution can be obtained at the observational data 𝒅0\bm{d}_{0}. Two mixture models are considered in our analysis: the Gaussian mixture model and the beta mixture model. We test the MDN method by estimating parameters of the Λ\LambdaCDM and wwCDM cosmologies using Type Ia supernovae (SN-Ia) and the angular power spectra of the CMB. The code and examples are available online.000https://github.com/Guo-Jian-Wang/mdncoper

This paper is organized as follows: In Section 2, we illustrate the method of estimating parameters using the MDN, which includes an introduction to the ANN, mixture model, MDN, and training and parameter inference method. Section 3 shows the application of the MDN method to the Pantheon SN-Ia. Section 4 presents a joint constraint on parameters using the power spectra of the Planck CMB and the Pantheon SN-Ia. Section 5 shows the effect of hyperparameters of the MDN on the parameter inference. In Section 6, we illustrate the beta mixture model. A discussion of the MDN method is presented in Section 7. We then conclude in Section 8.

Refer to caption
Figure 1: A neuron of the ANN.

2 Methodology

2.1 ANNs

An ANN is a computing system inspired by the biological neural network. The unit of an ANN is called a neuron (see Figure 1), and an ANN consists of many neurons. Each neuron transforms the input from other neurons and gives an output

y=f(iwixi+b),y=f\left(\sum_{i}w_{i}x_{i}+b\right)~{}, (1)

where xx is the input of a neuron, f()f(\cdot) is a nonlinear function that is usually called the activation function, and ww and bb are parameters to be learned by the network. The parameters of an ANN model can be optimized by minimizing the loss function \mathcal{L} in the training process. The loss function used in this work will be described in Section 2.3. In our analysis, we consider the randomized leaky rectified linear unit (RReLU; Xu et al. 2015) as the activation function:

f(x)={xif x0axif x<0,f(x)=\left\{\begin{matrix}x&\text{if }x\geq 0\\ ax&\text{if }x<0,\end{matrix}\right. (2)

where aa is a random number sampled from a uniform distribution U(l,u)U(l,u), where l,u[0,1)l,u\in[0,1). Here we adopt the default settings of l=1/8l=1/8 and u=1/3u=1/3 in PyTorch111https://pytorch.org/docs/stable/index.html. A discussion about the effect of activation functions on the result will be shown in Section 5.3.

The general structure of an ANN is composed of many layers, and each layer contains many neurons. There are three types of layers in a network: an input layer, one or more hidden layers, and an output layer. Data is fed to the input layer, and then the information is passed through each hidden layer and finally computed from the output layer. Furthermore, the batch normalization technique (Ioffe & Szegedy, 2015) is applied before each nonlinear layer to facilitate optimization and speedup convergence.

In general, there is no theory to determine exactly how many neurons should be used in each hidden layer. Here we adopt the model architecture used in Wang et al. (2020a), i.e. the number of neurons in each hidden layer is decreased in proportion to the layer index. The number of neurons in the iith hidden layer is therefore

Ni=NinFi,N_{i}=\frac{N_{\text{in}}}{F^{i}}~{}, (3)

where NinN_{\text{in}} is the number of neurons in the input layer and FF is a decreasing factor defined by

F=(NinNout)1n+1,F=\left(\frac{N_{\text{in}}}{N_{\text{out}}}\right)^{\frac{1}{n+1}}~{}, (4)

where NoutN_{\text{out}} is the number of neurons in the output layer and nn is the number of hidden layers. In our analysis, we consider a network with three hidden layers. We discuss the effect of the number of hidden layers on the result of estimation in Section 5.2.

Refer to caption
Figure 2: An example of a mixture model.

2.2 Mixture Model

A mixture model is a probabilistic model that assumes all data points are generated from a mixture of a finite number of distributions with unknown parameters, where the distribution can be of any type. In Figure 2 we show an example of a mixture model of a parameter. For measurement 𝒅\bm{d} and parameters 𝜽\bm{\theta}, the probability density of 𝜽\bm{\theta} with KK components has the form

p(𝜽|𝒅)=i=1Kωipi(𝜽|𝒅),p(\bm{\theta}|\bm{d})=\sum_{i=1}^{K}\omega_{i}p_{i}(\bm{\theta}|\bm{d})~{}, (5)

where ωi\omega_{i} is a mixture weight representing the probability that 𝜽\bm{\theta} belongs to the iith component. The mixture weights are nonnegative and their sum is normalized, i.e. i=1Kωi=1\sum_{i=1}^{K}\omega_{i}=1.

The normal (or Gaussian) distribution acts as the foundation for many modeling approaches in statistics. Therefore, in the following analysis, we mainly consider the Gaussian distribution as the unit of the mixture model. In Section 6, we also discuss the applications of the beta mixture model in the parameter estimation. Therefore, for the case of only one parameter, Equation (5) becomes

p(θ|𝒅)\displaystyle p(\theta|\bm{d}) =i=1Kωi𝒩(θ;μi,σi)\displaystyle=\sum_{i=1}^{K}\omega_{i}\mathcal{N}(\theta;\mu_{i},\sigma_{i})
=i=1Kωi12πσi2e(θμi)22σi2.\displaystyle=\sum_{i=1}^{K}\omega_{i}\cdot\frac{1}{\sqrt{2\pi\sigma^{2}_{i}}}e^{-\frac{(\theta-\mu_{i})^{2}}{2\sigma^{2}_{i}}}~{}. (6)

For multiple parameters, Equation (5) becomes

p(𝜽|𝒅)\displaystyle p(\bm{\theta}|\bm{d}) =i=1Kωi𝒩(𝜽;𝝁i,𝚺i)\displaystyle=\sum_{i=1}^{K}\omega_{i}\mathcal{N}(\bm{\theta};\bm{\mu}_{i},\bm{\Sigma}_{i})
=i=1Kωiexp(12(𝜽𝝁i)𝚺i1(𝜽𝝁i))(2π)N|𝚺i|,\displaystyle=\sum_{i=1}^{K}\omega_{i}\cdot\frac{\exp{\left(-\tfrac{1}{2}(\bm{\theta}-\bm{\mu}_{i})^{\top}\bm{\Sigma}_{i}^{-1}(\bm{\theta}-\bm{\mu}_{i})\right)}}{\sqrt{\left(2\pi\right)^{N}|\bm{\Sigma}_{i}|}}~{}, (7)

where NN is the number of parameters.

2.3 MDN

An MDN is a combination of an ANN and a mixture model. It learns a mapping between the measurement 𝒅\bm{d} and parameters of the mixture model. For an MDN with Gaussian components specifically, it learns a mapping between the measurement 𝒅\bm{d} and the parameters of the Gaussian mixture model (ω\omega, 𝝁\bm{\mu}, and 𝚺\bm{\Sigma} (or σ\sigma for the case of only one parameter)). Figure 3 shows the general structure of an MDN with Gaussian components. The left side of the MDN accepts the measurement as input and then the parameters of the mixture model are computed from the right side.

Refer to caption
Figure 3: The general structure of an MDN with Gaussian components. The input is the observational data, and the outputs are the parameters of the mixture model. Each node here is a neuron.
Refer to caption
Figure 4: The general structure of a multibranch MDN with Gaussian components. The inputs are multiple data sets {D1,D2,D3,,Dn}\{D_{1},D_{2},D_{3},...,D_{n}\} from different experiments, and the outputs are the parameters of the mixture model. The yellow part shows the combination of the outputs of these branches.

The MDN model shown in the Figure 3 is for the case of single observational data. But in cosmology we usually need to combine multiple data sets to estimate cosmological parameters. Therefore, following Wang et al. (2020a), we design a multibranch MDN for a joint constraint on parameters using multiple data sets, as shown in Figure 4. There are multiple branches on the left side of the model that can accept multiple observational data sets as input. Then, the information will be combined in the middle of the model (the yellow part of Figure 4), and finally the MDN outputs the parameters of the mixture model.

The parameters (e.g. ww and bb in Equation (1)) of an MDN can be learned by minimizing the loss function (Kruse, 2020):

\displaystyle\mathcal{L} =\displaystyle= 𝔼[ln(p(𝜽|𝒅))]\displaystyle\mathbb{E}\left[-\ln\left(p(\bm{\theta}|\bm{d})\right)\right] (8)
=\displaystyle= 𝔼[ln(i=1Kωipi(𝜽|𝒅))],\displaystyle\mathbb{E}\left[-\ln\left(\sum_{i=1}^{K}\omega_{i}p_{i}(\bm{\theta}|\bm{d})\right)\right]~{},

where the loss is averaged over the minibatch input samples to the network. Specifically, for an MDN with Gaussian components and the case of only one parameter, Equation (8) becomes

\displaystyle\mathcal{L} =𝔼[ln(i=1Kωi12πσi2e(θμi)22σi2)].\displaystyle=\mathbb{E}\left[-\ln\left(\sum_{i=1}^{K}\omega_{i}\cdot\frac{1}{\sqrt{2\pi\sigma^{2}_{i}}}e^{-\frac{(\theta-\mu_{i})^{2}}{2\sigma^{2}_{i}}}\right)\right]~{}. (9)

Because small values in the exponent will give very small values in the logarithm that can lead to numerical underflow, we therefore, use the Log-Sum-Exp trick and carry out the calculations in the logarithmic domain for numerical stability. Thus, Equation (9) can be rewritten as

\displaystyle\mathcal{L} =𝔼[ln(i=1Ke[ln(ωi)+ln(pi(θ|𝒅))])],\displaystyle=\mathbb{E}\left[-\ln\left(\sum_{i=1}^{K}e^{\left[\ln(\omega_{i})+\ln(p_{i}(\theta|\bm{d}))\right]}\right)\right]~{}, (10)

where

ln[pi(θ|𝒅)]=(θμi)22σi2ln(σi)ln(2π)2.\ln\left[p_{i}(\theta|\bm{d})\right]=-\frac{(\theta-\mu_{i})^{2}}{2\sigma_{i}^{2}}-\ln(\sigma_{i})-\frac{\ln(2\pi)}{2}~{}. (11)

For the case with multiple parameters, Equation (8) becomes

\displaystyle\mathcal{L} =\displaystyle= 𝔼[ln(i=1Kωi\displaystyle\mathbb{E}\left[-\ln\left(\sum_{i=1}^{K}\omega_{i}\right.\right. (12)
×\displaystyle\times exp(12(𝜽𝝁i)𝚺i1(𝜽𝝁i))(2π)N|𝚺i|)].\displaystyle\left.\left.\frac{\exp{\left(-\tfrac{1}{2}(\bm{\theta}-\bm{\mu}_{i})^{\top}\bm{\Sigma}_{i}^{-1}(\bm{\theta}-\bm{\mu}_{i})\right)}}{\sqrt{\left(2\pi\right)^{N}|\bm{\Sigma}_{i}|}}\right)\right]~{}.

Considering the numerical stability, this equation can be written as

\displaystyle\mathcal{L} =𝔼[ln(i=1Ke[ln(ωi)+ln(pi(𝜽|𝒅))])],\displaystyle=\mathbb{E}\left[-\ln\left(\sum_{i=1}^{K}e^{\left[\ln(\omega_{i})+\ln(p_{i}(\bm{\theta}|\bm{d}))\right]}\right)\right]~{}, (13)

where

ln[pi(𝜽|𝒅)]\displaystyle\ln[p_{i}(\bm{\theta}|\bm{d})] =\displaystyle= 12(𝜽𝝁i)𝚺i1(𝜽𝝁i)\displaystyle-\frac{1}{2}(\bm{\theta}-\bm{\mu}_{i})^{\top}\bm{\Sigma}_{i}^{-1}(\bm{\theta}-\bm{\mu}_{i}) (14)
+\displaystyle+ ln(|𝚺i1|12)ln((2π)N).\displaystyle\ln\left(|\bm{\Sigma}_{i}^{-1}|^{\frac{1}{2}}\right)-\ln\left(\sqrt{(2\pi)^{N}}\right).

Note that for a valid multivariate Gaussian distribution, 𝚺i\bm{\Sigma}_{i} and 𝚺i1\bm{\Sigma}^{-1}_{i} here must be positive-definite matrices. Therefore, the precision matrix 𝚺i1\bm{\Sigma}^{-1}_{i} can be characterized by its upper Cholesky factor 𝑼i\bm{U}_{i}:

𝚺i1=𝑼i𝑼i,\displaystyle\bm{\Sigma}^{-1}_{i}=\bm{U}^{\top}_{i}\bm{U}_{i}~{}, (15)

where 𝑼i\bm{U}_{i} is an upper triangular matrix with strictly positive diagonal entries. There are N(N+1)/2N(N+1)/2 nonzero entries for 𝑼i\bm{U}_{i}, which is much less than that of 𝚺i1\bm{\Sigma}^{-1}_{i} (N2N^{2} entries). Thus, if the MDN learns 𝚺i1\bm{\Sigma}^{-1}_{i} instead of 𝑼i\bm{U}_{i}, there will be more neurons in the network, which will increase the training time. Also, the output of the network may also make 𝚺i1\bm{\Sigma}^{-1}_{i} a non-positive-definite matrix. Therefore, in order to make the convergence of the network more stable, the MDN is actually learns 𝑼i\bm{U}_{i}. Outputs of the network can be either positive or negative numbers. Thus, to ensure the diagonal entries of 𝑼i\bm{U}_{i} are positive, we use a Softplus function to enforce the positiveness of the diagonal entries:

(𝑼i)jk\displaystyle(\bm{U}_{i})_{jk} ={Softplus(𝑼~i)jk,if j=k(𝑼~i)jk,otherwise\displaystyle=\begin{cases}\text{Softplus}(\widetilde{\bm{U}}_{i})_{jk}~{},&\text{if }j=k\\ (\widetilde{\bm{U}}_{i})_{jk}~{},&\text{otherwise}\end{cases} (16)

where 𝑼~i\widetilde{\bm{U}}_{i} is the output of the network, and

Softplus(x)=1βln(1+eβx),\text{Softplus}(x)=\frac{1}{\beta}\ln(1+e^{\beta x})~{}, (17)

where β\beta is a parameter and we set β=1\beta=1 throughout the paper. This offers an efficient way to calculate the matrix in Equation (14):

|𝚺i1|\displaystyle|\bm{\Sigma}_{i}^{-1}| =|𝑼i𝑼i|\displaystyle=|\bm{U}^{\top}_{i}\bm{U}_{i}|
=|𝑼i||𝑼i|\displaystyle=|\bm{U}^{\top}_{i}|\cdot|\bm{U}_{i}|
=(j=1Ndiag(𝑼i)j)2.\displaystyle=\left(\prod_{j=1}^{N}\text{diag}(\bm{U}_{i})_{j}\right)^{2}~{}. (18)

Therefore,

ln(|𝚺i1|12)=j=1Nln(diag(𝑼i)j).\ln\left(|\bm{\Sigma}_{i}^{-1}|^{\frac{1}{2}}\right)=\sum_{j=1}^{N}\ln\left(\text{diag}(\bm{U}_{i})_{j}\right)~{}. (19)

In addition,

(𝜽𝝁i)𝚺i1(𝜽𝝁i)\displaystyle(\bm{\theta}-\bm{\mu}_{i})^{\top}\bm{\Sigma}_{i}^{-1}(\bm{\theta}-\bm{\mu}_{i}) =\displaystyle= (𝜽𝝁i)𝑼i𝑼i(𝜽𝝁i)\displaystyle(\bm{\theta}-\bm{\mu}_{i})^{\top}\bm{U}^{\top}_{i}\cdot\bm{U}_{i}(\bm{\theta}-\bm{\mu}_{i}) (20)
=\displaystyle= [𝑼i(𝜽𝝁i)][𝑼i(𝜽𝝁i)]\displaystyle\left[\bm{U}_{i}(\bm{\theta}-\bm{\mu}_{i})\right]^{\top}\cdot\left[\bm{U}_{i}(\bm{\theta}-\bm{\mu}_{i})\right]
=\displaystyle= 𝑼i(𝜽𝝁i)22.\displaystyle\left\lVert\bm{U}_{i}(\bm{\theta}-\bm{\mu}_{i})\right\rVert^{2}_{2}.

Therefore, Equation (14) can be finally written as

ln[pi(𝜽|𝒅)]\displaystyle\ln[p_{i}(\bm{\theta}|\bm{d})] =12𝑼i(𝜽𝝁i)22+j=1Nln(diag(𝑼i)j)\displaystyle=-\frac{1}{2}\left\lVert\bm{U}_{i}(\bm{\theta}-\bm{\mu}_{i})\right\rVert^{2}_{2}+\sum_{j=1}^{N}\ln\left(\text{diag}(\bm{U}_{i})_{j}\right)
ln((2π)N).\displaystyle\quad-\ln\left(\sqrt{(2\pi)^{N}}\right)~{}. (21)

2.4 Training Strategy

For parameter estimation with the MDN, we will first train the MDN with simulated data and then estimate parameters with the observational data. In this section, we describe the processes of training and parameter estimation.

2.4.1 Training Set

The basic principle that the MDN can be used for parameter estimation lies in the fact that it can learn a mapping from the input data space to the output parameter space of the mixture model. Note that the parameters of the mixture model are functions of the cosmological parameters to be estimated. Therefore, the parameter space of the cosmological model in the training set should be large enough to cover the posterior distribution of the cosmological parameters. Following Wang et al. (2020a), we set the range of parameters in the training set to [P5σp,P+5σp][P-5\sigma_{p},P+5\sigma_{p}], where PP is the best-fitting value of the posterior distribution of cosmological parameters and σp\sigma_{p} is the corresponding 1σ1\sigma error. Note that the best-fitting value here refers to the mode of the posterior distribution. In this parameter space, the cosmological parameters in the training set are simulated according to the uniform distribution.

From the illustrations of Sections 2.2 and 2.3, one can see that the error of the cosmological parameters should be calculated by the standard deviation σ\sigma or covariance matrix 𝚺\bm{\Sigma} of the mixture model, which should be determined by the covariance matrix of the observational measurement. This means that the MDN should know what the error looks like. Therefore, inspired by Wang et al. (2020a), we add noise to the training set based on the covariance matrix of the observational measurement. Specifically, Gaussian noise 𝒩(0,𝚺)\mathcal{N}(0,\bm{\Sigma}) is generated and added to the samples of the training set at each epoch of the training process. Note that the noise level is equal to the observational errors, which is different from Wang et al. (2020a) where multiple levels of noise can be added to the training set. For more details, we refer interested readers to Wang et al. (2020a).

The orders of magnitude of the cosmological parameters may be quite different, which will have an effect on the performance of the network. To address this, data preprocessing techniques should be used before feeding the data to the MDN. Following Wang et al. (2020a), we conduct two steps to normalize the data. First, we divide the cosmological parameters in the training set by their estimated values (here the mean of the parameters in the training set is taken as the estimated value) to ensure that they all become of order of unity; then, we normalize the training set using the zz-score normalization technique (Kreyszig, 2011):

z=xμσ,\displaystyle z=\frac{x-\mu}{\sigma}~{}, (22)

where μ\mu and σ\sigma are the mean and standard deviation of the measurement 𝒅\bm{d} or the corresponding cosmological parameters 𝜽\bm{\theta}. This data preprocessing method can reduce the influence of differences between parameters on the result, thus making the MDN a general method that can be applied to any cosmological parameters.

2.4.2 Training and Parameter Estimation

Here we illustrate how to train an MDN and estimate parameters using a well-trained MDN model. The key steps of the training process are as follows:

  1. 1.

    Provide a cosmological model and the initial conditions of the corresponding cosmological parameters.

  2. 2.

    Simulate the training set using the method illustrated in Section 2.4.1.

  3. 3.

    Add noise and conduct data preprocessing on the training set using the method illustrated in Section 2.4.1.

  4. 4.

    Feed the training set to the MDN and train the model.

  5. 5.

    Feed the observational measurement to the well-trained MDN model to obtain the parameters of the mixture model.

  6. 6.

    Obtain an ANN chain by generating samples using the mixture model according to Equations (2.2) and (2.2).

  7. 7.

    Calculate the best-fitting values and errors of the cosmological parameters by using the ANN chain. Then, the parameter space to be learned will be updated according to the estimated parameters.

  8. 8.

    By repeating steps 2-7, one can obtain a stable ANN chain for parameter estimation.

Note that the initial conditions of cosmological parameters set in step 1 are general ranges of the parameters. This means that the true parameters can be located outside of these ranges. Therefore, the parameter space should be updated in step 7 to continue training.

3 Application to SN-Ia

In this section, we apply the MDN method to the SN-Ia data set to estimate parameters of the wwCDM cosmology.

3.1 SN-Ia data

To test the capability of the MDN for estimating parameters from one data set, we constrain ww and Ωm\Omega_{\rm m} of the wwCDM model using the Pantheon SN-Ia (Scolnic et al., 2018), which contain 1048 data points for the redshift range [0.01, 2.26]. The distance modulus of the Pantheon SN-Ia can be rewritten as

μ=mB,corrMB,\mu=m_{B,{\rm corr}}^{*}-M_{B}~{}, (23)

where mB,corr=mB+α×x1β×c+ΔBm_{B,{\rm corr}}^{*}=m_{B}^{*}+\alpha\times x_{1}-\beta\times c+\Delta_{B} is the corrected apparent magnitude, and MBM_{B} is the BB-band absolute magnitude (Scolnic et al., 2018). The variable α\alpha is the coefficient of the relation between luminosity and stretch, β\beta is the coefficient of the relation between luminosity and color, and ΔB\Delta_{B} is a distance correction based on predicted biases from simulations.

The measurement of the Pantheon SN-Ia is the corrected apparent magnitudes. Therefore, the input of the MDN is mB,corrm_{B,{\rm corr}}^{*} for the observational SN-Ia, and μ+MB\mu+M_{B} for the simulated data generated by the wwCDM model. Specifically, we have the distance modulus

μ=5log10(DL(z)/Mpc)+25,\displaystyle\mu=5\log_{10}\left(D_{\rm L}(z)/{\rm Mpc}\right)+25~{}, (24)

and the luminosity distance

DL(z)=c(1+z)H00zdzE(z),D_{\rm L}(z)=\frac{c(1+z)}{H_{0}}\int_{0}^{z}\frac{{\rm d}z^{\prime}}{E(z^{\prime})}~{}, (25)

where cc is the speed of light, and H0H_{0} is the Hubble constant. The E(z)E(z) function describes the evolution of the Hubble parameter:

E(z)=Ωm(1+z)3+(1Ωm)(1+z)3(1+w),E(z)=\sqrt{\Omega_{\rm m}(1+z)^{3}+(1-\Omega_{\rm m})(1+z)^{3(1+w)}}~{}, (26)

where Ωm\Omega_{\rm m} is the matter density parameter and ww is the equation of state of dark energy. Equation (26) assumes the spatial flatness. Then we have

μ+MB\displaystyle\mu+M_{B} =5log10DL(z)Mpc+MB+25.\displaystyle=5\log_{10}\frac{D_{\rm L}(z)}{\rm Mpc}+M_{B}+25~{}. (27)

As we already know, the absolute magnitude of SN-Ia MBM_{B} is strongly degenerated with the Hubble constant H0H_{0}. Therefore, we combine MBM_{B} and H0H_{0} into a new parameter and constrain it with the cosmological parameters simultaneously. To do this, we define a dimensionless luminosity distance

D~L(z)=(1+z)0zdzE(z).\widetilde{D}_{\rm L}(z)=(1+z)\int_{0}^{z}\frac{{\rm d}z^{\prime}}{E(z^{\prime})}~{}. (28)

Then Equation (27) can be rewritten as

μ+MB\displaystyle\mu+M_{B} =5log10(c/H0)D~L(z)Mpc+MB+25\displaystyle=5\log_{10}\frac{(c/H_{0})\cdot\widetilde{D}_{\rm L}(z)}{\rm Mpc}+M_{B}+25
=5log10D~L(z)+μc,\displaystyle=5\log_{10}\widetilde{D}_{\rm L}(z)+\mu_{c}~{}, (29)

where μc5log10(c/H0/Mpc)+MB+25\mu_{c}\equiv 5\log_{10}\left(c/H_{0}/{\rm Mpc}\right)+M_{B}+25 is a new nuisance parameter to be estimated with the cosmological parameters simultaneously.

Table 1: Best-fitting values of the wwCDM parameters using the Pantheon SN-Ia data without considering the systematic covariance matrix. The error is 1σ1\sigma C.L.
Methods
Parameters MCMC MDN
ww 1.235±0.141-1.235\pm 0.141 1.230±0.139-1.230\pm 0.139
Ωm\Omega_{\rm m} 0.356±0.0330.356\pm 0.033 0.351±0.0340.351\pm 0.034
Refer to caption
Figure 5: One-dimensional and two-dimensional marginalized distributions with 1σ\sigma and 2σ\sigma contours of ww, Ωm\Omega_{\rm m}, and μc\mu_{c} constrained from Pantheon SN-Ia, with no systematic uncertainty. Three Gaussian components are used here for the MDN method.
Refer to caption
Figure 6: Losses of the training set and validation set. There are 3000 samples in the training set, and 500 samples in the validation set.

3.2 Result

We first constrain ww, Ωm\Omega_{\rm m}, and the nuisance parameter μc\mu_{c} simultaneously using the Pantheon SN-Ia data without considering systematic uncertainties. Here two methods are considered in our analysis: standard MCMC and the MDN. For the MCMC method, we use the public package emcee (Foreman-Mackey et al., 2013) for parameter estimation, which is a Python module that achieves the MCMC method. During the constraining procedures, an MCMC chain with 100,000 steps is generated after burn-in. The best-fitting values with 1σ\sigma errors of the parameters are calculated from the MCMC chains with coplot,222https://github.com/Guo-Jian-Wang/coplot as shown in Table 1. We also draw the corresponding one-dimensional and two-dimensional contours in Figure 5. For the MDN method, we consider three Gaussian components and using the setting illustrated in Sections 2.1, 2.3, and 2.4. There are 3000 samples in the training set and 500 samples in the validation set. After training, we feed the corrected apparent magnitudes mB,corrm_{B,{\rm corr}}^{*} to the well-trained MDN model to obtain the parameters of the mixture model, and then get an ANN chain of the cosmological parameters. The results are shown in Table 1 and Figure 5. Obviously, the results of the MDN method are almost the same as those of the MCMC method.

We show the losses of the training and validation sets in Figure 6. The loss of the training set is large at the beginning of the training process, and gradually decreases and stabilizes with the epoch. For the validation set, the trend of loss is similar to that of the training set. After 400 epochs, the loss of the validation set is smaller than that of the training set. Obviously, there is no overfitting for the MDN. The reason why the loss of the validation set is smaller than that of the training set is that multiple sets of noise are added to each sample, which means that multiple sets of noise will be generated and added to a sample to ensure that the MDN knows that measurement may differ due to the presence of noise. Our analysis shows that the adding of multiple sets of noise to measurement is beneficial to the stability of the MDN and will make the final estimated cosmological parameters more accurate, especially for the multibranch MDN.

Furthermore, we consider the systematic uncertainties in our analysis to test the capability of the MDN for dealing with observations with covariance. We use the systematic covariance matrix 𝑪sys\bm{C}_{\rm sys} (Scolnic et al., 2018) to generate noise then add it to the training set (see Section 2.4.1). We then constrain the parameters with the MCMC and MDN (with three Gaussian components) methods. The results are shown in Table 2 and Figure 7. The difference between the MDN results and the MCMC results can then be calculated using

ΔP=|PMCMCPMDN|σ,\Delta P=\frac{|P_{\rm MCMC}-P_{\rm MDN}|}{\sigma}, (30)

where σ=σMCMC2+σMDN2\sigma=\sqrt{\sigma^{2}_{\rm MCMC}+\sigma^{2}_{\rm MDN}}, and PMCMCP_{\rm MCMC} and PMDNP_{\rm MDN} are the best-fitting parameters for the MCMC and MDN methods, respectively. Specifically, the differences for the parameters are 0.174σ0.174\sigma, 0.015σ0.015\sigma, and 0.076σ0.076\sigma, respectively, which are slightly larger for the parameter ww.

Table 2: The same as Table 1, but including systematic uncertainties.
Methods
Parameters MCMC MDN
ww 1.011±0.216-1.011\pm 0.216 1.060±0.229-1.060\pm 0.229
Ωm\Omega_{\rm m} 0.327±0.0740.327\pm 0.074 0.328±0.0760.328\pm 0.076
Refer to caption
Figure 7: The same as Figure 5, but including systematic uncertainties.
Refer to caption
Figure 8: The same as Figure 7, but nine Gaussian components are used here.

We can see that the shapes of both the one-dimensional and two-dimensional contours based on the MDN method are slightly different from those of the MCMC method. The reason is that the joint probability distribution of ww and Ωm\Omega_{\rm m} deviates from the multivariate Gaussian distribution, which makes it more difficult to model their joint probability distribution. We take two approaches to solve this problem. One is to increase the number of components (pip_{i} function in Equation (5)), which diversifies the function forms that the MDN can fit. Specifically, we use nine Gaussian components and reestimate the cosmological parameters with the MDN, then plot the one-dimensional and two-dimensional contours in Figure 8. One can see that the MDN result converges to the MCMC result. Quantitatively, the differences between the MDN results and the MCMC results are 0.051σ0.051\sigma, 0.046σ0.046\sigma, and 0.025σ0.025\sigma. These deviations are smaller than those of using three components, especially for parameter ww. Another way is to more efficiently sample the parameter space. As we illustrated in Section 2.4.1, samples are generated uniformly in the ±5σp\pm 5\sigma_{p} range of the posterior distribution, which might not be the most efficient way. We will explore different sampling methods in future works.

Table 3: 1σ\sigma Constraints on parameters of the Λ\LambdaCDM model using the Planck-2015 CMB temperature and polarization power spectra and Pantheon SN-Ia.
Methods
Parameters MCMC MDN
H0H_{0} 67.701±0.63367.701\pm 0.633 67.698±0.61967.698\pm 0.619
Ωbh2\Omega_{\rm b}h^{2} 0.02231±0.000150.02231\pm 0.00015 0.02229±0.000150.02229\pm 0.00015
Ωch2\Omega_{\rm c}h^{2} 0.11866±0.001410.11866\pm 0.00141 0.11867±0.001380.11867\pm 0.00138
τ\tau 0.06589±0.013460.06589\pm 0.01346 0.06170±0.014120.06170\pm 0.01412
109As10^{9}A_{\rm s} 2.13366±0.056362.13366\pm 0.05636 2.11930±0.059082.11930\pm 0.05908
nsn_{\rm s} 0.96817±0.003980.96817\pm 0.00398 0.96787±0.003820.96787\pm 0.00382
MBM_{B} 19.418±0.017-19.418\pm 0.017 19.418±0.017-19.418\pm 0.017
Refer to caption
Figure 9: One-dimensional and two-dimensional marginalized distributions with 1σ\sigma and 2σ\sigma contours of H0H_{0}, Ωbh2\Omega_{\rm b}h^{2}, Ωch2\Omega_{\rm c}h^{2}, τ\tau, AsA_{\rm s}, nsn_{\rm s}, and mν\sum m_{\nu} constrained from Planck-2015 CMB temperature and polarization power spectra and Pantheon SN-Ia. Three Gaussian components are used here for the MDN method.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Two-dimensional marginalized distributions with 1σ\sigma and 2σ\sigma contours of ww and Ωm\Omega_{\rm m} constrained from Pantheon SN-Ia when considering the systematic uncertainties.

4 Joint Constraint on Parameters

Cosmological research often combines multiple data sets to constrain cosmological parameters. So, in this section, we estimate parameters using the multibranch MDN. Our purpose is to verify the feasibility of this method. Therefore, for simplicity of our analysis, we will only consider the use of Planck-2015 temperature and polarization power spectra COM_PowerSpect_CMB_R2.02.fits333http://pla.esac.esa.int/pla/#cosmology and Pantheon SN-Ia data to constrain parameters. These parameters include the BB-band absolute magnitude MBM_{B} (nuisance parameter in SN-Ia) and six parameters in the standard Λ\LambdaCDM model, i.e. the Hubble constant H0H_{0}, the baryon density Ωbh2\Omega_{\rm b}h^{2}, the cold dark matter density Ωch2\Omega_{\rm c}h^{2}, the optical depth τ\tau, and the amplitude and spectral index of primordial curvature perturbations (AsA_{\rm s}, nsn_{\rm s}). There are four branches in the multibranch MDN model, and each of them accepts one measurement, namely the TT, EE, and TE power spectra of the CMB, and the corrected apparent magnitudes mB,corrm_{B,{\rm corr}}^{*} of the Pantheon SN-Ia. We calculate the power spectra of the CMB with the public code CAMB444https://github.com/cmbant/CAMB (Lewis et al., 2000).

We first constrain these parameters using the MCMC method, and an MCMC chain with 100,000 steps is generated. Then the best-fitting values with the 1σ1\sigma errors of these parameters are calculated using the MCMC chain, as shown in Table 3. The corresponding one-dimensional and two-dimensional marginalized distributions with 1σ1\sigma and 2σ2\sigma contours are shown in Figure 9.

Then, following the same procedure in Section 3, we constrain the parameters using the MDN method. In the training process, 3000 samples are used to train the MDN model, and 500 samples are used as a validation set. After training of the MDN model, we feed the observational TT, EE, and TE power spectra of the Planck CMB, as well as the corrected apparent magnitudes mB,corrm_{B,{\rm corr}}^{*} of the Pantheon SN-Ia, to the MDN model to obtain the parameters of the mixture model. Finally, we generate samples using the mixture model to obtain an ANN chain of the cosmological parameters. The best-fitting values and 1σ1\sigma errors calculated using the chain are shown in Table 3, and the corresponding one-dimensional and two-dimensional contours are shown in Figure 9. We can see that both the best-fitting values and the 1σ1\sigma errors are almost the same as those of the MCMC method. The deviations of the parameters between the two methods are 0.003σ0.003\sigma, 0.078σ0.078\sigma, 0.007σ0.007\sigma, 0.215σ0.215\sigma, 0.176σ0.176\sigma, 0.055σ0.055\sigma, and 0.006σ0.006\sigma, which are quite small on average. This indicates the MDN method can combine multiple observational data to jointly constrain cosmological parameters, which is generally applicable to all cosmological parameter estimations.

5 Effect of Hyperparameters

There are many hyperparameters that can be selected manually in the MDN model, such as the number of components, the number of hidden layers, the activation functions, the number of training samples, and the number of epochs. In this section, we discuss the effect of these hyperparameters on results. All results in this section are evaluated on observational data or simulated observations that can be taken as the test set.

5.1 The Number of Components

As we have shown in Section 3.2 the number of components of the mixture model may affect the constraints of parameters. Here, we further discuss the effect of the number of components on the result. Specifically, with the same procedure, we estimate ww and Ωm\Omega_{\rm m} of the wwCDM model with Pantheon SN-Ia data by considering one, three, five, and seven Gaussian components. The two-dimensional contours of the parameters are shown in Figure 10.

For the case of one component, the constraint is

w\displaystyle w =1.100±0.237,\displaystyle=-1.100\pm 0.237, Ωm\displaystyle\Omega_{\rm m} =0.314±0.081.\displaystyle=0.314\pm 0.081. (31)

This result converges to the MCMC result (see Table 2), and both the best-fitting values and 1σ1\sigma errors are similar to those of the MCMC method. However, the two-dimensional contour (the upper left panel of Figure 10) is quite different from the contour of the MCMC method. The reason, as we illustrated in Section 3.2, is that the joint probability distribution of ww and Ωm\Omega_{\rm m} deviates significantly from the multivariate Gaussian distribution.

For the cases with three, five, and seven components, the results all converge to the MCMC result, and all the shapes of the two-dimensional contours are similar to the MCMC result. This suggests that for ww and Ωm\Omega_{\rm m} of the wwCDM model, three Gaussian components can characterize the joint probability distribution. Therefore, we recommend using multiple components, such as three or more, when estimating parameters of other models. However, it is not recommended to use a very large number of components (e.g. more than 10), especially for multiple-parameters cases, because the MDN will be unstable and thus it will become difficult to constrain these components.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: Mean deviations between the MDN results and the fiducial values as a function of the number of hidden layers, the number of training samples, the number of epochs, and the activation function.

5.2 The Number of Hidden Layers

To test the effect of hyperparameters in the following sections, we simulate SN-Ia data based on a future Wide-field Infrared Survey Telescope (WFIRST) experiment (Spergel et al., 2015) in a flat wwCDM model with the following fiducial values of parameters: w=1w=-1, Ωm=0.3\Omega_{\rm m}=0.3, and H0=70kms1Mpc1H_{0}=70~{}\rm km\ s^{-1}\ Mpc^{-1}. The total number of SN-Ia samples is 2725 in the range of 0.1<z<1.70.1<z<1.7. We then estimate ww and Ωm\Omega_{\rm m} using the MDN method with three Gaussian components. Finally, we calculate the mean deviation between the MDN results and the fiducial values:

Meandeviation=1N(i=1N|θi,predθi,fid|σi,pred),{\rm Mean\ deviation}=\frac{1}{N}\left(\sum_{i=1}^{N}\frac{|\theta_{i,\rm pred}-\theta_{i,\rm fid}|}{\sigma_{i,\rm pred}}\right), (32)

where NN is the number of cosmological parameters, θi,fid\theta_{i,\rm fid} is the fiducial parameter, and θi,pred\theta_{i,\rm pred} and σi,pred\sigma_{i,\rm pred} are the predicted best-fitting value and error of the cosmological parameter, respectively.

First, we test the effect of the number of hidden layers of the MDN model. We design five different MDN structures with the number of hidden layers varying from one to five. We consider RReLU here as the activation function. We train the MDN models with 3000 samples, and after 2000 epochs, we obtain the corresponding five ANN chains by feeding the simulated observations. We then calculate the mean deviations between the MDN results and the fiducial values. The mean deviation as a function of the number of hidden layers is shown in the upper left panel of Figure 11. The mean deviations of these five cases are 0.072σ0.072\sigma, 0.033σ0.033\sigma, 0.032σ0.032\sigma, 0.094σ0.094\sigma, and 0.082σ0.082\sigma, respectively. We can see that there is a large deviation for MDNs with fewer or more hidden layers. An MDN with one hidden layer may not have enough layers to build a mapping between the input data and the output parameters, while an MDN with four or five hidden layers will have more difficulty learning the mixture model. We note that the training time will increase as the number of hidden layers increases. As a result, MDNs with many hidden layers become unpragmatic because of the training time.

5.3 Activation Function

The settings of the activation function may also affect the parameter estimation. Here we select several commonly used activation functions and test their effects on the performance of the MDN model: rectified linear unit (ReLU), exponential linear unit (ELU), RReLU, sigmoid function, and hyperbolic tangent (Tanh). The ReLU activation function is defined as (see also Nair & Hinton 2010)

f(x)={xif x00if x<0.f(x)=\left\{\begin{matrix}x&\text{if }x\geq 0\\ 0&\text{if }x<0.\end{matrix}\right. (33)

The ELU is (Clevert et al., 2016)

f(x)={xif x>0α(ex1)if x0,f(x)=\left\{\begin{matrix}x&\text{if }x>0\\ \alpha(e^{x}-1)&\text{if }x\leq 0,\end{matrix}\right. (34)

where α\alpha is a parameter and we set it to 1. The sigmoid function is defined by (see also Han & Moraga 1995)

f(x)=11+ex,f(x)=\frac{1}{1+e^{-x}}, (35)

and the hyperbolic tangent is (Malfliet, 1992)

f(x)=exexex+ex.f(x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}. (36)

The MDN model contains three hidden layers in our work. Using the data simulated in Section 5.2, we train another five MDN models with 3000 samples to estimate ww and Ωm\Omega_{\rm m}. After 2000 epochs, we obtain five ANN chains, and then calculate the corresponding deviations. The mean deviation of the parameters between the MDN results and the fiducial values as a function of the activation function is shown in the upper right panel of Figure 11. One can see that the MDN with the sigmoid function has the maximum deviation of 0.158σ0.158\sigma, while the deviations of the other MDNs are all smaller than 0.067σ0.067\sigma. Therefore, in our specific application of the MDN, the ReLU, ELU, RReLU, and Tanh activation functions can all be used for the MDN method.

5.4 The Number of Training Samples

To test the effect of the number of samples in the training set, we train several MDN models with the number of samples varying from 1000 to 10,000. The MDN model used here contains three hidden layers, the RReLU nonlinear function is taken as the activation function, and the MDN is trained with 2000 epochs. The mean deviation of parameters between the MDN results and the fiducial values is shown in the lower left panel of Figure 11. We can see that the mean deviation varies in the interval [0.010σ,0.103σ][0.010\sigma,0.103\sigma] as the number of training samples changes. But most of the deviations are less than 0.055σ0.055\sigma. Therefore, in our specific application of the MDN, thousands of samples can make an MDN model well trained. Note that the time of training of an MDN model is related to the number of samples in the training set. Therefore, considering the performance and training time of the MDN, the number of samples should be selected reasonably.

5.5 The Number of Epochs

The number of epochs may also affect the performance of the MDN. To test this, we train several MDN models with the epoch varying from 500 to 5000. The MDN models used here contain three hidden layers, the activation function is RReLU, and 3000 samples are used to train the models. With the same procedure, we train and calculate deviations between the MDN results and the fiducial values. The mean deviation as a function of the number of epochs is shown in the lower right panel of Figure 11. We can see that as the number of epochs increases, the mean deviation oscillates in the range of [0.015σ,0.059σ][0.015\sigma,0.059\sigma]. Also, in our case, there is little improvement by training for more than 2000 epochs. Since the training time is also related to the number of epochs, the number of epochs should also be selected reasonably.

6 Beta Mixture Model

6.1 One Parameter

The analysis above is focused on the Gaussian distribution as the unit of the mixture model. Since the unit of the mixture model can be any type of distribution, we use a beta distribution as the unit of the mixture model in this section. A beta mixture model with KK components has the form

p(θ|𝒅)\displaystyle p(\theta|\bm{d}) =i=1KωiBeta(θ;αi,βi)\displaystyle=\sum_{i=1}^{K}\omega_{i}{\rm Beta}(\theta;\alpha_{i},\beta_{i})
=i=1KωiΓ(αi+βi)Γ(αi)Γ(βi)θαi1(1θ)βi1,\displaystyle=\sum_{i=1}^{K}\omega_{i}\cdot\frac{\Gamma(\alpha_{i}+\beta_{i})}{\Gamma(\alpha_{i})\Gamma(\beta_{i})}\theta^{\alpha_{i}-1}(1-\theta)^{\beta_{i}-1}, (37)

where ω\omega is the mixture weight, α,β>0\alpha,\beta>0 are two shape parameters to be determined, and Γ()\Gamma(\cdot) is the gamma function defined as

Γ(z)=0xz1exdx.\Gamma(z)=\int_{0}^{\infty}x^{z-1}e^{-x}{\rm d}x. (38)

Here we use the Softplus function (Equation (17)) to guarantee α\alpha and β\beta are positive. Note that the beta distribution is defined on the interval [0, 1]; thus we normalize the parameters in the training set (i.e. the target) to the range [0, 1] by using the min-max normalization technique:

z=xmin(x)max(x)min(x).z=\frac{x-\text{min}(x)}{\text{max}(x)-\text{min}(x)}. (39)

The parameters of the beta mixture model can be estimated by minimizing the loss function

\displaystyle\mathcal{L} =\displaystyle= 𝔼[ln(i=1Kωi\displaystyle\mathbb{E}\left[-\ln\left(\sum_{i=1}^{K}\omega_{i}\right.\right. (40)
×\displaystyle\times Γ(αi+βi)Γ(αi)Γ(βi)θαi1(1θ)βi1)].\displaystyle\left.\left.\frac{\Gamma(\alpha_{i}+\beta_{i})}{\Gamma(\alpha_{i})\Gamma(\beta_{i})}\theta^{\alpha_{i}-1}(1-\theta)^{\beta_{i}-1}\right)\right].

For numerical stability, we also use the Log-Sum-Exp trick; then Equation  (40) can be rewritten as

\displaystyle\mathcal{L} =𝔼[ln(i=1Ke[ln(ωi)+ln(pi(θ|𝒅))])],\displaystyle=\mathbb{E}\left[-\ln\left(\sum_{i=1}^{K}e^{\left[\ln(\omega_{i})+\ln(p_{i}(\theta|\bm{d}))\right]}\right)\right]~{}, (41)

where

ln[pi(θ|𝒅)]\displaystyle\ln\left[p_{i}(\theta|\bm{d})\right] =\displaystyle= ln(Γ(αi+βi))ln(Γ(αi))ln(Γ(βi))\displaystyle\ln(\Gamma(\alpha_{i}+\beta_{i}))-\ln(\Gamma(\alpha_{i}))-\ln(\Gamma(\beta_{i})) (42)
+\displaystyle+ (αi1)ln(θ)+(βi1)ln(1θ).\displaystyle(\alpha_{i}-1)\ln(\theta)+(\beta_{i}-1)\ln(1-\theta).
Table 4: Best-fit values of Ωm\Omega_{\rm m} of the wwCDM model using the Pantheon SN-Ia. Here ww is manally set to 1-1 or 0.5-0.5. The error is 1σ1\sigma C.L. The MDN-Gaussian refers to the MDN with Gaussian components, and MDN-Beta refers to the MDN with beta components.
Methods w=1w=-1 w=0.5w=-0.5
MCMC Ωm=0.397±0.010\Omega_{\rm m}=0.397\pm 0.010 Ωm=0.00030.0002+0.0151\Omega_{\rm m}=0.0003_{-0.0002}^{+0.0151}
MDN-Gaussian Ωm=0.396±0.009\Omega_{\rm m}=0.396\pm 0.009 Ωm=0.00330.0018+0.0134\Omega_{\rm m}=0.0033_{-0.0018}^{+0.0134}
MDN-Beta Ωm=0.397±0.010\Omega_{\rm m}=0.397\pm 0.010 Ωm=0.00030.0002+0.0150\Omega_{\rm m}=0.0003_{-0.0002}^{+0.0150}
Refer to caption
Figure 12: One-dimensional distributions of Ωm\Omega_{\rm m} constrained from Pantheon SN-Ia, where the blue solid line is based on the MDN method with three Gaussian components and the red solid line is based on the MDN method with one beta component. Here ww is set to -1.
Refer to caption
Figure 13: The same as Figure 12, but now ww is manually set to -0.5.

Since the beta distribution is for one random variable, thus to test the capability of the beta mixture model for parameter inference, we use the Pantheon SN-Ia to constrain Ωm\Omega_{\rm m} of the wwCDM model. In our analysis, the Hubble constant is set to H0=70kms1Mpc1H_{0}=70~{}\rm km\ s^{-1}\ Mpc^{-1}, and the absolute magnitude is set to MB=19.3M_{B}=-19.3. We consider two cases of the equation of state of dark energy: w=1w=-1 and w=0.5w=-0.5, respectively. The case w=0.5w=-0.5 considered here is completely set manually, which is to test the case where the posterior distribution of the parameter is a truncated distribution under the physical limit. We constrain Ωm\Omega_{\rm m} using the MCMC method and the MDN method. For the MDN method, we consider two mixture models: the Gaussian mixture model and the beta mixture model.

For the case of w=1w=-1, we first constrain Ωm\Omega_{\rm m} using the MCMC method. After generating an MCMC chain, the best-fitting value with 1σ1\sigma error of Ωm\Omega_{\rm m} can be calculated (see Table 4). The corresponding marginalized distribution is shown in Figure 12 with a green dashed line. We then use the MDN method with three Gaussian components to constraint Ωm\Omega_{\rm m} and find the result (the blue solid line in Figure 12) is almost the same as that of the MCMC method for both the best-fitting value and the 1σ1\sigma error. We also constrain Ωm\Omega_{\rm m} using the MDN method with one beta component and obtain the same result of the MCMC method.

With the same procedure, we constrain Ωm\Omega_{\rm m} for the case of w=0.5w=-0.5. The result of the MCMC method is shown in Table 4 and the marginalized distribution is shown in Figure 13 with a green dashed line. We can see the distribution is a truncated distribution due to the physical limit. Then, we constrain Ωm\Omega_{\rm m} with the MDN method with Gaussian components and find both the best-fitting value and the 1σ1\sigma error deviate slightly from the MCMC result, and the shape of the distribution (the blue solid line in Figure 13) differs slightly from that of the MCMC method. Finally, the result of Ωm\Omega_{\rm m} based on the MDN method with beta components is almost the same as that of the MCMC method, and the shape of the distribution (the red solid line in Figure 13) is consistent with the one based on the MCMC method.

These results show that MDN models with beta components are able to constrain parameters, and even have advantages over MDN models with Gaussian components in the case of truncated distribution. In the case of truncated distribution, the reason for the beta mixture model performing better than the Gaussian mixture model is that the beta distribution has more shape features (see Figure 14 for examples), which enables it to fit better for the posterior distribution of parameters.

Refer to caption
Figure 14: Examples of beta probability density function with different shape parameters calculated using the beta function in Equation (6.1).
Refer to caption
Figure 15: One-dimensional distributions of ww, Ωm\Omega_{\rm m}, and μc\mu_{c} constrained from Pantheon SN-Ia.

6.2 Multiple Parameters

We now switch to multiparameter estimation. If parameters are not independent, the beta mixture model cannot be used when doing the MDN parameter estimation. But if the correlations between parameters are negligible, the beta mixture model can be used for multiple-parameter tasks to obtain the one-dimensional distribution of each parameter. In this case, the probability density in Equation (6.1) becomes

p(𝜽|𝒅)\displaystyle p(\bm{\theta}|\bm{d}) =j=1Ni=1KωiΓ(αi+βi)Γ(αi)Γ(βi)θjαi1(1θj)βi1,\displaystyle=\prod_{j=1}^{N}\sum_{i=1}^{K}\omega_{i}\cdot\frac{\Gamma(\alpha_{i}+\beta_{i})}{\Gamma(\alpha_{i})\Gamma(\beta_{i})}\theta_{j}^{\alpha_{i}-1}(1-\theta_{j})^{\beta_{i}-1}, (43)

where NN is the number of parameters. To test this, we constrain ww, Ωm\Omega_{\rm m}, and the nuisance parameter μc\mu_{c} using the Pantheon SN-Ia. The Softplus function is used as the activation function considering the stability of the MDN. The one-dimensional distributions of these parameters are shown in Figure 15. We can see that the results based on the MDN with beta components are almost the same as those of the MCMC method. Our analysis shows the instability of the MDN when using Equation (43) as the probability density. The reason may be that there are correlations between parameters, but they are not considered by the MDN. Besides, there are too many components (N×KN\times K components in total) for the probability density in Equation (43), especially for cases of many cosmological parameters, which will also increase the instability of the MDN. Therefore, the beta mixture model based on Equation (43) should be further investigated.

As a comparison, we also constrain ww, Ωm\Omega_{\rm m}, and μc\mu_{c} without considering correlations between them, by using an MDN with a Gaussian mixture model. In this case, the probability density in Equation (2.2) becomes

p(𝜽|𝒅)\displaystyle p(\bm{\theta}|\bm{d}) =j=1Ni=1Kωi12πσi2e(θjμi)22σi2.\displaystyle=\prod_{j=1}^{N}\sum_{i=1}^{K}\omega_{i}\cdot\frac{1}{\sqrt{2\pi\sigma^{2}_{i}}}e^{-\frac{(\theta_{j}-\mu_{i})^{2}}{2\sigma^{2}_{i}}}~{}. (44)

The one-dimensional distributions of these three parameters are shown as green dotted lines in Figure 15. Obviously, this result is similar to that based on the beta mixture model. We note that the Gaussian mixture model performs poorly for parameters with truncated distribution (see Figure 13), and correlations between parameters will not be obtained. Therefore, the Gaussian mixture model based on Equation (2.2) is recommended for multiparameter estimates.

7 Discussion

The MDN method proposed in this paper combines an ANN and a mixture model. The main idea is to assume that the posterior distribution of parameters is a mixture of several elemental simple distributions, such as the Gaussian distribution. Once the basic distribution is determined, the posterior distribution of the parameters can be calculated. Therefore, the neural network here aims to determine the basic distribution. In general, the MDN method models a connection between measurements and parameters, where the function of the mixture model is to model the parameters, and the function of the neural network is to extract information from the measurement. The neural network used here is a fully connected network that can deal with one-dimensional data. In future work, we will generalize this technique to two- and three-dimensional cosmological data sets.

As we illustrated in Section 2.4.2 that the parameter space is updated in the training process, the parameter space will gradually approach and cover the actual values of the cosmological parameters. Therefore, the MDN method is not sensitive to the initial conditions, suggesting that the initial conditions can be set freely with an arbitrary range of parameters. This fact is an advantage over the MCMC method and is beneficial for models that lack prior knowledge.

Many hyperparameters can be chosen when using the MDN, such as the number of hidden layers, the activation function, the number of training samples, the number of epochs, and even the number of components. The analysis in Section 5 uses specific simulated (or observational) data to test the effect of these hyperparameters. Due to the selection of hyperparameters and randomization in the training process (e.g. the initialization of the network parameters (ww, bb) in Equation (1) and the random noise added to the training set), the final estimated parameters have certain instability. But this shortcoming does not affect the fact that the MDN can get an accurate estimation with a mean deviation level of 𝒪(102σ)\mathcal{O}(10^{-2}\sigma), after selecting appropriate hyperparameters (e.g. using thousands of samples and thousands of epochs).

8 Conclusions

In this work, we conducted parameter inference with the MDN method on both one data set and multiple data sets to achieve joint constraints on parameters using the Pantheon SN-Ia and the CMB power spectra of the Planck mission. Our analysis shows that a well-trained MDN model can estimate parameters at the same level of accuracy as the traditional MCMC method. Furthermore, compared to the MCMC method, the MDN method can provide accurate parameter estimates with far fewer forward simulation samples.

We considered two mixture models in the MDN model: the Gaussian mixture model and the beta mixture model. The Gaussian mixture model can theoretically be applied to any parameter, while the beta mixture model is only applicable to a case with one parameter or multiple independent parameters. Our results show that the beta mixture model outperforms the Gaussian mixture model for parameters with truncated distributions.

The MDN is an ANN-based method, making it a flexible method that is extendable to two- and three-dimensional data, which may be of great use in extracting information from high-dimensional data.

9 Acknowledgement

Y.-Z.M. acknowledges the support of the National Research Foundation with grant Nos. 120385 and 120378. J.-Q.X. is supported by the National Science Foundation of China under grant No. U1931202. All computations were carried out using the computational cluster resources at the Centre for High Performance Computing, Cape Town, South Africa. This work was part of the research program “New Insights into Astrophysics and Cosmology with Theoretical Models Confronting Observational Data” of the National Institute for Theoretical and Computational Sciences of South Africa.

References