Roundtrip: A Deep Generative Neural Density Estimator

Qiao Liu^1,2, Jiaze Xu^1,2, Rui Jiang¹, and Wing Hung Wong²

Abstract

Density estimation is a fundamental problem in both statistics and machine learning. In this study, we proposed Roundtrip as a general-purpose neural density estimator based on deep generative models. Roundtrip retains the generative power of generative adversarial networks (GANs) but also provides estimates of density values. Unlike previous neural density estimators that put stringent conditions on the transformation from the latent space to the data space, Roundtrip enables the use of much more general mappings. In a series of experiments, Roundtrip¹¹1Source code of Roundtrip was provided at https://github.com/kimmo1019/Roundtrip achieves state-of-the-art performance in a diverse range of density estimation tasks.

1 Introduction

Density estimation is a fundamental problem in statistics. Let $p(\cdot)$ be a density on a $n$ -dimensional Euclidean space $\mathcal{X}$ . Our task is to estimate the density $p(\cdot)$ based on a set of independently and identically distributed data points $\{\textbf{x}_{i}\}_{i=1}^{N}$ drawn from this density.

Traditional density estimators such as histograms (Scott 1979; Lugosi, Nobel et al. 1996) and kernel density estimators (KDEs (Rosenblatt 1956; Parzen 1962)) typically perform well only in low dimension (e.g., $n$ is small). Recently, neural network-based approaches were proposed for density estimation, and yielded promising results for high dimensional problems (e.g., when each data point is an image). There are mainly two families of such neural density estimators: autoregressive models (Uria et al. 2016; Germain et al. 2015; Papamakarios, Pavlakou, and Murray 2017) and normalizing flows (Rezende and Mohamed 2015; Ballé, Laparra, and Simoncelli 2015; Dinh, Sohl-Dickstein, and Bengio 2016). Autoregression-based neural density estimators decompose the density into the product of conditional densities based on probability chain rule $p(\textbf{x})=\prod_{i}p(x_{i}|\textbf{x}_{1:i-1})$ . Each conditional probability $p(x_{i}|\textbf{x}_{1:i-1})$ is modeled by a parametric density (e.g., Gaussian or mixture of Gaussian), of which the parameters are learned by neural networks. Density estimators based on normalizing flows represent x as an invertible transformation of a latent variable z with known density, where the invertible transformation is a composition of a series of simple functions whose Jacobian is easy to compute. The parameters of these component functions are then learned by neural networks.

As suggested by (Kingma et al. 2016), both of the above two types of neural density estimators can be viewed under the following general framework. Given a differentiable and invertible mapping $g:\mathbb{R}^{n}\to\mathbb{R}^{n}$ and a base density $p(\textbf{z})$ , the density of $\textbf{x}=G(\textbf{z})$ can be represented using the change of variable rule as

p(\textbf{x})=p(\textbf{z})|\rm{det}(\textbf{J}_{z})|^{-1}

(1)

where $\textbf{J}_{z}=\frac{\partial G(\textbf{z})}{\partial\textbf{z}^{T}}$ is the Jacobian matrix of function $G(\cdot)$ at point z. Density estimation at x can be solved if the base density $p(\textbf{z})$ is known and the determinant and inverse of Jacobian matrix are feasible to calculate. To achieve this, previous neural density estimators have to carefully design model architectures to impose constraints on the Jacobian matrix. For example, (Papamakarios, Pavlakou, and Murray 2017; Dinh, Sohl-Dickstein, and Bengio 2016; Kingma et al. 2016) require the Jacobian to be triangular, (Berg et al. 2018) constructed a low rank perturbations of a diagonal matrix as the Jacobian, (Karami et al. 2018) proposed a circular convolution where the Jacobian is a circulant matrix. These strong constraints diminish the expressiveness of neural networks which may lead to poor performance. For example, autoregressive neural density estimators based on learning $p(x_{i}|\textbf{x}_{1:i-1})$ are naturally sensitive to the order of the features. Moreover, the change of variable rule is not applicable when the domain dimension differs in base density and target density. However, experience from deep generative models (e.g., GAN (Goodfellow et al. 2014) and VAE (Kingma and Welling 2013)) suggested that it is often desirable to use a latent space of smaller dimension than the data space.

To overcome the limitations above, we proposed a new neural density estimator called Roundtrip. Our approach is motivated by recent advances in deep generative neural networks (Goodfellow et al. 2014; Zhu et al. 2017; Makhzani et al. 2015). Roundtrip differs from previous neural density estimators in two ways. 1) It allows the direct use of a deep generative network to model the transformation from the latent variable space to the data space while previous neural density estimators use neural networks only to represent the component functions that are used for building up invertible transformation. 2) It can efficiently model data densities that are concentrated near learned manifolds, which is difficult to achieve by previous approaches as they require the latent space to have equal dimension as the data space. Importantly, we also provide methods, based on either importance sampling and Laplace approximation, for the point-wise evaluation of the density estimate. We summarize our major contributions in this study as follows.

•

We proposed Roundtrip as a general-purpose neural density estimator based on deep generative models. Roundtrip requires less restrictive model assumptions compared to previous neural density estimators.
•

We provided theoretical guarantees for the feasibility of density estimation with deep generative models. Specifically, we proved that the principle in previous neural density estimators can be regarded as a special case in our Roundtrip framework (See proof in Appendix B).
•

We demonstrated state-of-the-art performance of Roundtrip model through a series of experiments, including density estimation tasks in simulations as well as in real data applications ranging from image generation to outlier detection.

2 Methods

2.1 Roundtrip overview

The key idea of Roundtrip is to approximate the target distribution as a convolution of a Gaussian with a distribution induced on a manifold by transforming a base distribution where the transformation is learned by joint training of two GAN models (Figure 1). Density estimation is an offline algorithm which is typically conducted after the two GAN models are well trained in Roundtrip. Next, we will first introduce our framework on how to model densities with deep generative networks before providing details on training strategy and model architecture.

2.2 Density modeling in Roundtrip

Consider two random variables $\textbf{z}\in\mathbb{R}^{m}$ and $\textbf{x}\in\mathbb{R}^{n}$ where z has a known density $p(\textbf{z})$ (e.g., standard Gaussian) and x is distributed according to a target density $p(\textbf{x})$ that we intend to estimate based on $i.i.d$ observations from it. We introduced two functions $G(\cdot)$ and $H(\cdot)$ for learning an forward and backward mapping relationship between the two distributions. These two functions are learned by two neural networks (Figure 1). The model architecture is similar to CycleGAN (Zhu et al. 2017) while we intend to exploit it for a new task of density estimation. To do this, we denote $G(\textbf{z})=\tilde{\textbf{x}}$ and $H(\textbf{x})=\tilde{\textbf{z}}$ and assume that the forward mapping error follows a Gaussian distribution

\textbf{x}=\tilde{\textbf{x}}+\bm{\epsilon},\epsilon_{i}\sim N(0,\sigma^{2})

(2)

Typically, we set $m<n$ , which means that $\tilde{\textbf{x}}$ takes values in a manifold of $\mathbb{R}^{n}$ with intrinsic dimension $m$ . Basically, this roundtrip model utilizes $G(\cdot)$ to produce a manifold and then approximate the target density as a mixture of Gaussians where the mixing density is the induced density $p(\tilde{\textbf{x}})$ on the manifold. In what follows, we will set $p(\textbf{z})$ to be a standard Gaussian $p(\textbf{z})=(\frac{1}{\sqrt{2\pi}})^{m}e^{-\frac{\left\|\textbf{z}\right\|_{2}^{2}}{2}}$ . Based on the model assumption, $p(\textbf{x}|\textbf{z})=(\frac{1}{\sqrt{2\pi}\sigma})^{n}e^{-\frac{\left\|\textbf{x}-G(\textbf{z})\right\|_{2}^{2}}{2\sigma^{2}}}$ . Then, the target density can be expressed as

p(\textbf{x})=\int p(\textbf{x}|\textbf{z})p(\textbf{z})d\textbf{z}=(\frac{1}{\sqrt{2\pi}})^{m+n}\sigma^{-n}\int e^{-\frac{v(\textbf{x},\textbf{z})}{2}}d\textbf{z}

(3)

where $v(\textbf{x},\textbf{z})=\left\|\textbf{z}\right\|_{2}^{2}+\sigma^{-2}\left\|\textbf{x}-G(\textbf{z})\right\|_{2}^{2}$ . The density estimation problem has been transformed to computing the integral in equation (3). We will postpone model training details to section 2.3-2.5. Assuming that $G(\cdot)$ and $H(\cdot)$ have already been well learned, we now discuss how to evaluate integral in (3) by either importance sampling or Laplace approximation.

Refer to caption — Figure 1: The overview framework of Roundtrip.

Importance sampling

The simplest way to estimate (3) is to use the empirical expectation by $\frac{1}{N}\sum_{i}^{N}p(\textbf{x}|\textbf{z}_{i})$ where $\textbf{z}_{i}\sim p(\textbf{z})$ . However, this is usually extremely inefficient as $p(\textbf{x}|\textbf{z})$ typically takes low values at most values of $\textbf{z}_{i}$ sampled from $p(\textbf{z})$ . Thus we propose to sample $\textbf{z}_{i}$ from an importance distribution $q(\textbf{z})$ instead of the base density $p(\textbf{z})$ and use the importance-weighted estimate as

p^{IS}(\textbf{x})=\frac{1}{N}\sum_{i}^{N}p(\textbf{x}|\textbf{z}_{i}^{q})w(\textbf{z}_{i}^{q})

(4)

where $N$ is the sample size, $w(\textbf{z})=\frac{p(\textbf{z})}{q(\textbf{z})}$ is the importance weight function, $\{\textbf{z}_{i}^{q}\}_{i=1}^{N}$ are $i.i.d$ samples from $q(\textbf{z})$ . We propose to set $q(\textbf{z})$ to be a Student’s t distribution with the center at $\tilde{\textbf{z}}=H(\textbf{x})$ . This choice is motivated by the following considerations. 1) For a given x, p(x—z) is likely to be maximized at values of z near $\tilde{\textbf{z}}=H(\textbf{x})$ . 2) Student’s t distribution has a heavier tail than Gaussian which provides a control of the variance of the summand in (4). More details including an illustrative example of importance sampling are provided in Appendix A.

Laplace approximation

We can also obtain an approximation to the integral in (3) by Laplace’s method. To achieve this goal, we expand $G(\textbf{z})$ around $\tilde{\textbf{z}}=H(\textbf{x})$ to obtain a quadratic approximation to $v(\textbf{x},\textbf{z})$ , which can be represented as

\begin{split}\textbf{x}-G(\textbf{z})\approx\textbf{x}-G(\tilde{\textbf{z}})-\nabla G(\tilde{\textbf{z}})(\textbf{z}-\tilde{\textbf{z}})\end{split}

(5)

where $\nabla G(\tilde{\textbf{z}})\in\mathbb{R}^{n\times m}$ is the Jacobian matrix at $\tilde{\textbf{z}}$ . Substitute (5) into $\left\|\textbf{x}-G(\textbf{z})\right\|_{2}^{2}$ , we have

\begin{split}\left\|\textbf{x}-G(\textbf{z})\right\|_{2}^{2}&=(\textbf{x}-G(\textbf{z}))^{T}(\textbf{x}-G(\textbf{z}))\\ &=\left\|\textbf{x}-G(\tilde{\textbf{z}})\right\|_{2}^{2}-2(\textbf{x}-G(\tilde{\textbf{z}}))^{T}\nabla G(\tilde{\textbf{z}})(\textbf{z}-\tilde{\textbf{z}})\\ &+(\textbf{z}-\tilde{\textbf{z}})^{T}\nabla G^{T}(\tilde{\textbf{z}})\nabla G(\tilde{\textbf{z}})(\textbf{z}-\tilde{\textbf{z}})\end{split}

(6)

Next, We made variable substitutions as

\left\{\begin{array}[]{lr}\mathbf{A}=\nabla G^{T}(\tilde{\textbf{z}})\nabla G(\tilde{\textbf{z}})&\in\mathbb{R}^{m\times m}\\ \textbf{b}=\nabla G^{T}(\tilde{\textbf{z}})(\textbf{x}-G(\tilde{\textbf{z}}))&\in\mathbb{R}^{m}\\ \textbf{w}=\textbf{z}-\tilde{\textbf{z}}&\in\mathbb{R}^{m}\\ \lambda=\sigma^{-2}\end{array}\right.

(7)

Taking equations (6) and (7) into $v(\textbf{x},\textbf{z})$ in equation (3) and we can get

\begin{split}v(\textbf{x},\textbf{z})&=\tilde{v}(\textbf{x},\textbf{w})=\left\|\textbf{w}\right\|_{2}^{2}+2\textbf{w}^{T}\tilde{\textbf{z}}+\left\|\tilde{\textbf{z}}\right\|_{2}^{2}\\ &+\lambda(\left\|\textbf{x}-G(\tilde{\textbf{z}})\right\|_{2}^{2}-2\textbf{b}^{T}\textbf{w}+\textbf{w}^{T}\mathbf{A}\textbf{w})\\ &=\textbf{w}^{T}(\mathbf{I}+\lambda\mathbf{A})\textbf{w}-2(\lambda\textbf{b}-\tilde{\textbf{z}})^{T}\textbf{w}+c_{1}(\textbf{x})\end{split}

(8)

where $\mathbf{I}\in\mathbb{R}^{m\times m}$ is the identity matrix and $c_{1}(\textbf{x})=\left\|\tilde{\textbf{z}}\right\|_{2}^{2}+\lambda\left\|\textbf{x}-G(\tilde{\textbf{z}})\right\|_{2}^{2}$ . The integral in (3) $w.r.t$ z can now be solved by constructing a multivariate Gaussian distribution $w.r.t$ w in (8) as the following

\begin{split}\int e^{-\frac{v(\textbf{x},\textbf{z})}{2}}d\textbf{z}&=\int e^{-\frac{\tilde{v}(\textbf{x},\textbf{w})}{2}}d\textbf{w}=\int e^{-\frac{(\textbf{w}-\bm{\mu})^{T}\bm{\Sigma}^{-1}(\textbf{w}-\bm{\mu})+c_{2}(\textbf{x})}{2}}d\textbf{w}\\ &=e^{-\frac{c_{2}({\textbf{x}})}{2}}\int e^{-\frac{(\textbf{w}-\bm{\mu})^{T}\bm{\Sigma}^{-1}(\textbf{w}-\bm{\mu})}{2}}d\textbf{w}\\ &=e^{-\frac{c_{2}({\textbf{x}})}{2}}\sqrt{(2\pi)^{m}\rm{det}(\mathbf{\Sigma})}\end{split}

(9)

where $c(\textbf{x})=c_{1}(\textbf{x})-\bm{\mu}^{T}\bm{\Sigma}^{-1}\bm{\mu}$ , $\rm{det}(\mathbf{\Sigma})$ denotes the determinant of the covariant matrix. The constructed mean and covariant matrix of the multivariate Gaussian should be

\left\{\begin{array}[]{lr}\bm{\Sigma}=(\mathbf{I}+\lambda\mathbf{A})^{-1}\\ \bm{\mu}=\mathbf{\Sigma}(\lambda\textbf{b}-\tilde{\textbf{z}})\end{array}\right.

(10)

Substitute (9) into (3) and we can get the final closed-form solution for density of point x

\begin{split}p^{LP}(\textbf{x})=(\frac{1}{\sqrt{2\pi}})^{n}\sigma^{-n}\sqrt{\rm{det}(\bm{\Sigma})}e^{-\frac{c(\textbf{x})}{2}}\end{split}

(11)

where $\bm{\Sigma}=(\mathbf{I}+\sigma^{-2}\mathbf{J}^{T}_{\tilde{\textbf{z}}}\mathbf{J}_{\tilde{\textbf{z}}})^{-1}\in\mathbb{R}^{m\times m}$ . $\mathbf{J}_{\tilde{\textbf{z}}}\in\mathbb{R}^{n\times m}$ is the Jacobian matrix of $G(\textbf{z})$ at $\tilde{\textbf{z}}$ , $\rm{det}(\cdot)$ denotes the matrix determinant and $c(\textbf{x})$ is a scalar function. Interestingly, we note that change of variable rule represented by (1) where $G(\cdot)$ is a differentiable and invertible function is a special case in the closed-form solution (11) if the following three conditions are satisfied. 1) $m=n$ , 2) $H(\cdot)=G^{-1}(\cdot)$ , 3) $\sigma\to 0$ . The proof is given in Appendix B.

In the remaining part of Section 2, we discussed how to learn $G(\cdot)$ and $H(\cdot)$ from given observation data.

2.3 Adversarial training loss

The Roundtrip model consists a pair of two GAN models. For the forward GAN mapping, $G$ aims at generating samples $\{\tilde{\textbf{x}}_{i}\}_{i=1}^{N}$ that are similar to observation data $\{\textbf{x}_{i}\}_{i=1}^{N}$ while the discriminator $D_{x}$ tries to discern observation data (positive) from generated samples (negative). The backward mapping function $H$ and the discriminator $D_{z}$ aims to transform the data distribution to approximate the base distribution in latent space. Discriminators can be considered as binary classifiers where a input data point will be asserted to be positive (1) or negative (0). The objective loss functions of the above four neural networks ( $G$ , $H$ , $D_{z}$ , and $D_{x}$ ) in the training process can be represented as the following

\begin{split}\left\{\begin{array}[]{lr}\begin{aligned} \mathcal{L}_{GAN}(G)=&\mathbb{E}_{\textbf{z}\sim p(\textbf{z})}(D_{x}(G(\textbf{z}))-1)^{2}\\ \mathcal{L}_{GAN}(D_{x})=&\mathbb{E}_{\textbf{x}\sim p(\textbf{x})}(D_{x}(\textbf{x})-1)^{2}+\\ &\mathbb{E}_{\textbf{z}\sim p(\textbf{z})}D_{x}^{2}(G(\textbf{z}))\\ \mathcal{L}_{GAN}(H)=&\mathbb{E}_{\textbf{x}\sim p(\textbf{x})}(D_{z}(H(\textbf{x}))-1)^{2}\\ \mathcal{L}_{GAN}(D_{z})=&\mathbb{E}_{\textbf{z}\sim p(\textbf{z})}(D_{y}(\textbf{z})-1)^{2}+\\ &\mathbb{E}_{\textbf{x}\sim p(\textbf{x})}D_{z}^{2}(H(\textbf{x}))\end{aligned}\end{array}\right.\end{split}

(12)

where z and x are sampled from base density $p(\textbf{z})$ and data density $p(\textbf{x})$ , respectively. In practice, sampling x from data density $p(\textbf{x})$ can be regarded as a procedure of randomly sampling from $i.i.d$ obervations data with replacement. Minimizing the loss of a generator (e.g., $\mathcal{L}_{GAN}(G)$ ) and the corresponding discriminator (e.g., $\mathcal{L}_{GAN}(D_{x})$ ) are somehow contradictory as the two networks ( $G$ and $D_{x}$ ) compete with each other during the training process. Note that the least square loss functions we used in equation (12) has been detailedly discussed in LSGAN (Mao et al. 2017).

2.4 Roundtrip loss

During the training, we also aim to minimize the roundtrip loss which is defined as $\rho(\textbf{z},H(G(\textbf{z})))$ and $\rho(\textbf{x},G(H(\textbf{x})))$ where z and x are sampled from the base density $p(\textbf{z})$ and the data density $p(\textbf{x})$ . The principle is to minimize the distance when a data point goes through a roundtrip transformation between two data domains. If $m$ ¡ $n$ , this will ensure that $\textbf{x}\to H(\textbf{x})\to G(H(\textbf{x}))$ will stay close the projection of x to the manifold induced by $G$ , and $\textbf{z}\to G(\textbf{z})\to H(G(\textbf{z}))$ will stay close to z. In practice, we used $\textit{l}_{2}$ loss for both $\rho(\textbf{z},H(G(\textbf{z})))$ and $\rho(\textbf{x},G(H(\textbf{x})))$ as minimizing $\textit{l}_{2}$ loss implies the data is drawn from a Gaussian distribution (Mathieu, Couprie, and LeCun 2015), which exactly matches our model assumption. We denoted the roundtrip loss as

\begin{split}\mathcal{L}_{RT}(G,H)=&\alpha\left\|\textbf{x}-G(H(\textbf{x}))\right\|_{2}^{2}+\\ &\beta\left\|\textbf{z}-H(G(\textbf{z}))\right\|_{2}^{2}\end{split}

(13)

where $\alpha$ and $\beta$ are two constant coefficients. The idea of roundtrip loss which exploits transitivity for regularizing structured data can also be found in previous works (Zhu et al. 2017; Yi et al. 2017).

2.5 Full training loss

Combining the adversarial training loss and roundtrip loss together, we can get the full training loss for generator networks and discriminator networks as $\mathcal{L}(G,H)=\mathcal{L}_{GAN}(G)+\mathcal{L}_{GAN}(H)+\mathcal{L}_{RT}(G,H)$ and $\mathcal{L}(D_{x},D_{z})=\mathcal{L}_{GAN}(D_{x})+\mathcal{L}_{GAN}(D_{x})$ , respectively. To achieve joint training of the two GAN models, we iteratively updated the parameters in the two generative models ( $G$ and $H$ ) and the two discriminative models ( $D_{z}$ and $D_{x}$ ), respectively. Thus, the overall iterative optimization problem in Roundtrip can be represented as

\begin{split}G^{*},D_{x}^{*},H^{*},D_{z}^{*}=\left\{\begin{array}[]{lr}arg\min\limits_{G,H}\mathcal{L}(G,H)\\ arg\min\limits_{D_{x},D_{z}}\mathcal{L}(D_{x},D_{z})&\end{array}\right.\end{split}

(14)

After an iterative model training process, the learned networks $G^{*}$ and $H^{*}$ will then be used as $G(\cdot)$ and $H(\cdot)$ functions in the density estimation procedure. Note that traditional GAN-based models lack a robust stop criteria for model training. However, the training of Roundtrip can be easily evaluated by monitoring the average log likelihood of the validation set. We stop training Roundtrip when there is no further improvement of the average log likelihood on the validation set (see training details in Section 3.1).

2.6 Model architecture

The model architecture for Roundtrip is highly flexible. In most cases, when it is utilized for density estimation tasks with vector-valued data, we used fully-connected networks for both generative networks and discriminative networks. Specifically, the $G$ network contains 10 fully-connected layers and each layer has 512 hidden nodes while the $H$ network contains 10 fully-connected layers and each layer has 256 hidden nodes. The $D_{x}$ network contains 4 fully-connected layers and each layer has 256 hidden nodes while the $D_{z}$ network contains 2 fully-connected layers and each layer has 128 hidden nodes. The leaky-Relu activation function is deployed as a non-linear transformation in each hidden layer.

We also extended Roundtrip for estimating the density of tensor-valued data (e.g., images) by introducing a one-hot encoded class label y as an additional input to both $G$ and $D_{x}$ networks in a conditional GAN (CGAN) manner (Mirza and Osindero 2014). y will be combined in the hidden representations in $G$ and $D_{x}$ networks by concatenation. Compared to vector-valued data, tensor-valued data such as images will be flattened to and reshaped from vector-valued data when taken as input and output to all networks in Roundtrip, respectively. Similar to the model architecture in DCGAN (Radford, Metz, and Chintala 2015), we used transposed convolutional layers for upsampling images from latent space for $G$ network. Besides, we used traditional convolutional neural networks for $H$ , $D_{x}$ while $D_{z}$ still adopts a fully-connected network architecture. Note that Batch normalization (Ioffe and Szegedy 2015) is applied after each convolutional layer or transposed convolutional layer (detailed hyperparameters were provided in Appendix E).

3 Results

3.1 Experiement setup

We test the performance of Roundtrip model in a series of experiments, including simulation studies and real data studies. In these experiments, we compared Roundtrip to the widely used Gaussian kernel density estimator as well as several neural density estimators, including MADE (Germain et al. 2015), Real NVP (Dinh, Sohl-Dickstein, and Bengio 2016) and MAF (Papamakarios, Pavlakou, and Murray 2017). In the outlier detection experiment, we additionally compared to two commonly used outlier detection methods: One-class SVM (Schölkopf et al. 2001) and Isolation Forest (Liu, Ting, and Zhou 2008). Note that the default setting of Roundtrip model was based on the importance sampling strategy. Results of Roundtrip density estimator based on Laplace approximation are reported in Appendix C.

The neural networks in Roundtrip model were implemented with TensorFlow (Abadi et al. 2016). In all experiments, we set $\alpha$ =10 and $\beta$ =10 in equation (13). For the parameter $\sigma$ in our model assumption, we first pretrained the Roundtrip model for 20 epoches and selected from $\{0.01,0.05,0.1,0.2,0.4,0.5\}$ of which the value maximizes the average likelihood on validation test. Sample size $N$ in importance sampling is set to 40,000. An Adam optimizer (Kingma and Ba 2014) with a learning rate of 0.0002 was used for backpropagation and updating model parameters. We stopping model training when there is no improvement on the average log-likelihood in the validation set in 10 consecutive epochs (early stopping).

We took Gaussian kernel density estimator (KDE) as a baseline where the bandwidth is selected by Silverman’s ”rule of thumb” (Silverman 1986) or Scott’s rule (Scott 1992). We choose the one with better results to present. The three alternative neural density estimators (MADE, Real NVP, and MAF) were implemented from https://github.com/gpapamak/maf. In outlier detection tasks, we implemented One-class SVM and Isolation Forest using scikit-learn library (Pedregosa et al. 2011), where the default parameters were used. To ensure fair model comparison, both simulation and real data were randomly split into a 90% training set and a 10% test set. For neural density estimators including Roundtrip, 10% of the training set was kept as a validation set. The image datasets with training and test set were directly provided which require no further data split.

Table 1: Performance of different methods on five UCI datasets. The average log likelihood (.nat) and 2 standard deviations are shown. The model with best performance is shown in bold.

	AReM	CASP	HEPMASS	BANK	YPMSD
KDE	6.26 $\pm$ 0.07	20.47 $\pm$ 0.10	-25.46 $\pm$ 0.03	15.84 $\pm$ 0.12	247.03 $\pm$ 0.61
MADE	6.00 $\pm$ 0.11	21.82 $\pm$ 0.23	-15.15 $\pm$ 0.02	14.97 $\pm$ 0.53	273.20 $\pm$ 0.35
Real NVP	9.52 $\pm$ 0.18	26.81 $\pm$ 0.15	-18.71 $\pm$ 0.02	26.33 $\pm$ 0.22	287.74 $\pm$ 0.34
MAF	9.49 $\pm$ 0.17	27.61 $\pm$ 0.13	-17.39 $\pm$ 0.02	20.09 $\pm$ 0.20	290.76 $\pm$ 0.33
Roundtrip	11.74 $\pm$ 0.04	28.38 $\pm$ 0.08	-4.18 $\pm$ 0.02	35.16 $\pm$ 0.14	297.98 $\pm$ 0.52

3.2 Evaluation

For simulation datasets with two dimensions, we directly visualized both true density and estimated density on a 2D bounded region. For simulation datasets with higher dimensions where the true density can be calculated, we evaluate different density estimators by calculating the Spearman (rank) correlation between true density and estimated density based on the test set. For real data where the ground truth density is not available, the average estimated density (natural log-likelihood) on the test set will be considered as a measurement.

In the application of outlier detection, we measure performance by calculating the precision at $k$ , which is defined as the proportion of correct results in the top $k$ ranks. We set $k$ to the number of outliers in the test set.

3.3 Simulation studies

We first designed three 2D simulation datasets to test the performance of different neural density estimators where the truth density can be calculated.

(a) Independent Gaussian mixture.
$x_{i}\sim\frac{1}{3}(N(-1,0.5^{2})+N(0,0.5^{2})+N(1,0.5^{2}))$ , $i$ =1,2.

(b) 8-octagon Gaussian mixture.
$\textbf{x}\sim\frac{1}{8}\sum_{i=1}^{8}N(\bm{\mu}_{i},\mathbf{\Sigma}_{i})$ where $\bm{\mu}_{i}=(3\cos\frac{\pi i}{4},3\sin\frac{\pi i}{4})$ and $\mathbf{\Sigma}_{i}=\bigl{(}\begin{smallmatrix}\cos^{2}\frac{\pi i}{4}+0.16^{2}\sin^{2}\frac{\pi i}{4}&(1-0.16^{2})\sin\frac{\pi i}{4}\cos\frac{\pi i}{4}\\ (1-0.16^{2})\sin\frac{\pi i}{4}\cos\frac{\pi i}{4}&\sin^{2}\frac{\pi i}{4}+0.16^{2}\cos^{2}\frac{\pi i}{4}\end{smallmatrix}\bigr{)}$ , $i$ =1,…,8.

20000 $i.i.d$ points were sampled from each of the above true data distribution. After model training, we directly estimated the density in a 2D bounded region (100 $\times$ 100 grid) with different methods (Figure 2). For the independent Gaussian mixture in case (a), Roundtrip clearly separates the independent components in the Gaussian mixture while other neural density estimators either failed (MADE) or contain obvious trajectory between different components (Real NVP and MAF). Roundtrip can capture a better density distribution even for the highly non-linear structure in case (c). Then we took the case (a) for a further study by increasing the dimension up to 10 (containing $3^{10}$ modes). The performance of kernel density estimator (KDE) will decrease dramatically when dimension increases. Roundtrip still achieves a Spearman correlation of 0.829 at dimension 10, compared to 0.669 of Real NVP, 0.595 of MAF, and 0.14 of KDE (See Appendix C).

3.4 Real data studies

UCI datasets

We collected five datasets (AReM, CASP, HEPMASS, BANK and YPMSD) from the UCI machine learning repository (Dua and Graff 2017) with dimensions ranging from 6 to 90 and sample size from 42,240 to 515,345 (see more details about data description and data preprocessing in Appendix D). Unlike simulation data, these real datasets have no ground truth for the density. Hence, we evaluated different methods by calculating the average log-likelihood on the test set. Table 1 illustrates the performance of Roundtrip and other neural density estimators. A Gaussian kernel density estimator (KDE) fitted to the training data is reported as a baseline. Roundtrip outperforms other neural density estimators by achieving the highest average log-likelihood on the test set of each dataset, which again demonstrates the superiority of our model.

Image datasets

Table 2: The precision at

k

of different methods in three ODDS datasets.

	OC-SVM	I-Forest	Real NVP	MAF	Roundtrip
Shuttle	0.953	0.956	0.784	0.929	0.973
Mammography	0.037	0.482	0.474	0.407	0.482
ForestCover	0.127	0.058	0.054	0.046	0.177

We further applied Roundtrip model to generate images and assess the quality of the generated images by estimated density. Deep generative models have demonstrated their power in generating synthetic images. However, a deep generative model alone cannot provide quality scores for generated images. Here, we propose to use our Roundtrip method to generate images and quality score (e.g., the density of the image). We test this approach on two commonly used image datasets MNIST (LeCun, Cortes, and Burges 2010) and CIFAR-10 (Krizhevsky, Hinton et al. 2009) where in each of the these datasets, the image comes from 10 distinct classes. Roundtrip model was modified by introducing an additional one-hot encoded class label y to both $G$ and $D_{y}$ network and convolutional layers were used in $G$ , $H$ and $D_{x}$ (see Methods). We then model the conditional density estimation by $p(\textbf{x}|\textbf{y})=\int p(\textbf{x}|\textbf{y},\textbf{z})p(\textbf{z})d\textbf{z}$ where $\textbf{y}\sim$ Cat(10) denoting a categorical distribution with 10 distinct classes. We use this modified Roundtrip model to simultaneously generate images conditional on a class label and compute the within class density of the image. The comparing neural density estimators typically require a lot of tricks, including rescaling pixel values to $[0,1]$ , transforming the bounded pixel values into an unbounded logit space and adding uniform noise, to achieve images generation and density estimation. Roundtrip did not require additional transformation except for rescaling. In Figure 3, the generated images of each class were sorted by decreased likelihood. It is seen that images generated by Roundtrip are more realistic than those generated by MAF (which is the best among alternative neural density estimators, see Figure 2 and Table 1). Furthermore, the density provided by Roundtrip seems to correlate well with the quality of the generated images.

3.5 Outlier detection

Finally, we applied Roundtrip model to an outlier detection task, where a data point with extremely low density value is regarded as likely to be an outlier. We tested this method on three outlier detection datasets (Shuttle, Mammography, and ForestCover) from ODDS database (http://odds.cs.stonybrook.edu/). Each dataset is split into training, validation and test set (details of data description can be found in Appendix D). Besides the neural density estimators, we also introduced two baselines One-class SVM (Schölkopf et al. 2001) and Isolation Forest (Liu, Ting, and Zhou 2008). The results shown in Table 2 were based on the average precision of three independent running of each algorithm. Roundtrip achieves the best or comparable results in different outlier detection tasks. Especially in the last dataset ForestCover, in which the outlier percentage is only 0.9%, Roundtrip still achieves a precision of 17.7% while the precision of other neural density estimators is less than 6%.

4 Discussion

We proposed Roundtrip as a new neural density estimator based on deep generative models. Unlike prior studies which focus on modeling the invertible transformation between a base density and the target density, where the parameters of the components functions are learned by neural networks, Roundtrip allows the direct use of a deep generative network to model the transformation from the latent variable space to the data space. In the meanwhile, the change of variable rule used by previous methods requires equal dimension in the base density and the target density. Roundtrip provides a more flexible transformation between the base density and target density. To the best of our knowledge, Roundtrip is the first work to tackle the general-purpose density estimation problem with deep generative neural networks (e.g., GAN). In a series of experiments, Roundtrip outperforms previous neural density estimators in a variety of density estimation tasks, including simulation/real data studies and an outlier detection application. We also demonstrated the high flexibility in Roundtrip as it can be either used for estimating density in vector-valued data and tensor-valued data (e.g., images).

Density estimation aims at obtaining accurate densities of given i.i.d data points. Deep generative models, such as GANs, typically focus on data generative where the density is somehow implicitly learned. Our work provides a new perspective of borrowing the generation ability of deep generative models for accurately evaluating densities.

5 Ethical impact statement

This work does not present any foreseeable societal consequence.

References

Abadi et al. (2016) Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th $\{$ USENIX $\}$ Symposium on Operating Systems Design and Implementation ( $\{$ OSDI $\}$ 16), 265–283.
Baldi et al. (2016) Baldi, P.; Cranmer, K.; Faucett, T.; Sadowski, P.; and Whiteson, D. 2016. Parameterized machine learning for high-energy physics. arXiv preprint arXiv:1601.07913 .
Ballé, Laparra, and Simoncelli (2015) Ballé, J.; Laparra, V.; and Simoncelli, E. P. 2015. Density modeling of images using a generalized normalization transformation. arXiv preprint arXiv:1511.06281 .
Berg et al. (2018) Berg, R. v. d.; Hasenclever, L.; Tomczak, J. M.; and Welling, M. 2018. Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649 .
Dinh, Sohl-Dickstein, and Bengio (2016) Dinh, L.; Sohl-Dickstein, J.; and Bengio, S. 2016. Density estimation using real nvp. arXiv preprint arXiv:1605.08803 .
Dua and Graff (2017) Dua, D.; and Graff, C. 2017. UCI Machine Learning Repository. URL http://archive.ics.uci.edu/ml.
Germain et al. (2015) Germain, M.; Gregor, K.; Murray, I.; and Larochelle, H. 2015. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, 881–889.
Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
Ioffe and Szegedy (2015) Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 .
Karami et al. (2018) Karami, M.; Dinh, L.; Duckworth, D.; Sohl-Dickstein, J.; and Schuurmans, D. 2018. Generative convolutional flow for density estimation. In Workshop on Bayesian Deep Learning NeurIPS 2018.
Kingma and Ba (2014) Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
Kingma et al. (2016) Kingma, D. P.; Salimans, T.; Jozefowicz, R.; Chen, X.; Sutskever, I.; and Welling, M. 2016. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, 4743–4751.
Kingma and Welling (2013) Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 .
Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images .
LeCun, Cortes, and Burges (2010) LeCun, Y.; Cortes, C.; and Burges, C. 2010. MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2.
Liu, Ting, and Zhou (2008) Liu, F. T.; Ting, K. M.; and Zhou, Z.-H. 2008. Isolation forest. In 2008 Eighth IEEE International Conference on Data Mining, 413–422. IEEE.
Lugosi, Nobel et al. (1996) Lugosi, G.; Nobel, A.; et al. 1996. Consistency of data-driven histogram methods for density estimation and classification. The Annals of Statistics 24(2): 687–706.
Makhzani et al. (2015) Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; and Frey, B. 2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 .
Mao et al. (2017) Mao, X.; Li, Q.; Xie, H.; Lau, R. Y.; Wang, Z.; and Paul Smolley, S. 2017. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, 2794–2802.
Mathieu, Couprie, and LeCun (2015) Mathieu, M.; Couprie, C.; and LeCun, Y. 2015. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 .
Mirza and Osindero (2014) Mirza, M.; and Osindero, S. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 .
Moro, Cortez, and Rita (2014) Moro, S.; Cortez, P.; and Rita, P. 2014. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems 62: 22–31.
Palumbo et al. (2016) Palumbo, F.; Gallicchio, C.; Pucci, R.; and Micheli, A. 2016. Human activity recognition using multisensor data fusion based on reservoir computing. Journal of Ambient Intelligence and Smart Environments 8(2): 87–107.
Papamakarios, Pavlakou, and Murray (2017) Papamakarios, G.; Pavlakou, T.; and Murray, I. 2017. Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, 2338–2347.
Parzen (1962) Parzen, E. 1962. On estimation of a probability density function and mode. The annals of mathematical statistics 33(3): 1065–1076.
Pedregosa et al. (2011) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12: 2825–2830.
Radford, Metz, and Chintala (2015) Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 .
Rezende and Mohamed (2015) Rezende, D. J.; and Mohamed, S. 2015. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770 .
Rosenblatt (1956) Rosenblatt, M. 1956. Remarks on Some Nonparametric Estimates of a Density Function. The Annals of Mathematical Statistics 832–837.
Schölkopf et al. (2001) Schölkopf, B.; Platt, J. C.; Shawe-Taylor, J.; Smola, A. J.; and Williamson, R. C. 2001. Estimating the support of a high-dimensional distribution. Neural computation 13(7): 1443–1471.
Scott (1979) Scott, D. W. 1979. On optimal and data-based histograms. Biometrika 66(3): 605–610.
Scott (1992) Scott, D. W. 1992. Multivariate density estimation: theory, practice, and visualization .
Silverman (1986) Silverman, B. W. 1986. Density estimation for statistics and data analysis, volume 26. CRC press.
Uria et al. (2016) Uria, B.; Côté, M.-A.; Gregor, K.; Murray, I.; and Larochelle, H. 2016. Neural autoregressive distribution estimation. The Journal of Machine Learning Research 17(1): 7184–7220.
Yi et al. (2017) Yi, Z.; Zhang, H.; Tan, P.; and Gong, M. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision, 2849–2857.
Zhu et al. (2017) Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, 2223–2232.

Appendix A

We used importance sampling to get numeric result of $\int p(\textbf{x}|\textbf{z})p(\textbf{z})d\textbf{z}$ . One key problem is to choose an appropriate importance distribution $q(\textbf{z})$ . In Roundtrip model, we chose $q(\textbf{z})$ as student’s t distribution with center at $H(\textbf{x})$ . $p(\textbf{x}|\textbf{z})$ always takes optimal maximum value at $\tilde{\textbf{z}}=H(\textbf{x})$ as

\begin{split}p(\textbf{x}|\tilde{\textbf{z}})=(\frac{1}{\sqrt{2\pi}\sigma})^{n}e^{-\frac{\left\|\textbf{x}-G(\tilde{\textbf{z}})\right\|_{2}^{2}}{2\sigma^{2}}}=(\frac{1}{\sqrt{2\pi}\sigma})^{n}e^{-\frac{\left\|\textbf{x}-G(H(\textbf{x}))\right\|_{2}^{2}}{2\sigma^{2}}}\end{split}

(15)

We can see that minimizing roundtrip loss $\rho(\textbf{x},G(H(\textbf{x})))$ in section 2.4 is equivalent to maximizing $p(\textbf{x}|\tilde{\textbf{z}})$ . This is also the reason we want to impose a roundtrip loss during the training process.

To make the importance sampling strategy more understandable, we illustrated an example based on the simulation study here. We take the Involute simulation case in Section 3.3 for an example, we visualize $p(\textbf{z})$ , $p(\textbf{x}|\textbf{z})$ and $q(\textbf{z})$ at the first dimension focusing on the density at the point x=(3,3) (Figure S1).

As $p(\textbf{x}|\textbf{z})$ typically decays much faster than $p(\textbf{z})$ , we chose $q(\textbf{z})$ in which the center is close to the center of $p(\textbf{x}|\textbf{z})$ as much as possible. To sum up, in the importance sampling strategy, $G(\textbf{z})$ network was used for generating samples while $H(\textbf{x})$ network was used for determining the center of importance distribution $q(\textbf{z})$ .

Appendix B

The Change of variable rule as a special case

We first rephrase the density of x in equation (11) in the main text as the following

\begin{split}p(\textbf{x})&=(\frac{1}{\sqrt{2\pi}})^{n}\sigma^{-n}\sqrt{\rm{det}(\rm{inv}(\mathbf{I}+\sigma^{-2}\mathbf{A}))}e^{-\frac{c_{2}(\textbf{x})}{2}}\\ &=(\frac{1}{\sqrt{2\pi}})^{n}\sigma^{-n}\sqrt{\rm{det}(\sigma^{2}\rm{inv}(\mathbf{A}+\sigma^{2}\mathbf{I}))}e^{-\frac{c_{2}(\textbf{x})}{2}}\\ &=(\frac{1}{\sqrt{2\pi}})^{n}\sigma^{-n}\sqrt{\sigma^{2m}\rm{det}(\rm{inv}(\mathbf{A}+\sigma^{2}\mathbf{I}))}e^{-\frac{c_{2}(\textbf{x})}{2}}\\ &=(\frac{1}{\sqrt{2\pi}})^{n}\sigma^{m-n}\sqrt{\rm{det}(\rm{inv}(\mathbf{A}+\sigma^{2}\mathbf{I}))}e^{-\frac{c_{2}(\textbf{x})}{2}}\end{split}

(16)

When $m=n$ and $H(\cdot)=G^{-1}(\cdot)$ ,then we have $x-G(\tilde{\textbf{z}})=\textbf{x}-G(H(\textbf{x}))=\textbf{x}-G(G^{-1}(\textbf{x}))=\textbf{0}$ , $\textbf{b}=\nabla G^{T}(\tilde{\textbf{z}})(\textbf{x}-G(\tilde{\textbf{z}}))=\textbf{0}$ , $\bm{\mu}=\bm{\Sigma}(\lambda\textbf{b}-\tilde{\textbf{z}})=-\bm{\Sigma}\tilde{\textbf{z}}$ and $c_{2}(\textbf{x})=\left\|\tilde{\textbf{z}}\right\|_{2}^{2}-\sigma^{2}\tilde{\textbf{z}}^{T}(\mathbf{A}+\sigma^{2}\mathbf{I})^{-1}\tilde{\textbf{z}}$

Finally, we take the limit of $\sigma\to 0$ , we have $\lim_{\sigma\to 0}c_{2}(\textbf{x})=\left\|\tilde{\textbf{z}}\right\|_{2}^{2}$ and

\begin{split}\lim_{\sigma\to 0}&\sqrt{\rm{det}(\rm{inv}(\mathbf{A}+\sigma^{2}\mathbf{I}))}=\sqrt{\rm{det}(\rm{inv}(\mathbf{A}))}=\sqrt{\rm{det}(\rm{inv}(\mathbf{J}_{\tilde{\textbf{z}}}^{T}\mathbf{J}_{\tilde{\textbf{z}}}))}\\ &=\sqrt{\rm{det}(\mathbf{J}^{-T}_{\tilde{\textbf{z}}}\mathbf{J}^{-1}_{\tilde{\textbf{z}}})}=|\rm{det}(\mathbf{J}^{-1}_{\tilde{\textbf{z}}})|=|\rm{det}(\frac{\partial G(\tilde{\textbf{z}})}{\partial\tilde{\textbf{z}}^{T}})|^{-1}\\ \end{split}

(17)

So when $m$ = $n$ and $H(\cdot)=G^{-1}(\cdot)$ , then $\lim_{\sigma\to 0}p(\textbf{x})=(\frac{1}{\sqrt{2\pi}})^{n}e^{-\frac{\left\|\tilde{\textbf{z}}\right\|_{2}^{2}}{2}}|\rm{det}(\frac{\partial G(\tilde{\textbf{z}})}{\partial\tilde{\textbf{z}}^{T}})|^{-1}=p(\tilde{\textbf{z}})|\rm{det}(\frac{\partial G(\tilde{\textbf{z}})}{\partial\tilde{\textbf{z}}^{T}})|^{-1}$ .

So we proved that under the three conditions (1) $m=n$ , (2) $H(\cdot)=G^{-1}(\cdot)$ , (3) $\sigma\to 0$ , the proposed Laplace approximation is degraded into the Change of variable rule which is the principle of previous neural density estimators. Our Laplace approximation approach can be considered as an extension of the Change of variable rule which requires equal dimension in base density and target density.

Appendix C

Table S1. Dimension and sample size of UCI/Image datasets

Dataset	Domain	Dim(z)	Dim(x)	Sample size
Dataset	Domain	Dim(z)	Dim(x)	Train	Validation	Test
AReM	Social science	3	6	34215	3801	4223
CASP	Chemistry	5	9	37042	4115	4573
HEPMASS	Physics	8	21	315123	35013	174987
BANK	Finance	8	17	36621	4069	4521
YPMSD	Audio	20	90	417430	46381	51534
MNIST	Image	100	784	50000	10000	10000
CIFAR-10	Image	100	3072	45000	5000	10000

we took the case (a) independent Gaussian mixture for a further study by increasing the dimension up to 10 (containing $3^{10}$ modes). The Spearman correlation between estimated density and true density of the test set is calculated and shown in Figure S2. The kernel density estimator (KDE) performs comparable or even better when the dimension is less than 5. But the performance of KDE decreases sharply when the dimension is larger than 5. Our Roundtrip model with the importance sampling (Roundtrip-IS) strategy can achieve a consistently better performance than other neural density estimators at different dimensions. We also note that the performance of Roundtrip model with Laplace approximation (Roundtrip-LP) outperforms MADE but not as good as MAF and Real NVP in most cases (Figure S2).

Although we provided theoretical guarantees on the approximation solution, the success of Roundtrip-LP requires that the high order terms in equation (5) in the main text is negligible, which may introduce additional bias when estimating density. So we reported all results of density estimation using the more robust Roundtrip-IS model (default setting) as the result of importance sampling is unbiased.

Appendix D

UCI and image datasets

We provided detailed descriptions about the data description and preprocessing of all datasets that were used in our study.

AreM. The Activity Recognition system based on Multisensor data fusion (AReM) (Palumbo et al. 2016) dataset contains temporal data from a Wireless Sensor Network worn by an actor performing the activities: bending, cycling, lying down, sitting, standing, walking. The time-domain features including 3 mean values and 3 standard deviations were collected from the multisensor system during a period of time. Although it is time-series data but we treat it as if each example was drawn from an $iid$ distribution from the target distribution. Then raw data was first applied a feature scaling through a min-max normalization and then randomly split into 90% training set and 10% test. Note that for neural density estimators, 10% of the training set will be kept for validation.

CASP. The CASP dataset contains the physicochemical properties of the protein tertiary structure. Each example denotes an individual residue which has 9 features, including total surface area, non-polar exposed area, fractional area of exposed non-polar residue, fractional area of exposed non-polar part of the residue, molecular mass weighted exposed area, Euclidian distance, secondary structure penalty and spacial distribution constraints (N.K Value). The same data normalization and split were used as AreM dataset.

HEPMASS. HEPMASS (Baldi et al. 2016) dataset describes the particle collisions signatures of exotic particles in high energy physics. We preprocessed this dataset following the same strategy as (Papamakarios, Pavlakou, and Murray 2017). Examples from the ”1000” dataset were collected where the particle mass is 1000 and five features were removed due to too many reoccurring values.

BANK. BANK dataset (Moro, Cortez, and Rita 2014) is related to a marketing campaign of a Portuguese banking institution where the goal is to predict whether the client will subscribe a deposit. The label encoding was used for discrete features in the raw data with values between 0 and n_classes. Then a uniform noise of $(-0.2,0.2)$ was added to each feature. At last, the same data normalization and split were used as AreM dataset.

YPMSD. YPMSD (http://millionsongdataset.com/) is a dataset that contains the audio features of songs from different years ranging from 1922 to 2011. Each song has 90 features which relate to 12 timbre average and 78 timbre covariance. The same data normalization and split were used as AreM dataset.

The descriptions of the five UCI datasets and the two image datasets (MNIST and CIFAR-10), including feature dimension and sample size, were summarized in Table S1.

Table S2. Dimension and sample size of ODDS datasets

Dataset	Dim(z)	Dim(x)	Outliers(%)	Sample size
Dataset	Dim(z)	Dim(x)	Outliers(%)	Train	Validation	Test
Shuttle	3	9	7	39770	4418	4909
Mammograph	3	6	2.32	9059	1006	1118
ForestCover	4	10	0.9	231700	25744	28604

ODDS datasets

We provided detailed descriptions about the ODDS datasets used in this study.

Shuttle. Shuttle (http://odds.cs.stonybrook.edu/shuttle-dataset/) dataset contains 9 numerical features. The smallest five classes, i.e. 2, 3, 5, 6, 7 are combined to form the outliers class, while class 1 forms the inlier class. Data for class 4 is discarded. All inlier and outlier data were first mixed together and then randomly split into 90% training set and 10% test set. For neural density estimators, 10% of the training set were kept for validation.

Mammography Mammography (http://odds.cs.stonybrook.edu/mammography-dataset/) dataset describes the characteristics of 260 calcifications. The minority class of calcification is considered as an outlier class and the non-calcification class as inliers. The same data split strategy was used for Shuttle dataset.

ForestCover ForestCover (http://odds.cs.stonybrook.edu/forestcovercovertype-dataset/) dataset is used in predicting forest cover type from cartographic variables. Outlier detection dataset is created using only 10 quantitative attributes. Instances from class 2 are considered as normal points and instances from class 4 are anomalies. The same data split strategy was used for Shuttle dataset.

The descriptions of the three ODDS datasets are summarized in Table S2.

Table S3. The network architecture for conditional image generation and density estimation.

generator $G$	discriminator $D_{x}$
Inputs $\textbf{z}\in\mathbb{R}^{100}$ and $\textbf{y}\in\mathbb{R}^{10}$	Inputs flattened image $\in\mathbb{R}^{784}$ and $\textbf{y}\in\mathbb{R}^{10}$
Concat(z, y)	Reshape, $28\times 28\times 1$ , $1\times 1\times 10$
FC, 1024 batchnorm. LRelu	$4\times 4$ conv, 32 stride 2. batchnorm. LRelu
Concat(FC,y)	Concat(Conv, yb)
FC, $7\times 7\times 128$ batchnorm. LRelu	$4\times 4$ conv, 64 stride 2. batchnorm. LRelu
Reshape, $7\times 7\times 128$	Flatten, 1568
Concat(Reshape,yb)	Concat(Flat, yb)
$4\times 4$ upconv, 64 stride 2. Sigmoid	FC, 1024 batchnorm. LRelu
Flatten,784	Concat(FC, y)
	FC, 1
generator $H$	discriminator $D_{z}$
Input flattened image $\in\mathbb{R}^{784}$	Input $\textbf{z}\in\mathbb{R}^{100}$
Reshape, $28\times 28\times 1$	FC, 128. LRelu
$4\times 4$ conv, 64 stride 2. LRelu	FC, 128. Batchnorm. Tanh
$4\times 4$ conv, 64 stride 2. LRelu	FC, 1
FC, 1024
FC, 100

Appendix E

Take the MNIST database for an example, we provided the details of network architectures in Roundtrip model for conditional generation and density estimation which were shown in Table S3. The one-hot encoded label y will be fed to both $G$ network and $D_{x}$ network. Note that yb is a reshape of y, which is convenient for channel-wise concatenation. For CIFAR-10 database, the network architecture used in Roundtrip is exactly the same except that the image size will be $32\times 32\times 3$ and the hidden unit number in the second fully-connected layer of $G$ network will be $8\times 8\times 128$ .