This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[1]\fnmMarcus \surHoerger

[1]\orgdivSchool of Mathematics & Physics, \orgnameThe University of Queensland, \orgaddress\cityBrisbane, \postcode4072, \stateQLD, \countryAustralia

A Flow-Based Generative Model for Rare-Event Simulation

\fnmLachlan \surGibson [email protected]    [email protected]    \fnmDirk \surKroese [email protected] *
Abstract

Solving decision problems in complex, stochastic environments is often achieved by estimating the expected outcome of decisions via Monte Carlo sampling. However, sampling may overlook rare, but important events, which can severely impact the decision making process. We present a method in which a Normalizing Flow generative model is trained to simulate samples directly from a conditional distribution given that a rare event occurs. By utilizing Coupling Flows, our model can, in principle, approximate any sampling distribution arbitrarily well. By combining the approximation method with Importance Sampling, highly accurate estimates of complicated integrals and expectations can be obtained. We include several examples to demonstrate how the method can be used for efficient sampling and estimation, even in high-dimensional and rare-event settings. We illustrate that by simulating directly from a rare-event distribution significant insight can be gained into the way rare events happen.

keywords:
normalizing flows, neural networks, rare events, simulation

1 Introduction

Many decision problems in complex, stochastic environments (Kochenderfer, \APACyear2015) are nowadays solved by estimating the expected outcome of decisions via Monte Carlo simulations (Kroese \BOthers., \APACyear2011, \APACyear2019; Liu, \APACyear2004). The success of Monte Carlo methods is due to their simplicity, flexibility, and scalability. However, in many problem domains, the occurrence of rare but important events — events that happen with a very small probability, say less than 10410^{-4} — severely impairs their efficiency, for the reason that such events do not show up often in a typical simulation run (Arief \BOthers., \APACyear2021). On the other hand, failing to consider such rare events can lead to decisions with potentially catastrophic outcomes, e.g., hazardous behaviour of autonomous cars. By using well-known variance reduction techniques such as Importance Sampling  (Kroese \BOthers., \APACyear2011; Bucklew, \APACyear2004) it is possible to sometimes dramatically increase the efficiency of the standard Monte Carlo method. Nevertheless, there are few efficient methods available that give insights into how a system behaves under a rare event. An important research goal is thus to find methods that simulate a random process conditionally on the occurrence of a rare event. In this case, the target sampling distribution is the distribution of the original process conditioned on the rare event occurring. More broadly, for both estimation and sampling problems, the challenge is to identify a “good” sampling distribution that closely approximates a target distribution. Certain approximation methods, such as the Cross-Entropy method (de Boer \BOthers., \APACyear2005; Rubinstein \BBA Kroese, \APACyear2017), train a parametric model by minimizing the cross-entropy between a distribution family and the target distribution. However, typical parametric models are often not flexible enough to efficiently capture all the complexity of many interesting systems. Botev \BOthers. (\APACyear2007) introduced a Generalized Cross-entropy method, which approximates the target distribution in a non-parametric way. However, the quality of the results is determined by various hyper-parameters, which require expertise to tune.

While many standard methods can identify sampling distributions of sufficient quality to estimate rare-event probabilities and related quantities of interest, extremely good approximations to the target conditional distribution are required to be able to explore the behaviour of the simulated system under rare-event conditions. This can occasionally be achieved by strategically selecting very specific distribution families using problem-specific information, but this has been difficult to achieve with conventional simulation methods. Therefore, it is desirable to have a more general approach that can closely approximate any target distribution without relying on problem-specific knowledge. Gibson \BBA Kroese (\APACyear2022) recently introduced a framework for rare-event simulation using neural networks that aimed to achieve this. In that framework two Multilayer Perceptrons are trained simultaneously: the first being a generative model to represent the sampling distribution, and the second to approximate the probability distribution of the first. While the framework had some success in one-dimensional problems, the reliance on a second network and probability density estimation convoluted the training process and made scaling to higher-dimensional problems difficult. Ardizzone \BOthers. (\APACyear2019) and Falkner \BOthers. (\APACyear2022) explore the use of normalizing flows for rare event sampling in the context of image generation and Boltzmann generators. The approach in Ardizzone \BOthers. (\APACyear2019) requires a conditioning network to be pre-trained and embedded into the normalizing flows architecture. Falkner \BOthers. (\APACyear2022) embeds a bias variable into the flow to learn distributions that are biased towards the region of interest. However, such bias variables can be difficult to construct for more complex problems. In contrast, our model is trained end-to-end and without having to introduce a conditioning variable.

We present a new framework, inspired by Gibson \BBA Kroese (\APACyear2022), in which a Normalizing Flows generative model is trained to learn the optimal sampling distribution and used to estimate quantities such as the rare-event probability. The highly expressive nature of some Normalizing Flows architectures, trained using standard deep learning training algorithms, massively expands the range of learnable distributions, while the invertibility of Normalizing Flows allows the exact generative probability density to be computed without the need of a second network. In Section 2 we present the background theory and a description of the training algorithm, and in Section 3 we present several examples of rare-event simulation using this method. The source-code of our method is available at https://github.com/hoergems/rare-event-simulation-normalizing-flows.

2 Theory

The theory of rare-event simulation (Juneja \BBA Shahabuddin, \APACyear2006; Rubino \BBA Tuffin, \APACyear2009; Bucklew, \APACyear2004), Importance Sampling (Glynn \BBA Iglehart, \APACyear1989; Tokdar \BBA Kass, \APACyear2010; Neal, \APACyear2001) and Normalizing Flows (Kobyzev \BOthers., \APACyear2021; Papamakarios \BOthers., \APACyear2021; Rezende \BBA Mohamed, \APACyear2015) is well established. In this section we review the relevant background, formulate the foundation of our method, and present the training procedure.

2.1 Rare-Event Simulation and Importance Sampling

First let us introduce a mathematical basis for rare-event simulation and Importance Sampling. Consider a random object (e.g., random variable, random vector, or stochastic process), 𝑿\boldsymbol{X}, taking values in some space, 𝒳\mathcal{X}, with probability density function, p(𝒙)p(\boldsymbol{x}), that represents the result of a simulation experiment. Running simulations corresponds to sampling 𝑿\boldsymbol{X}. We consider rare events of the form {𝑿𝒜}\{\boldsymbol{X}\in\mathcal{A}\}, where 𝒜𝒳\mathcal{A}\subset\mathcal{X} is the set of simulation outcomes 𝒙\boldsymbol{x} that satisfy a predicate S(𝒙)γ{S(\boldsymbol{x})\geq\gamma} for some performance function S:𝒳S:\mathcal{X}\rightarrow\mathbb{R} and level parameter γ\gamma\in\mathbb{R}. The probability cc of the rare event {𝑿𝒜}\{\boldsymbol{X}\in\mathcal{A}\} can thus be written as

c:=(𝑿𝒜)=(S(𝑿)γ)=𝔼[𝟙𝒜(𝑿)],c:=\mathbb{P}(\boldsymbol{X}\in\mathcal{A})=\mathbb{P}(S(\boldsymbol{X})\geq\gamma)=\mathbb{E}[\mathbbm{1}_{\mathcal{A}}(\boldsymbol{X})],

where cc is very small, but not 0. Here, 𝟙𝒜(𝒙)\mathbbm{1}_{\mathcal{A}}(\boldsymbol{x}) is an indicator function that is 1 when 𝒙𝒜{\boldsymbol{x}\in\mathcal{A}} and 0 otherwise, and 𝔼\mathbb{E} represents the expected value. The expected number of simulations to sample a single rare event is 1/c1/c, so simulating rare events by sampling 𝑿\boldsymbol{X} directly quickly becomes computationally infeasible the rarer the event is. Therefore, other techniques, such as Importance Sampling become necessary to simulate and analyse rare events.

In particular, suppose that 𝑿\boldsymbol{X} is sampled from a probability density function p(𝒙)p(\boldsymbol{x}) and that we wish to estimate the quantity

:=𝔼[H(𝑿)]=H(𝒙)p(𝒙)d𝒙\ell:={\mathbb{E}}[H(\boldsymbol{X})]=\int H(\boldsymbol{x})p(\boldsymbol{x})\,\mathrm{d}\boldsymbol{x}

via “crude” Monte Carlo: Simulate 𝑿1,,𝑿n\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n} independently from pp, and estimate \ell via the sample average i=1nH(𝑿i)/n\sum_{i=1}^{n}H(\boldsymbol{X}_{i})/n. The idea of Importance Sampling is to change the probability distribution under which the simulation takes place. However, computing the expected value of a function of 𝑿\boldsymbol{X} while using a different sampling density qq, requires a correction factor of the likelihood ratio p(𝒙)/q(𝒙)p(\boldsymbol{x})/q(\boldsymbol{x}),

:=𝔼p[H(𝑿)]=𝔼q[H(𝑿)p(𝑿)q(𝑿)],\ell:={\mathbb{E}}_{p}\left[H(\boldsymbol{X})\right]={\mathbb{E}}_{q}\left[H(\boldsymbol{X})\frac{p(\boldsymbol{X})}{q(\boldsymbol{X})}\right], (1)

where 𝔼p=𝔼{\mathbb{E}}_{p}={\mathbb{E}} and 𝔼q{\mathbb{E}}_{q} represent the expected values under the two probability models. This relationship allows an unbiased estimate of \ell to be computed by sampling 𝑿\boldsymbol{X} from qq instead of pp, via

^:=1nk=1nH(𝑿k)p(𝑿k)q(𝑿k),\widehat{\ell}:=\frac{1}{n}\sum_{k=1}^{n}H(\boldsymbol{X}_{k})\frac{p(\boldsymbol{X}_{k})}{q(\boldsymbol{X}_{k})},

where 𝑿1,,𝑿n\boldsymbol{X}_{1},\ldots,\boldsymbol{X}_{n} is an independent and identically distributed (iid) sample from qq. Note that any choice for qq is allowed, as long as q(𝒙)0q(\boldsymbol{x})\neq 0 when p(𝒙)0p(\boldsymbol{x})\neq 0. The optimal sampling distribution for estimating \ell in this way is thus the distribution that minimizes the variance of the estimator ^\widehat{\ell}. When HH is strictly positive (or strictly negative), choosing the probability density

q(𝒙):=p(𝒙)H(𝒙),q^{*}(\boldsymbol{x}):=\frac{p(\boldsymbol{x})H(\boldsymbol{x})}{\ell}, (2)

yields in fact a zero-variance estimator, because then ^=\widehat{\ell}=\ell. As a special case, an unbiased estimator of the rare-event probability c=(𝑿A)c={\mathbb{P}}(\boldsymbol{X}\in A) can be computed as

c^:=1nk=1n𝟙𝒜(𝑿k)p(𝑿k)q(𝑿k),\widehat{c}:=\frac{1}{n}\sum_{k=1}^{n}\mathbbm{1}_{\mathcal{A}}(\boldsymbol{X}_{k})\frac{p(\boldsymbol{X}_{k})}{q(\boldsymbol{X}_{k})}, (3)

and the optimal sampling density qq^{*} is just the original density pp truncated to the rare-event region 𝒜\mathcal{A}:

q(𝒙)=p𝒜(𝒙):=p(𝒙)𝟙𝒜(𝒙)c.q^{*}(\boldsymbol{x})=p^{\mathcal{A}}(\boldsymbol{x}):=\frac{p(\boldsymbol{x})\mathbbm{1}_{\mathcal{A}}(\boldsymbol{x})}{c}. (4)

More generally, if we want to estimate the expectation of H(𝑿)H(\boldsymbol{X}) conditional on 𝑿𝒜\boldsymbol{X}\in\mathcal{A}, that is,

𝒜:=𝔼p𝒜[H(𝑿)]=1c𝔼q[H(𝑿)p(𝑿)q(𝑿)𝟙𝒜(𝑿)],\ell^{\mathcal{A}}:={\mathbb{E}}_{p^{\mathcal{A}}}[H(\boldsymbol{X})]=\frac{1}{c}{{\mathbb{E}}}_{q}\left[H(\boldsymbol{X})\frac{p(\boldsymbol{X})}{q(\boldsymbol{X})}\mathbbm{1}_{\mathcal{A}}(\boldsymbol{X})\right], (5)

then 𝒜\ell^{\mathcal{A}} can be estimated via

^𝒜:=1nc^k=1nH(𝑿k)p(𝑿k)q(𝑿k)𝟙𝒜(𝑿k).\widehat{\ell}^{\mathcal{A}}:=\frac{1}{n\,\widehat{c}}\sum_{k=1}^{n}H(\boldsymbol{X}_{k})\frac{p(\boldsymbol{X}_{k})}{q(\boldsymbol{X}_{k})}\mathbbm{1}_{\mathcal{A}}(\boldsymbol{X}_{k}). (6)

In this case, the optimal Importance Sampling density is

q(𝒙):=p(𝒙)H(𝒙)𝟙𝒜(𝒙)c,q^{*}(\boldsymbol{x}):=\frac{p(\boldsymbol{x})H(\boldsymbol{x})\mathbbm{1}_{\mathcal{A}}(\boldsymbol{x})}{\ell\,c}, (7)

2.2 Normalizing Flows

If it is possible to approximate the target density (e.g., the optimal Importance Sampling density) closely, then we are able to compute certain quantities of interest (e.g., c=(𝑿𝒜)c={\mathbb{P}}(\boldsymbol{X}\in\mathcal{A}) or =𝔼H(𝑿))\ell={\mathbb{E}}H(\boldsymbol{X})) with low variance. Normalizing Flows is a generative model that can closely approximate a wide variety of probability densities. Suppose the random object, 𝑿\boldsymbol{X}, can be mapped to another random object, 𝒁\boldsymbol{Z}, taking values in some space 𝒵\mathcal{Z} of the same dimensionality, and vice versa, via invertible differentiable functions 𝝍\boldsymbol{\psi} and ϕ=𝝍1\boldsymbol{\phi}=\boldsymbol{\psi}^{-1}, so that

𝑿=𝝍(𝒁;𝒖)and𝒁=ϕ(𝑿;𝒖),\boldsymbol{X}=\boldsymbol{\psi}(\boldsymbol{Z}\mathchar 24635\relax\;\boldsymbol{u})\qquad\text{and}\qquad\boldsymbol{Z}=\boldsymbol{\phi}(\boldsymbol{X}\mathchar 24635\relax\;\boldsymbol{u}),

where 𝒖\boldsymbol{u} is a vector representing all, if any, parameters. If, under the target density, ϕ\boldsymbol{\phi} maps 𝑿\boldsymbol{X} to a random object 𝒁\boldsymbol{Z} that has a simple base distribution, such as a (multivariate) normal or uniform distribution, then the target distribution can be sampled indirectly by sampling from the base distribution and mapping the result using 𝝍\boldsymbol{\psi}. The name of such a generative model is derived from the idea that a sequence of invertible differentiable transformations can map even a very complex probability distribution to a simple base distribution, thereby forming a ‘normalizing flow’. The construction of the ‘flow’ relies on the principle that any composition of invertible differentiable functions will also be invertible and differentiable.

In what follows, we assume that 𝒳\cal X is a subset of n\mathbb{R}^{n} and that 𝑿\boldsymbol{X} and 𝒁\boldsymbol{Z} are “continuous” random variables; more precisely, that they have probability densities p𝑿p_{\boldsymbol{X}} and p𝒀p_{\boldsymbol{Y}} with respect to the Lebesgue measure on 𝒳\cal X. Since 𝝍\boldsymbol{\psi} is differentiable, the probability density function 𝑿\boldsymbol{X} can be expressed in terms of the probability density function 𝒁\boldsymbol{Z} via the relation

p𝑿(𝒙)=p𝒁(ϕ(𝒙;𝒖))|detDϕ(𝒙;𝒖)|,p_{\boldsymbol{X}}(\boldsymbol{x})=p_{\boldsymbol{Z}}(\boldsymbol{\phi}(\boldsymbol{x}\mathchar 24635\relax\;\boldsymbol{u}))\left|\det\mathrm{D}\boldsymbol{\phi}(\boldsymbol{x}\mathchar 24635\relax\;\boldsymbol{u})\right|, (8)

or, in terms of 𝒛\boldsymbol{z}:

p𝑿(𝝍(𝒛;𝒖))=p𝒁(𝒛)|detD𝝍(𝒛;𝒖)|1.p_{\boldsymbol{X}}(\boldsymbol{\psi}(\boldsymbol{z}\mathchar 24635\relax\;\boldsymbol{u}))=p_{\boldsymbol{Z}}(\boldsymbol{z})\left|\det\mathrm{D}\boldsymbol{\psi}(\boldsymbol{z}\mathchar 24635\relax\;\boldsymbol{u})\right|^{-1}. (9)

Here, Dϕ(𝒙;𝒖)\mathrm{D}\boldsymbol{\phi}(\boldsymbol{x}\mathchar 24635\relax\;\boldsymbol{u}) is Jacobian matrix of ϕ\boldsymbol{\phi} and the absolute value of its determinant is the Jacobian of ϕ\boldsymbol{\phi}, and similar for 𝝍\boldsymbol{\psi}. If 𝝍\boldsymbol{\psi} is a composition of other invertible differentiable functions,

𝝍(;𝒖)=i𝝍i(;𝒖i),\boldsymbol{\psi}(\,\cdot\,\mathchar 24635\relax\;\boldsymbol{u})=\operatorname*{\circ}_{i}\boldsymbol{\psi}_{i}(\,\cdot\,\mathchar 24635\relax\;\boldsymbol{u}_{i}),

then we can use the chain rule to combine the Jacobians as a product,

|detD𝝍(;𝒖)|=i|detD𝝍i(;𝒖i)|.\left|\det\mathrm{D}\boldsymbol{\psi}(\,\cdot\,\mathchar 24635\relax\;\boldsymbol{u})\right|=\prod_{i}\left|\det\mathrm{D}\boldsymbol{\psi}_{i}(\,\cdot\,\mathchar 24635\relax\;\boldsymbol{u}_{i})\right|.

Therefore, analogous to a feed-forward neural network, the full Jacobian can be computed during a single pass as 𝝍(𝒛;𝒖)\boldsymbol{\psi}(\boldsymbol{z}\mathchar 24635\relax\;\boldsymbol{u}) is computed. A similar result holds for the inverse: ϕ\boldsymbol{\phi}. Functional composition increases the complexity of the model while maintaining the accessibility of the Jacobian. In this way, Normalizing Flows provides the opportunity for highly expressible generative models, with computationally tractable density functions. As will be discussed in Section 2.4, training the model to learn a target density from a given function, rather than via training data, requires the ability to compute p𝑿p_{\boldsymbol{X}} as a differentiable function. Therefore, using a Normalizing Flows model improves and simplifies the framework introduced by Gibson \BBA Kroese (\APACyear2022) by eliminating the need to train a second neural network to approximate the density function via kernel density estimation.

2.3 Coupling Flows

Coupling Flows, first introduced by Dinh \BOthers. (\APACyear2015), is a family of Normalizing Flows which have been shown to be universal approximators of arbitrary invertible differentiable functions when composed appropriately (Teshima \BOthers., \APACyear2020). A single Coupling Flows unit splits 𝒛\boldsymbol{z} into two vectors, 𝒛A\boldsymbol{z}^{A} and 𝒛B\boldsymbol{z}^{B}, as well as 𝒙\boldsymbol{x} into 𝒙A\boldsymbol{x}^{A} and 𝒙B\boldsymbol{x}^{B}. It then transforms one part via an invertible differentiable function, 𝒘\boldsymbol{w}, called the coupling function, which contains parameters determined by the other partition. The relationship between the second part and the coupling function parameters, called the conditioner, Θ\Theta, does not need to be invertible, since this part is not transformed and is accessible when computing the inverse. This means functions with many learnable parameters, like neural networks, can be used to capture interesting and complex dependencies between variables. Figure 1 illustrates the structure of a Coupling Flow models and its inverse.

Figure 1: Flow charts showing the structure of a Coupling Flow (left) and its inverse (right).

The Coupling Flow can be expressed mathematically as:

𝒙=(𝒙A,𝒙B)=𝝍(𝒛A,𝒛B;𝒖)\displaystyle\boldsymbol{x}=(\boldsymbol{x}^{A},\boldsymbol{x}^{B})=\boldsymbol{\psi}(\boldsymbol{z}^{A},\boldsymbol{z}^{B}\mathchar 24635\relax\;\boldsymbol{u}) =(𝒘(𝒛A;Θ(𝒛B;𝒖)),𝒛B),\displaystyle=(\boldsymbol{w}(\boldsymbol{z}^{A}\mathchar 24635\relax\;\Theta(\boldsymbol{z}^{B}\mathchar 24635\relax\;\boldsymbol{u})),\boldsymbol{z}^{B}),
𝒛=(𝒛A,𝒛B)=ϕ(𝒙A,𝒙B;𝒖)\displaystyle\boldsymbol{z}=(\boldsymbol{z}^{A},\boldsymbol{z}^{B})=\boldsymbol{\phi}(\boldsymbol{x}^{A},\boldsymbol{x}^{B}\mathchar 24635\relax\;\boldsymbol{u}) =(𝒘1(𝒛A;Θ(𝒙B;𝒖)),𝒙B).\displaystyle=(\boldsymbol{w}^{-1}(\boldsymbol{z}^{A}\mathchar 24635\relax\;\Theta(\boldsymbol{x}^{B}\mathchar 24635\relax\;\boldsymbol{u})),\boldsymbol{x}^{B}).

Note that the parameters 𝒖\boldsymbol{u} of the flow are contained in the conditioner, Θ\Theta. Since the BB-partition does not transform, the Jacobian of the full transformation is just the Jacobian of the coupling function; that is,

|detD𝝍(𝒛,𝒖)|=|detD𝒘(𝒛A;Θ(𝒛B;𝒖))|.\left|\det\mathrm{D}\boldsymbol{\psi}(\boldsymbol{z},\boldsymbol{u})\right|=\left|\det\mathrm{D}\boldsymbol{w}(\boldsymbol{z}^{A}\mathchar 24635\relax\;\Theta(\boldsymbol{z}^{B}\mathchar 24635\relax\;\boldsymbol{u}))\right|.

The coupling function is often chosen to be a simple affine transformation (Dinh \BOthers., \APACyear2015, \APACyear2016; Kingma \BBA Dhariwal, \APACyear2018). Instead, we base our coupling function on the rational function proposed by Ziegler \BBA Rush (\APACyear2019):

r(z;θ1,θ2,θ3,θ4,θ5)=θ1z+θ2+θ31+(θ4z+θ5)2.r(z\mathchar 24635\relax\;\theta_{1},\theta_{2},\theta_{3},\theta_{4},\theta_{5})=\theta_{1}z+\theta_{2}+\frac{\theta_{3}}{1+(\theta_{4}z+\theta_{5})^{2}}. (10)

This function has five parameters when applied elementwise. However, separate parameters can be used for each dimension in the 𝒛A\boldsymbol{z}^{A} partition. For additional complexity, the coupling function can also be a composition of these rational functions, w(;Θ)=iri(;θ1i,θ2i,θ3i,θ4i,θ5i)w(\cdot\mathchar 24635\relax\;\Theta)=\operatorname*{\circ}_{i}r_{i}(\cdot\mathchar 24635\relax\;\theta_{1}^{i},\theta_{2}^{i},\theta_{3}^{i},\theta_{4}^{i},\theta_{5}^{i}). In this case the total number of parameters in the coupling function is five multiplied by the number of compositions multiplied by the number of dimension in 𝒛A\boldsymbol{z}^{A}. We choose the conditioner function to be a Multilayer Perceptron, with an input dimensionality to match 𝒛B\boldsymbol{z}^{B} and an output dimensionality to match the number of parameters in the coupling function.

In order for eq. 10 to define an invertible function, some restrictions have to be placed on the parameters. Choosing

θ1>0,θ4>0,and|θ3|<83θ19θ4\theta_{1}>0,\quad\theta_{4}>0,\quad\text{and}\quad|\theta_{3}|<\frac{8\sqrt{3}\theta_{1}}{9\theta_{4}}

ensures that the function rr in eq. 10 is a strictly increasing invertible function. We enforce these constraints, following Ziegler and Rush, by processing the output of the conditioner Multilayer Perceptron,

θ1\displaystyle\theta_{1} =eθ1,\displaystyle=\mathrm{e}^{\theta_{1}^{\prime}}, θ2\displaystyle\theta_{2} =θ2,\displaystyle=\theta_{2}^{\prime}, θ4\displaystyle\theta_{4} =eθ4,\displaystyle=\mathrm{e}^{\theta_{4}^{\prime}}, θ5\displaystyle\theta_{5} =θ5,\displaystyle=\theta_{5}^{\prime}, θ3\displaystyle\theta_{3} =0.9583θ29θ4tanhθ3,\displaystyle=0.95\frac{8\sqrt{3}\theta_{2}}{9\theta_{4}}\tanh\theta_{3}^{\prime},

where primed letters denote unconstrained outputs of the conditioner function. The factor of 0.95 in the expression for θ3\theta_{3} helps improve stability by precluding barely invertible functions that can exist near the bound |θ3|=83θ1/(9θ4)|\theta_{3}|=8\sqrt{3}\theta_{1}/(9\theta_{4}). While we do not need to compute the inverse in our method, it can be calculated as the real solution to a cubic equation.

A single coupling flow unit does not transform the BB-partition. So, functional composition is necessary to obtain a general approximation. To achieve this, the dimensions of each coupling flow unit are split differently to ensure that all dimensions have to opportunity to ‘influence’ all other dimensions, thereby capturing any dependencies between all dimensions. In practice, we permute the dimensions between Coupling Flow units, while fixing the partitioned dimensions. The permutation of dimensions can be viewed as special case of a linear transformation, where the transformation matrix is a permutation of the rows of the identity matrix. The Jacobian matrix of a linear transformation is equal to the transformation matrix. So, the Jacobian is just 1 and is trivial to include in the probability density calculations of eq. 8 and eq. 9.

2.4 Objective and Training

Choosing an appropriate Normalizing Flows model architecture allows any density to be approximated arbitrarily well if the correct parameters can be identified. We use a form of stochastic gradient descent to iteratively tune the parameters of the model to minimize the Kullback–Leibler (KL) divergence between the model density, qq say, and a known target function, h:𝒳h:\mathcal{X}\rightarrow\mathbb{R}, that is proportional to the target probability density function. That is, we minimize

𝒟(q,h):=𝔼qlnq(𝑿)h(𝑿).\mathcal{D}(q,h):={\mathbb{E}}_{q}\ln\frac{q(\boldsymbol{X})}{h(\boldsymbol{X})}. (11)

Note that a model density qq that minimizes eq. 11 also minimizes the KL divergence between qq and the actual (normalized) target density. The advantage of using eq. 11 is that the normalization constant of the target density does not need to be known. This motivates the following minimization program:111Actually, the first term in this objective corresponds to the negative differential entropy of the base density and does not depend on any parameters, so could be omitted. However, we keep it to make interpreting the loss a little simpler.

min𝒖L(𝒖):=min𝒖𝔼[lnp𝒁(𝒁)ln|detD𝝍(𝒁;𝒖)|lnh(𝝍(𝒁;𝒖))],\min_{\boldsymbol{u}}L(\boldsymbol{u}):=\min_{\boldsymbol{u}}{\mathbb{E}}\left[\ln p_{\boldsymbol{Z}}(\boldsymbol{Z})-\ln\left|\det\mathrm{D}\boldsymbol{\psi}(\boldsymbol{Z}\mathchar 24635\relax\;\boldsymbol{u})\right|-\ln h(\boldsymbol{\psi}(\boldsymbol{Z}\mathchar 24635\relax\;\boldsymbol{u}))\right], (12)

which corresponds to choosing parameters to minimize the KL divergence between the Normalizing Flows generative density and a target density whose probability density function is proportional to hh. The expression can be derived by substituting qq from eq. 9 into eq. 11 and taking the expectation with respect to the base density. The minimum KL divergence is zero, when the two densities are identical, so the minimum value of this objective is the negative natural logarithm of the normalization constant of hh. For example, if h(𝒙)=p(𝒙)H(𝒙)h(\boldsymbol{x})=p(\boldsymbol{x})H(\boldsymbol{x}), then the minimum value of the objective function LL is ln-\ln\ell. The objective function values can be estimated via

L^(𝒖):=1nk=1n[lnp𝒁(𝒁k)ln|detD𝝍(𝒁k;𝒖)|lnh(𝝍(𝒁k;𝒖))],\widehat{L}(\boldsymbol{u}):=\frac{1}{n}\sum_{k=1}^{n}\left[\ln p_{\boldsymbol{Z}}(\boldsymbol{Z}_{k})-\ln\left|\det\mathrm{D}\boldsymbol{\psi}(\boldsymbol{Z}_{k}\mathchar 24635\relax\;\boldsymbol{u})\right|-\ln h(\boldsymbol{\psi}(\boldsymbol{Z}_{k}\mathchar 24635\relax\;\boldsymbol{u}))\right], (13)

where 𝒁1,,𝒁n\boldsymbol{Z}_{1},\ldots,\boldsymbol{Z}_{n} is an iid sample of the base density.

The objective function LL marks a clear distinction between how we train a Normalizing Flows generative model using a target function, and how they are typically trained using a data set. In other works, such as by Dinh \BOthers. (\APACyear2015), Normalizing Flows are trained to learn the density of a data set by maximizing the log-likelihood of the sampled data. While this can be considered equivalent to minimizing the KL divergence, the training process involves sampling the data set, which corresponds to sampling 𝑿\boldsymbol{X}, and then moving in the ‘normalizing’ direction with ϕ\boldsymbol{\phi} to solve the maximization program:

max𝒖𝔼qlnq(𝑿)=max𝒖𝔼q[lnp𝒁(ϕ(𝑿;𝒖))+ln|detDϕ(𝑿;𝒖)|].\max_{\boldsymbol{u}}{\mathbb{E}}_{q}\ln q(\boldsymbol{X})=\max_{\boldsymbol{u}}{\mathbb{E}}_{q}\left[\ln p_{\boldsymbol{Z}}(\boldsymbol{\phi}(\boldsymbol{X}\mathchar 24635\relax\;\boldsymbol{u}))+\ln\left|\det\mathrm{D}\boldsymbol{\phi}(\boldsymbol{X}\mathchar 24635\relax\;\boldsymbol{u})\right|\right].

After the model is trained, new data can be generated by sampling 𝒁\boldsymbol{Z} and moving in the ‘generating’ direction using 𝑿=𝝍(𝒁;𝒖)\boldsymbol{X}=\boldsymbol{\psi}(\boldsymbol{Z}\mathchar 24635\relax\;\boldsymbol{u}). Therefore, in the data-focused case, it is essential to be able to compute both transformations, 𝝍\boldsymbol{\psi} and ϕ\boldsymbol{\phi}. However, in the context of rare-event simulation, we do not have access to training data, but a target function that is proportional to a target probability density, and all training occurs in the ‘generating’ direction. Therefore, computing ϕ\boldsymbol{\phi} is not required (although 𝝍\boldsymbol{\psi} must still be invertible in principle). Algorithm 1 outlines our training procedure.

Algorithm 1 Training a flow-based generative model using a target function.
1:Randomly initialize parameters, 𝒖\boldsymbol{u}.
2:for many iterations do
3:     Independently sample 𝒁1,,𝒁n\boldsymbol{Z}_{1},\ldots,\boldsymbol{Z}_{n} from the base distribution.
4:     Compute ln|detD𝝍(𝒁k;𝒖)|\ln\left|\det\mathrm{D}\boldsymbol{\psi}(\boldsymbol{Z}_{k}\mathchar 24635\relax\;\boldsymbol{u})\right| and 𝑿k=𝝍(𝒁k;𝒖)\boldsymbol{X}_{k}=\boldsymbol{\psi}(\boldsymbol{Z}_{k}\mathchar 24635\relax\;\boldsymbol{u}) during a single pass.
5:     Compute the target function values h(𝑿k)h(\boldsymbol{X}_{k}).
6:     Estimate the objective function via eq. 13.
7:     Adjust parameters 𝒖\boldsymbol{u} via standard gradient descent methods.
8:end for

2.5 Conditional Target Densities

As previously shown in eq. 4 and eq. 7, when estimating rare-event probabilities and other quantity conditioned on a rare event, the target density includes an indicator function, 𝟙𝒜(𝒙)\mathbbm{1}_{\mathcal{A}}(\boldsymbol{x}). However, the term, lnh(𝝍(𝒁;𝒖))-\ln h(\boldsymbol{\psi}(\boldsymbol{Z}\mathchar 24635\relax\;\boldsymbol{u})) in the objective function LL in eq. 12 requires each value h(𝒙)h(\boldsymbol{x}) of the target function value to be strictly positive. To circumvent this problem, we reuse the solution presented by Gibson \BBA Kroese (\APACyear2022), to approximate 𝟙𝒜(𝒙)\mathbbm{1}_{\mathcal{A}}(\boldsymbol{x}) by defining a strictly positive penalty factor,

ρ(𝒙):=exp[α(γS(𝒙))𝟙Ac(𝒙)],\rho(\boldsymbol{x}):=\exp\left[-\alpha\left(\gamma-S(\boldsymbol{x})\right)\mathbbm{1}_{A^{c}}(\boldsymbol{x})\right], (14)

which is a positive approximation of 𝟙𝒜(𝒙)\mathbbm{1}_{\mathcal{A}}(\boldsymbol{x}) when α>0\alpha>0 and 𝒜c={𝒙𝒳|S(𝒙)<γ}\mathcal{A}^{c}=\{\boldsymbol{x}\in\mathcal{X}~{}|~{}S(\boldsymbol{x})<\gamma\} is the complement set of 𝒜\mathcal{A}. The approximation is more accurate the larger the value of α\alpha and is exact in the limit α\alpha\rightarrow\infty. Figure 2 illustrates the convergence of the approximation.

Refer to caption
Figure 2: A comparison between the penalty factor ρ(𝒙)\rho(\boldsymbol{x}) (colored lines) and the step function 𝟙𝒜(𝒙)\mathbbm{1}_{\mathcal{A}}(\boldsymbol{x}) (black dashed line). As α\alpha grows the approximation becomes more accurate.

Therefore, if the target density includes the indicator function as a factor, we replace it with the penalty factor with appropriate performance function, SS. For example, if the aim is to estimate 𝒜\ell^{\mathcal{A}} from eq. 5, then the target function would be of the form

h(𝒙)=p(𝒙)H(𝒙)ρ(𝒙).h(\boldsymbol{x})=p(\boldsymbol{x})H(\boldsymbol{x})\rho(\boldsymbol{x}).

Note that in this case

𝔼[lnh(𝑿)]=𝔼[lnp(𝑿)H(𝑿)]α𝔼[(γS(𝑿))𝟙Ac(𝑿)],{\mathbb{E}}\left[-\ln h(\boldsymbol{X})\right]={\mathbb{E}}\left[-\ln p(\boldsymbol{X})H(\boldsymbol{X})\right]-\alpha{\mathbb{E}}\left[\left(\gamma-S(\boldsymbol{X})\right)\mathbbm{1}_{A^{c}}(\boldsymbol{X})\right],

so learning a conditional density by including the penalty factor in the target function is equivalent to learning the unconditional target p(𝒙)H(𝒙)p(\boldsymbol{x})H(\boldsymbol{x}) and including an additional penalty term in the objective that penalizes samples outside the rare-event region. In this way, α\alpha can be interpreted as the weight of the penalty. For our experiments we choose α=100\alpha=100.

3 Results

In this section we look at how well Normalizing Flows models can learn various target functions using several practical examples.

3.1 Truncated Normal Density

Firstly we consider a very simple one-dimensional example of a truncated normal density. Let XX have a standard normal distribution: X𝒩(0,1)X\sim\mathcal{N}(0,1) and consider the rare-event region 𝒜=[γ,)\mathcal{A}=[\gamma,\infty). An obvious choice for the performance function is S(x)=xS(x)=x, so if the goal is to estimate the rare-event probability c=(Xγ)c={\mathbb{P}}(X\geq\gamma), then the target function is

h(x):=pX(x)exp[α(γx)𝟙(,γ)(x)].h(x):=p_{X}(x)\exp\left[-\alpha\left(\gamma-x\right)\mathbbm{1}_{(-\infty,\gamma)}(x)\right].

A Coupling Flow is unsuitable for a one-dimensional problem, so we use a composition of three rational functions from eq. 10, containing a total of 15 parameters. Choosing a batch size of n=1,000n=1,000, a learning rate of 0.0010.001, a weight decay parameter of 0.00010.0001 and α=100\alpha=100, the model was trained via Algorithm 1 and the Adam gradient descent optimizer (Kingma \BBA Ba, \APACyear2017) for 30,000 iterations. Figure 3 illustrates how the model converges towards the truncated normal target density, with threshold parameter γ=3\gamma=3.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: The normalizing flow probability density function at different stages of training. The black density is the target probability density of a normal density truncated to the interval [3,)[3,\infty). The blue density is the probability density of XX at the corresponding training iteration.

Using a sample of 1,000 points, eq. 3 gives us an estimate 0.00134576 of the rare-event probability (X3){\mathbb{P}}(X\geq 3) , which is about 0.3% error from the true value of about 0.0013499. From the same sample, the sample standard deviation is about 0.000261. We can estimate the minimum sample size needed for a 1% standard error by

n=(σc^10.01)2376,n=\left(\frac{\sigma}{\widehat{c}}\frac{1}{0.01}\right)^{2}\approx 376,

where σ\sigma is the sample standard deviation of the summand in eq. 3. Using a crude Monte Carlo estimator, the relative standard error should be c(1c)0.0367\sqrt{c(1-c)}\approx 0.0367, requiring a much larger sample size of about n=7.4×106n=7.4\times 10^{6} for a 1% standard error.

The learned density is not exactly identical to the target conditional density, but is very close. To quantify how good the approximation is, we can estimate the KL divergence between the target density qq^{*} and the learned density using

𝒟(q,q)=𝔼q[lnq(𝑿)q(𝑿)]=𝔼q[q(𝑿)q(𝑿)lnq(𝑿)q(𝑿)].\mathcal{D}(q^{*},q)={\mathbb{E}}_{q^{*}}\left[\ln\frac{q^{*}(\boldsymbol{X})}{q(\boldsymbol{X})}\right]={\mathbb{E}}_{q}\left[\frac{q^{*}(\boldsymbol{X})}{q(\boldsymbol{X})}\ln\frac{q^{*}(\boldsymbol{X})}{q(\boldsymbol{X})}\right].

In this example the KL divergence is estimated to be 0.034650.03465 with a standard error of 0.56%0.56\% using sample size 10,000.

3.2 Two-Dimensional Exponential Density

Secondly, we try another example from Gibson \BBA Kroese (\APACyear2022) of a two-dimensional exponential density. In particular, in this example the probability density function of 𝑿=(X1,X2)\boldsymbol{X}=(X_{1},X_{2}) is:

p(x1,x2)=e(x1+x2),p(x_{1},x_{2})=\mathrm{e}^{-(x_{1}+x_{2})},

and the rare-event region is 𝒜={(x1,x2)+2:x1+x2γ}\mathcal{A}=\{(x_{1},x_{2})\in\mathbb{R}^{2}_{+}~{}:~{}x_{1}+x_{2}\geq\gamma\}, with performance function, S(x1,x2)=x1+x2S(x_{1},x_{2})=x_{1}+x_{2}. To estimate c=(X1+X2γ)c={\mathbb{P}}(X_{1}+X_{2}\geq\gamma), the target function is

h(x1,x2)=exp[(x1+x2)α(γx1x2)𝟙{x1+x2<γ}].h(x_{1},x_{2})=\exp\left[-(x_{1}+x_{2})-\alpha\left(\gamma-x_{1}-x_{2}\right)\mathbbm{1}_{\{x_{1}+x_{2}<\gamma\}}\right].

This time we choose a flow model composed of six Coupling Flows, interlaced with dimension permutations, followed by an elementwise exponential function. The exponentiation ensures that the generated points are positive. The coupling function in each Coupling Flow is a composition of two rational functions of the form in eq. 10. The conditioner in each Coupling Flow is a linear transformation plus a constant vector. Therefore, the total number of learnable parameters is 60. Figure 4 shows the structure of this model via a flow chart.

The model was trained for 100,000 iterations with a learning rate of 0.0001, a weight decay of constant of 0.0001, and a batch size of 10,000. When γ=10\gamma=10, c=(1+γ)eγ0.000499399c=(1+\gamma)\mathrm{e}^{-\gamma}\approx 0.000499399. After training, with a sample size of 1,000 the estimated constant is 0.000499200.00049920, with standard deviation of about 6.34×1056.34\times 10^{-5} giving a relative standard error of 0.40%. Achieving a comparable standard error using a crude Mote Carlo estimator would require a sample size of more than 10810^{8}. Figure 5 shows how well the model learned the target density by including a scatter plot of a sample of points as well as comparing the learned probability density with the target probability density. The KL-divergence is estimated to be about 0.011 with 9.7% standard error using sample size of 10,000.

Figure 4: Flow chart showing the normalizing flow structure used in the exponential density example. The base density is a 2D multivariate normal density which forms the input to a sequence of six consecutive Coupling Flow and dimension permutation pairs, followed by an exponential function applied to both dimensions. The learnable parameters are found in the conditioner functions of all six Coupling Flows.
Refer to caption
Refer to caption
Figure 5: 1,000 points sampled from the trained normalizing flow. The left scatter plot shows the how the sampled two-dimensional points almost exclusively satisfy the rare-event condition, x1+x210x_{1}+x_{2}\geq 10. The right plot compares the learned log-density with the renormalized exponential log-density. The dashed black line shows the target density, which is equal to the renormalized log density, until about 2.398-2.398, which is the maximum log-density when x1+x2=10x_{1}+x_{2}=10.

3.3 Bridge Network

This third example comes from Chapter 9 of Kroese \BOthers. (\APACyear2011). In this example, we increase the number of dimensions to five, each corresponding to a random edge length in a bridge network, shown in Figure 6. The graph contains five edges and four nodes, where the length of each edge is uniformly distributed and the goal is to identify the expected shortest path from node AA to node DD. Specifically, we wish to estimate the expected length of the shortest path from AA to DD; that is, the expectation \ell in eq. 1, with Xi𝒰(0,1)X_{i}\sim\mathcal{U}(0,1) and

H(𝑿)=min{X1+X4,X1+3X3+2X5,2X2+3X3+X4,2X2+2X5}.H(\boldsymbol{X})=\min\{X_{1}+X_{4},X_{1}+3X_{3}+2X_{5},2X_{2}+3X_{3}+X_{4},2X_{2}+2X_{5}\}.

In this case p(𝒙)=1p(\boldsymbol{x})=1, so the optimal Importance Sampling density that minimizes the variance in estimating \ell is q(𝒙)=H(𝒙)/q^{*}(\boldsymbol{x})=H(\boldsymbol{x})/\ell. Therefore, the target function is h(𝒙)=H(𝒙)h(\boldsymbol{x})=H(\boldsymbol{x}).

Figure 6: A bridge network containing four nodes and five edges, where the edge lengths X1,,X5X_{1},\ldots,X_{5} are random variables. So the minimum path length from node AA to node DD is also a random variable.

The space of allowed values in this example is 𝒳=[0,1]5\mathcal{X}=[0,1]^{5}, so it is reasonable to choose a model architecture in which the domains and co-domains of each composed flow unit is [0,1]5[0,1]^{5}. To achieve this we choose a uniform base distribution, Zi𝒰(0,1)Z_{i}\sim\mathcal{U}(0,1). Additionally, the model is composed of five Coupling Flows and dimension permutation pairs. In each permutation the dimensions are cycled via [3,4,5,1,2][3,4,5,1,2]. The dimensions are partitioned so the coupling function transforms the first three dimensions. The coupling function of each Coupling Flow is a rational function based on eq. 10 with separate parameters for each dimension. The conditioner, like the previous example, is a linear transformation. Figure 7 uses a flow chart to illustrate the structure of this model. The parameters in the coupling functions of each Coupling Flow unit contain additional constraints to ensure that r(0)=0r(0)=0 and r(1)=1r(1)=1, thereby preserving the domain of [0,1]5[0,1]^{5} at each flow component.

Figure 7: Flow chart showing the normalizing flow structure used in the bridge network example. The base distribution is a 5D uniform distribution which forms the input to a sequence of five consecutive Coupling Flows and dimension permutation pairs.

The model was trained with a learning rate of 0.0001, weight decay parameter of 0.0001, batch size 10,000 for 300,000 iterations. The scatter plot in Figure 8 compares the learned probability density of 𝝍(𝒁;𝒖)\boldsymbol{\psi}(\boldsymbol{Z}\mathchar 24635\relax\;\boldsymbol{u}) with the optimal sampling density given by eq. 2. The learned density forms a fairly good approximation of the optimal sampling density with a KL divergence of about 0.0017 with 29% standard error (using sample of 10,000). The histograms in Figure 8 compare the densities of the summands of the Importance Sampling estimator from eq. 1 and the crude Monte Carlo estimator, being the sample mean of H(𝑿)H(\boldsymbol{X}). In this example both methods provide accurate estimates of the expected minimum path length. Typical values of the crude Monte Carlo and IS estimators are 0.9272 with a standard relative error of 0.43%0.43\% and 0.9301 with a standard relative error of 0.053%0.053\%. While both values agree with the theoretical value of l=13391440=0.929861˙l=\frac{1339}{1440}=0.92986\dot{1}, the IS estimator reduces the variance by a factor of about 65.

Refer to caption
Refer to caption
Figure 8: A scatter plot (left) comparing the learned probability density and the target density, and a histogram (right) comparing the summand densities of the crude Monte Carlo estimator (blue) and the Importance Sampling estimator (orange). Clearly the model has learned to approximate the target density and can estimate the expected minimum path length with a much lower variance using Importance Sampling than the crude Monte Carlo estimator.

3.4 Conditional Bridge Network

In this example we continue with the bridge network, but now consider the rare event that the shortest path includes three edges, passing through the middle, rather than just two. That is, the shortest path is either ABCDABCD or ACBDACBD. The rare-event region can be defined by choosing

S(𝑿)=min(X1+3X3+2X5,2X2+3X3+X4)min(X1+X4,2X2+2X5),S(\boldsymbol{X})=\min\left(X_{1}+3X_{3}+2X_{5},2X_{2}+3X_{3}+X_{4}\right)-\min\left(X_{1}+X_{4},2X_{2}+2X_{5}\right),

and γ=0\gamma=0. That is, the penalty depends on the difference between the shortest ‘inner’ path (ABCDABCD or ACBDACBD) and the shortest ‘outer’ path (ABDABD or ACDACD).

The exact same model was reused, but trained across 500,000 iterations with the penalty factor included in the target function. In this way, the goal is to estimate the expected shortest path length of the distribution conditioned on the shortest path being ABCDABCD or ACBDACBD. The rare event probability can be estimated using eq. 3 with a sample size of 10,000 to be about 0.0346 with a standard relative error of 1.3%1.3\%. This corresponds to a variance reduction in a factor of about 17 compared to the crude Monte Carlo estimate. The conditional expected shortest path length can now be estimated via eq. 6 as 0.913 with a standard relative error of 1.7%1.7\%.

By learning an approximation of the conditional distribution, we can explore the likely conditions required to generate the rare event. Figure 9 compares scatter plots between the learned unconditional and the conditional distributions. The unconditional distribution is not too different from the original uniform distribution, with the exception of X1X_{1} and X4X_{4}. This is the expected result since the path, ABDABD, with path length X1+X4X_{1}+X_{4}, is the most likely to be the shortest. However, the learned conditional distribution is significantly different to the original uniform distribution. From the scatter plots it is clear that X3X_{3} is almost always short, and X2X_{2} and X5X_{5} have a strong negative correlation. Therefore, we can conclude that the most likely conditions for the shortest path to be one that passes through the middle edge, is that the middle edge is short, and that exactly one of bottom two edges is short. This is congruent with expectations since a long middle edge would likely make the total path length longer than both of the outer paths, if both bottom edges were short then the middle edge would be bypassed, and if both bottom edges were long, then the shortest path would likely be across the top.

Refer to caption
Refer to caption
Figure 9: Scatter plots comparing the each of the five dimensions against each other, as well as histograms showing the distribution of each dimension. The darker purple colour corresponds to a higher probability density. The left figure shows the learned unconditional distribution while the right shows the learned conditional distribution.

3.5 Asian Call Option

In this section we look at stock price models and expected pricing of call options. This example, from Chapter 15 of Kroese \BOthers. (\APACyear2011), specifically looks at estimating the expected Asian call option of a stock whose undiscounted price, StS_{t}, at time tt can be modelled by the Stochastic Differential Equation (SDE)

dSt=rStdt+σStdXt,dS_{t}=rS_{t}dt+\sigma S_{t}dX_{t},

where the drift, rr, is the risk-free rate of return, {Xt}\{X_{t}\} is a Wiener process and σ\sigma is the volatility. This SDE is solved by

St=S0e(rσ2/2)/t+σXt.S_{t}=S_{0}\mathrm{e}^{(r-\sigma^{2}/2)/t+\sigma X_{t}}.

The average stock price across the interval t[0,T]t\in[0,T] can be approximated by

S¯T=1n+1i=0nSti,ti=iTn,i=0,,n,\bar{S}_{T}=\frac{1}{n+1}\sum^{n}_{i=0}S_{t_{i}},\quad t_{i}=\frac{iT}{n},~{}i=0,\ldots,n,

where the time has been discretized into nn equal intervals of Δt=T/n\Delta t=T/n,

Sti=S0exp((rσ2/2)ti+σXti),S_{t_{i}}=S_{0}\exp\left((r-\sigma^{2}/2)t_{i}+\sigma X_{t_{i}}\right),

and Xti𝒩(0,Δt)X_{t_{i}}\sim\mathcal{N}(0,\Delta t). The discounted payoff at time t=0t=0 of the Asian option is

H(𝑿)=erT(S¯TK)+,H(\boldsymbol{X})=\mathrm{e}^{-rT}(\bar{S}_{T}-K)^{+},

where 𝑿=(Xt1,,Xtn)\boldsymbol{X}=\left(X_{t_{1}},\ldots,X_{t_{n}}\right) and the ++ superscript denotes the ramp function. Since {Xt}\{X_{t}\} is a Wiener process, 𝑿𝒩(𝟎,ΔtI)\boldsymbol{X}\sim\mathcal{N}(\mathbf{0},\Delta tI), where II is the identity matrix of appropriate dimensionality. If the goal is to estimate the expected discounted payoff then, according to eq. 2, the optimal sampling distribution should include a factor of H(𝒙)H(\boldsymbol{x}). However, the objective eq. 12 requires the target function to be non-zero at all sampled values. Therefore, we introduce the following approximation

x+v(x):={x,if xδδexδ1,if x<δx^{+}\approx v(x):=\begin{cases}x,&\text{if }x\geqslant\delta\\ \delta\mathrm{e}^{\frac{x}{\delta}-1},&\text{if }x<\delta\end{cases}

which is a strictly positive continuously differentiable function that is exact in the limit δ0\delta\rightarrow 0. In our experiments we choose δ=0.5\delta=0.5. Therefore, the target function is

h(𝒙)=v(erT(S¯TK))1(2πΔt)n/2e12Δt|𝒙|2.h(\boldsymbol{x})=v\left(\mathrm{e}^{-rT}(\bar{S}_{T}-K)\right)\frac{1}{(2\pi\Delta t)^{n/2}}\mathrm{e}^{-\frac{1}{2\Delta t}|\boldsymbol{x}|^{2}}.

We chose the base distribution to be 𝒁𝒩(𝟎,ΔtI)\boldsymbol{Z}\sim\mathcal{N}(\mathbf{0},\Delta tI) and the normalizing flow to be comprised of six coupling flow units with equal partition sizes. The dimensions were permuted after each coupling flow unit. Figure 10 illustrates the normalizing flows structure.

Figure 10: Flow chart showing the normalizing flow structure used in the Asian call option example. The base density is a 88D multivariate normal density which forms the input to a sequence of six consecutive Coupling Flow and dimension permutation pairs.

The parameters used in this example were r=0.07r=0.07, σ=0.2\sigma=0.2, K=35K=35, S0=40S_{0}=40, T=4/12T=4/12 and n=88n=88. The model was trained for 30000 steps with a batch size of 1000, learning rate of 0.0001 and weight decay of 0.0001. Figure 11 compares the histograms of the crude Monte Carlo and importance sampling estimator summands. Using a sample of 10000, the crude estimator is about 5.3432 with a standard relative error of 0.49%0.49\%, and the importance sampling estimator is about 5.3580 with a standard relative error of 0.23%0.23\%. This corresponds to a variance reduction of about a factor of 4.4.

Refer to caption
Figure 11: Histograms of the crude Monte Carlo and Importance Sampling estimator summands. Importance Sampling using the learned distribution allows the expected discounted payoff of the Asian option to be estimated with less variance than the crude Monte Carlo estimate.

Let us explore the rare event in which the discounted payoff exceeds 14, 𝒜={𝒙n:erT(S¯TK)+γ}{\mathcal{A}=\{\boldsymbol{x}\in\mathbb{R}^{n}~{}:~{}\mathrm{e}^{-rT}(\bar{S}_{T}-K)^{+}\geq\gamma\}} with performance function S(𝒙)=erT(S¯TK)γS(\boldsymbol{x})=\mathrm{e}^{-rT}(\bar{S}_{T}-K)\geqslant\gamma and γ=14\gamma=14. In this case the goal is to explore how this rare event occurs and to estimate the probability of the rare event. Therefore, the target function is just the probability density function of the Wiener process multiplied by the penalty factor,

h(𝒙)=1(2πΔt)n/2e12Δt|𝒙|2ρ(𝒙).h(\boldsymbol{x})=\frac{1}{(2\pi\Delta t)^{n/2}}\mathrm{e}^{-\frac{1}{2\Delta t}|\boldsymbol{x}|^{2}}\rho(\boldsymbol{x}).

This time the model was trained for 100000 iterations. With a sample of 10000 the crude and IS estimators are about 0.0022 with 21%21\% relative error, and 0.0015797 with 0.46%0.46\% relative error respectively. This corresponds to a variance reduction of a factor of about 2100. Using the estimated normalization constant of 0.0015797 the KL divergence between the learnt density and the target density is about 0.15157 with a standard relative error of 0.40%0.40\% indicating that the learnt distribution is very close to the target conditional distribution. Figure 12 illustrates the strong linear relationship between the learnt log density and the target unconditional density, and compares the distributions of the simulated discounted Asian options.

Refer to caption Refer to caption
Figure 12: A scatter plot comparing the learnt log probability density and the unconditional log probability density (left) and Histograms (right) of simulated discounted payoffs using crude Monte Carlo (blue) and the learned conditional distribution (orange).

Since the learnt distribution is a decent approximation of the conditional distribution, we can use the simulated trajectories to gain insights into how high performing options can occur. Figure 13 shows typical stock prices across time when simulating using the original probability measure and the learnt probability measure. In the first instance the average value is largely static, increasing very slowly, while the variance grows across time. However, conditioning the simulation on achieving a discounted payoff of at least 14 changes the trend unexpectedly. Rather than a steady growth in stock price across the whole time interval, most of the stock price growth occurs by t=0.25t=0.25. Additionally, the variance is fairly constant over time during this growth time, mostly growing near the end. This result is also visible in the covariance matrix of the stock price, illustrated in figure 14. Under the original probability distribution the stock price near maturity depends little on the early prices, and the variance grows roughly linearly with time. However, under the learnt probability distribution, the variance in stock price is largely constant until a short time before maturity. Additionally, there is a small correlation between early stock prices and late stock prices. This could indicate that

Refer to caption Refer to caption
Figure 13: Undiscounted stock prices generated by crude Monte Carlo (left) and the model conditioned on discounted payoffs of at least 14 (right). Coloured lines represent 10 trajectories, dashed black lines represent the mean and the shaded grey region is a single standard deviation from the mean.
Refer to caption
Figure 14: Covariance matrix of undiscounted stock prices simulated using crude Monte Carlo (left) and generated by the model conditioned on discounted payoffs of at least 14 (right). In the crude simulation the variance grows with with time. There is a small amount of covariance between early stock prices and later stock prices in the conditionally generated prices. It appears that while most stock prices achieve the discounted payoff of at least 14 through consistent steady increases, there are a few trajectories which perform well near the beginning or near the end, but not both.

3.6 Double Slit Experiment

In this example we test our method on a high-dimensional problem. A dimensionless particle traverses an unbounded 22-dimensional environment for a maximum duration of TT, which is discretized into Δt=d2\Delta t=\frac{d}{2} time steps, where d2{d\geq 2}. The environment consists of a barrier with two slits, located at (xslit,yslit)(x_{\mathrm{slit}},y_{\mathrm{slit}}) and (xslit,yslit)(x_{\mathrm{slit}},-y_{\mathrm{slit}}). The width of both slits is wslitw_{\mathrm{slit}}. Additionally, the environment contains a vertical screen located at xscreenx_{\mathrm{screen}}. The aim of the particle is to reach the screen while simultaneously passing through one of the two slits. At time step 0, the particle starts at the 2D2D-position 𝑽0=(0,0)\boldsymbol{V}_{0}=(0,0). Given a random vector 𝑸𝒩(𝟎,2TdId)\boldsymbol{Q}\sim\mathcal{N}(\boldsymbol{0},\frac{2T}{d}I_{d}), where IdI_{d} is the dd-dimensional identity matrix, the position of the particle at time step 1kd/21\leq k\leq d/2 evolves according to

𝑽k=𝑽k1+𝑸k,\boldsymbol{V}_{k}=\boldsymbol{V}_{k-1}+\boldsymbol{Q}_{k}, (15)

where 𝑸k=(Q2k1,Q2k)\boldsymbol{Q}_{k}=(Q_{2k-1},Q_{2k})^{\top} is the kk-th component vector of 𝑸\boldsymbol{Q}. The path 𝑽\boldsymbol{V} of the particle induced by 𝑸\boldsymbol{Q} is then the sequence of points 𝑽=(𝑽k)k=1d/2\boldsymbol{V}=(\boldsymbol{V}_{k})_{k=1}^{d/2}, with 𝑽k\boldsymbol{V}_{k} defined according to eq. 15.

Let gdg_{d} be the density of 𝒩(𝟎,2TdId)\mathcal{N}(\boldsymbol{0},\frac{2T}{d}I_{d}). Our goal is to learn a sampling density that is proportional to gdg_{d} truncated to the rare-event region 𝒜={𝒒d:S(𝒒)0}\mathcal{A}=\left\{\boldsymbol{q}\in\mathbb{R}^{d}:S(\boldsymbol{q})\geq 0\right\}, where 𝒒\boldsymbol{q} is an outcome of 𝑸\boldsymbol{Q}. The performance function S(𝒒)S(\boldsymbol{q}) is defined as follows. Suppose is MM is the time step where 𝑿\boldsymbol{X} is closest to the screen; that is,

M:=argmin1kd/2(xscreenXk),M:=\mathop{\rm argmin}_{1\leq k\leq d/2}\left(x_{\mathrm{screen}}-X_{k}\right),

where XkX_{k} is the xx-coordinate of the particle at time step kk. Additionally, let NN be the first step where XN1xslitXNX_{N-1}\leq x_{\mathrm{slit}}\leq X_{N}, i.e., where the particle crosses xslitx_{\mathrm{slit}}. The performance function of an outcome 𝒒\boldsymbol{q} of 𝑸\boldsymbol{Q}, with path 𝒗=P(𝒒)\boldsymbol{v}=P(\boldsymbol{q}), and outcomes nn and mm of NN and MM is then defined as

S(𝒒)=(min{f1(yn),f2(yn)}+f3(xm)),S(\boldsymbol{q})=-\bigg{(}\min\{f_{1}(y_{n}),f_{2}(y_{n})\}+f_{3}(x_{m})\bigg{)}, (16)

where

f1(yn)\displaystyle f_{1}(y_{n}) =(|ynyslit|wslit2)𝟙{|ynyslit|>wslit2},\displaystyle=\left(\left|y_{n}-y_{\mathrm{slit}}\right|-\frac{w_{\mathrm{slit}}}{2}\right)\mathbbm{1}_{\left\{\left|y_{n}-y_{\mathrm{slit}}\right|>\frac{w_{\mathrm{slit}}}{2}\right\}},
f2(yn)\displaystyle f_{2}(y_{n}) =(|yn+yslit|wslit2)𝟙{|yn+yslit|>wslit2},\displaystyle=\left(\left|y_{n}+y_{\mathrm{slit}}\right|-\frac{w_{\mathrm{slit}}}{2}\right)\mathbbm{1}_{\left\{\left|y_{n}+y_{\mathrm{slit}}\right|>\frac{w_{\mathrm{slit}}}{2}\right\}},
f3(xm)\displaystyle f_{3}(x_{m}) =(xscreenxm)𝟙{xscreenxm>0}.\displaystyle=\left(x_{\mathrm{screen}}-x_{m}\right)\mathbbm{1}_{\left\{x_{\mathrm{screen}}-x_{m}>0\right\}}. (17)

The functions f1f_{1} and f2f_{2} in section 3.6 measure the distances of the yy-coordinate (i.e., yny_{n}) of point 𝒗n\boldsymbol{v}_{n} to the slits and the min\min operator in eq. 16 ensures that only the minimum distance to the slits contributes to the performance function. Note that the definitions of f1f_{1} and f2f_{2} imply that a path successfully crosses a slit if yslitwslit2|yn|yslit+wslit2y_{\mathrm{slit}}-\frac{w_{\mathrm{slit}}}{2}\leq\left|y_{n}\right|\leq y_{\mathrm{slit}}+\frac{w_{\mathrm{slit}}}{2}, even if the path touches the barrier from 𝒗n1\boldsymbol{v}_{n-1} to 𝒗n\boldsymbol{v}_{n}. In case the particle does not cross xslitx_{\mathrm{slit}}, we set both f1f_{1} and f2f_{2} to 0. The function f3f_{3} is an additional penalty term which penalizes the xx-coordinate of the point 𝒗m\boldsymbol{v}_{m} (i.e., xmx_{m}) to be on the left-hand side of the screen.

With the performance function in eq. 16, the target function is defined as

h(𝒒)=gd(𝒒)exp[α(γS(𝒒))𝟙{S(𝒒)<0}],h(\boldsymbol{q})=g_{d}(\boldsymbol{q})\exp\left[-\alpha(\gamma-S(\boldsymbol{q}))\mathbbm{1}_{\left\{S(\boldsymbol{q})<0\right\}}\right], (18)

with γ=0\gamma=0. For the base distribution, we chose a Gaussian Mixture Model (GMM\mathrm{GMM}) consisting of two dd-dimensional component distributions 𝒩(𝝁1,2TdId)\mathcal{N}(\boldsymbol{\mu}_{1},\frac{2T}{d}I_{d}) and 𝒩(𝝁2,2TdId)\mathcal{N}(\boldsymbol{\mu}_{2},\frac{2T}{d}I_{d}) with equal weights, where 𝝁1=(1,1,,1,1)\boldsymbol{\mu}_{1}=(1,-1,\ldots,1,-1)^{\top} and 𝝁2=(1,,1)\boldsymbol{\mu}_{2}=(1,\ldots,1)^{\top}. This bimodal mixture model helps in learning the correct bimodal target density. The normalizing flow is composed of twenty coupling flow units with equal partition sizes. The dimensions are permuted after each coupling flow unit. Figure 15 illustrates the normalizing flow.

In our experiments we set T=1T=1, xslit=5x_{\mathrm{slit}}=5, yslit=1.5y_{\mathrm{slit}}=1.5, wslit=1w_{\mathrm{slit}}=1 and xscreen=10x_{\mathrm{screen}}=10. We then trained three normalizing flows using d=100d=100, d=500d=500 and d=1,000d=1,000, where each flow was trained for 100,000100,000 iterations with a batch size of 200200, learning rate of 10610^{-6} and weight decay of 10810^{-8}.

Figure 15: Flow chart showing the normalizing flow structure used in the Double Slit example. The base density is a Gaussian Mixture Model (GMM\mathrm{GMM}) consisting of two component distributions 𝒩(𝝁1,2TdId)\mathcal{N}(\boldsymbol{\mu}_{1},\frac{2T}{d}I_{d}) and 𝒩(𝝁2,2TdId)\mathcal{N}(\boldsymbol{\mu}_{2},\frac{2T}{d}I_{d}) with equal weights, where 𝝁1=(1,1,,1,1)\boldsymbol{\mu}_{1}=(1,-1,\ldots,1,-1)^{\top} and 𝝁2=(1,,1)\boldsymbol{\mu}_{2}=(1,\ldots,1)^{\top}. This forms the input to a sequence of twenty consecutive Coupling Flow and dimension permutation pairs.
Refer to caption
Figure 16: The double slit environment (left figure) with 100100 paths simulated with the learned normalizing flows for d=100d=100 (red), d=500d=500 (green) and d=1,000d=1,000 (blue). The black regions represent the barrier, while the dark green region represents the screen. The right figure shows the histograms of 100,000100,000 simulated paths at the screen for d=100d=100 (red), d=500d=500 (green) and d=1,000d=1,000 (blue).

Figure 16 (left) shows the double slit environment with paths simulated using the learned normalizing flows for d=100d=100 (red paths), d=500d=500 (green paths) and d=1,000d=1,000 (blue paths). We additionally used 100,000100,000 samples for each dd to test the capability of the models in generating paths that successfully pass through one of the slits and hit the screen at xscreenx_{\mathrm{screen}}. In our tests, the success rate was 91.6%91.6\%, 91.2%91.2\% and 90.3%90.3\% for d=100d=100, d=500d=500 and d=1,000d=1,000 respectively, which shows that the learnt distributions are close to the conditional target distribution, even when the base and target distributions are 1,0001,000-dimensional. We additionally investigate how the marginal distributions of the paths at xscreenx_{\mathrm{screen}} look like. Figure 16 (right) shows the histogram of the yy-positions of 100,000100,000 paths at xscreenx_{\mathrm{screen}} for d=100d=100 (red), d=500d=500 (green) and d=1,000d=1,000 (blue) respectively. We can see that for all values of dd, the distributions at the screen closely match each other. This is both desired and expected, due to the scaling factor 2Td\frac{2T}{d} for the variance of the conditional target densities.

4 Conclusion

Monte Carlo methods are important tools for decision making under uncertainty. However, rare events pose significant challenges for standard Monte Carlo methods, due to the low probability of sampling such events. In this paper, we propose a framework for rare-event simulation. Our framework uses a normalizing flow based generative model to learn optimal importance sampling distributions, including conditional distributions. The model is trained using standard gradient descent techniques to minimize the KL divergence between the model density and a function that is proportional to the target density. Empirical evaluations in several domains demonstrate that our framework is able to closely approximate complicated distributions, even in high-dimensional (up to 1,0001,000-dimensional) rare-event settings. Simultaneously, our framework demonstrates substantial improvements in sample efficiency compared to standard Monte Carlo methods. Given these properties, we believe that combining our framework with Monte Carlo based methods for decision making under uncertainty is a fruitful avenue for future research.

\bmhead

Supplementary information The source-code of our framework and the examples is publicly available at https://github.com/hoergems/rare-event-simulation-normalizing-flows.

\bmhead

Acknowledgements We thank Wei Jiang for helpful discussions. This work is supported by the Australian Research Council Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS) CE140100049 grant and the Australian Research Council (ARC) Discovery Project 200101049.

References

  • \bibcommenthead
  • Ardizzone \BOthers. (\APACyear2019) \APACinsertmetastarardizzone2019guided{APACrefauthors}Ardizzone, L., Lüth, C., Kruse, J., Rother, C.\BCBL Köthe, U.  \APACrefYearMonthDay2019. \APACrefbtitleGuided Image Generation with Conditional Invertible Neural Networks. Guided image generation with conditional invertible neural networks. \APACaddressPublisherarXiv. {APACrefURL} https://doi.org/10.48550/arXiv.1907.02392 \PrintBackRefs\CurrentBib
  • Arief \BOthers. (\APACyear2021) \APACinsertmetastararief2021certifiable{APACrefauthors}Arief, M., Bai, Y., Ding, W., He, S., Huang, Z., Lam, H.\BCBL Zhao, D.  \APACrefYearMonthDay2021. \APACrefbtitleCertifiable Deep Importance Sampling for Rare-Event Simulation of Black-Box Systems. Certifiable deep importance sampling for rare-event simulation of black-box systems. \APACaddressPublisherarXiv. {APACrefURL} https://doi.org/10.48550/arXiv.2111.02204 \PrintBackRefs\CurrentBib
  • Botev \BOthers. (\APACyear2007) \APACinsertmetastarbotev2007generalized{APACrefauthors}Botev, Z.I., Kroese, D.P.\BCBL Taimre, T.  \APACrefYearMonthDay2007. \BBOQ\APACrefatitleGeneralized Cross-entropy Methods with Applications to Rare-event Simulation and Optimization Generalized cross-entropy methods with applications to rare-event simulation and optimization.\BBCQ \APACjournalVolNumPagesSimulation8311785–806, {APACrefDOI} https://doi.org/10.1177/0037549707087067 \PrintBackRefs\CurrentBib
  • Bucklew (\APACyear2004) \APACinsertmetastarBucklewJames2004ItRE{APACrefauthors}Bucklew, J.  \APACrefYear2004. \APACrefbtitleIntroduction to Rare Event Simulation Introduction to rare event simulation. \APACaddressPublisherSpringer. \PrintBackRefs\CurrentBib
  • de Boer \BOthers. (\APACyear2005) \APACinsertmetastardeBoer2005{APACrefauthors}de Boer, P\BHBIT., Kroese, D.P., Mannor, S.\BCBL Rubinstein, R.Y.  \APACrefYearMonthDay2005. \BBOQ\APACrefatitleA Tutorial on the Cross-Entropy Method A tutorial on the cross-entropy method.\BBCQ \APACjournalVolNumPagesAnnals of Operations Research134119–67, {APACrefDOI} https://doi.org/10.1007/s10479-005-5724-z \PrintBackRefs\CurrentBib
  • Dinh \BOthers. (\APACyear2015) \APACinsertmetastarDinh2015{APACrefauthors}Dinh, L., Krueger, D.\BCBL Bengio, Y.  \APACrefYearMonthDay2015. \APACrefbtitleNICE: Non-linear Independent Components Estimation. Nice: Non-linear independent components estimation. \APACaddressPublisherarXiv. {APACrefURL} https://doi.org/10.48550/arXiv.1410.8516 \PrintBackRefs\CurrentBib
  • Dinh \BOthers. (\APACyear2016) \APACinsertmetastarDinh2016{APACrefauthors}Dinh, L., Sohl-Dickstein, J.\BCBL Bengio, S.  \APACrefYearMonthDay2016. \APACrefbtitleDensity estimation using Real NVP. Density estimation using real NVP. \APACaddressPublisherarXiv. {APACrefURL} https://doi.org/10.48550/arXiv.1605.08803 \PrintBackRefs\CurrentBib
  • Falkner \BOthers. (\APACyear2022) \APACinsertmetastarfalkner2022conditioning{APACrefauthors}Falkner, S., Coretti, A., Romano, S., Geissler, P.\BCBL Dellago, C.  \APACrefYearMonthDay2022. \APACrefbtitleConditioning Normalizing Flows for Rare Event Sampling. Conditioning normalizing flows for rare event sampling. \APACaddressPublisherarXiv. {APACrefURL} https://doi.org/10.48550/arXiv.2207.14530 \PrintBackRefs\CurrentBib
  • Gibson \BBA Kroese (\APACyear2022) \APACinsertmetastarGibson2022{APACrefauthors}Gibson, L.J.\BCBT \BBA Kroese, D.P.  \APACrefYearMonthDay2022. \BBOQ\APACrefatitleRare-Event Simulation via Neural Networks Rare-event simulation via neural networks.\BBCQ \APACrefbtitleAdvances in Modeling and Simulation. Advances in modeling and simulation. \APACaddressPublisherSpringer. \APACrefnoteIn Press \PrintBackRefs\CurrentBib
  • Glynn \BBA Iglehart (\APACyear1989) \APACinsertmetastarGlynn1989{APACrefauthors}Glynn, P.W.\BCBT \BBA Iglehart, D.L.  \APACrefYearMonthDay1989. \BBOQ\APACrefatitleImportance Sampling for Stochastic Simulations Importance sampling for stochastic simulations.\BBCQ \APACjournalVolNumPagesManagement Science35111367-1392, {APACrefDOI} https://doi.org/10.1287/mnsc.35.11.1367 \PrintBackRefs\CurrentBib
  • Juneja \BBA Shahabuddin (\APACyear2006) \APACinsertmetastarJUNEJA2006291{APACrefauthors}Juneja, S.\BCBT \BBA Shahabuddin, P.  \APACrefYearMonthDay2006. \BBOQ\APACrefatitleChapter 11 Rare-Event Simulation Techniques: An Introduction and Recent Advances Chapter 11 rare-event simulation techniques: An introduction and recent advances.\BBCQ S.G. Henderson \BBA B.L. Nelson (\BEDS), \APACrefbtitleSimulation Simulation (\BPGS 291–350). \APACaddressPublisherElsevier. \PrintBackRefs\CurrentBib
  • Kingma \BBA Ba (\APACyear2017) \APACinsertmetastarkingma2017adam{APACrefauthors}Kingma, D.P.\BCBT \BBA Ba, J.  \APACrefYearMonthDay2017. \APACrefbtitleAdam: A Method for Stochastic Optimization. Adam: A method for stochastic optimization. \APACaddressPublisherarXiv. {APACrefURL} https://doi.org/10.48550/arXiv.1412.6980 \PrintBackRefs\CurrentBib
  • Kingma \BBA Dhariwal (\APACyear2018) \APACinsertmetastarNEURIPS2018_d139db6a{APACrefauthors}Kingma, D.P.\BCBT \BBA Dhariwal, P.  \APACrefYearMonthDay2018. \BBOQ\APACrefatitleGlow: Generative Flow with Invertible 1x1 Convolutions Glow: Generative flow with invertible 1x1 convolutions.\BBCQ S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi\BCBL \BBA R. Garnett (\BEDS), \APACrefbtitleAdvances in Neural Information Processing Systems Advances in neural information processing systems (\BVOL 31). \APACaddressPublisherCurran Associates. \PrintBackRefs\CurrentBib
  • Kobyzev \BOthers. (\APACyear2021) \APACinsertmetastarKobyzev2020{APACrefauthors}Kobyzev, I., Prince, S.J.\BCBL Brubaker, M.A.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleNormalizing Flows: An Introduction and Review of Current Methods Normalizing flows: An introduction and review of current methods.\BBCQ \APACjournalVolNumPagesIEEE Transactions on Pattern Analysis and Machine Intelligence43113964–3979, {APACrefDOI} https://doi.org/10.1109/TPAMI.2020.2992934 \PrintBackRefs\CurrentBib
  • Kochenderfer (\APACyear2015) \APACinsertmetastarKochenderfer2015decision{APACrefauthors}Kochenderfer, M.J.  \APACrefYear2015. \APACrefbtitleDecision Making Under Uncertainty: Theory and Application Decision Making Under Uncertainty: Theory and Application. \APACaddressPublisherMIT Press. \PrintBackRefs\CurrentBib
  • Kroese \BOthers. (\APACyear2019) \APACinsertmetastarkroese2019data{APACrefauthors}Kroese, D.P., Botev, Z.I., Taimre, T.\BCBL Vaisman, R.  \APACrefYear2019. \APACrefbtitleData Science and Machine Learning: Mathematical and Statistical Methods Data science and machine learning: Mathematical and statistical methods. \APACaddressPublisherChapman and Hall/CRC. \PrintBackRefs\CurrentBib
  • Kroese \BOthers. (\APACyear2011) \APACinsertmetastarkroese2013handbook{APACrefauthors}Kroese, D.P., Taimre, T.\BCBL Botev, Z.I.  \APACrefYear2011. \APACrefbtitleHandbook of Monte Carlo Methods Handbook of Monte Carlo methods. \APACaddressPublisherJohn Wiley & Sons. \PrintBackRefs\CurrentBib
  • Liu (\APACyear2004) \APACinsertmetastarliu2001monte{APACrefauthors}Liu, J.S.  \APACrefYear2004. \APACrefbtitleMonte Carlo Strategies in Scientific Computing Monte Carlo strategies in scientific computing. \APACaddressPublisherSpringer. \PrintBackRefs\CurrentBib
  • Neal (\APACyear2001) \APACinsertmetastarNeal2001{APACrefauthors}Neal, R.M.  \APACrefYearMonthDay2001. \BBOQ\APACrefatitleAnnealed Importance Sampling Annealed importance sampling.\BBCQ \APACjournalVolNumPagesStatistics and Computing112125-139, {APACrefDOI} https://doi.org/10.1023/A:1008923215028 \PrintBackRefs\CurrentBib
  • Papamakarios \BOthers. (\APACyear2021) \APACinsertmetastarpapamakarios2021normalizing{APACrefauthors}Papamakarios, G., Nalisnick, E.T., Rezende, D.J., Mohamed, S.\BCBL Lakshminarayanan, B.  \APACrefYearMonthDay2021. \BBOQ\APACrefatitleNormalizing Flows for Probabilistic Modeling and Inference Normalizing flows for probabilistic modeling and inference.\BBCQ \APACjournalVolNumPagesThe Journal of Machine Learning Research22571–64, {APACrefURL} http://jmlr.org/papers/v22/19-1028.html \PrintBackRefs\CurrentBib
  • Rezende \BBA Mohamed (\APACyear2015) \APACinsertmetastarRezende2015{APACrefauthors}Rezende, D.\BCBT \BBA Mohamed, S.  \APACrefYearMonthDay2015. \BBOQ\APACrefatitleVariational Inference with Normalizing Flows Variational inference with normalizing flows.\BBCQ F. Bach \BBA D. Blei (\BEDS), \APACrefbtitleProceedings of the 32nd International Conference on Machine Learning Proceedings of the 32nd international conference on machine learning (\BVOL 37, \BPGS 1530–1538). \APACaddressPublisherPMLR. \PrintBackRefs\CurrentBib
  • Rubino \BBA Tuffin (\APACyear2009) \APACinsertmetastarRubino2009{APACrefauthors}Rubino, G.\BCBT \BBA Tuffin, B.  \APACrefYear2009. \APACrefbtitleRare Event Simulation using Monte Carlo Methods Rare event simulation using Monte Carlo methods. \APACaddressPublisherJohn Wiley & Sons. \PrintBackRefs\CurrentBib
  • Rubinstein \BBA Kroese (\APACyear2017) \APACinsertmetastarrubinstein2016simulation{APACrefauthors}Rubinstein, R.Y.\BCBT \BBA Kroese, D.P.  \APACrefYear2017. \APACrefbtitleSimulation and the Monte Carlo Method Simulation and the Monte Carlo method. \APACaddressPublisherJohn Wiley & Sons. \PrintBackRefs\CurrentBib
  • Teshima \BOthers. (\APACyear2020) \APACinsertmetastarTakeshi2020{APACrefauthors}Teshima, T., Ishikawa, I., Tojo, K., Oono, K., Ikeda, M.\BCBL Sugiyama, M.  \APACrefYearMonthDay2020. \APACrefbtitleCoupling-based Invertible Neural Networks Are Universal Diffeomorphism Approximators. Coupling-based invertible neural networks are universal diffeomorphism approximators. \APACaddressPublisherarXiv. {APACrefURL} https://doi.org/10.48550/arXiv.2006.11469 \PrintBackRefs\CurrentBib
  • Tokdar \BBA Kass (\APACyear2010) \APACinsertmetastarTokdar2010{APACrefauthors}Tokdar, S.T.\BCBT \BBA Kass, R.E.  \APACrefYearMonthDay2010. \BBOQ\APACrefatitleImportance Sampling: A Review Importance sampling: A review.\BBCQ \APACjournalVolNumPagesWIREs Computational Statistics2154-60, {APACrefDOI} https://doi.org/10.1002/wics.56 \PrintBackRefs\CurrentBib
  • Ziegler \BBA Rush (\APACyear2019) \APACinsertmetastarziegler2019{APACrefauthors}Ziegler, Z.\BCBT \BBA Rush, A.  \APACrefYearMonthDay2019. \BBOQ\APACrefatitleLatent Normalizing Flows for Discrete Sequences Latent normalizing flows for discrete sequences.\BBCQ K. Chaudhuri \BBA R. Salakhutdinov (\BEDS), \APACrefbtitleProceedings of the 36th International Conference on Machine Learning Proceedings of the 36th international conference on machine learning (\BVOL 97, \BPGS 7673–7682). \APACaddressPublisherPMLR. \PrintBackRefs\CurrentBib