This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A direct importance sampling-based framework for rare event uncertainty quantification in non-Gaussian spaces

Elsayed Eshra
The Pennsylvania State University
University Park, PA 16802
[email protected]
&Konstantinos G. Papakonstantinou
The Pennsylvania State University
University Park, PA 16802
[email protected]
&Hamed Nikbakht
The Pennsylvania State University
University Park, PA 16802
[email protected]
Abstract

This work introduces a novel framework for precisely and efficiently estimating rare event probabilities in complex, high-dimensional non-Gaussian spaces, building on our foundational Approximate Sampling Target with Post-processing Adjustment (ASTPA) approach. An unnormalized sampling target is first constructed and sampled, relaxing the optimal importance sampling distribution and appropriately designed for non-Gaussian spaces. Post-sampling, its normalizing constant is estimated using a stable inverse importance sampling procedure, employing an importance sampling density based on the already available samples. The sought probability is then computed based on the estimates evaluated in these two stages. The proposed estimator is theoretically analyzed, proving its unbiasedness and deriving its analytical coefficient of variation. To sample the constructed target, we resort to our developed Quasi-Newton mass preconditioned Hamiltonian MCMC (QNp-HMCMC) and we prove that it converges to the correct stationary target distribution. To avoid the challenging task of tuning the trajectory length in complex spaces, QNp-HMCMC is effectively utilized in this work with a single-step integration. We thus show the equivalence of QNp-HMCMC with single-step implementation to a unique and efficient preconditioned Metropolis-adjusted Langevin algorithm (MALA). An optimization approach is also leveraged to initiate QNp-HMCMC effectively, and the implementation of the developed framework in bounded spaces is eventually discussed. A series of diverse problems involving high dimensionality (several hundred inputs), strong nonlinearity, and non-Gaussianity is presented, showcasing the capabilities and efficiency of the suggested framework and demonstrating its advantages compared to relevant state-of-the-art sampling methods.

Keywords Rare Event  \cdot Non-Gaussian  \cdot Inverse Importance Sampling  \cdot Reliability Estimation  \cdot Hamiltonian MCMC  \cdot Quasi-Newton  \cdot Preconditioned MALA  \cdot High dimensions.

1 Introduction

This work introduces a new framework for directly estimating rare event probabilities in complex, high-dimensional non-Gaussian spaces. Rare event uncertainty quantification is a pivotal component in contemporary decision-making across diverse domains. It is particularly crucial for assessing and mitigating risks tied to infrequent yet highly impactful rare event occurrences, such as structural failures induced by extreme loads like earthquakes and hurricanes. This field, commonly referred to as reliability estimation in engineering, has a long history, and comprehensive insights into its complexities can be read in [1, 2, 3, 4, 5, 6]. In many practical applications, rare event probabilities are fortunately exceptionally low, often within the range of 10410^{-4} to even 10910^{-9} and lower, posing significant numerical and mathematical challenges. Further complicating these challenges is the reliance on computationally expensive models of the involved systems with high-dimensional random variable spaces. Achieving accurate estimates while minimizing model evaluations is thus a critical parameter.

Numerous existing solution techniques are specifically designed for Gaussian spaces and typically seek to transform non-Gaussian random variables to Gaussian ones. However, when such Gaussian transformations are not possible, these techniques often encounter challenges in non-Gaussian spaces due to potential asymmetries, complex dependencies, intricate geometries, and constrained spaces. Approximation methods, such as First Order Reliability Method (FORM) [7, 8], Second Order Reliability Method (SORM) [9], and Large deviation theory (LDT) [10, 11, 12], provide asymptotic expressions for estimating rare event probabilities [13]. In general, these methods lose accuracy and/or practicality for arbitrary non-Gaussian distributions, and may also exhibit considerable errors in problems with moderately to highly nonlinear performance functions [7, 14]. Sampling methods can tackle the problem in its utmost generality, however, with a higher computational cost. In addressing the known limitations of the crude Monte Carlo approach [15], numerous sophisticated sampling methods have been proposed. Among others, the Subset Simulation (SuS) approach, originally presented in [16], has been widely used and continuously enhanced for Gaussian spaces [17, 18, 19, 20], given its potential to handle high-dimensional problems, despite some limitations [21]. Several more recent works have also shown enhanced performance in non-Gaussian spaces by mainly employing advanced samplers within SuS, such as Hamiltonian MCMC [22, 23], Riemannian Manifold Hamiltonian Monte Carlo (RMHMC) [24, 25], Hamiltonian Neural Networks (HNN) [26], and affine invariant ensemble MCMC sampler [27, 28]. It should be noted that Subset Simulation also has similarities with multilevel splitting methods [29, 30, 31]. Another sampling method capable of efficiently estimating rare event probabilities is importance sampling (IS), a well-known variance-reduction approach. Since this work builds upon the theoretically rigorous foundations of importance sampling, we provide a comprehensive review and discussion of IS techniques for rare event estimation in Section 2. Combining surrogate models, that approximate computationally expensive original models, with sampling-based approaches has also demonstrated significant savings in computational cost across various applications [32, 33, 34]; however, their applicability to high-dimensional, complex problems still faces practical challenges.

In this paper, we introduce a novel framework for efficiently and precisely estimating rare event probabilities, directly in non-Gaussian spaces, building on our foundational Approximate Sampling Target with Post-processing Adjustment (ASTPA) approach, recently developed for Gaussian spaces in [35]. As shown in this work, the developed framework demonstrates excellent performance in challenging high-dimensional non-Gaussian scenarios, thereby overcoming a significant limitation of numerous existing methods. In essence, rare event probability estimation is a normalizing constant estimation problem for the joint distribution of the involved random variables truncated over the rare event domain, also known as the optimal sampling target; a demanding problem traditionally approached through sequential techniques. Conversely, the ASTPA approach directly and innovatively addresses this challenge by decomposing the problem into two less demanding estimation problems. Hence, the proposed methodology first constructs an unnormalized sampling target, appropriately designed for non-Gaussian spaces, relaxing the optimal importance sampling distribution. To sample this constructed target adequately, our Quasi-Newton mass preconditioned Hamiltonian MCMC (QNp-HMCMC) algorithm is employed, particularly suitable for complex spaces, as discussed later. The obtained samples are subsequently utilized not only to compute a shifted estimate of the sought probability but also to guide the second ASTPA stage. Post-sampling, the normalizing constant of the approximate sampling target is computed through the inverse importance sampling (IIS) technique [35], which utilizes an importance sampling density (ISD) fitted based on the samples drawn in the first stage. A practical approach augmenting the IIS stability in challenging non-Gaussian settings, by avoiding anomalous samples, is also devised in this work. The rare event probability is eventually computed by utilizing this computed normalizing constant to correct the shifted estimate of the first stage. The statistical properties of the proposed estimator are thoroughly derived, including its proven unbiasedness and analytical coefficient of variation (C.o.V). Notably, the derived analytical C.o.V expression, accompanied by an applied implementation technique based on the effective sample size, has demonstrated accurate agreement with numerical results.

Hamiltonian Markov Chain Monte Carlo (HMCMC) is one of the most successful and powerful MCMC algorithms [22, 36], based, as the name suggests, on Hamiltonian dynamics [37]. Its performance largely depends on tuning three parameters: the mass matrix, the simulation step size, and the length of the simulated trajectories. Its efficiency in high-dimensional complex spaces can be significantly improved by selecting an appropriate mass matrix [24, 38, 39]. Therefore, in this work, we rigorously analyze a Quasi-Newton mass preconditioned HMCMC (QNp-HMCMC) sampling algorithm, an approach we initially conceptualized in our earlier work [35]. QNp-HMCMC effectively and uniquely leverages the geometric information of the target distribution. During the burn-in stage, a skew-symmetric preconditioning scheme for the Hamiltonian dynamics is adopted [40], incorporating Hessian information merely approximated based on the already available gradients. The estimated Hessian is subsequently utilized to provide an informed, preconditioned mass matrix, better describing the scale and correlation of the involved random variables. To reinforce its theoretical basis, we prove here that simulating the skew-symmetric preconditioned dynamics in QNp-HMCMC leads to the correct stationary target distribution. To effectively implement QNp-HMCMC, we provide detailed recommendations for approximating the Hessian information in complex non-Gaussian spaces. The simulation step size in QNp-HMCMC is automatically tuned using the dual averaging algorithm [41, 42]. Equipped with the preconditioned dynamics and mass matrix, QNp-HMCMC is characterized by large informed sampling steps, enabling it to work efficiently even with single-step simulated Hamiltonian dynamics, avoiding the challenging task of tuning its trajectory length in general, intricate domains, and enabling increased computational efficiency by minimizing the number of model calls (gradient evaluations). As proven in this work, this QNp-HMCMC version with a single-step implementation is equivalent to an original preconditioned Metropolis-adjusted Langevin algorithm (MALA). Finally, an optimization approach is also devised in this paper to initiate the QNp-HMCMC sampling effectively.

This paper is structured as follows. Section 2 offers a comprehensive discussion on existing importance sampling-based methods for rare event probability estimation, highlighting their common characteristics and, notably, their connections and differences with our ASTPA approach. The developed framework is subsequently presented in Section 3, and Section 4 discusses in detail the single-step QNp-HMCMC algorithm. Section 5 introduces an effective optimization approach for rare event domain discovery, specifically using the Adam optimizer [43], aiming to provide good initial states for the MCMC sampler. Given that it is more practical for MCMC and HMCMC algorithms to operate in unconstrained spaces, Section 6 discusses transforming bounded random variables to unconstrained ones. A summary of the QNp-HMCMC-based ASTPA framework for non-Gaussian domains is then presented in Section 7. To fully examine the capabilities of the proposed methodology, in Section 8 its performance is demonstrated and successfully compared against the state-of-the-art Subset Simulation, and related variants, in a series of challenging low- and high-dimensional non-Gaussian problems. The paper then concludes in Section 9.

2 Rare Event Probability Estimation

Let 𝑿\bm{X} be a continuous random vector taking values in 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} and having a joint probability density function (PDF) π𝑿\pi_{\bm{X}}. We are interested in rare events {𝒙:g(𝒙)0}\mathcal{F}\coloneqq\{\bm{x}\,:\,g(\bm{x})\leq 0\}, where g:𝒳g:\mathcal{X}\rightarrow\mathbb{R} is a performance function, also known as the limit-state function, defining the occurrence of a rare event. In the context of reliability estimation, which considers failures as rare events, the limit state function, for example, can be formulated as g(𝑿)=λ(𝑿)g(\bm{X})=\lambda-\mathcal{M}(\bm{X}), where (𝑿)\mathcal{M}(\bm{X}) denotes a model output (a quantity of interest). Here, the predefined threshold λ\lambda determines the rarity of the failure event by controlling when g(𝑿)g(\bm{X}) starts to take non-positive values, i.e., rare event occurrence. In this work, we aim to estimate the rare event probability pp_{\mathcal{F}}:

p=π𝑿(𝒙)𝑑𝒙=𝒳I(𝒙)π𝑿(𝒙)𝑑𝒙=𝔼π𝑿[I(𝑿)]p_{\mathcal{F}}=\int_{\mathcal{F}}\pi_{\bm{X}}(\bm{x})d\bm{x}=\int_{\mathcal{X}}I_{\mathcal{F}}(\bm{x})\,\pi_{\bm{X}}(\bm{x})d\bm{x}={\mathop{\mathbb{E}}}_{\pi_{\bm{X}}}[I_{\mathcal{F}}(\bm{X})] (1)

where I:𝒳{0,1}I_{\mathcal{F}}:\mathcal{X}\rightarrow\{0,1\} is the indicator function, i.e., I(𝒙)=1I_{\mathcal{F}}(\bm{x})=1 if 𝒙\bm{x}\in\mathcal{F}, and I(𝒙)=0I_{\mathcal{F}}(\bm{x})=0 otherwise, and 𝔼\mathop{\mathbb{E}} is the expectation operator.

Our objective in this work is to estimate the described integration in Eq. 1 under these challenging yet realistic settings: (i) the analytical calculation of Eq. 1 is generally intractable, as the limit-state function can be complicated, e.g., involve the solution of a differential equation; (ii) the computational effort for evaluating II_{\mathcal{F}} for each realization 𝒙\bm{x} is assumed to be quite significant, often relying on computationally intensive models, necessitating the minimization of such function evaluations (model calls); (iii) the rare event probability pp_{\mathcal{F}} is extremely low, e.g., in order of p104109p_{\mathcal{F}}\sim 10^{-4}-10^{-9}; (iv) the random variable space, 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d}, is assumed to be high-dimensional, e.g., d=102d=10^{2} and more; (v) the joint probability density, π𝑿(𝒙)\pi_{\bm{X}}(\bm{x}), is non-Gaussian, with transformations to the Gaussian domain assumed inapplicable, necessitating Eq. 1 evaluation directly in non-Gaussian spaces.

Under these settings, several sampling methods become highly inefficient and fail to address the problem effectively. For instance, the standard Monte Carlo estimator of Eq. 1 using {𝒙i}1N\{\bm{x}_{i}\}_{1}^{N} draws from π𝑿\pi_{\bm{X}}, p^=1Ni=1NI(𝒙i)\hat{p}_{\mathcal{F}}=\frac{1}{N}\sum_{i=1}^{N}I_{\mathcal{F}}(\bm{x}_{i}), has a coefficient of variation (1p)/(Np)\sqrt{(1-p_{\mathcal{F}})/(Np_{\mathcal{F}})}, rendering this estimate inefficient by requiring a prohibitively large number of samples to accurately quantify small probabilities; see setting (iii) above. Importance sampling (IS), as a variance-reduction method, can efficiently compute the integral in Eq. 1, through sampling from an appropriate importance sampling density (ISD), h(𝑿)h(\bm{\bm{X}}). The ISD places greater importance on rare event domains in the random variable space compared to π𝑿(𝒙)\pi_{\bm{X}}(\bm{x}), satisfying the sufficient condition supp(Iπ𝑿)supp(h)\text{supp}(I_{\mathcal{F}}\,\pi_{\bm{X}})\subseteq\text{supp}(h), with supp()\text{supp}(\cdot) denoting the support of the function. This results in an increased number of rare event samples, thereby reducing the number of required model calls. Eq. (1) can then be written as:

p=𝒳I(𝒙)π𝑿(𝒙)h(𝒙)h(𝒙)𝑑𝒙=𝔼h[I(𝑿)π𝑿(𝑿)h(𝑿)]p_{\mathcal{F}}=\int_{\mathcal{X}}I_{\mathcal{F}}(\bm{x})\,\dfrac{\pi_{\bm{X}}(\bm{x})}{h(\bm{x})}h(\bm{x})d\bm{x}={\mathop{\mathbb{E}}}_{h}\big{[}I_{\mathcal{F}}(\bm{X})\dfrac{\pi_{\bm{X}}(\bm{X})}{h(\bm{X})}\big{]} (2)

leading to the unbiased IS estimator:

p^=1Ni=1NI(𝒙i)π𝑿(𝒙i)h(𝒙i)\hat{p}_{\mathcal{F}}=\frac{1}{N}\sum_{i=1}^{N}I_{\mathcal{F}}(\bm{x}_{i})\dfrac{\pi_{\bm{X}}(\bm{x}_{i})}{h(\bm{x}_{i})} (3)

The efficiency of IS relies heavily on the careful selection of both the ISD and the sampling algorithm. A theoretically optimal ISD hh^{*}, providing zero-variance IS estimator, can be given as [44]:

h(𝑿)=1pI(𝑿)π𝑿(𝑿)\displaystyle h^{*}(\bm{X})=\dfrac{1}{p_{\mathcal{F}}}I_{\mathcal{F}}(\bm{X})\pi_{\bm{X}}(\bm{X}) (4)

Yet, it is evident that this ISD cannot be utilized due to the fact that the normalizing constant in Eq. (4) is the sought target probability pp_{\mathcal{F}}. Notably, finding a near-optimal ISD, approximating hh^{*}, is an ongoing research pursuit. The existing approaches can be largely classified into two categories. The first directly follows the IS estimator in Eq. 3 and seeks to approximate hh^{*} by a parametric or non-parametric probability density function (PDF), hh. An early approach in this category utilizes densities centered around one or more design points [45], provided by an initial FORM analysis, which has later been deemed ineffective for high-dimensional, complex spaces [16]. A more recent popular approach is the cross entropy-based importance sampling (CE-IS). It alternatively constructs a parametric ISD hh by minimizing the Kullback-Leibler (KL) divergence between hh^{*} and a chosen parametric family of probability distributions [46, 47, 48, 49, 50]. Recent CE-IS variants incorporate information from influential rare event points, e.g., design points, to effectively initialize the cross entropy process [51, 52]. Within this first category of methods, several different noteworthy approaches have also been proposed to approximate hh^{*}, e.g., [53, 54, 55, 56, 57].

Before delving into the second category of methods, we define ISD h=h~(𝑿)/Chh={\nicefrac{{\tilde{h}(\bm{X})}}{{C_{h}}}}, where h~\tilde{h} is an unnormalized importance distribution, and Ch=𝒳h~(𝒙)𝑑𝒙C_{h}=\int_{\mathcal{X}}\tilde{h}(\bm{x})d\bm{x} is its normalizing constant. Following this definition, h~=I(𝑿)π𝑿(𝑿)\tilde{h}^{*}=I_{\mathcal{F}}(\bm{X})\pi_{\bm{X}}(\bm{X}) denotes an unnormalized optimal importance sampling distribution, and Ch=pC_{h^{*}}=p_{\mathcal{F}} is its normalizing constant. Thus, in the second category of methods, sequential variants of importance sampling are employed, mainly utilizing a number of intermediate unnormalized distributions. These unnormalized distributions are generally more flexible than the PDFs usually utilized within the first category, as discussed above, thus being capable of providing a better approximation of hh^{*}. Employing an unnormalized importance sampling distribution h~\tilde{h}, Eq. 2 can be written as:

p=𝔼h[I(𝑿)π𝑿(𝑿)h(𝑿)]=Ch𝔼h[I(𝑿)π𝑿(𝑿)h~(𝑿)]p_{\mathcal{F}}={\mathop{\mathbb{E}}}_{h}\big{[}I_{\mathcal{F}}(\bm{X})\dfrac{\pi_{\bm{X}}(\bm{X})}{h(\bm{X})}\big{]}=C_{h}\,{\mathop{\mathbb{E}}}_{h}\big{[}I_{\mathcal{F}}(\bm{X})\dfrac{\pi_{\bm{X}}(\bm{X})}{\tilde{h}(\bm{X})}\big{]} (5)

Existing methods currently compute the normalizing constant, ChC_{h}, using consecutive intermediate unnormalized distributions h~i\tilde{h}_{i}’s, i=1,2,,ni=1,2,...,n, following Eq. 6. These distributions should be designed to link between π𝑿\pi_{\bm{X}} and hh^{*}, with h~1\tilde{h}_{1} and h~n=h~\tilde{h}_{n}=\tilde{h} being the closest to π𝑿\pi_{\bm{X}} and hh^{*}, respectively.

ChnCπ=Ch1Cπi=2nChiChi1\dfrac{C_{h_{n}}}{C_{\pi}}=\dfrac{C_{h_{1}}}{C_{\pi}}\prod_{i=2}^{n}\dfrac{C_{h_{i}}}{C_{h_{i-1}}} (6)

where ChiC_{h_{i}} is the normalizing constant of hih_{i}, and Cπ=1.0C_{\pi}=1.0, typically, is the normalizing constant of the original distribution π𝑿\pi_{\bm{X}}. CπC_{\pi} is left in Eq. 6 for generality purposes, particularly to cover scenarios when an unnormalized original distribution π~𝑿\tilde{\pi}_{\bm{X}} is instead provided (substitute π𝑿=π~𝑿/Cπ\pi_{\bm{X}}={\nicefrac{{\tilde{\pi}_{\bm{X}}}}{{C_{\pi}}}} in Eq. 5 to verify that). Eq. 6 can be computed using the sequential importance sampling (SIS) approach as [58, 59]:

ChiChi1=𝔼hi1[h~i(𝑿)h~i1(𝑿)]\dfrac{C_{h_{i}}}{C_{h_{i-1}}}={\mathop{\mathbb{E}}}_{h_{i-1}}\big{[}\dfrac{\tilde{h}_{i}(\bm{X})}{\tilde{h}_{i-1}(\bm{X})}\big{]} (7)

Alternatively, bridge sampling can evaluate Eq. 6 as [60, 61, 62]:

ChiChi1=𝔼hi1[h~i(𝑿)hiB(𝑿)]𝔼hi[h~i1(𝑿)hiB(𝑿)],orChiChi1=ChiB/Chi1ChiB/Chi=𝔼hi1[hiB(𝑿)/h~i1(𝑿)]𝔼hi[hiB(𝑿)/h~i(𝑿)]\dfrac{C_{h_{i}}}{C_{h_{i-1}}}=\dfrac{\mathop{\mathbb{E}}_{h_{i-1}}[\tilde{h}_{i}(\bm{X})h_{i}^{B}(\bm{X})]}{\mathop{\mathbb{E}}_{h_{i}}[\tilde{h}_{i-1}(\bm{X})h_{i}^{B}(\bm{X})]},\text{or}\quad\dfrac{C_{h_{i}}}{C_{h_{i-1}}}=\dfrac{{\nicefrac{{C_{h_{i}^{B}}}}{{C_{h_{i-1}}}}}}{{\nicefrac{{C_{h_{i}^{B}}}}{{C_{h_{i}}}}}}=\dfrac{\mathop{\mathbb{E}}_{h_{i-1}}[{\nicefrac{{h_{i}^{B}(\bm{X})}}{{\tilde{h}_{i-1}(\bm{X})}}}]}{\mathop{\mathbb{E}}_{h_{i}}[{\nicefrac{{h_{i}^{B}(\bm{X})}}{{\tilde{h}_{i}(\bm{X})}}}]} (8)

where hiB(𝑿)h_{i}^{B}(\bm{X}) is a bridge distribution [61] between h~i\tilde{h}_{i} and h~i1\tilde{h}_{i-1}, and ChiBC_{h_{i}^{B}} is its associated normalizing constant. This latter technique aims to provide a more accurate estimate even if the two consecutive distributions, h~i\tilde{h}_{i} and h~i1\tilde{h}_{i-1}, are not close enough. Eq. 6 can also be computed using path sampling [63, 64]. A key to the success of these sequential methods is to design nearby intermediate distributions and to sufficiently sample them to accurately compute Eq. 6, a non-trivial and computationally demanding task. A discussion on methods within this second category can be found in [65]. Interestingly, the subset simulation method [16] can also be seen as a special case of this second category, wherein the ratio Chi/Chi1{\nicefrac{{C_{h_{i}}}}{{C_{h_{i-1}}}}} is the conditional probability between every two consecutive intermediate events, i\mathcal{F}_{i} and i1\mathcal{F}_{i-1}.

In this work, we are originally estimating Eq. 5 in a non-sequential manner, using a single unnormalized distribution h~\tilde{h}, approximating hh^{*}, i.e., n=1n=1. Therefore, the proposed approach significantly reduces the computational cost associated with employing multiple intermediate distributions. The discovery of the rare event domain is alternatively achieved through the utilization of advanced samplers and optimization techniques, as detailed in Section 5. It should also be noted that the discussed importance sampling-based methods in the current literature are mostly tailored for Gaussian spaces. Consequently, the ability of our proposed framework to operate directly in non-Gaussian spaces is an additional substantial benefit.

3 Approximate Sampling Target with Post-processing Adjustment (ASTPA) in Non-Gaussian Spaces

ASTPA estimates the rare event probability in Eq. 1 through an innovative non-sequential importance sampling variant evaluating Eq. 5. To this end, ASTPA constructs a single unnormalized approximate sampling target h~\tilde{h}, relaxing the optimal ISD in Eq. 4. Post-sampling, its normalizing constant is computed utilizing our devised inverse importance sampling. For clarity, we rewrite Eq. 5 as:

p=𝔼h[I(𝑿)π𝑿(𝑿)h(𝑿)]=Ch𝔼h[I(𝑿)π𝑿(𝑿)h~(𝑿)]=Chp~p_{\mathcal{F}}={\mathop{\mathbb{E}}}_{h}\big{[}I_{\mathcal{F}}(\bm{X})\dfrac{\pi_{\bm{X}}(\bm{X})}{h(\bm{X})}\big{]}=C_{h}\,{\mathop{\mathbb{E}}}_{h}\big{[}I_{\mathcal{F}}(\bm{X})\dfrac{\pi_{\bm{X}}(\bm{X})}{\tilde{h}(\bm{X})}\big{]}=C_{h}\,\tilde{p}_{\mathcal{F}} (9)

Computing pp_{\mathcal{F}} is thus decomposed into two (hopefully simpler) problems. The first involves constructing h~\tilde{h} and sampling {𝒙i}i=1Nh\{\bm{x}_{i}\}_{i=1}^{N}\sim h, to compute the unbiased expectation of the weighted indicator function (shifted probability estimate) as:

p~=𝔼h[I(𝑿)π𝑿(𝑿)h~(𝑿)]p~^=1Ni=1NI(𝒙𝒊)π𝑿(𝒙𝒊)h~(𝒙𝒊)\tilde{p}_{\mathcal{F}}=\,{\mathop{\mathbb{E}}}_{h}\big{[}I_{\mathcal{F}}(\bm{X})\dfrac{\pi_{\bm{X}}(\bm{X})}{\tilde{h}(\bm{X})}\big{]}\approx\hat{\tilde{p}}_{\mathcal{F}}=\dfrac{1}{N}\sum_{i=1}^{N}I_{\mathcal{F}}(\bm{x_{i}})\dfrac{\pi_{\bm{X}}(\bm{x_{i}})}{\tilde{h}(\bm{x_{i}})} (10)

An empirical estimation for the normalizing constant, C^h\hat{C}_{h}, is then computed using the inverse importance sampling, as discussed in Section 3.2. The ASTPA estimator for the sought rare event probability can then be computed as:

p^=p~^C^h\hat{p}_{\mathcal{F}}=\hat{\tilde{p}}_{\mathcal{F}}\,\,\hat{C}_{h} (11)

Fig. 1 concisely portrays the proposed framework using a correlated Gumbel distribution and a quadratic limit state function, as detailed in Section 8.1, with p2.51×107p_{\mathcal{F}}\sim 2.51\times 10^{-7}. The construction of the approximate sampling target in ASTPA is discussed in Section 3.1, including its recommended parameters for problems described in non-Gaussian spaces. Then, we present the inverse importance sampling approach in Section 3.2. Subsequently, Section 3.3 discusses thoroughly the statistical properties of the proposed estimator presented in Eq. 11.

Refer to caption Refer to caption
(a) π𝑿\pi_{\bm{X}} (b) h(𝑿)h^{*}(\bm{X})
Refer to caption Refer to caption Refer to caption Refer to caption
(c) h~(𝑿)\tilde{h}(\bm{X}) (d) Samples h~(𝑿)\sim\tilde{h}(\bm{X}) (e) Q(𝑿)Q(\bm{X}) (f) Samples Q(𝑿)\sim Q(\bm{X})
Figure 1: Outlining the ASTPA framework; (a) shows an original bivariate joint distribution π𝑿\pi_{\bm{X}} in red, and a limit state function at g(𝑿)=0g(\bm{X})=0 shown using a grey line, with the rare event domain being outside g(𝑿)=0g(\bm{X})=0, as detailed in Section 8.1, (b) visualizes the optimal importance sampling density, hh^{*}, in Eq. 4, (c) depicts the constructed approximate sampling target, h~\tilde{h}, in Eq. 13, (d) showcases samples from this target, subsequently used to compute p~^\hat{\tilde{p}}_{\mathcal{F}} in Eq. 10, and (e) and (f) demonstrate the inverse importance sampling procedure to compute the normalizing constant C^h\hat{C}_{h} in Eq. 18. Specifically, (e) and (f) present a crudely fitted Q(𝑿)Q(\bm{X}) and its samples, respectively. Another choice for QQ is also shown in Fig. 3. The sought probability is eventually computed using Eq. 11.

3.1 Target distribution formulation

As discussed earlier, the basic idea is to construct a near-optimal sampling target distribution h~\tilde{h}, placing higher importance on the rare event domain \mathcal{F}. Similar to methods utilizing sequential variants of IS (see Section 2 for details), the approximate sampling target in ASTPA is constructed by approximating the indicator function II_{\mathcal{F}} with a smooth function, namely a likelihood function g𝑿\ell_{g_{\bm{X}}}. This likelihood function is chosen as a logistic cumulative distribution function (CDF) of the negative of a scaled limit state function, FCDF(g(𝑿)/gc)F_{\text{CDF}}\left({\nicefrac{{-g(\bm{X})}}{{g_{c}}}}\right), with gcg_{c} being a scaling constant:

g𝑿=FCDF(\displaystyle\ell_{g_{\bm{X}}}=\,F_{\text{CDF}}\bigg{(} g(𝑿)gc|μg,σ)=(1+exp((g(𝑿)gc)+μg(3π)σ))1\displaystyle\dfrac{-g(\bm{X})}{g_{c}}\bigg{\arrowvert}\ \mu_{g},\ \sigma\bigg{)}=\,\Bigg{(}1+\exp\bigg{(}\dfrac{(\frac{g(\bm{X})}{g_{c}})+\mu_{g}}{(\frac{\sqrt{3}}{\pi})\sigma}\bigg{)}\Bigg{)}^{-1} (12)

where μg\mu_{g} and σ\sigma are the mean and dispersion factor of the logistic CDF, respectively; their recommended values are subsequently discussed. Our numerical experiments in this work and [35] have underscored the effectiveness of this chosen approximation of II_{\mathcal{F}}. It should also be noted, however, that there exist numerous choices for approximating II_{\mathcal{F}} in the literature, e.g., standard normal CDF [59, 48], and exponential tilting barrier [62]. The approximate sampling target is ASTPA is then expressed as:

h~(𝑿)=g𝑿π𝑿(𝑿)=(1+exp((g(𝑿)gc)+μg(3π)σ))1π𝑿(𝑿)\displaystyle\tilde{h}(\bm{X})=\ell_{g_{\bm{X}}}\,\pi_{\bm{X}}(\bm{X})=\,\Bigg{(}1+\exp\bigg{(}\dfrac{(\frac{g(\bm{X})}{g_{c}})+\mu_{g}}{(\frac{\sqrt{3}}{\pi})\sigma}\bigg{)}\Bigg{)}^{-1}\,\pi_{\bm{X}}(\bm{X}) (13)

The scaling g(𝑿)/gc{\nicefrac{{g(\bm{X})}}{{g_{c}}}} is suggested to handle the varying magnitudes that different limit-state functions can exhibit. It is implemented in a way ensuring that the input to the logistic CDF (g(𝑿)/gc)({\nicefrac{{g(\bm{X})}}{{g_{c}}}}) at an influential point with respect to the original distribution π𝑿\pi_{\bm{X}} consistently falls within a predefined range, regardless the magnitude of g(𝑿)g(\bm{X}). In this work, we use the mean value of the original distribution, 𝝁π𝑿\bm{\mu}_{\pi_{\bm{X}}}, to represent this influential point, albeit other points could also serve this purpose. To make the input to the logistic CDF, when evaluated at 𝒙=𝝁π𝑿\bm{x}=\bm{\mu}_{\pi_{\bm{X}}}, lies in a desired range of [10  20][10\,\,20], we define gcg_{c} as:

gc={g(𝝁π𝑿)q,q[10  20]if((g(𝝁π𝑿)>20)(0<g(𝝁π𝑿)<10))1,otherwise\displaystyle g_{c}=\begin{cases}\frac{g(\bm{\mu}_{\pi_{\bm{X}}})}{q},q\in[10\,\,20]&\text{if}\,\bigg{(}\big{(}g(\bm{\mu}_{\pi_{\bm{X}}})>20\big{)}\bigcup\big{(}0<g(\bm{\mu}_{\pi_{\bm{X}}})<10\big{)}\bigg{)}\\ 1,&\text{otherwise}\end{cases} (14)

For example, if g(𝝁π𝑿)=104g(\bm{\mu}_{\pi_{\bm{X}}})=10^{4}, this scaling ensures that the input at 𝒙=𝝁π𝑿\bm{x}=\bm{\mu}_{\pi_{\bm{X}}} equals qq, i.e., g(𝝁π𝑿)/gc=g(𝝁π𝑿)/(g(𝝁π𝑿)/q)=q{\nicefrac{{g(\bm{\mu}_{\pi_{\bm{X}}})}}{{g_{c}}}}={\nicefrac{{g(\bm{\mu}_{\pi_{\bm{X}}})}}{{(\nicefrac{{g(\bm{\mu}_{\pi_{\bm{X}}})}}{{q}}}}})=q. Fine-tuning values of the scaling constant, qq, within the recommended range is not usually necessary. Our empirical investigation has confirmed the effectiveness of this proposed scaling technique and demonstrated that the recommended range for qq is practical for the studied non-Gaussian problems.

The dispersion factor σ\sigma of the logistic CDF, FCDFF_{\text{CDF}}, determines the widespread of the likelihood function, thus controlling the shape of the constructed target h~\tilde{h}. For non-Gaussian spaces, σ\sigma values are recommended in the range [0.1  0.6][0.1\,\,0.6]. Fine-tuning higher decimal values of σ\sigma in that range is not usually necessary. Lower values, [0.1  0.4][0.1\,\,0.4], usually work efficiently, whereas a higher σ\sigma value within the recommended range may be needed for cases involving a large number of (strongly) nonlinear random variables.

For the mean parameter, μg\mu_{g}, of the logistic CDF, FCDFF_{\text{CDF}}, we are interested in generally locating it inside the rare event domain, g(𝑿)<0g(\bm{X})<0, to enhance the sampling efficiency. As such, we are describing μg\mu_{g} through a percentile, pp, of the logistic CDF and its quantile function, Υg\Upsilon_{g}:

Υg(p;μg,σ)=μg+(3πσ)ln(p1p)\Upsilon_{g}(p;\mu_{g},\sigma)=\mu_{g}+(\frac{\sqrt{3}}{\pi}\sigma)\ln\bigg{(}\frac{p}{1-p}\bigg{)} (15)

By placing a chosen percentile pp of the logistic CDF on the limit-state surface at g(𝑿)=0g(\bm{X})=0 (Υg(p;μg,σ)=0)(\Upsilon_{g}(p;\mu_{g},\sigma)=0), the mean parameter μg\mu_{g} can be computed as:

μg=(3πσ)ln(p1p)\mu_{g}=-(\frac{\sqrt{3}}{\pi}\sigma)\ln\bigg{(}\frac{p}{1-p}\bigg{)} (16)

Based on our empirical investigation, we found that using the p10p10 percentile (p=0.1p=0.1) yields good efficiency. Therefore, it is used in all our experiments in this work. Substituting p=0.1p=0.1 in Eq. 16, the mean parameter is then defined as a function of the dispersion factor as μg1.21σ\mu_{g}\approx 1.21\,\sigma.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 2: Effect of the likelihood dispersion factor, σ\sigma, and the scaling constant, qq, on the target distribution for a quadratic limit-state function, shown in grey at g(𝑿)=0g(\bm{X})=0, with a correlated Gumbel distribution, as detailed in Section 8.1.

In Fig. 2, we illustrate how the likelihood dispersion factor, σ\sigma, and the scaling constant, qq, influence the shape of the target distribution, based on a correlated Gumbel distribution, as detailed in Section 8.1. As can be seen, decreasing σ\sigma and/or increasing qq results in a more concentrated target distribution inside the rare event domain. Although an increasing number of rare event samples can be obtained by pushing the target further inside the rare event domain, i.e., being very close to the optimal one (Iπ𝑿)(I_{\mathcal{F}}\pi_{\bm{X}}), that may also complicate the sampling process and the computation of the normalizing constant C^h\hat{C}_{h}, as discussed in Section 3.4.

3.2 Inverse Importance Sampling

Inverse importance sampling (IIS) is a general technique to compute normalizing constants, demonstrating exceptional performance in the context of rare event probability estimation. Given an unnormalized distribution h~\tilde{h}, and samples {𝒙i}i=1Nh~\{\bm{x}_{i}\}_{i=1}^{N}\sim\tilde{h}, inverse importance sampling first fit an ISD Q(𝒙)Q(\bm{x}) based the samples {𝒙i}i=1Nh\{\bm{x}_{i}\}_{i=1}^{N}\sim h; details of Q(𝒙)Q(\bm{x}) is discussed below. IIS then estimates the normalizing constant ChC_{h} as:

Ch=𝒳h~(𝒙)𝑑𝒙=𝒳h~(𝒙)Q(𝒙)Q(𝒙)𝑑𝒙=𝔼Q[h~(𝑿)Q(𝑿)]C_{h}=\int_{\mathcal{X}}\tilde{h}(\bm{x})d\bm{x}=\int_{\mathcal{X}}\dfrac{\tilde{h}(\bm{x})}{Q(\bm{x})}Q(\bm{x})d\bm{x}\,=\mathbb{E}_{Q}\big{[}\dfrac{\tilde{h}(\bm{X})}{Q(\bm{X})}] (17)

By drawing {𝒙}i=1M\{\bm{x}^{\prime}\}_{i=1}^{M} i.i.d. samples from QQ, the unbiased normalizing constant estimator can be computed as:

C^h=1Mi=1Mh~(𝒙i)Q(𝒙i)\hat{C}_{h}=\dfrac{1}{M}\sum_{i=1}^{M}\dfrac{\tilde{h}(\bm{x}_{i}^{\prime})}{Q(\bm{x}_{i}^{\prime})} (18)

In this work, the ISD Q(.)Q(.) is a computed Gaussian Mixture Model (GMM), based again on the already available samples, {𝒙i}i=1N\{\bm{x}_{i}\}_{i=1}^{N}, and the generic Expectation-Maximization (EM) algorithm [66]. A typical GMM expression can be then given by:

Q(𝑿)=k=1𝑲wkϕ(𝑿;𝝁k,𝚺k)\displaystyle Q(\bm{\bm{X}})=\sum_{k=1}^{\bm{K}}w_{k}\phi(\bm{X}\,;\,\bm{\mu}_{k},\bm{\Sigma}_{k}) (19)

where ϕ(.)\phi(.) is the PDF, wkw_{k} is the weight, 𝝁k\bm{\mu}_{k} is the mean vector and 𝚺k\bm{\Sigma}_{k} is the covariance matrix of the kthkth Gaussian component, that can all be estimated, for all components, by the EM algorithm. In low-dimensional spaces, where d<20d<20, we employ a GMM with a large number (10)(\sim 10) of Gaussian components that have full covariance matrices. This approach aims to accurately approximate the target distribution h~\tilde{h} with an ISD Q(.)Q(.), thus improving the IIS estimator in Eq. 18. In contrast, for higher dimensions, we fit Gaussian Mixture Models (GMMs) with diagonal covariance matrices. That is to reduce the number of GMM parameters to be estimated, thereby mitigating the scalability issues commonly associated with GMMs. Specifically, for high-dimensional examples discussed in this paper, we utilize a GMM fitted with a single component and a diagonal covariance matrix, i.e., a multivariate independent Gaussian density. This method has demonstrated exceptional efficacy in calculating the normalizing constant ChC_{h} for challenging examples, even in spaces with dimensions as high as 500. Notably, this IIS technique can generally still work effectively even using a crudely fitted QQ. We demonstrate this feature by employing two different choices for the ISD, QQ, to quantify the target probability in the illustrative example depicted in Fig. 1, with p2.51×107p_{\mathcal{F}}\sim 2.51\times 10^{-7}. The first QQ is an accurate GMM with ten Gaussian components having full covariance matrices, as depicted in Fig. 3 , which yields 𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] of 2.52×1072.52\times 10^{-7} and a coefficient of variation (C.o.V) of 0.030.03. On the other hand, using the same number of total model calls (4,0484,048), the crudely fitted GMM in Fig. 1(e) still provides very good performance with an estimated probability of 2.49×1072.49\times 10^{-7} and a C.o.V of 0.100.10. The ASTPA estimator of the rare event probability can finally be computed as:

p=𝔼h[I(𝑿)π𝑿(𝑿)h~(𝑿)]𝔼Q[h~(𝑿)Q(𝑿)]p~^C^h=(1Ni=1NI(𝒙𝒊)π𝑿(𝒙𝒊)h~(𝒙𝒊))(1Mi=1Mh~(𝒙i)Q(𝒙i))p_{\mathcal{F}}=\,{\mathop{\mathbb{E}}}_{h}\big{[}I_{\mathcal{F}}(\bm{X})\dfrac{\pi_{\bm{X}}(\bm{X})}{\tilde{h}(\bm{X})}\big{]}\,\mathbb{E}_{Q}\big{[}\dfrac{\tilde{h}(\bm{X})}{Q(\bm{X})}]\approx\hat{\tilde{p}}_{\mathcal{F}}\,\hat{C}_{h}=\bigg{(}\dfrac{1}{N}\sum_{i=1}^{N}I_{\mathcal{F}}(\bm{x_{i}})\dfrac{\pi_{\bm{X}}(\bm{x_{i}})}{\tilde{h}(\bm{x_{i}})}\bigg{)}\bigg{(}\dfrac{1}{M}\sum_{i=1}^{M}\dfrac{\tilde{h}(\bm{x}_{i}^{\prime})}{Q(\bm{x}_{i}^{\prime})}\bigg{)} (20)

where {𝒙i}i=1N\{\bm{x}_{i}\}_{i=1}^{N} and {𝒙i}i=1M\{\bm{x}_{i}^{\prime}\}_{i=1}^{M} are samples from hh and QQ, respectively.

Refer to caption Refer to caption
(a) Q(𝑿)Q(\bm{X}) (b) Samples Q(𝑿)\sim Q(\bm{X})
Figure 3: Showcases the inverse importance sampling technique for the same illustrative example in Fig. 1, but here with an accurately fitted GMM, QQ, with ten Gaussian components having full covariance matrices, as shown in (a), where its samples are showcased in (b).
Remark 1.

We introduce an effective approach to enhance the robustness of the IIS estimator C^h\hat{C}_{h} against the impact of rarely observed outlier or anomalous samples. Initially, the sample set {𝐱i}i=1MQ\{\bm{x}_{i}^{\prime}\}_{i=1}^{M}\sim Q is split into two subsets of equal size: {𝐱i}i=1M/2\{\bm{x}_{i}^{\prime}\}_{i=1}^{M/2} and {𝐱i}i=M/2+1M\{\bm{x}_{i}^{\prime}\}_{i=M/2+1}^{M}, given that MM is an even number. Subsequently, Eq. 18 is computed separately for each subset, yielding two estimates, C^h1\hat{C}_{h}^{1} and C^h2\hat{C}_{h}^{2}. If these two estimates fall within a reasonable range of each other, specifically if 1/3C^h1/C^h23{\nicefrac{{1}}{{3}}}\leq{\nicefrac{{\hat{C}_{h}^{1}}}{{\hat{C}_{h}^{2}}}}\leq 3, we then compute the final IIS estimate as their average, C^h=0.5(C^h1+C^h2)\hat{C}_{h}=0.5(\hat{C}_{h}^{1}+\hat{C}_{h}^{2}). This is equivalent to compute Eq. 18 using the entire set {𝐱i}i=1M\{\bm{x}_{i}^{\prime}\}_{i=1}^{M}. However, in rare cases, the estimates diverge beyond this threshold, indicating potential instability, and a conservative approach is thus adopted. In these instances, the final estimator C^h\hat{C}_{h} is determined as the minimum of the two estimates, C^h=min(C^h1,C^h2)\hat{C}_{h}=\min(\hat{C}_{h}^{1},\hat{C}_{h}^{2}). This method has been successfully applied in all examples presented in this work.

3.3 Statistical properties of the ASTPA estimator (p^\hat{p}_{\mathcal{F}})

In this section, we show unbiasedness and the analytical Coefficient of Variation (C.o.V) of the ASTPA estimator, p^=p~^C^h\hat{p}_{\mathcal{F}}=\hat{\tilde{p}}_{\mathcal{F}}\,\hat{C}_{h}, in Eq. 20. Assuming p~^\hat{\tilde{p}}_{\mathcal{F}} and C^h\hat{C}_{h} are independent, the expected value of their product can be expressed as:

𝔼[p^]=𝔼[p~^C^h]=𝔼[p~^]𝔼[C^h]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}]=\mathop{\mathbb{E}}[\hat{\tilde{p}}_{\mathcal{F}}\,\hat{C}_{h}]=\mathop{\mathbb{E}}[\hat{\tilde{p}}_{\mathcal{F}}]\mathop{\mathbb{E}}[\hat{C}_{h}] (21)

Considering their formulation in Eqs. 10 and 18, 𝔼[p~^]\mathop{\mathbb{E}}[\hat{\tilde{p}}_{\mathcal{F}}] and 𝔼[C^h]\mathop{\mathbb{E}}[\hat{C}_{h}] are unbiased estimators of p~{\tilde{p}}_{\mathcal{F}} and Ch{C}_{h}, respectively, i.e., 𝔼[p~^]=p~\mathop{\mathbb{E}}[\hat{\tilde{p}}_{\mathcal{F}}]={\tilde{p}}_{\mathcal{F}} and 𝔼[C^h]=Ch\mathop{\mathbb{E}}[\hat{C}_{h}]=C_{h}. The ASTPA estimator is thus unbiased:

𝔼[p^]=𝔼[p~^C^h]=p~Ch\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}]=\mathop{\mathbb{E}}[\hat{\tilde{p}}_{\mathcal{F}}\,\hat{C}_{h}]=\tilde{p}_{\mathcal{F}}\,C_{h} (22)

The variance (Var^\widehat{\text{Var}}) of the ASTPA estimator p^\hat{p}_{\mathcal{F}} can then be computed as:

Var^(p^)=(p~^)2Var^(C^h)+(C^h)2Var^(p~^)+Var^(p~^)Var^(C^h)\widehat{\text{Var}}(\hat{p}_{\mathcal{F}})=(\hat{\tilde{p}}_{\mathcal{F}})^{2}\,\widehat{\text{Var}}(\hat{C}_{h})+(\hat{C}_{h})^{2}\widehat{\text{Var}}(\hat{\tilde{p}}_{\mathcal{F}})+\widehat{\text{Var}}(\hat{\tilde{p}}_{\mathcal{F}})\widehat{\text{Var}}(\hat{C}_{h}) (23)

where p~^\hat{\tilde{p}}_{\mathcal{F}} and C^h\hat{C}_{h} can be computed according to Eqs. 10 and 18, respectively. Var^(p~^)\widehat{\text{Var}}(\hat{\tilde{p}}_{\mathcal{F}}) and Var^(C^h)\widehat{\text{Var}}(\hat{C}_{h}) can thus be estimated as follows:

Var^(p~^)=1Ns(Ns1)i=1Ns(I(𝒙𝒊)π𝑿(𝒙𝒊)h~(𝒙𝒊)p~^)2\displaystyle\widehat{\text{Var}}(\hat{\tilde{p}}_{\mathcal{F}})=\frac{1}{N_{s}(N_{s}-1)}\sum_{i=1}^{N_{s}}\bigg{(}\dfrac{I_{\mathcal{F}}(\bm{x_{i}})\,\pi_{\bm{X}}(\bm{x_{i}})}{\tilde{h}(\bm{x_{i}})}-\hat{\tilde{p}}_{\mathcal{F}}\bigg{)}^{2} (24)
Var^(C^h)=1M(M1)i=1M(h~(𝒙i)Q(𝒙i)C^h)2\displaystyle\widehat{\text{Var}}(\hat{C}_{h})=\frac{1}{M(M-1)}\sum_{i=1}^{M}\Bigg{(}\dfrac{\tilde{h}(\bm{x}_{i}^{\prime})}{Q(\bm{x}_{i}^{\prime})}-\hat{C}_{h}\Bigg{)}^{2} (25)

where {𝒙i}i=1Ns\{\bm{x}_{i}\}_{i=1}^{N_{s}} and {𝒙i}i=1M\{\bm{x}_{i}^{\prime}\}_{i=1}^{M} are samples from hh and QQ, respectively. NsN_{s} denotes the number of used Markov chain samples, taking into account the fact that the samples are not independent and identically distributed (i.i.d) in this case. The analytical Coefficient of Variation (C.o.V) can then be provided as:

C.o.VVar^(p^)p^\displaystyle\textrm{C.o.V}\approx\dfrac{\sqrt{\widehat{\text{Var}}(\hat{p}_{\mathcal{F}})}}{\hat{p}_{\mathcal{F}}} (26)

In this work, the thinning process reduces the sample size from NN to NsN_{s} based on the effective sample size (ESS) of the sample set {𝒙i}i=1N\{\bm{x}_{i}\}_{i=1}^{N}. We select every jthj^{th} sample for thinning, where jj, an integer, is determined by:

j=N4ESSminj=\left\lfloor\dfrac{N}{4\cdot\text{ESS}_{\text{min}}}\right\rfloor (27)

where ESSmin\text{ESS}_{\text{min}} represents the minimum effective sample size across all dimensions and is computed using TensorFlow Probability software [67]. To ensure jj remains within practical bounds for the thinning process, it is constrained to the set {3,4,5,,30}\{3,4,5,\ldots,30\}, with any calculated value of jj outside this range being adjusted to the nearest bound within it. The analytical C.o.V computed according to this approach showed good agreement with the sample C.o.V in all our numerical examples.

Remark 2.

The unbiasedness of the ASTPA estimator is based on assuming independence between p~^\hat{\tilde{p}}_{\mathcal{F}} and C^h\hat{C}_{h}. If dependence exists, the expected value of their product becomes:

𝔼[p^]=𝔼[p~^C^h]=𝔼[p~^]𝔼[C^h]+Cov(p~^,C^h)=p~Ch+Cov(p~^,C^h)\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}]=\mathop{\mathbb{E}}[\hat{\tilde{p}}_{\mathcal{F}}\,\hat{C}_{h}]=\mathop{\mathbb{E}}[\hat{\tilde{p}}_{\mathcal{F}}]\mathop{\mathbb{E}}[\hat{C}_{h}]+\mathrm{Cov}(\hat{\tilde{p}}_{\mathcal{F}},\hat{C}_{h})=\tilde{p}_{\mathcal{F}}\,C_{h}+\mathrm{Cov}(\hat{\tilde{p}}_{\mathcal{F}},\hat{C}_{h}) (28)

where Cov(p~^,C^h)\mathrm{Cov}(\hat{\tilde{p}}_{\mathcal{F}},\hat{C}_{h}) is the covariance between p~^\hat{\tilde{p}}_{\mathcal{F}} and C^h\hat{C}_{h}. This dependence could arise from using the sample set {𝐱i}i=1N\{\bm{x}_{i}\}_{i=1}^{N} to fit the ISD QQ, leading to a potential bias in the estimator. Despite this potential for bias, our empirical investigations using a finite sample size reveal that the bias of the ASTPA estimator is minimal. One approach to ensure independence, if wanted, is to fit the ISD QQ on an additional sample set {𝐱i′′}i=1Nh\{\bm{x}_{i}^{\prime\prime}\}_{i=1}^{N}\sim h, separate from the one used to estimate p~^\hat{\tilde{p}}_{\mathcal{F}}. Our empirical findings, however, suggest that such steps are generally unnecessary.

3.4 Impact of the constructed sampling target on the estimated probability

In this section, we illustrate how constructing the sampling target, hh, influences the estimated probability pp_{\mathcal{F}}. To this end, we consider three cases: (a) a highly relaxed target using σ\sigma and qq values outside the recommended ranges, (b) a target constructed with σ\sigma and qq values within the recommended ranges, and (c) a target highly concentrated inside the rare event domain, i.e., very close to the theoretically optimal ISD in Eq. 4. We again consider the correlated Gumbel distribution example, as detailed in Section 8.1, with p2.51×107p_{\mathcal{F}}\sim 2.51\times 10^{-7}. Fig. 4 depictes these three constructed targets, accompanied by the corresponding σ\sigma and qq values. The impact on the estimator accuracy is judged using the sampling coefficients of variation (C.o.V) of the normalizing constant C^h\hat{C}_{h} (Eq. 18), the shifted probability estimate p~^\hat{\tilde{p}}_{\mathcal{F}} (Eq. 10), and the sought probability p^\hat{p}_{\mathcal{F}} (Eq. 20), as well as the expected values of p^\hat{p}_{\mathcal{F}}. Fig. 4 shows the results obtained using a total number of model calls 4,000\sim 4,000 in all cases and based on 100 independent simulations.

In Fig. 4(a), the highly relaxed target is mainly located outside the rare event domain, thus resulting in a low number of rare event samples. Therefore, the C.o.V of p~^\hat{\tilde{p}}_{\mathcal{F}} becomes high since only a few samples with I=1I_{\mathcal{F}}=1 appear in Eq. 10. Conversely, the case shown in Fig. 4(b) demonstrates excellent performance with low coefficients of variation and 𝔼[p^]\mathbb{E}[\hat{p}_{\mathcal{F}}] very close to the reference value. On the other hand, the highly concentrated target in Fig. 4(c), constructed using σ\sigma and qq values outside the recommended range, presents interesting results with an approximately zero C.o.V for p~^\hat{\tilde{p}}_{\mathcal{F}}, and a comparatively higher C.o.V of C^h\hat{C}_{h}. The reason for the former is that almost all the samples yield I=1I_{\mathcal{F}}=1, therefore p~^1.0\hat{\tilde{p}}_{\mathcal{F}}\approx 1.0 regardless of the number of samples. However, this concentrated sampling target may not often be sampled adequately using an affordable number of model calls compared to an appropriately relaxed target. Consequently, the fitted GMM within the inverse importance sampling procedure may not accurately represent the target hh. The accuracy of the estimated normalizing constant then slightly deteriorates in this 2-dimensional example. This decrease in accuracy becomes significantly more pronounced if the scenario in Fig. 4(c) is followed in high-dimensional problems, e.g., the high-dimensional examples discussed in Section 8.

The deteriorated performance observed sometimes in certain importance sampling-based methods, particularly for high-dimensional and significantly nonlinear performance functions, can be attributed to the scenario outlined in Fig. 4(c). Specifically, some methods, see [53, 55] for example, aim to sample the unnormalized optimal distribution (Iπ𝑿)(I_{\mathcal{F}}\,\pi_{\bm{X}}), and then fit a normalized PDF based on these samples. Another sample set from the fitted PDF is then employed to compute the normalizing constant of (Iπ𝑿)(I_{\mathcal{F}}\pi_{\bm{X}}), which is the sought probability pp_{\mathcal{F}}. This investigation here, therefore, underscores the sampling benefits of appropriately relaxing (smoothing) the indicator function, often leading to enhanced performance.

        Refer to caption Refer to caption Refer to caption
         (a) (b) (c)
C.o.V(p~^)(\hat{\tilde{p}}_{\mathcal{F}}) 0.40 0.03 0.00
C.o.V(C^h)(\hat{C}_{h}) 0.13 0.02 0.13
C.o.V(p^)(\hat{p}_{\mathcal{F}}) 0.40 0.03 0.13
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 2.38E-7 2.52E-7 2.33E-7
Figure 4: Impact of the constructed sampling target on the accuracy of the proposed estimator; reference p2.51×107p_{\mathcal{F}}\sim 2.51\times 10^{-7}.

4 Sampling the Approximate Target (h~\tilde{h}) in ASTPA

In this section, we discuss sampling the approximate target h~\tilde{h} in ASTPA using gradient-based Markov Chain Monte Carlo (MCMC) methods. Specifically, we focus on the state-of-the-art Hamiltonian MCMC (HMCMC) methods, including our developed Quasi-Newton mass preconditioned HMCMC (QNp-HMCMC), aimed at improving sampling efficiency for intricate probability distributions. We first present the standard HMCMC algorithm, followed by our developed QNp-HMCMC, emphasizing its theoretical foundations and practical implementation techniques for complex non-Gaussian spaces. Connections with the preconditioned Metropolis-adjusted Langevin algorithm (MALA) are illustrated, highlighting the uniqueness of QNp-HMCMC, even with a single-step integration.

4.1 Hamiltonian Markov Chain Monte Carlo (HMCMC)

HMCMC leverages Hamiltonian dynamics to generate distant Markov chain samples, thereby avoiding the slow exploration of high-dimensional random variable spaces encountered in random-walk samplers. This Hamiltonian approach was first introduced to molecular simulations by Alder and Wainwright in [68], in which the motion of the molecules was deterministic. Subsequently, Duane et al. [37] integrated MCMC methods with molecular dynamics approaches. The enhanced sampling efficiency of HMCMC is mainly attributed to the augmentation of the space with momentum variables and the utilization of the target distribution’s gradient, facilitating informed sampling steps.

Given dd-dimensional variables of interest 𝑿\bm{X} with unnormalized density h~\tilde{h}(.), HMCMC introduces dd-dimensional auxiliary momentum variables 𝒁\bm{Z} and then samples from the joint distribution characterized by:

π(𝑿,Z)exp(H(𝑿,𝒁))\pi(\bm{X},\textbf{Z})\propto\exp\big{(}-H(\bm{X},\bm{Z})\big{)} (29)

where H(𝑿,𝒁)=U(𝑿)+K(𝑿,𝒁)H(\bm{X},\bm{Z})=U(\bm{X})+K(\bm{X},\bm{Z}) is termed the Hamiltonian function HH. The functions U(𝑿)=logh~(𝑿)U(\bm{\bm{X}})=-\log\tilde{h}(\bm{X}) and K(𝑿,𝒁)=logπ𝒁|𝑿(𝒁|𝑿)K(\bm{X},\bm{Z})=-\log\pi_{\bm{Z}\arrowvert\bm{X}}(\bm{Z}\arrowvert\bm{X}) are introduced as the potential energy and kinetic energy, owing to the concept of the canonical distribution [22] and the physical laws which motivate the Hamiltonian Markov Chain Monte Carlo algorithm. The kinetic energy function is unconstrained and can be formed in various ways according to the implementation. In most typical cases, the momentum is sampled by a zero-mean normal distribution [22, 69], and accordingly, the kinetic energy can be written as K(𝑿,𝒁)=logπ𝒁|𝑿(𝒁|𝑿)=12𝒁T𝐌1𝒁K(\bm{X},\bm{Z})=-\log\pi_{\bm{Z}\arrowvert\bm{X}}(\bm{Z}\arrowvert\bm{X})=\frac{1}{2}\bm{Z}^{T}\mathbf{M}^{-1}\bm{Z}, where 𝐌\mathbf{M} is a symmetric, positive-definite matrix, termed mass matrix. The joint distribution π(𝑿,𝒁)\pi(\bm{X},\bm{Z}) can then be written as:

π(𝑿,Z)exp((U(𝑿)+K(𝑿,𝒁)))=h~(𝑿)π𝒁|𝑿(𝒁|𝑿)=h~(𝑿)𝒩(𝒁|0,M)\pi(\bm{X},\textbf{Z})\propto\exp\big{(}-\big{(}U(\bm{X})+K(\bm{X},\bm{Z})\big{)}\big{)}=\tilde{h}(\bm{X})\,\pi_{\bm{Z}\arrowvert\bm{X}}(\bm{Z}\arrowvert\bm{X})=\tilde{h}(\bm{X})\,\mathcal{N}(\bm{Z}\arrowvert\textbf{{0}},\textbf{M}) (30)

where 𝒩()\mathcal{N}(\cdot) denotes a Gaussian distribution. HMCMC generates a Metropolis proposal on the joint state-space (𝑿,Z)(\bm{X},\textbf{Z}) by sampling the momentum and simulating trajectories of Hamiltonian dynamics in which the time evolution of the state (𝑿,Z)(\bm{X},\textbf{Z}) is governed by Hamilton’s equations, expressed typically by:

d𝑿dt=H𝒁=K𝒁=𝐌1𝒁,d𝒁dt=H𝑿=U𝑿=𝑿(𝑿)\frac{d\bm{\bm{X}}}{dt}=\frac{\partial H}{\partial\bm{Z}}=\frac{\partial K}{\partial\bm{Z}}=\mathbf{M}^{-1}\bm{Z},\ \ \ \ \frac{d\bm{Z}}{dt}=-\frac{\partial H}{\partial\bm{X}}=-\frac{\partial U}{\partial\bm{X}}=\nabla_{\bm{X}}\mathcal{L}(\bm{X}) (31)

where (𝑿)=log(h~(𝑿))\mathcal{L}(\bm{X})=\log(\tilde{h}(\bm{X})) denotes here the log-density of the target distribution. Hamiltonian dynamics prove to be an effective proposal generation mechanism because the distribution π(𝑿,Z)\pi(\bm{X},\textbf{Z}) is invariant under the dynamics of Eq. 31. These dynamics enable a proposal, triggered by an approximate solution of Eq. 31, to be distant from the current state, yet with high probability acceptance. The solution to Eq. 31 is analytically intractable in general and thus the Hamiltonian equations need to be numerically solved by discretizing time using some small step size, ε\varepsilon. A symplectic integrator that can be used for the numerical solution is the leapfrog one and works as follows:

𝒛t+ε/2=𝒛t(ε2)U𝑿(𝒙t),𝒙t+ε=𝒙t+εK𝒁(𝒛t+ε/2),𝒛t+ε=𝒛t+ε/2(ε2)U𝑿(𝒙t+ε)\bm{z}\textsubscript{t+$\varepsilon$/2}=\bm{z}\textsubscript{t}-(\dfrac{\varepsilon}{2})\dfrac{\partial U}{\partial\bm{\bm{X}}}(\bm{x}\textsubscript{t}),\ \ \ \ \bm{x}\textsubscript{t+$\varepsilon$}=\bm{x}\textsubscript{t}+\varepsilon\dfrac{\partial K}{\partial\bm{Z}}(\bm{z}\textsubscript{t+$\varepsilon$/2}),\ \ \ \ \bm{z}\textsubscript{t+$\varepsilon$}=\bm{z}\textsubscript{t+$\varepsilon$/2}-(\dfrac{\varepsilon}{2})\dfrac{\partial U}{\partial\bm{X}}(\bm{x}\textsubscript{t+$\varepsilon$}) (32)

where U𝑿(𝒙)=𝑿(𝒙)\dfrac{\partial U}{\partial\bm{X}}(\bm{x})=-\nabla_{\bm{X}}\mathcal{L}(\bm{x}) and K𝒁(𝒛)=𝐌1𝒛\dfrac{\partial K}{\partial\bm{Z}}(\bm{z})=\mathbf{M}^{-1}\bm{z}. The main advantage of using the leapfrog integrator is its simplicity, that is volume-preserving, and that it is reversible, due to its symmetry, by simply negating 𝒛\bm{z}, in order to generate a valid Metropolis proposal. See Neal [22] and Betancourt [69] for more details on energy-conservation, reversibility and volume-preserving integrators and their connections to HMCMC. It is noted that in the above leapfrog integration algorithm, the computationally expensive part is the one model call per step to acquire the U𝑿\dfrac{\partial U}{\partial\bm{X}} term. With τ\tau the trajectory or else path length, taking L=τ/ε\textit{L}=\tau/\varepsilon leapfrog steps approximates the evolution (𝒙(0),𝒛(0))(𝒙(τ),𝒛(τ))(\bm{x}(0),\bm{z}(0))\longrightarrow(\bm{x}(\tau),\bm{z}(\tau)), providing the exact solution in the limit ε0\varepsilon\longrightarrow 0. Section 4.3 discusses techniques for tuning these parameters.

A generic procedure for drawing NIterN_{Iter} samples via HMCMC is described in Algorithm 1, where again (𝑿)\mathcal{L}(\bm{X}) is the log-density of the target distribution of interest, 𝒙0\bm{x}^{0} are initial values, and L is the number of leapfrog steps, as explained before. For each HMCMC step, the momentum is first resampled, and then the L leapfrog updates are performed, as seen in Eq. 32, before a typical accept/reject Metropolis step takes place. In Algorithm 1, the mass matrix M is set to the identity matrix, I, as often followed in the standard implementation of HMCMC. However, the efficiency of the HMCMC can be significantly enhanced by appropriately selecting the mass matrix, particularly when dealing with complex, high-dimensional distributions. In the subsequent section, we introduce our developed technique for adapting the mass matrix, which has shown substantial improvement in the sampling efficiency.

Algorithm 1 Hamiltonian Markov Chain Monte Carlo
1:procedure HMCMC(𝒙0\bm{x}^{0}, ε\varepsilon, L, (𝑿)=log(h~(𝑿))\mathcal{L}(\bm{X})=\log(\tilde{h}(\bm{X})), NIterN_{Iter})
2:     for m=1m=1 toto NIterN_{Iter} do
3:         𝒛0\bm{z}^{0}\sim𝒩(0,𝐈)\mathcal{N}(\textbf{{0}},\mathbf{I})\triangleright sampling from independent standard Gaussian distribution, 𝐌=𝐈\mathbf{M}=\mathbf{I}
4:         𝒙m\bm{x}^{m} \leftarrow 𝒙m1\bm{x}^{m-1}, 𝒙~\tilde{\bm{x}} \leftarrow 𝒙m1\bm{x}^{m-1}, 𝒛~\tilde{\bm{z}} \leftarrow 𝒛0\bm{z}^{0}
5:         for i=1i=1 toto LL do
6:              𝒙~\tilde{\bm{x}}, 𝒛~\tilde{\bm{z}} \leftarrow Leapfrog(𝒙~\tilde{\bm{x}}, 𝒛~\tilde{\bm{z}}, ε\varepsilon) \triangleright leapfrog integration in Eq. 32
7:         end for
8:         withwith probabilityprobability:
9:          α\alpha = min{\bigg{\{}1,exp((𝒙~)12𝒛~T𝒛~)exp((𝒙m1)12𝒛0T𝒛0)\dfrac{\exp(\mathcal{L}(\tilde{\bm{x}})-\dfrac{1}{2}\tilde{\bm{z}}^{T}\tilde{\bm{z}})}{\exp(\mathcal{L}(\bm{x}^{m-1})-\dfrac{1}{2}{\bm{z}^{0}}^{T}\bm{z}^{0})} }\bigg{\}} \triangleright Metropolis step
10:          𝒙m\bm{x}^{m} \leftarrow 𝒙~\tilde{\bm{x}}, 𝒛m\bm{z}^{m} \leftarrow -𝒛~\tilde{\bm{z}}
11:     end for
12:end procedure

4.2 Quasi-Newton mass preconditioned HMCMC (QNp-HMCMC)

In complex, high-dimensional problems, the performance of the standard HMCMC sampler, presented as Algorithm 1, may deteriorate, necessitating a prohibitive number of model calls. Various strategies have been proposed to overcome this limitation, including the adaptation of the mass matrix 𝐌\mathbf{M}, as highlighted in [24, 38, 39, 70]. We have uniquely employed this technique in our Quasi-Newton mass preconditioned HMCMC (QNp-HMCMC), as presented in Algorithm 2 and discussed in this section. The role of the mass matrix in HMCMC is analogous to performing a linear transformation for the involved random variables 𝑿\bm{X}, potentially simplifying the geometric complexity of the underlying sampling distribution. Such a transformation may thus make the transformed variables less correlated and have a similar scale, facilitating a more efficient sampling process. Specifically, operating HMCMC in a transformed space 𝑿=𝐀𝑿\bm{X}^{\prime}=\mathbf{A}\bm{X}, where 𝐀\mathbf{A} is a non-singular matrix, and using a standard mass matrix 𝐌1=𝐈\mathbf{M}_{1}=\mathbf{I} is equivalent to using a transformed mass matrix 𝐌2=𝐀T𝐀\mathbf{M}_{2}=\mathbf{A}^{T}\mathbf{A} in the original space 𝑿\bm{X} [22, 38].

Algorithm 2 Quasi-Newton mass preconditioned Hamiltonian Markov Chain Monte Carlo (QNp-HMCMC)
1:procedure QNp-HMCMC(𝒙0\bm{x}^{0}, ε\varepsilon, L, (𝑿)\mathcal{L}(\bm{X}), NBurnInN_{\textit{BurnIn}}, NIterN_{Iter})
2:     𝐖0=𝐈\mathbf{W}^{0}=\mathbf{I}
3:     for m=1m=1 toto NIterN_{Iter} do
4:         if mm \leq NBurnInN_{BurnIn} then
5:              𝒛0\bm{z}^{0}\sim𝒩(0,𝐌)\mathcal{N}(\textbf{{0}},\mathbf{M})\triangleright where 𝐌=𝐈\mathbf{M}=\mathbf{I}
6:              𝒙m\bm{x}^{m} \leftarrow 𝒙m1\bm{x}^{m-1}, 𝒙~\tilde{\bm{x}} \leftarrow 𝒙m1\bm{x}^{m-1}, 𝒛~\tilde{\bm{z}} \leftarrow 𝒛0\bm{z}^{0}, 𝐁\mathbf{B} \leftarrow 𝐖m1\mathbf{W}^{m-1}, 𝐂0\mathbf{C}^{0} \leftarrow 𝐖m1\mathbf{W}^{m-1}
7:              for i=1i=1 toto LL do
8:                  𝒙~\tilde{\bm{x}}, 𝒛~\tilde{\bm{z}} \leftarrow Leapfrog-BurnIn(𝒙~\tilde{\bm{x}}, 𝒛~\tilde{\bm{z}}, ε\varepsilon, 𝐁\mathbf{B})
9:                   Compute 𝐂i\mathbf{C}^{i} using 𝐂i1\mathbf{C}^{i-1} in Eq. 33
10:              end for
11:              𝐖m\mathbf{W}^{m} \leftarrow 𝐂L\mathbf{C}^{L}
12:              withwith probabilityprobability:
13:             α\alpha = min{\bigg{\{}1,exp((𝒙~)12𝒛~T𝒛~)exp((𝒙m1)12𝒛0T𝒛0)\dfrac{\exp(\mathcal{L}(\tilde{\bm{x}})-\dfrac{1}{2}\tilde{\bm{z}}^{T}\tilde{\bm{z}})}{\exp(\mathcal{L}(\bm{x}^{m-1})-\dfrac{1}{2}{\bm{z}^{0}}^{T}\bm{z}^{0})} }\bigg{\}}
14:              𝒙m\bm{x}^{m} \leftarrow 𝒙~\tilde{\bm{x}}, 𝒛m\bm{z}^{m} \leftarrow -𝒛~\tilde{\bm{z}}
15:         else\triangleright If mm >> NBurnInN_{BurnIn}
16:              𝒛0\bm{z}^{0}\sim𝒩(0,𝐌)\mathcal{N}(\textbf{{0}},\mathbf{M})\triangleright where 𝐌=(𝐖NBurnIn)1\mathbf{M}=(\mathbf{W}^{N_{BurnIn}})^{-1}
17:              𝒙m\bm{x}^{m} \leftarrow 𝒙m1\bm{x}^{m-1}, 𝒙~\tilde{\bm{x}} \leftarrow 𝒙m1\bm{x}^{m-1}, 𝒛~\tilde{\bm{z}} \leftarrow 𝒛0\bm{z}^{0}
18:              for i=1i=1 toto LL do
19:                  𝒙~\tilde{\bm{x}}, 𝒛~\tilde{\bm{z}} \leftarrow Leapfrog(𝒙~\tilde{\bm{x}}, 𝒛~\tilde{\bm{z}}, ε\varepsilon, 𝐌\mathbf{M})
20:              end for
21:              withwith probabilityprobability:
22:             α\alpha = min{\bigg{\{}1,exp((𝒙~)12𝒛~T𝐌1𝒛~)exp((𝒙m1)12𝒛0T𝐌1𝒛0)\dfrac{\exp(\mathcal{L}(\tilde{\bm{x}})-\dfrac{1}{2}\tilde{\bm{z}}^{T}\ \mathbf{M}^{-1}\tilde{\bm{z}})}{\exp(\mathcal{L}(\bm{x}^{m-1})-\dfrac{1}{2}{\bm{z}^{0}}^{T}\ \mathbf{M}^{-1}\bm{z}^{0})} }\bigg{\}}
23:              𝒙m\bm{x}^{m} \leftarrow 𝒙~\tilde{\bm{x}}, 𝒛m\bm{z}^{m} \leftarrow -𝒛~\tilde{\bm{z}}
24:         end if
25:     end for
26:end procedure
27:
28:
29:
30:
31:
32:
33:function Leapfrog-BurnIn(𝒙,𝒛,ε,𝐁\bm{x},\bm{z},\varepsilon,\mathbf{B})
34:     𝒛~𝒛+(ε/2)𝐁𝑿(𝒙)\tilde{\bm{z}}\leftarrow\bm{z}+(\varepsilon/2)\mathbf{B}\nabla_{\bm{X}}\mathcal{L}(\bm{x})
35:     𝒙~𝒙+ε𝐁𝒛~\tilde{\bm{x}}\leftarrow\bm{x}+\varepsilon\mathbf{B}\tilde{\bm{z}}
36:     𝒛~𝒛~+(ε/2)𝐁𝑿(𝒙~)\tilde{\bm{z}}\leftarrow\tilde{\bm{z}}+(\varepsilon/2)\mathbf{B}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})
37:return 𝒙~\tilde{\bm{x}}, 𝒛~\tilde{\bm{z}}
38:end function
39:
40:
41:function Leapfrog(𝒙,𝒛,ε,𝐌\bm{x},\bm{z},\varepsilon,\mathbf{M})
42:     𝒛~𝒛+(ε/2)𝑿(𝒙)\tilde{\bm{z}}\leftarrow\bm{z}+(\varepsilon/2)\nabla_{\bm{X}}\mathcal{L}(\bm{x})
43:     𝒙~𝒙+ε𝐌1𝒛~\tilde{\bm{x}}\leftarrow\bm{x}+\varepsilon\mathbf{M}^{-1}\tilde{\bm{z}}
44:     𝒛~𝒛~+(ε/2)𝑿(𝒙~)\tilde{\bm{z}}\leftarrow\tilde{\bm{z}}+(\varepsilon/2)\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})
45:return 𝒙~\tilde{\bm{x}}, 𝒛~\tilde{\bm{z}}
46:end function

The mass matrix is commonly adapted to better describe the local geometry of the target distribution, enabling more efficient exploration of the space of random variables. For instance, the Riemannian Manifold Hamiltonian Monte Carlo (RMHMC) approach, suggested in [24], leverages the manifold structure of the variable space through a position-specific Riemannian metric tensor. This tensor functions as a position-dependent mass matrix, leading to a non-separable Hamiltonian. Therefore, solving the Hamiltonian equations within RMHMC requires a generalized leapfrog scheme, demanding higher-order derivatives of distributions and additional model calls per leapfrog step. Conversely, within the scope of our research, a more computationally feasible approach is to introduce a position-independent mass matrix 𝐌(𝒙)=𝐌\mathbf{M}(\bm{x})=\mathbf{M}, as in algorithm 1, describing the local structure of the target distribution. Such description can be generally obtained in a Newton-type context utilizing the Hessian information of the target distribution, as demonstrated in various MCMC methods [71, 72, 73, 74]. In the context of HMCMC, Zhang and Sutton [38] employed a quasi-Newton-type approach to provide a position-independent, yet continuously adapted, mass matrix using a variant of the BFGS formula [75] to approximate the Hessian matrix. They maintained position independence by adopting the ensemble chain adaption (ECA) approach [76], updating the mass matrix for each chain based on the states of others. In QNp-HMCMC, we utilize an alternative approach to ensure position independence by approximating the Hessian matrix during a burn-in phase and subsequently utilizing it as a constant mass matrix. Importantly, we enhance the burn-in stage’s efficiency through a skew-symmetric preconditioning approach while setting the mass matrix to identity.

In this work, we approximate the Hessian information of the potential energy function, U(𝑿)=(𝑿)=logh~(𝑿)U(\bm{\bm{X}})=-\mathcal{L}(\bm{X})=-\log\tilde{h}(\bm{X}), using the well-known BFGS approximation [77], solely based on the gradient information. Let 𝑿d\bm{X}\in\mathbb{R}^{d}, consistent with the previous sections. Given the kk-thth estimate 𝐖k\mathbf{W}_{k}, where 𝐖k\mathbf{W}_{k} is an approximation to the inverse Hessian of the potential energy function at 𝒙k\bm{x}_{k}, the BFGS update 𝐖k+1\mathbf{W}_{k+1} can be expressed as:

𝐖k+1=(𝐈𝒔k𝒚kT𝒚kT𝒔k)𝐖k(𝐈𝒚k𝒔kT𝒚kT𝒔k)+𝒔k𝒔kT𝒚kT𝒔k\displaystyle\mathbf{W}_{k+1}=(\mathbf{I}-\dfrac{\bm{s}_{k}\bm{y}_{k}^{T}}{\bm{y}_{k}^{T}\bm{s}_{k}})\mathbf{W}_{k}(\mathbf{I}-\dfrac{\bm{y}_{k}\bm{s}_{k}^{T}}{\bm{y}_{k}^{T}\bm{s}_{k}})+\dfrac{\bm{s}_{k}\bm{s}_{k}^{T}}{\bm{y}_{k}^{T}\bm{s}_{k}} (33)

where 𝐈\mathbf{I} is the identity matrix, 𝒔k=𝒙k+1𝒙k\bm{s}_{k}=\bm{x}_{k+1}-\bm{x}_{k}, and 𝒚k=(𝒙k+1)+(𝒙k)\bm{y}_{k}=-\nabla\mathcal{L}(\bm{x}_{k+1})+\nabla\mathcal{L}(\bm{x}_{k}) where :d\mathcal{L}:\mathbb{R}^{d}\longrightarrow\mathbb{R} denotes the log density of the target distribution. There is a long history of efficient BFGS updates for very large systems and several numerical techniques can be used, including sparse and limited-memory approaches.

To achieve a good approximation of the inverse Hessian matrix during the burn-in stage, we enhance the sampling efficiency by resorting to a skew-symmetric preconditioning approach for Hamiltonian dynamics. Such preconditioning approaches are originally introduced in [40] and utilized within HMCMC in [39], albeit in completely different settings compared to our QNp-HMCMC. Ma et al. in [40] devised a unified framework for continuous-dynamics samplers such as HMCMC and Langevin Monte Carlo. Let 𝜽=(𝑿,𝒁)2d\bm{\theta}=(\bm{X},\bm{Z})\in\mathbb{R}^{2d} be an augmented state. The suggested dynamics in [40] can then be written as a stochastic differential equation (SDE) of the form:

d𝜽=𝐟(𝜽)dt+2𝐃(𝜽)d𝐖(t)d\bm{\theta}=\mathbf{f}(\bm{\theta})\,dt+\sqrt{2\,\mathbf{D}(\bm{\theta})}\,d\mathbf{W}(t) (34)

where 𝐟(𝜽)\mathbf{f}(\bm{\theta}) denoted the deterministic drift and often related to the gradient of the Hamiltonian H(𝜽)H(\bm{\theta}), 𝐖(t)\mathbf{W}(t) is a 2d2d-dimensional Wiener process, and 𝐃(𝜽)2d×2d\mathbf{D}(\bm{\theta})\in\mathbb{R}^{2d\times 2d} is a positive semidefinite diffusion matrix. The deterministic drift is further proposed as:

𝐟(𝜽)=[𝐃(𝜽)+𝐐(𝜽)]H(𝜽)+𝚪(𝜽),𝚪i(𝜽)=j=1d𝜽j(𝐃ij(𝜽)+𝐐ij(𝜽))\mathbf{f}(\bm{\theta})=-[\mathbf{D}(\bm{\theta})+\mathbf{Q}(\bm{\theta})]\nabla H(\bm{\theta})+\bm{\Gamma}(\bm{\theta}),\quad\bm{\Gamma}_{i}(\bm{\theta})=\sum_{j=1}^{d}\dfrac{\partial}{\partial\bm{\theta}_{j}}\big{(}\mathbf{D}_{ij}(\bm{\theta})+\mathbf{Q}_{ij}(\bm{\theta})\big{)} (35)

where 𝐐(𝜽)2d×2d\mathbf{Q}(\bm{\theta})\in\mathbb{R}^{2d\times 2d} is a skew-symmetric curl matrix representing the deterministic traversing effects seen in the HMC procedure. Ma et al. proved that for any choice of positive semidefinite 𝐃(𝜽)\mathbf{D}(\bm{\theta}) and skew-symmetric 𝐐(𝜽)\mathbf{Q}(\bm{\theta}) parameterizing 𝐟(𝜽)\mathbf{f}(\bm{\theta}), stimulating from Eq. 34 with 𝐟(𝜽)\mathbf{f}(\bm{\theta}) as in Eq. 35 leads to the desired distribution as the stationary distribution: π(𝜽)exp(H(𝜽))\pi(\bm{\theta})\propto\exp\big{(}-H(\bm{\theta})\big{)} in Eq. 29. Deterministic dynamics, such as Hamiltonian dynamics, can be obtained under this unified framework in Eq. 34 by setting the diffusion matrix 𝐃(𝜽)=0\mathbf{D}(\bm{\theta})=0 as:

d𝜽=𝐐(𝜽)H(𝜽)+𝚪(𝜽),𝚪i(𝜽)=j=1d𝜽j(𝐐ij(𝜽))d\bm{\theta}=-\mathbf{Q}(\bm{\theta})\nabla H(\bm{\theta})+\bm{\Gamma}(\bm{\theta}),\quad\bm{\Gamma}_{i}(\bm{\theta})=\sum_{j=1}^{d}\dfrac{\partial}{\partial\bm{\theta}_{j}}\big{(}\mathbf{Q}_{ij}(\bm{\theta})\big{)} (36)

Hamiltonian dynamics in Eq. 31 can then be retrieved by setting 𝐐(𝜽)=[0𝐈d𝐈d0]\mathbf{Q}(\bm{\theta})=\begin{bmatrix}0&-\mathbf{I}_{d}\\ \mathbf{I}_{d}&0\\ \end{bmatrix}, with 𝐈dd×d\mathbf{I}_{d}\in\mathbb{R}^{d\times d} being an identity matrix, as:

d𝜽dt=ddt[𝑿𝒁]=[0𝐈d𝐈d0][𝑿H(𝑿,𝒁)𝒁H(𝑿,𝒁)]=[0𝐈d𝐈d0][𝑿(𝑿)𝐌1𝒁]=[𝐌1𝒁𝑿(𝑿)]\dfrac{d\bm{\theta}}{dt}=\dfrac{d}{dt}\begin{bmatrix}\bm{X}\\[6.0pt] \bm{Z}\\ \end{bmatrix}=-\begin{bmatrix}0&-\mathbf{I}_{d}\\[6.0pt] \mathbf{I}_{d}&0\\ \end{bmatrix}\begin{bmatrix}\nabla_{\bm{X}}H(\bm{X},\bm{Z})\\[6.0pt] \nabla_{\bm{Z}}H(\bm{X},\bm{Z})\\ \end{bmatrix}=-\begin{bmatrix}0&-\mathbf{I}_{d}\\[6.0pt] \mathbf{I}_{d}&0\\ \end{bmatrix}\begin{bmatrix}-\nabla_{\bm{X}}\mathcal{L}(\bm{X})\\[6.0pt] \mathbf{M}^{-1}\bm{Z}\\ \end{bmatrix}=\begin{bmatrix}\mathbf{M}^{-1}\bm{Z}\\[6.0pt] \nabla_{\bm{X}}\mathcal{L}(\bm{X})\end{bmatrix} (37)

It is worth mentioning that if the mass matrix is position-dependent, such as in RMHMC, a different treatment for the unified dynamics in Eq. 34 should be done to retrieve the corresponding dynamics; see [40] for more details.

In our suggested QNp-HMCMC, we utilize a skew-symmetric preconditioning matrix 𝐐(𝜽)=[0𝐖𝐖0]\mathbf{Q}(\bm{\theta})=\begin{bmatrix}0&-\mathbf{W}\\ \mathbf{W}&0\\ \end{bmatrix}, with 𝐖\mathbf{W} being the inverse Hessian of the potential energy function. Assuming 𝐖\mathbf{W} is position-independent, the correction term can be ignored, i.e., 𝚪i(𝜽)=0\bm{\Gamma}_{i}(\bm{\theta})=0, and therefore the corresponding dynamics can then be written as:

d𝜽dt=ddt[𝑿𝒁]=[0𝐖𝐖0][𝑿(𝑿)𝐌1𝒁]=[𝐖𝐌1𝒁𝐖𝑿(𝑿)]\dfrac{d\bm{\theta}}{dt}=\dfrac{d}{dt}\begin{bmatrix}\bm{X}\\[6.0pt] \bm{Z}\\ \end{bmatrix}=-\begin{bmatrix}0&-\mathbf{W}\\[6.0pt] \mathbf{W}&0\\ \end{bmatrix}\begin{bmatrix}-\nabla_{\bm{X}}\mathcal{L}(\bm{X})\\[6.0pt] \mathbf{M}^{-1}\bm{Z}\\ \end{bmatrix}=\begin{bmatrix}\mathbf{W}\,\mathbf{M}^{-1}\bm{Z}\\[6.0pt] \mathbf{W}\,\nabla_{\bm{X}}\mathcal{L}(\bm{X})\end{bmatrix} (38)

Simulating these dynamics keeps the target distribution invariant as it is cast under the discussed unified framework of continuous-dynamics samplers [40]. Accordingly, the leapfrog integrator is then reformulated as:

𝒛t+ε/2=𝒛t+(ε2)𝐖t𝑿(𝒙t),𝒙t+ε=𝒙t+ε𝐖t𝐌1𝒛t+ε/2,𝒛t+ε=𝒛t+ε/2+(ε2)𝐖t𝑿(𝒙t+ε)\bm{z}\textsubscript{t+$\varepsilon$/2}=\bm{z}\textsubscript{t}+(\dfrac{\varepsilon}{2})\mathbf{W}\textsubscript{t}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t}),\ \ \ \ \bm{x}\textsubscript{t+$\varepsilon$}=\bm{x}\textsubscript{t}+\varepsilon\mathbf{W}\textsubscript{t}\,\mathbf{M}^{-1}\bm{z}\textsubscript{t+$\varepsilon$/2},\ \ \ \ \bm{z}\textsubscript{t+$\varepsilon$}=\bm{z}\textsubscript{t+$\varepsilon$/2}+(\dfrac{\varepsilon}{2})\mathbf{W}\textsubscript{t}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t+$\varepsilon$}) (39)

Since the last update of the inverse Hessian, 𝐖t\mathbf{W}\textsubscript{t}, is employed to simulate the current step, the position-independence of 𝐐\mathbf{Q} may not be achieved, and thus, a Metropolis correction should be devised to lead to the correct stationary distribution, which we follow in this work. Alternatively, an older update of the inverse Hessian, say 𝐖t-5\mathbf{W}\textsubscript{t-5}, can be employed to ensure position independence. A similar approach can be seen in [73] in the context of Langevin Monte Carlo. Further, we use the same inverse Hessian for all steps of a simulated trajectory: (𝒙(0),𝒛(0))(𝒙(τ),𝒛(τ))(\bm{x}(0),\bm{z}(0))\longrightarrow(\bm{x}(\tau),\bm{z}(\tau)). Following this preconditioning scheme thus implies that the new dynamics efficiently and compatibly adjusts both the 𝒁\bm{Z} and 𝑿\bm{X} evolutions based on the local structure of the target distribution and also features a Quasi-Newton direction for the momentum variables, allowing large jumps across the state space. In the burn-in stage of QNp-HMCMC, we set the mass matrix to identity (𝒁𝒩(0,𝐈))\big{(}\bm{Z}\sim\mathcal{N}(\textbf{{0}},\mathbf{I})\big{)}, while the skew-symmetric preconditioning is in effect. However, adapting the mass matrix while using these dynamics is an interesting direction for future research.

After this adaptive burn-in phase, we leverage the adapted inverse Hessian matrix to provide an informed, preconditioned mass matrix for the subsequent non-adaptive sampling phase of the algorithm. That is performed by inverting the final estimate of inverse Hessian and utilizing the Hessian matrix itself as a constant mass matrix, i.e., 𝐌=𝐖1\mathbf{M}=\mathbf{W}^{-1}. As such, in this non-adaptive phase of QNp-HMCMC, we follow the original dynamics in Eq. 31 and the leapfrog integrator in Eq. 32. However, we now utilize the properly constructed mass matrix that takes into account the scale and correlations of the position variable, leading to significant efficiency gains, particularly in complex, high-dimensional problems.

Utilizing the Hessian matrix as a mass matrix to simulate the momentum variables, 𝒁𝒩(0,𝐌)\bm{Z}\sim\mathcal{N}(\textbf{{0}},\mathbf{M}), necessitates the positive definiteness of the mass matrix. The BFGS procedure in Eq. 33 normally provides a symmetric, positive-definite 𝐖\mathbf{W} matrix in an optimization context. However, in our case, we are using BFGS under different settings that may not satisfy the curvature condition 𝒔kT𝒚k>0\bm{s}_{k}^{T}\bm{y}_{k}>0, resulting in occasional deviations from positive-definiteness. Several standard techniques can be then implemented to ensure positive-definiteness, such as a damped BFGS updating [77] or the simple addition 𝐖new=𝐖old+δ𝐈\mathbf{W}_{new}=\mathbf{W}_{old}+\delta\mathbf{I}, where δ0\delta\geq 0 is some appropriate number. Alternatively, 𝐖\mathbf{W} can be updated only when the curvature condition is satisfied, which directly guarantees positive definiteness. To further ensure the stability of the sampler, a positive threshold can be introduced to the curvature condition, e.g., 𝒔kT𝒚k>10\bm{s}_{k}^{T}\bm{y}_{k}>10. This latter approach has been used and has worked well in this work. As can be seen in algorithm 2, approximating the inverse Hessian is independent of the accept/reject Metropolis step. That is, we always use the simulated sample to adapt 𝐖\mathbf{W}, even if the proposed sample is rejected later, since each accepted/rejected point provides insights about the geometry of the sampling target. As a safeguard against faraway proposed samples, we may return to the previous approximation of 𝐖\mathbf{W} when the acceptance probability is too low.

To summarize, our developed Quasi-Newton mass preconditioned Hamiltonian Markov Chain Monte Carlo (QNp-HMCMC) consists of two integrated coupled phases, an adaptive and a non-adaptive, as concisely presented in Algorithm 2. In the first adaptive burn-in phase, we approximate the inverse Hessian matrix, 𝐖\mathbf{W}, using a BFGS formula while the sampler dynamics are enhanced using a skew-symmetric scheme. The final estimate of the inverse Hessian matrix is then utilized to provide a constant, preconditioned mass matrix (𝐌=𝐖1)(\mathbf{M}=\mathbf{W}^{-1}) to be utilized during the non-adaptive sampling phase. Overall, QNp-HMCMC is a practical, efficient approach that only requires gradient information and provides important insight about the geometry of the target distribution, eventually improving computational performance and enabling faster mixing.

4.3 (QNp-)HMCMC parameters

The performance of HMCMC algorithms relies on carefully selecting suitable values for the step size ε\varepsilon and length τ\tau of the simulated trajectory. The stepsize ε\varepsilon can be efficiently and independently tuned to achieve some target average acceptance rate. In this work, we set this target acceptance rate to 65%65\%, as values between 60%60\% and 80%80\% are typically assumed optimal [22, 78, 41]. To this end, we adopt the dual averaging algorithm of Hoffman and Gelman [41, 42] using its default parameters provided in [41], however, with a given appropriate initial step size. Therefore, in QNp-HMCMC, the step size ε\varepsilon is automatically tuned, and we do that in the first 2NBurnIn2N_{BurnIn} iterations. This adaptation is extended beyond the burn-in phase, given that different dynamics are used in each phase of QNp-HMCMC, probably requiring different optimal step sizes.

In contrast, tuning the trajectory length τ\tau or the number of steps, L=τ/ε\textit{L}=\tau/\varepsilon, is trickier. It is often achieved by maximizing variants of the Expected Squared Jumping Distance (ESJD) [79], 𝔼L𝒙(m+1)𝒙(m)2{\mathbb{E}}_{L}\lVert\bm{x}^{(m+1)}-\bm{x}^{(m)}\rVert^{2}, such as normalized ESJD [80] and Change in the Estimator of the Expected Square (ChEES) [36]. Motivated by the idea of ESJD, Hoffman and Gelman [41] proposed a reliable, robust HMCMC variant known as the No-U-Turn Sampler (NUTS), which basically runs the Leapfrog integrator, building a tree procedure, until the simulation makes a “U-turn”. About half of these leapfrog steps are thus wasted to satisfy detailed balance [36, 81]. Obviously, such a computationally expensive approach is not suitable in the context of rare event estimation, as we aim to minimize the number of model calls. However, we observed that our QNp-HMCMC can perform well even when one leapfrog step is utilized. Since that aligns with our goal of minimizing model calls, we adopted single-step QNp-HMCMC implementation in this work, which worked well in practice. It is understood that using multiple steps can generally achieve faster mixing rates for HMCMC [82]. Still, the QNp-HMCMC has demonstrated good performance with single-step implementation, due to the utilized preconditioned dynamics and mass matrix. Utilizing a single step for the leapfrog has also been adopted in the recent variant of the Generalized Hamiltonian Monte Carlo (GHMC) introduced in [83]. An interesting direction for future research is to incorporate quasi-Newton preconditioning approaches within GHMC while augmented by Neal’s persistent Metropolis scheme presented in [84].

4.4 Connections with the preconditioned Metropolis adjusted Langevin algorithm (MALA)

In this section, we explore connections between QNp-HMCMC with single-step implementation and preconditioned Langevin Monte Carlo (LMC) samplers, including Metropolis adjusted Langevin algorithm (MALA). Such exploration provides insights into the LMC preconditioning equivalent to the skew-symmetrix preconditioning scheme in our QNp-HMCMC, leading to a novel, efficient preconditioning scheme for LMC. The preconditioned Langevin Monte Carlo proposal can be generally written as [85, 24]:

𝒙~=𝒙t+(ε22)𝐀𝑿(𝒙t)+ε𝐀1/2𝒛t\tilde{\bm{x}}=\bm{x}\textsubscript{t}+(\dfrac{\varepsilon^{2}}{2})\,\mathbf{A}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})+\varepsilon\mathbf{A}^{1/2}\bm{z}^{\prime}_{\text{t}} (40)

where 𝐀\mathbf{A} is a preconditioning positive definite matrix, 𝐀1/2\mathbf{A}^{1/2} can be obtained using Cholesky decomposition as 𝐀=𝐀1/2(𝐀1/2)T\mathbf{A}=\mathbf{A}^{1/2}(\mathbf{A}^{1/2})^{T}, the log density (𝑿)=logh~(𝑿)\mathcal{L}(\bm{X})=\log\tilde{h}(\bm{X}) with h~\tilde{h} being the target distribution, and 𝒁𝒩(0,𝐈)\bm{Z}^{\prime}\sim\mathcal{N}(\textbf{{0}},\mathbf{I}).

Lemma 1.

QNp-HMCMC with single-step implementation is equivalent to a preconditioned Langevin Monte Carlo (LMC) with an adaptive preconditioning matrix 𝐀t=𝐖t2\mathbf{A}_{\text{t}}=\mathbf{W}_{\text{t}}^{2} in the burn-in stage and a fixed preconditioning matrix 𝐀=𝐌1=𝐖NBurnIn\mathbf{A}=\mathbf{M}^{-1}=\mathbf{W}_{N_{BurnIn}} afterward, where 𝐖t\mathbf{W}_{\text{t}} is the inverse Hessian matrix approximated at iteration tt of the burn-in stage using the BFGS formula, and 𝐖NBurnIn\mathbf{W}_{N_{BurnIn}} is the final approximation of the inverse Hessian matrix.

Lemma 2.

Considering the accept/reject Metropolis step, QNp-HMCMC with single-step implementation is equivalent to a preconditioned Metropolis adjusted Langevin algorithm (MALA), with preconditioning matrices described in lemma 1.

Proofs of lemmas 1 and 2 are provided in appendix A. Despite many existing LMC variants, the proposed preconditioning approach, 𝐀t=𝐖t2\mathbf{A}_{\text{t}}=\mathbf{W}_{\text{t}}^{2}, is unique, to our knowledge. Based on lemmas 1 and 2, the preconditioned MALA scheme corresponding to QNp-HMCMC with single-step integration is a novel, efficient algorithm, as shown in this work, and can be retrieved from algorithm 2 by set L=1L=1. Further, the computational cost of QNp-HMCMC can be reduced by following the decomposition approach in Eq. 40, thus avoiding the inversion step, 𝐌=𝐖NBurnIn1\mathbf{M}=\mathbf{W}^{-1}_{N_{BurnIn}}, in algorithm 2. Alternatively, a BFGS variant providing the decomposed (inverse) Hessian can be used instead of Eq. 33, such as the formula in [75], for example.

5 Rare Event Domain Discovery

It is well known that good initial points can significantly enhance the efficiency of any MCMC sampler by facilitating faster convergence to the stationary target distribution. This is particularly crucial in the context of estimating probabilities of rare events, where identifying such points poses a challenge due to the arbitrary existence of rare event domains within the tail regions of the original distribution; see Fig. 5, for example. Therefore, starting at a sample from the original distribution of the involved random variables often initializes the sampler far from the regions of interest. In our previous work [35], we adopted an annealing approach for the approximate sampling target in ASTPA, h~\tilde{h}, in which the sampling target is continuously moving toward the rare event domain in the burn-in phase, facilitating rare event discovery. This approach, accompanied by the powerful HMCMC samplers, has demonstrated excellent performance when used in Gaussian spaces.

In this work, we alternatively use an optimization approach to initialize the sampler, given our focus on complex non-Gaussian spaces. Specifically, Adam optimizer [43] is employed to minimize the negative logarithm of target distribution (logh~)(-\log\tilde{h}), given its ability to automatically tune the step size while utilizing momentum for accelerated convergence. We initialize Adam at the mean of the original distribution, 𝝁π𝑿\bm{\mu}_{\pi_{\bm{X}}}, and set its learning rate to 0.10.1 and other parameters to default values as recommended in [43]. This optimization is performed over 500 iterations or until convergence is reached earlier, indicated by the L2 norm of the optimizer’s update dropping below 10710^{-7}.

This discussed optimization approach can also be used to verify the appropriateness of the ASTPA parameters, gcg_{c} and σ\sigma. Convergence of the Adam optimizer to a state, 𝒙Adam\bm{x}_{Adam}, within the rare event domain, or to a state with a limit-state value relatively near zero, suggests that the sampling target is well-placed within the regions of interest. Otherwise, we may reduce the σ\sigma value and/or change the scaling constant gcg_{c} to shift the sampling target closer to these regions. In this case, running Adam on the new target, starting from 𝒙Adam\bm{x}_{Adam}, often requires a few iterations to converge. Our observations, however, indicate that the recommended parameter ranges for σ\sigma and gcg_{c} in Section 3.1 are generally effective in practice.

6 ASTPA in Bounded Spaces

To obviate the need to deal with variable bounds in non-Gaussian spaces while simulating the Hamiltonian dynamics during sampling, a transformation to an unconstrained space can be performed [86]. A new random variable 𝒀=𝑹(𝑿)\bm{Y}=\bm{R}(\bm{X}) can be characterized by transforming a dd-dimensional random variable 𝑿\bm{X}, with PDF π𝑿(𝑿)\pi_{\bm{X}}(\bm{X}), using a properly well-defined function 𝑹\bm{R}. If 𝑹\bm{R} is a one-to-one function and its inverse 𝑹1\bm{R}^{-1} is available, with an explicit Jacobian, then the density of 𝒀\bm{Y} is expressed as:

π𝒀(𝒀)=π𝑿(𝑹1(𝒀))missing|detJ𝑹1(𝒀)missing|\pi_{\bm{Y}}(\bm{Y})=\pi_{\bm{X}}\big{(}\bm{R}^{-1}(\bm{Y})\big{)}\,\mathopen{\bigg{missing}}|\det J_{\bm{R}^{-1}}(\bm{Y})\mathclose{\bigg{missing}}| (41)

where "det" denotes the determinant operation of the Jacobian matrix, J𝑹1(𝒀)J_{\bm{R}^{-1}}(\bm{Y}). We are interested in constructing a map Ri:XiYiR_{i}:X_{i}\rightarrow Y_{i}, with i=1,2,,di=1,2,...,d, where YiY_{i} is unbounded random variable. In this scenario, the Jacobian matrix will be diagonal and thus easy to compute. Subsequently, we present these maps based on distinct relevant categories of bounds.

Lower bounded variables: A one-dimensional random variable XiX_{i}, having a lower bound αi\alpha_{i}, can be transformed to unconstrained variable Yi=Ri(Xi)Y_{i}=R_{i}(X_{i}) using a logarithmic transform as Yi=log(Xiαi)Y_{i}=\log(X_{i}-\alpha_{i}), and thus Xi=Ri1(Yi)=exp(Yi)+αiX_{i}=R_{i}^{-1}(Y_{i})=\exp(Y_{i})+\alpha_{i}. Therefore, the corresponding ithi^{th} diagonal element in JR1(𝒀)J_{R^{-1}}(\bm{Y}) will be exp(Yi)\exp(Y_{i}).

Upper bounded variables: The logarithmic transformation Yi=log(βiXi)Y_{i}=\log(\beta_{i}-X_{i}) can be utilized to transform a one-dimensional random variable XiX_{i}, with an upper bound βi\beta_{i}, to unconstrained random variable YiY_{i}, and then Xi=Ri1(Yi)=βiexp(Yi)X_{i}=R_{i}^{-1}(Y_{i})=\beta_{i}-\exp(Y_{i}). The corresponding ithi^{th} diagonal element in the Jacobian matrix will be (exp(Yi))\big{(}-\exp(Y_{i})\big{)}.

Lower and upper bounded variables: A random variable XiX_{i} with a lower bound αi\alpha_{i} and an upper bound βi\beta_{i} can be transformed to an unconstrainted random variable YiY_{i} as Yi=logit(Xiαi/βiαi)Y_{i}=\text{logit}({\nicefrac{{X_{i}-\alpha_{i}}}{{\beta_{i}-\alpha_{i}}}}), where the logit function is defined for q(0,1)q\in(0,1) as logit(q)=log(q/1q)\text{logit}(q)=\log({\nicefrac{{q}}{{1-q}}}). The inverse map and its derivative are then expressed as Xi=Ri1(Yi)=αi+(βiαi)logit1(Yi)X_{i}=R_{i}^{-1}(Y_{i})=\alpha_{i}+(\beta_{i}-\alpha_{i})\,\text{logit}^{-1}(Y_{i}) and dXi/dYi=(βiαi)logit1(Yi)(1logit1(Yi)){\nicefrac{{dX_{i}}}{{dY_{i}}}}=(\beta_{i}-\alpha_{i})\,\text{logit}^{-1}(Y_{i})\,(1-\text{logit}^{-1}(Y_{i})), respectively, where logit1(κ)=1/(1+exp(κ))\text{logit}^{-1}(\kappa)={\nicefrac{{1}}{{\big{(}1+\exp(-\kappa)\big{)}}}} for κ(,+)\kappa\in(-\infty,+\infty).

7 Summary of (QNp-)HMCMC-based ASTPA

The proposed (Quasi-Newton mass preconditioned) HMCMC-based ASTPA framework can be directly applied in non-Gaussian spaces as follows:

  1. 1.

    Constructing the approximate sampling target h~\tilde{h} in Eq. 13, where its two parameters σ\sigma and gcg_{c} are determined as recommended in Section 3.1.

  2. 2.

    Running Adam optimizer to get a good initial state for the sampler, as described in Section 5, with a number of iterations NAdamN_{Adam}, typically less than 500.

  3. 3.

    Utilizing the (QNp-)HMCMC with single-step implementation to draw a sample set {𝒙i}i=1Nh\{\bm{x}_{i}\}_{i=1}^{N}\sim h. We consider a burn-in phase with a length NBurnIn10N_{BurnIn}\approx 10-15%ofN15\%\,\text{of}\,N, particularly needed to approximate the inverse Hessian in QNp-HMCMC. The leapfrog step size ε\varepsilon is automatically tuned during the first 2NBurnIn2N_{BurnIn} iterations using the dual averaging algorithm of Hoffman and Gelman [41].

  4. 4.

    Computing the expected value of the weighted indicator function, p~^\hat{\tilde{p}}_{\mathcal{F}}, using Eq. 10 based on {𝒙i}i=1N\{\bm{x}_{i}\}_{i=1}^{N}.

  5. 5.

    Applying Inverse Importance Sampling (IIS) to compute the normalizing constant of C^h\hat{C}_{h}. That starts by fitting a GMM Q(𝑿)Q(\bm{\bm{X}}) based on {𝒙i}i=1N\{\bm{x}_{i}\}_{i=1}^{N}, with the Q(𝑿)Q(\bm{\bm{X}}) structured as recommended in Section 3.2. An i.i.d. sample set {𝒙i}i=1M\{\bm{x}_{i}\}_{i=1}^{M} is then drawn from Q(𝑿)Q(\bm{\bm{X}}) to compute C^h\hat{C}_{h} using Eq. 18. MM can be generally around 30%30\% of NN.

  6. 6.

    Computing the rare event probability as p^=p~^C^h\hat{p}_{\mathcal{F}}=\hat{\tilde{p}}_{\mathcal{F}}\,\,\hat{C}_{h}.

Naturally, the required total number of model calls (NTotal=NAdam+NBurnIn+N+M)(N_{Total}=N_{Adam}+N_{BurnIn}+N+M) is case dependent, and convergence of the estimator can be checked through Eqs. 24 and 25. However, NtotalN_{total} can roughly be 5,0005,000-10,00010,000 for high-dimensional problems and target probabilities lower than 10410^{-4}.

8 Numerical Results

Several numerical examples are studied in this section to examine the performance and efficiency of the proposed methods. In all examples,(QNp-)HMCMC-based ASTPA is implemented as summarized in Section 7. Results are compared with the Component-Wise Metropolis-Hastings based Subset Simulation (CWMH-SuS) [16], with a uniform proposal distribution of width 22, a samples percentile of 0.10.1 to determine the subsets, and an appropriate number of samples in each subset level. In Example 2, our results are compared against the state-of-the-art Riemannian Manifold Hamiltonian Monte Carlo-based subset simulation (RMHMC-SuS), using the available results in [25]. We could not, however, access an advanced importance sampling-based method that can competitively and directly operate in the studied complex non-Gaussian problems. Comparisons are provided in terms of accuracy and computational cost based on 100100 independent simulations. Specifically, for each simulation jj, we record the target probability estimate (p^)j(\hat{p}_{\mathcal{F}})^{j}, the total number of model calls (NTotal)j(N_{Total})^{j}, and the analytical Coefficient of Variation of ASTPA, (C.o.V-Anal)j(\text{C.o.V-Anal})^{j}, computed by Eq. 26. NTotalN_{Total} is defined for ASTPA in Section 7. Then, we report the mean of the rare event probability estimates, 𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}], that of the total number of model calls 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}], and the sampling C.o.V, computed as:

C.o.V=Var^(p^)𝔼[p^],Var^(p^)=11001j=1100((p^)j𝔼[p^])2\textrm{C.o.V}=\dfrac{\sqrt{\widehat{\text{Var}}(\hat{p}_{\mathcal{F}})}}{\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}]},\quad\widehat{\text{Var}}(\hat{p}_{\mathcal{F}})=\dfrac{1}{100-1}\,\sum_{j=1}^{100}\big{(}(\hat{p}_{\mathcal{F}})^{j}-\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}]\big{)}^{2} (42)

The mean of the analytical C.o.V, 𝔼[C.o.V-Anal]\mathbb{E}[\text{C.o.V-Anal}], is also reported in parentheses. The total number of limit-state function evaluations NTotalN_{Total} for ASTPA has been determined based on achieving C.o.V values [0.1, 0.35]\in[0.1,\,0.35]. The reference rare event probabilities are computed based on the standard Monte Carlo estimator, described in Section 2, using 10610^{6}-10810^{8} samples, as appropriate. The problem dimensions are denoted by dd, and all ASTPA parameters are carefully chosen for all examples but are not optimized. Analytical gradients are provided in all examples; hence, one limit-state/model call can provide both the relevant function value and gradient.

8.1 Example 1: Multi-dimensional quadratic limit-state function with correlated Gumbel distribution

Our proposed framework is first tested directly on a highly correlated Gumbel distribution, with its joint probability distribution formulated using a Gaussian copula as:

π𝑿(𝑿)=ϕd([Φ1(FX1(X1)),Φ1(FX2(X2)),,Φ1(FXd(Xd))];𝐑)i=1dfXi(Xi)i=1dϕ(Φ1(FXi(Xi)))\displaystyle\pi_{\bm{X}}(\bm{X})=\phi_{d}\big{(}[\Phi^{-1}(F_{X_{1}}(X_{1})),\Phi^{-1}(F_{X_{2}}(X_{2})),...,\Phi^{-1}(F_{X_{d}}(X_{d}))];\mathbf{R}\big{)}\frac{\prod_{i=1}^{d}f_{X_{i}}(X_{i})}{\prod_{i=1}^{d}\phi\big{(}\Phi^{-1}(F_{X_{i}}(X_{i}))\big{)}} (43)

where ϕd()\phi_{d}(\cdot) is the dd-dimensional standard Gaussian PDF, with a correlation matrix 𝐑\mathbf{R}, having a pair-wise correlation ρij=0.9528\rho_{ij}=0.9528; ϕ()\phi(\cdot) and Φ()\Phi(\cdot) are the univariate standard normal PDF and CDF, respectively; fXif_{X_{i}} and FXiF_{X_{i}} are the Gumbel distribution PDF and CDF of XiX_{i}, respectively. All XiX_{i}’s have a mean value μi=10\mu_{i}=10 and a Coefficient of Variation of 0.40. The model is represented here using the following quadratic limit-state function:

g(𝑿)=λ1di=1dXi+2.5(X1j=2γXj)2\displaystyle g(\bm{X})=\lambda-\frac{1}{\sqrt{d}}\ \sum_{i=1}^{d}X_{i}+2.5\ \bigg{(}X_{1}-\sum_{j=2}^{\gamma}X_{j}\bigg{)}^{2} (44)

where the threshold λ\lambda determines the rarity of the considered event, and γ\gamma defines the number of involved nonlinear variables. Fig. 5(a) showcases the joint probability distribution and the limit-state surface for d=2d=2, λ=70\lambda=70, and γ=2\gamma=2.

Table 1 compares the performance of ASTPA methods with CWMH-SuS for dimensions d=2,3,and 40d=2,3,\text{and }40. The approximate sampling target, h~\tilde{h}, in ASTPA is constructed for all dimensions using a likelihood dispersion factor, σ\sigma, of 0.1, and a scaling constant gc=g(𝝁π𝑿)/qg_{c}=\nicefrac{{g(\bm{\mu}_{\pi_{\bm{X}}})}}{{q}}, with q=20q=20, and 𝝁π𝑿=[μi,μi,,μi]d\bm{\mu}_{\pi_{\bm{X}}}=[\mu_{i},\mu_{i},...,\mu_{i}]\in\mathbb{R}^{d}, as explained in Eq. 14. Adam optimizer is then run, starting at 𝝁π𝑿\bm{\mu}_{\pi_{\bm{X}}}, to provide an initial state for the HMCMC samplers. In Table 1, it is clear that our ASTPA 𝔼[p^]\mathbb{E}[\hat{p}_{\mathcal{F}}] estimates have very good agreement with the reference probability given by standard Monte Carlo estimator (MCs) for all studied cases. Samples in MCs and the first set in Subset Simulation are generated using Nataf transformation [87, 88]. Compared to CWMH-SuS, ASTPA results are significantly efficient, as demonstrated by achieving low C.o.V using a significantly lower total number of model calls NTotalN_{Total}. Comparing HMCMC and QNp-HMCMC within ASTPA, it is obvious that QNp-HMCMC exhibits enhanced performance, considering the comparatively lower number of model calls, i.e., (N+NBurnIn)(N+N_{BurnIn}) (see Section 7), required to achieve similar accuracy. Further, the good agreement between the sampling C.o.V and analytical C.o.V, reported in parentheses, shows the effectiveness and accuracy of our analytical C.o.V expression, particularly when utilizing the QNp-HMCMC sampler. Fig. 5 outlines the ASTPA framework in the 2D case here, with the joint distribution π𝑿\pi_{\bm{X}} and the limit-state surface at g(𝑿)=0g(\bm{X})=0 shown in Fig. 5(a). The approximate sampling target is visualized in Fig. 5(b), with its samples plotted in grey in Fig. 5(c). The blue points in Fig. 5(c) are the steps of the Adam optimizer, efficiently discovering the rare event domain. As illustrated in Fig. 4, the developed framework can provide better estimates for the 2-dimensional case by employing other parameter values, σ\sigma and qq, to construct the sampling target. Nevertheless, for the sake of consistency hereafter, we use the same parameter values across all considered dimensionalities within the same example, albeit with less accurate estimates for some cases.

Refer to caption
Refer to caption
Refer to caption
Figure 5: Example 1: (a) Correlated Gumbel distribution plotted in red; the grey line represents the limit-state surface at g(𝑿)=0g(\bm{X})=0, (b) The approximate sampling target h~\tilde{h} in ASTPA, (c) Adam’s steps (500) indicatively plotted in blue; grey samples drawn from hh using QNp-HMCMC.
Table 1: Example 1: Performance of various methods for a quadratic limit-state function with
a highly correlated Gumbel distribution.
Eq. 44 100 Simulations MCs ASTPA (σ=0.1,q=20\sigma=0.1,\,q=20) CWMH-SuS
HMCMC QNp-HMCMC
d=2d=2 λ=70\lambda=70 γ=2\gamma=2 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E9 5,348 4,048 44,690
C.o.V 0.06 0.09(\color[rgb]{0,0.88,0}(0.05)\color[rgb]{0,0.88,0}) 0.09(\color[rgb]{0,0.88,0}(0.06)\color[rgb]{0,0.88,0}) 6.78
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 2.51E-7 2.37E-7 2.43E-7 4.55E-8
d=3d=3 λ=5\lambda=5 γ=3\gamma=3 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E9 8,598 4,598 31,730
C.o.V 0.05 0.23(\color[rgb]{0,0.88,0}(0.14)\color[rgb]{0,0.88,0}) 0.16(\color[rgb]{0,0.88,0}(0.13)\color[rgb]{0,0.88,0}) 0.82
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 4.17E-7 4.18E-7 4.19E-7 5.19E-7
d=40d=40 λ=200\lambda=-200 γ=20\gamma=20 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E8 13,298 5,298 40,685
C.o.V 0.04 0.19(\color[rgb]{0,0.88,0}(0.07)\color[rgb]{0,0.88,0}) 0.08(\color[rgb]{0,0.88,0}(0.08)\color[rgb]{0,0.88,0}) 4.63
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 4.60E-6 4.56E-6 4.60E-6 6.32E-6

8.2 Example 2: High-dimensional hyperspherical limit-state function with funnel distribution

Our proposed framework is here tested on a challenging non-Gaussian sampling distribution. Specifically, we consider a dd-dimensional random variable 𝑿=[X1,X2,,Xd]\bm{X}=[X_{1},X_{2},...,X_{d}] following a dd-dimensional Neal’s funnel distribution, defined as [89, 25]:

π𝑿(𝑿)=i=1d1𝒩(Xi|0,exp(Xd))𝒩(Xd|0, 1)\pi_{\bm{X}}(\bm{X})=\prod_{i=1}^{d-1}\mathcal{N}(X_{i}\,|0,\,\exp(X_{d}))\cdot\,\mathcal{N}(X_{d}\,|0,\,1) (45)

A hyperspherical limit-state function is utilized in this example, expressed as:

g(𝑿)=i=1d1Xi2+(Xd+6)2r2g(\bm{X})=\sum_{i=1}^{d-1}X_{i}^{2}+(X_{d}+6)^{2}-r^{2}\\ (46)

where rr is a threshold parameter determining the level of the rare event probability. Fig. 6(a) depicts Neal’s funnel distribution π𝑿(𝑿)\pi_{\bm{X}}(\bm{X}) and the limit-state function for d=2d=2, and r=2r=2. The approximate sampling target, its samples, and Adam’s steps are also visualized in Fig. 6. The results, accompanied by ASTPA parameters, are presented in Table 2, for dimensions up to 101101. Samples in MCs and the first set in Subset Simulation (CWMH-SuS) are generated directly based on the joint distribution in Eq. 45. Importantly, ASTPA performance is successfully compared with the state-of-the-art RMHMC-SuS, using results reported for the studied example up to 5151 dimensions in [25]. Moreover, the ASTPA results in this example support our findings in Section 8.1, including that on the analytical C.o.V and emphasizing the superiority of ASTPA in such challenging scenarios. Here, HMCMC and QNp-HMCMC exhibit similar performance, except for the 101101-dimensional case, wherein QNp-HMCMC-based ASTPA provided a significantly efficient estimate.

Refer to caption
Refer to caption
Refer to caption
Figure 6: Example 2: (a) Neal’s funnel distribution plotted in red; the grey line represents the limit-state surface at g(𝑿)=0g(\bm{X})=0, (b) The approximate sampling target h~\tilde{h} in ASTPA, (c) Adam’s steps plotted in blue; grey samples drawn from hh using QNp-HMCMC.
Table 2: Example 2: Performance of various methods for a high-dimensional hyperspherical limit-state function
with Neal’s funnel distribution.
Eq. 46 100 Simulations MCs ASTPA (σ=0.1,q=20\sigma=0.1,\,q=20) RMHMC-SuS [25] CWMH-SuS
HMCMC QNp-HMCMC U(1,1)U(-1,1)
d=2d=2 r=2r=2 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E7 1,213 1,213 4,761 4,618
C.o.V 0.06 0.09(\color[rgb]{0,0.88,0}(0.08)\color[rgb]{0,0.88,0}) 0.10(\color[rgb]{0,0.88,0}(0.09)\color[rgb]{0,0.88,0}) 0.35 0.57
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 3.11E-5 3.09E-5 3.09E-5 3.11E-5 3.13E-5
d=31d=31 r=2r=2 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E7 3,213 3,213 4,905 20,884
C.o.V 0.07 0.10(\color[rgb]{0,0.88,0}(0.11)\color[rgb]{0,0.88,0}) 0.12(\color[rgb]{0,0.88,0}(0.11)\color[rgb]{0,0.88,0}) 0.41 1.95
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 1.87E-5 1.83E-5 1.84E-5 1.99E-5 1.52E-5
d=51d=51 r=2r=2 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E7 4,313 4,313 5,152 22,072
C.o.V 0.08 0.15(\color[rgb]{0,0.88,0}(0.14)\color[rgb]{0,0.88,0}) 0.13(\color[rgb]{0,0.88,0}(0.11)\color[rgb]{0,0.88,0}) 0.45 2.63
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 1.37E-5 1.34E-5 1.33E-5 1.37E-5 1.32E-5
d=51d=51 r=1r=1 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E8 4,840 4,840 7,065 32,260
C.o.V 0.29 0.15(\color[rgb]{0,0.88,0}(0.11)\color[rgb]{0,0.88,0}) 0.17(\color[rgb]{0,0.88,0}(0.11)\color[rgb]{0,0.88,0}) 0.55 5.21
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 1.28E-7 1.26E-7 1.28E-7 1.36E-7 0.77E-7
d=101d=101 r=2r=2 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E8 14,313 7,813 25,564
C.o.V 0.04 0.16(\color[rgb]{0,0.88,0}(0.14)\color[rgb]{0,0.88,0}) 0.16(\color[rgb]{0,0.88,0}(0.11)\color[rgb]{0,0.88,0}) 4.04
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 6.82E-6 6.55E-6 6.55E-6 5.68E-6

8.3 Example 3: High-dimensional highly nonlinear problem

To further investigate the ASTPA performance in high-dimensional non-Gaussian spaces with significant nonlinearities, the limit-state function in this example is expressed in the lognormal space as:

g(𝑿)=Y01200i=1200Xi+2.5(X1j=210Xj)2+(X11k=1214Xk)4+(X15l=1617Xl)8\displaystyle g(\bm{X})=Y_{0}-\frac{1}{\sqrt{200}}\ \sum_{i=1}^{200}X_{i}+2.5\ \bigg{(}X_{1}-\sum_{j=2}^{10}X_{j}\bigg{)}^{2}+\bigg{(}X_{11}-\sum_{k=12}^{14}X_{k}\bigg{)}^{4}+\bigg{(}X_{15}-\sum_{l=16}^{17}X_{l}\bigg{)}^{8} (47)

where 𝑿\bm{X} follows an independent lognormal distribution with a mean μXi=1.0\mu_{X_{i}}=1.0, and a standard deviation, σXi=1.0\sigma_{X_{i}}=1.0. Two values are considered for the threshold Y0Y_{0}, 1515 and 1616, to examine different probability levels. Prior to applying ASTPA, these positive lognormal variables are transformed to an unconstrained non-Gaussian space by following the approach discussed in Section 6. The results, accompanied by the ASTPA parameters, are presented in Table 3, wherein ASTPA has significantly outperformed CWMH-SuS, considering all metrics while giving an accurate mean value for the sought probability. Once more, QNp-HMCMC, which is implemented here with diagonal inverse Hessian and mass matrices, has significantly outperformed the original HMCMC in this highly nonlinear scenario, reinforcing our previous findings.

Table 3: Example 3: Performance of various methods for a high-dimensional, highly nonlinear
limit-state function with lognormal distribution (d=200d=200).
Eq. 44 100 Simulations MCs ASTPA (σ=0.2,q=10\sigma=0.2,\,q=10) CWMH-SuS
HMCMC QNp-HMCMC
Y0=15Y_{0}=15 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E7 30,312 8,812 13,719
C.o.V 0.06 0.24(\color[rgb]{0,0.88,0}(0.18)\color[rgb]{0,0.88,0}) 0.22(\color[rgb]{0,0.88,0}(0.26)\color[rgb]{0,0.88,0}) 0.41
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 2.22E-5 2.22E-5 2.20E-5 5.62E-5
Y0=16Y_{0}=16 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E7 30,333 11,833 23,855
C.o.V 0.16 0.26(\color[rgb]{0,0.88,0}(0.26)\color[rgb]{0,0.88,0}) 0.24(\color[rgb]{0,0.88,0}(0.39)\color[rgb]{0,0.88,0}) 0.46
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 3.54E-6 3.69E-6 3.62E-6 1.66E-5

8.4 Example 4: 2D and 3D Rosenbrock distributions

In this example, we consider the challenge Rosenbrock distribution, defined for multi-dimensional random variables as [27, 90]:

π𝑿(𝑿)=ai=2dbiπd/2exp(a(X1γ)2i=2dbi(XiXi12)2)\displaystyle\pi_{\bm{X}}(\bm{X})=\dfrac{\sqrt{a}\prod_{i=2}^{d}\sqrt{b_{i}}}{\pi^{d/2}}\,\,\exp\bigg{(}-a(X_{1}-\gamma)^{2}-\sum_{i=2}^{d}b_{i}(X_{i}-X_{i-1}^{2})^{2}\bigg{)} (48)

where the parameters aa, bib_{i}, and γ\gamma defines the geometry and complexity of the distribution. An interesting discussion on the sampling complexity of the 2D Rosenbrock distribution can be found in [28, 26]. In this example, we include two- and three-dimensional cases of Eq. 48, where the relevant parameter sets are given in Table 4. It should be noted that these chosen parameters result in significantly challenging sampling targets, as showcased for the 2-dimensional case in Fig. 7(a). To complicate further this testing benchmark, a low probability level (105106\sim 10^{-5}-10^{-6}) is considered by defining the limit-state function as:

g(𝑿)=2503X1i=2dXi\displaystyle g(\bm{X})=250-3X_{1}-\sum_{i=2}^{d}X_{i} (49)

The approximate sampling target in ASTPA is depicted in Fig. 7(b). The results, accompanied by the ASTPA parameters, are reported in Table 4. Samples in MCs and the first set in Subset Simulation (CWMH-SuS) are generated directly based on the conditional sampling scheme provided in [90]. Adam optimizer is run in this example for 1500 iterations because of the complexity of this example. As can be seen in Table 4, QNp-HMCMC-based ASTPA exceptionally outperforms the other methods. The performance difference between the two HMCMC schemes can be explained by observing the grey samples in Fig. 7, where Fig. 7(c) visualizes 1650 QNp-HMCMC samples, and Fig. 7(d) showcases around one million HMCMC samples. Comparing these two plots demonstrates the incomparable ability of QNp-HMCMC to sample complex target distributions representatively and efficiently. As discussed in Section 4.2, this sampling efficiency is mainly attributed to the preconditioned dynamics and mass matrix incorporated in QNp-HMCMC. In QNp-HMCMC-based ASTPA, the estimates of the developed analytical C.o.V expression exhibit very good agreement with the sampling ones. However, this is not the case for HMCMC, considering its highly correlated samples, requiring a thining scheme different from that in Eq. 27. It should be noted that the performance of Subset Simulation (SuS) can be generally enhanced by adopting advanced samplers, such as HMCMC or RMHMC, as in Section 8.2.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Example 4: (a) 2D Rosenbrock distribution plotted in red; the grey line represents the limit-state surface at g(𝑿)=0g(\bm{X})=0, (b) the approximate sampling target h~\tilde{h} in ASTPA, (c) Adam’s steps (1500) indicatively plotted in blue; 1650 grey samples drawn from hh using QNp-HMCMC, including burn-in samples shown in light grey, and (d) around 10610^{6} grey samples generated using the standard HMCMC; comparing (c) and (d) demonstrates the efficiency of QNp-HMCMC in adequately exploring the sampling target .
Table 4: Example 4: Performance of various methods for 2D and 3D Rosenbrock distributions.
Eq. 48 100 Simulations MCs ASTPA (σ=0.1,q=20\sigma=0.1,\,q=20) CWMH-SuS
HMCMC QNp-HMCMC
d=2d=2 γ=1\gamma=1 a=0.05a=0.05 b=5b=5 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E8 1.02E6 3,848 633,700
C.o.V 0.03 0.20(\color[rgb]{0,0.88,0}(0.03)\color[rgb]{0,0.88,0}) 0.12(\color[rgb]{0,0.88,0}(0.11)\color[rgb]{0,0.88,0}) 1.09
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 1.15E-5 0.67E-5 1.10E-5 1.15E-5
d=3d=3 γ=0.5\gamma=0.5 a=1a=1 b=5b=5 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E8 1.02E6 4,948 848,800
C.o.V 0.10 0.41(\color[rgb]{0,0.88,0}(0.07)\color[rgb]{0,0.88,0}) 0.16(\color[rgb]{0,0.88,0}(0.13)\color[rgb]{0,0.88,0}) 2.22
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 1.00E-6 0.20E-6 0.87E-6 1.70E-6

8.5 Example 5: High-dimensional ring-shaped conditional distribution in a Bayesian context

Another challenging high dimensional complex example, originally introduced in [25], is considered in this section. The original joint distribution, π𝑿(𝑿)=π𝑿(𝑿|y)\pi_{\bm{X}}(\bm{X})=\pi_{\bm{X}}(\bm{X}|y), is constructed in a Bayesian context. The observed random variable yy is assumed to follow a Gaussian distribution with mean μY=2\mu_{Y}=2 and σY=4\sigma_{Y}=4, and 100 data points are then generated accordingly. A likelihood function L(y|𝑿)L(y|\bm{X}) is then defined as:

L(y|𝑿)=j=11001σY2πexp((yjs𝑿)22σY2)L(y|\bm{X})=\prod_{j=1}^{100}\frac{1}{\sigma_{Y}\sqrt{2\pi}}\,\exp{-\bigg{(}\dfrac{(y_{j}-s_{\bm{X}})^{2}}{2\,\sigma_{Y}^{2}}\bigg{)}} (50)

where s𝑿i=1dXi2s_{\bm{X}}\coloneqq\sum_{i=1}^{d}X_{i}^{2} is assumed hereafter to be an unknown mean parameter, intended to be estimated using the Bayes’ rule. An independent standard Gaussian prior distribution, f𝑿(𝑿)=i=1dN(Xi|0, 1)f_{\bm{X}}(\bm{X})=\prod_{i=1}^{d}N(X_{i}|0,\,1), is then assumed for 𝑿\bm{X}. The posterior distribution can thus be expressed as:

π𝑿(𝑿|y)=L(y|𝒙)f𝑿(𝒙)π(y)=exp(s𝑿212σY2j=1100(yjs𝑿)2)Cπ=π~𝑿(𝑿|y)Cπ\pi_{\bm{X}}(\bm{X}|y)=\dfrac{L(y|\bm{x})f_{\bm{X}}(\bm{x})}{\pi(y)}=\dfrac{\exp\bigg{(}-\dfrac{s_{\bm{X}}}{2}-\dfrac{1}{2\sigma_{Y}^{2}}\sum_{j=1}^{100}(y_{j}-s_{\bm{X}})^{2}\bigg{)}}{C_{\pi}}=\dfrac{\tilde{\pi}_{\bm{X}}(\bm{X}|y)}{C_{\pi}} (51)

where π(y)\pi(y) is the Bayes evidence, subsequently absorbed in CπC_{\pi}, a normalizing constant of the unnormalized posterior π~𝑿(𝑿|y)\tilde{\pi}_{\bm{X}}(\bm{X}|y). Fig. 8(a) depicts this unnormalized posterior distribution for d=2d=2. Since ASTPA necessitates a normalized original distribution, the normalizing constant CπC_{\pi} is first computed.

8.5.1 Normalizing constant of the posterior distribution

The normalizing constant CπC_{\pi} can be expressed as:

Cπ=𝒳π~𝑿(𝒙|y)𝑑𝒙C_{\pi}=\int_{\mathcal{X}}\tilde{\pi}_{\bm{X}}(\bm{x}|y)d\bm{x} (52)

We employ our developed inverse importance sampling (IIS) technique to evaluate this integral. First, we draw a sample set {𝒙i}i=1Nπ\{\bm{x}_{i}\}_{i=1}^{N_{\pi}} from the posterior distribution π~𝑿(𝑿|y)\tilde{\pi}_{\bm{X}}(\bm{X}|y) using the HMCMC sampler. A GMM Q(𝑿)Q^{\prime}(\bm{X}) structured as recommended in Section 3.2 is subsequently fitted based on {𝒙i}i=1Nπ\{\bm{x}_{i}\}_{i=1}^{N_{\pi}}, which is then used as ISD as follows:

Cπ=𝒳π~𝑿(𝒙|y)Q(𝒙)Q(𝒙)𝑑𝒙=𝔼Q[π~𝑿(𝒙|y)Q(𝑿)]C_{\pi}=\int_{\mathcal{X}}\dfrac{\tilde{\pi}_{\bm{X}}(\bm{x}|y)}{Q^{\prime}(\bm{x})}Q^{\prime}(\bm{x})d\bm{x}=\mathbb{E}_{Q^{\prime}}\big{[}\dfrac{\tilde{\pi}_{\bm{X}}(\bm{x}|y)}{Q^{\prime}(\bm{X})}] (53)

By drawing an i.i.d sample set {𝒙i}i=1Mπ\{\bm{x}_{i}^{\prime}\}_{i=1}^{M_{\pi}} from QQ^{\prime}, C^π\hat{C}_{\pi} can be eventually computed as:

C^π=1Mπi=1Mππ~𝑿(𝒙i|y)Q(𝒙i)\hat{C}_{\pi}=\dfrac{1}{M_{\pi}}\sum_{i=1}^{M_{\pi}}\dfrac{\tilde{\pi}_{\bm{X}}(\bm{x}_{i}^{\prime}|y)}{Q(\bm{x}_{i}^{\prime})} (54)

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Example 5: Outlining the normalizing constant, CπC_{\pi}, estimation in Section 8.5.1; (a) Ring-shaped distribution π~𝑿\tilde{\pi}_{\bm{X}}, (b) NπN_{\pi} samples from π~𝑿\tilde{\pi}_{\bm{X}}, (c) a GMM Q(𝑿)Q^{\prime}(\bm{X}) fitted based on the π~𝑿\tilde{\pi}_{\bm{X}} samples, and (d) MπM_{\pi} GMM samples utilized to compute CπC_{\pi} in Eq. 54.
Table 5: Example 5: The estimated normalizing constant CπC_{\pi} for the ring-shaped distribution
50 Simulations d=2d=2 d=50d=50 d=150d=150 d=500d=500
NπTotalN_{\pi}^{Total} 11,000 11,000 11,000 20,000
C.o.V 0.01 0.01 0.02 0.03
𝔼[C^π]\mathop{\mathbb{E}}[\hat{C}_{\pi}] 3.43E-19 3.26E-21 1.05E-50 6.41E-210

Fig. 8 showcases this procedure for the 2-dimensional case. Table 5 shows the estimated CπC_{\pi} for dimensions d=2,50,150,d=2,50,150, and 500500, based on 5050 independent simulations. We report the mean estimates for the normalizing constant, 𝔼[C^π]\mathop{\mathbb{E}}[\hat{C}_{\pi}], the Coefficient of Variation (C.o.V) of C^π\hat{C}_{\pi}, the number of total π~𝑿(𝒙i|y)\tilde{\pi}_{\bm{X}}(\bm{x}_{i}^{\prime}|y) evaluations, NπTotal=Nπ+MπN_{\pi}^{Total}=N_{\pi}+M_{\pi}. It should be noted that estimating CπC_{\pi} does not require any (expensive) limit-state function evaluations. Therefore, the focus is on obtaining accurate estimates characterized by lower C.o.V, regardless of the number of samples. In contrast, we did not compare 𝔼[C^π]\mathop{\mathbb{E}}[\hat{C}_{\pi}] with other methods, and its accuracy is alternatively verified through the subsequently computed probabilities.

8.5.2 Target probability estimation

The following quadratic limit-state function is then considered:

g(𝑿)=r2(X12)2i=2dXi2g(\bm{X})=r^{2}-(X_{1}-2)^{2}-\sum_{i=2}^{d}X_{i}^{2} (55)

where the threshold rr determines the level of the rare event probability. Fig. 9 depicts the ringed-shaped distribution π𝑿(𝑿|y)\pi_{\bm{X}}(\bm{X}|y), the approximate sampling target h~(𝑿)\tilde{h}(\bm{X}), its samples, and Adam’s steps, for d=2d=2.

Refer to caption
Refer to caption
Refer to caption
Figure 9: Example 5: (a) Ring-shaped distribution plotted in red; the grey line represents the limit-state surface at g(𝑿)=0g(\bm{X})=0, (b) The approximate sampling target h~\tilde{h} in ASTPA, (c) Adam’s steps plotted in blue; grey samples drawn from hh using QNp-HMCMC.
Table 6: Example 5: Performance of various methods for the high dimensional problem
with the ring-shaped distribution
Eq. 46 100 Simulations MCMC ASTPA (σ=0.3,q=10\sigma=0.3,\,q=10) CWMH-SuS
HMCMC QNp-HMCMC
d=2d=2 r=3.8r=3.8 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E7 1,639 1,639 9,812
C.o.V 0.20 0.07(\color[rgb]{0,0.88,0}(0.16)\color[rgb]{0,0.88,0}) 0.09(\color[rgb]{0,0.88,0}(0.10)\color[rgb]{0,0.88,0}) 2.05
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 3.38E-5 3.48E-5 3.45E-5 3.34E-5
d=50d=50 r=3.4r=3.4 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E7 3,156 3,156 18,724
C.o.V 0.09 0.20(\color[rgb]{0,0.88,0}(0.22)\color[rgb]{0,0.88,0}) 0.15(\color[rgb]{0,0.88,0}(0.17)\color[rgb]{0,0.88,0}) 0.73
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 2.32E-5 2.36E-5 2.29E-5 3.53E-5
d=150d=150 r=3.4r=3.4 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E6 5,056 5,056 17,330
C.o.V 0.22 0.14(\color[rgb]{0,0.88,0}(0.16)\color[rgb]{0,0.88,0}) 0.17(\color[rgb]{0,0.88,0}(0.17)\color[rgb]{0,0.88,0}) 0.72
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 6.78E-5 6.91E-5 6.85E-5 9.26E-5
d=150d=150 r=3.6r=3.6 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 4,998 4,998 32,476
C.o.V 0.10(\color[rgb]{0,0.88,0}(0.15)\color[rgb]{0,0.88,0}) 0.12(\color[rgb]{0,0.88,0}(0.16)\color[rgb]{0,0.88,0}) 2.25
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 1.65E-8 1.69E-8 1.54E-8
d=500d=500 r=3.6r=3.6 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 1.00E5 6,597 6,597 11,200
C.o.V 0.10 0.18(\color[rgb]{0,0.88,0}(0.17)\color[rgb]{0,0.88,0}) 0.18(\color[rgb]{0,0.88,0}(0.16)\color[rgb]{0,0.88,0}) 0.35
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 2.90E-3 2.89E-3 2.88E-3 3.30E-3
d=500d=500 r=3.8r=3.8 𝔼[NTotal]\mathop{\mathbb{E}}[N_{Total}] 6,639 6,639 27,976
C.o.V 0.23(\color[rgb]{0,0.88,0}(0.35)\color[rgb]{0,0.88,0}) 0.23(\color[rgb]{0,0.88,0}(0.35)\color[rgb]{0,0.88,0}) 5.49
𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}] 0.99E-7 1.00E-7 4.99E-7

The presented results in Table 6 demonstrate the superiority of the proposed framework for dimensions up to 500500. The MCMC term in Table 6 denotes utilizing the standard Monte Carlo approach with samples generated using an MCMC algorithm (HMCMC) since direct sampling is not applicable for the considered posterior distribution π𝑿(𝑿|y)\pi_{\bm{X}}(\bm{X}|y). However, we could not afford these standard MCMC computations for two cases with d=150d=150 and 500500, in which the target probability is extremely low (108)\sim 10^{-8}). Comparisons with MCMC results show the accuracy of the mean estimate 𝔼[p^]\mathop{\mathbb{E}}[\hat{p}_{\mathcal{F}}]. That also implies the accuracy of the computed normalizing constant CπC_{\pi} in Table 5. Further, ASTPA has significantly outperformed CWMH-SuS, which utilized HMCMC to generate the first set. We could not access advanced SuS variants capable of efficiently operating in such high-dimensional complex scenarios. Here, similar performance is observed for HMCMC and QNp-HMCMC, which utilized diagonal inverse Hessian and mass matrices for d=500d=500, as it yielded more stable performance. It is also evident that the analytical CoV demonstrates good agreement with the sampling C.o.V.

The target probability in this example is referred to in the literature as a posteriori rare event probability [57]. The ASTPA estimator, in this case, can be expressed as

p=p~Ch=ChCπ𝔼h[I(𝑿)π~𝑿(𝑿)h~(𝑿)]=p~~ChCπp_{\mathcal{F}}=\tilde{p}_{\mathcal{F}}\,\,C_{h}=\dfrac{C_{h}}{C_{\pi}}\,{\mathop{\mathbb{E}}}_{h}\big{[}I_{\mathcal{F}}(\bm{X})\dfrac{\tilde{\pi}_{\bm{X}}(\bm{X})}{\tilde{h}(\bm{X})}\big{]}=\dfrac{\tilde{\tilde{p}}_{\mathcal{F}}\,\,C_{h}}{C_{\pi}} (56)

This estimate is known to be theoretically biased due to the existence of the ratio estimator, even if p~~\tilde{\tilde{p}}_{\mathcal{F}}, CπC_{\pi}, and ChC_{h} are independent. However, in [57], where the ISD hh is normalized, i.e., ChC_{h} is dropped, the authors proved theoretically that this biasedness is negligible if hh and QQ^{\prime} closely resemble Iπ𝑿I_{\mathcal{F}}\pi_{\bm{X}} and π𝑿\pi_{\bm{X}}, respectively. That is true in our proposed framework since, in fact, hh is a smoothed version of Iπ𝑿I_{\mathcal{F}}\pi_{\bm{X}}, and QQ^{\prime} is fitted on samples from π𝑿\pi_{\bm{X}}. Noteworthy, the self-normalized importance sampling scheme with h=Qh=Q^{\prime} can not be applied given the significant difference between Iπ𝑿I_{\mathcal{F}}\pi_{\bm{X}} and π𝑿\pi_{\bm{X}} in our context; see the difference between Fig. 8(c) and Fig. 9(b).

9 Conclusions

A novel framework for accurately and efficiently estimating rare event probabilities directly in complex non-Gaussian spaces is presented in this paper, suitable for low- and high-dimensional problems and very small probabilities, building upon our foundational Approximate Sampling Target with Post-processing Adjustment (ASTPA) approach [35]. An unnormalized sampling target is first constructed and sampled, relaxing the optimal importance sampling distribution, and appropriately designed for non-Gaussian spaces. The obtained samples are subsequently utilized not only to compute a shifted estimate of the sought probability but also to guide the second stage in ASTPA. Post-sampling, the sampling target’s normalizing constant is estimated using a stable inverse importance sampling procedure, which employs an importance sampling density based on the already available samples. As shown and discussed in the paper, a very approximately fitted Gaussian Mixture Model (GMM), even using one component with a diagonal covariance matrix, is adequate at this step, thus avoiding the known scalability issues of GMMs, as also showcased by the high-dimensional numerical examples, with dimensions up to 500 presented in this work. The target probability is eventually computed based on the estimations in these two stages. We employed our developed Quasi-Newton mass preconditioned Hamiltonian MCMC (QNp-HMCMC) algorithm to sample the constructed target, achieving outstanding performance in sampling high-dimensional, complex distributions.

In comparison to our relevant framework for estimating rare event probabilities in Gaussian spaces in [35], this paper introduces significant enhancements and novel contributions to enable direct applicability in non-Gaussian spaces. We have first adjusted the recommended values for parameters of the ASTPA sampling target to suit non-Gaussian spaces. The inverse importance sampling technique has also advanced through a stable implementation approach, making it robust against possible anomalous samples. The proposed estimator is further theoretically analyzed in this work, proving its unbiasedness and deriving its analytical coefficient of variation (C.o.V). The analytical C.o.V expression, accompanied by an applied implementation technique based on the effective sample size, demonstrates accurate agreement with empirical results. Further, our developed Quasi-Newton mass preconditioned Hamiltonian MCMC (QNp-HMCMC) is theoretically reinforced in this work, proving that it converges to the correct stationary target distribution. Avoiding the challenging task of tuning the trajectory length in general, complex, non-Gaussian spaces, a variant of the QNp-HMCMC is effectively utilized in this manuscript based on a single-step integration, and we are showing its equivalence to an original and efficient preconditioned Metropolis adjusted Langevin algorithm (MALA). An optimization approach based on the Adam optimizer is also devised, to initiate the QNp-HMCMC effectively. The utilization of the proposed framework in bounded spaces is also discussed and examined.

We have demonstrated the notable performance of our developed methodology through comparisons with the state-of-the-art Subset Simulation, and related variants, in a series of diverse, low- and high-dimensional non-Gaussian problems. Our methodology sets new efficiency benchmarks on existing and new, challenging non-Gaussian problems, including high-dimensional Neal’s funnel distribution and two- and three-dimensional Rosenbrock distributions, as well as a high-dimensional, highly correlated Gumbel distribution. In addition, we have successfully applied our framework to scenarios involving unnormalized probability distributions, such as those often appearing in Bayesian contexts, where the target probability is related to a posteriori rare events.

Since we are utilizing gradient-based sampling methods in this work, all of our results and conclusions are based on the fact that analytical gradients can be computed. Some of our ongoing and future works are further directed toward exploring various ASTPA variants. That includes devising effective discovery schemes for challenging multi-modal rare event domains involving distant modes, developing a related non-gradient-based sampling framework, and estimating high-dimensional first-passage problems under various settings [91].

Acknowledgements

The authors would like to thank Prof. Dr. Daniel Straub, Dr. Iason Papaioannou, and Dr. Max Ehre at the Technical University of Munich, for very fruitful scientific discussions. This material is based upon work partially supported by the U.S. National Science Foundation under CAREER Grant No. 1751941.

References

  • [1] O. Ditlevsen and H. O. Madsen, Structural Reliability Methods. Department of Mechanical Engineering, Technical University of Denmark, 2007.
  • [2] E. Nikolaidis, D. Ghiocel, and S. Singhal, Engineering Design Reliability Handbook. CRC press, 2005.
  • [3] M. Lemaire, A. Chateauneuf, and J. Mitteau, Structural Reliability. Wiley, 2009.
  • [4] S.-K. Au and Y. Wang, Engineering Risk Assessment with Subset Simulation. Wiley, 2014.
  • [5] R. E. Melchers and A. T. Beck, Structural Reliability Analysis and Prediction. Wiley, 2018.
  • [6] A. Der Kiureghian, Structural and system reliability. Cambridge University Press, 2022.
  • [7] R. Rackwitz, “Reliability analysis—a review and some perspectives,” Structural Safety, vol. 23, no. 4, pp. 365–395, 2001.
  • [8] K. Breitung, “40 years FORM: Some new aspects?,” Probabilistic Engineering Mechanics, vol. 42, pp. 71–77, 2015.
  • [9] K. Breitung, “Asymptotic approximations for multinormal integrals,” Journal of Engineering Mechanics, vol. 110, no. 3, pp. 357–366, 1984.
  • [10] S. Asmussen and P. W. Glynn, Stochastic simulation: algorithms and analysis, vol. 57. Springer, 2007.
  • [11] A. Dembo and O. Zeitouni, Large deviations techniques and applications, vol. 38. Springer Science & Business Media, 2009.
  • [12] T. Grafke and E. Vanden-Eijnden, “Numerical computation of rare events via large deviation theory,” Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 29, no. 6, 2019.
  • [13] K. W. Breitung, Asymptotic approximations for probability integrals. Springer, 2006.
  • [14] M. Valdebenito, H. Pradlwarter, and G. Schuëller, “The role of the design point for calculating failure probabilities in view of dimensionality and structural nonlinearities,” Structural Safety, vol. 32, no. 2, pp. 101–111, 2010.
  • [15] Y. Bai, A. B. Dieker, and H. Lam, “Curse of dimensionality in rare-event simulation,” in 2023 Winter Simulation Conference (WSC), pp. 375–384, IEEE, 2023.
  • [16] S.-K. Au and J. L. Beck, “Estimation of small failure probabilities in high dimensions by Subset Simulation,” Probabilistic Engineering Mechanics, vol. 16, no. 4, pp. 263–277, 2001.
  • [17] K. M. Zuev and L. S. Katafygiotis, “Modified Metropolis–Hastings algorithm with delayed rejection,” Probabilistic Engineering Mechanics, vol. 26, no. 3, pp. 405–412, 2011.
  • [18] K. M. Zuev, J. L. Beck, S.-K. Au, and L. S. Katafygiotis, “Bayesian post-processor and other enhancements of Subset Simulation for estimating failure probabilities in high dimensions,” Computers & Structures, vol. 92, pp. 283–296, 2012.
  • [19] S.-K. Au and E. Patelli, “Rare event simulation in finite-infinite dimensional space,” Reliability Engineering & System Safety, vol. 148, pp. 67–77, 2016.
  • [20] I. Papaioannou, W. Betz, K. Zwirglmaier, and D. Straub, “MCMC algorithms for Subset Simulation,” Probabilistic Engineering Mechanics, vol. 41, pp. 89–103, 2015.
  • [21] K. Breitung, “The geometry of limit state function graphs and subset simulation: Counterexamples,” Reliability Engineering & System Safety, vol. 182, pp. 98–106, 2019.
  • [22] R. M. Neal, “MCMC using Hamiltonian dynamics.,” arXiv preprint arXiv:1206.1901, 2012.
  • [23] Z. Wang, M. Broccardo, and J. Song, “Hamiltonian Monte Carlo methods for Subset Simulation in reliability analysis,” Structural Safety, vol. 76, pp. 51–67, 2019.
  • [24] M. Girolami and B. Calderhead, “Riemann manifold Langevin and Hamiltonian Monte Carlo methods,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, no. 2, pp. 123–214, 2011.
  • [25] W. Chen, Z. Wang, M. Broccardo, and J. Song, “Riemannian Manifold Hamiltonian Monte Carlo based subset simulation for reliability analysis in non-Gaussian space,” Structural Safety, vol. 94, p. 102134, 2022.
  • [26] D. Thaler, S. L. Dhulipala, F. Bamer, B. Markert, and M. D. Shields, “Reliability analysis of complex systems using subset simulations with Hamiltonian Neural Networks,” arXiv preprint arXiv:2401.05244, 2024.
  • [27] J. Goodman and J. Weare, “Ensemble samplers with affine invariance,” Communications in Applied Mathematics and Computational Science, vol. 5, no. 1, pp. 65–80, 2010.
  • [28] M. D. Shields, D. G. Giovanis, and V. Sundar, “Subset simulation for problems with strongly non-Gaussian, highly anisotropic, and degenerate distributions,” Computers & Structures, vol. 245, p. 106431, 2021.
  • [29] A. Guyader, N. Hengartner, and E. Matzner-Løber, “Simulation and estimation of extreme quantiles and extreme probabilities,” Applied Mathematics & Optimization, vol. 64, no. 2, pp. 171–196, 2011.
  • [30] F. Cérou, P. Del Moral, T. Furon, and A. Guyader, “Sequential Monte Carlo for rare event estimation,” Statistics and Computing, vol. 22, no. 3, pp. 795–808, 2012.
  • [31] C. Walter, “Moving particles: A parallel optimal multilevel splitting method with application in quantiles estimation and meta-model based algorithms,” Structural Safety, vol. 55, pp. 10–25, 2015.
  • [32] X. Huang, J. Chen, and H. Zhu, “Assessing small failure probabilities by AK–SS: An active learning method combining kriging and subset simulation,” Structural Safety, vol. 59, pp. 86–95, 2016.
  • [33] S. Marelli and B. Sudret, “An active-learning algorithm that combines sparse polynomial chaos expansions and bootstrap for structural reliability analysis,” Structural Safety, vol. 75, pp. 67–74, 2018.
  • [34] M. Moustapha, S. Marelli, and B. Sudret, “Active learning for structural reliability: Survey, general framework and benchmark,” Structural Safety, vol. 96, p. 102174, 2022.
  • [35] K. G. Papakonstantinou, H. Nikbakht, and E. Eshra, “Hamiltonian MCMC methods for estimating rare events probabilities in high-dimensional problems,” Probabilistic Engineering Mechanics, vol. 74, p. 103485, 2023.
  • [36] M. Hoffman, A. Radul, and P. Sountsov, “An adaptive-MCMC scheme for setting trajectory lengths in Hamiltonian Monte Carlo,” in International Conference on Artificial Intelligence and Statistics, pp. 3907–3915, PMLR, 2021.
  • [37] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid Monte Carlo,” Physics Letters B, vol. 195, no. 2, pp. 216–222, 1987.
  • [38] Y. Zhang and C. A. Sutton, “Quasi-Newton methods for Markov Chain Monte Carlo,” in Advances in Neural Information Processing Systems, pp. 2393–2401, 2011.
  • [39] T. Fu, L. Luo, and Z. Zhang, “Quasi-Newton Hamiltonian Monte Carlo,” in Uncertainty in Artificial Intelligence, 2016.
  • [40] Y.-A. Ma, T. Chen, and E. Fox, “A complete recipe for stochastic gradient MCMC,” in Advances in Neural Information Processing Systems, pp. 2917–2925, 2015.
  • [41] M. D. Hoffman and A. Gelman, “The No-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1593–1623, 2014.
  • [42] Y. Nesterov, “Primal-dual subgradient methods for convex problems,” Mathematical Programming, vol. 120, no. 1, pp. 221–259, 2009.
  • [43] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [44] G. L. Ang, A. H.-S. Ang, and W. H. Tang, “Optimal importance-sampling density estimator,” Journal of Engineering Mechanics, vol. 118, no. 6, pp. 1146–1163, 1992.
  • [45] G. I. Schuëller and R. Stix, “A critical appraisal of methods to determine failure probabilities,” Structural Safety, vol. 4, no. 4, pp. 293–309, 1987.
  • [46] N. Kurtz and J. Song, “Cross-entropy-based adaptive importance sampling using Gaussian mixture,” Structural Safety, vol. 42, pp. 35–44, 2013.
  • [47] Z. Wang and J. Song, “Cross-entropy-based adaptive importance sampling using von Mises-Fisher mixture for high dimensional reliability analysis,” Structural Safety, vol. 59, pp. 42–52, 2016.
  • [48] I. Papaioannou, S. Geyer, and D. Straub, “Improved cross entropy-based importance sampling with a flexible mixture model,” Reliability Engineering & System Safety, vol. 191, p. 106564, 2019.
  • [49] F. Uribe, I. Papaioannou, Y. M. Marzouk, and D. Straub, “Cross-entropy-based importance sampling with failure-informed dimension reduction for rare event simulation,” SIAM/ASA Journal on Uncertainty Quantification, vol. 9, no. 2, pp. 818–847, 2021.
  • [50] J. Demange-Chryst, F. Bachoc, J. Morio, and T. Krauth, “Variational autoencoder with weighted samples for high-dimensional non-parametric adaptive importance sampling,” arXiv preprint arXiv:2310.09194, 2023.
  • [51] S. Tong and G. Stadler, “Large deviation theory-based adaptive importance sampling for rare events in high dimensions,” SIAM/ASA Journal on Uncertainty Quantification, vol. 11, no. 3, pp. 788–813, 2023.
  • [52] M. Chiron, C. Genest, J. Morio, and S. Dubreuil, “Failure probability estimation through high-dimensional elliptical distribution modeling with multiple importance sampling,” Reliability Engineering & System Safety, vol. 235, p. 109238, 2023.
  • [53] S.-K. Au and J. L. Beck, “A new adaptive importance sampling scheme for reliability calculations,” Structural Safety, vol. 21, no. 2, pp. 135–158, 1999.
  • [54] S.-K. Au and J. L. Beck, “Important sampling in high dimensions,” Structural Safety, vol. 25, no. 2, pp. 139–163, 2003.
  • [55] A. Tabandeh, G. Jia, and P. Gardoni, “A review and assessment of importance sampling methods for reliability analysis,” Structural Safety, vol. 97, p. 102216, 2022.
  • [56] M. Ehre, I. Papaioannou, and D. Straub, “Stein variational rare event simulation,” arXiv preprint arXiv:2308.04971, 2023.
  • [57] T. Cui, S. Dolgov, and R. Scheichl, “Deep importance sampling using tensor trains with application to a priori and a posteriori rare events,” SIAM Journal on Scientific Computing, vol. 46, no. 1, pp. C1–C29, 2024.
  • [58] P. Del Moral, A. Doucet, and A. Jasra, “Sequential Monte Carlo samplers,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 3, pp. 411–436, 2006.
  • [59] I. Papaioannou, C. Papadimitriou, and D. Straub, “Sequential importance sampling for structural reliability analysis,” Structural Safety, vol. 62, pp. 66–75, 2016.
  • [60] C. H. Bennett, “Efficient estimation of free energy differences from Monte Carlo data,” Journal of Computational Physics, vol. 22, no. 2, pp. 245–268, 1976.
  • [61] X.-L. Meng and W. H. Wong, “Simulating ratios of normalizing constants via a simple identity: a theoretical exploration,” Statistica Sinica, pp. 831–860, 1996.
  • [62] A. Sinha, M. O’Kelly, R. Tedrake, and J. C. Duchi, “Neural bridge sampling for evaluating safety-critical autonomous systems,” Advances in Neural Information Processing Systems, vol. 33, pp. 6402–6416, 2020.
  • [63] A. Gelman and X.-L. Meng, “Simulating normalizing constants: From importance sampling to bridge sampling to path sampling,” Statistical Science, pp. 163–185, 1998.
  • [64] A. M. Johansen, P. Del Moral, and A. Doucet, “Sequential Monte Carlo samplers for rare events,” in 6th International Workshop on Rare Event Simulation, pp. 256–267, 2006.
  • [65] J. Xian and Z. Wang, “Relaxation-based importance sampling for structural reliability analysis,” Structural Safety, vol. 106, p. 102393, 2024.
  • [66] G. McLachlan and D. Peel, Finite mixture models. Wiley, 2000.
  • [67] J. V. Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasudevan, D. Moore, B. Patton, A. Alemi, M. Hoffman, and R. A. Saurous, “Tensorflow distributions,” arXiv preprint arXiv:1711.10604, 2017.
  • [68] B. J. Alder and T. E. Wainwright, “Studies in molecular dynamics. I. General method,” The Journal of Chemical Physics, vol. 31, no. 2, pp. 459–466, 1959.
  • [69] M. Betancourt, “A conceptual introduction to Hamiltonian Monte Carlo,” arXiv preprint arXiv:1701.02434, 2017.
  • [70] M. Hirt, M. Titsias, and P. Dellaportas, “Entropy-based adaptive Hamiltonian Monte Carlo,” Advances in Neural Information Processing Systems, vol. 34, pp. 28482–28495, 2021.
  • [71] Y. Qi and T. P. Minka, “Hessian-based Markov chain Monte-Carlo algorithms,” in First Cape Cod Workshop on Monte Carlo Methods, vol. 2, Cape Cod Massachusetts, USA, 2002.
  • [72] J. Martin, L. C. Wilcox, C. Burstedde, and O. Ghattas, “A stochastic Newton MCMC method for large-scale statistical inverse problems with application to seismic inversion,” SIAM Journal on Scientific Computing, vol. 34, no. 3, pp. A1460–A1487, 2012.
  • [73] U. Simsekli, R. Badeau, T. Cemgil, and G. Richard, “Stochastic quasi-Newton Langevin Monte Carlo,” in International Conference on Machine Learning, pp. 642–651, PMLR, 2016.
  • [74] B. Leimkuhler, C. Matthews, and J. Weare, “Ensemble preconditioning for Markov chain Monte Carlo simulation,” Statistics and Computing, vol. 28, pp. 277–290, 2018.
  • [75] K. W. Brodlie, A. Gourlay, and J. Greenstadt, “Rank-one and rank-two corrections to positive definite matrices expressed in product form,” IMA Journal of Applied Mathematics, vol. 11, no. 1, pp. 73–82, 1973.
  • [76] W. R. Gilks, G. O. Roberts, and E. I. George, “Adaptive direction sampling,” Journal of the Royal Statistical Society: Series D (The Statistician), vol. 43, no. 1, pp. 179–189, 1994.
  • [77] J. Nocedal and S. J. Wright, “Numerical optimization,” Springer, 2006.
  • [78] A. Beskos, N. Pillai, G. Roberts, J.-M. Sanz-Serna, and A. Stuart, “Optimal tuning of the hybrid Monte Carlo algorithm,” Bernoulli, vol. 19, no. 5A, pp. 1501–1534, 2013.
  • [79] C. Pasarica and A. Gelman, “Adaptively scaling the Metropolis algorithm using expected squared jumped distance,” Statistica Sinica, pp. 343–364, 2010.
  • [80] Z. Wang, S. Mohamed, and N. Freitas, “Adaptive Hamiltonian and Riemann manifold Monte Carlo,” in International Conference on Machine Learning, pp. 1462–1470, 2013.
  • [81] C. Wu, J. Stoehr, and C. P. Robert, “Faster Hamiltonian Monte Carlo by learning leapfrog scale,” arXiv preprint arXiv:1810.04449, 2018.
  • [82] Y. Chen, R. Dwivedi, M. J. Wainwright, and B. Yu, “Fast mixing of Metropolized Hamiltonian Monte Carlo: Benefits of multi-step gradients,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 3647–3717, 2020.
  • [83] M. D. Hoffman and P. Sountsov, “Tuning-free generalized Hamiltonian Monte Carlo,” in International Conference on Artificial Intelligence and Statistics, pp. 7799–7813, PMLR, 2022.
  • [84] R. M. Neal, “Non-reversibly updating a uniform [0, 1] value for Metropolis accept/reject decisions,” arXiv preprint arXiv:2001.11950, 2020.
  • [85] G. O. Roberts and O. Stramer, “Langevin diffusions and Metropolis-Hastings algorithms,” Methodology and Computing in Applied Probability, vol. 4, pp. 337–357, 2002.
  • [86] K. G. Papakonstantinou, H. Nikbakht, and E. Eshra, “Quasi-Newton Hamiltonian MCMC sampling for reliability estimation in high-dimensional non-Gaussian spaces,” in 13th International Conference on Structural Safety & Reliability (ICOSSAR), Shanghai, China, 2022.
  • [87] A. Nataf, “Determination des distribution don’t les marges sont donnees,” Comptes rendus de l’Académie des Sciences, vol. 225, pp. 42–43, 1962.
  • [88] R. Lebrun and A. Dutfoy, “An innovating analysis of the Nataf transformation from the copula viewpoint,” Probabilistic Engineering Mechanics, vol. 24, no. 3, pp. 312–320, 2009.
  • [89] R. M. Neal, “Slice sampling,” The Annals of Statistics, vol. 31, no. 3, pp. 705–767, 2003.
  • [90] F. Pagani, M. Wiegand, and S. Nadarajah, “An n-dimensional Rosenbrock distribution for MCMC testing,” arXiv preprint arXiv:1903.09556, 2019.
  • [91] K. G. Papakonstantinou, E. Eshra, and H. Nikbakht, “Hamiltonian MCMC based framework for time-variant rare event uncertainty quantification,” in 14th International Conference on Applications of Statistics and Probability in Civil Engineering (ICASP), Dublin, Ireland, 2023.

APPENDIX

Appendix A Connections between single-step QNp-HMCMC implementation and preconditioned Metropolis adjusted Langevin algorithm (MALA)

Proof of lemma 1.

First, HMCMC with single-step implementation can be written based on Eq. 32 as:

𝒙~=𝒙t+ε𝐌1(𝒛t+(ε2)𝑿(𝒙t))=𝒙t+(ε22)𝐌1𝑿(𝒙t))+ε𝐌1𝒛t\displaystyle\tilde{\bm{x}}=\bm{x}\textsubscript{t}+\varepsilon\,\mathbf{M}^{-1}(\bm{z}\textsubscript{t}+(\dfrac{\varepsilon}{2})\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t}))=\bm{x}\textsubscript{t}+(\dfrac{\varepsilon^{2}}{2})\,\mathbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t}))+\varepsilon\,\mathbf{M}^{-1}\bm{z}\textsubscript{t} (57)

Since 𝒁𝒩(0,𝐌)\bm{Z}\sim\mathcal{N}(\textbf{{0}},\mathbf{M}), it can be expressed as 𝒁=𝐌1/2𝒁\bm{Z}=\mathbf{M}^{1/2}\bm{Z}^{\prime} where 𝒁𝒩(0,𝐈)\bm{Z}^{\prime}\sim\mathcal{N}(\textbf{{0}},\mathbf{I}), and 𝐌1/2\mathbf{M}^{1/2} is computed using Cholesky decomposition, 𝐌=𝐌1/2(𝐌1/2)T\mathbf{M}=\mathbf{M}^{1/2}(\mathbf{M}^{1/2})^{T}. Eq. 57 can then be written as:

𝒙~=𝒙t+(ε22)𝐌1𝑿(𝒙t))+ε𝐌1𝐌1/2𝒛t=𝒙t+(ε22)𝐌1𝑿(𝒙t))+ε(𝐌1/2)T𝒛t\displaystyle\tilde{\bm{x}}=\bm{x}\textsubscript{t}+(\dfrac{\varepsilon^{2}}{2})\,\mathbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t}))+\varepsilon\,\mathbf{M}^{-1}\mathbf{M}^{1/2}\bm{z}^{\prime}_{\text{t}}=\bm{x}\textsubscript{t}+(\dfrac{\varepsilon^{2}}{2})\,\mathbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t}))+\varepsilon\,(\mathbf{M}^{-1/2})^{T}\bm{z}^{\prime}_{\text{t}} (58)

The right side is simplified following the Cholesky decomposition 𝐌1=(𝐌1/2(𝐌1/2)T)1=(𝐌1/2)T𝐌1/2\mathbf{M}^{-1}=\big{(}\mathbf{M}^{1/2}(\mathbf{M}^{1/2})^{T}\big{)}^{-1}=(\mathbf{M}^{-1/2})^{T}\mathbf{M}^{-1/2}, and then 𝐌1𝐌1/2=(𝐌1/2)T𝐌1/2𝐌1/2=(𝐌1/2)T\mathbf{M}^{-1}\mathbf{M}^{1/2}=(\mathbf{M}^{-1/2})^{T}\mathbf{M}^{-1/2}\mathbf{M}^{1/2}=(\mathbf{M}^{-1/2})^{T}. Comparing Eq. 58 with Eq. 40, it is clear that HMCMC with single-step implementation is equivalent to a preconditioned Langevin Monte Carlo (LMC) where its preconditioning matrix equals the inverse of the mass matrix, 𝐀=𝐌1\mathbf{A}=\mathbf{M}^{-1}.

The connection between the skew-symmetric preconditioned dynamics in the burn-in stage of single-step QNp-HMCMC and the preconditioned LMC can be studied similarly. The adaptive scheme in the single-step QNp-HMCMC can be written based on Eq. 39 as:

𝒙~\displaystyle\tilde{\bm{x}} =𝒙t+ε𝐖t𝐌1(𝒛t+(ε2)𝐖t𝑿(𝒙t))=𝒙t+(ε22)𝐖t𝐌1𝐖t𝑿(𝒙t))+ε𝐖t𝐌1𝒛t\displaystyle=\bm{x}\textsubscript{t}+\varepsilon\,\mathbf{W}_{\text{t}}\mathbf{M}^{-1}\big{(}\bm{z}\textsubscript{t}+(\dfrac{\varepsilon}{2})\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})\big{)}=\bm{x}\textsubscript{t}+(\dfrac{\varepsilon^{2}}{2})\,\mathbf{W}_{\text{t}}\mathbf{M}^{-1}\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t}))+\varepsilon\,\mathbf{W}_{\text{t}}\mathbf{M}^{-1}\bm{z}\textsubscript{t} (59)
=𝒙t+(ε22)𝐖t𝐌1𝐖t𝑿(𝒙t))+ε𝐖t(𝐌1/2)T𝒛t\displaystyle=\bm{x}\textsubscript{t}+(\dfrac{\varepsilon^{2}}{2})\,\mathbf{W}_{\text{t}}\mathbf{M}^{-1}\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t}))+\varepsilon\,\mathbf{W}_{\text{t}}(\mathbf{M}^{-1/2})^{T}\bm{z}^{\prime}_{\text{t}}

Comparing Eq. 59 and Eq. 40 shows that the burn-in stage in QNp-HMCM with a single-step implementation is equivalent to a preconditioned LMC with a preconditioning matrix 𝐀t=𝐖t𝐌1𝐖t\mathbf{A}_{\text{t}}=\mathbf{W}_{\text{t}}\mathbf{M}^{-1}\mathbf{W}_{\text{t}}, where 𝐌=𝐈\mathbf{M}=\mathbf{I} in this work. ∎

Proof of lemma 2.

The proposal of Langevin Monte Carlo in Eq. 40 can be viewed as a Metropolis-Hasting update in which 𝒙~\tilde{\bm{x}} is proposed from a Gaussian distribution, 𝒩(𝒙t+(ε22)𝐀𝑿(𝒙t),ε2𝐀)\mathcal{N}\big{(}\bm{x}\textsubscript{t}+(\dfrac{\varepsilon^{2}}{2})\,\mathbf{A}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t}),\,\varepsilon^{2}\mathbf{A}\big{)}, and the acceptance probability of Metropolis adjusted Langevin algorithm (MALA) is then written as:

α\displaystyle\alpha =min{1,exp((𝒙~))𝒩(𝒙t|𝒙~+(ε22)𝐀𝑿(𝒙~),ε2𝐀)exp((𝒙t))𝒩(𝒙~t|𝒙t+(ε22)𝐀𝑿(𝒙t),ε2𝐀)}\displaystyle=\min\bigg{\{}1,\dfrac{\exp(\mathcal{L}(\tilde{\bm{x}}))\,\mathcal{N}\big{(}\bm{x}\textsubscript{t}\,|\tilde{\bm{x}}+(\dfrac{\varepsilon^{2}}{2})\,\mathbf{A}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}}),\,\varepsilon^{2}\mathbf{A}\big{)}}{\exp(\mathcal{L}(\bm{x}\textsubscript{t}))\,\mathcal{N}\big{(}\tilde{\bm{x}}\textsubscript{t}\,|\bm{x}\textsubscript{t}+(\dfrac{\varepsilon^{2}}{2})\,\mathbf{A}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t}),\,\varepsilon^{2}\mathbf{A}\big{)}}\bigg{\}} (60)
=min{1,exp((𝒙~))12ε2(𝒙t𝒙~(ε22)𝐀𝑿(𝒙~))T𝐀1(𝒙t𝒙~(ε22)𝐀𝑿(𝒙~))exp((𝒙t))12ε2(𝒙~𝒙t(ε22)𝐀𝑿(𝒙t))T𝐀1(𝒙~𝒙t(ε22)𝐀𝑿(𝒙t))}\displaystyle=\min\bigg{\{}1,\dfrac{\exp(\mathcal{L}(\tilde{\bm{x}}))\,-\dfrac{1}{2\varepsilon^{2}}(\bm{x}\textsubscript{t}-\tilde{\bm{x}}-(\dfrac{\varepsilon^{2}}{2})\,\mathbf{A}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}}))^{T}\,\mathbf{A}^{-1}\,(\bm{x}\textsubscript{t}-\tilde{\bm{x}}-(\dfrac{\varepsilon^{2}}{2})\,\mathbf{A}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}}))}{\exp(\mathcal{L}(\bm{x}\textsubscript{t}))\,-\dfrac{1}{2\varepsilon^{2}}(\tilde{\bm{x}}-\bm{x}\textsubscript{t}-(\dfrac{\varepsilon^{2}}{2})\,\mathbf{A}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t}))^{T}\,\mathbf{A}^{-1}\,(\tilde{\bm{x}}-\bm{x}\textsubscript{t}-(\dfrac{\varepsilon^{2}}{2})\,\mathbf{A}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t}))}\bigg{\}}

To check the equivalence of the acceptance probability of preconditioned MALA and single-step HMCMC, we write the resampled momentum 𝒛t\bm{z}\textsubscript{t} and the simulated one 𝒛~\tilde{\bm{z}} based on Eq. 32 as:

𝒛t=𝒛t+ε/2(ε2)𝑿(𝒙t)=1ε𝐌(𝒙~𝒙t)(ε2)𝑿(𝒙t)=1ε𝐌(𝒙~𝒙t(ε22)M1𝑿(𝒙t))\bm{z}\textsubscript{t}=\bm{z}\textsubscript{t+$\varepsilon$/2}-(\dfrac{\varepsilon}{2})\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})=\dfrac{1}{\varepsilon}\mathbf{M}(\tilde{\bm{x}}-\bm{x}\textsubscript{t})-(\dfrac{\varepsilon}{2})\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})=\dfrac{1}{\varepsilon}\mathbf{M}\bigg{(}\tilde{\bm{x}}-\bm{x}\textsubscript{t}-(\dfrac{\varepsilon^{2}}{2})\textbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})\bigg{)} (61)
𝒛~=𝒛t+ε/2+(ε2)𝑿(𝒙~)=1ε𝐌(𝒙~𝒙t)+(ε2)𝑿(𝒙~)=1ε𝐌(𝒙t𝒙~(ε22)M1𝑿(𝒙~))\tilde{\bm{z}}=\bm{z}\textsubscript{t+$\varepsilon$/2}+(\dfrac{\varepsilon}{2})\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})=\dfrac{1}{\varepsilon}\mathbf{M}(\tilde{\bm{x}}-\bm{x}\textsubscript{t})+(\dfrac{\varepsilon}{2})\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})=-\dfrac{1}{\varepsilon}\mathbf{M}\bigg{(}\bm{x}\textsubscript{t}-\tilde{\bm{x}}-(\dfrac{\varepsilon^{2}}{2})\textbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})\bigg{)} (62)

The acceptance probability of HMCMC can then be expressed as:

α\displaystyle\alpha =min{1,exp((𝒙~)12𝒛~T𝐌1𝒛~)exp((𝒙t)12𝒛tT𝐌1𝒛t)}\displaystyle=\min\bigg{\{}1,\dfrac{\exp(\mathcal{L}(\tilde{\bm{x}})-\dfrac{1}{2}\tilde{\bm{z}}^{T}\ \mathbf{M}^{-1}\tilde{\bm{z}})}{\exp(\mathcal{L}(\bm{x}_{\text{t}})-\dfrac{1}{2}{\bm{z}_{\text{t}}}^{T}\ \mathbf{M}^{-1}\bm{z}_{\text{t}})}\bigg{\}} (63)
=min{1,exp((𝒙~)12ε2(𝒙t𝒙~(ε22)M1𝑿(𝒙~))T𝐌T𝐌1𝐌(𝒙t𝒙~(ε22)M1𝑿(𝒙~)))exp((𝒙t)12ε2(𝒙~𝒙t(ε22)M1𝑿(𝒙t))T𝐌T𝐌1𝐌(𝒙~𝒙t(ε22)M1𝑿(𝒙t)))}\displaystyle=\min\bigg{\{}1,\dfrac{\exp\bigg{(}\mathcal{L}(\tilde{\bm{x}})-\dfrac{1}{2\varepsilon^{2}}\big{(}\bm{x}\textsubscript{t}-\tilde{\bm{x}}-(\dfrac{\varepsilon^{2}}{2})\textbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})\big{)}^{T}\mathbf{M}^{T}\ \mathbf{M}^{-1}\mathbf{M}\big{(}\bm{x}\textsubscript{t}-\tilde{\bm{x}}-(\dfrac{\varepsilon^{2}}{2})\textbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})\big{)}\bigg{)}}{\exp\bigg{(}\mathcal{L}(\bm{x}_{\text{t}})-\dfrac{1}{2\varepsilon^{2}}\big{(}\tilde{\bm{x}}-\bm{x}\textsubscript{t}-(\dfrac{\varepsilon^{2}}{2})\textbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})\big{)}^{T}\mathbf{M}^{T}\ \mathbf{M}^{-1}\mathbf{M}\big{(}\tilde{\bm{x}}-\bm{x}\textsubscript{t}-(\dfrac{\varepsilon^{2}}{2})\textbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})\big{)}\bigg{)}}\bigg{\}}
=min{1,exp((𝒙~)12ε2(𝒙t𝒙~(ε22)M1𝑿(𝒙~))T𝐌(𝒙t𝒙~(ε22)M1𝑿(𝒙~)))exp((𝒙t)12ε2(𝒙~𝒙t(ε22)M1𝑿(𝒙t))T𝐌(𝒙~𝒙t(ε22)M1𝑿(𝒙t)))}\displaystyle=\min\bigg{\{}1,\dfrac{\exp\bigg{(}\mathcal{L}(\tilde{\bm{x}})-\dfrac{1}{2\varepsilon^{2}}\big{(}\bm{x}\textsubscript{t}-\tilde{\bm{x}}-(\dfrac{\varepsilon^{2}}{2})\textbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})\big{)}^{T}\mathbf{M}\big{(}\bm{x}\textsubscript{t}-\tilde{\bm{x}}-(\dfrac{\varepsilon^{2}}{2})\textbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})\big{)}\bigg{)}}{\exp\bigg{(}\mathcal{L}(\bm{x}_{\text{t}})-\dfrac{1}{2\varepsilon^{2}}\big{(}\tilde{\bm{x}}-\bm{x}\textsubscript{t}-(\dfrac{\varepsilon^{2}}{2})\textbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})\big{)}^{T}\mathbf{M}\big{(}\tilde{\bm{x}}-\bm{x}\textsubscript{t}-(\dfrac{\varepsilon^{2}}{2})\textbf{M}^{-1}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})\big{)}\bigg{)}}\bigg{\}}

Comparing Eq. 63 and Eq. 60, it is obvious that the acceptance ratio of the single-step HMCMC is equivalent to that of the preconditioned MALA when using a preconditioning matrix 𝐀=𝐌1\mathbf{A}=\mathbf{M}^{-1}, with 𝐌\mathbf{M} being a symmetric positive definite mass matrix.

The equivalence of the acceptance probabilities of the preconditioned MALA and the burn-in stage in QNp-HMCMC can also be shown by rewriting Eq. 39 as:

𝒛t\displaystyle\bm{z}\textsubscript{t} =𝒛t+ε/2(ε2)𝐖t𝑿(𝒙t)=1ε𝐌𝐖t1(𝒙~𝒙t)(ε2)𝐖t𝑿(𝒙t)\displaystyle=\bm{z}\textsubscript{t+$\varepsilon$/2}-(\dfrac{\varepsilon}{2})\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})=\dfrac{1}{\varepsilon}\mathbf{M}\mathbf{W}_{\text{t}}^{-1}(\tilde{\bm{x}}-\bm{x}\textsubscript{t})-(\dfrac{\varepsilon}{2})\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t}) (64)
=1ε𝐌𝐖t1(𝒙~𝒙t(ε22)𝐖tM1𝐖t𝑿(𝒙t))\displaystyle=\dfrac{1}{\varepsilon}\mathbf{M}\mathbf{W}_{\text{t}}^{-1}\bigg{(}\tilde{\bm{x}}-\bm{x}\textsubscript{t}-(\dfrac{\varepsilon^{2}}{2})\mathbf{W}_{\text{t}}\textbf{M}^{-1}\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})\bigg{)}
𝒛~\displaystyle\tilde{\bm{z}} =𝒛t+ε/2+(ε2)𝐖t𝑿(𝒙~)=1ε𝐌𝐖t1(𝒙~𝒙t)+(ε2)𝐖t𝑿(𝒙~)\displaystyle=\bm{z}\textsubscript{t+$\varepsilon$/2}+(\dfrac{\varepsilon}{2})\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})=\dfrac{1}{\varepsilon}\mathbf{M}\mathbf{W}_{\text{t}}^{-1}(\tilde{\bm{x}}-\bm{x}\textsubscript{t})+(\dfrac{\varepsilon}{2})\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}}) (65)
=1ε𝐌𝐖t1(𝒙t𝒙~(ε22)𝐖tM1𝐖t𝑿(𝒙~))\displaystyle=-\dfrac{1}{\varepsilon}\mathbf{M}\mathbf{W}_{\text{t}}^{-1}\bigg{(}\bm{x}\textsubscript{t}-\tilde{\bm{x}}-(\dfrac{\varepsilon^{2}}{2})\mathbf{W}_{\text{t}}\textbf{M}^{-1}\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})\bigg{)}

The corresponding acceptance probability can then be expressed as:

α=min{1,exp((𝒙~))exp((𝒙t))R},\displaystyle\alpha=\min\bigg{\{}1,\dfrac{\exp(\mathcal{L}(\tilde{\bm{x}}))}{\exp(\mathcal{L}(\bm{x}_{\text{t}}))}\cdot R^{\prime}\bigg{\}}, (66)
R=exp(12ε2(𝒙t𝒙~(ε22)𝐖tM1𝐖t𝑿(𝒙~))T(𝐖t1𝐌𝐖t1)(𝒙t𝒙~(ε22)𝐖tM1𝐖t𝑿(𝒙~)))exp(12ε2(𝒙~𝒙t(ε22)𝐖tM1𝐖t𝑿(𝒙t))T(𝐖t1𝐌𝐖t1)(𝒙~𝒙t(ε22)𝐖tM1𝐖t𝑿(𝒙t)))\displaystyle R^{\prime}=\dfrac{\exp(-\dfrac{1}{2\varepsilon^{2}}\big{(}\bm{x}\textsubscript{t}-\tilde{\bm{x}}-(\dfrac{\varepsilon^{2}}{2})\mathbf{W}_{\text{t}}\textbf{M}^{-1}\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})\big{)}^{T}(\mathbf{W}_{\text{t}}^{-1}\mathbf{M}\mathbf{W}_{\text{t}}^{-1})\big{(}\bm{x}\textsubscript{t}-\tilde{\bm{x}}-(\dfrac{\varepsilon^{2}}{2})\mathbf{W}_{\text{t}}\textbf{M}^{-1}\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\tilde{\bm{x}})\big{)}\bigg{)}}{\exp\bigg{(}-\dfrac{1}{2\varepsilon^{2}}\big{(}\tilde{\bm{x}}-\bm{x}\textsubscript{t}-(\dfrac{\varepsilon^{2}}{2})\mathbf{W}_{\text{t}}\textbf{M}^{-1}\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})\big{)}^{T}(\mathbf{W}_{\text{t}}^{-1}\mathbf{M}\mathbf{W}_{\text{t}}^{-1})\big{(}\tilde{\bm{x}}-\bm{x}\textsubscript{t}-(\dfrac{\varepsilon^{2}}{2})\mathbf{W}_{\text{t}}\textbf{M}^{-1}\mathbf{W}_{\text{t}}\nabla_{\bm{X}}\mathcal{L}(\bm{x}\textsubscript{t})\big{)}\bigg{)}}

Consequently, the acceptance probability in the burn-in stage of QNp-HMCMC is equivalent to that of the preconditioned MALA with a preconditioning matrix 𝐀t=𝐖t𝐌1𝐖t\mathbf{A}_{\text{t}}=\mathbf{W}_{\text{t}}\mathbf{M}^{-1}\mathbf{W}_{\text{t}}, where 𝐌=𝐈\mathbf{M}=\mathbf{I} in this work. ∎