This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Pre-integration via active subspaces

Sifan Liu Department of Statistics, Stanford University Art B. Owen Department of Statistics, Stanford University
(February 2022)
Abstract

Pre-integration is an extension of conditional Monte Carlo to quasi-Monte Carlo and randomized quasi-Monte Carlo. It can reduce but not increase the variance in Monte Carlo. For quasi-Monte Carlo it can bring about improved regularity of the integrand with potentially greatly improved accuracy. Pre-integration is ordinarily done by integrating out one of dd input variables to a function. In the common case of a Gaussian integral one can also pre-integrate over any linear combination of variables. We propose to do that and we choose the first eigenvector in an active subspace decomposition to be the pre-integrated linear combination. We find in numerical examples that this active subspace pre-integration strategy is competitive with pre-integrating the first variable in the principal components construction on the Asian option where principal components are known to be very effective. It outperforms other pre-integration methods on some basket options where there is no well established default. We show theoretically that, just as in Monte Carlo, pre-integration can reduce but not increase the variance when one uses scrambled net integration. We show that the lead eigenvector in an active subspace decomposition is closely related to the vector that maximizes a less computationally tractable criterion using a Sobol’ index to find the most important linear combination of Gaussian variables. They optimize similar expectations involving the gradient. We show that the Sobol’ index criterion for the leading eigenvector is invariant to the way that one chooses the remaining d1d-1 eigenvectors with which to sample the Gaussian vector.

1 Introduction

Pre-integration [12] is a strategy for high dimensional numerical integration in which one variable is integrated out in closed form (or by a very accurate quadrature rule) while the others are handled by quasi-Monte Carlo (QMC) sampling. This strategy has a long history in Monte Carlo (MC) sampling [13, 43], where it is known as conditional Monte Carlo. In the Markov chain Monte Carlo (MCMC) literature, such conditioning is called Rao-Blackwellization, although it does not generally bring the optimal variance reduction that results from the Rao-Blackwell theorem. See [38] for a survey.

The advantage of pre-integration in QMC goes beyond the variance reduction that arises in MC. After pre-integration, a dd-dimensional integration problem with a discontinuity or a kink (discontinuity in the gradient) can be converted into a much smoother d1d-1-dimensional problem [12]. QMC exploits regularity of the integrand and then smoothness brings a benefit on top of the L2L^{2} norm reduction that comes from conditioning. Gilbert, Kuo and Sloan [9] point out that the resulting smoothness depends critically on a monotonicity property of the integrand with respect to the variable being integrated out. Hoyt and Owen [17] give conditions where pre-integration reduces the mean dimension (that we will define below) of the integrand. It can reduce mean dimension from proportional to d\sqrt{d} to O(1)O(1) as dd\to\infty in a sequence of ridge functions with a discontinuity that the pre-integration smooths out. He [14] studied the error rate of pre-integration for scrambled nets applied to functions of the form f(𝒙)=h(𝒙)𝟏{ϕ(𝒙)0}f(\boldsymbol{x})=h(\boldsymbol{x}){\mathbf{1}\left\{{\phi(\boldsymbol{x})\geqslant 0}\right\}} for a Gaussian variable 𝒙\boldsymbol{x}. That work assumes that hh and ϕ\phi are smooth functions and ϕ\phi is monotone in xjx_{j}. Then pre-integration has a smoothing effect that when combined with some boundary growth conditions yields an error rate of O(n1+ε)O(n^{-1+\varepsilon}).

Many of the use cases for pre-integration involve integration with respect to the multivariate Gaussian distribution, especially for problems arising in finance. In the Gaussian context, we have more choices for the variable to pre-integrate over. In addition to pre-integration over any one of the dd coordinates, pre-integration over any linear combination of the variables remains integration with respect to a univariate Gaussian variable. Our proposal is to pre-integrate over a linear combination of variables chosen to optimize a measure of variable importance derived from active subspaces [3].

When sampling from a multivariate Gaussian distribution by QMC, even without pre-integration, one must choose a square root of the covariance matrix by which to multiply some sampled scalar Gaussian variables. There are numerous choices for that square root. One can sample via the principal component matrix decomposition as [1] and many others do. For integrands defined with respect to Brownian motions, one can use the Brownian bridge construction studied by [23]. These options have some potential disadvantages. It is always possible that the integrand is little affected by the first principal component. In a pessimistic scenario, the integrand could be proportional to a principal component that is orthogonal to the first one. This is a well known pitfall in principal components regression [20]. In a related phenomenon, [34] exhibits an integrand where QMC via the standard construction is more effective than via the Brownian bridge.

Not only might a principal component direction perform poorly, the first principal component is not necessarily well defined. Although the problem may be initially defined in terms of a specific Gaussian distribution, by a change of variable we can rewrite our integral as an expectation with respect to another Gaussian distribution with a different covariance matrix that has a different first principal component. Or, if the problem is posed with a covariance equal to the dd-dimensional identity matrix, then every unit vector linear combination of variables is a first principal component.

Some proposed methods take account of the specific integrand while formulating a sampling strategy. These include stratifying in a direction chosen from exponential tilting [11], exploiting a linearization of the integrand at d+1d+1 special points starting with the center of the domain [18], and a gradient principal component analysis (GPCA) algorithm [45] that we describe in more detail below.

The problem we consider is to compute an approximation to μ=𝔼(f(𝒙))\mu=\mathbb{E}(f(\boldsymbol{x})) where 𝒙d\boldsymbol{x}\in^{d} has the spherical Gaussian distribution denoted by 𝒩(0,I)\mathcal{N}(0,I) and ff has a gradient almost everywhere, that is square integrable. Let C=𝔼(f(𝒙)f(𝒙)𝖳)d×dC=\mathbb{E}(\nabla f(\boldsymbol{x})\nabla f(\boldsymbol{x})^{\mathsf{T}})\in^{d\times d}. The rr dimensional active subspace [3] is the space spanned by the rr leading eigenvectors of CC. For other uses of the matrix CC in numerical computation, see the references in [4]. For r=1r=1, let θ\theta be the leading eigenvector of CC normalized to be a unit vector. Our use is to pre-integrate ff over θ𝖳𝒙𝒩(0,1)\theta^{\mathsf{T}}\boldsymbol{x}\sim\mathcal{N}(0,1).

The eigendecomposition of CC is an uncentered principal components decomposition of the gradients, also known as the GPCA. The GPCA method [45] also uses the eigendecomposition of CC to define a matrix square root for a QMC sampling strategy to reduce effective dimension, but it involves no pre-integration. The algorithm in [44] pre-integrates the first variable x1x_{1} out of ff. Then it applies GPCA to the remaining d1d-1 variables in order to find a suitable (d1)×(d1)(d-1)\times(d-1) matrix square root for the remaining Gaussian variables. They pre-integrate over a coordinate variable while we always pre-integrate over the leading eigenvector which is not generally one of the dd coordinates. All of the algorithms that involve CC take a sample in order to estimate it.

This paper is organized as follows. Section 2 provides some background on RQMC and pre-integration. Section 3 shows that pre-integration never increases the variance of scrambled net integration, extending the well known property of conditional integration to RQMC. This holds even if the points being scrambled are not a digital net. There is presently very little guidance in the literature about which variable to pre-integrate over, beyond the monotonicity condition recently studied in [9] and a remark for ridge functions in [17]. An RQMC variance formula from Dick and Pillichshammer [7] overlaps greatly with a certain Sobol’ index and so this section advocates using the variable that maximizes that Sobol’ index of variable importance. That index has an effective practical estimator due to Jansen [19]. In Section 4, we describe using the unit vector θ\theta which maximizes the Sobol’ index for θ𝖳𝒙\theta^{\mathsf{T}}\boldsymbol{x}. This strategy is to find a matrix square root for which the first column maximizes the criterion from Section 3. We show that this Sobol’ index τ¯θ2\underline{\tau}^{2}_{\theta} is well defined in that it does not depend on how we parameterize the space orthogonal to θ\theta. This index is upper bounded by θ𝖳𝔼(f(𝒙)f(𝒙)𝖳)θ\theta^{\mathsf{T}}\mathbb{E}(\nabla f(\boldsymbol{x})\nabla f(\boldsymbol{x})^{\mathsf{T}})\theta. When ff is differentiable, the Sobol’ index can be expressed as a weighted expectation of θ𝖳f(𝒙)f(𝒙)𝖳θ\theta^{\mathsf{T}}\nabla f(\boldsymbol{x})\nabla f(\boldsymbol{x})^{\mathsf{T}}\theta. As a result, choosing a projection by active subspaces amounts to optimizing a computationally convenient proxy measure for a Sobol’ index of variable importance. We apply active subspace pre-integration to option pricing examples in Section 5. These include an Asian call option and some of its partial derivatives (called the Greeks) as well as a basket option. The following summary is based on the figures in that section. Using the standard construction, the active subspace pre-integration is a clear winner for five of the six Asian option integration problems and is essentially tied with the method from [44] for the one called Gamma. The principal components construction of [1] is more commonly used for this problem and has been extensively studied. In that construction active subspace pre-integration is more accurate than the other methods for Gamma, is worse than some others for Rho and is essentially the same as the best methods in the other four problems. Every one of the pre-integration methods was more accurate than RQMC without pre-integration, consistent with Theorem 3.2. We also looked at a basket option for which there is not a strong default construction as well accepted as the PCA construction is for the Asian option. There in six cases, two basket options and three baseline sampling constructions, active subspace pre-integration was always the most accurate. That section makes some brief comments about the computational cost. In our simulations, the pre-integration methods cost significantly more than the other methods, but the cost factor is not enough to overwhelm their improved accuracy, except that pre-integrating the first variable in the standard construction of Brownian motion usually brought little to no advantage. Section 6 has our conclusions.

2 Background

In this section, we introduce some background of RQMC and pre-integration and active subspaces. First we introduce some notations. Additional notations are introduced as needed.

For a positive integer dd, we let 1:d={1,2,,d}1{:}d=\{1,2,\ldots,d\}. For a subset u1:du\subseteq 1{:}d, let u=1:du-u=1{:}d\setminus u. For an integer j1:dj\in 1{:}d, we use jj to represent {j}\{j\} when the context is clear and j=1:d{j}-j=1{:}d\setminus\{j\}. Let |u||u| denote the cardinality of uu. For 𝒙d\boldsymbol{x}\in\mathbb{R}^{d}, we let 𝒙u\boldsymbol{x}_{u} be the |u||u| dimensional vector containing only the xjx_{j} with juj\in u. For 𝒙,𝒛d\boldsymbol{x},\boldsymbol{z}\in\mathbb{R}^{d} and u1:du\subseteq 1{:}d, we let 𝒙u:𝒛u\boldsymbol{x}_{u}{:}\boldsymbol{z}_{-u} be the dd dimensional vector, whose jj-th entry is xjx_{j} if juj\in u, and zjz_{j} if juj\notin u. We use ={1,2,}\mathbb{N}=\{1,2,\ldots\} for the natural numbers and 0={0}\mathbb{N}_{0}=\mathbb{N}\cup\{0\}. We denote the density and the cumulative distribution function (CDF) of standard Gaussian distribution 𝒩(0,1)\mathcal{N}(0,1) as φ\varphi and Φ\Phi, respectively. We let Φ1\Phi^{-1} denote the inverse CDF of 𝒩(0,1)\mathcal{N}(0,1). We also use φ\varphi to denote the density of the dd-dimensional standard Gaussian distribution 𝒩(0,Id)\mathcal{N}(0,I_{d}), φ(𝒙)=(2π)d/2exp(𝒙2/2)\varphi(\boldsymbol{x})=(2\pi)^{-d/2}\exp(-\|\boldsymbol{x}\|^{2}/2). We use 𝒩(0,I)\mathcal{N}(0,I) when the dimension of the random variable is clear from context. For a matrix Θd×d\Theta\in^{d\times d} we often need to select a subset of columns. We do that via Θ[u]\Theta[u], such as Θ[1:r]\Theta[1{:}r] or Θ[2:d]\Theta[2{:}d].

2.1 QMC and RQMC

For background on QMC see the monograph [7] and the survey article [6]. For a survey of RQMC see [21]. We will emphasize scrambled net integration with a recent description in [32]. Here we give a brief account.

QMC provides a way to estimate μ=[0,1]df(𝒙)d𝒙\mu=\int_{[0,1]^{d}}f(\boldsymbol{x})\,\mathrm{d}\boldsymbol{x} with greater accuracy than can be done by MC while sharing with MC the ability to handle larger dimensions dd than can be well handled by classical quadrature methods such as those in [5]. The QMC estimate, like the MC one takes the form μ^n=(1/n)i=0n1f(𝒙i)\hat{\mu}_{n}=(1/n)\sum_{i=0}^{n-1}f(\boldsymbol{x}_{i}), except that instead of 𝒙iiid𝕌[0,1]d\boldsymbol{x}_{i}\,{\buildrel\text{iid}\over{\sim}\,}\mathbb{U}[0,1]^{d} the sample points are chosen strategically to get a small value for

Dn=Dn(𝒙0,,𝒙n1)=sup𝒂[0,1]d|1ni=0n11{𝒙i[𝟎,𝒂)}j=1daj|D_{n}^{*}=D_{n}^{*}(\boldsymbol{x}_{0},\dots,\boldsymbol{x}_{n-1})=\sup_{\boldsymbol{a}\in[0,1]^{d}}\Bigl{|}\frac{1}{n}\sum_{i=0}^{n-1}1\{\boldsymbol{x}_{i}\in[\boldsymbol{0},\boldsymbol{a})\}-\prod_{j=1}^{d}a_{j}\Bigr{|}

which is known as the star discrepancy. The Koksma-Hlawka inequality (see [15]) is

|μ^nμ|Dn×fHK\displaystyle|\hat{\mu}_{n}-\mu|\leqslant D_{n}^{*}\times\|f\|_{\mathrm{HK}} (1)

where HK\|\cdot\|_{\mathrm{HK}} denotes total variation in the sense of Hardy and Krause. It is possible to construct points 𝒙i\boldsymbol{x}_{i} with Dn=O((logn)d1/n)D_{n}^{*}=O((\log n)^{d-1}/n) or to choose an infinite sequence of them along which Dn=O((logn)d/n)D_{n}^{*}=O((\log n)^{d}/n). Both of these are often written as O(n1+ϵ)O(n^{-1+\epsilon}) for any ϵ>0\epsilon>0 and this rate translates directly to |μ^nμ||\hat{\mu}_{n}-\mu| when fBVHK[0,1]df\in\mathrm{BVHK}[0,1]^{d}, the set of functions of bounded variation in the sense of Hardy and Krause. While the logarithmic powers are not negligible they correspond to worst case integrands that are not representative of integrands evaluated in practice, and as we see below, some RQMC methods provide a measure of control against them.

The QMC methods we study are digital nets and sequences. To define them, for an integer base bb\geqslant let

E(𝒌,𝒄)=j=1d[cjbkj,cj+1bkj)\displaystyle E(\boldsymbol{k},\boldsymbol{c})=\prod_{j=1}^{d}\Bigl{[}\frac{c_{j}}{b^{k_{j}}},\frac{c_{j}+1}{b^{k_{j}}}\Bigr{)} (2)

for 𝒌=(k1,,kd)\boldsymbol{k}=(k_{1},\dots,k_{d}) and 𝒄=(c1,,cd)\boldsymbol{c}=(c_{1},\dots,c_{d}) where kj0k_{j}\in\mathbb{N}_{0} and 0aj<bkj0\leqslant a_{j}<b^{k_{j}}. The sets E(𝒌,𝒄)E(\boldsymbol{k},\boldsymbol{c}) are called elementary intervals in base bb. They have volume b|𝒌|b^{-|\boldsymbol{k}|} where |𝒌|=j=1dkj|\boldsymbol{k}|=\sum_{j=1}^{d}k_{j}. For integers mt0m\geqslant t\geqslant 0, the points 𝒙0,,𝒙n1\boldsymbol{x}_{0},\dots,\boldsymbol{x}_{n-1} with n=bmn=b^{m} are a (t,m,d)(t,m,d)-net in base bb if every elementary interval E(𝒌,𝒄)E(\boldsymbol{k},\boldsymbol{c}) with |𝒌|mt|\boldsymbol{k}|\leqslant m-t contains bm|𝒌|b^{m-|\boldsymbol{k}|} of those points, which is exactly nn times the volume of E(𝒌,𝒄)E(\boldsymbol{k},\boldsymbol{c}).

For each 𝒌\boldsymbol{k} the sets E(𝒌,𝒄)E(\boldsymbol{k},\boldsymbol{c}) partition [0,1)d[0,1)^{d} into b|𝒌|b^{|\boldsymbol{k}|} congruent half open sets. If |𝒌|mt|\boldsymbol{k}|\leqslant m-t, then the nn points 𝒙i\boldsymbol{x}_{i} are perfectly stratified over that partition. The power of digital nets is that the points 𝒙i\boldsymbol{x}_{i} satisfy (mt+d1d1){m-t+d-1\choose d-1} such stratifications simultaneously. They attain Dn=O((logn)d1/n)D_{n}^{*}=O((\log n)^{d-1}/n) after approximating each [𝟎,𝒂)[\boldsymbol{0},\boldsymbol{a}) by sets E(𝒌,𝒄)E(\boldsymbol{k},\boldsymbol{c}) [24]. Given bb and mm and dd, a smaller tt is the better. It is not always possible to get t=0t=0.

For an integer t0t\geqslant 0, the infinite sequence (𝒙i)i0(\boldsymbol{x}_{i})_{i\geqslant 0} is a (t,s)(t,s)-sequence in base bb if for all integers mtm\geqslant t and r0r\geqslant 0, the points 𝒙rbm,,𝒙rbm+bm1\boldsymbol{x}_{rb^{m}},\dots,\boldsymbol{x}_{rb^{m}+b^{m}-1} are a (t,m,d)(t,m,d)-net in base bb. This means that the first bmb^{m} points are a (t,m,d)(t,m,d)-net as are the second bmb^{m} points and if we take the first bb such nets together we get a (t,m+1,d)(t,m+1,d)-net. Similarly the first bb such (t,m+1,d)(t,m+1,d)-nets form a (t,m+2,d)(t,m+2,d)-net and so on ad infinitum.

The nets and sequences in common use are called digital nets and sequences owing to a digital strategy used in their construction, where xijx_{ij} is expanded in base bb and there are rules for constructing those base bb digits from the base bb representation of ii. See [24]. The most commonly used nets and sequences are those of Sobol’ [39]. Sobol’ sequences are (t,d)(t,d)-sequences in base 22 and sometimes using base 22 brings computational advantages for computers that work in base 22. The value of tt is nondecreasing in dd. The first 2m2^{m} points of a (t,d)(t,d)-sequence can be a (t,m,d)(t^{\prime},m,d)-net for some t<tt^{\prime}<t.

While the Koksma-Hlawka inequality (1) shows that QMC is asymptotically better than MC for fBVHK[0,1]df\in\mathrm{BVHK}[0,1]^{d} it is not usable for error estimation, and furthermore, many integrands of interest such as unbounded ones have VHK(f)=V_{\mathrm{HK}}(f)=\infty. RQMC methods can address both problems. In RQMC the points 𝒙i𝕌[0,1]d\boldsymbol{x}_{i}\sim\mathbb{U}[0,1]^{d} individually while collectively they have a small DnD_{n}^{*}. The uniformity property makes

μ^n=1ni=0n1f(𝒙i)\displaystyle\hat{\mu}_{n}=\frac{1}{n}\sum_{i=0}^{n-1}f(\boldsymbol{x}_{i}) (3)

an unbiased estimate of μ\mu when fL1[0,1]df\in L^{1}[0,1]^{d}. If fL2[0,1]df\in L^{2}[0,1]^{d} then Var(μ^n)<\mathrm{Var}(\hat{\mu}_{n})<\infty and it can be estimated by using independent repeated RQMC evaluations.

Scrambled (t,m,d)(t,m,d)-nets have 𝒙i𝕌[0,1]d\boldsymbol{x}_{i}\sim\mathbb{U}[0,1]^{d} individually and the collective condition is that they form a (t,m,d)(t,m,d)-net with probability one. For an infinite sequence of (t,m,d)(t,m,d)-nets one can use scrambled (t,d)(t,d)-sequences. See [26] for both of these. The estimate μ^n\hat{\mu}_{n} taken over a scrambled (t,d)(t,d)-sequence satisfies a strong law of large numbers if fL2[0,1]1+ϵf\in L^{2}[0,1]^{1+\epsilon} [32]. If fL2[0,1]df\in L^{2}[0,1]^{d}, then Var(μ^n)=o(1/n)\mathrm{Var}(\hat{\mu}_{n})=o(1/n) giving the method asymptotically unbounded efficiency versus MC which has variance σ2/n\sigma^{2}/n for σ2=Var(f(𝒙))\sigma^{2}=\mathrm{Var}(f(\boldsymbol{x})). For smooth enough ff, Var(μ^n)=O(n3(logn)d1)\mathrm{Var}(\hat{\mu}_{n})=O(n^{-3}(\log n)^{d-1}) [28, 30] with the sharpest sufficient condition in [46]. Also there exists Γ<\Gamma<\infty with Var(μ^n)Γσ2/n\mathrm{Var}(\hat{\mu}_{n})\leqslant\Gamma\sigma^{2}/n for all fL2[0,1]df\in L^{2}[0,1]^{d} [29]. This bound involves no powers of log(n)\log(n).

2.2 Pre-integration with respect to 𝕌[0,1]d\mathbb{U}[0,1]^{d} and 𝒩(0,I)\mathcal{N}(0,I)

For j1:dj\in 1{:}d, [0,1]df(𝒙)d𝒙=[0,1]d101f(𝒙)dxjd𝒙j\int_{[0,1]^{d}}f(\boldsymbol{x})\,\mathrm{d}\boldsymbol{x}=\int_{[0,1]^{d-1}}\int_{0}^{1}f(\boldsymbol{x})\,\mathrm{d}x_{j}\,\mathrm{d}\boldsymbol{x}_{-j} which we can also write as 𝔼(f(𝒙))=𝔼(𝔼(f(𝒙)𝒙j))\mathbb{E}(f(\boldsymbol{x}))=\mathbb{E}(\mathbb{E}(f(\boldsymbol{x})\!\mid\!\boldsymbol{x}_{-j})) for 𝒙𝕌[0,1]d\boldsymbol{x}\sim\mathbb{U}[0,1]^{d}. For 𝒙[0,1]d\boldsymbol{x}\in[0,1]^{d}, define

g(𝒙)=gj(𝒙)=01f(𝒙)dxj.g(\boldsymbol{x})=g_{j}(\boldsymbol{x})=\int_{0}^{1}f(\boldsymbol{x})\,\mathrm{d}x_{j}.

It simplifies the presentation of some of our results, especially Theorem 3.2 on variance reduction, to keep gg defined as above on [0,1]d[0,1]^{d} even though it does not depend at all on xjx_{j} and could be written as a function of 𝒙j[0,1]d1\boldsymbol{x}_{-j}\in[0,1]^{d-1} instead. In pre-integration

μ^n=μ^n,j=1ni=0n1gj(𝒙i)\hat{\mu}_{n}=\hat{\mu}_{n,j}=\frac{1}{n}\sum_{i=0}^{n-1}g_{j}(\boldsymbol{x}_{i})

which as we noted in the introduction is conditional MC except that we now use RQMC inputs.

Pre-integration can bring some advantages for RQMC. The plain MC variance of g(𝒙)g(\boldsymbol{x}) is no larger than that of f(𝒙)f(\boldsymbol{x}) and is generally smaller unless ff does not depend at all on xjx_{j}. Thus the bound Γσ2/n\Gamma\sigma^{2}/n is reduced. Next, the pre-integrated integrand gg can be much smoother than ff and (R)QMC improves on MC by exploiting smoothness. For example [12, 44] observe that for some option pricing integrands, pre-integrating certain variable can remove the discontinuities in the integrand or its gradient.

The integrands we consider here are defined with respect to a Gaussian random variable. We are interested in μ=𝔼(f(𝒛))\mu=\mathbb{E}(f(\boldsymbol{z})) for 𝒛𝒩(𝟎,Σ)\boldsymbol{z}\sim\mathcal{N}(\boldsymbol{0},\Sigma) and a nonsingular covariance Σd×d\Sigma\in^{d\times d}. Letting R0d×dR_{0}\in^{d\times d} with R0R0𝖳=ΣR_{0}R_{0}^{\mathsf{T}}=\Sigma we can write μ=𝔼(f(R0𝒛))\mu=\mathbb{E}(f(R_{0}\boldsymbol{z})) for 𝒛𝒩(0,I)\boldsymbol{z}\sim\mathcal{N}(0,I). For an orthogonal matrix Qd×dQ\in^{d\times d} we also have Q𝒛𝒩(0,I)Q\boldsymbol{z}\sim\mathcal{N}(0,I). Then taking 𝒛=Φ1(𝒙)\boldsymbol{z}=\Phi^{-1}(\boldsymbol{x}) componentwise leads us to the estimate

μ^=1ni=0n1f(RΦ1(𝒙i))with R=R0Q\hat{\mu}=\frac{1}{n}\sum_{i=0}^{n-1}f(R\Phi^{-1}(\boldsymbol{x}_{i}))\quad\text{with $R=R_{0}Q$}

for RQMC points 𝒙i\boldsymbol{x}_{i}. The choice of QQ or equivalently RR does not affect the MC variance of μ^\hat{\mu} but it can change the RQMC variance. We will consider some examples later. The mapping Φ1\Phi^{-1} from 𝕌[0,1]d\mathbb{U}[0,1]^{d} to 𝒩(0,I)\mathcal{N}(0,I) can be replaced by another one such as the Box-Muller transformation. The choice of transformation does not affect the MC variance but does affect the RQMC variance. Most researchers use Φ1\Phi^{-1} but [25] advocates for Box-Muller.

When we are using pre-integration for a problem defined with respect to a 𝒩(0,Σ)\mathcal{N}(0,\Sigma) random variable we must choose RR and then the coordinate jj over which to pre-integrate. Our approach is to choose RR to make coordinate j=1j=1 as important as we can, using active subspaces.

2.3 The ANOVA decomposition

For fL2[0,1]df\in L^{2}[0,1]^{d} we can define an analysis of variance (ANOVA) decomposition from [16, 40, 8]. For details see [31, Appendix A.6]. This decomposition writes

f(𝒙)=u1:dfu(𝒙)f(\boldsymbol{x})=\sum_{u\subseteq 1:d}f_{u}(\boldsymbol{x})

where fuf_{u} depends on 𝒙\boldsymbol{x} only through xjx_{j} with juj\in u and also 01fu(𝒙)dxj=0\int_{0}^{1}f_{u}(\boldsymbol{x})\,\mathrm{d}x_{j}=0 whenever juj\in u. The decomposition is orthogonal in that [0,1]dfu(𝒙)fv(𝒙)d𝒙=0\int_{[0,1]^{d}}f_{u}(\boldsymbol{x})f_{v}(\boldsymbol{x})\,\mathrm{d}\boldsymbol{x}=0 if uvu\neq v. The term ff_{\varnothing} is the constant function everywhere equal to μ=[0,1]df(𝒙)d𝒙\mu=\int_{[0,1]^{d}}f(\boldsymbol{x})\,\mathrm{d}\boldsymbol{x}. To each effect fuf_{u} there corresponds a variance component

σu2=Var(fu(𝒙))={[0,1]dfu(𝒙)2d𝒙,|u|>00,else.\sigma^{2}_{u}=\mathrm{Var}(f_{u}(\boldsymbol{x}))=\begin{cases}\int_{[0,1]^{d}}f_{u}(\boldsymbol{x})^{2}\,\mathrm{d}\boldsymbol{x},&|u|>0\\ 0,&\text{else.}\end{cases}

For |u|2|u|\geqslant 2, the effect fuf_{u} is called a |u||u|-fold interaction. The variance components sum to σ2=Var(f(𝒙))\sigma^{2}=\mathrm{Var}(f(\boldsymbol{x})). We will use the ANOVA decomposition below when describing how to choose a pre-integration variable.

Sobol’ indices [41] are derived from the ANOVA decomposition. For u1:du\subseteq 1{:}d these are

τ¯u2=vuσv2andτ¯u2=v1:d𝟏{uv}σv2.\underline{\tau}^{2}_{u}=\sum_{v\subseteq u}\sigma^{2}_{v}\quad\text{and}\quad\overline{\tau}^{2}_{u}=\sum_{v\subseteq 1:d}{\mathbf{1}_{\left\{{u\cap v\neq\varnothing}\right\}}}\sigma^{2}_{v}.

They provide two ways to judge the importance of the set of variables xjx_{j} for juj\in u. They’re usually normalized by σ2\sigma^{2} to get an interpretation as a proportion of variance explained.

The mean dimension of ff is ν(f)=u1:d|u|σu2/σ2\nu(f)=\sum_{u\subseteq 1:d}|u|\sigma^{2}_{u}/\sigma^{2}. It satisfies ν(f)=j=1dτ¯j2/σ2\nu(f)=\sum_{j=1}^{d}\overline{\tau}^{2}_{j}/\sigma^{2}. A ridge function takes the form f(𝒙)=h(Θ𝖳𝒙)f(\boldsymbol{x})=h(\Theta^{\mathsf{T}}\boldsymbol{x}) for Θd×r\Theta\in^{d\times r} with Θ𝖳Θ=I\Theta^{\mathsf{T}}\Theta=I. For 𝒙𝒩(0,I)\boldsymbol{x}\sim\mathcal{N}(0,I), the variance of ff does not depend on dd and the mean dimension is O(1)O(1) as dd\to\infty [17] if hh is Lipschitz. If hh has a step discontinuity then it is possible to have ν(f)=Ω(d)\nu(f)=\Omega(\sqrt{d}) reduced to O(1)O(1) by pre-integration over a component variable xjx_{j} with θj\theta_{j} bounded away from zero as dd\to\infty [17].

3 Pre-integration and scrambled net variance

Conditional MC can reduce but not increase the variance of plain MC integration. Here we show that the same thing holds for scrambled nets using the nested uniform scrambling of [26]. The affine linear scrambling of [22] has the same variance and hence the same result. We assume that fL2[0,1)df\in L^{2}[0,1)^{d}. The half-open interval is just a notational convenience. For any fL2[0,1]df\in L^{2}[0,1]^{d} we could set f(𝒙)=0f(\boldsymbol{x})=0 for any 𝒙D=[0,1]d[0,1)d\boldsymbol{x}\in D=[0,1]^{d}\setminus[0,1)^{d} and get an equivalent function with the same integral and, almost surely, the same RQMC estimate because all 𝒙i𝕌[0,1]d\boldsymbol{x}_{i}\sim\mathbb{U}[0,1]^{d} avoid DD with probability one.

We will pre-integrate over one of the dd components of 𝒙[0,1)d\boldsymbol{x}\in[0,1)^{d}. It is also possible to pre-integrate over multiple components and reduce the RQMC variance each time, though the utility of that strategy is limited by the availability of suitable closed forms or effective quadratures.

3.1 Walsh function expansions

To get variance formulas for scrambled nets we follow Dick and Pillichshammer [7] who work with a Walsh function expansion of L2[0,1)dL^{2}[0,1)^{d} for which they credit Pirsic [36]. Let ωb=e2πi/b\omega_{b}=e^{2\pi i/b} with ii being the imaginary unit. For k0k\in\mathbb{N}_{0} write k=j0κjbjk=\sum_{j\geqslant 0}\kappa_{j}b^{j} for base bb digits κj{0,1,,b1}\kappa_{j}\in\{0,1,\dots,b-1\}. For x[0,1)x\in[0,1) write x=k1ξjbjx=\sum_{k\geqslant 1}\xi_{j}b^{-j} for base bb digits ξj{0,1,,b1}\xi_{j}\in\{0,1,\dots,b-1\}. Countably many xx have an expansion terminating in either infinitely many 0s or in infinitely many b1b-1s. For those we always choose the expansion terminating in 0s.

Using the above notation we can define the kk-th bb-adic Walsh function walkb:[0,1){}_{b}\text{wal}_{k}:[0,1)\to{\mathbb{C}} as

walkb(x)=ωbj1ξjκj1.{}_{b}\text{wal}_{k}(x)=\omega_{b}^{\sum_{j\geqslant 1}\xi_{j}\kappa_{j-1}}.

The summation in the exponent is finite because k<k<\infty. Note that wal0b(x)=1{}_{b}\mathrm{wal}_{0}(x)=1 for all x[0,1)x\in[0,1). For 𝒙=(x1,,xd)[0,1)d\boldsymbol{x}=(x_{1},\dots,x_{d})\in[0,1)^{d} and 𝒌=(k1,,kd)0d\boldsymbol{k}=(k_{1},\ldots,k_{d})\in\mathbb{N}_{0}^{d}, the dd-dimensional Walsh functions are defined as

wal𝒌b(𝒙)=j=1dwalkjb(xj).\displaystyle{}_{b}\text{wal}_{\boldsymbol{k}}(\boldsymbol{x})=\prod_{j=1}^{d}{}{}_{b}\text{wal}_{k_{j}}(x_{j}).

The Walsh series expansion of f(x)f(x) is

f(𝒙)𝒌0df^(𝒌)wal𝒌b(𝒙),where f^(𝒌)=[0,1)df(𝒙)wal𝒌b(𝒙)¯d𝒙.\displaystyle f(\boldsymbol{x})\sim\sum_{\boldsymbol{k}\in\mathbb{N}_{0}^{d}}\hat{f}(\boldsymbol{k}){}_{b}\text{wal}_{\boldsymbol{k}}(\boldsymbol{x}),\ \text{where }\hat{f}(\boldsymbol{k})=\int_{[0,1)^{d}}f(\boldsymbol{x})\overline{{}_{b}\text{wal}_{\boldsymbol{k}}(\boldsymbol{x})}\,\mathrm{d}\boldsymbol{x}.

The dd-dimensional bb-adic Walsh function system is a complete orthonormal basis in L2([0,1)d)L_{2}([0,1)^{d}) [7, Theorem A.11] and the series expansion converges to ff in L2L^{2}.

While our integrand is real valued, it will also satisfy an expansion written in terms of complex numbers. For real valued ff,

Var(f)=𝒌0d{𝟎}|f^(𝒌)|2.\displaystyle\mathrm{Var}(f)=\sum_{\boldsymbol{k}\in{\mathbb{N}}_{0}^{d}\setminus\{\boldsymbol{0}\}}|\hat{f}(\boldsymbol{k})|^{2}.

The variance under scrambled nets is different. To study it we group the Walsh coefficients. For 0d\boldsymbol{\ell}\in\mathbb{N}_{0}^{d} let

C={𝒌0d|bkj1kj<bj,1jd}.C_{\boldsymbol{\ell}}=\bigl{\{}\boldsymbol{k}\in\mathbb{N}_{0}^{d}\bigm{|}\lfloor b^{k_{j}-1}\rfloor\leqslant k_{j}<b^{\ell_{j}},1\leqslant j\leqslant d\bigr{\}}.

Then define

β(𝒙)=𝒌Cf^(𝒌)wal𝒌b(𝒙).\beta_{\boldsymbol{\ell}}(\boldsymbol{x})=\sum_{\boldsymbol{k}\in C_{\boldsymbol{\ell}}}\hat{f}(\boldsymbol{k}){}_{b}\text{wal}_{\boldsymbol{k}}(\boldsymbol{x}).

The functions β\beta_{\boldsymbol{\ell}} are orthogonal in that [0,1)dβ(𝒙)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111d𝒙=0\int_{[0,1)^{d}}\beta_{\boldsymbol{\ell}}(\boldsymbol{x})\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\,\mathrm{d}\boldsymbol{x}=0 when \boldsymbol{\ell}^{\prime}\neq\boldsymbol{\ell}. For 𝟎\boldsymbol{\ell}\neq\boldsymbol{0}, β(𝒙)\beta_{\boldsymbol{\ell}}(\boldsymbol{x}) has variance

σ2=[0,1)d|β(𝒙)|2d𝒙=𝒌C|f^(𝒌)|2.\sigma^{2}_{\boldsymbol{\ell}}=\int_{[0,1)^{d}}|\beta_{\boldsymbol{\ell}}(\boldsymbol{x})|^{2}\,\mathrm{d}\boldsymbol{x}=\sum_{\boldsymbol{k}\in C_{\boldsymbol{\ell}}}|\hat{f}(\boldsymbol{k})|^{2}.

If 𝒙i\boldsymbol{x}_{i} are a scrambled version of original points 𝒂i[0,1)d\boldsymbol{a}_{i}\in[0,1)^{d} then under this scrambling

Var(μ^n)=1n0d{𝟎}Γσ2\displaystyle\mathrm{Var}(\hat{\mu}_{n})=\frac{1}{n}\sum_{\boldsymbol{\ell}\in\mathbb{N}_{0}^{d}\setminus\{\boldsymbol{0}\}}\Gamma_{\boldsymbol{\ell}}\sigma^{2}_{\boldsymbol{\ell}} (4)

for a collection of gain coefficients Γ0\Gamma_{\boldsymbol{\ell}}\geqslant 0 that depend on the 𝒂i\boldsymbol{a}_{i}. This expression can also be obtained through a base bb Haar wavelet decomposition [27]. Our Γ\Gamma_{\boldsymbol{\ell}} equals nGnG_{\boldsymbol{\ell}} from [7]. The variance of μ^\hat{\mu} under IID MC sampling is (1/n)0d{𝟎}σ2(1/n)\sum_{\boldsymbol{\ell}\in\mathbb{N}^{d}_{0}\setminus\{\boldsymbol{0}\}}\sigma^{2}_{\boldsymbol{\ell}} so Γ<1\Gamma_{\boldsymbol{\ell}}<1 corresponds to integrating the term β(𝒙)\beta_{\boldsymbol{\ell}}(\boldsymbol{x}) with less variance than MC does.

If scrambling of [26] or [22] is applied to 𝒂i\boldsymbol{a}_{i} then

Γ\displaystyle\Gamma_{\boldsymbol{\ell}} =1ni,i=0n1j=1db𝟏{bjai,j=bjai,j}𝟏{bi1ai,j=bj1ai,j}b1.\displaystyle=\frac{1}{n}\sum_{i,i^{\prime}=0}^{n-1}\prod_{j=1}^{d}\frac{b{\mathbf{1}\left\{{\lfloor b^{\ell_{j}}a_{i,j}\rfloor=\lfloor b^{\ell_{j}}a_{i^{\prime},j}\rfloor}\right\}}-{\mathbf{1}\left\{{\lfloor b^{\ell_{i}-1}a_{i,j}\rfloor=\lfloor b^{\ell_{j}-1}a_{i^{\prime},j}\rfloor}\right\}}}{b-1}. (5)

This holds for any 𝒂i\boldsymbol{a}_{i} not just digital nets. When 𝒂i\boldsymbol{a}_{i} are the first bmb^{m} points of a (t,d)(t,d)-sequence in base bb then Γ=supΓ<\Gamma=\sup_{\boldsymbol{\ell}}\Gamma_{\boldsymbol{\ell}}<\infty (uniformly in mm) so that Var(μ^)Γσ2/n\mathrm{Var}(\hat{\mu})\leqslant\Gamma\sigma^{2}/n. Similarly for any 0d\boldsymbol{\ell}\in\mathbb{N}_{0}^{d} we have Γ0\Gamma_{\boldsymbol{\ell}}\to 0 as n=bmn=b^{m}\to\infty in a (t,d)(t,d)-sequence in base bb from which Var(μ^n)=o(1/n)\mathrm{Var}(\hat{\mu}_{n})=o(1/n). For a (t,m,s)(t,m,s)-net in base bb, one can show that the gain coefficients Γ=0\Gamma_{\boldsymbol{\ell}}=0 for all \boldsymbol{\ell} with ||mt|\boldsymbol{\ell}|\leqslant m-t.

3.2 Walsh decomposition after pre-integration

Proposition 3.1.

For fL2[0,1)df\in L^{2}[0,1)^{d} and j1:dj\in 1{:}d, let gg be ff pre-integrated over xjx_{j}. Then for 𝐤0d\boldsymbol{k}\in\mathbb{N}_{0}^{d},

g^(𝒌)={f^(𝒌),kj=00,kj>0.\displaystyle\hat{g}(\boldsymbol{k})=\begin{cases}\hat{f}(\boldsymbol{k}),&k_{j}=0\\ 0,&k_{j}>0.\end{cases} (6)
Proof.

We write

g^(𝒌)=[0,1)d101g(𝒙)wal𝒌b(𝒙)¯dxjd𝒙j=[0,1)d1g(𝒙)01wal𝒌b(𝒙)¯dxjd𝒙j\hat{g}(\boldsymbol{k})=\int_{[0,1)^{d-1}}\int_{0}^{1}g(\boldsymbol{x})\overline{{}_{b}\text{wal}_{\boldsymbol{k}}(\boldsymbol{x})}\,\mathrm{d}x_{j}\,\mathrm{d}\boldsymbol{x}_{-j}=\int_{[0,1)^{d-1}}g(\boldsymbol{x})\int_{0}^{1}\overline{{}_{b}\text{wal}_{\boldsymbol{k}}(\boldsymbol{x})}\,\mathrm{d}x_{j}\,\mathrm{d}\boldsymbol{x}_{-j}

because g(𝒙)g(\boldsymbol{x}) does not depend on xjx_{j}. If kj>0k_{j}>0, then the inner integral vanishes establishing the second clause in (6). If kj=0k_{j}=0 then walkjb(xj)=1{}_{b}\mathrm{wal}_{k_{j}}(x_{j})=1 for all xjx_{j} and the inner integral equals jwalkjb(xj)=wal𝒌b(𝒙)\prod_{\ell\neq j}{}_{b}\mathrm{wal}_{k_{j}}(x_{j})={}_{b}\text{wal}_{\boldsymbol{k}}(\boldsymbol{x}) establishing the second clause. ∎

Theorem 3.2.

For 𝐚0,,𝐚n1[0,1)d\boldsymbol{a}_{0},\dots,\boldsymbol{a}_{n-1}\in[0,1)^{d} let 𝐱0,,𝐱n1\boldsymbol{x}_{0},\dots,\boldsymbol{x}_{n-1} be a scrambled version of them using the algorithm from [26] or [22]. Let fL2[0,1)df\in L^{2}[0,1)^{d} and for j1:dj\in 1{:}d, let gg be ff pre-integrated over xjx_{j}. Then

Var(1ni=0n1g(𝒙i))Var(1ni=0n1f(𝒙i)).\mathrm{Var}\Bigl{(}\frac{1}{n}\sum_{i=0}^{n-1}g(\boldsymbol{x}_{i})\Bigr{)}\leqslant\mathrm{Var}\Bigl{(}\frac{1}{n}\sum_{i=0}^{n-1}f(\boldsymbol{x}_{i})\Bigr{)}.
Proof.

With either ff or gg we have the same gain coefficients Γ\Gamma_{\boldsymbol{\ell}} for 0d\boldsymbol{\ell}\in\mathbb{N}_{0}^{d}. However

σ2(g)=𝒌C|g^(𝒌)|2=𝒌C,kj=0|f^(𝒌)|2𝒌C|f^(𝒌)|2=σ2(f).\displaystyle\sigma^{2}_{\boldsymbol{\ell}}(g)=\sum_{\boldsymbol{k}\in C_{\boldsymbol{\ell}}}|\hat{g}(\boldsymbol{k})|^{2}=\sum_{\boldsymbol{k}\in C_{\boldsymbol{\ell}},k_{j}=0}|\hat{f}(\boldsymbol{k})|^{2}\leqslant\sum_{\boldsymbol{k}\in C_{\boldsymbol{\ell}}}|\hat{f}(\boldsymbol{k})|^{2}=\sigma^{2}_{\boldsymbol{\ell}}(f).

The result now follows from (4). ∎

Theorem 3.2 shows that pre-integration does not increase the variance under scrambling. This holds whether or not the underlying points are a digital net, though of course the main case of interest is for scrambling of digital nets and sequences.

Pre-integration has another benefit that is not captured by Theorem 3.2. By reducing the input dimension from dd to d1d-1 we might be able to find better points in [0,1]d1[0,1]^{d-1} than the (t,m,d)(t,m,d)-net we would otherwise use in [0,1]d[0,1]^{d}. Those improved points might have some smaller gain coefficients or they might be a (t,m,s1)(t^{\prime},m,s-1) net in base bb with t<tt^{\prime}<t.

For scrambled net sampling, reducing the dimension reduces an upper bound on the variance. For any function fL2[0,1)df\in L^{2}[0,1)^{d}, the variance using a scrambled (t,m,d)(t,m,d)-net in base bb is at most bt+db^{t+d} times the MC variance. Reducing the dimension reduces the bound to bt+d1b^{t+d-1} times the MC variance. For digital constructions in base 22, there are sharper bounds on these ratios, 2t+d12^{t+d-1} and 2t+d22^{t+d-2}, respectively [33] and for some nets described there, even lower bounds apply.

As remarked above, pre-integration over a variable xjx_{j} that f(𝒙)f(\boldsymbol{x}) uses will reduce the variance under scrambled net sampling. This reduction does not require ff to be monotone in xjx_{j}, though such cases have the potential to bring a greater improvement. Pre-integration can either increase or decrease the mean dimension because

ν(f)=𝒌0d{𝟎}|f^(𝒌)|2×𝒌0𝒌0d{𝟎}|f^(𝒌)|2andν(gj)=𝒌0d{𝟎}𝟏{kj>0}|f^(𝒌)|2×𝒌0𝒌0d{𝟎}𝟏{kj>0}|f^(𝒌)|2\nu(f)=\frac{\sum_{\boldsymbol{k}\in\mathbb{N}_{0}^{d}\setminus\{\boldsymbol{0}\}}|\hat{f}(\boldsymbol{k})|^{2}\times\|\boldsymbol{k}\|_{0}}{\sum_{\boldsymbol{k}\in\mathbb{N}_{0}^{d}\setminus\{\boldsymbol{0}\}}|\hat{f}(\boldsymbol{k})|^{2}}\quad\text{and}\quad\nu(g_{j})=\frac{\sum_{\boldsymbol{k}\in\mathbb{N}_{0}^{d}\setminus\{\boldsymbol{0}\}}{\mathbf{1}_{\left\{{k_{j}>0}\right\}}}|\hat{f}(\boldsymbol{k})|^{2}\times\|\boldsymbol{k}\|_{0}}{\sum_{\boldsymbol{k}\in\mathbb{N}_{0}^{d}\setminus\{\boldsymbol{0}\}}{\mathbf{1}_{\left\{{k_{j}>0}\right\}}}|\hat{f}(\boldsymbol{k})|^{2}}

and pre-integration could possibly reduce the denominator by a greater proportion than it reduces the numerator.

3.3 Choice of xjx_{j}

In order to choose xjx_{j} to pre-integrate over, we can look at the variance reduction we get. Pre-integrating over xjx_{j} reduces the scrambled net variance by

1n0d{𝟎}Γσ2𝟏{j>0}=1n𝒌0d{𝟎}Γ|f^(𝒌)|2𝟏{kj>0}.\displaystyle\frac{1}{n}\sum_{\boldsymbol{\ell}\in\mathbb{N}_{0}^{d}\setminus\{\boldsymbol{0}\}}\Gamma_{\boldsymbol{\ell}}\sigma^{2}_{\boldsymbol{\ell}}{\mathbf{1}_{\left\{{\ell_{j}>0}\right\}}}=\frac{1}{n}\sum_{\boldsymbol{k}\in\mathbb{N}_{0}^{d}\setminus\{\boldsymbol{0}\}}\Gamma_{\boldsymbol{\ell}}|\hat{f}(\boldsymbol{k})|^{2}{\mathbf{1}_{\left\{{k_{j}>0}\right\}}}. (7)

Evaluating this quantity for each j1:dj\in 1{:}d might be more expensive than getting a good estimate of μ\mu. However we don’t need to find the best jj. Any jj where ff depends on xjx_{j} will bring some improvement. Below we develop a principled and computationally convenient choice by choosing the jj which is most important as measured by a Sobol’ index [41] from global sensitivity analysis [37].

A convenient proxy replacement for (7) is

1n𝒌0d{𝟎}|f^(𝒌)|2𝟏{kj>0}=1nu1:dσu2𝟏{ju}\displaystyle\frac{1}{n}\sum_{\boldsymbol{k}\in\mathbb{N}_{0}^{d}\setminus\{\boldsymbol{0}\}}|\hat{f}(\boldsymbol{k})|^{2}{\mathbf{1}_{\left\{{k_{j}>0}\right\}}}=\frac{1}{n}\sum_{u\subseteq 1:d}\sigma^{2}_{u}{\mathbf{1}_{\left\{{j\in u}\right\}}} (8)

where σu2\sigma^{2}_{u} is the ANOVA variance component for the set uu. The equality above follows because the ANOVA can be defined, as Sobol’ [40] did, by collecting up the terms involving xjx_{j} for juj\in u from the orthogonal decomposition. Sobol’ used Haar functions. The right hand side of (8) equals τ¯j2/n\overline{\tau}^{2}_{j}/n. It counts all the variance components in which variable jj participates. From the orthogonality properties of ANOVA effects it follows that

τ¯j2=12[0,1]d+1(f(𝒛j:𝒙j)f(𝒙))2d𝒛jd𝒙.\overline{\tau}^{2}_{j}=\frac{1}{2}\int_{[0,1]^{d+1}}\bigl{(}f(\boldsymbol{z}_{j}{:}\boldsymbol{x}_{-j})-f(\boldsymbol{x})\bigr{)}^{2}\,\mathrm{d}\boldsymbol{z}_{j}\,\mathrm{d}\boldsymbol{x}.

The Jansen estimator [19] is an estimate of the above integral that can be done by a d+1d+1 dimensional MC or QMC or RQMC sampling algorithm. Our main interest in this Sobol’ index estimator is that we use it as a point of comparison to the use of active subspaces in choosing a projection of a Gaussian vector along which to pre-integrate.

4 Active subspace method

Without loss of generality an expectation defined with respect to 𝒙𝒩(0,Σ)\boldsymbol{x}\sim\mathcal{N}(0,\Sigma) for nonsingular Σd×d\Sigma\in^{d\times d} can be written as an expectation with respect to 𝒙𝒩(0,I)\boldsymbol{x}\sim\mathcal{N}(0,I). For a unit vector θd\theta\in^{d}, we will pre-integrate over 𝒙𝖳θ𝒩(0,1)\boldsymbol{x}^{\mathsf{T}}\theta\sim\mathcal{N}(0,1) and then the problem is to make a principled choice of θ\theta. It would not be practical to seek an optimal choice.

Our proposal is to use active subspaces [3]. As mentioned in the introduction we let

C=𝔼(f(𝒙)f(𝒙)𝖳)C=\mathbb{E}(\nabla f(\boldsymbol{x})\nabla f(\boldsymbol{x})^{\mathsf{T}})

and then let Θ[1:r]\Theta[1{:}r] comprise the rr leading eigenvectors of CC. The original use for active subspaces is to approximate f(𝒙)f~(Θ[1:r]𝖳𝒙)f(\boldsymbol{x})\approx\tilde{f}(\Theta[1{:}r]^{\mathsf{T}}\boldsymbol{x}) for some function f~\tilde{f} on r. It is well known that one can construct functions where the active subspace will be a bad choice over which to approximate. For instance, with f(𝒙)=sin(106x1)+100x2f(\boldsymbol{x})=\sin(10^{6}x_{1})+100x_{2} the r=1r=1 active subspace provides a function of x1x_{1} alone while a function of x2x_{2} alone can provide a better approximation than a function of x1x_{1} alone can. Active subspaces remain useful for approximation because the motivating problems are not so pathological and there is a human in the loop to catch such things. They also have an enormous practical advantage that one set of evaluations of f\nabla f can be used in the search for Θ\Theta instead of having every candidate Θ\Theta require its own evaluations of f\nabla f. Using active subspaces for integration retains that advantage.

In our setting, we take r=1r=1 and pre-integrate over θ𝖳𝒙\theta^{\mathsf{T}}\boldsymbol{x} where θ\theta is the leading eigenvector of CC. That is θ\theta maximizes θ𝖳𝔼(f(𝒙)f(𝒙)𝖳)θ\theta^{\mathsf{T}}\mathbb{E}(\nabla f(\boldsymbol{x})\nabla f(\boldsymbol{x})^{\mathsf{T}})\theta over dd dimensional unit vectors. Now suppose that instead of using f(𝒙)f(\boldsymbol{x}) we use fQ(𝒙)=f(Q𝒙)f_{Q}(\boldsymbol{x})=f(Q\boldsymbol{x}) for an orthogonal matrix Qd×dQ\in^{d\times d}. Then 𝔼(fQ(𝒙)fQ(𝒙)𝖳)=Q𝖳CQ\mathbb{E}(\nabla f_{Q}(\boldsymbol{x})\nabla f_{Q}(\boldsymbol{x})^{\mathsf{T}})=Q^{\mathsf{T}}CQ which is similar to CC. It has the same eigenvalues and the leading eigenvector is θ~=Q𝖳θ\tilde{\theta}=Q^{\mathsf{T}}\theta. Furthermore, Theorem 3.1 of [47] shows that the invariance extends to the whole eigendecomposition.

4.1 Connection to a Sobol’ index

The discussion in Section 3 motivates pre-integration of f(𝒙)f(\boldsymbol{x}) for 𝒙𝒩(0,I)\boldsymbol{x}\sim\mathcal{N}(0,I) over a linear combination θ𝖳𝒙\theta^{\mathsf{T}}\boldsymbol{x} having the largest Sobol’ index over unit vectors θ\theta. For θ1=θ\theta_{1}=\theta, let Θ=(θ1,θ2,,θd)d×d\Theta=(\theta_{1},\theta_{2},\dots,\theta_{d})\in^{d\times d} be an orthogonal matrix and write

fΘ(𝒙)=f(Θ𝒙)=f(x1θ1+x2θ2++xdθd).f_{\Theta}(\boldsymbol{x})=f(\Theta\boldsymbol{x})=f(x_{1}\theta_{1}+x_{2}\theta_{2}+\cdots+x_{d}\theta_{d}).

Then we define τ¯θ2(f)\overline{\tau}^{2}_{\theta}(f) to be τ¯12\overline{\tau}^{2}_{1} in the ANOVA of fΘ(𝒙)f_{\Theta}(\boldsymbol{x}). First we show that τ¯θ2\overline{\tau}^{2}_{\theta} does not depend on the last d1d-1 columns of Θ\Theta.

Let zz, z~\tilde{z} and 𝒚\boldsymbol{y} be independent with distributions 𝒩(0,1)\mathcal{N}(0,1), 𝒩(0,1)\mathcal{N}(0,1) and 𝒩(0,Id1)\mathcal{N}(0,I_{d-1}) respectively. Let 𝒙=zθ1+Θ1𝒚\boldsymbol{x}=z\theta_{1}+\Theta_{-1}\boldsymbol{y} and 𝒙~=z~θ1+Θ1𝒚\tilde{\boldsymbol{x}}=\tilde{z}\theta_{1}+\Theta_{-1}\boldsymbol{y}. Using the Jansen formula for θ=θ1\theta=\theta_{1},

τ¯θ2=12(f(zθ1+Θ1𝒚)f(z~θ1+Θ1𝒚))2d1φ(𝒚)d𝒚φ(z~)dz~φ(z)dz.\displaystyle\overline{\tau}^{2}_{\theta}=\frac{1}{2}\int_{\int}{}_{\int}{}_{{}^{d-1}}\bigl{(}f(z\theta_{1}+\Theta_{-1}\boldsymbol{y})-f(\tilde{z}\theta_{1}+\Theta_{-1}\boldsymbol{y})\bigr{)}^{2}\varphi(\boldsymbol{y})\,\mathrm{d}\boldsymbol{y}\varphi(\tilde{z})\,\mathrm{d}\tilde{z}\varphi(z)\,\mathrm{d}z. (9)

Now for an orthogonal matrix Q(d1)×(d1)Q\in^{(d-1)\times(d-1)}, let

Θ~=Θ(1𝟎d1𝖳𝟎d1Q)=(θ~1θ~2θ~d)\widetilde{\Theta}=\Theta\begin{pmatrix}1&\boldsymbol{0}_{d-1}^{\mathsf{T}}\\ \boldsymbol{0}_{d-1}&Q\end{pmatrix}=\begin{pmatrix}\tilde{\theta}_{1}&\tilde{\theta}_{2}&\cdots&\tilde{\theta}_{d}\end{pmatrix}

where θ~1=θ1\tilde{\theta}_{1}=\theta_{1}. In this parameterization we get

τ¯θ2\displaystyle\overline{\tau}^{2}_{\theta} =12(f(zθ1+Θ~1𝒚)f(z~θ1+Θ~1𝒚))2d1φ(𝒚)d𝒚φ(z~)dz~φ(z)dz\displaystyle=\frac{1}{2}\int_{\int}{}_{\int}{}_{{}^{d-1}}\bigl{(}f(z\theta_{1}+\widetilde{\Theta}_{-1}\boldsymbol{y})-f(\tilde{z}\theta_{1}+\widetilde{\Theta}_{-1}\boldsymbol{y})\bigr{)}^{2}\varphi(\boldsymbol{y})\,\mathrm{d}\boldsymbol{y}\varphi(\tilde{z})\,\mathrm{d}\tilde{z}\varphi(z)\,\mathrm{d}z
=12(f(zθ1+Θ1Q𝒚)f(z~θ1+Θ1Q𝒚))2d1φ(𝒚)d𝒚φ(z~)dz~φ(z)dz\displaystyle=\frac{1}{2}\int_{\int}{}_{\int}{}_{{}^{d-1}}\bigl{(}f(z\theta_{1}+\Theta_{-1}Q\boldsymbol{y})-f(\tilde{z}\theta_{1}+\Theta_{-1}Q\boldsymbol{y})\bigr{)}^{2}\varphi(\boldsymbol{y})\,\mathrm{d}\boldsymbol{y}\varphi(\tilde{z})\,\mathrm{d}\tilde{z}\varphi(z)\,\mathrm{d}z

which matches (9) after a change of variable. There is an even stronger invariance property in this setup. The random variable 𝔼(fΘ(𝒙)𝒙1)\mathbb{E}(f_{\Theta}(\boldsymbol{x})\!\mid\!\boldsymbol{x}_{-1}) (random because it depends on 𝒙1\boldsymbol{x}_{-1}) has a distribution that does not depend on θ2,,θd\theta_{2},\dots,\theta_{d}.

Theorem 4.1.

Let 𝐱𝒩(0,Id)\boldsymbol{x}\sim\mathcal{N}(0,I_{d}) for ff with 𝔼(f(𝐱)2)<\mathbb{E}(f(\boldsymbol{x})^{2})<\infty and let Θd×d\Theta\in^{d\times d} be an orthogonal matrix with columns θj\theta_{j} for j=1,,dj=1,\dots,d. Then the distribution of 𝔼(fΘ(𝐱)𝐱1)\mathbb{E}(f_{\Theta}(\boldsymbol{x})\!\mid\!\boldsymbol{x}_{-1}) does not depend on the last d1d-1 columns of Θ\Theta.

Proof.

Let Θ=(θ1Θ1)\Theta=(\theta_{1}\ \Theta_{-1}) and Θ~=(θ1Θ~1)\widetilde{\Theta}=(\theta_{1}\ \widetilde{\Theta}_{-1}) be orthogonal d×dd\times d matrices. Define

fΘ(𝒙)=f(Θ𝒙)=f(x1θ1+Θ1𝒙1)andfΘ~(𝒙)=f(Θ~𝒙)=f(x1θ1+Θ~1𝒙1).f_{\Theta}(\boldsymbol{x})=f(\Theta\boldsymbol{x})=f(x_{1}\theta_{1}+\Theta_{-1}\boldsymbol{x}_{-1})\quad\text{and}\quad f_{\widetilde{\Theta}}(\boldsymbol{x})=f(\widetilde{\Theta}\boldsymbol{x})=f(x_{1}\theta_{1}+\widetilde{\Theta}_{-1}\boldsymbol{x}_{-1}).

Now Θ1𝒙1\Theta_{-1}\boldsymbol{x}_{-1} and Θ~1𝒙1\widetilde{\Theta}_{-1}\boldsymbol{x}_{-1} both have the same 𝒩(0,Iθ1θ1𝖳)\mathcal{N}(0,I-\theta_{1}\theta_{1}^{\mathsf{T}}) distribution independently of x1θ1x_{1}\theta_{1}. Choose A1A_{1}\subset and A0A_{0} in the support of 𝒩(0,Iθ1θ1𝖳)\mathcal{N}(0,I-\theta_{1}\theta_{1}^{\mathsf{T}}) with positive probability under that (singular) distribution. Then Pr(fΘ(𝒙)A1Θ1𝒙1A0)=Pr(fΘ~(𝒙)A1Θ~1𝒙1A0)\Pr(f_{\Theta}(\boldsymbol{x})\in A_{1}\!\mid\!\Theta_{-1}\boldsymbol{x}_{-1}\in A_{0})=\Pr(f_{\widetilde{\Theta}}(\boldsymbol{x})\in A_{1}\!\mid\!\widetilde{\Theta}_{-1}\boldsymbol{x}_{-1}\in A_{0}).

For AA\subset and Bd1B\subset^{d-1},

Pr(fΘ(𝒙)A𝒙1B)\displaystyle\Pr(f_{\Theta}(\boldsymbol{x})\in A\!\mid\!\boldsymbol{x}_{-1}\in B) =Pr(fΘ(𝒙)AΘ1𝒙1Θ1B)\displaystyle=\Pr(f_{\Theta}(\boldsymbol{x})\in A\!\mid\!\Theta_{-1}\boldsymbol{x}_{-1}\in\Theta_{-1}B)
=Pr(fΘ~(𝒙)AΘ~1𝒙1Θ1B)\displaystyle=\Pr(f_{\widetilde{\Theta}}(\boldsymbol{x})\in A\!\mid\!\widetilde{\Theta}_{-1}\boldsymbol{x}_{-1}\in\Theta_{-1}B)
=Pr(fΘ~(𝒙)AΘ1𝖳Θ~1𝒙1B)\displaystyle=\Pr(f_{\widetilde{\Theta}}(\boldsymbol{x})\in A\!\mid\!\Theta_{-1}^{\mathsf{T}}\widetilde{\Theta}_{-1}\boldsymbol{x}_{-1}\in B)

where Θ1𝖳Θ~1𝒙1\Theta_{-1}^{\mathsf{T}}\widetilde{\Theta}_{-1}\boldsymbol{x}_{-1} has the same 𝒩(0,Id1)\mathcal{N}(0,I_{d-1}) distribution that 𝒙1\boldsymbol{x}_{-1} has. Then for C[0,1]C\subset[0,1],

Pr(Pr(fΘ(𝒙)A𝒙1B)C)\displaystyle\Pr\bigl{(}\Pr(f_{\Theta}(\boldsymbol{x})\in A\!\mid\!\boldsymbol{x}_{-1}\in B)\in C\bigr{)} =Pr(Pr(fΘ~(𝒙)AΘ1𝖳Θ~1𝒙1B)C)\displaystyle=\Pr\bigl{(}\Pr(f_{\widetilde{\Theta}}(\boldsymbol{x})\in A\!\mid\!\Theta_{-1}^{\mathsf{T}}\widetilde{\Theta}_{-1}\boldsymbol{x}_{-1}\in B)\in C\bigr{)}
=Pr(Pr(fΘ~(𝒙)A𝒙1B)C).\displaystyle=\Pr\bigl{(}\Pr(f_{\widetilde{\Theta}}(\boldsymbol{x})\in A\!\mid\!\boldsymbol{x}_{-1}\in B)\in C\bigr{)}.

It follows that the distribution of Pr(fΘ~(𝒙)A𝒙1)\Pr(f_{\widetilde{\Theta}}(\boldsymbol{x})\in A\!\mid\!\boldsymbol{x}_{-1}) is the same as the distribution of Pr(fΘ(𝒙)A𝒙1)\Pr(f_{\Theta}(\boldsymbol{x})\in A\!\mid\!\boldsymbol{x}_{-1}). Integrating over we get that 𝔼(fΘ~(𝒙)A𝒙1)\mathbb{E}(f_{\widetilde{\Theta}}(\boldsymbol{x})\in A\!\mid\!\boldsymbol{x}_{-1}) has the same distribution as 𝔼(fΘ(𝒙)A𝒙1)\mathbb{E}(f_{\Theta}(\boldsymbol{x})\in A\!\mid\!\boldsymbol{x}_{-1}) does. ∎

Another consequence of Theorem 4.1 is that τ¯θ2(f)=τ¯12(fΘ)\underline{\tau}^{2}_{\theta}(f)=\underline{\tau}^{2}_{1}(f_{\Theta}) is unaffected by θ2,,θd\theta_{2},\dots,\theta_{d}. Because the variance of ff is unchanged by making an orthogonal matrix transformation of its inputs, the normalized Sobol’ indices τ¯θ2/σ2\underline{\tau}^{2}_{\theta}/\sigma^{2} and τ¯θ2/σ2\overline{\tau}^{2}_{\theta}/\sigma^{2} are also invariant.

Finding the optimal θ\theta would ordinarily require an expensive search because every estimate of τ¯θ2\overline{\tau}_{\theta}^{2}, for a given θ\theta would require its own collection of evaluations of ff. Using a Poincaré inequality in [42] we can bound that Sobol’ index by

τ¯θ2(f)\displaystyle\overline{\tau}_{\theta}^{2}(f) 𝔼(θ𝖳f(𝒙))2)=θ𝖳Cθ.\displaystyle\leqslant\mathbb{E}(\theta^{\mathsf{T}}\nabla f(\boldsymbol{x}))^{2})=\theta^{\mathsf{T}}C\theta.

The active subspace direction thus maximizes an upper bound on the Sobol’ index for a projection. Next we develop a deeper correspondence between these two measures.

For a unit vector θd\theta\in^{d}, we can write f(𝒙)=f(θθ𝖳𝒙+(Iθθ𝖳)𝒙)f(\boldsymbol{x})=f(\theta\theta^{\mathsf{T}}\boldsymbol{x}+(I-\theta\theta^{\mathsf{T}})\boldsymbol{x}). If 𝒙,𝒛\boldsymbol{x},\boldsymbol{z} are independent 𝒩(0,I)\mathcal{N}(0,I) vectors then we can change the component of 𝒙\boldsymbol{x} parallel to θ\theta by changing the argument of ff to be θθ𝖳𝒛+(Iθθ𝖳)𝒙\theta\theta^{\mathsf{T}}\boldsymbol{z}+(I-\theta\theta^{\mathsf{T}})\boldsymbol{x}. This leaves the resulting point unchanged in the d1d-1 dimensional space orthogonal to θ\theta. Let x~=θ𝖳𝒙\tilde{x}=\theta^{\mathsf{T}}\boldsymbol{x} and z~=θ𝖳𝒛\tilde{z}=\theta^{\mathsf{T}}\boldsymbol{z}. Then x~,z~𝒩(0,1)\tilde{x},\tilde{z}\sim\mathcal{N}(0,1) and (Iθθ𝖳)𝒙𝒩(0,Iθθ𝖳)(I-\theta\theta^{\mathsf{T}})\boldsymbol{x}\sim\mathcal{N}(0,I-\theta\theta^{\mathsf{T}}) are all independent. If ff is differentiable, then by the mean value theorem

f(θθ𝖳𝒛+(Iθθ𝖳)𝒙)f(θθ𝖳𝒙+(Iθθ𝖳)𝒙)=θ𝖳f(θy~+(Iθθ𝖳)𝒙)(z~x~)\displaystyle f(\theta\theta^{\mathsf{T}}\boldsymbol{z}+(I-\theta\theta^{\mathsf{T}})\boldsymbol{x})-f(\theta\theta^{\mathsf{T}}\boldsymbol{x}+(I-\theta\theta^{\mathsf{T}})\boldsymbol{x})=\theta^{\mathsf{T}}\nabla f(\theta\tilde{y}+(I-\theta\theta^{\mathsf{T}})\boldsymbol{x})(\tilde{z}-\tilde{x})

for a real number y~\tilde{y} between x~\tilde{x} and z~\tilde{z}. Using the Jansen formula, the Sobol’ index for this projection is

12θ𝖳𝔼((z~x~)2f(θy~+(Iθθ𝖳)𝒙)f(θy~+(Iθθ𝖳)𝒙)𝖳)θ\displaystyle\frac{1}{2}\theta^{\mathsf{T}}\mathbb{E}\Bigl{(}(\tilde{z}-\tilde{x})^{2}\nabla f(\theta\tilde{y}+(I-\theta\theta^{\mathsf{T}})\boldsymbol{x})\nabla f(\theta\tilde{y}+(I-\theta\theta^{\mathsf{T}})\boldsymbol{x})^{\mathsf{T}}\Bigr{)}\theta (10)

which matches θ𝖳𝔼(f(𝒙)f(𝒙)𝖳)θ\theta^{\mathsf{T}}\mathbb{E}(\nabla f(\boldsymbol{x})\nabla f(\boldsymbol{x})^{\mathsf{T}})\theta over a d1d-1 dimensional subspace but differs from it as follows. First, it includes a weight factor (z~x~)2(\tilde{z}-\tilde{x})^{2} that puts more emphasis on pairs of inputs where θ𝖳𝒙\theta^{\mathsf{T}}\boldsymbol{x} and θ𝖳𝒛\theta^{\mathsf{T}}\boldsymbol{z} are far from each other. Second, the evaluation point projected onto θ\theta equals y~\tilde{y} which lies between two independent 𝒩(0,1)\mathcal{N}(0,1) variables instead of having the 𝒩(0,1)\mathcal{N}(0,1) distribution, and just where it lies between them depends on details of ff and there could be more than one such y~\tilde{y} for some ff. The formula simplifies in an illustrative way for quadratic functions ff.

Proposition 4.2.

If f:df:^{d}\to is a quadratic function and θd\theta\in^{d} is a unit vector, then the Sobol’ index τ¯θ2\overline{\tau}_{\theta}^{2} is

θ𝖳𝔼(f(θθ𝖳𝒙2+(Iθθ𝖳)𝒙)f(θθ𝖳𝒙2+(Iθθ𝖳)𝒙)𝖳)θ.\displaystyle\theta^{\mathsf{T}}\mathbb{E}\biggl{(}\nabla f\Bigl{(}\frac{\theta\theta^{\mathsf{T}}\boldsymbol{x}}{\sqrt{2}}+(I-\theta\theta^{\mathsf{T}})\boldsymbol{x}\Bigr{)}\nabla f\Bigl{(}\frac{\theta\theta^{\mathsf{T}}\boldsymbol{x}}{\sqrt{2}}+(I-\theta\theta^{\mathsf{T}})\boldsymbol{x}\Bigr{)}^{\mathsf{T}}\biggr{)}\theta. (11)
Proof.

If ff is quadratic, then y~=(z~+x~)/2𝒩(0,1/2)\tilde{y}=(\tilde{z}+\tilde{x})/2\sim\mathcal{N}(0,1/2) and z~x~𝒩(0,2)\tilde{z}-\tilde{x}\sim\mathcal{N}(0,2) and (Iθθ𝖳)𝒙(I-\theta\theta^{\mathsf{T}})\boldsymbol{x} are all independent. Then 𝔼((z~x~)2)=2\mathbb{E}((\tilde{z}-\tilde{x})^{2})=2 and θy~\theta\tilde{y} has the same distribution as θθ𝖳𝒙/2\theta\theta^{\mathsf{T}}\boldsymbol{x}/\sqrt{2} which is also independent of (z~x~)(\tilde{z}-\tilde{x}) and (Iθθ𝖳)𝒙(I-\theta\theta^{\mathsf{T}})\boldsymbol{x}. Making those substitutions in (10) yields (11). ∎

The Sobol’ index in equation (11) matches the quantity optimized by the first active subspace apart from the divisor 2\sqrt{2} affecting one of the dd dimensions. We can also show directly that for 𝒙𝒩(0,I)\boldsymbol{x}\sim\mathcal{N}(0,I) and f(𝒙)=(1/2)𝒙𝖳A𝒙+b𝖳𝒙f(\boldsymbol{x})=(1/2)\boldsymbol{x}^{\mathsf{T}}A\boldsymbol{x}+b^{\mathsf{T}}\boldsymbol{x} for a symmetric matrix AA, that the Sobol’ criterion reduces to θ𝖳A2θ+(θ𝖳b)2(1/2)(θ𝖳Aθ)2\theta^{\mathsf{T}}A^{2}\theta+(\theta^{\mathsf{T}}b)^{2}-(1/2)(\theta^{\mathsf{T}}A\theta)^{2} compared to an active subspace criterion of θ𝖳A2θ+(θ𝖳b)2\theta^{\mathsf{T}}A^{2}\theta+(\theta^{\mathsf{T}}b)^{2}.

4.2 Active subspace

Because C=𝔼(f(𝒙)f(𝒙)𝖳)C=\mathbb{E}(\nabla f(\boldsymbol{x})\nabla f(\boldsymbol{x})^{\mathsf{T}}) is positive semi-definite (PSD), it has the eigen-decomposition C=ΘDΘ𝖳C=\Theta D\Theta^{\mathsf{T}}, where Θ=(θ1,,θd)d×d\Theta=(\theta_{1},\ldots,\theta_{d})\in^{d\times d} is an orthogonal matrix consisted of eigenvectors of CC, and D=diag(λ1,,λd)D=\text{diag}(\lambda_{1},\ldots,\lambda_{d}) with λ1λd0\lambda_{1}\geqslant\ldots\geqslant\lambda_{d}\geqslant 0 being the eigenvalues. Constantine et al. [4] prove that there exists a constant cc such that

𝔼((f(𝒙)𝔼(f(𝒙)Θ[1:r]𝖳𝒙))2)c(λr+1++λd)\displaystyle\mathbb{E}\bigl{(}\bigl{(}f(\boldsymbol{x})-\mathbb{E}(f(\boldsymbol{x})\!\mid\!\Theta[1{:}r]^{\mathsf{T}}\boldsymbol{x})\bigr{)}^{2}\bigr{)}\leqslant c(\lambda_{r+1}+\cdots+\lambda_{d}) (12)

for all ff with a square integrable gradient. In general, the Poincaré constant cc depends on the support of the function and the probability measure. But for multivariate standard Gaussian distribution, the Poincaré constant is always 1 [2, 35]. This is because Θ[1:r]𝖳𝒙\Theta[1{:}r]^{\mathsf{T}}\boldsymbol{x} and Θ[(1:r)]𝖳𝒙\Theta[-(1{:}r)]^{\mathsf{T}}\boldsymbol{x} are independent standard Gaussian variables thus

𝔼((f(𝒙)𝔼(f(𝒙)Θ[1:r]𝖳𝒙))2Θ[1:r]𝖳𝒙)λr+1++λd\displaystyle\mathbb{E}\bigl{(}\bigl{(}f(\boldsymbol{x})-\mathbb{E}(f(\boldsymbol{x})\!\mid\!\Theta[1{:}r]^{\mathsf{T}}\boldsymbol{x})\bigr{)}^{2}\mid\Theta[1{:}r]^{\mathsf{T}}\boldsymbol{x}\bigr{)}\leqslant\lambda_{r+1}+\cdots+\lambda_{d}

for all Θ[1:r]𝖳𝒙\Theta[1{:}r]^{\mathsf{T}}\boldsymbol{x}.

In our problem, we take r=1r=1 and use 𝔼(f(𝒙)θ𝖳𝒙)\mathbb{E}(f(\boldsymbol{x})\!\mid\!\theta^{\mathsf{T}}\boldsymbol{x}) where θ\theta is the first column of Θ\Theta. Because we will end up with a d1d-1 dimensional integration problem it is convenient in an implementation to make θ𝖳𝒙\theta^{\mathsf{T}}\boldsymbol{x} the last variable not the first. For instance, one would use the first d1d-1 components in a Sobol’ sequence not components 22 through dd. Taking θ\theta to be the first column of Θ\Theta, we compute with

g(𝒙d)=f(θxd+Ψ𝒙d)dxdg(\boldsymbol{x}_{-d})=\int_{-\infty}^{\infty}f(\theta x_{d}+\Psi\boldsymbol{x}_{-d})\,\mathrm{d}x_{d}

using a quadrature rule of negligible error or a closed form expression if a suitable one is available for an orthonormal matrix Ψd×(d1)\Psi\in^{d\times(d-1)} that is orthogonal to θ\theta. We then integrate this gg over d1d-1 variables by RQMC. We can use Ψ=Θ[2:d]\Psi=\Theta[2{:}d]. Or if we want to avoid the cost of computing the full eigendecomposition of CC we can find θ1\theta_{1} by a power iteration and then use a Householder transformation

Θ=I2𝒘𝒘𝖳,where𝒘=θe1θe1\Theta=I-2\boldsymbol{w}\boldsymbol{w}^{\mathsf{T}},\quad\text{where}\quad\boldsymbol{w}=\frac{\theta-e_{1}}{\|\theta-e_{1}\|}

and e1=(1,0,0,,0)𝖳e_{1}=(1,0,0,\ldots,0)^{\mathsf{T}}. This Θ\Theta is an orthogonal matrix whose first column is θ\theta and again we can choose Ψ=Θ[2:d]\Psi=\Theta[2{:}d]. In our numerical work, we have used Θ[2:d]\Theta[2{:}d] instead of the Householder transformation because of the effective dimension motivation for those eigenvectors given by [45].

In practice, we must estimate CC. In the above description, we replace CC by

C^=1Mi=0M1f(𝒙i)f(𝒙i)𝖳,\displaystyle\widehat{C}=\frac{1}{M}\sum_{i=0}^{M-1}\nabla f(\boldsymbol{x}_{i})\nabla f(\boldsymbol{x}_{i})^{\mathsf{T}}, (13)

for an RQMC generated sample with 𝒙i𝒩(0,Id)\boldsymbol{x}_{i}\sim\mathcal{N}(0,I_{d}) and then define θ\theta and Θ1\Theta_{-1} using C^\widehat{C} in place of CC. We summarize the procedure in Algorithm 1.

Input : Integrand ff, number of samples MM to compute C^\hat{C}, number of samples nn to compute μ^\hat{\mu}
Output : An estimate μ^\hat{\mu} of df(𝒙)φ(𝒙)𝑑𝒙\int_{\mathbb{R}^{d}}f(\boldsymbol{x})\varphi(\boldsymbol{x})d\boldsymbol{x}
/* Find active subspaces */
Take 𝒙0,,𝒙M1𝒩(0,Id)\boldsymbol{x}_{0},\ldots,\boldsymbol{x}_{M-1}\sim\mathcal{N}(0,I_{d}) by RQMC.
Compute C^=1Mi=0M1f(𝒙i)f(𝒙i)𝖳\widehat{C}=\frac{1}{M}\sum_{i=0}^{M-1}\nabla f(\boldsymbol{x}_{i})\nabla f(\boldsymbol{x}_{i})^{\mathsf{T}}.
Compute the eigen-decomposition C^=Θ^D^Θ^𝖳\widehat{C}=\widehat{\Theta}\widehat{D}\widehat{\Theta}^{\mathsf{T}}.
/* Pre-integration */
Let θ\theta be the first column of Θ^\widehat{\Theta}
Compute the pre-integrated function 𝔼(f(𝒙)θ𝖳𝒙)\mathbb{E}(f(\boldsymbol{x})\!\mid\!\theta^{\mathsf{T}}\boldsymbol{x}) by a closed form or quadrature rule
/* RQMC integration */
Take 𝒙0,,𝒙n1𝒩(0,Id1)\boldsymbol{x}_{0},\ldots,\boldsymbol{x}_{n-1}\sim\mathcal{N}(0,I_{d-1}) by RQMC.
Let μ^=1ni=0n1f~(𝒙i)\hat{\mu}=\frac{1}{n}\sum_{i=0}^{n-1}{\tilde{f}}(\boldsymbol{x}_{i}).
Algorithm 1 pre-integration with active subspace

Using our prior notation we can now describe the approach of [44] more precisely. They first pre-integrate one variable in closed form producing a d1d-1 dimensional integrand. They then apply gradient GPCA to the pre-integrated function to find a good d1d-1 dimensional rotation matrix. [4]. That is, they first find h(𝒙2:d):=𝔼(f(𝒙)𝒙2:d)h(\boldsymbol{x}_{2:d}):=\mathbb{E}(f(\boldsymbol{x})\!\mid\!\boldsymbol{x}_{2:d}), then compute

C~^=1Mi=0M1h(𝒙i)h(𝒙i)𝖳(d1)×(d1),𝒙i𝒩(0,Id1),\displaystyle\widehat{\widetilde{C}}=\frac{1}{M}\sum_{i=0}^{M-1}\nabla h(\boldsymbol{x}_{i})\nabla h(\boldsymbol{x}_{i})^{\mathsf{T}}\in^{(d-1)\times(d-1)},\quad\boldsymbol{x}_{i}\sim\mathcal{N}(0,I_{d-1}), (14)

using RQMC points 𝒙i\boldsymbol{x}_{i}. Then they find the eigen-decomposition C~^=V^Λ^V^𝖳\widehat{\widetilde{C}}=\widehat{V}\widehat{\Lambda}\widehat{V}^{\mathsf{T}}. Finally, they use RQMC to integrate the function h(V^𝒙)h(\widehat{V}\boldsymbol{x}) where 𝒙𝒩(0,Id1)\boldsymbol{x}\sim\mathcal{N}(0,I_{d-1}). The main difference is that they apply pre-integration to the original integrand f(𝒙)f(\boldsymbol{x}) while we apply pre-integration to the rotated integrand fΘ(𝒙)=f(Θ𝒙)f_{\Theta}(\boldsymbol{x})=f(\Theta\boldsymbol{x}). They conduct GPCA in the end as an approach to reduce effective dimension, while we conduct a similar GPCA (active subspace method) at the beginning to find the important subspace.

5 Application to option pricing

Here we study some Gaussian integrals arising from financial valuation. We assume that an asset price StS_{t}, such as a stock, follows a geometric Brownian motion satisfying the stochastic differential equation (SDE)

dSt=rStdt+σStdBt,\displaystyle\,\mathrm{d}S_{t}=rS_{t}\,\mathrm{d}t+\sigma S_{t}\,\mathrm{d}B_{t},

where BtB_{t} is a Brownian motion. Here, rr is the interest rate and σ>0\sigma>0 is the constant volatility for the asset. For an initial price S0S_{0}, the SDE above has a unique solution

St=S0exp((rσ22)t+σBt).\displaystyle S_{t}=S_{0}\exp\Bigl{(}\Bigl{(}r-\frac{\sigma^{2}}{2}\Bigr{)}t+\sigma B_{t}\Bigr{)}.

Suppose the maturity time of the option is TT. In practice, we simulate discrete Brownian motion. We call BB a dd-dimensional discrete Brownian motion if BB follows a multivariate Gaussian distribution with mean zero and covariance Σ\Sigma with Σij=Δtmin(i,j)\Sigma_{ij}=\Delta t\min(i,j), where Δt=T/d\Delta t=T/d is the length of each time interval and 1i,jd1\leqslant i,j\leqslant d. To sample a discrete Brownian motion, we can first find a d×dd\times d matrix RR such that RR𝖳=ΣRR^{\mathsf{T}}=\Sigma, then generate a standard Gaussian variable 𝒛𝒩(0,Id)\boldsymbol{z}\sim\mathcal{N}(0,I_{d}), and let B=R𝒛B=R\boldsymbol{z}. Taking RR to be the lower triangular matrix in the Cholesky decomposition of Σ\Sigma yields the standard construction. Using the usual eigen-decomposition Σ=UΛU𝖳\Sigma=U\Lambda U^{\mathsf{T}}, we can take R=UΛ1/2R=U\Lambda^{1/2}. This is called the principal component analysis (PCA) construction. For explicit forms of both these choices of RR, see [10].

5.1 Option with one asset

When we use the matrix RR, we can approximate SjΔtS_{j\Delta t} by

Sj\displaystyle S_{j} =S0exp((rσ22)jΔt+σBj),1jd\displaystyle=S_{0}\exp\Bigl{(}\Bigl{(}r-\frac{\sigma^{2}}{2}\Bigr{)}j\Delta t+\sigma B_{j}\Bigr{)},\quad 1\leqslant j\leqslant d

where B=R𝒛B=R\boldsymbol{z} is the discrete Brownian motion. The arithmetic average of the stock price is given by

S¯(R,𝒛)=S0sj=1dexp((rσ22)jΔt+σk=1dRjkzk).\displaystyle\bar{S}(R,\boldsymbol{z})=\frac{S_{0}}{s}\sum_{j=1}^{d}\exp\Bigl{(}\Bigl{(}r-\frac{\sigma^{2}}{2}\Bigr{)}j\Delta t+\sigma\sum_{k=1}^{d}R_{jk}z_{k}\Bigr{)}.

Then the expected payoff of the arithmetic average Asian call option with strike price KK is 𝔼((S¯(A,𝒛)K)+)\mathbb{E}\bigl{(}(\bar{S}(A,\boldsymbol{z})-K)_{+}\bigr{)}, where the expectation is taken over 𝒛𝒩(0,Id)\boldsymbol{z}\sim\mathcal{N}(0,I_{d}).

Suppose that we want to marginalize over z1z_{1} before computing the expectation 𝔼((S¯(A,𝒛)K)+)\mathbb{E}((\bar{S}(A,\boldsymbol{z})-K)_{+}). If Rj1>0R_{j1}>0 for all 1jd1\leqslant j\leqslant d, then S¯(R,𝒛)\bar{S}(R,\boldsymbol{z}) is increasing in z1z_{1} for any value of 𝒛2:s\boldsymbol{z}_{2:s}. If we can find γ=γ(𝒛2:s)\gamma=\gamma(\boldsymbol{z}_{2:s}) such that

S¯(R,(γ,𝒛2:s))=K,\displaystyle\bar{S}(R,(\gamma,\boldsymbol{z}_{2:s}))=K, (15)

then the pre-integration step becomes

𝔼((S¯(A,𝒛)K)+𝒛2:s)\displaystyle\mathbb{E}((\bar{S}(A,\boldsymbol{z})-K)_{+}\!\mid\!\boldsymbol{z}_{2:s})
=z1γ(𝒛2:s)(S¯(R,(z1,𝒛2:s))K)φ(z1)dz1\displaystyle=\int_{z_{1}\geqslant\gamma(\boldsymbol{z}_{2:s})}(\bar{S}(R,(z_{1},\boldsymbol{z}_{2:s}))-K)\varphi(z_{1})\,\mathrm{d}z_{1}
=S0dj=1dexp((rσ22)jΔt+σk=2dRjkzk+σ2Rj122)Φ¯(γσRj1)KΦ¯(γ),\displaystyle=\frac{S_{0}}{d}\sum_{j=1}^{d}\exp\Bigl{(}\Bigl{(}r-\frac{\sigma^{2}}{2}\Bigr{)}j\Delta t+\sigma\sum_{k=2}^{d}R_{jk}z_{k}+\frac{\sigma^{2}R_{j1}^{2}}{2}\Bigr{)}\bar{\Phi}(\gamma-\sigma R_{j1})-K\bar{\Phi}(\gamma), (16)

where Φ¯(x)=1Φ(x)\bar{\Phi}(x)=1-\Phi(x). In practice, Equation (15) can be solved by a root finding algorithm. For example, Newton iteration usually converges in only a few steps.

The condition that Rj1>0R_{j1}>0 for 1jd1\leqslant j\leqslant d is satisfied when we use the standard construction or PCA construction of Brownian motion. Using the active subspace method, we are using R~=RΘ\tilde{R}=R\Theta in the place of RR where Θ\Theta consists of the eigenvectors of C=𝔼(f(𝒛)f(𝒛)𝖳)C=\mathbb{E}(\nabla f(\boldsymbol{z})\nabla f(\boldsymbol{z})^{\mathsf{T}}), and here f(𝒛)=(S¯(A,𝒛)K)+f(\boldsymbol{z})=(\bar{S}(A,\boldsymbol{z})-K)_{+}. If every R~j1{\tilde{R}}_{j1} is negative then we replace R~{\tilde{R}} by R~-{\tilde{R}}. We have not proved that the components of the first column of R~{\tilde{R}} must all have the same sign, but that has always held for the integrands in our simulations. As a result we have not had to use a numerical quadrature.

We compare Algorithm 1 with other methods in the option pricing example considered in [44] and [14]. Apart from the payoff function of call option, we also consider the Greeks: Delta, Gamma, Rho, Theta, and Vega. These are defined in [44]. We take the parameters d=50d=50, T=1T=1, σ=0.4\sigma=0.4, r=0.1r=0.1, S0=K=100S_{0}=K=100 the same as in [14]. We consider 4 methods:

  • AS+pre-int: our proposed active subspace pre-integration method (Algorithm 1), which applies active subspace method to find the direction to pre-integrate,

  • pre-int+DimRed: the method proposed in [44], which first pre-integrates z1z_{1} and applies GPCA to conduct dimension reduction for the other d1d-1 variables,

  • pre-int: pre-integrating z1z_{1} with no dimension reduction,

  • RQMC: usual RQMC, and

  • MC: plain Monte Carlo.

We vary nn from 232^{3} to 2172^{17}. For each nn, we repeat the simulation 50 times and compute the root mean squared error (RMSE) defined by

150k=150(μ^(k)μ)2,\sqrt{\frac{1}{50}\sum_{k=1}^{50}(\hat{\mu}^{(k)}-\mu)^{2}},

where μ^(k)\hat{\mu}^{(k)} is the estimate in the kk-th replicate. The true value μ\mu is approximated by applying pre-int+DimRed using PCA construction with n=217n=2^{17} RQMC samples and averaging over 30 independent replicates. We chose this one because it works well and we wanted to avoid using AS-pre-int in case the model used for ground truth had some advantage in reported accuracy.

The root mean square error (RMSE) is plotted versus the sample size on the log-log scale. For the methods AS+pre-int and pre-int+DimRed, we use M=128M=128 samples to estimate CC as in (13). We approximate the gradients of the original integrand and the pre-integrated integrand by the finite difference

f(x)(f(x+ε𝐞1)f(x)ε,,f(x+ε𝐞d1)f(x)ε)𝖳,ε=106,\nabla f(x)\approx\left(\frac{f(x+\varepsilon{\mathbf{e}}_{1})-f(x)}{\varepsilon},\ldots,\frac{f(x+\varepsilon{\mathbf{e}}_{d-1})-f(x)}{\varepsilon}\right)^{\mathsf{T}},\quad\varepsilon=10^{-6},

matching the choice in [44]. We chose a small value of MM to keep the costs comparable to plain RQMC. Also because θ\theta is a local optimum of θ𝖳Cθ\theta^{\mathsf{T}}C\theta, we have θ^𝖳Cθ^=θ𝖳Cθ+O(θ^θ2)\hat{\theta}^{\mathsf{T}}C\hat{\theta}=\theta^{\mathsf{T}}C\theta+O(\|\hat{\theta}-\theta\|^{2}) so there are diminishing returns to accurate estimation of θ\theta. Finally, any θ\theta where ff varies along θ𝖳𝒙\theta^{\mathsf{T}}\boldsymbol{x} brings a variance reduction.

We consider both the standard construction (Figure 1) and the PCA construction (Figure 2) of Brownian motion. Several observations are in order:

  1. (a)

    With the standard construction, AS+pre-int dominates all the other methods for five of the six test functions and is tied for best with pre-int+DimRed for the other one (Gamma).

  2. (b)

    With the PCA construction, AS+pre-int, pre-int+DimRed and pre-int are the best methods for the payoff, Delta, Theta, and Vega and are nearly equally good.

  3. (c)

    For Rho, pre-int+DimRed and pre-int are best, while for Gamma, AS+pre-int is best.

The performance of active subspace pre-integration is the same under either the standard or the PCA construction by invariance. For these Asian options it is already well known that the PCA is especially effective. Active subspace pre-integration finds something almost as good without special knowledge, coming out better in one example, worse in another and essentially the same in the other four.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Single asset option. Standard construction of Brownian motion with d=50d=50. This and subsequent figures include least squares estimated slopes on the log–log scale.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Single asset option. PCA construction of Brownian motion with d=50d=50.

5.2 Basket option

A basket option depends on the weighted average of several assets. Suppose that the LL assets S(),,S(L)S^{(\ell)},\ldots,S^{(L)} follow the SDE

dSt()=rSt()dt+σSt()dBt(),\displaystyle\,\mathrm{d}S_{t}^{(\ell)}=r_{\ell}S_{t}^{(\ell)}\,\mathrm{d}t+\sigma_{\ell}S_{t}^{(\ell)}\,\mathrm{d}B_{t}^{(\ell)},

where {B()}1L\{B^{(\ell)}\}_{1\leqslant\ell\leqslant L} are standard Brownian motions with correlation

Corr(Bt(),Bt(k))=ρk\displaystyle\mathrm{Corr}(B_{t}^{(\ell)},B_{t}^{(k)})=\rho_{\ell k}

for all t>0t>0. For some nonnegative weights w1++wL=1w_{1}+\ldots+w_{L}=1, the payoff function of the Asian basket call option is given by

(=1LwS¯()K)+\displaystyle\Biggl{(}\,\sum_{\ell=1}^{L}w_{\ell}\bar{S}^{(\ell)}-K\Biggr{)}_{+}

where S¯()\bar{S}^{(\ell)} is the arithmetic average of St()S^{(\ell)}_{t} in the time interval [0,T][0,T]. Here, we only consider L=2L=2 assets. To generate B(1),B(2)B^{(1)},B^{(2)} with correlation ρ\rho, we can generate two independent standard Brownian motions W(1),W(2)W^{(1)},W^{(2)} letting

B(1)=W(1)andB(2)=ρW(1)+1ρ2W(2).\displaystyle B^{(1)}=W^{(1)}\quad\text{and}\quad B^{(2)}=\rho W^{(1)}+\sqrt{1-\rho^{2}}W^{(2)}.

Following the same discretization as before, we can generate (𝒛𝖳,𝒛~𝖳)𝒩(0,I2d)(\boldsymbol{z}^{\mathsf{T}},\tilde{\boldsymbol{z}}^{\mathsf{T}})\sim\mathcal{N}(0,I_{2d}). Then for time steps j=1,,dj=1,\dots,d, let

Sj(1)\displaystyle S_{j}^{(1)} =S0(1)exp((r1σ122)jΔt+σ1k=1dRjkzk),and\displaystyle=S_{0}^{(1)}\exp\Bigl{(}\Bigl{(}r_{1}-\frac{\sigma_{1}^{2}}{2}\Bigr{)}j\Delta t+\sigma_{1}\sum_{k=1}^{d}R_{jk}z_{k}\Bigr{)},\quad\text{and}
Sj(2)\displaystyle S_{j}^{(2)} =S0(2)exp((r2σ222)jΔt+σ2(ρk=1dRjkzk+1ρ2k=1dRjkz~k)).\displaystyle=S_{0}^{(2)}\exp\Bigl{(}\Bigl{(}r_{2}-\frac{\sigma_{2}^{2}}{2}\Bigr{)}j\Delta t+\sigma_{2}\Bigl{(}\rho\sum_{k=1}^{d}R_{jk}z_{k}+\sqrt{1-\rho^{2}}\sum_{k=1}^{d}R_{jk}\tilde{z}_{k}\Bigr{)}\Bigr{)}.

Again, the d×dd\times d matrix RR can be constructed by the standard construction or the PCA construction. Thus, the expected payoff can be written as

𝔼((w1dj=1dSj(1)+w2dj=1dSj(2)K)+),\displaystyle\mathbb{E}\Bigl{(}\Bigl{(}\frac{w_{1}}{d}\sum_{j=1}^{d}S_{j}^{(1)}+\frac{w_{2}}{d}\sum_{j=1}^{d}S_{j}^{(2)}-K\Bigr{)}_{+}\Bigr{)},

where the expectation is taken over (𝒛𝖳,𝒛~𝖳)𝖳𝒩(0,I2d)(\boldsymbol{z}^{\mathsf{T}},\tilde{\boldsymbol{z}}^{\mathsf{T}})^{\mathsf{T}}\sim\mathcal{N}(0,I_{2d}).

In the pre-integration step, we choose to integrate out z1z_{1}. This can be easily carried out as in equations (15) and (16) provided that the first column of R~\tilde{R} is nonnegative. We take d=50d=50, T=1T=1, ρ=0.5\rho=0.5, S0(1)=S0(2)=100S_{0}^{(1)}=S_{0}^{(2)}=100 and K=95K=95. The RMSE values are plotted in Figure 3. In the left panel, we take r1=0.1r_{1}=0.1, r2=0.2r_{2}=0.2, σ1=0.2\sigma_{1}=0.2, σ2=0.4\sigma_{2}=0.4, w1=0.8w_{1}=0.8 and w2=0.2w_{2}=0.2. In the right panel, we take r1=0.2r_{1}=0.2, r2=0.1r_{2}=0.1, σ1=0.4\sigma_{1}=0.4, σ2=0.2\sigma_{2}=0.2, w1=0.2w_{1}=0.2 and w8=0.8w_{8}=0.8. This reverses the roles of the two assets which will make a difference when one pre-integrates over the first of the 2d2d inputs. We use M=128M=128 as before. In the top row of Figure 3, the matrix RR is obtained by the standard construction, while in the bottom row, RR is obtained by the PCA construction.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Basket option pricing.

A few observations are in order:

  1. (a)

    For the standard construction, pre-integrating over z1z_{1} brings little improvement over plain RQMC. But for PCA construction, pre-integrating over z1z_{1} brings a big variance reduction.

  2. (b)

    The dimension reduction technique from [44] largely improves the RMSE from pre-integration without dimension reduction. This improvement is particularly significant for the standard construction.

  3. (c)

    Active subspace pre-integration, AS+preint, has the best performance for both the standard and PCA constructions. It is even better than pre-integrating out the first principal component of Brownian motion with dimension reduction. In this example, the active subspace method is able to find a better direction than the first principal component over which to pre-integrate.

This problem required L=2L=2 Brownian motions and the above examples used the same decomposition to sample them both. A sharper principal components analysis would merge the two Brownian motions into a single 2d2d-dimensional process and use the principal components from their joint covariance matrix. We call this the full PCA construction as described next.

The processes B(1)=σ1R𝒛B^{(1)}=\sigma_{1}R\boldsymbol{z}, and B(2)=σ2R(ρ𝒛+1ρ2𝒛~)B{(2)}=\sigma_{2}R(\rho\boldsymbol{z}+\sqrt{1-\rho^{2}}\tilde{\boldsymbol{z}}), have joint distribution

(B(1)B(2))𝒩(0,(σ12Σρσ1σ2Σρσ1σ2Σσ22Σ)).\displaystyle\begin{pmatrix}B^{(1)}\\ B^{(2)}\end{pmatrix}\sim\mathcal{N}\left(0,\begin{pmatrix}\sigma_{1}^{2}\Sigma&\rho\sigma_{1}\sigma_{2}\Sigma\\ \rho\sigma_{1}\sigma_{2}\Sigma&\sigma_{2}^{2}\Sigma\end{pmatrix}\right).

Let Σ~\widetilde{\Sigma} be the joint covariance matrix above. An alternative generation method is to pick R~\tilde{R} with R~R~𝖳=Σ~\tilde{R}\tilde{R}^{\mathsf{T}}=\tilde{\Sigma}, and let

(B(1)B(2))=R~(𝒛𝒛~), where (𝒛𝒛~)𝒩(0,I2d).\displaystyle\begin{pmatrix}B^{(1)}\\ B^{(2)}\end{pmatrix}=\tilde{R}\begin{pmatrix}\boldsymbol{z}\\ \tilde{\boldsymbol{z}}\end{pmatrix},\quad\text{ where }\quad\begin{pmatrix}\boldsymbol{z}\\ \tilde{\boldsymbol{z}}\end{pmatrix}\sim\mathcal{N}(0,I_{2d}).

The matrix R~\tilde{R} can be found by either a Cholesky decomposition or eigendecomposition of R~\tilde{R}. We call this method full standard construction or full PCA construction. Taking B(1)=σ1R𝒛B^{(1)}=\sigma_{1}R\boldsymbol{z}, and B(2)=σ2R(ρ𝒛+1ρ2𝒛~)B^{(2)}=\sigma_{2}R(\rho\boldsymbol{z}+\sqrt{1-\rho^{2}}\tilde{\boldsymbol{z}}), it is equivalent to taking

R~=(σ1Id𝟎σ2ρId1ρ2σ2Id)(R𝟎𝟎R).\tilde{R}=\begin{pmatrix}\sigma_{1}I_{d}&\mathbf{0}\\ \sigma_{2}\rho I_{d}&\sqrt{1-\rho^{2}}\sigma_{2}I_{d}\end{pmatrix}\begin{pmatrix}R&\mathbf{0}\\ \mathbf{0}&R\end{pmatrix}.

We call this the ordinary PCA or standard construction depending on whether RR is from the PCA or standard construction.

Figure 4 compares results using this full PCA construction. All methods apply pre-integration before using RQMC. For active subspace pre-integration, we pre-integrate along the direction in 2d found by the active subspace method. For “full PCA”, we pre-integrate along the first principal component of Σ~\tilde{\Sigma}. We can see that active subspace pre-integration has a better RMSE than even the full PCA construction with dimension reduction, which we consider to be the natural generalization of the PCA construction to this setting.

Refer to caption
Refer to caption
Figure 4: These plots compare pre-integration strategies for two basket options. We compare active subspace pre-integration to strategies with full and partial PCA and standard contructions of Brownian motion.

5.3 Timing

Pre-integration changes the computational cost of RQMC and this has not been much studied in the literature. Our figures compare the RMSE of different samplers as a function of the number nn of evaluations. Efficiency depends also on running times. Here we make study of running times for the pre-integration methods we studied. It is brief in order to conserve space. Running times depend on implementation details that could make our efficiency calculations differ from others’. All of timings we report were conducted on a MacBook Pro with 8 cores and 16GB memory. We simulated them all with 1010 replicates and n=215n=2^{15}. The standard errors of the average evaluation times were negligible, about 0.30.3%–0.80.8% of the corresponding means. We also computed times to find C^\widehat{C} and C~^\widehat{\widetilde{C}} (defined in equations (13) and (14)). Those had small standard errors too, and their costs are a small fraction of the total. The costs of finding the eigendecompositons are negligible in our examples.

Table 1 shows the results of our timings. For the 6 integrands, we find that pre-integration raises the cost of computation by roughly 1212 to 1616-fold. That extra cost is not enough to offset the gains from pre-integration with large nn except for pre-integration of the first component in the standard construction. That variable is not very important and we might well have expected pre-integration to be unhelpful for it.

Most of the pre-integrated computations took about the same amount of time but plain pre-integration and pre-integration with dimension reduction take slighly more time for the standard construction. Upon investigation we found that those methods took slightly more Newton iterations on average. In hindsight this makes sense. The Newton iterations started at 0. In the standard construction, the first variable does not have a very strong effect on the option value, requiring it to take larger values to reach the positivity threshold x1=γ(𝒙2:d)x_{1}=\gamma(\boldsymbol{x}_{2{:}d}) and hence (a few) more iterations.

Another component of cost for active subspace methods is in computing an approximation to CC and C~\widetilde{C}. We used M=128M=128 function evaluations. Those are about 2d2d times as expensive as evaluations because they use divided differences. One advantage of active subspace pre-integration is that it uses divided differences of the original integrand. Pre-integration plus dimension reduction requires divided differences of the pre-integrated function with associated Newton searches, and that costs more.

MC RQMC pre-int AS+pre-int pre-int+DimRed C^\widehat{C} C~^\widehat{\widetilde{C}}
Payoff 0.6 0.6 07.5 06.9 07.4 0.3 2.9
0.6 0.6 06.9 06.9 06.9 0.3 2.7
Delta 0.5 0.4 05.7 05.2 05.7 0.2 2.2
0.5 0.4 05.1 05.2 05.2 0.2 2.0
Gamma 0.6 0.6 12.2 12.2 12.3 0.2 4.7
0.6 0.6 11.7 12.3 11.7 0.2 4.5
Rho 0.9 0.9 10.6 10.1 10.6 0.3 4.1
0.9 0.9 10.1 10.2 10.1 0.4 3.9
Theta 1.0 1.0 13.9 13.3 14.0 0.4 5.4
1.0 1.0 13.4 13.4 13.4 0.4 5.2
Vega 0.6 0.6 09.7 09.1 09.7 0.3 3.7
0.6 0.6 09.1 09.1 09.1 0.3 3.5
Table 1: Time in seconds to compute all Asian option integrands and methods. All simulations take 2152^{15} samples and the times are averaged over 10 replicates. Each integrand has two rows, where the top row uses the standard construction and the bottom row uses the PCA construction. The right two columns are the times used to estimate C^\widehat{C} and C~^\widehat{\widetilde{C}} by M=128M=128 samples.

6 Discussion

In this paper we have studied a kind of conditional RQMC known as pre-integration. We found that, just like conditional MC, the procedure can reduce variance but cannot increase it. We proposed to pre-integrate over the first component in an active subspace approximation, which is also known as the gradient PCA approximation. We showed a close relationship between this choice of pre-integration variable and what one would get using a computationally infeasible but well motivated choice by maximizing the Sobol’ index of a linear combination of variables.

In the numerical examples of option pricing, we see that active subspace pre-integration achieves a better RMSE than previous methods when using the standard construction of Brownian motion. For the PCA construction, the proposed method has comparable accuracy in four of six cases, is better once and is worse once. For those six integrands, the PCA construction is already very good. Active subspace pre-integration still provides an automatic way to choose the pre-integration direction. Even using the standard construction it is almost as good as pre-integration with the PCA construction. It can be used in settings where there is no strong incumbent decomposition analogous to the PCA for the Asian option. We saw it perform well for basket options. We even saw an improvement for Gamma in the very well studied case of the Asian option with the PCA construction.

We note that active subspaces use an uncentered PCA analysis of the matrix of sample gradients. One could use instead a centered analysis of 𝔼((f(𝒙)η)(f(𝒙)η)𝖳)\mathbb{E}((\nabla f(\boldsymbol{x})-\eta)(\nabla f(\boldsymbol{x})-\eta)^{\mathsf{T}}) where η=𝔼(f(𝒙))\eta=\mathbb{E}(\nabla f(\boldsymbol{x})). The potential advantage of this is that f(𝒙)η\nabla f(\boldsymbol{x})-\eta is the gradient of f(𝒙)η𝖳𝒙f(\boldsymbol{x})-\eta^{\mathsf{T}}\boldsymbol{x} which subtracts a linear approximation from ff before searching for θ\theta. The rationale for this alternative is that RQMC might already do well integrating a linear function and we would then want to choose a direction θ\theta that performs well for the nonlinear part of ff. In our examples, we found very little difference between the two methods and so we proposed the simpler uncentered active subspace pre-integration.

Acknowledgments

This work was supported by the U.S. National Science Foundation under grant IIS-1837931.

References

  • [1] P. Acworth, M. Broadie, and P. Glasserman. A comparison of some Monte Carlo techniques for option pricing. In H. Niederreiter, P. Hellekalek, G. Larcher, and P. Zinterhof, editors, Monte Carlo and quasi-Monte Carlo methods ’96, pages 1–18. Springer, 1997.
  • [2] L. H. Y. Chen. An inequality for the multivariate normal distribution. Journal of Multivariate Analysis, 12(2):306–315, 1982.
  • [3] P. G. Constantine. Active subspaces: Emerging ideas for dimension reduction in parameter studies. SIAM, Philadelphia, 2015.
  • [4] P. G. Constantine, E. Dow, and Q. Wang. Active subspace methods in theory and practice: applications to kriging surfaces. SIAM Journal on Scientific Computing, 36(4):A1500–A1524, 2014.
  • [5] P. J. Davis and P. Rabinowitz. Methods of Numerical Integration. Academic Press, San Diego, 2nd edition, 1984.
  • [6] J. Dick, F. Y. Kuo, and I. H. Sloan. High-dimensional integration: the quasi-Monte Carlo way. Acta Numerica, 22:133–288, 2013.
  • [7] J. Dick and F. Pillichshammer. Digital sequences, discrepancy and quasi-Monte Carlo integration. Cambridge University Press, Cambridge, 2010.
  • [8] B. Efron and C. Stein. The jackknife estimate of variance. Annals of Statistics, 9(3):586–596, 1981.
  • [9] A. D. Gilbert, F. Y. Kuo, and I. H. Sloan. Preintegration is not smoothing when monotonicity fails. Technical report, arXiv:2112.11621, 2021.
  • [10] P. Glasserman. Monte Carlo Methods in Financial Engineering, volume 53. Springer, 2004.
  • [11] P. Glasserman, P. Heidelberger, and P. Shahabuddin. Asymptotically optimal importance sampling and stratification for pricing path-dependent options. Mathematical finance, 9(2):117–152, 1999.
  • [12] A. Griewank, F. Y. Kuo, H. Leövey, and I. H. Sloan. High dimensional integration of kinks and jumps—smoothing by preintegration. Journal of Computational and Applied Mathematics, 344:259–274, 2018.
  • [13] J. M. Hammersley. Conditional Monte Carlo. Journal of the ACM (JACM), 3(2):73–76, 1956.
  • [14] Z. He. On the error rate of conditional quasi–Monte Carlo for discontinuous functions. SIAM Journal on Numerical Analysis, 57(2):854–874, 2019.
  • [15] F. J. Hickernell. Koksma-Hlawka inequality. Wiley StatsRef: Statistics Reference Online, 2014.
  • [16] W. Hoeffding. A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics, 19(3):293–325, 1948.
  • [17] C. R. Hoyt and A. B. Owen. Mean dimension of ridge functions. SIAM Journal on Numerical Analysis, 58(2):1195–1216, 2020.
  • [18] J. Imai and K. S. Tan. Minimizing effective dimension using linear transformation. In Monte Carlo and Quasi-Monte Carlo Methods 2002, pages 275–292. Springer, 2004.
  • [19] M. J. W. Jansen. Analysis of variance designs for model output. Computer Physics Communications, 117(1–2):35–43, 1999.
  • [20] I. T. Jolliffe. A note on the use of principal components in regression. Journal of the Royal Statistical Society: Series C (Applied Statistics), 31(3):300–303, 1982.
  • [21] P. L’Ecuyer and C. Lemieux. A survey of randomized quasi-Monte Carlo methods. In Modeling Uncertainty: An Examination of Stochastic Theory, Methods, and Applications, pages 419–474. Kluwer Academic Publishers, 2002.
  • [22] J. Matoušek. On the L2–discrepancy for anchored boxes. Journal of Complexity, 14(4):527–556, 1998.
  • [23] B. Moskowitz and R. E. Caflisch. Smoothness and dimension reduction in quasi-Monte Carlo methods. Mathematical and Computer Modelling, 23(8-9):37–54, 1996.
  • [24] H. Niederreiter. Random Number Generation and Quasi-Monte Carlo Methods. SIAM, Philadelphia, PA, 1992.
  • [25] G. Ökten and A. Göncü. Generating low-discrepancy sequences from the normal distribution: Box–Muller or inverse transform? Mathematical and Computer Modelling, 53(5-6):1268–1281, 2011.
  • [26] A. B. Owen. Randomly permuted (t,m,s)(t,m,s)-nets and (t,s)(t,s)-sequences. In Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing, pages 299–317, New York, 1995. Springer-Verlag.
  • [27] A. B. Owen. Monte Carlo variance of scrambled net quadrature. SIAM Journal of Numerical Analysis, 34(5):1884–1910, 1997.
  • [28] A. B. Owen. Scrambled net variance for integrals of smooth functions. Annals of Statistics, 25(4):1541–1562, 1997.
  • [29] A. B. Owen. Scrambling Sobol’ and Niederreiter-Xing points. Journal of Complexity, 14(4):466–489, December 1998.
  • [30] A. B. Owen. Local antithetic sampling with scrambled nets. Annals of Statistics, 36(5):2319–2343, 2008.
  • [31] A. B. Owen. Monte Carlo Theory, Methods and Examples. statweb.stanford.edu/~owen/mc, 2013.
  • [32] A. B. Owen and D. Rudolf. A strong law of large numbers for scrambled net integration. SIAM Review, 63(2):360–372, 2021.
  • [33] Z. Pan and A. B. Owen. The nonzero gain coefficients of Sobol’s sequences are always powers of two. Technical Report arXiv:2106.10534, Stanford University, 2021.
  • [34] A. Papageorgiou. The Brownian bridge does not offer a consistent advantage in quasi-Monte Carlo integration. Journal of Complexity, 18(1):171–186, 2002.
  • [35] M. T. Parente, J. Wallin, and B. Wohlmuth. Generalized bounds for active subspaces. Electronic Journal of Statistics, 14(1):917–943, 2020.
  • [36] G. Pirsic. Schnell konvergierende Walshreihen über gruppen. Master’s thesis, University of Salzburg, 1995. Institute for Mathematics.
  • [37] S. Razavi, A. Jakeman, A. Saltelli, C. Prieur, B. Iooss, E. Borgonovo, E. Plischke, S. L. Piano, T. Iwanaga, W. Becker, S. Tarantola, J. H. A. Guillaume, J. Jakeman, H. Gupta, N. Melillo, G. Rabitti, V. Chabirdon, Q. Duan, X. Sun, S. Smith, R. Sheikholeslami, N. Hosseini, M. Asadzade, A. Puy, S. Kucherenko, and H. R. Maier. The future of sensitivity analysis: an essential discipline for systems modeling and policy support. Environmental Modelling & Software, 137:104954, 2021.
  • [38] C. P. Robert and G. O. Roberts. Rao-Blackwellization in the MCMC era. Technical Report arXiv:2101.01011, University of Warwick, 2021.
  • [39] I. M. Sobol’. The distribution of points in a cube and the accurate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics, 7(4):86–112, 1967.
  • [40] I. M. Sobol’. Multidimensional Quadrature Formulas and Haar Functions. Nauka, Moscow, 1969. (In Russian).
  • [41] I. M. Sobol’. Sensitivity estimates for nonlinear mathematical models. Mathematical Modeling and Computational Experiment, 1:407–414, 1993.
  • [42] I. M. Sobol’ and S. Kucherenko. Derivative based global sensitivity measures and their link with global sensitivity indices. Mathematics and Computers in Simulation (MATCOM), 79(10):3009–3017, 2009.
  • [43] H. F. Trotter and J. W. Tukey. Conditional Monte Carlo for normal samples. In Symposium on Monte Carlo Methods, pages 64–79, New York, 1956. Wiley.
  • [44] Y. Xiao and X. Wang. Conditional quasi-Monte Carlo methods and dimension reduction for option pricing and hedging with discontinuous functions. Journal of Computational and Applied Mathematics, 343:289–308, 2018.
  • [45] Y. Xiao and X. Wang. Enhancing quasi-Monte Carlo simulation by minimizing effective dimension for derivative pricing. Computational Economics, 54(1):343–366, 2019.
  • [46] R.-X. Yue and S.-S. Mao. On the variance of quadrature over scrambled nets and sequences. Statistics & Probability Letters, 44(3):267–280, 1999.
  • [47] C. Zhang, X. Wang, and Z. He. Efficient importance sampling in quasi-Monte Carlo methods for computational finance. SIAM Journal on Scientific Computing, 43(1):B1–B29, 2021.