This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Neural Option Pricing for Rough Bergomi Model

Changqing Teng  and  Guanglian Li Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong. Email: [email protected].Department of Mathematics, The University of Hong Kong, Pokfulam Road, Hong Kong. Email: [email protected] GL acknowledges the support from GRF (project number: 17317122) and Early Career Scheme (Project number: 27301921), RGC, Hong Kong.
Abstract

The rough Bergomi (rBergomi) model can accurately describe the historical and implied volatilities, and has gained much attention in the past few years. However, there are many hidden unknown parameters or even functions in the model. In this work we investigate the potential of learning the forward variance curve in the rBergomi model using a neural SDE. To construct an efficient solver for the neural SDE, we propose a novel numerical scheme for simulating the volatility process using the modified summation of exponentials. Using the Wasserstein 1-distance to define the loss function, we show that the learned forward variance curve is capable of calibrating the price process of the underlying asset and the price of the European style options simultaneously. Several numerical tests are provided to demonstrate its performance.

1 Introduction

Empirical studies of a wild range of assets volatility time-series show that the log-volatility in practice behaves similarly to fractional Brownian motion (FBM) (FBM with a Hurst index HH is the unique centered Gaussian process WH(t)W^{H}(t) with autocorrelation being 𝔼[WtHWsH]=12[|t|2H+|s|2H|ts|2H]\mathbb{E}[W^{H}_{t}W^{H}_{s}]=\tfrac{1}{2}[|t|^{2H}+|s|^{2H}-|t-s|^{2H}]) with a Hurst index H0.1H\approx 0.1 at any reasonable time scale [8]. Motivated by this empirical observation, several rough stochastic volatility models have been proposed, all of which are essentially based on fBM and involve the fractional kernel. The rough Bergomi (rBergomi) model [3] is one recent rough volatility model that has a remarkable capability of fitting both historical and implied volatilities.

The rBergomi model can be formulated as follows. Let StS_{t} be the price of underlying asset with the time horizon [0,T][0,T] defined on a given filtered probability space (Ω,,{t}t0,)(\Omega,\mathcal{F},\{\mathcal{F}_{t}\}_{t\geq 0},\mathbb{Q}), with \mathbb{Q} being a risk-neutral martingale measure:

St=S0exp(120tVsds+0tVsdZs),t[0,T].S_{t}=S_{0}\exp\left(-\frac{1}{2}\int_{0}^{t}V_{s}\mathrm{d}s+\int_{0}^{t}\sqrt{V_{s}}\mathrm{d}Z_{s}\right),\quad t\in[0,T]. (1)

Here, VtV_{t} is the spot variance process satisfying

Vt=ξ0(t)exp(η2H0t(ts)H12dWsη22t2H),t[0,T].V_{t}=\xi_{0}(t)\exp\left(\eta\sqrt{2H}\int_{0}^{t}(t-s)^{H-\frac{1}{2}}\mathrm{d}W_{s}-\frac{\eta^{2}}{2}t^{2H}\right),\quad t\in[0,T]. (2)

S0>0S_{0}>0 denotes the initial value of the underlying asset, and the parameter η\eta is defined by

η:=2νΓ(3/2H)Γ(H+1/2)Γ(22H),\displaystyle\eta:=2\nu\sqrt{\frac{\Gamma(3/2-H)}{\Gamma(H+1/2)\Gamma(2-2H)}},

with ν\nu being the ratio between the increment of logVt\log V_{t} and the FBM over (t,t+Δt)(t,t+\Delta t) [3, Equation (2.1)].

The Hurst index H(0,1/2)H\in(0,{1}/{2}) reflects the regularity of the volatility VtV_{t}. ξ0()\xi_{0}(\cdot) is the so-called initial forward variance curve, defined by ξ0(t)=𝔼[Vt|0]=𝔼[Vt]\xi_{0}(t)=\mathbb{E}^{\mathbb{Q}}[V_{t}|\mathcal{F}_{0}]=\mathbb{E}[V_{t}] [3]. ZtZ_{t} is a standard Brownian motion:

Zt:=ρWt+1ρ2Wt.Z_{t}:=\rho W_{t}+\sqrt{1-\rho^{2}}W^{\perp}_{t}. (3)

Here, ρ(1,0)\rho\in(-1,0) is the correlation parameter and WtW^{\perp}_{t} is a standard Brownian motion independent of WtW_{t}. The price of a European option with payoff function h()h(\cdot) and expiry TT is given by

P0=𝔼[erTh(ST)].\displaystyle P_{0}=\mathbb{E}[\mathrm{e}^{-rT}h(S_{T})]. (4)

Despite its significance from a modeling perspective, using a singular kernel in the rBergomi model leads to the loss of Markovian and semimartingale structure. In practice, simulating and pricing options using this model involves several challenges. The primary difficulty arises from the singularity of the fractional kernel

G(t):=tH12,\displaystyle G(t):=t^{H-\frac{1}{2}}, (5)

at t=0t=0, which impacts the simulation of the Volterra process given by:

It:=2H0tG(ts)dWs.I_{t}:=\sqrt{2H}\int_{0}^{t}G(t-s)\mathrm{d}W_{s}. (6)

Note that this Volterra process is a Riemann-Liouville FBM or Lévy’s definition of FBM up to a multiplicative constant [18]. The deterministic nature of the kernel G()G(\cdot) implies that ItI_{t} is a centered, locally (Hϵ)(H-\epsilon)-Hölder continuous Gaussian process with 𝔼[It2]=t2H\mathbb{E}[I_{t}^{2}]=t^{2H}. Furthermore, a straightforward calculation shows

𝔼[It1It2]=t12HC(t2t1), for t2>t1,\displaystyle\mathbb{E}[I_{t_{1}}I_{t_{2}}]=t_{1}^{2H}C\left(\frac{t_{2}}{t_{1}}\right),\quad\text{ for }t_{2}>t_{1},

where for x>1x>1, the univariate function C()C(\cdot) is given by

C(x):=2H01(1s)H1/2(xs)H1/2ds,\displaystyle C(x):=2H\int_{0}^{1}(1-s)^{H-1/2}(x-s)^{H-1/2}\mathrm{d}s,

indicating that ItI_{t} is non-stationary.

Even though the rBergomi model enjoys remarkable calibration capability with the market data, there are many hidden unknown parameters even functions. For example, the Hurst exponent (in a general SDE) is regarded as an unknown function H:=H(t):+(0,1)H:=H(t)\colon\mathbb{R}_{+}\to(0,1) parameterized by a neural network in [21], which is learned using neural SDEs. According to [3], ξ0(t)\xi_{0}(t) can be any given initial forward variance swap curve consistent with the market price. In view of this fact and the high expressivity of neural SDEs [22, 23, 16, 14, 17, 10, 9], in this work, we propose to parameterize the initial forward variance curve ξ0(t)\xi_{0}(t) using a feedforward neural network:

ξ0(t)=ξ0(t;θ),\displaystyle\xi_{0}(t)=\xi_{0}(t;\theta), (7)

where θ\theta represents the weight parameter vector of the neural network. The training data are generated via a suitable numerical scheme for a given target initial forward variance curve ξ0(t)\xi_{0}(t).

Motivated by the implementation in [7], we adopt the Wasserstein metric as the loss function to train the weights θ\theta. Since the Wasserstein loss is convex and has a unique global minima, any SDE maybe learnt in the infinite data limit [13]. Remarkably, the attained Wasserstein-1 distance during the training is a natural upper bound for the discrepancy between the exact option price P0P_{0} and the learned option price P0(θ)P_{0}(\theta^{*}) given by the neural SDE. This in particular implies that if the approximation of the underlying dynamics of stock price StS_{t} is accurate to some level, then the European option price P0P_{0} with this stock as the underlying asset will be accurate to the same level. In this manner, the loss can realize two optimization goals simultaneously.

To generate the training data to learn θ\theta, we need to jointly simulate the Volterra process ItI_{t} and the underlying Brownian motion ZtZ_{t} for the stock price. The non-stationarity feature and the joint simulation requirement make Cholesky factorization [3] the only available exact method. However, it has an 𝒪(n3)\mathcal{O}(n^{3}) complexity for the Cholesky factorization, 𝒪(n2)\mathcal{O}(n^{2}) complexity and 𝒪(n2)\mathcal{O}(n^{2}) storage with nn being the total number of time steps. Clearly, the method is infeasible for large nn. The most well known method to reduce the computational complexity and storage cost is the hybrid scheme [4]. The hybrid scheme approximates the kernel G()G(\cdot) by a power function near zero and by a step function elsewhere, which yields an approximation combining the Wiener integrals of the power function with a Riemann sum. Since the fractional kernel G()G(\cdot) is a power function, it is exact near zero. Its computational complexity is 𝒪(nlogn)\mathcal{O}(n\log n) and storage cost is 𝒪(n)\mathcal{O}(n).

In this work, we propose an efficient modified summation of exponentials (mSOE) based method (16) to simulate (1)-(2) (to facilitate the training of the neural SDE). Similar approximations have already been considered, e.g. by Bayer and Breneis [2], and Abi Jaber and El Euch [1]. We further enhance the numerical performance by keeping the kernel exact near the singularity, which achieves high accuracy with the fewest number of summation terms. Numerical experiments show that the mSOE scheme considerably improve the prevalent hybrid scheme [4] in terms of accuracy, while having less computational and storage costs. Besides the rBergomi model, the proposed approach is applicable to a wide class of stochastic volatility models, especially the rough volatility models with completely monotone kernels.

In sum, the contributions of this work are three-fold. First, we derive an efficient modified sum-of-exponentials (mSOE) based method (16) to solve (1)-(2) even with very small Hurst parameter H(0,1/2)H\in(0,1/2). Second, we propose to learn the forward variance curve by the neural SDE using the loss function based on the Wasserstein 1-distance, which can learn the underlying dynamics of the stock price as well as the European option price. Third and last, the mSOE scheme is further utilized to obtain the training data to train our proposed neural SDE model and serves as the solver for the neural SDE, which improves the efficiency significantly.

The remaining of the paper is organized as follows. In Section 2, we introduce an approximation of the singular kernel G()G(\cdot) by the mSOE, and then describe two approaches for obtaining them. Based on the mSOE, we propose a numerical scheme for simulating the rBergomi model (1)-(2). We introduce in Section 3 the neural SDE model and describe the training of the model. We illustrate in Section 4 the numerical performance of the mSOE scheme. Moreover, several numerical experiments on the performance of our neural SDE model are shown for different target initial forward variance curves. Finally, we summarize the main findings in Section 5.

2 Simulation of the rBergomi model

In this section, we develop an mSOE scheme for efficiently simulating the rBergomi model (1)-(2). Throughout, we consider an equidistant temporal grid 0=t0<t1<<tn=T0=t_{0}<t_{1}<\cdots<t_{n}=T with a time stepping size τ:=T/n\tau:={T}/{n} and ti:=iτt_{i}:=i\tau.

2.1 Modified SOE based numerical scheme

The non-Markovian nature of the Gaussian process ItI_{t} in (6) poses multiple theoretical and numerical challenges. A tractable and flexible Markovian approximation is highly desirable. By the well-known Bernstein’s theorem [25], a completely monotone function (i.e., functions that satisfy (1)kg(k)(x)0(-1)^{k}g^{(k)}(x)\geq 0 for all x>0x>0 and k=0,1,2,k=0,1,2,\cdots) can be represented as the Laplace transform of a nonnegative measure. We apply this result to the fractional kernel G()G(\cdot) and obtain

G(t)=1Γ(12H)0extxH12dx=:0extμ(dx)G(t)=\frac{1}{\Gamma(\frac{1}{2}-H)}\int_{0}^{\infty}\mathrm{e}^{-xt}x^{-H-\frac{1}{2}}\mathrm{d}x=\colon\int_{0}^{\infty}\mathrm{e}^{-xt}\mu(\mathrm{d}x) (8)

where μ(dx)=w(x)dx\mu(\mathrm{d}x)=w(x)\mathrm{d}x, with w(x)=1Γ(12H)xH12w(x)=\frac{1}{\Gamma(\frac{1}{2}-H)}x^{-H-\frac{1}{2}}. Γ()\Gamma(\cdot) denotes Euler’s gamma function, defined by Γ(z)=0sz1esds\Gamma(z)=\int_{0}^{\infty}s^{z-1}\mathrm{e}^{-s}\mathrm{d}s for (z)>0\Re(z)>0. That is, G(t)G(t) is an infinite mixture of exponentials.

The stochastic Fubini theorem implies

It\displaystyle I_{t} =2H0tG(ts)dWs=2H00tex(ts)dWsμ(dx)\displaystyle=\sqrt{2H}\int_{0}^{t}G(t-s)\mathrm{d}W_{s}=\sqrt{2H}\int_{0}^{\infty}\int_{0}^{t}\mathrm{e}^{-x(t-s)}\mathrm{d}W_{s}\mu(\mathrm{d}x)
=:2H0Ytxμ(dx).\displaystyle=:\sqrt{2H}\int_{0}^{\infty}Y_{t}^{x}\mu(\mathrm{d}x). (9)

Note that for any fixed x0x\geq 0, (Ytx;t0)(Y_{t}^{x};t\geq 0) is an Ornstein-Uhlenbeck process with parameter xx, solving the SDE dYtx=xYtxdt+dWt\mathrm{d}Y_{t}^{x}=-xY_{t}^{x}\mathrm{d}t+\mathrm{d}W_{t} starting from the origin. Therefore, I(t)I(t) is a linear functional of the infinite-dimensional process 𝒴t:=(Ytx,x0)\mathcal{Y}_{t}:=(Y_{t}^{x},x\geq 0) [6]. We propose to simulate ItI_{t} exactly near tt and apply numerical quadrature to (8) elsewhere to enhance the computational efficiency, and refer the resulting method to as the modified SOE (mSOE) scheme.

Summation of exponentials can be utilized to approximate the completely monotone functions g(x)g(x). This assumption covers most non-negative, non-increasing and smooth functions and is less restrictive than the requirement on the hybrid scheme [20].

Remark 1.

For the truncated Brownian semistationary process YtY_{t} of the form Yt=0tg(ts)σsdWsY_{t}=\int_{0}^{t}g(t-s)\sigma_{s}\mathrm{d}W_{s}, the hybrid scheme requires that the kernel g()g(\cdot) to satisfy the following two conditions: (a) There exists an Lg(x)C1((0,1])L_{g}(x)\in C^{1}((0,1]) satisfying limx0Lg(tx)Lg(x)=1\lim\limits_{x\to 0}\frac{L_{g}(tx)}{L_{g}(x)}=1 for any t>0t>0 and the derivative LgL^{\prime}_{g} satisfying |Lg(x)|C(1+x1)|{L^{\prime}_{g}(x)}|\leq C(1+x^{-1}) for x(0,1]x\in(0,1] such that

g(x)=xαLg(x),x(0,1],α(0.5,0.5)\{0}.g(x)=x^{\alpha}L_{g}(x),\quad x\in(0,1],\ \alpha\in(-0.5,0.5)\backslash\{0\}.

(b) gg is differentiable on (0,)(0,\infty).

Condition (a) implies that g()g(\cdot) does not allow for strong singularity near the origin such that it can be well approximated by a certain power function. However, the rough fractional kernel cxαcx^{\alpha} with α[1,0.5]\alpha\in[-1,-0.5] fails to satisfy this assumption.

Common examples of completely monotone functions are the exponential kernel ceλx,c,λ0c\mathrm{e}^{-\lambda x},c,\lambda\geq 0, the rough fractional kernel cxαcx^{\alpha} with c0c\geq 0 and α<0\alpha<0 and the shifted power law-kernel (1+t)β(1+t)^{\beta} with β0\beta\leq 0. More flexible kernels can be constructed from these building blocks as complete monotonicity is preserved by summation and products [19, Theorem 1]. The approximation property of the summation of exponentials is guaranteed by the following theorem.

Theorem 2.1 ([5, Theorem 3.4]).

Let g()g(\cdot) be completely monotone and analytic for (x)>0\Re(x)>0, and let 0<a<b0<a<b. Then there exists a uniform approximation g^(x):=j=1nωjeλjx\hat{g}(x):=\sum_{j=1}^{n}\omega_{j}\mathrm{e}^{-\lambda_{j}x} on the interval [a,b][a,b] such that

limngg^1/nσ2.\lim\limits_{n\to\infty}\|g-\hat{g}\|_{\infty}^{1/n}\leq\sigma^{-2}.

Here, σ=exp(π𝒦(k)𝒦(k))\sigma=\exp(\frac{\pi\mathcal{K}(k)}{\mathcal{K}^{\prime}(k)}), 𝒦(k)=𝒦(k)\mathcal{K}(k)=\mathcal{K}^{\prime}(k^{\prime}) and 𝒦(k)=0dt(k2+t2)(12+t2)\mathcal{K}^{\prime}(k)=\int_{0}^{\infty}\frac{\mathrm{d}t}{\sqrt{(k^{2}+t^{2})(1^{2}+t^{2})}}, with k2+(k)2=1k^{2}+(k^{\prime})^{2}=1.

The proposed mSOE scheme relies on replacing the kernel G()G(\cdot) by G^()\hat{G}(\cdot), which is defined by

G^(t):={tH12,t[t0,t1),j=1Nωjeλjt,t[t1,tn],\hat{G}(t):=\left\{\begin{aligned} &t^{H-\frac{1}{2}},&&t\in[t_{0},t_{1}),\\ &\sum_{j=1}^{N}\omega_{j}\mathrm{e}^{-\lambda_{j}t},&&t\in[t_{1},t_{n}],\end{aligned}\right. (10)

for certain non-negative paired sequence {(ωj,λj)}j=1N\{(\omega_{j},\lambda_{j})\}_{j=1}^{N}, where λj\lambda_{j}’s are the interpolation points (nodes) and ωj\omega_{j}’s are the corresponding weights. This scheme is also referred to as the mSOE-NN scheme below. By replacing G()G(\cdot) with G^\hat{G} in (5), we derive the associated approximation I^(ti)\hat{I}(t_{i}) for ItiI_{t_{i}},

I^(ti):=2Hti1ti(tis)H12dWs+2Hj=1Nωj0ti1eλj(tis)dWs.\hat{I}(t_{i}):=\sqrt{2H}\int_{t_{i-1}}^{t_{i}}(t_{i}-s)^{H-\frac{1}{2}}\mathrm{d}W_{s}+\sqrt{2H}\sum_{j=1}^{N}\omega_{j}\int_{0}^{t_{i-1}}\mathrm{e}^{-\lambda_{j}(t_{i}-s)}\mathrm{d}W_{s}. (11)

Then the Itô isometry implies

𝔼[|I(ti)I^(ti)|2]\displaystyle\mathbb{E}\left[|{I(t_{i})-{\hat{I}}(t_{i})}|^{2}\right] =2H0ti1|G(tis)G^(tis)|2ds\displaystyle=2H\int_{0}^{t_{i-1}}|{G(t_{i}-s)-\hat{G}(t_{i}-s)}|^{2}\mathrm{d}s
=2Ht1ti|G(s)G^(s)|2ds.\displaystyle=2H\int_{t_{1}}^{t_{i}}|{G(s)-\hat{G}(s)}|^{2}\mathrm{d}s. (12)

This indicates that we only need to choose non-negative paired sequence {(ωj,λj)}j=1N\{(\omega_{j},\lambda_{j})\}_{j=1}^{N} which minimize t1ti|G(s)G^(s)|2ds\int_{t_{1}}^{t_{i}}|{G(s)-\hat{G}(s)}|^{2}\mathrm{d}s in order to obtain a good approximation of I(ti)I(t_{i}) in the sense of pointwise mean square error. Next, we describe two known approaches to obtain the non-negative paired sequence {(ωj,λj)}j=1N\{(\omega_{j},\lambda_{j})\}_{j=1}^{N} in (10). Both are essentially based on Gauss quadrature.

Summation of exponentials: approach A

Approach A in [2] applies mm-point Gauss-Jacobi quadrature with weight function xH12x^{-H-\frac{1}{2}} to nn geometrically spaced intervals [ζi,ζi+1]i=0,,n1[\zeta_{i},\zeta_{i+1}]_{i=0,\cdots,n-1} and uses a Riemann-type approximation on the interval [0,ζ0][0,\zeta_{0}]. The resulting approximator G^A(t)\hat{G}_{A}(t) with N+1N+1 number of summations is given by

G^A(t):=j=0Nωjeλjtt[t1,tn],\hat{G}_{A}(t):=\sum_{j=0}^{N}\omega_{j}\mathrm{e}^{-\lambda_{j}t}\quad t\in[t_{1},t_{n}],

where λ0=0\lambda_{0}=0 and ω0=1Γ(12H)0ζ0xH12dx\omega_{0}=\frac{1}{\Gamma(\frac{1}{2}-H)}\int_{0}^{\zeta_{0}}x^{-H-\frac{1}{2}}\mathrm{d}x. Let N:=nmN:=nm be the prespecified total number of nodes. The parameters mm, nn and ζi\zeta_{i} are computed based on a set of parameters (α,β,a,b)(0,)(\alpha,\beta,a,b)\in(0,\infty) as follows.

m\displaystyle m =βAN,n=AβN(mnN),\displaystyle=\left\lceil\frac{\beta}{A}\sqrt{N}\right\rceil,\quad n=\left\lfloor\frac{A}{\beta}\sqrt{N}\right\rfloor(mn\approx N),
A\displaystyle A =(1H+13/2H)1/2,\displaystyle=\left(\frac{1}{H}+\frac{1}{3/2-H}\right)^{1/2},
ζ0\displaystyle\zeta_{0} =aexp(α(3/2H)AN),ζn=bexp(αHAN),\displaystyle=a\exp\left(-\frac{\alpha}{(3/2-H)A}\sqrt{N}\right),\quad\zeta_{n}=b\exp\left(\frac{\alpha}{HA}\sqrt{N}\right),
ζi\displaystyle\zeta_{i} =ζ0(ζnζ0)i/n,i=0,,n.\displaystyle=\zeta_{0}\left(\frac{\zeta_{n}}{\zeta_{0}}\right)^{i/n},\quad i=0,\cdots,n.

The following lemma gives the approximation error of the above Gaussian quadrature, which is a modification of [2, lemmas 2.8 and 2.9].

Lemma 2.1.

Let (ωj)j=1nm(\omega_{j})_{j=1}^{nm} be the weights and (λj)j=1nm(\lambda_{j})_{j=1}^{nm} be the nodes of Gaussian quadrature on the intervals [ζi,ζi+1]i=0,,n1[\zeta_{i},\zeta_{i+1}]_{i=0,\cdots,n-1} computed on the set of parameters (α,β,1,1)(\alpha,\beta,1,1). Then the following error estimates hold

|0ζ0extμ(dx)ω0|\displaystyle\left|{\int_{0}^{\zeta_{0}}\mathrm{e}^{-xt}\mu(\mathrm{d}x)-\omega_{0}}\right| tΓ(12H)(32H)exp(αAN),\displaystyle\leq\frac{t}{\Gamma(\frac{1}{2}-H)(\frac{3}{2}-H)}\exp\left(-\frac{\alpha}{A}\sqrt{N}\right),
|ζ0ζnextμ(dx)j=1nmωjeλjt|\displaystyle\left|{\int_{\zeta_{0}}^{\zeta_{n}}\mathrm{e}^{-xt}\mu(\mathrm{d}x)-\sum_{j=1}^{nm}\omega_{j}\mathrm{e}^{-\lambda_{j}t}}\right| 5π318n(eαβ1)tH12Γ(12H)22m+1mHexp(2βANlog(eαβ1)),\displaystyle\leq\sqrt{\frac{5\pi^{3}}{18}}\frac{n\left(\mathrm{e}^{\alpha\beta}-1\right)t^{H-\frac{1}{2}}}{\Gamma(\frac{1}{2}-H)2^{2m+1}m^{H}}\exp\left(\frac{2\beta}{A}\sqrt{N}\log\left(\mathrm{e}^{\alpha\beta}-1\right)\right),
|ζnextμ(dx)|\displaystyle\left|{\int_{\zeta_{n}}^{\infty}\mathrm{e}^{-xt}\mu(\mathrm{d}x)}\right| 1tΓ(12H)ζnH+12exp(ζnt).\displaystyle\leq\frac{1}{t\Gamma(\frac{1}{2}-H){\zeta_{n}}^{H+\frac{1}{2}}}\exp(-\zeta_{n}t).

Summation of exponentials: approach B

Approach B in [11] approximates the kernel G(t)G(t) efficiently on the interval [t1,tn][t_{1},t_{n}] with the desired precision ϵ>0\epsilon>0. It applies n0n_{0}-point Gauss-Jacobi quadrature on the interval [0,2M][0,2^{-M}] with the weight function xH12x^{-H-\frac{1}{2}} where n0=𝒪(log1ϵ)n_{0}=\mathcal{O}(\log{\frac{1}{\epsilon}}), M=𝒪(logT)M=\mathcal{O}(\log T); nsn_{s}-point Gauss-Legendre quadrature on MM small intervals [2j,2j+1][2^{j},2^{j+1}], j=M,,1j=-M,\cdots,-1, where ns=𝒪(log1ϵ)n_{s}=\mathcal{O}(\log{\frac{1}{\epsilon}}), and nln_{l}-point Gauss-Legendre quadrature on N+1N+1 large intervals [2j,2j+1][2^{j},2^{j+1}], j=0,,Nj=0,\cdots,N, where nl=𝒪(log1ϵ+log1τ)n_{l}=\mathcal{O}(\log\frac{1}{\epsilon}+\log\frac{1}{\tau}), N=𝒪(loglog1ϵ+log1τ)N=\mathcal{O}(\log\log\frac{1}{\epsilon}+\log\frac{1}{\tau}) [11, Theorem 2.1]. The resulting approximation G^B(t)\hat{G}_{B}(t) reads,

G^B(t)=k=1n0ω0,keλ0,kt+j=M1k=1nsωj,keλj,ktλj,kH12+j=0Nk=1nlωj,keλj,ktλj,kH12t[t1,tn],\hat{G}_{B}(t)=\sum_{k=1}^{n_{0}}\omega_{0,k}\mathrm{e}^{-\lambda_{0,k}t}+\sum_{j=-M}^{-1}\sum_{k=1}^{n_{s}}\omega_{j,k}\mathrm{e}^{-\lambda_{j,k}t}\lambda_{j,k}^{-H-\frac{1}{2}}+\sum_{j=0}^{N}\sum_{k=1}^{n_{l}}\omega_{j,k}\mathrm{e}^{-\lambda_{j,k}t}\lambda_{j,k}^{-H-\frac{1}{2}}\quad t\in[t_{1},t_{n}], (13)

where the λj,kH12\lambda_{j,k}^{-H-\frac{1}{2}} terms could be absorbed into the corresponding ωj,k\omega_{j,k}s so that the approximation is in the form of (10). There holds |G(t)G^B(t)|ϵ|G(t)-\hat{G}_{B}(t)|\leq\epsilon . For further optimization, a modified Prony’s method is applied on the interval (0,1)(0,1) and standard model reduction method is applied on [1,2N+1][1,2^{N+1}] to reduce the number of exponentials needed.

The approximation error is given below, which is a slight modification of [11, Lemmas 2.2, 2.3 and 2.4].

Lemma 2.2.

Let a=2Ma=2^{-M}, p=2N+1p=2^{N+1} and follow the settings in (13). Then the following error estimates hold

|0aextμ(dx)k=1n0w0,keλ0.kt|4πΓ(12H)a12Hn032(eaT8(n01))2n0,\displaystyle\left|{\int_{0}^{a}\mathrm{e}^{-xt}\mu(\mathrm{d}x)-\sum_{k=1}^{n_{0}}w_{0,k}\mathrm{e}^{-\lambda_{0.k}t}}\right|\leq\frac{4\sqrt{\pi}}{\Gamma(\frac{1}{2}-H)}a^{\frac{1}{2}-H}n_{0}^{\frac{3}{2}}\left(\frac{\mathrm{e}aT}{8(n_{0}-1)}\right)^{2n_{0}},
|apextμ(dx)j=M1k=1nsωj,keλj,ktλj,kH12j=0Nk=1nlωj,keλj,ktλj,kH12|\displaystyle\left|{\int_{a}^{p}\mathrm{e}^{-xt}\mu(\mathrm{d}x)-\sum_{j=-M}^{-1}\sum_{k=1}^{n_{s}}\omega_{j,k}\mathrm{e}^{-\lambda_{j,k}t}\lambda_{j,k}^{-H-\frac{1}{2}}-\sum_{j=0}^{N}\sum_{k=1}^{n_{l}}\omega_{j,k}\mathrm{e}^{-\lambda_{j,k}t}\lambda_{j,k}^{-H-\frac{1}{2}}}\right|
232πΓ(12H)(e1e4)max(ns,nl)2(12H)(N+1)2(12H)M212H1,\displaystyle\quad\leq\frac{2^{\frac{3}{2}}\pi}{\Gamma(\frac{1}{2}-H)}\left(\frac{\mathrm{e}^{\frac{1}{\mathrm{e}}}}{4}\right)^{\max(n_{s},n_{l})}\frac{2^{(\frac{1}{2}-H)(N+1)}-2^{-(\frac{1}{2}-H)M}}{2^{\frac{1}{2}-H}-1},
|pextμ(dx)|1τΓ(12H)pH+12eτp.\displaystyle\left|{\int_{p}^{\infty}\mathrm{e}^{-xt}\mu(\mathrm{d}x)}\right|\leq\frac{1}{\tau\Gamma(\frac{1}{2}-H)p^{H+\frac{1}{2}}}\mathrm{e}^{-\tau p}.

We shall compare Approaches A and B in Section 4, the result shows that Approach B outperforms Approach A. We present in Table 1 the parameter pairs for H=0.07H=0.07 and N=20N=20. See [12, Section 3.4] for further discussions about SOE approximations for the fractional kernel G(t)G(t).

Table 1: Parameters for mSOE scheme based on approach B with H=0.07H=0.07, ϵ=0.0008\epsilon=0.0008 and N=20N=20.
jj ωj\omega_{j} λj\lambda_{j} jj ωj\omega_{j} λj\lambda_{j}
1 0.26118 0.47726 11 1.28121 1.92749
2 0.19002 0.22777 12 1.76098 40.38675
3 0.13840 0.108690 13 2.42043 84.62266
4 0.10717 5.11098 ×102\times 10^{-2} 14 3.32681 177.31051
5 0.11366 1.93668 ×102\times 10^{-2} 15 4.57262 371.52005
6 0.14757 2.04153 ×103\times 10^{-3} 16 6.28495 778.44877
7 0.35898 1 17 8.63850 1631.08960
8 0.49341 2.09531 18 11.87339 3417.63439
9 0.67818 4.390310 19 16.31967 7160.99521
10 0.93214 9.19906 20 22.43096 15004.4875

2.2 Numerical method based on mSOE scheme

Next we describe a numerical method to simulate (Sti,Vti)(S_{t_{i}},V_{t_{i}}) for i=1,,ni=1,\cdots,n based on the mSOE scheme (11). Recall that the integral (6) can be written into a summation of the local part and the history part

Iti+1\displaystyle I_{t_{i+1}} =2H0ti(ti+1s)H12dWs+2Htiti+1(ti+1s)H12dWs\displaystyle=\sqrt{2H}\int_{0}^{t_{i}}(t_{i+1}-s)^{H-\frac{1}{2}}\mathrm{d}W_{s}+\sqrt{2H}\int_{t_{i}}^{t_{i+1}}(t_{i+1}-s)^{H-\frac{1}{2}}\mathrm{d}W_{s} (14)
=:I(ti+1)+I𝒩(ti+1).\displaystyle=\colon I_{\mathcal{F}}(t_{i+1})+{I_{\mathcal{N}}(t_{i+1})}.

The local part I𝒩(ti+1)𝒩(0,τ2H)I_{\mathcal{N}}(t_{i+1})\sim\mathcal{N}(0,\tau^{2H}) can be simulated exactly. The history part I(ti+1)I_{\mathcal{F}}(t_{i+1}) is approximated by replacing the kernel G()G(\cdot) by the mSOE G^(t)\widehat{G}(t) (by approach B):

I¯(ti+1)=2Hj=1Nωj0tieλj(ti+1s)dWs=:2Hj=1NωjI¯j(ti+1).\bar{I}_{\mathcal{F}}(t_{i+1})=\sqrt{2H}\sum_{j=1}^{N}\omega_{j}\int_{0}^{t_{i}}\mathrm{e}^{-\lambda_{j}(t_{i+1}-s)}\mathrm{d}W_{s}=\colon\sqrt{2H}\sum_{j=1}^{N}\omega_{j}\bar{I}_{\mathcal{F}}^{j}(t_{i+1}).

Then direct computation leads to

I¯j(ti+1)\displaystyle\bar{I}_{\mathcal{F}}^{j}(t_{i+1}) =eλjτ0tieλj(tis)dWs\displaystyle=\mathrm{e}^{-\lambda_{j}\tau}\int_{0}^{t_{i}}\mathrm{e}^{-\lambda_{j}(t_{i}-s)}\mathrm{d}W_{s}
=eλjτ(0ti1eλj(tis)dWs+ti1tieλj(tis)dWs)\displaystyle=\mathrm{e}^{-\lambda_{j}\tau}\left(\int_{0}^{t_{i-1}}\mathrm{e}^{-\lambda_{j}(t_{i}-s)}\mathrm{d}W_{s}+\int_{t_{i-1}}^{t_{i}}\mathrm{e}^{-\lambda_{j}(t_{i}-s)}\mathrm{d}W_{s}\right)
=eλjτ(I¯j(ti)+ti1tieλj(tis)dWs).\displaystyle=\mathrm{e}^{-\lambda_{j}\tau}\left(\bar{I}_{\mathcal{F}}^{j}(t_{i})+\int_{t_{i-1}}^{t_{i}}\mathrm{e}^{-\lambda_{j}(t_{i}-s)}\mathrm{d}W_{s}\right).

Consequently, we obtain the recurrent formula for each history component

I¯j(ti)={0i=1,eλjτ(I¯j(ti1)+ti2ti1eλj(ti1s)dWs)i>1.\bar{I}_{\mathcal{F}}^{j}(t_{i})=\left\{\begin{aligned} &0&&i=1,\\ &\mathrm{e}^{-\lambda_{j}\tau}\left(\bar{I}_{\mathcal{F}}^{j}(t_{i-1})+\int_{t_{i-2}}^{t_{i-1}}\mathrm{e}^{-\lambda_{j}(t_{i-1}-s)}\mathrm{d}W_{s}\right)&&i>1.\end{aligned}\right. (15)

This, together with (14), implies that we need to simulate a centered (N+2)(N+2)-dimensional Gaussian random vector at tit_{i},

Θi:=(ΔWti,ti1tieλ1(tis)dWs,,ti1tieλN(tis)dWs,I𝒩(ti))fori=1,,n.{\Theta}_{i}:=\left(\Delta W_{t_{i}},\int_{t_{i-1}}^{t_{i}}\mathrm{e}^{-\lambda_{1}(t_{i}-s)}\mathrm{d}W_{s},\cdots,\int_{t_{i-1}}^{t_{i}}\mathrm{e}^{-\lambda_{N}(t_{i}-s)}\mathrm{d}W_{s},I_{\mathcal{N}}(t_{i})\right)\quad\text{for}\ i=1,\cdots,n.

Here, ΔWti=:WtiWti1\Delta W_{t_{i}}=:W_{t_{i}}-W_{t_{i-1}} denotes the increment. Note that Θi{\Theta}_{i} is determined by its covariance matrix Σ\Sigma, which is defined

Σ1,1=τ,Σ1,l=Σl,1=1λl1(1eλl1τ),Σk,l=1λk1+λl1(1e(λk1+λl1)τ)\displaystyle\Sigma_{1,1}=\tau,\quad\Sigma_{1,l}=\Sigma_{l,1}=\frac{1}{\lambda_{l-1}}\left(1-\mathrm{e}^{-\lambda_{l-1}\tau}\right),\quad\Sigma_{k,l}=\frac{1}{\lambda_{k-1}+\lambda_{l-1}}\left(1-\mathrm{e}^{-(\lambda_{k-1}+\lambda_{l-1})\tau}\right)
ΣN+2,1=Σ1,N+2=2HH+1/2τH+12\displaystyle\Sigma_{N+2,1}=\Sigma_{1,N+2}=\frac{\sqrt{2H}}{H+1/2}\tau^{H+\frac{1}{2}}
ΣN+2,l=Σl,N+2=2Hλl1H+1/2γ(H+12,λl1τ)\displaystyle\Sigma_{N+2,l}=\Sigma_{l,N+2}=\frac{\sqrt{2H}}{\lambda_{l-1}^{H+1/2}}\gamma(H+\tfrac{1}{2},\lambda_{l-1}\tau)
ΣN+2,N+2=τ2H\displaystyle\Sigma_{N+2,N+2}=\tau^{2H}

for k,l=2,,N+1k,l=2,\cdots,N+1, where γ(,)\gamma(\cdot,\cdot) refers to the lower incomplete gamma function. We only need to inplement the Cholesky decomposition once since Σ\Sigma is independent of ii.

Finally, we present the two-step numerical scheme for (Sti+1,Vti+1)(S_{t_{i+1}},V_{t_{i+1}}) for i=0,,n1i=0,\cdots,n-1,

S¯ti+1\displaystyle\bar{S}_{t_{i+1}} =S¯tiexp(V¯ti(ρΔWti+1+1ρ2ΔWti+1)τ2V¯ti),\displaystyle=\bar{S}_{t_{i}}\exp\left(\sqrt{\bar{V}_{t_{i}}}\left(\rho\Delta W_{t_{i+1}}+\sqrt{1-\rho^{2}}\Delta W_{t_{i+1}}^{\perp}\right)-\frac{\tau}{2}\bar{V}_{t_{i}}\right), (16)
V¯ti+1\displaystyle\bar{V}_{t_{i+1}} =ξ0(ti+1)exp(η(I¯(ti+1)+I𝒩(ti+1))η22ti+12),\displaystyle=\xi_{0}(t_{i+1})\exp\left(\eta\left(\bar{I}_{\mathcal{F}}(t_{i+1})+I_{\mathcal{N}}(t_{i+1})\right)-\frac{\eta^{2}}{2}t_{i+1}^{2}\right),
I¯(ti+1)\displaystyle\bar{I}_{\mathcal{F}}(t_{i+1}) =2Hj=1NωjI¯j(ti+1),\displaystyle=\sqrt{2H}\sum_{j=1}^{N}\omega_{j}\bar{I}_{\mathcal{F}}^{j}(t_{i+1}),
I¯j(ti+1)\displaystyle\bar{I}_{\mathcal{F}}^{j}(t_{i+1}) =eλjτ(I¯j(ti)+ti1tieλj(tis)dWs).\displaystyle=\mathrm{e}^{-\lambda_{j}\tau}\left(\bar{I}_{\mathcal{F}}^{j}(t_{i})+\int_{t_{i-1}}^{t_{i}}\mathrm{e}^{-\lambda_{j}(t_{i}-s)}\mathrm{d}W_{s}\right).

This mSOE-N based simulation scheme (16) only requires 𝒪(N3)\mathcal{O}(N^{3}) offline computation complexity which accounts for the Cholesky decomposition of the covariance matrix Σ\Sigma, 𝒪(Nn)\mathcal{O}(Nn) computation complexity and 𝒪(N)\mathcal{O}(N) storage.

3 Learning the forward variance curve

This section is concerned with learning the forward variance curve ξ0(t;θ)\xi_{0}(t;\theta) with θ\theta being the weight parameters from the feedforward neural network (7) by Neural SDE. Without loss of generality, assuming S0=1S_{0}=1, we can rewrite (1)-(2) as

{St=1+0tSsexp(Xs)dZs,Xt=12log(Vt).\left\{\begin{aligned} S_{t}&=1+\int_{0}^{t}S_{s}\exp(X_{s})\mathrm{d}Z_{s},\\ X_{t}&=\frac{1}{2}\log(V_{t}).\end{aligned}\right. (17)

Upon parameterizing the forward variance curve by (7), the dynamics of the price process and variance process are also parameterized accordingly, which is given by the following neural SDE

{St(θ)=1+0tSs(θ)exp(Xs(θ))dZs,Xt(θ)=12log(Vt(θ)),Vt(θ)=ξ0(t;θ)(ηI(t)).\left\{\begin{aligned} S_{t}(\theta)&=1+\int_{0}^{t}S_{s}(\theta)\exp(X_{s}(\theta))\mathrm{d}Z_{s},\\ X_{t}(\theta)&=\frac{1}{2}\log(V_{t}(\theta)),\\ V_{t}(\theta)&=\xi_{0}(t;\theta)\mathcal{E}(\eta I(t)).\end{aligned}\right. (18)

where ()\mathcal{E}(\cdot) denotes the (Wick) stochastic exponential. Then we propose to use the Wasserstein-1 distance as the loss function to train this neural SDE. That is, the learned weight parameters θ\theta^{*} are given by

θ:=argminθW1(ST,ST(θ)).\displaystyle\theta^{*}:=\arg\min_{\theta}W_{1}\left(S_{T},S_{T}(\theta)\right). (19)

Next, we recall the Wasserstein distance. Let (M,d)(M,d) be a Radon space. The Wasserstein-pp distance for p[1,)p\in[1,\infty) between two probability measures μ\mu and ν\nu on MM with finite pp-moments is defined by

Wp(μ,ν)=(infγΓ(μ,ν)𝔼(x,y)γd(x,y)p)1/p,W_{p}(\mu,\nu)=\left(\inf\limits_{\gamma\in\Gamma(\mu,\nu)}\mathbb{E}_{(x,y)\sim\gamma}d(x,y)^{p}\right)^{1/p},

where Γ(μ,ν)\Gamma(\mu,\nu) is the set of all couplings of μ\mu and ν\nu. A coupling γ\gamma is a joint probability measure on M×MM\times M whose marginals are μ\mu and ν\nu on the first and second factors, respectively. If μ\mu and ν\nu are real-valued, then their Wasserstein distance can be simply computed by utilising their cumulative distribution functions [24]:

Wp(μ,ν)=(01|F1(z)G1(z)|pdz)1/p.W_{p}(\mu,\nu)=\left(\int_{0}^{1}|F^{-1}(z)-G^{-1}(z)|^{p}\mathrm{d}z\right)^{1/p}.

In particular, when p=1p=1, according to the Kantorovich-Robenstein duality [24], Wasserstein-1 distance can be represented by

W1(μ,ν)=supLip(f)1𝔼xμ[f(x)]𝔼yν[f(y)],W_{1}(\mu,\nu)=\sup\limits_{\mathrm{Lip}(f)\leq 1}\mathbb{E}_{x\sim\mu}[f(x)]-\mathbb{E}_{y\sim\nu}[f(y)], (20)

where Lip(f)\mathrm{Lip}(f) denotes the Lipschitz constant of ff.

In the experiment, we work with the empirical distributions on \mathbb{R}. If ξ\xi and η\eta are two empirical measures with mm samples X1,,XmX_{1},\cdots,X_{m} and Y1,,YmY_{1},\cdots,Y_{m}, then the Wasserstein-pp distance is given by

Wp(ξ,η)=(1mi=1m|X(i)Y(i)|p)1/p,W_{p}(\xi,\eta)=\left(\frac{1}{m}\sum_{i=1}^{m}|{X_{(i)}-Y_{(i)}}|^{p}\right)^{1/p}, (21)

where X(i)X_{(i)} is the iith smallest value among the mm samples. With St(θ)S_{t}(\theta^{*}) being the price process of the underlying asset after training, the price P0(θ)P_{0}(\theta^{*}) of a European option with payoff function h(x)h(x) and expiry TT is given by

P0(θ)=𝔼[erTh(ST(θ))].\displaystyle P_{0}(\theta^{*})=\mathbb{E}[\mathrm{e}^{-rT}h(S_{T}(\theta^{*}))]. (22)

Since the payoff function h()h(\cdot) is clearly Lipschitz-1 continuous, according to (20), Wasserstein-1 distance is a natural upper bound for the pricing error of the rBergomi model. Thus, the choice of the training loss is highly desirable.

Proposition 3.1.

Set the interest rate r=0r=0, the Wasserstein-1 distance is a natural upper bound for the pricing error of the rBergomi model:

|P0P0|=|𝔼[h(ST)]𝔼[h(ST(θ))]|W1(ST,ST(θ)).|P_{0}-P_{0}^{*}|=\left|\mathbb{E}\left[h(S_{T})\right]-\mathbb{E}\left[h({S}_{T}(\theta^{*}))\right]\right|\leq W_{1}(S_{T},S_{T}(\theta^{*})). (23)

4 Numerical tests

In this section, we illustrate the performance of the proposed mSOE scheme (16) for simulating (1)-(2), and demonstrate the learning of the forward variance curve (7) using neural SDE (18).

4.1 Numerical test for mSOE scheme (16)

First we compare approaches A and B. They only differ in the choice of nodes λj\lambda_{j} and weights ωj\omega_{j}. To compare their performance in the simulation of the stochastic Volterra process I(t)I(t), one only needs to estimate the discrepancy between GG and G^\hat{G} in L2(t1,T)L^{2}(t_{1},T)-norm (12). Thus, as the criterion for comparison, we use err:=(t1T(G(t)G^(t))2dt)12\text{err}:=(\int_{t_{1}}^{T}(G(t)-\hat{G}(t))^{2}\mathrm{d}t)^{\frac{1}{2}}. We show in Figure 1(a) the err versus the number of nodes NN for approaches A and B with H=0.07H=0.07. We observe that approach B has a better error decay. This behavior is also observed for other H(0,1/2)H\in(0,1/2). Thus, we implement approach B in (10), which involves 𝒪(N3)\mathcal{O}(N^{3}) computational cost to obtain the non-negative tuples {(ωj,λj)}j=1N\{(\omega_{j},\lambda_{j})\}_{j=1}^{N}.

Refer to caption
(a) mSOE approximation error
Refer to caption
(b) training dynamics
Figure 1: (a) The performance of the mSOE-NN schemes based on approaches A and B with τ=0.0005\tau=0.0005, T=1T=1 and H=0.07H=0.07, and (b) the training dynamics.

Next we calculate the implied volatility curves using the mSOE-NN based numerical schemes with N=2,4,8,10,32N=2,4,8,10,32 using the parameters listed in Table 2. To ensure that the temporal discretization error and the Monte Carlo simulation error are negligible, we take the number of temporal steps n=2000n=2000 and the number of samples M=106M=10^{6}. The resulting implied volatility curves are shown in Figure 2, which are solved via the Newton-Raphson method. We compare the proposed mSOE scheme with the exact method, i.e., Cholesky decomposition, and the hybrid scheme. We observe that mSOE-4 scheme can already accurately approximate the implied volatility curves given by the exact method. This clearly shows the high efficiency of the proposed mSOE based scheme for simulating the implied volatility curves.

Table 2: Parameter values used in the rBergomi model.
S0S_{0} ξ0\xi_{0} η\eta HH ρ\rho
1 0.23520.235^{2} 1.9 0.07 -0.9
Refer to caption
Refer to caption
Figure 2: The implied volatility curves σimp\sigma_{imp} computed by the mSOE-NN based numerical scheme (16) with different number of summation terms NN. In the plot, k=log(K/S0)k=\log(K/S_{0}) is the log-moneyness and KK is the strike price.

Next, we depict in Figure 3 the root mean squared error (RMSE) of the first moment and second moment of the following Gaussian random variable using the same set of parameters,

𝒢(t):=ηItη22t2H𝒩(η22t2H,η2t2H).\displaystyle\mathcal{G}(t):=\eta I_{t}-\frac{\eta^{2}}{2}t^{2H}\sim\mathcal{N}\left(-\frac{\eta^{2}}{2}t^{2H},\eta^{2}t^{2H}\right).

The RMSEs are defined by

RMSE 1st moment =(i=1n(𝔼[𝒢^(ti)]+η22ti2H)2)1/2,\displaystyle=\left(\sum_{i=1}^{n}\left(\mathbb{E}\left[\hat{\mathcal{G}}(t_{i})\right]+\frac{\eta^{2}}{2}t_{i}^{2H}\right)^{2}\right)^{1/2},
RMSE 2nd moment =(i=1n(𝔼[𝒢^(ti)2](η4ti4H4+η2ti2H))2)1/2,\displaystyle=\left(\sum_{i=1}^{n}\left(\mathbb{E}\left[\hat{\mathcal{G}}(t_{i})^{2}\right]-\left(\frac{\eta^{4}t_{i}^{4H}}{4}+\eta^{2}t_{i}^{2H}\right)\right)^{2}\right)^{1/2},

where 𝒢^(t)\hat{\mathcal{G}}(t) is the approximation of 𝒢(t)\mathcal{G}(t) under the mSOE-N scheme or the hybrid scheme. These quantities characterize the fundamental statistical feature of the distribution. We observe that the mSOE-NN based numerical scheme achieves high accuracy for the number of terms N>15N>15, again clearly showing the high efficiency of the proposed scheme.

Refer to caption
Refer to caption
Figure 3: The RMSEs for the first and second moments generated by the mSOE-NN based numerical scheme (16) and the hybrid scheme.

4.2 Numerical performance for learning the forward variance curve

Now we validate (23) using three examples of ground truth forward variance curves:

ξ0(t)0.2352,\displaystyle\xi_{0}(t)\equiv 0.235^{2}, (24a)
ξ0(t)=2|Wt|,\displaystyle\xi_{0}(t)=2|W_{t}|, (24b)
ξ0(t)=0.1|WtH| with H=0.07.\displaystyle\xi_{0}(t)=0.1|W^{H}_{t}|\text{ with }H=0.07. (24c)

The first example (24a) corresponds to a constant forward variance curve. The second example (24b) takes a scaled sample path of Brownian motion as the forward variance curve, and the third example (24c) utilises a scaled path of fractional Brownian motion as the forward variance curve.

ttInput11\cdots\cdots\cdots10010011\cdots\cdots\cdots10010011\cdots\cdots\cdots100100ξ(t;θ)\xi(t;\theta)Output Hidden layer1 Input layer Hidden layer2 Hidden layer3 Output layer
Figure 4: Architecture: 3-layer feedforward neural network.

The initial forward variance curve ξ0\xi_{0} is parameterized by a feed forward neural network, cf. (7), which has 3 hidden layers, width 100, and leaky ReLU activations; see Fig. 4 for a schematic illustration. The weights of each model were carefully initialized in order to prevent gradient vanishing.

To generate the training data, we use the stock price samples generated by the scheme (16). The total number of samples used in the experiments is 10510^{5}. Training of the neural network was performed on the first 81.92%81.92\% of the dataset and the model’s performance was evaluated on the remaining 18.08%18.08\% of the dataset. Each sample is of length 2000, which is the number of discretized time intervals. The batch size was 4096, which was picked as large as possible that the GPU memory allowed for. Each model was trained for 100 epochs. We consider Wasserstein-1 distance as loss metric by applying (21) to the empirical distributions on batches of data at the final time grid of the samples.

The neural SDEs were trained using Adam [15]. The learning rate was taken to be 10410^{-4}, which was then gradually reduced until a good performance was achieved. The training was performed on a Linux server and a NVIDIA RTX6000 GPU and takes a few hours for each experiment. Vt(θ)V_{t}(\theta) in (18) was solved by the mSOE scheme to reduce the computational complexity, then St(θ)S_{t}(\theta) was solved using the Euler-Maruyama method. The example code for dataset sampling and network training will be made available at the Github repository https://github.com/evergreen1002/Neural-option-pricing-for-rBergomi-model.

Now we present the numerical results for the learned forward variance curves in Figure 5. These three rows correspond to three different initial forward variance curves defined in (24a), (24b) and (24c) used for neural network training, respectively. The first two columns display the empirical distributions ST(θ)S_{T}(\theta) and the corresponding European call option prices P0(θ)P_{0}(\theta) for the test set against a number of strikes before training has begun. The next two columns display ST(θ)S_{T}(\theta^{*}) and P0(θ)P_{0}(\theta^{*}) after training. It is observed that ST(θ)S_{T}(\theta^{*}) and P0(θ)P_{0}(\theta^{*}) closely match the exact one on the test set, thereby showing the high accuracy of learning the rBergomi model. The last column shows the Wasserstein-1 distance at time TT compared to the maximum error in the option price over each batch during training. These two quantities are precisely the subjects in (23). Only the first 500500 iterations were plotted. In all cases, the training loss can be reduced to an acceptable level, so is the error of pricing, clearly indicating the feasibility of learning within the rBergomi model. Finally, we investigate whether the training loss can be further reduced after more iterations for the first example (24a). Figure 1(b) shows the training loss against the number of iterations. Gradient descent is applied with learning rate 10510^{-5}. We observe that the training loss is oscillating near 0.05 even after 100,000 iterations. Indeed, the oscillation of the loss values aggravates as the iteration further proceeds, indicating the necessity for early stopping of the training process.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Left to right: Empirical distribution of terminal stock price before the training, option price before the training, empirical distribution of terminal stock price after the training, option price after training, learning curve.

5 Conclusion

In this work, we have proposed a novel neural network based SDE model to learn the forward variance curve for the rough Bergomi model. We propose a new modified summation of exponentials (mSOE) scheme to improve the efficiency of generating training data and facilitating the training process. We have utilized the Wasserstein 1-distance as the loss function to calibrate the dynamics of the underlying assets and the price of the European options simultaneously. Furthermore, several numerical experiments are provided to demonstrate their performances, which clearly show the feasibility of neural networks for calibrating these models. Future work includes learning all the unknown functions in the rBergomi model (as well as other rough volatility models) by the proposed approach, using the market data instead of the simulated data as in the present work.

References

  • [1] E. Abi Jaber and O. El Euch. Multifactor approximation of rough volatility models. SIAM Journal on Financial Mathematics, 10(2):309–349, 2019.
  • [2] C. Bayer and S. Breneis. Markovian approximations of stochastic Volterra equations with the fractional kernel. Quantitative Finance, 23(1):53–70, 2023.
  • [3] C. Bayer, P. Friz, and J. Gatheral. Pricing under rough volatility. Quantitative Finance, 16(6):887–904, 2016.
  • [4] M. Bennedsen, A. Lunde, and M. S. Pakkanen. Hybrid scheme for Brownian semistationary processes. Finance and Stochastics, 21:931–965, 2017.
  • [5] D. Braess. Nonlinear approximation theory. Springer Science & Business Media, 2012.
  • [6] L. Coutin and P. Carmona. Fractional Brownian motion and the Markov property. Electronic Communications in Probability, 3:12, 1998.
  • [7] T. DeLise. Neural options pricing. Preprint, arXiv:2105.13320, 2021.
  • [8] J. Gatheral, T. Jaisson, and M. Rosenbaum. Volatility is rough. In Commodities, pages 659–690. Chapman and Hall/CRC, 2022.
  • [9] L. Hodgkinson, C. van der Heide, F. Roosta, and M. W. Mahoney. Stochastic normalizing flows. arXiv preprint arXiv:2002.09547, 2020.
  • [10] J. Jia and A. R. Benson. Neural jump stochastic differential equations. In Advances in Neural Information Processing Systems, volume 32, 2019.
  • [11] S. Jiang, J. Zhang, Q. Zhang, and Z. Zhang. Fast evaluation of the Caputo fractional derivative and its applications to fractional diffusion equations. Communications in Computational Physics, 21(3):650–678, 2017.
  • [12] B. Jin and Z. Zhou. Numerical Treatment and Analysis of Time-Fractional Evolution Equations, volume 214 of Applied Mathematical Sciences. Springer, Cham, 2023.
  • [13] P. Kidger, J. Foster, X. Li, and T. J. Lyons. Neural SDEs as infinite-dimensional GANs. In International conference on machine learning, pages 5453–5463. PMLR, 2021.
  • [14] P. Kidger, J. Foster, X. C. Li, and T. Lyons. Efficient and accurate gradients for neural SDEs. In Advances in Neural Information Processing Systems, volume 34, pages 18747–18761, 2021.
  • [15] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference for Learning Representations, San Diego, 2015.
  • [16] X. Li, T.-K. L. Wong, R. T. Chen, and D. Duvenaud. Scalable gradients for stochastic differential equations. In International Conference on Artificial Intelligence and Statistics, pages 3870–3882. PMLR, 2020.
  • [17] X. Liu, T. Xiao, S. Si, Q. Cao, S. Kumar, and C.-J. Hsieh. Neural SDE: Stabilizing neural ODE networks with stochastic noise. arXiv preprint arXiv:1906.02355, 2019.
  • [18] B. B. Mandelbrot and J. W. Van Ness. Fractional Brownian motions, fractional noises and applications. SIAM Review, 10(4):422–437, 1968.
  • [19] K. S. Miller and S. G. Samko. Completely monotonic functions. Integral Transforms and Special Functions, 12(4):389–402, 2001.
  • [20] S. E. Rømer. Hybrid multifactor scheme for stochastic volterra equations with completely monotone kernels. Available at SSRN 3706253, 2022.
  • [21] A. Tong, T. Nguyen-Tang, T. Tran, and J. Choi. Learning fractional white noises in neural stochastic differential equations. In Advances in Neural Information Processing Systems, volume 35, pages 37660–37675, 2022.
  • [22] B. Tzen and M. Raginsky. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019.
  • [23] B. Tzen and M. Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference on Learning Theory, pages 3084–3114. PMLR, 2019.
  • [24] C. Villani. Topics in optimal transportation. American Mathematical Soc., Providence, 2021.
  • [25] D. V. Widder. The Laplace Transform, volume vol. 6 of Princeton Mathematical Series. Princeton University Press, Princeton, NJ, 1941.