This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Pricing American options under rough volatility using deep-signatures and signature-kernels

Christian Bayer  Luca Pelizzari  Jia-Jie Zhu Weierstrass Institut (WIAS), Berlin, Germany, email: [email protected] Universität Berlin and Weierstrass Institut (WIAS), Berlin, Germany, email: [email protected] Institut (WIAS), Berlin, Germany, email: [email protected].
Abstract

We extend the signature-based primal and dual solutions to the optimal stopping problem recently introduced in [BPS23], by integrating deep-signature and signature-kernel learning methodologies. These approaches are designed for non-Markovian frameworks, in particular enabling the pricing of American options under rough volatility. We demonstrate and compare the performance within the popular rough Heston and rough Bergomi models.

Keywords: Signature, optimal stopping, rough volatility, deep learning, kernel learning.
MSC2020 classifications: 60G40, 60L10, 91G20, 91G60.

1 Introduction

Starting with the seminal paper [KLA20], (rough) path signatures [Che57] have been increasingly recognized as a powerful tool for numerical approximations to solutions of stochastic optimal control problems when the underlying system does not have the Markov property. While the workhorse theoretical methods for stochastic optimal control – dynamic programming / HJB equations on the one hand, Pontryagin maximum principle / BSDEs on the other hand – can, in principle, be formulated and analysed even when the underlying system is not Markovian, this comes at the expense of tractability.

For example, in the Markovian case, we can assume (under mild conditions) that optimal controls are of feedback form, i.e., can be expressed as functions α(t,Xt)\alpha^{\ast}(t,X_{t}) of the underlying state variable XtX_{t}. If we do not make the Markovian assumption, we only know that optimal controls αt\alpha^{\ast}_{t} are, say, progressively measurable w.r.t. the governing filtration, i.e., αt=α(t,X|[0,t])\alpha^{\ast}_{t}=\alpha^{\ast}\left(t,X|_{[0,t]}\right), provided that said filtration is generated by a process XX. Similar remarks can be made for the value function as well as the solutions to the corresponding BSDE systems.

From a numerical point of view, this observation increases the computational challenge considerably for solving non-Markovian stochastic optimal control problems. In essence, an approximation problem for a finite-dimensional function, say, α:[0,T]×dm\alpha^{\ast}:[0,T]\times\mathbb{R}^{d}\to\mathbb{R}^{m}, is replaced with one for a function on pathspace, say α:[0,T]×C([0,T];d)m\alpha^{\ast}:[0,T]\times C([0,T];\mathbb{R}^{d})\to\mathbb{R}^{m} (with additional measurability constraints).

Kalsi, Lyons and Perez Arribas suggested a general framework for solving stochastic optimal control problems in a model-free way (in particular, without assuming a Markovian problem) for the example of an optimal execution problem, see [KLA20]. In a nutshell, the approach consists of approximating the strategy as well as the value function as linear functionals of the signature of the underlying process. The linearization – based on the signature’s universality – allows them to rephrase the optimal execution problem as a deterministic optimization problem in terms of the expected signature.

Full linearization as in [KLA20] is only feasible if the unknown function (value function, control, …) is smooth enough, so as not to require a very deep degree of signature approximation – comparable to approximations with polynomials of high degree in finite dimensional spaces. Otherwise, several other strategies have been considered in the literature:

  • Signatures can be interlaced with non-linear transformations of the path, e.g., deep neural networks, to increase the signature’s “expressivity”, see, for instance, [TBO21].

  • Signatures – or, alternatively, log-signatures to reduce the dimension – can be used as features for other, non-linear approximation methods, e.g., deep neural networks. In the context of optimal stopping, this was first suggested in [BHRS23].

In addition, signature kernels, see [CO22], provide a classical kernel learning approach to regression problems in general, and can also be used for stochastic optimal control, in particular. [SCF+21] observed that the (non-truncated) signature kernel satisfies a Goursat PDE, and, hence, can be computed numerically, without requiring the (otherwise crucial) truncations of the signature.

Rough volatility models, see [BFF+23], are a popular class of stochastic volatility models in finance, i.e., with a stock price process following a dynamics dSt=vtStdZtdS_{t}=\sqrt{v_{t}}S_{t}dZ_{t}, where the stochastic variance process vv is “rough”, e.g., an exponential of fractional Brownian motion with Hurst index 0<H<1/20<H<1/2. Crucially, such models (S,v)(S,v) do not have the Markov property, so signature methods are ideally suited for computing approximate solutions to the optimal stopping problem in SS – in financial terms: to compute the prices of American or Bermudan options.

This paper builds on [BPS23], where adaptations of the classical Longstaff–Schwartz and dual algorithms for Bermudan option pricing based on linear functionals of the signature were introduced and analysed. In this paper, we further extend these algorithms by introducing versions based on non-linear functions of the signature as well as signature-kernel versions. We then apply them to the Bermudan options pricing problem for popular rough volatility models (specifically, the rough Bergomi model [BFG16] and the rough Heston model [EER19]), and numerically analyse their performance for these models under realistic model parameters.

The goal of this paper is to to compute American (more precisely, Bermudan) option prices in the aforementioned rough volatility models, with special emphasis on numerical performance as well as comparisons between the methods outlined before. Specifically, we solve primal and dual formulations of the optimal stopping problem – giving us lower and upper bounds of the option price, respectively – using three different signature methods each:

  1. 1.

    linear functionals of the truncated signature;

  2. 2.

    deep neural networks applied to truncated signatures;

  3. 3.

    linear combinations of the signature kernel.

Compared to the literature (see, for instance, [BPS23] based on linear functionals of the signature alone), we provide considerably sharper bounds for the option price, in the sense of the length of the interval between lower and upper bounds. More specifically, we find that for a realistic rough volatility model (the rough Bergomi model with Hurst index H=0.07H=0.07, as suggested in [BFG16]), we obtain a tight gap of about 1%1\%.

Regarding the comparison between the different methods, no method seems to consistently outperform the others. For the primal problem, all three methods tend to perform very well, with mild advantages for the linear and the signature kernel methods. The dual formulation, however, seems to lead to more difficult approximation and optimization problems, and the neural-network-based method has some advantages. Overall, we see the biggest improvements compared to [BPS23] for the dual formulation.

This conclusion is also supported by various other numerical studies performed, including a study of the dependence of the training error on the number of training samples. We also study the relative importance of the different components in the signature, specifically in the DNN-based method. For the primal formulation, while the the most important signature components are those corresponding to powers of the increments, generally all components matter, and there is no apparent sparse structure. On the other hand, for the dual problem, only very few signature components actually are important, and there seems to be great potential for dimension reduction.

Acknowledgments

All authors gratefully acknowledge funding by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – The Berlin Mathematics Research Center MATH+ (EXC-2046/1, project ID: 390685689). CB and PL acknowledge support from DFG CRC/TRR 388 “Rough Analysis, Stochastic Dynamics and Related Fields”, Project B03.

2 A review on Monte-Carlo methods for optimal stopping

In this section we introduce the optimal stopping problem in a fairly general framework, and recall two general simulation based procedures to derive lower and upper bounds of the optimal value. Let {Ω,,}\{\Omega,\mathcal{F},\mathbb{P}\} be a probability space supporting a state-process (Xt:0tT)(X_{t}:0\leq t\leq T), which we only assume to be α\alpha-Hölder continuous almost surely, α(0,1)\alpha\in(0,1). Moreover, we consider an (tX)(\mathcal{F}^{X}_{t})-adapted, continuous process (Zt:0tT)(Z_{t}:0\leq t\leq T), the cash-flow process, which for technical reasons we assume to fulfill sup0tT|Zt|2L1\sup_{0\leq t\leq T}|Z_{t}|^{2}\in L^{1}. Finally, the optimal stopping problem reads

Y0=supτ𝒮0𝔼[Zτ],Yt=esssupτ𝒮t𝔼[Zτ|tX],0<tT,Y_{0}=\sup_{\tau\in\mathcal{S}_{0}}\mathbb{E}[Z_{\tau}],\quad Y_{t}=\operatorname*{ess\,sup}_{\tau\in\mathcal{S}_{t}}\mathbb{E}[Z_{\tau}|\mathcal{F}^{X}_{t}],\quad 0<t\leq T, (1)

where 𝒮t\mathcal{S}_{t} denotes the set of (tX)(\mathcal{F}^{X}_{t})-stopping times on [t,T][t,T].

2.1 A general Longstaff-Schwartz algorithm

Replacing the continuous time interval [0,T][0,T] by some finite grid 0=t0<t1<<tN=T0=t_{0}<t_{1}<\dots<t_{N}=T, the dynammic programming principle [PS06, Page eq. (2.1.7)] for the discrete optimal stopping reads

YtN=ZtN,Ytn=max(Ztn,𝔼[Ytn+1|tnX]),0nN1,Y_{t_{N}}=Z_{t_{N}},\quad Y_{t_{n}}=\max(Z_{t_{n}},\mathbb{E}[Y_{t_{n+1}}|\mathcal{F}^{X}_{t_{n}}]),\quad 0\leq n\leq N-1, (2)

and τ=inf{tn:Ztn𝔼[Ytn+1|tnX]}\tau^{\star}=\inf\{t_{n}:Z_{t_{n}}\geq\mathbb{E}[Y_{t_{n+1}}|\mathcal{F}^{X}_{t_{n}}]\} is optimal. Motivated by the famous Longstaff and Schwartz algorithm [LS01], the first optimal stopping time can be obtained as τ0\tau_{0} of the following recursion

τN=tN,τn=tn1{Ztn𝔼[Zτn+1|tnX]}+τn+11{Ztn<𝔼[Zτn+1|tnX]},0nN1.\tau_{N}=t_{N},\quad\tau_{n}=t_{n}1_{\{Z_{t_{n}}\geq\mathbb{E}[Z_{\tau_{n+1}}|\mathcal{F}^{X}_{t_{n}}]\}}+\tau_{n+1}1_{\{Z_{t_{n}}<\mathbb{E}[Z_{\tau_{n+1}}|\mathcal{F}^{X}_{t_{n}}]\}},\quad 0\leq n\leq N-1. (3)

The remaining question is, how to compute the continuation values 𝔼[Zτn+1|tnX]\mathbb{E}[Z_{\tau_{n+1}}|\mathcal{F}^{X}_{t_{n}}] for possibly non-Markovian state-processes XX, which will be the main topic of the forthcoming Section 3. For now, assume we are given a suitable family of basis-functions (ψk)(\psi_{k}), such that 𝔼[Zτn+1|tnX]kαknψk(X|[0,tn])\mathbb{E}[Z_{\tau_{n+1}}|\mathcal{F}^{X}_{t_{n}}]\approx\sum_{k}\alpha_{k}^{n}\psi_{k}(X|_{[0,t_{n}]}) with some coefficients (αkn)(\alpha^{n}_{k}). A general version of the Longstaff-Schwartz algorithm goes as follows:

  1. 1.

    Draw i=1,,Mi=1,\dots,M samples X(i)X^{(i)} and Z(i)Z^{(i)} and initialize τN(i)tN\tau_{N}^{(i)}\equiv t_{N}.

  2. 2.

    Then, for n=N1,,1n=N-1,\dots,1

    • solve the minimization problem

      α,n:=argminα1Mi=1M(Zτn+1(i)(i)kαkψkn(X|[0,tn](i)))2\alpha^{\star,n}:=\mathop{\mathrm{argmin}}_{\alpha}\frac{1}{M}\sum_{i=1}^{M}\left(Z^{(i)}_{\tau_{n+1}^{(i)}}-\sum_{k}\alpha_{k}\psi_{k}^{n}(X|_{[0,t_{n}]}^{(i)})\right)^{2}
    • Set τn(i)=tn\tau_{n}^{(i)}=t_{n} if Ztn(i)kαk,nψkn(X|[0,tn](i))Z^{(i)}_{t_{n}}\geq\sum_{k}\alpha^{\star,n}_{k}\psi_{k}^{n}(X|_{[0,t_{n}]}^{(i)}), and τn(i)=τn+1(i)\tau_{n}^{(i)}=\tau_{n+1}^{(i)} otherwise.

  3. 3.

    Finally, draw i=1,,Mi=1,\dots,M^{\prime} independent sample-paths X~(i)\tilde{X}^{(i)} and Z~(i)\tilde{Z}^{(i)}, and compute the stopping times τ~(i)=inf{tn:Z~tn(i)kαk,nψkn(X~|[0,tn](i))}\tilde{\tau}^{(i)}=\inf\left\{t_{n}:\tilde{Z}^{(i)}_{t_{n}}\geq\sum_{k}\alpha^{\star,n}_{k}\psi_{k}^{n}(\tilde{X}|_{[0,t_{n}]}^{(i)})\right\} and the lower-biased estimator

    y~0=max(Zt0,1Mi=1MZ~τ~(i)(i))\tilde{y}_{0}=\max\left(Z_{t_{0}},\frac{1}{M^{\prime}}\sum_{i=1}^{M^{\prime}}\tilde{Z}^{(i)}_{\tilde{\tau}^{(i)}}\right)

Notice that the resimulation in the final step ensures, that the τ~(i)\tilde{\tau}^{(i)} are indeed stopping times, since the computed coefficients α\alpha are now independent of the samples, and thus max(Zt0,𝔼[Zτ~1])\max(Z_{t_{0}},\mathbb{E}[Z_{\tilde{\tau}_{1}}]) is a true lower-bound. The Monte-Carlo approximation y~0\tilde{y}_{0} is therefore lower-biased and strictly speaking not a true lower-bound, due to the Monte-Carlo error with respect to MM^{\prime}. Nevertheless, this error is typically chosen to be very small, and we will refer to y~0\tilde{y}_{0} as lower-bound anyways.

2.2 A general dual procedure

While the Longstaff-Schwartz procedure described in the last section provides us with lower-bounds for the optimal value, it is desirable to additionally have an algorithm producing upper-bounds. To this end, it is useful to consider the (discretized) dual formulation of (1) due to [Rog02]

infM2𝔼[max0kN(ZtkMtk)],\inf_{M\in\mathcal{M}^{2}}\mathbb{E}[\max_{0\leq k\leq N}(Z_{t_{k}}-M_{t_{k}})], (4)

where the minimization is over discrete L2L^{2}-martingales with respect to the filtration 𝒢n=tnX,n=0,,N\mathcal{G}_{n}=\mathcal{F}^{X}_{t_{n}},n=0,\dots,N. By the Martingale Representation Theorem [KS91, Theorem 4.5], if the underlying filtration (tX)(\mathcal{F}^{X}_{t}) is generated by a Brownian motion WW, such martingales are of the form Mtkα=0tkαs𝑑WsM_{t_{k}}^{\alpha}=\int_{0}^{t_{k}}\alpha_{s}dW_{s} for some (tX)(\mathcal{F}_{t}^{X})-progressive process α\alpha. On the other hand, due to the progressive measurability, the integrands are of the form αt=f(X|[0,t])\alpha_{t}=f(X|_{[0,t]}) for some (deterministic) function ff. Assume again that we are given some family of basis-functions (ψk)(\psi_{k}) such that f(X|[0,t])kβkψk(X|[0,t])f(X|_{[0,t]})\approx\sum_{k}\beta_{k}\psi_{k}(X|_{[0,t]}) for some coefficients β\beta. If the family (ψk)(\psi_{k}) is rich enough, it is reasonable to parametrize the space 2\mathcal{M}^{2} by the span of basis-martingales {Mtk=0tψk(X|[0,s])𝑑Ws:k0}\{M^{k}_{t}=\int_{0}^{t}\psi_{k}(X|_{[0,s]})dW_{s}:k\geq 0\}, which leads to the following dual algorithm:

  1. 1.

    Draw i=1,,Mi=1,\dots,M sample-paths of Z(i),X(i),W(i)Z^{(i)},X^{(i)},W^{(i)}, and compute the basis martingales Mk,(i)M^{k,(i)} (e.g. using an Euler-scheme)

  2. 2.

    Solve the minimization problem (see, e.g., [DFM12, Bel13])

    β=argminβ1Mi=1Mmax0nN(Ztn(i)kβkMtnk)\beta^{\star}=\mathop{\mathrm{argmin}}_{\beta}\frac{1}{M}\sum_{i=1}^{M}\max_{0\leq n\leq N}\left(Z^{(i)}_{t_{n}}-\sum_{k}\beta_{k}M^{k}_{t_{n}}\right)
  3. 3.

    Finally, draw i=1,,Mi=1,\dots,M^{\prime} independent sample-paths Z~(i),X~(i),W~(i)\tilde{Z}^{(i)},\tilde{X}^{(i)},\tilde{W}^{(i)} and compute the martingales M~k,(i)\tilde{M}^{k,(i)} and the upper-biased estimator

    y~0=1Mi=1Mmax0nN(Z~tn(i)kβkM~tnk,(i))\tilde{y}_{0}=\frac{1}{M^{\prime}}\sum_{i=1}^{M}\max_{0\leq n\leq N}\left(\tilde{Z}^{(i)}_{t_{n}}-\sum_{k}\beta^{\star}_{k}\tilde{M}^{k,(i)}_{t_{n}}\right)

Similar as in the primal case, the independent resimulations ensure that we get true martingales M~:=kβkM~k\tilde{M}^{\star}:=\sum_{k}\beta^{\star}_{k}\tilde{M}^{k}, since β\beta^{\star} is independent of the samples, and therefore 𝔼[maxk(ZtkM~tk)]\mathbb{E}[\max_{k}(Z_{t_{k}}-\tilde{M}^{\star}_{t_{k}})] is a true upper-bound. By the same abuse of terminology as in the primal case, we call the Monte-Carlo approximation y~0\tilde{y}_{0} upper-bound, keeping in mind that it is strictly speaking only upper-biased due to the Monte-Carlo error with respect to MM^{\prime}.

3 Signature stopping policies and dual martingales

To numerically solve the two algorithms presented in the last section, we require to learn functionals f:𝒳f:\mathcal{X}\rightarrow\mathbb{R}, where 𝒳\mathcal{X} is an infinite-dimensional path space. As motivated in the introduction, we will use the path-signature, resp. the corresponding signature-kernel, to tackle this problem. For the rough path theoretical details in this section, we refer to the excellent books [FV10, FH20], but see also [CS24] with a focus on machine learning.

Let (x(t):0tT)(x(t):0\leq t\leq T) be an d\mathbb{R}^{d}-valued, α\alpha-Hölder continuous path, α(0,1]\alpha\in(0,1]. The rough path signature of xx is given by the path 𝕩,<:Δ[0,T]T((d))\mathbbm{x}^{<\infty}_{\cdot,\cdot}:\Delta_{[0,T]}\to T((\mathbb{R}^{d})), defined (at least formally) as the collection of iterated integrals

𝕩s,t<:=n=0s<t1<<tn<t𝑑x(t1)𝑑x(tn),0stT,\mathbbm{x}^{<\infty}_{s,t}:=\sum_{n=0}^{\infty}\int_{s<t_{1}<\cdots<t_{n}<t}dx(t_{1})\otimes\cdots\otimes dx(t_{n}),\quad 0\leq s\leq t\leq T,

taking values in the extended tensor algebra

T((d)):=n=0(d)n,T((\mathbb{R}^{d})):=\prod_{n=0}^{\infty}(\mathbb{R}^{d})^{\otimes n},

an algebra under a (naturally defined) non-commutative product \otimes. If α=1\alpha=1, the meaning of the iterated integrals is in the sense of Riemann-Stieltjes, see [FV10, Chapter 7.2], and less obvious becomes the case α<1\alpha<1, since dxdx does not have a meaning a-priori. However, Lyons’ theory of rough paths [Lyo98] allows us to give the latter a meaning, but higher-order information of the path is required. More precisely, we have to lift xx to a so-called 𝐱=(𝐱(1),,𝐱(N))n=0N(d)n\mathbf{x}=(\mathbf{x}^{(1)},\dots,\mathbf{x}^{(N)})\in\prod_{n=0}^{N}(\mathbb{R}^{d})^{\otimes n}, where N=1αN=\lfloor\frac{1}{\alpha}\rfloor and 𝐱(1)=x\mathbf{x}^{(1)}=x, see [FV10, Chapter 9]. Having such a rough path lift 𝐱\mathbf{x} at hand, the notion of rough integration allows to make sense of integrating against d𝐱d\mathbf{x}, and a signature lift of 𝐱\mathbf{x} can be defined, sharing all the properties of the path-signature for smooth paths, which is due to Lyons’ extension theorem [Lyo98, Theorem 3.7].

Remark 3.1

The authors of [BGLY16] show that the map x𝕩<x\mapsto\mathbbm{x}^{<\infty} is injective up to a certain equivalence class on path spaces, which can be eliminated when adding a monotone component to the path, e.g. the time-augmentation x^(t)=(t,x(t))\hat{x}(t)=(t,x(t)). Moreover, by applying a so-called tensor-normalization λ:T((d))T((d))\lambda:T((\mathbb{R}^{d}))\rightarrow T((\mathbb{R}^{d})) to the signature, the authors of [CO22] introduce the notion of robust signatures λ(𝕩<)\lambda(\mathbbm{x}^{<\infty}), which can be seen as a bounded version of the signature, sharing all of its structural properties. For the theoretical results in this section, we always assume that our underlying process XX is already augmented by some monotone function, and this pair can be lifted to a geometric rough path 𝐗\mathbf{X}, see [FV10, Definition 9.16], and we denote by 𝕏<\mathbbm{X}^{<\infty} the unique robust rough path signature for some fixed normalization λ\lambda.

3.1 Linear signature learning

One of the most important algebraic properties of the signature is the following: For any two linear functionals on the tensor-algbera, that is 1,2T((d))\ell_{1},\ell_{2}\in T((\mathbb{R}^{d})^{\star}), we can find 3\ell_{3} (in fact, a closed-form expression can be given), such that 1,𝕏0,T<2,𝕏0,T<=3,𝕏0,T<\big{\langle}\ell_{1},\mathbbm{X}^{<\infty}_{0,T}\big{\rangle}\big{\langle}\ell_{2},\mathbbm{X}^{<\infty}_{0,T}\big{\rangle}=\big{\langle}\ell_{3},\mathbbm{X}^{<\infty}_{0,T}\big{\rangle}, for all signatures 𝕏0,T<\mathbbm{X}^{<\infty}_{0,T}. In words, the space of linear functionals of the signature forms an algbera, which is point-seperating (since we assume to have a unique signature, see Remark 3.1). An application of a general Stone-Weierstrass result, see [CO22, Section 2.1], tells us that the set of linear functional of the robust signature is dense in the set of bounded continuous functions on the corresponding path space. This fact can be used to deduce a global LpL^{p}-approximation for linear functionals of the signature, see [BPS23, Theorem 2.8], which then yields the following convergence results. We will always assume to be in the framework described in the beginning of Section [BPS23, Section 3.1], where certain assumptions on the path space, as well as on the payoff process are made precise. The following proposition summarizes the results [BPS23, Proposition 3.3 and 3.8], to which we refer for more precise statements and detailed proofs.

Proposition 3.2

Using the same notation from Section 2, we consider i=1,,Mi=1,\dots,M independent sample-paths of X,Z,WX,Z,W, and we define the martingales Mt=0t,𝕏0,sK𝑑WsM^{\ell}_{t}=\int_{0}^{t}\big{\langle}\ell,\mathbbm{X}^{\leq K}_{0,s}\big{\rangle}dW_{s} for some truncation level KK\in\mathbb{N}. We define the estimators

P,n\displaystyle\ell_{P}^{\star,n} =argmin,||K1Mi=1M(Zτn+1(i)(i),𝕏0,tnK,(i))2,\displaystyle=\mathop{\mathrm{argmin}}_{\ell,|\ell|\leq K}\frac{1}{M}\sum_{i=1}^{M}\left(Z^{(i)}_{\tau_{n+1}^{(i)}}-\big{\langle}\ell,\mathbbm{X}^{\leq K,(i)}_{0,t_{n}}\big{\rangle}\right)^{2}, (5)
D\displaystyle\ell_{D}^{\star} =argmin,||K1Mi=1Mmax0nN(Ztn(i)Mtn,(i)).\displaystyle=\mathop{\mathrm{argmin}}_{\ell,|\ell|\leq K}\frac{1}{M}\sum_{i=1}^{M}\max_{0\leq n\leq N}\left(Z^{(i)}_{t_{n}}-M^{\ell,(i)}_{t_{n}}\right).

Then, for new samples X~,Z~,W~\tilde{X},\tilde{Z},\tilde{W}, independent of X,Z,WX,Z,W, define the (tX)(\mathcal{F}_{t}^{X})-stopping times τ~(i)=inf{tn:Z~tn(i)P,n,𝕏0,tnK,(i)}\tilde{\tau}^{(i)}=\inf\left\{t_{n}:\tilde{Z}^{(i)}_{t_{n}}\geq\big{\langle}\ell_{P}^{\star,n},\mathbbm{X}^{\leq K,(i)}_{0,t_{n}}\big{\rangle}\right\} and martingales M~D,(i)\tilde{M}^{\ell^{\star}_{D},(i)}. Then

max(Z~t0,1Mi=1MZ~τ~(i)(i))Y0,1Mi=1Mmax0nN(Z~tn(i)M~tnD,(i))Y0,\max\left(\tilde{Z}_{t_{0}},\frac{1}{M}\sum_{i=1}^{M}\tilde{Z}^{(i)}_{\tilde{\tau}^{(i)}}\right)\nearrow Y_{0},\quad\frac{1}{M}\sum_{i=1}^{M}\max_{0\leq n\leq N}\left(\tilde{Z}^{(i)}_{t_{n}}-\tilde{M}^{\ell_{D}^{\star},(i)}_{t_{n}}\right)\searrow Y_{0}, (6)

as M,KM,K\rightarrow\infty, where the convergence with respect to MM is almost sure convergence.

While the last result ensures converges for the primal and dual procedures, is does not come with quantitative statements about convergence rates. Additionally, these algorithms can become computationally expensive, as the size of the signature grows exponentially with KK. In [BPS23] several numerical experiments were performed for these two algorithms in non-Markovian frameworks, in particular for models driven by fractional Brownian motion with small Hurst parameters. The accuracy of the method can be measured by the duality gap between lower and upper bounds, and in some examples these gaps were observed to be quite significant even for high truncation levels. Two important sources for this gap can be described as follows: First, especially in very rough regimes, the considered signature levels might not suffice to capture the relevant past of the non-Markovian processes, so that information is lost when truncating the signature. Secondly, the functionals on the path spaces we try to learn are highly non-linear, so that more sophisticated learning technologies are required to improve the performance. The goal of the next sections is to introduce non-linear extensions of the primal and dual algorithm, based on deep-, resp. kernel-learning methodologies, with the objective of improving the duality gap.

3.2 Deep signature learning

Applying Deep Neural Networks (DNNs) on top of the signature was considered several times in the literature, see for instance [KBPA+19], where even more generally, the signature can act as a layer in a network. Here, we simply consider the truncated signature as the input for a DNN of the form

θ=βσA1σAI,\theta=\beta\circ\sigma\circ A_{1}\circ\sigma\cdots\circ A_{I},

where β,A0,,AI\beta,A_{0},\dots,A_{I} are affine, β:m,AI:T(d)m,AIi:mm\beta:\mathbb{R}^{m}\rightarrow\mathbb{R},A_{I}:T(\mathbb{R}^{d})\rightarrow\mathbb{R}^{m},A_{I-i}:\mathbb{R}^{m}\to\mathbb{R}^{m}, for i=1,,I1i=1,\dots,I-1, and σ\sigma is an activation function. We call θ\theta a signature DNN with II hidden layers, each of which having mm neurons. The signature map 𝐱|[0,t]𝕩0,t<\mathbf{x}|_{[0,t]}\mapsto\mathbbm{x}^{<\infty}_{0,t} represents the input layer, and the first hidden layer acting on the signature, can be seen as a vector of linear functionals AI=(1,,m)A_{I}=(\ell_{1},\dots,\ell_{m}), and AI(𝕩<)=(1,𝕩<,,m,𝕩<)mA_{I}(\mathbbm{x}^{<\infty})=(\big{\langle}\ell_{1},\mathbbm{x}^{<\infty}\big{\rangle},\dots,\big{\langle}\ell_{m},\mathbbm{x}^{<\infty}\big{\rangle})^{\top}\in\mathbb{R}^{m}. The remaining hidden layers are all of the form Aj(x)=Ajx+bjA_{j}(x)=A_{j}x+b_{j}, where Ajm×m,bjmA_{j}\in\mathbb{R}^{m\times m},b_{j}\in\mathbb{R}^{m}, and the output layer is given by βm\beta\in\mathbb{R}^{m}. We denote by DNNsigσ,I\mathrm{DNN}^{\sigma,I}_{\mathrm{sig}} the set of all such signature DNNs θ\theta with II hidden layers, and activation function σ\sigma.

It is well known, that already I=1I=1 hidden layer DNNs are universal, see, e.g. [LLPS93]. For this reason, and to ease the notation for the theoretical results, in the remainder of this section we fix I=1I=1. Nevertheless, all the results trivially extend to multiple hidden layers I>1I>1, and in the numerical experiments we mostly choose more than one hidden layer. Setting DNNsigσ:=DNNsigσ,1\mathrm{DNN}^{\sigma}_{\mathrm{sig}}:=\mathrm{DNN}^{\sigma,1}_{\mathrm{sig}}, we can explicitly write the set as

DNNsigσ={𝐱|[0,t]j=1mβjσ(j,𝕩0,t<):(β,)m×𝒲m,m1},\mathrm{DNN}^{\sigma}_{\mathrm{sig}}=\left\{\mathbf{x}|_{[0,t]}\mapsto\sum_{j=1}^{m}\beta_{j}\sigma\Big{(}\big{\langle}\ell_{j},\mathbbm{x}^{<\infty}_{0,t}\big{\rangle}\Big{)}:(\beta,\ell)\in\mathbb{R}^{m}\times\mathcal{W}^{m},m\geq 1\right\}, (7)

where 𝒲\mathcal{W} denotes the set of all linear functionals, and 𝐱|[0,t]\mathbf{x}|_{[0,t]} are stopped α\alpha-Hölder rough paths, see [BPS23, Section 2] for details. For fixed (m,k)2(m,k)\in\mathbb{N}^{2} and any pair (β,)m×(𝒲k)m(\beta,\ell)\in\mathbb{R}^{m}\times(\mathcal{W}^{\leq k})^{m}, where 𝒲k\mathcal{W}^{\leq k} are linear functionals on the kk-step signature 𝕩k\mathbbm{x}^{\leq k}, denote by θ(β,)\theta^{(\beta,\ell)} the corresponding functional in the set DNNsigσ\mathrm{DNN}^{\sigma}_{\mathrm{sig}}. Revisiting the Longstaff & Schwartz algorithm presented in Section 2.1, it is tempting to define a sequence of deep stopping times (τn(m,k):1nN)(\tau^{(m,k)}_{n}:1\leq n\leq N), recursively defined similar to (3), but with conditional expectations replaced by θ(βP,n,P,n)(𝐗|[0,tn])\theta^{(\beta_{P}^{\star,n},\ell_{P}^{\star,n})}(\mathbf{X}|_{[0,t_{n}]}), where for 0nN0\leq n\leq N

(βP,n,P,n)=argmin(β,)m×(𝒲k)m𝔼[(Zτn+1(k,m)θ(β,)(𝐗|[0,tn]))2].(\beta_{P}^{\star,n},\ell_{P}^{\star,n})=\mathop{\mathrm{argmin}}_{(\beta,\ell)\in\mathbb{R}^{m}\times(\mathcal{W}^{\leq k})^{m}}\mathbb{E}\left[\big{(}Z_{\tau^{(k,m)}_{n+1}}-\theta^{(\beta,\ell)}(\mathbf{X}|_{[0,t_{n}]})\big{)}^{2}\right]. (8)

In words, we recursively learn the continuation values as neural networks of truncated signatures. Similarly, we approximate the Doob-martingale from Section 2.2 by deep martingales Mt(k,m)=0tθ(βD,D)(𝐗|[0,s])𝑑WsM^{(k,m)}_{t}=\int_{0}^{t}\theta^{(\beta_{D}^{\star},\ell_{D}^{\star})}(\mathbf{X}|_{[0,s]})dW_{s}, where

(βD,D)=argmin(β,)m×(𝒲k)m𝔼[max0nN(Ztn0tnθ(β,)(𝐗|[0,s])𝑑Ws)].(\beta_{D}^{\star},\ell_{D}^{\star})=\mathop{\mathrm{argmin}}_{(\beta,\ell)\in\mathbb{R}^{m}\times(\mathcal{W}^{\leq k})^{m}}\mathbb{E}\left[\max_{0\leq n\leq N}\bigg{(}Z_{t_{n}}-\int_{0}^{t_{n}}\theta^{(\beta,\ell)}(\mathbf{X}|_{[0,s]})dW_{s}\bigg{)}\right]. (9)

Thanks to the universality of both signatures and DNNs, the following result should not come as a surprise. Similar as in the last section, we adopt the assumptions described in the beginning of [BPS23, Section 3.1].

Proposition 3.3

Assuming that the filtration is Brownian, that is X=W\mathcal{F}^{X}=\mathcal{F}^{W}, and the activation function σ\sigma is non-polynomial, we have

Y0N=limk,mmax(Zt0,𝔼[Zτ1(k,m)])=limk,m𝔼[max0nN(ZtnMtn(k,m))].Y_{0}^{N}=\lim_{k,m\to\infty}\max\left(Z_{t_{0}},\mathbb{E}[Z_{\tau_{1}^{(k,m)}}]\right)=\lim_{k,m\to\infty}\mathbb{E}\Big{[}\max_{0\leq n\leq N}(Z_{t_{n}}-M^{(k,m)}_{t_{n}})\Big{]}.

An outline of the proof can be found in Appendix A. Similar to the linear case, in practice we solve the sample average approximation of (8)-(9), that is

(βP,n,P,n)\displaystyle(\beta_{P}^{\star,n},\ell_{P}^{\star,n}) =argmin(β,)m×(𝒲k)m1Mi=1M(Zτn+1(i)(i)θ(β,)(𝐗(i)|[0,tn]))2,\displaystyle=\mathop{\mathrm{argmin}}_{(\beta,\ell)\in\mathbb{R}^{m}\times(\mathcal{W}^{\leq k})^{m}}\frac{1}{M}\sum_{i=1}^{M}\left(Z^{(i)}_{\tau_{n+1}^{(i)}}-\theta^{(\beta,\ell)}(\mathbf{X}^{(i)}|_{[0,t_{n}]})\right)^{2}, (10)
(βD,D)\displaystyle(\beta_{D}^{\star},\ell_{D}^{\star}) =argmin(β,)m×(𝒲k)m1Mi=1Mmax0nN(Ztn(i)Mtn(β,),(i)),\displaystyle=\mathop{\mathrm{argmin}}_{(\beta,\ell)\in\mathbb{R}^{m}\times(\mathcal{W}^{\leq k})^{m}}\frac{1}{M}\sum_{i=1}^{M}\max_{0\leq n\leq N}\left(Z^{(i)}_{t_{n}}-M^{(\beta,\ell),(i)}_{t_{n}}\right),

for i.i.d sample-paths Z(i),𝐗(i),W(i)Z^{(i)},\mathbf{X}^{(i)},W^{(i)}. The latter (non-convex) optimization problems can then be solved using stochastic gradient descent. More details about this and the DNN architecture will be discussed in Section 4. Let us conclude this section with a remark.

Remark 3.4

To address the complexity issue coming from the size of the signature, it can be helpful to consider the so-called log-signature. The latter is defined by 𝕃s,t<:=log(𝕏s,t<)\mathbb{L}^{<\infty}_{s,t}:=\mathrm{log}^{\otimes}(\mathbbm{X}^{<\infty}_{s,t}), where log\mathrm{log}^{\otimes} is the bijection

log(1+𝐱):=k0(1)k+1k𝐱kT((d)),𝐱T((d)),𝐱(0)=0.\mathrm{log}^{\otimes}(1+\mathbf{x}):=\sum_{k\geq 0}\frac{(-1)^{k+1}}{k}\mathbf{x}^{\otimes k}\in T((\mathbb{R}^{d})),\quad\mathbf{x}\in T((\mathbb{R}^{d})),\mathbf{x}^{(0)}=0.

It can be shown that the dimension of the truncated log-signature 𝕃K\mathbb{L}^{\leq K}, grows much slower than the one of 𝕏K\mathbbm{X}^{\leq K}, see, e.g., [Rei17], and denoting by exp\exp^{\otimes} the inverse of log\mathrm{log}^{\otimes}, we have 𝕏K=exp(𝕃K)\mathbbm{X}^{\leq K}=\exp^{\otimes}(\mathbb{L}^{\leq K}). Although the log-signature itself is not longer universal, Proposition 3.3 remains through for log-signature DNNs, that is replacing the signature in (7) by its log-signature transform. In Remark A.1, after the proof of Proposition 3.3, we briefly explain why this is true and how to modify the proof.

3.3 Signature-kernel learning

Let us start by recalling some notions from general RKHS theory, see, e.g. [Ste08]. Given a feature map ϕ:𝒳\phi:\mathcal{X}\rightarrow\mathcal{H}, where \mathcal{H} is a Hilbert space (the so-called feature space), one can define the associated kernel

k:𝒳×𝒳,k(x,y):=ϕ(x),ϕ(y).k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R},\quad k(x,y):=\big{\langle}\phi(x),\phi(y)\big{\rangle}_{\mathcal{H}}. (11)

If the kernel is positive-definite, there exists a unique RKHS with reproducing kernel kk, see [Ste08, Theorem 4.21], in the sense that kk reproduces elements ff\in\mathcal{H}, that is f(x)=k(,x),ff(x)=\big{\langle}k(\cdot,x),f\big{\rangle}_{\mathcal{H}}, and it generates the Hilbert space, =span{k(,x):x𝒳}¯\mathcal{H}=\overline{\mathrm{span}\{k(\cdot,x):x\in\mathcal{X}\}}. In our case, the feature map ϕ\phi, maps a path x𝒳x\in\mathcal{X} to its path-signature 𝕩<\mathbbm{x}^{<\infty}, and therefore signature kernel is naturally defined by

ks,t(x,y):=𝕩0,s<,𝕪0,t<,0s,tTk_{s,t}(x,y):=\big{\langle}\mathbbm{x}^{<\infty}_{0,s},\mathbbm{y}^{<\infty}_{0,t}\big{\rangle},\quad 0\leq s,t\leq T

where ,\big{\langle}\cdot,\cdot\big{\rangle} is the natural extension of the inner products (d)k(\mathbb{R}^{d})^{\otimes k} to the tensor algebra T((Rd))T((R^{d})), see [CS24, Definition 2.1.1]. It has been shown in [SCF+21], that the signature-kernel solves a Goursat-type PDE with respect to the time variables. This kernel trick allows us to evaluate the signature-kernel, representing the whole signature, by numerically solving a second-order PDE. An important extension, designed for rougher inputs, was discussed in the recent work [LL24].

An important observation is that universality of the signature feature map is equivalent to the universality of the signature-kernel, see for instance [CO22, Section 6], but also [CS24, Chapter 2.1]. Returning to optimal stopping, this motivates us to seek for functionals in the primal and dual problems, solving the regularized minimizing problems on the RKHS

fλ=argminf(f)+λf2,λ>0,f^{\lambda}=\mathop{\mathrm{argmin}}_{f\in\mathcal{H}}\mathcal{L}(f)+\lambda\|f\|^{2}_{\mathcal{H}},\quad\lambda>0, (12)

where the loss \mathcal{L} is either the primal or the dual loss function. More precisely, in the primal case, at each exercise date nn, the loss is simply the mean-square error n(f)=Ztnf(X|[0,tn])22\mathcal{L}_{n}(f)=\|Z_{t_{n}}-f(X|_{[0,t_{n}]})\|_{2}^{2}. In the dual case, it is given by (f)=max1kN(ZtkMtkf)1\mathcal{L}(f)=\|\max_{1\leq k\leq N}(Z_{t_{k}}-M^{f}_{t_{k}})\|_{1}, where Mtf=0tf(X|[0,s])𝑑WsM_{t}^{f}=\int_{0}^{t}f(X|_{[0,s]})dW_{s}. Then, replacing the losses by their sampled version, the general representer theorem [SHS01, Theorem 1] reveals that the minimizers are of the form fα(x)=i=1Mαik(X(i),x)f^{\alpha}(x)=\sum_{i=1}^{M}\alpha_{i}k(X^{(i)},x). Thus, let us define the minimizers

αP,n:=argminαM1Mi=1M(Zτn+1(i)(i)j=1Mαjktn,tn(X(i),X(j)))2+λfα2,λ>0,\alpha_{P}^{\star,n}:=\mathop{\mathrm{argmin}}_{\alpha\in\mathbb{R}^{M}}\frac{1}{M}\sum_{i=1}^{M}\left(Z^{(i)}_{\tau_{n+1}^{(i)}}-\sum_{j=1}^{M}\alpha_{j}k_{t_{n},t_{n}}(X^{(i)},X^{(j)})\right)^{2}+\lambda\|f^{\alpha}\|^{2}_{\mathcal{H}},\quad\lambda>0, (13)

for all exercise date n=1,,N1n=1,\cdots,N-1. Similarly,

αD:=argminαM1Mi=1Mmax0nN(Ztn(i)j=1MαjMtn(i),(j))+λfα2,λ>0,\displaystyle\alpha_{D}^{\star}:=\mathop{\mathrm{argmin}}_{\alpha\in\mathbb{R}^{M}}\frac{1}{M}\sum_{i=1}^{M}\max_{0\leq n\leq N}\left(Z^{(i)}_{t_{n}}-\sum_{j=1}^{M}\alpha_{j}M^{(i),(j)}_{t_{n}}\right)+\lambda\|f^{\alpha}\|^{2}_{\mathcal{H}},\quad\lambda>0, (14)

for the kernel-martingales Mt(i),(j)=(0tks,s(Xj,X)𝑑Ws)(i).M^{(i),(j)}_{t}=\left(\int_{0}^{t}k_{s,s}(X^{j},X)dW_{s}\right)^{(i)}. The algorithms in Section 2 can then easily be translated to the kernel-learning framework, by replacing the minimizers in the second step accordingly.

Remark 3.5

We expect that the convergence results in Proposition 3.2 can be obtained analogously in this kernel-learning framework, when sending MM\to\infty and λ0\lambda\to 0. For instance in the the dual case, it follows by universality of the signature-kernel, that the closure of the span of the family of kernel-martingales K={ks,s(x,X)𝑑Ws:x𝒳}\mathcal{M}^{K}=\left\{\int k_{s,s}(x,X)dW_{s}:x\in\mathcal{X}\right\} corresponds to the space of L2(W)L^{2}(\mathcal{F}^{W})-martingales. For the sample average approximation and existence of minimizers, one can then argue similar as in [BPS23, Appendix A.2.], and use the representer theorem to conclude.

Remark 3.6

The obvious advantage of this method is that no truncation for the feature map is necessary, and therefore theoretically it does not suffer from a loss of information, see Remark 3.1. Modulo evaluation of the signature-kernel, the approximation error only depends on the regularization λ\lambda and the number of samples MM. Moreover, at least for the mean-square loss, it seems possible to theoretically study convergence rates of such algorithms. This, however, involves a more precise understanding of the signature RKHS \mathcal{H} and integral operators therein, a problem that is outside the scope of this paper, but planned for future research.

Having mentioned the theoretical advantages of the kernel-method, evaluating the signature-kernel, which means solving the Goursat-PDE [SCF+21], becomes the main difficulty in this procedure. Due to the recursive nature of the regression problems in the Longstaff and Schwartz algorithm in Section 2.1, typically large sample size are required to ensure stability. Moreover, as we are especially interested in paths of low regularity, a fine discretization grid is required to solve the Goursat PDE. Combining these observations with the fact, that the kernel-ridge-regressions involve computing and inverting the Gram-matrices (ktn,tn(X(i),X(j))i,j)(k_{t_{n},t_{n}}(X^{(i)},X^{(j)})_{i,j}), it becomes clear that a reduction of the computational costs is required.

To this end, we propose the following approach, related to the so-called Nyström-method for kernel-learning [DMC05, WS00]. We randomly select LML\ll M subsamples, denoting by ILI_{L} the set of those indices, and in both the primal and dual optimization problems we restrict the minimization to functions of the form f()=iILβik(xi,)f(\cdot)=\sum_{i\in I^{L}}\beta_{i}k(x^{i},\cdot). It follows that we only need to compute the (M×L)(M\times L)-matrices

𝐊tn=(ktn,tn(Xj,Xi):iIL,j=1,,M),n=1,,N.\mathbf{K}_{t_{n}}=(k_{t_{n},t_{n}}(X^{j},X^{i}):i\in I_{L},j=1,\dots,M),\quad n=1,\dots,N.

For the primal example, that is, for the kernel ridge regression, we can additionally observe that the explicit solutions is given by

αP,n=(𝐊tnT𝐊tn+Mλ𝐑tn)1𝐊tnTZτn+1\alpha_{P}^{\star,n}=(\mathbf{K}_{t_{n}}^{T}\mathbf{K}_{t_{n}}+M\lambda\mathbf{R}_{t_{n}})^{-1}\mathbf{K}_{t_{n}}^{T}Z_{\tau_{n+1}}

where 𝐑tn=(ktn,tn(xi,xj):i,jIL)\mathbf{R}_{t_{n}}=(k_{t_{n},t_{n}}(x^{i},x^{j}):i,j\in I_{L}). Further details about the implementation will be presented in the next section.

4 Pricing American options under rough volatility

In this section we test the signature-based procedures for the problem of pricing American options in rough volatility models. In such models, the asset-price dynamics is given by

S0=s0,dSt=rStdt+Stvt(ρdWr+1ρ2dBt),0<tT,S_{0}=s_{0},\quad dS_{t}=rS_{t}dt+S_{t}v_{t}\left(\rho dW_{r}+\sqrt{1-\rho^{2}}dB_{t}\right),\quad 0<t\leq T, (15)

where WW and BB are two independent Brownian motions, the volatility (vt)t[0,T](v_{t})_{t\in[0,T]} is (tW)(\mathcal{F}^{W}_{t})-adapted and continuous, ρ[1,1]\rho\in[-1,1] and r>0r>0 the interest rate. We denote by XX the log-price, that is Xt=log(St)X_{t}=\mathrm{log}(S_{t}). Compared to classical diffusion models, we are interested in volatility process (vt)(v_{t}) driven by fractional Brownian motion, turning both the volatility and the price into non-Markovian processes. We will focus on the following two, arguably most popular examples of rough volatility specifications.

Example 1

(Rough Bergomi [BFG16]) The rough Bergomi volatility is given by

vt=ξ0(η0t(ts)H12𝑑Ws),v_{t}=\xi_{0}\mathcal{E}\left(\eta\int_{0}^{t}(t-s)^{H-\frac{1}{2}}dW_{s}\right),

where \mathcal{E} denotes the stochastic exponential. In all the numerical examples, we will choose η=1.9\eta=1.9 and ξ0=0.09\xi_{0}=0.09.

Example 2

(Rough Heston [EER19]) The rough Heston volatility, resp. its variance V=v2V=v^{2}, is defined as (weak) solution to the Volterra-tye CIR equation

Vt=V0+0t(ts)H12λ(θVs)𝑑s+0t(ts)H12νVs𝑑Ws.V_{t}=V_{0}+\int_{0}^{t}(t-s)^{H-\frac{1}{2}}\lambda(\theta-V_{s})ds+\int_{0}^{t}(t-s)^{H-\frac{1}{2}}\nu\sqrt{V_{s}}dW_{s}. (16)

In all the numerical examples, we will choose V0=θ=0.02,ν=λ=0.3V_{0}=\theta=0.02,\nu=\lambda=0.3.

The problem of pricing American (resp. Bermudan) options consists of solving the discrete optimal stopping problem

Y0N=supτ𝒮0NE[erτϕ(Sτ)],Y_{0}^{N}=\sup_{\tau\in\mathcal{S}^{N}_{0}}E[e^{-r\tau}\phi(S_{\tau})], (17)

where 𝒮0N\mathcal{S}^{N}_{0} is the set of (tS)(\mathcal{F}^{S}_{t})-stopping times taking values in the set of possible exercise-dates {t0,,tN}\{t_{0},\dots,t_{N}\}, and ϕ\phi is the payoff function. We will focus on so-called put-options, which corresponds to the payoff function ϕ(x)=(Kx)+\phi(x)=(K-x)^{+} for a given strike price KK.

4.1 Implementation details

The code accompanying this section can be found in https://github.com/lucapelizzari/Optimal_Stopping_with_signatures. Before discussing numerical experiments, let us specify in detail how the procedures introduced in Section 3 are implemented. All the signatures are computed using the package iisignature, see [RG18]. For the signature kernel, we rely on the PDE solvers from the package sigkernel to obtain the kernel as finite-difference solution to the Goursat PDE derived in [SCF+21]. For the simulation of the rough Bergomi, we use https://github.com/ryanmccrickerd/rough_bergomi related to [MP18], and for the simulation of rough Heston https://github.com/SimonBreneis/approximations_to_fractional_stochastic_volterra_equations related to [BB24].

Linear signature stopping

This approach was already studied in detail in [BPS23], and we adopt the choices made there. In particular, for the primal procedure we use the signature of X^t=(t,Xt)\hat{X}_{t}=(t,X_{t}), and for the dual procedure the signature of Z^t=(t,Xt,ϕ(Xt))\hat{Z}_{t}=(t,X_{t},\phi(X_{t})), and in both we add Laguerre polynomials L(Xt,vt)L(X_{t},v_{t}) to the set of basis functions. For more details about this choice we refer to [BPS23, Section 4].

Deep signature stopping

Here we make a slightly different choice for the basis compared to the linear approach, exploiting both the universality of the DNNs and the signature. First, it was observed for instance in [BBFP23], that the price XX has a partial Markovian nature in BB, and only depends on the past through the non-Markovian volatility process (vt)t[0,T](v_{t})_{t\in[0,T]}. Leaving rigorous arguments to [BBFP23], one can for instance note that Law(Xt+h|t)=Law(Xt+ht,x|tv)|x=Xt\mathrm{Law}(X_{t+h}|\mathcal{F}_{t})=\mathrm{Law}(X^{t,x}_{t+h}|\mathcal{F}^{v}_{t})|_{x=X_{t}}, so that conditional expectations 𝔼[ϕ(Xt+Δt)|t)=f(Xt,(vs)st]\mathbb{E}[\phi(X_{t+\Delta t})|\mathcal{F}_{t})=f(X_{t},(v_{s})_{s\leq t}] for some measurable function ff. To capture the relevant memory of the dynamics, in the primal case we lift the variance-augmented process v^t=(Xt,vt)=(0tvu2𝑑u,vt)\hat{v}_{t}=(\langle X\rangle_{t},v_{t})=(\int_{0}^{t}v_{u}^{2}du,v_{t}) the the standard signature 𝕍^<\hat{\mathbb{V}}^{<\infty}. In the dual case we simply lift the time-augmentation v^t=(t,vt)\hat{v}_{t}=(t,v_{t}). The partial Markovianity motivates to simply apply DNNs on {Xt,𝕍^0,tK}\{X_{t},\hat{\mathbb{V}}_{0,t}^{\leq K}\}.

Let us now specify the DNN architecture: For the Longstaff and Schwartz algorithm, similar to [LL21], we rely on the Leaky ReLu activation function σ(x)=1{x0}x+1{x<0}0.3x\sigma(x)=1_{\{x\geq 0\}}x+1_{\{x<0\}}0.3x and the ADAM optimizer to fit the models at each exercise date, with a batch-size of b=128b=128 and learning rate λ=103\lambda=10^{-3}. Inspired by [LL21], we use e=15e=15 epochs to learn the conditional expectation at the last exercise date (first regression in the algorithm), and then use the trained weights to initialize the DNN at the next exercise-date. Doing this allows to reduce the epochs to e=1e=1 for 1n<N11\leq n<N-1, which in turn reduces the computation time significantly. It is sufficient to consider I=2I=2 hidden layers with each Di(dim(𝕍^K)+1)+32D_{i}\equiv(\mathrm{dim}(\hat{\mathbb{V}}^{\leq K})+1)+32 neurons. For the dual algorithm we rely on the classical ReLu activation function σ(x)=max(x,0)\sigma(x)=\max(x,0), and again using the ADAM optimizer with batch size b=128b=128 and learning rate 10310^{-3} to fit the model. Compared to the primal algorithm, deeper neural networks with I=6I=6 layers and Di(dim(𝕍^K)+1)+32D_{i}\equiv(\mathrm{dim}(\hat{\mathbb{V}}^{\leq K})+1)+32 are required, which is due to the non-linear nature of the integrand of the Doob-martingale.

Signature-kernel stopping

For the kernel ridge regressions in the Longstaff and Schwartz algorithm based methods, we choose the kernel of the time-augmented path N^t=(Xt,Wt,Xt)\hat{N}_{t}=(\langle X\rangle_{t},W_{t},X_{t}). At each exercise date nn, we randomly select L=32L=32 subsamples, where the ii-th sample is selected with probability pi=ktn,tn(N^(i),N^(i))j=1Mktn,tn(N^(j),N^(j))p_{i}=\frac{k_{t_{n},t_{n}}(\hat{N}^{(i)},\hat{N}^{(i)})}{\sum_{j=1}^{M}k_{t_{n},t_{n}}(\hat{N}^{(j)},\hat{N}^{(j)})}, noting that the latter only requires the diagonal of the Gram matrix 𝐊\mathbf{K}.

For the dual procedure, we randomly select the subsamples according to the probabilities, build on the quadratic variation of the kernel martingales 0ks,s(N^(i),N^)𝑑WsT(i)=0Tks,s(N^(i),N^(i))2𝑑s\langle\int_{0}^{\cdot}k_{s,s}(\hat{N}^{(i)},\hat{N})dW_{s}\rangle_{T}^{(i)}=\int_{0}^{T}k_{s,s}(\hat{N}^{(i)},\hat{N}^{(i)})^{2}ds, that is

pi=0Tks,s(N^(i),N^(i))2𝑑sj=1M0Tks,s(N^(j),N^(j))2𝑑s.p_{i}=\frac{\int_{0}^{T}k_{s,s}(\hat{N}^{(i)},\hat{N}^{(i)})^{2}ds}{\sum_{j=1}^{M}\int_{0}^{T}k_{s,s}(\hat{N}^{(j)},\hat{N}^{(j)})^{2}ds}.

Having 𝐊M×L\mathbf{K}\in\mathbb{R}^{M\times L} at hand, we noticed that best performance can be achieved when solving (14) using a simple version of the neural network technology before, with 11 hidden-layer and ReLu activation.

4.2 American put option prices in rough Bergomi and rough Heston

In this section we compare all the signature-based methods to price Bermudan put options. In Table 1-2, we present the rough Bergomi model with Hurst parameters H=0.8H=0.8, resp. H=0.07H=0.07, and for a range of strikes K{70,80,,120}K\in\{70,80,\dots,120\} and S0=100S_{0}=100. We choose the contract duration of T=1T=1 year and J=12J=12 early exercise options and interest rate r=0.05r=0.05 and correlation ρ=0.9\rho=-0.9. In the first column we report the European price, that is, the price for no early exercise opportunity, 𝐄=𝔼(erT(KST)+)\mathbf{E}=\mathbb{E}(e^{-rT}(K-S_{T})^{+}). The second columns correspond to lower-bounds obtained in [GMZ20], see also [BTW20]. Finally, in the remaining columns, we compare the point estimate, which is the biased-estimator from the Longstaff and Schwartz algorithm based on the training samples, with the lower- and upper-bounds obtained from independent testing samples. While the intervals in the linear case are taken from [BPS23], we derive the intervals for the deep signature using truncation level K=4K=4, discretization for the signature and martingales N=600N=600, and both M=1218M=12^{18} samples for training and testing, to ensure stability of the procedure. For the signature-kernel, we use M=217M=2^{17} samples, with N=240N=240 discretization points for solving the Goursat PDE. For each strike, we highlight the best lower-, resp. upper-bound, and in the last column present the relative duality gap with respect to these two values. In Table 3 we present the same considerations for the rough Heston model with H=0.1,r=0.06H=0.1,r=0.06 and ρ=0.7\rho=-0.7, a choice motivated in [BB24], with smaller regularization parameter. The trainings were repeated 2020 times and the overall Monte-Carlo error is below 0.010.01. Finally, in Table 4 we compare the computational time (in seconds) of one training each, as well as the offline computation of the signature, resp. signature-kernel.

Refer to caption
(a) Rough Heston with H=0.1H=0.1
Refer to caption
(b) Rough Bergomi with H=0.07H=0.07
Figure 1: Primal put option prices with strike K=100K=100 and signature-kernel procedure, with respect to the penalization parameters λ\lambda.

In Figure 1 we present the dependence of the lower-bounds, resp. point estimates with respect to λ[0,2]\lambda\in[0,2]. The optimal choice is made for highest possible lower bounds (blue lines), so that the distance to the point estimate is reasonable, since larger difference suggest overfitting. According to this remark, we choose λ=103\lambda=10^{-3} for rough Bergomi, and λ=108\lambda=10^{-8} in the rough Heston model.

First and most visibly, we observe that the the deep signature method delivers significantly better upper-bounds than both other methods, with reasonable training and offline costs, see Table 4. Moreover, the deep method also mostly produces the smallest duality gap, that is the smallest distance between lower- and upper-bound. However, one can also observe that for both the point estimates and lower-bounds, the linear and kernel methods, which are based on conventional convex optimization problems, slightly outperform the deep signature method in general. It should be noted that, however, this comes at the price of offline computational costs, see Table 4, where we recall that the DNNs allow to only lift the augmented volatility process. As discussed in [BPS23], for the linear procedure it is necessary to lift the three-dimensional path (t,Xt,ϕ(Xt))(t,X_{t},\phi(X_{t})) and add polynomials of the state, which explains the increased computational time in the linear case. While the signature-kernel method improves the linear approach, the complexity of deriving the Gram-matrix, i.e. solving the Goursat-PDE, see Table 4, leaves the signature-kernel procedure as the most expensive approach to solve the optimal stopping problem. To handle large data sets, it is therefore necessary to solve the complexity issue for the kernel, and a potential direction could be the random Fourier signature features by [TOS23], but this is outside the scope of this paper.

In summary, for our goal of achieving minimal duality gaps for arbitrary large sample sizes, the best possible results can be achieved by combining a Longstaff & Schwartz algorithm based on the linear signature or the signature-kernel, exploiting the simplicity of the training procedure, together with the deep-signature approach for the upper-bounds, to capture the non-linearity of the Doob martingale integrand. If one has to choose one method, we recommend the deep-signature method, simply as it shows the smallest duality gaps and overall computational times.

Strike 𝐄\mathbf{E} [GMZ20] Linear [BPS23] Deep signature Signature kernel Best Gap
Point est. C.I. Point est. C.I. Point est. C.I.
70 0.78 1.84 1.847 [1.83, 1.90] 1.831 [1.82, 1.85] 1.835 [1.83, 1.89] 1.08 %
80 1.55 3.10 3.101 [3.08, 3.19] 3.086 [3.07, 3.12] 3.080 [3.08, 3.16] 1.28 %
90 3.11 5.08 5.086 [5.07, 5.17] 5.058 [5.04, 5.08] 5.061 [5.06, 5.15] 0.78 %
100 6.10 8.19 8.188 [8.15, 8.27] 8.132 [8.11, 8.17] 8.159 [8.15, 8.26] 0.24 %
110 11.41 13.00 12.991 [12.97, 13.09] 12.922 [12.87, 12.98] 12.944 [12.94, 13.03] 0.07 %
120 18.65 20.28 20.219 [20.21, 20.51] 20.161 [20.14, 20.25] 20.162 [20.16, 20.27] 0.19 %
Table 1: Comparison of Put option prices in rough Bergomi with H=0.8H=0.8
Strike E [GMZ20] Linear [BPS23] Deep signature Signature kernel Best Gap
Point est. Interval Point est. Interval Point est. Interval
70 0.88 1.88 1.929 [1.92, 1.99] 1.921 [1.91, 1.95] 1.926 [1.92, 2.01] 1.53 %
80 1.67 3.25 3.289 [3.27, 3.37] 3.281 [3.26, 3.31] 3.286 [3.27, 3.36] 1.20 %
90 3.11 5.34 5.394 [5.37, 5.50] 5.383 [5.35, 5.44] 5.397 [5.37, 5.53] 1.28 %
100 5.83 8.53 8.586 [8.57, 8.77] 8.555 [8.52, 8.66] 8.589 [8.56, 8.75] 1.03 %
110 10.94 13.28 13.314 [13.29, 13.59] 13.281 [13.21, 13.45] 13.326 [13.27, 13.46] 1.18 %
120 18.37 20.20 20.267 [20.24, 20.66] 20.163 [20.14, 20.63] 20.276 [20.24, 20.44] 0.97 %
Table 2: Comparison of Put option prices in rough Bergomi with H=0.07H=0.07.
Strike 𝐄\mathbf{E} Linear [BPS23] Deep signature Signature kernel Best Gap
Point est. Interval Point est. Interval Point est. Interval
70 0.16 0.438 [0.42, 0.53] 0.426 [0.42, 0.44] 0.430 [0.43, 0.46] 2.27 %
80 0.42 0.966 [0.94, 1.11] 0.961 [0.95, 0.97] 0.960 [0.96, 1.01] 1.03 %
90 1.02 2.034 [1.99, 2.28] 2.031 [2.02, 2.07] 2.027 [2.02, 2.11] 2.42 %
100 2.66 4.245 [4.16, 4.66] 4.248 [4.24, 4.36] 4.219 [4.21, 4.44] 2.75 %
110 8.05 9.715 [9.63, 10.52] 9.660 [9.66, 10.42] 9.693 [9.68, 10.14] 4.53 %
120 16.68 19.500 [19.49, 20.04] 19.487 [19.48, 20.00] 19.502 [19.50, 19.56] 0.30 %
Table 3: Comparison of Put option prices in rough Heston with H=0.1H=0.1.
Method Training primal Training dual Offline costs
Linear 1.73 610.10 508.32
Deep 100.23 520.34 50.92
Kernel 1.69 390.98 5564.53
Table 4: Computational times (in seconds) for training and evaluation, and offline computation times (in seconds) for signatures and kernels

On the other hand, it is important to note that we observe the primal signature-kernel method to be more stable compared to deep-signatures, with respect to smaller data sets. This unsurprising advantage of the kernel-method can become important when one works with real-word rather than synthetic data, where the number of samples cannot be arbitrary large, in which case the signature-kernel might be favorable. This is illustrated in Figure 2 (a) for the rough Bergomi model with H=0.07H=0.07, where we can observe more severe instability for the neural network training for smaller MM, both in the sense of small lower-bounds and high training variance. The Monte-Carlo errors reflect this variance, which is obtained after independently repeating the training 2020 times for each sample size. For completeness, we show a similar plot in the dual case in Figure 2 (b), where it is confirmed again that the deep-signature method heavily outperforms the signature-kernel approach, also in every data-size regime. The variance in the dual procedure is much smaller, as we only solve one optimization problem in the training, rather than at each exercise date.

Refer to caption
(a) Primal
Refer to caption
(b) Dual
Figure 2: Put option prices for H=0.07H=0.07 and strike K=100K=100, with respect to training sample size MM.

In Figure 3 we compare the options prices in the rough Bergomi model, with respect to the correlation parameter ρ[1,1]\rho\in[-1,1], for the two (very different regimes) regimes H=0.07H=0.07 and H=0.8H=0.8. The blue region reflects the pricing interval derived from the deep-signature approach, and once again we present the point-estimates of all methods. In both cases we can observe, that the pricing method perform worse – in the sense of duality gap – when we come closer to the no-correlation case ρ=0\rho=0, while the intervals get tighter in more correlated regimes.

Refer to caption
(a) H=0.07H=0.07
Refer to caption
(b) H=0.8H=0.8
Figure 3: Point-estimates and pricing intervals for put option with K=100K=100 in the rough Bergomi model, with respect to the correlation parameters ρ[1,1]\rho\in[-1,1].

In Figure 4 we compare the duality gap in the deep-signature method, with respect to the number of discretization points and different Hurst regimes. More precisely, let us define the relative duality gap

ϵ=ϵ(N,H)=y0D(N,H)y0P(N,H)y0D(N,H),\epsilon=\epsilon(N,H)=\frac{y_{0}^{D}(N,H)-y_{0}^{P}(N,H)}{y_{0}^{D}(N,H)},

where y0D(N,H)y_{0}^{D}(N,H) (resp. y0P(N,H)y_{0}^{P}(N,H)) is the upper (resp. lower) bound obtained from the deep-signature method in the rough Bergomi model with Hurst parameter H(0,1)H\in(0,1) and NN\in\mathbb{N}. All the other parameters are chosen the same as in the Tables 1-2. We can observe that in general higher Hurst parameters show smaller duality gaps and also faster convergence with respect to NN. Notice that, apart from optimization and Monte-Carlo errors, the two main sources for the gap here are the discretization error from approximating iterated and stochastic integrals, and the non-Markovianity for H1/2H\neq 1/2 regimes. When HH is close to 1/21/2, the convergence rate seems to be around HH, and since this regime is ”almost Markovian” (and Markovian for H=1/2H=1/2), this can be interpreted as the strong error occuring when approximating the stochastic integrals f(vt,Xt)𝑑Wt\int f(v_{t},X_{t})dW_{t} with an Euler-scheme and rough integrand. In general, however, it is not possible to seperate these two error sources, so that it becomes more difficult to interpret the observed convergence rate.

Refer to caption
Figure 4: Linearly fitted duality gaps with respect to time-discretization and different Hurst regimes in the rough Bergomi model.

Finally, in Figure 5

Refer to caption
(a) Primal
Refer to caption
(b) Dual
Figure 5: Feature importance in the rough Bergomi model with H=0.07H=0.07 and ρ=0.9\rho=-0.9.

we compare the importance of different entries of the level-33 signature in the deep-signature pricing methodologies. This is simply done by first training the primal and dual model, and then for each fixed feature, we shuffle the corresponding samples and measure the error between the point-estimate and the shuffled-estimate. This procedure we repeat independently 2020 times for each feature and in Figures 5 we plot the averaged error errors. Notice that the higher this value is, the more influence the corresponding feature has for computing the options price. In both methods, we observe that the ”Markovian” state features XtX_{t}, vt=2,𝕍^0,tv_{t}=\big{\langle}2,\hat{\mathbb{V}}_{0,t}\big{\rangle}, vt2=22,𝕍^0,tv_{t}^{2}=\big{\langle}22,\hat{\mathbb{V}}_{0,t}\big{\rangle} appear to have the biggest influence. Apart from that, we can observe that the integrals 21,𝕍^0,t=0tvudXu\big{\langle}21,\hat{\mathbb{V}}_{0,t}\big{\rangle}=\int_{0}^{t}v_{u}d\langle X\rangle_{u} (resp. 0tvu𝑑u\int_{0}^{t}v_{u}du), and 212,𝕍^0,t=0t0uvrdXr𝑑Xu\big{\langle}212,\hat{\mathbb{V}}_{0,t}\big{\rangle}=\int_{0}^{t}\int_{0}^{u}v_{r}d\langle X\rangle_{r}dX_{u} (resp. 212,𝕍^0,t=0t0uvr𝑑r𝑑Xu\big{\langle}212,\hat{\mathbb{V}}_{0,t}\big{\rangle}=\int_{0}^{t}\int_{0}^{u}v_{r}drdX_{u}) seem to carry the most important information about the past of the volatility.

Appendix A Proof deep-signature optimal stopping

Proof.

of Proposition 3.3 The first step is to generalize the global approximation result [BPS23, Theorem 2.8]. Leaving details to [BPS23, Section 2], we are given a measure space (𝒳,(𝒳),μ)(\mathcal{X},\mathcal{B}(\mathcal{X}),\mu), where 𝒳\mathcal{X} is the space of stopped, time-augmented, geometric α\alpha-Hölder rough paths ([BPS23, Definition 2.1]), (𝒳)\mathcal{B}(\mathcal{X}) the Borel σ\sigma-algebra, and μ\mu a measure fulfilling certain assumptions ([BPS23, Assumption 2.6]). Thanks to [BPS23, Theorem 2.8], for any fLp(𝒳,μ)f\in L^{p}(\mathcal{X},\mu) and ϵ>0\epsilon>0, we can find a linear functional on the signature fϵ(𝐱|[0,t])=lϵ,𝕩0,t<f^{\epsilon}(\mathbf{x}|_{[0,t]})=\big{\langle}l^{\epsilon},\mathbbm{x}^{<\infty}_{0,t}\big{\rangle}, such that ffnLpϵ/2\|f-f_{n}\|_{L^{p}}\leq\epsilon/2. But due to LpL^{p}-universality of DNNs, see [LLPS93, Proposition 2], denoting by kϵk_{\epsilon} the truncation level of lϵl^{\epsilon} in 𝒲\mathcal{W}, for mm large enough we can find (β,~)m×(𝒲kϵ)m(\beta,\tilde{\ell})\in\mathbb{R}^{m}\times(\mathcal{W}^{\leq k_{\epsilon}})^{m} such that θ(β,~)(𝐱|[0,t])fϵLpϵ/2\|\theta^{(\beta,\tilde{\ell})}(\mathbf{x}|_{[0,t]})-f^{\epsilon}\|_{L^{p}}\leq\epsilon/2. An application of the triangle inequality then proves that for any fLp(𝒳,μ)f\in L^{p}(\mathcal{X},\mu), we can find a sequence (θn)nDNNsigσ(\theta_{n})_{n\in\mathbb{N}}\subseteq\mathrm{DNN}^{\sigma}_{\mathrm{sig}} such that fθnLp0\|f-\theta_{n}\|_{L^{p}}\rightarrow 0 as nn\to\infty.

Now in the primal case, it is in fact possible to show that for all 0nN0\leq n\leq N, we have 𝔼[Zτn(m,k)|tn]𝔼[Zτn|tn]\mathbb{E}[Z_{\tau_{n}^{(m,k)}}|\mathcal{F}_{t_{n}}]\rightarrow\mathbb{E}[Z_{\tau_{n}}|\mathcal{F}_{t_{n}}] as (m,k)(m,k)\to\infty in L2L^{2}, from which the result follows immediately. Using backward induction, one can notice that the result clearly holds true for n=Nn=N. Assuming that the claim holds for n+1n+1, we denote by θn(m,k)\theta_{n}^{(m,k)} the element θ(βP,n,P,n)\theta^{(\beta_{P}^{\star,n},\ell_{P}^{\star,n})} with respect to the minimizer (βP,n,P,n)(\beta_{P}^{\star,n},\ell_{P}^{\star,n}) in (8). We can estimate 𝔼[Zτn(m,k)Zτn|tn]L2\|\mathbb{E}[Z_{\tau_{n}^{(m,k)}}-Z_{\tau_{n}}|\mathcal{F}_{t_{n}}]\|_{L^{2}} exactly as it was done [BPS23, Appendix A.1] until equation (A.1), to reduce the claim to the convergence of

L(m,k)=θ(m,k)(𝐗|[0,tn])𝔼[Zτn+1|tn]L2.L^{(m,k)}=\big{\|}\theta^{(m,k)}(\mathbf{X}|_{[0,t_{n}]})-\mathbb{E}[Z_{\tau_{n+1}}|\mathcal{F}_{t_{n}}]\big{\|}_{L^{2}}.

There is, however, a subtle difference here compared to the conclusion used in the linear case: The space DNNsigσ\mathrm{DNN}^{\sigma}_{\mathrm{sig}} is not convex, which means that θ(m,k)(𝐗|[0,tn])\theta^{(m,k)}(\mathbf{X}|_{[0,t_{n}]}) with respect to (8) is not necessary the orthogonal projection of Zτn+1(k,m)Z_{\tau_{n+1}^{(k,m)}}. We can still conclude by introducing the minimizer (β^P,n,^P,n)(\hat{\beta}_{P}^{\star,n},\hat{\ell}_{P}^{\star,n}) similar to (8), but replacing Zτn+1(k,m)Z_{\tau_{n+1}^{(k,m)}} by 𝔼[Zτn+1|tn]\mathbb{E}[Z_{\tau_{n+1}}|\mathcal{F}_{t_{n}}]. Then we have

L(m,k)\displaystyle L^{(m,k)} θ(m,k)(𝐗|[0,tn])𝔼[Zτn+1(m,k)|tn]L2+𝔼[Zτn+1(m,k)Zτn+1|tn]L2\displaystyle\leq\big{\|}\theta^{(m,k)}(\mathbf{X}|_{[0,t_{n}]})-\mathbb{E}[Z_{\tau^{(m,k)}_{n+1}}|\mathcal{F}_{t_{n}}]\big{\|}_{L^{2}}+\big{\|}\mathbb{E}[Z_{\tau^{(m,k)}_{n+1}}-Z_{\tau_{n+1}}|\mathcal{F}_{t_{n}}]\big{\|}_{L^{2}}
θ^(m,k)(𝐗|[0,tn])𝔼[Zτn+1(m,k)|tn]L2+𝔼[Zτn+1(m,k)Zτn+1|tn]L2\displaystyle\leq\big{\|}\hat{\theta}^{(m,k)}(\mathbf{X}|_{[0,t_{n}]})-\mathbb{E}[Z_{\tau^{(m,k)}_{n+1}}|\mathcal{F}_{t_{n}}]\big{\|}_{L^{2}}+\big{\|}\mathbb{E}[Z_{\tau^{(m,k)}_{n+1}}-Z_{\tau_{n+1}}|\mathcal{F}_{t_{n}}]\big{\|}_{L^{2}}
θ^(m,k)(𝐗|[0,tn])𝔼[Zτn+1|tn]L2+2𝔼[Zτn+1(m,k)Zτn+1|tn]L2\displaystyle\leq\big{\|}\hat{\theta}^{(m,k)}(\mathbf{X}|_{[0,t_{n}]})-\mathbb{E}[Z_{\tau_{n+1}}|\mathcal{F}_{t_{n}}]\big{\|}_{L^{2}}+2\big{\|}\mathbb{E}[Z_{\tau^{(m,k)}_{n+1}}-Z_{\tau_{n+1}}|\mathcal{F}_{t_{n}}]\big{\|}_{L^{2}}
θ^(m,k)(𝐗|[0,tn])𝔼[Zτn+1|tn]L2+2𝔼[Zτn+1(m,k)Zτn+1|tn+1]L2,\displaystyle\leq\big{\|}\hat{\theta}^{(m,k)}(\mathbf{X}|_{[0,t_{n}]})-\mathbb{E}[Z_{\tau_{n+1}}|\mathcal{F}_{t_{n}}]\big{\|}_{L^{2}}+2\big{\|}\mathbb{E}[Z_{\tau^{(m,k)}_{n+1}}-Z_{\tau_{n+1}}|\mathcal{F}_{t_{n+1}}]\big{\|}_{L^{2}},

where we used the triangle inequality in the first and third inequality, the fact that θ(m,k)\theta^{(m,k)} minimizes the L2L^{2} distance to 𝔼[Zτn+1(m,k)|tn]\mathbb{E}[Z_{\tau^{(m,k)}_{n+1}}|\mathcal{F}_{t_{n}}] in the second inequality, and the contraction property of conditional expectations in the last inequality. Now the last term in the inequality converges by induction hypothesis, while the first one converges thanks to the global approximation result as m,km,k\to\infty.

Finally, let us outline also the proof of the dual case, which closely follows the techniques used in [BPS23, Appendix A.2]. First, an application of the global approximation allows us to equivalently write (4) as

Y0N=infθDNNsigσ𝔼[max0nN(ZtnMtnθ)],Mtnθ=0tnθ(𝐗|[0,s])𝑑WsY^{N}_{0}=\inf_{\theta\in\mathrm{DNN}^{\sigma}_{\mathrm{sig}}}\mathbb{E}\big{[}\max_{0\leq n\leq N}(Z_{t_{n}}-M^{\theta}_{t_{n}})\big{]},\quad M^{\theta}_{t_{n}}=\int_{0}^{t_{n}}\theta(\mathbf{X}|_{[0,s]})dW_{s} (18)

Indeed, leaving the detailed technique to the proof of [BPS23, Theorem 3.7], this follows from the following observation: As already discussed in the beginning of Section 2.2, since X=W\mathcal{F}^{X}=\mathcal{F}^{W}, by martingale representation it is enough to minimize (4) over L2L^{2}-progressive processes, which in turn can by approximated arbitrary well by θDNNsigσ\theta\in\mathrm{DNN}^{\sigma}_{\mathrm{sig}} thanks to the global approximation result. Next we define Y0N,(m,k)=𝔼[max0nN(ZtnMtn(m,k))]Y_{0}^{N,(m,k)}=\mathbb{E}[\max_{0\leq n\leq N}(Z_{t_{n}}-M^{(m,k)}_{t_{n}})], where we recall M(m,k)M^{(m,k)} was defined with respect to the minimizer (9), and note that Y0N,(m,k)Y0NY_{0}^{N,(m,k)}\geq Y_{0}^{N}. Then, by (18), for every ϵ>0\epsilon>0, there exists an element θϵDNNsigσ\theta^{\epsilon}\in\mathrm{DNN}^{\sigma}_{\mathrm{sig}} such that Y0N+ϵ𝔼[max0nN(ZtnMtnθϵ)]Y_{0}^{N}+\epsilon\geq\mathbb{E}[\max_{0\leq n\leq N}(Z_{t_{n}}-M^{\theta^{\epsilon}}_{t_{n}})], so that

0Y0N,(m,k)Y0Nϵ+Y0N,(m,k)𝔼[max0nN(ZtnMtnθϵ)]ϵ,0\leq Y_{0}^{N,(m,k)}-Y^{N}_{0}\leq\epsilon+Y_{0}^{N,(m,k)}-\mathbb{E}\big{[}\max_{0\leq n\leq N}(Z_{t_{n}}-M^{\theta^{\epsilon}}_{t_{n}})\big{]}\leq\epsilon,

where the last inequality holds for (m,k)(m,k) large enough, such that θϵ=θ(β~,~)\theta^{\epsilon}=\theta^{(\tilde{\beta},\tilde{\ell})} for some (β~,~)m×(𝒲k)m(\tilde{\beta},\tilde{\ell})\in\mathbb{R}^{m}\times(\mathcal{W}^{\leq k})^{m}.

Remark A.1

The same results remain true when working with the log-signature introduced in Remark 3.4. For the global approximation result, one can simply note that for any fLp(𝒳,μ)f\in L^{p}(\mathcal{X},\mu), we know that fLp,𝕏K=,exp(𝕃K)f\overset{L^{p}}{\approx}\big{\langle}\ell,\mathbb{X}^{\leq K}\big{\rangle}=\big{\langle}\ell,\mathrm{exp}^{\otimes}(\mathbbm{L}^{\leq K})\big{\rangle} for KK large enough, where exp\exp^{\otimes} is the (continuous!) inverse of the log\mathrm{log}^{\otimes} introduced in Remark 3.4. Using again [LLPS93, Proposition 2], we can approximate this exponential on the truncated log-signature by a DNN. In summary, fLpjβjσ(j~,𝕃<)f\overset{L^{p}}{\approx}\sum_{j}\beta_{j}\sigma(\big{\langle}\tilde{\ell_{j}},\mathbb{L}^{<\infty}\big{\rangle}) in the sense of the global approximation result discussed in the beginning of the proof. Then, all the arguments can be repeated, relying on the set DNNlogσ\mathrm{DNN}_{\mathrm{log}}^{\sigma}, which replaces the signature with the log-signature.

References

  • [BB24] Christian Bayer and Simon Breneis. Efficient option pricing in the rough Heston model using weak simulation schemes. Quantitative Finance, pages 1–15, 2024.
  • [BBFP23] Peter Bank, Christian Bayer, Peter K Friz, and Luca Pelizzari. Rough PDEs for local stochastic volatility models. arXiv preprint arXiv:2307.09216, 2023.
  • [Bel13] Denis Belomestny. Solving optimal stopping problems via empirical dual optimization. The Annals of Applied Probability, 23(5):1988 – 2019, 2013.
  • [BFF+23] Christian Bayer, Peter K Friz, Masaaki Fukasawa, Jim Gatheral, Antoine Jacquier, and Mathieu Rosenbaum. Rough volatility. SIAM, 2023.
  • [BFG16] Christian Bayer, Peter Friz, and Jim Gatheral. Pricing under rough volatility. Quantitative Finance, 16(6):887–904, 2016.
  • [BGLY16] Horatio Boedihardjo, Xi Geng, Terry Lyons, and Danyu Yang. The signature of a rough path: uniqueness. Advances in Mathematics, 293:720–737, 2016.
  • [BHRS23] Christian Bayer, Paul P Hager, Sebastian Riedel, and John Schoenmakers. Optimal stopping with signatures. The Annals of Applied Probability, 33(1):238–273, 2023.
  • [BPS23] Christian Bayer, Luca Pelizzari, and John Schoenmakers. Primal and dual optimal stopping with signatures. arXiv preprint arXiv:2312.03444, 2023.
  • [BTW20] Christian Bayer, Raúl Tempone, and Sören Wolfers. Pricing American options by exercise rate optimization. Quantitative Finance, 20(11):1749–1760, 2020.
  • [Che57] Kuo-Tsai Chen. Integration of paths, geometric invariants and a generalized Baker-Hausdorff formula. Annals of Mathematics, 65(1):163–178, 1957.
  • [CO22] Ilya Chevyrev and Harald Oberhauser. Signature moments to characterize laws of stochastic processes. The Journal of Machine Learning Research, 23(1):7928–7969, 2022.
  • [CS24] Thomas Cass and Cristopher Salvi. Lecture notes on rough paths and applications to machine learning. arXiv preprint arXiv:2404.06583, 2024.
  • [DFM12] Vijay V Desai, Vivek F Farias, and Ciamac C Moallemi. Pathwise optimization for optimal stopping problems. Management Science, 58(12):2292–2308, 2012.
  • [DMC05] Petros Drineas, Michael W Mahoney, and Nello Cristianini. On the Nyström method for approximating a gram matrix for improved kernel-based learning. journal of machine learning research, 6(12), 2005.
  • [EER19] Omar El Euch and Mathieu Rosenbaum. The characteristic function of rough Heston models. Mathematical Finance, 29(1):3–38, 2019.
  • [FH20] Peter K Friz and Martin Hairer. A course on rough paths. Springer, 2020.
  • [FV10] Peter K Friz and Nicolas B Victoir. Multidimensional stochastic processes as rough paths: theory and applications, volume 120. Cambridge University Press, 2010.
  • [GMZ20] Ludovic Goudenege, Andrea Molent, and Antonino Zanette. Machine learning for pricing American options in high-dimensional Markovian and non-Markovian models. Quantitative Finance, 20(4):573–591, 2020.
  • [KBPA+19] Patrick Kidger, Patric Bonnier, Imanol Perez Arribas, Cristopher Salvi, and Terry Lyons. Deep signature transforms. Advances in Neural Information Processing Systems, 32, 2019.
  • [KLA20] Jasdeep Kalsi, Terry Lyons, and Imanol Perez Arribas. Optimal execution with rough path signatures. SIAM Journal on Financial Mathematics, 11(2):470–493, 2020.
  • [KS91] Ioannis Karatzas and Steven Shreve. Brownian motion and stochastic calculus, volume 113. Springer Science & Business Media, 1991.
  • [LL21] Bernard Lapeyre and Jérôme Lelong. Neural network regression for Bermudan option pricing. Monte Carlo Methods and Applications, 27(3):227–247, 2021.
  • [LL24] Maud Lemercier and Terry Lyons. A high order solver for signature kernels. arXiv preprint arXiv:2404.02926, 2024.
  • [LLPS93] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993.
  • [LS01] Francis A Longstaff and Eduardo S Schwartz. Valuing American options by simulation: a simple least-squares approach. The review of financial studies, 14(1):113–147, 2001.
  • [Lyo98] Terry J Lyons. Differential equations driven by rough signals. Revista Matemática Iberoamericana, 14(2):215–310, 1998.
  • [MP18] Ryan McCrickerd and Mikko S Pakkanen. Turbocharging Monte Carlo pricing for the rough Bergomi model. Quantitative Finance, 18(11):1877–1886, 2018.
  • [PS06] Goran Peskir and Albert Shiryaev. Optimal stopping and free-boundary problems. Springer, 2006.
  • [Rei17] Jeremy Reizenstein. Calculation of iterated-integral signatures and log signatures. arXiv preprint arXiv:1712.02757, 2017.
  • [RG18] Jeremy Reizenstein and Benjamin Graham. The iisignature library: efficient calculation of iterated-integral signatures and log signatures. arXiv preprint arXiv:1802.08252, 2018.
  • [Rog02] Leonard CG Rogers. Monte Carlo valuation of American options. Mathematical Finance, 12(3):271–286, 2002.
  • [SCF+21] Cristopher Salvi, Thomas Cass, James Foster, Terry Lyons, and Weixin Yang. The signature kernel is the solution of a Goursat PDE. SIAM Journal on Mathematics of Data Science, 3(3):873–899, 2021.
  • [SHS01] Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International conference on computational learning theory, pages 416–426. Springer, 2001.
  • [Ste08] Ingo Steinwart. Support Vector Machines. Springer, 2008.
  • [TBO21] Csaba Toth, Patric Bonnier, and Harald Oberhauser. Seq2tens: An efficient representation of sequences by low-rank tensor projections. In International Conference on Learning Representations, 2021.
  • [TOS23] Csaba Toth, Harald Oberhauser, and Zoltan Szabo. Random Fourier signature features. arXiv preprint arXiv:2311.12214, 2023.
  • [WS00] Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. Advances in neural information processing systems, 13, 2000.