\NewDocumentCommand\evalat

sOmm\IfBooleanTF#1 \mleft. #3 \mright—_#4 #3#2—_#4

Sig-Splines: universal approximation and convex calibration of time series generative models

Magnus Wiese [email protected] University of KaiserlauternGottlieb-Daimler Straße 48KaiserslauternGermany , Phillip Murray [email protected] Imperial College London180 Queen’s GateLondonUnited Kingdom and Ralf Korn [email protected] University of KaiserlauternGottlieb-Daimler Straße 48KaiserslauternGermany

(2023)

Abstract.

We propose a novel generative model for multivariate discrete-time time series data. Drawing inspiration from the construction of neural spline flows, our algorithm incorporates linear transformations and the signature transform as a seamless substitution for traditional neural networks. This approach enables us to achieve not only the universality property inherent in neural networks but also introduces convexity in the model’s parameters.

generative modelling, market simulation, signatures, time series

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: XXXXXXX.XXXXXXX^†^†conference: 4th ACM International Conference on AI in Finance ; November 27–29, 2023; New York City^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Mathematics of computing Probability and statistics^†^†ccs: Applied computing^†^†ccs: Computing methodologies Machine learning

1. Introduction

Constructing and approximating generative models that fit multidimensional time series data with high realism is a fundamental problem with applications in probablistic prediction and forecasting [Rasul et al., 2020] and in synthetic data simulation in areas such as finance [Arribas et al., 2020, Buehler et al., 2020, 2022, Ni et al., 2020, 2021, Wiese et al., 2021, 2020]. The sequential nature of the data presents some unique challenges. It is essential that any generative model captures the temporal dynamics of the time series process - that is, at each time step, it is not enough to simply reflect the marginal distribution $p(\mathbf{x}_{t})$ but instead we need to model the conditional density $p(\mathbf{x}_{t+1}|\mathcal{F}_{t})$ where the filtration $\mathcal{F}_{t}$ represents all information available at time $t$ . Therefore, the model requires an efficient encoding of the history of the path as the conditioning variable.

In this paper, we present a novel approach to time series generative modelling that combines the use of path signatures with the framework of normalizing flow density estimation. Normalizing flows [Kobyzev et al., 2020, Papamakarios et al., 2019] are a class of probability density estimation models that express the data $\mathbf{x}$ as the output of an invertible, differentiable transformation $T$ of some base noise $\mathbf{u}$ :

\mathbf{x}=T(\mathbf{u})\qquad\text{where}\ \mathbf{u}\sim p(\mathbf{u})

A special instance of such models is the neural spline flow [Durkan et al., 2019a] which construct a triangular chain of approximate conditional CDFs of the components of $\mathbf{x}$ , by using neural networks to output the width and height of a sequence of knots that are used to construct a monotonically increasing CDF function for each dimension, through spline interpolation which could be linear, rational quadratic or cubic [Durkan et al., 2019a, b]. Such models may be estimated by finding neural network parameters that minimize the forward Kullback-Leibler divergence between the target density $p(\mathbf{x})$ and the flow-based model density $p_{\theta}(\mathbf{x})$ . Thus, training such models may face the known challenges in neural netowrk training. In particular, learning the conditional density for time series data $p(\mathbf{x}_{t+1}|\mathcal{F}_{t})$ would in general require some recurrent structure in the neural network to account for the path sequence, unless some specific Markovian assumptions were made.

To overcome these challenges, we replace the neural network in the neural spline flow with an alternative function approximator, via the signature transform. First introduced in the context of rough path theory [Friz and Hairer, 2020, Lyons, 2014], the signature gives an efficient and parsimonious encoding of the information in the path history (i.e. the filtration). This encoding provides a feature map which exhibits a universal approximation property - any continuous function of the path can be approximated arbitrarily well by a linear combination of signature features. This makes signature-based methods highly computationally efficient for convex objective functions.

By incorporating the signature transform, our algorithm, termed the signature spline flow, possesses two significant properties:

•

Universality: We leverage the universal approximation theorem of path signatures, allowing our model to approximate the conditional density of any time series model with arbitrary precision.
•

Convexity: By replacing the neural network with the signature transform, we demonstrate that our optimization problem becomes convex in its parameters. This means that gradient descent or convex optimization methods lead to a unique global minimum (except for potential linearly dependent terms within the signature).

2. Related work

The seminal paper of Goodfellow et al. on Generative Adversarial Networks (GANs) [Goodfellow et al., 2014] has led to a rapid expansion of the literature related to generative modelling. While the algorithm may seem blatant at first different optimisation techniques have led to the convergence towards equilibria that give satisfying results for types modalities of data including images [Brock et al., 2019], (financial) time series data [Wiese et al., 2020, Yoon et al., 2019], music [Engel et al., 2019] and text to image synthesis [Reed et al., 2016].

Normalizing flows are an alternative approach to constructing a generative model by creating expressive bijections allowing for tractible conditional densities. In a somewhat chronological naming the most impactful papers, NICE [Dinh et al., 2014] proposed additive coupling layers and real-valued non-volume preserving (real NVP) transformations authored by Dinh et al. presented chaining affine coupling transforms for constructing expressive bijections which were shown to be universal [Teshima et al., 2020]. A bit later the idea of constructing a triangular map and leveraging the theorem of Bogachev et al. [Bogachev et al., 2005] got popular and coined by various authors autoregressive flow. Works such that presented algorithms that leverage autoregressive flows include Papamakarios et al. [Papamakarios et al., 2017], Wehenkel and Louppe [Wehenkel and Louppe, 2019] and finally Durkan et al. [Durkan et al., 2019b, a] in [2019a]. While the invertibility property of normalizing flows at first sight can seem limiting as a flow can not construct densities on a manifold, this problem was addressed with Riemannian manifold flows [Gemici et al., 2016], where an injective decoder mapping is consstructed to the high-dimensional space. An efficient algorithm to guarantee the injectiveness was introduced in [Brehmer and Cranmer, 2020].

Our research paper centers around constructing a conditional parametrized generative density $p_{\theta}(\mathbf{x}_{t+1}| \mathcal{F}_{t})$ for time series data, where the condition is specified by the current filtration $\mathcal{F}_{t}$ . Previous studies have explored the estimation and construction of such densities using real data.

Ni et al. [Ni et al., 2020] propose a conditional signature-based Wasserstein metric and estimate it through signatures and linear regression. The combination of GANs and autoencoders is utilized by [Yoon et al., 2019] to learn the conditional dynamics. Wiese et al. [Wiese and Murray, 2022, Wiese et al., 2021] employ neural spline flows as a one-point estimator for estimating the conditional law of observed spot and volatility processes.

Other approaches focus on estimating the unconditional law of stochastic processes using Sig-SDEs [Arribas et al., 2020], neural SDEs [Gierjatowicz et al., 2020], Sig-Wasserstein metric [Ni et al., 2021], and temporal convolutional networks [Wiese et al., 2020]. Among these, Dyer et al. [Dyer et al., 2021] have work most closely related to ours, utilizing deep signature transforms [Kidger et al., 2019] to minimize the KL-divergence.

3. Signatures

When working with time series data in real-world applications we typically have a dataset which may have been sampled at discrete and potentially irregular intervals. Whilst many models take the view that the underlying process is a discrete-time process, the perspective in rough path theory is to model the data as discrete observations from an unknown continuous-time process. The trajetories of these continuous-time processes define paths in some path space. Rough path theory, and signatures in particular, give a powerful and computationally efficient way of working with data in the path space.

First introduced by Chen [Chen, 1957, 2001], the signature has been widely used in finance and machine learning. We provide here only a very brief introduction to the mathematical framework we will be using to define the signature, and refer the reader to e.g. [Chevyrev and Kormilitzin, 2016, Lyons et al., 2007] for a more thorough overview.

The signature of a $V$ -valued path takes values in the tensor algebra space of $V$ . In this section, we assume for simplicity that $V$ is the real-valued Euclidean $d$ -dimensional vectors space; i.e. $V=\mathbb{R}^{d}$ .

We now may define the signature of a $d$ -dimensional path, as a sequence of iterated integrals which exists in the tensor algebra $T((\mathbb{R}^{d}))$ .

Definition 0 (Signature of a path).

Let $\mathbf{X}=(\mathbf{X}_{1},\ldots,\mathbf{X}_{d})\colon[0,1]\to\mathbb{R}^{d}$ be continuous. The signature of $X$ evaluated at the word $\mathbf{i}=(i_{1},\dots,i_{k})\in[d]^{\times k}$ is defined as

\operatorname{Sig}_{\mathbf{i}}(\mathbf{X})=\underset{0<t_{1}<\cdots<t_{k}<1}{\int\cdots\int}\mathrm{d}X_{t_{1},i_{1}}\cdots\mathrm{d}X_{t_{k},i_{k}}\in(\mathbb{R}^{d})^{\otimes k}

Furthermore, the signature of $\mathbf{x}$ is defined as the collection of iterated integrals

\operatorname{Sig}(\mathbf{x})=\left(\operatorname{Sig}_{\mathbf{i}}(\mathbf{x})\right)_{\mathbf{i}\in[d]^{\star}}\in T((\mathbb{R}^{d})).

Remark 1 (Signature of a sequence).

Let $\mathbf{x}=(\mathbf{x}_{1},\ldots,\mathbf{x}_{n})\in S(\mathbb{R}^{d})$ be a sequence. Let $\mathbf{X}=(\mathbf{X}_{1},\ldots,\mathbf{X}_{d})\colon[0,1]\to\mathbb{R}^{d}$ be continuous, such that $\mathbf{X}(\tfrac{i-1}{n-1})=\mathbf{x}_{i}$ , and linear on the intervals in between. For convenience, we denote the signature of the linearly embedded sequence $\mathbf{x}$ for any word $\mathbf{i}=(i_{1},\dots,i_{k})\in[d]^{\star}$ by $\operatorname{Sig}_{\mathbf{i}}(\mathbf{x})=\operatorname{Sig}_{\mathbf{i}}(\mathbf{X})$ .

The signature is a projection from the path space into a sequence of statistics of the path. Therefore, it can informally be thought of as playing the role of a basis on the path space. Each element in the signature encodes some information about the path, and moreover, some have clear interpretations, particularly in the context of financial time series - the first term represents the increment of the path over the time interval, oftern referred to as the drift of the process. Different path transformations, such as the lead-lag transformation [Chevyrev and Kormilitzin, 2016, Page 20.] can be applied before computing the signature to obtain statistics such as the realized volatilty.

To state the universality property of signatures we have to introduce the concept of time augmentation. Let $BV([0,T],\mathbb{R}^{d+1})$ denote the space of $\mathbb{R}^{d}$ -valued paths with bounded variation. Let $\tilde{\psi}:BV([0,T],\mathbb{R}^{d})\to BV([0,T],\mathbb{R}^{d+1})$ be a function which is defined for $t\in[0,T]$ as $\tilde{\psi}(\mathbf{X})(t)=(t,X(t))$ . We call

	$\displaystyle\Omega_{0}([0,1];\mathbb{R}^{d})\coloneqq$
	$\displaystyle\{\mathbf{X}:[0,T]\to\mathbb{R}^{d}\ \|\ \tilde{\mathbf{X}}\in BV([0,T],\mathbb{R}^{d}),\tilde{\mathbf{X}}(0)=0,\mathbf{X}=\tilde{\psi}(\tilde{\mathbf{X}})\}$

the space of time-augmented paths starting at zero. We have the following representation property of signatures.

Proposition 0 (Universality of signatures).

Let $F$ be a real-valued continuous function on continuous piecewise smooth paths in $\mathbb{R}^{d}$ and let $\mathcal{K}\subset\Omega_{0}([0,1];\mathbb{R}^{d})$ be a compact set of such paths. Let $\varepsilon>0$ . Then there exists an order of truncation $L\in\mathbb{N}$ and a linear functional $\mathbf{w}\in T((\mathbb{R}^{d+1})^{*})$ such that for all $\hat{\mathbf{X}}\in\mathcal{K}$ ,

\left|F(\mathbf{X})-\langle\mathbf{w},\operatorname{Sig}^{(L)}(\hat{\mathbf{X}})\rangle\right|<\varepsilon

This result shows that any continuous function on a compact set of paths can be approximated arbitrarily well simply by a linear combination of terms of the signature. For a proof, see [Király and Oberhauser, 2019]. It can be thought of as a universal approximation property akin to the oft-cited one for neural networks [Cybenko, 1989]. The primary difference is that neural networks typically require a very large number of parameters which must be optimized through gradient-based methods, of a loss function which is non-convex in the network parameters, whereas the approximation via signatures amounts to a simple linear regression in the signature features. Hence, once the signature is calculated, function approximation becomes extremely computationally efficient and can be done via second order methods. Furthermore, as we will see later, we may apply further convex transformations of the signature and retain an objective that is convex in the parameters.

4. Linear neural spline flows

In this section, we revisit neural spline flows as prerequisites. Specifically, we explore their application to multivariate data in subsection 4.1 and further delve into their relevance in the context of multivariate time series data in subsection 4.2.

4.1. Multivariate data

Without loss of generality let $\mathbf{x}=(\mathrm{x}_{1},\dots,\mathrm{x}_{d})\sim p$ be a $[0,1]^{d}$ -valued random variable.¹¹1Note that any random variable can be defined on the $[0,1]$ by applying the integral probability transform (IPT). Note that this any bijection applied to make a random variable $[0,1]^{d}$ -valued does not impact the KL-divergence objective that is later derived. Furthermore, denote by $F_{1}(x)\coloneqq\mathbb{P}(\mathrm{x}_{1}\leq x)$ the cumulative distribution function (CDF) of $\mathrm{x}_{1}$ and for $i\in\{2,\dots,d\}$ by

F_{i}({x}|\mathbf{x}_{:i-1})\coloneqq\mathbb{P}(\mathrm{x}_{i}\leq x|{\mathrm{x}}_{:i-1}=\mathbf{x}_{:i-1})

the conditional CDF of $\mathrm{x}_{i}$ given that the value of ${\mathrm{x}}_{:i-1}$ is $\mathbf{x}_{:i-1}$ . The set of conditional CDFs $\mathbf{F}=(F_{1},\dots,F_{d})$ completely defines the joint distribution of the random variable $\mathbf{x}\sim p$ . For completeness we restate the inverse sampling theorem in multiple dimensions:

Theorem 1 (Inverse sampling theorem).

Let $\mathbf{u}=(\mathrm{u}_{1},\dots,\mathrm{u}_{d})$ be a uniformly distributed random variable on $[0,1]^{d}$ . Furthermore, let $\mathrm{y}_{1}=(F^{1})^{-1}(\mathrm{u}_{1})$ and define for $i=2,\dots,d$ the random variables

\mathrm{y}_{i}=(F_{i})^{-1}(\mathrm{u}_{i}|\mathrm{y}_{1},\dots,\mathrm{y}_{i-1}).

Then the random variables $\mathbf{x}\sim p$ and $\mathbf{y}=(\mathrm{y}_{1},\dots,\mathrm{y}_{d})$ are equal in distribution, i.e. $\mathbf{x}\stackrel{{\scriptstyle d}}{{=}}\mathbf{y}$ .

Proof.

See [Papamakarios et al., 2019, Section 2.2] for a derivation or [Bogachev et al., 2005] for a more formal treatment of the proof. ∎

Various methods exist to approximate a set of conditional CDFs $\mathbf{F}=(F_{1},\dots,F_{d})$ . In this paper, we are interested in a spline-based approximation

\hat{F}_{i}:[0,1]\times[0,1]^{i-1}\times\Theta\to[0,1],\ i\in[d]

parametrised by model parameters $\theta\in\Theta$ and the conditioning vector $\mathbf{x}_{:i-1}$ . Crucially, the constructed spline has to satisfy the properties of a CDF: it has to be (1) monotonically increasing $x_{i}$ and (2) span from $0$ to $1$ . In order to satisfy both requirements the following construction is used.

Let $N+1\in\mathbb{N}$ be the number of knots used to construct the spline. Furthermore, denote by $\Phi:\mathbb{R}^{N}\to\mathbb{R}^{N}$ the softmax transform, which is defined as $\Phi(\mathbf{x})=\left({\exp(\mathbf{x}_{j})}/{\sum_{i=1}^{N}\exp(\mathbf{x}_{i})}\right)_{j\in\{1,\dots,N\}}$ , and let $H_{i}:[0,1]^{i-1}\times\Theta\to\mathbb{R}^{N}$ be a potentially non-linear function which we shall coin the feature map. We call the composition of the softmax transform with the feature transform $H_{i}$ the increment function and denote it by

\Delta_{i}(\mathbf{x}_{:i-1},\theta)=\Phi\circ H_{i}(\mathbf{x}_{:i-1},\theta)\ .

The parametrised spline with linear interpolation $\hat{F}_{i}$ is then defined as

	$\displaystyle F_{i}(x_{i};\mathbf{x}_{:i-1},\theta)=$
	$\displaystyle\begin{cases}\sum_{j=1}^{k}\Delta_{j}(\mathbf{x}_{:i-1},\theta)+(x_{i}-\frac{k}{N})\dfrac{\Delta_{k+1}(\mathbf{x}_{:i-1},\theta)}{\frac{1}{N}}&\ \textrm{if}\ x\in[\frac{k}{N},\frac{k+1}{N}]\end{cases}$

Note that due to the application of the softmax transform the constructed spline satisfies the properties of a CDF.

Using the above definition of a linear spline CDF we can construct a linear spline flow which approximates the set of CDFs $\mathbf{F}$ :

Definition 0 (Spline flow with linear interpolation).

For $i\in[d]$ let ${F}_{i}:[0,1]\times[0,1]^{i-1}\times\Theta\to[0,1]$ be linear spline CDFs. Then we call the function $\mathbf{F}:[0,1]^{d}\times\Theta^{d}\to[0,1]^{d}$ defined as

{\mathbf{F}}(\mathbf{x};\theta)=({F}_{i}(x_{i};\mathbf{x}_{:i-1},\theta_{i}))_{i\in[d]}

where $\theta=(\theta_{i})_{i\in[d]}$ , neural spline flow with linear interpolation.

Various options for the feature transform $H_{i}$ exist. The most wide-spread in Machine Learning literature is a neural network, resulting in a linear neural spline flow.

Definition 0 (Linear neural spline flow).

For $i\in[d]$ let $H_{i}=\mathcal{NN}:[0,1]^{i-1}\times\Theta\to[0,1]^{N}$ be a neural network. We call a spline flow $\mathbf{F}:[0,1]^{d}\times\Theta^{d}\to[0,1]^{d}$ taking as a feature map $(H_{i})_{i\in[d]}$ a neural spline flow.

Remark 2 (Interpolation schemes).

Other interpolation schemes that ensure that ${F}_{i}$ is monotonically increasing exist (see for example [Durkan et al., 2019a]). For the sake of this paper, we will restrict ourselves to linear interpolation.

Utilizing normalizing flows, such as linear spline CDFs, offers the advantage of allowing for analytical evaluation of the likelihood function. This characteristic can be expressed explicitly as an autoregressive flow [Papamakarios et al., 2019], employing the chain rule of probability (defining $p(x_{1}|\mathbf{x}_{:0})=p(x_{1})$ ):

(1)

p_{\theta}(\mathbf{x})=\prod_{i=1}^{d}p_{\theta}(x_{i}|\mathbf{x}_{:i-1})

The density can be further expanded by using the definition of a parametrised linear CDF

(2)		$\displaystyle p_{\theta}(\mathbf{x})$	$\displaystyle=\prod_{i=1}^{d}\prod_{k=1}^{N}\left(N{\Delta_{i,k}(\mathbf{x}_{:i-1},\theta)}\right)^{C_{i,k}(\mathbf{x})}$
(3)		$\displaystyle$	$\displaystyle=N^{d}\prod_{i=1}^{d}\prod_{k=1}^{N}{\Delta_{i,k}(\mathbf{x}_{:i-1},\theta)}^{C_{i,k}(\mathbf{x})}$

where $C_{i,k}(\mathbf{x})=\mathbf{1}_{\{x_{i}\in[\frac{k-1}{N},\frac{k}{N})\}}$ is the indicator function for the $k$ -the bin which is $1$ if $x_{i}$ falls in to the bin $[\frac{k-1}{N},\frac{k}{N})$ and $0$ else. Due to the tractability of the conditional density the calibration of the parameters $\theta\in\Theta$ can be performed by minimizing the Kullback-Leibler (KL) divergence. The KL-divergence is defined as the expected ratio of the log-densities under the true density $p$

\mathrm{KL}(p,p_{\theta})\coloneqq\mathbb{E}_{p}\left[\ln\dfrac{p(\mathbf{x})}{p_{\theta}(\mathbf{x})}\right]

and for a neural spline flow $\mathbf{F}_{\theta}$ with linear interpolation enjoys the explicit representation:

	$\displaystyle\mathrm{KL}(p,p_{\theta})$	$\displaystyle=-\mathbb{E}_{p}\left[\ln p_{\theta}(\mathbf{x})\right]+K$
		$\displaystyle=-\sum_{i=1}^{d}\sum_{k=1}^{N}\mathbb{E}_{p}\left[C_{i,k}(\mathbf{x})\ln\Delta_{i,k}(\mathbf{x}_{:i-1},\theta)\right]+K$

where $K$ represents a constant term that is independent of the parameters $\theta\in\Theta$ .

In practice, the density $p$ is generally unknown and the expectation in the above derived KL-divergence needs to be estimated via Monte Carlo (MC) for a sample ${\{\mathbf{x}^{(j)}\}_{j=1}^{M}\sim p}$ of size $M$ . In this case, the MC approximation is given as

\displaystyle J(\theta)\coloneqq-\dfrac{1}{M}\sum_{i=1}^{d}\sum_{j=1}^{M}\ln p_{\theta}\big{(}x_{i}^{(j)}|\mathbf{x}_{:i-1}^{(j)}\big{)}

where we drop the constant $const$ term. The calibration problem is then defined as the minimization of the loss function $J:\Theta\to\mathbb{R}$

\min_{\theta\in\Theta}J(\theta).

Updates of the parameters are performed via batch or stochastic gradient descent [Ruder, 2016] for $k\in\{1,\dots,N\}$ and initial parameters $\theta^{(0)}\in\Theta^{\times 2}$

\theta^{(k+1)}\leftarrow\theta^{(k)}-\alpha\nabla_{\theta}\evalat{J}{\theta=\theta^{(k)}}

where $\alpha>0$ is the step size or learning rate and the gradients of the neural network’s parameters are computed via the backpropagation algorithm [Rumelhart et al., 1986].

4.2. Discrete-time stochastic processes

In the case of time series data one is interested in approximating the transition dynamics - that is, the conditional density of the next observation, conditional on past states of the time series. Let $(\Omega,\mathbb{F}=(\mathcal{F}_{t})_{t\in\mathbb{N}},\mathbb{P})$ be a filtered probability space and let $(\mathbf{x}_{t})_{t}\sim p$ be a time series observed at discrete timestamps, which for ease of notation we assume to be regular but they need not be. As in the previous section, we assume without loss of generality that the time series $(\mathbf{x}_{t})_{t}\sim p$ is $[0,1]^{d}$ -valued. Assume further that the process $\mathbf{x}\sim p$ is generative in the sense that the filtration is generated by the process $\mathcal{F}_{t}=\sigma((\mathbf{x}_{s})_{s=0,\dots,t}),t\in\mathbb{N}$ .

Our objective is to approximate a conditional model density $p_{\theta}(\mathbf{x}_{t+1}| \mathcal{F}_{t})$ that minimizes an adapted version of the KL-divergence to the true conditional density

\mathrm{aKL}(p,p_{\theta})=\mathbb{E}_{p}\left[\mathbb{E}_{p}\left[\ln\dfrac{p(\mathbf{x}_{t+1}|\mathcal{F}_{t})}{p_{\theta}(\mathbf{x}_{t+1}|\mathcal{F}_{t})}\Big{|}\mathcal{F}_{t}\right]\right]

To accomodate for the conditional information of past samples, i.e. the filtration, we need to generalise neural spline flows by including the condition, i.e. past states of the time series.

Definition 0 (Discrete-time linear spline flow).

Fix $t\in\mathbb{N}$ and for $i\in[d]$ let $H_{i}:[0,1]^{dt+(i-1)}\times\Theta\to\mathbb{R}^{N}$ be a non-linear function and ${F}_{i}:[0,1]\times[0,1]^{dt+(i-1)}\times\Theta\to[0,1]$ be linear spline CDFs taking $H_{i}$ as the feature map. We call the function $\mathbf{F}:[0,1]^{d}\times[0,1]^{dt}\times\Theta^{d}\to[0,1]^{d}$ defined as

(4)

{\mathbf{F}}(\mathbf{x}_{t+1};\mathbf{x}_{t},\theta)=({F}_{i}(x_{t+1,i};(\mathbf{x}_{\leq t},\mathbf{x}_{t+1,:i-1}),\theta_{i}))_{i\in[d]}

a discrete-time linear spline flow.

Following the defintion of a neural spline flow the discrete-time linear neural spline flow, simply is defined as a discrete-time linear spline flow where the feature maps $(H_{i})_{i\in[d]}$ are neural networks.

Remark 3 (Markovian dynamics).

In case the time series is Markovian with $r$ -lagged memory a single conditional neural spline flow can be constructed to approximate the conditional law at any time $t\in\mathbb{N}$ . The objective reduces to the “simpler” adapted KL-divergence

	$\displaystyle\mathrm{aKL}$	$\displaystyle(p,p_{\theta})=\mathbb{E}_{p}\left[\mathbb{E}_{p}\left[\ln\dfrac{p(\mathbf{x}_{t+1}\|\mathcal{F}_{t})}{p_{\theta}(\mathbf{x}_{t+1}\|\mathcal{F}_{t})}\Big{\|}\mathcal{F}_{t}\right]\right]$
		$\displaystyle=-\mathbb{E}_{p}\left[\mathbb{E}_{p}\left[\ln p_{\theta}(\mathbf{x}_{t+1}\|\mathbf{x}_{t-r+1:t})\Big{\|}\mathbf{x}_{t-r+1:t}\right]\right]+K$

5. Signature spline flows

In this section, we introduce signature splines flows and adopt the notation from subsection 4.2. Before we proceed, we first define a couple of helper augmentations which will be useful to define the sig-spline.

Definition 0 (Mask augmentation).

The function $m:S(\mathbb{R}^{d})\times[d]\to S(\mathbb{R}^{d})$ defined for $i\in[d]$ as

m(\mathbf{x},i)=\left(\mathbf{x}_{1},\dots,\mathbf{x}_{n-1},\tilde{m}(\mathbf{x}_{n},i)\right)

where $\tilde{m}(x_{n},i):\mathbb{R}^{d}\times[d]\to\mathbb{R}^{d}$ is defined as

\tilde{m}(x_{n},i)=(x^{1}_{n},\dots,x^{i-1}_{n},x^{i}_{n-1},\dots,x^{d}_{n-1})^{T}

is called mask augmentation.

Thus, the mask augmentation restrains any information beyond the $(i-1)^{th}$ coordinate of the $n^{th}$ term of a sequence. This is useful to define any conditional density approximator for discrete-time data.

To obtain the universality property of signatures the sequence has to start at $\mathbf{0}$ and needs to be time-augmented. Both following definitions come useful:

Definition 0 (Basepoint augmentation).

Let $\phi:S(\mathbb{R}^{d})\to S(\mathbb{R}^{d})$ defined as $\phi:(\mathbf{x}_{1},\dots,\mathbf{x}_{n})\mapsto(\mathbf{0},\mathbf{x}_{1},\dots,\mathbf{x}_{n})$ basepoint augmentation.

Definition 0 (Time augmentation).

We call the function $\psi:S(\mathbb{R}^{d})\to S(\mathbb{R}^{1+d})$ defined as

\psi:(\mathbf{x}_{1},\dots,\mathbf{x}_{n})\mapsto((t_{1},\mathbf{x}_{1}),\dots,(t_{n},\mathbf{x}_{n}))

time augmentation.

Last, let $\gamma_{i}:S(\mathbb{R}^{d})\to S(\mathbb{R}^{d+1})$ be the composition of the basepoint, time and the mask augmentation defined as $\gamma_{i}=\phi\circ m_{i}\circ\psi$ where $m_{i}$ is the mask augmentation applied at the $i^{th}$ coordinate.

The signature spline flow is defined in the same spirit as the neural spline flow; except that the neural network-based CDF approximator is replaced by a signature-based one and augmentations $\gamma_{i},i\in[d]$ are applied to the raw time series.

Definition 0 (Signature spline flow).

Let $\mathbf{x}\in S(\mathbb{R}^{d})$ be a discrete-time series of length $t\in\mathbb{N}$ , $L$ be the order of truncation and let $\Theta=\times_{j=1}^{N}T^{(L)}((\mathbb{R}^{d+1})^{*})$ be the parameter space. Furthermore, let $H_{i}:S([0,1]^{d})\times\Theta\to\mathbb{R}^{N}$ be defined as

H_{i}(\mathbf{x},\mathbf{U})=\langle\mathbf{u}_{j},\operatorname{Sig}^{(L)}(\gamma_{i}(\mathbf{x}))\rangle_{j\in[N]}

where $\mathbf{U}=(\mathbf{u}_{1},\dots,\mathbf{u}_{N})$ . We call a spline flow $\mathbf{F}:S([0,1]^{d})\times\Theta^{\times d}\to[0,1]^{d}$ using as a feature map $(H_{i})_{i\in[d]}$ a signature spline flow.

As before, we define our objective to be to minimize an adapted version of the KL-divergence of our model density with respect to the true density.

	$\displaystyle\mathrm{aKL}(p,p_{\theta})$	$\displaystyle\coloneqq\mathbb{E}_{p}\left[\mathbb{E}_{p}\left[\ln\dfrac{p(\mathbf{x}_{t+1}\|\mathcal{F}_{t})}{p_{\theta}(\mathbf{x}_{t+1}\|\mathcal{F}_{t})}\Big{\|}\mathcal{F}_{t}\right]\right]$
		$\displaystyle=-\mathbb{E}_{p}\left[\mathbb{E}_{p}\left[\ln p_{\theta}(\mathbf{x}_{t+1}\|\mathcal{F}_{t})\|\mathcal{F}_{t}\right]\right]+K$
		$\displaystyle=-\sum_{i=1}^{d}\sum_{k=1}^{N}\mathbb{E}_{p}\left[\mathbb{E}_{p}\left[C_{i,k}(\mathbf{x}_{t+1})\ln\Delta_{k}(\mathbf{x},\theta_{i})\right]\big{\|}\mathcal{F}_{t}\right]+K$

For a finite set of realizations $\{(\mathbf{x}_{1}^{(j)},\dots,\mathbf{x}_{t+1}^{(j)})\}_{j=1}^{M}$ we obtain the the Monte Carlo approximation

(5)

J(\theta)=\dfrac{1}{M}\sum_{i=1}^{d}\sum_{k=1}^{N}\sum_{j=1}^{M}C_{i,k}(\mathbf{x}_{t+1}^{(j)})\ln\Delta_{k}(\mathbf{x}^{(j)},\theta_{i})

where we denote the parameters as $\theta=(\mathbf{U}_{1},\dots,\mathbf{U}_{d})\in\Theta^{\times d}$ .

The following theorem states that calibrating the cost functions parameters $\theta\in\Theta$ is convex. The proof can be found in Appendix A.

Theorem 5 (Convexity).

The objective function $J:\Theta\to\mathbb{R}$ is convex.

When working with a limited size dataset optimizing with respect to the cost function $J$ may lead to an overfitting of the density’s parameters. The following corollary shows that regularizing the model’s parameters using a convex penalty function maintains the convexity property of the calibration problem.

Corollary 0.

Let $\lambda>0$ and let $\alpha:\Theta\to\mathbb{R}$ be a convex function of $\theta$ . Then the regularized objective $J_{\lambda}(\theta)\coloneqq J(\theta)+\lambda\alpha(\theta)$ is convex.

Note that the regularized objective includes the penalty functions $\alpha(\theta)=\|\theta\|_{2}^{2}$ and $\alpha(\theta)=\|\theta\|_{1}$ , so we may retain the convexity of our objective through these standard penalty functions. The regularized objective can be particularly helpful to avoid overfitting in the context of time series generation where we may only have one single realization of the time series.

We now present our main theoretical result, that the sig-spline construction is able to approximate the conditional transition density of any time series arbitrarily well, given a high enough signature truncation order, and sufficiently many knots in the spline. Again the proof can be found in Appendix A.

Theorem 7 (Universality of Sig-Splines).

Let $\mathbf{x}\sim p$ be a $[0,1]^{d}$ -valued Markov process with $k$ -lagged memory. Let $\varepsilon>0$ . Then there exists an order of truncation $L$ , a number of bins $N$ and a set of linear functionals $\mathbf{U}\in\mathbb{R}^{N\times f(d,L)}$ such that for all paths $\mathbf{x}\in S([0,1]^{d})$ of length $t$

|p(\mathbf{x}_{t+1}|\mathcal{F}_{t})-p_{\mathbf{W}}(\mathbf{x}_{t+1}|\mathcal{F}_{t})|<\varepsilon

with respect to the $L^{2}$ norm.

6. Numerical results

This section delves into the evaluation of sig- and neural splines, examining their performance through a series of controlled data and real-world data experiments in subsections 6.2 to 6.4. Subsection 6.2 focuses on assessing generative performance using a VAR process with known parameters. In contrast, subsections 6.3 and 6.4 evaluate the performance of these splines on realized volatilities of multiple equity indices and spot prices, respectively.

6.1. Experiment outline

6.1.1. Training

The neural spline flow serves as the benchmark model to surpass in all experiments. Each neural spline flow consists of three hidden layers, each with 64 hidden dimensions which are used to construct the linear spline CDF. On the other hand, the sig-spline models are calibrated for orders of truncation ranging from 1 to 4, allowing us to observe how the performance of the sig-spline model varies with higher orders. Throughout all experiments, the models are calibrated using 2-dimensional time series data. The number of parameters used in each sig-spline model, assuming a 2-dimensional path, is reported in Table 1.

Before training, the dataset is divided into a train and test set. To account for the randomness in the train and test split and its potential impact on model performance, the models are calibrated using 10 different seeds. Performance metrics are computed as averages across all 10 calibration runs.

Within each calibration, early stopping [Prechelt, 2002] is applied to each individual conditional density estimator to mitigate overfitting on the train set. Early stopping halts the fitting of a conditional density estimator when the test set error increases consecutively for a certain number of times. In these experiments, the patience (the number of consecutive test set errors allowed before the fitting process stops) is set to 32. All models are trained using full-batch gradient descent.

Order of truncation	1	2	3	4
Parameters	512	1664	5120	15488

Table 1. Number of parameters used in each sig-spline as a function of the order.

6.1.2. Evaluation

After training, all calibrated models are evaluated using a set of test metrics. This involves sampling a batch of time series of length 4 from the calibrated model, computing standard statistics from the generated dataset, and comparing them with the empirical statistics of the real dataset. This comparison is done by calculating the difference between the statistics and applying the $l_{1}$ norm. Specifically, four statistics are compared for both the return process and the level process: the first two lags of the autocorrelation function (ACF), the skewness, the kurtosis, and the cross-correlation. Additionally, for the multi-asset spot return dataset, the ACF of the absolute returns is compared as a lower discrepancy would indicate that the generative model is capable of capturing volatility clustering.

All numerical results can be found in Appendix B. Each table presents the performance metrics of the models, with the best-performing metric highlighted in bold font. It is important to note that the designation of the best-performing model is merely indicative, as in several cases, the presence of error bounds prevents the identification of a clear best-performing model.

6.2. Vector autoregression

Assume a $d$ -dimensional $\mathrm{VAR}(2)$ process $(\mathbf{y}_{t})_{t}$ taking the form

\mathbf{y}_{t+1}=\mathbf{W}_{1}\mathbf{y}_{t}+\mathbf{W}_{2}\mathbf{y}_{t-1}+\mathbf{\Sigma}^{\frac{1}{2}}\mathbf{z}_{t+1}

where $\mathbf{W}_{1},\mathbf{W}_{2},\mathbf{\Sigma}\in\mathbb{R}^{d\times d}$ are matrices and $\mathbf{z}_{t}\sim\mathcal{N}(0,\mathbf{I})$ is an adapted normally distributed $l$ -dimensional random variable. The process $(\mathbf{y}_{t})_{t}$ is assumed to be latent in the sense that it is not observed. The observed process $(\mathbf{x}_{t})_{t}$ is assumed to be $D$ -dimensional and defined as

\mathbf{x}_{t}=F_{\theta}(\mathbf{y}_{t}),t\in\mathbb{N}

where $F_{\theta}:\mathbb{R}^{d}\to\mathbb{R}^{D}$ is an unknown non-linear function.

In the controlled experiment the dimensions $d=2$ and $D=8$ are assumed. The decoder $F_{\theta}:\mathbb{R}^{2}\to\mathbb{R}^{8}$ is represented by a neural network with two hidden layers, $64$ hidden dimensions initialized randomly using the He initialization scheme [He et al., 2015] and parametric ReLUs as activation functions and the VAR dynamic is governed by the autoregressive matrices

\mathbf{W}_{1}=\begin{pmatrix}0.1&0.0\\  0.0&0.2\end{pmatrix}\quad\mathbf{W}_{2}=\begin{pmatrix}0.6&0.0\\  0.0&0.3\end{pmatrix}

and the covariance matrix

\mathbf{\Sigma}=\begin{pmatrix}0.5&0.0\\ 0.0&0.5\end{pmatrix}.

A total of 4096 lags of the latent process are sampled to create a simulated empirical dataset. These lags serve as the basis for generating the observed empirical time series using a randomly sampled decoder (refer to Figure 1). Subsequently, an autoencoder is trained to compress the generated time series back to its original latent dimension. This compression step is employed to enhance scalability and demonstrate that autoencoders can produce a lower-dimensional representation of the time series.

Refer to caption — Figure 1. Sample of the original observed path (top) $(\mathbf{x}_{t})_{t}$
and the compressed path (bottom) $(\mathbf{y}_{t})_{t}$ .

Utilizing the calibrated autoencoder, the encoder component is applied to obtain the compressed time series. This compressed representation is then used to calibrate the conditional density approximators, enabling the modeling of the conditional density of the original time series based on the compressed data obtained from the autoencoder.

Tables B.1.1 - B.1.4 report the performance metrics for the compressed level and return process, and the observed level and return process respectively. Upon examining all the tables, it becomes evident that the performance of sig-spline models consistently improves with higher truncation orders. This holds particularly true for the autocorrelation function, revealing that a truncation order below 3 fails to adequately capture the proper dependence of the $\mathop{VAR}(2)$ process.

When comparing all the tables, it becomes apparent that the neural spline model holds a slight advantage over the sig-spline model truncated at order 4. However, when considering the performance on the compressed process, the two models demonstrate very comparable performance.

6.3. Realized volatilities

The subsequent case study explores the effectiveness of the sig-spline model in approximating the dynamics of a real-world multivariate volatilities dataset derived from the MAN AHL Realized Library. To conduct the numerical evaluation, the med-realized volatilities of Standard & Poor’s 500 (SPX), Dow Jones Industrial Average (DJI), Nikkei 225 (N225), Euro STOXX 50 (STOXX50E), and Amsterdam Exchange (AEX) are extracted from the dataframe. These volatilities form a 5-dimensional time series spanning from January 1st, 2005, to December 31st, 2019. Figure 2 provides a visual representation of the corresponding historical volatilities.

An observation from Figure 2 reveals a strong correlation between both the levels and their returns (refer to Figure 3 for the corresponding cross-correlation matrices). Due to these high cross-correlations, an autoencoder is calibrated to compress the $5$ -dimensional time series into a $2$ -dimensional representation. This $2$ -dimensional time series, depicted in Figure 2, serves as the basis for calibrating the neural and sig-spline conditional density estimators.

The performance metrics for the compressed 2-dimensional return and level process are presented in Table B.2.1 and Table B.2.2, while Table B.2.3 and B.2.4 display the performance metrics for the original 5-dimensional process. Analysis of Table B.2.1 and B.2.2 indicates that, in most metrics, the neural spline demonstrates superior performance compared to the sig spline for the compressed process. This suggests that the neural network exhibits better ability to capture complex dependence structures in more intricate real-world datasets. However, it is interesting to note that this performance advantage of the neural spline does not directly translate to an advantage in the observed process, as observed in Table B.2.3 and B.2.4. In fact, the signature spline truncated at order 4 frequently exhibited better performance compared to the neural spline flow.

It is noteworthy that Table B.2.3 and Table B.2.4 highlight the kurtosis metrics, which stand out prominently. This observation arises from the high kurtosis exhibited by the med-realized volatilities, indicating that both the neural and sig-spline models struggle to capture the heavy-tailed nature of the data.

6.4. Multi-asset spot returns

The following real-world multi-asset spot return dataset is sourced from the MAN AHL Realized Library. It focuses on the SPX and DJI stock indices, covering the period from January 1st, 2005, to December 31st, 2021 (refer to the top figure of Figure 4).

During the calibration process, it was noted that sig-splines struggled to capture the strong cross-correlations in the returns of SPX and DJI. To address this issue, the spot time series underwent preprocessing by applying Principal Component Analysis (PCA) to the index returns, resulting in whitened returns. The preprocessed time series is depicted in Figure 4. Subsequently, the models were calibrated using the transformed return series.

Table B.3.1 and B.3.2 present the performance metrics for both the preprocessed return series and the observed return process. Notably, the cross-correlation metric for all sig-spline models closely compares with the neural spline baseline, thanks to the application of PCA transform during the preprocessing of the spot return series. Both tables demonstrate that sig- and neural splines perform relatively well, with sig-splines showcasing superior performance. Moreover, it is observed that higher orders of truncation lead to improved performance in autocorrelation metrics, which detect serial independence and volatility clustering.

7. Conclusion

This paper introduced signature spline flows as a generative model for time series data. By employing the signature transform, sig-splines are constructed as an alternative to the neural networks used in neural spline flows. In section 5, we formally demonstrate the universal approximation capability of sig-splines, highlighting their ability to approximate any conditional density. Additionally, we establish the convexity of sig-spline calibration with respect to its parameters.

To assess their performance, we compared sig-splines with neural splines in section 6 using a simulated benchmark dataset and two real-world financial datasets. Our evaluation, based on standard test metrics, reveals that sig-splines perform comparably to neural spline flows.

The convexity and universality properties of sig-splines pique our interest in future research directions. We believe that pursuing these avenues could yield valuable insights and fruitful outcomes:

•

Sig-ICA (Signature Independent Component Analysis): Exploring the application of a signature-based ICA introduced in [Schell and Oberhauser, 2023] could enhance the scalability of the sig-spline generative model. A Sig-ICA preprocessing would allow calibrating a sig-spline model on each coordinate for each coordinate of the process, allowing for less parameters and more interpretablity.
•

Regularisation techniques: This paper has not extensively addressed methods for improving the generalization of sig-splines. Investigating regularization techniques specifically tailored to sig-splines could prove beneficial in enhancing their performance and robustness, particularly in scenarios with limited training data or complex dependencies.

References

Rasul et al. [2020] Kashif Rasul, Abdul-Saboor Sheikh, Ingmar Schuster, Urs Bergmann, and Roland Vollgraf. Multivariate probabilistic time series forecasting via conditioned normalizing flows. arXiv preprint arXiv: Arxiv-2002.06103, 2020.
Arribas et al. [2020] Imanol Perez Arribas, Cristopher Salvi, and Lukasz Szpruch. Sig-sdes model for quantitative finance, 2020.
Buehler et al. [2020] Hans Buehler, Blanka Horvath, Terry Lyons, Imanol Perez Arribas, and Ben Wood. A data-driven market simulator for small data environments, 2020.
Buehler et al. [2022] Hans Buehler, Phillip Murray, Mikko S. Pakkanen, and Ben Wood. Deep hedging: Learning to remove the drift under trading frictions with minimal equivalent near-martingale measures, 2022.
Ni et al. [2020] Hao Ni, Lukasz Szpruch, Magnus Wiese, Shujian Liao, and Baoren Xiao. Conditional sig-wasserstein gans for time series generation, 2020.
Ni et al. [2021] Hao Ni, Lukasz Szpruch, Marc Sabate-Vidales, Baoren Xiao, Magnus Wiese, and Shujian Liao. Sig-wasserstein gans for time series generation. In Proceedings of the Second ACM International Conference on AI in Finance, ICAIF ’21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450391481. doi: 10.1145/3490354.3494393. URL https://doi.org/10.1145/3490354.3494393.
Wiese et al. [2021] Magnus Wiese, Ben Wood, Alexandre Pachoud, Ralf Korn, Hans Buehler, Phillip Murray, and Lianjun Bai. Multi-asset spot and option market simulation, 2021.
Wiese et al. [2020] Magnus Wiese, Robert Knobloch, Ralf Korn, and Peter Kretschmer. Quant gans: Deep generation of financial time series. Quantitative Finance, 20(9):1419–1440, 2020.
Kobyzev et al. [2020] Ivan Kobyzev, Simon JD Prince, and Marcus A Brubaker. Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence, 43(11):3964–3979, 2020.
Papamakarios et al. [2019] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762, 2019.
Durkan et al. [2019a] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. Advances in Neural Information Processing Systems, 32:7511–7522, 2019a.
Durkan et al. [2019b] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Cubic-spline flows. arXiv preprint arXiv:1906.02145, 2019b.
Friz and Hairer [2020] Peter K Friz and Martin Hairer. A course on rough paths. Springer, 2020.
Lyons [2014] Terry Lyons. Rough paths, signatures and the modelling of functions on streams. arXiv preprint arXiv:1405.4537, 2014.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Brock et al. [2019] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1xsqj09Fm.
Yoon et al. [2019] Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. Time-series generative adversarial networks. Advances in Neural Information Processing Systems, 32, 2019.
Engel et al. [2019] Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. GANSynth: Adversarial neural audio synthesis. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1xQVn09FX.
Reed et al. [2016] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016.
Dinh et al. [2014] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation, 2014. URL https://arxiv.org/abs/1410.8516.
Dinh et al. [2016] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
Teshima et al. [2020] Takeshi Teshima, Isao Ishikawa, Koichi Tojo, Kenta Oono, Masahiro Ikeda, and Masashi Sugiyama. Coupling-based invertible neural networks are universal diffeomorphism approximators. Advances in Neural Information Processing Systems, 33:3362–3373, 2020.
Bogachev et al. [2005] Vladimir Igorevich Bogachev, Aleksandr Viktorovich Kolesnikov, and Kirill Vladimirovich Medvedev. Triangular transformations of measures. Sbornik: Mathematics, 196(3):309, 2005.
Papamakarios et al. [2017] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. Advances in neural information processing systems, 30, 2017.
Wehenkel and Louppe [2019] Antoine Wehenkel and Gilles Louppe. Unconstrained monotonic neural networks. Advances in Neural Information Processing Systems, 32:1545–1555, 2019.
Gemici et al. [2016] Mevlana C Gemici, Danilo Rezende, and Shakir Mohamed. Normalizing flows on riemannian manifolds. arXiv preprint arXiv:1611.02304, 2016.
Brehmer and Cranmer [2020] Johann Brehmer and Kyle Cranmer. Flows for simultaneous manifold learning and density estimation, 2020.
Wiese and Murray [2022] Magnus Wiese and Phillip Murray. Risk-neutral market simulation. arXiv preprint arXiv:2202.13996, 2022.
Gierjatowicz et al. [2020] Patryk Gierjatowicz, Marc Sabate-Vidales, David Siska, Lukasz Szpruch, and Zan Zuric. Robust pricing and hedging via neural sdes. Available at SSRN 3646241, 2020.
Dyer et al. [2021] Joel Dyer, Patrick W Cannon, and Sebastian M Schmon. Deep signature statistics for likelihood-free time-series models. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021. URL https://openreview.net/forum?id=OOlxsoRPyFL.
Kidger et al. [2019] Patrick Kidger, Patric Bonnier, Imanol Perez Arribas, Cristopher Salvi, and Terry Lyons. Deep signature transforms. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/d2cdf047a6674cef251d56544a3cf029-Paper.pdf.
Chen [1957] Kuo-Tsai Chen. Integration of paths, geometric invariants and a generalized baker-hausdorff formula. Annals of Mathematics, pages 163–178, 1957.
Chen [2001] Kuo-Tsai Chen. Iterated integrals and exponential homomorphismsť. Collected Papers of KT Chen, page 54, 2001.
Chevyrev and Kormilitzin [2016] Ilya Chevyrev and Andrey Kormilitzin. A primer on the signature method in machine learning. arXiv preprint arXiv:1603.03788, 2016.
Lyons et al. [2007] Terry J Lyons, Michael Caruana, and Thierry Lévy. Differential equations driven by rough paths. Springer, 2007.
Király and Oberhauser [2019] Franz J Király and Harald Oberhauser. Kernels for sequentially ordered data. Journal of Machine Learning Research, 20(31):1–45, 2019.
Cybenko [1989] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
Ruder [2016] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
Rumelhart et al. [1986] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
Prechelt [2002] Lutz Prechelt. Early stopping-but when? In Neural Networks: Tricks of the trade, pages 55–69. Springer, 2002.
He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
Schell and Oberhauser [2023] Alexander Schell and Harald Oberhauser. Nonlinear independent component analysis for discrete-time and continuous-time signals. The Annals of Statistics, 51(2):487–518, 2023.
Böhning [1992] Dankmar Böhning. Multinomial logistic regression algorithm. Annals of the institute of Statistical Mathematics, 44(1):197–200, 1992.
Vu [2016] Trung Vu. Multinomial logistic regression: Convexity and smoothness, 2016. URL https://trungvietvu.github.io/notes/2016/MLR.

Appendix A Proofs

A.1. Proof of Theorem 5

See 5

Proof.

²²2The proof follows ideas from [Böhning, 1992] paper [Böhning, 1992]. A short summary of the proof was uploaded by Vu [Vu, 2016].

The sum of two convex functions remains convex. We therefore consider the cost function (5) without loss of generality for a single sample $\mathbf{x}=(\mathbf{x}_{1},\dots,\mathbf{x}_{t+1})$ and only a single conditional density $i\in\{1,\dots,d \}$ ; i.e. we consider the cost function of the $i^{th}$ conditional density $J^{i}:\bigtimes_{k=1}^{N}T^{(L)}((\mathbb{R}^{1+d})^{*})\to\mathbb{R}$ defined as

\displaystyle J^{i}(\mathbf{V}^{i})

\displaystyle=-\sum_{k=1}^{N}\Big{[}{c}_{k}\Big{(}\langle\mathbf{v}_{k},\mathbf{y}\rangle-\ln\big{(}\sum_{j=1}^{N}\exp(\langle\mathbf{v}_{j},\mathbf{y}\rangle)\big{)}\Big{)}\Big{]}

$\mathbf{y}=\operatorname{Sig}^{(L)}(\eta^{i}({\mathbf{x}}))$ is the signature of the masked path, and $\mathbf{c}\in\{0,1\}^{N}$ are the indicator functions

(6)

\mathbf{c}=(\mathbf{1}_{[\alpha^{i}_{k-1},\alpha^{i}_{k}]}(x_{t+1,i}))_{k\in\{1,\dots,N\}}

where $0=\alpha_{0}^{i}<\dots<\alpha_{N}^{i}=1$ is the predefined partition of the $x$ -axis of the constructed CDF. We obtain the full loss function $J$ as

J(\theta)=J(\mathbf{V}^{1},\dots,\mathbf{V}^{d})=\sum_{i=1}^{d}J^{i}(\mathbf{V}^{i}).

For legibility we drop the dependence on $i$ and write $H(\mathbf{V})\coloneqq J^{i}(\mathbf{V}^{i})$ instead. To ease the notation we project the linear functionals $\mathbf{v}_{i},i\in\{1,\dots,N\}$ and the signature $\mathbf{y}\in T^{(L)}(\mathbb{R}^{d})$ to the real-valued vector space $\mathbb{R}^{K},K={f(L,1+d)}$ , thus the linear functionals can be thought of as weights hereafter.

To show that the cost function $H:\bigtimes_{i=1}^{N}\mathbb{R}^{K}\to\mathbb{R}$ is convex we derive the first- and second-order gradients with respect to the model parameters $\mathbf{V}=(\mathbf{v}_{1},\dots,\mathbf{v}_{N})$ . The first-order derivative is defined for a single set of weights $\mathbf{v}_{i},i\in\{1,\dots,N\}$ is

\dfrac{\partial H(\mathbf{V})}{\partial\mathbf{v}_{i}}=\dfrac{\exp(\langle\mathbf{v}_{i},\mathbf{y}\rangle)}{\sum_{j=1}^{N}\exp(\langle\mathbf{v}_{j},\mathbf{y}\rangle)}\mathbf{y}-{c}_{i}\mathbf{y}={p}_{i}\mathbf{y}-{c}_{i}\mathbf{y}

where ${p}_{i}=\Phi(\operatorname{Sig}^{(L)})$ and can be expressed in matrix-vector notation as

\nabla_{\mathbf{V}}H(\mathbf{V})=(\mathbf{p}-\mathbf{c})\otimes\mathbf{y}

where $\otimes$ denotes the Kronecker product. Furthermore, the second-order gradient is given for any $i,j\in\{1,\dots,N\}$ as

\dfrac{\partial H(\mathbf{V})}{\partial\mathbf{v}_{j}\partial\mathbf{v}_{i}^{T}}=\dfrac{\partial}{\partial\mathbf{v}_{j}}\left({p}_{i}\mathbf{y}-{c}_{i}\mathbf{y}\right)^{T}=\dfrac{\partial}{\partial\mathbf{v}_{j}}p_{i}\mathbf{y}^{T}=(\delta_{ij}{p}_{i}-{p}_{i}{p}_{j})\mathbf{y}\mathbf{y}^{T}

which in matrix-vector notation can be expressed as

(7)

\nabla^{2}_{\mathbf{V}}H(\mathbf{V})=(\mathbf{D}(\mathbf{p})-\mathbf{p}\mathbf{p}^{T})\otimes\mathbf{y}\mathbf{y}^{T}

where $\mathbb{D}(\mathbb{p})$ denotes the weight matrix with diagonal entries

Finally, we need to show that the Hessian of the cost function $H$ with respect to $\mathbf{V}$ is positive semidefinite. To demonstrate this, first recall that if the Eigenvalues of the square matrices $\mathbf{A}\in\mathbb{R}^{n\times n},\mathbf{B}\in\mathbb{R}^{m\times m}$ are $\alpha_{i},i\in\{1,\dots,n\}$ and $\beta_{j},j\in\{1,\dots,m\}$ respectively, then the Eigenvalues of $\mathbf{A}\otimes\mathbf{B}$ are $\alpha_{i}\beta_{j},i\in\{1,\dots,n\},j\in\{1,\dots,m\}$ . Thus, it suffices to show that the matrices $(\mathbf{D}(\mathbf{p})-\mathbf{p}\mathbf{p}^{T})\in\mathbb{R}^{N\times N}$ and $\mathbf{y}\mathbf{y}^{T}\in\mathbb{R}^{K\times K}$ are positive semidefinite. For the latter matrix this is straightforward since for any $\mathbf{a}\in\mathbb{R}^{K}$ we have $\mathbf{a}^{T}\mathbf{y}\mathbf{y}^{T}\mathbf{a}=(\mathbf{a}^{T}\mathbf{y})^{2}\geq 0$ . For the former we can show that it is diagonally-dominant and conclude that it is positive semidefinite CITE. Thus,

(8)

\lambda_{\min}(\nabla^{2}_{\mathbf{V}}H(\mathbf{V}))=\lambda_{\min}(\mathbf{D}(\mathbf{p})-\mathbf{p}\mathbf{p}^{T})\cdot\lambda_{\min}(\mathbf{y}\mathbf{y}^{T})\geq 0

which concludes the proof. ∎

Remark 4.

Note that the convexity of the cost function $J:\Theta\to\mathbb{R}$ does not hold if one makes the $\alpha$ -partition.

A.2. Proof of Theorem 7

See 7

Proof.

We first note that by factoring the true and model densities into the product of the conditionals $p({x}_{i}|\mathbf{x}_{:i-1},\mathcal{F}_{t})$ , we can write (dropping the explicit reference to the filtration to ease notation)

\left|p(\mathbf{x}_{t+1}|\mathcal{F}_{t})-p_{\mathbf{W}}(\mathbf{x}_{t+1}|\mathcal{F}_{t})\right|=\left|\prod_{i=1}^{d}p({x}_{i}|\mathbf{x}_{:i-1})-\prod_{i=1}^{d}p_{\mathbf{W}}({x}_{i}|\mathbf{x}_{:i-1})\right|

To use this factorisation, we make use of the following lemma.

Lemma 0.

Let $a_{1},\ldots,a_{d}$ and $b_{1},\ldots,b_{d}$ be two bounded sequences of real numbers. Then there exists $M\in\mathbb{R}$ such that

\left|\prod_{i=1}^{d}a_{i}-\prod_{i=1}^{d}b_{i}\right|\leq M\sum_{i=1}^{d}\left|a_{i}-b_{i}\right|

Proof.

Write $\prod_{i=1}^{d}a_{i}-\prod_{i=1}^{d}b_{i}=\sum_{i=1}^{d}a_{1}\cdots a_{i-1}(a_{i}-b_{i})b_{i+1}\cdots b_{d}$ . Then taking absolute values and applying the triangle inequality, taking $M=\max_{i=1,\ldots,d}|a_{1}\cdots a_{i-1}b_{i+1}\cdots b_{d}|$ we obtain the result. ∎

Hence, there exists $M\in\mathbb{R}$ such that we can write

\left|\prod_{i=1}^{d}p({x}_{i}|\mathbf{x}_{:i-1})-\prod_{i=1}^{d}p_{\mathbf{W}}({x}_{i}|\mathbf{x}_{:i-1})\right|\leq M\sum_{i=1}^{d}\left|p({x}_{i}|\mathbf{x}_{:i-1})-p_{\mathbf{W}}({x}_{i}|\mathbf{x}_{:i-1})\right|

Thus, since to prove the claim, it would be sufficient to establish that we can approximate each conditional arbitrarily well, without loss of generality we can focus on the univariate case $d=1$ . For any $N\in\mathbb{N}$ , define the piecewise constant approximation of the true conditional density as

\hat{p}(x_{t+1}|\mathcal{F}_{t})\coloneqq\prod_{j=1}^{N}\hat{p}_{j}^{c_{j}}

with

\hat{p}_{j}\coloneqq N\int_{\frac{j-1}{N}}^{\frac{j}{N}}p(x_{t+1}|\mathcal{F}_{t})\ \mathrm{d}x_{t+1}

and $c_{j}$ defined as above. Such piecewise constant functions are known to be dense in $L^{2}[0,1]$ , hence given any $\varepsilon>0$ we can find $N$ such that

|p(x_{t+1}|\mathcal{F}_{t})-\hat{p}(x_{t+1}|\mathcal{F}_{t})|<\varepsilon/2

according to the $L^{2}$ norm. For such $N$ , consider the log probabilities $q_{j}=\ln\hat{p}_{j}$ . Due to the universality of signatures, for each $j=1,\ldots,N$ we can find a trunctation order $L_{j}\coloneqq L_{j}(N)$ and a weight vector $\mathbf{w}_{j}\in\mathbb{R}^{f(d,L_{j})}$ such that $|q_{j}-\mathbf{w}_{j}^{T}\mathbf{y}|<\varepsilon/2N$ . Define $L=\max\{L_{j}:j=1,\ldots,N\}$ and $K=f(d,L)$ , then take the weight matrix $\mathbf{W}\in\bigtimes_{i=1}^{N}T^{(L)}(\mathbb{R}^{1+d})$ to have rows $\mathbf{w}_{j}$ concatenated with zeros whenever $L_{j}<L$ . Thus, writing $\mathbf{q}=(q_{1},\ldots,q_{N})$ we have

|\mathbf{q}-\mathbf{W}\mathbf{y}|\leq\sum_{j=1}^{N}|q_{j}-\mathbf{w}_{j}^{T}\mathbf{y}|\leq\varepsilon/2

Defining the vectors $\hat{\mathbf{p}}=\operatorname{\Phi}(\mathbf{q})$ and $\mathbf{p}_{\mathbf{W}}=\operatorname{\Phi}(\mathbf{Wy})$ similarly, by the Lipschitz property of the softmax function, with constant less than $1$ for $N>2$ , we have

|\hat{\mathbf{p}}-\mathbf{p}_{\mathbf{W}}|\leq|\mathbf{q}-\mathbf{W}\mathbf{y}|\leq\varepsilon/2

Hence, with an application of the triangle inequality we obtain our result

|p(x_{t+1}|\mathcal{F}_{t})-p_{\mathbf{W}}(x_{t+1}|\mathcal{F}_{t})|\leq\varepsilon\ .

∎

Appendix B Numerical results

B.1. VAR model

Model / Test metrics	$\|\rho^{x}_{h}-\rho^{x}_{g}\|_{1}$	$\|\kappa^{x}_{h}-\kappa^{x}_{g}\|_{1}$	$\|s_{h}^{x}-s_{g}^{x}\|_{1}$	$\|\Sigma_{h}^{x}-\Sigma_{g}^{x}\|_{1}$
Sig-Spline (Order 1)	$0.0147\pm 0.0063$	$\textbf{0.3478}\pm 0.1675$	$0.3072\pm 0.0425$	${0.0108}\pm 0.0065$
Sig-Spline (Order 2)	${0.0133}\pm 0.0059$	$2.8459\pm 1.4922$	$0.3402\pm 0.1123$	${0.0145}\pm 0.0098$
Sig-Spline (Order 3)	${0.0108}\pm 0.0037$	$0.6097\pm 0.2611$	${0.1093}\pm 0.0255$	$0.0162\pm 0.0095$
Sig-Spline (Order 4)	${0.0122}\pm 0.0064$	$0.5694\pm 0.1948$	${0.1075}\pm 0.0406$	$0.0168\pm 0.0119$
Neural spline flow	$\textbf{0.0102}\pm 0.0038$	${0.4404}\pm 0.3541$	$\textbf{0.0979}\pm 0.0357$	$\textbf{0.0107}\pm 0.0053$

Table B.1.1. Performance metrics of the compressed level process.

Model / Test metrics	$\|\rho^{r}_{h}-\rho^{r}_{g}\|_{1}$	$\|\kappa^{r}_{h}-\kappa^{r}_{g}\|_{1}$	$\|s_{h}^{r}-s_{g}^{r}\|_{1}$	$\|\Sigma_{h}^{r}-\Sigma_{g}^{r}\|_{1}$
Sig-Spline (Order 1)	$0.0579\pm 0.0034$	$0.3888\pm 0.1913$	$0.0867\pm 0.0557$	$0.0216\pm 0.0124$
Sig-Spline (Order 2)	$0.0122\pm 0.0050$	$0.3412\pm 0.1222$	$0.0708\pm 0.0386$	$0.0154\pm 0.0096$
Sig-Spline (Order 3)	${0.0084}\pm 0.0028$	$0.2044\pm 0.0481$	$0.0603\pm 0.0277$	$0.0139\pm 0.0171$
Sig-Spline (Order 4)	${0.0098}\pm 0.0041$	$0.2107\pm 0.0754$	$\textbf{0.0569}\pm 0.0310$	$\textbf{0.0113}\pm 0.0109$
Neural spline flow	$\textbf{0.0076}\pm 0.0036$	$\textbf{0.1972}\pm 0.0691$	$0.0585\pm 0.0275$	$0.0134\pm 0.0105$

Table B.1.2. Performance metrics of the compressed return process.

Model / Test metrics	$\|\rho^{x}_{h}-\rho^{x}_{g}\|_{1}$	$\|\kappa^{x}_{h}-\kappa^{x}_{g}\|_{1}$	$\|s_{h}^{x}-s_{g}^{x}\|_{1}$	$\|\Sigma_{h}^{x}-\Sigma_{g}^{x}\|_{1}$
Sig-Spline (Order 1)	$0.0398\pm 0.0052$	$4.2277\pm 1.0256$	$0.9135\pm 0.1187$	$0.1976\pm 0.0122$
Sig-Spline (Order 2)	$0.0479\pm 0.0105$	$78.8745\pm 58.4826$	$3.5408\pm 1.8149$	$0.1939\pm 0.0278$
Sig-Spline (Order 3)	$0.0297\pm 0.0051$	$5.5644\pm 1.3769$	$0.7894\pm 0.1273$	$0.1197\pm 0.0105$
Sig-Spline (Order 4)	$0.0322\pm 0.0067$	$7.6248\pm 1.8894$	$0.8857\pm 0.1504$	$0.1118\pm 0.0083$
Neural spline flow	$\textbf{0.0157}\pm 0.0041$	$\textbf{2.9326}\pm 1.2577$	$\textbf{0.4276}\pm 0.1100$	$\textbf{0.0730}\pm 0.0084$

Table B.1.3. Performance metrics of the observed level process.

Model / Test metrics	$\|\rho^{r}_{h}-\rho^{r}_{g}\|_{1}$	$\|\kappa^{r}_{h}-\kappa^{r}_{g}\|_{1}$	$\|s_{h}^{r}-s_{g}^{r}\|_{1}$	$\|\Sigma_{h}^{r}-\Sigma_{g}^{r}\|_{1}$
Sig-Spline (Order 1)	$0.0577\pm 0.0058$	$3.6188\pm 0.6704$	$0.4141\pm 0.1134$	$0.1707\pm 0.0060$
Sig-Spline (Order 2)	$0.0461\pm 0.0173$	$8.9617\pm 3.3185$	$0.9355\pm 0.2366$	$0.1232\pm 0.0063$
Sig-Spline (Order 3)	$0.0196\pm 0.0036$	$4.5679\pm 1.2919$	$0.5792\pm 0.1436$	$0.0928\pm 0.0075$
Sig-Spline (Order 4)	$0.0231\pm 0.0047$	$5.4555\pm 1.1996$	$0.5990\pm 0.0888$	$0.0845\pm 0.0067$
Neural spline flow	$\textbf{0.0127}\pm 0.0052$	$\textbf{2.6636}\pm 0.7675$	$\textbf{0.2558}\pm 0.0933$	$\textbf{0.0702}\pm 0.0060$

Table B.1.4. Performance metrics of the observed return process.

B.2. Realized volatilities

Model / Test metrics	$\|\rho^{x}_{h}-\rho^{x}_{g}\|_{1}$	$\|\kappa^{x}_{h}-\kappa^{x}_{g}\|_{1}$	$\|s_{h}^{x}-s_{g}^{x}\|_{1}$	$\|\Sigma_{h}^{x}-\Sigma_{g}^{x}\|_{1}$
Sig-Spline (Order 1)	${0.0074}\pm 0.0028$	${0.6196}\pm 0.2898$	$0.1844\pm 0.0479$	${0.0121}\pm 0.0071$
Sig-Spline (Order 2)	${0.0093}\pm 0.0018$	$1.4172\pm 0.3019$	$0.2043\pm 0.0686$	$0.0232\pm 0.0093$
Sig-Spline (Order 3)	$0.0141\pm 0.0038$	$1.3337\pm 0.1564$	$0.3197\pm 0.0741$	$0.0546\pm 0.0087$
Sig-Spline (Order 4)	$0.0143\pm 0.0029$	$1.2699\pm 0.2692$	$0.3211\pm 0.0635$	$0.0507\pm 0.0116$
Neural spline flow	$\textbf{0.0064}\pm 0.0031$	$\textbf{0.5631}\pm 0.0912$	$\textbf{0.0641}\pm 0.0336$	$\textbf{0.0088}\pm 0.0078$

Table B.2.1. Performance metrics of the compressed level process.

Model / Test metrics	$\|\rho^{r}_{h}-\rho^{r}_{g}\|_{1}$	$\|\kappa^{r}_{h}-\kappa^{r}_{g}\|_{1}$	$\|s_{h}^{r}-s_{g}^{r}\|_{1}$	$\|\Sigma_{h}^{r}-\Sigma_{g}^{r}\|_{1}$
Sig-Spline (Order 1)	$0.1055\pm 0.0038$	${1.0656}\pm 0.4424$	${0.1556}\pm 0.1015$	$0.0424\pm 0.0080$
Sig-Spline (Order 2)	$0.0375\pm 0.0072$	$1.9738\pm 0.3338$	${0.1544}\pm 0.0628$	$0.0170\pm 0.0078$
Sig-Spline (Order 3)	$0.0424\pm 0.0048$	${1.6472}\pm 0.2757$	$\textbf{0.1242}\pm 0.0324$	${0.0093}\pm 0.0081$
Sig-Spline (Order 4)	$0.0357\pm 0.0072$	${1.4642}\pm 0.3115$	$0.2270\pm 0.0471$	$\textbf{0.0076}\pm 0.0073$
Neural spline flow	$\textbf{0.0177}\pm 0.0044$	$\textbf{1.0643}\pm 0.6016$	${0.1548}\pm 0.0862$	${0.0108}\pm 0.0075$

Table B.2.2. Performance metrics of the compressed return process.

Model / Test metrics	$\|\rho^{x}_{h}-\rho^{x}_{g}\|_{1}$	$\|\kappa^{x}_{h}-\kappa^{x}_{g}\|_{1}$	$\|s_{h}^{x}-s_{g}^{x}\|_{1}$	$\|\Sigma_{h}^{x}-\Sigma_{g}^{x}\|_{1}$
Sig-Spline (Order 1)	${0.1172}\pm 0.0552$	$8343.9745\pm 3288.4139$	$74.3962\pm 18.1762$	$0.1318\pm 0.0328$
Sig-Spline (Order 2)	$\textbf{0.0972}\pm 0.0481$	$4585.1863\pm 3288.8783$	$48.2547\pm 19.9321$	$0.1261\pm 0.0409$
Sig-Spline (Order 3)	$0.1904\pm 0.0573$	${2949.4543}\pm 1667.5847$	${35.1558}\pm 11.1678$	$\textbf{0.1099}\pm 0.0146$
Sig-Spline (Order 4)	$0.1775\pm 0.0536$	$\textbf{2479.8028}\pm 1398.8296$	$\textbf{31.7946}\pm 10.3507$	${0.1159}\pm 0.0198$
Neural spline flow	${0.1147}\pm 0.0895$	$5403.4121\pm 4172.1582$	$52.9103\pm 24.0138$	${0.1177}\pm 0.0279$

Table B.2.3. Performance metrics of the observed level process.

Model / Test metrics	$\|\rho^{r}_{h}-\rho^{r}_{g}\|_{1}$	$\|\kappa^{r}_{h}-\kappa^{r}_{g}\|_{1}$	$\|s_{h}^{r}-s_{g}^{r}\|_{1}$	$\|\Sigma_{h}^{r}-\Sigma_{g}^{r}\|_{1}$
Sig-Spline (Order 1)	$0.1258\pm 0.0681$	$8598.7035\pm 3035.6188$	$69.0059\pm 29.2961$	$0.2549\pm 0.0722$
Sig-Spline (Order 2)	$0.1353\pm 0.0698$	$4300.6591\pm 2254.0125$	$48.5637\pm 14.5840$	$\textbf{0.2156}\pm 0.0416$
Sig-Spline (Order 3)	${0.0667}\pm 0.0322$	${1912.7057}\pm 1028.8896$	${21.6471}\pm 11.3284$	$0.2362\pm 0.0280$
Sig-Spline (Order 4)	$\textbf{0.0622}\pm 0.0349$	$\textbf{1673.3185}\pm 762.6186$	$\textbf{20.4442}\pm 7.3065$	$0.2296\pm 0.0409$
Neural spline flow	$0.1084\pm 0.0518$	$6216.6590\pm 3531.5966$	$61.3187\pm 25.7093$	$0.2419\pm 0.0678$

Table B.2.4. Performance metrics of the observed return process.

B.3. Multi-asset spot returns

Model / Test metrics	$\|\rho^{r}_{h}-\rho^{r}_{g}\|_{1}$	$\|\kappa^{r}_{h}-\kappa^{r}_{g}\|_{1}$	$\|s_{h}^{r}-s_{g}^{r}\|_{1}$	$\|\Sigma_{h}^{r}-\Sigma_{g}^{r}\|_{1}$	$\|\rho^{\|r\|}_{h}-\rho^{\|r\|}_{g}\|_{1}$
Sig-Spline (Order 1)	$0.0140\pm 0.0026$	$0.7377\pm 0.4659$	$0.1238\pm 0.0557$	$0.0043\pm 0.0029$	$0.0140\pm 0.0026$
Sig-Spline (Order 2)	$0.0115\pm 0.0039$	$0.7787\pm 0.5196$	$0.1056\pm 0.0391$	$0.0036\pm 0.0023$	$0.0115\pm 0.0039$
Sig-Spline (Order 3)	$0.0103\pm 0.0064$	$0.9353\pm 0.4326$	$\textbf{0.0995}\pm 0.0457$	$0.0050\pm 0.0039$	$0.0103\pm 0.0064$
Sig-Spline (Order 4)	$\textbf{0.0085}\pm 0.0058$	$0.8250\pm 0.3648$	$0.0998\pm 0.0577$	$0.0041\pm 0.0035$	$\textbf{0.0085}\pm 0.0058$
Neural spline flow	$0.0145\pm 0.0017$	$\textbf{0.7160}\pm 0.3343$	$0.1176\pm 0.0621$	$\textbf{0.0017}\pm 0.0010$	$0.0145\pm 0.0017$

Table B.3.1. Performance metrics of the preprocessed return process.

Model / Test metrics	$\|\rho^{r}_{h}-\rho^{r}_{g}\|_{1}$	$\|\kappa^{r}_{h}-\kappa^{r}_{g}\|_{1}$	$\|s_{h}^{r}-s_{g}^{r}\|_{1}$	$\|\Sigma_{h}^{r}-\Sigma_{g}^{r}\|_{1}$	$\|\rho^{\|r\|}_{h}-\rho^{\|r\|}_{g}\|_{1}$
Sig-Spline (Order 1)	$0.0307\pm 0.0037$	$14.8671\pm 0.9075$	$0.8110\pm 0.1067$	$0.0057\pm 0.0004$	$0.0307\pm 0.0037$
Sig-Spline (Order 2)	$0.0259\pm 0.0078$	$15.3513\pm 0.7230$	$0.8226\pm 0.1230$	$0.0056\pm 0.0004$	$0.0259\pm 0.0078$
Sig-Spline (Order 3)	$0.0234\pm 0.0074$	$15.2616\pm 0.8126$	$0.7708\pm 0.1158$	$\textbf{0.0055}\pm 0.0004$	$0.0234\pm 0.0074$
Sig-Spline (Order 4)	$\textbf{0.0211}\pm 0.0087$	$15.4545\pm 0.6326$	$0.7699\pm 0.0941$	$0.0056\pm 0.0003$	$\textbf{0.0211}\pm 0.0087$
Neural spline flow	$0.0345\pm 0.0021$	$\textbf{13.6939}\pm 1.3599$	$\textbf{0.6839}\pm 0.1220$	$0.0057\pm 0.0004$	$0.0345\pm 0.0021$

Table B.3.2. Performance metrics of the observed return process.

	$\displaystyle\mathrm{aKL}(p,p_{\theta})$	$\displaystyle\coloneqq\mathbb{E}_{p}\left[\mathbb{E}_{p}\left[\ln\dfrac{p(\mathbf{x}_{t+1}\|\mathcal{F}_{t})}{p_{\theta}(\mathbf{x}_{t+1}\|\mathcal{F}_{t})}\Big{\|}\mathcal{F}_{t}\right]\right]$
		$\displaystyle=-\mathbb{E}_{p}\left[\mathbb{E}_{p}\left[\ln p_{\theta}(\mathbf{x}_{t+1}\|\mathcal{F}_{t})\|\mathcal{F}_{t}\right]\right]+K$
		$\displaystyle=-\sum_{i=1}^{d}\sum_{k=1}^{N}\mathbb{E}_{p}\left[\mathbb{E}_{p}\left[C_{i,k}(\mathbf{x}_{t+1})\ln\Delta_{k}(\mathbf{x},\theta_{i})\right]\big{\|}\mathcal{F}_{t}\right]+K$