This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Recurrent Interpolants for Probabilistic Time Series Prediction

Yu Chen∗, Morgan Stanley Marin Biloš∗, Morgan Stanley Sarthak Mittal†,∗, Mila Université de Montréal Wei Deng Morgan Stanley Kashif Rasul Morgan Stanley Anderson Schneider Morgan Stanley
Abstract

Sequential models like recurrent neural networks and transformers have become standard for probabilistic multivariate time series forecasting across various domains. Despite their strengths, they struggle with capturing high-dimensional distributions and cross-feature dependencies. Recent work explores generative approaches using diffusion or flow-based models, extending to time series imputation and forecasting. However, scalability remains a challenge. This work proposes a novel method combining recurrent neural networks’ efficiency with diffusion models’ probabilistic modeling, based on stochastic interpolants and conditional generation with control features, offering insights for future developments in this dynamic field.

Preprint Equal Contribution Work done as part of an internship at Morgan Stanley

1 Introduction

Autoregression models [Box et al., 2015], such as recurrent neural networks [Graves, 2013, Sutskever et al., 2014, Hochreiter and Schmidhuber, 1997] or transformer models [Vaswani et al., 2017], have been the go-to methods for neural time series forecasting. They are widely applied in finance, biological statistics, medicine, geophysical applications, etc., effectively showcasing their ability to capture short-term and long-term sequential dependencies [Morrill et al., 2021]. These methods can also provide an assessment of prediction uncertainty through probabilistic forecasting by incorporating specific parametric probabilistic models into the output layer of the neural network. For instance, a predictor can model the Gaussian distribution by predicting both mean and covariance. However, the probabilistic output layer is confined within a simple probability family because the density needs to be parameterized by neural networks, and the loss must be differentiable with respect to neural network parameters.

To better capture sophisticated distributions in time series modeling and learn both the temporal and cross-feature dependencies, a common strategy involves exploring the generative modeling of time series using efficient distribution transportation plans, especially via diffusion or flow-based models. For example, recent works such as Li et al. [2020] propose using latent neural SDE as latent states for modeling time series in a stochastic manner, while Spantini et al. [2022] summarize non-linear extensions of state space models using both deterministic and stochastic transformation plans. Tashiro et al. [2021], Biloš et al. [2023], Chen et al. [2023a], Miguel et al. [2022], Li et al. [2022] studied the application of diffusion models in probabilistic time series imputation and forecasting. The generative model is trained to learn the joint density of the time series window 𝐗pred.D×Tprediction\mathbf{X}_{\mathrm{pred.}}\in\mathbb{R}^{D\times T_{\mathrm{prediction}}} with D>1D>1 variates given 𝐗cont.D×Tcontext\mathbf{X}_{\mathrm{cont.}}\in\mathbb{R}^{D\times T_{\mathrm{context}}}. TcontextT_{\mathrm{context}} is the size of the context window and TpredictionT_{\mathrm{prediction}} is the size of the subsequent prediction window. During inference, the model performs conditional generation given only the context, similar to the inpainting task in computer vision [Song et al., 2021]. Compared to a recurrent model, where the model size is only proportional to the number of features, but not the length of the time window, such generative model predictors may suffer from scalability issues because the model size is related to both feature dimension and the size of the window. A more computational-friendly framework is needed for large-scale generative model-based time series prediction problems.

Generative modeling excels at modeling complicated high-dimension distributions, but most models require learning a mapping from noise distribution to data distribution. If the generative procedure starts from an initial distribution proximate to the terminal data distribution, it can remarkably alleviate learning challenges, reduce inference complexity, and enhance the quality of generated samples, which is also supported by previous studies [Rubanova et al., 2019, Rasul et al., 2021a, b, Chen et al., 2023b, Deng et al., 2024a, b, Chen et al., 2024]. Time series data is typically continuous and neighboring time points exhibit strong correlations, indicating that the distribution of future time points is close to that of the current time point.

These observations inspire the creation of a time series prediction model under the generative framework that maps between dependent data points: initiating the prediction of future time point’s distribution with the current time point is more straightforward and yields better quality; meanwhile, the longer temporal dependency is encoded by a recurrent neural network and the embedded history is passed to the generative model as the guidance of the prediction for the future time points. The new framework benefits from the efficient training and computation inherited from the recurrent neural network, while enjoying the high quality of probabilistic modeling empowered by the diffusion model.

Our contributions include:

  • extending the theory of stochastic interpolants to a more general conditional generation framework with extra control features;

  • adopting a conditional stochastic interpolants module for the sequential modeling and multivariate probabilistic time series prediction tasks, which is computational-friendly and achieves high-quality modeling of the future time point’s distribution.

2 Background

As we formalize probabilistic time series forecasting within the generative framework in Section 4, this section is dedicated to reviewing commonly used generative methods and their extensions for conditional generation. These models will serve as baseline models in subsequent sections. For a broader overview of time series forecasting problems, refer to Salinas et al. [2019], Alexandrov et al. [2020], and the references therein.

2.1 Denoising Diffusion Probabilistic Model (DDPM)

DDPM [Sohl-Dickstein et al., 2015, Ho et al., 2020] adds Gaussian noise to the observed data point 𝐱0D\mathbf{x}^{0}\in\mathbb{R}^{D} at different scales, indexed by nn, 0<β1<β2<<βN0<\beta_{1}<\beta_{2}<\dots<\beta_{N} such that the first noisy value 𝐱1\mathbf{x}^{1} is close to the clean data 𝐱0\mathbf{x}^{0}, and the final value 𝐱N\mathbf{x}^{N} is indistinguishable from noise. The generative model learns to revert this process allowing sampling new points from pure noise samples.

Following previous convention, we define α¯n=k=1nαk\bar{\alpha}_{n}=\prod_{k=1}^{n}\alpha_{k}, with αn=1βn\alpha_{n}=1-\beta_{n}. Then when the transition kernel is Gaussian it can be computed directly from 𝐱0\mathbf{x}^{0}:

q(𝐱n|𝐱0)=𝒩(α¯n𝐱0,(1α¯n)𝑰).\displaystyle q(\mathbf{x}^{n}|\mathbf{x}^{0})=\mathcal{N}(\sqrt{\bar{\alpha}_{n}}\mathbf{x}^{0},(1-\bar{\alpha}_{n}){\bm{I}}). (1)

The posterior distribution is available in closed form:

q(𝐱n1|𝐱n,𝐱0)=𝒩(𝝁~n,β~n𝑰),\displaystyle q(\mathbf{x}^{n-1}|\mathbf{x}^{n},\mathbf{x}^{0})=\mathcal{N}(\tilde{{\bm{\mu}}}_{n},\tilde{\beta}_{n}{\bm{I}}), (2)

where 𝝁~n\tilde{{\bm{\mu}}}_{n} depends on 𝐱0\mathbf{x}^{0}, 𝐱n\mathbf{x}^{n} and a choice of β\beta-scheduler. The generative model p(𝐱n1|𝐱n)q(𝐱n1|𝐱n,𝐱0)p(\mathbf{x}^{n-1}|\mathbf{x}^{n})\approx q(\mathbf{x}^{n-1}|\mathbf{x}^{n},\mathbf{x}^{0}) approximates the reverse process. The actual model ϵθ(𝐱0,n)\epsilon_{\theta}(\mathbf{x}^{0},n) is usually reparameterized to predict the noise added to a clean data point, from the noisy data point 𝐱n\mathbf{x}^{n}. The loss function can be simply written as:

=𝔼ϵ𝒩(𝟎,𝑰),n𝒰({1,,N})[ϵθ(𝐱n,n)ϵ22].\displaystyle\mathcal{L}=\mathbb{E}_{\bm{\epsilon}\sim\mathcal{N}({\bm{0}},{\bm{I}}),n\sim\mathcal{U}(\{1,\dots,N\})}\left[\lVert\epsilon_{\theta}(\mathbf{x}^{n},n)-\bm{\epsilon}\rVert_{2}^{2}\right]. (3)

Sampling new data is performed by first sampling a point from the pure noise 𝐱N𝒩(𝟎,𝑰)\mathbf{x}^{N}\sim\mathcal{N}({\bm{0}},{\bm{I}}) and then gradually denoising it using the above model to get a sample from the data distribution via NN calls of the model [Ho et al., 2020].

2.2 Score-based Generative Model (SGM)

SGM [Song et al., 2021], like DDPM, considers a pair of forward and backward dynamics between s[0,1]s\in[0,1]:

d𝐱s=\displaystyle{\mathrm{d}}\mathbf{x}^{s}= f(𝐱s,s)ds+g(s)dws\displaystyle f(\mathbf{x}^{s},s){\mathrm{d}}s+g(s){\mathrm{d}}\textbf{w}^{s} (4)
d𝐱s=\displaystyle{\mathrm{d}}\mathbf{x}^{s}= [f(𝐱s,s)g(s)2𝐱slogp(𝐱s)]ds+g(s)dws,\displaystyle[f(\mathbf{x}^{s},s)-g(s)^{2}\nabla_{\mathbf{x}^{s}}\log p(\mathbf{x}^{s})]{\mathrm{d}}s+g(s){\mathrm{d}}\textbf{w}^{s}, (5)

where 𝐱slogp(𝐱s)\nabla_{\mathbf{x}^{s}}\log p(\mathbf{x}^{s}) is the so-called score function. The forward process usually is scheduled as simple processes, such as Brownian motion or Ornstein–Uhlenbeck process, which can transport data distribution to standard Gaussian distribution. The generative process is achieved by the backward process that walks from Gaussian prior distribution to the data distribution of interest. Now, Equation 5 gives a way to generate new points by starting at 𝐱1𝒩(𝟎,𝑰)\mathbf{x}^{1}\sim\mathcal{N}({\bm{0}},{\bm{I}}) and solving the SDE backward in time giving 𝐱0\mathbf{x}^{0} as a sample from data distribution. In practice, the only missing piece is obtaining the score. A standard approach is to approximate the score with a neural network.

Since during training, we have access to clean data, the score function is available in closed form. The model ϵθ(𝐱s,s)\epsilon_{\theta}(\mathbf{x}^{s},s) learns to approximate the score from noisy data only, resulting in a loss function similar to Equation 3:

=𝔼s𝒰(0,1),𝐱0Data,𝐱sp(𝐱s|𝐱0)[ϵθ(𝐱s,s)𝐱slogp(𝐱s|𝐱0)22].\displaystyle\begin{split}\mathcal{L}=&\mathbb{E}_{s\sim\mathcal{U}(0,1),\mathbf{x}^{0}\sim\text{Data},\mathbf{x}^{s}\sim p(\mathbf{x}^{s}|\mathbf{x}^{0})}\Big{[}\\ &\lVert\epsilon_{\theta}(\mathbf{x}^{s},s)-\nabla_{\mathbf{x}^{s}}\log p(\mathbf{x}^{s}|\mathbf{x}^{0})\rVert_{2}^{2}\Big{]}.\end{split} (6)

2.3 Flow Matching (FM)

Flow matching [Lipman et al., 2023] constructs a probability path by learning the vector field that generates it. Given a data point 𝐱1\mathbf{x}^{1}, the conditional probability path is denoted with ps(𝐱|𝐱1)p_{s}(\mathbf{x}|\mathbf{x}^{1}) for s[0,1]s\in[0,1]. We put the constraints on ps(𝐱|𝐱1)p_{s}(\mathbf{x}|\mathbf{x}^{1}) such that p0(𝐱|𝐱1)=𝒩(𝟎,𝑰)p_{0}(\mathbf{x}|\mathbf{x}^{1})=\mathcal{N}({\bm{0}},{\bm{I}}) and p1(𝐱|𝐱1)=𝒩(𝐱1,σ2𝑰)p_{1}(\mathbf{x}|\mathbf{x}^{1})=\mathcal{N}(\mathbf{x}^{1},\sigma^{2}{\bm{I}}), with small σ>0\sigma>0. That is, the distribution p0(𝐱|𝐱1)p_{0}(\mathbf{x}|\mathbf{x}^{1}) corresponds to the noise distribution and the distribution p1(𝐱|𝐱1)p_{1}(\mathbf{x}|\mathbf{x}^{1}) is centered around the data point with small variance.

Then there exists a conditional vector field us(𝐱|𝐱1)u_{s}(\mathbf{x}|\mathbf{x}^{1}) which generates ps(𝐱|𝐱1)p_{s}(\mathbf{x}|\mathbf{x}^{1}). Our goal is to learn the vector field with a neural network ϵθ(𝐱,s)\epsilon_{\theta}(\mathbf{x},s) which amounts to learning the generative process. This can be done by minimizing the flow matching objective:

=Es𝒰(0,1),𝐱1Data,𝐱ps(𝐱|𝐱0)[ϵθ(𝐱,s)us(𝐱s|𝐱1)22].\displaystyle\begin{split}\mathcal{L}=&E_{s\sim\mathcal{U}(0,1),\mathbf{x}^{1}\sim\text{Data},\mathbf{x}\sim p_{s}(\mathbf{x}|\mathbf{x}^{0})}\Big{[}\\ &\lVert\epsilon_{\theta}(\mathbf{x},s)-u_{s}(\mathbf{x}^{s}|\mathbf{x}^{1})\rVert_{2}^{2}\Big{]}.\end{split} (7)

Going back to Equation 6 we notice that the two approaches have similarities. Flow matching differs in the path constructions and it learns the vector field directly, instead of learning the score, potentially offering a more stable alternative.

One choice for the noising function is transporting the values into noise as a linear function of transport time:

𝐱s=s𝐱1+(1(1σ)s)ϵ,ϵ𝒩(𝟎,𝑰).\displaystyle\mathbf{x}^{s}=s\mathbf{x}^{1}+(1-(1-\sigma)s)\mathbf{\epsilon},\quad\mathbf{\epsilon}\sim\mathcal{N}({\bm{0}},{\bm{I}}). (8)

The probability path is generated by the following conditional vector field which is available in closed form:

us(𝐱|𝐱1)=𝐱1(1σ)𝐱1(1σ)s.\displaystyle u_{s}(\mathbf{x}|\mathbf{x}^{1})=\frac{\mathbf{x}^{1}-(1-\sigma)\mathbf{x}}{1-(1-\sigma)s}. (9)

By learning the field us(𝐱|𝐱1)u_{s}(\mathbf{x}|\mathbf{x}^{1}) with a neural network ϵθ(𝐱,s)\epsilon_{\theta}(\mathbf{x},s) we can sample new points by sampling an initial value 𝐱0\mathbf{x}^{0} from the noise distribution p0p_{0} and solve an ODE 010\mapsto 1 to obtain the new sample 𝐱1\mathbf{x}^{1}.

2.4 Stochastic Interpolants (SI)

Stochastic interpolants [Albergo et al., 2024] aims to model the dependent couplings between (𝐱0,𝐱1)(\mathbf{x}^{0},\mathbf{x}^{1}) with their joint density ρ(𝐱0,𝐱1)\rho(\mathbf{x}^{0},\mathbf{x}^{1}), and establish a two-way generative SDEs mapping from one data distribution to another. The method constructs a straightforward stochastic mapping from s=0s=0 to s=1s=1 given the values at two ends 𝐱0ρ0\mathbf{x}^{0}\sim\rho_{0} to 𝐱1ρ1\mathbf{x}^{1}\sim\rho_{1}, which provides a means of transport between two densities ρ0\rho_{0} and ρ1\rho_{1}, while maintaining the dependency between 𝐱0\mathbf{x}^{0} and 𝐱1\mathbf{x}^{1}.

𝐱s=α(s)𝐱0+β(s)𝐱1+γ(s)𝐳,s[0,1],𝐳𝒩(𝟎,𝐈)\mathbf{x}^{s}=\alpha(s)\mathbf{x}^{0}+\beta(s)\mathbf{x}^{1}+\gamma(s)\mathbf{z},\;s\in[0,1],\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) (10)

where ρ(s,𝐱)\rho(s,\mathbf{x}) is the marginal density of 𝐱s\mathbf{x}^{s} at diffusion time ss. Such a stochastic mapping is characterized by a pair of functions: velocity function 𝐛(s,𝐱)\mathbf{b}(s,\mathbf{x}) and score function 𝐬(s,𝐱)\mathbf{s}(s,\mathbf{x}):

𝐬(s,𝐱):=logρ(s,𝐱),\displaystyle\mathbf{s}(s,\mathbf{x}):=\nabla\log\rho(s,\mathbf{x}), (11)
𝐛(s,𝐱):=𝔼𝐱0,𝐱1,𝐳[α˙(s)𝐱0+β˙(s)𝐱1+γ˙(s)𝐳|𝐱s=𝐱].\displaystyle\mathbf{b}(s,\mathbf{x}):=\mathbb{E}_{\mathbf{x}^{0},\mathbf{x}^{1},\mathbf{z}}[\dot{\alpha}(s)\mathbf{x}^{0}+\dot{\beta}(s)\mathbf{x}^{1}+\dot{\gamma}(s)\mathbf{z}|\mathbf{x}^{s}=\mathbf{x}]. (12)

𝐛(s,𝐱)\mathbf{b}(s,\mathbf{x}), ρ(s,𝐱)\rho(s,\mathbf{x}), and 𝐬(s,𝐱)\mathbf{s}(s,\mathbf{x}) satisfy the equality below,

tρ(s,𝐱)+(𝐛(s,𝐱)ρ(s,𝐱))=0\displaystyle\partial_{t}\rho(s,\mathbf{x})+\nabla\cdot(\mathbf{b}(s,\mathbf{x})\rho(s,\mathbf{x}))=0 (13)
𝐬(s,𝐱)=γ1(s)𝔼𝐳[𝐳|𝐱s=𝐱],\displaystyle\mathbf{s}(s,\mathbf{x})=-\gamma^{-1}(s)\mathbb{E}_{\mathbf{z}}[\mathbf{z}|\mathbf{x}^{s}=\mathbf{x}], (14)

where α(s)\alpha(s) and β(s)\beta(s) schedule the deterministic interpolant. We set α(0)=1,α(1)=0,β(0)=0,β(1)=1\alpha(0)=1,\alpha(1)=0,\beta(0)=0,\beta(1)=1. γ(s)\gamma(s) schedules the variance of the stochastic component 𝐳\mathbf{z}. We set γ(0)=γ(1)=0\gamma(0)=\gamma(1)=0, so the two ends of the interpolant are fixed at 𝐱0\mathbf{x}^{0} and 𝐱1\mathbf{x}^{1}. Figure 1 shows one example of the interpolant schedule, where α(s)=1γ2(s)cos(12πs)\alpha(s)=\sqrt{1-\gamma^{2}(s)}\cos(\frac{1}{2}\pi s), β(s)=1γ2(s)sin(12πs)\beta(s)=\sqrt{1-\gamma^{2}(s)}\sin(\frac{1}{2}\pi s), γ(s)=2s(1s)\gamma(s)=\sqrt{2s(1-s)}.

Refer to caption
Figure 1: α()\alpha(\cdot), β()\beta(\cdot), and γ()\gamma(\cdot), the schedules of stochastic interpolants.

The velocity function 𝐛(s,𝐱)\mathbf{b}(s,\mathbf{x}) and the score function 𝐬(s,𝐱)\mathbf{s}(s,\mathbf{x}) can be modeled by a rich family of functions, such as deep neural networks. The model is trained to match the above equality by minimizing the mean squared error loss functions,

b=\displaystyle\mathcal{L}_{b}= 01𝔼[12𝐛^(s,𝐱s)2\displaystyle\int_{0}^{1}\mathbb{E}\Big{[}\frac{1}{2}\|\hat{\mathbf{b}}(s,\mathbf{x}^{s})\|^{2} (15)
(α˙(s)𝐱0+β˙(s)𝐱1+γ˙(s)𝐳)T𝐛^(s,𝐱s)]ds\displaystyle-\big{(}\dot{\alpha}(s)\mathbf{x}^{0}+\dot{\beta}(s)\mathbf{x}^{1}+\dot{\gamma}(s)\mathbf{z}\big{)}^{T}\hat{\mathbf{b}}(s,\mathbf{x}^{s})\Big{]}{\mathrm{d}}s
s=\displaystyle\mathcal{L}_{s}= 01𝔼[12𝐬^(s,𝐱s)2+γ1𝐳T𝐬^(s,𝐱s)]ds.\displaystyle\int_{0}^{1}\mathbb{E}\Big{[}\frac{1}{2}\|\hat{\mathbf{s}}(s,\mathbf{x}^{s})\|^{2}+\gamma^{-1}\mathbf{z}^{T}\hat{\mathbf{s}}(s,\mathbf{x}^{s})\Big{]}{\mathrm{d}}s. (16)

More details of training will be shown in section 4.

During inference, usually, one side of the diffusion trajectory at s=0s=0 or s=1s=1 is given, the goal is to infer the sample distribution on the other side. The interpolant in equation 10 results in elegant forward and backward SDEs and corresponding Fokker-Planck equations, which offer convenient tools for inference. The SDEs are composed of 𝐛(s,𝐱s)\mathbf{b}(s,\mathbf{x}^{s}) and 𝐬(s,𝐱s)\mathbf{s}(s,\mathbf{x}^{s}), which are learned from the data. For any ϵ(s)0\epsilon(s)\geq 0, define the forward and backward SDEs

d𝐱s=\displaystyle{\mathrm{d}}\mathbf{x}^{s}= [𝐛(s,𝐱)+ϵ(s)𝐬(s,𝐱)]ds+2ϵ(s)d𝐰s\displaystyle[\mathbf{b}(s,\mathbf{x})+\epsilon(s)\mathbf{s}(s,\mathbf{x})]{\mathrm{d}}s+\sqrt{2\epsilon(s)}{\mathrm{d}}\mathbf{w}^{s} (17)
d𝐱s=\displaystyle{\mathrm{d}}\mathbf{x}^{s}= [𝐛(s,𝐱)ϵ(s)𝐬(s,𝐱)]ds+2ϵ(s)d𝐰Bs,\displaystyle[\mathbf{b}(s,\mathbf{x})-\epsilon(s)\mathbf{s}(s,\mathbf{x})]{\mathrm{d}}s+\sqrt{2\epsilon(s)}{\mathrm{d}}\mathbf{w}_{\mathrm{B}}^{s}, (18)

where 𝐰Bs\mathbf{w}_{\mathrm{B}}^{s} is the backward Brownian motion. The SDEs satisfy the forward and backward Fokker-Plank equations,

sρ+(𝐛Fρ)=ϵ(s)Δρ,ρ(0)=ρ0\displaystyle\partial_{s}\rho+\nabla\cdot(\mathbf{b}_{\mathrm{F}}\rho)=\epsilon(s)\Delta\rho,\rho(0)=\rho_{0} (19)
sρ+(𝐛Bρ)=ϵ(s)Δρ,ρ(1)=ρ1.\displaystyle\partial_{s}\rho+\nabla\cdot(\mathbf{b}_{\mathrm{B}}\rho)=-\epsilon(s)\Delta\rho,\rho(1)=\rho_{1}. (20)

These properties imply that one can draw samples from the conditional density ρ(𝐱1|𝐱0)\rho(\mathbf{x}^{1}|\mathbf{x}^{0}) following the forward SDE in equation 17 starting from 𝐱𝟎\mathbf{x_{0}} at s=0s=0. It can also draw samples from the joint density ρ(𝐱0,𝐱1)\rho(\mathbf{x}^{0},\mathbf{x}^{1}) by initially drawing a sample 𝐱0ρ0\mathbf{x}^{0}\sim\rho_{0} (if feasible, for example, pick one sample from the dataset), then using the forward SDE to generate samples 𝐱1\mathbf{x}^{1} at s=1s=1. The method guarantees that 𝐱1\mathbf{x}^{1} follows marginal distribution ρ1\rho_{1} and the sample pair (𝐱0,𝐱1)(\mathbf{x}^{0},\mathbf{x}^{1}) satisfies the joint density ρ(𝐱0,𝐱1)\rho(\mathbf{x}^{0},\mathbf{x}^{1}). Drawing samples using the backward SDE is similar: one can draw samples from ρ(𝐱0|𝐱1)\rho(\mathbf{x}^{0}|\mathbf{x}^{1}) and the joint density ρ(𝐱0,𝐱1)\rho(\mathbf{x}^{0},\mathbf{x}^{1}) as well. Details of inference will be shown in section 4.

Refer to caption
Figure 2: Stochastic interpolants for time series prediction using forward SDE in equation 22.

3 Conditional generation with extra features

All the aforementioned methods can be adapted for conditional generation with additional features. The conditions may range from simple categorical values [Song et al., 2021] to complex prompts involving multiple data types, including partial observations of a sample’s entries (e.g., image inpainting, time series imputation) [Tashiro et al., 2021, Song et al., 2021], images [Zheng et al., 2023, Rombach et al., 2022], text [Rombach et al., 2022, Zhang et al., ], etc. A commonly employed technique to handle diverse conditions is to integrate condition information through feature embedding, where the embedding is injected into various layers of neural networks [Song et al., 2021, Rombach et al., 2022]. For instance, conditional SGM can be trained with

cond=\displaystyle\mathcal{L}_{\mathrm{cond}}= 𝔼s𝒰(0,1),(𝐱0,ξ)Data,𝐱sp(𝐱s|𝐱0)[\displaystyle\mathbb{E}_{s\sim\mathcal{U}(0,1),(\mathbf{x}^{0},\xi)\sim\text{Data},\mathbf{x}^{s}\sim p(\mathbf{x}^{s}|\mathbf{x}^{0})}\Big{[} (21)
ϵθ(s,𝐱s,ξ)𝐱slogp(𝐱s|𝐱0)22].\displaystyle\lVert\epsilon_{\theta}(s,\mathbf{x}^{s},\xi)-\nabla_{\mathbf{x}^{s}}\log p(\mathbf{x}^{s}|\mathbf{x}^{0})\rVert_{2}^{2}\Big{]}.

where the data is given by pairs of a sample 𝐱0\mathbf{x}^{0} and the corresponding condition ξ\xi. This simple scheme approach showcases its effectiveness in various tasks, achieving state-of-the-art performance [Rombach et al., 2022, Zhang et al., ].

Refer to caption
Figure 3: Examples of model generated samples for synthetic two-dimensional (D=2D=2) datasets.

Likewise, SI can be expanded for conditional generation by substituting the velocity function and score function with 𝐛(𝐱s,s,ξ)\mathbf{b}(\mathbf{x}^{s},s,\xi) and 𝐬(𝐱s,s,ξ)\mathbf{s}(\mathbf{x}^{s},s,\xi) [Albergo et al., 2024]. The model is trained using samples of tuples (𝐱0,𝐱1,ξ)(\mathbf{x}^{0},\mathbf{x}^{1},\xi), where ξ\xi is the extra condition feature. Consequently, the inference using forward or backward SDEs becomes

d𝐱s=\displaystyle{\mathrm{d}}\mathbf{x}^{s}= [𝐛(s,𝐱s,ξ)+ϵ(s)𝐬(s,𝐱s,ξ)]ds+2ϵ(s)d𝐰s\displaystyle[\mathbf{b}(s,\mathbf{x}^{s},\xi)+\epsilon(s)\mathbf{s}(s,\mathbf{x}^{s},\xi)]{\mathrm{d}}s+\sqrt{2\epsilon(s)}{\mathrm{d}}\mathbf{w}^{s} (22)
d𝐱s=\displaystyle{\mathrm{d}}\mathbf{x}^{s}= [𝐛(s,𝐱s,ξ)ϵ(s)𝐬(s,𝐱s,ξ)]ds+2ϵ(s)d𝐰Bs,\displaystyle[\mathbf{b}(s,\mathbf{x}^{s},\xi)-\epsilon(s)\mathbf{s}(s,\mathbf{x}^{s},\xi)]{\mathrm{d}}s+\sqrt{2\epsilon(s)}{\mathrm{d}}\mathbf{w}_{\mathrm{B}}^{s}, (23)

where both velocity and score functions depend on the condition ξ\xi. The loss functions are similar to equation 15 and equation 16.

Regarding the time series prediction task, we will encode a large context window as the conditional information, and the prediction or generation of future time points will rely on such a conditional generation mechanism.

Next, we demonstrate that the probability distribution of 𝐱s\mathbf{x}^{s} as simulated by equation 24, results in a dynamic density function. This density serves as a solution to a transport equation 25, which smoothly transitions between ρ0\rho_{0} and ρ1\rho_{1}.

Algorithm 1 Training algorithm.
  Input: Sample 𝐱1:C+P\mathbf{x}_{1:C+P} from training split. Interpolant schedules: α(s),β(s),γ(s)\alpha(s),\beta(s),\gamma(s). Models: velocity 𝐛^\hat{\mathbf{b}}, score 𝐬^\hat{\mathbf{s}}, RNN\mathrm{RNN}.
  for iteration t=Ct=C to C+P1C+P-1 do
     sBeta(0.1,0.1)s\sim\mathrm{Beta}(0.1,0.1) and 𝐳𝒩(𝟎,𝐈)\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}). 𝐱s=α(s)𝐱t+β(s)𝐱t+1+γ(s)𝐳\mathbf{x}^{s}=\alpha(s)\mathbf{x}_{t}+\beta(s)\mathbf{x}_{t+1}+\gamma(s)\mathbf{z} 𝐡t=RNN(𝐱t,𝐡t1)\mathbf{h}_{t}=\mathrm{RNN}(\mathbf{x}_{t},\mathbf{h}_{t-1})
b=1pBeta(s)[12𝐛^(s,𝐱s,𝐡t)2\displaystyle\mathcal{L}_{b}=\frac{1}{p_{\mathrm{Beta}}(s)}\Big{[}\frac{1}{2}\|\hat{\mathbf{b}}(s,\mathbf{x}^{s},\mathbf{h}_{t})\|^{2}
(α˙(s)𝐱t+β˙(s)𝐱t+1+γ˙(s)𝐳)T𝐛^(s,𝐱s,𝐡t)]\displaystyle-\big{(}\dot{\alpha}(s)\mathbf{x}_{t}+\dot{\beta}(s)\mathbf{x}_{t+1}+\dot{\gamma}(s)\mathbf{z}\big{)}^{T}\hat{\mathbf{b}}(s,\mathbf{x}^{s},\mathbf{h}_{t})\Big{]}
s=1pBeta(s)[12𝐬^(s,𝐱s)2+γ1𝐳T𝐬^(s,𝐱s)]\displaystyle\mathcal{L}_{s}=\frac{1}{p_{\mathrm{Beta}}(s)}\Big{[}\frac{1}{2}\|\hat{\mathbf{s}}(s,\mathbf{x}^{s})\|^{2}+\gamma^{-1}\mathbf{z}^{T}\hat{\mathbf{s}}(s,\mathbf{x}^{s})\Big{]}
Perform back-propagation by minimizing b\mathcal{L}_{b} and s\mathcal{L}_{s}.
  end for
Theorem 1.

(Extension of Stochastic Interpolants to Arbitrary Joint Distributions). Let ρ01\rho_{01} be the joint distribution (𝐱0,𝐱1)ρ01(\mathbf{x}^{0},\mathbf{x}^{1})\sim\rho_{01} and let the stochastic interpolant be

𝐱s=αs𝐱0+βs𝐱1+γs𝐳,\displaystyle\mathbf{x}^{s}=\alpha_{s}\mathbf{x}^{0}+\beta_{s}\mathbf{x}^{1}+\gamma_{s}\mathbf{z}, (24)

where α0=β1=1\alpha_{0}=\beta_{1}=1, α1=β0=γ0=γ1=0\alpha_{1}=\beta_{0}=\gamma_{0}=\gamma_{1}=0, and αs2+βs2+γs2>0\alpha_{s}^{2}+\beta_{s}^{2}+\gamma_{s}^{2}>0 for all s[0,1]s\in[0,1]. We define ρs\rho_{s} to be the noise-dependent density of 𝐱s\mathbf{x}^{s}, which satisfies the boundary conditions at s=0,1s=0,1 and the transport equation follows that

ρ˙s+(𝐛sρs)=0\displaystyle\dot{\rho}_{s}+\nabla\cdot(\mathbf{b}_{s}\rho_{s})=0 (25)

for all s[0,1]s\in[0,1] with the velocity defined as

𝐛s(𝐱|ξ)=𝔼[αs˙𝐱0+βs˙𝐱1+γ˙s𝐳|𝐱s=𝐱,ξ],\displaystyle\mathbf{b}_{s}(\mathbf{x}|\xi)=\mathbb{E}\left[\dot{\alpha_{s}}\mathbf{x}^{0}+\dot{\beta_{s}}\mathbf{x}^{1}+\dot{\gamma}_{s}\,\mathbf{z}|\mathbf{x}^{s}=\mathbf{x},\xi\right], (26)

where the expectation is based on the density ρ01\rho_{01} given 𝐱s=𝐱\mathbf{x}^{s}=\mathbf{x} and the extra information ξ\xi.

The score function follows the relation such that

logρs(𝐱)=γs1𝔼[𝐳|𝐱s=𝐱,ξ].\nabla\log\rho_{s}(\mathbf{x})=-\gamma^{-1}_{s}\mathbb{E}\left[\mathbf{z}|\mathbf{x}^{s}=\mathbf{x},\xi\right].

The proof is in a spirit similar to Theorem 2 in Albergo et al. [2024] and detailed in section B. The key difference is that we consider a continuous-time interpretation and avoid using characteristic functions, which makes the analysis more friendly to users. Additionally, the score function logρs(𝐱)\nabla\log\rho_{s}(\mathbf{x}) is optimized in a simple quadratic objective function as indicated in Theorem 2 in the Appendix.

4 Stochastic interpolants for time series prediction

We formulate the multivariate probabilistic time series prediction tasks through the conditional probability Πt=C+1C+Pp(𝐱t|𝐱1:t1)\Pi_{t=C+1}^{C+P}p(\mathbf{x}_{t}|\mathbf{x}_{1:t-1}) for some chosen context length CC and prediction length PP. The model diagram is illustrated in Figure 2. Here, 𝐱tD\mathbf{x}_{t}\in\mathbb{R}^{D} represents the multivariate time series at date-time index tt with D>1D>1 variates. 𝐱1:C=𝐗cont.\mathbf{x}_{1:C}=\mathbf{X}_{\mathrm{cont.}} is the context window and during training the subsequent prediction window 𝐱C+1:C+P=𝐗pred.\mathbf{x}_{C+1:C+P}=\mathbf{X}_{\mathrm{pred.}} is available, typically sampled randomly from within the train split of a dataset.

For this problem, we employ the conditional Stochastic Interpolants (SI) method as follows. In the training phase, the generative model learns the joint distribution p(𝐱t+1,𝐱t|𝐱1:t1)p(\mathbf{x}_{t+1},\mathbf{x}_{t}|\mathbf{x}_{1:t-1}) of the pair (𝐱t+1,𝐱t)(\mathbf{x}_{t+1},\mathbf{x}_{t}) given the past observations 𝐱1:t1\mathbf{x}_{1:t-1}, where 𝐱tρ0\mathbf{x}_{t}\sim\rho_{0} and 𝐱t+1ρ1\mathbf{x}_{t+1}\sim\rho_{1} for all tt, so the marginal distributions are equal: ρ0=ρ1\rho_{0}=\rho_{1}. The model aims to learn the coupling relation between 𝐱t+1\mathbf{x}_{t+1} and 𝐱t\mathbf{x}_{t} conditioning on the history 𝐱1:t\mathbf{x}_{1:t}. This is achieved by training the conditional velocity and score functions in equation 22.

Algorithm 2 Inference algorithm.
  Input: Last context 𝐱1:C\mathbf{x}_{1:C}. Trained models: Velocity 𝐛^\hat{\mathbf{b}}, score 𝐬^\hat{\mathbf{s}}, RNN\mathrm{RNN}. Diffusion variance ϵ(s)\epsilon(s).
  for iteration t=Ct=C to C+P1C+P-1 do
     Set 𝐱^0=𝐱t\hat{\mathbf{x}}^{0}=\mathbf{x}_{t} and 𝐡t=RNN(𝐱t,𝐡t1)\mathbf{h}_{t}=\mathrm{RNN}(\mathbf{x}_{t},\mathbf{h}_{t-1}). Run SDE integral for s[0,1]s\in[0,1] following
d𝐱^s=[𝐛^(s,𝐱^s,𝐡t)+ϵ(s)𝐬^(s,𝐱^s,𝐡t)]ds+2ϵd𝐰s{\mathrm{d}}\hat{\mathbf{x}}^{s}=[\hat{\mathbf{b}}(s,\hat{\mathbf{x}}^{s},\mathbf{h}_{t})+\epsilon(s)\hat{\mathbf{s}}(s,\hat{\mathbf{x}}^{s},\mathbf{h}_{t})]{\mathrm{d}}s+\sqrt{2\epsilon}{\mathrm{d}}\mathbf{w}^{s}
     Output: 𝐱^1\hat{\mathbf{x}}^{1} as prediction: 𝐱t+1\mathbf{x}_{t+1}.
  end for
8gaussians Circles Moons Rings Swissroll
DDPM (cosine) 2.58 0.20 0.20 0.12 0.24
DDPM (linear) 0.70 0.18 0.12 0.11 0.14
SGM 1.10 0.30 0.35 0.32 0.14
FM 0.58 0.10 0.11 0.09 0.15
SI (quad, linear) 0.52 0.15 0.32 0.12 0.16
SI (sqrt, linear) 0.59 0.29 0.51 0.22 0.37
SI (sqrt, trig) 0.75 0.25 0.50 0.48 0.36
SI (trig, linear) 0.52 0.13 0.29 0.21 0.16
Table 1: Wasserstein distance between the generated samples and true data.

As the sample spaces of ρ0\rho_{0} and ρ1\rho_{1} must be the same, the generative model can not directly map the whole context window 𝐱1:t\mathbf{x}_{1:t} to the target 𝐱t+1\mathbf{x}_{t+1} due to different tensor sizes. Instead, a recurrent neural network is used to encode the context 𝐱1:t\mathbf{x}_{1:t} into a history prompt 𝐡tH\mathbf{h}_{t}\in\mathbb{R}^{H} vector. Subsequently, the score function and velocity function perform conditional generation diffusing from 𝐱t\mathbf{x}_{t} with the condition input 𝐡t\mathbf{h}_{t} following equation 22.

The training loss consists of tuples (𝐱t+1,𝐱t,𝐡t)(\mathbf{x}_{t+1},\mathbf{x}_{t},\mathbf{h}_{t}) for each time step tt. It is worth noting that the loss values become larger when ss is close to the two ends. To address this, importance sampling is leveraged to better handle the integral over diffusion time in the loss functions equation 15 and equation 16 to stabilize the training, where we use Beta distribution for our proposal distribution. The algorithm is outlined in Algorithm 1. Additional details can be found in Appendix C.

In the inference phase, the RNN first encodes the context 𝐱1:t\mathbf{x}_{1:t} into the history prompt 𝐡t\mathbf{h}_{t}, then SI transports the context vector 𝐱t\mathbf{x}_{t} to the target distribution with the condition 𝐡t\mathbf{h}_{t}, following the forward SDE. Regarding the multiple-step prediction, we recursively run the step-by-step prediction in an autoregressive manner as outlined in Algorithm 2. By repeating the inference loop (in the batch dimension) we can obtain empirical samples from the predicted distribution which are used to quantify uncertainty.

Refer to caption
Figure 4: Forecast paths for SI on Solar dataset showing median prediction, 50th and 90th confidence intervals calculated from model samples, on 6/ 1376\;/\;137 variate dimensions.

5 Experiments

We first verify the method on synthetic datasets and then apply it to the time series forecasting tasks with real data.

Refer to caption
Figure 5: Time-Grad [Rasul et al., 2021a] model for conditional time series prediction as a comparison.

Baseline models such as DDPM, SGM, FM, and SI all involve modeling field functions, where the inputs are the state vector (in the same space of the data samples), diffusion time, and condition embedding, and the output is the generated sample. The field functions correspond to the noise prediction function in DDPM; the score function in SGM; the vector field in FM; and the velocity and score functions in SI. To make a fair comparison between these models, we use the same neural networks for these models. Details of the models are discussed in Appendix C.

Exchange rate Solar Traffic Wiki
Vec-LSTM 0.008±\pm0.001 0.391±\pm0.017 0.087±\pm0.041 0.133±\pm0.002
DDPM 0.009±\pm0.004 0.359±\pm0.061 0.058±\pm0.014 0.084±\pm0.023
FM 0.009±\pm0.001 0.419±\pm0.027 0.038±\pm0.002 64.256±\pm62.596
SGM 0.008±\pm0.002 0.364±\pm0.029 0.071±\pm0.05 0.108±\pm0.026
SI 0.007±\pm0.001 0.359±\pm0.06 0.083±\pm0.005 0.080±\pm0.007
Table 2: CRPS-sum metric on multivariate probabilistic forecasting tasks. A smaller number indicates better performance.

5.1 Synthetic datasets

We synthesize several two-dimensional datasets with regular patterns, such as Circles, Moons, etc. Details can be found in Chen et al. [2022], Lipman et al. [2023] and their published code repositories. Models introduced in section 2 are compared to the SI as baselines. For diffusion-like models, we implement DPPM with a linear or cosine noise scheduler. We explore on synthetic datasets to determine a good range of hyperparameters which will be used in later time series experiments. This experiment is used to investigate the properties with respect to the varying data sizes, model sizes, and training lengths.

To fairly compare the generation quality, all models are assigned to generate data in the same setting by mapping from standard Gaussian to the target distribution. The neural networks and hyperparameters are also set as the same, such as batch size, training epochs, etc. The generated samples from different methods are shown in Figure 3. Table 1 measures the sample quality with Wasserstein distance [Ramdas et al., 2017]. It shows that all the models can capture the true distribution. The same holds when we use different metrics such as Sliced Wasserstein Distance (SWD) [Rabin et al., 2012] and Maximum Mean Discrepancy (MMD) [Gretton et al., 2012].

We also test out different schedulers for a stochastic interpolant model. For example, “SI (sqrt, linear)” means we use square root gamma-function γ(s)=2s(1s)\gamma(s)=\sqrt{2s(1-s)} and a linear interpolant. Other gamma-functions that we consider are quad: γ(s)=s(1s)\gamma(s)=s(1-s), and trig: γ(s)=sin2(πs)\gamma(s)=\sin^{2}(\pi s). We show that most of the gamma-interpolant function combinations achieve good results in modeling the target distribution.

5.2 Multivariate probabilistic forecasting

In this section, we will empirically verify that: 1) SI is a suitable generative module for the prediction compared with other baselines with different generative methods under the same framework; 2) the whole framework can achieve competitive performance in time series forecasting.

Models.

The baseline models include DDPM, SGM, and FM-based generative models adopted for step-by-step (autoregressive) prediction. DDPM and SGM-based models can only generate samples by transporting Gaussian noise distribution to data distribution. So we modify the framework by replacing the context time point 𝐱t\mathbf{x}_{t} with Gaussian noise, as shown in Figure 5. Flow matching can easily fit into this framework by replacing the denoising objective with the flow matching objective. The modified framework is shown in Figure 2. We model the map from the previous time series observation to the next (forecasted) value. We argue this is a more natural choice than mapping from noise for each time series prediction step. Finally, Vec-LSTM from Salinas et al. [2019] is compared as a pure recurrent neural network model whose probabilistic layer is a multivariate Gaussian.

Setup.

The real-world time series datasets include Solar [Lai et al., 2018], Exchange [Lai et al., 2018], Traffic111https://archive.ics.uci.edu/ml/datasets/PEMS-SF, and Wiki222https://github.com/mbohlkeschneider/gluon-ts/tree/mv_release/datasets which have been commonly used for probabilistic forecasting tasks. We follow the preprocessing steps as in Salinas et al. [2019]. The probabilistic forecasting is evaluated by Continuous Ranked Probability Score (CRPS-sum) [Koochali et al., 2022], normalized root mean square error via the median of the samples (NRMSE), and point-metrics normalized deviance (ND). The metrics calculation is provided by gluonts package [Alexandrov et al., 2020]. In all of the cases, smaller values indicate better performance.

Results.

The results for CRPS-sum are shown in Table 2. The results for other metrics are consistent with CRPS-sum and are shown in Tables 4 and 5, in Appendix C. We outperform or match other models on three out of four datasets, only on Traffic FM model achieves better performance. Note that on Wiki data FM cannot capture the data distribution. We ran a search over flow matching hyperparameters without being able to get satisfying results. Therefore, we conclude that stochastic interpolants are a strong candidate for conditional generation, in particular for multivariate probabilistic forecasting. By comparing to the RNN-based model Vec-LSTM, our model and other baselines such as SGM and DDPM get better performance, which implies that carefully modeling the probability distribution is critical for large dimension time series prediction. Figure 4 demonstrates the quality of the forecast on the Solar dataset. We can see that our model can make precise predictions and capture the uncertainty, even when the scale of the different dimensions varies considerably.

6 Conclusions

This study presents an innovative method that effectively merges the computational efficiency of recurrent neural networks with the high-quality probabilistic modeling of the diffusion model, specifically applied to probabilistic time series forecasting. Grounded in stochastic interpolants and an expanded conditional generation framework featuring control features, the method undergoes empirical evaluation on both synthetic and real datasets, showcasing its compelling performance.

References

  • Albergo et al. [2023] Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Stochastic Interpolants: A Unifying Framework for Flows and Diffusions. In arXiv:2303.08797v3, 2023.
  • Albergo et al. [2024] Michael Samuel Albergo, Mark Goldstein, Nicholas Matthew Boffi, Rajesh Ranganath, and Eric Vanden-Eijnden. Stochastic interpolants with data-dependent couplings. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=FFILRGD0jG.
  • Alexandrov et al. [2020] Alexander Alexandrov, Konstantinos Benidis, Michael Bohlke-Schneider, Valentin Flunkert, Jan Gasthaus, Tim Januschowski, Danielle C. Maddix, Syama Rangapuram, David Salinas, Jasper Schulz, Lorenzo Stella, Ali Caner Türkmen, and Yuyang Wang. Gluonts: Probabilistic and neural time series modeling in python. Journal of Machine Learning Research, 21(116):1–6, 2020. URL http://jmlr.org/papers/v21/19-820.html.
  • Biloš et al. [2023] Marin Biloš, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, and Stephan Günnemann. Modeling Temporal Data as Continuous Functions with Process Diffusion. In Proc. of the International Conference on Machine Learning (ICML), 2023.
  • Box et al. [2015] George E. P. Box, Gwilym M. Jenkins, Gregory C. Reinsel, and Greta M. Ljung. Time Series Analysis: Forecasting and Control. WILEY, 2015.
  • Chen et al. [2022] Tianrong Chen, Guan-Horng Liu, and Evangelos Theodorou. Likelihood training of schrödinger bridge using forward-backward SDEs theory. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nioAdKCEdXB.
  • Chen et al. [2024] Yifan Chen, Mark Goldstein, Mengjian Hua, Michael S. Albergo, Nicholas M. Boff, and Eric Vanden-Eijnden. Probabilistic Forecasting with Stochastic Interpolants and Föllmer Processes. In Proc. of the International Conference on Machine Learning (ICML), 2024.
  • Chen et al. [2023a] Yu Chen, Wei Deng, Shikai Fang, Fengpei Li, Nicole Tianjiao Yang, Yikai Zhang, Kashif Rasul, Shandian Zhe, Anderson Schneider, and Yuriy Nevmyvaka. Provably Convergent Schrödinger Bridge with Applications to Probabilistic Time Series Imputation. In International Conference on Machine Learning (ICML), 2023a.
  • Chen et al. [2023b] Zehua Chen, Guande He, Kaiwen Zheng, Xu Tan, and Jun Zhu. Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis. arXiv preprint arXiv:2312.03491, 2023b.
  • Chen et al. [2023c] Zonglei Chen, Minbo Ma, Tianrui Li, Hongjun Wang, and Chongshou Li. Long Sequence Time-Series Forecasting with Deep Learning: A Survey. In Information Fusion, 2023c.
  • Deng et al. [2024a] Wei Deng, Yu Chen, Nicole Tianjiao Yang, Hengrong Du, Qi Feng, and Ricky T. Q. Chen. Reflected Schrödinger Bridge for Constrained Generative Modeling. In Proc. of the Conference on Uncertainty in Artificial Intelligence (UAI), 2024a.
  • Deng et al. [2024b] Wei Deng, Weijian Luo, Yixin Tan, Marin Biloš, Yu Chen, Yuriy Nevmyvaka, and Ricky T. Q. Chen. Variational Schrödinger Diffusion Models. In Proc. of the International Conference on Machine Learning (ICML), 2024b.
  • Graves [2013] Alex Graves. Generating Sequences with Recurrent Neural Networks. 2013.
  • Gretton et al. [2012] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8), 1997.
  • Koochali et al. [2022] Alireza Koochali, Peter Schichtel, Andreas Dengel, and Sheraz Ahmed. Random noise vs. state-of-the-art probabilistic forecasting methods: A case study on crps-sum discrimination ability. Applied Sciences, 12(10):5104, 2022.
  • Lai et al. [2018] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term temporal patterns with deep neural networks. In The 41st international ACM SIGIR conference on research & development in information retrieval, pages 95–104, 2018.
  • Li et al. [2020] Xuechen Li, Ting-Kam Leonard Wong, Ricky T. Q. Chen, and David Duvenaud. Scalable Gradients for Stochastic Differential Equations. In Proc. of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.
  • Li et al. [2022] Yan Li, Xinjiang Lu, Yaqing Wan, and Dejing Do. Generative Time Series Forecasting with Diffusion, Denoise, and Disentanglement. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  • Lipman et al. [2023] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t.
  • Miguel et al. [2022] Juan Miguel, Lopez Alcaraz, and Nils Strodthoff. Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models. In Transactions on Machine Learning Research, 2022.
  • Morrill et al. [2021] James Morrill, Cristopher Salvi, Patrick Kidger, James Foster, and Terry Lyons. Neural Rough Differential Equations for Long Time Series. In Proc. of the International Conference on Machine Learning (ICML), 2021.
  • Rabin et al. [2012] Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. Wasserstein barycenter and its application to texture mixing. In Scale Space and Variational Methods in Computer Vision: Third International Conference, SSVM 2011, Ein-Gedi, Israel, May 29–June 2, 2011, Revised Selected Papers 3, pages 435–446. Springer, 2012.
  • Ramdas et al. [2017] Aaditya Ramdas, Nicolás García Trillos, and Marco Cuturi. On wasserstein two-sample testing and related families of nonparametric tests. Entropy, 19(2):47, 2017.
  • Rasul et al. [2021a] Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. In International Conference on Machine Learning, pages 8857–8868. PMLR, 2021a.
  • Rasul et al. [2021b] Kashif Rasul, Abdul-Saboor Sheikh, Ingmar Schuster, Urs Bergmann, and Roland Vollgraf. Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows. In Proc. of the International Conference on Learning Representation (ICLR), 2021b.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  • Rubanova et al. [2019] Yulia Rubanova, Ricky T. Q. Chen, and David Duvenaud. Latent ODEs for Irregularly-Sampled Time Series. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • Salinas et al. [2019] David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, and Jan Gasthaus. High-dimensional Multivariate Forecasting with Low-rank Gaussian Copula Processes. Advances in neural information processing systems, 32, 2019.
  • Shishkov [2023] Vladislav Shishkov. Time Diffusion. In https://github.com/timetoai/TimeDiffusion, 2023.
  • Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In International Conference on Machine Learning (ICML), 2015.
  • Song et al. [2021] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021.
  • Spantini et al. [2022] Alessio Spantini, Ricardo Baptista, and Youssef Marzouk. Coupling Techniques for Nonlinear Ensemble Filtering. In SIAM Review, volume 64:4, page 10.1137, 2022.
  • Sutskever et al. [2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. In NIPS, 2014.
  • Tashiro et al. [2021] Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wen et al. [2023] Qingsong Wen, Tian Zhou, Chaoli Zhang, Weiqi Chen, Ziqing Ma, Junchi Yan, and Liang Sun. Transformers in Time Series: a Survey. Thirty-Second International Joint Conference on Artificial Intelligence, 2023.
  • [40] Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. Text-to-image diffusion models in generative ai: A survey. arxiv 2023. arXiv preprint arXiv:2303.07909.
  • Zheng et al. [2023] Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, and Xi Li. Layoutdiffusion: Controllable diffusion model for layout-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22490–22499, 2023.

Appendix A Related Works

A plethora of papers focus on auto-regression models, particularly transformer-based models. For a more comprehensive review, we refer to Wen et al. [2023], Chen et al. [2023c]. While our work does not aim to replace RNN- or transformer-based architectures, we emphasize that one of the main motivations behind our work is to develop a probabilistic module building upon these recent advancements. Due to limited resources, we did not extensively explore all underlying temporal architectures but instead selected relatively simpler models as defaults.

The authors were aware of other diffusion-based probabilistic models, as highlighted in the introduction. Unlike our lightweight model, which models the transition between adjacent time points, these selected works model the entire time window, requiring both high memory and computational complexity. With our computation budget restricted to a 32 GB GPU device, effectively training these diffusion models on large datasets with hundreds of features is challenging.

Additionally, several relevant works are related to our idea. For instance, Rasul et al. [2021b] incorporates the DDPM structure, aligning with our DDPM baseline structure. During inference, the prediction diffuses from pure noise to the target distribution. TimeDiff [Shishkov, 2023] introduces two modifications to the established diffusion model: during training it mixes target and context data and it adds an AR model for more precise initial prediction. Both of these can be incorporated into our model as well.

The existing probabilistic forecasters model the distribution of the next value from scratch, meaning they start with normal distribution in the case of normalizing flows and diffusion models or output parametric distribution in the case of transformers and deep AR. We propose modeling the transformation between the previously observed value and the next value we want to predict. We believe this is a more natural way to forecast which can be seen from requiring fewer solver steps to reach the target distribution.

The second row (DDPM) in Table 2 is an exact implementation of Rasul et al. [2021b]. The results might be different due to slightly different training setups but all the models share the same training parameters so the rank should remain the same. We also include ND-sum and NRMSE-sum in the appendix for completeness.

Discussion of Vec-LSTM baseline

In terms of the neural network architecture, we use a similar architecture for the LSTM encoder. But to be clear, Vec-LSTM [Salinas et al., 2019] and our SI framework are not the same mainly due to different ways of probabilistic modeling. Vec-LSTM considers the multivariate Gaussian distribution for the time points, where the mean and covariance matrices are modeled using separate LSTMs. Especially, the covariance matrix is modeled through a low-dimensional structure 𝚺(𝒉t)=𝑫t(𝒉t)+𝑽t(𝒉t)𝑽t(𝒉t)\bm{\Sigma}(\bm{h}_{t})=\bm{D}_{t}(\bm{h}_{t})+\bm{V}_{t}(\bm{h}_{t})\bm{V}_{t}(\bm{h}_{t})^{\intercal}, where 𝒉t\bm{h}_{t} is the latent variable from LSTM. The SI framework does not explicitly model the output distribution in any parametric format. Instead, the latent variable from RNN output is used as the condition variable to guide the diffusion model in Eq.22. Thus, the architectures of RNNs in the two frameworks are not quite strictly comparable.

Appendix B Proof: Conditional Stochastic Interpolant

The proof is in spirit similar to Theorem 2 in Albergo et al. [2024]. The key difference is that we consider a continuous-time interpretation, which makes the analysis more friendly to users.

Proof [Proof of Theorem 1]

Given the conditional information ξ\xi and 𝐱s=𝐱{\mathbf{x}}^{s}={\mathbf{x}} simulated from equation 24, the conditional stochastic interpolant for equation 24 follows that (where the index tt is over the noise index and not date-times):

𝔼[𝐱t|𝐱s=𝐱,ξ]=𝔼[αt𝐱0+βt𝐱1+γt𝐳|𝐱s=𝐱,ξ],\displaystyle\mathbb{E}[{\mathbf{x}}^{t}|{\mathbf{x}}^{s}={\mathbf{x}},\xi]=\mathbb{E}[\alpha_{t}{\mathbf{x}}^{0}+\beta_{t}{\mathbf{x}}^{1}+\gamma_{t}\mathbf{z}|{\mathbf{x}}^{s}={\mathbf{x}},\xi], (27)

where the expectation takes over the density for (𝐱0,𝐱1)ρ(𝐱0,𝐱1|ξ)({\mathbf{x}}^{0},{\mathbf{x}}^{1})\sim\rho({\mathbf{x}}_{0},{\mathbf{x}}_{1}|\xi), ξη(ξ)\xi\sim\eta(\xi), and 𝐳𝒩(𝟎,𝐈)\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}).

We next show equation 27 is a solution of a stochastic differential equation as follows

d𝔼[𝐱t|𝐱s=𝐱,ξ]\displaystyle{\textnormal{d}}\mathbb{E}[{\mathbf{x}}^{t}|{\mathbf{x}}^{s}={\mathbf{x}},\xi] =𝐟t(𝐱)dt+σtdwt,\displaystyle={\mathbf{f}}_{t}({\mathbf{x}}){\mathrm{d}}t+\sigma_{t}{\textnormal{d}}{\mathrm{w}}^{t}, (28)

where 𝐟t(𝐱)=𝔼[αt˙𝐱0+βt˙𝐱1|𝐱s=𝐱,ξ]{\mathbf{f}}_{t}({\mathbf{x}})=\mathbb{E}[\dot{\alpha_{t}}{\mathbf{x}}^{0}+\dot{\beta_{t}}{\mathbf{x}}^{1}|{\mathbf{x}}^{s}={\mathbf{x}},\xi] and σt=2γtγt˙\sigma_{t}=\sqrt{2{\gamma_{t}}\dot{\gamma_{t}}}.

To prove the above argument, we proceed to verify the drift and diffusion terms respectively:

  • Drift: It is straightforward to verify the drift 𝐟t{\mathbf{f}}_{t} by taking the gradient of the conditional expectation 𝔼[αt𝐱0+βt𝐱1|𝐱s=𝐱,ξ]\mathbb{E}[\alpha_{t}{\mathbf{x}}^{0}+\beta_{t}{\mathbf{x}}^{1}|{\mathbf{x}}^{s}={\mathbf{x}},\xi] with respect to tt.

  • Diffusion: For the diffusion term, the proof hinges on showing σt=2γtγt˙\sigma_{t}=\sqrt{2{\gamma_{t}}\dot{\gamma_{t}}}, which boils down to prove the stochastic calculus follows that 0t2γsγs˙dws=γt𝐳\int_{0}^{t}\sqrt{2{\gamma_{s}}\dot{\gamma_{s}}}{\textnormal{d}}{\mathrm{w}}^{s}=\gamma_{t}\mathbf{z}. Note that 𝔼[0t2γsγs˙dws]=0\mathbb{E}[\int_{0}^{t}\sqrt{2{\gamma_{s}}\dot{\gamma_{s}}}{\textnormal{d}}{\mathrm{w}}^{s}]=0. Invoking the Itô isometry, we have Var(0t2γsγs˙dws)=0t2γsγs˙ds=0t(γs2)ds=γt2\text{Var}(\int_{0}^{t}\sqrt{2{\gamma_{s}}\dot{\gamma_{s}}}{\textnormal{d}}{\mathrm{w}}^{s})=\int_{0}^{t}2{\gamma_{s}}\dot{\gamma_{s}}{\mathrm{d}}s=\int_{0}^{t}({\gamma_{s}}^{2})^{\prime}{\mathrm{d}}s=\gamma_{t}^{2} (given γ0=0\gamma_{0}=0). In other words, 0t2γsγs˙dws\int_{0}^{t}\sqrt{2{\gamma_{s}}\dot{\gamma_{s}}}{\textnormal{d}}{\mathrm{w}}^{s} is a normal random variable with mean 0 and variance γt2\gamma_{t}^{2}, which proves that equation 27 is a solution of the stochastic differential equation 28.

Define Σt=2γtγt˙\Sigma_{t}=2{\gamma_{t}}\dot{\gamma_{t}}, we know the Fokker-Planck equation associated with equation 28 follows that

0=ρtt+(𝐟tρt12Σtρt)=ρtt+((𝐟t12Σtlogρt)ρt)=ρtt+((𝔼[αt˙𝐱0+βt˙𝐱1|𝐱s=𝐱,ξ]γtγt˙logρt)ρt)=ρtt+(bt|s(𝐱,ξ)ρt),\begin{split}0&=\frac{\partial\rho_{t}}{\partial t}+\nabla\cdot\bigg{(}{\mathbf{f}}_{t}\rho_{t}-\frac{1}{2}\Sigma_{t}\nabla\rho_{t}\bigg{)}\\ &=\frac{\partial\rho_{t}}{\partial t}+\nabla\cdot\bigg{(}\bigg{(}{\mathbf{f}}_{t}-\frac{1}{2}\Sigma_{t}\nabla\log\rho_{t}\bigg{)}\rho_{t}\bigg{)}\\ &=\frac{\partial\rho_{t}}{\partial t}+\nabla\cdot\bigg{(}\bigg{(}\mathbb{E}[\dot{\alpha_{t}}{\mathbf{x}}^{0}+\dot{\beta_{t}}{\mathbf{x}}^{1}|{\mathbf{x}}^{s}={\mathbf{x}},\xi]-{\gamma_{t}}\dot{\gamma_{t}}\nabla\log\rho_{t}\bigg{)}\rho_{t}\bigg{)}\\ &=\frac{\partial\rho_{t}}{\partial t}+\nabla\cdot\big{(}b_{t|s}({\mathbf{x}},\xi)\rho_{t}\big{)},\end{split} (29)

where bt|s(𝐱|ξ)=𝔼[αt˙𝐱0+βt˙𝐱1γtγt˙logρt|𝐱s=𝐱,ξ]b_{t|s}({\mathbf{x}}|\xi)=\mathbb{E}[\dot{\alpha_{t}}{\mathbf{x}}^{0}+\dot{\beta_{t}}{\mathbf{x}}^{1}-{\gamma_{t}}\dot{\gamma_{t}}\nabla\log\rho_{t}|{\mathbf{x}}^{s}={\mathbf{x}},\xi].

Further setting s=ts=t and rewrite btbt|tb_{t}\equiv b_{t|t}, we have bt(𝐱|ξ)=𝔼[αt˙𝐱0+βt˙𝐱1γtγt˙logρt|𝐱t=𝐱,ξ]b_{t}({\mathbf{x}}|\xi)=\mathbb{E}[\dot{\alpha_{t}}{\mathbf{x}}^{0}+\dot{\beta_{t}}{\mathbf{x}}^{1}-{\gamma_{t}}\dot{\gamma_{t}}\nabla\log\rho_{t}|{\mathbf{x}}^{t}={\mathbf{x}},\xi].

Further define gt(i)(𝐱|ξ)=𝔼[𝐱i|𝐱t=𝐱,ξ]g^{(i)}_{t}({\mathbf{x}}|\xi)=\mathbb{E}[{\mathbf{x}}^{i}|{\mathbf{x}}^{t}={\mathbf{x}},\xi], where i{0,1}i\in\{0,1\} and gt(z)(𝐱|ξ)=𝔼[𝐳|𝐱t=𝐱,ξ]g^{(z)}_{t}({\mathbf{x}}|\xi)=\mathbb{E}[\mathbf{z}|{\mathbf{x}}^{t}={\mathbf{x}},\xi]. We have that

bt(𝐱|ξ)\displaystyle b_{t}({\mathbf{x}}|\xi) =𝔼[αt˙𝐱0+βt˙𝐱1γtγt˙logρt|𝐱t=𝐱,ξ]\displaystyle=\mathbb{E}[\dot{\alpha_{t}}{\mathbf{x}}^{0}+\dot{\beta_{t}}{\mathbf{x}}^{1}-{\gamma_{t}}\dot{\gamma_{t}}\nabla\log\rho_{t}|{\mathbf{x}}^{t}={\mathbf{x}},\xi]
=αt˙g(0)+βt˙g(1)+γt˙g(z)\displaystyle=\dot{\alpha_{t}}g^{(0)}+\dot{\beta_{t}}g^{(1)}+\dot{\gamma_{t}}g^{(z)}
=𝔼[αt˙𝐱0+βt˙𝐱1+γt˙𝐳|𝐱t=𝐱,ξ],\displaystyle=\mathbb{E}[\dot{\alpha_{t}}{\mathbf{x}}^{0}+\dot{\beta_{t}}{\mathbf{x}}^{1}+\dot{\gamma_{t}}\mathbf{z}|{\mathbf{x}}^{t}={\mathbf{x}},\xi],

where the first equality follows by equation 29 and the last one follows by taking derivative to equation 27 w.r.t. the index tt.

We also observe that logρt=γt1𝔼[𝐳|𝐱t=𝐱]\nabla\log\rho_{t}=-\gamma_{t}^{-1}\mathbb{E}[\mathbf{z}|{\mathbf{x}}^{t}={\mathbf{x}}].

Theorem 2.

The loss functions used for estimating the vector field follow that

Li(g^(i))=01𝔼[|g^(i)|22𝐱ig^(i)]dt,\displaystyle L_{i}(\hat{g}^{(i)})=\int_{0}^{1}\mathbb{E}[|\hat{g}^{(i)}|^{2}-2{\mathbf{x}}^{i}\cdot\hat{g}^{(i)}]{\mathrm{d}}t,

where i{0,1,z}i\in\{0,1,z\}, the expectation takes over the density for (𝐱0,𝐱1)ρ(𝐱0,𝐱1|ξ)({\mathbf{x}}^{0},{\mathbf{x}}^{1})\sim\rho({\mathbf{x}}^{0},{\mathbf{x}}^{1}|\xi), ξη(ξ)\xi\sim\eta(\xi), and 𝐳𝒩(𝟎,𝐈)\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}).

Proof  To show the loss is effective to estimate g(0)g^{(0)}, g(1)g^{(1)}, and g(z)g^{(z)}. It suffices to show

L0(g^(0))\displaystyle L_{0}(\hat{g}^{(0)}) =01𝔼[|g^(0)|22𝐱0g^(0)]dt,\displaystyle=\int_{0}^{1}\mathbb{E}[|\hat{g}^{(0)}|^{2}-2{\mathbf{x}}^{0}\cdot\hat{g}^{(0)}]{\mathrm{d}}t,
=01D[|g^(0)|22𝔼[𝐱0|𝐱t=𝐱,ξ]g^(0)]d𝐱dt,\displaystyle=\int_{0}^{1}\int_{\mathbb{R}^{D}}\bigg{[}|\hat{g}^{(0)}|^{2}-2\mathbb{E}[{\mathbf{x}}^{0}|{\mathbf{x}}^{t}={\mathbf{x}},\xi]\cdot\hat{g}^{(0)}\bigg{]}{\mathrm{d}}{\mathbf{x}}{\mathrm{d}}t,
=01D[|g^(0)|22g(0)g^(0)]d𝐱dt,\displaystyle=\int_{0}^{1}\int_{\mathbb{R}^{D}}\bigg{[}|\hat{g}^{(0)}|^{2}-2g^{(0)}\cdot\hat{g}^{(0)}\bigg{]}{\mathrm{d}}{\mathbf{x}}{\mathrm{d}}t,

where the last equality follows by definition. The unique minimizer is attainable by setting g^(0)=g(0)\hat{g}^{(0)}=g^{(0)}.

The proof of g(1)g^{(1)} and g(z)g^{(z)} follows a similar fashion.

Appendix C Experiment details

C.1 Time series data

The time series datasets include: Solar [Lai et al., 2018], Exchange [Lai et al., 2018], Traffic333https://archive.ics.uci.edu/ml/datasets/PEMS-SF, and Wikipedia4445https://github.com/mbohlkeschneider/gluon-ts/tree/mv_release/datasets. We follow the preprocessing steps as in Salinas et al. [2019]. Details of the datasets are listed in Table 3.

Table 3: Properties of the datasets.
Datasets Dimension DD Frequency Total time points Prediction length PP
Exchange 8 Daily 6,071 30
Solar 137 Hourly 7,009 24
Traffic 963 Hourly 4,001 24
Wiki 2000 Daily 792 30

The probabilistic forecasting is evaluated by Continuous Ranked Probability Score (CRPS-sum) [Koochali et al., 2022], normalized root mean square error via the median of the samples (NRMSE), and point-metrics normalized deviance (ND). The metrics calculation is provided by gluonts package [Alexandrov et al., 2020] by calling module gluonts.evaluation.MultivariateEvaluator.

C.2 Models and hyperperameters

Baseline models such as DDPM, SGM, FM, and SI all involve modeling field functions, where the inputs are the state vector (in the same space as the data samples), diffusion time, and condition embedding, and the output is the generated sample. The field functions correspond to the “noise prediction” function in DDPM; the score function in SGM; the vector field in FM; the velocity and score functions in SI. To make a fair comparison between these models, we use the same neural networks for these models.

In the synthetic datasets experiments, we model the field functions with a 4-layer ResNet, each layer has 256 intermediate dimensions. The batch size is 10,000 for all models and a model is trained with 20,000 iterations. The learning rate is 10310^{-3}.

In the time series forecasting experiments, the RNN for the history encoder has 1 layer and 128 latent dimensions; The field function is modeled with a Unet-like structure [Ronneberger et al., 2015] with 8 residual blocks, and each block has 64 dimensions. To stabilize the training, we also use paired sampling for the stochastic interpolants introduced by [Albergo et al., 2023, Appendix C]:

𝐱s=\displaystyle\mathbf{x}^{s}= α(s)𝐱0+β(s)𝐱1+γ(s)𝐳\displaystyle\alpha(s)\mathbf{x}^{0}+\beta(s)\mathbf{x}^{1}+\gamma(s)\mathbf{z}
𝐱s=\displaystyle{\mathbf{x}^{s}}^{\prime}= α(s)𝐱0+β(s)𝐱1+γ(s)(𝐳)\displaystyle\alpha(s)\mathbf{x}^{0}+\beta(s)\mathbf{x}^{1}+\gamma(s)(-\mathbf{z})
s[0,1],𝐳𝒩(𝟎,𝐈).\displaystyle s\in[0,1],\mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I}).

The baseline models are trained with 200 epochs and 64 batch sizes with a learning rate 10310^{-3}. The SI model is trained with 100 epochs and 128 batch sizes with a learning rate 10410^{-4}. We find if the learning rate is too large, SI may not converge properly.

C.3 Importance sampling

The loss functions for training the velocity and score functions are

b=\displaystyle\mathcal{L}_{b}= 01𝔼[12𝐛^(s,𝐱s)2(α˙(s)𝐱0+β˙(s)𝐱1+γ˙(s)𝐳)T𝐛^(s,𝐱s)]𝑑s,\displaystyle\int_{0}^{1}\mathbb{E}\Big{[}\frac{1}{2}\|\hat{\mathbf{b}}(s,\mathbf{x}^{s})\|^{2}-\big{(}\dot{\alpha}(s)\mathbf{x}^{0}+\dot{\beta}(s)\mathbf{x}^{1}+\dot{\gamma}(s)\mathbf{z}\big{)}^{T}\hat{\mathbf{b}}(s,\mathbf{x}^{s})\Big{]}ds, (30)
s=\displaystyle\mathcal{L}_{s}= 01𝔼[12𝐬^(s,𝐱s)2+γ1𝐳T𝐬^(s,𝐱s)]𝑑s.\displaystyle\int_{0}^{1}\mathbb{E}\Big{[}\frac{1}{2}\|\hat{\mathbf{s}}(s,\mathbf{x}^{s})\|^{2}+\gamma^{-1}\mathbf{z}^{T}\hat{\mathbf{s}}(s,\mathbf{x}^{s})\Big{]}ds.

Both loss functions involve the integral over diffusion time s[0,1]s\in[0,1] in the form of

=01l(s)𝑑sil(si),siUniform[0,1].\mathcal{L}=\int_{0}^{1}l(s)ds\approx\sum_{i}l(s_{i}),\quad s_{i}\sim\mathrm{Uniform}[0,1]. (31)

However, the loss value l(s)l(s) has a large variance, especially when ss is near 0 or 11. Figure 6 shows an example of the distribution of l(s)l(s) across multiple ss. The large variance slows down the convergence of training. To overcome this issue, we apply importance sampling, a similar technique used by [Song et al., 2021, Sec. 5.1 ], to stabilize the training. Instead of drawing diffusion time from a uniform distribution, importance sampling considers,

=01l(s)𝑑sil(si)q~(si),siq~(s).\mathcal{L}=\int_{0}^{1}l(s)ds\approx\sum_{i}\frac{l(s_{i})}{\tilde{q}(s_{i})},\quad s_{i}\sim\tilde{q}(s). (32)

Ideally, one wants to keep l(si)/q~(si)l(s_{i})/\tilde{q}(s_{i}) as constant as possible such that the variance of the estimation is minimum. The loss value l(s)l(s) is very large when ss is close to 0 or 11, and l(s)l(s) is relatively flat in the middle, and the domain of ss is [0,1][0,1], so we choose Beta distribution Beta(s;0.1,0.1)\mathrm{Beta}(s;0.1,0.1) as the proposal distribution q~\tilde{q}. As shown in Figure 6, the values of l(si)/q~(si)l(s_{i})/\tilde{q}(s_{i}) are plotted against their ss, which becomes more concentrated in a small range.

Refer to caption
Refer to caption
Figure 6: Comparison between uniform sampling and importance sampling. Each dot represents the loss of one sample with respect to the diffusion time.

C.4 Additional forecasting results

Exchange rate Solar Traffic Wiki
DDPM 0.011±\pm0.004 0.377±\pm0.061 0.064±\pm0.014 0.093±\pm0.023
FM 0.011±\pm0.001 0.445±\pm0.031 0.041±\pm0.002 80.624±\pm89.804
SGM 0.01±\pm0.002 0.388±\pm0.026 0.08±\pm0.053 0.122±\pm0.026
SI 0.008±\pm0.002 0.399±\pm0.065 0.089±\pm0.006 0.091±\pm0.011
Table 4: ND-sum. A smaller number indicates better performance.
Exchange rate Solar Traffic Wiki
DDPM 0.013±\pm0.005 0.72±\pm0.08 0.094±\pm0.029 0.123±\pm0.026
FM 0.014±\pm0.002 0.849±\pm0.072 0.059±\pm0.007 165.128±\pm147.682
SGM 0.019±\pm0.004 0.76±\pm0.066 0.109±\pm0.064 0.164±\pm0.03
SI 0.01±\pm0.003 0.722±\pm0.132 0.127±\pm0.003 0.117±\pm0.011
Table 5: NRMSE-sum. A smaller number indicates better performance.

C.5 Baseline Model using Unconditional SI

Additionally, we introduced a new experiment to verify the necessity of conditional SI over unconditional SI. The unconditional SI diffuses from pure noise and does not utilize the prior distribution from the previous time point. In this case, the context for the prediction is provided exclusively by the RNN encoder. The new results are shown in the following tables. When compared with the conditional SI framework, the unconditional model shows slightly inferior performance.

Exchange rate Solar Traffic
SI 0.007±\pm0.001 0.359±\pm0.06 0.083±\pm0.005
Vanilla SI 0.010±\pm0.001 0.383±\pm0.010 0.082±\pm0.006
Table 6: CRPS-sum metric on multivariate probabilistic forecasting. A smaller number indicates better performance.
Exchange rate Solar Traffic
SI 0.008±\pm0.002 0.399±\pm0.065 0.089±\pm0.006
Vanilla SI 0.010±\pm0.003 0.430±\pm0.113 0.093±\pm0.007
Table 7: ND-sum. A smaller number indicates better performance.
Exchange rate Solar Traffic
SI 0.010±\pm0.003 0.722±\pm0.132 0.127±\pm0.003
Vanilla SI 0.012±\pm0.003 0.815±\pm0.135 0.132±\pm0.015
Table 8: NRMSE-sum. A smaller number indicates better performance.