This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\useunder

\ul

Channel-aware Contrastive Conditional Diffusion for Multivariate Probabilistic Time Series Forecasting

Siyang Li1,  Yize Chen2,  Hui Xiong111footnotemark: 1
1Hong Kong University of Science and Technology (Guangzhou)
2University of Alberta
[email protected], [email protected],
[email protected]
Corresponding author.
Abstract

Forecasting faithful trajectories of multivariate time series from practical scopes is essential for reasonable decision-making. Recent methods majorly tailor generative conditional diffusion models to estimate the target temporal predictive distribution. However, it remains an obstacle to enhance the exploitation efficiency of given implicit temporal predictive information to bolster conditional diffusion learning. To this end, we propose a generic channel-aware Contrastive Conditional Diffusion model entitled CCDM to achieve desirable Multivariate probabilistic forecasting, obviating the need for curated temporal conditioning inductive biases. In detail, we first design a channel-centric conditional denoising network to manage intra-variate variations and cross-variate correlations, which can lead to scalability on diverse prediction horizons and channel numbers. Then, we devise an ad-hoc denoising-based temporal contrastive learning to explicitly amplify the predictive mutual information between past observations and future forecasts. It can coherently complement naive step-wise denoising diffusion training and improve the forecasting accuracy and generality on unknown test time series. Besides, we offer theoretic insights on the benefits of such auxiliary contrastive training refinement from both neural mutual information and temporal distribution generalization aspects. The proposed CCDM can exhibit superior forecasting capability compared to current state-of-the-art diffusion forecasters over a comprehensive benchmark, with best MSE and CRPS outcomes on 66.67%66.67\% and 83.33%83.33\% cases. Our code is publicly available at https://github.com/LSY-Cython/CCDM.

1 Introduction

Multivariate probabilistic time series forecasting aims to quantify the stochastic temporal evolutions of multiple continuous variables and benefit decision-making in various engineering fields, such as weather prediction (Li et al., 2024), renewable energy dispatch (Dumas et al., 2022), traffic planning (Huang et al., 2023) and financial trading (Gao et al., 2024). Modern methods majorly customize time series generative models (Salinas et al., 2019; Li et al., 2022; Yoon et al., 2019; Rasul et al., 2020) and produce diverse plausible trajectories to decipher the intricate temporal predictive distribution which is conditioned on past observations. Due to the excellent mode coverage capacity and training stability of diffusion models (Song et al., 2020; Ho et al., 2020), a flurry of conditional diffusion forecasters (Lin et al., 2023; Yang et al., 2024) are recently developed by designing effective temporal conditioning mechanisms to discover informative patterns from historical time series.

Despite current advances, there remain two open challenges on learning a precise and generalizable multivariate predictive distribution via the step-wise denoising paradigm. The first obstacle is how to represent multivariate temporal correlations in both past observed and target denoised sequences. To this end, many diffusion forecasting methods focus on efficient architectural designs upon conditional denoising networks to capture heterogeneous temporal correlations in both diffusion corrupted and pure conditioning time series. Among them, spatiotemporal attention layers in CSDI (Tashiro et al., 2021) and structured state space modules in SSSD (Alcaraz & Strodthoff, 2022) are employed to characterize intra-channel and inter-channel 111A channel shares the same physical interpretations with a variate, with each channel indicating a univariate time series. relations. LDT (Feng et al., 2024) and D3VAE (Li et al., 2022) utilize latent diffusion models to handle high-dimensional sequences. However, due to the limited capacity on identifying channel-specific and cross-channel properties, these temporal denoisers cannot manifest scalability and reliability on tough prediction tasks with numerous channels or long terms. Inspired by recent success of channel-centric views in multivariate point forecasting (Chen et al., 2024a; Liu et al., 2023), we propose to endow a unified channel manipulation strategy into the conditional denoising network, as depicted in Fig. 3, where we stack channel-independent and channel-mixing modules to capture univariate variations and inter-variate correlations.

Refer to caption
Figure 1: The schematic of proposed information-theoretic denoising-based contrastive diffusion learning. Bi-directional arrows indicate two learning ways are complementary. The bar chart depicts the average gains by contrastive diffusion refinement on diverse prediction setups for six datasets.

The second issue is how to enhance the exploitation efficiency of the implicit predictive information hidden in provided time series, which can improve the diversity and accuracy of generated profiles. It has been revealed that learning to unveil the useful temporal information like decomposed patterns (Deng et al., 2024) or spectral biases (Crabbé et al., 2024) can boost the estimated predictive distribution. Related diffusion forecasters also develop auxiliary training strategies to amplify the fine-grained features, as the naive step-wise noise regression paradigm fall short in fully releasing the intrinsic predictive information. In particular, they employ specific temporal inductive biases to promote temporal conditioning schemes or guide iterative inference procedures. Pretraining temporal conditioning encoders by deterministic point prediction (Shen & Kwok, 2023; Li et al., 2023) is a viable method, which produces more accurate medians and sharper prediction intervals. Coupling unique temporal features like multi-granularity dynamics (Fan et al., 2024; Shen et al., 2023) or target quantitative metrics (Kollovieh et al., 2024) to regularize the sequential diffusion process can also steer the reverse generation process towards plausible trajectories. However, these auxiliary methods need to fully expose task-specific temporal properties or tailor distinct regulations for diffusion training and sampling procedures. They are neither consistent with standard hierarchical diffusion optimization nor generic over generative time series diffusion models.

Motivated by a neural information view in (Tsai et al., 2020), naive conditional time series diffusion learning can be deemed as a forward predictive way to maximize the temporal mutual information between past observations and target forecasts. Such auxiliary inductive biases can empirically enrich the predictive temporal information. However, single predictive learning is inadequate to reveal the entire task-specific information. In light of the composite objective which integrates contrastive learning to procure more robust task-related representations (Tsai et al., 2020), we propose to further enhance the prediction-related mutual information captured by denoising diffusion in a complementary contrastive way, where both positive and negative time series are inspected at each diffusion step. We illustrate such temporal contrastive refinement on conditional diffusion forecasting in Fig. 1, which mitigates over-fitting and attains better generality on unknown test data.

In this work, we propose a contrastive conditional diffusion model termed CCDM which can explicitly maximize the predictive mutual information for multivariate probabilistic forecasting. The efficient channel-aware denoiser architecture and complementary denoising-based contrastive refinement are two recipes to boost diffusion forecasting capacity. Our main contributions are summarized as: (1) We design a composite channel-aware conditional denoising network, which merges channel-independent dense encoders to extract univariate dynamics and channel-wise diffusion transformers to aggregate cross-variate correlations. It gives rise to efficient iterative inference and better scalability on various channel numbers and prediction horizons. (2) We propose to explicitly amplify the predictive information between generated forecasts and past observations via a coherent denoising-based temporal contrastive learning, which can be seamlessly aligned with vanilla step-wise denoising diffusion training and thus efficient to implement. (3) Extensive simulations validate the superior forecasting capability of CCDM. It can attain better accuracy and reliability versus other excellent models on various forecasting settings, especially for long-term and large-channel scenarios.

2 Preliminaries

2.1 Problem formulation

In this paper, we look into the task of multivariate probabilistic time series forecasting. Given the past observation 𝐱L×D\mathbf{x}\in\mathbb{R}^{L\times D} as conditioning time series, the goal is to generate a group of SS plausible forecasts {𝐲^0(s)H×D}s=1S\{\hat{\mathbf{y}}_{0}^{(s)}\in\mathbb{R}^{H\times D}\}_{s=1}^{S} from the learned conditional predictive distribution pθ(𝐲0|𝐱)p_{\theta}(\mathbf{y}_{0}|\mathbf{x}). Here, DD is the number of channels, LL and HH indicate the lookback window length and prediction horizon respectively. θ\theta stands for the parameters of a conditional diffusion forecaster which represents the real predictive distribution q(𝐲0|𝐱)q(\mathbf{y}_{0}|\mathbf{x}). We allocate diverse values to horizon HH and channel number DD to construct a holistic benchmark which can completely evaluate the capability of different conditional diffusion models on various forecasting scenarios.

2.2 Conditional denoising diffusion models

Conditional diffusion models have exhibited impressive capability on a wide variety of controllable multi-modal synthesis tasks (Chen et al., 2024b). It dictates a bi-directional distribution transport process between raw data 𝐲0\mathbf{y}_{0} and prior Gaussian noise 𝐲K𝒩(𝟎,𝐈)\mathbf{y}_{K}\in\mathcal{N}(\mathbf{0},\mathbf{I}) via KK diffusion steps. The forward process gradually degrades clean 𝐲0\mathbf{y}_{0} to fully noisy 𝐲K\mathbf{y}_{K} and can be fixed as a Markov chain: q(𝐲0:K)=q(𝐲0)k=1Kq(𝐲k|𝐲k1)q(\mathbf{y}_{0:K})=q(\mathbf{y}_{0})\textstyle\prod_{k=1}^{K}q(\mathbf{y}_{k}|\mathbf{y}_{k-1}), where q(𝐲k|𝐲k1):=𝒩(𝐲k;1βk𝐲k1,βk𝐈)q(\mathbf{y}_{k}|\mathbf{y}_{k-1}):=\mathcal{N}(\mathbf{y}_{k};\sqrt{1-\beta_{k}}\mathbf{y}_{k-1},\beta_{k}\mathbf{I}) and βk\beta_{k} is the degree of imposed step-wise Gaussian noise. We can accelerate the forward sampling procedure and obtain closed-form latent state 𝐲k\mathbf{y}_{k} at arbitrary step kk by a noteworthy property (Ho et al., 2020): 𝐲k=α¯k𝐲0+1α¯kϵ\mathbf{y}_{k}=\sqrt{\bar{\alpha}_{k}}\mathbf{y}_{0}+\sqrt{1-\bar{\alpha}_{k}}\bm{\epsilon}, where α¯k:=s=1k(1βs)\bar{\alpha}_{k}:=\textstyle\prod_{s=1}^{k}(1-\beta_{s}) and ϵ𝒩(𝟎,𝐈)\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}). The reverse generation process converts known Gaussian to realistic prediction data 𝐲0\mathbf{y}_{0} given input conditions 𝐱\mathbf{x}, which can be cast as a parameterized Markov chain: pθ(𝐲0:K|𝐱)=p(𝐲K)k=K1pθ(𝐲k1|𝐲k,𝐱)p_{\theta}(\mathbf{y}_{0:K}|\mathbf{x})=p(\mathbf{y}_{K})\textstyle\prod_{k=K}^{1}p_{\theta}(\mathbf{y}_{k-1}|\mathbf{y}_{k},\mathbf{x}). The overall training objective can be simplified as minimizing the step-wise denoising loss below:

kdenoise=𝔼𝐲0,𝐱,ϵ[ϵϵθ(α¯k𝐲0+1α¯kϵ,𝐱,k)22].\centering\mathcal{L}_{k}^{denoise}=\mathbb{E}_{\mathbf{y}_{0},\mathbf{x},\bm{\epsilon}}[\left\|\bm{\epsilon}-\bm{\epsilon}_{\theta}(\sqrt{\bar{\alpha}_{k}}\mathbf{y}_{0}+\sqrt{1-\bar{\alpha}_{k}}\bm{\epsilon},\mathbf{x},k)\right\|_{2}^{2}].\@add@centering (1)

A potential issue for current conditional diffusion models lies in forging an effective conditioning mechanism that can enhance the alignment between given conditions 𝐱\mathbf{x} and produced data 𝐲0\mathbf{y}_{0}, like the coherent semantics between textual descriptions and visual renderings (Esser et al., 2024), or the conformity of generated vehicle motions to scenario constraints (Jiang et al., 2023). However, such data consistency is hard to represent for temporal conditional probability modeling. We thus explicitly learn to amplify the prediction-related temporal information conveyed from past conditioning time series to generated trajectories. Such predictive mutual information can reflect underlying temporal properties in historical sequences, to which the produced forecasts should comply.

2.3 Neural mutual information maximization

As discussed above, to more efficiently represent the useful predictive modes involved in conditioning time series, we choose to explicitly maximize the prediction-oriented mutual information when learning the conditional diffusion forecaster. Learning to maximize mutual information is effective to boost the consistency between two associated variables (Song & Ermon, 2019), which has been actively applied to self-supervised learning (Liang et al., 2024b) and multi-modal alignment (Liang et al., 2024a). Regarding conditional diffusion learning, there also exist several related works (Wang et al., 2023; Zhu et al., 2022) which explicitly employ mutual information maximization to enhance high-level semantic coherence between input prompts and generated samples. While we propose a complementary way to equip the conditional diffusion forecaster with this tool to bolster the utilization of informative temporal patterns. Besides, we provide a distinct composite loss design and more concrete interpretations on the benefits of the contrastive scheme to ordinary conditional diffusion.

Among the two practical methods to maximize the intractable mutual information (Tsai et al., 2020), contrastive learning aids to strengthen the association by discriminating intra-class from inter-class samples. Contrastive predictive coding (Oord et al., 2018) realizes such objective by optimizing the contrastive lower bound with low variance via the prevalent InfoNCE loss:

InfoNCE=𝔼(𝐲0,𝐱)q(𝐲0,𝐱),𝐲0(n)qn(𝐲0)[logf(𝐲0,𝐱)f(𝐲0,𝐱)+n=1Nf(𝐲0(n),𝐱)].\centering\mathcal{L}_{InfoNCE}=-\mathbb{E}_{(\mathbf{y}_{0},\mathbf{x})\sim q(\mathbf{y}_{0},\mathbf{x}),\mathbf{y}_{0}^{(n)}\sim q^{n}(\mathbf{y}_{0})}[\log\frac{f(\mathbf{y}_{0},\mathbf{x})}{f(\mathbf{y}_{0},\mathbf{x})+{\textstyle\sum_{n=1}^{N}f(\mathbf{y}_{0}^{(n)},\mathbf{x})}}].\@add@centering (2)

During each iteration, we create a set of NN negative samples via the negative construction operation qn(𝐲0)q^{n}(\mathbf{y}_{0}) on positive data 𝐲0\mathbf{y}_{0}. f(𝐲0,𝐱)f(\mathbf{y}_{0},\mathbf{x}) accounts for the density ratio q(𝐲0|𝐱)q(𝐲0)\frac{q(\mathbf{y}_{0}|\mathbf{x})}{q(\mathbf{y}_{0})} and can be any types of positive real functions. This flexible form of the density ratio function offers a natural initiative of the following denoising-based contrastive conditional diffusion.

Forward predictive learning is another way to boost the inter-dependency by fully reconstructing target 𝐲0\mathbf{y}_{0} conditioned on given 𝐱\mathbf{x}. This reconstruction learning can be realized by learning a deterministic mapping or conditional generative model from 𝐱\mathbf{x} to 𝐲0\mathbf{y}_{0}. As I(𝐲0;𝐱)=H(𝐲0)H(𝐲0|𝐱)I(\mathbf{y}_{0};\mathbf{x})=H(\mathbf{y}_{0})-H(\mathbf{y}_{0}|\mathbf{x}), and H(𝐲0)H(\mathbf{y}_{0}) is irrelevant to discovering the entanglement between 𝐱\mathbf{x} and 𝐲0\mathbf{y}_{0}, thereby maximizing I(𝐱;𝐲0)I(\mathbf{x};\mathbf{y}_{0}) boils down to optimizing the predictive lower bound H(𝐲0|𝐱)=𝔼q(𝐱,𝐲0)[logpθ(𝐲0|𝐱)]-H(\mathbf{y}_{0}|\mathbf{x})=\mathbb{E}_{q(\mathbf{x},\mathbf{y}_{0})}[\log{p_{\theta}(\mathbf{y}_{0}|\mathbf{x})}], which is aligned with the likelihood-based objective of naive conditional diffusion learning. (Tsai et al., 2020) claims that combining both predictive and contrastive learning tactics can significantly raise the quality of obtained task-related features. Accordingly, we equip vanilla conditional time series diffusion with a denoising-based InfoNCE contrastive loss to further boost the temporal predictive information between past conditions and future forecasts. A concise motivation of this information-theoretic contrastive diffusion forecasting is depicted in Fig. 1.

3 Method: Channel-aware contrastive conditional diffusion

Refer to caption
Figure 2: The framework of denoising-based contrastive conditional diffusion forecaster.
Refer to caption
Figure 3: The diagram of channel-aware conditional denoiser architecture. Left: the whole network. Middle: channel-mixing DiT blocks. Right: channel-independent MLP dense modules.

In this section, we elucidate two innovations of the tailored CCDM for generative multivariate time series forecasting, including the hybrid channel-aware denoiser architecture depicted in Fig. 3 and denoising-based contrastive diffusion learning demonstrated in Fig. 2.

3.1 Channel-aware conditional denoising network

Recent progress on multivariate prediction methods (Liu et al., 2023; Ilbert et al., 2024) show that proper integration of channel management strategies in time series backbones is critical to discover univariate dynamics and cross-variate correlations. But this problem has not been well explored in multivariate diffusion forecasting and previous conditional denoiser structures do not obviously distinguish such heterogeneous channel-centric temporal properties. To this end, we design a channel-aware conditional denoising network which incorporates composite channel manipulation modules, i.e. channel-independent dense encoders and channel-mixing diffusion transformers. This architecture can efficiently represent intra-variate and inter-variate temporal correlations in past conditioning 𝐱\mathbf{x} and future predicted 𝐲0\mathbf{y}_{0} under different noise levels, as well as being robust to diverse prediction horizons and channel numbers.

Channel-independent dense encoders. We develop two channel-independent MLP encoders to extract unique temporal variations in each individual channel of observed condition 𝐱\mathbf{x} and corrupted latent state 𝐲k\mathbf{y}_{k} at each diffusion step. The core ingredient in latent and condition encoders is the channel-independent dense module (CiDM) borrowed from TiDE (Das et al., 2023a), which stands out as a potent MLP building-block for universal time series analysis models (Das et al., 2023b). A salient element in CiDM is the skip-connecting MLP residual block which can improve temporal pattern expressivity. The DD linear layers in parallel are shared and used for separate channel feature embedding. We stack nencn_{enc} CiDM modules of ehide_{hid} hidden dimension to transform both 𝐱\mathbf{x} and 𝐲k\mathbf{y}_{k} into ehid×De_{hid}\times D size. These two input encoders can be easily adjusted to accommodate different context windows and hidden feature dimensions.

Channel-wise diffusion transformers. To regress step-wise Gaussian noise ϵk\bm{\epsilon}_{k} on raw 𝐲0\mathbf{y}_{0} more precisely, we should fully exploit implicit temporal information in pure conditioning 𝐱\mathbf{x} and polluted target 𝐲k\mathbf{y}_{k}. We concatenate the latent encoding of 𝐱\mathbf{x} and 𝐲k\mathbf{y}_{k} along the channel axis and then leverage nattn_{att}-depth channel-wise diffusion transformer (DiT) blocks to aggregate heterogeneous temporal modes from various channels. DiT is an emergent diffusion backbone for open-ended text-to-image synthesis which merits eminent efficiency, scalability and robustness (Peebles & Xie, 2023; Esser et al., 2024). Two critical components in DiT are multi-head self-attention for feature fusion and adaptive layer norm (adaLN) layers to absorb other conditioning items (e.g. diffusion step embedding, text labels) as learnable scale and shift parameters. Although DiT has been adopted by few works (Cao et al., 2024; Feng et al., 2024) for generative time series modeling, they do not adapt it in a fully channel-centric angle. We repurpose DiT to model multivariate predictive distribution by replacing naive point-wise attention to a channel-wise version, which can blend correlated temporal information from different variates in 𝐱\mathbf{x} and 𝐲k\mathbf{y}_{k}. Afterwards, we develop a output decoder with ndecn_{dec} CiDMs plus a last adaLN to yield the prediction of imposed noise ϵk\bm{\epsilon}_{k} given 𝐱\mathbf{x} and 𝐲k\mathbf{y}_{k}.

3.2 Denoising-based temporal contrastive refinement

Unlike previous empirically designed temporal conditioning schemes to make better exploitation of past predictive information, we instead propose to explicitly maximize the prediction-related mutual information I(𝐲0;𝐱)I(\mathbf{y}_{0};\mathbf{x}) between past observations 𝐱\mathbf{x} and future forecasts 𝐲0\mathbf{y}_{0} via an adapted denoising-based contrastive strategy. We will employ the learnable denoising network ϵθ()\bm{\epsilon}_{\theta}(\cdot) to represent the contrastive lower bound of I(𝐲0;𝐱)I(\mathbf{y}_{0};\mathbf{x}) presented by Eq. 2, and exhibit this information-theoretic contrastive refinement is complementary and aligned with original conditional denoising diffusion optimization, which is actually another forward predictive method to maximize I(𝐲0;𝐱)I(\mathbf{y}_{0};\mathbf{x}).

To improve the diffusion forecasting capacity more essentially, the developed contrastive learning item is wished to directly benefit naive step-wise denoising-based training procedure, i.e. regularizing noise elimination behaviors of the conditional denoiser ϵθ()\bm{\epsilon}_{\theta}(\cdot). Since the density ratio function f(𝐲0,𝐱)f(\mathbf{y}_{0},\mathbf{x}) constituting the contrastive mutual information lower bound in Eq. 2 can be any positive-valued forms, this flexibility naturally motivates us to prescribe f()f(\cdot) using the step-wise denoising objective in Eq. 1, for both positive sample 𝐲0\mathbf{y}_{0} and a group of negative samples 𝐲0(n)\mathbf{y}_{0}^{(n)}:

fk,ϵ(𝐲0,𝐱;θ)\displaystyle\centering f_{k,\bm{\epsilon}{}^{\prime}}(\mathbf{y}_{0},\mathbf{x};\theta)\@add@centering =exp(||ϵϵθ(α¯k𝐲0+1α¯kϵ,𝐱,k)||22/τ);\displaystyle=\exp(-||\bm{\epsilon}{}^{\prime}-\bm{\epsilon}_{\theta}(\sqrt{\bar{\alpha}_{k}}\mathbf{y}_{0}+\sqrt{1-\bar{\alpha}_{k}}\bm{\epsilon}{}^{\prime},\mathbf{x},k)||^{2}_{2}/\tau); (3a)
fk,ϵ(𝐲0(n),𝐱;θ)\displaystyle f_{k,\bm{\epsilon}{}^{\prime}}(\mathbf{y}_{0}^{(n)},\mathbf{x};\theta) =exp(||ϵϵθ(α¯k𝐲0(n)+1α¯kϵ,𝐱,k)||22/τ);\displaystyle=\exp(-||\bm{\epsilon}{}^{\prime}-\bm{\epsilon}_{\theta}(\sqrt{\bar{\alpha}_{k}}\mathbf{y}_{0}^{(n)}+\sqrt{1-\bar{\alpha}_{k}}\bm{\epsilon}{}^{\prime},\mathbf{x},k)||^{2}_{2}/\tau); (3b)

where τ\tau is the temperature coefficient in the softmax-form contrastive loss. The negative sample set {𝐲0(n)}n=1N\{\mathbf{y}_{0}^{(n)}\}_{n=1}^{N} is constructed by a hybrid time series augmentation method which alters both temporal variations and point magnitudes (See Appendix A.3 for details.). Then, we can derive the following contrastive refinement loss which is coincident with vanilla step-wise denoising diffusion training:

kcontrast=𝔼𝐱,𝐲0,{𝐲0(n)}n=1N,ϵ[logfk,ϵ(𝐲0,𝐱;θ)fk,ϵ(𝐲0,𝐱;θ)+n=1Nfk,ϵ(𝐲0(n),𝐱;θ)].\centering\mathcal{L}_{k}^{contrast}=-\mathbb{E}_{\mathbf{x},\mathbf{y}_{0},\{\mathbf{y}_{0}^{(n)}\}_{n=1}^{N},\bm{\epsilon}{}^{\prime}}[\log\;\frac{f_{k,\bm{\epsilon}{}^{\prime}}(\mathbf{y}_{0},\mathbf{x};\theta)}{f_{k,\bm{\epsilon}{}^{\prime}}(\mathbf{y}_{0},\mathbf{x};\theta)+{\textstyle\sum_{n=1}^{N}f_{k,\bm{\epsilon}{}^{\prime}}(\mathbf{y}_{0}^{(n)},\mathbf{x};\theta)}}].\vspace{-5pt}\@add@centering (4)

Apparently, the devised denoising-based temporal contrastive learning can not only seamlessly coordinate with standard diffusion training at each step kk, but also improve the conditional denoiser behaviors in out-of-distribution (OOD) regions. These OOD areas are constituted by the low-density diffusion paths of negative samples, which are not touched by merely executing denoising learning along the high-density probability paths of positive samples.

3.3 Overall learning objective

The naive denoising diffusion model trained by log-likelihood maximization (Ho et al., 2020) totally owns KK-step valid training items. To align with this step-wise denoising distribution learning, we can amortize the contrastive regularization in Eq. 4 to each training step, and derive the overall learning objective below:

maxθ𝔼q(𝐲0,𝐱)[logpθ(𝐲0|𝐱)+λKIθ(𝐲0;𝐱)],\centering\max_{\theta}\;\mathbb{E}_{q(\mathbf{y}_{0},\mathbf{x})}\left[\log p_{\theta}(\mathbf{y}_{0}|\mathbf{x})+\lambda K\cdot I_{\theta}(\mathbf{y}_{0};\mathbf{x})\right],\vspace{-5pt}\@add@centering (5)

where logpθ(𝐲0|𝐱)\log p_{\theta}(\mathbf{y}_{0}|\mathbf{x}) can be decomposed as k=1Kkdenoise\textstyle\sum_{k=1}^{K}\mathcal{L}_{k}^{denoise} and indicates the predictive distribution learning. Whilst maxθIθ(𝐲0;𝐱)\max_{\theta}I_{\theta}(\mathbf{y}_{0};\mathbf{x}) governs the information-theoretic contrastive learning. Then, the practical training loss of the devised CCDM at each diffusion step can be presented as:

kCCDM=𝔼𝐲0,𝐱,kU[1,K](kdenoise+λkcontrast).\centering\mathcal{L}^{CCDM}_{k}=\mathbb{E}_{\mathbf{y}_{0},\mathbf{x},k\sim\mathrm{U}[1,K]}(\mathcal{L}_{k}^{denoise}+\lambda\mathcal{L}_{k}^{contrast}).\vspace{-5pt}\@add@centering (6)

So far, we obtain the overall step-wise training procedure for CCDM, which is a λ\lambda-weighted combination of the vanilla denoising term in Eq. 1 and auxiliary contrastive item in Eq. 4. The whole training algorithm is clarified in Appendix A.4, which is efficient, end-to-end and seamlessly coupled with original simplified denoising diffusion.

Theoretical insights. Beyond the concrete method described above, we offer two-fold interpretations on how conditional diffusion forecasters can gain from auxiliary temporal contrastive learning. From the neural mutual information perspective, Eq. 1 and Eq. 4 train the parameterized conditional denoiser ϵθ()\bm{\epsilon}_{\theta}(\cdot) to simultaneously optimize two complementary lower bounds (i.e. predictive and contrastive) of the prediction-related mutual information Iθ(𝐲0;𝐱)I_{\theta}(\mathbf{y}_{0};\mathbf{x}) between future forecasts and past conditioning time series. According to (Tsai et al., 2020), this composite learning method can enhance the representation efficiency to distill task-related information from accessible conditions. The contrastive scheme assists ϵθ()\bm{\epsilon}_{\theta}(\cdot) to learn helpful negative instances, which can gain useful discriminative temporal patterns for accurate multivariate predictive distribution recovery and mitigate over-fitting on historical training time series. From the distribution generalization perspective, explicitly optimizing the probabilities of unexpected negative samples can render ϵθ()\bm{\epsilon}_{\theta}(\cdot) see more OOD regions that pure denoising fashion on positive in-distribution samples do not encompass. In time series forecasting, there always exists distribution shift between unforeseen testing data and historical training data. The contrastive term in Eq. 4 intuitively minimizes the possibility logpθ(𝐲0(n))\log p_{\theta}(\mathbf{y}_{0}^{(n)}) of undesirable spurious forecasts by directly impeding ϵθ()\bm{\epsilon}_{\theta}(\cdot) from correctly removing the noise over negative 𝐲0(n)\mathbf{y}_{0}^{(n)}. This contrastive enforcement helps ϵθ()\bm{\epsilon}_{\theta}(\cdot) avoid low-density areas and undergo more OOD areas during in-distribution training. In light of the arguments in (Wu et al., 2024), promoting the denoiser robustness in OOD regions in testing stage is crucial to sample plausible forecasts.

Moreover, we reveal the upper bound of forecasting error on testing data for conditional diffusion models in Proposition 1. It obviously reflects that the efficacy of conditional diffusion forecasters is inextricably intertwined with the step-wise noise regression accuracy of trained ϵθ()\bm{\epsilon}_{\theta}(\cdot) on unknown test time series. In this regard, resorting to temporal contrastive refinement or other auxiliary training regimes is sensible to boost conditional denoiser behaviors and final prediction outcomes.

Proposition 1. Let qte(𝐲0|𝐱)q^{te}(\mathbf{y}_{0}|\mathbf{x}) be the ground truth distribution of test time series, and pθte(𝐲0|𝐱)p^{te}_{\theta}(\mathbf{y}_{0}|\mathbf{x}) be the approximated predictive distribution by the developed conditional diffusion model. Let the KL-divergence between qte(𝐲0|𝐱)q^{te}(\mathbf{y}_{0}|\mathbf{x}) and pθte(𝐲0|𝐱)p^{te}_{\theta}(\mathbf{y}_{0}|\mathbf{x}) represent the resulting probabilistic forecasting error. Then the denoising diffusion-induced forecasting error is upper-bounded:

𝒟KL[qte(𝐲0|𝐱)||pθte(𝐲0|𝐱)]𝔼𝐱,𝐲0,ϵk,k[Akϵθ(α¯k𝐲0+1α¯kϵk,𝐱,k)ϵk22]+C.\centering\mathcal{D}_{KL}\left[q^{te}(\mathbf{y}_{0}|\mathbf{x})||p^{te}_{\theta}(\mathbf{y}_{0}|\mathbf{x})\right]\leq\mathbb{E}_{\mathbf{x},\mathbf{y}_{0},\bm{\epsilon}_{k},k}\left[A_{k}\left\|\bm{\epsilon}_{\theta}\left(\sqrt{\bar{\alpha}_{k}}\mathbf{y}_{0}+\sqrt{1-\bar{\alpha}_{k}}\bm{\epsilon}_{k},\mathbf{x},k\right)-\bm{\epsilon}_{k}\right\|^{2}_{2}\right]+C.\@add@centering (7)

Such upper bound is determined by the denoising behaviors of learned ϵθ()\bm{\epsilon}_{\theta}(\cdot) on unknown test time series. AkA_{k} is a step-wise constant related to noise schedule, and CC is a constant depending on test data quantities. See Appendix A.1 for the proof.

4 Experiments

4.1 Experimental setup

Datasets. We choose six multivariate time series datasets, i.e. ETTh1, Exchange, Weather, Appliance, Electricity, Traffic, which cover a wide range of temporal dynamics and channel number DD to completely gauge the probabilistic forecasting performance. We manually establish a more comprehensive benchmark with diverse values of lookback window LL and prediction horizon HH, distinct from previous models which merely attest their generative forecasting capacity on a single short-term setup. Refer to Appendix. A.5 for more details on datasets.

Evaluation metrics. We adopt two standard metrics to assess the quality of both probabilistic and deterministic forecasts resulting from the generated prediction intervals. CRPS (Continuous Ranked Probability Score) is used to assess the reliability of the estimated predictive distribution, and MSE (Mean Squared Error) is used to quantify the accuracy of calculated point forecasts. See Appendix A.6 for more details on metrics.

Baselines. We select five currently remarkable denoising diffusion-based generative forecasters for comparisons, including TimeGrad (Rasul et al., 2021), CSDI (Tashiro et al., 2021), SSSD (Alcaraz & Strodthoff, 2022), TimeDiff (Shen & Kwok, 2023), TMDM (Li et al., 2023). Since these models do not shed light on outcomes on long-term probabilistic forecasting scenarios, we fully reproduce them on the newly constructed benchmark.

Implementation details. We normally execute the end-to-end contrastive diffusion training in Eq. 6 using 200 epochs. To reduce the contrastive learning costs on those cases which consume enormous computational resources, we also employ a cost-efficient two-stage training strategy. Concretely, we firstly pretrain a low-cost naive diffusion forecaster by Eq. 1 and fine-tune it by the total contrastive manner in Eq. 6 with only 30 epochs. We keep the temperature coefficient τ=0.1\tau=0.1 and randomly generate S=100S=100 mulivariate profiles to compose prediction intervals. See Appendix A.7 for more details on network architecture and contrastive training configurations in different forecasting setups. All experiments are conducted on a single NVIDIA A100 40GB GPU.

4.2 Overall results

Table 1: Overall comparisons w.r.t MSE and CRPS on six real-world datasets with diverse horizon H{96,168,336,720}H\in\left\{96,168,336,720\right\}. The best and second-best results are boldfaced and underlined.
Methods CCDM TMDM TimeDiff SSSD CSDI TimeGrad
Metrics MSE CRPS MSE CRPS MSE CRPS MSE CRPS MSE CRPS MSE CRPS
ETTh1 96 0.3856 0.2935 0.4692 0.3952 \ul0.4025 \ul0.3942 1.0984 0.5622 1.1013 0.5794 1.1730 0.6223
168 0.4267 0.3142 0.5296 0.4163 \ul0.4397 0.4170 0.6067 \ul0.4046 1.1013 0.5794 1.1554 0.5970
336 \ul0.5312 0.3452 0.5862 0.4655 0.4943 \ul0.4488 0.9330 0.5421 1.0459 0.6223 1.1403 0.5883
720 0.5642 0.4870 0.7083 0.5335 \ul0.5779 \ul0.5145 1.3776 0.7035 1.0081 0.5952 1.2529 0.6498
Avg 0.4769 0.3600 0.5733 0.4526 \ul0.4786 \ul0.4418 1.0039 0.5531 1.0642 0.5941 1.1804 0.6144
Exchange 96 0.0905 0.1545 0.1278 \ul0.2112 \ul0.1106 0.2349 0.5551 0.4569 0.2551 0.2901 1.8655 1.0439
168 0.1638 0.2159 0.2791 0.3210 \ul0.2050 \ul0.3187 0.4517 0.3602 0.8050 0.5093 1.1638 0.8374
336 0.4407 0.3517 \ul0.4572 0.4426 0.5834 0.5472 0.5641 \ul0.4106 0.6179 0.4786 1.9264 1.0465
720 \ul1.1685 0.5864 2.5625 1.0828 0.9096 0.7128 1.3686 \ul0.6386 1.3816 0.7423 2.4034 1.1478
Avg \ul0.4659 0.3271 0.8567 0.5144 0.4522 \ul0.4534 0.7349 0.4666 0.7649 0.5051 1.8398 1.0189
Weather 96 \ul0.2669 0.1904 0.2768 0.2273 0.3842 0.3441 0.6103 0.3878 0.2608 \ul0.2127 0.5628 0.3445
168 0.2489 0.2006 0.2864 0.2519 0.3566 0.3192 \ul0.2796 \ul0.2060 0.2930 0.2286 0.4141 0.2880
336 0.2870 \ul0.2258 0.3494 0.3007 0.4805 0.3591 0.3189 0.2355 \ul0.2918 0.2193 0.5462 0.3549
720 0.5627 0.4083 \ul0.3975 0.3365 0.5052 0.3880 0.6880 0.4179 0.3803 0.2770 0.4774 \ul0.3221
Avg 0.3414 \ul0.2563 \ul0.3275 0.2791 0.4316 0.3526 0.4742 0.3118 0.3065 0.2344 0.5001 0.3274
Appliance 96 0.6988 0.4138 \ul0.6858 0.4678 0.7328 0.5740 1.1954 0.6504 0.6823 \ul0.4334 1.6748 0.8397
168 0.6266 0.4020 0.7153 0.5232 \ul0.6468 0.5562 0.7841 0.4776 0.7176 \ul0.4560 1.8901 0.8858
336 0.9119 0.5036 1.0310 0.6590 \ul0.9531 0.6822 1.8822 0.8002 1.0565 \ul0.5675 1.8506 0.8661
720 1.5798 0.8620 1.3937 \ul0.8272 \ul1.4327 0.8809 3.3226 1.1225 1.7347 0.7982 2.4393 1.0083
Avg \ul0.9543 0.5454 0.9565 0.6193 0.9414 0.6733 1.7961 0.7627 1.0478 \ul0.5638 1.9637 0.9000
Electricity 96 0.2102 0.2182 0.1954 0.3113 \ul0.1960 0.3123 0.2444 \ul0.2346 0.2560 0.2571 0.3733 0.3259
168 0.1678 \ul0.2014 0.1908 0.3037 0.1907 0.3043 0.2001 0.2249 \ul0.1754 0.1985 0.3676 0.3083
336 0.1683 0.2014 0.2042 0.3165 0.2047 0.3172 0.1941 0.2245 \ul0.1803 \ul0.2043 0.4249 0.3497
720 0.1994 0.2232 0.2282 0.3338 \ul0.2277 \ul0.3336 0.3743 0.3680 0.9932 0.5678 0.4299 0.3479
Avg 0.1864 0.2111 \ul0.2047 0.3163 0.2048 0.3169 0.2532 \ul0.2630 0.4012 0.3069 0.3989 0.3330
Traffic 96 1.0282 0.4093 \ul0.9692 0.5894 0.9684 0.5859 1.0363 0.4445 1.1154 \ul0.4240 1.2259 0.4667
168 0.6881 0.3077 0.8632 0.5254 \ul0.8553 0.5192 0.9551 \ul0.4289 1.6000 0.6701 1.3282 0.5510
336 0.6863 0.3358 0.8874 0.5562 \ul0.8834 0.5538 0.9283 0.5140 1.5724 0.6780 1.0447 \ul0.3817
720 0.9357 0.4519 \ul1.0258 0.6383 1.0270 0.6387 1.0635 0.5515 1.5428 0.6696 1.1753 \ul0.4604
Avg 0.8346 0.3762 0.9364 0.5773 \ul0.9335 0.5744 0.9958 0.4847 1.4577 0.6104 1.1935 \ul0.4650

We demonstrate the devised CCDM model can outperform existing diffusion forecasters on most of the generative forecasting cases in Table 1. Concretely, CCDM can attain the best outcomes on 16/2416/24 deterministic and 20/2420/24 probabilistic evaluations, with 9.10%9.10\% and 15.66%15.66\% average improvement of MSE and CRPS on these cases. Especially on two most difficult datasets Electricity and Traffic, CCDM garners notable progress of 8.94%8.94\%, 10.59%10.59\% on MSE and 19.73%19.73\%, 19.10%19.10\% on CRPS. These prominent increases reflect the devised channel-centric structure and contrastive refinement on the diffusion forecaster can enhance its representation efficiency of implicit predictive information on diverse prediction scenarios. The second-best model CSDI also manifests excellent forecasting ability especially on Weather, which has complex multivariate temporal correlations. The hybrid attention module in CSDI can well capture these relations but it entices high computational overhead and over-fitting to other datasets. TMDM and TimeDiff also attain small MSE on few cases due to their extra deterministic pre-training operations on conditioning encoders. Note that we completely replicate TimeGrad on the whole benchmark for the first time even with severe inference costs, and validate it can actually realize reasonable forecasting results. In Fig. 4, we depict different diffusion produced prediction intervals on one case. We can clearly see that CCDM’s interval is much more faithful, while TimeDiff’s area is sharper but loses diversity and accuracy. See Appendix A.11 for more forecasting result showcases and Appendix A.8 on time cost analysis.

4.3 Ablation study

To investigate respective effects of each component, we remove the proposed denoising-based contrastive learning and channel-wise DiT structure, and exhibit the average metric degradation over different prediction horizons in Table 2. Without auxiliary contrastive diffusion training, we observe a mean performance drop of 8.33%8.33\% and 5.81%5.81\% on MSE and CRPS over the whole benchmark. This notable decrease indicates that the dedicated denoising-based contrastive refinement can enhance the utilization efficiency of conditional temporal predictive information and yield a more genuine multivariate predictive distribution. Due to the restriction of computational costs, such contrastive gains on Electricity and Traffic datasets are relatively smaller. We can amplify contrastive benefits on large-scale datasets by increasing the batch size and negative number within an iteration in the future. Regarding the influence of composite channel-aware management in conditional denoiser, we replace the channel-wise DiT modules by the same depth of linear dense encoders and incur a full channel-independence architecture. The average reduction on MSE and CRPS over the whole test settings are 19.80%19.80\% and 26.10%26.10\%. This considerable drop reveals that the channel-mixing attention can empower the denoising network to integrate useful cross-variate temporal features in past observations and corrupted targets. Besides, the elevation degree induced by channel-centric DiT is consistent with the true variate correlations in real-world datasets. For instance, the performance decrease is less salient on Electricity dataset where the electricity consumption of different customers is not highly related to each other. Whilst on ETTh1 and Weather datasets whose sensory measurements are heavily inter-correlated, the channel-mixing DiT can improve the diffusion forecasting capacity more vastly.

Table 2: Average MSE and CRPS degradation resulting from the ablation of denoising-based contrastive learning or channel-wise DiT module. Full results can be found in Appendix A.9.
Models w/o contrastive refinement w/o channel-wise DiT
Metrics MSE Degradation CRPS Degradation MSE Degradation CRPS Degradation
ETTh1 0.5508 15.97% 0.3889 8.55% 0.5956 23.34% 0.5816 58.20%
Exchange 0.4966 11.52% 0.3403 5.24% 0.4924 12.18% 0.3555 11.00%
Weather 0.3816 13.00% 0.2695 4.77% 0.4843 40.28% 0.3336 28.92%
Appliance 1.0220 7.18% 0.5818 5.99% 1.1183 16.96% 0.7231 32.70%
Electricity 0.1887 1.20% 0.2144 1.57% 0.1973 5.71% 0.2137 1.25%
Traffic 0.8439 1.08% 0.4130 8.73% 1.0084 20.30% 0.4675 24.50%
Refer to caption
Figure 4: Comparison of generated point forecasts and prediction intervals on an Electricity channel.
Refer to caption
Figure 5: Forecasting results by varying contrastive weight λ\lambda on three datasets with H=168H=168.

4.4 Contrastive refinement analysis

Below, we empirically investigate the efficacy of the devised denoising-based temporal contrastive refinement, including three vital factors for contrastive learning practice and its generality on other existing diffusion forecasters.

Influence of contrastive weight λ\lambda. The complementary step-wise denoising-based contrastive loss in 4 can enhance the alignment between diffusion generated forecasts and given temporal predictive information. To elucidate the impact of contrastive refinement in different degrees on original diffusion optimization, we escalate the contrastive weight λ\lambda in Eq. 5 from 0.0001 to 0.01 and display corresponding forecasting outcomes in Fig. 5. We can totally find that imposing contrastive regime on denoising diffusion training can indeed promote the generative forecasting capacity, and the gain margin moderately fluctuates among various weights and datasets. Roughly, a modest weight between 0.0005 and 0.005 can lead to better improvement. We also empirically observe that larger λ\lambda can accelerate the diffusion training convergence. See Appendix A.10 for more detailed analysis on the influence of negative number NN and temperature τ\tau.

Table 3: Forecasting performance promotion induced by applying denoising-based contrastive training to two existing conditional diffusion forecasters.
Methods TimeDiff CSDI
Metrics MSE Promotion CRPS Promotion MSE Promotion CRPS Promotion
ETTh1 96 0.4143 -2.93% 0.3491 11.44% 0.6559 40.44% 0.4371 24.56%
168 0.4715 -7.23% 0.3753 10.00% 0.5894 29.53% 0.3851 25.76%
336 0.5073 -2.63% 0.4025 10.32% 0.9920 5.15% 0.5644 9.30%
720 0.5291 6.19% 0.4338 14.44% 0.7744 23.18% 0.7010 -17.78%
Exchange 96 0.0901 18.54% 0.1722 26.69% 0.1589 37.71% 0.2082 28.23%
168 0.1588 22.54% 0.2312 27.46% 0.4096 49.12% 0.3840 24.60%
336 0.6345 -8.76% 0.4293 21.55% 0.5664 8.33% 0.4110 14.12%
720 0.9735 -7.03% 0.6941 2.62% 1.3642 1.26% 0.6392 13.89%

Generality of contrastive training. We add the step-wise denoising contrastive training presented in 4 to two existing diffusion forecasters to validate its generality on conditional time series diffusion learning. From the results shown in Table 3, it is obvious that CSDI’s generative forecasting ability can be further enhanced by contrastive diffusion training. Its hybrid attention network can represent complex temporal patterns more properly by handling more OOD negative samples. While for TimeDiff which owns extra pre-trained auto-regressive conditioning encoders, CRPS values constantly decrease but some unexpected increases appear on MSE. It may stem from the side effect of redundant contrastive procedure conveyed to the well-behaved deterministic pre-training strategy.

5 Related work

Channel-oriented multivariate forecasting. Recent progress on multivariate deterministic prediction (Liu et al., 2023; Lu et al., 2023; Chen et al., 2024a; Han et al., 2024) indicate that learning channel-centric temporal properties (including single-channel dynamics and cross-channel correlations) is of significant importance. Both channel-independent and channel-fusing time series processing are crucial to improve the forecasting performance. But the effectiveness of such channel manipulation structures is rarely investigated in diffusion-based multivariate probabilistic forecasting, where the extra influence of imposed channel noise in varying degrees should also be addressed. To tackle this barrier, we blend both channel-independent and channel-mixing modules in the conditional diffusion denoiser to boost its forecasting ability on multivariate cases.

Time series diffusion models. Diffusion models have been actively applied to tackle a wide scope of time series tasks, including synthesis (Yuan & Qiao, 2024; Narasimhan et al., 2024), forecasting (Rasul et al., 2021), imputation (Tashiro et al., 2021) and anomaly detection (Chen et al., 2023). Their common goal is to derive a high-quality conditional temporal distribution aligned with diverse input contexts, such as statistical properties in constrained generation (Coletta et al., 2024) and historical records. A valid solution is to inject useful temporal properties into iterative diffusion learning (Yuan & Qiao, 2024; Biloš et al., 2023) or to develop gradient-based guidance schemes (Coletta et al., 2024). But there are still rooms to enhance them from the aspect of training methods and denoiser architectures. To bridge this gap for multivariate forecasting, we exclusively design a channel-aware denoiser and explicitly enhance the predictive mutual information between past observations and future forecasts by an adapted temporal contrastive diffusion learning. Even though several works have applied contrastive diffusion to cross-modal content creation (Wang et al., 2024b; Zhu et al., 2022), its efficacy on time series generative modeling have not yet been well explored. And reasonable interpretations on such contrastive diffusion merits are also scanty. See Appendix A.2 for more detailed related work, which also covers universal temporal contrastive learning.

6 Conclusion

In this work, we propose the channel-aware contrastive conditional diffusion model named CCDM for probabilistic forecasts on multivariate time series. CCDM can capture intrinsic prediction-related temporal information hidden in observed conditioning time series using an efficient channel-centric denoiser architecture and information-maximizing denoising-based contrastive refinement. Extensive experiments demonstrate the exceptional forecasting capability of CCDM over existing time series diffusion models. In future work, we plan to reduce the training costs imposed by additional temporal contrastive learning, and extend this contrastive diffusion method to general time series analysis and other cross-domain synthesis tasks.

Ethics Statement

Our work is only aimed at faithful multivariate probabilistic forecasting for human good, so there is no involvement of human subjects or conflict of interests as far as the authors are aware of.

References

  • Alcaraz & Strodthoff (2022) Juan Lopez Alcaraz and Nils Strodthoff. Diffusion-based time series imputation and forecasting with structured state space models. Transactions on Machine Learning Research, 2022.
  • Biloš et al. (2023) Marin Biloš, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, and Stephan Günnemann. Modeling temporal data as continuous functions with stochastic process diffusion. In International Conference on Machine Learning, pp.  2452–2470. PMLR, 2023.
  • Cao et al. (2024) Defu Cao, Wen Ye, and Yan Liu. Timedit: General-purpose diffusion transformers for time series foundation model. In ICML 2024 Workshop on Foundation Models in the Wild, 2024.
  • Chen et al. (2024a) Jialin Chen, Jan Eric Lenssen, Aosong Feng, Weihua Hu, Matthias Fey, Leandros Tassiulas, Jure Leskovec, and Rex Ying. From similarity to superiority: Channel clustering for time series forecasting. arXiv preprint arXiv:2404.01340, 2024a.
  • Chen et al. (2024b) Minshuo Chen, Song Mei, Jianqing Fan, and Mengdi Wang. An overview of diffusion models: Applications, guided generation, statistical rates and optimization. arXiv preprint arXiv:2404.07771, 2024b.
  • Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp.  1597–1607. PMLR, 2020.
  • Chen et al. (2023) Yuhang Chen, Chaoyun Zhang, Minghua Ma, Yudong Liu, Ruomeng Ding, Bowen Li, Shilin He, Saravan Rajmohan, Qingwei Lin, and Dongmei Zhang. Imdiffusion: Imputed diffusion models for multivariate time series anomaly detection. Proceedings of the VLDB Endowment, 17(3):359–372, 2023.
  • Coletta et al. (2024) Andrea Coletta, Sriram Gopalakrishnan, Daniel Borrajo, and Svitlana Vyetrenko. On the constrained time-series generation problem. Advances in Neural Information Processing Systems, 36, 2024.
  • Crabbé et al. (2024) Jonathan Crabbé, Nicolas Huynh, Jan Pawel Stanczuk, and Mihaela van der Schaar. Time series diffusion in the frequency domain. In Forty-first International Conference on Machine Learning, 2024.
  • Das et al. (2023a) Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan K Mathur, Rajat Sen, and Rose Yu. Long-term forecasting with tide: Time-series dense encoder. Transactions on Machine Learning Research, 2023a.
  • Das et al. (2023b) Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. A decoder-only foundation model for time-series forecasting. arXiv preprint arXiv:2310.10688, 2023b.
  • Deng et al. (2024) Jinliang Deng, Xuan Song, Ivor W Tsang, and Hui Xiong. The bigger the better? rethinking the effective model scale in long-term time series forecasting. arXiv preprint arXiv:2401.11929, 2024.
  • Dumas et al. (2022) Jonathan Dumas, Antoine Wehenkel, Damien Lanaspeze, Bertrand Cornélusse, and Antonio Sutera. A deep generative model for probabilistic energy forecasting in power systems: normalizing flows. Applied Energy, 305:117871, 2022.
  • Esser et al. (2024) Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
  • Fan et al. (2024) Xinyao Fan, Yueying Wu, Chang Xu, Yuhao Huang, Weiqing Liu, and Jiang Bian. Mg-tsd: Multi-granularity time series diffusion models with guided learning process. arXiv preprint arXiv:2403.05751, 2024.
  • Feng et al. (2024) Shibo Feng, Chunyan Miao, Zhong Zhang, and Peilin Zhao. Latent diffusion transformer for probabilistic time series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  11979–11987, 2024.
  • Franceschi et al. (2019) Jean-Yves Franceschi, Aymeric Dieuleveut, and Martin Jaggi. Unsupervised scalable representation learning for multivariate time series. Advances in neural information processing systems, 32, 2019.
  • Gao et al. (2024) Yuan Gao, Haokun Chen, Xiang Wang, Zhicai Wang, Xue Wang, Jinyang Gao, and Bolin Ding. Diffsformer: A diffusion transformer on stock factor augmentation. arXiv preprint arXiv:2402.06656, 2024.
  • Han et al. (2024) Lu Han, Xu-Yang Chen, Han-Jia Ye, and De-Chuan Zhan. Softs: Efficient multivariate time series forecasting with series-core fusion. arXiv preprint arXiv:2404.14197, 2024.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Huang et al. (2023) Xingshuai Huang, Di Wu, and Benoit Boulet. Metaprobformer for charging load probabilistic forecasting of electric vehicle charging stations. IEEE Transactions on Intelligent Transportation Systems, 2023.
  • Ilbert et al. (2024) Romain Ilbert, Ambroise Odonnat, Vasilii Feofanov, Aladin Virmaux, Giuseppe Paolo, Themis Palpanas, and Ievgen Redko. Unlocking the potential of transformers in time series forecasting with sharpness-aware minimization and channel-wise attention. arXiv preprint arXiv:2402.10198, 2024.
  • Jiang et al. (2023) Chiyu Jiang, Andre Cornman, Cheolho Park, Benjamin Sapp, Yin Zhou, Dragomir Anguelov, et al. Motiondiffuser: Controllable multi-agent motion prediction using diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9644–9653, 2023.
  • Kollovieh et al. (2024) Marcel Kollovieh, Abdul Fatir Ansari, Michael Bohlke-Schneider, Jasper Zschiegner, Hao Wang, and Yuyang Bernie Wang. Predict, refine, synthesize: Self-guiding diffusion models for probabilistic time series forecasting. Advances in Neural Information Processing Systems, 36, 2024.
  • Lee et al. (2023) Seunghan Lee, Taeyoung Park, and Kibok Lee. Soft contrastive learning for time series. In The Twelfth International Conference on Learning Representations, 2023.
  • Li et al. (2024) Lizao Li, Robert Carver, Ignacio Lopez-Gomez, Fei Sha, and John Anderson. Generative emulation of weather forecast ensembles with diffusion models. Science Advances, 10(13):eadk4489, 2024.
  • Li et al. (2022) Yan Li, Xinjiang Lu, Yaqing Wang, and Dejing Dou. Generative time series forecasting with diffusion, denoise, and disentanglement. Advances in Neural Information Processing Systems, 35:23009–23022, 2022.
  • Li et al. (2023) Yuxin Li, Wenchao Chen, Xinyue Hu, Bo Chen, Mingyuan Zhou, et al. Transformer-modulated diffusion models for probabilistic multivariate time series forecasting. In The Twelfth International Conference on Learning Representations, 2023.
  • Liang et al. (2024a) Paul Pu Liang, Yun Cheng, Xiang Fan, Chun Kai Ling, Suzanne Nie, Richard Chen, Zihao Deng, Nicholas Allen, Randy Auerbach, Faisal Mahmood, et al. Quantifying & modeling multimodal interactions: An information decomposition framework. Advances in Neural Information Processing Systems, 36, 2024a.
  • Liang et al. (2024b) Paul Pu Liang, Zihao Deng, Martin Q Ma, James Y Zou, Louis-Philippe Morency, and Ruslan Salakhutdinov. Factorized contrastive learning: Going beyond multi-view redundancy. Advances in Neural Information Processing Systems, 36, 2024b.
  • Lin et al. (2023) Lequan Lin, Zhengkun Li, Ruikun Li, Xuliang Li, and Junbin Gao. Diffusion models for time-series applications: a survey. Frontiers of Information Technology & Electronic Engineering, pp.  1–23, 2023.
  • Liu et al. (2022) Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Non-stationary transformers: Exploring the stationarity in time series forecasting. Advances in Neural Information Processing Systems, 35:9881–9893, 2022.
  • Liu et al. (2023) Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. itransformer: Inverted transformers are effective for time series forecasting. In The Twelfth International Conference on Learning Representations, 2023.
  • Lu et al. (2023) Jiecheng Lu, Xu Han, and Shihao Yang. Arm: Refining multivariate forecasting with adaptive temporal-contextual learning. In The Twelfth International Conference on Learning Representations, 2023.
  • Narasimhan et al. (2024) Sai Shankar Narasimhan, Shubhankar Agarwal, Oguzhan Akcin, Sujay Sanghavi, and Sandeep Chinchali. Time weaver: A conditional time series generation model. arXiv preprint arXiv:2403.02682, 2024.
  • Nie et al. (2022) Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. In The Eleventh International Conference on Learning Representations, 2022.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  • Peebles & Xie (2023) William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  • Rasul et al. (2020) Kashif Rasul, Abdul-Saboor Sheikh, Ingmar Schuster, Urs M Bergmann, and Roland Vollgraf. Multivariate probabilistic time series forecasting via conditioned normalizing flows. In International Conference on Learning Representations, 2020.
  • Rasul et al. (2021) Kashif Rasul, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. Autoregressive denoising diffusion models for multivariate probabilistic time series forecasting. In International Conference on Machine Learning, pp.  8857–8868. PMLR, 2021.
  • Salinas et al. (2019) David Salinas, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, and Jan Gasthaus. High-dimensional multivariate forecasting with low-rank gaussian copula processes. Advances in neural information processing systems, 32, 2019.
  • Shen & Kwok (2023) Lifeng Shen and James Kwok. Non-autoregressive conditional diffusion models for time series prediction. In International Conference on Machine Learning, pp.  31016–31029. PMLR, 2023.
  • Shen et al. (2023) Lifeng Shen, Weiyu Chen, and James Kwok. Multi-resolution diffusion models for time series forecasting. In The Twelfth International Conference on Learning Representations, 2023.
  • Song & Ermon (2019) Jiaming Song and Stefano Ermon. Understanding the limitations of variational mutual information estimators. In International Conference on Learning Representations, 2019.
  • Song et al. (2020) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  • Tashiro et al. (2021) Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems, 34:24804–24816, 2021.
  • Trirat et al. (2024) Patara Trirat, Yooju Shin, Junhyeok Kang, Youngeun Nam, Jihye Na, Minyoung Bae, Joeun Kim, Byunghyun Kim, and Jae-Gil Lee. Universal time-series representation learning: A survey. arXiv preprint arXiv:2401.03717, 2024.
  • Tsai et al. (2020) Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Self-supervised learning from a multi-view perspective. In International Conference on Learning Representations, 2020.
  • Wang & Liu (2021) Feng Wang and Huaping Liu. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  2495–2504, 2021.
  • Wang et al. (2024a) Yihe Wang, Yu Han, Haishuai Wang, and Xiang Zhang. Contrast everything: A hierarchical contrastive framework for medical time-series. Advances in Neural Information Processing Systems, 36, 2024a.
  • Wang et al. (2023) Yingheng Wang, Yair Schiff, Aaron Gokaslan, Weishen Pan, Fei Wang, Christopher De Sa, and Volodymyr Kuleshov. Infodiffusion: Representation learning using information maximizing diffusion models. In International Conference on Machine Learning, pp.  36336–36354. PMLR, 2023.
  • Wang et al. (2024b) Yongkang Wang, Xuan Liu, Feng Huang, Zhankun Xiong, and Wen Zhang. A multi-modal contrastive diffusion model for therapeutic peptide generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  3–11, 2024b.
  • Wang et al. (2022) Zhiyuan Wang, Xovee Xu, Weifeng Zhang, Goce Trajcevski, Ting Zhong, and Fan Zhou. Learning latent seasonal-trend representations for time series forecasting. Advances in Neural Information Processing Systems, 35:38775–38787, 2022.
  • Woo et al. (2021) Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven Hoi. Cost: Contrastive learning of disentangled seasonal-trend representations for time series forecasting. In International Conference on Learning Representations, 2021.
  • Wu et al. (2021) Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in neural information processing systems, 34:22419–22430, 2021.
  • Wu et al. (2024) Yunshu Wu, Yingtao Luo, Xianghao Kong, Evangelos E Papalexakis, and Greg Ver Steeg. Your diffusion model is secretly a noise classifier and benefits from contrastive training. arXiv preprint arXiv:2407.08946, 2024.
  • Yang et al. (2024) Yiyuan Yang, Ming Jin, Haomin Wen, Chaoli Zhang, Yuxuan Liang, Lintao Ma, Yi Wang, Chenghao Liu, Bin Yang, Zenglin Xu, et al. A survey on diffusion models for time series and spatio-temporal data. arXiv preprint arXiv:2404.18886, 2024.
  • Yoon et al. (2019) Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. Time-series generative adversarial networks. Advances in neural information processing systems, 32, 2019.
  • Yuan & Qiao (2024) Xinyu Yuan and Yan Qiao. Diffusion-ts: Interpretable diffusion for general time series generation. arXiv preprint arXiv:2403.01742, 2024.
  • Zeng et al. (2023) Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence, volume 37, pp.  11121–11128, 2023.
  • Zhou et al. (2021) Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp.  11106–11115, 2021.
  • Zhou et al. (2022) Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International conference on machine learning, pp.  27268–27286. PMLR, 2022.
  • Zhu et al. (2022) Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, and Yan Yan. Discrete contrastive diffusion for cross-modal music and image generation. In The Eleventh International Conference on Learning Representations, 2022.

Appendix A Appendix

A.1 Proof for Proposition 1

Below, we shed light on how to derive the upper bound of diffusion-induced probabilistic forecasting error shown in Proposition 1. We utilize the KL-divergence between the real distribution qte(𝐲0|𝐱)q^{te}(\mathbf{y}_{0}|\mathbf{x}) of test time series and approximated predictive distribution pθte(𝐲0|𝐱)p_{\theta}^{te}(\mathbf{y}_{0}|\mathbf{x}) by conditional diffusion models to represent the probabilistic forecasting error

𝒟KL[qte(𝐲0|𝐱)||pθte(𝐲0|𝐱)]=𝔼qte(𝐲0|𝐱)[logqte(𝐲0|𝐱)]𝔼qte(𝐲0|𝐱)[logpθte(𝐲0|𝐱)].\mathcal{D}_{KL}\left[q^{te}(\mathbf{y}_{0}|\mathbf{x})||p^{te}_{\theta}(\mathbf{y}_{0}|\mathbf{x})\right]=\mathbb{E}_{q^{te}(\mathbf{y}_{0}|\mathbf{x})}\left[\log q^{te}(\mathbf{y}_{0}|\mathbf{x})\right]-\mathbb{E}_{q^{te}(\mathbf{y}_{0}|\mathbf{x})}\left[\log p^{te}_{\theta}(\mathbf{y}_{0}|\mathbf{x})\right]. (8)

The first term in Eq. 8 is unrelated to conditional diffusion learning and thus can be prescribed as a constant C1C_{1} based on the information quantity of real test data

𝔼qte(𝐲0|𝐱)[logqte(𝐲0|𝐱)]=1qte(𝐱)𝔼qte(𝐲0,𝐱)[logqte(𝐲0|𝐱)]=H(𝐲0|𝐱)qte(𝐱)=C1.\mathbb{E}_{q^{te}(\mathbf{y}_{0}|\mathbf{x})}\left[\log q^{te}(\mathbf{y}_{0}|\mathbf{x})\right]=-\frac{1}{q^{te}(\mathbf{x})}\mathbb{E}_{q^{te}(\mathbf{y}_{0},\mathbf{x})}\left[\log q^{te}(\mathbf{y}_{0}|\mathbf{x})\right]=-\frac{\mathit{H}(\mathbf{y}_{0}|\mathbf{x})}{q^{te}(\mathbf{x})}=\mathrm{C}_{1}. (9)

The second term Eq. 8 is the expected log-likelihood over qte(𝐲0|𝐱)q^{te}(\mathbf{y}_{0}|\mathbf{x}), which is identical to the learning objective of vanilla conditional diffusion models in (Ho et al., 2020). Akin to the step-wise denoising loss derivation in (Ho et al., 2020), we can obtain the upper bound of the error via Jensen’s inequality and decompose it into K+1K+1 items 𝒱0,,𝒱K\mathcal{V}_{0},...,\mathcal{V}_{K}:

𝔼qte(𝐲0|𝐱)[logpθte(𝐲0|𝐱)]\displaystyle-\mathbb{E}_{q^{te}(\mathbf{y}_{0}|\mathbf{x})}\left[\log p^{te}_{\theta}(\mathbf{y}_{0}|\mathbf{x})\right] =𝔼qte(𝐲0|𝐱)[logqte(𝐲1:K|𝐲0)pθte(𝐲0:K|𝐱)qte(𝐲1:K|𝐲0)d𝐲1:K]\displaystyle=-\mathbb{E}_{q^{te}(\mathbf{y}_{0}|\mathbf{x})}\left[\log\int q^{te}(\mathbf{y}_{1:K}|\mathbf{y}_{0})\frac{p^{te}_{\theta}(\mathbf{y}_{0:K}|\mathbf{x})}{q^{te}(\mathbf{y}_{1:K}|\mathbf{y}_{0})}\mathrm{d}\mathbf{y}_{1:K}\right]
𝔼qte(𝐲0|𝐱)[𝔼qte(𝐲1:K|𝐲0)[logpθte(𝐲0:K|𝐱)qte(𝐲1:K|𝐲0)]]\displaystyle\leq-\mathbb{E}_{q^{te}(\mathbf{y}_{0}|\mathbf{x})}\left[\mathbb{E}_{q^{te}(\mathbf{y}_{1:K}|\mathbf{y}_{0})}\left[\log\frac{p^{te}_{\theta}(\mathbf{y}_{0:K}|\mathbf{x})}{q^{te}(\mathbf{y}_{1:K}|\mathbf{y}_{0})}\right]\right]
=𝔼qte(𝐲0|𝐱)[𝒱0+k=2K𝒱k1+𝒱K],\displaystyle=\mathbb{E}_{q^{te}(\mathbf{y}_{0}|\mathbf{x})}\left[\mathcal{V}_{0}+\sum_{k=2}^{K}\mathcal{V}_{k-1}+\mathcal{V}_{K}\right], (10)

where

𝒱K=𝒟KL[qte(𝐲K|𝐲0)||pθte(𝐲K|𝐱)]=0,\displaystyle\mathcal{V}_{K}=\mathcal{D}_{KL}\left[q^{te}(\mathbf{y}_{K}|\mathbf{y}_{0})||p^{te}_{\theta}(\mathbf{y}_{K}|\mathbf{x})\right]=0, (11)

as qte(𝐲K|𝐲0)q^{te}(\mathbf{y}_{K}|\mathbf{y}_{0}) and pθte(𝐲K|𝐱)p^{te}_{\theta}(\mathbf{y}_{K}|\mathbf{x}) are both standard Gaussian. And since the reverse transitions at each diffusion step can be shaped in explicit Gaussian forms, we can write out

𝒱k1\displaystyle\mathcal{V}_{k-1} =𝔼qte(𝐲k|𝐲0)[𝒟KL[qte(𝐲k1|𝐲k,𝐲0)||pθte(𝐲k1|𝐲k,𝐱)]]\displaystyle=\mathbb{E}_{q^{te}(\mathbf{y}_{k}|\mathbf{y}_{0})}\left[\mathcal{D}_{KL}\left[q^{te}(\mathbf{y}_{k-1}|\mathbf{y}_{k},\mathbf{y}_{0})||p^{te}_{\theta}(\mathbf{y}_{k-1}|\mathbf{y}_{k},\mathbf{x})\right]\right]
=𝔼qte(𝐲k|𝐲0)[𝒟KL[𝒩(𝐲k1;𝝁k(𝐲k,𝐲0),β~k𝐈)||𝒩(𝐲k1;𝝁θ(𝐲k,𝐱,k),β~k𝐈)]]\displaystyle=\mathbb{E}_{q^{te}(\mathbf{y}_{k}|\mathbf{y}_{0})}\left[\mathcal{D}_{KL}\left[\mathcal{N}(\mathbf{y}_{k-1};\bm{\mu}_{k}(\mathbf{y}_{k},\mathbf{y}_{0}),\tilde{\beta}_{k}\mathbf{I})||\mathcal{N}(\mathbf{y}_{k-1};\bm{\mu}_{\theta}(\mathbf{y}_{k},\mathbf{x},k),\tilde{\beta}_{k}\mathbf{I})\right]\right]
=𝔼qte(𝐲k|𝐲0)[12β~k2[𝝁θ(𝐲k,𝐱,k)𝝁k(𝐲k,𝐲0)22]]\displaystyle=\mathbb{E}_{q^{te}(\mathbf{y}_{k}|\mathbf{y}_{0})}\left[\frac{1}{2\tilde{\beta}_{k}^{2}}\left[\left\|\bm{\mu}_{\theta}(\mathbf{y}_{k},\mathbf{x},k)-\bm{\mu}_{k}(\mathbf{y}_{k},\mathbf{y}_{0})\right\|^{2}_{2}\right]\right]
=𝔼𝐲0,ϵk[12β~k2[1αk(𝐲kβk1α¯kϵθ(𝐲k,𝐱,k))1αk(𝐲kβk1α¯kϵk)22]]\displaystyle=\mathbb{E}_{\mathbf{y}_{0},\bm{\epsilon}_{k}}\left[\frac{1}{2\tilde{\beta}_{k}^{2}}\left[\left\|\frac{1}{\sqrt{\alpha_{k}}}\left(\mathbf{y}_{k}-\frac{\beta_{k}}{\sqrt{1-\bar{\alpha}_{k}}}\bm{\epsilon}_{\theta}\left(\mathbf{y}_{k},\mathbf{x},k\right)\right)-\frac{1}{\sqrt{\alpha_{k}}}\left(\mathbf{y}_{k}-\frac{\beta_{k}}{\sqrt{1-\bar{\alpha}_{k}}}\bm{\epsilon}_{k}\right)\right\|^{2}_{2}\right]\right]
=𝔼𝐲0,ϵk[βk22β~k2αk(1α¯k)[ϵθ(α¯k𝐲0+1α¯kϵk,𝐱,k)ϵk22]],\displaystyle=\mathbb{E}_{\mathbf{y}_{0},\bm{\epsilon}_{k}}\left[\frac{\beta_{k}^{2}}{2\tilde{\beta}_{k}^{2}\alpha_{k}(1-\bar{\alpha}_{k})}\left[\left\|\bm{\epsilon}_{\theta}\left(\sqrt{\bar{\alpha}_{k}}\mathbf{y}_{0}+\sqrt{1-\bar{\alpha}_{k}}\bm{\epsilon}_{k},\mathbf{x},k\right)-\bm{\epsilon}_{k}\right\|^{2}_{2}\right]\right], (12)

where β~k=1α~k11α~kβk\tilde{\beta}_{k}=\frac{1-\tilde{\alpha}_{k-1}}{1-\tilde{\alpha}_{k}}{\beta}_{k} and 𝒱0\mathcal{V}_{0} is actually a special case of Eq. A.1 when k=1k=1

𝒱0\displaystyle\mathcal{V}_{0} =𝔼q(𝐲1|𝐲0)[logpθ(𝐲0|𝐲1,𝐱)]\displaystyle=-\mathbb{E}_{q(\mathbf{y}_{1}|\mathbf{y}_{0})}\left[\log p_{\theta}(\mathbf{y}_{0}|\mathbf{y}_{1},\mathbf{x})\right]
=𝔼q(𝐲1|𝐲0)[log(2π)HD2β~1+12β~12𝐲0𝝁θ(𝐲1,𝐱,k=1)22]\displaystyle=\mathbb{E}_{q(\mathbf{y}_{1}|\mathbf{y}_{0})}\left[\log(2\pi)^{\frac{HD}{2}}\tilde{\beta}_{1}+\frac{1}{2\tilde{\beta}_{1}^{2}}\left\|\mathbf{y}_{0}-\bm{\mu}_{\theta}(\mathbf{y}_{1},\mathbf{x},k=1)\right\|^{2}_{2}\right]
=𝔼𝐲0,ϵ1[β122β~12α1(1α¯1)[ϵθ(α¯1𝐲0+1α¯1ϵ1,𝐱,k=1)ϵ122]]+C2.\displaystyle=\mathbb{E}_{\mathbf{y}_{0},\bm{\epsilon}_{1}}\left[\frac{\beta_{1}^{2}}{2\tilde{\beta}_{1}^{2}\alpha_{1}(1-\bar{\alpha}_{1})}\left[\left\|\bm{\epsilon}_{\theta}\left(\sqrt{\bar{\alpha}_{1}}\mathbf{y}_{0}+\sqrt{1-\bar{\alpha}_{1}}\bm{\epsilon}_{1},\mathbf{x},k=1\right)-\bm{\epsilon}_{1}\right\|^{2}_{2}\right]\right]+\mathrm{C}_{2}. (13)

Overall, if we let Ak=βk22β~k2αk(1α¯k)A_{k}=\frac{\beta_{k}^{2}}{2\tilde{\beta}_{k}^{2}\alpha_{k}(1-\bar{\alpha}_{k})}, C=C1+C2C=C_{1}+C_{2}, we can derive the ultimate upper bound of probabilistic forecasting error in a concise form as follows:

𝒟KL[qte(𝐲0|𝐱)||pθte(𝐲0|𝐱)]𝔼𝐱,𝐲0,ϵk,k[Akϵθ(α¯k𝐲0+1α¯kϵk,𝐱,k)ϵk22]+C,\centering\mathcal{D}_{KL}\left[q^{te}(\mathbf{y}_{0}|\mathbf{x})||p^{te}_{\theta}(\mathbf{y}_{0}|\mathbf{x})\right]\leq\mathbb{E}_{\mathbf{x},\mathbf{y}_{0},\bm{\epsilon}_{k},k}\left[A_{k}\left\|\bm{\epsilon}_{\theta}\left(\sqrt{\bar{\alpha}_{k}}\mathbf{y}_{0}+\sqrt{1-\bar{\alpha}_{k}}\bm{\epsilon}_{k},\mathbf{x},k\right)-\bm{\epsilon}_{k}\right\|^{2}_{2}\right]+C,\@add@centering (14)

which finalizes the proof of Proposition 1. It shows that for unknown test time series, the diffusion-based generative forecasting performance is associated with the generalization capability of the trained conditional denoising network on total step-wise noise regression.

A.2 Additional Discussions on related works

Channel-oriented multivariate forecasting. How to properly manage various channel-centric temporal properties (i.e. single-channel dynamics and cross-channel correlations) has been attached greater importance in recent multivariate forecasting works (Chen et al., 2024a; Han et al., 2024) for two reasons. One is that traditional transformer-based models (Zhou et al., 2021; Wu et al., 2021; Zhou et al., 2022; Liu et al., 2022) only focus on improving the expressivity and efficiency of long-range temporal dependency, which can not obviously discriminate roles of disparate channels and entice some unsatisfactory outcomes. Besides, channel-independent predictors (Nie et al., 2022; Zeng et al., 2023; Das et al., 2023a) utilize a shared network to uniformly treat all channels and display that the single-channel separate prediction can outperform multi-channel mixing settings. Whilst this channel-independent structure fail to handle those complex temporal modes where the auxiliary information from other channels could also be helpful. Latest progress (Liu et al., 2023; Lu et al., 2023; Chen et al., 2024a; Han et al., 2024) reflect that both channel-independence and channel-fusion are crucial for versatile time series predictors. However, the significance of proper channel manipulation is rarely probed in multivariate diffusion forecasters, and the additional influence of channel noise imposed in different extents should also be considered. To tackle this barrier, we blend both channel-independence and channel-fusion modules in diffusion denoiser to boost its forecasting ability on multivariate cases.

Time series diffusion models. Due to the remarkable capacity to generate high-fidelity samples, diffusion models are actively exploited to grasp the stochastic dynamics and temporal correlations for a variety of time series tasks, including synthesis (Yuan & Qiao, 2024; Narasimhan et al., 2024), forecasting (Rasul et al., 2021), imputation (Tashiro et al., 2021) and anomaly detection (Chen et al., 2023). Common goals of these tasks are to derive a high-quality conditional temporal distribution aligned with diverse input contexts, such as statistical properties in constrained generation (Coletta et al., 2024) and historical records. To this end, the key challenge lies in how to design a potent temporal conditioning mechanism to empower the conditional backward generation. An intuitive way is to integrate useful temporal properties such as trend-seasonality (Yuan & Qiao, 2024), continuity (Biloš et al., 2023) and multi-scale modes (Shen et al., 2023; Fan et al., 2024) to empirically boost the utilization efficiency of conditioning data in the learnable denoising process. Another track is to develop gradient-based guidance schemes to satisfy given constraints via differentialable (Coletta et al., 2024) or objective-oriented optimization (Kollovieh et al., 2024). Even this plethora of time series diffusion models, there are still rooms to enhance them from the aspect of training manners and denoiser architectures. To bridge this gap for multivariate forecasting, we exclusively design a channel-aware denoiser network and recast the problem of estimating conditional predictive distribution in the paradigm of mutual information maximization, which can enhance the consistency between past conditioning and future predicted time series. On top of original conditional likelihood maximization via step-wise noise regression, we adapt temporal contrastive learning to further augment conditional diffusion training. In future work, we hope to extend such innovations to benefit other time series analysis tasks.

Temporal contrastive learning. Time series contrastive learning primarily aims to obtain self-supervised universal temporal representations which can enable an array of downstream tasks with few shots (Trirat et al., 2024; Lee et al., 2023; Franceschi et al., 2019; Wang et al., 2024a). This line of research focus on developing efficient representation learning methods to pre-train temporal feature extractors in two vital senses, containing contrastive loss design and positive and negative sample pair construction. With respect to the deterministic time series prediction task, there also exist specialized decomposed contrastive pre-training approaches (Woo et al., 2021; Wang et al., 2022) to investigate disentangled seasonal and trend representations, which can relieve the subsequent prediction on volatile temporal evolution. While in this work, we devise an end-to-end denoising-based contrastive learning to ameliorate conditional denoiser training rather than the common pre-training fashion on general temporal representation networks. We realize this contrastive refinement in an identical form of step-wise noise regression to vanilla diffusion. Moreover, we alter both temporal variations and point magnitudes in the time series augmentation stage, which can construct more useful negative samples for the contrastive denoiser improvement.

A.3 Negative time series augmentation

To enable contrastive learning, we employ two types of augmentation methods to produce negative multivariate time series 𝐲0(n)\mathbf{y}_{0}^{(n)}. The first way is to alter the ground truth temporal variations of each univariate time series by patch shuffling, since recovering the correct temporal evolution is a vital challenge for time series diffusion models. As is shown in Fig. 6, we divide a given sequence into an array of sub-series patches and randomly shuffle their orders to change original temporal dynamics. The second way is to scale up or scale down the magnitudes of individual time points, as an ideal prediction interval should well cover every points without any of them falling outside. Thus, for each positive target 𝐲0\mathbf{y}_{0}, we uniformly sample a scaling factor ada_{d} between [0,0.5][1.5,2.0]\left[0,0.5\right]\cup\left[1.5,2.0\right] and impose it on each channel by ad𝐲0dHa_{d}\cdot\mathbf{y}_{0}^{d}\in\mathbb{R}^{H}. We find both ways of generating negative samples would help the diffusion model learn more realistic time series samples.

Refer to caption
Figure 6: One diagram of the variation-based time series augmentation method.

A.4 Training algorithm

We elucidate the step-wise denoising-based contrastive diffusion training algorithm in Algorithm 1.

Algorithm 1 Step-wise contrastive conditional diffusion training procedure.

Input: Lookback time series 𝐱L×D\mathbf{x}\in\mathbb{R}^{L\times D}; target time series 𝐲0H×D\mathbf{y}_{0}\in\mathbb{R}^{H\times D}; lookback length LL; prediction horizon HH; variate number DD; diffusion step number KK; negative sample number NN; contrastive loss weight λ\lambda; temperature coefficient τ\tau;
repeat

1:Draw step k𝕌[1,..,K]k\sim\mathbb{U}[1,..,K].
2:Draw noise ϵ𝒩(𝟎,𝐈)\bm{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) to calculate the naive diffusion loss kdenoise\mathcal{L}_{k}^{denoise} in Eq. 1.
3:Draw noise ϵ𝒩(𝟎,𝐈)\bm{\epsilon}{}^{\prime}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) to calculate the denoising-based contrastive loss kcontrast\mathcal{L}_{k}^{contrast} in Eq. 4.
4:Obtain a set of negative time series {𝐲0n}n=1N\left\{\mathbf{y}_{0}^{n}\right\}_{n=1}^{N} using the hybrid augmentation in Appendix A.3.
5:Compute the contrastive conditional diffusion loss kCCDM=kdenoise+λkcontrast\mathcal{L}^{CCDM}_{k}=\mathcal{L}_{k}^{denoise}+\lambda\mathcal{L}_{k}^{contrast} in Eq. 6.
6:Optimize the conditional denoising network ϵθ()\bm{\epsilon}_{\theta}(\cdot) using the gradient θkCCDM\nabla_{\theta}\mathcal{L}^{CCDM}_{k}.

until converged

A.5 Dataset description

We present the dataset usage in Table 4, where the channel number DD, sampling rate, train/val/test split size and own field are clarified. We also provide accessible repositories for these datasets below:

Table 4: Detailed dataset description. Size indicates the split lengths of individual points for training, validation and testing division respectively.
Dataset Variate number DD Sampling frequency Split size Field
ETTh1 7 Hourly (8640, 2880, 2880) Energy
Exchange 8 Daily (5311, 758, 1517) Finance
Weather 21 10min (34560, 5760, 11520) Weather
Appliance 28 10min (13814, 1973, 3947) Energy
Electricity 321 Hourly (17280, 2880, 5760) Energy
Traffic 862 Hourly (11520, 2880, 2880) Traffic

A.6 Evaluation metrics

To assess the accuracy and reliability of estimated multivariate predictive distribution, we adopt two common metrics, i.e. MSE and CRPS to quantify both deterministic and probabilistic forecasting performance of generated prediction intervals. Assume 𝐲0\mathbf{y}_{0} is the ground-truth time series, {𝐲^0(s)}s=1S\{\hat{\mathbf{y}}_{0}^{(s)}\}_{s=1}^{S} is the produced prediction set, and let its 50%50\%-quantile trajectory 𝐲¯0\bar{\mathbf{y}}_{0} signify the point forecast, then two metrics can be calculated in a point-wise form over all channels and timestamps:

MSE=1HD𝐲0𝐲¯022;\centering MSE=\frac{1}{HD}\left\|\mathbf{y}_{0}-\bar{\mathbf{y}}_{0}\right\|_{2}^{2};\@add@centering (15)
CRPS=1HDd=1Dt=1HR(F(y^td)𝕀{ytdy^td})2dy^td;\centering CRPS=\frac{1}{HD}\sum_{d=1}^{D}\sum_{t=1}^{H}\int_{R}(F(\hat{y}_{td})-\mathbb{I}\{y_{td}\leq\hat{y}_{td}\})^{2}\mathrm{d}\hat{y}_{td};\@add@centering (16)

where ytdy_{td} indicates the tt-th point value of the dd-th univariate time series. FF is the empirical cumulative distribution function.

A.7 Experimental configurations

In Table 5, we detail the conditional diffusion model configurations on different forecasting scenarios, including the channel-aware DiT compositions and diffusion noise scheduling. We simply preserve the layers of input and output channel-independent dense encoders identical to the depth of attention modules, i.e. natt=nenc=ndec=2n_{att}=n_{enc}=n_{dec}=2. One observation is that the designed channel-centric conditional denoising network can be easily scalable with diverse forecasting scenarios by merely adjusting the hidden representation dimension ehide_{hid}, which changes compatibly with the prediction horizon HH.

In Table 6, we shed light on the concrete contrastive training configurations for the main comparison outcomes presented in Table 1. We adopt the two-stage separate training on Weather, Electricity and Traffic datasets to reduce the training time and memory consumption. The best contrastive weight is chosen from {0.001,0.0005,0.0001,0.00005}\left\{0.001,0.0005,0.0001,0.00005\right\}. Due to the GPU memory limitation, we have to turn down the negative sample number and batch size on Electricity and Traffic datasets with hundreds of channels, which could restrict the resulting final forecast performance. The initial learning rates are also displayed for the full reproduction on the newly adopted benchmark.

Table 5: Diffusion forecaster configurations on different forecasting setups.
Forecasting setup DiT blocks
Noise schedule
(quadratic)
Lookback length LL Prediction horizon HH Depth nattn_{att} Heads Hidden dim ehide_{hid} β1\beta_{1} βK\beta_{K} Steps KK
48 96 2 8 128 0.0001 0.5 50
96 168 2 8 256 0.0001 0.2 100
192 336 2 8 512 0.0001 0.1 200
336 720 2 8 728 0.0001 0.1 200
Table 6: Contrastive training configurations corresponding to forecasting results in Table 1.
Setup Contrastive weight λ\lambda Negative number NN Batch size Initial rate Training mode
ETTh1 96 0.001 64*2 32 0.001 End-to-end
168 0.001 64*2 32 0.001 End-to-end
336 0.001 64*2 32 0.0002 End-to-end
720 0.001 64*2 32 0.0002 End-to-end
Exchange 96 0.001 64*2 32 0.001 End-to-end
168 0.001 64*2 32 0.001 End-to-end
336 0.001 64*2 32 0.0002 End-to-end
720 0.001 64*2 32 0.0002 End-to-end
Weather 96 0.001 64*2 32 0.0001 Two-stage
168 0.001 64*2 32 0.0001 Two-stage
336 0.0001 64*2 32 0.00002 Two-stage
720 0.001 64*2 32 0.00002 Two-stage
Appliance 96 0.001 64*2 32 0.001 End-to-end
168 0.001 64*2 32 0.001 End-to-end
336 0.0001 64*2 32 0.0002 End-to-end
720 0.00005 64*2 32 0.0002 End-to-end
Electricity 96 0.0001 32*2 16 0.0001 Two-stage
168 0.00005 32*2 12 0.0001 Two-stage
336 0.00005 32*2 8 0.00002 Two-stage
720 0.00005 20*2 8 0.00002 Two-stage
Traffic 96 0.0005 28*2 4 0.0001 Two-stage
168 0.0001 22*2 4 0.0001 Two-stage
336 0.00005 16*2 4 0.00002 Two-stage
720 0.00005 12*2 4 0.00002 Two-stage

A.8 Computational time analysis

We compare the both training and inference time costs of disparate diffusion forecasters in Table 7. It is obvious that the auxiliary contrastive learning indeed aggravates the burden of vanilla denoising diffusion training for the sake of a higher quality of multivariate predictive distribution. Thus we adopt the two-stage separate strategy to accelerate the training process. The sequential generation procedure of our CCDM method is notably faster than other models, which indicates the designed channel-centric denoiser architecture can be efficiently scalable to diverse forecasting settings. Besides, the deterministic autoregressive pretraining in TimeDiff, hybrid attention layers in CSDI and point-wise amortized diffusion in TimeGrad can magnify their time consumption to different extents.

Table 7: Time cost comparison of diffusion forecasters on different sizes of prediction tasks. Both training time [s] of one epoch and inference time [ms] of one step are provided.
Size CCDM TimeDiff CSDI TimeGrad
Train [s] Infer [ms] Train [s] Infer [ms] Train [s] Infer [ms] Train [s] Infer [ms]
D=8 H=96 18.67 3.63 14.11 3.00 4.78 3.63 2.22 349.42
H=168 28.11 4.37 18.56 3.00 6.33 3.58 3.89 603.05
H=336 80.78 4.71 24.00 2.98 10.67 3.72 5.32 1163.55
H=720 166.44 4.97 26.22 3.05 18.78 3.58 9.33 2571.37
D=28 H=96 66.67 3.76 25.44 4.00 34.89 3.76 7.67 374.23
H=168 203.11 4.38 37.11 3.94 50.78 3.68 12.22 605.80
H=336 441.74 4.71 33.22 4.29 97.22 3.66 21.67 1170.64
H=720 903.00 4.75 34.67 4.50 181.56 6.45 50.22 2551.13
D=321 H=96 573.22 4.59 657.67 17.92 84.78 9.08 48.56 357.51
H=168 1131.89 4.70 859.44 19.48 145.89 17.20 86.22 630.14
H=336 3173.89 4.83 1190.33 20.61 376.11 47.77 171.44 1188.07
H=720 4039.56 5.09 1269.56 22.71 546.67 70.13 330.67 2672.31
D=862 H=96 1466.14 4.54 185.78 46.67 104.22 25.12 80.56 369.04
H=168 1884.77 4.52 193.89 47.89 118.33 47.71 146.67 620.23
H=336 3202.85 5.17 284.89 49.09 228.44 96.06 289.11 1186.83
H=720 4678.78 7.86 463.56 55.01 417.33 193.62 545.67 2591.93

A.9 Full results on ablation study

We illuminate the full forecasting outcomes corresponding to the ablation study of Section 4.3 in Table 8. In a nutshell, the performance promotion margins derived from such denoiser architecture and contrastive refinement innovations vary among different forecasting scenarios. It still requires careful settings on channel-aware denoising networks and auxiliary contrastive training to achieve the optimal results for a specific time series field and prediction setup.

Table 8: Complete forecasting results by masking denoising-based temporal contrastive refinement or channel-mixing DiT blocks.
Methods w/o contrastive refinement w/o channel-wise DiT
Metrics MSE Degradation CRPS Degradation MSE Degradation CRPS Degradation
ETTh1 96 0.4447 15.33% 0.3199 8.99% 0.3903 1.22% 0.2963 0.95%
168 0.5223 22.40% 0.3402 8.27% 0.5800 35.93% 0.6674 112.41%
336 0.6416 20.78% 0.3917 13.47% 0.5381 1.30% 0.4699 36.12%
720 0.5944 5.35% 0.5038 3.45% 0.8740 54.91% 0.8928 83.33%
Avg 0.5508 15.97% 0.3889 8.55% 0.5956 23.34% 0.5816 58.20%
Exchange 96 0.1057 16.80% 0.1677 8.54% 0.0959 5.97% 0.1598 3.43%
168 0.1986 21.25% 0.2338 8.29% 0.2200 34.31% 0.2777 28.62%
336 0.4532 2.84% 0.3557 1.14% 0.4735 7.44% 0.3870 10.04%
720 1.2290 5.18% 0.6038 2.97% 1.1802 1.00% 0.5975 1.89%
Avg 0.4966 11.52% 0.3403 5.24% 0.4924 12.18% 0.3555 11.00%
Weather 96 0.2825 5.84% 0.1936 1.68% 0.2919 9.37% 0.2012 5.67%
168 0.3349 34.55% 0.2167 8.03% 0.4199 68.70% 0.2981 48.60%
336 0.2932 2.16% 0.2313 2.44% 0.3825 33.28% 0.2873 27.24%
720 0.6158 9.44% 0.4365 6.91% 0.8428 49.78% 0.5478 34.17%
Avg 0.3816 13.00% 0.2695 4.77% 0.4843 40.28% 0.3336 28.92%
Appliance 96 0.7097 1.56% 0.4291 3.70% 0.7473 6.94% 0.4546 9.86%
168 0.7313 16.71% 0.4374 8.81% 0.7853 25.33% 0.6070 51.00%
336 0.9254 1.48% 0.5083 0.93% 1.0660 16.90% 0.6971 38.42%
720 1.7215 8.97% 0.9525 10.50% 1.8744 18.65% 1.1336 31.51%
Avg 1.0220 7.18% 0.5818 5.99% 1.1183 16.96% 0.7231 32.70%
Electricity 96 0.2142 1.90% 0.2266 3.85% 0.2296 9.23% 0.2198 0.73%
168 0.1689 0.66% 0.2033 0.94% 0.1779 6.02% 0.2041 1.34%
336 0.1714 1.84% 0.2035 1.04% 0.1744 3.62% 0.2048 1.69%
720 0.2002 0.40% 0.2242 0.45% 0.2073 3.96% 0.2260 1.25%
Avg 0.1887 1.20% 0.2144 1.57% 0.1973 5.71% 0.2137 1.25%
Traffic 96 1.0345 0.61% 0.4226 3.25% 1.2831 24.79% 0.4741 15.83%
168 0.6936 0.80% 0.3113 1.17% 0.7682 11.64% 0.3869 25.74%
336 0.6913 0.73% 0.3572 6.37% 0.8472 23.44% 0.4329 28.92%
720 0.9561 2.18% 0.5610 24.14% 1.1351 21.31% 0.5762 27.51%
Avg 0.8439 1.08% 0.4130 8.73% 1.0084 20.30% 0.4675 24.50%

A.10 More analysis on contrastive refinement

Influence of negative number NN. It is claimed in previous works on visual contrastive representation learning Oord et al. (2018); Chen et al. (2020) that a larger number of negative samples within a training iteration can bring out more informative latent features for downstream vision recognition tasks. To probe the influence of number of negative sample NN on the specialized contrastive time series diffusion model for multivariate forecasting, we change NN in the range from 1616 to 256256 and showcase pertaining outcomes in Fig. 7. We can observe that the optimal NN is 192192, 128128 and 1616 on three datasets and two quantitative metrics of each dataset exhibit distinctive changing trends. This phenomenon suggests that the real impact of negative sample number on contrastive training gains is relatively intractable, which is not amenable to the law in visual contrastive self-supervised pretraining. It could also be caused by the substantially smaller amount of training corpus in time series than images. We should determine the best number of negative instances in light of concrete data characteristics along with other training hyper-parameters.

Refer to caption
Figure 7: Forecasting results by different numbers of negative samples.

Influence of temperature coefficient τ\tau. The proposed denoising-based contrastive diffusion loss in Eq. 4 is in a canonical softmax form. According to the gradient analysis for the universal softmax-based contrastive loss in Wang & Liu (2021), the temperature τ\tau is a critical factor to control the penalty magnitude on various negative samples. To attain the contrastive improvement on conditional denoiser training, maintaining τ\tau within an appropriate interval is significant. We assign four values to τ\tau and illustrate quantitative results in Table 9. We can apparently observe that [0.05,0.1][0.05,0.1] could be a reasonable range on ETTh1 and [0.1,0.5][0.1,0.5] is also valid for other two datasets.

Table 9: Forecasting results by different temperature coefficients.
Temperature τ\tau ETTh1 Exchange Weather
MSE CRPS MSE CRPS MSE CRPS
0.05 0.4391 0.3124 0.1728 0.2216 0.2515 0.2015
0.1 0.4267 0.3142 0.1638 0.2159 0.2489 0.2006
0.5 0.4730 0.3252 0.1583 0.2152 0.2493 0.2022
1.0 0.5082 0.3389 0.2031 0.2449 0.2505 0.2025

A.11 More showcases on prediction intervals

In Fig. 8-13 below, we visualize more prediction intervals generated by the proposed CCDM on six datasets. The legend for each figure is identical to Fig. 4. For each task’s result visualization, we just display the first 7 or 8 variates and present two random samples on the L=48,H=96L=48,H=96 setting.

Refer to caption
Figure 8: ETTh1 prediction intervals of total 7 channels.
Refer to caption
Figure 9: Exchange prediction intervals of total 7 channels.
Refer to caption
Figure 10: Weather prediction intervals of first 8 channels.
Refer to caption
Figure 11: Appliance prediction intervals of first 8 channels.
Refer to caption
Figure 12: Electricity prediction intervals of first 8 channels.
Refer to caption
Figure 13: Traffic prediction intervals of first 8 channels.