Conditional Lagrangian Wasserstein Flow for Time Series Imputation

Weizhu QIAN
School of Computer Science and Technology
Soochow University, China
&Dalin Zhang
Department of Computer Science
Aalborg University, Denmark
Yan Zhao
Department of Computer Science
Aalborg University, Denmark
correspondence to [email protected]

Abstract

Time series imputation is important for numerous real-world applications. To overcome the limitations of diffusion model-based imputation methods, e.g., slow convergence in inference, we propose a novel method for time series imputation in this work, called Conditional Lagrangian Wasserstein Flow. The proposed method leverages the (conditional) optimal transport theory to learn the probability flow in a simulation-free manner, in which the initial noise, missing data, and observations are treated as the source distribution, target distribution, and conditional information, respectively. According to the principle of least action in Lagrangian mechanics, we learn the velocity by minimizing the corresponding kinetic energy. Moreover, to incorporate more prior information into the model, we parameterize the derivative of a task-specific potential function via a variational autoencoder, and combine it with the base estimator to formulate a Rao-Blackwellized sampler. The propose model allows us to take less intermediate steps to produce high-quality samples for inference compared to existing diffusion methods. Finally, the experimental results on the real-word datasets show that the proposed method achieves competitive performance on time series imputation compared to the state-of-the-art methods.

1 Introduction

Time series imputation is essential for various practical scenarios in many fields, such as transportation, environment, and medical care, etc. Deep learning-based approaches, such as RNNs, VAEs, and GANs, have been proved to be advantageous compared to traditional machine learning methods on various complex real-words multivariate time series analysis tasks Fortuin et al. (2020). More recently, diffusion models, such as denoising diffusion probabilistic models (DDPMs) Ho et al. (2020) and score-based generative models (SBGMs) Song et al. (2020), have gained more and more attention in the field of time series analysis due to their powerful modelling capability Lin et al. (2023); Meijer and Chen (2024).

Although many diffusion model-based time series imputation approaches have been proposed and show their advantages compared to conventional deep learning models Tashiro et al. (2021); Chen et al. (2021, 2023), they are limited to slow convergence or large computational costs. Such limitations may prevent them being applied to real-world applications. To address the aforementioned issues, in this work, we leverage the optimal transport theory Villani et al. (2009) and Lagrangian mechanics Arnol’d (2013) to propose a novel method, called Conditional Lagrangian Wasserstein Flow (CLWF), for fast and accurate time series imputation.

In our method, we treat the multivariate time series imputation task as a conditional optimal transport problem, whereby the random noise is the source distribution, the missing data is the target distribution, and the observed data is the conditional information. To generate new data samples efficiently and accurately, we need to find the shortest path in the probability space according to the optimal transport theory. To this end, we first project the original source and target distributions into the Wasserstein space via sampling mini-batch OT maps. Afterwards, we construct the time-dependent intermediate samples through interpolating the source distribution and target distribution. Then according to the principle of least action in Lagrangian mechanics Arnol’d (2013), the optimal velocity function moving the source distribution to the target distribution is learned in a self-supervised manner by minimizing the corresponding kinetic energy. Moreover, to further improve the model’s performance, we learn the task-specific potential function by training a Variational Autoencoder (VAE) model Kingma and Welling (2014) on the observed time series data to build a Rao-Blackwellized trajectory sampler.

Finally, CLWF is assessed on two real-word multivariate time series datasets. The obtained results show that the proposed method achieves competitive performance and admits fast convergence compared with other state-of-the-art time series imputation methods.

The contributions of the paper ares summarized as follows:

•

We present Conditional Lagrangian Wasserstein Flow, a novel conditional generative framework based on the optimal transport theory and Lagrangian mechanics;
•

We propose a Rao-Blackwellized trajectory sampler to enhance the data generation performance by incorporating the prior information;
•

We develop the practical algorithms to solve the time series imputation problem via a conditional generative approach;
•

We demonstrate that the proposed method has competitive performance on time series imputation tasks compared other state-of-the-art methods.

2 Preliminaries

In this section, we concisely introduce the fundamentals of stochastic differential equations, optimal transport, Shrödinger Bridge, and Lagrangian mechanics.

2.1 Stochastic Differential Equations

We treat the data generation task as an initial value problem (IVP), in which $X_{0}\in\mathbb{R}^{d}$ is the initial data (e.g., some random noise) at the initial time $t=0$ , and $X_{T}\in\mathbb{R}^{d}$ is target data at the terminal time $t=T$ . To solve the IVP, we consider a stochastic differential equation (SDE) defined by a Borel measurable time-dependent drift function $\mu_{t}:[0,T]\times\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ , and a positive Borel measurable time-dependent diffusion function $\sigma_{t}:[0,T]\rightarrow\mathbb{R}^{d}_{>0}$ . Accordingly, the Itô form of the SDE can be described as follows Oksendal (2013):

\displaystyle\mathrm{d}X_{t}=\mu_{t}(X_{t},t)\mathrm{d}t+\sigma_{t}\mathrm{d}W_{t},

(1)

where $W_{t}$ is a Brownian motion/Wiener process. When the diffusion term is not considered, the SDE degenerates to an ordinary differential equation (ODE). However, we will use the SDE for the theoretical analysis as it is more general.

The Fokker–Planck equation (FPE) Risken and Frank (2012) describing the evolution of the marginal density $p_{t}(X_{t})$ reads:

\displaystyle\frac{\partial}{\partial t}p_{t}(X_{t})=-\divergence(p_{t}\mu_{t})+\frac{\sigma_{t}^{2}}{2}\Delta p_{t},

(2)

where $\Delta p_{t}=\divergence(\nabla p_{t})$ is the Laplacian. In fact, both Eq. eq:sde and Eq. eq:fpe reveal the dynamics of the system and serve as the boundary conditions for the optimization problems we will introduce in later sections with different focuses. The differences are when the constraint is Eq. 1, the formalism is Lagrangian, which depicts the movement of each individual particle; while when the constraint is Eq.2, the formalism is Eulerian, which depicts the evolution of population.

2.2 Optimal Transport

The optimal transport (OT) problem aims to seek the optimal transport plans/ maps that moves the source distribution to the target distribution Villani et al. (2009); Santambrogio (2015); Peyré et al. (2019). In the Kantorovich’s formulation of the OT problem, the transport costs are minimized with respect to some probabilistic couplings/joint distributions Villani et al. (2009); Santambrogio (2015); Peyré et al. (2019). Let $p_{0}$ and $p_{T}$ be two Borel probability measures with finite second moments on the space $\Omega\in\mathbb{R}^{d}$ . $\Pi(p_{0},p_{T})$ denotes a set of transport plans between these two marginals. Then, the Kantorovich’s OT problem is defined as follows:

\displaystyle\underset{\pi\in\Pi(p_{0},p_{T})}{\mathrm{inf}}\int_{\mathcal{X}\times\mathcal{Y}}\frac{1}{2}\norm{x-y}^{2}\pi(x,y)\mathrm{d}x\mathrm{d}y,

(3)

where $\Pi(p_{0},p_{T})=\big{\{}\pi\in\mathcal{P}(\mathcal{X}\times\mathcal{Y}):(\pi^{x})_{\#}\pi=p_{0},(\pi^{y})_{\#}\pi=p_{T}\big{\}}$ , with $\pi^{x}$ and $\pi^{x}$ being two projections of $\mathcal{X}\times\mathcal{Y}$ on $\Omega$ . The minimizer of Eq .3, $\pi*$ , always exist and referred to as the optimal transport plan.

Note that the R.H.S of Eq. 3 can also include an entropy regularization term $D_{\mathrm{KL}}(\pi\|p_{0}\otimes p_{T})$ , then the original OT problem transforms into the entropy-regularized optimal transport (EROT) problem with Eq. 2 as the constraint, which frames the transport problem better in terms of convexity and stability Cuturi (2013) In particular, from a data generation perspective, $p_{0}$ is some random initial noise and $p_{T}$ is the target data distribution, and we can sample the optimal transport maps in a mini-batch manner Tong et al. (2023, 2024); Pooladian et al. (2023).

2.3 Shrödinger Bridge

The transport problem in Sec. 2.2 can be further viewed from a distribution evolution perspective, which is particularly suitable for developing the flow models that model the data generation process. For this reason, the Shrödinger Bridge (SB) problem is introduced Léonard (2012). Assume that $\Omega\in C^{1}([0,T],\mathbb{R}^{d})$ , $\mathcal{P}(\Omega)$ is a probability path measure on the path space $\Omega$ , then the goal of the SB problem aims to find the following optimal path measure:

\displaystyle\mathbb{P}*=\underset{\mathbb{P}\in\mathcal{P}(\Omega)}{\arg\min}D_{\mathrm{KL}}(\mathbb{P}\|\mathbb{Q})\quad\mbox{subject to}~{}\mathbb{P}_{0}=q_{0}~{}\mbox{and}~{}\mathbb{P}_{T}=q_{T},

(4)

where the Kullback–Leibler (KL) divergence $D_{\mathrm{KL}}(\mathbb{P}\|\mathbb{Q})=\begin{cases}\log\frac{\mathrm{d}\mathbb{P}}{\mathrm{d}\mathbb{Q}}\mathrm{d}\mathbb{P},&\mbox{if}~{}\mathbb{P}\ll\mathbb{Q},\\ +\infty,&\mbox{otherwise},\end{cases}$ and $\mathbb{Q}$ is a reference path measure, e.g., Brownian motion or Ornstein-Uhlenbeck process. Moreover, the distribution matching problem in Eq. 3 can be cast as a dynamical SB problem as well Gushchin et al. (2024); Koshizuka and Sato (2022); Liu et al. (2024):

	$\displaystyle\arg$	$\displaystyle\min_{\theta}\mathbb{E}_{p(X_{t})}\Big{[}\frac{1}{2}\norm{\mu_{t}^{\theta}(X_{t},t)}^{2}\Big{]},$		(5)
		$\displaystyle\mbox{subject to}~{}\mbox{Eq.~{}}\ref{eq:sde}~{}\mbox{or}~{}\mbox{Eq.~{}}\ref{eq:fpe},$

where $\theta$ is the parameters of the variational drift function $\mu_{t}$ .

2.4 Lagrangian Mechanics

In this section, we formulate the data generation problem under the framework of Lagrangian mechanics Arnol’d (2013). Let $p_{t}$ and $\dot{p_{t}}=\frac{\mbox{d}p_{t}}{\mbox{d}t}$ be the density and law of the generalized coordinates $X_{t}$ , respectively. $\mathcal{K}(p_{t},\dot{p_{t}},t)$ is the kinetic energy, and $\mathcal{U}(p_{t},t)$ is the potential energy, then the corresponding Lagrangian is

\displaystyle\mathcal{L}(p_{t},\dot{p}_{t},t)=\mathcal{K}(p_{t},\dot{p}_{t},t)-\mathcal{U}(p_{t}).

(6)

We assume that Eq. 6 is lower semi-continuous (lsc) and strictly convex in $\dot{p_{t}}$ in the Wasserstein space. The kinetic energy $\mathcal{K}(x_{t},\mu_{t},t)$ and potential energy $\mathcal{U}(p_{t},t)$ are defined as follows, respectively:

	$\displaystyle\mathcal{K}(x_{t},\mu_{t},t)$	$\displaystyle=\int^{T}_{0}\int_{\mathbb{R}_{d}}\frac{1}{2}\norm{\mu_{t}(x_{t},t)}^{2}p_{t}(x_{t})\mathrm{d}x_{t}\mathrm{d}t,$		(7)
	$\displaystyle\mathcal{U}(p_{t},t)$	$\displaystyle=\int_{\mathbb{R}_{d}}U_{t}(x_{t})p_{t}(x_{t})dx_{t},$		(8)

where $U_{t}(X_{t})$ is the potential function. Then the action in the context of Lagrangian mechanics is defined as follow:

\displaystyle\mathcal{A}[\mu_{t}(x)]=\int^{T}_{0}\int_{\mathbb{R}_{d}}\mathcal{L}(x_{t},\mu_{t},t)dx_{t}dt.

(9)

According to the principle of least action, the shortest path is the one minimizing the action, which is aligned with Eq. 4 in the SB theory as well. Therefore, we can leverage the Lagrangian dynamics to tackle the OT problem for data generation. To solve Eq. 6, we need to satisfy the stationary condition, i.e., the Euler-Lagrangian equation:

\displaystyle\frac{d}{dt}\frac{\partial}{\partial\dot{p}_{t}}\mathcal{L}(x_{t},\mu_{t},t)=\frac{\partial}{\partial p_{t}}\mathcal{L}(p_{t},\dot{p_{t}},t),

(10)

with the boundary condition $\frac{\mathrm{d}X_{t}}{\mathrm{d}t}=\mu(X_{t},t),~{}q_{0}=p_{0},~{}q_{T}=p_{T}$ .

3 Conditional Lagrangian Wasserstein Flow for Time Series Imputation

In the section, building upon the optimal transport theory, the Shrödinger Bridge problem, and Lagrangian mechanics introduced in Sec. 2, we propose Conditional Lagrangian Wasserstein Flow, which is a novel conditional generative method for time series imputation.

3.1 Time Series Imputation

Our goal is to impute the missing time series data points based on the observations. For training, we adopt adopt a conditionally generative approach for time series imputation in the sample space $\mathbb{R}^{K\times L}$ , where $K$ represents the dimension of the multivariate time series and $L$ represents sequence length. In our self-supervised learning approach, the total observed data $x^{\mathrm{obs}}\in\mathbb{R}^{K\times L}$ are partitioned into the imputation target $x^{\mathrm{tar}}\in\mathbb{R}^{K\times L}$ and the conditional data $x^{\mathrm{cond}}\in\mathbb{R}^{K\times L}$ .

As a result, the missing data points $x^{\mathrm{tar}}$ can be generated based on the conditions $x^{\mathrm{cond}}$ joint with some uninformative initial distribution $x_{0}\in\mathbb{R}^{K\times L}$ (e.g., Gaussian noise) at time $t=0$ , then the imputation task can be described as: $x^{\mathrm{tar}}\sim p(x^{\mathrm{tar}}|x^{\mathrm{cond}}_{0})$ , where the total input of the model is $x^{\mathrm{input}}_{0}:=(x^{\mathrm{cond}},x_{0})\in\mathbb{R}^{K\times L\times 2}$ .

3.2 Interpolation in Wasserstein Space

To solve Eq. 7, we need to sample the intermediate variable $X_{t}$ in the Wasserstein space first. To do so, the interpolation method is adopted to construct the intermediate samples. According to the OT and SB problems introduced in Sec. 2, we define the following time-differentiable interpolant:

\displaystyle I_{t}:\Omega\times\Omega\rightarrow\Omega\quad\text{such that}~{}I_{0}=X_{0}~{}\text{and}~{}I_{T}=X_{T},

(11)

where $\Omega\in\mathbb{R}^{d}$ is the support of the marginals $p_{0}(X_{0})$ and $p_{T}(X_{T})$ , as well as the conditional $p(X_{t}|X_{0},X_{T},t)$ .

For implement $I_{t}$ , first, we independently sample some random noise $X_{0}\sim\mathcal{N}(0,\sigma^{2}_{0})$ at the initial time $t=0$ and the data samples $X_{T}\sim p(x^{\mathrm{tar}})$ at the terminal time $t=T$ , respectively. Afterwards, the interpolation method is used to construct the intermediate samples $X_{t}\sim p(X_{t}|X_{0},X_{T},t)$ , where $t\sim\mathrm{uniform}(0,T)$ Liu et al. (2022); Albergo and Vanden-Eijnden (2023); Tong et al. (2024). More specifically, we design the following sampling approach:

\displaystyle X_{t}=\frac{t}{T}(X_{T}+\gamma_{t})+(1-\frac{t}{T})X_{0}+\alpha(t)\sqrt{\frac{t(T-t)}{T}}\epsilon,\quad t\in[0,T],

(12)

where $\gamma_{t}\sim\mathcal{N}(0,\sigma^{2}_{\gamma})$ is some random noise with variance $\sigma_{\gamma}$ injected to the target data samples for improving the coupling’s generalization property, and $\alpha(t)\geq 0$ is a time-dependent scalar.

Note that Eq. 12 can only allow us to generate time-dependent intermediate samples in the Euclidean space but not the Wasserstein space, which can lead to slow convergence as the sampling paths are not straightened. Hence, to address this issue, we need to project the interpolations in the Wasserstein space before interpolating to strengthen the probability flow. To this end, we leverage the method adopted in Tong et al. (2023, 2024); Pooladian et al. (2023) to sample the optimal mini-batch OT maps between $X_{0}$ and $X_{T}$ first, and perform the interpolations according to Eq. 12 afterwards. Finally, we have the joint variable $x^{\mathrm{input}}_{t}:=(x^{\mathrm{cond}},x_{t})$ as the input for computing the velocity of the Wasserstein flow.

3.3 Flow Matching

To estimate the velocity of the Wasserstein flow $\mu_{t}(X_{t},t)$ in Eq. 1, the previous methods that require trajectory simulation for training can result in long convergence time and large computational costs Chen et al. (2018); Onken et al. (2021). To circumvent the above issues, in this work we adopt a simulation-free training strategy based on the OT theory introduce in Sec. 2.2 Liu et al. (2022); Tong et al. (2023); Albergo and Vanden-Eijnden (2023), which turns out to be faster and more scalable to large time series datasets.

Since we can now draw mini-batch interpolated samples of the source distribution and target distribution in the Wasserstein space using Eq. 12, we can model the variational velocity function using a neural network with parameters $\theta$ . Then, according to Eq. 1, the target velocity can be computed by the difference between the source distribution and target distribution. Therefore, the variational velocity function $\mu_{\theta}(x^{\mathrm{input}}_{t},t)$ can be learned by

		$\displaystyle\arg\min_{\theta}\int_{0}^{T}\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}\times\mathbb{R}^{d}}\norm{\frac{\mbox{d}X_{t}}{\mbox{d}t}-\mu_{t}^{\theta}(x^{\mathrm{input}}_{t},t)}^{2}\mbox{d}x_{0}\mathrm{d}x^{\mathrm{tar}}\mbox{d}x^{\mathrm{input}}\mbox{d}t$		(13)
	$\displaystyle\approx$	$\displaystyle\arg\min_{\theta}\mathbb{E}_{p(x_{0}),p(x^{\mathrm{tar}}),p(x^{\mathrm{input}}),t}\Bigg{[}\norm{\frac{x^{\mathrm{tar}}-x_{0}}{T}-\mu_{t}^{\theta}(x^{\mathrm{input}}_{t},t)}^{2}\Bigg{]}.$		(14)

Eq. 14 can be solved by drawing mini-batch samples in the Wasserstein space and performing stochastic gradient descent accordingly. In this fashion, the learning process is simulation-free as the trajectory simulation is not needed.

Moreover, note that that Eq. 13 also obeys the principle of least action introduced in Sec. 2.4 as it minimizes the kinetic energy described in Eq. 7. Therefore, it indicates that the geodesic that drives the particles from the source distribution to the target distribution in the OT problem described in Sec. 2 is found as well, which enables us to generate new samples with less simulation steps compared to standard diffusion models.

3.4 Potential Function

So far, we have demonstrated how to leverage the kinetic energy to estimate the velocity in the Lagrangian described by Eq. 6. Apart from this, we can also incorporate the prior knowledge within the task-specific potential energy into the dynamics, which enables us to further improve the data generation performance. To this end, let $U(X_{t}):\mathbb{R}^{d}\times[0,T]\rightarrow\mathbb{R}$ be the task-specific potential function depending on the generalized coordinates $X_{t}$ Yang and Karniadakis (2020); Onken et al. (2021); Neklyudov et al. (2023b). Therefore, we can compute the dynamics of the system by

\displaystyle\frac{\mbox{d}X_{t}}{\mbox{d}t}=v_{t}(X_{t},t)

\displaystyle=-\nabla_{x}U_{t}(X_{t}).

(15)

Since the data generation problem in our case can also be interpreted as a stochastic optimal control (SOC) problem Bellman (1966); Fleming and Rishel (2012); Nüsken and Richter (2021); Zhang and Chen (2022); Holdijk et al. (2023); Berner et al. (2024), then the existence of such $U_{t}(X_{t})$ is assured by Pontryagin’s Maximum Principle (PMP) Evans (2024).

To estimate $v_{t}(X_{t},t)$ , according to the Lagrangian Eq. 6, we assume that the potential function takes the form $U_{t}(X_{t})\approx-\log\mathcal{N}(X_{t}|\hat{X_{t}},\sigma^{2}_{p})$ , where $\hat{X_{t}}$ the learned mean and $\sigma^{2}_{p}$ is the pre-defined variance. As a result, the derivative is $\nabla_{x}U(X_{t})=\frac{X_{t}-\hat{X_{t}}}{\sigma^{2}_{p}}$ . In terms of practical implementation, we parameterize $\nabla_{x}U(X_{t})$ via a Variational Autoencoder (VAE) Kingma and Welling (2014). More specifically, we pre-train a VAE on the total observed time series data $X^{\mathrm{obs}}$ . Afterwards, the reconstruction discrepancies of the VAE are used to approximate the task-specific $v^{\phi}(X_{t},t)$ depending on $X_{t}$ :

\displaystyle v_{t}^{\phi}(X_{t},t)=-\frac{1}{\sigma^{2}_{p}}(X_{t}-\mathrm{VAE}(X_{t})),

(16)

where $\mathrm{VAE}(X_{t})$ represents the reconstruction output of the pre-trained VAE model with input $X_{t}$ , and $\sigma^{2}_{p}$ is treated as a positive constant for simplicity. In this manner, we can incorporate the prior knowledge learned from the accessible training data into the sampling process formulated by Eq. 14 to enhance the data generation performance.

3.5 Rao-Blackwellized Sampler

To generate the missing time series datapoints, we first formulate an unbiased ODE sampler $S(X_{t},\mu_{t}^{\theta}(X_{t},t),t)$ for $X_{t+1}$ with the Euler method and $\mu_{t}^{\theta}(X_{t},t)$ learned by Eq. 14 (which means the diffusion term in Eq. 1 is omitted). Alternatively, one can also adopt the SDE sampler by using the Euler–Maruyama method. Nevertheless, to ensure achieve the best imputation performance, we choose the ODE sampler for implementation. Note that the ODE sampler alone is good enough to generate high-quality samples for time series imputation.

Now we can construct a Rao-Blackwellized trajectory sampler Casella and Robert (1996) for time series data imputation using Eq. 14 and Eq. 16. To this end, we first treat $\mathcal{S}(X_{t+1}|X_{t},\mu_{t}^{\theta}(X_{t},t),t)$ be the base estimator for $X_{t+1}$ with $\mathbb{E}[\mathcal{S}^{2}]<\infty$ for all $X_{t+1}$ . And we assume $\mathcal{T}(X_{t},v_{t}^{\phi}(x_{t},t),t)$ is a sufficient statistic for $X_{t+1}$ based on Eq. 16, even it is not a very accurate estimator for $X_{t+1}$ . As a result, we can formulate a new trajectory sampler $\mathcal{S}^{*}=\mathbb{E}[\mathcal{S}|\mathcal{T}]$ to generate the missing time series data. Then according to the Rao-Blackwell theorem Casella and Robert (1996), we have

\displaystyle\mathbb{E}[\mathcal{S}^{*}-X_{t+1}]^{2}\leq\mathbb{E}[\mathcal{S}-X_{t+1}]^{2},

(17)

where the inequality is strict unless $\mathcal{S}$ is a function of $\mathcal{T}$ . Eq. 17 suggests we can construct a more powerful sampler with smaller errors than the base ODE sampler $\mathcal{S}$ using Rao-Blackwellization.

3.6 The Algorithms

Refer to caption — Figure 1: The overall training process of Conditional Lagrangian Wasserstein Flow.

The overall training process of CLWF is illustrated in Fig. 1, which consists of the following stages. First, the total observed data $x^{\mathrm{obs}}$ are partitioned into the target data and conditional data for training. Next, the data pairs of $x^{\mathrm{tar}}$ and $x_{0}$ are sampled from the target dataset and random Gaussian noise, respectively. Then, the data pairs are projected into the Wasserstein space by sampling the corresponding OT maps. After that, the intermediate variable $x_{t}$ is sampled through interpolation using Eq. 12. We can approximate the target velocity $\frac{\mbox{d}X_{t}}{\mbox{d}t}$ by computing $\frac{x^{\mathrm{tar}}-x_{0}}{T}$ . Subsequently, we use the joint distribution of the conditional information $x^{\mathrm{cond}}$ and the intermediate variable $x_{t}$ , $x^{\mathrm{input}}$ as the total input to feed the variational flow model $\mu_{t}^{\theta}$ to compute the velocity. And the flow matching loss defined by Eq. 14 is minimized by stochastic gradient descent.

Furthermore, to incorporate the prior information of into the model, we can choose to train a VAE model on the total observed data $x^{\mathrm{obs}}$ . This is used to estimate the derivative of the task-specific potential function according to Eq. 16, which can be further utilized to construct a more powerful Rao-Blackwellized sampler for inference.

For inference, at time $t=0$ , we sample the initial random noise $x_{0}$ and conditional information $x^{\mathrm{cond}}$ to formulate the joint variable $x^{\mathrm{input}}$ . Note that during the trajectory sampling $x_{0}$ will evolve over time, while $x^{\mathrm{cond}}$ remain invariant. We use $x^{\mathrm{input}}$ as the input of the flow model $\mu_{t}^{\theta}$ to compute the velocity. Afterwards, we sample the new $x_{t}$ using the Euler method. If we perform Rao-Blackwellization, then $x_{t}$ is fed to the VAE model for computing the derivative of the potential function, and $x_{t}$ is updated again using the Euler method. The above process will be repeated until reach its convergence. Moreover, we can sample multiple trajectories using different initial random noise, and the averages as the final imputation results. Finally, the proposed training and sampling procedures are presented in Algorithm 1 and Algorithm 2, respectively.

Algorithm 1 Training procedure

Terminal time:

T

, max epochs, observed data

X^{\mathrm{obs}}

, parameters:

\theta

and

\phi

while epoch < max epochs do

sample

t

(x_{0},x_{T})

, and OT maps;

sample

x_{t}

according to Eq. 12;

minimize Eq. 14;

end while

if Rao-Blackwellization then

train a

\mathrm{VAE}

model on

X^{\mathrm{obs}}

end if

Algorithm 2 Sampling procedure

initial time

t=0

, terminal time:

T=1

, Euler method step number:

N

sample initial noise

X_{0}\sim\mathcal{N}(0,\sigma^{2}_{0})

while

t

< T do

X_{t}=X_{t}+\mu_{t}^{\theta}(x^{\mathrm{input}},t)\frac{T}{N}

if Rao-Blackwellization then

X_{t}=X_{t}+v_{t}^{\phi}(x_{t},t)\frac{T}{N}

end if

t=t+\frac{T}{N}

end while

4 Experiments

4.1 Datasets

We use two public multivariate time series datasets for validation. The first dataset is the PM 2.5 dataset Zheng et al. (2013) from the air quality monitoring sites for $12$ months. The missing rate of the raw data is $13\%$ . The feature number $K$ is $36$ and the sequence length $L$ is $36$ . In our experiments, only the observed datapoints are masked randomly as the imputation targets.

The other dataset we use is the PhysioNet dataset Silva et al. (2012) collected from the intensive care unit for $48$ hours. The feature number $K$ is $35$ and the sequence length $L$ is $48$ . The missing rate of the raw data is $80\%$ . In our experiments, $10\%$ and $50\%$ of the datapoints are masked randomly as the imputation targets, which are denoted as PhysioNet 0.1 and PhysioNet 0.5, respectively.

4.2 Baselines

For comparison, we select the following state-of-the-art timer series imputation methods as the baselines: 1) GP-VAE Fortuin et al. (2020), which is combines a VAE model and a Gaussian Process prior; 2) CSDI Tashiro et al. (2021), which is based on the conditional diffusion model; 3) CSBI Chen et al. (2023), which is based on the Schrödinger bridge diffusion model; 4) DSPD-GP Biloš et al. (2023), which combines the diffusion model with the Gaussian Process prior.

4.3 Experimental Settings

In terms of the choices of architectures, tboth the flow model and the VAE model are built upon Transformers Tashiro et al. (2021). We use the ODE sampler for inference and sample the exact optimal transport maps for interpolations to achieve the optimal performance. The optimizer is Adam and the learning rate: $0.001$ with linear scheduler. The maximum training epochs is $200$ . The mini batch size for training is $64$ . The total step number of the Euler method used in CLWF is $15$ , while the total step numbers for other diffusion models. i.e., is CSDI, CSBI, and DSPD-GP are $15$ (as suggested in their papers). The number of the Monte Carlo samples for inference is $50$ . The standard deviation $\sigma_{0}$ for the initial noise $X_{0}$ is $0.1$ , and the standard deviation $\sigma_{\gamma}$ for the injected noise $\gamma_{t}$ $0.001$ . The coefficient $\sigma_{p}^{2}$ in the derivative of the potential function is $0.01$ .

4.4 Experimental Results

4.4.1 Imputation Results

Table 1: Test imputation results on PM 2.5, PhysioNet 0.1, and PhysioNet 0.5 (

5

-trial averages). The best are in bold and the second best are underlined.

Method	PM 2.5		PhysioNet 0.1		PhysioNet 0.5
Method	RMSE	MAE	RMSE	MAE	RMSE	MAE
GP-VAE	$43.1$	$26.4$	$0.73$	$0.42$	$0.76$	$0.47$
CSDI	$19.3$	$9.86$	$0.57$	$0.24$	$0.65$	$0.32$
CSBI	$19.0$	$9.80$	$0.55$	$0.23$	$0.63$	$0.31$
DSPD-GP	$18.3$	$9.70$	$0.54$	$0.22$	$0.68$	$0.30$
CLWF	$18.1$	$9.70$	$0.47$	$0.22$	$0.64$	$0.29$

We assess the proposed method on PM 2.5, PhysioNet 0.1 and PhysioNet 0.5, respectively. The root means squared error (RMSE) and mean absolute error (MAE) are used as the evaluation metrics. From the test results shown in Table 1 and Fig. 2, we can see that our method CLWF outperforms the existing deep learning-based method (GP-VAE) and the recent state-of-the-art diffusion methods (CSDI, CSBI, and DSPD-GP). Moreover, CLWF uses only $15$ sampling steps for inference, while the baseline diffusion method uses only $50$ sampling steps. This suggests that CLWF is faster and more accurate than the existing methods on time series imputation tasks.

4.4.2 Ablation Study

Single-sample Imputation Result. We compare the time series imputation performance of CLWF with CSDI using only one Monte Carlo sample. The test results shown in Table 2 shows that CWFL outperforms CSDI, which suggests that CWFL exhibits lower imputation variances compared to diffusion-based models. This indicates that CWFL is more efficient and computationally economical for inference.

Table 2: Single-sample test imputation results on PM 2.5, PhysioNet 0.1, and PhysioNet 0.5 (

5

-trial averages).

Method	PM 2.5		PhysioNet 0.1		PhysioNet 0.5
Method	RMSE	MAE	RMSE	MAE	RMSE	MAE
CSDI	$22.2$	$11.7$	$0.74$	$0.30$	$0.83$	$0.40$
CLWF	$18.4$	$10.0$	$0.48$	$0.22$	$0.64$	$0.30$

Effect of Rao-Blackwellzation. We compare the test imputation CLWF wth and without using Rao-Blackwellzation. Note that the PhysioNet dataset does not have enough non-zero data points to train a valid VAE model, therefore we only construct the Rao-Blackwellized sampler for the PM 2.5 dataset. The results showed in Table 3 indicates

Table 3: Test imputation results on PM 2.5 (

5

-trial averages).

Method	PM 2.5
Method	RMSE	MAE
CLWF (without RB)	$18.2$	$9.75$
CLWF (RB)	$18.1$	$9.70$

that the Rao-Blackwellized sampler can further improve the time series imputation performance of the base sampler.

5 Related Work

5.1 Diffusion Models

Diffusion models, such as DDPMs Ho et al. (2020) and SBGM Song et al. (2020), are considered as the new contenders to GANs on data generation tasks. But they generally take relatively long time to produce high quality samples. To mitigate this problem, the flowing matching methods have been proposed from an optimal transport. For example, ENOT uses the saddle point reformulation of the OT problem to develop a new diffusion model Gushchin et al. (2024) The flowing matching methods have also been proposed based on the OT theory Lipman et al. (2022); Liu (2022); Liu et al. (2023); Albergo and Vanden-Eijnden (2023); Albergo et al. (2023). In particular, mini-batch couplings are proposed to straighten the probability flows for fast inference Pooladian et al. (2023); Tong et al. (2024, 2023).

The Schrödinger Bridge have also been applied to diffusion models for improving the data generation performance of diffusion models. Diffusion Schrödinger Bridge utilizes the Iterative Proportional Fitting (IPF) method to solve the SB problem De Bortoli et al. (2021). SB-FBSDE proposes to use forward-backward (FB) SDE theory to solve the SB problem through likelihood training Chen et al. (2022). GSBM formulates a generalized Schrödinger Bridge matching framework by including the task-specific state costs for various data generation tasks Liu et al. (2024) NLSB chooses to model the potential function rather than the velocity function to solve the Lagrangian SB problem Koshizuka and Sato (2022). Action Matching Neklyudov et al. (2023a, b) leverages the principle of least action in Lagrangian mechanics to implicitly model the velocity function for trajectory inference. Another classes of diffusion models have also been proposed from an stochastic optimal control perspective by solving the HJB-PDEs Nüsken and Richter (2021); Zhang and Chen (2022); Berner et al. (2024); Liu et al. (2024).

5.2 Time Series Imputation

Many diffusion-based models have been recently proposed for time series imputation Lin et al. (2023); Meijer and Chen (2024). For instance, CSDI Tashiro et al. (2021) combines a conditional DDPM with a Transformer model to impute time series data. CSBI Chen et al. (2023) adopts the FB-SDE theory to train the conditional Schrödinger bridge model to for probabilistic time series imputation. To model the dynamics of time series from irregular sampled data, DSPD-GP Biloš et al. (2023) uses a Gaussian process as the noise generator. TDdiff Kollovieh et al. (2024) utilizes self guidance and learned implicit probability density to improve the time series imputation performance of the diffusion models. However, the time series imputation methods mentioned above exhibit common issues, such as slow convergence, similar to many diffusion models. Therefore, in this work, we proposed CLWF to tackle thess challenges.

6 Conclusion, Limitation, and Broader Impact

In this work, we proposed CLWF, a novel time series imputation method based on the optimal transport theory and Lagrangian mechanics. To generate the missing time series data, following the principle of least action, CLWF learns a velocity field by minimizing the kinetic energy to move the initial random noise to the target distribution. Moreover, we can also estimate the derivative of a potential function via a VAE model trained on the observed training data to further improve the performance of the base sampler by Rao-Blackwellization. In contrast with previous diffusion-based models, the proposed requires less simulation steps and Monet Carlo samples to produce high-quality data, which leads to fast inference. For validation, CWLF is assessed on two public datasets and achieves competitive results compared with existing methods.

One limitation of CLWF is that the samples obtained are not diverse enough as we use ODE for inference, which results in slightly higher test (continuous ranked probability score) CRPS compared to previous works, e.g., CSDI. Therefore, for future work, we will seek suitable approaches to accurately model the diffusion term in the SDE. Moreover, we will also try to design better task-specific potential functions for sparse multivariate time series data. We plan to explore the potential of the Lagrangian Wasserstein Flow model for other time series analysis tasks, such as anomaly detection and uncertainty quantification.

In terms of broader impact, our study on time series imputation has the potential to address important real-world challenges and consequently make a positive impact on daily lives.

References

Albergo et al. [2023] Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797, 2023.
Albergo and Vanden-Eijnden [2023] Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations, 2023.
Arnol’d [2013] Vladimir Igorevich Arnol’d. Mathematical methods of classical mechanics, volume 60. Springer Science & Business Media, 2013.
Bellman [1966] Richard Bellman. Dynamic programming. science, 153(3731):34–37, 1966.
Berner et al. [2024] Julius Berner, Lorenz Richter, and Karen Ullrich. An optimal control perspective on diffusion-based generative modeling. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=oYIjw37pTP.
Bertsekas [2012] Dimitri Bertsekas. Dynamic programming and optimal control: Volume I, volume 4. Athena scientific, 2012.
Biloš et al. [2023] Marin Biloš, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, and Stephan Günnemann. Modeling temporal data as continuous functions with stochastic process diffusion. In International Conference on Machine Learning, pages 2452–2470. PMLR, 2023.
Casella and Robert [1996] George Casella and Christian P Robert. Rao-blackwellisation of sampling schemes. Biometrika, 83(1):81–94, 1996.
Chen et al. [2018] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
Chen et al. [2022] Tianrong Chen, Guan-Horng Liu, and Evangelos Theodorou. Likelihood training of schrödinger bridge using forward-backward sdes theory. In International Conference on Learning Representations, 2022, 2022.
Chen et al. [2021] Yongxin Chen, Tryphon T Georgiou, and Michele Pavon. Stochastic control liaisons: Richard sinkhorn meets gaspard monge on a schrodinger bridge. Siam Review, 63(2):249–313, 2021.
Chen et al. [2023] Yu Chen, Wei Deng, Shikai Fang, Fengpei Li, Nicole Tianjiao Yang, Yikai Zhang, Kashif Rasul, Shandian Zhe, Anderson Schneider, and Yuriy Nevmyvaka. Provably convergent schrödinger bridge with applications to probabilistic time series imputation. In International Conference on Machine Learning, pages 4485–4513. PMLR, 2023.
Cuturi [2013] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
De Bortoli et al. [2021] Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
Evans [2022] Lawrence C Evans. Partial differential equations, volume 19. American Mathematical Society, 2022.
Evans [2024] Lawrence C Evans. An introduction to mathematical optimal control theory spring, 2024 version. Lecture notes available at https://math.berkeley.edu/ evans/control.course.pdf, 2024.
Fleming and Rishel [2012] Wendell H Fleming and Raymond W Rishel. Deterministic and stochastic optimal control, volume 1. Springer Science & Business Media, 2012.
Fortuin et al. [2020] Vincent Fortuin, Dmitry Baranchuk, Gunnar Rätsch, and Stephan Mandt. Gp-vae: Deep probabilistic time series imputation. In International conference on artificial intelligence and statistics, pages 1651–1661. PMLR, 2020.
Gushchin et al. [2024] Nikita Gushchin, Alexander Kolesov, Alexander Korotin, Dmitry P Vetrov, and Evgeny Burnaev. Entropic neural optimal transport via diffusion processes. Advances in Neural Information Processing Systems, 36, 2024.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Holdijk et al. [2023] Lars Holdijk, Yuanqi Du, Ferry Hooft, Priyank Jaini, Berend Ensing, and Max Welling. Stochastic optimal control for collective variable free sampling of molecular transition paths. Advances in Neural Information Processing Systems, 36, 2023.
Kingma and Welling [2014] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In The Second International Conference on Learning Representations, 2014.
Kollovieh et al. [2024] Marcel Kollovieh, Abdul Fatir Ansari, Michael Bohlke-Schneider, Jasper Zschiegner, Hao Wang, and Yuyang Bernie Wang. Predict, refine, synthesize: Self-guiding diffusion models for probabilistic time series forecasting. Advances in Neural Information Processing Systems, 36, 2024.
Koshizuka and Sato [2022] Takeshi Koshizuka and Issei Sato. Neural lagrangian schrödinger bridge: Diffusion modeling for population dynamics. In The Eleventh International Conference on Learning Representations, 2022.
Léonard [2012] Christian Léonard. From the schrödinger problem to the monge–kantorovich problem. Journal of Functional Analysis, 262(4):1879–1920, 2012.
Lin et al. [2023] Lequan Lin, Zhengkun Li, Ruikun Li, Xuliang Li, and Junbin Gao. Diffusion models for time-series applications: a survey. Frontiers of Information Technology & Electronic Engineering, pages 1–23, 2023.
Lipman et al. [2022] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2022.
Liu et al. [2024] Guan-Horng Liu, Yaron Lipman, Maximilian Nickel, Brian Karrer, Evangelos Theodorou, and Ricky TQ Chen. Generalized schrödinger bridge matching. In The Twelfth International Conference on Learning Representations, 2024.
Liu [2022] Qiang Liu. Rectified flow: A marginal preserving approach to optimal transport. arXiv preprint arXiv:2209.14577, 2022.
Liu et al. [2022] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2022.
Liu et al. [2023] Xingchao Liu, Lemeng Wu, Mao Ye, and Qiang Liu. Learning diffusion bridges on constrained domains. In international conference on learning representations (ICLR), 2023.
Meijer and Chen [2024] Caspar Meijer and Lydia Y Chen. The rise of diffusion models in time-series forecasting. arXiv preprint arXiv:2401.03006, 2024.
Neklyudov et al. [2023a] Kirill Neklyudov, Rob Brekelmans, Daniel Severo, and Alireza Makhzani. Action matching: Learning stochastic dynamics from samples. In International Conference on Machine Learning, pages 25858–25889. PMLR, 2023a.
Neklyudov et al. [2023b] Kirill Neklyudov, Rob Brekelmans, Alexander Tong, Lazar Atanackovic, Qiang Liu, and Alireza Makhzani. A computational framework for solving wasserstein lagrangian flows. arXiv preprint arXiv:2310.10649, 2023b.
Nüsken and Richter [2021] Nikolas Nüsken and Lorenz Richter. Solving high-dimensional hamilton–jacobi–bellman pdes using neural networks: perspectives from the theory of controlled diffusions and measures on path space. Partial differential equations and applications, 2(4):48, 2021.
Oksendal [2013] Bernt Oksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013.
Onken et al. [2021] Derek Onken, Samy Wu Fung, Xingjian Li, and Lars Ruthotto. Ot-flow: Fast and accurate continuous normalizing flows via optimal transport. Proceedings of the AAAI Conference on Artificial Intelligence, 35(10):9223–9232, 2021.
Peyré et al. [2019] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
Pooladian et al. [2023] Aram-Alexandre Pooladian, Heli Ben-Hamu, Carles Domingo-Enrich, Brandon Amos, Yaron Lipman, and Ricky TQ Chen. Multisample flow matching: Straightening flows with minibatch couplings. In International Conference on Machine Learning, pages 28100–28127. PMLR, 2023.
Risken and Frank [2012] Hannes Risken and Till Frank. The Fokker-Planck Equation: Methods of Solution and Applications, volume 18. Springer Science & Business Media, 2012.
Santambrogio [2015] Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.
Silva et al. [2012] Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In 2012 Computing in Cardiology, pages 245–248. IEEE, 2012.
Song et al. [2020] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020.
Tashiro et al. [2021] Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems, 34:24804–24816, 2021.
Tong et al. [2023] Alexander Tong, Nikolay Malkin, Kilian Fatras, Lazar Atanackovic, Yanlei Zhang, Guillaume Huguet, Guy Wolf, and Yoshua Bengio. Simulation-free schrödinger bridges via score and flow matching. arXiv preprint arXiv:2307.03672, 2023.
Tong et al. [2024] Alexander Tong, Kilian FATRAS, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport. Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=CD9Snc73AW. Expert Certification.
Villani et al. [2009] Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.
Yang and Karniadakis [2020] Liu Yang and George Em Karniadakis. Potential flow generator with l 2 optimal transport regularity for generative models. IEEE Transactions on Neural Networks and Learning Systems, 33(2):528–538, 2020.
Yong and Zhou [2012] Jiongmin Yong and Xun Yu Zhou. Stochastic controls: Hamiltonian systems and HJB equations, volume 43. Springer Science & Business Media, 2012.
Zhang and Chen [2022] Qinsheng Zhang and Yongxin Chen. Path integral sampler: A stochastic control approach for sampling. In The Tenth International Conference on Learning Representations. OpenReview.net, 2022. URL https://openreview.net/forum?id=_uCb2ynRu7Y.
Zheng et al. [2013] Yu Zheng, Furui Liu, and Hsun-Ping Hsieh. U-air: When urban air quality inference meets big data. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1436–1444, 2013.

Appendix A Stochastic Optimal Control

The data generation task can also be interpreted as a stochastic optimal control (SOC) problem Bellman [1966], Fleming and Rishel [2012], Nüsken and Richter [2021], Zhang and Chen [2022], Holdijk et al. [2023], Berner et al. [2024] whose cost function $\mathcal{J}$ is defined as:

\displaystyle\mathcal{J}(X_{t},t)

\displaystyle=\mathbb{E}_{p(X_{t})}\Bigg{[}\int^{T}_{0}\int_{\mathbb{R}_{d}}\frac{1}{2}\norm{\nabla_{x}\Psi(X_{t},t)}^{2}\mathrm{d}X_{t}\mathrm{d}t\Bigg{]}+\mathbb{E}_{p(X_{T})}\big{[}\Psi(X_{T})\big{]},

(18)

where $\frac{1}{2}\norm{\nabla_{x}\Psi(X_{t},t)}^{2}$ denotes the running cost, and $\Psi(X_{T})$ denotes the terminal cost. The above SOC problem can be solved by dynamic programming Bellman [1966], Bertsekas [2012].

Further, let $V(X_{t},t)=\inf\mathcal{J}(X_{t},t)$ be the value function/optimal-cost-to-go of the SOC problem, then the corresponding Hamilton-Jacobi-Bellman (HJB) partial differential equation (PDE) Evans [2022], Yong and Zhou [2012] is given by

		$\displaystyle\frac{\partial V_{t}}{\partial t}-\frac{1}{2}\nabla V^{\prime}_{t}\nabla V_{t}+\frac{1}{2}\Delta V_{t}=0,$		(19)
		$\displaystyle\mbox{with the terminal condition:}~{}V(X_{t},T)=\Psi(X_{t}).$		(20)

Appendix B Rao-Blackwell Theorem

Theorem 1

(Rao-Blackwell) Let $\mathcal{S}$ be an unbiased estimator of some parameter $\theta\in\Theta$ , and $\mathcal{T}(X)$ the sufficient statistic for $\theta$ , then: 1) $\mathcal{S}^{*}=\mathbb{E}[\mathcal{S}|\mathcal{T}(X)],$ is an unbiased estimator for $\theta$ , and 2) $\mathbb{V}_{\theta}[\mathcal{S}^{*}]\leq\mathbb{V}_{\theta}[\mathcal{S}]$ for all $\theta$ . The inequality is strict unless $\mathcal{S}$ is a function of $\mathcal{T}$ .

Appendix C Experimental Environment

For the hardware environment of the experiments, we use a single NVIDIA A100-PCIE-40GB GPU and an Intel(R) Xeon(R) Gold-6248R-3.00GHz CPU. For the software environment, the Python version is 3.9.7, the CUDA version 11.7, and the Pytorch version is 2.0.1.