This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Department of Statistics & Actuarial Science, The University of Hong Kong

Time Series Generative Learning with Application to Brain Imaging Analysis

Zhenghao Li
* Contributes equally
   Sanyou Wu
#\# Correspondence to: [email protected]
   Long Feng#
Abstract

This paper focuses on the analysis of sequential image data, particularly brain imaging data such as MRI, fMRI, CT, with the motivation of understanding the brain aging process and neurodegenerative diseases. To achieve this goal, we investigate image generation in a time series context. Specifically, we formulate a min-max problem derived from the ff-divergence between neighboring pairs to learn a time series generator in a nonparametric manner. The generator enables us to generate future images by transforming prior lag-k observations and a random vector from a reference distribution. With a deep neural network learned generator, we prove that the joint distribution of the generated sequence converges to the latent truth under a Markov and a conditional invariance condition. Furthermore, we extend our generation mechanism to a panel data scenario to accommodate multiple samples. The effectiveness of our mechanism is evaluated by generating real brain MRI sequences from the Alzheimer’s Disease Neuroimaging Initiative. These generated image sequences can be used as data augmentation to enhance the performance of further downstream tasks, such as Alzheimer’s disease detection.

keywords:
Time series, Generative learning, Brain imaging, Markov property, data augmentation

1 Introduction

Time series data is not limited to numbers; it can also include images, texts, and other forms of information. This paper specifically focuses on the analysis of image time series, driven by the goal of understanding brain aging process with brain imaging data, such as magnetic resonance imaging (MRI), computer tomography (CT), etc. Understanding the brain aging process holds immense value for a variety of applications. For instance, insights into the brain aging process can reveal structural changes within the brain (Tofts,, 2005) and aid in the early detection of degenerative diseases such as Alzheimer’s Disease (Jack et al.,, 2004). Consequently, the characterization and forecasting of brain image time series data can play a crucial role in the fight against age-related neurodegenerative diseases (Cole et al.,, 2018; Huizinga et al.,, 2018).

Time series analysis, a classic topic in statistics, has been extensively studied, with numerous notable works such as autoregression (AR), autoregressive moving-average (ARMA, Brockwell and Davis, 1991), autoregressive conditional heteroskedasticity (ARCH, Engle, 1982; Bollerslev, 1986), among others. The classic scalar time series model has been expanded to accommodate vectors (Stock and Watson,, 2001), matrices (Chen et al.,, 2021; Han et al.,, 2023; Chang et al.,, 2023), and even tensors (Chen et al., 2022a, ) in various contexts. Furthermore, numerous time series analysis methods have been adapted to high-dimensional models, employing regularization techniques, such as Basu and Michailidis, (2015); Guo et al., (2016). Beyond linear methods, nonlinear time series models have also been explored in the literature, e.g., Fan and Yao, (2003); Tsay and Chen, (2018), etc. We refer to Tsay, (2013) for a comprehensive review of time series models. In recent years, with the advent of deep learning, various deep neural network architectures have been applied to sequential data modeling, including recurrent neural network (RNN)-based methods (Pascanu et al.,, 2013; Lai et al.,, 2018) and attention-based methods (Li et al.,, 2019; Zhou et al.,, 2021). These deep learning models have demonstrated remarkable success in addressing nonlinear dependencies, resulting in promising forecasting performances. However, in spite of these advances, image data typically contains rich spatial and structural information along with substantially higher dimensions. Time series methods designed for numerical values are not easily applicable to image data analysis.

In recent years, imaging data analysis has attracted significant attention in both statistics and machine learning community. Broadly speaking, existing work in the statistical literature can be divided into two categories. The first category typically involves converting images into vectors and analyzing the resulting vectors with high-dimensional statistical methods, such as Lasso (Tibshirani,, 1996) or other penalization techniques, Bayesian approaches (Kang et al.,, 2018; Feng et al.,, 2019), etc. Notably, this type of approach may generate ultra high-dimensional vectors and face heavy computational burdens when dealing with high resolution images. The second category of approaches treats image data as tensor inputs and employs general tensor decomposition techniques, such as Canonical polyadic decomposition (Zhou et al.,, 2013) or Tucker decomposition (Li et al.,, 2018; Luo and Zhang,, 2023). Recently, Kronecker product based decomposition has been applied in image analysis and its connections to Convolutional Neural Network (CNN, LeCun et al., 1998) has been explored in the literature (Wu and Feng,, 2023; Feng and Yang,, 2023). Indeed, Deep neural networks (DNN), particularly CNNs, have arguably emerged as the most prevalent approach for image data analysis across various contexts.

Owing to the success of DNNs, generative models have become an immense active area, achieving numerous accomplishments in image analysis and computer vision. In particular, variational autoencoder (Kingma and Welling,, 2013), generative adversiral networks (GAN, Goodfellow et al., 2014) and diffusion models (Ho et al.,, 2020) have gained substantial attention and have been applied to various imaging tasks. Moreover, statisticians have also made tremendous contributions to the advancement of generative models. For example, Zhou et al., 2023a introduced a deep generative approach for conditional sampling, which matches appropriate joint distributions using Kullback-Liebler divergence. Zhou et al., 2023b introduced a non-parametric test for the Markov property in time series using generative models. Chen et al., 2022b proposed an inferential Wasserstein GAN (iWGAN) model that fuses autoencoders and GANs, and established the generalization error bounds for the iWGAN. Although generative models have achieved remarkable success, their investigation in the context of image time series analysis has been much less explored, to the best of our knowledge. Ravi et al., (2019) introduced a deep learning approach to generate neuroimages and simulate disease progression, while Yoon et al., (2023) employed a conditional diffusion model for generating sequential medical images. In fact, handling medical imaging data can present additional challenges, such as limited sample sizes, higher image resolutions, and subtle differences between images. Consequently, conventional generative approaches might not be sufficient for generating medical images.

Motivated by brain imaging analysis, this paper considers image generation problem in a time series context. Suppose that a time series of images {Xtp,t=1,,T}\{X_{t}\in\mathbb{R}^{p},t=1,\ldots,T\} of dimension pp is observed from 0 to TT and we are interested in generating the next SS points, i.e., {Xt,t=T+1,,T+S}\{X_{t},t=T+1,\ldots,T+S\}. Our goal is to generate a sequence (X^T+1,,X^T+S)(\widehat{X}_{T+1},\ldots,\widehat{X}_{T+S}) which follows the same joint distribution as (XT+1,,XT+S)(X_{T+1},\ldots,X_{T+S}), i.e.,

(X^T+1,,X^T+S)=d(XT+1,,XT+S).\displaystyle(\widehat{X}_{T+1},\cdots,\widehat{X}_{T+S})\overset{\text{d}}{=}(X_{T+1},\cdots,X_{T+S}). (1)

In this paper, we show that under a Markov and a conditional invariance condition, such generation is possible by letting

X^T=XT,X^T+s=g(ηT+s1,X^T+s1),1sS,\displaystyle\widehat{X}_{T}=X_{T},\ \widehat{X}_{T+s}=g(\eta_{T+s-1},\widehat{X}_{T+s-1}),\quad 1\leq s\leq S, (2)

where gg is certain unknown measurable function, i.e., generator, that need to be estimated, ηt\eta_{t} is a mm-dimensional random vector that is drawn i.i.d. from a reference distribution such as Gaussian. If (1) could be achieved, not only would any subset of the generated sequence (X^T+1,,X^T+S)(\widehat{X}_{T+1},\ldots,\widehat{X}_{T+S}) follows the same distribution as that of (XT+1,,XT+S)(X_{T+1},\ldots,X_{T+S}), but the dependencies between (X^T+1,,X^T+S)(\widehat{X}_{T+1},\ldots,\widehat{X}_{T+S}) and XTX_{T} also maintain consistency with the existing relationships between (XT+1,,XT+S)(X_{T+1},\ldots,X_{T+S}) and XTX_{T}. We refer to the generation (2) as “iteration generation” as X^T+S\widehat{X}_{T+S} is generated iteratively. Additionally, we consider a “s-step” generation that allows us to generate X^T+s\widehat{X}_{T+s} directly from XTX_{T}:

X~T+s=G(ηT,XT,s),1sS,\displaystyle{\widetilde{X}}_{T+s}=G(\eta_{T},X_{T},s),\quad 1\leq s\leq S, (3)

where GG is the target function that to be learned.

To guarantee the generation (2) achieves distribution matching (1), in this paper, we first establish the existence of the generator gg. Given its existence, we formulate a min-max problem that derived from the ff-divergence between the pairs (Xt,Xt+1)(X_{t},X_{t+1}) and (Xt,g(ηt,Xt))(X_{t},g(\eta_{t},X_{t})) to estimate the generator g()g(\cdot). With the learned g^()\widehat{g}(\cdot), we prove that under a mild distribution condition on XtX_{t}, not only does the pairwise distribution of (XT,X^T+s)(X_{T},\widehat{X}_{T+s}) converge in expectation to (XT,XT+s)(X_{T},X_{T+s}) for any 1sS1\leq s\leq S, but more important, the joint distribution of the generated sequence (XT,X^T+1,,X^T+S)(X_{T},\widehat{X}_{T+1},\cdots,\widehat{X}_{T+S}) converges to that of (XT,XT+1,,XT+S)(X_{T},X_{T+1},\cdots,X_{T+S}). The lag-1 generation in (2) could be further extended to a lag-k time series context with similar distribution matching guarantees. Furthermore, we generalize our framework to a panel data scenario to accommodate multiple samples. Finally, The effectiveness of our generation mechanism is evaluated by generating real brain MRI sequences from the Alzheimer’s Disease Neuroimaging Initiative (ADNI).

The rest of this paper is organized as follows. In Section 2, we establish the existence of the generator and formulate the estimation of iterative and s-step generation. In Section 3, we prove the weak convergence of joint and pairwise distribution for the generation. We extend our theoretical analysis to a lag-k time series context in Section 4. Section 5 generalize the framework to a panel data scenario with multiple samples. We conduct simulation studies in Section 6 and generate real brain MRI sequences in ADNI in Section 7. The detailed proof and implementation of our approach are deferred to the supplementary material.

2 Existence and estimation

We consider a time series {Xtp,t=1,2,}\left\{X_{t}\in\mathbb{R}^{p},t=1,2,\ldots\right\} that satisfies the following two assumptions

(i) Markov:Xt|Xt1,,X0=dXt|Xt1,\displaystyle\text{(i) Markov}:\ \ X_{t}|X_{t-1},\ldots,X_{0}\ \overset{\text{d}}{=}\ X_{t}|X_{t-1},
(ii) Conditional invariance:pXt|Xt1(x|y)=pX1|X0(x|y),t=1,2,.\displaystyle\text{(ii) Conditional invariance}:\ \ p_{X_{t}|X_{t-1}}(x|y)=p_{X_{1}|X_{0}}(x|y),\ t=1,2,\ldots.

Markov and conditional invariance property are commonly imposed in the time series analysis. Here we restrict our attention to a lag-1 time series , further generalization to lag-k scenarios will be considered in Section 4. Notably, Zhou et al., 2023b proposed a nonparametric test for the Markov property using generative learning. The proposed test could even allow us to infer the order of a Markov model. We omit the details here and refer interested readers to their paper for further details.

Suppose that the time series is observed from 0 to TT and we are interested in generating the next SS points, i.e., {Xt,t=T+1,,T+S}\{X_{t},t=T+1,\ldots,T+S\}. In particular, we aim to generate a sequence (X^T+1,,X^T+S)(\widehat{X}_{T+1},\ldots,\widehat{X}_{T+S}) that follows the same joint distribution as (XT+1,,XT+S)(X_{T+1},\ldots,X_{T+S}). More aggressively, we aim to achieve

(XT,X^T+1,,X^T+S)(XT,XT+1,,XT+S).\displaystyle(X_{T},\widehat{X}_{T+1},\cdots,\widehat{X}_{T+S})\sim(X_{T},X_{T+1},\cdots,X_{T+S}). (4)

The target (4) is aggressive. If it holds true, then not only does any subset of the generated sequence (X^T+1,,X^T+S)(\widehat{X}_{T+1},\ldots,\widehat{X}_{T+S}) follows the same distribution as that of (XT+1,,XT+S)(X_{T+1},\ldots,X_{T+S}), but the dependencies between (X^T+1,,X^T+S)(\widehat{X}_{T+1},\ldots,\widehat{X}_{T+S}) and XTX_{T} also remain consistent with those existing between (XT+1,,XT+S)(X_{T+1},\ldots,X_{T+S}) and XTX_{T}.

In this paper, we show that such a generation is possible by letting

X^T=XT,X^T+s=g(ηT+s1,X^T+s1), 1sS,\displaystyle\widehat{X}_{T}=X_{T},\ \ \widehat{X}_{T+s}=g(\eta_{T+s-1},\widehat{X}_{T+s-1}),\ \ 1\leq s\leq S, (5)

where gg is certain unknown measurable function, {ηt}i.i.dN(0,𝑰m)\left\{\eta_{t}\right\}\overset{\text{i.i.d}}{\sim}N(0,\boldsymbol{I}_{m}) is a sequence of mm-dimensional Gaussian vector that is independent of XtX_{t}.

The existence of target function gg is based on the following proposition.

Theorem 1.

Let X0,X1,,X_{0},X_{1},\cdots, be a sequence of random variables which satisfy:

(i) Markov:Xt|Xt1,,X0=dXt|Xt1\displaystyle\text{(i) Markov}:\ \ X_{t}|X_{t-1},\cdots,X_{0}\overset{\text{d}}{=}X_{t}|X_{t-1}
(ii) Conditional invariance:pXt|Xt1(x|y)=pX1|X0(x|y)\displaystyle\text{(ii) Conditional invariance}:\ \ p_{X_{t}|X_{t-1}}(x|y)=p_{X_{1}|X_{0}}(x|y)

for any t1t\geq 1. Suppose X^00=dX0\widehat{X}_{0}^{0}\overset{\text{d}}{=}X_{0} and η0,η1,\eta_{0},\eta_{1},\cdots be a sequence of independent mm-dimensional Gaussian vector. Then there exist a a measurable function gg such that the sequence

{X^t0:X^t0=g(ηt1,X^t10),t=1,2,}\displaystyle\left\{\widehat{X}_{t}^{0}:\ \widehat{X}_{t}^{0}=g(\eta_{t-1},\widehat{X}^{0}_{t-1}),\ t=1,2,\ldots\right\} (6)

satisfies that for any s1s\geq 1,

(X^00,X^10,,X^s0)=d(X0,X1,,Xs).\displaystyle(\widehat{X}_{0}^{0},\widehat{X}_{1}^{0},\cdots,\widehat{X}_{s}^{0})\overset{\text{d}}{=}(X_{0},X_{1},\cdots,X_{s}). (7)

Theorem 1 could be proved using the following noice-outsourcing lemma.

Lemma 1.

(Theorem 5.10 in Kallenberg, (2021))

Let (X,Y)(X,Y) be a random vector. Suppose random variables X^=dX\widehat{X}\overset{\text{d}}{=}X and ηN(0,1)\eta\sim N(0,1) which is independent of X^\widehat{X}. Then there exist a random variable Y^\widehat{Y} and a measurable function gg such that Y^=g(η,X^)\widehat{Y}=g(\eta,\widehat{X}) and (X,Y)=d(X^,Y^)(X,Y)\overset{\text{d}}{=}(\widehat{X},\widehat{Y}).

We refer to the generation mechanism (6) as “iterative generation” since it is produced iteratively, one step at a time. Theorem 1 proves the existence of such iterative generation process. Besides iterative generation, an alternative approach is to directly generate the outcomes after ss steps. Specifically, we consider the ss-step generation of the following form

X~T+s=G(ηT,XT,s), 1sS,\displaystyle{\widetilde{X}}_{T+s}=G(\eta_{T},X_{T},s),\ \ 1\leq s\leq S, (8)

with GG being the target function.

Theorem 2.

Let X0,X1,,X_{0},X_{1},\cdots, be a sequence of random variables which satisfy:

(i) Markov:Xt|Xt1,,X0=dXt|Xt1,\displaystyle\text{(i) Markov}:\ \ X_{t}|X_{t-1},\cdots,X_{0}\overset{\text{d}}{=}X_{t}|X_{t-1},
(ii) Conditional invariance:pXt|Xt1(x|y)=pX1|X0(x|y)\displaystyle\text{(ii) Conditional invariance}:\ \ p_{X_{t}|X_{t-1}}(x|y)=p_{X_{1}|X_{0}}(x|y)

for any t1t\geq 1. Let η0,η1,\eta_{0},\eta_{1},\cdots be a sequence of independent mm-dimensional Gaussian vector. Further let g()g(\cdot) be the target function in Theorem 1. Then there exist a measurable function GG such that the sequence

{X~t0:X~t0=G(ηT,X0,s),t=1,2,}\displaystyle\left\{{\widetilde{X}}_{t}^{0}:\ {\widetilde{X}}_{t}^{0}=G(\eta_{T},X_{0},s),\ t=1,2,\ldots\right\} (9)

satisfies

X~t0=dXt\displaystyle{\widetilde{X}}^{0}_{t}\overset{\text{d}}{=}X_{t} (10)

and

G(η,X,1)=g(η,X),ηm,Xp.\displaystyle G(\eta,X,1)=g(\eta,X),\ \ \ \forall\eta\in\mathbb{R}^{m},\ \ X\in\mathbb{R}^{p}. (11)
Remark 2.1.

Unlike Theorem 1, the sequence {X~t0,t=1,2}\big{\{}{\widetilde{X}}^{0}_{t},\ t=1,2\ldots\big{\}} does not necessarily achieve the target property (7) on the joint distribution. Instead, Theorem 2 could only guarantee the distributional match on the marginals as in (10). The major difference is that in the s-step generation, the conditional distribution X~t+10|X~t0=x{\widetilde{X}}^{0}_{t+1}\big{|}{\widetilde{X}}^{0}_{t}=x varies with tt. Consequently, the mutual dependencies between X~10,X~20,{\widetilde{X}}^{0}_{1},{\widetilde{X}}^{0}_{2},\ldots could not be kept in the generation.

Remark 2.2.

Due to the connection between G()G(\cdot) and g()g(\cdot) in (11), G()G(\cdot) could be considered as an extended function of g()g(\cdot) that includes an extra forecasting lag variable tt.

Given Theorem 1 on the existence of gg, now we consider the estimation of function g()g(\cdot). For any given period 1,,T1,\cdots,T, we consider the following ff-GAN (Nowozin et al.,, 2016) type of min-max problem for the iterative generation:

(g^,h^)=argming𝒢1maxh1^(g,h)\displaystyle(\widehat{g},\widehat{h})=\arg\mathop{\min}_{g\in\mathcal{G}_{1}}\mathop{\max}_{h\in\mathcal{H}_{1}}\widehat{\mathcal{L}}(g,h) (12)
^(g,h)=1Tt=0T1[h(Xt,g(ηt,Xt))f(h(Xt,Xt+1))],\displaystyle\widehat{\mathcal{L}}(g,h)=\frac{1}{T}\sum\limits_{t=0}^{T-1}\left[h(X_{t},g(\eta_{t},X_{t}))-f^{*}(h(X_{t},X_{t+1}))\right], (13)

where f(x)=supy{xyf(y)}f^{*}(x)=\sup_{y}\{x\cdot y-f(y)\} is the convex conjugate of ff, 𝒢1\mathcal{G}_{1} and 1\mathcal{H}_{1} are spaces of continuous and bounded functions.

The ff-divergence includes many commonly used measures of divergence, such as Kullback-Leibler (KL) divergence, χ2\chi^{2} divergence as special cases. In our analysis, we consider the general ff divergence. From a technique perspective, we assume that ff is a convex function and satisfies f(1)=0f(1)=0. We further assume that there exists constants a>0a>0 and 0<b<10<b<1 such that,

f′′(x+1)a(1+bx)3,x1.\displaystyle f^{\prime\prime}(x+1)\geq\frac{a}{(1+bx)^{3}},\ \forall\ x\geq-1. (14)

The assumption (14) is rather mild. For instance, the KL divergence, defined by f(x)=xlogxf(x)=x\log x, meets the requirement of (14) with a=1a=1 and b=1/3b=1/3. Similarly, the χ2\chi^{2} divergence, described by f(x)=(x1)2f(x)=(x-1)^{2}, satisfies (14) with a=1/4a=1/4 and b=1/2b=1/2.

Now we define the pseudo dimension and global bound of 𝒢1\mathcal{G}_{1} and 1\mathcal{H}_{1}, which will be used in later analysis. Let \mathcal{F} be a class of functions from 𝒳\mathcal{X} to \mathbb{R}. The pseudo dimension of \mathcal{F}, written as Pdim\text{Pdim}_{\mathcal{F}}, is the largest integer mm for which there exists {x1,,xm,y1,,ym}𝒳m×m\left\{x_{1},\cdots,x_{m},y_{1},\cdots,y_{m}\right\}\in\mathcal{X}^{m}\times\mathbb{R}^{m} such that for any (b1,,bm){0,1}m(b_{1},\cdots,b_{m})\in\left\{0,1\right\}^{m} there exists ff\in\mathcal{F} such that i:f(xi)>yibi=1\forall i:f(x_{i})>y_{i}\Leftrightarrow b_{i}=1. Note that this definition is adopted from Bartlett et al., (2017). Furthermore, the global bound of \mathcal{F} is defined as B:=supffB_{\mathcal{F}}:=\mathop{\sup}_{f\in\mathcal{F}}\|f\|_{\infty}.

The min-max problem (12) is derived from the ff-divergence between the pairs (Xt,Xt+1)(X_{t},X_{t+1}) and (Xt,g(ηt,Xt))(X_{t},g(\eta_{t},X_{t})). For any two probability distributions with densities pp and qq, let Df(qp)=f(q(z)p(z))p(z)𝑑zD_{f}(q\|p)=\int f(\frac{q(z)}{p(z)})p(z)dz be their ff-divergence. Denote the ff-divergence between (Xt,Xt+1)(X_{t},X_{t+1}) and (Xt,g(ηt,Xt))(X_{t},g(\eta_{t},X_{t})) as 𝕃t(g)\mathbb{L}_{t}(g):

𝕃t(g)=Df(pXt,g(ηt,Xt)pXt,Xt+1).\displaystyle\mathbb{L}_{t}(g)=D_{f}(p_{X_{t},g(\eta_{t},X_{t})}\|p_{X_{t},X_{t+1}}). (15)

A variational formulation of ff-divergence (Keziou,, 2003; Nguyen et al.,, 2010) is based on the Fenchel conjugate. Let

t(g,h)=𝔼Xt,ηth(Xt,g(ηt,Xt))𝔼Xt,Xt+1f(h(Xt,Xt+1)),\displaystyle\mathcal{L}_{t}(g,h)=\mathbb{E}_{X_{t},\eta_{t}}h(X_{t},g(\eta_{t},X_{t}))-\mathbb{E}_{X_{t},X_{t+1}}f^{*}(h(X_{t},X_{t+1})), (16)

Then we have 𝕃t(g)suph1t(g,h)\mathbb{L}_{t}(g)\geq\sup_{h\in\mathcal{H}_{1}}\mathcal{L}_{t}(g,h). The equality holds if and only if f(qt/pt)1f^{\prime}(q_{t}/p_{t})\in\mathcal{H}_{1}, where ptp_{t} and qtq_{t} denote the distribution of (Xt,Xt+1)(X_{t},X_{t+1}) and (Xt,g(ηt,Xt))\left(X_{t},g(\eta_{t},X_{t})\right), respectively. If f(qt/pt)1f^{\prime}(q_{t}/p_{t})\in\mathcal{H}_{1} for t=0,T1t=0\ldots,T-1, then by Theorem 1, there exists a function gg such that

𝕃t(g)=suph1t(g,h)=0,t=0,1,T1.\displaystyle\mathbb{L}_{t}(g)=\sup_{h\in\mathcal{H}_{1}}\mathcal{L}_{t}(g,h)=0,\ \ \forall t=0,1\ldots,T-1.

Consequently, a function gg exists such that

suph1𝔼^(g,h)=0.\displaystyle\sup_{h\in\mathcal{H}_{1}}{\mathbb{E}}\widehat{\mathcal{L}}(g,h)=0.

For the ss-step generation, we consider a similar min-max problem of the following form:

(G^,H^)=argminG𝒢maxH~(G,H)\displaystyle(\widehat{G},\widehat{H})=\arg\mathop{\min}_{G\in\mathcal{G}}\mathop{\max}_{H\in\mathcal{H}}\widetilde{\mathcal{L}}(G,H) (17)
~(G,H)=1|Ω|(t,s)Ω[H(Xt,G(ηt,Xt,s),s)f(H(Xt,Xt+s,s))]\displaystyle\widetilde{\mathcal{L}}(G,H)=\frac{1}{|\Omega|}\sum\limits_{(t,s)\in\Omega}[H(X_{t},G(\eta_{t},X_{t},s),s)-f^{*}(H(X_{t},X_{t+s},s))] (18)
Ω=[(t,s):t+sT,t1, 1sS].\displaystyle\Omega=[(t,s):t+s\leq T,\ t\geq 1,\ 1\leq s\leq S]. (19)

Similar to (𝒢1,1)(\mathcal{G}_{1},\mathcal{H}_{1}), here 𝒢\mathcal{G} and \mathcal{H} are the spaces of continuous and bounded functions. As the s-step generation allows us to generate outcomes after ss-steps for an arbitrary ss, the ~(G,H)\widetilde{\mathcal{L}}(G,H) includes all the available pairs before the observation time TT. As a comparison, in iterative generation, the pairs are restricted to neighbors. The ss-step generation is related to the following average of ff-divergence a

𝕃˙t(G)=1Ss=1SDf(pXt,G(ηt,Xt,s)pXt,Xt+s)\displaystyle\dot{\mathbb{L}}_{t}(G)=\frac{1}{S}\sum\limits_{s=1}^{S}D_{f}\left(p_{X_{t},G(\eta_{t},X_{t},s)}\|p_{X_{t},X_{t+s}}\right) (20)

As in iterative generation, we also consider the following variational form

˙t(G,H)=\displaystyle\dot{\mathcal{L}}_{t}(G,H)= S1s=1S[𝔼Xt,ηtH(Xt,G(ηt,Xt,s),s)\displaystyle S^{-1}\sum\limits_{s=1}^{S}\Big{[}\mathbb{E}_{X_{t},\eta_{t}}H(X_{t},G(\eta_{t},X_{t},s),s) (21)
𝔼Xt,Xt+sf(H(Xt,Xt+s,s))].\displaystyle-{\mathbb{E}}_{X_{t},X_{t+s}}f^{*}(H(X_{t},X_{t+s},s))\Big{]}. (22)

Similarly, we have 𝕃˙(G)=supH˙(G,H)\dot{\mathbb{L}}(G)=\mathop{\sup}_{H}\dot{\mathcal{L}}(G,H).

On the other hand, the solution of s-step generation (17) and iterative generation (12) could also be connected as in the following proposition.

Proposition 3.

Let (G^,H^)(\widehat{G},\widehat{H}) and (g^,h^)(\widehat{g},\widehat{h}) be defined as in (17) and (12) respectively. Then,

G^(,,1)g^(,),\displaystyle\widehat{G}(\cdot,\cdot,1)\equiv\widehat{g}(\cdot,\cdot),
H^(,,1)h^(,).\displaystyle\widehat{H}(\cdot,\cdot,1)\equiv\widehat{h}(\cdot,\cdot).

Thus, the property (11) on the target function gg and GG could also be inherited by the estimated solution g^\hat{g} and G^\hat{G}.

3 Convergence analysis for Lag-1 time series

3.1 General bounds for iterative generation

In this section, we prove that the time series obtained by iterative generation matches the true distribution as in (4) asymptotically. To achieve this goal, we impose the following condition on XtX_{t}.

Asumption 1.

The probability density funtion of XtX_{t}, denoted as ptp_{t}, converges in L1L_{1}, i.e., there exists a funtion pp_{\infty} such that:

|pt(x)p(x)|𝑑x𝒪(tα),\displaystyle\int|p_{t}(x)-p_{\infty}(x)|dx\leq\mathcal{O}(t^{-\alpha}), (23)

where α>0\alpha>0 is a constant.

Assumption 1 requires that the density XtX_{t} converges in L1L_{1}, and the convergence rate is controlled by tαt^{-\alpha}. Under many scenarios, such a rate is rather mild and can be achieved easily. For example, if we consider the following Gaussian distribution family, the convergence rate will be controlled by ecte^{-ct}, a smaller order of tαt^{-\alpha}.

Example 1.

Let {Xt}\left\{X_{t}\right\} be time series which satisfy:

X0N(0,Σ0),Xt+1=ϕ2Xt+ϕ1ξt,t=0,1,2,,\displaystyle X_{0}\sim N(0,\Sigma_{0}),\ \ X_{t+1}=\phi_{2}X_{t}+\phi_{1}\xi_{t},\ \ t=0,1,2,\ldots,

where ϕ1,ϕ2p×p\phi_{1},\phi_{2}\in\mathbb{R}^{p\times p}, ϕ2\phi_{2} is symmetric matrix and its largest singular value is less than 11. ξti.i.dN(0,Ip)\xi_{t}\overset{\text{i.i.d}}{\sim}N(0,I_{p}) is independent of X0X_{0}. Let c=2logσmax(ϕ2)c=-2\log\sigma_{\text{max}}(\phi_{2}). Then there exists a density function pp_{\infty} such that

|pXt(x)p(x)|𝑑x=𝒪(ect).\displaystyle\int|p_{X_{t}}(x)-p_{\infty}(x)|dx=\mathcal{O}(e^{-ct}). (24)

In classical time series analysis, stationary conditions are usually imposed for desired statistical properties. For example, A time series {Yt}\left\{Y_{t}\right\} is said to be strictly stationary or strongly stationary if

FYt1,,Ytn(y1,,yn)=FYt1+τ,,Ytn+τ(y1,,yn)\displaystyle F_{Y_{t_{1}},\cdots,Y_{t_{n}}}(y_{1},\cdots,y_{n})=F_{Y_{t_{1}+\tau},\cdots,Y_{t_{n}+\tau}}(y_{1},\cdots,y_{n})

for all t1,,tn,τt_{1},\cdots,t_{n},\tau\in\mathbb{N} and for all n>0n>0, where FX()F_{X}(\cdot) is the distribution function of XX. Clearly, the strictly stationary condition implies Assumption 1. In other words, Assumption 1 is a much weaker condition, since the convergence condition is imposed only on the marginal densities, with no requirement placed on the joint distribution.

Given the convergence condition, we are ready to state the main theorem for the iterative generations. Let g^\widehat{g} be the solution to (12) and X^T+s\widehat{X}_{T+s} be the generated sequence, i.e.,

X^T=XT,X^T+s=g^(ηT+s1,X^T+s1),s=1,,S.\displaystyle\widehat{X}_{T}=X_{T},\ \ \widehat{X}_{T+s}=\widehat{g}(\eta_{T+s-1},\widehat{X}_{T+s-1}),\ \ s=1,\ldots,S.

Then, the joint density of (X^T,,X^T+S)(\widehat{X}_{T},\cdots,\widehat{X}_{T+S}) converge to the corresponding truth in expectation as in Theorem 4 below.

Theorem 4.

(iterative generation) Let X0,X1,,X_{0},X_{1},\cdots, be a sequence of random variables which satisfy the Markov and Conditional invariance condition as in Theorem 1. Suppose Assumption 1 holds. Let g^\widehat{g} be the solution to the f-GAN problem (12) with f satisfying (14). Then,

𝔼(Xt,ηt)t=0TpX^T,,X^T+SpXT,,XT+SL12Δ1+Δ2statistical error+Δ3+Δ4approximation error,\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+S}}-p_{X_{T},\cdots,X_{T+S}}\|^{2}_{L_{1}}\leq\underbrace{\Delta_{1}+\Delta_{2}}_{\text{statistical error}}+\underbrace{\Delta_{3}+\Delta_{4}}_{\text{approximation error}}, (25)

where

Δ1=𝒪(Tαα+1+T2α),\displaystyle\Delta_{1}=\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}}+T^{-2\alpha}),
Δ2=𝒪(Pdim𝒢1log(TB𝒢1)T+Pdim1log(TB1)T),\displaystyle\Delta_{2}=\mathcal{O}\left(\sqrt{\frac{\text{Pdim}_{\mathcal{G}_{1}}\log(T\text{B}_{\mathcal{G}_{1}})}{T}}+\sqrt{\frac{\text{Pdim}_{\mathcal{H}_{1}}\log(T\text{B}_{\mathcal{H}_{1}})}{T}}\right),
Δ3=𝒪(1)𝔼(Xt,ηt)t=0T(suphT(g^,h)suph1T(g^,h)),\displaystyle\Delta_{3}=\mathcal{O}(1)\cdot\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\left(\mathop{\sup}_{h}\mathcal{L}_{T}(\widehat{g},h)-\mathop{\sup}_{h\in\mathcal{H}_{1}}\mathcal{L}_{T}(\widehat{g},h)\right),
Δ4=𝒪(1)infg¯𝒢1𝕃T(g¯).\displaystyle\Delta_{4}=\mathcal{O}(1)\cdot\mathop{\inf}_{\bar{g}\in\mathcal{G}_{1}}\mathbb{L}_{T}(\bar{g}).

Moreover, for the pairwise distribution, we have

𝔼(Xt,ηt)t=0T1Ss=1SpX^T,X^T+spXT,XT+sL12Δ1+Δ2statistical error+Δ3+Δ4approximation error.\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\frac{1}{S}\sum\limits_{s=1}^{S}\|p_{\widehat{X}_{T},\widehat{X}_{T+s}}-p_{X_{T},X_{T+s}}\|_{L_{1}}^{2}\leq\underbrace{\Delta_{1}+\Delta_{2}}_{\text{statistical error}}+\underbrace{\Delta_{3}+\Delta_{4}}_{\text{approximation error}}. (26)
Corollary 1.

When {Xt}\left\{X_{t}\right\} follows a Gaussian distribution family as in Example 1, the statistical error Δ1\Delta_{1} could be further optimized to

Δ1=𝒪(ect+(1/T)logT),\displaystyle\Delta_{1}=\mathcal{O}\left(e^{-ct}+(1/T)\log T\right),

where cc is a certain constant.

Theorem 4 establishes the convergence of joint and pairwise distribution for the iterative generations. The 1\ell_{1} distance between (XT,X^T+1,,X^T+S)(X_{T},\widehat{X}_{T+1},\cdots,\widehat{X}_{T+S}) and the true sequence could be bounded by four terms, including the statistical errors Δ1\Delta_{1}, Δ2\Delta_{2} and approximation errors Δ3\Delta_{3}, Δ4\Delta_{4}. In particular, it is clear that Δ1\Delta_{1} converge to 0 when TT\rightarrow\infty. While for Δ2\Delta_{2} to Δ4\Delta_{4}, we prove their converge in Section 3.3 when 1\mathcal{H}_{1} and 𝒢1\mathcal{G}_{1} are chosen to be spaces of deep neural networks. As a result of Theorem 4, the main objective (4) can be ensured asymptotically. In other words, the generated sequence (XT,X^T+1,,X^T+S)(X_{T},\widehat{X}_{T+1},\cdots,\widehat{X}_{T+S}) follows approximately the same distribution as the truth (XT,XT+1,,XT+S)(X_{T},X_{T+1},\cdots,X_{T+S}) with sufficient samples.

Remark 3.1.

The statistical error Δ1\Delta_{1} depends on TT and α\alpha, the convergence speed of XtX_{t}. Clearly, we may omit the term 𝒪(T2α)\mathcal{O}(T^{-2\alpha}) in Δ1\Delta_{1} as T2αT^{-2\alpha} is of smaller order of Tαα+1T^{-\frac{\alpha}{\alpha+1}}. We include T2αT^{-2\alpha} in Δ1\Delta_{1} because it controls the difference between joint distribution and pairwise distribution bound. See Proposition 5 below. In addition, the term 𝒪(Tαα+1)\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}}) is obtained by estimating a carefully constructed quantity ds(G,H)d_{s}(G,H) in Proposition 6 below.

Proposition 5.

For any S=1,2S=1,2\ldots,

𝔼(Xt,ηt)t=0TpX^T,,X^T+SpXT,,XT+SL12\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\left\|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+S}}-p_{X_{T},\cdots,X_{T+S}}\right\|_{L_{1}}^{2} (27)
\displaystyle\leq 2S2𝔼(Xt,ηt)t=0TpX~T,X~T+1pXT,XT+1L12+𝒪(T2α).\displaystyle 2S^{2}\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\left\|p_{{\widetilde{X}}_{T},{\widetilde{X}}_{T+1}}-p_{X_{T},X_{T+1}}\right\|_{L_{1}}^{2}+\mathcal{O}(T^{-2\alpha}). (28)
Proposition 6.

For any S=1,2S=1,2\ldots, define

U(x,y,z,s)\displaystyle U(x,y,z,s) =\displaystyle= H(x,G(z,x,s),s)f(H(x,y,s)),\displaystyle H(x,G(z,x,s),s)-f^{*}(H(x,y,s)), (29)
ds(G,H)\displaystyle d_{s}(G,H) =\displaystyle= 𝔼XT,XT+s,ηTU(XT,XT+s,ηT,s)\displaystyle\mathbb{E}_{X_{T},X_{T+s},\eta_{T}}U(X_{T},X_{T+s},\eta_{T},s) (31)
1Ts+1t=0Ts𝔼Xt,Xt+s,ηtU(Xt,Xt+s,ηt,s),\displaystyle-\frac{1}{T-s+1}\sum\limits_{t=0}^{T-s}\mathbb{E}_{X_{t},X_{t+s},\eta_{t}}U(X_{t},X_{t+s},\eta_{t},s),

then we have

supG𝒢,H|ds(G,H)|𝒪(Tαα+1).\displaystyle\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|d_{s}(G,H)|\leq\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}}). (32)
Remark 3.2.

The statistical error Δ2\Delta_{2} depends only on the time period TT and the structure of function spaces 𝒢\mathcal{G} and \mathcal{H}. In subsection 3.3, we show that Δ2\Delta_{2} goes to 0 when 𝒢\mathcal{G} and \mathcal{H} are taken as neural network spaces of appropriate sizes. The Δ2\Delta_{2} is obtained by estimating the Rademacher complexity of {bts}t=0Ts\left\{b_{t}^{s}\right\}_{t=0}^{T-s} in Proposition 7 below. Under the time series setting, {Xt,t=1,2}\left\{X_{t},t=1,2\ldots\right\} are highly correlated. Conventional techniques to bound Rademacher complexity does not work. In our proof, we adopt a new technique introduced by McDonald and Shalizi, (2017) that allows us to handle correlated variables. We defer to the supplementary material for more details.

Proposition 7.

Let {ϵt}t0\left\{\epsilon_{t}\right\}_{t\geq 0} be the Rademacher random variables. For 1sS1\leq s\leq S, define

bts(G,H)=\displaystyle b_{t}^{s}(G,H)= H(Xt,G(ηt,Xt,s),s)f(H(Xt,Xt+s,s),s)\displaystyle H(X_{t},G(\eta_{t},X_{t},s),s)-f^{*}(H(X_{t},X_{t+s},s),s) (34)
𝔼[H(Xt,G(ηt,Xt,s),s)f(H(Xt,Xt+s,s),s)].\displaystyle-\mathbb{E}\big{[}H(X_{t},G(\eta_{t},X_{t},s),s)-f^{*}(H(X_{t},X_{t+s},s),s)\big{]}.

Further let s(𝒢×)\mathcal{R}_{s}(\mathcal{G}\times\mathcal{H}) be the Rademacher complexity of {bts}t=0Ts\left\{b_{t}^{s}\right\}_{t=0}^{T-s},

s(𝒢×)=𝔼supG𝒢,H|2Ts+1t=0Tsϵtbts(G,H)|.\displaystyle\mathcal{R}_{s}(\mathcal{G}\times\mathcal{H})=\mathbb{E}\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}\left|\frac{2}{T-s+1}\sum\limits_{t=0}^{T-s}\epsilon_{t}b_{t}^{s}(G,H)\right|. (35)

Then we have

𝔼supG𝒢,H|1Ts+1t=0Tsbts(G,H)|s(𝒢×).\displaystyle\mathbb{E}\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}\left|\frac{1}{T-s+1}\sum\limits_{t=0}^{T-s}b_{t}^{s}(G,H)\right|\leq\mathcal{R}_{s}(\mathcal{G}\times\mathcal{H}). (36)

Moreover, s(𝒢×)\mathcal{R}_{s}(\mathcal{G}\times\mathcal{H}) could be further bounded using the pseudo dimension and global bound of 𝒢\mathcal{G} and \mathcal{H} .

3.2 General bounds for ss-step generation

In this subsection, we provide theoretical guarantees for the ss-step generation. Let G^\widehat{G} be the solution to (17) and X~T+s{\widetilde{X}}_{T+s} be the generated sequence, i.e.,

X~T+s=G^(ηT,X~T,s),s=1,,S.\displaystyle{\widetilde{X}}_{T+s}=\widehat{G}(\eta_{T},{\widetilde{X}}_{T},s),\ \ s=1,\ldots,S.

Now we show that the pairwise distance between (X~T,X~T+s)({\widetilde{X}}_{T},{\widetilde{X}}_{T+s}) and (XT,XT+s)(X_{T},X_{T+s}) could be guaranteed as in Theorem 8 below.

Theorem 8.

(ss-step generation) Let X0,X1,,X_{0},X_{1},\cdots, be a sequence of random variables which satisfy the Markov and Conditional invariance condition as in Theorem 1. Suppose Assumption 1 holds. Let G^\widehat{G} be the solution to the f-GAN problem (17) with f satisfying (14). Then,

𝔼(Xt,ηt)t=0T1Ss=1SpX~T,X~T+spXT,XT+sL12Δ~1+Δ~2statistical error+Δ~3+Δ~4approximation error,\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\frac{1}{S}\sum\limits_{s=1}^{S}\|p_{{\widetilde{X}}_{T},{\widetilde{X}}_{T+s}}-p_{X_{T},X_{T+s}}\|_{L_{1}}^{2}\leq\underbrace{\widetilde{\Delta}_{1}+\widetilde{\Delta}_{2}}_{\text{statistical error}}+\underbrace{\widetilde{\Delta}_{3}+\widetilde{\Delta}_{4}}_{\text{approximation error}}, (37)

where

Δ~1=𝒪(Tαα+1),\displaystyle\widetilde{\Delta}_{1}=\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}}),
Δ~2=𝒪(Pdim𝒢log(TB𝒢)T+Pdimlog(TB)T),\displaystyle\widetilde{\Delta}_{2}=\mathcal{O}\left(\sqrt{\frac{\text{Pdim}_{\mathcal{G}}\log(T\text{B}_{\mathcal{G}})}{T}}+\sqrt{\frac{\text{Pdim}_{\mathcal{H}}\log(T\text{B}_{\mathcal{H}})}{T}}\right),
Δ~3=𝒪(1)𝔼(Xt,ηt)t=0T(supH˙T(G^,H)supH˙T(G^,H)),\displaystyle\widetilde{\Delta}_{3}=\mathcal{O}(1)\cdot\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\left(\mathop{\sup}_{H}\dot{\mathcal{L}}_{T}(\widehat{G},H)-\mathop{\sup}_{H\in\mathcal{H}}\dot{\mathcal{L}}_{T}(\widehat{G},H)\right),
Δ~4=𝒪(1)infG¯𝒢𝕃˙T(G¯).\displaystyle\widetilde{\Delta}_{4}=\mathcal{O}(1)\cdot\mathop{\inf}_{\bar{G}\in\mathcal{G}}\dot{\mathbb{L}}_{T}(\bar{G}).

In particular, when S=1S=1,

𝔼(Xt,ηt)t=0TpX~T,X~T+1pXT,XT+1L12Δ~1+Δ2statistical error+Δ3+Δ4approximation error,\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\|p_{{\widetilde{X}}_{T},{\widetilde{X}}_{T+1}}-p_{X_{T},X_{T+1}}\|_{L_{1}}^{2}\leq\underbrace{\widetilde{\Delta}_{1}+\Delta_{2}}_{\text{statistical error}}+\underbrace{\Delta_{3}+\Delta_{4}}_{\text{approximation error}}, (38)

where Δ2,Δ3\Delta_{2},\Delta_{3} and Δ4\Delta_{4} are the quantities in Theorem 4.

Theorem 8 demonstrates the convergence of pairwise distribution for the ss-step generation. It states that the 1\ell_{1} distribution distance between (X~T,X~T+s)({\widetilde{X}}_{T},{\widetilde{X}}_{T+s}) and (XT,XT+s)(X_{T},X_{T+s}) for any ss can be bounded by the sum of two statistical errors and two approximation errors. In the following subsection, similar to Theorem 4, we will show that all Δ~\widetilde{\Delta}s approach to 0 when \mathcal{H} and 𝒢\mathcal{G} are chosen to be spaces of deep neural networks. Upon comparing Theorem 8 with Theorem 4, it can be observed that the term T2αT^{-2\alpha} in Δ1\Delta_{1} does not appear in Δ~1\widetilde{\Delta}_{1}. As discussed in Remark 3.1, the term T2αT^{-2\alpha} controls the difference between joint and pairwise distribution, and it is no longer needed in Theorem 8.

Remark 3.3.

Unlike Theorem 4, the convergence for joint distribution may not be guaranteed for s-step generated sequence {X~t,t=1,2}\big{\{}{\widetilde{X}}_{t},\ t=1,2\ldots\big{\}}. As discussed in Theorem 2, there is no assurance regarding the existence of GG to attain the joint distribution match. The major issue for s-step generation is that X^T+s|(X^T+s1=x)\widehat{X}_{T+s}|(\widehat{X}_{T+s-1}=x) varies with tt. Thus the mutual dependencies between X~10,X~20,{\widetilde{X}}^{0}_{1},{\widetilde{X}}^{0}_{2},\ldots could not be kept in the generation. As a comparison, in iterative generation, the joint distribution match could be achieved as the conditional distribution of adjacent generations does not vary with tt, i.e.,

X^T+s|(X^T+s1=x)dX^T+1|(X^T=x).\displaystyle\widehat{X}_{T+s}|(\widehat{X}_{T+s-1}=x)\ \overset{\text{d}}{\equiv}\ \widehat{X}_{T+1}|(\widehat{X}_{T}=x).

3.3 Analysis of deep neural network spaces

Neural networks have been extensively studied in recent years due to its universal approximation power. In this subsection, we consider DNN to approximate the generator gg in our model. In particular, we show that both statistical and approximation errors converge to zero when 𝒢1{\cal G}_{1}, 1{\cal H}_{1}, 𝒢{\cal G}, {\cal H} are taken to be the space of Rectified Linear Unit (ReLU) neural network functions. To avoid redundancy, we concentrate on the spaces of 𝒢1{\cal G}_{1} and 1{\cal H}_{1}, with generalizations to 𝒢{\cal G} and {\cal H} being straightforward.

Recall that the input XtX_{t} and reference ηt\eta_{t} are of dimension pp and mm, respectively. We consider the generator g:p+mpg:{\mathbb{R}}^{p+m}\rightarrow{\mathbb{R}}^{p} in the space of ReLU neural networks 𝒢1:=𝒢𝒟,𝒲,𝒦,{\cal G}_{1}:={\cal G}_{{\cal D},{\cal W},{\cal K},{\cal B}} with width 𝒲\mathcal{W}, depth 𝒟\mathcal{D}, size 𝒦\mathcal{K} and global bound \mathcal{B}. Specifically, let ωj\omega_{j} denote the number of hidden units in layer jj with ω0=p+m\omega_{0}=p+m being the dimension of input layer. Then the width 𝒲=max0i𝒟{ωi}\mathcal{W}=\mathop{\max}_{0\leq i\leq\mathcal{D}}\left\{\omega_{i}\right\} is the maximum dimension, the depth 𝒟\mathcal{D} is the number of layers, the size 𝒦=i=0𝒟1ωi(ωi+1+1)\mathcal{K}=\sum_{i=0}^{\mathcal{D}-1}\omega_{i}\cdot(\omega_{i+1}+1) is the total number of parameters, and the global bound satisfies g\|g\|_{\infty}\leq\mathcal{B} for all g𝒢1g\in\mathcal{G}_{1}. Similarly, we may define a ReLU network space for the discriminator hh as 1:=𝒟~,𝒲~,𝒦~,~{\cal H}_{1}:=\mathcal{H}_{\widetilde{\mathcal{D}},\widetilde{\mathcal{W}},\tilde{\mathcal{K}},\widetilde{\mathcal{B}}}.

Then, by Bartlett et al., (2017), we may bound the pseudo dimension of 𝒢1{\cal G}_{1} (and 1{\cal H}_{1}) and consequently Δ2\Delta_{2} as in Proposition 9 below.

Proposition 9.

Let 𝒢1:=𝒢𝒟,𝒲,𝒦,{\cal G}_{1}:={\cal G}_{{\cal D},{\cal W},{\cal K},{\cal B}} be the ReLU network space with width 𝒲\mathcal{W}, depth 𝒟\mathcal{D}, size 𝒦\mathcal{K} and global bound \mathcal{B}. Then we have

Pdim𝒢1=𝒪(𝒟𝒦log𝒦).\displaystyle\text{Pdim}_{\mathcal{G}_{1}}=\mathcal{O}({\cal D}\mathcal{K}\log\mathcal{K}).

Consequently,

Δ2=𝒪(𝒟𝒦log𝒦log(T)T+𝒟~𝒦~log𝒦~log(T~)T).\displaystyle\Delta_{2}=\mathcal{O}\left(\sqrt{\frac{\mathcal{D}\mathcal{K}\log\mathcal{K}\log(T\mathcal{B})}{T}}+\sqrt{\frac{\tilde{\mathcal{D}}\tilde{\mathcal{K}}\log\tilde{\mathcal{K}}\log(T\tilde{\mathcal{B}})}{T}}\right).

By Proposition 9, it is clear that Δ2\Delta_{2} goes to 0 with appropriate size of ReLU network spaces, e.g. 𝒟𝒦log𝒦log(T)\mathcal{D}\mathcal{K}\log\mathcal{K}\log(T\mathcal{B}) and 𝒟~𝒦~log𝒦~log(T~)\tilde{\mathcal{D}}\tilde{\mathcal{K}}\log\tilde{\mathcal{K}}\log(T\tilde{\mathcal{B}}) are of smaller order of TT. Moreover, as Δ20\Delta_{2}\rightarrow 0 when TT\rightarrow\infty regardless of network structure, we can conclude that the statistical error in Theorem 4 converge to 0.

Now we consider the approximation errors Δ3\Delta_{3} and Δ4\Delta_{4}. The approximation power of DNN has been intensively studied in the literature under different conditions, such as smoothness assumptions. For instance, the early work by Stone, (1982) established the optimal minimax rate of convergence for estimating a (β,C)(\beta,C)-smooth function. While more recently, Yarotsky, (2017) and Lu et al., (2020) considered target functions with continuous β\beta-th derivatives. Jiao et al., (2021) assumed β\beta-Hölder smooth functions with β>1\beta>1. Moreover, studies including Shen et al., (2021), Schmidt-Hieber, (2020), and Bauer and Kohler, (2019), have sought to enhance the convergence rate by assuming that the target function possesses certain compositional structure. Here, we adopt Theorem 4.3 in Shen et al., (2019) and show that the approximation error in Theorem 4 converge to 0 with a particular structure of neural networks.

Proposition 10.

Let 𝒢1:=𝒢𝒟,𝒲,𝒦,{\cal G}_{1}:={\cal G}_{{\cal D},{\cal W},{\cal K},{\cal B}} be a ReLU network space with depth 𝒟=12logT+14+2(p+m+1){\cal D}=12\log T+14+2(p+m+1) and width 𝒲=3p+m+4max{(p+m+1)(Tp+m+12(3+p+m)/logT)1p+m+1,Tp+m+12(3+p+m)/logT+1}{\cal W}=3^{p+m+4}\max\{(p+m+1)\lfloor(T^{\frac{p+m+1}{2(3+p+m)}}/\log T)^{\frac{1}{p+m+1}}\rfloor,T^{\frac{p+m+1}{2(3+p+m)}}/\log T+1\}. Further let 1:=𝒟~,𝒲~,𝒦~,~{\cal H}_{1}:=\mathcal{H}_{\widetilde{\mathcal{D}},\widetilde{\mathcal{W}},\tilde{\mathcal{K}},\widetilde{\mathcal{B}}} be a ReLU network with depth 𝒟~=12logT+14+2(2p+1)\widetilde{{\cal D}}=12\log T+14+2(2p+1) and width 𝒲~=32p+4max{(2p+1)(T2p+12(2p+3)/logT)12p+1,T2p+12(2p+3)/logT+1}\widetilde{{\cal W}}=3^{2p+4}\max\{(2p+1)\lfloor(T^{\frac{2p+1}{2(2p+3)}}/\log T)^{\frac{1}{2p+1}}\rfloor,T^{\frac{2p+1}{2(2p+3)}}/\log T+1\}. Then as TT\rightarrow\infty, we have

Δ3=𝒪(1)𝔼(Xt,ηt)t=0T(suphT(g^,h)suph1T(g^,h))0,\displaystyle\Delta_{3}=\mathcal{O}(1)\cdot\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\left(\mathop{\sup}_{h}\mathcal{L}_{T}(\widehat{g},h)-\mathop{\sup}_{h\in\mathcal{H}_{1}}\mathcal{L}_{T}(\widehat{g},h)\right)\rightarrow 0,
Δ4=𝒪(1)infg¯𝒢1𝕃T(g¯)0.\displaystyle\Delta_{4}=\mathcal{O}(1)\cdot\mathop{\inf}_{\bar{g}\in\mathcal{G}_{1}}\mathbb{L}_{T}(\bar{g})\rightarrow 0.

4 Generalizations to lag-k time series

In this section, we generalize the lag-1 time series studied in Section 2 and 3 to a lag-k setting. Specifically, we consider a time series {Xtp,t=1,2,}\left\{X_{t}\in\mathbb{R}^{p},t=1,2,\ldots\right\} that satisfies the following lag-k Markov assumption

Xt|Xt1,,X0=dXt|Xt1,,Xtk.\displaystyle X_{t}|X_{t-1},\ldots,X_{0}\ \overset{\text{d}}{=}\ X_{t}|X_{t-1},\ldots,X_{t-k}. (39)

Moreover, assume that {Xtp,t=1,2,}\left\{X_{t}\in\mathbb{R}^{p},t=1,2,\ldots\right\} is conditionally invariant

pXt|Xt1,,Xtk(xk|xk1,,x0)=pXk|Xk1,,X0(xk|xk1,,x0),tk.\displaystyle p_{X_{t}|X_{t-1},\cdots,X_{t-k}}(x_{k}|x_{k-1},\cdots,x_{0})=p_{X_{k}|X_{k-1},\cdots,X_{0}}(x_{k}|x_{k-1},\cdots,x_{0}),\ \ \forall t\geq k. (40)

In other words, the conditional density function of Xt|Xt1,,XtkX_{t}|X_{t-1},\cdots,X_{t-k} does not depend on tt.

Given a lag-k time series {Xt}\left\{X_{t}\right\}, we aim to generate a sequence (X^T+1,,X^T+S)(\widehat{X}_{T+1},\ldots,\widehat{X}_{T+S}) that not only follows the same joint distribution as (XT+1,,XT+S)(X_{T+1},\ldots,X_{T+S}), but also maintains the dependencies between (X^T+1,,X^T+S)(\widehat{X}_{T+1},\ldots,\widehat{X}_{T+S}) and (XTk+1,,XT)(X_{T-k+1},\ldots,X_{T}). In other words, we aim to achieve

(X^Tk+1,,X^T,X^T+1,,X^T+S)=d(XTk+1,,XT,XT+1,,XT+S).\displaystyle(\widehat{X}_{T-k+1},\ldots,\widehat{X}_{T},\widehat{X}_{T+1},\ldots,\widehat{X}_{T+S})\overset{\text{d}}{=}(X_{T-k+1},\ldots,X_{T},X_{T+1},\ldots,X_{T+S}). (41)

we show that such generation is possible by the following iterative generation

(X^Tk+1,,X^T)=(XTk+1,,XT),\displaystyle(\widehat{X}_{T-k+1},\cdots,\widehat{X}_{T})=(X_{T-k+1},\cdots,X_{T}),
X^T+s=g(ηTk+s,X^Tk+s,,X^T1+s), 1sS.\displaystyle\widehat{X}_{T+s}=g(\eta_{T-k+s},\widehat{X}_{T-k+s},\cdots,\widehat{X}_{T-1+s}),\ \ 1\leq s\leq S.

where gg is the target function to be estimated and ηt\eta_{t} is i.i.d. Gaussian vectors of dimension mm. Moreover, we may also consider the ss-step generation:

X~T+s=G(ηTk+1,XTk+1,,XT,s).\displaystyle{\widetilde{X}}_{T+s}=G\left(\eta_{T-k+1},X_{T-k+1},\cdots,X_{T},s\right).

The following proposition suggests that, analogous to the lag-1 case, there exist a function gg for iterative generation to achieve the joint distribution matching. Furthermore, for the s-step generation, a function GG exists to attain the marginal distribution matching.

Proposition 11.

Let {Xt}\left\{X_{t}\right\} satisfies the lag-k Markov property (39) and conditional invariance condition (40). Let (X^00,,X^k10)=d(X0,,Xk1)(\widehat{X}_{0}^{0},\cdots,\widehat{X}_{k-1}^{0})\overset{\text{d}}{=}(X_{0},\cdots,X_{k-1}) and η0,η1,\eta_{0},\eta_{1},\cdots be independent mm-dimensional Gaussian vectors which are independent of (X^0,,X^k1)(\widehat{X}_{0},\cdots,\widehat{X}_{k-1}). Then for iterative generation, there exists a measurable function gg such that the sequence

{X^t0:X^t0=g(ηtk,X^tk0,,X^t10)}\displaystyle\left\{\widehat{X}_{t}^{0}:\widehat{X}_{t}^{0}=g(\eta_{t-k},\widehat{X}_{t-k}^{0},\cdots,\widehat{X}_{t-1}^{0})\right\} (42)

satisfies that for any s1s\geq 1,

(X^00,,X^s0)=d(X0,,Xs).\displaystyle(\widehat{X}_{0}^{0},\cdots,\widehat{X}_{s}^{0})\overset{\text{d}}{=}(X_{0},\cdots,X_{s}). (43)

Moreover, for ss-step generation, the sequence

{X~t+k10:X~t+k10=G(η0,X0,,Xk1,t)}\displaystyle\left\{{\widetilde{X}}_{t+k-1}^{0}:{\widetilde{X}}_{t+k-1}^{0}=G(\eta_{0},X_{0},\cdots,X_{k-1},t)\right\} (44)

satisfies

X~t+k10=dXt+k1.\displaystyle{\widetilde{X}}_{t+k-1}^{0}\overset{\text{d}}{=}X_{t+k-1}. (45)

Now we consider the estimation of gg and GG in lag-k time series. For any sequence {At}\left\{A_{t}\right\} and positive integer uvu\leq v, denote A[u,v]A_{[u,v]} as the set (Au,Au+1,,Av)(A_{u},A_{u+1},\cdots,A_{v}). Then we consider the following min-max problem for the estimation of s-step generation:

(G^,H^)\displaystyle(\widehat{G},\widehat{H}) =\displaystyle= argminG𝒢maxH~(G,H),\displaystyle\arg\mathop{\min}_{G\in\mathcal{G}}\mathop{\max}_{H\in\mathcal{H}}\widetilde{\mathcal{L}}(G,H), (46)
~(G,H)\displaystyle\widetilde{\mathcal{L}}(G,H) =\displaystyle= 1|Ω|(t,s)Ω[H(X[tk+1,t],G(ηtk+1,X[tk+1,t],s),s)\displaystyle\frac{1}{|\Omega|}\sum\limits_{(t,s)\in\Omega}\big{[}H(X_{[t-k+1,t]},G(\eta_{t-k+1},X_{[t-k+1,t]},s),s) (48)
f(H(X[tk+1,t],Xt+s,s))],\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ -f^{*}(H(X_{[t-k+1,t]},X_{t+s},s))\big{]},
Ω\displaystyle\Omega =\displaystyle= [(t,s):t+hT,tk,1sS],\displaystyle[(t,s):t+h\leq T,t\geq k,1\leq s\leq S], (49)

where 𝒢\mathcal{G}, \mathcal{H} are the spaces of continuous and bounded functions. As in the lag-1 case, the generator gg and discriminator hh in the lag-k iterative generation can be obtained by letting

g^(,)=G^(,,1),h^(,)=H^(,,1).\displaystyle\widehat{g}(\cdot,\cdot)=\widehat{G}(\cdot,\cdot,1),\ \ \ \widehat{h}(\cdot,\cdot)=\widehat{H}(\cdot,\cdot,1).

For lag-k time series, we impose the following condition analogous to Assumption 1 in Section 3.

Asumption 2.

The probability density funtion of X[t,t+k1]X_{[t,t+k-1]}, denoted by pt,kp_{t,k}, converges in L1L_{1}, i.e., there exists a funtion p,kp_{\infty,k} such that:

|pt,k(x1,,xk)p,k(x1,,xk)|𝑑x[1,k]𝒪(tα)\displaystyle\int\left|p_{t,k}(x_{1},\cdots,x_{k})-p_{\infty,k}(x_{1},\cdots,x_{k})\right|dx_{[1,k]}\leq\mathcal{O}(t^{-\alpha}) (50)

where α>0\alpha>0 is certain positive constant.

Given Assumption 2, we can derive theoretical guarantees for the distribution matching of lag-k time series. Now let X^T\widehat{X}_{T} be the iteratively generated sequence, i.e.,

X^Tj=XTj,j=0,,k1,\displaystyle\widehat{X}_{T-j}=X_{T-j},\ \ j=0,\cdots,k-1, (51)
X^T+s=g^(ηT+sk,X^T+sk,,X^T+s1),s=1,,S.\displaystyle\widehat{X}_{T+s}=\widehat{g}(\eta_{T+s-k},\widehat{X}_{T+s-k},\cdots,\widehat{X}_{T+s-1}),\ \ s=1,\cdots,S. (52)

Further let X~T{\widetilde{X}}_{T} be the s-step generated sequence, i.e.,

X~Tj=XTj,j=0,,k1,\displaystyle{\widetilde{X}}_{T-j}=X_{T-j},\ \ j=0,\cdots,k-1, (53)
X~T+s=G^(ηTk+1,X~Tk+1,,X~T,s),s=1,,S.\displaystyle{\widetilde{X}}_{T+s}=\widehat{G}(\eta_{T-k+1},{\widetilde{X}}_{T-k+1},\cdots,{\widetilde{X}}_{T},s),\ \ s=1,\cdots,S. (54)

Then we have the following convergence theorem for iterative and s-step generated sequences.

Theorem 12.

Let {Xt}\left\{X_{t}\right\} satisfies the lag-k Markov property (39) and conditional invariance condition (40). Suppose Assumption 2 holds. Let G^\widehat{G} be the solution to the f-GAN problem (77) with f satisfying (14). Then, for the iterative generations X^T+s\widehat{X}_{T+s} in (51), we have

𝔼(Xt,ηt)t=0TpX^[Tk+1,T+S]pX[Tk+1,T+S]L12Δ¯1+Δ¯2statistical err+Δ¯3+Δ¯4approximation err,\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\left\|p_{\widehat{X}_{[T-k+1,T+S]}}-p_{X_{[T-k+1,T+S]}}\right\|^{2}_{L_{1}}\leq\underbrace{{\overline{\Delta}}_{1}+{\overline{\Delta}}_{2}}_{\text{statistical err}}+\underbrace{{\overline{\Delta}}_{3}+{\overline{\Delta}}_{4}}_{\text{approximation err}}, (55)

and

𝔼(Xt,ηt)t=0T1Ss=1SpX^[Tk+1,T],X^T+spX[Tk+1,T],XT+sL12Δ¯1+Δ¯2statistical err+Δ¯3+Δ¯4approximation err,\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\frac{1}{S}\sum\limits_{s=1}^{S}\left\|p_{\widehat{X}_{[T-k+1,T]},\widehat{X}_{T+s}}-p_{X_{[T-k+1,T]},X_{T+s}}\right\|_{L_{1}}^{2}\leq\underbrace{{\overline{\Delta}}_{1}+{\overline{\Delta}}_{2}}_{\text{statistical err}}+\underbrace{{\overline{\Delta}}_{3}+{\overline{\Delta}}_{4}}_{\text{approximation err}}, (56)

where

Δ¯1=𝒪(Tαα+1+T2α),\displaystyle{\overline{\Delta}}_{1}=\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}}+T^{-2\alpha}),
Δ¯2=𝒪(Pdim𝒢1log(TB𝒢1)T+Pdim1log(TB1)T),\displaystyle{\overline{\Delta}}_{2}=\mathcal{O}\left(\sqrt{\frac{\text{Pdim}_{\mathcal{G}_{1}}\log(T\text{B}_{\mathcal{G}_{1}})}{T}}+\sqrt{\frac{\text{Pdim}_{\mathcal{H}_{1}}\log(T\text{B}_{\mathcal{H}_{1}})}{T}}\right),
Δ¯3=𝒪(1)𝔼(Xt,ηt)t=0T(suphT(g^,h)suph1T(g^,h)),\displaystyle{\overline{\Delta}}_{3}=\mathcal{O}(1)\cdot\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}(\mathop{\sup}_{h}\mathcal{L}_{T}(\widehat{g},h)-\mathop{\sup}_{h\in\mathcal{H}_{1}}\mathcal{L}_{T}(\widehat{g},h)),
Δ¯4=𝒪(1)infg¯𝒢1𝕃T(g¯).\displaystyle{\overline{\Delta}}_{4}=\mathcal{O}(1)\cdot\mathop{\inf}_{\bar{g}\in\mathcal{G}_{1}}\mathbb{L}_{T}(\bar{g}).

Moreover, for ss-step generations X~T+s{\widetilde{X}}_{T+s} in (53), we have

𝔼(Xt,ηt)t=0T1Ss=1SpX~[Tk+1,T],X~T+spX[Tk+1,T],XT+sL12Δ˘1+Δ˘2statistical err+Δ˘3+Δ˘4approximation err\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\frac{1}{S}\sum\limits_{s=1}^{S}\left\|p_{{\widetilde{X}}_{[T-k+1,T]},{\widetilde{X}}_{T+s}}-p_{X_{[T-k+1,T]},X_{T+s}}\right\|_{L_{1}}^{2}\leq\underbrace{\breve{\Delta}_{1}+\breve{\Delta}_{2}}_{\text{statistical err}}+\underbrace{\breve{\Delta}_{3}+\breve{\Delta}_{4}}_{\text{approximation err}} (57)

In particular, when S=1S=1,

𝔼(Xt,ηt)t=0TpX~[Tk+1,T+1]pX[Tk+1,T+1]L12Δ˘1+Δ¯2statistical err+Δ¯3+Δ¯4approximation err.\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\left\|p_{{\widetilde{X}}_{[T-k+1,T+1]}}-p_{X_{[T-k+1,T+1]}}\right\|_{L_{1}}^{2}\leq\underbrace{\breve{\Delta}_{1}+{\overline{\Delta}}_{2}}_{\text{statistical err}}+\underbrace{{\overline{\Delta}}_{3}+{\overline{\Delta}}_{4}}_{\text{approximation err}}. (58)

where

Δ˘1=𝒪(Tαα+1),\displaystyle\breve{\Delta}_{1}=\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}}),
Δ˘2=𝒪(Pdim𝒢log(TB𝒢)T+Pdimlog(TB)T),\displaystyle\breve{\Delta}_{2}=\mathcal{O}\left(\sqrt{\frac{\text{Pdim}_{\mathcal{G}}\log(T\text{B}_{\mathcal{G}})}{T}}+\sqrt{\frac{\text{Pdim}_{\mathcal{H}}\log(T\text{B}_{\mathcal{H}})}{T}}\right),
Δ˘3=𝒪(1)𝔼(Xt,ηt)t=0T(supH˙T(G^,H)supH˙T(G^,H)),\displaystyle\breve{\Delta}_{3}=\mathcal{O}(1)\cdot\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}(\mathop{\sup}_{H}\dot{\mathcal{L}}_{T}(\widehat{G},H)-\mathop{\sup}_{H\in\mathcal{H}}\dot{\mathcal{L}}_{T}(\widehat{G},H)),
Δ˘4=𝒪(1)infG¯𝒢𝕃˙T(G¯).\displaystyle\breve{\Delta}_{4}=\mathcal{O}(1)\cdot\mathop{\inf}_{\bar{G}\in\mathcal{G}}\dot{\mathbb{L}}_{T}(\bar{G}).
Remark 4.1.

Analogous to the lag-1 case, the T\mathcal{L}_{T}, 𝕃T\mathbb{L}_{T}, ˙T\dot{\mathcal{L}}_{T}, and 𝕃˙T\dot{\mathbb{L}}_{T} in the lag-k time series are defined as below:

𝕃T(g)\displaystyle\mathbb{L}_{T}(g) =\displaystyle= Df(pX[Tk+1,T],g(ηTk+1,X[Tk+1,T])pX[Tk+1,T+1]),\displaystyle D_{f}\left(p_{X_{[T-k+1,T]},g\left(\eta_{T-k+1},X_{[T-k+1,T]}\right)}\|p_{X_{[T-k+1,T+1]}}\right), (59)
T(g,h)\displaystyle\mathcal{L}_{T}(g,h) =\displaystyle= 𝔼X[Tk+1,T],ηTk+1h(X[Tk+1,T],g(ηTk+1,X[Tk+1,T]))\displaystyle\mathbb{E}_{X_{[T-k+1,T]},\eta_{T-k+1}}h\left(X_{[T-k+1,T]},g(\eta_{T-k+1},X_{[T-k+1,T]})\right) (61)
𝔼X[Tk+1,T+1]f(h(X[Tk+1,T+1])),\displaystyle-\mathbb{E}_{X_{[T-k+1,T+1]}}f^{*}\left(h(X_{[T-k+1,T+1]})\right),
𝕃˙T(G)\displaystyle\dot{\mathbb{L}}_{T}(G) =\displaystyle= 1Ss=1SDf(pX[Tk+1,T],G(ηTk+1,X[Tk+1,T],s)pX[Tk+1,T],XT+s),\displaystyle\frac{1}{S}\sum_{s=1}^{S}D_{f}\left(p_{X_{[T-k+1,T]},G(\eta_{T-k+1},X_{[T-k+1,T]},s)}\|p_{X_{[T-k+1,T]},X_{T+s}}\right), (62)
˙T(G,H)\displaystyle\dot{\mathcal{L}}_{T}(G,H) =\displaystyle= 1Ss=1S[𝔼X[Tk+1,T],ηTk+1H(X[Tk+1,T],G(ηTk+1,X[Tk+1,T],s),s)\displaystyle\frac{1}{S}\sum\limits_{s=1}^{S}\Big{[}\mathbb{E}_{X_{[T-k+1,T]},\eta_{T-k+1}}H\left(X_{[T-k+1,T]},G(\eta_{T-k+1},X_{[T-k+1,T]},s),s\right) (64)
𝔼X[Tk+1,T],XT+sf(H(X[Tk+1,T],XT+s,s))].\displaystyle-\mathbb{E}_{X_{[T-k+1,T]},X_{T+s}}f^{*}(H(X_{[T-k+1,T]},X_{T+s},s))\Big{]}.

When \mathcal{H} and 𝒢\mathcal{G} are approximated by appropriate deep neural networks, the Δ¯1{\overline{\Delta}}_{1} to Δ¯4{\overline{\Delta}}_{4} and Δ˘1\breve{\Delta}_{1} to Δ˘4\breve{\Delta}_{4} will all converge to 0. Consequently, the joint distribution matching for the iterative generation and pairwise distribution matching for the s-step generation could be guaranteed in lag-k time series.

5 Further generalizations to panel data

In this section, we extend our analysis for image time series to a panel data setting. In particular, we consider a scenario with nn subjects, and for each subject i=1,2,,ni=1,2,\cdots,n, we observe a sequence of images {Xi,t,t=1,2,,Ti}\left\{X_{i,t},t=1,2,\ldots,T_{i}\right\}. Here we allow the time series length TiT_{i} for each subject to be different. Clearly, this type of setting is frequently encountered when analyzing medical image data. Our objective is to generate images for each subject at future time points.

In the panel data setting, we assume that {Xi,t,t=1,2,,Ti}\left\{X_{i,t},t=1,2,\ldots,T_{i}\right\} satisfies the following Markov condition for all subject

Xi,t|Xi,t1,,Xi,0=dXi,t|Xi,t1,i=[n],t=[Ti].\displaystyle X_{i,t}|X_{i,t-1},\cdots,X_{i,0}\overset{\text{d}}{=}X_{i,t}|X_{i,t-1},\ \ i=[n],\ t=[T_{i}]. (65)

We further assume the following invariance condition

pXi,t|Xi,t1(x|y)=pX1,1|X1,0(x|y),i=[n],t=[Ti].\displaystyle p_{X_{i,t}|X_{i,t-1}}(x|y)=p_{X_{1,1}|X_{1,0}}(x|y),\ \ i=[n],\ t=[T_{i}]. (66)

In other words, we assume the same conditional distribution for different subjects ii and time point tt.

Similar to previous sections, we aim to find a common function gg such that for all subjects i=1,2,,ni=1,2,\cdots,n, the generated sequence

X^i,Ti=Xi,Ti,X^i,Ti+s=g(ηTi+s1,X^i,Ti+s1), 1sS\displaystyle\widehat{X}_{i,T_{i}}=X_{i,T_{i}},\ \ \ \widehat{X}_{i,T_{i}+s}=g(\eta_{T_{i}+s-1},\widehat{X}_{i,T_{i}+s-1}),\ \ 1\leq s\leq S (67)

achieves distribution matching

(X^i,Ti,X^i,Ti+1,,X^i,Ti+S)(Xi,Ti,Xi,Ti+1,,Xi,Ti+S).\displaystyle(\widehat{X}_{i,T_{i}},\widehat{X}_{i,T_{i}+1},\cdots,\widehat{X}_{i,T_{i}+S})\sim(X_{i,T_{i}},X_{i,T_{i}+1},\cdots,X_{i,T_{i}+S}). (68)

By Theorem 1, such a function gg clearly exist. To estimate gg, we consider the following min-max problem.

(g^,h^)=argming𝒢maxh^(g,h),\displaystyle(\widehat{g},\widehat{h})=\arg\mathop{\min}_{g\in\mathcal{G}}\mathop{\max}_{h\in\mathcal{H}}\widehat{\mathcal{L}}(g,h), (69)
^(g,h)=1ni=1n1Tit=0Ti1[h(Xi,t,g(ηt,Xi,t))f(h(Xi,t,Xi,t+1))],\displaystyle\widehat{\mathcal{L}}(g,h)=\frac{1}{n}\sum\limits_{i=1}^{n}\frac{1}{T_{i}}\sum\limits_{t=0}^{T_{i}-1}\Big{[}h(X_{i,t},g(\eta_{t},X_{i,t}))-f^{*}(h(X_{i,t},X_{i,t+1}))\Big{]}, (70)

where as before, 𝒢,\mathcal{G},\mathcal{H} are spaces of continuous and uniformly bounded functions. To prove the convergence of the generated sequence, we consider two different settings: 1) TiT_{i} approaches infinity, while nn may either go to infinity or be finite; 2) TiT_{i} is finite, while nn approaches infinity.

5.1 Convergence analysis for TiT_{i}\to\infty

In this subsection, we consider the case that min1in{Ti}\mathop{\min}_{1\leq i\leq n}\left\{T_{i}\right\}\to\infty, while nn may either go to infinity or be finite. We consider the following sequences

X^i,Ti=Xi,Ti,X^i,Ti+s=g^(ηTi+s1,X^i,Ti+s1),i=[n],s=[S].\displaystyle\widehat{X}_{i,T_{i}}=X_{i,T_{i}},\ \ \widehat{X}_{i,T_{i}+s}=\widehat{g}(\eta_{T_{i}+s-1},\widehat{X}_{i,T_{i}+s-1}),\ \ i=[n],\ s=[S]. (71)

Now we are ready to present the convergence theorem for the generated sequences.

Theorem 13.

Suppose {Xi,t}\left\{X_{i,t}\right\} satisfies the Markov property (65) and conditional invariance condition (66). Suppose {Xi,t,t=[Ti]}\left\{X_{i,t},\ t=[T_{i}]\right\} satisfies Assumption 1 for each subject ii. Let g^\widehat{g} be the solution to the f-GAN problem (69) with f satisfying (14). Then,

𝔼{ηt,Xi,t}1ni=1npX^i,Ti,,X^i,Ti+SpXi,Ti,,Xi,Ti+SL12Δ˙1+Δ˙2statistical error+Δ˙3+Δ˙4approximation error\displaystyle\mathbb{E}_{\{\eta_{t},X_{i,t}\}}\frac{1}{n}\sum_{i=1}^{n}\left\|p_{\widehat{X}_{i,T_{i}},\cdots,\widehat{X}_{i,T_{i}+S}}-p_{X_{i,T_{i}},\cdots,X_{i,T_{i}+S}}\right\|_{L_{1}}^{2}\leq\underbrace{\dot{\Delta}_{1}+\dot{\Delta}_{2}}_{\text{statistical error}}+\underbrace{\dot{\Delta}_{3}+\dot{\Delta}_{4}}_{\text{approximation error}} (72)

where

Δ˙1=𝒪(1ni=1n[Tiαα+1+Ti2α]),\displaystyle\dot{\Delta}_{1}=\mathcal{O}\left(\frac{1}{n}\sum\limits_{i=1}^{n}\left[T_{i}^{-\frac{\alpha}{\alpha+1}}+T_{i}^{-2\alpha}\right]\right),
Δ˙2=𝒪(1ni=1n[Pdim𝒢log(TiB𝒢)Ti+Pdimlog(TiB)Ti]),\displaystyle\dot{\Delta}_{2}=\mathcal{O}\left(\frac{1}{n}\sum\limits_{i=1}^{n}\left[\sqrt{\frac{\text{Pdim}_{\mathcal{G}}\log(T_{i}\text{B}_{\mathcal{G}})}{T_{i}}}+\sqrt{\frac{\text{Pdim}_{\mathcal{H}}\log(T_{i}\text{B}_{\mathcal{H}})}{T_{i}}}\right]\right),
Δ˙3=𝒪(1)𝔼(suph(n)(g^,h)suph(n)(g^,h)),\displaystyle\dot{\Delta}_{3}=\mathcal{O}(1)\cdot\mathbb{E}(\mathop{\sup}_{h}\mathcal{L}_{(n)}(\widehat{g},h)-\mathop{\sup}_{h\in\mathcal{H}}\mathcal{L}_{(n)}(\widehat{g},h)),
Δ˙4=𝒪(1)infg¯𝒢𝕃(n)(g¯).\displaystyle\dot{\Delta}_{4}=\mathcal{O}(1)\cdot\mathop{\inf}_{\bar{g}\in\mathcal{G}}\mathbb{L}_{(n)}(\bar{g}).

Here (n)(,)\mathcal{L}_{(n)}(\cdot,\cdot) and 𝕃(n)(,)\mathbb{L}_{(n)}(\cdot,\cdot) are defined as

(n)(g,h)=1ni=1n[𝔼ηTi,Xi,Tih(Xi,Ti,g(ηTi,Xi,Ti))𝔼Xi,Ti,Xi,Ti+1f(h(Xi,Ti,Xi,Ti+1))]\displaystyle\mathcal{L}_{(n)}(g,h)=\frac{1}{n}\sum\limits_{i=1}^{n}\big{[}\mathbb{E}_{\eta_{T_{i}},X_{i,T_{i}}}h(X_{i,T_{i}},g(\eta_{T_{i}},X_{i,T_{i}}))-\mathbb{E}_{X_{i,T_{i}},X_{i,T_{i}+1}}f^{*}(h(X_{i,T_{i}},X_{i,T_{i}+1}))\big{]}
𝕃(n)(g)=1ni=1nDf(pXi,Ti,g(ηTi,Xi,Ti)pXi,Ti,Xi,Ti+1).\displaystyle\mathbb{L}_{(n)}(g)=\frac{1}{n}\sum\limits_{i=1}^{n}D_{f}(p_{X_{i,T_{i}},g(\eta_{T_{i}},X_{i,T_{i}})}\|p_{X_{i,T_{i}},X_{i,T_{i}+1}}).

Whether nn is finite or approaches infinity, Theorem 13 demonstrates the convergence of the generated sequence when TiT_{i}\rightarrow\infty. We shall note that the usual independence assumption between subjects is not necessary in Theorem 13. This implies that convergence can be guaranteed even when the observations are dependent.

5.2 Convergence analysis for nn\to\infty and TT is finite

In this subsection, we consider the case that nn\to\infty, while TiT_{i} is finite. Without loss of generality, we assume that T1=T2==TnTT_{1}=T_{2}=\cdots=T_{n}\triangleq T. In addition, the following assumption is needed in our analysis.

Asumption 3.

For all i=1,,ni=1,\ldots,n, the starting point Xi,0X_{i,0} follows the same distribution as X0X_{0}, i.e.,

X1,0=d=dXn,0=dX0.\displaystyle X_{1,0}\overset{\text{d}}{=}\cdots\overset{\text{d}}{=}X_{n,0}\overset{\text{d}}{=}X_{0}. (73)

By combining Assumption 3 with the Markov and conditional invariance conditions, we could have that the sequences (Xi,0,,Xi,T)(X_{i,0},\cdots,X_{i,T}) for all i=1,,ni=1,\ldots,n follow the same joint distribution. Consequently, we can reach the following convergence theorem.

Theorem 14.

Suppose {Xi,t}\left\{X_{i,t}\right\} satisfies Assumption 3, the Markov property (65) and conditional invariance condition (66). Further assume pXT+s2(x)pXT1(x)𝑑x<\int\frac{p_{X_{T+s}}^{2}(x)}{p_{X_{T-1}}(x)}dx<\infty for all s=[S]s=[S]. Then, if the sequences (Xi,0,,Xi,T)(X_{i,0},\cdots,X_{i,T}) are independent across samples i=1,,ni=1,\ldots,n, we have for all ii,

𝔼pX^i,T,,X^i,T+SpXT,,XT+SL12Δ¨1statistical error+Δ¨2+Δ¨3approximation error,\displaystyle\mathbb{E}\left\|p_{\widehat{X}_{i,T},\cdots,\widehat{X}_{i,T+S}}-p_{X_{T},\cdots,X_{T+S}}\right\|_{L_{1}}^{2}\leq\underbrace{\ddot{\Delta}_{1}}_{\text{statistical error}}+\underbrace{\ddot{\Delta}_{2}+\ddot{\Delta}_{3}}_{\text{approximation error}}, (74)

where

Δ¨1=𝒪(Pdim𝒢log(nB𝒢)n+Pdimlog(nB)n),\displaystyle\ddot{\Delta}_{1}=\mathcal{O}\left(\sqrt{\frac{\text{Pdim}_{\mathcal{G}}\log(n\text{B}_{\mathcal{G}})}{n}}+\sqrt{\frac{\text{Pdim}_{\mathcal{H}}\log(n\text{B}_{\mathcal{H}})}{n}}\right),
Δ¨2=𝒪(1)𝔼(suph˙(T)(g^,h)suph˙(T)(g^,h)),\displaystyle\ddot{\Delta}_{2}=\mathcal{O}(1)\cdot\mathbb{E}(\mathop{\sup}_{h}\dot{\mathcal{L}}_{(T)}(\widehat{g},h)-\mathop{\sup}_{h\in\mathcal{H}}\dot{\mathcal{L}}_{(T)}(\widehat{g},h)),
Δ¨3=𝒪(1)infg¯𝒢𝕃˙(T)(g¯).\displaystyle\ddot{\Delta}_{3}=\mathcal{O}(1)\cdot\mathop{\inf}_{\bar{g}\in\mathcal{G}}\dot{\mathbb{L}}_{(T)}(\bar{g}).

Here ˙(T)(,)\dot{\mathcal{L}}_{(T)}(\cdot,\cdot) and 𝕃˙(T)(,)\dot{\mathbb{L}}_{(T)}(\cdot,\cdot) are defined as

˙(T)(g,h)=1Tt=0T1[𝔼ηt,Xi,th(Xi,t,g(ηt,Xi,t))𝔼Xi,t,Xi,t+1f(h(Xi,t,Xi,t+1))],\displaystyle\dot{\mathcal{L}}_{(T)}(g,h)=\frac{1}{T}\sum\limits_{t=0}^{T-1}\big{[}\mathbb{E}_{\eta_{t},X_{i,t}}h(X_{i,t},g(\eta_{t},X_{i,t}))-\mathbb{E}_{X_{i,t},X_{i,t+1}}f^{*}(h(X_{i,t},X_{i,t+1}))\big{]},
𝕃˙(T)(g)=1Tt=0T1Df(pXi,t,g(ηt,Xi,t)pXi,t,Xi,t+1).\displaystyle\dot{\mathbb{L}}_{(T)}(g)=\frac{1}{T}\sum\limits_{t=0}^{T-1}D_{f}(p_{X_{i,t},g(\eta_{t},X_{i,t})}\|p_{X_{i,t},X_{i,t+1}}).

Moreover,

𝔼1Tt=0T1pXi,t,g^(ηt,Xi,t)pXi,t,Xi,t+1L12Δ¨1statistical error+Δ¨2+Δ¨3approximation error.\displaystyle\mathbb{E}\frac{1}{T}\sum\limits_{t=0}^{T-1}\left\|p_{X_{i,t},\widehat{g}(\eta_{t},X_{i,t})}-p_{X_{i,t},X_{i,t+1}}\right\|_{L_{1}}^{2}\leq\underbrace{\ddot{\Delta}_{1}}_{\text{statistical error}}+\underbrace{\ddot{\Delta}_{2}+\ddot{\Delta}_{3}}_{\text{approximation error}}. (75)
Remark 5.1.

The independence assumption is necessary for convergence in Theorem 14, whereas it is not required in Theorem 13. In a panel data setting where the time series length TT is finite, we cannot rely on a sufficiently long time series to achieve convergence. In particular, the Proposition 5 could no longer be employed to control the difference between joint and pairwise distribution bounds. Thus the proof for Theorem 14 differs significantly from previous sections.

6 Simulation studies

In this section, we conduct comprehensive simulation studies to assess the performance of our generations. We begin in Section 6.1 to consider the generation of a single image time series, then in Section 6.2 generalize to the panel data scenario.

6.1 Study I: Single Time Series

We consider matrix valued time series to mimic the setting of real image data. Specifically, we consider the following three cases:

  • Case 1. Lag-1 Linear

    Xt+1=ϕ1Xt+ϕeEt+1,\displaystyle X_{t+1}=\phi_{1}X_{t}+\phi_{e}E_{t+1},
  • Case 2. Lag-1 Nonlinear

    Xt+1=ϕ1sinXt+ϕeEt+1,\displaystyle X_{t+1}=\phi_{1}\sin X_{t}^{\top}+\phi_{e}E_{t+1},
  • Case 3. Lag-3 Nonlinear

    Xt+1=ϕ1cos(XtXt2Xt)+ϕ2max{0,Xt1}+ϕeEt+1.\displaystyle X_{t+1}=\phi_{1}\cos(X_{t}^{\top}X_{t-2}X_{t}^{\top})+\phi_{2}\sqrt{\max\{0,X_{t-1}^{\top}\}}+\phi_{e}E_{t+1}.

Here Xtp1×p2X_{t}\in\mathbb{R}^{p_{1}\times p_{2}} and Et+1p1×p2E_{t+1}\in\mathbb{R}^{p_{1}\times p_{2}} represent target image and independent noise matrices, respectively, with the image size fixed to be (p1,p2)=(32,32)(p_{1},p_{2})=(32,32). In all three cases, we let the noise matrix Et+1E_{t+1} consists of i.i.d. standard normal entries. Moreover, the initialization X0X_{0} in Cases 1 and 2 and X0X_{0}, X1X_{1}, and X2X_{2} in Case 3 are also taken to the matrices with i.i.d. standard normal entries. It is worth noting that under Case 1 setting, XtX_{t} is column-wise independent, and each column of XtX_{t} converges to 𝒩(0,Σ){\cal N}(0,\Sigma_{\infty}), where Σ\Sigma_{\infty} satisfies Σ=ϕ1Σϕ1+ϕeϕe\Sigma_{\infty}=\phi_{1}\Sigma_{\infty}\phi_{1}^{\top}+\phi_{e}\phi_{e}^{\top}. In this simulation, we let ϕ132×32\phi_{1}\in\mathbb{R}^{32\times 32} be a normalized Gaussian matrix with the largest eigenvalue modulus less than 1, ϕe32×32\phi_{e}\in\mathbb{R}^{32\times 32} is a fixed matrix with block shape patterns. Given ϕ1\phi_{1} and ϕe\phi_{e}, then Σ\Sigma_{\infty} could be solved easily. We plot the ϕ1\phi_{1}, ϕe\phi_{e} and Σ\Sigma_{\infty} in the supplementary material. Moreover, in Case 2, we set ϕ1\phi_{1}, ϕe32×32\phi_{e}\in\mathbb{R}^{32\times 32} in a similar manner as Case 1, but certainly, the normal convergence would no longer hold. While in Case 3, we let both ϕ1\phi_{1} and ϕ232×32\phi_{2}\in\mathbb{R}^{32\times 32} be normalized Gaussian matrices, and fix ϕe32×32\phi_{e}\in\mathbb{R}^{32\times 32} as before.

We consider the time series with two different lengths, T=1000T=1000 and T=5000T=5000 for training. We set the horizon S=3S=3, i.e., generate images in a maximum of S=3S=3 steps. We aim to generate a total of Tnew=500T_{new}=500 image points in a “rolling forecasting” style. Specifically, for any s=1,,Ds=1,\ldots,D, we let the ss-step generation (denoted as “s-step GTS”) be

X~T+tnew(j)(s)=G^(XT+tnews,ηT+tnews(j),s),\displaystyle{\widetilde{X}}_{T+t_{new}}^{(j)}(s)=\widehat{G}\left(X_{T+t_{new}-s},\eta^{(j)}_{T+t_{new}-s},s\right),

where tnewTnewt_{new}\leq T_{new} and superscript jj indicting the j-th generation. Here G^\widehat{G} is estimated using 10,000 randomly selected pairs (Xt,Xt+s)(X_{t},X_{t+s}) from the training data. While for iteration generation (denoted as “iter GTS”), we let

X^T+tnew(j)(s)=g^(s)(XT+tnews,[ηT+tnews(j),,ηT+tnew1(j)]),\displaystyle\widehat{X}_{T+t_{new}}^{(j)}(s)=\widehat{g}^{(s)}\left(X_{T+t_{new}-s},[\eta^{(j)}_{T+t_{new}-s},\ldots,\eta^{(j)}_{T+t_{new}-1}]\right),

where g^(s)\widehat{g}^{(s)} is the ss-times composition of the function g^\widehat{g}.

The G^\widehat{G} for s-step generation (and g^\widehat{g} for iterative generation) are estimated for KL divergence (i.e., f(x)=xlogxf(x)=x\log x) using neural networks in the simulation. Specifically, in Case 1 and 2, the input XtX_{t} initially goes through two fully connected layers, each with ReLU activation functions. Afterward, it is combined with the random noise vector η\eta. This combined input is then passed through a single fully connected layer. The discriminator has two separate processing branches that embed XtX_{t} and Xt+1X_{t+1} into low-dimensional vectors, respectively. These vectors are concatenated and further processed to produce an output score. The details of network structure can be found in Table 1. While for Case 3, in the generator network, XtX_{t}, Xt1X_{t-1}, and Xt2X_{t-2} are each processed independently using three separate fully-connected layers. Afterward, they are combined with a random noise vector and passed through another fully-connected layer before producing the output. The discriminator in this case follows the same process as in Cases 1 and 2.

Different from supervised learning, generative learning does not have universally applicable metrics for evaluating the quality of generated samples (Theis et al.,, 2015; Borji,, 2022). Assessing the visual quality of produced images often depends on expert domain knowledge. Meanwhile, the GANs literature has seen significant efforts in understanding and developing evaluation metrics for generative performance. Several quantitative metrics have been introduced, such as Inception Score (Salimans et al.,, 2016), Frechet Inception Distance (Heusel et al.,, 2017), and Maximum Mean Discrepancy (Bińkowski et al.,, 2018), among others.

Table 1: The architecture of generator and discriminator for case 1 and 2.
Layer Type
1 fully connected layer (in dims = 1024, out dims = 256)
2 ReLU
generator 3 fully connected layer (in dims = 256, out dims = 128)
4 ReLU
5 concatenate with random vector η\eta
6 fully connected layer (in dims = 148, out dims = 1024)
discriminator Xtfc+ReLU64fcxt64Xt+1fc+ReLU xt+164}concatenate(xt,xt+1)fc+ReLU64fc score\left.\begin{matrix}X_{t}\xrightarrow{\text{fc+ReLU}}64\xrightarrow{\text{fc}}x_{t}\in{\mathbb{R}}^{64}\\ X_{t+1}\xrightarrow{\text{fc+ReLU }}x_{t+1}\in{\mathbb{R}}^{64}\end{matrix}\right\}\text{concatenate}(x_{t},x_{t+1})\xrightarrow{\text{fc+ReLU}}64\xrightarrow{\text{fc }}\text{score}
Table 2: The NRMSE of mean estimation in Study I under different settings. The best and second best results are marked by green and bold respectively.
TT Cases Methods s=1s=1 s=2s=2 s=3s=3
OLS 0.013{\color[rgb]{0.0,0.47,0.44}\bf{0.013}} (0.002) 0.017{\color[rgb]{0.0,0.47,0.44}\bf{0.017}} (0.003) 0.023{\color[rgb]{0.0,0.47,0.44}\bf{0.023}} (0.004)
case 1 Naive Baseline 1.563 (0.039) 1.932 (0.043) 1.408 (0.098)
iter GTS 0.800\bf{0.800} (0.038) 0.891 (0.050) 0.985 (0.064)
s-step GTS 0.800\bf{0.800} (0.038) 0.843\bf{0.843} (0.047) 0.966\bf{0.966} (0.062)
OLS 0.973 (0.060) 1.000 (0.021) 0.995 (0.008)
1000 case 2 Naive Baseline 1.462 (0.129) 1.455 (0.152) 1.373 (0.178)
iter GTS 0.606{\color[rgb]{0.0,0.47,0.44}\bf{0.606}} (0.053) 0.664\bf{0.664} (0.064) 0.702\bf{0.702} (0.075)
s-step GTS 0.606{\color[rgb]{0.0,0.47,0.44}\bf{0.606}} (0.053) 0.609{\color[rgb]{0.0,0.47,0.44}\bf{0.609}} (0.057) 0.615{\color[rgb]{0.0,0.47,0.44}\bf{0.615}} (0.061)
OLS 0.608 (0.021) 0.609 (0.021) 0.609 (0.020)
case 3 Naive Baseline 0.845 (0.034) 0.843 (0.034) 0.844 (0.033)
iter GTS 0.598{\color[rgb]{0.0,0.47,0.44}\bf{0.598}} (0.020) 0.595{\color[rgb]{0.0,0.47,0.44}\bf{0.595}} (0.020) 0.598{\color[rgb]{0.0,0.47,0.44}\bf{0.598}} (0.020)
s-step GTS 0.598{\color[rgb]{0.0,0.47,0.44}\bf{0.598}} (0.020) 0.595{\color[rgb]{0.0,0.47,0.44}\bf{0.595}} (0.020) 0.602\bf{0.602} (0.020)
OLS 0.007{\color[rgb]{0.0,0.47,0.44}\bf{0.007}} (0.001) 0.010{\color[rgb]{0.0,0.47,0.44}\bf{0.010}} (0.002) 0.013{\color[rgb]{0.0,0.47,0.44}\bf{0.013}} (0.002)
case 1 Naive Baseline 1.563 (0.040) 1.932 (0.043) 1.408 (0.099)
iter GTS 0.615\bf{0.615} (0.098) 0.717 (0.163) 0.851 (0.226)
s-step GTS 0.615\bf{0.615} (0.098) 0.707\bf{0.707} (0.167) 0.749\bf{0.749} (0.079)
OLS 0.973 (0.060) 1.001 (0.021) 0.995 (0.007)
5000 case 2 Naive Baseline 1.468 (0.132) 1.452 (0.157) 1.377 (0.181)
iter GTS 0.470{\color[rgb]{0.0,0.47,0.44}\bf{0.470}} (0.046) 0.532\bf{0.532} (0.056) 0.576\bf{0.576} (0.066)
s-step GTS 0.470{\color[rgb]{0.0,0.47,0.44}\bf{0.470}} (0.046) 0.470{\color[rgb]{0.0,0.47,0.44}\bf{0.470}} (0.048) 0.500{\color[rgb]{0.0,0.47,0.44}\bf{0.500}} (0.054)
OLS 0.608 (0.020) 0.610 (0.021) 0.609 (0.020)
case 3 Naive Baseline 0.845 (0.032) 0.846 (0.032) 0.850 (0.033)
iter GTS 0.578{\color[rgb]{0.0,0.47,0.44}\bf{0.578}} (0.020) 0.574\bf{0.574} (0.020) 0.595\bf{0.595} (0.020)
s-step GTS 0.578{\color[rgb]{0.0,0.47,0.44}\bf{0.578}} (0.020) 0.566{\color[rgb]{0.0,0.47,0.44}\bf{0.566}} (0.020) 0.592{\color[rgb]{0.0,0.47,0.44}\bf{0.592}} (0.021)

Nonetheless, achieving a consensus regarding the evaluation of generative models remains an unresolved issue. In our approach, we compute the mean of the generated samples and present the results as the normalized root mean squared error (NRMSE) of the mean estimation. Specifically, for a given step ss, let X~T+tnew(s)=(1/J)j=1JX~T+tnew(j)(s){\widetilde{X}}_{T+t_{new}}(s)=(1/J)\sum_{j=1}^{J}{\widetilde{X}}_{T+t_{new}}^{(j)}(s) and X^T+tnew(s)=(1/J)j=1JX^T+tnew(j)(s)\widehat{X}_{T+t_{new}}(s)=(1/J)\sum_{j=1}^{J}\widehat{X}_{T+t_{new}}^{(j)}(s) be the estimated mean of the s-step and iterative generated samples, respectively. The NRMSE of the iterative generation is defined as

NRMSE(s)=X^T+tnew(s)𝔼XT+tnewF/𝔼XT+tnewF,\displaystyle\text{NRMSE}(s)=\|\widehat{X}_{T+t_{new}}(s)-{\mathbb{E}}X_{T+t_{new}}\|_{F}/\|{\mathbb{E}}X_{T+t_{new}}\|_{F},

where F\|\cdot\|_{F} denotes the Frobenius norm. The NRMSE of the s-step generation can be defined similarly.

The performance of our iterative and s-step generation are compared with two benchmark approaches. We first consider a naive baseline in which the prediction for XT+tnewX_{T+t_{new}} is taken from the observation ss-steps ahead for a given s=1,,Ss=1,\ldots,S, meaning that X^T+tnew=XT+tnews\widehat{X}_{T+t_{new}}=X_{T+t_{new}-s}. In addition, we consider a simple linear estimator obtained using Ordinary Least Squares (OLS). Specifically, the linear coefficients are estimated with a correctly specified order of lag, i.e., ϕ^1=argminϕt=1TXt+1ϕXtF\widehat{\phi}_{1}=\mathop{\rm arg\,min}_{\phi}\sum_{t=1}^{T}\|X_{t+1}-\phi X_{t}\|_{F} for Cases 1 and 2, and (ϕ^1,ϕ^2,ϕ^3)=argminϕ1,ϕ2,ϕ3t=1TXt+1ϕ1Xtϕ2Xt1ϕ3Xt2F(\widehat{\phi}_{1},\widehat{\phi}_{2},\widehat{\phi}_{3})=\mathop{\rm arg\,min}_{\phi_{1},\phi_{2},\phi_{3}}\sum_{t=1}^{T}\|X_{t+1}-\phi_{1}X_{t}-\phi_{2}X_{t-1}-\phi_{3}X_{t-2}\|_{F} for Case 3. Clearly, in Case 1, OLS is a suitable choice as the model is linear. However, for Cases 2 and 3, OLS is mis-specified.

We repeat the simulation for 100 times and report in Table 2 the mean and standard deviation of the NRMSE for s=1,2,3s=1,2,3 using different approaches. It is clear that our s-step and iterative generations exhibit competitive performance across almost all settings, particularly in the nonlinear Cases 2 and 3. In Case 1, the original model is linear, and as expected, OLS achieves the minimum NRMSE. Moreover, as ss increases, the problem becomes more challenging. However, both s-step and iterative generation maintain robust performance across different ss. It is important to note that when s=1s=1, the iterative and s-step generation methods are equivalent. For s=2,3s=2,3, the s-step generation generally exhibits a slightly smaller NRMSE, though the difference is not significant.

6.2 Study II: Multiple Time Series

In this subsection, we consider a panel data setting with multiple time series. We consider two different sample sizes, n=200,500n=200,500 while fixing the time series length T=20T=20. We set the horizon S=3S=3 as before, i.e., generate images in a maximum of S=3S=3 steps for each subject.

Table 3: The NRMSE of mean estimation in Study II under different settings. The best and second best results are marked by green and bold respectively.
(n,T)(n,T) Cases Methods s=1s=1 s=2s=2 s=3s=3
OLS 0.037{\color[rgb]{0.0,0.47,0.44}\bf{0.037}} (0.001) 0.070{\color[rgb]{0.0,0.47,0.44}\bf{0.070}} (0.002) 0.096{\color[rgb]{0.0,0.47,0.44}\bf{0.096}} (0.003)
case 1 Naive Baseline 1.563 (0.039) 1.925 (0.043) 1.380 (0.097)
iter GTS 0.729\bf{0.729} (0.036) 0.813 (0.046) 0.893 (0.056)
s-step GTS 0.729\bf{0.729} (0.036) 0.780\bf{0.780} (0.044) 0.885\bf{0.885} (0.056)
OLS 0.985 (0.029) 1.000 (0.005) 0.999 (0.001)
(200,20)(200,20) case 2 Naive Baseline 1.459 (0.107) 1.478 (0.127) 1.415 (0.156)
iter GTS 0.583{\color[rgb]{0.0,0.47,0.44}\bf{0.583}} (0.049) 0.647{\color[rgb]{0.0,0.47,0.44}\bf{0.647}} (0.060) 0.690{\color[rgb]{0.0,0.47,0.44}\bf{0.690}} (0.071)
s-step GTS 0.583{\color[rgb]{0.0,0.47,0.44}\bf{0.583}} (0.049) 0.689\bf{0.689} (0.068) 0.710\bf{0.710} (0.077)
OLS 0.617 (0.018) 0.620 (0.018) 0.625 (0.015)
case 3 Naive Baseline 0.847 (0.032) 0.845 (0.033) 0.846 (0.034)
iter GTS 0.593{\color[rgb]{0.0,0.47,0.44}\bf{0.593}} (0.019) 0.590{\color[rgb]{0.0,0.47,0.44}\bf{0.590}} (0.020) 0.594{\color[rgb]{0.0,0.47,0.44}\bf{0.594}} (0.020)
s-step GTS 0.593{\color[rgb]{0.0,0.47,0.44}\bf{0.593}} (0.019) 0.592\bf{0.592} (0.020) 0.597\bf{0.597} (0.020)
OLS 0.037{\color[rgb]{0.0,0.47,0.44}\bf{0.037}} (0.001) 0.069{\color[rgb]{0.0,0.47,0.44}\bf{0.069}} (0.002) 0.096{\color[rgb]{0.0,0.47,0.44}\bf{0.096}} (0.003)
case 1 Naive Baseline 1.564 (0.039) 1.925 (0.043) 1.379 (0.098)
iter GTS 0.611\bf{0.611} (0.040) 0.667 (0.053) 0.755\bf{0.755} (0.072)
s-step GTS 0.611\bf{0.611} (0.040) 0.647\bf{0.647} (0.066) 0.763 (0.056)
OLS 0.985 (0.029) 1.000 (0.005) 0.999 (0.001)
(500,20)(500,20) case 2 Naive Baseline 1.458 (0.109) 1.478 (0.128) 1.414 (0.156)
iter GTS 0.511{\color[rgb]{0.0,0.47,0.44}\bf{0.511}} (0.050) 0.578{\color[rgb]{0.0,0.47,0.44}\bf{0.578}} (0.060) 0.625{\color[rgb]{0.0,0.47,0.44}\bf{0.625}} (0.070)
s-step GTS 0.511{\color[rgb]{0.0,0.47,0.44}\bf{0.511}} (0.050) 0.639\bf{0.639} (0.069) 0.682\bf{0.682} (0.080)
OLS 0.617 (0.018) 0.619 (0.018) 0.625 (0.015)
case 3 Naive Baseline 0.847 (0.032) 0.845 (0.033) 0.846 (0.034)
iter GTS 0.579{\color[rgb]{0.0,0.47,0.44}\bf{0.579}} (0.019) 0.577{\color[rgb]{0.0,0.47,0.44}\bf{0.577}} (0.020) 0.592{\color[rgb]{0.0,0.47,0.44}\bf{0.592}} (0.020)
s-step GTS 0.579{\color[rgb]{0.0,0.47,0.44}\bf{0.579}} (0.019) 0.579\bf{0.579} (0.020) 0.593\bf{0.593} (0.021)

Analogous to the single time series, we repeat the simulation for 100 times and report in Table 3 the mean and standard deviation of the NRMSE for s=1,2,3s=1,2,3 using different approaches. Table 3 shows a similar pattern as Table 2. Both iterative and s-step generated images achieve the minimum NRMSE in Case 2 and 3. While under the linear Case 1, the OLS continues to achieve the lowest NRMSE. One notable difference in Table 3 is that the iterative generation achieves a lower NRMSE compared to the s-step generation in this case. One potential explanation for this is the increase in sample size. In comparison to the single time series setting, a sample size of n=200n=200 or 500500 allows the iterative approach to obtain a better gg estimation, which in turn enhances the image generation process.

7 The ADNI study

Driven by the goal of understanding the brain aging process, we in this section study real brain MRI data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Introduced by Petersen et al., (2010), ADNI aims to investigate the progression of Alzheimer’s disease (AD) by collecting sequential MRI scans of subjects classified as cognitively normal (CN), mildly cognitively impaired (MCI), and those with AD.

Our study focuses on analyzing the brain’s progression during the MCI stage. We include a total of 565 participants from the MCI group in our analysis, each having a sequence of T1-weighted MRI scans with lengths varying between 3 and 9. We approach this as an imaging time series generation problem with multiple samples. However, a significant challenge with the dataset is the short length of each sample (Ti9T_{i}\leq 9), a common issue in brain imaging analysis. Given the short length of these samples, we do not generate new images beyond a specific point TT, as it would be difficult to evaluate their quality. Instead, we divide the dataset into a training set consisting of 450 samples and a testing set comprising 115 samples. We then train the generator GG using the training set and generate image sequences for the testing set given the starting point X0X_{0}. As in the simulation study, we also consider a naive baseline in which the prediction for Xi,sX_{i,s} is taken from the observation ss-steps ahead, meaning that X^i,s=Xi,0\widehat{X}_{i,s}=X_{i,0}. It is worth mentioning that OLS is not suitable for MRI analysis and, as such, is not included in this context.

In our study, all brain T1-weighted MRI scans are processed through a standard pipeline which begins with a spatial adaptive non-local means (SANLM) denoising filter (Manjón et al.,, 2010), then followed by resampling, bias-correction, affine-registration and unified segmentation (Ashburner and Friston,, 2005), skull-stripping and cerebellum removing. Each MRI is then locally intensity corrected and spatially normalized into the Montreal Neurological Institute (MNI) atlas space (Ashburner,, 2007). These procedures result in processed images of size 169×205×169169\times 205\times 169. We further rescale the intensities of the resulting images to a range of [1,1][-1,1]. We select the central axial slice from each MRI and crop it to a size of 144×192144\times 192 by removing the zero-valued voxels.

Figure 1 plots the original, iterative and s-step generated MRI sequence along with their difference to X0X_{0} for one subject in the test set. As shown by the original images, the brain changes gradually as age increases. This pattern is clearly captured by both iterative and s-step generations. To further assess the generated MRI images, we plot the starting image X0X_{0}, true image after s step (i.e., XsX_{s}), iteratively and ss-step generated images (i.e., X^s\widehat{X}_{s} and X~s{\widetilde{X}}_{s}) for three subjects in Figure 2.

\begin{overpic}[width=433.62pt]{real_fig_sequences.pdf} \put(88.0,42.0){$X_{s}-X_{0}$} \put(88.0,36.0){$X_{s}$} \par\put(88.0,27.0){$\widehat{X}_{s}-X_{0}$} \put(88.0,21.0){$\widehat{X}_{s}$} \par\put(88.0,12.0){${\widetilde{X}}_{s}-X_{0}$} \put(88.0,6.0){${\widetilde{X}}_{s}$} \par\end{overpic}
Figure 1: The original, iterative and s-step generated MRI sequence s=0,1,,7s=0,1,\ldots,7 along with their difference to X0X_{0} for one subject.

Although the differences are subtle, we can observe that the image XsX_{s} and the generated images, X^s\widehat{X}_{s} and X~s{\widetilde{X}}_{s}, are quite similar, but they all deviate from X0X_{0} in several crucial regions. We highlight four of these regions in Figure 2, including: a) cortical sulci, b) ventricles, c) edge of ventricles, and d) anterior inter-hemispheric fissure. More specifically, we can observe (from subjects 1, 2, and 3, region a) that the cortical sulci widen as age increases. The widening of cortical sulci may be associated with white matter degradation (Drayer,, 1988; Walhovd et al.,, 2005). This phenomenon is also observed in patients with Alzheimer’s Disease (Migliaccio et al.,, 2012). Additionally, the brain ventricles expand from time 0 to ss as suggested by subjects 1 and 3, region b. The enlargement of ventricles during the aging process is one of the most striking features in structural brain scans across the lifespan (MacDonald and Pike,, 2021). Moreover, we notice that the edge of the ventricles becomes softer (darker region of subject 1, region c), and there is an increased presence of low signal areas adjacent to the ventricles (subject 2, regions b and c). From a clinical perspective, this observation suggests the existence of periventricular interstitial edema, which is linked to reduced ependyma activity and brain white matter atrophy (Todd et al.,, 2018). Lastly, the anterior interhemispheric fissure deepens with aging, as demonstrated in subject 1, region d. In conclusion, the generated samples can potentially aid clinical analyses in identifying age-related brain issues.

\begin{overpic}[width=433.62pt]{real_fig_subjects.pdf} \put(18.0,50.0){$X_{0}$ (age 71)} \put(38.0,50.0){$X_{s}$ (age 80)} \put(61.0,50.0){iterative} \put(82.0,50.0){s-step} \put(18.0,25.5){$X_{0}$ (age 72)} \put(38.0,25.5){$X_{s}$ (age 78)} \put(61.0,25.5){iterative} \put(82.0,25.5){s-step} \put(18.0,1.5){$X_{0}$ (age 72)} \put(38.0,1.5){$X_{s}$ (age 77)} \put(61.0,1.5){iterative} \put(82.0,1.5){s-step} \end{overpic}
Figure 2: An illustration of the generated samples for three subjects in the testing group. For each subject, we present from left to right: starting image X0X_{0}, target XsX_{s}, and two generated samples X^s\widehat{X}_{s} and X~s{\widetilde{X}}_{s} by iterative and s-step generation respectively. Moreover, we highlight (with different colors) four different regions that XsX_{s} (along with X^s\widehat{X}_{s} and X~s{\widetilde{X}}_{s}) are most different from X0X_{0}, including: a) cortical sulci, b) ventricles, c) edge of ventricles, d) anterior interhemispheric fissure.

As discussed before, there is a lack of universally application metric for evaluating the quality of generated images. In this study, we consider three metrics to measure the difference between the target XsX_{s} and our generations X^s\widehat{X}_{s}: structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), along with the previously introduced NRMSE. More specifically, SSIM calculates the similarity score between two images by comparing their luminance, contrast, and structure. Given two images XX and YY, it is defined as

SSIM(X,Y)=(2μxμy+c1)(2σxy+c2)(μx2+μy2+c1)(σx2+σy2+c2),\displaystyle\text{SSIM}(X,Y)=\frac{(2\mu_{x}\mu_{y}+c_{1})(2\sigma_{xy}+c_{2})}{(\mu_{x}^{2}+\mu_{y}^{2}+c_{1})(\sigma_{x}^{2}+\sigma_{y}^{2}+c_{2})},

where (μx,μy)(\mu_{x},\mu_{y}), (σx2,σy2)(\sigma_{x}^{2},\sigma_{y}^{2}) are the mean and variance of pixel values in XX and YY respectively. The σxy\sigma_{xy} is the covariance of XX and YY and c1c_{1}, c2c_{2} are constants, to be specified in the supplementary material. PSNR is a widely used engineering term for measuring the reconstruction quality for images subject to lossy compression. It is typically defined using the mean squared error:

PSNR(X,Y)=10log10(c2/MSE(X,Y)),\displaystyle\text{PSNR}(X,Y)=10\log_{10}\left(c^{2}/\text{MSE}(X,Y)\right),

where cc is the maximum pixel value among X,YX,Y and MSE(X,Y)=XYF2/(D1D2)\text{MSE}(X,Y)=\|X-Y\|_{F}^{2}/(D_{1}D_{2}). Clearly, better image generation performance is indicated by higher values of SSIM and PSNR, as well as smaller values of NRMSE.

In Figure 3, we present for each ss the mean SSIM, PSNR, and NRMSE over all subjects in the testing set. The results clearly indicate that both s-step and iterative generations outperform the benchmark in all three metrics across almost all the s=1,,9s=1,\ldots,9, providing strong evidence of the high quality of our generated images. As ss increases, the generation problem becomes more challenging. Consequently, we observe a decrease in both SSIM and PSNR for both iterative and s-step generations, indicating a decline in generation quality. On the other hand, the NRMSE for both approaches increase, suggesting a higher level of error in the generated images. When further comparing the iterative and s-step generation, the s-step generation shows a dominating performance in this study. The major reason is due to the limited sample size. With fewer than 500 subjects in training and the observed sequence for each subject being less than 9, a direct s-step generation approach proves advantageous compared to the iterative generation method.

Refer to caption
Figure 3: The mean SSIM, PSNR, and NRMSE over all testing subjects in the ADNI data analysis for s=1,,9s=1,\ldots,9.

In summary, this study further validates the effectiveness of our approach in generating image sequences. Incorporating the generated image sequences as data augmentation (Chen et al., 2022c, ) could further enhance the performance of downstream tasks, such as Alzheimer’s disease detection (Xia et al.,, 2022). As mentioned before, AD is the most prevalent neurodegenerative disorder, progressively leading to irreversible neuronal damage. The early diagnosis of AD and its syndromes, such as MCI, is of significant importance. We believe that the proposed image time series learning offers valuable assistance in understanding and identifying AD, as well as other aging-related brain issues.

References

  • Anthony and Bartlett, (1999) Anthony, M. and Bartlett, P. L. (1999). Neural Network Learning: Theoretical Foundations. Cambridge University Press.
  • Ashburner, (2007) Ashburner, J. (2007). A fast diffeomorphic image registration algorithm. Neuroimage, 38(1):95–113.
  • Ashburner and Friston, (2005) Ashburner, J. and Friston, K. J. (2005). Unified segmentation. Neuroimage, 26(3):839–851.
  • Bartlett et al., (2017) Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A. (2017). Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks.
  • Basu and Michailidis, (2015) Basu, S. and Michailidis, G. (2015). Regularized estimation in sparse high-dimensional time series models.
  • Bauer and Kohler, (2019) Bauer, B. and Kohler, M. (2019). On deep learning as a remedy for the curse of dimensionality in nonparametric regression.
  • Bińkowski et al., (2018) Bińkowski, M., Sutherland, D. J., Arbel, M., and Gretton, A. (2018). Demystifying mmd gans. arXiv preprint arXiv:1801.01401.
  • Bollerslev, (1986) Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of econometrics, 31(3):307–327.
  • Borji, (2022) Borji, A. (2022). Pros and cons of gan evaluation measures: New developments. Computer Vision and Image Understanding, 215:103329.
  • Brockwell and Davis, (1991) Brockwell, P. J. and Davis, R. A. (1991). Time series: theory and methods. Springer science & business media.
  • Chang et al., (2023) Chang, J., He, J., Yang, L., and Yao, Q. (2023). Modelling matrix time series via a tensor cp-decomposition. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(1):127–148.
  • Chen et al., (2021) Chen, R., Xiao, H., and Yang, D. (2021). Autoregressive models for matrix-valued time series. Journal of Econometrics, 222(1):539–560.
  • (13) Chen, R., Yang, D., and Zhang, C.-H. (2022a). Factor models for high-dimensional tensor time series. Journal of the American Statistical Association, 117(537):94–116.
  • (14) Chen, Y., Gao, Q., and Wang, X. (2022b). Inferential wasserstein generative adversarial networks. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(1):83–113.
  • (15) Chen, Y., Yang, X.-H., Wei, Z., Heidari, A. A., Zheng, N., Li, Z., Chen, H., Hu, H., Zhou, Q., and Guan, Q. (2022c). Generative adversarial networks in medical image augmentation: a review. Computers in Biology and Medicine, 144:105382.
  • Cole et al., (2018) Cole, J. H., Ritchie, S. J., Bastin, M. E., Hernández, V., Muñoz Maniega, S., Royle, N., Corley, J., Pattie, A., Harris, S. E., Zhang, Q., et al. (2018). Brain age predicts mortality. Molecular psychiatry, 23(5):1385–1392.
  • Drayer, (1988) Drayer, B. P. (1988). Imaging of the aging brain. part i. normal findings. Radiology, 166(3):785–796.
  • Engle, (1982) Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation. Econometrica: Journal of the econometric society, pages 987–1007.
  • Fan and Yao, (2003) Fan, J. and Yao, Q. (2003). Nonlinear time series: nonparametric and parametric methods, volume 20. Springer.
  • Feng and Yang, (2023) Feng, L. and Yang, G. (2023). Deep kronecker network. Biometrika, page asad049.
  • Feng et al., (2019) Feng, X., Li, T., Song, X., and Zhu, H. (2019). Bayesian scalar on image regression with nonignorable nonresponse. Journal of the American Statistical Association, pages 1–24.
  • Goodfellow et al., (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
  • Guo et al., (2016) Guo, S., Wang, Y., and Yao, Q. (2016). High-dimensional and banded vector autoregressions. Biometrika, page asw046.
  • Han et al., (2023) Han, Y., Chen, R., Zhang, C.-H., and Yao, Q. (2023). Simultaneous decorrelation of matrix time series. Journal of the American Statistical Association, pages 1–13.
  • Heusel et al., (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30.
  • Ho et al., (2020) Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851.
  • Huizinga et al., (2018) Huizinga, W., Poot, D. H., Vernooij, M. W., Roshchupkin, G. V., Bron, E. E., Ikram, M. A., Rueckert, D., Niessen, W. J., Klein, S., Initiative, A. D. N., et al. (2018). A spatio-temporal reference model of the aging brain. NeuroImage, 169:11–22.
  • Jack et al., (2004) Jack, C., Shiung, M., Gunter, J., O’brien, P., Weigand, S., Knopman, D. S., Boeve, B. F., Ivnik, R. J., Smith, G. E., Cha, R., et al. (2004). Comparison of different mri brain atrophy rate measures with clinical disease progression in ad. Neurology, 62(4):591–600.
  • Jiao et al., (2021) Jiao, Y., Shen, G., Lin, Y., and Huang, J. (2021). Deep nonparametric regression on approximately low-dimensional manifolds. arXiv preprint arXiv:2104.06708.
  • Kallenberg, (2021) Kallenberg, O. (2021). Foundations of modern probability. Probability Theory and Stochastic Modelling.
  • Kang et al., (2018) Kang, J., Reich, B. J., and Staicu, A.-M. (2018). Scalar-on-image regression via the soft-thresholded gaussian process. Biometrika, 105(1):165–184.
  • Keziou, (2003) Keziou, A. (2003). Dual representation of φ\varphi-divergences and applications. Comptes rendus mathématique, 336(10):857–862.
  • Kingma and Welling, (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Lai et al., (2018) Lai, G., Chang, W.-C., Yang, Y., and Liu, H. (2018). Modeling long-and short-term temporal patterns with deep neural networks. In The 41st international ACM SIGIR conference on research & development in information retrieval, pages 95–104.
  • LeCun et al., (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
  • Li et al., (2019) Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y.-X., and Yan, X. (2019). Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Advances in neural information processing systems, 32.
  • Li et al., (2018) Li, X., Xu, D., Zhou, H., and Li, L. (2018). Tucker tensor regression and neuroimaging analysis. Statistics in Biosciences, 10(3):520–545.
  • Loshchilov and Hutter, (2017) Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  • Lu et al., (2020) Lu, J., Shen, Z., Yang, H., and Zhang, S. (2020). Deep network approximation for smooth functions. arxiv e-prints, page. arXiv preprint arXiv:2001.03040.
  • Luo and Zhang, (2023) Luo, Y. and Zhang, A. R. (2023). Low-rank tensor estimation via riemannian gauss-newton: Statistical optimality and second-order convergence. The Journal of Machine Learning Research, 24(1):18274–18321.
  • MacDonald and Pike, (2021) MacDonald, M. E. and Pike, G. B. (2021). Mri of healthy brain aging: A review. NMR in Biomedicine, 34(9):e4564.
  • Manjón et al., (2010) Manjón, J. V., Coupé, P., Martí-Bonmatí, L., Collins, D. L., and Robles, M. (2010). Adaptive non-local means denoising of mr images with spatially varying noise levels. Journal of Magnetic Resonance Imaging, 31(1):192–203.
  • McDonald and Shalizi, (2017) McDonald, D. J. and Shalizi, C. R. (2017). Rademacher complexity of stationary sequences.
  • Migliaccio et al., (2012) Migliaccio, R., Agosta, F., Possin, K. L., Rabinovici, G. D., Miller, B. L., and Gorno-Tempini, M. L. (2012). White matter atrophy in alzheimer’s disease variants. Alzheimer’s & Dementia, 8:S78–S87.
  • Nguyen et al., (2010) Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861.
  • Nowozin et al., (2016) Nowozin, S., Cseke, B., and Tomioka, R. (2016). f-gan: Training generative neural samplers using variational divergence minimization. Advances in neural information processing systems, 29.
  • Pascanu et al., (2013) Pascanu, R., Gulcehre, C., Cho, K., and Bengio, Y. (2013). How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026.
  • Petersen et al., (2010) Petersen, R. C., Aisen, P. S., Beckett, L. A., Donohue, M. C., Gamst, A. C., Harvey, D. J., Jack, C. R., Jagust, W. J., Shaw, L. M., Toga, A. W., et al. (2010). Alzheimer’s disease neuroimaging initiative (adni): clinical characterization. Neurology, 74(3):201–209.
  • Ravi et al., (2019) Ravi, D., Alexander, D. C., Oxtoby, N. P., and Initiative, A. D. N. (2019). Degenerative adversarial neuroimage nets: generating images that mimic disease progression. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 164–172. Springer.
  • Salimans et al., (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. (2016). Improved techniques for training gans. Advances in neural information processing systems, 29.
  • Schmidt-Hieber, (2020) Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with relu activation function.
  • Shen et al., (2021) Shen, G., Jiao, Y., Lin, Y., Horowitz, J. L., and Huang, J. (2021). Deep quantile regression: Mitigating the curse of dimensionality through composition. arXiv preprint arXiv:2107.04907.
  • Shen et al., (2019) Shen, Z., Yang, H., and Zhang, S. (2019). Deep network approximation characterized by number of neurons. arXiv preprint arXiv:1906.05497.
  • Stock and Watson, (2001) Stock, J. H. and Watson, M. W. (2001). Vector autoregressions. Journal of Economic perspectives, 15(4):101–115.
  • Stone, (1982) Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. The annals of statistics, pages 1040–1053.
  • Theis et al., (2015) Theis, L., Oord, A. v. d., and Bethge, M. (2015). A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844.
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
  • Todd et al., (2018) Todd, K. L., Brighton, T., Norton, E. S., Schick, S., Elkins, W., Pletnikova, O., Fortinsky, R. H., Troncoso, J. C., Molfese, P. J., Resnick, S. M., et al. (2018). Ventricular and periventricular anomalies in the aging and cognitively impaired brain. Frontiers in aging neuroscience, 9:445.
  • Tofts, (2005) Tofts, P. (2005). Quantitative MRI of the brain: measuring changes caused by disease. John Wiley & Sons.
  • Tsay, (2013) Tsay, R. S. (2013). Multivariate time series analysis: with R and financial applications. John Wiley & Sons.
  • Tsay and Chen, (2018) Tsay, R. S. and Chen, R. (2018). Nonlinear time series analysis, volume 891. John Wiley & Sons.
  • Walhovd et al., (2005) Walhovd, K. B., Fjell, A. M., Reinvang, I., Lundervold, A., Dale, A. M., Eilertsen, D. E., Quinn, B. T., Salat, D., Makris, N., and Fischl, B. (2005). Effects of age on volumes of cortex, white matter and subcortical structures. Neurobiology of aging, 26(9):1261–1270.
  • Wu and Feng, (2023) Wu, S. and Feng, L. (2023). Sparse kronecker product decomposition: a general framework of signal region detection in image regression. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(3):783–809.
  • Xia et al., (2022) Xia, T., Sanchez, P., Qin, C., and Tsaftaris, S. A. (2022). Adversarial counterfactual augmentation: application in alzheimer’s disease classification. Frontiers in radiology, 2:1039160.
  • Yarotsky, (2017) Yarotsky, D. (2017). Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114.
  • Yoon et al., (2023) Yoon, J. S., Zhang, C., Suk, H.-I., Guo, J., and Li, X. (2023). Sadm: Sequence-aware diffusion model for longitudinal medical image generation. In International Conference on Information Processing in Medical Imaging, pages 388–400. Springer.
  • Zhang et al., (2019) Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. (2019). Self-attention generative adversarial networks. In International conference on machine learning, pages 7354–7363. PMLR.
  • Zhang et al., (2018) Zhang, Z., Liu, Q., and Wang, Y. (2018). Road extraction by deep residual u-net. IEEE Geoscience and Remote Sensing Letters, 15(5):749–753.
  • Zhou et al., (2013) Zhou, H., Li, L., and Zhu, H. (2013). Tensor regression with applications in neuroimaging data analysis. Journal of the American Statistical Association, 108(502):540–552.
  • Zhou et al., (2021) Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., and Zhang, W. (2021). Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115.
  • (71) Zhou, X., Jiao, Y., Liu, J., and Huang, J. (2023a). A deep generative approach to conditional sampling. Journal of the American Statistical Association, 118(543):1837–1848.
  • (72) Zhou, Y., Shi, C., Li, L., and Yao, Q. (2023b). Testing for the markov property in time series via deep conditional generative learning. arXiv preprint arXiv:2305.19244.
\appendixpage\addappheadtotoc

In the supplementary material, we provide proofs in the following order: Theorem 1, Theorem 2, Proposition 3, Example 1, Proposition 5, Proposition 6, Proposition 7, Theorem 8, Theorem 4, Theorem 14, Proposition 10. Moreover, we include three additional lemmas. Furthermore, we present more details on the numerical implementations.

Appendix A Proofs

A.1 Proof of Theorem 1

Proof A.1.

By Lemma 1, we can find a measurable function gg and random variable X^10\widehat{X}_{1}^{0} such that X^10=g(η0,X^00)\widehat{X}_{1}^{0}=g(\eta_{0},\widehat{X}_{0}^{0}) and (X0,X1)=d(X^00,X^10)(X_{0},X_{1})\overset{\text{d}}{=}(\widehat{X}_{0}^{0},\widehat{X}_{1}^{0}). Then define X^t0=g(ηt1,X^t10)\widehat{X}_{t}^{0}=g(\eta_{t-1},\widehat{X}_{t-1}^{0}) for t2t\geq 2, we will use mathematical induction to prove that (X0,X1,,Xs)=d(X^00,X^10,,X^s0)(X_{0},X_{1},\cdots,X_{s})\overset{\text{d}}{=}(\widehat{X}_{0}^{0},\widehat{X}_{1}^{0},\cdots,\widehat{X}_{s}^{0}) for s1s\geq 1.
Because the construction of gg makes (X0,X1)=d(X^00,X^10)(X_{0},X_{1})\overset{\text{d}}{=}(\widehat{X}_{0}^{0},\widehat{X}_{1}^{0}), the conclusion is automatically held when n=1n=1.
For s>1s>1, suppose (X0,X1,,Xs1)=d(X^00,X^10,,X^s10)(X_{0},X_{1},\cdots,X_{s-1})\overset{\text{d}}{=}(\widehat{X}_{0}^{0},\widehat{X}_{1}^{0},\cdots,\widehat{X}_{s-1}^{0}), we will prove that (X0,X1,,Xs)=d(X^00,X^10,,X^s0)(X_{0},X_{1},\cdots,X_{s})\overset{\text{d}}{=}(\widehat{X}_{0}^{0},\widehat{X}_{1}^{0},\cdots,\widehat{X}_{s}^{0}) below.
Note that (X0,X1)=d(X^00,X^10)(X_{0},X_{1})\overset{\text{d}}{=}(\widehat{X}_{0}^{0},\widehat{X}_{1}^{0}), then X^10|X^00=x=dX1|X0=x\widehat{X}_{1}^{0}|\widehat{X}_{0}^{0}=x\overset{\text{d}}{=}X_{1}|X_{0}=x. Which means X1|X0=x=dg(η0,x)X_{1}|X_{0}=x\overset{\text{d}}{=}g(\eta_{0},x).
Because ηs1\eta_{s-1} is independent of X^00\widehat{X}_{0}^{0} and {ηi}i=0s2\left\{\eta_{i}\right\}_{i=0}^{s-2}, ηs1\eta_{s-1} is independent of {X^i0}i=0s2\left\{\widehat{X}_{i}^{0}\right\}_{i=0}^{s-2}, then:

X^s0|X^s10=xs1,,X^00=x0=g(ηs1,xs1)\widehat{X}_{s}^{0}|\widehat{X}_{s-1}^{0}=x_{s-1},\cdots,\widehat{X}_{0}^{0}=x_{0}=g(\eta_{s-1},x_{s-1})

Because ηs1=dη0\eta_{s-1}\overset{\text{d}}{=}\eta_{0}, combining the given conditions, we have:

g(ηs1,xs1)\displaystyle g(\eta_{s-1},x_{s-1}) =dg(η0,xs1)\displaystyle\overset{\text{d}}{=}g(\eta_{0},x_{s-1})
=X^10|X^00=xs1\displaystyle=\widehat{X}_{1}^{0}|\widehat{X}_{0}^{0}=x_{s-1}
=dX1|X0=xs1\displaystyle\overset{\text{d}}{=}X_{1}|X_{0}=x_{s-1}
=dXs|Xs1=xs1\displaystyle\overset{\text{d}}{=}X_{s}|X_{s-1}=x_{s-1}
=dXs|Xs1=xs1,,X0=x0\displaystyle\overset{\text{d}}{=}X_{s}|X_{s-1}=x_{s-1},\cdots,X_{0}=x_{0}

which shows that:

X^s0|X^s10=xs1,,X^00=x0=dXs|Xs1=xs1,,X0=x0\widehat{X}_{s}^{0}|\widehat{X}_{s-1}^{0}=x_{s-1},\cdots,\widehat{X}_{0}^{0}=x_{0}\overset{\text{d}}{=}X_{s}|X_{s-1}=x_{s-1},\cdots,X_{0}=x_{0}

Combining (X0,X1,,Xs1)=d(X^00,X^10,,X^s10)(X_{0},X_{1},\cdots,X_{s-1})\overset{\text{d}}{=}(\widehat{X}_{0}^{0},\widehat{X}_{1}^{0},\cdots,\widehat{X}_{s-1}^{0}), therefore:

(X0,X1,,Xs)=d(X^00,X^10,,X^s0)(X_{0},X_{1},\cdots,X_{s})\overset{\text{d}}{=}(\widehat{X}_{0}^{0},\widehat{X}_{1}^{0},\cdots,\widehat{X}_{s}^{0})

A.2 Proof of Theorem 2

Proof A.2.

We define the measurable function GG such that:

G(,,1)=g(,)\displaystyle G(\cdot,\cdot,1)=g(\cdot,\cdot)

For t2t\geq 2, we apply Lemma 1 to X0,XtX_{0},X_{t} and ηt\eta_{t}, there exist a measurable function GtG_{t} such that:

Xt=dGt(ηt,X0)\displaystyle X_{t}\overset{\text{d}}{=}G_{t}(\eta_{t},X_{0})

Let G(,,t)=Gt(,)G(\cdot,\cdot,t)=G_{t}(\cdot,\cdot), X~t0=G(ηt,X0,t){\widetilde{X}}_{t}^{0}=G(\eta_{t},X_{0},t), then:

X~t0=dXt\displaystyle{\widetilde{X}}_{t}^{0}\overset{\text{d}}{=}X_{t}

Therefore, such function GG satisfies the condition.

A.3 Proof of Proposition 3

Proof A.3.

For G𝒢,HG\in\mathcal{G},H\in\mathcal{H} and 1sS1\leq s\leq S, define:

Gs(,):=G(,,s)\displaystyle G_{s}(\cdot,\cdot):=G(\cdot,\cdot,s)
Hs(,):=H(,,s)\displaystyle H_{s}(\cdot,\cdot):=H(\cdot,\cdot,s)

Then (17) can be rewritten as:

~(G,H)=1|Ω|(t,s)Ω[Hs(Xt,Gs(ηt,Xt))f(Hs(Xt,Xt+s))]\displaystyle\tilde{\mathcal{L}}(G,H)=\frac{1}{|\Omega|}\sum\limits_{(t,s)\in\Omega}[H_{s}(X_{t},G_{s}(\eta_{t},X_{t}))-f^{*}(H_{s}(X_{t},X_{t+s}))] (76)

For 1sS1\leq s\leq S, let:

~s(Gs,Hs)=1|Ω|t=0Ts[Hs(Xt,Gs(ηt,Xt))f(Hs(Xt,Xt+s))]\displaystyle\tilde{\mathcal{L}}_{s}(G_{s},H_{s})=\frac{1}{|\Omega|}\sum\limits_{t=0}^{T-s}[H_{s}(X_{t},G_{s}(\eta_{t},X_{t}))-f^{*}(H_{s}(X_{t},X_{t+s}))]

then,

~(G,H)=s=1S~s(Gs,Hs)\displaystyle\tilde{\mathcal{L}}(G,H)=\sum\limits_{s=1}^{S}\tilde{\mathcal{L}}_{s}(G_{s},H_{s})

According to the above equation, the optimization problem (17) can be decomposed as the following SS optimization problems:

(G^s,H^s)=argminGs𝒢smaxHss~s(Gs,Hs), 1sS\displaystyle(\widehat{G}_{s},\widehat{H}_{s})=\arg\mathop{\min}_{G_{s}\in\mathcal{G}_{s}}\mathop{\max}_{H_{s}\in\mathcal{H}_{s}}\tilde{\mathcal{L}}_{s}(G_{s},H_{s})\ ,\ 1\leq s\leq S (77)

In particular, let s=1s=1 and note that ~1(G1,H1)=T|Ω|^(G1,H1)\tilde{\mathcal{L}}_{1}(G_{1},H_{1})=\frac{T}{|\Omega|}\widehat{\mathcal{L}}(G_{1},H_{1}), we have:

(G^1,H^1)=(g^,h^)\displaystyle(\widehat{G}_{1},\widehat{H}_{1})=(\widehat{g},\widehat{h})

A.4 Proof of Example 1

Proof A.4.

Obviously, for t0t\geq 0, XtX_{t} follows gaussian distribution with mean νt=0\nu_{t}=0. Suppose Σt\Sigma_{t} is the covariance matirx of XtX_{t}, then the covariance matrices sequence satisfies the following recurrence relationship:

Σt+1=ϕ2Σtϕ2T+ϕ1ϕ1T\displaystyle\Sigma_{t+1}=\phi_{2}\Sigma_{t}\phi_{2}^{T}+\phi_{1}\phi_{1}^{T} (78)

In addition, Σt\Sigma_{t} converges to a symmetric matrix Σ\Sigma, whose explicit expression is:

Σ=t=0ϕ2tϕ1ϕ1T(ϕ2T)t\displaystyle\Sigma=\sum\limits_{t=0}^{\infty}\phi_{2}^{t}\phi_{1}\phi_{1}^{T}(\phi_{2}^{T})^{t}

In (78), let tt\to\infty, we have:

Σ=ϕ2Σϕ2T+ϕ1ϕ1T\displaystyle\Sigma=\phi_{2}\Sigma\phi_{2}^{T}+\phi_{1}\phi_{1}^{T}

then,

Σt+1Σt=ϕ2(ΣtΣ)ϕ2T\displaystyle\Sigma_{t+1}-\Sigma_{t}=\phi_{2}(\Sigma_{t}-\Sigma)\phi_{2}^{T}

By iteration:

ΣtΣ=ϕ2t(Σ0Σ)(ϕ2T)t\displaystyle\Sigma_{t}-\Sigma=\phi_{2}^{t}(\Sigma_{0}-\Sigma)(\phi_{2}^{T})^{t}

Let σ=σmax(ϕ2)\sigma=\sigma_{\text{max}}(\phi_{2}). Because ϕ2\phi_{2} is symmetric, there exists an orthogonal matrix Q=(qij)p×pQ=(q_{ij})_{p\times p}, such that:

ϕ2=QUQT\displaystyle\phi_{2}=QUQ^{T}

where U=diag{u1,u2,,up}U=\hbox{\rm diag}\left\{u_{1},u_{2},\cdots,u_{p}\right\} with |ui|σ|u_{i}|\leq\sigma, 1ip1\leq i\leq p.

Denote M0=QT(Σ0Σ)QM_{0}=Q^{T}(\Sigma_{0}-\Sigma)Q, then:

ΣtΣ=QTUtM0UtQ\displaystyle\Sigma_{t}-\Sigma=Q^{T}U^{t}M_{0}U^{t}Q

let M0=(mij)p×pM_{0}=(m_{ij})_{p\times p}, then:

|(ΣtΣ)ij|=|k,l=1pqkiqljmkluktult|Kσ2t\displaystyle|(\Sigma_{t}-\Sigma)_{ij}|=|\sum\limits_{k,l=1}^{p}q_{ki}q_{lj}m_{kl}u_{k}^{t}u_{l}^{t}|\leq K\sigma^{2t} (79)

where K=p2max{|qij|}2max{|mij|}K=p^{2}\max\left\{|q_{ij}|\right\}^{2}\cdot\max\left\{|m_{ij}|\right\} is a constant.

Let p(x)=(2π)p/2(|Σ|)1/2exp{12xTΣ1x}p(x)=(2\pi)^{-p/2}(|\Sigma|)^{-1/2}\exp\left\{-\frac{1}{2}x^{T}\Sigma^{-1}x\right\} be the density function of N(0,Σ)N(0,\Sigma), then:

|pXt(x)p(x)|\displaystyle|p_{X_{t}}(x)-p(x)|
=\displaystyle= (2π)p/2|Σt|1/2exp{12xTΣt1x}(2π)p/2|Σ|1/2exp{12xTΣ1x}|\displaystyle(2\pi)^{-p/2}|\Sigma_{t}|^{-1/2}\exp\left\{-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x\right\}-(2\pi)^{-p/2}|\Sigma|^{-1/2}\exp\left\{-\frac{1}{2}x^{T}\Sigma^{-1}x\right\}|
\displaystyle\leq (2π)p/2||Σt|1/2|Σ|1/2|exp{12xTΣt1x}\displaystyle(2\pi)^{-p/2}||\Sigma_{t}|^{-1/2}-|\Sigma|^{-1/2}|\exp\left\{-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x\right\}
+(2π)p/2(|Σ|)1/2|exp{12xTΣt1x}exp{12xTΣ1x}|\displaystyle+(2\pi)^{-p/2}(|\Sigma|)^{-1/2}|\exp\left\{-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x\right\}-\exp\left\{-\frac{1}{2}x^{T}\Sigma^{-1}x\right\}|

From (79), there exists a constant K1K_{1}, such that:

||Σt||Σ||=||Σ+(ΣtΣ)||Σ||K1σ2t\displaystyle||\Sigma_{t}|-|\Sigma||=||\Sigma+(\Sigma_{t}-\Sigma)|-|\Sigma||\leq K_{1}\sigma^{2t}

then:

(2π)p/2||Σt|1/2|Σ|1/2|exp{12xTΣt1x}𝑑x\displaystyle\int(2\pi)^{-p/2}||\Sigma_{t}|^{-1/2}-|\Sigma|^{-1/2}|\exp\left\{-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x\right\}dx
=\displaystyle= ||Σt|1/2|Σ|1/2||Σt|1/2\displaystyle||\Sigma_{t}|^{-1/2}-|\Sigma|^{-1/2}|\cdot|\Sigma_{t}|^{1/2}
=\displaystyle= ||Σt|1/2|Σ|1/2||Σ|1/2\displaystyle||\Sigma_{t}|^{1/2}-|\Sigma|^{1/2}|\cdot|\Sigma|^{-1/2}
\displaystyle\leq ||Σt|1/2|Σ|1/2|||Σt|1/2+|Σ|1/2||Σ|1\displaystyle||\Sigma_{t}|^{1/2}-|\Sigma|^{1/2}|\cdot||\Sigma_{t}|^{1/2}+|\Sigma|^{1/2}|\cdot|\Sigma|^{-1}
=\displaystyle= ||Σt||Σ|||Σ|1\displaystyle||\Sigma_{t}|-|\Sigma||\cdot|\Sigma|^{-1}
\displaystyle\leq K1σ2t|Σ|1\displaystyle K_{1}\sigma^{2t}|\Sigma|^{-1}

For xpx\in\mathbb{R}^{p}, if xT(Σt1Σ1)x0x^{T}(\Sigma_{t}^{-1}-\Sigma^{-1})x\geq 0, then:

|exp(12xTΣt1x)exp{12xTΣ1x}|\displaystyle|\exp(-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x)-\exp\left\{-\frac{1}{2}x^{T}\Sigma^{-1}x\right\}|
=\displaystyle= exp{12xTΣ1x}|1exp{12xT(Σt1Σ1)x}|\displaystyle\exp\left\{-\frac{1}{2}x^{T}\Sigma^{-1}x\right\}\cdot|1-\exp\left\{-\frac{1}{2}x^{T}(\Sigma_{t}^{-1}-\Sigma^{-1})x\right\}|
\displaystyle\leq 12xT(Σt1Σ1)xexp{12xTΣ1x}\displaystyle\frac{1}{2}x^{T}(\Sigma_{t}^{-1}-\Sigma^{-1})x\cdot\exp\left\{-\frac{1}{2}x^{T}\Sigma^{-1}x\right\}

if xT(Σt1Σ1)x<0x^{T}(\Sigma_{t}^{-1}-\Sigma^{-1})x<0:

|exp{12xTΣt1x}exp{12xTΣ1x}|\displaystyle|\exp\left\{-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x\right\}-\exp\left\{-\frac{1}{2}x^{T}\Sigma^{-1}x\right\}|
=\displaystyle= exp{12xTΣt1x}|1exp{12xT(Σ1Σt1)x}|\displaystyle\exp\left\{-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x\right\}\cdot|1-\exp\left\{-\frac{1}{2}x^{T}(\Sigma^{-1}-\Sigma_{t}^{-1})x\right\}|
\displaystyle\leq 12xT(Σ1Σt1)xexp{12xTΣt1x}\displaystyle\frac{1}{2}x^{T}(\Sigma^{-1}-\Sigma_{t}^{-1})x\cdot\exp\left\{-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x\right\}

therefore,

|exp{12xTΣt1x}exp{12xTΣ1x}|\displaystyle|\exp\left\{-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x\right\}-\exp\left\{-\frac{1}{2}x^{T}\Sigma^{-1}x\right\}| \displaystyle\leq 12|xT(Σt1Σ1)x|(exp{12xTΣ1x}\displaystyle\frac{1}{2}|x^{T}(\Sigma_{t}^{-1}-\Sigma^{-1})x|\cdot(\exp\left\{-\frac{1}{2}x^{T}\Sigma^{-1}x\right\}
+exp{12xTΣt1x})\displaystyle\ \ +\exp\left\{-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x\right\})

Since Σt1Σ1=Σt1(ΣΣt)Σ1\Sigma_{t}^{-1}-\Sigma^{-1}=\Sigma_{t}^{-1}(\Sigma-\Sigma_{t})\Sigma^{-1}, by (79), there exists a constant K2K_{2} such that:

(Σt1Σ1)ijK2σ2t\displaystyle(\Sigma_{t}^{-1}-\Sigma^{-1})_{ij}\leq K_{2}\sigma^{2t}

hence, for xpx\in\mathbb{R}^{p},

|xT(Σt1Σ1)x|\displaystyle|x^{T}(\Sigma_{t}^{-1}-\Sigma^{-1})x| σmax(Σt1Σ1)xTx\displaystyle\leq\sigma_{\text{max}}(\Sigma_{t}^{-1}-\Sigma^{-1})\cdot x^{T}x
=sup|z|=1|(Σt1Σ1)z|2xTx\displaystyle=\mathop{\sup}_{|z|=1}|(\Sigma_{t}^{-1}-\Sigma^{-1})z|_{2}\cdot x^{T}x
pK2σ2txTx\displaystyle\leq\sqrt{p}K_{2}\sigma^{2t}x^{T}x

therefore:

(2π)p/2|Σ|1/2|exp{12xTΣt1x}exp{12xTΣ1x}|𝑑x\displaystyle\int(2\pi)^{-p/2}|\Sigma|^{-1/2}|\exp\left\{-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x\right\}-\exp\left\{-\frac{1}{2}x^{T}\Sigma^{-1}x\right\}|dx
\displaystyle\leq (2π)p/2|Σ|1/2p2K2σ2txTx(exp{12xTΣ1x}+exp{12xTΣt1x})𝑑x\displaystyle\int(2\pi)^{-p/2}|\Sigma|^{-1/2}\frac{\sqrt{p}}{2}K_{2}\sigma^{2t}x^{T}x(\exp\left\{-\frac{1}{2}x^{T}\Sigma^{-1}x\right\}+\exp\left\{-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x\right\})dx
=\displaystyle= (2π)p/2|Σ|1/2p2K2σ2txTxexp{12xTΣ1x}𝑑x\displaystyle\int(2\pi)^{-p/2}|\Sigma|^{-1/2}\frac{\sqrt{p}}{2}K_{2}\sigma^{2t}x^{T}x\exp\left\{-\frac{1}{2}x^{T}\Sigma^{-1}x\right\}dx
+(2π)p/2|Σ|1/2p2K2σ2txTxexp{12xTΣt1x}𝑑x\displaystyle+\int(2\pi)^{-p/2}|\Sigma|^{-1/2}\frac{\sqrt{p}}{2}K_{2}\sigma^{2t}x^{T}x\exp\left\{-\frac{1}{2}x^{T}\Sigma_{t}^{-1}x\right\}dx
=\displaystyle= (2π)p/2|Σ|1/2p2K2σ2tzTΣzexp{12zTz}|Σ|1/2𝑑z\displaystyle\int(2\pi)^{-p/2}|\Sigma|^{-1/2}\frac{\sqrt{p}}{2}K_{2}\sigma^{2t}z^{T}\Sigma z\exp\left\{-\frac{1}{2}z^{T}z\right\}|\Sigma|^{1/2}dz
+(2π)p/2|Σ|1/2p2K2σ2tzTΣtzexp{12zTz}|Σt|1/2𝑑z\displaystyle+\int(2\pi)^{-p/2}|\Sigma|^{-1/2}\frac{\sqrt{p}}{2}K_{2}\sigma^{2t}z^{T}\Sigma_{t}z\exp\left\{-\frac{1}{2}z^{T}z\right\}|\Sigma_{t}|^{1/2}dz
\displaystyle\leq (2π)p/2|Σ|1/2p2K2σ2t(|Σ|1/2σmax(Σ)+|Σt|1/2σmax(Σt))zTzexp{12zTz}𝑑z\displaystyle\int(2\pi)^{-p/2}|\Sigma|^{-1/2}\frac{\sqrt{p}}{2}K_{2}\sigma^{2t}(|\Sigma|^{1/2}\sigma_{\text{max}}(\Sigma)+|\Sigma_{t}|^{1/2}\sigma_{\text{max}}(\Sigma_{t}))z^{T}z\exp\left\{-\frac{1}{2}z^{T}z\right\}dz
=\displaystyle= p|Σ|1/2p2K2σ2t(|Σ|1/2σmax(Σ)+|Σt|1/2σmax(Σt))\displaystyle p\cdot|\Sigma|^{-1/2}\frac{\sqrt{p}}{2}K_{2}\sigma^{2t}(|\Sigma|^{1/2}\sigma_{\text{max}}(\Sigma)+|\Sigma_{t}|^{1/2}\sigma_{\text{max}}(\Sigma_{t}))
\displaystyle\leq K3σ2t\displaystyle K_{3}\sigma^{2t}

where K3K_{3} is a constant.

Combining the above results, we have:

|pXt(x)p(x)|𝑑xK1σ2t|Σ|1+K3σ2t=𝒪(σ2t)\displaystyle\int|p_{X_{t}}(x)-p(x)|dx\leq K_{1}\sigma^{2t}|\Sigma|^{-1}+K_{3}\sigma^{2t}=\mathcal{O}(\sigma^{2t})

A.5 Proof of Proposition 5

Proof A.5.

We recursively prove the following equation for 1sS1\leq s\leq S:

pX^T,,X^T+spXT,,XT+sL1r=0s1pT+r(x)p(|x)q(|x)L1dx\displaystyle\|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+s}}-p_{X_{T},\cdots,X_{T+s}}\|_{L_{1}}\leq\int\sum\limits_{r=0}^{s-1}p_{T+r}(x)\cdot\|p(\cdot|x)-q(\cdot|x)\|_{L_{1}}dx (80)

For s=1s=1, note that:

pX^T,X^T+1(x,y)pXT,XT+1(x,y)=pT(x)[q(y|x)p(y|x)]\displaystyle p_{\widehat{X}_{T},\widehat{X}_{T+1}}(x,y)-p_{X_{T},X_{T+1}}(x,y)=p_{T}(x)[q(y|x)-p(y|x)]

then:

pX^T,X^T+1pXT,XT+1L1\displaystyle\|p_{\widehat{X}_{T},\widehat{X}_{T+1}}-p_{X_{T},X_{T+1}}\|_{L_{1}} =\displaystyle= |pX^T,X^T+1(x,y)pXT,XT+1(x,y)|𝑑x𝑑y\displaystyle\int|p_{\widehat{X}_{T},\widehat{X}_{T+1}}(x,y)-p_{X_{T},X_{T+1}}(x,y)|dxdy
=\displaystyle= pT(x)p(|x)q(|x)L1dx\displaystyle\int p_{T}(x)\|p(\cdot|x)-q(\cdot|x)\|_{L_{1}}dx

therefore, the equation (80) holds for s=1s=1.

For 1<kS1<k\leq S, assuming that (80) holds for s=k1s=k-1, now we consider the case of s=ks=k. Note that for 1sS1\leq s\leq S:

pX^T,,X^T+s(x0,,xs)=pT(x0)i=1sq(xi|xi1)\displaystyle p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+s}}(x_{0},\cdots,x_{s})=p_{T}(x_{0})\cdot\prod\limits_{i=1}^{s}q(x_{i}|x_{i-1})
pXT,,XT+s(x0,,xs)=pT(x0)i=1sp(xi|xi1)\displaystyle p_{X_{T},\cdots,X_{T+s}}(x_{0},\cdots,x_{s})=p_{T}(x_{0})\cdot\prod\limits_{i=1}^{s}p(x_{i}|x_{i-1})

then:

pX^T,,X^T+kpXT,,XT+kL1\displaystyle\|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+k}}-p_{X_{T},\cdots,X_{T+k}}\|_{L_{1}}
=\displaystyle= pT(x0)|i=1kq(xi|xi1)i=1kp(xi|xi1)|dx0:k\displaystyle\int p_{T}(x_{0})|\prod\limits_{i=1}^{k}q(x_{i}|x_{i-1})-\prod\limits_{i=1}^{k}p(x_{i}|x_{i-1})|dx_{0:k}
\displaystyle\leq pT(x0)|i=1k1q(xi|xi1)i=1k1p(xi|xi1)|q(xk|xk1)dx0:k\displaystyle\int p_{T}(x_{0})\cdot|\prod\limits_{i=1}^{k-1}q(x_{i}|x_{i-1})-\prod\limits_{i=1}^{k-1}p(x_{i}|x_{i-1})|\cdot q(x_{k}|x_{k-1})dx_{0:k}
+pT(x0)i=1k1p(xi|xi1)|q(xk|xk1)p(xk|xk1)|dx0:k\displaystyle\ +\int p_{T}(x_{0})\cdot\prod\limits_{i=1}^{k-1}p(x_{i}|x_{i-1})\cdot|q(x_{k}|x_{k-1})-p(x_{k}|x_{k-1})|dx_{0:k}
=\displaystyle= pT(x0)|i=1k1q(xi|xi1)i=1k1p(xi|xi1)|dx0:k1\displaystyle\int p_{T}(x_{0})\cdot|\prod\limits_{i=1}^{k-1}q(x_{i}|x_{i-1})-\prod\limits_{i=1}^{k-1}p(x_{i}|x_{i-1})|dx_{0:k-1}
+pXT,,XT+k1(x0,,xk1)|q(xk|xk1)p(xk|xk1)|dx0:k\displaystyle\ +\int p_{X_{T},\cdots,X_{T+k-1}}(x_{0},\cdots,x_{k-1})\cdot|q(x_{k}|x_{k-1})-p(x_{k}|x_{k-1})|dx_{0:k}
=\displaystyle= pX^T,,X^T+k1pXT,,XT+k1L1\displaystyle\|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+k-1}}-p_{X_{T},\cdots,X_{T+k-1}}\|_{L_{1}}
+pT+k1(xk1)|q(xk|xk1)p(xk|xk1)|dxkdxk1\displaystyle\ +\int p_{T+k-1}(x_{k-1})\cdot|q(x_{k}|x_{k-1})-p(x_{k}|x_{k-1})|dx_{k}dx_{k-1}
=\displaystyle= pX~T,,X~T+k1pXT,,XT+k1L1\displaystyle\|p_{{\widetilde{X}}_{T},\cdots,{\widetilde{X}}_{T+k-1}}-p_{X_{T},\cdots,X_{T+k-1}}\|_{L_{1}}
+pT+k1(xk1)p(|xk1)q(|xk1)L1dxk1\displaystyle\ +\int p_{T+k-1}(x_{k-1})\cdot\|p(\cdot|x_{k-1})-q(\cdot|x_{k-1})\|_{L_{1}}dx_{k-1}

According to the hypothesis of induction,

pX^T,,X^T+k1pXT,,XT+k1L1r=0k2pT+r(x)p(|x)q(|x)L1dx\displaystyle\|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+k-1}}-p_{X_{T},\cdots,X_{T+k-1}}\|_{L_{1}}\leq\int\sum\limits_{r=0}^{k-2}p_{T+r}(x)\cdot\|p(\cdot|x)-q(\cdot|x)\|_{L_{1}}dx

hence, we have:

pX^T,,X^T+kpXT,,XT+kL1\displaystyle\|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+k}}-p_{X_{T},\cdots,X_{T+k}}\|_{L_{1}}
\displaystyle\leq pX^T,,X^T+k1pXT,,XT+k1L1\displaystyle\|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+k-1}}-p_{X_{T},\cdots,X_{T+k-1}}\|_{L_{1}}
+pT+k1(xk1)p(|xk1)q(|xk1)L1dxk1\displaystyle\ \ \ +\int p_{T+k-1}(x_{k-1})\cdot\|p(\cdot|x_{k-1})-q(\cdot|x_{k-1})\|_{L_{1}}dx_{k-1}
\displaystyle\leq r=0k2pT+r(x)p(|x)q(|x)L1dx\displaystyle\sum\limits_{r=0}^{k-2}p_{T+r}(x)\cdot\|p(\cdot|x)-q(\cdot|x)\|_{L_{1}}dx
+pT+k1(xk1)p(|xk1)q(|xk1)L1dxk1\displaystyle\ \ \ +\int p_{T+k-1}(x_{k-1})\cdot\|p(\cdot|x_{k-1})-q(\cdot|x_{k-1})\|_{L_{1}}dx_{k-1}
=\displaystyle= r=0k1pT+r(x)p(|x)q(|x)L1dx\displaystyle\sum\limits_{r=0}^{k-1}p_{T+r}(x)\cdot\|p(\cdot|x)-q(\cdot|x)\|_{L_{1}}dx

then (80) holds for s=ks=k.

Therefore, the inequality (80) holds for all 1sS1\leq s\leq S. In particular, taking s=Ss=S, by Assumption 1, we have:

pX^T,,X^T+SpXT,,XT+SL1\displaystyle\|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+S}}-p_{X_{T},\cdots,X_{T+S}}\|_{L_{1}}
\displaystyle\leq r=0S1pT+r(x)p(|x)q(|x)L1dx\displaystyle\int\sum\limits_{r=0}^{S-1}p_{T+r}(x)\cdot\|p(\cdot|x)-q(\cdot|x)\|_{L_{1}}dx
\displaystyle\leq SpT(x)p(|x)q(|x)L1dx+r=0S1|pT+r(x)pT(x)|p(|x)q(|x)L1dx\displaystyle S\int p_{T}(x)\cdot\|p(\cdot|x)-q(\cdot|x)\|_{L_{1}}dx+\int\sum\limits_{r=0}^{S-1}|p_{T+r}(x)-p_{T}(x)|\cdot\|p(\cdot|x)-q(\cdot|x)\|_{L_{1}}dx
\displaystyle\leq SpT(x)p(|x)q(|x)L1dx+2r=0S1pT+rpTL1\displaystyle S\int p_{T}(x)\cdot\|p(\cdot|x)-q(\cdot|x)\|_{L_{1}}dx+2\sum\limits_{r=0}^{S-1}\|p_{T+r}-p_{T}\|_{L_{1}}
\displaystyle\leq SpT(x)p(|x)q(|x)L1dx+𝒪(Tα)\displaystyle S\int p_{T}(x)\cdot\|p(\cdot|x)-q(\cdot|x)\|_{L_{1}}dx+\mathcal{O}(T^{-\alpha})
=\displaystyle= SpX^T,X^T+1pXT,XT+1L1+𝒪(Tα)\displaystyle S\|p_{\widehat{X}_{T},\widehat{X}_{T+1}}-p_{X_{T},X_{T+1}}\|_{L_{1}}+\mathcal{O}(T^{-\alpha})

Therefore,

𝔼(Xt,ηt)t=0TpX^T,,X^T+SpXT,,XT+SL12\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\left\|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+S}}-p_{X_{T},\cdots,X_{T+S}}\right\|_{L_{1}}^{2}
\displaystyle\leq 𝔼(Xt,ηt)t=0T[2S2pX^T,X^T+1pXT,XT+1L12+𝒪(T2α)]\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\big{[}2S^{2}\|p_{\widehat{X}_{T},\widehat{X}_{T+1}}-p_{X_{T},X_{T+1}}\|_{L_{1}}^{2}+\mathcal{O}(T^{-2\alpha})\big{]}
=\displaystyle= 2S2𝔼(Xt,ηt)t=0TpX^T,X^T+1pXT,XT+1L12+𝒪(T2α)\displaystyle 2S^{2}\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\|p_{\widehat{X}_{T},\widehat{X}_{T+1}}-p_{X_{T},X_{T+1}}\|_{L_{1}}^{2}+\mathcal{O}(T^{-2\alpha})

A.6 Proof of Proposition 6

Proof A.6.

Since G𝒢,HG\in\mathcal{G},H\in\mathcal{H} are bounded functions, there exist a constant M0M\geq 0 such that:

|H(Xt,G(ηt,Xt,s),s)f(H(Xt,Xt+s,s))|M,t0, 1sS\displaystyle|H(X_{t},G(\eta_{t},X_{t},s),s)-f^{*}(H(X_{t},X_{t+s},s))|\leq M,\ \ t\geq 0,\ \ 1\leq s\leq S

By Lemma 16, for 1sS1\leq s\leq S and t1,t2>0t_{1},t_{2}>0:

pXt1,Xt1+spXt2,Xt2+sL1𝒪(min{t1,t2}α)\displaystyle\|p_{X_{t_{1}},X_{t_{1}+s}}-p_{X_{t_{2}},X_{t_{2}+s}}\|_{L_{1}}\leq\mathcal{O}(\mathop{\min}\left\{t_{1},t_{2}\right\}^{-\alpha})

Then, for t1,t2>0t_{1},t_{2}>0:

|𝔼[H(Xt1,G(ηt1,Xt1,s),s)f(H(Xt1,Xt1+s,s))]\displaystyle|\mathbb{E}[H(X_{t_{1}},G(\eta_{t_{1}},X_{t_{1}},s),s)-f^{*}(H(X_{t_{1}},X_{t_{1}+s},s))]
𝔼[H(Xt2,G(ηt2,Xt2,s),s)f(H(Xt2,Xt2+s,s))]|\displaystyle\ \ -\mathbb{E}[H(X_{t_{2}},G(\eta_{t_{2}},X_{t_{2}},s),s)-f^{*}(H(X_{t_{2}},X_{t_{2}+s},s))]|
=\displaystyle= |𝔼ηN(0,Im)(pXt1,Xt1+s(x,y)pXt2,Xt2+s(x,y))[H(x,G(η,x,s),s)f(H(x,y,s))]𝑑x𝑑y|\displaystyle|\mathbb{E}_{\eta\sim N(0,I_{m})}\int(p_{X_{t_{1}},X_{t_{1}+s}}(x,y)-p_{X_{t_{2}},X_{t_{2}+s}}(x,y))[H(x,G(\eta,x,s),s)-f^{*}(H(x,y,s))]dxdy|
\displaystyle\leq MpXt1,Xt1+spXt2,Xt2+sL1\displaystyle M\cdot\|p_{X_{t_{1}},X_{t_{1}+s}}-p_{X_{t_{2}},X_{t_{2}+s}}\|_{L_{1}}
\displaystyle\leq M𝒪(min{t1,t2}α)\displaystyle M\mathcal{O}(\mathop{\min}\left\{t_{1},t_{2}\right\}^{-\alpha})

Therefore, let T0TST_{0}\leq T-S be an arbitrary integer, we have:

supG𝒢,H|ds(G,H)|\displaystyle\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|d_{s}(G,H)|
=\displaystyle= supG𝒢,H|1Ts+1t=0Ts𝔼[H(Xt,G(ηt,Xt,s),s)f(H(Xt,Xt+s,s))]\displaystyle\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|\frac{1}{T-s+1}\sum\limits_{t=0}^{T-s}\mathbb{E}[H(X_{t},G(\eta_{t},X_{t},s),s)-f^{*}(H(X_{t},X_{t+s},s))]
𝔼[H(XT,G(ηT,XT,s),s)f(H(XT,XT+s,s))]|\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\mathbb{E}[H(X_{T},G(\eta_{T},X_{T},s),s)-f^{*}(H(X_{T},X_{T+s},s))]|
\displaystyle\leq supG𝒢,H|1Ts+1t=0T01(𝔼[H(Xt,G(ηt,Xt,s),s)f(H(Xt,Xt+s,s))]\displaystyle\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|\frac{1}{T-s+1}\sum\limits_{t=0}^{T_{0}-1}(\mathbb{E}[H(X_{t},G(\eta_{t},X_{t},s),s)-f^{*}(H(X_{t},X_{t+s},s))]
𝔼[H(XT,G(ηT,XT,s),s)f(H(XT,XT+s,s))])|\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\mathbb{E}[H(X_{T},G(\eta_{T},X_{T},s),s)-f^{*}(H(X_{T},X_{T+s},s))])|
+supG𝒢,H|1Ts+1t=T0Ts(𝔼[H(Xt,G(ηt,Xt,s),s)f(H(Xt,Xt+s,s))]\displaystyle+\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|\frac{1}{T-s+1}\sum\limits_{t=T_{0}}^{T-s}(\mathbb{E}[H(X_{t},G(\eta_{t},X_{t},s),s)-f^{*}(H(X_{t},X_{t+s},s))]
𝔼[H(XT,G(ηT,XT,s),s)f(H(XT,XT+s,s))])|\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ -\mathbb{E}[H(X_{T},G(\eta_{T},X_{T},s),s)-f^{*}(H(X_{T},X_{T+s},s))])|
\displaystyle\leq 2MT0Ts+1+M(TsT0+1)Ts+1𝒪(T0α)\displaystyle\frac{2MT_{0}}{T-s+1}+\frac{M(T-s-T_{0}+1)}{T-s+1}\mathcal{O}(T_{0}^{-\alpha})
\displaystyle\leq 2MT0Ts+1+M𝒪(T0α)\displaystyle\frac{2MT_{0}}{T-s+1}+M\mathcal{O}(T_{0}^{-\alpha})

Let T0=T1α+1T_{0}=\lceil T^{\frac{1}{\alpha+1}}\rceil, then:

supG𝒢,H|ds(G,H)|2M(T1α+1+1)Ts+1+M𝒪(Tαα+1)=𝒪(Tαα+1)\displaystyle\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|d_{s}(G,H)|\leq\frac{2M(T^{\frac{1}{\alpha+1}}+1)}{T-s+1}+M\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}})=\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}})

A.7 Proof of Proposition 7

Proof A.7.

Proposition 7 directly follows from Theorem 8 in McDonald and Shalizi, (2017).

A.8 Proof of Theorem 8

Proof A.8.

By Lemma 17:

𝔼(Xt,ηt)t=0T1Ss=1SpX~T,X~T+spXT,XT+sL12\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\frac{1}{S}\sum\limits_{s=1}^{S}\|p_{{\widetilde{X}}_{T},{\widetilde{X}}_{T+s}}-p_{X_{T},X_{T+s}}\|_{L_{1}}^{2}
\displaystyle\leq 𝔼(Xt,ηt)t=0T1Ss=1S2aDf(pX~T,G^(ηT,X~T,s)pXT,XT+s)\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\frac{1}{S}\sum\limits_{s=1}^{S}\frac{2}{a}D_{f}(p_{{\widetilde{X}}_{T},\widehat{G}(\eta_{T},{\widetilde{X}}_{T},s)}\|p_{X_{T},X_{T+s}})
=\displaystyle= 2a𝔼(Xt,ηt)t=0T𝕃˙T(G^)\displaystyle\frac{2}{a}\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\dot{\mathbb{L}}_{T}(\widehat{G})

After some calculations, we can decompose 𝔼(Xt,ηt)t=0T𝕃˙T(G^)\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\dot{\mathbb{L}}_{T}(\widehat{G}) as:

𝔼(Xt,ηt)t=0T𝕃˙T(G^)Δ~3+Δ~4+Δ5\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\dot{\mathbb{L}}_{T}(\widehat{G})\leq\tilde{\Delta}_{3}+\tilde{\Delta}_{4}+\Delta_{5}

Where

Δ~3=𝔼(Xt,ηt)t=0TsupH˙T(G^,H)supH˙T(G^,H)\displaystyle\tilde{\Delta}_{3}=\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\mathop{\sup}_{H}\dot{\mathcal{L}}_{T}(\widehat{G},H)-\mathop{\sup}_{H\in\mathcal{H}}\dot{\mathcal{L}}_{T}(\widehat{G},H)
Δ~4=infG¯𝒢𝕃˙T(G¯)\displaystyle\tilde{\Delta}_{4}=\mathop{\inf}_{\bar{G}\in\mathcal{G}}\dot{\mathbb{L}}_{T}(\bar{G})
Δ5=2𝔼(Xt,ηt)t=0TsupG𝒢,H|˙T(G,H)~(G,H)|\displaystyle\Delta_{5}=2\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|\dot{\mathcal{L}}_{T}(G,H)-\widetilde{\mathcal{L}}(G,H)|

Then we only need to prove that:

Δ5=2𝔼(Xt,ηt)t=0TsupG𝒢,H|˙T(G,H)~(G,H)|Δ~1+Δ~2\displaystyle\Delta_{5}=2\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|\dot{\mathcal{L}}_{T}(G,H)-\tilde{\mathcal{L}}(G,H)|\leq\tilde{\Delta}_{1}+\widetilde{\Delta}_{2}

where

Δ~1=𝒪(Tαα+1)\displaystyle\tilde{\Delta}_{1}=\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}})
Δ~2=𝒪(Pdim𝒢log(TB𝒢)T+Pdimlog(TB)T)\displaystyle\tilde{\Delta}_{2}=\mathcal{O}\left(\sqrt{\frac{\text{Pdim}_{\mathcal{G}}\log(T\text{B}_{\mathcal{G}})}{T}}+\sqrt{\frac{\text{Pdim}_{\mathcal{H}}\log(T\text{B}_{\mathcal{H}})}{T}}\right)

Let:

˙T,s(G,H)=𝔼XT,ηTH(XT,G(ηT,XT,s),s)𝔼XT,XT+hf(H(XT,XT+s,s))\displaystyle\dot{\mathcal{L}}_{T,s}(G,H)=\mathbb{E}_{X_{T},\eta_{T}}H(X_{T},G(\eta_{T},X_{T},s),s)-\mathbb{E}_{X_{T},X_{T+h}}f^{*}(H(X_{T},X_{T+s},s))
~s(G,H)=1|Ω|t=0Ts[H(Xt,G(ηt,Xt,s),s)f(H(Xt,Xt+s,s))]\displaystyle\widetilde{\mathcal{L}}_{s}(G,H)=\frac{1}{|\Omega|}\sum\limits_{t=0}^{T-s}[H(X_{t},G(\eta_{t},X_{t},s),s)-f^{*}(H(X_{t},X_{t+s},s))]
bts(G,H)=H(Xt,G(ηt,Xt,s),s)f(H(Xt,Xt+s,s))\displaystyle b_{t}^{s}(G,H)=H(X_{t},G(\eta_{t},X_{t},s),s)-f^{*}(H(X_{t},X_{t+s},s))
𝔼[H(Xt,G(ηt,Xt,s),s)f(H(Xt,Xt+s,s))]\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ -\mathbb{E}[H(X_{t},G(\eta_{t},X_{t},s),s)-f^{*}(H(X_{t},X_{t+s},s))]
bs(G,H)=|Ω|Ts+1[~s(G,H)𝔼~s(G,H)]\displaystyle b_{s}(G,H)=\frac{|\Omega|}{T-s+1}[\widetilde{\mathcal{L}}_{s}(G,H)-\mathbb{E}\widetilde{\mathcal{L}}_{s}(G,H)]
ds(G,H)=˙T,s(G,H)|Ω|Ts+1𝔼~s(G,H)\displaystyle d_{s}(G,H)=\dot{\mathcal{L}}_{T,s}(G,H)-\frac{|\Omega|}{T-s+1}\mathbb{E}\widetilde{\mathcal{L}}_{s}(G,H)

then bs(G,H)=1Ts+1t=0Tsbts(G,H)b_{s}(G,H)=\frac{1}{T-s+1}\sum\limits_{t=0}^{T-s}b_{t}^{s}(G,H), and:

Δ52Ss=1S𝔼supG𝒢,H|bs(G,H)|+2Ss=1SsupG𝒢,H|ds(G,H)|\displaystyle\Delta_{5}\leq\frac{2}{S}\sum\limits_{s=1}^{S}\mathbb{E}\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|b_{s}(G,H)|+\frac{2}{S}\sum\limits_{s=1}^{S}\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|d_{s}(G,H)| (81)
+2s=1s|1|Ω|S(Ts+1)|𝔼supG𝒢,H|~s(G,H)|\displaystyle+2\sum\limits_{s=1}^{s}|1-\frac{|\Omega|}{S(T-s+1)}|\cdot\mathbb{E}\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|\widetilde{\mathcal{L}}_{s}(G,H)| (82)

Let ϵt{\epsilon_{t}} be the Rademacher random variables. For 1sS1\leq s\leq S, denote the Rademacher complexity in 𝒢×\mathcal{G}\times\mathcal{H} as:

s(𝒢×)=𝔼supG𝒢,H|2Ts+1t=0Tsϵtbts(G,H)|\displaystyle\mathcal{R}_{s}(\mathcal{G}\times\mathcal{H})=\mathbb{E}\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|\frac{2}{T-s+1}\sum\limits_{t=0}^{T-s}\epsilon_{t}b_{t}^{s}(G,H)|

By Proposition 7, we have:

𝔼supG𝒢,H|bs(G,H)|s(𝒢×)\displaystyle\mathbb{E}\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|b_{s}(G,H)|\leq\mathcal{R}_{s}(\mathcal{G}\times\mathcal{H}) (83)

Fix (Xt,ηt)t=0T1(X_{t},\eta_{t})_{t=0}^{T-1}, define empirical metric dd in 𝒢×\mathcal{G}\times\mathcal{H} such that:

d((G,H),(G~,H~))=sup0tTs|bts(G,H)bts(G~,H~)|\displaystyle d((G,H),(\tilde{G},\tilde{H}))=\mathop{\sup}_{0\leq t\leq T-s}|b_{t}^{s}(G,H)-b_{t}^{s}(\tilde{G},\tilde{H})|

Let 𝒢δ×δ\mathcal{G}_{\delta}\times\mathcal{H}_{\delta} be the δ\delta-net of (𝒢×,d)(\mathcal{G}\times\mathcal{H},d), C(δ,𝒢×,d)C(\delta,\mathcal{G}\times\mathcal{H},d) be the covering number of δ\delta-net, then by Lemma 15:

s(𝒢×)\displaystyle\mathcal{R}_{s}(\mathcal{G}\times\mathcal{H}) =𝔼(Xt,ηt)t=0T𝔼(ϵt)t=0TssupG𝒢,H|2Ts+1t=0Tsϵtbts(G,H)|\displaystyle=\mathbb{E}_{\left(X_{t},\eta_{t}\right)_{t=0}^{T}}\mathbb{E}_{\left(\epsilon_{t}\right)_{t=0}^{T-s}}\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|\frac{2}{T-s+1}\sum\limits_{t=0}^{T-s}\epsilon_{t}b_{t}^{s}(G,H)|
2δ+𝔼(Xt,ηt)t=0T𝔼(ϵt)t=0Tssup(G,H)𝒢δ×δ|2Ts+1t=0Tsϵtbts(G,H)|\displaystyle\leq 2\delta+\mathbb{E}_{\left(X_{t},\eta_{t}\right)_{t=0}^{T}}\mathbb{E}_{\left(\epsilon_{t}\right)_{t=0}^{T-s}}\mathop{\sup}_{(G,H)\in\mathcal{G}_{\delta}\times\mathcal{H}_{\delta}}|\frac{2}{T-s+1}\sum\limits_{t=0}^{T-s}\epsilon_{t}b_{t}^{s}(G,H)|
2δ+2Ts+1𝔼(Xt,ηt)t=0T(2log(2C(δ,𝒢×,d))sup(G,H)𝒢δ×δt=0Ts(bts(G,H))2)1/2\displaystyle\leq 2\delta+\frac{2}{T-s+1}\mathbb{E}_{\left(X_{t},\eta_{t}\right)_{t=0}^{T}}(2\log(2C(\delta,\mathcal{G}\times\mathcal{H},d))\cdot\sup_{(G,H)\in\mathcal{G}_{\delta}\times\mathcal{H}_{\delta}}\sum\limits_{t=0}^{T-s}(b_{t}^{s}(G,H))^{2})^{1/2}
2δ+4Ts+1𝔼(Xt,ηt)t=0T(logC(δ,𝒢×,d)sup(G,H)𝒢δ×δt=0Ts(bts(G,H))2)1/2\displaystyle\leq 2\delta+\frac{4}{T-s+1}\mathbb{E}_{\left(X_{t},\eta_{t}\right)_{t=0}^{T}}(\log C(\delta,\mathcal{G}\times\mathcal{H},d)\cdot\sup_{(G,H)\in\mathcal{G}_{\delta}\times\mathcal{H}_{\delta}}\sum\limits_{t=0}^{T-s}(b_{t}^{s}(G,H))^{2})^{1/2}

By assumption, bts(G,H)b_{t}^{s}(G,H) can be bounded by a constant C0C_{0}, we have:

s(𝒢×)\displaystyle\mathcal{R}_{s}(\mathcal{G}\times\mathcal{H}) 2δ+4Ts+1𝔼(Xt,ηt)t=0T((Ts+1)C02logC(δ,𝒢×,d))1/2\displaystyle\leq 2\delta+\frac{4}{T-s+1}\mathbb{E}_{\left(X_{t},\eta_{t}\right)_{t=0}^{T}}((T-s+1)C_{0}^{2}\log C(\delta,\mathcal{G}\times\mathcal{H},d))^{1/2}
2δ+4C0Ts+1𝔼(Xt,ηt)t=0T(logC(δ,𝒢,d)+logC(δ,,d))1/2\displaystyle\leq 2\delta+\frac{4C_{0}}{\sqrt{T-s+1}}\mathbb{E}_{\left(X_{t},\eta_{t}\right)_{t=0}^{T}}(\log C(\delta,\mathcal{G},d)+\log C(\delta,\mathcal{H},d))^{1/2}

By Theorem 12.2 in Anthony and Bartlett, (1999), the covering number can be bounded as:

logC(δ,𝒢,d)Pdim𝒢log2eTB𝒢δPdim𝒢\displaystyle\log C(\delta,\mathcal{G},d)\leq\text{Pdim}_{\mathcal{G}}\log\frac{2eT\text{B}_{\mathcal{G}}}{\delta\text{Pdim}_{\mathcal{G}}}
logC(δ,,d)Pdimlog2eTBδPdim\displaystyle\log C(\delta,\mathcal{H},d)\leq\text{Pdim}_{\mathcal{H}}\log\frac{2eT\text{B}_{\mathcal{H}}}{\delta\text{Pdim}_{\mathcal{H}}}

The Pdim𝒢\text{Pdim}_{\mathcal{G}} shown above is the Pseudo dimension of GG.

Let δ=T1\delta=T^{-1}, therefore,

s(𝒢×)\displaystyle\ \ \ \mathcal{R}_{s}(\mathcal{G}\times\mathcal{H})
2T1+4C0Ts+1(Pdim𝒢log2eT2B𝒢Pdim𝒢+Pdimlog2eT2BPdim)1/2\displaystyle\leq 2T^{-1}+\frac{4C_{0}}{\sqrt{T-s+1}}(\text{Pdim}_{\mathcal{G}}\log\frac{2eT^{2}\text{B}_{\mathcal{G}}}{\text{Pdim}_{\mathcal{G}}}+\text{Pdim}_{\mathcal{H}}\log\frac{2eT^{2}\text{B}_{\mathcal{H}}}{\text{Pdim}_{\mathcal{H}}})^{1/2}
2T1+4C0Ts+1(Pdim𝒢log2eT2B𝒢Pdim𝒢)1/2+4C0Ts+1(Pdimlog2eT2BPdim)1/2\displaystyle\leq 2T^{-1}+\frac{4C_{0}}{\sqrt{T-s+1}}(\text{Pdim}_{\mathcal{G}}\log\frac{2eT^{2}\text{B}_{\mathcal{G}}}{\text{Pdim}_{\mathcal{G}}})^{1/2}+\frac{4C_{0}}{\sqrt{T-s+1}}(\text{Pdim}_{\mathcal{H}}\log\frac{2eT^{2}\text{B}_{\mathcal{H}}}{\text{Pdim}_{\mathcal{H}}})^{1/2}
=𝒪((Pdim𝒢log(TB𝒢)T)1/2)+𝒪((Pdimlog(TB)T)1/2)\displaystyle=\mathcal{O}((\frac{\text{Pdim}_{\mathcal{G}}\log(T\text{B}_{\mathcal{G}})}{T})^{1/2})+\mathcal{O}((\frac{\text{Pdim}_{\mathcal{H}}\log(T\text{B}_{\mathcal{H}})}{T})^{1/2})

By Proposition 6, we have:

supG𝒢,H|ds(G,H)|𝒪(Tαα+1)\displaystyle\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|d_{s}(G,H)|\leq\ \mathcal{O}(T^{-\frac{\alpha}{\alpha+1}})

Finally, note that |Ω|=S(2TS+1)2|\Omega|=\frac{S(2T-S+1)}{2}, then:

|1|Ω|S(Ts+1)|𝔼supG𝒢,H|~s(G,H)|\displaystyle|1-\frac{|\Omega|}{S(T-s+1)}|\mathbb{E}\mathop{\sup}_{G\in\mathcal{G},H\in\mathcal{H}}|\widetilde{\mathcal{L}}_{s}(G,H)| \displaystyle\leq |1|Ω|S(Ts+1)|(Ts+1)M|Ω|\displaystyle|1-\frac{|\Omega|}{S(T-s+1)}|\cdot\frac{(T-s+1)M}{|\Omega|}
=\displaystyle= |S2s+1|2(Ts+1)2(Ts+1)MS(2TS+1)\displaystyle\frac{|S-2s+1|}{2(T-s+1)}\cdot\frac{2(T-s+1)M}{S(2T-S+1)}
=\displaystyle= 𝒪(T1)\displaystyle\mathcal{O}(T^{-1})

therefore,

Δ5\displaystyle\Delta_{5} \displaystyle\leq 𝒪(Pdim𝒢log(TB𝒢)T+Pdimlog(TB)T)+𝒪(Tαα+1)+𝒪(T1)\displaystyle\mathcal{O}\left(\sqrt{\frac{\text{Pdim}_{\mathcal{G}}\log(T\text{B}_{\mathcal{G}})}{T}}+\sqrt{\frac{\text{Pdim}_{\mathcal{H}}\log(T\text{B}_{\mathcal{H}})}{T}}\right)+\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}})+\mathcal{O}(T^{-1})
=\displaystyle= 𝒪(Pdim𝒢log(TB𝒢)T+Pdimlog(TB)T)+𝒪(Tαα+1)\displaystyle\mathcal{O}\left(\sqrt{\frac{\text{Pdim}_{\mathcal{G}}\log(T\text{B}_{\mathcal{G}})}{T}}+\sqrt{\frac{\text{Pdim}_{\mathcal{H}}\log(T\text{B}_{\mathcal{H}})}{T}}\right)+\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}})
=\displaystyle= Δ~1+Δ~2\displaystyle\tilde{\Delta}_{1}+\tilde{\Delta}_{2}

A.9 Proof of Theorem 4

Proof A.9.

By Proposition 5 and Theorem 8, we have:

𝔼(Xt,ηt)t=0TpX^T,,X^T+SpXT,,XT+SL12\displaystyle\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\left\|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+S}}-p_{X_{T},\cdots,X_{T+S}}\right\|_{L_{1}}^{2}
\displaystyle\leq 2S2𝔼(Xt,ηt)t=0TpX^T,X^T+1pXT,XT+1L12+𝒪(T2α)\displaystyle 2S^{2}\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}\|p_{\widehat{X}_{T},\widehat{X}_{T+1}}-p_{X_{T},X_{T+1}}\|_{L_{1}}^{2}+\mathcal{O}(T^{-2\alpha})
\displaystyle\leq Δ~1+Δ2+Δ3+Δ4+𝒪(T2α)\displaystyle\tilde{\Delta}_{1}+\Delta_{2}+\Delta_{3}+\Delta_{4}+\mathcal{O}(T^{-2\alpha})
=\displaystyle= Δ1+Δ2+Δ3+Δ4\displaystyle\Delta_{1}+\Delta_{2}+\Delta_{3}+\Delta_{4}

where,

Δ1=𝒪(Tαα+1+T2α)\displaystyle\Delta_{1}=\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}}+T^{-2\alpha})
Δ~1=𝒪(Tαα+1)\displaystyle\tilde{\Delta}_{1}=\mathcal{O}(T^{-\frac{\alpha}{\alpha+1}})
Δ2=𝒪(Pdim𝒢1log(TB𝒢1)T+Pdim1log(TB1)T)\displaystyle\Delta_{2}=\mathcal{O}\left(\sqrt{\frac{\text{Pdim}_{\mathcal{G}_{1}}\log(T\text{B}_{\mathcal{G}_{1}})}{T}}+\sqrt{\frac{\text{Pdim}_{\mathcal{H}_{1}}\log(T\text{B}_{\mathcal{H}_{1}})}{T}}\right)
Δ3=4S2a𝔼(Xt,ηt)t=0T(suphT(g^,h)suph1T(g^,h))\displaystyle\Delta_{3}=\frac{4S^{2}}{a}\mathbb{E}_{(X_{t},\eta_{t})_{t=0}^{T}}(\mathop{\sup}_{h}\mathcal{L}_{T}(\widehat{g},h)-\mathop{\sup}_{h\in\mathcal{H}_{1}}\mathcal{L}_{T}(\widehat{g},h))
Δ4=4S2ainfg¯𝒢1𝕃T(g¯)\displaystyle\Delta_{4}=\frac{4S^{2}}{a}\mathop{\inf}_{\bar{g}\in\mathcal{G}_{1}}\mathbb{L}_{T}(\bar{g})

A.10 Proof of Theorem 14

Proof A.10.

Similar to the proof of Theorem 8, we can get that:

𝔼1Tt=0T1pXi,t,g^(ηt,Xi,t)pXi,t,Xi,t+1L12Δ¨1+Δ¨2+Δ¨3\displaystyle\mathbb{E}\frac{1}{T}\sum\limits_{t=0}^{T-1}\left\|p_{X_{i,t},\widehat{g}(\eta_{t},X_{i,t})}-p_{X_{i,t},X_{i,t+1}}\right\|_{L_{1}}^{2}\leq\ddot{\Delta}_{1}+\ddot{\Delta}_{2}+\ddot{\Delta}_{3}

where,

Δ¨1=𝒪(Pdim𝒢log(nB𝒢)n+Pdimlog(nB)n)\displaystyle\ddot{\Delta}_{1}=\mathcal{O}\left(\sqrt{\frac{\text{Pdim}_{\mathcal{G}}\log(n\text{B}_{\mathcal{G}})}{n}}+\sqrt{\frac{\text{Pdim}_{\mathcal{H}}\log(n\text{B}_{\mathcal{H}})}{n}}\right)
Δ¨2=𝒪(1)𝔼(suph˙(T)(g^,h)suph˙(T)(g^,h))\displaystyle\ddot{\Delta}_{2}=\mathcal{O}(1)\cdot\mathbb{E}(\mathop{\sup}_{h}\dot{\mathcal{L}}_{(T)}(\widehat{g},h)-\mathop{\sup}_{h\in\mathcal{H}}\dot{\mathcal{L}}_{(T)}(\widehat{g},h))
Δ¨3=𝒪(1)infg¯𝒢𝕃˙(T)(g¯)\displaystyle\ddot{\Delta}_{3}=\mathcal{O}(1)\cdot\mathop{\inf}_{\bar{g}\in\mathcal{G}}\dot{\mathbb{L}}_{(T)}(\bar{g})

Let,

g^(η0,x)p(|x)\displaystyle\widehat{g}(\eta_{0},x)\sim p(\cdot|x)
X1|(X0=x)q(|x)\displaystyle X_{1}|(X_{0}=x)\sim q(\cdot|x)

Let M0=sup0sS1pT+s2(x)pT1(x)𝑑x<M_{0}=\mathop{\sup}_{0\leq s\leq S-1}\int\frac{p_{T+s}^{2}(x)}{p_{T-1}(x)}dx<\infty, by the proof of Proposition 5, we have:

pX^T,,X^T+SpXT,,XT+SL12\displaystyle\|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+S}}-p_{X_{T},\cdots,X_{T+S}}\|_{L_{1}}^{2}
=\displaystyle= (|pX^T,,X^T+S(x0,,xS)pXT,,XT+S(x0,,xS)|𝑑x0:S)2\displaystyle(\int|p_{\widehat{X}_{T},\cdots,\widehat{X}_{T+S}}(x_{0},\cdots,x_{S})-p_{X_{T},\cdots,X_{T+S}}(x_{0},\cdots,x_{S})|dx_{0:S})^{2}
\displaystyle\leq (r=0S1pT+r(x)p(|x)q(|x)L1dx)2\displaystyle(\int\sum\limits_{r=0}^{S-1}p_{T+r}(x)\cdot\|p(\cdot|x)-q(\cdot|x)\|_{L_{1}}dx)^{2}
\displaystyle\leq (r=0S1pT+r(x))2pT1(x)dxpT1(x)p(|x)q(|x)L12dx\displaystyle\int\frac{(\sum\limits_{r=0}^{S-1}p_{T+r}(x))^{2}}{p_{T-1}(x)}dx\cdot\int p_{T-1}(x)\|p(\cdot|x)-q(\cdot|x)\|_{L_{1}}^{2}dx
\displaystyle\leq S2M0pT1(x)Df(p(|x)q(|x))dx\displaystyle S^{2}M_{0}\int p_{T-1}(x)\cdot D_{f}(p(\cdot|x)\|q(\cdot|x))dx
=\displaystyle= S2M0Df(pXT1,g^(ηT1,XT1)pXT1,XT)\displaystyle S^{2}M_{0}D_{f}(p_{X_{T-1},\widehat{g}(\eta_{T-1},X_{T-1})}\|p_{X_{T-1},X_{T}})
\displaystyle\leq S2M0T(Δ¨1+Δ¨2+Δ¨3)\displaystyle S^{2}M_{0}T(\ddot{\Delta}_{1}+\ddot{\Delta}_{2}+\ddot{\Delta}_{3})
=\displaystyle= Δ¨1+Δ¨2+Δ¨3\displaystyle\ddot{\Delta}_{1}+\ddot{\Delta}_{2}+\ddot{\Delta}_{3}

A.11 Proof of Proposition 10

Proof A.11.

We first denote hg^:=argsuph(g^,h)h_{\hat{g}}:=\arg\sup\limits_{h}{\cal L}(\hat{g},h) and hg^h_{\hat{g}} is continuous on E2=[logT,logT]2p+1E_{2}=[-\log T,\log T]^{2p+1} by assumption. Let E=E2E=E_{2}, L=logTL=\log T and N=T2p+12(2p+3)/logTN=T^{\frac{2p+1}{2(2p+3)}}/\log T in the Theorem 4.3 in Shen et al., (2019), there exists a ReLU network h^ϕ1\hat{h}_{\phi}\in{\cal H}_{1} with depth 𝒟~=12logT+14+2(2p+1)\widetilde{{\cal D}}=12\log T+14+2(2p+1) and width 𝒲~=32p+4max{(2p+1)(T2p+12(2p+3)/logT)12p+1,T2p+12(2p+3)/logT+1}\widetilde{{\cal W}}=3^{2p+4}\max\{(2p+1)\lfloor(T^{\frac{2p+1}{2(2p+3)}}/\log T)^{\frac{1}{2p+1}}\rfloor,T^{\frac{2p+1}{2(2p+3)}}/\log T+1\}, such that

h^ϕhg^L(E2)192p+1whg^E2(2logTT12p+3)\displaystyle\|\hat{h}_{\phi}-h_{\hat{g}}\|_{L^{\infty}(E_{2})}\leq 19\sqrt{2p+1}w_{h_{\hat{g}}}^{E_{2}}(2\log T\cdot T^{\frac{-1}{2p+3}})

where whg^E2w_{h_{\hat{g}}}^{E_{2}} is the modulus of hg^h_{\hat{g}} as defined in Shen et al., (2019). Then by continuity of \mathcal{L},

Δ3=suph(g^,h)suph1(g^,h)(g^,hg^)(g^,h^ϕ)0\displaystyle\Delta_{3}=\mathop{\sup}_{h}\mathcal{L}(\widehat{g},h)-\mathop{\sup}_{h\in\mathcal{H}_{1}}\mathcal{L}(\widehat{g},h)\leq\mathcal{L}(\widehat{g},h_{\hat{g}})-\mathcal{L}(\widehat{g},\hat{h}_{\phi})\rightarrow 0

Similarly, let g:=arginfg𝕃(g)g^{*}:=\arg\inf\limits_{g}{\mathbb{L}}(g) be continous function on E1=[logT,logT]p+m+1E_{1}=[-\log T,\log T]^{p+m+1}. Setting E=E1E=E_{1}, L=logTL=\log T and N=Tp+m+12(3+p+m)/logTN=T^{\frac{p+m+1}{2(3+p+m)}}/\log T in the Theorem 4.3 in Shen et al., (2019), there exists a ReLU network g¯𝒢1\bar{g}\in{\cal G}_{1} with depth 𝒟=12logT+14+2(p+m+1){\cal D}=12\log T+14+2(p+m+1) and width 𝒲=3p+m+4max{(p+m+1)(Tp+m+12(3+p+m)/logT)1p+m+1,Tp+m+12(3+p+m)/logT+1}{\cal W}=3^{p+m+4}\max\{(p+m+1)\lfloor(T^{\frac{p+m+1}{2(3+p+m)}}/\log T)^{\frac{1}{p+m+1}}\rfloor,T^{\frac{p+m+1}{2(3+p+m)}}/\log T+1\}, such that

g¯gL(E1)19p+m+1wgE1(2logTT1p+m+3)\displaystyle\|\bar{g}-g^{*}\|_{L^{\infty}(E_{1})}\leq 19\sqrt{p+m+1}w_{g^{*}}^{E_{1}}(2\log T\cdot T^{\frac{-1}{p+m+3}})

where wgE1w_{g^{*}}^{E_{1}} is the modulus of gg^{*} as defined in Shen et al., (2019). Let h¯=f(pXT,g¯(η,XT)pXT,XT+1)\bar{h}=f^{\prime}(\frac{p_{X_{T},\bar{g}(\eta,X_{T})}}{p_{X_{T},X_{T+1}}}) and h=f(pXT,g(η,XT)pXT,XT+1)h^{*}=f^{\prime}(\frac{p_{X_{T},g^{*}(\eta,X_{T})}}{p_{X_{T},X_{T+1}}}), by the ff^{\prime} is continuous and integrable in L1L_{1} , we have h¯h0\|\bar{h}-h^{*}\|\rightarrow 0 as TT\rightarrow\infty. By the definition, 𝕃(){\mathbb{L}}(\cdot) can be rephrase as following

infg¯𝒢1𝕃(g¯)=𝔼XT,ηTh¯(XT,g¯(ηT,XT))\displaystyle\mathop{\inf}_{\bar{g}\in\mathcal{G}_{1}}\mathbb{L}(\bar{g})=\mathbb{E}_{X_{T},\eta_{T}}\bar{h}(X_{T},\bar{g}(\eta_{T},X_{T})) 𝔼XT,XT+1f(h¯(XT,XT+1)).\displaystyle-\mathbb{E}_{X_{T},X_{T+1}}f^{*}(\bar{h}(X_{T},X_{T+1})).

Therefore, by the continuity of ff^{*} (since ff is a differentiable convex function), we have

Δ4=infg¯𝒢1𝕃(g¯)0.\displaystyle\Delta_{4}=\mathop{\inf}_{\bar{g}\in\mathcal{G}_{1}}\mathbb{L}(\bar{g})\rightarrow 0.

A.12 Additional lemmas

Lemma 15.

Let ϵi(1im)\epsilon_{i}(1\leq i\leq m) be the Rademacher random variables. For any AmA\in\mathbb{R}^{m}, let R=supaA(i=1mai2)1/2R=\sup_{a\in A}(\sum\limits_{i=1}^{m}a_{i}^{2})^{1/2}. Then:

(A)=𝔼(ϵi)i=1msupaA|1mi=1mϵiai|R2log(2|A|)m\displaystyle\mathcal{R}(A)=\mathbb{E}_{(\epsilon_{i})_{i=1}^{m}}\mathop{\sup}_{a\in A}|\frac{1}{m}\sum\limits_{i=1}^{m}\epsilon_{i}a_{i}|\leq\frac{R\sqrt{2\log(2|A|)}}{m}
Proof A.12.

Let B=A(A)B=A\cup(-A), we only need to prove that:

𝔼(ϵi)i=1m[supbB1mi=1mϵibi]R2log|B|m\displaystyle\mathbb{E}_{(\epsilon_{i})_{i=1}^{m}}[\mathop{\sup}_{b\in B}\frac{1}{m}\sum\limits_{i=1}^{m}\epsilon_{i}b_{i}]\leq\frac{R\sqrt{2\log|B|}}{m}

For arbitrary ss, by Jensen’s inequality:

exp(s𝔼(ϵi)i=1m[supbBi=1mϵibi])\displaystyle\exp(s\mathbb{E}_{(\epsilon_{i})_{i=1}^{m}}[\mathop{\sup}_{b\in B}\sum\limits_{i=1}^{m}\epsilon_{i}b_{i}]) 𝔼(ϵi)i=1m[exp{ssupbBi=1mϵibi}]\displaystyle\leq\mathbb{E}_{(\epsilon_{i})_{i=1}^{m}}[\exp\left\{s\mathop{\sup}_{b\in B}\sum\limits_{i=1}^{m}\epsilon_{i}b_{i}\right\}]
bB𝔼(ϵi)i=1m[exp{si=1mϵibi}]\displaystyle\leq\sum\limits_{b\in B}\mathbb{E}_{(\epsilon_{i})_{i=1}^{m}}[\exp\left\{s\sum\limits_{i=1}^{m}\epsilon_{i}b_{i}\right\}]
=bBi=1m𝔼(ϵi)i=1mexp{sϵibi}\displaystyle=\sum\limits_{b\in B}\prod\limits_{i=1}^{m}\mathbb{E}_{(\epsilon_{i})_{i=1}^{m}}\exp\left\{s\epsilon_{i}b_{i}\right\}

Because E[ϵibi]=0E[\epsilon_{i}b_{i}]=0, and ϵibi[|bi|,|bi|]\epsilon_{i}b_{i}\in[-|b_{i}|,|b_{i}|], then applying Hoeffding’s inequality, we have:

exp{s𝔼(ϵi)i=1m[supbBi=1mϵibi]}\displaystyle\exp\left\{s\mathbb{E}_{(\epsilon_{i})_{i=1}^{m}}[\mathop{\sup}_{b\in B}\sum\limits_{i=1}^{m}\epsilon_{i}b_{i}]\right\} bBi=1m𝔼(ϵi)i=1mexp{sϵibi}\displaystyle\leq\sum\limits_{b\in B}\prod\limits_{i=1}^{m}\mathbb{E}_{(\epsilon_{i})_{i=1}^{m}}\exp\left\{s\epsilon_{i}b_{i}\right\}
bBi=1mexp{s2(2|bi|2)8}\displaystyle\leq\sum\limits_{b\in B}\prod\limits_{i=1}^{m}\exp\left\{\frac{s^{2}(2|b_{i}|^{2})}{8}\right\}
=bBexp{s22i=1mbi2}\displaystyle=\sum\limits_{b\in B}\exp\left\{\frac{s^{2}}{2}\sum\limits_{i=1}^{m}b_{i}^{2}\right\}
|B|exp{s2R22}\displaystyle\leq|B|\exp\left\{\frac{s^{2}R^{2}}{2}\right\}

Therefore,

𝔼(ϵi)i=1m[supbBi=1mϵibi]log|B|s+sR22\displaystyle\mathbb{E}_{(\epsilon_{i})_{i=1}^{m}}[\mathop{\sup}_{b\in B}\sum\limits_{i=1}^{m}\epsilon_{i}b_{i}]\leq\frac{\log|B|}{s}+\frac{sR^{2}}{2}

Let s=2log|B|Rs=\frac{\sqrt{2\log|B|}}{R}, we have:

𝔼(ϵi)i=1m[supbBi=1mϵibi]R2log|B|\displaystyle\mathbb{E}_{(\epsilon_{i})_{i=1}^{m}}[\mathop{\sup}_{b\in B}\sum\limits_{i=1}^{m}\epsilon_{i}b_{i}]\leq R\sqrt{2\log|B|}
Lemma 16.

Suppose Assumption 1 holds. Then for 1sS1\leq s\leq S and t1,t2>0t_{1},t_{2}>0, the joint density function of (Xt1,Xt1+s)(X_{t_{1}},X_{t_{1}+s}) and (Xt2,Xt2+s)(X_{t_{2}},X_{t_{2}+s}) satisfy:

pXt1,Xt1+spXt2,Xt2+sL1𝒪(min{t1,t2}α)\displaystyle\|p_{X_{t_{1}},X_{t_{1}+s}}-p_{X_{t_{2}},X_{t_{2}+s}}\|_{L_{1}}\leq\mathcal{O}(\mathop{\min}\left\{t_{1},t_{2}\right\}^{-\alpha}) (84)
Proof A.13.

Let ps(|x)p_{s}(\cdot|x) be the conditional density function of Xs|(X0=x)X_{s}|(X_{0}=x).

For 1sS1\leq s\leq S and t0t\geq 0, the conditional density function of Xt+s|XtX_{t+s}|X_{t} satisfies:

pXt+s|Xt(xs|x0)\displaystyle p_{X_{t+s}|X_{t}}(x_{s}|x_{0}) =pXt+s,Xt+s1,,Xt+1|Xt(xs,xs1,,x1|x0)𝑑xs1𝑑x1\displaystyle=\int p_{X_{t+s},X_{t+s-1},\cdots,X_{t+1}|X_{t}}(x_{s},x_{s-1},\cdots,x_{1}|x_{0})dx_{s-1}\cdots dx_{1}
=i=1sp1(xi|xi1)dxs1dx1\displaystyle=\int\prod\limits_{i=1}^{s}p_{1}(x_{i}|x_{i-1})dx_{s-1}\cdots dx_{1}
=ps(xs|x0)\displaystyle=p_{s}(x_{s}|x_{0})

Then, for 1sS1\leq s\leq S and t1,t2>0t_{1},t_{2}>0, we have:

pXt1,Xt1+spXt2,Xt2+sL1\displaystyle\|p_{X_{t_{1}},X_{t_{1}+s}}-p_{X_{t_{2}},X_{t_{2}+s}}\|_{L_{1}} =|pt1(x)ps(y|x)pt2(x)ps(y|x)|dydx\displaystyle=\int|p_{t_{1}}(x)\cdot p_{s}(y|x)-p_{t_{2}}(x)\cdot p_{s}(y|x)|dydx
=|pt1(x)pt2(x)|ps(y|x)𝑑y𝑑x\displaystyle=\int|p_{t_{1}}(x)-p_{t_{2}}(x)|\cdot p_{s}(y|x)dydx
=|pt1(x)pt2(x)|𝑑x\displaystyle=\int|p_{t_{1}}(x)-p_{t_{2}}(x)|dx
=pt1pt2L1\displaystyle=\|p_{t_{1}}-p_{t_{2}}\|_{L_{1}}

Combining Assumption 1, we have:

pXt1,Xt1+spXt2,Xt2+sL1\displaystyle\|p_{X_{t_{1}},X_{t_{1}+s}}-p_{X_{t_{2}},X_{t_{2}+s}}\|_{L_{1}} =\displaystyle= pt1pt2L1\displaystyle\|p_{t_{1}}-p_{t_{2}}\|_{L_{1}}
\displaystyle\leq pt1pL1+pt2pL1\displaystyle\|p_{t_{1}}-p_{\infty}\|_{L_{1}}+\|p_{t_{2}}-p_{\infty}\|_{L_{1}}
\displaystyle\leq 𝒪(t1α)+𝒪(t2α)\displaystyle\mathcal{O}(t_{1}^{-\alpha})+\mathcal{O}(t_{2}^{-\alpha})
=\displaystyle= 𝒪(min{t1,t2}α)\displaystyle\mathcal{O}(\mathop{\min}\left\{t_{1},t_{2}\right\}^{-\alpha})
Lemma 17.

Suppose convex function ff satisfies f(1)=0f(1)=0. If equation (14) holds, then for any density functions pp and qq, we have:

Df(pq)a2pqL12\displaystyle D_{f}(p\|q)\geq\frac{a}{2}\left\|p-q\right\|_{L_{1}}^{2} (85)
Proof A.14.

Let L=f(0)L=f^{\prime}(0). By equation (14), we can easily get the following inequality:

f(x+1)a2x21+bx+Lx,x1\displaystyle f(x+1)\geq\frac{a}{2}\frac{x^{2}}{1+bx}+Lx,\ \ x\geq-1 (86)

Let r(x)=p(x)/q(x)11r(x)=p(x)/q(x)-1\geq-1, then the ff divergence of pp from qq can be expressed as:

Df(pq)=q(x)f(r(x)+1)𝑑x\displaystyle D_{f}(p\|q)=\int q(x)f(r(x)+1)dx

Note that,

r(x)q(x)𝑑x=(p(x)q(x))𝑑x=0\displaystyle\int r(x)q(x)dx=\int(p(x)-q(x))dx=0 (87)

combining with inequality (86), we have,

Df(pq)\displaystyle D_{f}(p\|q) =\displaystyle= q(x)f(r(x)+1)𝑑x\displaystyle\int q(x)f(r(x)+1)dx
\displaystyle\geq q(x)(a2r(x)21+br(x)+Lr(x))𝑑x\displaystyle\int q(x)(\frac{a}{2}\frac{r(x)^{2}}{1+br(x)}+Lr(x))dx
=\displaystyle= a2q(x)r(x)21+br(x)𝑑x\displaystyle\frac{a}{2}\int q(x)\frac{r(x)^{2}}{1+br(x)}dx

By equation (87),

q(x)(1+br(x))𝑑x=q(x)𝑑x+bq(x)r(x)𝑑x=1\displaystyle\int q(x)(1+br(x))dx=\int q(x)dx+b\int q(x)r(x)dx=1

Since 0<b<10<b<1, 1+br(x)1b>01+br(x)\geq 1-b>0 holds for all xx. Therefore, according to Cauchy’s inequality, we have:

Df(pq)\displaystyle D_{f}(p\|q) \displaystyle\geq a2q(x)r(x)21+br(x)𝑑x\displaystyle\frac{a}{2}\int q(x)\frac{r(x)^{2}}{1+br(x)}dx
=\displaystyle= a2q(x)r(x)21+br(x)𝑑xq(x)(1+br(x))𝑑x\displaystyle\frac{a}{2}\int q(x)\frac{r(x)^{2}}{1+br(x)}dx\cdot\int q(x)(1+br(x))dx
\displaystyle\geq a2(q(x)|r(x)|𝑑x)2\displaystyle\frac{a}{2}(\int q(x)|r(x)|dx)^{2}
=\displaystyle= a2(|p(x)q(x)|𝑑x)2\displaystyle\frac{a}{2}(\int|p(x)-q(x)|dx)^{2}
=\displaystyle= a2pqL12\displaystyle\frac{a}{2}\|p-q\|_{L_{1}}^{2}

Appendix B Implementations

B.1 Simulations

We present here the visualization of ϕ1\phi_{1}, ϕe\phi_{e}, and Σ\Sigma_{\infty} of Case 1 in the simulation study.

Refer to caption
Figure 4: The visualization of ϕ1\phi_{1}, ϕe\phi_{e}, and Σ\Sigma_{\infty} of Case 1 in the simulation study.

B.2 The ADNI study

Given two images XX and YY, structural similarity index measure (SSIM) is defined as

SSIM(X,Y)=(2μxμy+c1)(2σxy+c2)(μx2+μy2+c1)(σx2+σy2+c2),\displaystyle\text{SSIM}(X,Y)=\frac{(2\mu_{x}\mu_{y}+c_{1})(2\sigma_{xy}+c_{2})}{(\mu_{x}^{2}+\mu_{y}^{2}+c_{1})(\sigma_{x}^{2}+\sigma_{y}^{2}+c_{2})},

where (μx,μy)(\mu_{x},\mu_{y}), (σx2,σy2)(\sigma_{x}^{2},\sigma_{y}^{2}) are the mean and variance of pixel values in XX and YY respectively. The σxy\sigma_{xy} is the covariance of XX and YY. The c1=(0.01R)2,c2=(0.03R)2c_{1}=(0.01R)^{2},c_{2}=(0.03R)^{2} where RR denotes the range of pixel values in the image. In the computing of SSIM(X^s,Xs)\text{SSIM}(\widehat{X}_{s},X_{s}), RR refers to the pixel range of XsX_{s}.

Below are the specifics of the Generator and the Discriminator used in the ADNI study.
Generator: The Generator G𝒢G\in{\cal G} consists of two component: the Encoder EGE_{G} and the Decoder DGD_{G}. Initially, a 2D slice XtX_{t} is fed into the EGE_{G}, which generate an embedding vector of size 130, denoted as EG(Xt)130E_{G}(X_{t})\in{\mathbb{R}}^{130}. Next we concatenate EG(Xt)E_{G}(X_{t}) with age difference vector 𝒔\boldsymbol{s} and use it as the input of DGD_{G}. The output of DGD_{G} is a generated image X^t+s\widehat{X}_{t+s} with the same dimension of XtX_{t}. The structure of Encoder EGE_{G} and Decoder DGD_{G} are adopted from residual U-net proposed by Zhang et al., (2018).
Discriminator: The Discriminator HH\in{\cal H} consists of a encoder part (EHE_{H}) and a critic part (CHC_{H}). At the encoder part, we obtain two latent features: EH(Xt)E_{H}(X_{t}) and EH(Xt+s)E_{H}(X_{t+s}). Then we consider the combination of EH(Xt)E_{H}(X_{t}), EH(Xt+s)E_{H}(X_{t+s}) and 𝒔\boldsymbol{s} as the input of the critic CHC_{H}, which produce a confident score. The encoder components EH(Xt)E_{H}(X_{t}) and EH(Xt+s)E_{H}(X_{t+s}) resemble the encoder in the generator, while the critic part is adopted from Zhang et al., (2019).

The number of training epochs was set to 500. We use AdamW optimizer Loshchilov and Hutter, (2017) for both networks with a learning rate and weight decay of 1×1041\times 10^{-4}. The size of mini-batches is set to 12.