This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sparse Deep Learning for Time Series Data: Theory and Applications

Mingxuan Zhang
Department of Statistics
Purdue University
[email protected]
Yan Sun
Department of Biostatistics, Epidemiology, and Informatics
University of Pennsylvania
[email protected] &Faming Liang
Department of Statistics
[email protected]
Abstract

Sparse deep learning has become a popular technique for improving the performance of deep neural networks in areas such as uncertainty quantification, variable selection, and large-scale network compression. However, most existing research has focused on problems where the observations are independent and identically distributed (i.i.d.), and there has been little work on the problems where the observations are dependent, such as time series data and sequential data in natural language processing. This paper aims to address this gap by studying the theory for sparse deep learning with dependent data. We show that sparse recurrent neural networks (RNNs) can be consistently estimated, and their predictions are asymptotically normally distributed under appropriate assumptions, enabling the prediction uncertainty to be correctly quantified. Our numerical results show that sparse deep learning outperforms state-of-the-art methods, such as conformal predictions, in prediction uncertainty quantification for time series data. Furthermore, our results indicate that the proposed method can consistently identify the autoregressive order for time series data and outperform existing methods in large-scale model compression. Our proposed method has important practical implications in fields such as finance, healthcare, and energy, where both accurate point estimates and prediction uncertainty quantification are of concern.

1 Introduction

Over the past decade, deep learning has experienced unparalleled triumphs across a multitude of domains, such as time series forecasting [1, 2, 3, 4, 5], natural language processing [6, 7], and computer vision [8, 9]. However, challenges like generalization and miscalibration [10] persist, posing potential risks in critical applications like medical diagnosis and autonomous vehicles.

In order to enhance the performance of deep neural networks (DNNs), significant research efforts have been dedicated to exploring optimization methods and the loss surface of the DNNs, see, e.g., [11, 12, 13, 14, 15, 16], which have aimed to expedite and direct the convergence of DNNs towards regions that exhibit strong generalization capabilities. While these investigations are valuable, effectively addressing both the challenges of generalization and miscalibration require additional and perhaps more essential aspects: consistent estimation of the underlying input-output mapping and complete knowledge of the asymptotic distribution of predictions. As a highly effective method that addresses both challenges, sparse deep learning has been extensively studied, see, e.g., [17, 18, 19, 20, 21]. Nevertheless, it is important to note that all the studies have been conducted under the assumption of independently and identically distributed (i.i.d.) data. However, in practice, we frequently encounter situations where the data exhibits dependence, such as time series data.

The primary objective of this paper is to address this gap by establishing a theoretical foundation for sparse deep learning with time series data. Specifically, we lay the foundation within the Bayesian framework. For RNNs, by letting their parameters be subject to a mixture Gaussian prior, we establish posterior consistency, structure selection consistency, input-output mapping estimation consistency, and asymptotic normality of predicted values.

We validate our theory through numerical experiments on both synthetic and real-world datasets. Our approach outperforms existing state-of-the-art methods in uncertainty quantification and model compression, highlighting its potential for practical applications where both accurate point prediction and prediction uncertainty quantification are of concern.

2 Related Works

Sparse deep learning. Theoretical investigations have been conducted on the approximation power of sparse DNNs across different classes of functions [22, 23]. Recently, [17] has made notable progress by integrating sparse DNNs into the framework of statistical modeling, which offers a fundamentally distinct neural network approximation theory. Unlike traditional theories that lack data involvement and allow connection weights to assume values in an unbounded space to achieve arbitrarily small approximation errors with small networks [24], their theory [17] links network approximation error, network size, and weight bounds to the training sample size. They show that a sparse DNN of size O(n/log(n))O(n/\log(n)) can effectively approximate various types of functions, such as affine and piecewise smooth functions, as nn\to\infty, where nn denotes the training sample size. Additionally, sparse DNNs exhibit several advantageous theoretical guarantees, such as improved interpretability, enabling the consistent identification of relevant variables for high-dimensional nonlinear systems. Building upon this foundation, [18] establishes the asymptotic normality of connection weights and predictions, enabling valid statistical inference for predicting uncertainties. This work extends the sparse deep learning theory of [17, 18] from the case of i.i.d data to the case of time series data.

Uncertainty quantification. Conformal Prediction (CP) has emerged as a prominent technique for generating prediction intervals, particularly for black-box models like neural networks. A key advantage of CP is its capability to provide valid prediction intervals for any data distribution, even with finite samples, provided the data meets the condition of exchangeability [25, 26]. While i.i.d. data easily satisfies this condition, dependent data, such as time series, often doesn’t. Researchers have extended CP to handle time series data by relying on properties like strong mixing and ergodicity [27, 28]. In a recent work, [29] introduced a random swapping mechanism to address potentially non-exchangeable data, allowing conformal prediction to be applied on top of a model trained with weighted samples. The main focus of this approach was to provide a theoretical basis for the differences observed in the coverage rate of the proposed method. Another recent study by [30] took a deep dive into the Adaptive Conformal Inference (ACI) [31], leading to the development of the Aggregation Adaptive Conformal Inference (AgACI) method. In situations where a dataset contains a group of similar and independent time series, treating each time series as a separate observation, applying a CP method becomes straightforward [32]. For a comprehensive tutorial on CP methods, one can refer to [33]. Beyond CP, other approaches for addressing uncertainty quantification in time series datasets include multi-horizon probabilistic forecasting [34], methods based on dropout [35], and recursive Bayesian approaches [36].

3 Sparse Deep Learning for Time Series Data: Theory

Let Dn={y1,,yn}D_{n}=\{y_{1},\dots,y_{n}\} denote a time series sequence, where yiy_{i}\in\mathbb{R}. Let (Ω,,P)(\Omega,\mathcal{F},P^{*}) be the probability space of DnD_{n}, and let αk=sup{|P(yjA,yk+jB)P(yjA)P(yk+jB)|:A,B,j+}\alpha_{k}=\sup\{|P^{*}(y_{j}\in A,y_{k+j}\in B)-P^{*}(y_{j}\in A)P^{*}(y_{k+j}\in B)|:A,B\in\mathcal{F},j\in\mathbb{N}^{+}\} be the kk-th order α\alpha-mixing coefficient.

Assumption 3.1.

The time series DnD_{n} is (strictly) stationary and α\alpha-mixing with an exponentially decaying mixing coefficient and follows an autoregressive model of order ll

yi=μ(yi1:il,𝒖i)+ηi,y_{i}=\mu^{*}(y_{i-1:i-l},{\boldsymbol{u}}_{i})+\eta_{i}, (1)

where μ\mu^{*} is a non-linear function, yi1:il=(yi1,,yil)y_{i-1:i-l}=(y_{i-1},\dots,y_{i-l}), 𝒖i{\boldsymbol{u}}_{i} contains optional exogenous variables, and ηii.i.d.N(0,σ2)\eta_{i}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,\sigma^{2}) with σ2\sigma^{2} being assumed to be a constant.

Remark 3.2.

Similar assumptions are commonly adopted to establish asymptotic properties of stochastic processes [37, 38, 39, 40, 41, 28]. For example, the asymptotic normality of the maximum likelihood estimator (MLE) can be established under the assumption that the time series is strictly stationary and ergodic, provided that the model size is fixed [37]. A posterior contraction rate of the autoregressive (AR) model can be obtained by assuming it is α\alpha-mixing with k=0ak12/s<\sum_{k=0}^{\infty}a_{k}^{1-2/s}<\infty for some s>2s>2 which is implied by an exponentially decaying mixing coefficient [41]. For stochastic processes that are strictly stationary and β\beta-mixing, results such as uniform laws of large numbers and convergence rates of the empirical processes [38, 39] can also be obtained.

Remark 3.3.

Extending the results of [17, 18] to the case that the dataset includes a set of i.i.d. time series, with each time series regarded as an individual observation, is straightforward, this is because all observations are independent and have the same distribution.

3.1 Posterior Consistency

Both the MLP and RNN can be used to approximate μ\mu^{*} as defined in (1), and for simplicity, we do not explicitly denote the exogenous variables 𝒖i{\boldsymbol{u}}_{i} unless it is necessary. For the MLP, we can formulate it as a regression problem, where the input is 𝒙i=yi1:iRl{\boldsymbol{x}}_{i}=y_{i-1:i-R_{l}} for some lRlnl\leq R_{l}\ll n, and the corresponding output is yiy_{i}, then the dataset DnD_{n} can be expressed as {(𝒙i,yi)}i=1+Rln\{({\boldsymbol{x}}_{i},y_{i})\}^{n}_{i=1+R_{l}}. Detailed settings and results for the MLP are given in Appendix B.3. In what follows, we will focus on the RNN, which serves as an extension of the previous studies. For the RNN, we can rewrite the training dataset as Dn={yi:iMl+1}i=MlnD_{n}=\{y_{i:i-M_{l}+1}\}_{i=M_{l}}^{n} for some lRl<Mlnl\leq R_{l}<M_{l}\ll n, i.e., we split the entire sequence into a set of shorter sequences, where RlR_{l} denotes an upper bound for the exact AR order ll, and MlM_{l} denotes the length of these shorter sequences (see Figure 1). We assume RlR_{l} is known but not ll since, in practice, it is unlikely that we know the exact order ll.

Refer to caption
Figure 1: A multi-layer RNN with an input window size of kk. We restrict the use of the RNN’s outputs until the hidden states have accumulated a sufficient quantity of past information to ensure accurate predictions.

For simplicity of notations, we do not distinguish between weights and biases of the RNN. In this paper, the presence of the subscript nn in the notation of a variable indicates its potential to increase with the sample size nn. To define an RNN with Hn1H_{n}-1 hidden layers, for h{1,2,,Hn}h\in\{1,2,\dots,H_{n}\}, we let ψh\psi^{h} and LhL_{h} denote, respectively, the nonlinear activation function and the number of hidden neurons at layer hh. We set LHn=1L_{H_{n}}=1 and L0=pnL_{0}=p_{n}, where pnp_{n} denotes a generic input dimension. Because of the existence of hidden states from the past, the input 𝒙i{\boldsymbol{x}}_{i} can contain only yi1y_{i-1} or yi1:iry_{i-1:i-r} for some r>1r>1. Let 𝒘hLh×Lh1{\boldsymbol{w}}^{h}\in\mathbb{R}^{L_{h}\times L_{h-1}} and 𝒗hLh×Lh{\boldsymbol{v}}^{h}\in\mathbb{R}^{L_{h}\times L_{h}} denote the weight matrices at layer hh. With these notations, the output of the step ii of an RNN model can be expressed as

μ(𝒙i,{𝒛i1h}h=1Hn1,𝜷)=𝒘HnψHn1[ψ1[𝒘1𝒙i+𝒗1𝒛i11]],\displaystyle\mu({\boldsymbol{x}}_{i},\{{\boldsymbol{z}}_{i-1}^{h}\}^{H_{n}-1}_{h=1},{\boldsymbol{\beta}})={\boldsymbol{w}}^{H_{n}}\psi^{H_{n}-1}[\cdots\psi^{1}[{\boldsymbol{w}}^{1}{\boldsymbol{x}}_{i}+{\boldsymbol{v}}^{1}{\boldsymbol{z}}_{i-1}^{1}]\cdots], (2)

where 𝒛ih=ψh[𝒘h𝒛ih1+𝒗h𝒛i1h]{\boldsymbol{z}}_{i}^{h}=\psi^{h}[{\boldsymbol{w}}^{h}{\boldsymbol{z}}_{i}^{h-1}+{\boldsymbol{v}}^{h}{\boldsymbol{z}}_{i-1}^{h}] denotes the hidden state of layer hh at step ii with 𝒛0h=𝟎{\boldsymbol{z}}^{h}_{0}={\boldsymbol{0}}; and 𝜷{\boldsymbol{\beta}} is the collection of all weights, consisting of Kn=h=1Hn(Lh×Lh1)+h=1Hn1(Lh2)+LhK_{n}=\sum_{h=1}^{H_{n}}(L_{h}\times L_{h-1})+\sum_{h=1}^{H_{n}-1}(L_{h}^{2})+L_{h} elements. To represent the structure for a sparse RNN, we introduce an indicator variable for each weight in 𝜷{\boldsymbol{\beta}}. Let 𝜸={γj{0,1}:j=1,,Kn}{\boldsymbol{\gamma}}=\{\gamma_{j}\in\{0,1\}:j=1,\dots,K_{n}\}, which specifies the structure of a sparse RNN. To include information on the network structure 𝜸{\boldsymbol{\gamma}} and keep the notation concise, we redenote μ(𝒙i,{𝒛i1h}h=1Hn1,𝜷)\mu({\boldsymbol{x}}_{i},\{{\boldsymbol{z}}_{i-1}^{h}\}^{H_{n}-1}_{h=1},{\boldsymbol{\beta}}) by μ(𝒙i:iMl+1,𝜷,𝜸)\mu({\boldsymbol{x}}_{i:i-M_{l}+1},{\boldsymbol{\beta}},{\boldsymbol{\gamma}}), as {𝒛i1h}h=1Hn1\{{\boldsymbol{z}}_{i-1}^{h}\}^{H_{n}-1}_{h=1} depends only on (𝜷,𝜸)({\boldsymbol{\beta}},{\boldsymbol{\gamma}}) and up to 𝒙i1:iMl+1{\boldsymbol{x}}_{i-1:i-M_{l}+1}.

Posterior consistency is an essential concept in Bayesian statistics, which forms the basis of Bayesian inference. While posterior consistency generally holds for low-dimensional problems, establishing it becomes challenging in high-dimensional scenarios. In such cases, the dimensionality often surpasses the sample size, and if the prior is not appropriately elicited, prior information can overpower the data information, leading to posterior inconsistency.

Following [17, 18, 19], we let each connection weight be subject to a mixture Gaussian prior, i.e.,

βjλnN(0,σ1,n2)+(1λn)N(0,σ0,n2),j{1,2,,Kn},\displaystyle\beta_{j}\sim\lambda_{n}N(0,\sigma^{2}_{1,n})+(1-\lambda_{n})N(0,\sigma^{2}_{0,n}),\quad j\in\{1,2,\ldots,K_{n}\}, (3)

by integrating out the structure information 𝜸{\boldsymbol{\gamma}}, where λn(0,1)\lambda_{n}\in(0,1) is the mixture proportion, σ0,n2\sigma^{2}_{0,n} is typically set to a very small number, while σ1,n2\sigma^{2}_{1,n} is relatively large. Visualizations of the mixture Gaussian priors for different λn\lambda_{n}, σ0,n2\sigma^{2}_{0,n}, and σ1,n2\sigma^{2}_{1,n} are given in the Appendix E.

We assume μ\mu^{*} can be well approximated by a sparse RNN given enough past information, and refer to this sparse RNN as the true RNN model. To be more specific, we define the true RNN model as

(𝜷,𝜸)=argmin(𝜷,𝜸)𝒢n,μ(𝒙i:iMl+1,𝜷,𝜸)μ(yi1:il)L2(Ω)ϖn|𝜸|,({\boldsymbol{\beta}}^{*},{\boldsymbol{\gamma}}^{*})=\operatorname*{arg\,min}_{({\boldsymbol{\beta}},{\boldsymbol{\gamma}})\in\mathcal{G}_{n},\norm{\mu({\boldsymbol{x}}_{i:i-M_{l}+1},{\boldsymbol{\beta}},{\boldsymbol{\gamma}})-\mu^{*}(y_{i-1:i-l})}_{L_{2}(\Omega)}\leq\varpi_{n}}|{\boldsymbol{\gamma}}|, (4)

where 𝒢n:=𝒢(C0,C1,ϵ,pn,L1,L2,,LHn)\mathcal{G}_{n}\vcentcolon=\mathcal{G}(C_{0},C_{1},\epsilon,p_{n},L_{1},L_{2},\dots,L_{H_{n}}) denotes the space of all valid networks that satisfy the Assumption 3.4 for the given values of HnH_{n}, pnp_{n}, and LhL_{h}’s, and ϖn\varpi_{n} is some sequence converging to 0 as nn\to\infty. For any given RNN (𝜷,𝜸)({\boldsymbol{\beta}},{\boldsymbol{\gamma}}), the error |μ()μ(,𝜷,𝜸)||\mu^{*}(\cdot)-\mu(\cdot,{\boldsymbol{\beta}},{\boldsymbol{\gamma}})| can be decomposed as the approximation error |μ()μ(,𝜷,𝜸)||\mu^{*}(\cdot)-\mu(\cdot,{\boldsymbol{\beta}}^{*},{\boldsymbol{\gamma}}^{*})| and the estimation error |μ(,𝜷,𝜸)μ(,𝜷,𝜸)||\mu(\cdot,{\boldsymbol{\beta}}^{*},{\boldsymbol{\gamma}}^{*})-\mu(\cdot,{\boldsymbol{\beta}},{\boldsymbol{\gamma}})|. The former is bounded by ϖn\varpi_{n}, and the order of the latter will be given in Theorem 3.8. For the sparse RNN, we make the following assumptions:

Assumption 3.4.

The true sparse RNN model (𝜷,𝜸)({\boldsymbol{\beta}}^{*},{\boldsymbol{\gamma}}^{*}) satisfies the following conditions:

  • The network structure satisfies: rnHnlog(n)+rnlog(L¯)+snlog(pn)C0n1ϵr_{n}H_{n}\log{n}+r_{n}\log{\bar{L}}+s_{n}\log{p_{n}}\leq C_{0}n^{1-\epsilon}, where 0<ϵ<10<\epsilon<1 is a small constant, rn=|𝜸|r_{n}=|{\boldsymbol{\gamma}}^{*}| denotes the connectivity of 𝜸{\boldsymbol{\gamma}}^{*}, L¯=max1jHn1Lj\bar{L}=\max_{1\leq j\leq H_{n-1}}L_{j} denotes the maximum hidden state dimension, and sns_{n} denotes the input dimension of 𝜸{\boldsymbol{\gamma}}^{*}.

  • The network weights are polynomially bounded: 𝜷En\norm{{\boldsymbol{\beta}}^{*}}_{\infty}\leq E_{n}, where En=nC1E_{n}=n^{C_{1}} for some constant C1>0C_{1}>0.

Remark 3.5.

Assumption 3.4 is identical to assumption A.2 of [17], which limits the connectivity of the true RNN model to be of o(n1ϵ)o(n^{1-\epsilon}) for some 0<ϵ<10<\epsilon<1. Then, as implied by Lemma 3.9, an RNN of size O(n/log(n))O(n/\log(n)) has been large enough for modeling a time series sequence of length nn. Refer to [17] for discussions on the universal approximation ability of the neural network under this assumption; the universal approximation ability can still hold for many classes of functions, such as affine function, piecewise smooth function, and bounded α\alpha-Hölder smooth function.

Assumption 3.6.

The activation function ψh\psi^{h} is bounded for h=1h=1 (e.g., sigmoid and tanh), and is Lipschitz continuous with a Lipschitz constant of 11 for 2hHn2\leq h\leq H_{n} (e.g., ReLU, sigmoid and tanh).

Remark 3.7.

Assumption 3.6 mirrors [17, 18], except that we require the activation function for the first layer to be bounded. This extra assumption can be viewed as a replacement for the boundedness assumption for the input variables of a conventional DNN.

Let d(p1,p2)d(p_{1},p_{2}) denote the integrated Hellinger distance between two conditional densities p1(y|𝒙)p_{1}(y|{\boldsymbol{x}}) and p2(y|𝒙)p_{2}(y|{\boldsymbol{x}}). Let π(|Dn)\pi(\cdot|D_{n}) be the posterior probability of an event. Theorem 3.8 establishes posterior consistency for the RNN model with the mixture Gaussian prior (3).

Theorem 3.8.

Suppose Assumptions 1, 3.4, and 3.6 hold. If the mixture Gaussian prior (3) satisfies the conditions :λn=O(1/[MlHnKn[n2MlHn(L¯pn)]τ])\vcentcolon\lambda_{n}=O(1/[M_{l}^{H_{n}}K_{n}[n^{2M_{l}H_{n}}(\bar{L}p_{n})]^{\tau}]) for some constant τ>0\tau>0, En/[Hnlog(n)+log(L¯)]1/2σ1,nnαE_{n}/[H_{n}\log{n}+\log{\bar{L}}]^{1/2}\leq\sigma_{1,n}\leq n^{\alpha} for some constant α\alpha, and σ0,nmin{1/[MlHnnKn(n3/2σ1,n/Hn)2MlHn],1/[MlHnnKn(nEn/Hn)2MlHn]}\sigma_{0,n}\leq\min\{1/[M_{l}^{H_{n}}\sqrt{n}K_{n}(n^{3/2}\sigma_{1,n}/H_{n})^{2M_{l}H_{n}}],1/[M_{l}^{H_{n}}\sqrt{n}K_{n}(nE_{n}/H_{n})^{2M_{l}H_{n}}]\}, then there exists an an error sequence ϵn2=O(ϖn2)+O(ζn2)\epsilon_{n}^{2}=O(\varpi^{2}_{n})+O(\zeta_{n}^{2}) such that limnϵn=0\lim_{n\to\infty}\epsilon_{n}=0 and limnnϵn2=\lim_{n\to\infty}n\epsilon_{n}^{2}=\infty, and the posterior distribution satisfies

π(d(p𝜷,pμ)Cϵn|Dn)=OP(enϵn2)\displaystyle\pi(d(p_{{\boldsymbol{\beta}}},p_{\mu^{*}})\geq C\epsilon_{n}|D_{n})=O_{P^{*}}(e^{-n\epsilon_{n}^{2}}) (5)

for sufficiently large nn and C>0C>0, where ζn2=[rnHnlog(n)+rnlog(L¯)+snlog(pn)]/n\zeta_{n}^{2}=[r_{n}H_{n}\log{n}+r_{n}\log{\bar{L}}+s_{n}\log{p_{n}}]/n, pμp_{\mu^{*}} denotes the underlying true data distribution, and p𝛃p_{{\boldsymbol{\beta}}} denotes the data distribution reconstructed by the Bayesian RNN based on its posterior samples.

3.2 Uncertainty Quantification with Sparse RNNs

As mentioned previously, posterior consistency forms the basis for Bayesian inference with the RNN model. Based on Theorem 3.8, we further establish structure selection consistency and asymptotic normality of connection weights and predictions for the sparse RNN. In particular, the asymptotic normality of predictions enables the prediction intervals with correct coverage rates to be constructed.

Structure Selection Consistency

It is known that the neural network often suffers from a non-identifiability issue due to the symmetry of its structure. For instance, one can permute certain hidden nodes or simultaneously change the signs or scales of certain weights while keeping the output of the neural network invariant. To address this issue, we follow [17] to define an equivalent class of RNNs, denoted by Θ\Theta, which is a set of RNNs such that any possible RNN for the problem can be represented by one and only one RNN in Θ\Theta via appropriate weight transformations. Let ν(𝜷,𝜸)Θ\nu({\boldsymbol{\beta}},{\boldsymbol{\gamma}})\in\Theta denote an operator that maps any RNN to Θ\Theta. To serve the purpose of structure selection in the space Θ\Theta, we consider the marginal posterior inclusion probability (MIPP) approach. Formally, for each connection weight i=1,,Kni=1,\dots,K_{n}, we define its MIPP as qi=𝜸ei|ν(𝜷,𝜸)π(𝜸|𝜷,Dn)π(𝜷|Dn)d𝜷q_{i}=\int\sum_{{\boldsymbol{\gamma}}}e_{i|\nu({\boldsymbol{\beta}},{\boldsymbol{\gamma}})}\pi({\boldsymbol{\gamma}}|{\boldsymbol{\beta}},D_{n})\pi({\boldsymbol{\beta}}|D_{n})d{\boldsymbol{\beta}}, where ei|ν(𝜷,𝜸)e_{i|\nu({\boldsymbol{\beta}},{\boldsymbol{\gamma}})} is the indicator of ii in ν(𝜷,𝜸)\nu({\boldsymbol{\beta}},{\boldsymbol{\gamma}}). The MIPP approach selects the connections whose MIPPs exceed a threshold q^\hat{q}. Let 𝜸^q^={i:qi>q^,i=1,,Kn}\hat{{\boldsymbol{\gamma}}}_{\hat{q}}=\{i:q_{i}>\hat{q},i=1,\dots,K_{n}\} denote an estimator of 𝜸={i:ei|ν(𝜷,𝜸)=1,i=1,,Kn}{\boldsymbol{\gamma}}_{*}=\{i:e_{i|\nu({\boldsymbol{\beta}}^{*},{\boldsymbol{\gamma}}^{*})}=1,i=1,\dots,K_{n}\}. Let A(ϵn)={𝜷:d(p𝜷,pμ)ϵn}A(\epsilon_{n})=\{{\boldsymbol{\beta}}:d(p_{{\boldsymbol{\beta}}},p_{\mu^{*}})\leq\epsilon_{n}\} and ρ(ϵn)=max1iKnA(ϵn)𝜸|ei|ν(𝜷,𝜸)ei|ν(𝜷,𝜸)|π(𝜸|𝜷,Dn)π(𝜷|Dn)d𝜷\rho(\epsilon_{n})=\max_{1\leq i\leq K_{n}}\int_{A(\epsilon_{n})}\sum_{{\boldsymbol{\gamma}}}|e_{i|\nu({\boldsymbol{\beta}},{\boldsymbol{\gamma}})}-e_{i|\nu({\boldsymbol{\beta}}^{*},{\boldsymbol{\gamma}}^{*})}|\pi({\boldsymbol{\gamma}}|{\boldsymbol{\beta}},D_{n})\pi({\boldsymbol{\beta}}|D_{n})d{\boldsymbol{\beta}}, which measures the structure difference on A(ϵ)A(\epsilon) for the true RNN from those sampled from the posterior. Then we have the following Lemma:

Lemma 3.9.

If the conditions of Theorem 3.8 are satisfied and ρ(ϵn)0\rho(\epsilon_{n})\to 0 as nn\to\infty, then

  • (i)

    max1iKn{|qiei|ν(𝜷,𝜸)|}𝑝0\max_{1\leq i\leq K_{n}}\{|q_{i}-e_{i|\nu({\boldsymbol{\beta}}^{*},{\boldsymbol{\gamma}}^{*})}|\}\overset{p}{\to}0 as nn\to\infty;

  • (ii)

    (sure screening) P(𝜸𝜸^q^)𝑝1P({\boldsymbol{\gamma}}_{*}\subset\hat{{\boldsymbol{\gamma}}}_{\hat{q}})\overset{p}{\to}1 as nn\to\infty, for any prespecified q^(0,1)\hat{q}\in(0,1);

  • (iii)

    (consistency) P(𝜸=𝜸^0.5)𝑝1P({\boldsymbol{\gamma}}_{*}=\hat{{\boldsymbol{\gamma}}}_{0.5})\overset{p}{\to}1 as nn\to\infty;

where p\stackrel{{\scriptstyle p}}{{\to}} denotes convergence in probability.

Remark 3.10.

This lemma implies that we can filter out irrelevant variables and simplify the RNN structure when appropriate. Please refer to Section 5.2 for a numerical illustration.

Asymptotic Normality of Connection Weights and Predictions

The following two Theorems establish the asymptotic normality of ν~(𝜷)\tilde{\nu}({\boldsymbol{\beta}}) and predictions, where ν~(𝜷)\tilde{\nu}({\boldsymbol{\beta}}) denotes a transformation of 𝜷{\boldsymbol{\beta}} which is invariant with respect to μ(,𝜷,𝜸)\mu(\cdot,{\boldsymbol{\beta}},{\boldsymbol{\gamma}}) while minimizing ν~(𝜷)𝜷\norm{\tilde{\nu}({\boldsymbol{\beta}})-{\boldsymbol{\beta}}^{*}}_{\infty}.

We follow the same definition of asymptotic normality as in [18, 42, 21]. The posterior distribution for the function g(𝜷)g({\boldsymbol{\beta}}) is asymptotically normal with center gg^{*} and variance GG if, for d𝜷d_{{\boldsymbol{\beta}}} the bounded Lipschitz metric for weak convergence, and ϕn\phi_{n} the mapping ϕn:𝜷n(g(𝜷)g)\phi_{n}:{\boldsymbol{\beta}}\to\sqrt{n}(g({\boldsymbol{\beta}})-g^{*}), it holds, as nn\to\infty, that

d𝜷(π(|Dn)ϕn1,N(0,G))0,\displaystyle d_{{\boldsymbol{\beta}}}(\pi(\cdot|D_{n})\circ\phi_{n}^{-1},N(0,G))\to 0, (6)

in PP^{*}-probability, which we also denote π(|Dn)ϕn1N(0,G)\pi(\cdot|D_{n})\circ\phi_{n}^{-1}\leadsto N(0,G).

The detailed assumptions and setups for the following two theorems are given in Appendix C. For simplicity, we let Ml=Rl+1M_{l}=R_{l}+1, and let ln(𝜷)=1nRli=Rl+1nlog(p𝜷(yi|yi1:iRl))l_{n}({\boldsymbol{\beta}})=\frac{1}{n-R_{l}}\sum_{i=R_{l}+1}^{n}\log(p_{{\boldsymbol{\beta}}}(y_{i}|y_{i-1:i-R_{l}})) denote the averaged log-likelihood function. Let Hn(𝜷)H_{n}({\boldsymbol{\beta}}) denote the Hessian matrix of ln(𝜷)l_{n}({\boldsymbol{\beta}}), and let hi,j(𝜷)h_{i,j}({\boldsymbol{\beta}}) and hi,j(𝜷)h^{i,j}({\boldsymbol{\beta}}) denote the (i,j)(i,j)-th element of Hn(𝜷)H_{n}({\boldsymbol{\beta}}) and Hn1(𝜷)H_{n}^{-1}({\boldsymbol{\beta}}), respectively.

Theorem 3.11.

Assume the conditions of Lemma 3.9 hold with ρ(ϵn)=o(1Kn)\rho(\epsilon_{n})=o(\dfrac{1}{K_{n}}) and additional assumptions hold given in Appendix C, then π(n(ν~(𝛃)𝛃)|Dn)N(𝟎,𝐕)\pi(\sqrt{n}(\tilde{\nu}({\boldsymbol{\beta}})-{\boldsymbol{\beta}}^{*})|D_{n})\leadsto N({\boldsymbol{0}},{\boldsymbol{V}}) in PP^{*}-probability as nn\to\infty, where 𝐕=(vi,j){\boldsymbol{V}}=(v_{i,j}), and vi,j=E(hi,j(𝛃))v_{i,j}=E(h^{i,j}({\boldsymbol{\beta}}^{*})) if i,j𝛄i,j\in{\boldsymbol{\gamma}}^{*} and 0 otherwise.

Theorem 3.12 establishes asymptotic normality of the sparse RNN prediction, which implies prediction consistency and forms the theoretical basis for prediction uncertainty quantification as well.

Theorem 3.12.

Assume the conditions of Theorem 3.11 and additional assumptions hold given in Appendix C. Then π(n(μ(,𝛃)μ(,𝛃)|Dn)N(0,Σ)\pi(\sqrt{n}(\mu(\cdot,{\boldsymbol{\beta}})-\mu(\cdot,{\boldsymbol{\beta}}^{*})|D_{n})\leadsto N(0,\Sigma), where Σ=𝛄μ(,𝛃)H1𝛄μ(,𝛃)\Sigma=\nabla_{{\boldsymbol{\gamma}}^{*}}\mu(\cdot,{\boldsymbol{\beta}}^{*})^{\prime}H^{-1}\nabla_{{\boldsymbol{\gamma}}^{*}}\mu(\cdot,{\boldsymbol{\beta}}^{*}) and H=E(𝛄2ln(𝛃))H=E(-\nabla^{2}_{{\boldsymbol{\gamma}}^{*}}l_{n}({\boldsymbol{\beta}}^{*})) is the Fisher information matrix.

4 Computation

In the preceding section, we established a theoretical foundation for sparse deep learning with time series data under the Bayesian framework. Building on [17], it is straightforward to show that Bayesian computation can be simplified by invoking the Laplace approximation theorem at the maximum a posteriori (MAP) estimator. This essentially transforms the proposed Bayesian method into a regularization method by interpreting the log-prior density function as a penalty for the log-likelihood function in RNN training. Consequently, we can train the regularized RNN model using an optimization algorithm, such as SGD or Adam. To address the local-trap issue potentially suffered by these methods, we train the regularized RNN using a prior annealing algorithm [18], as described in Algorithm 1. For a trained RNN, we sparsify its structure by truncating the weights less than a threshold to zero and further refine the nonzero weights for attaining the MAP estimator. For algorithmic specifics, refer to Appendix D. Below, we outline the steps for constructing prediction intervals for one-step-ahead forecasts, where μ(yk1:kRl,𝜷^)\mu(y_{k-1:k-R_{l}},\hat{{\boldsymbol{\beta}}}) is of one dimension, and 𝜷^\hat{{\boldsymbol{\beta}}} and 𝜸^\hat{{\boldsymbol{\gamma}}} represent the estimators of the network parameters and structure, respectively, as obtained by Algorithm 1:

  • Estimate σ2\sigma^{2} by σ^2=1nRl1i=Rl+1nyiμ(yi1:iRl,𝜷^)2\hat{\sigma}^{2}=\dfrac{1}{n-R_{l}-1}\sum_{i=R_{l}+1}^{n}\|y_{i}-\mu(y_{i-1:i-R_{l}},\hat{{\boldsymbol{\beta}}})\|^{2}.

  • For a test point yk1:kRly_{k-1:k-R_{l}}, estimate Σ\Sigma in Theorem 3.12 by

    ς^2=𝜸^μ(yk1:kRl,𝜷^)(𝜸^2ln(𝜷^))1𝜸^μ(yk1:kRl,𝜷^).\displaystyle\hat{\varsigma}^{2}=\nabla_{\hat{{\boldsymbol{\gamma}}}}\mu(y_{k-1:k-R_{l}},\hat{{\boldsymbol{\beta}}})^{\prime}(-\nabla^{2}_{\hat{{\boldsymbol{\gamma}}}}l_{n}(\hat{{\boldsymbol{\beta}}}))^{-1}\nabla_{\hat{{\boldsymbol{\gamma}}}}\mu(y_{k-1:k-R_{l}},\hat{{\boldsymbol{\beta}}}).
  • The corresponding (1α)%(1-\alpha)\% prediction interval is given by

    μ(yk1:kRl,𝜷^)±zα/2ς^2/(nRl)+σ^2,\displaystyle\mu(y_{k-1:k-R_{l}},\hat{{\boldsymbol{\beta}}})\pm z_{\alpha/2}\sqrt{\hat{\varsigma}^{2}/(n-R_{l})+\hat{\sigma}^{2}},

    where there are nRln-R_{l} observations used in training, and zα/2z_{\alpha/2} denotes the upper α/2\alpha/2-quantile of the standard Gaussian distribution.

For construction of multi-horizon prediction intervals, see Appendix F.

5 Numerical Experiments

5.1 Uncertainty Quantification

As mentioned in Section 3, we will consider two types of time series datasets: the first type comprises a single time series, and the second type consists of a set of time series. We will compare the performance of our method against the state-of-the-art Conformal Prediction (CP) methods for both types of datasets. We set α=0.1\alpha=0.1 (the error rate) for all uncertainty quantification experiments in the paper, and so the nominal coverage level of the prediction intervals is 90%.

5.1.1 A Single Time Series: French Electricity Spot Prices

We perform one-step-ahead forecasts on the French electricity spot prices data from 20162016 to 20202020, which consists of 35,064 observations. A detailed description and visualization of this time series are given in Appendix G.1. Our goal is to predict the 2424 hourly prices of the following day, given the prices up until the end of the current day. As the hourly prices exhibit distinct patterns, we fit one model per hour as in the CP baseline [30]. We follow the data splitting strategy used in [30], where the first three years 201620192016-2019 data are used as the (initial) training set, and the prediction is made for the last year 201920202019-2020.

For all the methods considered, we use the same underlying neural network model: an MLP with one hidden layer of size 100 and the sigmoid activation function. More details on the training process are provided in Appendix G.1. For the state-of-the-art CP methods, EnbPI-V2 [28], NEX-CP [29], ACI [31] and AgACI [30], we conduct experiments in an online fashion, where the model is trained using a sliding window of the previous three years of data (refer to Figure 4 in the Appendix). Specifically, after constructing the prediction interval for each time step in the prediction period, we add the ground truth value to the training set and then retrain the model with the updated training set. For ACI, we conduct experiments with various values of γ\gamma and present the one that yields the best performance. In the case of AgACI, we adopt the same aggregation approach as used in [30], namely, the Bernstein Online Aggregation (BOA) method [43] with a gradient trick. We also report the performance of ACI with γ=0\gamma=0 as a reference. For NEX-CP, we use the same weights as those employed in their time-series experiments. For EnbPI-V2, we tune the number of bootstrap models and select the one that offers the best performance.

Since this time series exhibits no or minor distribution shift, our method PA is trained in an offline fashion, where the model is fixed for using only the observations between 20162016 and 20192019, and the observations in the prediction range are used only for final evaluation. That is, our method uses less data information in training compared to the baseline methods.

The results are presented in Figure 2, which includes empirical coverage (with methods that are positioned closer to 90.0 being more effective), median/average prediction interval length, and corresponding interquartile range/standard deviation. As expected, our method is able to train and calibrate the model by using only the initial training set, i.e., the data for 2016-2019, and successfully produces faithful prediction intervals. In contrast, all CP methods produce wider prediction intervals than ours and higher coverage rates than the nominal level of 90%. In addition, ACI is sensitive to the choice of γ\gamma [30].

Refer to caption
Figure 2: Comparisons of our method and baselines on hourly spot electricity prices in France. Left: Average prediction length with vertical error bars indicating the standard deviation of the prediction interval lengths. Right: Median prediction length with vertical error bars indicating the interquartile range of the prediction interval lengths. The precise values for these metrics are provided in Table 5 of Appendix G.1.

5.1.2 A Set of Time Series

We conduct experiments using three publicly available real-world datasets: Medical Information Mart for Intensive Care (MIMIC-III), electroencephalography (EEG) data, and daily COVID-19 case numbers within the United Kingdom’s local authority districts (COVID-19) [32]. A concise overview of these datasets is presented in Table 1. Our method, denoted as PA-RNN (since the underlying prediction model is an LSTM [44]), is compared to three benchmark methods: CF-RNN [32], MQ-RNN [34], and DP-RNN [35], where LSTM is used for all these methods. To ensure a fair comparison, we adhere to the same model structure, hyperparameters, and data processing steps as specified in [32]. Detailed information regarding the three datasets and training procedures can be found in Appendix G.2.

The numerical results are summarized in Tables 2. Note that the baseline results for EEG and COVID-19 are directly taken from the original paper [32]. We reproduce the baseline results for MIMIC-III, as the specific subset used by [32] is not publicly available. Table 2 indicates that our method consistently outperforms the baselines. In particular, Our method consistently generates shorter prediction intervals compared to the conformal baseline CF-RNN while maintaining the same or even better coverage rate as CF-RNN. Both the MQ-RNN and DP-RNN methods fail to generate prediction intervals that accurately maintain a faithful coverage rate.

Table 1: A brief description of the datasets used in Section 5.1.2, where “Train size” and “Test size” indicate the numbers of training and testing sequences, respectively, and “Length” represents the length of the input sequence.
Dataset Train size Test size Length Prediction horizon
MIMIC-III [45] 26922692 300300 [3,47][3,47] 22
EEG [46] 1920019200 1920019200 4040 1010
COVID-19 [47] 300300 8080 100100 5050
Table 2: Uncertainty quantification results by different methods for the MIMIC-III, EEG, and COVID-19 data, where “Coverage” represents the joint coverage rate averaged over five different random seeds, the prediction interval length is averaged across prediction horizons and five random seeds, and the numbers in the parentheses indicate the standard deviations (across five random seeds) of the respective means.

MIMIC-III EEG COVID-19 Model Coverage PI length Coverage PI length Coverage PI length PA-RNN 90.8%(0.7%)90.8\%(0.7\%) 2.35(0.26)2.35(0.26) 94.69%(0.4%)94.69\%(0.4\%) 51.02(10.50)51.02(10.50) 90.5%(1.4%)90.5\%(1.4\%) 444.93(203.28)444.93(203.28) CF-RNN 92.0%(0.3%)92.0\%(0.3\%) 3.01(0.17)3.01(0.17) 96.5%(0.4%)96.5\%(0.4\%) 61.86(18.02)61.86(18.02) 89.7%(2.4%)89.7\%(2.4\%) 733.95(582.52)733.95(582.52) MQ-RNN 85.3%(0.2%)85.3\%(0.2\%) 2.64(0.11)2.64(0.11) 48.0%(1.8%)48.0\%(1.8\%) 21.39(2.36)21.39(2.36) 15.0%(2.6%)15.0\%(2.6\%) 136.56(63.32)136.56(63.32) DP-RNN 1.2%(0.3%)1.2\%(0.3\%) 0.12(0.01)0.12(0.01) 3.3%(0.3%)3.3\%(0.3\%) 7.39(0.74)7.39(0.74) 0.0%(0.0%)0.0\%(0.0\%) 61.18(32.37)61.18(32.37)

5.2 Autoregressive Order Selection

In this section, we evaluate the performance of our method in selecting the autoregressive order for two synthetic autoregressive processes. The first is the non-linear autoregressive (NLAR) process [48, 49, 50]:

yi=0.17+0.85yi1+0.14yi20.31yi3+0.08yi7+12.80G1(yi1:i7)+2.44G2(yi1:i7)+ηi,\small\begin{split}y_{i}=-0.17+0.85y_{i-1}+0.14y_{i-2}-0.31y_{i-3}+0.08y_{i-7}+12.80G_{1}(y_{i-1:i-7})+2.44G_{2}(y_{i-1:i-7})+\eta_{i},\end{split}

where ηiN(0,1)\eta_{i}\sim N(0,1) represents i.i.d. Gaussian random noises, and the functions G1G_{1} and G2G_{2} are defined as:

G1(yi1:i7)=(1+exp(0.46(0.29yi10.87yi2+0.40yi76.68)))1,G2(yi1:i7)=(1+exp(1.17×103(0.83yi10.53yi20.18yi7+0.38)))1.\small\begin{split}G_{1}(y_{i-1:i-7})&=(1+\exp{-0.46(0.29y_{i-1}-0.87y_{i-2}+0.40y_{i-7}-6.68)})^{-1},\\ G_{2}(y_{i-1:i-7})&=(1+\exp{-1.17\times 10^{-3}(0.83y_{i-1}-0.53y_{i-2}-0.18y_{i-7}+0.38)})^{-1}.\end{split}

The second is the exponential autoregressive process [51]:

yi=(0.81.1exp(50yi12))yi1+ηiy_{i}=\left(0.8-1.1\exp{-50y_{i-1}^{2}}\right)y_{i-1}+\eta_{i}

where, again, ηiN(0,1)\eta_{i}\sim N(0,1) denotes i.i.d. Gaussian random noises.

For both synthetic processes, we generate five datasets. Each dataset consists of training, validation, and test sequences. The training sequence has 10000 samples, while the validation and test sequences each contain 1000 samples. For training, we employ a single-layer RNN with a hidden layer width of 1000. Further details on the experimental setting can be found in Appendix G.3.

For the NLAR process, we consider two different window sizes for RNN modeling: 1515 (with input as yi1:i15y_{i-1:i-15}) and 11 (with input as yi1y_{i-1}). Notably, the NLAR process has an order of 77. In the case where the window size is 1515, the input information suffices for RNN modeling, rendering the past information conveyed by the hidden states redundant. However, when the window size is 11, this past information becomes indispensable for the RNN model. In contrast, the exponential autoregressive process has an order of 11. For all window sizes we explored, namely 1,3,5,7,10,151,3,5,7,10,15, the input information is always sufficient for RNN modeling.

We evaluate the predictive performance using mean square prediction error (MSPE) and mean square fitting error (MSFE). The model selection performance is assessed by two metrics: the false selection rate (FSR) and the negative selection rate (NSR). The FSR is given by

FSR=j=15|S^j/S|j=15|S^j|\text{FSR}=\frac{\sum_{j=1}^{5}|\hat{S}_{j}/S|}{\sum_{j=1}^{5}|\hat{S}_{j}|}

and the NSR by

NSR=j=15|S/S^j|i=j5|S|\text{NSR}=\frac{\sum_{j=1}^{5}|S/\hat{S}_{j}|}{\sum_{i=j}^{5}|S|}

where SS denotes the set of true variables, and S^j\hat{S}_{j} represents the set of selected variables from dataset jj. Furthermore, we provide the final number of nonzero connections for hidden states and the estimated autoregressive orders. The numerical results for the NLAR process are presented in Table 3, while the numerical results for the exponential autoregressive process can be found in Table 4.

Our results are promising. Specifically, when the window size is equal to or exceeds the true autoregressive order, all connections associated with the hidden states are pruned, effectively converting the RNN into an MLP. Conversely, if the window size is smaller than the true autoregressive order, a significant number of connections from the hidden states are retained. Impressively, our method accurately identifies the autoregressive order—a noteworthy achievement considering the inherent dependencies in the time series data. Although our method produces a nonzero FSR for the NLAR process, it is quite reasonable considering the relatively short time sequence and the complexity of the functions G1G_{1} and G2G_{2}.

Table 3: Numerical results for the NLAR process: numbers in the parentheses are standard deviations of the respective means, "-" indicates not applicable, \downarrow means lower is better, and “#hidden link” denotes the number of nonzero connections from the hidden states. All results are obtained from five independent runs.

Model Window size FSR \downarrow NSR \downarrow AR order #hidden link MSPE \downarrow MSFE \downarrow PA-RNN 11 0 0 - 357(21)357(21) 1.056(0.001)1.056(0.001) 1.057(0.006)1.057(0.006) PA-RNN 1515 0.230.23 0 7.4(0.25)7.4(0.25) 0(0)0(0) 1.017(0.008)1.017(0.008) 1.020(0.010)1.020(0.010)

Table 4: Numerical results for the exponetial autoregressive process.

Model Window size FSR \downarrow NSR \downarrow AR order #hidden link MSPE \downarrow MSFE \downarrow PA-RNN 11 0 0 1(0)1(0) 0(0)0(0) 1.004(0.004)1.004(0.004) 1.003(0.005)1.003(0.005) PA-RNN 33 0 0 1(0)1(0) 0(0)0(0) 1.006(0.005)1.006(0.005) 0.999(0.004)0.999(0.004) PA-RNN 55 0 0 1(0)1(0) 0(0)0(0) 1.000(0.004)1.000(0.004) 1.007(0.005)1.007(0.005) PA-RNN 77 0 0 1(0)1(0) 0(0)0(0) 1.006(0.005)1.006(0.005) 1.000(0.003)1.000(0.003) PA-RNN 1010 0 0 1(0)1(0) 0(0)0(0) 1.002(0.004)1.002(0.004) 1.002(0.006)1.002(0.006) PA-RNN 1515 0 0 1(0)1(0) 0(0)0(0) 1.001(0.004)1.001(0.004) 1.002(0.007)1.002(0.007)

5.3 RNN Model Compression

We have also applied our method to RNN model compression, achieving state-of-the-art results. Please refer to Section G.4 in the Appendix for details.

6 Discussion

This paper has established the theoretical groundwork for sparse deep learning with time series data, including posterior consistency, structure selection consistency, and asymptotic normality of predictions. Our empirical studies indicate that sparse deep learning can outperform current cutting-edge approaches, such as conformal predictions, in prediction uncertainty quantification. More specifically, compared to conformal methods, our method maintains the same coverage rate, if not better, while generating significantly shorter prediction intervals. Furthermore, our method effectively determines the autoregression order for time series data and surpasses state-of-the-art techniques in large-scale model compression.

In summary, this paper represents a significant advancement in statistical inference for deep RNNs, which, through sparsing, has successfully integrated the RNNs into the framework of statistical modeling. The superiority of our method over the conformal methods shows further the criticality of consistently approximating the underlying mechanism of the data generation process in uncertainty quantification.

The theory developed in this paper has included LSTM [44] as a special case, and some numerical examples have been conducted with LSTM; see Section G of the Appendix for the detail. Furthermore, there is room for refining the theoretical study under varying mixing assumptions for time series data, which could broaden applications of the proposed method. Also, the efficacy of the proposed method can potentially be further improved with the elicitation of different prior distributions.

References

  • [1] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term temporal patterns with deep neural networks. In The 41st international ACM SIGIR conference on research & development in information retrieval, pages 95–104, 2018.
  • [2] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020.
  • [3] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. Advances in neural information processing systems, 31, 2018.
  • [4] Cristian Challu, Kin G Olivares, Boris N Oreshkin, Federico Garza, Max Mergenthaler, and Artur Dubrawski. N-hits: Neural hierarchical interpolation for time series forecasting. arXiv preprint arXiv:2201.12886, 2022.
  • [5] Hansika Hewamalage, Christoph Bergmeir, and Kasun Bandara. Recurrent neural networks for time series forecasting: Current status and future directions. International Journal of Forecasting, 37(1):388–427, 2021.
  • [6] Gábor Melis, Tomáš Kočiskỳ, and Phil Blunsom. Mogrifier lstm. arXiv preprint arXiv:1909.01792, 2019.
  • [7] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
  • [8] Yuki Tatsunami and Masato Taki. Sequencer: Deep lstm for image classification. arXiv preprint arXiv:2205.01972, 2022.
  • [9] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  • [10] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017.
  • [11] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In International conference on machine learning, pages 2603–2612. PMLR, 2017.
  • [12] Marco Gori and Alberto Tesi. On the problem of local minima in backpropagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(1):76–86, 1992.
  • [13] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
  • [14] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pages 1675–1685. PMLR, 2019.
  • [15] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks. Machine learning, 109(3):467–492, 2020.
  • [16] Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. Advances in neural information processing systems, 32, 2019.
  • [17] Yan Sun, Qifan Song, and Faming Liang. Consistent sparse deep learning: Theory and computation. Journal of the American Statistical Association, pages 1–15, 2021.
  • [18] Yan Sun, Wenjun Xiong, and Faming Liang. Sparse deep learning: A new framework immune to local traps and miscalibration. Advances in Neural Information Processing Systems, 34:22301–22312, 2021.
  • [19] Faming Liang, Qizhai Li, and Lei Zhou. Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113(523):955–972, 2018.
  • [20] Nicholas G Polson and Veronika Ročková. Posterior concentration for sparse deep learning. Advances in Neural Information Processing Systems, 31, 2018.
  • [21] Yuexi Wang and Veronika Rocková. Uncertainty quantification for sparse deep learning. In International Conference on Artificial Intelligence and Statistics, pages 298–308. PMLR, 2020.
  • [22] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4):1916–1921, 2020.
  • [23] Helmut Bolcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approximation with sparsely connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019.
  • [24] Vitaly Maiorov and Allan Pinkus. Lower bounds for approximation by mlp neural networks. Neurocomputing, 25(1-3):81–91, 1999.
  • [25] Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
  • [26] Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines for regression. In Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19–23, 2002 Proceedings 13, pages 345–356. Springer, 2002.
  • [27] Victor Chernozhukov, Kaspar Wüthrich, and Zhu Yinchu. Exact and robust conformal inference methods for predictive machine learning with dependent data. In Conference On learning theory, pages 732–749. PMLR, 2018.
  • [28] Chen Xu and Yao Xie. Conformal prediction interval for dynamic time-series. In International Conference on Machine Learning, pages 11559–11569. PMLR, 2021.
  • [29] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816–845, 2023.
  • [30] Margaux Zaffran, Olivier Féron, Yannig Goude, Julie Josse, and Aymeric Dieuleveut. Adaptive conformal predictions for time series. In International Conference on Machine Learning, pages 25834–25866. PMLR, 2022.
  • [31] Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34:1660–1672, 2021.
  • [32] Kamile Stankeviciute, Ahmed M Alaa, and Mihaela van der Schaar. Conformal time-series forecasting. Advances in Neural Information Processing Systems, 34:6216–6228, 2021.
  • [33] Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021.
  • [34] Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. A multi-horizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.
  • [35] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
  • [36] Derrick T Mirikitani and Nikolay Nikolaev. Recursive bayesian recurrent neural networks for time-series modeling. IEEE Transactions on Neural Networks, 21(2):262–274, 2009.
  • [37] Shiqing Ling and Michael McAleer. A general asymptotic theory for time-series models. Statistica Neerlandica, 64(1):97–111, 2010.
  • [38] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94–116, 1994.
  • [39] Ron Meir. Performance bounds for nonlinear time series prediction. In Proceedings of the tenth annual conference on computational learning theory, pages 122–129, 1997.
  • [40] Cosma Shalizi and Aryeh Kontorovich. Predictive pac learning and process decompositions. Advances in neural information processing systems, 26, 2013.
  • [41] Subhashis Ghosal and Aad van der Vaart. Convergence rates of posterior distributions for non-i.i.d. observations. Annals of Statistics, 35:192–223, 2007.
  • [42] Ismael Castillo and Judith Rousseau. A bernstein–von mises theorem for smooth functionals in semiparametric models. Annals of Statistics, 43:2353–2383, 2013.
  • [43] Olivier Wintenberger. Optimal learning with bernstein online aggregation. Machine Learning, 106:119–141, 2017.
  • [44] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, nov 1997.
  • [45] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
  • [46] Catherine Blake. Uci repository of machine learning databases. http://www. ics. uci. edu/~ mlearn/MLRepository. html, 1998.
  • [47] GOV UK et al. Coronavirus (covid-19) in the uk, 2020.
  • [48] Souhaib Ben Taieb and Amir F Atiya. A bias and variance analysis for multistep-ahead time series forecasting. IEEE transactions on neural networks and learning systems, 27(1):62–76, 2015.
  • [49] Marcelo C Medeiros, Timo Teräsvirta, and Gianluigi Rech. Building neural network models for time series: a statistical approach. Journal of Forecasting, 25(1):49–75, 2006.
  • [50] Timo Terasvirta and Anders Kock. Forecasting with nonlinear time series models. Econometrics: Multiple Equation Models eJournal, 2010.
  • [51] Bjørn Auestad and Dag Tjøstheim. Identification of nonlinear time series: First order characterization and order determination. Biometrika, 77(4):669–687, 1990.
  • [52] Wenxin Jiang. Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities. The Annals of Statistics, 35(4):1487–1511, 2007.
  • [53] Subhashis Ghosal and Aad Van der Vaart. Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press, 2017.
  • [54] Andre M Zubkov and Aleksandr A Serov. A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications, 57(3):539–544, 2013.
  • [55] Pentti Saikkonen. Dependent versions of a central limit theorem for the squared length of a sample mean. Statistics & probability letters, 22(3):185–194, 1995.
  • [56] Stephen Portnoy. Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. The Annals of Statistics, pages 356–366, 1988.
  • [57] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. Advances in neural information processing systems, 28, 2015.
  • [58] Grigoris A Dourbois and Pandelis N Biskas. European market coupling algorithm incorporating clearing conditions of block and complex orders. In 2015 IEEE Eindhoven PowerTech, pages 1–6. IEEE, 2015.
  • [59] Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Conformal prediction with temporal quantile adjustments. arXiv preprint arXiv:2205.09940, 2022.
  • [60] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
  • [61] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(241):1–124, 2021.
  • [62] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
  • [63] Nadezhda Chirkova, Ekaterina Lobacheva, and Dmitry Vetrov. Bayesian compression for natural language processing. arXiv preprint arXiv:1810.10927, 2018.
  • [64] Maxim Kodryan, Artem Grachev, Dmitry Ignatov, and Dmitry Vetrov. Efficient language modeling with automatic relevance determination in recurrent neural networks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 40–48, 2019.
  • [65] Ekaterina Lobacheva, Nadezhda Chirkova, and Dmitry Vetrov. Bayesian sparsification of gated recurrent neural networks. arXiv preprint arXiv:1812.05692, 2018.
  • [66] Artem M Grachev, Dmitry I Ignatov, and Andrey V Savchenko. Compression of recurrent neural networks for efficient language modeling. Applied Soft Computing, 79:354–362, 2019.
  • [67] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.

Appendix for “Sparse Deep Learning for Time Series Data: Theory and Applications”

Appendix A Mathematical Facts of Sparse RNNs

For a sparse RNN model with Hn1H_{n}-1 recurrent layers and a single output layer, several mathematical facts can be established. Let’s denote the number of nodes in each layer as L1,,LHnL_{1},\ldots,L_{H_{n}}. Additionally, let rwir_{w_{i}} represent the number of non-zero connection weights for 𝒘i{\boldsymbol{w}}^{i}, and let rvir_{v_{i}} denote the number of non-zero connection weights for 𝒗i{\boldsymbol{v}}^{i}, where 𝒘i{\boldsymbol{w}}_{i} and 𝒗i{\boldsymbol{v}}_{i} denote the connection weights at two layers of a sparse RNN.

Furthermore, we define Oi,jt(𝜷,𝒙t:1)O^{t}_{i,j}({\boldsymbol{\beta}},{\boldsymbol{x}}_{t:1}) as the output value of the jj-th neuron of the ii-th layer at time tt. This output depends on the parameter vector 𝜷{\boldsymbol{\beta}} and the input sequence 𝒙t:1=(𝒙t,,𝒙1){\boldsymbol{x}}_{t:1}=({\boldsymbol{x}}_{t},\ldots,{\boldsymbol{x}}_{1}), where 𝒙t{\boldsymbol{x}}_{t} represents the generic input to the sparse RNN at time step tt.

Lemma A.1.

Under Assumption 3.6, if a sparse RNN has at most rnr_{n} non-zero connection weights (i.e., rn=i=1Hnrwi+i=1Hn1rvir_{n}=\sum_{i=1}^{H_{n}}r_{w_{i}}+\sum_{i=1}^{H_{n}-1}r_{v_{i}}) and 𝛃=En\norm{{\boldsymbol{\beta}}}_{\infty}=E_{n}, then the summation of the absolute outputs of the iith layer at time tt is bounded by

j=1Li|Oi,jt(𝜷,𝒙t:1)|ti(k=1irwk)(k=1irvk)t1(En)ti,1iHn1\displaystyle\sum_{j=1}^{L_{i}}|O^{t}_{i,j}({\boldsymbol{\beta}},{\boldsymbol{x}}_{t:1})|\leq t^{i}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-1}(E_{n})^{ti},\hskip 8.53581pt1\leq i\leq H_{n}-1
j=1LHn|OHn,jt(𝜷,𝒙t:1)|tHn(k=1Hnrwk)(k=1Hn1rvk)t1(En)t(Hn1)+1\displaystyle\sum_{j=1}^{L_{H_{n}}}|O^{t}_{H_{n},j}({\boldsymbol{\beta}},{\boldsymbol{x}}_{t:1})|\leq t^{H_{n}}\left(\prod_{k=1}^{H_{n}}r_{w_{k}}\right)\left(\prod_{k=1}^{H_{n}-1}r_{v_{k}}\right)^{t-1}\left(E_{n}\right)^{t(H_{n}-1)+1}
Proof.

For simplicity, we rewrite Oi,jt(𝜷,𝒙t:1)O^{t}_{i,j}({\boldsymbol{\beta}},{\boldsymbol{x}}_{t:1}) as Oi,jtO^{t}_{i,j} when appropriate. The lemma is the result from the following facts:

  • For t=1t=1 and 1iHn1\leq i\leq H_{n} (Lemma S4 from [17])

    j=1Li|Oi,j1|Enik=1irwk\sum_{j=1}^{L_{i}}|O^{1}_{i,j}|\leq E_{n}^{i}\prod_{k=1}^{i}r_{w_{k}}
  • For t2t\geq 2 and i=1i=1 (recursive relationship)

    j=1L1|O1,jt|Enrw1+Enrv1j=1L1|O1,jt1|\sum_{j=1}^{L_{1}}|O^{t}_{1,j}|\leq E_{n}r_{w_{1}}+E_{n}r_{v_{1}}\sum_{j=1}^{L_{1}}|O^{t-1}_{1,j}|
  • For t2t\geq 2 and 2iHn12\leq i\leq H_{n}-1 (recursive relationship)

    j=1Li|Oi,jt|Enrwij=1Li1|Oi1,jt|+Enrvij=1Li|Oi,jt1|\sum_{j=1}^{L_{i}}|O^{t}_{i,j}|\leq E_{n}r_{w_{i}}\sum_{j=1}^{L_{i-1}}|O^{t}_{i-1,j}|+E_{n}r_{v_{i}}\sum_{j=1}^{L_{i}}|O^{t-1}_{i,j}|
  • For t2t\geq 2 and i=Hni=H_{n} (recursive relationship)

    j=1LHn|OHn,jt|EnrwHnj=1LHn1|OHn1,jt|\sum_{j=1}^{L_{H_{n}}}|O^{t}_{H_{n},j}|\leq E_{n}r_{w_{H_{n}}}\sum_{j=1}^{L_{H_{n}-1}}|O^{t}_{H_{n}-1,j}|

We can verify this lemma by plugging the conclusion into all recursive relationships. We show the steps to verify the case when t2t\geq 2 and 2iHn12\leq i\leq H_{n}-1, since other cases are trivial to verify.

j=1Li|Oi,jt|Enrwij=1Li1|Oi1,jt|+Enrvij=1Li|Oi,jt1|\displaystyle\sum_{j=1}^{L_{i}}|O^{t}_{i,j}|\leq E_{n}r_{w_{i}}\sum_{j=1}^{L_{i-1}}|O^{t}_{i-1,j}|+E_{n}r_{v_{i}}\sum_{j=1}^{L_{i}}|O^{t-1}_{i,j}|
ti1(k=1irwk)(k=1i1rvk)t1(En)t(i1)+1+(t1)i(k=1irwk)(k=1irvk)t2rvi(En)(t1)i+1\displaystyle\leq t^{i-1}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i-1}r_{v_{k}}\right)^{t-1}(E_{n})^{t(i-1)+1}+(t-1)^{i}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-2}r_{v_{i}}(E_{n})^{(t-1)i+1}
ti1(k=1irwk)(k=1irvk)t1(En)ti+(t1)i(k=1irwk)(k=1irvk)t1(En)ti\displaystyle\leq t^{i-1}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-1}(E_{n})^{ti}+(t-1)^{i}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-1}(E_{n})^{ti}
ti(k=1irwk)(k=1irvk)t1(En)ti,\displaystyle\leq t^{i}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-1}(E_{n})^{ti},

where the last inequality is due to the fact that

ti1+(t1)iti=1t+(t1t)i1t+t1t=1\displaystyle\dfrac{t^{i-1}+(t-1)^{i}}{t^{i}}=\dfrac{1}{t}+\left(\dfrac{t-1}{t}\right)^{i}\leq\dfrac{1}{t}+\dfrac{t-1}{t}=1

Lemma A.2.

Under Assumption 3.6, consider two RNNs, 𝛃{\boldsymbol{\beta}} and 𝛃~\widetilde{{\boldsymbol{\beta}}}, where the former one is a sparse network satisfying 𝛃0=rn\norm{{\boldsymbol{\beta}}}_{0}=r_{n} and 𝛃=En\norm{{\boldsymbol{\beta}}}_{\infty}=E_{n}, and it’s network structure vector is denoted by 𝛄{\boldsymbol{\gamma}}. Assume that if |𝛃i𝛃~i|<δ1|{\boldsymbol{\beta}}_{i}-\widetilde{{\boldsymbol{\beta}}}_{i}|<\delta_{1} for all i𝛄i\in{\boldsymbol{\gamma}} and |𝛃i𝛃~i|<δ2|{\boldsymbol{\beta}}_{i}-\widetilde{{\boldsymbol{\beta}}}_{i}|<\delta_{2} for all i𝛄i\notin{\boldsymbol{\gamma}}, then

max𝒙t:1|μ(𝒙t:1,𝜷)μ(𝒙t:1,𝜷~)|\displaystyle\max_{{\boldsymbol{x}}_{t:1}}|\mu({\boldsymbol{x}}_{t:1},{\boldsymbol{\beta}})-\mu({\boldsymbol{x}}_{t:1},\widetilde{{\boldsymbol{\beta}}})|
tHnδ1Hn(k=1Hnrwk)(En+δ1)t(Hn+1)2(k=1Hn1rvk)t1\displaystyle\leq t^{H_{n}}\delta_{1}H_{n}\left(\prod_{k=1}^{H_{n}}r_{w_{k}}\right)(E_{n}+\delta_{1})^{t(H_{n}+1)-2}\left(\prod_{k=1}^{H_{n}-1}r_{v_{k}}\right)^{t-1}
+tHnδ2(pnL1+k=1HnLk)(k=1Hn[δ2Lk+rwk(En+δ1)])(k=1Hn1[δ2Lk+rvk(En+δ1)])t1.\displaystyle+t^{H_{n}}\delta_{2}(p_{n}L_{1}+\sum_{k=1}^{H_{n}}L_{k})\left(\prod_{k=1}^{H_{n}}[\delta_{2}L_{k}+r_{w_{k}}(E_{n}+\delta_{1})]\right)\left(\prod_{k=1}^{H_{n}-1}[\delta_{2}L_{k}+r_{v_{k}}(E_{n}+\delta_{1})]\right)^{t-1}.
Proof.

Define 𝜷ˇ\check{{\boldsymbol{\beta}}} such that 𝜷ˇi=𝜷~i\check{{\boldsymbol{\beta}}}_{i}=\widetilde{{\boldsymbol{\beta}}}_{i} for all i𝜸i\in{\boldsymbol{\gamma}} and 𝜷ˇi=0\check{{\boldsymbol{\beta}}}_{i}=0 for all i𝜸i\notin{\boldsymbol{\gamma}}. Let Oˇi,jt\check{O}^{t}_{i,j} denote Oi,jt(𝜷ˇ,𝒙t:1)O^{t}_{i,j}(\check{{\boldsymbol{\beta}}},{\boldsymbol{x}}_{t:1}). Then, from the following facts that

  • For t=1t=1 and 1iHn1\leq i\leq H_{n} (Lemma S4 from [17])

    j=1Li|Oˇi,j1Oi,j1|iδ1(En+δ1)i1(k=1irwk)\displaystyle\sum_{j=1}^{L_{i}}|\check{O}^{1}_{i,j}-O^{1}_{i,j}|\leq i\delta_{1}(E_{n}+\delta_{1})^{i-1}\left(\prod_{k=1}^{i}r_{w_{k}}\right)
  • For t2t\geq 2 and i=1i=1 (recursive relationship)

    j=1L1|Oˇ1,jtO1,jt|\displaystyle\sum_{j=1}^{L_{1}}|\check{O}^{t}_{1,j}-O^{t}_{1,j}| rw1δ1+rv1δ1k=1L1|O1,kt1|+rv1(En+δ1)k=1L1|Oˇ1,kt1O1,kt1|\displaystyle\leq r_{w_{1}}\delta_{1}+r_{v_{1}}\delta_{1}\sum_{k=1}^{L_{1}}|O^{t-1}_{1,k}|+r_{v_{1}}(E_{n}+\delta_{1})\sum_{k=1}^{L_{1}}|\check{O}^{t-1}_{1,k}-O^{t-1}_{1,k}|
  • For t2t\geq 2 and 2iHn12\leq i\leq H_{n}-1 (recursive relationship)

    j=1Li|Oˇi,jtOi,jt|\displaystyle\sum_{j=1}^{L_{i}}|\check{O}^{t}_{i,j}-O^{t}_{i,j}| rwiδ1k=1Li1|Oi1,kt|+rwi(En+δ1)k=1Li1|Oˇi1,ktOi1,kt|\displaystyle\leq r_{w_{i}}\delta_{1}\sum_{k=1}^{L_{i-1}}|O^{t}_{i-1,k}|+r_{w_{i}}(E_{n}+\delta_{1})\sum_{k=1}^{L_{i-1}}|\check{O}^{t}_{i-1,k}-O^{t}_{i-1,k}|
    +rviδ1k=1Li|Oi,kt1|+rvi(En+δ1)k=1Li|Oˇi,kt1Oi,kt1|\displaystyle+r_{v_{i}}\delta_{1}\sum_{k=1}^{L_{i}}|O^{t-1}_{i,k}|+r_{v_{i}}(E_{n}+\delta_{1})\sum_{k=1}^{L_{i}}|\check{O}^{t-1}_{i,k}-O^{t-1}_{i,k}|
  • For t2t\geq 2 and i=Hni=H_{n} (recursive relationship)

    j=1LHn|OˇHn,jtOHn,jt|\displaystyle\sum_{j=1}^{L_{H_{n}}}|\check{O}^{t}_{H_{n},j}-O^{t}_{H_{n},j}| rwHnδ1k=1LHn1|OHn1,kt|+rwHn(En+δ1)k=1LHn1|OˇHn1,ktOHn1,kt|.\displaystyle\leq r_{w_{H_{n}}}\delta_{1}\sum_{k=1}^{L_{H_{n}-1}}|O^{t}_{H_{n}-1,k}|+r_{w_{H_{n}}}(E_{n}+\delta_{1})\sum_{k=1}^{L_{H_{n}-1}}|\check{O}^{t}_{H_{n}-1,k}-O^{t}_{H_{n}-1,k}|.

We have

  • For 1iHn11\leq i\leq H_{n}-1

    j=1Li|Oˇi,jtOi,jt|ti+1iδ1(k=1irwk)(k=1irvk)t1(En+δ1)(i+1)t2\displaystyle\sum_{j=1}^{L_{i}}|\check{O}^{t}_{i,j}-O^{t}_{i,j}|\leq t^{i+1}i\delta_{1}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-1}(E_{n}+\delta_{1})^{(i+1)t-2}
  • For i=Hni=H_{n}

    |μ(𝒙t:1,𝜷)|μ(𝒙t:1,𝜷ˇ)|=j=1LHn|OˇHn,jtOHn,jt|tHnHnδ1(k=1Hnrwk)(k=1Hn1rvk)t1(En+δ1)(Hn+1)t2.\begin{split}|\mu({\boldsymbol{x}}_{t:1},{\boldsymbol{\beta}})-|\mu({\boldsymbol{x}}_{t:1},\check{{\boldsymbol{\beta}}})|&=\sum_{j=1}^{L_{H_{n}}}|\check{O}^{t}_{H_{n},j}-O^{t}_{H_{n},j}|\\ &\leq t^{H_{n}}H_{n}\delta_{1}\left(\prod_{k=1}^{H_{n}}r_{w_{k}}\right)\left(\prod_{k=1}^{H_{n}-1}r_{v_{k}}\right)^{t-1}(E_{n}+\delta_{1})^{(H_{n}+1)t-2}.\end{split}

We can verify the above conclusion by plugging it into all recursive relationships. We show the steps to verify the case when t2t\geq 2 and 2iHn12\leq i\leq H_{n}-1, since other cases are trivial to verify.

j=1Li|Oˇi,jtOi,jt|\displaystyle\sum_{j=1}^{L_{i}}|\check{O}^{t}_{i,j}-O^{t}_{i,j}|
rwiδ1k=1Li1|Oi1,kt|+rwi(En+δ1)k=1Li1|Oˇi1,ktOi1,kt|+rviδ1k=1Li|Oi,kt1|+rvi(En+δ1)k=1Li|Oˇi,kt1Oi,kt1|\displaystyle\leq r_{w_{i}}\delta_{1}\sum_{k=1}^{L_{i-1}}|O^{t}_{i-1,k}|+r_{w_{i}}(E_{n}+\delta_{1})\sum_{k=1}^{L_{i-1}}|\check{O}^{t}_{i-1,k}-O^{t}_{i-1,k}|+r_{v_{i}}\delta_{1}\sum_{k=1}^{L_{i}}|O^{t-1}_{i,k}|+r_{v_{i}}(E_{n}+\delta_{1})\sum_{k=1}^{L_{i}}|\check{O}^{t-1}_{i,k}-O^{t-1}_{i,k}|
δ1ti1(k=1irwk)(k=1i1rvk)t1(En+δ1)tit+δ1ti(i1)(k=1irwk)(k=1i1rvk)t1(En+δ1)ti1\displaystyle\leq\delta_{1}t^{i-1}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i-1}r_{v_{k}}\right)^{t-1}(E_{n}+\delta_{1})^{ti-t}+\delta_{1}t^{i}(i-1)\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i-1}r_{v_{k}}\right)^{t-1}(E_{n}+\delta_{1})^{ti-1}
+δ1(t1)i(k=1irwk)(k=1irvk)t2rvi(En+δ1)tii+δ1(t1)i+1i(k=1irwk)(k=1irvk)t2rvi(En+δ1)tii+t2\displaystyle+\delta_{1}(t-1)^{i}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-2}r_{v_{i}}(E_{n}+\delta_{1})^{ti-i}+\delta_{1}(t-1)^{i+1}i\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-2}r_{v_{i}}(E_{n}+\delta_{1})^{ti-i+t-2}
δ1ti1(k=1irwk)(k=1irvk)t1(En+δ1)(i+1)t2+δ1ti(i1)(k=1irwk)(k=1irvk)t1(En+δ1)(i+1)t2\displaystyle\leq\delta_{1}t^{i-1}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-1}(E_{n}+\delta_{1})^{(i+1)t-2}+\delta_{1}t^{i}(i-1)\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-1}(E_{n}+\delta_{1})^{(i+1)t-2}
+δ1(t1)i(k=1irwk)(k=1irvk)t1(En+δ1)(i+1)t2+δ1(t1)i+1i(k=1irwk)(k=1irvk)t1(En+δ1)(i+1)t2\displaystyle+\delta_{1}(t-1)^{i}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-1}(E_{n}+\delta_{1})^{(i+1)t-2}+\delta_{1}(t-1)^{i+1}i\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-1}(E_{n}+\delta_{1})^{(i+1)t-2}
ti+1iδ1(k=1irwk)(k=1irvk)t1(En+δ1)(i+1)t2\displaystyle\leq t^{i+1}i\delta_{1}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-1}(E_{n}+\delta_{1})^{(i+1)t-2}

where the last inequality is due to the fact that for t2t\geq 2 and i2i\geq 2, it is easy to see that

ti1+(i1)ti+(t1)i+i(t1)i+1iti+1\displaystyle\dfrac{t^{i-1}+(i-1)t^{i}+(t-1)^{i}+i(t-1)^{i+1}}{it^{i+1}}
=1it2+i1i1t+1it(t1t)i+(t1t)i+1\displaystyle=\dfrac{1}{it^{2}}+\dfrac{i-1}{i}\dfrac{1}{t}+\dfrac{1}{it}\left(\dfrac{t-1}{t}\right)^{i}+\left(\dfrac{t-1}{t}\right)^{i+1}
12t2+1t+(t1)22t3+(t1t)3\displaystyle\leq\dfrac{1}{2t^{2}}+\dfrac{1}{t}+\dfrac{(t-1)^{2}}{2t^{3}}+\left(\dfrac{t-1}{t}\right)^{3}
=t+2t2+t22t+1+2(t1)32t3\displaystyle=\dfrac{t+2t^{2}+t^{2}-2t+1+2(t-1)^{3}}{2t^{3}}
=1+3t2+5t12t3\displaystyle=1+\dfrac{-3t^{2}+5t-1}{2t^{3}}
1\displaystyle\leq 1

Now let O~i,jt\widetilde{O}^{t}_{i,j} denotes Oi,jt(𝜷~,𝒙1:t)O^{t}_{i,j}(\widetilde{{\boldsymbol{\beta}}},{\boldsymbol{x}}_{1:t}), then from the facts that

  • For 1tT1\leq t\leq T and 1iHn1\leq i\leq H_{n}

    j=1Li|Oˇi,jt|ti(k=1irwk)(k=1irvk)t1(En+δ1)ti,1iHn1,1tT\displaystyle\sum_{j=1}^{L_{i}}|\check{O}^{t}_{i,j}|\leq t^{i}\left(\prod_{k=1}^{i}r_{w_{k}}\right)\left(\prod_{k=1}^{i}r_{v_{k}}\right)^{t-1}(E_{n}+\delta_{1})^{ti},\hskip 8.53581pt1\leq i\leq H_{n}-1,\hskip 5.69054pt1\leq t\leq T
    j=1LHn|OˇHn,jt|tHn(k=1Hnrwk)(k=1Hn1rvk)t1(En+δ1)t(Hn1)+1,1tT\displaystyle\sum_{j=1}^{L_{H_{n}}}|\check{O}^{t}_{H_{n},j}|\leq t^{H_{n}}\left(\prod_{k=1}^{H_{n}}r_{w_{k}}\right)\left(\prod_{k=1}^{H_{n}-1}r_{v_{k}}\right)^{t-1}\left(E_{n}+\delta_{1}\right)^{t(H_{n}-1)+1},\hskip 8.53581pt1\leq t\leq T

    This can be easily derived by Lemma A.1 and the fact that 𝜷ˇEn+δ1\norm{\check{{\boldsymbol{\beta}}}}_{\infty}\leq E_{n}+\delta_{1}.

  • For t=1t=1 and 1iHn1\leq i\leq H_{n} (Lemma S4 from [17])

    j=1Li|Oˇi,j1O~i,j1|δ2(pnL1+k=1iLk)(k=1i[δ2Lk+rwk(En+δ1)])\displaystyle\sum_{j=1}^{L_{i}}|\check{O}^{1}_{i,j}-\widetilde{O}^{1}_{i,j}|\leq\delta_{2}(p_{n}L_{1}+\sum_{k=1}^{i}L_{k})\left(\prod_{k=1}^{i}[\delta_{2}L_{k}+r_{w_{k}}(E_{n}+\delta_{1})]\right)
  • For t2t\geq 2 and i=1i=1 (recursive relationship)

    j=1L1|Oˇ1,jtO~1,jt|\displaystyle\sum_{j=1}^{L_{1}}|\check{O}^{t}_{1,j}-\widetilde{O}^{t}_{1,j}| δ2L1pn+δ2L1k=1L1|Oˇ1,kt1O~1,kt1|+rv1(En+δ1)k=1L1|Oˇ1,kt1O~1,kt1|+δ2L1k=1L1|Oˇ1,kt1|\displaystyle\leq\delta_{2}L_{1}p_{n}+\delta_{2}L_{1}\sum_{k=1}^{L_{1}}|\check{O}^{t-1}_{1,k}-\widetilde{O}^{t-1}_{1,k}|+r_{v_{1}}(E_{n}+\delta_{1})\sum_{k=1}^{L_{1}}|\check{O}^{t-1}_{1,k}-\widetilde{O}^{t-1}_{1,k}|+\delta_{2}L_{1}\sum_{k=1}^{L_{1}}|\check{O}^{t-1}_{1,k}|
  • For t2t\geq 2 and 2iHn12\leq i\leq H_{n}-1 (recursive relationship)

    j=1Li|Oˇi,jtO~i,jt|\displaystyle\sum_{j=1}^{L_{i}}|\check{O}^{t}_{i,j}-\widetilde{O}^{t}_{i,j}| δ2Lik=1Li1|Oˇi1,ktO~i1,kt|+rwi(En+δ1)k=1Li1|Oˇi1,ktO~i1,kt|+δ2Lik=1Li1|Oˇi1,kt|\displaystyle\leq\delta_{2}L_{i}\sum_{k=1}^{L_{i-1}}|\check{O}^{t}_{i-1,k}-\widetilde{O}^{t}_{i-1,k}|+r_{w_{i}}(E_{n}+\delta_{1})\sum_{k=1}^{L_{i-1}}|\check{O}^{t}_{i-1,k}-\widetilde{O}^{t}_{i-1,k}|+\delta_{2}L_{i}\sum_{k=1}^{L_{i-1}}|\check{O}^{t}_{i-1,k}|
    +δ2Lik=1Li|Oˇi,kt1O~i,kt1|+rvi(En+δ1)k=1Li|Oˇi,kt1O~i,kt1|+δ2Lik=1Li|Oˇi,kt1|\displaystyle+\delta_{2}L_{i}\sum_{k=1}^{L_{i}}|\check{O}^{t-1}_{i,k}-\widetilde{O}^{t-1}_{i,k}|+r_{v_{i}}(E_{n}+\delta_{1})\sum_{k=1}^{L_{i}}|\check{O}^{t-1}_{i,k}-\widetilde{O}^{t-1}_{i,k}|+\delta_{2}L_{i}\sum_{k=1}^{L_{i}}|\check{O}^{t-1}_{i,k}|
  • For t2t\geq 2 and i=Hni=H_{n} (recursive relationship)

    j=1LHn|OˇHn,jtO~Hn,jt|\displaystyle\sum_{j=1}^{L_{H_{n}}}|\check{O}^{t}_{H_{n},j}-\widetilde{O}^{t}_{H_{n},j}| δ2LHnk=1LHn1|OˇHn1,ktO~Hn1,kt|+rwHn(En+δ1)k=1LHn1|OˇHn1,ktO~Hn1,kt|\displaystyle\leq\delta_{2}L_{H_{n}}\sum_{k=1}^{L_{H_{n}-1}}|\check{O}^{t}_{H_{n}-1,k}-\widetilde{O}^{t}_{H_{n}-1,k}|+r_{w_{H_{n}}}(E_{n}+\delta_{1})\sum_{k=1}^{L_{H_{n}-1}}|\check{O}^{t}_{H_{n}-1,k}-\widetilde{O}^{t}_{H_{n}-1,k}|
    +δ2LHnk=1LHn1|OˇHn1,kt|\displaystyle+\delta_{2}L_{H_{n}}\sum_{k=1}^{L_{H_{n}-1}}|\check{O}^{t}_{H_{n}-1,k}|

we have

  • For 1iHn11\leq i\leq H_{n}-1

    j=1Li|Oˇi,jtO~i,jt|ti+1δ2(pnL1+k=1iLk)(k=1i[δ2Lk+rwk(En+δ1)])(k=1i[δ2Lk+rvk(En+δ1)])t1\displaystyle\sum_{j=1}^{L_{i}}|\check{O}^{t}_{i,j}-\widetilde{O}^{t}_{i,j}|\leq t^{i+1}\delta_{2}(p_{n}L_{1}+\sum_{k=1}^{i}L_{k})\left(\prod_{k=1}^{i}[\delta_{2}L_{k}+r_{w_{k}}(E_{n}+\delta_{1})]\right)\left(\prod_{k=1}^{i}[\delta_{2}L_{k}+r_{v_{k}}(E_{n}+\delta_{1})]\right)^{t-1}
  • For i=Hni=H_{n}

    |μ(𝒙t:1,𝜷ˇ)μ(𝒙t:1,𝜷~)|\displaystyle|\mu({\boldsymbol{x}}_{t:1},\check{{\boldsymbol{\beta}}})-\mu({\boldsymbol{x}}_{t:1},\widetilde{{\boldsymbol{\beta}}})|
    =j=1LHn|OˇHn,jtOHn,jt|\displaystyle=\sum_{j=1}^{L_{H_{n}}}|\check{O}^{t}_{H_{n},j}-O^{t}_{H_{n},j}|
    tHnδ2(pnL1+k=1HnLk)(k=1Hn[δ2Lk+rwk(En+δ1)])(k=1Hn1[δ2Lk+rvk(En+δ1)])t1\displaystyle\leq t^{H_{n}}\delta_{2}(p_{n}L_{1}+\sum_{k=1}^{H_{n}}L_{k})\left(\prod_{k=1}^{H_{n}}[\delta_{2}L_{k}+r_{w_{k}}(E_{n}+\delta_{1})]\right)\left(\prod_{k=1}^{H_{n}-1}[\delta_{2}L_{k}+r_{v_{k}}(E_{n}+\delta_{1})]\right)^{t-1}

We can verify the above conclusion in a similar approach.

The proof is completed by summation of the bound for |μ(𝒙t:1,𝜷)|μ(𝒙t:1,𝜷ˇ)||\mu({\boldsymbol{x}}_{t:1},{\boldsymbol{\beta}})-|\mu({\boldsymbol{x}}_{t:1},\check{{\boldsymbol{\beta}}})| and |μ(𝒙t:1,𝜷ˇ)μ(𝒙t:1,𝜷~)||\mu({\boldsymbol{x}}_{t:1},\check{{\boldsymbol{\beta}}})-\mu({\boldsymbol{x}}_{t:1},\widetilde{{\boldsymbol{\beta}}})|. ∎

Appendix B Proofs on Posterior Consistency: A Single Time Series

To establish posterior consistency for DNNs with i.i.d. data, [17] utilized Proposition 1 from [52]. This lemma provides three sufficient conditions for proving posterior consistency for general statistical models with i.i.d. data, along with a posterior contraction rate. In this paper, we aim to establish posterior consistency for DNNs with stochastic processes, specifically time series, that are strictly stationary and α\alpha-mixing with an exponentially decaying mixing coefficient [53, 41].

Consider a time series Dn=(z1,z2,,zn)D_{n}=(z_{1},z_{2},\dots,z_{n}) defined on a probability space (Ω,,P)(\Omega,\mathcal{F},P^{*}), which satisfies the assumptions outlined in Assumption 1. For simplicity, we assume that the initial values z1l,,z0z_{1-l},\dots,z_{0} are fixed and given.

Let n\mathbb{P}_{n} denote a set of probability densities, let nc\mathbb{P}_{n}^{c} denote the complement of n\mathbb{P}_{n}, and let ϵn\epsilon_{n} denote a sequence of positive numbers. Let N(ϵn,n)N(\epsilon_{n},\mathbb{P}_{n}) be the minimum number of Hellinger balls of radius ϵn\epsilon_{n} that are needed to cover n\mathbb{P}_{n}, i.e., N(ϵn,n)N(\epsilon_{n},\mathbb{P}_{n}) is the minimum of all number kk’s such that there exist sets Sj={p:d(p,pj)ϵn},j=1,,kS_{j}=\{p:d(p,p_{j})\leq\epsilon_{n}\},j=1,...,k, with Pnj=1kSjP_{n}\subset\cup_{j=1}^{k}S_{j} holding, where

d(p,q)\displaystyle d(p,q) =(p(z|𝒙)q(z|𝒙))2v(𝒙)𝑑z𝑑𝒙,\displaystyle=\sqrt{\int\int(\sqrt{p(z|{\boldsymbol{x}})}-\sqrt{q(z|{\boldsymbol{x}})})^{2}v({\boldsymbol{x}})dzd{\boldsymbol{x}}},

denotes the integrated Hellinger distance [41, 53] between the two conditional densities p(z|𝒙)p(z|{\boldsymbol{x}}) and q(z|𝒙)q(z|{\boldsymbol{x}}), where 𝒙l{\boldsymbol{x}}\in\mathbb{R}^{l} contains the history up to ll time steps of zz, and v(𝒙)v({\boldsymbol{x}}) is the probability density function of the marginal distribution of 𝒙{\boldsymbol{x}}. Note that vv is invariant with respect to time index ii due to the strictly stationary assumption.

For DnD_{n}, denote the corresponding true conditional density by pp^{*}. Define π()\pi(\cdot) as the prior density, and π(|Dn)\pi(\cdot|D_{n}) as the posterior. Define π^(ϵ)=π[d(p,p)>ϵ|Dn]\hat{\pi}(\epsilon)=\pi[d(p,p^{*})>\epsilon|D_{n}] for each ϵ>0\epsilon>0. Assume the conditions:

  • (a)

    logN(ϵn,n)nϵn2\log N(\epsilon_{n},\mathbb{P}_{n})\leq n\epsilon_{n}^{2} for all sufficiently large nn.

  • (b)

    π(nc)ebnϵn2\pi(\mathbb{P}^{c}_{n})\leq e^{-bn\epsilon_{n}^{2}} for some b>0b>0 and all sufficiently large nn.

  • (c)

    Let fs=(|f|s𝑑r)1/s\norm{f}_{s}=(\int|f|^{s}dr)^{1/s}, then π(ppsbϵn)eγnϵn2\pi(\norm{p-p^{*}}_{s}\leq b^{\prime}\epsilon_{n})\geq e^{-\gamma n\epsilon_{n}^{2}} for some b,γ>0b^{\prime},\gamma>0, s>2s>2, and all sufficiently large nn.

Lemma B.1.

Under the conditions (a), (b) and (c), given sufficiently large nn, limnϵn=0\lim_{n\to\infty}\epsilon_{n}=0, and limnnϵn2=\lim_{n\to\infty}n\epsilon_{n}^{2}=\infty, we have for some large C>0C>0,

π^(Cϵn)=π(d(p,p)Cϵn|Dn)=OP(enϵn2).\displaystyle\hat{\pi}(C\epsilon_{n})=\pi(d(p,p^{*})\geq C\epsilon_{n}|D_{n})=O_{P^{*}}(e^{-n\epsilon_{n}^{2}}). (7)
Proof.

This Lemma can be proved with similar arguments used in Section 9.5.39.5.3, Theorem 8.198.19, and Theorem 8.298.29 of [53]. ∎

B.1 Posterior Consistency with a General Shrinkage Prior

Let 𝜷{\boldsymbol{\beta}} denote the vector of all connection weights of a RNN. To prove Theorem 3.8, we first consider a general shrinkage prior that all entries of 𝜷{\boldsymbol{\beta}} are subject to an independent continuous prior πb\pi_{b}, i.e., π(𝜷)=j=1Knπb(βj)\pi({\boldsymbol{\beta}})=\prod_{j=1}^{K_{n}}\pi_{b}(\beta_{j}), where KnK_{n} denotes the total number of elements of 𝜷{\boldsymbol{\beta}}. Theorem B.2 provides a sufficient condition for posterior consistency.

Theorem B.2.

(Posterior consistency) Suppose Assumptions 1 - 3.6 hold. If the prior πb\pi_{b} satisfies that

log(1/π¯b)=O(Hnlog(n)+log(L¯))\displaystyle\log(1/\underline{\pi}_{b})=O(H_{n}\log{n}+\log{\bar{L}}) (8)
πb{[ηn,ηn]}11Knexp{τ[Hnlog(n)+log(L¯+log(pn))]}\displaystyle\pi_{b}\{[-\eta_{n},\eta_{n}]\}\geq 1-\dfrac{1}{K_{n}}\exp\{-\tau[H_{n}\log{n}+\log{\bar{L}+\log{p_{n}}}]\} (9)
πb{[ηn,ηn]}11Kn\displaystyle\pi_{b}\{[-\eta^{\prime}_{n},\eta^{\prime}_{n}]\}\geq 1-\dfrac{1}{K_{n}} (10)
log[Knπb(|βj|>Mn)]nϵn2\displaystyle-\log[K_{n}\pi_{b}(|\beta_{j}|>M_{n})]\succ n\epsilon_{n}^{2} (11)

for some τ>0\tau>0, where

ηn<1/[nMlHnKn(n/Hn)2MlHn(c0Mn)2MlHn],\displaystyle\eta_{n}<1/[\sqrt{n}M_{l}^{H_{n}}K_{n}(n/H_{n})^{2M_{l}H_{n}}(c_{0}M_{n})^{2M_{l}H_{n}}],
ηn<1/[nMlHnKn(rn/Hn)2MlHn(c0En)2MlHn],\displaystyle\eta^{\prime}_{n}<1/[\sqrt{n}M_{l}^{H_{n}}K_{n}(r_{n}/H_{n})^{2M_{l}H_{n}}(c_{0}E_{n})^{2M_{l}H_{n}}],

with some c0>1c_{0}>1, π¯b\underline{\pi}_{b} is the minimal density value of πb\pi_{b} within the interval [En1,En+1][-E_{n}-1,E_{n}+1], and MnM_{n} is some sequence satisfying log(Mn)=O(log(n))\log(M_{n})=O(\log(n)). Then, there exists a sequence ϵn\epsilon_{n}, satisfying nϵn2rnHnlog(n)+rnlog(L¯)+snlog(pn)+nϖn2n\epsilon_{n}^{2}\asymp r_{n}H_{n}\log{n}+r_{n}\log{\bar{L}}+s_{n}\log{p_{n}}+n\varpi_{n}^{2} and ϵn1\epsilon_{n}\prec 1, such that

π^(Mϵn)=π(d(p,p)Mϵn|Dn)=OP(enϵn2)\displaystyle\hat{\pi}(M\epsilon_{n})=\pi(d(p,p^{*})\geq M\epsilon_{n}|D_{n})=O_{P^{*}}(e^{-n\epsilon_{n}^{2}})

for some large M>0M>0.

Proof.

To prove this theorem, it suffices to check all three conditions listed in Lemma B.1

Checking condition (c):

Consider the set A={𝜷:maxj𝜸|βjβj|ωn,maxj𝜸|βjβj|ωn}A=\{{\boldsymbol{\beta}}:\max_{j\in{\boldsymbol{\gamma}}^{*}}|\beta_{j}-\beta^{*}_{j}|\leq\omega_{n},\max_{j\notin{\boldsymbol{\gamma}}^{*}}|\beta_{j}-\beta^{*}_{j}|\leq\omega_{n}^{\prime}\}, where

ωnc1ϵnMlHnHn(rn/Hn)2MlHn(c0En)2MlHn\displaystyle\omega_{n}\leq\dfrac{c_{1}\epsilon_{n}}{M_{l}^{H_{n}}H_{n}(r_{n}/H_{n})^{2M_{l}H_{n}}(c_{0}E_{n})^{2M_{l}H_{n}}}
ωnc1ϵnMlHnKn(rn/Hn)2MlHn(c0En)2MlHn\displaystyle\omega_{n}^{\prime}\leq\dfrac{c_{1}\epsilon_{n}}{M_{l}^{H_{n}}K_{n}(r_{n}/H_{n})^{2M_{l}H_{n}}(c_{0}E_{n})^{2M_{l}H_{n}}}

for some c1>0c_{1}>0 and c0>1c_{0}>1. If 𝜷A{\boldsymbol{\beta}}\in A, by Lemma A.2, we have

|μ(𝒙Ml:1,𝜷)μ(𝒙Ml:1,𝜷)|3c1ϵn.\displaystyle|\mu({\boldsymbol{x}}_{M_{l}:1},{\boldsymbol{\beta}})-\mu({\boldsymbol{x}}_{M_{l}:1},{\boldsymbol{\beta}}^{*})|\leq 3c_{1}\epsilon_{n}.

By the definition (4), we have

|μ(𝒙Ml:1,𝜷)μ(𝒙Ml:1)|3c1ϵn+ϖn\displaystyle|\mu({\boldsymbol{x}}_{M_{l}:1},{\boldsymbol{\beta}})-\mu^{*}({\boldsymbol{x}}_{M_{l}:1})|\leq 3c_{1}\epsilon_{n}+\varpi_{n}

Finally for some s>2s>2 (for simplicity, we take ss to be an even integer),

p𝜷pμs\displaystyle\norm{p_{{\boldsymbol{\beta}}}-p_{\mu^{*}}}_{s} =((|μ(𝒙Ml:1,𝜷)μ(𝒙Ml:1)|)sv(𝒙Ml:1)𝑑𝒙Ml:1)1/s\displaystyle=\left(\int(|\mu({\boldsymbol{x}}_{M_{l}:1},{\boldsymbol{\beta}})-\mu^{*}({\boldsymbol{x}}_{M_{l}:1})|)^{s}v({\boldsymbol{x}}_{M_{l}:1})d{\boldsymbol{x}}_{M_{l}:1}\right)^{1/s}
((3c1ϵn+ϖn)sv(𝒙Ml:1)𝑑𝒙Ml:1)1/s\displaystyle\leq\left(\int(3c_{1}\epsilon_{n}+\varpi_{n})^{s}v({\boldsymbol{x}}_{M_{l}:1})d{\boldsymbol{x}}_{M_{l}:1}\right)^{1/s}
3c1ϵn+ϖn.\displaystyle\leq 3c_{1}\epsilon_{n}+\varpi_{n}.

For any small b>0b^{\prime}>0, condition (c) is satisfied as long as c1c_{1} is sufficiently small, ϵnC0ϖn\epsilon_{n}\geq C_{0}\varpi_{n} for some large C0C_{0}, and the prior satisfies log(π(A))γnϵn2-\log{\pi(A)}\leq\gamma n\epsilon_{n}^{2}. Since

π(A)\displaystyle\pi(A) (πb([Enωn,En+ωn]))rn×π({maxj𝜸|βj|ωn})\displaystyle\geq\left(\pi_{b}([E_{n}-\omega_{n},E_{n}+\omega_{n}])\right)^{r_{n}}\times\pi(\{\max_{j\notin{\boldsymbol{\gamma}}^{*}}|\beta_{j}|\leq\omega_{n}^{\prime}\})
(2π¯bωn)rn×πb([ωn,ωn])Knrn\displaystyle\geq(2\underline{\pi}_{b}\omega_{n})^{r_{n}}\times\pi_{b}([-\omega_{n}^{\prime},\omega_{n}^{\prime}])^{K_{n}-r_{n}}
(2π¯bωn)rn(11/Kn)Kn\displaystyle\geq(2\underline{\pi}_{b}\omega_{n})^{r_{n}}(1-1/K_{n})^{K_{n}}

where the last inequality is due to the fact that ωnηn\omega_{n}^{\prime}\gg\eta_{n}^{\prime}. Note that limn(11/Kn)Kn=e1\lim_{n\rightarrow\infty}(1-1/K_{n})^{K_{n}}=e^{-1}. Since log(1/ωn)2MlHnlog(En)+2MlHnlog(rn/Hn)+log(1/ϵn)+constant=O(Hnlog(n))\log(1/\omega_{n})\asymp 2M_{l}H_{n}\log(E_{n})+2M_{l}H_{n}\log(r_{n}/H_{n})+\log(1/\epsilon_{n})+\text{constant}=O(H_{n}\log{n}) (note that log(1/ϵn)=O(log(n)\log(1/\epsilon_{n})=O(\log{n})), then for sufficiently large nn,

logπ(A)\displaystyle-\log\pi(A) rnlog(1π¯b)+rnO(Hnlog(n))+rnlog(12)+1\displaystyle\leq r_{n}\log(\dfrac{1}{\underline{\pi}_{b}})+r_{n}O(H_{n}\log(n))+r_{n}\log(\dfrac{1}{2})+1
=O(rnHnlog(n)+rnlogL¯).\displaystyle=O(r_{n}H_{n}\log(n)+r_{n}\log\bar{L}).

Thus, the prior satisfies log(π(A))γnϵn2-\log{\pi(A)}\leq\gamma n\epsilon_{n}^{2} for sufficiently large nn, when nϵn2C0(rnHnlog(n)+rnlogL¯)n\epsilon_{n}^{2}\geq C_{0}(r_{n}H_{n}\log{n}+r_{n}\log\bar{L}) for some sufficiently large constant C0C_{0}. Thus condition (c) holds.

Checking condition (a):

Let n\mathbb{P}_{n} denote the set of probability densities for the RNNs whose parameter vectors satisfy

𝜷Bn={|βj|Mn,Γ𝜷={i:|βi|δn} satisfies |Γ𝜷|knrn,|Γ𝜷|inknsn}},{\boldsymbol{\beta}}\in B_{n}=\Big{\{}|\beta_{j}|\leq M_{n},\Gamma_{{\boldsymbol{\beta}}}=\{i:|\beta_{i}|\geq\delta_{n}^{\prime}\}\text{ satisfies }|\Gamma_{{\boldsymbol{\beta}}}|\leq k_{n}r_{n},|\Gamma_{{\boldsymbol{\beta}}}|_{in}\leq k^{\prime}_{n}s_{n}\}\Big{\}},

where |Γ𝜷|in|\Gamma_{{\boldsymbol{\beta}}}|_{in} denotes the number of input connections with the absolute weights greater than δn\delta_{n}^{\prime}, kn(n/rn)k_{n}(\leq n/r_{n}) and kn(n/sn)k_{n}^{\prime}(\leq n/s_{n}) will be specified later, and

δn=c1ϵnMlHnKn(knrn/Hn)2MlHn(c0Mn)2MlHn,\displaystyle\delta_{n}^{\prime}=\dfrac{c_{1}\epsilon_{n}}{M_{l}^{H_{n}}K_{n}(k_{n}r_{n}/H_{n})^{2M_{l}H_{n}}(c_{0}M_{n})^{2M_{l}H_{n}}},

for some c1>0,c0>0c_{1}>0,c_{0}>0. Let

δn=c1ϵnMlHnHn(knrn/Hn)2MlHn(c0Mn)2MlHn.\delta_{n}=\dfrac{c_{1}\epsilon_{n}}{M_{l}^{H_{n}}H_{n}(k_{n}r_{n}/H_{n})^{2M_{l}H_{n}}(c_{0}M_{n})^{2M_{l}H_{n}}}.

Consider two parameter vectors 𝜷u{\boldsymbol{\beta}}^{u} and 𝜷v{\boldsymbol{\beta}}^{v} in set BnB_{n}, such that there exists a structure 𝜸{\boldsymbol{\gamma}} with |𝜸|knrn|{\boldsymbol{\gamma}}|\leq k_{n}r_{n} and |𝜸|inknsn|{\boldsymbol{\gamma}}|_{in}\leq k_{n}^{\prime}s_{n}, and |𝜷ju𝜷jv|δn|{\boldsymbol{\beta}}^{u}_{j}-{\boldsymbol{\beta}}^{v}_{j}|\leq\delta_{n} for all j𝜸j\in{\boldsymbol{\gamma}}, max(|𝜷jv|,|𝜷ju|)δn\max(|{\boldsymbol{\beta}}^{v}_{j}|,|{\boldsymbol{\beta}}^{u}_{j}|)\leq\delta_{n}^{\prime} for all j𝜸j\notin{\boldsymbol{\gamma}}. Hence, by Lemma A.2, we have that |μ(𝒙1:Ml,𝜷u)μ(𝒙1:Ml,𝜷v)|29c12ϵn2|\mu({\boldsymbol{x}}_{1:M_{l}},{\boldsymbol{\beta}}^{u})-\mu({\boldsymbol{x}}_{1:M_{l}},{\boldsymbol{\beta}}^{v})|^{2}\leq 9c^{2}_{1}\epsilon_{n}^{2}. For two normal distributions N(μ1,σ2)N(\mu_{1},\sigma^{2}) and N(μ2,σ2)N(\mu_{2},\sigma^{2}), define the corresponding Kullback-Leibler divergence as

d0(pμ1,pμ2)=12σ2|μ2μ1|2\displaystyle d_{0}(p_{\mu_{1}},p_{\mu_{2}})=\dfrac{1}{2\sigma^{2}}|\mu_{2}-\mu_{1}|^{2}

Together with the fact that 2d(pμ1,pμ2)2d0(pμ1,pμ2)2d(p_{\mu_{1}},p_{\mu_{2}})^{2}\leq d_{0}(p_{\mu_{1}},p_{\mu_{2}}), we have

d(p𝜷u,p𝜷v)d0(p𝜷u,p𝜷v)C(9+o(1))c12ϵn2ϵn\displaystyle d(p_{{\boldsymbol{\beta}}^{u}},p_{{\boldsymbol{\beta}}^{v}})\leq\sqrt{d_{0}(p_{{\boldsymbol{\beta}}^{u}},p_{{\boldsymbol{\beta}}^{v}})}\leq\sqrt{C(9+o(1))c_{1}^{2}\epsilon_{n}^{2}}\leq\epsilon_{n}

for some C>0C>0, given a sufficiently small c1c_{1}.

Given the above results, one can bound the packing number N(n,ϵn)N(\mathbb{P}_{n},\epsilon_{n}) by j=1knrnχHnj(2Mnδn)j\sum_{j=1}^{k_{n}r_{n}}\chi^{j}_{H_{n}}(\dfrac{2M_{n}}{\delta_{n}})^{j} where χHnj\chi^{j}_{H_{n}} denotes the number of all valid networks who has exact jj connections and has no more than knsnk^{\prime}_{n}s_{n} inputs. Since log(χHnj)ksnlog(pn)+jlog(ksnL1+2HnL¯2)\log{\chi_{H_{n}}^{j}}\leq k^{\prime}s_{n}\log{p_{n}}+j\log(k^{\prime}s_{n}L_{1}+2H_{n}\bar{L}^{2}), log(Mn)=O(log(n))\log{M_{n}}=O(\log{n}), knrnnk_{n}r_{n}\leq n, and knsnnk^{\prime}_{n}s_{n}\leq n, then

logN(n,ϵn)\displaystyle\log N(\mathbb{P}_{n},\epsilon_{n}) log(knrnχHnknrn(2Mnδn)knrn)\displaystyle\leq\log(k_{n}r_{n}\chi_{H_{n}}^{k_{n}r_{n}}(\dfrac{2M_{n}}{\delta_{n}})^{k_{n}r_{n}})
log(knrn)+ksnlog(pn)+knrnlog(2Hn)+2knrnlog(L¯+knsn)\displaystyle\leq\log{k_{n}r_{n}}+k^{\prime}s_{n}\log{p_{n}}+k_{n}r_{n}\log(2H_{n})+2k_{n}r_{n}\log(\bar{L}+k^{\prime}_{n}s_{n})
+knrnlog(2MnMlHnHn(knrn/Hn)2MlHn(c0Mn)2MlHnc1ϵn)\displaystyle+k_{n}r_{n}\log{\dfrac{2M_{n}M_{l}^{H_{n}}H_{n}(k_{n}r_{n}/H_{n})^{2M_{l}H_{n}}(c_{0}M_{n})^{2M_{l}H_{n}}}{c_{1}\epsilon_{n}}}
=knrnO(Hnlog(n)+log(L¯))+knsnlog(pn).\displaystyle=k_{n}r_{n}O(H_{n}\log{n}+\log{\bar{L}})+k_{n}^{\prime}s_{n}\log{p_{n}}.

We can choose knk_{n} and knk^{\prime}_{n} such that for sufficiently large nn, knrn{Hnlog(n)+log(L¯)}knsnlog(pn)nϵn2k_{n}r_{n}\{H_{n}\log{n}+\log{\bar{L}}\}\asymp k_{n}^{\prime}s_{n}\log{p_{n}}\asymp n\epsilon_{n}^{2} and then logN(n,ϵn)nϵn2\log N(\mathbb{P}_{n},\epsilon_{n})\leq n\epsilon_{n}^{2}.

Checking condition (b):

Lemma B.3.

(Theorem 1 of [54]) Let XBinomial(n,p)X\sim Binomial(n,p) be a Binomial random variable. For any 1<k<n11<k<n-1

Pr(Xk+1)1Φ(sign(knp)[2nH(p,k/n)]1/2),\displaystyle Pr(X\geq k+1)\leq 1-\Phi(\text{sign}(k-np)[2nH(p,k/n)]^{1/2}),

where Φ\Phi is the cumulative distribution function (CDF) of the standard Gaussian distribution and H(p,k/n)=(k/n)log(k/np)+(1k/n)log[(1k/n)/(1p)]H(p,k/n)=(k/n)\log(k/np)+(1-k/n)\log[(1-k/n)/(1-p)].

Now,

π(nc)Pr(Binomial(Kn,vn)>knrn)+Knπb(|βj|>Mn)+Pr(|Γ𝜷|in>knsn),\displaystyle\pi(\mathbb{P}_{n}^{c})\leq Pr(\text{Binomial}(K_{n},v_{n})>k_{n}r_{n})+K_{n}\pi_{b}(|\beta_{j}|>M_{n})+Pr(|\Gamma_{{\boldsymbol{\beta}}}|_{in}>k_{n}^{\prime}s_{n}),

where vn=1πb([δn,δn])v_{n}=1-\pi_{b}([-\delta_{n}^{\prime},\delta_{n}^{\prime}]). By the condition on πb\pi_{b} and the fact that δnηn\delta_{n}^{\prime}\gg\eta_{n}^{\prime}, it is easy to see that vnexp{τ[Hnlog(n)+log(L¯)+log(pn)]log(Kn)}v_{n}\leq\exp\{\tau[H_{n}\log{n}+\log{\bar{L}}+\log{p_{n}}]-\log{K_{n}}\} for some constant τ>0\tau>0. Thus, by Lemma B.3, log(Pr(Binomial(Kn,vn)>knrn))τknrn[Hnlog(n)+log(L¯)+log(pn)]nϵn2-\log{Pr(\text{Binomial}(K_{n},v_{n})>k_{n}r_{n})}\approx\tau k_{n}r_{n}[H_{n}\log{n}+\log{\bar{L}}+\log{p_{n}}]\gtrsim n\epsilon_{n}^{2} due to the choice of knk_{n}, and log(Pr(|Γ𝜷|inknsn))knsn[τ(Hnlog(n)+log(L¯)+log(pn))+log((pnKn/L1))]nϵn2-\log{Pr(|\Gamma_{{\boldsymbol{\beta}}}|_{in}\geq k_{n}^{\prime}s_{n})}\approx k_{n}^{\prime}s_{n}[\tau(H_{n}\log{n}+\log{\bar{L}}+\log{p_{n}})+\log{(p_{n}K_{n}/L_{1})}]\gtrsim n\epsilon_{n}^{2} due to the choice of knk_{n}^{\prime}. Thus, condition (b) holds as well. Then by lemma B.1, the proof is completed. ∎

B.2 Proof of Theorem 3.8

Proof.

To prove Theorem 3.8 in the main text, it suffices to verify the four conditions on πb\pi_{b} listed in Theorem B.2. Let Mn=max(2nσ1,n,En)M_{n}=\max(\sqrt{2n}\sigma_{1,n},E_{n}). Condition 8 can be verified by choosing σ1.n\sigma_{1.n} such that En2/2σ1.n2+log(σ1,n2)=O(Hnlog(n)+log(L¯))E_{n}^{2}/2\sigma_{1.n}^{2}+\log{\sigma_{1,n}^{2}}=O(H_{n}\log{n}+\log{\bar{L}}). Conditions 9 and 10 can be verified by setting λn=1/{MlHnKn[n2MlHn(L¯pn)]τ}\lambda_{n}=1/\{M_{l}^{H_{n}}K_{n}[n^{2M_{l}H_{n}}(\bar{L}p_{n})]^{\tau}\} and σ0,n1/{MlHnnKn(n/Hn)2MlHn(c0Mn)2MlHn}\sigma_{0,n}\prec 1/\{M_{l}^{H_{n}}\sqrt{n}K_{n}(n/H_{n})^{2M_{l}H_{n}}(c_{0}M_{n})^{2M_{l}H_{n}}\}. Finally, condition 11 can be verified by Mn2nσ0.n2M_{n}\geq 2n\sigma_{0.n}^{2} and τ[Hnlog(n)+log(L¯)+log(pn)]+Mn2/2σ1,n2n\tau[H_{n}\log{n}+\log{\bar{L}}+\log{p_{n}}]+M_{n}^{2}/2\sigma_{1,n}^{2}\geq n. Finally, based on the proof above we see that nϵn2rnHnlog(n)+rnlog(L¯)+snlog(pn)+nϖn2n\epsilon_{n}^{2}\asymp r_{n}H_{n}\log{n}+r_{n}\log{\bar{L}}+s_{n}\log{p_{n}}+n\varpi_{n}^{2} and limnϵn=0\lim_{n\rightarrow\infty}\epsilon_{n}=0.

B.3 Posterior Consistency for Multilayer Perceptrons

As highlighted in Section 3, the MLP can be formulated as a regression problem. Here, the input is 𝒙i=yi1:iRl{\boldsymbol{x}}_{i}=y_{i-1:i-R_{l}} for some lRlnl\leq R_{l}\ll n, with the corresponding output being yiy_{i}. Thus, the dataset DnD_{n} can be represented as (𝒙i,yi)i=1+Rln{({\boldsymbol{x}}_{i},y_{i})}^{n}_{i=1+R_{l}}. We apply the same assumptions as for the MLP, specifically 1, 3.4, and 3.6. We also use the same definitions and notations for the MLP as those in [17, 18].

Leveraging the mathematical properties of sparse MLPs presented in [17] and the proof of sparse RNNs for a single time series discussed above, one can straightforwardly derive the following Corollary. Let d(p1,p2)d(p_{1},p_{2}) denote the integrated Hellinger distance between two conditional densities p1(y|𝒙)p_{1}(y|{\boldsymbol{x}}) and p2(y|𝒙)p_{2}(y|{\boldsymbol{x}}). Let π(|Dn)\pi(\cdot|D_{n}) be the posterior probability of an event.

Corollary B.4.

Suppose Assumptions 1, 3.4, and 3.6 hold. If the mixture Gaussian prior (3) satisfies the conditions :λn=O(1/[Kn[nHn(L¯pn)]τ])\vcentcolon\lambda_{n}=O(1/[K_{n}[n^{H_{n}}(\bar{L}p_{n})]^{\tau}]) for some constant τ>0\tau>0, En/[Hnlog(n)+log(L¯)]1/2σ1,nnαE_{n}/[H_{n}\log{n}+\log{\bar{L}}]^{1/2}\leq\sigma_{1,n}\leq n^{\alpha} for some constant α\alpha, and σ0,nmin{1/[nKn(n3/2σ1,n/Hn)Hn],1/[nKn(nEn/Hn)Hn]}\sigma_{0,n}\leq\min\{1/[\sqrt{n}K_{n}(n^{3/2}\sigma_{1,n}/H_{n})^{H_{n}}],1/[\sqrt{n}K_{n}(nE_{n}/H_{n})^{H_{n}}]\}, then there exists an an error sequence ϵn2=O(ϖn2)+O(ζn2)\epsilon_{n}^{2}=O(\varpi^{2}_{n})+O(\zeta_{n}^{2}) such that limnϵn=0\lim_{n\to\infty}\epsilon_{n}=0 and limnnϵn2=\lim_{n\to\infty}n\epsilon_{n}^{2}=\infty, and the posterior distribution satisfies

π(d(p𝜷,pμ)Cϵn|Dn)=OP(enϵn2)\displaystyle\pi(d(p_{{\boldsymbol{\beta}}},p_{\mu^{*}})\geq C\epsilon_{n}|D_{n})=O_{P^{*}}(e^{-n\epsilon_{n}^{2}}) (12)

for sufficiently large nn and C>0C>0, where ζn2=[rnHnlog(n)+rnlog(L¯)+snlog(pn)]/n\zeta_{n}^{2}=[r_{n}H_{n}\log{n}+r_{n}\log{\bar{L}}+s_{n}\log{p_{n}}]/n, pμp_{\mu^{*}} denotes the underlying true data distribution, and p𝛃p_{{\boldsymbol{\beta}}} denotes the data distribution reconstructed by the Bayesian MLP based on its posterior samples.

Appendix C Asymptotic Normality of Connection Weights and Predictions

This section provides detailed assumptions and proofs for Theorem 3.11 and 3.12. For simplicity, we assume y0:1Rly_{0:1-R_{l}} is also given, and we let Ml=Rl+1M_{l}=R_{l}+1. Let ln(𝜷)=1ni=1nlog(p𝜷(yi|yi1:iRl))l_{n}({\boldsymbol{\beta}})=\frac{1}{n}\sum_{i=1}^{n}\log(p_{{\boldsymbol{\beta}}}(y_{i}|y_{i-1:i-R_{l}})) denote the likelihood function, and let π(𝜷)\pi({\boldsymbol{\beta}}) denote the density of the mixture Gaussian prior (3). Let hi1,i2,,id(𝜷)=dln(𝜷)βi1βidh_{i_{1},i_{2},\dots,i_{d}}({\boldsymbol{\beta}})=\dfrac{\partial^{d}{l_{n}({\boldsymbol{\beta}})}}{\partial{\beta_{i_{1}}}\cdots\partial{\beta_{i_{d}}}} which denotes the dd-th order partial derivatives. Let Hn(𝜷)H_{n}({\boldsymbol{\beta}}) denote the Hessian matrix of ln(𝜷)l_{n}({\boldsymbol{\beta}}), and let hi,j(𝜷),hi,j(𝜷)h_{i,j}({\boldsymbol{\beta}}),h^{i,j}({\boldsymbol{\beta}}) denote the (i,j)(i,j)-th component of Hn(𝜷)H_{n}({\boldsymbol{\beta}}) and Hn1(𝜷)H_{n}^{-1}({\boldsymbol{\beta}}), respectively. Let Bλ,n=λ¯n1/2(𝜷)/λ¯n(𝜷)B_{\lambda,n}=\bar{\lambda}_{n}^{1/2}({\boldsymbol{\beta}}^{*})/\underline{$\lambda$}_{n}({\boldsymbol{\beta}}^{*}) and bλ,n=rn/nBλ,nb_{\lambda,n}=\sqrt{r_{n}/n}B_{\lambda,n}. For a RNN with 𝜷{\boldsymbol{\beta}}, we define the weight truncation at the true model structure 𝜸:(𝜷𝜸)i=𝜷i{\boldsymbol{\gamma}}^{*}:({\boldsymbol{\beta}}_{{\boldsymbol{\gamma}}^{*}})_{i}={\boldsymbol{\beta}}_{i} for i𝜸i\in{\boldsymbol{\gamma}}^{*} and (𝜷𝜸)i=0({\boldsymbol{\beta}}_{{\boldsymbol{\gamma}}^{*}})_{i}=0 for i𝜸i\notin{\boldsymbol{\gamma}}^{*}. For the mixture Gaussian prior (3), let Bδn(𝜷)={𝜷:|𝜷i𝜷i|δn,i𝜸,|𝜷i𝜷i|2σ0,nlog(σ1,nλnσ0,n),i𝜸}B_{\delta_{n}}({\boldsymbol{\beta}}^{*})=\{{\boldsymbol{\beta}}:|{\boldsymbol{\beta}}_{i}-{\boldsymbol{\beta}}^{*}_{i}|\leq\delta_{n},\forall i\in{\boldsymbol{\gamma}}^{*},|{\boldsymbol{\beta}}_{i}-{\boldsymbol{\beta}}^{*}_{i}|\leq 2\sigma_{0,n}\log(\dfrac{\sigma_{1,n}}{\lambda_{n}\sigma_{0,n}}),\forall i\notin{\boldsymbol{\gamma}}^{*}\}.

In addition, we make the following assumptions for Theorem 3.11 and Theorem 3.12.

Assumption C.1.

Assume the conditions of Lemma 3.9 hold with ρ(ϵn)=o(1Kn)\rho(\epsilon_{n})=o(\dfrac{1}{K_{n}}) and the C123C_{1}\geq\dfrac{2}{3} defined in Assumption 3.4. For some δn\delta_{n} s.t. rnnδn1n3rn\dfrac{r_{n}}{\sqrt{n}}\lesssim\delta_{n}\lesssim\dfrac{1}{\sqrt[{}^{3}]{n}r_{n}}, let A(ϵn,δn)={𝜷:maxi𝜸|𝜷i𝜷i|>δn,d(p𝜷,pμ)ϵn}A(\epsilon_{n},\delta_{n})=\{{\boldsymbol{\beta}}:\max_{i\in{\boldsymbol{\gamma}}^{*}}|{\boldsymbol{\beta}}_{i}-{\boldsymbol{\beta}}^{*}_{i}|>\delta_{n},d(p_{{\boldsymbol{\beta}}},p_{\mu^{*}})\leq\epsilon_{n}\}, where ϵn\epsilon_{n} is the posterior contraction rate as defined in Theorem 3.8. Assume there exists some constants C>2C>2 and M>0M>0 such that

  • C.1

    𝜷=(𝜷1,,𝜷Kn){\boldsymbol{\beta}}^{*}=({\boldsymbol{\beta}}_{1},\dots,{\boldsymbol{\beta}}_{K_{n}}) is generic, mini𝜸|𝜷i|>Cδn\min_{i\in{\boldsymbol{\gamma}}^{*}}|{\boldsymbol{\beta}}_{i}^{*}|>C\delta_{n} and π(A(ϵn,δn)|Dn)0\pi(A(\epsilon_{n},\delta_{n})|D_{n})\to 0 as nn\to\infty.

  • C.2

    |hi(𝜷)|<M,|hj,k(𝜷)|<M,|hj,k(𝜷)|<M,|hi,j,k(𝜷)|<M,|hl(𝜷)|<M|h_{i}({\boldsymbol{\beta}}^{*})|<M,|h_{j,k}({\boldsymbol{\beta}}^{*})|<M,|h^{j,k}({\boldsymbol{\beta}}^{*})|<M,|h_{i,j,k}({\boldsymbol{\beta}})|<M,|h_{l}({\boldsymbol{\beta}})|<M hold for any i,j,k𝜸,l𝜸i,j,k\in{\boldsymbol{\gamma}}^{*},l\notin{\boldsymbol{\gamma}}^{*}, and 𝜷B2δn(𝜷){\boldsymbol{\beta}}\in B_{2\delta_{n}}({\boldsymbol{\beta}}^{*}).

  • C.3

    sup{|E𝜷(aU3)|:𝜷𝜸𝜷1.2bλ,n,a=10.1n/rnλ¯n2(𝜷)/λ¯n1/2(𝜷)}\sup\{|E_{{\boldsymbol{\beta}}}(a^{\prime}U^{3})|:\norm{{\boldsymbol{\beta}}_{{\boldsymbol{\gamma}}^{*}}-{\boldsymbol{\beta}}^{*}}\leq 1.2b_{\lambda,n},\norm{a}=1\leq 0.1\sqrt{n/r_{n}}\underline{$\lambda$}_{n}^{2}({\boldsymbol{\beta}}^{*})/\bar{\lambda}_{n}^{1/2}({\boldsymbol{\beta}}^{*})\} and Bλ,n=O(1)B_{\lambda,n}=O(1), where U=ZE𝜷𝜸(Z)U=Z-E_{{\boldsymbol{\beta}}_{{\boldsymbol{\gamma}}^{*}}}(Z), ZZ denotes a random variable drawn from a neural network model parameterized by 𝜷𝜸{\boldsymbol{\beta}}_{{\boldsymbol{\gamma}}^{*}}, and E𝜷𝜸E_{{\boldsymbol{\beta}}_{{\boldsymbol{\gamma}}^{*}}} denotes the mean of ZZ.

  • C.4

    rn2/n0r_{n}^{2}/n\to 0 and the conditions for Theorem 22 of [55] hold.

Conditions C.1C.1 to C.3C.3 align with the assumptions made in [18]. An additional assumption, C.4C.4, is introduced to account for the dependency inherent in time series data. This assumption is crucial and employed in conjunction with C.3C.3 to establish the consistency of the maximum likelihood estimator (MLE) of 𝜷𝜸{\boldsymbol{\beta}}_{{\boldsymbol{\gamma}}^{*}} for a given structure 𝜸{\boldsymbol{\gamma}}^{*}. While the assumption rn/n0r_{n}/n\to 0 is sufficient for independent data, which is implied by Assumption 3.4, a stronger restriction, specifically rn2/n0r_{n}^{2}/n\to 0, is necessary for dependent data such as time series. It is worth noting that the conditions used in Theorem 22 of [55] pertain specifically to the time series data itself.

Let μi1,i2,,id(,𝜷)=dμ(,𝜷)βi1βid\mu_{i_{1},i_{2},\dots,i_{d}}(\cdot,{\boldsymbol{\beta}})=\dfrac{\partial^{d}{\mu(\cdot,{\boldsymbol{\beta}})}}{\partial{\beta_{i_{1}}}\cdots\partial{\beta_{i_{d}}}} denotes the d-th order partial derivative for some input.

Assumption C.2.

|μi(,𝜷)|<M,|μi,j(,𝜷)|<M,|μk(,𝜷)|<M|\mu_{i}(\cdot,{\boldsymbol{\beta}}^{*})|<M,|\mu_{i,j}(\cdot,{\boldsymbol{\beta}})|<M,|\mu_{k}(\cdot,{\boldsymbol{\beta}})|<M for any i,j𝜸,k𝜸i,j\in{\boldsymbol{\gamma}}^{*},k\notin{\boldsymbol{\gamma}}^{*}, and 𝜷B2δn(𝜷){\boldsymbol{\beta}}\in B_{2\delta_{n}}({\boldsymbol{\beta}}^{*}), where MM is as defined in Assumption C.1.

Proof of Theorem 3.11 and Theorem 3.12

The proof of these two theorems can be conducted following the approach in [18]. The main difference is that in [18], they rely on Theorem 2.12.1 of [56], which assumes independent data. However, the same conclusions can be extended to time series data by using Theorem 22 of [55], taking into account assumption C.4C.4. This allows for the consideration of dependent data in the analysis.

Appendix D Computation

Algorithm 1 gives the prior annealing procedure[18]. In practice, the following implementation can be followed based on Algorithm 1:

  • For 0<t<T10<t<T_{1}, perform initial training.

  • For T1tT2T_{1}\leq t\leq T_{2}, set σ0,n(t)=σ0,ninit\sigma_{0,n}^{(t)}=\sigma_{0,n}^{init} and gradually increase η(t)=tT1T2T1\eta^{(t)}=\frac{t-T_{1}}{T_{2}-T_{1}}.

  • For T2tT3T_{2}\leq t\leq T_{3}, set η(t)=1\eta^{(t)}=1 and gradually decrease σ0,n(t)\sigma_{0,n}^{(t)} according to the formula: σ0,n(t)=(T3tT3T2)(σ0,ninit)2+(tT2T3T2)(σ0,nend)2\sigma_{0,n}^{(t)}=\left(\frac{T_{3}-t}{T_{3}-T_{2}}\right)(\sigma_{0,n}^{init})^{2}+\left(\frac{t-T_{2}}{T_{3}-T_{2}}\right)(\sigma_{0,n}^{end})^{2}.

  • For t>T3t>T_{3}, set η(t)=1\eta^{(t)}=1, σ0,n(t)=σ0,nend\sigma_{0,n}^{(t)}=\sigma_{0,n}^{end}, and gradually decrease the temperature τ\tau according to the formula: τ=ctT3\tau=\frac{c}{t-T_{3}}, where cc is a constant.

Please refer to Appendix E for intuitive explanations of the prior annealing algorithm and further details on training a model to achieve the desired sparsity.

Algorithm 1 Prior Annealing: Frequentist
  [1] (Initial Training) Train a neural network satisfying condition (S)(S^{*}) such that a global optimal solution 𝜷0=argmax𝜷ln(𝜷){\boldsymbol{\beta}}_{0}=\operatorname*{arg\,max}_{{\boldsymbol{\beta}}}l_{n}({\boldsymbol{\beta}}) is reached, this stage can be accomplished by using SGD or Adam.
  [2] (Prior Annealing) Initialize 𝜷{\boldsymbol{\beta}} at 𝜷0{\boldsymbol{\beta}}_{0}, and simulate from a sequence of distributions π(𝜷|Dn,τ,η(k),σ0,n(k))enln(𝜷)/τπkη(k)/τ(𝜷)\pi({\boldsymbol{\beta}}|D_{n},\tau,\eta^{(k)},\sigma^{(k)}_{0,n})\propto e^{nl_{n}({\boldsymbol{\beta}})/\tau}\pi_{k}^{\eta^{(k)}/\tau}({\boldsymbol{\beta}}) for k=1,2,,mk=1,2,\dots,m, where 0<η(1)η(2)η(m)=10<\eta^{(1)}\leq\eta^{(2)}\leq\dots\leq\eta^{(m)}=1, πk=λnN(0,σ1,n2)+(1λn)N(0,(σ0,n(k))2)\pi_{k}=\lambda_{n}N(0,\sigma^{2}_{1,n})+(1-\lambda_{n})N(0,(\sigma^{(k)}_{0,n})^{2}), and σ0,ninit=σ0,n(1)σ0,n(2)σ0,n(m)=σ0,nend\sigma_{0,n}^{init}=\sigma^{(1)}_{0,n}\geq\sigma^{(2)}_{0,n}\geq\dots\geq\sigma^{(m)}_{0,n}=\sigma^{end}_{0,n}. This can be done by using stochastic gradient MCMC algorithms [57]. After the stage mm has been reached, continue to run the simulated annealing algorithm by gradually decreasing the temperature τ\tau to a very small value. Denote the resulting model by 𝜷^\hat{{\boldsymbol{\beta}}}.
  [3] (Structure Sparsification) For each connection weight i{1,2,,Kn}i\in\{1,2,\dots,K_{n}\}, set 𝜸~i=1\tilde{{\boldsymbol{\gamma}}}_{i}=1, if |𝜷i|>2σ0,nσ1,nσ1,n2σ0,n2log(1λnλnσ1,nσ0,n)|{\boldsymbol{\beta}}_{i}|>\dfrac{\sqrt{2}\sigma_{0,n}\sigma_{1,n}}{\sqrt{\sigma_{1,n}^{2}-\sigma_{0,n}^{2}}}\sqrt{\log(\dfrac{1-\lambda_{n}}{\lambda_{n}}\dfrac{\sigma_{1,n}}{\sigma_{0,n}})} and 0 otherwise, where the threshold value is determined by solving π(𝜸i=1|𝜷)>0.5\pi({\boldsymbol{\gamma}}_{i}=1|{\boldsymbol{\beta}})>0.5, and σ0,n=σ0,nend\sigma_{0,n}=\sigma_{0,n}^{end}. Denote the the structure of the sparse RNN by 𝜸~\tilde{{\boldsymbol{\gamma}}}.
  [4] (Nonzero-weights Refining) Refine the nonzero weights of the sparse model (𝜷^,𝜸~)(\hat{{\boldsymbol{\beta}}},\tilde{{\boldsymbol{\gamma}}}) by maximizing ln(𝜷)l_{n}({\boldsymbol{\beta}}). Denote the resulting estimate by (𝜷~,𝜸~)(\tilde{{\boldsymbol{\beta}}},\tilde{{\boldsymbol{\gamma}}}).

Appendix E Mixture Gaussian Prior


Refer to caption

Figure 3: Changes of the negative log prior density function in prior annealing of σ0,n2\sigma_{0,n}^{2}, where the values of λn\lambda_{n} and σ1,n2\sigma_{1,n}^{2} are fixed.

The mixture Gaussian prior imposes a penalty on the model parameters by acting as a piecewise L2 penalty, applying varying degrees of penalty in different regions of the parameter space. Given values for σ0,n\sigma_{0,n}, σ1,n\sigma_{1,n}, and λn\lambda_{n}, a threshold value can be computed using Algorithm 1. Parameters whose absolute values are below this threshold will receive a large penalty, hence constituting the "penalty region", while parameters whose absolute values are above the threshold will receive relatively minimal penalty, forming the "free space". The severity of the penalty in the free space largely depends on the value of σ1,n\sigma_{1,n}.

For instance, as depicted in Figure 3, based on the simple practice implementation detailed in Appendix D, we fix λn=1e7\lambda_{n}=1e-7, σ1,n2=0.05\sigma_{1,n}^{2}=0.05, and we set (σ0,ninit)2=1e5(\sigma_{0,n}^{init})^{2}=1e-5, (σ0,nend)2=1e7(\sigma_{0,n}^{end})^{2}=1e-7, and gradually reduce σ0,n2\sigma_{0,n}^{2} from the initial value to the end value. Initially, the free space is relatively small and the penalty region is relatively large. However, the penalty imposed on the parameters within the initial penalty region is also minor, making it challenging to shrink these parameters to zero. As we progressively decrease σ0,n2\sigma_{0,n}^{2}, the free space enlarges, the penalty region diminishes, and the penalty on the parameters within the penalty region intensifies simultaneously. Once σ0,n2\sigma_{0,n}^{2} equals (σ0,nend)2(\sigma_{0,n}^{end})^{2}, the parameters within the penalty region will be proximate to zero, and the parameters outside the penalty region can vary freely in almost all areas of the parameter space.

For model compression tasks, achieving the desired sparsity involves several steps post the initial training phase outlined in Algorithm 1. First, determine the initial threshold value based on pre-set values for λn\lambda_{n}, σ1,n2\sigma_{1,n}^{2}, and (σ0,ninit)2(\sigma_{0,n}^{init})^{2}. Then, compute the proportion of the parameters in the initial model whose absolute values are lower than this threshold. This value serves as an estimate for anticipated sparsity. Adjust the (σ0,ninit)2(\sigma_{0,n}^{init})^{2} value until the predicted sparsity aligns closely with the desired sparsity level.

Appendix F Construction of Multi-Horizon Joint Prediction Intervals

To construct valid joint prediction intervals for multi-horizon forecasting, we estimate the individual variances for each time step in the prediction horizon using procedures similar to one-step-ahead forecasting. We then adjust the critical value by dividing α\alpha by the number of time steps mm in the prediction horizon and use zα/(2m)z_{\alpha/(2m)} as the adjusted critical value. This Bonferroni correction ensures the desired coverage probability across the entire prediction horizon.

As an example, we refer to the experiments conducted in 5.1.2, where we work with a set of training sequences denoted by Dn={y1(i),,yT(i),yT+m(i)}i=1nD_{n}=\{y_{1}^{(i)},\ldots,y_{T}^{(i)},\ldots y_{T+m}^{(i)}\}_{i=1}^{n}. Here, yT+1(i),yT+2(i),,yT+m(i)y_{T+1}^{(i)},y_{T+2}^{(i)},\ldots,y_{T+m}^{(i)} represent the prediction horizon for each sequence, while TT represents the length of the observed sequence.

  • Train a model by the proposed algorithm, and denote the trained model by (𝜷^,𝜸^)(\hat{{\boldsymbol{\beta}}},\hat{{\boldsymbol{\gamma}}}).

  • Calculate 𝝈^2\hat{\boldsymbol{\sigma}}^{2} as an estimator of 𝝈2m×1\boldsymbol{\sigma}^{2}\in\mathbb{R}^{m\times 1}:

    𝝈^2=1n1i=1n(𝝁(𝜷^,y1:T(i))yT+1:T+m(i))(𝝁(𝜷^,y1:T(i))yT+1:T+m(i)),\displaystyle\hat{\boldsymbol{\sigma}}^{2}=\dfrac{1}{n-1}\sum_{i=1}^{n}\left({\boldsymbol{\mu}}(\hat{{\boldsymbol{\beta}}},y_{1:T}^{(i)})-y^{(i)}_{T+1:T+m}\right)\otimes\left({\boldsymbol{\mu}}(\hat{{\boldsymbol{\beta}}},y_{1:T}^{(i)})-y^{(i)}_{T+1:T+m}\right),

    where 𝝁(𝜷^,y1:T(i))=y^T+1:T+m(i){\boldsymbol{\mu}}(\hat{{\boldsymbol{\beta}}},y_{1:T}^{(i)})=\hat{y}_{T+1:T+m}^{(i)}, and \otimes denotes elementwise product.

  • For a test sequence y1:T(0)y_{1:T}^{(0)}, calculate

    𝚺^=𝜸^𝝁(𝜷^,y1:T(0))(𝜸^2ln(𝜷^))1𝜸^𝝁(𝜷^,y1:T(0)).\displaystyle\hat{\boldsymbol{\Sigma}}=\nabla_{\hat{{\boldsymbol{\gamma}}}}{\boldsymbol{\mu}}(\hat{{\boldsymbol{\beta}}},y_{1:T}^{(0)})^{\prime}(-\nabla^{2}_{\hat{{\boldsymbol{\gamma}}}}l_{n}(\hat{{\boldsymbol{\beta}}}))^{-1}\nabla_{\hat{{\boldsymbol{\gamma}}}}{\boldsymbol{\mu}}(\hat{{\boldsymbol{\beta}}},y_{1:T}^{(0)}).

    Let 𝝇^m×1\hat{\boldsymbol{\varsigma}}\in\mathbb{R}^{m\times 1} denote the vector formed by the diagonal elements of 𝚺^\hat{\boldsymbol{\Sigma}}.

  • The Bonferroni simultaneous prediction intervals for all elements of yT+1:T+m(0)y_{T+1:T+m}^{(0)} are given by

    𝝁(𝜷^,y1:T(0))±zα/(2m)(1n𝝇^+𝝈^2)12\displaystyle{\boldsymbol{\mu}}(\hat{{\boldsymbol{\beta}}},y_{1:T}^{(0)})\pm z_{\alpha/(2m)}\left(\dfrac{1}{n}\hat{\boldsymbol{\varsigma}}+\hat{\boldsymbol{\sigma}}^{2}\right)^{\circ\frac{1}{2}}

    where 12\circ\frac{1}{2} represents the element-wise square root operation.

The Bayesian method is particularly advantageous when dealing with a large number of non-zero connection weights in (𝜷^,𝜸^)(\hat{{\boldsymbol{\beta}}},\hat{{\boldsymbol{\gamma}}}), making the computation of the Hessian matrix of the log-likelihood function costly or unfeasible. For detailed information on utilizing the Bayesian method for constructing prediction intervals, please refer to [18].

Appendix G Numerical Experiments

G.1 French Electricity Spot Prices

Dataset. The given dataset contains the spot prices of electricity in France that were established over a period of four years, from 20162016 to 20202020, using an auction market. In this market, producers and suppliers submit their orders for the next day’s 2424-hour period, specifying the electricity volume in MWh they intend to sell or purchase, along with the corresponding price in €/MWh. At midnight, the Euphemia algorithm, as described in [58], calculates the 2424 hourly prices for the next day, based on the submitted orders and other constraints. This hourly dataset consists of 3506435064 observations, covering (3×365+366)×24(3\times 365+366)\times 24 periods. Our main objective is to predict the 2424 prices for the next day by considering different explanatory variables such as the day-ahead forecast consumption, day-of-the-week, as well as the prices of the previous day and the same day in the previous week, as these variables are crucial in determining the spot prices of electricity. Refer to Figure 4 for a visual representation of the dataset.

The prediction models and training settings described below are the same for all 2424 hours.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: (a): French electricity spot prices (20162016 to 20192019). (b): Online learning method used for ACI and AgACI in [30], where the prediction range is assumed to between Tpred startT_{\text{pred start}} and Tpred endT_{\text{pred end}}.

Prediction Model. We use an MLP with one hidden layer of size 100100 and the sigmoid activation function as the underlying prediction model for all methods and all hours.

Training: Baselines. For all CP baselines, we train the MLP for 300300 epochs using SGD with a constant learning rate of 0.0010.001 and momentum of 0.90.9, the batch size is set to be 100100. For the ACI, we use the same list of γ{0,0.000005,0.00005,0.0001,0.0002,0.0003,0.0004,0.0005,0.0001,0.0002,0.0003,0.0004,0.0005,0.0006,0.0007,0.0008,0.0009,0.001,0.002,0.003,0.004,0.005,0.006,0.007,0.008,0.009,0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09}\gamma\in\{0,0.000005,0.00005,0.0001,0.0002,0.0003,0.0004,0.0005,0.0001,0.0002,0.0003,0.0004,\\ 0.0005,0.0006,0.0007,0.0008,0.0009,0.001,0.002,0.003,0.004,0.005,0.006,0.007,0.008,0.009,\\ 0.01,0.02,0.03,0.04,0.05,0.06,0.07,0.08,0.09\} as in [30].

Training: Our Method. For our method PA, we train a total of 300300 epochs with the same learning rate, momentum, and batch size. We use T1=150T_{1}=150, T2=160T_{2}=160, T3=260T_{3}=260. We run SGD with momentum for t<T1t<T_{1} and SGHMC with temperature=1=1 for t>=T1t>=T_{1}. For the mixture Gaussian prior, we fix σ1,n2=0.01\sigma_{1,n}^{2}=0.01, (σ1,ninit)2=1e5(\sigma_{1,n}^{init})^{2}=1e-5, (σ1,nend)2=1e6(\sigma_{1,n}^{end})^{2}=1e-6, and λn=1e7\lambda_{n}=1e-7.

Table 5: Uncertainty quantification results produced by different methods for the French electricity spot prices dataset. This table corresponds to the Figure 2.

Methods Coverage Average PI Lengths (standard deviation) Median PI Lengths (interquartile range) PA offline (ours) 90.0690.06 20.19(3.23)20.19(3.23) 20.59(1.82)20.59(1.82) AgACI online 91.0591.05 23.27(14.49)23.27(14.49) 21.38(2.62)21.38(2.62) ACI γ=0.01\gamma=0.01 online 90.2190.21 22.13(8.50)22.13(8.50) 20.68(2.70)20.68(2.70) ACI γ=0.0\gamma=0.0 online 92.8992.89 24.39(9.17)24.39(9.17) 22.99(2.91)22.99(2.91) EnbPI V2 online 91.2391.23 27.99(10.17)27.99(10.17) 26.55(5.09)26.55(5.09) NexCP online 91.1891.18 24.41(10.40)24.41(10.40) 22.88(2.86)22.88(2.86)

G.2 EEG, MIMIC-III, and COVID-19

EEG The EEG dataset, available at here, served as the primary source for the EEG signal time series. This dataset contains responses from both control and alcoholic subjects who were presented with visual stimuli of three different types. To maintain consistency with previous work [32], we utilized the medium-sized version, consisting of 1010 control and 1010 alcoholic subjects. We focused solely on the control subjects for our experiments, as the dataset summaries indicated their EEG responses were more difficult to predict. For detailed information on our processing steps, please refer to [32].

COVID-19 We followed the processing steps of previous research [32] and utilized COVID-19 data from various regions within the same country to minimize potential distribution shifts while adhering to the exchangeability assumption. The lower tier local authority split provided us with 380380 sequences, which we randomly allocated between the training, calibration, and test sets over multiple trials. The dataset can be found at here.

MIMIC-III We collect patient data on the use of antibiotics (specifically Levofloxacin) from the MIMIC-III dataset [45]. However, the subset of the MIMIC-III dataset used in [32] is not publicly available to us, and the authors did not provide information on the processing steps, such as SQL queries and Python code. Therefore, we follow the published processing steps provided in [59]. We remove sequences with fewer than 55 visits or more than 4747 visits, resulting in a total of 29922992 sequences. We randomly split the dataset into train, calibration, and test sets, with corresponding proportions of 43%43\%, 47%47\%, and 10%10\%. We use the white blood cell count (high) as the feature for the univariate time series from these sequences. Access to the MIMIC-III dataset requires PhysioNet credentialing.

Table 6 presents the training hyperparameters and prediction models used for the baselines in the three datasets, while Table 7 presents the training details used for our method.

We hypothesize that the mediocre performance on the EEG dataset is due to the small size of the prediction model, which has only 20252025 parameters and is substantially smaller than the number of training sequences. Hence, we conducted experiments on the EEG dataset using a slightly larger prediction model for different methods. The results are presented in Table 8, and the corresponding training details are provided in Table 6. Again, our method outperforms other baselines as well.

Table 6: Training details for the baselines, where LSTM-dd refers to a single hidden layer LSTM network with a size of dd.

MIMIC-III EEG COVID-19 model LSTM-500500 LSTM-2020 LSTM-2020 learning rate 2e42e-4 0.010.01 0.010.01 Epochs 6565 100100 10001000 Optimizer Adam Adam Adam Batch size 150150 150150 150150

Table 7: Training details for our method, where LSTM-dd refers to a single hidden layer LSTM network with a size of dd.

MIMIC-III EEG COVID-19 Model LSTM-500500 LSTM-2020 LSTM-2020 Learning rate 2e42e-4 0.010.01 0.010.01 Epochs 6565 100100 10001000 Optimizer Adam Adam Adam Batch size 150150 150150 150150 σ1,n2\sigma_{1,n}^{2} 0.050.05 0.10.1 0.10.1 (σ0,ninit)2(\sigma_{0,n}^{init})^{2} 1e51e-5 1e71e-7 1e71e-7 (σ0,nend)2(\sigma_{0,n}^{end})^{2} 1e61e-6 1e81e-8 1e81e-8 λn\lambda_{n} 1e71e-7 1e71e-7 1e71e-7 Temperature 11 11 11 T1T_{1} 4040 5050 800800 T2T_{2} 4040 5555 850850 T3T_{3} 5555 8080 950950

Table 8: Uncertainty quantification results produced by various methods for the EEG data using a larger prediction model. The “coverage” represents the average joint coverage rate over different random seeds. The prediction interval (PI) lengths are averaged across prediction horizons and random seeds. The values in parentheses indicate the standard deviations of the respective means.

EEG Model Coverage PI lengths PA-RNN 91.0%(0.8%)91.0\%(0.8\%) 40.84(5.69)40.84(5.69) CF-RNN 90.8%(1.8%)90.8\%(1.8\%) 43.4(6.79)43.4(6.79) MQ-RNN 45.2%(2.5%)45.2\%(2.5\%) 20.63(2.07)20.63(2.07) DP-RNN 0.6%(0.1%)0.6\%(0.1\%) 5.02(0.53)5.02(0.53)

Table 9: Training details for the EEG data with a slightly larger prediction model for all methods, where "-" denotes that the information is not applicable.

PA-RNN CF-RNN MQ-RNN DP-RNN model LSTM-100100 LSTM-100100 LSTM-100100 LSTM-100100 learning rate 1e31e-3 1e31e-3 1e31e-3 1e31e-3 Epochs 150150 150150 150150 150150 Optimizer Adam Adam Adam Adam Batch size 150150 150150 150150 150150 σ1,n2\sigma_{1,n}^{2} 0.010.01 - - - (σ0,ninit)2(\sigma_{0,n}^{init})^{2} 1e51e-5 - - - (σ0,nend)2(\sigma_{0,n}^{end})^{2} 1e61e-6 - - - λn\lambda_{n} 1e71e-7 - - - temperature 11 - - - T1T_{1} 100100 - - - T2T_{2} 105105 - - - T3T_{3} 130130 - - -

G.3 Autoregressive Order Selection

Model An Elman RNN with one hidden layer of size 10001000. Different window sizes (i.e., 11 or 1515) will result in a different total number of parameters..

Hyperparameters All training hyperparameters are given in Table 10

Table 10: Training details for the autoregressive order selection experiment.

PA-RNN 11 PA-RNN 1515 RNN 11 RNN 1515 Learning rate 4e34e-3 1e41e-4 1e41e-4 1e41e-4 Iterations 2500025000 2500025000 2500025000 2500025000 Optimizer SGHMC SGHMC SGHMC SGHMC Batch size 3636 3636 3636 3636 Subsample size per iteration 5050 5050 5050 5050 Predicton horizon 11 11 11 11 σ1,n2\sigma_{1,n}^{2} 0.050.05 0.050.05 - - (σ0,ninit)2(\sigma_{0,n}^{init})^{2} 2e62e-6 4e64e-6 - - (σ0,nend)2(\sigma_{0,n}^{end})^{2} 1e71e-7 1e71e-7 - - λn\lambda_{n} 1e71e-7 1e71e-7 - - Temperature 0.10.1 0.10.1 - - T1T_{1} 50005000 50005000 - - T2T_{2} 1000010000 1000010000 - - T3T_{3} 2500025000 2500025000 - -

G.4 Large-Scale Model Compression

As pointed out by recent summary/survey papers on sparse deep learning [60, 61], the lack of standardized benchmarks and metrics that provide guidance on model structure, task, dataset, and sparsity levels has caused difficulties in conducting fair and meaningful comparisons with previous works. For example, the task of compressing Penn Tree Bank (PTB) word language model [62] is a popular comparison task. However, many previous works [63, 64, 65, 66] have avoided comparison with the state-of-the-art method by either not using the standard baseline model, not reporting, or conducting comparisons at different sparsity levels. Therefore, we performed an extensive search of papers that reported performance on this task, and to the best of our knowledge, the state-of-the-art method is the Automated Gradual Pruning (AGP) by [67].

In our experiments, we train large stacked LSTM language models on the PTB dataset at different sparsity levels. The model architecture follows the same design as in [62], comprising an embedding layer, two stacked LSTM layers, and a softmax layer. The vocabulary size for the model is 1000010000, the embedding layer size is 15001500, and the hidden layer size is 15001500 resulting in a total of 6666 million parameters. We compare our method with AGP at different sparsity levels, including 80%80\%, 85%85\%, 90%90\%, 95%95\%, and 97.5%97.5\% as in [67]. The results are summarized in Table 11, and our method achieves better results consistently. For the AGP, numbers are taken directly from the original paper, and since they only provided one experimental result for each sparsity level, no standard deviation is reported. For our method, we run three independent trials and provide both the mean and standard deviation for each sparsity level. During the initial training stage, we follow the same training procedure as in [62]. The details of the prior annealing and fine-tuning stages of our method for different sparsity levels are provided below.

During the initial training stage, we follow the same training procedure as in [62] for all sparsity, and hence T1=55T_{1}=55.

During the prior annealing stage, we train a total of 6060 epochs using SGHMC. For all levels of sparsity considered, we fix σ1,n2=0.5\sigma_{1,n}^{2}=0.5, λn=1e6\lambda_{n}=1e-6, momentum 1α=0.91-\alpha=0.9, minibatch size =20=20, and we fix T2=5+T1=60,T3=T2+20=80T_{2}=5+T_{1}=60,T_{3}=T_{2}+20=80. We set the initial temperature τ=0.01\tau=0.01 for tT3t\leq T_{3} and and gradually decrease τ\tau by τ=0.01tT3\tau=\dfrac{0.01}{t-T_{3}} for t>T3t>T_{3}.

During the fine tune stage, we apply a similar training procedure as the initial training stage, i.e., we use SGD with gradient clipping and we decrease the learning rate by a constant factor after certain epochs. The minibatch size is set to be 2020, and we apply early stopping based on the validation perplexity with a maximum of 3030 epochs.

Table 12 gives all hyperparameters (not specified above) for different sparsity levels.

Note that, the mixture Gaussian prior by nature is a regularization method, so we lower the dropout ratio during the prior annealing and fine tune stage for models with relatively high sparsity.

Table 11: Test set perplexity for different methods and sparsity levels.
Method Sparsity Test Perplexity
baseline 0%0\% 78.4078.40
PA 80.87%±0.09%80.87\%\pm 0.09\% 77.122±0.14777.122\pm 0.147
PA 85.83%±0.05%85.83\%\pm 0.05\% 77.431±0.10977.431\pm 0.109
PA 90.53%±0.05%90.53\%\pm 0.05\% 78.823±0.11878.823\pm 0.118
PA 95.40%±0.00%95.40\%\pm 0.00\% 84.525±0.11284.525\pm 0.112
PA 97.78%±0.05%97.78\%\pm 0.05\% 93.268±0.13693.268\pm 0.136
AGP 80%80\% 77.5277.52
AGP 85%85\% 78.3178.31
AGP 90%90\% 80.2480.24
AGP 95%95\% 87.8387.83
AGP 97.5%97.5\% 103.20103.20
Table 12: Word Language Model Compression: Hyperparameters for our method during the prior annealing and fine tune stage. We denote the learning rate by LR, the prior annealing stage by PA, and the fine tune stage by FT.

Hyperparameters/Sparsity 80%80\% 85%85\% 90%90\% 95%95\% 97.5%97.5\%
Dropout ratio 0.650.65 0.650.65 0.650.65 0.40.4 0.40.4
(σ0,ninit)2(\sigma_{0,n}^{init})^{2} 0.00050.0005 0.00070.0007 0.000970.00097 0.001740.00174 0.002480.00248
(σ0,nend)2(\sigma_{0,n}^{end})^{2} 1e71e-7 1e71e-7 1e71e-7 1e71e-7 1e61e-6
LR PA 0.0040.004 0.0040.004 0.0080.008 0.0080.008 0.010.01
LR FT 0.0080.008 0.0080.008 0.0080.008 0.10.1 0.20.2
LR decay factor FT 1/0.951/0.95 1/0.951/0.95 1/0.951/0.95 1/0.951/0.95 1/0.951/0.95
LR decay epoch FT 55 55 55 55 55
Table 13: Training details for the additional autoregressive order selection experiment.

yi=(0.81.1exp(50yi12))yi1+ηiy_{i}=\left(0.8-1.1\exp{-50y_{i-1}^{2}}\right)y_{i-1}+\eta_{i} PA-RNN 1,3,5,7,10,151,3,5,7,10,15 Learning rate 1e41e-4 Iterations 20002000 Optimizer SGHMC Batch size 3636 Subsample size per iteration 5050 Predicton horizon 11 σ1,n2\sigma_{1,n}^{2} 0.050.05 (σ0,ninit)2(\sigma_{0,n}^{init})^{2} 1e51e-5 (σ0,nend)2(\sigma_{0,n}^{end})^{2} 1e61e-6 λn\lambda_{n} 1e61e-6 Temperature 0.10.1 T1T_{1} 500500 T2T_{2} 10001000 T3T_{3} 15001500