Sparse Deep Learning for Time Series Data: Theory and Applications
Abstract
Sparse deep learning has become a popular technique for improving the performance of deep neural networks in areas such as uncertainty quantification, variable selection, and large-scale network compression. However, most existing research has focused on problems where the observations are independent and identically distributed (i.i.d.), and there has been little work on the problems where the observations are dependent, such as time series data and sequential data in natural language processing. This paper aims to address this gap by studying the theory for sparse deep learning with dependent data. We show that sparse recurrent neural networks (RNNs) can be consistently estimated, and their predictions are asymptotically normally distributed under appropriate assumptions, enabling the prediction uncertainty to be correctly quantified. Our numerical results show that sparse deep learning outperforms state-of-the-art methods, such as conformal predictions, in prediction uncertainty quantification for time series data. Furthermore, our results indicate that the proposed method can consistently identify the autoregressive order for time series data and outperform existing methods in large-scale model compression. Our proposed method has important practical implications in fields such as finance, healthcare, and energy, where both accurate point estimates and prediction uncertainty quantification are of concern.
1 Introduction
Over the past decade, deep learning has experienced unparalleled triumphs across a multitude of domains, such as time series forecasting [1, 2, 3, 4, 5], natural language processing [6, 7], and computer vision [8, 9]. However, challenges like generalization and miscalibration [10] persist, posing potential risks in critical applications like medical diagnosis and autonomous vehicles.
In order to enhance the performance of deep neural networks (DNNs), significant research efforts have been dedicated to exploring optimization methods and the loss surface of the DNNs, see, e.g., [11, 12, 13, 14, 15, 16], which have aimed to expedite and direct the convergence of DNNs towards regions that exhibit strong generalization capabilities. While these investigations are valuable, effectively addressing both the challenges of generalization and miscalibration require additional and perhaps more essential aspects: consistent estimation of the underlying input-output mapping and complete knowledge of the asymptotic distribution of predictions. As a highly effective method that addresses both challenges, sparse deep learning has been extensively studied, see, e.g., [17, 18, 19, 20, 21]. Nevertheless, it is important to note that all the studies have been conducted under the assumption of independently and identically distributed (i.i.d.) data. However, in practice, we frequently encounter situations where the data exhibits dependence, such as time series data.
The primary objective of this paper is to address this gap by establishing a theoretical foundation for sparse deep learning with time series data. Specifically, we lay the foundation within the Bayesian framework. For RNNs, by letting their parameters be subject to a mixture Gaussian prior, we establish posterior consistency, structure selection consistency, input-output mapping estimation consistency, and asymptotic normality of predicted values.
We validate our theory through numerical experiments on both synthetic and real-world datasets. Our approach outperforms existing state-of-the-art methods in uncertainty quantification and model compression, highlighting its potential for practical applications where both accurate point prediction and prediction uncertainty quantification are of concern.
2 Related Works
Sparse deep learning. Theoretical investigations have been conducted on the approximation power of sparse DNNs across different classes of functions [22, 23]. Recently, [17] has made notable progress by integrating sparse DNNs into the framework of statistical modeling, which offers a fundamentally distinct neural network approximation theory. Unlike traditional theories that lack data involvement and allow connection weights to assume values in an unbounded space to achieve arbitrarily small approximation errors with small networks [24], their theory [17] links network approximation error, network size, and weight bounds to the training sample size. They show that a sparse DNN of size can effectively approximate various types of functions, such as affine and piecewise smooth functions, as , where denotes the training sample size. Additionally, sparse DNNs exhibit several advantageous theoretical guarantees, such as improved interpretability, enabling the consistent identification of relevant variables for high-dimensional nonlinear systems. Building upon this foundation, [18] establishes the asymptotic normality of connection weights and predictions, enabling valid statistical inference for predicting uncertainties. This work extends the sparse deep learning theory of [17, 18] from the case of i.i.d data to the case of time series data.
Uncertainty quantification. Conformal Prediction (CP) has emerged as a prominent technique for generating prediction intervals, particularly for black-box models like neural networks. A key advantage of CP is its capability to provide valid prediction intervals for any data distribution, even with finite samples, provided the data meets the condition of exchangeability [25, 26]. While i.i.d. data easily satisfies this condition, dependent data, such as time series, often doesn’t. Researchers have extended CP to handle time series data by relying on properties like strong mixing and ergodicity [27, 28]. In a recent work, [29] introduced a random swapping mechanism to address potentially non-exchangeable data, allowing conformal prediction to be applied on top of a model trained with weighted samples. The main focus of this approach was to provide a theoretical basis for the differences observed in the coverage rate of the proposed method. Another recent study by [30] took a deep dive into the Adaptive Conformal Inference (ACI) [31], leading to the development of the Aggregation Adaptive Conformal Inference (AgACI) method. In situations where a dataset contains a group of similar and independent time series, treating each time series as a separate observation, applying a CP method becomes straightforward [32]. For a comprehensive tutorial on CP methods, one can refer to [33]. Beyond CP, other approaches for addressing uncertainty quantification in time series datasets include multi-horizon probabilistic forecasting [34], methods based on dropout [35], and recursive Bayesian approaches [36].
3 Sparse Deep Learning for Time Series Data: Theory
Let denote a time series sequence, where . Let be the probability space of , and let be the -th order -mixing coefficient.
Assumption 3.1.
The time series is (strictly) stationary and -mixing with an exponentially decaying mixing coefficient and follows an autoregressive model of order
(1) |
where is a non-linear function, , contains optional exogenous variables, and with being assumed to be a constant.
Remark 3.2.
Similar assumptions are commonly adopted to establish asymptotic properties of stochastic processes [37, 38, 39, 40, 41, 28]. For example, the asymptotic normality of the maximum likelihood estimator (MLE) can be established under the assumption that the time series is strictly stationary and ergodic, provided that the model size is fixed [37]. A posterior contraction rate of the autoregressive (AR) model can be obtained by assuming it is -mixing with for some which is implied by an exponentially decaying mixing coefficient [41]. For stochastic processes that are strictly stationary and -mixing, results such as uniform laws of large numbers and convergence rates of the empirical processes [38, 39] can also be obtained.
Remark 3.3.
3.1 Posterior Consistency
Both the MLP and RNN can be used to approximate as defined in (1), and for simplicity, we do not explicitly denote the exogenous variables unless it is necessary. For the MLP, we can formulate it as a regression problem, where the input is for some , and the corresponding output is , then the dataset can be expressed as . Detailed settings and results for the MLP are given in Appendix B.3. In what follows, we will focus on the RNN, which serves as an extension of the previous studies. For the RNN, we can rewrite the training dataset as for some , i.e., we split the entire sequence into a set of shorter sequences, where denotes an upper bound for the exact AR order , and denotes the length of these shorter sequences (see Figure 1). We assume is known but not since, in practice, it is unlikely that we know the exact order .

For simplicity of notations, we do not distinguish between weights and biases of the RNN. In this paper, the presence of the subscript in the notation of a variable indicates its potential to increase with the sample size . To define an RNN with hidden layers, for , we let and denote, respectively, the nonlinear activation function and the number of hidden neurons at layer . We set and , where denotes a generic input dimension. Because of the existence of hidden states from the past, the input can contain only or for some . Let and denote the weight matrices at layer . With these notations, the output of the step of an RNN model can be expressed as
(2) |
where denotes the hidden state of layer at step with ; and is the collection of all weights, consisting of elements. To represent the structure for a sparse RNN, we introduce an indicator variable for each weight in . Let , which specifies the structure of a sparse RNN. To include information on the network structure and keep the notation concise, we redenote by , as depends only on and up to .
Posterior consistency is an essential concept in Bayesian statistics, which forms the basis of Bayesian inference. While posterior consistency generally holds for low-dimensional problems, establishing it becomes challenging in high-dimensional scenarios. In such cases, the dimensionality often surpasses the sample size, and if the prior is not appropriately elicited, prior information can overpower the data information, leading to posterior inconsistency.
Following [17, 18, 19], we let each connection weight be subject to a mixture Gaussian prior, i.e.,
(3) |
by integrating out the structure information , where is the mixture proportion, is typically set to a very small number, while is relatively large. Visualizations of the mixture Gaussian priors for different , , and are given in the Appendix E.
We assume can be well approximated by a sparse RNN given enough past information, and refer to this sparse RNN as the true RNN model. To be more specific, we define the true RNN model as
(4) |
where denotes the space of all valid networks that satisfy the Assumption 3.4 for the given values of , , and ’s, and is some sequence converging to as . For any given RNN , the error can be decomposed as the approximation error and the estimation error . The former is bounded by , and the order of the latter will be given in Theorem 3.8. For the sparse RNN, we make the following assumptions:
Assumption 3.4.
The true sparse RNN model satisfies the following conditions:
-
•
The network structure satisfies: , where is a small constant, denotes the connectivity of , denotes the maximum hidden state dimension, and denotes the input dimension of .
-
•
The network weights are polynomially bounded: , where for some constant .
Remark 3.5.
Assumption 3.4 is identical to assumption A.2 of [17], which limits the connectivity of the true RNN model to be of for some . Then, as implied by Lemma 3.9, an RNN of size has been large enough for modeling a time series sequence of length . Refer to [17] for discussions on the universal approximation ability of the neural network under this assumption; the universal approximation ability can still hold for many classes of functions, such as affine function, piecewise smooth function, and bounded -Hölder smooth function.
Assumption 3.6.
The activation function is bounded for (e.g., sigmoid and tanh), and is Lipschitz continuous with a Lipschitz constant of for (e.g., ReLU, sigmoid and tanh).
Remark 3.7.
Let denote the integrated Hellinger distance between two conditional densities and . Let be the posterior probability of an event. Theorem 3.8 establishes posterior consistency for the RNN model with the mixture Gaussian prior (3).
Theorem 3.8.
Suppose Assumptions 1, 3.4, and 3.6 hold. If the mixture Gaussian prior (3) satisfies the conditions for some constant , for some constant , and , then there exists an an error sequence such that and , and the posterior distribution satisfies
(5) |
for sufficiently large and , where , denotes the underlying true data distribution, and denotes the data distribution reconstructed by the Bayesian RNN based on its posterior samples.
3.2 Uncertainty Quantification with Sparse RNNs
As mentioned previously, posterior consistency forms the basis for Bayesian inference with the RNN model. Based on Theorem 3.8, we further establish structure selection consistency and asymptotic normality of connection weights and predictions for the sparse RNN. In particular, the asymptotic normality of predictions enables the prediction intervals with correct coverage rates to be constructed.
Structure Selection Consistency
It is known that the neural network often suffers from a non-identifiability issue due to the symmetry of its structure. For instance, one can permute certain hidden nodes or simultaneously change the signs or scales of certain weights while keeping the output of the neural network invariant. To address this issue, we follow [17] to define an equivalent class of RNNs, denoted by , which is a set of RNNs such that any possible RNN for the problem can be represented by one and only one RNN in via appropriate weight transformations. Let denote an operator that maps any RNN to . To serve the purpose of structure selection in the space , we consider the marginal posterior inclusion probability (MIPP) approach. Formally, for each connection weight , we define its MIPP as , where is the indicator of in . The MIPP approach selects the connections whose MIPPs exceed a threshold . Let denote an estimator of . Let and , which measures the structure difference on for the true RNN from those sampled from the posterior. Then we have the following Lemma:
Lemma 3.9.
If the conditions of Theorem 3.8 are satisfied and as , then
-
(i)
as ;
-
(ii)
(sure screening) as , for any prespecified ;
-
(iii)
(consistency) as ;
where denotes convergence in probability.
Remark 3.10.
This lemma implies that we can filter out irrelevant variables and simplify the RNN structure when appropriate. Please refer to Section 5.2 for a numerical illustration.
Asymptotic Normality of Connection Weights and Predictions
The following two Theorems establish the asymptotic normality of and predictions, where denotes a transformation of which is invariant with respect to while minimizing .
We follow the same definition of asymptotic normality as in [18, 42, 21]. The posterior distribution for the function is asymptotically normal with center and variance if, for the bounded Lipschitz metric for weak convergence, and the mapping , it holds, as , that
(6) |
in -probability, which we also denote .
The detailed assumptions and setups for the following two theorems are given in Appendix C. For simplicity, we let , and let denote the averaged log-likelihood function. Let denote the Hessian matrix of , and let and denote the -th element of and , respectively.
Theorem 3.11.
Theorem 3.12 establishes asymptotic normality of the sparse RNN prediction, which implies prediction consistency and forms the theoretical basis for prediction uncertainty quantification as well.
4 Computation
In the preceding section, we established a theoretical foundation for sparse deep learning with time series data under the Bayesian framework. Building on [17], it is straightforward to show that Bayesian computation can be simplified by invoking the Laplace approximation theorem at the maximum a posteriori (MAP) estimator. This essentially transforms the proposed Bayesian method into a regularization method by interpreting the log-prior density function as a penalty for the log-likelihood function in RNN training. Consequently, we can train the regularized RNN model using an optimization algorithm, such as SGD or Adam. To address the local-trap issue potentially suffered by these methods, we train the regularized RNN using a prior annealing algorithm [18], as described in Algorithm 1. For a trained RNN, we sparsify its structure by truncating the weights less than a threshold to zero and further refine the nonzero weights for attaining the MAP estimator. For algorithmic specifics, refer to Appendix D. Below, we outline the steps for constructing prediction intervals for one-step-ahead forecasts, where is of one dimension, and and represent the estimators of the network parameters and structure, respectively, as obtained by Algorithm 1:
-
•
Estimate by .
-
•
For a test point , estimate in Theorem 3.12 by
-
•
The corresponding prediction interval is given by
where there are observations used in training, and denotes the upper -quantile of the standard Gaussian distribution.
For construction of multi-horizon prediction intervals, see Appendix F.
5 Numerical Experiments
5.1 Uncertainty Quantification
As mentioned in Section 3, we will consider two types of time series datasets: the first type comprises a single time series, and the second type consists of a set of time series. We will compare the performance of our method against the state-of-the-art Conformal Prediction (CP) methods for both types of datasets. We set (the error rate) for all uncertainty quantification experiments in the paper, and so the nominal coverage level of the prediction intervals is 90%.
5.1.1 A Single Time Series: French Electricity Spot Prices
We perform one-step-ahead forecasts on the French electricity spot prices data from to , which consists of 35,064 observations. A detailed description and visualization of this time series are given in Appendix G.1. Our goal is to predict the hourly prices of the following day, given the prices up until the end of the current day. As the hourly prices exhibit distinct patterns, we fit one model per hour as in the CP baseline [30]. We follow the data splitting strategy used in [30], where the first three years data are used as the (initial) training set, and the prediction is made for the last year .
For all the methods considered, we use the same underlying neural network model: an MLP with one hidden layer of size 100 and the sigmoid activation function. More details on the training process are provided in Appendix G.1. For the state-of-the-art CP methods, EnbPI-V2 [28], NEX-CP [29], ACI [31] and AgACI [30], we conduct experiments in an online fashion, where the model is trained using a sliding window of the previous three years of data (refer to Figure 4 in the Appendix). Specifically, after constructing the prediction interval for each time step in the prediction period, we add the ground truth value to the training set and then retrain the model with the updated training set. For ACI, we conduct experiments with various values of and present the one that yields the best performance. In the case of AgACI, we adopt the same aggregation approach as used in [30], namely, the Bernstein Online Aggregation (BOA) method [43] with a gradient trick. We also report the performance of ACI with as a reference. For NEX-CP, we use the same weights as those employed in their time-series experiments. For EnbPI-V2, we tune the number of bootstrap models and select the one that offers the best performance.
Since this time series exhibits no or minor distribution shift, our method PA is trained in an offline fashion, where the model is fixed for using only the observations between and , and the observations in the prediction range are used only for final evaluation. That is, our method uses less data information in training compared to the baseline methods.
The results are presented in Figure 2, which includes empirical coverage (with methods that are positioned closer to 90.0 being more effective), median/average prediction interval length, and corresponding interquartile range/standard deviation. As expected, our method is able to train and calibrate the model by using only the initial training set, i.e., the data for 2016-2019, and successfully produces faithful prediction intervals. In contrast, all CP methods produce wider prediction intervals than ours and higher coverage rates than the nominal level of 90%. In addition, ACI is sensitive to the choice of [30].

5.1.2 A Set of Time Series
We conduct experiments using three publicly available real-world datasets: Medical Information Mart for Intensive Care (MIMIC-III), electroencephalography (EEG) data, and daily COVID-19 case numbers within the United Kingdom’s local authority districts (COVID-19) [32]. A concise overview of these datasets is presented in Table 1. Our method, denoted as PA-RNN (since the underlying prediction model is an LSTM [44]), is compared to three benchmark methods: CF-RNN [32], MQ-RNN [34], and DP-RNN [35], where LSTM is used for all these methods. To ensure a fair comparison, we adhere to the same model structure, hyperparameters, and data processing steps as specified in [32]. Detailed information regarding the three datasets and training procedures can be found in Appendix G.2.
The numerical results are summarized in Tables 2. Note that the baseline results for EEG and COVID-19 are directly taken from the original paper [32]. We reproduce the baseline results for MIMIC-III, as the specific subset used by [32] is not publicly available. Table 2 indicates that our method consistently outperforms the baselines. In particular, Our method consistently generates shorter prediction intervals compared to the conformal baseline CF-RNN while maintaining the same or even better coverage rate as CF-RNN. Both the MQ-RNN and DP-RNN methods fail to generate prediction intervals that accurately maintain a faithful coverage rate.
Dataset | Train size | Test size | Length | Prediction horizon |
---|---|---|---|---|
MIMIC-III [45] | ||||
EEG [46] | ||||
COVID-19 [47] |
MIMIC-III EEG COVID-19 Model Coverage PI length Coverage PI length Coverage PI length PA-RNN CF-RNN MQ-RNN DP-RNN
5.2 Autoregressive Order Selection
In this section, we evaluate the performance of our method in selecting the autoregressive order for two synthetic autoregressive processes. The first is the non-linear autoregressive (NLAR) process [48, 49, 50]:
where represents i.i.d. Gaussian random noises, and the functions and are defined as:
The second is the exponential autoregressive process [51]:
where, again, denotes i.i.d. Gaussian random noises.
For both synthetic processes, we generate five datasets. Each dataset consists of training, validation, and test sequences. The training sequence has 10000 samples, while the validation and test sequences each contain 1000 samples. For training, we employ a single-layer RNN with a hidden layer width of 1000. Further details on the experimental setting can be found in Appendix G.3.
For the NLAR process, we consider two different window sizes for RNN modeling: (with input as ) and (with input as ). Notably, the NLAR process has an order of . In the case where the window size is , the input information suffices for RNN modeling, rendering the past information conveyed by the hidden states redundant. However, when the window size is , this past information becomes indispensable for the RNN model. In contrast, the exponential autoregressive process has an order of . For all window sizes we explored, namely , the input information is always sufficient for RNN modeling.
We evaluate the predictive performance using mean square prediction error (MSPE) and mean square fitting error (MSFE). The model selection performance is assessed by two metrics: the false selection rate (FSR) and the negative selection rate (NSR). The FSR is given by
and the NSR by
where denotes the set of true variables, and represents the set of selected variables from dataset . Furthermore, we provide the final number of nonzero connections for hidden states and the estimated autoregressive orders. The numerical results for the NLAR process are presented in Table 3, while the numerical results for the exponential autoregressive process can be found in Table 4.
Our results are promising. Specifically, when the window size is equal to or exceeds the true autoregressive order, all connections associated with the hidden states are pruned, effectively converting the RNN into an MLP. Conversely, if the window size is smaller than the true autoregressive order, a significant number of connections from the hidden states are retained. Impressively, our method accurately identifies the autoregressive order—a noteworthy achievement considering the inherent dependencies in the time series data. Although our method produces a nonzero FSR for the NLAR process, it is quite reasonable considering the relatively short time sequence and the complexity of the functions and .
Model Window size FSR NSR AR order #hidden link MSPE MSFE PA-RNN 0 0 - PA-RNN
Model Window size FSR NSR AR order #hidden link MSPE MSFE PA-RNN 0 0 PA-RNN 0 0 PA-RNN 0 0 PA-RNN 0 0 PA-RNN 0 0 PA-RNN
5.3 RNN Model Compression
We have also applied our method to RNN model compression, achieving state-of-the-art results. Please refer to Section G.4 in the Appendix for details.
6 Discussion
This paper has established the theoretical groundwork for sparse deep learning with time series data, including posterior consistency, structure selection consistency, and asymptotic normality of predictions. Our empirical studies indicate that sparse deep learning can outperform current cutting-edge approaches, such as conformal predictions, in prediction uncertainty quantification. More specifically, compared to conformal methods, our method maintains the same coverage rate, if not better, while generating significantly shorter prediction intervals. Furthermore, our method effectively determines the autoregression order for time series data and surpasses state-of-the-art techniques in large-scale model compression.
In summary, this paper represents a significant advancement in statistical inference for deep RNNs, which, through sparsing, has successfully integrated the RNNs into the framework of statistical modeling. The superiority of our method over the conformal methods shows further the criticality of consistently approximating the underlying mechanism of the data generation process in uncertainty quantification.
The theory developed in this paper has included LSTM [44] as a special case, and some numerical examples have been conducted with LSTM; see Section G of the Appendix for the detail. Furthermore, there is room for refining the theoretical study under varying mixing assumptions for time series data, which could broaden applications of the proposed method. Also, the efficacy of the proposed method can potentially be further improved with the elicitation of different prior distributions.
References
- [1] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term temporal patterns with deep neural networks. In The 41st international ACM SIGIR conference on research & development in information retrieval, pages 95–104, 2018.
- [2] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020.
- [3] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. Deep state space models for time series forecasting. Advances in neural information processing systems, 31, 2018.
- [4] Cristian Challu, Kin G Olivares, Boris N Oreshkin, Federico Garza, Max Mergenthaler, and Artur Dubrawski. N-hits: Neural hierarchical interpolation for time series forecasting. arXiv preprint arXiv:2201.12886, 2022.
- [5] Hansika Hewamalage, Christoph Bergmeir, and Kasun Bandara. Recurrent neural networks for time series forecasting: Current status and future directions. International Journal of Forecasting, 37(1):388–427, 2021.
- [6] Gábor Melis, Tomáš Kočiskỳ, and Phil Blunsom. Mogrifier lstm. arXiv preprint arXiv:1909.01792, 2019.
- [7] Luciano Floridi and Massimo Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
- [8] Yuki Tatsunami and Masato Taki. Sequencer: Deep lstm for image classification. arXiv preprint arXiv:2205.01972, 2022.
- [9] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- [10] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017.
- [11] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In International conference on machine learning, pages 2603–2612. PMLR, 2017.
- [12] Marco Gori and Alberto Tesi. On the problem of local minima in backpropagation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(1):76–86, 1992.
- [13] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
- [14] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pages 1675–1685. PMLR, 2019.
- [15] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks. Machine learning, 109(3):467–492, 2020.
- [16] Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. Advances in neural information processing systems, 32, 2019.
- [17] Yan Sun, Qifan Song, and Faming Liang. Consistent sparse deep learning: Theory and computation. Journal of the American Statistical Association, pages 1–15, 2021.
- [18] Yan Sun, Wenjun Xiong, and Faming Liang. Sparse deep learning: A new framework immune to local traps and miscalibration. Advances in Neural Information Processing Systems, 34:22301–22312, 2021.
- [19] Faming Liang, Qizhai Li, and Lei Zhou. Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113(523):955–972, 2018.
- [20] Nicholas G Polson and Veronika Ročková. Posterior concentration for sparse deep learning. Advances in Neural Information Processing Systems, 31, 2018.
- [21] Yuexi Wang and Veronika Rocková. Uncertainty quantification for sparse deep learning. In International Conference on Artificial Intelligence and Statistics, pages 298–308. PMLR, 2020.
- [22] Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. The Annals of Statistics, 48(4):1916–1921, 2020.
- [23] Helmut Bolcskei, Philipp Grohs, Gitta Kutyniok, and Philipp Petersen. Optimal approximation with sparsely connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019.
- [24] Vitaly Maiorov and Allan Pinkus. Lower bounds for approximation by mlp neural networks. Neurocomputing, 25(1-3):81–91, 1999.
- [25] Jing Lei, Max G’Sell, Alessandro Rinaldo, Ryan J Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
- [26] Harris Papadopoulos, Kostas Proedrou, Volodya Vovk, and Alex Gammerman. Inductive confidence machines for regression. In Machine Learning: ECML 2002: 13th European Conference on Machine Learning Helsinki, Finland, August 19–23, 2002 Proceedings 13, pages 345–356. Springer, 2002.
- [27] Victor Chernozhukov, Kaspar Wüthrich, and Zhu Yinchu. Exact and robust conformal inference methods for predictive machine learning with dependent data. In Conference On learning theory, pages 732–749. PMLR, 2018.
- [28] Chen Xu and Yao Xie. Conformal prediction interval for dynamic time-series. In International Conference on Machine Learning, pages 11559–11569. PMLR, 2021.
- [29] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816–845, 2023.
- [30] Margaux Zaffran, Olivier Féron, Yannig Goude, Julie Josse, and Aymeric Dieuleveut. Adaptive conformal predictions for time series. In International Conference on Machine Learning, pages 25834–25866. PMLR, 2022.
- [31] Isaac Gibbs and Emmanuel Candes. Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34:1660–1672, 2021.
- [32] Kamile Stankeviciute, Ahmed M Alaa, and Mihaela van der Schaar. Conformal time-series forecasting. Advances in Neural Information Processing Systems, 34:6216–6228, 2021.
- [33] Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021.
- [34] Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. A multi-horizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.
- [35] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
- [36] Derrick T Mirikitani and Nikolay Nikolaev. Recursive bayesian recurrent neural networks for time-series modeling. IEEE Transactions on Neural Networks, 21(2):262–274, 2009.
- [37] Shiqing Ling and Michael McAleer. A general asymptotic theory for time-series models. Statistica Neerlandica, 64(1):97–111, 2010.
- [38] Bin Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, pages 94–116, 1994.
- [39] Ron Meir. Performance bounds for nonlinear time series prediction. In Proceedings of the tenth annual conference on computational learning theory, pages 122–129, 1997.
- [40] Cosma Shalizi and Aryeh Kontorovich. Predictive pac learning and process decompositions. Advances in neural information processing systems, 26, 2013.
- [41] Subhashis Ghosal and Aad van der Vaart. Convergence rates of posterior distributions for non-i.i.d. observations. Annals of Statistics, 35:192–223, 2007.
- [42] Ismael Castillo and Judith Rousseau. A bernstein–von mises theorem for smooth functionals in semiparametric models. Annals of Statistics, 43:2353–2383, 2013.
- [43] Olivier Wintenberger. Optimal learning with bernstein online aggregation. Machine Learning, 106:119–141, 2017.
- [44] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, nov 1997.
- [45] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
- [46] Catherine Blake. Uci repository of machine learning databases. http://www. ics. uci. edu/~ mlearn/MLRepository. html, 1998.
- [47] GOV UK et al. Coronavirus (covid-19) in the uk, 2020.
- [48] Souhaib Ben Taieb and Amir F Atiya. A bias and variance analysis for multistep-ahead time series forecasting. IEEE transactions on neural networks and learning systems, 27(1):62–76, 2015.
- [49] Marcelo C Medeiros, Timo Teräsvirta, and Gianluigi Rech. Building neural network models for time series: a statistical approach. Journal of Forecasting, 25(1):49–75, 2006.
- [50] Timo Terasvirta and Anders Kock. Forecasting with nonlinear time series models. Econometrics: Multiple Equation Models eJournal, 2010.
- [51] Bjørn Auestad and Dag Tjøstheim. Identification of nonlinear time series: First order characterization and order determination. Biometrika, 77(4):669–687, 1990.
- [52] Wenxin Jiang. Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities. The Annals of Statistics, 35(4):1487–1511, 2007.
- [53] Subhashis Ghosal and Aad Van der Vaart. Fundamentals of nonparametric Bayesian inference, volume 44. Cambridge University Press, 2017.
- [54] Andre M Zubkov and Aleksandr A Serov. A complete proof of universal inequalities for the distribution function of the binomial law. Theory of Probability & Its Applications, 57(3):539–544, 2013.
- [55] Pentti Saikkonen. Dependent versions of a central limit theorem for the squared length of a sample mean. Statistics & probability letters, 22(3):185–194, 1995.
- [56] Stephen Portnoy. Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. The Annals of Statistics, pages 356–366, 1988.
- [57] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient mcmc. Advances in neural information processing systems, 28, 2015.
- [58] Grigoris A Dourbois and Pandelis N Biskas. European market coupling algorithm incorporating clearing conditions of block and complex orders. In 2015 IEEE Eindhoven PowerTech, pages 1–6. IEEE, 2015.
- [59] Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Conformal prediction with temporal quantile adjustments. arXiv preprint arXiv:2205.09940, 2022.
- [60] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
- [61] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. J. Mach. Learn. Res., 22(241):1–124, 2021.
- [62] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
- [63] Nadezhda Chirkova, Ekaterina Lobacheva, and Dmitry Vetrov. Bayesian compression for natural language processing. arXiv preprint arXiv:1810.10927, 2018.
- [64] Maxim Kodryan, Artem Grachev, Dmitry Ignatov, and Dmitry Vetrov. Efficient language modeling with automatic relevance determination in recurrent neural networks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 40–48, 2019.
- [65] Ekaterina Lobacheva, Nadezhda Chirkova, and Dmitry Vetrov. Bayesian sparsification of gated recurrent neural networks. arXiv preprint arXiv:1812.05692, 2018.
- [66] Artem M Grachev, Dmitry I Ignatov, and Andrey V Savchenko. Compression of recurrent neural networks for efficient language modeling. Applied Soft Computing, 79:354–362, 2019.
- [67] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.
Appendix for “Sparse Deep Learning for Time Series Data: Theory and Applications”
Appendix A Mathematical Facts of Sparse RNNs
For a sparse RNN model with recurrent layers and a single output layer, several mathematical facts can be established. Let’s denote the number of nodes in each layer as . Additionally, let represent the number of non-zero connection weights for , and let denote the number of non-zero connection weights for , where and denote the connection weights at two layers of a sparse RNN.
Furthermore, we define as the output value of the -th neuron of the -th layer at time . This output depends on the parameter vector and the input sequence , where represents the generic input to the sparse RNN at time step .
Lemma A.1.
Under Assumption 3.6, if a sparse RNN has at most non-zero connection weights (i.e., ) and , then the summation of the absolute outputs of the th layer at time is bounded by
Proof.
For simplicity, we rewrite as when appropriate. The lemma is the result from the following facts:
-
•
For and (Lemma S4 from [17])
-
•
For and (recursive relationship)
-
•
For and (recursive relationship)
-
•
For and (recursive relationship)
We can verify this lemma by plugging the conclusion into all recursive relationships. We show the steps to verify the case when and , since other cases are trivial to verify.
where the last inequality is due to the fact that
∎
Lemma A.2.
Under Assumption 3.6, consider two RNNs, and , where the former one is a sparse network satisfying and , and it’s network structure vector is denoted by . Assume that if for all and for all , then
Proof.
Define such that for all and for all . Let denote . Then, from the following facts that
-
•
For and (Lemma S4 from [17])
-
•
For and (recursive relationship)
-
•
For and (recursive relationship)
-
•
For and (recursive relationship)
We have
-
•
For
-
•
For
We can verify the above conclusion by plugging it into all recursive relationships. We show the steps to verify the case when and , since other cases are trivial to verify.
where the last inequality is due to the fact that for and , it is easy to see that
Now let denotes , then from the facts that
- •
-
•
For and (Lemma S4 from [17])
-
•
For and (recursive relationship)
-
•
For and (recursive relationship)
-
•
For and (recursive relationship)
we have
-
•
For
-
•
For
We can verify the above conclusion in a similar approach.
The proof is completed by summation of the bound for and . ∎
Appendix B Proofs on Posterior Consistency: A Single Time Series
To establish posterior consistency for DNNs with i.i.d. data, [17] utilized Proposition 1 from [52]. This lemma provides three sufficient conditions for proving posterior consistency for general statistical models with i.i.d. data, along with a posterior contraction rate. In this paper, we aim to establish posterior consistency for DNNs with stochastic processes, specifically time series, that are strictly stationary and -mixing with an exponentially decaying mixing coefficient [53, 41].
Consider a time series defined on a probability space , which satisfies the assumptions outlined in Assumption 1. For simplicity, we assume that the initial values are fixed and given.
Let denote a set of probability densities, let denote the complement of , and let denote a sequence of positive numbers. Let be the minimum number of Hellinger balls of radius that are needed to cover , i.e., is the minimum of all number ’s such that there exist sets , with holding, where
denotes the integrated Hellinger distance [41, 53] between the two conditional densities and , where contains the history up to time steps of , and is the probability density function of the marginal distribution of . Note that is invariant with respect to time index due to the strictly stationary assumption.
For , denote the corresponding true conditional density by . Define as the prior density, and as the posterior. Define for each . Assume the conditions:
-
(a)
for all sufficiently large .
-
(b)
for some and all sufficiently large .
-
(c)
Let , then for some , , and all sufficiently large .
Lemma B.1.
Under the conditions (a), (b) and (c), given sufficiently large , , and , we have for some large ,
(7) |
Proof.
This Lemma can be proved with similar arguments used in Section , Theorem , and Theorem of [53]. ∎
B.1 Posterior Consistency with a General Shrinkage Prior
Let denote the vector of all connection weights of a RNN. To prove Theorem 3.8, we first consider a general shrinkage prior that all entries of are subject to an independent continuous prior , i.e., , where denotes the total number of elements of . Theorem B.2 provides a sufficient condition for posterior consistency.
Theorem B.2.
(Posterior consistency) Suppose Assumptions 1 - 3.6 hold. If the prior satisfies that
(8) | ||||
(9) | ||||
(10) | ||||
(11) |
for some , where
with some , is the minimal density value of within the interval , and is some sequence satisfying . Then, there exists a sequence , satisfying and , such that
for some large .
Proof.
To prove this theorem, it suffices to check all three conditions listed in Lemma B.1
Checking condition (c):
Consider the set , where
for some and . If , by Lemma A.2, we have
By the definition (4), we have
Finally for some (for simplicity, we take to be an even integer),
For any small , condition (c) is satisfied as long as is sufficiently small, for some large , and the prior satisfies . Since
where the last inequality is due to the fact that . Note that . Since (note that )), then for sufficiently large ,
Thus, the prior satisfies for sufficiently large , when for some sufficiently large constant . Thus condition (c) holds.
Checking condition (a):
Let denote the set of probability densities for the RNNs whose parameter vectors satisfy
where denotes the number of input connections with the absolute weights greater than , and will be specified later, and
for some . Let
Consider two parameter vectors and in set , such that there exists a structure with and , and for all , for all . Hence, by Lemma A.2, we have that . For two normal distributions and , define the corresponding Kullback-Leibler divergence as
Together with the fact that , we have
for some , given a sufficiently small .
Given the above results, one can bound the packing number by where denotes the number of all valid networks who has exact connections and has no more than inputs. Since , , , and , then
We can choose and such that for sufficiently large , and then .
Checking condition (b):
Lemma B.3.
(Theorem 1 of [54]) Let be a Binomial random variable. For any
where is the cumulative distribution function (CDF) of the standard Gaussian distribution and .
B.2 Proof of Theorem 3.8
Proof.
To prove Theorem 3.8 in the main text, it suffices to verify the four conditions on listed in Theorem B.2. Let . Condition 8 can be verified by choosing such that . Conditions 9 and 10 can be verified by setting and . Finally, condition 11 can be verified by and . Finally, based on the proof above we see that and .
∎
B.3 Posterior Consistency for Multilayer Perceptrons
As highlighted in Section 3, the MLP can be formulated as a regression problem. Here, the input is for some , with the corresponding output being . Thus, the dataset can be represented as . We apply the same assumptions as for the MLP, specifically 1, 3.4, and 3.6. We also use the same definitions and notations for the MLP as those in [17, 18].
Leveraging the mathematical properties of sparse MLPs presented in [17] and the proof of sparse RNNs for a single time series discussed above, one can straightforwardly derive the following Corollary. Let denote the integrated Hellinger distance between two conditional densities and . Let be the posterior probability of an event.
Corollary B.4.
Suppose Assumptions 1, 3.4, and 3.6 hold. If the mixture Gaussian prior (3) satisfies the conditions for some constant , for some constant , and , then there exists an an error sequence such that and , and the posterior distribution satisfies
(12) |
for sufficiently large and , where , denotes the underlying true data distribution, and denotes the data distribution reconstructed by the Bayesian MLP based on its posterior samples.
Appendix C Asymptotic Normality of Connection Weights and Predictions
This section provides detailed assumptions and proofs for Theorem 3.11 and 3.12. For simplicity, we assume is also given, and we let . Let denote the likelihood function, and let denote the density of the mixture Gaussian prior (3). Let which denotes the -th order partial derivatives. Let denote the Hessian matrix of , and let denote the -th component of and , respectively. Let and . For a RNN with , we define the weight truncation at the true model structure for and for . For the mixture Gaussian prior (3), let .
Assumption C.1.
Assume the conditions of Lemma 3.9 hold with and the defined in Assumption 3.4. For some s.t. , let , where is the posterior contraction rate as defined in Theorem 3.8. Assume there exists some constants and such that
-
C.1
is generic, and as .
-
C.2
hold for any , and .
-
C.3
and , where , denotes a random variable drawn from a neural network model parameterized by , and denotes the mean of .
-
C.4
and the conditions for Theorem of [55] hold.
Conditions to align with the assumptions made in [18]. An additional assumption, , is introduced to account for the dependency inherent in time series data. This assumption is crucial and employed in conjunction with to establish the consistency of the maximum likelihood estimator (MLE) of for a given structure . While the assumption is sufficient for independent data, which is implied by Assumption 3.4, a stronger restriction, specifically , is necessary for dependent data such as time series. It is worth noting that the conditions used in Theorem of [55] pertain specifically to the time series data itself.
Let denotes the d-th order partial derivative for some input.
Assumption C.2.
for any , and , where is as defined in Assumption C.1.
Proof of Theorem 3.11 and Theorem 3.12
The proof of these two theorems can be conducted following the approach in [18]. The main difference is that in [18], they rely on Theorem of [56], which assumes independent data. However, the same conclusions can be extended to time series data by using Theorem of [55], taking into account assumption . This allows for the consideration of dependent data in the analysis.
Appendix D Computation
Algorithm 1 gives the prior annealing procedure[18]. In practice, the following implementation can be followed based on Algorithm 1:
-
•
For , perform initial training.
-
•
For , set and gradually increase .
-
•
For , set and gradually decrease according to the formula: .
-
•
For , set , , and gradually decrease the temperature according to the formula: , where is a constant.
Please refer to Appendix E for intuitive explanations of the prior annealing algorithm and further details on training a model to achieve the desired sparsity.
Appendix E Mixture Gaussian Prior
The mixture Gaussian prior imposes a penalty on the model parameters by acting as a piecewise L2 penalty, applying varying degrees of penalty in different regions of the parameter space. Given values for , , and , a threshold value can be computed using Algorithm 1. Parameters whose absolute values are below this threshold will receive a large penalty, hence constituting the "penalty region", while parameters whose absolute values are above the threshold will receive relatively minimal penalty, forming the "free space". The severity of the penalty in the free space largely depends on the value of .
For instance, as depicted in Figure 3, based on the simple practice implementation detailed in Appendix D, we fix , , and we set , , and gradually reduce from the initial value to the end value. Initially, the free space is relatively small and the penalty region is relatively large. However, the penalty imposed on the parameters within the initial penalty region is also minor, making it challenging to shrink these parameters to zero. As we progressively decrease , the free space enlarges, the penalty region diminishes, and the penalty on the parameters within the penalty region intensifies simultaneously. Once equals , the parameters within the penalty region will be proximate to zero, and the parameters outside the penalty region can vary freely in almost all areas of the parameter space.
For model compression tasks, achieving the desired sparsity involves several steps post the initial training phase outlined in Algorithm 1. First, determine the initial threshold value based on pre-set values for , , and . Then, compute the proportion of the parameters in the initial model whose absolute values are lower than this threshold. This value serves as an estimate for anticipated sparsity. Adjust the value until the predicted sparsity aligns closely with the desired sparsity level.
Appendix F Construction of Multi-Horizon Joint Prediction Intervals
To construct valid joint prediction intervals for multi-horizon forecasting, we estimate the individual variances for each time step in the prediction horizon using procedures similar to one-step-ahead forecasting. We then adjust the critical value by dividing by the number of time steps in the prediction horizon and use as the adjusted critical value. This Bonferroni correction ensures the desired coverage probability across the entire prediction horizon.
As an example, we refer to the experiments conducted in 5.1.2, where we work with a set of training sequences denoted by . Here, represent the prediction horizon for each sequence, while represents the length of the observed sequence.
-
•
Train a model by the proposed algorithm, and denote the trained model by .
-
•
Calculate as an estimator of :
where , and denotes elementwise product.
-
•
For a test sequence , calculate
Let denote the vector formed by the diagonal elements of .
-
•
The Bonferroni simultaneous prediction intervals for all elements of are given by
where represents the element-wise square root operation.
The Bayesian method is particularly advantageous when dealing with a large number of non-zero connection weights in , making the computation of the Hessian matrix of the log-likelihood function costly or unfeasible. For detailed information on utilizing the Bayesian method for constructing prediction intervals, please refer to [18].
Appendix G Numerical Experiments
G.1 French Electricity Spot Prices
Dataset. The given dataset contains the spot prices of electricity in France that were established over a period of four years, from to , using an auction market. In this market, producers and suppliers submit their orders for the next day’s -hour period, specifying the electricity volume in MWh they intend to sell or purchase, along with the corresponding price in €/MWh. At midnight, the Euphemia algorithm, as described in [58], calculates the hourly prices for the next day, based on the submitted orders and other constraints. This hourly dataset consists of observations, covering periods. Our main objective is to predict the prices for the next day by considering different explanatory variables such as the day-ahead forecast consumption, day-of-the-week, as well as the prices of the previous day and the same day in the previous week, as these variables are crucial in determining the spot prices of electricity. Refer to Figure 4 for a visual representation of the dataset.
The prediction models and training settings described below are the same for all hours.


Prediction Model. We use an MLP with one hidden layer of size and the sigmoid activation function as the underlying prediction model for all methods and all hours.
Training: Baselines. For all CP baselines, we train the MLP for epochs using SGD with a constant learning rate of and momentum of , the batch size is set to be . For the ACI, we use the same list of as in [30].
Training: Our Method. For our method PA, we train a total of epochs with the same learning rate, momentum, and batch size. We use , , . We run SGD with momentum for and SGHMC with temperature for . For the mixture Gaussian prior, we fix , , , and .
Methods Coverage Average PI Lengths (standard deviation) Median PI Lengths (interquartile range) PA offline (ours) AgACI online ACI online ACI online EnbPI V2 online NexCP online
G.2 EEG, MIMIC-III, and COVID-19
EEG The EEG dataset, available at here, served as the primary source for the EEG signal time series. This dataset contains responses from both control and alcoholic subjects who were presented with visual stimuli of three different types. To maintain consistency with previous work [32], we utilized the medium-sized version, consisting of control and alcoholic subjects. We focused solely on the control subjects for our experiments, as the dataset summaries indicated their EEG responses were more difficult to predict. For detailed information on our processing steps, please refer to [32].
COVID-19 We followed the processing steps of previous research [32] and utilized COVID-19 data from various regions within the same country to minimize potential distribution shifts while adhering to the exchangeability assumption. The lower tier local authority split provided us with sequences, which we randomly allocated between the training, calibration, and test sets over multiple trials. The dataset can be found at here.
MIMIC-III We collect patient data on the use of antibiotics (specifically Levofloxacin) from the MIMIC-III dataset [45]. However, the subset of the MIMIC-III dataset used in [32] is not publicly available to us, and the authors did not provide information on the processing steps, such as SQL queries and Python code. Therefore, we follow the published processing steps provided in [59]. We remove sequences with fewer than visits or more than visits, resulting in a total of sequences. We randomly split the dataset into train, calibration, and test sets, with corresponding proportions of , , and . We use the white blood cell count (high) as the feature for the univariate time series from these sequences. Access to the MIMIC-III dataset requires PhysioNet credentialing.
Table 6 presents the training hyperparameters and prediction models used for the baselines in the three datasets, while Table 7 presents the training details used for our method.
We hypothesize that the mediocre performance on the EEG dataset is due to the small size of the prediction model, which has only parameters and is substantially smaller than the number of training sequences. Hence, we conducted experiments on the EEG dataset using a slightly larger prediction model for different methods. The results are presented in Table 8, and the corresponding training details are provided in Table 6. Again, our method outperforms other baselines as well.
MIMIC-III EEG COVID-19 model LSTM- LSTM- LSTM- learning rate Epochs Optimizer Adam Adam Adam Batch size
MIMIC-III EEG COVID-19 Model LSTM- LSTM- LSTM- Learning rate Epochs Optimizer Adam Adam Adam Batch size Temperature
EEG Model Coverage PI lengths PA-RNN CF-RNN MQ-RNN DP-RNN
PA-RNN CF-RNN MQ-RNN DP-RNN model LSTM- LSTM- LSTM- LSTM- learning rate Epochs Optimizer Adam Adam Adam Adam Batch size - - - - - - - - - - - - temperature - - - - - - - - - - - -
G.3 Autoregressive Order Selection
Model An Elman RNN with one hidden layer of size . Different window sizes (i.e., or ) will result in a different total number of parameters..
Hyperparameters All training hyperparameters are given in Table 10
PA-RNN PA-RNN RNN RNN Learning rate Iterations Optimizer SGHMC SGHMC SGHMC SGHMC Batch size Subsample size per iteration Predicton horizon - - - - - - - - Temperature - - - - - - - -
G.4 Large-Scale Model Compression
As pointed out by recent summary/survey papers on sparse deep learning [60, 61], the lack of standardized benchmarks and metrics that provide guidance on model structure, task, dataset, and sparsity levels has caused difficulties in conducting fair and meaningful comparisons with previous works. For example, the task of compressing Penn Tree Bank (PTB) word language model [62] is a popular comparison task. However, many previous works [63, 64, 65, 66] have avoided comparison with the state-of-the-art method by either not using the standard baseline model, not reporting, or conducting comparisons at different sparsity levels. Therefore, we performed an extensive search of papers that reported performance on this task, and to the best of our knowledge, the state-of-the-art method is the Automated Gradual Pruning (AGP) by [67].
In our experiments, we train large stacked LSTM language models on the PTB dataset at different sparsity levels. The model architecture follows the same design as in [62], comprising an embedding layer, two stacked LSTM layers, and a softmax layer. The vocabulary size for the model is , the embedding layer size is , and the hidden layer size is resulting in a total of million parameters. We compare our method with AGP at different sparsity levels, including , , , , and as in [67]. The results are summarized in Table 11, and our method achieves better results consistently. For the AGP, numbers are taken directly from the original paper, and since they only provided one experimental result for each sparsity level, no standard deviation is reported. For our method, we run three independent trials and provide both the mean and standard deviation for each sparsity level. During the initial training stage, we follow the same training procedure as in [62]. The details of the prior annealing and fine-tuning stages of our method for different sparsity levels are provided below.
During the initial training stage, we follow the same training procedure as in [62] for all sparsity, and hence .
During the prior annealing stage, we train a total of epochs using SGHMC. For all levels of sparsity considered, we fix , , momentum , minibatch size , and we fix . We set the initial temperature for and and gradually decrease by for .
During the fine tune stage, we apply a similar training procedure as the initial training stage, i.e., we use SGD with gradient clipping and we decrease the learning rate by a constant factor after certain epochs. The minibatch size is set to be , and we apply early stopping based on the validation perplexity with a maximum of epochs.
Table 12 gives all hyperparameters (not specified above) for different sparsity levels.
Note that, the mixture Gaussian prior by nature is a regularization method, so we lower the dropout ratio during the prior annealing and fine tune stage for models with relatively high sparsity.
Method | Sparsity | Test Perplexity |
---|---|---|
baseline | ||
PA | ||
PA | ||
PA | ||
PA | ||
PA | ||
AGP | ||
AGP | ||
AGP | ||
AGP | ||
AGP |
Hyperparameters/Sparsity | |||||
---|---|---|---|---|---|
Dropout ratio | |||||
LR PA | |||||
LR FT | |||||
LR decay factor FT | |||||
LR decay epoch FT |
PA-RNN Learning rate Iterations Optimizer SGHMC Batch size Subsample size per iteration Predicton horizon Temperature