This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\useunder

\ul

11institutetext: Inventec Corporation - AI Center, Taiwan

Authors’ Instructions

A Trainable Reconciliation Method
for Hierarchical Time-Series

Davide Burba    Trista Chen
Abstract

In numerous applications, it is required to produce forecasts for multiple time-series at different hierarchy levels. An obvious example is given by the supply chain in which demand forecasting may be needed at a store, city, or country level. The independent forecasts typically do not add up properly because of the hierarchical constraints, so a reconciliation step is needed. In this paper, we propose a new general, flexible, and easy-to-implement reconciliation strategy based on an encoder-decoder neural network. By testing our method on four real-world datasets, we show that it can consistently reach or surpass the performance of existing methods in the reconciliation setting.

Keywords:
hierarchical time-series, reconciliation, forecasting, neural networks

1 Introduction

A hierarchical time-series is a collection of time-varying observations organized in a hierarchical structure. The problem of forecasting hierarchical time-series often appears in business and economics, where time-varying quantities need to be predicted at different granularity levels. For instance, in the supply chain, forecasts of the demand may be required at a country, city, or store level to organize the logistic. A synthetic example of a hierarchical time-series is shown in Figure 1.

Refer to caption
(a) Time-series hierarchy
Refer to caption
(b) Time-series values
Figure 1: Example of a hierarchical time-series:
At=Bt+CtA_{t}=B_{t}+C_{t}
Bt=Dt+EtB_{t}=D_{t}+E_{t}
Ct=Ft+GtC_{t}=F_{t}+G_{t}

Predictions for hierarchical time-series are typically built in two independent stages. First, forecasts are produced for all or some of the time-series. Then, the forecasts are reconciled to enforce the hierarchy constraints. The main methods for reconciling time-series predictions are known as “bottom-up”, “top-down”, “optimal reconciliation”, and “trace minimization”. These methods can be expressed in a common framework [1], and at the moment there is no consensus about their performance. A common drawback of existing methods is that they are not flexible, meaning that they do not allow for a specific metric to be optimized. However, in real-world forecasting models, it is typically required to minimize a given metric, e.g. the Mean Absolute Scaled Error (MASE) or the Mean Logarithm of Absolute Error (MLAE), and the modeling choices depend on the chosen metric.

With the rise of deep learning in the past years, some attempts have been proposed to improve performances and overcome current limitations in the reconciliation setting [2, 3]. These methods exploit the hierarchical structure by imposing soft constraints in the loss function to regularize the training process, improve forecasting performances, and tighten the reconciliation gaps. However, they cannot guarantee an exact reconciliation, i.e. the hierarchy constraints are not satisfied.

In this work, we propose a new exact methodology to reconcile hierarchical time-series forecasts based on an encoder-decoder neural network. The encoder is a trainable neural network that takes as input the independent forecasts and outputs the bottom-level reconciled forecasts. The decoder is a fixed matrix which reconstructs exactly the forecasts at all levels using the bottom level encoded predictions. Our method includes and generalizes the representation space of existing methods, is extremely flexible, and is easy to implement. We apply it to four real-world datasets, and we show that it consistently achieves a better or equal performance than the existing reconciliation methods.

To summarize, the main contributions of this paper are: (1) we propose a new easy-to-implement reconciliation method which is more general and flexible than existing approaches, (2) we empirically show that this method improves over the state-of-the-art by testing it on multiple heterogeneous datasets.

The rest of the paper is organized as follows. In Section 2 we briefly describe existing reconciliation methods. In Section 3 we describe the proposed method and in Section 4 we show our implementation. In Section 5 we describe the datasets used in the experiments and in Section 6 we show the results of the experiments. We conclude in Section 7.

2 Related Works

In this Section we briefly introduce the main existing reconciliation methods, i.e. bottom-up, top-down, optimal reconciliation, and trace minimization.

2.0.1 Bottom-Up

The Bottom-Up (BU) [1] approach consists in producing forecasts only for the bottom level of the hierarchy and then aggregating them to the upper levels. It is the simplest reconciliation method.

2.0.2 Top-Down

The Top-Down approach consists of making predictions at the top level of the hierarchy and distributing them to the lower levels. Different splitting strategies correspond to different Top-Down methods. The most popular ones are Top-Down Historical Proportions (TDHP) [4] which computes the splits based on the historical time-series proportions, and Top-Down Forecasted Proportions (TDFP) [5] which, at each forecasted time-step, iteratively computes the forecasted proportions while descending the hierarchy tree. Note that TDFP requires predictions at all levels of the hierarchy to compute the proportions.

TD and BU approaches may be mixed in the Middle-Out (MO) approach. With MO, we first select an intermediate level of the hierarchy; then, predictions are computed via BU for upper levels and TD for lower levels.

2.0.3 Optimal-Combination

The Optimal Combination (OC) method [1] requires to produce forecasts for all the time-series at all levels of the hierarchy. The predictions are then mixed via a constrained linear combination to produce the reconciled predictions. The values of the linear coefficients are based on an estimate of the covariance matrix of the incoherence errors.

2.0.4 Trace-Minimization

In [6] it is shown how the covariance matrix of the incoherence errors for the OC method is non-identifiable, and therefore impossible to estimate. The authors propose instead the Trace-Minimization (TM) method [6], a constrained linear combination method whose estimate is based on the covariance matrix of the forecast errors. TM estimate has the property of minimizing the mean squared error of the coherent forecasts under the constraint of unbiasedness.

All the methods described above can be framed in a common framework [1]. Let us denote by NN the total number of time-series at all levels and by MM the number of time-series at the bottom level. Let 𝐲^t\mathbf{\hat{y}}_{t} be the vector containing the individual time-series predictions at time tt and 𝐲~t\mathbf{\tilde{y}}_{t} the vector with the corresponding reconciled predictions. Then, the methods described above can be expressed as:

𝐲~t=SP𝐲^t\mathbf{\tilde{y}}_{t}=SP\mathbf{\hat{y}}_{t} (1)

where PP is a method-specific (M×N)(M\times N) matrix which maps the individual predictions to the bottom-level reconciled predictions, and SS is the hierarchy-specific (N×M)(N\times M) matrix which reconstructs the predictions at all levels of the hierarchy by summing the corresponding bottom-level values.

As an example, let us consider the hierarchy illustrated in Figure 1. If we set 𝐲^t=[A^t,..,G^t]T\mathbf{\hat{y}}_{t}=[\hat{A}_{t},..,\hat{G}t]^{T} and 𝐲~t=[A~t,..,G~t]T\mathbf{\tilde{y}}_{t}=[\tilde{A}_{t},..,\tilde{G}_{t}]^{T}, then the BU approach corresponds to the following matrices:

P(BU)=[0001000000010000000100000001]S=[1111110000111000010000100001]P^{(BU)}=\begin{bmatrix}{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&1&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}\\ {\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&1&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}\\ {\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&1&{\color[rgb]{.5,.5,.5}0}\\ {\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&1\\ \end{bmatrix}\hskip 14.22636ptS=\begin{bmatrix}1&1&1&1\\ 1&1&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}\\ {\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&1&1\\ 1&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}\\ {\color[rgb]{.5,.5,.5}0}&1&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}\\ {\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&1&{\color[rgb]{.5,.5,.5}0}\\ {\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&{\color[rgb]{.5,.5,.5}0}&1\\ \end{bmatrix}

and therefore:

𝐲~t=SP(BU)[A^tB^tC^tD^tE^tF^tG^t]=[D^t+E^t+F^t+G^tD^t+E^tF^t+G^tD^tE^tF^tG^t]\mathbf{\tilde{y}}_{t}=SP^{(BU)}\begin{bmatrix}\hat{A}_{t}\\ \hat{B}_{t}\\ \hat{C}_{t}\\ \hat{D}_{t}\\ \hat{E}_{t}\\ \hat{F}_{t}\\ \hat{G}_{t}\\ \end{bmatrix}=\begin{bmatrix}\hat{D}_{t}+\hat{E}_{t}+\hat{F}_{t}+\hat{G}_{t}\\ \hat{D}_{t}+\hat{E}_{t}\\ \hat{F}_{t}+\hat{G}_{t}\\ \hat{D}_{t}\\ \hat{E}_{t}\\ \hat{F}_{t}\\ \hat{G}_{t}\\ \end{bmatrix}

For the TD approach, the PP matrix is filled with zeros except for the first column, which sums-up to one. For the optimal combination and trace-minimization methods, PP is a dense matrix.

3 Proposed Method

In our approach we propose to generalize the (M×N)(M\times N) matrix PP described in Section 2 with a generic function p:NMp:\mathbb{R}^{N}\rightarrow\mathbb{R}^{M} represented via a neural network. We can think of the proposed method as a simple encoder-decoder neural network, where the network pp represents the (trainable) encoder and the matrix SS represents the (fixed) decoder. The encoder maps the input predictions to the bottom level reconciled predictions, and the decoder takes the latter as input and reconstructs the predictions at all levels. A representation of our method is given in Figure 2.

Refer to caption
Figure 2: Trainable reconciler structure (refer to hierarchy in Figure 1)

This approach has two main theoretical advantages: generalization and flexibility. The generalization advantage refers to the wide representation space of our model, which includes the one described in Section 2 and extends it, allowing for non-linearities. The flexibility advantage refers to the fact that our approach allows us to target specific performance goals via the choice of a suitable loss function in the training phase. Existing methods are simple heuristics (bottom-up, top-down) or minimize the estimated coefficients’ errors under different assumptions (optimal-combinations, trace-minimization). Our method instead allows us to target different metrics. For instance, if our target is the Mean Logarithm of Absolute Error (MLAE), we can use the MLAE as a loss function. Or, if we are especially interested in a specific level of the hierarchy, we can assign a higher weight to the corresponding loss terms. Similarly, if errors’ importance depends on a scale or is asymmetric, we can change the loss function accordingly so to optimize for our target.

Last but not least, our model has the practical advantage to be easy-to-implement and accessible to the wide and fast-growing deep learning community, as opposed to complex statistical models such as optimal combination or trace minimization. Furthermore, if the forecasting models can be expressed in a deep-learning framework, our approach allows us to stack the reconciliation network on top of them and simultaneously train the reconciler and fine-tune the forecasting models.

4 Implementation

In this Section we describe how we implement our trainable reconciliation method to perform the experiments. We use Sklearn [7] and LightGBM [8] libraries for the forecasting models, and PyTorch [9] framework for our reconciliation method. We develop our analysis in Python, and we use R to fit the Optimal-Combination and Trace-Minimization methods exploiting the HTS [10] package.

4.1 Time-Series Prediction

For simplicity, we only consider the task of forecasting the one-step-ahead time-series values. To train the forecasting models we consider two alternative strategies, i.e. estimating an individual model for each time-series, or a global model for all of them. For the individual strategy, we consider linear autoregressive models (AR(p)AR(p)), taking the lagged time-series values as input. For the datasets which present a high number of time-series (>500>500), we estimate a single global predictive model to forecast all the time-series. This approach allows us to exploit the time-series similarities and estimating a more complex model [11]. We consider the Light Gradient Boosting (LightGBM) model, taking as input the scaled lagged values, time-series specific features (such as the level in the hierarchy), and temporal features.

For both modeling strategies, we choose the best hyperparameters combination by performing ten-folds blocked cross-validation, i.e. with validation sets belonging to the same time-window. This validation technique was shown to be effective for time-series tasks [12, 13]. For the individual strategy with AR(p)AR(p), we perform a grid search to optimize the number of predictive lagged values pp. For the global strategy with LighGBM, we perform a grid search to optimize the number of leaves and minimum number of observations in each leaf hyperparameters, and we keep the default values for all the others. Once the best configuration is found, we keep the ten models trained during cross-validation in order to validate the reconciler parameters, as shown in Section 4.4.

4.2 Reconciliation

For the reconciler’s encoder, we consider feed-forward networks with ReLU activation functions and we set the size of the output layers to the number of bottom level time-series. We consider two architecture alternatives. The first is a standard fully connected network. The second is a “shrunk” version of a fully connected network such that the output for a given bottom-level time-series depends only on forecasts for itself and his parents in the hierarchy at all levels. As an example, consider the hierarchy in Figure 1. Let us consider the reconciled predictions for bottom level time-series D~t\tilde{D}_{t}. In the fully connected case, we have that D~t\tilde{D}_{t} is influenced by forecasts for all the time series at all levels: D~t=f(A^t,,G^t)\tilde{D}_{t}=f(\hat{A}_{t},...,\hat{G}_{t}). In the shrunk case instead, D~t\tilde{D}_{t} depends only on the forecast for itself, its parent and its grand-parent: D~t=f(A^t,B^t,D^t)\tilde{D}_{t}=f(\hat{A}_{t},\hat{B}_{t},\hat{D}_{t}). The “shrunk” version does not have the same representative power of the fully connected one. However note that it still includes bottom-up, top-down, and middle-out representation spaces, and any mix of them. Furthermore, for a given number of hierarchy levels, the number of trainable parameters grows with the number of time-series quadratically in the fully connected case and linearly in the shrunk case. To reduce the overfitting risk, we consider the shrunk architecture when the number of time-series is greater than ten times their length. However, the shrunk option can be generally considered as a tunable hyperparameter. An illustration of the two encoder architectures is given in Figure 3.

Refer to caption
(a) Fully connected
Refer to caption
(b) Shrunk
Figure 3: Encoder architecture alternatives, without hidden layers for ease of visualization. Refer to hierarchy in Figure 1.

Applying the model directly to the raw time-series presents a challenge: hierarchical time-series naturally exhibit large amplitude differences, which could compromise the model training since neural networks are sensitive to the scale of the data [14]. To deal with the scale problem, we perform a normalization step. We divide each time-series by a scaling factor before feeding the data to the encoder, and we perform the inverse scaling on the encoded output. The scaling factor is set to one plus the time-series average computed in the training period.

Given the short size of the training samples, learning a multi-dimensional transformation could be challenging for the network. To simplify the training process, we want to initialize the network to the BU method, meaning that the untrained encoder should map the raw predictions to the raw bottom-level predictions. We explore two different ways to achieve this, i.e. using residual connections [15] or using a custom initialization of the network weights. We choose the second option due to its slightly higher performance, and we refer to this method as “bottom-up initialization”. As an example, consider the two architectures shown in Figure 3. With both these architectures we initialize the weights connecting each bottom level input to the corresponding output to one (i.e. the connections D^tD~t,E^tE~t,F^tF~t,G^tG~t\hat{D}_{t}\rightarrow{\tilde{D}_{t}},\hat{E}_{t}\rightarrow{\tilde{E}_{t}},\hat{F}_{t}\rightarrow{\tilde{F}_{t}},\hat{G}_{t}\rightarrow{\tilde{G}_{t}}) and all the other weights and biases to zero. This technique can be trivially extended to multiple layers, assuming that the size of the hidden layers is at least equal to the output size.

To improve the stability of the model, we ensemble ten networks and we take the average prediction.

4.3 Metrics

To demonstrate the flexibility of our approach, we consider two evaluation metrics. The first one is the Mean Absolute Scaled Error (MASE) [16], which scales the predictions errors by the average in-sample naïve error and has the nice properties of being symmetric, stable for values close to zero, and scale-free. These, in general, are desirable statistical properties. However, the scale-free property sometimes may not be aligned with the business point of view. If, on the one hand, scale-dependent metrics such as Mean Absolute Error (MAE) may be misleading since the prediction error for small time-series would become irrelevant in presence of much larger ones, on the other hand, higher forecast errors typically correspond to higher costs, regardless of the time-series scale. Furthermore, the MASE is undefined or unstable for constant or almost constant time-series, respectively. For these reasons, we also consider the Mean Logarithm of Absolute Error (MLAE). The MLAE depends on the time-series scale, but at the same time its dependency is “reduced” via the natural logarithmic transformation xlog(x+1)x\mapsto log(x+1), and is stable for constant and almost constant time-series.

We trained the models with different loss functions based on the two target metrics: the MASE loss for the MASE metric, and the MLAE loss for the MLAE metric. Note that using the performance metric as a loss function is a convenient way to optimize for our target and is proven effective in our experiments. However, there are cases in which this may not be the optimal solution. For instance, using the MASE as a loss may be unstable in case of almost constant time series; if this is the case we suggest to use a different loss or a regularized version of it, e.g. by scaling the errors by one plus the average in-sample naïve error.

4.4 Validation Strategy

To choose the hyperparameters for the reconciliation model we exploit, again, ten folds blocked cross-validation. Specifically, we divide the training time interval into ten time-windows, and at each iteration, we perform the following steps:

  1. 1.

    Set one time window as validation period, and the remaining ones as training period.

  2. 2.

    Select the forecasting models trained on the training period and use them to generate forecasts for both periods. These are the reconciler inputs in the next step.

  3. 3.

    Train the reconciler with forecasts in the training period and evaluate its performance with forecasts in the validation period.

Note that, for each fold, we produce the reconciler input with the forecasting models trained on the reconciler training period. One may wonder why we do not simply use the predictions of the forecasting models trained on the whole training period. This is because in doing so, the validation labels for the reconciler would be used twice, first to train the forecasting models and then to evaluate the reconciler, and this would lead to an over-optimistic evaluation.

We select the best combination of hyperparameters via random search; details about the training procedure and the hyperparameters search are shown in the Appendix.

5 Datasets

For our experiments we consider four real-world datasets, with an increasing complexity in the hierarchy. We denote them as Dairy [17], Walmart [18], Australian [19], and Electricity [20]. A short description of the datasets is given here below, and detailed instructions on the data collection process are available at https://github.com/davide-inventec/trainable-reconciler-data. Table 1 provides an overview of the datasets statistics. All the datasets are scaled by the overall absolute average on the training set.

5.0.1 Dairy

Weekly sales of dairy products in the United States (volume in USD). The hierarchy levels are total, dairy products (milk, butter, etc.).

5.0.2 Walmart

Daily sales for a popular Walmart product. The hierarchy levels are total, country, store.

5.0.3 Australian

Quarterly time-series of domestic tourism in Australia. The hierarchy levels are total, purpose of travel (holiday, business, etc.), region (Sidney, Brisbane, etc.), destination (city, non-city).

5.0.4 Electricity

Yearly electricity consumption in MWH in the State of New York. The hierarchy levels are total, county (Bronx, Richmond, etc.), type of consumer (Business, Residential, etc.), and provider (National grid, Central Hudson, etc.).

Table 1: Datasets statistics.

Dataset # levels # time-series # time-series per level
train
time-steps
test
time-steps
Dairy 2 6 (1,5) 380 53
Walmart 3 13 (1,3,9) 1575 366
Australian 4 89 (1,4,28,56) 28 8
Electricity 4 772 (1,60,295,416) 36 12

For the Electricity dataset we use a global forecasting model, using 13 numerical features (12 lagged values scaled by the time-series average, and the time-series average) and 6 categorical features (the month, 3 level-specific variables, the hierarchy level, and the time-series ID) as predictors. For all the other datasets, we use individual forecasting models. We use the shrunk architecture for Electricity and the fully connected one for all the others.

6 Results

We evaluate the performance of different methods by computing the MASE and MLAE on the test periods for each dataset. To compute the scores, we average the prediction errors for all the time-series at all hierarchy levels. Then, we assess the significance of performance differences between our method and the best among the others by performing paired tt-tests at 1% and 5% on the corresponding scaled/logarithmic absolute errors. In Table 2 we show the performance results, highlighting significantly better scores in bold and using one or two stars to denote p-values smaller than 5% and 1%, respectively. The detailed scores for each hierarchy level are shown in the Appendix.

Table 2: Performances on the test sets.
(a) MASE
BU TDHP TDFP OC TM Trainable (ours)
Dairy 1.1011 1.5543 1.0960 1.1150 1.1014 1.1037
Walmart 0.6608 0.7823 0.6600 0.6606 0.6580 0.6567
Australian 0.6472 0.7181 0.6918 0.7602 0.6301 0.5910 **
Electricity 0.7041 0.9629 0.7216 3.0777 0.7042 0.7038 *
(b) MLAE
BU TDHP TDFP OC TM Trainable (ours)
Dairy 0.1110 0.1255 0.1113 0.1114 0.1112 0.1129
Walmart 0.1070 0.1169 0.1065 0.1064 0.1064 0.1060
Australian 0.0929 0.1144 0.1045 0.1018 0.0923 0.0883 *
Electricity 0.0569 0.0742 0.0580 0.0600 0.0575 0.0567 **

For the datasets with a complex hierarchy, our method performs significantly better than all the others, while for the simpler ones the difference across different methods is generally small. In particular, for the Australian and the Electricity datasets, our method achieves a significantly better performance than all the other methods in terms of both MASE and MLAE. For the Walmart dataset, our method has the highest performance, however, the difference from the second best is not statistically significant. For the Dairy dataset, all the methods have statistically similar performances (except for TDHP, which has larger errors). We think that this is because the Dairy dataset exhibits a trivial two levels hierarchy, and therefore there is not much space for improvement.

To summarize, we find that our method consistently achieves a better or equivalent performance than the best of all the other methods, and the greatest improvement corresponds to the datasets with the deepest hierarchies. Given that scores for other methods are highly variable (especially for TDHP, TDFP, and OC), we believe that this is a strong result.

7 Conclusion

We have shown that a deep learning based approach can improve the forecasting accuracy over state-of-the-art methods in the time-series reconciliation setting. Our method, which is based on an encoder-decoder feed-forward neural network, generalizes the representation space of existing methods, allows us to target specific metrics via the choice of an appropriate loss function, and is easy-to-implement. By testing our model on four real-world datasets, we saw that it was able to achieve, statistically, better or equivalent performances than the best of all the other methods used in the experiments.

For what concerns future developments, we believe that our method may open multiple research directions. In particular, further exploration could be done to evaluate the possibility of jointly training the forecasting models and the reconciliation network. Besides, since our custom initialization converges to a solution that is probably close to the bottom-up method, it would be interesting to investigate the effect of initialization strategies on different architectures.

Appendix

Reconciler Hyperparameters Search

We train all the models via backpropagation, using the AdamW [21] optimizer and a batch size equal to 128 . For the fully connected networks, we set the size of the hidden layers to the number of bottom level time-series. For shrunk networks, we set the size of the hidden layers for each sub-network relative to each bottom level time-series to 8, and therefore the number of nodes in each hidden layer is equal to 8 times the number of bottom level time-series.

We optimize the reconciler hyperparameters by evaluating 100 random combinations via the procedure described in Section 4.4. Combinations were uniformly sampled from the following values:

  • dropout : [0,0.1,0.2][0,0.1,0.2]

  • learning rate : [13,14,15][1^{-3},1^{-4},1^{-5}]

  • weight decay : [11,32,12][1^{-1},3^{-2},1^{-2}]

  • number of training epochs : [50,100,200,500][50,100,200,500]

  • number of hidden layers : [0,1,2,3][0,1,2,3]

Results by Hierarchy Level

In Table 3 we show the performances on the test set for each dataset, at each level of the hierarchy. Level 0 denotes the top level.

Table 3: Performances on the test sets by hierarchy level.
DAIRY BU TDHP TDFP OC TM Trainable (ours)
mase - level 0 1.0865 1.6276 1.0776 1.1014 1.0868 1.0840
mase - level 1 1.1740 1.1878 1.1878 1.1829 1.1748 1.2023
mlae - level 0 0.0837 0.1007 0.0836 0.0839 0.0840 0.0844
mlae - level 1 0.2475 0.2497 0.2497 0.2488 0.2475 0.2553
WALMART BU TDHP TDFP OC TM Trainable (ours)
mase - level 0 0.6812 0.8417 0.6808 0.6827 0.6795 0.6783
mase - level 1 0.6269 0.6747 0.6274 0.6248 0.6223 0.6205
mase - level 2 0.5795 0.5698 0.5698 0.5689 0.5722 0.5714
mlae - level 0 0.0726 0.0848 0.0724 0.0725 0.0724 0.0720
mlae - level 1 0.1467 0.1542 0.1464 0.1459 0.1458 0.1453
mlae - level 2 0.2980 0.2937 0.2937 0.2934 0.2949 0.2939
AUSTRALIAN BU TDHP TDFP OC TM Trainable (ours)
mase - level 0 0.6542 0.7020 0.6905 0.7488 0.6388 0.6087
mase - level 1 0.6479 0.7474 0.6859 0.7873 0.6206 0.5731
mase - level 2 0.5769 0.7515 0.7549 0.7667 0.5978 0.4934
mase - level 3 0.5160 0.6716 0.6716 0.6159 0.5377 0.4923
mlae - level 0 0.0601 0.0711 0.0649 0.0634 0.0587 0.0578
mlae - level 1 0.0937 0.1224 0.1005 0.0993 0.0901 0.0875
mlae - level 2 0.3629 0.4416 0.4614 0.4502 0.3828 0.3387
mlae - level 3 0.8300 1.0047 1.0047 0.9300 0.8731 0.8120
ELECTRICITY BU TDHP TDFP OC TM Trainable (ours)
mase - level 0 0.7193 0.9813 0.7403 3.9513 0.7203 0.7189
mase - level 1 0.6944 0.9523 0.7083 2.3308 0.6923 0.6942
mase - level 2 0.6510 0.8961 0.6620 0.7365 0.6561 0.6505
mase - level 3 0.4439 0.4633 0.4633 0.4630 0.4555 0.4434
mlae - level 0 0.0376 0.0491 0.0382 0.0402 0.0380 0.0375
mlae - level 1 0.0507 0.0665 0.0514 0.0531 0.0511 0.0505
mlae - level 2 0.1887 0.2503 0.1915 0.1957 0.1915 0.1878
mlae - level 3 2.0085 2.2147 2.2147 2.2129 2.0325 2.0025

References

  • [1] Rob J Hyndman, Roman A Ahmed, George Athanasopoulos, and Han Lin Shang. Optimal combination forecasts for hierarchical time series. Computational statistics & data analysis, 55(9):2579–2589, 2011.
  • [2] Konstantin Mishchenko, Mallory Montgomery, and Federico Vaggi. A self-supervised approach to hierarchical forecasting with applications to groupwise synthetic controls. arXiv preprint arXiv:1906.10586, 2019.
  • [3] Paolo Mancuso, Veronica Piccialli, and Antonio M Sudoso. A machine learning approach for forecasting hierarchical time series. arXiv preprint arXiv:2006.00630, 2020.
  • [4] Charles W Gross and Jeffrey E Sohl. Disaggregation methods to expedite product line forecasting. Journal of forecasting, 9(3):233–254, 1990.
  • [5] George Athanasopoulos, Roman A Ahmed, and Rob J Hyndman. Hierarchical forecasts for australian domestic tourism. International Journal of Forecasting, 25(1):146–166, 2009.
  • [6] Shanika L Wickramasuriya, George Athanasopoulos, and Rob J Hyndman. Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization. Journal of the American Statistical Association, 114(526):804–819, 2019.
  • [7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [8] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems, pages 3146–3154, 2017.
  • [9] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
  • [10] Rob J Hyndman, George Athanasopoulos, and Han Lin Shang. hts: An r package for forecasting hierarchical or grouped time series. R Package (Available at http://cran. unej. ac. id/web/packages/hts/vignettes/hts. pdf), 2015.
  • [11] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020.
  • [12] Christoph Bergmeir and José M Benítez. On the use of cross-validation for time series predictor evaluation. Information Sciences, 191:192–213, 2012.
  • [13] Christoph Bergmeir, Rob J Hyndman, and Bonsoo Koo. A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120:70–83, 2018.
  • [14] J. Sola and J. Sevilla. Importance of input data normalization for the application of neural networks to complex industrial problems. IEEE Transactions on Nuclear Science, 44(3):1464–1468, 1997.
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [16] Rob J Hyndman and Anne B Koehler. Another look at measures of forecast accuracy. International journal of forecasting, 22(4):679–688, 2006.
  • [17] United States Department of Agriculture. https://mpr.datamart.ams.usda.gov/menu.do?path=Products%5CDairy.
  • [18] M5 Forecasting - Accuracy. https://www.kaggle.com/c/m5-forecasting-accuracy/data?select=sales_train_validation.csv.
  • [19] Rob J Hyndman Blog. https://robjhyndman.com/data/hier1_with_names.csv.
  • [20] Utility Energy Registry Monthly County Energy Use: Beginning 2016. https://data.ny.gov/Energy-Environment/Utility-Energy-Registry-Monthly-County-Energy-Use-/47km-hhvs.
  • [21] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.