SLOTH: Structured Learning and Task-based Optimization for Time Series Forecasting on Hierarchies

Fan Zhou, Chen Pan, Lintao Ma, Yu Liu, Shiyu Wang, James Zhang, Xinxin Zhu, Xuanwei Hu, Yunhua Hu, Yangfei Zheng, Lei Lei, Yun Hu

Abstract

Multivariate time series forecasting with hierarchical structure is widely used in real-world applications, e.g., sales predictions for the geographical hierarchy formed by cities, states, and countries. The hierarchical time series (HTS) forecasting includes two sub-tasks, i.e., forecasting and reconciliation. In the previous works, hierarchical information is only integrated in the reconciliation step to maintain coherency, but not in forecasting step for accuracy improvement. In this paper, we propose two novel tree-based feature integration mechanisms, i.e., top-down convolution and bottom-up attention to leverage the information of the hierarchical structure to improve the forecasting performance. Moreover, unlike most previous reconciliation methods which either rely on strong assumptions or focus on coherent constraints only, we utilize deep neural optimization networks, which not only achieve coherency without any assumptions, but also allow more flexible and realistic constraints to achieve task-based targets, e.g., lower under-estimation penalty and meaningful decision-making loss to facilitate the subsequent downstream tasks. Experiments on real-world datasets demonstrate that our tree-based feature integration mechanism achieves superior performances on hierarchical forecasting tasks compared to the state-of-the-art methods, and our neural optimization networks can be applied to real-world tasks effectively without any additional effort under coherence and task-based constraints.

Introduction

Forecasting for multiple time series with complex hierarchical structure has been applied in various real-world problems (Dangerfield and Morris 1992; Athanasopoulos, Ahmed, and Hyndman 2009; Liu, Yan, and Hauskrecht 2018; Jeon, Panagiotelis, and Petropoulos 2018). For instance, the international sales forecasting of transnational corporations needs to consider geographical hierarchies involving the levels of city, state, and country (Han, Dasgupta, and Ghosh 2021). Moreover, retail sales can also be divided into different groups according to the product category to form another hierarchy. The combination of different hierarchies can form a more complicated but realistic virtual topology, e.g., the geographical hierarchies (of different regions) and the commodity hierarchies (of varies categories) are nested to be parts of the supply chain in the retail sector.

The critical challenge in hierarchical forecasting tasks lies in producing accurate prediction results while satisfying aggregation (coherence) constraints (Taieb and Koo 2019). Specifically, the hierarchical structure in these time series implies coherence constraint, i.e., time series at upper levels are the aggregation of those at lower levels. However, independent forecasts from a prediction model (called base forecasts) are unlikely to satisfy coherence constraint.

Previous work on hierarchical forecasting mainly focuses on a procedure of two separate stages (Hyndman, Lee, and Wang 2016; Taieb and Koo 2019; Corani et al. 2020; Anderer and Li 2021): In the first stage, base forecasts are generated independently for each time series in the hierarchy; in the second stage, these forecasts are adjusted via reconciliation to derive coherent results. The base forecasts are obtained by univariate or multivariate time series models, such as traditional forecasting methods (e.g., ARIMA (Taieb and Koo 2019)) and deep learning techniques (e.g. DeepVar (Salinas et al. 2019)). However, both approaches ignore the information of hierarchical structure for prediction. As for reconciliation, traditional statistical methods mostly rely on strong assumption, such as unbiased forecasts and Gaussian noises (e.g., MinT (Wickramasuriya, Athanasopoulos, and Hyndman 2019)), which are often inconsistent with non-Gaussian/non-linear real-world hierarchical time series (HTS) data. A notable exception is the approach proposed in (Rangapuram et al. 2021), promising coherence by a closed-form projection to achieve an end-to-end reconciliation without any assumptions. However, this method may introduce huge adjustments on the original forecasts for coherency, making the final forecasts unreasonable. Moreover, even coherent forecasters might still be impractical for some real-world tasks with further practical operational or task-related constraints, such as inventory management and resource scheduling. These reconciliation concerns can be addressed by imposing more realistic constraints to control the scale and optimize the task-based targets for the downstream tasks, which can be achieved by deep neural optimization layer (OptNet) (Amos and Kolter 2017).

In this work, we provides an end-to-end framework to generate coherent forecasts of hierarchical time series that incorporates the hierarchical structure in the prediction process, while taking task-based constraints and targets into consideration. In detail, our contributions to HTS forecasting and aligned decision-making can be summarised as follows:

•

We propose two tree-based mechanisms, including top-down convolution and bottom-up attention, to leverage hierarchical structure information (through feature fusion of all levels in the hierarchy) for performance improvement of the base forecasts. To the best of our knowledge, our approach is the first model that harnesses the power of deep learning to exploit the complicated structure information of HTS.
•

We provide a flexible end-to-end learning framework to unify the goals of the forecasting and decision-making by employing a deep differentiable convex optimization layer (OptNet), which not only achieves controllable reconciliation without any assumptions, but also adapts to more practical task-related constraints and targets to solve the real-world problems without any explicit post-processing step.
•

Extensive experiments on real-world hierarchical datasets from various industrial domains demonstrate that our proposed approach achieves significant improvements over the state-of-the-art baseline methods, and our approach has been deployed online to cloud resource scheduling project in Ant Group.

Preliminaries

A hierarchical time series can be denoted as a tree structure (see Figure 1) with linear aggregation constraints, expressed by aggregation matrix $\mathbf{S}\in\mathbb{R}^{n\times m}$ ( $m$ is the number of bottom-level nodes, and $n$ is the total number of nodes). In the hierarchy, each node represents a time series, to be predicted over a time horizon.

Given a time horizon $t\in\{1,2,\dots,T\}$ , we use $y_{i,t}\in\mathbb{R}$ to denote the values of the $i$ -th component of a multivariate hierarchical time series, where $i\in\{1,2,\dots,n\}$ is the index of the individual univariate time series. Here we assume that the index $i$ abides by the level-order traversal of the hierarchical tree, going from left to right at each level.

Refer to caption — Figure 1: An example of HTS structure for $n=8$ time series with $m=5$ bottom-level series and $r=3$ upper-level series.

In the tree structure, the time series of leaf nodes are called the bottom-level series $\mathbf{b}_{t}\in\mathbb{R}^{m}$ , and those of the remaining nodes are termed upper-levels series $\mathbf{u}_{t}\in\mathbb{R}^{r}$ . Obviously, the total number of nodes $n=r+m$ , and $\mathbf{y}_{t}:=[\mathbf{u}_{t},\mathbf{b}_{t}]^{\mathsf{T}}\in\mathbb{R}^{n}$ contains observations at time $t$ for all levels, which satisfies

\mathbf{y}_{t}=\mathbf{S}\mathbf{b}_{t},

(1)

where $\mathbf{S}\in\{0,1\}^{n\times m}$ is an aggregation matrix.

Taking the HTS in Figure 1 as an example, the aggregation matrix is in the form:

\mathbf{S}=\begin{bmatrix}\mathbf{S}_{\text{sum}}\\ \mathbf{I}_{5}\end{bmatrix}\begin{bmatrix}\begin{array}[]{ccccc}1&1&1&1&1\\ 1&1&0&0&0\\ 0&0&1&1&1\\ {5}\\ &&\mathbf{I}_{5}&&\\ \end{array}\end{bmatrix},

where $\mathbf{I}_{5}$ is an identity matrix of size 5. The total number of series in the hierarchy is $n=3+5$ . At each time step $t$ ,

	$\displaystyle\mathbf{u}_{t}$	$\displaystyle=[y_{A,t},y_{B,t},y_{C,t}]\in\mathbb{R}^{3},$
	$\displaystyle\mathbf{b}_{t}$	$\displaystyle=[y_{D,t},y_{E,t},y_{F,t},y_{G,t},y_{H,t}]\in\mathbb{R}^{5},$
	$\displaystyle y_{A,t}$	$\displaystyle=y_{B,t}+y_{C,t}=y_{D,t}+y_{E,t}+y_{F,t}+y_{G,t}+y_{H,t}.$

The coherence constraint of Eq. (1) can be represented as (Rangapuram et al. 2021)

\mathbf{A}\mathbf{y}_{t}=\mathbf{0},

(2)

where $\mathbf{A}:=[\mathbf{I}_{r}|-\mathbf{S}_{\text{sum}}]\in\{0,1\}^{r\times n}$ , $\mathbf{0}$ is an $r$ -vector of zeros, and $\mathbf{I}_{r}$ is a $r\times r$ identity matrix.

Method

In this section, we introduce our framework (SLOTH), which not only can improve prediction performance by integrating hierarchical structure into forecasting, but also achieves task-based goals with deep neural optimization layer fulfilling coherence constraint. Figure 2 illustrates the architecture of SLOTH, consisting of two main components:

•

a structured hierarchical forecasting module that produces forecasts over the prediction horizon across all nodes, utilizing both top-down convolution and bottom-up attention to enhance the temporal features for each node.
•

a task-based constrained optimization module that leverages OptNet to satisfy the coherence constraint and provide a flexible module for real-world tasks with more complex practical constraints and targets.

Structured Hierarchical Forecasting

In this section, we introduce the structured hierarchical learning module, which integrates dynamic features from the hierarchical structure to produce better base forecasts.

Temporal Feature Extraction Module

This module extracts temporal features of each node as follows:

	$\displaystyle\bar{h}^{i}_{t}$	$\displaystyle=\text{UPDATE}(\bar{h}^{i}_{t-1},x^{i}_{t};\mathbf{\theta}),\;i\in\{{1,2,\dots,n}\},$		(3)
	$\displaystyle\bar{\mathbf{H}}_{t}$	$\displaystyle=[{\bar{h}^{1}_{t},\bar{h}^{2}_{t},\dots,\bar{h}^{n}_{t}}],$

where ${x}^{i}_{t}$ is the covariates of node $i$ at time $t$ , $\bar{h}_{t-1}^{i}$ is the hidden feature of the previous time step $t-1$ , $\mathbf{\theta}$ is the model parameters shared across all nodes, and $\textit{UPDATE}(\cdot)$ is the pattern extraction function. Any recurrent-type neural network can be adopted as the temporal feature extraction module, such as the RNN variants (Hochreiter and Schmidhuber 1997; Chung et al. 2014), TCN (Bai, Kolter, and Koltun 2018), WAVENET (Van Den Oord et al. 2016), and NBEATS (Oreshkin et al. 2019). We use GRU (Chung et al. 2014) in our experiments due to its simplicity.

It is worth noting that we process each node independently in this module, which is in contrast to some existing works that utilize multivariate model to extract the relationship between time series (Rangapuram et al. 2021). It is unnecessary for our framework because we leverage the hierarchical structure in feature integration step, which we believe is enough to characterize the relationship between nodes. We reduce the time complexity from O( $n^{2}$ ) to O( $n$ ) accordingly.

Top-Down Convolution Module (TD-Conv)

This module incorporates structural information to dynamic pattern by integrating temporal features (e.g., trends and seasonality) from nodes at the top level into those at the bottom level to enhance the temporal stability. In other words, clearer seasonality and more smooth evolving (Taieb, Taylor, and Hyndman 2021).

Most of the previous methods indicate that time series of nodes at the upper level are easier to predict than nodes at the bottom level, which implies the dynamic pattern at the top level is more stable. Therefore, the bottom nodes can use their ancestors’ features at top levels to improve prediction performance. Similar to Tree Convolution (Mou et al. 2016), our approach introduces a top-down convolution mechanism by extracting effective top-level temporal patterns to denoise the feature of bottom nodes and increase stability.

The top-down convolution mechanism shown in Figure 3 is based on the outputs of the univariate forecasting model. We want to highlight that all nodes share the forecasting model to obtain the temporal features. Please note that ancestor’s value are the sum of all children nodes. In other words, ancestors’ features are helpful to predict the considered node, which calls for the integration of the features of both levels for better prediction.

In order to reduce the computational complexity, the hidden states are reorganized into matrices (yellow boxes in the middle part of Figure 3) after temporal hidden features are obtained. Each row of the matrices represents the feature concatenation of nodes from considered node to root in the hierarchy. We then apply convolution neural networks (CNNs) to each row of those matrices to aggregate the temporal patterns, since it is well known that CNN is an effective tool to integrate valid information from hidden features.

The computation complexity is very high if the convolution is applied on $\bar{\mathbf{H}}_{t}$ because it is not sorted as tree structure index. To accelerate the computation, we transform the hierarchical structure to a series of matrix forms to speed up the convolution process as shown in the middle part of Figure 3. Specifically, the nodes’ hidden features $\mathbf{\bar{H}}_{t}$ in the shape of $(n,d_{h})$ are reorganized into a matrix form $\mathbf{\bar{H}}^{\prime}_{t}=\{\bar{h^{\prime}}^{1}_{t},...,\bar{h^{\prime}}^{n}_{t}\}$ , where $\bar{h^{\prime}}^{i}_{t}=[\bar{h}^{1}_{t},\dots,\bar{h}^{i}_{t}]$ is the concatenation of temporal features $\bar{h}_{t}$ of all the ancestors of node $i$ in the shape of $(l_{i},d_{h})$ , where $n$ is number of nodes, $l_{i}$ is the number of levels of node $i$ , and $d_{h}$ is the dimension. Please note that $\mathbf{\bar{H}}^{\prime}_{t}$ is not an actual matrix but a union of matrices of different dimensions, where the first dimension of each $\bar{h^{\prime}}_{t}$ is the level index. Then we apply convolution to $\bar{h^{\prime}}_{t}^{i}$ to integrate temporal features of all ancestors of $i$ as follows

	$\displaystyle\hat{h}_{t}^{i}$	$\displaystyle=\text{Conv}_{l_{i}}(\bar{h^{\prime}}^{i}_{t};\Theta_{l_{i}})=\sum_{k=1}^{l_{i}}w_{k}\bar{h}^{l_{i}-k}_{t},$		(4)
	$\displaystyle\mathbf{\hat{H}}_{t}$	$\displaystyle=[{\hat{h}^{1}_{t},\hat{h}^{2}_{t},\dots,\hat{h}^{n}_{t}}],$

where different levels have different convolution components $\text{Conv}_{l_{i}}$ , while nodes at the same level share the same parameter $\Theta_{l_{i}}$ .

It is important to emphasize that HTS is different from general graph-structured time series, where spatial-temporal information is passed between adjacent nodes, but the relationship between nodes in hierarchical structure only contains value aggregation, which indicates the message passing mechanism of graph time series (DCRNN (Li et al. 2017) and STGCN (Yu, Yin, and Zhu 2017)) is not appropriate for our problem. We compare it with our SLOTH method in Appendix E.

Bottom-Up Attention Module (BU-Attn)

This module integrates temporal features of nodes at the bottom levels to their ancestors at top levels to enhance the ability of adapting to dynamic pattern variations including sudden level or trend changes of time series.

TD-Conv carries top-level information downward to improve the predictions at the bottom levels. On the other hand, the information from the bottom-levels should also be useful for the prediction of the top-levels, due to their relationship of value aggregation.

Please note that feature aggregation is different from value aggregation (Eq. (1)). Specifically, the summing matrix ( $\mathbf{S}$ ) is actually a two-level hierarchical structure, rather than a tree structure that is reasonable for value aggregation, but causes structural information loss in feature aggregation. Direct summing operation is inappropriate for feature aggregation as there is no relationship of summation between parents and children in the feature space.

We therefore adopt the attention mechanism (Vaswani et al. 2017; Nguyen et al. 2020) to aggregate temporal features from the bottom levels, due to the following considerations: 1) hierarchical structure has variations (different number of levels and children nodes); 2) child nodes contribute differently to parents with various scales/dynamic patterns. The attention mechanism is appropriate to aggregate the feature as it is flexible for various structures and can learn weighted contributions based on feature similarities.

$\displaystyle\tilde{h}^{\cup_{l}}_{t}$	$\displaystyle=\operatorname{Softmax}\left(\frac{\mathbf{Q}^{\cup_{l}}_{t}(\mathbf{K}^{\cup_{l+1}}_{t})^{\mathsf{T}}}{\sqrt{d_{h}}}\right)\mathbf{V}^{\cup_{l+1}}_{t},$	(5)
$\displaystyle\tilde{\mathbf{H}}_{t}$	$\displaystyle=[\tilde{h}^{\cup_{1}}_{t},\dots,\tilde{h}^{\cup_{n_{l}}}_{t}],$
$\displaystyle\mathbf{V}_{t}^{\cup_{l}}$	$\displaystyle=\tilde{h}_{t}^{\cup_{l}}-\hat{h}^{\cup_{l}}_{t}+\bar{h}^{\cup_{l}}_{t}.$	(6)

Algorithm 1 Bottom-up Attention Module

\mathbf{\bar{H}}_{t},\mathbf{\hat{H}}_{t},\phi_{q},\phi_{k}

l\leftarrow n_{l}-1,\tilde{h}^{\cup_{n_{l}}}_{t}=\hat{h}^{\cup_{n_{l}}}_{t},\mathbf{V}_{t}=0

\mathbf{V}^{n_{l}}_{t}=\hat{h}^{\cup_{n_{l}}}_{t}

3: /*

\cup_{i}

means union of nodes at level i

4: while

l>0

\mathbf{Q}_{t}^{\cup_{l}}=F(\bar{h}^{\cup_{l}}_{t};\phi_{q})

\mathbf{K}_{t}^{\cup_{l+1}}=F(\bar{h}^{\cup_{l+1}}_{t};\phi_{k})

7: Update

\tilde{h}_{t}^{\cup_{l}}

according to Eq. (5)

8: Update

\mathbf{V}_{t}^{\cup_{l}}=\tilde{h}_{t}^{\cup_{l}}-\hat{h}^{\cup_{l}}_{t}+\bar{h}^{\cup_{l}}_{t}

l\leftarrow l-1,

10: end while

11: return

\mathbf{\tilde{H}}_{t}

Specifically, as shown in Algorithm 1, attention process starts from the second last to the first level (except for leaf nodes). Parent node takes the original temporal feature $\bar{h}_{t}$ to generate the query, and the corresponding child node takes the original feature $\bar{h}_{t}$ to generate the key. Child node feature value $\mathbf{V}_{t}$ is updated as attention process going upward as in Eq. (6). $\tilde{h}^{\cup_{l}}_{t}$ is the feature of all nodes at level $l$ by attention aggregation, which contains the contributions from both children and their ancestors. $\mathbf{V}_{t}^{\cup_{l}}$ is attention value of each node at level $l$ , which is used in Eq. (5). Since we not only attempt to strengthen the information of children and the node’s own features, but also to weaken the parents’ influence in the bottom-up attention process because parent’s feature becomes the node’s own feature in the next iteration of BU-Attn, therefore, we subtract top-down temporal features $\hat{h}_{t}^{\cup_{l}}$ and add original temporal features $\bar{h}_{t}^{\cup_{l}}$ .

It is important to note that computation process at same level can be executed concurrently because it is only related to self temporal hidden feature $\bar{\mathbf{H}}_{t}$ . All the experiments show our bottom-up attention aggregation mechanism carries the bottom-level information containing dynamic patterns to top-level to improve the forecasting performance. More importantly, both TD-Conv and BU-Attn can be independent components to be used in the fitting process (i.e., step $t$ of RNN) for better prediction accuracy, or to be used after the temporal patterns are obtained, trading accuracy for faster computation.

Base Forecast Module

This module serves as prediction generation based on dynamic features, and the prediction can either be probabilistic or point estimates. Our framework is agnostic to the forecasting models, such as MLP (Gardner and Dorling 1998), seq2seq (Sutskever, Vinyals, and Le 2014), and attention networks (Vaswani et al. 2017). We employ the MLP to generate the base point estimates for its flexibility and simplicity. In order to avoid the loss of information of temporal features in the cascade of hierarchical learning, we apply residual connection between temporal feature extraction module and base forecast module as follows

{z}_{t}=\sigma(MLP(\tilde{h}_{t}));h_{t}=(1-{z}_{t})\tilde{h}_{t}+{z}_{t}\bar{h}_{t}\textrm{ . }

(7)

Then we apply MLP to generate base forecasts as follows

\hat{y}^{i}_{t}=MLP(h^{i}_{t}),\quad\hat{\mathbf{y}}_{t}=[\hat{y}^{1}_{t},\dots,\hat{y}^{n}_{t}].

(8)

Task-based Constrained Optimization

In this section, we introduce a task-based optimization module that leverages the deep neural optimization layer to achieve targets in realistic scenarios, while satisfying coherence and task-based constraints.

Optimization with Coherence Constraint in Forecasting Task

We formally define HTS forecasting task as a prediction and optimization problem in this section. As shown in Eq. (9), reconciliation on base forecasts can be represented as a constrained optimization problem (Rangapuram et al. 2021), where two categories of constraints are considered, i.e., the equality constraints representing coherency, and the inequality constraints ensuring the reconciliation is restricted, which means the adjustment of the base forecasts is limited in a specific range to reduce the deterioration of forecast performance,

		$\displaystyle\tilde{\mathbf{y}}_{t}=\mathop{\arg\min}_{\mathbf{y}\in\mathbb{R}^{n}}\\|\mathbf{y}-\hat{\mathbf{y}}_{t}\\|_{2}=\mathop{\arg\min}_{\mathbf{y}\in\mathbb{R}^{n}}\frac{1}{2}\mathbf{y}^{\mathsf{T}}\mathbf{y}-\hat{\mathbf{y}}_{t}^{\mathsf{T}}\mathbf{y}$		(9)
		$\displaystyle\text{subject to}\begin{cases}\mathbf{A}{\mathbf{y}}=\mathbf{0},\\ \delta_{1}abs(\hat{\mathbf{y}}_{t})-\varepsilon_{1}\leq\mathbf{y}-\hat{\mathbf{y}}_{t}\leq\delta_{2}abs(\hat{\mathbf{y}}_{t})+\varepsilon_{2},\end{cases}$		(9)

where $\hat{\mathbf{y}}$ is the base forecasts without reconciliation, and $\delta_{i},\varepsilon_{i},i=1,2$ are some predefined constants.

Recall that the aforementioned end-to-end optimization architecture (Rangapuram et al. 2021) provides a closed-form solution for reconciliation problem. It projects the base forecasts into the solution space effectively by multiplying reconciliation matrix, which only depends on aggregation matrix $\mathbf{S}$ and is thus easy to calculate (for convenience, the details are supplied in Appendix A). However, this procedure only considers aggregation constraints without limiting adjustment scale which may sometime cause the reconciled results $\tilde{\mathbf{y}}$ become unreasonable, e.g., a negative value for small base forecasts in demand forecasting. In addition, loss based on reconciliation projection derails gradient magnitudes and directions, which may cause the model not to converge to the optimum. Moreover, it is not a general solution for real-world scenarios, where more complex task-related constraints and targets are involved.

To keep reconciliation result reasonable and training efficient, we utilize neural network layer OptNet to solve the constrained reconciliation optimization problem, which is essentially a quadratic programming problem. The Lagrangian of formal quadratic programming problem is defined in Eq. (10) (Amos and Kolter 2017), where equality constraints are $\mathbf{A}\mathbf{z}=\mathbf{b}$ and inequality constraints are $\mathbf{G}\mathbf{z}\leq\mathbf{h}$ :

L(\mathbf{z},{\nu},{\lambda})=\frac{1}{2}\mathbf{z}^{\mathsf{T}}\mathbf{Q}\mathbf{z}-\mathbf{q}^{\mathsf{T}}\mathbf{z}+\nu^{\mathsf{T}}(\mathbf{A}\mathbf{z}-\mathbf{b})+\lambda^{\mathsf{T}}(\mathbf{G}\mathbf{z}-\mathbf{h}).

(10)

When applied to hierarchical reconciliation problems (where we take the special range constraint $\mathbf{y}\geq\mathbf{0}$ as an example), the Lagrangian can be revised to

L(\mathbf{y},\nu,\lambda)=\frac{1}{2}\mathbf{y}^{\mathsf{T}}\mathbf{y}-\hat{\mathbf{y}}^{\mathsf{T}}\mathbf{y}+\nu^{\mathsf{T}}\mathbf{A}\mathbf{y}+\lambda^{\mathsf{T}}(-\mathbf{I}\mathbf{y}),

(11)

where $\nu,\lambda$ are the dual variables of equality and inequality constraints respectively. Then we can derive the differentials of these variables according to the KKT condition, and apply linear differential theory to calculate the Jacobians for backpropagation. The detail is as follows

$\begin{bmatrix}\mathbf{I}&-\mathbf{I}&\mathbf{A}^{\mathsf{T}}\\ diag(\lambda)(-\mathbf{I})&diag(-\mathbf{y})&\mathbf{0}\\ \mathbf{A}&\mathbf{0}&\mathbf{0}\end{bmatrix}\begin{bmatrix}d\mathbf{y}\\ d\lambda\\ d\nu\end{bmatrix}=-\begin{bmatrix}d\mathbf{y}-d\hat{\mathbf{y}}-d\lambda+d\mathbf{A}^{\mathsf{T}}\nu\\ -diag(\lambda)d\mathbf{y}\\ d\mathbf{A}\mathbf{y}\end{bmatrix}\mathrm{,}$

(12)

where function $diag(\cdot)$ means diagonal matrix. We can infer the conditions of the constrained reconciliation optimization problem from the left side, and compute the derivative of the relevant function with respect to model parameters from the right side. In practice, we apply OptNet layer to obtain the solution of argmin differential QPs quickly to solve the linear equation. In this way, our framework achieves end-to-end learning by directly generating reconciliation optimization results, while calculating the derivative and backpropagating the gradient to the optimization model automatically.

Optimization with Task-based Constraints and Target for Real-World Scenarios

The coherence constraint is enough in forecasting tasks when the only concern is prediction accuracy, which is, however, most likely unrealistic in real-world tasks with specific limitations and practical targets. Such tasks can be further revised as follows

		$\displaystyle\mathcal{J}(\hat{\mathbf{y}})=\mathop{\arg\min}_{\mathbf{y}}{f(\hat{\mathbf{y}},\mathbf{y})}$		(13)
	subject to	$\displaystyle\begin{cases}\mathbf{A}\mathbf{y}=\mathbf{0},e_{j}=0,\;j=1,\dots,n_{\text{eq}},\\ g_{i}(\mathbf{y},\hat{\mathbf{y}})\leq 0,i=1,\dots,n_{\text{ineq}},\end{cases}$		(13)

where $f$ is the task-based quadratic objective, $e_{j}$ represents task-based equality constraint other than coherence constraint, $n_{\text{eq}}$ is the number of equality constraints, and $g_{i}$ is an inequality constraint, $n_{\text{ineq}}$ is the number of task-based inequality constraints. Eq. (13) can be efficiently solved using differential QP layer in an end-to-end fashion (Donti, Amos, and Kolter 2017), where we need to transform our target into a quadratic loss and add equality/inequality constraints. We construct a scheduling experiment on M5 dataset in the following section to validate the superiority of our framework on realistic tasks.

Experiments

Dataset	levels	nodes	structure	freq
Labour	4	57	1, 8, 16, 32	1M
Tourism	4	89	1, 4, 28, 56	3M
M5	5	114	1, 3, 10, 30 ,70	1D

Table 1: Dataset statistics. Structure column shows the number of nodes at each level from top to bottom, as for frequency (‘freq’), 1D means one day and 3M means three months.

Model	Tourism		Labour		M5
Metric	MAPE	w-MAPE	MAPE	w-MAPE	MAPE	w-MAPE
ARIMA-BU	0.2966	0.1212	0.0467	0.0457	0.1134	0.0638
ARIMA-MINT-SHR	0.2942	0.1237	0.0506	0.0471	0.1140	0.0675
ARIMA-MINT-OLS	0.3030	0.1254	0.0505	0.0467	0.1400	0.0733
ARIMA-ERM	1.6447	0.6198	0.0495	0.0402	0.1810	0.1163
PERMBU-MINT	0.2947(0.0031)	0.1057(0.0004)	0.0497(0.0003)	0.0453(0.0002)	0.1176(0.0005)	0.0759(0.0007)
DeepAR-Proj	0.3214(0.0202)	0.1171(0.0116)	0.0423(0.0016)	0.0290(0.0013)	0.1546(0.0165)	0.0951(0.0195)
DeepVAR-Proj	0.4214(0.0548)	0.2162(0.0307)	0.0936(0.0206)	0.0884(0.0235)	0.2019(0.0279)	0.1615(0.0311)
NBEATS-Proj	0.3295(0.0231)	0.1359(0.0264)	0.0355(0.0018)	0.0268(0.0043)	0.2256(0.0399)	0.1952(0.0637)
INFORMER-Proj	0.5401(0.0339)	0.5566(0.0261)	0.1537(0.0685)	0.1455(0.0683)	0.3123(0.0735)	0.3098(0.0708)
AUTOFORMER-Proj	0.3983(0.0678)	0.1862(0.0596)	0.0455(0.0037)	0.0367(0.0016)	0.1654(0.0153)	0.1308(0.0560)
FEDFORMER-Proj	0.3741(0.0291)	0.1685(0.0180)	0.0440(0.0038)	0.0334(0.0024)	0.1505(0.0139)	0.1188(0.0044)
DeepAR-BU	0.3065(0.0123)	0.1154(0.0097)	0.0378(0.0014)	0.0278(0.0022)	0.1151(0.0017)	0.0686(0.0012)
DeepVAR-BU	0.4135(0.0562)	0.2195(0.0370)	0.1112(0.0371)	0.1008(0.0352)	0.1851(0.0153)	0.1494(0.0140)
NBEATS-BU	0.2904(0.0308)	0.1259(0.0183)	0.0393(0.0031)	0.0310(0.0046)	0.1740(0.0221)	0.1398(0.0294)
INFROMER-BU	0.5694(0.0065)	0.5707(0.0072)	0.1654(0.0824)	0.1580(0.0840)	0.3128(0.0728)	0.3099(0.0706)
AUTOFORMER-BU	0.3787(0.0578)	0.1868(0.0084)	0.0519(0.0034)	0.0505(0.0011)	0.1506(0.0146)	0.1143(0.0049)
FEDFORMER-BU	0.3408(0.0099)	0.1544(0.0097)	0.0464(0.0041)	0.0369(0.0027)	0.1424(0.0141)	0.1081(0.0034)
SLOTH(Opt)(ours)	0.2613(0.0017)	0.1032(0.0012)	0.0328(0.0006)	0.0183(0.0008)	0.1116(0.0018)	0.0696(0.0017)
SLOTH(Proj)	0.2780(0.0051)	0.1098(0.0008)	0.0370(0.0052)	0.0228(0.0072)	0.1121(0.0014)	0.0704(0.0005)
SLOTH(BU)	0.2583(0.0015)	0.0991(0.0021)	0.0391(0.0051)	0.0248(0.0065)	0.1127(0.0017)	0.0703(0.0023)

Table 2: MAPE and weighted-MAPE (w-MAPE) metric values over five independent runs for baselines such as traditional reconciliation methods and end-to-end methods, as well as our approach. The value in brackets is the variance over the five runs.

In this section, we conduct extensive evaluations on real-world hierarchical datasets. Firstly, we evaluate the performance of our framework, and compare our approach against the traditional statistical method and end-to-end model (HierE2E). We then add more practical constraints to M5 dataset, building a meaningful optimization target to solve an inventory management problem, and again evaluate various approaches for hierarchical tasks under these realistic scenarios. Our framework on a real-world task of cloud resource scheduling in Ant Group is shown in Appendix E.

Real-world datasets

We take three publicly available datasets with standard hierarchical structures.

•

Tourism (Bushell et al. 2001; Athanasopoulos, Ahmed, and Hyndman 2009) includes an 89-series geographical hierarchy with quarterly observations of Australian tourism flows from 1998 to 2006, which is divided into 4 levels. Bottom-level contains 56 series, aggregated-levels contain 33 series, prediction length is 8. This dataset is frequently referenced in hierarchical forecasting studies (Hyndman and Athanasopoulos 2018).
•

Labour (Labour Force 2021) includes monthly Australian employment data from Feb. 1978 to Dec. 2020. By using included category labels, we construct a 57-series hierarchy, which is divided into 4 levels. Specifically, bottom-level contains 57 series, aggregated-levels contain 49 series in total, and prediction length is 8.
•

M5 (Han, Dasgupta, and Ghosh 2021) dataset describes daily sales from Jan. 2011 to June 2016 of various products. We construct 5-level hierarchical structure as state, store, category, department, and detail product from the origin dataset, resulting in 70 bottom time series and 44 aggregated-level time series. The prediction length is 8.

Results Analysis

In this section, we validate the overall performance of our method in the prediction task on three public datasets. We report scale-free metrics MAPE and scaled metrics weighted-MAPE to measure the accuracy. We apply several representative state-of-the-art forecasting baselines (details shown in Appendix B), including DeepAR (Salinas et al. 2020), DeepVAR (Salinas et al. 2019), NBEATS (Oreshkin et al. 2019), and former-based methods (Zhou et al. 2021; Wu et al. 2021; Zhou et al. 2022). We then combine these methods with traditional bottom-up aggregation mechanism and closed-form solution (Rangapuram et al. 2021) (DeepVar-Proj is HierE2E) to generate reconciliation results.

Table 2 reports the results. The top part shows results of traditional statistic methods, the middle part is of deep neural networks methods with closed-formed solution and bottom-up reconciliation, and the bottom part is the results from our approach and the combination of our forecasting mechanism with traditional bottom-up (BU) and closed-form projection methods (Proj).

We can see that traditional statistic methods perform poorly compared with deep neural networks. NBEATS performs best on Tourism dataset. In particular, NBEATS-BU performs best on MAPE while NBEATS-Proj performs best on weighted-MAPE. However, informer-related methods perform poorly. One possible explanation is that it requires much larger training dataset.

One can observe that the models performing best on MAPE do not perform as well on weighted-MAPE, which is caused by different level contributions on the overall performance, e.g., the bottom-level contributes more on MAPE but higher levels contribute more on weighted-MAPE.

Our proposed approach (SLOTH) has superior performance compared to other methods on both MAPE and weighted-MAPE metrics in most scenarios. Specifically, SLOTH achieves best performance among all models on Labour and M5 datasets, and ranks the second on Tourism dataset and the third on w-MAPE for M5. Besides optimization reconciliation, we also combine our forecasting mechanism with the aforementioned bottom-up and closed-form projection methods, and both methods achieve higher accuracy than the baselines. Even SLOTH-BU, with primitive structure, achieves the best performance in Tourism dataset. Please note that smaller variances indicate that our framework shows stabler performances across various scenarios.

In conclusion, our SLOTH mechanism improves the forecasting performance, and optimized reconciliation generates reasonable coherent predictions without much performance loss. We also assess the gains in the performance across all levels and running time in Appendix D, and the ablation study for each component is presented in Appendix E.

M5 Scheduling Task

In this section, we apply our framework to realistic scenarios with meaningful task-based constraints and targets, using a designed scheduling task for product sales based on M5 dataset. Specifically, we define a meaningful task that minimize the cost of scheduling and inventory with more practical conditions:

•

Underestimation and overestimation contribute differently to the final cost. Underestimation implies that the store needs to order more commodity to fulfill the demand, which increases scheduling cost. Overestimation implies that the store has to keep the extra commodity, which increases inventory cost.
•

Different levels generate different weighted contribution due to aggregation, e.g., scheduling and inventory costs for the top-levels are less than those for the bottom-levels because the company scale is larger at the top-levels.
•

Commodities of different types have different inventory and scheduling costs, since the shelf life for food products is shorter than household products, i.e., the inventory cost is lower for shorter storage time and the scheduling cost is higher due to the need for faster transportation.

Settings. We assume scheduling takes place every week. We set prediction length to 7 and context length to 14. The other penalty settings and target are detailed in Appendix F. We then compare the outcome from the following models:

1.

Prediction Net: prediction model that takes the prediction metric (MAE) as the loss for optimization, and bottom-up approach for coherency.
2.

Weighted Loss Net: prediction model that takes task cost (weighted-MAE) as the loss for optimization, and bottom-up for coherency.
3.

SLOTH: our end-to-end approach that takes both cost and constraints, with the task loss for optimization.

As shown in Figure 4, the prediction Net model performs the best on prediction (MAPE), which is the training objective. As for the task cost, which the scheduler really cares in practice, our SLOTH framework outperforms the Prediction Net by a large margin. Specifically, our model improves the task-cost performance by 36.8% compared to the Prediction Net, while at the same time achieving a similar task loss target as the Weighted Loss Net, but with an improvement of 53.2% in the prediction accuracy.

Conclusion

In this paper, we introduced a novel task-based structure-learning framework (SLOTH) for HTS. We proposed two tree-based mechanisms to utilize the hierarchical structure for HTS forecasting. The top-down convolution integrates the temporal feature of the top-level to enhance stability of dynamic patterns, while the bottom-up attention incorporates the features of the bottom-level to improve the coherency of the temporal features. In the reconciliation step, we applied the deep neural optimization layer to produce the controllable coherent result, which also accommodates complicated realistic task-based constraints and targets under coherency without requiring any explicit post-processing step. We unified the goals of forecasting and decision-making and achieved an end-to-end framework. We conducted extensive empirical evaluations on real-world datasets, where the competitiveness of our method under various conditions against other state-of-the-art methods were demonstrated. Furthermore, our ablation studies proved the efficacy of each component we designed. Our method has also been deployed in the production environment in Ant Group for its cloud resources scheduling.

References

Amos and Kolter (2017) Amos, B.; and Kolter, J. Z. 2017. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, 136–145. PMLR.
Anderer and Li (2021) Anderer, M.; and Li, F. 2021. Forecasting reconciliation with a top-down alignment of independent level forecasts. arXiv:2103.08250.
Athanasopoulos, Ahmed, and Hyndman (2009) Athanasopoulos, G.; Ahmed, R. A.; and Hyndman, R. J. 2009. Hierarchical forecasts for Australian domestic tourism. International Journal of Forecasting, 25(1): 146–166.
Athanasopoulos et al. (2017) Athanasopoulos, G.; Hyndman, R. J.; Kourentzes, N.; and Petropoulos, F. 2017. Forecasting with temporal hierarchies. European Journal of Operational Research, 262(1): 60–74.
Bai, Kolter, and Koltun (2018) Bai, S.; Kolter, J. Z.; and Koltun, V. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271.
Bushell et al. (2001) Bushell, R.; Prosser, G. M.; Faulkner, H. W.; and Jafari, J. 2001. Tourism research in Australia. Journal of Travel Research, 39(3): 323–326.
Chung et al. (2014) Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555.
Corani et al. (2020) Corani, G.; Azzimonti, D.; Augusto, J. P.; and Zaffalon, M. 2020. Probabilistic reconciliation of hierarchical forecast via Bayes’ rule. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 211–226. Springer.
Dangerfield and Morris (1992) Dangerfield, B. J.; and Morris, J. S. 1992. Top-down or bottom-up: Aggregate versus disaggregate extrapolations. International journal of forecasting, 8(2): 233–241.
Donti, Amos, and Kolter (2017) Donti, P.; Amos, B.; and Kolter, J. Z. 2017. Task-based end-to-end model learning in stochastic optimization. Advances in Neural Information Processing Systems, 30.
Gardner and Dorling (1998) Gardner, M. W.; and Dorling, S. 1998. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric Environment, 32(14-15): 2627–2636.
Gross and Sohl (1990) Gross, C. W.; and Sohl, J. E. 1990. Disaggregation methods to expedite product line forecasting. Journal of forecasting, 9(3): 233–254.
Han, Dasgupta, and Ghosh (2021) Han, X.; Dasgupta, S.; and Ghosh, J. 2021. Simultaneously Reconciled Quantile Forecasting of Hierarchically Related Time Series. In International Conference on Artificial Intelligence and Statistics, 190–198. PMLR.
Hochreiter and Schmidhuber (1997) Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural Computation, 9(8): 1735–1780.
Hyndman et al. (2011) Hyndman, R. J.; Ahmed, R. A.; Athanasopoulos, G.; and Han, L. S. 2011. Optimal combination forecasts for hierarchical time serie. Computational Statistics & Data Analysis.
Hyndman and Athanasopoulos (2018) Hyndman, R. J.; and Athanasopoulos, G. 2018. Forecasting: Principles and Practice. OTexts.
Hyndman, Lee, and Wang (2016) Hyndman, R. J.; Lee, A. J.; and Wang, E. 2016. Fast computation of reconciled forecasts for hierarchical and grouped time series. Computational Statistics & Data Analysis, 97: 16–32.
Jeon, Panagiotelis, and Petropoulos (2018) Jeon, J.; Panagiotelis, A.; and Petropoulos, F. 2018. Reconciliation of probabilistic forecasts with an application to wind power. arXiv:1808.02635.
Kotary et al. (2021) Kotary, J.; Fioretto, F.; Van Hentenryck, P.; and Wilder, B. 2021. End-to-end constrained optimization learning: A survey. arXiv preprint arXiv:2103.16378.
Kourentzes and Athanasopoulos (2019) Kourentzes, N.; and Athanasopoulos, G. 2019. Cross-temporal coherent forecasts for Australian tourism. Annals of Tourism Research, 75.
Labour Force (2021) Labour Force, A. 2021. Australian Bureau of Statistics.
Li et al. (2017) Li, Y.; Yu, R.; Shahabi, C.; and Liu, Y. 2017. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv:1707.01926.
Liu, Yan, and Hauskrecht (2018) Liu, Z.; Yan, Y.; and Hauskrecht, M. 2018. A flexible forecasting framework for hierarchical time series with seasonal patterns: A case study of web traffic. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 889–892.
Mou et al. (2014) Mou, L.; Li, G.; Jin, Z.; Zhang, L.; and Wang, T. 2014. TBCNN: A tree-based convolutional neural network for programming language processing. arXiv preprint arXiv:1409.5718.
Mou et al. (2016) Mou, L.; Li, G.; Zhang, L.; Wang, T.; and Jin, Z. 2016. Convolutional neural networks over tree structures for programming language processing. In Thirtieth AAAI Conference on Artificial Intelligence.
Nguyen et al. (2020) Nguyen, X.-P.; Joty, S.; Hoi, S. C. H.; and Socher, R. 2020. Tree-structured Attention with Hierarchical Accumulation. arXiv:2002.08046.
Oreshkin et al. (2019) Oreshkin, B. N.; Carpov, D.; Chapados, N.; and Bengio, Y. 2019. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv:1905.10437.
Qu et al. (2023) Qu, C.; Tan, X.; Xue, S.; Shi, X.; Zhang, J.; and Mei, H. 2023. Bellman Meets Hawkes: Model-Based Reinforcement Learning via Temporal Point Processes. In AAAI 2023. AAAI Press.
Rangapuram et al. (2021) Rangapuram, S. S.; Werner, L. D.; Benidis, K.; Mercado, P.; Gasthaus, J.; and Januschowski, T. 2021. End-to-end learning of coherent probabilistic forecasts for hierarchical time series. In International Conference on Machine Learning, 8832–8843. PMLR.
Salinas et al. (2019) Salinas, D.; Bohlke-Schneider, M.; Callot, L.; Medico, R.; and Gasthaus, J. 2019. High-dimensional multivariate forecasting with low-rank gaussian copula processes. arXiv:1910.03002.
Salinas et al. (2020) Salinas, D.; Flunkert, V.; Gasthaus, J.; and Januschowski, T. 2020. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3): 1181–1191.
Sutskever, Vinyals, and Le (2014) Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
Taieb and Koo (2019) Taieb, S. B.; and Koo, B. 2019. Regularized regression for hierarchical forecasting without unbiasedness conditions. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1337–1347.
Taieb, Taylor, and Hyndman (2017) Taieb, S. B.; Taylor, J. W.; and Hyndman, R. J. 2017. Coherent probabilistic forecasts for hierarchical time series. In International Conference on Machine Learning, 3348–3357. PMLR.
Taieb, Taylor, and Hyndman (2021) Taieb, S. B.; Taylor, J. W.; and Hyndman, R. J. 2021. Hierarchical probabilistic forecasting of electricity demand with smart meter data. Journal of the American Statistical Association, 116(533): 27–43.
Van Den Oord et al. (2016) Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A. W.; and Kavukcuoglu, K. 2016. WaveNet: A generative model for raw audio. SSW, 125: 2.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998–6008.
Wickramasuriya, Athanasopoulos, and Hyndman (2019) Wickramasuriya, S. L.; Athanasopoulos, G.; and Hyndman, R. J. 2019. Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization. Journal of the American Statistical Association, 114(526): 804–819.
Wu et al. (2021) Wu, H.; Xu, J.; Wang, J.; and Long, M. 2021. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems, 34: 22419–22430.
Xue et al. (2022a) Xue, S.; Qu, C.; Shi, X.; Liao, C.; Zhu, S.; Tan, X.; Ma, L.; Wang, S.; Wang, S.; Hu, Y.; Lei, L.; Zheng, Y.; Li, J.; and Zhang, J. 2022a. A Meta Reinforcement Learning Approach for Predictive Autoscaling in the Cloud. In Zhang, A.; and Rangwala, H., eds., KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, 4290–4299. ACM.
Xue et al. (2022b) Xue, S.; Shi, X.; Zhang, Y. J.; and Mei, H. 2022b. HYPRO: A Hybridly Normalized Probabilistic Model for Long-Horizon Prediction of Event Sequences. In Advances in Neural Information Processing Systems.
Yu, Yin, and Zhu (2017) Yu, B.; Yin, H.; and Zhu, Z. 2017. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv:1709.04875.
Zhou et al. (2021) Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; and Zhang, W. 2021. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of AAAI.
Zhou et al. (2022) Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; and Jin, R. 2022. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. Proceedings of the 39th International Conference on Machine Learning, 162: 27268–27286.

References

Amos and Kolter (2017) Amos, B.; and Kolter, J. Z. 2017. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, 136–145. PMLR.
Anderer and Li (2021) Anderer, M.; and Li, F. 2021. Forecasting reconciliation with a top-down alignment of independent level forecasts. arXiv:2103.08250.
Athanasopoulos, Ahmed, and Hyndman (2009) Athanasopoulos, G.; Ahmed, R. A.; and Hyndman, R. J. 2009. Hierarchical forecasts for Australian domestic tourism. International Journal of Forecasting, 25(1): 146–166.
Athanasopoulos et al. (2017) Athanasopoulos, G.; Hyndman, R. J.; Kourentzes, N.; and Petropoulos, F. 2017. Forecasting with temporal hierarchies. European Journal of Operational Research, 262(1): 60–74.
Bai, Kolter, and Koltun (2018) Bai, S.; Kolter, J. Z.; and Koltun, V. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271.
Bushell et al. (2001) Bushell, R.; Prosser, G. M.; Faulkner, H. W.; and Jafari, J. 2001. Tourism research in Australia. Journal of Travel Research, 39(3): 323–326.
Chung et al. (2014) Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555.
Corani et al. (2020) Corani, G.; Azzimonti, D.; Augusto, J. P.; and Zaffalon, M. 2020. Probabilistic reconciliation of hierarchical forecast via Bayes’ rule. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 211–226. Springer.
Dangerfield and Morris (1992) Dangerfield, B. J.; and Morris, J. S. 1992. Top-down or bottom-up: Aggregate versus disaggregate extrapolations. International journal of forecasting, 8(2): 233–241.
Donti, Amos, and Kolter (2017) Donti, P.; Amos, B.; and Kolter, J. Z. 2017. Task-based end-to-end model learning in stochastic optimization. Advances in Neural Information Processing Systems, 30.
Gardner and Dorling (1998) Gardner, M. W.; and Dorling, S. 1998. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric Environment, 32(14-15): 2627–2636.
Gross and Sohl (1990) Gross, C. W.; and Sohl, J. E. 1990. Disaggregation methods to expedite product line forecasting. Journal of forecasting, 9(3): 233–254.
Han, Dasgupta, and Ghosh (2021) Han, X.; Dasgupta, S.; and Ghosh, J. 2021. Simultaneously Reconciled Quantile Forecasting of Hierarchically Related Time Series. In International Conference on Artificial Intelligence and Statistics, 190–198. PMLR.
Hochreiter and Schmidhuber (1997) Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural Computation, 9(8): 1735–1780.
Hyndman et al. (2011) Hyndman, R. J.; Ahmed, R. A.; Athanasopoulos, G.; and Han, L. S. 2011. Optimal combination forecasts for hierarchical time serie. Computational Statistics & Data Analysis.
Hyndman and Athanasopoulos (2018) Hyndman, R. J.; and Athanasopoulos, G. 2018. Forecasting: Principles and Practice. OTexts.
Hyndman, Lee, and Wang (2016) Hyndman, R. J.; Lee, A. J.; and Wang, E. 2016. Fast computation of reconciled forecasts for hierarchical and grouped time series. Computational Statistics & Data Analysis, 97: 16–32.
Jeon, Panagiotelis, and Petropoulos (2018) Jeon, J.; Panagiotelis, A.; and Petropoulos, F. 2018. Reconciliation of probabilistic forecasts with an application to wind power. arXiv:1808.02635.
Kotary et al. (2021) Kotary, J.; Fioretto, F.; Van Hentenryck, P.; and Wilder, B. 2021. End-to-end constrained optimization learning: A survey. arXiv preprint arXiv:2103.16378.
Kourentzes and Athanasopoulos (2019) Kourentzes, N.; and Athanasopoulos, G. 2019. Cross-temporal coherent forecasts for Australian tourism. Annals of Tourism Research, 75.
Labour Force (2021) Labour Force, A. 2021. Australian Bureau of Statistics.
Li et al. (2017) Li, Y.; Yu, R.; Shahabi, C.; and Liu, Y. 2017. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv:1707.01926.
Liu, Yan, and Hauskrecht (2018) Liu, Z.; Yan, Y.; and Hauskrecht, M. 2018. A flexible forecasting framework for hierarchical time series with seasonal patterns: A case study of web traffic. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 889–892.
Mou et al. (2014) Mou, L.; Li, G.; Jin, Z.; Zhang, L.; and Wang, T. 2014. TBCNN: A tree-based convolutional neural network for programming language processing. arXiv preprint arXiv:1409.5718.
Mou et al. (2016) Mou, L.; Li, G.; Zhang, L.; Wang, T.; and Jin, Z. 2016. Convolutional neural networks over tree structures for programming language processing. In Thirtieth AAAI Conference on Artificial Intelligence.
Nguyen et al. (2020) Nguyen, X.-P.; Joty, S.; Hoi, S. C. H.; and Socher, R. 2020. Tree-structured Attention with Hierarchical Accumulation. arXiv:2002.08046.
Oreshkin et al. (2019) Oreshkin, B. N.; Carpov, D.; Chapados, N.; and Bengio, Y. 2019. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv:1905.10437.
Qu et al. (2023) Qu, C.; Tan, X.; Xue, S.; Shi, X.; Zhang, J.; and Mei, H. 2023. Bellman Meets Hawkes: Model-Based Reinforcement Learning via Temporal Point Processes. In AAAI 2023. AAAI Press.
Rangapuram et al. (2021) Rangapuram, S. S.; Werner, L. D.; Benidis, K.; Mercado, P.; Gasthaus, J.; and Januschowski, T. 2021. End-to-end learning of coherent probabilistic forecasts for hierarchical time series. In International Conference on Machine Learning, 8832–8843. PMLR.
Salinas et al. (2019) Salinas, D.; Bohlke-Schneider, M.; Callot, L.; Medico, R.; and Gasthaus, J. 2019. High-dimensional multivariate forecasting with low-rank gaussian copula processes. arXiv:1910.03002.
Salinas et al. (2020) Salinas, D.; Flunkert, V.; Gasthaus, J.; and Januschowski, T. 2020. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3): 1181–1191.
Sutskever, Vinyals, and Le (2014) Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
Taieb and Koo (2019) Taieb, S. B.; and Koo, B. 2019. Regularized regression for hierarchical forecasting without unbiasedness conditions. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 1337–1347.
Taieb, Taylor, and Hyndman (2017) Taieb, S. B.; Taylor, J. W.; and Hyndman, R. J. 2017. Coherent probabilistic forecasts for hierarchical time series. In International Conference on Machine Learning, 3348–3357. PMLR.
Taieb, Taylor, and Hyndman (2021) Taieb, S. B.; Taylor, J. W.; and Hyndman, R. J. 2021. Hierarchical probabilistic forecasting of electricity demand with smart meter data. Journal of the American Statistical Association, 116(533): 27–43.
Van Den Oord et al. (2016) Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A. W.; and Kavukcuoglu, K. 2016. WaveNet: A generative model for raw audio. SSW, 125: 2.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998–6008.
Wickramasuriya, Athanasopoulos, and Hyndman (2019) Wickramasuriya, S. L.; Athanasopoulos, G.; and Hyndman, R. J. 2019. Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization. Journal of the American Statistical Association, 114(526): 804–819.
Wu et al. (2021) Wu, H.; Xu, J.; Wang, J.; and Long, M. 2021. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems, 34: 22419–22430.
Xue et al. (2022a) Xue, S.; Qu, C.; Shi, X.; Liao, C.; Zhu, S.; Tan, X.; Ma, L.; Wang, S.; Wang, S.; Hu, Y.; Lei, L.; Zheng, Y.; Li, J.; and Zhang, J. 2022a. A Meta Reinforcement Learning Approach for Predictive Autoscaling in the Cloud. In Zhang, A.; and Rangwala, H., eds., KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, 4290–4299. ACM.
Xue et al. (2022b) Xue, S.; Shi, X.; Zhang, Y. J.; and Mei, H. 2022b. HYPRO: A Hybridly Normalized Probabilistic Model for Long-Horizon Prediction of Event Sequences. In Advances in Neural Information Processing Systems.
Yu, Yin, and Zhu (2017) Yu, B.; Yin, H.; and Zhu, Z. 2017. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. arXiv:1709.04875.
Zhou et al. (2021) Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; and Zhang, W. 2021. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of AAAI.
Zhou et al. (2022) Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; and Jin, R. 2022. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. Proceedings of the 39th International Conference on Machine Learning, 162: 27268–27286.

Appendix A Appendix A Related Work

Hirarchical Time Series Prediction

Traditional approaches

Traditionally, coherence is utilized as a tool helping generate base forecasts. Base forecasts of a pre-specified level in the hierarchy are generated first, then according to the coherence, reconciliation methods are applied to generate base forecasts of other levels. There are three types of approaches following this routine: Bottom-Up mechanism (Gross and Sohl 1990) generates forecasts at the bottom level and then aggregates them according to the hierarchical structure to produce high level forecasts as in Eq. (1), where hierarchical structure information was utilized explicitly during the aggregations. The major shortcoming of this approach is that the bottom-level series tend to be noisy and there is a high risk of inaccurate forecasting (Athanasopoulos et al. 2017; Wickramasuriya, Athanasopoulos, and Hyndman 2019), which accumulates during aggregation. To overcome this weakness, a Top-Down mechanism was proposed in (Athanasopoulos, Ahmed, and Hyndman 2009), where the top-level time series are first forecasted and then disaggregated to bottom-levels. Forecasting at the top-level is an easier task because the temporal pattern is normally more stable and less noisy, but the disaggregation procedure still cannot fully utilize the information of the hierarchical structure. Considering the disadvantages of aforementioned approaches, a compromised approach called Middle-out mechanism is proposed in (Athanasopoulos et al. 2017), where forecasts are first produced at a middle level of the hierarchy, followed by aggregations for the higher levels and disaggregations for the lower ones. A significant weakness of these methods lies in information loss, i.e., the characteristics of time series at other levels are not integrated, which worsens with deeper hierarchies (Kourentzes and Athanasopoulos 2019).

Forecast Reconciliation

In order to overcome the weaknesses of the above approaches, a two-stage procedure is proposed to produce coherent forecasts for all series in the hierarchy:

1.

Produce forecasting results for time series of each level independently without considering the structure.
2.

Conduct reconciliation on base forecasts using the hierarchy structure, to obtain the coherent forecasts.

Given the $h$ -step-ahead base forecasts $\hat{\mathbf{y}}_{T+h}$ , the reconciliation step (Hyndman et al. 2011; Wickramasuriya, Athanasopoulos, and Hyndman 2019) obtains coherent forecasts as follows:

\tilde{\mathbf{y}}_{T+h}=\mathbf{S}\mathbf{P}\hat{\mathbf{y}}_{T+h},

(14)

where matrix $\mathbf{P}\in\mathbb{R}^{m\times n}$ projects base forecasts of all nodes to those at the bottom level, $\mathbf{S}\in\{0,1\}^{n\times m}$ is the aggregation matrix, and $\tilde{\mathbf{y}}_{T+h}$ is the coherent forecasts. Aggregation matrix ensures that reconciliation forecasts $\tilde{\mathbf{y}}_{T+h}$ satisfy the coherence constraint. The key to the reconciliation step is to solve for the matrix $\mathbf{P}$ , and the Trace Minimization (MinT) algorithm (Wickramasuriya, Athanasopoulos, and Hyndman 2019) provides the optimal solution of the analytical expression $\mathbf{P}=(\mathbf{S}^{\mathsf{T}}\mathbf{W}^{-1}_{h}\mathbf{S})^{-1}$ $(\mathbf{S}^{\mathsf{T}}\mathbf{W}^{-1}_{h})$ , where $\mathbf{W}_{h}$ is the covariance matrix of the $h$ -step-ahead forecast error $\hat{\varepsilon}_{T+h}=\mathbf{y}_{T+h}-\hat{\mathbf{y}}_{T+h}$ , and the empirical risk minimization (ERM) algorithm (Taieb and Koo 2019) puts unbiased error and variance error into the objective function and then derive the best reconciliation result through solving an ERM problem.

There are several major issues with these two-stage procedures: 1) These methods employ univariate time series models as the base forecaster predominantly, such as linear auto-regressive (AR) models, which are initially trained independently without considering the relationships across time series; 2) The base forecasts are revised without any regard to the learned model parameters; 3) Several methods assume that the individual base forecast is unbiased, which is inconsistent with many real-world tasks.

An end-to-end framework for coherent probabilistic forecasting is proposed recently in Hier-E2E (Rangapuram et al. 2021), leveraging a closed-form projection from the space of base forecasts to the space of coherent solution with minimization of revision ,in order to retain the prediction as much as possible while satisfying the coherence constraint as

\tilde{\mathbf{y}}_{t+h}=\mathbf{M}\hat{\mathbf{y}}_{t+h}=[\mathbf{I}-\mathbf{A}^{\mathsf{T}}(\mathbf{A}\mathbf{A}^{\mathsf{T}})^{-1}\mathbf{A}]\hat{\mathbf{y}}_{t+h},

(15)

where $\mathbf{A}$ is the constraint matrix as shown in Eq. (2). The closed-form projection matrix $\mathbf{M}$ can be calculated before training since it only depends on hierarchical matrix $\mathbf{A}$ . This procedure improves the reconciliation efficiency without incurring any post-processing penalty. Another advantage of this method is the use of multivariate model DeepVAR (Salinas et al. 2019) to produce base forecasts, where training takes all time series simultaneously, followed by DNN to extract relationship between nodes, which leads to better forecast accuracy. However, this method has the following limitations: 1) the structural information in the hierarchy is not explicitly incorporated to produce the base forecasts; 2) the closed-form projection ignores the inequality constraints; and 3) more general and complex task-related constraints beyond the coherence constraint are not considered.

To the best of our knowledge, none of the existing approaches takes hierarchical structure information explicitly into consideration in base forecasting. Inspired by structure learning (Nguyen et al. 2020; Mou et al. 2014; Li et al. 2017; Yu, Yin, and Zhu 2017) \add, we propose dual tree-based mechanisms to incorporate hierarchical structure into temporal features for more accurate base forecasts. Furthermore, all the aforementioned methods aim at ensuring the coherence, ignoring task-based targets or realistic constraints, making them impractical for real-world applications. Our optimization reconciliation approach, on the other hand, takes the final target as an optimization goal while satisfying task-related constraints including coherence constraint.

Prediction and Optimization

The intersection of prediction and decision models has been a prominent topic in recent years (Kotary et al. 2021), where decision models can be partially represented as constrained optimization problems.

	$\displaystyle\mathcal{J}(\mathbf{y})=$	$\displaystyle\mathop{\arg\min}_{\mathbf{z}}f(\mathbf{y},\mathbf{z})$		(16)
	subject to	$\displaystyle\;\mathbf{z}\in C_{\mathbf{z}},$		(16)

where $\mathbf{y}$ is the prediction results generated by the time series model, $f({\mathbf{y}},\cdot)$ is the objective function and $C_{\mathbf{z}}$ is the feasible solution space under specific constraints. The training process is to find the optimal solution $\hat{\mathbf{y}}$ by minimizing the target loss $J(\mathbf{y})$ . The optimization process can be seen as a Quadratic Programming (QP) problem if target loss function is quadratic. Amos and Kolter (Amos and Kolter 2017) employed GPU-ready QP solver (OPTNET) to solve constrained QP problems with a deep neural layer. In the training process, OPTNET derives argmin gradient for backpropagation by differentiating Lagrange’s KKT conditions using matrix differential calculus, and generates the gradient to optimize implicit inference layer. Optimization result is obtained in forward pass. Donti et al. (Donti, Amos, and Kolter 2017) then proposed a predict-and-optimize architecture by applying the QP solver to time series tasks with task-specified constraints.

Appendix B Appendix B Baselines

The baselines for prediction and reconciliation is as follows, we include traditional and deep learning methods.

•
Traditional statistical models ²²2All methods can be found in hts R package: https://cran.r-project.org/web/packages/hts/vignettes/hts.pdf
- –
  
  ARIMA-BU: ARIMA is a standard statistic prediction method, and bottom-up reconciliation (BU) is the basic reconciliation method.
- –
  
  ARIMA-MINT-SHR: ARIMA prediction with min-trace reconciliation using SHR mechanism to generate covariance (Wickramasuriya, Athanasopoulos, and Hyndman 2019).
- –
  
  ARIMA-MINT-OLS:ARIMA prediction with min-trace reconciliation method using OLS mechanism to generate covariance (Wickramasuriya, Athanasopoulos, and Hyndman 2019).
- –
  
  ARIMA-ERM: ARIMA prediction, using ERM (Taieb and Koo 2019) to generate reconciliation result.
- –
  
  PERMBU-MINT: probabilistic hierarchical forecasting method with min-trace reconciliation (Taieb, Taylor, and Hyndman 2017)
•
Deep learning baselines are the combination of four baseline deep forecasting models with two reconciliation methods.
- –
  - *
    
    DeepAR: an approach to produce accurate probabilistic forecasts, based on autoregressive recurrent neural network model, which is widely applied in industrial forecasting tasks. ³³3https://github.com/arrigonialberto86/deepar (Salinas et al. 2020)
  - *
    
    DeepVar: an representative RNN-based multi-variate probabilistic time series model with a Gaussian copula process to handle multi-variate Gaussion joint distributions. ⁴⁴4https://github.com/zalandoresearch/pytorch-ts/tree/master/pts/model/deepvar (Salinas et al. 2019)
  - *
    
    NBEATS: a deep neural architecture based on backward and forward residual links and a deep stack of neural layers to achieve interpretability. ⁵⁵5https://github.com/philipperemy/n-beats (Oreshkin et al. 2019)
  - *
    
    INFORMER: an efficient transformer-based model in long sequence time-series forecasting with ProbSparse self-attention, self-attention distilling and generative style decoder. ⁶⁶6https://github.com/zhouhaoyi/Informer2020 (Zhou et al. 2021)
- –
  Reconciliation methods
  - *
    
    Bottom-Up (BU): a traditionally popular reconciliation method, used by most of the state-of-the-art methods as discussed in Appendix A.
  - *
    
    Closed-form projection operator (Proj): the state-of-the-art end-to-end reconciliation method, with the closed-form projection minimizing the revision of base forecast while ensuring coherence. It is combined with DeepVar forecasting to form the best model in hierarchical forecasting in HierE2E (Rangapuram et al. 2021)

Appendix C Appendix C Evalution Metrics

We use Mean Absolute Percent Error (MAPE) as the prediction metric, and adopt the weighted mape ( $w\_MAPE$ ) to evaluate the performance when taking scaled differences across all levels into consideration:

MAPE=\frac{100\%}{n}\sum_{i=1}^{n}\left|\frac{y^{i}-\hat{y}^{i}}{\hat{y}^{i}}\right|,

(17)

w\_MAPE=\sum_{i=1}^{n}\frac{\hat{y}^{i}\times|y^{i}-\hat{y}^{i}|}{\sum_{j=1}^{n}\hat{y}^{j}}.

(18)

In the ablation study, we use $CO\_MAPE$ to measure the coherence satisfaction as shown in Eq. (19), and we do not consider the leaf node when calculating $CO\_MAPE$ because it does not contain any aggregation:

CO\_MAPE=\frac{1}{r}\sum_{i=1}^{r}\frac{|y^{i}-\sum_{j\in c(i)}y^{j}|}{y^{i}},

(19)

where $c(i)$ means the children of node $i$ , and $r$ is number of all nodes minus the number of leaf nodes.

Appendix D Appendix D Detailed Experiments Results

Dataset		Tourism				Labour				M5
		Level				Level				Level
Model	Metric	1	2	3	4	1	2	3	4	1	2	3	4	5
ARIMA-BU	MAPE	0.0632	0.1055	0.2442	0.3406	0.0439	0.0375	0.0391	0.0529	0.0259	0.0498	0.0576	0.0851	0.1374
ARIMA-BU	w-MAPE	0.0158	0.0248	0.0360	0.0444	0.0109	0.0111	0.0112	0.0123	0.0051	0.0098	0.0119	0.0159	0.0208
PERMBU-MINT	MAPE	0.0571	0.1014	0.2450	0.3390	0.0448	0.0422	0.0425	0.0556	0.0423	0.0672	0.0705	0.0900	0.1393
PERMBU-MINT	w-MAPE	0.0142	0.0205	0.0324	0.0405	0.0112	0.0108	0.0109	0.0123	0.0086	0.0135	0.0148	0.0183	0.0225
DEEPAR-BU	MAPE	0.0686	0.1034	0.2661	0.3455	0.0236	0.0253	0.0265	0.047	0.0353	0.0569	0.0649	0.0904	0.1365
DEEPAR-BU	w-MAPE	0.0172	0.0212	0.0337	0.0395	0.0059	0.0065	0.0068	0.0086	0.007	0.0112	0.0133	0.0167	0.0204
DEEPAR-Proj	MAPE	0.0771	0.1161	0.2825	0.36	0.024	0.0282	0.03	0.0526	0.0615	0.0785	0.0917	0.1176	0.1842
DEEPAR-Proj	w-MAPE	0.0193	0.0225	0.0349	0.0405	0.006	0.0068	0.0072	0.009	0.0123	0.0159	0.0186	0.0225	0.0258
DEEPVAR-BU	MAPE	0.1672	0.1904	0.363	0.459	0.0984	0.102	0.1035	0.1196	0.1254	0.1333	0.1436	0.169	0.2011
DEEPVAR-BU	w-MAPE	0.0418	0.0489	0.061	0.0678	0.0246	0.0249	0.025	0.0257	0.0251	0.0274	0.0289	0.0326	0.0355
DEEPVAR-Proj	MAPE	0.1621	0.1834	0.3683	0.4697	0.0867	0.0839	0.0853	0.1005	0.138	0.1463	0.1569	0.18	0.2211
DEEPVAR-Proj	w-MAPE	0.0405	0.0477	0.0604	0.0676	0.0217	0.0219	0.022	0.0227	0.0276	0.0299	0.0315	0.0348	0.0378
NBEATS-BU	MAPE	0.073	0.1287	0.276	0.3537	0.0254	0.0274	0.0287	0.0458	0.1554	0.1621	0.166	0.173	0.2119
NBEATS-BU	w-MAPE	0.0183	0.0262	0.0387	0.0458	0.0064	0.0067	0.0069	0.0088	0.0311	0.0323	0.0338	0.0362	0.0386
NBEATS-Proj	MAPE	0.1167	0.1844	0.3816	0.4895	0.0358	0.0351	0.0359	0.0523	0.1131	0.1292	0.1316	0.156	0.2299
NBEATS-Proj	w-MAPE	0.0292	0.0346	0.0461	0.0528	0.0089	0.0092	0.0093	0.0111	0.0226	0.0258	0.0272	0.0316	0.0348
SLOTH-Opt	MAPE	0.0617	0.0946	0.2274	0.2939	0.0124	0.0212	0.0226	0.0415	0.0323	0.0581	0.0647	0.0877	0.1321
SLOTH-Opt	w-MAPE	0.0154	0.0194	0.0316	0.0368	0.0031	0.0041	0.0046	0.0066	0.0065	0.0115	0.0134	0.0173	0.021
SLOTH-BU	MAPE	0.0554	0.089	0.2235	0.2916	0.0197	0.0287	0.0303	0.0468	0.0345	0.0588	0.0654	0.0883	0.1333
SLOTH-BU	w-MAPE	0.0139	0.0186	0.0306	0.0361	0.0049	0.0058	0.0062	0.0079	0.0069	0.0117	0.0136	0.0172	0.0209
SLOTH-Proj	MAPE	0.0682	0.0988	0.2455	0.3118	0.0167	0.0255	0.027	0.0455	0.0346	0.0586	0.0655	0.0888	0.1320
SLOTH-Proj	w-MAPE	0.017	0.0209	0.0325	0.0393	0.0042	0.0053	0.0056	0.0077	0.0069	0.0117	0.0136	0.0172	0.0208

Table 3: MAPE and weighted-MAPE(w-MAPE) values across all levels over 5 independent runs. We only show level performances from the part of baselines which perform superior enough. The best performance for each level in three datasets are highlighted.

Deep Model	Tourism	Labour	M5
DeepAR-Proj	0.1112	3.6691	15.5745
DeepVAR-Proj	0.2431	6.4946	26.3628
NBEATS-Proj	1.0372	3.6010	64.2808
INFORMER-Proj	0.6268	13.6285	53.3686
DeepAR-BU	0.2315	5.8928	24.8748
NBEATS-BU	0.9865	2.5818	11.7834
INFROMER-BU	0.3460	12.9719	53.9309
SLOTH(Opt)(ours)	0.4610	6.9624	48.2653
SLOTH(Proj)	0.1992	3.9627	18.1219
SLOTH(BU)	0.2065	4.021	19.6753

Table 4: Average running time (seconds) over 5 independent runs for baselines such as traditional reconciliation methods and end-to-end methods, as well as our approach.

As mentioned in section Experiment , the performance difference on w-MAPE and MAPE metrics indicates that the model performs differently across different aggregation levels. So we assess the gains in the performance across all levels presented in Table 3 , which demonstrates that our approach achieves the best performance on both MAPE and weighted-MAPE for 15 out of 25 total levels across three datasets, and our forecasting mechanism, combined with other reconciliation methods, outperforms all other baseline methods across all aggregation levels and bottom levels. The training running time of deep model for 10 epochs is shown in Table 4, we don’t show the result of statistic models such as $ARIMA\_BU$ because it is R scripts which run on CPUs, it meanless to compare it with deep models running on GPUs.

Appendix E Appendix E Ablation Study

We conduct ablation studies for each component in our framework, and we demonstrate that we utilize the hierarchical structural information effectively to provide stabler and more accurate results than the state-of-the-art forecasting model. In order to prove that our performance improvement only comes from the forecasting module, we do not perform the explicit reconciliation step after forecasting.

We first evaluate the validity of the top-down convolution (TD-Conv) mechanism which incorporates the ancestors’ information. Our mechanism can be applied to any deep RNN-like model, and we adopt GRU(Chung et al. 2014), TCN(Bai, Kolter, and Koltun 2018), and WAVENET(Van Den Oord et al. 2016) as baseline forecasting models to generate dynamic features for each node. As shown in Figure 5, the orange bar (model with TD-Conv) is always lower than the green bar (base model), therefore, models perform better on the three datasets when the top-down convolution mechanism is applied to integrate dynamic patterns from ancestors. Taking MAPE as the metric, top-down convolution improves the performance by 6.2% comparing to the baseline.

We further compare the performance improvement of the bottom-up attention mechanism which integrates children’s hidden features into the parent node as shown in Figure 5. We also adopt GRU, TCN, and WAVENET as dynamic feature generators. The results show that the bottom-up attention mechanism helps the forecasting model obtain more informative features and lower the prediction error. This mechanism outperforms the baseline RNN models by 4.3%. We also find that the bottom-up attention mechanism improves the coherency without explicit reconciliation as Figure 6 demonstrates, wherethe CO-MAPE of the forecasting results from the bottom-up attention integration feature is lower.

As mentioned in Method, the hierarchical structure is a type of the graph structure, but different from the graph time-series scenario because nodes interact with neighbors dynamically; nodes in hierarchical structure only have aggregation relationship. Nodes at the top levels are virtual or conceptual split points, such as the north-eastern region or a state, and there might exist no physical interaction between nodes at all. We represent hierarchical structure as a bi-directional adjacent graph where edges only exist between parents and children. We then adopt the graph time series method, DCRNN, to generate the predictions and compare the performance against baseline models as shown in Figure 7. We can see DCRNN performs inferior to baseline models, and our explanation is that the graph structure from hierarchical topology is sub-optimal for the representation of the true relationship between nodes, i.e., the mechanism of interaction is inappropriate for the hierarchical structure. In other words, improper modeling of relationship between nodes induces extra noise for prediction.

HTS forecasting has always been modelled as a multivariate forecasting task in previous methods such as Hier-E2E(Rangapuram et al. 2021), where DeepVar (Salinas et al. 2019) is used to extract dynamic patterns and to predict for all nodes, ignoring the hierarchical relationship between each node, but adopts deep neural networks to extract valid similarities that help improve the prediction performance. One limitation of multivariate method is that the increased complexity and the inaccurate relationship from overfitting or underfitting of the deep architecture, causing performance loss for prediction. We evaluate the performance by taking features of all nodes in the hierarchical structure as input to jointly model the whole system and then generate results for all nodes. As shown in Figure 7, the multi-variate model performs better than the univariate model, but is still inferior to our proposed approach.

Appendix F Appendix F M5 Scheduling Task details

We discuss the details of the designed realistic demand-scheduling task for m5 scheduling task. The penalty setting for each level is shown in Table 5. We would like to minimize the cost of the whole supply chain, formally, the target is defined as follows:

level	1	2	3	4	5
$\mathbf{c}_{u}$	2	1	0.9	0.7	0.5
$\mathbf{c}_{o}$	50	45	40	35	30
$\mathbf{q}$	0.5	0.7	0.8	1	2

Table 5: penalty setting for each level

	$\displaystyle f(\hat{\mathbf{y}},\mathbf{y})=$	$\displaystyle\mathbf{c}_{u}^{\mathsf{T}}\max(\hat{\mathbf{y}}-\mathbf{y},\mathbf{0})+\mathbf{c}_{o}^{\mathsf{T}}\max(\mathbf{y}-\hat{\mathbf{y}},\mathbf{0})$		(20)
		$\displaystyle+\frac{1}{2}(\hat{\mathbf{y}}-\mathbf{y})^{\mathsf{T}}diag(\mathbf{q})(\hat{\mathbf{y}}-\mathbf{y}),$		(20)

where $\hat{\mathbf{y}}$ is the ground truth. In order to simplify computation, we induce $\mathbf{y}_{u}$ and $\mathbf{y}_{o}$ to represent the amount of underestimation and overestimation, respectively, Therefore, our task loss can be reformulated as:

		$\displaystyle\mathop{\min}_{\mathbf{y},\mathbf{y}_{u},\mathbf{y}_{o}}\mathbf{c}_{u}^{\mathsf{T}}\mathbf{y}_{u}+\mathbf{c}_{o}^{\mathsf{T}}\mathbf{y}_{o}-\hat{\mathbf{y}}^{\mathsf{T}}diag(\mathbf{q})\mathbf{y}+\frac{1}{2}\mathbf{y}^{\mathsf{T}}diag(\mathbf{q})\mathbf{y},$		(21)
		$\displaystyle\text{subject to}\begin{cases}\mathbf{A}\mathbf{y}=\mathbf{0},\\ \mathbf{y}+\mathbf{y}_{u}\geq\hat{\mathbf{y}},\;\mathbf{y}-\mathbf{y}_{o}\leq\hat{\mathbf{y}},\\ \mathbf{y},\mathbf{y}_{u},\mathbf{y}_{o}\geq\mathbf{0}.\end{cases}$		(21)

Appendix G Appendix G Online Cloud Resource Scheduling Experiments

Our Company has a giant cluster of application servers to support its complex Internet business, separated into numerous Internet Data Centers (IDCs) (Xue et al. 2022a; Qu et al. 2023; Xue et al. 2022b), which are computation facilities that an organization or a service provider relies on. For Company’s IDCs, the deployment of application services presents a hierarchical structure. To be more specific, each deployment includes three levels: the upper layer is called the app deployment unit, and the bottom deployment unit is called the zone, which can be divided into Gzone, Rzone, and Czone. Each zone is then divided into multiple groups, as shown in Figure 8.

We select part of the server hierarchical topology to validate our performance. The hierarchical topology is arranged as follows: 17 IDCs in the bottom level, 3 different zones as the second level, and the root takes the sum of all zones. The frequency granularity is 10min, and we set the prediction length as 12 and the context length as 24. The experiment results are shown in Figure 9, and our approach achieves better performance in a long time horizon forecasting compared to original online baselines.

Appendix H Appendix H Reproducibility

Detail of SLOTH Architecture

The detail of SLOTH architecture of experiment is as follows, we adopt GRU as RNN component to extract the temporal feature, where the hidden dimension is 128 and layer number is 2;for top down convolution mechanism we take 2-d convolution, the kernel size is (1, index of level num) for each level, then adopts RELU as activation; for bottom up attention mechanism, the hidden dimension of query and key is 128, and the layer number is 1; the dimension of output network is 128, and layer number is 2. For optimization reconciliation, we use default setting of OPTNET.

The implementation of pseudocode of SLOTH is shown in 2.

Hyper Parameters Tuning

For all RNN-related neural methods, the context length of recurrent module is set in range {4, 8, 10, 12, 16} for Tourism dataset, and select from {4, 8, 12, 16, 24, 32, 40} for Labour and M5 datasets. For DeepAR and DeepVar methods, the number of rnn layers is selected from range {1, 2, 3, 4}, and the number of hidden layers is chosen from range {32, 64, 128, 256, 512}. and lag sequence is set as (1, 4, 8) for Tourism dataset, (1, 2, 3, 4, 5, 6, 7) for Labour and M5 datasets. For INFORMER, the layer of encoder is chosen from range { 2, 3, 4}, the head number of multi-head attention is set as 8, and the dimension of attention is set as 128. For NBEATS, the number of stacks is chosen from range {32, 48, 64}, the number of hidden state of fully connect layer is chosen from {128, 256, 512}, and the number of blocks is chosen from range {1, 2, 3}, and the number of block layers is chosen from range {2, 4}.

For ARIMA and other statistic methods, we follow the setting of HTS package directly.

Experiment Setup

We conduct all our experiments on a server with following configuration, 2 NVIDIA Tesla P100-PCIe GPUs(16GB memory), and 4 Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz(64 cores) CPUs, 256GB memory and 400TB disk.

Algorithm 2 SLOTH Alogorithm

\mathbf{Y}_{(0,t-1)},S

i\leftarrow 1

\bar{\mathbf{H}_{t}}\leftarrow\mathbf{0}

\mathbf{H}_{t}\leftarrow\mathbf{0}

/* temporal feature extraction

while

i<=n

\bar{H}_{t}^{i}=RNN(y_{(0,t-1)}^{i},h_{0}^{i})

i\leftarrow i+1

end while

i\leftarrow n

\hat{\mathbf{H}}_{t}\leftarrow\mathbf{0}

/* top down convolution

while

i>0

\hat{H}_{t}^{i}=CNN(\bar{h}^{i}_{t},[\bar{h}^{parents_{i}}_{t}])

i\leftarrow i-1

end while

/* bottom up attention

l\leftarrow n_{l}-1,\tilde{h}^{\cup_{n_{l}}}_{t}=\hat{h}^{\cup_{n_{l}}}_{t},\mathbf{V}_{t}=0

\mathbf{V}^{n_{l}}_{t}=\hat{h}^{\cup_{n_{l}}}_{t},\cup_{i}

means union of nodes at level i

while

l>0

\mathbf{Q}_{t}^{\cup_{l}}=F(\bar{h}^{\cup_{l}}_{t};\phi_{q})

\mathbf{K}_{t}^{\cup_{l+1}}=F(\bar{h}^{\cup_{l+1}}_{t};\phi_{k})

Update

\tilde{h}_{t}^{\cup_{l}}

according to Equation (5)

Update

\mathbf{V}_{t}^{\cup_{l}}=\tilde{h}_{t}^{\cup_{l}}-\hat{h}^{\cup_{l}}_{t}+\bar{h}^{\cup_{l}}_{t}

l\leftarrow l-1,

end while

i\leftarrow 1,\mathbf{}{H}_{t}^{i}\leftarrow\mathbf{0},\mathbf{\hat{Y}}_{t}^{i}\leftarrow\mathbf{0}

/* base forecaster

while

i<=n

z_{t}^{i}=\sigma(mlp(\tilde{h}_{t}^{i}))

h_{t}^{i}=(1-z_{t}^{i})*\tilde{h}_{t}^{i}+z_{t}^{i}*\bar{h}_{t}^{i}

\hat{y}_{t}^{i}=MLP(h_{t}^{i})

i\leftarrow i+1

end while

/* Optimization Reconciliation

\mathbf{Y}_{t}=OPTNET(\mathbf{\hat{Y}_{t},\mathbf{S}})

return

\mathbf{Y}_{t}