Kernel-U-Net: Multivariate Time Series Forecasting
using Custom Kernels

Jiang YOU ^{1, 2, 3} Université Paris-Est Créteil,
Île-de-France, France
[email protected] Arben CELA ^{1, 2, 4} ESIEE Paris-UGE
Île-de-France, France
[email protected] René NATOWICZ ^1,2 ESIEE Paris-UGE
Île-de-France, France
[email protected] Jacob OUANOUNOU ³ HN-Services
Île-de-France, France
[email protected] Patrick SIARRY ¹ Université Paris-Est Créteil
Île-de-France, France
[email protected]

Abstract

Time series forecasting task predicts future trends based on historical information. Transformer-based U-Net architectures, despite their success in medical image segmentation, have limitations in both expressiveness and computation efficiency in time series forecasting as evidenced in YFormer. To tackle these challenges, we introduce Kernel-U-Net, a flexible and kernel-customizable U-shape neural network architecture. The kernel-U-Net encoder compresses the input series into latent vectors, and its symmetric decoder subsequently expands these vectors into output series. Specifically, Kernel-U-Net separates the procedure of partitioning input time series into patches from kernel manipulation, thereby providing the convenience of executing customized kernels. Our method offers two primary advantages: 1) Flexibility in kernel customization to adapt to specific datasets; and 2) Enhanced computational efficiency, with the complexity of the Transformer layer reduced to linear. Experiments on seven real-world datasets, demonstrate that Kernel-U-Net’s performance either exceeds or meets that of the existing state-of-the-art model in the majority of cases in channel-independent settings. The source code for Kernel-U-Net will be made publicly available for further research and application.

^†^†publicationid: pubid: 979-8-3503-6813- 0/24/$31.00 ©2024 IEEE ^†^†
¹ Laboratoire Images, Signaux et Systèmes Intelligents (LISSI), Université Paris-Est Créteil , Île-de-France, France
² Département Informatique et Télécommunication, ESIEE Paris-Université Gustave Eiffel, Île-de-France, France
³ HN-Services, Île-de-France, France
⁴ Artificial Intelligence Laboratory, UMT, Tirana-Albanie
Proceedings of the 18^th International Conference On Innovations In Intelligent Systems And Applications, Craiova, Romania, 2024.

I Introduction

Time series forecasting predicts future trends based on recent historical information. It allows experts to track the incoming situation and react timely in critical cases. Its applications range from different domains such as predicting the road occupancy rates from different sensors in the city [1], monitoring influenza-like illness weekly patient cases [2], monitoring electricity transformer temperature in the electric power long-term deployment [3] or forecasting temperature, pressure and humidity in weather station [4] etc.

Refer to caption — Figure 1: Illustration of Kernel U-Net Architecture. a) Architecture of Kernel U-Net, it allows executing linear kernel and nonlinear kernels such as MLP, LSTM, and Transformer. b) Illustration of K-U-Net Encoder, the application of $\phi^{(l)}_{enc}$ on patches is independent of the choice of kernel. c) In the K-U-Net decoder, a custom kernel $\phi^{(l)}_{dec}$ expands the vectors into patches in the reverse order.

Over the past few decades, time series forecasting solutions have evolved from traditional statistical methods[5] and machine learning techniques[6] to deep learning-based solutions, such as recurrent neural networks(RNN) [7], Long Short-term Memory (LSTM) [8], Temporal Convolutional Network (TCN) [9] and Transformer-based model [10].

Among the Transformer models applying to time series data, Informer [3], Autoformer [2], and FEDformer [11] are the best variants that incrementally improved the quality of prediction. As a recent paper [12] challenges the efficiency of Transformer-based models with a simple linear layer model NLinear, the authors in [13] argued that the degrades of performance comes from the wrong application of transformer modules on a point-wise sequence and the ignorance of patches. By adding a linear patch layer, their model PatchTST successfully relieved the overfitting problem of transformer modules and reached state-of-the-art results.

We observe that models display distinct strengths depending on the dataset type. For instance, NLinear stands out for its efficiency in handling univariate time series tasks, particularly with small-size datasets. On the other hand, PatchTST is noteworthy for its expressiveness in multivariate time series tasks on large-size datasets. These contrasting attributes highlight the necessity for a unified but flexible architecture. This architecture would aim to integrate various modules easily, allowing for specific customized solutions. Such integration should not only ensure a balance between computational efficiency and expressiveness but also respond to requests for rapid development and testing.

The Convolutional U-net, as a classic and highly expressive model in medical image segmentation[14], features a symmetric encoder and decoder structure that is elegant in its design. This model’s structure is particularly suited to the time series forecasting task, as both inputs and outputs in this context are typically derived from the same distribution. The first U-shape model adapted for time series forecasting was the Yformer[15], which incorporated Transformers in both its encoder and decoder components. As mentioned previously, employing Transformers on point-wise data has the potential to cause overfitting issues. Therefore, our investigation aims to discover if there is a U-shape architecture effective in time series forecasting, which also possesses the capability to integrate various modules flexibly, thus facilitating the specific customization of solutions.

To tackle this problem, we propose a flexible and kernel-customizable architecture, Kernel-U-Net (K-U-Net), inspired by convolutional U-net, Swin Transformer, and Yformer for time series forecasting. K-U-Net generalizes the concept of convolutional kernel and provides convenience for composing particular models with non-linear kernels. Following the design pattern, K-U-Net can easily integrate custom kernels by replacing linear kernels with Transformer or LSTM kernels. As a result, K-U-Net can gain expressivity by capturing more complex patterns and dependencies in the data.

Furthermore, the hierarchical structure of K-U-Net exponentially reduces the input length at each level, thereby concurrently decreasing the complexity involved in learning such sequences. Notably, when Transformer modules are utilized in the second or higher-level layers, the computation cost remains linear, ensuring efficiency in processing.

To fully study the performance and efficiency of K-U-Net, we conduct experiments for time series forecasting tasks on several widely used benchmark datasets. We compose 30 variants of K-U-Net by placing different kernels at different layers and then we choose the best candidates for fine-tuning. Our results show that in time series forecasting, K-U-Net exceeds or meets the state-of-the-art results, such as NLinear[12] and PatchTST, in the majority of cases.

In summary, the contributions of this work include:

•

We propose Kernel-U-Net, a U-shape architecture that progressively compresses the input sequence into a latent vector and expands it to generate the output sequence.
•

Kernel-U-Net generalizes the concept of the convolutional kernel and provides convenience for composing particular models with custom kernels.
•

The computation complexity is guaranteed in linear when employing Transformer kernels at the second or higher layers.
•

Kernel-U-Net exceeds or meets the state-of-the-art results in most cases.

We conclude that Kernel-U-Net stands out as a highly promising option for large-scale time series forecasting tasks. Its hierarchical design provides a balance of low computational complexity and high expressiveness. In most scenarios, it either surpasses or is slightly below the state-of-the-art results. Furthermore, its adaptability in fast-paced development and testing environments is ensured by the use of flexible, customizable kernels.

II Related works

II-1 Transformer

Transformer [16] was initially introduced in the field of Natural Language Processing (NLP) on language translation tasks. It contains a layer of positional encoding, blocks composed of layers of multiple head attentions, and a linear layer with SoftMax activation. As it demonstrated outstanding performance on NLP tasks, many researchers follow this technique route.

Vision Transformers (ViTs) [17] applied a pure transformer directly to sequences of image patches to classify the full image and outperformed CNN based method on ImageNet[18]. Swin Transformer [19] proposed a hierarchical Transformer whose representation is computed with shifted windows. As a shifted window brings greater efficiency by limiting self-attention computation to non-overlapping local windows, it also allows cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity concerning image size.

In time series forecasting, the researchers were also attracted by transformer-based models. LogTrans [10] proposed convolutional self-attention by employing causal convolutions to produce queries and keys in the self-attention layer. To improve the computation efficiency, the authors propose also a LogSparse Transformer with only $O(L(logL)^{2})$ space complexity to break the memory bottleneck. Informer [3] has an encoder-decoder architecture that validates the Transformer-like model’s potential value to capture individual long-range dependency between long sequences. The authors propose a ProbSparse self-attention mechanism to replace the canonical self-attention efficiently. It achieves the $O(LlogL)$ time complexity and $O(LlogL)$ memory usage on dependency alignments. Pyraformer[20] simultaneously captures temporal dependencies of different ranges in a compact multi-resolution fashion. Theoretically, by choosing parameters appropriately, it achieves concurrently the maximum path length of $O(1)$ and the time and space complexity of $O(L)$ in forward pass. PatchTST employs ProbSparse and a linear projection patch layer to reduce the computational complexity of the transformer to $O(\frac{L}{S}\log(\frac{L}{S}))$ , where $S$ is patch size. However, its flattened head still incurs a computational cost of $O(\frac{L^{2}}{S})$ .

Meanwhile, another family of transformer-based models combines the transformer with the traditional method in time series processing. Autoformer [2] introduces an Auto-Correlation mechanism in place of self-attention, which discovers the sub-series similarity based on the series periodicity and aggregates similar sub-series from underlying periods. Frequency Enhanced Decomposed Transformer (FEDformer)[11] captures global properties of time series with seasonal-trend decomposition. The authors proposed Fourier-enhanced blocks and Wavelet-enhanced blocks in the Transformer structure to capture important time series patterns through frequency domain mapping.

II-2 U-Net

U-Net is a neural network architecture designed primarily for medical image segmentation [14]. U-net is composed of an encoder and a decoder. At the encoder phase, a long sequence is gradually reduced by a convolutional layer and a max-pooling layer into a latent vector. At the decoder stage, the latent vector is developed by a transposed convolutional layer for generating an output with the same shape as the input. With the help of skip-connection between the encoder and decoder, U-Net can also capture and merge low-level and high-level information easily.

With such a neat structural design, U-Net has achieved great success in a variety of applications such as medical image segmentation [14], biomedical 3D-Image segmentation [21], time series segmentation [22], image super-resolution [23] and image denoising [24]. Its techniques evolve from basic 2D-U-Net to 3D U-Net [21], 1D-U-Net [25] and Swin-U-Net [26] that only use transformer blocks at each layer of U-Net.

In time series processing, U-time [22] is a U-Net composed of convolutional layers for time series segmentation tasks, and YFormer [15] is the first U-Net based model for time series forecasting task. In particular, YFormer applied transformer blocks on each layer of U-Net and capitalized on multi-resolution feature maps effectively.

II-3 Hierarchical and hybrid model

In time series processing, the increasing size of data degrades the performance of deep models and, crucially, increases the cost of learning them. For example, recurrent models such as RNN, LSTM, GRU have a linear complexity but suffer the gradient vanishing problem when input length increases. Transformer-based block captures better long dependencies but requires $O(L^{2})$ computations in general. To balance the expressiveness of complex models and computational efficiency, researchers investigated hybrid models that merge different modules into the network and hierarchical architectures.

For example, authors in [27] investigated a tree structure model made of bidirectional RNN layers and concatenation layers for skeleton-based action recognition, authors in [28] stacked RNN and LSTM with Attention mechanism for semantic relation classification of text, authors in [29] applied hierarchical LSTM and GRU for document classification. In time series processing, the authors in [30] combine Deep Belief Network (DBN) and LSTM for sleep signal pattern classification.

To meet the demand of balancing the quality of prediction and efficiency in learning Transformer-based models, researchers proposed hierarchical structure in Swin-Transformer [19], pyramidal structure in Pyraformer [20], U-shape structure in Yformer [15] or patch layer in PatchTST [13].

Kernel-U-Net is a U-shape architecture that exponentially reduces the input length at each level, thereby concurrently decreasing the complexity involved in learning long sequences. Kernel-U-Net separates the procedure of partitioning input time series into patches from kernel manipulation, thereby providing the convenience of executing customized kernels. By replacing linear kernels with transformer or LSTM kernels, the model gains enhanced expressiveness, allowing it to capture more complex patterns and dependencies in the data. Notably, when Transformer modules are utilized in the second or higher-level layers, the computation cost remains linear, ensuring efficiency in processing.

III Method

III-A Problem Formulation

Let us note by $x\in\mathbb{R}^{N\times M}$ the matrix which represents the multivariate time series dataset, where the first dimension $N$ represents the sampling time and the second dimension $M$ is the feature size. Let $L$ be the length of memory or the look-back window, we denote the historical time series $(x_{t+1,1},...,x_{t+L,M})$ (or for short, $(x_{t+1},...,x_{t+L}))$ . We also denote the future time series $(x_{t+L+1},...,x_{t+L+T})$ , where $T$ is the length of future horizon and $t\in[0,N-L-T]$ is the time stamp.

The time series forecasting task takes a multivariate time series as input and predicts a future series. Let $x_{t}$ be the features at the time step $t$ , $L$ the length of the look-back window, and $T$ the future horizon. Given a historical data series $(x_{t+1},...,x_{t+L})$ , time series forecasting task predicts the value $(\hat{x}_{t+L+1},...,\hat{x}_{t+L+T})$ in the future. Then we can define the basic time series forecasting problem:

(\hat{x}_{t+L+1},...,\hat{x}_{t+L+T})=f(x_{t+1},...,x_{t+L})

(1)

where $f$ is the function that predicts the future series based on a historical series.

Algorithm A Kernel U-Net Encoder

Input: $L_{1}$ , $M_{1}$ , $\{L_{2}$ , $...$ , $L_{n}$ , $M_{2}$ , $...$ , $M_{m}\}$ , $\{D_{2}$ , $...$ , $D_{n}$ , $D_{n+2}$ , $...$ , $D_{n+m-1}\}$ , $\{\phi^{(1)}$ , $...$ , $\phi^{(n)}$ , $\phi^{(n+2)}$ , $...$ , $\phi^{(n+m)}\}$ , $J_{h}$ , $D_{h}$
Output: Instance of Kernel U-Net Encoder

# Define the init function :

def __init__(inputs):

J\in\{L_{1},L_{2},...,L_{n},

M_{2},...,M_{m}\}

# Create first layer

layers = [KW(

\phi^{1},L_{1},M_{1},1,D_{2})

]

# Create intermediate layers

for

l

\{2,

...,

n,

n+2,

..,

n+m-1\}

layers.append(KW(

\phi^{(l)},

J_{l},

D_{l},

1,

D_{l\_next}

))

end for

# Create last layer

layers.append(KW(

\phi^{(n+m)},J_{n+m},D_{n+m},J_{h},D_{h}

))

encoder = nn.Sequential (*layers)

III-B Kernel U-Net

Kernel U-Net (K-U-Net) is a neural network featuring hierarchical and symmetric U-shape architecture. It separates the procedure of partitioning input time series into patches from kernel manipulation, thereby providing the convenience of executing customized kernels (Figure 1). More precisely, the K-U-Net encoder reshapes the input sequence into a large batch of small patches and repeatedly applies custom kernels on them until the latent vector is obtained. Later, the K-U-Net decoder expands the latent vector into patches gradually at each layer and obtains a large batch of small patches (Figure 1). At last, K-U-Net reshapes the patches to get the final output. In the following paragraphs, we describe methods such as the hierarchical partition of the input sequence and the creation of Kernel-U-Net.

III-B1 Hierarchical partition of the input and output

In the first place, We split the input trajectory matrix into patches. Let us consider a trajectory matrix $X_{t}\in\mathbb{R}^{L\times M}$ at time step $t$ . Given a list of multiples $\{L_{2},...,L_{n}\}$ for look-back window and $\{M_{2},...,M_{m}\}$ for feature, the patch length $L_{1}$ and feature unit $M_{1}$ such that $L=\prod_{1}^{n}L_{k}$ and $M=\prod_{1}^{m}M_{k}$ . We reshape $X_{t}$ as a set of small patches $P_{t}=\{X_{t,i,j}|X_{t,i,j}\in\mathbb{R}^{L_{1}\times M_{1}}\}$ , where $i\in\{1,...,\prod_{2}^{n}L_{k}\}$ and $j\in\{1,...,\prod_{2}^{m}M_{k}\}$ . The total number of patches is the product of the multiples of length and feature size $\#P_{t}=\prod_{2}^{m}M_{k}\cdot\prod_{2}^{n}L_{k}$ .

The partitioned patches will be processed by kernels in the encoder gradually and their size will be reduced after the kernel operation at each layer. In the decoder, the patches are generated from vectors of length $1$ at each layer. Since the decoder is symmetrical to the encoder, there will also be a trajectory matrix $\hat{X}_{t+L}\in\mathbb{R}^{T\times M}$ composed of a set of generated patches $\hat{P}_{t}=\{\hat{X}_{t+L,i,j}\}$ as output. For a simpler description, we let the look-back window $L$ and the forecasting horizon $T$ be equal.

Algorithm B Kernel Wrapper (KW)

Input: $\phi$ , $X$ , $J_{in}$ , $D_{in}$ , $J_{out}$ , $D_{out}$

Output: Instance of Kernel Wrapper

# Define the init function :

def __init__(inputs):

# Initiation of main kernel function

\phi

\phi=

\phi(J_{in},D_{in},J_{out},D_{out})

# Holds variables for global operation

skip_out = None

skip_in = None

# Define the forward function :

def forward(x):

reshape

x

into (-1,

J_{in}

D_{in}

)

X=X+

skip_in if skip_in is not None else

X

Z=\phi(X)

# assert

Z

.shape is (-1,

J_{out}

D_{out}

)

skip_out =

Z

# will be assigned to skip_in in decoder

return

Z

III-B2 Hierarchical processing with kernels

The hierarchical processing of Kernel U-Net consists of compressing an input sequence at the encoding stage and generating an output sequence at the decoding stage. By default, kernels reduce the dimension of input at each layer in an encoder and increase the dimension in a decoder. Let us consider $\mathcal{X}\in\mathbb{R}^{B\times L\times M}$ , a batch of trajectory matrix $X_{t}$ , where $B$ is the batch size. We now describe the shape of the intermediate patch before and after the kernel operation.

At the encoder stage, we compress input $\mathcal{X}\in\mathbb{R}^{B\times L\times M}$ into latent vector $\mathcal{Z}$ . Let us assume that $D_{h}$ is the unique dimension of the hidden vectors at each intermediate layer and the latent vector to simplify the problem. Firstly, we reshape $\mathcal{X}$ to $(B,1,\prod_{2}^{n}L_{k},L_{1},\prod_{2}^{m}M_{k},M_{1})$ then transpose it to $(B,\prod_{2}^{m}M_{k},\prod_{2}^{n}L_{k},L_{1},1,M_{1})$ and reshape it to $(B\cdot\prod_{2}^{m}M_{k}\cdot\prod_{2}^{n}L_{k},L_{1},M_{1})$ . We denote this vector $\mathcal{P}^{(1)}_{in}$ as a large batch of small patches ready for processing with a kernel. Secondly, the kernel at first layer can now process $\mathcal{P}^{(1)}_{in}$ and outputs a hidden vector $\mathcal{P}^{(1)}_{out}$ in shape $(B\cdot\prod_{2}^{m}M_{k}\cdot\prod_{2}^{n}L_{k},1,D_{h})$ . After this operation, we reshape the output $\mathcal{P}^{(1)}_{out}$ to $\mathcal{P}^{(2)}_{in}$ of shape $(B\cdot\prod_{2}^{m}M_{k}\cdot\prod_{3}^{n}L_{k},L_{2},D_{h})$ as input for the next layer. Iteratively, the encoder processes all the multiples $\{L_{2},...,L_{n},M_{2},...,M_{m}\}$ in order, and gives finally a batch of latent vector $\mathcal{Z}=\mathcal{P}^{(n+m)}_{out}$ in shape $(B,1,D_{h})$ (Algorithm A).

TABLE I: Multivariate time series forecasting results with Kernel U-Net. The prediction lengths

T

are in {96, 192, 336, 720} for all datasets. We note the best results in bold and the second best results in underlined.

Methods		K-U-Net		PatchTST		Nlinear		Dlinear		FEDformer		Autoformer		Informer		Yformer
Metric		MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE	MSE	MAE
ETTh1	96	0.355	0.388	0.37	0.4	0.374	0.394	0.375	0.399	0.376	0.419	0.449	0.459	0.865	0.713	0.985	0.74
	192	0.388	0.412	0.413	0.429	0.408	0.415	0.405	0.416	0.42	0.448	0.5	0.482	1.008	0.792	1.17	0.855
	336	0.407	0.427	0.422	0.44	0.429	0.427	0.439	0.443	0.459	0.465	0.521	0.496	1.107	0.809	1.208	0.886
	720	0.430	0.454	0.447	0.468	0.44	0.453	0.472	0.49	0.506	0.507	0.514	0.512	1.181	0.865	1.34	0.899
ETTh2	96	0.269	0.335	0.274	0.337	0.277	0.338	0.289	0.353	0.346	0.388	0.358	0.397	3.755	1.525	1.335	0.936
	192	0.332	0.377	0.339	0.379	0.344	0.381	0.383	0.418	0.429	0.439	0.456	0.452	5.602	1.931	1.593	1.021
	336	0.355	0.400	0.329	0.384	0.357	0.400	0.448	0.465	0.496	0.487	0.482	0.486	4.721	1.835	1.444	0.96
	720	0.384	0.435	0.379	0.422	0.394	0.436	0.605	0.551	0.463	0.474	0.515	0.511	3.647	1.625	3.498	1.631
ETTm1	96	0.275	0.331	0.29	0.342	0.306	0.348	0.299	0.343	0.379	0.419	0.505	0.475	0.672	0.571	0.849	0.669
	192	0.320	0.361	0.332	0.369	0.349	0.375	0.335	0.365	0.426	0.441	0.553	0.496	0.795	0.669	0.928	0.724
	336	0.349	0.380	0.366	0.392	0.375	0.388	0.369	0.386	0.445	0.459	0.621	0.537	1.212	0.871	1.058	0.786
	720	0.401	0.412	0.416	0.42	0.433	0.422	0.425	0.421	0.543	0.49	0.671	0.561	1.166	0.823	0.955	0.703
ETTm2	96	0.157	0.243	0.165	0.255	0.167	0.255	0.167	0.26	0.203	0.287	0.255	0.339	0.365	0.453	0.487	0.529
	192	0.213	0.283	0.22	0.292	0.221	0.293	0.224	0.303	0.269	0.328	0.281	0.34	0.533	0.563	0.789	0.705
	336	0.266	0.320	0.274	0.329	0.274	0.327	0.281	0.342	0.325	0.366	0.339	0.372	1.363	0.887	1.256	0.904
	720	0.343	0.377	0.362	0.385	0.368	0.384	0.397	0.421	0.421	0.415	0.433	0.432	3.379	1.338	2.698	1.297
Electricity	96	0.128	0.219	0.129	0.222	0.141	0.237	0.14	0.237	0.193	0.308	0.201	0.317	0.274	0.368	-	-
	192	0.145	0.234	0.147	0.24	0.154	0.248	0.153	0.249	0.201	0.315	0.222	0.334	0.296	0.386	-	-
	336	0.160	0.250	0.163	0.259	0.171	0.265	0.169	0.267	0.214	0.329	0.231	0.338	0.3	0.394	-	-
	720	0.196	0.283	0.197	0.29	0.21	0.297	0.203	0.301	0.246	0.355	0.254	0.361	0.373	0.439	-	-
Traffic	96	0.354	0.229	0.36	0.249	0.41	0.279	0.41	0.282	0.587	0.366	0.613	0.388	0.719	0.391	-	-
	192	0.372	0.262	0.379	0.25	0.423	0.284	0.423	0.287	0.604	0.373	0.616	0.382	0.696	0.379	-	-
	336	0.388	0.270	0.392	0.264	0.435	0.29	0.436	0.296	0.621	0.383	0.622	0.337	0.777	0.42	-	-
	720	0.430	0.269	0.432	0.286	0.464	0.307	0.466	0.315	0.626	0.382	0.66	0.408	0.864	0.472	-	-
Weather	96	0.142	0.183	0.149	0.198	0.182	0.232	0.176	0.237	0.217	0.296	0.266	0.336	0.3	0.384	-	-
	192	0.187	0.226	0.194	0.241	0.225	0.269	0.22	0.282	0.276	0.336	0.307	0.367	0.598	0.544	-	-
	336	0.238	0.269	0.245	0.282	0.271	0.301	0.265	0.319	0.339	0.38	0.359	0.395	0.578	0.523	-	-
	720	0.308	0.323	0.314	0.334	0.338	0.348	0.323	0.362	0.403	0.428	0.419	0.428	1.059	0.741	-	-

At the stage of the decoder, the operations are reversed. We start with $\mathcal{Q}^{(n+m)}_{in}=\mathcal{Z}$ of shape $(B,1,D_{h})$ , a set of input patches to the decoder. We send it to decoder kernel and get an output $\mathcal{Q}^{(n+m)}_{out}$ in shape $(B,M_{m},D_{h})$ , then we reshape it to $\mathcal{Q}^{(n+m-1)}_{in}$ in shape $(B\cdot M_{m},1,D_{h})$ for the next kernel. At the end of this iteration over multiples $\{M_{m-1},$ $...,$ $M_{2},$ $L_{n},$ $...,$ $L_{2},$ $L_{1}\}$ , we finally have a set of patches $\mathcal{Q}^{(1)}_{out}$ in the shape $(B\cdot\prod_{2}^{m}M_{k}\cdot\prod_{2}^{n}L_{k},L_{1},M_{1})$ . The last operations are reshaping it into $(B,\prod_{2}^{m}M_{k},\prod_{2}^{n}L_{k},L_{1},1,M_{1})$ , transposing it into $(B,1,\prod_{2}^{n}L_{k},L_{1},\prod_{2}^{m}M_{k},M_{1})$ and reshaping it into $(B,L,M)$ as final output.

III-B3 Kernel Wrapper

The Kernel Wrapper (KW) requires necessary parameters such as kernel $\phi$ , input patches $X$ , output patches $Z$ , patches set size $\mathcal{B}$ , input patch length $J_{in}$ and dimension $D_{in}$ , output patch length $J_{out}$ and dimension $D_{out}$ . The kernel wrapper initiates an instance of a given kernel and calls it in the forward function. It reshapes the input patches, executes with the kernel inside, and then checks the output shape. The wrapper processes $X$ and outputs $Z$ in the encoder, or it processes the sum of $X$ and an encoder output via skip connection, then gives an output in the decoder (Algorithm B).

III-B4 Formulation of Kernel Operation

We add enc and dec in the index of variables to differentiate their utilization in the encoder and decoder. We denote $\mathcal{P}=X_{enc}$ and $\mathcal{Q}=X_{dec}$ the set of patches, $l\in\{1,...,n,n+2,...,n+m\}$ the index of layers, and $\mathcal{B}$ the patches set size. By default, an encoder kernel at layer $l$ receives $\mathcal{P}^{(l)}_{in}\in\mathbb{R}^{\mathcal{B}^{(l)}\times J^{(l)}_{enc\_in}\times D^{(l)}_{enc\_in}}$ and gives $\mathcal{P}^{(l)}_{out}\in\mathbb{R}^{\mathcal{B}^{(l)}\times 1\times D^{(l)}_{enc\_out}}$ . The decoder kernel at layer $l$ receives $\mathcal{Q}^{(l)}_{in}\in\mathbb{R}^{\mathcal{B}^{(l)}\times 1\times D^{(l)}_{dec\_in}}$ and outputs $\mathcal{Q}^{(l)}_{out}\in\mathbb{R}^{\mathcal{B}^{(l)}\times J^{(l)}_{dec\_out}\times D^{(l)}_{dec\_out}}$ .

We recall that $i^{(l)},j^{(l)}$ are the indices for multiples of length and feature at layer $l$ . Given a set of input $\mathcal{P}^{(l)}_{in}$ at layer $l$ , the output $\mathcal{P}^{(l)}_{out}$ of kernel $\phi^{(l)}_{enc}$ in encoder is written :

\mathcal{P}^{(l)}_{out,i^{(l)},j^{(l)}}=\phi^{(l)}_{enc}(\mathcal{P}^{(l)}_{in,i^{(l)},j^{(l)}})

(2)

The decoder kernel $\phi^{(l)}_{dec}$ at layer $l$ takes the sum of input $\mathcal{Q}^{(l)}_{in}$ and encoder output $\mathcal{P}^{(l)}_{out}$ via the skip connection as input and produces $\mathcal{Q}^{(l)}_{out}$ as output:

\mathcal{Q}^{(l)}_{out,i^{(l)},j^{(l)}}=\phi^{(l)}_{dec}(\mathcal{Q}^{(l)}_{in,i^{(l)},j^{(l)}}+\mathcal{P}^{(l)}_{out,i^{(l)},j^{(l)}})

(3)

Remark that in case of $l=n+m$ , we have $\mathcal{Q}^{(l)}_{in,i^{(l)},j^{(l)}}+\mathcal{P}^{(l)}_{out,i^{(l)},j^{(l)}}=\mathcal{P}^{(l)}_{out,i^{(l)},j^{(l)}}$ because there are no higher layers and the kernel only process the encoder output.

III-B5 Creation of Kernel U-Net

We create the encoder, decoder, and the U-Net in order. The algorithm passes parameters describing the multiples, kernels, and hidden dimensions. Let us note the input length $L$ , feature dimension $M$ , concatenated lists of multiples of look-back window and feature $\{L_{2},$ $...,$ $L_{n},$ $M_{2},$ $...,$ $M_{m}\}$ , list of hidden dimension of same size $\{D_{h},$ $...,$ $D_{h}\}$ , patch size $L_{1}$ and feature unit $M_{1}$ , a list of kernels $\{\phi^{(1)}_{enc},$ $\phi^{(2)}_{enc},$ $...,$ $\phi^{(n)}_{enc},$ $\phi^{(n+2)}_{enc},$ $...,$ $\phi^{(n+m)}_{enc}\}$ , latent vector length $J_{h}$ and dimension $D_{h}$ . We also use next and prev to iterate over the index $l$ . We set the hidden dimension of intermediate output vectors within layers to be equal to that of latent vector for simpler description. It corresponds to channel size in a convolutional network and can be augmented for a larger passage of information if necessary. We describe the creation of the K-U-Net encoder in Algorithm A.

The decoder is symmetrical to the encoder and applies kernels in reverse order. More precisely, the decoder takes $J_{h}$ and $D_{h}$ , multiples $J\in\{L_{1},$ $L_{2},...,$ $L_{n},$ $M_{2},$ $...,$ $M_{m}\}$ , kernels $\{\phi^{(n+m)}_{dec},$ $...,$ $\phi^{(n+2)}_{dec},$ $\phi^{(n)}_{dec},$ $...,$ $\phi^{(2)}_{dec},$ $\phi^{(1)}_{dec}\}$ and initiates kernel wrappers KW( $\phi^{(n+m)}_{dec},$ $J_{h},$ $D_{h},$ $J_{n+m},D_{n+m})$ , KW( $\phi^{(l)}_{dec},$ $1,$ $D_{l\_prev},$ $J_{l},D_{l})$ , KW( $\phi^{(1)}_{dec},$ $1,$ $D_{2},$ $L_{1},M_{1})$ at the highest, intermediate, and lowest layers respectively.

The K-U-Net initiates an encoder and a decoder. In the forward function, the encoder processes the input and generates a list of outputs at each layer and a latent vector. It assigns the skip_out from encoder kernels to skip_in in decoder kernels. Then the decoder takes the outputs from the encoder via skip-connection and the latent vector to generate the final result.

III-C Custom Kernels

III-C1 Linear kernel

The linear kernel is a simple matrix multiplication. Given $X\in\mathbb{R}^{\mathcal{B}\times J_{in}\times D_{in}}$ , linear kernel $\phi$ reshape it to $(\mathcal{B},J_{in}\cdot D_{in})$ and process it as follow:

Z=\phi(X)=XW+b

(4)

where $Z$ is output of kernel, $W\in\mathbb{R}^{J_{in}\cdot D_{in}\times J_{out}\cdot D_{out}}$ is weight matrix and $b\in\mathbb{R}^{J_{out}\cdot D_{out}}$ is bias vector. Remark that the number of parameters of $W$ is $J_{in}\cdot D_{in}\cdot J_{out}\cdot D_{out}$ and this kernel operation is equivalent to the process with a 1D convolutional layer whose kernel size is $J_{in}$ or $J_{out}$ .

III-C2 Multi-Layer Perceptron kernel

The multi-layer perceptron (MLP) kernel has an additional hidden layer and a non-linear activation function Tanh. The formulation is:

Z=\phi(X)=Tanh(XW_{1}+b_{1})W_{2}+b_{2}

(5)

where $Z$ is output of kernel, $W_{1}\in\mathbb{R}^{J_{in}\cdot D_{in}\times J^{\prime}\cdot D^{\prime}}$ and $W_{2}\in\mathbb{R}^{J^{\prime}\cdot D^{\prime}\times J_{out}\cdot D_{out}}$ are weight matrices , $b_{1}\in\mathbb{R}^{J^{\prime}\cdot D^{\prime}}$ and $b_{2}\in\mathbb{R}^{J_{out}\cdot D_{out}}$ are bias vectors, and $J^{\prime}=\frac{1}{2}(J_{in}+J_{out}),D^{\prime}=\frac{1}{2}(D_{in}+D_{out})$ .

III-C3 Transformer kernel

The Vanilla Transformer[16] is made of a layer of positional encoding, blocks that are composed of layers of multiple head attentions, and a linear layer with ReLU activation. The transformer kernel in this work follows the classical structure (Figure 2).

III-C4 LSTM kernel

The LSTM kernel contains a classic LSTM cell[31] and a linear layer for skip connection. The hidden states of all time steps are combined with a linear layer in the next (Figure 2).

III-D Complexity Analysis

Since the K-U-Net is symmetric, we only study the complexity of the encoder layer and assume that the feature size is $1$ . Let us suppose that a kernel is receiving a sequence of length $L$ where $L=\prod_{1}^{n}L_{i}$ and all $L_{i}$ are equal. As the patch size in the first layer of Kernel U-Net encoder is $L_{1}$ , a kernel module will process $\frac{L}{L_{1}}$ patches of size $L_{1}$ . Therefore, the complexity is $O(\frac{L}{L_{1}^{n}}\cdot g(L_{1}))$ where $g(L_{1})$ is the complexity inside the kernel in the function of patch size $L_{1}$ and $n$ is the index of the layer. In the case of using the linear kernel at the first layer, the complexity is $O(\frac{L}{L_{1}}\cdot L_{1})$ = $O(L)$ . In the case of using a classic transformer kernel, the complexity is $O(\frac{L}{L_{1}}\cdot L_{1}^{2})$ = $O(L\cdot L_{1})$ . Let us set $L_{1}=log(L)$ , the complexity of the application of such a quadratic calculation kernel is $O(Llog(L))$ . Moreover, if we apply the transformer kernel starting from the second layer, the complexity is dramatically reduced to $O(\frac{L}{L_{1}\cdot L_{2}}\cdot L_{2}^{2})$ = $O(L)$ . Following the same demonstration, the complexity of using LSTM kernels and MLP starting from the second layer is also bounded by $O(L)$ .

IV Experiments and Results

IV-1 Datasets

We conducted experiments with our proposed Kernel U-Net on 7 public datasets, including 4 ETT (Electricity Transformer Temperature) datasets²²2https://github.com/zhouhaoyi/ETDataset (ETTh1, ETTh2, ETTm1, ETTm2), Weather³³3https://www.bgc-jena.mpg.de/wetter/, Traffic⁴⁴4http://pems.dot.ca.gov and Electricity⁵⁵5https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014. These datasets have been benchmarked and publicly available on[2] and the description of the dataset is available in [12].

Here, we followed the experiment setting in [12] and partitioned the data to $[12,4,4]$ months for training, validation, and testing respectively for the ETT dataset. The data is split to $[0.7,0.1,0.2]$ for training, validation, and testing for the Weather, Traffic, and Electricity datasets.

IV-2 Baselines and Experimental Settings

We follow the experiment setting in NLinear [12] and take $L$ step historical data as input then forecast $T$ step value in the future where $L\in\mathcal{L}=\{336,720\}$ and $T\in\mathcal{T}=\{96,192,336,720\}$ . We replace the last value normalization by mean value normalization for ETT, Electricity, and Weather datasets, and apply instance normalization[32] for the Traffic dataset. We use Mean Squared Error (MSE) and Mean Absolute Error (MAE) for evaluation as mentioned in [2].

We include recent methods: PatchTST[13], NLinear, DLinear[12], FEDformer [11], Autoformer [2], Informer [3], LogTrans [10] Yformer[15]. We merge the result reported in [13] and [12] for taking their best in a supervised setting and execute Yformer with default parameters in its Github⁶⁶6https://github.com/18kiran12/Yformer-Time-Series-Forecasting.

IV-3 Experiment details

We use a 4-layer K-U-Net for experiments. The list of multiples are respectively {4, 3, 7} and {6, 6, 5} for look-back windows $L\in\{336,720\}$ . The bottom patch length is 4 and its width is 1. We reshape the input into $(B\cdot M,L,1)$ for processing with K-U-Net as we follow the channel-independent setting [12] and then reshape it back. The hidden dimensions are 128 for multivariate tasks. The learning rate is selected in $[0.00005,0.001]$ . The training epoch is 50 and the patience of early stopping is 10 in general. We use MAE as the loss function.

IV-4 Model Variants Search

We propose 4 kernels for experiments with K-U-Net on 7 datasets. By replacing a linear kernel at different layers with other types of kernels, we search for variants that adapt the dataset. As there are too many variants to name, we note them as ”KUN_ $<$ kernel $>$ _ $<$ replace_layer $>$ ( $<$ look-back_window $>$ )”. For example, KUN_Linear $\_0000$ $(720)$ means that the model is made of linear kernels at all layers and the look-back window size is 720, KUN_Transf $\_0100$ means that a transformer kernel replaces the linear kernel at the second layer, KUN_Linear $\_0110$ means that multilayer perceptron kernels replace the kernels at the second and third layers. We have composed 16 variants with MLP kernels and 7 variants with Transformer and LSTM kernels.

To enumerate all variants that achieve at least once the best result, we report their performance by the average of the top 5 minimum running MSE (Top5MMSE) values on the validation and test set. The search results are reported with a relative score(RS) to the minimum Top5-MMSE:

\text{RS}(V)=\text{Min}_{T\in\mathcal{T}}\frac{\text{Top5MMSE}(V,T)}{\text{Min}_{\{U\in\mathcal{V},L\in\mathcal{L}\}}\text{Top5MMSE}(U,T,L)}

(6)

, where $U$ , $V$ are variant models in the models set $\mathcal{V}$ , $T$ is a forecasting horizon in $\mathcal{T}$ , $L$ is a look back window in $\mathcal{L}$ . Relative score notes the best-performed model with 1 and thus helps to identify the high-potential candidates for further fine-tuning examinations.

We observe in Figure 3 that the best variant for ETTh1 dataset is KUN_Linear $\_0010$ . Comparing the relative score of the K-U-Net of replace $\_$ layer code ( $0010$ , $0011$ ), ( $0110$ , $0111$ ) and ( $0100$ , $0101$ ), we remark that replacing the highest layer with Transformer and LSTM kernel degrades the performance because of overfitting. Furthermore, We observe in Figure 5 that the best variants for the Weather dataset are KUN_Linear $\_0110$ , KUN_LSTM $\_0100$ and KUN_Transf $\_0100$ . Comparing the relative score of the K-U-Net of replace $\_$ layer code ( $0010$ , $0110$ ), ( $0011$ , $0111$ ) and ( $0001$ , $0101$ ) we remark that replacing the second layer with Transformer and LSTM kernel gains the performance for their expressiveness.

Among candidates in the search phase, we empirically choose 3 variants for fine-tuning experiments with 5 runs. The final result shows that the performance of Kernel U-Net exceeds or meets the state-of-the-art methods in multivariate settings in most cases.

IV-5 Multivariate time series forecasting result

We remark that our model improves the MSE performance around $72.92\%$ compared with Yformer and $2.99\%$ compared to PatchTST and NLinear in the multivariate setting (Table I). It is worth noting that K-U-Net achieves similar results on the Electricity dataset with a variant based on MLP kernels.

IV-6 Computation Efficiency

We examined the computation efficiency of 6 variants of K-U-Net, PatchTST, and Yformer. We execute the models on the ETTh1 and ETTm1 datasets for 10 epochs and measure the average execution time per epoch and the GPU consumption during the training. The hidden dimension is set to 128 for all models. For fair comparison, PatchTST and Yformer contain 2 layers of Transformer block which equals to KUN_Transf $\_0100$ . All experiments are executed on a Tesla V100 GPU in a Google Colab environment. Comparing to the PatchTST, KUN_Linear $\_0000$ saves $85.20\%$ and $83.07\%$ memory (Figure 4) and saves $81.82\%$ and $80.55\%$ computation time (Figure 6) on ETTh1 and ETTm1 datasets, KUN_Transf $\_0100$ saves $55.15\%$ and $50.00\%$ memory and saves $28.47\%$ and $26.88\%$ computation time respectively.

V Conclusion

In this paper, we propose Kernel-U-Net, a highly potential candidate for large-scale time series forecasting tasks. It provides convenience for composing particular models with custom kernels, thereby it adapts well to particular datasets. As an efficient architecture, it accelerates the procedure of searching for appropriate variants. Kernel-U-Net either exceeds or meets the state-of-the-art results in most cases. In the future, we hope to develop more kernels and hope that Kernel U-Net can be useful for other time series tasks such as classification or anomaly detection.

References

[1] Lai, G., W.-C. Chang, Y. Yang, et al. Modeling long- and short-term temporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 95–104. Association for Computing Machinery, New York, NY, USA, 2018.
[2] Wu, H., J. Xu, J. Wang, et al. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, J. W. Vaughan, eds., Advances in Neural Information Processing Systems, vol. 34, pages 22419–22430. Curran Associates, Inc., 2021.
[3] Zhou, H., S. Zhang, J. Peng, et al. Informer: Beyond efficient transformer for long sequence time-series forecasting. In The Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Virtual Conference, vol. 35, pages 11106–11115. AAAI Press, 2021.
[4] Liu, Y., H. Wu, J. Wang, et al. Non-stationary transformers: Exploring the stationarity in time series forecasting. Advances in Neural Information Processing Systems, 2022.
[5] Khandelwal, I., R. Adhikari, G. Verma. Time series forecasting using hybrid ARIMA and ANN models based on DWT decomposition. Procedia Computer Science, 48:173–179, 2015.
[6] Persson, C., P. Bacher, T. Shiga, et al. Multi-site solar power forecasting using gradient boosted regression trees. Solar Energy, 150:423–436, 2017.
[7] Tokgöz, A., G. Ünal. A RNN based time series approach for forecasting turkish electricity load. In 2018 26th Signal Processing and Communications Applications Conference (SIU), pages 1–4. 2018.
[8] Kong, W., Z. Dong, Y. Jia, et al. Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Transactions on Smart Grid, PP:1–1, 2017.
[9] Hewage, P., A. Behera, M. Trovati, et al. Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Computing, 24(21):16453–16482, 2020.
[10] Li, S., X. Jin, Y. Xuan, et al. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting, page 11. Curran Associates Inc., Red Hook, NY, USA, 2019.
[11] Zhou, T., Z. Ma, Q. Wen, et al. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the 39th International Conference on Machine Learning, pages 27268–27286. PMLR, 2022. issns: 2640-3498.
[12] Zeng, A., M. Chen, L. Zhang, et al. Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence. 2023.
[13] Nie, Y., N. H. Nguyen, P. Sinthong, et al. A time series is worth 64 words: Long-term forecasting with transformers. International Conference on Learning Representations, 2023.
[14] Ronneberger, O., P. Fischer, T. Brox. U-net: Convolutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells, A. F. Frangi, eds., Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241. Springer International Publishing, 2015.
[15] Madhusudhanan, K., J. Burchert, N. Duong-Trung, et al. U-net inspired transformer architecture for far horizon time series forecasting. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part VI, page 36–52. Springer-Verlag, Berlin, Heidelberg, 2023.
[16] Vaswani, A., N. Shazeer, N. Parmar, et al. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett, eds., Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017.
[17] Dosovitskiy, A., L. Beyer, A. Kolesnikov, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. 2021.
[18] Deng, J., R. Socher, L. Fei-Fei, et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), vol. 00, pages 248–255. 2009.
[19] Liu, Z., Y. Lin, Y. Cao, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002. IEEE, 2021.
[20] Liu, S., H. Yu, C. Liao, et al. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In International Conference on Learning Representations. 2022.
[21] Çiçek, Ö., A. Abdulkadir, S. S. Lienkamp, et al. 3d u-net: Learning dense volumetric segmentation from sparse annotation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. 2016.
[22] Perslev, M., M. Jensen, S. Darkner, et al. U-time: A fully convolutional network for time series segmentation applied to sleep staging. In Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019.
[23] Han, N., L. Zhou, Z. Xie, et al. Multi-level u-net network for image super-resolution reconstruction. Displays, 73:102192, 2022.
[24] Zhang, K., Y. Li, J. Liang, et al. Practical blind image denoising via swin-conv-UNet and data synthesis. Mach. Intell. Res., 2023.
[25] Azar, J., G. B. Tayeh, A. Makhoul, et al. Efficient lossy compression for iot using sz and reconstruction with 1d u-net. Mob. Netw. Appl., 27(3):984–996, 2022.
[26] Cao, H., Y. Wang, J. Chen, et al. Swin-unet: Unet-like pure transformer for medical image segmentation. In L. Karlinsky, T. Michaeli, K. Nishino, eds., Computer Vision – ECCV 2022 Workshops, pages 205–218. Springer Nature Switzerland, 2023.
[27] Du, Y., W. Wang, L. Wang. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1110–1118. 2015.
[28] Xiao, M., C. Liu. Semantic relation classification via hierarchical recurrent neural network with attention. In Y. Matsumoto, R. Prasad, eds., Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1254–1263. The COLING 2016 Organizing Committee, 2016.
[29] Kowsari, K., D. E. Brown, M. Heidarysafa, et al. HDLTex: Hierarchical deep learning for text classification. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 364–371. 2017.
[30] Hong, J., J. Yoon. Multivariate time-series classification of sleep patterns using a hybrid deep learning architecture. In 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom), pages 1–6. 2017.
[31] Hochreiter, S., J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, 1997.
[32] Ulyanov, D., A. Vedaldi, V. Lempitsky. Instance normalization: The missing ingredient for fast stylization, 2017.

Kernel-U-Net: Multivariate Time Series Forecasting using Custom Kernels