This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: 1The University of Adelaide
11email: {chang.dong, liangwei.zheng, weitong.chen}@adelaide.edu.au

Kolmogorov-Arnold Networks (KAN) for Time Series Classification and Robust Analysis

Chang Dong 11 Liangwei Zheng 11 Weitong Chen Corresponding Author 11
Abstract

Kolmogorov-Arnold Networks (KAN) has recently attracted significant attention as a promising alternative to traditional Multi-Layer Perceptrons (MLP). Despite their theoretical appeal, KAN require validation on large-scale benchmark datasets. Time series data, which has become increasingly prevalent in recent years, especially univariate time series are naturally suited for validating KAN. Therefore, we conducted a fair comparison among KAN, MLP, and mixed structures. The results indicate that KAN can achieve performance comparable to, or even slightly better than, MLP across 128 time series datasets. We also performed an ablation study on KAN, revealing that the output is primarily determined by the base component instead of b-spline function. Furthermore, we assessed the robustness of these models and found that KAN and the hybrid structure MLP_KAN exhibit significant robustness advantages, attributed to their lower Lipschitz constants. This suggests that KAN and KAN layers hold strong potential to be robust models or to improve the adversarial robustness of other models.

Keywords:
Kolmogorov-Arnold NetworksTime-Series Adversarial Attack

1 Introduction

In recent years, time-series analysis has become increasingly prevalent in various fields, including healthcare[6], human activity recognition[16], remote sensing[17] etc. Among these, time series classification (TSC) is one of the key challenges in time series analysis. Due to rapid advancements in machine learning research, many TSC algorithms have been proposed. Before the prevalence of deep learning, numerous effective algorithms existed. For instance, NN-DTW [1] is based on measuring the general similarity of whole sequences using different distance metrics. Other methods included identifying repeated shapelets in sequences as features [7], discriminating time series by the frequency of repetition of some subseries [18], and using ensemble methods [13]. However, these approaches often faced limitations, as they were either difficult to generalize to various scenarios or had high time complexity.

In recent years, deep learning-based methods have achieved notable success in various fields such as image recognition and natural language processing. Consequently, many techniques from these domains have been adapted for time series classification, including Convolutional Neural Networks (CNNs)[9], Recurrent Neural Networks (RNNs)[11], and Transformers[20]. Even large model-based classification has found extensive application in time series classification[21].

The development of neural networks is fundamentally rooted in the concept of the multi-layer perceptron (MLP). Regardless of their complexity, neural networks retain an architecture similar to that of MLP. According to the Universal Approximation Theorem (UAT), any function can be approximated by a finite number of single-layer perceptrons[8]. This theorem underpins the capability of MLP to model and fit complex distributions. Recently, Liu, et al have proposed a new paradigm for neural networks called Kolmogorov-Arnold Networks (KAN)[14], which contrasts with traditional MLP-based neural networks. Unlike MLP, KAN is based on Kolmogorov-Arnold Theory (KAT)[12] and explicitly defines the model size required for fitting. Both MLP and KAN have analogous structures: in MLP, neuron outputs undergo a linear transformation followed by activation before being passed to the next neuron, whereas in KAN, the edges serve as learnable activation functions, followed by a linear transformation before passing to the next neuron. This learnable activation function makes KAN a potential competitor to MLP neural networks. However, KAN has only been validated within formulas constructed in the physical domain and has not been tested on large-scale datasets, leaving its scalability unproven. Univariate time series are inherently well-suited to KAN’ inputs, making them excellent candidates for validation. Furthermore, the robustness of TSC has garnered significant attention in recent years[3, 4, 5, 10]. As a new architecture, KAN’ robustness has not yet been studied. Given these circumstances, to validate the performance and robustness of KAN in TSC tasks, we conducted the following work:

  • We performed a fair comparison across 128 UCR datasets among KAN, MLP, KAN_MLP (KAN with the last layer replaced by MLP), MLP_KAN (MLP with the last layer replaced by KAN) under identical configurations, and MLP_L (MLP with the same number of parameters). We found that KAN could achieve performance comparable to MLP.

  • We conducted an ablation study to investigate the roles of the base and B-spline functions. The results indicated that the output values were predominantly determined by the base function. Additionally, we observed that in the absence of the base function, spline functions with large grid sizes were difficult to optimize.

  • We assessed the robustness of KAN by comparing it with other models. Our findings revealed that KAN exhibited superior adversarial robustness due to its lower Lipschitz constant.

  • We observed an anomalous phenomenon that KAN with higher grid sizes demonstrated greater robustness despite having a higher Lipschitz constant. We provided a reasonable hypothesis for this observation in the final section.

2 Background

2.1 Kolmogorov-Arnold representation

KAN is inspired by Kolmogorov-Arnold representation theory (KAT). It states that any multivariate continuous function defined in a bounded domain can be represented as a finite composition of continuous functions of a single variable and the binary operation of addition. Specifically, if f𝑓f is a continuous function on a bounded domain Dn𝐷superscript𝑛D\subset\mathbb{R}^{n}, then there exist continuous functions ϕijsubscriptitalic-ϕ𝑖𝑗\phi_{ij} and ψisubscript𝜓𝑖\psi_{i} such that:

f(x1,x2,,xn)=i=12n+1ψi(j=1nϕij(xj)),𝑓subscript𝑥1subscript𝑥2subscript𝑥𝑛superscriptsubscript𝑖12𝑛1subscript𝜓𝑖superscriptsubscript𝑗1𝑛subscriptitalic-ϕ𝑖𝑗subscript𝑥𝑗f(x_{1},x_{2},\ldots,x_{n})=\sum_{i=1}^{2n+1}\psi_{i}\left(\sum_{j=1}^{n}\phi_{ij}(x_{j})\right), (1)

where ϕij:[0,1]:subscriptitalic-ϕ𝑖𝑗01\phi_{ij}:[0,1]\rightarrow\mathbb{R} and ψi::subscript𝜓𝑖\psi_{i}:\mathbb{R}\rightarrow\mathbb{R}. It transforms the task of learning a multivariable function into learning a finite number of univariable functions. Compared to MLP, it explicitly provides the number of one-dimensional functions needed for fitting. However, these univariable functions could be non-smooth or even fractal, making it theoretically feasible but practically useless. Nevertheless, Liu et al.[14] found that, by analogy to MLP in neural networks, KAN need not be limited to two layers and finite width to fit all non-linearities. Furthermore, most natural functions tend to be smooth and have sparse structures. These insights suggest that an scalable KAN could become a strong competitor to MLP.

2.2 Adversarial Attack

Adversarial attacks involve applying carefully crafted small perturbations rd𝑟superscript𝑑r\in\mathbb{R}^{d} to input data xd𝑥superscript𝑑x\in\mathbb{R}^{d}, leading to significant changes in a model’s output, such as fooling a classifier f:xm:𝑓𝑥superscript𝑚f:x\rightarrow\mathbb{R}^{m} with the goal of altering the predicted label.

argmax{f(x)}argmax{f(xadv)},xadv=x+r,s.t.r2x2.formulae-sequenceargmax𝑓𝑥argmax𝑓subscript𝑥𝑎𝑑𝑣formulae-sequencesubscript𝑥𝑎𝑑𝑣𝑥𝑟much-less-thans.t.superscriptnorm𝑟2superscriptnorm𝑥2\begin{split}\text{argmax}\ \{f(x)\}\neq\text{argmax}\ \{f(x_{adv})\},\\ x_{adv}=x+r,\ \text{s.t.}\ ||r||^{2}\ll||x||^{2}.\end{split} (2)

here, the perturbation r𝑟r is small in magnitude relative to x𝑥x as indicated by their norms. We normally apply these perturbations to test whether the model can be fooled, thus assessing its robustness. To consider the worst-case scenario, we typically implement the gradient attacks which require knowledge of all the information about the model and the data. Among them, the most widely used method is the Projection Gradient Descent (PGD)[15], the gradient-based iterative attack method, which is the most effective method to evaluate the robustness of models against gradient attacks. This can be characterized by:

xadv(t+1)=Clipx,ϵ{xadv(t)+αsign(xadv(t)(xadv(t),y))},superscriptsubscript𝑥𝑎𝑑𝑣𝑡1𝐶𝑙𝑖subscript𝑝𝑥italic-ϵsuperscriptsubscript𝑥𝑎𝑑𝑣𝑡𝛼signsubscriptsuperscriptsubscript𝑥𝑎𝑑𝑣𝑡superscriptsubscript𝑥𝑎𝑑𝑣𝑡𝑦x_{adv}^{(t+1)}=Clip_{x,\epsilon}\{x_{adv}^{(t)}+\alpha\cdot\text{sign}(\nabla_{x_{adv}^{(t)}}\mathcal{L}(x_{adv}^{(t)},y))\}, (3)

here, t𝑡t is the iteration index, Clipx,ϵ{}𝐶𝑙𝑖subscript𝑝𝑥italic-ϵClip_{x,\epsilon}\{\cdot\} ensures that xadv(t+1)superscriptsubscript𝑥𝑎𝑑𝑣𝑡1x_{adv}^{(t+1)} remains within ϵitalic-ϵ\epsilon of the original input x𝑥x. This method iteratively adjusts xadvsubscript𝑥𝑎𝑑𝑣x_{adv} to maximize the loss function in the direction that moves it away from its original prediction y𝑦y, while ensuring xadvsubscript𝑥𝑎𝑑𝑣x_{adv} stays within a small perturbation distance ϵitalic-ϵ\epsilon from x𝑥x.

2.3 Local Lipschitz Constant

A function f:mn:𝑓superscript𝑚superscript𝑛f:\mathbb{R}^{m}\to\mathbb{R}^{n} is defined to be fsubscript𝑓\ell_{f}-locally Lipschitz continuous at radius r𝑟r if for each i=1,,n𝑖1𝑛i=1,\ldots,n, and x1x2rfor-allnormsubscript𝑥1subscript𝑥2𝑟\forall\ \|x_{1}-x_{2}\|\leq r, the following holds:

f(x1)f(x2)fx1x2norm𝑓subscript𝑥1𝑓subscript𝑥2subscript𝑓normsubscript𝑥1subscript𝑥2\|f(x_{1})-f(x_{2})\|\leq\ell_{f}\|x_{1}-x_{2}\| (4)

where fsubscript𝑓\ell_{f} is the local Lipschitz constant. Hereafter, we will refer to it simply as the Lipschitz constant. The Lipschitz constant is directly linked to perturbation stability, which in turn relates to robustness[19].

3 Methodology

3.1 Kolmogorov-Arnold Networks (KAN)

Refer to caption
Figure 1: A three-layer KAN structure with the architecture [3-5-3-1].

Assume there is a data distribution Dd×m𝐷superscript𝑑superscript𝑚D\subseteq\mathbb{R}^{d}\times\mathbb{R}^{m}. Our objective is to learn a function f:xdym:𝑓𝑥superscript𝑑𝑦superscript𝑚f:x\in\mathbb{R}^{d}\rightarrow y\in\mathbb{R}^{m} such that the following risk is minimized as R^(f)=1ni=1nyif(xi)^𝑅𝑓1𝑛superscriptsubscript𝑖1𝑛normsubscript𝑦𝑖𝑓subscript𝑥𝑖\hat{R}(f)=\frac{1}{n}\sum_{i=1}^{n}\|y_{i}-f(x_{i})\|. The purpose of the KAN is to learn such a representation of f𝑓f, thereby minimizing the objective loss. The original KAN used a two-layer structure, while Liu, et al. [14] extended to arbitrary width and depth. In contrast to MLP, the activation function are placed on edges instead of the neurons, KAN use 3rdsuperscript3𝑟𝑑3^{rd}-order B-spline (k=3𝑘3k=3) functions for fitting, which allows learning sophisticated activation function by controlling the weight of each basis. In this case, the neuron q𝑞q in the layer l+1𝑙1l+1 can be represented as :

xl+1,qspline=p=1nwp,qsplineΦl,q,p(xl,p)=p=1nwp,qsplinei=1k+GwiBi(xl,p),superscriptsubscript𝑥𝑙1𝑞𝑠𝑝𝑙𝑖𝑛𝑒superscriptsubscript𝑝1𝑛superscriptsubscript𝑤𝑝𝑞𝑠𝑝𝑙𝑖𝑛𝑒subscriptΦ𝑙𝑞𝑝subscript𝑥𝑙𝑝superscriptsubscript𝑝1𝑛superscriptsubscript𝑤𝑝𝑞𝑠𝑝𝑙𝑖𝑛𝑒superscriptsubscript𝑖1𝑘𝐺subscript𝑤𝑖subscript𝐵𝑖subscript𝑥𝑙𝑝x_{l+1,q}^{spline}=\sum_{p=1}^{n}w_{p,q}^{spline}\cdot\Phi_{l,q,p}(x_{l,p})=\sum_{p=1}^{n}w_{p,q}^{spline}\cdot\sum_{i=1}^{k+G}w_{i}\cdot B_{i}(x_{l,p}), (5)

where xl,psubscript𝑥𝑙𝑝x_{l,p} is the input from an arbitrary neuron p𝑝p in the previous layer l𝑙l. The input from all n𝑛n neurons in the previous layer l𝑙l undergoes a nonlinear transformation produced by a learnable B-spline combination, where G𝐺G is the grid size which determines the number of B-spline bases (k+G𝑘𝐺k+G). This is followed by a weighted summation to obtain the qthsuperscript𝑞𝑡q^{th} output of xl+1,qsplinesuperscriptsubscript𝑥𝑙1𝑞𝑠𝑝𝑙𝑖𝑛𝑒x_{l+1,q}^{spline}. Additionally, KAN introduce a base function similar to residual connections, using weighted silu𝑠𝑖𝑙𝑢silu, to stabilize optimization, which can be represented as:

xl+1,qbase=p=1nwp,qbasesilu(xl,p)=p=1nwp,qbasexl,p1+exl,psuperscriptsubscript𝑥𝑙1𝑞𝑏𝑎𝑠𝑒superscriptsubscript𝑝1𝑛superscriptsubscript𝑤𝑝𝑞𝑏𝑎𝑠𝑒𝑠𝑖𝑙𝑢subscript𝑥𝑙𝑝superscriptsubscript𝑝1𝑛superscriptsubscript𝑤𝑝𝑞𝑏𝑎𝑠𝑒subscript𝑥𝑙𝑝1superscript𝑒subscript𝑥𝑙𝑝x_{l+1,q}^{base}=\sum_{p=1}^{n}w_{p,q}^{base}\cdot silu(x_{l,p})=\sum_{p=1}^{n}w_{p,q}^{base}\cdot\frac{x_{l,p}}{1+e^{-x_{l,p}}} (6)

Therefore, the output of the qthsuperscript𝑞𝑡q^{th} neuron in layer l+1𝑙1l+1 can be represented as:

xl+1,q=xl+1,qspline+xl+1,qbasesubscript𝑥𝑙1𝑞superscriptsubscript𝑥𝑙1𝑞𝑠𝑝𝑙𝑖𝑛𝑒superscriptsubscript𝑥𝑙1𝑞𝑏𝑎𝑠𝑒x_{l+1,q}=x_{l+1,q}^{spline}+x_{l+1,q}^{base} (7)

For a multi-layer KAN, the final output can be represented as a nested structure of layers:

f(𝐱)=f(x1,x2,,xn)=ΨLΨL1Ψ1𝐱𝑓𝐱𝑓subscript𝑥1subscript𝑥2subscript𝑥𝑛subscriptΨ𝐿subscriptΨ𝐿1subscriptΨ1𝐱f(\mathbf{x})=f(x_{1},x_{2},...,x_{n})=\Psi_{L}\circ\Psi_{L-1}...\circ\Psi_{1}\circ\mathbf{x} (8)

where ΨlsubscriptΨ𝑙\Psi_{l} denotes the lthsuperscript𝑙𝑡l^{th} layer, which includes the combination of the above two operations: a spline𝑠𝑝𝑙𝑖𝑛𝑒spline transformation and a base activation silu𝑠𝑖𝑙𝑢silu. Fig. 1 illustrates a three-layer KAN structure with the architecture [3-5-3-1], clearly depicting how KAN operate.

3.2 KAN for time series classification

We constructed classifiers using KAN, similar to the structure shown in fig. 1. Due to the setting of the B-spline fitting interval being [1,1]11[-1,1], the data distribution outside this interval will not achieve an effective fitting. Instead of directly adopting the method proposed by Liu et al., which suggested updating the grid interval according to data distribution, We employed a more straightforward approach. We fixed the B-Spline grid interval to [1,1]11[-1,1] throughout the process, and applied batch normalization to keep the distribution within [1,1]11[-1,1] in each KAN Layer, to ensure the data distribution conforms to the grid and optimize the training process. Thus, to build a KAN for TSC, we adopted a 3-layer structure with the output transformed to the number of classes, having an architecture of [d-d-128-m], where d𝑑d is the sequence length and m𝑚m is the number of classes. Meanwhile, We compared KAN with MLP that had the same number of parameters and neurons per layer, as well as networks where the last layer of KAN was replaced with MLP (KAN_MLP) and the last layer of MLP was replaced with KAN (MLP_KAN). The experimental design is shown in tab. 1.

Table 1: Model Architectures and Parameters(G=5, k= 3 for all B-splines)
Networks Architecture Activation Parameters (\approx)
KAN [d-d-128-m] Silu, B-Spline (2+G+k)d2+(258+128(G+k))d2𝐺𝑘superscript𝑑2258128𝐺𝑘𝑑(2+G+k)\cdot d^{2}+(258+128\cdot(G+k))\cdot d
MLP_I [d-d-128-m] Relu d2+131dsuperscript𝑑2131𝑑d^{2}+131d
MLP_II [d-10d-128-m] Relu 10d2+1310d10superscript𝑑21310𝑑10d^{2}+1310d
KAN_MLP [d-d-128-m] Relu, Silu, B-Spline (2+G+k)d2+130d2𝐺𝑘superscript𝑑2130𝑑(2+G+k)\cdot d^{2}+130d
MLP_KAN [d-d-128-m] Relu, Silu, B-Spline d2+(2+G+k)128dsuperscript𝑑22𝐺𝑘128𝑑d^{2}+(2+G+k)\cdot 128d

4 Experiment Settings

Dataset: We applied the UCR2018 datasets [2] to evaluate these models. The UCR Time Series Archive encompasses 128 datasets, which are all univariate. These datasets span a diverse range of real-world domains, including healthcare, human activity recognition, remote sensing and more. Each dataset comprises a distinct number of samples, all of which have been pre-partitioned into training and testing sets. Reflecting the intricacies of real-world data, the archive includes datasets with missing values, imbalances, and those with limited training samples.

Evaluation Metrics: We used the accuracy and the F1 score to assess the performance of all models in tab. 1. During adversarial attacks, we evaluate the robustness of the models using the Attack Success Rate (ASR).

Experiment setup: Our experiments were executed on a server equipped with Nvidia RTX 4090 GPUs, 64 GB RAM, and an AMD EPYC 7320 processor.

Parameter setting: In our experiments, we utilized the open-source GitHub project efficient-KAN111Our Github, and efficient-KAN. to replace the original CPU-based KAN architecture proposed by Liu et al. [14]. This modification, along with switching the optimizer to AdamW from BFGS, allowed for faster training speeds. We set the dropout rate to 0.1 and trained all the models for 1000 epochs. The learning rate was initialized at 1e-2 and decayed to 90%percent9090\% of its previous value every 25 epochs. Especially for KAN, we used a weight decay of 1e-2, set L1 regularization for the weights to 0, and entropy regularization to 1e-5. For adversarial attacks, we employed the PGD with non-targeted attacks. The perturbation magnitude eps(ϵitalic-ϵ\epsilon) is set at [0.05, 0.1, 0.25, 0.5], with a step size of 0.01 times eps and 100 iterations for each attack.

5 Result

5.1 Performance Comparison

Fig. 2 (a) and (b) show the accuracy and F1 distribution across 128 UCR datasets for the five models respectively. We observe that these five models achieve relatively similar performance across the 128 datasets, both in terms of F1 score and accuracy. However, KAN performs slightly better overall. This conclusion is also supported by the results shown from the critical diagrams in Fig. 3, where only KAN and MLP_L in the same parameters exhibit statistically significant differences. In the critical diagram, KAN ranks the highest, indicating its strong fitting capability and demonstrating that it can achieve performance comparable to, or even better than, traditional neural networks on the benchmark time series datasets.

Refer to caption
(a) Test Accuracy of five models
Refer to caption
(b) Test F1 Score of five models
Figure 2: Performance comparison of five models across 128 datasets
Refer to caption
Figure 3: Critical diagram of accuracy for five models across 128 datasets (higher rank is better)

5.2 Ablation Study of KAN

KAN have a more complex structure compared to MLP, due to the combinations of base and spline functions. The different grid sizes of spline functions have varying impacts on performance. To evaluate their influences, we investigated three configurations of KAN as follows:

  1. 1.

    Complete KAN with different grid sizes: 1, 5, 50

  2. 2.

    KAN with only the wpline component, with different grid sizes: 1, 5, 50

  3. 3.

    KAN with only the base component

Table 2: Test accuracy of models with different architectures on the 128 dataset. The values corresponding to columns 1, 5, and 50 represent the number of datasets out of 128 where the accuracy of the grid size corresponding to each row is greater than or equal to that of the grid size in the column, under the same architecture. Q1, Q2, and Q3 denote the quantiles of the accuracy distribution across the 128 datasets.
KAN Configuration Grid Size 1 5 50 Q1 Q2 Q3
w/ base & Spline 1 128 64 96 0.6000 0.7991 0.9214
5 76 128 112 0.6146 0.7750 0.9387
50 39 24 128 0.5626 0.6976 0.847
w/o Base 1 128 100 122 0.5591 0.7571 0.9009
5 40 128 119 0.5054 0.6706 0.8271
50 13 19 128 0.2226 0.4315 0.5732
w/o Spline - - 0.5652 0.7698 0.9000
Table 3: Test(Train) accuracy of different KAN on the CBF dataset.
KAN Configuration 1 5 50
w/ Base & Spline 0.9011(0.9667) 0.9644(0.9667) 0.9300(0.9667)
w/o Base 0.8722(1.0000) 0.8644(0.9667) 0.3178(0.6667)
w/o Spline 0.8811(1.0000) 0.8811(1.0000) 0.8811(1.0000)
Refer to caption
(a) Train: Grid Size = 1
Refer to caption
(b) Train: Grid Size = 50
Refer to caption
(c) Test: Grid Size = 1
Refer to caption
(d) Test: Grid Size = 50
Figure 4: Distribution of the flattened Train/Test output values of the last layer of the model under different configurations on the CBF dataset. (a) Train: Grid size of 1, (b)Train: Grid size of 50, (c) Test: Grid size of 1, and (d)Test: Grid size of 50.
Refer to caption
(a) eps = 0.05
Refer to caption
(b) eps = 0.1
Refer to caption
(c) eps = 0.25
Refer to caption
(d) eps = 0.5
Figure 5: ASR distribution across 128 datasets for five models in different perturbation eps.
Refer to caption
(a) eps = 0.05
Refer to caption
(b) eps = 0.1
Refer to caption
(c) eps = 0.25
Refer to caption
(d) eps = 0.5
Figure 6: Critical diagram of ASR rank across 128 datasets for five models in different perturbation eps (lower rank is better)
Refer to caption
(a) KAN - MLP
Refer to caption
(b) MLP_KAN - MLP
Figure 7: Lipschitz constant distribution differences across 128 dataset
Refer to caption
(a) Grid Size 50 vs Grid Size 1
Refer to caption
(b) Grid Size 1 - Grid Size 50
Figure 8: Comparison of (a) ASR and (b) Lipschitz constant distribution differences across 128 dataset between Grid Size of 50 and Grid Size of 1

Tab. 2 presents the overall performance of these three configurations across 128 datasets. We observed that an excessively large grid size leads to performance degradation, regardless of whether it is in the complete KAN or without the base function. In the complete KAN, there is little difference in performance with smaller grid sizes. For grid size = 1, nearly 50% of the datasets achieved over 80% accuracy, whereas for grid size = 50, this value drops to less than 70%. In the KAN without the base function, overall performance significantly declines as the grid size increases. Particularly, for grid size = 50, the accuracy of 50% of the datasets is below 43.15%. Additionally, we found that the performance of KAN without the spline function is close to that of the KAN network with only the spline function and with grid size = 1. This indicates that the fitting capability of KAN largely comes from the simple activation functions, suggesting that complex B-spline combinations may lead to optimization difficulties. To explain the result above, we analyzed the CBF dataset, which exhibited results similar to the overall trend as shown in tab. 3. Most results are comparable, except for the model with only the Spline function and a grid size of 50, which performed significantly worse. We analyzed the output results of the two parts at the last layer as shown in the fig. 4.

We observed two phenomena across both the training and testing sets: First, the output values of the spline are relatively smaller and more concentrated compared to those of the base configuration. This indicates that the spline’s contribution to the final decision is less significant than that of the base, suggesting that the base configuration plays a more critical role in decision-making. Second, the most significant difference between fig. 4(c) and 4(d) is the distinct distribution observed when the base component is removed. When the grid size is set to 1, the output distributions for these three configurations are similar, exhibiting two prominent peaks on both the positive and negative sides, with the negative peak being higher and more numerous. This pattern occurs because the CBF dataset has 3 categories, thus, when one class is predicted with high confidence, the other two classes tend to output negative values. However, this scenario changes drastically at a grid size of 50. Here, the spline output shows only a single peak concentrated around zero both in the training and testing set, corresponding to the lower accuracy observed in tab. 2 for a grid size of 50 (without Base). This further confirms that an excessively large grid size complicates the network’s optimization.

5.3 Evaluation and Analysis of Adversarial Robustness

We also found that KAN demonstrate better robustness compared to MLP. We performed PGD untargeted attacks on the aforementioned five models, with ϵitalic-ϵ\epsilon ranging from 0.05 to 0.5. The results consistently show that KAN significantly outperform MLP. Fig. 5 illustrates the ASR of PGD on these models. We observe that KAN and MLP_KAN exhibit remarkable robustness compared to the other three models, with this advantage increasing as ϵitalic-ϵ\epsilon grows. Specifically, at ϵ=0.5italic-ϵ0.5\epsilon=0.5, the MLP_KAN model shows the best robustness among the five, with the ASR on 50% of the dataset remaining below approximately 75%, whereas the ASR for MLP, MLP_L, and KAN_MLP approach 1 for nearly 50% of the dataset. From fig. 6, it is evident that KAN and MLP_KAN demonstrate significantly different robustness across 128 datasets compared to the other three models.

To explain this phenomenon, we obtained the Lipschitz constants for the KAN, MLP, and MLP_KAN models. Fig. 7 shows the distribution of Lipschitz constant differences across 128 datasets. Fig. 8(a) and 8(b) both indicate that the model with KAN layer generally results in a decrease in the Lipschitz constants for most datasets, which is consistent with the observations in fig. 5. The combination of spline functions produced by KAN with a low grid size tends to be smooth and flat, making it difficult for small changes in input to cause significant changes in the output y𝑦y. Additionally, as shown in the previous fig. 4, the output distribution of the B-spline function is more narrow compared to the combination of activation and linear functions, thus contributing minimally to the output. This could be the primary reason why KAN is more robust under attack.

However, we observed the opposite result in another experiment. Fig. 8 shows that the Lipschitz constant corresponding to a larger grid size is greater. This is evident because as the number of grids increases, the slopes of the spline basis also increase, making the overall B-spline function more likely to produce larger slopes. However, the results indicate that a larger Lipschitz constant leads to greater robustness, with 104 out of 128 datasets having an ASR less than or equal to that corresponding to a grid size of 1, demonstrating stronger robustness. We preliminarily believe this might be because the value distribution produced by the B-spline is much smaller compared to the base function, thus the base contributes the majority in decision-making. However, the gradients of the B-spline with a larger grid size are substantial, thus during PGD attacks, the gradient is mainly provided by the B-spline part, which might not necessarily provide the correct direction compared to the base. Therefore, networks with larger Lipschitz constants exhibit stronger robustness. This may also further imply why models are difficult to optimize as the grid size increases.

6 Conclusion

In this paper, we applied the KAN in Time Series Classification and conducted a fair comparison among KAN, MLP, and mixed structures. We found that KAN can achieve comparable performance to MLP. Additionally, we analyzed the importance of each component of KAN and discovered that a larger grid size can be difficult to optimize, leading to lower performance. Furthermore, we evaluated the adversarial robustness of KAN and these models, finding that KAN exhibited remarkable robustness. This robustness is attributed to KAN’s lower Lipschitz constant. Moreover, we found that KAN with a larger grid size have a greater Lipschitz constant but exhibit stronger robustness. We provided an explanation for this phenomenon, although it requires further verification and broader experiments in our future work.

7 Acknowledgments

We thank all the creators and providers of the UCR time series benchmark datasets. This research work is supported by Australian Research Council Linkage Project (LP230200821), Australian Research Council Discovery Projects (DP240103070 ), Australian Research Council ARC Early Career Industry Fellowship (IE230100119), Australian Research Council ARC Early Career Industry Fellowship (IE240100275), and University of Adelaide, Sustainability FAME Strategy Internal Grant 2023.

References

  • [1] Bagnall, A., Lines, J., Bostrom, A., Large, J., Keogh, E.: The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data mining and knowledge discovery 31, 606–660 (2017)
  • [2] Dau, H.A., Bagnall, A., Kamgar, K., Yeh, C.C.M., Zhu, Y., Gharghabi, S., Ratanamahatana, C.A., Keogh, E.: The ucr time series archive. IEEE/CAA Journal of Automatica Sinica 6(6), 1293–1305 (2019)
  • [3] Dong, C., Li, Z., Zheng, L., Chen, W., Zhang, W.E.: Boosting certificate robustness for time series classification with efficient self-ensemble. arXiv preprint arXiv:2409.02802 (2024)
  • [4] Dong, C.G., Zheng, L.N., Chen, W., Zhang, W.E., Yue, L.: Swap: Exploiting second-ranked logits for adversarial attacks on time series. In: 2023 IEEE International Conference on Knowledge Graph (ICKG). pp. 117–125. IEEE (2023)
  • [5] Fawaz, H.I., Forestier, G., Weber, J., Idoumghar, L., Muller, P.A.: Adversarial attacks on deep neural networks for time series classification. In: 2019 International Joint Conference on Neural Networks (IJCNN). pp. 1–8. IEEE (2019)
  • [6] Forestier, G., Petitjean, F., Senin, P., Despinoy, F., Huaulmé, A., Fawaz, H.I., Weber, J., Idoumghar, L., Muller, P.A., Jannin, P.: Surgical motion analysis using discriminative interpretable patterns. Artificial intelligence in medicine 91, 3–11 (2018)
  • [7] Hills, J., Lines, J., Baranauskas, E., Mapp, J., Bagnall, A.: Classification of time series by shapelet transformation. Data mining and knowledge discovery 28, 851–881 (2014)
  • [8] Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural networks 2(5), 359–366 (1989)
  • [9] Ismail Fawaz, H., Lucas, B., Forestier, G., Pelletier, C., Schmidt, D.F., Weber, J., Webb, G.I., Idoumghar, L., Muller, P.A., Petitjean, F.: Inceptiontime: Finding alexnet for time series classification. Data Mining and Knowledge Discovery 34(6), 1936–1962 (2020)
  • [10] Karim, F., Majumdar, S., Darabi, H.: Adversarial attacks on time series. IEEE transactions on pattern analysis and machine intelligence 43(10), 3309–3320 (2020)
  • [11] Karim, F., Majumdar, S., Darabi, H., Chen, S.: Lstm fully convolutional networks for time series classification. IEEE access 6, 1662–1669 (2017)
  • [12] Kolmogorov, A.N.: On the representation of continuous functions of several variables by superpositions of continuous functions of a smaller number of variables. American Mathematical Society (1961)
  • [13] Lines, J., Taylor, S., Bagnall, A.: Hive-cote: The hierarchical vote collective of transformation-based ensembles for time series classification. In: 2016 IEEE 16th International Conference on Data Mining (ICDM). pp. 1041–1046 (2016). https://doi.org/10.1109/ICDM.2016.0133
  • [14] Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., Hou, T.Y., Tegmark, M.: Kan: Kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756 (2024)
  • [15] Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)
  • [16] Nweke, H.F., Teh, Y.W., Al-Garadi, M.A., Alo, U.R.: Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges. Expert Systems with Applications 105, 233–261 (2018)
  • [17] Pelletier, C., Webb, G.I., Petitjean, F.: Temporal convolutional neural network for the classification of satellite image time series. Remote Sensing 11(5),  523 (2019)
  • [18] Schäfer, P.: The boss is concerned with time series classification in the presence of noise. Data Mining and Knowledge Discovery 29, 1505–1530 (2015)
  • [19] Tholeti, T., Kalyani, S.: The robust way to stack and bag: the local lipschitz way. arXiv preprint arXiv:2206.00513 (2022)
  • [20] Wen, Q., Zhou, T., Zhang, C., Chen, W., Ma, Z., Yan, J., Sun, L.: Transformers in time series: A survey. arXiv preprint arXiv:2202.07125 (2022)
  • [21] Zhou, T., Niu, P., Sun, L., Jin, R., et al.: One fits all: Power general time series analysis by pretrained lm. Advances in neural information processing systems 36, 43322–43355 (2023)