This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Wav-KAN: Wavelet Kolmogorov-Arnold Networks

Zavareh Bozorgasl, , Hao Chen Z. Bozorgasl ([email protected]) and H. Chen ([email protected]) are with the Department of Electrical and Computer Engineering, Boise State University, Boise, ID, 83712.
Abstract

In this paper 111Another title for the paper can be ”Wavelet for Everybody”., we introduce Wav-KAN 222Wav-KAN refers to a family of continuous and discrete wavelet transforms in KAN., an innovative neural network architecture that leverages the Wavelet Kolmogorov-Arnold Networks (Wav-KAN) framework to enhance interpretability and performance. Traditional multilayer perceptrons (MLPs) and even recent advancements like Spl-KAN [1] face challenges related to interpretability, training speed, robustness, computational efficiency, and performance. Wav-KAN addresses these limitations by incorporating wavelet functions into the Kolmogorov-Arnold network structure, enabling the network to capture both high-frequency and low-frequency components of the input data efficiently. Wavelet-based approximations employ orthogonal or semi-orthogonal basis and also maintains a balance between accurately representing the underlying data structure and avoiding overfitting to the noise. While continuous wavelet transform (CWT) has a lot of potentials, we also employed discrete wavelet transform (CWT) for multiresolution analysis which obviated the need for recalculation of the previous steps in finding the details 333We demonstrate its efficacy in capturing detailed signal characteristic. and efficiently combines local detailed information where the data points are dense with broader trends where the data points are sparse. Analogous to how water conforms to the shape of its container, Wav-KAN adapts to the data structure, resulting in enhanced accuracy, faster training speeds, and increased robustness compared to Spl-KAN and MLPs. Our results highlight the potential of Wav-KAN as a powerful tool for developing interpretable and high-performance neural networks, with applications spanning various fields. This work sets the stage for further exploration and implementation of Wav-KAN in frameworks such as PyTorch, TensorFlow, and also it makes wavelet in KAN in wide-spread usage like nowadays activation functions like ReLU and sigmoid in universal approximation theory (UAT). The codes to replicate the simulations are available at https://github.com/zavareh1/Wav-KAN.

Index Terms:
Kolmogorov-Arnold Networks (KAN), Wavelet, Wav-KAN, Neural Networks.

I Introduction

Advancements in artificial intelligence (AI) have led to the creation of highly proficient AI systems that make decisions for reasons that are not clear to us. This has raised concerns about the widespread deployment of untrustworthy AI systems in the economy and our daily lives, introducing several new risks, including the potential for future AIs to deceive humans to achieve undesirable objectives [2, 3]. The tendency to unravel the black-box behaviour of neural network has attracted a lot attention in recent years.
Interpretability of neural networks is crucial as it influences trust in these systems and helps address ethical concerns such as algorithmic discrimination. It is also essential for applying neural networks in scientific fields like drug discovery and genomics, where understanding the model’s decisions is necessary for validation and regulatory compliance [4, 5].
The multilayer feedforward perceptron (MLP) model is among the most widely used and practical neural network models [6]. Despite their wide-spread usage, MLPs have serious drwabacks like consuming almost all non-embedding parameters in transformers and also have less interpretability relative to attention layers [7, 1, 8].

Model renovation approaches aim to enhance interpretability by incorporating more understandable components into a network. These components can include neurons with specially designed activation functions, additional layers with specific functionalities, modular architectures, and similar features [9].

Breaking down a neural network into individual neurons has a key challenge of polysemantic neurons, i.e., each neuron activates for several unrelated types of feature [10]. [11] explores the reasons behind the emergence of polysemanticity and propose that it occurs because of superposition, i.e., models learn more distinct features than the available dimensions in a layer. Due to the limitation that a vector space can only contain as many orthogonal vectors as it has dimensions, the network ends up learning an overcomplete set of non-orthogonal features. For superposition to be beneficial, these features must activate sparsely; otherwise, the interference between non-orthogonal features would negate any performance improvements [8].

Kolmogorov-Arnold Networks (KANs) which stand on the Kolmogorov-Arnold representation theorem [12] can bring a lot of advantages including more interpretability and accuracy [1]. KANs have univariate learnable activation function on edges and nodes are summing those activation functions. The integration of KAN, ensembles of probabilistic trees, and multivariate B-spline representations is presented in [13]. However, most of the previous works on KAN is for depth-2 representation; except, Liu et. al. [1] which extends KANs to to arbitrary widths and depths. As [1] is based on B-Spline, we call it Spl-KAN.
In this paper, we present an improved version of KAN, called Wav-KAN which uses wavelet in KAN configuration. Figure 1 shows a Wav-KAN 444These fancy plots are just to show the potential of wavelet for approximation; they are not output of activation functions of a real network; hence, we see some of them are not differentiable. with 2 input features, 3 hidden nodes, and 2 output nodes (Wav-KAN[2,3,2]). In general, it can have an arbitrary number of layers. The structure is similar to MLPs with weights replaced by wavelet functions, and nodes are doing summation of those Wavelet functions.

Refer to caption
Figure 1: Wav-KAN with arbitrary number of layers (here is Wav-KAN[2,3,2])

Wavelet has been extensively employed in multiresolution analysis [14]. There are some studies of wavelet in neural networks in Universal Approximation Therory (UAT). For example, the complex Gabor wavelet activation function is renowned for its optimal concentration in the space-frequency domain and its exceptional ability to represent images accurately. Hence, [15] used Gabor wavelet in implicit neural representation (INR) in application including image denoising, image inpainting, super-resolution, computed tomography reconstruction, image overfitting, and novel view synthesis with neural radiance fields. To the best of our knowledge, our proposed framework is the first work which uses wavelet in Kolmogorov-Arnold representation theorem for an arbitrary widths and depth neural netwroks.
In comparison with the recent proposed Spl-KAN [1] and also MLPs, the proposed configuration is faster, more accurate, and more robust, effectively addressing the existing issues and significantly enhancing its performance.
Also, in statistical learning, we look for models which are more flexible and more interpretable [16]. Wav-KAN will be a powerful member of a family of KAN glass-box models which facilitates the interpretability of neural networks. This work introduces Wav-KAN and we believe with the potential of Wav-KAN which combines the potential of wavelet and KAN, it will be widely used in all the fields and it will be implemented in Pytorch, Tensor-flow, R, to name but few.

Incorporating Wav-KAN in neural networks makes increasingly explainable components that can achieve similar state-of-the-art performance across diverse tasks 555Some of the proofs and experiments will be presented in the final version of the paper..

Section II discusses KAN and its generalization to multi-layers KAN with wavelet as function approximation (i.e., activation function). Section III presents Continuous Wavelet Transform, especially the definition and criteria for being a mother wavelet which can be used as basis for function approximation. The comparison of Wav-KAN, Spl-KAN and MLPs is given in V. Some experiment results will be given in Section VI. Finally, Section VII concludes the paper.

II Kolmogorov-Arnold Networks

Kolmogorov-Arnold Networks (KANs) represent a novel twist on neural network design that challenges traditional concepts like the Multi-Layer Perceptron (MLP). At the heart of KANs is a beautiful and somewhat abstract mathematical theorem from Kolmogorov and Arnold.

II-A The Kolmogorov-Arnold Representation Theorem

Let’s start with the theorem that inspires KANs:

Theorem: Kolmogorov-Arnold Representation [12] For any continuous function f𝑓f of n𝑛n variables defined on a cube [0,1]nsuperscript01𝑛[0,1]^{n}, there exist 2n+12𝑛12n+1 functions ϕqsubscriptitalic-ϕ𝑞\phi_{q} and 2n+1×n2𝑛1𝑛2n+1\times n functions ψq,psubscript𝜓𝑞𝑝\psi_{q,p}, all univariate and continuous, such that

f(x1,,xn)=q=12n+1Φq(p=1nϕq,p(xp)).𝑓subscript𝑥1subscript𝑥𝑛superscriptsubscript𝑞12𝑛1subscriptΦ𝑞superscriptsubscript𝑝1𝑛subscriptitalic-ϕ𝑞𝑝subscript𝑥𝑝f(x_{1},\ldots,x_{n})=\sum_{q=1}^{2n+1}\Phi_{q}\left(\sum_{p=1}^{n}\phi_{q,p}(x_{p})\right). (1)

This theorem tells us that any multivariate function can essentially be decomposed into the sum of functions of sums. The magic here is that these inner functions are univariate, meaning they each take a single input. As we use mother wavelet as our basis, in Wav-KAn notation, we use ψi,j(xj)subscript𝜓𝑖𝑗subscript𝑥𝑗\psi_{i,j}(x_{j}) instead of ϕi,j(xj)subscriptitalic-ϕ𝑖𝑗subscript𝑥𝑗\phi_{i,j}(x_{j}) and ΨisubscriptΨ𝑖\Psi_{i} instead of ΦisubscriptΦ𝑖\Phi_{i}.

II-B From Theory to Networks

How do we translate this theorem into a neural network architecture? Imagine this: Instead of weights and biases adjusting linear combinations of inputs at each node, KANs modify this process to work with functions. One generalized version of KAN theorem that corresponds to deeper KANs is the recently published work of Spl-KAN[1].

  • In KANs, every ”weight” is actually a small function on its own. Each node in a KAN does not apply a fixed non-linear activation function. Each learnable activation function in edges, gets the input and gives an output.

Suppose we have an MLPs neural network with n𝑛n input and m𝑚m output in a fully connected layer (between layer l𝑙l and l+1𝑙1l+1). The equation in matrix form is given by:

𝐱(l+1)=𝐖l+1,l𝐱(l)+𝐛(l+1)superscript𝐱𝑙1subscript𝐖𝑙1𝑙superscript𝐱𝑙superscript𝐛𝑙1\mathbf{x}^{(l+1)}=\mathbf{W_{\mathit{l+1,l}}}\mathbf{x}^{(l)}+\mathbf{b}^{(l+1)} (2)

where:

  • 𝐖l+1,lsubscript𝐖𝑙1𝑙\mathbf{W_{\mathit{l+1,l}}} is the weight matrix connecting layer l𝑙l and layer l+1𝑙1l+1,

  • 𝐱(l)superscript𝐱𝑙\mathbf{x}^{(l)} is the input vector,

  • 𝐱(l+1)superscript𝐱𝑙1\mathbf{x}^{(l+1)} is the output vector,

  • 𝐛(l+1)superscript𝐛𝑙1\mathbf{b}^{(l+1)} is the bias vector for layer l+1𝑙1l+1.

The weight matrix 𝐖l+1,lsubscript𝐖𝑙1𝑙\mathbf{W_{\mathit{l+1,l}}} is expanded as follows:

𝐖l+1,l=(w1,1w1,2w1,nw2,1w2,2w2,nwm,1wm,2wm,n)subscript𝐖𝑙1𝑙matrixsubscript𝑤11subscript𝑤12subscript𝑤1𝑛subscript𝑤21subscript𝑤22subscript𝑤2𝑛subscript𝑤𝑚1subscript𝑤𝑚2subscript𝑤𝑚𝑛\mathbf{W_{\mathit{l+1,l}}}=\begin{pmatrix}w_{1,1}&w_{1,2}&\cdots&w_{1,n}\\ w_{2,1}&w_{2,2}&\cdots&w_{2,n}\\ \vdots&\vdots&\ddots&\vdots\\ w_{m,1}&w_{m,2}&\cdots&w_{m,n}\end{pmatrix} (3)

where wi,jsubscript𝑤𝑖𝑗w_{i,j}, i=1,2,,m𝑖12𝑚i=1,2,...,m and j=1,2,,n𝑗12𝑛j=1,2,...,n is the weight between i𝑖i-th node in l+1𝑙1l+1-th layer, and j𝑗j-th node in l𝑙l-th .The bias vector 𝐛(l+1)superscript𝐛𝑙1\mathbf{b}^{(l+1)} is:

𝐛(l+1)=(b1(l+1)b2(l+1)bm(l+1))superscript𝐛𝑙1matrixsuperscriptsubscript𝑏1𝑙1superscriptsubscript𝑏2𝑙1superscriptsubscript𝑏𝑚𝑙1\mathbf{b}^{(l+1)}=\begin{pmatrix}b_{1}^{(l+1)}\\ b_{2}^{(l+1)}\\ \vdots\\ b_{m}^{(l+1)}\end{pmatrix} (4)

Thus, the complete equation becomes:

(x1(l+1)x2(l+1)xm(l+1))=(w1,1w1,2w1,nw2,1w2,2w2,nwm,1wm,2wm,n)(x1(l)x2(l)xn(l))+(b1(l+1)b2(l+1)bm(l+1))matrixsuperscriptsubscript𝑥1𝑙1superscriptsubscript𝑥2𝑙1superscriptsubscript𝑥𝑚𝑙1matrixsubscript𝑤11subscript𝑤12subscript𝑤1𝑛subscript𝑤21subscript𝑤22subscript𝑤2𝑛subscript𝑤𝑚1subscript𝑤𝑚2subscript𝑤𝑚𝑛matrixsuperscriptsubscript𝑥1𝑙superscriptsubscript𝑥2𝑙superscriptsubscript𝑥𝑛𝑙matrixsuperscriptsubscript𝑏1𝑙1superscriptsubscript𝑏2𝑙1superscriptsubscript𝑏𝑚𝑙1\begin{pmatrix}x_{1}^{(l+1)}\\ x_{2}^{(l+1)}\\ \vdots\\ x_{m}^{(l+1)}\end{pmatrix}=\begin{pmatrix}w_{1,1}&w_{1,2}&\cdots&w_{1,n}\\ w_{2,1}&w_{2,2}&\cdots&w_{2,n}\\ \vdots&\vdots&\ddots&\vdots\\ w_{m,1}&w_{m,2}&\cdots&w_{m,n}\end{pmatrix}\begin{pmatrix}x_{1}^{(l)}\\ x_{2}^{(l)}\\ \vdots\\ x_{n}^{(l)}\end{pmatrix}+\begin{pmatrix}b_{1}^{(l+1)}\\ b_{2}^{(l+1)}\\ \vdots\\ b_{m}^{(l+1)}\end{pmatrix} (5)

Now, suppose we have L𝐿L layers, each of them having the structure described above. Let σ()𝜎\sigma(\cdot) be the activation function. The compact formula for the whole network, f(𝐱)𝑓𝐱f(\mathbf{x}), where 𝐱𝐱\mathbf{x} is the input vector and f()𝑓f(\cdot) is the neural network, is given by:

f(𝐱)=𝐱(L)𝑓𝐱superscript𝐱𝐿f(\mathbf{x})=\mathbf{x}^{(L)} (6)

where

𝐱(l+1)=σ(𝐖l+1,l𝐱(l)+𝐛(l+1))superscript𝐱𝑙1𝜎subscript𝐖𝑙1𝑙superscript𝐱𝑙superscript𝐛𝑙1\mathbf{x}^{(l+1)}=\sigma(\mathbf{W_{\mathit{l+1,l}}}\mathbf{x}^{(l)}+\mathbf{b}^{(l+1)}) (7)

for l=0,1,2,,L1𝑙012𝐿1l=0,1,2,\ldots,L-1, and 𝐱(0)=𝐱superscript𝐱0𝐱\mathbf{x}^{(0)}=\mathbf{x} is the input vector.

f(𝐱)=σ(𝐖𝐋σ(𝐖𝐋𝟏σ(𝐖𝟐σ(𝐖𝟏𝐱+𝐛𝟏)+𝐛𝟐)+𝐛𝐋𝟏)+𝐛𝐋)𝑓𝐱𝜎subscript𝐖𝐋𝜎subscript𝐖𝐋1𝜎subscript𝐖2𝜎subscript𝐖1𝐱subscript𝐛1subscript𝐛2subscript𝐛𝐋1subscript𝐛𝐋f(\mathbf{x})=\sigma\left(\mathbf{W_{L}}\sigma\left(\mathbf{W_{L-1}}\cdots\sigma\left(\mathbf{W_{2}}\sigma\left(\mathbf{W_{1}}\mathbf{x}+\mathbf{b_{1}}\right)+\mathbf{b_{2}}\right)\cdots+\mathbf{b_{L-1}}\right)+\mathbf{b_{L}}\right) (8)

In KAN, the relationship between layers turns into: Let 𝐱(l)superscript𝐱𝑙\mathbf{x}^{(l)} be a vector of size n𝑛n. We transpose 𝐱(l)superscript𝐱𝑙\mathbf{x}^{(l)} and place it in a matrix 𝐗𝐗\mathbf{X} with m𝑚m rows and n𝑛n columns:

𝐱(l)n(𝐱(l))T1×nformulae-sequencesuperscript𝐱𝑙superscript𝑛superscriptsuperscript𝐱𝑙𝑇superscript1𝑛\mathbf{x}^{(l)}\in\mathbb{R}^{n}\quad\Rightarrow\quad(\mathbf{x}^{(l)})^{T}\in\mathbb{R}^{1\times n}

We construct the matrix 𝐗lsubscript𝐗𝑙\mathbf{X}_{l} as follows:

𝐗l=((𝐱(l))T(𝐱(l))T(𝐱(l))T)m×nsubscript𝐗𝑙matrixsuperscriptsuperscript𝐱𝑙𝑇superscriptsuperscript𝐱𝑙𝑇superscriptsuperscript𝐱𝑙𝑇superscript𝑚𝑛\mathbf{X}_{l}=\begin{pmatrix}(\mathbf{x}^{(l)})^{T}\\ (\mathbf{x}^{(l)})^{T}\\ \vdots\\ (\mathbf{x}^{(l)})^{T}\end{pmatrix}\in\mathbb{R}^{m\times n}

where each row of 𝐗lsubscript𝐗𝑙\mathbf{X}_{l} is the transposed vector (𝐱(l))Tsuperscriptsuperscript𝐱𝑙𝑇(\mathbf{x}^{(l)})^{T}.

We define the operator Tosubscript𝑇𝑜T_{o} which acts on the matrix Ψl+1,l(𝐗l)subscriptΨ𝑙1𝑙subscript𝐗𝑙\Psi_{l+1,l}(\mathbf{X}_{l}). This operator sums the elements of each row of the matrix and outputs the resulting vector 𝐯𝐯\mathbf{v}. The definition is as follows:

To(Ψl+1,l(𝐗l))=𝐯subscript𝑇𝑜subscriptΨ𝑙1𝑙subscript𝐗𝑙𝐯T_{o}\left(\Psi_{l+1,l}(\mathbf{X}_{l})\right)=\mathbf{v}

where 𝐯𝐯\mathbf{v} is a vector with elements given by:

vi=j[Ψl+1,l(𝐗l)]ij=j=1nψi,j(xj(l)),for i=1,2,,mformulae-sequencesubscript𝑣𝑖subscript𝑗subscriptdelimited-[]subscriptΨ𝑙1𝑙subscript𝐗𝑙𝑖𝑗superscriptsubscript𝑗1𝑛subscript𝜓𝑖𝑗superscriptsubscript𝑥𝑗𝑙for 𝑖12𝑚v_{i}=\sum_{j}\left[\Psi_{l+1,l}(\mathbf{X}_{l})\right]_{ij}=\sum_{j=1}^{n}\psi_{i,j}(x_{j}^{(l)}),\quad\text{for }i=1,2,\ldots,m

In this expression, [Ψl+1,l(𝐗l)]ijsubscriptdelimited-[]subscriptΨ𝑙1𝑙subscript𝐗𝑙𝑖𝑗\left[\Psi_{l+1,l}(\mathbf{X}_{l})\right]_{ij} represents the element in the i𝑖i-th row and j𝑗j-th column of the matrix Ψl+1,l(𝐱(l))subscriptΨ𝑙1𝑙superscript𝐱𝑙\Psi_{l+1,l}(\mathbf{x}^{(l)}).

Thus, the operator Tosubscript𝑇𝑜T_{o} can be written as:

To(Ψl+1,l(𝐗l))=(j[Ψl+1,l(𝐗l)]ij)isubscript𝑇𝑜subscriptΨ𝑙1𝑙subscript𝐗𝑙subscriptsubscript𝑗subscriptdelimited-[]subscriptΨ𝑙1𝑙subscript𝐗𝑙𝑖𝑗𝑖T_{o}\left(\Psi_{l+1,l}(\mathbf{X}_{l})\right)=\left(\sum_{j}\left[\Psi_{l+1,l}(\mathbf{X}_{l})\right]_{ij}\right)_{i}

In this definition, Tosubscript𝑇𝑜T_{o} takes the matrix Ψl+1,l(𝐗l)subscriptΨ𝑙1𝑙subscript𝐗𝑙\Psi_{l+1,l}(\mathbf{X}_{l}), sums the elements of each row, and outputs the resulting vector 𝐯𝐯\mathbf{v}.

Indeed Ψl+1,lsubscriptΨ𝑙1𝑙\Psi_{l+1,l} acts on the input vector 𝐱(l)superscript𝐱𝑙\mathbf{x}^{(l)} and gives an output where each element of Ψl+1,lsubscriptΨ𝑙1𝑙\Psi_{l+1,l} takes one corresponding element of 𝐱(l)superscript𝐱𝑙\mathbf{x}^{(l)} as input, sums up the results, and produces one element of the output:

𝐗l+1=Ψl+1,l(𝐗l)subscript𝐗𝑙1subscriptΨ𝑙1𝑙subscript𝐗𝑙\mathbf{X}_{l+1}=\Psi_{l+1,l}(\mathbf{X}_{l}) (9)

where:

Ψl+1,l(𝐗l)=(ψ1,1(x1(l))ψ1,2(x2(l))ψ1,n(xn(l))ψ2,1(x1(l))ψ2,2(x2(l))ψ2,n(xn(l))ψm,1(x1(l))ψm,2(x2(l))ψm,n(xn(l)))subscriptΨ𝑙1𝑙subscript𝐗𝑙matrixsubscript𝜓11superscriptsubscript𝑥1𝑙subscript𝜓12superscriptsubscript𝑥2𝑙subscript𝜓1𝑛superscriptsubscript𝑥𝑛𝑙subscript𝜓21superscriptsubscript𝑥1𝑙subscript𝜓22superscriptsubscript𝑥2𝑙subscript𝜓2𝑛superscriptsubscript𝑥𝑛𝑙subscript𝜓𝑚1superscriptsubscript𝑥1𝑙subscript𝜓𝑚2superscriptsubscript𝑥2𝑙subscript𝜓𝑚𝑛superscriptsubscript𝑥𝑛𝑙\Psi_{l+1,l}(\mathbf{X}_{l})=\begin{pmatrix}\psi_{1,1}(x_{1}^{(l)})&\psi_{1,2}(x_{2}^{(l)})&\cdots&\psi_{1,n}(x_{n}^{(l)})\\ \psi_{2,1}(x_{1}^{(l)})&\psi_{2,2}(x_{2}^{(l)})&\cdots&\psi_{2,n}(x_{n}^{(l)})\\ \vdots&\vdots&\ddots&\vdots\\ \psi_{m,1}(x_{1}^{(l)})&\psi_{m,2}(x_{2}^{(l)})&\cdots&\psi_{m,n}(x_{n}^{(l)})\end{pmatrix} (10)

Here, Ψl+1,lsubscriptΨ𝑙1𝑙\Psi_{l+1,l} represents the activation functions connecting layer l𝑙l and layer l+1𝑙1l+1. Each element ψi,j()subscript𝜓𝑖𝑗\psi_{i,j}(\cdot) denotes the activation function that connects the j𝑗j-th neuron in layer l𝑙l to the i𝑖i-th neuron in layer l+1𝑙1l+1. Instead of multiplication, equation (9) computes a function of input with distinct learnable parameters.

Hence, if 𝐗0subscript𝐗0\mathbf{X}_{0} be considered as input which just contains input vector as its rows, for the entire network, the output after L𝐿L layers is:

fKAN(𝐗0)=𝐱(L)=To(ΨL,L1(𝐗L1))=To(ΨL,L1(((To(ΨL1,L2(𝐗L2)))T(To(ΨL1,L2(𝐗L2)))T(To(ΨL1,L2(𝐗L2)))T)))subscript𝑓𝐾𝐴𝑁subscript𝐗0superscript𝐱𝐿subscript𝑇𝑜subscriptΨ𝐿𝐿1subscript𝐗𝐿1subscript𝑇𝑜subscriptΨ𝐿𝐿1matrixsuperscriptsubscript𝑇𝑜subscriptΨ𝐿1𝐿2subscript𝐗𝐿2𝑇superscriptsubscript𝑇𝑜subscriptΨ𝐿1𝐿2subscript𝐗𝐿2𝑇superscriptsubscript𝑇𝑜subscriptΨ𝐿1𝐿2subscript𝐗𝐿2𝑇f_{KAN}(\mathbf{X}_{0})=\mathbf{x}^{(L)}=T_{o}\left(\Psi_{L,L-1}\left(\mathbf{X}_{L-1}\right)\right)=T_{o}\left(\Psi_{L,L-1}\left(\begin{pmatrix}(T_{o}(\Psi_{L-1,L-2}(\mathbf{X}_{L-2})))^{T}\\ (T_{o}(\Psi_{L-1,L-2}(\mathbf{X}_{L-2})))^{T}\\ \vdots\\ (T_{o}(\Psi_{L-1,L-2}(\mathbf{X}_{L-2})))^{T}\end{pmatrix}\right)\right)
==To(ΨL,L1(((To(ΨL1,L2(To(Ψ1,0(𝐗0))))T(To(ΦL1,L2(To(Ψ1,0(𝐗0))))T(To(ΨL1,L2(To(Ψ1,0(𝐗0))))T)))=\cdots=T_{o}\left(\Psi_{L,L-1}\left(\begin{pmatrix}(T_{o}(\Psi_{L-1,L-2}\cdots(T_{o}(\Psi_{1,0}(\mathbf{X}_{0}))))^{T}\\ (T_{o}(\Phi_{L-1,L-2}\cdots(T_{o}(\Psi_{1,0}(\mathbf{X}_{0}))))^{T}\\ \vdots\\ (T_{o}(\Psi_{L-1,L-2}\cdots(T_{o}(\Psi_{1,0}(\mathbf{X}_{0}))))^{T}\end{pmatrix}\right)\right) (11)

In summary, traditional MLPs use fixed nonlinear activation functions at each node and linear weights (and biases) to transform inputs through layers. The output at each layer is computed by a linear transformation followed by a fixed activation function. During backpropagation, gradients of the loss function with respect to weights and biases are calculated to update the model parameters. In contrast, KANs replace linear weights with learnable univariate functions placed on edges rather than nodes. In the nodes, we just have a summation of some univariate functions from previous layers. Each function is adaptable, allowing the network to learn both the activation and transformation of the inputs. This change leads to improved accuracy and interpretability, as KANs can better approximate functions with fewer parameters. During backpropagation in KANs, the gradients are computed with respect to these univariate functions, updating them to minimize the loss function. This results in more efficient learning for complex and high-dimensional functions.

II-C Why Bother with KANs?

The flexibility of KANs allows for a more nuanced understanding and adaptation to data. By learning the functions directly involved in data relationships, KANs aim to provide a more accurate and interpretable model:

  • Accuracy: They can fit complex patterns in data more precisely with potentially fewer parameters.

  • Interpretability: Since each function has a specific, understandable role, it’s easier to see what the model is ”thinking.”

In summary, KANs leverage deep mathematical insights to offer a fresh perspective on how neural networks can understand and interact with the world. By focusing on functions rather than mere weights, they promise a richer and more intuitive form of machine learning.

III Continuous Wavelet Transform

The Continuous Wavelet Transform (CWT) is a method mostly used in signal processing to analyze the frequency content of a signal as it varies over time [14]. It acts like a microscope, zooming in on different parts of a signal to determine its constituent frequencies and their variations.

CWT utilizes a base function known as a “mother wavelet,” which serves as a template that can be scaled and shifted to match various parts of the signal. The shape of the mother wavelet is critical as it dictates which features of the signal are highlighted.

Let ψL2()𝜓superscript𝐿2\psi\in L^{2}(\mathbb{R}) be the mother wavelet and g(t)L2()𝑔𝑡superscript𝐿2g(t)\in L^{2}(\mathbb{R})666 L2()={f(x)|f(x)|2𝑑x<}superscript𝐿2conditional-set𝑓𝑥superscript𝑓𝑥2differential-d𝑥L^{2}(\mathbb{R})=\left\{f(x)\mid\int|f(x)|^{2}\,dx<\infty\right\} . be the function that we want to express in the wavelet basis. Then, a mother wavelet must satisfy certain criteria [17, 18].

  1. 1.

    Zero Mean: The integral of the wavelet over its entire range must equal zero:

    ψ(t)𝑑t=0superscriptsubscript𝜓𝑡differential-d𝑡0\int_{-\infty}^{\infty}\psi(t)\,dt=0 (12)
  2. 2.

    Admissibility Condition: The wavelet must have finite energy, which means the integral of the square of the wavelet must be finite.

    Cψ=0+|ψ^(ω)|2ω𝑑ω<+subscript𝐶𝜓superscriptsubscript0superscript^𝜓𝜔2𝜔differential-d𝜔C_{\psi}=\int_{0}^{+\infty}\frac{|\hat{\psi}(\omega)|^{2}}{\omega}\,d\omega<+\infty (13)

    where ψ^^𝜓\hat{\psi} is the Fourier transform of the wavelet ψ(t)𝜓𝑡\psi(t).

The CWT of a signal/function is represented by wavelet coefficients, calculated as follows:

C(s,τ)=+g(t)1sψ(tτs)𝑑t𝐶𝑠𝜏superscriptsubscript𝑔𝑡1𝑠𝜓𝑡𝜏𝑠differential-d𝑡C(s,\tau)=\int_{-\infty}^{+\infty}g(t)\frac{1}{\sqrt{s}}\psi\left(\frac{t-\tau}{s}\right)dt (14)

where:

  • g(t)𝑔𝑡g(t) is the signal/function that we want to approximate by Wavelet basis.

  • ψ(t)𝜓𝑡\psi(t) is the mother wavelet.

  • s+𝑠superscripts\in\mathbb{R}^{+} is the scale factor which is greater than zero.

  • τ𝜏\tau\in\mathbb{R} is the shift factor.

  • C(s,τ)𝐶𝑠𝜏C(s,\tau) measures the match between the wavelet and the signal at scale s𝑠s and shift τ𝜏\tau.

A signal can be reconstructed from its wavelet coefficients using the inverse CWT:

g(t)=1Cψ+0+C(s,τ)1sψ(tτs)dsdτs2𝑔𝑡1subscript𝐶𝜓superscriptsubscriptsuperscriptsubscript0𝐶𝑠𝜏1𝑠𝜓𝑡𝜏𝑠𝑑𝑠𝑑𝜏superscript𝑠2g(t)=\frac{1}{C_{\psi}}\int_{-\infty}^{+\infty}\int_{0}^{+\infty}C(s,\tau)\frac{1}{\sqrt{s}}\psi\left(\frac{t-\tau}{s}\right)\frac{ds\,d\tau}{s^{2}} (15)

where Cψsubscript𝐶𝜓C_{\psi} is a constant that depends on the wavelet, ensuring the reconstruction’s accuracy.

IV Discrete Wavelet Transform

The Discrete Wavelet Transform (DWT) is a widely used method in signal processing that decomposes a signal into different frequency components with high efficiency [14]. Unlike the Continuous Wavelet Transform (CWT), which analyzes the signal at every possible scale and translation, the DWT uses discrete sampling intervals which is suitable for digital applications. When the data points are close together (high sampling rate), the wavelet basis can zoom in to capture the high-frequency details (fine details) in that local area. When the data points are spread out (low sampling rate), the wavelet basis can zoom out to capture the low-frequency trends (overall shape) using global information. We see some examples uniform and irregular samples of data. Indeed, by employing wavelet we combine local detailed information where the data points are dense with broader trends where the data points are sparse.

The DWT employs a set of base functions derived from a single mother wavelet through scaling and translation. These base functions are orthonormal, ensuring that the DWT provides a compact and non-redundant representation of the signal. By applying high-pass and low-pass filters iteratively to the signal, the DWT produces approximation and detail coefficients at various levels of resolution, allowing for a multi-resolution analysis.

Let ψL2()𝜓superscript𝐿2\psi\in L^{2}(\mathbb{R}) be the mother wavelet and ϕL2()italic-ϕsuperscript𝐿2\phi\in L^{2}(\mathbb{R}) be the scaling function. The signal g(t)L2()𝑔𝑡superscript𝐿2g(t)\in L^{2}(\mathbb{R})777 L2()={f(x)|f(x)|2𝑑x<}superscript𝐿2conditional-set𝑓𝑥superscript𝑓𝑥2differential-d𝑥L^{2}(\mathbb{R})=\left\{f(x)\mid\int|f(x)|^{2}\,dx<\infty\right\} . is decomposed into approximation coefficients aj(k)subscript𝑎𝑗𝑘a_{j}(k) and detail coefficients dj(k)subscript𝑑𝑗𝑘d_{j}(k) at level j𝑗j as follows:

aj(k)=ng(n)ϕj,k(n)subscript𝑎𝑗𝑘subscript𝑛𝑔𝑛subscriptitalic-ϕ𝑗𝑘𝑛a_{j}(k)=\sum_{n}g(n)\phi_{j,k}(n) (16)
dj(k)=ng(n)ψj,k(n)subscript𝑑𝑗𝑘subscript𝑛𝑔𝑛subscript𝜓𝑗𝑘𝑛d_{j}(k)=\sum_{n}g(n)\psi_{j,k}(n) (17)

where:

  • g(n)𝑔𝑛g(n) is the discrete signal.

  • ϕj,k(n)subscriptitalic-ϕ𝑗𝑘𝑛\phi_{j,k}(n) is the scaling function at scale j𝑗j and position k𝑘k.

  • ψj,k(n)subscript𝜓𝑗𝑘𝑛\psi_{j,k}(n) is the wavelet function at scale j𝑗j and position k𝑘k.

  • aj(k)subscript𝑎𝑗𝑘a_{j}(k) represents the approximation coefficients.

  • dj(k)subscript𝑑𝑗𝑘d_{j}(k) represents the detail coefficients.

The original signal can be reconstructed from its wavelet coefficients using the inverse DWT, ensuring that no information is lost during the transformation process. This reconstruction is given by:

g(n)=kaj(k)ϕj,k(n)+j=1Jkdj(k)ψj,k(n)𝑔𝑛subscript𝑘subscript𝑎𝑗𝑘subscriptitalic-ϕ𝑗𝑘𝑛superscriptsubscript𝑗1𝐽subscript𝑘subscript𝑑𝑗𝑘subscript𝜓𝑗𝑘𝑛g(n)=\sum_{k}a_{j}(k)\phi_{j,k}(n)+\sum_{j=1}^{J}\sum_{k}d_{j}(k)\psi_{j,k}(n) (18)

where J𝐽J is the number of decomposition levels.

The DWT’s ability to provide both time and frequency localization, coupled with its computational efficiency, makes it an indispensable tool in various fields such as image compression, noise reduction, and feature extraction.

Multi-resolution analysis (MRA) using wavelets, specifically discrete wavelet transforms (DWT), is a powerful technique in signal processing and data analysis. MRA allows the decomposition of a signal into different levels of detail by using wavelets, which are localized functions. The DWT achieves this by recursively applying high-pass and low-pass filters to the data/signal, creating approximations and details at various resolutions. This process provides a hierarchical framework, where each level captures different frequency components of the signal, making it easier to analyze localized features and transient phenomena. The ability to zoom in on fine details while retaining the broader structure of the signal is a key advantage of MRA, making it useful in applications such as image compression, noise reduction, and feature extraction.

In the context of discrete wavelet transforms, the data/signal is represented as a sum of wavelet functions at different scales and positions. In wavelet analysis, once we compute the initial set of coefficients, we do not need to recompute them when calculating additional coefficients at finer resolutions. This efficiency arises because wavelet transforms leverage previously computed coefficients to generate further details. Consequently, the hierarchical nature of the multi-resolution analysis ensures that the earlier computations remain valid and reusable, streamlining the process and significantly reducing the computational overhead. This property makes wavelet-based methods particularly advantageous for iterative and real-time signal processing applications. The multi-resolution property of wavelets is particularly beneficial for analyzing non-stationary signals, which have frequency components that vary over time. By breaking down the signal into different resolution levels, MRA enables a more comprehensive understanding of its structure and characteristics. Moreover, the DWT’s ability to provide both time and frequency localization offers a significant advantage over traditional Fourier transforms, which only provide frequency information 888We will add some simulations to this section..

In summary, wav-kan by employing wavelet adapts to the varying density of data points, using more detailed information where there are more data points and less detailed information where there are fewer data points. This approach is particularly useful for real-world data that often suffers from irregular sampling and data dropouts.

V Wav-KAN or Spl-KAN or MLPs?

Wavelets and B-splines are two prominent methods used for function approximation, each with distinct advantages and limitations, particularly when applied in neural networks. B-splines provide smooth and flexible function approximations through piecewise polynomial functions defined over control points. They offer local control, meaning adjustments to a control point affect only a specific region, which is advantageous for precise function tuning. This smoothness and local adaptability make B-splines suitable for applications requiring continuous and refined approximations, such as in CAD and computer graphics. However, the computational complexity increases significantly with higher dimensions, making them less practical for high-dimensional data. Managing knot placement can also be intricate and affect the overall shape and accuracy of the approximation. While B-splines can be used in neural networks to approximate activation functions or smooth decision boundaries, their application is generally less suited to feature extraction tasks compared to wavelets due to their limited ability to handle multi-resolution analysis and sparse representations.

Wavelets excel in multi-resolution analysis enabling different levels of detail to be represented simultaneously which makes it a precious tool to decompose data into various frequency components. This capability is highly beneficial for feature extraction in neural networks, as it enables capturing of both high-frequency details and low-frequency trends. Additionally, wavelets offer sparse representations, which can lead to more efficient neural network architectures and faster training times. They are also well-suited for handling non-stationary signals and localized features, making them ideal for applications such as image recognition and signal classification. However, the choice of the wavelet function is crucial and can significantly impact performance, and edge effects can introduce artifacts that need special handling.

In function approximation, wavelets excel by maintaining a delicate balance between accurately representing the underlying data structure and avoiding overfitting to the noise. Unlike traditional methods that may overly smooth the data or fit to noise, wavelets achieve this balance through their inherent ability to capture both local and global features. By decomposing the data into different frequency components, wavelets can isolate and retain significant patterns while discarding irrelevant noise. This multiresolution analysis ensures that the approximation is robust and reliable, providing a more accurate and nuanced representation of the original data without the pitfalls of noise overfitting. On the other side, while Spl-KAN is powerful in capturing the changes in data, it also captures the noise in training data. Indeed, the strength of Spl-KAN is its weakness, too.
The advantages of Spl-KAN which are mentioned in [1] including interpretability and/or accuracy with respect to MLPs exist for Wav-KAN. More importantly Wav-KAN has solved the major disadvantage of Spl-KANs which was slow training speed. In terms of number of parameter, we compared Wav-KAN with Spl-KAN and MLPs for a hypothetical neural network which has N𝑁N input nodes and N𝑁N output node, with L𝐿L layers. As we see in Table I and by considering the value of G, Wav-KAN has less number of parameters than Spl-KAN (k should be at least 2 to have good results in Spl-KAN, especially in complex tasks). The coefficient 3 is becuase Wav-KAN has a learnable weight, a translation and a scaling. Learnable parameters in each neural network is in column of parameters. While the order of MLPs are less for the hypothetical neural network, in practice, Wav-KAN needs less number of parameters to learn the same task. Indeed, this originates because of capacity of Wavelet for capturing both low frequency and high frequency functions.

TABLE I: Comparison of MLPs, Spl-KAN and Wav-KAN
Neural network With L𝐿L layers, each layer has N𝑁N nodes
Neural Network Structure Order Parameters
MLPs O(N2L)𝑂superscript𝑁2𝐿O(N^{2}L) or O(N2L+NL)𝑂superscript𝑁2𝐿𝑁𝐿O(N^{2}L+NL) weights and biases
Spl-KAN O(N2L(G+k+1))O(N2LG)similar-to𝑂superscript𝑁2𝐿𝐺𝑘1𝑂superscript𝑁2𝐿𝐺O(N^{2}L(G+k+1))\sim O(N^{2}LG) weights
Wav-KAN O(3N2L)𝑂3superscript𝑁2𝐿O(3N^{2}L) weight, translation, scaling

Regarding the implementation, Spl-KAN requires a smooth function like b(x)𝑏𝑥b(x) in the equation (2.10) [1] attempting to catch on some global features. Because of the inherent scaling property of Wavelet, Wav-KAN does not need an additional term in its activation functions. This helps Wav-KAN to be faster. For example, wavelets equal to the second derivative of a Gaussian which are called Mexican hats 999The following equation is normalized Mexican hat wavelet. Indeed, pywavelets has a minus sign behind it which because of the weight we put behind the mother wavelet in our activation functinos, the sign doesn’t matter(https://pywavelets.readthedocs.io/en/latest/ref/cwt.html). and first used in computer vision to detect multiscale edges [19, 14] has the following form

ψ(t)=2π1/43σ(t2σ21)exp(t22σ2).𝜓𝑡2superscript𝜋143𝜎superscript𝑡2superscript𝜎21superscript𝑡22superscript𝜎2\psi(t)=\frac{2}{\pi^{1/4}\sqrt{3\sigma}}\left(\frac{t^{2}}{\sigma^{2}}-1\right)\exp\left(\frac{-t^{2}}{2\sigma^{2}}\right). (19)

Where σ𝜎\sigma shows the adjustable standard deviation of Gaussian. In our experiments, ψexp(t)subscript𝜓𝑒𝑥𝑝𝑡\psi_{exp}(t)

ψexp(t)=wψ(t)subscript𝜓𝑒𝑥𝑝𝑡𝑤𝜓𝑡\psi_{exp}(t)=w\psi(t) (20)

Indeed, w𝑤w plays the role of CWT coefficients which is multiplied by the mother wavelet formula; as w𝑤w is a learnable parameter, it helps adapting the shape of the mother wavelet to the function that it tries to approximate .

Moreover, Spl-KAN heavily depends on the grid spaces, and for better performance, it requires increasing the number of grids 101010Which is equivalent to decrease the spaces between the grids. This helps Spl-KAN accuracy to be improved., though, it brings two disadvantages. First it needs curvefitting which is a cumbersome and computational expensive operation, and also while we increase the number of grids, loss has some jumps 111111Loss will be increased for some training, then, it works better. See Figure 2.3 [1] for better illustration.. Fortunately, wavelet is safe from such computations and deficiencies and if one wants to capture more details, DWT efficiently does that without recalculation of the previous steps.

Last but no least, we found that batch normalization [20] significantly improves accuracy and speeds up training of both Wav-KAN and Spl-KAN; hence, we included batch normalization in both of these methods 121212In [1], the authors did not mention and didn’t apply batch normalization..

VI Simulation Results

In this section, we present the results of our experiments conducted using the KAN (Kernel-based Artificial Neural network) model with various continuous wavelet transformations (CWTs) 131313We add more simulations including some discrete wavelet ones for MRA, etc. on the MNIST dataset, utilizing a training set of 60,000 images and a test set of 10,000 images. It is important to note that our objective was not to optimize the parameters to their best possible values, but rather to demonstrate that Wav-KAN performs well in terms of overall performance. We have incorporated batch normalization into both Spl-KAN and Wav-KAN, resulting in improved performance. The wavelet types considered in our study include Mexican hat, Morlet, Derivative of Gaussian (DOG), and Shannon (see Table II). For each wavelet type and also Spl-KAN, we performed five trials, training the model for 50 epochs per trial.

TABLE II: Mother Wavelet Formulas and Parameters
Wavelet Type Formula of Mother Wavelet Parameters
Mexican hat ψ(t)=23π1/4(t21)et22𝜓𝑡23superscript𝜋14superscript𝑡21superscript𝑒superscript𝑡22\psi(t)=\frac{2}{\sqrt{3}\pi^{1/4}}\left(t^{2}-1\right)e^{-\frac{t^{2}}{2}} τ𝜏\tau, s𝑠s
Morlet ψ(t)=cos(ω0t)et22𝜓𝑡subscript𝜔0𝑡superscript𝑒superscript𝑡22\psi(t)=\cos(\omega_{0}t)e^{-\frac{t^{2}}{2}} τ𝜏\tau, s𝑠s and scaling, ω0subscript𝜔0\omega_{0}= 5
Derivative of Gaussian (DOG) ψ(t)=ddt(et22)𝜓𝑡𝑑𝑑𝑡superscript𝑒superscript𝑡22\psi(t)=-\frac{d}{dt}\left(e^{-\frac{t^{2}}{2}}\right) τ𝜏\tau, s𝑠s
Shannon ψ(t)=sinc(t/π)w(t)𝜓𝑡sinc𝑡𝜋𝑤𝑡\psi(t)=\text{sinc}(t/\pi)\cdot w(t) τ𝜏\tau, s𝑠s, and w(t)𝑤𝑡w(t): window function

The results were averaged across these trials to ensure the robustness and reliability of our findings.
Both Wav-KAN and Spl-KAN have the structure (number of nodes) of [28*28,32,10] 141414[first layer nodes, middle layer nodes, output nodes]. Although we enhanced Spl-KAN by using spline order 3 and a grid size of 5, this approach is computationally much more expensive compared to Wav-KAN. We employed AdamW optimizer [21, 22], with learning rate of 0.001 with weight decay of 104superscript10410^{-4}. Loss is cross entropy.

Refer to caption
Figure 2: Training accuracy of Wav-KAN [28*28,32,10] versus Spl-KAN [28*28,32,10]
Refer to caption
Figure 3: Test accuracy of Wav-KAN [28*28,32,10] versus Spl-KAN [28*28,32,10]

Figures 2 and 3 show the result of training accuracy and test accuracy of Wav-KAN in comparison to Spl-KAN. To not clutter up, we just show the result of Derivative of Gaussian (DOG) and Mexican hat wavelet. These have been shown as a sample. By fine tuning, and using all the freedom of wavelet (like frequency of sinusoid and variance of Gaussian), Wavelet shows significant superiority 151515We have done a lot of such experiments and we will publish them soon.. While Spl-KAN has better performance in training, which is because of overfitting to data, a lot of wavelet types have shown superior performance with respect to Spl-KAN. Indeed, for this experiment we set the variance of the Gaussian in wavelets to be 1; though, we can find better variances by grid search or making it a learnable parameter. Also, one may like to consider ω0subscript𝜔0\omega_{0} in Morlet and parameters of frequency and/or windowing in Shannon wavelet learnable. Indeed, we can use different degrees of freedom that we have to make the shape of wavelets more flexible. Wavelet makes a balance by not fitting to the noise in the data.

We evaluated the performance of each wavelet type in terms of training loss, training accuracy, validation loss, and validation accuracy. Figures 1 and 2 summarize the results, depicting the training and validation metrics averaged over the five trials for each wavelet type.
The simulation results indicate that the choice of wavelet significantly impacts the performance of the KAN model. This suggests that these wavelets are particularly effective at capturing the essential features of the MNIST dataset while maintaining robustness against noise. On the other hand, wavelets like Shannon and Bump did not perform as well, highlighting the importance of wavelet selection in designing neural networks with wavelet transformations.

VII Conclusion

In this paper, we have introduced Wav-KAN, a novel neural network architecture that integrates wavelet functions within the Kolmogorov-Arnold Networks (KAN) framework to enhance both interpretability and performance. By leveraging the multiresolution analysis capabilities of wavelets, Wav-KAN effectively captures complex data patterns and provides a robust solution to the limitations faced by traditional multilayer perceptrons (MLPs) and recently proposed Spl-KANs.

Our experimental results demonstrate that Wav-KAN not only achieves superior accuracy but also benefits from faster training speeds compared to Spl-KAN. The unique structure of Wav-KAN, which combines the strengths of wavelet transforms and the Kolmogorov-Arnold representation theorem, allows for more efficient parameter usage and improved model interpretability.

Wav-KAN represents a significant advancement in the design of interpretable neural networks. Its ability to handle high-dimensional data and provide clear insights into model behavior makes it a promising tool for a wide range of applications, from scientific research to industrial deployment. Future work will focus on further optimizing the Wav-KAN architecture, exploring its applicability to other datasets and tasks, and implementing the framework in popular machine learning libraries such as PyTorch and TensorFlow.

Overall, Wav-KAN stands out as a powerful and versatile model, paving the way for the development of more transparent and efficient neural network architectures. Its potential to combine high performance with interpretability marks a crucial step forward in the field of artificial intelligence.

References

  • [1] Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić, T. Y. Hou, and M. Tegmark, “Kan: Kolmogorov-arnold networks,” arXiv preprint arXiv:2404.19756, 2024.
  • [2] D. Hendrycks, M. Mazeika, and T. Woodside, “An overview of catastrophic ai risks,” arXiv preprint arXiv:2306.12001, 2023.
  • [3] R. Ngo, L. Chan, and S. Mindermann, “The alignment problem from a deep learning perspective,” arXiv preprint arXiv:2209.00626, 2022.
  • [4] Y. Zhang, P. Tiňo, A. Leonardis, and K. Tang, “A survey on neural network interpretability,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 5, no. 5, pp. 726–742, 2021.
  • [5] F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine learning,” arXiv preprint arXiv:1702.08608, 2017.
  • [6] A. Pinkus, “Approximation theory of the mlp model in neural networks,” Acta numerica, vol. 8, pp. 143–195, 1999.
  • [7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [8] H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey, “Sparse autoencoders find highly interpretable features in language models,” arXiv preprint arXiv:2309.08600, 2023.
  • [9] F.-L. Fan, J. Xiong, M. Li, and G. Wang, “On interpretability of artificial neural networks: A survey,” IEEE Transactions on Radiation and Plasma Medical Sciences, vol. 5, no. 6, pp. 741–760, 2021.
  • [10] C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter, “Zoom in: An introduction to circuits,” Distill, vol. 5, no. 3, pp. e00 024–001, 2020.
  • [11] N. Elhage, R. Lasenby, and C. Olah, “Privileged bases in the transformer residual stream, 2023,” URL https://transformer-circuits. pub/2023/privilegedbasis/index. html. Accessed, pp. 08–07, 2023.
  • [12] A. N. Kolmogorov, On the representation of continuous functions of several variables by superpositions of continuous functions of a smaller number of variables.   American Mathematical Society, 1961.
  • [13] D. Fakhoury, E. Fakhoury, and H. Speleers, “Exsplinet: An interpretable and expressive spline-based neural network,” Neural Networks, vol. 152, pp. 332–346, 2022.
  • [14] S. Mallat, A wavelet tour of signal processing.   Elsevier, 1999.
  • [15] V. Saragadam, D. LeJeune, J. Tan, G. Balakrishnan, A. Veeraraghavan, and R. G. Baraniuk, “Wire: Wavelet implicit neural representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 507–18 516.
  • [16] G. James, D. Witten, T. Hastie, R. Tibshirani, and J. Taylor, An introduction to statistical learning: With applications in python.   Springer Nature, 2023.
  • [17] A. Calderón, “Intermediate spaces and interpolation, the complex method,” Studia Mathematica, vol. 24, no. 2, pp. 113–190, 1964.
  • [18] A. Grossmann and J. Morlet, “Decomposition of hardy functions into square integrable wavelets of constant shape,” SIAM journal on mathematical analysis, vol. 15, no. 4, pp. 723–736, 1984.
  • [19] A. P. Witkin, “Scale-space filtering,” in Readings in computer vision.   Elsevier, 1987, pp. 329–332.
  • [20] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning.   pmlr, 2015, pp. 448–456.
  • [21] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [22] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.