ViKANformer: Embedding Kolmogorov Arnold Networks in Vision Transformers for Pattern-Based Learning

Shreyas S School of Computer Science and Engineering
VIT-AP University, India Akshath M School of Computer Science and Engineering
VIT-AP University, India

Abstract

Vision Transformers (ViTs) have significantly advanced image classification by applying self-attention on patch embeddings. However, the standard MLP blocks in each Transformer layer may not capture complex nonlinear dependencies optimally. In this paper, we propose ViKANformer, a Vision Transformer where we replace the MLP sub-layers with Kolmogorov–Arnold Network (KAN) expansions, including Vanilla KAN, Efficient-KAN, Fast-KAN, SineKAN, and FourierKAN, while also examining a Flash Attention variant. By leveraging the KolmogorovArnold theorem, which guarantees that multivariate continuous functions can be expressed via sums of univariate continuous functions, we aim to boost representational power.Experimental results on MNIST demonstrate that SineKAN, Fast-KAN, and a well-tuned Vanilla KAN can achieve over 97% accuracy, albeit with increased training overhead. This trade-off highlights that KAN expansions may be beneficial if computational cost is acceptable. We detail the expansions, present training/test accuracy and F1/ROC metrics, and provide pseudocode and hyperparameters for reproducibility. Finally, we compare ViKANformer to a simple MLP and a small CNN baseline on MNIST, illustrating the efficiency of Transformer-based methods even on a small-scale dataset.

Index Terms:

Vision Transformer, Kolmogorov Arnold Networks, MNIST, Attention Mechanisms, Deep Learning, Flash Attention

I Introduction

The Transformer architecture [vaswani2017attention] has dramatically improved performance in NLP tasks, and its adaptation to images, the Vision Transformer (ViT) [dosovitskiy2020image], has also achieved strong results. ViTs divide images into patches, embed them, and rely on self-attention over the patch embeddings. However, the feed-forward sub-layers (MLPs) may not optimally capture intricate patterns.

Kolmogorov Arnold Networks (KANs) exploit the Kolmogorov Arnold theorem [kolmogorov1957representation, kanGeneralArxiv], which states any continuous function of $n$ variables can be decomposed into sums of univariate continuous mappings plus additions. In practice, expansions such as Sine [SineKAN_arxiv], Fourier, radial basis, or polynomial can be used dimension by dimension. We embed such expansions within ViT feed-forward layers, replacing the standard MLP. Additionally, we experiment with Flash Attention, an approach for more efficient attention, to test synergy with KAN expansions.

Contributions:

•

We propose ViKANformer, a plug-and-play code framework that uses KAN expansions in place of standard MLPs in Vision Transformers.
•

We benchmark multiple KAN variants (Vanilla, Sine, Fourier, Fast, Efficient) plus a Flash Attention version on the MNIST dataset.
•

Empirical results show that while expansions such as SineKAN, Fast-KAN, and tuned Vanilla KAN can surpass 97–98% accuracy, they incur higher training costs (7–47 min/epoch).
•

We discuss a simple MLP and a small CNN baseline on MNIST for comparison, noting that while these methods can reach comparable or higher accuracy with less overhead, our aim is to demonstrate the viability of KAN expansions within Transformer-based pipelines.

II Related Work and Literature

II-A Vision Transformers

ViTs [dosovitskiy2020image] chunk an image into patches (e.g., $16\times 16$ or smaller/larger), flatten, and embed them. Positional embeddings are added, then a series of Transformer blocks with multi-head self-attention plus feed-forward sub-layers is applied. While very successful, research continues on optimizing or improving these feed-forward sub-layers, e.g., MLP-Mixer, ConvMixer, or, in our case, KAN expansions.

II-B Kolmogorov Arnold Theorem

Kolmogorov [kolmogorov1957representation] proved that any continuous $f(\mathbf{x})$ on $[0,1]^{n}$ can be expressed as finite sums of univariate continuous functions plus addition. The theorem is non-constructive, so practical “KANs” use expansions to approximate these univariate pieces. Recent expansions:

•

Sine expansions [SineKAN_arxiv],
•

Fourier expansions,
•

Radial basis expansions,
•

Polynomial or B-spline expansions.

They can be dimension-wise or can share parameters across dimensions, with varying overhead.

II-C Flash Attention

Flash Attention is a more efficient attention mechanism that computes $\mathbf{QK}^{\top}$ blocks in a memory-optimized way. Some prior works incorporate better feed-forward designs with Flash Attention to further accelerate Transformers. We attempt a FlashKAN approach, combining Flash-based self-attention with KAN expansions in the feed-forward sub-layer.

III ViKANformer Architecture

III-A Replacing MLP with KAN

Our approach is to replace the standard MLP block in the Transformer layer with dimension-wise KAN expansions. Suppose we have $d$ -dimensional embeddings. A KAN feed-forward block has the form:

\mathbf{y}=\mathbf{W}\,\bigl{[}\phi_{1}(x_{1})\oplus\cdots\oplus\phi_{d}(x_{d})\bigr{]},

where each $x_{j}$ passes through a parametric univariate function $\phi_{j}$ . For instance, in SineKAN:

\phi_{j}(x_{j})=\sum_{m=1}^{M}\alpha_{j,m}\,\sin\bigl{(}\omega_{j,m}\,x_{j}+b_{j,m}\bigr{)}.

(1)

The same overall Transformer structure remains intact—multi-head attention, layer normalization, etc.—but the feed-forward sub-layer is replaced by the chosen KAN variant. This modular “plug-and-play” design allows quick experimentation with different expansions.

III-B Architecture Diagram

Figure 1 illustrates an overview of the ViKANformer, in a two-column figure for clarity. We use a small Vision Transformer on MNIST as a proof of concept. The main modifications affect only the MLP blocks, while the rest of the Transformer (attention, skip connections, normalization) remains standard.

Refer to caption — Figure 1: ViKANformer Overview. We show two Transformer blocks with their self-attention sub-layer. The feed-forward sub-layer (normally an MLP) is replaced by a dimension-wise KAN expansion. Various KAN variants (Sine, Fourier, etc.) can be plugged in.