This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MatrixKAN:
Parallelized Kolmogorov-Arnold Network

Cale Coffman    Lizhong Chen
Abstract

Kolmogorov-Arnold Networks (KAN) are a new class of neural network architecture representing a promising alternative to the Multilayer Perceptron (MLP), demonstrating improved expressiveness and interpretability. However, KANs suffer from slow training and inference speeds relative to MLPs due in part to the recursive nature of the underlying B-spline calculations. This issue is particularly apparent with respect to KANs utilizing high-degree B-splines, as the number of required non-parallelizable recursions is proportional to B-spline degree. We solve this issue by proposing MatrixKAN, a novel optimization that parallelizes B-spline calculations with matrix representation and operations, thus significantly improving effective computation time for models utilizing high-degree B-splines. In this paper, we demonstrate the superior scaling of MatrixKAN’s computation time relative to B-spline degree. Further, our experiments demonstrate speedups of approximately 40x relative to KAN, with significant additional speedup potential for larger datasets or higher spline degrees. 111 Code: Our publicly available implementation of MatrixKAN is provided in the following GitHub repository: https://github.com/OSU-STARLAB/MatrixKAN

Machine Learning, ICML

1 Introduction

Kolmogorov-Arnold Networks (KAN) are a new class of neural network architecture representing a promising alternative to Multilayer Perceptrons (MLP). Like MLPs, KANs are comprised of a fully-connected network of nodes and edges. However, unlike MLPs, which utilize fixed activation functions on network nodes, KANs utilize learnable activation functions on network edges. Along each edge of a KAN, a parameterized B-spline is utilized to model functions of arbitrary degree, making KANs both more expressive and more interpretable (Samadi et al., 2024).

Initial empirical testing of KANs has yielded promising results (Liu et al., 2024a; Jamali et al., 2024; Peng et al., 2024; Hu et al., 2024; Koenig et al., 2024; Carlo et al., 2024; Zhou et al., 2024; Liu et al., 2025). For example, compared to MLPs with comparable parameter counts, KANs trained to model various physics functions have (at worst) achieved comparable loss levels and (at best) achieved loss levels several orders of magnitude better than their MLP counterparts (Liu et al., 2024a). While KANs have demonstrated the potential to outperform traditional MLPs, due to their use of parameterized B-splines as learnable activation functions, KANs suffer from significantly slower training and inference speeds. One major contributing factor to this increase in computation time is the recursive algorithm required for calculating B-spline outputs—the Cox-De Boor recursion algorithm.

As the name suggests, the Cox-De Boor recursion algorithm requires sequential, recursive calls to calculate a single B-spline output, and the number of recursive calls varies linearly with the degree of the B-spline. Perhaps due to this limitation, KANs typically utilize relatively low-degree B-splines (e.g., degree 3); however, our experiment suggests that certain classes of functions may be more effectively modeled by higher-degree B-splines and that use of higher-degree B-splines may result in enhanced model expressiveness.

To reduce the time complexity of using high-degree B-splines as learnable activation functions, we propose MatrixKAN, a novel optimization that parallelizes B-spline calculations, significantly improving effective computation time for KANs utilizing high-degree B-splines. This is achieved by adapting the generalized matrix representation of B-splines to KAN and parallelizing the Cox-De Boor recursion algorithm by decomposing its calculations into matrix operations, with all recursive operations represented by a single matrix that is precalculated at model initialization for B-splines of a given degree. To substantiate this optimization, we perform detailed analyses of the theoretical complexity of each architecture and demonstrate that the optimized MatrixKAN approach results in significantly improved scaling of B-spline calculation time with respect to B-spline degree.

We trained and evaluated our MatrixKAN models on the Feynman physics equations modeled by KAN in (Liu et al., 2024a) and compared both computational efficiency and model performance. To facilitate comparison with KAN, we evaluated computational efficiency and model performance using the same metrics reported in (Liu et al., 2024a)—(i) seconds per training step and (ii) RMSE loss, respectively. With respect to computational efficiency, our results demonstrate that KAN’s effective computation time scales linearly relative to B-spline degree, while MatrixKAN’s effective computation time is independent of B-spline degree, resulting in speedups of up to approximately 40x, with potential for greater speedups using large datasets or higher spline degrees. With respect to model performance, we demonstrate that MatrixKAN achieves loss levels consistent with those achieved by KAN. Further, we show that for various modeled functions, KANs of relatively high degree (i.e., 4 or higher) yield improved loss levels, up to approximately 27.0%, as compared to KANs of relatively low degree (i.e., 3 or less)—cases for which use of the MatrixKAN architecture would result in significantly reduced training and inference time.

The main contributions of this paper are:

  1. 1.

    Proposed adaptation of the generalized matrix representation of B-splines for B-spline calculations in KAN.

  2. 2.

    Developed MatrixKAN, an efficient, parallelized implementation of KAN.

  3. 3.

    Demonstrated that the effective computation time of MatrixKAN scales preferably relative to the degree of the B-spline employed within the model, resulting in more efficient calculations for high-degree B-splines.

  4. 4.

    Demonstrated that KANs / MatrixKANs utilizing relatively high-degree B-splines yield improved performance when modeling various functions.

2 Background and Related Work

2.1 Kolmogorov-Arnold Networks

In addition to the features already noted, KANs differ from MLPs in their theoretical foundation. Unlike MLPs, whose theoretical basis as a function modeling framework derives from the universal approximation theorem, KANs rely on the Kolmogorov-Arnold representation theorem, which provides that every multivariate continuous function ff can be represented as a finite composition of continuous, univariate functions and the binary operation of addition:

f(x)=f(x1,,xn)=q=12n+1Φq(p=1nϕq,p(xp))f(x)=f(x_{1},...,x_{n})=\sum_{q=1}^{2n+1}\Phi_{q}(\sum_{p=1}^{n}\phi_{q,p}(x_{p})) (1)

where ϕq,p:[0,1]\phi_{q,p}:[0,1]\rightarrow\mathbb{R} and Φq:\Phi_{q}:\mathbb{R}\rightarrow\mathbb{R} (Kolmogorov, 1957; Liu et al., 2024a). This traditional formulation of the Kolmogorov-Arnold representation theorem is analogous to a small neural network with (i) nn input neurons, (ii) a single hidden layer with 2n+12n+1 neurons, and (iii) a single output neuron. ϕq,p\phi_{q,p} represent the edges between input neurons and neurons in the hidden layer, and Φq\Phi_{q} represent edges between neurons in the hidden layer and the output neuron.

To generalize this approach for modeling more complex systems and to facilitate training of this using traditional neural network regression strategies, KANs expand this two-layer network to networks of arbitrary width and depth and parameterize all edge functions using B-splines. The shape of a KAN is represented as [n0,n1,,nL][n_{0},n_{1},...,n_{L}], where nin_{i} is the number of nodes in the ithi^{th} layer. The activation value for a given node is represented as xl,ix_{l,i} and the B-spline activation function between two nodes in adjacent layers is represented as Φl,j,i\Phi_{l,j,i}, where ll denotes the layer, ii denotes the neuron within the given layer and jj denotes the connected neuron in the subsequent layer (Liu et al., 2024a):

ϕl,j,i,l=0,,L1,i=1,,nl,j=1,,nl+1\phi_{l,j,i},l=0,...,L-1,i=1,...,n_{l},j=1,...,n_{l+1} (2)

The output of a given B-spline activation function is represented as x~Φl,j,i(xl,i)\tilde{x}\equiv\Phi_{l,j,i}(x_{l,i}). For any given node, the activation value is equal to the sum of the outputs of the B-spline activation functions on each incoming network edge (Liu et al., 2024a):

xl+1,j=i=1nlx~l,j,i=i=1nlϕl,j,i(xl,i)x_{l+1,j}=\sum_{i=1}^{n_{l}}\tilde{x}_{l,j,i}=\sum_{i=1}^{n_{l}}\phi_{l,j,i}(x_{l,i}) (3)

where j=1,,nl+1j=1,...,n_{l+1}.

2.2 B-Splines

Refer to caption

Figure 1: Diagram of a cubic B-spline curve, including applicable knots (turquoise) and control points (red).

To facilitate diversity in the learned activation function on each edge of a KAN, KANs implement activation functions using B-splines. B-splines are piecewise polynomial curves whose name derives from the elastic beams (i.e., splines) used by draftsmen to construct sweeping curves in ship design (Prautzsch et al., 2002). They are defined as affine combinations of control points cic_{i} and associated basis functions Bi(x)B_{i}(x) (Liu et al., 2024a):

spline(x)=iciBi(x)spline(x)=\sum_{i}c_{i}B_{i}(x) (4)

B-splines can be used to model a wide variety of shapes due to their ability to represent functions of arbitrary degree. A curve spline(x)spline(x) is called a B-spline of degree kk with knots to,,tmt_{o},...,t_{m}, where titi+1t_{i}\leq t_{i+1} and ti<ti+k+1t_{i}<t_{i+k+1} for all ii, which delineate the intervals over which the B-spline is defined (see Figures 1 and 2), and control points co,,cic_{o},...,c_{i}, which control the local shape of the B-spline. Due to their control over the shape of the B-spline, KANs parameterize B-splines via their control points. Expressions for the basis functions Bi,k(x)B_{i,k}(x) of a given B-spline are derived via the Cox-De Boor recursion algorithm, examples of which are shown in Figure 2:

Bi,0(x)={1if tix<ti+1,0otherwise.Bi,k(x)=xtiti+ktiBi,k1(x)+ti+k+1xti+k+1ti+1Bi+1,k1(x)\begin{split}B_{i,0}(x)&=\begin{cases}1&\text{if }t_{i}\leq x<t_{i+1},\\ 0&\text{otherwise.}\end{cases}\\ B_{i,k}(x)&=\frac{x-t_{i}}{t_{i+k}-t_{i}}B_{i,k-1}(x)+\frac{t_{i+k+1}-x}{t_{i+k+1}-t_{i+1}}B_{i+1,k-1}(x)\end{split} (5)

As Equations 4 and 5 demonstrate, the value of a basis function Bi,k(x)B_{i,k}(x) of degree kk is calculated recursively by reference to the position of the input value xx within the knot vector and the output of certain basis functions of degree k1k-1, and the final output of the B-spline is determined by multiplying all applicable basis functions by their associated control points cic_{i}.

Refer to caption

Figure 2: Diagram of cubic B-spline basis function curves (various colors) and knots (turquoise).

Given that basis functions of degree kk are defined by reference to basis functions of degree k1k-1, it follows that calculation of a degree kk basis function requires kk levels of sequential, recursive calls. This sequential dependency in the Cox-De Boor recursion algorithm prevents parallelization, resulting in a proportional increase in computation time as B-spline degree increases, even with memorization or dynamic programming techniques. Our testing shows that B-spline calculations constitute a significant portion of total execution time for KANs, ranging from 18% for degree-2 B-splines to over 50% for degree-30 B-splines, making B-spline computational efficiency a fundamental concern in improving the computational efficiency of KANs as a whole. While low-degree B-splines enable KANs to outperform MLPs (Liu et al., 2024a; Peng et al., 2024; Hu et al., 2024), our tests have demonstrated that high-degree B-splines are necessary for optimal performance in modeling certain functions, with optimal convergence occurring at degrees of 30 or higher. To bridge the gap between the need for high-degree B-splines and their high computational cost, we need a way to parallelize the recursive B-spline calculations for efficient execution on a parallel processor.

2.3 Related Work

To address the computational cost of KANs, a number of approaches have been taken. For certain applications, despite the general additional computational overhead of KANs as compared to MLPs with similar parameter counts, KANs have been shown to produce superior performance with lower parameter counts, resulting in overall computational efficiency, as is the case with the convolutional KAN and U-KAN (Bodner et al., 2024; Li et al., 2024). However, these results involve hybrid KAN architectures and are domain-specific, suggesting the need for more general optimization techniques. To achieve more general computational efficiency, many modified KAN architectures have been proposed that replace underlying B-splines with different basis functions, such as radial basis functions (FastKAN / FasterKAN) (Li, 2024), trigonometric functions (e.g., Larctan-SKAN) and left-shift softplus functions (LSS-SKAN) (Chen & Zhang, 2024b, a), and rational functions (rKAN) (Aghaei, 2024). While each of these approaches has shown speedups over the B-spline-based KAN architecture, each basis function type has its own strengths and weaknesses in function approximation, and none identified here exhibit the local support feature of B-splines that allows for its granular function modeling capabilities. In certain research, adding wavelet functions to B-spline-based KANs (Wav-KAN) (Bozorgasl & Chen, 2024) or implementing neuron grouping / weight sharing to minimize the overall parameter count of B-spline-based KANs (Free Knots KAN) (Zheng et al., 2025) has yielded efficiency gains, but in all cases these speedups are only by a constant factor and none resolves the underlying computational complexity of the recursive B-spline calculations that is addressed by our proposed approach. That said, each is complementary to our approach and could be used in combination to achieve further improved computational efficiency.

3 MatrixKAN

In this work, we propose MatrixKAN, a novel optimization that parallelizes B-spline calculations, allowing for significant reduction in the effective computation time of KAN calculations. In what follows, we first demonstrate that the generalized matrix representation of B-splines can be theoretically applied to B-spline-based KANs. Next, we demonstrate the procedure for parallelizing B-spline outputs in MatrixKAN. Finally, we discuss the computational efficiency of KAN and MatrixKAN.

3.1 Applying Generalized Matrix Representation of B-Splines to KAN

3.1.1 Decomposing the Cox-De Boor Algorithm into Matrix Operations

We first show that the summation of the products of (i) control points and (ii) basis functions representing the output of a B-spline in KAN (as presented in Equation 4) can be decomposed into a summation of only matrix multiplications–the general matrix representation of B-splines.

For a B-spline of order kk222Note that a B-spline of order kk is equivalent to a B-spline of degree k1k-1, whose shape is determined by kk control points cj(j=i,,i+k1)c_{j}(j=i,...,i+k-1), the output of the ithi^{th} B-spline curve segment splinei(u)spline_{i}(u) can be represented as follows:

splinei(u)=j=ii+k1cjBj,k(u)spline_{i}(u)=\sum_{j=i}^{i+k-1}c_{j}B_{j,k}(u) (6)

where u=tti+k1ti+kti+k1,ti+k1t<ti+ku=\frac{t-t_{i+k-1}}{t_{i+k}-t_{i+k-1}},t_{i+k-1}\leq t<t_{i+k}.

By expanding the summation, we obtain the following:

splinei(u)=ciBi,k(u)+ci+1Bi+1,k(u)++ci+k1Bi+k1,k(u)spline_{i}(u)=c_{i}B_{i,k}(u)+c_{i+1}B_{i+1,k}(u)+...+c_{i+k-1}B_{i+k-1,k}(u) (7)

This expanded summation can be represented in matrix form as the product of (i) all applicable basis functions and (ii) the control points, as follows:

splinei(u)=[Bi,k(u)Bi+1,k(u)Bi+k1,k(u)][cici+1ci+k1]spline_{i}(u)=\begin{bmatrix}B_{i,k}(u)&B_{i+1,k}(u)&...&B_{i+k-1,k}(u)\end{bmatrix}\begin{bmatrix}c_{i}\\ c_{i+1}\\ ...\\ c_{i+k-1}\end{bmatrix} (8)

As demonstrated by (Cohen & Riesenfeld, 1982; Qin, 1998), because all basis functions Bi,k(u)B_{i,k}(u) will be, by definition, degree kk, we can leverage Toeplitz matrices333For a detailed discussion of the use of Toeplitz matrices to represent the product of polynomials, see Appendix A. to further decompose the basis function matrix from Equation 8 [Bi,k(u)Bi+1,k(u)Bi+k1,k(u)]\begin{bmatrix}B_{i,k}(u)&B_{i+1,k}(u)&...&B_{i+k-1,k}(u)\end{bmatrix} into two matrices—(i) the power bases [1uuk1]\begin{bmatrix}1&u&...&u^{k-1}\end{bmatrix}, which represents the kk degrees of the input value u,u, and (ii) the basis matrix Ψk(i)\Psi^{k}(i):

splinei(u)=[1uuk1]Ψk(i)[cici+1ci+k1]spline_{i}(u)=\begin{bmatrix}1&u&...&u^{k-1}\end{bmatrix}\Psi^{k}(i)\begin{bmatrix}c_{i}\\ c_{i+1}\\ ...\\ c_{i+k-1}\end{bmatrix} (9)

where u=tti+k1ti+kti+k1,ti+k1t<ti+ku=\frac{t-t_{i+k-1}}{t_{i+k}-t_{i+k-1}},t_{i+k-1}\leq t<t_{i+k}.

3.1.2 Precalculating the Basis Matrix

We now demonstrate that the basis matrix Ψk\Psi^{k} is identical for all B-splines of a given degree and, as such, can be precalculated such that all subsequent B-spline calculations utilizing the basis matrix are simply matrix multiplications, which can be parallelized.

As demonstrated in (Qin, 1998), Ψk(i)\Psi^{k}(i) is calculated via the following recursive formula for all uniform B-splines:

{Ψ1=[1]Ψk=1k1[Ψk10][1k202k3011]+[0Ψk1][11011011]\begin{cases}\Psi^{1}=[1]\\ \Psi^{k}=\frac{1}{k-1}\begin{bmatrix}\Psi^{k-1}\\ 0\end{bmatrix}\begin{bmatrix}1&k-2&&&0\\ &2&k-3&&\\ &&\ddots&\ddots&\\ 0&&&-1&1\end{bmatrix}+\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\begin{bmatrix}0\\ \Psi^{k-1}\end{bmatrix}\begin{bmatrix}-1&1&&&0\\ &-1&1&&\\ &&\ddots&\ddots&\\ 0&&&-1&1\end{bmatrix}\end{cases} (10)

Note that because the only input into Equation 10 is the applicable order kk of the B-spline, the output of this algorithm is always the same (k1)(k-1) x (k1)(k-1) matrix for a B-spline of order kk, meaning Ψk(i)\Psi^{k}(i) can be calculated only once for all B-splines of order kk. The output of all B-splines of order kk can then be calculated via the matrix multiplication operations set forth in Equation 9, without the need to execute any further recursive calculations in accordance with Equations 4, 5, and 6.

3.2 Procedure for Parallelizing B-Spline Outputs in MatrixKAN

Given the above theoretical analysis, we now present the procedure for precalculating the basis matrix Ψk+1(i)\Psi^{k+1}(i) of order k+1k+1 (i.e., of degree kk) and executing all B-spline calculations in a MatrixKAN network in a fully parallelized manner:

Step 1: Calculate the basis matrix Ψk\Psi^{k}. At model initialization, calculate the basis matrix applicable to B-splines of the degree, kk, specified for the model. This results in a kk x kk tensor.

Step 2: Calculate applicable knot intervals and power bases [1uuk]\begin{bmatrix}1&u&...&u^{k}\end{bmatrix}, where u=tti+kti+k+1ti+k,ti+kt<ti+k+1u=\frac{t-t_{i+k}}{t_{i+k+1}-t_{i+k}},t_{i+k}\leq t<t_{i+k+1} and to,,tmt_{o},...,t_{m} represents the knot vector444Equation 9 is described in terms of B-spline order, while the formula here is described in terms of B-spline degree.. Thus, for each B-spline, we must first calculate uu, which requires determining the applicable knot interval for a given input value. This is achieved by comparing the input values to the knot values in each B-spline knot vector, returning two tensors—one representing the lower knot in the applicable grid interval for each value ti+kt_{i+k} and one representing the upper knot in the applicable grid interval for each value ti+k+1t_{i+k+1}. These tensors are then used, along with the tensor containing the input values tt, to calculate uu, per the formula above. Once uu is calculated, the power bases tensors is then populated with the appropriate powers of uu.

Step 3: Calculate applicable basis function outputs. Next, we calculate the basis function outputs for each B-spline based on Equation 9 by multiplying the basis matrix tensor calculated in Step 1 with the power bases tensor calculated in Step 2. To ensure that only applicable basis function outputs are calculated, we utilize the applicable knot intervals calculated in Step 2 to zero portions of the basis matrix tensor representing inapplicable basis functions. We then execute the final matrix multiplication, which results in a tensor containing the basis function outputs for all B-splines across all input values555The basis matrix tensor may be expanded to match the dimension of the power bases tensor, which has additional dimensions corresponding to the number of input values and number of dimensions of each input value..

Step 4: Calculate B-spline outputs. At this point, the output of Step 3 has completed the parallelized calculation of all equivalent recursive calculations required by the Cox-De Boor recursion algorithm and is equivalent to the basis function output tensor produced by KAN, so we can employ the original KAN algorithm from here forward to complete the B-spline output calculations.

All of the above operations are tensor operations implemented in PyTorch, and unlike the original KAN implementation, the recursive operations required to calculate basis function outputs have been replaced by matrix multiplication operations, which can be effectively parallelized.

3.3 Computational Efficiency Analysis

In this section, we analyze and compare the computational efficiency of the algorithms implemented in KAN and MatrixKAN. For purposes of this discussion, we assume a network that mirrors what is presented in (Liu et al., 2024a), where all LL layers have equal width n0=n1==nL=Nn_{0}=n_{1}=...=n_{L}=N with each B-spline of degree kk on GG intervals.

3.3.1 KAN

We first demonstrate that (i) the number of computations required by KAN in a forward pass is O(N2L(k2+kG)O(N^{2}L(k^{2}+kG) and (ii) the effective computation time of KAN when executed on a parallel processor is O(Lk)O(Lk).

To determine the number of calculations required for a forward pass of KAN with the parameters described above, note there are LL layers and N2N^{2} edges between layers (i.e., N2N^{2} B-splines between layers), resulting in O(N2L)O(N^{2}L) total B-splines. For each B-spline, the recursion algorithm in Equation 5 can be efficiently implemented with memorization or dynamic programming that iterates over ii and kk dimensions in a bottom-up fashion, i.e., computing Bi,0B_{i,0} for all ii first, then Bi,1B_{i,1} for all ii, and so on. This requires O(Dimi×Dimk)O(Dim_{i}\times Dim_{k}) = O((k+G)k)O((k+G)k) = O(k2+kG)O(k^{2}+kG) calculations. Thus, each forward pass of KAN entails O(N2L(k2+kG))O(N^{2}L(k^{2}+kG)) calculations.

To determine the effective computation time of a forward pass of KAN, we must consider the distinction between the serial and parallel portions of the algorithm. As discussed in (Blelloch, 1996), when analyzing the effective computation time of parallel algorithms, because all calculations that have no sequential dependencies can be processed in parallel (assuming a sufficient number of parallel processing cores to spread the work), it is only the length of the longest sequential dependency (i.e., the number of calculations that must be performed sequentially) that governs the effective computation time.

Applying this framework to KAN, first note that the O(N2)O(N^{2}) computations referenced above (i.e., the number of B-spline calculations within a given layer) can be fully parallelized, as the calculations at any layer are only sequentially dependent upon the outputs of the prior layer. Thus, with respect to the B-spline calculations across the network, the longest chain of sequential dependencies is equal to the number of layers, or O(L)O(L). Further, as discussed above, although each B-spline has O(k2+kG)O(k^{2}+kG) complexity, the computations at each of the kk levels of recursion (e.g., computing Bi,0B_{i,0} for all ii) are independent of each other and only sequentially dependent upon the prior level of recursion. Therefore,the longest chain of sequential dependencies for B-spline calculations in a given layer is equal to the total levels of recursion, or O(k)O(k). When multiplied with the number of layers, O(L)O(L), this results in the total effective computation time for a KAN forward pass of (O(Lk)(O(Lk).

3.3.2 MatrixKAN

In this section, we demonstrate that (i) the number of computations required by MatrixKAN in a forward pass is O(N2L(k2+G)O(N^{2}L(k^{2}+G) and (ii) the effective computation time of MatrixKAN when executed on a parallel processor is superior to that of KAN–O(L)O(L).

First, we address the number of computations required by MatrixKAN in precalculating the basis matrix Ψk+1(i)\Psi^{k+1}(i). As set forth in Equation 10, calculation of Ψk+1(i)\Psi^{k+1}(i) requires kk levels of recursion, each of which entails two sequentially independent matrix multiplications of square matrices and one addition of the resultant matrices, starting with matrices of dimension kk x kk and decrementing by 11 with each recursive call down to 11 x 11. Thus, each step requires O(2d3+d2)O(2d^{3}+d^{2}), or simply O(d3)O(d^{3}), calculations, where dd represents the matrix dimension at each level of recursion. This can be represented by the following summation, requiring O(k4)O(k^{4}) computations:

d=1kd3=k2(k+1)24=O(k4)\sum_{d=1}^{k}d^{3}=\frac{k^{2}(k+1)^{2}}{4}=O(k^{4}) (11)

Here, to determine the effective computation time of calculating the basis matrix Ψk+1(i)\Psi^{k+1}(i), note that because all O(d3)O(d^{3}) calculations required at each level of recursion are matrix operations, all are sequentially independent and can be parallelized. Thus, the longest chain of sequential dependencies is the total levels of recursion, making the effective computation time of the basis matrix Ψk+1(i)\Psi^{k+1}(i) equal to O(k)O(k). Note further that because this calculation need only be executed once for a given B-spline degree (i.e., potentially only once ever, as the calculated basis matrix can be reused between models), the amortized computation time of the algorithm is O(k4/n)O(k^{4}/n), where nn represents the total number of forward passes executed. Given that model training can easily entail thousands (if not orders of magnitude more) forward passes, nn will eventually grow very large, and the amortized computation time of calculating the basis matrix Ψk(i)\Psi^{k}(i) will eventually approach O(1)O(1).

Next, we address the number of computations required by MatrixKAN in a forward pass. Here, as with KAN, there are LL layers and N2N^{2} B-splines between layers, resulting in O(N2L)O(N^{2}L) total B-splines. Each of the B-splines is calculated based on Equation 9, which requires calculation of the power bases matrix uu, which entails O(G+k)O(G+k) complexity due to the dimension of the knot vectors, and two matrix multiplications. The first matrix multiplication is between the 11 x kk power bases matrix and the kk x kk basis matrix, requiring O(k2)O(k^{2}) computations, and the second is between the 11 x kk matrix resulting from the first calculation and the kk x 11 control points matrix, requiring O(k)O(k) computations. Consequently, each B-spline calculation in a given layer is O(G+k+k2+k)O(G+k+k^{2}+k), or simply O(k2+G)O(k^{2}+G). This, when multiplied with O(N2L)O(N^{2}L) total B-splines, results in a total number of O(N2L(k2+G))O(N^{2}L(k^{2}+G)) calculations required in a forward pass of MatrixKAN.

We now demonstrate how the enhanced parallelizability of MatrixKAN B-spline calculations results in significantly reduced effective computation time. To determine the effective computation time of a forward pass of MatrixKAN, first note that, as with KAN, all O(N2)O(N^{2}) B-spline computations in a given layer can be parallelized. Also like KAN, calculations at any given layer are sequentially dependent upon calculations at the prior layer, resulting in a chain of O(L)O(L) sequential dependencies across layers of the network. However, unlike KAN, all B-spline calculations within a given layer can be fully parallelized, i.e., the aforementioned O(k2+G)O(k^{2}+G) matrix-based calculations in Equation 9 can be processed in parallel in O(1)O(1) time. Thus, the only chain of sequentially dependent calculations for a given MatrixKAN forward pass are the calculations performed across layers, resulting in an effective computation time for MatrixKAN of O(L)O(L).

To summarize, in terms of computational complexity, MatrixKAN requires O(N2L(k2+G))O(N^{2}L(k^{2}+G)) computations, which is less than the O(N2L(k2+kG))O(N^{2}L(k^{2}+kG)) computations required by KAN. More importantly, in terms of effective computation time, MatrixKAN’s O(L)O(L) complexity is kk times faster than KAN’s O(Lk)O(Lk) complexity, which represents a speedup proportional to the degree of the underlying B-spline. As demonstrated in Section 5, for networks utilizing relatively high-degree B-splines, this can result in calculations many multiples, if not orders of magnitude, faster than KAN.

4 Experimental Setup

4.1 Data Sets

Following the evaluation methodology presented in (Liu et al., 2024a), we conducted experiments on (i) the dataset generated by the hellokan.ipynp script (for the efficiency tests) and (ii) a random subset of the Feynman equations generated by the feynman.py script (for the performance tests), each of which is provided by (Liu et al., 2024b). For each test, we generated a training set containing 1000 training samples and a test set containing 1000 validation samples. After each training epoch, each model was validated against the test set.

4.2 Computational Efficiency

For the efficiency tests, our aim was to compare the computation time of KAN666For all testing, we utilized the latest MultKAN implementation of KAN provided by (Liu et al., 2024b). and MatrixKAN relative to grid size, B-spline degree, and dataset size. We trained models using one NVIDIA 1080ti-8MB GPU for 20 steps, and to mirror the metrics in (Liu et al., 2024a), we measured computational efficiency in seconds per step, averaged across all steps. With this general setup, we ran three tests comparing–(1) training time vs. grid size, (2) training time vs. spline degree, and (3) training time vs. dataset size.

4.3 Performance / Optimal Spline Degree

For performance tests, our aim was to demonstrate (i) the validity of MatrixKAN, by demonstrating model performance consistent with that of KAN and (ii) the superior performance of models using high-degree B-splines (i.e., of degree 4 or higher) relative to models using low-degree B-splines (i.e. of degree 3 or less) for certain modeling tasks. We trained KAN and MatrixKAN models in tandem using the shapes prescribed in (Liu et al., 2024a) Table 2 Feynman dataset. To ensure comparability between KAN and MatrixKAN models, models of each architecture were trained using identical training and test sample sets and identical random seed values. Training was performed using the Adam optimizer with a learning rate of 0.001. We trained models of each architecture for 1000 steps using the following values for the underlying B-spline degree: 2, 4, 6, 8, 10, 20, 30.

5 Results

5.1 Computational Efficiency

To compare the computational efficiency of MatrixKAN to KAN, we trained models of shape [2,2,1] and [2,5,1] using the LBFGS optimizer, each model initialized with a grid size of 3, a grid eps of 1, a seed value of 42, and grid update enabled.

5.1.1 Grid Size

For our test measuring computational efficiency with respect to grid size,777As demonstrated in (Liu et al., 2024a), achieving optimal KAN performance sometimes requires refining the grid to grid sizes of up to 1000. Given the importance of increasing grid size during training to achieve optimal model performance, we included computational efficiency results with respect to increasing grid size to verify that the MatrixKAN optimizations did not create unacceptable additional computational burden at higher grid sizes. B-spline degree was set to 6, and grid size was varied across the following values: 2, 5, 10, 25, 50, 100, 250, 500, 1000. Results are depicted in Figure 3 by plotting training time in seconds per training step against grid size.

Refer to caption

Figure 3: A plot comparing MatrixKAN and KAN models of shape [2,2,1] and [2,5,1] across increasing grid sizes.

From Figure 3, we can see superior performance of MatrixKAN compared to KAN. As depicted in the figure, as grid size increases, training time for each architecture increases, but MatrixKAN training times are consistently less than that of KAN. The relatively efficient performance of MatrixKAN vs. KAN in these tests is a result of the static B-spline degree used–6–for which KAN must execute 6 sequential calculations per B-spline that are parallelized by MatrixKAN. As a result, although both KAN and MatrixKAN demonstrate similar scaling with respect to grid size, as expected from our theoretical analysis, MatrixKAN demonstrates superior performance due to the underlying B-spline degree.

5.1.2 B-spline Degree

For our test measuring computational efficiency with respect to B-spline degree, grid size was set to 2, and B-spline degree was varied across the following values: 2, 4, 6, 8, 10, 20. Results are depicted in Figure 4 by plotting training time in seconds per training step against B-spline degree.

Refer to caption

Figure 4: A plot comparing MatrixKAN and KAN models of shape [2,2,1] and [2,5,1] across increasing spline degrees.
Refer to caption
Figure 5: A plot comparing loss level of MatrixKAN and KAN models trained to model Feynman Equation I.6.20b:
f(Θ,σ)=exp(Θ22σ2)/2πσ2f(\Theta,\sigma)=exp(-\frac{\Theta^{2}}{2\sigma^{2}})/\sqrt{2\pi\sigma^{2}}.
Refer to caption
Figure 6: A plot comparing loss level of MatrixKAN and KAN models trained to model Feynman Equation I.12.11:
f(α,Θ)=1+αsinΘf(\alpha,\Theta)=1+\alpha sin\Theta
Refer to caption
Figure 7: A plot comparing loss level of MatrixKAN and KAN models trained to model Feynman Equation I.26.2:
f(n,Θ2)=arcsin(nsinΘ2)f(n,\Theta_{2})=arcsin(nsin\Theta_{2})

From Figure 4, we can see a clear distinction between KAN and MatrixKAN models, demonstrating preferable scaling of MatrixKAN computation time with respect to B-spline degree. For the base KAN architecture, both model shapes scale as expected, with computation time scaling linearly with increasing B-spline degree. This supports the conclusion that KAN’s effective computation time is a function of B-spline degree and MatrixKAN’s is independent of B-spline degree, even though the exact speedup here may not equal to the theoretic value of kk (due to implementation variation that may introduce a constant factor of overhead in the asymptotic analysis).

5.1.3 Dataset Size

Given that real-world training often involves datasets orders of magnitude larger than those utilized in the tests above, we performed a test to measure the computational efficiency of each architecture with respect to dataset size. For this test, we trained a model of shape [2,5,1] with grid size set to 3, B-spline degree set to 20, and dataset size varied across the following values: 10000, 50000, 100000. Results are depicted in Table 1 comparing dataset size to relative speedup, calculated as seconds per step for KAN calculations divided by seconds per step for MatrixKAN calculations.

Table 1: Speedups of MatrixKAN vs. KAN for various dataset sizes.

Dataset Size 10000 50000 100000
Speedup 2.18 17.59 38.89

As depicted in the table, as dataset size increases, the relative speedup of MatrixKAN vs. KAN increases significantly. For datasets up to 100,000 samples, MatrixKAN computations exhibit speedups of up to approximately 40x over KAN computations. The increased speedup is mainly because the parallel fraction of MatrixKAN is greater than that of KAN, so the proportion of computations performed in parallel by MatrixKAN increases more quickly than KAN over increasing dataset sizes. Therefore, as dataset size increases, MatrixKAN’s speedups relative to KAN will increase, with potential speedups many times, if not orders of magnitude, greater than those depicted here for large dataset sizes.

5.2 Performance / Optimal Spline Degree

For our performance tests, we compared the performance of MatrixKAN to KAN by plotting the loss level of each model against B-spline degree. The results of select tests are shown in Figures 7, 7, and 7.

Two things can be observed. First, MatrixKAN and KAN performed identically modeling all Feynman equations (i.e., completely overlapped curves). Although MatrixKAN and KAN perform B-spline calculations in different ways, the calculations are identical. Thus, when initialized with identical parameters and trained on identical datasets, models from each architecture should produce identical outputs and follow the same path of gradient descent to identical levels of convergence, which is demonstrated here. Second, the results demonstrate the superior performance of high-degree B-splines (i.e. of degree 4 or higher) in modeling various functions. Of the nine Feynman functions modeled, all but two models achieved optimal convergence at B-spline degree 4 or higher, with the majority of models converging at B-spline degree 6 or higher with loss level improvements of up to 27.0%. Further, Figure 7 demonstrates that for some modeling tasks model performance generally improves as a function of increasing B-spline degree and that B-splines of higher degree than tested here may yield improved results.

6 Conclusion

Kolmogorov-Arnold Networks represent a promising innovation in neural network architecture. However, due to the recursive nature of B-spline calculations, KANs suffer from slow training and inference speeds relative to traditional MLPs. To address this issue, we propose MatrixKAN, a novel optimization that parallelizes B-spline calculations. This optimized architecture adapts the generalized matrix representation of B-splines to parallelize B-spline calculations, resulting in significant speedups relative to KAN. Using this optimized approach, we observed speedups of up to approximately 40x compared to KAN, with potential for significant additional speedups for larger datasets or higher B-spline degrees, while demonstrating consistent model performance between architectures.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. It demonstrates various potential societal benefits, in particular those resulting from the increased testing and use of KAN models using high-degree B-splines. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Aghaei (2024) Aghaei, A. A. rkan: Rational kolmogorov-arnold networks, 2024. URL https://arxiv.org/abs/2406.14495.
  • Blelloch (1996) Blelloch, G. E. Programming parallel algorithms. Commun. ACM, 39(3):85–97, March 1996. ISSN 0001-0782. doi: 10.1145/227234.227246. URL https://doi.org/10.1145/227234.227246.
  • Bodner et al. (2024) Bodner, A. D., Tepsich, A. S., Spolski, J. N., and Pourteau, S. Convolutional kolmogorov-arnold networks, 2024. URL https://arxiv.org/abs/2406.13155.
  • Bozorgasl & Chen (2024) Bozorgasl, Z. and Chen, H. Wav-kan: Wavelet kolmogorov-arnold networks, 2024. URL https://arxiv.org/abs/2405.12832.
  • Carlo et al. (2024) Carlo, G. D., Mastropietro, A., and Anagnostopoulos, A. Kolmogorov-arnold graph neural networks, 2024. URL https://arxiv.org/abs/2406.18354.
  • Chen & Zhang (2024a) Chen, Z. and Zhang, X. Larctan-skan: Simple and efficient single-parameterized kolmogorov-arnold networks using learnable trigonometric function, 2024a. URL https://arxiv.org/abs/2410.19360.
  • Chen & Zhang (2024b) Chen, Z. and Zhang, X. Lss-skan: Efficient kolmogorov-arnold networks based on single-parameterized function, 2024b. URL https://arxiv.org/abs/2410.14951.
  • Cohen & Riesenfeld (1982) Cohen, E. and Riesenfeld, R. F. General matrix representations for bezier and b-spline curves. Computers in Industry, 3(1):9–15, 1982. ISSN 0166-3615. doi: https://doi.org/10.1016/0166-3615(82)90027-6. URL https://www.sciencedirect.com/science/article/pii/0166361582900276. Double Issue- In Memory of Steven Anson Coons.
  • Hu et al. (2024) Hu, L., Wang, Y., and Lin, Z. Incorporating arbitrary matrix group equivariance into kans, 2024. URL https://arxiv.org/abs/2410.00435.
  • Jamali et al. (2024) Jamali, A., Roy, S. K., Hong, D., Lu, B., and Ghamisi, P. How to learn more? exploring kolmogorov-arnold networks for hyperspectral image classification, 2024. URL https://arxiv.org/abs/2406.15719.
  • Koenig et al. (2024) Koenig, B. C., Kim, S., and Deng, S. Kan-odes: Kolmogorov–arnold network ordinary differential equations for learning dynamical systems and hidden physics. Computer Methods in Applied Mechanics and Engineering, 432:117397, December 2024. ISSN 0045-7825. doi: 10.1016/j.cma.2024.117397. URL http://dx.doi.org/10.1016/j.cma.2024.117397.
  • Kolmogorov (1957) Kolmogorov, A. K. On the representation of continuous functions of several variables by superposition of continuous functions of one variable and addition. Doklady Akademii Nauk SSSR, 114:369–373, 1957.
  • Li et al. (2024) Li, C., Liu, X., Li, W., Wang, C., Liu, H., Liu, Y., Chen, Z., and Yuan, Y. U-kan makes strong backbone for medical image segmentation and generation, 2024. URL https://arxiv.org/abs/2406.02918.
  • Li (2024) Li, Z. Kolmogorov-arnold networks are radial basis function networks, 2024. URL https://arxiv.org/abs/2405.06721.
  • Liu et al. (2024a) Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., Hou, T. Y., and Tegmark, M. Kan: Kolmogorov-arnold networks, 2024a. URL https://arxiv.org/abs/2404.19756.
  • Liu et al. (2024b) Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., Hou, T. Y., and Tegmark, M. pykan. https://github.com/kindxiaoming/pykan, 2024b. Accessed: January 28, 2025.
  • Liu et al. (2025) Liu, Z., Chen, H., Bai, L., Li, W., Zou, Z., and Shi, Z. Kolmogorov arnold neural interpolator for downscaling and correcting meteorological fields from in-situ observations, 2025. URL https://arxiv.org/abs/2501.14404.
  • Peng et al. (2024) Peng, Y., Wang, Y., Hu, F., He, M., Mao, Z., Huang, X., and Ding, J. Predictive modeling of flexible ehd pumps using kolmogorov–arnold networks. Biomimetic Intelligence and Robotics, 4(4):100184, December 2024. ISSN 2667-3797. doi: 10.1016/j.birob.2024.100184. URL http://dx.doi.org/10.1016/j.birob.2024.100184.
  • Prautzsch et al. (2002) Prautzsch, H., Boehm, W., and Paluszny, M. Bézier and B-Spline Techniques. Springer Berlin, Heidelberg, 01 2002. ISBN 978-3-642-07842-2. doi: 10.1007/978-3-662-04919-8.
  • Qin (1998) Qin, K. General matrix representations for b-splines. The Visual Computer, 16:177–186, 1998. URL https://api.semanticscholar.org/CorpusID:26401557.
  • Samadi et al. (2024) Samadi, M. E., Müller, Y., and Schuppert, A. Smooth kolmogorov arnold networks enabling structural knowledge representation, 2024. URL https://arxiv.org/abs/2405.11318.
  • Zheng et al. (2025) Zheng, L. N., Zhang, W. E., Yue, L., Xu, M., Maennel, O., and Chen, W. Free-knots kolmogorov-arnold network: On the analysis of spline knots and advancing stability, 2025. URL https://arxiv.org/abs/2501.09283.
  • Zhou et al. (2024) Zhou, Q., Pei, C., Sun, F., Han, J., Gao, Z., Pei, D., Zhang, H., Xie, G., and Li, J. Kan-ad: Time series anomaly detection with kolmogorov-arnold networks, 2024. URL https://arxiv.org/abs/2411.00278.

Appendix A Representing the product of two polynomials using Toeplitz matrices.

A Toeplitz matrix is a matrix whose elements on any line parallel to the main diagonal are all equal:

T=[α0α1αs0α1α0αsαrα10αrα1α0]T=\begin{bmatrix}\alpha_{0}&\alpha_{1}&...&\alpha_{s}&...&0\\ \alpha_{-1}&\alpha_{0}&\ddots&&\ddots&\\ \vdots&\ddots&\ddots&\ddots&\ddots&\alpha_{s}\\ \alpha_{-r}&\ddots&\ddots&\ddots&\ddots&\vdots\\ \vdots&\ddots&\ddots&\ddots&\ddots&\alpha_{1}\\ 0&&\alpha_{-r}&\ddots&\alpha_{-1}&\alpha_{0}\\ \end{bmatrix}

Toeplitz matrices are capable of representing polynomial functions as well as the product of polynomial functions. For example, the polynomial g(x)=c0+c1x+c2x2++cm1xm1(cm10)g(x)=c_{0}+c_{1}x+c_{2}x^{2}+...+c_{m-1}x^{m-1}(c_{m-1}\neq 0) may be represented as a special Toeplitz matrix—a lower triangular matrix—as follows:

g(x)=[c00c1c0cm1cm1c00cm1c1c0]g(x)=\begin{bmatrix}c_{0}&&&&&&0\\ c_{1}&c_{0}&&&&&\\ \vdots&\ddots&\ddots&&&&\\ c_{m-1}&...&\ddots&\ddots&&&\\ &\ddots&...&\ddots&\ddots&&\\ &&c_{m-1}&...&\ddots&c_{0}&\\ 0&&&c_{m-1}&...&c_{1}&c_{0}\end{bmatrix}

Further, the product of g(x)g(x) and q(x)=d0+d1x+d2x2++dn1xn1(dn10)q(x)=d_{0}+d_{1}x+d_{2}x^{2}+...+d_{n-1}x^{n-1}(d_{n-1}\neq 0) can be represented as follows:

f(x)\displaystyle f(x) =g(x)q(x)\displaystyle=g(x)q(x)
=X[c00c1c0cm1cm1c00cm1c1c0][d0d1dn100]\displaystyle=X\begin{bmatrix}c_{0}&&&&&&0\\ c_{1}&c_{0}&&&&&\\ \vdots&\ddots&\ddots&&&&\\ c_{m-1}&...&\ddots&\ddots&&&\\ &\ddots&...&\ddots&\ddots&&\\ &&c_{m-1}&...&\ddots&c_{0}&\\ 0&&&c_{m-1}&...&c_{1}&c_{0}\end{bmatrix}\begin{bmatrix}d_{0}\\ d_{1}\\ \vdots\\ d_{n-1}\\ 0\\ \vdots\\ 0\end{bmatrix}

where X=[1xx2xm+n2]X=\begin{bmatrix}1&x&x^{2}&...&x^{m+n-2}\end{bmatrix}.