toltxlabel
Deep Learning with Kernels through RKHM and the Perron–Frobenius Operator
Abstract
Reproducing kernel Hilbert -module (RKHM) is a generalization of reproducing kernel Hilbert space (RKHS) by means of -algebra, and the Perron–Frobenius operator is a linear operator related to the composition of functions. Combining these two concepts, we present deep RKHM, a deep learning framework for kernel methods. We derive a new Rademacher generalization bound in this setting and provide a theoretical interpretation of benign overfitting by means of Perron–Frobenius operators. By virtue of -algebra, the dependency of the bound on output dimension is milder than existing bounds. We show that -algebra is a suitable tool for deep learning with kernels, enabling us to take advantage of the product structure of operators and to provide a clear connection with convolutional neural networks. Our theoretical analysis provides a new lens through which one can design and analyze deep kernel methods.
1 Introduction
Kernel methods and deep neural networks are two major topics in machine learning. Originally, they had been investigated independently. However, their interactions have been researched recently. One important perspective is deep kernel learning [1, 2, 3]. In this framework, we construct a function with the composition of functions in RKHSs, which is learned by given training data. Representer theorems were shown for deep kernel methods, which guarantee the representation of solutions of minimization problems only with given training data [4, 5]. We can combine the flexibility of deep neural networks with the representation power and solid theoretical understanding of kernel methods. Other important perspectives are neural tangent kernel [6, 7] and convolutional kernel [8], which enable us to understand neural networks using the theory of kernel methods. In addition, Bietti et al. [9] proposed a regularization of deep neural network through a kernel perspective.
The generalization properties of kernel methods and deep neural networks have been investigated. One typical technique for bounding generalization errors is to use the Rademacher complexity [10, 11]. For kernel methods, generalization bounds based on the Rademacher complexity can be derived by the reproducing property. Bounds for deep kernel methods and vector-valued RKHSs (vvRKHSs) were also derived [5, 12, 13]. Table 1 shows existing Rademacher generalization bounds for kernel methods. Generalization bounds for deep neural networks have also been actively studied [14, 15, 16, 17, 18, 19]. Recently, analyzing the generalization property using the concept of benign overfitting has emerged [20, 21]. Unlike the classical interpretation of overfitting, which is called catastrophic overfitting, it explains the phenomenon that the model fits both training and test data. For kernel regression, it has been shown that the type of overfitting can be described by an integral operator associated with the kernel [22].
In this paper, we propose deep RKHM to make the deep kernel methods more powerful. RKHM is a generalization of RKHS by means of -algebra [23, 24, 25], where -algebra is a generalization of the space of complex numbers, regarded as space of operators. We focus on the -algebra of matrices in this paper. Applications of RKHMs to kernel methods have been investigated recently [26, 27]. We generalize the concept of deep kernel learning to RKHM, which constructs a map from an operator to an operator as the composition of functions in RKHMs. The product structure of operators induces interactions among elements of matrices, which enables us to capture relations between data components. Then, we derive a generalization bound of the proposed deep RKHM. By virtue of -algebras, we can use the operator norm, which alleviates the dependency of the generalization bound on the output dimension.
We also use Perron–Frobenius operators, which are linear operators describing the composition of functions and have been applied to analyzing dynamical systems [28, 29, 30, 31], to derive the bound. The compositions in the deep RKHM are effectively analyzed by the Perron–Frobenius operators.
-algebra and Perron–Frobenius operator are powerful tools that provide connections of the proposed deep RKHM with existing studies. Since the norm of the Perron–Frobenius operator is described by the Gram matrix associated with the kernel, our bound shows a connection of the deep RKHM with benign overfitting. In addition, the product structure of a -algebra enables us to provide a connection between the deep RKHMs and convolutional neural networks (CNNs).
Our main contributions are as follows.
-
•
We propose deep RKHM, which is a generalization of deep kernel method by means of -algebra. We can make use of the products in -algebra to induce interactions among data components. We also show a representer theorem to guarantee the representation of solutions only with given data.
-
•
We derive a generalization bound for deep RKHM. The dependency of the bound on the output dimension is milder than existing bounds by virtue of -algebras. In addition, the Perron–Frobenius operators provide a connection of our bound with benign overfitting.
-
•
We show connections of our study with existing studies such as CNNs and neural tangent kernel.
We emphasize that our theoretical analysis using -algebra and Perron–Frobenius operators gives a new and powerful lens through which one can design and analyze kernel methods.
Reproducing space | Output dimension | Shallow | Deep |
---|---|---|---|
RKHS | 1 | [10] | |
vvRKHS | [12, 13] | [5] | |
RKHM (existing) | [27] | – | |
RKHM (ours) |
2 Preliminaries
2.1 -algebra and reproducing kernel -module
-algebra, which is denoted by in the following, is a Banach space equipped with a product and an involution satisfying the identity (condition 3 below).
Definition 2.1 (-algebra)
A set is called a -algebra if it satisfies the following conditions:
-
1.
is an algebra over and equipped with a bijection that satisfies the following conditions for and :
, , .
-
2.
is a normed space endowed with , and for , holds. In addition, is complete with respect to .
-
3.
For , the identity holds.
Example 2.2
We now define RKHM. Let be a non-empty set for data.
Definition 2.3 (-valued positive definite kernel)
An -valued map is called a positive definite kernel if it satisfies the following conditions:
for ,
is positive semi-definite for , , .
Let be the feature map associated with , defined as for and let . We can define an -valued map as
which enjoys the reproducing property for and . The reproducing kernel Hilbert -module (RKHM) associated with is defined as the completion of . See, for example, the references [33, 34, 26] for more details about -algebra and RKHM.
2.2 Perron–Frobenius operator on RKHM
We introduce Perron–Frobenius operator on RKHM [35]. Let and be nonempty sets and let and be -valued positive definite kernels. Let and be RKHMs associated with and , respectively. Let and be the feature maps of and , respectively. We begin with the standard notion of linearity in RKHMs.
Definition 2.4 (-linear)
A linear map is called -linear if for any and , we have .
Definition 2.5 (Perron–Frobenius operator)
Let be a map. The Perron–Frobenius operator with respect to is an -linear map satisfying
Note that a Perron–Frobenius operator is not always well-defined since for and nonzero does not always mean .
Definition 2.6 (-linearly independent)
The set is called -linearly independent if it satisfies the following condition: For any , , and , “” is equivalent to “ for all ”.
Lemma 2.7
If is -linearly independent, then is well-defined.
The following lemma gives a sufficient condition for -linearly independence.
Lemma 2.8
Let , i.e., is separable, for an invertible operator and a complex-valued kernel . Assume is linearly independent (e.g. is Gaussian or Laplacian), where is the feature map associated with . Then, is -linearly independent.
Note that separable kernels are widely used in existing literature of vvRKHS (see e.g. [36]). Lemma 2.8 guarantees the validity of the analysis with Perron–Frobenius operators, at least in the separable case. This provides a condition for "good kernels" by means of the well-definedness of the Perron–Frobenius operator.
Notation
We denote the Euclidean inner product and norm by and without the subscript. For , let be the unique element in satisfying . If is a matrix, is the Hilbert–Schmidt norm of . The operator norm of a linear operator on an RKHM is denoted by . All the technical proofs are documented in the supplementary.
3 Deep RKHM
We now construct an -layer deep RKHM. Let be the -algebra of by matrices. Let be -subalgebras of and for and be an -valued positive definite kernel for the th layer. For , we denote by the RKHM over associated with . In addition, we denote by the RKHM over associated with . Note that is a subspace of and for , we have . We set the function space corresponding to each layer as and for . Here, for with , is the Perron–Frobenius operator with respect to from to . We assume the well-definedness of these operators. Then, we set the class of deep RKHM as
Figure 1 schematically shows the structure of the deep RKHM.

Example 3.1
We can set , the -algebra of block diagonal matrices. By setting and for , the number of nonzero elements in decreases during and increases during . This construction is regarded as an autoencoder, where the th layer corresponds to the encoder and the th layer corresponds to the decoder.
Advantage over existing deep kernel methods
Note that deep kernel methods with RKHSs and vvRKHSs have been proposed [1, 4, 2]. Autoencoders using RKHSs and vvRKHSs were also proposed [3, 5]. We have at least three advantages of deep RKHM over deep vvRKHSs or RKHSs: 1) useful structures of matrices stated in Remark 5.2, 2) availability of the operator norm in the generalization bound stated in Section 4, 3) connection with CNNs stated in Subsection 6.1.
4 Generalization Bound
We derive a generalization error for deep RKHM. To derive the bound, we bound the Rademacher complexity, which is one of the typical methods on deriving generalization bounds [11]. Let be a probability space equipped with a probability measure . We denote the integral of a measurable function on by . Let and be input and output samples from the distributions of and -valued random variables and , respectively. Let be i.i.d. Rademacher variables. Let . For an -valued function class and , the empirical Rademacher complexity is defined by .
4.1 Bound for shallow RKHMs
We use the operator norm to derive a bound, whose dependency on output dimension is milder than existing bounds. Availability of the operator norm is one of the advantages of considering -algebras (RKHMs) instead of vectors (vvRKHSs). Note that although we can also use the Hilbert–Schmidt norm for matrices, it tends to be large as the dimension becomes large. On the other hand, the operator norm is defined independently of the dimension . Indeed, the Hilbert–Schmidt norm of a matrix is calculated as , where is the th singular value of . The operator norm of is the largest singular value of .
To see the derivation of the bound, we first focus on the case of , i.e., the network is shallow. Let . For a space of -valued functions on , let . The following theorem shows a bound for RKHMs with the operator norm.
Theorem 4.1
Assume there exists such that for any . Let and . Let . Then, for any , where is defined in Section 3, with probability at least , we have
Theorem 4.1 is derived by the following lemmas. We first fix a vector and consider the operator-valued loss function acting on . We first show a relation between the generalization error and the Rademacher complexity of vector-valued functions. Then, we bound the Rademacher complexity. Since the bound does not depend on , we can finally remove the dependency on .
Lemma 4.2
Let be a function class of -valued functions on bounded by (i.e., for any ). Let and . Let satisfy and let . Then, for any , with probability at least , we have
Lemma 4.3
With the same notations in Lemma 4.2, let . Then, we have , where .
Lemma 4.4
Let satisfy . For defined in Section 3, we have
4.2 Bound for deep RKHMs
We now generalize Theorem 4.1 to the deep setting () using the Perron–Frobenius operators.
Theorem 4.5
Assume there exists such that for any . Let and . Let . Then, for any , with probability at least , we have
We use the following proposition and Lemmas 4.2 and 4.3 to show Theorem 4.5. The key idea of the proof is that by the reproducing property and the definition of the Perron–Frobenius operator, we get .
Proposition 4.6
Let satisfy . Then, we have
Here, is the submodule of generated by .
Corollary 4.7
Let satisfy . Then, we have
Comparison to deep vvRKHS
We can also regard as the Hilbert space equipped with the Hilbert–Schmidt inner product, i.e., we can flatten matrices and get -dimensional Hilbert space. In this case, the corresponding operator-valued kernel is the multiplication operator of , which we denote by . Then, we can apply existing results for vvRKHSs [5, 12], which involve the term . It is calculated as
Thus, using the existing approaches, we have the factor . On the other hand, we have the smaller factor in Theorems 4.1 and 4.5. Using the operator norm, we can reduce the dependency on the dimension .
5 Learning Deep RKHMs
We focus on the practical learning problem. To learn deep RKHMs, we consider the following minimization problem based on the generalization bound derived in Section 4:
(1) |
The second term regarding the Perron–Frobenius operators comes from the bound in Proposition 4.6. We try to reduce the generalization error by reducing the magnitude of the norm of the Perron–Frobenius operators.
5.1 Representer theorem
We first show a representer theorem to guarantee that a solution of the minimization problem (1) is represented only with given samples.
Proposition 5.1
Let be an error function, let be an -valued function on the space of bounded linear operators on , and let satisfy for . Assume the following minimization problem has a solution:
Then, there exists a solution admitting a representation of the form for some and for . Here, for and .
Remark 5.2
An advantage of deep RKHM compared to deep vvRKHS is that we can make use of the structure of matrices. For example, the product of two diagonal matrices is calculated by the elementwise product of diagonal elements. Thus, when , interactions among elements in an input are induced only by the kernels, not by the product . That is, the form of interactions does not depend on the learning parameter . On the other hand, if we set with , then at the th layer, we get interactions among elements in the same block through the product . In this case, the form of interactions is learned through . For example, the part (the encoder) in Example 3.1 tries to gradually reduce the dependency among elements in the output of each layer and describe the input with a small number of variables. The part (the decoder) tries to increase the dependency.
5.2 Computing the norm of the Perron–Frobenius operator
We discuss the practical computation and the role of the factor in Eq. (1). Let be the Gram matrix whose -entry is .
Proposition 5.3
For , let be the QR decomposition of . Then, we have .
The computational cost of is expensive if is large since computing and is expensive. Thus, we consider upper bounding by a computationally efficient value.
Proposition 5.4
Assume is invertible. Then, we have
(2) |
Since is independent of , to make the value small, we try to make the norm of and small according to Proposition 5.4. For example, instead of the second term regarding the Perron–Frobenius operators in Eq. (1), we can consider the following term with :
Note that the term depends on the training samples . The situation is different from the third term in Eq. (1) since the third term does not depend on the training samples before applying the representer theorem. If we try to minimize a value depending on the training samples, the model seems to be more specific for the training samples, and it may cause overfitting. Thus, the connection between the minimization of the second term in Eq. (1) and generalization cannot be explained by the classical argument about generalization and regularization. However, as we will see in Subsection 6.2, it is related to benign overfitting and has a good effect on generalization.
Remark 5.5
Remark 5.6
To evaluate for , we have to construct a Gram matrix , and compute the product of the Gram matrix and a vector for . The computational cost of the construction of the Gram matrices does not depend on . The cost of computing is , where is the number of nonzero elements in the matrices in . Note that if is the -algebra of block diagonal matrices, then we have . Regarding the cost with respect to , if the positive definite kernel is separable, we can use the random Fourier features [37] to replace the factor with for a small integer .
6 Connection and Comparison with Existing Studies
The proposed deep RKHM is deeply related to existing studies by virtue of -algebra and the Perron–Frobanius operators. We discuss the connection below.
6.1 Connection with CNN
The proposed deep RKHM has a duality with CNNs. Let , the -algebra of by circulant matrices. For , let . Let be an -valued positive definite kernel defined as , where is an -valued function. The -valued positive definite kernel makes the output of each layer become a circulant matrix, which enables us to apply the convolution as the product of the output and a parameter. Then, is represented as for some . Thus, at the th layer, the input is multiplied by and is transformed nonlinearly by . Since the product of two circulant matrices corresponds to the convolution, corresponds to a filter. In addition, corresponds to the activation function at the th layer. Thus, the deep RKHM with the above setting corresponds to a CNN. The difference between the deep RKHM and the CNN is the parameters that we learn. Whereas for the deep RKHM, we learn the coefficients , for the CNN, we learn the parameter . In other words, whereas for the deep RKHM, we learn the activation function , for the CNN, we learn the filter . It seems reasonable to interpret this difference as a consequence of solving the problem in the primal or in the dual. In the primal, the number of the learning parameters depends on the dimension of the data (or the filter for convolution), while in the dual, it depends on the size of the data.
Remark 6.1
The connection between CNNs and shallow RKHMs has already been studied [27]. However, the existing study does not provide the connection between the filter and activation function. The above investigation shows a more clear layer-wise connection of deep RKHMs with CNNs.
6.2 Connection with benign overfitting
Benign overfitting is a phenomenon that the model fits any amount of data yet generalizes well [20, 21]. For kernel regression, Mallinar et al. [22] showed that if the eigenvalues of the integral operator associated with the kernel function over the data distribution decay slower than any powerlaw decay, than the model exhibits benign overfitting. The Gram matrix is obtained by replacing the integral with the sum over the finite samples. The inequality (2) suggests that the generalization error becomes smaller as the smallest and the largest eigenvalues of the Gram matrix get closer, which means the eigenvalue decay is slower. Combining the observation in Remark 5.5, we can interpret that as the right-hand side of the inequality (2) becomes smaller, the outputs of noisy training data at the th layer tend to be more separated from the other outputs. In other words, for the random variable following the data distribution, is learned so that the distribution of generates the integral operator with more separated eigenvalues, which appreciates benign overfitting. We will also observe this phenomenon numerically in Section 7 and Appendix C.3.2. Since the generalization bound for deep vvRKHSs [5] is described by the Lipschitz constants of the feature maps and the norm of for , this type of theoretical interpretation regarding benign overfitting is not available for the existing bound for deep vvRKHSs.
Remark 6.2
The above arguments about benign overfitting are valid only for deep RKHMs, i.e., the case of . If (shallow RKHM), then the Gram matrix is fixed and determined only by the training data and the kernel. On the other hand, if (deep RKHM), then depends also on . As a result, by adding the term using to the loss function, we can learn proper so that they make the right-hand side of Eq. (2) small, and the whole network overfits benignly. As becomes large, the function changes more flexibly to attain a smaller value of the term. This is an advantage of considering a large .
6.3 Connection with neural tangent kernel
Neural tangent kernel has been investigated to understand neural networks using the theory of kernel methods [6, 7]. Generalizing neural networks to -algebra, which is called the -algebra network, is also investigated [32, 38]. We define a neural tangent kernel for the -algebra network and develop a theory for combining neural networks and deep RKHMs as an analogy of the existing studies. Consider the -algebra network over with -valued weight matrices and element-wise activation functions : . The -entry of is , where is the th column of regarded as and is the th row of regarded as . Thus, the -entry of corresponds to the output of the network . We can consider the neural tangent kernel for each . Chen and Xu [7] showed that the RKHS associated with the neural tangent kernel restricted to the sphere is the same set as that associated with the Laplacian kernel. Therefore, the th row of is described by the shallow RKHS associated with the neural tangent kernel . Let be the -valued positive definite kernel whose -entry is for and for . Then, for any function , the elements in the th row of are in the RKHS associated with , i.e., associated with the Laplacian kernel. Thus, is described by this shallow RKHM.
Remark 6.3
We can combine the deep RKHM and existing neural networks by replacing some in our model with an existing neural network. The above observation enables us to apply our results in Section 4 to the combined network.
6.4 Comparison to bounds for classical neural networks
Existing bounds for classical neural networks typically depend on the product or sum of matrix norm of all the weight matrices [14, 15, 16, 17]. Typical bound is . Note that the Hilbert–Schmidt norm is the matrix norm. Unlike the operator norm, the matrix norm tends to be large as the width of the layers becomes large. On the other hand, the dependency of our bound on the width of the layer is not affected by the number of layers in the case where the kernels are separable. Indeed, assume we set and for some complex-valued kernels and , , and . Then by Proposition 5.3, the factor is written as for some . Thus, it is independent of . The only part depending on is , which results in the bound . Note that the width of the th layer corresponds to the number of nonzero elements in a matrix in . We also discuss in Appendix B the connection of our bound with the bound by Koopman operators, the adjoints of the Perron–Frobenius operators [18].
7 Numerical Results
We numerically confirm our theory and the validity of the proposed deep RKHM.
Comparison to vvRKHS
We compared the generalization property of the deep RKHM to the deep vvRKHS with the same positive definite kernel. For and , we set as input samples, where and are randomly generated by , the normal distribution of mean 0 and standard deviation 0.1, is the elementwise product, and is the random noise drawn from . We reshaped to a by matrix. We set and for , where is the Laplacian kernel. For RKHMs, we set , , and . This is the autoencoder mentioned in Example 3.1. For vvRKHSs, we set the corresponding Hilbert spaces with the Hilbert–Schmidt inner product. We set the loss function as for the deep RKHM and as for the deep vvRKHS. Here, . We did not add any terms to the loss function to see how the loss function with the operator norm affects the generalization performance. We computed the same value for both RKHM and vvRKHS. Figure 2 (a) shows the results. We can see that the deep RKHM generalizes better than the deep vvRKHS, only with the loss function and without any additional terms.
Observation about benign overfitting
We analyzed the overfitting numerically. For and , we randomly sampled by diagonal matrices from the normal distribution . We set for , where is a noise drawn from the normal distribution . The magnitude of the noise is % of . In addition, we set , , , and as the same kernel as the above experiment. The additional term to the loss function is set as , where and according to Subsection 5.2. We computed the generalization error for the cases of and . Figure 2 (b) shows the result. We can see that the generalization error saturates without the additional term motivated by the Perron–Frobenius operator. On the other hand, with the additional term, the generalization error becomes small, which is the effect of benign overfitting.
Comparison to CNN
We compared the deep RKHM to a CNN on the classification task with MNIST [39]. We set and . We constructed a deep RKHM combined with a neural network with 2 dense layers. For the deep RKHM, we set , , , and . Then, two dense layers are added. See Subsection 6.3 about combining the deep RKHM with neural networks. Regarding the additional term to the loss function, we set the same term with the previous experiment with and set or . To compare the deep RKHM to CNNs, we also constructed a network by replacing the deep RKHM with a CNN. The CNN is composed of 2 layers with and filters. Figure 2 (c) shows the test accuracy of these networks. We can see that the deep RKHM outperforms the CNN. In addition, we can see that the test accuracy is higher if we add the term regarding the Perron–Frobenius operators. We discuss the memory consumption and the computational cost of each of the deep RKHM and the CNN in Appendix C.3, and we empirically show that the deep RKHM outperforms the CNN that has the same size of learning parameters as the deep RKHM. We also show additional results about benign overfitting in Appendix C.3.



8 Conclusion and Limitations
In this paper, we proposed deep RKHM and analyzed it through -algebra and the Perron–Frobenius operators. We derived a generalization bound, whose dependency on the output dimension is alleviated by the operator norm, and which is related to benign overfitting. We showed a representer theorem about the proposed deep RKHM, and connections with existing studies such as CNNs and neural tangent kernel. Our theoretical analysis shows that -algebra and Perron–Frobenius operators are effective tools for analyzing deep kernel methods. The main contributions of this paper are our theoretical results with -algebra and the Perron–Frobenius operators. More practical investigations are required for further progress. For example, although we numerically showed the validity of our method for the case where the number of samples is limited (the last experiment in Section 7), more experimental results for the case where the number of samples is large are useful for further analysis. Also, although we can apply random Fourier features (Remark 5.6) to reduce the computational costs, studying more efficient methods specific to deep RKHM remains to be investigated in future work. As for the theoretical topic, we assumed the well-definedness of the Perron–Frobenius operators. Though separable kernels with invertible matrices, which are typical examples of kernels, satisfy the assumption, generalization of our analysis to other kernels should also be studied in future work.
Acknowledgements
Hachem Kadri is partially supported by grant ANR-19-CE23-0011 from the French National Research Agency. Masahiro Ikeda is partially supported by grant JPMJCR1913 from JST CREST.
References
- [1] Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. In Proceedings of Advances in Neural Information Processing Systems (NIPS) 22, 2009.
- [2] Sebastian W. Ober, Carl E. Rasmussen, and Mark van der Wilk. The promises and pitfalls of deep kernel learning. In Proceedings of the 37th Conference on Uncertainty in Artificial Intelligence (UAI), 2021.
- [3] Behnam Gholami and Abolfazl Hajisami. Kernel auto-encoder for semi-supervised hashing. In Proceedings of 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), 2016.
- [4] Bastian Bohn, Michael Griebel, and Christian Rieger. A representer theorem for deep kernel learning. Journal of Machine Learning Research, 20(64):1–32, 2019.
- [5] Pierre Laforgue, Stéphan Clémençon, and Florence d’Alche Buc. Autoencoding any data through kernel autoencoders. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
- [6] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In In proceedings of Advances in Neural Information Processing Systems (NeurIPS) 31, 2018.
- [7] Lin Chen and Sheng Xu. Deep neural tangent kernel and laplace kernel have the same RKHS. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021.
- [8] Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. Convolutional kernel networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS) 27, 2014.
- [9] Alberto Bietti, Grégoire Mialon, Dexiong Chen, and Julien Mairal. A kernel perspective for regularizing deep neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
- [10] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
- [11] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT press, 2018.
- [12] Vikas Sindhwani, Ha Quang Minh, and Aurélie C. Lozano. Scalable matrix-valued kernel learning for high-dimensional nonlinear multivariate regression and granger causality. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI), 2013.
- [13] Riikka Huusari and Hachem Kadri. Entangled kernels - beyond separability. Journal of Machine Learning Research, 22(24):1–40, 2021.
- [14] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Proceedings of the 2015 Conference on Learning Theory (COLT), 2015.
- [15] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Proceedings of the 2018 Conference On Learning Theory (COLT), 2018.
- [16] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS) 31, 2017.
- [17] Haotian Ju, Dongyue Li, and Hongyang R Zhang. Robust fine-tuning of deep neural networks with Hessian-based generalization guarantees. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
- [18] Yuka Hashimoto, Sho Sonoda, Isao Ishikawa, Atsushi Nitanda, and Taiji Suzuki. Koopman-based bound for generalization: New aspect of neural networks regarding nonlinear noise filtering. arXiv: 2302.05825, 2023.
- [19] Taiji Suzuki, Hiroshi Abe, and Tomoaki Nishimura. Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020.
- [20] Mikhail Belkin, Alexander Rakhlin, and Alexandre B. Tsybakov. Does data interpolation contradict statistical optimality? In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
- [21] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
- [22] Neil Rohit Mallinar, James B Simon, Amirhesam Abedsoltan, Parthe Pandit, Misha Belkin, and Preetum Nakkiran. Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 37, 2022.
- [23] Jaeseong Heo. Reproducing kernel Hilbert -modules and kernels associated with cocycles. Journal of Mathematical Physics, 49(10):103507, 2008.
- [24] Shigeru Itoh. Reproducing kernels in modules over -algebras and their applications. Journal of Mathematics in Nature Science, pages 1–20, 1990.
- [25] Mohammad S. Moslehian. Vector-valued reproducing kernel Hilbert -modules. Complex Analysis and Operator Theory, 16(1):Paper No. 2, 2022.
- [26] Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda, Fuyuta Komura, Takeshi Katsura, and Yoshinobu Kawahara. Reproducing kernel Hilbert -module and kernel mean embeddings. Journal of Machine Learning Research, 22(267):1–56, 2021.
- [27] Yuka Hashimoto, Masahiro Ikeda, and Hachem Kadri. Learning in RKHM: a -algebraic twist for kernel machines. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS), 2023.
- [28] Yoshinobu Kawahara. Dynamic mode decomposition with reproducing kernels for Koopman spectral analysis. In Advances in Neural Information Processing Systems (NIPS) 29, pages 911–919, 2016.
- [29] Isao Ishikawa, Keisuke Fujii, Masahiro Ikeda, Yuka Hashimoto, and Yoshinobu Kawahara. Metric on nonlinear dynamical systems with Perron-Frobenius operators. In Advances in Neural Information Processing Systems (NeurIPS) 31, pages 2856–2866, 2018.
- [30] Dimitrios Giannakis and Suddhasattwa Das. Extraction and prediction of coherent patterns in incompressible flows through space-time Koopman analysis. Physica D: Nonlinear Phenomena, 402:132211, 2020.
- [31] Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda, Yoichi Matsuo, and Yoshinobu Kawahara. Krylov subspace method for nonlinear dynamical systems with random noise. Journal of Machine Learning Research, 21(172):1–29, 2020.
- [32] Ryuichiro Hataya and Yuka Hashimoto. Noncommutative -algebra net: Learning neural networks with powerful product structure in -algebra. arXiv: 2302.01191, 2023.
- [33] E. Christopher Lance. Hilbert -modules – a Toolkit for Operator Algebraists. London Mathematical Society Lecture Note Series, vol. 210. Cambridge University Press, 1995.
- [34] Gerard J. Murphy. -Algebras and Operator Theory. Academic Press, 1990.
- [35] Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda, Fuyuta Komura, Takeshi Katsura, and Yoshinobu Kawahara. Analysis via orthonormal systems in reproducing kernel hilbert -modules and applications. arXiv: 2003.00738, 2020.
- [36] Mauricio A Álvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends® in Machine Learning, 4(3):195–266, 2012.
- [37] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Proceedings of the Advances in Neural Information Processing Systems 20 (NIPS), 2007.
- [38] Yuka Hashimoto, Zhao Wang, and Tomoko Matsui. -algebra net: a new approach generalizing neural network parameters to -algebra. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
- [39] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- [40] Andreas Maurer. A vector-contraction inequality for rademacher complexities. In Proceedings of the 27th International Conference on Algorithmic Learning Theory (ALT), 2016.
Appendix A Proofs
We provide the proofs of the theorems, propositions, and lemmas in the main paper.
Lemma 2.7 If is -linearly independent, then is well-defined.
Proof Assume for , . Since is -linearly independent, we have for . Thus, .
Lemma 2.8 Let , i.e., is separable, for an invertible operator and a complex-valued kernel . Assume is linearly independent (e.g. is Gaussian or Laplacian), where is the feature map associated with . Then, is -linearly independent.
Proof If , we have
where is the -valued Gram matrix whose -entry is and . Thus, we obtain . Let be the standard Gram matrix whose -entry is . Since and is invertible, the inverse of is . Thus, is invertible and we have .
Lemma 4.2 Let be a function class of -valued functions on bounded by (i.e., for any ). Let and . Let satisfy and let . Then, for any , with probability at least , we have
Proof For , we have
In addition, we have
Since is a real-valued map, by Theorem 3.3 by Mohri et al. [11], we have
The inequality completes the proof.
Proof Let for and . The Lipschitz constant of is calculated as follows: We have
Thus, we have . By Corollary 4 by Maurer [40], the statement is proved.
Proof We have
(3) | |||
(4) | |||
(5) | |||
where for and , is the semi-inner product defined by and . In addition, the inequality (3) is by the Cauchy–Schwartz inequality and the inequality (5) is by the Jensen’s inequality. Note that the Cauchy–Schwartz inequality is still valid for semi-inner products. The inequality (4) is derived by the inequality
Theorem 4.1 Assume there exists such that for any . Let and . Let . Then, for any , where is defined in Section 3, with probability at least , we have
Proof For and , we have
Thus, we set as and apply Lemmas 4.2, 4.3, and 4.4. Then, for satisfy , with probability at least , we have
Therefore, we obtain
which completes the proof.
The following lemma by Lance [33, Proposition 1.2] is used in proving Proposition 4.6. Here, for , means is Hermitian positive definite.
Lemma A.1
Let and be Hilbert -modules and let be an -linear operator from to . Then, we have .
Proof
(6) | |||
(7) | |||
where the inequality (6) is by the Cauchy–Schwartz inequality and the inequality (7) is by Lemma A.1.
Proposition 5.1 Let be an error function, let be an -valued function on the space of bounded linear operators on , and let satisfy for . Assume the following minimization problem has a solution:
Then, there exists a solution admitting a representation of the form for some and for . Here, for and .
Proof Let be the submodule of generated by . Let , where and . Then, for , we have
(8) |
In addition, we have
where is the submodule of generated by . In the same manner as Eq. (8), we obtain
Thus, we have . Furthermore, we have
where the last inequality is derived from the fact that is positive and Theorem 2.2.5 (3) by Murphy [34]. As a result, the statement is proved.
Proposition 5.3 For , let be the QR decomposition of . Then, we have .
Proof The result is derived by the identities
Proposition 5.4 Assume is invertible. Then, we have
Proof Since is invertible, by Proposition 5.3, is bounded as
where the last equality is derived by the identity .
Appendix B Connection with Generalization Bound with Koopman Operators
Hashimoto et al. [18] derived a generalization bound for the classical neural network composed of linear transformations and activation functions using Koopman operators. The Koopman operator is the adjoint of the Perron–Frobenius operator. In their analysis, they set the final transformation in an RKHS and observed that if the transformation of each layer is injective, the noise is separated through the linear transformations and cut off by . Since the final transformation in our case is in an RKHM, the same interpretation about noise is valid for deep RKHM, too. Indeed, if is not injective, then is not invertible, which results in the right-hand side of the inequality (2) going to infinity. The bound derived by Hashimoto et al. also goes to infinity if the network is not injective.
Appendix C Experimental Details and Additional Results
We provide details for the experiments in Section 7. All the experiments are executed with Python 3.9 and TensorFlow 2.6 on Intel(R) Core(TM) i9 CPU and NVIDIA Quadro RTX 5000 GPU with CUDA 11.7.
C.1 Comparison with vvRKHS
We set for with and for positive definite kernels. For the optimizer, we used SGD. The learning rate is set as both for the deep RKHM and deep vvRKHS. The initial value of is set as , where is the block matrix all of whose elements are and is randomly drawn from . The result in Figure 2 is obtained by 5 independent runs.
C.2 Observation about benign overfitting
C.3 Comparison to CNN
We set for , where and are block matrices whose block sizes are and , for positive definite kernels. All the nonzero elements of are set as . We set and to induce interactions in the block elements (see Remark 5.2). For the optimizer, we used Adam with learning rate for both the deep RKHM and the CNN. The initial value of is set as , where is randomly drawn from .
We combined the deep RKHM and CNN with dense layers. Their activation functions are sigmoid and softmax, respectively. For the CNN layers, we also used the sigmoid activation functions. The loss function is set as the categorical cross-entropy for the CNN. The result in Figure 2 is obtained by 5 independent runs.
C.3.1 Memory consumption and computational cost
Memory consumption
We used a CNN with and -filters. On the other hand, for the deep RKHM, we learned the coefficients in Proposition 5.1 for and . That is, we learned the following coefficients:
-
•
block diagonal matrices each of whom has four -blocks (for the first layer)
-
•
block diagonal matrices each of whom has seven -blocks (for the second layer)
Thus, the size of the parameters we have to learn for the deep RKHM is larger than the CNN. Since the memory consumption depends on the size of learning parameters, the memory consumption is larger for the deep RKHM than for the CNN.
Computational cost
In each iteration, we compute for and the derivative of with respect to the learning parameters. Here, is the network. For the deep RKHM, we compute the product of a Gram matrix (composed of block diagonal matrices) and a vector (composed of block diagonal matrices) for computing for all . Thus, the computational cost for computing for all is , where and are the sizes of the block diagonal matrices. For the CNN, the computational cost for computing for all is , where and are the number of elements in the filters. Since we set and , the computational cost of the deep RKHM for computing for is smaller than that of the CNN. However, since the size of the learning parameters of the deep RKHM is large, the computational cost for computing the derivative of is larger than that of the CNN.
Additional results
To compare the deep RKHM to a CNN with the same size of learning parameters (the same memory consumption), we conducted the same experiment as the last experiment in the main text, excepting for the structure of the CNN. We constructed a CNN with and -filters (The size of learning parameters is the same as the deep RKHM) and replaced the CNN used in the experiment in the main text. Figure 3 shows the result. The result is similar to that in Figure 2 (c), and the deep RKHM also outperforms the CNN with the same learning parameter size. Since the size of the learning parameter is the same in this case, the computational cost for one iteration for learning the deep RKHM is the same or smaller than that for learning the CNN.

C.3.2 Additional results for benign ovefitting
We also observed benign overfitting for the deep RKHM. We compared the train and test losses for the deep RKHM with to those with . The results are illustrated in Figure 4. If we set in Eq. (1), i.e., do not minimize the term in Eq. (2), then whereas the training loss becomes small as the learning process proceeds, the test loss becomes large after sufficient iterations. On the other hand, if we set , i.e., minimize the term in Eq. (2), then the training loss becomes smaller than the case of , and the test loss does not become large even the learning process proceeds. The result implies that the term regarding the Perron–Frobenius operators in Eq. (1) causes overfitting, but it is benign overfitting.

