This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\zxrsetup

toltxlabel

Deep Learning with Kernels through RKHM and the Perron–Frobenius Operator

Yuka Hashimoto
NTT Network Service Systems Laboratories /
RIKEN AIP,
Tokyo, Japan
[email protected]
&Masahiro Ikeda
RIKEN AIP / Keio University,
Tokyo, Japan
[email protected]
Hachem Kadri
Aix-Marseille University, CNRS, LIS,
Marseille, France
[email protected]
Abstract

Reproducing kernel Hilbert CC^{*}-module (RKHM) is a generalization of reproducing kernel Hilbert space (RKHS) by means of CC^{*}-algebra, and the Perron–Frobenius operator is a linear operator related to the composition of functions. Combining these two concepts, we present deep RKHM, a deep learning framework for kernel methods. We derive a new Rademacher generalization bound in this setting and provide a theoretical interpretation of benign overfitting by means of Perron–Frobenius operators. By virtue of CC^{*}-algebra, the dependency of the bound on output dimension is milder than existing bounds. We show that CC^{*}-algebra is a suitable tool for deep learning with kernels, enabling us to take advantage of the product structure of operators and to provide a clear connection with convolutional neural networks. Our theoretical analysis provides a new lens through which one can design and analyze deep kernel methods.

1 Introduction

Kernel methods and deep neural networks are two major topics in machine learning. Originally, they had been investigated independently. However, their interactions have been researched recently. One important perspective is deep kernel learning [1, 2, 3]. In this framework, we construct a function with the composition of functions in RKHSs, which is learned by given training data. Representer theorems were shown for deep kernel methods, which guarantee the representation of solutions of minimization problems only with given training data [4, 5]. We can combine the flexibility of deep neural networks with the representation power and solid theoretical understanding of kernel methods. Other important perspectives are neural tangent kernel [6, 7] and convolutional kernel [8], which enable us to understand neural networks using the theory of kernel methods. In addition, Bietti et al. [9] proposed a regularization of deep neural network through a kernel perspective.

The generalization properties of kernel methods and deep neural networks have been investigated. One typical technique for bounding generalization errors is to use the Rademacher complexity [10, 11]. For kernel methods, generalization bounds based on the Rademacher complexity can be derived by the reproducing property. Bounds for deep kernel methods and vector-valued RKHSs (vvRKHSs) were also derived [5, 12, 13]. Table 1 shows existing Rademacher generalization bounds for kernel methods. Generalization bounds for deep neural networks have also been actively studied [14, 15, 16, 17, 18, 19]. Recently, analyzing the generalization property using the concept of benign overfitting has emerged [20, 21]. Unlike the classical interpretation of overfitting, which is called catastrophic overfitting, it explains the phenomenon that the model fits both training and test data. For kernel regression, it has been shown that the type of overfitting can be described by an integral operator associated with the kernel [22].

In this paper, we propose deep RKHM to make the deep kernel methods more powerful. RKHM is a generalization of RKHS by means of CC^{*}-algebra [23, 24, 25], where CC^{*}-algebra is a generalization of the space of complex numbers, regarded as space of operators. We focus on the CC^{*}-algebra of matrices in this paper. Applications of RKHMs to kernel methods have been investigated recently [26, 27]. We generalize the concept of deep kernel learning to RKHM, which constructs a map from an operator to an operator as the composition of functions in RKHMs. The product structure of operators induces interactions among elements of matrices, which enables us to capture relations between data components. Then, we derive a generalization bound of the proposed deep RKHM. By virtue of CC^{*}-algebras, we can use the operator norm, which alleviates the dependency of the generalization bound on the output dimension.

We also use Perron–Frobenius operators, which are linear operators describing the composition of functions and have been applied to analyzing dynamical systems [28, 29, 30, 31], to derive the bound. The compositions in the deep RKHM are effectively analyzed by the Perron–Frobenius operators.

CC^{*}-algebra and Perron–Frobenius operator are powerful tools that provide connections of the proposed deep RKHM with existing studies. Since the norm of the Perron–Frobenius operator is described by the Gram matrix associated with the kernel, our bound shows a connection of the deep RKHM with benign overfitting. In addition, the product structure of a CC^{*}-algebra enables us to provide a connection between the deep RKHMs and convolutional neural networks (CNNs).

Our main contributions are as follows.

  • We propose deep RKHM, which is a generalization of deep kernel method by means of CC^{*}-algebra. We can make use of the products in CC^{*}-algebra to induce interactions among data components. We also show a representer theorem to guarantee the representation of solutions only with given data.

  • We derive a generalization bound for deep RKHM. The dependency of the bound on the output dimension is milder than existing bounds by virtue of CC^{*}-algebras. In addition, the Perron–Frobenius operators provide a connection of our bound with benign overfitting.

  • We show connections of our study with existing studies such as CNNs and neural tangent kernel.

We emphasize that our theoretical analysis using CC^{*}-algebra and Perron–Frobenius operators gives a new and powerful lens through which one can design and analyze kernel methods.

Reproducing space Output dimension Shallow Deep
RKHS 1 O(1/n)O(\sqrt{1/n}) [10] O(AL1/n)O(A^{L}\sqrt{1/n})
vvRKHS dd O(d/n)O(\sqrt{d/n}) [12, 13] O(ALd/n)O(A^{L}\sqrt{d/n}) [5]
RKHM (existing) dd O(d/n)O(\sqrt{d/n}) [27]
RKHM (ours) dd O(d1/4/n)O({d}^{1/4}/\sqrt{n}) O(BLd1/4n)O(B^{L}d^{1/4}\sqrt{n})

Table 1: Existing generalization bounds for kernel methods based on the Rademacher complexity and our bound (nn: sample size, AA: Lipschitz constant regarding the positive definite kernel, BB: The norm of the Perron–Frobenius operator)

2 Preliminaries

2.1 CC^{*}-algebra and reproducing kernel CC^{*}-module

CC^{*}-algebra, which is denoted by 𝒜\mathcal{A} in the following, is a Banach space equipped with a product and an involution satisfying the CC^{*} identity (condition 3 below).

Definition 2.1 (CC^{*}-algebra)

A set 𝒜\mathcal{A} is called a CC^{*}-algebra if it satisfies the following conditions:

  1. 1.

    𝒜\mathcal{A} is an algebra over \mathbb{C} and equipped with a bijection ():𝒜𝒜(\cdot)^{*}:\mathcal{A}\to\mathcal{A} that satisfies the following conditions for α,β\alpha,\beta\in\mathbb{C} and a,b𝒜a,b\in\mathcal{A}:

    \bullet (αa+βb)=α¯a+β¯b(\alpha a+\beta b)^{*}=\overline{\alpha}a^{*}+\overline{\beta}b^{*},  \bullet (ab)=ba(ab)^{*}=b^{*}a^{*},  \bullet (a)=a(a^{*})^{*}=a.

  2. 2.

    𝒜\mathcal{A} is a normed space endowed with 𝒜\|\cdot\|_{\mathcal{A}}, and for a,b𝒜a,b\in\mathcal{A}, ab𝒜a𝒜b𝒜\|ab\|_{\mathcal{A}}\leq\|a\|_{\mathcal{A}}\|b\|_{\mathcal{A}} holds. In addition, 𝒜\mathcal{A} is complete with respect to 𝒜\|\cdot\|_{\mathcal{A}}.

  3. 3.

    For a𝒜a\in\mathcal{A}, the CC^{*} identity aa𝒜=a𝒜2\|a^{*}a\|_{\mathcal{A}}=\|a\|_{\mathcal{A}}^{2} holds.

Example 2.2

A typical example of CC^{*}-algebras is the CC^{*}-algebra of dd by dd circulant matrices. Another example is the CC^{*}-algebra of dd by dd block matrices with MM blocks and their block sizes are 𝐦=(m1,,mM)\mathbf{m}=(m_{1},\ldots,m_{M}). We denote them by Circ(d)Circ(d) and Block(𝐦,d)Block(\mathbf{m},d), respectively. See [27, 32] for more details about these examples.

We now define RKHM. Let 𝒳\mathcal{X} be a non-empty set for data.

Definition 2.3 (𝒜\mathcal{A}-valued positive definite kernel)

An 𝒜\mathcal{A}-valued map k:𝒳×𝒳𝒜k:\mathcal{X}\times\mathcal{X}\to\mathcal{A} is called a positive definite kernel if it satisfies the following conditions:
\bullet k(x,y)=k(y,x)k(x,y)=k(y,x)^{*}  for x,y𝒳x,y\in\mathcal{X},
\bullet i,j=1ncik(xi,xj)cj\sum_{i,j=1}^{n}c_{i}^{*}k(x_{i},x_{j})c_{j} is positive semi-definite  for nn\in\mathbb{N}, ci𝒜c_{i}\in\mathcal{A}, xi𝒳x_{i}\in\mathcal{X}.

Let ϕ:𝒳𝒜𝒳\phi:\mathcal{X}\to\mathcal{A}^{\mathcal{X}} be the feature map associated with kk, defined as ϕ(x)=k(,x)\phi(x)=k(\cdot,x) for x𝒳x\in\mathcal{X} and let k,0={i=1nϕ(xi)ci|n,ci𝒜,xi𝒳(i=1,,n)}\mathcal{M}_{k,0}=\{\sum_{i=1}^{n}\phi(x_{i})c_{i}|\ n\in\mathbb{N},\ c_{i}\in\mathcal{A},\ x_{i}\in\mathcal{X}\ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}(i=1,\ldots,n)}\}. We can define an 𝒜\mathcal{A}-valued map ,k:k,0×k,0𝒜\left\langle\cdot,\cdot\right\rangle_{\mathcal{M}_{k}}:\mathcal{M}_{k,0}\times\mathcal{M}_{k,0}\to\mathcal{A} as

i=1nϕ(xi)ci,j=1lϕ(yj)bjk:=i=1nj=1lcik(xi,yj)bj,\big{\langle}\sum_{i=1}^{n}\phi(x_{i})c_{i},\sum_{j=1}^{l}\phi(y_{j})b_{j}\big{\rangle}_{\mathcal{M}_{k}}:=\sum_{i=1}^{n}\sum_{j=1}^{l}c_{i}^{*}k(x_{i},y_{j})b_{j},

which enjoys the reproducing property ϕ(x),fk=f(x)\left\langle\phi(x),f\right\rangle_{\mathcal{M}_{k}}=f(x) for fk,0f\in\mathcal{M}_{k,0} and x𝒳x\in\mathcal{X}. The reproducing kernel Hilbert 𝒜\mathcal{A}-module (RKHM) k\mathcal{M}_{k} associated with kk is defined as the completion of k,0\mathcal{M}_{k,0}. See, for example, the references [33, 34, 26] for more details about CC^{*}-algebra and RKHM.

2.2 Perron–Frobenius operator on RKHM

We introduce Perron–Frobenius operator on RKHM [35]. Let 𝒳1\mathcal{X}_{1} and 𝒳2\mathcal{X}_{2} be nonempty sets and let k1k_{1} and k2k_{2} be 𝒜\mathcal{A}-valued positive definite kernels. Let 1\mathcal{M}_{1} and 2\mathcal{M}_{2} be RKHMs associated with k1k_{1} and k2k_{2}, respectively. Let ϕ1\phi_{1} and ϕ2\phi_{2} be the feature maps of 1\mathcal{M}_{1} and 2\mathcal{M}_{2}, respectively. We begin with the standard notion of linearity in RKHMs.

Definition 2.4 (𝒜\mathcal{A}-linear)

A linear map A:12A:\mathcal{M}_{1}\to\mathcal{M}_{2} is called 𝒜\mathcal{A}-linear if for any a𝒜a\in\mathcal{A} and w1w\in\mathcal{M}_{1}, we have A(wa)=(Aw)aA(wa)=(Aw)a.

Definition 2.5 (Perron–Frobenius operator)

Let f:𝒳1𝒳2f:\mathcal{X}_{1}\to\mathcal{X}_{2} be a map. The Perron–Frobenius operator with respect to ff is an 𝒜\mathcal{A}-linear map Pf:12P_{f}:\mathcal{M}_{1}\to\mathcal{M}_{2} satisfying

Pfϕ1(x)=ϕ2(f(x)).\displaystyle P_{f}\phi_{1}(x)=\phi_{2}(f(x)).

Note that a Perron–Frobenius operator is not always well-defined since ac=bcac=bc for a,b𝒜a,b\in\mathcal{A} and nonzero c𝒜c\in\mathcal{A} does not always mean a=ba=b.

Definition 2.6 (𝒜\mathcal{A}-linearly independent)

The set {ϕ1(x)x𝒳}\{\phi_{1}(x)\,\mid\,x\in\mathcal{X}\} is called 𝒜\mathcal{A}-linearly independent if it satisfies the following condition: For any nn\in\mathbb{N}, x1,,xn𝒳x_{1},\ldots,x_{n}\in\mathcal{X}, and c1,,cn𝒜c_{1},\ldots,c_{n}\in\mathcal{A}, “i=1nϕ1(xi)ci=0\sum_{i=1}^{n}\phi_{1}(x_{i})c_{i}=0” is equivalent to “ci=0c_{i}=0 for all i=1,,ni=1,\ldots,n”.

Lemma 2.7

If {ϕ1(x)x𝒳}\{\phi_{1}(x)\,\mid\,x\in\mathcal{X}\} is 𝒜\mathcal{A}-linearly independent, then PfP_{f} is well-defined.

The following lemma gives a sufficient condition for 𝒜\mathcal{A}-linearly independence.

Lemma 2.8

Let k1=k~ak_{1}=\tilde{k}a, i.e., kk is separable, for an invertible operator aa and a complex-valued kernel k~\tilde{k}. Assume {ϕ~(x)x𝒳}\{\tilde{\phi}(x)\,\mid\,x\in\mathcal{X}\} is linearly independent (e.g. k~\tilde{k} is Gaussian or Laplacian), where ϕ~\tilde{\phi} is the feature map associated with k~\tilde{k}. Then, {ϕ1(x)x𝒳}\{\phi_{1}(x)\,\mid\,x\in\mathcal{X}\} is 𝒜\mathcal{A}-linearly independent.

Note that separable kernels are widely used in existing literature of vvRKHS (see e.g. [36]). Lemma 2.8 guarantees the validity of the analysis with Perron–Frobenius operators, at least in the separable case. This provides a condition for "good kernels" by means of the well-definedness of the Perron–Frobenius operator.

Notation

We denote the Euclidean inner product and norm by ,\left\langle\cdot,\cdot\right\rangle and \|\cdot\| without the subscript. For a𝒜a\in\mathcal{A}, let |a|𝒜|a|_{\mathcal{A}} be the unique element in 𝒜\mathcal{A} satisfying |a|𝒜2=aa|a|_{\mathcal{A}}^{2}=a^{*}a. If aa is a matrix, aHS\|a\|_{\operatorname{HS}} is the Hilbert–Schmidt norm of aa. The operator norm of a linear operator AA on an RKHM is denoted by Aop\|A\|_{\operatorname{op}}. All the technical proofs are documented in the supplementary.

3 Deep RKHM

We now construct an LL-layer deep RKHM. Let 𝒜=d×d\mathcal{A}=\mathbb{C}^{d\times d} be the CC^{*}-algebra of dd by dd matrices. Let 𝒜0,,𝒜L\mathcal{A}_{0},\ldots,\mathcal{A}_{L} be CC^{*}-subalgebras of 𝒜\mathcal{A} and for j=1,,Lj=1,\ldots,L and kj:𝒜j1×𝒜j1𝒜jk_{j}:\mathcal{A}_{j-1}\times\mathcal{A}_{j-1}\to\mathcal{A}_{j} be an 𝒜j\mathcal{A}_{j}-valued positive definite kernel for the jjth layer. For j=1,,Lj=1,\ldots,L, we denote by j\mathcal{M}_{j} the RKHM over 𝒜j\mathcal{A}_{j} associated with kjk_{j}. In addition, we denote by ~j\tilde{\mathcal{M}}_{j} the RKHM over 𝒜\mathcal{A} associated with kjk_{j}. Note that j{\mathcal{M}}_{j} is a subspace of ~j\tilde{\mathcal{M}}_{j} and for u,vju,v\in\mathcal{M}_{j}, we have u,v~j=u,vj\left\langle u,v\right\rangle_{\tilde{\mathcal{M}}_{j}}=\left\langle u,v\right\rangle_{\mathcal{M}_{j}}. We set the function space corresponding to each layer as L={fLf(x)d×d for any x𝒜L1,fLBL}\mathcal{F}_{L}=\{f\in\mathcal{M}_{L}\,\mid\,f(x)\in\mathbb{R}^{d\times d}\mbox{ for any }x\in\mathcal{A}_{L-1},\ \|f\|_{{\mathcal{M}}_{L}}\leq B_{L}\} and j={fjf(x)d×d for any x𝒜j1,PfopBj}\mathcal{F}_{j}=\{f\in\mathcal{M}_{j}\,\mid\,f(x)\in\mathbb{R}^{d\times d}\mbox{ for any }x\in\mathcal{A}_{j-1},\ \|P_{f}\|_{\operatorname{op}}\leq B_{j}\} for j=1,,L1j=1,\ldots,L-1. Here, for fjf\in\mathcal{M}_{j} with j=1,,L1j=1,\ldots,L-1, PfP_{f} is the Perron–Frobenius operator with respect to ff from ~j\tilde{\mathcal{M}}_{j} to ~j+1\tilde{\mathcal{M}}_{j+1}. We assume the well-definedness of these operators. Then, we set the class of deep RKHM as

Ldeep={fLf1fjj(j=1,,L)}.\displaystyle\mathcal{F}^{\mathrm{deep}}_{L}=\{f_{L}\circ\cdots\circ f_{1}\,\mid\,f_{j}\in\mathcal{F}_{j}\ (j=1,\ldots,L)\}.

Figure 1 schematically shows the structure of the deep RKHM.

Refer to caption
Figure 1: Overview of the proposed deep RKHM. The small blue squares represent matrix elements. In the case of the autoencoder (see Example 3.1), f1f2f_{1}\circ f_{2} is the encoder, and f3f4f_{3}\circ f_{4} is the decoder.
Example 3.1

We can set 𝒜j=Block((m1,j,,mMj,j),d)\mathcal{A}_{j}=Block((m_{1,j},\ldots,m_{M_{j},j}),d), the CC^{*}-algebra of block diagonal matrices. By setting M1MlM_{1}\leq\cdots\leq M_{l} and MlMLM_{l}\geq\cdots\geq M_{L} for l<Ll<L, the number of nonzero elements in 𝒜j\mathcal{A}_{j} decreases during 1jl1\leq j\leq l and increases during ljLl\leq j\leq L. This construction is regarded as an autoencoder, where the 1l1\sim lth layer corresponds to the encoder and the l+1Ll+1\sim Lth layer corresponds to the decoder.

Advantage over existing deep kernel methods

Note that deep kernel methods with RKHSs and vvRKHSs have been proposed [1, 4, 2]. Autoencoders using RKHSs and vvRKHSs were also proposed [3, 5]. We have at least three advantages of deep RKHM over deep vvRKHSs or RKHSs: 1) useful structures of matrices stated in Remark 5.2, 2) availability of the operator norm in the generalization bound stated in Section 4, 3) connection with CNNs stated in Subsection 6.1.

4 Generalization Bound

We derive a generalization error for deep RKHM. To derive the bound, we bound the Rademacher complexity, which is one of the typical methods on deriving generalization bounds [11]. Let Ω\Omega be a probability space equipped with a probability measure PP. We denote the integral Ωs(ω)dP(ω)\int_{\Omega}s(\omega)\mathrm{d}P(\omega) of a measurable function ss on Ω\Omega by E[s]\mathrm{E}[s]. Let x1,,xnx_{1},\ldots,x_{n} and y1,,yny_{1},\ldots,y_{n} be input and output samples from the distributions of 𝒜0\mathcal{A}_{0} and 𝒜L\mathcal{A}_{L}-valued random variables xx and yy, respectively. Let σi,j(i=1,,d,j=1,,n)\sigma_{i,j}\ (i=1,\ldots,d,\ j=1,\ldots,n) be i.i.d. Rademacher variables. Let σj=(σ1,j,,σd,j)\sigma_{j}=(\sigma_{1,j},\ldots,\sigma_{d,j}). For an d\mathbb{R}^{d}-valued function class \mathcal{F} and 𝐱=(x1,,xn)\mathbf{x}=(x_{1},\ldots,x_{n}), the empirical Rademacher complexity R^n(𝐱,)\hat{R}_{n}(\mathbf{x},\mathcal{F}) is defined by R^n(𝐱,):=E[supfi=1nσi,f(xi)]/n\hat{R}_{n}(\mathbf{x},\mathcal{F}):=\mathrm{E}[\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}\left\langle\sigma_{i},f(x_{i})\right\rangle]/n.

4.1 Bound for shallow RKHMs

We use the operator norm to derive a bound, whose dependency on output dimension is milder than existing bounds. Availability of the operator norm is one of the advantages of considering CC^{*}-algebras (RKHMs) instead of vectors (vvRKHSs). Note that although we can also use the Hilbert–Schmidt norm for matrices, it tends to be large as the dimension dd becomes large. On the other hand, the operator norm is defined independently of the dimension dd. Indeed, the Hilbert–Schmidt norm of a matrix ad×da\in\mathbb{C}^{d\times d} is calculated as (i=1dsi2)1/2(\sum_{i=1}^{d}s_{i}^{2})^{1/2}, where sis_{i} is the iith singular value of aa. The operator norm of aa is the largest singular value of aa.

To see the derivation of the bound, we first focus on the case of L=1L=1, i.e., the network is shallow. Let E>0E>0. For a space \mathcal{F} of 𝒜\mathcal{A}-valued functions on 𝒜0\mathcal{A}_{0}, let 𝒢()={(x,y)f(x)yf,y𝒜E}\mathcal{G}(\mathcal{F})=\{(x,y)\mapsto f(x)-y\,\mid\,f\in\mathcal{F},\|y\|_{\mathcal{A}}\leq E\}. The following theorem shows a bound for RKHMs with the operator norm.

Theorem 4.1

Assume there exists D>0D>0 such that k1(x,x)𝒜D\|k_{1}(x,x)\|_{\mathcal{A}}\leq D for any x𝒜0x\in\mathcal{A}_{0}. Let K~=42(DB1+E)B1\tilde{K}=4\sqrt{2}(\sqrt{D}B_{1}+E)B_{1} and M~=6(DB1+E)2\tilde{M}=6(\sqrt{D}B_{1}+E)^{2}. Let δ(0,1)\delta\in(0,1). Then, for any g𝒢(1)g\in\mathcal{G}(\mathcal{F}_{1}), where 1\mathcal{F}_{1} is defined in Section 3, with probability at least 1δ1-\delta, we have

E[|g(x,y)|𝒜2]𝒜1ni=1n|g(xi,yi)|𝒜2𝒜+K~n(i=1ntrk1(xi,xi))1/2+M~log(2/δ)2n.\displaystyle\|\mathrm{E}[|g(x,y)|^{2}_{\mathcal{A}}]\|_{\mathcal{A}}\leq\bigg{\|}\frac{1}{n}\sum_{i=1}^{n}|g(x_{i},y_{i})|^{2}_{\mathcal{A}}\bigg{\|}_{\mathcal{A}}+\frac{\tilde{K}}{n}\bigg{(}\sum_{i=1}^{n}\operatorname{tr}k_{1}(x_{i},x_{i})\bigg{)}^{1/2}+\tilde{M}\sqrt{\frac{\log(2/\delta)}{2n}}.

Theorem 4.1 is derived by the following lemmas. We first fix a vector pdp\in\mathbb{R}^{d} and consider the operator-valued loss function acting on pp. We first show a relation between the generalization error and the Rademacher complexity of vector-valued functions. Then, we bound the Rademacher complexity. Since the bound does not depend on pp, we can finally remove the dependency on pp.

Lemma 4.2

Let \mathcal{F} be a function class of d×d\mathbb{R}^{d\times d}-valued functions on 𝒜0\mathcal{A}_{0} bounded by CC (i.e., f(x)𝒜C\|f(x)\|_{\mathcal{A}}\leq C for any x𝒜0x\in\mathcal{A}_{0}). Let 𝒢~(,p)={(x,y)(f(x)y)p2f,y𝒜E}\tilde{\mathcal{G}}(\mathcal{F},p)=\{(x,y)\mapsto\|(f(x)-y)p\|^{2}\,\mid\,f\in\mathcal{F},\|y\|_{\mathcal{A}}\leq E\} and M=2(C+E)2M=2(C+E)^{2}. Let pdp\in\mathbb{R}^{d} satisfy p=1\|p\|=1 and let δ(0,1)\delta\in(0,1). Then, for any g𝒢~(,p)g\in\tilde{\mathcal{G}}(\mathcal{F},p), with probability at least 1δ1-\delta, we have

E[|g(x,y)|𝒜2]1/2p21ni=1n|g(xi,yi)|𝒜2𝒜+2R^n(𝐱,𝒢~(,p))+3Mlog(2/δ)2n.\displaystyle\|\mathrm{E}[|g(x,y)|^{2}_{\mathcal{A}}]^{1/2}p\|^{2}\leq\bigg{\|}\frac{1}{n}\sum_{i=1}^{n}|g(x_{i},y_{i})|^{2}_{\mathcal{A}}\bigg{\|}_{\mathcal{A}}+2\hat{R}_{n}(\mathbf{x},\tilde{\mathcal{G}}(\mathcal{F},p))+3M\sqrt{\frac{\log(2/\delta)}{2n}}.
Lemma 4.3

With the same notations in Lemma 4.2, let K=22(C+E)K=2\sqrt{2}(C+E). Then, we have R^n(𝐱,𝒢(,p))KR^n(𝐱,p)\hat{R}_{n}(\mathbf{x},\mathcal{G}(\mathcal{F},p))\leq K\hat{R}_{n}(\mathbf{x},\mathcal{F}p), where p={xf(x)pf}\mathcal{F}p=\{x\mapsto f(x)p\,\mid\,f\in\mathcal{F}\}.

Lemma 4.4

Let pdp\in\mathbb{R}^{d} satisfy p=1\|p\|=1. For 1\mathcal{F}_{1} defined in Section 3, we have

R^n(𝐱,1p)B1n(i=1ntr(k(xi,xi)))1/2.\displaystyle\hat{R}_{n}(\mathbf{x},\mathcal{F}_{1}p)\leq\frac{B_{1}}{n}\Big{(}\sum_{i=1}^{n}\operatorname{tr}(k(x_{i},x_{i}))\Big{)}^{1/2}.

4.2 Bound for deep RKHMs

We now generalize Theorem 4.1 to the deep setting (L2L\geq 2) using the Perron–Frobenius operators.

Theorem 4.5

Assume there exists D>0D>0 such that kL(x,x)𝒜D\|k_{L}(x,x)\|_{\mathcal{A}}\leq D for any x𝒜0x\in\mathcal{A}_{0}. Let K~=42(DBL+E)B1BL\tilde{K}=4\sqrt{2}(\sqrt{D}B_{L}+E)B_{1}\cdots B_{L} and M~=6(DBL+E)2\tilde{M}=6(\sqrt{D}B_{L}+E)^{2}. Let δ(0,1)\delta\in(0,1). Then, for any g𝒢(Ldeep)g\in\mathcal{G}(\mathcal{F}_{L}^{\operatorname{deep}}), with probability at least 1δ1-\delta, we have

E[|g(x,y)|𝒜2]𝒜1ni=1n|g(xi,yi)|𝒜2𝒜+K~n(i=1ntrk1(xi,xi))1/2+M~log(2/δ)2n.\displaystyle\|\mathrm{E}[|g(x,y)|_{\mathcal{A}}^{2}]\|_{\mathcal{A}}\leq\bigg{\|}\frac{1}{n}\sum_{i=1}^{n}|g(x_{i},y_{i})|_{\mathcal{A}}^{2}\bigg{\|}_{\mathcal{A}}+\frac{\tilde{K}}{n}\Big{(}\sum_{i=1}^{n}\operatorname{tr}k_{1}(x_{i},x_{i})\Big{)}^{1/2}+\tilde{M}\sqrt{\frac{\log(2/\delta)}{2n}}.

We use the following proposition and Lemmas 4.2 and 4.3 to show Theorem 4.5. The key idea of the proof is that by the reproducing property and the definition of the Perron–Frobenius operator, we get fLf1(x)=ϕL(fL1f1(x)),fL~L=PfL1Pf1ϕ(x),fL~Lf_{L}\circ\cdots\circ f_{1}(x)=\left\langle\phi_{L}(f_{L-1}\circ\cdots\circ f_{1}(x)),f_{L}\right\rangle_{\tilde{\mathcal{M}}_{L}}\!\!=\left\langle P_{f_{L-1}}\cdots P_{f_{1}}\phi(x),f_{L}\right\rangle_{\tilde{\mathcal{M}}_{L}}.

Proposition 4.6

Let pdp\in\mathbb{R}^{d} satisfy p=1\|p\|=1. Then, we have

R^n(𝐱,Ldeepp)1nsup(fjj)jPfL1Pf1|𝒱~(𝐱)opfLL(i=1ntr(k1(xi,xi)))1/2.\displaystyle\hat{R}_{n}(\mathbf{x},\mathcal{F}^{\mathrm{deep}}_{L}p)\leq\frac{1}{n}\sup_{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}(f_{j}\in\mathcal{F}_{j})_{j}}}\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}}\;\|f_{L}\|_{{\mathcal{M}}_{L}}\;\Big{(}\sum_{i=1}^{n}\operatorname{tr}(k_{1}(x_{i},x_{i}))\Big{)}^{1/2}.

Here, 𝒱~(𝐱)\tilde{\mathcal{V}}(\mathbf{x}) is the submodule of 1~\tilde{\mathcal{M}_{1}} generated by ϕ1(x1),ϕ1(xn)\phi_{1}(x_{1}),\ldots\phi_{1}(x_{n}).

Corollary 4.7

Let pdp\in\mathbb{R}^{d} satisfy p=1\|p\|=1. Then, we have

R^n(𝐱,Ldeepp)1nB1BL(i=1ntr(k1(xi,xi)))1/2.\displaystyle\hat{R}_{n}(\mathbf{x},\mathcal{F}^{\mathrm{deep}}_{L}p)\leq\frac{1}{n}B_{1}\cdots B_{L}\;\Big{(}\sum_{i=1}^{n}\operatorname{tr}(k_{1}(x_{i},x_{i}))\Big{)}^{1/2}.
Comparison to deep vvRKHS

We can also regard d×d\mathbb{C}^{d\times d} as the Hilbert space equipped with the Hilbert–Schmidt inner product, i.e., we can flatten matrices and get d2d^{2}-dimensional Hilbert space. In this case, the corresponding operator-valued kernel is the multiplication operator of k(x,y)k(x,y), which we denote by Mk(x,y)M_{k(x,y)}. Then, we can apply existing results for vvRKHSs [5, 12], which involve the term (i=1ntr(Mk(xi,xi)))1/2(\sum_{i=1}^{n}\operatorname{tr}(M_{k(x_{i},x_{i})}))^{1/2}. It is calculated as

i=1ntr(Mk(xi,xi))=i=1nj,l=1dejl,k(xi,xi)ejlHS=i=1nj,l=1dk(xi,xi)l,l=di=1ntrk(xi,xi).\displaystyle\sum_{i=1}^{n}\operatorname{tr}(M_{k(x_{i},x_{i})})=\sum_{i=1}^{n}\sum_{j,l=1}^{d}\left\langle e_{jl},k(x_{i},x_{i})e_{jl}\right\rangle_{\operatorname{HS}}=\sum_{i=1}^{n}\sum_{j,l=1}^{d}k(x_{i},x_{i})_{l,l}=d\sum_{i=1}^{n}\operatorname{tr}k(x_{i},x_{i}).

Thus, using the existing approaches, we have the factor (di=1ntrk(xi,xi))1/2(d\sum_{i=1}^{n}\operatorname{tr}k(x_{i},x_{i}))^{1/2}. On the other hand, we have the smaller factor (i=1ntrk(xi,xi))1/2(\sum_{i=1}^{n}\operatorname{tr}k(x_{i},x_{i}))^{1/2} in Theorems 4.1 and 4.5. Using the operator norm, we can reduce the dependency on the dimension dd.

5 Learning Deep RKHMs

We focus on the practical learning problem. To learn deep RKHMs, we consider the following minimization problem based on the generalization bound derived in Section 4:

min(fjj)j1ni=1n|fLf1(xi)yi|𝒜2𝒜+λ1PfL1Pf1|𝒱~(𝐱)op+λ2fLL.\displaystyle\min_{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}(f_{j}\in\mathcal{M}_{j})_{j}}}\bigg{\|}\frac{1}{n}\sum_{i=1}^{n}|f_{L}\circ\cdots\circ f_{1}(x_{i})-y_{i}|_{\mathcal{A}}^{2}\bigg{\|}_{\mathcal{A}}+\lambda_{1}\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}}+\lambda_{2}\|f_{L}\|_{{\mathcal{M}}_{L}}. (1)

The second term regarding the Perron–Frobenius operators comes from the bound in Proposition 4.6. We try to reduce the generalization error by reducing the magnitude of the norm of the Perron–Frobenius operators.

5.1 Representer theorem

We first show a representer theorem to guarantee that a solution of the minimization problem (1) is represented only with given samples.

Proposition 5.1

Let h:𝒜n×𝒜n+h:\mathcal{A}^{n}\times\mathcal{A}^{n}\to\mathbb{R}_{+} be an error function, let g1g_{1} be an +\mathbb{R}_{+}-valued function on the space of bounded linear operators on ~1\tilde{\mathcal{M}}_{1}, and let g2:++g_{2}:\mathbb{R}_{+}\to\mathbb{R}_{+} satisfy g2(a)g2(b)g_{2}(a)\leq g_{2}(b) for aba\leq b. Assume the following minimization problem has a solution:

min(fjj)jh(fLf1(x1),,fLf1(xn))+g1(PfL1PfL|𝒱~(𝐱))+g2(fLL).\displaystyle\min_{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}(f_{j}\in\mathcal{M}_{j})_{j}}}h(f_{L}\circ\cdots\circ f_{1}(x_{1}),\ldots,f_{L}\circ\cdots\circ f_{1}(x_{n}))+g_{1}(P_{f_{L-1}}\cdots P_{f_{L}}|_{\tilde{\mathcal{V}}(\mathbf{x})})+g_{2}(\|f_{L}\|_{\mathcal{M}_{L}}).

Then, there exists a solution admitting a representation of the form fj=i=1nϕj(xij1)ci,jf_{j}=\sum_{i=1}^{n}\phi_{j}(x_{i}^{j-1})c_{i,j} for some c1,j,,cn,j𝒜c_{1,j},\ldots,c_{n,j}\in\mathcal{A} and for j=1,,Lj=1,\ldots,L. Here, xij=fjf1(xi)x_{i}^{j}=f_{j}\circ\cdots\circ f_{1}(x_{i}) for j=1,,Lj=1,\ldots,L and xi0=xix_{i}^{0}=x_{i}.

Remark 5.2

An advantage of deep RKHM compared to deep vvRKHS is that we can make use of the structure of matrices. For example, the product of two diagonal matrices is calculated by the elementwise product of diagonal elements. Thus, when 𝒜1==𝒜L=Block((1,,1),d)\mathcal{A}_{1}=\cdots=\mathcal{A}_{L}=Block((1,\ldots,1),d), interactions among elements in an input are induced only by the kernels, not by the product kj(x,xij1)ci,jk_{j}(x,x_{i}^{j-1})\cdot c_{i,j}. That is, the form of interactions does not depend on the learning parameter ci,jc_{i,j}. On the other hand, if we set 𝒜j=Block(𝐦j,d)\mathcal{A}_{j}=Block(\mathbf{m}_{j},d) with 𝐦j=(m1,j,,mMj,j)(1,,1)\mathbf{m}_{j}=(m_{1,j},\ldots,m_{M_{j},j})\neq(1,\ldots,1), then at the jjth layer, we get interactions among elements in the same block through the product kj(x,xij1)ci,jk_{j}(x,x_{i}^{j-1})\cdot c_{i,j}. In this case, the form of interactions is learned through ci,jc_{i,j}. For example, the part M1MlM_{1}\leq\cdots\leq M_{l} (the encoder) in Example 3.1 tries to gradually reduce the dependency among elements in the output of each layer and describe the input with a small number of variables. The part MlMLM_{l}\geq\cdots\geq M_{L} (the decoder) tries to increase the dependency.

5.2 Computing the norm of the Perron–Frobenius operator

We discuss the practical computation and the role of the factor PfL1Pf1|𝒱~(𝐱)op\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}} in Eq. (1). Let Gj𝒜n×nG_{j}\in\mathcal{A}^{n\times n} be the Gram matrix whose (i,l)(i,l)-entry is kj(xij1,xlj1)𝒜k_{j}(x_{i}^{j-1},x_{l}^{j-1})\in\mathcal{A}.

Proposition 5.3

For j=1,Lj=1,L, let [ϕj(x1j1),,ϕj(xnj1)]Rj=Qj[\phi_{j}(x_{1}^{j-1}),\ldots,\phi_{j}(x_{n}^{j-1})]R_{j}=Q_{j} be the QR decomposition of [ϕj(x1j1),,ϕj(xnj1)][\phi_{j}(x_{1}^{j-1}),\ldots,\phi_{j}(x_{n}^{j-1})]. Then, we have PfL1Pf1|𝒱~(𝐱)op=RLGLR1op\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}}=\|R_{L}^{*}G_{L}R_{1}\|_{\operatorname{op}}.

The computational cost of RLGLR1op\|R_{L}^{*}G_{L}R_{1}\|_{\operatorname{op}} is expensive if nn is large since computing R1R_{1} and RLR_{L} is expensive. Thus, we consider upper bounding RLGLR1op\|R_{L}^{*}G_{L}R_{1}\|_{\operatorname{op}} by a computationally efficient value.

Proposition 5.4

Assume GLG_{L} is invertible. Then, we have

PfL1Pf1|𝒱~(𝐱)opGL1op1/2GLopG11op1/2.\displaystyle\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}}\leq\|G_{L}^{-1}\|^{1/2}_{\operatorname{op}}\|G_{L}\|_{\operatorname{op}}\|G_{1}^{-1}\|_{\operatorname{op}}^{1/2}. (2)

Since G1G_{1} is independent of f1,,fLf_{1},\ldots,f_{L}, to make the value PfL1Pf1|𝒱~(𝐱)op\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}} small, we try to make the norm of GLG_{L} and GL1G_{L}^{-1} small according to Proposition 5.4. For example, instead of the second term regarding the Perron–Frobenius operators in Eq. (1), we can consider the following term with η>0\eta>0:

λ1((ηI+GL)1op+GLop).\displaystyle\lambda_{1}(\|(\eta I+G_{L})^{-1}\|_{\operatorname{op}}+\|G_{L}\|_{\operatorname{op}}).

Note that the term depends on the training samples x1,,xnx_{1},\ldots,x_{n}. The situation is different from the third term in Eq. (1) since the third term does not depend on the training samples before applying the representer theorem. If we try to minimize a value depending on the training samples, the model seems to be more specific for the training samples, and it may cause overfitting. Thus, the connection between the minimization of the second term in Eq. (1) and generalization cannot be explained by the classical argument about generalization and regularization. However, as we will see in Subsection 6.2, it is related to benign overfitting and has a good effect on generalization.

Remark 5.5

The inequality (2) implies that as {ϕL(x1L1),,ϕL(xnL1)}\{\phi_{L}(x_{1}^{L-1}),\ldots,\phi_{L}(x_{n}^{L-1})\} becomes nearly linearly dependent, the Rademacher complexity becomes large. By the term (ηI+GL)1\|(\eta I+G_{L})^{-1}\| in Eq. (1), the function fL1f1f_{L-1}\circ\cdots\circ f_{1} is learned so that it separates x1,,xnx_{1},\ldots,x_{n} well.

Remark 5.6

To evaluate f1fL(xi)f_{1}\circ\cdots\circ f_{L}(x_{i}) for i=1,,ni=1,\ldots,n, we have to construct a Gram matrix Gj𝒜jn×nG_{j}\in\mathcal{A}_{j}^{n\times n}, and compute the product of the Gram matrix and a vector cj𝒜jnc_{j}\in\mathcal{A}_{j}^{n} for j=1,,Lj=1,\ldots,L. The computational cost of the construction of the Gram matrices does not depend on dd. The cost of computing GjcjG_{j}c_{j} is O(n2d~j)O(n^{2}\tilde{d}_{j}), where d~j\tilde{d}_{j} is the number of nonzero elements in the matrices in 𝒜j\mathcal{A}_{j}. Note that if 𝒜j\mathcal{A}_{j} is the CC^{*}-algebra of block diagonal matrices, then we have d~jd2\tilde{d}_{j}\ll d^{2}. Regarding the cost with respect to nn, if the positive definite kernel is separable, we can use the random Fourier features [37] to replace the factor n2n^{2} with mnmn for a small integer mnm\ll n.

6 Connection and Comparison with Existing Studies

The proposed deep RKHM is deeply related to existing studies by virtue of CC^{*}-algebra and the Perron–Frobanius operators. We discuss the connection below.

6.1 Connection with CNN

The proposed deep RKHM has a duality with CNNs. Let 𝒜0==𝒜L=Circ(d)\mathcal{A}_{0}=\cdots=\mathcal{A}_{L}=Circ(d), the CC^{*}-algebra of dd by dd circulant matrices. For j=1,,Lj=1,\ldots,L, let aj𝒜ja_{j}\in\mathcal{A}_{j}. Let kjk_{j} be an 𝒜j\mathcal{A}_{j}-valued positive definite kernel defined as kj(x,y)=k~j(ajx,ajy)k_{j}(x,y)=\tilde{k}_{j}(a_{j}x,a_{j}y), where k~j\tilde{k}_{j} is an 𝒜j\mathcal{A}_{j}-valued function. The 𝒜j\mathcal{A}_{j}-valued positive definite kernel makes the output of each layer become a circulant matrix, which enables us to apply the convolution as the product of the output and a parameter. Then, fjf_{j} is represented as fj(x)=i=1nk~j(ajx,ajxij1)ci,jf_{j}(x)=\sum_{i=1}^{n}\tilde{k}_{j}(a_{j}x,a_{j}x_{i}^{j-1})c_{i,j} for some ci,j𝒜jc_{i,j}\in\mathcal{A}_{j}. Thus, at the jjth layer, the input xx is multiplied by aja_{j} and is transformed nonlinearly by σj(x)=i=1nkj(x,ajxij1)ci,j\sigma_{j}(x)=\sum_{i=1}^{n}k_{j}(x,a_{j}x_{i}^{j-1})c_{i,j}. Since the product of two circulant matrices corresponds to the convolution, aja_{j} corresponds to a filter. In addition, σj\sigma_{j} corresponds to the activation function at the jjth layer. Thus, the deep RKHM with the above setting corresponds to a CNN. The difference between the deep RKHM and the CNN is the parameters that we learn. Whereas for the deep RKHM, we learn the coefficients c1,j,,cn,jc_{1,j},\ldots,c_{n,j}, for the CNN, we learn the parameter aja_{j}. In other words, whereas for the deep RKHM, we learn the activation function σj\sigma_{j}, for the CNN, we learn the filter aja_{j}. It seems reasonable to interpret this difference as a consequence of solving the problem in the primal or in the dual. In the primal, the number of the learning parameters depends on the dimension of the data (or the filter for convolution), while in the dual, it depends on the size of the data.

Remark 6.1

The connection between CNNs and shallow RKHMs has already been studied [27]. However, the existing study does not provide the connection between the filter and activation function. The above investigation shows a more clear layer-wise connection of deep RKHMs with CNNs.

6.2 Connection with benign overfitting

Benign overfitting is a phenomenon that the model fits any amount of data yet generalizes well [20, 21]. For kernel regression, Mallinar et al. [22] showed that if the eigenvalues of the integral operator associated with the kernel function over the data distribution decay slower than any powerlaw decay, than the model exhibits benign overfitting. The Gram matrix is obtained by replacing the integral with the sum over the finite samples. The inequality (2) suggests that the generalization error becomes smaller as the smallest and the largest eigenvalues of the Gram matrix get closer, which means the eigenvalue decay is slower. Combining the observation in Remark 5.5, we can interpret that as the right-hand side of the inequality (2) becomes smaller, the outputs of noisy training data at the L1L-1th layer tend to be more separated from the other outputs. In other words, for the random variable xx following the data distribution, fL1f1f_{L-1}\circ\cdots\circ f_{1} is learned so that the distribution of fL1f1(x)f_{L-1}\circ\cdots\circ f_{1}(x) generates the integral operator with more separated eigenvalues, which appreciates benign overfitting. We will also observe this phenomenon numerically in Section 7 and Appendix C.3.2. Since the generalization bound for deep vvRKHSs [5] is described by the Lipschitz constants of the feature maps and the norm of fjf_{j} for j=1,,Lj=1,\ldots,L, this type of theoretical interpretation regarding benign overfitting is not available for the existing bound for deep vvRKHSs.

Remark 6.2

The above arguments about benign overfitting are valid only for deep RKHMs, i.e., the case of L2L\geq 2. If L=1L=1 (shallow RKHM), then the Gram matrix GL=[k(xi,xj)]i,jG_{L}=[k(x_{i},x_{j})]_{i,j} is fixed and determined only by the training data and the kernel. On the other hand, if L2L\geq 2 (deep RKHM), then GL=[kL(fL1f1(xi),fL1f1(xj))]i,jG_{L}=[k_{L}(f_{L-1}\circ\cdots\circ f_{1}(x_{i}),f_{L-1}\circ\cdots\circ f_{1}(x_{j}))]_{i,j} depends also on f1,,fL1f_{1},\ldots,f_{L-1}. As a result, by adding the term using GLG_{L} to the loss function, we can learn proper f1,,fL1f_{1},\ldots,f_{L-1} so that they make the right-hand side of Eq. (2) small, and the whole network overfits benignly. As LL becomes large, the function fL1f1f_{L-1}\circ\cdots\circ f_{1} changes more flexibly to attain a smaller value of the term. This is an advantage of considering a large LL.

6.3 Connection with neural tangent kernel

Neural tangent kernel has been investigated to understand neural networks using the theory of kernel methods [6, 7]. Generalizing neural networks to CC^{*}-algebra, which is called the CC^{*}-algebra network, is also investigated [32, 38]. We define a neural tangent kernel for the CC^{*}-algebra network and develop a theory for combining neural networks and deep RKHMs as an analogy of the existing studies. Consider the CC^{*}-algebra network f:𝒜N0𝒜f:\mathcal{A}^{N_{0}}\to\mathcal{A} over 𝒜\mathcal{A} with 𝒜\mathcal{A}-valued weight matrices Wj𝒜Nj×Nj1W_{j}\in\mathcal{A}^{N_{j}\times N_{j-1}} and element-wise activation functions σj\sigma_{j}: f(x)=WLσL1(WL1σ1(W1x))f(x)=W_{L}\sigma_{L-1}(W_{L-1}\cdots\sigma_{1}(W_{1}x)\cdots). The (i,j)(i,j)-entry of f(x)f(x) is fi(𝐱j)=𝐖L,iσL1(WL1σ1(W1𝐱j))f_{i}(\mathbf{x}_{j})=\mathbf{W}_{L,i}\sigma_{L-1}(W_{L-1}\cdots\sigma_{1}(W_{1}\mathbf{x}_{j})\cdots), where 𝐱j\mathbf{x}_{j} is the jjth column of xx regarded as xdN0×dx\in\mathbb{C}^{dN_{0}\times d} and 𝐖L,i\mathbf{W}_{L,i} is the iith row of WLW_{L} regarded as WLd×dNL1W_{L}\in\mathbb{C}^{d\times dN_{L-1}}. Thus, the (i,j)(i,j)-entry of f(x)f(x) corresponds to the output of the network fi(𝐱j)f_{i}(\mathbf{x}_{j}). We can consider the neural tangent kernel for each fif_{i}. Chen and Xu [7] showed that the RKHS associated with the neural tangent kernel restricted to the sphere is the same set as that associated with the Laplacian kernel. Therefore, the iith row of f(x)f(x) is described by the shallow RKHS associated with the neural tangent kernel kiNTk_{i}^{\operatorname{NT}}. Let kNNk^{\operatorname{NN}} be the 𝒜\mathcal{A}-valued positive definite kernel whose (i,j)(i,j)-entry is kiNTk_{i}^{\operatorname{NT}} for i=ji=j and 0 for iji\neq j. Then, for any function gkNNg\in\mathcal{M}_{k^{\operatorname{NN}}}, the elements in the iith row of gg are in the RKHS associated with kiNTk_{i}^{\operatorname{NT}}, i.e., associated with the Laplacian kernel. Thus, ff is described by this shallow RKHM.

Remark 6.3

We can combine the deep RKHM and existing neural networks by replacing some fjf_{j} in our model with an existing neural network. The above observation enables us to apply our results in Section 4 to the combined network.

6.4 Comparison to bounds for classical neural networks

Existing bounds for classical neural networks typically depend on the product or sum of matrix (p,q)(p,q) norm of all the weight matrices WjW_{j} [14, 15, 16, 17]. Typical bound is O(1/nj=1LWjHS)O(\sqrt{1/n}\prod_{j=1}^{L}\|W_{j}\|_{\operatorname{HS}}). Note that the Hilbert–Schmidt norm is the matrix (2,2)(2,2) norm. Unlike the operator norm, the matrix (p,q)(p,q) norm tends to be large as the width of the layers becomes large. On the other hand, the dependency of our bound on the width of the layer is not affected by the number of layers in the case where the kernels are separable. Indeed, assume we set k1=k~1a1k_{1}=\tilde{k}_{1}a_{1} and kL=k~LaLk_{L}=\tilde{k}_{L}a_{L} for some complex-valued kernels k~1\tilde{k}_{1} and k~2\tilde{k}_{2}, a1𝒜1a_{1}\in\mathcal{A}_{1}, and aL𝒜La_{L}\in\mathcal{A}_{L}. Then by Proposition 5.3, the factor α(L):=PfL1Pf1|𝒱~(𝐱)op\alpha(L):=\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}} is written as R~LG~LR~1aL2a1op=R~LG~LR~1opaL2a1𝒜\|\tilde{R}_{L}^{*}\tilde{G}_{L}\tilde{R}_{1}\otimes a_{L}^{2}a_{1}\|_{\operatorname{op}}=\|\tilde{R}_{L}^{*}\tilde{G}_{L}\tilde{R}_{1}\|_{\operatorname{op}}\;\|a_{L}^{2}a_{1}\|_{\mathcal{A}} for some R~L,G~L,R~1n×n\tilde{R}_{L},\tilde{G}_{L},\tilde{R}_{1}\in\mathbb{C}^{n\times n}. Thus, it is independent of dd. The only part depending on dd is trk1(xi,xi)\operatorname{tr}k_{1}(x_{i},x_{i}), which results in the bound O(α(L)d/n)O(\alpha(L)\sqrt{d/n}). Note that the width of the jjth layer corresponds to the number of nonzero elements in a matrix in 𝒜j\mathcal{A}_{j}. We also discuss in Appendix B the connection of our bound with the bound by Koopman operators, the adjoints of the Perron–Frobenius operators [18].

7 Numerical Results

We numerically confirm our theory and the validity of the proposed deep RKHM.

Comparison to vvRKHS

We compared the generalization property of the deep RKHM to the deep vvRKHS with the same positive definite kernel. For d=10d=10 and n=10n=10, we set xi=(azi)2+ϵix_{i}=(az_{i})^{2}+\epsilon_{i} as input samples, where a100×10a\in\mathbb{R}^{100\times 10} and zi10z_{i}\in\mathbb{R}^{10} are randomly generated by 𝒩(0,0.1)\mathcal{N}(0,0.1), the normal distribution of mean 0 and standard deviation 0.1, ()2(\cdot)^{2} is the elementwise product, and ϵi\epsilon_{i} is the random noise drawn from 𝒩(0,1e3)\mathcal{N}(0,1e-3). We reshaped xix_{i} to a 1010 by 1010 matrix. We set L=3L=3 and kj=k~Ik_{j}=\tilde{k}I for j=1,2,3j=1,2,3, where k~\tilde{k} is the Laplacian kernel. For RKHMs, we set 𝒜1=Block((1,,1),d)\mathcal{A}_{1}=Block((1,\ldots,1),d), 𝒜2=Block((2,,2),d)\mathcal{A}_{2}=Block((2,\ldots,2),d), and 𝒜3=d×d\mathcal{A}_{3}=\mathbb{C}^{d\times d}. This is the autoencoder mentioned in Example 3.1. For vvRKHSs, we set the corresponding Hilbert spaces with the Hilbert–Schmidt inner product. We set the loss function as 1/ni=1n|f(xi)xi|𝒜2𝒜1/n\|\sum_{i=1}^{n}|f(x_{i})-x_{i}|_{\mathcal{A}}^{2}\|_{\mathcal{A}} for the deep RKHM and as 1/ni=1nf(xi)xiHS21/n\sum_{i=1}^{n}\|f(x_{i})-x_{i}\|_{\operatorname{HS}}^{2} for the deep vvRKHS. Here, f=f3f2f1f=f_{3}\circ f_{2}\circ f_{1}. We did not add any terms to the loss function to see how the loss function with the operator norm affects the generalization performance. We computed the same value E[|f(x)x|𝒜2]𝒜1/ni=1n|f(xi)xi|𝒜2𝒜\|\mathrm{E}[|f(x)-x|_{\mathcal{A}}^{2}]\|_{\mathcal{A}}-1/n\|\sum_{i=1}^{n}|f(x_{i})-x_{i}|_{\mathcal{A}}^{2}\|_{\mathcal{A}} for both RKHM and vvRKHS. Figure 2 (a) shows the results. We can see that the deep RKHM generalizes better than the deep vvRKHS, only with the loss function and without any additional terms.

Observation about benign overfitting

We analyzed the overfitting numerically. For d=10d=10 and n=1000n=1000, we randomly sampled dd by dd diagonal matrices x1,,xn𝒜0x_{1},\ldots,x_{n}\in\mathcal{A}_{0} from the normal distribution 𝒩(0,0.1)\mathcal{N}(0,0.1). We set yi=xi2+ϵiy_{i}=x_{i}^{2}+\epsilon_{i} for i=1,,ni=1,\ldots,n, where ϵi\epsilon_{i} is a noise drawn from the normal distribution 𝒩(0,0.001)\mathcal{N}(0,0.001). The magnitude of the noise is 1010% of xi2x_{i}^{2}. In addition, we set L=2L=2, 𝒜1=d×d\mathcal{A}_{1}=\mathbb{C}^{d\times d}, 𝒜2=Block((1,,1),d)\mathcal{A}_{2}=Block((1,\ldots,1),d), and kjk_{j} as the same kernel as the above experiment. The additional term to the loss function is set as λ1((ηI+GL)1op+GLop)+λ2fLL2\lambda_{1}(\|(\eta I+G_{L})^{-1}\|_{\operatorname{op}}+\|G_{L}\|_{\operatorname{op}})+\lambda_{2}\|f_{L}\|_{\mathcal{M}_{L}}^{2}, where η=0.01\eta=0.01 and λ2=0.01\lambda_{2}=0.01 according to Subsection 5.2. We computed the generalization error for the cases of λ1=0\lambda_{1}=0 and λ1=102\lambda_{1}=10^{2}. Figure 2 (b) shows the result. We can see that the generalization error saturates without the additional term motivated by the Perron–Frobenius operator. On the other hand, with the additional term, the generalization error becomes small, which is the effect of benign overfitting.

Comparison to CNN

We compared the deep RKHM to a CNN on the classification task with MNIST [39]. We set d=28d=28 and n=20n=20. We constructed a deep RKHM combined with a neural network with 2 dense layers. For the deep RKHM, we set L=2L=2, 𝒜0=d×d\mathcal{A}_{0}=\mathbb{C}^{d\times d}, 𝒜1=Block((7,7,7,7),d)\mathcal{A}_{1}=Block((7,7,7,7),d), and 𝒜2=Block((4,,4),d)\mathcal{A}_{2}=Block((4,\ldots,4),d). Then, two dense layers are added. See Subsection 6.3 about combining the deep RKHM with neural networks. Regarding the additional term to the loss function, we set the same term with the previous experiment with λ2=0.001\lambda_{2}=0.001 and set λ1=1\lambda_{1}=1 or λ1=0\lambda_{1}=0. To compare the deep RKHM to CNNs, we also constructed a network by replacing the deep RKHM with a CNN. The CNN is composed of 2 layers with 7×77\times 7 and 4×44\times 4 filters. Figure 2 (c) shows the test accuracy of these networks. We can see that the deep RKHM outperforms the CNN. In addition, we can see that the test accuracy is higher if we add the term regarding the Perron–Frobenius operators. We discuss the memory consumption and the computational cost of each of the deep RKHM and the CNN in Appendix C.3, and we empirically show that the deep RKHM outperforms the CNN that has the same size of learning parameters as the deep RKHM. We also show additional results about benign overfitting in Appendix C.3.

Refer to caption
Refer to caption
Refer to caption
Figure 2: (a) The box plot of the generalization error of the deep RKHM and vvRKHS at the point that the training error reaches 0.05. (b) Behavior of the generalization error during the learning process with and without the additional term regarding the Perron–Frobenius operators. (c) Test accuracy of the classification task with MNIST for a deep RKHM and a CNN.

8 Conclusion and Limitations

In this paper, we proposed deep RKHM and analyzed it through CC^{*}-algebra and the Perron–Frobenius operators. We derived a generalization bound, whose dependency on the output dimension is alleviated by the operator norm, and which is related to benign overfitting. We showed a representer theorem about the proposed deep RKHM, and connections with existing studies such as CNNs and neural tangent kernel. Our theoretical analysis shows that CC^{*}-algebra and Perron–Frobenius operators are effective tools for analyzing deep kernel methods. The main contributions of this paper are our theoretical results with CC^{*}-algebra and the Perron–Frobenius operators. More practical investigations are required for further progress. For example, although we numerically showed the validity of our method for the case where the number of samples is limited (the last experiment in Section 7), more experimental results for the case where the number of samples is large are useful for further analysis. Also, although we can apply random Fourier features (Remark 5.6) to reduce the computational costs, studying more efficient methods specific to deep RKHM remains to be investigated in future work. As for the theoretical topic, we assumed the well-definedness of the Perron–Frobenius operators. Though separable kernels with invertible matrices, which are typical examples of kernels, satisfy the assumption, generalization of our analysis to other kernels should also be studied in future work.

Acknowledgements

Hachem Kadri is partially supported by grant ANR-19-CE23-0011 from the French National Research Agency. Masahiro Ikeda is partially supported by grant JPMJCR1913 from JST CREST.

References

  • [1] Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. In Proceedings of Advances in Neural Information Processing Systems (NIPS) 22, 2009.
  • [2] Sebastian W. Ober, Carl E. Rasmussen, and Mark van der Wilk. The promises and pitfalls of deep kernel learning. In Proceedings of the 37th Conference on Uncertainty in Artificial Intelligence (UAI), 2021.
  • [3] Behnam Gholami and Abolfazl Hajisami. Kernel auto-encoder for semi-supervised hashing. In Proceedings of 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), 2016.
  • [4] Bastian Bohn, Michael Griebel, and Christian Rieger. A representer theorem for deep kernel learning. Journal of Machine Learning Research, 20(64):1–32, 2019.
  • [5] Pierre Laforgue, Stéphan Clémençon, and Florence d’Alche Buc. Autoencoding any data through kernel autoencoders. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
  • [6] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In In proceedings of Advances in Neural Information Processing Systems (NeurIPS) 31, 2018.
  • [7] Lin Chen and Sheng Xu. Deep neural tangent kernel and laplace kernel have the same RKHS. In Proceedings of the 9th International Conference on Learning Representations (ICLR), 2021.
  • [8] Julien Mairal, Piotr Koniusz, Zaid Harchaoui, and Cordelia Schmid. Convolutional kernel networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS) 27, 2014.
  • [9] Alberto Bietti, Grégoire Mialon, Dexiong Chen, and Julien Mairal. A kernel perspective for regularizing deep neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), 2019.
  • [10] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
  • [11] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT press, 2018.
  • [12] Vikas Sindhwani, Ha Quang Minh, and Aurélie C. Lozano. Scalable matrix-valued kernel learning for high-dimensional nonlinear multivariate regression and granger causality. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI), 2013.
  • [13] Riikka Huusari and Hachem Kadri. Entangled kernels - beyond separability. Journal of Machine Learning Research, 22(24):1–40, 2021.
  • [14] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Proceedings of the 2015 Conference on Learning Theory (COLT), 2015.
  • [15] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Proceedings of the 2018 Conference On Learning Theory (COLT), 2018.
  • [16] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Proceedings of Advances in Neural Information Processing Systems (NIPS) 31, 2017.
  • [17] Haotian Ju, Dongyue Li, and Hongyang R Zhang. Robust fine-tuning of deep neural networks with Hessian-based generalization guarantees. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
  • [18] Yuka Hashimoto, Sho Sonoda, Isao Ishikawa, Atsushi Nitanda, and Taiji Suzuki. Koopman-based bound for generalization: New aspect of neural networks regarding nonlinear noise filtering. arXiv: 2302.05825, 2023.
  • [19] Taiji Suzuki, Hiroshi Abe, and Tomoaki Nishimura. Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network. In Proceedings of the 8th International Conference on Learning Representations (ICLR), 2020.
  • [20] Mikhail Belkin, Alexander Rakhlin, and Alexandre B. Tsybakov. Does data interpolation contradict statistical optimality? In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
  • [21] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  • [22] Neil Rohit Mallinar, James B Simon, Amirhesam Abedsoltan, Parthe Pandit, Misha Belkin, and Preetum Nakkiran. Benign, tempered, or catastrophic: Toward a refined taxonomy of overfitting. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS) 37, 2022.
  • [23] Jaeseong Heo. Reproducing kernel Hilbert CC^{*}-modules and kernels associated with cocycles. Journal of Mathematical Physics, 49(10):103507, 2008.
  • [24] Shigeru Itoh. Reproducing kernels in modules over CC^{*}-algebras and their applications. Journal of Mathematics in Nature Science, pages 1–20, 1990.
  • [25] Mohammad S. Moslehian. Vector-valued reproducing kernel Hilbert CC^{*}-modules. Complex Analysis and Operator Theory, 16(1):Paper No. 2, 2022.
  • [26] Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda, Fuyuta Komura, Takeshi Katsura, and Yoshinobu Kawahara. Reproducing kernel Hilbert CC^{*}-module and kernel mean embeddings. Journal of Machine Learning Research, 22(267):1–56, 2021.
  • [27] Yuka Hashimoto, Masahiro Ikeda, and Hachem Kadri. Learning in RKHM: a CC^{*}-algebraic twist for kernel machines. In Proceedings of the 26th International Conference on Artificial Intelligence and Statistics (AISTATS), 2023.
  • [28] Yoshinobu Kawahara. Dynamic mode decomposition with reproducing kernels for Koopman spectral analysis. In Advances in Neural Information Processing Systems (NIPS) 29, pages 911–919, 2016.
  • [29] Isao Ishikawa, Keisuke Fujii, Masahiro Ikeda, Yuka Hashimoto, and Yoshinobu Kawahara. Metric on nonlinear dynamical systems with Perron-Frobenius operators. In Advances in Neural Information Processing Systems (NeurIPS) 31, pages 2856–2866, 2018.
  • [30] Dimitrios Giannakis and Suddhasattwa Das. Extraction and prediction of coherent patterns in incompressible flows through space-time Koopman analysis. Physica D: Nonlinear Phenomena, 402:132211, 2020.
  • [31] Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda, Yoichi Matsuo, and Yoshinobu Kawahara. Krylov subspace method for nonlinear dynamical systems with random noise. Journal of Machine Learning Research, 21(172):1–29, 2020.
  • [32] Ryuichiro Hataya and Yuka Hashimoto. Noncommutative CC^{*}-algebra net: Learning neural networks with powerful product structure in CC^{*}-algebra. arXiv: 2302.01191, 2023.
  • [33] E. Christopher Lance. Hilbert CC^{*}-modules – a Toolkit for Operator Algebraists. London Mathematical Society Lecture Note Series, vol. 210. Cambridge University Press, 1995.
  • [34] Gerard J. Murphy. CC^{*}-Algebras and Operator Theory. Academic Press, 1990.
  • [35] Yuka Hashimoto, Isao Ishikawa, Masahiro Ikeda, Fuyuta Komura, Takeshi Katsura, and Yoshinobu Kawahara. Analysis via orthonormal systems in reproducing kernel hilbert cc^{*}-modules and applications. arXiv: 2003.00738, 2020.
  • [36] Mauricio A Álvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends® in Machine Learning, 4(3):195–266, 2012.
  • [37] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Proceedings of the Advances in Neural Information Processing Systems 20 (NIPS), 2007.
  • [38] Yuka Hashimoto, Zhao Wang, and Tomoko Matsui. CC^{*}-algebra net: a new approach generalizing neural network parameters to CC^{*}-algebra. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
  • [39] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [40] Andreas Maurer. A vector-contraction inequality for rademacher complexities. In Proceedings of the 27th International Conference on Algorithmic Learning Theory (ALT), 2016.

Appendix A Proofs

We provide the proofs of the theorems, propositions, and lemmas in the main paper.

Lemma 2.7    If {ϕ1(x)x𝒳}\{\phi_{1}(x)\,\mid\,x\in\mathcal{X}\} is 𝒜\mathcal{A}-linearly independent, then PfP_{f} is well-defined.

Proof  Assume i=1nϕ1(xi)ci=i=1nϕ1(xi)di\sum_{i=1}^{n}\phi_{1}(x_{i})c_{i}=\sum_{i=1}^{n}\phi_{1}(x_{i})d_{i} for nn\in\mathbb{N}, ci,di𝒜c_{i},d_{i}\in\mathcal{A}. Since {ϕ1(x)x𝒳}\{\phi_{1}(x)\,\mid\,x\in\mathcal{X}\} is 𝒜\mathcal{A}-linearly independent, we have ci=dic_{i}=d_{i} for i=1,,ni=1,\ldots,n. Thus, Pfi=1nϕ1(xi)ci=i=1nϕ1(f(xi))ci=i=1nϕ1(f(xi))di=Pfi=1nϕ1(xi)diP_{f}\sum_{i=1}^{n}\phi_{1}(x_{i})c_{i}=\sum_{i=1}^{n}\phi_{1}(f(x_{i}))c_{i}=\sum_{i=1}^{n}\phi_{1}(f(x_{i}))d_{i}=P_{f}\sum_{i=1}^{n}\phi_{1}(x_{i})d_{i}. \Box

Lemma 2.8    Let k1=k~ak_{1}=\tilde{k}a, i.e., kk is separable, for an invertible operator aa and a complex-valued kernel k~\tilde{k}. Assume {ϕ~(x)x𝒳}\{\tilde{\phi}(x)\,\mid\,x\in\mathcal{X}\} is linearly independent (e.g. k~\tilde{k} is Gaussian or Laplacian), where ϕ~\tilde{\phi} is the feature map associated with k~\tilde{k}. Then, {ϕ1(x)x𝒳}\{\phi_{1}(x)\,\mid\,x\in\mathcal{X}\} is 𝒜\mathcal{A}-linearly independent.

Proof  If i=1nϕ1(xi)ci=0\sum_{i=1}^{n}\phi_{1}(x_{i})c_{i}=0, we have

0=i=1nϕ1(xi)ci,i=1nϕ1(xi)cik=cGc=(G1/2c)(G1/2c),\displaystyle 0=\left\langle\sum_{i=1}^{n}\phi_{1}(x_{i})c_{i},\sum_{i=1}^{n}\phi_{1}(x_{i})c_{i}\right\rangle_{\mathcal{M}_{k}}\!\!=c^{*}Gc=(G^{1/2}c)^{*}(G^{1/2}c),

where GG is the 𝒜n×n\mathcal{A}^{n\times n}-valued Gram matrix whose (i,j)(i,j)-entry is k(xi,xj)𝒜k(x_{i},x_{j})\in\mathcal{A} and c=(c1,,cn)c=(c_{1},\ldots,c_{n}). Thus, we obtain G1/2c=0G^{1/2}c=0. Let G~\tilde{G} be the standard Gram matrix whose (i,j)(i,j)-entry is k~(xi,xj)\tilde{k}(x_{i},x_{j}). Since G=G~aG=\tilde{G}\otimes a and G~\tilde{G} is invertible, the inverse of GG is G~1a1\tilde{G}^{-1}\otimes a^{-1}. Thus, G1/2G^{1/2} is invertible and we have c=0c=0. \Box

Lemma 4.2    Let \mathcal{F} be a function class of d×d\mathbb{R}^{d\times d}-valued functions on 𝒜0\mathcal{A}_{0} bounded by CC (i.e., f(x)𝒜C\|f(x)\|_{\mathcal{A}}\leq C for any x𝒜0x\in\mathcal{A}_{0}). Let 𝒢~(,p)={(x,y)(f(x)y)p2f,y𝒜E}\tilde{\mathcal{G}}(\mathcal{F},p)=\{(x,y)\mapsto\|(f(x)-y)p\|^{2}\,\mid\,f\in\mathcal{F},\|y\|_{\mathcal{A}}\leq E\} and M=2(C+E)2M=2(C+E)^{2}. Let pdp\in\mathbb{R}^{d} satisfy p=1\|p\|=1 and let δ(0,1)\delta\in(0,1). Then, for any g𝒢~(,p)g\in\tilde{\mathcal{G}}(\mathcal{F},p), with probability at least 1δ1-\delta, we have

E[|g(x,y)|𝒜2]1/2p21ni=1n|g(xi,yi)|𝒜2𝒜+2R^n(𝐱,𝒢~(,p))+3Mlog(2/δ)2n.\displaystyle\|\mathrm{E}[|g(x,y)|^{2}_{\mathcal{A}}]^{1/2}p\|^{2}\leq\bigg{\|}\frac{1}{n}\sum_{i=1}^{n}|g(x_{i},y_{i})|^{2}_{\mathcal{A}}\bigg{\|}_{\mathcal{A}}+2\hat{R}_{n}(\mathbf{x},\tilde{\mathcal{G}}(\mathcal{F},p))+3M\sqrt{\frac{\log(2/\delta)}{2n}}.

Proof  For (x,y),(x,y)𝒜×d×d(x,y),(x^{\prime},y^{\prime})\in\mathcal{A}\times\mathbb{R}^{d\times d}, we have

(f(x)y)p2(f(x)y)p2\displaystyle\|(f(x)-y)p\|^{2}-\|(f(x^{\prime})-y^{\prime})p\|^{2}
=f(x)p22f(x)p,yp+yp2f(x)p2+2f(x)p,ypyp22(C+E)2.\displaystyle\qquad=\|f(x)p\|^{2}-2\left\langle f(x)p,yp\right\rangle+\|yp\|^{2}-\|f(x^{\prime})p\|^{2}+2\left\langle f(x^{\prime})p,y^{\prime}p\right\rangle-\|y^{\prime}p\|^{2}\leq 2(C+E)^{2}.

In addition, we have

E[|g(x,y)|𝒜2]1/2p2(1ni=1n|g(xi,yi)|𝒜2)1/2p2=p,E[|g(x,y)|𝒜2]pp,1ni=1n|g(xi,yi)|𝒜2p\displaystyle\|\mathrm{E}[|g(x,y)|^{2}_{\mathcal{A}}]^{1/2}p\|^{2}-\bigg{\|}\bigg{(}\frac{1}{n}\sum_{i=1}^{n}|g(x_{i},y_{i})|^{2}_{\mathcal{A}}\bigg{)}^{1/2}p\bigg{\|}^{2}=\left\langle p,\mathrm{E}[|g(x,y)|_{\mathcal{A}}^{2}]p\right\rangle-\bigg{\langle}p,\frac{1}{n}\sum_{i=1}^{n}|g(x_{i},y_{i})|^{2}_{\mathcal{A}}p\bigg{\rangle}
=E[p,|g(x,y)|𝒜2p]1ni=1np,|g(xi,yi)|𝒜2p=E[g(x,y)p2]1ni=1ng(xi,yi)p2.\displaystyle\qquad=\mathrm{E}[\left\langle p,|g(x,y)|_{\mathcal{A}}^{2}p\right\rangle]-\frac{1}{n}\sum_{i=1}^{n}\left\langle p,|g(x_{i},y_{i})|_{\mathcal{A}}^{2}p\right\rangle=\mathrm{E}[\|g(x,y)p\|^{2}]-\frac{1}{n}\sum_{i=1}^{n}\|g(x_{i},y_{i})p\|^{2}.

Since (x,y)g(x,y)p2(x,y)\mapsto\|g(x,y)p\|^{2} is a real-valued map, by Theorem 3.3 by Mohri et al. [11], we have

E[|g(x,y)|𝒜2]1/2p2(1ni=1n|g(xi,yi)|𝒜2)1/2p22R^n(𝐱,𝒢~(,p))+3Mlog(2/δ)2n.\displaystyle\|\mathrm{E}[|g(x,y)|_{\mathcal{A}}^{2}]^{1/2}p\|^{2}-\bigg{\|}\bigg{(}\frac{1}{n}\sum_{i=1}^{n}|g(x_{i},y_{i})|_{\mathcal{A}}^{2}\bigg{)}^{1/2}p\bigg{\|}^{2}\leq 2\hat{R}_{n}(\mathbf{x},\tilde{\mathcal{G}}(\mathcal{F},p))+3M\sqrt{\frac{\log(2/\delta)}{2n}}.

The inequality (i=1n|g(xi,yi)|𝒜2)1/2p2i=1n|g(xi,yi)|𝒜2𝒜\|(\sum_{i=1}^{n}|g(x_{i},y_{i})|_{\mathcal{A}}^{2})^{1/2}p\|^{2}\leq\|\sum_{i=1}^{n}|g(x_{i},y_{i})|_{\mathcal{A}}^{2}\|_{\mathcal{A}} completes the proof. \Box

Lemma 4.3    With the same notations in Lemma 4.2, let K=22(C+E)K=2\sqrt{2}(C+E). Then, we have R^n(𝐱,𝒢~(,p))KR^n(𝐱,p)\hat{R}_{n}(\mathbf{x},\tilde{\mathcal{G}}(\mathcal{F},p))\leq K\hat{R}_{n}(\mathbf{x},\mathcal{F}p), where p={xf(x)pf}\mathcal{F}p=\{x\mapsto f(x)p\,\mid\,f\in\mathcal{F}\}.

Proof  Let hi(z)=zyip2h_{i}(z)=\|z-y_{i}p\|^{2} for z{f(x)pf}dz\in\{f(x)p\,\mid\,f\in\mathcal{F}\}\subseteq\mathbb{R}^{d} and i=1,,ni=1,\ldots,n. The Lipschitz constant of hih_{i} is calculated as follows: We have

hi(z)hi(z)\displaystyle h_{i}(z)-h_{i}(z^{\prime}) =z22zz,yipz2\displaystyle=\|z\|^{2}-2\left\langle z-z^{\prime},y_{i}p\right\rangle-\|z^{\prime}\|^{2}
(z+z)(zz)+2zzyip\displaystyle\leq(\|z\|+\|z^{\prime}\|)(\|z\|-\|z^{\prime}\|)+2\|z-z^{\prime}\|\|y_{i}p\|
(z+z)(zz+zz)+2zzyip\displaystyle\leq(\|z\|+\|z^{\prime}\|)(\|z-z^{\prime}\|+\|z^{\prime}\|-\|z^{\prime}\|)+2\|z-z^{\prime}\|\|y_{i}p\|
(2C+2E)zz.\displaystyle\leq(2C+2E)\|z-z^{\prime}\|.

Thus, we have |hi(z)hi(z)|2(C+E)zz|h_{i}(z)-h_{i}(z^{\prime})|\leq 2(C+E)\|z-z^{\prime}\|. By Corollary 4 by Maurer [40], the statement is proved. \Box

Lemma 4.4    Let pdp\in\mathbb{R}^{d} satisfy p=1\|p\|=1. For 1\mathcal{F}_{1} defined in Section 3, we have

R^n(𝐱,1p)B1n(i=1ntr(k(xi,xi)))1/2.\displaystyle\hat{R}_{n}(\mathbf{x},\mathcal{F}_{1}p)\leq\frac{B_{1}}{n}\Big{(}\sum_{i=1}^{n}\operatorname{tr}(k(x_{i},x_{i}))\Big{)}^{1/2}.

Proof  We have

E[supf11ni=1nσi,f(xi)p]=E[supf11ni=1n(σip)p,ϕ1(xi),f1p]\displaystyle\mathrm{E}\bigg{[}\sup_{f\in\mathcal{F}_{1}}\frac{1}{n}\sum_{i=1}^{n}\left\langle\sigma_{i},f(x_{i})p\right\rangle\bigg{]}=\mathrm{E}\bigg{[}\sup_{f\in\mathcal{F}_{1}}\frac{1}{n}\sum_{i=1}^{n}\left\langle(\sigma_{i}p^{*})p,\left\langle\phi_{1}(x_{i}),f\right\rangle_{\mathcal{M}_{1}}p\right\rangle\bigg{]}
=E[supf11np,i=1n(pσi)ϕ1(xi),f1p]\displaystyle\qquad=\mathrm{E}\bigg{[}\sup_{f\in\mathcal{F}_{1}}\frac{1}{n}\bigg{\langle}p,\;\sum_{i=1}^{n}(p\sigma_{i}^{*})\left\langle\phi_{1}(x_{i}),f\right\rangle_{\mathcal{M}_{1}}p\bigg{\rangle}\bigg{]}
=E[supf11np,i=1nϕ1(xi)(σip),f~1p]\displaystyle\qquad=\mathrm{E}\bigg{[}\sup_{f\in\mathcal{F}_{1}}\frac{1}{n}\bigg{\langle}p,\;\bigg{\langle}\sum_{i=1}^{n}\phi_{1}(x_{i})(\sigma_{i}p^{*}),f\bigg{\rangle}_{\tilde{\mathcal{M}}_{1}}p\bigg{\rangle}\bigg{]}
=E[supf11n(i=1nϕ1(xi)(σip),f)~1,p]\displaystyle\qquad=\mathrm{E}\bigg{[}\sup_{f\in\mathcal{F}_{1}}\frac{1}{n}\bigg{(}\sum_{i=1}^{n}\phi_{1}(x_{i})(\sigma_{i}p^{*}),f\bigg{)}_{\tilde{\mathcal{M}}_{1},p}\bigg{]}
E[supf11ni=1nϕ1(xi)(σip)~1,pf~1,p]\displaystyle\qquad\leq\mathrm{E}\bigg{[}\sup_{f\in\mathcal{F}_{1}}\frac{1}{n}\bigg{\|}\sum_{i=1}^{n}\phi_{1}(x_{i})(\sigma_{i}p^{*})\bigg{\|}_{\tilde{\mathcal{M}}_{1},p}\;\|f\|_{\tilde{\mathcal{M}}_{1},p}\bigg{]} (3)
E[supf11np,|i=1nϕ1(xi)(σip)|~12p1/2f~1]\displaystyle\qquad\leq\mathrm{E}\bigg{[}\sup_{f\in\mathcal{F}_{1}}\frac{1}{n}\bigg{\langle}p,\;\bigg{|}\sum_{i=1}^{n}\phi_{1}(x_{i})(\sigma_{i}p^{*})\bigg{|}^{2}_{\tilde{\mathcal{M}}_{1}}p\bigg{\rangle}^{1/2}\;\|f\|_{\tilde{\mathcal{M}}_{1}}\bigg{]} (4)
B1nE[(i,j=1nppσik1(xi,xj)σjpp)1/2]\displaystyle\qquad\leq\frac{B_{1}}{n}\mathrm{E}\bigg{[}\bigg{(}\sum_{i,j=1}^{n}p^{*}p\sigma_{i}^{*}k_{1}(x_{i},x_{j})\sigma_{j}p^{*}p\bigg{)}^{1/2}\bigg{]}
B1nE[i,j=1nσik1(xi,xj)σj]1/2\displaystyle\qquad\leq\frac{B_{1}}{n}\mathrm{E}\bigg{[}\sum_{i,j=1}^{n}\sigma_{i}^{*}k_{1}(x_{i},x_{j})\sigma_{j}\bigg{]}^{1/2} (5)
=B1nE[i=1nσik1(xi,xi)σi]1/2=B1n(i=1ntrk1(xi,xi))1/2,\displaystyle\qquad=\frac{B_{1}}{n}\mathrm{E}\bigg{[}\sum_{i=1}^{n}\sigma_{i}^{*}k_{1}(x_{i},x_{i})\sigma_{i}\bigg{]}^{1/2}=\frac{B_{1}}{n}\bigg{(}\sum_{i=1}^{n}\operatorname{tr}k_{1}(x_{i},x_{i})\bigg{)}^{1/2},

where for pdp\in\mathbb{C}^{d} and f,g~1f,g\in\tilde{\mathcal{M}}_{1}, (f,g)~1,p(f,g)_{\tilde{\mathcal{M}}_{1},p} is the semi-inner product defined by (f,g)~1,p=p,f,g1~p(f,g)_{\tilde{\mathcal{M}}_{1},p}=\left\langle p,\left\langle f,g\right\rangle_{\tilde{\mathcal{M}_{1}}}p\right\rangle and |f|~12=f,f~1|f|_{\tilde{\mathcal{M}}_{1}}^{2}=\left\langle f,f\right\rangle_{\tilde{\mathcal{M}}_{1}}. In addition, the inequality (3) is by the Cauchy–Schwartz inequality and the inequality (5) is by the Jensen’s inequality. Note that the Cauchy–Schwartz inequality is still valid for semi-inner products. The inequality (4) is derived by the inequality

fL~L,p2=p,fL,fLLpp2fL,fLL𝒜fL~L2.\displaystyle\|f_{L}\|_{\tilde{\mathcal{M}}_{L},p}^{2}=\left\langle p,\left\langle f_{L},f_{L}\right\rangle_{\mathcal{M}_{L}}p\right\rangle\leq\|p\|^{2}\|\left\langle f_{L},f_{L}\right\rangle_{\mathcal{M}_{L}}\|_{\mathcal{A}}\leq\|f_{L}\|_{\tilde{\mathcal{M}}_{L}}^{2}.

\Box

Theorem 4.1    Assume there exists D>0D>0 such that k1(x,x)𝒜D\|k_{1}(x,x)\|_{\mathcal{A}}\leq D for any x𝒜0x\in\mathcal{A}_{0}. Let K~=42(DB1+E)B1\tilde{K}=4\sqrt{2}(\sqrt{D}B_{1}+E)B_{1} and M~=6(DB1+E)2\tilde{M}=6(\sqrt{D}B_{1}+E)^{2}. Let δ(0,1)\delta\in(0,1). Then, for any g𝒢(1)g\in\mathcal{G}(\mathcal{F}_{1}), where 1\mathcal{F}_{1} is defined in Section 3, with probability at least 1δ1-\delta, we have

E[|g(x,y)|𝒜2]𝒜1ni=1n|g(xi,yi)|𝒜2𝒜+K~n(i=1ntrk1(xi,xi))1/2+M~log(2/δ)2n.\displaystyle\|\mathrm{E}[|g(x,y)|^{2}_{\mathcal{A}}]\|_{\mathcal{A}}\leq\bigg{\|}\frac{1}{n}\sum_{i=1}^{n}|g(x_{i},y_{i})|^{2}_{\mathcal{A}}\bigg{\|}_{\mathcal{A}}+\frac{\tilde{K}}{n}\bigg{(}\sum_{i=1}^{n}\operatorname{tr}k_{1}(x_{i},x_{i})\bigg{)}^{1/2}+\tilde{M}\sqrt{\frac{\log(2/\delta)}{2n}}.

Proof  For f1f\in\mathcal{M}_{1} and x𝒜0x\in\mathcal{A}_{0}, we have

f(x)𝒜=ϕ1(x),f1𝒜k1(x,x)𝒜1/2f1DB1.\displaystyle\|f(x)\|_{\mathcal{A}}=\|\left\langle\phi_{1}(x),f\right\rangle_{\mathcal{M}_{1}}\|_{\mathcal{A}}\leq\|k_{1}(x,x)\|_{\mathcal{A}}^{1/2}\|f\|_{\mathcal{M}_{1}}\leq\sqrt{D}B_{1}.

Thus, we set CC as DB1\sqrt{D}B_{1} and apply Lemmas 4.2, 4.3, and 4.4. Then, for pdp\in\mathbb{R}^{d} satisfy p=1\|p\|=1, with probability at least 1δ1-\delta, we have

E[|g(x,y)|𝒜2]1/2p21ni=1n|g(xi,yi)|𝒜2𝒜+K~n(i=1ntrk1(xi,xi))1/2+M~log(2/δ)2n.\displaystyle\|\mathrm{E}[|g(x,y)|_{\mathcal{A}}^{2}]^{1/2}p\|^{2}\leq\bigg{\|}\frac{1}{n}\sum_{i=1}^{n}|g(x_{i},y_{i})|_{\mathcal{A}}^{2}\bigg{\|}_{\mathcal{A}}+\frac{\tilde{K}}{n}\bigg{(}\sum_{i=1}^{n}\operatorname{tr}k_{1}(x_{i},x_{i})\bigg{)}^{1/2}+\tilde{M}\sqrt{\frac{\log(2/\delta)}{2n}}.

Therefore, we obtain

E[|g(x,y)|𝒜2]𝒜\displaystyle\|\mathrm{E}[|g(x,y)|_{\mathcal{A}}^{2}]\|_{\mathcal{A}} =E[|g(x,y)|𝒜2]1/2𝒜2\displaystyle=\|\mathrm{E}[|g(x,y)|_{\mathcal{A}}^{2}]^{1/2}\|^{2}_{\mathcal{A}}
1ni=1n|g(xi,yi)|𝒜2𝒜+K~n(i=1ntrk1(xi,xi))1/2+M~log(2/δ)2n,\displaystyle\leq\bigg{\|}\frac{1}{n}\sum_{i=1}^{n}|g(x_{i},y_{i})|_{\mathcal{A}}^{2}\bigg{\|}_{\mathcal{A}}+\frac{\tilde{K}}{n}\bigg{(}\sum_{i=1}^{n}\operatorname{tr}k_{1}(x_{i},x_{i})\bigg{)}^{1/2}+\tilde{M}\sqrt{\frac{\log(2/\delta)}{2n}},

which completes the proof. \Box

Proposition 4.6    We have

R^n(𝐱,Ldeepp)1nsup(fjj)jPfL1Pf1|𝒱~(𝐱)opfLL(i=1ntr(k1(xi,xi)))1/2.\displaystyle\hat{R}_{n}(\mathbf{x},\mathcal{F}^{\mathrm{deep}}_{L}p)\leq\frac{1}{n}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}}\;\|f_{L}\|_{{\mathcal{M}}_{L}}\;\Big{(}\sum_{i=1}^{n}\operatorname{tr}(k_{1}(x_{i},x_{i}))\Big{)}^{1/2}.

Here, 𝒱~(𝐱)\tilde{\mathcal{V}}(\mathbf{x}) is the submodule of 1~\tilde{\mathcal{M}_{1}} generated by ϕ1(x1),ϕ1(xn)\phi_{1}(x_{1}),\ldots\phi_{1}(x_{n}).

The following lemma by Lance [33, Proposition 1.2] is used in proving Proposition 4.6. Here, for a,b𝒜a,b\in\mathcal{A}, aba\leq b means bab-a is Hermitian positive definite.

Lemma A.1

Let \mathcal{M} and 𝒩\mathcal{N} be Hilbert 𝒜\mathcal{A}-modules and let AA be an 𝒜\mathcal{A}-linear operator from \mathcal{M} to 𝒩\mathcal{N}. Then, we have |Aw|𝒩2Aop2|w|2|Aw|_{\mathcal{N}}^{2}\leq\|A\|_{\operatorname{op}}^{2}|w|_{\mathcal{M}}^{2}.

Proof 

E[supfLdeep1ni=1nσi,f(xi)p]=E[sup(fjj)j1ni=1n(σip)p,fL(fL1(f1(xi)))p]\displaystyle\mathrm{E}\bigg{[}\sup_{f\in\mathcal{F}^{\mathrm{deep}}_{L}}\frac{1}{n}\sum_{i=1}^{n}\left\langle\sigma_{i},f(x_{i})p\right\rangle\bigg{]}=\mathrm{E}\bigg{[}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\frac{1}{n}\sum_{i=1}^{n}\left\langle(\sigma_{i}p^{*})p,f_{L}(f_{L-1}(\cdots f_{1}(x_{i})\cdots))p\right\rangle\bigg{]}
=E[sup(fjj)j1ni=1n(σip)p,ϕL(fL1(f1(xi))),fLLp]\displaystyle\qquad=\mathrm{E}\bigg{[}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\frac{1}{n}\sum_{i=1}^{n}\big{\langle}(\sigma_{i}p^{*})p,\left\langle\phi_{L}(f_{L-1}(\cdots f_{1}(x_{i})\cdots)),f_{L}\right\rangle_{\mathcal{M}_{L}}p\big{\rangle}\bigg{]}
=E[sup(fjj)j1ni=1np,(pσi)ϕL(fL1(f1(xi))),fLLp]\displaystyle\qquad=\mathrm{E}\bigg{[}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\frac{1}{n}\sum_{i=1}^{n}\big{\langle}p,(p\sigma_{i}^{*})\left\langle\phi_{L}(f_{L-1}(\cdots f_{1}(x_{i})\cdots)),f_{L}\right\rangle_{\mathcal{M}_{L}}p\big{\rangle}\bigg{]}
=E[sup(fjj)j1np,i=1nϕL(fL1(f1(xi)))(σip),fL~Lp]\displaystyle\qquad=\mathrm{E}\bigg{[}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\frac{1}{n}\bigg{\langle}p,\bigg{\langle}\sum_{i=1}^{n}\phi_{L}(f_{L-1}(\cdots f_{1}(x_{i})\cdots))(\sigma_{i}p^{*}),f_{L}\bigg{\rangle}_{\tilde{\mathcal{M}}_{L}}p\bigg{\rangle}\bigg{]}
=E[sup(fjj)j1n(i=1nϕL(fL1(f1(xi)))(σip),fL)~L,p]\displaystyle\qquad=\mathrm{E}\bigg{[}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\frac{1}{n}\bigg{(}\sum_{i=1}^{n}\phi_{L}(f_{L-1}(\cdots f_{1}(x_{i})\cdots))(\sigma_{i}p^{*}),f_{L}\bigg{)}_{\tilde{\mathcal{M}}_{L},p}\bigg{]}
E[sup(fjj)j1ni=1nϕL(fL1(f1(xi)))(σip)~L,pfL~L,p]\displaystyle\qquad\leq\mathrm{E}\bigg{[}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\frac{1}{n}\bigg{\|}\sum_{i=1}^{n}\phi_{L}(f_{L-1}(\cdots f_{1}(x_{i})\cdots))(\sigma_{i}p^{*})\bigg{\|}_{\tilde{\mathcal{M}}_{L},p}\|f_{L}\|_{\tilde{\mathcal{M}}_{L},p}\bigg{]} (6)
E[sup(fjj)j1nPfL1Pf1i=1nϕ1(xi)(σip)~L,pfL~L]\displaystyle\qquad\leq\mathrm{E}\bigg{[}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\frac{1}{n}\bigg{\|}P_{f_{L-1}}\cdots P_{f_{1}}\sum_{i=1}^{n}\phi_{1}(x_{i})(\sigma_{i}p^{*})\bigg{\|}_{\tilde{\mathcal{M}}_{L},p}\|f_{L}\|_{\tilde{\mathcal{M}}_{L}}\bigg{]}
=E[sup(fjj)j1np,|PfL1Pf1i=1nϕ1(xi)(σip)|~L2p1/2fL~L]\displaystyle\qquad=\mathrm{E}\bigg{[}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\frac{1}{n}\bigg{\langle}p,\;\bigg{|}P_{f_{L-1}}\cdots P_{f_{1}}\sum_{i=1}^{n}\phi_{1}(x_{i})(\sigma_{i}p^{*})\bigg{|}_{\tilde{\mathcal{M}}_{L}}^{2}p\bigg{\rangle}^{1/2}\|f_{L}\|_{\tilde{\mathcal{M}}_{L}}\bigg{]}
E[sup(fjj)j1np,PfL1Pf1|𝒱~(𝐱)op2|i=1nϕ1(xi)(σip)|~L2p1/2fL~L]\displaystyle\qquad\leq\mathrm{E}\bigg{[}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\frac{1}{n}\bigg{\langle}p,\;\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}}^{2}\;\bigg{|}\sum_{i=1}^{n}\phi_{1}(x_{i})(\sigma_{i}p^{*})\bigg{|}_{\tilde{\mathcal{M}}_{L}}^{2}p\bigg{\rangle}^{1/2}\|f_{L}\|_{\tilde{\mathcal{M}}_{L}}\bigg{]} (7)
1nsup(fjj)jPfL1Pf1|𝒱~(𝐱)opfLLE[p,|i=1nϕ1(xi)(σip)|~L2p1/2]\displaystyle\qquad\leq\frac{1}{n}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}}\;\|f_{L}\|_{{\mathcal{M}}_{L}}\;\mathrm{E}\bigg{[}\bigg{\langle}p,\;\bigg{|}\sum_{i=1}^{n}\phi_{1}(x_{i})(\sigma_{i}p^{*})\bigg{|}_{\tilde{\mathcal{M}}_{L}}^{2}p\bigg{\rangle}^{1/2}\bigg{]}
1nsup(fjj)jPfL1Pf1|𝒱~(𝐱)opfLLE[(i,j=1nppσik1(xi,xj)σjpp)1/2]\displaystyle\qquad\leq\frac{1}{n}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}}\;\|f_{L}\|_{{\mathcal{M}}_{L}}\;\mathrm{E}\bigg{[}\bigg{(}\sum_{i,j=1}^{n}p^{*}p\sigma_{i}^{*}k_{1}(x_{i},x_{j})\sigma_{j}p^{*}p\bigg{)}^{1/2}\bigg{]}
1nsup(fjj)jPf1PfL1|𝒱~(𝐱)opfLL(i=1ntr(k1(xi,xi)))1/2,\displaystyle\qquad\leq\frac{1}{n}\sup_{(f_{j}\in\mathcal{F}_{j})_{j}}\|P_{f_{1}}\cdots P_{f_{L-1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}}\;\|f_{L}\|_{{\mathcal{M}}_{L}}\;\bigg{(}\sum_{i=1}^{n}\operatorname{tr}(k_{1}(x_{i},x_{i}))\bigg{)}^{1/2},

where the inequality (6) is by the Cauchy–Schwartz inequality and the inequality (7) is by Lemma A.1. \Box

Proposition 5.1    Let h:𝒜n×𝒜n+h:\mathcal{A}^{n}\times\mathcal{A}^{n}\to\mathbb{R}_{+} be an error function, let g1g_{1} be an +\mathbb{R}_{+}-valued function on the space of bounded linear operators on ~1\tilde{\mathcal{M}}_{1}, and let g2:++g_{2}:\mathbb{R}_{+}\to\mathbb{R}_{+} satisfy g2(a)g2(b)g_{2}(a)\leq g_{2}(b) for aba\leq b. Assume the following minimization problem has a solution:

min(fjj)jh(fLf1(x1),,fLf1(xn))+g1(PfL1PfL|𝒱~(𝐱))+g2(fLL).\displaystyle\min_{(f_{j}\in\mathcal{M}_{j})_{j}}h(f_{L}\circ\cdots\circ f_{1}(x_{1}),\ldots,f_{L}\circ\cdots\circ f_{1}(x_{n}))+g_{1}(P_{f_{L-1}}\cdots P_{f_{L}}|_{\tilde{\mathcal{V}}(\mathbf{x})})+g_{2}(\|f_{L}\|_{\mathcal{M}_{L}}).

Then, there exists a solution admitting a representation of the form fj=i=1nϕj(xij1)ci,jf_{j}=\sum_{i=1}^{n}\phi_{j}(x_{i}^{j-1})c_{i,j} for some c1,j,,cn,j𝒜c_{1,j},\ldots,c_{n,j}\in\mathcal{A} and for j=1,,Lj=1,\ldots,L. Here, xij=fjf1(xi)x_{i}^{j}=f_{j}\circ\cdots\circ f_{1}(x_{i}) for j=1,,Lj=1,\ldots,L and xi0=xix_{i}^{0}=x_{i}.

Proof  Let 𝒱j(𝐱)\mathcal{V}_{j}(\mathbf{x}) be the submodule of j{\mathcal{M}}_{j} generated by ϕj(x1j1),,ϕj(xnj1)\phi_{j}(x_{1}^{j-1}),\ldots,\phi_{j}(x_{n}^{j-1}). Let fj=fj+fjf_{j}=f_{j}^{\parallel}+f_{j}^{\perp}, where fj𝒱j(𝐱)f_{j}^{\parallel}\in\mathcal{V}_{j}(\mathbf{x}) and fj𝒱j(𝐱)f_{j}^{\perp}\in\mathcal{V}_{j}(\mathbf{x})^{\perp}. Then, for j=1,,L1j=1,\ldots,L-1, we have

fj(xij1)=ϕj(xij1),fjj=ϕj(xij1),fjj=fj(xij1).\displaystyle f_{j}(x_{i}^{j-1})=\big{\langle}\phi_{j}(x_{i}^{j-1}),f_{j}\big{\rangle}_{\mathcal{M}_{j}}=\big{\langle}\phi_{j}(x_{i}^{j-1}),f_{j}^{\parallel}\big{\rangle}_{\mathcal{M}_{j}}=f_{j}^{\parallel}(x_{i}^{j-1}). (8)

In addition, we have

PfL1Pf1|𝒱~(𝐱)=j=1L1Pfj|𝒱~j(𝐱),\displaystyle P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}=\prod_{j=1}^{L-1}P_{f_{j}}|_{\tilde{\mathcal{V}}_{j}(\mathbf{x})},

where 𝒱~j(𝐱)\tilde{\mathcal{V}}_{j}(\mathbf{x}) is the submodule of ~j\tilde{\mathcal{M}}_{j} generated by ϕj(x1j1),ϕj(xnj1)\phi_{j}(x_{1}^{j-1}),\ldots\phi_{j}(x_{n}^{j-1}). In the same manner as Eq. (8), we obtain

Pfji=1nϕj(xij1)ci,j\displaystyle P_{f_{j}}\sum_{i=1}^{n}\phi_{j}(x_{i}^{j-1})c_{i,j} =i=1nϕj+1(fj(xij1))ci,j=i=1nϕj+1(fj(xij1))ci,j\displaystyle=\sum_{i=1}^{n}\phi_{j+1}(f_{j}(x_{i}^{j-1}))c_{i,j}=\sum_{i=1}^{n}\phi_{j+1}\big{(}f_{j}^{\parallel}(x_{i}^{j-1})\big{)}c_{i,j}
=i=1nPfjϕj(xij1)ci,j.\displaystyle=\sum_{i=1}^{n}P_{f_{j}^{\parallel}}\phi_{j}(x_{i}^{j-1})c_{i,j}.

Thus, we have Pfj|𝒱~j(𝐱)=Pfj|𝒱~j(𝐱)P_{f_{j}}|_{\tilde{\mathcal{V}}_{j}(\mathbf{x})}=P_{f_{j}^{\parallel}}|_{\tilde{\mathcal{V}}_{j}(\mathbf{x})}. Furthermore, we have

fLL=fL+fL,fL+fLL𝒜=fL,fLL+fL,fLL𝒜fL,fLL𝒜,\displaystyle\big{\|}f_{L}\|_{\mathcal{M}_{L}}=\|\big{\langle}f_{L}^{\parallel}+f_{L}^{\perp},f_{L}^{\parallel}+f_{L}^{\perp}\big{\rangle}_{\mathcal{M}_{L}}\big{\|}_{\mathcal{A}}=\big{\|}\big{\langle}f_{L}^{\parallel},f_{L}^{\parallel}\big{\rangle}_{\mathcal{M}_{L}}+\big{\langle}f_{L}^{\perp},f_{L}^{\perp}\big{\rangle}_{\mathcal{M}_{L}}\big{\|}_{\mathcal{A}}\geq\big{\|}\big{\langle}f_{L}^{\parallel},f_{L}^{\parallel}\big{\rangle}_{\mathcal{M}_{L}}\big{\|}_{\mathcal{A}},

where the last inequality is derived from the fact that fL,fLL+fL,fLLfL,fLL\big{\langle}f_{L}^{\parallel},f_{L}^{\parallel}\big{\rangle}_{\mathcal{M}_{L}}+\big{\langle}f_{L}^{\perp},f_{L}^{\perp}\big{\rangle}_{\mathcal{M}_{L}}-\big{\langle}f_{L}^{\perp},f_{L}^{\perp}\big{\rangle}_{\mathcal{M}_{L}} is positive and Theorem 2.2.5 (3) by Murphy [34]. As a result, the statement is proved. \Box

Proposition 5.3    For j=1,Lj=1,L, let [ϕj(x1j1),,ϕj(xnj1)]Rj=Qj[\phi_{j}(x_{1}^{j-1}),\ldots,\phi_{j}(x_{n}^{j-1})]R_{j}=Q_{j} be the QR decomposition of [ϕj(x1j1),,ϕj(xnj1)][\phi_{j}(x_{1}^{j-1}),\ldots,\phi_{j}(x_{n}^{j-1})]. Then, we have PfL1Pf1|𝒱~(𝐱)op=RLGLR1op\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}}=\|R_{L}^{*}G_{L}R_{1}\|_{\operatorname{op}}.

Proof  The result is derived by the identities

PfL1Pf1|𝒱~(𝐱)op\displaystyle\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}} =QLPfL1Pf1Q1op=QLPfL1Pf1[ϕ1(x1),,ϕ1(xn)]R1op\displaystyle=\|Q_{L}^{*}P_{f_{L-1}}\cdots P_{f_{1}}Q_{1}\|_{\operatorname{op}}=\|Q_{L}^{*}P_{f_{L-1}}\cdots P_{f_{1}}[\phi_{1}(x_{1}),\ldots,\phi_{1}(x_{n})]R_{1}\|_{\operatorname{op}}
=RL[ϕL(x1L1),,ϕL(xnL1)][ϕL(x1L1),,ϕL(xnL1)]R1op\displaystyle=\|R_{L}^{*}[\phi_{L}(x_{1}^{L-1}),\ldots,\phi_{L}(x_{n}^{L-1})]^{*}[\phi_{L}(x_{1}^{L-1}),\ldots,\phi_{L}(x_{n}^{L-1})]R_{1}\|_{\operatorname{op}}
=RLGLR1op.\displaystyle=\|R_{L}^{*}G_{L}R_{1}\|_{\operatorname{op}}.

\Box

Proposition 5.4    Assume GLG_{L} is invertible. Then, we have

PfL1Pf1|𝒱~(𝐱)opGL1op1/2GLopG11op1/2.\displaystyle\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}}\leq\|G_{L}^{-1}\|^{1/2}_{\operatorname{op}}\|G_{L}\|_{\operatorname{op}}\|G_{1}^{-1}\|_{\operatorname{op}}^{1/2}.

Proof  Since GLG_{L} is invertible, by Proposition 5.3, RLGLR1op\|R_{L}^{*}G_{L}R_{1}\|_{\operatorname{op}} is bounded as

PfL1Pf1|𝒱~(𝐱)op=RLGLR1opRLopGLopR1op=GL1op1/2GLopG11op1/2,\displaystyle\|P_{f_{L-1}}\cdots P_{f_{1}}|_{\tilde{\mathcal{V}}(\mathbf{x})}\|_{\operatorname{op}}=\|R_{L}^{*}G_{L}R_{1}\|_{\operatorname{op}}\leq\|R_{L}\|_{\operatorname{op}}\|G_{L}\|_{\operatorname{op}}\|R_{1}\|_{\operatorname{op}}=\|G_{L}^{-1}\|^{1/2}_{\operatorname{op}}\|G_{L}\|_{\operatorname{op}}\|G_{1}^{-1}\|_{\operatorname{op}}^{1/2},

where the last equality is derived by the identity Rjop2=RjRjop=Gj1op\|R_{j}\|^{2}_{\operatorname{op}}=\|R_{j}R_{j}^{*}\|_{\operatorname{op}}=\|{G_{j}}^{-1}\|_{\operatorname{op}}. \Box

Appendix B Connection with Generalization Bound with Koopman Operators

Hashimoto et al. [18] derived a generalization bound for the classical neural network composed of linear transformations and activation functions using Koopman operators. The Koopman operator is the adjoint of the Perron–Frobenius operator. In their analysis, they set the final transformation fLf_{L} in an RKHS and observed that if the transformation of each layer is injective, the noise is separated through the linear transformations and cut off by fLf_{L}. Since the final transformation fLf_{L} in our case is in an RKHM, the same interpretation about noise is valid for deep RKHM, too. Indeed, if fLf1f_{L}\circ\cdots\circ f_{1} is not injective, then GLG_{L} is not invertible, which results in the right-hand side of the inequality (2) going to infinity. The bound derived by Hashimoto et al. also goes to infinity if the network is not injective.

Appendix C Experimental Details and Additional Results

We provide details for the experiments in Section 7. All the experiments are executed with Python 3.9 and TensorFlow 2.6 on Intel(R) Core(TM) i9 CPU and NVIDIA Quadro RTX 5000 GPU with CUDA 11.7.

C.1 Comparison with vvRKHS

We set kj(x,y)=k~(x,y)Ik_{j}(x,y)=\tilde{k}(x,y)I for j=1,,3j=1,\ldots,3 with k~(x,y)=eci,j=1d|xi,jyi,j|\tilde{k}(x,y)=\mathrm{e}^{-c\sum_{i,j=1}^{d}|x_{i,j}-y_{i,j}|} and c=0.001c=0.001 for positive definite kernels. For the optimizer, we used SGD. The learning rate is set as 1e41e^{-4} both for the deep RKHM and deep vvRKHS. The initial value of ci,jc_{i,j} is set as ai,j+ϵi,ja_{i,j}+\epsilon_{i,j}, where ai,j𝒜ja_{i,j}\in\mathcal{A}_{j} is the block matrix all of whose elements are 0.10.1 and ϵi,j\epsilon_{i,j} is randomly drawn from 𝒩(0,0.05)\mathcal{N}(0,0.05). The result in Figure 2 is obtained by 5 independent runs.

C.2 Observation about benign overfitting

We set the same positive definite kernel as Subsection C.1 for j=1,2j=1,2. For the optimizer, we used SGD. The learning rate is set as 3×1e43\times 1e^{-4}. The initial value of ci,jc_{i,j} is the same as Subsection C.1. The result in Figure 2 is obtained by 3 independent runs.

C.3 Comparison to CNN

We set kj(x,y)=k~(xaj,yaj)xyk_{j}(x,y)=\tilde{k}(xa_{j},ya_{j})xy^{*} for j=1,2j=1,2, where a1a_{1} and a2a_{2} are block matrices whose block sizes are 22 and 44, for positive definite kernels. All the nonzero elements of aja_{j} are set as 11. We set a1a_{1} and a2a_{2} to induce interactions in the block elements (see Remark 5.2). For the optimizer, we used Adam with learning rate 1e31e^{-3} for both the deep RKHM and the CNN. The initial value of ci,jc_{i,j} is set as ϵi,j\epsilon_{i,j}, where ϵi,j\epsilon_{i,j} is randomly drawn from 𝒩(0,0.1)\mathcal{N}(0,0.1).

We combined the deep RKHM and CNN with 22 dense layers. Their activation functions are sigmoid and softmax, respectively. For the CNN layers, we also used the sigmoid activation functions. The loss function is set as the categorical cross-entropy for the CNN. The result in Figure 2 is obtained by 5 independent runs.

C.3.1 Memory consumption and computational cost

Memory consumption

We used a CNN with (7×7)(7\times 7) and (4×4)(4\times 4)-filters. On the other hand, for the deep RKHM, we learned the coefficients ci,jc_{i,j} in Proposition 5.1 for j=1,2j=1,2 and i=1,,ni=1,\ldots,n. That is, we learned the following coefficients:

  • n(=20)n(=20) block diagonal matrices each of whom has four (7×7)(7\times 7)-blocks (for the first layer)

  • nn block diagonal matrices each of whom has seven (4×4)(4\times 4)-blocks (for the second layer)

Thus, the size of the parameters we have to learn for the deep RKHM is larger than the CNN. Since the memory consumption depends on the size of learning parameters, the memory consumption is larger for the deep RKHM than for the CNN.

Computational cost

In each iteration, we compute f(xi)f(x_{i}) for i=1,,ni=1,\ldots,n and the derivative of ff with respect to the learning parameters. Here, f=f1f2f=f_{1}\circ f_{2} is the network. For the deep RKHM, we compute the product of a Gram matrix (composed of n×nn\times n block diagonal matrices) and a vector (composed of nn block diagonal matrices) for computing fj(xi)f_{j}(x_{i}) for all i=1,,ni=1,\ldots,n. Thus, the computational cost for computing f(xi)f(x_{i}) for all i=1,,ni=1,\ldots,n is O(n2d(m1+m2))O(n^{2}d(m_{1}+m_{2})), where m1=7m_{1}=7 and m2=4m_{2}=4 are the sizes of the block diagonal matrices. For the CNN, the computational cost for computing f(xi)f(x_{i}) for all i=1,,ni=1,\ldots,n is O(nd2(l1+l2))O(nd^{2}(l_{1}+l_{2})), where l1=7×7l_{1}=7\times 7 and l2=4×4l_{2}=4\times 4 are the number of elements in the filters. Since we set n=20n=20 and d=28d=28, the computational cost of the deep RKHM for computing f(xi)f(x_{i}) for i=1,,ni=1,\ldots,n is smaller than that of the CNN. However, since the size of the learning parameters of the deep RKHM is large, the computational cost for computing the derivative of f(xi)f(x_{i}) is larger than that of the CNN.

Additional results

To compare the deep RKHM to a CNN with the same size of learning parameters (the same memory consumption), we conducted the same experiment as the last experiment in the main text, excepting for the structure of the CNN. We constructed a CNN with (28×720)(28\times 7\cdot 20) and (28×420)(28\times 4\cdot 20)-filters (The size of learning parameters is the same as the deep RKHM) and replaced the CNN used in the experiment in the main text. Figure 3 shows the result. The result is similar to that in Figure 2 (c), and the deep RKHM also outperforms the CNN with the same learning parameter size. Since the size of the learning parameter is the same in this case, the computational cost for one iteration for learning the deep RKHM is the same or smaller than that for learning the CNN.

Refer to caption
Figure 3: Comparison of deep RKHM to a CNN with the same size of learning parameter.

C.3.2 Additional results for benign ovefitting

We also observed benign overfitting for the deep RKHM. We compared the train and test losses for the deep RKHM with λ1=1\lambda_{1}=1 to those with λ1=0\lambda_{1}=0. The results are illustrated in Figure 4. If we set λ1=0\lambda_{1}=0 in Eq. (1), i.e., do not minimize the term in Eq. (2), then whereas the training loss becomes small as the learning process proceeds, the test loss becomes large after sufficient iterations. On the other hand, if we set λ1=1\lambda_{1}=1, i.e., minimize the term in Eq. (2), then the training loss becomes smaller than the case of λ1=0\lambda_{1}=0, and the test loss does not become large even the learning process proceeds. The result implies that the term regarding the Perron–Frobenius operators in Eq. (1) causes overfitting, but it is benign overfitting.

Refer to caption
Refer to caption
Figure 4: Train loss and test loss for the MNIST classification task with (λ=1\lambda=1) and without (λ=0\lambda=0) the minimization of the second term in Eq. (1) regarding the Perron–Frobenius operators.