On the Equivalence between Neural Network and Support Vector Machine

Yilan Chen
Computer Science and Engineering
University of California San Diego
La Jolla, CA
[email protected]
&Wei Huang
Engineering and Information Technology
University of Technology Sydney
Ultimo, Australia
[email protected]
Lam M. Nguyen
IBM Research
Thomas J. Watson Research Center
Yorktown Heights, NY
[email protected]
&Tsui-Wei Weng
Halıcıoğlu Data Science Institute
University of California San Diego
La Jolla, CA
[email protected]

Abstract

Recent research shows that the dynamics of an infinitely wide neural network (NN) trained by gradient descent can be characterized by Neural Tangent Kernel (NTK) [27]. Under the squared loss, the infinite-width NN trained by gradient descent with an infinitely small learning rate is equivalent to kernel regression with NTK [4]. However, the equivalence is only known for ridge regression currently [6], while the equivalence between NN and other kernel machines (KMs), e.g. support vector machine (SVM), remains unknown. Therefore, in this work, we propose to establish the equivalence between NN and SVM, and specifically, the infinitely wide NN trained by soft margin loss and the standard soft margin SVM with NTK trained by subgradient descent. Our main theoretical results include establishing the equivalences between NNs and a broad family of $\ell_{2}$ regularized KMs with finite-width bounds, which cannot be handled by prior work, and showing that every finite-width NN trained by such regularized loss functions is approximately a KM. Furthermore, we demonstrate our theory can enable three practical applications, including (i) non-vacuous generalization bound of NN via the corresponding KM; (ii) non-trivial robustness certificate for the infinite-width NN (while existing robustness verification methods would provide vacuous bounds); (iii) intrinsically more robust infinite-width NNs than those from previous kernel regression. Our code for the experiments is available at https://github.com/leslie-CH/equiv-nn-svm.

1 Introduction

Recent research has made some progress towards deep learning theory from the perspective of infinite-width neural networks (NNs). For a fully-trained infinite-width NN, it follows kernel gradient descent in the function space with respect to Neural Tangent Kernel (NTK) [27]. Under this linear regime and squared loss, it is proved that the fully-trained NN is equivalent to kernel regression with NTK [27, 4], which gives the generalization ability of such a model [5]. NTK helps us understand the optimization [27, 19] and generalization [5, 12] of NNs through the perspective of kernels. However, existing theories about NTK [27, 31, 4, 14] usually assume the loss is a function of the model output, which does not include the case of regularization. Besides, they usually consider the squared loss that corresponds to a kernel regression, which may have limited insights to understand classification problems since squared loss and kernel regression are usually used for regression problems.

On the other hand, another popular machine learning paradigm with solid theoretical foundation before the prevalence of deep neural networks is the support vector machine (SVM) [10, 16], which allows learning linear classifiers in high dimensional feature spaces. SVM tackles the sample complexity challenge by searching for large margin separators and tackles the computational complexity challenge using the idea of kernels [46]. To learn an SVM model, it usually involves solving a dual problem which is cast as a convex quadratic programming problem. Recently, there are some algorithms using subgradient descent [47] and coordinate descent [23] to further scale the SVM models to large datasets and high dimensional feature spaces.

We noticed that existing theoretical analysis mostly focused on connecting NN with kernel regression [27, 4, 31] but the connections between NN and SVM have not yet been explored. In this work, we establish the equivalence between NN and SVM for the first time to our best knowledge. More broadly, we show that our analysis can connect NNs with a family of $\ell_{2}$ regularized KMs, including kernel ridge regression (KRR), support vector regression (SVR) and $\ell_{2}$ regularized logistic regression, where previous results [27, 4, 31] cannot handle. These are the equivalences beyond ridge regression for the first time. Importantly, the equivalences between infinite-width NNs and these $\ell_{2}$ regularized KMs may shed light on the understanding of NNs from these new equivalent KMs [17, 48, 45, 52], especially towards understanding the training, generalization, and robustness of NNs for classification problems. Besides, regularization plays an important role in machine learning to restrict the complexity of models. These equivalences may shed light on the understanding of the regularization for NNs. We highlight our contributions as follows:

•

We derive the continuous (gradient flow) and discrete dynamics of SVM trained by subgradient descent and the dynamics of NN trained by soft margin loss. We show the dynamics of SVM with NTK and NN are exactly the same in the infinite width limit because of the constancy of the tangent kernel and thus establish the equivalence. We show same linear convergence rate of SVM and NN under reasonable assumption. We verify the equivalence by experiments of subgradient descent and stochastic subgradient descent on MNIST dataset [29].
•

We generalize our theory to general loss functions with $\ell_{2}$ regularization and establish the equivalences between NNs and a family of $\ell_{2}$ regularized KMs as summarized in Table 1. We prove the difference between the outputs of NN and KM sacles as $O(\ln{m}/\lambda\sqrt{m})$ , where $\lambda$ is the coefficient of the regularization and $m$ is the width of the NN. Additionally, we show every finite-width neural network trained by a $\ell_{2}$ regularized loss function is approximately a KM.
•

We show that our theory offers three practical benefits: (i) computing non-vacuous generalization bound of NN via the corresponding KM; (ii) we can deliver nontrivial robustness certificate for the over-parameterized NN (with width $m\rightarrow\infty$ ) while existing robustness verification methods would give trivial robustness certificate due to bound propagation [22, 57, 60]. In particular, the certificate decreases at a rate of $O(1/\sqrt{m})$ as the width of NN increases; (iii) we show that the equivalent infinite-width NNs trained from our $\ell_{2}$ regularized KMs are more robust than the equivalent NN trained from previous kernel regression [27, 4] (see Table 3), which is perhaps not too surprising as the regularization has a strong connection to robust machine learning.

2 Related Works and Background

2.1 Related Works

Neural Tangent Kernel and dynamics of neural networks. NTK was first introduced in [27] and extended to Convolutional NTK [4] and Graph NTK [21]. [26] studied the NTK of orthogonal initialization. [6] reported strong performance of NTK on small-data tasks both for kernel regression and kernel SVM. However, the equivalence is only known for ridge regression currently, but not for SVM and other KMs. A line of recent work [20, 1] proved the convergence of (convolutional) neural networks with large but finite width in a non-asymptotic way by showing the weights do not move far away from initialization in the optimization dynamics (trajectory). [31] showed the dynamics of wide neural networks are governed by a linear model of first-order Taylor expansion around its initial parameters. However, existing theories about NTK [27, 31, 4] usually assume the loss is a function of the model output, which does not include the case of regularization. Besides, they usually consider the squared loss that corresponds to a kernel regression, which may have limited insights to understand classification problems since squared loss and kernel regression are usually used for regression problems. In this paper, we study the regularized loss functions and establish the equivalences with KMs beyond kernel regression and regression problems.

Besides, we studied the robustness of NTK models. [24] studied the label noise (the labels are generated by a ground truth function plus a Gaussian noise) while we consider the robustness of input perturbation. They study the convergence rate of NN trained by $\ell_{2}$ regularized squared loss to an underlying true function, while we give explicit robustness certificates for NNs. Our robustness certificate enables us to compare different models and show the equivalent infinite-width NNs trained from our $\ell_{2}$ regularized KMs are more robust than the equivalent NN trained from previous kernel regression.

Neural network and support vector machine. Prior works [54, 53, 35, 50, 32] have explored the benefits of encouraging large margin in the context of deep networks. [15] introduced a new family of positive-definite kernel functions that mimic the computation in multi-layer NNs and applied the kernels into SVM. [18] showed that NNs trained by gradient flow are approximately "kernel machines" with the weights and bias as functions of input data, which however can be much more complex than a typical kernel machine. [47] proposed a subgradient algorithm to solve the primal problem of SVM, which can obtain a solution of accuracy $\epsilon$ in $\tilde{O}(1/\epsilon)$ iterations, where $\tilde{O}$ omits the logarithmic factors. In this paper, we also consider the SVM trained by subgradient descent and connect it with NN trained by subgradient descent. [49, 3] studied the connection between SVM and regularization neural network [44], one-hidden layer NN that has very similar structures with that of KMs and is not widely used in practice. NNs used in practice now (e.g. fully connected ReLU NN, CNN, ResNet) do not have such structures. [43] analyzed two-layer NN trained by hinge loss without regularization on linearly separable dataset. Note for SVM, it must have a regularization term such that it can achieve max-margin solution.

Implicit bias of neural network towards max-margin solution. There is also a line of research on the implicit bias of neural network towards max-margin (hard margin SVM) solution under different settings and assumptions [51, 28, 13, 56, 40, 37]. Our paper complements these research by establishing an exact equivalence between the infinite-width NN and SVM. Our equivalences not only include hard margin SVM but also include soft margin SVM and other $\ell_{2}$ regularized KMs.

2.2 Neural Networks and Tangent Kernel

We consider a general form of deep neural network $f$ with a linear output layer as [33]. Let $[L]=\{1,...,L\}$ , $\forall l\in[L]$ ,

\begin{split}\alpha^{(0)}(w,x)=x,\ \alpha^{(l)}(w,x)=\phi_{l}(w^{(l)},\alpha^{(l-1)}),\ f(w,x)=\frac{1}{\sqrt{m_{L}}}\langle w^{(L+1)},\alpha^{(L)}(w,x)\rangle,\end{split}

(1)

where each vector-valued function $\phi_{l}(w^{(l)},\cdot):\mathbb{R}^{m_{l-1}}\rightarrow\mathbb{R}^{m_{l}}$ , with parameter $w^{(l)}\in\mathbb{R}^{p_{l}}$ ( $p_{l}$ is the number of parameters), is considered as a layer of the network. This definition includes the standard fully connected, convolutional (CNN), and residual (ResNet) neural networks as special cases. For a fully connected ReLU NN, $\alpha^{(l)}(w,x)=\sigma(\frac{1}{\sqrt{m_{l-1}}}w^{(l)}\alpha^{(l-1)})$ with $w^{(l)}\in\mathbb{R}^{m_{l}\times m_{l-1}}$ and $\sigma(z)=\max(0,z)$ .

Initialization and parameterization. In this paper, we consider the NTK parameterization [27], under which the constancy of the tangent kernel has been initially observed. Specifically, the parameters, $w:=\{w^{(1)};w^{(2)};\cdots;w^{(L)};w^{(L+1)}\}$ are drawn i.i.d. from a standard Gaussian, $\mathcal{N}(0,1)$ , at initialization, denoted as $w_{0}$ . The factor $1/\sqrt{m_{L}}$ in the output layer is required by the NTK parameterization in order that the output $f$ is of order $O(1)$ . While we only consider NTK parameterization here, the results should be able to extend to general parameterization of kernel regime [58].

Definition 2.1 (Tangent kernel [27]).

The tangent kernel associated with function $f(w,x)$ at some parameter $w$ is $\hat{\Theta}(w;x,x^{\prime})=\langle\nabla_{w}f(w,x),\nabla_{w}f(w,x^{\prime})\rangle$ . Under certain conditions (usually infinite width limit and NTK parameterization), the tangent kernel at initialization converges in probability to a deterministic limit and keeps constant during training, $\hat{\Theta}(w;x,x^{\prime})\rightarrow\Theta_{\infty}(x,x^{\prime})$ . This limiting kernel is called Neural Tangent Kernel (NTK).

2.3 Kernel Machines

Kernel machine (KM) is a model of the form $g(\beta,x)=\varphi(\langle\beta,\Phi(x)\rangle+b)$ , where $\beta$ is the model parameter and $\Phi$ is a mapping from input space to some feature space, $\Phi:\mathcal{X}\rightarrow\mathcal{F}$ . $\varphi$ is an optional nonlinear function, such as identity mapping for kernel regression and $sign(\cdot)$ for SVM and logistic regression. The kernel can be exploited whenever the weight vector can be expressed as a linear combination of the training points, $\beta=\sum_{i=1}^{n}\alpha_{i}\Phi(x_{i})$ for some value of $\alpha_{i}$ , $i\in[n]$ , implying that we can express $g$ as $g(x)=\varphi(\sum_{i=1}^{n}\alpha_{i}K(x,x_{i})+b)$ , where $K(x,x_{i})=\langle\Phi(x),\Phi(x_{i})\rangle$ is the kernel function. For a NN in NTK regime, we have $f(w_{t},x)\approx f(w_{0},x)+\langle\nabla_{w}f(w_{0},x),w_{t}-w_{0}\rangle$ , which makes the NN linear in the gradient feature mapping $x\rightarrow\nabla_{w}f(w_{0},x)$ . Under squared loss, it is equivalent to kernel regression with $\Phi(x)=\nabla_{w}f(w_{0},x)$ (or equivalently using NTK as the kernel), $\beta=w_{t}-w_{0}$ and $\varphi$ identity mapping [4].

As far as we know, there is no work establishing the equivalence between fully trained NN and SVM. [18] showed that NNs trained by gradient flow are approximately "kernel machines" with the weights and bias as functions of input data, which however can be much more complex than a typical kernel machine. In this work, we compare the dynamics of SVM and NN trained by subgradient descent with soft margin loss and show the equivalence between them in the infinite width limit.

2.4 Subgradient Optimization of Support Vector Machine

We first formally define the standard soft margin SVM and then show how the subgradient descent can be applied to get an estimation of the SVM primal problem. For simplicity, we consider the homogenous model, $g(\beta,x)=\langle\beta,\Phi(x)\rangle$ .¹¹1Note one can always deal with the bias term $b$ by adding each sample with an additional dimension, $\Phi(x)^{T}\leftarrow[\Phi(x)^{T},1],\beta^{T}\leftarrow[\beta^{T},1]$ .

Definition 2.2 (Soft margin SVM).

Given labeled samples $\{(x_{i},y_{i})\}_{i=1}^{n}$ with $y_{i}\in\{-1,+1\}$ , the hyperplane $\beta^{*}$ that solves the below optimization problem realizes the soft margin classifier with geometric margin $\gamma=2/\lVert\beta^{*}\rVert$ .

\begin{split}\min_{\beta,\xi}\ &\frac{1}{2}\lVert\beta\rVert^{2}+C\sum_{i=1}^{n}\xi_{i},\quad\text{s.t.}\ y_{i}\langle\beta,\Phi(x_{i})\rangle\geq 1-\xi_{i},\ \xi_{i}\geq 0,\ i\in[n],\end{split}

Proposition 2.1.

The above primal problem of soft margin SVM can be equivalently formulated as

\min_{\beta}\frac{1}{2}\lVert\beta\rVert^{2}+C\sum_{i=1}^{n}\max(0,1-y_{i}\langle\beta,\Phi(x_{i})\rangle),

(2)

where the second term is a hinge loss. Denote this function as $L(\beta)$ , which is strongly convex in $\beta$ .

From this, we see that the SVM technique is equivalent to empirical risk minimization with $\ell_{2}$ regularization, where in this case the loss function is the nonsmooth hinge loss. The classical approaches usually consider the dual problem of SVM and solve it as a quadratic programming problem. Some recent algorithms, however, use subgradient descent [47] to optimize Eq. (2), which shows significant advantages when dealing with large datasets.

In this paper, we consider the soft margin SVM trained by subgradient descent with $L(\beta)$ . We use the subgradient $\nabla_{\beta}L(\beta)=\beta-C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g(\beta,x_{i})<1)y_{i}\Phi(x_{i})$ , where $\mathbbm{1}(\cdot)$ is the indicator function. As proved in [47], we can find a solution of accuracy $\epsilon$ , i.e. $L(\beta)-L(\beta^{*})\leq\epsilon$ , in $\tilde{O}(1/\epsilon)$ iterations. Other works also give convergence guarantees for subgradient descent of convex functions [11, 9]. In the following analysis, we will generally assume the convergence of SVM trained by subgradient descent.

3 Main Theoretical Results

In this section, we describe our main results. We first derive the continuous (gradient flow) and discrete dynamics of SVM trained by subgradient descent (in Section 3.1) and the dynamics of NN trained by soft margin loss (in Section 3.2 and Section 3.3). We show that they have similar dynamics, characterized by an inhomogeneous linear differential (difference) equation, and have the same convergence rate under reasonable assumption. Next, we show that their dynamics are exactly the same in the infinite width limit because of the constancy of tangent kernel and thus establish the equivalence (Theorem 3.4). Furthermore, in Section 3.4, we generalize our theory to general loss functions with $\ell_{2}$ regularization and establish the equivalences between NNs and a family of $\ell_{2}$ regularized KMs as summarized in Table 1.

3.1 Dynamics of Soft Margin SVM

For simpicity, we denote $\beta_{t}$ as $\beta$ at some time $t$ and $g_{t}(x)=g(\beta_{t},x)$ . The proofs of the following two theorems are detailed in Appendix C.

Theorem 3.1 (Continuous dynamics and convergence rate of SVM).

Consider training soft margin SVM by subgradient descent with infinite small learning rate (gradient flow [2]): $\frac{d\beta_{t}}{dt}=-\nabla_{\beta}L(\beta_{t})$ , the model $g_{t}(x)$ follows the below evolution:

\frac{dg_{t}(x)}{dt}=-g_{t}(x)+C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}),

(3)

and has a linear convergence rate:

L(\beta_{t})-L(\beta^{*})\leq e^{-2t}\left(L(\beta_{0})-L(\beta^{*})\right).

Denote $Q(t)=C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i})$ , which changes over time until convergence. The model output $g_{t}(x)$ at some time $T$ is

g_{T}(x)=e^{-T}\biggl{(}g_{0}(x)+\int_{0}^{T}Q(t)e^{t}\,dt\biggr{)},\quad\lim_{T\rightarrow\infty}g_{T}(x)=C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{T}(x_{i})<1)y_{i}K(x,x_{i}).

(4)

The continuous dynamics of SVM is described by an inhomogeneous linear differential equation (Eq. (3)), which gives an analytical solution. From Eq. (4), we can see that the influence of initial model $g_{0}(x)$ deceases as time $T\rightarrow\infty$ and disappears at last.

Theorem 3.2 (Discrete dynamics of SVM).

Let $\eta\in(0,1)$ be the learning rate. The dynamics of subgradient descent is

g_{t+1}(x)-g_{t}(x)=-\eta g_{t}(x)+\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}).

(5)

Denote $Q(t)=\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i})$ , which changes over time. The model output $g_{t}(x)$ at some time $T$ is

g_{T}(x)=(1-\eta)^{T}\biggl{(}g_{0}(x)+\sum_{t=0}^{T-1}(1-\eta)^{-t-1}Q(t)\biggr{)},\lim_{T\rightarrow\infty}g_{T}(x)=C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{T}(x_{i})<1)y_{i}K(x,x_{i}).

The discrete dynamics is characterized by an inhomogeneous linear difference equation (Eq. (5)). The discrete dynamics and solution of SVM have similar structures as the continuous case.

3.2 Soft Margin Neural Network

We first formally define the soft margin neural network and then derive the dynamics of training a neural network by subgradient descent with a soft margin loss. We will consider a neural network defined as Eq. (1). For convenience, we redefine $f(w,x)=\langle W^{(L+1)},\alpha^{(L)}(w,x)\rangle$ with $W^{(L+1)}=\frac{1}{\sqrt{m_{L}}}w^{(L+1)}$ and $w:=\{w^{(1)};w^{(2)};\cdots;w^{(L)};W^{(L+1)}\}$ .

Definition 3.1 (Soft margin neural network).

Given samples $\{(x_{i},y_{i})\}_{i=1}^{n}$ , $y_{i}\in\{-1,+1\}$ , the neural network $w^{*}$ defined as Eq. (1) that solves the following two equivalent optimization problems

\min_{w,\xi}\frac{1}{2}\lVert W^{(L+1)}\rVert^{2}+C\sum_{i=1}^{n}\xi_{i},\quad\text{s.t.}\ y_{i}f(w,x_{i})\geq 1-\xi_{i},\ \xi_{i}\geq 0,\ i\in[n],

\min_{w}\frac{1}{2}\lVert W^{(L+1)}\rVert^{2}+C\sum_{i=1}^{n}\max(0,1-y_{i}f(w,x_{i})),

(6)

realizes the soft margin classifier with geometric margin $\gamma=2/\lVert W_{*}^{(L+1)}\rVert$ . Denote Eq. (6) as $L(w)$ and call it soft margin loss.

This is generally a hard nonconvex optimization problem, but we can apply subgradient descent to optimize it heuristically. At initialization, $\lVert W_{0}^{(L+1)}\rVert^{2}=O(1)$ . The derivative of the regularization for $w^{(L+1)}$ is $w^{(L+1)}/\sqrt{m_{L}}=O(1/\sqrt{m_{L}})\rightarrow 0$ as $m\rightarrow\infty$ . For a fixed $\alpha^{(L)}(w,x)$ , this problem is same as SVM with $\Phi(x)=\alpha^{(L)}(w,x)$ , kernel $K(x,x^{\prime})=\alpha^{(L)}(w,x)\cdot\alpha^{(L)}(w,x^{\prime})$ and parameter $\beta=W^{(L+1)}$ . If we only train the last layer of an infinite-width soft margin NN, it corresponds to an SVM with a NNGP kernel [30, 38]. But for a fully-trained NN, $\alpha^{(L)}(w,x)$ is changing over time.

3.3 Dynamics of Neural Network Trained by Soft Margin Loss

Denote the hinge loss in $L(w)$ as $L_{h}(y_{i},f(w,x_{i}))=C\max(0,1-y_{i}f(w,x_{i}))$ . We use the same subgradient as that for SVM, $L_{h}^{\prime}(y_{i},f(w,x_{i}))=-Cy_{i}\mathbbm{1}(y_{i}f(w,x_{i})<1)$ .

Theorem 3.3 (Continuous dynamics and convergence rate of NN).

Suppose a NN $f(w,x)$ defined as Eq. (1), with $f$ a differentiable function of $w$ , is learned from a training set $\{(x_{i},y_{i})\}_{i=1}^{n}$ by subgradient descent with $L(w)$ and gradient flow. Then the network has the following dynamics:

\frac{df_{t}(x)}{dt}=-f_{t}(x)+C\sum_{i=1}^{n}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)y_{i}\hat{\Theta}(w_{t};x,x_{i}).

Let $\hat{\Theta}(w_{t})\in\mathbb{R}^{n\times n}$ be the tangent kernel evaluated on the training set and $\lambda_{min}(\hat{\Theta}(w_{t}))$ be its minimum eigenvalue. Assume $\lambda_{min}(\hat{\Theta}(w_{t}))\geq\frac{2}{C}$ , then NN has at least a linear convergence rate, same as SVM:

L(w_{t})-L(w^{*})\leq e^{-2t}\left(L(w_{0})-L(w^{*})\right).

The proof is in Appendix D. The key observation is that when deriving the dynamics of $f_{t}(x)$ , the $\frac{1}{2}\lVert W^{(L+1)}\rVert^{2}$ term in the loss function will produce a $f_{t}(x)$ term and the hinge loss will produce the tangent kernel term, which overall gives a similar dynamics to that of SVM. Comparing with the previous continuous dynamics without regularization [27, 31], our result has an extra $-f_{t}(x)$ here because of the regularization term of the loss function. The convergence rate is proved based on a sufficient condition for the PL inequality. The assumption of $\lambda_{min}(\hat{\Theta}(w_{t}))\geq\frac{2}{C}$ can be guaranteed in a parameter ball when $\lambda_{min}(\hat{\Theta}(w_{0}))>\frac{2}{C}$ , by using a sufficiently wide NN [34].

If the tangent kernel $\hat{\Theta}(w_{t};x,x_{i})$ is fixed, $\hat{\Theta}(w_{t};x,x_{i})\rightarrow\hat{\Theta}(w_{0};x,x_{i})$ , the dynamics of NN is the same as that of SVM (Eq. (3)) with kernel $\hat{\Theta}(w_{0};x,x_{i})$ , assuming the neural network and SVM have same initial output $g_{0}(x)=f_{0}(x)$ .²²2This can be done by setting the initial values to be $0$ , i.e. $g_{0}(x)=f_{0}(x)=0$ . And this consistency of tangent kernel is the case for infinitely wide neural networks of common architectures, which does not depend on the optimization algorithm and the choice of loss function, as discussed in [33].

Assumptions. We assume that (vector-valued) layer functions $\phi_{l}(w,\alpha),l\in[L]$ are $L_{\phi}$ -Lipschitz continuous and twice differentiable with respect to input $\alpha$ and parameters $w$ . The assumptions serve for the following theorem to show the constancy of tangent kernel.

Theorem 3.4 (Equivalence between NN and SVM).

As the minimum width of the NN, $m=\min_{l\in[L]}m_{l}$ , goes to infinity, the tangent kernel tends to be constant, $\hat{\Theta}(w_{t};x,x_{i})\rightarrow\hat{\Theta}(w_{0};x,x_{i})$ . Assume $g_{0}(x)=f_{0}(x)$ . Then the infinitely wide NN trained by subgradient descent with soft margin loss has the same dynamics as SVM with $\hat{\Theta}(w_{0};x,x_{i})$ trained by subgradient descent:

\frac{df_{t}(x)}{dt}=-f_{t}(x)+C\sum_{i=1}^{n}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)y_{i}\hat{\Theta}(w_{0};x,x_{i}).

And thus such NN and SVM converge to the same solution.

The proof is in Appendix E. We apply the results of [33] to show the constancy of tangent kernel in the infinite width limit. Then it is easy to check the dynamics of infinitely wide NN and SVM with NTK are the same. We give finite-width bounds for general loss functions in the next section. This theorem establishes the equivalence between infinitely wide NN and SVM for the first time. Previous theoretical results of SVM [17, 48, 45, 52] can be directly applied to understand the generalization of NN trained by soft margin loss. Given the tangent kernel is constant or equivalently the model is linear, we can also give the discrete dynamics of NN (Appendix D.4), which is identical to that of SVM. Compared with the previous discrete-time gradient descent [31, 58], our result has an extra $-\eta f_{t}(x)$ term because of the regularization term of the loss function.

f_{t+1}(x)-f_{t}(x)=-\eta f_{t}(x)+\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)y_{i}\hat{\Theta}(w_{0};x,x_{i}).

Table 1: Summary of our theoretical results on the equivalences between infinite-width NNs and a family of KMs. Thanks to the representer theorem [45], our

\ell_{2}

regularized KMs can all apply kernel trick, meaning infinite NTK can be applied in these

\ell_{2}

regularized KMs.

$\lambda$	Loss $l(z,y_{i})$	Kernel machine
$\lambda=0$ ([27, 4])	$(y_{i}-z)^{2}$	Kernel regression
$\lambda\rightarrow 0$ (ours)	$\max(0,1-y_{i}z)$	Hard margin SVM
$\lambda>0$ (ours)	$\max(0,1-y_{i}z)$	(1-norm) soft margin SVM
	$\max(0,1-y_{i}z)^{2}$	2-norm soft margin SVM
	$\max(0,\lvert y_{i}-z\rvert-\epsilon)$	Support vector regression
	$(y_{i}-z)^{2}$	Kernel ridge regression (KRR)
	$\log(1+e^{-y_{i}z})$	Logistic regression with $\ell_{2}$ regularization

3.4 General Loss Functions

We note that above analysis does not have specific dependence on the hinge loss. Thus we can generalize our analysis to general loss functions $l(z,y_{i})$ , where $z$ is the model output, as long as the loss function is differentiable (or has subgradients) with respect to $z$ , such as squared loss and logistic loss. Besides, we can scale the regularization term by a factor $\lambda$ instead of scaling $l(z,y_{i})$ with $C$ as it for SVM, which are equivalent. Suppose the loss function for the KM and NN are

\displaystyle L(\beta)=\frac{\lambda}{2}\lVert\beta\rVert^{2}+\sum_{i=1}^{n}l(g(\beta,x_{i}),y_{i}),\quad L(w)=\frac{\lambda}{2}\lVert W^{(L+1)}\rVert^{2}+\sum_{i=1}^{n}l(f(w,x_{i}),y_{i}).

(7)

Then the continuous dynamics of $g_{t}(x)$ and $f_{t}(x)$ are

	$\displaystyle\frac{dg_{t}(x)}{dt}=-\lambda g_{t}(x)-\sum_{i=1}^{n}l^{\prime}(g_{t}(x_{i}),y_{i})K(x,x_{i}),$		(8)
	$\displaystyle\frac{df_{t}(x)}{dt}=-\lambda f_{t}(x)-\sum_{i=1}^{n}l^{\prime}(f_{t}(x_{i}),y_{i})\hat{\Theta}(w_{t};x,x_{i}),$		(9)

where $l^{\prime}(z,y_{i})=\frac{\partial l(z,y_{i})}{\partial z}$ . In the situation of $\hat{\Theta}(w_{t};x,x_{i})\rightarrow\hat{\Theta}(w_{0};x,x_{i})$ and $K(x,x_{i})=\hat{\Theta}(w_{0};x,x_{i})$ , these two dynamics are the same (assuming $g_{0}(x)=f_{0}(x)$ ). When $\lambda=0$ , we recover the previous results of kernel regression. When $\lambda>0$ , we have our new results of $\ell_{2}$ regularized loss functions. Table 1 lists the different loss functions and the corresponding KMs that infinite-width NNs are equivalent to. KRR is considered in [25] to analyze the generalization of NN. However, they directly assume NN as a linear model and use it in KRR. Below we give finite-width bounds on the difference between the outputs of NN and the corresponding KM. The proof is in Appendix F.

Theorem 3.5 (Bounds on the difference between NN and KM).

Assume $g_{0}(x)=f_{0}(x),\forall x$ and $K(x,x_{i})=\hat{\Theta}(w_{0};x,x_{i})$ ³³3Linearized NN is a special case of such $g$ .. Suppose the KM and NN are trained with losses (7) and gradient flow. Suppose $l$ is $\rho$ -lipschitz and $\beta_{l}$ -smooth for the first argument (i.e. the model output). Given any $w_{T}\in B(w_{0};R):=\{w:\lVert w-w_{0}\rVert\leq R\}$ for some fixed $R>0$ , for training data $X\in\mathbb{R}^{d\times n}$ and a test point $x\in\mathbb{R}^{d}$ , with high probability over the initialization,

	$\displaystyle\lVert f_{T}(X)-g_{T}(X)\rVert=O(\frac{e^{\beta_{l}\lVert\hat{\Theta}(w_{0})\rVert}R^{3L+1}\rho n^{\frac{3}{2}}\ln{m}}{\lambda\sqrt{m}}),$
	$\displaystyle\lVert f_{T}(x)-g_{T}(x)\rVert=O(\frac{e^{\beta_{l}\lVert\hat{\Theta}(w_{0};X,x)\rVert}R^{3L+1}\rho n\ln{m}}{\lambda\sqrt{m}}).$

where $f_{T}(X),g_{T}(X)\in\mathbb{R}^{n}$ are the outputs of the training data and $\hat{\Theta}(w_{0};X,x)\in\mathbb{R}^{n}$ is the tangent kernel evaluated between training data and test point.

4 Discussion

In this section, we give some extensions and applications of our theory. We first show that every finite-width neural network trained by a $\ell_{2}$ regularized loss function is approximately a KM in Section 4.1, which enables us to compute non-vacuous generalization bound of NN via the corresponding KM. Next, in Section 4.2, we show that our theory of equivalence (in Section 3.3) is useful to evaluating the robustness of over-parameterized NNs with infinite width. In particular, our theory allows us to deliver nontrivial robustness certificates for infinite-width NNs, while existing robustness verification methods [22, 57, 60] would become much looser (decrease at a rate of $O(1/\sqrt{m})$ ) as the width of NN increases and trivial with infinite width (the experiment results are in Section 5 and Table 2).

4.1 Finite-width Neural Network Trained by $\ell_{2}$ Regularized Loss

Inspired by [18], we can also show that every NN trained by (sub)gradient descent with loss function (7) is approximately a KM without the assumption of infinite width.

Theorem 4.1.

Suppose a NN $f(w,x)$ , is learned from a training set $\{(x_{i},y_{i})\}_{i=1}^{n}$ by (sub)gradient descent with loss function (7) and gradient flow. Assume $\text{sign}(l^{\prime}(f_{t}(x_{i}),y_{i}))=\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i})),\forall t\in\left[0,T\right]$ .⁴⁴4This is the case for hinge loss and logistic loss. Then at some time $T>0$ ,

f_{T}(x)=\sum_{i=1}^{n}a_{i}K(x,x_{i})+b,\quad\text{with}\quad K(x,x_{i})=e^{-\lambda T}\int_{0}^{T}\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\rvert\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt,

and $a_{i}=-\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i}))$ , $b=e^{-\lambda T}f_{0}(x)$ .

See the proof in Appendix G, which utilizes the solution of inhomogeneous linear differential equation instead of integrating both sides of the dynamics (Eq. (9)) directly [18]. Note in Theorem 4.1, $a_{i}$ is deterministic and independent with $x$ , different with [18] that has $a_{i}$ depends on $x$ . Deterministic $a_{i}$ makes the function class simpler. Combing Theorem 4.1 with a bound of the Rademacher complexity of the KM [7] and a standard generalization bound using Rademacher complexity [39], we can compute the generalization bound of NN via the corresponding KM. See Appendix B for more background and experiments. The generalization bound we get will depend on $a_{i}$ , which depends on the label $y_{i}$ . This differs from traditional complexity measures that cannot explain the random label phenomenon [59].

Remark 4.1.

We note that the kernel here is valid only when $\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\rvert$ is a constant, e.g. $l^{\prime}(f_{t}(x_{i}),y_{i})=-y_{i}$ at the initial training stage of hinge loss with $f_{0}(x)=0$ , otherwise the kernel is not symmetric.

4.2 Robustness Verification of Infinite-width Neural Network

Our theory of equivalence allows us to deliver nontrivial robustness certificates for infinite-width NNs by considering the equivalent KMs. For an input $x_{0}\in\mathbb{R}^{d}$ , the objective of robustness verification is to find the largest ball such that no examples within this ball $x\in B(x_{0},\delta)$ can change the classification result. Without loss of generality, we assume $g(x_{0})>0$ . The robustness verification problem can be formulated as follows,

\max\ \delta,\quad\;\textrm{s.t.}\ \ g(x)>0,\ \forall x\in B(x_{0},\delta).

(10)

For an infinitely wide two-layer fully connected ReLU NN, $f(x)=\frac{1}{\sqrt{m}}\sum_{j=1}^{m}v_{j}\sigma(\frac{1}{\sqrt{d}}w_{j}^{T}x)$ , where $\sigma(z)=\max(0,z)$ is the ReLU activation, the NTK is

\Theta(x,x^{\prime})=\frac{\langle x,x^{\prime}\rangle}{d}(\frac{\pi-\arccos(u)}{\pi})+\frac{\lVert x\rVert\lVert x^{\prime}\rVert}{2\pi d}\sqrt{1-u^{2}},

where $u=\frac{\langle x,x^{\prime}\rangle}{\lVert x\rVert\lVert x^{\prime}\rVert}\in\left[-1,1\right]$ . See the proof of the following theorem in Appendix H.1.

Theorem 4.2.

Consider the $\ell_{\infty}$ perturbation, for $x\in B_{\infty}(x_{0},\delta)=\{x\in\mathbb{R}^{d}:\lVert x-x_{0}\rVert_{\infty}\leq\delta\}$ , we can bound $\Theta(x,x^{\prime})$ into some interval $[\Theta^{L}(x,x^{\prime}),\Theta^{U}(x,x^{\prime})]$ . Suppose $g(x)=\sum_{i=1}^{n}\alpha_{i}\Theta(x,x_{i})$ , where $\alpha_{i}$ are known after solving the KM problems (e.g. SVM and KRR). Then we can lower bound $g(x)$ as follows,

g(x)\geq\sum_{i=1,\alpha_{i}>0}^{n}\alpha_{i}\Theta^{L}(x,x_{i})+\sum_{i=1,\alpha_{i}<0}^{n}\alpha_{i}\Theta^{U}(x,x_{i}).

Using a simple binary search and above theorem, we can find a certified lower bound for (10). Because of the equivalence between the infinite-width NN and KM, the certified lower bound we get for the KM is equivalently a certified lower bound for the corresponding infinite-width NN.

Refer to caption — Figure 1: Training dynamics of neural network and SVM behave similarly. (a)(b) show dynamics of outputs for randomly selected two samples. (c) shows the difference between the outputs of SVM and NN. The dynamics of SVM agrees better with wider NN. (d) shows the dynamics of hinge loss for SVM and NN. Without specification, the width of NN is 10K and $\hat{\eta}=0.1$ .

5 Experiments

(I) Verification of the equivalence. The first experiment verifies the equivalence between soft margin SVM with NTK trained by subgradient descent and the infinite-width NN trained by soft margin loss. We train the SVM and 3-layer fully connected ReLU NN for a binary MNIST [29] classification ( $0$ and $1$ ) with learning rate $\hat{\eta}=0.1$ and $\hat{\eta}=0.01$ with full batch subgradient descent on $n=128$ samples, where $\hat{\eta}$ is the learning rate used in experiments. Figure 1 shows the dynamics of the outputs and loss for NN and SVM. Since the regularization terms in the loss of NN and SVM are different, we just plot the hinge loss. It can be seen that the dynamics of NN and SVM agree very well. We also do a stochastic subgradient descent case for binary classification on full MNIST $0$ and $1$ data ( $12665$ training and $2115$ test) with learning rate $\hat{\eta}=1$ and batch size $64$ , shown in Figure A.1. For more details, please see Appendix A.

(II) Robustness of over-parameterized neural network. Table 2 shows the robustness certificates of two-layer overparameterized NNs with increasing width and SVM (which is equivalent to infinite-width two-layer ReLU NN) on binary classification of MNIST (0 and 1). We use the NN robustness verification algorithm (IBP) [22] to compute the robustness certificates for two-layer overparameterized NNs. The robustness certificate for SVM is computed using our method in Section 4.2. As demonstrated in Table 2, the certificate of NN almost decrease at a rate of $O(1/\sqrt{m})$ and will decrease to 0 as $m\rightarrow\infty$ , where $m$ is the width of the hidden layer. We show that this is due to the bound propagation in Appendix H.2. Unfortunately, the decrease rate will be faster if the NN is deeper. The same problem will happen for LeCun initialization as well, which is used in PyTorch for fully connected layers by default. Notably, however, thanks to our theory, we could compute nontrivial robustness certificate for an infinite-width NN through the equivalent SVM as demonstrated.

Table 2: Robustness certified lower bounds of two-layer ReLU NN and SVM (infinite-width two-layer ReLU NN) tested on binary classification of MNIST (0 and 1). 100 test: randomly selected 100 test samples. Full test: full test data. Test only on data that classified correctly. std is computed over data samples. All models have test accuracy 99.95%. All values are mean of 5 experiments.

		Robustness certificate $\delta$ (mean $\pm$ std) $\times 10^{-3}$
Model	Width	100 test	Full test
NN	$10^{3}$	7.4485 $\pm$ 2.5667	7.2708 $\pm$ 2.1427
NN	$10^{4}$	2.9861 $\pm$ 1.0730	2.9367 $\pm$ 0.89807
NN	$10^{5}$	0.99098 $\pm$ 0.35775	0.97410 $\pm$ 0.29997
NN	$10^{6}$	0.31539 $\pm$ 0.11380	0.30997 $\pm$ 0.095467
SVM	$\infty$	8.0541 $\pm$ 2.5827	7.9733 $\pm$ 2.1396

(III) Comparison with kernel regression. Table 3 compares our $\ell_{2}$ regularized models (KRR and SVM with NTK) with the previous kernel regression model ( $\lambda=0$ for KRR). All the robustness certified lower bounds are computed using our method in Section 4.2. While the accuracies of different models are similar, as the regularization increases, the robustness of KRR increases. The robustness of SVM outperforms the KRR with same regularization magnitude a lot. Our theory enables us to train an equivalent infinite-width NN through SVM and KRR, which is intrinsically more robust than the previous kernel regression model.

Table 3: Robustness of equivalent infinite-width NN models with different loss functions (see Table 1) on binary classification of MNIST (0 and 1).

\lambda

is the parameter in Eq. (7).

	Model	$\lambda$	Test accuracy	Robustness certificate $\delta$	Robustness improvement
$\lambda=0$ ([27, 4])	KRR	0	99.95%	3.30202 $\times 10^{-5}$	-
$\lambda>0$ (ours)	KRR	0.001	99.95%	3.756122 $\times 10^{-5}$	1.14X
	KRR	0.01	99.95%	6.505500 $\times 10^{-5}$	1.97X
	KRR	0.1	99.95%	2.229960 $\times 10^{-4}$	6.75X
	KRR	1	99.95%	0.001005	30.43X
	KRR	10	99.91%	0.005181	156.90X
	KRR	100	99.86%	0.020456	619.50X
	KRR	1000	99.76%	0.026088	790.06X
	SVM	0.064	99.95%	0.008054	243.91X

6 Conclusion and Future Works

In this paper, we establish the equivalence between SVM with NTK and the NN trained by soft margin loss with subgradient descent in the infinite width limit, and we show that they have the same dynamics and solution. We also extend our analysis to general $\ell_{2}$ regularized loss functions and show every neural network trained by such loss functions is approximately a KM. Finally, we demonstrate our theory is useful to compute non-vacuous generalization bound for NN, non-trivial robustness certificate for infinite-width NN while existing neural network robustness verification algorithm cannot handle, and with our theory, the resulting infinite-width NN from our $\ell_{2}$ regularized models is intrinsically more robust than that from the previous NTK kernel regression. For future research, since the equivalence between NN and SVM (and other $\ell_{2}$ regularized KMs) with NTK has been established, it would be very interesting to understand the generalization and robustness of NN from the perspective of these KMs. Our main results are currently still limited in the linear regime. It would be interesting to extend the results to the mean field setting or consider its connection with the implicit bias of NN.

7 Acknowledgement

We thank the anonymous reviewers for useful suggestions to improve the paper. We thank Libin Zhu for helpful discussions. We thank San Diego Supercomputer Center and MIT-IBM Watson AI Lab for computing resources. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) [55], which is supported by National Science Foundation grant number ACI-1548562. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) Expanse at San Diego Supercomputer Center through allocation TG-ASC150024 and ddp390. T.-W. Weng is partially supported by National Science Foundation under Grant No. 2107189.

References

Allen-Zhu et al. [2019] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
Ambrosio et al. [2008] L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
Andras [2002] P. Andras. The equivalence of support vector machine and regularization neural networks. Neural Processing Letters, 15(2):97–104, 2002.
Arora et al. [2019a] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8141–8150, 2019a.
Arora et al. [2019b] S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019b.
Arora et al. [2019c] S. Arora, S. S. Du, Z. Li, R. Salakhutdinov, R. Wang, and D. Yu. Harnessing the power of infinitely wide deep nets on small-data tasks. arXiv preprint arXiv:1910.01663, 2019c.
Bartlett and Mendelson [2002] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
Bartlett et al. [2019] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301, 2019.
Bertsekas and Scientific [2015] D. P. Bertsekas and A. Scientific. Convex optimization algorithms. Athena Scientific Belmont, 2015.
Boser et al. [1992] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, 1992.
Boyd et al. [2003] S. Boyd, L. Xiao, and A. Mutapcic. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2004:2004–2005, 2003.
Cao and Gu [2019] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. arXiv preprint arXiv:1905.13210, 2019.
Chizat and Bach [2020] L. Chizat and F. Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305–1338. PMLR, 2020.
Chizat et al. [2018] L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
Cho [2012] Y. Cho. Kernel methods for deep learning. PhD thesis, UC San Diego, 2012.
Cortes and Vapnik [1995] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
Cristianini et al. [2000] N. Cristianini, J. Shawe-Taylor, et al. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000.
Domingos [2020] P. Domingos. Every model learned by gradient descent is approximately a kernel machine. arXiv preprint arXiv:2012.00152, 2020.
Du et al. [2019a] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685. PMLR, 2019a.
Du et al. [2018] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
Du et al. [2019b] S. S. Du, K. Hou, B. Póczos, R. Salakhutdinov, R. Wang, and K. Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. arXiv preprint arXiv:1905.13192, 2019b.
Gowal et al. [2018] S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. Mann, and P. Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715, 2018.
Hsieh et al. [2008] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th international conference on Machine learning, pages 408–415, 2008.
Hu et al. [2021] T. Hu, W. Wang, C. Lin, and G. Cheng. Regularization matters: A nonparametric perspective on overparametrized neural network. In International Conference on Artificial Intelligence and Statistics, pages 829–837. PMLR, 2021.
Hu et al. [2019] W. Hu, Z. Li, and D. Yu. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. arXiv preprint arXiv:1905.11368, 2019.
Huang et al. [2020] W. Huang, W. Du, and R. Y. Da Xu. On the neural tangent kernel of deep networks with orthogonal initialization. arXiv preprint arXiv:2004.05867, 2020.
Jacot et al. [2018] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
Ji and Telgarsky [2018] Z. Ji and M. Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Lee et al. [2017] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
Lee et al. [2019] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pages 8572–8583, 2019.
Liang et al. [2017] X. Liang, X. Wang, Z. Lei, S. Liao, and S. Z. Li. Soft-margin softmax for deep classification. In International Conference on Neural Information Processing, pages 413–421. Springer, 2017.
Liu et al. [2020a] C. Liu, L. Zhu, and M. Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33, 2020a.
Liu et al. [2020b] C. Liu, L. Zhu, and M. Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. arXiv preprint arXiv:2003.00307, 2020b.
Liu et al. [2016] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax loss for convolutional neural networks. In ICML, volume 2, page 7, 2016.
Long and Sedghi [2019] P. M. Long and H. Sedghi. Generalization bounds for deep convolutional neural networks. arXiv preprint arXiv:1905.12600, 2019.
Lyu and Li [2019] K. Lyu and J. Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890, 2019.
Matthews et al. [2018] A. G. d. G. Matthews, M. Rowland, J. Hron, R. E. Turner, and Z. Ghahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.
Mohri et al. [2018] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.
Nacson et al. [2019] M. S. Nacson, S. Gunasekar, J. Lee, N. Srebro, and D. Soudry. Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In International Conference on Machine Learning, pages 4683–4692. PMLR, 2019.
Novak et al. [2020] R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020. URL https://github.com/google/neural-tangents.
Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
Pellegrini and Biroli [2020] F. Pellegrini and G. Biroli. An analytic theory of shallow networks dynamics for hinge loss classification. Advances in Neural Information Processing Systems, 33, 2020.
Poggio and Girosi [1990] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9):1481–1497, 1990.
Schölkopf et al. [2002] B. Schölkopf, A. J. Smola, F. Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
Shalev-Shwartz and Ben-David [2014] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
Shalev-Shwartz et al. [2011] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
Shawe-Taylor et al. [2004] J. Shawe-Taylor, N. Cristianini, et al. Kernel methods for pattern analysis. Cambridge university press, 2004.
Smola et al. [1998] A. J. Smola, B. Schölkopf, and K.-R. Müller. The connection between regularization operators and support vector kernels. Neural networks, 11(4):637–649, 1998.
Sokolić et al. [2017] J. Sokolić, R. Giryes, G. Sapiro, and M. R. Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265–4280, 2017.
Soudry et al. [2018] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
Steinwart and Christmann [2008] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.
Sun et al. [2016] S. Sun, W. Chen, L. Wang, X. Liu, and T.-Y. Liu. On the depth of deep neural networks: A theoretical view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
Tang [2013] Y. Tang. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239, 2013.
Towns et al. [2014] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr. Xsede: Accelerating scientific discovery. Computing in Science & Engineering, 16(5):62–74, Sept.-Oct. 2014. ISSN 1521-9615. doi: 10.1109/MCSE.2014.80. URL doi.ieeecomputersociety.org/10.1109/MCSE.2014.80.
Wei et al. [2019] C. Wei, J. Lee, Q. Liu, and T. Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. 2019.
Weng et al. [2018] L. Weng, H. Zhang, H. Chen, Z. Song, C.-J. Hsieh, L. Daniel, D. Boning, and I. Dhillon. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning, pages 5276–5285. PMLR, 2018.
Yang and Hu [2020] G. Yang and E. J. Hu. Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522, 2020.
Zhang et al. [2021] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
Zhang et al. [2018] H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, and L. Daniel. Efficient neural network robustness certification with general activation functions. arXiv preprint arXiv:1811.00866, 2018.

\appendixpage

Appendix A Experiment Details

A.1 SVM Training

We use the following loss to train the SVM,

L(\beta)=\frac{\lambda}{2}\left\lVert\beta\right\rVert^{2}+\frac{1}{n}\sum_{i=1}^{n}\max(0,1-y_{i}\left\langle\beta,\Phi(x_{i})\right\rangle).

(11)

Let $\hat{\eta}$ be the learning rate for this loss in experiments. Then the dynamics of subgradient descent is

g_{t+1}(x)=(1-\hat{\eta}\lambda)g_{t}(x)+\frac{\hat{\eta}}{n}\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}).

(12)

Denote $Q(t)=\frac{\hat{\eta}}{n}\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i})$ , which is a linear combination of $K(x,x_{i})$ and changes over time. The model output $g_{t}(x)$ at some time $T$ is

g_{T}(x)=(1-\hat{\eta}\lambda)^{T}\biggl{(}g_{0}(x)+\frac{\hat{\eta}}{n}\sum_{t=0}^{T-1}(1-\hat{\eta}\lambda)^{-t-1}Q(t)\biggr{)}.

(13)

If we set $g_{0}(x)=0$ , we have

\begin{split}g_{T}(x)&=\sum_{t=0}^{T-1}(1-\hat{\eta}\lambda)^{T-1-t}Q(t).\end{split}

(14)

We see that $g_{T}(x)$ is always a linear combination of kernel values $K(x,x_{i})$ for $i=1,\dots,n$ . Since $K(x,x_{i})$ are fixed, we just need to store and update the weights of the kernel values. Let $\alpha_{t}\in\mathbb{R}^{n}$ be the weights at time $t$ , that is

g_{t}(x)=\sum_{i=1}^{n}\alpha_{t,i}K(x,x_{i}).

(15)

Then according to Eq. (12), we update $\alpha$ at each subgradient descent step as follows.

\alpha_{t+1,i}=(1-\hat{\eta}\lambda)\alpha_{t,i}+\frac{\hat{\eta}}{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i},\quad\forall i\in\{1,\dots,n\}.

(16)

For the SGD case, we sample $S_{t}\subseteq\{1,\dots,n\}$ at step $t$ and update the weights of this subset while keeping the other weights unchanged.

	$\displaystyle\alpha_{t+1,i}=(1-\hat{\eta}\lambda)\alpha_{t,i}+\frac{\hat{\eta}}{\left\lvert S_{t}\right\rvert}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i},\quad\forall i\in S_{t},$
	$\displaystyle\alpha_{t+1,i}=\alpha_{t,i},\quad\forall i\notin S_{t}.$

The kernelized implementation of Pegasos [47] set $\hat{\eta}_{t}=\frac{1}{\lambda t}$ for proving the convergence of the algorithm. In our experiments, we use constant $\hat{\eta}$ .

A.2 More Details

(I) Verification of the equivalence. The first experiment illustrates the equivalence between soft margin SVM with NTK trained by subgradient descent and NN trained by soft margin loss. We initialize 3-layer fully connected ReLU neural networks of width $10000$ and $900$ , with NTK parameterization and make sure $f_{0}(x)=0$ by subtracting the initial values from NN’s outputs. We initialize the parameter of SVM with $\beta_{0}=0$ , and this automatically makes sure $g_{0}(x)=0$ . SVM is trained by directly update the weights of kernel values [47] and more details can be found in Appendix A. We set the regularization parameter as $\lambda=0.001$ and take the average of the hinge loss instead of sum.⁵⁵5This is equivalent to use $\lambda=0.001\times n$ in Eq. (7). We train the NN and SVM for a binary MNIST [29] classification task ( $0$ and $1$ ) with learning rate $\hat{\eta}=0.1$ and $\hat{\eta}=0.01$ with full batch subgradient descent on $n=128$ samples, where $\hat{\eta}$ is the learning rate used in experiments (see Appendix A). Figure 1 shows the dynamics of the outputs and loss for NN and SVM. Since the regularization term in the loss of NN and SVM are different, we just plot the hinge loss. We see the dynamics of NN and SVM agree well. We also do a stochastic subgradient descent case for binary MNIST classification task on full $0$ and $1$ data ( $12665$ train data and $2115$ test data) with learning rate $\hat{\eta}=1$ and batch size $64$ , shown in Figure A.1.

Experiments are implemented with PyTorch [42] and the NTK of infinite-width NN is computed using Neural Tangents [41]. We do our experiments on 16G V100 GPU.

Appendix B Computing Non-vacuous Generalization Bounds via Corresponding Kernel Machines

Using Theorem 4.1, we can numerically compute the kernel machine that the NN is equivalent to, i.e. we can compute the kernel matrix and the weights at any time during the training. Then one can apply a generalization bound of kernel machines to give a generalization bound for this kernel machine (equivalently for this NN). Let $\mathcal{H}$ be the reproducing kernel Hilbert space (RKHS) corresponding to the kernel $K(\cdot,\cdot)$ . The RKHS norm of a function $f(x)=\sum_{i=1}^{n}a_{i}K(x,x_{i})$ is ⁶⁶6Assume $f_{0}(x)=0.$

\left\lVert f\right\rVert_{\mathcal{H}}=\left\lVert\sum_{i=1}^{n}a_{i}\Phi(x_{i})\right\rVert=\sqrt{\sum_{i=1}^{n}\sum_{j=1}^{n}a_{i}a_{j}K(x_{i},x_{j})}

Lemma B.1 (Lemma 22 in [7]).

For a function class $\mathcal{F}_{B}=\{f(x)=\sum_{i=1}^{n}a_{i}K(x,x_{i}):\left\lVert f\right\rVert_{\mathcal{H}}\leq B\}\subseteq\{x\rightarrow\left\langle\beta,\Phi(x)\right\rangle:\left\lVert\beta\right\rVert\leq B\}$ , its empirical Rademacher complexity can be bounded as

\hat{\mathcal{R}}_{S}(\mathcal{F}_{B})=\frac{1}{n}\mathop{\mathbb{E}}_{\sigma_{i}\sim\{\pm 1\}^{n}}\left[\sup_{f\in\mathcal{F}_{B}}\sum_{i=1}^{n}\sigma_{i}f(x_{i})\right]\leq\frac{B}{n}\sqrt{\sum_{i=1}^{n}K(x_{i},x_{i})}

Assume the data is sampled i.i.d. from some distribution $D$ and the population loss is $L_{D}(f)=\mathop{\mathbb{E}}_{(x,y)\sim D}{\left[l(f(x),y)\right]}$ . The empirical loss is $L_{S}(f)=\frac{1}{n}\sum_{i=1}^{n}l(f(x_{i}),y_{i})$ . Combing with a standard generalization bound using Rademacher complexity blow [39], we can get a bound of the population loss $L_{D}(f)$ for the kernel machine (equivalently for this NN).

Lemma B.2.

Suppose the loss $\ell(\cdot,\cdot)$ is bounded in $\left[0,c\right]$ , and is $\rho$ -Lipschitz in the first argument. Then with probability at least $1-\delta$ over the sample $S$ of size $n$ ,

\sup_{f\in\mathcal{F}}\{L_{D}(f)-L_{S}(f)\}\leq 2\rho\hat{\mathcal{R}}_{S}(\mathcal{F})+3c\sqrt{\frac{\log(2/\delta)}{2n}}

Most of the existing generalization bounds of NN [8, 36] are vacacous since they have a dependence on the number of parameters. Compared to those, the bound for kernel machines does not have a dependence on the number of NN’s parameters, making it non-vacuous and promising.

Appendix C Dynamics of Support Vector Machine

In this section, we derive the continuous and discrete dynamics of soft margin SVM trained by subgradient with the following loss function

L(\beta)=\frac{1}{2}\left\lVert\beta\right\rVert^{2}+C\sum_{i=1}^{n}\max(0,1-y_{i}\left\langle\beta,\Phi(x_{i})\right\rangle),

(17)

and the subgradient

\nabla_{\beta}L(\beta_{t})=\beta_{t}-C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}\Phi(x_{i}).

(18)

Lemma C.1.

$L(\beta)$ satisfies the Polyak- Lojasiewicz (PL) inequality,

L(\beta_{t})-L(\beta^{*})\leq\frac{1}{2}\left\lVert\nabla_{\beta}L(\beta_{t})\right\rVert^{2}\quad\forall\ \beta_{t}.

(19)

where $\beta^{*}=\operatorname*{arg\,min}_{\beta}L(\beta)$ .

Proof.

Since $L(\beta)$ is 1-strongly convex, by the definition of strong convexity and subgradient

L(\beta)\geq L(\beta_{t})+\left\langle\nabla_{\beta}L(\beta_{t}),\beta-\beta_{t}\right\rangle+\frac{1}{2}\left\lVert\beta-\beta_{t}\right\rVert^{2}

(20)

The right hand side is a convex quadratic function of $\beta$ (for fixed $\beta_{t}$ ). Setting the gradient with respect to $\beta$ equal to 0, we find that $\tilde{\beta}=\beta_{t}-\nabla_{\beta}L(\beta_{t})$ minimize right hand side. Therefore we have

\begin{split}L(\beta)&\geq L(\beta_{t})+\left\langle\nabla_{\beta}L(\beta_{t}),\beta-\beta_{t}\right\rangle+\frac{1}{2}\left\lVert\beta-\beta_{t}\right\rVert^{2}\\ &\geq L(\beta_{t})+\left\langle\nabla_{\beta}L(\beta_{t}),\tilde{\beta}-\beta_{t}\right\rangle+\frac{1}{2}\left\lVert\tilde{\beta}-\beta_{t}\right\rVert^{2}\\ &=L(\beta_{t})-\frac{1}{2}\left\lVert\nabla_{\beta}L(\beta_{t})\right\rVert^{2}.\end{split}

(21)

Since this holds for any $\beta$ , we have

L(\beta^{*})\geq L(\beta_{t})-\frac{1}{2}\left\lVert\nabla_{\beta}L(\beta_{t})\right\rVert^{2}.

(22)

∎

C.1 Continuous Dynamics of SVM

Here we give the detailed derivation of the dynamics of soft margin SVM trained by subgradient. In the learning rate $\eta\rightarrow 0$ limit, the subgradient descent equation, which can also be written as

\frac{\beta_{t+1}-\beta_{t}}{\eta}=-\nabla_{\beta}L(\beta_{t}),

(23)

becomes a differential equation

\frac{d\beta_{t}}{dt}=-\nabla_{\beta}L(\beta_{t}).

(24)

This is known as gradient flow [2]. And we have defined the subgradient as

\nabla_{\beta}L(\beta_{t})=\beta_{t}-C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}\Phi(x_{i}).

(25)

Applying the chain rule, the dynamics of $g_{t}(x)=\left\langle\beta_{t},\Phi(x)\right\rangle$ is

\begin{split}\frac{dg_{t}(x)}{dt}&=\frac{\partial g_{t}(x)}{\partial\beta_{t}}\frac{d\beta_{t}}{dt}\\ &=\left\langle\Phi(x),-\beta_{t}+C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}\Phi(x_{i})\right\rangle\\ &=-g_{t}(x)+C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}).\end{split}

(26)

Denoting $Q(t)=C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i})$ , the equation becomes

\frac{dg_{t}(x)}{dt}+g_{t}(x)=Q(t).

(27)

Note this is a first-order inhomogeneous differential equation. The general solution at some time $T$ is given by

g_{T}(x)=e^{-T}\biggl{(}g_{0}(x)+\int_{0}^{T}Q(t)e^{t}\,dt\biggr{)}.

(28)

As we already know that the loss function is strongly convex, $\beta$ will converge to the global optimizer in this infinite small learning rate setting. This can be seen by

\frac{d\left(L(\beta_{t})-L(\beta^{*})\right)}{dt}=\frac{dL(\beta_{t})}{dt}=\frac{\partial L(\beta_{t})}{\partial\beta_{t}}\frac{d\beta_{t}}{dt}=\left\langle\nabla_{\beta}L(\beta_{t}),-\nabla_{\beta}L(\beta_{t})\right\rangle=-\left\lVert\nabla_{\beta}L(\beta_{t})\right\rVert^{2}.

(29)

We see that $L(\beta_{t})$ is always decreasing. Since $L(\beta)$ is strongly convex and thus bounded from below, by monotone convergence theorem, $L(\beta_{t})$ will always converge. By Lemma C.1, we have the Polyak- Lojasiewicz (PL) inequality,

L(\beta_{t})-L(\beta^{*})\leq\frac{1}{2}\left\lVert\nabla_{\beta}L(\beta_{t})\right\rVert^{2}

(30)

Combining with above, we have

\frac{d\left(L(\beta_{t})-L(\beta^{*})\right)}{dt}\leq-2\left(L(\beta_{t})-L(\beta^{*})\right).

(31)

Solving the equation, we get

L(\beta_{t})-L(\beta^{*})\leq e^{-2t}\left(L(\beta_{0})-L(\beta^{*})\right).

(32)

Thus we have a linear convergence rate.

Now, let us assume $g_{T}(x)$ will converge and see what is $g_{T}(x)$ as $T\rightarrow\infty$ . As time increases $T\rightarrow\infty$ , $e^{-T}g_{0}(x)\rightarrow 0$ .

g_{T}(x)\rightarrow e^{-T}\int_{0}^{T}Q(t)e^{t}\,dt

(33)

$Q(t)$ is changing over time due to $g_{t}(x)$ is changing. Suppose $Q(t)$ keeps changing until some time $T_{1}$ and keeps constant, $Q(t)=Q$ , after $T_{1}$ ,

\begin{split}\lim_{T\rightarrow\infty}g_{T}(x)=e^{-T}\int_{0}^{T_{1}}Q(t)e^{t}\,dt+e^{-T}\int_{T_{1}}^{T}Qe^{t}\,dt.\end{split}

(34)

As $T\rightarrow\infty$ , the first part of right hand side converges to $0$ .

\begin{split}\lim_{T\rightarrow\infty}g_{T}(x)&\rightarrow e^{-T}\int_{T_{1}}^{T}Qe^{t}\,dt\\ &=e^{-T}\int_{T_{1}}^{T}e^{t}\,dt\cdot Q\\ &=e^{-T}(e^{T}-e^{T_{1}})\cdot Q\\ &\rightarrow Q\\ &=C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{T}(x_{i})<1)\cdot y_{i}K(x,x_{i}).\end{split}

(35)

C.2 Discrete Dynamics of SVM

Let $\eta\in(0,1)$ be the learning rate. The equation of subgradient descent update at some time $t$ is

\beta_{t+1}-\beta_{t}=-\eta\nabla_{\beta}L(\beta_{t}).\\

(36)

The dynamics of $g_{t}(x)$ is

\begin{split}g_{t+1}(x)-g_{t}(x)&=\left\langle\beta_{t+1}-\beta_{t},\Phi(x)\right\rangle\\ &=\left\langle-\eta\beta_{t}+\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}\Phi(x_{i}),\Phi(x)\right\rangle\\ &=-\eta g_{t}(x)+\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}).\end{split}

(37)

Denote second part as $Q(t)=\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i})$ , which changes over time. The model $g_{T}(x)$ at some time $T$ is

\begin{split}g_{T}(x)&=(1-\eta)g_{T-1}(x)+Q(T-1)\\ &=(1-\eta)\biggl{(}(1-\eta)g_{T-2}(x)+Q(T-2)\biggr{)}+Q(T-1)\\ &=(1-\eta)^{T}g_{0}(x)+\sum_{t=0}^{T-1}(1-\eta)^{T-1-t}Q(t)\\ &=(1-\eta)^{T}\biggl{(}g_{0}(x)+\sum_{t=0}^{T-1}(1-\eta)^{-t-1}Q(t)\biggr{)}.\end{split}

(38)

The convergence of subgradient descent usually requires additional assumption that the norm of the subgradient is bounded. We refer readers to [47, 11, 9] for some proofs. Here let us assume the subgradient descent converges to the global optimizer and $Q(t)$ keeps changing until some time $T_{1}$ and keeps constant, $Q(t)=Q$ , after $T_{1}$ . As $T\rightarrow\infty$ ,

\begin{split}g_{T}(x)&\rightarrow\sum_{t=0}^{T-1}(1-\eta)^{T-1-t}Q(t)\\ &=\sum_{t=0}^{T_{1}-1}(1-\eta)^{T-1-t}Q(t)+\sum_{t=T_{1}}^{T-1}(1-\eta)^{T-1-t}Q\\ &\rightarrow\sum_{t=T_{1}}^{T-1}(1-\eta)^{T-1-t}Q\\ &=\sum_{t=T_{1}}^{T-1}(1-\eta)^{T-1-t}Q\\ &=\frac{-(1-\eta)^{T-T_{1}}+1}{\eta}Q.\end{split}

(39)

As $\eta\in(0,1)$ , $-(1-\eta)^{T-T_{1}}\rightarrow 0$ .

\begin{split}g_{T}(x)&\rightarrow\frac{1}{\eta}Q\\ &=C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{T}(x_{i})<1)y_{i}K(x,x_{i}).\\ \end{split}

(40)

Appendix D Dynamics and Convergence Rate of Neural Network Trained by Soft Margin Loss

D.1 Continuous Dynamics of NN

In the learning rate $\eta\rightarrow 0$ limit, the subgradient descent equation, which can also be written as

\frac{w_{t+1}-w_{t}}{\eta}=-\nabla_{w}L(w_{t}),

(41)

becomes a differential equation

\frac{dw_{t}}{dt}=-\nabla_{w}L(w_{t}).

(42)

This is known as gradient flow [2]. Then for any differentiable function $f_{t}(x)$ ,

\frac{df_{t}(x)}{dt}=\sum_{j=1}^{p}\frac{\partial f_{t}(x)}{\partial w_{j}}\frac{dw_{j}}{dt},

(43)

where $p$ is the number of parameters. Replacing $\frac{dw_{j}}{dt}$ by its subgradient descent expression:

\frac{df_{t}(x)}{dt}=\sum_{j=1}^{p}\frac{\partial f_{t}(x)}{\partial w_{j}}(-\frac{\partial L(w_{t})}{\partial w_{j}}).

(44)

And we know

\frac{\partial L(w_{t})}{\partial w_{j}}=w_{j}\mathbbm{1}(w_{j}\in W^{(L+1)})+\sum_{i=1}^{n}\frac{\partial L_{h}}{\partial f_{t}(x_{i})}\frac{\partial f_{t}(x_{i})}{\partial w_{j}}.

(45)

where $\mathbbm{1}(w_{j}\in W^{(L+1)})$ equals to 1 if the parameter $w_{j}$ is in the last layer $W^{(L+1)}$ else 0. Combining above together,

\frac{df_{t}(x)}{dt}=\sum_{j=1}^{p}\frac{\partial f_{t}(x)}{\partial w_{j}}\biggl{(}-w_{j}\mathbbm{1}(w_{j}\in W^{(L+1)})-\sum_{i=1}^{n}\frac{\partial L_{h}}{\partial f_{t}(x_{i})}\frac{\partial f_{t}(x_{i})}{\partial w_{j}}\biggr{)}.

(46)

Rearranging terms:

\frac{df_{t}(x)}{dt}=-\sum_{k=1}^{p_{L+1}}\frac{\partial f_{t}(x)}{\partial W^{(L+1)}_{k}}W^{(L+1)}_{k}-\sum_{i=1}^{n}\frac{\partial L_{h}}{\partial f_{t}(x_{i})}\sum_{j=1}^{p}\frac{\partial f_{t}(x)}{\partial w_{j}}\frac{\partial f_{t}(x_{i})}{\partial w_{j}},

(47)

where $p_{L+1}$ is the number of parameters of the last layer ( $L+1$ layer). The first part of the right hand side is

\sum_{k=1}^{p_{L+1}}\frac{\partial f_{t}(x)}{\partial W^{(L+1)}_{k}}W^{(L+1)}_{k}=\left\langle\frac{\partial f_{t}(x)}{\partial W^{(L+1)}},W^{(L+1)}\right\rangle=\left\langle\alpha_{t}^{(L)}(x),W^{(L+1)}\right\rangle=f_{t}(x).

(48)

Applying $L_{h}^{\prime}(y_{i},f_{t}(x_{i}))=\frac{\partial L_{h}}{\partial f_{t}(x_{i})}$ , the subgradient of hinge loss, and the definition of tangent kernel (2.1), the second part is

-\sum_{i=1}^{n}\frac{\partial L_{h}}{\partial f_{t}(x_{i})}\sum_{j=1}^{p}\frac{\partial f_{t}(x)}{\partial w_{j}}\frac{\partial f_{t}(x_{i})}{\partial w_{j}}=-\sum_{i=1}^{n}L_{h}^{\prime}(y_{i},f_{t}(x_{i}))\hat{\Theta}(w_{t};x,x_{i}).

(49)

Thus the equation becomes

\frac{df_{t}(x)}{dt}=-f_{t}(x)-\sum_{i=1}^{n}L_{h}^{\prime}(y_{i},f_{t}(x_{i}))\hat{\Theta}(w_{t};x,x_{i}).

(50)

Take $L_{h}^{\prime}(y_{i},f_{t}(x_{i}))=-Cy_{i}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)$ in

\frac{df_{t}(x)}{dt}=-f_{t}(x)+C\sum_{i=1}^{n}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)y_{i}\hat{\Theta}(w_{t};x,x_{i}).

(51)

D.2 Additional Notations

Denote $X\in\mathbb{R}^{d\times n}$ as the training data. Denote $f_{t}=f_{t}(X)\in\mathbb{R}^{n}$ and $g_{t}=g_{t}(X)\in\mathbb{R}^{n}$ as the outputs of NN and SVM on the training data. Denote $\hat{\Theta}(w_{t})=\hat{\Theta}(w_{t};X,X)\in\mathbb{R}^{n\times n}$ as the tangent kernel evaluated on the training data at time $t$ , and $l^{\prime}(f_{t})\in\mathbb{R}^{n}$ as the derivative of the loss function w.r.t. $f_{t}$ . Denote $\nabla_{w}f_{t}\in\mathbb{R}^{n\times p}$ as the Jacobian and we have $\hat{\Theta}(w_{t})=\nabla_{w}f_{t}\nabla_{w}f_{t}^{T}$ . Denote $\lambda_{0}=\lambda_{min}\left(\hat{\Theta}(w_{t})\right)$ as the smallest eigenvalue of $\hat{\Theta}(w_{t})$ . Then we can write the dynamics of NN as

\displaystyle\frac{d}{dt}f_{t}=-f_{t}-\hat{\Theta}(w_{t})l^{\prime}(f_{t}).

Let $v\in\mathbb{R}^{p}$ with $v_{j}=\mathbbm{1}(w_{j}\in W^{(L+1)})$ . We can write the gradient as

\displaystyle\nabla_{w}L(w_{t})=w_{t}\odot v+\nabla_{w}f_{t}^{T}l^{\prime}(f_{t}).

D.3 Convergence of NN

The loss of NN is

L(w_{t})=\frac{1}{2}\left\lVert W_{t}^{(L+1)}\right\rVert^{2}+\sum_{i}^{n}l(f_{t}(x_{i}),y_{i}),

where $l(f,y)=C\max(0,1-yf)$ . The dynamic of the loss is

\frac{dL(w_{t})}{dt}=\frac{\partial L(w_{t})}{\partial w_{t}}\frac{dw_{t}}{dt}=\left\langle\nabla_{w}L(w_{t}),-\nabla_{w}L(w_{t})\right\rangle=-\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}.

Since $L(w_{t})\geq 0$ is bounded from below, by monotone convergence theorem, $L(w_{t})$ will always converge to a stationary point. Applying Lemma D.1, we have

\displaystyle\frac{d\left(L(w_{t})-L(w^{*})\right)}{dt}=-\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}\leq-2\left(L(w_{t})-L(w^{*})\right).

Thus we have a linear convergence, same as SVM.

L(w_{t})-L(w^{*})\leq e^{-2t}\left(L(w_{0})-L(w^{*})\right).

Lemma D.1 (PL inequality of NN for soft margin loss).

Assume $\lambda_{0}\geq\frac{2}{C}$ , then $L(w_{t})$ satisfies the PL condition

\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}\geq 2\left(L(w_{t})-L(w^{*})\right).

Proof.

	$\displaystyle\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}$	$\displaystyle=\left\langle w_{t}\odot v+\nabla_{w}f_{t}^{T}l^{\prime}(f_{t}),w_{t}\odot v+\nabla_{w}f_{t}^{T}l^{\prime}(f_{t})\right\rangle$
		$\displaystyle=\left\langle w_{t}\odot v,w_{t}\odot v\right\rangle+\left\langle\nabla_{w}f_{t}^{T}l^{\prime}(f_{t}),\nabla_{w}f_{t}^{T}l^{\prime}(f_{t})\right\rangle+2\left\langle w_{t}\odot v,\nabla_{w}f_{t}^{T}l^{\prime}(f_{t})\right\rangle$
		$\displaystyle=\left\lVert W_{t}^{(L+1)}\right\rVert^{2}+l^{\prime}(f_{t})^{T}\hat{\Theta}(w_{t})l^{\prime}(f_{t})+2\left\langle W_{t}^{(L+1)},\nabla_{W^{(L+1)}}f_{t}^{T}l^{\prime}(f_{t})\right\rangle$
		$\displaystyle=\left\lVert W_{t}^{(L+1)}\right\rVert^{2}+l^{\prime}(f_{t})^{T}\hat{\Theta}(w_{t})l^{\prime}(f_{t})+2f_{t}^{T}l^{\prime}(f_{t}).$

We want the loss satisfies the PL condition $\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}\geq 2\left(L(w_{t})-L(w^{*})\right)$ .

	$\displaystyle\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}-2\left(L(w_{t})-L(w^{*})\right)$
	$\displaystyle=\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}-2L(w_{t})+2L(w^{*})$
	$\displaystyle=l^{\prime}(f_{t})^{T}\hat{\Theta}(w_{t})l^{\prime}(f_{t})+2f_{t}^{T}l^{\prime}(f_{t})-2\sum_{i}^{n}l(f_{t}(x_{i}),y_{i})+2L(w^{*})$
	$\displaystyle\geq\lambda_{0}\left\lVert l^{\prime}(f_{t})\right\rVert^{2}+2f_{t}^{T}l^{\prime}(f_{t})-2\sum_{i}^{n}l(f_{t}(x_{i}),y_{i})+2L(w^{*}),$

where the last inequality is the inequality of quadratic form. For hinge loss $l(f,y)=C\max(0,1-yf)=C(1-yf)\mathbbm{1}(1-yf>0)$ and $l^{\prime}(f,y)=-Cy\mathbbm{1}(1-yf>0)$ ,

	$\displaystyle\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}-2\left(L(w_{t})-L(w^{*})\right)$
	$\displaystyle\geq\lambda_{0}\left\lVert l^{\prime}(f_{t})\right\rVert^{2}+2f_{t}^{T}l^{\prime}(f_{t})-2\sum_{i}^{n}l(f_{t}(x_{i}),y_{i})+2L(w^{*})$
	$\displaystyle=\lambda_{0}\sum_{i}^{n}{l^{\prime}(f_{t}(x_{i}),y_{i})}^{2}+2\sum_{i}^{n}f_{t}(x_{i})l^{\prime}(f_{t}(x_{i}),y_{i})-2\sum_{i}^{n}l(f_{t}(x_{i}),y_{i})+2L(w^{*})$
	$\displaystyle=\lambda_{0}\sum_{i}^{n}C^{2}\mathbbm{1}(1-y_{i}f_{t}(x_{i})>0)-2\sum_{i}^{n}Cy_{i}f_{t}(x_{i})\mathbbm{1}(1-y_{i}f_{t}(x_{i})>0)$
	$\displaystyle\qquad-2\sum_{i}^{n}C(1-y_{i}f_{t}(x_{i}))\mathbbm{1}(1-y_{i}f_{t}(x_{i})>0)+2L(w^{*})$
	$\displaystyle=C\sum_{i}^{n}\mathbbm{1}(1-y_{i}f_{t}(x_{i})>0)\left(C\lambda_{0}-2\right)+2L(w^{*}).$

Since $L(w^{*})>0$ , as long as $\lambda_{0}\geq 2/C$ , the loss $L(w_{t})$ satisfies the PL condition $\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}\geq 2\left(L(w_{t})-L(w^{*})\right)$ . $\lambda_{0}\geq 2/C$ can be guaranteed in a parameter ball when $\frac{2}{C}<\lambda_{min}\left(\hat{\Theta}(w_{0})\right)$ by using a sufficiently wide NN [34]. ∎

D.4 Discrete Dynamics of NN

The subgradient descent update is

w_{t+1}-w_{t}=-\eta\nabla_{w}L(w_{t}).

(52)

We consider the situation of constant NTK, $\hat{\Theta}(w_{t};x,x_{i})\rightarrow\hat{\Theta}(w_{0};x,x_{i})$ , or equivalently linear model. As proved by Proposition 2.2 in [33], the tangent kernel of a differentiable function $f(w,x)$ is constant if and only if $f(w,x)$ is linear in $w$ . Take the Taylor expansion of $f(w_{t+1},x)$ at $w_{t}$ ,

\begin{split}&\quad f(w_{t+1},x)-f(w_{t},x)\\ &=f(w_{t},x)+\left\langle\nabla_{w}f(w_{t},x),w_{t+1}-w_{t}\right\rangle-f(w_{t},x)\\ &=\left\langle\nabla_{w}f(w_{t},x),-\eta\nabla_{w}L(w_{t})\right\rangle\\ &=\left\langle\nabla_{w}f(w_{t},x),-\eta\Big{(}wv+\sum_{i=1}^{n}L_{h}^{\prime}(y_{i},f_{t}(x_{i}))\nabla_{w}f_{t}(x_{i})\Big{)}\right\rangle\\ &=-\eta f_{t}(x)+\eta\sum_{i=1}^{n}L_{h}^{\prime}(y_{i},f_{t}(x_{i}))\hat{\Theta}(w_{t};x,x_{i})\\ &=-\eta f_{t}(x)+\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)y_{i}\hat{\Theta}(w_{t};x,x_{i})\\ &\rightarrow-\eta f_{t}(x)+\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)y_{i}\hat{\Theta}(w_{0};x,x_{i}).\end{split}

(53)

Appendix E Proof of Theorem 3.4

Proof.

We prove the constancy of tangent kernel by adopting the results of [35].

Lemma E.1 (Theorem 3.3 in [33]; Hessian norm is controlled by the minimum hidden layer width).

Consider a general neural network $f(w,x)$ of the form Eq. (1), which can be a fully connected network, CNN, ResNet or a mixture of these types. Let m be the minimum of the hidden layer widths, i.e., $m=\min_{l\in[L]}m_{l}$ . Given any fixed $R>0$ , and any $w\in B(w_{0};R):=\{w:\left\lVert w-w_{0}\right\rVert\leq R\}$ , with high probability over the initialization, the Hessian spectral norm satisfies the following:

\left\lVert H(w)\right\rVert=O(\frac{R^{3L}\ln{m}}{\sqrt{m}}).

(54)

Lemma E.2 (Proposition 2.3 in [33]; Small Hessian norm $\Rightarrow$ Small change of tangent kernel).

Given a point $w_{0}\in\mathbb{R}^{p}$ and a ball $B(w_{0};R):=\{w:\left\lVert w-w_{0}\right\rVert\leq R\}$ with fixed radius $R>0$ , if the Hessian matrix satisfies $\left\lVert H(w)\right\rVert<\epsilon$ , where $\epsilon>0$ , for all $w\in B(w_{0},R)$ , then the tangent kernel $\hat{\Theta}(w;x,x^{\prime})$ of the model, as a function of $w$ , satisfies

\left\lvert\hat{\Theta}(w;x,x^{\prime})-\hat{\Theta}(w_{0};x,x^{\prime})\right\rvert=O(\epsilon R),\quad\forall w\in B(w_{0};R),\ \forall x,x^{\prime}\in\mathbb{R}^{d}.

(55)

Applying above two lemmas, we can see that in the limit of $m\rightarrow\infty$ , the spectral norm of Hessian converge to $0$ and the tangent kernel keeps constant in the ball $B(w_{0};R)$ .

Corollary E.2.1 (Consistancy of tangent kernel).

Consider a general neural network $f(w,x)$ of the form Eq. (1). Given a point $w_{0}\in\mathbb{R}^{p}$ and a ball $B(w_{0};R):=\{w:\left\lVert w-w_{0}\right\rVert\leq R\}$ with fixed radius $R>0$ , in the infinite width limit, $m\rightarrow\infty$ ,

\lim_{m\rightarrow\infty}\hat{\Theta}(w;x,x^{\prime})\rightarrow\hat{\Theta}(w_{0};x,x_{i}),\quad\forall w\in B(w_{0};R),\ \forall x,x^{\prime}\in\mathbb{R}^{d}.

(56)

Thus we prove the constancy of tangent kernel in infinite width limit. Then it is easy to check the dynamics of infinitely wide NN is the same with the dynamics of SVM with constant NTK.

∎

Appendix F Bound the difference between SVM and NN

Assume the loss $l$ is $\rho$ -lipschitz and $\beta_{l}$ -smooth for the first argument (i.e. the model output). Assume $f_{0}(x)=g_{0}(x)$ for any $x$ .

F.1 Bound the difference on the Training Data

The dynamics of the NN and SVM are

	$\displaystyle\frac{d}{dt}f_{t}=-\lambda f_{t}-\hat{\Theta}(w_{t})l^{\prime}(f_{t})$
	$\displaystyle\frac{d}{dt}g_{t}=-\lambda g_{t}-\hat{\Theta}(w_{0})l^{\prime}(g_{t})$

The dynamics of the difference between them is

\frac{d}{dt}\left(f_{t}-g_{t}\right)=-\lambda\left(f_{t}-g_{t}\right)-\left(\hat{\Theta}(w_{t})l^{\prime}(f_{t})-\hat{\Theta}(w_{0})l^{\prime}(g_{t})\right)

The solution of the above differential equation at time $T$ is

	$\displaystyle f_{T}-g_{T}$	$\displaystyle=e^{-\lambda T}\left(f_{0}-g_{0}\right)-e^{-\lambda T}\int_{0}^{T}\left(\hat{\Theta}(w_{t})l^{\prime}(f_{t})-\hat{\Theta}(w_{0})l^{\prime}(g_{t})\right)e^{\lambda t}dt$
		$\displaystyle=e^{-\lambda T}\int_{0}^{T}\left(\hat{\Theta}(w_{0})l^{\prime}(g_{t})-\hat{\Theta}(w_{t})l^{\prime}(f_{t})\right)e^{\lambda t}dt$

using $f_{0}=g_{0}$ . Thus

\displaystyle\left\lVert f_{T}-g_{T}\right\rVert

\displaystyle\leq e^{-\lambda T}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0})l^{\prime}(g_{t})-\hat{\Theta}(w_{t})l^{\prime}(f_{t})\right\rVert e^{\lambda t}dt

Since $l$ is $\beta_{l}$ smooth,

	$\displaystyle\left\lVert\hat{\Theta}(w_{0})l^{\prime}(g_{t})-\hat{\Theta}(w_{t})l^{\prime}(f_{t})\right\rVert$	$\displaystyle=\left\lVert\hat{\Theta}(w_{0})l^{\prime}(g_{t})-\hat{\Theta}(w_{0})l^{\prime}(f_{t})+\hat{\Theta}(w_{0})l^{\prime}(f_{t})-\hat{\Theta}(w_{t})l^{\prime}(f_{t})\right\rVert$
		$\displaystyle=\left\lVert\hat{\Theta}(w_{0})\left(l^{\prime}(g_{t})-l^{\prime}(f_{t})\right)+\left(\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right)l^{\prime}(f_{t})\right\rVert$
		$\displaystyle\leq\left\lVert\hat{\Theta}(w_{0})\left(l^{\prime}(g_{t})-l^{\prime}(f_{t})\right)\right\rVert+\left\lVert\left(\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right)l^{\prime}(f_{t})\right\rVert$
		$\displaystyle\leq\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert\left\lVert g_{t}-f_{t}\right\rVert+\rho\sqrt{n}\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert$

where $\left\lVert l^{\prime}(f_{t})\right\rVert\leq\rho\sqrt{n}$ . Thus we have

\displaystyle\left\lVert f_{T}-g_{T}\right\rVert

\displaystyle\leq e^{-\lambda T}\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert\int_{0}^{T}\left\lVert g_{t}-f_{t}\right\rVert e^{\lambda t}dt+e^{-\lambda T}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert e^{\lambda t}dt

Applying the Grönwall’s inequality,

	$\displaystyle\left\lVert f_{T}-g_{T}\right\rVert$	$\displaystyle\leq e^{-\lambda T}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert e^{\lambda t}dt\cdot e^{e^{-\lambda T}\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert\int_{0}^{T}e^{\lambda t}dt}$
		$\displaystyle=e^{-\lambda T}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert e^{\lambda t}dt\cdot e^{\frac{1}{\lambda}(1-e^{-\lambda T})\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert}$
		$\displaystyle=e^{-\lambda T}e^{\frac{1}{\lambda}(1-e^{-\lambda T})\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert e^{\lambda t}dt$

By Lemma E.1 and Lemma E.2, in a parameter ball $B(w_{0};R)=\{w:\left\lVert w-w_{0}\right\rVert\leq R\}$ , with high probability, $\left\lvert\hat{\Theta}(w;x,x^{\prime})-\hat{\Theta}(w_{0};x,x^{\prime})\right\rvert=O(R^{3L+1}\ln{m}/\sqrt{m})$ w.r.t. $m$ . Then we have

\displaystyle\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert\leq\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert_{F}=O(\frac{R^{3L+1}n\ln{m}}{\sqrt{m}})

Thus we have

	$\displaystyle\left\lVert f_{T}-g_{T}\right\rVert$	$\displaystyle\leq\frac{1}{\lambda}(1-e^{-\lambda T})e^{(1-e^{-T})\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert}\rho\sqrt{n}\cdot O(\frac{R^{3L+1}n\ln{m}}{\sqrt{m}})$
		$\displaystyle=O(\frac{e^{\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert}R^{3L+1}\rho n^{\frac{3}{2}}\ln{m}}{\lambda\sqrt{m}})$

F.2 Bound on the Test Data

For a test data $x$ , the prove is similar to the training case. Denote $\hat{\Theta}(w_{t};X,x)\in\mathbb{R}^{n}$ as the tangent kernel evaluate between the training data and a test data $x$ . Recall

	$\displaystyle\frac{df_{t}(x)}{dt}=-\lambda f_{t}(x)-\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})$
	$\displaystyle\frac{dg_{t}(x)}{dt}=-\lambda g_{t}(x)-\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})$
	$\displaystyle\frac{d}{dt}\left(f_{t}(x)-g_{t}(x)\right)=-\lambda\left(f_{t}(x)-g_{t}(x)\right)-\left(\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})-\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})\right)$

The solution of the above differential equation is

	$\displaystyle f_{T}(x)-g_{T}(x)$	$\displaystyle=e^{-\lambda T}\left(f_{0}-g_{0}\right)-e^{-\lambda T}\int_{0}^{T}\left(\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})-\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})\right)e^{\lambda t}dt$
		$\displaystyle=e^{-\lambda T}\int_{0}^{T}\left(\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})-\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})\right)e^{\lambda t}dt$

using $f_{0}=g_{0}$ . Thus

\displaystyle\left\lVert f_{T}(x)-g_{T}(x)\right\rVert

\displaystyle\leq e^{-\lambda T}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})-\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})\right\rVert e^{\lambda t}dt

Since $l$ is $\beta_{l}$ smooth,

	$\displaystyle\left\lVert\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})-\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})\right\rVert$
	$\displaystyle=\left\lVert\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})-\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(f_{t})+\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(f_{t})-\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})\right\rVert$
	$\displaystyle=\left\lVert\hat{\Theta}(w_{0};X,x)^{T}\left(l^{\prime}(g_{t})-l^{\prime}(f_{t})\right)+\left(\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right)l^{\prime}(f_{t})\right\rVert$
	$\displaystyle\leq\left\lVert\hat{\Theta}(w_{0};X,x)^{T}\left(l^{\prime}(g_{t})-l^{\prime}(f_{t})\right)\right\rVert+\left\lVert\left(\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right)l^{\prime}(f_{t})\right\rVert$
	$\displaystyle\leq\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert\left\lVert g_{t}-f_{t}\right\rVert+\rho\sqrt{n}\left\lVert\left(\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right)\right\rVert$

where $\left\lVert l^{\prime}(f_{t})\right\rVert\leq\rho\sqrt{n}$ . Thus we have

	$\displaystyle\left\lVert f_{T}(x)-g_{T}(x)\right\rVert$
	$\displaystyle\leq e^{-\lambda T}\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert\int_{0}^{T}\left\lVert g_{t}-f_{t}\right\rVert e^{\lambda t}dt+e^{-\lambda T}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right\rVert e^{\lambda t}dt$

Applying the Grönwall’s inequality,

	$\displaystyle\left\lVert f_{T}(x)-g_{T}(x)\right\rVert$
	$\displaystyle\leq e^{-\lambda T}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right\rVert e^{\lambda t}dt\cdot e^{e^{-\lambda T}\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert\int_{0}^{T}e^{\lambda t}dt}$
	$\displaystyle=e^{-\lambda T}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right\rVert e^{\lambda t}dt\cdot e^{\frac{1}{\lambda}(1-e^{-\lambda T})\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert}$
	$\displaystyle=e^{-\lambda T}e^{\frac{1}{\lambda}(1-e^{-\lambda T})\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right\rVert e^{\lambda t}dt$

\displaystyle\left\lVert\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right\rVert=O(\frac{R^{3L+1}\sqrt{n}\ln{m}}{\sqrt{m}})

Thus we have

	$\displaystyle\left\lVert f_{T}(x)-g_{T}(x)\right\rVert$	$\displaystyle\leq\frac{1}{\lambda}(1-e^{-\lambda T})e^{(1-e^{-T})\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert}\rho\sqrt{n}\cdot O(\frac{R^{3L+1}\sqrt{n}\ln{m}}{\sqrt{m}})$
		$\displaystyle=O(\frac{e^{\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert}R^{3L+1}\rho n\ln{m}}{\lambda\sqrt{m}})$

Appendix G Finite-width Neural Networks are Kernel Machines

Inspired by [18], we can also show that every neural network trained by (sub)gradient descent with loss function in the form (7) is approximately a kernel machine without the assumption of infinite width limit.

Theorem G.1.

Suppose a neural network $f(w,x)$ , with $f$ a differentiable function of $w$ , is learned from a training set $\{(x_{i},y_{i})\}_{i=1}^{n}$ by (sub)gradient descent with loss function $L(w)=\frac{\lambda}{2}\left\lVert W^{(L+1)}\right\rVert^{2}+\sum_{i=1}^{n}l(f(w,x_{i}),y_{i})$ and gradient flow. Assume $\text{sign}(l^{\prime}(f_{t}(x_{i}),y_{i}))=\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i})),\forall t\in[0,T]$ , keeps unchanged during training. Then at some time $T$ ,

f_{T}(x)=\sum_{i=1}^{n}a_{i}K(x,x_{i})+b,

(57)

where

	$\displaystyle a_{i}=-\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i})),\qquad b=e^{-\lambda T}f_{0}(x),$
	$\displaystyle K(x,x_{i})=e^{-\lambda T}\int_{0}^{T}\left\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\right\rvert\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt$

Proof.

As we have derived, the neural network follows the dynamics of Eq. (9):

\frac{df_{t}(x)}{dt}=-\lambda f_{t}(x)-\sum_{i=1}^{n}l^{\prime}(f_{t}(x_{i}),y_{i})\hat{\Theta}(w_{t};x,x_{i}).

(58)

Note this is a first-order inhomogeneous linear differential equation with the functions depended on $t$ . Denote $Q(t)=-\sum_{i=1}^{n}l^{\prime}(f_{t}(x_{i}),y_{i})\hat{\Theta}(w_{t};x,x_{i})$ ,

\frac{df_{t}(x)}{dt}+\lambda f_{t}(x)=Q(t).

(59)

Let $f_{0}(x)$ be the initial model, prior to gradient descent. The solution is given by

f_{T}(x)=e^{-\lambda T}\biggl{(}f_{0}(x)+\int_{0}^{T}Q(t)e^{\lambda t}\,dt\biggr{)}.

(60)

Then

\begin{split}f_{T}(x)&=e^{-\lambda T}\biggl{(}f_{0}(x)-\sum_{i=1}^{n}\int_{0}^{T}l^{\prime}(f_{t}(x_{i}),y_{i})\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt\biggr{)}\\ &=e^{-\lambda T}f_{0}(x)-\sum_{i=1}^{n}e^{-\lambda T}\int_{0}^{T}l^{\prime}(f_{t}(x_{i}),y_{i})\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt\\ &=e^{-\lambda T}f_{0}(x)-\sum_{i=1}^{n}e^{-\lambda T}\int_{0}^{T}\text{sign}(l^{\prime}(f_{t}(x_{i}),y_{i}))\cdot\left\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\right\rvert\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt\\ &=e^{-\lambda T}f_{0}(x)-\sum_{i=1}^{n}\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i}))\cdot e^{-\lambda T}\int_{0}^{T}\left\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\right\rvert\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt.\end{split}

(61)

where the last equality uses the assumption $\text{sign}(l^{\prime}(f_{t}(x_{i}),y_{i}))=\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i})),\forall t\in[0,T]$ . Thus

f_{T}(x)=\sum_{i=1}^{n}a_{i}K(x,x_{i})+b,

(62)

with

	$\displaystyle a_{i}=-\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i})),\qquad b=e^{-\lambda T}f_{0}(x),$
	$\displaystyle K(x,x_{i})=e^{-\lambda T}\int_{0}^{T}\left\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\right\rvert\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt$

∎

$K(x,x_{i})=e^{-\lambda T}\int_{0}^{T}\left\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\right\rvert\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt$ is a valid kernel since it is a nonnegative sum of positive definite kernels. Our $a_{i}$ , $b$ and $K(x,x_{i})$ will stay bounded as long as $f_{0}(x)$ , $l^{\prime}(f_{t}(x_{i}),y_{i})$ and $\hat{\Theta}(w_{t};x,x_{i})$ are bounded.

Appendix H Robustness of Over-parameterized Neural Network

H.1 Robustness Verification of NTK

	$\displaystyle\begin{split}\Theta(x,x^{\prime})=\frac{\left\langle x,x^{\prime}\right\rangle}{d}(\frac{\pi-\arccos(u)}{\pi})+\frac{\left\lVert x\right\rVert\left\lVert x^{\prime}\right\rVert}{2\pi d}\sqrt{1-u^{2}}=\frac{\left\lVert x\right\rVert\left\lVert x^{\prime}\right\rVert}{2\pi d}h(u),\end{split}$		(63)
	$\displaystyle h(u)=2u(\pi-\arccos(u))+\sqrt{1-u^{2}}.$		(64)

where $u=\frac{\left\langle x,x^{\prime}\right\rangle}{\left\lVert x\right\rVert\left\lVert x^{\prime}\right\rVert}\in\left[-1,1\right]$ . Consider the $\ell_{\infty}$ perturbation, for $x\in B_{\infty}(x_{0},\delta)=\{x\in\mathbb{R}^{d}:\left\lVert x-x_{0}\right\rVert_{\infty}\leq\delta\}$ , we can bound $\left\lVert x\right\rVert$ in the interval $[\left\lVert x\right\rVert^{L},\left\lVert x\right\rVert^{U}]$ as follows.

	$\displaystyle\left\lVert x\right\rVert=\left\lVert x_{0}+\Delta\right\rVert\leq\left\lVert x_{0}\right\rVert+\left\lVert\Delta\right\rVert\leq\left\lVert x_{0}\right\rVert+\sqrt{d}\delta=\left\lVert x\right\rVert^{U},$
	$\displaystyle\left\lVert x\right\rVert=\left\lVert x_{0}+\Delta\right\rVert\geq\left\lvert\left\lVert x_{0}\right\rVert-\left\lVert\Delta\right\rVert\right\rvert\geq\max(\left\lVert x_{0}\right\rVert-\sqrt{d}\delta,0)=\left\lVert x\right\rVert^{L}.$

Then we can also bound $u$ in $[u^{L},u^{U}]$ .

	$\displaystyle\left\langle x,x^{\prime}\right\rangle=\left\langle x_{0}+\Delta,x^{\prime}\right\rangle\in\left[\left\langle x_{0},x^{\prime}\right\rangle-\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert,\left\langle x_{0},x^{\prime}\right\rangle+\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert\right],$
	$\displaystyle u^{L}=\frac{\left\langle x_{0},x^{\prime}\right\rangle-\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert}{\left\lVert x\right\rVert^{U}\left\lVert x^{\prime}\right\rVert}\quad\text{if}\ \left\langle x_{0},x^{\prime}\right\rangle-\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert\geq 0\quad\text{else}\quad\frac{\left\langle x_{0},x^{\prime}\right\rangle-\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert}{\left\lVert x\right\rVert^{L}\left\lVert x^{\prime}\right\rVert},$
	$\displaystyle u^{U}=\frac{\left\langle x_{0},x^{\prime}\right\rangle+\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert}{\left\lVert x\right\rVert^{L}\left\lVert x^{\prime}\right\rVert}\quad\text{if}\ \left\langle x_{0},x^{\prime}\right\rangle+\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert\geq 0\quad\text{else}\quad\frac{\left\langle x_{0},x^{\prime}\right\rangle+\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert}{\left\lVert x\right\rVert^{U}\left\lVert x^{\prime}\right\rVert},$
	$\displaystyle u^{U}=\min(u^{U},1).$

where $\Delta\in B_{\infty}(0,\delta)$ . $h(u)$ is a bow shaped function so it is easy to get its interval $[h^{L}(u),h^{U}(u)]$ . Then we can get the interval of $\Theta(x,x^{\prime})$ , denote as $[\Theta^{L}(x,x^{\prime}),\Theta^{U}(x,x^{\prime})]$ .

	$\displaystyle\Theta^{L}(x,x^{\prime})=\frac{\left\lVert x\right\rVert^{L}\left\lVert x^{\prime}\right\rVert}{2\pi d}h^{L}(u)\quad\text{if}\ h^{L}(u)\geq 0\quad\text{else}\quad\frac{\left\lVert x\right\rVert^{U}\left\lVert x^{\prime}\right\rVert}{2\pi d}h^{L}(u),$
	$\displaystyle\Theta^{U}(x,x^{\prime})=\frac{\left\lVert x\right\rVert^{U}\left\lVert x^{\prime}\right\rVert}{2\pi d}h^{U}(u)\quad\text{if}\ h^{U}(u)\geq 0\quad\text{else}\quad\frac{\left\lVert x\right\rVert^{L}\left\lVert x^{\prime}\right\rVert}{2\pi d}h^{U}(u).$

Suppose the $g(x)=\sum_{i=1}^{n}\alpha_{i}\Theta(x,x_{i})$ , $\alpha_{i}$ are known after solving the kernel machine problem. Then we can lower bound and upper bound $g(x)$ as follows.

	$\displaystyle g(x)\geq\sum_{i=1,\alpha_{i}>0}^{n}\alpha_{i}\Theta^{L}(x,x_{i})+\sum_{i=1,\alpha_{i}<0}^{n}\alpha_{i}\Theta^{U}(x,x_{i}),$		(65)
	$\displaystyle g(x)\leq\sum_{i=1,\alpha_{i}<0}^{n}\alpha_{i}\Theta^{L}(x,x_{i})+\sum_{i=1,\alpha_{i}>0}^{n}\alpha_{i}\Theta^{U}(x,x_{i}).$		(66)

H.2 IBP for Two-layer Neural Network

See the computation of IBP in [22]. For affine layers of NTK parameterization, the IBP bounds are computed as follows.

\begin{split}\mu_{k-1}&=\frac{\overline{z}_{k-1}+\underline{z}_{k-1}}{2}\\ r_{k-1}&=\frac{\overline{z}_{k-1}-\underline{z}_{k-1}}{2}\\ \mu_{k}&=\frac{1}{\sqrt{m}}W\mu_{k-1}+b\\ r_{k}&=\frac{1}{\sqrt{m}}\left\lvert W\right\rvert r_{k-1}\\ \underline{z}_{k}&=\mu_{k}-r_{k}\\ \overline{z}_{k}&=\mu_{k}+r_{k}\end{split}

(67)

where $m$ is the input dimension of that layer. At initialization, $W$ , $\mu_{k-1}$ and $b$ are independent. Since $\mathop{\mathbb{E}}\left[W\right]=0$ and $\mathop{\mathbb{E}}\left[b\right]=0$ ,

\mathop{\mathbb{E}}\left[\mu_{k}\right]=\frac{1}{\sqrt{m}}\mathop{\mathbb{E}}\left[W\right]\mathop{\mathbb{E}}\left[\mu_{k-1}\right]+\mathop{\mathbb{E}}\left[b\right]=0

(68)

Since $\left\lvert W\right\rvert$ follows a folded normal distribution (absolute value of normal distribution) and $r_{k-1}\geq 0$ , $\left\lvert W\right\rvert\geq 0$ , $\mathop{\mathbb{E}}\left[\left\lvert W\right\rvert\right]\mathop{\mathbb{E}}\left[r_{k-1}\right]=O(m)$ ,

\mathop{\mathbb{E}}\left[r_{k}\right]=\frac{1}{\sqrt{m}}\mathop{\mathbb{E}}\left[\left\lvert W\right\rvert\right]\mathop{\mathbb{E}}\left[r_{k-1}\right]=O(\sqrt{m})

(69)

Thus

	$\displaystyle-\mathop{\mathbb{E}}\left[\underline{z}_{k}\right]=-\mathop{\mathbb{E}}\left[\mu_{k}\right]+\mathop{\mathbb{E}}\left[r_{k}\right]=O(\sqrt{m})$		(70)
	$\displaystyle\mathop{\mathbb{E}}\left[\overline{z}_{k}\right]=\mathop{\mathbb{E}}\left[\mu_{k}\right]+\mathop{\mathbb{E}}\left[r_{k}\right]=O(\sqrt{m})$		(71)

And this will cause the robustness lower bound to decrease at a rate of $O(1/\sqrt{m})$ . The same results hold for LeCun initialization, which is used in PyTorch for fully connected layers by default.

On the Equivalence between Neural Network and Support Vector Machine

Abstract

1 Introduction

2 Related Works and Background

2.1 Related Works

2.2 Neural Networks and Tangent Kernel

Definition 2.1 (Tangent kernel [27]).

2.3 Kernel Machines

2.4 Subgradient Optimization of Support Vector Machine

Definition 2.2 (Soft margin SVM).

Proposition 2.1.

3 Main Theoretical Results

3.1 Dynamics of Soft Margin SVM

Theorem 3.1 (Continuous dynamics and convergence rate of SVM).

Theorem 3.2 (Discrete dynamics of SVM).

3.2 Soft Margin Neural Network

Definition 3.1 (Soft margin neural network).

3.3 Dynamics of Neural Network Trained by Soft Margin Loss

Theorem 3.3 (Continuous dynamics and convergence rate of NN).

Theorem 3.4 (Equivalence between NN and SVM).

3.4 General Loss Functions

Theorem 3.5 (Bounds on the difference between NN and KM).

4 Discussion

4.1 Finite-width Neural Network Trained by ℓ2\ell_{2} Regularized Loss

Theorem 4.1.

Remark 4.1.

4.2 Robustness Verification of Infinite-width Neural Network

Theorem 4.2.

5 Experiments

6 Conclusion and Future Works

7 Acknowledgement

References

Appendix A Experiment Details

A.1 SVM Training

A.2 More Details

Appendix B Computing Non-vacuous Generalization Bounds via Corresponding Kernel Machines

Lemma B.1 (Lemma 22 in [7]).

Lemma B.2.

Appendix C Dynamics of Support Vector Machine

Lemma C.1.

Proof.

C.1 Continuous Dynamics of SVM

C.2 Discrete Dynamics of SVM

Appendix D Dynamics and Convergence Rate of Neural Network Trained by Soft Margin Loss

D.1 Continuous Dynamics of NN

D.2 Additional Notations

D.3 Convergence of NN

Lemma D.1 (PL inequality of NN for soft margin loss).

Proof.

D.4 Discrete Dynamics of NN

Appendix E Proof of Theorem 3.4

Proof.

Lemma E.1 (Theorem 3.3 in [33]; Hessian norm is controlled by the minimum hidden layer width).

Lemma E.2 (Proposition 2.3 in [33]; Small Hessian norm ⇒\Rightarrow Small change of tangent kernel).

Corollary E.2.1 (Consistancy of tangent kernel).

Appendix F Bound the difference between SVM and NN

F.1 Bound the difference on the Training Data

F.2 Bound on the Test Data

Appendix G Finite-width Neural Networks are Kernel Machines

Theorem G.1.

Proof.

Appendix H Robustness of Over-parameterized Neural Network

H.1 Robustness Verification of NTK

H.2 IBP for Two-layer Neural Network

4.1 Finite-width Neural Network Trained by $\ell_{2}$ Regularized Loss

Lemma E.2 (Proposition 2.3 in [33]; Small Hessian norm $\Rightarrow$ Small change of tangent kernel).