This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Equivalence between Neural Network and Support Vector Machine

Yilan Chen
Computer Science and Engineering
University of California San Diego
La Jolla, CA
[email protected]
&Wei Huang
Engineering and Information Technology
University of Technology Sydney
Ultimo, Australia
[email protected]
Lam M. Nguyen
IBM Research
Thomas J. Watson Research Center
Yorktown Heights, NY
[email protected]
&Tsui-Wei Weng
Halıcıoğlu Data Science Institute
University of California San Diego
La Jolla, CA
[email protected]
Abstract

Recent research shows that the dynamics of an infinitely wide neural network (NN) trained by gradient descent can be characterized by Neural Tangent Kernel (NTK) [27]. Under the squared loss, the infinite-width NN trained by gradient descent with an infinitely small learning rate is equivalent to kernel regression with NTK [4]. However, the equivalence is only known for ridge regression currently [6], while the equivalence between NN and other kernel machines (KMs), e.g. support vector machine (SVM), remains unknown. Therefore, in this work, we propose to establish the equivalence between NN and SVM, and specifically, the infinitely wide NN trained by soft margin loss and the standard soft margin SVM with NTK trained by subgradient descent. Our main theoretical results include establishing the equivalences between NNs and a broad family of 2\ell_{2} regularized KMs with finite-width bounds, which cannot be handled by prior work, and showing that every finite-width NN trained by such regularized loss functions is approximately a KM. Furthermore, we demonstrate our theory can enable three practical applications, including (i) non-vacuous generalization bound of NN via the corresponding KM; (ii) non-trivial robustness certificate for the infinite-width NN (while existing robustness verification methods would provide vacuous bounds); (iii) intrinsically more robust infinite-width NNs than those from previous kernel regression. Our code for the experiments is available at https://github.com/leslie-CH/equiv-nn-svm.

1 Introduction

Recent research has made some progress towards deep learning theory from the perspective of infinite-width neural networks (NNs). For a fully-trained infinite-width NN, it follows kernel gradient descent in the function space with respect to Neural Tangent Kernel (NTK) [27]. Under this linear regime and squared loss, it is proved that the fully-trained NN is equivalent to kernel regression with NTK [27, 4], which gives the generalization ability of such a model [5]. NTK helps us understand the optimization [27, 19] and generalization [5, 12] of NNs through the perspective of kernels. However, existing theories about NTK [27, 31, 4, 14] usually assume the loss is a function of the model output, which does not include the case of regularization. Besides, they usually consider the squared loss that corresponds to a kernel regression, which may have limited insights to understand classification problems since squared loss and kernel regression are usually used for regression problems.

On the other hand, another popular machine learning paradigm with solid theoretical foundation before the prevalence of deep neural networks is the support vector machine (SVM) [10, 16], which allows learning linear classifiers in high dimensional feature spaces. SVM tackles the sample complexity challenge by searching for large margin separators and tackles the computational complexity challenge using the idea of kernels [46]. To learn an SVM model, it usually involves solving a dual problem which is cast as a convex quadratic programming problem. Recently, there are some algorithms using subgradient descent [47] and coordinate descent [23] to further scale the SVM models to large datasets and high dimensional feature spaces.

We noticed that existing theoretical analysis mostly focused on connecting NN with kernel regression [27, 4, 31] but the connections between NN and SVM have not yet been explored. In this work, we establish the equivalence between NN and SVM for the first time to our best knowledge. More broadly, we show that our analysis can connect NNs with a family of 2\ell_{2} regularized KMs, including kernel ridge regression (KRR), support vector regression (SVR) and 2\ell_{2} regularized logistic regression, where previous results [27, 4, 31] cannot handle. These are the equivalences beyond ridge regression for the first time. Importantly, the equivalences between infinite-width NNs and these 2\ell_{2} regularized KMs may shed light on the understanding of NNs from these new equivalent KMs [17, 48, 45, 52], especially towards understanding the training, generalization, and robustness of NNs for classification problems. Besides, regularization plays an important role in machine learning to restrict the complexity of models. These equivalences may shed light on the understanding of the regularization for NNs. We highlight our contributions as follows:

  • We derive the continuous (gradient flow) and discrete dynamics of SVM trained by subgradient descent and the dynamics of NN trained by soft margin loss. We show the dynamics of SVM with NTK and NN are exactly the same in the infinite width limit because of the constancy of the tangent kernel and thus establish the equivalence. We show same linear convergence rate of SVM and NN under reasonable assumption. We verify the equivalence by experiments of subgradient descent and stochastic subgradient descent on MNIST dataset [29].

  • We generalize our theory to general loss functions with 2\ell_{2} regularization and establish the equivalences between NNs and a family of 2\ell_{2} regularized KMs as summarized in Table 1. We prove the difference between the outputs of NN and KM sacles as O(lnm/λm)O(\ln{m}/\lambda\sqrt{m}), where λ\lambda is the coefficient of the regularization and mm is the width of the NN. Additionally, we show every finite-width neural network trained by a 2\ell_{2} regularized loss function is approximately a KM.

  • We show that our theory offers three practical benefits: (i) computing non-vacuous generalization bound of NN via the corresponding KM; (ii) we can deliver nontrivial robustness certificate for the over-parameterized NN (with width mm\rightarrow\infty) while existing robustness verification methods would give trivial robustness certificate due to bound propagation [22, 57, 60]. In particular, the certificate decreases at a rate of O(1/m)O(1/\sqrt{m}) as the width of NN increases; (iii) we show that the equivalent infinite-width NNs trained from our 2\ell_{2} regularized KMs are more robust than the equivalent NN trained from previous kernel regression [27, 4] (see Table 3), which is perhaps not too surprising as the regularization has a strong connection to robust machine learning.

2 Related Works and Background

2.1 Related Works

Neural Tangent Kernel and dynamics of neural networks. NTK was first introduced in [27] and extended to Convolutional NTK [4] and Graph NTK [21]. [26] studied the NTK of orthogonal initialization. [6] reported strong performance of NTK on small-data tasks both for kernel regression and kernel SVM. However, the equivalence is only known for ridge regression currently, but not for SVM and other KMs. A line of recent work [20, 1] proved the convergence of (convolutional) neural networks with large but finite width in a non-asymptotic way by showing the weights do not move far away from initialization in the optimization dynamics (trajectory). [31] showed the dynamics of wide neural networks are governed by a linear model of first-order Taylor expansion around its initial parameters. However, existing theories about NTK [27, 31, 4] usually assume the loss is a function of the model output, which does not include the case of regularization. Besides, they usually consider the squared loss that corresponds to a kernel regression, which may have limited insights to understand classification problems since squared loss and kernel regression are usually used for regression problems. In this paper, we study the regularized loss functions and establish the equivalences with KMs beyond kernel regression and regression problems.

Besides, we studied the robustness of NTK models. [24] studied the label noise (the labels are generated by a ground truth function plus a Gaussian noise) while we consider the robustness of input perturbation. They study the convergence rate of NN trained by 2\ell_{2} regularized squared loss to an underlying true function, while we give explicit robustness certificates for NNs. Our robustness certificate enables us to compare different models and show the equivalent infinite-width NNs trained from our 2\ell_{2} regularized KMs are more robust than the equivalent NN trained from previous kernel regression.

Neural network and support vector machine. Prior works [54, 53, 35, 50, 32] have explored the benefits of encouraging large margin in the context of deep networks. [15] introduced a new family of positive-definite kernel functions that mimic the computation in multi-layer NNs and applied the kernels into SVM. [18] showed that NNs trained by gradient flow are approximately "kernel machines" with the weights and bias as functions of input data, which however can be much more complex than a typical kernel machine. [47] proposed a subgradient algorithm to solve the primal problem of SVM, which can obtain a solution of accuracy ϵ\epsilon in O~(1/ϵ)\tilde{O}(1/\epsilon) iterations, where O~\tilde{O} omits the logarithmic factors. In this paper, we also consider the SVM trained by subgradient descent and connect it with NN trained by subgradient descent. [49, 3] studied the connection between SVM and regularization neural network [44], one-hidden layer NN that has very similar structures with that of KMs and is not widely used in practice. NNs used in practice now (e.g. fully connected ReLU NN, CNN, ResNet) do not have such structures. [43] analyzed two-layer NN trained by hinge loss without regularization on linearly separable dataset. Note for SVM, it must have a regularization term such that it can achieve max-margin solution.

Implicit bias of neural network towards max-margin solution. There is also a line of research on the implicit bias of neural network towards max-margin (hard margin SVM) solution under different settings and assumptions [51, 28, 13, 56, 40, 37]. Our paper complements these research by establishing an exact equivalence between the infinite-width NN and SVM. Our equivalences not only include hard margin SVM but also include soft margin SVM and other 2\ell_{2} regularized KMs.

2.2 Neural Networks and Tangent Kernel

We consider a general form of deep neural network ff with a linear output layer as [33]. Let [L]={1,,L}[L]=\{1,...,L\}, l[L]\forall l\in[L],

α(0)(w,x)=x,α(l)(w,x)=ϕl(w(l),α(l1)),f(w,x)=1mLw(L+1),α(L)(w,x),\begin{split}\alpha^{(0)}(w,x)=x,\ \alpha^{(l)}(w,x)=\phi_{l}(w^{(l)},\alpha^{(l-1)}),\ f(w,x)=\frac{1}{\sqrt{m_{L}}}\langle w^{(L+1)},\alpha^{(L)}(w,x)\rangle,\end{split} (1)

where each vector-valued function ϕl(w(l),):ml1ml\phi_{l}(w^{(l)},\cdot):\mathbb{R}^{m_{l-1}}\rightarrow\mathbb{R}^{m_{l}}, with parameter w(l)plw^{(l)}\in\mathbb{R}^{p_{l}} (plp_{l} is the number of parameters), is considered as a layer of the network. This definition includes the standard fully connected, convolutional (CNN), and residual (ResNet) neural networks as special cases. For a fully connected ReLU NN, α(l)(w,x)=σ(1ml1w(l)α(l1))\alpha^{(l)}(w,x)=\sigma(\frac{1}{\sqrt{m_{l-1}}}w^{(l)}\alpha^{(l-1)}) with w(l)ml×ml1w^{(l)}\in\mathbb{R}^{m_{l}\times m_{l-1}} and σ(z)=max(0,z)\sigma(z)=\max(0,z).

Initialization and parameterization. In this paper, we consider the NTK parameterization [27], under which the constancy of the tangent kernel has been initially observed. Specifically, the parameters, w:={w(1);w(2);;w(L);w(L+1)}w:=\{w^{(1)};w^{(2)};\cdots;w^{(L)};w^{(L+1)}\} are drawn i.i.d. from a standard Gaussian, 𝒩(0,1)\mathcal{N}(0,1), at initialization, denoted as w0w_{0}. The factor 1/mL1/\sqrt{m_{L}} in the output layer is required by the NTK parameterization in order that the output ff is of order O(1)O(1). While we only consider NTK parameterization here, the results should be able to extend to general parameterization of kernel regime [58].

Definition 2.1 (Tangent kernel [27]).

The tangent kernel associated with function f(w,x)f(w,x) at some parameter ww is Θ^(w;x,x)=wf(w,x),wf(w,x)\hat{\Theta}(w;x,x^{\prime})=\langle\nabla_{w}f(w,x),\nabla_{w}f(w,x^{\prime})\rangle. Under certain conditions (usually infinite width limit and NTK parameterization), the tangent kernel at initialization converges in probability to a deterministic limit and keeps constant during training, Θ^(w;x,x)Θ(x,x)\hat{\Theta}(w;x,x^{\prime})\rightarrow\Theta_{\infty}(x,x^{\prime}). This limiting kernel is called Neural Tangent Kernel (NTK).

2.3 Kernel Machines

Kernel machine (KM) is a model of the form g(β,x)=φ(β,Φ(x)+b)g(\beta,x)=\varphi(\langle\beta,\Phi(x)\rangle+b), where β\beta is the model parameter and Φ\Phi is a mapping from input space to some feature space, Φ:𝒳\Phi:\mathcal{X}\rightarrow\mathcal{F}. φ\varphi is an optional nonlinear function, such as identity mapping for kernel regression and sign()sign(\cdot) for SVM and logistic regression. The kernel can be exploited whenever the weight vector can be expressed as a linear combination of the training points, β=i=1nαiΦ(xi)\beta=\sum_{i=1}^{n}\alpha_{i}\Phi(x_{i}) for some value of αi\alpha_{i}, i[n]i\in[n], implying that we can express gg as g(x)=φ(i=1nαiK(x,xi)+b)g(x)=\varphi(\sum_{i=1}^{n}\alpha_{i}K(x,x_{i})+b), where K(x,xi)=Φ(x),Φ(xi)K(x,x_{i})=\langle\Phi(x),\Phi(x_{i})\rangle is the kernel function. For a NN in NTK regime, we have f(wt,x)f(w0,x)+wf(w0,x),wtw0f(w_{t},x)\approx f(w_{0},x)+\langle\nabla_{w}f(w_{0},x),w_{t}-w_{0}\rangle, which makes the NN linear in the gradient feature mapping xwf(w0,x)x\rightarrow\nabla_{w}f(w_{0},x). Under squared loss, it is equivalent to kernel regression with Φ(x)=wf(w0,x)\Phi(x)=\nabla_{w}f(w_{0},x) (or equivalently using NTK as the kernel), β=wtw0\beta=w_{t}-w_{0} and φ\varphi identity mapping [4].

As far as we know, there is no work establishing the equivalence between fully trained NN and SVM. [18] showed that NNs trained by gradient flow are approximately "kernel machines" with the weights and bias as functions of input data, which however can be much more complex than a typical kernel machine. In this work, we compare the dynamics of SVM and NN trained by subgradient descent with soft margin loss and show the equivalence between them in the infinite width limit.

2.4 Subgradient Optimization of Support Vector Machine

We first formally define the standard soft margin SVM and then show how the subgradient descent can be applied to get an estimation of the SVM primal problem. For simplicity, we consider the homogenous model, g(β,x)=β,Φ(x)g(\beta,x)=\langle\beta,\Phi(x)\rangle.111Note one can always deal with the bias term bb by adding each sample with an additional dimension, Φ(x)T[Φ(x)T,1],βT[βT,1]\Phi(x)^{T}\leftarrow[\Phi(x)^{T},1],\beta^{T}\leftarrow[\beta^{T},1].

Definition 2.2 (Soft margin SVM).

Given labeled samples {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} with yi{1,+1}y_{i}\in\{-1,+1\}, the hyperplane β\beta^{*} that solves the below optimization problem realizes the soft margin classifier with geometric margin γ=2/β\gamma=2/\lVert\beta^{*}\rVert.

minβ,ξ12β2+Ci=1nξi,s.t.yiβ,Φ(xi)1ξi,ξi0,i[n],\begin{split}\min_{\beta,\xi}\ &\frac{1}{2}\lVert\beta\rVert^{2}+C\sum_{i=1}^{n}\xi_{i},\quad\text{s.t.}\ y_{i}\langle\beta,\Phi(x_{i})\rangle\geq 1-\xi_{i},\ \xi_{i}\geq 0,\ i\in[n],\end{split}
Proposition 2.1.

The above primal problem of soft margin SVM can be equivalently formulated as

minβ12β2+Ci=1nmax(0,1yiβ,Φ(xi)),\min_{\beta}\frac{1}{2}\lVert\beta\rVert^{2}+C\sum_{i=1}^{n}\max(0,1-y_{i}\langle\beta,\Phi(x_{i})\rangle), (2)

where the second term is a hinge loss. Denote this function as L(β)L(\beta), which is strongly convex in β\beta.

From this, we see that the SVM technique is equivalent to empirical risk minimization with 2\ell_{2} regularization, where in this case the loss function is the nonsmooth hinge loss. The classical approaches usually consider the dual problem of SVM and solve it as a quadratic programming problem. Some recent algorithms, however, use subgradient descent [47] to optimize Eq. (2), which shows significant advantages when dealing with large datasets.

In this paper, we consider the soft margin SVM trained by subgradient descent with L(β)L(\beta). We use the subgradient βL(β)=βCi=1n𝟙(yig(β,xi)<1)yiΦ(xi)\nabla_{\beta}L(\beta)=\beta-C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g(\beta,x_{i})<1)y_{i}\Phi(x_{i}), where 𝟙()\mathbbm{1}(\cdot) is the indicator function. As proved in [47], we can find a solution of accuracy ϵ\epsilon, i.e. L(β)L(β)ϵL(\beta)-L(\beta^{*})\leq\epsilon, in O~(1/ϵ)\tilde{O}(1/\epsilon) iterations. Other works also give convergence guarantees for subgradient descent of convex functions [11, 9]. In the following analysis, we will generally assume the convergence of SVM trained by subgradient descent.

3 Main Theoretical Results

In this section, we describe our main results. We first derive the continuous (gradient flow) and discrete dynamics of SVM trained by subgradient descent (in Section 3.1) and the dynamics of NN trained by soft margin loss (in Section 3.2 and Section 3.3). We show that they have similar dynamics, characterized by an inhomogeneous linear differential (difference) equation, and have the same convergence rate under reasonable assumption. Next, we show that their dynamics are exactly the same in the infinite width limit because of the constancy of tangent kernel and thus establish the equivalence (Theorem 3.4). Furthermore, in Section 3.4, we generalize our theory to general loss functions with 2\ell_{2} regularization and establish the equivalences between NNs and a family of 2\ell_{2} regularized KMs as summarized in Table 1.

3.1 Dynamics of Soft Margin SVM

For simpicity, we denote βt\beta_{t} as β\beta at some time tt and gt(x)=g(βt,x)g_{t}(x)=g(\beta_{t},x). The proofs of the following two theorems are detailed in Appendix C.

Theorem 3.1 (Continuous dynamics and convergence rate of SVM).

Consider training soft margin SVM by subgradient descent with infinite small learning rate (gradient flow [2]): dβtdt=βL(βt)\frac{d\beta_{t}}{dt}=-\nabla_{\beta}L(\beta_{t}), the model gt(x)g_{t}(x) follows the below evolution:

dgt(x)dt=gt(x)+Ci=1n𝟙(yigt(xi)<1)yiK(x,xi),\frac{dg_{t}(x)}{dt}=-g_{t}(x)+C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}), (3)

and has a linear convergence rate:

L(βt)L(β)e2t(L(β0)L(β)).L(\beta_{t})-L(\beta^{*})\leq e^{-2t}\left(L(\beta_{0})-L(\beta^{*})\right).

Denote Q(t)=Ci=1n𝟙(yigt(xi)<1)yiK(x,xi)Q(t)=C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}), which changes over time until convergence. The model output gt(x)g_{t}(x) at some time TT is

gT(x)=eT(g0(x)+0TQ(t)et𝑑t),limTgT(x)=Ci=1n𝟙(yigT(xi)<1)yiK(x,xi).g_{T}(x)=e^{-T}\biggl{(}g_{0}(x)+\int_{0}^{T}Q(t)e^{t}\,dt\biggr{)},\quad\lim_{T\rightarrow\infty}g_{T}(x)=C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{T}(x_{i})<1)y_{i}K(x,x_{i}). (4)

The continuous dynamics of SVM is described by an inhomogeneous linear differential equation (Eq. (3)), which gives an analytical solution. From Eq. (4), we can see that the influence of initial model g0(x)g_{0}(x) deceases as time TT\rightarrow\infty and disappears at last.

Theorem 3.2 (Discrete dynamics of SVM).

Let η(0,1)\eta\in(0,1) be the learning rate. The dynamics of subgradient descent is

gt+1(x)gt(x)=ηgt(x)+ηCi=1n𝟙(yigt(xi)<1)yiK(x,xi).g_{t+1}(x)-g_{t}(x)=-\eta g_{t}(x)+\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}). (5)

Denote Q(t)=ηCi=1n𝟙(yigt(xi)<1)yiK(x,xi)Q(t)=\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}), which changes over time. The model output gt(x)g_{t}(x) at some time TT is

gT(x)=(1η)T(g0(x)+t=0T1(1η)t1Q(t)),limTgT(x)=Ci=1n𝟙(yigT(xi)<1)yiK(x,xi).g_{T}(x)=(1-\eta)^{T}\biggl{(}g_{0}(x)+\sum_{t=0}^{T-1}(1-\eta)^{-t-1}Q(t)\biggr{)},\lim_{T\rightarrow\infty}g_{T}(x)=C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{T}(x_{i})<1)y_{i}K(x,x_{i}).

The discrete dynamics is characterized by an inhomogeneous linear difference equation (Eq. (5)). The discrete dynamics and solution of SVM have similar structures as the continuous case.

3.2 Soft Margin Neural Network

We first formally define the soft margin neural network and then derive the dynamics of training a neural network by subgradient descent with a soft margin loss. We will consider a neural network defined as Eq. (1). For convenience, we redefine f(w,x)=W(L+1),α(L)(w,x)f(w,x)=\langle W^{(L+1)},\alpha^{(L)}(w,x)\rangle with W(L+1)=1mLw(L+1)W^{(L+1)}=\frac{1}{\sqrt{m_{L}}}w^{(L+1)} and w:={w(1);w(2);;w(L);W(L+1)}w:=\{w^{(1)};w^{(2)};\cdots;w^{(L)};W^{(L+1)}\}.

Definition 3.1 (Soft margin neural network).

Given samples {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n}, yi{1,+1}y_{i}\in\{-1,+1\}, the neural network ww^{*} defined as Eq. (1) that solves the following two equivalent optimization problems

minw,ξ12W(L+1)2+Ci=1nξi,s.t.yif(w,xi)1ξi,ξi0,i[n],\min_{w,\xi}\frac{1}{2}\lVert W^{(L+1)}\rVert^{2}+C\sum_{i=1}^{n}\xi_{i},\quad\text{s.t.}\ y_{i}f(w,x_{i})\geq 1-\xi_{i},\ \xi_{i}\geq 0,\ i\in[n],
minw12W(L+1)2+Ci=1nmax(0,1yif(w,xi)),\min_{w}\frac{1}{2}\lVert W^{(L+1)}\rVert^{2}+C\sum_{i=1}^{n}\max(0,1-y_{i}f(w,x_{i})), (6)

realizes the soft margin classifier with geometric margin γ=2/W(L+1)\gamma=2/\lVert W_{*}^{(L+1)}\rVert. Denote Eq. (6) as L(w)L(w) and call it soft margin loss.

This is generally a hard nonconvex optimization problem, but we can apply subgradient descent to optimize it heuristically. At initialization, W0(L+1)2=O(1)\lVert W_{0}^{(L+1)}\rVert^{2}=O(1). The derivative of the regularization for w(L+1)w^{(L+1)} is w(L+1)/mL=O(1/mL)0w^{(L+1)}/\sqrt{m_{L}}=O(1/\sqrt{m_{L}})\rightarrow 0 as mm\rightarrow\infty. For a fixed α(L)(w,x)\alpha^{(L)}(w,x), this problem is same as SVM with Φ(x)=α(L)(w,x)\Phi(x)=\alpha^{(L)}(w,x), kernel K(x,x)=α(L)(w,x)α(L)(w,x)K(x,x^{\prime})=\alpha^{(L)}(w,x)\cdot\alpha^{(L)}(w,x^{\prime}) and parameter β=W(L+1)\beta=W^{(L+1)}. If we only train the last layer of an infinite-width soft margin NN, it corresponds to an SVM with a NNGP kernel [30, 38]. But for a fully-trained NN, α(L)(w,x)\alpha^{(L)}(w,x) is changing over time.

3.3 Dynamics of Neural Network Trained by Soft Margin Loss

Denote the hinge loss in L(w)L(w) as Lh(yi,f(w,xi))=Cmax(0,1yif(w,xi))L_{h}(y_{i},f(w,x_{i}))=C\max(0,1-y_{i}f(w,x_{i})). We use the same subgradient as that for SVM, Lh(yi,f(w,xi))=Cyi𝟙(yif(w,xi)<1)L_{h}^{\prime}(y_{i},f(w,x_{i}))=-Cy_{i}\mathbbm{1}(y_{i}f(w,x_{i})<1).

Theorem 3.3 (Continuous dynamics and convergence rate of NN).

Suppose a NN f(w,x)f(w,x) defined as Eq. (1), with ff a differentiable function of ww, is learned from a training set {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} by subgradient descent with L(w)L(w) and gradient flow. Then the network has the following dynamics:

dft(x)dt=ft(x)+Ci=1n𝟙(yift(xi)<1)yiΘ^(wt;x,xi).\frac{df_{t}(x)}{dt}=-f_{t}(x)+C\sum_{i=1}^{n}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)y_{i}\hat{\Theta}(w_{t};x,x_{i}).

Let Θ^(wt)n×n\hat{\Theta}(w_{t})\in\mathbb{R}^{n\times n} be the tangent kernel evaluated on the training set and λmin(Θ^(wt))\lambda_{min}(\hat{\Theta}(w_{t})) be its minimum eigenvalue. Assume λmin(Θ^(wt))2C\lambda_{min}(\hat{\Theta}(w_{t}))\geq\frac{2}{C}, then NN has at least a linear convergence rate, same as SVM:

L(wt)L(w)e2t(L(w0)L(w)).L(w_{t})-L(w^{*})\leq e^{-2t}\left(L(w_{0})-L(w^{*})\right).

The proof is in Appendix D. The key observation is that when deriving the dynamics of ft(x)f_{t}(x), the 12W(L+1)2\frac{1}{2}\lVert W^{(L+1)}\rVert^{2} term in the loss function will produce a ft(x)f_{t}(x) term and the hinge loss will produce the tangent kernel term, which overall gives a similar dynamics to that of SVM. Comparing with the previous continuous dynamics without regularization [27, 31], our result has an extra ft(x)-f_{t}(x) here because of the regularization term of the loss function. The convergence rate is proved based on a sufficient condition for the PL inequality. The assumption of λmin(Θ^(wt))2C\lambda_{min}(\hat{\Theta}(w_{t}))\geq\frac{2}{C} can be guaranteed in a parameter ball when λmin(Θ^(w0))>2C\lambda_{min}(\hat{\Theta}(w_{0}))>\frac{2}{C}, by using a sufficiently wide NN [34].

If the tangent kernel Θ^(wt;x,xi)\hat{\Theta}(w_{t};x,x_{i}) is fixed, Θ^(wt;x,xi)Θ^(w0;x,xi)\hat{\Theta}(w_{t};x,x_{i})\rightarrow\hat{\Theta}(w_{0};x,x_{i}), the dynamics of NN is the same as that of SVM (Eq. (3)) with kernel Θ^(w0;x,xi)\hat{\Theta}(w_{0};x,x_{i}), assuming the neural network and SVM have same initial output g0(x)=f0(x)g_{0}(x)=f_{0}(x).222This can be done by setting the initial values to be 0, i.e. g0(x)=f0(x)=0g_{0}(x)=f_{0}(x)=0. And this consistency of tangent kernel is the case for infinitely wide neural networks of common architectures, which does not depend on the optimization algorithm and the choice of loss function, as discussed in [33].

Assumptions. We assume that (vector-valued) layer functions ϕl(w,α),l[L]\phi_{l}(w,\alpha),l\in[L] are LϕL_{\phi}-Lipschitz continuous and twice differentiable with respect to input α\alpha and parameters ww. The assumptions serve for the following theorem to show the constancy of tangent kernel.

Theorem 3.4 (Equivalence between NN and SVM).

As the minimum width of the NN, m=minl[L]mlm=\min_{l\in[L]}m_{l}, goes to infinity, the tangent kernel tends to be constant, Θ^(wt;x,xi)Θ^(w0;x,xi)\hat{\Theta}(w_{t};x,x_{i})\rightarrow\hat{\Theta}(w_{0};x,x_{i}). Assume g0(x)=f0(x)g_{0}(x)=f_{0}(x). Then the infinitely wide NN trained by subgradient descent with soft margin loss has the same dynamics as SVM with Θ^(w0;x,xi)\hat{\Theta}(w_{0};x,x_{i}) trained by subgradient descent:

dft(x)dt=ft(x)+Ci=1n𝟙(yift(xi)<1)yiΘ^(w0;x,xi).\frac{df_{t}(x)}{dt}=-f_{t}(x)+C\sum_{i=1}^{n}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)y_{i}\hat{\Theta}(w_{0};x,x_{i}).

And thus such NN and SVM converge to the same solution.

The proof is in Appendix E. We apply the results of [33] to show the constancy of tangent kernel in the infinite width limit. Then it is easy to check the dynamics of infinitely wide NN and SVM with NTK are the same. We give finite-width bounds for general loss functions in the next section. This theorem establishes the equivalence between infinitely wide NN and SVM for the first time. Previous theoretical results of SVM [17, 48, 45, 52] can be directly applied to understand the generalization of NN trained by soft margin loss. Given the tangent kernel is constant or equivalently the model is linear, we can also give the discrete dynamics of NN (Appendix D.4), which is identical to that of SVM. Compared with the previous discrete-time gradient descent [31, 58], our result has an extra ηft(x)-\eta f_{t}(x) term because of the regularization term of the loss function.

ft+1(x)ft(x)=ηft(x)+ηCi=1n𝟙(yift(xi)<1)yiΘ^(w0;x,xi).f_{t+1}(x)-f_{t}(x)=-\eta f_{t}(x)+\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)y_{i}\hat{\Theta}(w_{0};x,x_{i}).
Table 1: Summary of our theoretical results on the equivalences between infinite-width NNs and a family of KMs. Thanks to the representer theorem [45], our 2\ell_{2} regularized KMs can all apply kernel trick, meaning infinite NTK can be applied in these 2\ell_{2} regularized KMs.
λ\lambda Loss l(z,yi)l(z,y_{i}) Kernel machine
λ=0\lambda=0 ([27, 4]) (yiz)2(y_{i}-z)^{2} Kernel regression
λ0\lambda\rightarrow 0 (ours) max(0,1yiz)\max(0,1-y_{i}z) Hard margin SVM
λ>0\lambda>0 (ours) max(0,1yiz)\max(0,1-y_{i}z) (1-norm) soft margin SVM
max(0,1yiz)2\max(0,1-y_{i}z)^{2} 2-norm soft margin SVM
max(0,|yiz|ϵ)\max(0,\lvert y_{i}-z\rvert-\epsilon) Support vector regression
(yiz)2(y_{i}-z)^{2} Kernel ridge regression (KRR)
log(1+eyiz)\log(1+e^{-y_{i}z}) Logistic regression with 2\ell_{2} regularization

3.4 General Loss Functions

We note that above analysis does not have specific dependence on the hinge loss. Thus we can generalize our analysis to general loss functions l(z,yi)l(z,y_{i}), where zz is the model output, as long as the loss function is differentiable (or has subgradients) with respect to zz, such as squared loss and logistic loss. Besides, we can scale the regularization term by a factor λ\lambda instead of scaling l(z,yi)l(z,y_{i}) with CC as it for SVM, which are equivalent. Suppose the loss function for the KM and NN are

L(β)=λ2β2+i=1nl(g(β,xi),yi),L(w)=λ2W(L+1)2+i=1nl(f(w,xi),yi).\displaystyle L(\beta)=\frac{\lambda}{2}\lVert\beta\rVert^{2}+\sum_{i=1}^{n}l(g(\beta,x_{i}),y_{i}),\quad L(w)=\frac{\lambda}{2}\lVert W^{(L+1)}\rVert^{2}+\sum_{i=1}^{n}l(f(w,x_{i}),y_{i}). (7)

Then the continuous dynamics of gt(x)g_{t}(x) and ft(x)f_{t}(x) are

dgt(x)dt=λgt(x)i=1nl(gt(xi),yi)K(x,xi),\displaystyle\frac{dg_{t}(x)}{dt}=-\lambda g_{t}(x)-\sum_{i=1}^{n}l^{\prime}(g_{t}(x_{i}),y_{i})K(x,x_{i}), (8)
dft(x)dt=λft(x)i=1nl(ft(xi),yi)Θ^(wt;x,xi),\displaystyle\frac{df_{t}(x)}{dt}=-\lambda f_{t}(x)-\sum_{i=1}^{n}l^{\prime}(f_{t}(x_{i}),y_{i})\hat{\Theta}(w_{t};x,x_{i}), (9)

where l(z,yi)=l(z,yi)zl^{\prime}(z,y_{i})=\frac{\partial l(z,y_{i})}{\partial z}. In the situation of Θ^(wt;x,xi)Θ^(w0;x,xi)\hat{\Theta}(w_{t};x,x_{i})\rightarrow\hat{\Theta}(w_{0};x,x_{i}) and K(x,xi)=Θ^(w0;x,xi)K(x,x_{i})=\hat{\Theta}(w_{0};x,x_{i}), these two dynamics are the same (assuming g0(x)=f0(x)g_{0}(x)=f_{0}(x)). When λ=0\lambda=0, we recover the previous results of kernel regression. When λ>0\lambda>0, we have our new results of 2\ell_{2} regularized loss functions. Table 1 lists the different loss functions and the corresponding KMs that infinite-width NNs are equivalent to. KRR is considered in [25] to analyze the generalization of NN. However, they directly assume NN as a linear model and use it in KRR. Below we give finite-width bounds on the difference between the outputs of NN and the corresponding KM. The proof is in Appendix F.

Theorem 3.5 (Bounds on the difference between NN and KM).

Assume g0(x)=f0(x),xg_{0}(x)=f_{0}(x),\forall x and K(x,xi)=Θ^(w0;x,xi)K(x,x_{i})=\hat{\Theta}(w_{0};x,x_{i}) 333Linearized NN is a special case of such gg.. Suppose the KM and NN are trained with losses (7) and gradient flow. Suppose ll is ρ\rho-lipschitz and βl\beta_{l}-smooth for the first argument (i.e. the model output). Given any wTB(w0;R):={w:ww0R}w_{T}\in B(w_{0};R):=\{w:\lVert w-w_{0}\rVert\leq R\} for some fixed R>0R>0, for training data Xd×nX\in\mathbb{R}^{d\times n} and a test point xdx\in\mathbb{R}^{d}, with high probability over the initialization,

fT(X)gT(X)=O(eβlΘ^(w0)R3L+1ρn32lnmλm),\displaystyle\lVert f_{T}(X)-g_{T}(X)\rVert=O(\frac{e^{\beta_{l}\lVert\hat{\Theta}(w_{0})\rVert}R^{3L+1}\rho n^{\frac{3}{2}}\ln{m}}{\lambda\sqrt{m}}),
fT(x)gT(x)=O(eβlΘ^(w0;X,x)R3L+1ρnlnmλm).\displaystyle\lVert f_{T}(x)-g_{T}(x)\rVert=O(\frac{e^{\beta_{l}\lVert\hat{\Theta}(w_{0};X,x)\rVert}R^{3L+1}\rho n\ln{m}}{\lambda\sqrt{m}}).

where fT(X),gT(X)nf_{T}(X),g_{T}(X)\in\mathbb{R}^{n} are the outputs of the training data and Θ^(w0;X,x)n\hat{\Theta}(w_{0};X,x)\in\mathbb{R}^{n} is the tangent kernel evaluated between training data and test point.

4 Discussion

In this section, we give some extensions and applications of our theory. We first show that every finite-width neural network trained by a 2\ell_{2} regularized loss function is approximately a KM in Section 4.1, which enables us to compute non-vacuous generalization bound of NN via the corresponding KM. Next, in Section 4.2, we show that our theory of equivalence (in Section 3.3) is useful to evaluating the robustness of over-parameterized NNs with infinite width. In particular, our theory allows us to deliver nontrivial robustness certificates for infinite-width NNs, while existing robustness verification methods [22, 57, 60] would become much looser (decrease at a rate of O(1/m)O(1/\sqrt{m})) as the width of NN increases and trivial with infinite width (the experiment results are in Section 5 and Table 2).

4.1 Finite-width Neural Network Trained by 2\ell_{2} Regularized Loss

Inspired by [18], we can also show that every NN trained by (sub)gradient descent with loss function (7) is approximately a KM without the assumption of infinite width.

Theorem 4.1.

Suppose a NN f(w,x)f(w,x), is learned from a training set {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} by (sub)gradient descent with loss function (7) and gradient flow. Assume sign(l(ft(xi),yi))=sign(l(f0(xi),yi)),t[0,T]\text{sign}(l^{\prime}(f_{t}(x_{i}),y_{i}))=\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i})),\forall t\in\left[0,T\right].444This is the case for hinge loss and logistic loss. Then at some time T>0T>0,

fT(x)=i=1naiK(x,xi)+b,withK(x,xi)=eλT0T|l(ft(xi),yi)|Θ^(wt;x,xi)eλt𝑑t,f_{T}(x)=\sum_{i=1}^{n}a_{i}K(x,x_{i})+b,\quad\text{with}\quad K(x,x_{i})=e^{-\lambda T}\int_{0}^{T}\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\rvert\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt,

and ai=sign(l(f0(xi),yi))a_{i}=-\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i})), b=eλTf0(x)b=e^{-\lambda T}f_{0}(x).

See the proof in Appendix G, which utilizes the solution of inhomogeneous linear differential equation instead of integrating both sides of the dynamics (Eq. (9)) directly [18]. Note in Theorem 4.1, aia_{i} is deterministic and independent with xx, different with [18] that has aia_{i} depends on xx. Deterministic aia_{i} makes the function class simpler. Combing Theorem 4.1 with a bound of the Rademacher complexity of the KM [7] and a standard generalization bound using Rademacher complexity [39], we can compute the generalization bound of NN via the corresponding KM. See Appendix B for more background and experiments. The generalization bound we get will depend on aia_{i}, which depends on the label yiy_{i}. This differs from traditional complexity measures that cannot explain the random label phenomenon [59].

Remark 4.1.

We note that the kernel here is valid only when |l(ft(xi),yi)|\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\rvert is a constant, e.g. l(ft(xi),yi)=yil^{\prime}(f_{t}(x_{i}),y_{i})=-y_{i} at the initial training stage of hinge loss with f0(x)=0f_{0}(x)=0, otherwise the kernel is not symmetric.

4.2 Robustness Verification of Infinite-width Neural Network

Our theory of equivalence allows us to deliver nontrivial robustness certificates for infinite-width NNs by considering the equivalent KMs. For an input x0dx_{0}\in\mathbb{R}^{d}, the objective of robustness verification is to find the largest ball such that no examples within this ball xB(x0,δ)x\in B(x_{0},\delta) can change the classification result. Without loss of generality, we assume g(x0)>0g(x_{0})>0. The robustness verification problem can be formulated as follows,

maxδ,s.t.g(x)>0,xB(x0,δ).\max\ \delta,\quad\;\textrm{s.t.}\ \ g(x)>0,\ \forall x\in B(x_{0},\delta). (10)

For an infinitely wide two-layer fully connected ReLU NN, f(x)=1mj=1mvjσ(1dwjTx)f(x)=\frac{1}{\sqrt{m}}\sum_{j=1}^{m}v_{j}\sigma(\frac{1}{\sqrt{d}}w_{j}^{T}x), where σ(z)=max(0,z)\sigma(z)=\max(0,z) is the ReLU activation, the NTK is

Θ(x,x)=x,xd(πarccos(u)π)+xx2πd1u2,\Theta(x,x^{\prime})=\frac{\langle x,x^{\prime}\rangle}{d}(\frac{\pi-\arccos(u)}{\pi})+\frac{\lVert x\rVert\lVert x^{\prime}\rVert}{2\pi d}\sqrt{1-u^{2}},

where u=x,xxx[1,1]u=\frac{\langle x,x^{\prime}\rangle}{\lVert x\rVert\lVert x^{\prime}\rVert}\in\left[-1,1\right]. See the proof of the following theorem in Appendix H.1.

Theorem 4.2.

Consider the \ell_{\infty} perturbation, for xB(x0,δ)={xd:xx0δ}x\in B_{\infty}(x_{0},\delta)=\{x\in\mathbb{R}^{d}:\lVert x-x_{0}\rVert_{\infty}\leq\delta\}, we can bound Θ(x,x)\Theta(x,x^{\prime}) into some interval [ΘL(x,x),ΘU(x,x)][\Theta^{L}(x,x^{\prime}),\Theta^{U}(x,x^{\prime})]. Suppose g(x)=i=1nαiΘ(x,xi)g(x)=\sum_{i=1}^{n}\alpha_{i}\Theta(x,x_{i}), where αi\alpha_{i} are known after solving the KM problems (e.g. SVM and KRR). Then we can lower bound g(x)g(x) as follows,

g(x)i=1,αi>0nαiΘL(x,xi)+i=1,αi<0nαiΘU(x,xi).g(x)\geq\sum_{i=1,\alpha_{i}>0}^{n}\alpha_{i}\Theta^{L}(x,x_{i})+\sum_{i=1,\alpha_{i}<0}^{n}\alpha_{i}\Theta^{U}(x,x_{i}).

Using a simple binary search and above theorem, we can find a certified lower bound for (10). Because of the equivalence between the infinite-width NN and KM, the certified lower bound we get for the KM is equivalently a certified lower bound for the corresponding infinite-width NN.

Refer to caption
Figure 1: Training dynamics of neural network and SVM behave similarly. (a)(b) show dynamics of outputs for randomly selected two samples. (c) shows the difference between the outputs of SVM and NN. The dynamics of SVM agrees better with wider NN. (d) shows the dynamics of hinge loss for SVM and NN. Without specification, the width of NN is 10K and η^=0.1\hat{\eta}=0.1.

5 Experiments

(I) Verification of the equivalence. The first experiment verifies the equivalence between soft margin SVM with NTK trained by subgradient descent and the infinite-width NN trained by soft margin loss. We train the SVM and 3-layer fully connected ReLU NN for a binary MNIST [29] classification (0 and 11) with learning rate η^=0.1\hat{\eta}=0.1 and η^=0.01\hat{\eta}=0.01 with full batch subgradient descent on n=128n=128 samples, where η^\hat{\eta} is the learning rate used in experiments. Figure 1 shows the dynamics of the outputs and loss for NN and SVM. Since the regularization terms in the loss of NN and SVM are different, we just plot the hinge loss. It can be seen that the dynamics of NN and SVM agree very well. We also do a stochastic subgradient descent case for binary classification on full MNIST 0 and 11 data (1266512665 training and 21152115 test) with learning rate η^=1\hat{\eta}=1 and batch size 6464, shown in Figure A.1. For more details, please see Appendix A.

(II) Robustness of over-parameterized neural network. Table 2 shows the robustness certificates of two-layer overparameterized NNs with increasing width and SVM (which is equivalent to infinite-width two-layer ReLU NN) on binary classification of MNIST (0 and 1). We use the NN robustness verification algorithm (IBP) [22] to compute the robustness certificates for two-layer overparameterized NNs. The robustness certificate for SVM is computed using our method in Section 4.2. As demonstrated in Table 2, the certificate of NN almost decrease at a rate of O(1/m)O(1/\sqrt{m}) and will decrease to 0 as mm\rightarrow\infty, where mm is the width of the hidden layer. We show that this is due to the bound propagation in Appendix H.2. Unfortunately, the decrease rate will be faster if the NN is deeper. The same problem will happen for LeCun initialization as well, which is used in PyTorch for fully connected layers by default. Notably, however, thanks to our theory, we could compute nontrivial robustness certificate for an infinite-width NN through the equivalent SVM as demonstrated.

Table 2: Robustness certified lower bounds of two-layer ReLU NN and SVM (infinite-width two-layer ReLU NN) tested on binary classification of MNIST (0 and 1). 100 test: randomly selected 100 test samples. Full test: full test data. Test only on data that classified correctly. std is computed over data samples. All models have test accuracy 99.95%. All values are mean of 5 experiments.
Robustness certificate δ\delta (mean ±\pm std) ×103\times 10^{-3}
Model Width 100 test Full test
NN 10310^{3} 7.4485 ±\pm 2.5667 7.2708 ±\pm 2.1427
NN 10410^{4} 2.9861 ±\pm 1.0730 2.9367 ±\pm 0.89807
NN 10510^{5} 0.99098 ±\pm 0.35775 0.97410 ±\pm 0.29997
NN 10610^{6} 0.31539 ±\pm 0.11380 0.30997 ±\pm 0.095467
SVM \infty 8.0541 ±\pm 2.5827 7.9733 ±\pm 2.1396

(III) Comparison with kernel regression. Table 3 compares our 2\ell_{2} regularized models (KRR and SVM with NTK) with the previous kernel regression model (λ=0\lambda=0 for KRR). All the robustness certified lower bounds are computed using our method in Section 4.2. While the accuracies of different models are similar, as the regularization increases, the robustness of KRR increases. The robustness of SVM outperforms the KRR with same regularization magnitude a lot. Our theory enables us to train an equivalent infinite-width NN through SVM and KRR, which is intrinsically more robust than the previous kernel regression model.

Table 3: Robustness of equivalent infinite-width NN models with different loss functions (see Table 1) on binary classification of MNIST (0 and 1). λ\lambda is the parameter in Eq. (7).
Model λ\lambda Test accuracy Robustness certificate δ\delta Robustness improvement
λ=0\lambda=0 ([27, 4]) KRR 0 99.95% 3.30202×105\times 10^{-5} -
λ>0\lambda>0 (ours) KRR 0.001 99.95% 3.756122×105\times 10^{-5} 1.14X
KRR 0.01 99.95% 6.505500×105\times 10^{-5} 1.97X
KRR 0.1 99.95% 2.229960×104\times 10^{-4} 6.75X
KRR 1 99.95% 0.001005 30.43X
KRR 10 99.91% 0.005181 156.90X
KRR 100 99.86% 0.020456 619.50X
KRR 1000 99.76% 0.026088 790.06X
SVM 0.064 99.95% 0.008054 243.91X

6 Conclusion and Future Works

In this paper, we establish the equivalence between SVM with NTK and the NN trained by soft margin loss with subgradient descent in the infinite width limit, and we show that they have the same dynamics and solution. We also extend our analysis to general 2\ell_{2} regularized loss functions and show every neural network trained by such loss functions is approximately a KM. Finally, we demonstrate our theory is useful to compute non-vacuous generalization bound for NN, non-trivial robustness certificate for infinite-width NN while existing neural network robustness verification algorithm cannot handle, and with our theory, the resulting infinite-width NN from our 2\ell_{2} regularized models is intrinsically more robust than that from the previous NTK kernel regression. For future research, since the equivalence between NN and SVM (and other 2\ell_{2} regularized KMs) with NTK has been established, it would be very interesting to understand the generalization and robustness of NN from the perspective of these KMs. Our main results are currently still limited in the linear regime. It would be interesting to extend the results to the mean field setting or consider its connection with the implicit bias of NN.

7 Acknowledgement

We thank the anonymous reviewers for useful suggestions to improve the paper. We thank Libin Zhu for helpful discussions. We thank San Diego Supercomputer Center and MIT-IBM Watson AI Lab for computing resources. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) [55], which is supported by National Science Foundation grant number ACI-1548562. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) Expanse at San Diego Supercomputer Center through allocation TG-ASC150024 and ddp390. T.-W. Weng is partially supported by National Science Foundation under Grant No. 2107189.

References

  • Allen-Zhu et al. [2019] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
  • Ambrosio et al. [2008] L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
  • Andras [2002] P. Andras. The equivalence of support vector machine and regularization neural networks. Neural Processing Letters, 15(2):97–104, 2002.
  • Arora et al. [2019a] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8141–8150, 2019a.
  • Arora et al. [2019b] S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019b.
  • Arora et al. [2019c] S. Arora, S. S. Du, Z. Li, R. Salakhutdinov, R. Wang, and D. Yu. Harnessing the power of infinitely wide deep nets on small-data tasks. arXiv preprint arXiv:1910.01663, 2019c.
  • Bartlett and Mendelson [2002] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
  • Bartlett et al. [2019] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301, 2019.
  • Bertsekas and Scientific [2015] D. P. Bertsekas and A. Scientific. Convex optimization algorithms. Athena Scientific Belmont, 2015.
  • Boser et al. [1992] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, 1992.
  • Boyd et al. [2003] S. Boyd, L. Xiao, and A. Mutapcic. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2004:2004–2005, 2003.
  • Cao and Gu [2019] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. arXiv preprint arXiv:1905.13210, 2019.
  • Chizat and Bach [2020] L. Chizat and F. Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305–1338. PMLR, 2020.
  • Chizat et al. [2018] L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
  • Cho [2012] Y. Cho. Kernel methods for deep learning. PhD thesis, UC San Diego, 2012.
  • Cortes and Vapnik [1995] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
  • Cristianini et al. [2000] N. Cristianini, J. Shawe-Taylor, et al. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000.
  • Domingos [2020] P. Domingos. Every model learned by gradient descent is approximately a kernel machine. arXiv preprint arXiv:2012.00152, 2020.
  • Du et al. [2019a] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685. PMLR, 2019a.
  • Du et al. [2018] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
  • Du et al. [2019b] S. S. Du, K. Hou, B. Póczos, R. Salakhutdinov, R. Wang, and K. Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. arXiv preprint arXiv:1905.13192, 2019b.
  • Gowal et al. [2018] S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. Mann, and P. Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715, 2018.
  • Hsieh et al. [2008] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th international conference on Machine learning, pages 408–415, 2008.
  • Hu et al. [2021] T. Hu, W. Wang, C. Lin, and G. Cheng. Regularization matters: A nonparametric perspective on overparametrized neural network. In International Conference on Artificial Intelligence and Statistics, pages 829–837. PMLR, 2021.
  • Hu et al. [2019] W. Hu, Z. Li, and D. Yu. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. arXiv preprint arXiv:1905.11368, 2019.
  • Huang et al. [2020] W. Huang, W. Du, and R. Y. Da Xu. On the neural tangent kernel of deep networks with orthogonal initialization. arXiv preprint arXiv:2004.05867, 2020.
  • Jacot et al. [2018] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
  • Ji and Telgarsky [2018] Z. Ji and M. Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
  • LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lee et al. [2017] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
  • Lee et al. [2019] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pages 8572–8583, 2019.
  • Liang et al. [2017] X. Liang, X. Wang, Z. Lei, S. Liao, and S. Z. Li. Soft-margin softmax for deep classification. In International Conference on Neural Information Processing, pages 413–421. Springer, 2017.
  • Liu et al. [2020a] C. Liu, L. Zhu, and M. Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33, 2020a.
  • Liu et al. [2020b] C. Liu, L. Zhu, and M. Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. arXiv preprint arXiv:2003.00307, 2020b.
  • Liu et al. [2016] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax loss for convolutional neural networks. In ICML, volume 2, page 7, 2016.
  • Long and Sedghi [2019] P. M. Long and H. Sedghi. Generalization bounds for deep convolutional neural networks. arXiv preprint arXiv:1905.12600, 2019.
  • Lyu and Li [2019] K. Lyu and J. Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890, 2019.
  • Matthews et al. [2018] A. G. d. G. Matthews, M. Rowland, J. Hron, R. E. Turner, and Z. Ghahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.
  • Mohri et al. [2018] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.
  • Nacson et al. [2019] M. S. Nacson, S. Gunasekar, J. Lee, N. Srebro, and D. Soudry. Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In International Conference on Machine Learning, pages 4683–4692. PMLR, 2019.
  • Novak et al. [2020] R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020. URL https://github.com/google/neural-tangents.
  • Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
  • Pellegrini and Biroli [2020] F. Pellegrini and G. Biroli. An analytic theory of shallow networks dynamics for hinge loss classification. Advances in Neural Information Processing Systems, 33, 2020.
  • Poggio and Girosi [1990] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9):1481–1497, 1990.
  • Schölkopf et al. [2002] B. Schölkopf, A. J. Smola, F. Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
  • Shalev-Shwartz and Ben-David [2014] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • Shalev-Shwartz et al. [2011] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
  • Shawe-Taylor et al. [2004] J. Shawe-Taylor, N. Cristianini, et al. Kernel methods for pattern analysis. Cambridge university press, 2004.
  • Smola et al. [1998] A. J. Smola, B. Schölkopf, and K.-R. Müller. The connection between regularization operators and support vector kernels. Neural networks, 11(4):637–649, 1998.
  • Sokolić et al. [2017] J. Sokolić, R. Giryes, G. Sapiro, and M. R. Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265–4280, 2017.
  • Soudry et al. [2018] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  • Steinwart and Christmann [2008] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.
  • Sun et al. [2016] S. Sun, W. Chen, L. Wang, X. Liu, and T.-Y. Liu. On the depth of deep neural networks: A theoretical view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
  • Tang [2013] Y. Tang. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239, 2013.
  • Towns et al. [2014] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr. Xsede: Accelerating scientific discovery. Computing in Science & Engineering, 16(5):62–74, Sept.-Oct. 2014. ISSN 1521-9615. doi: 10.1109/MCSE.2014.80. URL doi.ieeecomputersociety.org/10.1109/MCSE.2014.80.
  • Wei et al. [2019] C. Wei, J. Lee, Q. Liu, and T. Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. 2019.
  • Weng et al. [2018] L. Weng, H. Zhang, H. Chen, Z. Song, C.-J. Hsieh, L. Daniel, D. Boning, and I. Dhillon. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning, pages 5276–5285. PMLR, 2018.
  • Yang and Hu [2020] G. Yang and E. J. Hu. Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522, 2020.
  • Zhang et al. [2021] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
  • Zhang et al. [2018] H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, and L. Daniel. Efficient neural network robustness certification with general activation functions. arXiv preprint arXiv:1811.00866, 2018.
\appendixpage

Appendix A Experiment Details

Refer to caption
Figure A.1: SVM and NN trained by stochastic subgradient descent for binary MNIST classification task on full 0 and 11 data with learning rate η^=1\hat{\eta}=1 and batch size 6464. The width of NN is 1010K.

A.1 SVM Training

We use the following loss to train the SVM,

L(β)=λ2β2+1ni=1nmax(0,1yiβ,Φ(xi)).L(\beta)=\frac{\lambda}{2}\left\lVert\beta\right\rVert^{2}+\frac{1}{n}\sum_{i=1}^{n}\max(0,1-y_{i}\left\langle\beta,\Phi(x_{i})\right\rangle). (11)

Let η^\hat{\eta} be the learning rate for this loss in experiments. Then the dynamics of subgradient descent is

gt+1(x)=(1η^λ)gt(x)+η^ni=1n𝟙(yigt(xi)<1)yiK(x,xi).g_{t+1}(x)=(1-\hat{\eta}\lambda)g_{t}(x)+\frac{\hat{\eta}}{n}\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}). (12)

Denote Q(t)=η^ni=1n𝟙(yigt(xi)<1)yiK(x,xi)Q(t)=\frac{\hat{\eta}}{n}\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}), which is a linear combination of K(x,xi)K(x,x_{i}) and changes over time. The model output gt(x)g_{t}(x) at some time TT is

gT(x)=(1η^λ)T(g0(x)+η^nt=0T1(1η^λ)t1Q(t)).g_{T}(x)=(1-\hat{\eta}\lambda)^{T}\biggl{(}g_{0}(x)+\frac{\hat{\eta}}{n}\sum_{t=0}^{T-1}(1-\hat{\eta}\lambda)^{-t-1}Q(t)\biggr{)}. (13)

If we set g0(x)=0g_{0}(x)=0, we have

gT(x)=t=0T1(1η^λ)T1tQ(t).\begin{split}g_{T}(x)&=\sum_{t=0}^{T-1}(1-\hat{\eta}\lambda)^{T-1-t}Q(t).\end{split} (14)

We see that gT(x)g_{T}(x) is always a linear combination of kernel values K(x,xi)K(x,x_{i}) for i=1,,ni=1,\dots,n. Since K(x,xi)K(x,x_{i}) are fixed, we just need to store and update the weights of the kernel values. Let αtn\alpha_{t}\in\mathbb{R}^{n} be the weights at time tt, that is

gt(x)=i=1nαt,iK(x,xi).g_{t}(x)=\sum_{i=1}^{n}\alpha_{t,i}K(x,x_{i}). (15)

Then according to Eq. (12), we update α\alpha at each subgradient descent step as follows.

αt+1,i=(1η^λ)αt,i+η^n𝟙(yigt(xi)<1)yi,i{1,,n}.\alpha_{t+1,i}=(1-\hat{\eta}\lambda)\alpha_{t,i}+\frac{\hat{\eta}}{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i},\quad\forall i\in\{1,\dots,n\}. (16)

For the SGD case, we sample St{1,,n}S_{t}\subseteq\{1,\dots,n\} at step tt and update the weights of this subset while keeping the other weights unchanged.

αt+1,i=(1η^λ)αt,i+η^|St|𝟙(yigt(xi)<1)yi,iSt,\displaystyle\alpha_{t+1,i}=(1-\hat{\eta}\lambda)\alpha_{t,i}+\frac{\hat{\eta}}{\left\lvert S_{t}\right\rvert}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i},\quad\forall i\in S_{t},
αt+1,i=αt,i,iSt.\displaystyle\alpha_{t+1,i}=\alpha_{t,i},\quad\forall i\notin S_{t}.

The kernelized implementation of Pegasos [47] set η^t=1λt\hat{\eta}_{t}=\frac{1}{\lambda t} for proving the convergence of the algorithm. In our experiments, we use constant η^\hat{\eta}.

A.2 More Details

(I) Verification of the equivalence. The first experiment illustrates the equivalence between soft margin SVM with NTK trained by subgradient descent and NN trained by soft margin loss. We initialize 3-layer fully connected ReLU neural networks of width 1000010000 and 900900, with NTK parameterization and make sure f0(x)=0f_{0}(x)=0 by subtracting the initial values from NN’s outputs. We initialize the parameter of SVM with β0=0\beta_{0}=0, and this automatically makes sure g0(x)=0g_{0}(x)=0. SVM is trained by directly update the weights of kernel values [47] and more details can be found in Appendix A. We set the regularization parameter as λ=0.001\lambda=0.001 and take the average of the hinge loss instead of sum.555This is equivalent to use λ=0.001×n\lambda=0.001\times n in Eq. (7). We train the NN and SVM for a binary MNIST [29] classification task (0 and 11) with learning rate η^=0.1\hat{\eta}=0.1 and η^=0.01\hat{\eta}=0.01 with full batch subgradient descent on n=128n=128 samples, where η^\hat{\eta} is the learning rate used in experiments (see Appendix A). Figure 1 shows the dynamics of the outputs and loss for NN and SVM. Since the regularization term in the loss of NN and SVM are different, we just plot the hinge loss. We see the dynamics of NN and SVM agree well. We also do a stochastic subgradient descent case for binary MNIST classification task on full 0 and 11 data (1266512665 train data and 21152115 test data) with learning rate η^=1\hat{\eta}=1 and batch size 6464, shown in Figure A.1.

Experiments are implemented with PyTorch [42] and the NTK of infinite-width NN is computed using Neural Tangents [41]. We do our experiments on 16G V100 GPU.

Appendix B Computing Non-vacuous Generalization Bounds via Corresponding Kernel Machines

Refer to caption
Figure B.2: Computing non-vacuous generalization bounds via corresponding kernel machines. Two-layer NN with 100 hidden nodes trained by full-batch subgradient descent for binary MNIST classification task on 0 and 11 data of size 1000 with learning rate η^=0.1\hat{\eta}=0.1 and hinge loss. We set f0(x)=0f_{0}(x)=0. The kernel machine (KM) approximates NN very well. And we get a tight upper bound of the true loss by computing its Rademacher complexity. The confidence parameter is set as 1δ=0.991-\delta=0.99. Since the kernel is only valid when the loss gradient of the output is a constant, we only did the initial training stage of the hinge loss where the kernel is valid.

Using Theorem 4.1, we can numerically compute the kernel machine that the NN is equivalent to, i.e. we can compute the kernel matrix and the weights at any time during the training. Then one can apply a generalization bound of kernel machines to give a generalization bound for this kernel machine (equivalently for this NN). Let \mathcal{H} be the reproducing kernel Hilbert space (RKHS) corresponding to the kernel K(,)K(\cdot,\cdot). The RKHS norm of a function f(x)=i=1naiK(x,xi)f(x)=\sum_{i=1}^{n}a_{i}K(x,x_{i}) is 666Assume f0(x)=0.f_{0}(x)=0.

f=i=1naiΦ(xi)=i=1nj=1naiajK(xi,xj)\left\lVert f\right\rVert_{\mathcal{H}}=\left\lVert\sum_{i=1}^{n}a_{i}\Phi(x_{i})\right\rVert=\sqrt{\sum_{i=1}^{n}\sum_{j=1}^{n}a_{i}a_{j}K(x_{i},x_{j})}
Lemma B.1 (Lemma 22 in [7]).

For a function class B={f(x)=i=1naiK(x,xi):fB}{xβ,Φ(x):βB}\mathcal{F}_{B}=\{f(x)=\sum_{i=1}^{n}a_{i}K(x,x_{i}):\left\lVert f\right\rVert_{\mathcal{H}}\leq B\}\subseteq\{x\rightarrow\left\langle\beta,\Phi(x)\right\rangle:\left\lVert\beta\right\rVert\leq B\}, its empirical Rademacher complexity can be bounded as

^S(B)=1n𝔼σi{±1}n[supfBi=1nσif(xi)]Bni=1nK(xi,xi)\hat{\mathcal{R}}_{S}(\mathcal{F}_{B})=\frac{1}{n}\mathop{\mathbb{E}}_{\sigma_{i}\sim\{\pm 1\}^{n}}\left[\sup_{f\in\mathcal{F}_{B}}\sum_{i=1}^{n}\sigma_{i}f(x_{i})\right]\leq\frac{B}{n}\sqrt{\sum_{i=1}^{n}K(x_{i},x_{i})}

Assume the data is sampled i.i.d. from some distribution DD and the population loss is LD(f)=𝔼(x,y)D[l(f(x),y)]L_{D}(f)=\mathop{\mathbb{E}}_{(x,y)\sim D}{\left[l(f(x),y)\right]}. The empirical loss is LS(f)=1ni=1nl(f(xi),yi)L_{S}(f)=\frac{1}{n}\sum_{i=1}^{n}l(f(x_{i}),y_{i}). Combing with a standard generalization bound using Rademacher complexity blow [39], we can get a bound of the population loss LD(f)L_{D}(f) for the kernel machine (equivalently for this NN).

Lemma B.2.

Suppose the loss (,)\ell(\cdot,\cdot) is bounded in [0,c]\left[0,c\right], and is ρ\rho-Lipschitz in the first argument. Then with probability at least 1δ1-\delta over the sample SS of size nn,

supf{LD(f)LS(f)}2ρ^S()+3clog(2/δ)2n\sup_{f\in\mathcal{F}}\{L_{D}(f)-L_{S}(f)\}\leq 2\rho\hat{\mathcal{R}}_{S}(\mathcal{F})+3c\sqrt{\frac{\log(2/\delta)}{2n}}

Most of the existing generalization bounds of NN [8, 36] are vacacous since they have a dependence on the number of parameters. Compared to those, the bound for kernel machines does not have a dependence on the number of NN’s parameters, making it non-vacuous and promising.

Appendix C Dynamics of Support Vector Machine

In this section, we derive the continuous and discrete dynamics of soft margin SVM trained by subgradient with the following loss function

L(β)=12β2+Ci=1nmax(0,1yiβ,Φ(xi)),L(\beta)=\frac{1}{2}\left\lVert\beta\right\rVert^{2}+C\sum_{i=1}^{n}\max(0,1-y_{i}\left\langle\beta,\Phi(x_{i})\right\rangle), (17)

and the subgradient

βL(βt)=βtCi=1n𝟙(yigt(xi)<1)yiΦ(xi).\nabla_{\beta}L(\beta_{t})=\beta_{t}-C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}\Phi(x_{i}). (18)
Lemma C.1.

L(β)L(\beta) satisfies the Polyak- Lojasiewicz (PL) inequality,

L(βt)L(β)12βL(βt)2βt.L(\beta_{t})-L(\beta^{*})\leq\frac{1}{2}\left\lVert\nabla_{\beta}L(\beta_{t})\right\rVert^{2}\quad\forall\ \beta_{t}. (19)

where β=argminβL(β)\beta^{*}=\operatorname*{arg\,min}_{\beta}L(\beta).

Proof.

Since L(β)L(\beta) is 1-strongly convex, by the definition of strong convexity and subgradient

L(β)L(βt)+βL(βt),ββt+12ββt2L(\beta)\geq L(\beta_{t})+\left\langle\nabla_{\beta}L(\beta_{t}),\beta-\beta_{t}\right\rangle+\frac{1}{2}\left\lVert\beta-\beta_{t}\right\rVert^{2} (20)

The right hand side is a convex quadratic function of β\beta (for fixed βt\beta_{t}). Setting the gradient with respect to β\beta equal to 0, we find that β~=βtβL(βt)\tilde{\beta}=\beta_{t}-\nabla_{\beta}L(\beta_{t}) minimize right hand side. Therefore we have

L(β)L(βt)+βL(βt),ββt+12ββt2L(βt)+βL(βt),β~βt+12β~βt2=L(βt)12βL(βt)2.\begin{split}L(\beta)&\geq L(\beta_{t})+\left\langle\nabla_{\beta}L(\beta_{t}),\beta-\beta_{t}\right\rangle+\frac{1}{2}\left\lVert\beta-\beta_{t}\right\rVert^{2}\\ &\geq L(\beta_{t})+\left\langle\nabla_{\beta}L(\beta_{t}),\tilde{\beta}-\beta_{t}\right\rangle+\frac{1}{2}\left\lVert\tilde{\beta}-\beta_{t}\right\rVert^{2}\\ &=L(\beta_{t})-\frac{1}{2}\left\lVert\nabla_{\beta}L(\beta_{t})\right\rVert^{2}.\end{split} (21)

Since this holds for any β\beta, we have

L(β)L(βt)12βL(βt)2.L(\beta^{*})\geq L(\beta_{t})-\frac{1}{2}\left\lVert\nabla_{\beta}L(\beta_{t})\right\rVert^{2}. (22)

C.1 Continuous Dynamics of SVM

Here we give the detailed derivation of the dynamics of soft margin SVM trained by subgradient. In the learning rate η0\eta\rightarrow 0 limit, the subgradient descent equation, which can also be written as

βt+1βtη=βL(βt),\frac{\beta_{t+1}-\beta_{t}}{\eta}=-\nabla_{\beta}L(\beta_{t}), (23)

becomes a differential equation

dβtdt=βL(βt).\frac{d\beta_{t}}{dt}=-\nabla_{\beta}L(\beta_{t}). (24)

This is known as gradient flow [2]. And we have defined the subgradient as

βL(βt)=βtCi=1n𝟙(yigt(xi)<1)yiΦ(xi).\nabla_{\beta}L(\beta_{t})=\beta_{t}-C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}\Phi(x_{i}). (25)

Applying the chain rule, the dynamics of gt(x)=βt,Φ(x)g_{t}(x)=\left\langle\beta_{t},\Phi(x)\right\rangle is

dgt(x)dt=gt(x)βtdβtdt=Φ(x),βt+Ci=1n𝟙(yigt(xi)<1)yiΦ(xi)=gt(x)+Ci=1n𝟙(yigt(xi)<1)yiK(x,xi).\begin{split}\frac{dg_{t}(x)}{dt}&=\frac{\partial g_{t}(x)}{\partial\beta_{t}}\frac{d\beta_{t}}{dt}\\ &=\left\langle\Phi(x),-\beta_{t}+C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}\Phi(x_{i})\right\rangle\\ &=-g_{t}(x)+C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}).\end{split} (26)

Denoting Q(t)=Ci=1n𝟙(yigt(xi)<1)yiK(x,xi)Q(t)=C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}), the equation becomes

dgt(x)dt+gt(x)=Q(t).\frac{dg_{t}(x)}{dt}+g_{t}(x)=Q(t). (27)

Note this is a first-order inhomogeneous differential equation. The general solution at some time TT is given by

gT(x)=eT(g0(x)+0TQ(t)et𝑑t).g_{T}(x)=e^{-T}\biggl{(}g_{0}(x)+\int_{0}^{T}Q(t)e^{t}\,dt\biggr{)}. (28)

As we already know that the loss function is strongly convex, β\beta will converge to the global optimizer in this infinite small learning rate setting. This can be seen by

d(L(βt)L(β))dt=dL(βt)dt=L(βt)βtdβtdt=βL(βt),βL(βt)=βL(βt)2.\frac{d\left(L(\beta_{t})-L(\beta^{*})\right)}{dt}=\frac{dL(\beta_{t})}{dt}=\frac{\partial L(\beta_{t})}{\partial\beta_{t}}\frac{d\beta_{t}}{dt}=\left\langle\nabla_{\beta}L(\beta_{t}),-\nabla_{\beta}L(\beta_{t})\right\rangle=-\left\lVert\nabla_{\beta}L(\beta_{t})\right\rVert^{2}. (29)

We see that L(βt)L(\beta_{t}) is always decreasing. Since L(β)L(\beta) is strongly convex and thus bounded from below, by monotone convergence theorem, L(βt)L(\beta_{t}) will always converge. By Lemma C.1, we have the Polyak- Lojasiewicz (PL) inequality,

L(βt)L(β)12βL(βt)2L(\beta_{t})-L(\beta^{*})\leq\frac{1}{2}\left\lVert\nabla_{\beta}L(\beta_{t})\right\rVert^{2} (30)

Combining with above, we have

d(L(βt)L(β))dt2(L(βt)L(β)).\frac{d\left(L(\beta_{t})-L(\beta^{*})\right)}{dt}\leq-2\left(L(\beta_{t})-L(\beta^{*})\right). (31)

Solving the equation, we get

L(βt)L(β)e2t(L(β0)L(β)).L(\beta_{t})-L(\beta^{*})\leq e^{-2t}\left(L(\beta_{0})-L(\beta^{*})\right). (32)

Thus we have a linear convergence rate.

Now, let us assume gT(x)g_{T}(x) will converge and see what is gT(x)g_{T}(x) as TT\rightarrow\infty. As time increases TT\rightarrow\infty, eTg0(x)0e^{-T}g_{0}(x)\rightarrow 0.

gT(x)eT0TQ(t)et𝑑tg_{T}(x)\rightarrow e^{-T}\int_{0}^{T}Q(t)e^{t}\,dt (33)

Q(t)Q(t) is changing over time due to gt(x)g_{t}(x) is changing. Suppose Q(t)Q(t) keeps changing until some time T1T_{1} and keeps constant, Q(t)=QQ(t)=Q, after T1T_{1},

limTgT(x)=eT0T1Q(t)et𝑑t+eTT1TQet𝑑t.\begin{split}\lim_{T\rightarrow\infty}g_{T}(x)=e^{-T}\int_{0}^{T_{1}}Q(t)e^{t}\,dt+e^{-T}\int_{T_{1}}^{T}Qe^{t}\,dt.\end{split} (34)

As TT\rightarrow\infty, the first part of right hand side converges to 0.

limTgT(x)eTT1TQet𝑑t=eTT1Tet𝑑tQ=eT(eTeT1)QQ=Ci=1n𝟙(yigT(xi)<1)yiK(x,xi).\begin{split}\lim_{T\rightarrow\infty}g_{T}(x)&\rightarrow e^{-T}\int_{T_{1}}^{T}Qe^{t}\,dt\\ &=e^{-T}\int_{T_{1}}^{T}e^{t}\,dt\cdot Q\\ &=e^{-T}(e^{T}-e^{T_{1}})\cdot Q\\ &\rightarrow Q\\ &=C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{T}(x_{i})<1)\cdot y_{i}K(x,x_{i}).\end{split} (35)

C.2 Discrete Dynamics of SVM

Let η(0,1)\eta\in(0,1) be the learning rate. The equation of subgradient descent update at some time tt is

βt+1βt=ηβL(βt).\beta_{t+1}-\beta_{t}=-\eta\nabla_{\beta}L(\beta_{t}).\\ (36)

The dynamics of gt(x)g_{t}(x) is

gt+1(x)gt(x)=βt+1βt,Φ(x)=ηβt+ηCi=1n𝟙(yigt(xi)<1)yiΦ(xi),Φ(x)=ηgt(x)+ηCi=1n𝟙(yigt(xi)<1)yiK(x,xi).\begin{split}g_{t+1}(x)-g_{t}(x)&=\left\langle\beta_{t+1}-\beta_{t},\Phi(x)\right\rangle\\ &=\left\langle-\eta\beta_{t}+\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}\Phi(x_{i}),\Phi(x)\right\rangle\\ &=-\eta g_{t}(x)+\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}).\end{split} (37)

Denote second part as Q(t)=ηCi=1n𝟙(yigt(xi)<1)yiK(x,xi)Q(t)=\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{t}(x_{i})<1)y_{i}K(x,x_{i}), which changes over time. The model gT(x)g_{T}(x) at some time TT is

gT(x)=(1η)gT1(x)+Q(T1)=(1η)((1η)gT2(x)+Q(T2))+Q(T1)=(1η)Tg0(x)+t=0T1(1η)T1tQ(t)=(1η)T(g0(x)+t=0T1(1η)t1Q(t)).\begin{split}g_{T}(x)&=(1-\eta)g_{T-1}(x)+Q(T-1)\\ &=(1-\eta)\biggl{(}(1-\eta)g_{T-2}(x)+Q(T-2)\biggr{)}+Q(T-1)\\ &=(1-\eta)^{T}g_{0}(x)+\sum_{t=0}^{T-1}(1-\eta)^{T-1-t}Q(t)\\ &=(1-\eta)^{T}\biggl{(}g_{0}(x)+\sum_{t=0}^{T-1}(1-\eta)^{-t-1}Q(t)\biggr{)}.\end{split} (38)

The convergence of subgradient descent usually requires additional assumption that the norm of the subgradient is bounded. We refer readers to [47, 11, 9] for some proofs. Here let us assume the subgradient descent converges to the global optimizer and Q(t)Q(t) keeps changing until some time T1T_{1} and keeps constant, Q(t)=QQ(t)=Q, after T1T_{1}. As TT\rightarrow\infty,

gT(x)t=0T1(1η)T1tQ(t)=t=0T11(1η)T1tQ(t)+t=T1T1(1η)T1tQt=T1T1(1η)T1tQ=t=T1T1(1η)T1tQ=(1η)TT1+1ηQ.\begin{split}g_{T}(x)&\rightarrow\sum_{t=0}^{T-1}(1-\eta)^{T-1-t}Q(t)\\ &=\sum_{t=0}^{T_{1}-1}(1-\eta)^{T-1-t}Q(t)+\sum_{t=T_{1}}^{T-1}(1-\eta)^{T-1-t}Q\\ &\rightarrow\sum_{t=T_{1}}^{T-1}(1-\eta)^{T-1-t}Q\\ &=\sum_{t=T_{1}}^{T-1}(1-\eta)^{T-1-t}Q\\ &=\frac{-(1-\eta)^{T-T_{1}}+1}{\eta}Q.\end{split} (39)

As η(0,1)\eta\in(0,1), (1η)TT10-(1-\eta)^{T-T_{1}}\rightarrow 0.

gT(x)1ηQ=Ci=1n𝟙(yigT(xi)<1)yiK(x,xi).\begin{split}g_{T}(x)&\rightarrow\frac{1}{\eta}Q\\ &=C\sum_{i=1}^{n}\mathbbm{1}(y_{i}g_{T}(x_{i})<1)y_{i}K(x,x_{i}).\\ \end{split} (40)

Appendix D Dynamics and Convergence Rate of Neural Network Trained by Soft Margin Loss

D.1 Continuous Dynamics of NN

In the learning rate η0\eta\rightarrow 0 limit, the subgradient descent equation, which can also be written as

wt+1wtη=wL(wt),\frac{w_{t+1}-w_{t}}{\eta}=-\nabla_{w}L(w_{t}), (41)

becomes a differential equation

dwtdt=wL(wt).\frac{dw_{t}}{dt}=-\nabla_{w}L(w_{t}). (42)

This is known as gradient flow [2]. Then for any differentiable function ft(x)f_{t}(x),

dft(x)dt=j=1pft(x)wjdwjdt,\frac{df_{t}(x)}{dt}=\sum_{j=1}^{p}\frac{\partial f_{t}(x)}{\partial w_{j}}\frac{dw_{j}}{dt}, (43)

where pp is the number of parameters. Replacing dwjdt\frac{dw_{j}}{dt} by its subgradient descent expression:

dft(x)dt=j=1pft(x)wj(L(wt)wj).\frac{df_{t}(x)}{dt}=\sum_{j=1}^{p}\frac{\partial f_{t}(x)}{\partial w_{j}}(-\frac{\partial L(w_{t})}{\partial w_{j}}). (44)

And we know

L(wt)wj=wj𝟙(wjW(L+1))+i=1nLhft(xi)ft(xi)wj.\frac{\partial L(w_{t})}{\partial w_{j}}=w_{j}\mathbbm{1}(w_{j}\in W^{(L+1)})+\sum_{i=1}^{n}\frac{\partial L_{h}}{\partial f_{t}(x_{i})}\frac{\partial f_{t}(x_{i})}{\partial w_{j}}. (45)

where 𝟙(wjW(L+1))\mathbbm{1}(w_{j}\in W^{(L+1)}) equals to 1 if the parameter wjw_{j} is in the last layer W(L+1)W^{(L+1)} else 0. Combining above together,

dft(x)dt=j=1pft(x)wj(wj𝟙(wjW(L+1))i=1nLhft(xi)ft(xi)wj).\frac{df_{t}(x)}{dt}=\sum_{j=1}^{p}\frac{\partial f_{t}(x)}{\partial w_{j}}\biggl{(}-w_{j}\mathbbm{1}(w_{j}\in W^{(L+1)})-\sum_{i=1}^{n}\frac{\partial L_{h}}{\partial f_{t}(x_{i})}\frac{\partial f_{t}(x_{i})}{\partial w_{j}}\biggr{)}. (46)

Rearranging terms:

dft(x)dt=k=1pL+1ft(x)Wk(L+1)Wk(L+1)i=1nLhft(xi)j=1pft(x)wjft(xi)wj,\frac{df_{t}(x)}{dt}=-\sum_{k=1}^{p_{L+1}}\frac{\partial f_{t}(x)}{\partial W^{(L+1)}_{k}}W^{(L+1)}_{k}-\sum_{i=1}^{n}\frac{\partial L_{h}}{\partial f_{t}(x_{i})}\sum_{j=1}^{p}\frac{\partial f_{t}(x)}{\partial w_{j}}\frac{\partial f_{t}(x_{i})}{\partial w_{j}}, (47)

where pL+1p_{L+1} is the number of parameters of the last layer (L+1L+1 layer). The first part of the right hand side is

k=1pL+1ft(x)Wk(L+1)Wk(L+1)=ft(x)W(L+1),W(L+1)=αt(L)(x),W(L+1)=ft(x).\sum_{k=1}^{p_{L+1}}\frac{\partial f_{t}(x)}{\partial W^{(L+1)}_{k}}W^{(L+1)}_{k}=\left\langle\frac{\partial f_{t}(x)}{\partial W^{(L+1)}},W^{(L+1)}\right\rangle=\left\langle\alpha_{t}^{(L)}(x),W^{(L+1)}\right\rangle=f_{t}(x). (48)

Applying Lh(yi,ft(xi))=Lhft(xi)L_{h}^{\prime}(y_{i},f_{t}(x_{i}))=\frac{\partial L_{h}}{\partial f_{t}(x_{i})}, the subgradient of hinge loss, and the definition of tangent kernel (2.1), the second part is

i=1nLhft(xi)j=1pft(x)wjft(xi)wj=i=1nLh(yi,ft(xi))Θ^(wt;x,xi).-\sum_{i=1}^{n}\frac{\partial L_{h}}{\partial f_{t}(x_{i})}\sum_{j=1}^{p}\frac{\partial f_{t}(x)}{\partial w_{j}}\frac{\partial f_{t}(x_{i})}{\partial w_{j}}=-\sum_{i=1}^{n}L_{h}^{\prime}(y_{i},f_{t}(x_{i}))\hat{\Theta}(w_{t};x,x_{i}). (49)

Thus the equation becomes

dft(x)dt=ft(x)i=1nLh(yi,ft(xi))Θ^(wt;x,xi).\frac{df_{t}(x)}{dt}=-f_{t}(x)-\sum_{i=1}^{n}L_{h}^{\prime}(y_{i},f_{t}(x_{i}))\hat{\Theta}(w_{t};x,x_{i}). (50)

Take Lh(yi,ft(xi))=Cyi𝟙(yift(xi)<1)L_{h}^{\prime}(y_{i},f_{t}(x_{i}))=-Cy_{i}\mathbbm{1}(y_{i}f_{t}(x_{i})<1) in

dft(x)dt=ft(x)+Ci=1n𝟙(yift(xi)<1)yiΘ^(wt;x,xi).\frac{df_{t}(x)}{dt}=-f_{t}(x)+C\sum_{i=1}^{n}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)y_{i}\hat{\Theta}(w_{t};x,x_{i}). (51)

D.2 Additional Notations

Denote Xd×nX\in\mathbb{R}^{d\times n} as the training data. Denote ft=ft(X)nf_{t}=f_{t}(X)\in\mathbb{R}^{n} and gt=gt(X)ng_{t}=g_{t}(X)\in\mathbb{R}^{n} as the outputs of NN and SVM on the training data. Denote Θ^(wt)=Θ^(wt;X,X)n×n\hat{\Theta}(w_{t})=\hat{\Theta}(w_{t};X,X)\in\mathbb{R}^{n\times n} as the tangent kernel evaluated on the training data at time tt, and l(ft)nl^{\prime}(f_{t})\in\mathbb{R}^{n} as the derivative of the loss function w.r.t. ftf_{t}. Denote wftn×p\nabla_{w}f_{t}\in\mathbb{R}^{n\times p} as the Jacobian and we have Θ^(wt)=wftwftT\hat{\Theta}(w_{t})=\nabla_{w}f_{t}\nabla_{w}f_{t}^{T}. Denote λ0=λmin(Θ^(wt))\lambda_{0}=\lambda_{min}\left(\hat{\Theta}(w_{t})\right) as the smallest eigenvalue of Θ^(wt)\hat{\Theta}(w_{t}). Then we can write the dynamics of NN as

ddtft=ftΘ^(wt)l(ft).\displaystyle\frac{d}{dt}f_{t}=-f_{t}-\hat{\Theta}(w_{t})l^{\prime}(f_{t}).

Let vpv\in\mathbb{R}^{p} with vj=𝟙(wjW(L+1))v_{j}=\mathbbm{1}(w_{j}\in W^{(L+1)}). We can write the gradient as

wL(wt)=wtv+wftTl(ft).\displaystyle\nabla_{w}L(w_{t})=w_{t}\odot v+\nabla_{w}f_{t}^{T}l^{\prime}(f_{t}).

D.3 Convergence of NN

The loss of NN is

L(wt)=12Wt(L+1)2+inl(ft(xi),yi),L(w_{t})=\frac{1}{2}\left\lVert W_{t}^{(L+1)}\right\rVert^{2}+\sum_{i}^{n}l(f_{t}(x_{i}),y_{i}),

where l(f,y)=Cmax(0,1yf)l(f,y)=C\max(0,1-yf). The dynamic of the loss is

dL(wt)dt=L(wt)wtdwtdt=wL(wt),wL(wt)=wL(wt)2.\frac{dL(w_{t})}{dt}=\frac{\partial L(w_{t})}{\partial w_{t}}\frac{dw_{t}}{dt}=\left\langle\nabla_{w}L(w_{t}),-\nabla_{w}L(w_{t})\right\rangle=-\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}.

Since L(wt)0L(w_{t})\geq 0 is bounded from below, by monotone convergence theorem, L(wt)L(w_{t}) will always converge to a stationary point. Applying Lemma D.1, we have

d(L(wt)L(w))dt=wL(wt)22(L(wt)L(w)).\displaystyle\frac{d\left(L(w_{t})-L(w^{*})\right)}{dt}=-\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}\leq-2\left(L(w_{t})-L(w^{*})\right).

Thus we have a linear convergence, same as SVM.

L(wt)L(w)e2t(L(w0)L(w)).L(w_{t})-L(w^{*})\leq e^{-2t}\left(L(w_{0})-L(w^{*})\right).
Lemma D.1 (PL inequality of NN for soft margin loss).

Assume λ02C\lambda_{0}\geq\frac{2}{C}, then L(wt)L(w_{t}) satisfies the PL condition

wL(wt)22(L(wt)L(w)).\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}\geq 2\left(L(w_{t})-L(w^{*})\right).
Proof.
wL(wt)2\displaystyle\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2} =wtv+wftTl(ft),wtv+wftTl(ft)\displaystyle=\left\langle w_{t}\odot v+\nabla_{w}f_{t}^{T}l^{\prime}(f_{t}),w_{t}\odot v+\nabla_{w}f_{t}^{T}l^{\prime}(f_{t})\right\rangle
=wtv,wtv+wftTl(ft),wftTl(ft)+2wtv,wftTl(ft)\displaystyle=\left\langle w_{t}\odot v,w_{t}\odot v\right\rangle+\left\langle\nabla_{w}f_{t}^{T}l^{\prime}(f_{t}),\nabla_{w}f_{t}^{T}l^{\prime}(f_{t})\right\rangle+2\left\langle w_{t}\odot v,\nabla_{w}f_{t}^{T}l^{\prime}(f_{t})\right\rangle
=Wt(L+1)2+l(ft)TΘ^(wt)l(ft)+2Wt(L+1),W(L+1)ftTl(ft)\displaystyle=\left\lVert W_{t}^{(L+1)}\right\rVert^{2}+l^{\prime}(f_{t})^{T}\hat{\Theta}(w_{t})l^{\prime}(f_{t})+2\left\langle W_{t}^{(L+1)},\nabla_{W^{(L+1)}}f_{t}^{T}l^{\prime}(f_{t})\right\rangle
=Wt(L+1)2+l(ft)TΘ^(wt)l(ft)+2ftTl(ft).\displaystyle=\left\lVert W_{t}^{(L+1)}\right\rVert^{2}+l^{\prime}(f_{t})^{T}\hat{\Theta}(w_{t})l^{\prime}(f_{t})+2f_{t}^{T}l^{\prime}(f_{t}).

We want the loss satisfies the PL condition wL(wt)22(L(wt)L(w))\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}\geq 2\left(L(w_{t})-L(w^{*})\right).

wL(wt)22(L(wt)L(w))\displaystyle\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}-2\left(L(w_{t})-L(w^{*})\right)
=wL(wt)22L(wt)+2L(w)\displaystyle=\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}-2L(w_{t})+2L(w^{*})
=l(ft)TΘ^(wt)l(ft)+2ftTl(ft)2inl(ft(xi),yi)+2L(w)\displaystyle=l^{\prime}(f_{t})^{T}\hat{\Theta}(w_{t})l^{\prime}(f_{t})+2f_{t}^{T}l^{\prime}(f_{t})-2\sum_{i}^{n}l(f_{t}(x_{i}),y_{i})+2L(w^{*})
λ0l(ft)2+2ftTl(ft)2inl(ft(xi),yi)+2L(w),\displaystyle\geq\lambda_{0}\left\lVert l^{\prime}(f_{t})\right\rVert^{2}+2f_{t}^{T}l^{\prime}(f_{t})-2\sum_{i}^{n}l(f_{t}(x_{i}),y_{i})+2L(w^{*}),

where the last inequality is the inequality of quadratic form. For hinge loss l(f,y)=Cmax(0,1yf)=C(1yf)𝟙(1yf>0)l(f,y)=C\max(0,1-yf)=C(1-yf)\mathbbm{1}(1-yf>0) and l(f,y)=Cy𝟙(1yf>0)l^{\prime}(f,y)=-Cy\mathbbm{1}(1-yf>0),

wL(wt)22(L(wt)L(w))\displaystyle\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}-2\left(L(w_{t})-L(w^{*})\right)
λ0l(ft)2+2ftTl(ft)2inl(ft(xi),yi)+2L(w)\displaystyle\geq\lambda_{0}\left\lVert l^{\prime}(f_{t})\right\rVert^{2}+2f_{t}^{T}l^{\prime}(f_{t})-2\sum_{i}^{n}l(f_{t}(x_{i}),y_{i})+2L(w^{*})
=λ0inl(ft(xi),yi)2+2inft(xi)l(ft(xi),yi)2inl(ft(xi),yi)+2L(w)\displaystyle=\lambda_{0}\sum_{i}^{n}{l^{\prime}(f_{t}(x_{i}),y_{i})}^{2}+2\sum_{i}^{n}f_{t}(x_{i})l^{\prime}(f_{t}(x_{i}),y_{i})-2\sum_{i}^{n}l(f_{t}(x_{i}),y_{i})+2L(w^{*})
=λ0inC2𝟙(1yift(xi)>0)2inCyift(xi)𝟙(1yift(xi)>0)\displaystyle=\lambda_{0}\sum_{i}^{n}C^{2}\mathbbm{1}(1-y_{i}f_{t}(x_{i})>0)-2\sum_{i}^{n}Cy_{i}f_{t}(x_{i})\mathbbm{1}(1-y_{i}f_{t}(x_{i})>0)
2inC(1yift(xi))𝟙(1yift(xi)>0)+2L(w)\displaystyle\qquad-2\sum_{i}^{n}C(1-y_{i}f_{t}(x_{i}))\mathbbm{1}(1-y_{i}f_{t}(x_{i})>0)+2L(w^{*})
=Cin𝟙(1yift(xi)>0)(Cλ02)+2L(w).\displaystyle=C\sum_{i}^{n}\mathbbm{1}(1-y_{i}f_{t}(x_{i})>0)\left(C\lambda_{0}-2\right)+2L(w^{*}).

Since L(w)>0L(w^{*})>0, as long as λ02/C\lambda_{0}\geq 2/C, the loss L(wt)L(w_{t}) satisfies the PL condition wL(wt)22(L(wt)L(w))\left\lVert\nabla_{w}L(w_{t})\right\rVert^{2}\geq 2\left(L(w_{t})-L(w^{*})\right). λ02/C\lambda_{0}\geq 2/C can be guaranteed in a parameter ball when 2C<λmin(Θ^(w0))\frac{2}{C}<\lambda_{min}\left(\hat{\Theta}(w_{0})\right) by using a sufficiently wide NN [34]. ∎

D.4 Discrete Dynamics of NN

The subgradient descent update is

wt+1wt=ηwL(wt).w_{t+1}-w_{t}=-\eta\nabla_{w}L(w_{t}). (52)

We consider the situation of constant NTK, Θ^(wt;x,xi)Θ^(w0;x,xi)\hat{\Theta}(w_{t};x,x_{i})\rightarrow\hat{\Theta}(w_{0};x,x_{i}), or equivalently linear model. As proved by Proposition 2.2 in [33], the tangent kernel of a differentiable function f(w,x)f(w,x) is constant if and only if f(w,x)f(w,x) is linear in ww. Take the Taylor expansion of f(wt+1,x)f(w_{t+1},x) at wtw_{t},

f(wt+1,x)f(wt,x)=f(wt,x)+wf(wt,x),wt+1wtf(wt,x)=wf(wt,x),ηwL(wt)=wf(wt,x),η(wv+i=1nLh(yi,ft(xi))wft(xi))=ηft(x)+ηi=1nLh(yi,ft(xi))Θ^(wt;x,xi)=ηft(x)+ηCi=1n𝟙(yift(xi)<1)yiΘ^(wt;x,xi)ηft(x)+ηCi=1n𝟙(yift(xi)<1)yiΘ^(w0;x,xi).\begin{split}&\quad f(w_{t+1},x)-f(w_{t},x)\\ &=f(w_{t},x)+\left\langle\nabla_{w}f(w_{t},x),w_{t+1}-w_{t}\right\rangle-f(w_{t},x)\\ &=\left\langle\nabla_{w}f(w_{t},x),-\eta\nabla_{w}L(w_{t})\right\rangle\\ &=\left\langle\nabla_{w}f(w_{t},x),-\eta\Big{(}wv+\sum_{i=1}^{n}L_{h}^{\prime}(y_{i},f_{t}(x_{i}))\nabla_{w}f_{t}(x_{i})\Big{)}\right\rangle\\ &=-\eta f_{t}(x)+\eta\sum_{i=1}^{n}L_{h}^{\prime}(y_{i},f_{t}(x_{i}))\hat{\Theta}(w_{t};x,x_{i})\\ &=-\eta f_{t}(x)+\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)y_{i}\hat{\Theta}(w_{t};x,x_{i})\\ &\rightarrow-\eta f_{t}(x)+\eta C\sum_{i=1}^{n}\mathbbm{1}(y_{i}f_{t}(x_{i})<1)y_{i}\hat{\Theta}(w_{0};x,x_{i}).\end{split} (53)

Appendix E Proof of Theorem 3.4

Proof.

We prove the constancy of tangent kernel by adopting the results of [35].

Lemma E.1 (Theorem 3.3 in [33]; Hessian norm is controlled by the minimum hidden layer width).

Consider a general neural network f(w,x)f(w,x) of the form Eq. (1), which can be a fully connected network, CNN, ResNet or a mixture of these types. Let m be the minimum of the hidden layer widths, i.e., m=minl[L]mlm=\min_{l\in[L]}m_{l}. Given any fixed R>0R>0, and any wB(w0;R):={w:ww0R}w\in B(w_{0};R):=\{w:\left\lVert w-w_{0}\right\rVert\leq R\}, with high probability over the initialization, the Hessian spectral norm satisfies the following:

H(w)=O(R3Llnmm).\left\lVert H(w)\right\rVert=O(\frac{R^{3L}\ln{m}}{\sqrt{m}}). (54)
Lemma E.2 (Proposition 2.3 in [33]; Small Hessian norm \Rightarrow Small change of tangent kernel).

Given a point w0pw_{0}\in\mathbb{R}^{p} and a ball B(w0;R):={w:ww0R}B(w_{0};R):=\{w:\left\lVert w-w_{0}\right\rVert\leq R\} with fixed radius R>0R>0, if the Hessian matrix satisfies H(w)<ϵ\left\lVert H(w)\right\rVert<\epsilon, where ϵ>0\epsilon>0, for all wB(w0,R)w\in B(w_{0},R), then the tangent kernel Θ^(w;x,x)\hat{\Theta}(w;x,x^{\prime}) of the model, as a function of ww, satisfies

|Θ^(w;x,x)Θ^(w0;x,x)|=O(ϵR),wB(w0;R),x,xd.\left\lvert\hat{\Theta}(w;x,x^{\prime})-\hat{\Theta}(w_{0};x,x^{\prime})\right\rvert=O(\epsilon R),\quad\forall w\in B(w_{0};R),\ \forall x,x^{\prime}\in\mathbb{R}^{d}. (55)

Applying above two lemmas, we can see that in the limit of mm\rightarrow\infty, the spectral norm of Hessian converge to 0 and the tangent kernel keeps constant in the ball B(w0;R)B(w_{0};R).

Corollary E.2.1 (Consistancy of tangent kernel).

Consider a general neural network f(w,x)f(w,x) of the form Eq. (1). Given a point w0pw_{0}\in\mathbb{R}^{p} and a ball B(w0;R):={w:ww0R}B(w_{0};R):=\{w:\left\lVert w-w_{0}\right\rVert\leq R\} with fixed radius R>0R>0, in the infinite width limit, mm\rightarrow\infty,

limmΘ^(w;x,x)Θ^(w0;x,xi),wB(w0;R),x,xd.\lim_{m\rightarrow\infty}\hat{\Theta}(w;x,x^{\prime})\rightarrow\hat{\Theta}(w_{0};x,x_{i}),\quad\forall w\in B(w_{0};R),\ \forall x,x^{\prime}\in\mathbb{R}^{d}. (56)

Thus we prove the constancy of tangent kernel in infinite width limit. Then it is easy to check the dynamics of infinitely wide NN is the same with the dynamics of SVM with constant NTK.

Appendix F Bound the difference between SVM and NN

Assume the loss ll is ρ\rho-lipschitz and βl\beta_{l}-smooth for the first argument (i.e. the model output). Assume f0(x)=g0(x)f_{0}(x)=g_{0}(x) for any xx.

F.1 Bound the difference on the Training Data

The dynamics of the NN and SVM are

ddtft=λftΘ^(wt)l(ft)\displaystyle\frac{d}{dt}f_{t}=-\lambda f_{t}-\hat{\Theta}(w_{t})l^{\prime}(f_{t})
ddtgt=λgtΘ^(w0)l(gt)\displaystyle\frac{d}{dt}g_{t}=-\lambda g_{t}-\hat{\Theta}(w_{0})l^{\prime}(g_{t})

The dynamics of the difference between them is

ddt(ftgt)=λ(ftgt)(Θ^(wt)l(ft)Θ^(w0)l(gt))\frac{d}{dt}\left(f_{t}-g_{t}\right)=-\lambda\left(f_{t}-g_{t}\right)-\left(\hat{\Theta}(w_{t})l^{\prime}(f_{t})-\hat{\Theta}(w_{0})l^{\prime}(g_{t})\right)

The solution of the above differential equation at time TT is

fTgT\displaystyle f_{T}-g_{T} =eλT(f0g0)eλT0T(Θ^(wt)l(ft)Θ^(w0)l(gt))eλt𝑑t\displaystyle=e^{-\lambda T}\left(f_{0}-g_{0}\right)-e^{-\lambda T}\int_{0}^{T}\left(\hat{\Theta}(w_{t})l^{\prime}(f_{t})-\hat{\Theta}(w_{0})l^{\prime}(g_{t})\right)e^{\lambda t}dt
=eλT0T(Θ^(w0)l(gt)Θ^(wt)l(ft))eλt𝑑t\displaystyle=e^{-\lambda T}\int_{0}^{T}\left(\hat{\Theta}(w_{0})l^{\prime}(g_{t})-\hat{\Theta}(w_{t})l^{\prime}(f_{t})\right)e^{\lambda t}dt

using f0=g0f_{0}=g_{0}. Thus

fTgT\displaystyle\left\lVert f_{T}-g_{T}\right\rVert eλT0TΘ^(w0)l(gt)Θ^(wt)l(ft)eλt𝑑t\displaystyle\leq e^{-\lambda T}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0})l^{\prime}(g_{t})-\hat{\Theta}(w_{t})l^{\prime}(f_{t})\right\rVert e^{\lambda t}dt

Since ll is βl\beta_{l} smooth,

Θ^(w0)l(gt)Θ^(wt)l(ft)\displaystyle\left\lVert\hat{\Theta}(w_{0})l^{\prime}(g_{t})-\hat{\Theta}(w_{t})l^{\prime}(f_{t})\right\rVert =Θ^(w0)l(gt)Θ^(w0)l(ft)+Θ^(w0)l(ft)Θ^(wt)l(ft)\displaystyle=\left\lVert\hat{\Theta}(w_{0})l^{\prime}(g_{t})-\hat{\Theta}(w_{0})l^{\prime}(f_{t})+\hat{\Theta}(w_{0})l^{\prime}(f_{t})-\hat{\Theta}(w_{t})l^{\prime}(f_{t})\right\rVert
=Θ^(w0)(l(gt)l(ft))+(Θ^(w0)Θ^(wt))l(ft)\displaystyle=\left\lVert\hat{\Theta}(w_{0})\left(l^{\prime}(g_{t})-l^{\prime}(f_{t})\right)+\left(\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right)l^{\prime}(f_{t})\right\rVert
Θ^(w0)(l(gt)l(ft))+(Θ^(w0)Θ^(wt))l(ft)\displaystyle\leq\left\lVert\hat{\Theta}(w_{0})\left(l^{\prime}(g_{t})-l^{\prime}(f_{t})\right)\right\rVert+\left\lVert\left(\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right)l^{\prime}(f_{t})\right\rVert
βlΘ^(w0)gtft+ρnΘ^(w0)Θ^(wt)\displaystyle\leq\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert\left\lVert g_{t}-f_{t}\right\rVert+\rho\sqrt{n}\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert

where l(ft)ρn\left\lVert l^{\prime}(f_{t})\right\rVert\leq\rho\sqrt{n}. Thus we have

fTgT\displaystyle\left\lVert f_{T}-g_{T}\right\rVert eλTβlΘ^(w0)0Tgtfteλt𝑑t+eλTρn0TΘ^(w0)Θ^(wt)eλt𝑑t\displaystyle\leq e^{-\lambda T}\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert\int_{0}^{T}\left\lVert g_{t}-f_{t}\right\rVert e^{\lambda t}dt+e^{-\lambda T}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert e^{\lambda t}dt

Applying the Grönwall’s inequality,

fTgT\displaystyle\left\lVert f_{T}-g_{T}\right\rVert eλTρn0TΘ^(w0)Θ^(wt)eλt𝑑teeλTβlΘ^(w0)0Teλt𝑑t\displaystyle\leq e^{-\lambda T}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert e^{\lambda t}dt\cdot e^{e^{-\lambda T}\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert\int_{0}^{T}e^{\lambda t}dt}
=eλTρn0TΘ^(w0)Θ^(wt)eλt𝑑te1λ(1eλT)βlΘ^(w0)\displaystyle=e^{-\lambda T}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert e^{\lambda t}dt\cdot e^{\frac{1}{\lambda}(1-e^{-\lambda T})\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert}
=eλTe1λ(1eλT)βlΘ^(w0)ρn0TΘ^(w0)Θ^(wt)eλt𝑑t\displaystyle=e^{-\lambda T}e^{\frac{1}{\lambda}(1-e^{-\lambda T})\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert e^{\lambda t}dt

By Lemma E.1 and Lemma E.2, in a parameter ball B(w0;R)={w:ww0R}B(w_{0};R)=\{w:\left\lVert w-w_{0}\right\rVert\leq R\}, with high probability, |Θ^(w;x,x)Θ^(w0;x,x)|=O(R3L+1lnm/m)\left\lvert\hat{\Theta}(w;x,x^{\prime})-\hat{\Theta}(w_{0};x,x^{\prime})\right\rvert=O(R^{3L+1}\ln{m}/\sqrt{m}) w.r.t. mm. Then we have

Θ^(w0)Θ^(wt)Θ^(w0)Θ^(wt)F=O(R3L+1nlnmm)\displaystyle\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert\leq\left\lVert\hat{\Theta}(w_{0})-\hat{\Theta}(w_{t})\right\rVert_{F}=O(\frac{R^{3L+1}n\ln{m}}{\sqrt{m}})

Thus we have

fTgT\displaystyle\left\lVert f_{T}-g_{T}\right\rVert 1λ(1eλT)e(1eT)βlΘ^(w0)ρnO(R3L+1nlnmm)\displaystyle\leq\frac{1}{\lambda}(1-e^{-\lambda T})e^{(1-e^{-T})\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert}\rho\sqrt{n}\cdot O(\frac{R^{3L+1}n\ln{m}}{\sqrt{m}})
=O(eβlΘ^(w0)R3L+1ρn32lnmλm)\displaystyle=O(\frac{e^{\beta_{l}\left\lVert\hat{\Theta}(w_{0})\right\rVert}R^{3L+1}\rho n^{\frac{3}{2}}\ln{m}}{\lambda\sqrt{m}})

F.2 Bound on the Test Data

For a test data xx, the prove is similar to the training case. Denote Θ^(wt;X,x)n\hat{\Theta}(w_{t};X,x)\in\mathbb{R}^{n} as the tangent kernel evaluate between the training data and a test data xx. Recall

dft(x)dt=λft(x)Θ^(wt;X,x)Tl(ft)\displaystyle\frac{df_{t}(x)}{dt}=-\lambda f_{t}(x)-\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})
dgt(x)dt=λgt(x)Θ^(w0;X,x)Tl(gt)\displaystyle\frac{dg_{t}(x)}{dt}=-\lambda g_{t}(x)-\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})
ddt(ft(x)gt(x))=λ(ft(x)gt(x))(Θ^(wt;X,x)Tl(ft)Θ^(w0;X,x)Tl(gt))\displaystyle\frac{d}{dt}\left(f_{t}(x)-g_{t}(x)\right)=-\lambda\left(f_{t}(x)-g_{t}(x)\right)-\left(\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})-\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})\right)

The solution of the above differential equation is

fT(x)gT(x)\displaystyle f_{T}(x)-g_{T}(x) =eλT(f0g0)eλT0T(Θ^(wt;X,x)Tl(ft)Θ^(w0;X,x)Tl(gt))eλt𝑑t\displaystyle=e^{-\lambda T}\left(f_{0}-g_{0}\right)-e^{-\lambda T}\int_{0}^{T}\left(\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})-\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})\right)e^{\lambda t}dt
=eλT0T(Θ^(w0;X,x)Tl(gt)Θ^(wt;X,x)Tl(ft))eλt𝑑t\displaystyle=e^{-\lambda T}\int_{0}^{T}\left(\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})-\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})\right)e^{\lambda t}dt

using f0=g0f_{0}=g_{0}. Thus

fT(x)gT(x)\displaystyle\left\lVert f_{T}(x)-g_{T}(x)\right\rVert eλT0TΘ^(w0;X,x)Tl(gt)Θ^(wt;X,x)Tl(ft)eλt𝑑t\displaystyle\leq e^{-\lambda T}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})-\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})\right\rVert e^{\lambda t}dt

Since ll is βl\beta_{l} smooth,

Θ^(w0;X,x)Tl(gt)Θ^(wt;X,x)Tl(ft)\displaystyle\left\lVert\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})-\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})\right\rVert
=Θ^(w0;X,x)Tl(gt)Θ^(w0;X,x)Tl(ft)+Θ^(w0;X,x)Tl(ft)Θ^(wt;X,x)Tl(ft)\displaystyle=\left\lVert\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(g_{t})-\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(f_{t})+\hat{\Theta}(w_{0};X,x)^{T}l^{\prime}(f_{t})-\hat{\Theta}(w_{t};X,x)^{T}l^{\prime}(f_{t})\right\rVert
=Θ^(w0;X,x)T(l(gt)l(ft))+(Θ^(w0;X,x)TΘ^(wt;X,x)T)l(ft)\displaystyle=\left\lVert\hat{\Theta}(w_{0};X,x)^{T}\left(l^{\prime}(g_{t})-l^{\prime}(f_{t})\right)+\left(\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right)l^{\prime}(f_{t})\right\rVert
Θ^(w0;X,x)T(l(gt)l(ft))+(Θ^(w0;X,x)TΘ^(wt;X,x)T)l(ft)\displaystyle\leq\left\lVert\hat{\Theta}(w_{0};X,x)^{T}\left(l^{\prime}(g_{t})-l^{\prime}(f_{t})\right)\right\rVert+\left\lVert\left(\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right)l^{\prime}(f_{t})\right\rVert
βlΘ^(w0;X,x)gtft+ρn(Θ^(w0;X,x)TΘ^(wt;X,x)T)\displaystyle\leq\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert\left\lVert g_{t}-f_{t}\right\rVert+\rho\sqrt{n}\left\lVert\left(\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right)\right\rVert

where l(ft)ρn\left\lVert l^{\prime}(f_{t})\right\rVert\leq\rho\sqrt{n}. Thus we have

fT(x)gT(x)\displaystyle\left\lVert f_{T}(x)-g_{T}(x)\right\rVert
eλTβlΘ^(w0;X,x)0Tgtfteλt𝑑t+eλTρn0TΘ^(w0;X,x)TΘ^(wt;X,x)Teλt𝑑t\displaystyle\leq e^{-\lambda T}\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert\int_{0}^{T}\left\lVert g_{t}-f_{t}\right\rVert e^{\lambda t}dt+e^{-\lambda T}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right\rVert e^{\lambda t}dt

Applying the Grönwall’s inequality,

fT(x)gT(x)\displaystyle\left\lVert f_{T}(x)-g_{T}(x)\right\rVert
eλTρn0TΘ^(w0;X,x)TΘ^(wt;X,x)Teλt𝑑teeλTβlΘ^(w0;X,x)0Teλt𝑑t\displaystyle\leq e^{-\lambda T}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right\rVert e^{\lambda t}dt\cdot e^{e^{-\lambda T}\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert\int_{0}^{T}e^{\lambda t}dt}
=eλTρn0TΘ^(w0;X,x)TΘ^(wt;X,x)Teλt𝑑te1λ(1eλT)βlΘ^(w0;X,x)\displaystyle=e^{-\lambda T}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right\rVert e^{\lambda t}dt\cdot e^{\frac{1}{\lambda}(1-e^{-\lambda T})\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert}
=eλTe1λ(1eλT)βlΘ^(w0;X,x)ρn0TΘ^(w0;X,x)TΘ^(wt;X,x)Teλt𝑑t\displaystyle=e^{-\lambda T}e^{\frac{1}{\lambda}(1-e^{-\lambda T})\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert}\rho\sqrt{n}\int_{0}^{T}\left\lVert\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right\rVert e^{\lambda t}dt

By Lemma E.1 and Lemma E.2, in a parameter ball B(w0;R)={w:ww0R}B(w_{0};R)=\{w:\left\lVert w-w_{0}\right\rVert\leq R\}, with high probability, |Θ^(w;x,x)Θ^(w0;x,x)|=O(R3L+1lnm/m)\left\lvert\hat{\Theta}(w;x,x^{\prime})-\hat{\Theta}(w_{0};x,x^{\prime})\right\rvert=O(R^{3L+1}\ln{m}/\sqrt{m}). Then we have

Θ^(w0;X,x)TΘ^(wt;X,x)T=O(R3L+1nlnmm)\displaystyle\left\lVert\hat{\Theta}(w_{0};X,x)^{T}-\hat{\Theta}(w_{t};X,x)^{T}\right\rVert=O(\frac{R^{3L+1}\sqrt{n}\ln{m}}{\sqrt{m}})

Thus we have

fT(x)gT(x)\displaystyle\left\lVert f_{T}(x)-g_{T}(x)\right\rVert 1λ(1eλT)e(1eT)βlΘ^(w0;X,x)ρnO(R3L+1nlnmm)\displaystyle\leq\frac{1}{\lambda}(1-e^{-\lambda T})e^{(1-e^{-T})\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert}\rho\sqrt{n}\cdot O(\frac{R^{3L+1}\sqrt{n}\ln{m}}{\sqrt{m}})
=O(eβlΘ^(w0;X,x)R3L+1ρnlnmλm)\displaystyle=O(\frac{e^{\beta_{l}\left\lVert\hat{\Theta}(w_{0};X,x)\right\rVert}R^{3L+1}\rho n\ln{m}}{\lambda\sqrt{m}})

Appendix G Finite-width Neural Networks are Kernel Machines

Inspired by [18], we can also show that every neural network trained by (sub)gradient descent with loss function in the form (7) is approximately a kernel machine without the assumption of infinite width limit.

Theorem G.1.

Suppose a neural network f(w,x)f(w,x), with ff a differentiable function of ww, is learned from a training set {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} by (sub)gradient descent with loss function L(w)=λ2W(L+1)2+i=1nl(f(w,xi),yi)L(w)=\frac{\lambda}{2}\left\lVert W^{(L+1)}\right\rVert^{2}+\sum_{i=1}^{n}l(f(w,x_{i}),y_{i}) and gradient flow. Assume sign(l(ft(xi),yi))=sign(l(f0(xi),yi)),t[0,T]\text{sign}(l^{\prime}(f_{t}(x_{i}),y_{i}))=\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i})),\forall t\in[0,T], keeps unchanged during training. Then at some time TT,

fT(x)=i=1naiK(x,xi)+b,f_{T}(x)=\sum_{i=1}^{n}a_{i}K(x,x_{i})+b, (57)

where

ai=sign(l(f0(xi),yi)),b=eλTf0(x),\displaystyle a_{i}=-\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i})),\qquad b=e^{-\lambda T}f_{0}(x),
K(x,xi)=eλT0T|l(ft(xi),yi)|Θ^(wt;x,xi)eλt𝑑t\displaystyle K(x,x_{i})=e^{-\lambda T}\int_{0}^{T}\left\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\right\rvert\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt
Proof.

As we have derived, the neural network follows the dynamics of Eq. (9):

dft(x)dt=λft(x)i=1nl(ft(xi),yi)Θ^(wt;x,xi).\frac{df_{t}(x)}{dt}=-\lambda f_{t}(x)-\sum_{i=1}^{n}l^{\prime}(f_{t}(x_{i}),y_{i})\hat{\Theta}(w_{t};x,x_{i}). (58)

Note this is a first-order inhomogeneous linear differential equation with the functions depended on tt. Denote Q(t)=i=1nl(ft(xi),yi)Θ^(wt;x,xi)Q(t)=-\sum_{i=1}^{n}l^{\prime}(f_{t}(x_{i}),y_{i})\hat{\Theta}(w_{t};x,x_{i}),

dft(x)dt+λft(x)=Q(t).\frac{df_{t}(x)}{dt}+\lambda f_{t}(x)=Q(t). (59)

Let f0(x)f_{0}(x) be the initial model, prior to gradient descent. The solution is given by

fT(x)=eλT(f0(x)+0TQ(t)eλt𝑑t).f_{T}(x)=e^{-\lambda T}\biggl{(}f_{0}(x)+\int_{0}^{T}Q(t)e^{\lambda t}\,dt\biggr{)}. (60)

Then

fT(x)=eλT(f0(x)i=1n0Tl(ft(xi),yi)Θ^(wt;x,xi)eλt𝑑t)=eλTf0(x)i=1neλT0Tl(ft(xi),yi)Θ^(wt;x,xi)eλt𝑑t=eλTf0(x)i=1neλT0Tsign(l(ft(xi),yi))|l(ft(xi),yi)|Θ^(wt;x,xi)eλt𝑑t=eλTf0(x)i=1nsign(l(f0(xi),yi))eλT0T|l(ft(xi),yi)|Θ^(wt;x,xi)eλt𝑑t.\begin{split}f_{T}(x)&=e^{-\lambda T}\biggl{(}f_{0}(x)-\sum_{i=1}^{n}\int_{0}^{T}l^{\prime}(f_{t}(x_{i}),y_{i})\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt\biggr{)}\\ &=e^{-\lambda T}f_{0}(x)-\sum_{i=1}^{n}e^{-\lambda T}\int_{0}^{T}l^{\prime}(f_{t}(x_{i}),y_{i})\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt\\ &=e^{-\lambda T}f_{0}(x)-\sum_{i=1}^{n}e^{-\lambda T}\int_{0}^{T}\text{sign}(l^{\prime}(f_{t}(x_{i}),y_{i}))\cdot\left\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\right\rvert\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt\\ &=e^{-\lambda T}f_{0}(x)-\sum_{i=1}^{n}\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i}))\cdot e^{-\lambda T}\int_{0}^{T}\left\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\right\rvert\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt.\end{split} (61)

where the last equality uses the assumption sign(l(ft(xi),yi))=sign(l(f0(xi),yi)),t[0,T]\text{sign}(l^{\prime}(f_{t}(x_{i}),y_{i}))=\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i})),\forall t\in[0,T]. Thus

fT(x)=i=1naiK(x,xi)+b,f_{T}(x)=\sum_{i=1}^{n}a_{i}K(x,x_{i})+b, (62)

with

ai=sign(l(f0(xi),yi)),b=eλTf0(x),\displaystyle a_{i}=-\text{sign}(l^{\prime}(f_{0}(x_{i}),y_{i})),\qquad b=e^{-\lambda T}f_{0}(x),
K(x,xi)=eλT0T|l(ft(xi),yi)|Θ^(wt;x,xi)eλt𝑑t\displaystyle K(x,x_{i})=e^{-\lambda T}\int_{0}^{T}\left\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\right\rvert\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt

K(x,xi)=eλT0T|l(ft(xi),yi)|Θ^(wt;x,xi)eλt𝑑tK(x,x_{i})=e^{-\lambda T}\int_{0}^{T}\left\lvert l^{\prime}(f_{t}(x_{i}),y_{i})\right\rvert\hat{\Theta}(w_{t};x,x_{i})e^{\lambda t}\,dt is a valid kernel since it is a nonnegative sum of positive definite kernels. Our aia_{i}, bb and K(x,xi)K(x,x_{i}) will stay bounded as long as f0(x)f_{0}(x), l(ft(xi),yi)l^{\prime}(f_{t}(x_{i}),y_{i}) and Θ^(wt;x,xi)\hat{\Theta}(w_{t};x,x_{i}) are bounded.

Appendix H Robustness of Over-parameterized Neural Network

H.1 Robustness Verification of NTK

For an infinitely wide two-layer fully connected ReLU NN, f(x)=1mj=1mvjσ(1dwjTx)f(x)=\frac{1}{\sqrt{m}}\sum_{j=1}^{m}v_{j}\sigma(\frac{1}{\sqrt{d}}w_{j}^{T}x), where σ(z)=max(0,z)\sigma(z)=\max(0,z) is the ReLU activation. The NTK is

Θ(x,x)=x,xd(πarccos(u)π)+xx2πd1u2=xx2πdh(u),\displaystyle\begin{split}\Theta(x,x^{\prime})=\frac{\left\langle x,x^{\prime}\right\rangle}{d}(\frac{\pi-\arccos(u)}{\pi})+\frac{\left\lVert x\right\rVert\left\lVert x^{\prime}\right\rVert}{2\pi d}\sqrt{1-u^{2}}=\frac{\left\lVert x\right\rVert\left\lVert x^{\prime}\right\rVert}{2\pi d}h(u),\end{split} (63)
h(u)=2u(πarccos(u))+1u2.\displaystyle h(u)=2u(\pi-\arccos(u))+\sqrt{1-u^{2}}. (64)

where u=x,xxx[1,1]u=\frac{\left\langle x,x^{\prime}\right\rangle}{\left\lVert x\right\rVert\left\lVert x^{\prime}\right\rVert}\in\left[-1,1\right]. Consider the \ell_{\infty} perturbation, for xB(x0,δ)={xd:xx0δ}x\in B_{\infty}(x_{0},\delta)=\{x\in\mathbb{R}^{d}:\left\lVert x-x_{0}\right\rVert_{\infty}\leq\delta\}, we can bound x\left\lVert x\right\rVert in the interval [xL,xU][\left\lVert x\right\rVert^{L},\left\lVert x\right\rVert^{U}] as follows.

x=x0+Δx0+Δx0+dδ=xU,\displaystyle\left\lVert x\right\rVert=\left\lVert x_{0}+\Delta\right\rVert\leq\left\lVert x_{0}\right\rVert+\left\lVert\Delta\right\rVert\leq\left\lVert x_{0}\right\rVert+\sqrt{d}\delta=\left\lVert x\right\rVert^{U},
x=x0+Δ|x0Δ|max(x0dδ,0)=xL.\displaystyle\left\lVert x\right\rVert=\left\lVert x_{0}+\Delta\right\rVert\geq\left\lvert\left\lVert x_{0}\right\rVert-\left\lVert\Delta\right\rVert\right\rvert\geq\max(\left\lVert x_{0}\right\rVert-\sqrt{d}\delta,0)=\left\lVert x\right\rVert^{L}.

Then we can also bound uu in [uL,uU][u^{L},u^{U}].

x,x=x0+Δ,x[x0,xdδx,x0,x+dδx],\displaystyle\left\langle x,x^{\prime}\right\rangle=\left\langle x_{0}+\Delta,x^{\prime}\right\rangle\in\left[\left\langle x_{0},x^{\prime}\right\rangle-\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert,\left\langle x_{0},x^{\prime}\right\rangle+\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert\right],
uL=x0,xdδxxUxifx0,xdδx0elsex0,xdδxxLx,\displaystyle u^{L}=\frac{\left\langle x_{0},x^{\prime}\right\rangle-\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert}{\left\lVert x\right\rVert^{U}\left\lVert x^{\prime}\right\rVert}\quad\text{if}\ \left\langle x_{0},x^{\prime}\right\rangle-\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert\geq 0\quad\text{else}\quad\frac{\left\langle x_{0},x^{\prime}\right\rangle-\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert}{\left\lVert x\right\rVert^{L}\left\lVert x^{\prime}\right\rVert},
uU=x0,x+dδxxLxifx0,x+dδx0elsex0,x+dδxxUx,\displaystyle u^{U}=\frac{\left\langle x_{0},x^{\prime}\right\rangle+\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert}{\left\lVert x\right\rVert^{L}\left\lVert x^{\prime}\right\rVert}\quad\text{if}\ \left\langle x_{0},x^{\prime}\right\rangle+\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert\geq 0\quad\text{else}\quad\frac{\left\langle x_{0},x^{\prime}\right\rangle+\sqrt{d}\delta\left\lVert x^{\prime}\right\rVert}{\left\lVert x\right\rVert^{U}\left\lVert x^{\prime}\right\rVert},
uU=min(uU,1).\displaystyle u^{U}=\min(u^{U},1).

where ΔB(0,δ)\Delta\in B_{\infty}(0,\delta). h(u)h(u) is a bow shaped function so it is easy to get its interval [hL(u),hU(u)][h^{L}(u),h^{U}(u)]. Then we can get the interval of Θ(x,x)\Theta(x,x^{\prime}), denote as [ΘL(x,x),ΘU(x,x)][\Theta^{L}(x,x^{\prime}),\Theta^{U}(x,x^{\prime})].

ΘL(x,x)=xLx2πdhL(u)ifhL(u)0elsexUx2πdhL(u),\displaystyle\Theta^{L}(x,x^{\prime})=\frac{\left\lVert x\right\rVert^{L}\left\lVert x^{\prime}\right\rVert}{2\pi d}h^{L}(u)\quad\text{if}\ h^{L}(u)\geq 0\quad\text{else}\quad\frac{\left\lVert x\right\rVert^{U}\left\lVert x^{\prime}\right\rVert}{2\pi d}h^{L}(u),
ΘU(x,x)=xUx2πdhU(u)ifhU(u)0elsexLx2πdhU(u).\displaystyle\Theta^{U}(x,x^{\prime})=\frac{\left\lVert x\right\rVert^{U}\left\lVert x^{\prime}\right\rVert}{2\pi d}h^{U}(u)\quad\text{if}\ h^{U}(u)\geq 0\quad\text{else}\quad\frac{\left\lVert x\right\rVert^{L}\left\lVert x^{\prime}\right\rVert}{2\pi d}h^{U}(u).

Suppose the g(x)=i=1nαiΘ(x,xi)g(x)=\sum_{i=1}^{n}\alpha_{i}\Theta(x,x_{i}), αi\alpha_{i} are known after solving the kernel machine problem. Then we can lower bound and upper bound g(x)g(x) as follows.

g(x)i=1,αi>0nαiΘL(x,xi)+i=1,αi<0nαiΘU(x,xi),\displaystyle g(x)\geq\sum_{i=1,\alpha_{i}>0}^{n}\alpha_{i}\Theta^{L}(x,x_{i})+\sum_{i=1,\alpha_{i}<0}^{n}\alpha_{i}\Theta^{U}(x,x_{i}), (65)
g(x)i=1,αi<0nαiΘL(x,xi)+i=1,αi>0nαiΘU(x,xi).\displaystyle g(x)\leq\sum_{i=1,\alpha_{i}<0}^{n}\alpha_{i}\Theta^{L}(x,x_{i})+\sum_{i=1,\alpha_{i}>0}^{n}\alpha_{i}\Theta^{U}(x,x_{i}). (66)

H.2 IBP for Two-layer Neural Network

See the computation of IBP in [22]. For affine layers of NTK parameterization, the IBP bounds are computed as follows.

μk1=z¯k1+z¯k12rk1=z¯k1z¯k12μk=1mWμk1+brk=1m|W|rk1z¯k=μkrkz¯k=μk+rk\begin{split}\mu_{k-1}&=\frac{\overline{z}_{k-1}+\underline{z}_{k-1}}{2}\\ r_{k-1}&=\frac{\overline{z}_{k-1}-\underline{z}_{k-1}}{2}\\ \mu_{k}&=\frac{1}{\sqrt{m}}W\mu_{k-1}+b\\ r_{k}&=\frac{1}{\sqrt{m}}\left\lvert W\right\rvert r_{k-1}\\ \underline{z}_{k}&=\mu_{k}-r_{k}\\ \overline{z}_{k}&=\mu_{k}+r_{k}\end{split} (67)

where mm is the input dimension of that layer. At initialization, WW, μk1\mu_{k-1} and bb are independent. Since 𝔼[W]=0\mathop{\mathbb{E}}\left[W\right]=0 and 𝔼[b]=0\mathop{\mathbb{E}}\left[b\right]=0,

𝔼[μk]=1m𝔼[W]𝔼[μk1]+𝔼[b]=0\mathop{\mathbb{E}}\left[\mu_{k}\right]=\frac{1}{\sqrt{m}}\mathop{\mathbb{E}}\left[W\right]\mathop{\mathbb{E}}\left[\mu_{k-1}\right]+\mathop{\mathbb{E}}\left[b\right]=0 (68)

Since |W|\left\lvert W\right\rvert follows a folded normal distribution (absolute value of normal distribution) and rk10r_{k-1}\geq 0, |W|0\left\lvert W\right\rvert\geq 0, 𝔼[|W|]𝔼[rk1]=O(m)\mathop{\mathbb{E}}\left[\left\lvert W\right\rvert\right]\mathop{\mathbb{E}}\left[r_{k-1}\right]=O(m),

𝔼[rk]=1m𝔼[|W|]𝔼[rk1]=O(m)\mathop{\mathbb{E}}\left[r_{k}\right]=\frac{1}{\sqrt{m}}\mathop{\mathbb{E}}\left[\left\lvert W\right\rvert\right]\mathop{\mathbb{E}}\left[r_{k-1}\right]=O(\sqrt{m}) (69)

Thus

𝔼[z¯k]=𝔼[μk]+𝔼[rk]=O(m)\displaystyle-\mathop{\mathbb{E}}\left[\underline{z}_{k}\right]=-\mathop{\mathbb{E}}\left[\mu_{k}\right]+\mathop{\mathbb{E}}\left[r_{k}\right]=O(\sqrt{m}) (70)
𝔼[z¯k]=𝔼[μk]+𝔼[rk]=O(m)\displaystyle\mathop{\mathbb{E}}\left[\overline{z}_{k}\right]=\mathop{\mathbb{E}}\left[\mu_{k}\right]+\mathop{\mathbb{E}}\left[r_{k}\right]=O(\sqrt{m}) (71)

And this will cause the robustness lower bound to decrease at a rate of O(1/m)O(1/\sqrt{m}). The same results hold for LeCun initialization, which is used in PyTorch for fully connected layers by default.