On the Equivalence between Neural Network and Support Vector Machine
Abstract
Recent research shows that the dynamics of an infinitely wide neural network (NN) trained by gradient descent can be characterized by Neural Tangent Kernel (NTK) [27]. Under the squared loss, the infinite-width NN trained by gradient descent with an infinitely small learning rate is equivalent to kernel regression with NTK [4]. However, the equivalence is only known for ridge regression currently [6], while the equivalence between NN and other kernel machines (KMs), e.g. support vector machine (SVM), remains unknown. Therefore, in this work, we propose to establish the equivalence between NN and SVM, and specifically, the infinitely wide NN trained by soft margin loss and the standard soft margin SVM with NTK trained by subgradient descent. Our main theoretical results include establishing the equivalences between NNs and a broad family of regularized KMs with finite-width bounds, which cannot be handled by prior work, and showing that every finite-width NN trained by such regularized loss functions is approximately a KM. Furthermore, we demonstrate our theory can enable three practical applications, including (i) non-vacuous generalization bound of NN via the corresponding KM; (ii) non-trivial robustness certificate for the infinite-width NN (while existing robustness verification methods would provide vacuous bounds); (iii) intrinsically more robust infinite-width NNs than those from previous kernel regression. Our code for the experiments is available at https://github.com/leslie-CH/equiv-nn-svm.
1 Introduction
Recent research has made some progress towards deep learning theory from the perspective of infinite-width neural networks (NNs). For a fully-trained infinite-width NN, it follows kernel gradient descent in the function space with respect to Neural Tangent Kernel (NTK) [27]. Under this linear regime and squared loss, it is proved that the fully-trained NN is equivalent to kernel regression with NTK [27, 4], which gives the generalization ability of such a model [5]. NTK helps us understand the optimization [27, 19] and generalization [5, 12] of NNs through the perspective of kernels. However, existing theories about NTK [27, 31, 4, 14] usually assume the loss is a function of the model output, which does not include the case of regularization. Besides, they usually consider the squared loss that corresponds to a kernel regression, which may have limited insights to understand classification problems since squared loss and kernel regression are usually used for regression problems.
On the other hand, another popular machine learning paradigm with solid theoretical foundation before the prevalence of deep neural networks is the support vector machine (SVM) [10, 16], which allows learning linear classifiers in high dimensional feature spaces. SVM tackles the sample complexity challenge by searching for large margin separators and tackles the computational complexity challenge using the idea of kernels [46]. To learn an SVM model, it usually involves solving a dual problem which is cast as a convex quadratic programming problem. Recently, there are some algorithms using subgradient descent [47] and coordinate descent [23] to further scale the SVM models to large datasets and high dimensional feature spaces.
We noticed that existing theoretical analysis mostly focused on connecting NN with kernel regression [27, 4, 31] but the connections between NN and SVM have not yet been explored. In this work, we establish the equivalence between NN and SVM for the first time to our best knowledge. More broadly, we show that our analysis can connect NNs with a family of regularized KMs, including kernel ridge regression (KRR), support vector regression (SVR) and regularized logistic regression, where previous results [27, 4, 31] cannot handle. These are the equivalences beyond ridge regression for the first time. Importantly, the equivalences between infinite-width NNs and these regularized KMs may shed light on the understanding of NNs from these new equivalent KMs [17, 48, 45, 52], especially towards understanding the training, generalization, and robustness of NNs for classification problems. Besides, regularization plays an important role in machine learning to restrict the complexity of models. These equivalences may shed light on the understanding of the regularization for NNs. We highlight our contributions as follows:
-
•
We derive the continuous (gradient flow) and discrete dynamics of SVM trained by subgradient descent and the dynamics of NN trained by soft margin loss. We show the dynamics of SVM with NTK and NN are exactly the same in the infinite width limit because of the constancy of the tangent kernel and thus establish the equivalence. We show same linear convergence rate of SVM and NN under reasonable assumption. We verify the equivalence by experiments of subgradient descent and stochastic subgradient descent on MNIST dataset [29].
-
•
We generalize our theory to general loss functions with regularization and establish the equivalences between NNs and a family of regularized KMs as summarized in Table 1. We prove the difference between the outputs of NN and KM sacles as , where is the coefficient of the regularization and is the width of the NN. Additionally, we show every finite-width neural network trained by a regularized loss function is approximately a KM.
-
•
We show that our theory offers three practical benefits: (i) computing non-vacuous generalization bound of NN via the corresponding KM; (ii) we can deliver nontrivial robustness certificate for the over-parameterized NN (with width ) while existing robustness verification methods would give trivial robustness certificate due to bound propagation [22, 57, 60]. In particular, the certificate decreases at a rate of as the width of NN increases; (iii) we show that the equivalent infinite-width NNs trained from our regularized KMs are more robust than the equivalent NN trained from previous kernel regression [27, 4] (see Table 3), which is perhaps not too surprising as the regularization has a strong connection to robust machine learning.
2 Related Works and Background
2.1 Related Works
Neural Tangent Kernel and dynamics of neural networks. NTK was first introduced in [27] and extended to Convolutional NTK [4] and Graph NTK [21]. [26] studied the NTK of orthogonal initialization. [6] reported strong performance of NTK on small-data tasks both for kernel regression and kernel SVM. However, the equivalence is only known for ridge regression currently, but not for SVM and other KMs. A line of recent work [20, 1] proved the convergence of (convolutional) neural networks with large but finite width in a non-asymptotic way by showing the weights do not move far away from initialization in the optimization dynamics (trajectory). [31] showed the dynamics of wide neural networks are governed by a linear model of first-order Taylor expansion around its initial parameters. However, existing theories about NTK [27, 31, 4] usually assume the loss is a function of the model output, which does not include the case of regularization. Besides, they usually consider the squared loss that corresponds to a kernel regression, which may have limited insights to understand classification problems since squared loss and kernel regression are usually used for regression problems. In this paper, we study the regularized loss functions and establish the equivalences with KMs beyond kernel regression and regression problems.
Besides, we studied the robustness of NTK models. [24] studied the label noise (the labels are generated by a ground truth function plus a Gaussian noise) while we consider the robustness of input perturbation. They study the convergence rate of NN trained by regularized squared loss to an underlying true function, while we give explicit robustness certificates for NNs. Our robustness certificate enables us to compare different models and show the equivalent infinite-width NNs trained from our regularized KMs are more robust than the equivalent NN trained from previous kernel regression.
Neural network and support vector machine. Prior works [54, 53, 35, 50, 32] have explored the benefits of encouraging large margin in the context of deep networks. [15] introduced a new family of positive-definite kernel functions that mimic the computation in multi-layer NNs and applied the kernels into SVM. [18] showed that NNs trained by gradient flow are approximately "kernel machines" with the weights and bias as functions of input data, which however can be much more complex than a typical kernel machine. [47] proposed a subgradient algorithm to solve the primal problem of SVM, which can obtain a solution of accuracy in iterations, where omits the logarithmic factors. In this paper, we also consider the SVM trained by subgradient descent and connect it with NN trained by subgradient descent. [49, 3] studied the connection between SVM and regularization neural network [44], one-hidden layer NN that has very similar structures with that of KMs and is not widely used in practice. NNs used in practice now (e.g. fully connected ReLU NN, CNN, ResNet) do not have such structures. [43] analyzed two-layer NN trained by hinge loss without regularization on linearly separable dataset. Note for SVM, it must have a regularization term such that it can achieve max-margin solution.
Implicit bias of neural network towards max-margin solution. There is also a line of research on the implicit bias of neural network towards max-margin (hard margin SVM) solution under different settings and assumptions [51, 28, 13, 56, 40, 37]. Our paper complements these research by establishing an exact equivalence between the infinite-width NN and SVM. Our equivalences not only include hard margin SVM but also include soft margin SVM and other regularized KMs.
2.2 Neural Networks and Tangent Kernel
We consider a general form of deep neural network with a linear output layer as [33]. Let , ,
(1) |
where each vector-valued function , with parameter ( is the number of parameters), is considered as a layer of the network. This definition includes the standard fully connected, convolutional (CNN), and residual (ResNet) neural networks as special cases. For a fully connected ReLU NN, with and .
Initialization and parameterization. In this paper, we consider the NTK parameterization [27], under which the constancy of the tangent kernel has been initially observed. Specifically, the parameters, are drawn i.i.d. from a standard Gaussian, , at initialization, denoted as . The factor in the output layer is required by the NTK parameterization in order that the output is of order . While we only consider NTK parameterization here, the results should be able to extend to general parameterization of kernel regime [58].
Definition 2.1 (Tangent kernel [27]).
The tangent kernel associated with function at some parameter is . Under certain conditions (usually infinite width limit and NTK parameterization), the tangent kernel at initialization converges in probability to a deterministic limit and keeps constant during training, . This limiting kernel is called Neural Tangent Kernel (NTK).
2.3 Kernel Machines
Kernel machine (KM) is a model of the form , where is the model parameter and is a mapping from input space to some feature space, . is an optional nonlinear function, such as identity mapping for kernel regression and for SVM and logistic regression. The kernel can be exploited whenever the weight vector can be expressed as a linear combination of the training points, for some value of , , implying that we can express as , where is the kernel function. For a NN in NTK regime, we have , which makes the NN linear in the gradient feature mapping . Under squared loss, it is equivalent to kernel regression with (or equivalently using NTK as the kernel), and identity mapping [4].
As far as we know, there is no work establishing the equivalence between fully trained NN and SVM. [18] showed that NNs trained by gradient flow are approximately "kernel machines" with the weights and bias as functions of input data, which however can be much more complex than a typical kernel machine. In this work, we compare the dynamics of SVM and NN trained by subgradient descent with soft margin loss and show the equivalence between them in the infinite width limit.
2.4 Subgradient Optimization of Support Vector Machine
We first formally define the standard soft margin SVM and then show how the subgradient descent can be applied to get an estimation of the SVM primal problem. For simplicity, we consider the homogenous model, .111Note one can always deal with the bias term by adding each sample with an additional dimension, .
Definition 2.2 (Soft margin SVM).
Given labeled samples with , the hyperplane that solves the below optimization problem realizes the soft margin classifier with geometric margin .
Proposition 2.1.
The above primal problem of soft margin SVM can be equivalently formulated as
(2) |
where the second term is a hinge loss. Denote this function as , which is strongly convex in .
From this, we see that the SVM technique is equivalent to empirical risk minimization with regularization, where in this case the loss function is the nonsmooth hinge loss. The classical approaches usually consider the dual problem of SVM and solve it as a quadratic programming problem. Some recent algorithms, however, use subgradient descent [47] to optimize Eq. (2), which shows significant advantages when dealing with large datasets.
In this paper, we consider the soft margin SVM trained by subgradient descent with . We use the subgradient , where is the indicator function. As proved in [47], we can find a solution of accuracy , i.e. , in iterations. Other works also give convergence guarantees for subgradient descent of convex functions [11, 9]. In the following analysis, we will generally assume the convergence of SVM trained by subgradient descent.
3 Main Theoretical Results
In this section, we describe our main results. We first derive the continuous (gradient flow) and discrete dynamics of SVM trained by subgradient descent (in Section 3.1) and the dynamics of NN trained by soft margin loss (in Section 3.2 and Section 3.3). We show that they have similar dynamics, characterized by an inhomogeneous linear differential (difference) equation, and have the same convergence rate under reasonable assumption. Next, we show that their dynamics are exactly the same in the infinite width limit because of the constancy of tangent kernel and thus establish the equivalence (Theorem 3.4). Furthermore, in Section 3.4, we generalize our theory to general loss functions with regularization and establish the equivalences between NNs and a family of regularized KMs as summarized in Table 1.
3.1 Dynamics of Soft Margin SVM
For simpicity, we denote as at some time and . The proofs of the following two theorems are detailed in Appendix C.
Theorem 3.1 (Continuous dynamics and convergence rate of SVM).
Consider training soft margin SVM by subgradient descent with infinite small learning rate (gradient flow [2]): , the model follows the below evolution:
(3) |
and has a linear convergence rate:
Denote , which changes over time until convergence. The model output at some time is
(4) |
The continuous dynamics of SVM is described by an inhomogeneous linear differential equation (Eq. (3)), which gives an analytical solution. From Eq. (4), we can see that the influence of initial model deceases as time and disappears at last.
Theorem 3.2 (Discrete dynamics of SVM).
Let be the learning rate. The dynamics of subgradient descent is
(5) |
Denote , which changes over time. The model output at some time is
The discrete dynamics is characterized by an inhomogeneous linear difference equation (Eq. (5)). The discrete dynamics and solution of SVM have similar structures as the continuous case.
3.2 Soft Margin Neural Network
We first formally define the soft margin neural network and then derive the dynamics of training a neural network by subgradient descent with a soft margin loss. We will consider a neural network defined as Eq. (1). For convenience, we redefine with and .
Definition 3.1 (Soft margin neural network).
This is generally a hard nonconvex optimization problem, but we can apply subgradient descent to optimize it heuristically. At initialization, . The derivative of the regularization for is as . For a fixed , this problem is same as SVM with , kernel and parameter . If we only train the last layer of an infinite-width soft margin NN, it corresponds to an SVM with a NNGP kernel [30, 38]. But for a fully-trained NN, is changing over time.
3.3 Dynamics of Neural Network Trained by Soft Margin Loss
Denote the hinge loss in as . We use the same subgradient as that for SVM, .
Theorem 3.3 (Continuous dynamics and convergence rate of NN).
Suppose a NN defined as Eq. (1), with a differentiable function of , is learned from a training set by subgradient descent with and gradient flow. Then the network has the following dynamics:
Let be the tangent kernel evaluated on the training set and be its minimum eigenvalue. Assume , then NN has at least a linear convergence rate, same as SVM:
The proof is in Appendix D. The key observation is that when deriving the dynamics of , the term in the loss function will produce a term and the hinge loss will produce the tangent kernel term, which overall gives a similar dynamics to that of SVM. Comparing with the previous continuous dynamics without regularization [27, 31], our result has an extra here because of the regularization term of the loss function. The convergence rate is proved based on a sufficient condition for the PL inequality. The assumption of can be guaranteed in a parameter ball when , by using a sufficiently wide NN [34].
If the tangent kernel is fixed, , the dynamics of NN is the same as that of SVM (Eq. (3)) with kernel , assuming the neural network and SVM have same initial output .222This can be done by setting the initial values to be , i.e. . And this consistency of tangent kernel is the case for infinitely wide neural networks of common architectures, which does not depend on the optimization algorithm and the choice of loss function, as discussed in [33].
Assumptions. We assume that (vector-valued) layer functions are -Lipschitz continuous and twice differentiable with respect to input and parameters . The assumptions serve for the following theorem to show the constancy of tangent kernel.
Theorem 3.4 (Equivalence between NN and SVM).
As the minimum width of the NN, , goes to infinity, the tangent kernel tends to be constant, . Assume . Then the infinitely wide NN trained by subgradient descent with soft margin loss has the same dynamics as SVM with trained by subgradient descent:
And thus such NN and SVM converge to the same solution.
The proof is in Appendix E. We apply the results of [33] to show the constancy of tangent kernel in the infinite width limit. Then it is easy to check the dynamics of infinitely wide NN and SVM with NTK are the same. We give finite-width bounds for general loss functions in the next section. This theorem establishes the equivalence between infinitely wide NN and SVM for the first time. Previous theoretical results of SVM [17, 48, 45, 52] can be directly applied to understand the generalization of NN trained by soft margin loss. Given the tangent kernel is constant or equivalently the model is linear, we can also give the discrete dynamics of NN (Appendix D.4), which is identical to that of SVM. Compared with the previous discrete-time gradient descent [31, 58], our result has an extra term because of the regularization term of the loss function.
Loss | Kernel machine | ||
---|---|---|---|
([27, 4]) | Kernel regression | ||
(ours) | Hard margin SVM | ||
(ours) | (1-norm) soft margin SVM | ||
2-norm soft margin SVM | |||
Support vector regression | |||
Kernel ridge regression (KRR) | |||
Logistic regression with regularization |
3.4 General Loss Functions
We note that above analysis does not have specific dependence on the hinge loss. Thus we can generalize our analysis to general loss functions , where is the model output, as long as the loss function is differentiable (or has subgradients) with respect to , such as squared loss and logistic loss. Besides, we can scale the regularization term by a factor instead of scaling with as it for SVM, which are equivalent. Suppose the loss function for the KM and NN are
(7) |
Then the continuous dynamics of and are
(8) | |||
(9) |
where . In the situation of and , these two dynamics are the same (assuming ). When , we recover the previous results of kernel regression. When , we have our new results of regularized loss functions. Table 1 lists the different loss functions and the corresponding KMs that infinite-width NNs are equivalent to. KRR is considered in [25] to analyze the generalization of NN. However, they directly assume NN as a linear model and use it in KRR. Below we give finite-width bounds on the difference between the outputs of NN and the corresponding KM. The proof is in Appendix F.
Theorem 3.5 (Bounds on the difference between NN and KM).
Assume and 333Linearized NN is a special case of such .. Suppose the KM and NN are trained with losses (7) and gradient flow. Suppose is -lipschitz and -smooth for the first argument (i.e. the model output). Given any for some fixed , for training data and a test point , with high probability over the initialization,
where are the outputs of the training data and is the tangent kernel evaluated between training data and test point.
4 Discussion
In this section, we give some extensions and applications of our theory. We first show that every finite-width neural network trained by a regularized loss function is approximately a KM in Section 4.1, which enables us to compute non-vacuous generalization bound of NN via the corresponding KM. Next, in Section 4.2, we show that our theory of equivalence (in Section 3.3) is useful to evaluating the robustness of over-parameterized NNs with infinite width. In particular, our theory allows us to deliver nontrivial robustness certificates for infinite-width NNs, while existing robustness verification methods [22, 57, 60] would become much looser (decrease at a rate of ) as the width of NN increases and trivial with infinite width (the experiment results are in Section 5 and Table 2).
4.1 Finite-width Neural Network Trained by Regularized Loss
Inspired by [18], we can also show that every NN trained by (sub)gradient descent with loss function (7) is approximately a KM without the assumption of infinite width.
Theorem 4.1.
Suppose a NN , is learned from a training set by (sub)gradient descent with loss function (7) and gradient flow. Assume .444This is the case for hinge loss and logistic loss. Then at some time ,
and , .
See the proof in Appendix G, which utilizes the solution of inhomogeneous linear differential equation instead of integrating both sides of the dynamics (Eq. (9)) directly [18]. Note in Theorem 4.1, is deterministic and independent with , different with [18] that has depends on . Deterministic makes the function class simpler. Combing Theorem 4.1 with a bound of the Rademacher complexity of the KM [7] and a standard generalization bound using Rademacher complexity [39], we can compute the generalization bound of NN via the corresponding KM. See Appendix B for more background and experiments. The generalization bound we get will depend on , which depends on the label . This differs from traditional complexity measures that cannot explain the random label phenomenon [59].
Remark 4.1.
We note that the kernel here is valid only when is a constant, e.g. at the initial training stage of hinge loss with , otherwise the kernel is not symmetric.
4.2 Robustness Verification of Infinite-width Neural Network
Our theory of equivalence allows us to deliver nontrivial robustness certificates for infinite-width NNs by considering the equivalent KMs. For an input , the objective of robustness verification is to find the largest ball such that no examples within this ball can change the classification result. Without loss of generality, we assume . The robustness verification problem can be formulated as follows,
(10) |
For an infinitely wide two-layer fully connected ReLU NN, , where is the ReLU activation, the NTK is
where . See the proof of the following theorem in Appendix H.1.
Theorem 4.2.
Consider the perturbation, for , we can bound into some interval . Suppose , where are known after solving the KM problems (e.g. SVM and KRR). Then we can lower bound as follows,
Using a simple binary search and above theorem, we can find a certified lower bound for (10). Because of the equivalence between the infinite-width NN and KM, the certified lower bound we get for the KM is equivalently a certified lower bound for the corresponding infinite-width NN.

5 Experiments
(I) Verification of the equivalence. The first experiment verifies the equivalence between soft margin SVM with NTK trained by subgradient descent and the infinite-width NN trained by soft margin loss. We train the SVM and 3-layer fully connected ReLU NN for a binary MNIST [29] classification ( and ) with learning rate and with full batch subgradient descent on samples, where is the learning rate used in experiments. Figure 1 shows the dynamics of the outputs and loss for NN and SVM. Since the regularization terms in the loss of NN and SVM are different, we just plot the hinge loss. It can be seen that the dynamics of NN and SVM agree very well. We also do a stochastic subgradient descent case for binary classification on full MNIST and data ( training and test) with learning rate and batch size , shown in Figure A.1. For more details, please see Appendix A.
(II) Robustness of over-parameterized neural network. Table 2 shows the robustness certificates of two-layer overparameterized NNs with increasing width and SVM (which is equivalent to infinite-width two-layer ReLU NN) on binary classification of MNIST (0 and 1). We use the NN robustness verification algorithm (IBP) [22] to compute the robustness certificates for two-layer overparameterized NNs. The robustness certificate for SVM is computed using our method in Section 4.2. As demonstrated in Table 2, the certificate of NN almost decrease at a rate of and will decrease to 0 as , where is the width of the hidden layer. We show that this is due to the bound propagation in Appendix H.2. Unfortunately, the decrease rate will be faster if the NN is deeper. The same problem will happen for LeCun initialization as well, which is used in PyTorch for fully connected layers by default. Notably, however, thanks to our theory, we could compute nontrivial robustness certificate for an infinite-width NN through the equivalent SVM as demonstrated.
Robustness certificate (mean std) | |||
---|---|---|---|
Model | Width | 100 test | Full test |
NN | 7.4485 2.5667 | 7.2708 2.1427 | |
NN | 2.9861 1.0730 | 2.9367 0.89807 | |
NN | 0.99098 0.35775 | 0.97410 0.29997 | |
NN | 0.31539 0.11380 | 0.30997 0.095467 | |
SVM | 8.0541 2.5827 | 7.9733 2.1396 |
(III) Comparison with kernel regression. Table 3 compares our regularized models (KRR and SVM with NTK) with the previous kernel regression model ( for KRR). All the robustness certified lower bounds are computed using our method in Section 4.2. While the accuracies of different models are similar, as the regularization increases, the robustness of KRR increases. The robustness of SVM outperforms the KRR with same regularization magnitude a lot. Our theory enables us to train an equivalent infinite-width NN through SVM and KRR, which is intrinsically more robust than the previous kernel regression model.
Model | Test accuracy | Robustness certificate | Robustness improvement | ||
---|---|---|---|---|---|
([27, 4]) | KRR | 0 | 99.95% | 3.30202 | - |
(ours) | KRR | 0.001 | 99.95% | 3.756122 | 1.14X |
KRR | 0.01 | 99.95% | 6.505500 | 1.97X | |
KRR | 0.1 | 99.95% | 2.229960 | 6.75X | |
KRR | 1 | 99.95% | 0.001005 | 30.43X | |
KRR | 10 | 99.91% | 0.005181 | 156.90X | |
KRR | 100 | 99.86% | 0.020456 | 619.50X | |
KRR | 1000 | 99.76% | 0.026088 | 790.06X | |
SVM | 0.064 | 99.95% | 0.008054 | 243.91X |
6 Conclusion and Future Works
In this paper, we establish the equivalence between SVM with NTK and the NN trained by soft margin loss with subgradient descent in the infinite width limit, and we show that they have the same dynamics and solution. We also extend our analysis to general regularized loss functions and show every neural network trained by such loss functions is approximately a KM. Finally, we demonstrate our theory is useful to compute non-vacuous generalization bound for NN, non-trivial robustness certificate for infinite-width NN while existing neural network robustness verification algorithm cannot handle, and with our theory, the resulting infinite-width NN from our regularized models is intrinsically more robust than that from the previous NTK kernel regression. For future research, since the equivalence between NN and SVM (and other regularized KMs) with NTK has been established, it would be very interesting to understand the generalization and robustness of NN from the perspective of these KMs. Our main results are currently still limited in the linear regime. It would be interesting to extend the results to the mean field setting or consider its connection with the implicit bias of NN.
7 Acknowledgement
We thank the anonymous reviewers for useful suggestions to improve the paper. We thank Libin Zhu for helpful discussions. We thank San Diego Supercomputer Center and MIT-IBM Watson AI Lab for computing resources. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) [55], which is supported by National Science Foundation grant number ACI-1548562. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) Expanse at San Diego Supercomputer Center through allocation TG-ASC150024 and ddp390. T.-W. Weng is partially supported by National Science Foundation under Grant No. 2107189.
References
- Allen-Zhu et al. [2019] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
- Ambrosio et al. [2008] L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
- Andras [2002] P. Andras. The equivalence of support vector machine and regularization neural networks. Neural Processing Letters, 15(2):97–104, 2002.
- Arora et al. [2019a] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8141–8150, 2019a.
- Arora et al. [2019b] S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019b.
- Arora et al. [2019c] S. Arora, S. S. Du, Z. Li, R. Salakhutdinov, R. Wang, and D. Yu. Harnessing the power of infinitely wide deep nets on small-data tasks. arXiv preprint arXiv:1910.01663, 2019c.
- Bartlett and Mendelson [2002] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- Bartlett et al. [2019] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301, 2019.
- Bertsekas and Scientific [2015] D. P. Bertsekas and A. Scientific. Convex optimization algorithms. Athena Scientific Belmont, 2015.
- Boser et al. [1992] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, 1992.
- Boyd et al. [2003] S. Boyd, L. Xiao, and A. Mutapcic. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2004:2004–2005, 2003.
- Cao and Gu [2019] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. arXiv preprint arXiv:1905.13210, 2019.
- Chizat and Bach [2020] L. Chizat and F. Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305–1338. PMLR, 2020.
- Chizat et al. [2018] L. Chizat, E. Oyallon, and F. Bach. On lazy training in differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
- Cho [2012] Y. Cho. Kernel methods for deep learning. PhD thesis, UC San Diego, 2012.
- Cortes and Vapnik [1995] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
- Cristianini et al. [2000] N. Cristianini, J. Shawe-Taylor, et al. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press, 2000.
- Domingos [2020] P. Domingos. Every model learned by gradient descent is approximately a kernel machine. arXiv preprint arXiv:2012.00152, 2020.
- Du et al. [2019a] S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685. PMLR, 2019a.
- Du et al. [2018] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- Du et al. [2019b] S. S. Du, K. Hou, B. Póczos, R. Salakhutdinov, R. Wang, and K. Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. arXiv preprint arXiv:1905.13192, 2019b.
- Gowal et al. [2018] S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. Mann, and P. Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715, 2018.
- Hsieh et al. [2008] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th international conference on Machine learning, pages 408–415, 2008.
- Hu et al. [2021] T. Hu, W. Wang, C. Lin, and G. Cheng. Regularization matters: A nonparametric perspective on overparametrized neural network. In International Conference on Artificial Intelligence and Statistics, pages 829–837. PMLR, 2021.
- Hu et al. [2019] W. Hu, Z. Li, and D. Yu. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. arXiv preprint arXiv:1905.11368, 2019.
- Huang et al. [2020] W. Huang, W. Du, and R. Y. Da Xu. On the neural tangent kernel of deep networks with orthogonal initialization. arXiv preprint arXiv:2004.05867, 2020.
- Jacot et al. [2018] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
- Ji and Telgarsky [2018] Z. Ji and M. Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
- LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Lee et al. [2017] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
- Lee et al. [2019] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pages 8572–8583, 2019.
- Liang et al. [2017] X. Liang, X. Wang, Z. Lei, S. Liao, and S. Z. Li. Soft-margin softmax for deep classification. In International Conference on Neural Information Processing, pages 413–421. Springer, 2017.
- Liu et al. [2020a] C. Liu, L. Zhu, and M. Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33, 2020a.
- Liu et al. [2020b] C. Liu, L. Zhu, and M. Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. arXiv preprint arXiv:2003.00307, 2020b.
- Liu et al. [2016] W. Liu, Y. Wen, Z. Yu, and M. Yang. Large-margin softmax loss for convolutional neural networks. In ICML, volume 2, page 7, 2016.
- Long and Sedghi [2019] P. M. Long and H. Sedghi. Generalization bounds for deep convolutional neural networks. arXiv preprint arXiv:1905.12600, 2019.
- Lyu and Li [2019] K. Lyu and J. Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890, 2019.
- Matthews et al. [2018] A. G. d. G. Matthews, M. Rowland, J. Hron, R. E. Turner, and Z. Ghahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.
- Mohri et al. [2018] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2018.
- Nacson et al. [2019] M. S. Nacson, S. Gunasekar, J. Lee, N. Srebro, and D. Soudry. Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In International Conference on Machine Learning, pages 4683–4692. PMLR, 2019.
- Novak et al. [2020] R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020. URL https://github.com/google/neural-tangents.
- Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
- Pellegrini and Biroli [2020] F. Pellegrini and G. Biroli. An analytic theory of shallow networks dynamics for hinge loss classification. Advances in Neural Information Processing Systems, 33, 2020.
- Poggio and Girosi [1990] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9):1481–1497, 1990.
- Schölkopf et al. [2002] B. Schölkopf, A. J. Smola, F. Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
- Shalev-Shwartz and Ben-David [2014] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Shalev-Shwartz et al. [2011] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical programming, 127(1):3–30, 2011.
- Shawe-Taylor et al. [2004] J. Shawe-Taylor, N. Cristianini, et al. Kernel methods for pattern analysis. Cambridge university press, 2004.
- Smola et al. [1998] A. J. Smola, B. Schölkopf, and K.-R. Müller. The connection between regularization operators and support vector kernels. Neural networks, 11(4):637–649, 1998.
- Sokolić et al. [2017] J. Sokolić, R. Giryes, G. Sapiro, and M. R. Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265–4280, 2017.
- Soudry et al. [2018] D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
- Steinwart and Christmann [2008] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.
- Sun et al. [2016] S. Sun, W. Chen, L. Wang, X. Liu, and T.-Y. Liu. On the depth of deep neural networks: A theoretical view. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
- Tang [2013] Y. Tang. Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239, 2013.
- Towns et al. [2014] J. Towns, T. Cockerill, M. Dahan, I. Foster, K. Gaither, A. Grimshaw, V. Hazlewood, S. Lathrop, D. Lifka, G. D. Peterson, R. Roskies, J. R. Scott, and N. Wilkins-Diehr. Xsede: Accelerating scientific discovery. Computing in Science & Engineering, 16(5):62–74, Sept.-Oct. 2014. ISSN 1521-9615. doi: 10.1109/MCSE.2014.80. URL doi.ieeecomputersociety.org/10.1109/MCSE.2014.80.
- Wei et al. [2019] C. Wei, J. Lee, Q. Liu, and T. Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. 2019.
- Weng et al. [2018] L. Weng, H. Zhang, H. Chen, Z. Song, C.-J. Hsieh, L. Daniel, D. Boning, and I. Dhillon. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning, pages 5276–5285. PMLR, 2018.
- Yang and Hu [2020] G. Yang and E. J. Hu. Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522, 2020.
- Zhang et al. [2021] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- Zhang et al. [2018] H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, and L. Daniel. Efficient neural network robustness certification with general activation functions. arXiv preprint arXiv:1811.00866, 2018.
Appendix A Experiment Details

A.1 SVM Training
We use the following loss to train the SVM,
(11) |
Let be the learning rate for this loss in experiments. Then the dynamics of subgradient descent is
(12) |
Denote , which is a linear combination of and changes over time. The model output at some time is
(13) |
If we set , we have
(14) |
We see that is always a linear combination of kernel values for . Since are fixed, we just need to store and update the weights of the kernel values. Let be the weights at time , that is
(15) |
Then according to Eq. (12), we update at each subgradient descent step as follows.
(16) |
For the SGD case, we sample at step and update the weights of this subset while keeping the other weights unchanged.
The kernelized implementation of Pegasos [47] set for proving the convergence of the algorithm. In our experiments, we use constant .
A.2 More Details
(I) Verification of the equivalence. The first experiment illustrates the equivalence between soft margin SVM with NTK trained by subgradient descent and NN trained by soft margin loss. We initialize 3-layer fully connected ReLU neural networks of width and , with NTK parameterization and make sure by subtracting the initial values from NN’s outputs. We initialize the parameter of SVM with , and this automatically makes sure . SVM is trained by directly update the weights of kernel values [47] and more details can be found in Appendix A. We set the regularization parameter as and take the average of the hinge loss instead of sum.555This is equivalent to use in Eq. (7). We train the NN and SVM for a binary MNIST [29] classification task ( and ) with learning rate and with full batch subgradient descent on samples, where is the learning rate used in experiments (see Appendix A). Figure 1 shows the dynamics of the outputs and loss for NN and SVM. Since the regularization term in the loss of NN and SVM are different, we just plot the hinge loss. We see the dynamics of NN and SVM agree well. We also do a stochastic subgradient descent case for binary MNIST classification task on full and data ( train data and test data) with learning rate and batch size , shown in Figure A.1.
Appendix B Computing Non-vacuous Generalization Bounds via Corresponding Kernel Machines

Using Theorem 4.1, we can numerically compute the kernel machine that the NN is equivalent to, i.e. we can compute the kernel matrix and the weights at any time during the training. Then one can apply a generalization bound of kernel machines to give a generalization bound for this kernel machine (equivalently for this NN). Let be the reproducing kernel Hilbert space (RKHS) corresponding to the kernel . The RKHS norm of a function is 666Assume
Lemma B.1 (Lemma 22 in [7]).
For a function class , its empirical Rademacher complexity can be bounded as
Assume the data is sampled i.i.d. from some distribution and the population loss is . The empirical loss is . Combing with a standard generalization bound using Rademacher complexity blow [39], we can get a bound of the population loss for the kernel machine (equivalently for this NN).
Lemma B.2.
Suppose the loss is bounded in , and is -Lipschitz in the first argument. Then with probability at least over the sample of size ,
Appendix C Dynamics of Support Vector Machine
In this section, we derive the continuous and discrete dynamics of soft margin SVM trained by subgradient with the following loss function
(17) |
and the subgradient
(18) |
Lemma C.1.
satisfies the Polyak- Lojasiewicz (PL) inequality,
(19) |
where .
Proof.
Since is 1-strongly convex, by the definition of strong convexity and subgradient
(20) |
The right hand side is a convex quadratic function of (for fixed ). Setting the gradient with respect to equal to 0, we find that minimize right hand side. Therefore we have
(21) |
Since this holds for any , we have
(22) |
∎
C.1 Continuous Dynamics of SVM
Here we give the detailed derivation of the dynamics of soft margin SVM trained by subgradient. In the learning rate limit, the subgradient descent equation, which can also be written as
(23) |
becomes a differential equation
(24) |
This is known as gradient flow [2]. And we have defined the subgradient as
(25) |
Applying the chain rule, the dynamics of is
(26) |
Denoting , the equation becomes
(27) |
Note this is a first-order inhomogeneous differential equation. The general solution at some time is given by
(28) |
As we already know that the loss function is strongly convex, will converge to the global optimizer in this infinite small learning rate setting. This can be seen by
(29) |
We see that is always decreasing. Since is strongly convex and thus bounded from below, by monotone convergence theorem, will always converge. By Lemma C.1, we have the Polyak- Lojasiewicz (PL) inequality,
(30) |
Combining with above, we have
(31) |
Solving the equation, we get
(32) |
Thus we have a linear convergence rate.
Now, let us assume will converge and see what is as . As time increases , .
(33) |
is changing over time due to is changing. Suppose keeps changing until some time and keeps constant, , after ,
(34) |
As , the first part of right hand side converges to .
(35) |
C.2 Discrete Dynamics of SVM
Let be the learning rate. The equation of subgradient descent update at some time is
(36) |
The dynamics of is
(37) |
Denote second part as , which changes over time. The model at some time is
(38) |
The convergence of subgradient descent usually requires additional assumption that the norm of the subgradient is bounded. We refer readers to [47, 11, 9] for some proofs. Here let us assume the subgradient descent converges to the global optimizer and keeps changing until some time and keeps constant, , after . As ,
(39) |
As , .
(40) |
Appendix D Dynamics and Convergence Rate of Neural Network Trained by Soft Margin Loss
D.1 Continuous Dynamics of NN
In the learning rate limit, the subgradient descent equation, which can also be written as
(41) |
becomes a differential equation
(42) |
This is known as gradient flow [2]. Then for any differentiable function ,
(43) |
where is the number of parameters. Replacing by its subgradient descent expression:
(44) |
And we know
(45) |
where equals to 1 if the parameter is in the last layer else 0. Combining above together,
(46) |
Rearranging terms:
(47) |
where is the number of parameters of the last layer ( layer). The first part of the right hand side is
(48) |
Applying , the subgradient of hinge loss, and the definition of tangent kernel (2.1), the second part is
(49) |
Thus the equation becomes
(50) |
Take in
(51) |
D.2 Additional Notations
Denote as the training data. Denote and as the outputs of NN and SVM on the training data. Denote as the tangent kernel evaluated on the training data at time , and as the derivative of the loss function w.r.t. . Denote as the Jacobian and we have . Denote as the smallest eigenvalue of . Then we can write the dynamics of NN as
Let with . We can write the gradient as
D.3 Convergence of NN
The loss of NN is
where . The dynamic of the loss is
Since is bounded from below, by monotone convergence theorem, will always converge to a stationary point. Applying Lemma D.1, we have
Thus we have a linear convergence, same as SVM.
Lemma D.1 (PL inequality of NN for soft margin loss).
Assume , then satisfies the PL condition
Proof.
We want the loss satisfies the PL condition .
where the last inequality is the inequality of quadratic form. For hinge loss and ,
Since , as long as , the loss satisfies the PL condition . can be guaranteed in a parameter ball when by using a sufficiently wide NN [34]. ∎
D.4 Discrete Dynamics of NN
The subgradient descent update is
(52) |
We consider the situation of constant NTK, , or equivalently linear model. As proved by Proposition 2.2 in [33], the tangent kernel of a differentiable function is constant if and only if is linear in . Take the Taylor expansion of at ,
(53) |
Appendix E Proof of Theorem 3.4
Proof.
We prove the constancy of tangent kernel by adopting the results of [35].
Lemma E.1 (Theorem 3.3 in [33]; Hessian norm is controlled by the minimum hidden layer width).
Consider a general neural network of the form Eq. (1), which can be a fully connected network, CNN, ResNet or a mixture of these types. Let m be the minimum of the hidden layer widths, i.e., . Given any fixed , and any , with high probability over the initialization, the Hessian spectral norm satisfies the following:
(54) |
Lemma E.2 (Proposition 2.3 in [33]; Small Hessian norm Small change of tangent kernel).
Given a point and a ball with fixed radius , if the Hessian matrix satisfies , where , for all , then the tangent kernel of the model, as a function of , satisfies
(55) |
Applying above two lemmas, we can see that in the limit of , the spectral norm of Hessian converge to and the tangent kernel keeps constant in the ball .
Corollary E.2.1 (Consistancy of tangent kernel).
Consider a general neural network of the form Eq. (1). Given a point and a ball with fixed radius , in the infinite width limit, ,
(56) |
Thus we prove the constancy of tangent kernel in infinite width limit. Then it is easy to check the dynamics of infinitely wide NN is the same with the dynamics of SVM with constant NTK.
∎
Appendix F Bound the difference between SVM and NN
Assume the loss is -lipschitz and -smooth for the first argument (i.e. the model output). Assume for any .
F.1 Bound the difference on the Training Data
The dynamics of the NN and SVM are
The dynamics of the difference between them is
The solution of the above differential equation at time is
using . Thus
F.2 Bound on the Test Data
For a test data , the prove is similar to the training case. Denote as the tangent kernel evaluate between the training data and a test data . Recall
The solution of the above differential equation is
using . Thus
Since is smooth,
where . Thus we have
Applying the Grönwall’s inequality,
Appendix G Finite-width Neural Networks are Kernel Machines
Inspired by [18], we can also show that every neural network trained by (sub)gradient descent with loss function in the form (7) is approximately a kernel machine without the assumption of infinite width limit.
Theorem G.1.
Suppose a neural network , with a differentiable function of , is learned from a training set by (sub)gradient descent with loss function and gradient flow. Assume , keeps unchanged during training. Then at some time ,
(57) |
where
Proof.
As we have derived, the neural network follows the dynamics of Eq. (9):
(58) |
Note this is a first-order inhomogeneous linear differential equation with the functions depended on . Denote ,
(59) |
Let be the initial model, prior to gradient descent. The solution is given by
(60) |
Then
(61) |
where the last equality uses the assumption . Thus
(62) |
with
∎
is a valid kernel since it is a nonnegative sum of positive definite kernels. Our , and will stay bounded as long as , and are bounded.
Appendix H Robustness of Over-parameterized Neural Network
H.1 Robustness Verification of NTK
For an infinitely wide two-layer fully connected ReLU NN, , where is the ReLU activation. The NTK is
(63) | |||
(64) |
where . Consider the perturbation, for , we can bound in the interval as follows.
Then we can also bound in .
where . is a bow shaped function so it is easy to get its interval . Then we can get the interval of , denote as .
Suppose the , are known after solving the kernel machine problem. Then we can lower bound and upper bound as follows.
(65) | |||
(66) |
H.2 IBP for Two-layer Neural Network
See the computation of IBP in [22]. For affine layers of NTK parameterization, the IBP bounds are computed as follows.
(67) |
where is the input dimension of that layer. At initialization, , and are independent. Since and ,
(68) |
Since follows a folded normal distribution (absolute value of normal distribution) and , , ,
(69) |
Thus
(70) | |||
(71) |
And this will cause the robustness lower bound to decrease at a rate of . The same results hold for LeCun initialization, which is used in PyTorch for fully connected layers by default.