This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Explicit regularization and implicit bias in deep network classifiers trained with the square loss

Tomaso Poggio and Qianli Liao
Abstract

Deep ReLU networks trained with the square loss have been observed to perform well in classification tasks. We provide here a theoretical justification based on analysis of the associated gradient flow. We show that convergence to a solution with the absolute minimum norm is expected when normalization techniques such as Batch Normalization (BN) or Weight Normalization (WN) are used together with Weight Decay (WD). The main property of the minimizers that bounds their expected error is the norm: we prove that among all the close-to-interpolating solutions, the ones associated with smaller Frobenius norms of the unnormalized weight matrices have better margin and better bounds on the expected classification error. With BN but in the absence of WD, the dynamical system is singular. Implicit dynamical regularization – that is zero-initial conditions biasing the dynamics towards high margin solutions – is also possible in the no-BN and no-WD case. The theory yields several predictions, including the role of BN and weight decay, aspects of Papyan, Han and Donoho’s Neural Collapse and the constraints induced by BN on the network weights.

1 Introduction

In the case of exponential-type loss functions a mechanism of complexity control underlying generalization was identified in the asymptotic margin maximization effect of minimizing exponential-type loss functions [1, 2, 3]. However, this mechanism

  • cannot explain the good empirical results that have been recently demostrated using the square loss[4];

  • cannot explain the empirical evidence that convergence for cross-entropy loss minimization depends on initialization.

This puzzle motivates our focus in this paper on the square loss. More details and sketches of the proofs can be found in [5].

Here we assume commonly used GD-based normalization algorithms such such as BN[6] (or WN[7]) together with weight decay (WD), since they appear to be essential for reliably[8] training deep networks (and were used by [4]. We also consider, however, the case in which neither BN nor WD are used111This was the main concern of the first version of [5]. showing that a dynamic “implicit regularization” effect for classification is still possible with convergence strongly depending on initial conditions.

1.1 Notation

We define a deep network with LL layers with the usual coordinate-wise scalar activation functions σ(z):𝐑𝐑\sigma(z):\quad\mathbf{R}\to\mathbf{R} as the set of functions g(W;x)=(WLσ(WL1σ(W1x)))g(W;x)=(W_{L}\sigma(W_{L-1}\cdots\sigma(W_{1}x))), where the input is x𝐑dx\in\mathbf{R}^{d}, the weights are given by the matrices WkW_{k}, one per layer, with matching dimensions. We sometime use the symbol WW as a shorthand for the set of WkW_{k} matrices k=1,,Lk=1,\cdots,L. There are no bias terms: the bias is instantiated in the input layer by one of the input dimensions being a constant. The activation nonlinearity is a ReLU, given by σ(x)=x+=max(0,x)\sigma(x)=x_{+}=max(0,x) . Furthermore,

  • we define g(x)=ρf(x)g(x)=\rho f(x) with ρ\rho defined as the product of the Frobenius norms of the weight matrices of the LL layers of the network and ff as the corresponding network with normalized weight matrices VkV_{k} (because the ReLU is homogeneous [3]);

  • in the following we use the notation fnf_{n} meaning f(xn)f(x_{n}), that is the ouput of the normalized network for the input xnx_{n};

  • we assume x=1||x||=1;

  • separability is defined as correct classification for all training data, that is ynfn>0,ny_{n}f_{n}>0,\quad\forall n. We call average separability when ynfn>0\sum y_{n}f_{n}>0.

1.2 Regression and classification

In our analysis of the square loss, we need to explain when and why regression works well for classification, since the training minimizes square loss but we are interested in good performance in classification (for simplicity we consider here binary classification). Unlike the case of linear networks we expect several global zero square loss minima corresponding to interpolating solutions (in general degenerate, see [9] and reference therein). Although all interpolating solutions are optimal solutions of the regression problem, they will in general have different margins and thus different expected classification performance. In other words, zero square loss does not imply by itself neither large margin nor good expected classification. Notice that if gg is a zero loss solution of the regression problem, then g(xn)=yn,ng(x_{n})=y_{n},\quad\forall n. This is equivalent to ρfn=yn\rho f_{n}=y_{n} where fnf_{n} is the margin for xnx_{n}. Thus the norm ρ\rho of a minimizer is inversely related to its average margin. In fact, for an exact zero loss solution of the regression problem, the margin is the same for all training data xnx_{n} and it is equal to 1ρeq\frac{1}{\rho_{eq}}. Starting from small initialization, GD will explore critical points with ρ\rho growing from zero, as we will show. Thus interpolating solutions with small norm ρeq\rho_{eq} (corresponding to the best margin) may be found before large ρeq\rho_{eq} solutions which have worse margin. If the weight decay parameter is non-zero and large enough, there is independence from initial conditions. Otherwise, a near-zero initialization is required, as in the case of linear networks, though the reason is quite different and is due to an implicit bias in the dynamics of GD.

2 Dynamics and Generalization

Our key assumption is that the main property of batch normalization (BN) and weight normalization (WN) – the normalization of the weight matrices – can be captured by the gradient flow on a loss function modified by adding Lagrange multipliers.

Gradient descent on a modified square loss

=n(ρfnyn)2+νkVk2\mathcal{L}=\sum_{n}(\rho f_{n}-y_{n})^{2}+\nu\sum_{k}||V_{k}||^{2} (1)

with Vk2=1||V_{k}||^{2}=1 is in fact exactly equivalent to “Weight Normalization”, as proved in [3], for deep networks.

This dynamics can be written as ρk˙=VkTW˙k\dot{\rho_{k}}=V_{k}^{T}\dot{W}_{k} and Vk˙=ρSW˙k\dot{V_{k}}=\rho S\dot{W}_{k} with S=IVkVkTS=I-V_{k}V_{k}^{T}. This shows that if Wk=ρkVkW_{k}=\rho_{k}V_{k} then Vk˙=1ρkWk˙\dot{V_{k}}=\frac{1}{\rho_{k}}\dot{W_{k}} as mentioned in [8]. The condition Vk2=1||V_{k}||^{2}=1 yields, as shown in [5], ν=n(ρ2fn2ρynfn)\nu=-\sum_{n}(\rho^{2}f_{n}^{2}-\rho y_{n}f_{n}). Here we assume that the BN module is used in all layers apart the last one, that is we assume ρk=1,k<L\rho_{k}=1,\quad\forall k<L and ρL=ρ\rho_{L}=\rho where LL is the number of layers222It is important to observe here that batch normalization – unlike Weight Normalization – leads not only to normalization of the weight matrices but also to normalization of each row of the weight matrices [3] because it normalizes separately the activity of each unit ii and thus – indirectly – the Wi,jW_{i,j} for each ii separately. This implies that each row ii in (Vk)i,j(V_{k})_{i,j} is normalized independently and thus the whole matrix VkV_{k} is normalized (assuming the normalization of each row is the same 11 for all rows). The equations in the main text involving VkV_{k} can be read in this way, that is restricted to each row..

As we will show, the dynamical system associated with the gradient flow of the Lagrangian Equation 1 is “singular”, in the sense that normalization is not guaranteed at the critical points. Regularization is needed, and in fact it is common to use in gradient descent not only batch normalization but also weight decay. Weight decay consists of a regularization term λWk2\lambda||W_{k}||^{2} added to the Lagrangian yielding

=n(ρfnyn)2+νkVk2+λρ2.\mathcal{L}=\sum_{n}(\rho f_{n}-y_{n})^{2}+\nu\sum_{k}||V_{k}||^{2}+\lambda\rho^{2}. (2)

The associate gradient flow is then the following dynamical system

ρ˙=2[nρ(fn)2nfnyn]2λρ\dot{\rho}=-2[\sum_{n}\rho(f_{n})^{2}-\sum_{n}f_{n}y_{n}]-2\lambda\rho (3)
Vk˙=2ρn[(ρfnyn)(VkfnfnVk)]\dot{V_{k}}=2\rho\sum_{n}[(\rho f_{n}-y_{n})(V_{k}f_{n}-\frac{\partial f_{n}}{\partial V_{k}})] (4)

where the critical points ρ˙=0,Vk˙=0\dot{\rho}=0,\quad\dot{V_{k}}=0 are singulat for λ=0\lambda=0 but are not singular for any arbitrarily small λ>0\lambda>0333For λ=0\lambda=0 the zero loss critical point is pathological, since Vk˙=0\dot{V_{k}}=0 even when (VkfnfnVk)(V_{k}f_{n}-\frac{\partial f_{n}}{\partial V_{k}}) implying that an un-normalized interpolating solution satisfies the equilibrium equations. Numerical simulations show that even for linear degenerate networks convergence is independent of initial conditions only if λ>0\lambda>0.. In particular, normalization is then effective at ρeq\rho_{eq} unlike in the λ=0\lambda=0 case. As a side remark, SGD, as opposed to gradient flow, may help (especially with label noise) to counter to some extent the singularity of the λ=0\lambda=0 case, even without weight decay, because of the associated random fluctuations around the pathological critical point.

The equilibrium value at ρk˙=0\dot{\rho_{k}}=0 is

ρeq=nynfnλ+nfn2.\rho_{eq}=\frac{\sum_{n}y_{n}f_{n}}{\lambda+\sum_{n}f^{2}_{n}}. (5)

Observe that ρ˙>0\dot{\rho}>0 if ρ\rho is smaller than ρeq\rho_{eq} and if average separability holds. Recall also that zero loss “global” minima (in fact arbitrarily close to zero for small but positive λ\lambda) are expected to exist and be degenerate [9].

If we assume that the loss (with the constraint Vk=1||V_{k}||=1) is a continuous function of the VkV_{k}, then there will be at least one minimum of \mathcal{L} at any fixed ρ\rho, because the domain VkV_{k} is compact. This means that for each ρ\rho there is at least a critical point of the gradient flow of VkV_{k}, implying that for each critical ρ\rho for which ρ˙=0\dot{\rho}=0, there is at least one critical point of the dynamical system in ρ\rho and VkV_{k}.

Around Vk˙=0\dot{V_{k}}=0 we have

n(ρfnyn)fnVk=n(ρfnyn)(Vkeqfn),\sum_{n}(\rho f_{n}-y_{n}){\frac{\partial f_{n}}{\partial V_{k}}}=\sum_{n}(\rho f_{n}-y_{n})(V^{eq}_{k}f_{n}), (6)

where the terms (ρfnyn)(\rho f_{n}-y_{n}) will be generically different from zero if λ>0\lambda>0.

The conclusions of this analysis can be summarized in

Observation 1

Assuming average separability, and gradient flow starting from small norm norm ρ(0)=ϵ\rho(0)=\epsilon >0, ρ(t)\rho(t) grows monotonically until a minimum is reached at which ρeq=nynfnλ+nfn2.\rho_{eq}=\frac{\sum_{n}y_{n}f_{n}}{\lambda+\sum_{n}f^{2}_{n}}. This dynamics is expected even in the limit of λ=0\lambda=0, which corresponds to exact interpolation of all the training data at a singular critical point.

and

Observation 2

Minimizers with small ρeq\rho_{eq} correspond to large average margin ynfn\sum y_{n}f_{n}. In particular, suppose that the gradient flow converges to a ρeq\rho_{eq} and VkeqV_{k}^{eq} which correspond to zero square loss. Among all such minimizers the one with the smallest ρeq\rho_{eq} (typically found first during the GD dynamics when ρ\rho increases from ρ=0\rho=0), corresponds to the (absolute) minimum norm – and maximum margin – solutions.

In general, there may be several critical points of the VkV_{k} for the same ρeq\rho_{eq} and they are typically degenerate (see references in [10]) with dimensionality WNW-N, where WW is the number of weights in the network may be degenerate. All of them will correspond to the same norm and all will have the same margin for all of the training points.

Since usually the maximum output of a multilayer network is <<1<<1, the first critical point for increasing ρ\rho will be when ρ\rho becomes large enough to allow the following equation to have solutions

nynfn=ρ(λ+nfn2).\sum_{n}y_{n}f_{n}=\rho(\lambda+\sum_{n}f^{2}_{n}). (7)

If gradient flow starts from very small ρ\rho and there is average separability, ρ\rho increases monotonically until such a minimum is found. If ρ\rho is large, then ρ˙<0\dot{\rho}<0 and ρ\rho will decrease until a minimum is found.

For large ρ\rho and very small λ\lambda, we expect many solutions under GD444It is interesting to recall [9] that for SGD – unlike GD – the algorithm will stop only when n=0n\ell_{n}=0\quad\forall n, which is the global minimum and corresponds to perfect interpolation. For the other critical points for which GD will stop, SGD will never stop but just fluctuate around the critical point.. The emerging picture is a landscape in which there are no zero-loss minima for ρ<ρmin\rho<\rho_{min}. With increasing ρ\rho from ρ=0\rho=0 there will be zero square-loss degenerate minima with the minimizer representing an almost interpolating solution (for λ>0\lambda>0). We expect, however, that depending on the value of λ\lambda, there is a bias towards minimum ρeq\rho_{eq} even for large ρ\rho initializations and certainly for small intializations.

All these observations are also supported by our numerical experiments. Figure 1, 2, 3, and 4 show the case of gradient descent with batch normalization and weight decay, which corresponds to a well-posed dynamical system for gradient flow; the other figures show the same networks and data with BN without WD and without both BN and WD. As predicted by the analysis, the case of BN+WD is the most well-behaved, whereas the others strongly depend on initial conditions.

Refer to caption
Figure 1: ConvNet with Batch Normalization and Weight Decay Binary classification on two classes from CIFAR-10, trained with MSE loss. The model is a very simple network with 4 layers of fully-connected Layers. ReLU nonlinearity is used. Batch normalization is used. The weight matrices of all layers are initialized with zero-mean normal distribution, scaled by a constant such that the Frobenius norm of each matrix is 5. We use weight decay of 0.01. We run SGD with batch size 128, constant learning rate 0.1 and momentum 0.9 for 1000 epochs. No weight decay. No data augmentation. Every input to the network is scaled such that it has Frobenius norm 1.
Refer to caption
Figure 2: ConvNet with Batch Normalization and Weight Decay Dynamics of ρ\rho from experiments in Figure 1. First row: small initialization (0.1). Second row: medium initialization (1). Third row: large initialization (5). A dashed rectangle denotes the previous subplot’s domain and range in the new subplot.
Refer to caption
Figure 3: ConvNet with Batch Normalization and Weight Decay Dynamics of the average of |fn||f_{n}| from experiments in Figure 1. First row: small initialization (0.1). Second row: medium initialization (1). Third row: large initialization (5). A dashed rectangle denotes the previous subplot’s domain and range in the new subplot.
Refer to caption
Figure 4: ConvNet with Batch Normalization and Weight Decay Margin of all training samples.

To show that ρ\rho indeed controls the expected error we use classical bounds that lead to the following theorem

Observation 3

With probability 1δ1-\delta

L(f)c1ρN(𝔽~)+c2ϵ(N,δ)L(f)\leq c_{1}\rho\mathbb{R}_{N}(\tilde{\mathbb{F}})+c_{2}\epsilon(N,\delta) (8)

where c1,c2c_{1},c_{2} are constants that reflect the Lipschitz constant of the loss function ( for the square loss this requires a bound on f(x)f(x)) and the architecture of the network. The Rademacher average N(𝔽~)\mathbb{R}_{N}(\tilde{\mathbb{F}}) depends on the normalized network architecture and NN. Thus for the same network and the same data, the upper bound for the expected error of the minimizer is smaller for smaller ρ\rho.

The theorem proves the conjecture in [11] that for deep networks, as for kernel machines, minimum norm interpolating solutions are the most stable.

Refer to caption
Figure 5: ConvNet with Batch Normalization but no Weight Decay Binary classification on two classes from CIFAR-10, trained with MSE loss. The model is a very simple network with 4 layers of convolutions. ReLU nonlinearity is used. Batch normalization is used without parameters (affine=False in PyTorch). The weight matrices of all layers are initialized with zero-mean normal distribution, scaled by a constant such that the Frobenius norm of each matrix is either 0.1 or 5. We run SGD with batch size 128, constant learning rate 0.01 and momentum 0.9 for 1000 epochs. No data augmentation. Every input to the network is scaled such that it has Frobenius norm 1. This is a single run but it is typical for the parameter values we used.
Refer to caption
Figure 6: ConvNet with Batch Normalization but no Weight Decay. Dynamics of ρ\rho from experiments in Figure 5. Top row: small initialization (0.1). Bottom row: large initialization (5). The plot starts with ρ(0)=0\rho(0)=0 despite an initialization of ρk=0\rho_{k}=0 because the the scaling factor of BN starts from 0. A dashed rectangle denotes the previous subplot’s domain and range in the new subplot.
Refer to caption
Figure 7: ConvNet with Batch Normalization but no Weight Decay. Margin of all training samples (see previous figures). If the solution were to correspond to exactly zero square loss, the margin distribution would be an horizontal line.
Refer to caption
Figure 8: ConvNet with Batch Normalization but no Weight Decay: Histogram of |fn||f_{n}| over time. Top figure: initial ρk=0.1\rho_{k}=0.1. Bottom figure: initial ρk=5\rho_{k}=5.

2.1 Predictions

  • In a recent paper Papyan, Han and Donoho[12] described four empirical properties of the terminal phase of training (TPT) deep networks, using the cross-entropy loss function. TPT begins at the epoch where training error first vanishes. During TPT, the training error stays effectively zero, while training loss is pushed toward zero. Direct empirical measurements expose an inductive bias they call neural collapse (NC), involving four interconnected phenomena. (NC1) Cross-example within-class variability of last-layer training activations collapses to zero, as the individual activations themselves collapse to their class means. (NC2) The class means collapse to the vertices of a simplex equiangular tight frame (ETF). (NC3) Up to rescaling, the last-layer classifiers collapse to the class means or in other words, to the simplex ETF (i.e., to a self-dual configuration). (NC4) For a given activation, the classifier’s decision collapses to simply choosing whichever class has the closest train class mean (i.e., the nearest class center [NCC] decision rule). We show in [5] that these properties of the Neural Collapse[12] seem to be predicted by the theory of this paper for the global (that is, close-to-zero square-loss) minima, irrespectively of the value of ρeq\rho_{eq}. We recall that the basic assumptions of the analysis are Batch Normalization and Weight Decay. Our predictions are for the square loss but we show that they should hold also in the case of crossnetropy, explored in [12].

  • At a close to zero loss critical point of the flow, Vkf(xj)=Vkf(xj)\nabla_{V_{k}}f(x_{j})=V_{k}f(x_{j}) with xjx_{j} in the training set, which are powerful constraints on the weight matrices to which training converges. A specific dependence of the matrix at each layer on matrices at the other layers is thus required. In particular, there are specific relations for each layer matrix VkV_{k} of the type, explained in the Appendix,

    Vkf=[VLDL1(x)VL1Vk+1Dk(x)]TDk1(x)Vk1Dk2(x)D1(x)V1x,V_{k}f=[V_{L}D_{L-1}(x)V_{L-1}\cdots V_{k+1}D_{k}(x)]^{T}D_{k-1}(x)V_{k-1}D_{k-2}(x)\cdots D_{1}(x)V_{1}x, (9)

    where the DD matrices are diagonal with components either 0 or 11, depending on whether the corresponding ReLU unit is on or off.

    As described in [5] for linear networks, a class of possible solutions to these constraint equations are projection matrices; another one are orthogonal matrices and more generally orthogonal Stiefel matrices on the sphere. These are sufficient but not necessary conditions to satisfy the constraint equations. Interestingly, randomly initialized weight matrices (an extreme case of the NTK regime) are approximately orthogonal.

3 Explicit Regularization and Implicit Dynamic Bias

We have established here convergence of the gradient flow to a minimum norm solution for the square loss, when gradient descent is used with BN (or WN) and WD. This result assumes explicit regularization (WD) and is thus consistent with [13], where is is proved that implicit regularization with the square loss cannot be characterized by any explicit function of the model parameters. We also show, however, that in the absence of WD and BN good solutions for binary classification can be found because of the bias in the dynamics of GD toward small norm solution introduced by near-zero initial conditions. Again, this is consistent with [13], because we identify an implicit bias in the dynamics which is relevant for classification.

For the exponential loss, BN is strictly not needed since minimization of the exponential loss maximizes the margin and minimizes the norm without BN, independently of initial conditions. Thus under the exponential loss, we expect a margin maximization bias for tt\to\infty as shown in [14], independently of initial conditions. The effect however can require very long times and unreasonably high precision to have significant effect in practice.

If there exist several almost-interpolating solutions with the same norm ρeq\rho_{eq}, they also have the same margin for each of the training data. Though they have the same norm and the same margin on each of the data point, they may in principle have different ranks of the weight matrices or of the rank of the local Jacobian fnVk\frac{\partial f_{n}}{\partial V_{k}} (at the minimum WW^{*}) . Notice that in deep linear networks the GD dynamics biases the solution towards small rank solutions, since large eigenvalues converge much faster the small ones [15]. It in unclear whether the rank has a role in our analysis of generalization and we conjecture it does not.

Why does GD have difficulties in converging in the absence of BN+WD, especially for very deep networks? For square loss, the best answer is that good tuning of the learning rate is important and BN together with weight decay was shown to provide a remarkable autotuning [8]. A related answer is that regularization is needed to provide stability, including numerical stability.

4 Summary

The main results of the paper analysis can be summarized in the following

Lemma 1

If the gradient flow with normalization and weight decay converges to an interpolating solution with near-zero square loss, the following properties hold:

  1. 1.

    The global minima in the square loss with the smallest ρ\rho are the global minimum norm solutions and have the best margin and the best bound on expected error;

  2. 2.

    Conditions that favour convergence to such minimum norm solutions are batch normalization with weight decay (λ>0\lambda>0) and small initialization (small ρ\rho);

  3. 3.

    initialization with small ρ(0)\rho(0) can be sufficient to induce a bias in the dynamics of GD leading to large margin minimizers of the square loss;

  4. 4.

    The condition f(xj)Vk=Vkf(xj)\frac{\partial f(x_{j})}{\partial V_{k}}=V_{k}f(x_{j}) which holds at the critical points of the SGD dynamics that are global minima, is key in predicting several properties of the Neural Collapse[12];

  5. 5.

    the same condition represents a powerful constraint on the set of weight matrices at convergence.

4.1 Discussion

The role of the Lagrange multiplier term νkVk2\nu\sum_{k}||V_{k}||^{2} in Equation 1 is different from a standard regularization term because ν\nu, determined by the constraint Vk=1||V_{k}||=1 can be positive or negative, depending on the sign of the error ν=n(ρ2fn2ρynfn)\nu=-\sum_{n}(\rho^{2}f_{n}^{2}-\rho y_{n}f_{n}). Thus the ν\nu term acts as a regularizer when the norm of VkV_{k} is larger than 11 but has the opposite effect for Vk<1||V_{k}||<1, thus constraing each VkV_{k} to the unit sphere. For the exponential loss, the situation is different and ν\nu in νkVk2\nu\sum_{k}||V_{k}||^{2} acts as a positive regularization parameter, albeit a vanishing one (for tt\to\infty).

Are there any implications of the theory sketched here for mechanisms of learning in cortex? Somewhat intriguingly, some form of normalization, often described as a balance of excitation and inhibition, has long been thought to be a key function of intracortical circuits in cortical areas[16]. One of the first deep models of visual cortex models, HMAx, explored the biological plausibility of specific normalization circuits with spiking and non-spiking neurons. It is also interesting to note that the Oja rule describing synaptic plasticity in terms of changes to the synaptic weight is the Hebb rule plus a normalization term that corresponds to a Lagrange multiplier.

The main problems left open by this paper are:

  • The analysis is so far restricted to gradient flow. It should be exteded to gradient descent along the lines of [8].

  • It is remarkable that for the case of no BN and no WD, the dynamical system still yields good results, provided initialization is small. The case of BN+WD is the only one which seems rather independent of initial conditions in our experiments.

  • In this context, an extension of the analysis to SGD may also be critical for providing a satisfactory analysis of convergence.

Acknowledgments We are grateful to Shai Shalev-Schwartz, Andrzej Banbuski, Arturo Desza, Akshay Rangamani, Santosh Vempala, David Donoho, Vardan Papyan, X.Y. Han, Silvia Villa and especially to Eran Malach for very useful comments. This material is based upon work supported by the Center for Minds, Brains and Machines (CBMM), funded by NSF STC award CCF-1231216, and part by C-BRIC, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA.

References

  • [1] Mor Shpigel Nacson, Suriya Gunasekar, Jason D. Lee, Nathan Srebro, and Daniel Soudry. Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models. arXiv e-prints, page arXiv:1905.07325, May 2019.
  • [2] Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. CoRR, abs/1906.05890, 2019.
  • [3] A. Banburski, Q. Liao, B. Miranda, T. Poggio, L. Rosasco, B. Liang, and J. Hidary. Theory of deep learning III: Dynamics and generalization in deep networks. CBMM Memo No. 090, 2019.
  • [4] Like Hui and Mikhail Belkin. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks, 2020.
  • [5] T. Poggio and Q. Liao. Generalization in deep network classifiers trained with the square loss. CBMM Memo No. 112, 2019.
  • [6] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [7] Tim Salimans and Diederik P. Kingm. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in Neural Information Processing Systems, 2016.
  • [8] Sanjeev Arora, Zhiyuan Li, and Kaifeng Lyu. Theoretical analysis of auto rate-tuning by batch normalization. CoRR, abs/1812.03981, 2018.
  • [9] T. Poggio and Y. Cooper. Loss landscape: Sgd has a better view. CBMM Memo 107, 2020.
  • [10] T. Poggio and Y. Cooper. Loss landscape: Sgd can have a better view than gd. CBMM memo 107, 2020.
  • [11] Tomaso Poggio. Stable foundations for learning. Center for Brains, Minds and Machines (CBMM) Memo No. 103, 2020.
  • [12] Vardan Papyan, X. Y. Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
  • [13] Gal Vardi and O. Shamir. Implicit regularization in relu networks with the square loss. 2020.
  • [14] Tomaso Poggio, Andrzej Banburski, and Qianli Liao. Theoretical issues in deep networks. PNAS, 2020.
  • [15] Daniel Gissin, Shai Shalev-Shwartz, and Amit Daniely. The Implicit Bias of Depth: How Incremental Learning Drives Generalization. arXiv e-prints, page arXiv:1909.12051, September 2019.
  • [16] RJ Douglas and KA Martin. Neuronal circuits of the neocortex. Annu Rev Neuroscience, 27:419–51, 2004.
Refer to caption
Figure 9: ConvNet, no Batch Normalization, no Weight Decay. Binary classification on two classes from CIFAR-10, trained with MSE loss. The model is a very simple network with 4 layers of fully-connected Layers. The ReLU nonlinearity is used. The weight matrices of all layers are initialized with zero-mean normal distribution, scaled by a constant such that the Frobenius norm of each matrix is either 5, 15 or 30. We run SGD with batch size 128, constant learning rate 0.1 and momentum 0.9 for 1000 epochs.. No data augmentation. Every input to the network is scaled such that it has Frobenius norm 1.
Refer to caption
Figure 10: ConvNet, no Batch Normalization, no Weight Decay. Dynamics of ρ\rho from experiments in Figure 9. First row: small initialization (5). Second row: large initialization (15). Third row: extra large initialization (30). A dashed rectangle denotes the previous subplot’s domain and range in the new subplot. More details to be added.
Refer to caption
Figure 11: ConvNet, no Batch Normalization, no Weight Decay. Margin of all training samples