This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Convergence Theory Towards Practical
Over-parameterized Deep Neural Networks

Asaf Noy, Yi Xu, Yonathan Aflalo, Lihi Zelnik-Manor, Rong Jin
Machine Intelligence Technology, Alibaba Group
{asaf.noy, yixu, jonathan.aflalo, lihi.zelnik, jinrong.jr}@alibaba-inc.com
Equal contribution.
Abstract

Deep neural networks’ remarkable ability to correctly fit training data when optimized by gradient-based algorithms is yet to be fully understood. Recent theoretical results explain the convergence for ReLU networks that are wider than those used in practice by orders of magnitude. In this work, we take a step towards closing the gap between theory and practice by significantly improving the known theoretical bounds on both the network width and the convergence time. We show that convergence to a global minimum is guaranteed for networks with widths quadratic in the sample size and linear in their depth at a time logarithmic in both. Our analysis and convergence bounds are derived via the construction of a surrogate network with fixed activation patterns that can be transformed at any time to an equivalent ReLU network of a reasonable size. This construction can be viewed as a novel technique to accelerate training, while its tight finite-width equivalence to Neural Tangent Kernel (NTK) suggests it can be utilized to study generalization as well.

1 Introduction

Deep neural networks have achieved remarkable success in machine learning applications of different fields such as computer vision [Voulodimos et al., 2018], speech recognition [Hinton et al., 2012], and natural language processing [Devlin et al., 2018]. Much of this success is yet to be fully explained. One of the existing gaps is the network’s trainability, its ability to perfectly fit training data when initialized randomly and trained by first-order methods. Modern deep networks are commonly equipped with rectified linear unit (ReLU) activations [Xu et al., 2015], forming highly non-convex and non-smooth optimization problems. Such optimization problems are known to be generally computationally infeasible [Livni et al., 2014], and NP-hard in some cases [Blum and Rivest, 1989]. Nevertheless, in practice, trainability is often achieved in various tasks. Such empirical findings were summarized by  Zhang et al. [2016]: “Deep neural networks easily fit (random) labels”.

Previous theoretical works focused on networks’ expressivity [Bengio and Delalleau, 2011, Cohen et al., 2016], the existence of a solution that perfectly fits training data, therefore necessary for trainability. As it has been shown that even shallow nonlinear networks are universal approximators [Hornik et al., 1989], a better understanding of the power of depth was desired [Eldan and Shamir, 2016, Liang and Srikant, 2016, Telgarsky, 2015]. One conclusion is that increased depth allows exponentially more efficient representations, thus per parameter, deep networks can approximate a richer class of functions than shallow ones [Cohen et al., 2016, Liang and Srikant, 2016]. However, a recent work by Yun et al. [2019] provided tighter expressivity bounds on ReLU networks, showing that datasets of nn training examples can be perfectly expressed by a 33-layer ReLU network with Ω(n)\Omega{\left({\sqrt{n}}\right)} parameters. This does not explain the sizes of practical neural networks that are typically over-parameterized and exceed the training data’s size. It appears that expressivity can provide only a limited explanation for the overparameterization of modern neural networks, which keep growing deeper and wider in search of state-of-the-art results [Zagoruyko and Komodakis, 2016, Huang et al., 2019, Ridnik et al., 2020].

Important insights regarding trainability emerged from the analysis of simplified variants of neural networks. Deep linear networks are of special interest since increased depth does not affect their expressiveness, only changes their optimization landscape. Therefore the effects of increased width and depth on the training process can be isolated and carefully studied in this setting. Arora et al. [2018a] proved that trainability is attained at a linear rate under proper initialization as long as the network is wide enough. Arora et al. [2018b] showed that training with gradient-descent could be accelerated by increasing the network’s depth. Their empirical evaluations supported the existence of a similar outcome in deep nonlinear networks as well. Another simplified variant with further insights is overparameterized shallow nonlinear networks, typically with a single hidden layer. Du and Lee [2018] proved that for a quadratic activation function all local minima are global minima. Safran and Shamir [2018] showed that ReLU networks suffer from spurious local minima, which can be drastically reduced by overparameterization. Oymak and Soltanolkotabi [2020] achieved trainability where the proportion between the number of hidden units and training examples nn depends on the activation: for smooth ones it is Ω(n2)\Omega\left({n^{2}}\right) while for ReLU it is significantly larger, Ω(n4)\Omega\left({n^{4}}\right). Additional works examined the unique dynamics of ReLU activation during the training and their relation to gradient-descent optimization [Li and Liang, 2018, Arora et al., 2019b].

The analysis of deep ReLU networks is more challenging, as additional problems emerge with increased depth. Exploding and vanishing gradient [Hanin, 2018] and bias shifts [Clevert et al., 2015] are shared with additional activations, while the dying ReLUs problem, causing them to remain inactive permanently, is unique [Lu et al., 2019]. While in practice those problems are solved by the introduction of additional tensor transformations [Ioffe and Szegedy, 2015, He et al., 2016b], their theoretical understanding remains limited. Correspondingly, existing trainability guarantees for deep ReLU networks are considerably weaker compared with shallow networks. Du et al. [2019] required a minimal network width which scales exponentially with its depth LL and polynomially with the number of examples nn. More recent works [Zou et al., 2018, Allen-Zhu et al., 2019] improved the dependencies to be high-order polynomials, Ω(n26L38)\Omega\left({n^{26}L^{38}}\right) and Ω(n24L12)\Omega\left({n^{24}L^{12}}\right) correspondingly. The best known result, by Zou and Gu [2019], required a network width of Ω(n8L12)\Omega\left({n^{8}L^{12}}\right), which is still prohibitively large compared to practice. For instance, training their network with 10011001 layers [He et al., 2016b] over the common ImageNet dataset [Krizhevsky et al., 2017] requires a network of Ω(10172)\Omega\left({10^{172}}\right) parameters, while the largest one reported contains only O(1011)O\left({10^{11}}\right) [Brown et al., 2020].

In this work, we take a step towards closing the gap between theory and practice. We develop a novel technique to analyze the trainability of overparameterized deep neural networks and provide convergence guarantees under size requirements that are significantly milder than previously known. The network width required by our analysis for trainability scales only quadratically with nn, enabling for the first time empirical evaluation of the theoretical bounds, and bridging previous empirical results related to overparameterization. In addition, the required width by our theory is linear with LL, paving the way for a better understanding of the behavior of deep neural networks of practical depths. Finally, the number of training iterations for reaching global minima by our theory is logarithmic in nLnL, significantly smaller than previous theoretical results and similar to leading practical applications (e.g., BiT [Kolesnikov et al., 2019]). A full comparison with previous methods can be found in Table 1.

A key novelty of our analysis is the construction of a surrogate network, named Gated ReLU or GReLU, illustrated in Figure 1. The activation patterns of GReLU, i.e. which entries output zero [Hanin and Rolnick, 2019b], are determined at initialization and kept fixed during training. They are set by a random transformation of the input, independent of the network initialization. Therefore, GReLU networks are immune to two main problems of training ReLU networks, dying ReLUs and bias shifts, while still enjoying the advantages of ReLU activation. We first prove tighter bounds on the width and convergence time for training GReLU networks, then show that they can be transformed to equivalent ReLU networks with 100%100\% train accuracy. We further investigate the proposed technique and derive a finite-width Neural Tangent Kernel equivalence with a improved condition on the network width. Finally, we empirically validate the effectiveness of our technique with datasets of different domains and practical network sizes.

Our main contribution can be summarized as follows

  • We show that a randomly initialized deep ReLU network of depth LL and width m=Ω~(n2L)m=\tilde{\Omega}{\left({n^{2}L}\right)} trained over nn samples of dimension dd, reaches ε\varepsilon-error global minimum for 2\ell_{2} regression in a logarithmic number of iterations, T=clog(n3Ldϵ)T=c\log\left({\frac{n^{3}L}{d\epsilon}}\right). For comparison, the previous state-of-the-art result by Zou and Gu [2019] required m=O~(n8L12)m=\tilde{O}\left({n^{8}L^{12}}\right), T=O(n2L2log(1ε))T=O\left({n^{2}L^{2}\log(\frac{1}{\varepsilon})}\right).

  • To achieve that, we propose a novel technique of training a surrogate network with fixed activation patterns that can be transformed to its equivalent ReLU network at any point. It is immune to bias-shifts and dying-ReLU problems and converges faster without affecting generalization, as we empirically demonstrate.

  • We derive an NTK equivalence for GReLU networks with width linear in nn, connecting the regimes of practical networks and kernel methods for the first time (improving Ω~(n32O(L))\tilde{\Omega}\left({n^{3}2^{O(L)}}\right) by Huang and Yau [2020]). This connection allows utilizing NTK-driven theoretic discoveries for practical neural networks, most notably generalization theory.

The remainder of this paper is organized as follows. In section 2 we state the problem setup and introduce our technique. Our main theoretical result is covered in section 3, followed by additional properties of the proposed technique in Section 4. In section 5 we provide a proof sketch of the main theory, and in section 6 we empirically demonstrate its properties and effectiveness. Section 7 contains additional related work. Finally, section 8 concludes our work.

Ω~(#Neurons)\tilde{\Omega}\left(\#\textrm{Neurons}\right) Oε(#Iters)O_{\varepsilon}\left(\#\textrm{Iters}\right) O(Prob)O\left(\textrm{Prob}\right) Θ~(GD Step)\tilde{\Theta}\left(\textrm{GD Step}\right) Remarks
Du[2019] n6λ04p3\frac{n^{6}}{\lambda_{0}^{4}p^{3}} 1ηλ0log1ε\frac{1}{\eta\lambda_{0}}\log\frac{1}{\varepsilon} p λ0n2\frac{\lambda_{0}}{n^{2}} λ01=poly(eL,n)\lambda_{0}^{-1}=\textrm{poly}\left(e^{L},n\right), binary -class, smooth activation
Zou[2018] n26L38n^{26}L^{38} n8L9n^{8}L^{9} 1n29L47\frac{1}{n^{29}L^{47}} binary-classification
Allen-Zhu[2019] n24L12n^{24}L^{12} n6L2log(1ε)n^{6}L^{2}\log(\frac{1}{\varepsilon}) elog2me^{-\log^{2}m} 1n28log5mL14\frac{1}{n^{28}\log^{5}mL^{14}} Poly(maxi|yi|)\propto\textrm{Poly}(\max_{i}|y_{i}|)
Zou[2019] n8L12n^{8}L^{12} n2L2log(1ε)n^{2}L^{2}\log(\frac{1}{\varepsilon}) n1n^{-1} 1n8L14\frac{1}{n^{8}L^{14}}
Ours 𝐧𝟐𝐋\mathbf{n^{2}L} log(𝐧𝟑𝐋𝐝𝐱ϵ)\mathbf{\log\left({\frac{n^{3}L}{d_{x}\epsilon}}\right)} 𝐞𝐦\mathbf{e^{-\sqrt{m}}} dxn4L3dy\frac{d_{x}}{n^{4}L^{3}d_{y}} L=Ω(logn)L=\Omega(\log n)
Table 1: Comparison of leading works on overparameterized deep nonlinear neural networks trained with gradient-descent.

2 Preliminaries

In this section we introduce the main problem and the novel setup which is used to solve it.
Notations: we denote the Euclidean norm of vector vv by v\left\lVert v\right\rVert or v2\left\lVert v\right\rVert_{2}, and the Kronecker product and Frobenius inner-product for matrices M,NM,N by MNM\otimes N, M,N\langle M,N\rangle respectively. We denote by λmin(M)\lambda_{\min}(M), λmax(M)\lambda_{\max}(M) matrix minimal and maximal eigenvalues. We use the shorthand [k][k] to denote the set {1,,k}\{1,\dots,k\} and A[k]={A1,,Ak}A_{[k]}=\{A_{1},\dots,A_{k}\}. We denote the positive part of a vector by [v]+=(max(0,v1),,max(0,vd))[v]^{+}=\left({\max(0,v_{1}),\dots,\max(0,v_{d})}\right) and its indicator vector by [v]+=(𝟙v1>0,,𝟙vd>0)[v]_{+}=(\mathbbm{1}_{v_{1}>0},\dots,\mathbbm{1}_{v_{d}>0}). Matrix column-wise vectorization is denoted by vec()\mbox{vec}\left(\right). For two sequences {an},{bn}\{a_{n}\},\{b_{n}\}, we denote an=O(bn)a_{n}=O(b_{n}) if there exist a constant CoC_{o} such that anCobna_{n}\leq C_{o}b_{n}, and an=Ω(bn)a_{n}=\Omega(b_{n}) if a constant CΩC_{\Omega} satisfies anCΩbna_{n}\geq C_{\Omega}b_{n}. In addition, we denote O~(),Ω~()\tilde{O}(\cdot),\tilde{\Omega}(\cdot) to hide logarithmic factors. Finally, we denote the normal and chi-square distributions by 𝒩(μ,Σ),χd2\mathcal{N}(\mu,\Sigma),\chi^{2}_{d} respectively.

2.1 Problem Setup

Let 𝒯={(xi,yi=Φixi)}i[n]\mathcal{T}=\left\{{(x_{i},y_{i}=\Phi_{i}x_{i})}\right\}_{i\in[n]}, be training examples, with xidx,yidy,Φidy×dxx_{i}\in\mathbb{R}^{d_{x}},y_{i}\in\mathbb{R}^{d_{y}},\Phi_{i}\in\mathbb{R}^{d_{y}\times d_{x}}. Note that a different linear mapping Φi\Phi_{i} is used for each training example as our goal is to fit a nonlinear objective yiy_{i}. We assume for convenience and without loss of generality that the input data is normalized, xi=1\left\lVert x_{i}\right\rVert=1. The data is used for training a fully-connected neural network with LL hidden layers and mm neurons on each layer.

We propose a novel network structure named GReLU, as illustrated in Figure 1, which improves the optimization, leading to tighter trainability bounds. The key idea of this construction is to decouple the activation patterns from the weight tuning. The role of the activation patterns is to make sure different input samples propagate through different paths, while the role of the weights is to fit the output with high accuracy. Therefore, GReLU is composed of two parts, one consists of the weights that are trained, while the other defines the data-dependent activations of the ReLU gates. It is independently initialized and then remains fixed. In Section 4.1, we show that a GReLU network has an equivalent ReLU network, and those can be switched at any point. All layers are initialized with independent Gaussian distributions for k[L]k\!\in\![L]:

[Wk]i,j𝒩(0,2/m),[Ψk]i,j𝒩(0,2/m),[C]i,j𝒩(0,2/dx),[B]i,j𝒩(0,2/dy)\left[{W_{k}}\right]_{i,j}\sim\mathcal{N}(0,2/m)~{}~{},~{}~{}\left[{\Psi_{k}}\right]_{i,j}\sim\mathcal{N}(0,2/m),~{}~{}\left[{C}\right]_{i,j}\sim\mathcal{N}(0,2/d_{x}),~{}~{}\left[{B}\right]_{i,j}\sim\mathcal{N}(0,2/d_{y})

Then, only the layers W[L]W_{[L]} (blue blocks) are trained, while (B,C,Ψ[L])\left({B,C,\Psi_{[L]}}\right) (red blocks) remain fixed during training. The activations {Dki,i[n],k[L]}\{D^{i}_{k},i\in[n],k\in[L]\} are computed based on the random weights of Ψ[L],B,C\Psi_{[L]},B,C as follows,

z0i=[Cxi]+,zki=[Ψkzk1i]+,Dki=diag([zki]+)k=1,,L.z^{i}_{0}=[Cx_{i}]^{+}~{}~{},~{}~{}z^{i}_{k}=[\Psi_{k}z_{k-1}^{i}]^{+}~{}~{},~{}~{}D_{k}^{i}=\mbox{diag}([z_{k}^{i}]_{+})~{}~{}~{}k=1,\ldots,L.

This implies that the activations do not change during training. Note that while previous works also used fixed initial and final layers e.g. [Allen-Zhu et al., 2019, Zou and Gu, 2019], the introduction of Ψ[L]\Psi_{[L]} as fixed activation layers is novel.

We denote the kkth layer at iteration tt by Wt,km×mW_{t,k}\in\mathbb{R}^{m\times m}, and the concatenation of all layers by W¯t=(C,Wt,1,,Wt,L,B)\bar{W}_{t}=(C,W_{t,1},\ldots,W_{t,L},B). Since the activations are fixed in time and change per example, the full network applied on example ii is the following matrix,

Wti:=BDLiWt,LD1iWt,1D0iCdy×dx\displaystyle W_{t}^{i}\vcentcolon=BD_{L}^{i}W_{t,L}\ldots D_{1}^{i}W_{t,1}D_{0}^{i}C\in\mathbb{R}^{d_{y}\times d_{x}} (1)

Following previous works [Allen-Zhu et al., 2019, Du et al., 2019, Oymak and Soltanolkotabi, 2020], we focus here for the sake of brevity on the task of regression with the square loss,

(W¯t)=12i=1n(WtiΦi)xi2.\displaystyle\ell(\bar{W}_{t})=\frac{1}{2}\sum_{i=1}^{n}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}. (2)

The loss is minimized by gradient-descent with a constant learning rate η\eta. Our proofs can be extended to other tasks and loss functions such as classification with cross-entropy, similarly to [Allen-Zhu et al., 2019, Zou et al., 2018] and omitted for readability.

Finally, our analysis requires two further definitions. The intermediate transform for 1kkL1\leq k^{\prime}\leq k\leq L is defined as:

Zk,kt,i:=DkiWt,kWt,k+1Dki\displaystyle Z^{t,i}_{k,k^{\prime}}\vcentcolon=D_{k}^{i}W_{t,k}\ldots W_{t,k^{\prime}+1}D_{k^{\prime}}^{i} (3)

and, the maximal variation from initialization of the network’s trained layers is:

τ:=max1tTmaxk[L]Wt,kW1,k2.\displaystyle\tau\vcentcolon=\max\limits_{1\leq t\leq T}\max\limits_{k\in[L]}\left\|W_{t,k}-W_{1,k}\right\|_{2}. (4)
Refer to caption
Figure 1: An illustration of the proposed network. Blue layers are trained while red layers set the activations and remain unchanged during training.

3 Main Theory

In this section, we present our main theoretical results. We start with making the following assumptions on the training data.

Assumption 1 (non-degenerate input).

Every two distinct examples xi,xjx_{i},x_{j} satisfy xixjδ\|x_{i}^{\top}x_{j}\|\leq\delta.

Assumption 2 (common regression labels).

Labels are bounded: maxi|yi|mdx\max_{i}|y_{i}|\leq\frac{m}{d_{x}}.

Both assumptions hold for typical datasets and are stated in order to simplify the derivation. Assumption 2 is more relaxed than corresponding assumptions in previous works of labels bounded by a constant due to the overparameterization (mdxm\gg d_{x}), enabling the network handle large outputs better. We are ready to state the main result, guaranteeing a linear-rate convergence to a global minimum.

Theorem 1.

Suppose a deep neural network of depth L=Ω(logn)L=\Omega(\log n) is trained by gradient-descent with learning rate η=dxn4L3dy\eta=\frac{d_{x}}{n^{4}L^{3}d_{y}} under the scheme in Section 2.1, and with a width mm that satisfies,

m=Ω~(n2Ldy).m=\tilde{\Omega}{\left({n^{2}Ld_{y}}\right)}.

Then, with a probability of at least 1exp(Ω(m))1-\exp(-\Omega(\sqrt{m})) over the random initialization, the network reaches ϵ\epsilon-error within a number of iterations,

T=O(log(n3Ldxϵ)).T=O\left({\log\left({\frac{n^{3}L}{d_{x}\epsilon}}\right)}\right).

The proof appears in Section 10.4. Theorem 1 improves previous known results [Zou et al., 2018, Allen-Zhu et al., 2019, Du et al., 2018, Zou and Gu, 2019] by orders of magnitude, as can be seen in Table 1. Specifically, it takes a step towards closing the gap between existing theory and practice, with required widths that can be used in practice for the first time. The dependence of mm on the depth LL is linear, offering a significant improvement over the previous tightest bound of O(L12)O(L^{12}) [Zou and Gu, 2019]. The bound on the number of iterations is logarithmic in the number of train examples, similarly to current common practice. For example, BiT [Kolesnikov et al., 2019, Sec. 3.3] obtained SotA results on CIFAR-10/100 and ImageNet with a similar adjustment of the number of training epochs to the dataset size.
An interesting question is whether the bounds we prove on mm and TT are tight. The empirical evaluation presented in Figure 3(a) suggests they might be. We show that on a synthetic problem, fully-connected ReLU and GReLU networks with widths m=(n2L)p,p<1m=\left({n^{2}L}\right)^{p},p<1, which is below our bound, do not converge to zero loss, while those with p=1p=1 do. Next, we make two important remarks that further compare the above theoretical results with previous work.

Remark 1.

The majority of works on the trainability of overparameterized networks focus on simplified shallow 11-hidden layer networks (e.g. [Zou and Gu, 2019, Du et al., 2018, Wu et al., 2019, Oymak and Soltanolkotabi, 2020, Song and Yang, 2019, Li and Liang, 2018]). Our focus is on the more challenging practical deep networks, and Theorem 1 requires a minimal depth L=Ω(logn)L=\Omega(\log{n}). Our analysis, however, can be extended to 11-hidden layer networks with the same dependencies on the number of train samples, i.e., m=Ω~(n2),T=O(logn)m=\tilde{\Omega}(n^{2}),~{}T=O\left({\log n}\right). According to our knowledge, these dependencies improve all previous results on shallow networks.

Remark 2.

For most practical regression tasks dyd_{y} is small, therefore often assumed to be of O(1)O(1) and ignored in previous works. We make no such assumption throughout this work and specifically in Theorem 1, but omit it from Table 1 for a clear comparison.

The proposed GReLU construction is used here to prove bounds on optimization. However, we believe it could be used as a technical tool for better understanding the generalization of deep neural networks. Previous works offered convergence guarantees on networks with practically infinite width. Empirical results show a poor generalization ability for such networks [Arora et al., 2019a, Li et al., 2019] when compared to mildly overparameterized ones [Srivastava et al., 2015, Noy et al., 2020, Nayman et al., 2019] over various datasets. Tighter bounds on the network size are possibly key to capture the training dynamics of deep neural networks, and to further improve those from a theoretic perspective. We leave this to future work.

4 Properties

In this section, we discuss the properties of the suggested technique. It complements the former section by providing intuition on why we are able to achieve the rates in Theorem 1, which are more similar to sizes used in practice.

First, it is important to note that modern state-of-the-art models are not based on fully-connected networks but rather on improved architectures. A classic example is the improved optimization achieved by residual connections [He et al., 2016a]. Neural networks equipped with residual connections typically retain an improved empirical performance for a relatively small degradation in runtime [Monti et al., 2018]. However, Previous theoretical works on trainability of residual networks do not share those improvements [Allen-Zhu et al., 2019, Du et al., 2018]. Few works have shown that a residual network can be transformed to its equivalent network with no residual connections (e.g.  [Monti et al., 2018]).

Similarly to residual networks, GReLU networks offer appealing properties regarding the optimization of deep networks (Section 4.3), and can be mapped to their equivalent ReLU networks at any time (Section 4.1). Unlike residual networks, GReLU networks are based on theoretical guarantees (Sections 3, 4.2), and actually accelerate the network runtime as discussed in Section 4.4.

4.1 ReLU Network Equivalence

ReLU networks are extensively studied theoretically due to their dominating empirical performance regarding both optimization and generalization. While this work studies optimization, the following equivalence between GReLU and ReLU networks is important to infer on the generalization ability of the proposed technique.

Theorem 2.

Let W¯t=(Wt,1,,Wt,L;C,B,Ψ[L])\bar{W}_{t}=(W_{t,1},\ldots,W_{t,L};C,B,\Psi_{[L]}) be an overparameterized GReLU network of depth LL and width mm, trained for tt steps. Then, a unique equivalent ReLU network of the same sizes W~t=(W~t,1,,W~t,L;C,B)\tilde{W}_{t}=(\tilde{W}_{t,1},\ldots,\tilde{W}_{t,L};C,B) can be obtained, with identical intermediate and output values over the train set.

The proof appears in Section 10.7 and is done by construction of the equivalent ReLU network. The equivalent networks have an identical training footprint, in the sense that every intermediate feature map of any train example is identical for both, leading to identical predictions and training loss.This offers a dual view on GReLUs: a technique to improve and accelerate the training of ReLU networks. A good analogy is Batch Normalization, used to smooth the optimization landscape [Santurkar et al., 2018], and can be absorbed at any time into the previous layers [Krishnamoorthi, 2018, (20)-(21)] Notice that a ReLU network can be also trained as a GReLU network at any time by fixing its activation pattern, and using the scheme in section 2.1 with ΨkWk,k[L]\Psi_{k}\leftarrow W_{k},k\in[L].

We note that while our proof uses the Moore–Penrose pseudo inverse for clarity, modern least-squares solvers [Lacotte and Pilanci, 2019] provide similar results more efficiently, so the overall complexity of our method stays the same. In other words, training according to Theorem 1 and then switching to the equivalent ReLU network results in a trained ReLU network with an ε\varepsilon-error in T=O(log(n3Ldxϵ)m2L)T=O\left({\log\left({\frac{n^{3}L}{d_{x}\epsilon}}\right)\cdot m^{2}L}\right) operations, as m2Lm^{2}L stands for the number of trained network parameters.

4.2 Neural Tangent Kernel Equivalence

The introduction of the Neural Tangent Kernel (NTK) [Jacot et al., 2018] offered a new perspective on deep learning, with insights on different aspects of neural networks training. Followup works analyzed the behavior of neural networks in the infinite-width regime, where neural networks roughly become equivalent to linear models with the NTK. An important result by Allen-Zhu et al. [2019], extended this equivalence to finite-width networks for m=Ω(Poly(n,L))m=\Omega\left({Poly(n,L)}\right). Huang and Yau [2020] reduced the sample-size dependency to m=Ω~(n32O(L))m=\tilde{\Omega}\left({n^{3}2^{O(L)}}\right), still prohibitively large for practical use.

In this section, we derive a corresponding finite-width NTK equivalence, with a width that is linear in the sample-size. Closing this gap between theory and practice offers advantages for both:

  1. 1.

    Practice: networks can be actually trained in the NTK regime, optimizing weights and architectures to retain properties such as improved generalization [Geiger et al., 2020, Xiao et al., 2019] and overall performance [Lee et al., 2020, Sec 3.1].

  2. 2.

    Theory: NTK-driven theoretic insights can be utilized effectively for neural networks of practical sizes. Those insights relate to loss landscapes, implicit regularization, training dynamics and generalization theory [Belkin et al., 2018b, Cao and Gu, 2019, Belkin et al., 2018a, Li et al., 2019, Huang et al., 2020, Liang et al., 2020].

The analysis of Theorem 1 bounds the maximal variation of the network’s layers with a probability of at least 1exp(Ω(m))1-\exp(-\Omega(\sqrt{m})) as follows,

τ=max1tTmaxk[L]Wt,kW1,k2O~(dyn1/2m1/2L):=ξL\displaystyle\tau=\max\limits_{1\leq t\leq T}\max\limits_{k\in[L]}\left\lVert W_{t,k}-W_{1,k}\right\rVert_{2}\leq\tilde{O}\left(\frac{d_{y}n^{1/2}m^{-1/2}}{L}\right)\vcentcolon=\frac{\xi}{L} (5)

for some small number ξ(m;n,dy)\xi(m;n,d_{y}). Let W¯=(W¯1,,W¯L)\bar{W}=(\bar{W}_{1},\ldots,\bar{W}_{L}) be any solution which satisfies,

W¯kW1,k2ξL1,k[L]\displaystyle\left\lVert\bar{W}_{k}-W_{1,k}\right\rVert_{2}\leq\xi L^{-1}~{},~{}\forall{k\in[L]}

Denote the difference W=W¯W1W^{\prime}=\bar{W}-W_{1}, and notice that W2ξ\left\lVert W^{\prime}\right\rVert_{2}\leq\xi.
Define the NTK and the NTK objective [Jacot et al., 2018, Allen-Zhu et al., 2019] of the initialized network for the pp-th output, p[dy]p\in[d_{y}], respectively,

KpNTK(xi,xj):=yp(xi,W1),yp(xj,W1),ypNTK(xi,W)=yp(xi,W1),W\displaystyle K_{p}^{\textrm{NTK}}(x_{i},x_{j})\vcentcolon=\langle\nabla y_{p}(x_{i},W_{1}),\nabla y_{p}(x_{j},W_{1})\rangle\quad,\quad y_{p}^{\textrm{NTK}}(x_{i},W^{\prime})=\langle\nabla y_{p}(x_{i},W_{1}),W^{\prime}\rangle (6)

Jacot et al. [2018] proved that for an infinite mm, the dynamic NTK and NTK are in fact equivalent as ξ0\xi\rightarrow 0. Allen-Zhu et al. [Allen-Zhu et al., 2019] showed a high-order polynomial bound on this equivalence. We further improve their results by stating a tighter bound on mm for our setting.

Theorem 3.

For every xidxx_{i}\in\mathbb{R}^{d_{x}}, 1pdy1\leq p\leq d_{y}, W:W2ξW^{\prime}:\left\lVert W^{\prime}\right\rVert_{2}\leq\xi and m=Ω(ndxdy2L3)m=\Omega(nd_{x}d_{y}^{2}L^{3}), with probability of at least 1exp(Ω(m))1-\exp(-\Omega(m)) over the initialization of {B,C,Ψ,W1}\left\{{B,C,\Psi,W_{1}}\right\} we have,

yp(xi,W1+W)ypNTK(xi,W)FypNTK(xi,W)F\displaystyle\left\lVert\nabla y_{p}(x_{i},W_{1}+W^{\prime})-\nabla y_{p}^{\textrm{NTK}}(x_{i},W^{\prime})\right\rVert_{F}\leq\mathcal{R}\left\lVert\nabla y_{p}^{\textrm{NTK}}(x_{i},W^{\prime})\right\rVert_{F} (7)

with the ratio

=O~(n1/2dx(Ldy)2m5/2).\mathcal{R}=\tilde{O}\left(\frac{n^{1/2}d_{x}(Ld_{y})^{2}}{m^{5/2}}\right).
Corollary 4.

For every xi,xjdxx_{i},x_{j}\in\mathbb{R}^{d_{x}}, 1pdy1\leq p\leq d_{y}, W:W2ξW^{\prime}:\left\lVert W^{\prime}\right\rVert_{2}\leq\xi and m=Ω(ndxdy2L3)m=\Omega(nd_{x}d_{y}^{2}L^{3}) , with probability of at least 1exp(Ω(m))1-\exp(-\Omega(m)),

|yp(xi,W1+W),yp(xj,W1+W)KpNTK(xi,xj)|3KpNTK(xi,xi)KpNTK(xj,xj)\displaystyle|\langle\nabla y_{p}(x_{i},W_{1}+W^{\prime}),\nabla y_{p}(x_{j},W_{1}+W^{\prime})\rangle-K_{p}^{\textrm{NTK}}(x_{i},x_{j})|\leq 3\mathcal{R}\sqrt{K_{p}^{\textrm{NTK}}(x_{i},x_{i})K_{p}^{\textrm{NTK}}(x_{j},x_{j})} (8)

Proofs can be found in Sections (10.5-10.6). Theorem 3 guarantees that the difference between gradients is negligible compared to their norms, thus the NTK is a good approximation for the dynamic one. For simplicity, we state the result per single output dimension pp, while it holds for outputs of dimension dyd_{y} as well.
For comparison, the corresponding ratio by Allen-Zhu et al. [2019] is significantly worse: L3/2logm\frac{L^{3/2}}{\sqrt{\log m}}. More Specifically, equipped with the width required by Theorem 1, the guaranteed ratio by Theorem 3 is negligible even for small problems 1=O~(n9/2dx1(dyL)1/2)\mathcal{R}^{-1}=\tilde{O}\left({n^{9/2}d_{x}^{-1}(d_{y}L)^{1/2}}\right). The more regular trajectory of GReLU gradients compared to ReLU gradients can be also seen in experiments with smaller network widths, as demonstrated in Figure 2(b).

4.3 Improved Gradient Optimization

Modern state-of-the-art models are modified versions of fully-connected networks, offering improved optimization with additional layers [He et al., 2016b, Ioffe and Szegedy, 2015, Bachlechner et al., 2020], activations [Clevert et al., 2015, Xu et al., 2015] and training schemes [Zagoruyko and Komodakis, 2017, Jia et al., 2017]. Those are mostly aimed at solving problems that arise when training plain deep ReLU networks. It is important to note that ReLU networks often do not achieve small training error without such modifications.
Two of the most studied problems of ReLU networks are bias shifts (mean shifts) [Clevert et al., 2015] and dying ReLUs [Lu et al., 2019].
Bias shifts affect networks equipped with activations of non-zero mean. Specifically, ReLU is a non-negative function, leading to increasingly positively-biased layer outputs in ReLU networks, a problem that grows with depth, EWk[max(Wkzk1,0)]zk1>0\textrm{E}_{W_{k}}\left[{\max({W_{k}z_{k-1},0})}\right]-z_{k-1}>0. Reduced bias shifts were shown to speed up the learning by bringing the normal gradient closer to the unit natural gradient [Clevert et al., 2015]. Interestingly, GReLUs do not suffer from bias shifts, as negative values are passed as well, and the fixed activations of different layers are independent.
Dying ReLUs refers to the problem of ReLU neurons become inactive and only output 0 for any input. This problem also grows with depth, as Lu et al. [Lu et al., 2019] show that a deep enough ReLU network will eventually become a constant function.
Considering a GReLU network, given the initial network is properly initialized (i.e., not ’born dead’), it is guaranteed that no neurons will die during training due to its fixed activation pattern.

Remark 3.

GReLU networks are immune to the problems of bias-shifts and dying-ReLUs.

These properties essentially lead to better back-propagation of gradients during training and corresponding faster convergence to global minima under milder requirements on the networks’ depths.

4.4 Faster Train and Inference Iterations

Theorem 1 guarantees a faster convergence to global minima in terms of the number of iterations. We now explain how a straightforward implementation of a GReLU network leads to an additional approximated 2×2\times shorter iteration time compared to its equivalent ReLU network, for both train and inference.
Train: consider a single GReLU train step as described in Figure 1. Since the activations are fixed, a binary lookup table of size n×m×Ln\times m\times L is made once and used to calculate only the values related to active ReLU entries, which are around 50%50\% due to the proposed initialization of Ψ[L]\Psi_{[L]}, leading to approximately 50%50\% of the FLOPS of a fully-connected network of the same dimensions111Acceleration is achieved without additional implementation, as Pytorch and Tensorflow automatically skip backward-propagation calculations of such zeroed gradients..
Inference: A non-negative matrix approximation (NMA) [Tandon and Sra, 2010] is used to replace each matrix Ψk\Psi_{k} with smaller ones with respective sizes (m×r)(m\times r) and (r×m)(r\times m), ΨkWkHk\Psi_{k}\approx W_{k}H_{k} for r<<mr<<m. The total FLOPS count is therefore proportional to O(0.5m2+2rm)O(0.5m2)O(\mathbf{0.5}m^{2}+2rm)\approx O(\mathbf{0.5}m^{2}). While this approximation is not covered by our theory, experiments with r=m/2r=\sqrt{m}/2 yielded the desired acceleration without any accuracy degradation. This result leads to a general insight.

Remark 4.

Our technique for 2×2\times accelerated inference can be used to accelerate any neural network with ReLU activation. Given network layers W[L]W_{[L]}, simply transition to the GReLU architecture described in Figure 1 with, ΨkWk,k[L]\Psi^{\prime}_{k}\leftarrow W_{k},\forall k\in[L], then use NMA approximation, ΨkWkHk\Psi^{\prime}_{k}\approx W_{k}H_{k}.
Additional modifications can be applied for further acceleration with minimal inference variation, like binarized ΨL\Psi^{\prime}_{L} matrices [Hubara et al., 2016]. Those exceed our scope and are left as future work.

5 Main Theory Proof Sketch

Calculating the gradient of (Wt)\ell(W_{t}) over Wt,kW_{t,k}, denoted by k(Wt)\nabla_{k}\ell(W_{t}), is straightforward,

k(Wt)=i=1n[Ft,k+1i](WtiΦi)xixi[Gt,k1i]m×m\displaystyle\nabla_{k}\ell(W_{t})=\sum_{i=1}^{n}\left[F^{i}_{t,k+1}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\left[G^{i}_{t,k-1}\right]^{\top}\in\mathbb{R}^{m\times m} (9)

where

Ft,ki=BDLiWt,LDkiWt,kDk1idy×m,Gt,ki=DkiWt,kD1iWt,1D0iCm×dx\displaystyle F^{i}_{t,k}=BD_{L}^{i}W_{t,L}\ldots D_{k}^{i}W_{t,k}D_{k-1}^{i}\in\mathbb{R}^{d_{y}\times m},\quad G^{i}_{t,k}=D_{k}^{i}W_{t,k}\ldots D_{1}^{i}W_{t,1}D_{0}^{i}C\in\mathbb{R}^{m\times d_{x}} (10)

and we set Gt,0i=D0iCG_{t,0}^{i}=D_{0}^{i}C. Those represent the network partition and will be referred to as sub-networks. Notice that Wti=Ft,k+1iGt,kiW^{i}_{t}=F^{i}_{t,k+1}G^{i}_{t,k}. We now calculate the difference between Wt+1iW_{t+1}^{i} and WtiW_{t}^{i}.

Lemma 1.
Wt+1iWti=ηk=1LFt,k+1i[Ft,k+1i](WtiΦi)xixi[Gt,k1i]Gt,k1iηΓt,i+η2Δt,i,\displaystyle W_{t+1}^{i}-W_{t}^{i}=-\eta\sum_{k=1}^{L}F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top}[G_{t,k-1}^{i}]^{\top}G_{t,k-1}^{i}-\eta\Gamma_{t,i}+\eta^{2}\Delta_{t,i},

where

Γt,i:=\displaystyle\Gamma_{t,i}:= k=1LjiFt,k+1i[Ft,k+1j](WtjΦj)xjxj[Gt,k1j]Gt,k1i.\displaystyle\sum_{k=1}^{L}\sum_{j\neq i}F_{t,k+1}^{i}\left[F_{t,k+1}^{j}\right]^{\top}\left(W_{t}^{j}-\Phi_{j}\right)x_{j}x_{j}^{\top}\left[G_{t,k-1}^{j}\right]^{\top}G_{t,k-1}^{i}. (11)
Δt,i:=\displaystyle\Delta_{t,i}:= s=2L(η)s2Lk1>k2>ks1Ft,k1+1ik1(Wt)Dk11iWt,k11Dksiks(Wt)Gt,ks1i,\displaystyle\sum_{s=2}^{L}(-\eta)^{s-2}\sum_{L\geq k_{1}>k_{2}\ldots>k_{s}\geq 1}F^{i}_{t,k_{1}+1}\nabla_{k_{1}}\ell(W_{t})D_{k_{1}-1}^{i}W_{t,k_{1}-1}\ldots D_{k_{s}}^{i}\nabla_{k_{s}}\ell(W_{t})G^{i}_{t,k_{s}-1}, (12)

Lemma 1 breaks a single gradient step into 33 terms. The first term on the left represents the impact of example ii onto itself, where the next term (11) represents its impact by other examples and has a key role in our analysis. Our technique of fixed activations enables a more careful optimization of this term, followed by improved guarantees. The last term (12) stands for higher-order dependencies between different weights, i.e. the impact of a change in some weights over the change of others, as a result of optimizing all weights in parallel. It negatively affects the optimization and should be minimized. The next lemma deals with the loss change.

Lemma 2.

For any set of positive numbers a1,,ana_{1},\ldots,a_{n}, we have

(Wt+1)(Wt)i=1nΛi+η2ai2(WtiΦi)xi2+i=1nη2(3η2+1/ai)2Δt,ixi2\displaystyle\ell(W_{t+1})-\ell(W_{t})\leq\sum_{i=1}^{n}\frac{\Lambda_{i}+\eta^{2}a_{i}}{2}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}+\sum_{i=1}^{n}\frac{\eta^{2}\left(3\eta^{2}+1/a_{i}\right)}{2}\left\lVert\Delta_{t,i}x_{i}\right\rVert^{2} (13)

where

Λi=\displaystyle\Lambda_{i}= 2ηk=1Lλmin(Ft,k+1i[Ft,k+1i])λmin([Gt,k1i]Gt,k1i)\displaystyle-2\eta\sum_{k=1}^{L}\lambda_{\min}\left(F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}\right)\lambda_{\min}\left(\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}\right)
+2ηk=1Lji|Gt,k1jxj,Gt,k1ixi|Ft,k+1j[Ft,k+1i]2\displaystyle+2\eta\sum_{k=1}^{L}\sum_{j\neq i}\left|\langle G^{j}_{t,k-1}x_{j},G^{i}_{t,k-1}x_{i}\rangle\right|\left\lVert F^{j}_{t,k+1}\left[F_{t,k+1}^{i}\right]^{\top}\right\rVert_{2}
+3η2Lk=1Lλmax(Ft,k+1i[Ft,k+1i])Gt,k1ixi4\displaystyle+3\eta^{2}L\sum_{k=1}^{L}\lambda_{\max}\left(F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}\right)\left\lVert G_{t,k-1}^{i}x_{i}\right\rVert^{4}
+3η2nLk=1Lji|Gt,k1jxj,Gt,k1ixi|2Ft,k+1j[Ft,k+1i]22.\displaystyle+3\eta^{2}nL\sum_{k=1}^{L}\sum_{j\neq i}\left|\langle G^{j}_{t,k-1}x_{j},G^{i}_{t,k-1}x_{i}\rangle\right|^{2}\left\lVert F^{j}_{t,k+1}\left[F_{t,k+1}^{i}\right]^{\top}\right\rVert_{2}^{2}.

We wish to bound the right-hand side with a value as negative as possible. The first term of (13) is proportional to the loss (2). Thus negative Λi+η2ai\Lambda_{i}+\eta^{2}a_{i} values can lead to an exponential loss decay, i.e. linear-rate convergence: (Wt+1)=(1|ρ|)(Wt)\ell(W_{t+1})=(1-|\rho|)\ell(W_{t}). In order to minimize Λi\Lambda_{i}, two properties are desired: (a) The eigenvalues of the squared weight matrices Ft,ki[Ft,ki],[Gt,ki]Gt,kiF^{i}_{t,k}\left[F_{t,k}^{i}\right]^{\top},\left[G_{t,k}^{i}\right]^{\top}G_{t,k}^{i} are concentrated, that is the ratio between the minimal and maximal eigenvalues is independent of the problem’s parameters. (b) The covariance between sub-networks of different examples is much smaller than the covariance of the same examples. The second term of (13) represents high-order dependencies between gradients and is positive, restricting the learning rate from being too high.
Next, we assume the following inequalities which satisfy (a),(b) hold with a high probability,

λmin(Ft,ki[Ft,ki])αy,λmin(Gt,ki[Gt,ki])αx,\displaystyle\lambda_{\min}\left(F_{t,k}^{i}\left[F_{t,k}^{i}\right]^{\top}\right)\geq\alpha_{y},\quad\lambda_{\min}\left(G_{t,k}^{i}\left[G_{t,k}^{i}\right]^{\top}\right)\geq\alpha_{x}, (14)
λmax(Ft,ki[Ft,ki])βy,λmax(Gt,ki[Gt,ki])βx,\displaystyle\lambda_{\max}\left(F_{t,k}^{i}\left[F_{t,k}^{i}\right]^{\top}\right)\leq\beta_{y},\quad\lambda_{\max}\left(G_{t,k}^{i}\left[G_{t,k}^{i}\right]^{\top}\right)\leq\beta_{x}, (15)
|Gt,k1jxj,Gt,k1ixi|Ft,k+1j[Ft,k+1i]2γβ2,\displaystyle\left|\left\langle G_{t,k-1}^{j}x_{j},G_{t,k-1}^{i}x_{i}\right\rangle\right|\left\lVert F^{j}_{t,k+1}\left[F_{t,k+1}^{i}\right]^{\top}\right\rVert_{2}\leq\gamma\beta^{2}, (16)
β2γnα22,\displaystyle\beta^{2}\gamma n\leq\frac{\alpha^{2}}{2}, (17)

where α=αxαy\alpha=\sqrt{\alpha_{x}\alpha_{y}}, β=βyβx\beta=\sqrt{\beta_{y}\beta_{x}} and γ\gamma are auxiliary parameters. Assumptions (14,15) are related to property (a), while (16, 17) guarantee property (b). We will show that the above bounds hold with a high probability under the proposed initialization and the fixed activation patterns defined in Section 2. Using those assumptions, we state the following lemma.

Lemma 3.

Set ai=β4L2a_{i}=\beta^{4}L^{2} and

η=min(α212β2βxL,13L,α24β4L,1β2L,142Leθ/2θ1/2β0,α21024ne2θθ2β20).\displaystyle\eta=\min\left(\frac{\alpha^{2}}{12\beta^{2}\beta_{x}L},\frac{1}{3L},\frac{\alpha^{2}}{4\beta^{4}L},\frac{1}{\beta^{2}L},\frac{1}{4\sqrt{2}\sqrt{L}e^{\theta/2}\theta^{-1/2}\beta\sqrt{\ell_{0}}},\frac{\alpha^{2}}{1024ne^{2\theta}\theta^{-2}\beta^{2}\ell_{0}}\right). (18)

Under assumptions in (14), (15), (16), (17) and (Wt)0\ell(W_{t})\leq\ell_{0} for t1t\geq 1, for any θ(0,1/5)\theta\in(0,1/5), with a probability 1L2mexp(θm/[4L]+6m)1-L^{2}\sqrt{m}\exp\left(-\theta m/[4L]+6\sqrt{m}\right), we have

(Wt+1)(1ηα2L2)(Wt).\displaystyle\ell(W_{t+1})\leq\left({1-\frac{\eta\alpha^{2}L}{2}}\right)\ell(W_{t}). (19)

This inequality indicates a linear-rate convergence with rate ηα2L2\frac{\eta\alpha^{2}L}{2}. The proof of Theorem 1 follows directly from (19). Proofs for the validity of the above assumptions with high probability are based on concentration bounds for sufficiently overparameterized networks under the unique initialization proposed in Section 2, and are quite technical.
Theorem 5 validates (14,15) by bounding the ratio between the maximal and minimal eigenvalues of the squared weight matrices with a small constant for all (t,k,i)(t,k,i). Theorem 6 utilizes the fixed activation patterns to show a small covariance between sub-networks of different examples as required by (16,17). Both theorems hold for small enough maximal variation from initialization τ\tau (4), which is achieved by sufficient network overparameterization, as shown in Theorem 1.

6 Experiments

In this section, we provide empirical validation of our theoretical results, as well as further testing on complementary setups. We wish to answer the following questions:

  1. 1.

    Is the theoretical guarantee of Theorem 1 tight, or can it be further improved?

  2. 2.

    How well do GReLU networks train and generalize compared to ReLU networks?

  3. 3.

    What is the difference between the training dynamics of GReLU and ReLU when optimized with gradient-descent?

The next sections are dedicated to discuss these questions. For fairness, we compare networks with the same width and depth, initialization described in Section 2, and trained by gradient descent with a constant learning rate for each train. The term MSE refers to the sum of squared errors described in (2). Since the theory covers general setups, it often uses small learning rate values to fit all setups, while in practice larger values can be utilized for faster convergence. Therefore, we experiment with a variety of learning rates according to the task. In addition, in all the experiments we optimize with batch gradient-descent due to hardware constraints. We pick the largest possible batch size per experiment.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: Training dynamics of ReLU and GReLU networks. (a) ReLU (solid lines) and ReLU (dashed) networks with different widths trained over synthetic dataset. (b) Average temporal difference of gradients on SGEMM dataset. (c) Average Hamming distance of successive ReLU activation patterns for SGEMM dataset.

6.1 Lower Trainability Bound

To answer question 1, we experiment on a synthetic dataset, in order to control the number of examples and their dimension to fit the theory under GPU memory constraints. We generate a regression problem with the following complex version of the Ackley function [Potter and De Jong, 1994],

yi=d=1dxi,dd(log(|xi,d|)(cos(xi,d)+x3sin(xi,d))+|xi,d|),i=1,,n\displaystyle y_{i}=\sum_{d^{\prime}=1}^{d}x_{i,d-d^{\prime}}\left(\log{\left(|x_{i,d^{\prime}}|\right)\left(\cos(x_{i,d^{\prime}})+x^{3}\sin(x_{i,d^{\prime}})\right)}+\sqrt{|x_{i,d^{\prime}}|}\right)~{}~{},~{}~{}i=1,\dots,n

We generate (n,d)=(100,200)\left({n,d}\right)=\left({100,200}\right) random vectors and labels, and use 44-layers networks for both ReLU and GReLU, with various widths. Each experiment is repeated 5 times with different random seeds.

The results of the different runs appear in Figure 2(a). To have a fair comparison between the different settings, we proceed as follows. For the experiment corresponding to the theoretical width in Theorem 1, we plot a curve where every entry is computed by taking the maximal MSE over the 5 runs. For lower widths, m=(n2L)p,p<1m=\left({n^{2}L}\right)^{p},p<1 and m=nLm=nL, we plot the minimal MSE over the corresponding runs. All 55 runs with our theoretical width converged to zero error, for both ReLU and GReLU. In contrast, none of the runs with lower width converged, for both networks. Since the theory guarantees convergence for any data distribution, these results provide some empirical support that Theorem 1 is tight.

We acknowledge that training with significantly smaller learning rates (e.g., 102010^{-20}) would correspond to longer and impractical convergence times, but possibly would allow narrower networks to converge as well. In addition, while we focus on fully-connected networks, modern architectures like ResNets were empirically shown to converge with smaller widths (channels) for specific tasks. Improved theoretic guarantees for these architectures via GReLU technique are interesting for future work.

6.2 Optimization and Generalization in Practical Scenarios

Refer to caption
(a) Train MSE vs epoch
Refer to caption
(b) Train MSE vs epoch
Refer to caption
(c) Test MSE vs epoch
Figure 3: MSE as a function of epochs for CIFAR-10 dataset. (a) Train with no augmentations. (b,c) Train and test losses for GReLU\rightarrowReLU scheme with transform at epoch 4040.

In this section, in order to answer question 2, we evaluate the training of ReLU and GReLU networks in more practical scenarios, by experimenting with sub-theoretic widths on two popular datasets from different domains, input dimensions and structures:
CIFAR-10 the popular image classification dataset, with n=50,000n=50,000 and d=32×32×3=3072d=32\times 32\times 3=3072. We transform each classification to a scalar regression by calculating the squared loss between the one-hot 1010 dimensional vector encoding its correct label and the output logits of the network.

Table 2: CIFAR-1010 MSE
Architecture ReLU GRelu
Depth 66, Width 1010k 0.480.48 0.020.02
Depth 77, Width 1010k 0.270.27 0.010.01
Depth 44, Width 1515k 0.750.75 0.030.03

SGEMM the well known CPU kernel-performance regression dataset222https://archive.ics.uci.edu/ml/datasets/SGEMM+GPU+kernel+performance [Nugteren and Codreanu, 2015]. It contains n=241,600n=241,600 samples of dimension d=18d\!=\!18, measuring CPU runtime of matrix products. SGEMM experiments demonstrate a similar behavior to CIFAR-10, and appear in Section 9 of the supplementary material.

We initialize both networks with the same trainable weights W¯0\bar{W}_{0} and additional ΨL\Psi_{L} for GReLU, and train them with a fixed learning rate of 10410^{-4}. The results presented in Table 2 show that all variants of GReLU converged to near zero losses. Conversely, ReLU variants improved with increased depth, but did not reach similar losses. Possibly deeper ReLU networks with fine-tuned learning rates schemes would reach near zero losses as well. The loss progression of different variants is plotted in Figure 3(a). The GReLU networks (blue) converge with higher rates as well.

We proceed to evaluate the GReLU\rightarrowReLU scheme proposed in Section 4.1. While this paper is focused on optimization, comparing the empirical generalization of this scheme against traditional ReLU network is intriguing. For a fair and simple comparison with no additional hyper-parameters to tune, we train both networks with RandAugment [Cubuk et al., 2020] to avoid overfitting. We train the variants for 8080 epochs, and transform the GReLU to its equivalent ReLU network exactly in the middle. Results on train and test set (additional 10,00010,000 examples) appear in Figures (3(b),3(c)). The train loss of the GReLU variant is continuous at epoch 4040, as guaranteed by Theorem 2, and comparable to the ReLU variant. However, on the test set a significant loss drop can be seen as a result of the transform, leading to an improved performance of the GReLU network. Since GReLU is transformed to a ReLU network with minimal 2\ell_{2}-norm, the transform effectively acts as a ridge-regularization step. While the final loss is similar, we note that the GReLU training is almost twice as fast, as explained in Section 4.4. This can be used for further improvement.

6.3 Training Dynamics

To answer question 3, we start with a comparison of the optimization trajectories of the networks. Regular trajectories are favored due to typically smoother corresponding optimization landscapes. Acceleration methods like gradient-descent with momentum [Qian, 1999] provide more regular trajectories and are commonly used. During a full train with each network with η=104\eta=10^{-4}, we compute the 2\ell_{2} norm between consecutive gradients, gtgt12\|g_{t}-g_{t-1}\|_{2}, and present the results in Figure 2(b). The gradient differences of the GReLU network are smaller than the ones of the ReLU network, and gradually decrease along the convergence to the global optima. This provides an additional empirical validation of the relatively larger learning rates allowed by our theory. In addition, this empirical evaluation matches the tighter bound on the gradients difference guaranteed by theorem 3. The gradient differences of the ReLU network increase monotonically along the training, with highly irregular trajectories that match the relatively high variation of the losses in Figure 5.
To better understand the behaviour of the ReLU network, we conduct an additional experiment, of calculating the change in its activation patterns along the training. More specifically, we present the mean Hamming distance between its activation patterns at consecutive epochs. The distance is computed across all ReLU activations over a selected validation set. The results are illustrated in Figure 2(c) and are quite surprising: a large portion of the activations, 20%25%20\%-25\%, changes every epoch and keep increasing along the run. This result complements the previous one by showing some instability of the ReLU network. We point out that some level of change in activation patterns along the optimization possibly contributes to the generalization via implicit regularization. In such case, a hybrid version of the GReLU network that allows gradual, controlled changes of its activation patterns is an interesting future direction to be pursued.

7 Related Work

Neural network overparameterization has been extensively studied under different assumptions over the training data and network architecture. For the simplified case of well-separated data, Ji and Telgarsky [2018] showed that a poly-logarithmic width is sufficient to guarantee convergence in two-layer networks, and Chen et al. [2019] extended the result to deep networks using the NTRF function class by Cao and Gu [2019]. For mixtures of well-separated distributions,  Li and Liang [2018] proved that when training two-layer networks with width proportional to a high-order polynomial in the parameters of the problem with SGD, small train and generalization errors are achievable. Other common assumptions include Gaussian input [Brutzkus and Globerson, 2017, Zhong et al., 2017, Li and Yuan, 2017], independet activations [Choromanska et al., 2015, Kawaguchi, 2016], and output generated by a teacher-network [Li and Yuan, 2017, Zhang et al., 2019b, Brutzkus and Globerson, 2017]. In our work, only a mild assumption of non-degenerate input is used.
overparameterization of special neural networks is varied across the analysis of linear networks [Kawaguchi, 2016, Arora et al., 2018a, Bartlett et al., 2018], smooth activations [Du and Lee, 2018, Kawaguchi and Huang, 2019], and shallow networks [Du and Lee, 2018, Oymak and Soltanolkotabi, 2020]. For deep linear networks, Kawaguchi [2016] proved that all local minima are global minima. Arora et al. [2018a] proved trainability for deep linear networks of minimal width, generalizing result by Bartlett et al. [2018] on residual linear networks. Du and Lee [2018] considered neural networks with smooth activations, and showed that in overparameterized shallow networks with quadratic activation, all local minima are global minima. Kawaguchi and Huang [2019] showed that when only the output layer is trained, i.e. least squares estimate with gradient descent, tighter bounds on network width are achievable. Such training assumptions are rarely used in practice due to poor generalization. Considering one-hidden-layer networks, Du et al. [2018] showed that under the assumption of non-degenerate population Gram matrix, gradient-descent converges to a globally optimal solution. Oymak and Soltanolkotabi [2020] stated the current tightest bound on the width of shallow networks for different activation functions. For the ReLU activation, the bound we derive for deep networks improves theirs over shallow networks as well.
We focus on the general setting of deep networks equipped with ReLU activation, as typically used in practice. Similar works [Zou et al., 2018, Du et al., 2019, Allen-Zhu et al., 2019, Zou and Gu, 2019] require unrealistic overparameterization conditions, namely a minimal width proportional to the size of the dataset in the power of (8,24,26)(8,24,26) [Zou and Gu, 2019, Allen-Zhu et al., 2019, Zou et al., 2018], and the depth of the network in the power of (12(12-38)38)  [Zou and Gu, 2019, Zou et al., 2018] or exponential dependency [Du et al., 2019]. Our work improves the current tightest bound [Zou and Gu, 2019] by a factor of O(n6L11)O\left({n^{6}L^{11}}\right), with a corresponding significant decrease in convergence time, from polynomial to logarithmic in the parameters of the problem.
Optimization landscape of deep networks under different assumptions is an active line of research. A few works showed that ReLU networks have bad local minima [Venturi et al., 2018, Safran and Shamir, 2018, Swirszcz et al., 2016, Yun et al., 2017] due to ”dead regions”, even for random input data and labels created according to a planted model [Yun et al., 2017, Safran and Shamir, 2018], a property which is not shared with smooth activations [Liang et al., 2018, Du and Lee, 2018, Soltanolkotabi et al., 2018]. Specifically, Hardt and Ma [2016] showed that deep linear residual networks have no spurious local minima.
Training dynamics of ReLU networks and their activation patterns were studied by few recent works [Hanin and Rolnick, 2019a, b, Li and Liang, 2018]. Hanin and Rolnick [2019b] showed that the average number of patterns along the training is significantly smaller than the possible one and only weakly depends on the depth. We study a single random pattern, entirely decoupled from depth. Li and Liang [2018] studied ”pseudo gradients” for shallow networks, close in spirit with the fixed ReLU activations suggested by our work. They showed that unless the generalization error is small, pseudo gradients remain large, providing motivation for fixing activation patterns from a generalization perspective. Furthermore, they showed that pseudo gradients could be coupled with original gradients for most cases when the network is sufficiently overparameterized.
Neural Tangent Kernel regression has been shown to describe the evolution of infinitely wide fully-connected networks when trained with gradient descent [Jacot et al., 2018]. Additional works provided corresponding kernels for residual [Huang et al., 2020] and convolutional networks [Li et al., 2019], allowing to study the training dynamics of those in functional space. Few works [Belkin et al., 2018b, Cao and Gu, 2019, Belkin et al., 2018a, Liang et al., 2020] analyzed NTK for explaining properties for deep neural networks, most notably their trainability and generalizability. An important result by Allen-Zhu et al. [2019] extended the NTK equivalence to deep networks of finite width. We provide an improved condition on the required width for this equivalence to hold.

8 Conclusions and Future Work

In this paper we proposed a novel technique for studying overparameterized deep ReLU networks which leads to state-of-the-art guarantees on trainability. Both the required network size and convergence time significantly improve previous theory and can be tested in practice for the first time. Further theoretical and empirical analysis provide insights on the optimization of neural networks with first-order methods. Our analysis and proof technique can be extended to additional setups such as optimization with stochastic gradient-descent, cross-entropy loss, and other network architectures such as convolutional and residual networks similarly to [Zhang et al., 2019a, Allen-Zhu et al., 2019, Du et al., 2018]. Our encouraging empirical results motivate analyzing additional variants like Gated-CNNs and Gated-ResNets. Our improved finite-width NTK equivalence enables studying neural networks of practical sizes using kernel theory, potentially leading to a better understanding of learning dynamics and generalization.

Acknowledgements

We would like to thank Assaf Hallak, Tal Ridnik, Avi Ben-Cohen, Yushun Zhang and Itamar Friedman for their feedbacks and productive discussions.

References

  • Allen-Zhu et al. [2019] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252, 2019.
  • Arora et al. [2018a] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. arXiv preprint arXiv:1810.02281, 2018a.
  • Arora et al. [2018b] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018b.
  • Arora et al. [2019a] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8141–8150, 2019a.
  • Arora et al. [2019b] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019b.
  • Bachlechner et al. [2020] Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W Cottrell, and Julian McAuley. Rezero is all you need: Fast convergence at large depth. arXiv preprint arXiv:2003.04887, 2020.
  • Bartlett et al. [2018] Peter Bartlett, Dave Helmbold, and Philip Long. Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. In International conference on machine learning, pages 521–530. PMLR, 2018.
  • Belkin et al. [2018a] Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In Advances in neural information processing systems, pages 2300–2311, 2018a.
  • Belkin et al. [2018b] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396, 2018b.
  • Bengio and Delalleau [2011] Yoshua Bengio and Olivier Delalleau. On the expressive power of deep architectures. In International Conference on Algorithmic Learning Theory, pages 18–36. Springer, 2011.
  • Blum and Rivest [1989] Avrim Blum and Ronald L Rivest. Training a 3-node neural network is np-complete. In Advances in neural information processing systems, pages 494–501, 1989.
  • Brown et al. [2020] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • Brutzkus and Globerson [2017] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017.
  • Cao and Gu [2019] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, pages 10836–10846, 2019.
  • Chen et al. [2019] Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? arXiv preprint arXiv:1911.12360, 2019.
  • Choromanska et al. [2015] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204, 2015.
  • Clevert et al. [2015] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
  • Cohen et al. [2016] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on learning theory, pages 698–728, 2016.
  • Cubuk et al. [2020] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020.
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Du et al. [2019] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685, 2019.
  • Du and Lee [2018] Simon S Du and Jason D Lee. On the power of over-parametrization in neural networks with quadratic activation. arXiv preprint arXiv:1803.01206, 2018.
  • Du et al. [2018] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
  • Eldan and Shamir [2016] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907–940, 2016.
  • Geiger et al. [2020] Mario Geiger, Arthur Jacot, Stefano Spigler, Franck Gabriel, Levent Sagun, Stéphane d’Ascoli, Giulio Biroli, Clément Hongler, and Matthieu Wyart. Scaling description of generalization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2020(2):023401, 2020.
  • Hanin [2018] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In Advances in Neural Information Processing Systems, pages 582–591, 2018.
  • Hanin and Rolnick [2019a] Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. arXiv preprint arXiv:1901.09021, 2019a.
  • Hanin and Rolnick [2019b] Boris Hanin and David Rolnick. Deep relu networks have surprisingly few activation patterns. In Advances in Neural Information Processing Systems, pages 361–370, 2019b.
  • Hardt and Ma [2016] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
  • He et al. [2016a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a.
  • He et al. [2016b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016b.
  • Hinton et al. [2012] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012.
  • Hornik et al. [1989] Kurt Hornik, Maxwell Stinchcombe, Halbert White, et al. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  • Huang and Yau [2020] Jiaoyang Huang and Horng-Tzer Yau. Dynamics of deep neural networks and neural tangent hierarchy. In International Conference on Machine Learning, pages 4542–4551. PMLR, 2020.
  • Huang et al. [2020] Kaixuan Huang, Yuqing Wang, Molei Tao, and Tuo Zhao. Why do deep residual networks generalize better than deep feedforward networks?–a neural tangent kernel perspective. arXiv preprint arXiv:2002.06262, 2020.
  • Huang et al. [2019] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Advances in neural information processing systems, pages 103–112, 2019.
  • Hubara et al. [2016] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. Advances in neural information processing systems, 29:4107–4115, 2016.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
  • Ji and Telgarsky [2018] Ziwei Ji and Matus Telgarsky. Risk and parameter convergence of logistic regression. arXiv preprint arXiv:1803.07300, 2018.
  • Jia et al. [2017] Kui Jia, Dacheng Tao, Shenghua Gao, and Xiangmin Xu. Improving training of deep neural networks via singular value bounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4344–4352, 2017.
  • Kawaguchi [2016] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in neural information processing systems, pages 586–594, 2016.
  • Kawaguchi and Huang [2019] Kenji Kawaguchi and Jiaoyang Huang. Gradient descent finds global minima for generalizable deep neural networks of practical sizes. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 92–99. IEEE, 2019.
  • Kolesnikov et al. [2019] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. arXiv preprint arXiv:1912.11370, 6:1, 2019.
  • Krishnamoorthi [2018] Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
  • Krizhevsky et al. [2017] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  • Lacotte and Pilanci [2019] Jonathan Lacotte and Mert Pilanci. Faster least squares optimization. arXiv preprint arXiv:1911.02675, 2019.
  • Laurent and Massart [2000] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
  • Lee et al. [2020] Jaehoon Lee, Samuel S Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. arXiv preprint arXiv:2007.15801, 2020.
  • Li and Liang [2018] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157–8166, 2018.
  • Li and Yuan [2017] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. Advances in neural information processing systems, 30:597–607, 2017.
  • Li et al. [2019] Zhiyuan Li, Ruosong Wang, Dingli Yu, Simon S Du, Wei Hu, Ruslan Salakhutdinov, and Sanjeev Arora. Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809, 2019.
  • Liang and Srikant [2016] Shiyu Liang and Rayadurgam Srikant. Why deep neural networks for function approximation? arXiv preprint arXiv:1610.04161, 2016.
  • Liang et al. [2018] Shiyu Liang, Ruoyu Sun, Yixuan Li, and Rayadurgam Srikant. Understanding the loss surface of neural networks for binary classification. arXiv preprint arXiv:1803.00909, 2018.
  • Liang et al. [2020] Tengyuan Liang, Alexander Rakhlin, et al. Just interpolate: Kernel “ridgeless” regression can generalize. Annals of Statistics, 48(3):1329–1347, 2020.
  • Livni et al. [2014] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. In Advances in neural information processing systems, pages 855–863, 2014.
  • Lu et al. [2019] Lu Lu, Yeonjong Shin, Yanhui Su, and George Em Karniadakis. Dying relu and initialization: Theory and numerical examples. arXiv preprint arXiv:1903.06733, 2019.
  • Monti et al. [2018] Ricardo Pio Monti, Sina Tootoonian, and Robin Cao. Avoiding degradation in deep feed-forward networks by phasing out skip-connections. In International Conference on Artificial Neural Networks, pages 447–456. Springer, 2018.
  • Nayman et al. [2019] Niv Nayman, Asaf Noy, Tal Ridnik, Itamar Friedman, Rong Jin, and Lihi Zelnik. Xnas: Neural architecture search with expert advice. In Advances in Neural Information Processing Systems, pages 1977–1987, 2019.
  • Noy et al. [2020] Asaf Noy, Niv Nayman, Tal Ridnik, Nadav Zamir, Sivan Doveh, Itamar Friedman, Raja Giryes, and Lihi Zelnik. Asap: Architecture search, anneal and prune. In International Conference on Artificial Intelligence and Statistics, pages 493–503. PMLR, 2020.
  • Nugteren and Codreanu [2015] Cedric Nugteren and Valeriu Codreanu. Cltune: A generic auto-tuner for opencl kernels. In 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip, pages 195–202. IEEE, 2015.
  • Oymak and Soltanolkotabi [2020] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 2020.
  • Pisier [1999] Gilles Pisier. The volume of convex bodies and Banach space geometry, volume 94. Cambridge University Press, 1999.
  • Potter and De Jong [1994] Mitchell A Potter and Kenneth A De Jong. A cooperative coevolutionary approach to function optimization. In International Conference on Parallel Problem Solving from Nature, pages 249–257. Springer, 1994.
  • Qian [1999] Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
  • Ridnik et al. [2020] Tal Ridnik, Hussam Lawen, Asaf Noy, and Itamar Friedman. Tresnet: High performance gpu-dedicated architecture. arXiv preprint arXiv:2003.13630, 2020.
  • Safran and Shamir [2018] Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In International Conference on Machine Learning, pages 4433–4441. PMLR, 2018.
  • Santurkar et al. [2018] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? arXiv preprint arXiv:1805.11604, 2018.
  • Soltanolkotabi et al. [2018] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory, 65(2):742–769, 2018.
  • Song and Yang [2019] Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint arXiv:1906.03593, 2019.
  • Srivastava et al. [2015] Rupesh K Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. Advances in neural information processing systems, 28:2377–2385, 2015.
  • Swirszcz et al. [2016] Grzegorz Swirszcz, Wojciech Marian Czarnecki, and Razvan Pascanu. Local minima in training of deep networks. ., 2016.
  • Tandon and Sra [2010] Rashish Tandon and Suvrit Sra. Sparse nonnegative matrix approximation: new formulations and algorithms. ., 2010.
  • Telgarsky [2015] Matus Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101, 2015.
  • Venturi et al. [2018] Luca Venturi, Afonso S Bandeira, and Joan Bruna. Spurious valleys in two-layer neural network optimization landscapes. arXiv preprint arXiv:1802.06384, 2018.
  • Voulodimos et al. [2018] Athanasios Voulodimos, Nikolaos Doulamis, Anastasios Doulamis, and Eftychios Protopapadakis. Deep learning for computer vision: A brief review. Computational intelligence and neuroscience, 2018, 2018.
  • Wu et al. [2019] Xiaoxia Wu, Simon S Du, and Rachel Ward. Global convergence of adaptive gradient methods for an over-parameterized neural network. arXiv preprint arXiv:1902.07111, 2019.
  • Xiao et al. [2019] Lechao Xiao, Jeffrey Pennington, and Sam Schoenholz. Disentangling trainability and generalization in deep learning. ,, 2019.
  • Xu et al. [2015] Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853, 2015.
  • Yun et al. [2017] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444, 2017.
  • Yun et al. [2019] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small relu networks are powerful memorizers: a tight analysis of memorization capacity. In Advances in Neural Information Processing Systems, pages 15558–15569, 2019.
  • Zagoruyko and Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • Zagoruyko and Komodakis [2017] Sergey Zagoruyko and Nikos Komodakis. Diracnets: Training very deep neural networks without skip-connections. arXiv preprint arXiv:1706.00388, 2017.
  • Zhang et al. [2016] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
  • Zhang et al. [2019a] Huishuai Zhang, Da Yu, Wei Chen, and Tie-Yan Liu. Training over-parameterized deep resnet is almost as easy as training a two-layer network. arXiv preprint arXiv:1903.07120, 2019a.
  • Zhang et al. [2019b] Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer relu networks via gradient descent. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1524–1534. PMLR, 2019b.
  • Zhong et al. [2017] Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping convolutional neural networks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017.
  • Zou and Gu [2019] Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems, pages 2055–2064, 2019.
  • Zou et al. [2018] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arxiv e-prints, art. arXiv preprint arXiv:1811.08888, 2018.

Appendix

9 Additional experiments on SGEMM dataset

We start with training networks of width 40964096 and depth 1010 over SGEMM, and plot their losses along the training with varied learning rates in Figures 4(a) for the GReLU and 4(b) for the ReLU network. While for smaller rates results are quite similar, for larger ones the losses of the GReLU network decrease more monotonically and reach lower values. A comparison of high learning rates in Figure 5 shows that ReLU network oscillates with larger amplitudes towards higher loss values. The losses of the GReLU network decrease more monotonically, and reach smaller values with larger learning rates. Both networks do not reach zero loss, as their widths are smaller than required by our theory. We further test GReLU with a fixed learning rate for different widths and depths, and present the results in Figure 4(c). While both increased depth and width improve the results, deeper and narrower networks perform better for the same number of parameters. This result is consistent with similar comparisons of ReLU networks in previous works.

Refer to caption
(a) GReLU
Refer to caption
(b) ReLU
Refer to caption
(c) GReLU
Figure 4: MSE as a function of epochs for SGEMM dataset
Refer to caption
(a) η=102\eta=10^{-2}
Refer to caption
(b) η=103\eta=10^{-3}
Refer to caption
(c) η=7104\eta=7\cdot 10^{-4}
Figure 5: MSE vs epochs per learning rate for GReLU and ReLU over SGEMM dataset

10 Proofs for Section 5

10.1 Proof of Lemma 1

Proof.

By the definition of WtiW_{t}^{i} in (1) we have

Wt+1iWti=\displaystyle W^{i}_{t+1}-W^{i}_{t}= BDLiWt+1,LD1iWt+1,1D0iCBDLiWt,LD1iWt,1D0iC\displaystyle BD_{L}^{i}W_{t+1,L}\ldots D_{1}^{i}W_{t+1,1}D_{0}^{i}C-BD_{L}^{i}W_{t,L}\ldots D_{1}^{i}W_{t,1}D_{0}^{i}C
=(a)\displaystyle\overset{(a)}{=} BDLi(Wt,LηL(Wt))D1i(Wt,1η1(Wt))D0iCBDLiWt,LD1iWt,1D0iC\displaystyle BD_{L}^{i}\left(W_{t,L}-\eta\nabla_{L}\ell(W_{t})\right)\ldots D_{1}^{i}\left(W_{t,1}-\eta\nabla_{1}\ell(W_{t})\right)D_{0}^{i}C-BD_{L}^{i}W_{t,L}\ldots D_{1}^{i}W_{t,1}D_{0}^{i}C
=(b)\displaystyle\overset{(b)}{=} η2Δt,iηk=1LBDLiWt,LDkik(Wt)Dk1iWt,k1D1iWt,1D0iC,:=Zti\displaystyle\eta^{2}\Delta_{t,i}-\eta\underbrace{\sum_{k=1}^{L}BD_{L}^{i}W_{t,L}\ldots D_{k}^{i}\nabla_{k}\ell(W_{t})D_{k-1}^{i}W_{t,k-1}\ldots D_{1}^{i}W_{t,1}D_{0}^{i}C,}_{:=Z_{t}^{i}}

where (a) is due to the update of Wt+1,k=Wt,kηk(Wt)W_{t+1,k}=W_{t,k}-\eta\nabla_{k}\ell(W_{t}) for any k[L]k\in[L]; (b) uses the definition of Δt,i\Delta_{t,i} in (12). In the above, we divide Wt+1iWtiW^{i}_{t+1}-W^{i}_{t} into two parts, with ηZti\eta Z_{t}^{i} is proportional to η\eta and η2Δt,i\eta^{2}\Delta_{t,i} is proportional to η2\eta^{2}. We can simplify ZtiZ_{t}^{i} as

Zti=(10)\displaystyle Z_{t}^{i}\overset{(\ref{notation:gradient:parameter})}{=} k=1LFt,k+1ik(Wt)Gt,k1i\displaystyle\sum_{k=1}^{L}F_{t,k+1}^{i}\nabla_{k}\ell(W_{t})G_{t,k-1}^{i}
=(9)(11)\displaystyle\overset{(\ref{eqn:grad})(\ref{eqn:gamma})}{=} k=1LFt,k+1i[Ft,k+1i](WtiΦi)xixi[Gt,k1i]Gt,k1i+Γt,i.\displaystyle\sum_{k=1}^{L}F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top}\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}+\Gamma_{t,i}.

We complete the proof by plugging in the expression of ZtiZ_{t}^{i} into the expression of Wt+1iWtiW_{t+1}^{i}-W_{t}^{i}. ∎

10.2 Proof of Lemma 2

Proof.

By the definition of loss function in (2), we have

(Wt+1)(Wt)=\displaystyle\ell(W_{t+1})-\ell(W_{t})= i=1n{12(Wt+1iΦi)xi212(WtiΦi)xi2}\displaystyle\sum_{i=1}^{n}\left\{\frac{1}{2}\left\lVert(W_{t+1}^{i}-\Phi_{i})x_{i}\right\rVert^{2}-\frac{1}{2}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\right\}
=\displaystyle= i=1n{(Wt+1iWti)xi,(WtiΦi)xi+12(WtiWt+1i)xi2}.\displaystyle\sum_{i=1}^{n}\left\{\langle(W_{t+1}^{i}-W_{t}^{i})x_{i},(W_{t}^{i}-\Phi_{i})x_{i}\rangle+\frac{1}{2}\left\lVert(W_{t}^{i}-W_{t+1}^{i})x_{i}\right\rVert^{2}\right\}. (20)

For the first term in (10.2), by Lemma 1, we have

(Wt+1iWti)xi,(WtiΦi)xi\displaystyle\langle(W_{t+1}^{i}-W_{t}^{i})x_{i},(W_{t}^{i}-\Phi_{i})x_{i}\rangle
=\displaystyle= ηk=1LFk+1i[Fk+1i](WtiΦi)xixi[Gt,k1i]Gt,k1ixi,(WtiΦi)xi+η2Δt,ixi,(WtiΦi)xi\displaystyle-\eta\sum_{k=1}^{L}\left\langle F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top}\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}x_{i},(W_{t}^{i}-\Phi_{i})x_{i}\right\rangle+\eta^{2}\langle\Delta_{t,i}x_{i},(W_{t}^{i}-\Phi_{i})x_{i}\rangle
ηk=1LjiFt,k+1i[Ft,k+1j](WtjΦj)xjxj[Gt,k1j]Gt,k1ixi,(WtiΦi)xi\displaystyle-\eta\sum_{k=1}^{L}\sum_{j\neq i}\left\langle F_{t,k+1}^{i}\left[F_{t,k+1}^{j}\right]^{\top}\left(W_{t}^{j}-\Phi_{j}\right)x_{j}x_{j}^{\top}\left[G_{t,k-1}^{j}\right]^{\top}G_{t,k-1}^{i}x_{i},(W_{t}^{i}-\Phi_{i})x_{i}\right\rangle
=\displaystyle= ηk=1LFk+1i[Fk+1i](WtiΦi)xixi[Gt,k1i]Gt,k1ixi,(WtiΦi)xi+η2Δt,ixi,(WtiΦi)xi\displaystyle-\eta\sum_{k=1}^{L}\left\langle F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top}\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}x_{i},(W_{t}^{i}-\Phi_{i})x_{i}\right\rangle+\eta^{2}\langle\Delta_{t,i}x_{i},(W_{t}^{i}-\Phi_{i})x_{i}\rangle
ηk=1LjiGt,k1jxj,Gt,k1ixi(WtjΦj)xj,Ft,k+1j[Ft,k+1i](WtiΦi)xi,\displaystyle-\eta\sum_{k=1}^{L}\sum_{j\neq i}\left\langle G_{t,k-1}^{j}x_{j},G_{t,k-1}^{i}x_{i}\right\rangle\left\langle(W_{t}^{j}-\Phi_{j})x_{j},F_{t,k+1}^{j}\left[F_{t,k+1}^{i}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}\right\rangle, (21)

where the last equality is due to the facts that xj[Gt,k1j]Gt,k1ixix_{j}^{\top}\left[G_{t,k-1}^{j}\right]^{\top}G_{t,k-1}^{i}x_{i} is a scalar and Ax,y=x,Ay\langle Ax,y\rangle=\langle x,A^{\top}y\rangle for any matrix AA and vectors x,yx,y. On the other hand, the first term in (10.2) will be

Fk+1i[Fk+1i](WtiΦi)xixi[Gt,k1i]Gt,k1ixi,(WtiΦi)xi\displaystyle\left\langle F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top}\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}x_{i},(W_{t}^{i}-\Phi_{i})x_{i}\right\rangle
=\displaystyle= Tr(xi(WtiΦi)Fk+1i[Fk+1i](WtiΦi)xixi[Gt,k1i]Gt,k1ixi)\displaystyle\text{Tr}\left(x_{i}^{\top}(W_{t}^{i}-\Phi_{i})^{\top}F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top}\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}x_{i}\right)
=(a)\displaystyle\overset{(a)}{=} Tr([Gt,k1i]Gt,k1ixixi(WtiΦi)Fk+1i[Fk+1i](WtiΦi)xixi)\displaystyle\text{Tr}\left(\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}x_{i}x_{i}^{\top}(W_{t}^{i}-\Phi_{i})^{\top}F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top}\right)
=(b)\displaystyle\overset{(b)}{=} vec((WtiΦi)xixi)vec(([Gt,k1i]Gt,k1ixixi(WtiΦi)Fk+1i[Fk+1i]))\displaystyle\mbox{vec}^{\top}(\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top})\mbox{vec}\left(\left(\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}x_{i}x_{i}^{\top}(W_{t}^{i}-\Phi_{i})^{\top}F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}\right)^{\top}\right)
=(c)\displaystyle\overset{(c)}{=} vec((WtiΦi)xixi)vec(Fk+1i[Fk+1i](WtiΦi)xixi[Gt,k1i]Gt,k1i)\displaystyle\mbox{vec}^{\top}(\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top})\mbox{vec}\left(F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}\right)
=(d)\displaystyle\overset{(d)}{=} vec((WtiΦi)xixi)([Gt,k1i]Gt,k1)(Fk+1i[Fk+1i])vec((WtiΦi)xixi),\displaystyle\mbox{vec}^{\top}(\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top})\left(\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}\right)\otimes\left(F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}\right)\mbox{vec}\left((W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\right), (22)

where for any matrices A,B,XA,B,X: (a) uses Tr(AB)=Tr(BA)\text{Tr}(AB)=\text{Tr}(BA); (b) uses Tr(AB)=vec(B)vec(A)\text{Tr}(AB)=\mbox{vec}^{\top}(B)\mbox{vec}(A^{\top}); (c) uses (AB)=BA(AB)^{\top}=B^{\top}A^{\top}; (d) uses vec(AXB)=BAvec(X)\mbox{vec}(AXB)=B^{\top}\otimes A\mbox{vec}(X). Define

At,ki:=\displaystyle A_{t,k}^{i}:= ([Gt,k1i]Gt,k1i)(Ft,k+1i[Ft,k+1i]),\displaystyle\left(\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}\right)\otimes\left(F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}\right), (23)
Ξt,ki,j:=\displaystyle\Xi_{t,k}^{i,j}:= Gt,k1jxj,Gt,k1ixiFt,k+1j[Ft,k+1i],\displaystyle\langle G^{j}_{t,k-1}x_{j},G^{i}_{t,k-1}x_{i}\rangle F_{t,k+1}^{j}\left[F_{t,k+1}^{i}\right]^{\top}, (24)

where \otimes is the Kronecker product. By plugging (10.2) into (10.2) and using the definitions in (23) and (24), we have

i=1n(Wt+1iWti)xi,(WtiΦi)xi\displaystyle\sum_{i=1}^{n}\langle(W_{t+1}^{i}-W_{t}^{i})x_{i},(W_{t}^{i}-\Phi_{i})x_{i}\rangle
=\displaystyle= ηi=1nvec((WtiΦi)xixi)(k=1LAt,ki)vec((WtiΦi)xixi)+η2i=1nΔt,ixi,(WtiΦi)xi\displaystyle-\eta\sum_{i=1}^{n}\mbox{vec}^{\top}\left((W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\right)\left(\sum_{k=1}^{L}A_{t,k}^{i}\right)\mbox{vec}\left((W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\right)+\eta^{2}\sum_{i=1}^{n}\langle\Delta_{t,i}x_{i},(W_{t}^{i}-\Phi_{i})x_{i}\rangle
ηi=1nk=1Lji(WtjΦj)xj,Ξt,ki,j(WtiΦi)xi\displaystyle-\eta\sum_{i=1}^{n}\sum_{k=1}^{L}\sum_{j\neq i}\left\langle(W_{t}^{j}-\Phi_{j})x_{j},\Xi_{t,k}^{i,j}(W_{t}^{i}-\Phi_{i})x_{i}\right\rangle (25)

By using the fact that (AB)=AB(A\otimes B)^{\top}=A^{\top}\otimes B^{\top}, we know that (At,ki)=At,ki(A_{t,k}^{i})^{\top}=A_{t,k}^{i}, which means that At,kiA_{t,k}^{i} is symmetric. Thus we have

vec((WtiΦi)xixi)(k=1LAt,ki)vec((WtiΦi)xixi)(WtiΦi)xi2(k=1Lλmin(At,ki)).\displaystyle\mbox{vec}^{\top}\left((W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\right)\left(\sum_{k=1}^{L}A_{t,k}^{i}\right)\mbox{vec}\left((W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\right)\geq\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\left(\sum_{k=1}^{L}\lambda_{\min}\left(A_{t,k}^{i}\right)\right). (26)

For the last term in (10.2),

ηi=1nk=1Lji(WtjΦj)xj,Ξt,ki,j(WtiΦi)xi\displaystyle-\eta\sum_{i=1}^{n}\sum_{k=1}^{L}\sum_{j\neq i}\left\langle(W_{t}^{j}-\Phi_{j})x_{j},\Xi_{t,k}^{i,j}(W_{t}^{i}-\Phi_{i})x_{i}\right\rangle
(a)\displaystyle\overset{(a)}{\leq} η2i=1nk=1LjiΞt,ki,j2((WtiΦi)xi2+(WtjΦj)xj2)\displaystyle\frac{\eta}{2}\sum_{i=1}^{n}\sum_{k=1}^{L}\sum_{j\neq i}\left\lVert\Xi_{t,k}^{i,j}\right\rVert_{2}\left(\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}+\left\lVert(W_{t}^{j}-\Phi_{j})x_{j}\right\rVert^{2}\right)
=Lemma6\displaystyle\overset{\text{Lemma}~{}\ref{tech:lemm:1}}{=} ηi=1nk=1L(WtiΦi)xi2(jiΞt,ki,j2+Ξt,kj,i22)\displaystyle\eta\sum_{i=1}^{n}\sum_{k=1}^{L}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\left(\sum_{j\neq i}\frac{\left\|\Xi_{t,k}^{i,j}\right\|_{2}+\left\|\Xi_{t,k}^{j,i}\right\|_{2}}{2}\right)
=(b)\displaystyle\overset{(b)}{=} ηi=1nk=1L(WtiΦi)xi2(jiΞt,ki,j2)\displaystyle\eta\sum_{i=1}^{n}\sum_{k=1}^{L}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\left(\sum_{j\neq i}\left\|\Xi_{t,k}^{i,j}\right\|_{2}\right) (27)

where (a) uses the fact that x,AyxAyxA2y12A2(x2+y2)\langle x,Ay\rangle\leq\|x\|\|Ay\|\leq\|x\|\|A\|_{2}\|y\|\leq\frac{1}{2}\|A\|_{2}(\|x\|^{2}+\|y\|^{2}) for any matrix Ad×dA\in\mathbb{R}^{d\times d} and vectors x,ydx,y\in\mathbb{R}^{d}; (b) uses the fact of definition of Ξt,ki,j\Xi_{t,k}^{i,j} that Ξt,ki,j2=Ξt,kj,i2\|\Xi_{t,k}^{i,j}\|_{2}=\|\Xi_{t,k}^{j,i}\|_{2}. As a result, by (10.2) (26) (10.2) we have

i=1n(Wt+1iWti)xi,(WtiΦi)xi\displaystyle\sum_{i=1}^{n}\langle(W_{t+1}^{i}-W_{t}^{i})x_{i},(W_{t}^{i}-\Phi_{i})x_{i}\rangle
\displaystyle\leq ηi=1n(WtiΦi)xi2(k=1Lλmin(At,ki))+η2i=1nΔt,i,(WtiΦi)xixi\displaystyle-\eta\sum_{i=1}^{n}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\left(\sum_{k=1}^{L}\lambda_{\min}\left(A_{t,k}^{i}\right)\right)+\eta^{2}\sum_{i=1}^{n}\langle\Delta_{t,i},(W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\rangle
+ηi=1nk=1L(WtiΦi)xi2(jiΞt,ki,j2).\displaystyle+\eta\sum_{i=1}^{n}\sum_{k=1}^{L}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\left(\sum_{j\neq i}\left\|\Xi_{t,k}^{i,j}\right\|_{2}\right). (28)

We now handle the second term in (10.2), by Lemma 1 and using Cauchy-Schwarz inequality, we know

12(Wt+1iWti)xi2\displaystyle\frac{1}{2}\left\lVert(W_{t+1}^{i}-W_{t}^{i})x_{i}\right\rVert^{2}
\displaystyle\leq 3η22k=1LFt,k+1i[Ft,k+1i](WtiΦi)xiGt,k1ixi22+3η42Δt,ixi2\displaystyle\frac{3\eta^{2}}{2}\left\lVert\sum_{k=1}^{L}F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}\left\lVert G_{t,k-1}^{i}x_{i}\right\rVert^{2}\right\rVert^{2}+\frac{3\eta^{4}}{2}\left\lVert\Delta_{t,i}x_{i}\right\rVert^{2}
+3η22k=1LjiFt,k+1i[Ft,k+1j](WtjΦj)xjGt,k1jxj,Gt,k1ixi2\displaystyle+\frac{3\eta^{2}}{2}\left\lVert\sum_{k=1}^{L}\sum_{j\neq i}F_{t,k+1}^{i}\left[F_{t,k+1}^{j}\right]^{\top}(W_{t}^{j}-\Phi_{j})x_{j}\langle G_{t,k-1}^{j}x_{j},G_{t,k-1}^{i}x_{i}\rangle\right\rVert^{2}
\displaystyle\leq 3η2L2k=1LFt,k+1i[Ft,k+1i](WtiΦi)xi2Gt,k1ixi4+3η42Δt,ixi2\displaystyle\frac{3\eta^{2}L}{2}\sum_{k=1}^{L}\left\lVert F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\left\lVert G_{t,k-1}^{i}x_{i}\right\rVert^{4}+\frac{3\eta^{4}}{2}\left\lVert\Delta_{t,i}x_{i}\right\rVert^{2}
+3η2Ln2k=1LjiFt,k+1i[Ft,k+1j](WtjΦj)xj2|Gt,k1jxj,Gt,k1ixi|2.\displaystyle+\frac{3\eta^{2}Ln}{2}\sum_{k=1}^{L}\sum_{j\neq i}\left\lVert F_{t,k+1}^{i}\left[F_{t,k+1}^{j}\right]^{\top}(W_{t}^{j}-\Phi_{j})x_{j}\right\rVert^{2}\left|\langle G_{t,k-1}^{j}x_{j},G_{t,k-1}^{i}x_{i}\rangle\right|^{2}. (29)

Since by using AxA2x\|Ax\|\leq\|A\|_{2}\|x\| for any matrix Ad×dA\in\mathbb{R}^{d\times d} and vector xdx\in\mathbb{R}^{d}, we have

Ft,k+1i[Ft,k+1i](WtiΦi)xi2\displaystyle\left\lVert F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\leq Ft,k+1i[Ft,k+1i]22(WtiΦi)xi2\displaystyle\left\lVert F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}\right\rVert_{2}^{2}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}
\displaystyle\leq λmax2(Ft,k+1i[Ft,k+1i])(WtiΦi)xi2,\displaystyle\lambda^{2}_{\max}\left(F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}\right)\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}, (30)

and

k=1LjiFt,k+1i[Ft,k+1j](WtjΦj)xj2|Gt,k1jxj,Gt,k1ixi|2\displaystyle\sum_{k=1}^{L}\sum_{j\neq i}\left\lVert F_{t,k+1}^{i}\left[F_{t,k+1}^{j}\right]^{\top}(W_{t}^{j}-\Phi_{j})x_{j}\right\rVert^{2}\left|\langle G_{t,k-1}^{j}x_{j},G_{t,k-1}^{i}x_{i}\rangle\right|^{2}
\displaystyle\leq k=1Lji(WtjΦj)xj2Ft,k+1i[Ft,k+1j]22|Gt,k1jxj,Gt,k1ixi|2\displaystyle\sum_{k=1}^{L}\sum_{j\neq i}\left\lVert(W_{t}^{j}-\Phi_{j})x_{j}\right\rVert^{2}\left\lVert F_{t,k+1}^{i}\left[F_{t,k+1}^{j}\right]^{\top}\right\rVert_{2}^{2}\left|\langle G_{t,k-1}^{j}x_{j},G_{t,k-1}^{i}x_{i}\rangle\right|^{2}
=\displaystyle= k=1Lji(WtjΦj)xj2Ξt,ki,j22.\displaystyle\sum_{k=1}^{L}\sum_{j\neq i}\left\lVert(W_{t}^{j}-\Phi_{j})x_{j}\right\rVert^{2}\left\lVert\Xi_{t,k}^{i,j}\right\rVert_{2}^{2}. (31)

Hence, plugging (10.2) and (10.2) into (10.2), we have

12i=1n(Wt+1iWti)xi2\displaystyle\frac{1}{2}\sum_{i=1}^{n}\left\lVert(W_{t+1}^{i}-W_{t}^{i})x_{i}\right\rVert^{2}
\displaystyle\leq 3η2L2i=1n(WtiΦi)xi2(k=1Lλmax2(Ft,k+1i[Ft,k+1i])Gt,k1ixi4)+3η42Δt,ixi2\displaystyle\frac{3\eta^{2}L}{2}\sum_{i=1}^{n}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\left(\sum_{k=1}^{L}\lambda^{2}_{\max}\left(F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}\right)\left\lVert G_{t,k-1}^{i}x_{i}\right\rVert^{4}\right)+\frac{3\eta^{4}}{2}\left\lVert\Delta_{t,i}x_{i}\right\rVert^{2}
+3η2Ln2i=1nk=1Lji(WtjΦj)xj2Ξt,ki,j22\displaystyle+\frac{3\eta^{2}Ln}{2}\sum_{i=1}^{n}\sum_{k=1}^{L}\sum_{j\neq i}\left\lVert(W_{t}^{j}-\Phi_{j})x_{j}\right\rVert^{2}\left\lVert\Xi_{t,k}^{i,j}\right\rVert_{2}^{2}
\displaystyle\leq 3η2L2i=1n(WtiΦi)xi2(k=1Lλmax2(Ft,k+1i[Ft,k+1i])Gt,k1ixi4)+3η42Δt,ixi2\displaystyle\frac{3\eta^{2}L}{2}\sum_{i=1}^{n}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\left(\sum_{k=1}^{L}\lambda^{2}_{\max}\left(F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}\right)\left\lVert G_{t,k-1}^{i}x_{i}\right\rVert^{4}\right)+\frac{3\eta^{4}}{2}\left\lVert\Delta_{t,i}x_{i}\right\rVert^{2}
+3η2Ln2i=1n(WtiΦi)xi2k=1LjiΞt,ki,j22,\displaystyle+\frac{3\eta^{2}Ln}{2}\sum_{i=1}^{n}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\sum_{k=1}^{L}\sum_{j\neq i}\left\lVert\Xi_{t,k}^{i,j}\right\rVert_{2}^{2}, (32)

where the last inequality uses

i=1nk=1Lji(WtjΦj)xj2Ξt,ki,j22=\displaystyle\sum_{i=1}^{n}\sum_{k=1}^{L}\sum_{j\neq i}\left\lVert(W_{t}^{j}-\Phi_{j})x_{j}\right\rVert^{2}\left\lVert\Xi_{t,k}^{i,j}\right\rVert_{2}^{2}= k=1Li=1nji(WtjΦj)xj2Ξt,ki,j22\displaystyle\sum_{k=1}^{L}\sum_{i=1}^{n}\sum_{j\neq i}\left\lVert(W_{t}^{j}-\Phi_{j})x_{j}\right\rVert^{2}\left\lVert\Xi_{t,k}^{i,j}\right\rVert_{2}^{2}
=Lemma6\displaystyle\overset{\text{Lemma}~{}\ref{tech:lemm:1}}{=} k=1Li=1n(WtiΦi)xi2jiΞt,kj,i22\displaystyle\sum_{k=1}^{L}\sum_{i=1}^{n}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\sum_{j\neq i}\left\lVert\Xi_{t,k}^{j,i}\right\rVert_{2}^{2}
=\displaystyle= i=1n(WtiΦi)xi2k=1LjiΞt,kj,i22\displaystyle\sum_{i=1}^{n}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\sum_{k=1}^{L}\sum_{j\neq i}\left\lVert\Xi_{t,k}^{j,i}\right\rVert_{2}^{2}
=\displaystyle= i=1n(WtiΦi)xi2k=1LjiΞt,ki,j22,\displaystyle\sum_{i=1}^{n}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\sum_{k=1}^{L}\sum_{j\neq i}\left\lVert\Xi_{t,k}^{i,j}\right\rVert_{2}^{2},

where the last equality is due to Ξt,kj,i2=Ξt,ki,j2\left\lVert\Xi_{t,k}^{j,i}\right\rVert_{2}=\left\lVert\Xi_{t,k}^{i,j}\right\rVert_{2}. We complete the proof by using Young’s inequality with ai>0a_{i}>0 to upper bound

Δt,i,(WtiΦi)xi12aiΔt,ixi2+ai2(WtiΦi)xi2\displaystyle\langle\Delta_{t,i},(W_{t}^{i}-\Phi_{i})x_{i}\rangle\leq\frac{1}{2a_{i}}\left\lVert\Delta_{t,i}x_{i}\right\rVert^{2}+\frac{a_{i}}{2}\left\lVert(W_{t}^{i}-\Phi_{i})x_{i}\right\rVert^{2} (33)

and using the property of the eigenvalue of the Kronecker product of two symmetric matrices

λmin(At,ki)=λmin(Ft,k+1i[Ft,k+1i])λmin([Gt,k1i]Gt,k1i).\displaystyle\lambda_{\min}\left(A_{t,k}^{i}\right)=\lambda_{\min}\left(F_{t,k+1}^{i}\left[F_{t,k+1}^{i}\right]^{\top}\right)\lambda_{\min}\left(\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}\right). (34)

10.3 Proof for Lemma 3

Proof.

Using the assumption in (14), (15), (16) and (17), by setting ai=β4L2a_{i}=\beta^{4}L^{2},

Λi+η2ai\displaystyle\Lambda_{i}+\eta^{2}a_{i}\leq 2ηLα2+2ηL(n1)γβ2+3η2L2β2βx+3η2L2n(n1)γ2β4+η2L2β4\displaystyle-2\eta L\alpha^{2}+2\eta L(n-1)\gamma\beta^{2}+3\eta^{2}L^{2}\beta^{2}\beta_{x}+3\eta^{2}L^{2}n(n-1)\gamma^{2}\beta^{4}+\eta^{2}L^{2}\beta^{4}
\displaystyle\leq 2ηLα22ηLγβ2+ηLα2+3η2L2β2βx3η2L2nγ2β4+34η2L2α2+η2L2β4\displaystyle-2\eta L\alpha^{2}-2\eta L\gamma\beta^{2}+\eta L\alpha^{2}+3\eta^{2}L^{2}\beta^{2}\beta_{x}-3\eta^{2}L^{2}n\gamma^{2}\beta^{4}+\frac{3}{4}\eta^{2}L^{2}\alpha^{2}+\eta^{2}L^{2}\beta^{4}
\displaystyle\leq ηLα2+3η2L2β2βx+34η2L2α2+η2L2β4.\displaystyle-\eta L\alpha^{2}+3\eta^{2}L^{2}\beta^{2}\beta_{x}+\frac{3}{4}\eta^{2}L^{2}\alpha^{2}+\eta^{2}L^{2}\beta^{4}.

By choosing step size η\eta as ηmin(α212β2βxL,13L,α24β4L,1β2L)\eta\leq\min\left(\frac{\alpha^{2}}{12\beta^{2}\beta_{x}L},\frac{1}{3L},\frac{\alpha^{2}}{4\beta^{4}L},\frac{1}{\beta^{2}L}\right), we have Λi3ηα2L4\Lambda_{i}\leq-\frac{3\eta\alpha^{2}L}{4}, and therefore

(Wt+1)(Wt)3ηα2L4(Wt)+2η2β4L2i=1nΔt,ixi2.\displaystyle\ell(W_{t+1})-\ell(W_{t})\leq-\frac{3\eta\alpha^{2}L}{4}\ell(W_{t})+\frac{2\eta^{2}}{\beta^{4}L^{2}}\sum_{i=1}^{n}\left\lVert\Delta_{t,i}x_{i}\right\rVert^{2}. (35)

Next, we need to bound Δt,ixi2\left\lVert\Delta_{t,i}x_{i}\right\rVert^{2}. Since Δt,ixiΔt,i2xi=Δt,i2\left\lVert\Delta_{t,i}x_{i}\right\rVert\leq\left\lVert\Delta_{t,i}\right\rVert_{2}\left\lVert x_{i}\right\rVert=\left\lVert\Delta_{t,i}\right\rVert_{2}, we want to bound Δti2\left\lVert\Delta_{t}^{i}\right\rVert_{2}. To this end, we first bound k(Wt)22\left\lVert\nabla_{k}\ell(W_{t})\right\rVert_{2}^{2}, i.e.

k(Wt)22\displaystyle\left\lVert\nabla_{k}\ell(W_{t})\right\rVert_{2}^{2}
=(a)\displaystyle\overset{(a)}{=} k(Wt)k(Wt)2\displaystyle\left\lVert\nabla_{k}\ell(W_{t})^{\top}\nabla_{k}\ell(W_{t})\right\rVert_{2}
=(9)\displaystyle\overset{(\ref{eqn:grad})}{=} (i=1n[Ft,k+1i](WtiΦi)xixi[Gt,k1i])(i=1n[Ft,k+1i](WtiΦi)xixi[Gt,k1i])2\displaystyle\left\|\left(\sum_{i=1}^{n}\left[F^{i}_{t,k+1}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\left[G^{i}_{t,k-1}\right]^{\top}\right)^{\top}\left(\sum_{i=1}^{n}\left[F^{i}_{t,k+1}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\left[G^{i}_{t,k-1}\right]^{\top}\right)\right\|_{2}
=(b)\displaystyle\overset{(b)}{=} (i=1nGt,k1ixixi(WtiΦi)Ft,k+1i)(i=1n[Ft,k+1i](WtiΦi)xixi[Gt,k1i])2\displaystyle\left\|\left(\sum_{i=1}^{n}G^{i}_{t,k-1}x_{i}x_{i}^{\top}(W_{t}^{i}-\Phi_{i})^{\top}F^{i}_{t,k+1}\right)\left(\sum_{i=1}^{n}\left[F^{i}_{t,k+1}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\left[G^{i}_{t,k-1}\right]^{\top}\right)\right\|_{2}
(c)\displaystyle\overset{(c)}{\leq} i=1nGt,k1ixixi(WtiΦi)Ft,k+1i[Ft,k+1i](WtiΦi)xixi[Gt,k1i]2\displaystyle\sum_{i=1}^{n}\left\|G^{i}_{t,k-1}x_{i}x_{i}^{\top}(W_{t}^{i}-\Phi_{i})^{\top}F^{i}_{t,k+1}\left[F^{i}_{t,k+1}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\left[G^{i}_{t,k-1}\right]^{\top}\right\|_{2}
+i=1njiGt,k1ixixi(WtiΦi)Ft,k+1i[Ft,k+1j](WtjΦj)xjxj[Gt,k1j]2,\displaystyle+\sum_{i=1}^{n}\sum_{j\neq i}\left\|G^{i}_{t,k-1}x_{i}x_{i}^{\top}(W_{t}^{i}-\Phi_{i})^{\top}F^{i}_{t,k+1}\left[F^{j}_{t,k+1}\right]^{\top}(W_{t}^{j}-\Phi_{j})x_{j}x_{j}^{\top}\left[G^{j}_{t,k-1}\right]^{\top}\right\|_{2}, (36)

where (a) is due to A22=AA2\|A\|_{2}^{2}=\|A^{\top}A\|_{2} for matrix Am×mA\in\mathbb{R}^{m\times m}; (b) uses the facts that (A+B)=A+B(A+B)^{\top}=A^{\top}+B^{\top} and (AB)=BA(AB)^{\top}=B^{\top}A^{\top} for matrices A,Bm×mA,B\in\mathbb{R}^{m\times m}; (c) uses A+B2A2+B2\|A+B\|_{2}\leq\|A\|_{2}+\|B\|_{2} and A2=A2\|A\|_{2}=\|A^{\top}\|_{2}. Since AA2=A22AF2=tr(AA)\|A^{\top}A\|_{2}=\|A\|_{2}^{2}\leq\|A\|_{F}^{2}=\text{tr}(A^{\top}A), then

Gt,k1ixixi(WtiΦi)Ft,k+1i[Ft,k+1i](WtiΦi)xixi[Gt,k1i]2\displaystyle\left\|G^{i}_{t,k-1}x_{i}x_{i}^{\top}(W_{t}^{i}-\Phi_{i})^{\top}F^{i}_{t,k+1}\left[F^{i}_{t,k+1}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\left[G^{i}_{t,k-1}\right]^{\top}\right\|_{2}
\displaystyle\leq Tr(Gt,k1ixixi(WtiΦi)Ft,k+1i[Ft,k+1i](WtiΦi)xixi[Gt,k1i])\displaystyle\text{Tr}\left(G^{i}_{t,k-1}x_{i}x_{i}^{\top}(W_{t}^{i}-\Phi_{i})^{\top}F^{i}_{t,k+1}\left[F^{i}_{t,k+1}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\left[G^{i}_{t,k-1}\right]^{\top}\right)
=(a)\displaystyle\overset{(a)}{=} Tr([Gt,k1i]Gt,k1ixixi(WtiΦi)Fk+1i[Fk+1i](WtiΦi)xixi)\displaystyle\text{Tr}\left(\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}x_{i}x_{i}^{\top}(W_{t}^{i}-\Phi_{i})^{\top}F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top}\right)
=(b)\displaystyle\overset{(b)}{=} vec((WtiΦi)xixi)vec(([Gt,k1i]Gt,k1ixixi(WtiΦi)Fk+1i[Fk+1i]))\displaystyle\mbox{vec}^{\top}(\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top})\mbox{vec}\left(\left(\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}x_{i}x_{i}^{\top}(W_{t}^{i}-\Phi_{i})^{\top}F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}\right)^{\top}\right)
=(c)\displaystyle\overset{(c)}{=} vec((WtiΦi)xixi)vec(Fk+1i[Fk+1i](WtiΦi)xixi[Gt,k1i]Gt,k1i)\displaystyle\mbox{vec}^{\top}(\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top})\mbox{vec}\left(F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}(W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}^{i}\right)
=(d)\displaystyle\overset{(d)}{=} vec((WtiΦi)xixi)([Gt,k1i]Gt,k1)(Fk+1i[Fk+1i])vec((WtiΦi)xixi)\displaystyle\mbox{vec}^{\top}(\left(W_{t}^{i}-\Phi_{i}\right)x_{i}x_{i}^{\top})\left(\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}\right)\otimes\left(F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}\right)\mbox{vec}\left((W_{t}^{i}-\Phi_{i})x_{i}x_{i}^{\top}\right)
\displaystyle\leq [Gt,k1i]Gt,k12Fk+1i[Fk+1i]2(WtiΦi)xi2\displaystyle{\left\lVert\left[G_{t,k-1}^{i}\right]^{\top}G_{t,k-1}\right\rVert_{2}\left\lVert F_{k+1}^{i}\left[F_{k+1}^{i}\right]^{\top}\right\rVert_{2}\left\lVert\left(W_{t}^{i}-\Phi_{i}\right)x_{i}\right\rVert^{2}}
(15)\displaystyle\overset{(\ref{eqn:eig-upper-bound})}{\leq} β2(WtiΦi)xi2,\displaystyle\beta^{2}\left\lVert\left(W_{t}^{i}-\Phi_{i}\right)x_{i}\right\rVert^{2}, (37)

where for any matrices A,B,XA,B,X: (a) uses Tr(AB)=Tr(BA)\text{Tr}(AB)=\text{Tr}(BA); (b) uses Tr(AB)=vec(B)vec(A)\text{Tr}(AB)=\mbox{vec}^{\top}(B)\mbox{vec}(A^{\top}); (c) uses (AB)=BA(AB)^{\top}=B^{\top}A^{\top}; (d) uses vec(AXB)=BAvec(X)\mbox{vec}(AXB)=B^{\top}\otimes A\mbox{vec}(X).

i=1njiGt,k1ixixi(WtiΦi)Ft,k+1i[Ft,k+1j](WtjΦj)xjxj[Gt,k1j]2\displaystyle\sum_{i=1}^{n}\sum_{j\neq i}\left\|G^{i}_{t,k-1}x_{i}x_{i}^{\top}(W_{t}^{i}-\Phi_{i})^{\top}F^{i}_{t,k+1}\left[F^{j}_{t,k+1}\right]^{\top}(W_{t}^{j}-\Phi_{j})x_{j}x_{j}^{\top}\left[G^{j}_{t,k-1}\right]^{\top}\right\|_{2}
=(a)\displaystyle\overset{(a)}{=} i=1nji|xi(WtiΦi)Ft,k+1i[Ft,k+1j](WtjΦj)xj|Gt,k1ixixj[Gt,k1j]2\displaystyle\sum_{i=1}^{n}\sum_{j\neq i}\left|x_{i}^{\top}(W_{t}^{i}-\Phi_{i})^{\top}F^{i}_{t,k+1}\left[F^{j}_{t,k+1}\right]^{\top}(W_{t}^{j}-\Phi_{j})x_{j}\right|\left\|G^{i}_{t,k-1}x_{i}x_{j}^{\top}\left[G^{j}_{t,k-1}\right]^{\top}\right\|_{2}
(b)\displaystyle\overset{(b)}{\leq} i=1nji(WtiΦi)xiFt,k+1i[Ft,k+1j]2(WtjΦj)xjGt,k1ixiGt,k1jxj\displaystyle\sum_{i=1}^{n}\sum_{j\neq i}\left\|(W_{t}^{i}-\Phi_{i})x_{i}\right\|\left\|F^{i}_{t,k+1}\left[F^{j}_{t,k+1}\right]^{\top}\right\|_{2}\left\|(W_{t}^{j}-\Phi_{j})x_{j}\right\|\left\|G^{i}_{t,k-1}x_{i}\right\|\left\|G^{j}_{t,k-1}x_{j}\right\|
(16)\displaystyle\overset{(\ref{eqn:cross-upper-bound})}{\leq} γβ2i=1nji(WtiΦi)xi(WtjΦj)xj\displaystyle\gamma\beta^{2}\sum_{i=1}^{n}\sum_{j\neq i}\left\|(W_{t}^{i}-\Phi_{i})x_{i}\right\|\left\|(W_{t}^{j}-\Phi_{j})x_{j}\right\|
(c)\displaystyle\overset{(c)}{\leq} γβ22i=1nji((WtiΦi)xi2+(WtjΦj)xj2)\displaystyle\frac{\gamma\beta^{2}}{2}\sum_{i=1}^{n}\sum_{j\neq i}\left(\left\|(W_{t}^{i}-\Phi_{i})x_{i}\right\|^{2}+\left\|(W_{t}^{j}-\Phi_{j})x_{j}\right\|^{2}\right)
\displaystyle\leq nγβ2i=1n(WtiΦi)xi2\displaystyle n\gamma\beta^{2}\sum_{i=1}^{n}\left\|(W_{t}^{i}-\Phi_{i})x_{i}\right\|^{2}
=\displaystyle= nγβ2(Wt),\displaystyle n\gamma\beta^{2}\ell(W_{t}), (38)

where (a) uses cA2=|c|A2\|cA\|_{2}=|c|\|A\|_{2} for a matrix AA and a scalar cc; (b) uses the facts that |xAy|xAyxA2y|x^{\top}Ay|\leq\|x\|\|Ay\|\leq\|x\|\|A\|_{2}\|y\| and xy2=xy2=xy\|xy^{\top}\|_{2}=\|x\otimes y^{\top}\|_{2}=\|x\|\|y\|; (c) uses Young’s inequality. Therefore, by (10.3) (10.3) and (10.3) we have

k(Wt)22(β2+nγβ2)(Wt)\displaystyle\left\lVert\nabla_{k}\ell(W_{t})\right\rVert_{2}^{2}\leq(\beta^{2}+n\gamma\beta^{2})\ell(W_{t}) (39)

We can rewrite Δt,i\Delta_{t,i} in (12) as

Δt,i=\displaystyle\Delta_{t,i}= s=2L(η)s2Lk1>k2>>ks1Ft,k1+1i(=1sk(Wt)Zk1,k+1t,i)Gt,ks1i, where Zks1,ks+1t,i=I.\displaystyle\sum_{s=2}^{L}(-\eta)^{s-2}\sum_{L\geq k_{1}>k_{2}>\ldots>k_{s}\geq 1}F_{t,k_{1}+1}^{i}\left(\prod_{\ell=1}^{s}\nabla_{k_{\ell}}\ell(W_{t})Z^{t,i}_{k_{\ell}-1,k_{\ell+1}}\right)G^{i}_{t,k_{s}-1},{\text{~{}where~{}}Z_{k_{s}-1,k_{s+1}}^{t,i}=I.}

Therefore, according to Lemma 7, for any θ(0,1/2)\theta\in(0,1/2), with a probability 14L2exp(θ2m/[16L2])1-4L^{2}\exp(-\theta^{2}m/[16L^{2}]), for any 1kb<kbL1\leq k_{b}<k_{b}\leq L, we have

Zka,kbt24Leθ/2θ1/2\|Z_{k_{a},k_{b}}^{t}\|_{2}\leq 4\sqrt{L}e^{\theta/2}\theta^{-1/2}

with a probability 1L2mexp(θm/[4L]+6m)1-L^{2}\sqrt{m}\exp\left(-\theta m/[4L]+6\sqrt{m}\right)

Δt,i2(a)\displaystyle\|\Delta_{t,i}\|_{2}\overset{(a)}{\leq} s=2Lηs2Lk1>k2>>ks1Ft,k1+1i2Gt,ks1i2(=1sk(Wt)2Zk1,k+1t,i2)\displaystyle\sum_{s=2}^{L}\eta^{s-2}\sum_{L\geq k_{1}>k_{2}>\ldots>k_{s}\geq 1}\|F_{t,k_{1}+1}^{i}\|_{2}\|G^{i}_{t,k_{s}-1}\|_{2}\left(\prod_{\ell=1}^{s}\|\nabla_{k_{\ell}}\ell(W_{t})\|_{2}\|Z^{t,i}_{k_{\ell}-1,k_{\ell+1}}\|_{2}\right)
(b)\displaystyle\overset{(b)}{\leq} s=2Lηs2(sL)β((β2+nγβ2)(Wt)×4Leθ/2θ1/2)s\displaystyle\sum_{s=2}^{L}\eta^{s-2}(_{s}^{L})\beta\left(\sqrt{(\beta^{2}+n\gamma\beta^{2})\ell(W_{t})}\times 4\sqrt{L}e^{\theta/2}\theta^{-1/2}\right)^{s}
(c)\displaystyle\overset{(c)}{\leq} s=2Lβηs2(22Leθ/2θ1/2β(Wt))s\displaystyle\sum_{s=2}^{L}\beta\eta^{s-2}\left(2\sqrt{2}\sqrt{L}e^{\theta/2}\theta^{-1/2}\beta\sqrt{\ell(W_{t})}\right)^{s}
(d)\displaystyle\overset{(d)}{\leq} 8Leθθ1β3(Wt)s=0L2ηs(22Leθ/2θ1/2β0)s\displaystyle 8Le^{\theta}\theta^{-1}\beta^{3}\ell(W_{t})\sum_{s=0}^{L-2}\eta^{s}\left(2\sqrt{2}\sqrt{L}e^{\theta/2}\theta^{-1/2}\beta\sqrt{\ell_{0}}\right)^{s}
(e)\displaystyle\overset{(e)}{\leq} 8Leθθ1β3(Wt)122ηLeθ/2θ1/2β0\displaystyle\frac{8Le^{\theta}\theta^{-1}\beta^{3}\ell(W_{t})}{1-2\sqrt{2}\eta\sqrt{L}e^{\theta/2}\theta^{-1/2}\beta\sqrt{\ell_{0}}} (40)

where (a) uses AB2A2B2\|AB\|_{2}\leq\|A\|_{2}\|B\|_{2} and A+B2A2+B2\|A+B\|_{2}\leq\|A\|_{2}+\|B\|_{2} for any matrices A,Bm×mA,B\in\mathbb{R}^{m\times m}; (b) uses (15), (39) and Lemma 7 that Zka,kbt25eθ/2m1/4\|Z_{k_{a},k_{b}}^{t}\|_{2}\leq 5e^{\theta/2}m^{1/4}; (c) uses the facts that (sL)=L!(Ls)!s!L!(Ls)!=L(L1)(Ls+1)Ls(_{s}^{L})=\frac{L!}{(L-s)!s!}\leq\frac{L!}{(L-s)!}=L(L-1)\dots(L-s+1)\leq L^{s} and (17) with αβ\alpha\leq\beta; (d) uses (Wt)0\ell(W_{t})\leq\ell_{0}; (e) uses η<1/(22Leθ/2θ1/2β0)\eta<1/(2\sqrt{2}\sqrt{L}e^{\theta/2}\theta^{-1/2}\beta\sqrt{\ell_{0}}). Therefore, by (35) and (10.3) we have

(Wt+1)(Wt)3ηα2L4(Wt)+128nη2e2θθ2β22(Wt)122ηLeθ/2θ1/2β0.\displaystyle\ell(W_{t+1})-\ell(W_{t})\leq-\frac{3\eta\alpha^{2}L}{4}\ell(W_{t})+\frac{128n\eta^{2}e^{2\theta}\theta^{-2}\beta^{2}\ell^{2}(W_{t})}{1-2\sqrt{2}\eta\sqrt{L}e^{\theta/2}\theta^{-1/2}\beta\sqrt{\ell_{0}}}.

We complete the proof by using the selection of η\eta. ∎

10.4 Proof for Theorem 1

Proof.

We start by restricting the network’s width and depth to values for which assumption (17) holds, that is β2γnα2/2\beta^{2}\gamma n\leq\alpha^{2}/2. Revisiting (52), we need condition,

γ=C′′((L3/2τ)2+δ(56)L+1m1/2)=O(1n)\displaystyle\gamma=C^{\prime\prime}\left(\left(L^{3/2}\tau\right)^{2}+\delta\left(\frac{5}{6}\right)^{L}+\frac{1}{m^{1/2}}\right)=O\left(\frac{1}{n}\right)

To meet the above condition, all the following should hold,

L3/2τ=O(1n),L=Ω(logn),m=Ω(n2).\displaystyle L^{3/2}\tau=O\left(\frac{1}{\sqrt{n}}\right),\quad L=\Omega(\log n),\quad m=\Omega(n^{2}). (41)

We thus need to bound τ\tau. According to the setting, we know the number of iterations is

T=2ηα2Llog0ϵ\displaystyle T=\frac{2}{\eta\alpha^{2}L}\log\frac{\ell_{0}}{\epsilon} (42)

On the other hand,

τηt=1T1maxk[L]k(Wt)ηβt=1T12(Wt)ηβT20=2β20α2Llog0ϵ\displaystyle\tau\leq\eta\sum_{t=1}^{T-1}\max_{k\in[L]}\|\nabla_{k}\ell(W_{t})\|\leq\eta\beta\sum_{t=1}^{T-1}\sqrt{2\ell(W_{t})}\leq\eta\beta T\sqrt{2\ell_{0}}=\frac{2\beta\sqrt{2\ell_{0}}}{\alpha^{2}L}\log\frac{\ell_{0}}{\epsilon}

where the second inequality uses (39) which assumes (9) and (16). Similarly to (51), the initial output of the network can be bounded with high probability in order to bound 0\ell_{0}. With a probability 14L2exp(θ2m/[8L2]+3dx+3dy)1-4L^{2}\exp(-\theta^{2}m/[8L^{2}]+3d_{x}+3d_{y})

0=12i=1n(W0iΦi)xi212i=1n((W0ixi2+yi2)=n2(3mdxeθ+mdx)4mndx\displaystyle\ell_{0}=\frac{1}{2}\sum_{i=1}^{n}\left\lVert(W_{0}^{i}-\Phi_{i})x_{i}\right\rVert^{2}\leq\frac{1}{2}\sum_{i=1}^{n}\left({\left\lVert(W_{0}^{i}x_{i}\right\rVert^{2}+\left\lVert y_{i}\right\rVert^{2}}\right)=\frac{n}{2}\left({\frac{3m}{d_{x}}e^{\theta}+\frac{m}{d_{x}}}\right)\leq\frac{4mn}{d_{x}} (43)

Where we also used Assumption (2) and θ<0.5\theta<0.5. We now extract the required width from (41),

L3/2τ=O~(0dxdyLm)=O(1n)\displaystyle L^{3/2}\tau=\tilde{O}\left(\frac{\sqrt{\ell_{0}d_{x}d_{y}L}}{m}\right)=O\left(\frac{1}{\sqrt{n}}\right) (44)

Therefore from (44,43) we get,

m=Ω~(n2Ldy)\displaystyle m=\tilde{\Omega}\left({n^{2}Ld_{y}}\right) (45)

Considering the learning rate, revisiting (18),

η\displaystyle\eta =\displaystyle= min(α212β2βxL,13L,α24β4L,1β2L,142Leθ/2θ1/2β0,α21024ne2θθ2β20)\displaystyle\min\left(\frac{\alpha^{2}}{12\beta^{2}\beta_{x}L},\frac{1}{3L},\frac{\alpha^{2}}{4\beta^{4}L},\frac{1}{\beta^{2}L},\frac{1}{4\sqrt{2}\sqrt{L}e^{\theta/2}\theta^{-1/2}\beta\sqrt{\ell_{0}}},\frac{\alpha^{2}}{1024ne^{2\theta}\theta^{-2}\beta^{2}\ell_{0}}\right)
=\displaystyle= min(O(dx1/2m1/2L),O(1L),O(dxdym2L),O(dxdym2L),O(dxdy1/2m3/2n1/2L1/2),O(dxmn2))\displaystyle\min\left(O\left({\frac{d_{x}^{1/2}}{m^{1/2}L}}\right),O\left({\frac{1}{L}}\right),O\left({\frac{d_{x}d_{y}}{m^{2}L}}\right),O\left({\frac{d_{x}d_{y}}{m^{2}L}}\right),O\left({\frac{d_{x}d_{y}^{1/2}}{m^{3/2}n^{1/2}L^{1/2}}}\right),O\left({\frac{d_{x}}{mn^{2}}}\right)\right)
=(45)\displaystyle\overset{(\ref{eqn:min_width})}{=} min(O~(dx1/2nL3/2dy1/2),O~(1L),O~(dxn4L3dy),O~(dxn7/2L2dy),O~(dxn4Ldy))\displaystyle\min\left(\tilde{O}\left({\frac{d_{x}^{1/2}}{nL^{3/2}d_{y}^{1/2}}}\right),\tilde{O}\left({\frac{1}{L}}\right),\tilde{O}\left({\frac{d_{x}}{n^{4}L^{3}d_{y}}}\right),\tilde{O}\left({\frac{d_{x}}{n^{7/2}L^{2}d_{y}}}\right),\tilde{O}\left({\frac{d_{x}}{n^{4}Ld_{y}}}\right)\right)
=\displaystyle= O~(dxn4L3dy)\displaystyle\tilde{O}\left({\frac{d_{x}}{n^{4}L^{3}d_{y}}}\right)

So, the learning rate is therefore 1β2L\frac{1}{\beta^{2}L} and the corresponding number of iterations in (42),

T=cTlog0ϵ=cTlog(n3Ldxϵ),cT=2812\displaystyle T=c_{T}\log\frac{\ell_{0}}{\epsilon}=c_{T}\log\left({\frac{n^{3}L}{d_{x}\epsilon}}\right)\quad,\quad c_{T}=2\cdot 81^{2} (46)

10.5 Proof for Theorem 3

Proof.

We start with the left-hand side, and denote for brevity Dki=DkD^{i}_{k}=D_{k}. We will bound for every k[L]k\in[L] independently, and the final result follows,

Wkyp(xi,W¯)WkypNTK(xi,W1)F\displaystyle\left\lVert\nabla_{W_{k}}y_{p}(x_{i},\bar{W})-\nabla_{W_{k}}y_{p}^{\textrm{NTK}}(x_{i},W_{1})\right\rVert_{F} \displaystyle\leq BF(ZL,ki+ΣL,ki)(Zk1,1i+Σk1,1i)ZL,kiZk1,1iFCF\displaystyle\left\lVert B\right\rVert_{F}\left\lVert(Z^{i}_{L,k}+\Sigma^{i}_{L,k})(Z^{i}_{k-1,1}+\Sigma^{i}_{k-1,1})-Z^{i}_{L,k}Z^{i}_{k-1,1}\right\rVert_{F}\left\lVert C\right\rVert_{F}
\displaystyle\leq BF(ZL,kiΣk1,1iF+Zk1,1iΣL,kiF+Σk1,1iΣL,kiF)CF\displaystyle\left\lVert B\right\rVert_{F}(\left\lVert Z^{i}_{L,k}\Sigma^{i}_{k-1,1}\right\rVert_{F}+\left\lVert Z^{i}_{k-1,1}\Sigma^{i}_{L,k}\right\rVert_{F}+\left\lVert\Sigma^{i}_{k-1,1}\Sigma^{i}_{L,k}\right\rVert_{F})\left\lVert C\right\rVert_{F}

Where we used the sub-multiplicativity of Frobenius norm, with,

Zk,ki\displaystyle Z^{i}_{k,k^{\prime}} =\displaystyle= DkiW1,kW1,k+1Dki\displaystyle D_{k}^{i}W_{1,k}\ldots W_{1,k^{\prime}+1}D_{k^{\prime}}^{i}
Σk,ki\displaystyle\Sigma^{i}_{k,k^{\prime}} =\displaystyle= s=1kkk1>k2>>ksZk,k1+1i=1sWkZk+1,k+1i\displaystyle\sum_{s=1}^{k-k^{\prime}}\sum_{k_{1}>k_{2}>\ldots>k_{s}}Z^{i}_{k,k_{1}+1}\prod_{\ell=1}^{s}W^{\prime}_{k_{\ell}}Z^{i}_{k_{\ell}+1,k_{\ell+1}}

According to Lemma 5, with a probability larger than 181L3exp(m/[108L])1-81L^{3}\exp(-m/[108L]), we have for all kk,

Zk1,1iΣL,ki219Ls=1kk[19L3/2ξ]s(19L)2ξ119L3/2ξ.\displaystyle\left\lVert Z^{i}_{k-1,1}\Sigma^{i}_{L,k}\right\rVert_{2}\leq 19\sqrt{L}\sum_{s=1}^{k-k^{\prime}}\left[19L^{3/2}\xi\right]^{s}\leq\frac{(19L)^{2}\xi}{1-19L^{3/2}\xi}.

Similar analysis can be applied to the other term, therefore,

(ZL,kiΣk1,1iF+Zk1,1iΣL,kiF+Σk1,1iΣL,kiF)3Zk1,1iΣL,kiF\displaystyle(\left\lVert Z^{i}_{L,k}\Sigma^{i}_{k-1,1}\right\rVert_{F}+\left\lVert Z^{i}_{k-1,1}\Sigma^{i}_{L,k}\right\rVert_{F}+\left\lVert\Sigma^{i}_{k-1,1}\Sigma^{i}_{L,k}\right\rVert_{F})\leq 3\left\lVert Z^{i}_{k-1,1}\Sigma^{i}_{L,k}\right\rVert_{F}

The terms BF2\left\lVert B\right\rVert^{2}_{F},CF2\left\lVert C\right\rVert^{2}_{F} follow independant 2dyχmdy2,2dxχmdx2\frac{2}{d_{y}}\chi^{2}_{md_{y}},\frac{2}{d_{x}}\chi^{2}_{md_{x}} distributions. Therefore by Laurent-Massart concentration bound [Laurent and Massart, 2000],

Pr(BF5m)2exp(mdy4),Pr(CF5m)2exp(mdx4)\displaystyle\Pr\left(\left\lVert B\right\rVert_{F}\geq\sqrt{5m}\right)\leq 2\exp\left(-\frac{md_{y}}{4}\right)\quad,\quad\Pr\left(\left\lVert C\right\rVert_{F}\geq\sqrt{5m}\right)\leq 2\exp\left(-\frac{md_{x}}{4}\right) (47)

We can conclude, with a probability of at least 1exp(Ω(m))1-\exp(-\Omega(m)),

Wkyp(xi,W¯)WkypNTK(xi,W1)FO(mL2ξ)=O~((mn)1/2dyL2)\displaystyle\left\lVert\nabla_{W_{k}}y_{p}(x_{i},\bar{W})-\nabla_{W_{k}}y_{p}^{\textrm{NTK}}(x_{i},W_{1})\right\rVert_{F}\leq O\left(mL^{2}\xi\right)=\tilde{O}\left((mn)^{1/2}d_{y}L^{2}\right)

This term grows with mm due to the unique normalization of the network. We now show that it is negligible with respect to the NTK gradient in the right-hand side. Using theorem 5, probability of at least 1exp(Ω(m))1-\exp(-\Omega(m)),

kypNTK(xi,W1)F2=[G1,k1iFt,k+1i]F2\displaystyle\left\lVert\nabla_{k}y_{p}^{\textrm{NTK}}(x_{i},W_{1})\right\rVert^{2}_{F}=\left\lVert\left[G^{i}_{1,k-1}F^{i}_{t,k+1}\right]^{\top}\right\rVert_{F}^{2} =\displaystyle= Tr(G1,k1iFt,k+1i[G1,k1iFt,k+1i])\displaystyle\operatorname{Tr}\left(G^{i}_{1,k-1}F^{i}_{t,k+1}\left[G^{i}_{1,k-1}F^{i}_{t,k+1}\right]^{\top}\right)
=\displaystyle= Tr([G1,k1i]G1,k1iFt,k+1i[Ft,k+1i])\displaystyle\operatorname{Tr}\left(\left[G^{i}_{1,k-1}\right]^{\top}G^{i}_{1,k-1}F^{i}_{t,k+1}\left[F^{i}_{t,k+1}\right]^{\top}\right)
=\displaystyle= i=1mλ([G1,k1i]G1,k1iFt,k+1i[Ft,k+1i])\displaystyle\sum_{i=1}^{m}\lambda\left(\left[G^{i}_{1,k-1}\right]^{\top}G^{i}_{1,k-1}F^{i}_{t,k+1}\left[F^{i}_{t,k+1}\right]^{\top}\right)
\displaystyle\geq mλmin([G1,k1i]G1,k1i)λmin(Ft,k+1i[Ft,k+1i])\displaystyle m\lambda_{\min}\left(\left[G^{i}_{1,k-1}\right]^{\top}G^{i}_{1,k-1}\right)\lambda_{\min}\left(F^{i}_{t,k+1}\left[F^{i}_{t,k+1}\right]^{\top}\right)
=\displaystyle= mα2\displaystyle m\alpha^{2}
=\displaystyle= m3144dxdy\displaystyle\frac{m^{3}}{144d_{x}d_{y}}

Where we used the inequality of arithmetic and geometric means and multiplicativity of determinant for symmetric and positive semi-definite A,BA,B, i.e,

i=1mλi(AB)mi=1mλi(AB)m=mi=1mλi(A)λi(B)mmλminm(A)λminm(B)m=mλmin(A)λmin(B)\displaystyle\sum_{i=1}^{m}\lambda_{i}\left(AB\right)\geq m\sqrt[m]{\prod_{i=1}^{m}\lambda_{i}\left(AB\right)}=m\sqrt[m]{\prod_{i=1}^{m}\lambda_{i}\left(A\right)\lambda_{i}\left(B\right)}\geq m\sqrt[m]{\lambda^{m}_{\min}\left(A\right)\lambda^{m}_{\min}\left(B\right)}=m\lambda_{\min}\left(A\right)\lambda_{\min}\left(B\right)

Combining both terms completes the proof. ∎

10.6 Proof for Corollary 4

Proof.

We will apply Theorem 3 for 2 examples (xi,xj)(x_{i},x_{j}) to validate the given bound. Notice the following,

{ypNTK(xi,W1),ypNTK(xi,W1)=KpNTK(xi,xi)ypNTK(xi,W1),ypNTK(xj,W1)=KpNTK(xi,xj)\displaystyle\begin{cases}\langle\nabla y_{p}^{\textrm{NTK}}(x_{i},W_{1}),\nabla y_{p}^{\textrm{NTK}}(x_{i},W_{1})\rangle&=~{}~{}K_{p}^{\textrm{NTK}}(x_{i},x_{i})\\ \langle\nabla y_{p}^{\textrm{NTK}}(x_{i},W_{1}),\nabla y_{p}^{\textrm{NTK}}(x_{j},W_{1})\rangle&=~{}~{}K_{p}^{\textrm{NTK}}(x_{i},x_{j})\\ \end{cases}

We start from the left hand side of equation (8),

|yp(xi,W¯),yp(xj,W¯)KpNTK(xi,xj)|\displaystyle|\langle\nabla y_{p}(x_{i},\bar{W}),\nabla y_{p}(x_{j},\bar{W})\rangle-K_{p}^{\textrm{NTK}}(x_{i},x_{j})|
=\displaystyle= |yp(xi,W¯),yp(xj,W¯)ypNTK(xi,W1),yp(xj,W¯)\displaystyle|\langle\nabla y_{p}(x_{i},\bar{W}),\nabla y_{p}(x_{j},\bar{W})\rangle-\langle\nabla y_{p}^{\textrm{NTK}}(x_{i},W_{1}),\nabla y_{p}(x_{j},\bar{W})\rangle
+ypNTK(xi,W1),yp(xj,W¯)ypNTK(xi,W1),ypNTK(xj,W1)|\displaystyle+\langle\nabla y_{p}^{\textrm{NTK}}(x_{i},W_{1}),\nabla y_{p}(x_{j},\bar{W})\rangle-\langle\nabla y_{p}^{\textrm{NTK}}(x_{i},W_{1}),\nabla y_{p}^{\textrm{NTK}}(x_{j},W_{1})\rangle|
\displaystyle\leq yp(xj,W¯)Fyp(xi,W¯)ypNTK(xi,W1)F\displaystyle\left\lVert\nabla y_{p}(x_{j},\bar{W})\right\rVert_{F}\left\lVert\nabla y_{p}(x_{i},\bar{W})-\nabla y_{p}^{\textrm{NTK}}(x_{i},W_{1})\right\rVert_{F}
+ypNTK(xi,W1)Fyp(xj,W¯)ypNTK(xj,W1)F\displaystyle+\left\lVert\nabla y_{p}^{\textrm{NTK}}(x_{i},W_{1})\right\rVert_{F}\left\lVert\nabla y_{p}(x_{j},\bar{W})-\nabla y_{p}^{\textrm{NTK}}(x_{j},W_{1})\right\rVert_{F}
\displaystyle\leq ypNTK(xi,W1)F(2ypNTK(xj,W1)F+yp(xj,W¯)ypNTK(xj,W1)F)\displaystyle\mathcal{R}\left\lVert\nabla y_{p}^{\textrm{NTK}}(x_{i},W_{1})\right\rVert_{F}\left({2\left\lVert\nabla y_{p}^{\textrm{NTK}}(x_{j},W_{1})\right\rVert_{F}+\left\lVert\nabla y_{p}(x_{j},\bar{W})-\nabla y_{p}^{\textrm{NTK}}(x_{j},W_{1})\right\rVert_{F}}\right)
\displaystyle\leq (2+2)KpNTK(xi,xi)KpNTK(xj,xj)\displaystyle\left({2\mathcal{R}+\mathcal{R}^{2}}\right)\sqrt{K_{p}^{\textrm{NTK}}(x_{i},x_{i})K_{p}^{\textrm{NTK}}(x_{j},x_{j})}

Since 1\mathcal{R}\ll 1, 2+2<32\mathcal{R}+\mathcal{R}^{2}<3\mathcal{R}

10.7 Proof for Theorem 2

Proof.

In order to obtain a GReLU network from a ReLU network, we proceed as follows. First, let us define the following notations:

  • xix_{i} denotes the ithi^{\text{th}} training example.

  • zReLUki{z^{\text{ReLU}}}^{i}_{k} denotes the output feature map of the kthk^{\text{th}} layer taken before the ReLU in the ReLU network on the ithi^{\text{th}} training example, and is recursively defined as:

    zReLUki=W~kReLU(zReLUk1i){z^{\text{ReLU}}}^{i}_{k}=\tilde{W}_{k}\text{ReLU}\left({z^{\text{ReLU}}}^{i}_{k-1}\right)
  • z~ki{\tilde{z}}^{i}_{k} denotes the output feature map of the kthk^{\text{th}} layer taken after the GReLU activation in the GReLU network on the ithi^{\text{th}} training example, and is recursively defined as:

    z~ki=GReLU(Wt,kz~k1i,Ψizk1i){\tilde{z}}^{i}_{k}=\text{GReLU}\left(W_{t,k}{\tilde{z}}^{i}_{k-1},\Psi_{i}{z}^{i}_{k-1}\right)
  • zki{z}^{i}_{k} represents the feature map of the auxiliary network activating the kthk^{\text{th}} GReLU gate on the ithi^{\text{th}} training example, and is recursively defined as:

    zki=Ψizk1i{z}^{i}_{k}=\Psi_{i}{z}^{i}_{k-1}

Matching the intermediate layer of the GReLU and ReLU network can be written as:

zReLUki=z~kik[L]i[n].{z^{\text{ReLU}}}^{i}_{k}={\tilde{z}}^{i}_{k}~{}~{}\forall k\in[L]~{}~{}\forall i\in[n].

that is

W~kReLU(zReLUk1i)=GReLU(Wt,kz~k1i,Ψizk1i)k[L]i[n],\tilde{W}_{k}\text{ReLU}\left({z^{\text{ReLU}}}^{i}_{k-1}\right)=\text{GReLU}\left(W_{t,k}{\tilde{z}}^{i}_{k-1},\Psi_{i}{z}^{i}_{k-1}\right)~{}~{}\forall k\in[L]~{}~{}\forall i\in[n],

that is equivalent to

W~kReLU(zReLUk1)=GReLU(Wt,kz~k1,Ψizk1)k[L].\tilde{W}_{k}\text{ReLU}\left({z^{\text{ReLU}}}_{k-1}\right)=\text{GReLU}\left(W_{t,k}{\tilde{z}}_{k-1},\Psi_{i}{z}_{k-1}\right)~{}~{}\forall k\in[L]. (48)

Hence we proceed by recursion. Given that

zReLUk1=z~k1,{z^{\text{ReLU}}}_{k-1}={\tilde{z}}_{k-1},

we seek for W~k\tilde{W}_{k} that obeys equation (48). Denoting by ()(\cdot)^{\dagger} the Moore–Penrose pseudo inverse, and by setting

W~k=(GReLU(Wt,kz~k1,Ψizk1))zReLUk1,\tilde{W}_{k}=\left(\text{GReLU}\left(W_{t,k}{\tilde{z}}_{k-1},\Psi_{i}{z}_{k-1}\right)\right)^{\dagger}{z^{\text{ReLU}}}_{k-1}, (49)

equation (48) holds such that we have

zReLUk=z~k.{z^{\text{ReLU}}}_{k}={\tilde{z}}_{k}.

The equivalence between (48) and (49) lays in the fact that the network is overparameterized such that nmn\leq m. In this case W~k\tilde{W}_{k} has dimension (m×m)(m\times m) and GReLU(Wt,kz~k1,Ψizk1)\text{GReLU}\left(W_{t,k}{\tilde{z}}_{k-1},\Psi_{i}{z}_{k-1}\right) has dimension m×nm\times n, and thus, (48) and (49) are equivalent.

11 Main Theorems with Proofs

Theorem 5.

Choose θ(0,1/2)\theta\in(0,1/2) which satisfies

Lτ312Le2θθ1/219\displaystyle L\tau 3\sqrt{12L}e^{2\theta}\theta^{-1/2}\leq\frac{1}{9} (50)

With a probability at least 13L2mexp(θm/[4L]+6m+3max{dx,dy})1-3L^{2}\sqrt{m}\exp\left(-\theta m/[4L]+6\sqrt{m}+3\max\{d_{x},d_{y}\}\right), we have, for any tt, kk and ii

[Gt,ki]Gt,ki227m4dx,Ft,ki[Ft,ki]227m4dy\left\|\left[G_{t,k}^{i}\right]^{\top}G_{t,k}^{i}\right\|_{2}\leq\frac{27m}{4d_{x}},\quad\left\|F_{t,k}^{i}\left[F_{t,k}^{i}\right]^{\top}\right\|_{2}\leq\frac{27m}{4d_{y}}

and

λmin([Gt,ki]Gt,ki)m12dx,λmin(Ft,ki[Ft,ki])m12dy\lambda_{\min}\left(\left[G_{t,k}^{i}\right]^{\top}G_{t,k}^{i}\right)\geq\frac{m}{12d_{x}},\quad\lambda_{\min}\left(F_{t,k}^{i}\left[F_{t,k}^{i}\right]^{\top}\right)\geq\frac{m}{12d_{y}}

Thus, with a high probability, we have

αx=m12dx,αy=m12dy,βx=27m4dx,βy=27m4dy.\alpha_{x}=\frac{m}{12d_{x}},\alpha_{y}=\frac{m}{12d_{y}},\quad\beta_{x}=\frac{27m}{4d_{x}},\beta_{y}=\frac{27m}{4d_{y}}.
Proof.

We will focus on Gt,kiG^{i}_{t,k} first and similar analysis will be applied to Ft,kiF^{i}_{t,k}. To bound [Gt,ki]Gt,ki2\left\lVert\left[G^{i}_{t,k}\right]^{\top}G^{i}_{t,k}\right\rVert_{2}, we will bound maxv1Gt,kiv2\max\limits_{\left\lVert v\right\rVert\leq 1}\left\lVert G^{i}_{t,k}v\right\rVert^{2}. Define δWt,k:=Wt,kW1,k\delta W_{t,k}:=W_{t,k}-W_{1,k}, we have

Gt,ki=\displaystyle G^{i}_{t,k}= Dki(W1,k+δWt,k)D1i(W1,1+δWt,1)D0iC\displaystyle D^{i}_{k}\left(W_{1,k}+\delta W_{t,k}\right)\ldots D^{i}_{1}\left(W_{1,1}+\delta W_{t,1}\right)D_{0}^{i}C
=\displaystyle= DkiW1,kD0iC\displaystyle D^{i}_{k}W_{1,k}\ldots D^{i}_{0}C
+s=1kk1>k2>>ksDkiW1,kDk1iδWt,k1Dk11iW1,k11DksiδWt,ksDks1iW1,ks1D0iC\displaystyle+\sum_{s=1}^{k}\sum_{k_{1}>k_{2}>\ldots>k_{s}}D^{i}_{k}W_{1,k}\ldots D^{i}_{k_{1}}\delta W_{t,k_{1}}D^{i}_{k_{1}-1}W_{1,k_{1}-1}\ldots D^{i}_{k_{s}}\delta W_{t,k_{s}}D^{i}_{k_{s}-1}W_{1,k_{s}-1}\ldots D^{i}_{0}C
=\displaystyle= Zk,01,iC+s=1kk1>k2>>ks(j=1sZkj1,kjδWt,kj)Zks1,0C.\displaystyle Z_{k,0}^{1,i}C+\sum_{s=1}^{k}\sum_{k_{1}>k_{2}>\ldots>k_{s}}\left(\prod_{j=1}^{s}Z_{k_{j-1},k_{j}}\delta W_{t,k_{j}}\right)Z_{k_{s}-1,0}C.

According to Lemma 5, we have, with a probability 12Lexp(θ2m/[16L2])1-2L\exp(-\theta^{2}m/[16L^{2}]), for any 1kbkaL1\leq k_{b}\leq k_{a}\leq L and any θ(0,1/2)\theta\in(0,1/2),

Zka,kb1,i212Leθ/2θ1/2\|Z_{k_{a},k_{b}}^{1,i}\|_{2}\leq\sqrt{12L}e^{\theta/2}\theta^{-1/2}

Following the same analysis as in Lemma 5, we have, with a probability 14L2exp(θ2m/[8L2]+3dx)1-4L^{2}\exp(-\theta^{2}m/[8L^{2}]+3d_{x}), we have

minudxZka,kb1,iCuum3dxeθandmaxudxZka,kb1,iCuu3mdxeθ/2\displaystyle\min\limits_{u\in\mathbb{R}^{d_{x}}}\frac{\left\lVert Z^{1,i}_{k_{a},k_{b}}Cu\right\rVert}{\left\lVert u\right\rVert}\geq\sqrt{\frac{m}{3d_{x}}}e^{-\theta}\quad\textrm{and}\quad\max\limits_{u\in\mathbb{R}^{d_{x}}}\frac{\left\lVert Z^{1,i}_{k_{a},k_{b}}Cu\right\rVert}{\left\lVert u\right\rVert}{\leq}\sqrt{\frac{3m}{d_{x}}}e^{\theta/2} (51)

Combining the above results together, we have, with a probability 14L3exp(θ2m/[16L2]+3d)1-4L^{3}\exp(-\theta^{2}m/[16L^{2}]+3d),

minudxGt,kiuum3dxeθ3mdxeθ/2s=1L(Lτ12Leθ/2θ1/2)sm3dxeθ(1Lτ312Le2θθ1/21Lτ12Leθ/2θ1/2)\min\limits_{u\in\mathbb{R}^{d_{x}}}\frac{\left\lVert G_{t,k}^{i}u\right\rVert}{\left\lVert u\right\rVert}\geq\sqrt{\frac{m}{3d_{x}}}e^{-\theta}-\sqrt{\frac{3m}{d_{x}}}e^{\theta/2}\sum_{s=1}^{L}\left(L\tau\sqrt{12L}e^{\theta/2}\theta^{-1/2}\right)^{s}\geq\sqrt{\frac{m}{3d_{x}}}e^{-\theta}\left(1-\frac{L\tau 3\sqrt{12L}e^{2\theta}\theta^{-1/2}}{1-L\tau\sqrt{12L}e^{\theta/2}\theta^{-1/2}}\right)

and

maxudxGt,kiuu3mdxeθ/2+3mdxeθ/2s=1L(Lτ12Leθ/2θ1/2)s3mdxeθ/2(1+Lτ12Leθ/2θ1/21Lτ12Leθ/2θ1/2)\max\limits_{u\in\mathbb{R}^{d_{x}}}\frac{\left\lVert G_{t,k}^{i}u\right\rVert}{\left\lVert u\right\rVert}\leq\sqrt{\frac{3m}{d_{x}}}e^{\theta/2}+\sqrt{\frac{3m}{d_{x}}}e^{\theta/2}\sum_{s=1}^{L}\left(L\tau\sqrt{12L}e^{\theta/2}\theta^{-1/2}\right)^{s}\leq\sqrt{\frac{3m}{d_{x}}}e^{\theta/2}\left(1+\frac{L\tau\sqrt{12L}e^{\theta/2}\theta^{-1/2}}{1-L\tau\sqrt{12L}e^{\theta/2}\theta^{-1/2}}\right)

Since

Lτ312Le2θθ1/219L\tau 3\sqrt{12L}e^{2\theta}\theta^{-1/2}\leq\frac{1}{9}

we have, with a probability 14L3exp(θ2m/[16L2]+3dx)1-4L^{3}\exp(-\theta^{2}m/[16L^{2}]+3d_{x}),

minudxGt,kiu2u12m3dx,maxudxGt,kiuu323mdx\min\limits_{u\in\mathbb{R}^{d_{x}}}\frac{\left\lVert G_{t,k}^{i}u\right\rVert_{2}}{\left\lVert u\right\rVert}\geq\frac{1}{2}\sqrt{\frac{m}{3d_{x}}},\quad\max\limits_{u\in\mathbb{R}^{d_{x}}}\frac{\left\lVert G_{t,k}^{i}u\right\rVert}{\left\lVert u\right\rVert}\leq\frac{3}{2}\sqrt{\frac{3m}{d_{x}}}

Similar analysis applies to Ft,ki2\left\lVert F_{t,k}^{i}\right\rVert_{2}. ∎

Theorem 6.

Assume L3/2τ<1L^{3/2}\tau<1. With a probability 1(4L2+n2)exp(Ω(m+max{dx,dy}))1-(4L^{2}+n^{2})\exp(-\Omega(\sqrt{m}+\max\{d_{x},d_{y}\})), we have

Ft,kj[Ft,ki]2\displaystyle\left\lVert F_{t,k}^{j}\left[F_{t,k}^{i}\right]^{\top}\right\rVert_{2} \displaystyle\leq C(1m1/4+(56)Lk+L3/2τ)βy\displaystyle C^{\prime}\left(\frac{1}{m^{1/4}}+\left(\frac{5}{6}\right)^{L-k}+L^{3/2}\tau\right)\beta_{y}
|Gt,kjxj,Gt,kixi|\displaystyle\left|\langle G_{t,k}^{j}x_{j},G_{t,k}^{i}x_{i}\rangle\right| \displaystyle\leq C(1m1/4+δ(56)k+L3/2τ)βx\displaystyle C^{\prime}\left(\frac{1}{m^{1/4}}+\delta\left(\frac{5}{6}\right)^{k}+L^{3/2}\tau\right)\beta_{x}

where CC^{\prime} is an universal constant, and therefore

|Gt,k1jxj,Gt,k1ixi|Ft,k+1j[Ft,k+1i]2γβ2\left|\left\langle G_{t,k-1}^{j}x_{j},G_{t,k-1}^{i}x_{i}\right\rangle\right|\left\lVert F^{j}_{t,k+1}\left[F_{t,k+1}^{i}\right]^{\top}\right\rVert_{2}\leq\gamma\beta^{2}

with

γ=C′′(L3τ2+δ(56)L+1m1/2)\displaystyle\gamma=C^{\prime\prime}\left(L^{3}\tau^{2}+\delta\left(\frac{5}{6}\right)^{L}+\frac{1}{m^{1/2}}\right) (52)
Proof.

To bound Ft,kj[Ft,ki]2\left\lVert F_{t,k}^{j}\left[F_{t,k}^{i}\right]^{\top}\right\rVert_{2}, we define k0=L+1k_{0}=L+1 and express Ft,kj[Ft,ki]F_{t,k}^{j}\left[F_{t,k}^{i}\right]^{\top} as

Ft,kj[Ft,ki]\displaystyle F_{t,k}^{j}\left[F_{t,k}^{i}\right]^{\top}
=\displaystyle= BDLjW1,LW1,k+1DkjDkiW1,k+1W1,LDLiB:=Z\displaystyle\underbrace{BD^{j}_{L}W_{1,L}\ldots W_{1,k+1}D^{j}_{k}D_{k}^{i}W_{1,k+1}^{\top}\ldots W_{1,L}^{\top}D^{i}_{L}B^{\top}}_{:=Z}
+BZL:kjs=1Lk[(=1sZk11:kiδWt,k)Zks1:ki]B:=𝒜1\displaystyle+\underbrace{BZ_{L:k}^{j}\sum_{s=1}^{L-k}\left[\left(\prod_{\ell=1}^{s}Z_{k_{\ell-1}-1:k_{\ell}}^{i}\delta W_{t,k_{\ell}}\right)Z^{i}_{k_{s}-1:k}\right]^{\top}B^{\top}}_{:=\mathcal{A}_{1}}
+Bs=1Lk[(=1sZk11:kjδWt,k)Zks1:kj][ZL:ki]B:=𝒜2\displaystyle+\underbrace{B\sum_{s=1}^{L-k}\left[\left(\prod_{\ell=1}^{s}Z_{k_{\ell-1}-1:k_{\ell}}^{j}\delta W_{t,k_{\ell}}\right)Z^{j}_{k_{s}-1:k}\right]\left[Z_{L:k}^{i}\right]^{\top}B^{\top}}_{:=\mathcal{A}_{2}}
+Bsa=1Lksb=1Lk(=1saZk1:kjδWt,k)Zksa1:kj[Zkb1:ki][=2sbZk1:kiδWt,k]B:=𝒜3\displaystyle+\underbrace{B\sum_{s_{a}=1}^{L-k}\sum_{s_{b}=1}^{L-k}\left(\prod_{\ell=1}^{s_{a}}Z_{k_{\ell-1}:k_{\ell}}^{j}\delta W_{t,k_{\ell}}\right)Z^{j}_{k_{s_{a}}-1:k}\left[Z^{i}_{k_{b}-1:k}\right]^{\top}\left[\prod_{\ell=2}^{s_{b}}Z_{k_{\ell-1}:k_{\ell}}^{i}\delta W_{t,k_{\ell}}\right]^{\top}B^{\top}}_{:=\mathcal{A}_{3}}

According to Lemma 5, with a probability 14L3exp(θ2m/[16L2]+3dy)1-4L^{3}\exp(-\theta^{2}m/[16L^{2}]+3d_{y}),

Zka,kb212Leθ/2θ1/2,maxv1vBZka,kb323mdy\|Z_{k_{a},k_{b}}\|_{2}\leq\sqrt{12L}e^{\theta/2}\theta^{-1/2},\quad\max\limits_{\|v\|\leq 1}\left\|v^{\top}BZ_{k_{a},k_{b}}\right\|\leq\frac{3}{2}\sqrt{\frac{3m}{d_{y}}}

As a result, with a probability 14L3exp(θ2m/[16L2]+3dy)1-4L^{3}\exp(-\theta^{2}m/[16L^{2}]+3d_{y}), we have

maxv1v𝒜1v27m4ds=1L(12Leθ/2θ1/2Lτ)sξmdy\displaystyle\max\limits_{\|v\|\leq 1}v^{\top}\mathcal{A}_{1}v\leq\frac{27m}{4d}\sum_{s=1}^{L}\left(\sqrt{12L}e^{\theta/2}\theta^{-1/2}L\tau\right)^{s}\leq\xi\frac{m}{d_{y}}

where ξ=812Leθ/2θ1/2Lτ1\xi=8\sqrt{12L}e^{\theta/2}\theta^{-1/2}L\tau\ll 1 and the last step utilizes the assumption 12Leθ/2θ1/2Lτ1/9\sqrt{12L}e^{\theta/2}\theta^{-1/2}L\tau\leq 1/9. Using the same analysis, we have, with a probability 14L3exp(θ2m/[16L2]+3dy)1-4L^{3}\exp(-\theta^{2}m/[16L^{2}]+3d_{y}),

maxv1v𝒜2vξmdy,maxv1v𝒜3vξ2mdy\max\limits_{\|v\|\leq 1}v^{\top}\mathcal{A}_{2}v\leq\xi\frac{m}{d_{y}},\quad\max\limits_{\|v\|\leq 1}v^{\top}\mathcal{A}_{3}v\leq\xi^{2}\frac{m}{d_{y}}

Next we need to bound the spectral norm of Z=F1,kj[F1,ki]Z=F_{1,k}^{j}\left[F_{1,k}^{i}\right]. To this end, we will bound

maxudy,vdy,u=1,v=1[F1,kj]u,[F1,ki]v\max\limits_{u\in\mathbb{R}^{d_{y}},v\in\mathbb{R}^{d_{y}},\|u\|=1,\|v\|=1}\left\langle\left[F^{j}_{1,k}\right]^{\top}u,\left[F^{i}_{1,k}\right]^{\top}v\right\rangle

Define

ak=[Ft,k+1j]u,bk=[Ft,k+1i]ua_{k}=\left[F^{j}_{t,k+1}\right]u,\quad b_{k}=\left[F^{i}_{t,k+1}\right]u

We have

akbk=ak+1W1,k+1DkjDkiW1,k+1bk+1\displaystyle a^{\top}_{k}b_{k}=a_{k+1}^{\top}W_{1,k+1}D_{k}^{j}D_{k}^{i}W^{\top}_{1,k+1}b_{k+1}
=\displaystyle= ak+1W1,k+1DkjDkiW1,k+1Pak+1(bk+1):=Z1+ak+1W1,k+1DkjDkiW1,k+1(bk+1Pak+1(bk+1)):=Z2\displaystyle\underbrace{a_{k+1}^{\top}W_{1,k+1}D_{k}^{j}D_{k}^{i}W^{\top}_{1,k+1}P_{a_{k+1}}\left(b_{k+1}\right)}_{:=Z_{1}}+\underbrace{a_{k+1}^{\top}W_{1,k+1}D_{k}^{j}D_{k}^{i}W^{\top}_{1,k+1}\left(b_{k+1}-P_{a_{k+1}}\left(b_{k+1}\right)\right)}_{:=Z_{2}}

where Pa(b)=aab/a2P_{a}(b)=aa^{\top}b/\left\lVert a\right\rVert^{2} projects vector bb onto the direction of aa. Since W1,k+1(bk+1Pak+1(bk+1))W^{\top}_{1,k+1}\left(b_{k+1}-P_{a_{k+1}}(b_{k+1})\right) is statistically independent from W1,k+1ak+1W_{1,k+1}^{\top}a_{k+1}, by fixing W1,k+1ak+1W_{1,k+1}^{\top}a_{k+1} and viewing Z2Z_{2} as a random variable of W1,k+1(bk+1Pak+1(bk+1))W^{\top}_{1,k+1}\left(b_{k+1}-P_{a_{k+1}}(b_{k+1})\right), we have Z2Z_{2} follow 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) distribution, where

σ2=1mbk+1Pak+1(bk+1)2DkjW1,k+1ak+12\sigma^{2}=\frac{1}{m}\left\lVert b_{k+1}-P_{a_{k+1}}(b_{k+1})\right\rVert^{2}\left\lVert D_{k}^{j}W_{1,k+1}^{\top}a_{k+1}\right\rVert^{2}

Using the property of Gaussian distribution, we have, with a probability 12exp(m/2)1-2\exp(-\sqrt{m}/2),

|Z2|bk+1m1/2DkjW1,k+1ak+1\left|Z_{2}\right|\leq\frac{\left\lVert b_{k+1}\right\rVert}{m^{1/2}}\left\lVert D_{k}^{j}W_{1,k+1}^{\top}a_{k+1}\right\rVert

Since mDkjW1,k+1ak+122|ak+1|2\frac{m\left\lVert D_{k}^{j}W_{1,k+1}^{\top}a_{k+1}\right\rVert^{2}}{2|a_{k+1}|^{2}} follows χm/22\chi^{2}_{m/2}, with a probability 1exp(θm/4)1-\exp(-\theta m/4), we have

DkjW1,k+1ak+12(1+θ)ak+12\left\lVert D_{k}^{j}W_{1,k+1}^{\top}a_{k+1}\right\rVert^{2}\leq(1+\theta)\left\lVert a_{k+1}\right\rVert^{2}

Combining the above two results, with a probability 12exp(m/2)exp(θm/4)1-2\exp(-\sqrt{m}/2)-\exp(-\theta m/4),

|Z2|1+θm1/2bk+1ak+1(1+θ)βm1/2\left|Z_{2}\right|\leq\frac{1+\theta}{m^{1/2}}\left\lVert b_{k+1}\right\rVert\left\lVert a_{k+1}\right\rVert\leq\frac{(1+\theta)\beta}{m^{1/2}}

where the last step utilizes the fact bk+1β\left\lVert b_{k+1}\right\rVert\leq\sqrt{\beta} and ak+1β\left\lVert a_{k+1}\right\rVert\leq\sqrt{\beta}.

To bound Z1Z_{1}, let ki,j\mathcal{I}_{k}^{i,j} be the set of indices of diagonal elements that are nonzero in both DkjD_{k}^{j} and DkiD_{k}^{i}. According to Lemma 4, with a probability 1exp(Ω(m))1-\exp(-\Omega(m)), we have |ki,j|m/3|\mathcal{I}_{k}^{i,j}|\leq m/3. Define u=W1,k+1ak+1/ak+1u=W^{\top}_{1,k+1}a_{k+1}/\left\lVert a_{k+1}\right\rVert. We have

Z1=|ak+1bk+1|ski,jus2|ak+1bk+1|ski,jus2Z_{1}=\left|a_{k+1}^{\top}b_{k+1}\right|\sum_{s\in\mathcal{I}_{k}^{i,j}}u_{s}^{2}\leq\left|a_{k+1}^{\top}b_{k+1}\right|\sum_{s\in\mathcal{I}_{k}^{i,j}}u_{s}^{2}

Using the concentration of χ2\chi^{2} distribution, we have, with a probability 1exp(θm/3))1-\exp(-\theta m/3))

ski,jui22(1+θ)3\sum_{s\in\mathcal{I}_{k}^{i,j}}u_{i}^{2}\leq\frac{2(1+\theta)}{3}

By combining the results for Z1Z_{1} and Z2Z_{2}, for a fixed normalized udyu\in\mathbb{R}^{d_{y}}, vdyv\in\mathbb{R}^{d_{y}}, with a probability 1exp(Ω(m)1-\exp(-\Omega(\sqrt{m}), we have

|[F1,kj]u,[F1,ki]v|(1+θ)βm1/2+2(1+θ)3|[F1,k+1j]u,[F1,k+1i]v|\left|\left\langle\left[F_{1,k}^{j}\right]^{\top}u,\left[F_{1,k}^{i}\right]^{\top}v\right\rangle\right|\leq\frac{(1+\theta)\beta}{m^{1/2}}+\frac{2(1+\theta)}{3}\left|\left\langle\left[F_{1,k+1}^{j}\right]^{\top}u,\left[F_{1,k+1}^{i}\right]^{\top}v\right\rangle\right|

By combining the results over iterations, we have, with a probability 12Lexp(θm/3)2Lexp(m/2)1-2L\exp(-\theta m/3)-2L\exp\left(-\sqrt{m}/2\right), for fixed normalized udyu\in\mathbb{R}^{d_{y}} and vdyv\in\mathbb{R}^{d_{y}}

|[Ft,kj]u,[Ft,ki]v|4(1+θ)βm1/2+(2(1+θ)3)Lkβ\left|\left\langle\left[F_{t,k}^{j}\right]^{\top}u,\left[F_{t,k}^{i}\right]^{\top}v\right\rangle\right|\leq\frac{4(1+\theta)\beta}{m^{1/2}}+\left(\frac{2(1+\theta)}{3}\right)^{L-k}\beta

We extend the result to any uu and vv by using the theory of covering number, and complete the proof by choosing θ=1/6\theta=1/6.

To bound |Gt,kjxj,Gt,kxi||\langle G_{t,k}^{j}x_{j},G_{t,k}x_{i}\rangle|, follow the same analysis in above, we have

|Gt,kjxj,Gt,kixi|O(L3/2τβ)+|G1,kjxj,G1,kixi|O((L3/2+1m1/2)β)+(56)kCxi,Cxj\left|\langle G_{t,k}^{j}x_{j},G^{i}_{t,k}x_{i}\rangle\right|\leq O\left(L^{3/2}\tau\beta\right)+\left|\langle G_{1,k}^{j}x_{j},G^{i}_{1,k}x_{i}\rangle\right|\leq O\left(\left(L^{3/2}+\frac{1}{m^{1/2}}\right)\beta\right)+\left(\frac{5}{6}\right)^{k}\langle Cx_{i},Cx_{j}\rangle

We expand Cxi,Cxj\langle Cx_{i},Cx_{j}\rangle as

Cxi,Cxj=CPxj(xi),Cxj+C(xiPxj(xi)),Cxj\langle Cx_{i},Cx_{j}\rangle=\langle CP_{x_{j}}(x_{i}),Cx_{j}\rangle+\langle C\left(x_{i}-P_{x_{j}}(x_{i})\right),Cx_{j}\rangle

With a probability 1exp(θm/4)1-\exp(-\theta m/4), we have

|CPxj(xi),Cxj|(1+θ)δβ\left|\langle CP_{x_{j}}(x_{i}),Cx_{j}\rangle\right|\leq(1+\theta)\delta\beta

With a probability 12exp(2m)1-2\exp(-2\sqrt{m})

|C(xiPxj(xi)),Cxj|βm1/2\left|\langle C\left(x_{i}-P_{x_{j}}(x_{i})\right),Cx_{j}\rangle\right|\leq\frac{\beta}{m^{1/2}}

12 Auxiliary Lemmas

Lemma 4.

Define

ki={s[m]:[Ψkxi]s>0},kj={s[m]:[Ψkxj]s>0}\mathcal{I}_{k}^{i}=\left\{s\in[m]:[\Psi_{k}x_{i}]_{s}>0\right\},\quad\mathcal{I}_{k}^{j}=\left\{s\in[m]:[\Psi_{k}x_{j}]_{s}>0\right\}

where Ψkm×m\Psi_{k}\in\mathbb{R}^{m\times m} is a Gaussian random matrix. Suppose |xi|=1|x_{i}|=1, |xj|=1|x_{j}|=1, and |xixj|δ|x_{i}^{\top}x_{j}|\leq\delta, where δ1/3\delta\leq 1/3. Then, with a probability 1exp(Ω(m))1-\exp(-\Omega(m)), we have

|kikj|m3\left|\mathcal{I}_{k}^{i}\cap\mathcal{I}_{k}^{j}\right|\leq\frac{m}{3}
Proof.

Define [z]+[z]_{+} that outputs 11 when z>0z>0, and zero otherwise. We have

|kjki|\displaystyle\left|\mathcal{I}_{k}^{j}\cap\mathcal{I}_{k}^{i}\right| =\displaystyle= [Ψkxj]+,[Ψkxi]+\displaystyle\langle\left[\Psi_{k}x_{j}\right]_{+},\left[\Psi_{k}x_{i}\right]_{+}\rangle
\displaystyle\leq [Ψkxj]+,[Ψk(xiPxj(xi))]++|[Ψkxi]+[Ψk(xiPxj(xi))]+|\displaystyle\langle\left[\Psi_{k}x_{j}\right]_{+},\left[\Psi_{k}\left(x_{i}-P_{x_{j}}(x_{i})\right)\right]_{+}\rangle+\left|\left[\Psi_{k}x_{i}\right]_{+}-\left[\Psi_{k}\left(x_{i}-P_{x_{j}}(x_{i})\right)\right]_{+}\right|

Since Ψk(xiPxj(xi))\Psi_{k}(x_{i}-P_{x_{j}}(x_{i})) and Ψkxj\Psi_{k}x_{j} are statistically independent, hence, with a probability 1exp(θm/2)1-\exp(-\theta m/2),

[Ψxj]+,[Ψk(xiPxj(xi)]+(1+θ)m4\langle\left[\Psi x_{j}\right]_{+},\left[\Psi_{k}(x_{i}-P_{x_{j}}(x_{i})\right]_{+}\rangle\leq\frac{(1+\theta)m}{4}

and therefore,

|kjki|\displaystyle\left|\mathcal{I}_{k}^{j}\cap\mathcal{I}_{k}^{i}\right|
\displaystyle\leq (1+θ)m4+|[Ψkxi]+[Ψk(xiPxj(xi))]+|\displaystyle\frac{(1+\theta)m}{4}+\left|\left[\Psi_{k}x_{i}\right]_{+}-\left[\Psi_{k}\left(x_{i}-P_{x_{j}}(x_{i})\right)\right]_{+}\right|
\displaystyle\leq (1+θ)m4+=1mI(|[ΨkPxj(xi)]|>|[Ψk(xiPxj(xi))]|):=Z\displaystyle\frac{(1+\theta)m}{4}+\underbrace{\sum_{\ell=1}^{m}I\left(\left|\left[\Psi_{k}P_{x_{j}}(x_{i})\right]_{\ell}\right|>\left|\left[\Psi_{k}\left(x_{i}-P_{x_{j}}(x_{i})\right)\right]_{\ell}\right|\right)}_{:=Z}

We proceed to bound ZZ. First, each element of ΨkPxj(xi)\Psi_{k}P_{x_{j}}(x_{i}) follows a Gaussian distribution 𝒩(0,σ12)\mathcal{N}(0,\sigma_{1}^{2}), where σ12=2|ΨkPxj(xi)|2/m\sigma_{1}^{2}=2|\Psi_{k}P_{x_{j}}(x_{i})|^{2}/m. Similarly, each element Ψk(xiPxj(xi))\Psi_{k}(x_{i}-P_{x_{j}}(x_{i})) follows a distribution 𝒩(0,σ22)\mathcal{N}(0,\sigma_{2}^{2}), where σ22=2|Ψk(xiPxj(xi))|2/m\sigma_{2}^{2}=2|\Psi_{k}(x_{i}-P_{x_{j}}(x_{i}))|^{2}/m. Using the concentration inequality of χm2\chi^{2}_{m} distribution, we have, with a probability 12exp(θm/4)1-2\exp(-\theta m/4),

σ122(1+θ)m|Pxj(xi)|22(1+θ)δ2m,σ222(1θ)m|(xiPxj(xi))|22(1θ)(1δ2)m\sigma_{1}^{2}\leq\frac{2(1+\theta)}{m}\left|P_{x_{j}}(x_{i})\right|^{2}\leq\frac{2(1+\theta)\delta^{2}}{m},\quad\sigma_{2}^{2}\geq\frac{2(1-\theta)}{m}\left|\left(x_{i}-P_{x_{j}}(x_{i})\right)\right|^{2}\geq\frac{2(1-\theta)(1-\delta^{2})}{m}

Using the properties of Gaussian distribution, we have

Pr(|[Ψk(xiPxj(xi))]|22δm(1θ)(1δ2)1+θ)\displaystyle\Pr\left(\left|\left[\Psi_{k}\left(x_{i}-P_{x_{j}}(x_{i})\right)\right]_{\ell}\right|^{2}\geq\frac{2\delta}{m}\sqrt{\frac{(1-\theta)(1-\delta^{2})}{1+\theta}}\right)
\displaystyle\geq 1((1+θ)δ2(1θ)1δ2)1/4exp(1[(1+θ)δ2/[(1θ)(1δ2)]]1/42)12δ1/4\displaystyle 1-\left(\frac{(1+\theta)\delta^{2}}{(1-\theta)1-\delta^{2}}\right)^{1/4}\exp\left(-\frac{1-\left[(1+\theta)\delta^{2}/[(1-\theta)(1-\delta^{2})]\right]^{1/4}}{2}\right)\geq 1-2\delta^{1/4}
Pr(|[ΨkPxj(xi)]|22δm(1θ)(1δ2)1+θm)\displaystyle\Pr\left(\left|\left[\Psi_{k}P_{x_{j}}(x_{i})\right]_{\ell}\right|^{2}\geq\frac{2\delta}{m}\sqrt{\frac{(1-\theta)(1-\delta^{2})}{1+\theta}}{m}\right)
\displaystyle\leq ((1θ)(1δ2)(1+θ)δ2)1/4exp([(1θ)(1δ2)/(1+θ)δ2]1/42)exp(12δ)\displaystyle\left(\frac{(1-\theta)(1-\delta^{2})}{(1+\theta)\delta^{2}}\right)^{1/4}\exp\left(-\frac{\left[(1-\theta)(1-\delta^{2})/(1+\theta)\delta^{2}\right]^{1/4}}{2}\right)\leq\exp\left(-\frac{1}{2\delta}\right)

As a result, we have

Pr(|[Ψk(xiPxj(xi))]|<|[ΨkPxj(xi)]|)3δ1/4exp(12δ)\Pr\left(\left|\left[\Psi_{k}\left(x_{i}-P_{x_{j}}(x_{i})\right)\right]_{\ell}\right|<\left|\left[\Psi_{k}P_{x_{j}}(x_{i})\right]_{\ell}\right|\right)\leq 3\delta^{1/4}\exp\left(-\frac{1}{2\delta}\right)

Using the standard concentration inequality, we have, with a probability 1exp(m/2)1-\exp(-m/2),

Z(2δ1/4exp(12δ)+13.5)m+2mδ1/4exp(12δ)m7Z\leq\left(2\delta^{1/4}\exp\left(-\frac{1}{2\delta}\right)+\frac{1}{3.5}\right)m+2m\sqrt{\delta^{1/4}\exp\left(-\frac{1}{2\delta}\right)}\leq\frac{m}{7}

and therefore, with a probability 1exp(Ω(m))1-\exp(-\Omega(m)), we have

|kjki|m3\left|\mathcal{I}_{k}^{j}\cap\mathcal{I}_{k}^{i}\right|\leq\frac{m}{3}

Lemma 5.

With a probability at least 12Lexp(θ2m/[16L2])1-2L\exp(-\theta^{2}m/[16L^{2}]), for any 1kbkaL1\leq k_{b}\leq k_{a}\leq L and any θ(0,1/2)\theta\in(0,1/2), we have

Zka,kb1,i212Leθ/2θ1/2.\left\lVert Z_{k_{a},k_{b}}^{1,i}\right\rVert_{2}\leq\sqrt{12L}e^{\theta/2}\theta^{-1/2}.
Proof.

To bound Zka,kb1,i2\|Z_{k_{a},k_{b}}^{1,i}\|_{2}, we will bound maxv1Zka,kb1,iv\max\limits_{\left\lVert v\right\rVert\leq 1}\left\lVert Z_{k_{a},k_{b}}^{1,i}v\right\rVert. To this end, we divide vv into rr components of equal size, denoted by v1,,vrv_{1},\ldots,v_{r}. We furthermore introduce v~km\tilde{v}_{k}\in\mathbb{R}^{m} to include the kkth component vkv_{k} of vv with everything else padded with zeros. As a result,

maxv1Zka,kb1,iv\displaystyle\max\limits_{\left\lVert v\right\rVert\leq 1}\left\lVert Z_{k_{a},k_{b}}^{1,i}v\right\rVert\leq maxj=1rvj21j=1rZka,kb1,iv~j\displaystyle\max\limits_{\sum_{j=1}^{r}\left\lVert v_{j}\right\rVert^{2}\leq 1}\sum_{j=1}^{r}\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert
\displaystyle\leq maxj=1rvj21j=1r{vjmaxvjm/rZka,kb1,iv~jvj}\displaystyle\max\limits_{\sum_{j=1}^{r}\left\lVert v_{j}\right\rVert^{2}\leq 1}\sum_{j=1}^{r}\left\{\left\lVert v_{j}\right\rVert\max_{v_{j}\in\mathbb{R}^{m/r}}\frac{\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert}{\left\lVert v_{j}\right\rVert}\right\}
(a)\displaystyle\overset{(a)}{\leq} rmaxj=1rvj21,j[r],vjm/rZka,kb1,iv~jvj\displaystyle\sqrt{r}\max\limits_{\begin{subarray}{c}\sum_{j=1}^{r}\left\lVert v_{j}\right\rVert^{2}\leq 1,\\ j\in[r],v_{j}\in\mathbb{R}^{m/r}\end{subarray}}\frac{\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert}{\left\lVert v_{j}\right\rVert}
(b)\displaystyle\overset{(b)}{\leq} rmaxj[r],vjm/rZka,kb1,iv~jvj,\displaystyle\sqrt{r}\max\limits_{j\in[r],v_{j}\in\mathbb{R}^{m/r}}\frac{\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert}{\left\lVert v_{j}\right\rVert},

where (a) uses the fact that maxvjm/rZka,kb1,iv~jvjmaxj[r],vjm/rZka,kb1,iv~jvj\max_{v_{j}\in\mathbb{R}^{m/r}}\frac{\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert}{\left\lVert v_{j}\right\rVert}\leq\max_{j\in[r],v_{j}\in\mathbb{R}^{m/r}}\frac{\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert}{\left\lVert v_{j}\right\rVert} for any j[r]j\in[r] and Cauchy-Schwarz inequality that j=1rvjj=1r12j=1rvj2r\sum_{j=1}^{r}\|v_{j}\|\leq\sqrt{\sum_{j=1}^{r}1^{2}\sum_{j=1}^{r}\|v_{j}\|^{2}}\leq\sqrt{r} since j=1rvj2=v21\sum_{j=1}^{r}\|v_{j}\|^{2}=\|v\|^{2}\leq 1; (b) holds since one constraint j=1rvj21\sum_{j=1}^{r}\left\lVert v_{j}\right\rVert^{2}\leq 1 for maximization is removed. We fix jj and consider a fixed v~j\tilde{v}_{j}. We will prove the following high probability bound by induction

Zka,kb1,iv~j2((1θL)kakbDkbiv~j2,(1+θL)kbkaDkbiv~j2),\displaystyle\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}\in\left(\left(1-\frac{\theta}{L}\right)^{k_{a}-k_{b}}\left\lVert D_{k_{b}}^{i}\tilde{v}_{j}\right\rVert^{2},\left(1+\frac{\theta}{L}\right)^{k_{b}-k_{a}}\left\lVert D_{k_{b}}^{i}\tilde{v}_{j}\right\rVert^{2}\right), (53)

where θ(0,1/2)\theta\in(0,1/2) is a introduced parameter related to the failure probability. For ka=kb+1k_{a}=k_{b}+1, we have

Zka,kb1,iv~j2=Dkb+1iW1,kb+1Dkbiv~j2\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}=\left\lVert D^{i}_{k_{b}+1}W_{1,k_{b}+1}D^{i}_{k_{b}}\tilde{v}_{j}\right\rVert^{2}

It is easy to show that mZka,kb1,iv~j22Dkbiv~j2\frac{m\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}}{2\left\lVert D^{i}_{k_{b}}\tilde{v}_{j}\right\rVert^{2}} follows the distribution of χ2\chi^{2} with degree of m/2m/2. Using the concentration of χ2\chi^{2} distribution, we have

Pr(Zka,kb1,iv~j2(1+θL)Dkbiv~j2)exp(θ2m8L2),\displaystyle\Pr\left(\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}\geq\left(1+\frac{\theta}{L}\right)\left\lVert D^{i}_{k_{b}}\tilde{v}_{j}\right\rVert^{2}\right)\leq\exp\left({-\frac{\theta^{2}m}{8L^{2}}}\right),
Pr(Zka,kb1,iv~j2(1θL)Dkbiv~j2)exp(θ2m8L2).\displaystyle\Pr\left(\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}\leq\left(1-\frac{\theta}{L}\right)\left\lVert D^{i}_{k_{b}}\tilde{v}_{j}\right\rVert^{2}\right)\leq\exp\left({-\frac{\theta^{2}m}{8L^{2}}}\right).

Hence, with a probability 12exp(θ2m/[8L2])1-2\exp\left({-\theta^{2}m/[8L^{2}]}\right), when ka=kb+1k_{a}=k_{b}+1, we have

Zka,kb1,iv~j2/Dkbiv~j2=Zkb+1,kb1,iv~j2/Dkbiv~j2[1θL,1+θL].\displaystyle\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}/\left\lVert D_{k_{b}}^{i}\tilde{v}_{j}\right\rVert^{2}=\left\lVert Z_{k_{b}+1,k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}/\left\lVert D_{k_{b}}^{i}\tilde{v}_{j}\right\rVert^{2}\in\left[1-\frac{\theta}{L},1+\frac{\theta}{L}\right].

As a result, for a fixed v~j\tilde{v}_{j}, for any kak_{a} and kbk_{b}, with a probability 12Lexp(θ2m/[8L2])1-{2L\exp\left(-\theta^{2}m/[8L^{2}]\right)}, we have

Zka,kb1,iv~j2[(1θL)LDkbiv~j2,(1+θL)LDkbiv~j2].\displaystyle\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}\in\left[\left(1-\frac{\theta}{L}\right)^{L}\left\lVert D^{i}_{k_{b}}\tilde{v}_{j}\right\rVert^{2},\left(1+\frac{\theta}{L}\right)^{L}\left\lVert D_{k_{b}}^{i}\tilde{v}_{j}\right\rVert^{2}\right]. (54)

Since θ<1/2\theta<1/2, the above bound can be simplified as

Zka,kb1,iv~j2[e2θDkbiv~j2,eθDkbiv~j2],\displaystyle\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}\in\left[e^{-2\theta}\left\lVert D^{i}_{k_{b}}\tilde{v}_{j}\right\rVert^{2},e^{\theta}\left\lVert D_{k_{b}}^{i}\tilde{v}_{j}\right\rVert^{2}\right], (55)

where we uses two inequalities: (1+θ/L)Lexp(θ)(1+\theta/L)^{L}\leq\exp(\theta) for L>1,|θ|nL>1,|\theta|\leq n and exp(θ)1θ/2\exp(-\theta)\leq 1-\theta/2 for 0<θ<10<\theta<1.

To bound Zka,kbv~j2\left\lVert Z_{k_{a},k_{b}}\tilde{v}_{j}\right\rVert^{2} for any v~j\tilde{v}_{j}, we introduce 𝒩m/r1\mathcal{N}\subseteq\mathbb{R}^{m/r-1} as an appropriate ϵ\epsilon-net of m/r1\mathbb{R}^{m/r-1}. Using the well known property of appropriate ϵ\epsilon-net, we have

maxvj𝒩Zka,kb1,iv~j2(a)maxvj𝒩Zka,kb1,iv~j2vj2(b)maxvjm/r1Zka,kb1,iv~j2vj2(c)112ϵmaxvj𝒩Zka,kb1,iv~j2,\displaystyle\max\limits_{v_{j}\in\mathcal{N}}\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}\overset{(a)}{\leq}\max\limits_{v_{j}\in\mathcal{N}}\frac{\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}}{\left\lVert v_{j}\right\rVert^{2}}\overset{(b)}{\leq}\max\limits_{v_{j}\in\mathbb{R}^{m/r-1}}\frac{\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}}{\left\lVert v_{j}\right\rVert^{2}}\overset{(c)}{\leq}\frac{1}{1-2\epsilon}\max\limits_{v_{j}\in\mathcal{N}}\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}, (56)

where (a) is due to vj21\left\lVert v_{j}\right\rVert^{2}\leq 1; (b) is due to 𝒩m/r1\mathcal{N}\subseteq\mathbb{R}^{m/r-1}; (c) is due to the the property of appropriate θ\theta-net. Since |𝒩|(3/ϵ)m/r1<(1+3/ϵ)m/rexp(3m/[rϵ])|\mathcal{N}|\leq(3/\epsilon)^{m/r-1}<(1+3/\epsilon)^{m/r}\leq\exp(3m/[r\epsilon]) [Pisier, 1999], by taking the union bound, we have, with probability at least 12Lexp(θ2m/[8L2]+3m/[rϵ])1-2L\exp(-\theta^{2}m/[8L^{2}]+3m/[r\epsilon]),

maxvjm/r1Zka,kb1,iv~j2vj2[e2θDkbiv~j2vj2,eθ12ϵDkbiv~j2vj2].\displaystyle\max\limits_{v_{j}\in\mathbb{R}^{m/r-1}}\frac{\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}}{\left\lVert v_{j}\right\rVert^{2}}\in\left[e^{-2\theta}\frac{\left\lVert D^{i}_{k_{b}}\tilde{v}_{j}\right\rVert^{2}}{\left\lVert v_{j}\right\rVert^{2}},\frac{e^{\theta}}{1-2\epsilon}\frac{\left\lVert D^{i}_{k_{b}}\tilde{v}_{j}\right\rVert^{2}}{\left\lVert v_{j}\right\rVert^{2}}\right].

By choosing r=48L2θ2ϵr=\frac{48L^{2}}{\theta^{2}\epsilon} with θ=13\theta=\frac{1}{3}, we have, with a probability at least 12Lexp(θ2m/[16L2])1-2L\exp(-\theta^{2}m/[16L^{2}])

maxvjm/r1Zka,kb1,iv~j2vj2[e2θDkbiv~j2vj2,3eθDkbiv~j2vj2].\displaystyle\max\limits_{v_{j}\in\mathbb{R}^{m/r-1}}\frac{\left\lVert Z_{k_{a},k_{b}}^{1,i}\tilde{v}_{j}\right\rVert^{2}}{\left\lVert v_{j}\right\rVert^{2}}\in\left[e^{-2\theta}\frac{\left\lVert D^{i}_{k_{b}}\tilde{v}_{j}\right\rVert^{2}}{\left\lVert v_{j}\right\rVert^{2}},3e^{\theta}\frac{\left\lVert D^{i}_{k_{b}}\tilde{v}_{j}\right\rVert^{2}}{\left\lVert v_{j}\right\rVert^{2}}\right]. (57)

and therefore, with a probability 12Lexp(θ2m/[16L2])1-2L\exp(-\theta^{2}m/[16L^{2}]), for any 1kb<kaL1\leq k_{b}<k_{a}\leq L, we have

Zka,kb1,i212Leθ/2θ.\displaystyle\left\lVert Z_{k_{a},k_{b}}^{1,i}\right\rVert_{2}\leq\frac{\sqrt{12L}e^{\theta/2}}{\sqrt{\theta}}. (58)

Lemma 6.

For any sequences {ai,i=1,2,,n}\{a_{i},i=1,2,\dots,n\} and {bi,j,i,j=1,2,,n}\{b_{i,j},i,j=1,2,\dots,n\}, we have

i=1njiajbi,j=i=1naijibj,i,\displaystyle\sum_{i=1}^{n}\sum_{j\neq i}a_{j}b_{i,j}=\sum_{i=1}^{n}a_{i}\sum_{j\neq i}b_{j,i}, (59)
i=1njiaibi,j=i=1naijibi,j.\displaystyle\sum_{i=1}^{n}\sum_{j\neq i}a_{i}b_{i,j}=\sum_{i=1}^{n}a_{i}\sum_{j\neq i}b_{i,j}. (60)

where the notation ji\sum_{j\neq i} means jibj=j=1nbjbi\sum_{j\neq i}b_{j}=\sum_{j=1}^{n}b_{j}-b_{i}.

Proof.

First, we know

i=1njiajbi,j\displaystyle\sum_{i=1}^{n}\sum_{j\neq i}a_{j}b_{i,j}
=\displaystyle= j1ajb1,j+j2ajb2,j++jnajbn,j\displaystyle\sum_{j\neq 1}a_{j}b_{1,j}+\sum_{j\neq 2}a_{j}b_{2,j}+\dots+\sum_{j\neq n}a_{j}b_{n,j}
=\displaystyle= (j=1najb1,ja1b1,1)+(j=1najb2,ja2b2,2)++(j=1najbn,janbn,n)\displaystyle\left(\sum_{j=1}^{n}a_{j}b_{1,j}-a_{1}b_{1,1}\right)+\left(\sum_{j=1}^{n}a_{j}b_{2,j}-a_{2}b_{2,2}\right)+\dots+\left(\sum_{j=1}^{n}a_{j}b_{n,j}-a_{n}b_{n,n}\right)
=\displaystyle= i=1nj=1najbi,ji=1naibi,i=j=1naji=1nbi,ji=1naibi,i\displaystyle\sum_{i=1}^{n}\sum_{j=1}^{n}a_{j}b_{i,j}-\sum_{i=1}^{n}a_{i}b_{i,i}=\sum_{j=1}^{n}a_{j}\sum_{i=1}^{n}b_{i,j}-\sum_{i=1}^{n}a_{i}b_{i,i}
=\displaystyle= j=1najijbi,j=i=1naijibj,i(note that bi,jbj,i)\displaystyle\sum_{j=1}^{n}a_{j}\sum_{i\neq j}b_{i,j}=\sum_{i=1}^{n}a_{i}\sum_{j\neq i}b_{j,i}\quad\text{(note that $b_{i,j}\neq b_{j,i}$)}

We then proved (59). Next, we want to prove (60). To this end, we have

i=1njiaibi,j\displaystyle\sum_{i=1}^{n}\sum_{j\neq i}a_{i}b_{i,j}
=\displaystyle= j1a1b1,j+j2a2b2,j++jnanbn,j\displaystyle\sum_{j\neq 1}a_{1}b_{1,j}+\sum_{j\neq 2}a_{2}b_{2,j}+\dots+\sum_{j\neq n}a_{n}b_{n,j}
=\displaystyle= (j=1na1b1,ja1b1,1)+(j=1na2b2,ja2b2,2)++(j=1nanbn,janbn,n)\displaystyle\left(\sum_{j=1}^{n}a_{1}b_{1,j}-a_{1}b_{1,1}\right)+\left(\sum_{j=1}^{n}a_{2}b_{2,j}-a_{2}b_{2,2}\right)+\dots+\left(\sum_{j=1}^{n}a_{n}b_{n,j}-a_{n}b_{n,n}\right)
=\displaystyle= i=1naij=1nbi,ji=1naibi,i=i=1naijibi,j.\displaystyle\sum_{i=1}^{n}a_{i}\sum_{j=1}^{n}b_{i,j}-\sum_{i=1}^{n}a_{i}b_{i,i}=\sum_{i=1}^{n}a_{i}\sum_{j\neq i}b_{i,j}.

Lemma 7.

Assume

12eθ/2θ1/2L3/2τ19\sqrt{12}e^{\theta/2}\theta^{-1/2}L^{3/2}\tau\leq\frac{1}{9}

where τ\tau is defined in (4). Then, for any θ(0,1/2)\theta\in(0,1/2), with a probability 14L2exp(θ2m/[16L2])1-4L^{2}\exp(-\theta^{2}m/[16L^{2}]), for any 1kb<kbL1\leq k_{b}<k_{b}\leq L, we have

Zka,kbt,i24Leθ/2θ1/2\|Z_{k_{a},k_{b}}^{t,i}\|_{2}\leq 4\sqrt{L}e^{\theta/2}\theta^{-1/2}
Proof.

Expand Zka,kbtZ^{t}_{k_{a},k_{b}} as

Zka,kbt,i=Zka,kb1,i+s=1kbkak1>k2>>ksZka,k1+11,i=1sδWt,kZk+1,k+11,i\displaystyle Z_{k_{a},k_{b}}^{t,i}=Z_{k_{a},k_{b}}^{1,i}+\sum_{s=1}^{k_{b}-k_{a}}\sum_{k_{1}>k_{2}>\ldots>k_{s}}Z_{k_{a},k_{1}+1}^{1,i}\prod_{\ell=1}^{s}\delta W_{t,k_{\ell}}Z_{k_{\ell}+1,k_{\ell+1}}^{1,i}

According to Lemma 5, for any θ(0,1/2)\theta\in(0,1/2), with a probability 12Lexp(θ2m/[16L2])1-2L\exp(-\theta^{2}m/[16L^{2}]), we have Zka,kb1,i12Leθ/2θ1/2\|Z_{k_{a},k_{b}}^{1,i}\|\leq\sqrt{12L}e^{\theta/2}\theta^{-1/2}. We thus have, with a probability 14L2exp(θ2m/[16L2])1-4L^{2}\exp(-\theta^{2}m/[16L^{2}]),

Zka,kbt,i2(a)\displaystyle\|Z_{k_{a},k_{b}}^{t,i}\|_{2}\overset{(a)}{\leq} Zka,kb1,i2+s=1kbkak1>k2>>ksZka,k1+11,i2=1sδWt,k2Zk+1,k+11,i2\displaystyle\|Z_{k_{a},k_{b}}^{1,i}\|_{2}+\sum_{s=1}^{k_{b}-k_{a}}\sum_{k_{1}>k_{2}>\ldots>k_{s}}\|Z_{k_{a},k_{1}+1}^{1,i}\|_{2}\prod_{\ell=1}^{s}\|\delta W_{t,k_{\ell}}\|_{2}\|Z_{k_{\ell}+1,k_{\ell+1}}^{1,i}\|_{2}
(b)\displaystyle\overset{(b)}{\leq} 12Leθ/2θ1/2(1+s=1kbka(sL)(τ12Leθ/2θ1/2)s)\displaystyle\sqrt{12L}e^{\theta/2}\theta^{-1/2}\left(1+\sum_{s=1}^{k_{b}-k_{a}}(_{s}^{L})(\tau\sqrt{12L}e^{\theta/2}\theta^{-1/2})^{s}\right)
(c)\displaystyle\overset{(c)}{\leq} 12Leθ/2θ1/2(1+s=1kbka(Lτ12Leθ/2θ1/2)s)\displaystyle\sqrt{12L}e^{\theta/2}\theta^{-1/2}\left(1+\sum_{s=1}^{k_{b}-k_{a}}(L\tau\sqrt{12L}e^{\theta/2}\theta^{-1/2})^{s}\right)
(d)\displaystyle\overset{(d)}{\leq} 4Leθ/2θ1/2,\displaystyle 4\sqrt{L}e^{\theta/2}\theta^{-1/2},

where (a) uses AB2A2B2\|AB\|_{2}\leq\|A\|_{2}\|B\|_{2} and A+B2A2+B2\|A+B\|_{2}\leq\|A\|_{2}+\|B\|_{2} for any matrices A,Bm×mA,B\in\mathbb{R}^{m\times m}; (b) uses Lemma 5 upper to 2L2^{L} times that for all combinations of Zka,kb1,i212Leθ/2θ1/2\|Z_{k_{a},k_{b}}^{1,i}\|_{2}\leq\sqrt{12L}e^{\theta/2}\theta^{-1/2} and the definition of τ\tau; (c) uses the facts that (sL)Ls(_{s}^{L})\leq L^{s}; (d) uses i=0nai11a\sum_{i=0}^{n}a^{i}\leq\frac{1}{1-a} for a(0,1)a\in(0,1) with 12Leθ/2θ1/2Lτ19\sqrt{12L}e^{\theta/2}\theta^{-1/2}L\tau\leq\frac{1}{9}, and 934<4\frac{9\sqrt{3}}{4}<4. ∎