This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\hldauthor\Name

Zenan Ling \Email[email protected]
\NameZhenyu Liao \Email[email protected]
\NameRobert C. Qiu \Email[email protected]
\addr Huazhong University of Science and Technology, Wuhan, China

On the Equivalence between Implicit and Explicit Neural Networks:
A High-dimensional Viewpoint

Abstract

Implicit neural networks have demonstrated remarkable success in various tasks. However, there is a lack of theoretical analysis of the connections and differences between implicit and explicit networks. In this paper, we study high-dimensional implicit neural networks and provide the high dimensional equivalents for the corresponding conjugate kernels and neural tangent kernels. Built upon this, we establish the equivalence between implicit and explicit networks in high dimensions.

1 Introduction

Implicit neural networks (NNs) [Bai et al.(2019)Bai, Kolter, and Koltun] have recently emerged as a new paradigm in neural network design. An implicit NN is equivalent to an infinite-depth weight-shared explicit NN with input-injection. Unlike explicit NNs, implicit NNs generate features by directly solving for the fixed point, rather than through layer-by-layer forward propagation. Moreover, implicit NNs have the remarkable advantage that gradients can be computed analytically only through the fixed point with implicit differentiation. Therefore, training implicit NNs only requires constant memory.

Despite the empirical success achieved by implicit NNs [Bai et al.(2020)Bai, Koltun, and Kolter, Xie et al.(2022)Xie, Wang, Ling, Li, Liu, and Lin], our theoretical understanding of these models is still limited. In particular, there is a lack of theoretical analysis of the training dynamics and generalization performance of implicit NNs, and possibly more importantly, whether these properties can be connected to those of explicit NNs. [Bai et al.(2019)Bai, Kolter, and Koltun] demonstrates that any deep NN can be reformulated as a special implicit NN. However, it remains unknown whether general implicit NNs have advantages over explicit NNs.  [Feng and Kolter(2020)] extends previous neural tangent kernel (NTK) studies to implicit NNs and give the exact expression of the NTK of the ReLU implicit NNs. However, the differences between implicit and explicit NTKs are not analyzed. Moreover, previous works [Ling et al.(2023)Ling, Xie, Wang, Zhang, and Lin, Truong(2023)] have proved the global convergence of gradient descent for training implicit NNs. However, it is still unclear what distinguishes the training dynamic of implicit NNs and that of explicit NNs.

In this paper, we investigate implicit NNs from a high-dimensional view. Specifically, we perform a fine-grained asymptotic analysis on the eigenspectra of conjugate kernel (CKs) and NTKs of implicit NNs, which play a fundamental role in the convergence and generalization high dimensional NNs [Jacot et al.(2018)Jacot, Gabriel, and Hongler]. By considering input data uniformly drawn from the unit sphere, we derive, with recent advances in random matrix theory, high-dimensional (spectral) equivalents for the CKs and NTKs of implicit NNs, and establish the equivalence between implicit and explicit NNs by matching the coefficients of the corresponding asymptotic spectral equivalents. Surprisingly, our results reveal that a single-layer explicit NN with carefully designed activations has the same CK or NTK eigenspectra as a ReLU implicit NN, whose depth is essentially infinite.

2 Preliminaries

2.1 Implicit and Explicit NNs

Implicit NNs.

In this paper, we study a typical implicit neural network, the deep equilibrium model (DEQ) [Bai et al.(2019)Bai, Kolter, and Koltun]. Let 𝑿=[𝒙1,,𝒙n]d×n{\bm{X}}=[{\bm{x}}_{1},\cdots,{\bm{x}}_{n}]\in\mathbb{R}^{d\times n} denote the input data. We define a vanilla DEQ with the transform at the ll-th layer as

𝒉i(l)=σa2m𝑨𝒛i(l1)+σb2m𝑩𝒙i,𝒛i(l)=ϕ(𝒉i(l)){\bm{h}}_{i}^{(l)}=\sqrt{\frac{\sigma_{a}^{2}}{m}}{\bm{A}}{\bm{z}}_{i}^{(l-1)}+\sqrt{\frac{\sigma_{b}^{2}}{m}}{\bm{B}}{\bm{x}}_{i},\quad{\bm{z}}_{i}^{(l)}=\phi({\bm{h}}_{i}^{(l)}) (1)

where 𝑨m×m{\bm{A}}\in\mathbb{R}^{m\times m} and 𝑩m×d{\bm{B}}\in\mathbb{R}^{m\times d} are weight matrices, σa,σb\sigma_{a},\sigma_{b}\in\mathbb{R} are constants, ϕ\phi is an element-wise activation, 𝒉i(l){\bm{h}}_{i}^{(l)} is the pre-activation and 𝒛i(l)m{\bm{z}}_{i}^{(l)}\in\mathbb{R}^{m} is the output feature of the ll-th hidden layer corresponding to the input data 𝒙i{\bm{x}}_{i}. The output of the last hidden layer is defined by 𝒛i\triangleqliml𝒛i(l){\bm{z}}_{i}^{*}\triangleq\lim_{l\rightarrow\infty}{\bm{z}}_{i}^{(l)} and we denote the corresponding pre-activation by 𝒉i{\bm{h}}_{i}^{*}. Note that 𝒛i{\bm{z}}_{i}^{*} can be calculated by directly solving for the equilibrium point of the following equation

𝒛i=ϕ(σa2m𝑨𝒛i+σb2m𝑩𝒙i).{\bm{z}}_{i}^{*}=\phi\left(\sqrt{\frac{\sigma_{a}^{2}}{m}}{\bm{A}}{\bm{z}}_{i}^{*}+\sqrt{\frac{\sigma_{b}^{2}}{m}}{\bm{B}}{\bm{x}}_{i}\right). (2)

We are interested in the conjugate kernel and neural tangent kernel (Implicit-CK and Implicit-NTK, for short) of implicit neural networks defined in Eq. (2). Following [Feng and Kolter(2020)], we denote the corresponding Implicit-CK by 𝑮=liml𝑮(l){\bm{G}}^{*}=\lim_{l\rightarrow\infty}{\bm{G}}^{(l)} where the (i,j)(i,j)-th entry of 𝑮(l){\bm{G}}^{(l)} is defined recursively as

𝑮ij(0)=𝒙i𝒙j,𝚲ij(l)=[𝑮ii(l1)𝑮ij(l1)𝑮ji(l1)𝑮jj(l1)],𝑮ij(l)=σa2𝔼(u,v)𝒩(0,𝚲ij(l))[ϕ(u)ϕ(v)]+σb2𝒙i𝒙j,𝑮˙ij(l)=σa2𝔼(u,v)𝒩(0,𝚲ij(l))[ϕ(u)ϕ(v)].\begin{split}&{\bm{G}}^{(0)}_{ij}={\bm{x}}_{i}^{\top}{\bm{x}}_{j},\quad\bm{\Lambda}_{ij}^{(l)}=\left[\begin{array}[]{cc}{\bm{G}}^{(l-1)}_{ii}&{\bm{G}}^{(l-1)}_{ij}\\ {\bm{G}}^{(l-1)}_{ji}&{\bm{G}}^{(l-1)}_{jj}\end{array}\right],\\ &{\bm{G}}^{(l)}_{ij}=\sigma_{a}^{2}\mathbb{E}_{({\textnormal{u}},{\textnormal{v}})\sim{\mathcal{N}}(0,\bm{\Lambda}_{ij}^{(l)})}[\phi({\textnormal{u}})\phi({\textnormal{v}})]+\sigma_{b}^{2}{\bm{x}}_{i}^{\top}{\bm{x}}_{j},\\ &\dot{\bm{G}}^{(l)}_{ij}=\sigma_{a}^{2}\mathbb{E}_{({\textnormal{u}},{\textnormal{v}})\sim{\mathcal{N}}(0,\bm{\Lambda}_{ij}^{(l)})}[\phi^{\prime}({\textnormal{u}})\phi^{\prime}({\textnormal{v}})].\end{split} (3)

And the Implicit-NTK is defined as 𝑲=liml𝑲(l){\bm{K}}^{*}=\lim_{l\rightarrow\infty}{\bm{K}}^{(l)} whose the (i,j)(i,j)-th entry is defined as

𝑲ij(l)=h=1l+1(𝑮ij(h1)h=hl+1𝑮˙ij(h)).{\bm{K}}^{(l)}_{ij}=\sum_{h=1}^{l+1}\left({\bm{G}}^{(h-1)}_{ij}\prod_{h^{\prime}=h}^{l+1}\dot{\bm{G}}^{(h^{\prime})}_{ij}\right). (4)

Explicit Neural Networks.

We consider a single-layer fully-connected NN model defined as 𝒀=1pσ(𝑾𝑿){\bm{Y}}=\sqrt{\frac{1}{p}}\sigma({\bm{W}}{\bm{X}}) where 𝑾p×d{\bm{W}}\in\mathbb{R}^{p\times d} is the weight matrix and σ\sigma is an element-wise activation function. Let 𝒘𝒩(0,𝑰d){\bm{w}}\sim{\mathcal{N}}(0,{\bm{I}}_{d}), the corresponding Explicit-CK matrix 𝚺{\bm{\Sigma}} and Explicit-NTK matrix 𝚯{\bm{\Theta}} are defined as follows:

𝚺=𝔼𝒘[σ(𝒘𝑿)σ(𝒘𝑿)],𝚯=𝚺+(𝑿𝑿)𝔼𝒘[σ(𝒘𝑿)σ(𝒘𝑿)].{\bm{\Sigma}}=\mathbb{E}_{{\bm{w}}}[\sigma({\bm{w}}^{\top}{\bm{X}})^{\top}\sigma({\bm{w}}^{\top}{\bm{X}})],\quad{\bm{\Theta}}={\bm{\Sigma}}+\left({\bm{X}}^{\top}{\bm{X}}\right)\odot\mathbb{E}_{{\bm{w}}}[\sigma^{\prime}({\bm{w}}^{\top}{\bm{X}})^{\top}\sigma^{\prime}({\bm{w}}^{\top}{\bm{X}})]. (5)

2.2 CKs and NTKs of ReLU Implicit NNs

We make the following assumptions on the random initialization, the input data, and activations.

Assumption 1

(i) As nn\rightarrow\infty, d/nc(0,)d/n\rightarrow c\in(0,\infty). All data points 𝐱i{\bm{x}}_{i}, i[n]i\in[n], are independent and uniformly sampled from 𝕊d1{\mathbb{S}}^{d-1}. (ii) 𝐀{\bm{A}}, 𝐁{\bm{B}}, and 𝐖{\bm{W}} are independent and have i.i.d entries of zero mean, unit variance, and finite fourth kurtosis. Moreover, we require σa2+σb2=1\sigma_{a}^{2}+\sigma_{b}^{2}=1. (iii) The activation ϕ\phi of the implicit NN is the normalized ReLU, i.e., ϕ(x)=2max(x,0)\phi(x)=\sqrt{2}\max(x,0). The activation σ\sigma of the explicit NN is a C3C^{3} function.

Remark 1

(i) Despite derived here for uniform distribution on the unit sphere, we conjecture that our results extend the result to more general distributions by using the technique developed in [Fan and Wang(2020), Gu et al.(2022)Gu, Du, Yuan, Xie, Pu, Qiu, and Liao]. (ii) The additional requirement on the variance is to ensure the existence and uniqueness of the fixed point of the NTK and to keep the diagonal entries of the CK matrix at 11, see examples in [Feng and Kolter(2020)]. (iii) It is possible to extend our results to implicit NNs with general activations by using the technique proposed in [Truong(2023)]. We defer the extension to more general data distributions and activation functions to future work.

Under Assumptions 1, the limits of Implicit-CK and Implicit-NTK exist, and one can have precise expressions of 𝑮{\bm{G}}^{*} and 𝑲{\bm{K}}^{*} as follows [Feng and Kolter(2020), Ling et al.(2023)Ling, Xie, Wang, Zhang, and Lin].

Lemma 1

Let f(x)=1x2+(πarccosx)xπf(x)=\frac{\sqrt{1-x^{2}}+\left(\pi-\arccos x\right)x}{\pi}. Under Assumptions 1, the fixed point of Implicit-CK 𝐆ij{\bm{G}}^{*}_{ij} is the root of

𝑮ij=σa2f(𝑮ij)+(1σa2)𝒙i𝒙j.{\bm{G}}^{*}_{ij}=\sigma_{a}^{2}f({\bm{G}}^{*}_{ij})+(1-\sigma_{a}^{2}){\bm{x}}_{i}^{\top}{\bm{x}}_{j}. (6)

The limit of Implicit-NTK is

𝑲ij=h(𝑮ij)\triangleq𝑮ij1𝑮˙ijwhere𝑮˙ij\triangleqσa2π1(πarccos(𝑮ij)).{\bm{K}}^{*}_{ij}=h({\bm{G}}^{*}_{ij})\triangleq\frac{{\bm{G}}^{*}_{ij}}{1-\dot{\bm{G}}^{*}_{ij}}\quad\text{where}\quad\dot{\bm{G}}^{*}_{ij}\triangleq\sigma_{a}^{2}\pi^{-1}(\pi-\arccos({\bm{G}}^{*}_{ij})). (7)

3 Main Results

In this section, we prove the high-dimensional equivalents for CKs and NTKs of implicit and explicit NNs. As a result, by matching the coefficients of the asymptotic spectral equivalents, we establish the equivalence between implicit and explicit NNs in high dimensions.

3.1 Asymptotic Approximations

CKs.

We begin by defining several quantities that are crucial to our results. Note that the unique fixed point of  Eq. (6) exists as long as σa2<1\sigma_{a}^{2}<1. We define the implicit map induced from Eq. (6) as 𝑮ij\triangleqg(𝒙i𝒙j){\bm{G}}^{*}_{ij}\triangleq g({\bm{x}}_{i}^{\top}{\bm{x}}_{j}). Let =g(0)\angle^{*}=g(0) be the solution of =σa2f()\angle^{*}=\sigma_{a}^{2}f(\angle^{*}) when 𝒙i𝒙j=0{\bm{x}}_{i}^{\top}{\bm{x}}_{j}=0. Using implicit differentiation, one can obtain that

g(0)=1σa21σa2f(),g′′(0)=σa2(1σa2)2f′′()(1σa2f())3.g^{\prime}(0)=\frac{1-\sigma_{a}^{2}}{1-\sigma_{a}^{2}f^{\prime}(\angle^{*})},\quad g^{\prime\prime}(0)=\frac{\sigma_{a}^{2}(1-\sigma_{a}^{2})^{2}f^{\prime\prime}(\angle^{*})}{(1-\sigma_{a}^{2}f^{\prime}(\angle^{*}))^{3}}.

Now we are ready to present the asymptotic equivalent of the Implicit-CK matrix.

Theorem 1 (Asymptotic approximation of Implicit-CKs)

Let Assumptions 1 hold. As n,dn,d\rightarrow\infty, the Implicit-CK matrix 𝐆{\bm{G}}^{*} defined in Eq. (6) can be approximated consistently in operator norm, by the matrix 𝐆¯\overline{{\bm{G}}}, that is 𝐆𝐆¯20\|{\bm{G}}^{*}-\overline{{\bm{G}}}\|_{2}\rightarrow 0, where

𝑮¯=α𝟏𝟏+β𝑿𝑿+μ𝑰n,\overline{{\bm{G}}}=\alpha\bm{1}\bm{1}^{\top}+\beta{\bm{X}}^{\top}{\bm{X}}+\mu{\bm{I}}_{n},

with α=g(0)+g′′(0)2d\alpha=g(0)+\frac{g^{\prime\prime}(0)}{2d}, β=g(0)\beta=g^{\prime}(0), and μ=g(1)g(0)g(0)\mu=g(1)-g(0)-g^{\prime}(0).

Theorem 2 (Asymptotic approximation for Explicit-CKs)

Let Assumptions 1 hold. As n,dn,d\rightarrow\infty, the Explicit-CK matrix 𝚺{\bm{\Sigma}} defined in Eq. (5) can be approximated consistently in operator norm, by the matrix 𝚺¯\overline{{\bm{\Sigma}}}, that is 𝚺𝚺¯20\|{\bm{\Sigma}}-\overline{{\bm{\Sigma}}}\|_{2}\rightarrow 0, where

𝚺¯=α1𝟏𝟏+β1𝑿𝑿+μ1𝑰n,\overline{{\bm{\Sigma}}}=\alpha_{1}\bm{1}\bm{1}^{\top}+\beta_{1}{\bm{X}}^{\top}{\bm{X}}+\mu_{1}{\bm{I}}_{n},

with α1=𝔼[σ(z)]2+𝔼[σ′′(z)]22d\alpha_{1}=\mathbb{E}[\sigma(z)]^{2}+\frac{\mathbb{E}[\sigma^{\prime\prime}(z)]^{2}}{2d}, β1=𝔼[σ(z)]2\beta_{1}=\mathbb{E}[\sigma^{\prime}(z)]^{2}, and μ1=𝔼[σ2(z)]𝔼[σ(z)]2𝔼[σ(z)]2\mu_{1}=\mathbb{E}[\sigma^{2}(z)]-\mathbb{E}[\sigma(z)]^{2}-\mathbb{E}[\sigma^{\prime}(z)]^{2}, for z𝒩(0,1)z\sim{\mathcal{N}}(0,1).

NTKs.

For the Implicit-NTK, we define 𝑲ij=k(𝒙i𝒙j){\bm{K}}^{*}_{ij}=k({\bm{x}}_{i}^{\top}{\bm{x}}_{j}), i.e., k(𝒙i𝒙j)=h(g(𝒙i𝒙j))k({\bm{x}}_{i}^{\top}{\bm{x}}_{j})=h(g({\bm{x}}_{i}^{\top}{\bm{x}}_{j})), for i,j[n]i,j\in[n]. It is easy to check that k(0)=h()k(0)=h(\angle^{*}) and k(1)=h(g(1))k(1)=h(g(1)). Using implicit differentiation again, we have

k(0)=(1σa2)h()σa2f()1,k′′(0)=(1σa2)2(h′′()σa2f()h′′()+σa2h()f′′())(1σa2f())3.k^{\prime}(0)=\frac{(1-\sigma_{a}^{2})h^{\prime}(\angle^{*})}{\sigma_{a}^{2}f^{\prime}(\angle^{*})-1},\,k^{\prime\prime}(0)=\frac{(1-\sigma_{a}^{2})^{2}(h^{\prime\prime}(\angle^{*})-\sigma_{a}^{2}f^{\prime}(\angle^{*})h^{\prime\prime}(\angle^{*})+\sigma_{a}^{2}h^{\prime}(\angle^{*})f^{\prime\prime}(\angle^{*}))}{(1-\sigma_{a}^{2}f^{\prime}(\angle^{*}))^{3}}.

Now we are ready to present the asymptotic equivalent of the Implicit-NTK matrix.

Theorem 3 (Asymptotic approximation for Implicit-NTKs)

Let Assumptions 1 hold. As n,dn,d\rightarrow\infty, the Implicit-NTK matrix 𝐊{\bm{K}}^{*} defined Eq. (7) in can be approximated consistently in operator norm, by the matrix 𝐊¯\overline{{\bm{K}}}, that is 𝐊𝐊¯20\|{\bm{K}}^{*}-\overline{{\bm{K}}}\|_{2}\rightarrow 0, where

𝑲¯=α˙𝟏𝟏+β˙𝑿𝑿+μ˙𝑰n,\overline{{\bm{K}}}=\dot{\alpha}\bm{1}\bm{1}^{\top}+\dot{\beta}{\bm{X}}^{\top}{\bm{X}}+\dot{\mu}{\bm{I}}_{n},

with α˙=k(0)+k′′(0)2d\dot{\alpha}=k(0)+\frac{k^{\prime\prime}(0)}{2d}, β˙=k(0)\dot{\beta}=k^{\prime}(0), and μ˙=k(1)k(0)k(0)\dot{\mu}=k(1)-k(0)-k^{\prime}(0).

Theorem 4 (Asymptotic approximation for Explicit-NTKs)

Let Assumptions 1 hold. As n,dn,d\rightarrow\infty, the Explicit-NTK matrix 𝚯{\bm{\Theta}} defined in Eq. (5) can be approximated consistently in operator norm, by the matrix 𝚯¯\overline{{\bm{\Theta}}}, that is 𝚯𝚯¯20\|{\bm{\Theta}}^{*}-\overline{{\bm{\Theta}}}\|_{2}\rightarrow 0, where

𝚯¯=α˙1𝟏𝟏+β˙1𝑿𝑿+μ˙1𝑰n,\overline{{\bm{\Theta}}}=\dot{\alpha}_{1}\bm{1}\bm{1}^{\top}+\dot{\beta}_{1}{\bm{X}}^{\top}{\bm{X}}+\dot{\mu}_{1}{\bm{I}}_{n},

with α˙1=𝔼[σ(z)]2+3𝔼[σ′′(z)]22d\dot{\alpha}_{1}=\mathbb{E}[\sigma(z)]^{2}+\frac{3\mathbb{E}[\sigma^{\prime\prime}(z)]^{2}}{2d}, β˙1=2𝔼[σ(z)]2\dot{\beta}_{1}=2\mathbb{E}[\sigma^{\prime}(z)]^{2}, and μ˙1=𝔼[σ2(z)]+𝔼[σ(z)2]𝔼[σ(z)]22𝔼[σ(z)]2\dot{\mu}_{1}=\mathbb{E}[\sigma^{2}(z)]+\mathbb{E}[\sigma^{\prime}(z)^{2}]-\mathbb{E}[\sigma(z)]^{2}-2\mathbb{E}[\sigma^{\prime}(z)]^{2} for z𝒩(0,1)z\sim{\mathcal{N}}(0,1).

Remark 2

(i) Due to the homogeneity of the ReLU function, the Implicit-CK and the Implicit-NTK are essentially inner product kernel random matrices. Consequently, Theorem 1 and 3 can be built upon the results in [El Karoui(2010)]. We postpone the study on general activations to future work. (ii) The results in Theorem 2 and 4 generalize those of [Gu et al.(2022)Gu, Du, Yuan, Xie, Pu, Qiu, and Liao, Ali et al.(2022)Ali, Liao, and Couillet] to the cases of “non-centred” activations, i.e., we do not require 𝔼[σ(z)]=0\mathbb{E}[\sigma(z)]=0 for z𝒩(0,1)z\sim{\mathcal{N}}(0,1).

3.2 The Equivalence between Implicit and Explicit NNs

Refer to caption
Figure 1: We independently generate n=1 000n=1\,000 data points from the d=1 200d=1\,200-dimensional unit sphere. We use Gaussian initialization and σa2\sigma_{a}^{2} is set as 0.20.2. Upper: the CK results. Bottom: the NTK results. (a) spectral densities of implicit kernels, (b) spectral densities of explicit kernels, (c) quadratic activations.

In the following corollary, we show a concrete case of a single-layer explicit NN with an quadratic activation, that matches the CK or NTK eigenspectra of a ReLU implicit NN. The idea is to utilize the results of Theorems 1-4 to match the coefficients of the asymptotic equivalents such that α1=α,β1=β,μ1=μ\alpha_{1}=\alpha,\beta_{1}=\beta,\mu_{1}=\mu, or α˙1=α˙,β˙1=β,μ˙1=μ\dot{\alpha}_{1}=\dot{\alpha},\dot{\beta}_{1}=\beta,\dot{\mu}_{1}=\mu. We implement numerical simulations to verify our theory. The numerical results are shown in Figure 1.

Corollary 1

We consider a quadratic polynomial activation σ(t)=a2t2+a1t+a0\sigma(t)=a_{2}t^{2}+a_{1}t+a_{0}. Let Assumptions 1 hold. As n,dn,d\rightarrow\infty, the Implicit-CK matrix 𝐆{\bm{G}}^{*} defined in Eq. (6) can be approximated consistently in operator norm, by the Explicit-CK matrix 𝚺{\bm{\Sigma}} defined in Eq. (5), i.e., 𝐆𝚺20\|{\bm{G}}^{*}-{\bm{\Sigma}}\|_{2}\rightarrow 0, as long as

a2=±μ2a1=±β,a0=±αμda2,a_{2}=\pm\sqrt{\frac{\mu}{2}}\quad a_{1}=\pm\sqrt{\beta},\quad a_{0}=\pm\sqrt{\alpha-\frac{\mu}{d}}-a_{2},

and the Implicit-NTK matrix 𝐊{\bm{K}}^{*} defined in Eq. (7) can be approximated consistently in operator norm, by the Explicit-NTK matrix 𝚯{\bm{\Theta}} defined in Eq. (5), i.e., 𝐊𝚯20\|{\bm{K}}^{*}-{\bm{\Theta}}\|_{2}\rightarrow 0, as long as

a2=±μ˙6,a1=±β˙2,a0=±α˙μ˙da2.a_{2}=\pm\sqrt{\frac{\dot{\mu}}{6}},\quad a_{1}=\pm\sqrt{\frac{\dot{\beta}}{2}},\quad a_{0}=\pm\sqrt{\dot{\alpha}-\frac{\dot{\mu}}{d}}-a_{2}.

4 Conclusion

In this paper, we study the CKs and NTKs of high-dimensional ReLU implicit NNs. We prove the asymptotic spectral equivalents for Implicit-CKs and Implicit-NTKs. Moreover, we establish the equivalence between implicit and explicit NNs by matching the coefficients of the asymptotic spectral equivalents. In particular, we show that a single-layer explicit NN with carefully designed activations has the same CK or NTK eigenspectra as a ReLU implicit NN. For future work, it would be interesting to extend our analysis to more general data distributions and activation functions.

Acknowledgements

Z. Liao would like to acknowledge the National Natural Science Foundation of China (via fund NSFC-62206101) and the Fundamental Research Funds for the Central Universities of China (2021XXJS110) for providing partial support. R. C. Qiu and Z. Liao would like to acknowledge the National Natural Science Foundation of China (via fund NSFC-12141107), the Key Research and Development Program of Hubei (2021BAA037) and of Guangxi (GuiKe-AB21196034).

References

  • [Ali et al.(2022)Ali, Liao, and Couillet] Hafiz Tiomoko Ali, Zhenyu Liao, and Romain Couillet. Random matrices in service of ml footprint: ternary random features with no performance loss. ICLR, 2022.
  • [Bai et al.(2019)Bai, Kolter, and Koltun] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [Bai et al.(2020)Bai, Koltun, and Kolter] Shaojie Bai, Vladlen Koltun, and J Zico Kolter. Multiscale deep equilibrium models. Advances in Neural Information Processing Systems, 2020.
  • [El Karoui(2010)] Noureddine El Karoui. The spectrum of kernel random matrices. The Annuals of Statistics, 2010.
  • [Fan and Wang(2020)] Zhou Fan and Zhichao Wang. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. Advances in neural information processing systems, 33:7710–7721, 2020.
  • [Feng and Kolter(2020)] Zhili Feng and J Zico Kolter. On the neural tangent kernel of equilibrium models. arxiv, 2020.
  • [Gu et al.(2022)Gu, Du, Yuan, Xie, Pu, Qiu, and Liao] Lingyu Gu, Yongqi Du, Zhang Yuan, Di Xie, Shiliang Pu, Robert Qiu, and Zhenyu Liao. ” lossless” compression of deep neural networks: A high-dimensional neural tangent kernel approach. Advances in Neural Information Processing Systems, 35:3774–3787, 2022.
  • [Jacot et al.(2018)Jacot, Gabriel, and Hongler] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  • [Ling et al.(2023)Ling, Xie, Wang, Zhang, and Lin] Zenan Ling, Xingyu Xie, Qiuhao Wang, Zongpeng Zhang, and Zhouchen Lin. Global convergence of over-parameterized deep equilibrium models. In International Conference on Artificial Intelligence and Statistics, pages 767–787. PMLR, 2023.
  • [Truong(2023)] Lan V Truong. Global convergence rate of deep equilibrium models with general activations. arXiv preprint arXiv:2302.05797, 2023.
  • [Xie et al.(2022)Xie, Wang, Ling, Li, Liu, and Lin] Xingyu Xie, Qiuhao Wang, Zenan Ling, Xia Li, Guangcan Liu, and Zhouchen Lin. Optimization induced equilibrium networks: An explicit optimization perspective for understanding equilibrium models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.