On the Landscape of One-hidden-layer Sparse Networks and Beyond

Dachao Lin Academy for Advanced Interdisciplinary Studies, Peking University. [email protected] Ruoyu Sun Department of Industrial and Enterprise Engineering, Coordinate Science Lab (affiliated), University of Illinois Urbana-Champaign. [email protected] Zhihua Zhang School of Mathematical Sciences, Peking University. [email protected]

Abstract

Sparse neural networks have received increasing interest due to their small size compared to dense networks. Nevertheless, most existing works on neural network theory have focused on dense neural networks, and the understanding of sparse networks is very limited. In this paper we study the loss landscape of one-hidden-layer sparse networks. First, we consider sparse networks with a dense final layer. We show that linear networks can have no spurious valleys under special sparse structures, and non-linear networks could also admit no spurious valleys under a wide final layer. Second, we discover that spurious valleys and spurious minima can exist for wide sparse networks with a sparse final layer. This is different from wide dense networks which do not have spurious valleys under mild assumptions.

1 Introduction

Motivation

Deep neural networks (DNNs) have achieved remarkable empirical successes in the fields of computer vision, speech recognition, and natural language processing, sparking great interests in the theory underlying their architectures and training. However, DNNs are typically highly over-parameterized, making them computationally expensive with large amounts of memory and computational power. For example, it may take more than 1 week to train ResNet-52 on ImageNet with a single GPU. Thus, DNNs are often unsuitable for smaller devices like embedded electronics, sensors and smartphones. And there is a pressing demand for techniques to optimize models with reduced model size, faster inference and lower power consumption.

Sparse networks, in which a large subset of the connections are non-existent, are a possible choice for small-size networks. It has been shown empirically that sparse networks can achieve performance comparable to dense networks [13, 11, 25]. In recent years, many sparse networks are obtained from network pruning [48, 21, 24, 9].

There are many existing theoretical works on dense neural networks. One popular theoretical line of research is to analyze the loss surface of neural networks [42]. Although this line does not provide a full description of the algorithm behavior, it serves as a starting point for understanding DNNs. For deep linear networks, it is shown that every local minimum is a global minimum [2, 18]. For non-linear networks, researchers have found spurious local minima for various non-linear networks [1, 47, 43, 46, 40]. Despite the existence of suboptimal local minima, researchers discovered that non-linear neural nets can be trained to global minima. One explanation is that such over-parameterized networks still exhibit nice geometrical landscape properties such as the absence of bad valleys and basins [10, 44, 22]. Nguyen et al. [37] also considered sparse networks, showing no spurious “valleys” ¹¹1There is a slight difference in the definition of spurious valleys in Nguyen et al. [37] with common adopted concept from Freeman and Bruna [10], Venturi et al. [44]. We employ the later one in our paper. under strictly increasing activations as long as the bias term is not pruned and the width (or the connections) of the final hidden layer is larger than the number of training samples. It is not clear whether a more general sparse network (e.g., allowing the final hidden layer to be sparse with few connections) still exhibits the same nice landscape property.

Main contribution

Since sparse networks have gained a great success in practice, we may wonder what kind of a sparse network is good for training, or still preserves the good landscape as dense networks? In this work, we show that some specific sparsity could still provide a benign landscape. But generally, sparsity can deteriorate the loss landscape and create spurious minima and spurious valleys (defined in Definitions 2 and 3) even on wide networks. We summarize our constribution in Table 1. We say a two-layer neural net to be an SD net if it has a sparse first layer and a dense final layer, and to be a SS net if it has a sparse first layer and a sparse final layer.

	Spurious valleys for linear nets	Spurious valleys for non-linear nets
D nets	No ([44])	No ([44, 22])
SD nets	No (Theorem 1)	No ([37, 36] and Theorem 4)
SS nets	Yes (Theorem 6)	Yes (Theorem 6)

Table 1: Summary of our results and comparison of existing results for two-layer linear/non-linear networks. D nets: dense networks; SD nets: sparse networks with a sparse first layer and a dense final layer; SS nets: sparse networks with a sparse first layer and a sparse final layer. Some works adopt analogous definitions instead of valleys. We view all of them as valleys for brevity. Additionally, for SD nets, the results of linear networks hold for some special structures, and the results of non-linear networks hold for wide SD nets.

•

For a linear SD net, we show that sparse networks can have a benign landscape as dense linear networks under some special cases (in Theorem 1). Specifically, we show no spurious valleys if one of the following conditions holds: each sparse pattern is over-parameterized, or data are orthogonal with non-overlapping filters, or the target is scalar.
•

For a wide non-linear SD net, we show the consistency between sparse and dense networks (in Theorems 3 and 4). We show no spurious valleys for polynomial activations when the hidden layer width is larger than corresponding upper intrinsic dimensions [44], and for real analytic activations when the hidden layer width is larger than the number of training samples. This is the same as dense networks.
•

For a linear/non-linear SS net, we identify the difference between sparse and dense networks by constructively proving the existence of spurious valleys and spurious minima (in Theorem 6) with common activations.

Finally, we conduct numerical experiments under linear/non-linear activations to verify our theoretical results. Though we mainly focus on two-layer networks, we also generalize some results to deep networks. For example, we show no spurious valleys for deep linear networks with all sparse layers under scalar output (in Theorem 2). Moreover, we prove no spurious valleys with non-empty interiors for deep sparse networks with a dense wide final layer (in Theorem 5). In summary, we hope that our work contributes to a better understanding of the landscape of sparse networks and brings insights for future research.

1.1 Related Work

Our work is strongly related to and inspired by a line of recent work that analyzes the landscape of neural networks.

Landscape of neural networks

The landscape analysis of neural networks was a popular topic in the early days of neural net research; see, e.g., Bianchini and Gori [3] for an overview. One notable early work [2] proved that shallow linear neural networks do not have bad local minima. Kawaguchi [18] generalized this result to deep linear neural networks. However, the situation is more complicated when nonlinear activations are introduced [46, 15, 12]. Many works [47, 46, 40, 15, 7] show that spurious local minima can happen even in a two-layer network with nonlinear activations. Despite the existence of spurious minima, over-parameterized networks can still exhibit some nice geometrical properties. Freeman and Bruna [10] considered the topological properties of the sublevel sets. Venturi et al. [44] showed the absence of spurious valleys for polynomial activations under a wide hidden layer, and proved the existence of spurious valleys for non-polynomial activation functions if the hidden layer is narrow. Nguyen [36] showed that a pyramidal DNN has no spurious “valleys” for one wide layer and strictly monotonic activations. Moreover, Li et al. [22] used a different notion named “spurious basin” to show that deep fully connected networks with any continuous activation admit no spurious basins. Nguyen et al. [37] also showed no spurious “valleys” defined by strict sublevel sets for general sparse networks with plenty of connections to each output neuron and strictly monotonic activations.

Some works [30, 20] also focus on the effect of regularization in deep linear networks. Mehta et al. [30] employed novel algebraic symmetries that result in “flat” critical manifolds, and analyzed how $\ell_{2}$ -regularization breaks these symmetries to produce isolated critical points. In this work, we do not apply regularization, because linear networks and wide non-linear networks already have a benign landscape based on previous work. We consider using as little as possible in the component of the loss function to see the influence of sparsity. Moreover, Mehta et al. [30] did not distinguish between spurious minima and global minima, which we think the theory still needs to handle.

Sparse networks

Sparse networks [13, 48, 24, 9, 14] have a long history, and they have gained many recent interests due to their potential in on-device AI (i.e., deploying AI models on small devices). In practice, sparse networks are mainly produced by the network pruning technique [13, 11, 21, 24, 9, 45, 16, 35, 34, 28], to name a few. There are many variants of pruning methods and surveys of this literature, e.g., Hoefler et al. [17], Mishra et al. [31]. Pruning can lead to a reduction in storage and model runtime. The performance is usually maintained by retraining the pruned network.

Most research efforts are spent on the empirical aspects. Frankle and Carbin [9] recommended reusing the sparsity pattern found through pruning and training a sparse network from the same initialization as the original training (i.e., “lottery”) to obtain comparable performance and avoid a bad solution. Moreover, several recent works also give abundant methods for choosing weights or sparse network structures while achieving comparable performance [21, 33, 26, 4, 23]. There are some theoretical investigations on representation power. Several works [29, 39] proved that an over-parameterized neural network contains a subnetwork with roughly the same accuracy as the target network, providing guarantees for existence of “good” sparse subnetworks.

The closest work to ours is probably Evci et al. [8], who showed empirically that bad local minima can appear in a sparse network, but this work did not provide any theoretical result. Our work provides a partial theoretical justification for this phenomenon: we show that spurious valleys can exist in a sparse network.

1.2 Organization

This article is organized as follows. We formally define our setting and notation in Section 2. Then we begin with the analysis for linear SD nets in Section 3. And we shed light on non-linear SD nets in Section 4. In Section 5, we give a counterexample for SS nets. We verify our theorems through experiments in Section 6. Finally, we report conclusion in Section 7.

2 Preliminaries

Notation

We use bold-faced letters (e.g., $\bm{w}$ and $\bm{a}$ ) to denote vectors, and capital letters (e.g., ${W}=[w_{ij}]$ and ${A}=[a_{ij}]$ ) for matrices. We sometimes use $(W)_{i,\cdot}$ and $(W)_{\cdot,j}$ as the $i$ -th row and $j$ -th column of a matrix ${W}$ , and $(\bm{a})_{i}$ as the $i$ -th entry of a vector $\bm{a}$ , if no ambiguity. We denote $\widetilde{W}$ as a sparse matrix of $W$ (which will be illustrated in detail), and $W^{+}$ as the pseudoinverse matrix of $W$ . We abbreviate $[n]=\{1,\ldots,n\}$ . We let $\bm{e}_{i}$ as the standard $i$ -th unit vector, $\bm{1}_{n}$ and $\bm{0}_{n}$ respectively denote the all-ones and all-zeros vectors in $\mathbb{R}^{n}$ . The range of a real-value function $\sigma(\cdot)$ is $\sigma(\mathbb{R})=\{\sigma(x):x\in\mathbb{R}\}$ , and the number of elements in a set $\mathcal{S}$ is $|\mathcal{S}|$ . We denote $\mathrm{span}(\mathcal{S})$ as the linear span (or the linear hull) of a set $\mathcal{S}$ of vectors.

Problem setup

Given training samples $\{(\bm{x}_{i},\bm{y}_{i})\}_{i=1}^{n}\subset\mathbb{R}^{d_{x}}\times\mathbb{R}^{d_{y}}$ , we form the data matrices $X=[\bm{x}_{1},\ldots,\bm{x}_{n}]\in\mathbb{R}^{d_{x}\times n}$ and $Y=[\bm{y}_{1},\ldots,\bm{y}_{n}]\in\mathbb{R}^{d_{y}\times n}$ , respectively. We mainly focus on one-hidden-layer neural networks with squared loss, while some results can also be applied to general convex loss. Then the objective is ²²2Adding bias is equivalent to adding a row of vector $\bm{1}_{n}^{\top}$ to $X$ and our setting also includes sparse bias in the first layer. Hence we do not distinguish bias terms in the subsequent content.

\min_{U,W}\;L(U,W):=\frac{1}{2}\left\|U\sigma(WX)-Y\right\|_{F}^{2},

(1)

where $W=[\bm{w}_{1},\ldots,\bm{w}_{p}]^{\top}\in\mathbb{R}^{p\times d_{x}}$ , $U=[\bm{u}_{1},\ldots,\bm{u}_{p}]\in\mathbb{R}^{d_{y}\times p}$ . Here $p$ represents the width of the hidden layer and $\sigma(\cdot)$ is a continuous element-wise activation function.

Generally, when applied to sparse patterns after weight pruning or masking, many weights become zero and would not be updated in training. Our goal is to study the landscape of sparse structures after weight pruning. Thus, we pay much attention to the sparse structures, rather than the underlying pruning methods.

We begin with the SD nets (sparse-dense networks). We denote each pattern for each hidden weight $\bm{w}_{i}$ as $\bm{m}_{i}\in\{0,1\}^{d_{x}},i\in[p]$ and the first layer weights have totally $s$ different patterns $\bm{m}_{1}^{*},\ldots,\bm{m}_{s}^{*}\neq\bm{0}$ . We recombine weights and data with the same pattern as $\mathcal{S}_{i}=\{j:\bm{m}_{j}=\bm{m}_{i}^{*}\},i\in[s]$ . Then the objective of the sparse one-hidden-layer network becomes

\min_{U,\widetilde{W}}L(U,\widetilde{W}):=\frac{1}{2}\left\|\sum_{i=1}^{s}U_{i}\sigma(W_{i}Z_{i})-Y\right\|_{F}^{2}=\frac{1}{2}\left\|\left(U_{1},\ldots,U_{s}\right)\sigma\left[\left(\begin{smallmatrix}W_{1}&\cdots&\bm{0}\\ \vdots&\ddots&\vdots\\ \bm{0}&\cdots&W_{s}\\ \end{smallmatrix}\right)\left(\begin{smallmatrix}Z_{1}\\ \vdots\\ Z_{s}\end{smallmatrix}\right)\right]-Y\right\|_{F}^{2},

(2)

where we view sparse hidden-layer weight as a block diagonal matrix with elements $W_{i}=[\bm{w}_{j},j\in\mathcal{S}_{i}]^{\top}\in\mathbb{R}^{p_{i}\times d_{i}}$ , and we also duplicate and rearrange input data as $Z_{i}=[X_{j,\cdot},(\bm{m}^{*}_{i})_{j}=1]\in\mathbb{R}^{d_{i}\times n}$ , and separating $U$ as $U_{i}=[\bm{u}_{j},j\in\mathcal{S}_{i}]^{\top}\in\mathbb{R}^{d_{y}\times p_{i}}$ . Here $i\in[s],p_{i}=|\mathcal{S}_{i}|,d_{i}=\bm{1}_{d_{x}}^{\top}\bm{m}^{*}_{i}$ , $p=\sum_{i=1}^{s}p_{i}$ (see Figure 1 for example).

Refer to caption — Figure 1: An example of sparse weight transformation in Eq. (2).

Note that there may be useless connections and nodes, such as a node with zero out-degree which can be retrieved and excluded from the final layer to the first layer, and other cases are shown in A. Thus we do not consider them and assume the sparse structure is effective by Definition 1, meaning that each neuron has a potential contribution to the network output.

Definition 1 (Effective neuron and sparse network)

A neuron is effective if it appears at least in one directed path from one input node to one output node. A sparse network is effective if each neuron (including input and output neurons) is effective.

Previous works mainly pay attention to local minima (Definition 2) and valleys (Definition 3) for describing the landscape of neural networks. No spurious valleys imply the absence of spurious strict minima [10, 44]. Local search methods may get stuck in a spurious valley as well as a spurious strict minimum. Thus, understanding the existence of such objectives in the landscape would help understand the difficulty of training (sparse) neural networks.

Definition 2 (Spurious minimum)

We say a point to be a spurious (strict) minimum if it is a (strict) local minimum but not a global minimum. Here, $f(\bm{\theta}_{0})$ is a local minimum, that is, there exists $\epsilon>0$ such that $f(\bm{\theta}_{0})\leq f(\bm{\theta}),\forall\bm{\theta}\colon\left\|\bm{\theta}-\bm{\theta}_{0}\right\|\leq\epsilon$ . And $f(\bm{\theta}_{0})$ is a strict local minimum, that is, there exists $\epsilon>0$ such that $f(\bm{\theta}_{0})<f(\bm{\theta}),\forall\bm{\theta}\colon\left\|\bm{\theta}-\bm{\theta}_{0}\right\|\leq\epsilon,\bm{\theta}\neq\bm{\theta}_{0}$ .

Definition 3 (Spurious valley)

Given an $\alpha\in\mathbb{R}$ , we define the $\alpha$ -sublevel set of $f(\bm{\theta})$ as $\Omega_{f}(\alpha)=\{\bm{\theta}\in\mathrm{dom}(f)\colon f(\bm{\theta})\leq\alpha\}$ . A spurious valley $\mathcal{T}$ is a connected component of a sublevel set $\Omega_{f}(\alpha)$ which can not approach the infimum of $f(\bm{\theta})$ , that is, $\inf_{\bm{\theta}\in\mathcal{T}}f(\bm{\theta})>\inf_{\bm{\theta}\in\mathrm{dom}(f)}f(\bm{\theta})$ .

3 Sparse-dense Linear Networks

We begin with linear activation ( $\sigma(z)=z$ ) as a warm-up case to understand the landscape of neural networks. Previous works have proven that there are no spurious minima [18, 27] and no spurious valleys [44] for dense linear networks. We will inherit their descriptions of landscape to show that under certain conditions, we could guarantee no spurious valleys for SD linear networks.

Figure 2: Sparse network without (left) / with (right) overlapping filters in the first layer.

Theorem 1

Under either of the following conditions, any effective SD linear network has no spurious valleys: 1) $p_{i}\geq d_{i},\forall i\in[s]$ ; 2) $Z_{i}Z_{j}^{\top}=\bm{0},\forall i\neq j\in[s]$ ; 3) $d_{y}=1$ .

The proof of Theorem 1 leaves in B.1. We prove the results by constructing a non-increasing path to approach the infimum of the loss function (i.e., the zero loss). That is, we show Property P below under the conditions in Theorem 1. Here we use the fact that Property P implies the absence of spurious valleys [44, Lemma 2].

Definition 4 (Property P)

We say the landscape of an objective $f(\bm{\theta})$ satisfies Property P if there exists a non-increasing continuous path from any initialization to approach the infimum. Formally, given any initial parameter $\bm{\theta}_{0}\in\mathrm{dom}(f)$ , there exists a continuous path $\bm{\theta}\colon t\in[0,1)\to\bm{\theta}(t)\in\mathrm{dom}(f)$ such that: (a) $\bm{\theta}(0)=\bm{\theta}_{0}$ ; (b) $\lim_{t\to 1}f(\bm{\theta}(t))=\inf_{\bm{\theta}}f(\bm{\theta})$ ; (c) the function $f(\bm{\theta}(t))$ in $t\in[0,1)$ is non-increasing.

The first condition in Theorem 1 can be viewed as over-parameterization in the sparse networks for each sparse pattern, and the second condition includes two parts: the patterns are non-overlapping (as the left graph of Figure 2 depicts) and the input data is orthogonal, showing that structured training data matched with a sparse network can preserve the benign landscape. The third condition states that the output dimension is one. From Theorem 1, we could see some special sparse linear networks still have benign landscapes as dense linear networks, which motivates researchers to discover or design efficient sparse networks.

3.1 Extensions

We generalize Theorem 1 to deep linear networks combined with previous works, and the proof can be found in B.2. The intuition is that deep linear networks have similar landscape properties as the shallow case [46, 27]. However, except for the scalar output case which we have solved, understanding the landscape of an arbitrary deep sparse network is still complicated.

Theorem 2

Under either of the following conditions, any effective deep linear neural network with a sparse first layer does not have spurious valleys: 1) $p_{i}\geq d_{i},\forall i\in[s]$ . 2) $d_{y}=1$ . Moreover, the result under assumption 2) holds for sparse networks with all sparse layers.

4 Sparse-dense Non-linear Networks

Previous work [46, 15] has already certified the intrinsic difference between linear and non-linear activations. However, the landscape of neural networks still has benign geometric properties [44, 37, 36] in the scope of no spurious valleys, if the network is wide enough. In this section, we verify that a wide SD net with non-linear activations still possesses a benign landscape. We provide no spurious valleys for polynomial activations in Subsection 4.1, for general real analytic activations in Subsection 4.2, and defer the proofs to C.

4.1 Polynomial Activations

For some simple activations, such as polynomial functions, we could view them as the linear activation after non-linear mappings. Specifically, we give an example of quadratic activation when $d_{x}=2$ , $\bm{x}=(x_{1},x_{2})$ and $\bm{w}=(w_{1},w_{2},b)$ . If we define $\bm{\psi}(\bm{w})=(w_{1}^{2},w_{2}^{2},2w_{1}w_{2},2bw_{1},2bw_{2},b^{2}),\bm{\phi}(\bm{x})=(x_{1}^{2},x_{2}^{2},x_{1}x_{2},x_{1},x_{2},1)$ , then we can view $\sigma(w_{1}x_{1}+w_{2}x_{2}+b)=(w_{1}x_{1}+w_{2}x_{2}+b)^{2}=\langle\bm{\psi}(\bm{w}),\bm{\phi}(\bm{x})\rangle$ , showing that we can convert polynomial activation into linear activation after we project original data and weight to mapping spaces.

More abstractly, we define $V_{\sigma}(X):=\mathrm{span}\left(\left\{\sigma(\bm{w}^{\top}X),\bm{w}\in\mathbb{R}^{d_{x}}\right\}\right)$ and the upper intrinsic dimension [44] as $\dim^{*}(\sigma,X):=\mathrm{dim}\left(V_{\sigma}(X)\right)$ . Then from the above example, we could see $\dim^{*}(\sigma,X)\leq 6$ for quadratic activation with $d_{x}=2$ because $\bm{\psi}(\bm{w})\in\mathbb{R}^{6}$ . Therefore, we can still regard the non-linear activations (particularly for polynomial activations) as the linear activation on feature mappings $\bm{\psi}(\bm{w})$ and $\bm{\phi}(\bm{x})$ .

Theorem 3

For any continuous activation function $\sigma(\cdot)$ , suppose $\dim^{*}(\sigma,X)<+\infty$ . Then any effective SD net has no spurious valleys if $p_{i}\geq\dim^{*}(\sigma,Z_{i}),\forall i\in[s]$ .

Remark 1

The condition $\dim^{*}(\sigma,X)<+\infty$ has also appeared in [44], but we extend it to the sparse setting. Moreover, if $\sigma(z)=\sum_{k=0}^{t}a_{k}z^{k}$ , then $\dim^{*}(\sigma,X)=O(d^{t})$ (e.g., see Venturi et al. [44, Corollary 10]). Therefore, we obtain no spurious valleys when the hidden width $p=\sum_{i=1}^{s}p_{i}=\Omega(\sum_{i=1}^{s}d_{i}^{t})$ . For non-overlapping filters, if we choose $d_{i}=d/s$ for total $s$ patterns and $t\geq 3$ , then $p=\Omega(s(d/s)^{t})=\Omega(d^{t}/s^{t-1})$ , giving less hidden-layer width requirement compared to $\Omega(d^{t})$ in the dense setting from Venturi et al. [44, Corollary 10].

4.2 Real Analytic Activations

When applied to general non-polynomial activations, particularly on real analytic activations, Nguyen et al. [37] provided results for sparse networks but adopted strictly increasing activations and strict sublevel sets to define spurious valleys. Li et al. [22] used a different notion of “spurious basin” to show that deep fully-connected networks with any continuous activation admit no spurious basins, but did not apply it to sparse networks. Yet, we provide no spurious valleys for SD nets with real analytic activations. We show the comparison in Table 2.

	Nguyen et al. [37]	Li et al. [22]	Theorem 4
Sparse NN	$\checkmark$	$\times$	$\checkmark$
Non-increasing $\sigma$	$\times$	$\checkmark$	$\checkmark$

Table 2: Comparison of results for two-layer networks with real analytic activations.

Before introducing our results, we need some assumptions as previous works do.

Assumption 1

1) $\forall i\neq j\in[n],\forall k\in[d_{x}],\left|(\bm{x}_{i})_{k}\right|\neq\left|(\bm{x}_{j})_{k}\right|\neq 0$ . 2) Except the last hidden-layer weight and all biases, no hidden-layer weights are all pruned, i.e., $\widetilde{W}_{i,\cdot}\neq\bm{0}_{d_{x}}^{\top},\forall i\in[p]$ .

Assumption 2

Activation function $\sigma(\cdot)$ is real analytic, and there exist $n$ distinct non-negative integers $l_{1},\dots,l_{n}$ which form an arithmetic sequence³³3That is, $l_{i}=l_{1}+(i-1)(l_{1}-l_{0}),\forall i\in[n]$ ., such that $\sigma^{(l_{i})}(0)\neq 0,\forall i\in[n]$ , where $\sigma^{(k)}(0)$ denotes the $k$ -th order derivative of $\sigma(\cdot)$ at zero.

The first part of Assumption 1 shows that training data have different features in each dimension, which also appears in previous works [22]. And this can be easily achieved if an arbitrarily small perturbation of data is allowed. The second part of Assumption 1 excludes useless patterns which only employ bias feature $\bm{1}_{n}^{\top}$ and do not encode any effective features among training data. Assumption 2 shows the least non-linearity requirement of activation at the origin, which can be satisfied for Sigmoid, Tanh, and Softplus (a smooth approximation of ReLU). Now we state two-layer SD networks admit no spurious valleys in the over-parametrized regime.

Theorem 4

Under Assumptions 1 and 2, if the hidden-layer width $p\geq n$ , we have $\mathrm{rank}(\sigma(\widetilde{W}X))=n,a.s.$ , and any effective SD non-linear network has no spurious valleys.

The key proof idea is to show that we could always make hidden-layer $\sigma(\widetilde{W}X)$ have full column rank with invariant loss. Otherwise, we could calculate $l_{i}$ -th order derivative of $\sigma(\widetilde{W}X)$ at a specific row, and show the contradiction from Assumptions 1 and 2. It is interesting to note that Theorem 4 explains no substantial difference in the scope of spurious valleys when sparsity is only introduced in the first layer. Additionally, the requirement of the large width ( $p\geq n$ ) also appears in [44, 22, 37, 36].

4.3 Extension

We extend Theorem 4 to deep sparse networks. Due to the complex stack structure of deep sparse networks, we could only show no spurious valleys with non-empty interiors. Such constraint points out that spurious valleys, if exist, are certainly degenerated.

Theorem 5

Under Assumptions 1 and 2, any effective deep sparse neural network has no spurious valleys with non-empty interiors if the last hidden-layer width $p\geq n$ and each output neuron has at least $n$ connected neurons in the last hidden layer.

Analogous to the two-layer case, the main proof idea is to show the almost surely full rank property of the output matrix of each hidden layer. Moreover, previous works also need to refine the definition of the spurious valley to deep networks, such as the spurious basin [22], and the spurious “valley” defined by strict sublevel sets [37, 36].

5 Sparse-sparse Networks

Finally, we consider the remaining setting when sparsity is introduced in both layers, i.e., SS nets (sparse-sparse networks). In the following, we will show the necessity of a dense (or $n$ connected) final layer for general continuous activations through a concrete example with spurious valleys. Furthermore, in our construction, the spurious valley is a global minimum set of a sub-network ⁴⁴4Here, the sub-network means a network by removing some connections from the original network. of the original sparse network.

Since no spurious valleys appear in dense wide (two-layer) networks [44, 37, 36], our result indicates that sparsity can provably deteriorate the landscape.

Theorem 6

If the continuous activation satisfies $\sigma(0)=0$ and $\sigma(\mathbb{R})\neq\{0\}$ , and the hidden width $p\geq 3$ , then there exists an effective SS network, which has a spurious valley. Additionally, this spurious valley corresponds to a global minimum set of a sub-network of the original sparse network.

Proof: The proof mainly utilizes the relationship among sparse connections. We consider a sparse network with the sub-structure in the right graph of Figure 2, and the objective becomes:

	$\displaystyle\min_{\begin{subarray}{c}\bm{\theta}=\left(w_{1},\dots,w_{8}\right)\\ \bm{\theta}_{1}=\left(\bm{v}_{1},\bm{v}_{2},\bm{v}_{3},V\right)\end{subarray}}L(\bm{\theta},\bm{\theta}_{1})$	$\displaystyle=\frac{1}{2}\left\\|\left(\begin{smallmatrix}w_{1}&0&\bm{0}^{\top}\\ w_{2}&w_{3}&\bm{0}^{\top}\\ 0&w_{4}&\bm{0}^{\top}\\ \bm{0}&\bm{0}&V\end{smallmatrix}\right)\sigma\left[\left(\begin{smallmatrix}w_{5}&w_{6}&0\\ 0&w_{7}&w_{8}\\ \bm{v}_{1}&\bm{v}_{2}&\bm{v}_{3}\end{smallmatrix}\right)X\right]-Y\right\\|_{F}^{2}$		(3)
		$\displaystyle=\frac{1}{2}\left\\|\left(\begin{smallmatrix}w_{1}\sigma(w_{5})&w_{1}\sigma(w_{6})&0\\ w_{2}\sigma(w_{5})&w_{2}\sigma(w_{6})+w_{3}\sigma(w_{7})&w_{3}\sigma(w_{8})\\ 0&w_{4}\sigma(w_{7})&w_{4}\sigma(w_{8})\\ V\sigma(\bm{v}_{1})&V\sigma(\bm{v}_{2})&V\sigma(\bm{v}_{3})\\ \end{smallmatrix}\right)-Y\right\\|_{F}^{2},$		(3)

where we assume $X=I_{3}$ and $Y=\left(\begin{smallmatrix}y_{1}&y_{1}&0\\ y_{2}&y_{2}+y_{3}&0\\ 0&0&y_{4}\\ \bm{*}&\bm{*}&\bm{*}\\ \end{smallmatrix}\right)$ with $y_{3}>4y_{4}>4y_{1}>0$ , $y_{2}>0$ .

Because the variables $\bm{v}_{i}$ and $V$ are unrelated to the $w_{i}$ , we only need to consider the $w_{i}$ while assuming the $\bm{v}_{i}$ and $V$ can always obtain zero loss for the remaining rows in $Y$ . Therefore, in the following, we only consider the parameter $\bm{\theta}$ .

Now we consider the sublevel set $\mathcal{T}:=\{\bm{\theta}:L(\bm{\theta})\leq y_{4}^{2}/2\}\cap\{w_{4}=0\}$ . Noting that when $w_{4}=0$ , $L(\bm{\theta})\geq y_{4}^{2}/2$ , we obtain $L(\bm{\theta})=y_{4}^{2}/2$ and $\left(\begin{smallmatrix}w_{1}&0\\ w_{2}&w_{3}\end{smallmatrix}\right)\left(\begin{smallmatrix}\sigma(w_{5})&\sigma(w_{6})&0\\ 0&\sigma(w_{7})&\sigma(w_{8})\\ \end{smallmatrix}\right)=\left(\begin{smallmatrix}y_{1}&y_{1}&0\\ y_{2}&y_{2}+y_{3}&0\\ \end{smallmatrix}\right),\forall\bm{\theta}\in\mathcal{T}$ , leading to that

\displaystyle\mathcal{T}{=}\bigg{\{}\bm{\theta}(a,b)\bigg{|}

\displaystyle\left(\begin{smallmatrix}w_{1}&0\\ w_{2}&w_{3}\\ 0&w_{4}\end{smallmatrix}\right){=}\left(\begin{smallmatrix}y_{1}a&0\\ y_{2}a&y_{3}b\\ 0&0\end{smallmatrix}\right),\left(\begin{smallmatrix}\sigma(w_{5})&\sigma(w_{6})&0\\ 0&\sigma(w_{7})&\sigma(w_{8})\\ \end{smallmatrix}\right){=}\left(\begin{smallmatrix}1/a&1/a&0\\ 0&1/b&0\\ \end{smallmatrix}\right),a,b{\neq}0\bigg{\}}.

Since $\sigma(\mathbb{R})\neq\{0\}$ , we have $\mathcal{T}\neq\emptyset$ . We choose a connected component of $\mathcal{T}_{0}\in\mathcal{T}$ with $a,b>0$ or $a,b<0$ depending on $\sigma(\mathbb{R})$ . For any small disturbances $\delta_{i},i\in[8]$ ,

		$\displaystyle 2L(w_{1}+\delta_{1},\dots,w_{8}+\delta_{8})=\left\\|\left(\begin{smallmatrix}&&\\ &&(w_{3}+\delta_{3})\sigma(w_{8}+\delta_{8})\\ &\delta_{4}\sigma(w_{7}+\delta_{7})&\delta_{4}\sigma(w_{8}+\delta_{8})-y_{4}\end{smallmatrix}\right)\right\\|_{F}^{2}$
		$\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}(w_{3}+\delta_{3})^{2}\sigma(w_{8}+\delta_{8})^{2}+\delta_{4}^{2}\sigma(w_{7}+\delta_{7})^{2}+\left(\delta_{4}\sigma(w_{8}+\delta_{8})-y_{4}\right)^{2}$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}2\left\|(w_{3}+\delta_{3})\sigma(w_{7}+\delta_{7})\delta_{4}\sigma(w_{8}+\delta_{8})\right\|-2y_{4}\delta_{4}\sigma(w_{8}+\delta_{8})+y_{4}^{2}\stackrel{{\scriptstyle(iii)}}{{\geq}}y_{4}^{2},$

where we only consider the last column and row of loss in $(i)$ , and $(ii)$ uses the arithmetic and geometric means inequality, $(iii)$ is derived from small permutation when we choose $|\delta_{3}|\leq|w_{3}|/2$ and $|\delta_{7}|$ small enough, such that $|\sigma(w_{7}+\delta_{7})-\sigma(w_{7})|\leq|\sigma(w_{7})|/2$ since $\sigma(\cdot)$ is continues and $\sigma(w_{7})\neq 0$ , leading to $|(w_{3}+\delta_{3})\sigma(w_{7}+\delta_{7})|\geq|(w_{3}/2)\sigma(w_{7})/2|=y_{3}/4>y_{4}>0$ .

Thus, each $\bm{\theta}\in\mathcal{T}_{0}$ is a local minimum, and $\mathcal{T}_{0}$ is connected. Moreover, we could see if $\delta_{4}\neq 0$ , from $(iii)$ , the equality only holds when $\sigma(w_{8}+\delta_{8})=0$ . Then from $(i)$ , we obtain that $2L(w_{1}+\delta_{1},\dots,w_{8}+\delta_{8})\geq\delta_{4}^{2}\sigma(w_{7}+\delta_{7})^{2}+y_{4}^{2}\geq\delta_{4}^{2}\sigma(w_{7})^{2}/4+y_{4}^{2}>y_{4}^{2}$ . Hence, a connected component of a sublevel set $\{\bm{\theta}:L(\bm{\theta})\leq y_{4}^{2}/2\}$ that contains $\mathcal{T}_{0}$ would never reach some point $\bm{\theta}^{\prime}$ with $w_{4}^{\prime}\neq 0$ and $L(\bm{\theta}^{\prime})\leq y_{4}^{2}/2$ .

Finally, we show that $\mathcal{T}_{0}$ is not a global minimum set. We choose a point $\tilde{\bm{\theta}}(a)$ that $\left(\begin{smallmatrix}w_{1}&0\\ w_{2}&w_{3}\\ 0&w_{4}\end{smallmatrix}\right)=\left(\begin{smallmatrix}y_{1}a&0\\ (y_{2}+y_{3})a&0\\ 0&y_{4}a\end{smallmatrix}\right)$ and $\left(\begin{smallmatrix}\sigma(w_{5})&\sigma(w_{6})&0\\ 0&\sigma(w_{7})&\sigma(w_{8})\\ \end{smallmatrix}\right)=\left(\begin{smallmatrix}\frac{y_{2}}{(y_{2}+y_{3})a}&\frac{1}{a}&0\\ 0&0&\frac{1}{a}\\ \end{smallmatrix}\right)$ , where $a\neq 0,1/a\in\sigma(\mathbb{R})$ , and using intermediate value theorem for continuous activation $\sigma(\cdot)$ , we have $\frac{y_{2}}{(y_{2}+y_{3})a}\in\sigma(\mathbb{R})$ because $\sigma(0)=0$ and $0<y_{2}<y_{2}+y_{3}$ . Then we obtain $L(w_{1},\dots,w_{8})=\frac{y_{1}^{2}y_{3}^{2}}{2(y_{2}+y_{3})^{2}}<y_{1}^{2}/2<y_{4}^{2}/2$ . Hence, a spurious valley exists. $\square$

Remark 2

From Theorem 6, we already know the necessity of $n$ connections for each output neuron to guarantee no spurious valleys as dense networks. Furthermore, we wonder whether a sparse network with at most $n{-}1$ connections for each output neuron and $n$ hidden neurons, is the same as $n{-}1$ hidden-neuron dense networks? The answer is no. We do allow at least $n$ effective neurons after pruning, while in narrow-network examples, no more than $(n{-}1)$ neurons exist. This leads to different expressive power. We elaborate on this below. We denote $A_{a\times b}$ as a matrix in $\mathbb{R}^{a\times b}$ . For a dense network with $n{-}1$ hidden nodes, we have $\mathrm{rank}(U_{m\times(n-1)}\sigma(\widetilde{W}_{(n-1)\times d}X_{d\times n}))\leq n{-}1$ . In our result, we allow $m=n$ and a sparse matrix $\widetilde{U}_{n\times n}=I_{n}$ (i.e., only one connection for each output neuron), then from Theorem 4, $\mathrm{rank}(\widetilde{U}_{m\times n}\sigma(\widetilde{W}_{n\times d}X_{d\times n}))=n$ , a.s. Thus, wide sparse networks have more expressive power than dense narrow networks.

From the proof of Theorem 6, we need to underline that such a spurious valley is a set of spurious minima generated from the sub-network when $w_{4}$ is removed. We encounter no spurious valleys when applied to wide SD networks (see Theorem 4). Thus, sparse connections indeed deteriorate the landscape because sparsity obstructs the decreasing path. Moreover, when the number of unpruned weights in each row of the final layer is less than the training sample size, a strict valley may indeed exist from our construction. Note that Venturi et al. [44] also gave existence proof for spurious valleys on dense narrow networks. The difference is that our example is concrete and applies to sparse structure with arbitrary width. Furthermore, our example also covers the SS linear networks by adopting linear activation. Previous works have proven that there are no spurious valleys in dense linear networks [44]. Hence, we conclude that sparsity in the final layer also deteriorates the landscape of linear networks.

5.1 The instruction on reality

Theorems 4 and 5 show positive results for sparse networks. Specifically, with a dense final layer, the sparsity in other layers generally would not affect the benign landscape from the theoretical perspective. While Theorem 6 provides the negative part, the sparsity in the final layer would bring spurious valleys. These results convey the message that we need to be cautious when pruning the final hidden layer, because sparsifying the final-layer weight could directly deteriorate the landscape of networks. Intuitively, the final-layer weight decides the output of the network, which is crucial for fitting data from optimization even if the width is large enough. Thus, we need to preserve a dense final-layer weight if possible. Additionally, pruning the other layer weights has less impact on the landscape. Hence, we can adopt many variants of pruning methods in practice to obtain powerful features and produce high-efficient networks. Nowadays, pruning methods adopt many tricks, such as iterative pruning and retraining methods [24, 9]. Hence, we recommend a small fraction of the pruning rates in the final-layer weight to avoid spurious valleys from the landscape perspective. Moreover, a recent work [23, Appendix F] also empirically discovers that allocating more parameters to the last fully-connected layer while keeping the overall parameter count fixed could lead to higher accuracy. Such experimental findings in [23] demonstrate the importance of final layer weight from the generalization perspective, which support our viewpoint on the final-layer weight from another aspect.

6 Experiments

Previous theoretical findings mainly describe the global landscape of sparse neural networks. Now we conduct experiments to investigate deep sparse networks using practical optimization methods in this section.

Datasets

We consider a synthetic dataset and the CIFAR10 dataset [19]. For the synthetic dataset, we sample training data $\bm{x}_{i}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(\bm{0},I_{100})$ and construct the corresponding target $\bm{y}_{i}=A\bm{x}_{i}+\bm{\varepsilon}_{i}$ , where $\bm{\varepsilon}_{i}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(\bm{0},I_{d_{y}}),\bm{\varepsilon}_{i}\perp\!\!\!\perp\bm{x}_{j}$ , and $A_{kl}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(0,1),\forall i,j\in[400],k\in[d_{y}],l\in[100]$ . We scale $A$ to make $\|A\|_{F}=5$ for avoiding a small numerical output when extreme sparsity is introduced. In the following, we construct two datasets with $d_{y}=1$ and $10$ . For the CIFAR10 dataset, we sample a subdataset with 400 training samples and use one-hot labels as targets ( $d_{y}=10$ ).

Experimental Settings

We use Mean Squared Error (MSE) loss with a unified constant learning rate of $0.01$ under the gradient descent (GD) method. And we construct the same hidden width of $400$ across all layers, which is the same as training samples to match our assumptions. We do not employ bias terms and any data pre-processing or data-augmentation in all of our experiments to monitor our theoretical schemes. We use default initialization in PyTorch [38] among all experiments, unless specifically mentioned.

6.1 Sparse Linear Networks

We first verify the results of deep sparse linear networks shown in Theorem 2. Specifically, there is no spurious valleys for a deep sparse linear network when 1) only the first-layer weight is sparse and each sparse pattern in the sparse first layer is over-parameterized (i.e., $p_{i}\geq d_{i},\forall i\in[s]$ ), which is shown in Figure 3(a) with $d_{y}=10$ , or 2) all layer weights are sparse and the output dimension $d_{y}=1$ , which is conducted in Figure 3(b). We adopt a $5$ -layer linear network for example, because a large depth may cause the gradient vanishing and exploding [41]. To guarantee the effectiveness of sparse networks, we first use random pruning with fixed pruning rates for each layer, and then check the useless connections, which may lead to a sparser structure.

We plot the difference between the current loss ( $L(\bm{\theta})$ ) and the global minimum ( $L^{*}$ ) obtained from linear regression in Figure 3. Here, $\bm{\theta}$ is the remaining parameters after random pruning and effectiveness detection, and from our constriction, $L^{*}=0$ . Though we only show no spurious valleys in Theorem 2, we suspect that a nice landscape enables the common optimizer to find a global minimum (i.e., zero loss) in practice. The simulation verifies our conjecture: the optimization trajectory has nearly non-increasing loss and approaches to a global minimum.

6.2 Sparse-dense Non-linear Networks

Second, we conduct experiments for deep sparse non-linear networks motivated by Theorem 5 to verify that the large width of the final layer in an SD net would preserve a benign landscape as the dense network, and the full rank property of each hidden-layer output matrix. The numerical findings are shown in Figure 4, where we use the dataset with $d_{y}=10$ and prune all layer weights except the final layer to guarantee $n$ connections for each output neuron. For reasonable comparison, the input neurons all have at least a directed path to one final output. We will also examine whether each neuron in the last hidden layer is effective. During optimization, we discover that default initialization gives really slow convergence on the CIFAR10 dataset. Hence we multiply each weight by $10$ after initialization when running on the CIFAR10 dataset.

For non-linear networks in Figure 4, we observe that all sparse structures could obtain near-zero training loss, which matches our theoretical finding. Moreover, we discover that the outputs of the first few hidden layers are full rank, but the output matrices of the later layers become rank deficient even for dense networks in Figure 4(a). We suspect that such a phenomenon could be caused by numerical overflow from the saturation of activations and the small default initialization. When applied to the CIFAR10 dataset with large initialization, we discover a stable rank phenomenon in Figure 4(c). Thus, we could confirm that the outputs of hidden layers are nearly full rank.

Additionally, we also attempt the non-analytic ReLU activation in Figure 4(b), though our theory does not cover it. We find it still has near-zero loss even with a large sparsity. But the training has a long distinct stuck period at the origin and oscillates around at last, and the rank variation under ReLU is also strange. For dense NNs, it preserves full rank property in the first few layers but becomes rank deficient at the last few hidden layers. When applied to sparse NNs, the rank drops quickly during training but does not recover at last. We view such phenomenon as future work.

Overall, the approximate full rank property still holds when large sparsity is introduced as long as the effective width is larger than the number of training samples. But the sparsity usually introduces a much longer training trajectory employed with the default setting and has complex behavior coupled with optimization methods.

6.3 Sparse-sparse Networks

Finally, we verify the bad landscape for SS nets constructed in the proof of Theorem 6. For simplification, we select target $X=I_{3},Y=\left(\begin{smallmatrix}1&1&0\\ 2&8&0\\ 0&0&2\\ \end{smallmatrix}\right)$ ⁵⁵5Indeed, we only need $y_{3}>y_{4}>y_{1}>0,y_{2}>0$ in the proof in Theorem 6. following Eq. (3) and consider the sparse structure in the right graph of Figure 2 with $d_{x}=3,p=2,d_{y}=3$ , i.e., the loss function is

\displaystyle L(\bm{w})

\displaystyle=\frac{1}{2}\left\|\left(\begin{smallmatrix}w_{1}&0\\ w_{2}&w_{3}\\ 0&w_{4}\end{smallmatrix}\right)\sigma\left[\left(\begin{smallmatrix}w_{5}&w_{6}&0\\ 0&w_{7}&w_{8}\end{smallmatrix}\right)X\right]-Y\right\|_{F}^{2}=\frac{1}{2}\left\|\left(\begin{smallmatrix}w_{1}\sigma(w_{5})-1&w_{1}\sigma(w_{6})-1&0\\ w_{2}\sigma(w_{5})-2&w_{2}\sigma(w_{6})+w_{3}\sigma(w_{7})-8&w_{3}\sigma(w_{8})\\ 0&w_{4}\sigma(w_{7})&w_{4}\sigma(w_{8})-2\end{smallmatrix}\right)\right\|_{F}^{2}.

We use many common continuous activations including shifted Sigmoid ⁶⁶6We choose the shifted Sigmoid activation as $\sigma(z)=1/(1+e^{-z})-1/2$ to satisfy $\sigma(0)=0$ ., Tanh, LeakyReLU, ELU [5] and ReLU, which all satisfy our activation assumption of $\sigma(0)=0$ . We totally run $100$ trials using GD with a learning rate of $0.01$ until the training converges (up to $50000$ iterations) and record the distribution of convergent solutions in Table 3.

From Table 3, except for ReLU, the training loss for other activations either approaches a global minimum, or traps into the valley we constructed. And we could observe that it is likely to fall into a spurious valley instead of a global minimum. Moreover, we discover that for ReLU activation, there exist $11$ final convergent solutions while only $5$ successful times approach global minima, showing a worse landscape compared to other activations. Finally, the landscape visualization of some activations in Figure 5 also matches well with our theorem. Overall, we could encounter severe landscape issues in SS nets from our theorems and experiments.

Shifted Sigmoid	Tanh	LeakyReLU	ELU	ReLU
$\framebox{95},\mathbf{5}$	$\framebox{99},\mathbf{1}$	$\framebox{45},\mathbf{55}$	$\framebox{67},\mathbf{33}$	$17,15,\mathbf{5},19,6,$ $6,6,8,\framebox{7},7,4$

Table 3: Statistic of final convergent solutions during totally

100

trials for different continuous activations. The number inside a square is the times trapped in our constructed valleys and the bold number is the times that the algorithm generates an approximate global minimum. Particularly, we discover totally 11 different solutions under ReLU.

7 Conclusion

We have investigated the landscape of sparse networks for several scenarios. For a sparse linear network with a dense final layer, we have shown that under some specific structures there are no spurious valleys. For a sparse non-linear network with a dense wide final layer, there is no difference with the dense case in terms of spurious valleys. Finally, we have shown that spurious valleys can exist for networks with both sparse layers. These results do reveal the negative impact of sparsity on the loss landscape of neural nets. Future research directions include providing more necessary conditions for sparse non-linear networks to have a benign landscape, and analyzing the optimization trajectory under popular gradient descent methods when applied to sparse networks.

Appendix A Useless Connections and Nodes in Sparse Networks

We show several kinds of useless connections caused by sparsity or network pruning.

1.

Zero out-degree I: if a node has zero out-degree, we can eliminate its input connections because this node does not influence the network’s output, e.g., $h^{(2)}_{1}$ in Figure 6.
2.

Zero out-degree II: if a node has zero out-degree after removing its output connections based on the criterion of Zero out-degree I in latter layers, we also can eliminate its input connections, e.g., $h^{(1)}_{1}$ in Figure 6. Though it owes one output connection, the connected node $h^{(2)}_{1}$ has zero out-degree. Hence, the connection can be removed, leading to zero out-degree. So we can eliminate the input connections of $h^{(1)}_{1}$ as well.
3.

Zero in-degree I: if a node has zero in-degree, we can eliminate its output connections when $\sigma(0)=0$ and no bias term is used, because this node only provides zero feature, e.g., $h^{(2)}_{4}$ and $h^{(1)}_{4}$ in Figure 6. However, when the node has a bias term, then we can not remove output connections since the bias constant will propagate to subsequent layers, though such a node only provides the feature like $\sigma(b\cdot\bm{1}_{n})$ for some $b\in\mathbb{R}$ .
4.

Zero in-degree II: if a node has zero in-degree after removing its input connections based on the criterion of Zero in-degree I in former layers, we also can eliminate its output connections, e.g., $h^{(2)}_{3}$ in Figure 6. Though it owes one input connection, the connected node $h^{(1)}_{4}$ has zero in-degree. Hence the connection can be removed, leading to zero in-degree. So we can eliminate the output connections of $h^{(2)}_{3}$ as well. Similarly, we can not remove output connections if the bias term is employed.

In conclusion, we can remove hidden nodes that do not appear in any directed path from one input node to one output node. Then the remaining structure is effective.

Figure 6: An example of sparse networks. Lines are connections of the original sparse network, dotted lines are useless connections that can be removed, and solid lines are effective connections.

Appendix B Missing Proofs for Section 3

B.1 Proof of Theorem 1

Lemma 1 (Modified from [44])

For two matrices $U=[\bm{u}_{1},\ldots,\bm{u}_{m}]\in\mathbb{R}^{m\times p}$ and $W=[\bm{w}_{1},\ldots,\bm{w}_{p}]^{\top}\in\mathbb{R}^{p\times n}$ , $p\geq n$ . If $\mathrm{rank}(W)=r<n$ , and $\bm{w}_{1},\ldots,\bm{w}_{r}$ are linearly independent, then we can construct a $U^{0}$ , such that $U^{0}W=UW$ and $U^{0}$ has $(p{-}r)$ zero columns.

Proof: From the condition, $U=(\smash{\stackrel{{\scriptstyle r}}{{U_{1}}}}\ \smash{\stackrel{{\scriptstyle p-r}}{{U_{2}}}}),W=(\smash{\stackrel{{\scriptstyle r}}{{W_{1}}}}\ \smash{\stackrel{{\scriptstyle p-r}}{{W_{2}}}})^{\top}=\left(\begin{smallmatrix}I_{r}&\bm{0}\\ Q&\bm{0}\\ \end{smallmatrix}\right)\left(\begin{smallmatrix}W_{1}^{\top}\\ \bm{0}\end{smallmatrix}\right)$ , where $Q\in\mathbb{R}^{(p-r)\times r}$ because $W_{1}^{\top}$ is a basis of the row space of $W$ . Then $UW=\left[(U_{1}\ U_{2})\left(\begin{smallmatrix}I_{r}&\bm{0}\\ Q&\bm{0}\\ \end{smallmatrix}\right)\right]\left(\begin{smallmatrix}W_{1}^{\top}\\ \bm{0}\\ \end{smallmatrix}\right)=(U_{1}+U_{2}Q\ \ \bm{0})\left(\begin{smallmatrix}W_{1}^{\top}\\ \bm{0}\\ \end{smallmatrix}\right)=(U_{1}+U_{2}Q\ \ \bm{0})W$ . Therefore, we can choose $U^{0}=(U_{1}+U_{2}Q\ \ \bm{0})$ , which satisfies the requirement. $\square$

Lemma 2

Suppose $f(\bm{u})\colon\mathbb{R}^{d}\to\mathbb{R}$ is a convex function of $\bm{u}$ , and $\bm{u}^{*}\in\mathop{\arg\min}_{\bm{u}}f(\bm{u})$ . Then given any $\bm{u}_{0}\in\mathbb{R}^{d}$ , we have $f((1-t)\bm{u}_{0}+t\bm{u}^{*})$ as a function of $t\in[0,1]$ is non-increasing.

Proof: For any $1\geq t>s\geq 0$ , by the convexity of $f(\cdot)$ and $f(\bm{u}_{*})=\min_{\bm{u}}f(\bm{u})$ , we have

f((1-s)\bm{u}_{0}+s\bm{u}^{*})\geq\frac{1-t}{1-s}\cdot f((1-s)\bm{u}_{0}+s\bm{u}^{*})+\frac{t-s}{1-s}\cdot f(\bm{u}^{*})\geq f((1-t)\bm{u}_{0}+t\bm{u}^{*}).

Thus, $f((1-t)\bm{u}_{0}+t\bm{u}^{*})$ as a function of $t\in[0,1]$ is non-increasing. $\square$

Now we turn back to the proof of Theorem 1.

Proof: We prove the results of no spurious valleys by constructing the path that satisfies Property P.

If $p_{i}\geq d_{i}$ , we reformulate the objective as

\displaystyle\min_{\begin{subarray}{c}U_{1},\dots,U_{s},\\ W_{1},\dots,W_{s}\end{subarray}}L(U_{1},\dots,U_{s},W_{1},\dots,W_{s}):=\frac{1}{2}\left\|(U_{1}\ \cdots\ U_{s})\left(\begin{smallmatrix}W_{1}&\dots&\bm{0}\\ \vdots&\ddots&\vdots\\ \bm{0}&\dots&W_{s}\\ \end{smallmatrix}\right)Z-Y\right\|_{F}^{2},Z:=\left(\begin{smallmatrix}Z_{1}\\ \vdots\\ Z_{s}\end{smallmatrix}\right)

Now we construct a non-increasing path to a global minimum. If for certain $i$ , $\mathrm{rank}(W_{i})=r<d_{i}$ , we denote $W_{i}=[\bm{w}_{1},\dots,\bm{w}_{p_{i}}]^{\top}$ , and w.l.o.g., $\bm{w}_{1}\dots,\bm{w}_{r}$ is linearly independent. Then from Lemma 1, we are able to find $U_{i}^{0}$ , such that $U_{i}^{0}W_{i}=U_{i}W_{i}$ , and the last $p_{i}-r$ columns of $U_{i}^{0}$ are zero. We first fix $W_{i}$ and take a linear path $U_{i}(t)=(1-t)U_{i}+tU_{i}^{0},0\leq t\leq 1$ with invariant loss. Then we fix $U_{i}^{0}$ and take another linear path $W_{i}(t^{\prime})=(1-t^{\prime})W_{i}+t^{\prime}W_{i}^{0},0\leq t^{\prime}\leq 1$ , where $W_{i}^{0}$ has the same $r$ rows as $W_{i}$ and $\mathrm{rank}(W_{i}^{0})=d_{i}$ (we can always extend linearly independent vectors to a basis due to $p_{i}\geq d_{i}$ ). Hence, we could reach a point $(U_{1}^{0},\dots,U_{s}^{0},W_{1}^{0},\dots,W_{s}^{0})$ with all full column rank $W_{i}^{0}$ s with unchanged loss.

Next, note that a global minimizer $U_{i}^{*},W^{*}_{i}$ s satisfies $\left(U_{1}^{*}W_{1}^{*},\dots,U_{s}^{*}W_{s}^{*}\right)=YZ^{+}$ , where $Z^{+}$ is the pseudoinverse of $Z$ . Hence we could obtain a global minimizer by making $U_{i}^{*}$ s satisfy $\left(U_{1}^{*}W^{0}_{1},\dots,U_{s}^{*}W_{s}^{0}\right)=YZ^{+}$ since $W_{i}$ s have full column rank. Finally, we take $U_{i}(t^{\prime\prime})=(1-t^{\prime\prime})U_{i}^{0}+t^{\prime\prime}U_{i}^{*},0\leq t^{\prime\prime}\leq 1,\forall i\in[s]$ . Since $L(U_{i},W_{i},i\in[s])$ is convex related to $U_{i}$ s, the constructed path is a non-increasing path to the global minimum by Lemma 2. Hence, there are no spurious valleys.

If $Z_{i}Z_{j}^{\top}=\bm{0},\forall i,j\in[s],i\neq j$ , then $Z_{i}$ and $Z_{j}$ share no same rows of $X$ . Additionally, each row in data $X$ has connections to the first layer weight $W$ , thus $Z_{1},\ldots,Z_{s}$ are an arrangement of the rows of $X$ . Then we only need to consider the objective:

\displaystyle\min_{U,W_{1},\dots,W_{s}}L(U,W_{1},\dots,W_{s})

\displaystyle=\frac{1}{2}\left\|\sum_{i=1}^{s}U_{i}W_{i}Z_{i}-Y\right\|_{F}^{2}\stackrel{{\scriptstyle(i)}}{{=}}\frac{1}{2}\sum_{i=1}^{s}\left\|U_{i}W_{i}Z_{i}-Y\right\|_{F}^{2}-\frac{s-1}{2}\left\|Y\right\|_{F}^{2},

where $(i)$ uses the assumption $Z_{i}Z_{j}^{\top}=\bm{0},\forall i,j\in[s],i\neq j$ again. We see that the objective has already been separated into $s$ parts, while each part is a two-layer dense linear network. Based on [44, Theorem 11], dense linear networks have no spurious valleys. Hence the original SD network has no spurious valleys.

3)

If $d_{y}=1$ , we can simplify $U=\left(u_{1},\dots,u_{p}\right)\in\mathbb{R}^{1\times p}$ , and with some abuse of notation, we set the rows of $W_{1},\dots W_{s}$ in sequence as $\bm{w}_{1}^{\top},\dots,\bm{w}_{p}^{\top}$ ( $\bm{w}_{i}$ s may not have same dimensions) with corresponding data $Z_{1},\dots,Z_{p}$ (some of $Z_{i}$ s could be same), then we can reformulate the original problem as

$\min_{u_{i},\bm{w}_{i},i\in[p]}L(U,W)=\frac{1}{2}\left\|\sum_{i=1}^{p}u_{i}\bm{w}_{i}^{\top}Z_{i}-Y\right\|_{2}^{2}.$

From any initialization, if certain $u_{i}=0,i\in[p]$ , we can construct linear paths from $\bm{w}_{i}$ to $\bm{w}_{i}^{\prime}=\bm{0}$ , and then from $u_{i}=0$ to $u_{i}^{\prime}=1$ with fixed other weights, where the loss is invariant. Therefore, we can assume $u_{i}\neq 0,\forall i\in[p]$ . Then the objective is convex related to the sparse first-layer weight $\widetilde{W}$ . Hence, we can construct a non-decreasing path to a global minimum by Lemma 2.

$\square$

B.2 Proof of Theorem 2

Proof: The objective of an $L+1$ -layer network is $\min_{V_{i},i\in[L],W}\frac{1}{2}\left\|V_{L}\cdots V_{1}WX-Y\right\|_{F}^{2}$ . We first focus on the sparse first layer weight (i.e., $W$ is sparse) as shallow linear networks. We prove the results of no spurious valleys by constructing the path that satisfies Property P following Venturi et al. [44, Lemma 2].

If $p_{i}\geq d_{i}$ , we reformulate the objective as $(P)$ below:

\displaystyle\min_{\begin{subarray}{c}W_{i},i\in[s]\\ V_{j},j\in[L]\end{subarray}}L(V_{L},\dots,V_{1},W_{1},\dots,W_{s}):=\frac{1}{2}\left\|V_{L}\cdots V_{1}\left(\begin{smallmatrix}W_{1}&\dots&\bm{0}\\ \vdots&\ddots&\vdots\\ \bm{0}&\dots&W_{s}\\ \end{smallmatrix}\right)\left(\begin{smallmatrix}Z_{1}\\ \vdots\\ Z_{s}\end{smallmatrix}\right)-Y\right\|_{F}^{2}.

Invoked from the proof of 1) in Theorem 1, we can assume $W_{i}$ s have full column rank, then we can take $\widetilde{V}_{1}:=V_{1}\left(\begin{smallmatrix}W_{1}&\dots&\bm{0}\\ \vdots&\ddots&\vdots\\ \bm{0}&\dots&W_{s}\\ \end{smallmatrix}\right)$ as new variable. Note that choosing $V_{1}=V\left(\begin{smallmatrix}W_{1}^{+}&\dots&\bm{0}\\ \vdots&\ddots&\vdots\\ \bm{0}&\dots&W_{s}^{+}\\ \end{smallmatrix}\right)$ for any matrix $V$ . We can see there has no constraint for $\widetilde{V}_{1}$ . Thus, the original problem shares the same global minimum of $(P_{1})$ below:

\displaystyle\min_{V_{L},\dots,V_{2},\widetilde{V}_{1}}\widetilde{L}(V_{L}\dots,V_{2},\widetilde{V}_{1}):=\frac{1}{2}\left\|V_{L}\cdots V_{2}\widetilde{V}_{1}\left(\begin{smallmatrix}Z_{1}\\ \vdots\\ Z_{s}\end{smallmatrix}\right)-Y\right\|_{F}^{2}.

From Venturi et al. [44, Theorem 11], we can construct a non-decreasing path to a global minimum of $(P_{1})$ from any initialization. Thus we could obtain a non-decreasing path to a global minimum for $(P)$ as well.

If $d_{y}=1$ , we directly consider the objective with all sparse layers as

\min_{\begin{subarray}{c}\bm{w}_{i},i\in[p]\\ \widetilde{V}_{j},j\in[L]\end{subarray}}\frac{1}{2}\left\|\widetilde{V}_{L}\cdots\widetilde{V}_{1}\left(\begin{smallmatrix}\bm{w}_{1}^{\top}Z_{1}\\ \vdots\\ \bm{w}_{p}^{\top}Z_{p}\\ \end{smallmatrix}\right)-Y\right\|_{F}^{2},

where $\widetilde{V}_{i}\in\mathbb{R}^{p_{i}\times p_{i-1}},\forall i\in[L]$ are sparse layer wights and $p_{0}=p$ , $\bm{w}_{i}$ is the remaining parameters in the $i$ -th row of the first layer weight, and $Z_{i}$ is the corresponding data matrix. Note that $(\widetilde{V}_{L}\cdots\widetilde{V}_{1})\in\mathbb{R}^{1\times p}$ due to $d_{y}=1$ . If the initialization satisfies $(\widetilde{V}_{L}\cdots\widetilde{V}_{1})_{j}\neq 0,\forall j\in[p]$ , then the objective is convex related to $\bm{w}_{i},i\in[p]$ . Thus we can construct a non-increasing path to its global minimum by Lemma 2. Otherwise, for certain $j\in[p]$ , such that $(\widetilde{V}_{L}\cdots\widetilde{V}_{1})_{j}=0$ , we fix $j$ and prove by induction for $L$ that we can construct a path with the invariant loss to a point $(\widetilde{V}_{L}^{\prime},\dots,\widetilde{V}_{1}^{\prime},\bm{w}_{1},\dots,\bm{w}_{j}^{\prime},\dots,\bm{w}_{p})$ such that $\bm{w}_{j}^{\prime}=\bm{0}$ and $\widetilde{V}_{L}^{\prime}\cdots\widetilde{V}_{1}^{\prime}=\widetilde{V}_{L}\cdots\widetilde{V}_{1}+a\bm{e}_{j}^{\top}$ for some $a\neq 0$ , i.e., $\widetilde{V}_{L}^{\prime}\cdots\widetilde{V}_{1}^{\prime}$ only change the $j$ -th dimension of $\widetilde{V}_{L}\cdots\widetilde{V}_{1}$ to be nonzero. The case $L=1$ has shown in shallow linear network. Now suppose the result hold for $L-1$ layers, and we consider the case of $L$ layers.

Case 1. Suppose for some $i\in[p_{1}]$ , $(\widetilde{V}_{L}\cdots\widetilde{V}_{2})_{i}\neq 0$ and $(\widetilde{V}_{1})_{i,j}$ is not pruned. We first construct a linear path from $\bm{w}_{j}$ to $\bm{w}_{j}^{\prime}=\bm{0}$ with fixed other weights, which gives the invariant loss due to the hypothesis that $(\widetilde{V}_{L}\cdots\widetilde{V}_{1})_{j}=0$ . Then we can construct another linear path starting from $\widetilde{V}_{1}$ to $\widetilde{V}_{1}^{\prime}$ with $(\widetilde{V}_{1}^{\prime})_{\cdot,j}=\bm{e}_{i}$ and the same other columns. The loss is still unchanged because $\bm{w}_{j}^{\prime}=\bm{0}$ . Now we obtain $(\widetilde{V}_{L}\cdots\widetilde{V}_{2}\widetilde{V}_{1}^{\prime})_{j}=(\widetilde{V}_{L}\cdots\widetilde{V}_{2})\cdot(\widetilde{V}_{1}^{\prime})_{\cdot,j}=(\widetilde{V}_{L}\cdots\widetilde{V}_{2})_{i}\neq 0$ , and $(\widetilde{V}_{L}\cdots\widetilde{V}_{2}\widetilde{V}_{1}^{\prime})_{k}=(\widetilde{V}_{L}\cdots\widetilde{V}_{1})_{k},\forall k\neq j$ , which satisfies our requirement.

Case 2. Otherwise, for all $i\in[p_{1}]$ , such that $(\widetilde{V}_{1})_{i,j}$ is not pruned, we have $(\widetilde{V}_{L}\cdots\widetilde{V}_{2})_{i}=0$ . Now we view $X_{new}:=\left(\begin{smallmatrix}\bm{w}_{1}^{\top}Z_{1}\\ \vdots\\ \bm{w}_{p}^{\top}Z_{p}\\ \end{smallmatrix}\right)$ , and rearrange the data to define $\left(Z_{new}\right)_{t},t\in[p_{1}]$ similarly as Eq. (2). Then we have

\widetilde{V}_{L}\cdots\widetilde{V}_{1}X_{new}=\widetilde{V}_{L}\cdots\widetilde{V}_{2}\left(\begin{smallmatrix}\left(\widetilde{V}_{1}\right)_{1,\cdot}\left(Z_{new}\right)_{1}\\ \vdots\\ \left(\widetilde{V}_{1}\right)_{p_{1},\cdot}\left(Z_{new}\right)_{p_{1}}\\ \end{smallmatrix}\right).

For each $i$ that satisfies $(\widetilde{V}_{L}\cdots\widetilde{V}_{2})_{i}=0$ , from the inductive hypothesis of $L-1$ layers, we can construct a path to a point $(\widetilde{V}_{L}^{\prime},\dots,\widetilde{V}_{2}^{\prime},(\widetilde{V}_{1})_{1,\cdot}^{\top},\dots,(\widetilde{V}_{1}^{\prime})_{i,\cdot}^{\top},\dots,(\widetilde{V}_{1})_{p_{1},\cdot}^{\top})$ that only changes $i$ -th dimension of $\widetilde{V}_{L}\cdots\widetilde{V}_{2}$ to be nonzero and makes $(\widetilde{V}_{1}^{\prime})_{i,\cdot}=\bm{0}$ with the invariant loss. Thus, $(\widetilde{V}_{1}^{\prime})_{i,j}=0$ . Repeating the above argument for all $i$ , we obtain $(\widetilde{V}_{1}^{\prime})_{\cdot,j}=\bm{0}$ . Finally, we construct a loss-invariant linear path from $\bm{w}_{j}$ to $\bm{w}_{j}^{\prime}=\bm{0}$ due to $(\widetilde{V}_{1}^{\prime})_{\cdot,j}=\bm{0}$ . Subsequently, noting that $\bm{w}_{j}^{\prime}=\bm{0}$ , we can construct another loss-invariant linear path from $(\widetilde{V}_{1}^{\prime})_{\cdot,j}$ to $(\widetilde{V}_{1}^{\prime\prime})_{\cdot,j}=\bm{e}_{k}$ by choosing a $k\in[p_{1}]$ that $(\widetilde{V}_{1})_{k,j}$ is not pruned and $(\widetilde{V}_{L}^{\prime}\cdots\widetilde{V}_{2}^{\prime})_{k}\neq 0$ , which completes our inductive step because $(\widetilde{V}_{L}^{\prime}\cdots\widetilde{V}_{2}^{\prime}\widetilde{V}_{1}^{\prime\prime})_{j}=(\widetilde{V}_{L}^{\prime}\cdots\widetilde{V}_{2}^{\prime})_{k}\neq 0$ .

Combining Case 1 and Case 2, we finish our induction. Therefore, we can always reach a point with $(\widetilde{V}_{L}\cdots\widetilde{V}_{1})_{j}\neq 0,\forall j\in[p]$ , and construct a non-increasing path with only adjusting $\bm{w}_{i}$ s to a global minimum by Lemma 2.

$\square$

Appendix C Missing Proofs for Section 4

C.1 Proof of Theorem 3

Proof: The proof is partially inspired by Venturi et al. [44], but we extend the result to sparse networks. Since $\dim^{*}(\sigma,X)<\infty$ , then $\forall i\in[s]$ , $\dim^{*}(\sigma,Z_{i})<\infty$ . Thus $\forall\bm{w},\bm{z}\in\mathbb{R}^{d_{i}}$ , we have $\sigma(\bm{w}^{\top}\bm{z})=\langle\bm{\psi}_{i}(\bm{w}),\bm{\varphi}_{i}(\bm{z})\rangle$ for some maps $\bm{\psi}_{i},\bm{\varphi}_{i}:\mathbb{R}^{d_{i}}\to\mathbb{R}^{p_{i}}$ with $r_{i}:=\dim^{*}(\sigma,Z_{i}),i\in[s]$ . Denote

\bm{\psi}(\widetilde{W})=\left(\begin{smallmatrix}\bm{\psi}_{1}(W_{1})&\cdots&\bm{0}\\ \vdots&\ddots&\vdots\\ \bm{0}&\cdots&\bm{\psi}_{s}(W_{s})\\ \end{smallmatrix}\right),\text{ where }\bm{\psi}_{i}(W_{i})\in\mathbb{R}^{p_{i}\times r_{i}},\;\forall i\in[s].

Then the output of network becomes $U\bm{\psi}(\widetilde{W})\left(\begin{smallmatrix}\bm{\varphi}_{1}(Z_{1})\\ \vdots\\ \bm{\varphi}_{s}(Z_{s})\\ \end{smallmatrix}\right)$ . Once $\bm{\psi}(\widetilde{W})$ has full column rank, we only need to change $U$ to obtain a non-increasing path to the zero loss by Lemma 2. Otherwise, we can reach a full column rank $\bm{\psi}(\widetilde{W})$ by adjusting $U_{i}$ s and $W_{i}$ s following the proof of Venturi et al. [44, Corollary 10] or Lemma 1. $\square$

C.2 Proof of Theorem 4

Lemma 3 (Dang [6], Mityagin [32])

If $f:\mathbb{R}^{n}\to\mathbb{R}$ is a real analytic function which is not identically zero, then $\{x\in\mathbb{R}^{n}|f(x)=0\}$ has Lebesgue measure zero.

Now we turn to the proof of Theorem 4.

Proof: Suppose $\mathrm{rank}(\sigma(\widetilde{W}X))=t-1<n$ , and without loss of generality, assume that the first $t-1$ rows of $\sigma(\widetilde{W}X)\in\mathbb{R}^{p\times n}$ are linearly independent. Then from Lemma 1, we can construct a loss-invariant linear path from $(U,\widetilde{W})$ to $(U^{\prime},\widetilde{W})$ where $U^{\prime}$ has $p-t+1$ zero columns. We only need to show that $\sigma(\widetilde{W}X)$ can become full column rank after replacing the last $p-t+1$ rows of $\widetilde{W}$ while the loss is unchanged due to zero columns in $U^{\prime}$ .

First, we show that there exists $\bm{w}_{t}$ such that the first $t$ rows of $\sigma(\widetilde{W}X)$ are linearly independent, where $\bm{w}_{t}$ is the $t$ -th row of $\widetilde{W}$ . We consider the determinant of the first $t$ rows and columns of $\sigma(\widetilde{W}X)$ as

g(\bm{w}_{t}):=\det\left(\begin{smallmatrix}\sigma(\bm{w}_{1}^{\top}(Z_{1})_{\cdot,1})&\dots&\sigma(\bm{w}_{1}^{\top}(Z_{1})_{\cdot,i})&\dots&\sigma(\bm{w}_{1}^{\top}(Z_{1})_{\cdot,t})\\ \vdots&\ddots&\vdots&\ddots&\vdots\\ \sigma(\bm{w}_{t}^{\top}(Z_{t})_{\cdot,1})&\dots&\sigma(\bm{w}_{t}^{\top}(Z_{t})_{\cdot,i})&\dots&\sigma(\bm{w}_{t}^{\top}(Z_{t})_{\cdot,t})\\ \end{smallmatrix}\right).

Then $g(\bm{w}_{t})$ is real analytic function from Assumption 2. If $g(\cdot)\not\equiv 0$ , we could find some $\bm{w}_{t}$ with $g(\bm{w}_{t})\neq 0$ based on Lemma 3, which already shows that the first $t$ rows of $\sigma(\widetilde{W}X)$ are linearly independent. Otherwise, $g(\cdot)\equiv 0$ , then we have

\displaystyle\frac{\partial^{k}g(\bm{w}_{t})}{\partial(\bm{w}_{t})_{1}^{k}}=\det\left(\begin{smallmatrix}\sigma(\bm{w}_{1}^{\top}(Z_{1})_{\cdot,1})&\dots&\sigma(\bm{w}_{1}^{\top}(Z_{1})_{\cdot,i})&\dots&\sigma(\bm{w}_{1}^{\top}(Z_{1})_{\cdot,t})\\ \vdots&\ddots&\vdots&\ddots&\vdots\\ \sigma(\bm{w}_{t-1}^{\top}(Z_{t-1})_{\cdot,1})&\dots&\sigma(\bm{w}_{t-1}^{\top}(Z_{t-1})_{\cdot,i})&\dots&\sigma(\bm{w}_{t-1}^{\top}(Z_{t-1})_{\cdot,t})\\ (Z_{t})_{1,1}^{k}\sigma^{(k)}(\bm{w}_{t}^{\top}(Z_{t})_{\cdot,1})&\dots&(Z_{t})_{1,i}^{k}\sigma^{(k)}(\bm{w}_{t}^{\top}(Z_{t})_{\cdot,i})&\dots&(Z_{t})_{1,t}^{k}\sigma^{(k)}(\bm{w}_{t}^{\top}(Z_{t})_{\cdot,t})\\ \end{smallmatrix}\right)\equiv 0.

Since the first $t{-}1$ rows of $\sigma(\widetilde{W}X)$ are linearly independent, we obtain that the last row of $\frac{\partial^{k}g(\bm{w}_{t})}{\partial(\bm{w}_{t})_{1}^{k}}$ can be expressed by a linear combination of the remaining $t{-}1$ rows. Thus, the last rows of $\frac{\partial^{l_{i}}g(\bm{w}_{t})}{\partial(\bm{w}_{t})_{1}^{l_{i}}},i\in[t]$ are linearly dependent, showing that for all $\bm{w}_{t}$ ,

\displaystyle\left(\begin{smallmatrix}(Z_{t})_{1,1}^{l_{1}}\sigma^{(l_{1})}(\bm{w}_{t}^{\top}(Z_{t})_{\cdot,1})&\dots&(Z_{t})_{1,t}^{l_{1}}\sigma^{(l_{1})}(\bm{w}_{t}^{\top}(Z_{t})_{\cdot,t})\\ \vdots&\ddots&\vdots\\ (Z_{t})_{1,1}^{l_{t}}\sigma^{(l_{t})}(\bm{w}_{t}^{\top}(Z_{t})_{\cdot,1})&\dots&(Z_{t})_{1,t}^{l_{t}}\sigma^{(l_{t})}(\bm{w}_{t}^{\top}(Z_{t})_{\cdot,t})\\ \end{smallmatrix}\right)=0.

(4)

Since $l_{1},\dots,l_{n}\in\mathbb{N}$ form an arithmetic sequence, we denote $d=l_{1}-l_{0}\in\mathbb{N}_{+}$ and choose $\bm{w}_{t}=\bm{0}$ , then the left side of Eq. (4) equals to

\prod_{i=1}^{t}\sigma^{(l_{i})}(0)\cdot\prod_{i=1}^{t}\left(Z_{t}\right)_{1,i}^{l_{1}}\cdot\prod_{1\leq i<j\leq t}\left[(Z_{t})_{1,j}^{d}-(Z_{t})_{1,i}^{d}\right],

where we use the property of Vandermonde matrix. However, Assumption 2 implies that $\sigma^{(l_{1})}(0),\dots,\sigma^{(l_{t})}(0)\neq 0$ . Furthermore, Assumption 1 implies $(Z_{t})_{1,i}^{d}\neq(Z_{t})_{1,j}^{d}$ and $(Z_{t})_{1,i}^{l_{1}}\neq 0$ since $\left(Z_{t}\right)_{1,\cdot}$ includes all different non-zero feature (instead of $\bm{1}^{\top}$ ). Therefore, we obtain contradiction, and we could find $\bm{w}_{t}$ to make the first $t$ rows of $\sigma(\widetilde{W}X)$ are linearly independent,

Second, we could repeat the above step until $\sigma(\widetilde{W}X)$ has full column rank. Moreover, previous argument also shows that $\mathrm{rank}(\sigma(\widetilde{W}X))=n,a.s.$ Particularly, any $n$ rows of $\sigma(\widetilde{W}X)$ have full rank with measure one.

Now from $U^{\prime}$ has $p-t+1$ zero columns, we can adjust the $i$ -th row of $\widetilde{W}$ with $i\geq t$ arbitrarily. Hence, we could reach some $\widetilde{W}^{\prime}$ with the full column rank $\sigma(\widetilde{W}^{\prime}X)$ according to the above argument. Then we can approach zero loss by a linear path from $U^{\prime}$ to $U^{\prime\prime}=Y[\sigma(\widetilde{W}^{\prime}X)]^{+}$ with non-increasing loss from Lemma 2. $\square$

C.3 Proof of Theorem 5

Lemma 4 (Lemma 3 of Li et al. [22])

Suppose $\sigma(\cdot)$ is a non-constant analytic function. Then given $\bm{a},\bm{b}\in\mathbb{R}^{d}$ , the set $\Omega=\{\bm{z}\in\mathbb{R}^{d}|\sigma(\bm{a}^{\top}\bm{z})=\sigma(\bm{b}^{\top}\bm{z}),\bm{a}\neq\bm{b}\}$ has zero measure.

Now we turn to the proof of Theorem 5.

Proof: We first consider the sparse network with a dense final layer. When applied to deep sparse networks, we need to show that all hidden layer outputs still satisfy Assumption 1. Denote certain hidden-layer outputs for each training sample as $\bm{z}_{1},\dots,\bm{z}_{n}$ , and $\mathcal{S}_{ij}:=\{\bm{w}|\sigma(\bm{w}^{\top}\bm{z}_{i})^{2}=\sigma(\bm{w}^{\top}\bm{z}_{j})^{2},\bm{z}_{i}-\bm{z}_{j}\neq\bm{0}\}$ . From Lemma 4 and note that $\sigma^{2}(\cdot)$ is still real analytic, then $\mathcal{S}_{ij}$ has zero measure and $\cup_{i\neq j}\mathcal{S}_{ij}$ has zero measure. Moreover, from Lemma 3, $\cup_{i=1}^{n}\{\bm{w}|\sigma(\bm{w}^{\top}\bm{z}_{i})=0,\bm{z}_{i}\neq\bm{0}\}$ has zero measure. Therefore, each hidden-layer output satisfies 1) in Assumption 1 with measure one. Combining with the proof of Theorem 4, we obtain that each hidden-layer output has full rank with measure one by choosing a square submatrix. Therefore, the full rank argument can be applied to the next layer until the last hidden layer.

If there exists a spurious valley $\mathcal{T}$ with a interior point $\bm{\theta}$ , then from the measure one argument, we can find a point $\bm{\theta}^{\prime}\in\mathcal{T}$ which is arbitrary close to $\bm{\theta}$ and the last hidden-layer output has full rank. Using over-parametrized regime in the last layer, we derive a non-increasing path from $\bm{\theta}^{\prime}$ to approach zero loss by only adjusting the final layer weight. Thus we obtain the contradiction for this spurious valley.

Now if the final layer weight $\widetilde{U}$ is a sparse matrix, we can also rearrange the order to formulate the output as

\widetilde{U}\sigma\left(\begin{smallmatrix}W_{1}Z_{1}\\ \vdots\\ W_{s}Z_{s}\\ \end{smallmatrix}\right)=\left(\begin{smallmatrix}U_{1}&\dots&\bm{0}\\ \vdots&\ddots&\vdots\\ \bm{0}&\dots&U_{r}\\ \end{smallmatrix}\right)\left(\begin{smallmatrix}G_{1}\\ \vdots\\ G_{r}\\ \end{smallmatrix}\right),

where $G_{i}$ s are some rows of the last hidden layer output matrix, and $U_{i}$ s are dense matrices with no less than $n$ columns. Repeating the argument of Theorem 4, $G_{i}$ s have rank $n$ with measure one. Finally, the loss function is convex for each $U_{i}$ . Thus, we can still construct a non-increasing path to zero loss by Lemma 2. $\square$

References

Auer et al. [1996] Peter Auer, Mark Herbster, and Manfred KK Warmuth. Exponentially many local minima for single neurons. In Advances in Neural Information Processing Systems, pages 316–322, 1996.
Baldi and Hornik [1989] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58, 1989.
Bianchini and Gori [1996] Monica Bianchini and Marco Gori. Optimal learning in artificial neural networks: A review of theoretical results. Neurocomputing, 13(2-4):313–346, 1996.
Carreira-Perpinán and Idelbayev [2018] Miguel A Carreira-Perpinán and Yerlan Idelbayev. “learning-compression” algorithms for neural net pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8532–8541, 2018.
Clevert et al. [2015] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
Dang [2015] Nguyen Viet Dang. Complex powers of analytic functions and meromorphic renormalization in qft. arXiv preprint arXiv:1503.00995, 2015.
Ding et al. [2019] Tian Ding, Dawei Li, and Ruoyu Sun. Sub-optimal local minima exist for almost all over-parameterized neural networks. arXiv preprint arXiv:1911.01413, 2019.
Evci et al. [2019] Utku Evci, Fabian Pedregosa, Aidan Gomez, and Erich Elsen. The difficulty of training sparse neural networks. 2019.
Frankle and Carbin [2018] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2018.
Freeman and Bruna [2017] C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization. International Conference on Learning Representations, 2017.
Gale et al. [2019] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
Goldblum et al. [2019] Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, and Tom Goldstein. Truth or backpropaganda? an empirical investigation of deep learning theory. In International Conference on Learning Representations, 2019.
Han et al. [2015] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pages 1135–1143, 2015.
Han et al. [2016] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations, 2016.
He et al. [2020] Fengxiang He, Bohan Wang, and Dacheng Tao. Piecewise linear activations substantially shape the loss surfaces of neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=B1x6BTEKwr.
He et al. [2017] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1389–1397, 2017.
Hoefler et al. [2021] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. arXiv preprint arXiv:2102.00554, 2021.
Kawaguchi [2016] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Kunin et al. [2019] Daniel Kunin, Jonathan Bloom, Aleksandrina Goeva, and Cotton Seed. Loss landscapes of regularized linear autoencoders. In International Conference on Machine Learning, pages 3560–3569. PMLR, 2019.
Lee et al. [2018] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2018.
Li et al. [2018] Dawei Li, Tian Ding, and Ruoyu Sun. On the benefit of width for neural networks: Disappearance of bad basins. arXiv, pages arXiv–1812, 2018.
Liu et al. [2022] Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Zhangyang Wang, and Mykola Pechenizkiy. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=VBZJ_3tz-t.
Liu et al. [2018] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations, 2018.
Louizos et al. [2017] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pages 3288–3298, 2017.
Louizos et al. [2018] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l_0 regularization. In International Conference on Learning Representations, 2018.
Lu and Kawaguchi [2017] Haihao Lu and Kenji Kawaguchi. Depth creates no bad local minima. arXiv preprint arXiv:1702.08580, 2017.
Luo et al. [2017] Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017.
Malach et al. [2020] Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, pages 6682–6691. PMLR, 2020.
Mehta et al. [2021] Dhagash Mehta, Tianran Chen, Tingting Tang, and Jonathan Hauenstein. The loss surface of deep linear networks viewed through the algebraic geometry lens. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
Mishra et al. [2020] Rahul Mishra, Hari Prabhat Gupta, and Tanima Dutta. A survey on deep neural network compression: Challenges, overview, and solutions. arXiv preprint arXiv:2010.03954, 2020.
Mityagin [2015] Boris Mityagin. The zero set of a real analytic function. arXiv preprint arXiv:1512.07276, 2015.
Molchanov et al. [2017] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498–2507. PMLR, 2017.
Morcos et al. [2019] Ari Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. Advances in Neural Information Processing Systems, 32:4932–4942, 2019.
Mozer and Smolensky [1989] Michael C Mozer and Paul Smolensky. Using relevance to reduce network size automatically. Connection Science, 1(1):3–16, 1989.
Nguyen [2019] Quynh Nguyen. On connected sublevel sets in deep learning. In International Conference on Machine Learning, pages 4790–4799. PMLR, 2019.
Nguyen et al. [2018] Quynh Nguyen, Mahesh Chandra Mukkamala, and Matthias Hein. On the loss landscape of a class of deep neural networks with no bad local valleys. In International Conference on Learning Representations, 2018.
Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
Pensia et al. [2020] Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos. Optimal lottery tickets via subset sum: Logarithmic over-parameterization is sufficient. Advances in Neural Information Processing Systems, 33, 2020.
Safran and Shamir [2018] Itay Safran and Ohad Shamir. Spurious local minima are common in two-layer relu neural networks. In International Conference on Machine Learning, pages 4433–4441. PMLR, 2018.
Sun [2019] Ruoyu Sun. Optimization for deep learning: theory and algorithms. arXiv preprint arXiv:1912.08957, 2019.
Sun et al. [2020] Ruoyu Sun, Dawei Li, Shiyu Liang, Tian Ding, and Rayadurgam Srikant. The global landscape of neural networks: An overview. IEEE Signal Processing Magazine, 37(5):95–108, 2020.
Swirszcz et al. [2016] Grzegorz Swirszcz, Wojciech Marian Czarnecki, and Razvan Pascanu. Local minima in training of neural networks. arXiv preprint arXiv:1611.06310, 2016.
Venturi et al. [2019] Luca Venturi, Afonso S Bandeira, and Joan Bruna. Spurious valleys in one-hidden-layer neural network optimization landscapes. Journal of Machine Learning Research, 20(133):1–34, 2019.
Yu et al. [2018] Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9194–9203, 2018.
Yun et al. [2018] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Small nonlinearities in activation functions create bad local minima in neural networks. In International Conference on Learning Representations, 2018.
Zhou and Liang [2018] Yi Zhou and Yingbin Liang. Critical points of linear neural networks: Analytical forms and landscape properties. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SysEexbRb.
Zhu and Gupta [2017] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.