Theoretical Analysis of the Advantage of
Deepening Neural Networks

Yasushi Esaki Faculty of Science and Engineering
Waseda University
Tokyo, Japan
[email protected] Yuta Nakahara Center for Data Science
Waseda University
Tokyo, Japan
[email protected] Toshiyasu Matsushima Faculty of Science and Engineering
Waseda University
Tokyo, Japan
[email protected]

Abstract

We propose two new criteria to understand the advantage of deepening neural networks. It is important to know the expressivity of functions computable by deep neural networks in order to understand the advantage of deepening neural networks. Unless deep neural networks have enough expressivity, they cannot have good performance even though learning is successful. In this situation, the proposed criteria contribute to understanding the advantage of deepening neural networks since they can evaluate the expressivity independently from the efficiency of learning. The first criterion shows the approximation accuracy of deep neural networks to the target function. This criterion has the background that the goal of deep learning is approximating the target function by deep neural networks. The second criterion shows the property of linear regions of functions computable by deep neural networks. This criterion has the background that deep neural networks whose activation functions are piecewise linear are also piecewise linear. Furthermore, by the two criteria, we show that to increase layers is more effective than to increase units at each layer on improving the expressivity of deep neural networks.

Index Terms:

deep learning theory, expressivity, approximation accuracy, linear region

I Introduction

In recent researches on machine learning, deep neural networks show very good performance on prediction. For example, some papers consider that to increase layers in neural networks contributes to the accuracy of image recognition e.g., [1, 2]. There are also papers which consider that to increase layers contributes to the accuracy of speech recognition e.g., [3, 4]. Neural networks are getting deeper and deeper with each passing year.

In this situation, there are many researchers that are interested in the relation between the depth of neural networks and good performance. In order to show the advantage of deepening neural networks, the followings should be distinguished.

1.

the expressivity of deep neural networks as functions which express the relation between explanatory variables and objective variables.
2.

the efficiency of learning by the gradient method.

The prediction accuracy after learning is dependent on both the expressivity and the efficiency of learning. Unless deep neural networks have enough expressivity, they cannot have good performance even though learning is successful.

Therefore, there have been some researches that evaluate deep neural networks with distinguishing between the expressivity and the efficiency of learning e.g., [5, 6]. Some of the researches assume that data is scattered around the target function and investigate the approximation accuracy when approximating the target function by deep neural networks. They discuss the distance between deep neural networks and the target function. Then they regard that the shorter that distance is, the more expressivity the deep neural networks have. For example, some studies consider how many layers and units are necessary for the distance between deep neural networks and the target function to be less than $\epsilon$ , for any $\epsilon>0$ and any target function e.g., [7, 8]. Besides, other studies consider how the lower limit of the distance between deep neural networks and the target function is described for the depth and the number of units e.g., [9, 10].

On the other hand, there are other kinds of researches that evaluate deep neural networks with distinguishing between the expressivity and the efficiency of learning. Those researches investigate the property of linear regions of functions by computable by deep neural networks. Since deep neural networks whose activation functions are piecewise linear are also piecewise linear, investigating the property of linear regions is a good way to know the shape of the lines drawn by those deep neural networks. For example, some studies count the number of linear regions of deep neural networks [11, 12, 13]. They regard that the larger the number of linear regions of deep neural networks is, the more expressivity the deep neural networks have. Besides, other studies calculate the volume of the boundary between two linear regions of deep neural networks [14]. They regard that the larger the volume of the boundary between two linear regions of deep neural networks is, the more expressivity the deep neural networks have.

In this paper, we propose two criteria to evaluate the expressivity of deep neural networks independently from the efficiency of learning, similar to the above studies. The first criterion evaluates the approximation accuracy of deep neural networks to the target function. It shows the ratio of the values of the parameters which make the distance between deep neural networks and the target function short. We regard that the larger the ratio is, the more expressivity deep neural networks have. The novelty of this criterion is that we can figure out the volume of parameter values which give functions close to the target function. The proposed criterion shows not only the existence of the appropriate parameter values but also how many the appropriate parameter values exist.

The second criterion evaluates the property of linear regions of functions computable by deep neural networks whose activation functions are piecewise linear. It shows the size of the maximal linear region of deep neural networks. We regard that the smaller the size of the maximal linear region is, the more expressivity the deep neural networks have. The novelty of this criterion is figuring out how fine linear regions the area which is the least flexible has. The proposed criterion shows whether linear regions are scattered finely and uniformly or not.

The above two criteria have a relation. The small size of the maximal linear region of deep neural networks is a requirement for the good approximation to the target function if the target function is smooth. This requirement is understood by referring to Fig. 1. Unless the size of the maximal linear region of deep neural networks is small, there is a gap between the deep neural networks and the target function no matter how close the deep neural networks and the target function.

Refer to caption — Figure 1: The relation between the size of the maximal linear regions of deep neural networks and the approximation to the target function by deep neural networks. If the size of the maximal linear region of deep neural networks is large, a gap inevitably occurs between deep neural networks and the target function, as the left figure. On the other hand, if the size of the maximal linear region of deep neural networks is small, there is a possibility that the target function is approximated by deep neural networks, as the right figure.

Moreover, by the two criteria, we show that to increase layers is more effective than to increase units at each layer on improving the expressivity of neural networks. We adopt theoretical proofs and computer simulations to show it.

In summary, our contributions are as follows.

1.

We propose new criteria to evaluate the expressivity of deep neural networks independently from the efficiency of learning.
2.

We show that to increase layers is more effective than to increase units at each layer on improving the expressivity of neural networks, by the new criteria.

II Definitions and Theoretical Facts

II-A Deep Neural Networks

We define a neural network as a composite function $\bm{F}_{\bm{\theta}}:\mathbb{R}^{n_{0}}\to\mathbb{R}^{n_{L}}$ parametrized by $\bm{\theta}$ , which alternately computes linear functions $\bm{f}^{(l)}:\mathbb{R}^{n_{l-1}}\to\mathbb{R}^{n_{l}}\ (l=1,\cdots,L)$ and non-linear activation functions $\bm{g}^{(l)}:\mathbb{R}^{n_{l}}\to\mathbb{R}^{n_{l}}\ (l=1,\cdots,L-1)$ such as

\displaystyle\bm{F}_{\bm{\theta}}=\bm{f}^{(L)}\circ\bm{g}^{(L-1)}\circ\bm{f}^{(L-1)}\circ\cdots\circ\bm{g}^{(1)}\circ\bm{f}^{(1)}.

The linear function $\bm{f}^{(l)}$ is given by

\displaystyle f^{(l)}_{j}(\bm{x})=\sum^{n_{l-1}}_{i=1}w^{(l)}_{ji}x_{i}+b^{(l)}_{j}\ (j=1,\cdots,n_{l}),

where $x_{i}$ is the $i$ -th coordinate of $\bm{x}$ and $f^{(l)}_{j}$ is the $j$ -th coordinate of $\bm{f}^{(l)}$ . The parameter $\bm{\theta}$ is a vector which consists of $w^{(l)}_{ji}\in\mathbb{R}$ and $b^{(l)}_{j}\in\mathbb{R}$ for each $i\in\{1,2,\cdots,n_{l-1}\},\ j\in\{1,2,\cdots,n_{l}\}$ and $l\in\{1,2,\cdots,L\}$ . We define $\bm{\Theta}\subset\mathbb{R}^{M}$ as the set of $\bm{\theta}$ , where $M$ is the number of parameters on the neural network. We call $\bm{g}^{(l)}\circ\bm{f}^{(l)}$ as the $l$ -th layer and call the $j$ -th coordinate of $\bm{g}^{(l)}\circ\bm{f}^{(l)}$ as the $j$ -th unit of the $l$ -th layer. We note that the $0$ -th layer, which is called as the input layer is not counted into the number of layers.

Deep neural networks are used for supervised learning. We can express the relation between explanatory variables and objective variables with deep neural networks if the values of $\bm{\theta}$ are set appropriately.

II-B The Ratio of the Desired Parameters

In this section, we assume a probability distribution for objective variables in supervised learning and propose a new criterion to evaluate the expressivity of functions computable by deep neural networks under that assumption. For simplicity, we restrict the dimension of objective variables to 1. Let the probability distribution of objective variables be normal distribution such as

\displaystyle y=F^{*}(\bm{x})+\varepsilon\ \ (\varepsilon\sim N(0,\sigma^{2})),

(1)

where $F^{*}:\mathbb{R}^{n_{0}}\to\mathbb{R}$ is a continuous function. The goal of deep learning is expressing the relation between explanatory variables and objective variables by approximating $F^{*}$ with deep neural networks $F_{\bm{\theta}}$ in the set $\{F_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\}$ . Therefore, we call $F^{*}$ as the target function in this paper.

When approximating the target function $F^{*}$ by deep neural networks $F_{\bm{\theta}}$ , it is desirable for the set $\{F_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\}$ to have many functions close to $F^{*}$ . However, there have been few studies which investigate the volume of the functions close to $F^{*}$ in $\{F_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\}$ although some studies prove that at least one function close to $F^{*}$ exist in $\{F_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\}$ e.g., [7, 10]. Therefore, we define a new criterion to investigate it independently from the efficiency of learning as follows.

Definition 1.

Let $F_{\bm{\theta}}:\mathbb{R}^{n_{0}}\to\mathbb{R}$ be a deep neural network. Let $F^{*}:\mathbb{R}^{n_{0}}\to\mathbb{R}$ be the target function defined as (1). Moreover, we assume that explanatory variables $\bm{x}\in\mathbb{R}^{n_{0}}$ are randam variables with a probability density function $p(\bm{x})$ . Then we define the ratio of the desired parameters between the set $\mathcal{F}:=\{F_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\}$ and $F^{*}$ as

\displaystyle R_{\epsilon}(\mathcal{F},F^{*}):=\frac{\int_{\bm{\theta}\in\bm{\Theta}_{\epsilon}(\mathcal{F},F^{*})}d\bm{\theta}}{\int_{\bm{\theta}\in\bm{\Theta}}d\bm{\theta}},

(2)

where

	$\displaystyle d(F_{\bm{\theta}},F^{*})$	$\displaystyle:=\int_{\bm{x}\in\mathbb{R}^{n_{0}}}(F_{\bm{\theta}}(\bm{x})-F^{*}(\bm{x}))^{2}p(\bm{x})d\bm{x},$		(3)
	$\displaystyle\bm{\Theta}_{\epsilon}(\mathcal{F},F^{*})$	$\displaystyle:=\{\bm{\theta}\in\bm{\Theta}\ \|\ d(F_{\bm{\theta}},F^{*})\leq\epsilon\}.$

We can say that $d(F_{\bm{\theta}},F^{*})$ represents the distance between the deep neural network $F_{\bm{\theta}}$ and the target function $F^{*}$ . Moreover, $R_{\epsilon}(\mathcal{F},F^{*})$ calculates the ratio of the values of $\bm{\theta}$ which make that distance short. Therefore, we regard that the larger $R_{\epsilon}(\mathcal{F},F^{*})$ is, the more functions close to the target function the set $\mathcal{F}$ has.

The ratio of the desired parameters shows the volume of parameter values which give functions close to the target function. Therefore, we can know not only the existence of the appropriate parameter values but also how many the appropriate parameter values exist. If we remind learning with the gradient method, we can consider that the ratio of the desired parameters may show the number of local solutions of the loss function.

II-C Deep Neural Networks and Linear Regions

II-C1 ReLU Neural Networks

We can use some kinds of non-linear functions for the non-linear activation function $\bm{g}^{(l)}$ . One example is the ReLU function defined as $g(x)=\max\{0,x\}$ . ReLU neural networks have the special property as follows.

Lemma 1 ([10] ).

Let $\mathcal{D}$ be a bounded subset of $\mathbb{R}^{n_{0}}$ . Let $\bm{F}_{\bm{\theta}}:\mathcal{D}\to\mathbb{R}^{n_{L}}$ be a ReLU neural network with $L$ layers, $M$ parameters and $S$ units in total. Then there exists a neural network $\tilde{\bm{F}}_{\bm{\theta}}:\mathcal{D}\to\mathbb{R}^{n_{L}}$ with any continuous piecewise linear activation function, having $L$ layers, $4M$ parameters and $2S$ units in total, and that computes the same function as $\bm{F}_{\bm{\theta}}$ .

Lemma 1 points out that deep neural networks with any piecewise linear activation function can describe any ReLU neural network. In the proof of [10], two units of $\tilde{\bm{F}}_{\bm{\theta}}$ are assigned to the each unit of $\bm{F}_{\bm{\theta}}$ when we replace $\bm{F}_{\bm{\theta}}$ with $\tilde{\bm{F}}_{\bm{\theta}}$ . Therefore, the number of units at the $l$ -th layer $n_{l}$ increases to $2n_{l}$ . Lemma 1 is used for the proof of the theorem shown in Section II-C5.

II-C2 Fineness of Linear Regions

The ReLU function $g(x)=\max\{0,x\}$ is a piecewise linear function whose inclination changes at the origin. If a neural network has piecewise linear activation functions like this, $\bm{f}^{(l)}(l=1,2,\cdots,L)$ are linear functions and $\bm{g}^{(l)}(l=1,2,\cdots,L-1)$ are piecewise linear functions. Therefore, a neural network whose activation functions are piecewise linear is a piecewise linear function. Based on this fact, we can introduce a linear region to consider the flexibility of deep neural networks with piecewise linear activation functions. It is defined as follows.

Definition 2 ([11] ).

Let $\mathcal{D}$ be a bounded subset of $\mathbb{R}^{n}$ . Let $\bm{F}:\mathcal{D}\to\mathbb{R}^{m}$ be a piecewise linear function. Let $U$ be a $n$ -dimensional interval in $\mathcal{D}$ . Let $\bm{F}|_{U}:U\to\mathbb{R}^{m}$ be the function given by restricting the input domain of $\bm{F}$ from $\mathcal{D}$ to $U$ . We say that $U$ is a linear region of $\bm{F}$ if the followings hold.

•

$\bm{F}|_{U}$ is linear.
•

For any subset $V\subset\mathcal{D}$ such that $V\supsetneq U$ , $\bm{F}|_{V}$ is not linear.

Example 1.

For example, let $F:[0,1]\to\mathbb{R}$ be the function in Fig. 2. We can say that $[0,\frac{1}{4}]$ , $[\frac{1}{4},\frac{2}{3}]$ and $[\frac{2}{3},1]$ are linear regions of $F$ . $[\frac{1}{3},\frac{1}{2}]$ is not a linear region.

In this paper, the set of linear regions of $\bm{F}$ is represented by $L(\bm{F})$ . We note that

\displaystyle\bigcup_{U\in L(\bm{F})}U=\mathcal{D},\ \ \ \ \ \sum_{U\in L(\bm{F})}\int_{\bm{x}\in U}d\bm{x}=\int_{\bm{x}\in\mathcal{D}}d\bm{x}

(4)

when $\bm{F}:\mathcal{D}\to\mathbb{R}^{m}$ is a piecewise linear function.

There have been some studies that evaluate the flexibility of deep neural networks by the number of linear regions [12, 11, 13]. However, there is a room for improvement in the evaluation of the flexibility by the number of linear regions when we want to know whether linear regions are scattered uniformly. The number of linear regions cannot ensure that many linear regions do not gather in a certain area in the input domain. This problem appears in Fig. 3. $F_{2}$ does not keep its flexibility in the left-side area. However, $F_{1}$ and $F_{2}$ are regarded that they have similar flexibility by the number of linear regions. There is also the same problem in the evaluation of the flexibility by the volume of the boundary between two linear regions proposed in [14]. Therefore, we propose a new criterion to evaluate the flexibility of deep neural networks. It is defined as follows.

Definition 3.

Let $\mathcal{D}$ be a bounded subset of $\mathbb{R}^{n}$ . We define the fineness of linear regions of a piecewise linear function $\bm{F}:\mathcal{D}\to\mathbb{R}^{m}$ such as

\displaystyle I(\bm{F}):=\max_{U\in L(\bm{F})}\frac{\int_{\bm{x}\in U}d\bm{x}}{\int_{\bm{x}\in\mathcal{D}}d\bm{x}}.

(5)

Example 2.

We consider the example function $F:[0,1]\to\mathbb{R}$ in Fig. 2. In this example, the fineness of linear regions $I(F)$ is computed as follows.

\displaystyle I(F)

\displaystyle=\max_{U\in\left\{\left[0,\frac{1}{4}\right],\left[\frac{1}{4},\frac{2}{3}\right],\left[\frac{2}{3},1\right]\right\}}\frac{\int_{x\in U}dx}{\int_{x\in[0,1]}dx}=\int^{\frac{2}{3}}_{\frac{1}{4}}dx=\frac{5}{12}.

Note that $0\leq I(\bm{F})\leq 1$ . We regard that the smaller the value of $I(\bm{F})$ is, the finer the linear regions of $\bm{F}$ are. The fineness of linear regions shows the size of the maximal linear region of piecewise linear functions and shows how fine linear regions the area which is the least flexible has. Therefore, we can know whether linear regions are scattered finely and uniformly on the input domain or not, by that criterion. It is different from the number of linear regions. In the case of neural networks $\bm{F}_{\bm{\theta}}$ whose activation functions are piecewise linear, the fineness of linear regions $I(\bm{F}_{\bm{\theta}})$ changes when the values of $\bm{\theta}$ change. Therefore, we regard that the smaller the value of $\min_{\bm{\theta}\in\bm{\Theta}}I(\bm{F}_{\bm{\theta}})$ is, the more flexibility the set $\{\bm{F}_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\}$ has.

For the readers who are interested in the relation between the number of linear regions and the fineness of linear regions, we show the following fact.

Corollary 1.

Let $\mathcal{D}$ be a bounded subset of $\mathbb{R}^{n}$ . Let $\bm{F}:\mathcal{D}\to\mathbb{R}^{m}$ be a piecewise linear function. The fineness of linear regions $I(\bm{F})$ and the number of linear regions $|L(\bm{F})|$ satisfy

\displaystyle|L(\bm{F})|\geq\frac{1}{I(\bm{F})}.

Proof of Corollary 1.

From (4),

	$\displaystyle\int_{\bm{x}\in\mathcal{D}}d\bm{x}$	$\displaystyle=\sum_{U\in L(\bm{F})}\int_{\bm{x}\in U}d\bm{x}$
		$\displaystyle\leq\|L(\bm{F})\|\times\max_{U\in L(\bm{F})}\int_{\bm{x}\in U}d\bm{x}.$

Therefore,

\displaystyle|L(\bm{F})|\geq\frac{\int_{\bm{x}\in\mathcal{D}}d\bm{x}}{\max_{U\in L(\bm{F})}\int_{\bm{x}\in U}d\bm{x}}=\frac{1}{I(\bm{F})}\ \ \ \ (\because\ (\ref{eqif})).

∎

II-C3 The Partial Order of Piecewise Linear Functions

Next, we define a partial order between two piecewise linear functions as follows.

Definition 4.

Let $\mathcal{D}$ be a bounded subset of $\mathbb{R}^{n}$ . Let $\bm{F},\bm{G}:\mathcal{D}\to\mathbb{R}^{m}$ be two piecewise linear functions. We say that $\bm{G}\preceq_{r}\bm{F}$ , where $0\leq r\leq 1$ , if the following statements hold.

•

For any linear region $U\in L(\bm{F})$ , there exist $s\in\mathbb{N}\ (s\geq 1)$ and some linear regions $V_{i}\in L(\bm{G})\ (i=1,2,\cdots,s)$ such that

$\displaystyle U=\bigcup^{s}_{i=1}V_{i}.$ (6)
•

For any linear region $U\in L(\bm{F})$ and $V\in\{V^{\prime}\in L(\bm{G})\ |\ V^{\prime}\subset U\}$ ,

$\displaystyle\int_{\bm{x}\in V}d\bm{x}\leq r\int_{\bm{x}\in U}d\bm{x}.$ (7)

Example 3.

For example, let $F,G:[0,1]\to\mathbb{R}$ be the functions in Fig. 2. We can say that $G\preceq_{\frac{5}{9}}F$ because $r$ is computed such as

\displaystyle r=\frac{\frac{1}{4}-\frac{1}{9}}{\frac{1}{4}-0}=\frac{5}{9}.

Equation (6) claims that the linear regions of $\bm{G}$ are finer than those of $\bm{F}$ and (7) claims that the linear regions of $\bm{G}$ are fine $r$ times or more those of $\bm{F}$ . In fact, we can say the following.

Lemma 2.

Let $\mathcal{D}$ be a bounded subset of $\mathbb{R}^{n}$ . Let $\bm{F},\bm{G}:\mathcal{D}\to\mathbb{R}^{m}$ be two piecewise linear functions. We can say that

\displaystyle\bm{G}\preceq_{r}\bm{F}\ \Rightarrow\ I(\bm{G})\leq rI(\bm{F}),

where $0\leq r\leq 1$ .

Proof of Lemma 2.

	$\displaystyle I(\bm{G})$	$\displaystyle=\max_{V\in L(\bm{G})}\frac{\int_{\bm{x}\in V}d\bm{x}}{\int_{\bm{x}\in\mathcal{D}}d\bm{x}}\ \ \ \ (\because(\ref{eqif}))$
		$\displaystyle=\max_{U\in L(\bm{F})}\max_{V\in\{V^{\prime}\in L(\bm{G})\ \|\ V^{\prime}\subset U\}}\frac{\int_{\bm{x}\in V}d\bm{x}}{\int_{\bm{x}\in\mathcal{D}}d\bm{x}}\ \ \ \ (\because(\ref{equ}))$
		$\displaystyle\leq\max_{U\in L(\bm{F})}r\frac{\int_{\bm{x}\in U}d\bm{x}}{\int_{\bm{x}\in\mathcal{D}}d\bm{x}}\ \ \ \ (\because(\ref{eqr}))$
		$\displaystyle=rI(\bm{F})\ \ \ \ (\because(\ref{eqif})).$

∎

Lemma 2 will be used for the proof of the theorem shown in Section II-C5.

II-C4 Linear Regions and Identification of Input Subsets

Montufar et al. [11] define the identification of input subsets in their paper. It is defined as follows.

Definition 5 ([11] ).

We say that a map $\bm{g}:\mathbb{R}^{n}\to\mathbb{R}^{m}$ identifies $K(\in\mathbb{N})$ subsets $U_{1},\cdots,U_{K}(\subset\mathbb{R}^{n})$ onto $V(\subset\mathbb{R}^{m})$ if it maps them to a commom image $V=\bm{g}(U_{1})=\cdots=\bm{g}(U_{K})$ in $\mathbb{R}^{m}$ . We also say that $U_{1},\cdots,U_{K}$ are identified onto $V$ by $\bm{g}$ .

Example 4.

For example, the four quadrants of two dimensional euclidean space are identified onto $[0,\infty)^{2}$ by the function $\bm{g}:\mathbb{R}^{2}\to[0,\infty)^{2}$ such as $\bm{g}(x_{1},x_{2})=(|x_{1}|,|x_{2}|)^{T}$ .

If $U_{1},\cdots,U_{K}(\subset\mathbb{R}^{n})$ are all of the subsets identified onto $V(\subset\mathbb{R}^{m})$ by $\bm{g}:\mathbb{R}^{n}\to\mathbb{R}^{m}$ , we can say that

\displaystyle\bm{g}^{-1}(V)=\bigcup^{K}_{k=1}U_{k}.

(8)

Montufar et al. [11] interpret (8) that the common subset $V$ is replicated to $U_{1},\cdots,U_{K}$ . We can prove the following lemma related to Definition 5.

Lemma 3.

Let $\mathcal{D}_{1}$ be a bounded subset of $\mathbb{R}^{n}$ . Let $\mathcal{D}_{2}$ be a bounded subset of $\mathbb{R}^{n^{\prime}}$ . Let $\bm{g}:\mathcal{D}_{1}\to\mathcal{D}_{2}$ be a piecewise linear function identifying all its linear regions onto $\mathcal{D}_{2}$ . Let $\bm{f}:\mathcal{D}_{2}\to\mathbb{R}^{m}$ be a piecewise linear function. The following holds.

\displaystyle\bm{f}\circ\bm{g}\preceq_{I(\bm{f})}\bm{g}.

Proof of Lemma 3.

For any linear region $U\in L(\bm{g})$ and $V\in L(\bm{f})$ , $\bm{g}|^{-1}_{U}(V)$ is a linear region of $\bm{f}\circ\bm{g}$ and

	$\displaystyle U$	$\displaystyle=\bm{g}\|_{U}^{-1}(\mathcal{D}_{2})\ \ \ \ (\because\bm{g}(U)=\mathcal{D}_{2})$
		$\displaystyle=\bm{g}\|_{U}^{-1}\left(\bigcup_{V\in L(\bm{f})}V\right)\ \ \ \ (\because(\ref{eqpiece}))$
		$\displaystyle=\bigcup_{V\in L(\bm{f})}\bm{g}\|_{U}^{-1}(V).$

Since $\bm{g}|_{U}^{-1}$ is linear, the jacobian of $\bm{g}|_{U}^{-1}$ satisfies $\left|J_{\bm{g}|_{U}^{-1}}\right|\neq 0$ . Therefore, from the property of integration by substitution with $\bm{x}=\bm{g}|_{U}^{-1}(\bm{y})$ ,

\displaystyle\int_{\bm{x}\in U}d\bm{x}

\displaystyle=\int_{\bm{x}\in\bm{g}|_{U}^{-1}(\mathcal{D}_{2})}d\bm{x}=\left|J_{\bm{g}|_{U}^{-1}}\right|\int_{\bm{y}\in\mathcal{D}_{2}}d\bm{y}.

(9)

From (9),

\displaystyle\left|J_{\bm{g}|_{U}^{-1}}\right|=\frac{\int_{\bm{x}\in U}d\bm{x}}{\int_{\bm{y}\in\mathcal{D}_{2}}d\bm{y}}.

(10)

Moreover,

	$\displaystyle\int_{\bm{x}\in\bm{g}\|_{U}^{-1}(V)}d\bm{x}$	$\displaystyle=\left\|J_{\bm{g}\|_{U}^{-1}}\right\|\int_{\bm{y}\in V}d\bm{y}$
		$\displaystyle=\frac{\int_{\bm{x}\in U}d\bm{x}}{\int_{\bm{y}\in\mathcal{D}_{2}}d\bm{y}}\int_{\bm{y}\in V}d\bm{y}\ \ \ \ (\because\mbox{(\ref{eqj})})$
		$\displaystyle\leq I(\bm{f})\int_{\bm{x}\in U}d\bm{x}\ \ \ \ (\because(\ref{eqif})).$

∎

II-C5 The Relation Between the Depth of Neural Networks and the Fineness of Linear Regions

We can show that to increase layers is more effective than to increase units at each layer on improving the expressivity of neural networks, by proving the following theorem.

Theorem 1.

Let $\tilde{\bm{F}}_{\bm{\theta}}:[0,1]^{n_{0}}\to\mathbb{R}^{n_{L}}$ be a neural network with any piecewise linear activation function, having $L$ layers and $n_{l}$ units at the $l$ -th layer, where $n_{l}\geq 2n_{0}$ for any $l\in\{1,2,\cdots,L-1\}$ . We can say that

{}^{\exists}\bm{\theta}\in\bm{\Theta}\ \ \mathrm{s.t.}\ \ I(\tilde{\bm{F}}_{\bm{\theta}})\leq\prod^{L-1}_{l=1}\left\lfloor\frac{n_{l}}{2n_{0}}\right\rfloor^{-n_{0}}.

(11)

If $n_{l}=n(\geq 2n_{0})$ for any $l\in\{1,2,\cdots,L-1\}$ , we can say that

{}^{\exists}\bm{\theta}\in\bm{\Theta}\ \ \mathrm{s.t.}\ \ I(\tilde{\bm{F}}_{\bm{\theta}})\leq O\left(\left(\frac{n}{2n_{0}}\right)^{-n_{0}(L-1)}\right)

from Theorem 1. In other words, $\tilde{\bm{F}}_{\bm{\theta}}$ can express functions which have the fineness of linear regions $O((\frac{n}{2n_{0}})^{-n_{0}(L-1)})$ if $\bm{\theta}$ has appropriate values. Therefore, we consider that the flexibility of the set of deep neural networks $\{\tilde{\bm{F}}_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\}$ grows exponentially in $L$ and polynominally in $n$ . We use the following lemma in order to prove Theorem 1.

Lemma 4.

Let $\bm{F}_{\bm{\theta}}:[0,1]^{n_{0}}\to\mathbb{R}^{n_{L}}$ be a ReLU neural network with $L$ layers and $n_{l}$ units at the $l$ -th layer, where $n_{l}\geq n_{0}$ for any $l\in\{1,2,\cdots,L-1\}$ . We can say that

{}^{\exists}{\bm{\theta}}\in\bm{\Theta}\ \ \mathrm{s.t.}\ \ I(\bm{F}_{\bm{\theta}})\leq\prod^{L-1}_{l=1}\left\lfloor\frac{n_{l}}{n_{0}}\right\rfloor^{-n_{0}}.

Proof of Theorem 1.

Let $\bm{F}_{\bm{\theta}}:[0,1]^{n_{0}}\to\mathbb{R}^{n_{L}}$ be a ReLU neural network with $L$ layers and $\frac{n_{l}}{2}(n_{l}\geq 2n_{0})$ units at the $l$ -th layer for any $l\in\{1,2,\cdots,L-1\}$ . From Lemma 4, there exists $\hat{\bm{\theta}}\in\bm{\Theta}$ such that

\displaystyle I(\bm{F}_{\hat{\bm{\theta}}})\leq\prod^{L-1}_{l=1}\left\lfloor\frac{n_{l}}{2n_{0}}\right\rfloor^{-n_{0}}.

Moreover, from Lemma 1, there exists a neural network $\tilde{\bm{F}}_{\bm{\theta}}:[0,1]^{n_{0}}\to\mathbb{R}^{n_{L}}$ with any piecewise linear activation function, having $L$ layers and $n_{l}$ units at the $l$ -th layer, and that computes the same function as $\bm{F}_{\hat{\bm{\theta}}}$ . It satisfies

\displaystyle I(\tilde{\bm{F}}_{\bm{\theta}})\leq\prod^{L-1}_{l=1}\left\lfloor\frac{n_{l}}{2n_{0}}\right\rfloor^{-n_{0}}.

∎

Now, we prove Lemma 4.

Proof of Lemma 4.

We prove Lemma 4 by two steps as follows.

Step 1.: Construct a specific ReLU neural network $\bm{F}_{\hat{\bm{\theta}}}$ which recursively caluculates the functions with common parameters.
Step 2.: Show that the fineness of linear regions of the network $\bm{F}_{\hat{\bm{\theta}}}$ satisfies $I(\bm{F}_{\hat{\bm{\theta}}})\leq\prod^{L-1}_{l=1}\lfloor\frac{n_{l}}{n_{0}}\rfloor^{-n_{0}}$ with applying Lemma 2 and Lemma 3 inductively.

Step 1. Let an integer $n\in\mathbb{N}$ satisfy $n\geq n_{0}$ and $p:=\lfloor\frac{n}{n_{0}}\rfloor$ . We define a $pn_{0}$ -dimensional function $\tilde{\bm{f}}:[0,1]^{n_{0}}\to\mathbb{R}^{pn_{0}}$ as

	$\displaystyle\tilde{f}_{(i-1)n_{0}+j}(\bm{x}):=\left\{\begin{array}[]{ll}px_{j}&(i=1)\\ 2px_{j}-2(i-1)&(i=2,\cdots,p)\end{array}\right.$		(14)
	$\displaystyle\hskip 142.26378pt(j=1,2,\cdots,n_{0}),$

where $\tilde{f}_{(i-1)n_{0}+j}$ is the $((i-1)n_{0}+j)$ -th coordinate of $\tilde{\bm{f}}$ and $x_{j}$ is the $j$ -th coordinate of $\bm{x}$ . We define a $n_{0}$ -dimensional function $\tilde{\tilde{\bm{f}}}:\mathbb{R}^{pn_{0}}\to[0,1]^{n_{0}}$ as

\displaystyle\tilde{\tilde{f}}_{j}(\bm{x}):=\sum^{p}_{i=1}(-1)^{i-1}x_{(i-1)n_{0}+j}\ (j=1,2,\cdots,n_{0}),

where $\tilde{\tilde{f}}_{j}$ is the $j$ -th coordinate of $\tilde{\tilde{\bm{f}}}$ . Furthermore, we define a $n_{L}$ -dimensional function $\tilde{\tilde{\tilde{\bm{f}}}}:[0,1]^{n_{0}}\to\mathbb{R}^{n_{L}}$ as

\displaystyle\tilde{\tilde{\tilde{f}}}_{j}(\bm{x}):=\sum^{n_{0}}_{i=1}w_{ji}x_{i}+b_{j}\ (w_{ji}>0;\ j=1,2,\cdots,n_{L}),

where $\tilde{\tilde{\tilde{f}}}_{j}$ is the $j$ -th coordinate of $\tilde{\tilde{\tilde{\bm{f}}}}$ . We construct the neural network using these functions as follows.

At first, let the number of units at the $l$ -th layer be equal to $n(\geq n_{0})$ for any $l\in\{1,2,\cdots,L-1\}$ . We separate the set of $n$ units at the $l$ -th layer, $(l=1,2,\cdots,L-1)$ , into $p(=\lfloor\frac{n}{n_{0}}\rfloor)$ subsets and remainder units. Each of the $p$ subsets contains $n_{0}$ units. Let the values of the parameters of the remainder $n-pn_{0}$ units be $0$ . For any $l\in\{1,2,\cdots,L-1\}$ , let the non-linear activation function $\bm{g}^{(l)}:\mathbb{R}^{pn_{0}}\to\mathbb{R}^{pn_{0}}$ be

\displaystyle g^{(l)}_{j}(\bm{x})=\max\{0,x_{j}\}\ (j=1,2,\cdots,pn_{0}),

where $g^{(l)}_{j}$ is the $j$ -th coordinate of $\bm{g}^{(l)}$ .

Next, we consider the linear functions $\bm{f}^{(l)}(l=1,2,\cdots,L)$ . When $l=1$ , for the $j$ -th unit of the $i$ -th subset, ( $j=1,2,\dots,n_{0}$ , $i=1,2,\dots,p$ ), let the linear function $f^{(1)}_{(i-1)n_{0}+j}:[0,1]^{n_{0}}\to\mathbb{R}$ be

\displaystyle f^{(1)}_{(i-1)n_{0}+j}=\tilde{f}_{(i-1)n_{0}+j}.

When $l=2,3,\dots,L-1$ , for the $j$ -th unit of the $i$ -th subset, ( $j=1,2,\dots,n_{0}$ , $i=1,2,\dots,p$ ), let the linear function $f^{(l)}_{(i-1)n_{0}+j}:\mathbb{R}^{pn_{0}}\to\mathbb{R}$ be

\displaystyle f_{(i-1)n_{0}+j}^{(l)}=\tilde{f}_{(i-1)n_{0}+j}\circ\tilde{\tilde{\bm{f}}}.

When $l=L$ , let the linear function $\bm{f}^{(L)}:\mathbb{R}^{pn_{0}}\to\mathbb{R}^{n_{L}}$ be

\displaystyle\bm{f}^{(L)}=\tilde{\tilde{\tilde{\bm{f}}}}\circ\tilde{\tilde{\bm{f}}}.

Then the constructed network is represented as follows. Since $\bm{g}^{(1)},\bm{g}^{(2)},\cdots,\bm{g}^{(L-1)}$ are common, we express them as $\bm{g}$ .

	$\displaystyle\bm{F}_{\hat{\bm{\theta}}}$	$\displaystyle=\bm{f}^{(L)}\circ\bm{g}^{(L-1)}\circ\bm{f}^{(L-1)}\circ\cdots\circ\bm{g}^{(1)}\circ\bm{f}^{(1)}$
		$\displaystyle=\tilde{\tilde{\tilde{\bm{f}}}}\circ\tilde{\tilde{\bm{f}}}\circ\bm{g}\circ\tilde{\bm{f}}\circ\cdots\circ\tilde{\tilde{\bm{f}}}\circ\bm{g}\circ\tilde{\bm{f}}\circ\tilde{\tilde{\bm{f}}}\circ\bm{g}\circ\tilde{\bm{f}}.$

Step 2. Let $\bm{h}:=\tilde{\tilde{\bm{f}}}\circ\bm{g}\circ\tilde{\bm{f}}$ . We can re-represent our network as

\displaystyle\bm{F}_{\hat{\bm{\theta}}}=\tilde{\tilde{\tilde{\bm{f}}}}\circ\underbrace{\bm{h}\circ\bm{h}\circ\cdots\circ\bm{h}}_{L-1}.

We consider the fineness of linear regions $I(\tilde{\tilde{\tilde{\bm{f}}}})$ and $I(\bm{h})$ as well as the identified regions of $\bm{h}$ in order to apply Lemma 2 and Lemma 3. Since $\tilde{\tilde{\tilde{\bm{f}}}}$ is linear on $[0,1]^{n_{0}}$ ,

\displaystyle I(\tilde{\tilde{\tilde{\bm{f}}}})=1.

On the other hand, the $j$ -th coordinate of $\bm{h}:[0,1]^{n_{0}}\to[0,1]^{n_{0}}$ is described such as

	$\displaystyle h_{j}(\bm{x})$	$\displaystyle=\max\left\{0,\ px_{j}\right\}-\max\left\{0,\ 2px_{j}-2\right\}+\cdots$
		$\displaystyle\hskip 28.45274pt\cdots+(-1)^{p-1}\max\left\{0,\ 2px_{j}-2(p-1)\right\}$
		$\displaystyle\hskip 113.81102pt(j=1,2,\cdots,n_{0}).$

Figure 4 shows the graph of $h_{j}$ . From Fig. 4, we can say that $h_{j}$ identifies its linear regions

\displaystyle\left[\frac{t}{p},\frac{t+1}{p}\right]\ (t=0,1,\cdots,p-1)

onto $[0,1]$ . In other words, $\bm{h}$ identifies these linear regions on the each coordinate $h_{1},h_{2},\cdots,h_{n_{0}}$ . Therefore, it identifies its linear regions

\displaystyle\prod^{n_{0}}_{j=1}\left[\frac{t_{j}}{p},\frac{t_{j}+1}{p}\right]\ (t_{j}=0,1,\cdots,p-1)

onto $[0,1]^{n_{0}}$ . Since these linear regions are $n_{0}$ -dimensional hypercubes with the side length of $p^{-1}$ for any $t_{j}\in\{0,1,\cdots,p-1\}$ , the fineness of linear regions $I(\bm{h})$ satisfies

\displaystyle I(\bm{h})=p^{-n_{0}}.

Then we can say that

\displaystyle\tilde{\tilde{\tilde{\bm{f}}}}\circ\bm{h}\preceq_{I(\tilde{\tilde{\tilde{\bm{f}}}})}\bm{h}

from Lemma 3. Therefore, from Lemma 2,

\displaystyle I(\tilde{\tilde{\tilde{\bm{f}}}}\circ\bm{h})\leq I(\tilde{\tilde{\tilde{\bm{f}}}})I(\bm{h})=p^{-n_{0}}.

(15)

In a similar manner, from Lemma 3,

\displaystyle(\tilde{\tilde{\tilde{\bm{f}}}}\circ\bm{h})\circ\bm{h}\preceq_{I(\tilde{\tilde{\tilde{\bm{f}}}}\circ\bm{h})}\bm{h}.

Therefore, from Lemma 2 and (15),

\displaystyle I(\tilde{\tilde{\tilde{\bm{f}}}}\circ\bm{h}\circ\bm{h})\leq I(\tilde{\tilde{\tilde{\bm{f}}}}\circ\bm{h})I(\bm{h})\leq I(\tilde{\tilde{\tilde{\bm{f}}}})I(\bm{h})I(\bm{h})=p^{-2n_{0}}.

If we inductively repeat these operations, we get the inequation as follows.

	$\displaystyle I(\bm{F}_{\hat{\bm{\theta}}})$	$\displaystyle=I(\tilde{\tilde{\tilde{\bm{f}}}}\circ\underbrace{\bm{h}\circ\bm{h}\circ\cdots\circ\bm{h}}_{L-1})$
		$\displaystyle\leq I(\tilde{\tilde{\tilde{\bm{f}}}})\underbrace{I(\bm{h})I(\bm{h})\cdots I(\bm{h})}_{L-1}=p^{-n_{0}(L-1)}.$

If the number of units at each layer is different from each other and satisfies $n_{l}\geq n_{0}(l=1,2,\cdots,L-1)$ , we may change $p=\lfloor\frac{n}{n_{0}}\rfloor$ to

\displaystyle p_{l}:=\left\lfloor\frac{n_{l}}{n_{0}}\right\rfloor\ (l=1,2,\cdots,L-1).

In this case, the fineness of linear regions $I(\bm{F}_{\hat{\bm{\theta}}})$ satisfies

\displaystyle I(\bm{F}_{\hat{\bm{\theta}}})\leq\prod^{L-1}_{l=1}p_{l}^{-n_{0}}.

∎

III Experiments

TABLE I: The neural networks compared in the computer simulations.

$l$	Network 1 ( $L=6$ )	Network 2 ( $L=2$ )
0	$n_{0}=1$	$n_{0}=1$
1	$n_{1}=4$	$n_{1}=20$
2	$n_{2}=4$	$n_{2}=1$
3	$n_{3}=4$
4	$n_{4}=4$
5	$n_{5}=4$
6	$n_{6}=1$

We compared two types of neural networks shown in Table I in terms of the expressivity. The activation functions in those neural networks are the ReLU function. Although the depth of them is different, the number of units is equal to 22. We approximately calculated the ratio of the desired parameters and the fineness of linear regions by computer simulations in order to evaluate the expressivity of the two neural networks.

III-A The Calculation of the Ratio of the Desired Parameters

We calculated the ratio of the desired parameters by computer simulations. We executed the calculation with restricting the input and output dimension to one for simplicity. It is difficult to calculate (3) by computer simulations. Therefore, instead of it, we calculated such as

\displaystyle\hat{d}(F_{\bm{\theta}},F^{*}):=\frac{1}{N}\sum^{N}_{n=1}(F_{\bm{\theta}}(x_{n})-F^{*}(x_{n}))^{2},

where $x_{n}\in\mathbb{R}(n=1,2,\cdots,N)$ are finite samples. For the target function $F^{*}$ , we adopted the two kinds of functions. One was the sin function $F^{*}(x)=\sin(4\pi x)$ and the other was the Weierstrass function such as Fig. 5. The samples $x_{n}(n=1,2,\cdots,N)$ were set as $x_{n}=(n-1)\times 10^{-4}$ and $N=10^{4}$ .

Moreover, it is also difficult to calculate (2) by computer simulations. Therefore, we approximately calculated it with random sampling of the values of $\bm{\theta}$ . We generated the values of $\bm{\theta}$ with random sampling of standard normal distribution and calculated $\hat{d}(F_{\bm{\theta}},F^{*})$ with the generated values of $\bm{\theta}$ , $2\times 10^{4}$ times. Then we confirmed whether the each $\hat{d}(F_{\bm{\theta}},F^{*})$ was less than $\epsilon_{k}(k=1,2,\cdots,10^{4})$ or not. When the target function was $\sin(4\pi x)$ , $\epsilon_{k}=0.4+4(k-1)\times 10^{-5}$ . When the target function was the Weierstrass function, $\epsilon_{k}=0.6+2(k-1)\times 10^{-5}$ . We counted up when $\hat{d}(F_{\bm{\theta}},F^{*})\leq\epsilon_{k}$ and calculated the ratio of $\bm{\theta}$ which satisfied the inequation by dividing the last count by $2\times 10^{4}$ . The following summarizes the algorithm which approximately calculates $R_{\epsilon}(\mathcal{F},F^{*})$ .

\epsilon_{k}\ \leftarrow\ 0.4+4(k-1)\times 10^{-5}

0.6+2(k-1)\times 10^{-5}\ (k=1,2,\cdots,10^{4})

t_{k}\ \leftarrow\ 0\ (k=1,2,\cdots,10^{4})

x_{n}\ \leftarrow\ (n-1)\times 10^{-4}\ (n=1,2,\cdots,10^{4})

F^{*}(x_{n})\ \leftarrow\ \sin(4\pi x_{n})

or the Weierstrass function

(n=1,2,\cdots,10^{4})

for

j=1

2\times 10^{4}

Generate

\bm{\theta}

with random sampling of standard normal distribution

Compute a neural network

F_{\bm{\theta}}(x_{n})\ (n=1,2,\cdots,10^{4})

\hat{\mu}\ \leftarrow\ \frac{1}{10^{4}}\sum^{10^{4}}_{n=1}F_{\bm{\theta}}(x_{n})

\hat{\sigma}\ \leftarrow\ \sqrt{\frac{1}{10^{4}}\sum^{10^{4}}_{n=1}\left(F_{\bm{\theta}}(x_{n})-\hat{\mu}\right)^{2}}

F_{\bm{\theta}}(x_{n})\ \leftarrow\ \frac{F_{\bm{\theta}}(x_{n})-\hat{\mu}}{\hat{\sigma}}\ (n=1,2,\cdots,10^{4})

\hat{d}(F_{\bm{\theta}},F^{*})\ \leftarrow\ \frac{1}{10^{4}}\sum^{10^{4}}_{n=1}(F_{\bm{\theta}}(x_{n})-F^{*}(x_{n}))^{2}

for

k=1

10^{4}

\hat{d}(F_{\bm{\theta}},F^{*})\leq\epsilon_{k}

then

t_{k}\ \leftarrow\ t_{k}+1

end if

end for

\hat{R}_{\epsilon_{k}}(\mathcal{F},F^{*})\ \leftarrow\ \frac{t_{k}}{2\times 10^{4}}\ (k=1,2,\cdots,10^{4})

return

\hat{R}_{\epsilon_{k}}(\mathcal{F},F^{*})\ (k=1,2,\cdots,10^{4})

We ran the above algorithm using the neural networks in Table I as $F_{\bm{\theta}}$ . Figure 6 and Fig. 7 show the results of the calculation. From these results, we can that Network 1 can express more functions close to $\sin(4\pi x)$ and the Weierstrass function than Network 2. It suggests that to increase layers is more effective than to increase units at each layer on improving the expressivity of neural networks.

III-B The Calculation of the Fineness of Linear Regions

In order to confirm Theorem 1, we approximately calculated the minimum value of the fineness of linear regions $\min_{\bm{\theta}\in\bm{\Theta}}I(\bm{F}_{\bm{\theta}})$ by computer simulations. We must detect linear regions from the input domain of deep neural networks in order to know the fineness of linear regions $I(\bm{F}_{\bm{\theta}})$ . Therefore, we did it by calculating the inflection of the derivative of functions computed by deep neural networks.

We executed the calculation with restricting the input and output dimension to one for simplicity. We generated an arithmetic sequence $x_{n}=(n-1)\times 10^{-5}(n=1,2,\cdots,10^{5})$ and calculated the gradient of the line connecting $(x_{n},F_{\bm{\theta}}(x_{n}))$ and $(x_{n+1},F_{\bm{\theta}}(x_{n+1}))$ such as

\displaystyle g_{n}:=\frac{F_{\bm{\theta}}(x_{n+1})-F_{\bm{\theta}}(x_{n})}{x_{n+1}-x_{n}}.

Then we calculated the inflection of $g_{n}$ such as

\displaystyle d_{n+1}:=|g_{n+1}-g_{n}|.

We regarded that $x_{n+1}$ was the boundary of linear regions if $d_{n+1}$ was larger than a threshold value and calculated the size of the maximal interval between the two boundaries as the fineness of linear regions $I(F_{\bm{\theta}})$ . As we have mentioned in Section I, we must calculate it independently from the efficiency of learning. Therefore, we searched the minimum value of $I(F_{\bm{\theta}})$ by generating the value of $\bm{\theta}$ randomly and repeatedly, not by learning. The following summarizes the algorithm to approximately calculate $\min_{\bm{\theta}\in\bm{\Theta}}I(F_{\bm{\theta}})$ .

x_{n}\ \leftarrow\ (n-1)\times 10^{-5}\ (n=1,2,\cdots,10^{5})

fineness\_list\ \leftarrow\ [\ ]

for

i=1

10^{3}

Generate the value of

\bm{\theta}

by randam sampling of uniform distribution

U(0,1)

count\ \leftarrow\ 0

count\_list\ \leftarrow\ [\ ]

for

n=1

10^{5}-2

for

k=0,1,2

Compute the output of a neural network

F_{\bm{\theta}}(x_{n+k})

end for

g_{n}\ \leftarrow\ \frac{F_{\bm{\theta}}(x_{n+1})-F_{\bm{\theta}}(x_{n})}{x_{n+1}-x_{n}}

g_{n+1}\ \leftarrow\ \frac{F_{\bm{\theta}}(x_{n+2})-F_{\bm{\theta}}(x_{n+1})}{x_{n+2}-x_{n+1}}

d_{n+1}\ \leftarrow\ |g_{n+1}-g_{n}|

d_{n+1}<0.5

then

count\ \leftarrow\ count+1

else

Append

count

count\_list

count\ \leftarrow\ 0

end if

end for

Append

count

count\_list

count\_max\ \leftarrow\

the maximum value in

count\_list

fineness\ \leftarrow\ \frac{count\_max}{10^{5}}

Append

fineness

fineness\_list

end for

fineness\_min\ \leftarrow\

the minimum value in

fineness\_list

return

fineness\_min

In the above algorithm, the threshold value of the inflection $d_{n+1}$ is set as 0.5. If $d_{n+1}\geq 0.5$ , $x_{n+1}$ is regarded as the boundary of linear regions. The variable $count$ increases if a deep neural network is linear. Then it returns to 0 at the boundary of linear regions. The list $count\_list$ stocks the size of linear regions and the maximum value in $count\_list$ shows the fineness of linear regions. However, the maximum value in $count\_list$ must be divided by $10^{5}$ because the difference between $x_{n}$ and $x_{n+1}$ is $10^{-5}$ . On the other hand, we also calculated the value of $\prod^{L-1}_{l=1}\lfloor\frac{n_{l}}{2n_{0}}\rfloor^{-n_{0}}$ , the right side of (11).

We ran the above algorithm using the neural networks in Table I as $F_{\bm{\theta}}$ . The results of the simulations follow Theorem 1. The minimum value of the fineness of linear regions of Network 1 is 0.00004 and that of Network 2 is 0.06345, where the values of $\bm{\theta}$ are restricted to the finite sampling. On the other hand, the value of $\prod^{L-1}_{l=1}\lfloor\frac{n_{l}}{2n_{0}}\rfloor^{-n_{0}}$ of Network 1 is 0.03125 and that of Network 2 is 0.1. From these results, we can confirm Theorem 1 and say that to increase layers is more effective than to increase units at each layer when we want neural networks to be more flexible.

IV Conclusion

In this paper, we proposed two new criteria to evaluate the expressivity of functions computable by deep neural networks independently from the efficiency of learning. The first criterion, the ratio of the desired parameters shows the approximation accuracy of deep neural networks to the target function. The second criterion, the fineness of linear regions shows the property of linear regions of functions computable by deep neural networks. Furthermore, by the two criteria, we showed that to increase layers is more effective than to increase units at each layer on improving the expressivity of deep neural networks. It is hoped that our studies will contribute to a better understanding of the advantage of deepening neural networks.

Acknowledgment

This research was supported by JSPS Grants-in-Aid for Scientific Research JP17K00316, JP17K06446, JP18K11585, JP19K04914.

References

[1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv: 1409.1556, 2014.
[2] A. Veit, M. Wiber, and S. Belongie, “Residual networks behave like ensembles of relatively shallow networks,” Advances in Neural Information Processing Systems 29, pp. 550–558, 2016.
[3] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, “Gradient flow in recurrent nets: The difficulty of learning long-term dependencies,” A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
[4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” pp. 1735–1780, 1997.
[5] D. Yarotsky, “Optimal approximation of continuous functions by very deep relu networks,” Proceedings of Machine Learning Research, vol. 75, pp. 639–649, 2018.
[6] M. Telgarsky, “Benefits of depth in neural networks,” arXiv preprint arXiv: 1602.04485, 2016.
[7] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Sygnals and Systems, vol. 2, no. 4, pp. 303–314, 1989.
[8] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, 1989.
[9] H. N. Mhaskar, “Neural networks for optimal approximation of smooth and analytic functions,” Neural Computation, vol. 8, no. 1, pp. 164–177, 1996.
[10] D. Yarotsky, “Error bounds for approximations with deep relu networks,” Neural Networks, vol. 94, pp. 103–114, 2017.
[11] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number of linear regions of deep neural networks,” Advances in Neural Information Processing Systems 27, pp. 2924–2932, 2014.
[12] R. Pascanu, G. F. Montufar, and Y. Bengio, “On the number of inference regions of deep feed forward networks with piece-wise linear avtivations,” arXiv: 1312.6098, 2013.
[13] T. Serra, C. Tjandraatmadja, and S. Ramalingam, “Bounding and counting linear regions of deep neural networks,” Proceedings of Machine Learning Research, vol. 80, pp. 4558–4566, 2018.
[14] B. Hanin and D. Rolnick, “Complexity of linear regions in deep networks,” Proceedings of Machine Learning Research, vol. 97, pp. 2596–2604, 2019.

	$\displaystyle U$	$\displaystyle=\bm{g}\|_{U}^{-1}(\mathcal{D}_{2})\ \ \ \ (\because\bm{g}(U)=\mathcal{D}_{2})$
		$\displaystyle=\bm{g}\|_{U}^{-1}\left(\bigcup_{V\in L(\bm{f})}V\right)\ \ \ \ (\because(\ref{eqpiece}))$
		$\displaystyle=\bigcup_{V\in L(\bm{f})}\bm{g}\|_{U}^{-1}(V).$

Theoretical Analysis of the Advantage of Deepening Neural Networks

Abstract

Index Terms:

I Introduction

II Definitions and Theoretical Facts

II-A Deep Neural Networks

II-B The Ratio of the Desired Parameters

Definition 1.

II-C Deep Neural Networks and Linear Regions

II-C1 ReLU Neural Networks

Lemma 1 ([10] ).

II-C2 Fineness of Linear Regions

Definition 2 ([11] ).

Example 1.

Definition 3.

Example 2.

Corollary 1.

Proof of Corollary 1.

II-C3 The Partial Order of Piecewise Linear Functions

Definition 4.

Example 3.

Lemma 2.

Proof of Lemma 2.

II-C4 Linear Regions and Identification of Input Subsets

Definition 5 ([11] ).

Example 4.

Lemma 3.

Proof of Lemma 3.

II-C5 The Relation Between the Depth of Neural Networks and the Fineness of Linear Regions

Theorem 1.

Lemma 4.

Proof of Theorem 1.

Proof of Lemma 4.

III Experiments

III-A The Calculation of the Ratio of the Desired Parameters

III-B The Calculation of the Fineness of Linear Regions

IV Conclusion

Acknowledgment

References

Theoretical Analysis of the Advantage of
Deepening Neural Networks