This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Theoretical Analysis of the Advantage of
Deepening Neural Networks

Yasushi Esaki Faculty of Science and Engineering
Waseda University
Tokyo, Japan
[email protected]
   Yuta Nakahara Center for Data Science
Waseda University
Tokyo, Japan
[email protected]
   Toshiyasu Matsushima Faculty of Science and Engineering
Waseda University
Tokyo, Japan
[email protected]
Abstract

We propose two new criteria to understand the advantage of deepening neural networks. It is important to know the expressivity of functions computable by deep neural networks in order to understand the advantage of deepening neural networks. Unless deep neural networks have enough expressivity, they cannot have good performance even though learning is successful. In this situation, the proposed criteria contribute to understanding the advantage of deepening neural networks since they can evaluate the expressivity independently from the efficiency of learning. The first criterion shows the approximation accuracy of deep neural networks to the target function. This criterion has the background that the goal of deep learning is approximating the target function by deep neural networks. The second criterion shows the property of linear regions of functions computable by deep neural networks. This criterion has the background that deep neural networks whose activation functions are piecewise linear are also piecewise linear. Furthermore, by the two criteria, we show that to increase layers is more effective than to increase units at each layer on improving the expressivity of deep neural networks.

Index Terms:
deep learning theory, expressivity, approximation accuracy, linear region

I Introduction

In recent researches on machine learning, deep neural networks show very good performance on prediction. For example, some papers consider that to increase layers in neural networks contributes to the accuracy of image recognition e.g., [1, 2]. There are also papers which consider that to increase layers contributes to the accuracy of speech recognition e.g., [3, 4]. Neural networks are getting deeper and deeper with each passing year.

In this situation, there are many researchers that are interested in the relation between the depth of neural networks and good performance. In order to show the advantage of deepening neural networks, the followings should be distinguished.

  1. 1.

    the expressivity of deep neural networks as functions which express the relation between explanatory variables and objective variables.

  2. 2.

    the efficiency of learning by the gradient method.

The prediction accuracy after learning is dependent on both the expressivity and the efficiency of learning. Unless deep neural networks have enough expressivity, they cannot have good performance even though learning is successful.

Therefore, there have been some researches that evaluate deep neural networks with distinguishing between the expressivity and the efficiency of learning e.g., [5, 6]. Some of the researches assume that data is scattered around the target function and investigate the approximation accuracy when approximating the target function by deep neural networks. They discuss the distance between deep neural networks and the target function. Then they regard that the shorter that distance is, the more expressivity the deep neural networks have. For example, some studies consider how many layers and units are necessary for the distance between deep neural networks and the target function to be less than ϵ\epsilon, for any ϵ>0\epsilon>0 and any target function e.g., [7, 8]. Besides, other studies consider how the lower limit of the distance between deep neural networks and the target function is described for the depth and the number of units e.g., [9, 10].

On the other hand, there are other kinds of researches that evaluate deep neural networks with distinguishing between the expressivity and the efficiency of learning. Those researches investigate the property of linear regions of functions by computable by deep neural networks. Since deep neural networks whose activation functions are piecewise linear are also piecewise linear, investigating the property of linear regions is a good way to know the shape of the lines drawn by those deep neural networks. For example, some studies count the number of linear regions of deep neural networks [11, 12, 13]. They regard that the larger the number of linear regions of deep neural networks is, the more expressivity the deep neural networks have. Besides, other studies calculate the volume of the boundary between two linear regions of deep neural networks [14]. They regard that the larger the volume of the boundary between two linear regions of deep neural networks is, the more expressivity the deep neural networks have.

In this paper, we propose two criteria to evaluate the expressivity of deep neural networks independently from the efficiency of learning, similar to the above studies. The first criterion evaluates the approximation accuracy of deep neural networks to the target function. It shows the ratio of the values of the parameters which make the distance between deep neural networks and the target function short. We regard that the larger the ratio is, the more expressivity deep neural networks have. The novelty of this criterion is that we can figure out the volume of parameter values which give functions close to the target function. The proposed criterion shows not only the existence of the appropriate parameter values but also how many the appropriate parameter values exist.

The second criterion evaluates the property of linear regions of functions computable by deep neural networks whose activation functions are piecewise linear. It shows the size of the maximal linear region of deep neural networks. We regard that the smaller the size of the maximal linear region is, the more expressivity the deep neural networks have. The novelty of this criterion is figuring out how fine linear regions the area which is the least flexible has. The proposed criterion shows whether linear regions are scattered finely and uniformly or not.

The above two criteria have a relation. The small size of the maximal linear region of deep neural networks is a requirement for the good approximation to the target function if the target function is smooth. This requirement is understood by referring to Fig. 1. Unless the size of the maximal linear region of deep neural networks is small, there is a gap between the deep neural networks and the target function no matter how close the deep neural networks and the target function.

Refer to caption


Figure 1: The relation between the size of the maximal linear regions of deep neural networks and the approximation to the target function by deep neural networks. If the size of the maximal linear region of deep neural networks is large, a gap inevitably occurs between deep neural networks and the target function, as the left figure. On the other hand, if the size of the maximal linear region of deep neural networks is small, there is a possibility that the target function is approximated by deep neural networks, as the right figure.

Moreover, by the two criteria, we show that to increase layers is more effective than to increase units at each layer on improving the expressivity of neural networks. We adopt theoretical proofs and computer simulations to show it.

In summary, our contributions are as follows.

  1. 1.

    We propose new criteria to evaluate the expressivity of deep neural networks independently from the efficiency of learning.

  2. 2.

    We show that to increase layers is more effective than to increase units at each layer on improving the expressivity of neural networks, by the new criteria.

II Definitions and Theoretical Facts

II-A Deep Neural Networks

We define a neural network as a composite function 𝑭𝜽:n0nL\bm{F}_{\bm{\theta}}:\mathbb{R}^{n_{0}}\to\mathbb{R}^{n_{L}} parametrized by 𝜽\bm{\theta}, which alternately computes linear functions 𝒇(l):nl1nl(l=1,,L)\bm{f}^{(l)}:\mathbb{R}^{n_{l-1}}\to\mathbb{R}^{n_{l}}\ (l=1,\cdots,L) and non-linear activation functions 𝒈(l):nlnl(l=1,,L1)\bm{g}^{(l)}:\mathbb{R}^{n_{l}}\to\mathbb{R}^{n_{l}}\ (l=1,\cdots,L-1) such as

𝑭𝜽=𝒇(L)𝒈(L1)𝒇(L1)𝒈(1)𝒇(1).\displaystyle\bm{F}_{\bm{\theta}}=\bm{f}^{(L)}\circ\bm{g}^{(L-1)}\circ\bm{f}^{(L-1)}\circ\cdots\circ\bm{g}^{(1)}\circ\bm{f}^{(1)}.

The linear function 𝒇(l)\bm{f}^{(l)} is given by

fj(l)(𝒙)=i=1nl1wji(l)xi+bj(l)(j=1,,nl),\displaystyle f^{(l)}_{j}(\bm{x})=\sum^{n_{l-1}}_{i=1}w^{(l)}_{ji}x_{i}+b^{(l)}_{j}\ (j=1,\cdots,n_{l}),

where xix_{i} is the ii-th coordinate of 𝒙\bm{x} and fj(l)f^{(l)}_{j} is the jj-th coordinate of 𝒇(l)\bm{f}^{(l)}. The parameter 𝜽\bm{\theta} is a vector which consists of wji(l)w^{(l)}_{ji}\in\mathbb{R} and bj(l)b^{(l)}_{j}\in\mathbb{R} for each i{1,2,,nl1},j{1,2,,nl}i\in\{1,2,\cdots,n_{l-1}\},\ j\in\{1,2,\cdots,n_{l}\} and l{1,2,,L}l\in\{1,2,\cdots,L\}. We define 𝚯M\bm{\Theta}\subset\mathbb{R}^{M} as the set of 𝜽\bm{\theta}, where MM is the number of parameters on the neural network. We call 𝒈(l)𝒇(l)\bm{g}^{(l)}\circ\bm{f}^{(l)} as the ll-th layer and call the jj-th coordinate of 𝒈(l)𝒇(l)\bm{g}^{(l)}\circ\bm{f}^{(l)} as the jj-th unit of the ll-th layer. We note that the 0-th layer, which is called as the input layer is not counted into the number of layers.

Deep neural networks are used for supervised learning. We can express the relation between explanatory variables and objective variables with deep neural networks if the values of 𝜽\bm{\theta} are set appropriately.

II-B The Ratio of the Desired Parameters

In this section, we assume a probability distribution for objective variables in supervised learning and propose a new criterion to evaluate the expressivity of functions computable by deep neural networks under that assumption. For simplicity, we restrict the dimension of objective variables to 1. Let the probability distribution of objective variables be normal distribution such as

y=F(𝒙)+ε(εN(0,σ2)),\displaystyle y=F^{*}(\bm{x})+\varepsilon\ \ (\varepsilon\sim N(0,\sigma^{2})), (1)

where F:n0F^{*}:\mathbb{R}^{n_{0}}\to\mathbb{R} is a continuous function. The goal of deep learning is expressing the relation between explanatory variables and objective variables by approximating FF^{*} with deep neural networks F𝜽F_{\bm{\theta}} in the set {F𝜽|𝜽𝚯}\{F_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\}. Therefore, we call FF^{*} as the target function in this paper.

When approximating the target function FF^{*} by deep neural networks F𝜽F_{\bm{\theta}}, it is desirable for the set {F𝜽|𝜽𝚯}\{F_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\} to have many functions close to FF^{*}. However, there have been few studies which investigate the volume of the functions close to FF^{*} in {F𝜽|𝜽𝚯}\{F_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\} although some studies prove that at least one function close to FF^{*} exist in {F𝜽|𝜽𝚯}\{F_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\} e.g., [7, 10]. Therefore, we define a new criterion to investigate it independently from the efficiency of learning as follows.

Definition 1.

Let F𝜽:n0F_{\bm{\theta}}:\mathbb{R}^{n_{0}}\to\mathbb{R} be a deep neural network. Let F:n0F^{*}:\mathbb{R}^{n_{0}}\to\mathbb{R} be the target function defined as (1). Moreover, we assume that explanatory variables 𝒙n0\bm{x}\in\mathbb{R}^{n_{0}} are randam variables with a probability density function p(𝒙)p(\bm{x}). Then we define the ratio of the desired parameters between the set :={F𝜽|𝜽𝚯}\mathcal{F}:=\{F_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\} and FF^{*} as

Rϵ(,F):=𝜽𝚯ϵ(,F)𝑑𝜽𝜽𝚯𝑑𝜽,\displaystyle R_{\epsilon}(\mathcal{F},F^{*}):=\frac{\int_{\bm{\theta}\in\bm{\Theta}_{\epsilon}(\mathcal{F},F^{*})}d\bm{\theta}}{\int_{\bm{\theta}\in\bm{\Theta}}d\bm{\theta}}, (2)

where

d(F𝜽,F)\displaystyle d(F_{\bm{\theta}},F^{*}) :=𝒙n0(F𝜽(𝒙)F(𝒙))2p(𝒙)𝑑𝒙,\displaystyle:=\int_{\bm{x}\in\mathbb{R}^{n_{0}}}(F_{\bm{\theta}}(\bm{x})-F^{*}(\bm{x}))^{2}p(\bm{x})d\bm{x}, (3)
𝚯ϵ(,F)\displaystyle\bm{\Theta}_{\epsilon}(\mathcal{F},F^{*}) :={𝜽𝚯|d(F𝜽,F)ϵ}.\displaystyle:=\{\bm{\theta}\in\bm{\Theta}\ |\ d(F_{\bm{\theta}},F^{*})\leq\epsilon\}.

We can say that d(F𝜽,F)d(F_{\bm{\theta}},F^{*}) represents the distance between the deep neural network F𝜽F_{\bm{\theta}} and the target function FF^{*}. Moreover, Rϵ(,F)R_{\epsilon}(\mathcal{F},F^{*}) calculates the ratio of the values of 𝜽\bm{\theta} which make that distance short. Therefore, we regard that the larger Rϵ(,F)R_{\epsilon}(\mathcal{F},F^{*}) is, the more functions close to the target function the set \mathcal{F} has.

The ratio of the desired parameters shows the volume of parameter values which give functions close to the target function. Therefore, we can know not only the existence of the appropriate parameter values but also how many the appropriate parameter values exist. If we remind learning with the gradient method, we can consider that the ratio of the desired parameters may show the number of local solutions of the loss function.

II-C Deep Neural Networks and Linear Regions

II-C1 ReLU Neural Networks

We can use some kinds of non-linear functions for the non-linear activation function 𝒈(l)\bm{g}^{(l)}. One example is the ReLU function defined as g(x)=max{0,x}g(x)=\max\{0,x\}. ReLU neural networks have the special property as follows.

Lemma 1 ([10] ).

Let 𝒟\mathcal{D} be a bounded subset of n0\mathbb{R}^{n_{0}}. Let 𝑭𝜽:𝒟nL\bm{F}_{\bm{\theta}}:\mathcal{D}\to\mathbb{R}^{n_{L}} be a ReLU neural network with LL layers, MM parameters and SS units in total. Then there exists a neural network 𝑭~𝜽:𝒟nL\tilde{\bm{F}}_{\bm{\theta}}:\mathcal{D}\to\mathbb{R}^{n_{L}} with any continuous piecewise linear activation function, having LL layers, 4M4M parameters and 2S2S units in total, and that computes the same function as 𝑭𝜽\bm{F}_{\bm{\theta}}.

Lemma 1 points out that deep neural networks with any piecewise linear activation function can describe any ReLU neural network. In the proof of [10], two units of 𝑭~𝜽\tilde{\bm{F}}_{\bm{\theta}} are assigned to the each unit of 𝑭𝜽\bm{F}_{\bm{\theta}} when we replace 𝑭𝜽\bm{F}_{\bm{\theta}} with 𝑭~𝜽\tilde{\bm{F}}_{\bm{\theta}}. Therefore, the number of units at the ll-th layer nln_{l} increases to 2nl2n_{l}. Lemma 1 is used for the proof of the theorem shown in Section II-C5.

Refer to caption

Figure 2: The examples of linear regions of piecewise linear functions.

II-C2 Fineness of Linear Regions

The ReLU function g(x)=max{0,x}g(x)=\max\{0,x\} is a piecewise linear function whose inclination changes at the origin. If a neural network has piecewise linear activation functions like this, 𝒇(l)(l=1,2,,L)\bm{f}^{(l)}(l=1,2,\cdots,L) are linear functions and 𝒈(l)(l=1,2,,L1)\bm{g}^{(l)}(l=1,2,\cdots,L-1) are piecewise linear functions. Therefore, a neural network whose activation functions are piecewise linear is a piecewise linear function. Based on this fact, we can introduce a linear region to consider the flexibility of deep neural networks with piecewise linear activation functions. It is defined as follows.

Definition 2 ([11] ).

Let 𝒟\mathcal{D} be a bounded subset of n\mathbb{R}^{n}. Let 𝑭:𝒟m\bm{F}:\mathcal{D}\to\mathbb{R}^{m} be a piecewise linear function. Let UU be a nn-dimensional interval in 𝒟\mathcal{D}. Let 𝑭|U:Um\bm{F}|_{U}:U\to\mathbb{R}^{m} be the function given by restricting the input domain of 𝑭\bm{F} from 𝒟\mathcal{D} to UU. We say that UU is a linear region of 𝑭\bm{F} if the followings hold.

  • 𝑭|U\bm{F}|_{U} is linear.

  • For any subset V𝒟V\subset\mathcal{D} such that VUV\supsetneq U, 𝑭|V\bm{F}|_{V} is not linear.

Example 1.

For example, let F:[0,1]F:[0,1]\to\mathbb{R} be the function in Fig. 2. We can say that [0,14][0,\frac{1}{4}], [14,23][\frac{1}{4},\frac{2}{3}] and [23,1][\frac{2}{3},1] are linear regions of FF. [13,12][\frac{1}{3},\frac{1}{2}] is not a linear region.

Refer to caption

Figure 3: The problem of evaluating the flexibility of deep neural networks by the number of linear regions. We cannot distinguish between F1F_{1} and F2F_{2} when we use the number of linear regions. Therefore, we propose a new criterion in this paper.

In this paper, the set of linear regions of 𝑭\bm{F} is represented by L(𝑭)L(\bm{F}). We note that

UL(𝑭)U=𝒟,UL(𝑭)𝒙U𝑑𝒙=𝒙𝒟𝑑𝒙\displaystyle\bigcup_{U\in L(\bm{F})}U=\mathcal{D},\ \ \ \ \ \sum_{U\in L(\bm{F})}\int_{\bm{x}\in U}d\bm{x}=\int_{\bm{x}\in\mathcal{D}}d\bm{x} (4)

when 𝑭:𝒟m\bm{F}:\mathcal{D}\to\mathbb{R}^{m} is a piecewise linear function.

There have been some studies that evaluate the flexibility of deep neural networks by the number of linear regions [12, 11, 13]. However, there is a room for improvement in the evaluation of the flexibility by the number of linear regions when we want to know whether linear regions are scattered uniformly. The number of linear regions cannot ensure that many linear regions do not gather in a certain area in the input domain. This problem appears in Fig. 3. F2F_{2} does not keep its flexibility in the left-side area. However, F1F_{1} and F2F_{2} are regarded that they have similar flexibility by the number of linear regions. There is also the same problem in the evaluation of the flexibility by the volume of the boundary between two linear regions proposed in [14]. Therefore, we propose a new criterion to evaluate the flexibility of deep neural networks. It is defined as follows.

Definition 3.

Let 𝒟\mathcal{D} be a bounded subset of n\mathbb{R}^{n}. We define the fineness of linear regions of a piecewise linear function 𝑭:𝒟m\bm{F}:\mathcal{D}\to\mathbb{R}^{m} such as

I(𝑭):=maxUL(𝑭)𝒙U𝑑𝒙𝒙𝒟𝑑𝒙.\displaystyle I(\bm{F}):=\max_{U\in L(\bm{F})}\frac{\int_{\bm{x}\in U}d\bm{x}}{\int_{\bm{x}\in\mathcal{D}}d\bm{x}}. (5)
Example 2.

We consider the example function F:[0,1]F:[0,1]\to\mathbb{R} in Fig. 2. In this example, the fineness of linear regions I(F)I(F) is computed as follows.

I(F)\displaystyle I(F) =maxU{[0,14],[14,23],[23,1]}xU𝑑xx[0,1]𝑑x=1423𝑑x=512.\displaystyle=\max_{U\in\left\{\left[0,\frac{1}{4}\right],\left[\frac{1}{4},\frac{2}{3}\right],\left[\frac{2}{3},1\right]\right\}}\frac{\int_{x\in U}dx}{\int_{x\in[0,1]}dx}=\int^{\frac{2}{3}}_{\frac{1}{4}}dx=\frac{5}{12}.

Note that 0I(𝑭)10\leq I(\bm{F})\leq 1. We regard that the smaller the value of I(𝑭)I(\bm{F}) is, the finer the linear regions of 𝑭\bm{F} are. The fineness of linear regions shows the size of the maximal linear region of piecewise linear functions and shows how fine linear regions the area which is the least flexible has. Therefore, we can know whether linear regions are scattered finely and uniformly on the input domain or not, by that criterion. It is different from the number of linear regions. In the case of neural networks 𝑭𝜽\bm{F}_{\bm{\theta}} whose activation functions are piecewise linear, the fineness of linear regions I(𝑭𝜽)I(\bm{F}_{\bm{\theta}}) changes when the values of 𝜽\bm{\theta} change. Therefore, we regard that the smaller the value of min𝜽𝚯I(𝑭𝜽)\min_{\bm{\theta}\in\bm{\Theta}}I(\bm{F}_{\bm{\theta}}) is, the more flexibility the set {𝑭𝜽|𝜽𝚯}\{\bm{F}_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\} has.

For the readers who are interested in the relation between the number of linear regions and the fineness of linear regions, we show the following fact.

Corollary 1.

Let 𝒟\mathcal{D} be a bounded subset of n\mathbb{R}^{n}. Let 𝑭:𝒟m\bm{F}:\mathcal{D}\to\mathbb{R}^{m} be a piecewise linear function. The fineness of linear regions I(𝑭)I(\bm{F}) and the number of linear regions |L(𝑭)||L(\bm{F})| satisfy

|L(𝑭)|1I(𝑭).\displaystyle|L(\bm{F})|\geq\frac{1}{I(\bm{F})}.
Proof of Corollary 1.

From (4),

𝒙𝒟𝑑𝒙\displaystyle\int_{\bm{x}\in\mathcal{D}}d\bm{x} =UL(𝑭)𝒙U𝑑𝒙\displaystyle=\sum_{U\in L(\bm{F})}\int_{\bm{x}\in U}d\bm{x}
|L(𝑭)|×maxUL(𝑭)𝒙U𝑑𝒙.\displaystyle\leq|L(\bm{F})|\times\max_{U\in L(\bm{F})}\int_{\bm{x}\in U}d\bm{x}.

Therefore,

|L(𝑭)|𝒙𝒟𝑑𝒙maxUL(𝑭)𝒙U𝑑𝒙=1I(𝑭)((5)).\displaystyle|L(\bm{F})|\geq\frac{\int_{\bm{x}\in\mathcal{D}}d\bm{x}}{\max_{U\in L(\bm{F})}\int_{\bm{x}\in U}d\bm{x}}=\frac{1}{I(\bm{F})}\ \ \ \ (\because\ (\ref{eqif})).

II-C3 The Partial Order of Piecewise Linear Functions

Next, we define a partial order between two piecewise linear functions as follows.

Definition 4.

Let 𝒟\mathcal{D} be a bounded subset of n\mathbb{R}^{n}. Let 𝑭,𝑮:𝒟m\bm{F},\bm{G}:\mathcal{D}\to\mathbb{R}^{m} be two piecewise linear functions. We say that 𝑮r𝑭\bm{G}\preceq_{r}\bm{F}, where 0r10\leq r\leq 1, if the following statements hold.

  • For any linear region UL(𝑭)U\in L(\bm{F}), there exist s(s1)s\in\mathbb{N}\ (s\geq 1) and some linear regions ViL(𝑮)(i=1,2,,s)V_{i}\in L(\bm{G})\ (i=1,2,\cdots,s) such that

    U=i=1sVi.\displaystyle U=\bigcup^{s}_{i=1}V_{i}. (6)
  • For any linear region UL(𝑭)U\in L(\bm{F}) and V{VL(𝑮)|VU}V\in\{V^{\prime}\in L(\bm{G})\ |\ V^{\prime}\subset U\},

    𝒙V𝑑𝒙r𝒙U𝑑𝒙.\displaystyle\int_{\bm{x}\in V}d\bm{x}\leq r\int_{\bm{x}\in U}d\bm{x}. (7)
Example 3.

For example, let F,G:[0,1]F,G:[0,1]\to\mathbb{R} be the functions in Fig. 2. We can say that G59FG\preceq_{\frac{5}{9}}F because rr is computed such as

r=1419140=59.\displaystyle r=\frac{\frac{1}{4}-\frac{1}{9}}{\frac{1}{4}-0}=\frac{5}{9}.

Equation (6) claims that the linear regions of 𝑮\bm{G} are finer than those of 𝑭\bm{F} and (7) claims that the linear regions of 𝑮\bm{G} are fine rr times or more those of 𝑭\bm{F}. In fact, we can say the following.

Lemma 2.

Let 𝒟\mathcal{D} be a bounded subset of n\mathbb{R}^{n}. Let 𝑭,𝑮:𝒟m\bm{F},\bm{G}:\mathcal{D}\to\mathbb{R}^{m} be two piecewise linear functions. We can say that

𝑮r𝑭I(𝑮)rI(𝑭),\displaystyle\bm{G}\preceq_{r}\bm{F}\ \Rightarrow\ I(\bm{G})\leq rI(\bm{F}),

where 0r10\leq r\leq 1.

Proof of Lemma 2.
I(𝑮)\displaystyle I(\bm{G}) =maxVL(𝑮)𝒙V𝑑𝒙𝒙𝒟𝑑𝒙((5))\displaystyle=\max_{V\in L(\bm{G})}\frac{\int_{\bm{x}\in V}d\bm{x}}{\int_{\bm{x}\in\mathcal{D}}d\bm{x}}\ \ \ \ (\because(\ref{eqif}))
=maxUL(𝑭)maxV{VL(𝑮)|VU}𝒙V𝑑𝒙𝒙𝒟𝑑𝒙((6))\displaystyle=\max_{U\in L(\bm{F})}\max_{V\in\{V^{\prime}\in L(\bm{G})\ |\ V^{\prime}\subset U\}}\frac{\int_{\bm{x}\in V}d\bm{x}}{\int_{\bm{x}\in\mathcal{D}}d\bm{x}}\ \ \ \ (\because(\ref{equ}))
maxUL(𝑭)r𝒙U𝑑𝒙𝒙𝒟𝑑𝒙((7))\displaystyle\leq\max_{U\in L(\bm{F})}r\frac{\int_{\bm{x}\in U}d\bm{x}}{\int_{\bm{x}\in\mathcal{D}}d\bm{x}}\ \ \ \ (\because(\ref{eqr}))
=rI(𝑭)((5)).\displaystyle=rI(\bm{F})\ \ \ \ (\because(\ref{eqif})).

Lemma 2 will be used for the proof of the theorem shown in Section II-C5.

II-C4 Linear Regions and Identification of Input Subsets

Montufar et al. [11] define the identification of input subsets in their paper. It is defined as follows.

Definition 5 ([11] ).

We say that a map 𝒈:nm\bm{g}:\mathbb{R}^{n}\to\mathbb{R}^{m} identifies K()K(\in\mathbb{N}) subsets U1,,UK(n)U_{1},\cdots,U_{K}(\subset\mathbb{R}^{n}) onto V(m)V(\subset\mathbb{R}^{m}) if it maps them to a commom image V=𝒈(U1)==𝒈(UK)V=\bm{g}(U_{1})=\cdots=\bm{g}(U_{K}) in m\mathbb{R}^{m}. We also say that U1,,UKU_{1},\cdots,U_{K} are identified onto VV by 𝒈\bm{g}.

Example 4.

For example, the four quadrants of two dimensional euclidean space are identified onto [0,)2[0,\infty)^{2} by the function 𝒈:2[0,)2\bm{g}:\mathbb{R}^{2}\to[0,\infty)^{2} such as 𝒈(x1,x2)=(|x1|,|x2|)T\bm{g}(x_{1},x_{2})=(|x_{1}|,|x_{2}|)^{T}.

If U1,,UK(n)U_{1},\cdots,U_{K}(\subset\mathbb{R}^{n}) are all of the subsets identified onto V(m)V(\subset\mathbb{R}^{m}) by 𝒈:nm\bm{g}:\mathbb{R}^{n}\to\mathbb{R}^{m}, we can say that

𝒈1(V)=k=1KUk.\displaystyle\bm{g}^{-1}(V)=\bigcup^{K}_{k=1}U_{k}. (8)

Montufar et al. [11] interpret (8) that the common subset VV is replicated to U1,,UKU_{1},\cdots,U_{K}. We can prove the following lemma related to Definition 5.

Lemma 3.

Let 𝒟1\mathcal{D}_{1} be a bounded subset of n\mathbb{R}^{n}. Let 𝒟2\mathcal{D}_{2} be a bounded subset of n\mathbb{R}^{n^{\prime}}. Let 𝒈:𝒟1𝒟2\bm{g}:\mathcal{D}_{1}\to\mathcal{D}_{2} be a piecewise linear function identifying all its linear regions onto 𝒟2\mathcal{D}_{2}. Let 𝒇:𝒟2m\bm{f}:\mathcal{D}_{2}\to\mathbb{R}^{m} be a piecewise linear function. The following holds.

𝒇𝒈I(𝒇)𝒈.\displaystyle\bm{f}\circ\bm{g}\preceq_{I(\bm{f})}\bm{g}.
Proof of Lemma 3.

For any linear region UL(𝒈)U\in L(\bm{g}) and VL(𝒇)V\in L(\bm{f})𝒈|U1(V)\bm{g}|^{-1}_{U}(V) is a linear region of 𝒇𝒈\bm{f}\circ\bm{g} and

U\displaystyle U =𝒈|U1(𝒟2)(𝒈(U)=𝒟2)\displaystyle=\bm{g}|_{U}^{-1}(\mathcal{D}_{2})\ \ \ \ (\because\bm{g}(U)=\mathcal{D}_{2})
=𝒈|U1(VL(𝒇)V)((4))\displaystyle=\bm{g}|_{U}^{-1}\left(\bigcup_{V\in L(\bm{f})}V\right)\ \ \ \ (\because(\ref{eqpiece}))
=VL(𝒇)𝒈|U1(V).\displaystyle=\bigcup_{V\in L(\bm{f})}\bm{g}|_{U}^{-1}(V).

Since 𝒈|U1\bm{g}|_{U}^{-1} is linear, the jacobian of 𝒈|U1\bm{g}|_{U}^{-1} satisfies |J𝒈|U1|0\left|J_{\bm{g}|_{U}^{-1}}\right|\neq 0. Therefore, from the property of integration by substitution with 𝒙=𝒈|U1(𝒚)\bm{x}=\bm{g}|_{U}^{-1}(\bm{y}),

𝒙U𝑑𝒙\displaystyle\int_{\bm{x}\in U}d\bm{x} =𝒙𝒈|U1(𝒟2)𝑑𝒙=|J𝒈|U1|𝒚𝒟2𝑑𝒚.\displaystyle=\int_{\bm{x}\in\bm{g}|_{U}^{-1}(\mathcal{D}_{2})}d\bm{x}=\left|J_{\bm{g}|_{U}^{-1}}\right|\int_{\bm{y}\in\mathcal{D}_{2}}d\bm{y}. (9)

From (9),

|J𝒈|U1|=𝒙U𝑑𝒙𝒚𝒟2𝑑𝒚.\displaystyle\left|J_{\bm{g}|_{U}^{-1}}\right|=\frac{\int_{\bm{x}\in U}d\bm{x}}{\int_{\bm{y}\in\mathcal{D}_{2}}d\bm{y}}. (10)

Moreover,

𝒙𝒈|U1(V)𝑑𝒙\displaystyle\int_{\bm{x}\in\bm{g}|_{U}^{-1}(V)}d\bm{x} =|J𝒈|U1|𝒚V𝑑𝒚\displaystyle=\left|J_{\bm{g}|_{U}^{-1}}\right|\int_{\bm{y}\in V}d\bm{y}
=𝒙U𝑑𝒙𝒚𝒟2𝑑𝒚𝒚Vd𝒚((10))\displaystyle=\frac{\int_{\bm{x}\in U}d\bm{x}}{\int_{\bm{y}\in\mathcal{D}_{2}}d\bm{y}}\int_{\bm{y}\in V}d\bm{y}\ \ \ \ (\because\mbox{(\ref{eqj})})
I(𝒇)𝒙Ud𝒙((5)).\displaystyle\leq I(\bm{f})\int_{\bm{x}\in U}d\bm{x}\ \ \ \ (\because(\ref{eqif})).

II-C5 The Relation Between the Depth of Neural Networks and the Fineness of Linear Regions

We can show that to increase layers is more effective than to increase units at each layer on improving the expressivity of neural networks, by proving the following theorem.

Theorem 1.

Let 𝑭~𝜽:[0,1]n0nL\tilde{\bm{F}}_{\bm{\theta}}:[0,1]^{n_{0}}\to\mathbb{R}^{n_{L}} be a neural network with any piecewise linear activation function, having LL layers and nln_{l} units at the ll-th layer, where nl2n0n_{l}\geq 2n_{0} for any l{1,2,,L1}l\in\{1,2,\cdots,L-1\}. We can say that

𝜽𝚯s.t.I(𝑭~𝜽)l=1L1nl2n0n0.{}^{\exists}\bm{\theta}\in\bm{\Theta}\ \ \mathrm{s.t.}\ \ I(\tilde{\bm{F}}_{\bm{\theta}})\leq\prod^{L-1}_{l=1}\left\lfloor\frac{n_{l}}{2n_{0}}\right\rfloor^{-n_{0}}. (11)

If nl=n(2n0)n_{l}=n(\geq 2n_{0}) for any l{1,2,,L1}l\in\{1,2,\cdots,L-1\}, we can say that

𝜽𝚯s.t.I(𝑭~𝜽)O((n2n0)n0(L1)){}^{\exists}\bm{\theta}\in\bm{\Theta}\ \ \mathrm{s.t.}\ \ I(\tilde{\bm{F}}_{\bm{\theta}})\leq O\left(\left(\frac{n}{2n_{0}}\right)^{-n_{0}(L-1)}\right)

from Theorem 1. In other words, 𝑭~𝜽\tilde{\bm{F}}_{\bm{\theta}} can express functions which have the fineness of linear regions O((n2n0)n0(L1))O((\frac{n}{2n_{0}})^{-n_{0}(L-1)}) if 𝜽\bm{\theta} has appropriate values. Therefore, we consider that the flexibility of the set of deep neural networks {𝑭~𝜽|𝜽𝚯}\{\tilde{\bm{F}}_{\bm{\theta}}\ |\ \bm{\theta}\in\bm{\Theta}\} grows exponentially in LL and polynominally in nn. We use the following lemma in order to prove Theorem 1.

Lemma 4.

Let 𝑭𝜽:[0,1]n0nL\bm{F}_{\bm{\theta}}:[0,1]^{n_{0}}\to\mathbb{R}^{n_{L}} be a ReLU neural network with LL layers and nln_{l} units at the ll-th layer, where nln0n_{l}\geq n_{0} for any l{1,2,,L1}l\in\{1,2,\cdots,L-1\}. We can say that

𝜽𝚯s.t.I(𝑭𝜽)l=1L1nln0n0.{}^{\exists}{\bm{\theta}}\in\bm{\Theta}\ \ \mathrm{s.t.}\ \ I(\bm{F}_{\bm{\theta}})\leq\prod^{L-1}_{l=1}\left\lfloor\frac{n_{l}}{n_{0}}\right\rfloor^{-n_{0}}.
Proof of Theorem 1.

Let 𝑭𝜽:[0,1]n0nL\bm{F}_{\bm{\theta}}:[0,1]^{n_{0}}\to\mathbb{R}^{n_{L}} be a ReLU neural network with LL layers and nl2(nl2n0)\frac{n_{l}}{2}(n_{l}\geq 2n_{0}) units at the ll-th layer for any l{1,2,,L1}l\in\{1,2,\cdots,L-1\}. From Lemma 4, there exists 𝜽^𝚯\hat{\bm{\theta}}\in\bm{\Theta} such that

I(𝑭𝜽^)l=1L1nl2n0n0.\displaystyle I(\bm{F}_{\hat{\bm{\theta}}})\leq\prod^{L-1}_{l=1}\left\lfloor\frac{n_{l}}{2n_{0}}\right\rfloor^{-n_{0}}.

Moreover, from Lemma 1, there exists a neural network 𝑭~𝜽:[0,1]n0nL\tilde{\bm{F}}_{\bm{\theta}}:[0,1]^{n_{0}}\to\mathbb{R}^{n_{L}} with any piecewise linear activation function, having LL layers and nln_{l} units at the ll-th layer, and that computes the same function as 𝑭𝜽^\bm{F}_{\hat{\bm{\theta}}}. It satisfies

I(𝑭~𝜽)l=1L1nl2n0n0.\displaystyle I(\tilde{\bm{F}}_{\bm{\theta}})\leq\prod^{L-1}_{l=1}\left\lfloor\frac{n_{l}}{2n_{0}}\right\rfloor^{-n_{0}}.

Now, we prove Lemma 4.

Proof of Lemma 4.

We prove Lemma 4 by two steps as follows.

Step 1.

Construct a specific ReLU neural network 𝑭𝜽^\bm{F}_{\hat{\bm{\theta}}} which recursively caluculates the functions with common parameters.

Step 2.

Show that the fineness of linear regions of the network 𝑭𝜽^\bm{F}_{\hat{\bm{\theta}}} satisfies I(𝑭𝜽^)l=1L1nln0n0I(\bm{F}_{\hat{\bm{\theta}}})\leq\prod^{L-1}_{l=1}\lfloor\frac{n_{l}}{n_{0}}\rfloor^{-n_{0}} with applying Lemma 2 and Lemma 3 inductively.

Step 1. Let an integer nn\in\mathbb{N} satisfy nn0n\geq n_{0} and p:=nn0p:=\lfloor\frac{n}{n_{0}}\rfloor. We define a pn0pn_{0}-dimensional function 𝒇~:[0,1]n0pn0\tilde{\bm{f}}:[0,1]^{n_{0}}\to\mathbb{R}^{pn_{0}} as

f~(i1)n0+j(𝒙):={pxj(i=1)2pxj2(i1)(i=2,,p)\displaystyle\tilde{f}_{(i-1)n_{0}+j}(\bm{x}):=\left\{\begin{array}[]{ll}px_{j}&(i=1)\\ 2px_{j}-2(i-1)&(i=2,\cdots,p)\end{array}\right. (14)
(j=1,2,,n0),\displaystyle\hskip 142.26378pt(j=1,2,\cdots,n_{0}),

where f~(i1)n0+j\tilde{f}_{(i-1)n_{0}+j} is the ((i1)n0+j)((i-1)n_{0}+j)-th coordinate of 𝒇~\tilde{\bm{f}} and xjx_{j} is the jj-th coordinate of 𝒙\bm{x}. We define a n0n_{0}-dimensional function 𝒇~~:pn0[0,1]n0\tilde{\tilde{\bm{f}}}:\mathbb{R}^{pn_{0}}\to[0,1]^{n_{0}} as

f~~j(𝒙):=i=1p(1)i1x(i1)n0+j(j=1,2,,n0),\displaystyle\tilde{\tilde{f}}_{j}(\bm{x}):=\sum^{p}_{i=1}(-1)^{i-1}x_{(i-1)n_{0}+j}\ (j=1,2,\cdots,n_{0}),

where f~~j\tilde{\tilde{f}}_{j} is the jj-th coordinate of 𝒇~~\tilde{\tilde{\bm{f}}}. Furthermore, we define a nLn_{L}-dimensional function 𝒇~~~:[0,1]n0nL\tilde{\tilde{\tilde{\bm{f}}}}:[0,1]^{n_{0}}\to\mathbb{R}^{n_{L}} as

f~~~j(𝒙):=i=1n0wjixi+bj(wji>0;j=1,2,,nL),\displaystyle\tilde{\tilde{\tilde{f}}}_{j}(\bm{x}):=\sum^{n_{0}}_{i=1}w_{ji}x_{i}+b_{j}\ (w_{ji}>0;\ j=1,2,\cdots,n_{L}),

where f~~~j\tilde{\tilde{\tilde{f}}}_{j} is the jj-th coordinate of 𝒇~~~\tilde{\tilde{\tilde{\bm{f}}}}. We construct the neural network using these functions as follows.

At first, let the number of units at the ll-th layer be equal to n(n0)n(\geq n_{0}) for any l{1,2,,L1}l\in\{1,2,\cdots,L-1\}. We separate the set of nn units at the ll-th layer, (l=1,2,,L1)(l=1,2,\cdots,L-1), into p(=nn0)p(=\lfloor\frac{n}{n_{0}}\rfloor) subsets and remainder units. Each of the pp subsets contains n0n_{0} units. Let the values of the parameters of the remainder npn0n-pn_{0} units be 0. For any l{1,2,,L1}l\in\{1,2,\cdots,L-1\}, let the non-linear activation function 𝒈(l):pn0pn0\bm{g}^{(l)}:\mathbb{R}^{pn_{0}}\to\mathbb{R}^{pn_{0}} be

gj(l)(𝒙)=max{0,xj}(j=1,2,,pn0),\displaystyle g^{(l)}_{j}(\bm{x})=\max\{0,x_{j}\}\ (j=1,2,\cdots,pn_{0}),

where gj(l)g^{(l)}_{j} is the jj-th coordinate of 𝒈(l)\bm{g}^{(l)}.

Next, we consider the linear functions 𝒇(l)(l=1,2,,L)\bm{f}^{(l)}(l=1,2,\cdots,L). When l=1l=1, for the jj-th unit of the ii-th subset, (j=1,2,,n0j=1,2,\dots,n_{0}, i=1,2,,pi=1,2,\dots,p), let the linear function f(i1)n0+j(1):[0,1]n0f^{(1)}_{(i-1)n_{0}+j}:[0,1]^{n_{0}}\to\mathbb{R} be

f(i1)n0+j(1)=f~(i1)n0+j.\displaystyle f^{(1)}_{(i-1)n_{0}+j}=\tilde{f}_{(i-1)n_{0}+j}.

When l=2,3,,L1l=2,3,\dots,L-1, for the jj-th unit of the ii-th subset, (j=1,2,,n0j=1,2,\dots,n_{0}, i=1,2,,pi=1,2,\dots,p), let the linear function f(i1)n0+j(l):pn0f^{(l)}_{(i-1)n_{0}+j}:\mathbb{R}^{pn_{0}}\to\mathbb{R} be

f(i1)n0+j(l)=f~(i1)n0+j𝒇~~.\displaystyle f_{(i-1)n_{0}+j}^{(l)}=\tilde{f}_{(i-1)n_{0}+j}\circ\tilde{\tilde{\bm{f}}}.

When l=Ll=L, let the linear function 𝒇(L):pn0nL\bm{f}^{(L)}:\mathbb{R}^{pn_{0}}\to\mathbb{R}^{n_{L}} be

𝒇(L)=𝒇~~~𝒇~~.\displaystyle\bm{f}^{(L)}=\tilde{\tilde{\tilde{\bm{f}}}}\circ\tilde{\tilde{\bm{f}}}.

Then the constructed network is represented as follows. Since 𝒈(1),𝒈(2),,𝒈(L1)\bm{g}^{(1)},\bm{g}^{(2)},\cdots,\bm{g}^{(L-1)} are common, we express them as 𝒈\bm{g}.

𝑭𝜽^\displaystyle\bm{F}_{\hat{\bm{\theta}}} =𝒇(L)𝒈(L1)𝒇(L1)𝒈(1)𝒇(1)\displaystyle=\bm{f}^{(L)}\circ\bm{g}^{(L-1)}\circ\bm{f}^{(L-1)}\circ\cdots\circ\bm{g}^{(1)}\circ\bm{f}^{(1)}
=𝒇~~~𝒇~~𝒈𝒇~𝒇~~𝒈𝒇~𝒇~~𝒈𝒇~.\displaystyle=\tilde{\tilde{\tilde{\bm{f}}}}\circ\tilde{\tilde{\bm{f}}}\circ\bm{g}\circ\tilde{\bm{f}}\circ\cdots\circ\tilde{\tilde{\bm{f}}}\circ\bm{g}\circ\tilde{\bm{f}}\circ\tilde{\tilde{\bm{f}}}\circ\bm{g}\circ\tilde{\bm{f}}.

Refer to caption

Figure 4: The graph of hj(j=1,,n0)h_{j}\ (j=1,\cdots,n_{0}), where pp is a even number.

Step 2. Let 𝒉:=𝒇~~𝒈𝒇~\bm{h}:=\tilde{\tilde{\bm{f}}}\circ\bm{g}\circ\tilde{\bm{f}}. We can re-represent our network as

𝑭𝜽^=𝒇~~~𝒉𝒉𝒉L1.\displaystyle\bm{F}_{\hat{\bm{\theta}}}=\tilde{\tilde{\tilde{\bm{f}}}}\circ\underbrace{\bm{h}\circ\bm{h}\circ\cdots\circ\bm{h}}_{L-1}.

We consider the fineness of linear regions I(𝒇~~~)I(\tilde{\tilde{\tilde{\bm{f}}}}) and I(𝒉)I(\bm{h}) as well as the identified regions of 𝒉\bm{h} in order to apply Lemma 2 and Lemma 3. Since 𝒇~~~\tilde{\tilde{\tilde{\bm{f}}}} is linear on [0,1]n0[0,1]^{n_{0}},

I(𝒇~~~)=1.\displaystyle I(\tilde{\tilde{\tilde{\bm{f}}}})=1.

On the other hand, the jj-th coordinate of 𝒉:[0,1]n0[0,1]n0\bm{h}:[0,1]^{n_{0}}\to[0,1]^{n_{0}} is described such as

hj(𝒙)\displaystyle h_{j}(\bm{x}) =max{0,pxj}max{0, 2pxj2}+\displaystyle=\max\left\{0,\ px_{j}\right\}-\max\left\{0,\ 2px_{j}-2\right\}+\cdots
+(1)p1max{0, 2pxj2(p1)}\displaystyle\hskip 28.45274pt\cdots+(-1)^{p-1}\max\left\{0,\ 2px_{j}-2(p-1)\right\}
(j=1,2,,n0).\displaystyle\hskip 113.81102pt(j=1,2,\cdots,n_{0}).

Figure 4 shows the graph of hjh_{j}. From Fig. 4, we can say that hjh_{j} identifies its linear regions

[tp,t+1p](t=0,1,,p1)\displaystyle\left[\frac{t}{p},\frac{t+1}{p}\right]\ (t=0,1,\cdots,p-1)

onto [0,1][0,1]. In other words, 𝒉\bm{h} identifies these linear regions on the each coordinate h1,h2,,hn0h_{1},h_{2},\cdots,h_{n_{0}}. Therefore, it identifies its linear regions

j=1n0[tjp,tj+1p](tj=0,1,,p1)\displaystyle\prod^{n_{0}}_{j=1}\left[\frac{t_{j}}{p},\frac{t_{j}+1}{p}\right]\ (t_{j}=0,1,\cdots,p-1)

onto [0,1]n0[0,1]^{n_{0}}. Since these linear regions are n0n_{0}-dimensional hypercubes with the side length of p1p^{-1} for any tj{0,1,,p1}t_{j}\in\{0,1,\cdots,p-1\}, the fineness of linear regions I(𝒉)I(\bm{h}) satisfies

I(𝒉)=pn0.\displaystyle I(\bm{h})=p^{-n_{0}}.

Then we can say that

𝒇~~~𝒉I(𝒇~~~)𝒉\displaystyle\tilde{\tilde{\tilde{\bm{f}}}}\circ\bm{h}\preceq_{I(\tilde{\tilde{\tilde{\bm{f}}}})}\bm{h}

from Lemma 3. Therefore, from Lemma 2,

I(𝒇~~~𝒉)I(𝒇~~~)I(𝒉)=pn0.\displaystyle I(\tilde{\tilde{\tilde{\bm{f}}}}\circ\bm{h})\leq I(\tilde{\tilde{\tilde{\bm{f}}}})I(\bm{h})=p^{-n_{0}}. (15)

In a similar manner, from Lemma 3,

(𝒇~~~𝒉)𝒉I(𝒇~~~𝒉)𝒉.\displaystyle(\tilde{\tilde{\tilde{\bm{f}}}}\circ\bm{h})\circ\bm{h}\preceq_{I(\tilde{\tilde{\tilde{\bm{f}}}}\circ\bm{h})}\bm{h}.

Therefore, from Lemma 2 and (15),

I(𝒇~~~𝒉𝒉)I(𝒇~~~𝒉)I(𝒉)I(𝒇~~~)I(𝒉)I(𝒉)=p2n0.\displaystyle I(\tilde{\tilde{\tilde{\bm{f}}}}\circ\bm{h}\circ\bm{h})\leq I(\tilde{\tilde{\tilde{\bm{f}}}}\circ\bm{h})I(\bm{h})\leq I(\tilde{\tilde{\tilde{\bm{f}}}})I(\bm{h})I(\bm{h})=p^{-2n_{0}}.

If we inductively repeat these operations, we get the inequation as follows.

I(𝑭𝜽^)\displaystyle I(\bm{F}_{\hat{\bm{\theta}}}) =I(𝒇~~~𝒉𝒉𝒉L1)\displaystyle=I(\tilde{\tilde{\tilde{\bm{f}}}}\circ\underbrace{\bm{h}\circ\bm{h}\circ\cdots\circ\bm{h}}_{L-1})
I(𝒇~~~)I(𝒉)I(𝒉)I(𝒉)L1=pn0(L1).\displaystyle\leq I(\tilde{\tilde{\tilde{\bm{f}}}})\underbrace{I(\bm{h})I(\bm{h})\cdots I(\bm{h})}_{L-1}=p^{-n_{0}(L-1)}.

If the number of units at each layer is different from each other and satisfies nln0(l=1,2,,L1)n_{l}\geq n_{0}(l=1,2,\cdots,L-1), we may change p=nn0p=\lfloor\frac{n}{n_{0}}\rfloor to

pl:=nln0(l=1,2,,L1).\displaystyle p_{l}:=\left\lfloor\frac{n_{l}}{n_{0}}\right\rfloor\ (l=1,2,\cdots,L-1).

In this case, the fineness of linear regions I(𝑭𝜽^)I(\bm{F}_{\hat{\bm{\theta}}}) satisfies

I(𝑭𝜽^)l=1L1pln0.\displaystyle I(\bm{F}_{\hat{\bm{\theta}}})\leq\prod^{L-1}_{l=1}p_{l}^{-n_{0}}.

III Experiments

TABLE I: The neural networks compared in the computer simulations.
ll Network 1 (L=6L=6) Network 2 (L=2L=2)
0 n0=1n_{0}=1 n0=1n_{0}=1
1 n1=4n_{1}=4 n1=20n_{1}=20
2 n2=4n_{2}=4 n2=1n_{2}=1
3 n3=4n_{3}=4
4 n4=4n_{4}=4
5 n5=4n_{5}=4
6 n6=1n_{6}=1

We compared two types of neural networks shown in Table I in terms of the expressivity. The activation functions in those neural networks are the ReLU function. Although the depth of them is different, the number of units is equal to 22. We approximately calculated the ratio of the desired parameters and the fineness of linear regions by computer simulations in order to evaluate the expressivity of the two neural networks.

III-A The Calculation of the Ratio of the Desired Parameters

We calculated the ratio of the desired parameters by computer simulations. We executed the calculation with restricting the input and output dimension to one for simplicity. It is difficult to calculate (3) by computer simulations. Therefore, instead of it, we calculated such as

d^(F𝜽,F):=1Nn=1N(F𝜽(xn)F(xn))2,\displaystyle\hat{d}(F_{\bm{\theta}},F^{*}):=\frac{1}{N}\sum^{N}_{n=1}(F_{\bm{\theta}}(x_{n})-F^{*}(x_{n}))^{2},

where xn(n=1,2,,N)x_{n}\in\mathbb{R}(n=1,2,\cdots,N) are finite samples. For the target function FF^{*}, we adopted the two kinds of functions. One was the sin function F(x)=sin(4πx)F^{*}(x)=\sin(4\pi x) and the other was the Weierstrass function such as Fig. 5. The samples xn(n=1,2,,N)x_{n}(n=1,2,\cdots,N) were set as xn=(n1)×104x_{n}=(n-1)\times 10^{-4} and N=104N=10^{4}.

Moreover, it is also difficult to calculate (2) by computer simulations. Therefore, we approximately calculated it with random sampling of the values of 𝜽\bm{\theta}. We generated the values of 𝜽\bm{\theta} with random sampling of standard normal distribution and calculated d^(F𝜽,F)\hat{d}(F_{\bm{\theta}},F^{*}) with the generated values of 𝜽\bm{\theta}, 2×1042\times 10^{4} times. Then we confirmed whether the each d^(F𝜽,F)\hat{d}(F_{\bm{\theta}},F^{*}) was less than ϵk(k=1,2,,104)\epsilon_{k}(k=1,2,\cdots,10^{4}) or not. When the target function was sin(4πx)\sin(4\pi x), ϵk=0.4+4(k1)×105\epsilon_{k}=0.4+4(k-1)\times 10^{-5}. When the target function was the Weierstrass function, ϵk=0.6+2(k1)×105\epsilon_{k}=0.6+2(k-1)\times 10^{-5}. We counted up when d^(F𝜽,F)ϵk\hat{d}(F_{\bm{\theta}},F^{*})\leq\epsilon_{k} and calculated the ratio of 𝜽\bm{\theta} which satisfied the inequation by dividing the last count by 2×1042\times 10^{4}. The following summarizes the algorithm which approximately calculates Rϵ(,F)R_{\epsilon}(\mathcal{F},F^{*}).

Refer to caption

Figure 5: The Weierstrass function we used in the computer simulations.
  ϵk 0.4+4(k1)×105\epsilon_{k}\ \leftarrow\ 0.4+4(k-1)\times 10^{-5} or 0.6+2(k1)×105(k=1,2,,104)0.6+2(k-1)\times 10^{-5}\ (k=1,2,\cdots,10^{4})
  tk 0(k=1,2,,104)t_{k}\ \leftarrow\ 0\ (k=1,2,\cdots,10^{4})
  xn(n1)×104(n=1,2,,104)x_{n}\ \leftarrow\ (n-1)\times 10^{-4}\ (n=1,2,\cdots,10^{4})
  F(xn)sin(4πxn)F^{*}(x_{n})\ \leftarrow\ \sin(4\pi x_{n}) or the Weierstrass function (n=1,2,,104)(n=1,2,\cdots,10^{4})
  for j=1j=1 to 2×1042\times 10^{4} do
     Generate 𝜽\bm{\theta} with random sampling of standard normal distribution
     Compute a neural network F𝜽(xn)(n=1,2,,104)F_{\bm{\theta}}(x_{n})\ (n=1,2,\cdots,10^{4})
     μ^1104n=1104F𝜽(xn)\hat{\mu}\ \leftarrow\ \frac{1}{10^{4}}\sum^{10^{4}}_{n=1}F_{\bm{\theta}}(x_{n})
     σ^1104n=1104(F𝜽(xn)μ^)2\hat{\sigma}\ \leftarrow\ \sqrt{\frac{1}{10^{4}}\sum^{10^{4}}_{n=1}\left(F_{\bm{\theta}}(x_{n})-\hat{\mu}\right)^{2}}
     F𝜽(xn)F𝜽(xn)μ^σ^(n=1,2,,104)F_{\bm{\theta}}(x_{n})\ \leftarrow\ \frac{F_{\bm{\theta}}(x_{n})-\hat{\mu}}{\hat{\sigma}}\ (n=1,2,\cdots,10^{4})
     d^(F𝜽,F)1104n=1104(F𝜽(xn)F(xn))2\hat{d}(F_{\bm{\theta}},F^{*})\ \leftarrow\ \frac{1}{10^{4}}\sum^{10^{4}}_{n=1}(F_{\bm{\theta}}(x_{n})-F^{*}(x_{n}))^{2}
     for k=1k=1 to 10410^{4} do
        if d^(F𝜽,F)ϵk\hat{d}(F_{\bm{\theta}},F^{*})\leq\epsilon_{k} then
           tktk+1t_{k}\ \leftarrow\ t_{k}+1
        end if
     end for
  end for
  R^ϵk(,F)tk2×104(k=1,2,,104)\hat{R}_{\epsilon_{k}}(\mathcal{F},F^{*})\ \leftarrow\ \frac{t_{k}}{2\times 10^{4}}\ (k=1,2,\cdots,10^{4})
  return  R^ϵk(,F)(k=1,2,,104)\hat{R}_{\epsilon_{k}}(\mathcal{F},F^{*})\ (k=1,2,\cdots,10^{4})

We ran the above algorithm using the neural networks in Table I as F𝜽F_{\bm{\theta}}. Figure 6 and Fig. 7 show the results of the calculation. From these results, we can that Network 1 can express more functions close to sin(4πx)\sin(4\pi x) and the Weierstrass function than Network 2. It suggests that to increase layers is more effective than to increase units at each layer on improving the expressivity of neural networks.

III-B The Calculation of the Fineness of Linear Regions

In order to confirm Theorem 1, we approximately calculated the minimum value of the fineness of linear regions min𝜽𝚯I(𝑭𝜽)\min_{\bm{\theta}\in\bm{\Theta}}I(\bm{F}_{\bm{\theta}}) by computer simulations. We must detect linear regions from the input domain of deep neural networks in order to know the fineness of linear regions I(𝑭𝜽)I(\bm{F}_{\bm{\theta}}). Therefore, we did it by calculating the inflection of the derivative of functions computed by deep neural networks.

Refer to caption

Figure 6: The result when approximately calculating the ratio of the desired parameters with the sin function. From this figure, we can say that to increase layers is more effective than to increase units at each layer on improving on the approximation accuracy since the ratio of the desired parameters of Network 1 is over that of Network 2.

We executed the calculation with restricting the input and output dimension to one for simplicity. We generated an arithmetic sequence xn=(n1)×105(n=1,2,,105)x_{n}=(n-1)\times 10^{-5}(n=1,2,\cdots,10^{5}) and calculated the gradient of the line connecting (xn,F𝜽(xn))(x_{n},F_{\bm{\theta}}(x_{n})) and (xn+1,F𝜽(xn+1))(x_{n+1},F_{\bm{\theta}}(x_{n+1})) such as

gn:=F𝜽(xn+1)F𝜽(xn)xn+1xn.\displaystyle g_{n}:=\frac{F_{\bm{\theta}}(x_{n+1})-F_{\bm{\theta}}(x_{n})}{x_{n+1}-x_{n}}.

Then we calculated the inflection of gng_{n} such as

dn+1:=|gn+1gn|.\displaystyle d_{n+1}:=|g_{n+1}-g_{n}|.

We regarded that xn+1x_{n+1} was the boundary of linear regions if dn+1d_{n+1} was larger than a threshold value and calculated the size of the maximal interval between the two boundaries as the fineness of linear regions I(F𝜽)I(F_{\bm{\theta}}). As we have mentioned in Section I, we must calculate it independently from the efficiency of learning. Therefore, we searched the minimum value of I(F𝜽)I(F_{\bm{\theta}}) by generating the value of 𝜽\bm{\theta} randomly and repeatedly, not by learning. The following summarizes the algorithm to approximately calculate min𝜽𝚯I(F𝜽)\min_{\bm{\theta}\in\bm{\Theta}}I(F_{\bm{\theta}}).

  xn(n1)×105(n=1,2,,105)x_{n}\ \leftarrow\ (n-1)\times 10^{-5}\ (n=1,2,\cdots,10^{5})
  fineness_list[]fineness\_list\ \leftarrow\ [\ ]
  for i=1i=1 to 10310^{3} do
     Generate the value of 𝜽\bm{\theta} by randam sampling of uniform distribution U(0,1)U(0,1)
     count 0count\ \leftarrow\ 0
     count_list[]count\_list\ \leftarrow\ [\ ]
     for n=1n=1 to 105210^{5}-2 do
        for k=0,1,2k=0,1,2 do
           Compute the output of a neural network F𝜽(xn+k)F_{\bm{\theta}}(x_{n+k})
        end for
        gnF𝜽(xn+1)F𝜽(xn)xn+1xng_{n}\ \leftarrow\ \frac{F_{\bm{\theta}}(x_{n+1})-F_{\bm{\theta}}(x_{n})}{x_{n+1}-x_{n}}
        gn+1F𝜽(xn+2)F𝜽(xn+1)xn+2xn+1g_{n+1}\ \leftarrow\ \frac{F_{\bm{\theta}}(x_{n+2})-F_{\bm{\theta}}(x_{n+1})}{x_{n+2}-x_{n+1}}
        dn+1|gn+1gn|d_{n+1}\ \leftarrow\ |g_{n+1}-g_{n}|
        if dn+1<0.5d_{n+1}<0.5 then
           countcount+1count\ \leftarrow\ count+1
        else
           Append countcount to count_listcount\_list
           count 0count\ \leftarrow\ 0
        end if
     end for
     Append countcount to count_listcount\_list
     count_maxcount\_max\ \leftarrow\ the maximum value in count_listcount\_list
     finenesscount_max105fineness\ \leftarrow\ \frac{count\_max}{10^{5}}
     Append finenessfineness to fineness_listfineness\_list
  end for
  fineness_minfineness\_min\ \leftarrow\ the minimum value in fineness_listfineness\_list
  return  fineness_minfineness\_min

Refer to caption

Figure 7: The result when approximately calculating the ratio of the desired parameters with the Weierstrass function. From this figure, we can say that to increase layers is more effective than to increase units at each layer on improving on the approximation accuracy since the ratio of the desired parameters of Network 1 is over that of Network 2.

In the above algorithm, the threshold value of the inflection dn+1d_{n+1} is set as 0.5. If dn+10.5d_{n+1}\geq 0.5, xn+1x_{n+1} is regarded as the boundary of linear regions. The variable countcount increases if a deep neural network is linear. Then it returns to 0 at the boundary of linear regions. The list count_listcount\_list stocks the size of linear regions and the maximum value in count_listcount\_list shows the fineness of linear regions. However, the maximum value in count_listcount\_list must be divided by 10510^{5} because the difference between xnx_{n} and xn+1x_{n+1} is 10510^{-5}. On the other hand, we also calculated the value of l=1L1nl2n0n0\prod^{L-1}_{l=1}\lfloor\frac{n_{l}}{2n_{0}}\rfloor^{-n_{0}}, the right side of (11).

We ran the above algorithm using the neural networks in Table I as F𝜽F_{\bm{\theta}}. The results of the simulations follow Theorem 1. The minimum value of the fineness of linear regions of Network 1 is 0.00004 and that of Network 2 is 0.06345, where the values of 𝜽\bm{\theta} are restricted to the finite sampling. On the other hand, the value of l=1L1nl2n0n0\prod^{L-1}_{l=1}\lfloor\frac{n_{l}}{2n_{0}}\rfloor^{-n_{0}} of Network 1 is 0.03125 and that of Network 2 is 0.1. From these results, we can confirm Theorem 1 and say that to increase layers is more effective than to increase units at each layer when we want neural networks to be more flexible.

IV Conclusion

In this paper, we proposed two new criteria to evaluate the expressivity of functions computable by deep neural networks independently from the efficiency of learning. The first criterion, the ratio of the desired parameters shows the approximation accuracy of deep neural networks to the target function. The second criterion, the fineness of linear regions shows the property of linear regions of functions computable by deep neural networks. Furthermore, by the two criteria, we showed that to increase layers is more effective than to increase units at each layer on improving the expressivity of deep neural networks. It is hoped that our studies will contribute to a better understanding of the advantage of deepening neural networks.

Acknowledgment

This research was supported by JSPS Grants-in-Aid for Scientific Research JP17K00316, JP17K06446, JP18K11585, JP19K04914.

References

  • [1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv: 1409.1556, 2014.
  • [2] A. Veit, M. Wiber, and S. Belongie, “Residual networks behave like ensembles of relatively shallow networks,” Advances in Neural Information Processing Systems 29, pp. 550–558, 2016.
  • [3] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, “Gradient flow in recurrent nets: The difficulty of learning long-term dependencies,” A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
  • [4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” pp. 1735–1780, 1997.
  • [5] D. Yarotsky, “Optimal approximation of continuous functions by very deep relu networks,” Proceedings of Machine Learning Research, vol. 75, pp. 639–649, 2018.
  • [6] M. Telgarsky, “Benefits of depth in neural networks,” arXiv preprint arXiv: 1602.04485, 2016.
  • [7] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Sygnals and Systems, vol. 2, no. 4, pp. 303–314, 1989.
  • [8] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, 1989.
  • [9] H. N. Mhaskar, “Neural networks for optimal approximation of smooth and analytic functions,” Neural Computation, vol. 8, no. 1, pp. 164–177, 1996.
  • [10] D. Yarotsky, “Error bounds for approximations with deep relu networks,” Neural Networks, vol. 94, pp. 103–114, 2017.
  • [11] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number of linear regions of deep neural networks,” Advances in Neural Information Processing Systems 27, pp. 2924–2932, 2014.
  • [12] R. Pascanu, G. F. Montufar, and Y. Bengio, “On the number of inference regions of deep feed forward networks with piece-wise linear avtivations,” arXiv: 1312.6098, 2013.
  • [13] T. Serra, C. Tjandraatmadja, and S. Ramalingam, “Bounding and counting linear regions of deep neural networks,” Proceedings of Machine Learning Research, vol. 80, pp. 4558–4566, 2018.
  • [14] B. Hanin and D. Rolnick, “Complexity of linear regions in deep networks,” Proceedings of Machine Learning Research, vol. 97, pp. 2596–2604, 2019.