Theoretical Analysis of the Advantage of
Deepening Neural Networks
Abstract
We propose two new criteria to understand the advantage of deepening neural networks. It is important to know the expressivity of functions computable by deep neural networks in order to understand the advantage of deepening neural networks. Unless deep neural networks have enough expressivity, they cannot have good performance even though learning is successful. In this situation, the proposed criteria contribute to understanding the advantage of deepening neural networks since they can evaluate the expressivity independently from the efficiency of learning. The first criterion shows the approximation accuracy of deep neural networks to the target function. This criterion has the background that the goal of deep learning is approximating the target function by deep neural networks. The second criterion shows the property of linear regions of functions computable by deep neural networks. This criterion has the background that deep neural networks whose activation functions are piecewise linear are also piecewise linear. Furthermore, by the two criteria, we show that to increase layers is more effective than to increase units at each layer on improving the expressivity of deep neural networks.
Index Terms:
deep learning theory, expressivity, approximation accuracy, linear regionI Introduction
In recent researches on machine learning, deep neural networks show very good performance on prediction. For example, some papers consider that to increase layers in neural networks contributes to the accuracy of image recognition e.g., [1, 2]. There are also papers which consider that to increase layers contributes to the accuracy of speech recognition e.g., [3, 4]. Neural networks are getting deeper and deeper with each passing year.
In this situation, there are many researchers that are interested in the relation between the depth of neural networks and good performance. In order to show the advantage of deepening neural networks, the followings should be distinguished.
-
1.
the expressivity of deep neural networks as functions which express the relation between explanatory variables and objective variables.
-
2.
the efficiency of learning by the gradient method.
The prediction accuracy after learning is dependent on both the expressivity and the efficiency of learning. Unless deep neural networks have enough expressivity, they cannot have good performance even though learning is successful.
Therefore, there have been some researches that evaluate deep neural networks with distinguishing between the expressivity and the efficiency of learning e.g., [5, 6]. Some of the researches assume that data is scattered around the target function and investigate the approximation accuracy when approximating the target function by deep neural networks. They discuss the distance between deep neural networks and the target function. Then they regard that the shorter that distance is, the more expressivity the deep neural networks have. For example, some studies consider how many layers and units are necessary for the distance between deep neural networks and the target function to be less than , for any and any target function e.g., [7, 8]. Besides, other studies consider how the lower limit of the distance between deep neural networks and the target function is described for the depth and the number of units e.g., [9, 10].
On the other hand, there are other kinds of researches that evaluate deep neural networks with distinguishing between the expressivity and the efficiency of learning. Those researches investigate the property of linear regions of functions by computable by deep neural networks. Since deep neural networks whose activation functions are piecewise linear are also piecewise linear, investigating the property of linear regions is a good way to know the shape of the lines drawn by those deep neural networks. For example, some studies count the number of linear regions of deep neural networks [11, 12, 13]. They regard that the larger the number of linear regions of deep neural networks is, the more expressivity the deep neural networks have. Besides, other studies calculate the volume of the boundary between two linear regions of deep neural networks [14]. They regard that the larger the volume of the boundary between two linear regions of deep neural networks is, the more expressivity the deep neural networks have.
In this paper, we propose two criteria to evaluate the expressivity of deep neural networks independently from the efficiency of learning, similar to the above studies. The first criterion evaluates the approximation accuracy of deep neural networks to the target function. It shows the ratio of the values of the parameters which make the distance between deep neural networks and the target function short. We regard that the larger the ratio is, the more expressivity deep neural networks have. The novelty of this criterion is that we can figure out the volume of parameter values which give functions close to the target function. The proposed criterion shows not only the existence of the appropriate parameter values but also how many the appropriate parameter values exist.
The second criterion evaluates the property of linear regions of functions computable by deep neural networks whose activation functions are piecewise linear. It shows the size of the maximal linear region of deep neural networks. We regard that the smaller the size of the maximal linear region is, the more expressivity the deep neural networks have. The novelty of this criterion is figuring out how fine linear regions the area which is the least flexible has. The proposed criterion shows whether linear regions are scattered finely and uniformly or not.
The above two criteria have a relation. The small size of the maximal linear region of deep neural networks is a requirement for the good approximation to the target function if the target function is smooth. This requirement is understood by referring to Fig. 1. Unless the size of the maximal linear region of deep neural networks is small, there is a gap between the deep neural networks and the target function no matter how close the deep neural networks and the target function.
Moreover, by the two criteria, we show that to increase layers is more effective than to increase units at each layer on improving the expressivity of neural networks. We adopt theoretical proofs and computer simulations to show it.
In summary, our contributions are as follows.
-
1.
We propose new criteria to evaluate the expressivity of deep neural networks independently from the efficiency of learning.
-
2.
We show that to increase layers is more effective than to increase units at each layer on improving the expressivity of neural networks, by the new criteria.
II Definitions and Theoretical Facts
II-A Deep Neural Networks
We define a neural network as a composite function parametrized by , which alternately computes linear functions and non-linear activation functions such as
The linear function is given by
where is the -th coordinate of and is the -th coordinate of . The parameter is a vector which consists of and for each and . We define as the set of , where is the number of parameters on the neural network. We call as the -th layer and call the -th coordinate of as the -th unit of the -th layer. We note that the -th layer, which is called as the input layer is not counted into the number of layers.
Deep neural networks are used for supervised learning. We can express the relation between explanatory variables and objective variables with deep neural networks if the values of are set appropriately.
II-B The Ratio of the Desired Parameters
In this section, we assume a probability distribution for objective variables in supervised learning and propose a new criterion to evaluate the expressivity of functions computable by deep neural networks under that assumption. For simplicity, we restrict the dimension of objective variables to 1. Let the probability distribution of objective variables be normal distribution such as
(1) |
where is a continuous function. The goal of deep learning is expressing the relation between explanatory variables and objective variables by approximating with deep neural networks in the set . Therefore, we call as the target function in this paper.
When approximating the target function by deep neural networks , it is desirable for the set to have many functions close to . However, there have been few studies which investigate the volume of the functions close to in although some studies prove that at least one function close to exist in e.g., [7, 10]. Therefore, we define a new criterion to investigate it independently from the efficiency of learning as follows.
Definition 1.
Let be a deep neural network. Let be the target function defined as (1). Moreover, we assume that explanatory variables are randam variables with a probability density function . Then we define the ratio of the desired parameters between the set and as
(2) |
where
(3) | ||||
We can say that represents the distance between the deep neural network and the target function . Moreover, calculates the ratio of the values of which make that distance short. Therefore, we regard that the larger is, the more functions close to the target function the set has.
The ratio of the desired parameters shows the volume of parameter values which give functions close to the target function. Therefore, we can know not only the existence of the appropriate parameter values but also how many the appropriate parameter values exist. If we remind learning with the gradient method, we can consider that the ratio of the desired parameters may show the number of local solutions of the loss function.
II-C Deep Neural Networks and Linear Regions
II-C1 ReLU Neural Networks
We can use some kinds of non-linear functions for the non-linear activation function . One example is the ReLU function defined as . ReLU neural networks have the special property as follows.
Lemma 1 ([10] ).
Let be a bounded subset of . Let be a ReLU neural network with layers, parameters and units in total. Then there exists a neural network with any continuous piecewise linear activation function, having layers, parameters and units in total, and that computes the same function as .
Lemma 1 points out that deep neural networks with any piecewise linear activation function can describe any ReLU neural network. In the proof of [10], two units of are assigned to the each unit of when we replace with . Therefore, the number of units at the -th layer increases to . Lemma 1 is used for the proof of the theorem shown in Section II-C5.
II-C2 Fineness of Linear Regions
The ReLU function is a piecewise linear function whose inclination changes at the origin. If a neural network has piecewise linear activation functions like this, are linear functions and are piecewise linear functions. Therefore, a neural network whose activation functions are piecewise linear is a piecewise linear function. Based on this fact, we can introduce a linear region to consider the flexibility of deep neural networks with piecewise linear activation functions. It is defined as follows.
Definition 2 ([11] ).
Let be a bounded subset of . Let be a piecewise linear function. Let be a -dimensional interval in . Let be the function given by restricting the input domain of from to . We say that is a linear region of if the followings hold.
-
•
is linear.
-
•
For any subset such that , is not linear.
Example 1.
For example, let be the function in Fig. 2. We can say that , and are linear regions of . is not a linear region.
In this paper, the set of linear regions of is represented by . We note that
(4) |
when is a piecewise linear function.
There have been some studies that evaluate the flexibility of deep neural networks by the number of linear regions [12, 11, 13]. However, there is a room for improvement in the evaluation of the flexibility by the number of linear regions when we want to know whether linear regions are scattered uniformly. The number of linear regions cannot ensure that many linear regions do not gather in a certain area in the input domain. This problem appears in Fig. 3. does not keep its flexibility in the left-side area. However, and are regarded that they have similar flexibility by the number of linear regions. There is also the same problem in the evaluation of the flexibility by the volume of the boundary between two linear regions proposed in [14]. Therefore, we propose a new criterion to evaluate the flexibility of deep neural networks. It is defined as follows.
Definition 3.
Let be a bounded subset of . We define the fineness of linear regions of a piecewise linear function such as
(5) |
Example 2.
We consider the example function in Fig. 2. In this example, the fineness of linear regions is computed as follows.
Note that . We regard that the smaller the value of is, the finer the linear regions of are. The fineness of linear regions shows the size of the maximal linear region of piecewise linear functions and shows how fine linear regions the area which is the least flexible has. Therefore, we can know whether linear regions are scattered finely and uniformly on the input domain or not, by that criterion. It is different from the number of linear regions. In the case of neural networks whose activation functions are piecewise linear, the fineness of linear regions changes when the values of change. Therefore, we regard that the smaller the value of is, the more flexibility the set has.
For the readers who are interested in the relation between the number of linear regions and the fineness of linear regions, we show the following fact.
Corollary 1.
Let be a bounded subset of . Let be a piecewise linear function. The fineness of linear regions and the number of linear regions satisfy
Proof of Corollary 1.
II-C3 The Partial Order of Piecewise Linear Functions
Next, we define a partial order between two piecewise linear functions as follows.
Definition 4.
Let be a bounded subset of . Let be two piecewise linear functions. We say that , where , if the following statements hold.
-
•
For any linear region , there exist and some linear regions such that
(6) -
•
For any linear region and ,
(7)
Example 3.
For example, let be the functions in Fig. 2. We can say that because is computed such as
Equation (6) claims that the linear regions of are finer than those of and (7) claims that the linear regions of are fine times or more those of . In fact, we can say the following.
Lemma 2.
Let be a bounded subset of . Let be two piecewise linear functions. We can say that
where .
Proof of Lemma 2.
∎
II-C4 Linear Regions and Identification of Input Subsets
Montufar et al. [11] define the identification of input subsets in their paper. It is defined as follows.
Definition 5 ([11] ).
We say that a map identifies subsets onto if it maps them to a commom image in . We also say that are identified onto by .
Example 4.
For example, the four quadrants of two dimensional euclidean space are identified onto by the function such as .
If are all of the subsets identified onto by , we can say that
(8) |
Montufar et al. [11] interpret (8) that the common subset is replicated to . We can prove the following lemma related to Definition 5.
Lemma 3.
Let be a bounded subset of . Let be a bounded subset of . Let be a piecewise linear function identifying all its linear regions onto . Let be a piecewise linear function. The following holds.
Proof of Lemma 3.
For any linear region and , is a linear region of and
Since is linear, the jacobian of satisfies . Therefore, from the property of integration by substitution with ,
(9) |
From (9),
(10) |
Moreover,
∎
II-C5 The Relation Between the Depth of Neural Networks and the Fineness of Linear Regions
We can show that to increase layers is more effective than to increase units at each layer on improving the expressivity of neural networks, by proving the following theorem.
Theorem 1.
Let be a neural network with any piecewise linear activation function, having layers and units at the -th layer, where for any . We can say that
(11) |
If for any , we can say that
from Theorem 1. In other words, can express functions which have the fineness of linear regions if has appropriate values. Therefore, we consider that the flexibility of the set of deep neural networks grows exponentially in and polynominally in . We use the following lemma in order to prove Theorem 1.
Lemma 4.
Let be a ReLU neural network with layers and units at the -th layer, where for any . We can say that
Proof of Theorem 1.
Let be a ReLU neural network with layers and units at the -th layer for any . From Lemma 4, there exists such that
Moreover, from Lemma 1, there exists a neural network with any piecewise linear activation function, having layers and units at the -th layer, and that computes the same function as . It satisfies
∎
Now, we prove Lemma 4.
Proof of Lemma 4.
We prove Lemma 4 by two steps as follows.
- Step 1.
-
Construct a specific ReLU neural network which recursively caluculates the functions with common parameters.
- Step 2.
Step 1. Let an integer satisfy and . We define a -dimensional function as
(14) | |||
where is the -th coordinate of and is the -th coordinate of . We define a -dimensional function as
where is the -th coordinate of . Furthermore, we define a -dimensional function as
where is the -th coordinate of . We construct the neural network using these functions as follows.
At first, let the number of units at the -th layer be equal to for any . We separate the set of units at the -th layer, , into subsets and remainder units. Each of the subsets contains units. Let the values of the parameters of the remainder units be . For any , let the non-linear activation function be
where is the -th coordinate of .
Next, we consider the linear functions . When , for the -th unit of the -th subset, (, ), let the linear function be
When , for the -th unit of the -th subset, (, ), let the linear function be
When , let the linear function be
Then the constructed network is represented as follows. Since are common, we express them as .
Step 2. Let . We can re-represent our network as
We consider the fineness of linear regions and as well as the identified regions of in order to apply Lemma 2 and Lemma 3. Since is linear on ,
On the other hand, the -th coordinate of is described such as
Figure 4 shows the graph of . From Fig. 4, we can say that identifies its linear regions
onto . In other words, identifies these linear regions on the each coordinate . Therefore, it identifies its linear regions
onto . Since these linear regions are -dimensional hypercubes with the side length of for any , the fineness of linear regions satisfies
Then we can say that
from Lemma 3. Therefore, from Lemma 2,
(15) |
In a similar manner, from Lemma 3,
Therefore, from Lemma 2 and (15),
If we inductively repeat these operations, we get the inequation as follows.
If the number of units at each layer is different from each other and satisfies , we may change to
In this case, the fineness of linear regions satisfies
∎
III Experiments
Network 1 () | Network 2 () | |
---|---|---|
0 | ||
1 | ||
2 | ||
3 | ||
4 | ||
5 | ||
6 |
We compared two types of neural networks shown in Table I in terms of the expressivity. The activation functions in those neural networks are the ReLU function. Although the depth of them is different, the number of units is equal to 22. We approximately calculated the ratio of the desired parameters and the fineness of linear regions by computer simulations in order to evaluate the expressivity of the two neural networks.
III-A The Calculation of the Ratio of the Desired Parameters
We calculated the ratio of the desired parameters by computer simulations. We executed the calculation with restricting the input and output dimension to one for simplicity. It is difficult to calculate (3) by computer simulations. Therefore, instead of it, we calculated such as
where are finite samples. For the target function , we adopted the two kinds of functions. One was the sin function and the other was the Weierstrass function such as Fig. 5. The samples were set as and .
Moreover, it is also difficult to calculate (2) by computer simulations. Therefore, we approximately calculated it with random sampling of the values of . We generated the values of with random sampling of standard normal distribution and calculated with the generated values of , times. Then we confirmed whether the each was less than or not. When the target function was , . When the target function was the Weierstrass function, . We counted up when and calculated the ratio of which satisfied the inequation by dividing the last count by . The following summarizes the algorithm which approximately calculates .
We ran the above algorithm using the neural networks in Table I as . Figure 6 and Fig. 7 show the results of the calculation. From these results, we can that Network 1 can express more functions close to and the Weierstrass function than Network 2. It suggests that to increase layers is more effective than to increase units at each layer on improving the expressivity of neural networks.
III-B The Calculation of the Fineness of Linear Regions
In order to confirm Theorem 1, we approximately calculated the minimum value of the fineness of linear regions by computer simulations. We must detect linear regions from the input domain of deep neural networks in order to know the fineness of linear regions . Therefore, we did it by calculating the inflection of the derivative of functions computed by deep neural networks.
We executed the calculation with restricting the input and output dimension to one for simplicity. We generated an arithmetic sequence and calculated the gradient of the line connecting and such as
Then we calculated the inflection of such as
We regarded that was the boundary of linear regions if was larger than a threshold value and calculated the size of the maximal interval between the two boundaries as the fineness of linear regions . As we have mentioned in Section I, we must calculate it independently from the efficiency of learning. Therefore, we searched the minimum value of by generating the value of randomly and repeatedly, not by learning. The following summarizes the algorithm to approximately calculate .
In the above algorithm, the threshold value of the inflection is set as 0.5. If , is regarded as the boundary of linear regions. The variable increases if a deep neural network is linear. Then it returns to 0 at the boundary of linear regions. The list stocks the size of linear regions and the maximum value in shows the fineness of linear regions. However, the maximum value in must be divided by because the difference between and is . On the other hand, we also calculated the value of , the right side of (11).
We ran the above algorithm using the neural networks in Table I as . The results of the simulations follow Theorem 1. The minimum value of the fineness of linear regions of Network 1 is 0.00004 and that of Network 2 is 0.06345, where the values of are restricted to the finite sampling. On the other hand, the value of of Network 1 is 0.03125 and that of Network 2 is 0.1. From these results, we can confirm Theorem 1 and say that to increase layers is more effective than to increase units at each layer when we want neural networks to be more flexible.
IV Conclusion
In this paper, we proposed two new criteria to evaluate the expressivity of functions computable by deep neural networks independently from the efficiency of learning. The first criterion, the ratio of the desired parameters shows the approximation accuracy of deep neural networks to the target function. The second criterion, the fineness of linear regions shows the property of linear regions of functions computable by deep neural networks. Furthermore, by the two criteria, we showed that to increase layers is more effective than to increase units at each layer on improving the expressivity of deep neural networks. It is hoped that our studies will contribute to a better understanding of the advantage of deepening neural networks.
Acknowledgment
This research was supported by JSPS Grants-in-Aid for Scientific Research JP17K00316, JP17K06446, JP18K11585, JP19K04914.
References
- [1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv: 1409.1556, 2014.
- [2] A. Veit, M. Wiber, and S. Belongie, “Residual networks behave like ensembles of relatively shallow networks,” Advances in Neural Information Processing Systems 29, pp. 550–558, 2016.
- [3] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, “Gradient flow in recurrent nets: The difficulty of learning long-term dependencies,” A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001.
- [4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” pp. 1735–1780, 1997.
- [5] D. Yarotsky, “Optimal approximation of continuous functions by very deep relu networks,” Proceedings of Machine Learning Research, vol. 75, pp. 639–649, 2018.
- [6] M. Telgarsky, “Benefits of depth in neural networks,” arXiv preprint arXiv: 1602.04485, 2016.
- [7] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of Control, Sygnals and Systems, vol. 2, no. 4, pp. 303–314, 1989.
- [8] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, 1989.
- [9] H. N. Mhaskar, “Neural networks for optimal approximation of smooth and analytic functions,” Neural Computation, vol. 8, no. 1, pp. 164–177, 1996.
- [10] D. Yarotsky, “Error bounds for approximations with deep relu networks,” Neural Networks, vol. 94, pp. 103–114, 2017.
- [11] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio, “On the number of linear regions of deep neural networks,” Advances in Neural Information Processing Systems 27, pp. 2924–2932, 2014.
- [12] R. Pascanu, G. F. Montufar, and Y. Bengio, “On the number of inference regions of deep feed forward networks with piece-wise linear avtivations,” arXiv: 1312.6098, 2013.
- [13] T. Serra, C. Tjandraatmadja, and S. Ramalingam, “Bounding and counting linear regions of deep neural networks,” Proceedings of Machine Learning Research, vol. 80, pp. 4558–4566, 2018.
- [14] B. Hanin and D. Rolnick, “Complexity of linear regions in deep networks,” Proceedings of Machine Learning Research, vol. 97, pp. 2596–2604, 2019.