This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the High Symmetry of Neural Network Functions

[Uncaptioned image] Umberto Michelucci
Research and Development
TOELT LLC
Switzerland
[email protected]
Department of Computer Science
Lucerne University of Applied Sciences and Arts
Switzerland
https://www.toelt.ai
Abstract

Training neural networks means solving a high-dimensional optimization problem. Normally the goal is to minimize a loss function that depends on what is called the network function, or in other words the function that gives the network output given a certain input. This function depends on a large number of parameters, also known as weights, that depends on the network architecture. In general the goal of this optimization problem is to find the global minimum of the network function. In this paper it is discussed how due to how neural networks are designed, the neural network function present a very large symmetry in the parameter space. This work shows how the neural network function has a number of equivalent minima, in other words minima that give the same value for the loss function and the same exact output, that grows factorially with the number of neurons in each layer for feed forward neural network or with the number of filters in a convolutional neural networks. When the number of neurons and layers is large, the number of equivalent minima grows extremely fast. This will have of course consequences for the study of how neural networks converges to minima during training. This results is known, but in this paper for the first time a proper mathematical discussion is presented and an estimate of the number of equivalent minima is derived.

Keywords Machine Learning, Neural Networks, Symmetry

1 Introduction

Note that this short paper is not meant to review existing results or to analyze deeply the consequences of the high symmetry of neural networks. The only goal is to formalise and calculate the number of equivalent points in parameter space for a feed forward neural network. I hope this can be helpful to someone. The resul highlighted in this short paper (notably without references so far), is known and therefore it is no new contribution. But to the best of my knowledge no one has ever analyzed the problem formally and shown from where this high symmetry is coming from in a short and concise form. I hope the mathematics shown here can be helpful in this regard. I claim not to be the first to have noticed this of course, but I think this paper shows for the first time from where the symmetry comes from.

I plan to add references and search for similar contribution soon and update this short paper accordingly.

The contribution of this short paper is the Symmetry Theorem for Feed Forward Neural Networks, or in other words that for a FFNN with LL layers the network function y^(θ,x)\hat{y}(\theta,\textbf{x}) has a very high symmetry in θ\theta-space. With θ\theta we have indicated a vector of all the network parameters (or weights), and with x a generic (possibly multi-dimensional) input observation. More precisely there are Λ=n1!n2!nL1!\Lambda=n_{1}!n_{2}!...n_{L-1}! (nin_{i} being the number of neurons in layer ii) sets of parameters θ\theta that gives the exact same value of the network function for the same input x. As a direct consequence, the loss function LL can not have just one absolute minimum, but always Λ\Lambda ones, all with the same value of LL. The paper is organised as follows. In Section 2 the formalism and notation are discussed. In Section 3 the conclusions and important further research developments are discussed.

2 Formalism

In this section the case for feed forward neural networks (FFNN) is discussed and the definition of equivalent minima and their exact number depending on the network architecture is respectively given and calculated.

Let us consider a simple FFNN. An example with just two hidden layers is depicted in Figure 1. The symbol wi,j[l]w_{i,j}^{[l]} indicates the weight between neuron ii in layer l1l-1 and neuron jj in layer ll, where by convention layer 0 is the input layer.

Refer to caption
Figure 1: An example of a simple FFNN with two hidden layers. In this diagram only some of the weights (parameters) are depicted for clarity. The symbol wi,j[l]w_{i,j}^{[l]} indicates the weight between neuron ii in layer l1l-1 and neuron jj in layer ll, where by convention layer 0 is the input layer.

Normally each neuron will also have a bias as input but for simplicity we will neglect them, since the final general results is not influenced by this simplification. We will indicate the output of a generic neuron ii in layer ll with

ai[l]=f[l](j=1nl1wj,i[l]aj[l1])a_{i}^{[l]}=f^{[l]}\left(\sum_{j=1}^{n_{l-1}}w_{j,i}^{[l]}a_{j}^{[l-1]}\right) (1)

where f[l]f^{[l]} is the activation function in layer ll. For simplicity we will assume that all layers, except the output one, will have the same activation function ff. nln_{l} is the number of neurons in layer ll, wj,i[l]w_{j,i}^{[l]} indicates, as already mentioned, the weight between neuron jj in layer l1l-1 and neuron ii in layer ll. By convention layer 0 will be the input layer, and therefore aj[0]xja_{j}^{[0]}\equiv x_{j}, the jthj^{th} component (or feature) of a generic input observation. In matrix form the output of layer ll can be written as

(a1[l]a2[l]anl[l])𝐚[l]=(w1,1[l]w2,1[l]wnl1,1[l]w1,2[l]w2,2[l]wnl1,2[l]w1,nl[l]w2,nl[l]wnl1,nl[l])𝐖[l](a1[l1]a2[l1]anl1[l1])𝐚[l1]\underbrace{\begin{pmatrix}a_{1}^{[l]}\\ a_{2}^{[l]}\\ ...\\ a_{n_{l}}^{[l]}\end{pmatrix}}_{\displaystyle{\bf a}^{[l]}}=\underbrace{\begin{pmatrix}w_{1,1}^{[l]}&w_{2,1}^{[l]}&...&w_{n_{l-1},1}^{[l]}\\ w_{1,2}^{[l]}&w_{2,2}^{[l]}&...&w_{n_{l-1},2}^{[l]}\\ \vdots&...&\ddots&...\\ w_{1,n_{l}}^{[l]}&w_{2,n_{l}}^{[l]}&...&w_{n_{l-1},n_{l}}^{[l]}\end{pmatrix}}_{\displaystyle{\bf W}^{[l]}}\underbrace{\begin{pmatrix}a_{1}^{[l-1]}\\ a_{2}^{[l-1]}\\ ...\\ a_{n_{l-1}}^{[l-1]}\end{pmatrix}}_{\displaystyle{\bf a}^{[l-1]}} (2)

or by using the matrix symbols 𝐚[l]{\bf a}^{[l]} and 𝐖[l]{\bf W}^{[l]}

𝐚[l]=𝐖[l]𝐚[l1]{\bf a}^{[l]}={\bf W}^{[l]}{\bf a}^{[l-1]} (3)

Now let us consider the example in Figure 1, and suppose to switch neuron 1 and 2 in layer 1. Suppose to do that by also switching the respective connections, or in other words by switching also the respective weights. To clarify this concept, after switching neurons 1 and 2 in leayer 1, one would end up with the network depicted in Figure 2 where the irrelevant parts have been colored in light gray to only highlight the changes that the switch have generated.

Refer to caption
Figure 2: An example of a simple ffnn with two hidden layers. In this diagram neurons 1 and 2 in layer 1 have been switched. The connection with the relative weights have also been switched.

The neuron switch, performed as described, will have the result of effectively switching the first and second entry in the output vector 𝐚[1]{\bf a}^{[1]}. Note that this can be equivalently achieved by switching rows 1 and 2 in 𝐖[1]{\bf W}^{[1]}. To summarize, switching two generic neurons ii and jj in layer ll as described, is equivalent to switching rows ii and jj in the weight matrix 𝐖[l]{\bf W}^{[l]}. Switching two neurons is equivalent to a reordering of the neurons in layer ll, and there are nl!n_{l}! possible ways of ordering the nln_{l} neurons in layer ll. Now is important to note that by switching two neurons as described will give naturally a different output of the network function. But we can easily correct this by simply switching columns ii and jj in 𝐖[l+1]{\bf W}^{[l+1]}. To see why this is the case let us consider switching neuron ii and jj in layer ll as described. This will result in an output vector 𝐚[l]{\bf a^{\prime}}^{[l]} given by

𝐚[l]=(a1[l]aj[l]ai[l]anl[l]){\bf a^{\prime}}^{[l]}=\begin{pmatrix}a_{1}^{[l]}\\ a_{j}^{[l]}\\ ...\\ a_{i}^{[l]}\\ ...\\ a_{n_{l}}^{[l]}\end{pmatrix} (4)

where rows ii and jj has been swapped. Now let us consider what happens to the output of layer l+1l+1. If 𝐖[l+1]{\bf W}^{[l+1]} is not changed the output vector 𝐚[l+1]{\bf a}^{[l+1]} will be of course different than the one before the swap of the neurons. But by swapping the columns ii and jj in 𝐖[l+1]{\bf W}^{[l+1]} and indicating the new weight matrix with 𝐖c[l]{\bf W}_{c}^{[l]}, the output vector 𝐚[l+1]{\bf a}^{[l+1]} will remain unchanged. In fact

(a1[l+1]a2[l+1]anl+1[l+1])𝐚[l]=(w1,1[l]wj,1[l]wi,1[l]wnl1,1[l]w1,2[l]wj,2[l]wi,1[l]wnl1,2[l]w1,nl[l]wj,nl[l]wi,1[l]wnl1,nl[l])𝐖c[l](a1[l]aj[l]ai[l]anl[l])𝐚[l1]\underbrace{\begin{pmatrix}a_{1}^{[l+1]}\\ a_{2}^{[l+1]}\\ ...\\ a_{n_{l+1}}^{[l+1]}\end{pmatrix}}_{\displaystyle{\bf a}^{[l]}}=\underbrace{\begin{pmatrix}w_{1,1}^{[l]}&...&w_{j,1}^{[l]}&...&w_{i,1}^{[l]}&...&w_{n_{l-1},1}^{[l]}\\ w_{1,2}^{[l]}&...&w_{j,2}^{[l]}&...&w_{i,1}^{[l]}&...&w_{n_{l-1},2}^{[l]}\\ ...\\ w_{1,n_{l}}^{[l]}&...&w_{j,n_{l}}^{[l]}&...&w_{i,1}^{[l]}&...&w_{n_{l-1},n_{l}}^{[l]}\\ \end{pmatrix}}_{\displaystyle{\bf W}_{c}^{[l]}}\underbrace{\begin{pmatrix}a_{1}^{[l]}\\ a_{j}^{[l]}\\ ...\\ a_{i}^{[l]}\\ ...\\ a_{n_{l}}^{[l]}\end{pmatrix}}_{\displaystyle{\bf a^{\prime}}^{[l-1]}} (5)

That means that by simultaneosuly swapping two rows ii and jj in 𝐖[l]{\bf W}^{[l]}, and indicating the new weight matrix with 𝐖r[l]{\bf W}_{r}^{[l]}, and the two columns ii and jj in 𝐖[l+1]{\bf W}^{[l+1]} the output of the network will not change. We have effectively found two set of weights

{𝐖[0],𝐖[1],,𝐖[L]}\{{\bf W}^{[0]},{\bf W}^{[1]},...,{\bf W}^{[L]}\}

and

{𝐖[0],𝐖[1],,𝐖r[l],𝐖c[l+1],,𝐖[L]}\{{\bf W}^{[0]},{\bf W}^{[1]},...,{\bf W}_{r}^{[l]},{\bf W}_{c}^{[l+1]},...,{\bf W}^{[L]}\}

that will give the exact same value for the loss function and for all the predictions. As discussed there are nl!n_{l}! possible ways of organising the neurons in layer ll that will give a set of weights with the property of keeping the neural network output and loss function unchanged. One can say that the neural network is invariant to neuron swaps as described in the text. This property is general for all layers (excluded the input and the output) 1,,L11,...,L-1. So in total one can create Λ=n1!n2!nL1!\Lambda=n_{1}!n_{2}!...n_{L-1}! possible set of weights that will give the exact same predictions and loss function value. In a small network with three hidden layers, each with 128 neurons the number of equivalent set of weights is 128!3128!^{3} that is approximately 5.7106465.7\cdot 10^{646}. This is very large number indicating that in the parameter space the loss function will have this number of equivalent minima. Where again with equivalent minima we intend location in parameter space that will give the same value of the loss function and the exact same predictions.

Let us formally define a generalized neuron switch.

Definition 2.1.

A generalized ii-jj neuron switch is a transformation of the set of weights that is described by

{𝐖[l]RiRj𝐖r[l]𝐖[l+1]CiCj𝐖c[l+1]\left\{\begin{matrix}{\bf W}^{[l]}\xrightarrow{R_{i}\rightarrow R_{j}}{\bf W}_{r}^{[l]}\\ {\bf W}^{[l+1]}\xrightarrow{C_{i}\rightarrow C_{j}}{\bf W}_{c}^{[l+1]}\\ \end{matrix}\right. (6)

where the notation RiRj\xrightarrow{R_{i}\rightarrow R_{j}} indicates the switch of rows ii and jj and CiCj\xrightarrow{C_{i}\rightarrow C_{j}} the switch of columns ii and jj. This matrix transformation is equivalent to switch neurons ii and jj in layer ll as described in the text.

One can say that LL and y^\hat{y} are invariant under a generalized ii-jj neuron switch. There are a total of Λ=n1!n2!nL1!\Lambda=n_{1}!n_{2}!...n_{L-1}! generalized ii-jj neuron switches possible in a FFNN with LL layers (including the output layer). Let us also define the concept of equivalent set of parameters for a given FFNN.

Definition 2.2.

Two set of parameters 𝐖¯1={𝐖1[0],𝐖1[1],,𝐖1[L]}\overline{\bf W}_{1}=\{{\bf W}_{1}^{[0]},{\bf W}_{1}^{[1]},...,{\bf W}_{1}^{[L]}\}) and 𝐖¯2={𝐖2[0],𝐖2[1],,𝐖2[L]}\overline{\bf W}_{2}=\{{\bf W}_{2}^{[0]},{\bf W}_{2}^{[1]},...,{\bf W}_{2}^{[L]}\}) are said to be equivalent for a given FFNN when L(𝐖¯1)=L(𝐖¯2)L(\overline{\bf W}_{1})=L(\overline{\bf W}_{2}) and y^(𝐖¯1)=y^(𝐖¯2)\hat{y}(\overline{\bf W}_{1})=\hat{y}(\overline{\bf W}_{2}) for any input 𝐱=(x1,,xn){\bf x}=(x_{1},...,x_{n}).

So the conclusion of this paper can be stated in the following theorem.

Theorem 2.1 (Symmetry Theorem for Feed Forward Neural Networks).

Given a FFNN with LL layers, each having nin_{i} neurons, there exist Λ=n1!n2!nL1!\Lambda=n_{1}!n_{2}!...n_{L-1}! equivalent sets of weights.

The proof has been given in the previous discussion.

2.1 Convolutional Neural Neural Networks

The same approach can be used for convolutional neural networks (CNNs). In fact the order of the filters in each convolution is not relevant and therefore also in CNNs we have a high symmetry. this time the relevant parameters are not the number of neurons in each layer, but the number of filters in each convolutional layer. Details for CNNs will be developed by the author in a subsequent publication.

3 Conclusions

For any point in parameter space 𝐖¯={𝐖[0],𝐖[1],,𝐖[L]}\overline{\bf W}=\{{\bf W}^{[0]},{\bf W}^{[1]},...,{\bf W}^{[L]}\} there are always Λ1\Lambda-1 additional points in parameter space that give the same loss function value and the same network output as 𝐖¯\overline{\bf W}. This highlight the very high symmetry of the network function in parameter space as the number Λ\Lambda grows as O(n!L)O(n!^{L}) with the number of neurons nn in a given layer and LL the total number of layers. That also means that if the network function cannot have one single minima, but if there is one then there will be Λ\Lambda that will be equivalent according to Defintion 2.2.