This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimum number of neurons in fully connected layers of a given neural network (the first approximation)

Oleg I.Berngardt
Abstract

This paper presents an algorithm for searching for the minimum number of neurons in fully connected layers of an arbitrary network solving given problem, which does not require multiple training of the network with different number of neurons. The algorithm is based at training the initial wide network using the cross-validation method over at least two folds. Then by using truncated singular value decomposition autoencoder inserted after the studied layer of trained network we search the minimum number of neurons in inference only mode of the network.

It is shown that the minimum number of neurons in a fully connected layer could be interpreted not as network hyperparameter associated with the other hyperparameters of the network, but as internal (latent) property of the solution, determined by the network architecture, the training dataset, layer position, and the quality metric used. So the minimum number of neurons can be estimated for each hidden fully connected layer independently. The proposed algorithm is the first approximation for estimating the minimum number of neurons in the layer, since, on the one hand, the algorithm does not guarantee that a neural network with the found number of neurons can be trained to the required quality, and on the other hand, it searches for the minimum number of neurons in a limited class of possible solutions.

The solution was tested on several datasets in classification and regression problems.

Keywords: Networks, neurons number, fully-connected layer, SVD autoencoder

1 Introduction

In machine learning, the question often arises: how many neurons should be in a fully connected layer of a neural network so that it can correctly solve a given problem? For fully connected networks, there are general theorems that allow one to answer this question: initially [1, 2] show that for two-layer network it is 2Nin+12N_{in}+1 for the second layer (where NinN_{in} is input data dimension, Nout=1N_{out}=1 is output data dimension). The most recent papers improve this estimate to the following: for an infinitely deep network in [3] obtained the limit Nin+Nout+2N_{in}+N_{out}+2; in [4] it is obtained the limit Nin+NoutN_{in}+N_{out} for ReLU activations; in [5] it is found max(Nin+1,Nout)max(N_{in}+1,N_{out}) limit for ReLU activations; in most recent [6] it is found max(Nin,Nout)max(N_{in},N_{out}) limit for any activations.

However, this minimum number is task-dependent and for every specific problem can be smaller than estimates above. So researchers are sometimes interested, for example in [7, 8], in the answer to a more specific question: how many neurons should be in the layer of their neural network of a given architecture and depth that solves the problem they set? In [9] it is presented network information criteria (NIC) to compare two trained networks - if it is better to increase/decrease number of neurons or not. So one of the widely used methods to answer this question is to search for the number of neurons as a hyperparameter found by multiple training of the network with different number of neurons. Obviously, this approach is resource-intensive when training complex neural networks at large datasets.

Let’s consider a method for estimating the minimum required number of neurons in the fully connected layers of a neural network of a given architecture solving a given problem, which do not require multiple training the network with different number of neurons in the layers.

2 Deterministic view

Let us have a two-layer fully connected neural network SS (source) that produces targets T=f(S)(X)T=f^{(S)}(X) on the dataset samples XX:

φ(1)(Wj,k(1)Xi,j+Bk(1))=Yi,k\varphi^{\circ(1)}(W_{j,k}^{(1)}X_{i,j}+B_{k}^{(1)})=Y_{i,k} (1)
φ(2)(Wk,l(2)Yi,k+Bl(2))=Ti,l\varphi^{\circ(2)}(W_{k,l}^{(2)}Y_{i,k}+B_{l}^{(2)})=T_{i,l} (2)

where XRI×J,TRI×L,YRI×KX\in R^{I\times J},T\in R^{I\times L},Y\in R^{I\times K}; φ(i),Wj,k(i),Bk(i)\varphi^{\circ(i)},W_{j,k}^{(i)},B_{k}^{(i)} - element-wise function of activation, weight and bias of neurons of the i-th layer, respectively; Xi,jYi,kX_{i,j}Y_{i,k} - vectors of input features for the first and second layers, respectively; K,LK,L - number of neurons in the first and the second layer, respectively, and the number of samples II in the dataset is greater than the number of neurons in the first layer: I>KI>K.

We also have network DD (destination) of identical architecture, producing results T=f(D)(X)T^{\prime}=f^{(D)}(X) at the same dataset XX, and differing from network SS only in the number of neurons in the first layer, as well as weights and biases W,BW,B:

φ(1)(WXi,jj,m(1)+Bm(1))=Yi,m\varphi^{\circ(1)}(W{}_{j,m}^{(^{\prime}1)}X_{i,j}+B_{m}^{(^{\prime}1)})=Y^{\prime}_{i,m} (3)
φ(2)(Wm,l(2)Yi,m+Bl(2))=Ti,l\varphi^{\circ(2)}(W_{m,l}^{(^{\prime}2)}Y^{\prime}_{i,m}+B_{l}^{(^{\prime}2)})=T^{\prime}_{i,l} (4)

where YRI×MY^{\prime}\in R^{I\times M}, MM is the number of neurons in the first layer, TRI×LT^{\prime}\in R^{I\times L}.

Definition 1 (Equivalence of two networks). We call two fully connected neural networks of the same depth and same input and output dimensions to be equivalent at dataset XX, if on each sample of the dataset XX they produce identical output values TTT\equiv T^{\prime} and can differ only in the number of neurons in each hidden layer and neuron coefficients W,B.

Lemma 1. (Equivalence of two-layer networks) Let the results of passing through the first fully connected layer of two-layer fully connected networks SS and DD ((1) and (3) respectively) at the dataset XX satisfy the linear bijection:

Yi,k=Yi,mAm,kY_{i,k}=Y^{\prime}_{i,m}A_{m,k} (5)
Yi,m=Yi,kAm,k(1)Y^{\prime}_{i,m}=Y_{i,k}A_{m,k}^{(1)} (6)

where A,A(1)A,A^{(1)} are certain matrices, and Yi,kY_{i,k} and Yi,mY^{\prime}_{i,m} are the outputs of the first layer of the networks SS and DD, respectively. Then these networks becomes equivalent at the dataset XX with the appropriate choice of coefficients of the second layer of one of the networks.

Proof. If the conditions (5-6) are fulfilled, the second layer of the network DD is:

φ(2)(Wm,l(2)Yi,kAm,k(1)+Bl(2))=Ti,l\varphi^{\circ(2)}(W_{m,l}^{(^{\prime}2)}Y_{i,k}A_{m,k}^{(1)}+B_{l}^{(^{\prime}2)})=T^{\prime}_{i,l} (7)

This equation coincides with the equation (1) when substituting

Wk,l(2)=Wm,l(2)Am,k(1)W_{k,l}^{(2)}=W_{m,l}^{(^{\prime}2)}A_{m,k}^{(1)} (8)
Bl(2)=Bl(2)B_{l}^{(2)}=B_{l}^{(^{\prime}2)} (9)

therefore, the forecasts of network SS on dataset XX with such a choice of coefficients of the second layer repeat the forecasts of network DD and TTT^{\prime}\equiv T. The ability of network DD to repeat the forecasts of network SS is proven in a similar way, therefore SS and DD are equivalent. QED.

Corollary 1.1 Under the limitations (5-6) of Lemma 1, the number of neurons in the first layer of a fully connected two-layer neural network can be changed independently on the number of neurons in the next layer, without changing the predicted values. The coefficients of the neurons of the second layers can be calculated without additional training of the new network if the bijection transformation (5-6) matrices Am,k,Am,k(1)A_{m,k},A_{m,k}^{(1)} are known.

Lemma 2.(Minimum number of neurons in the first layer of a two-layer equivalent network) The minimum possible number of neurons MM in the first layer of the network among all networks DD, equivalent to network SS on dataset XX and satisfying the limitations of Lemma 1 (5-6), is not less than the rank of matrix YY on dataset XX.

min(M)rank(Y)min(M)\geq rank(Y) (10)

Proof. According to Lemma 1, if limitations (5-6) are satisfied, then the networks SS and DD are equivalent. For the existence of transformations (5-6), the ranks of the matrices Y,YY,Y^{\prime} should coincide:

rank(Y)=rank(Y)rank(Y)=rank(Y^{\prime}) (11)

If

rank(Y)rank(Y)rank(Y)\neq rank(Y^{\prime}) (12)

then the networks do not satisfy the initial limitations of Lemma 1. The rank(Y)rank(Y^{\prime}) cannot exceed the number of neurons in this layer (the number of columns of the matrix YY^{\prime}):

Mrank(Y)M\geq rank(Y^{\prime})

Therefore min(M)rank(Y)min(M)\geq rank(Y^{\prime}), and sequently (10). QED.

Corollary 2.1 If among all networks satisfying the limitations of Lemma 2 there is a network equivalent to a given one with the number of neurons in the first layer M=rank(Y)M=rank(Y), then this is the network with the smallest possible number of neurons in this layer among all these networks.

Corollary 2.2 Having a trained two-layer network, under the limitations of Lemma 2 it is possible to find the lower bound of the minimum number of neurons in the first layer of its equivalent network (10), and this search does not require multiple training equivalent networks again.

Lemma 3.(the minimum number of neurons in the hidden layers of a multilayer fully-connected network). Let us have a fully connected neural network with the number of hidden layers NN, and the outputs of n-th layer Y(n)=Yi,j(n)Y^{(n)}=Y_{i,j}^{(n)} at the dataset XX, n[1,N]n\in[1,N]. Then any network equivalent to it at the dataset XX, having MnM_{n} neurons in n-th hidden layer and satisfying the limitations (5-6) of Lemma 1 in each hidden layer also satisfies the condition:

Mnrank(Y(n))M_{n}\geq rank(Y^{(n)}) (13)

Proof. Let’s consider layers nn and n+1n+1 as a separate fully connected neural network with inputs Y(n1)Y^{(n-1)} and outputs Y(n+1)Y^{(n+1)}. A neural network equivalent to it satisfying the limitations of Lemma 1 and having the minimum possible number of neurons in this layer has at least rank(Y(n))rank(Y^{(n)}) neurons in the first layer (corresponding to layer n of the original network) in accordance with Lemma 2. This is true for each layer except the last layer of the original network, i.e. for all hidden layers of the original network (nNn\leq N). QED.

Corollary 3.1 If a network DD with rank(Y(n))rank(Y^{(n)}) neurons in layer nn is equivalent to some network SS with Y(n)Y^{(n)} outputs at dataset XX, then it has the minimum possible number of neurons in this layer among all the equivalent to SS networks at this dataset that satisfy the limitations (5-6) of Lemma 1.

3 Stochastic view

Based on Lemma 3, the problem of finding the minimum number of neurons in each hidden layer (nn) of a fully connected neural network can be reduced to search for the rank(Y(n))rank(Y^{(n)}) : rank of layer outputs at this dataset with following single retraining of the resulting network with found minimal number of neurons to finally check its equivalence to the original network.

Let’s assume that we already have trained fully connected neural network SS of sufficiently large width that solves some problem XTX\rightarrow T, XRI×J;TRI×LX\in R^{I\times J};T\in R^{I\times L}. Then we can find the minimum number of neurons in the equivalent network DD, for each hidden fully-connected layer separately.

When searching, one should to take into account the following facts:

  • Training a neural network is a stochastic process, and the results and coefficients of neurons are determined by a combination of random factors: initial conditions and the learning process. Therefore, the matrices Y(n)Y^{(n)} obtained during each training of the same network are different, and may have different ranks: for CC independent trainings of the same SS network, instead of the matrix Y(n)Y^{(n)} we have a set of CC matrices {Y(n)}C\{Y^{(n)}\}_{C}, with ranks {M(n)}C\{M^{(n)}\}_{C} accordingly;

  • Taking into account the limited accuracy of numerical algorithms and source data, finding rank(Y(n))rank(Y^{(n)}) (and the matrix Y(n)Y^{(n)} turns out to be very large) is associated with calculating its truncated SVD decomposition and choosing the optimal level MM of this truncat. For large matrices, solving this problem is usually difficult, and there are many ways to choose the truncat level MM for truncated SVD decomposition (corresponding to the rank of the matrix). But these ways produce different values, so the choice of MM is subjective problem, as shown by [10, 11, 12] and therefore depends on the actual problem we solve by the network.

Thus, the task of constructing a minimal network DD equivalent to a network SS with a given architecture should be solved from a statistical point of view and could be subjective. A similar remark is true about minimum number of neurons and networks equivalence.

Definition 2 (statistical equivalence of networks). Let’s call two networks SS and DD statistically equivalent at a dataset XX with a quality metric QQ and its threshold level Q0Q_{0} (Q0Q_{0} worse than absolutely the best value of QQ), if the networks SS and DD at the dataset XX produce almost identical forecasts - with the quality metric QQ not worse than Q0Q_{0}.

Corollary 4.1 Network SS is statistically equivalent to itself with any pre-specified threshold level Q0Q_{0} satisfying the limitations of the Definition 2.

Corollary 4.2 All fully overtrained (completely repeating the targets/labels TT of training dataset XX) networks on a given dataset XX are equivalent and statistically equivalent at this dataset with any pre-specified threshold level Q0Q_{0} satisfying the limitations of the Definition 2.

Let us formulate an algorithm for finding minimum number of neurons in fully connected layer by checking the statistical equivalence of two networks - wide one SS and probe one DD.

To obtain a set of matrices {Y}C\{Y\}_{C}, we use cross-validation - dividing the original dataset XX into CC parts. We use randomly selected C1C-1 parts of the dataset for training, and the rest of the dataset to validate and check the stopping conditions of training. Thus, we get CC variants of SS network, and CC variants of matrix YY (set of matrices {Y}C\{Y\}_{C}).

From Corollary 4.2 it follows that checking the statistical equivalence of two networks should be carried out on a dataset, part of which none of the networks were trained on. At the part of the dataset on which they were both trained, they can be equivalent with any predefined quality Q0Q_{0} (satisfying the limitations of the Definition 2 ) simply due to their overtraining.

Therefore as a dataset XX^{\prime} for checking statistical equivalence, we choose a dataset, half of the elements of which were used to train network SS, and another half - to validate network SS.

Taking into account the property of most quality metrics at non-intersecting datasets X1,X2X_{1},X_{2}:

Q(X1X2)=Dim(X1)Q(X1)+Dim(X2)Q(X2)Dim(X1)+Dim(X2);X1X2=Q(X_{1}\vee X_{2})=\frac{Dim(X_{1})Q(X_{1})+Dim(X_{2})Q(X_{2})}{Dim(X_{1})+Dim(X_{2})};X_{1}\wedge X_{2}=\emptyset (14)

, equality of fold sizes (Dim(X1)Dim(X2)Dim(X_{1})\approx Dim(X_{2})), and Corollary 4.2, we choose the threshold Q0Q_{0} equal to the average value between the quality metric QQ of network SS at the validation fold X1X_{1} (Q@val¯(S)\overline{Q_{@val}}(S)) and the best possible quality metric Best(Q)Best(Q) - at training fold X2X_{2}.

Q0=Q@val¯(S)+Best(Q)2Q_{0}=\frac{\overline{Q_{@val}}(S)+Best(Q)}{2} (15)

From each network SCiS_{C_{i}} we produce a network DCiD_{C_{i}}, and check these two networks for statistical equivalence with threshold level (15). Since each network S,DS,D produce different results depending on the training dataset and on the dataset on which the comparison is made, it is possible to obtain C(C1)C(C-1) estimates of the network quality metric Qij=Q(SCi(Xij),DCi(Xij));ijQ_{ij}=Q(S_{C_{i}}(X^{\prime}_{ij}),D_{C_{i}}(X^{\prime}_{ij}));i\neq j where checking dataset XijX^{\prime}_{ij} is:

Xij=(XX(Ci))(XX(Cj))X^{\prime}_{ij}=(X\setminus X^{(C_{i})})\lor(X\setminus X^{(C_{j})}) (16)

where X(Ci),X(Cj)X^{(C_{i})},X^{(C_{j})} - training datasets used for i-th and j-th cross-validation.

Quality is a stochastic quantity, so we need to know the worst bound of the metric QijQ_{ij} (with some significance level α\alpha). If this boundary is not worse than Q0Q_{0}, then the networks S,DS,D considered to be statistically equivalent with threshold level Q0Q_{0}.

Unfortunately, the number of folds CC is usually small and traditionally does not exceed 10, so it is difficult to reach high enough level of statistical significance using this data only. We use traditional way to increase the size of the ensemble - bootstrap technique, presented in [13].

The resulting approximate method for assessing statistical equivalence is presented in Algorithms 1-2. It estimates the average QQ (in algorithm Q@val¯\overline{Q_{@val}}) based on the minimum QQ obtained during bootstrap estimate of QQ for each of the XijX_{ij} datasets. The quality estimate is shown in Algorithm 1.

Algorithm 1 Estimating the quality metric between model predictions
procedure WorstQ(X,XC,SC,DC,Q,i,jX,X_{C},S_{C},D_{C},Q,i,j)       Input: data set X       Input: C training folds X(Ci)X^{(C_{i})}       Input: C trained networks SCi,DCiS_{C_{i}},D_{C_{i}} at folds X(Ci)X^{(C_{i})}       Input: metric Q       Input: cross-validation folds numbers - i,j
     Calculate network outputs:
     TSCi(X(X(Ci)X(Cj)))T\leftarrow S_{C_{i}}(X\setminus(X^{(C_{i})}\wedge X^{(C_{j})}))
     TDCj(X(X(Ci)X(Cj)))T^{\prime}\leftarrow D_{C_{j}}(X\setminus(X^{(C_{i})}\wedge X^{(C_{j})}))
     N10000N\leftarrow 10000
     QsetQset\leftarrow\emptyset
     for iteration[1..N]iteration\in[1..N] do
         t,tBootstrapedPairs(T,T)t,t^{\prime}\leftarrow BootstrapedPairs(T,T^{\prime})
         QsetQsetQ(t,t)Qset\leftarrow Qset\vee Q(t,t^{\prime})
     end for
     return Worst(Qset)Worst(Qset)
end procedure

Comments to Algorithm 1:

  • Function Q(T,T)Q(T,T^{\prime}) returns the value of the QQ metric between the outputs of networks S,DS,D based on pairs of the network output forecasts (Ti,Ti)(T_{i},T^{\prime}_{i}) ;

  • The BootstrappedPairs function returns pairs (Ti,Ti)(T_{i},T^{\prime}_{i}) randomly selected from samples T,TT,T^{\prime}, the number of pairs returned is equal to the number of samples in TT (and TT^{\prime}) ;

  • The Worst(Qset)Worst(Qset) function determines the worst metric value from a set QsetQset. For example, for the case Q=AccuracyQ=Accuracy this is the minimum value over the set, for Q=MSEQ=MSE it is the maximum value over the set;

  • Number of folds C2C\geq 2 to ensure that the networks was not trained at the entire dataset used for statistical equivalence check (see Corollary 4.2);

  • Number of bootstrap iterations NN can be used to estimate the minimal significance level of the algorithm as α1/N\alpha\sim 1/N.

Knowing how to check the statistical equivalence of two networks, and having a set of networks SS trained by cross-validation using CC folds, we can formulate an algorithm for finding the minimum number of neurons in a layer.

Since the choice of SVD truncating level MM is subjective, as shown by [10, 11, 12], the problem of finding the minimum number of neurons can be formulated as follows: find the minimum number of neurons MM in network DD such that networks SS and DD remained statistically equivalent with given threshold level Q0Q_{0} and quality metric QQ.

To find the ranks {M}C\{M\}_{C} of the set of matrices {Y}C\{Y\}_{C}, we create DD^{\prime} network by putting a linear transformation between the first and second fully connected layers:

Y=YA(M)A(M)+Y^{\prime}=YA^{(M)}A^{(M)+} (17)

where A(M),A(M)+A^{(M)},A^{(M)+} are the matrices of the direct and pseudo-inverse truncated SVD transformation, both truncated at level MM:

φ(1)(Wj,k(1)Xi,j+Bk(1))=Yi,k\varphi^{\circ(1)}(W_{j,k}^{(1)}X_{i,j}+B_{k}^{(1)})=Y_{i,k} (18)
Yi,r=As,r(M)+Ak,s(M)Yi,kY^{\prime}_{i,r}=A_{s,r}^{(M)+}A_{k,s}^{(M)}Y_{i,k} (19)
φ(2)(Wr,l(2)Yi,r+Bl(2))=Ti,l\varphi^{\circ(2)}(W_{r,l}^{(2)}Y^{\prime}_{i,r}+B_{l}^{(2)})=T_{i,l} (20)

Here Wj,k(1),Bk(1),Wr,l(2),Bl(2)W_{j,k}^{(1)},B_{k}^{(1)},W_{r,l}^{(2)},B_{l}^{(2)} - weights and biases of the trained network SS.

For Mrank(Y)M\geq rank(Y), by SVD definition, YY,TTY^{\prime}\equiv Y,T^{\prime}\equiv T and the networks S,DS,D^{\prime} are equivalent and statistically equivalent. Therefore, the task of finding the rank of the matrix Y{Y}CY\in\{Y\}_{C} is reduced to finding the minimum MM for which the networks S,DS,D^{\prime} are statistically equivalent in terms of the metric QQ with good enough threshold level Q0Q_{0} (Algorithm 1).

The transformation A(M)+A(M)YA^{(M)+}A^{(M)}Y is essentially an autoencoder-like architecture, first described in [14], but is based on the truncated SVD transformation. It does not distort the input data when the size of its latent representation MM is not less than the rank of the matrix YY. SVD Autoencoders has been already used for optimizing neural networks, for example in [15, 16], but not in the task of finding minimum number of neurons in fully connected layer without multiple training. The architectures of the original network SS (1-2) and network DD^{\prime} (18-20) are shown in Fig.1A-B.

Refer to caption
Figure 1: Architecture for searching for the minimum number of neurons in a layer nn. A) - architecture of network SS; B) - architecture of network DD^{\prime}; C) - architecture for study MNIST problem, p is a network width multiplier.

Thus, the search for the minimum number of neurons, assuming that the limitations of Lemma 1 are met, comes down to calculating the quality metrics over an ensemble of already trained networks. We put the truncated SVD autoencoder after the studied layer and study the metrics of the forecast quality of these networks over the studied dataset at various levels of its truncat, as shown in Algorithm 2.

Algorithm 2 Finding the minimum number of neurons in a layer
procedure FindNumberOfNeurons(S,L,X,C,QS,L,X,C,Q)       Input: network SS       Input: layer number LL       Input: data set XX       Input: folds number CC       Input: metric QQ
     Make CC folds X(Ci)X^{(C_{i})} from XX of nearly equal size:
     (XX(Ci))(XX(Cji))=,i[1..C](XX(Ci))=X(X\setminus X^{(C_{i})})\wedge(X\setminus X^{(C_{j\neq i})})=\emptyset,\vee_{i\in[1..C]}(X\setminus X^{(C_{i})})=X
     for i[1..C]i\in[1..C] do
         Train SCiS_{C_{i}} networks:
         SCiS_{C_{i}}\leftarrow SS trained at X(Ci)X^{(C_{i})}, validated at XX(Ci)X\setminus X^{(C_{i})}
         QiQ(SCi)Q_{i}\leftarrow Q(S_{C_{i}}) at XX(Ci)X\setminus X^{(C_{i})}
         Calculate layer LL outputs:
         YCiY_{C_{i}}: YCiSCi(X,L)Y_{C_{i}}\leftarrow S_{C_{i}}(X,L)
         Calculate SVD decompositions of layer outputs:
         UiΣiViTYCiU_{i}\Sigma_{i}V_{i}^{T}\leftarrow Y_{C_{i}}
     end for
     Q@val¯Mean({Qi})\overline{Q_{@val}}\leftarrow Mean(\{Q_{i}\})
     Qsearch(BestQ+Q@val¯)/2Q_{search}\leftarrow\left(BestQ+\overline{Q_{@val}}\right)/2
     MfoundM_{found}\leftarrow\emptyset
     for i[1..C]i\in[1..C] do
         for j[1..C],jij\in[1..C],j\neq i do
              Mmin1M_{min}\leftarrow 1
              MmaxDim(YC)M_{max}\leftarrow Dim(Y_{C})
              for iteration[0,log2(Dim(YC)+1)]iteration\in[0,log_{2}(Dim(Y_{C})+1)] do
                  M(Mmax+Mmin)/2M\leftarrow(M_{max}+M_{min})/2
                  for k[1..C]k\in[1..C] do
                       DCkSCkD^{\prime}_{C_{k}}\leftarrow S_{C_{k}}
                       AutoEnckmakeTruncuttedSVDautencoder(Σk,VkT;M)AutoEnc_{k}\leftarrow makeTruncuttedSVDautencoder(\Sigma_{k},V_{k}^{T};M)
                       DCkInsertAfterNetworkLayerL(AutoEnck,DCk,L)D^{\prime}_{C_{k}}\leftarrow InsertAfterNetworkLayerL(AutoEnc_{k},D^{\prime}_{C_{k}},L)
                  end for
                  QWorstQ(X,XC,SC,DC,Q,i,j)Q^{\prime}\leftarrow WorstQ(X,X_{C},S_{C},D^{\prime}_{C},Q,i,j)
                  if QworserthanQsearchQ^{\prime}\,worser\,than\,Q_{search} then
                       MminMM_{min}\leftarrow M
                  else
                       MmaxMM_{max}\leftarrow M
                  end if
                  if MmaxMmin<=1M_{max}-M_{min}<=1 then
                       MfoundMfoundMmaxM_{found}\leftarrow M_{found}\vee M_{max}
                       break for
                  end if
              end for
         end for
     end for
     return Mean(Mfound)Mean(M_{found})
end procedure

Comments to the Algorithm 2.

  • BestQ - the best possible value of the QQ metric: for example, for accuracy in a classification problem it is 1 (100%), and for MSE or MAE in a regression problem it is 0.

  • The search for the truncat level MM for the SVD autoencoder is carried out by searching for the point Q(M)=Q0Q(M)=Q_{0} by bisection method (which is accurate under the assumption that the dependence Q(M)Q(M) is smooth and monotonic, which is not always the case).

  • Due to Lemma 3 the algorithm 2 can be independently repeated for any hidden fully connected layer of network SS, splitting XX into cross-validation folds and training the variants of networks SCiS_{C_{i}} only once, which further reduces the time and required resources.

In accordance with the Corollary 3.1 of Lemma 3, after determining the minimum number of neurons by Algorithm 2, it is necessary to retrain the network of architecture SS with the found minimum number of neurons in layers to make sure that it is statistically equivalent to the original networks SCiS_{C_{i}} with threshold level Q0=Q@valQ_{0}=Q_{@val} at test dataset.

4 Experiments

To test the performance and stability of the algorithm, the MNIST dataset, presented in [17] and the classification task were chosen. Datasets of training single-color 8x8 images were studied. The dataset has 10 classes. We used the MNIST dataset variant from the Tensorflow library (28x28 images, 60000 training, 10000 test images), the images were reduced to 8x8 size to speed up calculations (MaxPooling + Padding), and flattened (reshaped) into 64-dimension vector.

For experiments, a neural network with three layers was used - two hidden layers of equal (variable) width, an output layer of 10 neurons, and an output Softmax activation. The input has a batch normalization layer. Activation function of all the layers - Abs.

The network architecture is shown in Fig.1C. The network has formula: Softmax, FCx(10)(Abs), FCx(p*128)(Abs), FCx(p*128)(Abs), BN where pp is some layer width multiplier, and BN - Batch Normalization layer, FC - Fully connected layer. The SVD autoencoder was implemented on TensorFlow through two layers of untrained embeddings, the coefficients of which were set from the truncated SVD decomposition matrices (direct and inverse). For training we used ADAM optimizer with constant learning rate 10310^{-3}, early stopping with patience 3 and monitoring of loss function at validation. Loss function is cross-entropy, quality metric QQ is Accuracy.

During the experiments, the following criteria for the stability and performance of the algorithm were numerically checked:

  1. 1.

    Stability of the result to changes in the number of neurons in the original layer (variants with different pp in Fig.1C);

  2. 2.

    Stability of the result to the choice of a combination of folds for comparison: pairs i,ji,j on cross-validation;

  3. 3.

    Stability of the result to changes in the number of folds (variants for different C[2..7]C\in[2..7]);

  4. 4.

    Non-decreasing forecast quality (forecast quality of found network is not worse than of original network);

  5. 5.

    Equivalence of the original network and new one at the test dataset.

The results of estimating the optimal number of neurons for each width (by varying elastics coefficient pp in network SS, Fig.1C) and for each layer are marked in Fig.1C as transform1 and transform2. It can be seen from the figure that the method reaches a constant value when the ratio between the number of neurons in layer SS and the predicted minimum value in layer DD^{\prime} is greater than approximately 3. As one can see the minimum width of the first layer is about 40 neurons, and the minimum width of second layer is about 10 neurons.

Refer to caption
Figure 2: Performance and stability of the algorithm. A) predicted minimum number of neurons for different number of neurons in the SS original layer (mean over CC folds combinations); B) predicted minimum number of neurons for different number of neurons in the SS original layer (over CC folds combinations); C) predicted minimum number of neurons for different number of CC folds (for the number of neurons in original layer is 128); D) predicted minimum number of neurons for different dataset separation and order variants (for the number of neurons in original layer is 128); E) accuracy at test dataset for original network SS (red circles) and found equivalent network DD with minimal number of neurons in the layers (40 and 10 for the first and second layer correspondingly): green diamonds - for simple training with constant learning rate 10310^{-3}and early stopping with patience 3; blue crosses - for decreasing learning rate 10310610^{-3}-10^{-6} and early stopping with patience 10

Fig.2B shows the results of testing robustness to the choice of fold pair for comparison. In Algorithm 2 these values are stored in the array MfoundM_{found} and are averaged to obtain the result for a given network width. The Fig.2B shows the data before averaging over pairs of folds. The figure shows that the spread of values across fold pairs is relatively low and on average does not exceed 10-20%, which, when averaging over pairs of folds, makes the standard deviation several times less.

Fig.2C shows the dependence of the predicted number of neurons as a function of the number of folds CC during cross-validation (before averaging over the pairs of folds). It can be seen from the figure that the dependence on the number of folds CC is weak and therefore even cross-validation over 2 or 3 folds can be used to solve the problem.

Fig.2D shows the results of testing the stability of the algorithm to the cross-validation partitioning variants, the width of the original network is 128 neurons. Previously, when studying the algorithm, we set the samples order in the training dataset in a fixed way; the figure shows how the results depends on the mixing the order, and shows the distribution of the results for each pair of folds. The figure shows that the results are stable with an accuracy of about 10%. Thus, the result depends little on the order of elements in the dataset and seperation to the cross-validation folds.

After this, the neural network was trained 10 times with the minimum number of neurons found (40 neurons in the first layer - transform1, 10 in the second layer - transform2), the obtained accuracy at the test dataset are shown in Fig.2E. The figure shows that the achieved accuracies (shown in Fig.2E with diamonds) with standard training are slightly worse (the intervals intersect, but the averages are separated) from the accuracies achieved by the original networks. However, with more complex training (decreasing learning rate from 10310^{-3} to 10610^{-6} with step 0.1, increasing patience of early stopping from 3 to 10), the achieved accuracies becomes nearly the same with original network (shown in Fig.2E with crosses).

For the final check of the equivalence of the found network to the original ones, quality metrics were calculated between the prediction results of three original networks S with a width of 128 neurons of each hidden layer and three found networks D with a width of 40 and 10 neurons of the corresponding hidden layers. Three variant of networks in each case were obtained by cross-validation. The resulting 9 quality metric values (accuracy) are in the range 0.919..0.931. The mutual quality of the networks at the test dataset is not lower than the quality of the original network at the test dataset, which suggests that found network (D) with a minimum number of neurons is equivalent to the original one (S).

5 Discussion

The presented results show that the proposed algorithm is stable enough on average: on average, the estimated minimum layer width weakly depends on the initial number of neurons in the original layer when the neural network is wider than 3 minimum widths. The minimum number of neurons can indeed be found for each layer independently, which suggests that the number of neurons is not a hyperparameter, the combination of which determines the accuracy of the solution, but an internal property determined for each layer separately. This suggests that on average we actually find some latent dimension of the layer, which is the basic internal property of the solution, and this does not depend on how wide the original layer width is. Therefore, to find the internal (latent) dimension, it is enough for us to train a fairly wide network with the number of neurons exceeding the expected minimum number of neurons by at least 3 times; with a smaller width of the original network, the resulting forecast could be inaccurate.

We also tested the algorithm on other problems - both classification and regression: fullsize 28x28 MNIST, FashionMNIST dataset, presented in [18] (classification task), as well as California housing, shown in [19] and Wine Quality demonstrated in [20] datasets (regression task). In the first case, a network with two hidden layers was used, in other cases - with one hidden layer. In all the cases, the initial width of the hidden layers was chosen to be 200 neurons (and 300 for MNIST task), and cross-validation was carried out using 3 folds. The result are shown in Table 1. The table shows that the solutions found are equivalent to original ones with Q0Q_{0} not worse than Q@valQ_{@val}, so the algorithm looks useful. As one can also see, the found minimum number of neurons is much smaller than expected both from [3] (Nin+Nout+2N_{in}+N_{out}+2) universal formula predictions (where Nin,NoutN_{in},N_{out} are input and output dimensions respectively), and from [6] universal formula predictions (max(Nin,Nout)max(N_{in},N_{out})).

It is important to note that discussed approach can be useful not only for fully connected networks, but for any network architecture to estimate the minimal size of any fully connected layer followed by another fully connected one.

Table 1: Algorithm results for other tasks, C=3. Designations: BN - Batch Normalization layer, FC - Fully connected layer, FL - Flatten layer.
Data set MNIST 28x28 Fashion MNIST Calfornia housing Wine Quality
Task Classification Regression
Metric QQ Accuracy MSE
Source network SS formula FCx10 (Softmax), FCx300 (ReLU), FCx300 (ReLU), BN, FL FCx10 (Softmax), FCx200 (ReLU), FCx200 (ReLU), BN, FL FCx1(linear), FCx200(ReLU), BN
SCiS_{C_{i}} metric value at validation dataset 0.926..0.950 0.883..0.887 0.442..0.490 0.476..0.559
Resulting network DD formula FCx10 (Softmax), FCx19 (ReLU), FCx68 (ReLU), BN, FL FCx10 (Softmax), FCx25 (ReLU), FCx36 (ReLU), BN, FL FCx1(linear), FCx5(ReLU), BN FCx1(linear), FCx6(ReLU), BN
DCiD_{C_{i}} metric value at validation dataset 0.962..0.964 0.886..0.888 0.516..0.521 0.500..0.621
Metric value between SCiS_{C_{i}} and DCiD_{C_{i}} at test dataset 0.922..0.941 0.869..0.894 0.049..0.084 0.060..0.088
Found minimum number of neurons in layers by our algorithm 68, 19 36, 25 5 6
Expected minimum number of neurons per layer from [3] 796, 796 796, 796 11 14
Expected minimum number of neurons per layer from [6] 784, 784 784, 784 8 11

The main problems of this algorithm are:

  1. 1.

    The algorithm is stochastic, so the minimal number of neurons it predicts is also stochastic, so one cannot guarantee that the minimum number of neurons cannot be greater or smaller than the number found by this algorithm.

  2. 2.

    The algorithm is computationally expensive at the initial stage - it requires CC trained versions of the original network by cross-validation using CC folds, but this is often a standard procedure for estimating the accuracy of the network. It was shown that 2-3 folds looks enough for calculations.

  3. 3.

    The difficulty of creating the initial complete SVD decomposition of the outputs of the neural layer. To speed things up, one can try to get it from random N>JN>J samples of the dataset (where JJ is the number of output neurons of the studied layer of network SS) but with accuracy loss;

  4. 4.

    Cumbersome calculations by truncated SVD autoencoder - it uses matrix multiplications; in addition, it is necessary to calculate quality metrics between variants of SS,DD^{\prime} at different fold combinations, which also slows down the algorithm.

  5. 5.

    The algorithm operates under the assumption that the limitations of Lemma 1 are met. Due to the nonlinearity of the activation functions, it is not obvious that an equivalent network with a minimum number of layer’s neurons satisfies the limitations of Lemma 1, so one cannot guarantee that the minimum number of neurons cannot be less than the number found by this algorithm.

  6. 6.

    It is not obvious how (and whether it is possible) to formalize the calculation of the minimum number of neurons for the case of non-elementwise activation functions, for example Softmax or Maxout: in this case, activation functions can create additional relationships between the columns of the matrix YY and change the rank of the output matrix compared to the rank of the argument matrix. In the case of Softmax, for example, the rank of the output matrix is 1 less than the number of neurons due to the normalization process.

  7. 7.

    It is not obvious how (or if) the algorithm will work in the case of training with random data augmentation, widely used now when training networks.

6 Conclusion

The paper presents an algorithm for searching of the minimum number of neurons in a fully connected layers of a network. The basis of the algorithm is training the initial wide network using the cross-validation method using at least two folds. After training, for analyzed fully connected layer, truncated SVD autoencoders are built for each trained network. Each SVD autoencoder is inserted inside the corresponding network after the studied layer. The quality of the resulting ensemble of networks is determined by comparing the forecasts of the original network and the network with a truncated SVD autoencoder, on a dataset composed of a pair of folds of the original dataset. The statistical equivalence of new and original networks is determined by the quality metric reaching a threshold value - the average between the metric of the original network at the validation dataset and the best possible metric value. The algorithm constructed in this way searches for the minimum dimension of the truncated SVD autoencoder, which has the meaning of the rank of the matrix of output values of the layer of the original network and, within the framework of the described approach, corresponds to the minimum number of neurons in the fully connected layer.

The minimum number of neurons in a layer determined by this algorithm does not require multiple training of the network for different values of the number of neurons in the layer. Therefore minimum number of neurons is not a hyperparameter related with other hyperparameters of the network, but an internal property of the solution, and can be calculated independently for each layer, and depends on given architecture of the network, layer position, quality metric, and training dataset. The proposed algorithm determines the first approximation for estimating the minimum number of neurons, since on the one hand it does not guarantee that a neural network with the found number of neurons can be trained to the required quality (which will ultimately lead to the increase of the found minimum number of neurons), and on the other hand, it searches for the minimum number of neurons in a limited class of solutions that satisfies the limitations of Lemma 1 (which ultimately lead to the decrease of the found minimum number of neurons).

An experimental study of the algorithm and properties of the solution was carried out on the smallsize (8x8) MNIST dataset, and demonstrated the effectiveness of the proposed solution in this problem. The described algorithm was also tested on several well-known classification and regression problems, including full 28x28 MNIST images.

Sample code is avaliable at https://github.com/berng/FCLayerMinimumNeuronsFinder .

Acknowledgments

The work has been done under financial support of the Russian Science Foundation (grant No.24-22-00436).


References

  • [1] A.N. Kolmogorov. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk SSSR, pages 953–956, 1957.
  • [2] Vladimir Arnold. On the function of three variables. Amer. Math. Soc. Transl., pages 51–54, 1963.
  • [3] G. Gripenberg. Approximation by neural networks with a bounded number of nodes at each level. Journal of Approximation Theory, 122(2):260–266, 2003.
  • [4] Boris Hanin and Mark Sellke. Approximating Continuous Functions by ReLU Nets of Minimal Width. arXiv e-prints, page arXiv:1710.11278, October 2017.
  • [5] Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum Width for Universal Approximation. arXiv e-prints, page arXiv:2006.08859, June 2020.
  • [6] Yongqiang Cai. Achieve the minimum width of neural networks for universal approximation. In The Eleventh International Conference on Learning Representations, 2023.
  • [7] H. Ruppert, A. Krug, and Y.A.W. Shardt. Method to design a neural network with minimal number of neurons for approximation problems. IFAC-PapersOnLine, 55(7):568–573, 2022. 13th IFAC Symposium on Dynamics and Control of Process Systems, including Biosystems DYCOPS 2022.
  • [8] F Arifin, H Robbani, T Annisa, and N N M I Ma’arof. Variations in the number of layers and the number of neurons in artificial neural networks: Case study of pattern recognition. Journal of Physics: Conference Series, 1413(1):012016, nov 2019.
  • [9] N. Murata, S. Yoshizawa, and S. Amari. Network information criterion-determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks, 5(6):865–872, 1994.
  • [10] Pedro R. Peres-Neto, Donald A. Jackson, and Keith M. Somers. How many principal components? stopping rules for determining the number of non-trivial axes revisited. Computational Statistics & Data Analysis, 49(4):974–997, 2005.
  • [11] Jushan Bai and Serena Ng. Determining the number of factors in approximate factor models. Econometrica, 70(1):191–221, 2002.
  • [12] Antonella Falini. A review on the selection criteria for the truncated SVD in Data Science applications. Journal of Computational Mathematics and Data Science, 5:100064, 2022.
  • [13] B. Efron. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1):1 – 26, 1979.
  • [14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation, page 318–362. MIT Press, Cambridge, MA, USA, 1986.
  • [15] Jian Xue, Jinyu Li, and Yifan Gong. Restructuring of deep neural network acoustic models with singular value decomposition. In Proc. Interspeech 2013, pages 2365–2369, 2013.
  • [16] Jian Xue, Jinyu Li, Dong Yu, Mike Seltzer, and Yifan Gong. Singular value decomposition based low-footprint speaker adaptation and personalization for deep neural network. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6359–6363, 2014.
  • [17] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [18] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv e-prints, page arXiv:1708.07747, 2017.
  • [19] R. Kelley Pace and Ronald Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291–297, 1997.
  • [20] Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553, 2009. Smart Business Networks: Concepts and Empirical Evidence.