This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Sparse Artificial Neural Networks

Seyed Majid Naji1, Azra Abtahi1, and Farokh Marvasti1 Corresponding author: Azra Abtahi (email: [email protected]). 1 Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran.
Abstract

The brain, as the source of inspiration for Artificial Neural Networks (ANN), is based on a sparse structure. This sparse structure helps the brain to consume less energy, learn easier and generalize patterns better than any other ANN. In this paper, two evolutionary methods for adopting sparsity to ANNs are proposed. In the proposed methods, the sparse structure of a network as well as the values of its parameters are trained and updated during the learning process. The simulation results show that these two methods have better accuracy and faster convergence while they need fewer training samples compared to their sparse and non-sparse counterparts. Furthermore, the proposed methods significantly improve the generalization power and reduce the number of parameters. For example, the sparsification of the ResNet47 network by exploiting our proposed methods for the image classification of ImageNet dataset uses 40% fewer parameters while the top-1 accuracy of the model improves by 12% and 5% compared to the dense network and their sparse counterpart, respectively. As another example, the proposed methods for the CIFAR10 dataset converge to their final structure 7 times faster than its sparse counterpart, while the final accuracy increases by 6%.

Index Terms:
Artificial Neural Network, Natural Neural Network, Sparse Structure, Sparsifying, Scalable Training.

I Introduction

Artificial Neural Networks (ANN) are inspired by Natural Neural Networks (NNN) which constitute the brain. Hence, scientists are trying to adopt natural network functionalities to the ANNs. The observations show that the NNNs have sparse properties. First, neurons in NNNs fire sparsely in both time and spatial domains [1, 2, 3, 4, 5, 6]. Furthermore, the NNNs profit from a scale-free [7], small-world [8], and sparse structure which means each neuron connects to a very limited set of other neurons. In other words, the connections in a natural network is a very small portion of the possible connections [9, 10, 11]. This property eventuates in very important features of NNNs such as low complexity, low power consumption, and stronge generalization power [12, 13]. On the other hand, conventional ANNs do not have sparse structures which lead to the important differences between them and their natural counterparts. They have an extraordinary large number of parameters and need extraordinary large storage space, large sample number, and powerful underlying hardware [14]. Although deep learning has gained enormous success in wide variety of domains such as image and video processing[15, 16, 17], Biomedical processing[18, 19, 15, 20, 21, 22], speech recognition[23], Physics[24, 25], and Geophysics [26]; the mentioned barriers make it impractical to exploit deep networks in portable and cheap applications. Sparsifying a neural network can significantly reduce network parameters and learning time. However, we need to find a suitable sparse structure which also leads to an acceptable accuracy.

There are three main approaches to sparsifying a neural network:

  1. 1.

    Training a conventional network completely, and at the end, making it sparse with pruning techniques [27, 28, 29].

  2. 2.

    Modification of the loss function to enforce the sparsity [30, 31].

  3. 3.

    Updating a sparse structure during the learning process (evolutionary methods) [32, 33, 34, 35, 36].

The methods using the first approach train a conventional ANN. Then, they remove the least important connections which their elimination does not decrease the accuracy of network significantly. Hence, finding a good measure to identify such connections is the most important part in this approach. Pruning in this way guarantees good performance of the network model [28]. However, the learning time does not decrease in this approach as a conventional non-sparse network must be trained completely. There are also methods which fine-tune the resultant pruned model to compensate even small drop of accuracy in the pruning procedure[37, 38, 39]. In [40], it is shown that a randomly initialized dense neural network contains a sub-network that can match the test accuracy of the dense network and also needs the same iteration number for training (called winning lottery ticket). However, finding this sub-network is computationally expensive as it needs multiple pruning and re-training starting from a dense network. Thus, several references have concentrated on finding this configuration faster [41, 42, 43] and using less memory [44].

In the second approach, a sparsifying term is added to the loss function in order to shape the sparse structure of the network. If the loss function is chosen properly, the network tends towards a sparse structure with good performance properties. This can be achieved by adding a sparse regularization term such as l1l_{1} norm. In MorphNet [45], this term is added to the loss function. MorphNet is a method which iteratively shrinks and expands a network to find a suitable configuration. In other words, it needs multiple re-training to achieve its final configuration.

In the third approach, both the network structure and the values of connections are trained and updated in the learning process. First of all, an initial sparse graph model is generated for the network. Then, at each epoch of the training phase, both the weights of connections and the sparse structure are updated. NeuroEvolution of Augmenting Topologies (NEAT) [46] is an example of such methods which seeks to optimize both the weights and the topology of an ANN for a given task. Although its impressive results have been reported [47, 48], NEAT is not scalable to large networks as it should search a large space. References [49, 38, 39] also propose unscalable evolutionary methods which are not practical for large Networks.

In [32], a scalable evolutionary method is proposed, called SET. In this method, the weights are updated by back-propagation algorithm [50] and the candidate connections are chosen to be replaced by new ones in each training epoch. The candidate connections are tried to be chosen from the weakest ones. Thus, replacing them leads to the structure improvement of the model. This method profits from the reduced learning time and acceptable performance. However, there are some drawbacks with this evolutionary method.The novelty of this method is the process through which the structure is updated: identifying weak connections and replacing them with new ones. Hence, there are two questions to be answered: 1) what is a “weak” connection and 2) what makes a connection suitable for replacing the eliminated one? In [32], the magnitude of the connections are considered as their strength (importance), which means connections with less magnitudes are weaker. Furthermore, new connections are chosen randomly. It seems that these two metrics can be chosen more efficiently by utilizing methods of evolution in natural networks.

In this paper, we proposed two evolutionary sparse methods: 1) “Path-Weight” method and 2) “Sensitivity” method, both of which have more reasonable updating structures. The proposed methods show significant improvement in updating the structure compared to other existing methods. They show more generalization power and have higher accuracy in challenging datasets like ImageNet [51]. Furthermore, they need less training samples and training epochs which make them suitable for cases with low number of training samples.

This paper is organized as follows. In Section II and Section III, we propose the “Path-Weight” and the “Sensitivity” methods, respectively. The simulation results are discussed in Section IV, and finally, we conclude the paper in section V.

II Path-Weight Method

As discussed earlier, an important part in the evolutionary methods for finding the sparse structures is to find the weak connections at each training epoch; this requires a measure to quantify the “Importance” of each connection. In reference [32], the weight magnitude of a connection is considered as its importance. However, in our approach, we measure the importance of a connection based on its effect on the final output. We should note two facts. First, the magnitudes of the inputs have the same importance as the weight magnitudes of the connections. A connection with small weight magnitude can be effective if the input of its node has a large magnitude; in this case, a small variation in its weight may change the output significantly. Another fact is that the “Path” which contains a connection is effective on its importance. We define a path as a sequence of connections, one in each layer, which starts from the first layer and ends in the last layer. A path with large connection weight magnitudes generally has more impact on the final output of the network. A small variation in the value of the corresponding feature (feature which is the input of the path in the first layer) leads to significantly large variation at the output of a “strong path”. Hence, a connection with low weight magnitude in a strong path should not be eliminated in the learning procedure as by eliminating this connection, its underlying path is also eliminated which is not desirable.

By considering the aforementioned facts, an evolutionary method leading to a sparse structure called the Path-Weight method is proposed. In this method, we tend to choose the weakest connections in the weakest paths and replace them with the best candidates as opposed to random ones described in [32].

In the proposed method, the initial structure is based on Erdos-Renyi random graph[52] which leads to a binomial distribution for degree of hidden neurons [53]. Let us denote the connection between the ithi^{th} neuron in the k1thk-1^{th} layer and the jthj^{th} neuron in the kthk^{th} layer by Wi,j(k)W_{i,j}^{(k)}. In this initial distribution, the existence probability of connection Wi,j(k)W_{i,j}^{(k)} is

P(Wi,j(k))=ϵ(n(k)+n(k1))n(k)n(k1),P(W_{i,j}^{(k)})=\frac{\epsilon(n^{(k)}+n^{(k-1)})}{n^{(k)}n^{(k-1)}}, (1)

where n(k)n^{(k)} is number of neurons in the kthk^{th} layer and ϵ\epsilon is a parameter which controls the sparsity of the model; the higher the ϵ\epsilon, the fewer the sparsity becomes.

In the rest of this section, the proposed method is described in three parts: identifying weak connections, adding new connections to the network in the substitution of the eliminated ones, and the time-varying version of the method.

II-A The Identification of Weak Connections

In each training epoch, the structure must be updated as well as the weights. Updating the structure needs a measure to identify the importance or the effectiveness of paths. Let us denote the mthm^{th} path of the network by pmp_{m} and the importance measure of pmp_{m} by IpmI_{p_{m}}. This importance measure is defined as follows:

Ipm=l=1LI(Wl(pm)),I_{p_{m}}=\prod_{l=1}^{L}I(W_{l}^{(p_{m})}), (2)

where LL is the number of the network layers, and Wl(pm)W_{l}^{(p_{m})} is the constituent connection of pmp_{m} in the lthl^{th} layer. In (2), I(Wl(pm))I(W_{l}^{(p_{m})}) is called Normalized-Weight of Wl(pm)W_{l}^{(p_{m})} and is defined as follows:

I(Wl(pm))=|Wl(pm)||Fl|,I(W_{l}^{(p_{m})})=\frac{|W_{l}^{(p_{m})}|}{|F_{l}|}, (3)

where |Fl||F_{l}| denotes the Euclidean norm of the feature vector in layer ll. This definition helps us to involve the effect of the input magnitude in our measure. We choose the elimination candidates from those paths which have smallest importance measures. Hence, the strong paths remain intact. For the selection of the weakest connections, we first determines a fraction λ\lambda of the paths with the smallest importance measures and then, remove a fraction ζ\zeta of the connections with the smallest normalized-weights in those paths. λ\lambda and ζ\zeta are the parameters which should be set beforehead. Increasing these two parameters leads to more changes in the structure during each training epoch. This parameters also can be changed during the training procedure. At the end of this section, we propose a time-varying version for these parameters which improves the convergence speed of the proposed method.

II-B Adding New Connections

After removing the less important connections, they should be replaced by the new ones. For selecting the best substituting candidates, we propose a probability based method which aims to add connections to the most important nodes, called key nodes, instead of randomly adding them. A key node is a node which has a significant impact in the model. In other words, removing it may ruin the performance and its connections are very important ones. If a node is a part of several strong paths, it can be a key node. We define the importance measure of a node as the sum of the importance measures of all paths which pass through this node:

Ini(k)=m=1MU(ni(k)pm)×Ipm,I_{n_{i}^{(k)}}=\sum_{m=1}^{M}U(n_{i}^{(k)}\in p_{m})\times I_{p_{m}}, (4)
U(x)={1,if ’x’ is True0,otherwise,U(x)=\begin{cases}1,&\text{if '$x$' is True}\\ 0,&\text{otherwise}\end{cases}, (5)

where Ini(k)I_{n_{i}^{(k)}} is the importance measure of ithi^{th} node in kthk^{th} layer and MM is the total number of paths in the network. By normalizing this measure, we have a probability density function for adding new connections based on their source node as follows:

I¯ni(k)=δ×Ini(k)ikIni(k).\bar{I}_{n_{i}^{(k)}}=\frac{\delta\times I_{n_{i}^{(k)}}}{\sum_{i}\sum_{k}I_{n_{i}^{(k)}}}. (6)

Parameter δ\delta manages the number of connections which are established in each epoch; higher values of δ\delta leads to more established connections in each epoch. Hence, a connection originating from an important node is more likely to be added. This helps the model reach its final stable structure faster. According to ref.13, in natural networks, adding new connections through the evolution mostly occurs in the “hub nodes” (nodes which have more importance in connecting different network regions). Thus, our method for evolving the network is aligned with the behavior of natural networks. The path-weight method is presented in Algorithm 1.

1 Initialization:
2 Set algorithm parameters ϵ\epsilon, λ\lambda, ζ\zeta, and δ\delta;
3 Initialize the weights of connections by exploiting Erdos-Renyi distribution;
4
5Learning Procedure:;
6 for each epoch do
7       Implement the standard back-propagation algorithm;
8       Update the weights;
9       for each connection do
10             Calculate the impotance measure;
11            
12      Detect a fraction λ\lambda of weakest paths and then remove a fraction ζ\zeta of the connections with smaller importance measures in them;
13       for each node do
14             Sum the importance measures of all paths which go through the node to acheive its importance measure;
15            
16      Normalize the importance measure of all nodes to achieve probability desnity functions for adding connections;
17       Add connections based on the probability desnity function of nodes which these connection originate from;
18      
Algorithm 1 The Proposed Evolutionary Sparse Model

II-C Time-Varying Version

The parameters of the proposed methods (λ\lambda, ζ\zeta, and δ\delta) directly affect on the properties of the model. These parameters can control the number of connections which are removed or established at each epoch. In the early epochs of the training, the network needs to be changed a lot because the initialized structure is random. On the other hand, when the number of training epochs increases, most of the important connections are established in the structure and the structure of the model reaches somehow maturity. Thus, less manipulation is required compared to the early epochs.

In time-varying version of the proposed method, we initialize these parameters with the large values. Let us define the “primary” and the “secondary” criteria at the ttht^{th} epoch, Cprim(t)C_{prim}^{(t)} and Csec(t)C_{sec}^{(t)}, as the following:

Cprim(t)=i,kIni(k)(t)N,C_{prim}^{(t)}=\frac{\sum_{\forall i,k}I^{(t)}_{n_{i}^{(k)}}}{N}, (7)
Csec(t)=i,k(t)Ini(k)(t)|(t)|,C_{sec}^{(t)}=\frac{\sum_{i,k\in\mho^{(t)}}I^{(t)}_{n_{i}^{(k)}}}{|\mho^{(t)}|}, (8)

which are the average of all importance measures and the average of the eliminated ones, respectively. In (8), (t)\mho^{(t)} is the set of eliminated connections at the ttht^{th} epoch.

Now, the parameters can be changed during the learning process in the following manner:

ρ(t+1)={K1ρ(t),if Csec<K3CprimK2ρ(t),if Csec>K4Cprimρ(t),otherwise,\rho^{(t+1)}=\begin{cases}K_{1}\rho^{(t)},&\text{if }C_{sec}<K_{3}C_{prim}\\ K_{2}\rho^{(t)},&\text{if }C_{sec}>K_{4}C_{prim}\\ \rho^{(t)},&\text{otherwise}\end{cases}, (9)

where ρ\rho is λ\lambda or ζ\zeta or δ\delta. In(9), K1K_{1},K2K_{2},K3K_{3}, and K4K_{4} is constants where K2<1<K1K_{2}<1<K_{1} and K3<K4<1K_{3}<K_{4}<1. According to the simulation results, K1=2K_{1}=2, K2=0.5K_{2}=0.5, K3=0.1K_{3}=0.1, and K4=0.5K_{4}=0.5 are good choices for these constants.

The simulation results show a great improvement especially in generalization power and data requirement for this method. Besides, because of the improvements in the structure updating and the convergence speed, the proposed method can reach better performance compared to non-sparse methods with fewer number of training samples. Despite its well performance, the complexity of this method may cause problems as the number of layers increases. The number of total paths in a non-sparse model exponentially increases with the increase in the number of layers. Although sparse structures have much fewer paths in comparison with the non-sparse ones, an increase in the number of layers leads to a high number of paths and, consequently, a high processing load in updating the structure. Hence, in the next section, we propose another evolutionary method which has approximately no overhead process load and has a slightly lower performance than the path-weight method.

III Sensitivity Method

In this section, we propose another evolutionary method for applying sparsity in ANNs called the sensitivity method. This method has low computational load despite of the path-weight method while their performance in accuracy is almost the same. As we discussed before, the measure which determines the weak connections is so critical. In the previous section, we define a weak connection as a connection which has small effect on the output. In other words, there must be a large change in the weight of this connection or in its corresponding input to have a notable change at the output. If we consider the layers of a neural network as Multi Input- Multi Output (MIMO) systems, the importance measure, considering the mentioned definition, can be expressed by the “sensitivity” measure which is defined as:

S(Wi,j(k))=|fWi,j(k)Wi,j(k)|,S(W_{i,j}^{(k)})=|\frac{\frac{\partial f}{\partial W_{i,j}^{(k)}}}{W_{i,j}^{(k)}}|, (10)

where ff stands for the final output of the neural network. This measure is well-known in Control and Systems research field and is used for measuring the importance of parameters in a system [54].

Fortunately, the partial differentiation of the final output respect to all the connections are calculated during the training procedure by the back-propagation algorithm. Hence, there is no overhead computational load for calculating the sensitivity measure in the updating precedure except for dividing the partial differentiations by the corresponding weights.

The only difference between the Sensitivity method and the Path-Weight method is their importance measures; all other stages of the methods such as initialization and adding new connections are the same. In the Sensitivity method, the importance measure of a node is considered as the sum of sensitivity measures of all the connections that originate from that particular node:

S(ni(k))=j=1n(k+1)|fWi,j(k)Wi,j(k)|,S(n^{(k)}_{i})=\sum_{j=1}^{n^{(k+1)}}|\frac{\frac{\partial f}{\partial W_{i,j}^{(k)}}}{W_{i,j}^{(k)}}|, (11)

In the training phase at each epoch, the number of required operations for the evolutionary part is O(LμN¯)O(L\mu\bar{N}) for the sensitivity method; where N¯\bar{N} is the average number of the neurons in each layer and μ\mu is the sparse factor of the model which is defined as ratio of number of existing connections to the number of all possible ones and lies between 0 and 11. In the other hand, the path-weight method requires O((μN¯)L)O((\mu\bar{N})^{L}) operations at each epoch. As expected, the complexity of the path-weight method grows exponentially when the number of layers increases. The linear complexity of sensitivity method makes it scalable in networks with a large number of layers. Although the Path-Weight measure is more complex, the simulation results show that the Sensitivity method performs only slightly lower than the path-weight method in terms of accuracy, the number of required training samples, and the generalization power. Details are discussed in the next section.

IV Simulation Results

In this section, we have evaluated the performances of the proposed methods in terms of accuracy, generalization power, convergence speed, complexity, and the number of required training samples. In the simulations, four different networks were compared for both multi-layer perceptron (MLP) and convolutional architectures. These networks are the conventional non-sparse network, sparse evolutionary training (SET) network (it is introduced in ref[32]), sparse network based on the path-weight method, and sparse network based on the sensitivity method. The convolutional networks are based on ResNet18[55] architecture where its last fully-connected layers is modified according to the four aforementioned methods.

In the first scenario, a MLP network is trained on CIFAR10 dataset. All networks have 55 layers and 20002000 neurons. They all use ReLU[56] as the activation function. The only differences are the number and the configuration of connections. The dense network also uses dropout[57] in each layer.

Refer to caption
Figure 1: CFAR10 Top-1 Accuracy for Dense, SET, Path-Weight, and Sensitivity models of a MLP Network versus the epoch number.

The acuuracy of the mentioned networks versus the epoch number is shown in Fig. 1. In this figure, the sparse networks have the tenth of possible connections which means they have 90%90\% sparsity. The networks which used proposed methods have 5%5\% better accuracy than the dense network and 3%3\% better accuracy than the SET network. The network used the path-weight method reaches 70%70\% accuracy at the 3636th epoch and the network exploited the sensitivity method reaches the same accuracy at the 6060th epoch while dense and SET networks reaches 70%70\% accuracy at 350350th epoch. This confirms that using the proposed methods can lead to an improvement in the convergence speed. The time-varying version of these methods can also accelerate the convergence speed of the model. In other words, if these parameters change with respect to the current state of the model, a finalized structure can be achieved in fewer epochs. In Fig.1, the time-varying version of the path-weight method reaches 80% accuracy at 100 epochs while the other networks reach the same accuracy at epoch 500. Another point which can be seen from Fig. 1 is the enhancement of generalization power in the sparse networks. In other words, overfitting problem is addressed in sparse networks where accuracy of model does not drop even with thousands of training epochs.

Convolutional Neural Networks (CNN) are the most popular networks in image related tasks. In a CNN network, last layer has MLP architecture and consequently a major fraction of parameters in the whole model. Sparsifying this layer can reduce parameters of the whole model up to 40% which incredibly accelerates the training phase and also improves the generalization power of the model while there is no drop in accuracy. In Fig.2, four versions of ResNet18 network for image classification of the ImageNet dataset are compared in term of accuracy at different epoch numbers.

Refer to caption
Figure 2: ImageNet top-1 accuracy for dense, SET, path-weight, and sensitivity versions of the ResNet18 network versus the epoch number.

As can be seen from Fig. 2, the proposed methods almost have 12%12\% and 5%5\% higher top-1 accuracy in comparison with dense and SET versions of the ResNet18 network, respectively. The proposed methods have faster grow in accuracy in early epochs and their convergences to the final model happen earlier than their sparse and non-sparse counterparts.

The proposed sparse methods establish only important connections in the network and eliminate useless ones in order to reach a less complex model; thus, these networks can be learned with fewer number of training samples and have approximately the same performances as non-sparse ones which were trained with more samples.

fraction of used samples Dense ResNet SET ResNet Path-Weight ResNet Sensitivity ResNet
100% 79.9% 77.2% 81.4% 80.5%
90% 75.3% 74.5% 79.1% 78.3%
80% 69.8% 68.9% 74.5% 71.4%
60% 61.2% 65.7% 69.4% 66.1%
40% 49.1% 57.4% 63.8% 60.3%
TABLE I: The top-1 accuracy of different versions of ResNet18 for the various fraction of used training samples of the ImageNet dataset.

Table 1 shows top-1 accuracy of different versions of ResNet18 for various fraction of used training samples of the ImageNet dataset. This can be concluded from Table 1 that the proposed methods are more robust to the size of the training data rather than the others. This helps to use deep networks in applications such as medical imaging, cancer detection, disease prediction which there is no enough samples for conventional deep networks to be learned.

V Conclusion

In this paper, we have proposed two evolutionary methods for sparsifying ANNs which update both the sparse structure of a network and the values of its parameters during the learning procedure. In the first method, we regard end-to-end paths of the network for updating the structure instead of regarding only single connections. In this method, which is called “Path-Weight” method, the effects of inputs are also considered in the updating metric. In this approach, new connections are added to the most important nodes with a high probability. In section 2, a less complex method is introduced which has slightly lower properties than the “Path-Weight” method described in section 1. The sensitivity measure is used as the updating metric for this method. The simulation results show that these two methods have 5% better accuracy, 7 times faster convergence to their final structures, and more generalization power while they reduce the number of parameters up to 95% in ImageNet and CIFAR10 datasets.

References

  • [1] C. M. Niell and M. P. Stryker, “Highly selective receptive fields in mouse visual cortex,” Journal of Neuroscience, vol. 28, no. 30, pp. 7520–7536, 2008.
  • [2] K. D. Harris and T. D. Mrsic-Flogel, “Cortical connectivity and sensory coding,” Nature, vol. 503, no. 7474, p. 51, 2013.
  • [3] D. Mao, S. Kandler, B. L. McNaughton, and V. Bonin, “Sparse orthogonal population representation of spatial context in the retrosplenial cortex,” Nature communications, vol. 8, no. 1, p. 243, 2017.
  • [4] C. McCafferty, F. David, M. Venzi, M. L. Lőrincz, F. Delicata, Z. Atherton, G. Recchia, G. Orban, R. C. Lambert, G. Di Giovanni et al., “Cortical drive and thalamic feed-forward inhibition control thalamic output synchrony during absence seizures,” Nature neuroscience, vol. 21, no. 5, p. 744, 2018.
  • [5] M. Radosevic, A. Willumsen, P. C. Petersen, H. Linden, M. Vestergaard, and R. W. Berg, “Decoupling of timescales reveals sparse convergent cpg network in the adult spinal cord,” Nature communications, vol. 10, 2019.
  • [6] S. P. Jadhav, J. Wolfe, and D. E. Feldman, “Sparse temporal coding of elementary tactile features during active whisker sensation,” Nature neuroscience, vol. 12, no. 6, p. 792, 2009.
  • [7] A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,” science, vol. 286, no. 5439, pp. 509–512, 1999.
  • [8] D. J. Watts and S. H. Strogatz, “Collective dynamics of ‘small-world’networks,” nature, vol. 393, no. 6684, p. 440, 1998.
  • [9] S. B. Hofer, H. Ko, B. Pichler, J. Vogelstein, H. Ros, H. Zeng, E. Lein, N. A. Lesica, and T. D. Mrsic-Flogel, “Differential connectivity and response dynamics of excitatory and inhibitory neurons in visual cortex,” Nature neuroscience, vol. 14, no. 8, p. 1045, 2011.
  • [10] V. K. Jirsa and A. R. McIntosh, Handbook of brain connectivity.   Springer, 2007, vol. 1.
  • [11] J. Clune, J.-B. Mouret, and H. Lipson, “The evolutionary origins of modularity,” Proceedings of the Royal Society b: Biological sciences, vol. 280, no. 1755, p. 20122863, 2013.
  • [12] K. Safaryan, R. Maex, N. Davey, R. Adams, and V. Steuber, “Nonspecific synaptic plasticity improves the recognition of sparse patterns degraded by local noise,” Scientific reports, vol. 7, p. 46550, 2017.
  • [13] S. Ganguli and H. Sompolinsky, “Compressed sensing, sparsity, and dimensionality in neuronal information processing and data analysis,” Annual review of neuroscience, vol. 35, pp. 485–508, 2012.
  • [14] E. Bullmore and O. Sporns, “Complex brain networks: graph theoretical analysis of structural and functional systems,” Nature reviews neuroscience, vol. 10, no. 3, p. 186, 2009.
  • [15] R. Miao, L.-Y. Xia, H.-H. Chen, H.-H. Huang, and Y. Liang, “Improved classification of blood-brain-barrier drugs using deep learning,” Scientific Reports, vol. 9, no. 1, p. 8802, 2019.
  • [16] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
  • [17] E. Moen, D. Bannon, T. Kudo, W. Graf, M. Covert, and D. Van Valen, “Deep learning for cellular image analysis,” Nature methods, p. 1, 2019.
  • [18] C. A. Nelson, A. Butte, and S. E. Baranzini, “Integrating biomedical research and electronic health records to create knowledge based biologically meaningful machine readable embeddings,” bioRxiv, p. 540963, 2019.
  • [19] L. Wang, H.-F. Wang, S.-R. Liu, X. Yan, and K.-J. Song, “Predicting protein-protein interactions from matrix-based protein sequence using convolution neural network and feature-selective rotation forest,” Scientific Reports, vol. 9, no. 1, p. 9848, 2019.
  • [20] D. Chen, S. Liu, P. Kingsbury, S. Sohn, C. B. Storlie, E. B. Habermann, J. M. Naessens, D. W. Larson, and H. Liu, “Deep learning and alternative learning strategies for retrospective real-world clinical data,” npj Digital Medicine, vol. 2, no. 1, p. 43, 2019.
  • [21] N. Amin, A. McGrath, and Y.-P. P. Chen, “Evaluation of deep learning in non-coding rna classification,” Nature Machine Intelligence, vol. 1, no. 5, p. 246, 2019.
  • [22] G. Eraslan, Ž. Avsec, J. Gagneur, and F. J. Theis, “Deep learning: new computational modelling techniques for genomics,” Nature Reviews Genetics, p. 1, 2019.
  • [23] N. Rezaii, E. Walker, and P. Wolff, “A machine learning approach to predicting psychosis using semantic density and latent content analysis,” NPJ schizophrenia, vol. 5, 2019.
  • [24] R. Porotti, D. Tamascelli, M. Restelli, and E. Prati, “Coherent transport of quantum states by deep reinforcement learning,” Communications Physics, vol. 2, no. 1, p. 61, 2019.
  • [25] P. Baldi, P. Sadowski, and D. Whiteson, “Searching for exotic particles in high-energy physics with deep learning,” Nature communications, vol. 5, p. 4308, 2014.
  • [26] M. De Jong, W. Chen, R. Notestine, K. Persson, G. Ceder, A. Jain, M. Asta, and A. Gamst, “A statistical learning framework for materials science: application to elastic moduli of k-nary inorganic polycrystalline compounds,” Scientific reports, vol. 6, p. 34256, 2016.
  • [27] S. A. Mengiste, A. Aertsen, and A. Kumar, “Effect of edge pruning on structural controllability and observability of complex networks,” Scientific reports, vol. 5, p. 18145, 2015.
  • [28] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, 2015, pp. 1135–1143.
  • [29] S. Anwar, K. Hwang, and W. Sung, “Structured pruning of deep convolutional neural networks,” ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 13, no. 3, p. 32, 2017.
  • [30] S. Scardapane, D. Comminiello, A. Hussain, and A. Uncini, “Group sparse regularization for deep neural networks,” Neurocomputing, vol. 241, pp. 81–89, 2017.
  • [31] C. Louizos, M. Welling, and D. P. Kingma, “Learning sparse neural networks through l_0l\_0 regularization,” arXiv preprint arXiv:1712.01312, 2017.
  • [32] D. C. Mocanu, E. Mocanu, P. Stone, P. H. Nguyen, M. Gibescu, and A. Liotta, “Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science,” Nature communications, vol. 9, no. 1, p. 2383, 2018.
  • [33] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).   IEEE, 2017, pp. 27–40.
  • [34] A. Makhzani and B. Frey, “K-sparse autoencoders,” arXiv preprint arXiv:1312.5663, 2013.
  • [35] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2736–2744.
  • [36] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy et al., “Evolving deep neural networks,” in Artificial Intelligence in the Age of Neural Networks and Brain Computing.   Elsevier, 2019, pp. 293–312.
  • [37] Y. Sun, X. Wang, and X. Tang, “Sparsifying neural network connections for face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4856–4864.
  • [38] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1389–1397.
  • [39] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
  • [40] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Finding sparse, trainable neural networks,” 2019.
  • [41] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin, “Stabilizing the lottery ticket hypothesis,” arXiv preprint arXiv:1903.01611, 2019.
  • [42] H. Zhou, J. Lan, R. Liu, and J. Yosinski, “Deconstructing lottery tickets: Zeros, signs, and the supermask,” arXiv preprint arXiv:1905.01067, 2019.
  • [43] T. Dettmers and L. Zettlemoyer, “Sparse networks from scratch: Faster training without losing performance,” arXiv preprint arXiv:1907.04840, 2019.
  • [44] U. Evci, T. Gale, J. Menick, P. S. Castro, and E. Elsen, “Rigging the lottery: Making all tickets winners,” in International Conference on Machine Learning.   PMLR, 2020, pp. 2943–2952.
  • [45] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and E. Choi, “Morphnet: Fast & simple resource-constrained structure learning of deep networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1586–1595.
  • [46] K. O. Stanley and R. Miikkulainen, “Evolving neural networks through augmenting topologies,” Evolutionary computation, vol. 10, no. 2, pp. 99–127, 2002.
  • [47] M. Hausknecht, J. Lehman, R. Miikkulainen, and P. Stone, “A neuroevolution approach to general atari game playing,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 6, no. 4, pp. 355–366, 2014.
  • [48] T. Miconi, “Neural networks with differentiable structure,” arXiv preprint arXiv:1606.06216, 2016.
  • [49] P. Verbancsics and K. O. Stanley, “Constraining connectivity to encourage modularity in hyperneat,” in Proceedings of the 13th annual conference on Genetic and evolutionary computation, 2011, pp. 1483–1490.
  • [50] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning.   MIT press, 2016.
  • [51] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [52] P. Erdős and A. Rényi, “On the evolution of random graphs,” Publ. Math. Inst. Hung. Acad. Sci, vol. 5, no. 1, pp. 17–60, 1960.
  • [53] M. E. Newman, S. H. Strogatz, and D. J. Watts, “Random graphs with arbitrary degree distributions and their applications,” Physical review E, vol. 64, no. 2, p. 026118, 2001.
  • [54] N. S. Nise, Control systems engineering.   John Wiley & Sons, 2020.
  • [55] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [56] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
  • [57] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.