Dense Optimizer : An Information Entropy-Guided
Structural Search Method for Dense-like Neural Network Design

Tianyuan Liu, Libin Hou, Linyuan Wang, Xiyu Song, Bin Yan
Information Engineering University, ZhengZhou,China This paper was produced by the IEEE Publication Technology Group. They are in Piscataway, NJ.Manuscript received April 19, 2021; revised August 16, 2021.

Abstract

Dense Convolutional Network has been continuously refined to adopt a highly efficient and compact architecture, owing to its lightweight and efficient structure. However, the current Dense-like architectures are mainly designed manually, it becomes increasingly difficult to adjust the channels and reuse level based on past experience. As such, we propose an architecture search method called Dense Optimizer that can search high-performance dense-like network automatically. In Dense Optimizer, we view the dense network as a hierarchical information system, maximize the network’s information entropy while constraining the distribution of the entropy across each stage via a power-law, thereby constructing an optimization problem. We also propose a branch-and-bound optimization algorithm, tightly integrates power-law principle with search space scaling to solve the optimization problem efficiently. The superiority of Dense Optimizer has been validated on different computer vision benchmark datasets. Specifically, Dense Optimizer completes high-quality search but only costs 4 hours with one CPU. Our searched model DenseNet-OPT achieved a top-1 accuracy of 84.3% on CIFAR-100, which is 5.97% higher than the original one.

Index Terms:

Dense network Optimizer, information entropy, power-law, structural search.

I Introduction

In the field of computer vision, searching for superior and lightweight network architecture will never be an outdated task. Series of work like AlexNet [1], VGG [2], ResNet [3], and DenseNet [4] has improved the effectiveness of neural networks and our understanding of network design. Since the proposal of DenseNet, its design concept has been widely applied to the design of various advanced backbone network models. Networks based on improvements to Dense, such as Channel Cross-Linked Dense Convolutional Networks [5], lightweight DenseNet structure, dense unit, dense connection mode, and attention mechanism, have achieved significant accomplishments in tasks such as EEG emotion detection, medical analysis, etc. [6, 7, 8, 9].

However, with the expansion of model scale, it becomes increasingly difficult to adjust the network connection channel design and reuse level based on past experience. So far, Neural Architecture Search(NAS) have provided convenience for structural design by constructing high-performance networks using reinforcement learning or gradient-based algorithms within a given fixed search space [10] [11]. But to some extent, automatic search does not equal a good automatic design method. These previous NAS works cannot skillfully control aspects such as the convolution kernel and the number of channels. Meanwhile NAS methods require the training and evaluation of a large number of networks during the search phase, which is both time-consuming and computationally intensive.

The bilevel optimization of privious NAS, like differentiable architecture search(DARTS)[11], brings serious computational overhead problem. Our solution is to separate the structural parameter search and weight parameter training of Dense, decouple the bilevel optimization into a two-step optimization, and transform the design of structural hyperparameters (such as the size of the convolution kernel, the number of channels, etc.) into an optimization problem. Information entropy is an effective tool to describe the network representation ability[12], which can be used as the main optimization objective to search Dense structures. After the Dense structural parameters are optimized, the weight parameters will be trained to obtain the final high-performance model.

In practical terms, we propose Dense Optimizer, a universal structural parameters optimizer for dense-like network structures. Dense Optimizer establishes an optimization model to maximize the structural information entropy of the dense backbone network by searching for the optimal configuration of network depth and width and kernel size. And it incorporate a power-law distribution as an evaluation metric, and naturally embedding it into the optimization model. Besides, we propose a branch-and-bound optimization algorithm based on search space scaling to solve the problem. Utilizing traditional dense-BC modules, the models designed by Dense Optimizer are comparable with CNN models of the same size and FLOPs. The following are the key contributions of this work:

•

Dense Optimizer is proposed as a dense architecture search method. The search process is transformed into an optimization problem to construct Dense model efficiently.
•

Maximizing the network’s information entropy under multi-scale entropy power-law distribution principle is proposed to conduct the optimization model.
•

A branch-and-bound optimization algorithm is proposed, tightly integrate power-law principle with search space scaling.
•

Dense Optimizer is specifically designed for dense-like architecture, and achieve significant improvements across different datasets.

II Related Work

II-A Design of DenseNet

DenseNet has garnered widespread attention and research interest in the field of computer vision [13]. Owing to its excellent performance and flexibility, a multitude of manual designs in recent years have leveraged to refine and augment the DenseNet architecture [14][15]. However, these designs still heavily rely on on human expertise and lack principles for structural design guidance [6]. When selecting hyperparameters such as channel growth rate and convolution kernels, extensive experimentation and tuning are often required[16]. Dense Optimizer provides an efficient and automatic structural search method for Dense-like networks and promote its performance at the same time.

II-B Neural Architecture Search

Neural Architecture Search(NAS) have been greatly studied during the past few years to automatically design more effective architectures. Popular NAS algorithm options include genetic algorithms [17, 18, 19, 20], reinforcement learning (RL) [21, 22], differentiable search [23, 11] and many other types of optimization algorithms, e.g., Bayesian optimization and particle swarm optimization [24, 25, 26]. As a classic differentiable architecture search method, DARTS is the most widely attention-grabbing algorithms. It models the architecture design as a bi-level optimization problem, which requires training vast numbers of candidate networks to inform the search process, and often leads to high computational cost.

Recent NAS work Mellor et al. [27] start to explore indicators that can predict a network’s performance without training. Moreover, Xuan et al. [28] endeavor to decouple network weights from network architecture, focusing on the discovery of improved design principles through the exploration of structural parameters alone. These methods have represented an effective exploration in neural network architecture search that decouples structural parameters and weights, but still lack good design guidelines. Dense Optimizer conducts further in-depth research. Dense Optimizer circumventing the time-consuming issue of bi-level optimization, and uses information entropy as a design criterion to obtain a better structure by solving an optimization problems.

II-C Mathematical Architecture Design

Information theory is a powerful tool for studying complex systems such as deep neural networks. Recently, mathematical architecture design(MAD) is proposed [28]. Unlike the existing hyperparameter optimization methods [29, 30, 31], MAD does not require any model training during optimization, allowing for the acquisition of optimized network structures within minutes. It maximizes the network entropy with three empirical guidelines and demonstrate an advancement in designing network structures using mathematical programming. However, MAD unable to characterize the information flow structure of the concatenation operation and accurately estimate the information entropy of the dense network structure. And its three empirical guidelines do not have strong theoretical foundation. Dense Optimizer not only specifically addresses these issues, but also constraints the distribution of entropy at different scales with power-law and proposes a new optimization model for dense-like network architectures.

III Dense based architecture Optimizer

In this section, we introduce the core architecture of Dense Optimizer. Specifically, we consider the deep neural networks as a continuous information processing system. We provide a definition of structural entropy’s effectiveness and extend it from Multi-Layer Perceptron(MLP) to networks with dense connections. Then, we propose an optimization model to study the architectural design of DenseNet. To formulate the optimization problem for DenseNet, we first define the entropy that governs its expressiveness, followed by the power-law constraints that regulate the efficacy of the multi-scale entropy distribution. Ultimately, We provided a precise optimization model and proposed the corresponding branch-and-bound optimization algorithm.

III-A Entropy of DenseBlock

The principle of maximum entropy is one of the most widely applied principles in information theory [32]. Some previous works have also attempted to establish a connection between entropy and the structure of neural networks. Here, we provide a re-derivation, and give the entropy upper bound of DenseBlock. Suppose that in an L-layer MLP f(·), the i-th layer has $\mathrm{w}_{i}$ input channels and $\mathrm{w}_{\mathrm{i}+1}$ output channels. The output $\mathrm{x}_{\mathrm{i}+1}$ and the input $\mathrm{x}_{i}$ are connected by $\mathrm{x}_{\mathrm{i}+1}=\mathrm{M}_{\mathrm{i}}\mathrm{x}_{i}$ , where $M_{i}\in R^{w_{i+1}\times w_{i}}$ is trainable weights. Then the structural parameters define how the input $\mathrm{x}_{i}$ propagates inside the network, which is capable of being depicted.

For a DenseBlock with L layers, we consider the information reuse caused by the network’s dense connections and the impact of information distribution brought about by the concatenation operation. Then we obtain the information entropy of the basic block of the dense network, defined in Proposition 1:

Proposition 1.The normalized Gaussian entropy upper bound of the DenseBlock f(·) is

H_{f}=\mathrm{w}_{L}\log\left(\mathrm{w}_{0}^{\mathrm{L}}*\mathrm{i}!\right),

(1)

where the $\mathrm{w}_{L}$ is the width of $L$ -th layer, and $\mathrm{w}_{0}$ is the initial width of the DenseBlock. The whole derivation is given in Appendix A. The entropy measures the expressiveness of a DenseBlock. Following the Principle of Maximum Entropy [33, 34], we propose to maximize the entropy of DenseBlock under given computational budgets. When calculating the precise information entropy of a dense block, the number of the $i$ -th dense layer, the input channels is $\mathrm{c}_{i}$ , the number of output channels is $\mathrm{c}_{\mathrm{i}+1}$ , and the kernel size is $\mathrm{k}_{i}$ . Consequently, the ”width” of a dense block layer is projected as $\mathrm{c}_{i}\mathrm{k}_{i}^{2}$ in (1). Therefore, for a feature map with a resolution of $\mathrm{r}_{i}\times\mathrm{r}_{i}$ , the entropy of a DenseBlock with L layers is defined by

H_{f}=\log\left(\mathrm{r}_{L}^{2}\mathrm{c}_{\mathrm{L}+1}\right)\log\left(\left(\mathrm{c}_{i}\mathrm{k}_{i}^{2}\right)^{\mathrm{L}}*\mathrm{i}!\right).

(2)

Inspired by [35], taking logarithms can better formulate the ground-truth entropy for natural images.

III-B Effectiveness Defined in DenseBlock

Inspired by previous research, an infinitely deep network will become hard to train unless it meets particular structural requirements. Therefore, in Dense Optimizer, we propose to control the depth of the dense-like network to facilitate gradient flow throughout the entire architecture effectively. Typically, the depth and width of a network are relative; thus, the effectiveness of a network with $\mathrm{L}$ layers, where each layer possesses the same width $\mathrm{W}$ , can be defined as follows:

\rho=\mathrm{L}/\mathrm{W}.

(3)

Normally, the width $\mathrm{w}_{i}$ of each layer can be different. So the average width of an $\mathrm{L}$ layer network f(·) is defined by

\overline{\mathrm{w}}=\left(\prod_{i=1}^{L}w_{i}\right)^{1/L}=\exp\left(\frac{1}{L}\sum_{i=1}^{L}\log w_{i}\right).

(4)

But in DenseBlock, each layer connects to all the previous layers. This results in a relatively steady increase in the number of parameters with each layer. The growth rate $\mathrm{K}$ is always the same, and much smaller than the number of input channels. So the average of the width can be defined as:

\overline{\mathrm{w}}=\mathrm{w}_{0}+K/2\approx\mathrm{w}_{0}.

(5)

So a DenseBlock has L-layers with $\mathrm{w}_{0}$ input width and growth rate $\mathrm{K}$ , the effectiveness is defined by

\rho=\mathrm{L}/\mathrm{w}_{0}.

(6)

III-C Power law in entropy distribution

In information theory, entropy reflects the amount of information or uncertainty within a system. Higher information entropy implies the system is more dispersed and diverse [36]. For multi-stage neural networks, they are akin to independent information systems. To eliminate the discrete distribution of information entropy and uncertainty across these systems, it is essential to propose a multi-scale entropy distribution under mathematical constraints. According to highly optimized tolerance(HOT)[37], when a complex system is in a HOT state, the system will satisfy power-laws, that is, a global optimization process can lead to power-law distributions: inputs with characteristic scales, after undergoing a global system’s “output” optimization process, can produce outputs with power-law characteristics[38].

Based on extensive experimental statistics, We find that the information entropy distribution of dense network follows a power-law distribution, see figure 1.

Refer to caption — Figure 1: Visualization of the multi-scale entropy power-law distribution, which is based on the statistical results of dense backbone. The distribution of a dense backbone entropy under different feature size consistents with the power-law function

To reinforce this constraint, We propose a power-law in entropy distribution, and we use a two-parameter fitting function with parameters ’a’ and ’b’ to optimize the distribution. The object is to maximize the value of ’a’ and minimize the value of ’b’ under the same fitting parameter settings.

Here, we provide specific definitions. Following (2), The cumulative entropy distribution sequence at the current stage is:

H=\left[H_{f}^{1}H_{f}^{2}H_{f}^{3}\cdots H_{f}^{L}\right].

(7)

The fitting expression of this sequence under the power law function is as follows:

H=\mathrm{a}*\mathrm{M}_{i}^{\mathrm{b}},

(8)

where a, b are power index parameters, and $\mathrm{M}_{i}$ represent the i th stages. So the optimization target S is:

S=\mathrm{a}-\mathrm{b}.

(9)

Subsequently, we establish optimization constraints to achieve the objective of maximizing a while minimizing b.

III-D Optimization model and Solutions

We gather everything together and present the final optimization model for Dense Optimizer. Suppose that we aim to design an L-layer dense-like model $F$ with $\mathrm{M}$ stages. The entropy of the $\mathrm{i}$ -th stage is denoted as $\mathrm{H}_{i}$ defined in (2). We propose to optimize { $\mathrm{c}_{i}$ , $\mathrm{k}_{i}$ , $\mathrm{L}_{i}$ } via the following optimization problem:

\begin{array}[]{ll}\max_{w_{i},L_{i}}&\sum_{i=1}^{M}\alpha_{\mathrm{i}}H_{i}+\beta*S,\\ \text{ s.t. }&L_{i}/\mathrm{W}_{i}\leq\rho,\\ &\text{ FLOPs }[f(\cdot)]\leq\text{ budget, }\\ &\text{ Params }[f(\cdot)]\leq\text{ budget, }\\ &w_{1}\leq w_{2}\leq\cdots\leq w_{L}.\end{array}

(10)

where $\alpha_{i}$ is the weights of entropies at different scales, $\beta$ is a tuning coefficient. $\rho$ controls the effectiveness of the network whose value is usually tuned in range $[10,20]$ .

Integrating the complexity of above optimization problem and the particularity of the candidate solution space of network structure, we propose a branch-and-bound algorithm to integrate power-law in algorithm 1, to achieve efficient optimization search. By combining the properties of power-law distribution, we decompose the optimization problem into sub-optimization problems, gradually narrowing the search space and finding the optimal solution through searching.

Specifically, we first score the initial densely connected architecture through information entropy representation. Then, during the search process, we relax the initial network structure while employing regional reduction techniques during the relaxation process. Based on the network information entropy at a certain stage, we calculate the entropy space that conforms more to the power law distribution, and prune the search space. In the process of iterative optimization, we always retain the optimal solution.

Algorithm 1 Branch and Bound Method for to-Fine optimization

0: Search space

S

, inference budget

B

, maximal depth

L

total number of iterations

T

, evolutionary population size

N

, initial structure

{F}_{0}

, fine-search flag Flag.

0: Dense optimized backbone F*.

Initialize population P =

{F}_{0}

, Flag=False.

for

t=1,2,\dots,T

Calculate the Network Information Entropy at Each Stage

Conduct a Mathematical Optimization

Compute the ideal Information Entropy Under the Power-Law Distribution

Adjusting the Search Space at Each Stage

Performing internal Mathematical Optimization for each stage

Remove networks of the smallest entropy if the size of

P

exceeds

B

end for

return Return

F*

, the network of the highest entropy in P

IV Experiments and Results

In this section, we first describe the detailed settings for search optimization using Dense Optimizer. Then, the optimized dense network is trained, and the training settings are introduced in detail in subsection 4.2, and tested on the CIFAR-10, CIFAR-100, SVHN datasets. Meanwhile, the performance of the optimized structure is compared with the classic ResNet and DenseNet, and ablation experiments are conducted in Section 4.4 to verify the effectiveness of the multi-scale information entropy power-law distribution.

IV-A Search Settings

In Dense Optimizer, the number of search populations N is 256, and the number of iterations is 500,000. The classic densenet121 is used as the initial backbone network. Search space parameters: the number of input and output channels for each block, convolution kernel size: [3,5,7], budget layers=130, maximum growth rate: [12,24,40]. Meanwhile the optimization problem is solved on CPU device.

IV-B Training Settings

Following previous works , SGD optimizer with momentum 0.9 is adopted to train the dense models. The weight decay is 5e-4 for CIFAR dataset . The initial learning rate is 0.1 with batch size of 32. We use cosine learning rate decay with 5 epochs of warm-up. The number of training epochs is 100 for CIFAR-100. All experiments use the following data augmentations : mix-up, label-smoothing , random erasing, random crop/resize/flip/lighting, and Auto-Augment .

IV-C Result

As shown in table I and table II, after optimization with Dense Optimizer, the performance of the dense network on the CIFAR-100 dataset significantly surpassed the original network. Given diverse latency budgets, our method outperforms the compared NAS methods in terms of the accuracy of the generated/searched architectures. It is solved in CPU device with small search cost. The optimized model, with a size of 32M, achieved a TOP-1 error rate of 16.96%, while the larger network of 171M size had a TOP-1 error rate of 15.7%.

Moreover, we compare the searched architectures on CIFAR-10 and SVHN datasets, showing in table III and table IV.

The results demonstrate that Dense Optimizer achieves effective results on different datasets. Within the same model size, the accuracy of the optimized models is significantly higher than the original densenet, and they can easily outperform ResNet family.

TABLE I: Results of searched optimized architectures on CIFAR-100

Model	Parameters	Max K	top-1 error
ResNet [39]	1.7M	-	27.22
Wide ResNet [40]	36.5M	-	20.50
ResNet(pre-activation)	10.2M	-	22.71
FractalNet [41]	38.6M	-	23.3
DenseNet-BC [6]	0.8M	12	24.15
DenseNet	27.2M	24	23.42
DenseNet-BC(121)	9.02M	24	19.90
DenseNet-OPT(123)	24.12M	24	17.74
DenseNet-OPT(129)	32.60M	40	16.96
DenseNet-OPT(86)	171.7M	128	15.70

TABLE II: Results of searched optimized architectures on CIFAR-100

Method	Parameters	top-1 error	Search Cost
SNAS [42]	2.8M	20.09	1.5 GPU-days
DARTS [11]	3.4M	21.26	0.4 GPU-days
ZARTS [43]	4.1M	21.00	1.0 GPU-days
DenseNet-OPT(123)	24.12M	17.74	0.2 CPU-days

TABLE III: Results of searched optimized architectures on CIFAR-10

Model	Parameters	Max K	top-1 error
ResNet	19.3M	-	7.93
DenseNet	1.0M	12	7.00
DenseNet	27.2M	24	5.83
DenseNet-BC(250)	15.3M	24	5.19
DenseNet-OPT(123)	24.1M	24	3.53

TABLE IV: Results of searched optimized architectures on SVHN

Model	Parameters	Max K	top-1 error
ResNet-18	11.7M	-	2.65
DenseNet-BC	15.3M	12	1.74
DenseNet-OPT(123)	24.1M	24	1.49

IV-D Ablation Study and Analysis

In this section, we conducted ablation experiments on the CIFAR-100 dataset with power-law distribution constraints and performed statistical analysis on the information entropy of multi-stage power-law distributions. We utilized traditional dense-BC convolution blocks and networks generated with different information entropy distributions. The network structures found under different max channel growth rates $\mathrm{K}$ [40, 24, 12] were trained, and the fitting values of information entropy distribution at each stage were statistically analyzed.

As shown in table V, controlling the information entropy distribution at each stage can improve the performance of image classification tasks. When the distribution satisfies a power-law distribution, the network can achieve the best model performance in classification tasks. As can be seen in figure 2, the model performance is positively correlated with the power-law distribution hyperparameter a (Pearson correlation coefficient of $0.86$ ) and negatively correlated with the hyperparameter b (Pearson correlation coefficient of $-0.94$ ). Therefore, a better performance can be achieved by optimizing the structure of the model through power-law constraints.

We also statistically averaged the information entropy of all optimized dense network stages and fitted them with different functions. And we compared fitting errors, including SSE (Sum of Squared Errors), R-square (Coefficient of Determination), Adjusted R-square, and RMSE (Root Mean Squared Error). Figure 3 indicates that, compared to first-order and second-order polynomial functions and exponential functions, the power-law function also has the smallest fitting error and the highest fitting score.

Subsequently, we conducted ablation experiments under power-law constraints. In table V, we can observe that the model performance is significantly improved after optimization with the dense Optimizer and the addition of power-law constraint terms. This showcases how the model performance evolves with the strengthening of power-law constraints. The DenseNet, not optimized by power-law, achieved accuracy gains of +0.20%, +0.57%, and +2.45% on SVHN, CIFAR-10, and CIFAR-100 respectively. Meanwhile, the DenseNet optimized under power-law achieved accuracy gains of +0.23%, +1.66%, and +2.94% on SVHN, CIFAR-10, and CIFAR-100 respectively.

TABLE V: Error rate on the CIFAR and SVHN dataset, * marked are our own test results

Dataset	Original	Optimized	Power-optimized
SVHN	1.74	1.54*	1.49*
CIFAR-10	5.19	4.62*	3.53*
CIFAR-100	19.90*	17.45*	16.96*

Moreover, $\beta$ parameter is crucial in tuning the model. It controls the relationship between the power-law distribution and the magnitude of information entropy. Our results indicate that when $\beta$ is set to 0.1, there is an optimal balance between information entropy and the power-law constraint, leading to the best model performance (See table VI).

TABLE VI: Error rate on the CIFAR dataset with different

\beta

, * marked are our own test results

Dataset	$\beta=0$	$\beta=0.001$	$\beta=0.1$	$\beta=10$
CIFAR-100	19.90*	16.95*	16.92*	18.00*

Overall, the analysis proves that there is a strong correlation between hyperparameters a, b and model performance. The ablation result further validates the significance of power-law constraints in model design. And we analyze the differences between different distribution functions to show the fitness of power. Ultimately, we propose an optimal hyperparameter $\beta$ to achieve the best balance of entropy distribution.

V Conclusion

In this paper, we propose a dense architecture search method, Dense Optimizer, which can achieve automatic network structure design under mathematical optimization and improve network performance. Dense Optimizer decouples network weights from network architecture and maximizes network entropy while maintaining the distribution of network structure entropy under power-law constraints. We show that Dense Optimizer can design models comparable to modern CNN models using only traditional dense-BC convolutional blocks, proving the powerful capabilities of Dense Optimizer to release the potential of traditional DenseNet models. Furthermore, Dense Optimizer can be applied to the design of other dense-like networks and the power-law distribution characteristic of structural information entropy provides considerable insight for models with multi-scale features.

Acknowledgments

This work was supported by the Major Projects of Technological Innovation 2030 of China(Grant number 2022ZD0208500) and the National Natural Science Foundation of China under Grant No.62271504.

References

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large,” 2014.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[4] G. Huang, Z. Liu, G. Pleiss, L. Van Der Maaten, and K. Q. Weinberger, “Convolutional networks with dense connectivity,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 12, pp. 8704–8716, 2019.
[5] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path networks,” Advances in neural information processing systems, vol. 30, 2017.
[6] T. Zhou, X. Ye, H. Lu, X. Zheng, S. Qiu, Y. Liu et al., “Dense convolutional network and its application in medical image analysis,” BioMed Research International, vol. 2022, 2022.
[7] Y. Yang, Z. Zhong, T. Shen, and Z. Lin, “Convolutional neural networks with alternately updated clique,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2413–2422.
[8] B. Lodhi and J. Kang, “Multipath-densenet: A supervised ensemble architecture of densely connected convolutional networks,” Information Sciences, vol. 482, pp. 63–72, 2019.
[9] B. Chen, T. Zhao, J. Liu, and L. Lin, “Multipath feature recalibration densenet for image classification,” International Journal of Machine Learning and Cybernetics, vol. 12, pp. 651–660, 2021.
[10] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun, “Single path one-shot neural architecture search with uniform sampling,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer, 2020, pp. 544–560.
[11] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018.
[12] E. T. Jaynes, “Information theory and statistical mechanics,” Physical Review, vol. 106, pp. 620–630, 1957. [Online]. Available: https://api.semanticscholar.org/CorpusID:17870175
[13] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
[14] C. Li, Y. Yang, H. Liang, and B. Wu, “Transfer learning for establishment of recognition of covid-19 on ct imaging using small-sized training datasets,” Knowledge-Based Systems, vol. 218, p. 106849, 2021.
[15] R. Srivastva, A. Singh, and Y. N. Singh, “Plexnet: A fast and robust ecg biometric system for human recognition,” Information Sciences, vol. 558, pp. 208–228, 2021.
[16] M. Zhao, S. Zhong, X. Fu, B. Tang, and M. Pecht, “Deep residual shrinkage networks for fault diagnosis,” IEEE Transactions on Industrial Informatics, vol. 16, no. 7, pp. 4681–4690, 2019.
[17] Y. Liu, Y. Sun, B. Xue, M. Zhang, G. G. Yen, and K. C. Tan, “A survey on evolutionary neural architecture search,” IEEE transactions on neural networks and learning systems, vol. 34, no. 2, pp. 550–570, 2021.
[18] W. Ying, K. Zheng, Y. Wu, J. Li, and X. Xu, “Neural architecture search using multi-objective evolutionary algorithm based on decomposition,” in Artificial Intelligence Algorithms and Applications: 11th International Symposium, ISICA 2019, Guangzhou, China, November 16–17, 2019, Revised Selected Papers 11. Springer, 2020, pp. 143–154.
[19] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin, “Large-scale evolution of image classifiers,” in International conference on machine learning. PMLR, 2017, pp. 2902–2911.
[20] L. Xie and A. Yuille, “Genetic cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1379–1388.
[21] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2820–2828.
[22] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
[23] H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural architecture search on target task and hardware,” arXiv preprint arXiv:1812.00332, 2018.
[24] D. Eriksson, P. I.-J. Chuang, S. Daulton, P. Xia, A. Shrivastava, A. Babu, S. Zhao, A. Aly, G. Venkatesh, and M. Balandat, “Latency-aware neural architecture search with multi-objective bayesian optimization,” arXiv preprint arXiv:2106.11890, 2021.
[25] S. C. Nistor and G. Czibula, “Intelliswas: Optimizing deep neural network architectures using a particle swarm-based approach,” Expert Systems with Applications, vol. 187, p. 115945, 2022.
[26] C. White, W. Neiswanger, and Y. Savani, “Bananas: Bayesian optimization with neural architectures for neural architecture search,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 12, 2021, pp. 10 293–10 301.
[27] J. Mellor, J. Turner, A. Storkey, and E. J. Crowley, “Neural architecture search without training,” in International conference on machine learning. PMLR, 2021, pp. 7588–7598.
[28] X. Shen, Y. Wang, M. Lin, Y. Huang, H. Tang, X. Sun, and Y. Wang, “Deepmad: Mathematical architecture design for deep convolutional neural network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6163–6173.
[29] B. Bischl, M. Binder, M. Lang, T. Pielok, J. Richter, S. Coors, J. Thomas, T. Ullmann, M. Becker, A.-L. Boulesteix et al., “Hyperparameter optimization: Foundations, algorithms, best practices, and open challenges,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 13, no. 2, p. e1484, 2023.
[30] T. Yu and H. Zhu, “Hyper-parameter optimization: A review of algorithms and applications,” arXiv preprint arXiv:2003.05689, 2020.
[31] Y. Yu, K. H. R. Chan, C. You, C. Song, and Y. Ma, “Learning diverse and discriminative representations via the principle of maximal coding rate reduction,” Advances in Neural Information Processing Systems, vol. 33, pp. 9422–9434, 2020.
[32] K. H. R. Chan, Y. Yu, C. You, H. Qi, J. Wright, and Y. Ma, “Redunet: A white-box deep network from the principle of maximizing rate reduction,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 4907–5009, 2022.
[33] I. Csiszár, P. C. Shields et al., “Information theory and statistics: A tutorial,” Foundations and Trends® in Communications and Information Theory, vol. 1, no. 4, pp. 417–528, 2004.
[34] S. Kullback, Information theory and statistics. Courier Corporation, 1997.
[35] A. Hyvärinen, J. Hurri, and P. O. Hoyer, Natural image statistics: A probabilistic approach to early computational vision. Springer Science & Business Media, 2009, vol. 39.
[36] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox, “On the information bottleneck theory of deep learning,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2019, no. 12, p. 124020, 2019.
[37] J. M. Carlson and J. Doyle, “Highly optimized tolerance: A mechanism for power laws in designed systems,” Physical Review E, vol. 60, no. 2, p. 1412, 1999.
[38] S. Tyagi, N. Shukla, and S. Kulkarni, “Optimal design of fixture layout in a multi-station assembly using highly optimized tolerance inspired heuristic,” Applied Mathematical Modelling, vol. 40, no. 11, pp. 6134–6147, 2016.
[39] B. Koonce and B. Koonce, “Resnet 50,” Convolutional neural networks with swift for tensorflow: image recognition and dataset categorization, pp. 63–72, 2021.
[40] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
[41] G. Larsson, M. Maire, and G. Shakhnarovich, “Fractalnet: Ultra-deep neural networks without residuals,” arXiv preprint arXiv:1605.07648, 2016.
[42] S. Xie, H. Zheng, C. Liu, and L. Lin, “Snas: stochastic neural architecture search,” arXiv preprint arXiv:1812.09926, 2018.
[43] X. Wang, W. Guo, J. Su, X. Yang, and J. Yan, “Zarts: On zero-order optimization for neural architecture search,” Advances in Neural Information Processing Systems, vol. 35, pp. 12 868–12 880, 2022.

Dense Optimizer : An Information Entropy-Guided Structural Search Method for Dense-like Neural Network Design