{textblock}

10(2.8,0.5) This is the authors’ extended version. The authoritative version will appear in the proceedings of ACM/IEEE SEC’20.

Third ArchEdge Workshop: Exploring the Design Space of Efficient Deep Neural Networks

Invited paper4

Fuxun Yu1, Dimitrios Stamoulis2, Di Wang2, Dimitrios Lymberopoulos2, Xiang Chen1 4Presented at the Third Workshop on Computing Architecture for Edge Computing (ArchEdge), co-located with the fifth ACM/IEEE Symposium on Edge Computing (SEC), November 11-13, 2020. Email: [email protected] 1George Mason University, 2Microsoft

Abstract

This paper gives an overview of our ongoing work on the design space exploration of efficient deep neural networks (DNNs). Specifically, we cover two aspects: (1) static architecture design efficiency and (2) dynamic model execution efficiency. For static architecture design, different from existing “end-to-end” hardware modeling assumptions, we conduct “full-stack” profiling at the GPU core level to identify better accuracy-latency trade-offs for DNN designs. For dynamic model execution, different from prior work that tackles model redundancy at the DNN-channels level, we explore a new dimension of DNN feature map redundancy to be dynamically traversed at runtime. Last, we highlight several open questions that are poised to draw research attention in the next few years.

I Introduction

This paper summarizes our latest explorations in the design space of efficient DNNs. Specifically, we cover two complementary aspects: (a) static architecture design efficiency and (b) dynamic model execution efficiency.

I-A Static Architecture Design Efficiency

Recent AutoML techniques, such as Neural Architecture Search (NAS), aim to automate the design of DNNs [1, 2, 3]. Due to the considerable search cost required to traverse the DNN design space, early approaches can take thousands of GPU hours to select the final model candidates [4]. To this end, in the context of hardware-constrained DNNs, previous work has significantly improved the search efficiency thanks to hardware performance models (e.g., DNN FLOPs, energy consumption, latency, etc.) which allow the AutoML algorithm to efficiently traverse the design space in a hardware-aware manner [5, 6, 7], by quickly discarding DNNs that violate the hardware constraints of the target platform.

Hardware-aware AutoML algorithms resort to certain ineffective performance interpretations with respect to the underlying hardware. On the one hand, early AutoML methods [8] use FLOPs as a general performance indicator, yet recent works demonstrate a mismatch between FLOPs and hardware metrics [9, 10, 11, 12]. Nevertheless, this mismatch has been discussed mainly through empirical results and is not comprehensively analyzed. On the other hand, while recent methods replace FLOPs with predictive models (e.g., latency, power consumption), they rely on “end-to-end” profiling, which is either limited to discrete design choices (e.g., 50%, 100% channels) [13] or follows a look-up table-based manner [7].

To this end, we present a comprehensive “full-stack” profiling analysis that dives into individual GPU cores/threads to examine the intrinsic mechanisms of DNN execution [14]. As a key contribution, we shed light into the “GPU tail” effect as root cause of FLOPs-latency mismatch and GPU under-utilization. Based on our findings, we revisit the DNN design configuration choices of state-of-the-art AutoML methodologies to eliminate the tail effect, enabling larger, more accurate DNN designs at no latency cost. Hence, our method concretely improves accuracy-latency trade-offs, such as 27% latency and 4% accuracy improvements on top of SOTA DNN pruning and NAS methods. Moreover, we extend our profiling finding across different GPU configurations.

Discussion - Future work: while our investigation is employed as a fine-tuning (local search) step on top of SOTA designs, our findings can be flexibly incorporated into other AutoML methods. That is, a direction for future work is to revisit the predictive-models of existing single- and multi-path NAS works [15, 16] to further improve the accuracy-latency trade-offs by traversing the design space in a “tail effect”-aware fashion. Moreover, our methodology focuses on eliminating “tail effects” at the DNN design level, but improvements from alleviating GPU under-utilization can be realized at other design levels, as shown by novel scheduling- and computational flow-level explorations [17, 18].

Next, we hope that our findings could inspire researchers to revisit design-space assumptions, by allowing to identify hardware-optimal DNN candidates while eliminating sub-optimal ones. For example, our channel-level analysis reveals a discrete set of DNN channel configurations [14] with optimal GPU utilization, which could potentially reduce the number of candidates by $10\times$ as opposed to traversing a continuous channel-number space, e.g., on top of existing channel-pruning methods [19, 20]. Last, an interesting direction would be to investigate the severity of similar under-utilization beyond GPUs, especially in the context of hardware accelerators and co-design NAS methodologies [21].

I-B Dynamic Model Execution Efficiency

Dynamic execution methods aim at selecting between “switchable” DNN components at runtime [22, 23, 24, 25, 26]. The key insight behind these works is to improve the overall model efficiency by adaptively selecting and executing (a subset of) the model based on the input characteristics [27]. In our work, we extend this intuition across a new model redundancy dimension, namely dynamic feature map redundancy [28]. Specifically, we show that feature redundancy exists at the spatial dimensions of DNN convolutions, which allows us to formulate a dynamic pruning methodology in both channel- and spatial-wise dimensions. Our proposed method can greatly reduce the model computation with up to 54.5% FLOPs reduction and negligible accuracy drop on various image-classification DNNs.

Discussion - Future work: Drawing inspiration from our analysis on the FLOPs-latency mismatch, we highlight that when implemented naively, merely pruning convolution weights at the spatial level does not translate to latency savings. To this end, we postulate that advances in sparse DNN operators will be essential to support dynamic-sparse execution, as recently shown with CUDA implementations for dynamic convolutions [29].

II Conclusion

In this paper, we summarize a set of novel efficiency optimization angles for DNN design in both static architecture design and dynamic model execution. New potential advantages can be attained by integrating the proposed new perspectives to current optimization methods.

References

[1] W. Wen, H. Liu, Y. Chen, H. Li, G. Bender, and P.-J. Kindermans, “Neural predictor for neural architecture search,” in European Conference on Computer Vision. Springer, 2020, pp. 660–676.
[2] H.-P. Cheng, T. Zhang, S. Li, F. Yan, M. Li, V. Chandra, H. Li, and Y. Chen, “Nasgem: Neural architecture search via graph embedding method,” arXiv preprint arXiv:2007.04452, 2020.
[3] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
[4] E. Lybecker, How NAS was improved. From days to hours in search time, 2020. [Online]. Available: https://peltarion.com/blog/data-science/nas-search
[5] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2820–2828.
[6] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” arXiv preprint arXiv:1905.11946, 2019.
[7] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, N. B. Priyantha, J. Liu, and D. Marculescu, “Single-path mobile automl: Efficient convnet design and nas hyperparameter optimization,” IEEE Journal of Selected Topics in Signal Processing, 2020.
[8] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018.
[9] J.-D. Dong, A.-C. Cheng, D.-C. Juan, W. Wei, and M. Sun, “Dpp-net: Device-aware progressive search for pareto-optimal neural architectures,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 517–531.
[10] D. Marculescu, D. Stamoulis, and E. Cai, “Hardware-aware machine learning: modeling and optimization,” in Proceedings of the International Conference on Computer-Aided Design, 2018, pp. 1–8.
[11] E. Cai, D.-C. Juan, D. Stamoulis, and D. Marculescu, “Neuralpower: Predict and deploy energy-efficient convolutional neural networks,” in Asian Conference on Machine Learning, 2017, pp. 622–637.
[12] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficient convolutional neural networks using energy-aware pruning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5687–5695.
[13] D. Stamoulis, E. Cai, D.-C. Juan, and D. Marculescu, “Hyperpower: Power-and memory-constrained hyper-parameter optimization for neural networks,” in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2018, pp. 19–24.
[14] F. Yu, Z. Xu, T. Shen, D. Stamoulis, L. Shangguan, D. Wang, R. Madhok, C. Zhao, X. Li, N. Karianakis, D. Lymberopoulos, A. Li, C. Liu, Y. Chen, and X. Chen, “Towards latency-aware dnn optimization with gpu runtime analysis and tail effect elimination,” arXiv preprint arXiv:2011.03897, 2020.
[15] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyantha, J. Liu, and D. Marculescu, “Single-path nas: Designing hardware-efficient convnets in less than 4 hours,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2019, pp. 481–497.
[16] H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural architecture search on target task and hardware,” arXiv preprint arXiv:1812.00332, 2018.
[17] Y. Ding, L. Zhu, Z. Jia, G. Pekhimenko, and S. Han, “Ios: Inter-operator scheduler for cnn acceleration,” arXiv preprint arXiv:2011.01302, 2020.
[18] F. Yu, Z. Qin, D. Wang, P. Xu, C. Liu, Z. Tian, and X. Chen, “Dc-cnn: computational flow redefinition for efficient cnn through structural decoupling,” in 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2020, pp. 1097–1102.
[19] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 784–800.
[20] T.-W. Chin, R. Ding, C. Zhang, and D. Marculescu, “Towards efficient model compression via learned global ranking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1518–1528.
[21] Y. Zhang, Y. Fu, W. Jiang, C. Li, H. You, M. Li, V. Chandra, and Y. Lin, “Dna: Differentiable network-accelerator co-search,” arXiv preprint arXiv:2010.14778, 2020.
[22] Z. Xu, F. Yu, Z. Qin, C. Liu, and X. Chen, “Directx: Dynamic resource-aware cnn reconfiguration framework for real-time mobile applications,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020.
[23] Y. Chen, X. Dai, M. Liu, D. Chen, L. Yuan, and Z. Liu, “Dynamic convolution: Attention over convolution kernels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 030–11 039.
[24] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris, “Blockdrop: Dynamic inference paths in residual networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8817–8826.
[25] X. Gao, Y. Zhao, Ł. Dudziak, R. Mullins, and C.-z. Xu, “Dynamic channel pruning: Feature boosting and suppression,” arXiv preprint arXiv:1810.05331, 2018.
[26] Z. Xu, F. Yu, C. Liu, and X. Chen, “Reform: Static and dynamic resource-aware dnn reconfiguration framework for mobile device,” in Proceedings of the 56th Annual Design Automation Conference 2019, 2019, pp. 1–6.
[27] D. Stamoulis, T.-W. Chin, A. K. Prakash, H. Fang, S. Sajja, M. Bognar, and D. Marculescu, “Designing adaptive neural networks for energy-constrained image classification,” in Proceedings of the International Conference on Computer-Aided Design, 2018, pp. 1–8.
[28] F. Yu, C. Liu, D. Wang, Y. Wang, and X. Chen, “Antidote: attention-based dynamic optimization for neural network runtime efficiency,” in 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2020, pp. 951–956.
[29] T. Verelst and T. Tuytelaars, “Dynamic convolutions: Exploiting spatial sparsity for faster inference,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2320–2329.