*
ONNX-to-Hardware Design Flow for the Generation of Adaptive Neural-Network Accelerators on FPGAs
Abstract
Neural Networks provide a solid and reliable way of executing different types of applications, ranging from speech recognition to medical diagnosis, speeding up onerous and long workloads. The challenges involved in their implementation at the edge include providing diversity, flexibility, and sustainability. That implies, for instance, supporting evolving applications and algorithms energy-efficiently. Using hardware or software accelerators can deliver fast and efficient computation of the NNs, while flexibility can be exploited to support long-term adaptivity. Nonetheless, handcrafting a NN for a specific device, despite the possibility of leading to an optimal solution, takes time and experience, and that’s why frameworks for hardware accelerators are being developed. This work-in-progress study focuses on exploring the possibility of combining the toolchain proposed by Ratto et al. [1], which has the distinctive ability to favor adaptivity, with Approximate Computing (AC). The goal will be to allow lightweight adaptable NN inference on FPGAs at the edge. Before that, the work presents a detailed review of established frameworks that adopt a similar streaming architecture for future comparison.
Index Terms:
Cyber-Physical Systems, Convolutional Neural Networks, Approximate Computing, FPGAs- AC
- Approximate Computing
- AES
- Advanced Encryption Standard
- API
- Application Programming Interface
- ASIC
- Application Specific Integrated Circuit
- BNN
- Binary Neural Network
- CGR
- Coarse-Grain Reconfigurable
- CPG
- Co-Processor Generator
- CPS
- Cyber-Physical System
- CPU
- Central Processing Unit
- DAG
- Directed Acyclic Graph
- DNN
- Deep Neural Network
- DPN
- Dataflow Process Network
- DSE
- Design Space Exploration
- DSP
- Digital Signal Processing
- FFT
- Fast Fourier Transform
- FIFO
- First-In First-Out queue
- FPGA
- Field Programmable Gate Array
- GPU
- Graphics Processing Unit
- HEVC
- High Efficiency Video Coding
- HLS
- High Level Synthesis
- HWPU
- HW Processing Unit
- IP
- Intellectual Property
- LUT
- Look-Up Table
- MDC
- Multi-Dataflow Composer
- MDG
- Multi-Dataflow Generator
- MLP
- Multi Layer Perceptron
- MCDMA
- Multi-channel DMA
- MoA
- Model of Architecture
- MoC
- Model of Computation
- MPSoC
- Multi-Processor System on a Chip (SoC)
- NN
- Neural Network
- OS
- Operating System
- PC
- Platform Composer
- PE
- Processing Element
- PiSDF
- Parameterized and Interfaced Synchronous DataFlow
- Parameterized DataFlow
- PL
- Programmable Logic
- PS
- Processing System
- PMC
- Performance Monitoring Counter
- PSDF
- Parameterized Synchronous DataFlow
- QSoC
- Quantized SoC
- QNN
- Quantized Neural Network
- RTL
- Register Transfer Level
- SG
- Scatter-Gather
- SDF
- Synchronous DataFlow
- SoC
- System on a Chip
- SMMU
- System Memory Management Unit
- TIL
- Template Interface Layer
- UAV
- Unmanned Aerial Vehicle
- UGV
- Unmanned Ground Vehicle
I Introduction
Cyber-Physical System (CPS) integrate “computation with physical processes whose behavior is defined by both the computational (digital and other forms) and the physical parts of the system”111https://csrc.nist.gov/glossary/term/cyber physical systems. They are characterized by a considerable information exchange with the environment and by dynamic and reactive behaviors with respect to environmental changes. In modern systems, CPS or not, decisions making can be brought directly at the edge on small embedded platforms exploiting the capabilities of the NNs. This calls for real-time, low-energy execution, which can be achieved by leveraging different resources on the same platforms, making heterogeneous computing fundamental.
Field Programmable Gate Arrays devices can guarantee hardware acceleration, execution flexibility, and energy efficiency, as well as heterogeneity, and that is why this type of device is a valuable option for NNs inference at the edge [17]. Indeed, their integration on heterogenous Multi-Processor System on Chip opened up a wide range of possibilities exploiting the FPGA potentials along with the CPU capabilities of managing control flow and communication.
While many solutions exist to deploy AI models on these platforms, they often lack full support for advanced features, such as flexibility. This paper focuses on putting the basis to achieve future runtime adaptivity, which is key in our opinion for addressing the dynamic and reactive nature of CPS. Designing and deploying reconfigurable accelerators with these functionalities still needs to be investigated, requiring an in-depth knowledge of the underlying hardware and hand-tailored solutions. This belief is at the basis of this study that, starting from a state-of-the-art framework [1], intends to ultimately seek automated strategies to deploy lightweight, flexible, and adaptive NN accelerators for CPS, achieving different Pareto optimal working points that could be merged into an adaptive accelerator and exploited at runtime to serve variable and evolvable applications.
II BACKGROUND
Several software libraries and frameworks have been developed to facilitate the development and high-performance execution of CNNs. Tools such as Caffe222https://caffe.berkeleyvision.org/, CoreML333https://developer.apple.com/documentation/coreml, PyTorch444https://pytorch.org/, Theano555https://github.com/Theano and TensorFlow666https://www.tensorflow.org/ aim to increase the productivity of CNN developers by providing high-level APIs to simplifying data pipeline development. Such environments are currently flanked High Level Synthesis (HLS) that are used to generate FPGA-based hardware designs from a high level of abstraction, to fasten porting of complex algorithms at the edge. Examples of HLS environments are AMD’s Vitis HLS, Intel FPGA OpenCL SDK, Maxeler’s MaxCompiler [9], and LegUp [7]. They employ commonly used programming languages such as C, C++, OpenCL, and Java, to fill the gap between software-defined applications and their hardware implementation.
To execute NN at the edge, three main types of architectures can be found in literature [2]: the Single Computational Engine architecture, based on a single computation engine, typically in the form of a systolic array of processing elements or a matrix multiplication unit, that executes the CNN layers sequentially [11]; Vector Processor architecture, with instructions specific for accelerating operations related to convolutions [13]; the Streaming architecture consists of one distinct hardware block for each layer of the target CNN, where each block is optimized separately [3, 4]. In our studies, we focus mainly on the latter.
II-A Streaming Architectures
In our previous work [1], the dataflow model was the best-suited one to support runtime adaptivity and enhancing parallelism. The resulting hardware is a streaming architecture that uses on-chip memory, guaranteeing low-latency and low-energy computing. Solutions that exploit a similar streaming architecture are FINN [4], an experimental framework from AMD Research Labs based on Theano; HLS4ML [3], an open-source software designed to facilitate the deployment of machine learning models on FPGAs, targeting low-latency and low-power edge applications.
These two solutions are described in Sections II-A1 and II-A2 respectively, and their performance is compared in Table I.
II-A1 FINN
FINN is a framework for building scalable and fast NN, with a focus on the support of Quantized Neural Network (QNN) inference on FPGAs. A given model, trained through Brevitas, is compiled by the FINN compiler, producing a synthesizable C++ description of a heterogeneous streaming architecture. All QNN parameters are kept stored in the on-chip memory, which greatly reduces the power consumed and simplifies the design. The computing engines communicate via the on-chip data stream. Avoiding the “one-size-fits-all”, an ad-hoc topology is built for the network. The resulting accelerator is deployed on the target board using the AMD Pynq framework. Two works adopting the FINN framework have been analyzed and their results are summarized in Table I.
II-A2 HLS4ML
The main operation of the HLS4ML library is to translate the model of the network into an HLS Project. The focus in [6] was centered on reducing the computational complexity and resource usage on a fully connected network for MNIST dataset classification: the data is fed to a multi-layer perceptron with an input layer of 784 nodes, three hidden layers with 128 nodes each, and an output layer with 10 nodes. The work exploits the potential of Pruning and Quantization-Aware Training to drastically reduce the model size with limited impact on its accuracy.
To the best of our knowledge, neither FINN nor HLS4ML, despite targeting FPGA-based streaming architecture and supporting AC features such as pruning and quantization, ever proposed a reconfigurable solution for runtime adaptive environments.
Framework | Dataset | FC | CONV | Datatype | Target | LUTS | DSP | Latency | Throughput | Power | Accuracy |
---|---|---|---|---|---|---|---|---|---|---|---|
[#] | [#] | [# bits] | board | [#] | [#] | [us] | [FPS] | [W] | [%] | ||
FINN [5] | CIFAR-10* | 2 | 6 | 2 | Zynq7000 | 46253 | - | 283 | 21.9k | 15.3 | 80.1 |
FINN [5] | SVHN* | 2 | 6 | 2 | Zynq7000 | 46253 | - | 283 | 21.9k | 15.3 | 94.3 |
FINN [4] | CIFAR-10 | 2 | 6 | 2 | UltraScale | 392947 | - | 671 | 12k | <41 | 88.3 |
HLS4ML [6] | SVHN | 3 | 3 | 7 | UltraScale+ | 38795 | 72 | 1035 | - | - | 95 |
HLS4ML [3] | MNIST | 3 | 0 | 16 | Ultrascale+ | 366494 | 11 | 200 | - | - | 96 |
-
*
The two datasets are cropped to have the same image size.
II-B Approximate Computing
AC has been established as a new design paradigm for energy-efficient circuits, exploiting the inherent ability of a large number of applications to produce results of acceptable quality, despite some approximations in their computations. Leveraging this property, AC approximates the hardware execution of the error-resilient computations in a manner that favors performance and energy. Moreover, NNs have demonstrated strong resilience to errors and can take great advantage of AC [12]. In particular, hardware NN approximation can be classified into three wide categories: Computation Reduction, Approximate Arithmetic Units, and Precision Scaling [8].
Computational Reduction
The Computation Reduction approximation category aims at systematically avoiding, at the hardware level, the execution of some computations, significantly decreasing the executed workload. An example of this is pruning: biases, weights, or entire neurons can be evicted to lighten the workload [15].
Approximate Units
With Approximate Units, it is improved the energy consumption and latency of DNN accelerators by employing approximate circuits that replace more accurate units, like the Multiply-and-Accumulate (MAC) one [16].
Precision Scaling
The most used Precision Scaling practice is quantization: quantized hardware implementations feature reduced bit-width dataflow and arithmetic units attaining very high energy, latency, and bandwidth gains compared to 32-bit floating-point implementations. Rather than executing all the required mathematical operations with ordinary 32-bit or 16-bit floating point, quantization allows to exploit smaller integer operations instead [14]. For this purpose, AMD provides an arbitrary precision data types library for use in Vivado HLS designs, which allows the specification of any number of bits for data types beyond what the standard C++ data types provide. The library also supports customizable fixed-point data types [10].
III Proposed design flow
The main innovation of the design flow proposed in [1] is the possibility of supporting runtime adaptivity in the hardware accelerator through reconfiguration.
III-A Tools
Different tools are utilized along the design flow:
-
•
the ONNXParser777https://gitlab.com/aloha.eu/onnxparser, a Python application intended to parse the ONNX models and automatically create the code for a target device. It is composed of a Reader and many Writers, one for each target;
-
•
The Vivado HLS tool888https://www.AMD.com/support/documentation-navigation/design-hubs/dh0012-vivado-high-level-synthesis-hub.html, which synthesizes a C or C++ function into RTL code for implementation on AMD FPGAs. The resulting hardware can be optimized and customized through the insertion of directives in the code;
-
•
The Multi-Dataflow Composer (MDC)999https://mdc-suite.github.io/ is an open-source tool that can offer optional Coarse-Grained reconfigurability support for hardware acceleration. It takes as input the applications specified as dataflow, together with the library of the HDL files of the actors. These dataflows are then combined, and the resulting multi-dataflow topology is filled with the actors taken from the HDL library.
III-B Design Flow
The proposed flow, displayed in Figure 1, starts from the ONNX representation of the NN and produces a streaming accelerator that accelerates the input model. This file is given as input to the ONNX Parser: initially, the Reader reads the ONNX file and produces an intermediate format with a list of objects that describes layers and connections of the ONNX model. Then, the selected Writer creates the target-dependent files. When the target is the HLS flow, it is possible to customize the data precision used to represent weights and activations. The HLS Writer produces C++ files that implement the layers, and the TCL scripts to automate the synthesis by Vivado HLS. The C++ description of the layers is based on a template architecture: for the CONV layer, the core of the CNN, the template is composed of a Line Buffer actor that stores the input stream to provide data reuse; the Convolutional actor, whose function is to execute the actual computation; and the Weight and Bias actors that store the kernel parameters needed for the convolution. The resulting template is depicted in Figure 2. Each actor is developed to be customizable with the hyperparameter, e.g. input and kernel size, extracted from the ONNX model. The HDL library produced by Vivado HLS is given as input to the Multi-Dataflow Composer, together with the XDF file that describes the topology of the network and the CAL files that identify the different actors. These latter are generated by the HLS Writer. Finally, the HDL file of the complete dataflow is automatically generated. Optionally, the MDC Co-processor generator can be used to deploy the accelerator using the Vivado design suite. The Co-processor generator delivers the necessary scripts to wrap the accelerator and connect it to a complete processor-coprocessor system. Along with the hardware system, the drivers to call the coprocessor from the SDK application are made available.
In the work of Ratto et al. presented in [1] the flow was semi-integrated and, as part of this preliminary study, the entire generation process from the ONNX file down to the accelerator deployment is fully automated.
IV Preliminary results and future direction
Datatype | Zero-weights | LUT | FF | DSP | BRAM | Latency | Throughput | Power | Energy | Accuracy |
---|---|---|---|---|---|---|---|---|---|---|
[%] | [%] | [%] | [%] | [%] | [us] | [FPS] | [mW] | [uJ] | [%] | |
D32-W32 | 0.0 | 29.6 | 24.5 | 29.5 | 15.4 | 1530 | 88K | 28.6 | 43.7 | 98 |
D16-W16 | 0.0 | 23.4 | 20.2 | 52.7 | 15.4 | 1510 | 89K | 25.3 | 38.3 | 98 |
D8-W16 | 0.8 | 9.1 | 5.6 | 15.5 | 13.2 | 510 | 296K | 20.1 | 10.2 | 76 |
D16-W8 | 15.0 | 8.5 | 0.6 | 15.5 | 13.2 | 510 | 296K | 19.5 | 9.9 | 98 |
D16-W4 | 55.3 | 7.7 | 4.3 | 15.5 | 4.3 | 510 | 296K | 17.5 | 8.9 | 97 |
D16-W2 | 85.7 | 7.7 | 4.3 | 15.5 | 4.3 | 1140 | 117K | 15.0 | 17.1 | 68 |
To assess the re-engineered flow described in Section III-B, a wide exploration targeting the MNIST classifier has been carried out, as described in Table II. The intent was also to show the impact of quantization on both model accuracy and hardware performance, which is generally in line with the expectations that AC can offer, as discussed in Section II-B. It can be noticed that accuracy is not as affected by reducing parameter precision as it is by reducing activations precision. Moreover, reduced parameter precision leads to a reduced memory footprint (BRAM column) and a high percentage of zero weights. This latter can be exploited to combine quantization with pruning, which skips multiplications by zero. To have a fair comparison with state-of-the-art solutions presented in Table I, onboard-running experiments that consider also memory accesses are needed. However, we can see that the preliminary results show competitive performance in terms of utilized resources and latency/throughput. A broader comparison against state-of-the-art, based on significant onboard measurements and targeting more complex datasets, will be carried out in the future. Nonetheless, it is worth recalling that state-of-the-art approaches are not conceived to support runtime adaptivity, which is motivating our research instead.
Indeed, our future work intends to explore mixed precision in adaptive NN accelerators. To have available a fully automated flow, with reconfiguration support capabilities, was a key preliminary step to save the manual effort in the accelerator definition and exploration. The analysis carried out so far on non-reconfigurable accelerators shows that promising trade-offs are present, e.g. trading off accuracy for reduced energy consumption. The ultimate goal will be the efficient runtime management of the system that implies, as a first step, the combination of the different working points over a reconfigurable substrate. This latter can certainly be achieved by leveraging on the whole set of functionality offered by the MDC tool to design and operate reconfigurable and evolvable NN accelerators for CPS, including the one presented in this study. Resulting accelerators will be able to switch configuration at runtime to adapt to the desired goal, e.g. when a limited energy budget is left a reduction in energy consumption is worth the cost of some accuracy loss.
One of the challenges we expect to face in this research’s future steps is the limited onboard memory, which could constrain us to the execution of relatively small models (e.g. TinyML), especially when runtime switching among algorithms/configurations is required. The adoption of a reconfigurable approach, capable of sharing weights among configurations, should help us tackle that issue limiting the impact on the memory footprint of having more than one network available. This may limit the advantages in terms of accuracy achievable with Quantization Aware Training [18]. However, the preliminary results with Post Training Quantization show a limited drop in accuracy even with 4-bit weights.
Ackowledgments
The authors would like to thank Stefano Esposito for his contribution to this work during his Master’s thesis.
References
- [1] Ratto, Francesco, et al. ”An Automated Design Flow for Adaptive Neural Network Hardware Accelerators.” Journal of Signal Processing Systems (2023): 1-23.
- [2] Venieris, Stylianos, et al. ”Toolflows for mapping convolutional neural networks on FPGAs: A survey and future directions.” ACM Computing Surveys (CSUR) 51.3 (2018): 1-39.
- [3] Aarrestad, Thea, et al. ”Fast convolutional neural networks on FPGAs with hls4ml.” Machine Learning: Science and Technology 2.4 (2021): 045015.
- [4] Fraser, Nicholas J., et al. ”Scaling binarized neural networks on reconfigurable logic.” Proceedings of the 8th Workshop and 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms. 2017.
- [5] Umuroglu, Yaman, et al. ”Finn: A framework for fast, scalable binarized neural network inference.” Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. 2017.
- [6] Ngadiuba, Jennifer, et al. ”Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml.” Machine Learning: Science and Technology 2.1 (2020): 015001.
- [7] Canis, Andrew, et al. ”LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems.” ACM Transactions on Embedded Computing Systems (TECS) 13.2 (2013): 1-27.
- [8] Armeniakos, Giorgos, et al. ”Hardware approximate techniques for deep neural network accelerators: A survey.” ACM Computing Surveys 55.4 (2022): 1-36.
- [9] Summers, Sioni, et al. ”Using MaxCompiler for the high level synthesis of trigger algorithms.” Journal of Instrumentation 12.02 (2017): C02015.
- [10] https://jiafulow.github.io/blog/2020/08/02/hls-arbitrary-precision-data-types
- [11] Guan, Yijin, et al. ”FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates.” 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2017.
- [12] Mittal, Sparsh. ”A survey of techniques for approximate computing.” ACM Computing Surveys (CSUR) 48.4 (2016): 1-33.
- [13] C. Farabet, et al. An FPGA-based processor for convolutional networks. In Proc. IEEE FPL, pages 32–37. IEEE, 2009.
- [14] Jungwook, Choi, et al. 2019. Accurate and Efficient 2-bit Quantized Neural Networks. In Proc. of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia (Eds.), Vol. 1. 348–359.
- [15] Song Han, et al. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the 43rd International Symposium on Computer Architecture. 243–254.
- [16] Bhardwaj, Kartikeya, et al. ”Power-and area-efficient approximate wallace tree multiplier for error-resilient systems.” Fifteenth international symposium on quality electronic design. IEEE, 2014.
- [17] Guo, Kaiyuan, et al. ”[DL] A survey of FPGA-based neural network inference accelerators.” ACM Transactions on Reconfigurable Technology and Systems (TRETS) 12.1 (2019): 1-26.
- [18] Gholami, Amir, et al. ”A survey of quantization methods for efficient neural network inference.”arXiv preprint arXiv:2103.13630” (2021).