\AtBeginShipoutNext\AtBeginShipoutDiscard

ONNX-to-Hardware Design Flow for the Generation of Adaptive Neural-Network Accelerators on FPGAs

1^st Federico Manca Dipartimento di Ingegneria Elettrica ed Elettronica
Università degli Studi di Cagliari
Cagliari, Italy
[email protected] 2^nd Francesco Ratto Dipartimento di Ingegneria Elettrica ed Elettronica
Università degli Studi di Cagliari
Cagliari, Italy
0000-0001-5756-5879

Abstract

Neural Networks provide a solid and reliable way of executing different types of applications, ranging from speech recognition to medical diagnosis, speeding up onerous and long workloads. The challenges involved in their implementation at the edge include providing diversity, flexibility, and sustainability. That implies, for instance, supporting evolving applications and algorithms energy-efficiently. Using hardware or software accelerators can deliver fast and efficient computation of the NNs, while flexibility can be exploited to support long-term adaptivity. Nonetheless, handcrafting a NN for a specific device, despite the possibility of leading to an optimal solution, takes time and experience, and that’s why frameworks for hardware accelerators are being developed. This work-in-progress study focuses on exploring the possibility of combining the toolchain proposed by Ratto et al. [1], which has the distinctive ability to favor adaptivity, with Approximate Computing (AC). The goal will be to allow lightweight adaptable NN inference on FPGAs at the edge. Before that, the work presents a detailed review of established frameworks that adopt a similar streaming architecture for future comparison.

Index Terms:

Cyber-Physical Systems, Convolutional Neural Networks, Approximate Computing, FPGAs

AC: Approximate Computing
AES: Advanced Encryption Standard
API: Application Programming Interface
ASIC: Application Specific Integrated Circuit
BNN: Binary Neural Network
CGR: Coarse-Grain Reconfigurable
CPG: Co-Processor Generator
CPS: Cyber-Physical System
CPU: Central Processing Unit
DAG: Directed Acyclic Graph
DNN: Deep Neural Network
DPN: Dataflow Process Network
DSE: Design Space Exploration
DSP: Digital Signal Processing
FFT: Fast Fourier Transform
FIFO: First-In First-Out queue
FPGA: Field Programmable Gate Array
GPU: Graphics Processing Unit
HEVC: High Efficiency Video Coding
HLS: High Level Synthesis
HWPU: HW Processing Unit
IP: Intellectual Property
LUT: Look-Up Table
MDC: Multi-Dataflow Composer
MDG: Multi-Dataflow Generator
MLP: Multi Layer Perceptron
MCDMA: Multi-channel DMA
MoA: Model of Architecture
MoC: Model of Computation
MPSoC: Multi-Processor System on a Chip (SoC)
NN: Neural Network
OS: Operating System
PC: Platform Composer
PE: Processing Element
PiSDF: Parameterized and Interfaced Synchronous DataFlow
PDF: Parameterized DataFlow
PL: Programmable Logic
PS: Processing System
PMC: Performance Monitoring Counter
PSDF: Parameterized Synchronous DataFlow
QSoC: Quantized SoC
QNN: Quantized Neural Network
RTL: Register Transfer Level
SG: Scatter-Gather
SDF: Synchronous DataFlow
SoC: System on a Chip
SMMU: System Memory Management Unit
TIL: Template Interface Layer
UAV: Unmanned Aerial Vehicle
UGV: Unmanned Ground Vehicle

I Introduction

Cyber-Physical System (CPS) integrate “computation with physical processes whose behavior is deﬁned by both the computational (digital and other forms) and the physical parts of the system”¹¹1https://csrc.nist.gov/glossary/term/cyber physical systems. They are characterized by a considerable information exchange with the environment and by dynamic and reactive behaviors with respect to environmental changes. In modern systems, CPS or not, decisions making can be brought directly at the edge on small embedded platforms exploiting the capabilities of the NNs. This calls for real-time, low-energy execution, which can be achieved by leveraging different resources on the same platforms, making heterogeneous computing fundamental.

Field Programmable Gate Arrays devices can guarantee hardware acceleration, execution flexibility, and energy efficiency, as well as heterogeneity, and that is why this type of device is a valuable option for NNs inference at the edge [17]. Indeed, their integration on heterogenous Multi-Processor System on Chip opened up a wide range of possibilities exploiting the FPGA potentials along with the CPU capabilities of managing control flow and communication.

While many solutions exist to deploy AI models on these platforms, they often lack full support for advanced features, such as flexibility. This paper focuses on putting the basis to achieve future runtime adaptivity, which is key in our opinion for addressing the dynamic and reactive nature of CPS. Designing and deploying reconfigurable accelerators with these functionalities still needs to be investigated, requiring an in-depth knowledge of the underlying hardware and hand-tailored solutions. This belief is at the basis of this study that, starting from a state-of-the-art framework [1], intends to ultimately seek automated strategies to deploy lightweight, flexible, and adaptive NN accelerators for CPS, achieving different Pareto optimal working points that could be merged into an adaptive accelerator and exploited at runtime to serve variable and evolvable applications.

II BACKGROUND

Several software libraries and frameworks have been developed to facilitate the development and high-performance execution of CNNs. Tools such as Caffe²²2https://caffe.berkeleyvision.org/, CoreML³³3https://developer.apple.com/documentation/coreml, PyTorch⁴⁴4https://pytorch.org/, Theano⁵⁵5https://github.com/Theano and TensorFlow⁶⁶6https://www.tensorflow.org/ aim to increase the productivity of CNN developers by providing high-level APIs to simplifying data pipeline development. Such environments are currently flanked High Level Synthesis (HLS) that are used to generate FPGA-based hardware designs from a high level of abstraction, to fasten porting of complex algorithms at the edge. Examples of HLS environments are AMD’s Vitis HLS, Intel FPGA OpenCL SDK, Maxeler’s MaxCompiler [9], and LegUp [7]. They employ commonly used programming languages such as C, C++, OpenCL, and Java, to fill the gap between software-defined applications and their hardware implementation.

To execute NN at the edge, three main types of architectures can be found in literature [2]: the Single Computational Engine architecture, based on a single computation engine, typically in the form of a systolic array of processing elements or a matrix multiplication unit, that executes the CNN layers sequentially [11]; Vector Processor architecture, with instructions specific for accelerating operations related to convolutions [13]; the Streaming architecture consists of one distinct hardware block for each layer of the target CNN, where each block is optimized separately [3, 4]. In our studies, we focus mainly on the latter.

II-A Streaming Architectures

In our previous work [1], the dataflow model was the best-suited one to support runtime adaptivity and enhancing parallelism. The resulting hardware is a streaming architecture that uses on-chip memory, guaranteeing low-latency and low-energy computing. Solutions that exploit a similar streaming architecture are FINN [4], an experimental framework from AMD Research Labs based on Theano; HLS4ML [3], an open-source software designed to facilitate the deployment of machine learning models on FPGAs, targeting low-latency and low-power edge applications.

These two solutions are described in Sections II-A1 and II-A2 respectively, and their performance is compared in Table I.

II-A1 FINN

FINN is a framework for building scalable and fast NN, with a focus on the support of Quantized Neural Network (QNN) inference on FPGAs. A given model, trained through Brevitas, is compiled by the FINN compiler, producing a synthesizable C++ description of a heterogeneous streaming architecture. All QNN parameters are kept stored in the on-chip memory, which greatly reduces the power consumed and simplifies the design. The computing engines communicate via the on-chip data stream. Avoiding the “one-size-fits-all”, an ad-hoc topology is built for the network. The resulting accelerator is deployed on the target board using the AMD Pynq framework. Two works adopting the FINN framework have been analyzed and their results are summarized in Table I.

II-A2 HLS4ML

The main operation of the HLS4ML library is to translate the model of the network into an HLS Project. The focus in [6] was centered on reducing the computational complexity and resource usage on a fully connected network for MNIST dataset classification: the data is fed to a multi-layer perceptron with an input layer of 784 nodes, three hidden layers with 128 nodes each, and an output layer with 10 nodes. The work exploits the potential of Pruning and Quantization-Aware Training to drastically reduce the model size with limited impact on its accuracy.

To the best of our knowledge, neither FINN nor HLS4ML, despite targeting FPGA-based streaming architecture and supporting AC features such as pruning and quantization, ever proposed a reconfigurable solution for runtime adaptive environments.

TABLE I: Performance overview of FINN and HLS4ML under different testing set-ups.

Framework	Dataset	FC	CONV	Datatype	Target	LUTS	DSP	Latency	Throughput	Power	Accuracy
Framework	Dataset	[#]	[#]	[# bits]	board	[#]	[#]	[us]	[FPS]	[W]	[%]
FINN [5]	CIFAR-10^*	2	6	2	Zynq7000	46253	-	283	21.9k	15.3	80.1
FINN [5]	SVHN^*	2	6	2	Zynq7000	46253	-	283	21.9k	15.3	94.3
FINN [4]	CIFAR-10	2	6	2	UltraScale	392947	-	671	12k	<41	88.3
HLS4ML [6]	SVHN	3	3	7	UltraScale+	38795	72	1035	-	-	95
HLS4ML [3]	MNIST	3	0	16	Ultrascale+	366494	11	200	-	-	96

*

The two datasets are cropped to have the same image size.

II-B Approximate Computing

AC has been established as a new design paradigm for energy-efficient circuits, exploiting the inherent ability of a large number of applications to produce results of acceptable quality, despite some approximations in their computations. Leveraging this property, AC approximates the hardware execution of the error-resilient computations in a manner that favors performance and energy. Moreover, NNs have demonstrated strong resilience to errors and can take great advantage of AC [12]. In particular, hardware NN approximation can be classified into three wide categories: Computation Reduction, Approximate Arithmetic Units, and Precision Scaling [8].

Computational Reduction

The Computation Reduction approximation category aims at systematically avoiding, at the hardware level, the execution of some computations, significantly decreasing the executed workload. An example of this is pruning: biases, weights, or entire neurons can be evicted to lighten the workload [15].

Approximate Units

With Approximate Units, it is improved the energy consumption and latency of DNN accelerators by employing approximate circuits that replace more accurate units, like the Multiply-and-Accumulate (MAC) one [16].

Precision Scaling

The most used Precision Scaling practice is quantization: quantized hardware implementations feature reduced bit-width dataflow and arithmetic units attaining very high energy, latency, and bandwidth gains compared to 32-bit floating-point implementations. Rather than executing all the required mathematical operations with ordinary 32-bit or 16-bit floating point, quantization allows to exploit smaller integer operations instead [14]. For this purpose, AMD provides an arbitrary precision data types library for use in Vivado HLS designs, which allows the specification of any number of bits for data types beyond what the standard C++ data types provide. The library also supports customizable fixed-point data types [10].

III Proposed design flow

The main innovation of the design flow proposed in [1] is the possibility of supporting runtime adaptivity in the hardware accelerator through reconfiguration.

III-A Tools

Different tools are utilized along the design flow:

•

the ONNXParser⁷⁷7https://gitlab.com/aloha.eu/onnxparser, a Python application intended to parse the ONNX models and automatically create the code for a target device. It is composed of a Reader and many Writers, one for each target;
•

The Vivado HLS tool⁸⁸8https://www.AMD.com/support/documentation-navigation/design-hubs/dh0012-vivado-high-level-synthesis-hub.html, which synthesizes a C or C++ function into RTL code for implementation on AMD FPGAs. The resulting hardware can be optimized and customized through the insertion of directives in the code;
•

The Multi-Dataflow Composer (MDC)⁹⁹9https://mdc-suite.github.io/ is an open-source tool that can offer optional Coarse-Grained reconfigurability support for hardware acceleration. It takes as input the applications specified as dataflow, together with the library of the HDL files of the actors. These dataflows are then combined, and the resulting multi-dataflow topology is filled with the actors taken from the HDL library.

III-B Design Flow

The proposed flow, displayed in Figure 1, starts from the ONNX representation of the NN and produces a streaming accelerator that accelerates the input model. This file is given as input to the ONNX Parser: initially, the Reader reads the ONNX file and produces an intermediate format with a list of objects that describes layers and connections of the ONNX model. Then, the selected Writer creates the target-dependent files. When the target is the HLS flow, it is possible to customize the data precision used to represent weights and activations. The HLS Writer produces C++ files that implement the layers, and the TCL scripts to automate the synthesis by Vivado HLS. The C++ description of the layers is based on a template architecture: for the CONV layer, the core of the CNN, the template is composed of a Line Buffer actor that stores the input stream to provide data reuse; the Convolutional actor, whose function is to execute the actual computation; and the Weight and Bias actors that store the kernel parameters needed for the convolution. The resulting template is depicted in Figure 2. Each actor is developed to be customizable with the hyperparameter, e.g. input and kernel size, extracted from the ONNX model. The HDL library produced by Vivado HLS is given as input to the Multi-Dataflow Composer, together with the XDF file that describes the topology of the network and the CAL files that identify the different actors. These latter are generated by the HLS Writer. Finally, the HDL file of the complete dataflow is automatically generated. Optionally, the MDC Co-processor generator can be used to deploy the accelerator using the Vivado design suite. The Co-processor generator delivers the necessary scripts to wrap the accelerator and connect it to a complete processor-coprocessor system. Along with the hardware system, the drivers to call the coprocessor from the SDK application are made available.

In the work of Ratto et al. presented in [1] the flow was semi-integrated and, as part of this preliminary study, the entire generation process from the ONNX file down to the accelerator deployment is fully automated.

Refer to caption — Figure 1: ONNX-to-Hardware design flow for the generation of adaptive neural-network accelerators on FPGAs. The manual steps needed in [1] (grey lines) have been fully automated with the newly engineered *HLS Writer*.

IV Preliminary results and future direction

TABLE II: Results of exploration with mixed precision data on an accelerator made of 2 convolutional blocks (consisting of a convolutional layer, max pooling, batch normalization, and ReLU activation layers) followed by 1 fully connected layer. The accelerator classifies samples from the MNIST dataset. The model is quantized using post-training quantization. In the Datatype column, Dx-Wy denotes that x bits are used to represent activations and y bits are used to represent parameters in fixed-point precision. The reported results target a Zynq7000, comprising a “xc7z020-1csg484ces” chip, and have been retrieved through post-synthesis simulations.

Datatype	Zero-weights	LUT	FF	DSP	BRAM	Latency	Throughput	Power	Energy	Accuracy
Datatype	[%]	[%]	[%]	[%]	[%]	[us]	[FPS]	[mW]	[uJ]	[%]
D32-W32	0.0	29.6	24.5	29.5	15.4	1530	88K	28.6	43.7	98
D16-W16	0.0	23.4	20.2	52.7	15.4	1510	89K	25.3	38.3	98
D8-W16	0.8	9.1	5.6	15.5	13.2	510	296K	20.1	10.2	76
D16-W8	15.0	8.5	0.6	15.5	13.2	510	296K	19.5	9.9	98
D16-W4	55.3	7.7	4.3	15.5	4.3	510	296K	17.5	8.9	97
D16-W2	85.7	7.7	4.3	15.5	4.3	1140	117K	15.0	17.1	68

To assess the re-engineered flow described in Section III-B, a wide exploration targeting the MNIST classifier has been carried out, as described in Table II. The intent was also to show the impact of quantization on both model accuracy and hardware performance, which is generally in line with the expectations that AC can offer, as discussed in Section II-B. It can be noticed that accuracy is not as affected by reducing parameter precision as it is by reducing activations precision. Moreover, reduced parameter precision leads to a reduced memory footprint (BRAM column) and a high percentage of zero weights. This latter can be exploited to combine quantization with pruning, which skips multiplications by zero. To have a fair comparison with state-of-the-art solutions presented in Table I, onboard-running experiments that consider also memory accesses are needed. However, we can see that the preliminary results show competitive performance in terms of utilized resources and latency/throughput. A broader comparison against state-of-the-art, based on significant onboard measurements and targeting more complex datasets, will be carried out in the future. Nonetheless, it is worth recalling that state-of-the-art approaches are not conceived to support runtime adaptivity, which is motivating our research instead.

Indeed, our future work intends to explore mixed precision in adaptive NN accelerators. To have available a fully automated flow, with reconfiguration support capabilities, was a key preliminary step to save the manual effort in the accelerator definition and exploration. The analysis carried out so far on non-reconfigurable accelerators shows that promising trade-offs are present, e.g. trading off accuracy for reduced energy consumption. The ultimate goal will be the efficient runtime management of the system that implies, as a first step, the combination of the different working points over a reconfigurable substrate. This latter can certainly be achieved by leveraging on the whole set of functionality offered by the MDC tool to design and operate reconfigurable and evolvable NN accelerators for CPS, including the one presented in this study. Resulting accelerators will be able to switch configuration at runtime to adapt to the desired goal, e.g. when a limited energy budget is left a reduction in energy consumption is worth the cost of some accuracy loss.

One of the challenges we expect to face in this research’s future steps is the limited onboard memory, which could constrain us to the execution of relatively small models (e.g. TinyML), especially when runtime switching among algorithms/configurations is required. The adoption of a reconfigurable approach, capable of sharing weights among configurations, should help us tackle that issue limiting the impact on the memory footprint of having more than one network available. This may limit the advantages in terms of accuracy achievable with Quantization Aware Training [18]. However, the preliminary results with Post Training Quantization show a limited drop in accuracy even with 4-bit weights.

Ackowledgments

The authors would like to thank Stefano Esposito for his contribution to this work during his Master’s thesis.

References

[1] Ratto, Francesco, et al. ”An Automated Design Flow for Adaptive Neural Network Hardware Accelerators.” Journal of Signal Processing Systems (2023): 1-23.
[2] Venieris, Stylianos, et al. ”Toolflows for mapping convolutional neural networks on FPGAs: A survey and future directions.” ACM Computing Surveys (CSUR) 51.3 (2018): 1-39.
[3] Aarrestad, Thea, et al. ”Fast convolutional neural networks on FPGAs with hls4ml.” Machine Learning: Science and Technology 2.4 (2021): 045015.
[4] Fraser, Nicholas J., et al. ”Scaling binarized neural networks on reconfigurable logic.” Proceedings of the 8th Workshop and 6th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms. 2017.
[5] Umuroglu, Yaman, et al. ”Finn: A framework for fast, scalable binarized neural network inference.” Proceedings of the 2017 ACM/SIGDA international symposium on field-programmable gate arrays. 2017.
[6] Ngadiuba, Jennifer, et al. ”Compressing deep neural networks on FPGAs to binary and ternary precision with hls4ml.” Machine Learning: Science and Technology 2.1 (2020): 015001.
[7] Canis, Andrew, et al. ”LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems.” ACM Transactions on Embedded Computing Systems (TECS) 13.2 (2013): 1-27.
[8] Armeniakos, Giorgos, et al. ”Hardware approximate techniques for deep neural network accelerators: A survey.” ACM Computing Surveys 55.4 (2022): 1-36.
[9] Summers, Sioni, et al. ”Using MaxCompiler for the high level synthesis of trigger algorithms.” Journal of Instrumentation 12.02 (2017): C02015.
[10] https://jiafulow.github.io/blog/2020/08/02/hls-arbitrary-precision-data-types
[11] Guan, Yijin, et al. ”FP-DNN: An automated framework for mapping deep neural networks onto FPGAs with RTL-HLS hybrid templates.” 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2017.
[12] Mittal, Sparsh. ”A survey of techniques for approximate computing.” ACM Computing Surveys (CSUR) 48.4 (2016): 1-33.
[13] C. Farabet, et al. An FPGA-based processor for convolutional networks. In Proc. IEEE FPL, pages 32–37. IEEE, 2009.
[14] Jungwook, Choi, et al. 2019. Accurate and Efficient 2-bit Quantized Neural Networks. In Proc. of Machine Learning and Systems, A. Talwalkar, V. Smith, and M. Zaharia (Eds.), Vol. 1. 348–359.
[15] Song Han, et al. 2016. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the 43rd International Symposium on Computer Architecture. 243–254.
[16] Bhardwaj, Kartikeya, et al. ”Power-and area-efficient approximate wallace tree multiplier for error-resilient systems.” Fifteenth international symposium on quality electronic design. IEEE, 2014.
[17] Guo, Kaiyuan, et al. ”[DL] A survey of FPGA-based neural network inference accelerators.” ACM Transactions on Reconfigurable Technology and Systems (TRETS) 12.1 (2019): 1-26.
[18] Gholami, Amir, et al. ”A survey of quantization methods for efficient neural network inference.”arXiv preprint arXiv:2103.13630” (2021).