This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

vlang: Mapping Verilog Netlists to Modern Technologies

Nicholas V. Giamblanco Computer Engineering Department
University of Toronto
Toronto, Canada
[email protected]
   Andrew Schmidt This work is funded by the Department of Energy’s Kansas City National Security Campus, operated by Honeywell Federal Manufacturing & Technologies, LLC, under contract number DE-NA0002839. Information Sciences Institute
University of Southern California
Arlington, USA
[email protected]
Abstract

Portability of hardware designs between Programmable Logic Devices (PLD) can be accomplished through the use of device-agnostic hardware description languages (HDL) such as Verilog or VHDL. Hardware designers can use HDLs to migrate hardware designs between devices and explore performance, area and power tradeoffs, as well as, port designs to an alternative device. However, if design files are corrupt or missing, the portability of the design is lost. While reverse engineering efforts may be able to recover an HDL-netlist of the original design, HDL-netlists use device-specific primitives, restricting portability. Additionally, the recovered design may benefit from other computational technologies (e.g., μ\muP, GPGPUs), but is restricted to the domain of PLDs. In this work, we provide a new framework, vlang, which automatically maps Verilog-netlists into LLVM’s intermediate representation (IR). The remapped design can use the LLVM-framework to target many device technologies such as: x86-64 assembly, RISC-V, ARM or to other PLDs with a modern high-level synthesis tool. Our framework is able to preserve the exact functionality of the original design within the software executable. The vlang-produced software executable can be used with other software programs, or to verify the functionality and correctness of the remapped design. We evaluate our work with a suite of hardware designs from OpenCores. We compare our framework against state-of-the-art simulators, thereby outlining our framework’s ability to produce a fully-functional, cycle accurate software-executable. We also explore the usage of vlang as a front-end for high-level synthesis tools.

Index Terms:
Verilog-to-C Compiler, High-Level Synthesis, Design Portability, Graph Algorithms

I Introduction

Programmable logic devices (PLDs) are rapidly evolving: many new devices technologies and programming models are being delivered to the market to meet the computing needs of today [1, 2, 3, 4, 5]. However, maintaining portability of designs between past, current or future PLD-technologies is crucial  [6, 7]: either for continued product use, or for broad adoption of upcoming technologies.

Traditionally, the portability of hardware designs is generally handled with a portable design medium such as a hardware-description language (HDL) or with specialized design files (i.e., bitstream files). However, if these design mediums are lost or damaged, preserving the design on a programmable device is difficult. If a PLD is still configured with the design’s bitstream, it may be possible to recover the design through bitstream readback [8, 9, 10, 11]. Additionally, the original PLD model may be deprecated (i.e., the PLD model is no longer manufactured), rendering the recovered bitstream as useless since it cannot be applied to any other PLD. Therefore, the bitstream will need to undergo reverse engineering efforts [10].

Typically, the recovered design will be in the form of primitives native to the device (e.g., device-specific LUTs, DSPs, BRAMs, etc.), and represented with a HDL, which is referred to as a HDL netlist [8, 10]. The primitives used in the HDL netlist may not exist on the desired replacement device. Therefore, additional methods must be applied to remap deprecated-device primitives in the HDL netlist to the fabric of a user-selected modern device. Once this design has been remapped to the new-device’s primitives, it will undergo the corresponding vendor’s compilation flow.

It is not clear if the recovered-design will retain similar performance, power and area characteristics on alternative PLD technologies. Additionally, the recovered hardware design may benefit from other computational technologies (e.g. a μ\muProcessor, GPGPU, CGRA). However, the reverse engineering process restricts portability of the recovered design to only programmable-hardware platforms. This is due the structural encoding of the recovered design: the low-level representation of an algorithm or hardware design limits the possibility to explore design optimizations or modifications during the recovery process (i.e. security patches, performance improvements, power reduction techniques) or to gain an understanding of an application’s intentions.

To address the issue of hardware-design portability with HDL-netlists, we present a framework, vlang. Our framework is a new LLVM [12] front-end which compiles Verilog-HDL netlists into LLVM’s intermediate representation (IR), permitting portability of designs across a variety of computational devices (e.g., x86, RISC-V, ARM, etc.). vlang’s LLVM-IR is high-level synthesis (HLS) compatible, allowing designs to be remapped to alternative PLD technologies supported by modern HLS tools. Additionally, vlang produces LLVM-IR code which is cycle-accurate and functionally equivalent and to a hardware circuit, and can be used for simulation. We demonstrate that we can compile vlang’s LLVM-IR into a software executable which can be linked into C/C++ programs.

netlist.vnetlist.llnetlist.hParserSW ModelSchedulerExtracted NetlistvlangLLVMHLS (FPGA)x86RISC-V
Figure 1: An Overview of vlang’s compilation process.

The main contributions of this paper are:

  • A LLVM-frontend, vlang, which compiles a Verilog-netlist into LLVM-IR program which is functionally equivalent and cycle accurate to the Verilog-netlist.

  • A performance comparison of vlang’s simulation capabilities against state-of-the-art tools.

  • A study on vlang’s ability to preserve the functionality of designs through the use of modern HLS tools.

II Background

We provide: (1) a brief review of the LLVM compilation framework and (2) device-specific netlist example to assist the reader in understanding our work.

II-A LLVM

The LLVM compilation framework [12] provides a set of toolchains to enable compilation between any supported source-language to any supported target language. Source-languages are compiled into LLVM’s-IR, a strongly-typed RISC instruction-set. The IR is agnostic to the source and target languages. RISC instructions are simple operations such as add, or, xor, store, load, etc. Instructions may have: (a) 0 or more inputs and (b) 0 or 1 outputs. LLVM encodes the IR using single static assignment [13]. Control flow in the LLVM-IR is handled with the use of basic blocks and control flow graphs. A basic-block requires an entry to the block, a set of control-flow-independent instructions, and an termination point. Termination points can either be branch (br) or return (ret) instructions. Connections between basic-blocks are made if a branch exists to some other basic-block or itself. Connections between basic blocks can be represented graphically, which we refer to as a control flow graph (CFG). CFGs represent conditional flow of a program.

II-B Verilog Netlist

A Verilog-netlist encodes a hardware-design (e.g., an nn-bit Adder, a registered and-gate) in terms of a device’s available programmable logic resources (e.g., DSPs, LUTs, etc.). We will refer to a device’s available programmable logic resources as device-specific primitives. In our work, we consider PLD netlist extraction processes which recover designs and represent them as a Verilog-netlist. We provide an example in Fig. 2. In Fig. 2(a) a registered-AND-gate is behaviourally described with Verilog. The same design is presented in Fig. 2(b); however, it has been encoded in terms of device-specific primitives.

module reg_and (clk, a, b, out);
input a,b,clk;
output reg out;
always @(posedge clk) begin
out <= a & b;
end
endmodule
(a)
module reg_and (clk, a, b, out);
input a, b, clk;
output out;
wire a_inv;
FDR fdr (.C(clk), .D(b), .R(a_inv), .Q(out));
INV inv (.I(a), .O(a_inv));
endmodule
(b)
Figure 2: An example of (a) behaviourally-described hardware design in Verilog and (b) the same design described using device-specific primitives.

III vlang Approach

We present vlang, a LLVM-frontend to compile a Verilog netlist into LLVM’s IR. vlang’s compilation process for a Verilog netlist is outlined in Fig. 1. In this process, a Verilog netlist is first parsed into a software model. The software-model holds information regarding the primitives (e.g., LUTs, AND-Gates, Flip-Flops, etc.) in the supplied design, as well as the connections between primitives. During compilation, vlang replaces each primitive in the software-model with an LLVM-function. Replacement LLVM-functions are behavioural descriptions of the device-specific primitives. Once all primitives have been replaced with LLVM-functions, vlang determines the order-of-execution for all of the LLVM-functions through a scheduling algorithm; the explicit parallel structures in hardware need to be mapped into sequentially executed code111vlang does not attempt to create multi-threaded programs; this is left as future work.. After scheduling has completed, vlang will connect LLVM-functions by assembling a set of temporary registers in the LLVM-IR to act as the wiring between modules. Lastly, vlang will produce a C-header file which outlines the calling-convention for this compiled design. The following subsections provide further detail on vlang’s compilation process.

III-A Mapping Primitive Modules to LLVM

Typical PLD-fabric consists of programmable elements such as look-up tables (LUTs), flip-flops (FFs), digital-signal processing blocks (DSPs) and on-chip memories (RAMs). Routing resources are required to route information to-and-from these hardware-blocks. A Verilog netlist describes the hardware design (which has been configured on the PLD) by structurally representing programmable-hardware resources (i.e., primitives) as Verilog modules. Connections between primitives are explicitly made through the use of wires [14]. For each primitive in the design, vlang will construct an LLVM-function, which we describe below.

III-B Inputs and Outputs

module example(a, b, c);
input [31:0] a;
input [31:0] b;
output [31:0] c;
//== structural description.
endmodule
(a)
define void @example(i32 ; Translated Verilog module here..
;
;
ret void
}
(b)
Figure 3: An example of how vlang translates input/output nets to LLVM. (a) is an example of a Verilog module to be translated in LLVM-IR. (b) is the result.

vlang remaps a primitive’s inputs into LLVM-software datatypes (e.g., i1, i32, i43) which match the bit-widths of the inputs222Both LLVM and Verilog support variable-sized bitwidth datatypes. Since a Verilog module may have more than one output bus/net, vlang remaps output nets to pointers (sized to match the datatype-width of the net) as this provides the ability to obtain more than one output from a software function. Fig. 3, demonstrates this remapping, where Fig. 3(a) provides the original Verilog module and Fig. 3(b) provides the LLVM-function declaration generated by vlang.

III-C Primitives

Primitives on PLD fabric can either be combinational or clock triggered. We map both classes of primitives to LLVM-functions.

III-C1 Combinational Primitives

Combinational primitives such as {AND, XOR, OR, …} gates or LUTs, can be mapped directly to LLVM instructions (e.g., an AND gate is replaced with the and instruction). However, replacing the LUT combinational primitive is not as trivial. To map a LUT into an LLVM-function, the following information must be provided: (a) the size of the LUT (i.e., the number of inputs to the LUT) and (b) the LUT-mask. Using the LUT-mask, the sum-of-products formula is constructed, therefore, a sequence of and and or LLVM-instructions can be used to construct the behaviour of the LUT.

III-C2 Flip-Flops

DQQ¯\overline{\mbox{Q}}DFF0DQQ¯\overline{\mbox{Q}}DFF1Qt\text{Q}_{t}Pt\text{P}_{t}Rt\text{R}_{t}clk
Figure 4: Motivating example illustrating the need for two global-variables for the LLVM-function equivalent.

Flip-flops are clock-triggered primitives which provide storage capabilities. Replacement LLVM-functions must be able emulate a flip-flop. Consider the circuit in Fig. 4, where two D-Flip Flops are chained together. At the next clock edge, data at the D inputs of the flip-flops will be stored and displayed at the output: DFF1 will display Qt\text{Q}_{t} at it’s output, and DFF0 will output Pt\text{P}_{t}. However, if we were to sequentially update the flips flops from left-to-right (i.e., DFF0 is updated, then DFF1), DFF1 would improperly update to output Pt\text{P}_{t}. Unfortunately, a naive replacement LLVM-function would lead to the above error, since vlang produces sequentially executed LLVM-IR. Therefore, the replacement LLVM-function must preserve the parallel-functionality of the flip-flop. We detail how to update LLVM-described flip-flops. The function signature of the replacement LLVM-function for a flip-flop is depicted in Fig. 5.

define void @dff(i1 ; Translated Verilog module here..
ret void
}
Figure 5: The LLVM-function signature of a D-flip-flop.

Each flip-flop replacement must have two global variables, GV1 and GV2. GV1 holds the state of an incoming changes. GV2 holds the current state of the flip-flop, which will then be returned at the end of the function and then updated with the contents of GV1. Additionally, the flip-flop should only change state on a clock-edge. However, LLVM lacks the notion of edge-triggered logic. To capture a clock-edge, the function must be able to recall the previous state of the clock, and then use this memory as a reference against the current clock-signal. XOR-ing the recalled and current clock-signal will dictate if there was a change in the clock. Then, inspecting the current clock level will dictate if this was a positive edge, or negative edge shown in Fig. 6. Other flip-flop types (e.g., asynchronous reset, clock-enabled) can also be constructed as an LLVM function, by modifying this base-design.

define i32 @dff(i1 ;… if hi, save d.
; Remember the state of the clock
store i1 ;…
}
Figure 6: Clock edge detection logic in LLVM.

III-C3 Other Primitives

We currently do not support DSP and RAM blocks due the large number of configuration possibilities. Instead, if DSPs or RAM blocks are used in the original hardware, we map these to LUT and Flip-Flop elements. Adding support for these primitives is left as future work. We summarize the mapping between Verilog primitives and LLVM Datatypes in Table. I.

TABLE I: Mapping between Verilog primitives and LLVM structures.
Verilog Primitive Declaration Inputs and Wires Outputs Logic Gates Other Primitives
LLVM Function Declaration Variables Pointers Logic Instructions LLVM Functions

III-D Module Execution Schedule

All primitives mapped into LLVM functions may only be executed sequentially and must preserve the functionality of the original hardware design. However, Verilog-netlists are a form of dataflow-graphs (DFGs) which expose explicit parallelism between operators. An example of this encoding is pictured in Fig. 7, where the logic equation:

(a((ab)))c(\texttt{a}\land(\sim(\texttt{a}\oplus\texttt{b})))\lor\texttt{c} (1)

is encoded as a DFG. Additionally, Verilog-netlists may contain feedback (e.g., a linear-shift feedback register). vlang must determine an ordering of the nodes in the dataflow graph which preserves the data flow between nodes (or synonymously, primitives). Typically, a topological ordering would suffice if the DFG was guaranteed to be acyclic [15]. Since Verilog-netlist DFGs may contain feedback paths, a topological ordering cannot be computed unless feedback paths in the graph can be handled.

abcandinvxoror
Figure 7: An example of a device-specific netlist encoded as a data flow graph.

Feedback loops in Verilog netlists can be divided into two classes: (a) sequential feedback and (b) combinational feedback. If a feedback edge passes through a clock-triggered element which saves state (i.e., a flip-flop), then this will be referred to as sequential feedback. Otherwise, combinational feedback occurs if all primitives along the feedback edge are a form of combinational logic. Our tool does not support combinational logic loops.

To schedule the execution of primitives, the corresponding DFG must be acyclic. We present a methodology to (a) identify feedback-edges (i.e., cycles) from flip-flops in a DFG, (b) temporarily modify the DFG to be acyclic by removing feedback, (c) schedule the acyclic DFG, and then reintroducing feedback edges to restore the functionality of the original circuit.

III-E Breaking Sequential Feedback Loops

First, vlang must identify if there are cycles in the provided DFG. Recall, the DFG is a directed graph. vlang employs Tarjan’s Strongly-Connected Components algorithm [16] to the DFG, to discover any strongly-connected components (SCC), thereby discovering cycles. The vertices in each discovered SCC are inspected to ensure there exists at-least one flip-flop (as this permits valid feedback paths). If there are no flip-flops in any of the strongly-connected components, vlang will abort compilation. Otherwise, the flip-flops within the SCC must be modified to break the cycle/SCC (providing an acyclic graph).

Algorithm 1 Cycle Removal Algorithm
1:function CycleRemoval(GG)
2:     CTarjanSCC(G)C\leftarrow\texttt{TarjanSCC(}G\texttt{)} \triangleright List of SCCs
3:     RnewKeyValueTable()R\leftarrow\texttt{newKeyValueTable()}
4:     if HasCombinationalSCC(C) then
5:         abort()      
6:     for cCc\in C do
7:         for vcv\in c do
8:              if IsaFlipFlop(v)\texttt{IsaFlipFlop(}v\texttt{)} then
9:                  for Uv.getModulesCon2In()U\in v\texttt{.getModulesCon2In()} do
10:                       if UcU\in c then
11:                           R[v].append(U)R\texttt{[}v\texttt{].append(}U\texttt{)}
12:                           v.removeInput(U)v\texttt{.removeInput(}U\texttt{)}                                                                      
13:     return RR

Our procedure for flip-flop modification is outlined in Algorithm 1. For each flip-flop within an SCC, the modules which provide input (i.e., drivers) to the flip flop are inspected to see if the module is within the SCC. If a flip-flop’s driver is within the vertex set of the SCC, the edge (i.e., wire) connecting them is cut. The flip-flop driver pair is noted down in RR, a key-value data structure, as these modules will need to be reconnected after scheduling has occurred.

III-F ASAP Scheduling

Once Algorithm 1 has completed, vlang generates an ILP formulation for an LP solver to serialize the data-flow graph. We used the system of difference constraints ILP formulation from [17], which will provide a solution for an as-soon-as-possible (ASAP) execution schedule. We used the LP solver, lpsolve [18], to compute the schedule. As an example, if we serialized the DFG-encoding of Equation 1 using the ILP, the order of execution would be: (xor,inv,and,or)(\texttt{xor},\texttt{inv},\texttt{and},\texttt{or}). Optimizations to the serialization process could be made by adjusting the SDC-constraint formulation; we have left this as future work.

However, with Algorithm 1, flip-flops will be scheduled to be connected to what it drives. For example, if a flip-flop drives an and gate, the flip-flop will be scheduled to execute just-before the and gate. However, the execution schedule of the modified flip-flops are not guaranteed to have the correct data driving it’s inputs.

Algorithm 2 vlang’s ASAP Scheduling
1:function Schedule
2:     Gcircuit(E,V)G_{\text{circuit}}(E,V)\triangleright This is the DFG of the circuit
3:     RCycleRemoval(G)R\leftarrow\textsc{CycleRemoval(}G\textsc{)} \triangleright From Alg. 1
4:     SASAP(G)S\leftarrow\texttt{ASAP(}G\texttt{)} \triangleright Computed Serialized Schedule
5:     for vSv\in S do
6:         if vRv\in R then
7:              vcopyv.MakeCopy()v_{\text{copy}}\leftarrow v\texttt{.MakeCopy()}
8:              vcopy.disableClock()v_{\text{copy}}\texttt{.disableClock()}
9:              S.replace(v,vcopy)S.\texttt{replace(}v,v_{\text{copy}}\texttt{)}               
10:     for {v,(U0,U1,)}Rv,(U_{0},U_{1},...)\}\in R do
11:         Dv.getInputs()D\leftarrow v\texttt{.getInputs()}
12:         TGetLatestStart(D0,D1,,U0,U1,)T\leftarrow\texttt{GetLatestStart(}D_{0},D_{1},...,U_{0},U_{1},...\texttt{)}
13:         S.insertAfter(v,T)S\texttt{.insertAfter(}v,T\texttt{)}      
14:     return SS

Therefore, modifications to the schedule must be made to allow for the flip-flops to (a) propagate data to the necessary consumers and (b) permit the updating of the flip-flops by reattaching their original drivers. We outline this procedure in Algorithm 2. For each modified flip-flop, we make a copy of the flip flop and set the clock to zero (thereby disabling the flip-flop). This allows the consumers of the affected flip-flops to obtain the data at the correct time. However, the affected flip-flops also need to be updated at the correct time. Recall, vlang had stored the cut drivers from the flip-flops in a key-value data structure, RR. To restore updates to the flip-flops, the original copy of the flip-flop is inserted into the schedule. The insertion point is just-after the latest start-time of all of it’s original inputs. This restores updates to the flip-flop and guarantees the driver’s data to be arriving at the flip-flop at the correct time.

Once the schedule has been generated and modified, the original DFG has been serialized. vlang will iterate through the schedule, and begin inserting call instructions to the the corresponding primitive-modules. Connections between modules are made with local variables, which are known as alloca instructions in LLVM. vlang uses the connection-information collected during it’s parsing stage to determine how may temporary registers (in the LLVM-IR sense) should be instantiated to simulate data-transfer using wires. If a subsection of a bus is set or read, vlang will use and/or logic to set or read from the sub-selected bus.

III-G Cleaning-Up

Upon completion of schedule-implementation, vlang produces a C header-file which provides the function-signature to call the top-level modules as a software function.

IV Experimental Setup and Results

A variety of experiments are performed with vlang to showcase the ability to preserve portability of Verilog-netlist. First, we demonstrate vlang’s ability to produce a software-executable which preserves the functionality and cycle accuracy of the Verilog netlist. The software executable is linked to a C-testbench program. The results of the C-testbench are compared against three state-of-the-art hardware simulators: Verilator [19], Icarus Verilog [20] and Modelsim  [21]. We then explore the usage of vlang as a front-end for HLS tools, by exploring if the HLS-generated circuit from vlang’s LLVM-IR is functionally correct. We also explore the achieved performance and area from the high-level synthesis of vlang’s LLVM-IR, and compare against the original design’s performance and area. To conduct our studies, we explore a number of examples from OpenCore [22].

IV-A OpenCore Benchmarks

OpenCore is an online repository of hardware-designs represented in Verilog or VHDL. A number of designs from OpenCores have been selected to evaluate vlang’s ability to handle: (a) combinational-logic circuits, and (b) sequential-logic circuit with the possibility of feedback.

IV-A1 Combinational Logic Circuits

  • adder: is a ripple-carry adder which has (a) three inputs, two 16-bit numbers and a 1-bit carry-in and (b) two outputs, one 16-bit sum, and a 1-bit carry-out.

  • bcdadder: implements the addition of two 4-bit numbers and a 1-bit carry-in, and produces a 4-bit sum as a binary-coded-decimal with a 1-bit carry-out.

  • divide: performs the division of two 32-bit inputs, and outputs a 32-bit quotient.

  • mod3: calculates a 3-bit remainder of an 8-bit input in modulo-3 space.

  • popcount32: determines the number of one’s in a 32-bit input and is output onto a 5-bit bus.

IV-A2 Benchmarks with Clock-Triggered Elements

  • addertree: implements an adder tree for five 16-bit inputs, where each pair of inputs are added and registered. The last registered stage is output to a 16-bit bus.

  • andreg: computes the logical-and between two single-bit inputs, registers and then returns the result.

  • gcd: calculates the greastest-common-divisor (GCD) between two 32-bit numbers, storing the result in a 32-bit register. This benchmark has feedback.

We selected these benchmarks to provide a variety of tests with respect to: (a) complexity of the design, (b) clock-triggered behaviour with and without feedback, and (c) a variety of applications. However, these benchmarks are described behaviourally; vlang can only compile a Verilog-netlist into LLVM-IR. To simulate this environment, we assume that is it possible to retrieve designs from Xilinx’s Kintex-7 FPGAs in the form of Verilog netlists. Using xst from Xilinx’s ISE design suite [23], and targeting a Kintex-7 device (xc7k70t-1fbg484), we use xst to synthesize a behavioural design from our OpenCores benchmark suite into the Kintex-7’s FPGA-fabric, thereby producing a device-specific netlist. The device-specific netlist can be represented as Verilog-netlist using Xilinx’s netgen.

IV-B Results: Functional Correctness and Simulation Comparison

Recall that vlang produces an LLVM-IR program which reproduces the functionality of a Verilog-netlist, and thereby, the original design. Simultaneously, vlang also produces a C-header file to permit other programs to reference and use this program. The vlang produced LLVM-IR can be linked into a C program using the LLVM framework [24]. Once linked, the corresponding LLVM-IR can either be executed just-in-time [25] or compiled to a binary. To evaluate vlang, we compile the test-bench and vlang-produced LLVM-IR to a binary using llc and gcc [26]. Each C test-bench tests a large subset of all possible test-cases, providing an avenue to validate correctness. The number of unique test-cases issued for each OpenCore benchmark is detailed in Table II. We also compare vlang’s simulation results and times against two open-source tools, Icarus Verilog [20] and Verilator [19] and a commercial simulation tool, Modelsim. All simulators were provided with the same test-cases. For all of the OpenCore benchmarks, Icarus Verilog, Modelsim and vlang produced the correct results. However, Verilator is unable to produce correct results for the addertree and gcd benchmarks. This is due to a software-bug in Verilator, where certain feedback paths in the Verilog netlist are not handled correctly.

TABLE II: Number of unique test-cases performed on the OpenCore benchmarks.
adder bcdadder divide mod3 popcount32 addertree andreg gcd
# Testcases 2359296 256 65536 256 1048576 625 8 262144
Icarus Verilog Passed? Y Y Y Y Y Y Y Y
Verilator Passed? Y Y Y Y Y N Y N
Modelsim Passed? Y Y Y Y Y Y Y Y
vlang Passed? Y Y Y Y Y Y Y Y
adderdividepopcount32addertreegcd10010^{0}10110^{1}10210^{2}10310^{3}10410^{4}Time (s)Icarus-VerilogVerilatorModelsimvlang
Figure 8: Comparison of Icarus Verilog [20], Verilator [19] and vlang’s simulation time (in Log Scale) for the OpenCore benchmarks. Results for bcdadder, mod3 and andlatch are not shown due to fast simulation time (on the order of 10310^{-3} seconds)

Additionally, we compare the execution time for each testbench with all three simulators. The results are outlined in Fig. 8. This figure does not include results for bcdadder, mod3 and andreg, as simulation times were on the order of 10310^{-3} seconds. For a majority of the benchmarks, vlang is able to achieve a faster simulation time, compared to Icarus Verilog, Verilator and Modelsim. This is expected, as vlang is a compiled simulator [27]; these styles of simulators are known to provide faster execution time compared to event-driven simulators (Icarus Verilog, Verilator and Modelsim are event driven simulators) [28]. Overall vlang is 37.3×\sim 37.3\times faster than Icarus Verilog, 3.5×\sim 3.5\times faster than Verilator and 12.3×\sim 12.3\times faster than Modelsim. Although the focus of this paper is not on simulation, we were able to present vlang’s ability to: (a) produce an executable which reproduces the functionality of a Verilog-netlist with an LLVM-IR program and (b) be used a simulation tool, which is competitive in terms of simulation time and is cycle accurate.

IV-C Results: Porting between PLDs

We evaluate vlang’s portability by supplying vlang-produced LLVM-IR to LegUp, an open-source high-level synthesis tool from the University of Toronto [29]. Using LegUp, we compile the vlang-produced LLVM-IR to a Verilog design file, and target two devices, a Xilinx Kintex-7 device (xc7k70t-1fbg484) and an Intel Stratix V device (5SGXEA7N2F45C2), to study if the functionality of these designs are portable. With the Xilinx device, we also measure performance and area differences: the original design files (i.e., not the Verilog-netlists) are compiled with Xilinx’s ISE v14.7 and targeted for the Kintex-7 device. For Xilinx devices, area is gauged in terms of slices and number of LUTs and performance is measure in terms of FmaxF_{\text{max}}. For the combinational circuits in our OpenCore Benchmark suite, we used the worst-case estimated logic delay as our measure of FmaxF_{\text{max}}. The post place-and-route (PnR) area-and-performance metrics for the original design files are outlined in Table III.

TABLE III: ISE’s Post PnR Kintex-7 Performance and Area metrics for OpenCore benchmarks.
Area Performance
Benchmark Slices LUTs Flip-Flops FmaxF_{\text{max}} (MHz)
adder 4 16 - 136.44
bcdadder 4 12 - 143.32
divide 477 1347 - 11.86
mod3 2 3 - 164.10
popcount32 19 53 - 103.61
addertree 16 64 64 671.14
andreg 2 4 1 709.723
gcd 41 135 93 429.73

To compare the HLS-generated circuits (from vlang’s LLVM-IR) to the original design files, we compiled the design’s Verilog-netlists with vlang. The vlang-produced IR was then supplied to LegUp. LegUp was set produce Xilinx-amenable Verilog, with a 1 ns clock-period. The HLS-generated circuit then underwent ISE’s compilation flow, again targeting the Kintex-7 device, with a 1 ns clock-period. The results are tabulated in Table IV. Using our C-test-benches along with vlang-produced IR, we confirmed the functionality of the Verilog-netlist from the Kintex-7 was successfully ported back to the same device using LegUp’s test-bench generation flow [29]. However, supplying vlang’s LLVM-IR to an HLS tool may affect the overall area of HLS-generated design as shown in Table IV. This is demonstrated for all benchmarks, where the number of Slices and LUTs increase. As a general trend, the increase in LUTs follows the complexity of the original design: for example, the gcd benchmark uses 135 Slice-LUTs in the original design and the vlang-mapped design uses 5952 LUTs, whereas the adder benchmark uses 16 LUTs in the original design and vlang-mapped design uses 370. Additionally, the divide benchmark is not route-able on the Kintex-7 device (the design uses 97% of the available slices on this particular device). The increase in area is primarily attributed to the HLS-tool’s ability to produce hardware designs from vlang’s LLVM-Functions. The LLVM-function replicates the behaviour of a low-level device-primitive; there is no guarantee that the HLS tool will remap these LLVM-functions into these same primitive. Instead, the HLS tool may proceed to use additional logic resources to implement the LLVM-function. Additionally, extra logic from: (a) the HLS-compiler will be introduced to the original design (e.g., introducing an FSM-controller to enable switching between functions), thereby adding additional area costs and (b) from duplicateds flip-flops on feedback paths. Another notable difference: the combinational circuits have transformed into sequential circuits. This is expected since HLS-tools will infer a clock-signal (which is propagated throughout the entire design). This allows HLS tools to insert register stages into the design, to improve of upon the performance (i.e., breaking-up the critical path, pipe-lining, etc.). This can be desirable, especially if the combinational circuit need not be combinational with the new device. In our case all combinational benchmarks had a significant improvement in FmaxF_{\text{max}}. If cycle-latency is a sensitive parameter for design portability, the current flow will not be suitable: the additional register stages may affect the circuit’s cycle-latency. LegUp (or any LLVM-based HLS tool) could be modified to reflect the original design’s clock usage (or lack thereof) in vlang’s IR, but this is left as future work. Introducing register stages into combinational logic also prevents Quartus from inferring the entire logic function into a LUT: several LUTS and register would need to be used to infer the behaviour of this function.

TABLE IV: ISE’s Post PnR Kintex-7 Performance and Area metrics for OpenCore benchmarks (files regenerated with LegUp HLS from vlang). X indicates the design was unroutable.
Area Performance
Benchmark Slices LUTs Flip-Flops FmaxF_{\text{max}} (MHz)
adder 214 370 389 709.72
bcdadder 178 318 280 709.72
divide 9841 37,219 36,627 X
mod3 90 200 151 709.72
popcount32 319 636 583 709.72
addertree 1334 2289 2175 594.88
andreg 11 29 18 709.72
gcd 3625 5952 8908 472.14

Lastly, we confirm the portability of vlang’s LLVM-IR by using LegUp HLS and targeting a Stratix V device. LegUp was set to target a 1 ns clock-period. The LegUp produced hardware design was then compiled with Quartus Prime 18.0 Standard Edition, where we set a target clock-period of 1 ns, and targeted the Stratix V device 5SGXEA7N2F45C2. We use the Stratix V ALM (adaptive logic module) count as our area metric. We characterize performance through FmaxF_{\text{max}}. Again, using our C-test-benches we used LegUp’s test-bench generation flow to verify the functionality of the Verilog-netlist from the Kintex-7 was successfully ported to a Stratix V device. Table V tabulates the post place-and-route performance and area metrics for the Intel Device.

TABLE V: Quartus’s Post PnR Stratix V Performance and Area metrics for the high-level synthesis of vlang-compiled OpenCore benchmarks.
Area Performance
Benchmark ALMs Registers ALUT FmaxF_{\text{max}} (MHz)
adder 57 141 74 508.65
bcdadder 52 130 71 543.48
divide 18153 37519 36349 307.88
mod3 23 53 29 714.29
popcount32 54 140 69 399.04
addertree 407 428 625 381.24
andreg 32 58 54 554.32
gcd 3079 6135 5258 317.06

We note on the performance and area characteristics on the HLS-generated circuits for both Xilinx and Intel devices. Although there is no direct area comparison between Intel and Xilinx devices, we compare the number of LUTs used in Xilinx device (from Table III) to the number of ALUTs used in the Intel device. As before, we observe an increase in area: this is primarily attributed to the HLS-tool’s inability to recognize the device-specific primitives as LLVM-functions. Lastly, the achieved FmaxF_{\text{max}} for the combinational benchmarks when using vlang with LegUp HLS improve upon the baseline. Again, this is due to the HLS tool’s ability to (a) insert a clock signal to the design, and (b) insert register stages to break-up the critical path. Referring to Table IV and Table V, there is a performance difference between the designs on the Intel device versus the Xilinx device. From our investigation, we discovered that Quartus had purposed many LUTs as route-throughs introducing additional logic delay, thereby affecting the FmaxF_{\text{max}}. This differs from Xilinx’s ISE, where the majority of designs utilized routing tracks versus LUTs as route-throughs, thereby providing a higher FmaxF_{\text{max}} overall. From our experiments, vlang’s IR was able to port designs originally intended for a Kintex-7 Xilinx FPGA to an alternative device technology, an Intel Stratix V device. We also demonstrated vlang’s LLVM-IR can be used as a software executable and can be integrated with other software-ecosystems.

V Related Work

We present a list of works related to: (a) Verilog-to-C compilation procedures, (b) a (c) high-level synthesis technologies

V-A Verilog-to-C/C++ Compilers

V-A1 Verilog to C/C++ Compilers for Hardware Simulation

Open-source tools v2c [30], icarus [20] and verilator [19] compile hardware description languages (i.e., Verilog and VHDL) to C/C++ for verification purposes. However, these tools do not present HLS-synthesizable C/C++ code.

V-A2 Mapping Verilog to C/C++ for High-Level Synthesis

Bomberi et al. explored raising the level of abstraction of hardware designs for: (1) generating and exploring designs with an HLS compiler [31], and (2) using IP cores as standalone system software [32]. This work was able to recover the IP block specification for system-level design, and enable the derivation of more optimized implementations through HLS.

Mahapatra and Schafer identify a similar issue: older designs intended for a target architecture may suffer when porting HDL [33]. Additionally, their may be a need to optimize and port RTL cores [33, 31]. The authors provide a tool, veriIntel2c, to compile a subset of Verilog to a synthesizable set of C [33]. The compiled code is then explored for pareto-optimal designs by applying different code modifications to the C program. The authors built on this work in [34], where they optimize their Verilog to C process.

In these works, only either (a) a small subset of Verilog can be retargeted to C/C++ programs [33] or (b) an event scheduler-kernel is required to simulate a circuit in software [31, 32]. Our work is able to produce a software executable that is scheduled at compile time, while handling all gate-level representations in HDL.

V-B LLVM-Based High-Level Synthesis Tools

High-level synthesis is a compilation flow which compiles a high-level language (e.g., C/C++) to a hardware description language (e.g., Verilog, VHDL) [35]. A number of algorithmic processes are required to map a high-level language to a hardware description language. The program will undergo a number of static and dynamic program analyses to either restructure the program to expose parallelism and make the program amenable for hardware. The restructured program will then undergo allocation, scheduling and binding. Generally, HLS tools take advantage of pre-exising compilation frameworks, such as LLVM or GCC [29, 36]. Our work produced LLVM-IR, to be suitable for usage with HLS compilers built within LLVM frameworks.

VI Conclusion

This paper proposes a compilation method which transforms Verilog netlists into LLVM-IR to allow portability of these designs across a variety of computer architectures (e.g., μ\muProcessors, FPGAs, etc) by harnessing the LLVM-framework [12]. Our method was implemented as an LLVM-frontend, which compiles the Verilog netlist to LLVM’s IR. Our frontend, vlang, is able to produce an LLVM-IR program which preserves the functionality of the Verilog-netlist and is cycle accurate. Due to this ability, vlang can be used a verification tool. Lastly, vlang’s LLVM-IR is HLS-friendly. A number of experiments were performed with vlang. First, vlang’s LLVM-IR was validated by using brute-force comparisons against the original Verilog design. This simultaneously demonstrated vlang’s ability to function as a Gate-level simulator and it’s ability to retain the exact functionality and cycle accuracy of a hardware design as a software executable. Lastly, we evaluated vlang’s LLVM-IR for it’s usage in HLS processes. We demonstrated designs originally on a Xilinx device were successfully migrated to (a) the original Xilinx device and (b) an Intel FPGA. We commented on the HLS-generated hardware design’s performance and area metrics against (a) the original design’s performance and area metrics on the original device and (b) against the original design compiled to the Intel device.

References

  • [1] L. Eeckhout, “Is moore’s law slowing down? what’s next?” IEEE Micro, no. 4, pp. 4–5, 2017.
  • [2] S. Ahmad, S. Subramanian, V. Boppana, S. Lakka, F.-H. Ho, T. Knopp, J. Noguera, G. Singh, and R. Wittig, “Xilinx first 7nm device: Versal ai core (vc1902),” in 2019 IEEE Hot Chips 31 Symposium (HCS).   IEEE, 2019, pp. 1–28.
  • [3] D. Koeplinger, M. Feldman, R. Prabhakar, Y. Zhang, S. Hadjis, R. Fiszel, T. Zhao, L. Nardi, A. Pedram, C. Kozyrakis et al., “Spatial: A language and compiler for application accelerators,” in ACM Sigplan Notices, vol. 53, no. 4.   ACM, 2018, pp. 296–311.
  • [4] S. A. Chin, N. Sakamoto, A. Rui, J. Zhao, J. H. Kim, Y. Hara-Azumi, and J. Anderson, “Cgra-me: A unified framework for cgra modelling and exploration,” in 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP).   IEEE, 2017, pp. 184–189.
  • [5] Z. Li, Y. Zhang, J. Wang, and J. Lai, “A survey of fpga design for ai era,” Journal of Semiconductors, vol. 40, pp. 000 000–000 000, 2019.
  • [6] N. V. Giamblanco and J. H. Anderson, “ASAP: Automatic Sizing and Partitioning for Dynamic Memory Heaps in High-Level Synthesis,” in 2019 International Conference on Field-Programmable Technology (ICFPT).   IEEE, 2019, pp. 275–278.
  • [7] N. V. Giamblanco, “Dynamic Memory Allocation Techniques for High-Level Synthesis,” Ph.D. dissertation, University of Toronto (Canada), 2020.
  • [8] F. Benz, A. Seffrin, and S. A. Huss, “Bil: A tool-chain for bitstream reverse-engineering,” in 22nd International Conference on Field Programmable Logic and Applications (FPL).   IEEE, 2012, pp. 735–738.
  • [9] S. E. Quadir, J. Chen, D. Forte, N. Asadizanjani, S. Shahbazmohamadi, L. Wang, J. Chandy, and M. Tehranipoor, “A Survey on Chip to System Reverse Engineering,” ACM journal on emerging technologies in computing systems (JETC), vol. 13, no. 1, p. 6, 2016.
  • [10] J.-B. Note and É. Rannaud, “From the Bitstream to the Netlist.” in FPGA, vol. 8, 2008, pp. 264–264.
  • [11] H. Yu, H. Lee, S. Lee, Y. Kim, and H.-M. Lee, “Recent Advances in FPGA Reverse Engineering,” Electronics, vol. 7, no. 10, p. 246, 2018.
  • [12] C. Lattner and V. Adve, “Llvm: A compilation framework for lifelong program analysis & transformation,” in Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization.   IEEE Computer Society, 2004, p. 75.
  • [13] A. Leung and L. George, “Static single assignment form for machine code,” in ACM SIGPLAN Notices, vol. 34, no. 5.   ACM, 1999, pp. 204–214.
  • [14] D. Thomas and P. Moorby, The Verilog® Hardware Description Language.   Springer Science & Business Media, 2008.
  • [15] A. B. Kahn, “Topological sorting of large networks,” Communications of the ACM, vol. 5, no. 11, pp. 558–562, 1962.
  • [16] R. Tarjan, “Depth-first search and linear graph algorithms,” SIAM journal on computing, vol. 1, no. 2, pp. 146–160, 1972.
  • [17] J. Cong and Z. Zhang, “An efficient and versatile scheduling algorithm based on sdc formulation,” in 2006 43rd ACM/IEEE Design Automation Conference.   IEEE, 2006, pp. 433–438.
  • [18] M. Berkelaar, K. Eikland, P. Notebaert et al., “lpsolve: Open source (mixed-integer) linear programming system,” Eindhoven U. of Technology, vol. 63, 2004.
  • [19] W. Snyder, D. Galbi, and P. Wasson, “Verilator-verilog hdl simulator,” 2017.
  • [20] S. Williams, “Icarus verilog,” On-line: http://iverilog. icarus. com, 2006.
  • [21] M. Graphics, “Modelsim,” 2007.
  • [22] OpenCores. (1999) Opencores. [Online]. Available: https://opencores.org/
  • [23] I. Xilinx, “Design suite,” 2015.
  • [24] C. Lattner, “Llvm and clang: Next generation compiler technology,” in The BSD conference, vol. 5, 2008.
  • [25] T. Æ. Mogensen, Basics of compiler design.   Torben Ægidius Mogensen, 2009.
  • [26] A. Griffith, GCC: the complete reference.   McGraw-Hill, Inc., 2002.
  • [27] V. Zivojnovic and H. Meyr, “Compiled hw/sw co-simulation,” in 33rd Design Automation Conference Proceedings, 1996.   IEEE, 1996, pp. 690–695.
  • [28] Z. Wang and P. M. Maurer, “Lecsim: a levelized event driven compiled logic simulation,” in Proceedings of the 27th ACM/IEEE Design Automation conference.   ACM, 1991, pp. 491–496.
  • [29] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson, S. Brown, and T. Czajkowski, “LegUp: High-Level Synthesis for FPGA-based Processor/Accelerator Systems,” in ACM FPGA, 2011, pp. 33–36.
  • [30] R. Mukherjee, M. Tautschnig, and D. Kroening, “v2c–A verilog to C Translator,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems.   Springer, 2016, pp. 580–586.
  • [31] N. Bombieri, H.-Y. Liu, F. Fummi, and L. Carloni, “A Method to Abstract RTL IP Blocks into C++ Code and Enable High-Level Synthesis,” in Proceedings of the 50th Annual Design Automation Conference.   ACM, 2013, p. 156.
  • [32] N. Bombieri, F. Fummi, and S. Vinco, “A Methodology to Recover RTL IP Functionality for Automatic Generation of SW Applications,” ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 20, no. 3, p. 36, 2015.
  • [33] A. Mahapatra and B. C. Schafer, “VeriIntel2C: Abstracting RTL to C to maximize High-Level Synthesis Design Space Exploration,” Integration, vol. 64, pp. 1–12, 2019.
  • [34] ——, “Optimizing RTL to C Abstraction Methodologies to Improve HLS Design Space Exploration,” in IEEE ISCAS, 2019, pp. 1–5.
  • [35] P. Coussy and A. Morawiec, High-level synthesis: from algorithm to digital circuit.   Springer Science & Business Media, 2008.
  • [36] C. Pilato and F. Ferrandi, “Bambu: A modular framework for the high level synthesis of memory-intensive applications,” in 2013 23rd International Conference on Field programmable Logic and Applications.   IEEE, 2013, pp. 1–4.