This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Automatic Datapath Optimization using E-Graphs

Samuel Coward Numerical Hardware Group
Intel Corporation
Email: [email protected]
   George A. Constantinides Electrical and Electronic Engineering
Imperial College London
Email: [email protected]
   Theo Drane Numerical Hardware Group
Intel Corporation
Email: [email protected]
Abstract

Manual optimization of Register Transfer Level (RTL) datapath is commonplace in industry but holds back development as it can be very time consuming. We utilize the fact that a complex transformation of one RTL into another equivalent RTL can be broken down into a sequence of smaller, localized transformations. By representing RTL as a graph and deploying modern graph rewriting techniques we can automate the circuit design space exploration, allowing us to discover functionally equivalent but optimized architectures. We demonstrate that modern rewriting frameworks can adequately capture a wide variety of complex optimizations performed by human designers on bit-vector manipulating code, including significant error-prone subtleties regarding the validity of transformations under complex interactions of bitwidths. The proposed automated optimization approach is able to reproduce the results of typical industrial manual optimization, resulting in a reduction in circuit area by up to 71%. Not only does our tool discover optimized RTL, but also correctly identifies that the optimal architecture to implement a given arithmetic expression can depend on the width of the operands, thus producing a library of optimized designs rather than the single design point typically generated by manual optimization. In addition, we demonstrate that prior academic work on maximally exploiting carry-save representation and on multiple constant multiplication are both generalized and extended, falling out as special cases of this paper.

Index Terms:
hardware optimization, design automation, datapath design

I Introduction

In industry and academia, Register Transfer Level (RTL) development is limited to minimal design space exploration due to the complexity of design space and is slowed down by long debug timelines. RTL optimization of datapath designs is still often a manual task, as synthesis tools are unable to achieve the results of a skilled engineer [1].

A key observation is that manual RTL optimization is typically performed by applying a number of known ‘useful’ transformations to a design. These transformations, and their domain of validity, are accumulated through years of engineer design experience. In combination, these transformations may result in substantial changes to the underlying RTL. Apart from some simple transformations implemented automatically in modern ASIC design tools [2], the process of determining a sequence of transformations to apply to an RTL design is currently based on designer intuition [3], largely due to the non-convex nature of the design space: it is often necessary to apply an early transformation that results in a worse-quality circuit before then applying a later one leading to an overall improvement. It is this process we seek to automate.

Such automation facilitates the creation of bitwidth dependent architectures, where different parameterisations may result in different architectures: the best design approach for narrow bitwidths may not be the best design approach for wider bitwidths. A range of RTLs automatically generated from a single parameterisable input retains the ease-of-use benefits of parameterisable RTL without sacrificing quality.

Existing commercial synthesis tools are capable of merging together consecutive additions in a circuit design to make best use of carry-save representations [1]. However when arithmetic is interspersed with logic, the tools frequently miss potential optimization opportunities [4].

We aim to leverage existing commercial synthesis tools by transforming RTL to a form the existing tools can maximally optimize. We focus on combinational RTL, although the techniques described are equally applicable to pipelined designs via retiming. Given a design in the form of an RTL implementation RR, we aim to find an RTL implementation RR^{\prime} that minimises cost(R)\textrm{cost}(R^{\prime}) for some cost function, such that the two RTLs are functionally equivalent, RRR\simeq R^{\prime}. We define the equivalence relation \simeq as RRR\simeq R^{\prime} if and only if for all possible inputs, all outputs of RR and RR^{\prime} are equal.

We represent such RTL as a data-flow graph, where operators and operands are represented by nodes with edges, labelled by bitwidth, connecting operands and operators. This graphical representation allows us to formulate the problem as a graph optimization problem, where we are allowed to manipulate the graph with equivalence-preserving transformations. This formulation allows us to take advantage of recent advances in e(quivalence)-graph and equality saturation [5] technology, discussed in Section II, alongside previous motivating work in automated RTL optimization and design. Application of e-graphs to the RTL optimization problem is presented in Section III. We demonstrate results in Section IV and validate our cost metric in Section V.

The paper contains the following novel contributions:

  • application of e-graphs and equality saturation to automate datapath RTL optimization,

  • a precisely defined set of rewrites that facilitate efficient design space exploration together with their domains of applicability in designs utilizing multiple bitwidths,

  • an automated method to optimize architectures as a function of bitwidth parameters,

  • quantification of a ‘noise floor’ in datapath logic synthesis using ‘fuzzing’ techniques from software testing.

II Background

II-A Datapath Optimization

A useful example of transformation-based datapath improvement comes from Verma, Brisk and Ienne [4], which automatically applies data-flow transformations to maximally exploit carry-save representation. Their primary objective is to cluster additions together in the data-flow graph, a useful target as full carry-propagate addition is generally expensive and can often be avoided. This can be done by deploying compressor trees, circuits taking three or more input words which get reduced to two output words: a carry and a save. Using a carry-propagate adder to sum the carry and the save returns the sum of all the inputs. This can be beneficial as, for example, combining two consecutive carry-propagate adders into a compressor tree and one carry-propagate adder saves the cost of one carry-propagate adder at the expense of a compressor tree. We generalize this work using our methodology, which is able to replicate the results obtained in [4] as a special case.

Another well-studied transformation beyond the reach of standard commercial synthesis tools is the multiple constant multiplication (MCM) problem [6, 7]. The MCM problem asks, given a set of integer coefficients {a1,,an}\{a_{1},...,a_{n}\} and variable xx, what is the optimal architecture to compute the set {a1×x,,an×x}\{a_{1}\times x,...,a_{n}\times x\}? Competing solutions use a fixed number representation of the constants [7], often canonical signed digit (CSD) representation [8], and/or deploy an adder graph algorithm [6]. A transformation based approach also captures both of these methods.

In addition to these special cases, there is a wide variety of transformations that can be captured through standard arithmetic rewrites, e.g. associativity, distributivity, etc. Often these rewrites interact with each other, in the sense that applying one type of transformation opens or closes the door to applying a different class of transformation.

II-B E-graphs and Equality Saturation

Equivalence graphs, commonly called e-graphs, provide a dense representation of equivalence classes (e-classes) of expressions [9]. Often found in theorem provers, this data structure enables a graph optimization technique called equality saturation [5, 10, 11]. The e-graph represents expressions, where the nodes, known as e-nodes, represent functions (including variables and constants, as 0-arity functions) and are partitioned into a set of e-classes. The intuition is that e-classes can be used to compactly represent equivalent expressions, whose evaluation always leads to the same result. Edges represent function inputs and are from e-nodes to e-classes; see Figure 1, where dashed lines represent e-class boundaries, solid ellipses are e-nodes and arrows are edges. We define 𝒞\mathcal{C} to be the set of e-classes, 𝒩\mathcal{N} the set of e-nodes and E𝒩×𝒞E\subseteq\mathcal{N}\times\mathcal{C} the set of edges. We also introduce 𝒩c\mathcal{N}_{c} to denote the set of e-nodes in a given e-class cc.

Rewrites define equivalences over expressions, for example x+x2×xx+x\rightarrow 2\times x says that x+xx+x is equivalent to 2×x2\times x. Such rewrites are applied constructively to the e-graph, meaning that the left hand side of the rewrite remains in the data structure. Constructive rewrite application avoids the concern of which order to apply rewrites in. As rewrites are applied, the e-graph grows monotonically, representing more and more equivalent expressions, and hence naturally capturing the interaction between different rewrite rules.

Refer to caption
(a) Initial e-graph contains
(2×x)>>1(2\times x)>>1
Refer to caption
(b) Apply x×2x<<1x\times 2\rightarrow x<<1
Refer to caption
(c) Apply (x<<s)>>sx(x<<s)>>s\rightarrow x
Figure 1: E-graph rewriting for standard integer arithmetic. Dashed boxes represent e-classes of equivalent expressions. Green nodes represent newly added nodes. Red dashed boxes highlight which e-class has been modified.

Equality saturation provides us with a stopping condition. At the point where further rewrites add no additional information, we say that the e-graph has saturated. From an e-graph representing potentially infinitely many equivalent expressions we may choose the “best” expression [5].

egg is a recent Rust e-graph library, which is intended to be a general purpose and reusable implementation [5]. It adds powerful performance optimizations over existing, usually bespoke, e-graph implementations along with some useful additional features. It has been used to automatically improve the numerical stability of floating point expressions [12], map programs onto hardware accelerators [13] and optimize linear algebra [14]. To build a functioning e-graph optimization tool, egg must be supplied with a language definition – that is a set of operator names together with their arity, and a rewrite set – that is a set of equivalences over the given language definition.

III Methodology

This section demonstrates how e-graphs can be applied to the RTL optimization problem. We use a natural graphical representation of RTL, using data-flow graphs allowing us to fit the optimization problem into the e-graph framework. In this framework the e-graph contains classes of equivalent bitvector manipulating expressions and the rewrites transform such expressions to alternative equivalent expressions. The tool parses input Verilog using Yosys [15] and converts it into nested S-expressions in Common Lisp [16].

term::=(operator [term] [term]…[term])

The syntax is defined by the language described in Section III-A. These expressions are converted into e-graphs by egg. From the e-graph we extract a nested S-expression from which RTL is automatically generated, writing one operation per line. Figure 2 provides a flow diagram of the tool.

Input Verilog E-graphRewrite Extract (ILP) Optimised Verilog egg
Figure 2: Flow diagram describing the operation of the tool.
TABLE I: Operators defined in our egg implementation of RTL optimization. We include the architecture chosen for theoretical cost assignment.
Operator Symbol Arity Architecture
Left/Right Shift <</>><</>> 2 Mux Tree
Addition/Subtraction +/- 2 Prefix Adder (PA) [17]
Negation - 1 PA
Multiplication ×\times 2 Booth Radix-4 [18]
Multiplexer ?:\cdot?\cdot:\cdot 3 Mux gates
Not/Inversion \sim 1 One-input gates
Concatenate {,} n Wiring
Comparison >/<>/< 2 PA
\cdashline1-4 Sum SUM n CSA and PA
Muxed Mult Array MUXAR 3 Array Reduction and PA
Fused Multiply-Add FMA 3 Booth Radix-4

III-A Language

We consider the problem from the abstraction level of a Verilog parser and operate on finite length bitvectors. We target bitvector arithmetic and bitwise Boolean operations as they form the basis of low-level datapath optimization. Including the bitwidth of these vectors is crucial to correctly model the circuit’s behaviour. Bitwidths are also essential to correctly evaluate the cost of a given operation, clearly an 8-bit addition is less expensive than a 32-bit addition. The defined operations represented by nodes in the e-graph are described in Table I, and are defined for all inputs.

The first set in the table is a subset of the operators defined in Verilog [19]. These operators are fundamental in most arithmetic circuit designs and manipulating designs using them can have significant effects on power, performance and area. Bitwidth information is included in the e-graph as edge labels between the nodes. In addition to these basic operators, we introduce a second set of operators in Table I (below the dashed line), which encode the merging and sharing capabilities of modern synthesis tools. In [4], the benefits of merging adjacent additions to maximise the use of carry-save are exploited. We are able to capture the same effect in an automated manner through introduction of a SUM node in our language definition, which combines an arbitrary number of additions into a single node. This node encodes the compressor tree and carry-propagate adder described in Section I. Figure 3 shows how consecutive additions can be rewritten as a SUM node. Other such merging operators are the fused multiply-add (FMA) and the less familiar muxed multiplication array (MUXAR), which blends two disjoint multiplication arrays into one. These are discussed further in Section III-B.

Refer to caption
(a) Consecutive additions
Refer to caption
(b) Merged additions encoded as a SUM
Figure 3: The edge labels contain the operand’s index and the operand’s bitwidth in square brackets.

III-B Rewrites

TABLE II: The set of bitwidth dependent rewrites supplied to egg. Left subscript notation, xp\prescript{}{p}{x}, denotes a bitvector xx with length pp bits. The * operation represents both {+,×}\{+,\times\}. The square brackets represent Verilog bit slicing, where a[x:y]a[x:y] means we take bits xx down to yy of aa. The rules are conditionally applied according to column 4, which is sufficient but not always necessary.
Class Name Left-hand Side \rightarrow Right-hand Side Sufficient Condition
Bitvector Arithmetic Identities Commutativity (apbq)r(bqap)r\prescript{}{r}{(}\prescript{}{p}{a}*\prescript{}{q}{b})\rightarrow\prescript{}{r}{(}\prescript{}{q}{b}*\prescript{}{p}{a}) True
(qtr+sq)(q\geq t\lor r+s\leq q)
Mult Associativity ((ap×br)u×cs)t(ap×(br×cs)q)t\prescript{}{t}{(}\prescript{}{u}{(}\prescript{}{p}{a}\times\prescript{}{r}{b})\times\prescript{}{s}{c})\rightarrow\prescript{}{t}{(}\prescript{}{p}{a}\times\prescript{}{q}{(}\prescript{}{r}{b}\times\prescript{}{s}{c})) (utp+ru)\land(u\geq t\lor p+r\leq u)
Add Associativity ((ap+br)u+cs)t(ap+(br+cs)q)t\prescript{}{t}{(}\prescript{}{u}{(}\prescript{}{p}{a}+\prescript{}{r}{b})+\prescript{}{s}{c})\rightarrow\prescript{}{t}{(}\prescript{}{p}{a}+\prescript{}{q}{(}\prescript{}{r}{b}+\prescript{}{s}{c})) (qtmax(r,s)<q)(q\geq t\lor\max(r,s)<q)
(utmax(p,r)<u)\land(u\geq t\lor\max(p,r)<u)
Distribute Mult over Add (ap×(bs+ct)q)r((ap×bs)u+(ap×ct)v)r\prescript{}{r}{(}\prescript{}{p}{a}\times\prescript{}{q}{(}\prescript{}{s}{b}+\prescript{}{t}{c}))\rightarrow\prescript{}{r}{(}\prescript{}{u}{(}\prescript{}{p}{a}\times\prescript{}{s}{b})+\prescript{}{v}{(}\prescript{}{p}{a}\times\prescript{}{t}{c})) min(q,u,v)r\min(q,u,v)\geq r
Sum Same (ap+ap)q(22×ap)q\prescript{}{q}{(}\prescript{}{p}{a}+\prescript{}{p}{a})\rightarrow\prescript{}{q}{(}\prescript{}{2}{2}\times\prescript{}{p}{a}) True
Mult Sum Same ((ap×bq)s+bq)r((ap+11)t×bq)r\prescript{}{r}{(}\prescript{}{s}{(}\prescript{}{p}{a}\times\prescript{}{q}{b})+\prescript{}{q}{b})\rightarrow\prescript{}{r}{(}\prescript{}{t}{(}\prescript{}{p}{a}+\prescript{}{1}{1})\times\prescript{}{q}{b}) t>psp+qt>p\land s\geq p+q
Add Zero (ap+bq)p(a)p\prescript{}{p}{(}\prescript{}{p}{a}+\prescript{}{q}{b})\rightarrow\prescript{}{p}{(}a) b0mod2pb\equiv 0\mod 2^{p}
Sub to Neg (apbq)r(ap+(bq)q)r\prescript{}{r}{(}\prescript{}{p}{a}-\prescript{}{q}{b})\rightarrow\prescript{}{r}{(}\prescript{}{p}{a}+\prescript{}{q}{(}-\prescript{}{q}{b})) True
Mult by One (ap×bq)p(a)p\prescript{}{p}{(}\prescript{}{p}{a}\times\prescript{}{q}{b})\rightarrow\prescript{}{p}{(}a) b1mod2pb\equiv 1\mod 2^{p}
Mult by Two (ap×22)r(ap<<11)r\prescript{}{r}{(}\prescript{}{p}{a}\times\prescript{}{2}{2})\rightarrow\prescript{}{r}{(}\prescript{}{p}{a}<<\prescript{}{1}{1}) True
Bitvector Logic Identities Merge Left Shift ((ap<<bq)u<<cs)r(ap<<(bq+cs)t)r\prescript{}{r}{(}\prescript{}{u}{(}\prescript{}{p}{a}<<\prescript{}{q}{b})<<\prescript{}{s}{c})\rightarrow\prescript{}{r}{(}\prescript{}{p}{a}<<\prescript{}{t}{(}\prescript{}{q}{b}+\prescript{}{s}{c})) t>max(q,s)urt>\max(q,s)\land u\geq r
Merge Right Shift ((ap>>bq)u>>cs)r(ap>>(bq+cs)t)r\prescript{}{r}{(}\prescript{}{u}{(}\prescript{}{p}{a}>>\prescript{}{q}{b})>>\prescript{}{s}{c})\rightarrow\prescript{}{r}{(}\prescript{}{p}{a}>>\prescript{}{t}{(}\prescript{}{q}{b}+\prescript{}{s}{c})) t>max(q,s)upt>\max(q,s)\land u\geq p
Redundant Sel (b1?ap:ap)pap\prescript{}{p}{(}\prescript{}{1}{b}?\prescript{}{p}{a}:\prescript{}{p}{a})\rightarrow\prescript{}{p}{a} True
Neg Not (ap)r(((ap))p+11)r\prescript{}{r}{(}-\prescript{}{p}{a})\rightarrow\prescript{}{r}{(}\prescript{}{p}{(}\sim(\prescript{}{p}{a}))+\prescript{}{1}{1}) rpr\leq p
Not over Con (({aq,bs}q+s))r{((aq))q,((bs))s}r\prescript{}{r}{(}\sim(\prescript{}{q+s}{\{}\prescript{}{q}{a},\prescript{}{s}{b}\}))\rightarrow\prescript{}{r}{\{}\prescript{}{q}{(}\sim(\prescript{}{q}{a})),\prescript{}{s}{(}\sim(\prescript{}{s}{b}))\} q+srq+s\geq r
Constant Expansion (cq×xp)r\prescript{}{r}{(}\prescript{}{q}{c}\times\prescript{}{p}{x})\rightarrow
Mult Constant (((22×cq1[q1:1])q×xp)r+(c1[0]×xp)p)r\prescript{}{r}{(}\prescript{}{r}{(}\prescript{}{q}{(}\prescript{}{2}{2}\times\prescript{}{q-1}{c}[q-1:1])\times\prescript{}{p}{x})+\prescript{}{p}{(}\prescript{}{1}{c}[0]\times\prescript{}{p}{x})) cc constant
One to Two Mult (11×xp)p((22×xp)qxp)p\prescript{}{p}{(}\prescript{}{1}{1}\times\prescript{}{p}{x})\rightarrow\prescript{}{p}{(}\prescript{}{q}{(}\prescript{}{2}{2}\times\prescript{}{p}{x})-\prescript{}{p}{x}) q>pq>p
Arithmetic Logic Exchange Left Shift Add ((ap+bq)s<<ct)r((ap<<ct)u+(bq<<ct)u)r\prescript{}{r}{(}\prescript{}{s}{(}\prescript{}{p}{a}+\prescript{}{q}{b})<<\prescript{}{t}{c})\rightarrow\prescript{}{r}{(}\prescript{}{u}{(}\prescript{}{p}{a}<<\prescript{}{t}{c})+\prescript{}{u}{(}\prescript{}{q}{b}<<\prescript{}{t}{c})) (srmax(p,q)<s)ur(s\geq r\lor\max(p,q)<s)\land u\geq r
qtsp+2u1q\geq t\land s\geq p+2^{u}-1
Add Right Shift (ap+(bt>>cu)q)r(((ap<<cu)s+bt)v>>cu)r\prescript{}{r}{(}\prescript{}{p}{a}+\prescript{}{q}{(}\prescript{}{t}{b}>>\prescript{}{u}{c}))\rightarrow\prescript{}{r}{(}\prescript{}{v}{(}\prescript{}{s}{(}\prescript{}{p}{a}<<\prescript{}{u}{c})+\prescript{}{t}{b})>>\prescript{}{u}{c}) v>max(s,t)\land v>\max(s,t)
Left Shift Mult ((ap×bq)t<<cu)r((ap<<cu)v×bq)r\prescript{}{r}{(}\prescript{}{t}{(}\prescript{}{p}{a}\times\prescript{}{q}{b})<<\prescript{}{u}{c})\rightarrow\prescript{}{r}{(}\prescript{}{v}{(}\prescript{}{p}{a}<<\prescript{}{u}{c})\times\prescript{}{q}{b}) trvrt\geq r\land v\geq r
Sel Add (e1?(ap+bq)r:(cp+dq)r)r\prescript{}{r}{(}\prescript{}{1}{e}?\prescript{}{r}{(}\prescript{}{p}{a}+\prescript{}{q}{b}):\prescript{}{r}{(}\prescript{}{p}{c}+\prescript{}{q}{d}))\rightarrow True
((e1?ap:cp)p+(e1?bq:dq)q)r\prescript{}{r}{(}\prescript{}{p}{(}\prescript{}{1}{e}?\prescript{}{p}{a}:\prescript{}{p}{c})+\prescript{}{q}{(}\prescript{}{1}{e}?\prescript{}{q}{b}:\prescript{}{q}{d}))
Sel Add Zero (e1?(ap+bq)p:cp)p((e1?ap:cp)p+(e1?bq:0q)q)p\prescript{}{p}{(}\prescript{}{1}{e}?\prescript{}{p}{(}\prescript{}{p}{a}+\prescript{}{q}{b}):\prescript{}{p}{c})\rightarrow\prescript{}{p}{(}\prescript{}{p}{(}\prescript{}{1}{e}?\prescript{}{p}{a}:\prescript{}{p}{c})+\prescript{}{q}{(}\prescript{}{1}{e}?\prescript{}{q}{b}:\prescript{}{q}{0})) True
Move Sel Zero ((b1?0p:ap)p×cq)r(ap×(b1?0q:cq)q)r\prescript{}{r}{(}\prescript{}{p}{(}\prescript{}{1}{b}?\prescript{}{p}{0}:\prescript{}{p}{a})\times\prescript{}{q}{c})\rightarrow\prescript{}{r}{(}\prescript{}{p}{a}\times\prescript{}{q}{(}\prescript{}{1}{b}?\prescript{}{q}{0}:\prescript{}{q}{c})) True
Concat to Add {ap,bq}r((ap<<qu)s+bq)r\prescript{}{r}{\{}\prescript{}{p}{a},\prescript{}{q}{b}\}\rightarrow\prescript{}{r}{(}\prescript{}{s}{(}\prescript{}{p}{a}<<\prescript{}{u}{q})+\prescript{}{q}{b}) sp+2u1ulog2(q+1)s\geq p+2^{u}-1\land u\geq\lceil\log_{2}(q+1)\rceil
Merging Ops (ap11+(ap22+(ap33++apnn)q3)q2)q1\prescript{}{q_{1}}{(}\prescript{}{p_{1}}{a}1+\prescript{}{q_{2}}{(}\prescript{}{p_{2}}{a}2+\prescript{}{q_{3}}{(}\prescript{}{p_{3}}{a}3+...+\prescript{}{p_{n}}{a}n)...))\rightarrow qi>max(pi,qi+1),i=1,,n2q_{i}>\max(p_{i},q_{i+1}),i=1,...,n-2
Merge Additions (SUM(ap11,ap22,,apnn))q1\prescript{}{q_{1}}{(}\texttt{SUM}(\prescript{}{p_{1}}{a}1,\prescript{}{p_{2}}{a}2,...,\prescript{}{p_{n}}{a}n)) qn1>max(pn1,pn)\land q_{n-1}>\max(p_{n-1},p_{n})
((aq×br)s+(cq×((br))r)s)t\prescript{}{t}{(}\prescript{}{s}{(}\prescript{}{q}{a}\times\prescript{}{r}{b})+\prescript{}{s}{(}\prescript{}{q}{c}\times\prescript{}{r}{(}\sim(\prescript{}{r}{b}))))
Merge Mult Array (MUXAR(br,aq,cq))t\rightarrow\prescript{}{t}{(}\texttt{MUXAR}(\prescript{}{r}{b},\prescript{}{q}{a},\prescript{}{q}{c})) sq+rt>ss\geq q+r\land t>s
FMA Merge ((ap×bq)s+cr)t(FMA(ap,bq,cr))t\prescript{}{t}{(}\prescript{}{s}{(}\prescript{}{p}{a}\times\prescript{}{q}{b})+\prescript{}{r}{c})\rightarrow\prescript{}{t}{(}\texttt{FMA}(\prescript{}{p}{a},\prescript{}{q}{b},\prescript{}{r}{c})) sp+qt>max(s,r)s\geq p+q\land t>\max(s,r)

We have identified and formalized rules that capture many of the manual transformations that Intel’s Numerical Hardware Group regularly deploy manually. The set of rewrites described in Table II define equivalences over expressions. Each of the rewrites incorporates bitwidth information, where bitwidths are associated with operands. We introduce a left subscript notation, xp\prescript{}{p}{x}, to denote a bitvector xx with length pp bits. Left subscript takes the lowest operator precedence.

egg supports conditional rewrites. We make use of this facility to allow us to decide whether a rule may be applied based on its bitwidth information. The sufficient (but not always necessary) conditions under which each rule applies are described in Table II. As an example, consider the “Distribute Mult Over Add” rewrite, where the sufficient condition is a function of the bitwidth parameters, min(q,u,v)r\min(q,u,v)\geq r. When this condition is satisfied the rewrite can be safely applied as the equivalence holds. For some rewrites, a parameter appears on the right hand side but not on the left, for example qq in the associativity rewrites. This corresponds to a condition constraining but not fully determining the values that these undefined parameters can take, in which case we select the minimum feasible bitwidth. The conditions have been validated by creating a parameterizable SMT query for each rewrite using the theory of fixed size bitvectors [20], which have been checked for all combinations of bitwidths in {1,…,10}, however a formal equivalence check between the generated and original RTL ensures that we do not need to trust the correctness of our rewrites or of the egg library.

Arithmetic Logic Exchange rewrites are inspired by a set taken from [4], which focus on the interplay of logic and addition. To extend the prior work we have added rewrites that are able to move other arithmetic operators and we have generalized the rules so they are able to be applied in a multiple bitwidth setting.

The “Merging Ops” rewrites allow us to correctly evaluate the cost of specific sequences of operations, that logic synthesis tools are able to effectively optimize. We have seen that the SUM node merges consecutive additions, highlighting that the area usage for consecutive additions is not additive. This captures what logic synthesis tools will do to an expression such as a+b+ca+b+c, converting it to a compressor tree and deploying a single carry-propagate adder. Further merging optimization capabilities of Synopsys Design Compiler are documented [1]. The “Merge Mult Array” rewrite does not make use of carry-save format but identifies that two disjoint multiplication arrays can be merged. Letting a[i]a[i] represent bit ii of aa and u=log2(r)u=\lceil\log_{2}(r)\rceil, MUXAR is shorthand for the right hand side of the rewrite, where the SUM represents array reduction:

((aqst\displaystyle\prescript{}{t}{(}\prescript{}{s}{(}\prescript{}{q}{a} ×br)+(cq×((br))r)s)\displaystyle\times\prescript{}{r}{b})+\prescript{}{s}{(}\prescript{}{q}{c}\times\prescript{}{r}{(}\sim(\prescript{}{r}{b}))))\rightarrow
(SUM(t\displaystyle\prescript{}{t}{(}\texttt{SUM}( ((b1[0]?aq:cq)q<<0u)s,\displaystyle\prescript{}{s}{(}\prescript{}{q}{(}\prescript{}{1}{b}[0]?\prescript{}{q}{a}:\prescript{}{q}{c})<<\prescript{}{u}{0}),
((b1[1]?aq:cq)q<<1u)s,,\displaystyle\prescript{}{s}{(}\prescript{}{q}{(}\prescript{}{1}{b}[1]?\prescript{}{q}{a}:\prescript{}{q}{c})<<\prescript{}{u}{1}),...,
((b1[r1]?aq:cq)q<<(r1)u)s)).\displaystyle\prescript{}{s}{(}\prescript{}{q}{(}\prescript{}{1}{b}[r-1]?\prescript{}{q}{a}:\prescript{}{q}{c})<<\prescript{}{u}{(}r-1)))).

Logic synthesis is capable of exploiting this optimization if it identifies an expression of the form (a×b)+(c×b)(a\times b)+(c\times\sim b), but we must indicate the merging opportunity to our tool.

We added the “Constant Expansion” rules to re-express multiplication of a variable by a constant. These rules allow us to recreate results from the literature described in Section II-A on the MCM problem. In addition to the explicit rewrite rules, constant folding is implemented as an e-class analysis in egg [5].

III-C Extraction

Having applied rewrites to the e-graph until saturation or timeout limits are reached, we now must extract the optimal design from potentially infinitely many choices. The extraction process must select a set of e-classes to implement and for each e-class, which e-node within that class to implement, subject to the constraint that the e-class children of each selected e-node must also be selected. Choosing an optimal design requires some metric that allows us to discriminate between competing implementations. Industrial circuit design is typically judged on area, latency and power consumption. In this contribution we will only use an area metric, therefore our definition of optimal will be the smallest circuit implementation.

We have developed a theoretical area estimate cost function in terms of the number of two-input gates required for the operator. It assigns a cost per operator that is a function of the input and output bitwidths. Table I lists the architectures on which we base the cost of the more complex operators. We introduce different costs for when at least one of the operands is a constant. The cost of a complete design is then the sum of the operator costs, and the objective is to minimise this cost.

The major benefit of using a theoretical cost metric as opposed to using metrics derived from logic synthesis or HLS tools is the computation speed which, when combined with equality saturation, enables effective design space exploration. Cost metric validation is addressed in Section V.

By explicitly introducing rules for operator merging and sharing, we are able to define a cost for each node based only on its type and argument bitwidths, capturing downstream synthesis optimizations for SUM, MUXAR and FMA.

When extracting an RTL implementation from an e-graph, one is immediately faced with the question of common subexpressions. Common subexpressions are naturally extracted as part of the e-graph construction process [5] and ideally we would want to utilize this information in the resulting hardware, for example extracting (x+1)×(x+1)(x+1)\times(x+1) as let y=x+1 in y×y\texttt{let }y=x+1\text{ in }y\times y. However, this makes the extraction problem an inherently global problem over the e-classes, in the sense that the optimal e-node implementation for a given e-class may depend on the selected e-node implementation of the other e-classes in the graph. Previous solutions have solved this by posing optimal extraction as an integer linear programming (ILP) problem [13, 14], and we follow the same approach in this work.

Using the notation defined in Section II-B, for each e-node n𝒩n\in\mathcal{N} we associate a cost, cost(n)\text{cost}(n), given by the theoretical cost function, and a binary variable xn{0,1}x_{n}\in\{0,1\}, which indicates whether nn is implemented in the final extracted RTL. Our objective is to minimize the total implementation cost, as described by (1). (2) then guarantees that for every node nn, we implement a node from each of its child e-classes. Lastly we introduce 𝒮\mathcal{S}, the set of e-classes representing the desired expressions to implement in RTL. (3) then ensures that all these outputs are produced by the final RTL.

minimize: n𝒩cost(n)xn subject to:\displaystyle\text{minimize: }\sum_{n\in\mathcal{N}}\text{cost}(n)x_{n}\text{ subject to:} (1)
(n,c)E.xnn𝒩cxn\displaystyle\forall(n,c)\in E.\;x_{n}\leq\sum_{n^{\prime}\in\mathcal{N}_{c}}x_{n^{\prime}} (2)
c𝒮.n𝒩cxn=1.\displaystyle\forall c\in{\mathcal{S}}.\;\sum_{n\in\mathcal{N}_{c}}x_{n}=1. (3)

E-graphs may contain cycles, e.g. x+0xx+0\to x induces a cycle. By introducing a topological sorting variable, tct_{c} for each e-class cc, and associated constraints (4) where NN is the number of e-classes and 𝒞(n)\mathcal{C}(n) is the e-class containing node nn, we ensure that the output expression is acyclic.

(n,k)Et𝒞(n)Nxntk1N\forall(n,k)\in E\quad t_{\mathcal{C}(n)}-Nx_{n}-t_{k}\geq 1-N (4)

If we select a node n𝒩cn\in\mathcal{N}_{c} with child kk, i.e. xn=1x_{n}=1, this constraint simplifies to tctk+1t_{c}\geq t_{k}+1 to get a topologically sorted result, whereas in the case xn=0x_{n}=0, the constraint is vacuously satisfied. We use the open source GLPK solver to calculate solutions to this ILP [21].

We deploy the ILP extraction method in cases where sharing common subexpressions offers some improvement, otherwise for improved performance we can resort to the standard egg extraction method [5].

IV Results

The RTL test cases are automatically optimized using our egg-based implementation and optimized RTL is extracted. Original and optimized RTLs are synthesized using Synopsys Design Compiler for a TSMC 7nm cell library. We proved the formal equivalence of the original and optimized RTLs using Synopsys HECTOR technology, a formal equivalence checking tool that runs in minutes on these testcases.

The original and optimized RTLs were synthesised at the minimum delay of the slowest of the two, and then similarly at the minimum area of the larger of the two designs. These represent comparisons towards the endpoints of a standard area-delay curve for a design. Table III summarises the results. In Figure 4, we present the area-delay profiles for the competing architectures of a Smoothing Kernel, which highlights the points of comparison used in Table III.

We consider two sets of examples. First we demonstrate how the tool exploits complex datapath blocks to optimize designs. Then we consider bitwidth-dependent architectures, where the optimal design may vary as a function of bitwidth.

TABLE III: Logic synthesis results using Synopsys Design Compiler. We compared the automatically optimized results at the minimum delay which both designs could meet and at the smallest area that both designs could meet.
Benchmark Min Achievable Delay Min Achievable Area
Delay (ns) Orig Area (μm2\mu m^{2}) Opt Area (μm2\mu m^{2}) Area (μm2\mu m^{2}) Orig Delay (ns) Opt Delay (ns)
Smoothing Kernel 0.289 550 158 (-71%) 150 1.25 0.29 (-77%)
FIR Filter Kernel 0.611 1710 679 (-60%) 570 1.38 2.48 (+80%)
ADPCM Decoder [22] 0.102 103 102 (- 1%) 32 0.72 0.45 (-38%)
Shifted FMA 0.181 310 210 (-32%) 81 0.97 0.57 (-41%)
MCM Solution 0.132 90 104 (+15%) 40 0.72 0.56 (-22%)
TABLE IV: Benchmark complexity and resulting e-graphs size after rewriting. \dagger means that we removed the associativity of addition rewrite to limit e-graph growth. ILP column indicates whether ILP or standard egg extraction was used.
Benchmark Ops E-graph Nodes ILP Runtime (sec)
Smoothing Kernel 17 27,000 No 140
FIR Filter Kernel \dagger 23 550 Yes 100
ADPCM Decoder 9 6,700 No 19
Shifted FMA 3 22 No 0.04
MCM Solution 3 4,900 Yes 31

IV-A Datapath Optimizations using Complex Blocks

The first benchmark is a kernel from a media module, which is an industrially relevant example supplied by Intel. It was manually optimized by a single engineer over the course of one week. Our tool automatically matches the results obtained via manual optimization, making use of the “Merge Mult Array” rewrite. Figure 4 shows the area-delay curves for the original and optimized architectures. The optimized design performs strictly better, reducing the minimum achievable delay by 13% with a 28% area reduction.

Refer to caption
Figure 4: Area-delay profile of the Smoothing Kernel. The horizontal/vertical grey lines represents the minimum area/delay comparison points.

Next we have a set of benchmarks taken directly from [4], which are intended to show how this prior work is generalized by our optimization tool. The first example is a finite-impulse response (FIR) filter with 8 taps. The second example, a computational kernel of the ADPCM decoder [22], is an approximation to a 16×\times4 multiplier. These two examples are optimized by deploying the “Arithmetic Logic Exchange” class of rewrites described in Section III-B, to cluster additions together. The FIR filter example is particularly interesting as it also introduces a multiple constant multiplication (MCM) problem [7]. Since the ILP formulation of extraction correctly accounts for the cost of common subexpressions, the minimal area solution it extracts maximally shares common subexpressions, to generate {2×x,3×x,,7×x}\{2\times x,3\times x,...,7\times x\}. For example, 2×x2\times x and 3×x3\times x are constructed as follows

x2=x<<1,x3=x2+x.x_{2}=x<<1,\hskip 5.0ptx_{3}=x_{2}+x. (5)

Looking more generally at the MCM problem, using the “Constant Expansion” rewrites, we are able to match the operator count of a solution from [6]. An example from this paper generates adder graphs to compute {3×x,7×x,21×x}\{3\times x,7\times x,21\times x\}. The optimal design from our tool maximally shares common subexpressions to generate the data-flow graph shown in Figure 5, which uses 3 addition/subtraction operations to generate the results. Since logic synthesis likely implements constant multiplication using a CSD representation [8], sharing common subexpressions generates a 15%15\% larger circuit than the basic architecture for small delay targets as the design does not match the performance of CSD.

Refer to caption
Figure 5: Data-flow graph representing an optimal adder tree to compute the set 3×x,7×x3\times x,7\times x and 21×x21\times x. The blue nodes represent additions which generate the result in the node label.

Finally we consider a shifted FMA, which discovers the opportunity to merge a multiplication and addition that were originally separated by a left shift (a×b)<<S+c(a\times b)<<S+c. Using the “Left Shift Mult” and “FMA Merge” rewrites this can be implemented as an FMA block. The approach proposed by Verma, Brisk and Ienne misses this opportunity since it will not move the shift over the multiplication [4].

IV-B Bitwidth Dependent Optimal Architecture

The last set of results considers parameterizable RTL, as in general, RTL engineers do not generate alternative architectures for different bitwidths. As a result, designers may be sacrificing performance. To demonstrate this we again consider the FIR filter, but the 4 tap variant, where the original architecture is shown in Figure 6, which we will refer to as Architecture 0. We optimized Architecture 0 using our tool for increasing bitwidths and observed that the selected architecture varied between Architectures 0, 1 and 2 (Figure 6). The difference between Architectures 1 and 2 is that the addition involving Z4Z4 in Architecture 1 has been pushed back over the right shift, incurring the cost of an extra left shift, but saving a full carry-propagate adder.

The architectural choices are determined by the theoretical two-input gate cost metric, discussed in Section III-C. We synthesized each of the three architectures for bitwidths 4,8,644,8,...64, updating the delay target each time to the minimum delay that all three architectures could meet at that bitwidth. The theoretically optimal architecture for each bitwidth can be seen in Figure 6. For 56% of the bitwidths, the architecture selected by the theoretical cost metric gave the lowest area result from logic synthesis. The ability to always choose the minimum area design according to logic synthesis using the theoretical metric is limited, since logic synthesis always incorporates delay considerations into its results and there is noise in the results. We demonstrate this ‘noise floor’ in Section V.

Refer to caption
Refer to caption
Refer to caption
(a) Architecture 0 {4}(b) Architecture 1 {8,…,28,36,…,48}(c) Architecture 2 {32,52,…,64}
Figure 6: Simplified FIR filter data-flow graphs representing optimal architectures for different choices of the input bitwidth parameter pp and shift bitwidth parameter qq. The sets in curly braces are bitwidths for which that architecture is optimal. In these graphs Zi=Xi×AiZi=Xi\times Ai and 2S2S and 3S3S are computed according to (5).

IV-C Performance

For these test cases only the “Shifted FMA” e-graph saturated as we limited the exploration to 10 iterations of e-graph rewriting [5]. For test cases using ILP extraction we limited the solver to a 100 second time budget, which means that the FIR Filter solution is classified as feasible but is possibly sub-optimal. There is a tradeoff between growing the e-graph and solving the ILP efficiently. Only the MCM solution included the “Constant Expansion” rewrite class since these rewrites lead to exponential growth of the e-graph.

V Cost Metric Validation

This section evaluates the theoretical cost metric in terms of its ability to steer optimal architecture choices. To do so we compare the theoretical cost to logic synthesis area costs. The results from Table III demonstrate that for all test cases, the theoretically chosen architecture improved at least the performance or the area of the design.

When determining the accuracy of a cost estimate, it is necessary to consider inherent variability of the logic synthesis process. Small non-functional tweaks, e.g. changing a variable name in RTL code, can have impact on the synthesis results. This forms a ‘noise floor’ against which any theoretical cost model can be validated. To evaluate this noise floor, we used a technique known as fuzzing [23], which involves automatically generating random mutations to a program. We fuzz the RTL allowing two types of semantics-preserving mutations: variable renaming and swapping the order of always/assign blocks [19] in the code, modifications which one would not expect to have a meaningful impact on synthesis results.

We provide results for the Smoothing Kernel and 4 tap FIR Filter, synthesizing 30 fuzzed designs in each case at relevant delay targets. Variability of the results is shown in Figure 7.

Refer to caption
Figure 7: A violin plot depicting the logic synthesis area results for 30 fuzzed designs of the Smoothing Kernel and the FIR Filter at a 0.5ns delay target. For each violin, the area results are normalised by the mean.
Refer to caption
Figure 8: Histogram plot of logic synthesis area results for 30 fuzzed designs for each of the three FIR Filter architectures for 12 bits (Fig 6).

Figure 8 highlights how this noise can affect the architectural choices made in Section IV-B. For 12 bit inputs the synthesis results for fuzzed Architectures 0 and 1 overlap, with Architecture 1 generating lower area results on average whilst Architecture 0 obtains the minimum area result. This noise is not captured by the theoretical cost metric, as these fuzzed designs are theoretically identical. Applying this to other bitwidth inputs there are cases where there is clearly an optimal choice.

VI Conclusion

This paper has demonstrated the application of e-graphs and equality saturation to the RTL datapath optimization problem. Applying a precisely defined set of bitwidth dependent equivalence preserving transformations, in the form of rewrites, we efficiently explore the design space and extract optimized RTL using egg. We also quantify the noise floor in logic synthesis results to understand the limits of a theoretical cost metric.

The results show that this automated rewriting technique can match the results of a skilled hardware engineer within a short timescale. The tool was able to achieve up to 71% area improvement and up to 77% delay improvement. We also demonstrate that automatic RTL optimization can generate different architectures for different bitwidth designs, since the tradeoff points are bitwidth dependent.

Future work will address the limitations of an area-only cost metric by incorporating delay and power allowing us to generate Pareto optimal solution curves [24]. We will expand the domain to tackle floating point operations and incorporate greater support for automated design verification. We will also address the scalability limits of applying equality saturation by incorporating intelligent design space search procedures.

Acknowledgment

The authors would like to thank Yann Herklotz for his assistance in setting up the fuzzing experiments.

References

  • [1] Synopsys, “Coding Guidelines for Datapath Synthesis,” Synopsys, Mountain View, Tech. Rep., 12 2019.
  • [2] ——, “Design Compiler User Guide S-2021.06-SP2,” Synopsys, Mountain View, Tech. Rep., 6 2021.
  • [3] A. K. Verma, P. Brisk, and P. Ienne, “Challenges in automatic optimization of arithmetic circuits,” in Proceedings - IEEE Symposium on Computer Arithmetic, 2009, pp. 213–218.
  • [4] ——, “Data-flow transformations to maximize the use of carry-save representation in arithmetic circuits,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 27, no. 10, pp. 1761–1774, 2008.
  • [5] M. Willsey, C. Nandi, Y. R. Wang, O. Flatt, Z. Tatlock, and P. Panchekha, “Egg: Fast and extensible equality saturation,” Proceedings of the ACM on Programming Languages, vol. 5, no. POPL, 2021.
  • [6] O. Gustafsson, “A difference based adder graph heuristic for multiple constant multiplication problems,” in IEEE International Symposium on Circuits and Systems, 2007, pp. 1097–1100.
  • [7] R. I. Hartley, “Subexpression Sharing in Filters Using Canonic Signed Digit Multipliers,” IEEE Transactions on Circuits and Systems, vol. 11, 1996.
  • [8] M. D. Ercegovac and T. Lang, Digital arithmetic.   Elsevier, 2004.
  • [9] C. G. Nelson, “Techniques for program verification,” Ph.D. dissertation, Stanford University, 1980.
  • [10] R. Joshi, G. Nelson, and K. Randall, “Denali: A goal-directed superoptimizer,” in Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2002.
  • [11] R. Tate, M. Stepp, Z. Tatlock, and S. Lerner, “Equality saturation: A new approach to optimization,” in ACM SIGPLAN Notices, vol. 44, no. 1, 2009.
  • [12] P. Panchekha, A. Sanchez-Stern, J. R. Wilcox, and Z. Tatlock, “Automatically improving accuracy for floating point expressions,” ACM SIGPLAN Notices, vol. 50, no. 6, pp. 1–11, 2015.
  • [13] G. H. Smith, A. Liu, S. Lyubomirsky, S. Davidson, J. McMahan, M. Taylor, L. Ceze, and Z. Tatlock, “Pure tensor program rewriting via access patterns (representation pearl),” in Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming, 2021.
  • [14] Y. R. Wang, S. Hutchison, J. Leang, B. Howe, and D. Suciu, “SPORES: Sum-product optimization via relational equality saturation for large scale linear algebra,” Proceedings of the VLDB Endowment, vol. 13, no. 11, 2020.
  • [15] C. Wolf and J. Glaser, “Yosys-A Free Verilog Synthesis Suite,” in Proceedings of Austrochip, 2013.
  • [16] G. Steele, Common LISP: the language.   Elsevier, 1990.
  • [17] A. Beaumont-Smith and C.-C. Lim, “Parallel prefix adder design,” in Proceedings - IEEE Symposium on Computer Arithmetic, 2001, pp. 218–225.
  • [18] I. Koren, Computer arithmetic algorithms.   AK Peters/CRC Press, 2018.
  • [19] D. Thomas and P. Moorby, The Verilog® hardware description language.   Springer Science & Business Media, 2008.
  • [20] C. Barrett, P. Fontaine, and C. Tinelli, “The Satisfiability Modulo Theories Library (SMT-LIB),” www.SMT-LIB.org, 2016.
  • [21] A. Makhorin, “GLPK (GNU linear programming kit),” 2008. [Online]. Available: http://www.gnu.org/s/glpk/glpk.html
  • [22] C. Lee, M. Potkonjak, and W. H. Mangione-Smith, “MediaBench: A tool for evaluating and synthesizing multimedia and communications systems,” in Proceedings of the Annual International Symposium on Microarchitecture, 1997.
  • [23] B. P. Miller, L. Fredriksen, and B. So, “An empirical study of the reliability of UNIX utilities,” Communications of the ACM, vol. 33, no. 12, pp. 32–44, 1990.
  • [24] E. Ustun, I. San, J. Yin, C. Yu, and Z. Zhang, “IMpress: Large Integer Multiplication Expression Rewriting for FPGA HLS,” in 2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2022, pp. 1–10.