NEAT: A Framework for Automated Exploration of Floating Point Approximations

Saeid Barati Computer Science Department
University of Chicago
Chicago, USA
[email protected] Lee Ehudin Computer Science Department
University of Chicago
Chicago, USA
[email protected] Henry Hoffmann Computer Science Department
University of Chicago
Chicago, USA
[email protected]

Abstract

Much recent research is devoted to exploring tradeoffs between computational accuracy and energy efficiency at different levels of the system stack. Approximation at the floating point unit (FPU) allows saving energy by simply reducing the number of computed floating point bits in return for accuracy loss. Although, finding the most energy efficient approximation for various applications with minimal effort is the main challenge. To address this issue, we propose NEAT: a pin tool that helps users automatically explore the accuracy-energy tradeoff space induced by various floating point implementations. NEAT helps programmers explore the effects of simultaneously using multiple floating point implementations to achieve the lowest energy consumption for an accuracy constraint or vice versa. NEAT accepts one or more user-defined floating point implementations and programmable placement rules for where/when to apply them. NEAT then automatically replaces floating point operations with different implementations based on the user-specified rules during the runtime and explores the resulting tradeoff space to find the best use of approximate floating point implementations for the precision tuning throughout the program. We evaluate NEAT by enforcing combinations of 24/53 different floating point implementations with three sets of placement rules on a wide range of benchmarks. We find that heuristic precision tuning at the function level provides up to 22% and 48% energy savings at 1% and 10% accuracy loss comparing to applying a single implementation for the whole application. Also, NEAT is applicable to neural networks where it finds the optimal precision level for each layer considering an accuracy target for the model.

I Introduction

Early work in approximate computing demonstrates the tremendous energy and execution time reductions by making a variety of arithmetic and logic functional units available [12, 11, 22, 24]. Reduced-precision methods advocate less numerical precision for the data storage and computation to achieve higher performance and energy efficiency [73, 77, 16, 64].

The proliferation of both different approximate functional units and reduced-precision software methods creates tremendous opportunity, but it also creates a new problem. While designing for reduced precision has long been common in specialized application domains—for example, digital signal processing [10]—the proliferation of these techniques means that general programmers will now have to consider the implications of such designs. Specifically, it is up to programmers decide which level of approximation to use at different points in their application and navigate through this immense tradeoff space enacted by allowing multiple approximations within a single program.

Consider 10 different levels of approximation available to be enforced at the function level for a moderate-sized program with 10 functions. Programmers attempting to design for energy efficiency and accuracy in this scenario face two separate, but related, challenges. First, is the challenge of correctly (in terms of achieved accuracy) implementing 10 different versions of each candidate function (one version for each available level of precision). Second is the challenge of searching the resulting tradeoff space with ${10}^{10}$ points to explore. The tradeoff space could be even larger if we exploit data type approximation where each variable in the program could acquire a different level of approximation[30, 65, 9, 71]. Constructing a large number of alternative implementations and then navigating such an immense tradeoff space is likely beyond the abilities of even domain experts. Thus, we need an automated precision tuning framework that can both generate alternative implementations and then explore the induced tradeoff space.

In this paper, we propose one mechanism that helps address both of the above challenges: programmable placement rules for approximate floating point computation. We argue that asking programmers to implement $N$ different versions of key functions is unnecessarily burdensome and generating all possible approximations of each function will make the search space prohibitively large. The programmable rules are a compromise, where programmers can encode their knowledge of the application into concise rules about which functions can be approximated, by how much, and when it might be permissible to do so. These rules can then be used by an automated tool to generate a candidate set of approximate function implementations which is much smaller than the set of all possible approximations.

To address the challenges of creating and selecting from a large number of approximation alternatives, we propose NEAT—Navigating Energy Approximation tradeoffs—a tool that helps users explore different levels of approximation within a program without detailed instrumentation and without laboriously creating many alternative implementations of functions. NEAT accepts a user program, a set of approximate floating point implementations, and a set of programmable placement rules for when to use a specific implementation within a program. NEAT then runs the program and dynamically replaces floating point operations (FLOPs) with the approximate version as specified by the rules. NEAT reports the program’s output with the estimation of floating point unit (FPU) and memory access energy alongside an itemized report of FLOPs in the program. Thus, NEAT helps developers explore the configuration space of floating point implementations (FPI) without requiring them to have deep numerical expertise.

We implement NEAT for x86 using the Pin binary instrumentation system [51]. We demonstrate NEAT’s value by comparing the approximations produced by different placement rule sets. In the first, we write a simple rule that picks a single floating point implementation for the entire program; i.e. the rule is a simple one-to-one replacement (whole-program rule) common to many proposed approximation methods; e.g., those that use a single, reduced precision for machine learning [21] or scientific simulation [22]. In the second, we allow the top 10 executed functions with the most FLOPs to each use a different approximation (per-function rules). Either we use the currently-in-progress function (CIP) or the most recent function on the call stack (FCS) as the target to apply the approximate floating point implementation. For all rules, NEAT uses a genetic algorithm to guide exploration of the enormous resulting search space.

We evaluate NEAT on a selected set of benchmarks from Parsec 3.0[7] and Rodinia 3.1[13] suites which covers a variety of real-world applications. For the FPIs, we applied mantissa bitwidth tuning. On average, the per-function placement retrieves more energy-optimal floating point implementations than the whole-program approach, providing 22.1% and 3.2% energy savings in FPU and memory respectively with an allowance of 1% accuracy loss. To ensure the robustness of NEAT, we include multiple inputs for each application which are divided into training and test sets to evaluate whether NEAT produces statistically sound results. We also extend the evaluation by including a digit recognition application that is implemented with a neural network and MNIST dataset. For any accuracy target, NEAT provides the required precision level for each layer. NEAT is also released as opensource, so others could evaluate or use it freely.

In summary, this paper proposes:

•

The NEAT framework that helps users explore the tradeoff space of reduced precision floating point combinations while not requiring hand tuning or code instrumentation.
•

A case study that compares whole-program vs. per-function approximation placements for a variety of benchmarks. Also, NEAT offers a separated placement solution based on the caller function, useful for the high frequency invoked functions.
•

Robustness on unseen inputs with a high correlation coefficient. NEAT finds statistically meaningful approximations that are not sensitive to input data and are more likely to be efficient on an unseen set of inputs.
•

A demonstration of NEAT’s applicability to Convolutional Neural Networks (CNN), providing precision computation modes per layer resulting in energy savings with minimal loss of model accuracy.

II Background & Motivation

II-A Prior Work

While there has been a substantial amount of effort aimed towards finding new forms of approximation [69, 63, 56, 36, 70, 37, 50, 1, 17, 80, 25, 61, 65], there is a lack of solutions that helps the user to both develop their own approximation methods, and then specifying the approximation level to enforce for a single application.

Hardware approximation computes inexactly in return for reduced energy, area, or time [49, 14]. Approximate multipliers [81, 44, 46] and adders [20, 84] are widely advocated for energy-efficient computing. State-of-the-art neural network training platforms offer 16 bit floating point hardware systems that provide up to 4x performance gain comparing to traditional 32 bit systems [29]. Recent proposals promote putting many different approximate units or customized accelerators on a single core [31]. Thus, it is beneficial to include multiple FPUs on a chip for higher energy efficiency[22] but this requires tedious hand-tuning. Therefore, the challenge is how to figure out which FPU to use in each part of the program. This is the challenge that motivates NEAT.

Languages support approximation allowing the specification of variants for key functionality and formal analysis of their effects [3, 9, 55]. Approximation Knobs provide a way to lend performance and energy gains to existing power knobs[43]. Quora is a quality programmable processor where the notion of quality is codified in the instruction set of the processor [73]. Another example of user-defined approximation is Green, which is a system that allows programmers to supply approximate versions of loops and while-blocks that terminate early [4]. On the contrary to these programming language techniques, our proposal lets users easily—through our programmable substitution rules—examine and change the accuracy of FLOPs, giving them more control over the floating point computations in a program.

Performing precision tuning at fine grain is available through software libraries. EnerJ proposes to declare approximate data via type qualifiers[65]. MPFR adds to its arbitrary-precision representation the support for rounding modes, exceptions and special values as defined in the IEEE 754 standard[30]. FlexFloat reduces floating point emulation time by providing a C/C++ interface for supporting multiple FP formats. These techniques require source code instrumentation (changing $float$ and $double$ variables definition to custom parameters) or intending to yield more precise computation (for instance floating points numbers with more than 128 bits). NEAT focuses on energy efficiency by reducing precision while only requiring the program binary.

Convolutional Neural Networks (CNNs) include a significant amount of floating point computation in the training and inference stages. A large body of research has been focused towards CNN precision scaling[64, 72, 62, 32, 15, 33]. For example, WAGE quantizes weights to 2 bits while activation, errors, and gradient are 8 bits respectively[79]. FLEXpoint presents a new format with 16 bits mantissa to train CNNs with full precision[45]. Another piece of research demonstrates the successful training with 8-16 bits floating point numbers with full accuracy[77]. Other, tangentially related approaches create networks with early exit points [75, 76], but those are not related to the problem of changing numerical precision. Prior approaches either change the training architecture or apply a coarse-grain precision level for all layers. Differently, NEAT generates precision tuning analysis at different granularities by offering WP and CIP solutions without modifying the application internal structure or exhaustive precision exploration.

While prior work mainly develops mechanisms that enable approximation to provide energy and runtime savings at different domains, they do not help users make more informed decisions about approximation. These techniques mostly are not flexible about how much, where, and when to approximate, and only provide discrete approximation knobs which leads to more conservative design choices. NEAT does not propose new mechanisms but helps users answer the questions above.

II-B Motivation

Current inexact functional units in addition to approximate software libraries create an opportunity to exploit quality-energy tradeoffs. While an FPU accounts for 2-5% area on the chip, the floating point instructions consume significantly more energy compared to other classes of instructions such as integer, memory, and control[54, 5]. Figure 1 illustrates the energy per instruction (EPI) results for different classes of instructions of 64-bit 32nm processor. With random operands, a 64-bit floating point add consumes 400 pJ, and a division operation could go as high as 680 pJ. For a 32-bit versions, the energy consumption is 350 and 420 pJ respectively.

Refer to caption — Figure 1: Energy Per Instruction for different classes of instructions.

As expected with regards to the type of operations, executing the floating point instructions emerges as a major contributor to the total energy consumption. Recent empirical studies have shown up to 50% of the energy consumed in a core and memory is related to floating point instructions[54]. Thus, exploiting reduced bitwidth at instruction level (bit truncation) to generate Floating Point Implementations (FPI) could facilitate higher energy efficiency. Another useful insight from Figure 1 is the relationship between computation and memory accesses. For example, three add operations consume the same amount of energy as a ldx instruction. Hence, looking from an energy efficiency point of view, reducing the memory traffic could be as efficient as optimizing the floating point arithmetic operations[54].

A body of literature has focused on providing tool supports that allow users to define several approximations for different components of the application [66, 17, 60, 25, 37, 63, 56, 36]. Petabricks provides language extensions that expose tradeoffs between time and accuracy to the compiler[3]. The compiler then runs dynamic autotuning to generate optimized elements to achieve the target accuracy. However, autotuners need to be determined on a per-application basis by the user. OpenTuner provides fully-customizable configuration representation and ensembles of search techniques to find an optimal solution[2]. Both autotuning techniques are supposed to help programmers but Petabricks requires a separate language and both require users to implement all alternatives before the search can be conducted. NEAT also helps users deal with approximation, but instead of requiring users to implement all possible alternatives, they simply describe programmable rules that are then used to automatically generate the alternatives.

Hence, there is a need for a generic framework that provides multiple precision levels, accommodates custom user-defined floating point implementation, and does not require code refactoring. NEAT provides such a solution. NEAT generates insightful information for precision tuning at function level for floating point programs.

III System Design

In this section, we describe our solution which generates insightful information about floating point precision tuning for applications. This tool, named Navigating Energy and Accuracy tradeoffs (referred to as NEAT) allows users to collect energy and performance data from applications using custom implementations of floating-point arithmetic.

The main challenge of precision tuning is constructing the right configuration of floating point precisions for the application. This configuration space might be extremely large to fairly small, ranging in complexity from using a different floating point implementation for each dynamic floating point instruction, using a different implementation for different function calls, or just picking a single floating point implementation for the entire application. NEAT provides such flexibility in the granularity of enforcing floating point approximations by introducing the programmable placement rules and then automatically searching the accuracy and energy tradeoff space to find the optimal frontier.

Figure 2 illustrates the NEAT system from the user perspective. Users specify: (1) the application that they want to understand (this could be just a binary and requires no special changes), (2) whether NEAT should consider double or single precision (or both), a set of alternative implementations for floating-point arithmetic, and (4) the programmable placement rules that describe when, where, and how in the program to replace the standard floating point operations with one of the alternative implementations. NEAT then runs the program as a pin tool and intercepts all floating point operations of the specified type and replacing them according to the rules. NEAT will perform multiple runs of the application, collect statistics on floating point usage, accuracy, and estimated energy. NEAT offers a profiling mode where the user collects precision analysis such as quantity and frequency of FLOPs for the application before applying any FPIs. Ultimately, NEAT can repeatedly test different assignments of floating point operations to find the frontier of optimal configurations; i.e., assignments of floating point operations to different regions of the code. This section describes NEAT’s inputs, internals, and outputs.

III-A NEAT Inputs

User inputs of NEAT includes: a user application to instrument, a precision level as the optimization target, the desired FP arithmetic implementations, and a set of FPI to function mappings (programmable placement rules).

NEAT receives the binary of the program and instruments the floating point instructions. Unlike other precision tuning tools, NEAT does not require the source code of the program. Then, NEAT expects the optimization target which can be either single or double precision. There are two reasons behind including optimization objective. First, for most of the programs, the same precision level is held across the code base for the data structures and the functions. Second, if we consider both $float$ and $double$ FLOPs to optimize, the configuration space of FPIs combinations would explode excessively.

Next, users specify multiple FPIs for any individual arithmetic instruction such as addition, subtraction, multiplication, and division for each operand. At last, NEAT expects a mapping between the candidate code sections and the FPIs to calculate each FLOP in a program. By default, NEAT enforces the FPIs at the function level, meaning all FLOPs executed within a specific function will be using the same customized FPI. Any function that has at least one FLOP can be considered as a candidate for approximation.

III-B NEAT Internal Structure

The NEAT dynamic instrumentation tool was written in C++ using the Intel Pin instrumentation system [51]. NEAT performs run-time instrumentation to facilitate the analysis and replacement of floating-point arithmetic operations during the execution of compiled C and C++ binaries.

III-B1 Intel Pin Tool

The Pin instrumentation system was chosen as the backbone for this tool because of its clean API and efficient implementation. The Pin API makes it possible to write instrumentation routines to observe and alter the architectural state of a process. Pin uses a JIT compiler to generate a new instrumented code that can be executed without the extra runtime overhead from instrumentation.

III-B2 Floating Point Operations

For the purposes of this tool, we identify floating-point arithmetic operations as the Streaming SIMD Extensions (SSE) instructions for scalar arithmetic. These instructions are included in a SIMD instruction set extension to the x86 architecture and operate on 32-bit or 64-bit floating point numbers. More specifically, the instructions we use for our definition of floating-point operation are ADDSS, SUBSS, MULSS, DIVSS, ADDSD, SUBSD, MULSD, and DIVSD.

III-B3 Floating Point Arithmetic Implementation

Custom hardware units or accelerators have been considered for enriching the quality versus energy tradeoff spaces. Approximate adders [74, 84, 20] and multipliers [46, 44, 81] have been designed as a solution for lower power consumption and high performance. In the presence of inexact hardware units, NEAT provides information on how to efficiently redirect the arithmetic instructions to these units.

The floating point formats with a lower number of bits emerge an appealing opportunity to reduce the energy consumption since it allows simplification of both hardware units and reduction of memory bandwidth required to transfer the data between the memory and registers. The FPI can be as simple as bit truncation in the FP format representation, enforcing direct approximation to the operands or result of arithmetic operations, or redirecting instruction to approximate hardware units or software libraries.

III-B4 Execution of Floating Point Instructions

Defining an FPI is fairly trivial. The main challenge with enforcing FPI dynamically is the way to specify the exact mapping between FPI and the FLOPs. NEAT allows users to define placement rules that determine which FPI is used to calculate each FLOP in a program. Every time a FLOP is about to be calculated in the user application, NEAT examines all of the mappings and captures information about the current state of the application, and use them to determine which FPI will be applied to calculate the result of the FLOP.

TABLE I: Built-in Placement Rules in NEAT.

Placement Rule	Description	tradeoff Space Size
WP	one FPI for the whole program	$24-53$
CIP	one FPI for the currently in progress function	$24^{10}-53^{10}$
FCS	one FPI for the most recent function on the call stack	$24^{10}-53^{10}$

NEAT comes packaged with three predefined sets of FPI placements for the applications the cover many use-cases and show off its versatility. Table I includes the default placement rules and the corresponding tradeoff space size. Sets of rules are specified as C++ routines that accept the program state as input and return a single FPI as output.

The first set applies the same FPI for every FLOP in the whole-program (WP) regardless of the current function and the program state.

For finer granularity, the user can register callbacks through NEAT that can be executed whenever a function is entered or exited in the instrumented application. This allows more complex information to be collected about the program state, such as the call stack of the application. The second set of placement rules allows the user to specify a map of function names to FPIs and employs each FPI for the FLOPs in the corresponding currently in progress (CIP) function. Similarly, the third set of placement rules uses callbacks registered with NEAT to keep track of the function call stack (FCS) of the program. Instead of inspecting the current function, NEAT first checks the most recent function on the call stack. If no functions in the call stack match the names of those in the user-supplied map, a default implementation is used.

To highlight the difference between CIP and FCS, we analyzed the structure of 7 functions in a benchmark shown in Figure 3. The radar is an embedded real-time signal processing application that is used to find moving targets on the ground [47, 35]. It includes both a low-pass filter (LPF) and pulse compression (PC). Both of these components use a Fast Fourier Transform (FFT) as a part of their computation.

With the CIP option, NEAT enforces the same FPI every time the FFT function is called. For the FCS option, NEAT distinguishes between the two occurrences of FFT based on who has made the function call. Therefore, NEAT uses one FPI for the FFT in the Low Pass Filter (LPF) stage and a second FPI for the FFT in the Process Pulse (PC) stage. Empirically, we have found the results of FCS and CIP for most of the benchmarks do not differ as the callers of a FLOP intensive functions are the same. The radar is an example where multiple functions make numerous calls to the same FLOP-intensive function that is accuracy sensitive.

III-C Outputs

There are five outputs from this tool: the output from the user application, a trace of the operands and result of every FLOP executed by the program, the estimated FPU energy of FLOPs in the execution of the program, the estimated energy of off-chip memory accesses of the program, and the number of FLOPs executed per function in the program.

The trace of the FLOPs executed by the instrumented application is written to a file while the application is running. If FPIs are supplied to NEAT by the user, the result of each operation will be printed after the operation is calculated with the chosen FPI. The operands and result of each operation are printed as hexadecimal numbers so that there is no confusion in rounding the floating-point values.

NEAT reports total energy consumed in FPU by using energy per instruction (EPI) of different classes floating-point operations. We extracted the energy model of $fadd$ , $fmul$ and $fdiv$ for single and double precision operations provided in related work [54].

To this end, NEAT counts the number of bits manipulated in the operands and results of every FLOP in the instrumented program. Modifying the bit width in the exponent and sign of a floating-point number changes the accuracy significantly where the quality of output becomes unacceptable. Hence, NEAT only focuses on mantissa bits. NEAT counts the number of zeroes in the binary representation of the floating-point number, starting with the least significant bit, and then subtracts it from available mantissa bits in the floating type (24/53 bits in single/double precision respectively) to calculate the number of manipulated bits. NEAT uses the EPI models and the number of manipulated bits per FLOP to estimate the total floating-point energy consumed in the FPU.

NEAT also records the total number of bits used in FLOPs in the execution of the program is output to a file after the termination of the application. Unlike the FPU energy estimation, this metric can be used as a platform-independent way to evaluate the approximate amount of power used by FLOPs when instrumenting a program.

Currently, the memory accounts for more than 25% of energy spent in a large scale system. While on average, each single precision FLOP takes 400 pJ to execute, a byte read from memory consumes 1.5 nJ [8]. Accordingly, NEAT counts the total number of bits transmitted to/from memory and then estimates the total memory access energy of the instrumented program [53]. This allows NEAT to yield a better energy estimation of the program in a real system.

NEAT generates in-detail statistics about the floating point instructions in the program. Users might operate NEAT to profile the application before performing precision tuning to first, decide whether NEAT is useful to their application and second, what type of FPIs, which functions, and how to map them.

In general, NEAT is a tool used at program design time. NEAT allows users to evaluate many points on the accuracy/energy tradeoff curve without having to implement all possible alternatives. After profiling with NEAT, users can then select a point and implement it with confidence that it will provide the desired behavior.

Future work would explore additional machione learning techniques to configure the floating point usage differently for different functions in the program [19, 58, 59, 39]. Another promising line of work is using a runtime system to dynamically tune floating point usage to maintain either energy or accracy constraints in a changing workload [41, 57, 40, 27, 52, 34, 26, 28, 78, 38, 6], or possibly implementing this control scheme in hardware [67, 83, 68].

IV NEAT Interface and Runtime

We explain how the user can manage floating point precision scaling with the NEAT framework explained in the previous section. We specify the information that NEAT expects to receive from the users and then, discuss steps to execute the runtime engine of NEAT.

The NEAT procedure follows as:

1. Profile the Program: User runs the application. NEAT records the single and double precision instructions and the functions associated with them, and generates the detailed report in csv format.

2. Assign FP Optimization Target: Since the applications usually use the same precision level across the source code, NEAT enhances either single or double precision instructions at the same time. At this point, the user defines the directive for NEAT to target 32 or 64 bit FLOPs.

3. Develop FPIs: Users might define multiple FPIs to be explored by NEAT. NEAT supports FPIs developed in a number of different ways. An FPI can be created by truncating mantissa bits of the FLOP representation or injecting direct approximation to the operands or results of floating point arithmetic operations. For example, approximating the inverse function [82] or $sin$ function using a neural network[23] is considered an FPI, too. The FPI can be applied to one or more floating point arithmetic instruction. For instance, one benchmark might include numerous accumulations but few divisions. Thus, the user defines an FPI with enforcing 8 precision bits for the add/sub arithmetic instructions and 24 precision bits for the multiply instructions. The user develops an FPI by creating an instance of the $FpImplementation$ virtual class. Furthermore, user might customize the subroutine of $PerformOperation$ to modify the operands or results of a floating point instruction directly.

4. Register FPI Placement Rules and Functions: NEAT expects to receive a mapping between FPIs and when to enforce them. For the WP approach, the user only needs to instantiate $Register\_FP\_selector$ class with the desired FPI as the argument. For the per-function rules, NEAT by default considers the top 10 FLOP intensive functions. The user might pre-profile the program to detect and select any number of functions. The user then should provide a mapping between functions and FPIs by defining a $pair<functionName,FPI*>$ map data structure. Next, the user should combine the map with one of the pre-packaged placement rules (CIP or FCS). This mapping is also referred as a configuration. Finally, the user creates an instance of $Register\_FP\_selector$ class and passes the map and placement strategy as the input arguments. At the runtime, the user passes the registered instance name via fp_selector_name command line flag to NEAT. This interface is simple, but provides a quite flexible approach to replacing standard floating point operations with the approximate version. For example, the user can provide several maps and then their instantiation of the selector class can look at the current program context to select the desired map. This allows NEAT to explore many different options for a single function within a program. For example, users can specify that the map should depend on the function call stack so that different FP implementations will be used for the same function based on where it was called from.

5. Activate Exploration Scripts: If CIP or FCS schema is selected, the tradeoff space of FPI to function mappings (configurations) becomes too huge to explore exhaustively. Hence, NEAT uses the NSGA-II genetic exploration technique to search for energy efficient configurations [18]. If the user desires to enhance the exploration phase of the configuration space further, NEAT provides an interface through the command line flags to manually modify the tuning parameters of NSGA-II such as population size, number of generations, or convergence threshold.

6. Analyze the Output: NEAT reports detailed energy and performance data per configuration. Moreover, a python script is provided to generate scatter plots of tradeoff space with the lower convex hulls.

At the completion of these steps, the user finds information about the most appropriate precision level for each individual function or the whole program.

V Experimental Results

We evaluate the efficacy and flexibility of NEAT to provide floating point approximation analysis. In general, NEAT generates useful information on precision tuning of applications which can be used at design stage of a software or convoyed to other layers of system such as compilers or hardware (e.g. building a set of reduced-precision FPUs). Section V-B inspects the floating point profiling of NEAT for the applications. The primary challenge of automatic precision tuning is creating approximation configurations. We examine the NEAT’s flexibility to produce customized FPI definitions in Sections V-C and V-D. Moreover, the main mechanism of NEAT—programmable placement rules—are investigated in Sections V-E and V-F.

To navigate through the immense configuration space, NEAT comes with a tunable genetic exploration algorithm which is used in Sections above (from V-B through V-C). Although, to ensure robustness of NEAT on unseen data, we evaluate the difference between predicted accuracy and energy on training and test data to demonstrate that NEAT finds configurations that are robust across different test inputs that were not seen in training V-G. Finally in section V-H, we evaluate NEAT’s general applicability to find appropriate reduced precision floating point configurations by evaluating it on a problem that has seen a tremendous amount of attention from human experts recently: trading accuracy for precision in neural network inference. We find that NEAT can use the whole-program rule to automatically find a single floating point precision that is similar to those reported by human experts. Further, we find that by using different floating point implementations for different layers, NEAT produces even greater energy savings for the same accuracy.

V-A Evaluation platform

We evaluate NEAT by exploring the tradeoff spaces of the placement rules for a variety of benchmarks. Table II lists the applications from Parsec 3.0 [7] and Rodinia 3.1 [13] suites with the configuration space size (default precision optimization target) and training and test inputs for each benchmark. These benchmarks cover domains from finance to image processing.

TABLE II: Benchmarks Used for Evaluation.

Benchmarks	Training inputs	Test inputs	Possible Configuration Space
Blackscholes	10 lists with 100K initial prices	30 lists with 100K initial prices	$24^{4}$
Bodytrack	Sequence of 5 frames	Sequence of 20 frames	$24^{24}$
Fluidanimate	5 fluids with 15K+ particle	15 fluids with 15K+ particle	$24^{9}$
Ferret	5 databases of 16 images	15 databases of 16 images	$24^{12}$
Heartwall	Sequence of 15 frames	Sequence of 60 frames	$24^{4}$
Kmeans	10 vectors with 512 data points	30 vectors with 512 data points	$24^{9}$
Particlefilter	Sequence of 32 frames	Sequence of 128 frames	$53^{10}$
Radar	Sequence of 10 frames	Sequence of 40 frames	$24^{13}$

To create FPIs, we use bit truncation. For the single precision floating point numbers ( $float$ type in C), we have 24 different FPIs corresponding to the mantissa bits. Similarly, we created 53 FPIs for the double precision floating point numbers. For the whole-program approach, the size of the tradeoff space is the total number of possible FPIs which are 24 and 53 points. For the per-function approaches, we consider the top 10 functions with most FLOPs to enforce the FP rules, so each of the top 10 functions may use a different FPI.

In each experiment, at most 400 configurations in the tradeoff space (less than $6^{-12}$ of all possible configurations) have been evaluated through NEAT’s genetic algorithm.

V-B Floating Point Precision Distribution

NEAT can be used to analyze the type, distribution, and the intensity of the FLOPs in a program. Figure 4 depicts the ratio of single and double precision FLOPs for each benchmark.

Most of the benchmarks hold the same precision level across the source for correctness and portability. For example, Bodytrack, Heartwall, and Kmeans are all implemented with $float$ type while Canneal is mainly using $double$ . However, for some benchmarks such as Ferret, Particlefilter, and Srad due to including external libraries, there is a mixture of both precision levels. In this case, users might choose the optimization target to be enforced. Specifying the right target opens up further opportunities for additional energy savings.

V-C FPU Energy Saving

NEAT provides the FPU energy estimation consumed by the FLOPs. We compare two rules: whole program (WP) and currently-in-progress (CIP). As a reminder, WP uses one floating point implementation through the entirety of the program, while CIP is free to choose a separate implementation for each of the top 10 functions (by FLOP count) in the program. For Particlefilter, we set the optimization target to double precision as most of the FLOPs are $double$ . For the rest of the benchmarks, we apply the single precision optimization.

We consider the top 10 FLOP intensive functions for the CIP placement. Although, one might ask how much of the FLOPs are included in the top 10 functions. For all benchmarks, at least 98% FLOPs were coming from the top 10 functions, thus NEAT covers almost all of the FLOPs in the program.

Figure 5 illustrates the lower convex hull of normalized FPU energy and the error rate (also referred to as accuracy loss). The error rate metric is the relative error of a configuration comparing against the highest quality configuration (baseline) where no approximation happens. The horizontal axis is the error rate while the Noramlized Energy Consumption (NEC) to the baseline is shown vertically (on the y-axis). The lower the curve is, the more efficient configuration is found which means higher energy efficiency. Since users generally do not care about extremely inaccurate outputs, only error rates less than 20% is shown in the subfigures. The results show that f we assign multiple FPIs at the function level, NEAT will retrieve more energy efficient configurations that are not explorable if we use single FPI for the whole program. This result further demonstrates NEAT’s value in design space exploration.

With a minimal error in final output of the benchmark, NEAT reduces the FPU energy up to 60%. For some applications such as Blackscholes, Fluidanimate, and Particlefilter the FPU energy savings are more considerable. These benchmarks have less than 10 FLOP intensive functions. Therefore first, CIP covers all the FLOPs in the program. Second, since the tradeoff space is relatively smaller, NSGA-II searches a larger portion of the tradeoff space in the same exploration time.

For Fluidanimate and Ferret benchmarks, there are only three and two configurations where the WP outperforms the CIP. The reason is that NEAT’s genetic algorithm fails to explore those specific configurations as it is a heuristic algorithm. The same pattern can be seen for the Radar benchmark as well where the CIP does not dominate the whole-program approach.

The Heartwall benchmark has only two FLOP functions where they are very sensitive to the bit width adjustment and any modification leads to more than 20% error. Consequently, NEAT is not able to decrease FPU energy to less than 71% of the baseline with reasonable error rate. The opposite scenario happens for the Particlefilter application where the major FLOP functions do not impact the quality of output considerably, hence NEAT aggressively reduces the FPU energy without causing much error.

For a more detailed comparison, we re-illustrate a quantized representation of the previous plot. Figure 6 displays how the FPU energy savings enhance as the tolerated error threshold increases. Higher bars indicate more energy savings. By harmonic mean, applying the CIP versus WP approach results in 7%, 12%, and 13% more energy savings at 1%,5%, and 10% error rate, respectively.

The steeper slope in the lower convex hull curves in subplots of Figure 5 translates into higher bars in Figure 6 as the error threshold increases. The Blackscholes and Particlefilter benchmarks demonstrate such behavior. On the contrary, by increasing the error threshold in Particlefilter and Radar applications, the FPU energy savings do not inflate similarly.

From these graphs, we draw two conclusions. First, specifying the FPIs placement at a finer granularity results in more efficient FPI to function mappings. In other words, per-function rules use less energy with the same error comparing to use a single FPI for the whole application. This type of insight is really only achievable with the an automated system like NEAT. Second, if higher error rates are allowed, NEAT achieves higher efficiency of FPU energy. Thus, NEAT can navigate the whole tradeoffs space and give users a range of options depending on tolerable error rate.

V-D Memory Instructions

Main memory (DRAM) consumes as much as half of the total system power in a computer today, due to the increasing demand for memory capacity and bandwidth [53]. Hence, reducing the memory traffic directly derives into substantial energy savings. NEAT estimates the memory energy with accounting only accesses to/from an off-chip memory by keeping track of memory operations such as MOVSS and MOVSD. Figure 7 depicts memory accesses energy for a range of error rates for both whole-program (WP) and per-function (CIP) approaches respectively across the benchmarks. Same as before, higher bars indicate higher energy efficiency. Values are normalized to non-approximated version of the application, that acts as a baseline. On harmonic mean, increasing the error rate from 1% to 10% results in 3.2-10.5% less energy consumption.

If the FLOP functions are memory intensive, reducing the precision bits results in lower memory bandwidth, and consequently more energy savings. That is the reason why benchmarks such as Bodytrack, Fluidanimate, and Radar reduces the memory energy by more than 60%. In rest of the benchmarks, the FLOP functions were solely compute intensive.

To put the experiments above to a conclusion, we illustrate the WP rule as a sample for prior work [79] which tries to find a single most optimal approximation for the whole application. The per-function rules of NEAT show off the ability of the replacement rules to allow programmers to explore a richer set of tradeoffs without having to come up with whole new implementations of existing program functionality.

V-E Flexible Precision Level

In previous sections, we observed some benchmarks have a mixture of both $float$ and $double$ FLOPs. To choose the right optimization target, we compare the energy and accuracy of selected benchmarks in both single and double optimization targets. The FPI to function mapping is CIP in this experiment.

Figure 8 shows the normalized energy savings for both single and double optimization targets. As expected, if we choose the optimization target to be the same as the FP type which has larger ratio in FLOP distribution, higher energy savings would be achieved. This observation can be easily justified by the looking back at Section V-B. Both Canneal and Particlefilter contain more 64-bit than 32-bit FLOPs. Thus, double precision as NEAT directive is the right choice to achieve substantially higher energy efficiency.

Ferret requires special attention as it is not obvious how to choose the optimization target based on FLOP distribution ratio since it has almost equal amount of $float$ and $double$ FLOPs. At the 10% error rate, NEAT saves up to 92% of FPU energy corresponding to $double$ instructions while only 38% savings is available if we consider only $float$ instructions. There are two reasons for the discrepancy. One is that generally $double$ FLOPs yield more precise output, but they use more precision bits in return. Thus, NEAT has more freedom to cut down unnecessary floating point bits while not losing much accuracy because the $double$ baseline is already more accurate than the $float$ one. Second, the $double$ functions in Ferret are not accuracy sensitive, meaning that enforcing approximation on these functions would not excessively change the quality of the output. This is a great example of how NEAT determines the most efficient configurations for any benchmark regardless of how their floating point precision is specified in the source (or binary).

V-F Function Call Stack

As we mentioned in section III-B4, if we map an FPI to a function, depending on the caller, the quality of output could change. While on most benchmarks, CIP and FCS approaches produce the same result, on the Radar they differ. Hence, we examine the impact of the caller of the FFT function on the energy and accuracy of the benchmark. Figure 9 illustrates the FPU energy savings normalized to a baseline for CIP and FCS placement rules. FCS was able to explore a handful of more optimal configurations, resulting in 7% more energy savings at 1% accuracy loss comparing to CIP without extra runtime overhead. At 5% and 10% error rate, the additional energy savings are 4% and 2% respectively.

V-G Sensitivity to Input Changes

Since we employ a heuristic exploration technique, we ensure that NEAT produces statistically sound results by evaluating each application with multiple inputs divided into training and test sets. We take the median of normalized accuracy loss and FPU energy for each set of inputs, compute a linear least squares fit of training data to test data, and compute the correlation coefficient of each fit. Higher correlation coefficients imply less input sensitivity; i.e. the behavior of configurations found during training data is a good predictor of test behavior.

TABLE III: Correlation Coefficients for Error Rates and FPU energy.

Benchmark	Error Rates	FPU Energy
Blackscholes	0.999	0.999
Bodytrack	0.958	0.989
Fluidanimate	0.995	1.0
Ferret	0.973	1.0
Heartwall	0.999	1.0
Kmeans	0.932	1.0
Particlefilter	0.991	1.0
Radar	0.992	1.0

Table III show the correlation coefficient (R-values) for accuracy loss and FPU energy for each benchmark. Due to heuristic nature of exploration technique, it might be possible to select configurations that perform differently on unseen data. For instance, Kmeans clearly stresses the difference between training and test inputs. Although, all benchmarks have uniformly high R-values on accuracy loss and FPU energy—at least $0.93$ . This demonstrates that NEAT’s search techniques are robust and the accuracy and energy results they predict on training inputs hold up well for test inputs. The robustness of the energy results is, perhaps, not surprising as those should be highly predictable (simpler FLOP implementations should predictably lower energy). The robustness of the accuracy results is perhaps more surprising as it not intuitively obvious that floating point implementations that work well for one set of inputs would also work for another set.

V-H Neural Network Integration

The energy and resource constraints in neural networks creates an intriguing challenge. More recently, a growing body of literature have tried to sacrifice the precision of training and inference for the lower runtime and energy consumption[16]. NEAT can be used to identify the FLOP intensive sections of the network and then provide the minimum precision required for the computation without considerable model accuracy reductions. This tradeoff (small accuracy loss for large energy savings) is well known, and we perform this study not to claim a new result here, but to demonstrate that NEAT’s automated approach can produce the same types of savings for this problem that have been produced by human domain experts. We also believe that using NEAT’s programmable replacement rules to create DNNs with differing precision throughout the network is a new contribution that would (due to the size of the search space) be quite difficult even for human experts.

We use a hand-written digit classification with the MNIST dataset which includes 60K images and 10K labels. For the CNN, we consider the LeNet-5 model with the architecture summary listed in Table IV. The LeNet-5 architecture consists of two sets of convolutional and average pooling layers, followed by a flattening convolutional layer, then two fully-connected layers and finally a softmax classifier[48].

TABLE IV: LeNet-5 Architecture Summary.

Layer		Feature Map	Size	Kernel Size	Activation
Input	Image	1	32x32	-	-
1	Convolutional(1)	6	28x28	5x5	tanh
2	Average Pooling(1)	6	14x14	2x2	tanh
3	Convolutional(2)	16	10x10	5x5	tanh
4	Average Pooling(2)	16	5x5	2x2	tanh
5	Convolutional(3)	120	1x1	5x5	tanh
6	Fully Connected	-	84	-	tanh
Output	Fully Connected	-	10	-	softmax

TABLE V: Mantissa Bits For Single Precision FP Recommended by NEAT for Each Layer at Different Error Rates.

Layers / Error Rates	Conv 1	Avg Pool 1	Conv 2	Avg Pool 2	Conv 3	FC	Tanh	Internal Func.
1 %	10	23	14	4	19	4	20	17
5 %	10	5	5	16	13	4	18	15
10 %	6	16	12	9	13	1	17	11

Figure 10 shows the FLOPs breakdown for CNN training with minibatch size of 4, learning rate of 1, and 30 epochs. We first measured how much of the operations are floating point to determine the applicability of NEAT. For the inference, more than 73% of operations were FLOPs which makes NEAT absolutely beneficial to apply. Next, we analyze the FLOP distribution between the layers. We observe that more than 69% of floating point computation happens in the convolutional layers where they extract interesting features in an image. Activation phases and internal compute functions are responsible for the majority of remainder. Finally, we show that the number of FLOPs decreases as we approach the latter layers of the CNN since the size of transferred data between layers reduces as well.

To apply the FPI to function placement rules for a CNN, there are two options. First, apply one FPI per layer category (we refer to as PLC) meaning that all convolutional layers use the same precision level. The second approach is to apply a different FPI Per Layer Instance (PLI) where in this case the first and third layers might use distinct precision levels, however, they are both convolutional layers.

Picking the right FPI placement policy is not trivial for the CNNs. Unlike the WP versus CIP rules where one has significantly larger tradeoff space, the PLC and PLI tradeoff spaces are both large enough that heuristic exploration is required. Thus, any of these rules could outperform the other with the same exploration time. For the PLC, NEAT explores a larger portion of the tradeoff space, leading to locating efficient configurations more quickly. On the other hand, PLI examines FPI mappings at a finer granularity, hence it has a higher chance of discovering more optimal configurations.

Figure 11(a) illustrates the lower convex hull of normalized FPU energy and accuracy for both approaches. The accuracy loss is the error difference to the baseline configuration without approximation. The baseline recognition accuracy in the inference stage is 99.04% with a full accurate trained model. Each point in the tradeoff space demonstrates an FPI to layer (category or instance) mapping. Closer points to the origin indicate higher energy efficiency.

As can be seen, the lower convex hull of the PLI (finer granularity) outperforms the PLC curve for the error rates of less than 20%. The quantized representation of FPU energy versus error rates tradeoff space is shown in Figure 11(b) for both PLC and PLI placements. Similar to previous evaluation, finer granularity results in higher energy efficiency. With 1%, 5%, and 10% accuracy loss, NEAT with PLI placements achieves 6%, 4%, and 3% more energy savings compared to the default configuration.

NEAT’s programmable placement rules allow developers to analyze various precision levels for different components of their neural network without requiring them to instrument the source code or re-design the architecture.

Since the FPIs are based on the bit truncation of mantissa, using the above analysis, NEAT finds the required precision bits for each layer in the LeNet-5 network under accuracy loss constraints. By default, each layer is implemented with single precision floating point numbers (24 mantissa bits) bits. Table V demonstrates the mantissa bits required for every layer in the network. These precisions could later be integrated with the MPFR library in C [30] or mpmath library in Python [42].

VI Conclusion

In this work, we proposed NEAT, a tool for automated precision tuning of floating point applications. NEAT provides mechanisms for programmers trying to explore the tradeoff space of combinations of approximate floating point implementations without extensive source code refactoring. We evaluate NEAT on various benchmarks with whole-program and per-function placement rules. We found out at the finer granularity, up to 54% and 74% energy savings are available in FPU and memory transmissions respectively. We empirically show that NEAT performs robustly on unseen inputs as well. We also perform a case study on a digit recognition CNN programs to find optimal precision level requirements for each layer.

Acknowledgments

This research is supported by NSF(CCF-1439156, CNS-1526304, CCF-1823032, CNS-1764039). Additional support comes from the Proteus project under the DARPA BRASS program and a DOE Early Career award.

References

[1] C. Alvarez, J. Corbal, and M. Valero, “Fuzzy memoization for floating-point multimedia applications,” IEEE Transactions on Computers, vol. 54, no. 7, pp. 922–927, July 2005.
[2] J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O’Reilly, and S. Amarasinghe, “Opentuner: An extensible framework for program autotuning,” in Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, ser. PACT ’14. New York, NY, USA: ACM, 2014, pp. 303–316. [Online]. Available: http://doi.acm.org/10.1145/2628071.2628092
[3] J. Ansel, Y. L. Wong, C. Chan, M. Olszewski, A. Edelman, and S. Amarasinghe, “Language and compiler support for auto-tuning variable-accuracy algorithms,” in Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization. IEEE Computer Society, 2011, pp. 85–96.
[4] W. Baek and T. M. Chilimbi, “Green: A framework for supporting energy-conscious programming using controlled approximation,” SIGPLAN Not., vol. 45, no. 6, pp. 198–209, Jun. 2010. [Online]. Available: http://doi.acm.org/10.1145/1809028.1806620
[5] J. Balkind, M. McKeown, Y. Fu, T. Nguyen, Y. Zhou, A. Lavrov, M. Shahrad, A. Fuchs, S. Payne, X. Liang et al., “Openpiton: An open source manycore research framework,” in ACM SIGARCH Computer Architecture News, vol. 44, no. 2. ACM, 2016, pp. 217–232.
[6] S. Barati, F. A. Bartha, S. Biswas, R. Cartwright, A. Duracz, D. S. Fussell, H. Hoffmann, C. Imes, J. E. Miller, N. Mishra, Arvind, D. Nguyen, K. V. Palem, Y. Pei, K. Pingali, R. Sai, A. Wright, Y. Yang, and S. Zhang, “Proteus: Language and runtime support for self-adaptive software development,” IEEE Software, vol. 36, no. 2, pp. 73–82, 2019. [Online]. Available: https://doi.org/10.1109/MS.2018.2884864
[7] C. Bienia, “Benchmarking modern multiprocessors,” Ph.D. dissertation, Princeton University, January 2011.
[8] S. Borkar, “The exascale challange.” Keynote Talk, Parallel Architectures and Compilation Techniques (PACT), Galveston Island, Texas, USA., 10 2011.
[9] J. Bornholt, T. Mytkowicz, and K. S. McKinley, “Uncertain¡ t¿: A first-order type for uncertain data,” ACM SIGPLAN Notices, vol. 49, no. 4, pp. 51–66, 2014.
[10] A. Boutros, S. Yazdanshenas, and V. Betz, “Embracing diversity: Enhanced dsp blocks for low-precision deep learning on fpgas,” in 2018 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2018, pp. 35–357.
[11] L. N. Chakrapani, B. E. S. Akgul, S. Cheemalavagu, P. Korkmaz, K. V. Palem, and B. Seshasayee, “Ultra-efficient (embedded) soc architectures based on probabilistic cmos (pcmos) technology,” in Proceedings of the Conference on Design, Automation and Test in Europe: Proceedings, ser. DATE ’06. 3001 Leuven, Belgium, Belgium: European Design and Automation Association, 2006, pp. 1110–1115. [Online]. Available: http://dl.acm.org.proxy.uchicago.edu/citation.cfm?id=1131481.1131790
[12] A. P. Chandrakasan and R. W. Brodersen, “Minimizing power consumption in digital cmos circuits,” Proceedings of the IEEE, vol. 83, no. 4, pp. 498–523, Apr 1995.
[13] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. H. Lee, and K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,” in Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on, Oct 2009, pp. 44–54.
[14] V. K. Chippa, S. Venkataramani, S. T. Chakradhar, K. Roy, and A. Raghunathan, “Approximate computing: An integrated hardware approach,” in 2013 Asilomar Conference on Signals, Systems and Computers, Nov 2013, pp. 111–117.
[15] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” CoRR, vol. abs/1511.00363, 2015. [Online]. Available: http://arxiv.org/abs/1511.00363
[16] D. Das, N. Mellempudi, D. Mudigere, D. Kalamkar, S. Avancha, K. Banerjee, S. Sridharan, K. Vaidyanathan, B. Kaul, E. Georganas et al., “Mixed precision training of convolutional neural networks using integer operations,” arXiv preprint arXiv:1802.00930, 2018.
[17] M. de Kruijf, S. Nomura, and K. Sankaralingam, “Relax: An architectural framework for software recovery of hardware faults,” in Proceedings of the 37th Annual International Symposium on Computer Architecture, ser. ISCA ’10. New York, NY, USA: ACM, 2010, pp. 497–508. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/1815961.1816026
[18] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197, Apr 2002.
[19] Y. Ding, N. Mishra, and H. Hoffmann, “Generative and multi-phase learning for computer systems optimization,” in Proceedings of the 46th International Symposium on Computer Architecture, ser. ISCA ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 3952. [Online]. Available: https://doi.org/10.1145/3307650.3326633
[20] K. Du, P. Varman, and K. Mohanram, “High performance reliable variable latency carry select addition,” in 2012 Design, Automation Test in Europe Conference Exhibition (DATE), March 2012, pp. 1257–1262.
[21] Z. Du, K. Palem, A. Lingamneni, O. Temam, Y. Chen, and C. Wu, “Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators,” in 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 2014, pp. 201–206.
[22] P. D. Düben, J. Joven, A. Lingamneni, H. McNamara, G. De Micheli, K. V. Palem, and T. N. Palmer, “On the use of inexact, pruned hardware in atmospheric modelling,” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 372, no. 2018, 2014. [Online]. Available: http://rsta.royalsocietypublishing.org/content/372/2018/20130276
[23] S. Eldridge, F. Raudies, D. Zou, and A. Joshi, “Neural network-based accelerators for transcendental function approximation,” in Proceedings of the 24th edition of the great lakes symposium on VLSI. ACM, 2014, pp. 169–174.
[24] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecture support for disciplined approximate programming,” in Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS XVII. New York, NY, USA: ACM, 2012, pp. 301–312. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/2150976.2151008
[25] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceleration for general-purpose approximate programs,” in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE Computer Society, 2012, pp. 449–460. [Online]. Available: http://dx.doi.org.proxy.uchicago.edu/10.1109/MICRO.2012.48
[26] A. Farrell and H. Hoffmann, “MEANTIME: achieving both minimal energy and timeliness with approximate computing,” in 2016 USENIX Annual Technical Conference, USENIX ATC 2016, Denver, CO, USA, June 22-24, 2016., 2016, pp. 421–435.
[27] A. Filieri, H. Hoffmann, and M. Maggio, “Automated multi-objective control for self-adaptive software design,” in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, Bergamo, Italy, August 30 - September 4, 2015, E. D. Nitto, M. Harman, and P. Heymans, Eds. ACM, 2015, pp. 13–24. [Online]. Available: https://doi.org/10.1145/2786805.2786833
[28] A. Filieri, M. Maggio, K. Angelopoulos, N. D’Ippolito, I. Gerostathopoulos, A. B. Hempel, H. Hoffmann, P. Jamshidi, E. Kalyvianaki, C. Klein, F. Krikava, S. Misailovic, A. V. Papadopoulos, S. Ray, A. M. Sharifloo, S. Shevtsov, M. Ujma, and T. Vogel, “Control strategies for self-adaptive software systems,” ACM Trans. Auton. Adapt. Syst., vol. 11, no. 4, pp. 24:1–24:31, 2017. [Online]. Available: https://doi.org/10.1145/3024188
[29] B. Fleischer, S. Shukla, M. Ziegler, J. Silberman, J. Oh, V. Srinivasan, J. Choi, S. Mueller, A. Agrawal, T. Babinsky, N. Cao, C. Chen, P. Chuang, T. Fox, G. Gristede, M. Guillorn, H. Haynie, M. Klaiber, D. Lee, S. Lo, G. Maier, M. Scheuermann, S. Venkataramani, C. Vezyrtzis, N. Wang, F. Yee, C. Zhou, P. Lu, B. Curran, L. Chang, and K. Gopalakrishnan, “A scalable multi- teraops deep learning processor core for ai trainina and inference,” in 2018 IEEE Symposium on VLSI Circuits, June 2018, pp. 35–36.
[30] L. Fousse, G. Hanrot, V. Lefèvre, P. Pélissier, and P. Zimmermann, “Mpfr: A multiple-precision binary floating-point library with correct rounding,” ACM Transactions on Mathematical Software (TOMS), vol. 33, no. 2, p. 13, 2007.
[31] N. Gajjar, N. M. Devahsrayee, and K. S. Dasgupta, “Scalable leon 3 based soc for multiple floating point operations,” in 2011 Nirma University International Conference on Engineering, Dec 2011, pp. 1–3.
[32] B. Grigorian, N. Farahpour, and G. Reinman, “Brainiac: Bringing reliable accuracy into neurally-implemented approximate computing,” in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Feb 2015, pp. 615–626.
[33] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: Efficient inference engine on compressed deep neural network,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), June 2016, pp. 243–254.
[34] H. Hoffmann, “Coadapt: Predictable behavior for accuracy-aware applications running on power-aware systems,” in 26th Euromicro Conference on Real-Time Systems, ECRTS 2014, Madrid, Spain, July 8-11, 2014, 2014, pp. 223–232.
[35] H. Hoffmann, A. Agarwal, and S. Devadas, “Selecting spatiotemporal patterns for development of parallel applications,” IEEE Trans. Parallel Distributed Syst., vol. 23, no. 10, pp. 1970–1982, 2012. [Online]. Available: https://doi.org/10.1109/TPDS.2011.298
[36] H. Hoffmann, S. Misailovic, S. Sidiroglou, A. Agarwal, and M. Rinard, “Using code perforation to improve performance, reduce energy consumption, and respond to failures,” no. MIT-CSAIL-TR-2009-042, 09 2009.
[37] H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard, “Dynamic knobs for responsive power-aware computing,” in Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS XVI. New York, NY, USA: ACM, 2011, pp. 199–212. [Online]. Available: http://doi.acm.org/10.1145/1950365.1950390
[38] C. Imes and H. Hoffmann, “Bard: A unified framework for managing soft timing and power constraints,” in International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS 2016, Agios Konstantinos, Samos Island, Greece, July 17-21, 2016, W. A. Najjar and A. Gerstlauer, Eds. IEEE, 2016, pp. 31–38. [Online]. Available: https://doi.org/10.1109/SAMOS.2016.7818328
[39] C. Imes, S. A. Hofmeyr, and H. Hoffmann, “Energy-efficient application resource scheduling using machine learning classifiers,” in Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, Eugene, OR, USA, August 13-16, 2018. ACM, 2018, pp. 45:1–45:11. [Online]. Available: https://doi.org/10.1145/3225058.3225088
[40] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann, “POET: a portable approach to minimizing energy under soft real-time constraints,” in 21st IEEE Real-Time and Embedded Technology and Applications Symposium, Seattle, WA, USA, April 13-16, 2015. IEEE Computer Society, 2015, pp. 75–86. [Online]. Available: https://doi.org/10.1109/RTAS.2015.7108419
[41] C. Imes, H. Zhang, K. Zhao, and H. Hoffmann, “Copper: Soft real-time application performance using hardware power capping,” in 2019 IEEE International Conference on Autonomic Computing, ICAC 2019, Umeå, Sweden, June 16-20, 2019. IEEE, 2019, pp. 31–41. [Online]. Available: https://doi.org/10.1109/ICAC.2019.00015
[42] F. Johansson et al., mpmath: a Python library for arbitrary-precision floating-point arithmetic (version 0.14), February 2010, http://code.google.com/p/mpmath/.
[43] A. Kanduri, M. H. Haghbayan, A. M. Rahmani, P. Liljeberg, A. Jantsch, N. Dutt, and H. Tenhunen, “Approximation knob: Power capping meets energy efficiency,” in 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov 2016, pp. 1–8.
[44] Khaing Yin Kyaw, Wang Ling Goh, and Kiat Seng Yeo, “Low-power high-speed multiplier for error-tolerant application,” in 2010 IEEE International Conference of Electron Devices and Solid-State Circuits (EDSSC), Dec 2010, pp. 1–4.
[45] U. Köster, T. J. Webb, X. Wang, M. Nassar, A. K. Bansal, W. H. Constable, O. H. Elibol, S. Gray, S. Hall, L. Hornof, A. Khosrowshahi, C. Kloss, R. J. Pai, and N. Rao, “Flexpoint: An adaptive numerical format for efficient training of deep neural networks,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17. USA: Curran Associates Inc., 2017, pp. 1740–1750. [Online]. Available: http://dl.acm.org/citation.cfm?id=3294771.3294937
[46] P. Kulkarni, P. Gupta, and M. Ercegovac, “Trading accuracy for power with an underdesigned multiplier architecture,” in 2011 24th Internatioal Conference on VLSI Design, Jan 2011, pp. 346–351.
[47] J. Lebak, J. Kepner, H. Hoffmann, and E. Rutledge, “Parallel vsipl++: An open standard software library for high-performance parallel signal processing,” Proceedings of the IEEE, vol. 93, no. 2, pp. 313–330, Feb 2005.
[48] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object recognition with gradient-based learning,” in Shape, contour and grouping in computer vision. Springer, 1999, pp. 319–345.
[49] A. Lingamneni, C. Enz, K. Palem, and C. Piguet, “Designing energy-efficient arithmetic operators using inexact computing,” Journal of Low Power Electronics, vol. 9, no. 1, pp. 141–153, 2013.
[50] S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn, “Flikker: Saving dram refresh-power through critical data partitioning,” in Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS XVI. New York, NY, USA: ACM, 2011, pp. 213–224. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/1950365.1950391
[51] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Building customized program analysis tools with dynamic instrumentation,” in Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI ’05. New York, NY, USA: ACM, 2005, pp. 190–200. [Online]. Available: http://doi.acm.org/10.1145/1065010.1065034
[52] M. Maggio, A. V. Papadopoulos, A. Filieri, and H. Hoffmann, “Automated control of multiple software goals using multiple actuators,” in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017, Paderborn, Germany, September 4-8, 2017, 2017, pp. 373–384. [Online]. Available: https://doi.org/10.1145/3106237.3106247
[53] K. T. Malladi, F. A. Nothaft, K. Periyathambi, B. C. Lee, C. Kozyrakis, and M. Horowitz, “Towards energy-proportional datacenter memory with mobile dram,” in 2012 39th Annual International Symposium on Computer Architecture (ISCA), June 2012, pp. 37–48.
[54] M. McKeown, A. Lavrov, M. Shahrad, P. J. Jackson, Y. Fu, J. Balkind, T. M. Nguyen, K. Lim, Y. Zhou, and D. Wentzlaff, “Power and energy characterization of an open source 25-core manycore processor,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2018, pp. 762–775.
[55] S. Misailovic, M. Carbin, S. Achour, Z. Qi, and M. C. Rinard, “Chisel: Reliability- and accuracy-aware optimization of approximate computational kernels,” in Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, ser. OOPSLA ’14. New York, NY, USA: ACM, 2014, pp. 309–328. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/2660193.2660231
[56] S. Misailovic, S. Sidiroglou, H. Hoffmann, and M. Rinard, Quality of Service Profiling. New York, NY, USA: Association for Computing Machinery, 2010, p. 2534. [Online]. Available: https://doi.org/10.1145/1806799.1806808
[57] N. Mishra, C. Imes, J. D. Lafferty, and H. Hoffmann, “CALOREE: learning control for predictable latency and low energy,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2018, Williamsburg, VA, USA, March 24-28, 2018, X. Shen, J. Tuck, R. Bianchini, and V. Sarkar, Eds. ACM, 2018, pp. 184–198. [Online]. Available: https://doi.org/10.1145/3173162.3173184
[58] N. Mishra, J. D. Lafferty, and H. Hoffmann, “ESP: A machine learning approach to predicting application interference,” in 2017 IEEE International Conference on Autonomic Computing, ICAC 2017, Columbus, OH, USA, July 17-21, 2017, X. Wang, C. Stewart, and H. Lei, Eds. IEEE Computer Society, 2017, pp. 125–134. [Online]. Available: https://doi.org/10.1109/ICAC.2017.29
[59] N. Mishra, H. Zhang, J. D. Lafferty, and H. Hoffmann, “A probabilistic graphical model-based approach for minimizing energy under performance constraints,” in Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’15, Istanbul, Turkey, March 14-18, 2015, Ö. Özturk, K. Ebcioglu, and S. Dwarkadas, Eds. ACM, 2015, pp. 267–281. [Online]. Available: https://doi.org/10.1145/2694344.2694373
[60] T. Moreau, A. Sampson, and L. Ceze, “Approximate computing: Making mobile systems more efficient,” IEEE Pervasive Computing, vol. 14, no. 2, pp. 9–13, Apr 2015.
[61] K. V. Palem, L. N. Chakrapani, Z. M. Kedem, A. Lingamneni, and K. K. Muntimadugu, “Sustaining moore’s law in embedded computing through probabilistic and approximate design: Retrospects and prospects,” in Proceedings of the 2009 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, ser. CASES ’09. New York, NY, USA: ACM, 2009, pp. 1–10. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/1629395.1629397
[62] Qian Zhang, F. Yuan, R. Ye, and Q. Xu, “Approxit: An approximate computing framework for iterative methods,” in 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), June 2014, pp. 1–6.
[63] M. Rinard, H. Hoffmann, S. Misailovic, and S. Sidiroglou, “Patterns and statistical analysis for understanding reduced resource computing,” in Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, ser. OOPSLA ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 806821. [Online]. Available: https://doi.org/10.1145/1869459.1869525
[64] C. Sakr, N. Wang, C.-Y. Chen, J. Choi, A. Agrawal, N. Shanbhag, and K. Gopalakrishnan, “Accumulation bit-width scaling for ultra-low precision training of deep networks,” arXiv preprint arXiv:1901.06588, 2019.
[65] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, “Enerj: Approximate data types for safe and general low-power computation,” in Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI ’11. New York, NY, USA: ACM, 2011, pp. 164–174. [Online]. Available: http://doi.acm.org/10.1145/1993498.1993518
[66] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, “Enerj: Approximate data types for safe and general low-power computation,” in Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI ’11. New York, NY, USA: ACM, 2011, pp. 164–174. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/1993498.1993518
[67] M. H. Santriaji and H. Hoffmann, “GRAPE: minimizing energy for GPU applications with performance requirements,” in 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15-19, 2016. IEEE Computer Society, 2016, pp. 16:1–16:13. [Online]. Available: https://doi.org/10.1109/MICRO.2016.7783719
[68] M. H. Santriaji and H. Hoffmann, “MERLOT: architectural support for energy-efficient real-time processing in gpus,” in IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2018, 11-13 April 2018, Porto, Portugal, R. Pellizzoni, Ed. IEEE Computer Society, 2018, pp. 214–226. [Online]. Available: https://doi.org/10.1109/RTAS.2018.00030
[69] Q. Shi, H. Hoffmann, and O. Khan, “A cross-layer multicore architecture to tradeoff program accuracy and resilience overheads,” IEEE Comput. Archit. Lett., vol. 14, no. 2, pp. 85–89, 2015. [Online]. Available: https://doi.org/10.1109/LCA.2014.2365204
[70] S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard, “Managing performance vs. accuracy trade-offs with loop perforation,” in Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ser. ESEC/FSE ’11. New York, NY, USA: ACM, 2011, pp. 124–134. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/2025113.2025133
[71] G. Tagliavini, A. Marongiu, and L. Benini, “Flexfloat: A software library for transprecision computing,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.
[72] S. Venkataramani, A. Ranjan, K. Roy, and A. Raghunathan, “Axnn: Energy-efficient neuromorphic systems using approximate computing,” in 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), Aug 2014, pp. 27–32.
[73] S. Venkataramani, V. K. Chippa, S. T. Chakradhar, K. Roy, and A. Raghunathan, “Quality programmable vector processors for approximate computing,” in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-46. New York, NY, USA: ACM, 2013, pp. 1–12. [Online]. Available: http://doi.acm.org.proxy.uchicago.edu/10.1145/2540708.2540710
[74] A. K. Verma, P. Brisk, and P. Ienne, “Variable latency speculative addition: A new paradigm for arithmetic circuit design,” in 2008 Design, Automation and Test in Europe, March 2008, pp. 1250–1255.
[75] C. Wan, H. Hoffmann, S. Lu, and M. Maire, “Orthogonalized SGD and nested architectures for anytime neural networks,” in Proceedings of the 37th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, H. D. III and A. Singh, Eds., vol. 119. PMLR, 13–18 Jul 2020, pp. 9807–9817. [Online]. Available: http://proceedings.mlr.press/v119/wan20a.html
[76] C. Wan, M. Santriaji, E. Rogers, H. Hoffmann, M. Maire, and S. Lu, “ALERT: Accurate learning for energy and timeliness,” in 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, Jul. 2020, pp. 353–369. [Online]. Available: https://www.usenix.org/conference/atc20/presentation/wan
[77] N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan, “Training deep neural networks with 8-bit floating point numbers,” in Advances in neural information processing systems, 2018, pp. 7675–7684.
[78] S. Wang, C. Li, H. Hoffmann, S. Lu, W. Sentosa, and A. I. Kistijantoro, “Understanding and auto-adjusting performance-sensitive configurations,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2018, Williamsburg, VA, USA, March 24-28, 2018, X. Shen, J. Tuck, R. Bianchini, and V. Sarkar, Eds. ACM, 2018, pp. 154–168. [Online]. Available: https://doi.org/10.1145/3173162.3173206
[79] S. Wu, G. Li, F. Chen, and L. Shi, “Training and inference with integers in deep neural networks,” arXiv preprint arXiv:1802.04680, 2018.
[80] A. Yazdanbakhsh, D. Mahajan, B. Thwaites, J. Park, A. Nagendrakumar, S. Sethuraman, K. Ramkrishnan, N. Ravindran, R. Jariwala, A. Rahimi, H. Esmaeilzadeh, and K. Bazargan, “Axilog: Language support for approximate hardware design,” in 2015 Design, Automation Test in Europe Conference Exhibition (DATE), March 2015, pp. 812–817.
[81] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris, and K. Pekmestzi, “Design-efficient approximate multiplication circuits through partial product perforation,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 10, pp. 3105–3117, Oct 2016.
[82] H. Zhang, M. Putic, and J. Lach, “Low power gpgpu computation with imprecise hardware,” in Proceedings of the 51st Annual Design Automation Conference, ser. DAC ’14. New York, NY, USA: ACM, 2014, pp. 99:1–99:6. [Online]. Available: http://doi.acm.org/10.1145/2593069.2593156
[83] Y. Zhou, H. Hoffmann, and D. Wentzlaff, “CASH: supporting iaas customers with a sub-core configurable architecture,” in 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18-22, 2016. IEEE Computer Society, 2016, pp. 682–694. [Online]. Available: https://doi.org/10.1109/ISCA.2016.65
[84] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, “Design of low-power high-speed truncation-error-tolerant adder and its application in digital signal processing,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 18, no. 8, pp. 1225–1229, Aug 2010.