This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: University of Perpignan, LAMPS laboratory, 52 Av. P. Alduy, Perpignan, France 22institutetext: Numalis, Cap Omega, Rond-point Benjamin Franklin, Montpellier, France 22email: {assale.adje,matthieu.martel,dorra.ben-khalifa}@univ-perp.fr

Fast and Efficient Bit-Level Precision Tuning

   Assalé Adjé 11    Dorra Ben Khalifa 11    Matthieu Martel 1122
Abstract

In this article, we introduce a new technique for precision tuning. This problem consists of finding the least data types for numerical values such that the result of the computation satisfies some accuracy requirement. State of the art techniques for precision tuning use a try and fail approach. They change the data types of some variables of the program and evaluate the accuracy of the result. Depending on what is obtained, they change more or less data types and repeat the process. Our technique is radically different. Based on semantic equations, we generate an Integer Linear Problem (ILP) from the program source code. Basically, this is done by reasoning on the most significant bit and the number of significant bits of the values which are integer quantities. The integer solution to this problem, computed in polynomial time by a (real) linear programming solver, gives the optimal data types at the bit level. A finer set of semantic equations is also proposed which does not reduce directly to an ILP problem. So we use policy iteration to find the solution. Both techniques have been implemented and we show that our results encompass the results of state of the art tools.

Keywords:
Static analysis, computer arithmetic, integer linear problems, numerical accuracy, policy iteration.

1 Introduction

Let us consider a program PP computing some numerical result RR, typically but not necessarily in the IEEE754 floating-point arithmetic [1]. Precision tuning then consists of finding the smallest data types for all the variables and expressions of PP such that the result RR has some desired accuracy. These last years, much attention has been paid to this problem [6, 8, 10, 11, 12, 18]. Indeed, precision tuning makes it possible to save memory and, by way of consequence, it has a positive impact on the footprint of programs concerning energy consumption, bandwidth usage, computation time, etc.

A common point to all the techniques cited previously is that they follow a try and fail strategy. Roughly speaking, one chooses a subset SS of the variables of PP, assigns to them smaller data types (e.g. binary32 instead of binary64 [1]) and evaluates the accuracy of the tuned program PP^{\prime}. If the accuracy of the result returned by PP^{\prime} is satisfying then new variables are included in SS or even smaller data types are assigned to certain variables already in SS (e.g. binary16). Otherwise, if the accuracy of the result of PP^{\prime} is not satisfying, then some variables are removed from SS. This process is applied repeatedly, until a stable state is found. Existing techniques differ in their way to evaluate the accuracy of programs, done by dynamic analysis [10, 11, 12, 18] or by static analysis [6, 8] of PP and PP^{\prime}. They may also differ in the algorithm used to define SS, delta debugging being the most widespread method [18].

Anyway all these techniques suffer from the same combinatorial limitation: If PP has nn variables and if the method tries kk different data types then the search space contains knk^{n} configurations. They scale neither in the number nn of variables (even if heuristics such as delta debugging [18] or branch and bound [6] reduce the search space at the price of optimality) or in the number kk of data types which can be tried. In particular, bit level precision tuning, which consists of finding the minimal number of bits needed for each variable to reach the desired accuracy, independently of a limited number kk of data types, is not an option.

So the method introduced in this article for precision tuning of programs is radically different. Here, no try and fail method is employed. Instead, the accuracy of the arithmetic expressions assigned to variables is determined by semantic equations, in function of the accuracy of the operands. By reasoning on the number of significant bits of the variables of PP and knowing the weight of their most significant bit thanks to a range analysis performed before the tuning phase, we are able to reduce the problem to an Integer Linear Problem (ILP) which can be optimally solved in one shot by a classical linear programming solver (no iteration). Concerning the number nn of variables, the method scales up to the solver limitations and the solutions are naturally found at the bit level, making the parameter kk irrelevant. An important point is that the optimal solution to the continuous linear programming relaxation of our ILP is a vector of integers, as demonstrated in Section 4.2. By consequence, we may use a linear solver among real numbers whose complexity is polynomial [19] (contrarily to the linear solvers among integers whose complexity is NP-complete [16]). This makes our precision tuning method solvable in polynomial-time, contrarily to the existing exponential methods.

Next, we go one step further by introducing a second set of semantic equations. These new equations make it possible to tune even more the precision by being less pessimistic on the propagation of carries in arithmetic operations. However the problem do not reduce any longer to an ILP problem (min\min and max\max operators are needed). Then we use policy iteration (PI) [7] to find efficiently the solution.

Both methods have been implemented inside a tool for precision tuning named XXX111In this article, XXX hides the actual name of our tool and missing references refer to our previous work for anonymity. [5]. Formerly, XXX was expressing the precision tuning problem as a set of first order logical propositions among relations between linear integer expressions. An SMT solver (Z3 in practice [15]) was used repeatedly to find the existence of a solution with a certain weight expressing the number of significant bits (𝗇𝗌𝖻\mathsf{nsb}) of the variables [5]. In the present article, we compare experimentally our new methods to the SMT based method previously used by XXX [5] and to the Precimonious tool [10, 18]. These experiments on programs coming from mathematical libraries or other applicative domains such as IoT [3] show that the technique introduced in this article for precision tuning clearly encompasses the state of the art techniques.

The rest of this article is organized as follows. In the next section, we provide a motivating example. We then present in Section 3 some essential background on the functions needed for the constraint generation and also we detail the set of constraints for both ILP and PI methods. Section 4 presents the proofs of correctness. We end up in Section 5 by showing that our new techniques exhibits very good results in practice before concluding in Section 6.

2 Motivating Example

A motivating example to better explain our method is given by the code snippet of Figure 1. In this example, we aim at modeling the movement of a simple pendulum without damping. Let l=0.5ml=0.5\ m be the length of this pendulum, m=1kgm=1\ kg its mass and g=9.81ms2g=9.81\ m\cdot s^{-2} Newton’s gravitational constant. We denote by θ\theta the tilt angle in radians as shown in Figure 1 (initially θ=π4\theta=\frac{\pi}{4}). The Equation describing the movement of the pendulum is given in Equation (1).

  1 g = 9.81; l = 0.5; 2 y1 = 0.785398; y2 = 0.785398; 3 h = 0.1; t = 0.0; 4 while (t<10.0) { 5 y1new = y1 + y2 * h ; 6 aux1 = sin(y1) ; 7 aux2 = aux1 * h * g / l; 8 y2new = y2 - aux2; 9 t = t + h; 10 y1 = y1new; y2 = y2new; 11 }; 12 require_nsb(y2,20); Refer to caption 1 g1{}^{\ell_{1}} = 9.810{}^{\ell_{0}}; l3{}^{\ell_{3}} = 0.52{}^{\ell_{2}}; 2 y15{}^{\ell_{5}} = 0.7853984{}^{\ell_{4}}; 3 y27{}^{\ell_{7}} = 0.7853986{}^{\ell_{6}}; 4 h9{}^{\ell_{9}} = 0.18{}^{\ell_{8}}; t11{}^{\ell_{11}} = 0.010{}^{\ell_{10}}; 5 while (t13{}^{\ell_{13}} <15{}^{\ell_{15}} 10.014{}^{\ell_{14}})59{}^{\ell_{59}} { 6 y1new24{}^{\ell_{24}} = y117{}^{\ell_{17}} +23{}^{\ell_{23}} y219{}^{\ell_{19}} *22{}^{\ell_{22}} h21{}^{\ell_{21}}; 7 aux128{}^{\ell_{28}} = sin(y126{}^{\ell_{26}})27{}^{\ell_{27}}; 8 aux240{}^{\ell_{40}} = aux130{}^{\ell_{30}} *39{}^{\ell_{39}} h32{}^{\ell_{32}} 9 *38{}^{\ell_{38}} g34{}^{\ell_{34}} /37{}^{\ell_{37}} l36{}^{\ell_{36}}; 10 y2new46{}^{\ell_{46}} = y242{}^{\ell_{42}} -45{}^{\ell_{45}} aux244{}^{\ell_{44}}; 11 t52{}^{\ell_{52}} = t48{}^{\ell_{48}} +51{}^{\ell_{51}} h50{}^{\ell_{50}}; 12 y155{}^{\ell_{55}} = y1new54{}^{\ell_{54}}; 13 y258{}^{\ell_{58}} = y2new57{}^{\ell_{57}}; 14 }; 15 require_nsb(y2,20)61{}^{\ell_{61}}; 1 g|20| = 9.81|20|; l|20| = 1.5|20|; 2 y1|29| = 0.785398|29|; 3 y2|21| = 0.0|21|; 4 h|21| = 0.1|21|; t|21| = 0.0|21|; 5 while (t<1.0) { 6 y1new|20| = y1|21| +|20| y2|21| 7 *|22| h|21|; 8 aux1|20| = sin(y1|29|)|20|; 9 aux2|20| = aux1|19| *|20| h|18| 10 *|19| g|17| /|18|l|17|; 11 y2new|20| = y2|21| -|20| aux2|18|; 12 t|20| = t|21| +|20| h|17|; 13 y1|20| = y1new|20|; 14 y2|20| = y2new|20|; 15 }; 16 require_nsb(y2,20);  

Figure 1: Top left: source program. Top right: pendulum movement for θ=π4\theta=\frac{\pi}{4}. Bottom left: program annotated with labels. Bottom right: program with inferred accuracies.
mld2θdt2=mgsinθ\small m\cdot l\cdot\frac{d^{2}\theta}{dt^{2}}=-m\cdot g\cdot\sin\theta (1)

Equation (1) being a second order differential equation, we need to transform it into a system of two first order differential equations for resolution so we obtain y1=θy_{1}=\theta and y2=dθdty_{2}=\frac{d\theta}{dt}. By applying Euler’s method to these last equations, we obtain Equation (2) implemented in Figure 1.

dy1dt=y2anddy2dt=glsiny1\small\frac{dy_{1}}{dt}=y_{2}\quad\text{and}\quad\frac{dy_{2}}{dt}=-\frac{g}{l}\cdot\sin y_{1} (2)

The key point of our technique is to generate a set of constraints for each statement of our imperative language introduced further in Section 3. For our example, we suppose that all variables, before XXX analysis, are in double precision (source program in the top left corner of Figure 1) and that a range determination is performed by dynamic analysis on the program variables (we plan to use a static analyzer in the future). XXX assigns to each node of the program’s syntactic tree a unique control point in order to determine easily the number of significant bits of the result as mentioned in the bottom corner of Figure 1. Some notations can be stressed about the structure of XXX source code. For instance, the annotation g=19.810{}^{\ell_{1}}=9.81^{\ell_{0}} denotes that g has a unique control point 1\ell_{1}. As well, we have the statement require_nsb(y2,20) which informs the tool that the user wants to get on variable y2 only 20 bits (we consider that a result has nn significants if the relative error between the exact and approximated results is less than 2n2^{-n}). Finally, the minimal precision needed for the inputs and intermediary results satisfying the user assertion is observed on the bottom right corner of Figure 1. In this code, if we consider for instance Line 66, then y1new|20| means that the variable needs 2020 significant bits at this point. Similarly, y1 and y2 need 2121 bits each and the addition requires 2020 bits.

The originality of our method is that we reduce the precision tuning problem to an ILP. For example, taking again Line 66 of the pendulum code, we generate six constraints as shown in Equation 3 (this is detailed further in Section 3).

C1={𝗇𝗌𝖻(17)𝗇𝗌𝖻(23)+(1)+ξ(23)(17,22)(1),𝗇𝗌𝖻(22𝗇𝗌𝖻(23+0+ξ(23)(17,22)(1),𝗇𝗌𝖻(19)𝗇𝗌𝖻(22)+ξ(22)(19,21)1,𝗇𝗌𝖻(21)𝗇𝗌𝖻(22)+ξ(22)(19,21)1,𝗇𝗌𝖻(23)𝗇𝗌𝖻(24),ξ(23)(17,22)1,ξ(22)(19,21)1}\scriptsize C_{1}=\left\{\begin{array}[]{l}\mathsf{nsb}(\ell_{17})\geq\mathsf{nsb}(\ell_{23})+(-1)+\xi(\ell_{23})(\ell_{17},\ell_{22})-(-1),\\ \mathsf{nsb}(\ell_{22}\geq\mathsf{nsb}(\ell_{23}+0+\xi(\ell_{23})(\ell_{17},\ell_{22})-(1),\\ \mathsf{nsb}(\ell_{19})\geq\mathsf{nsb}(\ell_{22})+\xi(\ell_{22})(\ell_{19},\ell_{21})-1,\\ \mathsf{nsb}(\ell_{21})\geq\mathsf{nsb}(\ell_{22})+\xi(\ell_{22})(\ell_{19},\ell_{21})-1,\\ \mathsf{nsb}(\ell_{23})\geq\mathsf{nsb}(\ell_{24}),\ \xi(\ell_{23})(\ell_{17},\ell_{22})\geq 1,\ \ \xi(\ell_{22})(\ell_{19},\ell_{21})\geq 1\\ \end{array}\right\} (3)

The first two constraints are for the addition. Basically, 𝗇𝗌𝖻(23)\mathsf{nsb}(\ell_{23}) stands for number of significant bits as described in Section 3.1. It represents the difference between the unit in the first place (𝗎𝖿𝗉\mathsf{ufp}, see Section 3.1) of the result of the sum and the 𝗎𝖿𝗉\mathsf{ufp} of its error denoted 𝗎𝖿𝗉𝖾(23)\mathsf{ufp_{e}}(\ell_{23}). We have 𝗎𝖿𝗉(17)=𝗎𝖿𝗉(17)𝗇𝗌𝖻(17)\mathsf{ufp}(\ell_{17})=\mathsf{ufp}(\ell_{17})-\mathsf{nsb}(\ell_{17}). As mentioned previously, the 𝗎𝖿𝗉\mathsf{ufp} are computed by a prior range analysis. Then, at constraint generation time, they are constants. For our example, 𝗎𝖿𝗉(17)=1\mathsf{ufp}(\ell_{17})=-1. This quantity occurs in the first constraints. The next two constraints are for the multiplication. The fourth constraint 𝗇𝗌𝖻(23)𝗇𝗌𝖻(24)\mathsf{nsb}(\ell_{23})\geq\mathsf{nsb}(\ell_{24}) is for the assignment and the last two constraints are for the constant functions ξ(23)(17,22)\xi(\ell_{23})(\ell_{17},\ell_{22}) and ξ(22)(19,21)\xi(\ell_{22})(\ell_{19},\ell_{21}) respectively for the addition and multiplication (see next paragraphs and Section 3 for more details). Note that XXX generates such constraints for all the statements of the program.

For a user requirement of 20 bits on the variable y2 as shown in the original program of the top right corner of Figure 1 (all variables are in double precision initially), XXX succeeds in tuning the majority of variables of the pendulum program into simple precision with a total number of bits at bit level equivalent to 274 (originally the program used 689 bits). The new mixed precision formats obtained for line 66 are y1new|20| = y1|21| +|20|  y2|22| ×\times|22| h|22|.

Let us now focus on the term ξ(23)(17,22)\xi(\ell_{23})(\ell_{17},\ell_{22}) (for the addition). In our ILP, we always assume that ξ\xi is a constant function equal to 11. This corresponds to the carry bit which can be propagated up to the 𝗎𝖿𝗉\mathsf{ufp} and 𝗎𝖿𝗉𝖾\mathsf{ufp_{e}} and increments them by 11 which is correct but pessimistic. In large codes, this function becomes very costly if we perform several computations at a time and therefore the errors would be considerable, especially that in many cases adding this carry bit is useless because the operands and their errors do not overlap. It is then crucial to use the most precise function ξ\xi. Unfortunately, when we model this optimization the problem is no more linear (min\min and max\max operators arise) and we have to use the PI technique [7] to solve it (see Section 3.3).

In this case, by analyzing Line 66 of our program, we have to add the following new constraints (along with the former ones) as mentioned in Equation (4). In fact, policy iteration makes it possible to break the min\min in the ξ(23)(17,22)\xi(\ell_{23})(\ell_{17},\ell_{22}) function by choosing the max\max between 𝗎𝖿𝗉(22)𝗎𝖿𝗉(17)𝗇𝗌𝖻(17)𝗇𝗌𝖻(22)𝗇𝗌𝖻𝖾(17)\mathsf{ufp}(\ell_{22})-\mathsf{ufp}(\ell_{17})-\mathsf{nsb}(\ell_{17})-\mathsf{nsb}(\ell_{22})-\mathsf{nsb_{e}}(\ell_{17}) and 0, the max\max between 𝗎𝖿𝗉(17)𝗎𝖿𝗉(22)+𝗇𝗌𝖻(22)𝗇𝗌𝖻(17)𝗇𝗌𝖻𝖾(22)\mathsf{ufp}(\ell_{17})-\mathsf{ufp}(\ell_{22})+\mathsf{nsb}(\ell_{22})-\mathsf{nsb}(\ell_{17})-\mathsf{nsb_{e}}(\ell_{22}) and 0 and the constant 11. Next, it becomes possible to solve the corresponding ILP. If no fixed point is reached, XXX iterates until a solution is found as shown in Section 3.3. By applying this optimization, the new formats of line 66 are given as y1new|20| = y1|21|  +|20|  y2|21|  ×\times|22|  h|21|. By comparing with the formats obtained above, a gain of precision of 11 bit is observed on variables y2 and h (total of 272 bits at bit level for the optimized program). The program on the bottom right corner of Figure 1 illustrates the new optimized formats obtained by using the policy iteration technique.

C2={𝗇𝗌𝖻𝖾(23)𝗇𝗌𝖻𝖾(17),𝗇𝗌𝖻𝖾(23)𝗇𝗌𝖻𝖾(22),𝗇𝗌𝖻(23)10+𝗇𝗌𝖻(22)𝗇𝗌𝖻(17)+𝗇𝗌𝖻𝖾(22)+ξ(23,17,22),𝗇𝗌𝖻𝖾(23)0(1)+𝗇𝗌𝖻(17)𝗇𝗌𝖻(22)+𝗇𝗌𝖻𝖾(17)+ξ(23,17,22),𝗇𝗌𝖻𝖾(23)𝗇𝗌𝖻𝖾(24),𝗇𝗌𝖻𝖾(22)𝗇𝗌𝖻(19)+𝗇𝗌𝖻𝖾(19)+𝗇𝗌𝖻𝖾(21)2,𝗇𝗌𝖻𝖾(22)𝗇𝗌𝖻(21)+𝗇𝗌𝖻𝖾(21)+𝗇𝗌𝖻𝖾(19)2,ξ(23)(17,22)=min(max(06+𝗇𝗌𝖻(17)𝗇𝗌𝖻(22)𝗇𝗌𝖻𝖾(17),0),max(60+𝗇𝗌𝖻(22)𝗇𝗌𝖻(17)𝗇𝗌𝖻𝖾(22),0),1)}\scriptsize C_{2}=\left\{\begin{array}[]{l}\mathsf{nsb_{e}}(\ell_{23})\geq\mathsf{nsb_{e}}(\ell_{17}),\\ \mathsf{nsb_{e}}(\ell_{23})\geq\mathsf{nsb_{e}}(\ell_{22}),\\ \mathsf{nsb}(\ell_{23})\geq-1-0+\mathsf{nsb}(\ell_{22})-\mathsf{nsb}(\ell_{17})+\mathsf{nsb_{e}}(\ell_{22})+\xi(\ell_{23},\ell_{17},\ell_{22}),\\ \mathsf{nsb_{e}}(\ell_{23})\geq 0-(-1)+\mathsf{nsb}(\ell_{17})-\mathsf{nsb}(\ell_{22})+\mathsf{nsb_{e}}(\ell_{17})+\xi(\ell_{23},\ell_{17},\ell_{22}),\\ \mathsf{nsb_{e}}(\ell_{23})\geq\mathsf{nsb_{e}}(\ell_{24}),\\ \mathsf{nsb_{e}}(\ell_{22})\geq\mathsf{nsb}(\ell_{19})+\mathsf{nsb_{e}}(\ell_{19})+\mathsf{nsb_{e}}(\ell_{21})-2,\\ \mathsf{nsb_{e}}(\ell_{22})\geq\mathsf{nsb}(\ell_{21})+\mathsf{nsb_{e}}(\ell_{21})+\mathsf{nsb_{e}}(\ell_{19})-2,\\ \xi(\ell_{23})(\ell_{17},\ell_{22})=\min\left(\begin{array}[]{l}\max\big{(}0-6+\mathsf{nsb}(\ell_{17})-\mathsf{nsb}(\ell_{22})-\mathsf{nsb_{e}}(\ell_{17}),0\big{)},\\ \max\big{(}6-0+\mathsf{nsb}(\ell_{22})-\mathsf{nsb}(\ell_{17})-\mathsf{nsb_{e}}(\ell_{22}),0\big{)},1\end{array}\right)\\ \end{array}\right\} (4)

 

ε(c#p)2𝗎𝖿𝗉(c)min(p,𝗉𝗋𝖾𝖼())\varepsilon(c^{\ell}\texttt{\#p})\leq 2^{\mathsf{ufp}(c)-\min\big{(}\texttt{p},\mathsf{prec}(\ell)\big{)}} (5)
ε(c11#p1+c22#p2)ε(c11#p1)+ε(c22#p2)+2𝗎𝖿𝗉(c1+c2)𝗉𝗋𝖾𝖼()\varepsilon(c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}+^{\ell}c_{2}^{\ell_{2}}\texttt{\#$p_{2}$})\leq\varepsilon({c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}})+\varepsilon({c_{2}^{\ell_{2}}\texttt{\#$p_{2}$}})+{2^{\mathsf{ufp}(c_{1}+c_{2})-\mathsf{prec}(\ell)}} (6)
ε(c11#p1c22#p2)ε(c11#p1)ε(c22#p2)+2𝗎𝖿𝗉(c1c2)𝗉𝗋𝖾𝖼()\varepsilon(c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}-^{\ell}c_{2}^{\ell_{2}}\texttt{\#$p_{2}$})\leq\varepsilon({c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}})-\varepsilon({c_{2}^{\ell_{2}}\texttt{\#$p_{2}$}})+{2^{\mathsf{ufp}(c_{1}-c_{2})-\mathsf{prec}(\ell)}} (7)
ε(c11#p1×c22#p2)c1ε(c22#p2)+c2ε(c11#p1)+ε(c11#p1)ε(c22#p2)+2𝗎𝖿𝗉(c1×c2)𝗉𝗋𝖾𝖼()\begin{array}[]{c}\varepsilon(c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}\times^{\ell}c_{2}^{\ell_{2}}\texttt{\#$p_{2}$})\leq\\ c_{1}\cdot\varepsilon({c_{2}^{\ell_{2}}\texttt{\#$p_{2}$}})+c_{2}\cdot\varepsilon({c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}})+\varepsilon({c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}})\cdot\varepsilon({c_{2}^{\ell_{2}}\texttt{\#$p_{2}$}})+{2^{\mathsf{ufp}(c_{1}\times c_{2})-\mathsf{prec}(\ell)}}\end{array} (8)
ε(c11#p1÷c22#p2)ε(c11#p1×c22#p2)withc2=1c2\varepsilon(c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}\div^{\ell}c_{2}^{\ell_{2}}\texttt{\#$p_{2}$})\leq\varepsilon(c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}\times^{\ell}c_{2}^{\prime\ell_{2}}\texttt{\#$p_{2}$})\quad\text{with}\quad c^{\prime}_{2}=\frac{1}{c_{2}} (9)
ε(ϕ(c1#p))2𝗎𝖿𝗉(ϕ(c))p+φ+2𝗎𝖿𝗉(ϕ(c))𝗉𝗋𝖾𝖼()withϕ{sin,cos,tan,log,}\varepsilon\left(\phi(c^{\ell_{1}}\texttt{\#p})^{\ell}\right)\leq 2^{\mathsf{ufp}(\phi(c))-\texttt{p}+\varphi}+2^{\mathsf{ufp}(\phi(c))-\mathsf{prec}(\ell)}\ \text{with}\quad\phi\in\{\sin,\cos,\tan,\log,\ldots\} (10)
ε((c1#p))2𝗎𝖿𝗉(c)p+2𝗎𝖿𝗉(c)𝗉𝗋𝖾𝖼()\varepsilon\left(\sqrt{(c^{\ell_{1}}\texttt{\#p})}^{\ell}\right)\leq 2^{\mathsf{ufp}(\sqrt{c})-\texttt{p}}+2^{\mathsf{ufp}(\sqrt{c})-\mathsf{prec}(\ell)} (11)

 

Figure 2: Numerical error on arithmetic expressions.

3 Constraints Generation for Bit-Level Precision Tuning

In this section, we start by providing essential definitions for understanding the rest of the article. Also, we define a simple imperative language (see Figure 3) from which we generate semantic equations in order to determine the least precision needed for the program numerical values. Then, we will focus on the difference between the two sets of constraints obtained when using the simple ILP and the more complex PI formulations which optimizes the carry bit that can propagate throughout computations. The operational semantics of the language as well as the theorem proving that the solution to the system of constraints gives the desired 𝗇𝗌𝖻\mathsf{nsb} when running programs are detailed further in Section 4.

3.1 Elements of Computer Arithmetic

Our technique is independent of a particular computer arithmetic. In fact, we manipulate numbers for which we know their unit in the first place (𝗎𝖿𝗉\mathsf{ufp}) and the number of significant digits (𝗇𝗌𝖻\mathsf{nsb}) defined as follows.

Unit in the First Place

The unit in the first place of a real number xx (possibly encoded up to some rounding mode by a floating-point or a fixpoint number) is given in Equation (12). This function, which is independent of the representation of xx, will be used further in this section to describe the error propagation across the computations.

𝗎𝖿𝗉(x)=min{i:2i+1>x}=log2(x).\small\mathsf{ufp}(x)=\min\{i\in\mathbb{Z}:2^{i+1}>x\}=\lfloor\log_{2}(x)\rfloor\enspace. (12)
Number of Significant Bits

Intuitively, 𝗇𝗌𝖻(x)\mathsf{nsb}(x) is the number of significant bits of xx. Formally, following Parker [17], if 𝗇𝗌𝖻(x)=k\mathsf{nsb}(x)=k, for x0x\not=0 then the error ε(x)\varepsilon(x) on xx is less than 2𝗎𝖿𝗉(x)k2^{\mathsf{ufp}(x)-k}. If x=0x=0 then 𝗇𝗌𝖻(x)=0\mathsf{nsb}(x)=0. For example, if the exact binary value 1.01011.0101 is approximated by either x=1.010x=1.010 or x=1.011x=1.011 then 𝗇𝗌𝖻(x)=3\mathsf{nsb}(x)=3.

In the following, we also use 𝗎𝖿𝗉𝖾(x)\mathsf{ufp_{e}}(x) and 𝗇𝗌𝖻𝖾(x)\mathsf{nsb_{e}}(x) to denote the 𝗎𝖿𝗉\mathsf{ufp} and 𝗇𝗌𝖻\mathsf{nsb} of the error on xx, i.e. 𝗎𝖿𝗉𝖾(x)=𝗎𝖿𝗉(ε(x))\mathsf{ufp_{e}}(x)=\mathsf{ufp}(\varepsilon(x)) and 𝗇𝗌𝖻𝖾(x)=𝗇𝗌𝖻(ε(x))\mathsf{nsb_{e}}(x)=\mathsf{nsb}(\varepsilon(x)).

In this article, we consider a finite precision arithmetic, independently of any particular representation (IEEE754 [1], POSIT [7], \ldots). Nevertheless, the representation being finite, roundoff errors may arise when representing values or performing elementary operation. These errors are defined in Figure 2. First of all, by definition, using the function 𝗎𝖿𝗉\mathsf{ufp} of Equation (12), for a number xx with p number of significant bits, the roundoff error ϵ(x)\epsilon(x) is bounded as shown in Equation (13).

ε(x)2𝗎𝖿𝗉(x)p+1\small\varepsilon(x)\leq 2^{\mathsf{ufp}(x)-\texttt{p}+1} (13)

Let 𝗉𝗋𝖾𝖼()\mathsf{prec}(\ell) be the precision of the operation at control point \ell. For example, the precision is 5353 bits for the IEEE754 binary64 format. In fact, 𝗉𝗋𝖾𝖼()\mathsf{prec}(\ell) is used to compute the truncation error of the operations. Equations (5) to (11) of Figure 2 define the numerical errors of the arithmetic expressions of our language (presented in Section 3). For constants occurring in the code, the initial precision must be given by the user and we write c#pc^{\ell}\texttt{\#p} a constant cc with p significant bits at control point \ell. Then, following Equation (13), ε(c#p)\varepsilon(c^{\ell}\texttt{\#p}) is defined in Equation (5) of Figure 2: the 𝗇𝗌𝖻\mathsf{nsb} of the constant is min(p,𝗉𝗋𝖾𝖼())\min(\texttt{p},\mathsf{prec}(\ell)) and consequently the error is bounded by 2𝗎𝖿𝗉(c)min(p,𝗉𝗋𝖾𝖼())2^{\mathsf{ufp}(c)-\min\big{(}\texttt{{p}},\mathsf{prec}(\ell)\big{)}}. In equations (6) to (9), we propagate the errors on the operands and we add the roundoff error due to the elementary operation itself. For an elementary function ϕ\phi, we assume that φ\varphi bits are lost as shown in Equation (10) (more details are given in Section 3). The last equation is for the square root function. This function being computable exactly, no more error than for the elementary operation is introduced.

3.2 Integer Linear Problem Formulation

First, we define in Figure 3 the simple imperative language in which our input programs are written.

 

Lab\ell\in LabxIdx\in Id\odot \in {+, -, ×\times, ÷\div}  mathmath \in {sin\sin, cos\cos, tan\tan, arcsin\arcsin, log\log, \ldots}

Expr \ni e : e ::= c#pp^{\ell} || xx^{\ell} || e11e22e_{1}^{\ell_{1}}\odot^{\ell}e_{2}^{\ell_{2}} || math(e1)math(e^{\ell_{1}})^{\ell} || sqrt(e1)sqrt(e^{\ell_{1}})^{\ell}

Cmd \ni c : c::=c11;c22c::=c_{1}^{\ell_{1}};c_{2}^{\ell_{2}} || x=e1x=^{\ell}e^{\ell_{1}} || whileb0doc11\textbf{{while}}^{\ell}\>b^{\ell_{0}}\>\textbf{{do}}\>\>c_{1}^{\ell_{1}} || ifb0thenc11elsec\textbf{{if}}^{\ell}\>b^{\ell_{0}}\>\textbf{{then}}\>\>c_{1}^{\ell_{1}}\>\textit{{else}}\>\>c || require_nsb(x,n)\textbf{{require\_nsb}}(x,n)^{\ell}  

Figure 3: Simple imperative language of constraints.

 

[c#p]ϱ=(Const)[x]ϱ={𝗇𝗌𝖻(ϱ(x))𝗇𝗌𝖻()}(Id)\mathcal{E}[c\texttt{\#p}^{\ell}]\varrho=\emptyset\quad(\textsc{Const})\hskip 28.45274pt\mathcal{E}[x^{\ell}]\varrho=\big{\{}\mathsf{nsb}(\varrho(x))\geq\mathsf{nsb}(\ell)\big{\}}\quad(\textsc{Id})
[e11+e22]ϱ=[e11]ϱ[e22]ϱ{𝗇𝗌𝖻(1)𝗇𝗌𝖻()+𝗎𝖿𝗉(1)𝗎𝖿𝗉()+ξ()(1,2),𝗇𝗌𝖻(2)𝗇𝗌𝖻()+𝗎𝖿𝗉(2)𝗎𝖿𝗉()+ξ()(1,2)}(Add)\begin{array}[]{c}\mathcal{E}[e_{1}^{\ell_{1}}+^{\ell}e_{2}^{\ell_{2}}]\varrho=\mathcal{E}[e_{1}^{\ell_{1}}]\varrho\ \cup\ \mathcal{E}[e_{2}^{\ell_{2}}]\varrho\\ \cup\\ \left\{\mathsf{nsb}(\ell_{1})\geq\mathsf{nsb}(\ell)+\mathsf{ufp}(\ell_{1})-\mathsf{ufp}(\ell)+\xi(\ell)(\ell_{1},\ell_{2}),\right.\\ \left.\hskip 2.84544pt\mathsf{nsb}(\ell_{2})\geq\mathsf{nsb}(\ell)+\mathsf{ufp}(\ell_{2})-\mathsf{ufp}(\ell)+\xi(\ell)(\ell_{1},\ell_{2})\right\}\end{array}\quad(\textsc{Add})
[e11e22]ϱ=[e11]ϱ[e22]ϱ{𝗇𝗌𝖻(1)𝗇𝗌𝖻()+𝗎𝖿𝗉(1)𝗎𝖿𝗉()+ξ()(1,2),𝗇𝗌𝖻(2)𝗇𝗌𝖻()+𝗎𝖿𝗉(2)𝗎𝖿𝗉()+ξ()(1,2)}(Sub)\begin{array}[]{c}\mathcal{E}[e_{1}^{\ell_{1}}-^{\ell}e_{2}^{\ell_{2}}]\varrho=\mathcal{E}[e_{1}^{\ell_{1}}]\varrho\ \cup\ \mathcal{E}[e_{2}^{\ell_{2}}]\varrho\\ \cup\\ \left\{\mathsf{nsb}(\ell_{1})\geq\mathsf{nsb}(\ell)+\mathsf{ufp}(\ell_{1})-\mathsf{ufp}(\ell)+\xi(\ell)(\ell_{1},\ell_{2}),\right.\\ \left.\hskip 2.84544pt\mathsf{nsb}(\ell_{2})\geq\mathsf{nsb}(\ell)+\mathsf{ufp}(\ell_{2})-\mathsf{ufp}(\ell)+\xi(\ell)(\ell_{1},\ell_{2})\right\}\end{array}\quad(\textsc{Sub})
[e11×e22]ϱ=[e11]ϱ[e22]ϱ{𝗇𝗌𝖻(1)𝗇𝗌𝖻()+ξ()(1,2)1,𝗇𝗌𝖻(2)𝗇𝗌𝖻()+ξ()(1,2)1}(Mult)\begin{array}[]{c}\mathcal{E}[e_{1}^{\ell_{1}}\times^{\ell}e_{2}^{\ell_{2}}]\varrho=\mathcal{E}[e_{1}^{\ell_{1}}]\varrho\ \cup\ \mathcal{E}[e_{2}^{\ell_{2}}]\varrho\\ \cup\\ \left\{\mathsf{nsb}(\ell_{1})\geq\mathsf{nsb}(\ell)+\xi(\ell)(\ell_{1},\ell_{2})-1,\right.\left.\hskip 2.84544pt\mathsf{nsb}(\ell_{2})\geq\mathsf{nsb}(\ell)+\xi(\ell)(\ell_{1},\ell_{2})-1\right\}\end{array}\quad(\textsc{Mult})
[e11÷e22]ϱ=[e11]ϱ[e22]ϱ{𝗇𝗌𝖻(1)𝗇𝗌𝖻()+ξ()(1,2)1,𝗇𝗌𝖻(2)𝗇𝗌𝖻()+ξ()(1,2)1}(Div)\begin{array}[]{c}\mathcal{E}[e_{1}^{\ell_{1}}\div^{\ell}e_{2}^{\ell_{2}}]\varrho=\mathcal{E}[e_{1}^{\ell_{1}}]\varrho\ \cup\ \mathcal{E}[e_{2}^{\ell_{2}}]\varrho\\ \cup\\ \left\{\mathsf{nsb}(\ell_{1})\geq\mathsf{nsb}(\ell)+\xi(\ell)(\ell_{1},\ell_{2})-1,\right.\left.\hskip 2.84544pt\mathsf{nsb}(\ell_{2})\geq\mathsf{nsb}(\ell)+\xi(\ell)(\ell_{1},\ell_{2})-1\right\}\end{array}\quad(\textsc{Div})
[e1]ϱ=[e11]ϱ{𝗇𝗌𝖻(1)𝗇𝗌𝖻()}(Sqrt)\mathcal{E}\left[\sqrt{e^{\ell_{1}}}^{\ell}\right]\varrho=\mathcal{E}[e_{1}^{\ell_{1}}]\varrho\ \cup\ \big{\{}\mathsf{nsb}(\ell_{1})\geq\mathsf{nsb}(\ell)\big{\}}\quad(\textsc{Sqrt})
[ϕ(e1)]ϱ=[e11]ϱ{𝗇𝗌𝖻(1)𝗇𝗌𝖻()+φ}withϕ{sin,cos,tan,log,}(Math)\mathcal{E}\left[\phi\big{(}{e^{\ell_{1}}}\big{)}^{\ell}\right]\varrho=\mathcal{E}[e_{1}^{\ell_{1}}]\varrho\ \cup\ \big{\{}\mathsf{nsb}(\ell_{1})\geq\mathsf{nsb}(\ell)+\varphi\big{\}}\ \text{with}\ \phi\in\{\sin,\cos,\tan,\log,\ldots\}\quad(\textsc{Math})
𝒞[x:=e1]ϱ=(C,ϱ[x])whereC=[e11]ϱ{𝗇𝗌𝖻(1)𝗇𝗌𝖻()}(Assign)\begin{array}[]{c}\mathcal{C}\left[x\texttt{:=}^{\ell}e^{\ell_{1}}\right]\varrho=\big{(}C,\varrho\left[x\mapsto\ell\right]\big{)}\ \text{where}\ C=\mathcal{E}[e_{1}^{\ell_{1}}]\varrho\cup\left\{\mathsf{nsb}(\ell_{1})\geq\mathsf{nsb}(\ell)\right\}\end{array}\quad(\textsc{Assign})
𝒞[c11;c22]ϱ=(C1C2,ϱ2)where(C1,ϱ1)=𝒞[c11]ϱand(C2,ϱ2)=𝒞[c22]ϱ1(Seq)\begin{array}[]{c}\mathcal{C}\left[c_{1}^{\ell_{1}}\texttt{;}c_{2}^{\ell_{2}}\right]\varrho=\big{(}C_{1}\cup C_{2},\varrho_{2}\big{)}\par\\ \text{where}\ \big{(}C_{1},\varrho_{1}\big{)}=\mathcal{C}\left[c_{1}^{\ell_{1}}\right]\varrho\ \text{and}\ \big{(}C_{2},\varrho_{2}\big{)}=\mathcal{C}\left[c_{2}^{\ell_{2}}\right]\varrho_{1}\end{array}\quad(\textsc{Seq})
𝒞[ife0thenc1elsec2]ϱ=(C1C2C3,ϱ)where|xId,ϱ(x)=,(C1,ϱ1)=𝒞[c11]ϱ,(C2,ϱ2)=𝒞[c22]ϱ,C3=xId{𝗇𝗌𝖻(ϱ1(x))𝗇𝗌𝖻(),𝗇𝗌𝖻(ϱ2(x))𝗇𝗌𝖻()}(Cond)\begin{array}[]{c}\mathcal{C}[\texttt{if}^{\ell}\ e^{\ell_{0}}\ \texttt{then}\ c^{\ell_{1}}\ \texttt{else}\ c^{\ell_{2}}]\ \varrho=(C_{1}\cup C_{2}\cup C_{3},\varrho^{\prime})\\ \text{where}\ \left|\begin{array}[]{l}\forall x\in\text{Id},\ \varrho^{\prime}(x)=\ell,\ (C_{1},\varrho_{1})=\mathcal{C}[c_{1}^{\ell_{1}}]\ \varrho,\ (C_{2},\varrho_{2})=\mathcal{C}[c_{2}^{\ell_{2}}]\ \varrho,\\ C_{3}=\underset{x\in\text{Id}}{\bigcup}\left\{\mathsf{nsb}(\varrho_{1}(x))\geq\mathsf{nsb}(\ell),\ \mathsf{nsb}(\varrho_{2}(x))\geq\mathsf{nsb}(\ell)\right\}\end{array}\right.\end{array}\quad(\textsc{Cond})
𝒞[whilee0doc1]ϱ=(C1C2,ϱ)where|xId,ϱ(x)=,(C1,ϱ1)=𝒞[c11]ϱC2=xId{𝗇𝗌𝖻(ϱ(x))𝗇𝗌𝖻(),𝗇𝗌𝖻(ϱ1(x))𝗇𝗌𝖻()}(While)\begin{array}[]{c}\mathcal{C}[\texttt{while}^{\ell}\ e^{\ell_{0}}\ \texttt{do}\ c^{\ell_{1}}]\ \varrho=(C_{1}\cup C_{2},\varrho^{\prime})\\ \text{where}\ \left|\begin{array}[]{l}\forall x\in\text{Id},\ \varrho^{\prime}(x)=\ell,\ (C_{1},\varrho_{1})=\mathcal{C}[c_{1}^{\ell_{1}}]\ \varrho^{\prime}\\ C_{2}=\underset{x\in\text{Id}}{\bigcup}\left\{\mathsf{nsb}(\varrho(x))\geq\mathsf{nsb}(\ell),\ \mathsf{nsb}(\varrho_{1}(x))\geq\mathsf{nsb}(\ell)\right\}\end{array}\right.\end{array}\quad(\textsc{While})
𝒞[require_nsb(x,p)]ϱ={𝗇𝗌𝖻(ϱ(x))p}(Req)\mathcal{C}[\texttt{require\_nsb}(x,\texttt{p})^{\ell}]\varrho=\big{\{}\mathsf{nsb}(\varrho(x))\geq\texttt{p}\big{\}}\quad(\textsc{Req})

 
ξ()(1,2)=1\xi(\ell)(\ell_{1},\ell_{2})=1  

Figure 4: ILP constraints with pessimistic carry bit propagation ξ=1\xi=1.

We denote by IdId the set of identifiers and by LabLab the set of control points of the program as a means to assign to each element eExpre\in Expr and cCmdc\in Cmd of our language a unique control point \ell \in LabLab. First, in c#pc\texttt{\#p}, p indicates the number of significant bits of the constant cc in the source code. The parameter p is computed by our tool XXX, when solving the constraints. Next, the statement require_nsb(x,n) indicates the number of significant bits nn that a variable xx must have at a control point \ell. The rest of the grammar is standard.

As we have mentioned, we are able to reduce the problem of determining the lowest precision on variables and intermediary values in programs to an Integer Linear Problem (ILP) by reasoning on their unit in the first place (𝗎𝖿𝗉\mathsf{ufp}) and the number of significant bits. In addition, we assign to each control point \ell an integer variable 𝗇𝗌𝖻()\mathsf{nsb}(\ell) corresponding to the 𝗇𝗌𝖻\mathsf{nsb} of the arithmetic expressions. 𝗇𝗌𝖻()\mathsf{nsb}(\ell) is determined by solving the ILP generated by the rules of Figure 4.

Let us now focus on the rules of Figure 4 where ϱ:IdId×Lab\varrho\ :\ \text{Id}\rightarrow\text{Id}\times\text{Lab} is an environment which relates each identifier xx to its last assignment xx^{\ell}: Assuming that x:=e1x\ \texttt{:=}^{\ell}e^{\ell_{1}} is the last assignment of xx, the environment ϱ\varrho maps xx to xx^{\ell}. Then, [e]ϱ\mathcal{E}[e]\ \varrho generates the set of constraints for an expression eExpre\in Expr in the environment ϱ\varrho. We now formally define these constraints for each element of our language. No constraint is generated for a constant c#pc\texttt{\#p} as mentioned in Rule (Const) of Figure 4. For Rule (Id) of a variable xx^{\ell}, we require that the 𝗇𝗌𝖻\mathsf{nsb} at control point \ell is less than its 𝗇𝗌𝖻\mathsf{nsb} in the last assignment of xx given in ϱ(x)\varrho(x). For a binary operator \odot \in {+, -, ×\times, ÷\div}, we first generate the set of constraints [e11]ϱ\mathcal{E}[e_{1}^{\ell_{1}}]\varrho and [e22]ϱ\mathcal{E}[e_{2}^{\ell_{2}}]\varrho for the operands at control points 1\ell_{1} and 2\ell_{2}. Considering Rule (ADD), the result of the addition of two numbers is stored in control point \ell. Recall that a range determination is performed before the accuracy analysis, 𝗎𝖿𝗉()\mathsf{ufp}(\ell), 𝗎𝖿𝗉(1)\mathsf{ufp}(\ell_{1}) and 𝗎𝖿𝗉(2)\mathsf{ufp}(\ell_{2}) are known at constraint generation time.

Now, before going further in the explanation of the constraints generation for binary operations, we introduce the function ξ\xi which computes the carry bit that can occur throughout an addition (similar reasoning will be done for the other elementary operations). In the present ILP of Figure 4, we over-approximate the function ξ\xi by ξ()(1,2)=1\xi(\ell)(\ell_{1},\ell_{2})=1 for all ,1\ell,\ \ell_{1} and 2\ell_{2}, thus assuming the worst case, i.e. a carry bit is added at each operation. We will optimize ξ\xi in Section 3.3 but the problem will not remain an ILP any longer. To wrap up, for the addition (Rule (Add)), the 𝗇𝗌𝖻()\mathsf{nsb}(\ell) of the exact result is the number of bits between 𝗎𝖿𝗉(1+2)\mathsf{ufp}(\ell_{1}+\ell_{2}) and the 𝗎𝖿𝗉\mathsf{ufp} of the error ee which is:

e=max(𝗎𝖿𝗉(1)𝗇𝗌𝖻(1),𝗎𝖿𝗉(2)𝗇𝗌𝖻(2))ξ()(1,2)\small e=\max\big{(}\mathsf{ufp}(\ell_{1})-\mathsf{nsb}(\ell_{1}),\mathsf{ufp}(\ell_{2})-\mathsf{nsb}(\ell_{2})\big{)}-\xi(\ell)(\ell_{1},\ell_{2}) (14)

Hence, the error on the addition in precision 𝗉𝗋𝖾𝖼()\mathsf{prec}(\ell) is

e=max(𝗎𝖿𝗉(1)𝗇𝗌𝖻(1),𝗎𝖿𝗉(2)𝗇𝗌𝖻(2),𝗉𝗋𝖾𝖼())ξ()(1,2)\small e^{\prime}=\max\big{(}\mathsf{ufp}(\ell_{1})-\mathsf{nsb}(\ell_{1}),\mathsf{ufp}(\ell_{2})-\mathsf{nsb}(\ell_{2}),\mathsf{prec}(\ell)\big{)}-\xi(\ell)(\ell_{1},\ell_{2}) (15)

Equation (14) is obtained from Equation (6):

ε(c11#p1+c22#p2)\displaystyle\small\varepsilon(c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}+^{\ell}c_{2}^{\ell_{2}}\texttt{\#$p_{2}$}) \displaystyle\leq ε(c11#p1)+ε(c22#p2)+2𝗎𝖿𝗉(c1+c2)𝗉𝗋𝖾𝖼()\displaystyle\varepsilon({c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}})+\varepsilon({c_{2}^{\ell_{2}}\texttt{\#$p_{2}$}})+{2^{\mathsf{ufp}(c_{1}+c_{2})-\mathsf{prec}(\ell)}}
=\displaystyle= max(ε(c11#p1),ε(c22#p2))+2𝗎𝖿𝗉(c1+c2)𝗉𝗋𝖾𝖼()ξ()(1,2)\displaystyle\max\big{(}\varepsilon({c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}}),\varepsilon({c_{2}^{\ell_{2}}\texttt{\#$p_{2}$}})\big{)}+{2^{\mathsf{ufp}(c_{1}+c_{2})-\mathsf{prec}(\ell)}}-\xi(\ell)(\ell_{1},\ell_{2})
=\displaystyle= max(𝗎𝖿𝗉(1)𝗇𝗌𝖻(1),𝗎𝖿𝗉(1)𝗇𝗌𝖻(1))\displaystyle\max\big{(}\mathsf{ufp}(\ell_{1})-\mathsf{nsb}(\ell_{1}),\mathsf{ufp}(\ell_{1})-\mathsf{nsb}(\ell_{1})\big{)}
+2𝗎𝖿𝗉(c1+c2)𝗉𝗋𝖾𝖼()ξ()(1,2)\displaystyle+{2^{\mathsf{ufp}(c_{1}+c_{2})-\mathsf{prec}(\ell)}}-\xi(\ell)(\ell_{1},\ell_{2})

Since 𝗇𝗌𝖻()𝗉𝗋𝖾𝖼()\mathsf{nsb}(\ell)\leq\mathsf{prec}(\ell), we may get rid of the last term of ee^{\prime} in Equation (15) and the two equations generated for Rule (ADD) are derived from Equation (16).

𝗇𝗌𝖻()𝗎𝖿𝗉()max(𝗎𝖿𝗉(1)𝗇𝗌𝖻(1),𝗎𝖿𝗉(2)𝗇𝗌𝖻(2))ξ()(1,2)\small\mathsf{nsb}(\ell)\leq\mathsf{ufp}(\ell)-\max\big{(}\mathsf{ufp}(\ell_{1})-\mathsf{nsb}(\ell_{1}),\mathsf{ufp}(\ell_{2})-\mathsf{nsb}(\ell_{2})\big{)}-\xi(\ell)(\ell_{1},\ell_{2}) (16)

For example, let us consider the piece of code hereafter.

1 x2{}^{\ell_{2}} = 5.01{}^{\ell_{1}}; y4{}^{\ell_{4}} = 3.03{}^{\ell_{3}};
2 z8{}^{\ell_{8}} = x6{}^{\ell_{6}} +5{}^{\ell_{5}} y7{}^{\ell_{7}};
3 require_nsb(z,15)9{}^{\ell_{9}};

Wrapping up our constraints, we have 15𝗇𝗌𝖻(9)𝗇𝗌𝖻(8)𝗇𝗌𝖻(5),\ 15\leq\mathsf{nsb}(\ell_{9})\leq\mathsf{nsb}(\ell_{8})\leq\mathsf{nsb}(\ell_{5}), and 𝗇𝗌𝖻(6)𝗇𝗌𝖻(8)+𝗎𝖿𝗉(6)𝗎𝖿𝗉(7)+ξ(8)(6,7)\mathsf{nsb}(\ell_{6})\geq\mathsf{nsb}(\ell_{8})+\mathsf{ufp}(\ell_{6})-\mathsf{ufp}(\ell_{7})+\xi(\ell_{8})(\ell_{6},\ell_{7}). Since 𝗎𝖿𝗉(6)=2\mathsf{ufp}(\ell_{6})=2 and 𝗎𝖿𝗉(7)=1\mathsf{ufp}(\ell_{7})=1 we have 𝗇𝗌𝖻(6)𝗇𝗌𝖻(8)+1+ξ(8)(6,7)\mathsf{nsb}(\ell_{6})\geq\mathsf{nsb}(\ell_{8})+1+\xi(\ell_{8})(\ell_{6},\ell_{7}) and, consequently, 𝗇𝗌𝖻(6)15+1+1=17\mathsf{nsb}(\ell_{6})\geq 15+1+1=17. Rule (Sub) for the subtraction is obtained similarly to the addition case. For Rule (Mult) of multiplication (and in the same manner Rule(Div)), the reasoning mimics the one of the addition. Let xx and yy be two floating point numbers and zz the result of their product, z=x1×y2z=x^{\ell_{1}}\times^{\ell}y^{\ell_{2}}. We denote by ε(x)\varepsilon(x), ε(y)\varepsilon(y) and ε(z)\varepsilon(z) the errors on xx, yy and zz, respectively. The error ε(z)\varepsilon(z) of this multiplication is ε(z)=xε(y)+yε(x)+ε(x)ε(y)\varepsilon(z)=x\cdot\varepsilon(y)+y\cdot\varepsilon(x)+\varepsilon(x)\cdot\varepsilon(y). These numbers are bounded as shown in Equation (17).

2𝗎𝖿𝗉(x)x2𝗎𝖿𝗉(x)+1and2𝗎𝖿𝗉(x)𝗇𝗌𝖻(x)ε(x)2𝗎𝖿𝗉(x)𝗇𝗌𝖻(x)+12𝗎𝖿𝗉(y)y2𝗎𝖿𝗉(y)+1and2𝗎𝖿𝗉(y)𝗇𝗌𝖻(y)ε(y)2𝗎𝖿𝗉(y)𝗇𝗌𝖻(y)+12𝗎𝖿𝗉(x)+𝗎𝖿𝗉(y)𝗇𝗌𝖻(y)+2𝗎𝖿𝗉(y)+𝗎𝖿𝗉(x)𝗇𝗌𝖻(x)ε(z)2𝗎𝖿𝗉(z)𝗇𝗌𝖻(z)+1+2𝗎𝖿𝗉(x)+𝗎𝖿𝗉(y)𝗇𝗌𝖻(x)𝗇𝗌𝖻(y)\small\begin{array}[]{rcl}2^{\mathsf{ufp}(x)}\leq x\leq 2^{\mathsf{ufp}(x)+1}\quad\text{and}\quad 2^{\mathsf{ufp}(x)-\mathsf{nsb}(x)}&\leq&\varepsilon(x)\leq 2^{\mathsf{ufp}(x)-\mathsf{nsb}(x)+1}\\ 2^{\mathsf{ufp}(y)}\leq y\leq 2^{\mathsf{ufp}(y)+1}\quad\text{and}\quad 2^{\mathsf{ufp}(y)-\mathsf{nsb}(y)}&\leq&\varepsilon(y)\leq 2^{\mathsf{ufp}(y)-\mathsf{nsb}(y)+1}\\ 2^{\mathsf{ufp}(x)+\mathsf{ufp}(y)-\mathsf{nsb}(y)}+2^{\mathsf{ufp}(y)+\mathsf{ufp}(x)-\mathsf{nsb}(x)}&\leq&\varepsilon(z)\leq 2^{\mathsf{ufp}(z)-\mathsf{nsb}(z)+1}\\ +2^{\mathsf{ufp}(x)+\mathsf{ufp}(y)-\mathsf{nsb}(x)-\mathsf{nsb}(y)}&&\end{array} (17)

By getting rid of the last term 2𝗎𝖿𝗉(x)+𝗎𝖿𝗉(y)𝗇𝗌𝖻(x)𝗇𝗌𝖻(y)2^{\mathsf{ufp}(x)+\mathsf{ufp}(y)-\mathsf{nsb}(x)-\mathsf{nsb}(y)} of the error ϵ(z)\epsilon(z) which is strictly less than the former two ones, assuming that 𝗎𝖿𝗉(x+y)=𝗎𝖿𝗉(z)\mathsf{ufp}(x+y)=\mathsf{ufp}(z) and, finally, by reasoning on the exponents, we obtain the equations of Rule (Mult).

𝗇𝗌𝖻(x)𝗇𝗌𝖻(z)+ξ()(1,2)1and𝗇𝗌𝖻(y)𝗇𝗌𝖻(y)+ξ()(1,2)1\small\mathsf{nsb}(x)\geq\mathsf{nsb}(z)+\xi(\ell)(\ell_{1},\ell_{2})-1\quad\text{and}\quad\mathsf{nsb}(y)\geq\mathsf{nsb}(y)+\xi(\ell)(\ell_{1},\ell_{2})-1

Although the square root (Sqrt) is included, e.g, in the IEEE754 Standard, it is not the case for the other elementary functions such as the natural logarithm, the exponential functions and the hyperbolic and trigonometric functions gathered in Rule (Math). Also, each implementation of these functions has its own 𝗇𝗌𝖻\mathsf{nsb} which we have to know to model the propagation of errors in our analyses. To cope with this limitation, we consider that each elementary function introduces a loss of precision of φ\varphi bits, where φ\varphi\in\mathbb{N} is a parameter of the analysis and consequently of our tool, XXX.

The rules of commands are rather classical, we use control points to distinguish many assignments of the same variable and also to implement joins in conditions and loops. Given a command cc and an environment ϱ\varrho, 𝒞[c]ϱ\mathcal{C}[c]\ \varrho returns a pair (C,ϱ)(C,\varrho^{\prime}) made of a set CC of constraints and of a new environment ϱ\varrho^{\prime}. The function 𝒞\mathcal{C} is defined by induction on the structure of commands in figures 4 and  5. For conditionals, we generate the constraints for the then and else branches plus additional constraints to join the results of both branches. For loops, we relate the number of significants bits at the end of the body to the 𝗇𝗌𝖻\mathsf{nsb} of the same variables and the beginning of the loop.

 

[c#p]ϱ={𝗇𝗌𝖻𝖾()=0}(Const)[x]ϱ={𝗇𝗌𝖻𝖾(ϱ(x))𝗇𝗌𝖻𝖾()}(Id)\mathcal{E}^{\prime}[c\texttt{\#p}^{\ell}]\varrho=\big{\{}\mathsf{nsb_{e}}(\ell)=0\big{\}}\quad(\textsc{Const}^{\prime})\hskip 28.45274pt\mathcal{E}^{\prime}[x^{\ell}]\varrho=\big{\{}\mathsf{nsb_{e}}(\varrho(x))\geq\mathsf{nsb_{e}}(\ell)\big{\}}\quad(\textsc{Id}^{\prime})
[e11+e22]ϱ=[e11]ϱ[e22]ϱ(Add){𝗇𝗌𝖻𝖾()𝗇𝗌𝖻𝖾(1),𝗇𝗌𝖻𝖾()𝗇𝗌𝖻𝖾(2),𝗇𝗌𝖻𝖾()𝗎𝖿𝗉(1)𝗎𝖿𝗉(2)+𝗇𝗌𝖻(2)𝗇𝗌𝖻(1)+𝗇𝗌𝖻𝖾(2)+ξ()(1,2),𝗇𝗌𝖻𝖾()𝗎𝖿𝗉(2)𝗎𝖿𝗉(1)+𝗇𝗌𝖻(1)𝗇𝗌𝖻(2)+𝗇𝗌𝖻𝖾(1)+ξ()(1,2)}\begin{array}[]{c}\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}+^{\ell}e_{2}^{\ell_{2}}]\varrho=\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}]\varrho\ \cup\ \mathcal{E}^{\prime}[e_{2}^{\ell_{2}}]\varrho\quad(\textsc{Add}^{\prime})\\ \cup\\ \left\{\begin{array}[]{c}\mathsf{nsb_{e}}(\ell)\geq\mathsf{nsb_{e}}(\ell_{1}),\ \mathsf{nsb_{e}}(\ell)\geq\mathsf{nsb_{e}}(\ell_{2}),\\ \mathsf{nsb_{e}}(\ell)\geq\mathsf{ufp}(\ell_{1})-\mathsf{ufp}(\ell_{2})+\mathsf{nsb}(\ell_{2})-\mathsf{nsb}(\ell_{1})+\mathsf{nsb_{e}}(\ell_{2})+\xi(\ell)(\ell_{1},\ell_{2}),\\ \mathsf{nsb_{e}}(\ell)\geq\mathsf{ufp}(\ell_{2})-\mathsf{ufp}(\ell_{1})+\mathsf{nsb}(\ell_{1})-\mathsf{nsb}(\ell_{2})+\mathsf{nsb_{e}}(\ell_{1})+\xi(\ell)(\ell_{1},\ell_{2})\end{array}\right\}\end{array}
[e11e22]ϱ=[e11]ϱ[e22]ϱ(Sub){𝗇𝗌𝖻𝖾()𝗇𝗌𝖻𝖾(1),𝗇𝗌𝖻𝖾()𝗇𝗌𝖻𝖾(2),𝗇𝗌𝖻𝖾()𝗎𝖿𝗉(1)𝗎𝖿𝗉(2)+𝗇𝗌𝖻(2)𝗇𝗌𝖻(1)+𝗇𝗌𝖻𝖾(2)+ξ()(1,2),𝗇𝗌𝖻𝖾()𝗎𝖿𝗉(2)𝗎𝖿𝗉(1)+𝗇𝗌𝖻(1)𝗇𝗌𝖻(2)+𝗇𝗌𝖻𝖾(1)+ξ()(1,2)}\begin{array}[]{c}\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}-^{\ell}e_{2}^{\ell_{2}}]\varrho=\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}]\varrho\ \cup\ \mathcal{E}^{\prime}[e_{2}^{\ell_{2}}]\varrho\quad(\textsc{Sub}^{\prime})\\ \cup\\ \left\{\begin{array}[]{c}\mathsf{nsb_{e}}(\ell)\geq\mathsf{nsb_{e}}(\ell_{1}),\ \mathsf{nsb_{e}}(\ell)\geq\mathsf{nsb_{e}}(\ell_{2}),\\ \mathsf{nsb_{e}}(\ell)\geq\mathsf{ufp}(\ell_{1})-\mathsf{ufp}(\ell_{2})+\mathsf{nsb}(\ell_{2})-\mathsf{nsb}(\ell_{1})+\mathsf{nsb_{e}}(\ell_{2})+\xi(\ell)(\ell_{1},\ell_{2}),\\ \mathsf{nsb_{e}}(\ell)\geq\mathsf{ufp}(\ell_{2})-\mathsf{ufp}(\ell_{1})+\mathsf{nsb}(\ell_{1})-\mathsf{nsb}(\ell_{2})+\mathsf{nsb_{e}}(\ell_{1})+\xi(\ell)(\ell_{1},\ell_{2})\end{array}\right\}\end{array}
[e11×e22]ϱ=[e11]ϱ[e22]ϱ(Mult){𝗇𝗌𝖻𝖾()𝗇𝗌𝖻(1)+𝗇𝗌𝖻𝖾(1)+𝗇𝗌𝖻𝖾(2)2,𝗇𝗌𝖻𝖾()𝗇𝗌𝖻(2)+𝗇𝗌𝖻𝖾(2)+𝗇𝗌𝖻𝖾(1)2}\begin{array}[]{c}\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}\times^{\ell}e_{2}^{\ell_{2}}]\varrho=\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}]\varrho\ \cup\ \mathcal{E}^{\prime}[e_{2}^{\ell_{2}}]\varrho\quad(\textsc{Mult}^{\prime})\\ \cup\\ \left\{\begin{array}[]{c}\mathsf{nsb_{e}}(\ell)\geq\mathsf{nsb}(\ell_{1})+\mathsf{nsb_{e}}(\ell_{1})+\mathsf{nsb_{e}}(\ell_{2})-2,\ \mathsf{nsb_{e}}(\ell)\geq\mathsf{nsb}(\ell_{2})+\mathsf{nsb_{e}}(\ell_{2})+\mathsf{nsb_{e}}(\ell_{1})-2\end{array}\right\}\end{array}
[e11÷e22]ϱ=[e11]ϱ[e22]ϱ(Div){𝗇𝗌𝖻𝖾()𝗇𝗌𝖻(1)+𝗇𝗌𝖻𝖾(1)+𝗇𝗌𝖻𝖾(2)2,𝗇𝗌𝖻𝖾()𝗇𝗌𝖻(2)+𝗇𝗌𝖻𝖾(2)+𝗇𝗌𝖻𝖾(1)2}\begin{array}[]{c}\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}\div^{\ell}e_{2}^{\ell_{2}}]\varrho=\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}]\varrho\ \cup\ \mathcal{E}^{\prime}[e_{2}^{\ell_{2}}]\varrho\quad(\textsc{Div}^{\prime})\\ \cup\\ \left\{\begin{array}[]{c}\mathsf{nsb_{e}}(\ell)\geq\mathsf{nsb}(\ell_{1})+\mathsf{nsb_{e}}(\ell_{1})+\mathsf{nsb_{e}}(\ell_{2})-2,\ \mathsf{nsb_{e}}(\ell)\geq\mathsf{nsb}(\ell_{2})+\mathsf{nsb_{e}}(\ell_{2})+\mathsf{nsb_{e}}(\ell_{1})-2\end{array}\right\}\end{array}
[e1]ϱ=[e11]ϱ{𝗇𝗌𝖻𝖾()𝗇𝗌𝖻𝖾(1)}(Sqrt)\mathcal{E}^{\prime}\left[\sqrt{e^{\ell_{1}}}^{\ell}\right]\varrho=\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}]\varrho\ \cup\ \big{\{}\mathsf{nsb_{e}}(\ell)\geq\mathsf{nsb_{e}}(\ell_{1})\big{\}}\quad(\textsc{Sqrt}^{\prime})
[ϕ(e1)]ϱ=[e11]ϱ{𝗇𝗌𝖻𝖾()+}withϕ{sin,cos,tan,log,}(Math)\mathcal{E}^{\prime}\left[\phi\big{(}{e^{\ell_{1}}}\big{)}^{\ell}\right]\varrho=\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}]\varrho\ \cup\ \big{\{}\mathsf{nsb_{e}}(\ell)\geq+\infty\big{\}}\ \text{with}\ \phi\in\{\sin,\cos,\tan,\log,\ldots\}\quad(\textsc{Math}^{\prime})
𝒞[x:=e1]ϱ=(C,ϱ[x])whereC=[e11]ϱ{𝗇𝗌𝖻𝖾(1)𝗇𝗌𝖻𝖾()}(Assign)\begin{array}[]{c}\mathcal{C}^{\prime}\left[x\texttt{:=}^{\ell}e^{\ell_{1}}\right]\varrho=\big{(}C,\varrho\left[x\mapsto\ell\right]\big{)}\ \text{where}\ C=\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}]\varrho\cup\left\{\mathsf{nsb_{e}}(\ell_{1})\geq\mathsf{nsb_{e}}(\ell)\right\}\end{array}\quad(\textsc{Assign}^{\prime})
𝒞[c11;c22]ϱ=(C1C2,ϱ2)with(C1,ϱ1)=𝒞[c11]ϱand(C2,ϱ2)=𝒞[c22]ϱ1(Seq)\mathcal{C}^{\prime}\left[c_{1}^{\ell_{1}}\texttt{;}c_{2}^{\ell_{2}}\right]\varrho=\big{(}C_{1}\cup C_{2},\varrho_{2}\big{)}\ \text{with}\ \big{(}C_{1},\varrho_{1}\big{)}=\mathcal{C}^{\prime}\left[c_{1}^{\ell_{1}}\right]\varrho\ \text{and}\ \big{(}C_{2},\varrho_{2}\big{)}=\mathcal{C}^{\prime}\left[c_{2}^{\ell_{2}}\right]\varrho_{1}\quad(\textsc{Seq}^{\prime})
𝒞[ife0thenc1elsec2]ϱ=(C1C2C3,ϱ)where|xId,ϱ(x)=,(C1,ϱ1)=𝒞[c11]ϱ,(C2,ϱ2)=𝒞[c22]ϱ,C3=xId{𝗇𝗌𝖻𝖾(ϱ1(x))𝗇𝗌𝖻𝖾(),𝗇𝗌𝖻𝖾(ϱ2(x))𝗇𝗌𝖻𝖾()}(Cond)\begin{array}[]{c}\mathcal{C}^{\prime}[\texttt{if}^{\ell}\ e^{\ell_{0}}\ \texttt{then}\ c^{\ell_{1}}\ \texttt{else}\ c^{\ell_{2}}]\ \varrho=(C_{1}\cup C_{2}\cup C_{3},\varrho^{\prime})\\ \text{where}\ \left|\begin{array}[]{l}\forall x\in\text{Id},\ \varrho^{\prime}(x)=\ell,\ (C_{1},\varrho_{1})=\mathcal{C}^{\prime}[c_{1}^{\ell_{1}}]\ \varrho,\ (C_{2},\varrho_{2})=\mathcal{C}^{\prime}[c_{2}^{\ell_{2}}]\ \varrho,\\ C_{3}=\underset{x\in\text{Id}}{\bigcup}\left\{\mathsf{nsb_{e}}(\varrho_{1}(x))\geq\mathsf{nsb_{e}}(\ell),\ \mathsf{nsb_{e}}(\varrho_{2}(x))\geq\mathsf{nsb_{e}}(\ell)\right\}\end{array}\right.\end{array}\quad(\textsc{Cond}^{\prime})
𝒞[whilee0doc1]ϱ=(C1C2,ϱ)where|xId,ϱ(x)=,(C1,ϱ1)=𝒞[c11]ϱC2=xId{𝗇𝗌𝖻𝖾(ϱ(x))𝗇𝗌𝖻𝖾(),𝗇𝗌𝖻𝖾(ϱ1(x))𝗇𝗌𝖻𝖾()}(While)\begin{array}[]{c}\mathcal{C}^{\prime}[\texttt{while}^{\ell}\ e^{\ell_{0}}\ \texttt{do}\ c^{\ell_{1}}]\ \varrho=(C_{1}\cup C_{2},\varrho^{\prime})\\ \text{where}\ \left|\begin{array}[]{l}\forall x\in\text{Id},\ \varrho^{\prime}(x)=\ell,\ (C_{1},\varrho_{1})=\mathcal{C}^{\prime}[c_{1}^{\ell_{1}}]\ \varrho^{\prime}\\ C_{2}=\underset{x\in\text{Id}}{\bigcup}\left\{\mathsf{nsb_{e}}(\varrho(x))\geq\mathsf{nsb_{e}}(\ell),\ \mathsf{nsb_{e}}(\varrho_{1}(x))\geq\mathsf{nsb_{e}}(\ell)\right\}\end{array}\right.\end{array}\quad(\textsc{While}^{\prime})
𝒞[require_nsb(x,p)]ϱ=(Req)\mathcal{C}^{\prime}[\texttt{require\_nsb}(x,\texttt{p})^{\ell}]\varrho=\emptyset\quad(\textsc{Req}^{\prime})

 
ξ()(1,2)=min(max(𝗎𝖿𝗉(2)𝗎𝖿𝗉(1)+𝗇𝗌𝖻(1)𝗇𝗌𝖻(2)𝗇𝗌𝖻𝖾(2),0),max(𝗎𝖿𝗉(1)𝗎𝖿𝗉(2)+𝗇𝗌𝖻(2)𝗇𝗌𝖻(1)𝗇𝗌𝖻𝖾(1),0),1)\xi(\ell)(\ell_{1},\ell_{2})=\min\left(\begin{array}[]{l}\max\big{(}\mathsf{ufp}(\ell_{2})-\mathsf{ufp}(\ell_{1})+\mathsf{nsb}(\ell_{1})-\mathsf{nsb}(\ell_{2})-\mathsf{nsb_{e}}(\ell_{2}),0\big{)},\\ \max\big{(}\mathsf{ufp}(\ell_{1})-\mathsf{ufp}(\ell_{2})+\mathsf{nsb}(\ell_{2})-\mathsf{nsb}(\ell_{1})-\mathsf{nsb_{e}}(\ell_{1}),0\big{)},1\end{array}\right)  

Figure 5: Constraints solved by PI with min\min and max\max carry bit formulation.

3.3 Policy Iteration for Optimized Carry Bit Propagation

The policy iterations algorithm is used to solve nonlinear fixpoint equations when the function is written as the infimum of functions for which a fixpoint can be easily computed. The infimum formulation makes the function not being differentiable in the classical sense. The one proposed in [7] to solve smallest fixpoint equations in static analysis requires the fact that the function is order-preserving to ensure the decrease of the intermediate solutions provided by the algorithm. In this article, because of the nature of the semantics, we propose a policy iterations algorithm for a non order-preserving function.

More precisely, let FF be a map from a complete LL to itself such that F=infπΠfπF=\inf_{\pi\in\Pi}f^{\pi}. Classical policy iterations solve F(𝐱)=𝐱F(\mathbf{x})=\mathbf{x} by generating a sequence (𝐱k)k(\mathbf{x}^{k})_{k} such that fπk(𝐱k)=𝐱kf^{\pi^{k}}(\mathbf{x}^{k})=\mathbf{x}^{k} and 𝐱k+1<𝐱k\mathbf{x}^{k+1}<\mathbf{x}^{k}. The set Π\Pi is called the set of policies and fπf^{\pi} a policy map (associated to π\pi). The set of policy maps has to satisfy the selection property meaning that for all 𝐱L\mathbf{x}\in L, there exists πΠ\pi\in\Pi such that F(𝐱)=fπ(𝐱)F(\mathbf{x})=f^{\pi}(\mathbf{x}). This is exactly the same as for each 𝐱L\mathbf{x}\in L, the minimization problem MinπΠfπ(𝐱)\operatorname*{Min}_{\pi\in\Pi}f^{\pi}(\mathbf{x}) has an optimal solution. If Π\Pi is finite and FF is order-preserving, policy iterations converge in finite time to a fixpoint of FF. The number of iterations is bounded from above by the number of policies. Indeed, a policy cannot be selected twice in the running of the algorithm. This is implied by the fact that the smallest fixpoint of a policy map is computed. In this article, we adapt policy iterations to the problem of precision tuning. The function FF here is constructed from inequalities depicted in Figure 4 and Figure 5. We thus have naturally constraints of the form F(𝐱)𝐱F(\mathbf{x})\leq\mathbf{x}. We will give details about the construction of FF at Proposition 1. Consequently, we are interested in solving:

Min𝗇𝗌𝖻,𝗇𝗌𝖻𝖾𝗇𝗌𝖻()s.t.F(𝗇𝗌𝖻𝗇𝗌𝖻𝖾)(𝗇𝗌𝖻𝗇𝗌𝖻𝖾)𝗇𝗌𝖻Lab,𝗇𝗌𝖻𝖾Lab\small\begin{array}[]{lll}\displaystyle{\operatorname*{Min}_{\mathsf{nsb},\mathsf{nsb_{e}}}}&&\displaystyle{\sum_{\ell}\mathsf{nsb}(\ell)}\\ &\operatorname*{s.t.}&F\left(\begin{array}[]{c}\mathsf{nsb}\\ \mathsf{nsb_{e}}\end{array}\right)\leq\left(\begin{array}[]{c}\mathsf{nsb}\\ \mathsf{nsb_{e}}\end{array}\right)\quad\mathsf{nsb}\in\mathbb{N}^{Lab},\ \mathsf{nsb_{e}}\in\mathbb{N}^{Lab}\end{array} (18)

Let ξ:Lab{0,1}\xi:Lab\to\{0,1\}. We will write Sξ1S_{\xi}^{1} the system of inequalities depicted in Figure 4 and Sξ2S_{\xi}^{2} the system of inequalities presented at Figure 5. Note that the final system of inequalities is Sξ=Sξ1Sξ2S_{\xi}=S_{\xi}^{1}\cup S_{\xi}^{2} meaning that we add new constraints to Sξ1S_{\xi}^{1}. If the system Sξ1S_{\xi}^{1} is used alone, ξ\xi is the constant function equal to 1. Otherwise, ξ\xi is defined by the formula at the end of Figure 5.

Proposition 1

The following results hold:

  1. 1.

    Let ξ\xi the constant function equal to 1. The system Sξ1S_{\xi}^{1} can be rewritten as {𝗇𝗌𝖻LabF(𝗇𝗌𝖻)(𝗇𝗌𝖻)}\{\mathsf{nsb}\in\mathbb{N}^{Lab}\mid F(\mathsf{nsb})\leq(\mathsf{nsb})\} where FF maps Lab×Lab\mathbb{R}^{Lab}\times\mathbb{R}^{Lab} to itself, F(Lab×Lab)(Lab×Lab)F(\mathbb{N}^{Lab}\times\mathbb{N}^{Lab})\subseteq(\mathbb{N}^{Lab}\times\mathbb{N}^{Lab}) and has coordinates which are the maximum of a finite family of affine order-preserving functions.

  2. 2.

    Let ξ\xi the function such that ξ()\xi(\ell) equals the function defined at Figure 5. The system SξS_{\xi} can be rewritten as {(𝗇𝗌𝖻,𝗇𝗌𝖻𝖾)Lab×LabF(𝗇𝗌𝖻,𝗇𝗌𝖻𝖾)(𝗇𝗌𝖻,𝗇𝗌𝖻𝖾)}\{(\mathsf{nsb},\mathsf{nsb_{e}})\in\mathbb{N}^{Lab}\times\mathbb{N}^{Lab}\mid F(\mathsf{nsb},\mathsf{nsb_{e}})\leq(\mathsf{nsb},\mathsf{nsb_{e}})\} where FF maps Lab×Lab\mathbb{R}^{Lab}\times\mathbb{R}^{Lab} to itself, F(Lab×Lab)(Lab×Lab)F(\mathbb{N}^{Lab}\times\mathbb{N}^{Lab})\subseteq(\mathbb{N}^{Lab}\times\mathbb{N}^{Lab}) and all its coordinates are the min-max of a finite family of affine functions.

From Proposition 1, when SξS_{\xi} is used, we can write FF as F=minπΠfπF=\min_{\pi\in\Pi}f^{\pi}, where fπf^{\pi} is the maximum of a finite family of affine functions and thus used a modified policy iterations algorithm. The set of policies here is a map π:Lab{0,1}\pi:Lab\mapsto\{0,1\}. A choice is thus a vector of 0 or 1. A policy map fπf^{\pi} is a function Lab\mathbb{N}^{Lab} to itself such that the coordinates are fπ()f_{\ell}^{\pi}(\ell). If the coordinate fπ()f_{\ell}^{\pi}(\ell) depends on ξ\xi then ξ()=π()\xi(\ell)=\pi(\ell). Otherwise, the function is the maximum of affine functions and a choice is not required.

Result: An over-approximation of an optimal solution of Equation (18)
1 Let k:=0k:=0, S:=+S:=+\infty;
2 Choose π0Π\pi^{0}\in\Pi;
3 Select an optimal solution of (𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k}) the integer linear program:
Min{Lab𝗇𝗌𝖻()fπk(𝗇𝗌𝖻,𝗇𝗌𝖻𝖾)(𝗇𝗌𝖻,𝗇𝗌𝖻𝖾),𝗇𝗌𝖻Lab,𝗇𝗌𝖻𝖾Lab};\operatorname*{Min}\left\{\sum_{\ell\in Lab}\mathsf{nsb}(\ell)\mid f^{\pi^{k}}(\mathsf{nsb},\mathsf{nsb_{e}})\leq(\mathsf{nsb},\mathsf{nsb_{e}}),\ \mathsf{nsb}\in\mathbb{N}^{Lab},\ \mathsf{nsb_{e}}\in\mathbb{N}^{Lab}\right\}\enspace;
if Lab𝗇𝗌𝖻k()<S\sum_{\ell\in Lab}\mathsf{nsb}^{k}(\ell)<S then
4       S:=Lab𝗇𝗌𝖻k()S:=\sum_{\ell\in Lab}\mathsf{nsb}^{k}(\ell);
5       Choose πk+1Π\pi^{k+1}\in\Pi such that F(𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)=fπk+1(𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)F(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k})=f^{\pi^{k+1}}(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k});
6       k:=k+1k:=k+1 and go to 3;
7      
8else
9      Return SS and 𝗇𝗌𝖻k\mathsf{nsb}^{k}.
10 end if
Algorithm 1 Non-monotone Policy Iterations Algorithm
Proposition 2 (Algorithm correctness)

The sequence (Lab𝗇𝗌𝖻k())0kK(\sum_{\ell\in Lab}\mathsf{nsb}^{k}(\ell))_{0\leq k\leq K} generated by Algorithm 1 is of finite length (i.e. KK\in\mathbb{N}) and satisfies a strict decrease before convergence: Lab𝗇𝗌𝖻k+1()<Lab𝗇𝗌𝖻k()\sum_{\ell\in Lab}\mathsf{nsb}^{k+1}(\ell)<\sum_{\ell\in Lab}\mathsf{nsb}^{k}(\ell) if k<K1k<K-1 and Lab𝗇𝗌𝖻K()=Lab𝗇𝗌𝖻K1()\sum_{\ell\in Lab}\mathsf{nsb}^{K}(\ell)=\sum_{\ell\in Lab}\mathsf{nsb}^{K-1}(\ell). The number of terms is smaller than the number of policies.

Proof

Let Lab𝗇𝗌𝖻k()\sum_{\ell\in Lab}\mathsf{nsb}^{k}(\ell) be a term of the sequence and (𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k}) be the optimal solution of Min{Lab𝗇𝗌𝖻()fπk(𝗇𝗌𝖻,𝗇𝗌𝖻𝖾)(𝗇𝗌𝖻,𝗇𝗌𝖻𝖾),𝗇𝗌𝖻Lab,𝗇𝗌𝖻𝖾Lab}\operatorname*{Min}\{\sum_{\ell\in Lab}\mathsf{nsb}(\ell)\mid f^{\pi^{k}}(\mathsf{nsb},\mathsf{nsb_{e}})\leq(\mathsf{nsb},\mathsf{nsb_{e}}),\ \mathsf{nsb}\in\mathbb{N}^{Lab},\ \mathsf{nsb_{e}}\in\mathbb{N}^{Lab}\}. Then F(𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)fπk(𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)F(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k})\leq f^{\pi^{k}}(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k}) by definition of FF. Moreover,
F(𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)=fπk+1(𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)F(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k})=f^{\pi^{k+1}}(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k}) and fπk(𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)(𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)f^{\pi^{k}}(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k})\leq(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k}). It follows that fπk+1(𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)(𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)f^{\pi^{k+1}}(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k})\leq(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k}) and (𝗇𝗌𝖻k,𝗇𝗌𝖻𝖾k)(\mathsf{nsb}^{k},\mathsf{nsb_{e}}^{k}) is feasible for the minimisation problem for which (𝗇𝗌𝖻k+1,𝗇𝗌𝖻𝖾k+1)(\mathsf{nsb}^{k+1},\mathsf{nsb_{e}}^{k+1}) is an optimal solution. We conclude that Lab𝗇𝗌𝖻k+1()Lab𝗇𝗌𝖻k()\sum_{\ell\in Lab}\mathsf{nsb}^{k+1}(\ell)\leq\sum_{\ell\in Lab}\mathsf{nsb}^{k}(\ell) and the Algorithm terminates if the equality holds or continues as the criterion strictly decreases. Finally, from the strict decrease, a policy cannot be selected twice without terminating the algorithm. In conclusion, the number of iterations is smaller than the number of policies. \Box

Figure  5 displays the new rules that we add to the global system of constraints in the case where we optimize the carry bit of the elementary operations. Before introducing the optimized function ξ\xi, the definition of the unit in the last place 𝗎𝗅𝗉\mathsf{ulp} of a real number xx is defined in Equation(19).

𝗎𝗅𝗉(x)=𝗎𝖿𝗉(x)𝗇𝗌𝖻(x)+1.\small\mathsf{ulp}(x)=\mathsf{ufp}(x)-\mathsf{nsb}(x)+1\enspace. (19)

Indeed, during an operation between two numbers, the ξ\xi function is computed as follows: If the 𝗎𝗅𝗉\mathsf{ulp} of one of the two operands is greater than the 𝗎𝖿𝗉\mathsf{ufp} of the other one (or conversely) then the two numbers are not aligned and no carry bit can be propagated through the operation (otherwise ξ=1\xi=1). This idea is presented in Figure 5 in which ξ\xi is formulated by min\min and max\max operators. Winning one bit may seen a ridiculous optimization at first sight. However, when many operations are done in a program which has to compute with some tens of 𝗇𝗌𝖻\mathsf{nsb}, this is far from being negligible. Formally, let xx and yy be the operands of some operation whose result is zz. The errors are ε(x)\varepsilon(x), ε(y)\varepsilon(y) and ε(z)\varepsilon(z) respectively on xx, yy and zz. Note that the behaviour of ξ\xi is also valid in the case of propagation of the carry bit on the errors. Consequently, using the definitions of 𝗎𝖿𝗉𝖾\mathsf{ufp_{e}} and 𝗇𝗌𝖻𝖾\mathsf{nsb_{e}} introduced in Section 3.1, we have 𝗎𝖿𝗉𝖾(x)=𝗎𝖿𝗉(x)𝗇𝗌𝖻(x)\mathsf{ufp_{e}}(x)=\mathsf{ufp}(x)-\mathsf{nsb}(x) and 𝗎𝗅𝗉𝖾(x)=𝗎𝖿𝗉𝖾(x)𝗇𝗌𝖻𝖾(x)+1\mathsf{ulp_{e}}(x)=\mathsf{ufp_{e}}(x)-\mathsf{nsb_{e}}(x)+1 with 𝗎𝖿𝗉𝖾\mathsf{ufp_{e}} and 𝗎𝗅𝗉𝖾\mathsf{ulp_{e}} are the unit in the first and in the last place of the errors, respectively, for xx (same reasoning for yy and zz). The optimized function ξ\xi of Figure 5 derives from Equation (20).

ξ(z,x,y)={0𝗎𝖿𝗉𝖾(x)𝗇𝗌𝖻𝖾(x)𝗎𝖿𝗉(y)𝗇𝗌𝖻(y)orconversely,1otherwise.\small\xi(z,x,y)=\begin{cases}0&\mathsf{ufp_{e}}(x)-\mathsf{nsb_{e}}(x)\geq\mathsf{ufp}(y)-\mathsf{nsb}(y)~{}~{}or~{}~{}\text{conversely},\\ 1&\text{otherwise.}\end{cases} (20)

As mentioned in Equation (20), to compute the 𝗎𝗅𝗉\mathsf{ulp} of the errors on the operands, we need to estimate the number of bits of the error 𝗇𝗌𝖻𝖾\mathsf{nsb_{e}} for each operand which explains the new rules and constraints given in Figure 5. Hence, these constraints are complementary to the rules of Figure 4, already explained in Section 3.2, in which the only difference is that we activate the new function ξ\xi instead of its over-approximation of Figure 4. Let us concentrate on the rules of Figure 5. The function [e]ϱ\mathcal{E}^{\prime}[e]\ \varrho generates the new set of constraints for an expression eExpre\in Expr in the environment ϱ\varrho. For Rule (Const)(\textsc{Const}^{\prime}), the number of significant bits of the error 𝗇𝗌𝖻𝖾=0\mathsf{nsb_{e}}=0 whereas we impose that the 𝗇𝗌𝖻𝖾\mathsf{nsb_{e}} of a variable xx at control point \ell is less than the last assignment of 𝗇𝗌𝖻𝖾\mathsf{nsb_{e}} in ϱ(x)\varrho(x) as shown in Rule (Id)(\textsc{Id}^{\prime}) of Figure 5. Considering Rule (Add)(\textsc{Add}^{\prime}), we start by generating the new set of constraints [e11]ϱ\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}]\varrho and [e22]ϱ\mathcal{E}^{\prime}[e_{2}^{\ell_{2}}]\varrho on the operands at control points 1\ell_{1} and 2\ell_{2}. Then, we require that 𝗇𝗌𝖻𝖾()𝗇𝗌𝖻𝖾(1)\mathsf{nsb_{e}}(\ell)\geq\mathsf{nsb_{e}}(\ell_{1}) and 𝗇𝗌𝖻𝖾()𝗇𝗌𝖻𝖾(2)\mathsf{nsb_{e}}(\ell)\geq\mathsf{nsb_{e}}(\ell_{2}) where the result of the addition is stored at control point \ell. Additionally, the number of significant bits of the error on the result 𝗇𝗌𝖻𝖾()\mathsf{nsb_{e}}(\ell) at control point \ell is computed as shown hereafter.

𝗇𝗌𝖻𝖾()max(𝗎𝖿𝗉(1)𝗇𝗌𝖻(1)𝗎𝖿𝗉(2)𝗇𝗌𝖻(2))min(𝗎𝖿𝗉(1)𝗇𝗌𝖻(1)𝗇𝗌𝖻𝖾(1)𝗎𝖿𝗉(2)𝗇𝗌𝖻(2)𝗇𝗌𝖻𝖾(2))+ξ()(1,2)\small\mathsf{nsb_{e}}(\ell)\geq\max\left(\begin{array}[]{l}\mathsf{ufp}(\ell_{1})-\mathsf{nsb}(\ell_{1})\\ \mathsf{ufp}(\ell_{2})-\mathsf{nsb}(\ell_{2})\end{array}\right)-\min\left(\begin{array}[]{l}\mathsf{ufp}(\ell_{1})-\mathsf{nsb}(\ell_{1})-\mathsf{nsb_{e}}(\ell_{1})\\ \mathsf{ufp}(\ell_{2})-\mathsf{nsb}(\ell_{2})-\mathsf{nsb_{e}}(\ell_{2})\end{array}\right)+\xi(\ell)(\ell_{1},\ell_{2})

By breaking the min\min and max\max operators, we obtain the constraints on 𝗇𝗌𝖻𝖾()\mathsf{nsb_{e}}(\ell) of Rule (Add)(\textsc{Add}^{\prime}). For the subtraction, the constraints generated are similar to the addition case. Considering now Rule (Mult)(\textsc{Mult}^{\prime}), as we have defined in Section 3.2, ε(z)=xε(y)+yε(x)+ε(x)ε(y)\varepsilon(z)=x\cdot\varepsilon(y)+y\cdot\varepsilon(x)+\varepsilon(x)\cdot\varepsilon(y) where zz is the result of the product of xx and yy. By reasoning on the 𝗎𝗅𝗉\mathsf{ulp} of the error, we bound ε(z)\varepsilon(z) by

ε(z)=2𝗎𝖿𝗉(x)2𝗎𝖿𝗉(y)𝗇𝗌𝖻(y)𝗇𝗌𝖻𝖾(y)+1+2𝗎𝖿𝗉(y)2𝗎𝖿𝗉(x)𝗇𝗌𝖻(x)𝗇𝗌𝖻𝖾(x)+1\displaystyle\varepsilon(z)=2^{\mathsf{ufp}(x)}\cdot 2^{\mathsf{ufp}(y)-\mathsf{nsb}(y)-\mathsf{nsb_{e}}(y)+1}+2^{\mathsf{ufp}(y)}\cdot 2^{\mathsf{ufp}(x)-\mathsf{nsb}(x)-\mathsf{nsb_{e}}(x)+1}
+2𝗎𝖿𝗉(y)+𝗎𝖿𝗉(x)𝗇𝗌𝖻(x)𝗇𝗌𝖻(y)𝗇𝗌𝖻𝖾(x)𝗇𝗌𝖻𝖾(y)+2\displaystyle+2^{\mathsf{ufp}(y)+\mathsf{ufp}(x)-\mathsf{nsb}(x)-\mathsf{nsb}(y)-\mathsf{nsb_{e}}(x)-\mathsf{nsb_{e}}(y)+2}

By selecting the smallest term 𝗎𝖿𝗉(y)+𝗎𝖿𝗉(x)𝗇𝗌𝖻(x)𝗇𝗌𝖻(y)𝗇𝗌𝖻𝖾(x)𝗇𝗌𝖻𝖾(y)+2\mathsf{ufp}(y)+\mathsf{ufp}(x)-\mathsf{nsb}(x)-\mathsf{nsb}(y)-\mathsf{nsb_{e}}(x)-\mathsf{nsb_{e}}(y)+2, we obtain that

𝗇𝗌𝖻𝖾()max(𝗎𝖿𝗉(1)+𝗎𝖿𝗉(2)𝗇𝗌𝖻(1)𝗎𝖿𝗉(1)+𝗎𝖿𝗉(2)𝗇𝗌𝖻(2))𝗎𝖿𝗉(1)+𝗎𝖿𝗉(2)𝗇𝗌𝖻(1)𝗇𝗌𝖻(2)𝗇𝗌𝖻𝖾(1)𝗇𝗌𝖻𝖾(2)+2.\small\mathsf{nsb_{e}}(\ell)\geq\max\left(\begin{array}[]{l}\mathsf{ufp}(\ell_{1})+\mathsf{ufp}(\ell_{2})-\mathsf{nsb}(\ell_{1})\\ \mathsf{ufp}(\ell_{1})+\mathsf{ufp}(\ell_{2})-\mathsf{nsb}(\ell_{2})\end{array}\right)-\begin{array}[]{l}\mathsf{ufp}(\ell_{1})+\mathsf{ufp}(\ell_{2})-\mathsf{nsb}(\ell_{1})-\\ \mathsf{nsb}(\ell_{2})-\mathsf{nsb_{e}}(\ell_{1})-\mathsf{nsb_{e}}(\ell_{2})\\ +2\end{array}\enspace.

Finally, by simplifying the equation above we found the constraints of Rule (Mult)(\textsc{Mult}^{\prime}) in Figure 5 (same for Rule (Div)(\textsc{Div}^{\prime})). For Rule (Sqrt)(\textsc{Sqrt}^{\prime}), we generate the constraints on the expression [e11]ϱ\mathcal{E}^{\prime}[e_{1}^{\ell_{1}}]\varrho and we require that 𝗇𝗌𝖻𝖾\mathsf{nsb_{e}} of the result stored at control point \ell is greater than the 𝗇𝗌𝖻𝖾\mathsf{nsb_{e}} of the expression a control point 1\ell_{1}. For Rule (Math)(\textsc{Math}^{\prime}) , we assume that 𝗇𝗌𝖻𝖾()\mathsf{nsb_{e}}(\ell) is unbounded. Concerning the commands, we define the set 𝒞[c]ϱ\mathcal{C}^{\prime}[c]\ \varrho which has the same function as 𝒞\mathcal{C} defined in Figure 4. The reasoning on the commands also remains similar except that this time we reason on the number of bits of the errors 𝗇𝗌𝖻𝖾\mathsf{nsb_{e}}. The only difference is in Rule (Req)(\textsc{Req}^{\prime}) where the set of constraints is empty. Let us recall that the constraints of Figure 5 are added to the former constraints of Figure 4 and are sent to a linear solver (GLPK in practice).

 

ϱ(x)=c#px,ϱc#p,ϱ\frac{\varrho(x)=c\texttt{\#p}}{\langle x^{\ell},\varrho\rangle\longrightarrow\langle c^{\ell}\texttt{\#p},\varrho\rangle}
c=c1c2,p=𝗎𝖿𝗉(c)𝗎𝖿𝗉𝖾(c#p)c11#p1c22#p2,ϱc#p,ϱ{+,,×,÷}\frac{c=c_{1}\odot c_{2},\ \texttt{p}=\mathsf{ufp}(c)-\mathsf{ufp_{e}}\big{(}c^{\ell}\texttt{\#p}\big{)}}{\langle c_{1}^{\ell_{1}}\texttt{\#$p_{1}$}\odot^{\ell}c_{2}^{\ell_{2}}\texttt{\#$p_{2}$},\varrho\rangle\longrightarrow\langle c\texttt{\#p},\varrho\rangle}\quad\odot\in\{+,-,\times,\div\}
e11,ϱe11,ϱe11e22,ϱe11e22,ϱe22,ϱe22,ϱc11#pe22,ϱc11pe22,ϱ\frac{\langle e_{1}^{\ell_{1}},\varrho\rangle\longrightarrow\langle e_{1}^{\prime\ell_{1}},\varrho\rangle}{\langle e_{1}^{\ell_{1}}\odot^{\ell}e_{2}^{\ell_{2}},\varrho\rangle\longrightarrow\langle e_{1}^{\prime\ell_{1}}\odot^{\ell}e_{2}^{\ell_{2}},\varrho\rangle}\qquad\frac{\langle e_{2}^{\ell_{2}},\varrho\rangle\longrightarrow\langle e_{2}^{\prime\ell_{2}},\varrho\rangle}{\langle c_{1}^{\ell_{1}}\texttt{\#p}\odot^{\ell}e_{2}^{\ell_{2}},\varrho\rangle\longrightarrow\langle c_{1}^{\ell_{1}}\sharp\texttt{p}\odot^{\ell}e_{2}^{\prime\ell_{2}},\varrho\rangle}
e1,ϱe1,ϱϕ(e1),ϱϕ(e1),ϱc=ϕ(c1)q=𝗎𝖿𝗉(c)𝗎𝖿𝗉𝖾(c#p)ϕ(c11#p),ϱc#q,ϱϕ{sin,cos,}\frac{\langle e^{\ell_{1}},\varrho\rangle\longrightarrow\langle e^{\prime\ell_{1}},\varrho\rangle}{\langle\phi(e^{\ell_{1}})^{\ell},\varrho\rangle\longrightarrow\langle\phi(e^{\prime\ell_{1}})^{\ell},\varrho\rangle}\qquad\frac{c=\phi(c_{1})\quad\texttt{q}=\mathsf{ufp}(c)-\mathsf{ufp_{e}}(c^{\ell}\texttt{\#p})}{\langle\phi(c_{1}^{\ell_{1}}\texttt{\#p})^{\ell},\varrho\rangle\longrightarrow\langle c^{\ell}\texttt{\#q},\varrho\rangle}\quad\phi\in\{\sin,\cos,\ldots\}
e1,ϱe1,ϱe1,ϱe1,ϱc=c1q=𝗎𝖿𝗉(c)𝗎𝖿𝗉𝖾(c#p)c11#p,ϱc#q,ϱ\frac{\langle e^{\ell_{1}},\varrho\rangle\longrightarrow\langle e^{\prime\ell_{1}},\varrho\rangle}{\langle\sqrt{e^{\ell_{1}}}^{\ell},\varrho\rangle\longrightarrow\langle\sqrt{e^{\prime\ell_{1}}}^{\ell},\varrho\rangle}\qquad\frac{c=\sqrt{c_{1}}\quad\texttt{q}=\mathsf{ufp}(c)-\mathsf{ufp_{e}}(c^{\ell}\texttt{\#p})}{\langle\sqrt{c_{1}^{\ell_{1}}\texttt{\#p}}^{\ell},\varrho\rangle\longrightarrow\langle c^{\ell}\texttt{\#q},\varrho\rangle}

 

Figure 6: Small Step Operational semantics of arithmetic expressions.

4 Correctness

In this section, we present proofs of correctness concerning the soundness of the analysis (Section 4.1) and the integer nature of the solutions (Section 4.2).

4.1 Soundness of the Constraint System

Let \equiv denote the syntactic equivalence and let eExpre^{\ell}\in Expr be an expression. We write 𝖢𝗈𝗇𝗌𝗍(e)\mathsf{Const}(e^{\ell}) the set of constants occurring in the expression ee^{\ell}. For example, 𝖢𝗈𝗇𝗌𝗍(18.01×2x3+412.05×6y7+8z9)={18.01,12.05}\mathsf{Const}(18.0^{\ell_{1}}\times^{\ell_{2}}x^{\ell_{3}}+^{\ell_{4}}12.0^{\ell_{5}}\times^{\ell_{6}}y^{\ell_{7}}+^{\ell_{8}}z^{\ell_{9}})=\{18.0^{\ell_{1}},12.0^{\ell_{5}}\}. Also, we denote by τ:Lab\tau\ :\ \text{Lab}\rightarrow\mathbb{N} a function mapping the labels of an expression to a 𝗇𝗌𝖻\mathsf{nsb}. The notation τ[e]ϱ\tau\models\mathcal{E}[e^{\ell}]\varrho means that τ\tau is the minimal solution to the ILP [e]ϱ\mathcal{E}[e^{\ell}]\varrho. We write ϱ\varrho_{\perp} the empty environment (dom(ϱ)=dom(\varrho_{\perp})=\emptyset).

The small step operational semantics of our language is displayed in Figure 6. It is standard, the only originality being to indicate explicitly the 𝗇𝗌𝖻\mathsf{nsb} of constants. For the result of an elementary operation, this 𝗇𝗌𝖻\mathsf{nsb} is computed in function of the 𝗇𝗌𝖻\mathsf{nsb} of the operands. Lemma 1 below asses the soundness of the constraints for one step of the semantics.

Lemma 1

Given an expression eExpre^{\ell}\in Expr, if eee^{\ell}\rightarrow e^{\prime\ell} and τ[e]ϱ\tau\models\mathcal{E}[e^{\ell}]\varrho_{\perp} then for all ccc^{\ell_{c}}#p \in 𝖢𝗈𝗇𝗌𝗍(e)\mathsf{Const}(e^{\prime\ell}) we have p = τ(c)\tau(\ell_{c}).

Proof

By case examination of the rules of Figure 4. Hereafter, we focus on the most interesting case of addition of two constants. Recall that 𝗎𝖿𝗉𝖾()=𝗎𝖿𝗉()𝗇𝗌𝖻()\mathsf{ufp_{e}}(\ell)=\mathsf{ufp}(\ell)-\mathsf{nsb}(\ell) for any control point \ell. Assuming that ec11+c22e^{\ell}\equiv c_{1}^{\ell_{1}}+^{\ell}c_{2}^{\ell_{2}} then by following the reduction rule of Figure 6, we have ec#pe^{\ell}\rightarrow c^{\ell}\texttt{\#p} with p=𝗎𝖿𝗉(c)𝗎𝖿𝗉𝖾(c)\texttt{p}=\mathsf{ufp}(c)-\mathsf{ufp_{e}}\big{(}c\big{)}. On the other side, by following the set of constraints of Rule (Add) in Figure 4 we have [e]ϱ={𝗇𝗌𝖻(1)𝗇𝗌𝖻()+𝗎𝖿𝗉(1)𝗎𝖿𝗉()+ξ()(1,2),𝗇𝗌𝖻(2)𝗇𝗌𝖻()+𝗎𝖿𝗉(2)\small\mathcal{E}[e^{\ell}]\varrho=\{\mathsf{nsb}(\ell_{1})\geq\mathsf{nsb}(\ell)+\mathsf{ufp}(\ell_{1})-\mathsf{ufp}(\ell)+\xi(\ell)(\ell_{1},\ell_{2}),\hskip 2.84544pt\mathsf{nsb}(\ell_{2})\geq\mathsf{nsb}(\ell)+\mathsf{ufp}(\ell_{2}) 𝗎𝖿𝗉()+ξ()(1,2)}-\mathsf{ufp}(\ell)+\xi(\ell)(\ell_{1},\ell_{2})\}. These constraints can be written as

𝗇𝗌𝖻()\displaystyle\small\mathsf{nsb}(\ell) \displaystyle\leq 𝗎𝖿𝗉()𝗎𝖿𝗉(1)+𝗇𝗌𝖻(1)ξ()(1,2)\displaystyle\mathsf{ufp}(\ell)-\mathsf{ufp}(\ell_{1})+\mathsf{nsb}(\ell_{1})-\xi(\ell)(\ell_{1},\ell_{2})
𝗇𝗌𝖻()\displaystyle\mathsf{nsb}(\ell) \displaystyle\leq 𝗎𝖿𝗉()𝗎𝖿𝗉(2)+𝗇𝗌𝖻(2)ξ()(1,2)\displaystyle\mathsf{ufp}(\ell)-\mathsf{ufp}(\ell_{2})+\mathsf{nsb}(\ell_{2})-\xi(\ell)(\ell_{1},\ell_{2})

and may themselves be rewritten as Equation (16), i.e.

𝗇𝗌𝖻()𝗎𝖿𝗉()max(𝗎𝖿𝗉(1)𝗇𝗌𝖻(1),𝗎𝖿𝗉(2)𝗇𝗌𝖻(2))ξ()(1,2).\small\mathsf{nsb}(\ell)\leq\mathsf{ufp}(\ell)-\max\big{(}\mathsf{ufp}(\ell_{1})-\mathsf{nsb}(\ell_{1}),\mathsf{ufp}(\ell_{2})-\mathsf{nsb}(\ell_{2})\big{)}-\xi(\ell)(\ell_{1},\ell_{2})\enspace.

Since, obviously, 𝗎𝖿𝗉(c)=𝗎𝖿𝗉()\mathsf{ufp}(c)=\mathsf{ufp}(\ell) and since the solver finds the minimal solution to the ILP, it remains to show that

𝗎𝖿𝗉𝖾()=max(𝗎𝖿𝗉(1)𝗇𝗌𝖻(1),𝗎𝖿𝗉(2)𝗇𝗌𝖻(2),𝗎𝖿𝗉()𝗉𝗋𝖾𝖼())ξ()(1,2)\small\mathsf{ufp_{e}}(\ell)=\max\big{(}\mathsf{ufp}(\ell_{1})-\mathsf{nsb}(\ell_{1}),\mathsf{ufp}(\ell_{2})-\mathsf{nsb}(\ell_{2}),\mathsf{ufp}(\ell)-\mathsf{prec}(\ell)\big{)}-\xi(\ell)(\ell_{1},\ell_{2})

which corresponds to the assertion of Equation(15). Consequently, 𝗇𝗌𝖻()=p\mathsf{nsb}(\ell)=\texttt{p} as required, for this case, in Figure 6. \Box

Theorem 4.1

Given an expression eee^{\ell}\rightarrow e^{\prime\ell}. If eee^{\ell}\rightarrow^{*}e^{\prime\ell} and if τ[e]ϱ\tau\models\mathcal{E}[e^{\ell}]\varrho_{\perp} , then \forall cc#pc^{\ell_{c}}\texttt{\#p} \in 𝖢𝗈𝗇𝗌𝗍(e)\mathsf{Const}(e^{\prime\ell}) we have p=τ(c)\texttt{p}=\tau(\ell_{c}).

Proof

By recurrence on the length of the reduction path. \Box

4.2 ILP Nature of the Problem

In this section, we give insights about the complexity of the problem. The computation relies on integer linear programming. Integer linear programming is known to belong to the class of NP-Hard problems. A lower bound of the optimal value in a minimization problem can be furnished by the continuous linear programming relaxation. This relaxation is obtained by removing the integrity constraint. Recall that a (classical) linear program can be solved in polynomial-time. Then, we can solve our problem in polynomial-time if we can show that the continuous linear programming relaxation of our ILP has an unique optimal solution with integral coordinates. Proposition 3 presents a situation where a linear program has a unique optimal solution which is a vector of integers.

Proposition 3

Let G:[0,+)d[0,+)dG:[0,+\infty)^{d}\mapsto[0,+\infty)^{d} be an order-preserving function such that G(d)dG(\mathbb{N}^{d})\subseteq\mathbb{N}^{d}. Suppose that the set {ydG(y)y}\{y\in\mathbb{N}^{d}\mid G(y)\leq y\} is non-empty. Let φ:d\varphi:\mathbb{R}^{d}\mapsto\mathbb{R} a strictly monotone function such that φ(d)\varphi(\mathbb{N}^{d})\subseteq\mathbb{N}. Then, the minimization problem:

Miny[0,+)dφ(y)s.t.G(y)y\displaystyle{\operatorname*{Min}_{y\in[0,+\infty)^{d}}}\ \varphi(y)\ \operatorname*{s.t.}G(y)\leq y

has an unique optimal solution which is integral.

Proof

Let L:={x[0,+)dG(x)x}L:=\{x\in[0,+\infty)^{d}\mid G(x)\leq x\} and u=infLu=\inf L. It suffices to prove that udu\in\mathbb{N}^{d}. Indeed, as φ\varphi is stricly monotone then φ(u)<φ(x)\varphi(u)<\varphi(x) for all x[0,+)dx\in[0,+\infty)^{d} s.t. G(x)xG(x)\leq x and xux\neq u. The optimal solution is thus uu. If u=0u=0, the result holds. Now suppose that 0<u0<u, then 0G(0)0\leq G(0). Let M:={ydyG(y),yu}M:=\{y\in\mathbb{N}^{d}\mid y\leq G(y),y\leq u\}. Then 0M0\in M and we write v:=supMv:=\sup M. As MM is a complete lattice s.t. G(M)MG(M)\subseteq M, from Tarski’s theorem, vv satisfies G(v)=vG(v)=v and vuv\leq u. Moreover, vdv\in\mathbb{N}^{d} and vuv\leq u. Again, from Tarski’s theorem, uu is the smallest fixpoint of GG then it coincides with vv. We conclude that udu\in\mathbb{N}^{d}. \Box

Theorem 4.2

Assume that the system of inequalities depicted in Figure 4 has a solution. The smallest amount of memory Lab𝗇𝗌𝖻()\sum_{\ell\in Lab}\mathsf{nsb}(\ell) for the system of inequalities depicted in Figure 4 can be computed in polynomial-time by linear programming.

Proof

The function Lab𝗇𝗌𝖻()\sum_{\ell\in Lab}\mathsf{nsb}(\ell) is strictly monotone and stable on integers. From the first statement of Proposition 1, the system of constraints is of the form F(𝗇𝗌𝖻)𝗇𝗌𝖻F(\mathsf{nsb})\leq\mathsf{nsb} where FF is order-preserving and stable on integers. By assumption, there exists an vector of integers 𝗇𝗌𝖻\mathsf{nsb} s.t. F(𝗇𝗌𝖻)𝗇𝗌𝖻F(\mathsf{nsb})\leq\mathsf{nsb}. We conclude from Proposition 3. \Box

For the second system, in practice, we get integral solutions to the continuous linear programming relaxation of our ILP of Equation (18). However, because of the lack of monotonicity of the functions for the rules (ADD) and (SUB), we cannot exploit Proposition 3 to prove the polynomial-time solvability.

5 Experimental Results

Program TH BL IEEE ILP-time BL IEEE PI-time H S D LD 10410^{-4} 61% 43% 0.9s 62% 45% 1.5s 8 88 25 0 10610^{-6} 50% 21% 0.9s 51% 21% 1.4s 2 45 74 0 arclength 10810^{-8} 37% 3% 0.8s 38% 4% 1.6s 2 6 113 0 101010^{-10} 24% -1% 1.0s 25% -1% 1.7s 2 0 116 3 101210^{-12} 12% -17% 0.3s 14% -8% 1.5s 2 0 109 10 10410^{-4} 64% 45% 0.1s 67% 56% 0.5s 6 42 1 0 10610^{-6} 53% 30% 0.2s 56% 31% 0.5s 1 27 21 0 simpson 10810^{-8} 40% 4% 0.1s 43% 7% 0.3s 1 5 43 0 101010^{-10} 27% 1% 0.1s 28% 1% 0.4s 1 0 48 0 101210^{-12} 16% 1% 0.1s 16% 1% 0.3s 0 1 48 0 10410^{-4} 73% 61% 0.2s 76% 62% 1.0s 53 69 0 0 10610^{-6} 62% 55% 0.2s 65% 55% 1.0s 2 102 0 0 accelerometer 10810^{-8} 49% 15% 0.2s 52% 18% 1.0s 2 33 69 0 101010^{-10} 36% 1% 0.2s 39% 1% 1.0s 2 0 102 0 101210^{-12} 25% 1% 0.2s 28% 1% 1.0s 2 0 102 0 10410^{-4} 78% 66% 0.08s 79% 68% 1.3s 46 38 0 0 10610^{-6} 67% 53% 0.08s 68% 56% 0.5s 12 70 2 0 rotation 10810^{-8} 53% 29% 0.07s 54% 29% 0.4s 0 46 38 0 101010^{-10} 40% 0% 0.1s 41% 0% 0.5s 0 0 84 0 101210^{-12} 29% 0% 0.09s 30% 0% 0.5s 0 0 48 0 10410^{-4} 68% 46% 1.8s 69% 46% 10.7s 260 581 0 0 10610^{-6} 57% 38% 1.8s 58% 45% 11.0s 258 580 3 0 lowPassFilter 10810^{-8} 44% -7% 2.0s 45% -7% 11.4s 258 2 581 0 101010^{-10} 31% -7% 1.7s 32% -7% 10.9s 258 0 583 0 101210^{-12} 20% -7% 1.8s 21% -7% 11.3s 258 0 583 0 10410^{-4} 71% 54% 0.15s 71% 54% 0.4s 0 13 0 0 10610^{-6} 60% 50% 0.2s 60% 50% 0.5s 0 12 1 0 Pendulum 10810^{-8} 47% 0% 012s 47% 0% 0.4s 0 0 13 0 101010^{-10} 33% 0% 0.16s 34% 0% 0.5 0 0 13 0 101210^{-12} 22% 0% 0.11s 22% 0% 0.4s 0 0 13 0

Table 1: Precision tuning results for XXX for the ILP and PI methods.

In this section, we aim at evaluating the performance of our tool XXX implementing the techniques of Section 3. Recall that XXX reduces the problem of precision tuning to an ILP by generating the constraints defined in Section 3.2 which can be solved by a linear solver. We use GLPK222https://www.gnu.org/software/glpk/ in practice with continuous variables since we know from Section 3.3 that it is possible. Indeed, GLPK’s constant col-kind is set to cv and not to iv as for integer problems. Alternatively, a second optimized set of constraints may be generated by applying the PI technique introduced in Section 3.3.

We have evaluated XXX on several numerical programs. Two of them were used as a benchmark for precision tuning in prior work [18] and are coming from the GNU scientific library (GSL): arclength which is a program first introduced in [2] and the simpson program which corresponds to an implementation of the widely used Simpson’s rule for integration in numerical analysis [13]. In addition, we have experimented the techniques introduced in this article on three programs used as benchmarks for XXX in its former version [3, 4, 5]. The program rotation performs a matrix-vector product to rotate a vector around the zz axis by an angle of θ\theta [5]. The second program accelerometer is an application to measure an inclination angle with an accelerometer [3] and, finally, lowPassFilter program [4] which is taken from a pedometer application [14] that counts the number of footsteps of the user. The last two programs come from the IoT field. Also, we take the pendulum program already introduced in Section 2.

The experiments shown in Table 1 present the results of precision tuning returned by XXX for each error threshold 10410^{-4}, 10610^{-6}, 10810^{-8} and 101010^{-10}. The error threshold represents the 𝗇𝗌𝖻\mathsf{nsb}. For instance, the result is required to be correct up to 4 digits for an error threshold of 10410^{-4}. Obviously, XXX counting the number of significant bits, these threshold are converted in binary. In Table 1, we represent by ”TH” the error threshold given by the user. ”BL” is the percentage of optimization at bit level. ”IEEE” denotes the percentage of optimized variables in IEEE754 formats (binary16, binary32, \ldots The 𝗇𝗌𝖻\mathsf{nsb} obtained at bit level is approximated by the immediately the upper number of bits corresponding to a IEEE754 format). ”ILP-time” is the total time of XXX analysis in the case of ILP formulation. We have also ”PI-time” to represent the time passed by XXX to find the right policy and to resolve the precision tuning problem. ”H”, ”S”, ”D” and ”LD” denote respectively the number of variables obtained in, half, float, double and long-double precision when using the PI formulation that clearly displays better results.

Program Tool
#Bits saved - Time in seconds
Threshold 10410^{-4} Threshold 10610^{-6} Threshold 10810^{-8} Threshold 101010^{-10}
arclength XXX ILP (28) 2464b. - 1.8s. 2144b. - 1.5s. 1792b. - 1.7s. 1728b. - 1.8s.
XXX SMT (22) 1488b. - 4.7s. 1472b. - 3.04s. 864b. - 3.09s. 384b. - 2.9s.
Precimonious (9) 576b. - 146.4s. 576b. - 156.0s. 576b. - 145.8s. 576b. - 215.0s.
simpson XXX ILP (14) 1344b. - 0.4s. 1152b. - 0.5s. 896b. - 0.4s. 896b. - 0.4s.
XXX SMT (11) 896b. - 2.9s. 896b. - 1.9s. 704b. - 1.7s. 704b. - 1.8s.
Precimonious (10) 704b. - 208.1s. 704b. - 213.7s. 704b. - 207.5s. 704b. - 200.3s.
rotation XXX ILP (25) 2624b. - 0.47s. 2464b. - 0.47s. 2048b. - 0.54s. 1600b. - 0.48s.
XXX SMT (22) 1584b. - 1.85s. 2208b. - 1.7s. 1776b. - 1.6s. 1600b. - 1.7s.
Precimonious (27) 2400b. - 9.53s. 2592b. - 12.2s. 2464b. - 10.7s. 2464b. - 7.4s.
accel. XXX ILP (18) 1776b. - 1.05s. 1728b. - 1.05s. 1248b. - 1.04s. 1152b. - 1.03s.
XXX SMT (15) 1488b. - 2.6s. 1440b. - 2.6s. 1056 - 2.4s. 960b. - 2.4s.
Precimonious (0) - - - -
Table 2: Comparison between XXX ILP, XXX SMT and Precimonious: number of bits saved by the tool and time in seconds for analyzing the programs.

Let us focus on the first ”TH”, ”BL”, ”IEEE” and ”ILP-time” columns of Table 1. These columns have been measured for the case of ILP by being pessimistic on the propagation of carry bit as described in Section 3.2. We recall that all variables of our programs are in double precision before analysis. For the arclength program, the final percentage of bits after optimization reaches 61%61\% at bit level while it achieves 43%43\% in IEEE formats (100%100\% is the percentage of all variables initially in double precision, 121121 variables for the original arclength program that used 77447744 bits). This is obtained in only 0.90.9 second by applying the ILP formulation. As another option, when we refine the solution by applying the policy iteration method (from the sixth column), XXX displays better results attaining 62%62\% at bit level and 43%43\% for the IEEE formats. Although XXX needs more time of analysis to find and iterate between policies, the time of analysis remain negligible, not exceeding 1.51.5 seconds. For a total of 121121 variables for the arclength original program, XXX succeeds in tuning 88 variables to half precision (H), the majority with 8888 variables passes to simple precision (S) whereas 2525 variables remain in double precision (D) for an error threshold of 10410^{-4}. We remark that our second method displays better results also for the other user error thresholds. For the simpson, accelerometer, rotation and lowPassFilter, the improvement is also more important when using the PI technique than when using the ILP formulation. For instance, for an error threshold of 10610^{-6} for the simpson program, only one variable passes to half precision, 2727 variables turns to simple precision while 2121 remains in double precision with 56%56\% of percentage of total number of bits at bit level using the policy iteration method. Concerning the pendulum code, the two techniques return the same percentage at bit level and IEEE754 format for the majority of error thresholds (except 101010^{-10} where XXX reaches 34%34\% at bit level when using the PI method).

Now, we stress on the negative percentage that we obtain in Table 1, especially for the arclength program with 101010^{-10} and 101210^{-12} for the columns IEEE and for the lowPassFilter program for errors of 10810^{-8}, 101010^{-10} and 101210^{-12}. In fact, XXX being able to return new formats for any precision required by the user without additional cost nor by increasing the complexity even if it fails to have a significant improvement on the program output. To be specific, taking again the arclength program, for an error of 101210^{-12}, XXX fulfills this requirement by informing the user that this precision is achievable only if 10 variables passes to the long double precision (LD) which is more than the original program whose variables are all in double precision. By doing so, the percentage of IEEE formats for both ILP and PI formulations reaches 17%-17\% and 8%-8\%, respectively. Same reasoning is adopted for the lowPassFilter which spends more time, nearly 1212 seconds, with the policy iteration technique to find the optimized formats (total of 841 variables). Note that in these cases, other tools like Precimonious [18] fail to propose formats.

Table 2 shows a comparison between the new version of XXX combining both ILP and PI formulations, the former version of XXX that uses the Z3 SMT solver coupled to binary search to find optimal solution [5] and the prior state-of the-art Precimonious [18]. The results of the mixed precision tuning are shown for the arclength, simpson, rotation and accelerometer programs. Since XXX and Precimonious implement two different techniques, we have adjusted the criteria of comparison in several points. First, we mention that XXX optimizes much more variables than Precimonious. While it disadvantages XXX, we only consider in the experiments of Table 2 the variables optimized by Precimonious to estimate the quality of the optimization. Second, let us note that the error thresholds are expressed in base 22 in XXX and in base 1010 in Precimonious. For the relevance of comparisons, all the thresholds are expressed in base 1010 in tables 1 and 2. In practice, XXX will use the base 22 threshold immediately lower than the required base 1010 threshold.

In Table 2, we indicate in bold the tool that exhibits better results for each error threshold and each program. Starting with the arclength program, XXX ILP displays better results than the other tools by optimizing 2828 variables. For an error threshold of 10410^{-4}, 24642464 bits are saved by XXX ILP in 1.81.8 seconds while XXX Z3 saved only 1488 bits in more time (1111 seconds). Precimonious were the slowest tool on this example, more than 22 minutes with 576576 bits for only 99 variables optimized. For the simpson program, XXX ILP do also better than both other tools. However, for the rotation program, XXX ILP saves more bits than the other tools only for an error of 10410^{-4} while Precimonious do well for this program for the rest of error thresholds. Finally, Precimonious fails to tune the accelerometer program (0 variables) at the time that XXX ILP do faster (only 1 second) to save much more bits than XXX SMT for the four given error thresholds.

6 Conclusion and Perspectives

In this article, we have introduced a new technique for precision tuning, totally different from the existing ones. Instead of changing more or less randomly the data types of the numerical variables and running the programs to see what happens, we propose a semantical modelling of the propagation of the numerical errors throughout the code. This yields a system of constraints whose minimal solution gives the best tuning of the program, furthermore, in polynomial time. Two variants of this system are proposed. The first one corresponds to a pure ILP. The second one, which optimizes the propagation of carries in the elementary operations can be solved using policy iterations [7]. Proofs of correctness concerning the soundness of the analysis and the integer nature of the solutions have been presented in Section 4 and experimental results showing the efficiency of our method have been introduced in Section 5.

Compared to other approaches, the strength of our method is to find directly the minimal number of bits needed at each control point to get a certain accuracy on the results. Consequently, it is not dependant of a certain number of data types (e.g. the IEEE754 formats) and its complexity does not increase as the number of data types increases. The information provided may also be used to generate computations in the fixpoint arithmetic with an accuracy guaranty on the results. Concerning scalability, we generate a linear number of constraints and variables in the size of the analyzed program. The only limitation is the size of the problem accepted by the solver. Note that the number of variables could be reduced by assigning the same precision to a whole piece of code (for example an arithmetic expression, a line of code, a function, etc.) Code synthesis for the fixpoint arithmetic and assigning the same precision to pieces of code are perspectives we aim at explore at short term.

At longer term, other developments of the present work are planned. First we wish to adapt the techniques developed in this article to the special case of Deep Neural Networks for which it is important to save memory usage and computational resources. Second, we aim at using our precision tuning method to guide lossy compression techniques for floating-point datasets [9]. In this case, the bit-level accuracy inferred by our method would determine the compression rate of the lossy technique.

References

  • [1] ANSI/IEEE: IEEE Standard for Binary Floating-point Arithmetic, std 754-2008 edn. (2008)
  • [2] Bailey, D.: Resolving numerical anomalies in scientific computation (03 2008)
  • [3] Ben Khalifa, D., Martel, M.: Precision tuning and internet of things. In: International Conference on Internet of Things, Embedded Systems and Communications, IINTEC 2019. pp. 80–85. IEEE (2019)
  • [4] Ben Khalifa, D., Martel, M.: Precision tuning of an accelerometer-based pedometer algorithm for iot devices. In: International Conference on Internet of Things and Intelligence System, IOTAIS 2020. pp. 113–119. IEEE (2020)
  • [5] Ben Khalifa, D., Martel, M., Adjé, A.: POP: A tuning assistant for mixed-precision floating-point computations. In: Formal Techniques for Safety-Critical Systems - 7th International Workshop, FTSCS 2019. Communications in Computer and Information Science, vol. 1165, pp. 77–94. Springer (2019)
  • [6] Chiang, W., Baranowski, M., Briggs, I., Solovyev, A., Gopalakrishnan, G., Rakamaric, Z.: Rigorous floating-point mixed-precision tuning. In: Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL. pp. 300–315. ACM (2017)
  • [7] Costan, A., Gaubert, S., Goubault, E., Martel, M., Putot, S.: A policy iteration algorithm for computing fixed points in static analysis of programs. In: Computer Aided Verification, 17th International Conference, CAV 2005, Edinburgh, Scotland, UK, July 6-10, 2005, Proceedings. Lecture Notes in Computer Science, vol. 3576, pp. 462–475. Springer (2005)
  • [8] Darulova, E., Horn, E., Sharma, S.: Sound mixed-precision optimization with rewriting. In: Proceedings of the 9th ACM/IEEE International Conference on Cyber-Physical Systems, ICCPS. pp. 208–219. IEEE Computer Society / ACM (2018)
  • [9] Diffenderfer, J., Fox, A., Hittinger, J.A.F., Sanders, G., Lindstrom, P.G.: Error analysis of ZFP compression for floating-point data. SIAM J. Sci. Comput. 41(3), A1867–A1898 (2019)
  • [10] Guo, H., Rubio-González, C.: Exploiting community structure for floating-point precision tuning. In: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2018. pp. 333–343. ACM (2018)
  • [11] Kotipalli, P.V., Singh, R., Wood, P., Laguna, I., Bagchi, S.: AMPT-GA: automatic mixed precision floating point tuning for GPU applications. In: Proceedings of the ACM International Conference on Supercomputing, ICS. pp. 160–170. ACM (2019)
  • [12] Lam, M.O., Hollingsworth, J.K., de Supinski, B.R., LeGendre, M.P.: Automatically adapting programs for mixed-precision floating-point computation. In: International Conference on Supercomputing, ICS’13. pp. 369–378. ACM (2013)
  • [13] McKeeman, W.M.: Algorithm 145: Adaptive numerical integration by simpson’s rule. Commun. ACM 5(12),  604 (1962)
  • [14] Morris, D., Saponas, T., Guillory, A., Kelner, I.: Recofit: Using a wearable sensor to find, recognize, and count repetitive exercises. Conference on Human Factors in Computing Systems - Proceedings (2014)
  • [15] de Moura, L.M., Bjørner, N.: Z3: an efficient SMT solver. In: Tools and Algorithms for the Construction and Analysis of Systems. LNCS, vol. 4963, pp. 337–340. Springer (2008)
  • [16] Papadimitriou, C.H.: On the complexity of integer programming. Journal of the ACM (JACM) 28(4), 765–768 (1981)
  • [17] Parker, D.S.: Monte carlo arithmetic: exploiting randomness in floating-point arithmetic. Tech. Rep. CSD-970002, University of California (Los Angeles) (1997)
  • [18] Rubio-González, C., Nguyen, C., Nguyen, H.D., Demmel, J., Kahan, W., Sen, K., Bailey, D.H., Iancu, C., Hough, D.: Precimonious: tuning assistant for floating-point precision. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC’13. pp. 27:1–27:12. ACM (2013)
  • [19] Schrijver, A.: Theory of linear and integer programming. John Wiley & Sons (1998)

7 Appendix

We need a lemma on some algebraic operations stable on the set of functions written as the min-max of a finite family of affine functions. The functions are defined on d\mathbb{R}^{d}.

Lemma 1

The following statements hold:

  • The sum of two min-max of a finite family of affine functions is a min-max of a finite family of affine functions.

  • The maximum of two min-max of a finite family of affine functions is a min-max of a finite family of affine functions.

Proof

Let gg and hh be two min-max of a finite family of affine functions and f=g+hf=g+h. We have g=minimaxjgijg=\min_{i}\max_{j}g^{ij} and h=minkmaxlhklh=\min_{k}\max_{l}h^{kl}. Let xdx\in\mathbb{R}^{d}. There exist i,ki,k such that f(x)maxjgij(x)+maxlhkl(x)=maxj,lgij(x)+hkl(x)f(x)\geq\max_{j}g^{ij}(x)+\max_{l}h^{kl}(x)=\max_{j,l}g^{ij}(x)+h^{kl}(x). We have also, for all i,ki,k, f(x)maxjgij(x)+maxlhkl(x)=maxj,lgij(x)+hkl(x)f(x)\leq\max_{j}g^{ij}(x)+\max_{l}h^{kl}(x)=\max_{j,l}g^{ij}(x)+h^{kl}(x). We conclude that f(x)=mini,kmaxj,lgij(x)+hkl(x)f(x)=\min_{i,k}\max_{j,l}g^{ij}(x)+h^{kl}(x) for all xx. We use the same argument for the max. \Box

Proposition 4

The following results hold:

  1. 1.

    Let ξ\xi the constant function equal to 1. The system Sξ1S_{\xi}^{1} can be rewritten as {𝗇𝗌𝖻LabF(𝗇𝗌𝖻)(𝗇𝗌𝖻)}\{\mathsf{nsb}\in\mathbb{N}^{Lab}\mid F(\mathsf{nsb})\leq(\mathsf{nsb})\} where FF maps Lab×Lab\mathbb{R}^{Lab}\times\mathbb{R}^{Lab} to itself, F(Lab×Lab)(Lab×Lab)F(\mathbb{N}^{Lab}\times\mathbb{N}^{Lab})\subseteq(\mathbb{N}^{Lab}\times\mathbb{N}^{Lab}) and has coordinates which are the maximum of a finite family of affine order-preserving functions.

  2. 2.

    Let ξ\xi the function such that ξ()\xi(\ell) equals the function defined at Fig. 5. The system SξS_{\xi} can be rewritten as {(𝗇𝗌𝖻,𝗇𝗌𝖻𝖾)Lab×LabF(𝗇𝗌𝖻,𝗇𝗌𝖻𝖾)(𝗇𝗌𝖻,𝗇𝗌𝖻𝖾)}\{(\mathsf{nsb},\mathsf{nsb_{e}})\in\mathbb{N}^{Lab}\times\mathbb{N}^{Lab}\mid F(\mathsf{nsb},\mathsf{nsb_{e}})\leq(\mathsf{nsb},\mathsf{nsb_{e}})\} where FF maps Lab×Lab\mathbb{R}^{Lab}\times\mathbb{R}^{Lab} to itself, F(Lab×Lab)(Lab×Lab)F(\mathbb{N}^{Lab}\times\mathbb{N}^{Lab})\subseteq(\mathbb{N}^{Lab}\times\mathbb{N}^{Lab}) and all its coordinates are the min-max of a finite family of affine functions.

Proof

We only give details about the system Sξ1S_{\xi}^{1} (Figure 4). By induction on the rules. We write L={LabF is constructed }L=\{\ell\in Lab\mid F_{\ell}\text{ is constructed }\}. This set is used in the proof to construct FF inductively.

For the rule (CONST), there is nothing to do. For the rule (ID), if the label =ρ(x)L\ell^{\prime}=\rho(x)\in L then we define F(𝗇𝗌𝖻)=max(F(𝗇𝗌𝖻),𝗇𝗌𝖻())F_{\ell^{\prime}}(\mathsf{nsb})=\max(F_{\ell^{\prime}}(\mathsf{nsb}),\mathsf{nsb}(\ell)). Otherwise F(𝗇𝗌𝖻)=𝗇𝗌𝖻()F_{\ell^{\prime}}(\mathsf{nsb})=\mathsf{nsb}(\ell). As 𝗇𝗌𝖻𝗇𝗌𝖻()\mathsf{nsb}\mapsto\mathsf{nsb}(\ell) is order-preseving and the maximum of one affine function, FF_{\ell^{\prime}} is the maximum of a finite family of order-preserving affine functions since max\max preserves order-preservation.

For the rules (ADD), (SUB), (MULT), (DIV), (MATH) and (ASSIGN), by induction, it suffices to focus on the new set of inequalities. If 1L\ell_{1}\in L, we define F1F_{\ell_{1}} as the max with old definition and RHS(𝗇𝗌𝖻)RHS(\mathsf{nsb}) i.e. F1(𝗇𝗌𝖻)=max(RHS(𝗇𝗌𝖻),F1(𝗇𝗌𝖻))F_{\ell_{1}}(\mathsf{nsb})=\max(RHS(\mathsf{nsb}),F_{\ell_{1}}(\mathsf{nsb})) where RHS(𝗇𝗌𝖻)RHS(\mathsf{nsb}) is the right-hand side part of the new inequality. If 1L\ell_{1}\notin L, we define F1(𝗇𝗌𝖻)=RHS(𝗇𝗌𝖻)F_{\ell_{1}}(\mathsf{nsb})=RHS(\mathsf{nsb}). In the latter rules, RHS(𝗇𝗌𝖻)RHS(\mathsf{nsb}) are order-preserving affine functions. It follows that F1F_{\ell_{1}} is the maximum of a finite family of order-preserving affine functions.

The result follows by induction for the rule (SEQ).

The rules (COND) and (WHILE) are treated as the rules (ADD), (SUB), (MULT), (DIV), (MATH) and (ASSIGN), by induction and the consideration of the new set of inequalities.

The last rule (REQ) constructs Fρ(x)F_{\rho(x)} either as the constant function equal to pp at label ρ(x)\rho(x) or the maximum of the old definition of Fρ(x)F_{\rho(x)} and pp if ρ(x)L\rho(x)\in L.

The proof for the system SξS_{\xi} uses the same arguments and Lemma 1. \Box