This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Quantization for Batch Normalization in Neural Network Deployments and Beyond

Dachao Lin
Center for Data Science
Peking University
Beijing, China
[email protected]
&Peiqin Sun
Megvii
Beijing, China
[email protected]
&Guangzeng Xie
Center for Data Science
Peking University
Beijing, China
[email protected]
&Shuchang Zhou
Megvii
Beijing, China
[email protected]
&Zhihua Zhang
School of Mathematical Sciences
Peking University
Beijing, China
[email protected]
Abstract

Quantized Neural Networks (QNNs) use low bit-width fixed-point numbers for representing weight parameters and activations, and are often used in real-world applications due to their saving of computation resources and reproducibility of results. Batch Normalization (BN) poses a challenge for QNNs for requiring floating points in reciprocal operations, and previous QNNs either require computing BN at high precision or revise BN to some variants in heuristic ways. In this work, we propose a novel method to quantize BN by converting an affine transformation of two floating points to a fixed-point operation with shared quantized scale, which is friendly for hardware acceleration and model deployment. We confirm that our method maintains same outputs through rigorous theoretical analysis and numerical analysis. Accuracy and efficiency of our quantization method are verified by experiments at layer level on CIFAR and ImageNet datasets. We also believe that our method is potentially useful in other problems involving quantization.

1 Introduction

Deep neural networks have achieved great success in many applications, such as computer vision [20], speech recognition [10] and Natural Language Processing [1]. In computer vision, the most proposed architecture, Convolutional Neural Network (CNN), has demonstrated state-of-the-art results in many tasks. However, CNN-based recognition systems need large amounts of memory and computational power, which may take up to weeks on a modern multi-GPU server for large datasets such as ImageNet [5]. Hence, they are often unsuitable for smaller devices like embedded electronics, and there is a pressing demand for techniques to optimize models with reduced model size, faster inference and lower power consumption.

Accordingly, a variety of literature has made attention on the reduction of model size through the use of quantization [22, 4, 21], low-rank matrix factorization [6, 16], architecture pruning [7, 8], etc. Quantization is one of the simpler ways to reduce complexity of any model with less precision requirements for weights and activations as well as speeding up the computation.

The methods for quantizing gradients, weights and activations [22, 4, 13, 33, 24, 2], have achieved much closer performance to full precision networks. Whereas, after the previous efficient quantized neural network (QNN) training, there still has floating point operators in model inference particularly due to the float-type numbers in Batch Normalization (BN).

Since (deep) networks are hard trained, the BN operator [15] is usually used. Implementation of the conventional BN requires much computation of mean and variance, involving the sum of squares, square-root and reciprocal operations. These operators require float-type numbers for high precision. Previous attempts to use low precision networks do not use BN layers [30] or keep them in full precision [33]. In addition, Hubara et al. [13] proposed shift-based BN by replacing almost all multiplication with power-of-2 approximation and shift operations, and Banner et al. [2] devised Range BN for the variance normalizing according to the range of input distribution. These modifications to BN for less computational costs are heuristic, confirmed only with experiments. Besides, modern hardware has less deployment cost with fixed-point style or perhaps some only support fixed-points [14]. Hence, previous QNN training methods with float-type BN layers fail to deploy on total fixed-number hardware.

In this paper we develop a direct way to handle the floating numbers in BN after obtaining a benchmark QNNs with previous training methods. We view conventional BN operator combined with several quantization operators used in feedforward step as an affine transformation with only two floating numbers. In order to eliminate the floating numbers in BN, we start by considering the exact substitution of these two floating numbers to as less as possible fixed numbers. We recommend to quantize floating numbers with shared quantized scale and mathematically show that all floating points are able to convert to fixed-point operators in an idealized case. In addition, we also demonstrate the lower bound and upper bound of the least possible quantized scale in our scheme. After given the proposed quantized scale, we give few search attempts to decide remaining integers. By the way, our methods used for model deployment also guarantee the precision in QNNs’ experiments, which can be seen as a supplement of modern QNN training.

Our main contribution is summarized as follows:

  • We propose a new method for quantizing BN in model deployment by converting the two floating points affine transformations to a fixed-point operation with shared quantized scale.

  • We give theoretical guarantee of our quantization method, including the existence of quantized scale and the magnitude of optimal quantized bit-width. In addition, we accurately search the least possible quantized scale according to quantized bits to support our transformation numerically.

  • We conduct our method on CIFAR and ImageNet datasets based on benchmark quantized neural networks, showing little performance degradation if exists.

  • Our scenario is restricted not only in BN, but other affine operators with floating numbers in similar quantization structure, such as several variants of BN: LpL^{p} BN [11], Instance Normalization [28] and Range BN [2].

The remainder of our paper is organized as follows. In Section 2, we formulate the problem and present the motivation of our transformation. In Section 3, we give the equivalence of problem in Section 2 and show the upper bound and lower bound of satisfied solution through several properties. We then give a trivial search algorithm in Section 4, for finding accurate value of the solution and verifying previous results in Section 3. We also briefly discuss the precision loss in practical hardware implementation and validate numerically in common quantized neural networks in Section 5. Finally, we conclude our method in Section 6.

2 Problem Formulation

Quantization mainly focuses on compressing parameters in neural networks, such as weights in convolutional and fully-connected layers, activation outputs of hidden layers, and parameters in BN. A common design for QNN is uniform quantization [32, 18], which makes quantized values evenly distributed among possible values, that is,

Q(x)=ΔxΔ,x[xmin,xmax],Q(x)=\frac{\lfloor\Delta x\rfloor}{\Delta},x\in[x_{\min},x_{\max}],

where Δ\Delta\in\mathbb{N} (the set of positive integers) is quantized scale, which measures the smallest gap between quantized values, and \lfloor\cdot\rfloor means the floor function. The input xx\in\mathbb{R} is restricted to predefined interval [xmin,xmax][x_{\min},x_{\max}] due to limited computing power. Particularly, for kk-bit uniform quantization, choose Δ=2k1\Delta=2^{k}-1 for friendly hardware support and simply choose xmin=0,xmax=1x_{\min}=0,x_{\max}=1 for non-symmetric quantizer, xmin=12,xmax=12x_{\min}=-\frac{1}{2},x_{\max}=\frac{1}{2} for symmetric quantizer.

The weights which should be uniformly quantized may first be mapped into the required input value range through normalized operator [13]

xixi2maxj(|xj|)+12[0,1],x_{i}\rightarrow\frac{x_{i}}{2\max_{j}(|x_{j}|)}+\frac{1}{2}\in[0,1],

or combined with hyperbolic tangent function [33]

xitanh(xi)2maxj(|tanh(xj)|)+12[0,1].x_{i}\rightarrow\frac{\tanh(x_{i})}{2\max_{j}(|\tanh(x_{j})|)}+\frac{1}{2}\in[0,1].

Here the maximum is taken over all weights xjx_{j}\in\mathbb{R} in the same layer.

Activations are quantized more difficultly than weights due to nondifferentiable optimization incurred by quantizer functions [3, 29]. Several works [21, 24, 33] have addressed this problem using bit-width allocation across layers or a continuous approximation to the quantizer function in the backpropagation step.

A common quantization approach for activations [23, 13, 33] quantizes output of the previous layer after clipping inputs into a limited interval, that is,

Qa(x,ymin,ymax)=Q(clip(x,ymin,ymax)),Q_{a}(x,y_{\min},y_{\max})=Q(\mathrm{clip}(x,y_{\min},y_{\max})),

where clip(x,a,b)min{max{x,a},b},b>a\mathrm{clip}(x,a,b)\triangleq\min\{\max\{x,a\},b\},\ b>a, yminy_{\min} and ymaxy_{\max} are usually integers, and Q()Q(\cdot) is some quantizer function defined earlier. For common used ReLU function, ymin=0y_{\min}=0, ymax=1y_{\max}=1 or ymax=6y_{\max}=6 [12, 26] (ReLU6) or some specific positive integers.

Since weights and activations are of the form with constant quantization bits in denominator, feed-forward calculation is able to execute with complete fixed-point operation without regard of the fixed denominator, which achieves hardware acceleration.

Benchmark models employ BN for deep neural network training. The conventional BN operator is two affine transformations of four parameters:

BN(x,μ,σ,γ,β)=γxμσ+β,\mathrm{BN}(x,\mu,\sigma,\gamma,\beta)=\gamma\frac{x-\mu}{\sigma}+\beta,

where μ\mu and σ\sigma are the mean and standard error estimators of output in the same layer and updated according to training batch statistics. Except conventional BN, several BN variants use similar transformation with different calculation method, such as LpL^{p} BN [11] with LpL^{p} norm of feature as σ\sigma, Range BN [2] with maximum gap in feature as σ\sigma, Group Normalization [31] and Instance Normalization [28] with different part of feature for calculating μ\mu and σ\sigma. Anyhow, all parameters are fixed during inference and deployment.

We emphasize that floating points come from the division operation in BN and its variants, since μ\mu is the running mean of batch feature in the same layer, β\beta and γ\gamma are updated based on quantized gradients, but the reciprocal of σ\sigma fails.

Previous state-of-the-art networks such as ResNet [9] and VGG [27] employ BN after convolutional layer or fully-connected layer, followed by an activation function. Let \mathbb{Z} denote the set of integer numbers. Suppose we use Δ=A\Delta=A and WW to quantize activations and weights. Then from quantizer function QaQ_{a}, we can see outputs of previous layers are of the type Ni/AN_{i}/A with NiN_{i}\in\mathbb{Z} and quantized weight Mi/WM_{i}/W with MiM_{i}\in\mathbb{Z}. Therefore, the outputs of convolutional or fully-connected layer are of the type i=1rMiWNiA+c\sum_{i=1}^{r}\frac{M_{i}}{W}\cdot\frac{N_{i}}{A}+c, where rr is the number related to the calculation of outputs in the next layer, and cc is the bias term if exists and may also be quantized. We record this output of the type NAW+c\frac{N}{AW}+c with NN\in\mathbb{Z}. Here we do not identify specific corresponding position of parameters if no ambiguity.

After quantized convolutional or fully-connected layer is BN and quantized activation, then an element of the next layer outputs is

Qa(BN(NAW+c,μ,σ,γ,β),ymin,ymax)=Aclip(γNAW+cμσ+β,ymin,ymax)A\displaystyle Q_{a}\left(\mathrm{BN}\left(\frac{N}{AW}+c,\mu,\sigma,\gamma,\beta\right),y_{\min},y_{\max}\right)=\dfrac{\left\lfloor A\ \mathrm{clip}\left(\gamma\dfrac{\frac{N}{AW}+c-\mu}{\sigma}+\beta,y_{\min},y_{\max}\right)\right\rfloor}{A} (1)
=1Aclip(N+AW(βγσ+cμ)Wγσ,Aymin,Aymax)1Aclip(N+bt,Ymin,Ymax).\displaystyle=\dfrac{1}{A}\mathrm{clip}\left(\left\lfloor\dfrac{N{+}AW\Big{(}\dfrac{\beta}{\gamma}\sigma+c-\mu\Big{)}}{\dfrac{W}{\gamma}\sigma}\right\rfloor,Ay_{\min},Ay_{\max}\right)\triangleq\dfrac{1}{A}\mathrm{clip}\left(\left\lfloor\dfrac{N+b}{t}\right\rfloor,Y_{\min},Y_{\max}\right).

Here we exchange the order of clip\mathrm{clip} and round function because yminy_{min} and ymaxy_{max} are integers in second equality. We also simplify two-affine operator of BN to one-affine operator and replace some terms as follows:

Ymin=Aymin,Ymax=Aymax,Y_{\min}=Ay_{\min},Y_{\max}=Ay_{\max},
b=AW(βγσ+cμ),t=Wγσ.b=AW(\dfrac{\beta}{\gamma}\sigma+c-\mu)\in\mathbb{R},\;t=\dfrac{W}{\gamma}\sigma\in\mathbb{R}.

Generally, floating number operation will cause loss of precision, so this modification may lead to some error due to the limited computation precision. We will give experiments in subsequent section briefly.

Without loss of generality, we assume ymin=0,ymax=1y_{\min}=0,y_{\max}=1 clipped from ReLU function. Other cases can be obtained by shifting the clip range by

clip(x,a,b)=clip(xt,at,bt)+t.\mathrm{clip}(x,a,b)=\mathrm{clip}(x-t,a-t,b-t)+t.

Then Ymin=0Y_{\min}=0 and Ymax=AY_{\max}=A.

An empirical attempt is trying to make all operators only related to fixed-points, which means quantizing floating points tt and bb to some integers. A simple way is to let floating numbers tt and bb be integers respectively, which seems difficult to work because input variable NN\in\mathbb{Z} may vary in a wide range. A weaker requirement is to quantize tt and bb similar to quantizer function QQ and QaQ_{a} do, namely approximate tt and bb by rational numbers:

bB1B2,tT1T2.b\approx\frac{B_{1}}{B_{2}},\quad t\approx\frac{T_{1}}{T_{2}}.

Since this consideration would add two more numbers and no hardware support in advance if either B2B_{2} or T2T_{2} is decided. Another way is to consider whether the surrogates of tt and bb can share same quantized scale. For example, same prefixed denominator for whole potential floating numbers tt and bb, which is consistent with previous scheme, because

bT2B1T2B2BK,tT1B2T2B2TK,b\approx\frac{T_{2}B_{1}}{T_{2}B_{2}}\triangleq\frac{B}{K},\quad t\approx\frac{T_{1}B_{2}}{T_{2}B_{2}}\triangleq\frac{T}{K},

where we set K=T2B2K=T_{2}B_{2}, B=T2B1B=T_{2}B_{1}, T=T1B2T=T_{1}B_{2}. Then tt and bb share same pre-fixed quantized scale KK\in\mathbb{N}.

Thus we want to replace the float numbers t,bt,b to integers T,B,KT,B,K for friendly hardware execution, where KK should not changed when tt and bb vary to support fixed hardware design.

Problem 1

Given AA\in\mathbb{N}, the problem is to find some KK\in\mathbb{N} (as small as possible), such that for all t0,t,bt\neq 0,t,b\in\mathbb{R}, there exist some T0,T,BT\neq 0,T,B\in\mathbb{Z}, such that for all NN\in\mathbb{Z},

clip(N+bt,0,A)=clip(N×K+BT,0,A).\mathrm{clip}\left(\left\lfloor\frac{N{+}b}{t}\right\rfloor,0,A\right)=\mathrm{clip}\left(\left\lfloor\frac{N{\times}K{+}B}{T}\right\rfloor,0,A\right). (2)

In particular, we are able to examine whether t,bt,b can be directly converted to integers if K=1K=1 already satisfies.

After having obtained KK, we also need to decide the remaining quantized number T,BT,B based on the given t,bt,b, which is the second problem we need to tackle.

Problem 2

Given AA\in\mathbb{N}, t0,t,bt\neq 0,t,b\in\mathbb{R} and a suitable KK\in\mathbb{N} which is one solution in Problem 1, the problem is to find T0,T,BT\neq 0,T,B\in\mathbb{Z}, such that for all NN\in\mathbb{Z},

clip(N+bt,0,A)=clip(N×K+BT,0,A).\mathrm{clip}\left(\left\lfloor\frac{N{+}b}{t}\right\rfloor,0,A\right)=\mathrm{clip}\left(\left\lfloor\frac{N{\times}K{+}B}{T}\right\rfloor,0,A\right).

Particularly, in the conventional kk-bit quantization setting, we would set A=2k1A=2^{k}-1 for some small kk\in\mathbb{N} and give the least bits number of satisfied KK.

3 Theoretical Analysis

In this section, we analyze Problem 1 and Problem 2 with some propositions while leave detail in Appendix B.

Refer to caption
Refer to caption
Figure 1: Left: exact KnK_{n} varies with 1n631\leq n\leq 63 with upper bound and lower bound showed in Propositions 2 & 3. Right: exact KnK_{n} varies with 1n2551\leq n\leq 255 in log scale, showing the quadratic growth of KnK_{n}.

3.1 Analysis of Problem 1

For Problem 1, a direct way to deal with the randomness of NN is to specify all clipped intervals when NN varies.

We conclude a simple version of Problem 1 in Proposition 1 (see Appendix B.2 for proof).

Proposition 1

Problem 1 is equivalent to finding such a KK\in\mathbb{N} that for all t0,t,bt\neq 0,t,b\in\mathbb{R}, we can obtain T0,T,BT\neq 0,T,B\in\mathbb{Z}, s.t.

itb=iTBK,i{1,2,,A}.\lceil it-b\rceil=\Big{\lceil}\frac{iT-B}{K}\Big{\rceil},\ \forall i\in\{1,2,\dots,A\}. (3)

Here \lceil\cdot\rceil means the ceiling function. We first analyze Eq. (3) without considering T0T\neq 0 and leave the constraint in Appendix E, which have little influence in practice.

Set Si=itbS_{i}=\lceil it-b\rceil. According to Proposition 1, we only need to analyze the sequence {Si}i=1n\{S_{i}\}_{i=1}^{n} (Here we replace AA by nn for notation simplicity). It seems hard to give analytic formula of KK, though we only want to have a rough understanding of the least possible KK for future hardware support. Let the minimal KK\in\mathbb{N} which satisfies all sequences of length nn be KnK_{n}. First, we need roughly understand whether our transformation is reasonable. The following proposition guarantees the existence of KnK_{n} and also gives the magnitude of KnK_{n} w.r.t. nn.

Proposition 2

n\forall n\in\mathbb{N}, KnK_{n} exists. Moreover, if n>4n>4, then as long as K>(n1)(n3)2K>\frac{(n-1)(n-3)}{2}, we can find T,BT,B satisfying the requirements. Hence, Kn(n1)(n3)2+1K_{n}\leq\frac{(n-1)(n-3)}{2}+1.

The key insight of the existence of KnK_{n} is that while KK is large enough, then leave more choices to search for T,BT,B empirically on account of finite constraints in Eq. (3). The technique of proofs for the above proposition is to simply remove T,BT,B with conditions according to all possible sequences of {Si}i=1n\{S_{i}\}_{i=1}^{n} and apply discreteness of integers. The proof is given in Appendix B.2.

As for lower bound of KnK_{n}, we can take some specific sequences to see how large the possible KnK_{n} is. The insight of choosing several sequences {Si}i=1n\{S_{i}\}_{i=1}^{n} is making the maximum gap between the items of such sequence small, in order to quick check the existence of TT.

Proposition 3

If n15n\geq 15, we have Kn(n1)24K_{n}\geq\frac{(n-1)^{2}}{4}. Furthermore, if n27n\geq 27, then Kn>n24K_{n}>\frac{n^{2}}{4}.

Remark 1

Propositions 2 and 3 show that KnΘ(n2)K_{n}\asymp\Theta(n^{2}); more precisely, 14n2<Kn<12n2\frac{1}{4}n^{2}<K_{n}<\frac{1}{2}n^{2} except some small nn. As for practical employment, when quantizing activations with kk-bit, we need to choose KK at least (2k2)(2k-2)-bit.

Remark 2

From Proposition 2, we confirm that all K>(n1)(n3)2K>\frac{(n-1)(n-3)}{2} are the solution of Problem 1. However, we will see that such a KK is not always sequential when KnK(n1)(n3)2K_{n}\leq K\leq\frac{(n-1)(n-3)}{2} in Appendix D.

Besides, some special input range would obviously decrease the magnitude of KnK_{n}, such as the case Appendix G.2 showed.

3.2 Analysis of Problem 2

Intuitively, given K,b,tK,b,t and leaving out floor function and clip function, we have

KT1t,BTbt.\frac{K}{T}\approx\frac{1}{t},\quad\frac{B}{T}\approx\frac{b}{t}.

Hence, we can get TKt,BKbT\approx Kt,B\approx Kb. Because of T,BT,B\in\mathbb{Z}, we obtain TKtT\approx\lfloor Kt\rfloor (or Kt\lceil Kt\rceil) and BKbB\approx\lfloor{Kb\rfloor} (or Kb\lceil Kb\rceil). However, after subsequent search for all possible T,BT,B, we find it is not always the nearest integer (see Proposition 8 and Appendix B.8). The gap of suitable T,BT,B and intuitive approximation depends on specific t,bt,b. By the way, the intuitive way to obtain T,BT,B is correct most of the time.

To establish T,BT,B given KK that is suitable based on the previous understanding, we offer a conservative way to obtain all appropriate T,BT,B as Proposition 4 and Algorithm 5 (see Appendix F) mentioned .

Proposition 4

Given t,bt,b\in\mathbb{R} and proper KK\in\mathbb{N}, a pair of T,BT,B\in\mathbb{N} satisfy Problem 2 if and only if there exists such a TT that the following conditions met:

maxi>jSiSj1ijK\displaystyle\max_{i>j}\Big{\lfloor}\frac{S_{i}-S_{j}-1}{i-j}K\Big{\rfloor} <T<mini>jSiSj+1ijK,\displaystyle<T<\min_{i>j}\Big{\lceil}\frac{S_{i}-S_{j}+1}{i-j}K\Big{\rceil}, (4)
maxi(iTKSi)\displaystyle\max_{i}\left(iT-KS_{i}\right) <minj(jTKSj)+K,\displaystyle<\min_{j}\left(jT-KS_{j}\right)+K,

where i,j{1,2,n}i,j\in\{1,2,\dots n\}.

Moreover, we show that the intuitive way to obtain T,BT,B is partially right in Proposition 8, and give a loose bound of the possible ranges of TT and BB in Proposition 9 (see Appendix A). In addition, for general quantizer range instead of [0,A][0,A] in Problem 1, we can apply similar analysis in Appendix G.1

Table 1: Searched KnK_{n} with commonly used number for kk-bit quantization (n=2k1n=2^{k}-1).
nn KnK_{n} nn KnK_{n} nn KnK_{n} nn KnK_{n}
3 (2-bit) 2 7 (3-bit) 9 15 (4-bit) 51 31 (5-bit) 289
63 (6-bit) 1459 127 (7-bit) 6499 255 (8-bit) 28323

4 Numerical Computation

In the previous section we have discussed KnK_{n} theoretically. In this section, we present how to compute the accurate KnK_{n} to evaluate the theoretical analysis.

For convenience, we are able to only consider t[0,1)t\in[0,1) and tb=1\lceil t-b\rceil=1 by Proposition 5. Then S1=1S_{1}=1, 0Si+dSid0\leq S_{i+d}-S_{i}\leq d, SiiS_{i}\leq i, for i{1,,n1},d{1,,ni}i\in\{1,\cdots,n-1\},\ d\in\{1,\cdots,n-i\}.

Proposition 5

Considering t[0,1)t\in[0,1) and tb=1\lceil t-b\rceil=1 is enough for finding KnK_{n}.

For insurance purpose, we save all possible sequences {Si}i=1n\{S_{i}\}_{i=1}^{n} as a naive idea, then to check every KK if the corresponding T,BT,B exist. We generate the sequences of T,BT,B recursively based on the following result and the scope of t,bt,b.

Proposition 6

Suppose {Si}i=1n+1\{S_{i}\}_{i=1}^{n+1} is a satisfied sequence of length n+1n+1. Then {Si}i=1n\{S_{i}\}_{i=1}^{n} is a satisfied sequence of length nn. In addition, if {Si}i=1n\{S_{i}\}_{i=1}^{n} is a satisfied sequence of length nn, there must be a sequence of length n+1n+1 and with the same previous nn terms.

Now we turn to find KnK_{n} in an accurate way. On account of Proposition 3, we start search from 14(n1)2\frac{1}{4}(n-1)^{2} when nn is larger than 1515. We use brute force and follow proposition below to search KnK_{n} which satisfies all possible sequences {Si}i=1n\{S_{i}\}_{i=1}^{n}.

Proposition 7

Given a sequence {S1,,Sn}\{S_{1},\dots,S_{n}\}, if

minj>ij(Si1)iSjij>maxj<ij(Si1)iSjij,\min_{j>i}\frac{j(S_{i}-1)-iS_{j}}{i-j}>\max_{j<i}\frac{j(S_{i}-1)-iS_{j}}{i-j}, (5)

then there exist t,bt,b\in\mathbb{R} that can generate the sequence above.

The whole process is showed below:

  1. 1.

    First, produce candidate sequences. Note that from Proposition 6, we can get all the {Si}i=1n\{S_{i}\}_{i=1}^{n} recursively.

  2. 2.

    Second, record all satisfied sequences {Si}i=1n\{S_{i}\}_{i=1}^{n}. Given a candidate sequence {S1,,Sn}\{S_{1},\dots,S_{n}\}, we use Proposition 7 to confirm there exist t,bt,b which can reproduce this sequence.

  3. 3.

    Third, based on the obtained sequence {Si}i=1n\{S_{i}\}_{i=1}^{n}, check whether the given KK is satisfied through Proposition 4.

Due to the large amount of the possible sequences {Si}i=1n\{S_{i}\}_{i=1}^{n}, we suggest searching a small part of all sequences and running the window successively until all sequences are satisfied. The main pseudo-algorithm and the auxiliary pseudo-algorithms for search exact KnK_{n} are all shown in Appendix F.

4.1 Experimental KnK_{n} Results

We give a short Table 1 corresponding to specific bit quantized activations. More KnK_{n} with other nn please see Appendix C. Due to numerical errors, we also reexamine a wide range of input NN in [Mb,Atb+M][-M-b,At-b+M] with large enough MM in order to ensure every possible input.

We also draw the growth trend of KnK_{n} vs nn in Figures 1. Since saving all sequences is time-consuming and space-consuming, we only search nn in [1,63][1,63] and {127, 255}\{127,\ 255\}. The magnitude depicted in the figures is consistent with Propositions 2 and 3, showing the quadratic growth of KnK_{n}.

In practice, there is no necessary requirement of finding KnK_{n}, since we only need some proper KK to convert the float-type parameters in BN as well as provide hardware friendly KK, so not exactly KnK_{n} is the best.

5 Experiments

In this section we assess the effect of our BN quantization method through previous trained QNNs on ImageNet [25] (we also try on CIFAR-10 [19] in Appendix H). Our method is applied to model deployment at layer level, which is fitted for various quantized neural networks using the conventional BN or its variants. Besides, we aim at verifying accuracy variation through our modification for friendly hardware support, rather than obtaining high accuracy as previous quantization training methods do.

There are two main concerns we should verify. First, the derivation in Eq. (1) absorbs the parameters of BN into the neighboring convolutional layer or fully-connected layer, which may have precision degradation. We refer to btbt means absorbing BN parameters used in such layer though Eq. (1) based on quantized training models. Second, because the original theoretical analysis is based on real numbers rather than floating numbers, we should check whether the theoretical results match with the expression of floating numbers. Though numerical experiments have already examined in Section 4, we display the final accuracy to support our method again. We adopt BTBT which uses our fixed-points layerwise replacements in Problem 1 based on btbt in the following experimental tables.

We use DoReFa-Net [33], one of efficient QNNs to train a quantized model for verification, and denote QTQT as the accuracy with DoReFa-Net quantization methods.

DoReFa-Net has low bitwidth weights and activations using low bitwidth parameter gradients to accelerate both in training and inference. In order to get more suitable baselines, we only choose quantizing weights and activations while using full-precision gradients in training, and we do not quantize the final fully-connected layer for better training performance. We adopt xWyAxWyA to represent xx-bit quantized weights and yy-bit quantized activations. We examine 2W4A2W4A, 4W4A4W4A and 8W8A8W8A in subsequent experimental tables. We choose K=64K=64 and K=216K=2^{16} for 44-bit and 88-bit quantized activations according to the discussion in Appendix D.

5.1 Quantization for ImageNet Classification

The next set of experiments study our method using VGG16 [27] and ResNet18 [9] on ImageNet dataset. We adopt data augmentation including random cropping, horizontal flipping, lighting and color jittering with Adam optimizer using step-wise learning rate initialized from 0.30.3 divided by 1010 on each 3030 epochs with total 100100 epochs. After we obtain the trained QNN, we convert final model into btbt and BTBT mode we mentioned above. The main results are shown in Tables 2.

5.2 Accuracy and Complexity Analysis

From Table 2, using the QNN training method is able to get comparable performance when quantized even with 2W2A2W2A and 4W4A4W4A. In addition, once a QNN has trained, we absorb the only floating-points in BN by our attempts. From the experimental results, btbt enjoys slight difference with the original quantized model QTQT, which means one affine transformation in Eq. (1) brings tiny disturbance. Moreover, we observe that our substitution in Problem 1 also introduces the same results between btbt and BTBT. In principle, there should have diversity across btbt and BTBT when encountering operators which excess the computer precision. We prefer to use our BTBT case due to entire fixed-point implementation and if have, slight performance degradation.

Additionally, practical application of QNN mainly uses small bits (up to 8-bits to our knowledge with n=255n=255 ) for quantization. From Proposition 4 (or Proposition 9 in Appendix A), a single conversion in worse case is O(n)O(n), but time is able to be saved if we search around intuitive way KtKt and KbKb, while we only need to convert model once with few BN parameters.

Table 2: Top-1 and Top-5 test accuracy of different quantized weight and activation bits with our transformation on ImageNet trained by VGG16 (top) and ResNet18 (bottom).
Bits 2W4A 4W4A 8W8A
Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
QTQT 69.46 88.84 70.62 89.46 70.83 89.59
btbt 69.43 88.88 70.52 89.45 70.81 89.58
BTBT 69.43 88.88 70.52 89.45 70.81 89.58
VGG16
QTQT 65.94 86.54 68.15 88.09 68.67 88.18
btbt 65.99 86.55 68.15 88.11 68.66 88.17
BTBT 65.99 86.55 68.15 88.11 68.66 88.17
ResNet18

6 Conclusion

In this paper we have investigated the problem of quantizing floating points in BN, combined with quantized convolutional or fully-connected weights and activations. Noting that the conventional BN includes two affine transformations, we have made all floating points into one affine operator with only two floating-points. We have accordingly proposed to quantize each floating-point in the converted one-affine operator sharing a quantized scale. We have shown that possible quantized scale is roughly twice of activation bits in theory and given numerical computation schemes of the corresponding substitution. Our approach enjoys the errorless performance in inference using high precision. The strategy we recommended displays efficient model deployment with complete fixed-point operators.

It is worth emphasizing that our quantization scheme is suitable for other affine operators which are also common in deep NNs. Beyond BN as well as the NN context, we believe that our quantization scheme has potential applications in other problems that involve quantization.

Broader Impact

a) & b) If the proposed quantization method is verified to be useful in numerous real applications by the engineers in the future, it will produce good impacts on model compression. Hence, previous style of QNNs would pay attention to entire fixed-point effective QNN design with BN. c) The method we proposed leverages all possible outputs and shows convincing results in theory, though our method may fails when the scope of fixed-point doesn’t support the converted large range of T,BT,B that we seldom see this in practice. Besides, floating operation is likely to introduce precision error, which our method in first conversion step would encounter. d) Our method is data irrelevant.

References

  • Bahdanau et al. [2014] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • Banner et al. [2018] Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training of neural networks. In Advances in Neural Information Processing Systems, pages 5145–5153, 2018.
  • Cai et al. [2017] Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5918–5926, 2017.
  • Courbariaux et al. [2015] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pages 3123–3131, 2015.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Denton et al. [2014] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277, 2014.
  • Han et al. [2015a] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
  • Han et al. [2015b] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015b.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hinton et al. [2012] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Brian Kingsbury, et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29, 2012.
  • Hoffer et al. [2018] Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks. In Advances in Neural Information Processing Systems, pages 2160–2170, 2018.
  • Howard et al. [2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • Hubara et al. [2017] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
  • Ignatov et al. [2018] Andrey Ignatov, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. Ai benchmark: Running deep neural networks on android smartphones. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • Jaderberg et al. [2014] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krishnamoorthi [2018] Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • Lin et al. [2016] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pages 2849–2858, 2016.
  • Lin et al. [2015] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. arXiv preprint arXiv:1510.03009, 2015.
  • Mishra et al. [2017] Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook, and Debbie Marr. Wrpn: wide reduced-precision networks. arXiv preprint arXiv:1709.01134, 2017.
  • Rastegari et al. [2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pages 525–542. Springer, 2016.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  • Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Ulyanov et al. [2016] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
  • Vanhoucke et al. [2011] Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on cpus. 2011.
  • Wu et al. [2018] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680, 2018.
  • Wu and He [2018] Yuxin Wu and Kaiming He. Group normalization. In The European Conference on Computer Vision (ECCV), September 2018.
  • Zhou et al. [2017] Shu-Chang Zhou, Yu-Zhi Wang, He Wen, Qin-Yao He, and Yu-Heng Zou. Balanced quantization: An effective and efficient approach to quantized neural networks. Journal of Computer Science and Technology, 32(4):667–682, 2017.
  • Zhou et al. [2016] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.

Appendix A Additional Propositions

Proposition 8

Given t,bt,b\in\mathbb{R} and proper KK\in\mathbb{N}, suppose T,BT,B which satisfy Eq. (2) exist. Then either T=tKT=\lfloor tK\rfloor or tK\lceil tK\rceil is candidate value, and either B=bKB=\lfloor bK\rfloor or bK\lceil bK\rceil satisfies the requirements. However, the combined T,BT,B from these candidate values maybe not satisfy Problem 2.

Proposition 9

Given t,bt,b\in\mathbb{R} and proper KK\in\mathbb{N}, suppose T,BT,B which satisfy Eq. (2) exist. Then the candidate values of TT and BB satisfy

Kt2Kn1<T<Kt+2Kn1,Kt-\frac{2K}{n-1}<T<Kt+\frac{2K}{n-1},
Kbn+1n1K<B<Kb+n+1n1K.Kb-\frac{n+1}{n-1}K<B<Kb+\frac{n+1}{n-1}K.

Appendix B Proof of Propositions

The following proofs may need use property of ceiling function below.

Lemma 1 (The properties of ceiling function)
x+yx+yx+y1.\lceil x\rceil+\lceil y\rceil\geq\lceil x+y\rceil\geq\lceil x\rceil+\lceil y\rceil-1.
xx<x+1.x\leq\lceil x\rceil<x+1.

B.1 Proof of Proposition 1

Proof: Without loss of generality, we may assume t>0t>0, otherwise we can reverse b,t,B,Tb,t,B,T. Then we can see as NN\rightarrow\infty, KT\frac{K}{T} and tt must have the same sign, so T0T\geq 0 while we leave the case T=0T=0 in Appendix E.

Then according to the property of floor function, the original problem is equivalent to the follow situations:

{N+bt<1NK+BT<1,iN+bt<i+1iNK+BT<i+1,i{1,2,,A1},N+btANK+BTA.\left\{\begin{aligned} &\frac{N+b}{t}<1&\Leftrightarrow&&&\frac{NK+B}{T}<1,\\ i\leq&\frac{N+b}{t}<i+1&\Leftrightarrow&&\ i\leq&\frac{NK+B}{T}<i+1,\ \forall i\in\{1,2,\dots,A-1\},\\ &\frac{N+b}{t}\geq A&\Leftrightarrow&&&\frac{NK+B}{T}\geq A.\end{aligned}\right. (6)

We can see i{1,2,,A1}\ \forall i\in\{1,2,\dots,A-1\},

{N<tbN<TBK,itbN<(i+1)tbiTBKN<(i+1)TBK,NAtbNATBK.\left\{\begin{aligned} &N<t-b&\Leftrightarrow&&&N<\frac{T-B}{K},\\ it-b\leq&N<(i+1)t-b&\Leftrightarrow&&\ \frac{iT-B}{K}\leq&N<\frac{(i+1)T-B}{K},\\ &N\geq At-b&\Leftrightarrow&&&N\geq\frac{AT-B}{K}.\end{aligned}\right.

Hence itb=iTBK,i{1,2,,A}\lceil it-b\rceil=\lceil\frac{iT-B}{K}\rceil,\ \forall i\in\{1,2,\dots,A\}. \square

B.2 Proof of Proposition 2

Proof: It is obvious that if n=1,2n=1,2, then Kn=1K_{n}=1.

When n>2n>2, we only need to consider the problem in Proposition 1, that is, i{1,2,,n}\forall i\in\{1,2,\cdots,n\},

itbiTBK>itb1.\lceil it-b\rceil\geq\frac{iT-B}{K}>\lceil it-b\rceil-1.

Set Si=itbS_{i}=\lceil it-b\rceil,

KSiiTB>KSiK.KS_{i}\geq iT-B>KS_{i}-K. (7)

Then

iTKSiB<iTKSi+K.iT-KS_{i}\leq B<iT-KS_{i}+K.

Pay attention to BB\in\mathbb{Z}, so that

maxi(iTKSi)<mini(iTKSi+K),\max_{i}(iT-KS_{i})<\min_{i}(iT-KS_{i}+K),

which means

(ij)TK(SiSj)<K,i,j{1,2,,n}.(i{-}j)T-K(S_{i}{-}S_{j})<K,\ \forall i,j\in\{1,2,\cdots,n\}.

Analyzing the cases of i>ji>j and i<ji<j, we obtain
i,j{1,2,,n},i>j\forall i,j\in\{1,2,\cdots,n\},\ i>j,

SiSj1ijK<T<SiSj+1ijK.\frac{S_{i}-S_{j}-1}{i-j}K<T<\frac{S_{i}-S_{j}+1}{i-j}K. (8)

It follows from Lemma 1 in Appendix B that i,j{1,,n}\forall i,j\in\{1,\cdots,n\}, i>ji>j,

(Si1)Sjij<(itb)(jtb)ij=t.\frac{(S_{i}-1)-S_{j}}{i-j}<\frac{(it-b)-(jt-b)}{i-j}=t. (9)

l,m{1,2,,n}\forall l,m\in\{1,2,\cdots,n\}, l>ml>m,

t=(ltb)(mtb)lm<Sl(Sm1)lm.t=\frac{(lt-b)-(mt-b)}{l-m}<\frac{S_{l}-(S_{m}-1)}{l-m}.

Hence,

SiSj1ij<SlSm+1lm.\frac{S_{i}-S_{j}-1}{i-j}<\frac{S_{l}-S_{m}+1}{l-m}.

Therefore (SlSm+1)(ij)(SiSj1)(lm)(S_{l}-S_{m}+1)(i-j)-(S_{i}-S_{j}-1)(l-m)\in\mathbb{N}.
We make

(SlSm+1lmSiSj1ij)K>1.\left(\frac{S_{l}-S_{m}+1}{l-m}-\frac{S_{i}-S_{j}-1}{i-j}\right)K>1. (10)

Then TT always exists because the scope of TT is larger than one and includes at least one integer.

A naive idea is that when ijlmi-j\neq l-m,

SlSm+1lmSiSj1ij\displaystyle\frac{S_{l}-S_{m}+1}{l-m}-\frac{S_{i}-S_{j}-1}{i-j}
=(SlSm+1)(ij)(SiSj1)(lm)(lm)(ij)\displaystyle=\frac{(S_{l}{-}S_{m}+1)(i{-}j)-(S_{i}{-}S_{j}{-}1)(l{-}m)}{(l{-}m)(i{-}j)}
1(lm)(ij)1(n1)(n2).\displaystyle\geq\frac{1}{(l-m)(i-j)}\geq\frac{1}{(n-1)(n-2)}.

When ij=lmi-j=l-m and n>2n>2, we have that

SlSm+1lmSiSj1ij\displaystyle\frac{S_{l}-S_{m}+1}{l-m}-\frac{S_{i}-S_{j}-1}{i-j} 1(lm)1n1\displaystyle\geq\frac{1}{(l-m)}\geq\frac{1}{n-1}
1(n1)(n2).\displaystyle\geq\frac{1}{(n-1)(n-2)}.

So if K>(n1)(n2)K>(n-1)(n-2), we are able to find T,BT,B given any t,bt,b. Therefore, KnK_{n} exists.

Before the remaining proof, we need lemma below.

Lemma 2

Suppose a,b,c,d,a,c>1a,b,c,d\in\mathbb{N},a,c>1 and abcd=1ab-cd=1, then (b,c)=(a,d)=1(b,c)=(a,d)=1 and

n,n<a,{dna}0,\forall n\in\mathbb{N},n<a,\{\frac{dn}{a}\}\neq 0,
n,n<c,{bnc}0,\forall n\in\mathbb{N},n<c,\{\frac{bn}{c}\}\neq 0,

where {x}\{x\} is the fractional part of xx.

Now we turn to the left part of Proposition 2.

Using Proposition 5, we only need consider t[0,1)t\in[0,1). Let us take more precise analysis from Eq. (10). Suppose

ΔSlSm+1lmSiSj1ij=(SlSm+1)(ij)(SiSj1)(lm)(lm)(ij).\Delta\triangleq\frac{S_{l}-S_{m}+1}{l-m}-\frac{S_{i}-S_{j}-1}{i-j}=\frac{(S_{l}-S_{m}+1)(i-j)-(S_{i}-S_{j}-1)(l-m)}{(l-m)(i-j)}. (11)

When lm=ijl-m=i-j and n5n\geq 5,

SlSm+1lmSiSj1ij1lm1n12(n1)(n3).\frac{S_{l}-S_{m}+1}{l-m}-\frac{S_{i}-S_{j}-1}{i-j}\geq\frac{1}{l-m}\geq\frac{1}{n-1}\geq\frac{2}{(n-1)(n-3)}.

So we only need to check lmijl-m\neq i-j.

  1. 1.

    If (SlSm+1)(ij)(SiSj1)(lm)3(S_{l}-S_{m}+1)(i-j)-(S_{i}-S_{j}-1)(l-m)\geq 3, n>4n>4.

    SlSm+1lmSiSj1ij3(lm)(ij)3(n1)(n2)2(n1)(n3).\frac{S_{l}-S_{m}+1}{l-m}-\frac{S_{i}-S_{j}-1}{i-j}\geq\frac{3}{(l-m)(i-j)}\geq\frac{3}{(n-1)(n-2)}\geq\frac{2}{(n-1)(n-3)}.
  2. 2.

    If (SlSm+1)(ij)(SiSj1)(lm)=2(S_{l}-S_{m}+1)(i-j)-(S_{i}-S_{j}-1)(l-m)=2, n>4n>4.

    Set α=SlSm,β=SiSj\alpha=S_{l}-S_{m},\beta=S_{i}-S_{j}. Use Proposition 5, then 0αlm,0βij0\leq\alpha\leq l-m,0\leq\beta\leq i-j.

    • If (lm,ij)=(n1,n2)(l-m,i-j)=(n-1,n-2), then (l,m)=(n,1)(l,m)=(n,1),  (i,j)=(n,2)(i,j)=(n,2) or (n1,1)(n-1,1), then αβ\alpha\geq\beta.

      Hence

      2=(α+1)(n2)(β1)(n1)2n3β2n3(n2)>3.2=(\alpha+1)(n-2)-(\beta-1)(n-1)\geq 2n-3-\beta\geq 2n-3-(n-2)>3.

      No solution!

    • If (lm,ij)=(n2,n1)(l-m,i-j)=(n-2,n-1), then (i,j)=(n,1)(i,j)=(n,1), (l,m)=(n,2)(l,m)=(n,2) or (n1,1)(n-1,1), then αβ1\alpha\geq\beta-1.

      Hence

      2=(α+1)(n1)(β1)(n2)n2+β>2.2=(\alpha+1)(n-1)-(\beta-1)(n-2)\geq n-2+\beta>2.

      No solution!

    Therefore, (lm)(ij)(n1)(n3)(l-m)(i-j)\leq(n-1)(n-3)

    SlSm+1lmSiSj1ij2(lm)(ij)2(n1)(n3).\frac{S_{l}-S_{m}+1}{l-m}-\frac{S_{i}-S_{j}-1}{i-j}\geq\frac{2}{(l-m)(i-j)}\geq\frac{2}{(n-1)(n-3)}.
  3. 3.

    If (SlSm+1)(ij)(SiSj1)(lm)=1(S_{l}-S_{m}+1)(i-j)-(S_{i}-S_{j}-1)(l-m)=1, n>4n>4.

    Use Lemma 1,

    1(ij)(lm)t(lm)((ij)t1)>(ij)(lm)t(lm)(ij)t=0.1\geq(i-j)\lceil(l-m)t\rceil-(l-m)(\lceil(i-j)t\rceil-1)>(i-j)(l-m)t-(l-m)(i-j)t=0.

    Hence

    (ij)(lm)t(lm)((ij)t1)=1.(i-j)\lceil(l-m)t\rceil-(l-m)(\lceil(i-j)t\rceil-1)=1. (12)

    Set a=ij,c=lma=i-j,c=l-m, which lead to

    actc(at1)=1.a\lceil ct\rceil-c(\lceil at\rceil-1)=1. (13)

    So we can see

    Δ1(lm)(ij)=1ca.\Delta\geq\frac{1}{(l-m)(i-j)}=\frac{1}{ca}.

    Next we will prove that a,ca,c satisfy the follow cases.

    Case1. If a=1a=1 or c=1c=1, then

    Δ1n12(n1)(n3).\Delta\geq\frac{1}{n-1}\geq\frac{2}{(n-1)(n-3)}.

    Case2. If a+cn1a+c\leq n-1, then

    Δ4(a+c)24(n1)22(n1)(n3).\Delta\geq\frac{4}{(a+c)^{2}}\geq\frac{4}{(n-1)^{2}}\geq\frac{2}{(n-1)(n-3)}.

    From Case1, we may assume a,c>1a,c>1. Meanwhile take Eq. (13) into the condition (SlSm+1)(ij)(SiSj1)(lm)=1(S_{l}-S_{m}+1)(i-j)-(S_{i}-S_{j}-1)(l-m)=1, we get (a,c)=1(a,c)=1, and

    (SlSm+1(lm)t)a=(SiSj(ij)t)c.(S_{l}-S_{m}+1-\lceil(l-m)t\rceil)a=(S_{i}-S_{j}-\lceil(i-j)t\rceil)c.

    Since 0SlSm+1(lm)t10\leq S_{l}-S_{m}+1-\lceil(l-m)t\rceil\leq 1, therefore

    ltbmtb+1=(lm)t=ct,\displaystyle\lceil lt-b\rceil-\lceil mt-b\rceil+1=\lceil(l-m)t\rceil=\lceil ct\rceil, (14)
    itbjtb=(ij)t=at.\displaystyle\lceil it-b\rceil-\lceil jt-b\rceil=\lceil(i-j)t\rceil=\lceil at\rceil.

    Hence

    {mtb}+{ct}1andct,mtb.\displaystyle\{mt-b\}+\{ct\}\leq 1\ and\ ct,mt-b\notin\mathbb{Z}. (15)
    {jtb}+{at}>1oratorjtb.\displaystyle\{jt-b\}+\{at\}>1\ or\ at\in\mathbb{Z}\ or\ jt-b\in\mathbb{Z}. (16)
    • If atat\in\mathbb{Z}, then actcactc=at=at\frac{a\lceil ct\rceil}{c}\geq\frac{act}{c}=at=\lceil at\rceil. By Eq. (13),

      c(at1)+1cat,\frac{c(\lceil at\rceil-1)+1}{c}\geq\lceil at\rceil,

      so c=1c=1, to Case1.

    • If jtbjt-b\in\mathbb{Z} or {jtb}+{at}>1\{jt-b\}+\{at\}>1, by the previous fact that at,ctat,ct\notin\mathbb{Z} and Eq. (13), we can get the range of tt:

      at1a<t<ctc.\frac{\lceil at\rceil-1}{a}<t<\frac{\lceil ct\rceil}{c}.

      We set

      t=act1+δac=c(at1)+δac,δ(0,1),t=\frac{a\lceil ct\rceil-1+\delta}{ac}=\frac{c(\lceil at\rceil-1)+\delta}{ac},\delta\in(0,1),

      then

      {ct}=11δa>{at}=δc.\{ct\}=1-\frac{1-\delta}{a}>\{at\}=\frac{\delta}{c}.

      The last inequality comes from a,c>1a,c>1.

      Reuse Eq. (15) and Eq. (16),

      {mtb}1δa,{jtb}>1δcor{jtb}=0.\{mt-b\}\leq\frac{1-\delta}{a},\{jt-b\}>1-\frac{\delta}{c}\ or\ \{jt-b\}=0.
      • When m=jm=j, because from Eq. (15), mtbZmt-b\notin Z, so {jtb}0\{jt-b\}\neq 0, then

        1δa{jtb}>1δa.\frac{1-\delta}{a}\geq\{jt-b\}>1-\frac{\delta}{a}.

        Contradiction!

      • When m>jm>j,

        {(mj)t}<1δa+δc.\{(m-j)t\}<\frac{1-\delta}{a}+\frac{\delta}{c}.

        If mj<am-j<a,

        (mj)at1a<(mj)t<(mj)ctc.(m-j)\frac{\lceil at\rceil-1}{a}<(m-j)t<(m-j)\frac{\lceil ct\rceil}{c}.

        Then the range of (mj)t(m-j)t is mjac\frac{m-j}{ac} which less than 1c\frac{1}{c}, and the fraction of the left point of (mj)t(m-j)t can’t be zero by Lemma 2. So 1a1δa+δc\frac{1}{a}\leq\frac{1-\delta}{a}+\frac{\delta}{c}, else mjac1a+1c\frac{m-j}{ac}\geq\frac{1}{a}+\frac{1}{c}, which means a+cn1a+c\leq n-1, to Case2.
        Then aca\geq c. As (a,c)=1(a,c)=1, so a>ca>c.

        {(mj)t}<1δa+δc<1c.\{(m-j)t\}<\frac{1-\delta}{a}+\frac{\delta}{c}<\frac{1}{c}.

        Due to 0<(mj)ctc(mj)t<1c0<\frac{(m-j)\lceil ct\rceil}{c}-(m-j)t<\frac{1}{c}, hence

        {(mj)ctc}=1c.\left\{\frac{(m-j)\lceil ct\rceil}{c}\right\}=\frac{1}{c}.

        Therefore mj=arc,r,rc<am-j=a-rc,\ r\in\mathbb{N},rc<a, but this time

        {(mj)t}=δc+(1δ)raδc+1δa.\left\{(m-j)t\right\}=\frac{\delta}{c}+\frac{(1-\delta)r}{a}\geq\frac{\delta}{c}+\frac{1-\delta}{a}.

        No solution!
        Therefore mjam-j\geq a, so c+a(lm)+(mj)=ljn1c+a\leq(l-m)+(m-j)=l-j\leq n-1, to Case2.

      • Similarly we can get same results when m<jm<j with

        {(jm)t}>11δaδc.\{(j-m)t\}>1-\frac{1-\delta}{a}-\frac{\delta}{c}.

From previous discussion, we obtain

SlSm+1lmSiSj1ij2(n1)(n3).\frac{S_{l}-S_{m}+1}{l-m}-\frac{S_{i}-S_{j}-1}{i-j}\geq\frac{2}{(n-1)(n-3)}.

Hence, T,BT,B exists from Eq. (10) when n>4n>4 and K>(n1)(n3)2K>\frac{(n-1)(n-3)}{2}. \square

B.3 Proof of Proposition 3

Proof: Consider {Si}i=1n\{S_{i}\}_{i=1}^{n} through special choices of t,bt,b when n>8n>8.

  • t=1n2t=\frac{1}{n-2}, b=32n41b=\frac{3}{2n-4}-1, with sequence {Si}i=1n={1,2,2,,2,2,2,3}\{S_{i}\}_{i=1}^{n}=\{1,2,2,\dots,2,2,2,3\},

  • t=1n3t=\frac{1}{n-3}, b=32n61b=\frac{3}{2n-6}-1, with sequence {Si}i=1n={1,2,2,,2,2,3,3}\{S_{i}\}_{i=1}^{n}=\{1,2,2,\dots,2,2,3,3\},

  • t=1n4t=\frac{1}{n-4}, b=32n81b=\frac{3}{2n-8}-1, with sequence {Si}i=1n={1,2,2,,2,3,3,3}\{S_{i}\}_{i=1}^{n}=\{1,2,2,\dots,2,3,3,3\}.

The corresponding satisfied TT is T1,T2,T3T_{1},T_{2},T_{3}. From Eq. (8), the corresponding results are

(SlSm+1lm)min=1n3,1n4,1n5l,m[n],l>m.\left(\frac{S_{l}-S_{m}+1}{l-m}\right)_{min}=\frac{1}{n-3},\frac{1}{n-4},\frac{1}{n-5}\ \forall l,m\in[n],\ l>m.
(SiSj1ij)max=1n1,1n2,1n3i,j[n],i>j.\left(\frac{S_{i}-S_{j}-1}{i-j}\right)_{max}=\frac{1}{n-1},\frac{1}{n-2},\frac{1}{n-3}\ \forall i,j\in[n],\ i>j.

Therefore,

{Kn1<T1<Kn3,Kn2<T2<Kn4,Kn3<T3<Kn5.\left\{\begin{aligned} &\frac{K}{n-1}<T_{1}<\frac{K}{n-3},\\ &\frac{K}{n-2}<T_{2}<\frac{K}{n-4},\\ &\frac{K}{n-3}<T_{3}<\frac{K}{n-5}.\\ \end{aligned}\right. (17)

Then

(n3)T1<K<(n1)T1,\displaystyle(n-3)T_{1}<K<(n-1)T_{1}, (18)
(n4)T2<K<(n2)T2,\displaystyle(n-4)T_{2}<K<(n-2)T_{2}, (19)
(n5)T3<K<(n3)T3.\displaystyle(n-5)T_{3}<K<(n-3)T_{3}. (20)

From Eq. (18) and Eq. (20), then (n3)T1<(n3)T3(n-3)T_{1}<(n-3)T_{3}, so T3T1+1T_{3}\geq T_{1}+1. Take this into Eq. (18) and Eq. (20), then

(n1)T1(n5)T3+2(n5)(T1+1)+2T1n34.(n-1)T_{1}\geq(n-5)T_{3}+2\geq(n-5)(T_{1}+1)+2\Rightarrow T_{1}\geq\Big{\lceil}\frac{n-3}{4}\Big{\rceil}.

Then

K(n3)T1+1(n3)n34+1(n3)24+1.K\geq(n-3)T_{1}+1\geq(n-3)\Big{\lceil}\frac{n-3}{4}\Big{\rceil}+1\geq\frac{(n-3)^{2}}{4}+1.

If T1=n34,n>9T_{1}=\big{\lceil}\frac{n-3}{4}\big{\rceil},n>9, from

(n1)T1(n5)T3+2T3=n34+1.(n-1)T_{1}\geq(n-5)T_{3}+2\Rightarrow T_{3}=\Big{\lceil}\frac{n-3}{4}\Big{\rceil}+1.

So either T1T2T_{1}\geq T_{2} or T2T3T_{2}\geq T_{3} can be satisfied.

  • If T1T2T_{1}\geq T_{2}, from Eq. (18) and Eq. (19)

    (n3)T1<K<(n2)T1,\displaystyle(n-3)T_{1}<K<(n-2)T_{1}, (21)
    (n5)T3<K<(n3)T3.\displaystyle(n-5)T_{3}<K<(n-3)T_{3}. (22)
    (n5)T3+2(n2)T1(n5)(n34+1)+2(n2)n34n<15.(n-5)T_{3}+2\leq(n-2)T_{1}\Rightarrow(n-5)(\Big{\lceil}\frac{n-3}{4}\Big{\rceil}+1)+2\leq(n-2)\Big{\lceil}\frac{n-3}{4}\Big{\rceil}\Rightarrow n<15.
  • If T2T3T_{2}\geq T_{3}, from Eq. (18) and Eq. (19)

    (n3)T1<K<(n1)T1,\displaystyle(n-3)T_{1}<K<(n-1)T_{1}, (23)
    (n4)T3<K<(n3)T3.\displaystyle(n-4)T_{3}<K<(n-3)T_{3}. (24)
    (n4)T3+2(n1)T1(n4)(n34+1)+2(n1)n34n<11.(n-4)T_{3}+2\leq(n-1)T_{1}\Rightarrow(n-4)(\Big{\lceil}\frac{n-3}{4}\Big{\rceil}+1)+2\leq(n-1)\Big{\lceil}\frac{n-3}{4}\Big{\rceil}\Rightarrow n<11.

So when n15n\geq 15, then T1=n34+1T_{1}=\big{\lceil}\frac{n-3}{4}\big{\rceil}+1. Hence,

Kn(n3)T1+1(n3)(n34+1)+1(n1)24.K_{n}\geq(n-3)T_{1}+1\geq(n-3)(\Big{\lceil}\frac{n-3}{4}\Big{\rceil}+1)+1\geq\frac{(n-1)^{2}}{4}.

With the same idea further on the T1=n34+1T_{1}=\big{\lceil}\frac{n-3}{4}\big{\rceil}+1, we can get when n27n\geq 27,

Kn(n3)T1+1(n3)(n34+2)+1>n24.K_{n}\geq(n-3)T_{1}+1\geq(n-3)(\Big{\lceil}\frac{n-3}{4}\Big{\rceil}+2)+1>\frac{n^{2}}{4}.

\square

B.4 Proof of Proposition 4

Proof: From Eq. (7), we conclude that

maxi(iTKSi)B<minj(jTKSj)+K.\max_{i}\left(iT-KS_{i}\right)\leq B<\min_{j}\left(jT-KS_{j}\right)+K.

Therefore BB exists when given TT if and only if

maxi(iTKSi)<minj(jTKSj)+K.\max_{i}\left(iT-KS_{i}\right)<\min_{j}\left(jT-KS_{j}\right)+K.

From Eq. (8) and TT\in\mathbb{N}, we conclude that

SiSj1ijK<T<SiSj+1ijK,\Big{\lfloor}\frac{S_{i}-S_{j}-1}{i-j}K\Big{\rfloor}<T<\Big{\lceil}\frac{S_{i}-S_{j}+1}{i-j}K\Big{\rceil},

where i>j,i,j{1,2,n}i>j,i,j\in\{1,2,\dots n\}. Hence, TT exists if and only if

maxi>jSiSj1ijK<mini>jSiSj+1ijK.\max_{i>j}\Big{\lfloor}\frac{S_{i}-S_{j}-1}{i-j}K\Big{\rfloor}<\min_{i>j}\Big{\lceil}\frac{S_{i}-S_{j}+1}{i-j}K\Big{\rceil}. (25)

In conclusion, T,BT,B exists if and ony if there exists such a TT\in\mathbb{N},

maxi>jSiSj1ijK\displaystyle\max_{i>j}\Big{\lfloor}\frac{S_{i}-S_{j}-1}{i-j}K\Big{\rfloor} <T<mini>jSiSj+1ijK,\displaystyle<T<\min_{i>j}\Big{\lceil}\frac{S_{i}-S_{j}+1}{i-j}K\Big{\rceil}, (26)
maxi(iTKSi)\displaystyle\max_{i}\left(iT-KS_{i}\right) <minj(jTKSj)+K,\displaystyle<\min_{j}\left(jT-KS_{j}\right)+K,

where i,j{1,2,n}i,j\in\{1,2,\dots n\}. \square

B.5 Proof of Proposition 5

Proof: For the convenience of analysis, we can assume t[0,1)t\in[0,1) and tb=1\lceil t-b\rceil=1 as long as we can minus an integer in both sides:

i(tt)(bb)=i(TtK)(BbK)K,i{1,2,,n},\lceil i(t-\lfloor t\rfloor)-(b-b^{\prime})\rceil=\left\lceil\frac{i(T-\lfloor t\rfloor K)-(B-b^{\prime}K)}{K}\right\rceil,\ \forall i\in\{1,2,\cdots,n\},

where bb^{\prime} is the integer satisfies {t}(bb)=1\left\lceil\{t\}-(b-b^{\prime})\right\rceil=1.

S1=1S_{1}=1, this is obvious by the assumption 0t<10\leq t<1 and tb=1\lceil t-b\rceil=1.

From Lemma 1 Si+1Si=0S_{i+1}-S_{i}=0 or 11, so the sequence {Si}i=1n\{S_{i}\}_{i=1}^{n} is non-decrease and Si+dSid,SiiS_{i+d}-S_{i}\leq d,\ S_{i}\leq i. \square

B.6 Proof of Proposition 6

Proof: First, as {Si}i=1n+1\{S_{i}\}_{i=1}^{n+1} is a satisfied sequence with length n+1n+1,then there exist t,bt,b satisfy Si=itb,i{1,2,,n+1}S_{i}=\lceil it-b\rceil,\ \forall i\in\{1,2,\dots,n+1\}, so {Si}i=1n\{S_{i}\}_{i=1}^{n} is a satisfied sequence with length nn when we choose t,bt,b as the parameters.
Second, when {Si}i=1n\{S_{i}\}_{i=1}^{n} is a satisfied sequence with length n+1n+1,then there exist t,bt,b satisfy Si=itb,i{1,2,,n}S_{i}=\lceil it-b\rceil,\ \forall i\in\{1,2,\dots,n\}, so {S1,S2,,Sn,(n+1)tb}\{S_{1},S_{2},\dots,S_{n},\lceil(n+1)t-b\rceil\} is a satisfied sequence with length n+1n+1 and the same previous nn terms. \square

B.7 Proof of Proposition 7

Note that

Si1<itbSi,i{1,2,,n}.S_{i}-1<it-b\leq S_{i},\ \forall i\in\{1,2,\dots,n\}.

Then

Si1+bi<tSi+bi,i{1,2,,n}.\frac{S_{i}-1+b}{i}<t\leq\frac{S_{i}+b}{i},\ \forall i\in\{1,2,\dots,n\}.

So tt\in\mathbb{R} exists as long as

Si1+bi<Sj+bj,i,j{1,2,,n},\frac{S_{i}-1+b}{i}<\frac{S_{j}+b}{j},\ \forall i,j\in\{1,2,\dots,n\},

which means that

{b<j(Si1)iSjij,j>i,b>j(Si1)iSjij,j<i.\left\{\begin{aligned} &b<\frac{j(S_{i}-1)-iS_{j}}{i-j},\ j>i,\\ &b>\frac{j(S_{i}-1)-iS_{j}}{i-j},\ j<i.\end{aligned}\right.

Therefore, bb exists if and only if

minj>ij(Si1)iSjij>maxj<ij(Si1)iSjij.\min_{j>i}\frac{j(S_{i}-1)-iS_{j}}{i-j}>\max_{j<i}\frac{j(S_{i}-1)-iS_{j}}{i-j}. (27)

B.8 Proof of Proposition 8

Proof: Once we know T,BT,B exist, then from Eq. (9), and Lemma 1, i,j,l,m[n],i>j,l>m\forall i,j,l,m\in[n],i>j,l>m,

SiSj1ijK<tK<SlSm+1lmK.\frac{S_{i}-S_{j}-1}{i-j}K<tK<\frac{S_{l}-S_{m}+1}{l-m}K. (28)

Since there exist TT satisfy requirements, so there exists an integer in the interval

(maxi>jSiSj1ijK,mini>jSiSj+1ijK).\bigg{(}\max_{i>j}\frac{S_{i}-S_{j}-1}{i-j}K,\ \min_{i>j}\frac{S_{i}-S_{j}+1}{i-j}K\bigg{)}.

From Eq. (28), tKtK is in the interval, so the neighboring integer tK\lfloor tK\rfloor or tK\lceil tK\rceil must have at least one satisfies Eq. (2).

Similarly, from Eq. (7)

KSiK+Bi<TKSi+Bi,i{1,2,,n}.\frac{KS_{i}-K+B}{i}<T\leq\frac{KS_{i}+B}{i},\ \forall i\in\{1,2,\cdots,n\}.
KSiK+Bi<KSj+Bj,i,j{1,2,,n}.\frac{KS_{i}-K+B}{i}<\frac{KS_{j}+B}{j},\ \forall i,j\in\{1,2,\cdots,n\}.
(1i1j)B<K(SjjSi1i),i,j{1,2,,n}.\left(\frac{1}{i}-\frac{1}{j}\right)B<K\left(\frac{S_{j}}{j}-\frac{S_{i}-1}{i}\right),\ \forall i,j\in\{1,2,\cdots,n\}.
K(Si1iSjj)/(1j1i)<B<K(SllSm1m)/(1m1l),i>j,l>m.K\left(\frac{S_{i}-1}{i}-\frac{S_{j}}{j}\right)\bigg{/}\left(\frac{1}{j}-\frac{1}{i}\right)<B<K\left(\frac{S_{l}}{l}-\frac{S_{m}-1}{m}\right)\bigg{/}\left(\frac{1}{m}-\frac{1}{l}\right),\ \forall i>j,l>m.

Since

SllSm1m>ltblmtbm=(1m1l)b,\frac{S_{l}}{l}-\frac{S_{m}-1}{m}>\frac{lt-b}{l}-\frac{mt-b}{m}=\left(\frac{1}{m}-\frac{1}{l}\right)b,
Si1iSjj<itbijtbj=(1j1i)b.\frac{S_{i}-1}{i}-\frac{S_{j}}{j}<\frac{it-b}{i}-\frac{jt-b}{j}=\left(\frac{1}{j}-\frac{1}{i}\right)b.

Hence

K(Si1iSjj)/(1j1i)<bK<K(SllSm1m)/(1m1l),i>j,l>m.K\left(\frac{S_{i}-1}{i}-\frac{S_{j}}{j}\right)\bigg{/}\left(\frac{1}{j}-\frac{1}{i}\right)<bK<K\left(\frac{S_{l}}{l}-\frac{S_{m}-1}{m}\right)\bigg{/}\left(\frac{1}{m}-\frac{1}{l}\right),\ \forall i>j,l>m.

Since there exist BB satisfy requirements, so that there exists an integer in the interval

(maxi>jK(Si1iSjj)/(1j1i),mini>jK(SiiSj1j)/(1j1i)).\bigg{(}\max_{i>j}K\left(\frac{S_{i}-1}{i}-\frac{S_{j}}{j}\right)\bigg{/}\left(\frac{1}{j}-\frac{1}{i}\right),\ \min_{i>j}K\left(\frac{S_{i}}{i}-\frac{S_{j}-1}{j}\right)\bigg{/}\left(\frac{1}{j}-\frac{1}{i}\right)\bigg{)}. (29)

From Eq. (29), bKbK is in the interval, so the neighboring integer bK\lfloor bK\rfloor or bK\lceil bK\rceil must have at least one satisfies Eq. (2).

However, the combined pair T{bK,bK}T\in\{\lfloor bK\rfloor,\ \lceil bK\rceil\}, and B{tK,tK}B\in\{\lfloor tK\rfloor,\ \lceil tK\rceil\} may don’t satisfy Problem 2.
For example, take b=0.198,t=0.618,n=15,K=64b=0.198,t=0.618,n=15,K=64, all possible sequences (T,B)=(39,6),(39,7),(39,8)(T,B)=(39,6),(39,7),(39,8), while tK=39,bK=12\lfloor tK\rfloor=39,\lfloor bK\rfloor=12. \square

B.9 Proof of Proposition 9

Proof: From Eq. (9) and Lemma 1,

SlSm+1lm<(ltb+1)(mtb)+1lm=t+2lm.\frac{S_{l}-S_{m}+1}{l-m}<\frac{(lt-b+1)-(mt-b)+1}{l-m}=t+\frac{2}{l-m}.

Therefore

minl>mSlSm+1lm<t+minl>m2lm=t+2n1.\min_{l>m}\frac{S_{l}-S_{m}+1}{l-m}<t+\min_{l>m}\frac{2}{l-m}=t+\frac{2}{n-1}.

Similarly,

maxi>jSiSj1ij>maxi>j(itb)(jtb+1)1ij=t+maxi>j2ij=t2n1.\max_{i>j}\frac{S_{i}-S_{j}-1}{i-j}>\max_{i>j}\frac{(it-b)-(jt-b+1)-1}{i-j}=t+\max_{i>j}-\frac{2}{i-j}=t-\frac{2}{n-1}.

Hence

Kt2Kn1<T<Kt+2Kn1.Kt-\frac{2K}{n-1}<T<Kt+\frac{2K}{n-1}.

From Eq. (29)

(Si1iSjj)>itb1ijtb+1j=(1i1j)b(1i+1j).\left(\frac{S_{i}-1}{i}-\frac{S_{j}}{j}\right)>\frac{it-b-1}{i}-\frac{jt-b+1}{j}=-\left(\frac{1}{i}-\frac{1}{j}\right)b-\left(\frac{1}{i}+\frac{1}{j}\right).
maxi>j(Si1iSjj)/(1j1i)>maxi>jb(ij+1)/(ij1)=bn+1n1.\max_{i>j}\left(\frac{S_{i}-1}{i}-\frac{S_{j}}{j}\right)\bigg{/}\left(\frac{1}{j}-\frac{1}{i}\right)>\max_{i>j}b-\left(\frac{i}{j}+1\right)/\left(\frac{i}{j}-1\right)=b-\frac{n+1}{n-1}.
mini>j(SiiSj1j)/(1j1i)>mini>jb+(ij+1)/(ij1)=b+n+1n1.\min_{i>j}\left(\frac{S_{i}}{i}-\frac{S_{j}-1}{j}\right)\bigg{/}\left(\frac{1}{j}-\frac{1}{i}\right)>\min_{i>j}b+\left(\frac{i}{j}+1\right)/\left(\frac{i}{j}-1\right)=b+\frac{n+1}{n-1}.

Hence

Kbn+1n1K<B<Kb+n+1n1K.Kb-\frac{n+1}{n-1}K<B<Kb+\frac{n+1}{n-1}K.

\square

Appendix C Searched KnK_{n}

In this section, we list more KnK_{n} with various nn in Table 3. Since searching accurate KnK_{n} is time-consuming, we only find large bit quantized scale (n=2k1n=2^{k}{-}1) when nn is large.

Table 3: Searched KnK_{n} with various nn. Some useful numbers for kk-bit quantization (n=2k1n=2^{k}-1) are displayed in bold.

nn KnK_{n} nn KnK_{n} nn KnK_{n} nn KnK_{n} nn KnK_{n}
1 1 14 46 27 221 40 529 53 1013
2 1 15 51 28 232 41 545 54 1036
3 2 16 67 29 265 42 596 55 1059
4 3 17 73 30 277 43 613 56 1129
5 5 18 79 31 289 44 667 57 1153
6 7 19 99 32 326 45 685 58 1226
7 9 20 106 33 339 46 742 59 1251
8 11 21 113 34 379 47 761 60 1327
9 13 22 137 35 393 48 781 61 1353
10 22 23 145 36 407 49 841 62 1379
11 25 24 172 37 451 50 862 63 1459
12 29 25 181 38 466 51 925 127 6499
13 41 26 191 39 513 52 947 255 28323

Appendix D Satisfied KK is Not Sequential

Table 4: Searched satisfied list of KK for n=15n=15 and 3131.

n:15n:15 51,61,62,63,64,67,68,69,73,74,75,76,77,78,79,80,81,82,83,8551,61,62,63,64,67,68,69,73,74,75,76,77,78,79,80,81,82,83,85\cdots.
n:31n:31 289,313,314,315,316,317,318,326,327,328,329,339,340,341,342,343,344,345,346,347,352,353,354,355,356,357,358,359,365,366,367,368,369,370,371,372,373,374,375,376,379,380,381,382,383,384,385,386,387,388,389,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,421,422,\begin{array}[]{l}289,313,314,315,316,317,318,326,327,328,329,339,340,341,342,343,\\ 344,345,346,347,352,353,354,355,356,357,358,359,365,366,367,368,\\ 369,370,371,372,373,374,375,376,379,380,381,382,383,384,385,386,\\ 387,388,389,393,394,395,396,397,398,399,400,401,402,403,404,405,\\ 406,407,408,409,410,411,412,413,414,415,416,417,418,419,421,422\cdots,\\ \end{array}

In this section, we emphasize that not all KKnK\geq K_{n} satisfy Problem 1. We show all satisfied KK in Problem 1 when choose A=15A=15 (n=15n=15) for 44-bit quantization. Since, from Proposition 2, we know that all K(n1)(n3)2+1=85K\geq\frac{(n-1)(n-3)}{2}+1=85 satisfy Problem 1 and from Appendix C, K15=51K_{15}=51, we list satisfied sequences of KK in Table 4.

When choose A=31A=31 (n=31n=31) for 55-bit quantization, K31=289K_{31}=289, all K(n1)(n3)2+1=421K\geq\frac{(n-1)(n-3)}{2}+1=421 satisfies Problem 1, we also list satisfied sequence of KK in Table 4.

From Table 4, there some missed numbers from KnK_{n} to our upper bound (n1)(n3)2+1\frac{(n-1)(n-3)}{2}+1. Moreover, hardware support may choose KK as power of 22, so we recommand using K=64K=64 for 44-bit activation quantization and K=512K=512 for 55-bit activation quantization and naively K=216K=2^{16} for 88-bit activation quantization (actually 2152^{15} is enough).

Appendix E Constraint of T

From Problem 1 and Proposition 1, we need T0T\neq 0 but we do not consider this constraint in the main body of the paper. In this section, we show the case when T=0T=0 satisfies.

Based on Eq. (3) when T=0T=0, then itb=B/K\lceil it-b\rceil=\lceil B/K\rceil, i{1,2,,A}\forall i\in\{1,2,\dots,A\}, which means there exists N0N_{0}\in\mathbb{N}, N0<tb<AtbN0+1N_{0}<t-b<At-b\leq N_{0}+1 if we assume t>0t>0. Therefore (A1)t<1(A-1)t<1, that is 0<t<1A10<t<\frac{1}{A-1}. Additionally, N+bt<1\lfloor\frac{N+b}{t}\rfloor<1 when NN0N\leq N_{0}; N+btA\lfloor\frac{N+b}{t}\rfloor\geq A when NN0+1N\geq N_{0}+1, showing that

clip(N+bt,0,A)=0orA.\mathrm{clip}\left(\left\lfloor\frac{N+b}{t}\right\rfloor,0,A\right)=0\;or\;A.

We propose to use a sign function to quantize in this case.

QBN(N,b,t,A)clip(N+bt,0,A)=Asign(NN0)={Ax>N00xN0,Q_{BN}(N,b,t,A)\triangleq\mathrm{clip}\left(\left\lfloor\frac{N+b}{t}\right\rfloor,0,A\right)=A\;sign(N-N_{0})=\begin{cases}A&x>N_{0}\\ 0&x\leq N_{0}\\ \end{cases},

where N0=tb1N_{0}=\lceil t-b\rceil-1 and we need to check tb=Atb\lceil t-b\rceil=\lceil At-b\rceil.

Since the potential scope of |t|<1A1|t|<\frac{1}{A-1} is particularly limited in \mathbb{R} when AA is large, hence we seldom come across this situation, and we also can solve this problem by replacing original formulation to simple binary function QBNQ_{BN} even though this case happens.

Appendix F Auxiliary Algorithms

Algorithm 1 Generate: (𝒮n\mathcal{S}_{n}) Recursively generate all possible sequences from length nn to length n+1n+1.
1:  Input: 𝒮n\mathcal{S}_{n} which includes all possible sequences with length nn;
2:  Set empty list 𝒮n+1\mathcal{S}_{n+1};
3:  for {Si}i=1n\{S_{i}\}_{i=1}^{n} in 𝒮n\mathcal{S}_{n} do
4:     Generate new {Si}i=1n+1\{S_{i}\}_{i=1}^{n+1} with length n+1n+1 using Proposition 6 (e.g. choose Sn+1=SnS_{n+1}=S_{n} or Sn+1S_{n}+1);
5:     if  maxj>ij(Si1)iSjji<minj<ij(Si1)iSjji\max_{j>i}\frac{j(S_{i}-1)-iS_{j}}{j-i}<\min_{j<i}\frac{j(S_{i}-1)-iS_{j}}{j-i}  then
6:        Add {Si}i=1n+1\{S_{i}\}_{i=1}^{n+1} to 𝒮n+1\mathcal{S}_{n+1};
7:     end if
8:  end for
9:  Output: 𝒮n+1\mathcal{S}_{n+1}
Algorithm 2 Find({Si}i=1n\{S_{i}\}_{i=1}^{n}, KK): Check KK is suitable for the given possible sequence {Si}i=1n\{S_{i}\}_{i=1}^{n}
1:  Input: A possible sequence {Si}i=1n\{S_{i}\}_{i=1}^{n} from 𝒮n\mathcal{S}_{n}, KK\in\mathbb{N};
2:  for  TT\in\mathbb{N} and KSn2K+1n1TKSn1n1\frac{KS_{n}-2K+1}{n-1}\leq T\leq\frac{KS_{n}-1}{n-1} do
3:     if maxi(iTKSi)<mini(iTKSi)+K\max_{i}\left(iT-KS_{i}\right)<\min_{i}\left(iT-KS_{i}\right)+K then
4:        B=maxi(iTKSi)B=\max_{i}\left(iT-KS_{i}\right);
5:        Output: T,BT,B
6:     end if
7:  end for
8:  Output: NoneNone
Algorithm 3 Search: (𝒮no\mathcal{S}_{n}^{o}, K0K_{0}): Search KK for given possible sequences in 𝒮no\mathcal{S}_{n}^{o}, a subset of 𝒮n\mathcal{S}_{n}
1:  Input: 𝒮no\mathcal{S}_{n}^{o}, a subset of 𝒮n\mathcal{S}_{n}, start K0K_{0}\in\mathbb{N};
2:  find=Falsefind=False; K=K0K=K_{0};
3:  while notnot findfind do
4:     Set empty list KlstK_{lst} ;
5:     for {Si}i=1n\{S_{i}\}_{i=1}^{n} in 𝒮no\mathcal{S}_{n}^{o} do
6:        while Find({Si}i=1n\{S_{i}\}_{i=1}^{n}, KK) return None do
7:           K=K+1K=K+1;
8:        end while
9:        Append KK to KlstK_{lst};
10:     end for
11:     Get the maximum value KmaxK_{\max} from KlstK_{lst} ;
12:     if Kmax>KK_{\max}>K then
13:        KK = KmaxK_{\max};
14:     else
15:        findfind = TrueTrue;
16:     end if
17:  end while
18:  Output: KK
Algorithm 4 Search-All (𝒮n\mathcal{S}_{n}, K0K_{0}, ww): Search KnK_{n} based on all possible sequences with length nn in 𝒮n\mathcal{S}_{n}.
1:  Input: All possible sequences 𝒮n\mathcal{S}_{n} of sequence length nn generated and checked from Generate function (Algorithm 1 in Appendix), Search function (Algorithm 3 in Appendix), search starting point K0K_{0}\in\mathbb{N} and running window ww\in\mathbb{N}.
2:  find=Falsefind=False, K=K0K=K_{0};
3:  while notnot findfind do
4:     s=0s=0, findfind = TrueTrue;
5:     while s<s< the length of 𝒮n\mathcal{S}_{n} do
6:        Get 𝒮no\mathcal{S}_{n}^{o}, a subset of 𝒮n\mathcal{S}_{n} from item index ss to s+ws+w.
7:        Kmax=K_{max}= Search(𝒮no\mathcal{S}_{n}^{o}, KK);
8:        if Kmax>KK_{max}>K then
9:           K=KmaxK=K_{max}, findfind = FalseFalse;
10:        end if
11:        s=s+ws=s+w;
12:     end while
13:  end while
14:  Output: KK
Algorithm 5 Get(t,b,K,nt,b,K,n): Get T,BT,B in Problem 2 when given t,b,Kt,b,K
1:  Input: t,b,nt,b,n and KK from previous search.
2:  Si=itb,i{1,2,n}S_{i}=\lceil it-b\rceil,i\in\{1,2\cdots,n\}.
3:  Tlower=maxi>jSiSj1ijKT_{lower}=\max_{i>j}\Big{\lfloor}\frac{S_{i}-S_{j}-1}{i-j}K\Big{\rfloor}.
4:  Tupper=mini>jSiSj+1ijKT_{upper}=\min_{i>j}\Big{\lceil}\frac{S_{i}-S_{j}+1}{i-j}K\Big{\rceil}.
5:  for TT\in\mathbb{N} and Tlower<T<TupperT_{lower}<T<T_{upper} (searching according to the distance to KtKt is faster)  do
6:     Blower=maxi(iTKSi)B_{lower}=\max_{i}\left(iT-KS_{i}\right)
7:     Bupper=mini(iTKSi)+KB_{upper}=\min_{i}\left(iT-KS_{i}\right)+K
8:     if Blower<BupperB_{lower}<B_{upper} then
9:        B=BlowerB=B_{lower}.
10:        Output: T,BT,B
11:     end if
12:  end for
13:  Output: T,BT,B  (If we need to obtain all satisfied T,BT,B in Problem 2, we should not return when got a pair of T,BT,B, but run until all possible values searched.)

Appendix G Other Cases

G.1 General Quantizer Range

We have already mentioned in Section 2 when ymin,ymaxy_{\min},y_{\max} use other integers. Since previous discussion only focus on non-symmetric quantizer with ymin=0y_{\min}=0 in Qa()Q_{a}(\cdot) and xmin=0x_{\min}=0 in Q()Q(\cdot), we briefly show ways for obtaining symmetric quantizer and more general cases in this section.

Using similar analysis in Eq. (1), we get general formulation of Problem 1.

Problem 3

Given Ymin,YmaxY_{\min},Y_{\max}\in\mathbb{N}, the problem is to find some KK\in\mathbb{N} (as small as possible), such that for all t0,t,bt\neq 0,t,b\in\mathbb{R}, there exist some T0,T,BT\neq 0,T,B\in\mathbb{Z}, such that for all NN\in\mathbb{Z},

clip(N+bt,Ymin,Ymax)=clip(N×K+BT,Ymin,Ymax).\mathrm{clip}\left(\left\lfloor\frac{N+b}{t}\right\rfloor,Y_{\min},Y_{\max}\right)=\mathrm{clip}\left(\left\lfloor\frac{N{\times}K+B}{T}\right\rfloor,Y_{\min},Y_{\max}\right).

A simple way to solve Problem 3 is shifting the range of clip function.

clip(N+btYmint,0,YmaxYmin)=clip(N×K+BTYminT,0,YmaxYmin).\mathrm{clip}\left(\left\lfloor\frac{N+b-tY_{\min}}{t}\right\rfloor,0,Y_{\max}-Y_{\min}\right)=\mathrm{clip}\left(\left\lfloor\frac{N{\times}K+B-TY_{\min}}{T}\right\rfloor,0,Y_{\max}-Y_{\min}\right).

Hence we are able to apply conclusions of Problem 1 with changed b=btYminb^{\prime}=b-tY_{\min}, B=BTYminB^{\prime}=B-TY_{\min} and same t,Tt,T as follows:

clip(N+bt,0,YmaxYmin)=clip(N×K+BT,0,YmaxYmin).\mathrm{clip}\left(\left\lfloor\frac{N+b^{\prime}}{t}\right\rfloor,0,Y_{\max}-Y_{\min}\right)=\mathrm{clip}\left(\left\lfloor\frac{N{\times}K+B^{\prime}}{T}\right\rfloor,0,Y_{\max}-Y_{\min}\right).

However, due to limited numerical precision, our searching method already introduces error. We prefer to using our derivation to Problem 3 directly.

Accordingly, we obtain a simple version of Problem 3.

Proposition 10

Problem 3 is equivalent to finding such a KK\in\mathbb{N} that for all t0,t,bt\neq 0,t,b\in\mathbb{R}, we can obtain T0,T,BT\neq 0,T,B\in\mathbb{Z}, s.t.

itb=iTBK,i{Ymin+1,,Ymax}.\lceil it-b\rceil=\Big{\lceil}\frac{iT-B}{K}\Big{\rceil},\ \forall i\in\{Y_{\min}+1,\dots,Y_{\max}\}. (30)

Because we know Problem 3 and Problem 1 are same from previous analysis, meaning that the searched KnK_{n} with n=YmaxYminn={Y_{\max}-Y_{\min}} already satisfies Problem 3. Based on Proposition 4, to establish T,BT,B given KK that is suitable based on the previous understanding, we also offer a conservative way to obtain all appropriate T,BT,B as follows.

Proposition 11

Given t,bt,b\in\mathbb{R} and proper KK\in\mathbb{N}, a pair of T,BT,B\in\mathbb{N} satisfy Problem 3 if and only if there exists such a TT that the following conditions met:

maxi>jSiSj1ijK\displaystyle\max_{i>j}\Big{\lfloor}\frac{S_{i}-S_{j}-1}{i-j}K\Big{\rfloor} <T<mini>jSiSj+1ijK,\displaystyle<T<\min_{i>j}\Big{\lceil}\frac{S_{i}-S_{j}+1}{i-j}K\Big{\rceil}, (31)
maxi(iTKSi)\displaystyle\max_{i}\left(iT-KS_{i}\right) <minj(jTKSj)+K,\displaystyle<\min_{j}\left(jT-KS_{j}\right)+K,

where i,j{Ymin+1,,Ymax}i,j\in\{Y_{\min}+1,\dots,Y_{\max}\}.

In other words, our proposed method and results can simply convert to general quantizer range.

G.2 Special Input Range

Before training, we have little understanding of final BN parameters and hidden layer outputs, hence we consider all possible value which b,t,Nb,t,N can take. Actually, due to quantized activation and quantized weights function in previous layers, the input NN varies in a finite range, but we are hard to extract effective information because t,bt,b and NN are combined together, so we still use whole integers for NN.

Besides, some structures may lead to special input range of NN which would reduce the magnitude of KK. For example, NN is always even, then the problem changes into

clip(2N+bt,Ymin,Ymax)=clip(2N×K+BT,Ymin,Ymax),N.\mathrm{clip}\left(\left\lfloor\frac{2N+b}{t}\right\rfloor,Y_{\min},Y_{\max}\right)=\mathrm{clip}\left(\left\lfloor\frac{2N{\times}K+B}{T}\right\rfloor,Y_{\min},Y_{\max}\right),\forall N\in\mathbb{N}.

Under slight transformation,

clip(N+b/2t/2,Ymin,Ymax)=clip(N×(2K)+BT,Ymin,Ymax),N.\mathrm{clip}\left(\left\lfloor\frac{N+b/2}{t/2}\right\rfloor,Y_{\min},Y_{\max}\right)=\mathrm{clip}\left(\left\lfloor\frac{N{\times}(2K)+B}{T}\right\rfloor,Y_{\min},Y_{\max}\right),\forall N\in\mathbb{N}.

Hence we can use our searching method for t/2t/2 and b/2b/2 first, with 2K2K as satisfied solution in Problem 3. Finally, we still can obtain T,BT,B but with less KK used. For instance, during 44-bit quantization, K15=51K_{15}=51, we may use K=64K=64 for friendly hardware support. When NN is always even, we can use K=32K=32 instead.

Generally, when input range is described as αN+β\alpha N+\beta with fixed α,β\alpha,\beta\in\mathbb{N} and various NN\in\mathbb{N}, we are able to screen out KK if αK\alpha K already satisfies Problem 3. In conclusion, our method with special structure of input may introduce further less KK we searched.

Appendix H Quantization for CIFAR-10 Classification

We compare our method using VGG11 [27] with BN on CIFAR-10 dataset. We adopt data augmentation including horizontal flip, random cropping, random rotation and random scaling with Adam [17] optimizer using circular learning initialized from 0.0030.003 for total 640640 training epochs and batch size 256. The main results are shown in Table 5, which agree with the results on ImageNet.

Table 5: Top-1 test accuracy of different quantized weight and activation bits with our transformation on CIFAR-10 trained by VGG11.

Bits 2W4A 4W4A 8W8A
QTQT 90.60 91.75 91.74
btbt 90.40 91.71 91.67
BTBT 90.40 91.71 91.67
VGG11