On the quantization of recurrent neural networks

Jian Li and Raziel Alvarez
Google, Inc., USA.
Correspondence to: [email protected]

Abstract

Integer quantization of neural networks can be defined as the approximation of the high precision computation of the canonical neural network formulation, using reduced integer precision. It plays a significant role in the efficient deployment and execution of machine learning (ML) systems, reducing memory consumption and leveraging typically faster computations. In this work, we present an integer-only quantization strategy for Long Short-Term Memory (LSTM) neural network topologies, which themselves are the foundation of many production ML systems. Our quantization strategy is accurate (e.g. works well with quantization post-training), efficient and fast to execute (utilizing 8 bit integer weights and mostly 8 bit activations), and is able to target a variety of hardware (by leveraging instructions sets available in common CPU architectures, as well as available neural accelerators).

Keywords integer quantization $\cdot$ RNN/LSTM $\cdot$ hardware accelerator

1 Introduction

One important area in the development of machine learning (ML) systems is concerned with the efficient utilization of the available computational resources. In most practical applications of ML, efficiency has a direct correlation on whether the system can be put in production at all, due to implications around quality and cost.

Quantization of neural networks transforms the neural network model’s computation such that it can be represented and executed in a lower precision. It is an important technique that reduces the memory consumption (on disk and on RAM), and can help speed up the execution and reduce power consumption [1]. Moreover, such transformation may be required to execute in a variety of hardware that only supports integer computations, from digital signal processors [2] to the latest neural accelerators [3].

However, by performing an approximation to a reduced precision form, quantization may negatively affect the model’s quality (i.e. the target loss the neural network was trained to reduce). Hence, devising quantization strategies that preserve the quality of the original neural networks, while providing the benefits described above, is a challenging but important area of research.

Such quantization strategies involve defining how the original computation will be approximated, including decisions about how the original floating point parameters will be converter and represented in the integer domain, the precision used to represent the resulting outputs from the now integer computations, and how such resulting intermediate outputs will be converted to a reduced precision for subsequent computations to take place efficiently. All this while preserving the quality of the original neural network model.

Whereas significant work has been done in the field over the past several years [4, 5, 6, 7], the fully integer quantization strategy for recurrent neural networks (RNNs), and in particular for long short-term memory (LSTM) topologies has always been challenging.

Recurrent neural networks are widely used to solve tasks involving the processing of sequential data. In particular, the long short-term memory topology is used in a number of state of the art production systems, like speech recognition [8, 9, 10], text translation [11, 12] and text-to-speech [13, 14]. However, the stateful nature of RNN makes its quantization numerically challenging since the quantization error can accumulate in both depth dimension as well as the sequence (e.g. time) dimension.

There are successful approaches to quantize RNNs. For example, [6] uses a quantization strategy that quantizes the static parameters (i.e. weights) in the neural network to 8 bit integers but makes use of a dynamic computation (i.e. at execution time) of the true floating point ranges of the intermediate values to quantize them with higher precision. Whereas such approach results in good quality [15, 16, 10], and can be implemented efficiently in typical CPU architectures (e.g. x86, arm) resulting in significant performance gains, it still makes use of some floating point operations and thus lacks the hardware portability, energy efficiency, and ultimate performance potential that integer-only computation can provide.

With this in mind, this work focuses on the quantization of long short-term memory topologies, with the side goal that by providing a successful quantization strategy for one of the most useful and complex RNNs, we will be providing a good foundation to develop quantization strategies for arguably simpler topologies.

This work is organized as follows. In section 2 we introduce the different variants of the LSTM topology that we will cover in our quantization strategy. In section 3 the quantization strategy is described in full detail. We start with the over all design approach, which is followed by the building blocks and the design considerations. The end-to-end quantization strategy is then presented. This is followed by a discussion in section 4 on how the proposed strategy’s statistics collection can take place either during training, or entirely post-training (with a much smaller sub-set of data and without incurring in fine tuning of the neural network parameters) with good results as validated in the following section. Experimental results, including accuracy and speed, are listed in section 5. Section 6 describes additional optimizations for deploying models in production. Finally we provide conclusions in section 7.

2 LSTM architecture

The long short-term memory cell [17] is one of the most complex and widely used RNN topologies. With time it has seen a number of extensions. In this work we group some of the most widely used such as the addition of peephole connections [18], the coupling of input and forget gates (CIFG) [19], the addition of a projection [9] layer, and the addition of layer normalization [20]. Thus, in its more complex form containing all but the CIFG extensions mentioned above, the LSTM cell can be described as:

	$\displaystyle i^{t}=\sigma(norm(W_{i}x^{t}+R_{i}h^{t-1}+P_{i}\odot c^{t-1})\odot L_{i}+b_{i})$		(1)
	$\displaystyle f^{t}=\sigma(norm(W_{f}x^{t}+R_{f}h^{t-1}+P_{f}\odot c^{t-1})\odot L_{f}+b_{f})$		(2)
	$\displaystyle z^{t}=g(norm(W_{z}x^{t}+R_{z}h^{t-1})\odot L_{z}+b_{z})$		(3)
	$\displaystyle c^{t}=i^{t}\odot z^{t}+f^{t}\odot c^{t-1}$		(4)
	$\displaystyle o^{t}=\sigma(norm(W_{o}x^{t}+R_{o}h^{t-1}+P_{o}\odot c^{t})\odot L_{o}+b_{o})$		(5)
	$\displaystyle m^{t}=o^{t}\odot g(c^{t})$		(6)
	$\displaystyle h^{t}=W_{proj}m^{t}+b_{proj}$		(7)

In the LSTM cell definition above (equations 1 to 7), $x^{t}$ is the input at time $t$ , $W$ and $R$ the input and recurrent weight matrices respectively, $b$ the bias vectors, $\sigma$ the sigmoid function, $z$ , $i$ , $f$ , $o$ and $c$ the LSTM-block input, input gate, forget gate, output gate and cell activation vectors, $h$ the cell output, $\odot$ the element-wise product of the vectors, and $g$ the activation function, generally $tanh$ . We also present other common extensions such as the peephole connection $P$ , the output projection $W_{proj}$ proposed by [9] to reduce the cell size, and the layer normalization computation $norm()$ and terms $L$ added to the input block $z^{t}$ and gates ( $i^{t}$ , $f^{t}$ , and $o^{t}$ ) to help stabilize the hidden layer dynamics and speed up model convergence [20].

CIFG changes the formulation by "coupling" the input gate to forget gate as $i^{t}=1-f^{t}$ . Removing peephole connections, means the term $P\odot c$ is absent. By removing layer normalization, then the $norm()$ and terms $L$ are absent. Finally, removing the projection layer implies the hidden state $m$ becomes the output of the cell such that $h=m$ .

For clarity, we make use of graph diagrams to represent core formulations from the LSTM that will be relevant to our quantization strategy. The common gate calculation $W_{\lambda}x^{t}+R_{\lambda}h^{t-1}+P_{\lambda}\odot c^{t-1}$ (where $\lambda$ can be any of the gates $i,f,o$ ) is shown in fig 1 (without layer normalization) and fig 4 (with layer normalization). The layer normalization calculation $norm()+b$ is shown in fig 7. The calculation from the gates to cell state $c^{t}=i^{t}\odot z^{t}+f^{t}\odot c^{t-1}$ , and from the cell to hidden state $c^{t}=i^{t}\odot z^{t}+f^{t}\odot c^{t-1}$ is shown in fig 10. The projection layer calculation $h^{t}=W_{proj}m^{t}+b_{proj}i$ is shown in fig 13.

3 LSTM quantization strategy

This section lists the details of our quantization strategy for LSTMs, which covers the transformation of floating point values to integer, as well as the necessary rewrite of the computation for the execution to take place entirely in the integer domain.

More specifically, our strategy is built around these principles:

•

No floating-point arithmetic: no operation takes place in floating-point, and thus no on-the-fly quantization/de-quantization.
•

No inner loop branching: deeper instruction pipelining.
•

No lookup tables: leverages modern SIMD instruction.

More specifically, we aim to leverage 8 bit integer computation as much as possible, and only goes to higher bits when needed. While existing works have shown that 8 bit quantization is enough for convolution [7], we have found that is not enough for accurately approximating the computation in LSTMs without significant accuracy losses. As such, we start by quantizing all static weight parameters in the LSTM to an 8 bit integer representation, and use a higher number of bits for calculations that are scalar or non-linear. The relevant details will be covered in the sections below.

3.1 Quantization fundamentals

We quantize values as a linear affine transformation that in principle is similar to many previous works like [4, 6, 7] –to mention a few. This means that at a high level, we quantize each of the values from a given tensor $T$ of static values (in the original high precision) by computing a linear scale $s_{T}$ that aims to evenly distribute them in the narrower scale dictated by the target precision’s number of bits $n$ (Equation 8). Thus, each quantized value of the resulting tensor $T_{i}^{{}^{\prime}}$ is transformed according to $s_{T}$ (Equation 9).

	$\displaystyle s_{T}=\frac{max(T)-min(T)}{2^{n}}$		(8)
	$\displaystyle T_{i}^{{}^{\prime}}=\frac{T_{i}}{s_{T}}$		(9)

Our quantization strategy uses this formulation as its foundation, but important details and decisions around overflow and saturation, as well as the definition of scales, are described in the following subsections.

3.1.1 Overflow and saturation

Matrix multiplication ("matmul") is a basic operation in neural networks, including the LSTM. It can be defined as a binary operation on two input matrices to produce a third one such that $c_{(}i_{k})=a_{ij}b_{jk}$ where $j$ is accumulated over for all possible values of $i$ and $k$ . Relevant to our work is that its overflow behavior can be modeled via a random walk, and a safe accumulation depth can be calculated. For example, in a typical "matmul" operation with inputs comprised of signed 8 bit integers (int8) that accumulate into a resulting signed 32 bit integer (int32), there is no possibility of overflowing in $2^{15}$ steps. However, a 24 bit accumulator has only a safe accumulation depth to $2^{7}$ .

Thus, so as long as the dimension of the tensors in the model are smaller than that upper limit of steps, then the "matmul" computations used to execute the model are safe from accumulator overflowing. Empirically we have found that, with constraints of "matmul" operations on int8 values to a int32 accumulator, most models [6, 21, 7] are safe from overflow.

When overflows do occur, in some cases, saturation can solve the overflow issue. For example, when the operation is followed by a sigmoid function, clamping error due to saturation is negligibly small outside a proper range such as $[-8,8]$ (see section 3.2.1). However, in other cases, overflow can harm accuracy and should be avoided. Common ways to avoid overflow (though not entirely guaranteed) are using quantization simulation during training [7, 22], or by selecting a good calibration dataset when post-training quantization is used [23].

The main take away we want to convey here is that overflow and saturation behaviour needs to be modeled carefully, and thus is a key driver of the decisions (described later in the document) in our proposed quantization strategy.

3.1.2 Power-of-two scales and $Q_{m.n}$ format

While in Equation 8 we define the scale as a real number, in practice we approximate them as rational numbers defined as power-of-two scales to operate in the fixed-point domain, and because computation involving power-of-two values can be more efficiently implemented in digital computers as bitwise operations.

We make use of the $Q_{m.n}$ number format [24], where the integer numbers $m$ and $n$ represent the number of integer and fractional bits, respectively. More specifically, in this work $Q$ itself denotes signed fixed-point numbers and follows the conversion that $m+n+1$ equals the bit-width of the type.

This means that $Q_{m.n}$ can represent floating-point values within range of [ $-(2^{m})$ , $2^{m}-2^{-n}$ ], with a resolution of $2^{-n}$ .

3.2 LSTM quantization breakdown

In this subsection, we breakdown the quantization of common building blocks of the LSTM computation. This is particularly useful as some times specific decisions need to be made for such blocks, and because different LSTM variants are constructed from combinations of those blocks.

The LSTM is comprised by a set of gate computations. Each gate computation contains a series of matrix multiplications, followed by non-linear activation functions ( $g$ and $\sigma$ ) applied to the product of these matrix multiplications. Even though activation functions of $\sigma$ and $g$ (see eq 3) are in the middle of the LSTM inferencce, their quantization are discussed first (section 3.2.1) because they require special input output scales. Cell state $c$ (see eq 4) is the input to activations so its quantization is determined next in section 3.2.2. This is followed by quantization of peephole weight in section 3.2.3, since peephole weight is only used together with cell state $c$ . Gate calculations with layer noramlization ( $Wx^{t}+Rh^{t-1}+P\odot c^{t-1}$ ) and without layer noramlization ( $Wx^{t}+Rh^{t-1}+P\odot c^{t-1}+b$ ) are discussed in section 3.2.4 and section 3.2.5. The quantization of layer normalization is presented in section 3.2.6 where we introduced an extra factor in the inference proccess to avoid catastropic accuracy loss that cannot be solved by tuning scales. The quantization of hidden state (equation 6) is dicussed in section 3.2.7. And when there is projection, the quantization of projection weights and output state (equation 7) is discussed in section 3.2.8 At last we discuss the CIFG in section 3.2.9 where equation 1 becomes $1-f^{t}$ .

3.2.1 Non-linear activation functions

Non-linear activation functions $\sigma$ (see eq 1, eq 2, eq 5) and $g$ (see eq 4) have restrictions on input and output scales so their scales need to be decided first.

For the activation functions, we utilize 16 bits since we found that given their scalar nature we need the higher precision to ensure the overall accuracy of the LSTM across different types of models and tasks. Additionally, $Q_{m.15-m}$ is used as the input and output scale for activations.

For the output of activation functions, $Q_{0,15}$ is selected since it maps nicely to the output range of $\sigma$ and $tanh$ . The output values are slightly clamped at $[-1,\frac{32767}{32768}]$ and experimentally no accuracy loss is observed (see section 5 for more details).

There are clamping errors and resolutions for activations. The max of clamping error is $f(\infty)-f(2^{m})$ . Take $tanh$ as an example, if we restrict the input of $tanh$ to $[-8,8]$ ( $Q_{3.12}$ ), there is a clamping error of $1-tanh(8)=2.35e-7$ . Resolution error is the error from representing all values in the quantization "bucket" with one quantized value. The maximum value of resolution error is $2^{-n}max(f^{\prime}(x))$ , where $max(f^{\prime}(x))$ is the max value of the derivitive of $f$ . Take $tanh$ as an example, the max resolution error happens at $x=0$ (max gradient) with values $tanh(2^{-12})=2.44e-4$ .

When $m$ becomes bigger, clamping error is reduced but resolution error is increased and becomes dominating; when $m$ becomes smaller, clamping error is dominating. The value $m$ in $Q_{m.15-m}$ for input of activations is determined by balancing the two errors. Working out the math for both $tanh$ and $sigma$ , $Q_{3.12}$ has the lowest error.

3.2.2 Cell state

Cell state $c$ is the internal memory of LSTM cell. Because it persisits accross multiple invocations and mainly involves scalar operations, 16 bits is needed to preserve accuracy.

Cell state is used in 3 places:

•

as input to elementwise $tanh$ .
•

as input to elementwise multiplication with forget gate.
•

as input to elementwise multiplication for peephole connection.

We cannot clamp cell state to $[-8,8]$ because it is also used in multiplications. Instead, we extend the measured range to the next power-of-two values and symmetrically quantize it to 16 bits. For example, suppose cell is measured to be in the range of $[-3.2,10]$ , we extend that to $[-16,16)$ , which leads to quantization with $Q_{4.11}$ . And in this case, even though $Q_{3.12}$ is still the best input format for $tanh$ , we can directly use $Q_{4.11}$ as input to $tanh$ to remove the unnecessary rescaling.

3.2.3 Peephole connection

Peephole connection $P\odot c^{t-1}$ is the optional connection from cell state to gates (see fig 1). Since the multiplication between the 16 bit state and the peephole coefficients are elementwise, we symmetrically quantize the peephole coefficients to $[-32767,32767]$ with scale $s=\frac{range(max(P))}{32767}$ and store it as int16. The lack of 16bit-8bit multiplication instruction in ARM neon further deminish the need to quantize peephole weights to 8 bits.

3.2.4 Gate without layer normalization

There are four gates in one LSTM cell: input (eq 1), forget(eq 2), update(eq 3) and output(eq 5). Without layer normalization, gate calculation is $Wx^{t}+Rh^{t-1}+P\odot c^{t-1}+b$ . As disucssed above, $P$ and $c^{t-1}$ are 16 bits. For the matrix multiplication operations, 8 bit values and operations are sufficient because the quantization errors are expected to statistically cancel during the accumulation. $W$ and $R$ are symmetrically quantized to [-127, 127] with scale $s_{W}=\frac{max(abs())}{127}$ and stored as int8. $x$ and $h$ are asymmetrically quantized to [-128, 127] with scale $s_{x}=\frac{max()-min()}{255}$ and stored as int8. In practice, to ensure float zero point is mapped to an integer, $max(x)$ and $min(x)$ are lightly nudged [7] in the asymmetric case.

$Wx^{t}$ and $Rh^{t-1}$ are accumulations over the product of two int8 so int32 is used to accumulate the result. For $P\odot c^{t-1}+b$ , even though there is no accmulation, int32 is needed for the product of two int16. So the results of $Wx$ , $Rh$ and $P\odot c$ are stored using three 32 bit accumulators and they have different scales.

Bias is elementwise added to the accumulator so we choose to qunatize that to 32 bits. Quantized bias needed to be added before rescaling and it can use scale of any of the 3 accumulators. In the models we tested on, $s_{R}s_{h}$ is the smallest out of $s_{W}s_{x}$ , $s_{R}s_{h}$ and $s_{P}s_{c}$ so it provides the best resolution. So bias is symmetrically quantized with scale $s_{R}s_{h}$ to $[-(2^{31}-1),2^{31}-1]$ and stored as int32.

As discussed in the activation 3.2.1, without layer normalization, the overall $Wx^{t}+Rh^{t-1}+P\odot c^{t-1}+b$ uses $Q_{3.12}$ as scales so it’s symmetrically quantized with scale $s=2^{-12}$ to $[-32768,32767]$ and stored as int16.

Figure 2 shows the quantization details of the gates.

The three int32 accmulators need to be rescaled to the output scale $s=2^{-12}$ before being added together. To rescale the accumulators, an effective scale is calculated $s_{effx}=2^{12}s_{W}s_{x}$ which is the ratio between the accmulator scale and the output scale. Similarly $s_{effh}=2^{12}s_{R}s_{h}$ and $s_{effc}=2^{12}s_{P}s_{c}$ can be calculated.

The gate integer execution becomes $\textsf{rescale}(W(x+zp),s_{effx})+\textsf{rescale}(R(h+zp)+b,s_{effh})+\textsf{rescale}(Pc,s_{effc})$ . This is visualized in figure 3.

3.2.5 Gate with layer normalization

In the presence of layer normalizatoin, the gate calculation becomes $Wx+Rh+P\odot c$ and the bias is applied after normalization. Same as the cases without layer normalization $W$ , $x$ , $R$ , $h$ are quantized to 8 bit and $P$ and $c$ are quantized to 16 bit. Because the output of the gate is not consumed by activations, we cannot use $2^{-12}$ as the output scale. Instead the output scale is calculated from measured values: $s=\frac{max(abs(Wx+Rh+P\odot c))}{32767}$ . The quantized values are mapped to [-32767, 32767] and stored as int16. The quantization is shown in figure 5.

Same as the non layer normalization case, three effective scales ( $s_{effx}$ , $s_{effh}$ , $s_{effc}$ ) are calculated. Integer execution is $\textsf{rescale}(W(x+zp),s_{effx})+\textsf{rescale}(R(h+zp),s_{effh})+\textsf{rescale}(Pc,s_{effc})$ , which can be visualized in figure 6.

3.2.6 Layer normalization

Layer normalization [20] $norm()\odot L+b$ is widely used in streaming use cases. It normalizes the activation vector therefore prevents the model from overall shifts caused by the gate matrix multiplication, which is one of the primary source of accuracy degradation for LSTM cell quantization.

Layer normalization normalize $x$ to $x^{\prime}$ with zero mean and unity standard deviation and then applies layer normalization coefficients and bias.

	$\displaystyle\bar{x}=\frac{\sum_{i=1}^{n}(x_{i})}{n}$		(10)
	$\displaystyle x_{i}^{{}^{\prime}}=\frac{(x_{i}-\bar{x})}{\sqrt{\frac{\sum_{i=1}^{n}(x_{i}^{2}-\bar{x}_{i}^{2})}{n}}}$		(11)
	$\displaystyle y_{i}=x_{i}^{{}^{\prime}}*L_{i}+b_{i}$		(12)

The float calculation is shown in fig 7

The output of normalization $x^{\prime}$ is limited to a small range. Assuming normal distribution, 99.7% of values in $x_{i}^{{}^{\prime}}$ is confined between $[-3.0,3.0]$ which is roughly 2.8 bits in the integer presentation. This leads to catastrophic accuracy degradation in the model. Increasing bits and/or adjusting scales of the input of normalization would not help because any scale would be cancelled since it appears in both the numerator and denominitor in the calculation of $x_{i}^{{}^{\prime}}$ .

Instead, we solve the challenge by adding a scaling factor $s^{\prime}$ to $x^{\prime}$ in the computational graph. With $x_{i}^{{}^{\prime}}=q_{i}^{{}^{\prime}}s^{\prime}$ , the quantized value $q_{i}^{{}^{\prime}}$ can now be expressed as $q_{i}^{{}^{\prime}}=\frac{x_{i}^{{}^{\prime}}}{s^{\prime}}=\frac{(q_{i}-\bar{q})}{\sqrt{\frac{\sum_{i=1}^{n}(q_{i}^{2}-\bar{q}_{i}^{2})}{n}}}\times\frac{1}{s^{\prime}}$ . $s^{\prime}$ is chosen to be the smallest power-of-two number that won’t cause overflows in $q_{i}^{{}^{\prime}}$ , which is $2^{-10}$ in our experiments. With $s^{\prime}$ determined, the integer inference becomes [21]:

	$\displaystyle\bar{q}=round(\frac{\sum_{i=1}^{n}(2^{10}q_{i})}{n})$		(13)
	$\displaystyle\sigma=\sqrt{\frac{\sum_{i=1}^{n}(2^{20}q_{i}^{2}-\bar{q}_{i}^{2})}{n}}=\sqrt{\frac{2^{20}}{n}\sum_{i=1}^{n}q_{i}^{2}-\bar{q}_{i}^{2}}$		(14)
	$\displaystyle q_{i}^{{}^{\prime}}=round(\frac{(2^{10}q_{i}-\bar{q})}{\sigma})$		(15)
	$\displaystyle q_{i}^{{}^{\prime\prime}}=round(\frac{q_{i}^{{}^{\prime}}q_{Li}+q_{bi}}{2^{10}})$		(16)

$q_{L}$ is the quantized value of weight and $q_{b}$ is quantized value of bias. Layer normalization weights are symmetrically quantized with scale $s_{L}=\frac{range(L)}{32767}$ to [-32767, 32767] and stored as int16. Bias is symmetrically quantized with scale $s_{b}=2^{-10}s_{L}$ to $[-(2^{31}-1),2^{31}-1]$ and stored as int32. The quantization of layer normalization is shown in fig 8 and the integer execution is shown in fig 9.

3.2.7 Gate outputs to hidden

The new cell state $c^{t}$ is calculated using the old cell state $c^{t-1}$ and input gate, forget gate and update gate through $c^{t}=i^{t}\odot z^{t}+f^{t}\odot c^{t-1}$ . As shown in fig 11, the 3 gates all have $Q_{0.15}$ as scale and cell has $Q_{m.15-m}$ as scale. The quantized execution from gate to cell is $c^{t}=\textsf{shift}(i^{t}\odot z^{t},30-m)+\textsf{shift}(f^{t}\odot c^{t-1},15)$

Hidden state $m^{t}=o^{t}\odot g(c^{t})$ is the element wise product between the output gate and cell state. Hidden state is mathmatically bounded to $[-1,1]$ because it is the product of two values that are bounded to $[-1,1]$ . Experimentally we have observed that using the measured range instead of $[-1,1]$ improves accuracy. So the hidden state is quantized asymmetrically with scale $s_{m}=range(m)/255$ to [-128, 127] and stored as int8.

From cell to hidden state $m$ , an effective scale of $s_{eff}=2^{-30}/s_{m}$ is calculated and the integer execution is $m=\textsf{rescale}(o\odot g(c),s_{eff})-zp$ . The overallinteger execution from gate to hidden state is shown in fig 12.

3.2.8 Projection

Projection $h^{t}=W_{proj}m^{t}+b_{proj}$ is the optional connection that projects the hidden statee $m$ to output $h$ . $W$ is symmetrically quantized with scale $s_{w}=max(abs(W_{proj}))/127$ to [-127, 127] and stored as int8. Bias $b_{proj}$ is symmerically quantized with scale $s_{b}=s_{w}*s_{m}$ to $[-(2^{31}-1),2^{31}-1]$ and stored as int32. $h$ is asymmetrically quantized with scale $s_{h}=range(h)/255$ to [-128, 127] and stored as int8. This is shown in fig 14.

Similarly, with effective scale $s_{eff}=s_{W_{proj}}*s_{m}/s_{h}$ , the integer execution is $h=\textsf{rescale}(W_{proj}(m+m_{zp})+b,s_{eff})-h_{zp}$ and is shown in fig 15.

3.2.9 CIFG

With CIFG, the input gate and forget gate are coupled: $i^{t}=1-f^{t}$ . In the quantized case, because input and forget gate have $2^{-15}$ as scale, the coupling becomes $i^{t}=max(32768-f^{t},32767)$ . The extra clamping is to make sure the result can fit in int16. And becasue the forget gate is clamped slightly to $[0,\frac{32767}{32768}]$ (instead of $[0,1]$ ), input gate is clamped to $[\frac{1}{32768},\frac{32767}{32768}]$ for CIFG.

3.3 Quantization strategy

The quantization of all variants of LSTM can built on top of the above mentioned building blocks. They are summerized in table 2. On the high level, the quantization is:

•

Matrix related operations, such as matrix-matrix multiplication are in 8-bit;
•

Vector related operations, such as element wise sigmoid, are in a mixture of 8-bit and 16-bit.

4 Collecting statistics

As shown in equation 8, in order to calculate the scale, maximum and minimum values of the tensor are needed. For static weights, maximum and minimum values can be easily obtained. For activations, there are two ways of collecting maximum and minimum values: Post Training [25] and Quantization Aware Training (QAT) [7, 22]. Post Training runs inference on a representative data set and collect statistics; QAT collects tensor statistics during training and additionally fine tunes model weight by simulating quantization noise in the training process. Quantization strategy, including both quantization of the model and the integer execution, applies both Post Training and QAT. Post-training quantization is easier to use because it does not require training. Currently LSTM quantization is enabled in the post-training approach in TensorFlow.

For the sake of completeness, we document the QAT approach for LSTM as well. In the original TensorFlow graph, for performance reasons, the recurrent and input activations are concatenated and the recurrent weight and input activation weights are concatenated. Since the weights and activations in integer LSTM have saparate scales for input and recurrent calculation, the training graph need to be rewritten to remove the concatenation of the weights and activations for all the gates. The graph before and after the rewriting are shown in fig 16.

5 Experiments

In table 1, we reproduce [21] the accuracy for speech recognization on 3 anonymized private benchmark speech datasets on VoiceSearch, YouTube and Telephony. VoiceSearch and Telephony dataset have average utterance length of 4.7 seconds while YouTube dataset contains longer utterances, averaging 16.5 min per utterance. All models have the same RNN Transducer (RNN-T)[8, 16] architeture with 10 layers of LSTM and each of them contains 2048 hidden units. A fixed 100-utterances dataset is sufficient to quantize the model with negligible accuracy loss across all LSTM architecture and sparsity levels, despite the large variety of audio and semantic conditions in speech recognition. The good accuracy on the long utternaces in YouTube dataset shows the robustness of the quantizaiton strategy.

	Enc&Dec	Sparsity	#Params(M)	Quantization	WER on dataset
			% baseline	Size(MB)	VoiceSearch	YouTube	Telephony
LSTM	LSTMx8	0%	122.1	Float,466MB	6.6	19.5	8.1
(baseline)	LSTMx2	0%	100%	Hybrid,117MB	6.7	19.8	8.2
				Integer,117MB	6.7	19.8	8.2
Sparse LSTM	LSTMx8	50%	69.7	Float,270MB	6.7	20.2	8.2
	LSTMx2	50%	57%	Hybrid,71MB	6.8	20.4	8.4
				Integer, 71MB	6.9	22.9	8.7
Sparse CIFG	CIFGx8	50%	56.3	Float,219MB	7.1	21.7	8.3
	CIFGx2	50%	46%	Hybrid,57MB	7.2	21.4	8.5
				Integer,57MB	7.2	20.6	8.7

Table 1: Comparison of float, hybrid and fully quantized models across different dataset. Table is reproduced from [21]

6 Deployment

In this section we discuss an optimization for integer execution of LSTM that are improtant for deployment.

The most computational heavy part of the execution is the accumulation on the product of weight and activation: $\Sigma_{i}{(W_{i}\times(x_{i}-zp))}+b_{i}$ . The calculation on zero point $\Sigma_{i}{(W_{i}}\times zp)$ can be performed offline since they are static, In TensorFlow Lite[26], the pre-calculated zero points are folded into the bias $\Sigma_{i}{(W_{i}\times x_{i})}+\Sigma_{i}{(W_{i}\times zp)}+b_{i}=\Sigma_{i}{(W_{i}\times x_{i})}+b^{{}^{\prime}}_{i}$ and the acutal kernel treats both weight and activation symmetric. With this optimization, the integer LSTM is about 5% faster than hybrid and two times faster than float in RT factor [21].

7 Conclusions

We have demonstrated that integer RNN is accurate and meaningfully faster on CPU. More importantly, integer quantization has a number of additional advantages:

•

Widespread availability. Integer operations are common across hardware, including the latest generation Optimizing Speech Recognition for the Edge of specialized chips.
•

Efficiency. Having all operations as integer means faster execution, and less power consumption. Furthermore, the use of pre-computed scales means there is no overhead re-computing scales with every inference, nor quantizing and dequantizing tensors on-the-fly as with the hybrid approach.

It is expected that some platform will deviate from the quantization specification due to limitations of hardware capabilities. The considerations described in this work can help decide where (and how much) deviation is acceptable.

It is also expected that this quantization strategy works for all RNN variants. Unidirectional-RNN/LSTM and bidirectional-RNN/LSTM have loops on top of LSTM cell and the quantization strategy described in this work can be directly applied. For other RNNs such as Gated Recurrent Unit (GRU) [27], Simple Recurrent Unit (SRU) [28], the design considerations and building blocks can be use.

References

[1] Alexander Gruenstein, Raziel Alvarez, Chris Thornton, and Mohammadali Ghodrat. A cascade architecture for keyword spotting on mobile devices. 2017.
[2] L. Codrescu, W. Anderson, S. Venkumanhanti, M. Zeng, E. Plondke, C. Koob, A. Ingle, C. Tabony, and R. Maule. Hexagon dsp: An architecture optimized for mobile multimedia and communications. IEEE Micro, 34(2):34–43, 2014.
[3] Google EdgeTPU. Google EdgeTPU.
[4] Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
[5] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
[6] Raziel Alvarez, Rohit Prabhavalkar, and Anton Bakhtin. On the efficient representation and execution of deep acoustic models. In Proceedings of Annual Conference of the International Speech Communication Association (Interspeech), 2016.
[7] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew G. Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. CoRR, abs/1712.05877, 2017.
[8] Alex Graves. Sequence transduction with recurrent neural networks. In CoRR, page vol. abs/1211.3711, 2012.
[9] Haşim Sak, Andrew Senior, and Françoise Beaufays. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association, 2014.
[10] Tara Sainath, Yanzhang (Ryan) He, Bo Li, Arun Narayanan, Ruoming Pang, Antoine Bruguier, Shuo yiin Chang, Wei Li, Raziel Alvarez, Zhifeng Chen, Chung-Cheng Chiu, David Garcia, Alex Gruenstein, Kevin Hu, Minho Jin, Anjuli Kannan, Qiao Liang, Ian McGraw, Cal Peyser, Rohit Prabhavalkar, Golan Pundak, David Rybach, (June) Yuan Shangguan, Yash Sheth, Trevor Strohman, Mirkó Visontai, Yonghui Wu, Yu Zhang, and Ding Zhao. A streaming on-device end-to-end model surpassing server-side conventional model quality and latency. 2020.
[11] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144, 2016.
[12] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014.
[13] Heiga Zen and Hasim Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 4470–4474, 2015.
[14] Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. Tacotron: A fully end-to-end text-to-speech synthesis model. CoRR, abs/1703.10135, 2017.
[15] Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Hasim Sak, Alexander Gruenstein, Françoise Beaufays, and Carolina Parada. Personalized speech recognition on mobile devices. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
[16] Yanzhang He, Tara N Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, et al. Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6381–6385. IEEE, 2019.
[17] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, November 1997.
[18] Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3:115–143, March 2003.
[19] Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. Lstm: A search space odyssey. IEEE transactions on neural networks and learning systems, 28(10):2222–2232, 2016.
[20] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[21] Yuan Shangguan, Jian Li, Qiao Liang, Raziel Alvarez, and Ian McGraw. Optimizing speech recognition for the edge. arXiv preprint arXiv:1909.12408, 2019.
[22] TensorFlow model optimization. Tensorflow model optimization toolkit quantization aware training.
[23] TensorFlow model optimization. Tensorflow model optimization toolkit post training.
[24] Texas Instruments. Tms320c64x dsp library programmer’s reference. Literature Number: SPRU565B, 2003.
[25] TensorFlow model optimization. Tensorflow model optimization toolkit.
[26] TensorFlow Lite. TensorFlow Lite.
[27] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
[28] Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and Yoav Artzi. Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4470–4481, 2018.

Appendix A APPENDIX

tensor	bits	scale for each lstm variant
		No LN	No LN	No LN	No LN	Yes LN	Yes LN	Yes LN	Yes LN
		No Proj	No Proj	Yes Proj	Yes Proj	No Proj	No Proj	Yes Proj	Yes Proj
		No PH	Yes PH	No PH	Yes PH	No PH	Yes PH	No PH	Yes PH
$x$	8	$\frac{range}{255}$	$\frac{range}{255}$	$\frac{range}{255}$	$\frac{range}{255}$	$\frac{range}{255}$	$\frac{range}{255}$	$\frac{range}{255}$	$\frac{range}{255}$
$W_{i}$ †	8	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$
$W_{f}$	8	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$
$W_{z}$	8	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$
$W_{o}$	8	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$
$R_{i}$ †	8	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$
$R_{f}$	8	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$
$R_{z}$	8	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$
$R_{o}$	8	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$	$\frac{max}{127}$
$P_{i}$	16	–	$\frac{max}{32767}$	–	$\frac{max}{32767}$	–	$\frac{max}{32767}$	–	$\frac{max}{32767}$
$P_{f}$	16	–	$\frac{max}{32767}$	–	$\frac{max}{32767}$	–	$\frac{max}{32767}$	–	$\frac{max}{32767}$
$P_{o}$	16	–	$\frac{max}{32767}$	–	$\frac{max}{32767}$	–	$\frac{max}{32767}$	–	$\frac{max}{32767}$
$b_{i}$ †	32	$h\times R_{i}$	$h\times R_{i}$	$h\times R_{i}$	$h\times R_{i}$	$L_{i}\times 2^{-10}$	$L_{i}\times 2^{-10}$	$L_{i}\times 2^{-10}$	$L_{i}\times 2^{-10}$
$b_{f}$	32	$h\times R_{f}$	$h\times R_{f}$	$h\times R_{f}$	$h\times R_{f}$	$L_{f}\times 2^{-10}$	$L_{f}\times 2^{-10}$	$L_{f}\times 2^{-10}$	$L_{f}\times 2^{-10}$
$b_{z}$	32	$h\times R_{z}$	$h\times R_{z}$	$h\times R_{z}$	$h\times R_{z}$	$L_{z}\times 2^{-10}$	$L_{z}\times 2^{-10}$	$L_{z}\times 2^{-10}$	$L_{z}\times 2^{-10}$
$b_{o}$	32	$h\times R_{o}$	$h\times R_{o}$	$h\times R_{o}$	$h\times R_{o}$	$L_{o}\times 2^{-10}$	$L_{o}\times 2^{-10}$	$L_{o}\times 2^{-10}$	$L_{o}\times 2^{-10}$
$W_{proj}$	8	–	–	$\frac{max}{127}$	$\frac{max}{127}$	–	–	$\frac{max}{127}$	$\frac{max}{127}$
$b_{proj}$	32	–	–	$W_{proj}\times m$	$W_{proj}\times m$	–	–	$W_{proj}\times m$	$W_{proj}\times m$
$h$	8	$\frac{range}{255}$	$\frac{range}{255}$	$\frac{range}{255}$	$\frac{range}{255}$	$\frac{range}{255}$	$\frac{range}{255}$	$\frac{range}{255}$	$\frac{range}{255}$
$c$	16	$\frac{POT(max)}{32768}$	$\frac{POT(max)}{32768}$	$\frac{POT(max)}{32768}$	$\frac{POT(max)}{32768}$	$\frac{POT(max)}{32768}$	$\frac{POT(max)}{32768}$	$\frac{POT(max)}{32768}$	$\frac{POT(max)}{32768}$
$L_{i}$ †	16	–	–	–	–	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$
$L_{f}$	16	–	–	–	–	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$
$L_{z}$	16	–	–	–	–	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$
$L_{o}$	16	–	–	–	–	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$
$g_{i}$ †	16	–	–	–	–	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$
$g_{f}$	16	–	–	–	–	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$
$g_{z}$	16	–	–	–	–	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$
$g_{o}$	16	–	–	–	–	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$	$\frac{max}{32767}$
$m$	8	–	–	$\frac{range}{255}$	$\frac{range}{255}$	–	–	$\frac{range}{255}$	$\frac{range}{255}$

^†Rows becomes invalid when CIFG is true;

Table 2: The quantization recipe.

max

max(|x_{i}|)

;

range

max(|x_{i}|)-min(|x_{i}|)

; “

POT(x)

” is

x

extended to power of two. For variants, "LN" means layer normalization, "proj" means projection and "PH" means peephole. There are scales that are derived from other scales. For example, when there is no layer normlization, the input bias

b_{i}

is quantized with the scale

s

which is the production between the scale for recurrent activation

r

and recurrent weight

R_{i}

. Note that tensor cell state

c

has power of two scales so the denominator is 32768 instead of 32767.

g_{i}

g_{f}

g_{z}

g_{o}

are the gate matrix multiplication output

Wx+Rh+P\odot c

for the 4 gates.

Figure 1: Gates calculation without layer normalization in original float graph. Cell gate does not have

P

and

c

Figure 2: Quantized gates calculation without layer normalization. Cell gate does not have

P

and

c

. The (8, 8) is asymmetrically quantized input and its zero point.

Figure 3: Integer execution for gates calculation without layer normalization. Cell gate does not have

P

and

c

. The multiplication between (8) and (8, 8) is w (x + zp), which in practice is handled as w x + w zp.

Figure 4: Gates calculation with layer normalization in original float graph. Cell gate does not have

P

and

c

. Note the bias is absent compared with the case with layer normalization. The output need to be logged.

Figure 5: Quantized gates calculation with layer normalization. Cell gate does not have

P

and

c

. The (8, 8) is asymmetrically quantized input and its zero point.

Figure 6: Integer execution for gates calculation with layer normalization. Cell gate does not have

P

and

c

. The multiplication between (8) and (8, 8) is w (x + zp), which in practice is handled as w x + w zp.

Figure 7: Layer normalization float calculation.

Figure 8: Layer normalization in quantized format.

Figure 9: Layer normalization integer execution.

Figure 10: Float calculation from gate to hidden state. When there is projection, the hidden state needs to be logged.

Figure 11: From gate to hidden state quantized

Figure 12: Integer execution from gate to hidden state.

Figure 13: Float execution for projection.

Figure 14: Projection calulation in quantized format.

Figure 15: Integer execution for projection. The last

add

is for zero point of output

h

(a) Float TensorFlow graph

(b) Rewritten float TensorFlow graph

Figure 16: Float graph modification for quantization aware training of LSTM. Input activation and recurrent activations have different scales so fake quant need to added to each components with concatination.