This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayesian Attention Networks for Data Compression

Michael Tetelman
Volkswagen Group of America Innovation Center California, BayzAI
[email protected]
Abstract

The lossless data compression algorithm based on Bayesian Attention Networks is derived from first principles. Bayesian Attention Networks are defined by introducing an attention factor per a training sample loss as a function of two sample inputs, from training sample and prediction sample. By using a sharpened Jensen’s inequality we show that the attention factor is completely defined by a correlation function of the two samples w.r.t. the model weights. Due to the attention factor the solution for a prediction sample is mostly defined by a few training samples that are correlated with the prediction sample. Finding a specific solution per prediction sample couples together the training and the prediction. To make the approach practical we introduce a latent space to map each prediction sample to a latent space and learn all possible solutions as a function of the latent space along with learning attention as a function of the latent space and a training sample. The latent space plays a role of the context representation with a prediction sample defining a context and a learned context dependent solution used for the prediction.

1 Introduction: Lossless Compression as a Prediction Problem

Data compression is a very important area of study for both practical applications as well as for fundamental research in information theory (Salomon, 2006),(MacKay, 2002), (Mahoney, 2013b), (Lossless Compression, 2020) and subject of many competitions, like (Hutter Prize, 2021). See also (Mahoney, 2013a). The neural compression is an active research area (Kingma et al., 2019).

The fundamental approach to compressing a sequence of data elements is to use a probabilistic model that is able to reproducibly compute a probability of the sequence via conditional probabilities of all elements and then encode the sequence probability with an entropy encoder (Shannon, 1948),(Rissanen, 1976),(Pasco, 1976).

The direct consequence of that approach is that the finding of a good predictive model becomes critical for the development of the high performance compression algorithm.

For a sequence of elements

s=a1a2a3..an..aNs=a_{1}a_{2}a_{3}..a_{n}..a_{N} (1)

the probability of the sequence is a product of conditional probabilities of each element

P(s)=P(a1)P(a2|a1)P(a3|a1a2)..P(an|sn)..P(aN|sN),P(s)=P(a_{1})P(a_{2}|a_{1})P(a_{3}|a_{1}a_{2})..P(a_{n}|s_{n})..P(a_{N}|s_{N}), (2)

where sub-sequences sns_{n} are defined as sn=a1a2..an1,n=2..Ns_{n}=a_{1}a_{2}..a_{n-1},n=2..N. To simplify we will consider conditional dependencies for all probabilities as functions of fixed-length sequences sn,n+ls_{n,n+l} of length ll with left-side zero padding as needed. Then the probability of the sequence ss will be

P(s)=n=1NP(an|snl,n),snl,n=anlanl+1..an1.P(s)=\prod_{n=1}^{N}P(a_{n}|s_{n-l,n}),\;s_{n-l,n}=a_{n-l}a_{n-l+1}..a_{n-1}. (3)

By defining a model P(a|s,w)P(a|s,w) for probability of an element aa given a previous sequence ss and parametrized by weights ww we obtain a parametrized probability of any sequence ss

P(s|w)=n=1NP(an|snl,n,w),snl,n=anlanl+1..an1.P(s|w)=\prod_{n=1}^{N}P(a_{n}|s_{n-l,n},w),\;s_{n-l,n}=a_{n-l}a_{n-l+1}..a_{n-1}. (4)

For a known training dataset of independent sequences {si,i=1..M}\{s_{i},i=1..M\} the probability of the data for a selected model is given by Bayesian integral over weights as it follows from the Bayes theorem (Bayes & Price, 1763)

Prob({si}|H)=wi=1MP(si|w)P0(w|H)dw,Prob(\{s_{i}\}|H)=\int_{w}\prod_{i=1}^{M}P(s_{i}|w)P_{0}(w|H)dw, (5)

where P0(w|H)P_{0}(w|H) is a prior of weights conditional on hyperparameters HH.

The prediction of the probability P(s|{si},H)P(s|\{s_{i}\},H) of a new sequence ss then will be given by an average

P(s|{si},H)=P(s|w)=wP(s|w)i=1MP(si|w)P0(w|H)dw/Prob({si}|H).P(s|\{s_{i}\},H)=\langle P(s|w)\rangle=\int_{w}P(s|w)\prod_{i=1}^{M}P(s_{i}|w)P_{0}(w|H)dw\;/\;Prob(\{s_{i}\}|H). (6)

We will call the new sequence ss a test sequence.

The prediction probability P(s|{si},H)P(s|\{s_{i}\},H) is an average over weights with the most significant contribution coming from training sequences that maximize both corresponding probabilities of training data and the test sequence P(s|w)P(s|w) for the same weights in the Eq.(6). The resulting average probability is high when the test sequence ss is similar to the most typical training sequences. However, the average will be low when ss is very different from the typical training data. The problem is that all training sequences contributing equally to the average and a good prediction is possible only for a test sequence that is very similar to training data.

To improve the prediction of the probability for the given sequence ss we will introduce the importance weights for training data samples in the Eq.(6) that will increase contributions of samples with sequences sis_{i} similar to the test sequence ss and reduce contributions of other samples.

The idea is an extension of a very well known approach used for compression that maps first a prediction sample to a certain context and then a prediction for the sample is made using a context specific predictor.

The approach with importance weights described in this paper leads to a definition of Bayesian Attention Networks (BAN) considered next.

The idea to account for long-range correlations between data samples for achieving better compression has a long history. The attention mechanism recently proposed in (Vaswani et al., 2017) was a great success for sequence prediction tasks as well as for other problems including image recognition (Dosovitskiy et al., 2020).

2 Prediction of Sequence Probabilities with Bayesian Attention Networks (BAN)

We will define BAN in a few steps. First, we introduce the importance factors ρi(s)\rho_{i}(s) for sequences by modifying contribution of each training data sample in the Bayesian integrals

P(s|{si},H)\displaystyle P(s|\{s_{i}\},H) =\displaystyle= wP(s|w)i=1Meρi(s)l(si|w)P0(w|H)dw/Prob({si}|H),\displaystyle\int_{w}P(s|w)\prod_{i=1}^{M}e^{-\rho_{i}(s)l(s_{i}|w)}P_{0}(w|H)dw\;/\;Prob(\{s_{i}\}|H), (7)
Prob({si}|H)\displaystyle Prob(\{s_{i}\}|H) =\displaystyle= wi=1Meρi(s)l(si|w)P0(w|H)dw.\displaystyle\int_{w}\prod_{i=1}^{M}e^{-\rho_{i}(s)l(s_{i}|w)}P_{0}(w|H)dw. (8)

Here l(si|w)l(s_{i}|w) are loss functions for sequences defined as logs of probabilities l(s|w)=log(P(s|w))l(s|w)=-\log(P(s|w)). The importance factors ρi(s)\rho_{i}(s) are assigned to each training sequence sis_{i} and are functions of predicted sequence ss. The importance factors must be constrained by normalization condition iρi(s)=M\sum_{i}\rho_{i}(s)=M, so the total contribution of all training sequences is preserved.

The meaning of the importance factors is very simple - it defines the importance of a given training sample sis_{i} to influence a solution for the prediction probability of the test sequence ss with higher importance ρi(s)\rho_{i}(s) increasing contribution of the sample sis_{i} to predicting ss.

In the limit when unnormalized importance is only zero or one it is equivalent to clustering when the solution for predicting sample ss is defined by a few training samples sis_{i} correlated with the prediction sample.

We will clarify the definition of importance weights in the following steps.

Next, we will approximate the average of probability of a sequence as product of averages for predictions of probabilities of elements conditional on corresponding immediate sub-sequences

P(s)=P(a1a2..aN)=P(a1|s1,w)P(a2|s2,w)..P(aN|sN,w).P(s)=P(a_{1}a_{2}..a_{N})=\langle P(a_{1}|s_{1},w)\rangle\langle P(a_{2}|s_{2},w)\rangle..\langle P(a_{N}|s_{N},w)\rangle. (9)

Then the prediction of the probability of an element is given by

P(a|s,w)=wP(a|s,w)i=1Keρi(s)l(ai|si,w)P0(w|H)dw/Prob({ai}|H)\langle P(a|s,w)\rangle=\int_{w}P(a|s,w)\prod_{i=1}^{K}e^{-\rho_{i}(s)l(a_{i}|s_{i},w)}P_{0}(w|H)dw\;/\;Prob(\{a_{i}\}|H) (10)

with the normalization factor Prob({ai}|H)Prob(\{a_{i}\}|H) that keeps P(a|s,w)\langle P(a|s,w)\rangle normalized defined as follows

Prob({ai}|H)=wi=1Keρi(s)l(ai|si,w)P0(w|H)dw.Prob(\{a_{i}\}|H)=\int_{w}\prod_{i=1}^{K}e^{-\rho_{i}(s)l(a_{i}|s_{i},w)}P_{0}(w|H)dw. (11)

Here the importance factors depend only on input sequences in each sample and not on predicted elements. The index ii runs over all KK sequence elements aia_{i} in training data in Eqs.(10,11). The losses l(ai|si,w)l(a_{i}|s_{i},w) are per element in training sequences and the importance factors ρi(s)\rho_{i}(s) control the contributions of each training sample ii and depend on the sub-sequence ss of the element aa in the prediction probability P(a|s)P(a|s). The importance factors are normalized by condition iρi(s)=K\sum_{i}\rho_{i}(s)=K where KK is a total number of training data samples.

The Eqs.(10,11) are defining the Bayesian Attention Network with attention variable ρi(s)\rho_{i}(s) which is a function of two sequences. The attention or importance factors allow to take into account long range interactions between different data samples.

It is important to note that for any approximation of the integrals in Eqs.(10,11) the solution is dependent on the input sequence ss in the data sample we are making prediction for P(a|s)P(a|s), which requires computing the integrals for every test sample.

To resolve this problem and find the practical computing approach we will use a latent variable zz representation with the encoder-decoder network that is considered in the next section.

3 Computing Importance via Extended Variational Approximation

Let’s define encoder ρ(z|s)\rho(z|s) and decoder ρ(s|z)\rho(s|z). To do a prediction for a sample (a,s) the encoder ρ(z|s)\rho(z|s) gives a probability of zz for input ss and the decoder ρ(si|z)\rho(s_{i}|z) gives a probability of a training sequence sis_{i} for a given zz.

The decoder allows to define the normalized importance factor

ρi(z)=Kρ(si|z)/jρ(sj|z).\rho_{i}(z)=K\rho(s_{i}|z)/\sum_{j}\rho(s_{j}|z). (12)

With that we can reformulate Eqs.(10,11) as follows

P(a|s,w)\displaystyle\langle P(a|s,w)\rangle =\displaystyle= w,zP(a|s,w)i=1Keρi(z)l(ai|si,w)P0(w|H)ρ(z|s)dwdz/Prob({ai}|H)\displaystyle\int_{w,z}P(a|s,w)\prod_{i=1}^{K}e^{-\rho_{i}(z)l(a_{i}|s_{i},w)}P_{0}(w|H)\rho(z|s)dwdz\;/\;Prob(\{a_{i}\}|H)
Prob({ai}|H)\displaystyle Prob(\{a_{i}\}|H) =\displaystyle= w,zi=1Keρi(z)l(ai|si,w)P0(w|H)ρ(z|s)dwdz\displaystyle\int_{w,z}\prod_{i=1}^{K}e^{-\rho_{i}(z)l(a_{i}|s_{i},w)}P_{0}(w|H)\rho(z|s)dwdz (13)

Due to Bayesian theorem (Bayes & Price, 1763) the encoder probability related to the decoder probability and can be expressed via decoder probability and priors for zz and ss

ρ(z|s)=ρ(s|z)ρ(z)ρ(s).\rho(z|s)=\frac{\rho(s|z)\rho(z)}{\rho(s)}. (14)

By introducing an approximation for the encoder q(z|s)q(z|s) we can use a variational method for simplifying the integral over zz following (Kingma & Welling, 2014)

zeL(z)ρ(s|z)ρ(z)𝑑zexp(zq(z|s)L(z,w)𝑑z+zq(z|s)logρ(s|z)ρ(z)q(z|s)dz),\int_{z}e^{-L(z)}\rho(s|z)\rho(z)dz\geqslant\exp\left(-\int_{z}q(z|s)L(z,w)dz+\int_{z}q(z|s)\log\frac{\rho(s|z)\rho(z)}{q(z|s)}dz\right), (15)

where total training loss is L(z,w)=iρi(z)l(ai|si,w)L(z,w)=\sum_{i}\rho_{i}(z)l(a_{i}|s_{i},w). Maximizing the right side of Eq.(15) w.r.t. parameters of q(z|s)q(z|s) allows to find the encoder qq.

However, the variational approximation that is based on Jensen’s inequality, (Jensen, 1906) exp(x)exp(x)\langle\exp(x)\rangle\geqslant\exp(\langle x\rangle), results in the loss of dependency on importance factors in the prediction probability in the Eqs.(3). To retain that dependency we have to use a sharpened Jensen’s inequality, (see Appendix A for the proof)

exex+12(xx)2.\langle e^{x}\rangle\geqslant e^{\langle x\rangle+\frac{1}{2}\langle(x-\langle x\rangle)^{2}\rangle}. (16)

Then the probability of prediction can be maximized w.r.t. importance weights. Now we can find that the variation of the log of the prediction probability is equal to a correlation function of test and train losses

δlogP(a|s,w)δρi(z)=l(ai|si,w)l(a|s,w)wl(ai|si,w)wl(a|s,w)w\displaystyle\frac{\delta\log{\langle P(a|s,w)\rangle}}{\delta\rho_{i}(z)}=\langle l(a_{i}|s_{i},w)l(a|s,w)\rangle_{w}-\langle l(a_{i}|s_{i},w)\rangle_{w}\langle l(a|s,w)\rangle_{w}\approx
jσj2l(ai|si,w)wjl(a|s,w)wj,\displaystyle\sum_{j}\sigma_{j}^{2}\frac{\partial l(a_{i}|s_{i},w)}{\partial w_{j}}\frac{\partial l(a|s,w)}{\partial w_{j}}, (17)

where averages (..)w\langle(..)\rangle_{w} are computed with a posterior distribution of weights

Post(w|H0,z)=i=1Keρi(z)l(ai|si,w)P0(w|H0),Post(w|H_{0},z)=\prod_{i=1}^{K}e^{-\rho_{i}(z)l(a_{i}|s_{i},w)}P_{0}(w|H_{0}), (18)

by expanding the weights around the means up to the second order and averaging over it.

4 Complete Loss and Recurrent Update Rules for Hyperparameters

Due to the sharpened Jensen’s inequality in Eq.(16) the complete loss includes a cross-correlation term between the test sample loss and training sample losses

w,zP0(w|H0)q(z|s)(L(z,w)l(a|s,w)L(z,w))𝑑w𝑑zzq(z|s)logρ(s|z)ρ(z)q(z|s)dz,\int_{w,z}P_{0}(w|H_{0})q(z|s)\left(L(z,w)-l(a|s,w)L(z,w)\right)dwdz-\int_{z}q(z|s)\log\frac{\rho(s|z)\rho(z)}{q(z|s)}dz, (19)

where the training loss L(z,w)=iρi(z)l(ai|si,w)L(z,w)=\sum_{i}\rho_{i}(z)l(a_{i}|s_{i},w).

Finding the importance factors by maximizing the prediction probability for test sample requires to minimize only the cross-correlation loss due to normalization factor in Eq.(3). All other parameters could be found by minimizing the loss in the Eq.(19).

Let’s use a posterior distribution for weights that depends on initial hyperparameters H0H_{0} and the latent variable zz in the Eq.(18). The method developed in (Tetelman, 2020) allows to compute approximations of the Bayesian integrals by re-parametrizing a prior of weights P0(w|H)P_{0}(w|H) to represent a posterior

Post(w|H,z)=i=1Keρi(z)l(ai|si,w)P0(w|H0)P0(w|H(z,H0)).Post(w|H^{\prime},z)=\prod_{i=1}^{K}e^{-\rho_{i}(z)l(a_{i}|s_{i},w)}P_{0}(w|H_{0})\sim P_{0}(w|H(z,H_{0})). (20)

The re-parametrization is computed recursively. With Gaussian priors for all network weights and Gaussian encoder q(z|s)q(z|s) we can find the following update rules for hyperparameters: (mean, variance) pairs (μ,σ2)(\mu,\sigma^{2}) for all weights of all networks. The training update steps are as follows

i\displaystyle i \displaystyle\sim 𝐔𝐧𝐢𝐟𝐨𝐫𝐦[1..K]\displaystyle\mathbf{Uniform}[1..K] (21)
z(s,v)\displaystyle z(s,v) \displaystyle\sim q(z|s,v)\displaystyle q(z|s,v) (22)
w(z,v)\displaystyle w(z,v) \displaystyle\sim P0(w|μw(z|v),σw(z|v))\displaystyle P_{0}(w|\mu_{w}(z|v),\sigma_{w}(z|v)) (23)
ρi(z,u)\displaystyle\rho_{i}(z,u) =\displaystyle= Kρ(si|z,u)/jρ(sj|z,u)\displaystyle K\rho(s_{i}|z,u)/\sum_{j}\rho(s_{j}|z,u) (24)
𝐠𝐫𝐚𝐝𝐯\displaystyle\mathbf{grad_{v}} =\displaystyle= v[ρi(z,u)l(ai|si,w(z,v))logρ(s|z,u)ρ(z)q(z|s,v)]\displaystyle\frac{\partial}{\partial v}\bigg{[}\rho_{i}(z,u)l(a_{i}|s_{i},w(z,v))-\log{\frac{{\textbullet}\rho(s|z,u)\rho(z)}{q(z|s,v)}}\bigg{]} (25)
Δμv\displaystyle\Delta\mu_{v} +=\displaystyle+= εσv2𝐠𝐫𝐚𝐝𝐯\displaystyle-\varepsilon\sigma_{v}^{2}\mathbf{grad_{v}} (26)
Δ1σv2\displaystyle\Delta\frac{1}{\sigma_{v}^{2}} +=\displaystyle+= ε(𝐠𝐫𝐚𝐝𝐯)2\displaystyle\varepsilon(\mathbf{grad_{v}})^{2} (27)
𝐠𝐫𝐚𝐝𝐮\displaystyle\mathbf{grad_{u}} =\displaystyle= j(σwj2Kl(ai|si,w)wjl(a|s,w)wj)ρi(z,u)u\displaystyle\sum_{j}\bigg{(}\frac{\sigma_{w_{j}}^{2}}{K}\frac{\partial l(a_{i}|s_{i},w)}{\partial w_{j}}\frac{\partial l(a|s,w)}{\partial w_{j}}\bigg{)}\frac{\partial\rho_{i}(z,u)}{\partial u} (28)
Δμu\displaystyle\Delta\mu_{u} +=\displaystyle+= εσu2𝐠𝐫𝐚𝐝𝐮\displaystyle-\varepsilon\sigma_{u}^{2}\mathbf{grad_{u}} (29)
Δ1σu2\displaystyle\Delta\frac{1}{\sigma_{u}^{2}} +=\displaystyle+= ε(𝐠𝐫𝐚𝐝𝐮)2.\displaystyle\varepsilon(\mathbf{grad_{u}})^{2}. (30)

Here the learning rate ε\varepsilon is equal to the inverse number of epochs TT: ε=1/T\varepsilon=1/T.

The prediction for a sample with input ss is done by sampling zq(z|s,v)z\sim q(z|s,v), computing a mean for weight ww equal to μw(z,v)\mu_{w}(z,v) and finally computing a prediction probability P(a|s,μw)P(a|s,\mu_{w}).

5 Conclusion

The compression is defined by computing a probability of a prediction sample that is then encoded with the entropy coder. The contribution of this paper is to derive the compression algorithm by defining the Bayesian Attention Networks for predicting a probability of a new sample by introducing an attention or importance factor per the training sample that is dependent on the prediction sample. The attention factors allow to find a specific solution for each prediction sample that is influenced by a few training samples most correlated with the prediction sample.

Because the solution becomes specific to a given prediction sample the training and the prediction processes are coupled together. To make this approach practical we introduce a latent space representation with encoder-decoder networks that allows to separate training and prediction by mapping any prediction sample to a latent space and learning all solutions as a function of the latent space along with learning attention as a function of the latent space and a training sample.

The latent space plays a role of the context representation with a prediction sample defining a context and a learned context dependent solution used for the prediction.

References

  • Bayes & Price (1763) Mr Bayes and Mr Price. An Essay towards Solving a Problem in the Doctrine of Chances. Philosophical Transactions of the Royal Society of London, 53:370–418, 1763.
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929, 2020.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. The MIT Press, 2016. ISBN 0262035618.
  • Hutter Prize (2021) Hutter Prize. Wikipedia. https://en.wikipedia.org/wiki/Hutter_Prize, 2021. Online.
  • Jensen (1906) Johan L. Jensen. Sur les fonctions convexes et les inégualités entre les valeurs moyennes. Acta Math., 30:175–193, 1906.
  • Kingma & Welling (2014) Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. arXiv:1312.6114, 2014.
  • Kingma et al. (2019) Friso H. Kingma, Pieter Abbeel, and Jonathan Ho. Bit-Swap: Recursive Bits-Back Coding for Lossless Compression with Hierarchical Latent Variables. arXiv:1905.06845, 2019.
  • Lossless Compression (2020) Lossless Compression. Wikipedia. https://en.wikipedia.org/wiki/Lossless_compression, 2020. Online.
  • MacKay (2002) David J. C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, USA, 2002. ISBN 0521642981.
  • Mahoney (2013a) Matt Mahoney. Data Compression Programs. http://mattmahoney.net/dc/, 2013a. Online.
  • Mahoney (2013b) Matt Mahoney. Data Compression Explained. http://mattmahoney.net/dc/dce.html, 2013b. Online.
  • Pasco (1976) Richard Pasco. Source coding algorithms for fast data compression. PhD thesis, Stanford, Calif., 1976.
  • Rissanen (1976) Jorma J. Rissanen. Generalized Kraft inequality and arithmetic coding. IBM 1. Res. Dev., pp.  198–203, May 1976.
  • Salomon (2006) David Salomon. Data Compression: The Complete Reference. Springer-Verlag, Berlin, Heidelberg, 2006. ISBN 1846286026.
  • Shannon (1948) Claude Elwood Shannon. A Mathematical Theory of Communication. Bell System Technical Journal, 27:379–423,623–656, 1948.
  • Tetelman (2020) Michael Tetelman. On Compression Principle and Bayesian Optimization for Neural Networks. arXiv:2006.12714, 2020.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762, 2017.

Appendix A Appendix

To prove the sharpened Jensen’s inequality we note that for a normal distribution

ex=ex+12(xx)2\langle e^{x}\rangle=e^{\langle x\rangle+\frac{1}{2}\langle(x-\langle x\rangle)^{2}\rangle} (31)

As Gaussian mixture is an universal approximator for any (non-pathological) distribution, see (Goodfellow et al., 2016) p.65, then it directly proves the sharpened Jensen’s inequality in Eq.(16):

x𝑑xkAkG(x|μ,σ)ex=kAkeμk+12σk2eμ+12σ2.\int_{x}dx\sum_{k}A_{k}G(x|\mu,\sigma)e^{x}=\sum_{k}A_{k}e^{\mu_{k}+\frac{1}{2}\sigma_{k}^{2}}\geqslant e^{\mu+\frac{1}{2}\sigma^{2}}. (32)

Here kAk=1\sum_{k}A_{k}=1 and μ=kAkμk,σ2=kAkσk2\mu=\sum_{k}A_{k}\mu_{k},\sigma^{2}=\sum_{k}A_{k}\sigma_{k}^{2}.