This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Feature Whitening via Gradient Transformation
for Improved Convergence

Shmulik Markovich-Golan
[email protected] &Barak Battash
[email protected] &Amit Bleiweiss
[email protected]
Abstract

Feature whitening [16] is a known technique  for speeding up training of DNN. Under certain assumptions, whitening the activations [7] reduces the Fisher information matrix to a simple identity matrix, in which case stochastic gradient descent is equivalent to the faster natural gradient descent. Due to the additional complexity resulting from transforming the layer inputs and their corresponding gradients in the forward and backward propagation, and from repeatedly computing the Eigenvalue decomposition (EVD), this method is not commonly used to date.

In this work, we address the complexity drawbacks of feature whitening. Our contribution is twofold. First, we derive an equivalent method, which replaces the sample transformations by a transformation to the weight gradients, applied to every batch of BB samples. The complexity is reduced by a factor of S/(2B)S/(2B), where SS denotes the feature dimension of the layer output. As the batch size increases with distributed training, the benefit of using the proposed method becomes more compelling. Second, motivated by the theoretical relation between the condition number of the sample covariance matrix and the convergence speed, we derive an alternative sub-optimal algorithm which recursively reduces the condition number of the latter matrix. Compared to EVD, complexity is reduced by a factor of the input feature dimension MM. We exemplify the proposed algorithms with ResNet-based networks for image classification demonstrated on the CIFAR and Imagenet datasets. Parallelizing the proposed algorithms is straightforward and we implement a distributed version thereof. Improved convergence, in terms of speed and attained accuracy, can be observed in our experiments.

1 Introduction

Feature whitening is a known method for speeding-up training [15, 19, 18, 16], which aims to decorrelate the inputs to the network layers, making it possible to match the learning rate at each eigenspace to its optimum value, inversely proportionate to the corresponding eigenvalue. Due to the additional complexity, resulting from transforming the layer inputs and their gradients in the forward- and backward-propagation stages, and from repeatedly computing the EVD, using it is uncommon. The celebrated batch norm (BN) method [12] implements a degenerate version of feature whitening, aiming to standardize the features without decorrelating them. Although sub-optimal, BN was found to speed up convergence [5] and has become a common practice in designing deep neural networks thanks to its appealingly low complexity. In [7] it has been shown that when the features become white the Fisher information matrix (FIM) reduces to an identity matrix and the simple stochastic gradient descent (SGD) optimization coincides with the natural gradient descent (NGD) optimization [2] which is more suitable for difficult optimization tasks [17]. Some computations can be saved with the assumption that feature statistics vary slowly during training, thus allowing to update the principle component analysis (PCA)-based whitening transformation every block of samples and amortize the complexity of the EVD computation. Incorporating BN with feature whitening has been found to be useful to limit the variation of feature statistics within blocks. Instability of feature-whitening has been reported in [12] and was attributed to treating the transformation as constant and independent of the data during the back propagation stage. In [11] the stability issue was investigated, and it was found that constructing the whitening matrix based on zero-phase component analysis (ZCA[4, 13] is more stable than based on PCA as it maintains a lower distortion compared to the original samples and avoids the stochastic axis swapping problem. Furthermore, the derivatives with respect to the whitening transformation were computed and incorporated in the back propagation stage, which further improved convergence speed, at the expense of increased complexity.

In this work, we consider the feature whitening method [7] using ZCA instead of PCA and derive two reduced complexity methods. The first method is equivalent to the aforementioned method, and yet, computations are saved by transforming the weight gradients on every batch instead of whitening the activations and their gradients at every sample. The second method replaces the EVD computation with an approximate recursive algorithm which conditions the activations’ covariance matrix one subspace at a time. The structure of the paper is as follows: In Sec. 2 we present the feature whitening method of [7]. Then in Secs. 3 and 4 we derive the proposed reduced complexity methods. The complexity of the proposed methods is analyzed in Sec. 5. We evaluate the proposed methods with various datasets and models in Sec. 6 and conclude the work in Sec. 7.

2 Direct feature whitening

We describe the feature whitening method as presented in [7], where the transformation is constructed using ZCA as in [11], and denote it by the direct feature whitening method. Consider the input of the ll-th layer of a DNN, denoted by an Ml×Nl{M_{l}}\times{N_{l}} matrix 𝐗lp{\mathbf{X}_{l}^{p}}, where pp denotes the input index, Nl{N_{l}} is the number of columns corresponding to the spatial dimension and Ml{M_{l}} is the number of rows, corresponding to the feature dimension. This formulation is generic, where any spatial dimension can be vectorized (using the vec()\textit{vec}\left(\cdot\right) operator) into a single dimension. SGD is used to update the parameters, where it has been shown to be equivalent to the NGD optimizer when the features are white and under certain assumptions [7]. For brevity of notation, we omit the ll-th layer index hereafter, unless explicitly stated.

Denote the nn-th spatial element of the input, for n=0,,N1n=0,\ldots,N-1, as 𝐱np{\mathbf{x}_{n}^{p}}, and define the input matrix

𝐗p[𝐱0p,𝐱1p,,𝐱N1p].\displaystyle\textstyle{\mathbf{X}^{p}}\triangleq\left[\mathbf{x}_{0}^{p},\mathbf{x}_{1}^{p},\ldots,\mathbf{x}_{N-1}^{p}\right]. (1)

During the training procedure, as the network adapts, the mean and covariance of 𝐱n{\mathbf{x}_{n}} vary as well and correspondingly the whitening matrix has to be adapted too. A common method for continuously tracking the moments is to use the empirical sample-mean and sample-covariance-matrix, computed in blocks of LL batches, each consisting of BB samples, and recursively average them between blocks. The sample indices of the ii-th block are p[LBi,LB(i+1)1]p\in\left[LBi,LB(i+1)-1\right]. The varying mean and covariance are estimated at block index i1i\geq 1 by:

μx(i)\displaystyle{\mathbf{\mu}_{x}}\left(i\right)\triangleq αμx(i1)+(1α)1LBNp=LBiLB(i+1)1n𝐱np\displaystyle\alpha{\mathbf{\mu}_{x}}\left(i-1\right)+\left(1-\alpha\right)\frac{1}{LBN}\sum_{p=LBi}^{LB(i+1)-1}\sum_{n}{\mathbf{x}_{n}^{p}} (2)
𝚽x(i)\displaystyle{\mathbf{\Phi}_{x}}\left(i\right)\triangleq α𝚽x(i1)+(1α)1LBNp=LBiLB(i+1)1n(𝐱npμx(i))(𝐱npμx(i))T\displaystyle\alpha{\mathbf{\Phi}_{x}}\left(i-1\right)+\left(1-\alpha\right)\frac{1}{LBN}\sum_{p=LBi}^{LB(i+1)-1}\sum_{n}\left({\mathbf{x}_{n}^{p}}-{\mathbf{\mu}_{x}}\left(i\right)\right)\left({\mathbf{x}_{n}^{p}}-{\mathbf{\mu}_{x}}\left(i\right)\right)^{T} (3)

where 0α<10\leq\alpha<1 is a recursive-averaging factor between blocks. The mean and covariance are initialized in the first block (i=0i=0) to:

μx(0)\displaystyle{\mathbf{\mu}_{x}}\left(0\right)\triangleq 1LBNp=0LB1n𝐱np\displaystyle\frac{1}{LBN}\sum_{p=0}^{LB-1}\sum_{n}{\mathbf{x}_{n}^{p}} (4)
𝚽x(0)\displaystyle{\mathbf{\Phi}_{x}}\left(0\right)\triangleq 1LBNp=0LB1n(𝐱npμx(0))(𝐱npμx(0))T.\displaystyle\frac{1}{LBN}\sum_{p=0}^{LB-1}\sum_{n}\left({\mathbf{x}_{n}^{p}}-{\mathbf{\mu}_{x}}\left(0\right)\right)\left({\mathbf{x}_{n}^{p}}-{\mathbf{\mu}_{x}}\left(0\right)\right)^{T}. (5)

Assuming that the first and second order moments vary slowly and that they are similar across space, we can reduce the estimation complexity by averaging over a subset of the samples and of spatial elements. Storing the moments requires additional M(M+1)M(M+1) memory elements.

We turn to the construction of the whitening matrix at the ii-th block. A whitening matrix 𝐓(i){\mathbf{T}}(i) satisfies the following condition:

𝐓(i)𝚽x(i)𝐓T(i)=𝐈\displaystyle{\mathbf{T}}(i){\mathbf{\Phi}_{x}}(i){\mathbf{T}^{T}}(i)={\mathbf{I}} (6)

where 𝐈{\mathbf{I}} is an M×MM\times M identity matrix. The whitening matrix 𝐓(i){\mathbf{T}}(i) is not unique and can be constructed in several methods [8]. Some methods are based on the EVD of 𝚽x(i){\mathbf{\Phi}_{x}}(i), defined as:

𝚽x(i)=𝐕x(i)𝚲x(i)𝐕xT(i)\displaystyle\textstyle{\mathbf{\Phi}_{x}}(i)={\mathbf{V}_{x}}(i){\mathbf{\Lambda}_{x}}(i){\mathbf{V}_{x}^{T}}(i) (7)

where 𝐕x,n(i){\mathbf{V}_{x,n}}(i) is a M×MM\times M dimensional eigenvectors matrix,

𝚲x(i)diag([λx,0(i),,λx,M1(i)])\displaystyle\textstyle{\mathbf{\Lambda}_{x}}(i)\triangleq{\textrm{diag}}\left(\left[\lambda_{x,0}(i),\cdots,\lambda_{x,M-1}(i)\right]\right) (8)

is a diagnonal M×MM\times M dimensional eigenvalues matrix and diag(){\textrm{diag}}\left(\cdot\right) denotes a diagonal matrix with the argument vector placed at the diagonal. Without loss of generality we assume that the eigenvalues are sorted in decreasing order, such that λx,0(i)λx,1(i)λx,M1(i)\lambda_{x,0}(i)\geq\lambda_{x,1}(i)\geq\cdots\geq\lambda_{x,M-1}(i). Given the latter decomposition, the ZCA-based whitening matrix can be constructed by:

𝐓(i)𝐕x(i)diag1/2(𝐠(i))𝐕xT(i)\displaystyle\textstyle{\mathbf{T}}(i)\triangleq{\mathbf{V}_{x}}(i){\textrm{diag}}^{1/2}\left({\mathbf{g}}(i)\right){\mathbf{V}_{x}^{T}}(i) (9)

where diag1/2(𝐠(i))diag([g0(i),,gM1(i)]T){\textrm{diag}}^{1/2}\left({\mathbf{g}}\left(i\right)\right)\triangleq{\textrm{diag}}\left(\left[\sqrt{g_{0}(i)},\ldots,\sqrt{g_{M-1}(i)}\right]^{T}\right), the gains vector 𝐠{\mathbf{g}} is defined as

𝐠(i)[1/max{λx,0(i),ϵ},,1/max{λx,M1(i),ϵ}]T\displaystyle{\mathbf{g}}(i)\triangleq\left[{1}/{\max\left\{\lambda_{x,0}(i),\epsilon\right\}},\cdots,{1}/{\max\left\{\lambda_{x,M-1}(i),\epsilon\right\}}\right]^{T} (10)

and ϵ>0\epsilon>0 limits the minimal denominator for numerical stability.

Denote the output of the direct feature whitening transformation in the forward-propagation stage of the ii-th block as:

𝐲np𝐓(i1)(𝐱npμx(i1))\displaystyle\textstyle{\mathbf{y}_{n}^{p}}\triangleq{\mathbf{T}}(i-1)\left({\mathbf{x}_{n}^{p}}-{\mathbf{\mu}_{x}}(i-1)\right) (11)

Note that ii-th block outputs are constructed using the i1i-1-th transformation. The whitening transformation is initialized as:

𝐓(1)\displaystyle\textstyle{\mathbf{T}}(-1)\triangleq 𝐈\displaystyle{\mathbf{I}} (12)
μx(1)\displaystyle\textstyle{\mathbf{\mu}_{x}}(-1)\triangleq 𝟎\displaystyle\mathbf{0} (13)

In the back propagation stage, propagating the gradient through the whitening transformation from (11) requires computing:

𝐱np=𝐲np𝐱np𝐲np=𝐓T(i1)𝐲np\displaystyle\frac{\partial{\mathcal{L}}}{\partial{\mathbf{x}_{n}^{p}}}=\frac{\partial{\mathbf{y}_{n}^{p}}}{\partial{\mathbf{x}_{n}^{p}}}\frac{\partial{\mathcal{L}}}{\partial{\mathbf{y}_{n}^{p}}}={\mathbf{T}^{T}}(i-1)\frac{\partial{\mathcal{L}}}{\partial{\mathbf{y}_{n}^{p}}} (14)

where {\mathcal{L}} denotes the loss-function that is being optimized. Note that we adopt the denominator layout notation convention when considering derivatives with respect to vectors or matrices.

Denote the S×RS\times R dimensional output of the ll-th layer as 𝐙p{\mathbf{Z}^{p}}, where SS and RR respectively denote the feature and space dimensions. As in [7], we limit the discussion to whitening the input of linear layers. In this case the ll-th layer can be formulated as

𝐳rpn𝐖rn𝐲np+𝐛\displaystyle{\mathbf{z}_{r}^{p}}\triangleq\sum_{n}{\mathbf{W}_{rn}}{\mathbf{y}_{n}^{p}}+{\mathbf{b}} (15)

for r=0,,R1r=0,\ldots,R-1, where the trained parameters of the layer are S×MS\times M dimensional weight matrices {𝐖rn}rn\left\{{\mathbf{W}_{rn}}\right\}_{rn}, for r[0,R1]r\in\left[0,R-1\right] and n[0,N1]n\in\left[0,N-1\right], and a S×1S\times 1 dimensional bias vector 𝐛{\mathbf{b}}. Convolution and fully-connected layers can be formulated as special cases of this generic formulation. Substituting the transformed input (11) into (15) yields:

𝐳rp=n𝐖rn𝐓(i1)(𝐱npμx(i1))+𝐛.\displaystyle{\mathbf{z}_{r}^{p}}=\sum_{n}{\mathbf{W}_{rn}}{\mathbf{T}}(i-1)\left({\mathbf{x}_{n}^{p}}-{\mathbf{\mu}_{x}}(i-1)\right)+{\mathbf{b}}. (16)

Considering (15), the gradient with respect to the weights and bias in the jj-th batch is computed by:

𝐖rn=\displaystyle\frac{\partial{\mathcal{L}}}{\partial{\mathbf{W}_{rn}}}= p=jB(j+1)B1szrsp𝐖rnzrsp=p=jB(j+1)B1𝐳rp(𝐲np)T\displaystyle\sum_{p=jB}^{(j+1)B-1}\sum_{s}\frac{\partial z_{rs}^{p}}{\partial{\mathbf{W}_{rn}}}\frac{\partial{\mathcal{L}}}{\partial z_{rs}^{p}}=\sum_{p=jB}^{(j+1)B-1}\frac{\partial{\mathcal{L}}}{{\mathbf{z}_{r}^{p}}}\left({\mathbf{y}_{n}^{p}}\right)^{T} (17)
𝐛=\displaystyle\frac{\partial{\mathcal{L}}}{\partial{\mathbf{b}}}= p=jB(j+1)B1r,szrsp𝐛zrsp=p=jB(j+1)B1r𝐳rp\displaystyle\sum_{p=jB}^{(j+1)B-1}\sum_{r,s}\frac{\partial z_{rs}^{p}}{\partial{\mathbf{b}}}\frac{\partial{\mathcal{L}}}{\partial z_{rs}^{p}}=\sum_{p=jB}^{(j+1)B-1}\sum_{r}\frac{\partial{\mathcal{L}}}{{\mathbf{z}_{r}^{p}}} (18)

for r=0,,R1r=0,\ldots,R-1 and n=0,,N1n=0,\ldots,N-1. The parameters are updated on every batch according to the SGD optimization rule:

𝐖rn\displaystyle{\mathbf{W}_{rn}}\coloneqq 𝐖rnη𝐖rn\displaystyle{\mathbf{W}_{rn}}-\eta\frac{\partial{\mathcal{L}}}{\partial{\mathbf{W}_{rn}}} (19)
𝐛\displaystyle{\mathbf{b}}\coloneqq 𝐛η𝐛\displaystyle{\mathbf{b}}-\eta\frac{\partial{\mathcal{L}}}{\partial{\mathbf{b}}} (20)

where \coloneqq denotes the assignment operator and η\eta denotes the learning rate.

3 Gradient-based feature whitening

We propose the gradient-based feature whitening method in Sec. 3.1 and prove its equivalence to the direct feature whitening method from the previous section. In Sec. 3.2 we present some practical considerations and supplements to the basic method.

3.1 Method

The forward propagation stage of the method is given by:

𝐳rp=n𝐖~rn(𝐱npμx)+𝐛~\displaystyle{\mathbf{z}_{r}^{p}}=\sum_{n}{\tilde{\mathbf{W}}_{rn}}\left({\mathbf{x}_{n}^{p}}-{\mathbf{\mu}_{x}}\right)+{\tilde{\mathbf{b}}} (21)

where {𝐖~rn}r,n\left\{{\tilde{\mathbf{W}}_{rn}}\right\}_{r,n} for r[0,R1]r\in\left[0,R-1\right] and n[0,N1]n\in\left[0,N-1\right] and 𝐛~{\tilde{\mathbf{b}}} are the trained parameters. Unlike in [7], 𝐖~rn{\tilde{\mathbf{W}}_{rn}} is not split into two parts, and applied as a single transformation, conceptually including both the whitening and trained transformations. The activations and their gradients in the forward and backward propagation are computed normally, and do not require another whitening transformation as in [7]. The whitening matrix is computed as in (9), however, it is only used in the back propagation stage to construct the weight gradient transformation, such that the weights are updated as in the direct whitening method.

The equivalence is proven by induction. Without loss of generality, let us assume that (21) is equivalent to the forward propagation of the direct whitening in (16), and that the corresponding whitening transformations are identical, i.e.:

n𝐖~rn(𝐱npμx)+𝐛~=n𝐖rn𝐓(𝐱npμx)+𝐛\displaystyle\sum_{n}{\tilde{\mathbf{W}}_{rn}}\left({\mathbf{x}_{n}^{p}}-{\mathbf{\mu}_{x}}\right)+{\tilde{\mathbf{b}}}=\sum_{n}{\mathbf{W}_{rn}}{\mathbf{T}}\left({\mathbf{x}_{n}^{p}}-{\mathbf{\mu}_{x}}\right)+{\mathbf{b}} (22)

for r=0,,R1r=0,\ldots,R-1. It follows that:

𝐖~rn=\displaystyle{\tilde{\mathbf{W}}_{rn}}= 𝐖rn𝐓\displaystyle{\mathbf{W}_{rn}}{\mathbf{T}} (23)
𝐛~=\displaystyle{\tilde{\mathbf{b}}}= 𝐛.\displaystyle{\mathbf{b}}. (24)

We consider the relations between the parameter gradients in the direct whitening method, denoted 𝐖rn\frac{\partial{\mathcal{L}}}{\partial{\mathbf{W}_{rn}}} and 𝐛\frac{\partial{\mathcal{L}}}{\partial{\mathbf{b}}}, and their gradient-based whitening method counterparts, denoted 𝐖~rn\frac{\partial{\mathcal{L}}}{\partial{\tilde{\mathbf{W}}_{rn}}} and 𝐛~\frac{\partial{\mathcal{L}}}{\partial{\tilde{\mathbf{b}}}}. Considering (23), (24) and using matrix derivative rules, we obtain the relations:

𝐖rn=\displaystyle\frac{\partial{\mathcal{L}}}{\partial{\mathbf{W}_{rn}}}= 𝐖~rn𝐓T\displaystyle\frac{\partial{\mathcal{L}}}{\partial{\tilde{\mathbf{W}}_{rn}}}{\mathbf{T}^{T}} (25)
𝐛=\displaystyle\frac{\partial{\mathcal{L}}}{\partial{\mathbf{b}}}= 𝐛~.\displaystyle\frac{\partial{\mathcal{L}}}{\partial{\tilde{\mathbf{b}}}}. (26)

Finally, following the SGD update rules in (19) and (20) and substituting (23) and (24), we derive the update rules of the gradient-based whitening as:

𝐖~rn\displaystyle{\tilde{\mathbf{W}}_{rn}}\coloneqq 𝐖~rnη𝐖¯rn\displaystyle{\tilde{\mathbf{W}}_{rn}}-\eta\frac{\partial{\mathcal{L}}}{\partial{\bar{\mathbf{W}}_{rn}}} (27)
𝐛~\displaystyle{\tilde{\mathbf{b}}}\coloneqq 𝐛~η𝐛~\displaystyle{\tilde{\mathbf{b}}}-\eta\frac{\partial{\mathcal{L}}}{\partial{\tilde{\mathbf{b}}}} (28)
𝐖¯rn\displaystyle\frac{\partial{\mathcal{L}}}{\partial{\bar{\mathbf{W}}_{rn}}}\triangleq 𝐖~rn𝐐\displaystyle\frac{\partial{\mathcal{L}}}{\partial{\tilde{\mathbf{W}}_{rn}}}{\mathbf{Q}} (29)
𝐐\displaystyle{\mathbf{Q}}\triangleq 𝐓T𝐓.\displaystyle{\mathbf{T}^{T}}{\mathbf{T}}. (30)

Note that these update rules satisfy that the updated parameters in the gradient-based whitening method are equal to the direct whitening method counterparts, and therefore the varying networks are identical and the methods are equivalent by construction.

Special treatment is required for every ii-th block, when the empirical mean is updated from μx(i1){\mathbf{\mu}_{x}}(i-1) to μx(i){\mathbf{\mu}_{x}}(i). In these cases, the bias parameter 𝐛~{\tilde{\mathbf{b}}} is updated to compensate for the update and maintain the overall network unchanged after update. Considering (21) and respectively denoting by 𝐛~|i{\tilde{\mathbf{b}}}|_{i^{-}} and 𝐛~|i{\tilde{\mathbf{b}}}|_{i} the bias prior to and after the update, we would like to maintain the relation 1/Rr,n𝐖~rn(𝐱npμx(i))+𝐛~|i=1/Rr,n𝐖~rn(𝐱npμx(i1))+𝐛~|i{1}/{R}\sum_{r,n}{\tilde{\mathbf{W}}_{rn}}\left({\mathbf{x}_{n}^{p}}-{\mathbf{\mu}_{x}}(i)\right)+{\tilde{\mathbf{b}}}|_{i}={1}/{R}\sum_{r,n}{\tilde{\mathbf{W}}_{rn}}\left({\mathbf{x}_{n}^{p}}-{\mathbf{\mu}_{x}}(i-1)\right)+{\tilde{\mathbf{b}}}|_{i^{-}}. The bias compensation rule is therefore defined as:

𝐛~|i𝐛~|i+1/Rr,n𝐖~rn(μx(i1)μx(i)).\displaystyle{\tilde{\mathbf{b}}}|_{i}\triangleq{\tilde{\mathbf{b}}}|_{i^{-}}+{1}/{R}\sum_{r,n}{\tilde{\mathbf{W}}_{rn}}\left({\mathbf{\mu}_{x}}\left(i-1\right)-{\mathbf{\mu}_{x}}\left(i\right)\right). (31)

This concept is similar to [7]. However, there is no need to compensate for changes in the whitening matrix 𝐓{\mathbf{T}}, as they are only manifested in the gradients and not explicitly applied to the activations.

3.2 Practical considerations

We present some modifications to the basic method from the previous section, which we found to improve performance and robustness in practice. In the general case the MM dimensional feature subspace can be split into a signal subspace which corresponds to the Ms{M_{s}} largest eigenvalues and holds most of the energy and a noise subspace which corresponds to the smallest MMsM-{M_{s}} eigenvalues. We conjecture that the noise subspace does not contain information that is instrumental to further reduction of the loss function. Consequently, amplifying the noise components might have a destructive effect and hamper the convergence process. We propose to adopt the Akaike information criterion (AIC[1] for estimating the rank of the signal subspace Ms{M_{s}}:

M^s\displaystyle{\hat{M}_{s}}\triangleq mexp(λ¯mlogλ¯m)\displaystyle\sum_{m}\exp\left(-\bar{\lambda}_{m}\log\bar{\lambda}_{m}\right) (32)
λ¯m\displaystyle\bar{\lambda}_{m}\triangleq λm/mλm\displaystyle{\lambda_{m}}/{\sum_{m^{\prime}}{\lambda_{m^{\prime}}}} (33)

where the latter are the normalized eigenvalues for m=0,,M1m=0,\ldots,M-1. Note that in the extreme case of identical eigenvalues, we obtain M^s=M{\hat{M}_{s}}=M. In the other extreme case where the samples are completely correlated, i.e., λ0>0\lambda_{0}>0 and λm0\lambda_{m}\rightarrow 0 for m=1,,M1m=1,\ldots,M-1, we obtain M^s1{\hat{M}_{s}}\rightarrow 1.

Instead of forcing the variance of the whitened signals to be 11, we maintain the average variance at the input, and limit the maximal gain to gmax{g_{\textrm{max}}}. To summarize, the gain in (10) is replaced by:

gm=min{1/(Mλ¯m),gmax}\displaystyle g_{m}=\min\left\{1/(M\bar{\lambda}_{m}),{g_{\textrm{max}}}\right\} (34)

for m=0,,M^s1m=0,\ldots,{\hat{M}_{s}}-1 and 11 otherwise. Finally, we suggest to use a recursively smoothed version of 𝐐(i){\mathbf{Q}}(i), denoted as 𝐐s(i){\mathbf{Q}_{s}}(i), initialized by 𝐐s(1)𝐈{\mathbf{Q}_{s}}(-1)\triangleq{\mathbf{I}} and updated every block according to:

𝐐s(i)=β𝐐s(i1)+(1β)𝐐(i)\displaystyle{\mathbf{Q}_{s}}(i)=\beta{\mathbf{Q}_{s}}(i-1)+(1-\beta){\mathbf{Q}}(i) (35)

where 0β<10\leq\beta<1 is a recursive averaging factor.

4 Recursive feature whitening

Haykin [9] analyzed the stability and convergence of the gradient descent algorithm for Gaussian signals and a linear system. Similarly to LeCun [16] he concluded that the convergence time is proportionate to the condition number of the input covariance matrix, i.e., convergence-time is shorter when the eigenvalues spread is reduced. At the limit, when the condition number equals 11, the convergence time is minimal and the corresponding input covariance matrix is white.

Motivated by the high computational complexity of EVD, and by the relation between the convergence time and the condition number of the input covariance matrix, we propose a recursive approach for reducing the condition number. The idea of the recursive approach is to monitor the covariance matrix of the transformed input 𝚽y,n(i){\mathbf{\Phi}_{y,n}}(i), computed similarly to (3), and in every block identify a high power subspace. Then, construct a simple transformation step to reduce its power and append it to the previous whitening transformation, thereby reducing the condition number of 𝚽y,n(i+1){\mathbf{\Phi}_{y,n}}(i+1) in the following block. The method for constructing the transformation is described in Sec. 4.1. For estimating the high power subspace, we refer to [3] and adopt the procedure for estimating the principal eigenvector, which we present in Sec. 4.2.

4.1 Method

Let 𝐯e(i){\mathbf{v}_{e}}(i) denote a high power subspace of 𝚽y(i){\mathbf{\Phi}_{y}}\left(i\right), containing more power than the average eigenvalue. The vector 𝐯e(i){\mathbf{v}_{e}}(i) is normalized, such that 𝐯e=1\|{\mathbf{v}_{e}}\|=1. The power contained in its subspace is denoted λe(i){\lambda_{e}}(i) and is computed by:

λe(i)𝐯eT(i)𝚽y(i)𝐯e(i).\displaystyle{\lambda_{e}}(i)\triangleq{\mathbf{v}_{e}^{T}}(i){\mathbf{\Phi}_{y}}(i){\mathbf{v}_{e}}(i). (36)

Define the recursive average power of the mm-th input feature as:

ϕx,m(i)\displaystyle{\phi_{x,m}}\left(i\right)\triangleq αϕx,m(i1)+(1α)1LBNp=LBiLB(i+1)1n(xn,mp)2\displaystyle\alpha{\phi_{x,m}}\left(i-1\right)+\left(1-\alpha\right)\frac{1}{LBN}\sum_{p=LBi}^{LB(i+1)-1}\sum_{n}\left(x_{n,m}^{p}\right)^{2} (37)

and define the average over all input features as:

λ¯x(i)=1/Mmϕx,m(i).\displaystyle{\bar{\lambda}_{x}}(i)=1/M\sum_{m}{\phi_{x,m}}(i). (38)

We define the transformation-step which reduces the power in the subspace of 𝐯e{\mathbf{v}_{e}} as a summation of two projection matrices:

𝐓~(i)\displaystyle{\tilde{\mathbf{T}}}(i)\triangleq ge(i)𝐯e(i)𝐯eT(i)+(𝐈𝐯e(i)𝐯eT(i))\displaystyle\sqrt{{g_{e}}(i)}{\mathbf{v}_{e}}(i){\mathbf{v}_{e}^{T}}(i)+\left({\mathbf{I}}-{\mathbf{v}_{e}}(i){\mathbf{v}_{e}^{T}}(i)\right)
=\displaystyle= 𝐈+(ge(i)1)𝐯e(i)𝐯eT(i)=𝐈+ae(i)𝐯e(i)𝐯eT(i)\displaystyle{\mathbf{I}}+(\sqrt{{g_{e}}(i)}-1){\mathbf{v}_{e}}(i){\mathbf{v}_{e}^{T}}(i)={\mathbf{I}}+{a_{e}}(i){\mathbf{v}_{e}}(i){\mathbf{v}_{e}^{T}}(i) (39)
ge(i)\displaystyle{g_{e}}(i)\triangleq δλ¯x(i)max{λe(i),ϵ}\displaystyle\frac{\delta{\bar{\lambda}_{x}}(i)}{\max\left\{{\lambda_{e}}(i),\epsilon\right\}} (40)
ae(i)\displaystyle{a_{e}}(i)\triangleq ge(i)1\displaystyle\sqrt{{g_{e}}(i)}-1 (41)

with δ>0\delta>0 being a parameter controlling the power reduction. Note that the transformation step reduces the power of the high power subspace to 𝐯eT(i)𝐓~T(i)𝚽y(i)𝐓~(i)𝐯e(i)=δλ¯x(i){\mathbf{v}_{e}^{T}}(i){\tilde{\mathbf{T}}^{T}}(i){\mathbf{\Phi}_{y}}(i){\tilde{\mathbf{T}}}(i){\mathbf{v}_{e}}(i)=\delta{\bar{\lambda}_{x}}(i), whereas all orthogonal subspaces remain unchanged.

Due to variations of the feature covariance matrix, resulting from variations of the network parameters, it is necessary to introduce a forgetting procedure which reverts the transformation in subspaces which no longer contain high power. The procedure is based on leaking a fraction of the input 𝐱n{\mathbf{x}_{n}} to the whitened input 𝐲n{\mathbf{y}_{n}}. Given the previous block whitening transformation 𝐓(i1){\mathbf{T}}(i-1) and the recursive transformation-step (39), the update rule for the whitening transformation is defined as:

𝐓(i)γ𝐓~(i)𝐓(i1)+(1γ)𝐈=γ(ae(i)𝐯e(i)𝐭veT(i)+𝐓(i1))+(1γ)𝐈\displaystyle{\mathbf{T}}(i)\triangleq\gamma{\tilde{\mathbf{T}}}(i){\mathbf{T}}(i-1)+\left(1-\gamma\right){\mathbf{I}}=\gamma\left({a_{e}}(i){\mathbf{v}_{e}}(i){\mathbf{t}_{ve}^{T}}(i)+{\mathbf{T}}(i-1)\right)+(1-\gamma){\mathbf{I}} (42)

where 0<γ10<\gamma\leq 1 is the leakage factor and

𝐭ve(i)\displaystyle{\mathbf{t}_{ve}}(i)\triangleq 𝐓(i1)𝐯e(i).\displaystyle{\mathbf{T}}(i-1){\mathbf{v}_{e}}(i). (43)

Considering the transformation step (39) and (42), the matrix 𝐐(i){\mathbf{Q}}(i) from (30) can be efficiently computed as:

𝐐(i)=\displaystyle{\mathbf{Q}}(i)= γ2(𝐐(i1)+ae(i)(ae(i)+2)𝐭ve(i)𝐭veT(i))+(1γ)2𝐈\displaystyle\gamma^{2}\left({\mathbf{Q}}(i-1)+{a_{e}}(i)\left({a_{e}}(i)+2\right){\mathbf{t}_{ve}}(i){\mathbf{t}_{ve}^{T}}(i)\right)+\left(1-\gamma\right)^{2}{\mathbf{I}}
+γ(1γ)(ae(i)(𝐯e(i)𝐭veT(i)+𝐭ve(i)𝐯eT(i))+𝐓(i1)+𝐓T(i1))\displaystyle+\gamma(1-\gamma)\left({a_{e}}(i)\left({\mathbf{v}_{e}}(i){\mathbf{t}_{ve}^{T}}(i)+{\mathbf{t}_{ve}}(i){\mathbf{v}_{e}^{T}}(i)\right)+{\mathbf{T}}(i-1)+{\mathbf{T}^{T}}(i-1)\right) (44)

requiring 𝒪(M2){\mathcal{O}}(M^{2}) multiply and accumulate (MAC) operations instead of 𝒪(M3){\mathcal{O}}(M^{3}). Similarly to (35) in the direct feature whitening method, the weight gradients are transformed by an inter-block smoothed version of (44).

4.2 High power subspace estimation

We adopt the procedure from [3] for estimating the principal eigenvector, in the high signal-to-noise-ratio case. The block index ii is omitted for brevity.

Let ξ{\mathbf{\xi}} be an M×1M\times 1 vector comprised of the norms of the columns of 𝚽y{\mathbf{\Phi}_{y}}, i.e.:

ξm𝚽y𝐞m\displaystyle\xi_{m}\triangleq\|{\mathbf{\Phi}_{y}}\mathbf{e}_{m}\| (45)

for m=0,1,,M1m=0,1,\ldots,M-1 where 𝐞m[𝟎1×m1,1,𝟎1×Mm]T\mathbf{e}_{m}\triangleq\left[\mathbf{0}_{1\times m-1},1,\mathbf{0}_{1\times M-m}\right]^{T} is a selection vector, used to pick the mm-th column of 𝚽y{\mathbf{\Phi}_{y}}. Let

margmaxm({ξm}m=0M1)\displaystyle m^{\prime}\triangleq\textrm{argmax}_{m}\left(\left\{\xi_{m}\right\}_{m=0}^{M-1}\right) (46)

be the index of the column with the highest norm. The following method aligns the other columns with the mm^{\prime}-th column, and the high power subspace is computes as the normalized average of the aligned columns. Define the inner product between the mm^{\prime}-th column and every column in 𝚽y{\mathbf{\Phi}_{y}} as

𝐜T𝐞mT𝚽yT𝚽y.\displaystyle{\mathbf{c}}^{T}\triangleq\mathbf{e}_{m^{\prime}}^{T}{\mathbf{\Phi}_{y}^{T}}{\mathbf{\Phi}_{y}}. (47)

Next, we define a set of column indices \mathcal{M}. An index mm is included in the set \mathcal{M} if the correlation between its corresponding column and the mm^{\prime}-th column is sufficiently large, i.e.:

|cm|max{crel𝚽y𝐞m𝚽y𝐞m),cabs}\displaystyle|c_{m}|\geq\max\left\{c_{\textrm{rel}}\|{\mathbf{\Phi}_{y}}{\mathbf{e}_{m^{\prime}}}\|\cdot\|{\mathbf{\Phi}_{y}}{\mathbf{e}_{m}}\|),c_{\textrm{abs}}\right\} (48)

where crelc_{\textrm{rel}} and cabsc_{\textrm{abs}} are relative and absolute coherence threshold. Finally, we align the columns of 𝚽y{\mathbf{\Phi}_{y}} which indices are in \mathcal{M}, average them and normalize to obtain the high power subspace 𝐯e{\mathbf{v}_{e}}:

𝐯~e\displaystyle\tilde{\mathbf{v}}_{e}\triangleq 1||m1cm𝚽y𝐞m\displaystyle\frac{1}{|\mathcal{M}|}\sum_{m\in\mathcal{M}}\frac{1}{c_{m}}{\mathbf{\Phi}_{y}}\mathbf{e}_{m} (49)
𝐯e\displaystyle{\mathbf{v}_{e}}\triangleq 1𝐯~e𝐯~e.\displaystyle\frac{1}{\|\tilde{\mathbf{v}}_{e}\|}\tilde{\mathbf{v}}_{e}. (50)

5 Complexity analysis

The computational complexity of the direct feature whitening from Sec. 2 is comprised of: 1) Computing the transformation at every block based on EVD, requiring M3/(LB)M^{3}/(LB) MACs per sample; 2) Transforming the input in the forward propagation stage, requiring NM2NM^{2} MACs per sample; and 3) Transforming the activation gradients in the backward propagation stage, requiring NM2NM^{2} MACs per sample. A total of M3/(LB)+2NM2M^{3}/(LB)+2NM^{2} MACs per sample are required.

The gradient-based feature whitening in Sec. 3 replaces transforming of the activations and their gradients at each sample by instead transforming the weights gradients at each batch, requiring NSM2/BNSM^{2}/B MAC per sample. The computational complexity in the forward and backward propagation is therefore reduced by a factor of S/(2B)S/(2B), which becomes lower with the tendency of the batch size to increase when training is performed in parallel over multiple machines. In case that S/(2B)>1S/(2B)>1, one could apply the direct whitening method.

In the recursive gradient-based whitening method in Sec. 4, computing the EVD is replaced by recursively computing a high power subspace and updating the transformation. Also, by leveraging the structure of the recursive transformation, the gradient transformation 𝐐(i){\mathbf{Q}}(i) is efficiently computed by 𝒪(M2){\mathcal{O}}(M^{2}) MACs according to (44) instead of 𝒪(M3){\mathcal{O}}(M^{3}) in the generic matrix multiplication case. The computational complexity of constructing the whitening transformation is reduced by a factor MM.

Note that we neglect the complexity of tracking the moments, which is identical in all methods. This is a reasonable assumption given there is a low variability of the statistics over consecutive activations and different space elements.

6 Experimental study

We incorporate the proposed methods into the ResNet-110 and ResNet-50 models [10] and evaluate their convergence when training on the CIFAR [14] and Imagenet [6] datasets with 100100 and 10001000 classes, respectively. For each model we compare three methods: the original model, denoted as baseline; the EVD gradient-based feature whitening method, denoted as EVD; and the recursive feature whitening method, denoted as recursive. In models incorporating feature whitening we place a whitening layer prior to the convolution layers at each basic block (i.e., in ResNet-110 and ResNet-50 we respectively add 108108 and 5252 whitening layers). Training on CIFAR consists of 200200 epochs, where the learning rate is initialized to 0.10.1, and reduces to 0.010.01 and 0.0010.001 at epochs 100100 and 150150, respectively. The SGD rule is applied with a momentum of 0.90.9 and weight decay of 5e45e-4. Training on Imagenet consists of 9090 epochs, where the learning rate is initialized to 0.10.1, and reduces to 0.010.01 and 0.0010.001 at epochs 3030 and 6060, respectively. The SGD rule is applied with a momentum of 0.90.9 and weight decay of 1e41e-4. The hyper-parameters of the EVD-based feature whitening are α=0.9\alpha=0.9, β=0.95\beta=0.95, gmax=10{g_{\textrm{max}}}=10 and ϵ=1e5\epsilon=1e-5. The hyperparameters of the recursive feature whitening are α=0.1\alpha=0.1, β=0.1\beta=0.1, γ=0.99\gamma=0.99, δ=0.25\delta=0.25, crel=0.025c_{\textrm{rel}}=0.025, crel=1e6c_{\textrm{rel}}=1e-6 and ϵ=1e5\epsilon=1e-5. In order to evaluate the effectiveness of feature whitening we evaluate the classification accuracy and examine the normalized rank and whiteness of the covariance matrix of the whitened features. The normalized rank is defined as κM^s/M\kappa\triangleq{\hat{M}_{s}}/M, where M^s{\hat{M}_{s}} is estimated using AIC. The whiteness of the covariance matrix measures how close it is to being diagonal, and is defined as

ρM/mm(𝐞mT𝚽y𝐞m)2𝐞mT𝚽y𝐞m𝐞mT𝚽y𝐞m\displaystyle\rho\triangleq M/\sum_{m\neq m^{\prime}}\frac{\left({\mathbf{e}_{m^{\prime}}^{T}}{\mathbf{\Phi}_{y}}{\mathbf{e}_{m}}\right)^{2}}{{\mathbf{e}_{m}^{T}}{\mathbf{\Phi}_{y}}{\mathbf{e}_{m}}\cdot{\mathbf{e}_{m^{\prime}}^{T}}{\mathbf{\Phi}_{y}}{\mathbf{e}_{m^{\prime}}}} (51)

where in the extreme case of a diagonal 𝚽y{\mathbf{\Phi}_{y}} we get ρ=1\rho=1, and in the other extreme case of 𝐲n{\mathbf{y}_{n}} being completely correlated we get ρ=1M\rho=\frac{1}{M}. We average the κ\kappa and ρ\rho over all layers. The testing accuracy, the average normalized rank and average whiteness of features covariance for ResNet-110 and ResNet-50 are respectively depicted in Figs. 1(a), 1(b), 1(c) and Figs. 2(a), 2(b), 2(c). We capture the variability of the normalized rank and whiteness across different whitening layers, and also depict its standard-error boundaries. As expected, from the normalized rank Figs. 1(b),2(b) and the whiteness Figs.  1(c),2(c) it is evident that the whitening methods obtain a consistently higher rank and higher whiteness than the baseline methods. Considering the accuracy figures, it is noticeable that the whitening methods converge faster, and are more stable and less noisy. The converged accuracy of the baseline, EVD and recursive whitening methods is respectively 73.0%73.0\%, 73.5%73.5\% and 73.0%73.0\% for ResNet-110 and 75.7%75.7\%, 75.9%75.9\% and 76.0%76.0\% for ResNet-50. I.e., the whitening methods are on par or slightly better than the baseline method.

Refer to caption
(a) Testing accuracy
Refer to caption
(b) Normalized rank
Refer to caption
(c) Whiteness
Figure 1: Performance criteria vs. epoch number for ResNet-110 models over CIFAR
Refer to caption
(a) Testing accuracy
Refer to caption
(b) Normalized rank
Refer to caption
(c) Whiteness
Figure 2: Performance criteria vs. epoch number for ResNet-50 models over Imagenet

7 Conclusion

Two novel methods for whitening the features, named EVD and recursive gradient-based feature whitening, have been proposed. The methods offer reduced complexity compared to the direct feature whitening method [7], by applying the transformation to the weight gradients instead of to the activations and their gradients. The recursive method further saves computations by replacing the EVD-based transformation with a recursive transformation, updated in steps, treating only one subspace per step and designed to gradually reduce the condition number of the features covariance-matrix. The proposed methods are applied too ResNet-110 and ResNet-50 and obtain state of the art convergence in terms of speed, stability and accuracy.

References

  • [1] Hirotugu Akaike. A new look at the statistical model identification. IEEE transactions on automatic control, 19(6):716–723, 1974.
  • [2] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
  • [3] Anna Barnov, Vered Bar Bracha, and Shmulik Markovich-Golan. QRD based MVDR beamforming for fast tracking of speech and noise dynamics. In 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 369–373. IEEE, 2017.
  • [4] Anthony J Bell and Terrence J Sejnowski. The “independent components” of natural scenes are edge filters. Vision research, 37(23):3327–3338, 1997.
  • [5] Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. Understanding batch normalization. In Advances in Neural Information Processing Systems, pages 7694–7705, 2018.
  • [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • [7] Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. In Advances in Neural Information Processing Systems, pages 2071–2079, 2015.
  • [8] Gene H Golub and Charles F Van Loan. Matrix computations, volume 3. JHU press, 2012.
  • [9] Simon S Haykin. Adaptive filter theory. Pearson Education India, 2005.
  • [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [11] Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 791–800, 2018.
  • [12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [13] Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening and decorrelation. The American Statistician, 72(4):309–314, 2018.
  • [14] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [15] Yann Le Cun, Ido Kanter, and Sara A Solla. Eigenvalues of covariance matrices: Application to neural-network learning. Physical Review Letters, 66(18):2396, 1991.
  • [16] Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
  • [17] Daniel Povey, Xiaohui Zhang, and Sanjeev Khudanpur. Parallel training of deep neural networks with natural gradient and parameter averaging. arXiv preprint arXiv:1410.7455, 2014.
  • [18] Tapani Raiko, Harri Valpola, and Yann LeCun. Deep learning made easier by linear transformations in perceptrons. In Artificial intelligence and statistics, pages 924–932, 2012.
  • [19] Simon Wiesler and Hermann Ney. A convergence analysis of log-linear training. In Advances in Neural Information Processing Systems, pages 657–665, 2011.