This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Update Compression for Deep Neural Networks on the Edge

Bo Chen1  Ali Bakhshi2  Gustavo Batista2  Brian Ng1  Tat-Jun Chin1
1The University of Adelaide  2University of New South Wales
Abstract

An increasing number of artificial intelligence (AI) applications involve the execution of deep neural networks (DNNs) on edge devices. Many practical reasons motivate the need to update the DNN model on the edge device post-deployment, such as refining the model, concept drift, or outright change in the learning task. In this paper, we consider the scenario where retraining can be done on the server side based on a copy of the DNN model, with only the necessary data transmitted to the edge to update the deployed model. However, due to bandwidth constraints, we want to minimise the transmission required to achieve the update. We develop a simple approach based on matrix factorisation to compress the model update—this differs from compressing the model itself. The key idea is to preserve existing knowledge in the current model and optimise only small additional parameters for the update which can be used to reconstitute the model on the edge. We compared our method to similar techniques used in federated learning; our method usually requires less than half of the update size of existing methods to achieve the same accuracy.

1 Introduction

The significant progress in AI has been due in large part to the resurgence of deep neural networks (DNNs), particularly convolutional neural networks (CNNs) and variants thereof. As the complexity and breadth of problems that are solvable in an end-to-end manner by DNNs increase, the DNN architectures (“models”) themselves are becoming deeper [61, 25, 19, 18] and wider [64, 74, 65], which demand greater computational resources to execute.

On the other hand, there is a trend to deploy DNNs in the field through edge computing devices, such as embedded GPUs and FPGAs. This is due to the growing popularity of concepts such as AI on the edge [44, 6, 41, 51], which emphasises processing closer to the source of the data to minimise latency and reduce data transfer, and embodied AI [15, 55, 5], which focusses on building intelligent agents that can explore and interact with the environment.

The gap in computational capability between centralised processing systems and edge computing devices has motivated research in making complex DNNs feasible on the latter platforms. Significant attention has been devoted to compressing DNNs to reduce their memory consumption and/or accelerate their execution. Major categories of such methods include pruning and sparsification [69, 76, 20, 21, 43], quantisation [50, 39, 4, 46, 56], knowledge distillation [68, 53, 73, 7, 22], low-rank optimisation [32, 14, 52, 8, 31, 34], and architecture search [72, 63, 24, 45, 75, 16, 66, 67, 57]. A common observation is that many complex DNNs are over-parametrised, hence allowing substantial compression [29, 60, 9, 54].

Refer to caption
Figure 1: Illustrating update compression to alleviate bandwidth constraints for updating DNN models on the edge.

Another important requirement is updating the DNN on the edge after deployment. Numerous practical reasons, such as refining the model using new data, concept drift, or outright changes to the learning task, make updating DNN models unavoidable in machine learning applications. In this paper, we consider the specific scenario where

  • The (new) training data and supervision labels are available on the central processing system (the “server”);

  • Communication between server and edge can be unreliable or costly, which discourages large data transfers.

Many applications fall under this setting, e.g., AI-enabled sensor networks, field robotics, autonomous driving cars, intelligent drone fleets, etc. The scenario argues for retraining the DNN on the server based on a clone of the deployed DNN, then transmitting only the new parameter values to update the model on the edge, as illustrated in Fig. 1.

Update compression

To mitigate the communication bottleneck, it is vital to compress the model update such that the data can be transferred as bandwidth-efficiently as possible. Note that this aim differs from compressing the model itself; indeed, even if the deployed DNN is already a compressed model, it is still worthwhile to seek to transfer as little as possible to the edge to accomplish the update, particularly when recurrent updates are required.

A simple solution is to apply lossless compression techniques [58] on the model changes. However, this misses the potential gains via using more strategic model refinement that directly compresses the update during retraining.

Our contributions

Towards the above end, we inject the goal of update compression in the model refinement step itself. We propose a compact refinement technique based on re-parameterisation of DNN layers, which allows to express the model changes in terms of a small set of new parameters and to preserve a large part of model knowledge in the form of frozen buffers. On the edge side, a reconstitution step integrates the learnt update package with the preserved knowledge and reconfigures the model. The proposed method is illustrated in Fig. 1 and 2.

We examined the trade-off between update size and impact on inference accuracy to illustrate the superiority of our method over existing update compression techniques. In our classification experiment our method lifts the test accuracy of the initial model by 9%\approx 9\% with an update package equivalent to 0.5% of the model parameters, while competing methods requires 2 to 40 times bigger update packages to achieve the same lift. Moreover, our experiments show that our model is less reliant on parameter redundancy than existing techniques, which is more suitable for compact models deployed on resource limited edge devices.

2 Related Work

In this section, we briefly survey the related literature. We focus on three main research directions: Model compression techniques that create small-scale versions of large DNNs. Update compression methods that allow efficient transmission of model parameters in client-server architectures. Split computing approaches that divide a DNN model into two parts that execute in the client and server cooperatively.

2.1 Model Compression

The execution of deep-learning models on edge devices has attracted much attention in the last few years [6]. Edge devices are often limited by memory, processing and power constraints that motivate the development of compressed DNN models.

Compressing a model can benefit memory footprint and processing time, with often a minor decrease in classification performance [10]. There are six main categories of model compression techniques: quantisation, network slimming, weight pruning, low-rank factorisation, knowledge distillation and neural architecture search. We list below a few prominent works from each category.

Quantisation reduces the number of bits used to represent the network weights [50, 39, 4, 46, 56]. Network slimming removes the least important channels from convolutional layers [69, 76, 21]. Weight pruning removes the least significant connections in the model [20]. Low-rank factorisation use matrix/tensor decomposition to estimate the informative parameters of the DNN model [32, 14, 52, 8, 31, 34]. Knowledge distillation trains a more compact neural network to reproduce the output of a larger network [68, 53, 73, 7, 22]. Neural architecture search uses a search procedure to automatically design efficient models [72, 63, 24, 45, 75, 16, 66, 67, 57].

While the above mentioned techniques are capable of effectively compress the model size and, consequently, the update size, they are likely to be less efficient for update compression because the scenario is different: for update compression it does not require the whole model parameters to be updated which means the model can store aside useful information and only change what is necessary.

2.2 Model Update Compression

Besides hardware constraints, edge devices often rely on network communication to exchange data, model parameters and predictions. However, network communication can be costly, unreliable and energy-hungry. Improving communication efficiency is a research topic that has been explored in various client-server DNN architectures such as Federated Learning (FL) [49] and, more generally, distributed SGD [71].

Our work assumes that training data is available at the server, which has the computational power to retrain the DNN model. In contrast, Federated Learning requires that training data, including class labels, be available at the client devices. These devices have the computational power to update the model locally. The server receives the model updates from multiple clients and averages these updates.

The main approaches for model update compression can be divided in three groups: low-rank update, random masking and gradient-based techniques.

The low-rank update was used as an update compression method in FL by [33]. It consists in decomposing the gradient tensor in two matrices, LL and RR, with a rank rr. LL is generated randomly, and RR is optimised during training. After training, LL is compressed as a random seed and transmitted with RR to the server.

Random masking defines a random sparsity matrix and requires the training to update the non-null entries in such a sparse matrix [33]. Like the low-rank update method, a random seed defines the sparse pattern transmitted to the server with the non-zero entries.

Gradient-based techniques aim to reduce the communication burdens during training via gradient quantisation [59, 3], sparsification [33, 62], or both [12, 17]. However, compressing communication during training differs from compressing communication for model update, as model training requires many iterations whereas the latter only allows for a one-off data transmission.

2.3 Split Computation

Split Computing (SC) [48] is a framework that divides the DNN model into head and tail models, which are executed in the edge device and server, respectively. SC is attractive when compressed models for edge devices cannot achieve the same level of accuracy as their full counterpart models. SC has two main limitations: the inference time becomes the sum of inference time on client and server plus the communication latency, and the need for a “bottleneck” layer in the first model layers. The bottleneck layer provides a cutpoint for the head and tail models with a compact representation that reduces communication overhead.

The straightforward splitting of the DNN as suggested by [36, 30, 28] can result in either transferring a large part of the processing burden to the edge device or transmitting a larger volume of data on the network. Distilling the head section of DNN and introducing a bottleneck within the distilled head model as suggested in [47] can mitigate this problem and decrease the computational cost and the required bandwidth considerably.

3 Problem setting

We consider the problem of performing bandwidth efficient updating for a DNN on an edge device deployed in the field. As depicted in Fig. 1, the setting consists of a central server and an edge device, between which there is a communication bandwidth limit. We assume that the central server has no restriction in computational resources while the edge device has limited resources and can only perform inference and light-duty computation.

Pre-deployment

For concreteness, we focus on classifier models C(|θ)C(\cdot|\theta) with parameters θ\theta. Also, since our work concentrates on update compression and not model compression (see Fig. 1 on the distinction), we assume that C(|θ)C(\cdot|\theta) is already a lightweight model suitable for edge devices.

Let D1={(𝑿i,Yi)}i=1N1D_{1}=\{(\bm{X}_{i},Y_{i})\}_{i=1}^{N_{1}} denote the initial training dataset. At the pre-deployment stage, training CC with D1D_{1} results in model parameters θ1\theta_{1}

θ1=argminθ(θ,D1),\theta_{1}=\operatorname*{arg\,min}_{\theta}\ell(\theta,D_{1}), (1)

where \ell is some loss function for training CC, such as cross-entropy loss. The classifier C(|θ1)C(\cdot|\theta_{1}) is then loaded onto the edge device and deployed to the field.

Post-deployment

After deployment, the central server has accumulated a larger dataset D2={(𝑿i,Yi)}i=1N2D_{2}=\{(\bm{X}_{i},Y_{i})\}_{i=1}^{N_{2}}, where N2>N1N_{2}>N_{1} and D2D1D_{2}\supset D_{1}. The goal is to update CC on the edge device, currently with parameters θ1\theta_{1}, with the new dataset D2D_{2}, subject to a data transmission bottleneck.

Let Δ\Delta denote the update package to be sent from the server to the edge device. A basic method is to retrain CC using D2D_{2}, and let Δ\Delta be the set of all model parameters, i.e.,

Δ=argminθ(θ,D2).\Delta=\operatorname*{arg\,min}_{\theta}\ell(\theta,D_{2}).

At the edge upon receiving Δ\Delta, θ1\theta_{1} is discarded and we simply equate θ2=Δ\theta_{2}=\Delta and insert it into C(|θ2)C(\cdot|\theta_{2}) to complete the model update. However, in this case Δ\Delta will always be the set of all parameters even for a small incremental update. The key to generating more compact updates lies in reusing the learnt knowledge embedded in θ1\theta_{1}, as we will show next.

4 Update compression

Instead of densely retraining the whole model, we would like a compact refinement algorithm that produces a compact update package Δ\Delta using the current model C(|θ1)C(\cdot|\theta_{1}) and new dataset D2D_{2}

Δ=compact-refine(C(|θ1),D2),\Delta=\text{compact-refine}(C(\cdot|\theta_{1}),D_{2}), (2)

so that it can be efficiently transmitted to the edge device. At the edge side, upon receiving the compact update package Δ\Delta, the current model C(|θ1)C(\cdot|\theta_{1}) incorporates Δ\Delta and reconstitutes the updated model

C(|θ2)=reconstitute(C(|θ1),Δ).C(\cdot|\theta_{2})=\text{reconstitute}(C(\cdot|\theta_{1}),\Delta). (3)

The overall pipeline of such a update compression scheme is demonstrated in Fig. 2.

Refer to caption
Figure 2: A schematic pipeline of the proposed post-deployment update compression framework. To achieve compactness of the update package, we propose to re-parameterise the model into a learnable part and a frozen part. The learnable parameters facilitates the compact refinement and the frozen data enables the edge device to recycle knowledge from existing model.

4.1 Compact refinement

To achieve compact refinement, we first re-parameterise the current network in a way such that

  1. 1.

    the new learnable parameters are much fewer than the original ones; and

  2. 2.

    the model can be effectively updated by refining the new parameters with the new dataset D2D_{2}.

To this end, we focus on the network’s convolutional (Conv) layers and fully connected (FC) layers, as mainstream DNNs largely consist of these two types of layers. Other types of layers, such as batch normalisation [26], are not targeted for update compression in our method.

Low Rank Approximation (LRA)

We first describe the compact refinement algorithm using a baseline re-parameterisation method LRA before we introduce the proposed techniques. Let ϕlθ\phi_{l}\subset\theta denote the weight tensor of layer ll, which is either a Conv or FC layer. Note that ϕl\phi_{l} does not include the bias terms 𝒃l\bm{b}_{l} of the layer. If the layer has bias terms, they are not aimed for update compression in our method and remain as learnable parameters. We firstly re-arrange ϕl\phi_{l} to a matrix of size o×io\times i. For Conv layers, oo is the number of output channels and ii is the product of the number of input channels and the size of the kernel along each dimension, e.g., the kernel height and the kernel width for the 2D case. For FC layers, oo is simply the number of output features and ii the input features.

The matrix ϕl\phi_{l} is then decomposed using SVD:

ϕl=Udiag(𝒔)VT,\phi_{l}=U\cdot\text{diag}(\bm{s})\cdot V^{T}, (4)

where Uo×mU\in\mathbb{R}^{o\times m}, Vi×mV\in\mathbb{R}^{i\times m}, m=min(o,i)m=\min(o,i), 𝒔m×1\bm{s}\in\mathbb{R}^{m\times 1} is the vector of singular values sorted in descending order, and diag(𝒔)\text{diag}(\bm{s}) is the m×mm\times m square matrix of 0s except for the diagonal entries which are specified by 𝒔\bm{s}.

For either Conv or FC layers, the arrangement of ϕl\phi_{l} can be viewed as oo rows of filters with length ii. Consequently, the decomposed UU, VV and 𝒔\bm{s} can be viewed as a mapping matrix, a matrix of base filters, and a weight vector respectively, the latter indicating the importance of each base filter.

The decomposition of ϕl\phi_{l} can result in even more parameters if U,𝒔U,\bm{s}, and VV are directly used as new parameters. To achieve compactness, we adopt LRA to encode the original parameters. Let U1:rU_{1:r} denote the o×ro\times r matrix which consists of the first rr columns of UU, and 𝒔1:r\bm{s}_{1:r} denote the r×1r\times 1 vector of the first rr entries in 𝒔\bm{s}. We then define a new set of learnable parameters for the layer

ϕl={L,R},\phi_{l}^{\prime}=\{L,R\}, (5)

where

L=U1:rdiag(𝒔1:r),R=V1:rT.L=U_{1:r}\cdot\text{diag}(\bm{s}_{1:r}),\quad R=V_{1:r}^{T}. (6)

Note that this does not change the network’s architecture. After a Conv or FC layer is re-parameterised, for each forward propagation during the refinement process, the layer’s weight tensor ϕl\phi_{l} is firstly computed with ϕl\phi_{l}^{\prime} using a recover-weight() routine

ϕl\displaystyle\phi_{l} recover-weight(ϕl)\displaystyle\leftarrow\text{recover-weight}(\phi_{l}^{\prime}) (7)
=LR.\displaystyle=L\cdot R. (8)

The layer then performs normal forward processing. For backward propagation in such layers, only gradients with respect to ϕl\phi_{l}^{\prime} are computed instead of ϕl\phi_{l}, since ϕl\phi_{l} is now an intermediate variable rather than a leaf variable in the computational graph. The LRA re-parameterisation method is conceptually illustrated in Fig. 3 row 1.

Refer to caption
Figure 3: Illustration of different re-parameterisation methods.

By re-parameterising all Conv and FC layers, the network’s new learnable parameters become

θ=(θ{ϕl}l=1){ϕl}l=1,\theta^{\prime}=\left(\theta\setminus\{\phi_{l}\}_{l=1}^{\cal L}\right)\cup\{\phi_{l}^{\prime}\}_{l=1}^{\cal L}, (9)

where \cal L is the total number of Conv and FC layers in the network. Note that θ\theta^{\prime} includes the network’s original parameters that are not targeted for compact refinement, e.g., bias terms, batch-norm parameters.

Finally, the update package Δ\Delta is obtained via refining θ\theta^{\prime} from its current values with D2D_{2}

Δ=argminθ(θ,D2).\Delta=\operatorname*{arg\,min}_{\theta^{\prime}}\ell(\theta^{\prime},D_{2}). (10)

The LRA approach could use a small rank rr to achieve compactness

|Δ|=|θ||θ|.|\Delta|=|\theta^{\prime}|\ll|\theta|. (11)

However, since it does not reuse any existing knowledge from θ1\theta_{1} it can result in severe underfitting if rr is too small. LRA can be viewed as a network compression method which is similar to previous works [27, 37, 11, 32, 42]. It is thus considered as a baseline method in our experiments.

Mapping Learning (ML)

To further reduce the number of learnable parameters used in refinement while performing effective update, we propose to recycle part of previously learnt knowledge embedded in θ1\theta_{1}. We obtain LL and RR the same way as Eq. (6) with rank rr. The matrix RR consists of the most important rr rows of base filters learnt from D1D_{1}. It is thus possible to utilise these base filters and only learn a new mapping for the model to adapt to dataset D2D_{2}.

Therefore, the re-parameterisation of ϕl\phi_{l} should produce a set of learnable parameters ϕl\phi_{l}^{\prime} and a set of frozen data ϕl\phi_{l}^{*} which stores the recycled knowledge:

ϕl=L,ϕl=R.\phi_{l}^{\prime}=L,\quad\phi_{l}^{*}=R. (12)

The frozen data ϕl\phi_{l}^{*} is stored in the network as a buffer, which participates in forward computation but does not require gradient computation in back-propagation. Consequently, the recover-weight() routine is given by

recover-weight(ϕl,ϕl)=ϕlϕl=LR.\text{recover-weight}(\phi_{l}^{\prime},\phi_{l}^{*})=\phi_{l}^{\prime}\cdot\phi_{l}^{*}=L\cdot R. (13)

Knowledge Augmentation (KA)

To maximally preserve previously learnt knowledge in order to minimise the update size, we propose to freeze both UU and VV in Eq. (4), and only refine on 𝒔\bm{s} as free parameters. However, the base filters VV and its mappings UU learnt from D1D_{1} are likely to generalise poorer than they could have if they were learnt from D2D_{2}, because a larger training set is usually beneficial as it is less likely to cause overfitting.

To let the model learn to adapt to D2D_{2} whilst preserving current knowledge UU and VV, we propose to augment UU and VV with additional columns for capturing the “knowledge drift” between D1D_{1} and D2D_{2}. Let Uo×nU^{\prime}\in\mathbb{R}^{o\times n} denote the augmenting vectors and [U,U]o×(m+n)[U,U^{\prime}]\in\mathbb{R}^{o\times(m+n)} denote the augmented mapping matrix. Similarly let Vi×nV^{\prime}\in\mathbb{R}^{i\times n} be the augmenting vectors and [V,V]i×(m+n)[V,V^{\prime}]\in\mathbb{R}^{i\times(m+n)} be the augmented filter matrix. Note that both UU and VV are frozen and only UU^{\prime} and VV^{\prime} are learnable parameters.

Let 𝒔(m+n)×1\bm{s}^{\prime}\in\mathbb{R}^{(m+n)\times 1} denote the augmented weight vector where the whole vector is treated as learnable parameters. The layer is thus re-parameterised as

ϕl\displaystyle\phi_{l}^{\prime} ={U,𝒔,V},ϕl={U,V},\displaystyle=\{U^{\prime},\bm{s}^{\prime},V^{\prime}\},\quad\phi_{l}^{*}=\{U,V\}, (14)

and the recover-weight() routine is

recover-weight(ϕl,ϕl)=[U,U]diag(𝒔)[V,V]T.\text{recover-weight}(\phi_{l}^{\prime},\phi_{l}^{*})=[U,U^{\prime}]\cdot\text{diag}(\bm{s}^{\prime})\cdot[V,V^{\prime}]^{T}. (15)

An illustration of this re-parameterisation method is given in Fig. 3 row 3. The motivation of this design is to maximally reuse previous knowledge by freezing UU and VV, and at the same time allowing them to be updated by learning the augmenting vectors and incorporating them using the re-learned weights 𝒔\bm{s}^{\prime}.

After re-parameterisation we initialise ϕl\phi_{l}^{\prime} in a way such that the refinement process starts from the current accuracy level. The refinement process starts with a weight tensor

ϕl[U,U]diag(𝒔)[V,V]T\phi_{l}\leftarrow[U,U^{\prime}]\cdot\text{diag}(\bm{s}^{\prime})\cdot[V,V^{\prime}]^{T} (16)

which could be abruptly different from the original ϕl\phi_{l} because U,𝒔U^{\prime},\bm{s}^{\prime} and VV^{\prime} are initialised randomly. To avoid a significant drop in accuracy at the beginning of refinement, we let 𝒔1:m=𝒔\bm{s}_{1:m}^{\prime}=\bm{s} and U,VU^{\prime},V^{\prime} and 𝒔m+1:m+n\bm{s}_{m+1:m+n}^{\prime} be small random values that are close to 0 for initialisation.

Algorithm 1 compact-refine(C(|θ),D)(C(\cdot|\theta),D)
1:  θθ\theta^{\prime}\leftarrow\theta
2:  for ll in {1,,}\{1,...,\cal L\}  do
3:     ϕl\phi_{l}\leftarrow layer ll weight tensor in θ1\theta_{1}
4:     ϕl,ϕlre-parameterise(ϕl)\phi_{l}^{\prime},\phi_{l}^{*}\leftarrow\text{re-parameterise}(\phi_{l})
5:     θ(θϕl)ϕl\theta^{\prime}\leftarrow(\theta^{\prime}\setminus\phi_{l})\cup\phi_{l}^{\prime}
6:  end for
7:  Δargminθ(θ,D)\Delta\leftarrow\operatorname*{arg\,min}_{\theta^{\prime}}\ell(\theta^{\prime},D)
8:  return  Δ\Delta

The overall procedure of compact refinement is summarised in Algorithm 1, with the re-parameterise() routine and the recover-weight() routine depending on the specific method employed.

4.2 Model reconstitution

Once the edge device receives Δ\Delta, it then incorporates Δ\Delta to update its current model. This process is described in Algorithm 2. For all non Conv or FC layers, the edge device updates their parameters with new values from Δ\Delta. For Conv and FC layers, their weights are recovered using the new values of the learnable parameters ϕl\phi_{l}^{\prime} from Δ\Delta and the values of the frozen data ϕl\phi_{l}^{*} obtained by decomposing the current model weights θ1\theta_{1}.

Algorithm 2 reconstitute(C(|θ1),Δ)(C(\cdot|\theta_{1}),\Delta)
1:  θ2{all non Conv/FC layers in Δ}\theta_{2}\leftarrow\{\text{all non Conv/FC layers in }\Delta\}
2:  for ll in {1,,}\{1,...,\cal L\} do
3:     ϕl\phi_{l}\leftarrow layer ll weight tensor in θ1\theta_{1}
4:     ϕl,ϕlre-parameterise(ϕl)\phi_{l}^{\prime},\phi_{l}^{*}\leftarrow\text{re-parameterise}(\phi_{l})
5:     ϕlΔ\phi_{l}^{\Delta}\leftarrow layer ll free parameters in Δ\Delta
6:     ϕl\phi_{l}\leftarrow recover-weight(ϕlΔ,ϕl)(\phi_{l}^{\Delta},\phi_{l}^{*})
7:     𝒃l\bm{b}_{l}\leftarrow layer ll bias in Δ\Delta
8:     θ2θ2{ϕl,𝒃l}\theta_{2}\leftarrow\theta_{2}\cup\{\phi_{l},\bm{b}_{l}\}
9:  end for
10:  return  C(|θ2)C(\cdot|\theta_{2})

An additional desirable feature of the proposed KA method is that, once the edge device reconstitutes the updated model the augmenting vectors are integrated into the weight tensor by the recover-weight() routine and thus can be discarded. This prevents the edge device to store more and more augmenting vectors as a result of repetitive updates, which eliminates the requirement of maintenance over time for expanding its memory capacity.

5 Experiments

In this section we conduct various experiments to evaluate the performance of update compression with the proposed methods, as well as competing methods.

5.1 Competitor methods

We compared the proposed update compression algorithms to two techniques used in Federated Learning, Random Mask (RM) [33] and Low-Rank Update (LRU) [33], as well as a compression method Filter Pruning (FP) [40].

RM

For the RM method we generate a random binary mask for the entire model parameters. The proportion of parameters that are selected by the mask is controlled by a hyper-parameter PP. We then refine the model on D2D_{2} by updating only the selected parameters while keeping the unselected ones frozen. The update package size is counted as the number of selected parameters. We ignore the size of the mask as it can be represented by a random seed.

LRU

To implement LRU we firstly decompose the weight tensor of each Conv and FC layer to two matrices LL and RR with rank rr in the same way as Eq. (6). We then replace the values of RR with random numbers and fix them. During refinement LL is initialised with small values and LRL\cdot R is added to the layer’s original weight tensor, i.e., LRL\cdot R learns the changes of the current weight tensor, as described in [33] Sec.2. The update size of LRU is all the LL matrices and other free parameters (such as bias, Batch-Norm layers, etc). Again we ignore the size of the random matrices RR as they can be represented by random seeds.

FP

The FP method prunes whole filters as well as their connected feature maps in the network. The size of the pruned model is dependent on a compression rate cc and the layers selected for the pruning plan. In this experiment we only consider Conv and batch normalisation layers in the pruning plan. We count all remaining parameters after pruning into the update size.

5.2 Update compression for classification models

Refer to caption
(a) CIFAR10 classification with ResNet18. Results of RM and LRU are averages of 6 runs and their 90% confidence intervals are plotted.
Refer to caption
(b) MNIST classification with tiny-ResNet. Results of RM and LRU are averages of 6 runs and their 90% confidence intervals are plotted.
Figure 4: The update size versus test accuracy after refinement. Update size is measured as a proportion between the number of new parameters (|θ||\theta^{\prime}|) and the number of original parameters (|θ||\theta|).

We evaluate update compression methods for two classification tasks. The first one uses the CIFAR10 [35, 1] dataset on a ResNet18 [18] model and the second one classifies the MNIST [38, 2] dataset with a modified tiny-ResNet model. Since we are mainly concerned with the power- and bandwidth-constrained application, the CIFAR10 and MNIST datasets with low-resolution images are a good choice The tiny-Resnet architecture is composed of a feature extractor with three layers, each layer consists of a basic residual block and a classifier that include one fully connected layer. Similarly, the VGG-tiny architecture is comprised of a feature extractor that contains a sequence of convolutional layers with 16, 16, 32, and 64 outputs, each is followed by a max-pooling layer. Besides, one fully connected layer is on top of them. For both tasks we use a proportion pp of the training set as D1D_{1} and the whole training set as D2D_{2}. The model is initially trained with D1D_{1} and then updated using D2D_{2} with various update methods. We then test the updated model with the testing set and report its accuracy against the update size. The proportion for D1D_{1} is p=0.2p=0.2 for the CIFAR10 experiment and p=0.02p=0.02 for MNIST, as MNIST is a much easier dataset and a higher pp would make the accuracy gap between the initial model and the refined model too small for the experiment. The initial test accuracy of the model trained with D1D_{1} is 77.5% for ResNet18 on CIFAR10 and 95.5% for tiny-ResNet on MNIST. The results of this experiment is shown in Fig. 4. Note that in this experiment the purpose is not to push the test accuracy as high as possible, but to compare the updating efficiency among different methods, i.e., achieving high accuracy with a small update size.

Fig. 4(a) shows that KA has the highest accuracy for update size below 2%, while ML performs best with update size between 2% to 5%. The curves of RM and LRU are much more to the right hand side than KA’s, which means in order to achieve the same accuracy, RM and LRU requires 2 to 3 times larger update packages than KA. The low rank approximation nature of ML makes it deteriorates quickly when update size shrinks below 1%. On the other hand, KA is much more robust for small update size thanks to its augmentation mechanism. For the compression method FP, its accuracy at update size 20% is just similar to that of KA at update size 1%. Results in Fig. 4(b) tell a similar story where KA has the highest accuracy for the smallest update, closely followed by ML. In contrast, it takes RM and LRU at least twice as big an update size as KA’s for the same accuracy.

Note that in both experiments the LRA baseline is capable to achieve the highest accuracy when given enough parameter size but performs poorly at a low update size. This indicates that knowledge recycling has its pros and cons: while it can be super efficient in terms of update size, methods like LRA without preserving any old knowledge can achieve higher accuracy when given enough parameters.

Model Method Update size Update % Model size Acc. lift Refined acc. Init acc.
VGG-small KA 7413 5.44 136266 2.28 78.68 76.4
VGG-small RM 7413 5.44 136266 0.02 76.42 76.4
VGG-small LRU 7878 5.78 136266 1.45 77.85 76.4
VGG-medium KA 14672 2.72 539786 2.83 80.97 78.14
VGG-medium RM 14681 2.72 539786 0.07 78.21 78.14
VGG-medium LRU 15638 2.90 539786 1.44 79.58 78.14
MobileNetV2 KA 122632 5.34 2296922 3.67 87.92 84.25
MobileNetV2 RM 122660 5.34 2296922 3.18 87.43 84.25
MobileNetV2 LRU 128734 5.60 2296922 3.54 87.79 84.25
ResNet18 KA 127939 1.14 11173962 9.67 87.16 77.49
ResNet18 RM 128378 1.15 11173962 8.97 86.46 77.49
ResNet18 LRU 135670 1.21 11173962 9.02 86.51 77.49
ResNet50 KA 310656 1.32 23520842 6.57 84.34 77.77
ResNet50 RM 312837 1.33 23520842 4.61 82.38 77.77
ResNet50 LRU 327185 1.39 23520842 6.16 83.93 77.77
Table 1: Accuracy lift on different models. Update size and model size are both measured in number of parameters.

5.3 The effect of parameter redundancy

Besides knowledge recycling, we suspect that another factor contributing to the effectiveness of update compression is parameter redundancy, which allows the model to adapt to the new dataset by updating only a small fraction of parameters. However, for edge devices models deployed on them are usually optimised, distilled, or compressed to perform the target task well enough while minimising resource consumption. Therefore, update compression methods should be robust to the lack of parameter redundancy.

Refer to caption
Figure 5: The update size versus test accuracy after refinement on VGG-tiny.

We repeat the CIFAR10 experiment on a customised VGG-tiny [61] model, a much smaller model than ResNet18 with only 26554 parameters, to test different updating methods on compact models that do not have a lot of parameter redundancy. The result in Fig. 5 shows that the gap between KA and the two competing methods RM and LRU is clearly wider than that in ResNet18. This suggests that with a small model the lack of parameter redundancy has restricted the performance of RM and LRU, while the proposed KA is less affected by the small model size, which is attractive for resource limited edge devices.

To further test this hypothesis, we compare KA, RM and LRU on different models with parameters ranging from 136K to 23M. For each model, we set the hyper-parameter nn of KA to 3 and let the update size of RM and LRU to be similar to but no less than KA’s. We use 20% of CIFAR10 training set as D1D_{1} and 100% as D2D_{2}. We report the accuracy lift as well as other experimental details in Table 1.

For all models in Table 1 regardless of their size, KA outperforms RM and LRU consistently with an equal or smaller update size. Furthermore, as the model becomes more and more compact the accuracy lifts of all methods tend to be smaller as well, supporting our hypothesis about parameter redundancy. This effect is especially obvious for RM, which can hardly lift accuracy for small models. In contrast, KA is less reliant on parameter redundancy and remains effective for models of all sizes.

5.4 Effectiveness of knowledge renewal

All three methods of KA, RM, and LRU reuses significant old knowledge to achieve update compression. RM keeps all unselected parameters and only learns selected ones to update the model, while LRU renews the old model by learning to approximate the residuals of the old parameters. We evaluate the effectiveness of knowledge renewal amongst the three different techniques in the circumstance that the model is compact and there is little parameter redundancy. We train a VGG-tiny model with different initial training set size, i.e., |D1||D_{1}| ranges from 0.1 to 0.7 of |D2||D_{2}|. We then perform update compression to the model with different techniques and report their accuracy lifts after refinement with D2D_{2}. For KA the hyper-parameter nn is set to 1. For RM and LRU their update size are set to be as close as but not less than KA’s.

Refer to caption
Figure 6: Initial training set size vs accuracy lift on VGG-tiny. The initial training set size is computed by |D1|/|D2||D_{1}|/|D_{2}|.

Fig. 6 shows that while the gap between the initial set D1D_{1} and the target set D2D_{2} enlarges, KA is more effective in learning new knowledge to update the model than RM and LRU, even when parameter redundancy is lacking. This implies the KA update strategy would benefit applications where D2D_{2} grows ever larger over time.

6 Limitations

The proposed update compression methods only targets Conv and FC layers. It is unable to handle models that largely consist of other types of layers, such as RNN [13], LSTM [23] and Transformer [70]. Due to limited resources and paper space, we have also limited our experiments to classification tasks and small to medium size models, the effect of practical application of our methods outside of these settings requires further testings.

7 Conclusion

In this paper we address the problem of server-to-edge device model update with a limited communication bandwidth. The proposed update compression framework achieves superior communication efficiency via novel compact refinement and reconstitution techniques. Experiments further show that our method is robust to the lack of parameter redundancy.

Acknowledgements

We gratefully acknowledge generous funding from SmartSat CRC, Research Program 2: Advanced Satellite Systems, Sensors and Intelligence.

Tat-Jun Chin is SmartSat CRC Profesorial Chair of Sentient Satellites.

References

  • [1] Cifar10 dataset. https://www.cs.toronto.edu/~kriz/cifar.html.
  • [2] Mnist dataset. http://yann.lecun.com/exdb/mnist/. License: Creative Commons Attribution-Share Alike 3.0 license.
  • [3] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In NeurIPs, 2017.
  • [4] Milad Alizadeh, Arash Behboodi, Mart van Baalen, Christos Louizos, Tijmen Blankevoort, and Max Welling. Gradient l_1l\_1 regularization for quantization robustness. arXiv preprint arXiv:2002.07520, 2020.
  • [5] Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhinav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. In NeurIPS, 2020.
  • [6] Jiasi Chen and Xukan Ran. Deep learning with edge computing: A review. Proceedings of the IEEE, 107(8):1655–1674, 2019.
  • [7] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. In ICLR, 2016.
  • [8] Ting Chen, Ji Lin, Tian Lin, Song Han, Chong Wang, and Denny Zhou. Adaptive mixture of low-rank factorizations for compact neural modeling. 2018.
  • [9] Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, and Jiashi Feng. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In ICCV, 2019.
  • [10] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017.
  • [11] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NeurIPS, 2014.
  • [12] Nikoli Dryden, Tim Moon, Sam Ade Jacobs, and Brian Van Essen. Communication quantization for data-parallel training of deep neural networks. In 2nd Workshop on Machine Learning in HPC Environments (MLHPC), pages 1–8. IEEE, 2016.
  • [13] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
  • [14] Timur Garipov, Dmitry Podoprikhin, Alexander Novikov, and Dmitry Vetrov. Ultimate tensorization: compressing convolutional and fc layers alike. arXiv preprint arXiv:1611.03214, 2016.
  • [15] Graham Gobieski, Brandon Lucia, and Nathan Beckmann. Intelligence beyond the edge: Inference on intermittent embedded systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 199–213, 2019.
  • [16] Ariel Gordon, Elad Eban, Ofir Nachum, Bo Chen, Hao Wu, Tien-Ju Yang, and Edward Choi. Morphnet: Fast & simple resource-constrained structure learning of deep networks. In CVPR, 2018.
  • [17] Corentin Hardy, Erwan Le Merrer, and Bruno Sericola. Distributed deep learning on edge-devices: feasibility via adaptive compression. In International Symposium on Network Computing and Applications (NCA), 2017.
  • [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV. Springer, 2016.
  • [20] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In ECCV, 2018.
  • [21] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In ICCV, 2017.
  • [22] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [23] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [24] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In ICCV, 2019.
  • [25] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
  • [26] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pages 448–456. PMLR, 2015.
  • [27] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference, 2014.
  • [28] Hyuk-Jin Jeong, InChang Jeong, Hyeon-Jae Lee, and Soo-Mook Moon. Computation offloading for machine learning web apps in the edge server environment. In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), pages 1492–1499. IEEE, 2018.
  • [29] Kumara Kahatapitiya and Ranga Rodrigo. Exploiting the redundancy in convolutional filters for parameter reduction. In WACV, 2021.
  • [30] Yiping Kang, Johann Hauswald, Cao Gao, Austin Rovinski, Trevor Mudge, Jason Mars, and Lingjia Tang. Neurosurgeon: Collaborative intelligence between the cloud and mobile edge. ACM SIGARCH Computer Architecture News, 45(1):615–629, 2017.
  • [31] Hyeji Kim, Muhammad Umar Karim Khan, and Chong-Min Kyung. Efficient neural network compression. In CVPR, 2019.
  • [32] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In ICLR, 2016.
  • [33] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
  • [34] Jean Kossaifi, Adrian Bulat, Georgios Tzimiropoulos, and Maja Pantic. T-net: Parametrizing fully convolutional nets with a single high-order tensor. In CVPR, 2019.
  • [35] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images, 2009.
  • [36] Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. Deepx: A software accelerator for low-power deep learning inference on mobile devices. In 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pages 1–12. IEEE, 2016.
  • [37] V Lebedev, Y Ganin, M Rakhuba, I Oseledets, and V Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. In ICLR, 2015.
  • [38] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [39] Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. Training quantized nets: A deeper understanding. In NeurIPS, 2017.
  • [40] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In ICLR, 2017.
  • [41] He Li, Kaoru Ota, and Mianxiong Dong. Learning iot in edge: Deep learning for the internet of things with edge computing. IEEE network, 32(1):96–101, 2018.
  • [42] Shaohui Lin, Rongrong Ji, Chao Chen, Dacheng Tao, and Jiebo Luo. Holistic cnn compression via low-rank decomposition with knowledge transfer. TPAMI, 41(12):2889–2905, 2018.
  • [43] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In CVPR, 2017.
  • [44] Xiangzhong Luo, Di Liu, Hao Kong, and Weichen Liu. Edgenas: Discovering efficient neural architectures for edge systems. In International Conference on Computer Design (ICCD), pages 288–295. IEEE, 2020.
  • [45] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In ECCV, 2018.
  • [46] Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos. Training binary neural networks with real-to-binary convolutions. In ICLR, 2020.
  • [47] Yoshitomo Matsubara, Sabur Baidya, Davide Callegaro, Marco Levorato, and Sameer Singh. Distilled split deep neural networks for edge-assisted real-time systems. In Proceedings of the 2019 Workshop on Hot Topics in Video Analytics and Intelligent Edges, pages 21–26, 2019.
  • [48] Yoshitomo Matsubara, Marco Levorato, and Francesco Restuccia. Split computing and early exiting for deep learning applications: Survey and research challenges. arXiv preprint arXiv:2103.04505, 2021.
  • [49] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017.
  • [50] Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. In ICML. PMLR, 2020.
  • [51] Zhaolong Ning, Peiran Dong, Xiaojie Wang, Xiping Hu, Lei Guo, Bin Hu, Yi Guo, Tie Qiu, and Ricky YK Kwok. Mobile edge computing enabled 5g health monitoring for internet of medical things: A decentralized game theoretic approach. IEEE Journal on Selected Areas in Communications, 39(2):463–478, 2020.
  • [52] Bo Peng, Wenming Tan, Zheyang Li, Shun Zhang, Di Xie, and Shiliang Pu. Extreme network compression via filter group approximation. In ECCV, 2018.
  • [53] Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantization. In ICLR, 2018.
  • [54] Jiaxiong Qiu, Cai Chen, Shuaicheng Liu, Heng-Yu Zhang, and Bing Zeng. Slimconv: Reducing channel redundancy in convolutional neural networks by features recombining. IEEE Transactions on Image Processing, 30:6434–6445, 2021.
  • [55] Santhosh K Ramakrishnan, Dinesh Jayaraman, and Kristen Grauman. An exploration of embodied visual exploration. International Journal of Computer Vision, 129(5):1616–1649, 2021.
  • [56] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
  • [57] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, 2018.
  • [58] Khalid Sayood. Lossless compression handbook. Elsevier, 2002.
  • [59] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association. Citeseer, 2014.
  • [60] Wenling Shang, Kihyuk Sohn, Diogo Almeida, and Honglak Lee. Understanding and improving convolutional neural networks via concatenated rectified linear units. In ICML. PMLR, 2016.
  • [61] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [62] Nikko Strom. Scalable distributed dnn training using commodity gpu cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  • [63] Ke Sun, Mingjie Li, Dong Liu, and Jingdong Wang. Igcv3: Interleaved low-rank group convolutions for efficient deep neural networks. arXiv preprint arXiv:1806.00178, 2018.
  • [64] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [65] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  • [66] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In CVPR, 2019.
  • [67] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML. PMLR, 2019.
  • [68] Shitao Tang, Litong Feng, Wenqi Shao, Zhanghui Kuang, Wei Zhang, and Yimin Chen. Learning efficient detector with semi-supervised adaptive distillation. arXiv preprint arXiv:1901.00366, 2019.
  • [69] Frederick Tung and Greg Mori. Clip-q: Deep network compression learning by in-parallel pruning-quantization. In CVPR, 2018.
  • [70] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
  • [71] Jianyu Wang and Gauri Joshi. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.
  • [72] Guotian Xie, Jingdong Wang, Ting Zhang, Jianhuang Lai, Richang Hong, and Guo-Jun Qi. Interleaved structured sparse convolutional neural networks. In CVPR, 2018.
  • [73] Zheng Xu, Yen-Chang Hsu, and Jiawei Huang. Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. arXiv preprint arXiv:1709.00513, 2017.
  • [74] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • [75] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, 2018.
  • [76] Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. Discrimination-aware channel pruning for deep neural networks. In NeurIPS, 2018.