Dynamic Neural Network Architectural and Topological Adaptation and Related Methods -
A Survey

Lorenz Kummer
Computer Science Department , University of Vienna Email [email protected] thanks: Wilfried Gansterer

(2021
May)

Abstract

Training and inference in deep neural networks (DNNs) has, due to a steady increase in architectural complexity and data set size, lead to the development of strategies for reducing time and space requirements of DNN training and inference, which is of particular importance in scenarios where training takes place in resource constrained computation environments or inference is part of a time critical application. In this survey, we aim to provide a general overview and categorization of state-of-the-art (SOTA) of techniques to reduced DNN training and inference time and space complexities with a particular focus on architectural adaptions.

1 Introduction

Recently published literature on strategies for minimizing DNN training and/or inference runtime and model size while maximizing accuracy can be broadly categorized into the following 4 categories based on the most important methodical concepts:

1.

Quantization (bit-width reduction of layers, i.e. weights, gradients, activations, sec. 2.1), to computationally more efficient floating-, fixed- or block-floating-point (an arithmetic approaching floating- on fixed-point hardware) [1] representations [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. Quantization, in principle, can be applied either globally or on a per-layer basis, pre-training (apriori choice of a certain representation), intra-training (dynamic choice of representation during DNN training) or post-training (after training in float32, quantization to a lower representation is applied).
2.

Adapting network architecture or topology (order, type, size or connections of layers, sec. 2.2) either via pruning of over-parameterized weights s.t. layers that do not contribute to the networks accuracy are either reduced in size, bypassed or removed completely and extending weights that under-parameterized [13, 14, 15, 12, 16, 17, 18, 19] or controlling network size via hyperparamaters [20, 21, 22, 23, 24]. Similar to quantization, architectural adaption can be applied either globally or on a per-layer basis, pre-training (pruning an architecture before training or re-training it), intra-training (dynamic pruning or extending of the DNNs weights or modifying its topology during training based on some criterion) or post-training (after training).
3.

Introducing architectures obtaining their representational power (representational power based DNN taxonomy see [25], sec. 2.3) from entirely new, more efficient operations [26, 27, 12]. Approaches employing this strategy usually require training from scratch.
4.

Distributed (sec. 2.4) training and/or inference on large numbers of individually resource constrained devices [28, 29, 30, 31, 7, 8]

Strategies from categories 1 and 2 aim at optimizing DNN parameterization i.e. reducing over-parameterization either by keeping the number of parameters the same while reducing their numerical representations precision or by reducing the number of parameters while keeping their precision - both approaches lead to reduced model sizes and improved computational efficiency. Approaches falling into categories 3 and 4 do not directly consider the number of model parameters or their numerical representation as relevant variables, but instead focus on more efficient use of computational resources for a given number of parameters and numerical representation. Besides a strategy-based super-categorization, we further sub-categorize literature in tab. 1 dependent on the following parameters:

•

Whether abstract guarantees are provided by the authors (proof of convergence, time and space complexity, test error)
•

The data agnosticism of the algorithm (i.e. whether the content of the data is irrelevant to the algorithms function)
•

Task type the authors used in their paper for demonstrating their approach (classification, object detection, speech recognition, semantic segmentation)
•

The scenario where the approach claims an improvement over existing methods (training, inference, simulation - the last indicates the paper provides a simulation framework for e.g. simulating fixed-point quantization on floating-point hardware)
•

Whether the approach is dynamic or static in the sense that it is computed during the target scenario or pre-computed
•

Whether a publicly available code base exists that allows reproduction of the authors results, implemented either by the original authors or shipped as part of a machine learning framework e.g. TensorFlow, PyTorch, MXNet.

Empirically measured speedups or testing errors were intentionally not included in tab. 1 for they might not always be produced under comparable circumstances. If an approach matches the criteria for more than one category, it is listed in 1 in all categories it fits into.

1.1 Research and Selection of Literature

The papers incorporated in our survey were researched through the search engines of Proceedings of Machine Learning Research (PMLR), Neural Information Processing Systems (NeurIPS), Institute of Electrical and Electronics Engineers (IEEE) Xplore as well as Google Scholar. While we preferred works published in peer-reviewed journals or conference proceedings in the 2018-2021 timeframe, we included arXiv or earlier papers as well if we deemed them particularly novel or creative. We selected papers based on the factors novelty, ingenuity, citation count and reproduceability of results, i.e. the existence of a public code repository at e.g. github or similar platforms or inclusion in a major machine learning framework such as TensorFlow, PyTorch or MXNet.

1.2 Related Surveys

The focus of other recent surveys in the field lies on parallel and distributed deep learning [28] whereby [31] specializes particularly on mobile applications. [32] treats theory and methods of quantization and discusses their merits and drawbacks while the complementary work [33] discusses frameworks and methods for executing DNNs on Field Programmable Gate Array (FPGA). DNN pruning techniques are surveyed and categorized by [34]. Recent developments in DNN architectures are described in [25], which provides an overview not only over the SOTA but also the historical developments that lead there. The comparably new field of Quantum DNNs (QDNNs) is covered by [35, 36].

Our survey differs from others by it’s wider scope, incorporating works from different strategical categories, because we think that the most recent mixed-strategy approaches (sec. 2.5) as well as the fact that quantization (sec. 2.1), DNN pruning (sec. 2.2) and the search for more efficient DNN architectures (sec. 2.3) essentially treat the same issue - the systematic simplification of DNNs overly complex for the tasks they are intended to solve, leading to disproportional computation, memory and storage requirements - requires a broader perspective in order to provide a complete picture of the current SOTA. For brevity, our survey does not include recent advancements in QDNNs.

2 Overview

In this section, we will provide a categorization in tab. 1 along the lines introduced in sec. 1 as well as brief descriptions of mentioned approaches whereby particular attention will be paid to architectural adaption techniques.

Paper	Cat.	Metric	Guar.	DA	Task	CA	Scenario	DS	Ref
[6]	1	RT, TE,	N	Y	CL	1	TR	D	2.1.2
[2, 5]	1	RT, TE	N	Y	CL	1,2	INF	S	2.1.1
[11]	1	TE	N	Y	CL	N	INF	S	2.1.1
[3]	1	RT, TE	TC, SC	Y	CL, OB	1	INF	S	2.1.1
[7, 8]	1, 4	RT, TE	C, SC	Y	CL, SR	1	TR	D	2.5.3
[9, 10]	1	TE	N	Y	CL	1, 2	SIM	-	2.1.3
[4]	1	RT, TE	N	Y	CL	1	INF	S	2.1.1
[15]	1, 2	RT, MS, TE	N	Y	CL	1	INF	S	2.5.2
[12]	1, 2, 3	RT, MS, TE	SC	Y	CL	1	TR, INF	S	2.5.1
[13]	2	RT, MS, TE	N	Y	CL	1	INF	D	2.2.3
[20, 21]	2	RT, MS, TE	SC	Y	CL, OB	1, 2	TR, INF	S	2.2.1
[22]					SS
[14]	2	RT, MS, TE	TC, SC	Y	CL	1	INF	S	2.2.2
[16]	2	RT, MS, TE	N	Y	OB, SS	1	INF	S	2.2.2
[18]	2	RT, MS, TE	N	Y	CL	N	TR, INF	D	2.2.3
[17]	2	RT, MS	N	N	CL	N	INF	D	2.2.3
		TE, EE
[19]	2	RT, MS,	N	Y	CL	N	TR, INF	D	2.2.3
		TE, EE
[26, 27]	3	RT, MS, TE	N	Y	CL, OB	1, 2	TR, INF	S	2.3
[29, 30]	4	RT, TE	N	Y	CL, SR	N	TR, INF	S	2.4

Table 1: Overview of papers dependent on Category, Metric, Guarantees, Data Agnosticism, Task, Code Availability, Dynamic/Static and Scenario. Papers sharing the same properties are summarized in the same row independent of academic relation. For a summary of the table, see sec. 3. Abbrevations: RT = runtime, MS = model size, TE = test error, EE = energy efficiency, C = convergence, T/SC = time/space complexity, CL = classification, OB = object detection, SR = speech recognition, SS = semantic segmentation, TR = training, INF = inference, SIM = simulation, D = dynamic, S = static

2.1 Quantization

2.1.1 Quantized Inference

Minimizing inference accuracy degradation induced by quantizing weights and activations while leveraging associated performance increases can be achieved by incorporating (simulated) quantization into model training and training the model itself to compensate the introduced errors (Quantization Aware Training, QAT) [2, 3], by learning optimal quantization schemes through jointly training DNNs and associated quantizers (Learned Quantization Nets, LQ-Nets) [4] or by using dedicated quantization friendly operations [5]. A notably different approach is taken by Variational Network Quantization (VNQ) [11] which uses variational Dropout training [37] with a structured sparsity inducing prior [38] to formulate post-training quantization as the variational inference problem searching the posterior optimizing the Kullback-Leibler-Divergence (KLD) [39].

2.1.2 Quantized Training

For speeding up training on a single node via block-floating-point quantization, [6] introduced a dynamic training quantization scheme (Multi Precision Policy Enforced Training, MuPPET) that tracks mini batch gradient diversity [40] across epochs and decides if a precision switch is triggered for the next batch based on the violation of an empirically determined threshold . Due to a lack of dedicated fixed-point (i.e. FPGA) hardware, speed ups achieved by MuPPET were the product of simulations executed on floating-point hardware using NVIDIA CUTLASS [41].

2.1.3 Simulation

For exploring the accuracy degradation induced by quantizations of weights, activations and gradients, [9] and [10] introduced the frameworks TensorQuant and QPyTorch capable of simulating on a float32 basis the most common quantizations for training and inference tasks. Both frameworks allow to freely choose exponent and mantissa for floating-point, word and fractional bit length for fixed-point and word length for block-floating-point representations as well as signed/unsigned representations. Since the quantizations remains a simulation, however, no actual speed ups can be achieved using these frameworks.

2.2 Architectural and Topological Adaption

2.2.1 Pre-Training Adaption

MobileNets

A representative of pre-training architectural adaption, MobileNets, first introduced by [21], are models that are based on depth-wise separable convolutions (highly efficient factorized convolutions, factorizing a standard convolution into a depthwise convolution and pointwise convolution [42]). Their width and resolution can be controlled by multipliers (i.e. hyperparameters) controlling layer in and out channels as well as input sizes, thus allowing architectural adaption. While the original paper introduces only a sequential model and does not offer a systematic approach how to find the best architecture for a certain use case, later extensions integrate non-sequential residual blocks [20] increasing representational power and employ Neural Architecture Search (NAS) algorithms [22]. Numerous extensions of the popular MobileNet concept by researchers other than the original authors exist [23, 24].

2.2.2 Post-Training Adaption

WoodFisher

Based on the classic Optimal Brain Damage/Surgeon (OBD/S) [43, 44] framework, WoodFisher [14] uses second-order information in the form of efficiently approximated (Inverse-) Hessian’s to determine the change in loss induced by the removal of one ore more parameters and prunes the network architecture accordingly. The approach is notable because not only does it not require retraining after pruning, but the authors also provide guarantees for time and space complexity.

Fast Neural Network Adaption

Solving a problem different from reducing training or inference times but none the less interesting is Fast Neural Network Adaption (FNA) [16]: FNA uses depth-, width- and kernel level parameter remapping to map parameters from a pre-trained seed network to an arbitrary target network, thus allowing NAS to search for architectures optimized for other tasks (e.g. object detection, semantic segmentation) than the original seed network was (e.g. classification) without requiring retraining.

2.2.3 Dynamic Adaption

MorphNet

[13] is an efficient resource-constrained structure learning algorithm based on the combination of width multipliers (as first introduced by MobileNet, see sec. 2.2.1) and sparsifying regularizers (e.g. L1 [45, 46, 47]) to obtain particularly pruning-friendly weights tensors. MorphNet iteratively trains, shrinks, and expands a given DNN, finding each layers width multiplier s.t. a certain resource constraint (e.g. FLOPS or model size) is satisfied. While the authors are unable to proof convergence of their algorithm, they state that in their empirical evaluation, after 1 to 3 iterations, significant reductions in resource requirements where observed while maintaining iso-accuracy.

Dropback

introduced by [18] is a DNN training algorithm which trains DNNs under a pruned weight budget. During training, Dropback tracks only the highest $k$ accumulated gradients ( $k$ in this context refers to the weight budget) while untracked weights retain their initial values, thus reducing training times and memory accesses significantly. The actual model itself, however, is neither pruned in width nor depth and no advantage for inference is obtained.

Channel Gating

(CGNets) [17] is a dynamic inference topology pruning scheme that learns during training which regions in the features contribute least to the classification result and skips computations on a subset of these ineffective regions input channels. This is achieved by the introduction of learnable gate functions which, for each channel of a specific layer, learn if the channels output is zeroed out by ReLU or saturated to Sigmoid or TanH, allows bypassing of that channel in that layer.

Procrustes

[19] is an energy efficient sparse DNN training accelerator aimed at producing pruned models with iso-accuracy. Procrustes is based on above mentioned Dropback (sec. 2.2.3) and extends the concept by inducing sparsity through decaying the untracked weights initial values over the first $i$ iterations s.t. these weights can actually be pruned if they decay to zero, remedying one of Dropback’s major drawbacks. Additionally, Procrustes avoids the need to sort all accumulated gradients by using dynamic quantile estimation to continuously track target sparsity.

2.3 New Architectures

Instead of pruning, compressing or quantizing a pre-existing architecture, ShuffleNet [26] aims to be highly efficient per design: computationally expensive pointwise convolutions, used in SOTA architectures such as Xception [48] or ResNext [49] but also the adaptive MobileNet or MorphNet architectures discussed in sec. 2.2.1 and 2.2.3, are replaced in ShuffleNet by less costly pointwise group convolutions. The introduced downside of grouping pointwise convolutions (outputs from a certain channel are derived only from a small fraction of input channels) are compensated by shuffling output channels s.t. the subsequent layers input channels receive features extracted from all other channels. The ShuffleNet concept is extended by ShuffleNetV2 [27], which introduces the concept of channel splitting to balance the numbers of groups and channels which were immutable in the original work.

2.4 Distributed Training/Inference

Proposing a new, completely decentralized view of DNN training, [29, 30] introduce Federated Learning/Optimization (FL/O), an approach where no centralized learning takes place but instead client nodes (e.g. mobile devices) locally compute updates based on local data to a global model which is then averaged by a server node and shared among participating client nodes, thus performing a global update w.r.t. global data.

2.5 Mixed

2.5.1 New Architecture and Quantization

By combining alternating pointwise and 3x3 convolutions into an architectural element called the Fire Module as well as post-training quantization, SqueezeNets [12] authors empirically show that their approach significantly reduces model size compared to other SOTA models while maintaining iso-accuracy.

2.5.2 Architectural Adaption and Quantization

Deep Compression [15], an earlier work, like SqueezeNet produces an efficient, quantized DNN as output but instead of beeing based on a new architectural element, Deep Compression applies pruning, quantization and Huffman coding [50] to a pre-trained network to achieve it’s goal.

2.5.3 Distributed Training and Quantization

For training on multi-node environments (i.e. distributed), Quantized Stochastic Gradient Descent (QSGD) introduced by [7] incorporates a family of gradient compression schemes aimed at reducing inter node communication costs occurring during SGD’s gradient updates, producing a significant speedup as well as convergence guarantees under standard assumptions (not included here for brevity, see [8] for details).

3 Discussion

3.1 Metrics

As can be seen in tab. 1, a common ground all surveyed approaches share is they recognize test accuracy (expressed as test error by some works) as relevant metric. We conjecture this represents the common goal of iso-accuracy, i.e. any improvement in terms of time and space complexity must not come at the price of degraded test accuracy compared to more complex state-of-the art approaches. Second to test accuracy in terms of prevalence is runtime (expressed in time units, e.g. milliseconds, speedup or operation counts), which is recognized as relevant by all except 3 of the 24 surveyed papers. Those papers not recognizing runtime as relevant metric for their contribution were either quantization simulation frameworks (TensorQuant, QPytorch, sec. 2.1.3) or did not disclose their reasons for why they decided against runtime as metric (VNQ, sec. 2.1.1). With regards to model size (expressed as number of parameters or bytes), notably no work falling solely into the quantization (sec. 1 cat. 1) or distributed (sec. 1 cat. 4) category uses this metric, however it is used by all works either introducing new architectures, adapting existing architectures (1 cat. 2 and 3) or mixing multiple approaches. While it is clear why work distribution strategies dont lead to decreased model sizes, a possible reason for the lack of reported model size reductions using quantization could be that quantization alone, while yielding improvements in runtime, does not result in a sufficient reduction in model size so that authors did not considers it’s worthwhile reporting it - however, the verification of this conjecture is considered out of the scope of this survey. An interesting metric explicitly reported only by a 2 works (Procrustes and CGNets, sec. 2.2.3 and 2.2.3) is energy efficiency: this is distinctive because energy efficiency (and indirectly carbon emissions) of deep learning algorithms have recently been subject of research and are suggested as key metric in evaluating deep learning models [51, 52].

3.2 Tasks

The baseline task all surveyed works compete at is classification. In only 7 of 24 works, authors applied their approach to object detection, speech recognition or semantic segmentation tasks (namely QSGD (sec. 2.5.3), MobileNets (sec. 2.2.1), FNA (sec. 2.2.2), ShuffleNets (sec. 2.3), FL/O, 2.4) . There is no obvious connection between type of task chosen by the authors and strategical category, so we conjecture that authors might display the observed preference of classification task for reasons of simplicity.

3.3 Guarantees, Data Agnosticism, Scenario

Regarding abstract guarantees, only 5 of 24 works are able to provide guarantees w.r.t. convergence time and/or space complexity, namely QAT (sec. 2.1.1), QSGD (sec. 2.5), MobileNets (sec. 2.2.1) and WoodFisher (sec. 2.2.2), and no work provides guarantees w.r.t. test error. The majority of authors rely solely on empirical evaluation of their works.

Only CGNets (2.2.3) considers input data in it’s decision where to prune, which distinguishes it from the 23 other surveyed data agnostic approaches.

There seems to be a focus on increasing inference performance in the surveyed literature, with 2 works focusing on training, 12 on inference and 10 offering improvements for both.

3.4 Performance of Architectural Adaption Methods

In our work, we found that the most common experiment conducted by researchers working on DNN architectural adaption problems is classifying ImageNet [53, 54] (seconded by classifying CIFAR-10/100 [55]) and comparing the metrics test accuracy or error, Multiply-Add Operations (MAdds, 1 MAdd = 2 FLOPS)) and model size (usually given as number of parameters) with other SOTA approaches. Hence, we found this experimental setup most useful for comparing the performance of different approaches and summarized the authors results in tab. 2. This still remains only the smallest comparable subset of experiments however and should not be used to judge the contribution of an approach without considering its other properties (dynamic/static adaption, pre- or post-training, proven generalizability to other tasks such as object recognition or semantic segmentation, etc).

3.4.1 The Best Solution?

Within the limited expressiveness of the comparable experiments listed in tab. 2 however, we found that Procrustes (sec. 2.2.3) outperforms all other works in terms of top-1 accuracy by model size ratio and top-1 accruracy to MAdds ratio, reporting a top-1 test accuracy of 71.13% with only 75M MAdds and 0.35M weights when combined with MobileNetV2 (sec. 2.2.1). The best top-1 accuracy by model compression ratio though is reached by WoodFisher (sec. 2.2.2), which reports a top-1 test accuracy of 72.16% after inducing 95% sparsity in a ResNet50s weights, which under the assumption of a sparse weights storage format, translates to a reduction in model size by a factor of 20. WoodFisher also outperforms all other approaches in terms of highest test accuracy: with the induction of 80% sparsity, WoodFisher reaches 76.73% accuracy and a model compression by a factor of 5. Unfortunately though the WoodFisher paper does not explicitly report MAdds or FLOPS but just speed-ups s.t. we cant compare the approach to others using this metric. At this point, we take into account that WoodFisher and Procrustes solve two similar but different problems: WoodFisher receives a pre-trained, un-pruned network and prunes it to obtain it’s results while Procrustes already prunes during training, i.e. the latter is capable accelerating training and inference as well, which illustrates the limited usefulness of comparing experiments without taking algorithmic properties into account.

Approach	Top 1	MAdds	Size	Src	Ref
MobileNetV1 (1.0)	70.6	569M	4.2M	[21]	2.2.1
MobileNetV1 (0.75)	68.4	325M	2.6M	[21]	2.2.1
MobileNetV1 (0.5)	64.7	149M	1.3M	[21]	2.2.1
MobileNetV1 (0.25)	50.6	41M	0.5M	[21]	2.2.1
GoogleNet (cite)	69.7	1500M	6.8M	[21]
VGG16 [56]	71.5	15300M	138M	[21]
SqueezeNet	57.5M	1700M	1.25M	[21]	2.5.1
AlexNet [57])	57.2	720M	60M	[21]
MobileNetV2 (1.4)	74.7	585M	6.9M	[20]	2.2.1
MobileNetV2 (1.0)	72.0	300M	3.4M	[20]	2.2.1
ShuffleNetV1 (1.5)	71.5	292M	3.4M	[20]	2.3
ShuffleNetV1 (2.0)	73.7	524M	5.4M	[20]	2.3
MobileNetV3-L (1.0)	75.2	219M	5.4M	[22]	2.2.1
MobileNetV3-L (0.75)	73.3	155M	3.0M	[22]	2.2.1
MobileNetV3-S (1.0)	67.4	66M	2.9M	[22]	2.2.1
MobileNetV3-S (0.75)	65.4	44M	2.4M	[22]	2.2.1
InceptionV2 [58]	74.1	5000M[58]	25M [58]	[13]
MorphNet-InceptionV2	75.2	5000M	25M	[13]	2.2.3
MobileNetV1 (0.5)	57.1	149M	1.3M	[13]	2.2.1
MorphNet-MobileNetV1 (0.5)	58.1	149M	1.3M	[13]	2.2.3
MobileNetV1 (0.25)	44.8	41M	0.5M	[13]	2.2.1
MorphNet-MobileNetV1 (0.25)	45.9	41M	0.5M	[13]	2.2.3
ResNet18 [59]	68.41	1800M[19]	11M	[18]
Dropback-ResNet18	70.05	?	2M	[18]	2.2.3
Dropback-ResNet18	67.99	?	1M	[18]	2.2.3
ResNet18	69.17	1800M	11M	[19]
Procrustes-ResNet18	69.31	359M	1M	[19]	2.2.3
MobileNetV2 (1.0)	70.98	301M	3.5M	[19]	2.2.1
Procrustes-MobileNetV2 (1.0)	71.13	75M	0.35M	[19]	2.2.3
ResNet18	69.2	1800M	11M	[17]
CGNet-A-ResNet18	68.8	933M	?	[17]	2.2.3
CGNet-B-ResNet18	68.3	887M	?	[17]	2.2.3
MobileNetV1 (0.75)	68.8	325M	2.6M	[17]	2.2.1
CGNet-A-MobileNetV1 (0.75)	68.2	173M	?	[17]	2.2.3
CGNet-B-MobileNetV1 (0.75)	67.8	136M	?	[17]	2.2.3
ResNet50	77.01	3727M[60]	20.69M[60]	[14]
WoodFisher-ResNet50	76.73	?	4.14M	[14]	2.2.2
WoodFisher-ResNet50	75.26	?	2.07M	[14]	2.2.2
WoodFisher-ResNet50	72.16	?	1.03M	[14]	2.2.2

Table 2: Experimental evaluation of architectural adaption approaches compared to ’classic’ SOTA architectures (ResNet, AlexNet, VGG16, GoogleNet) by top-1 accuracy classifying ImageNet, MAdds (= Multiply-Add Operations, 1 MAdd = 2 FLOPS), Size (= Number of Parameters). MAdds and size are given in Millions.

4 Concluding Remarks

We present a survey, which differs from other surveys in the field as outlined in sec. 1.2, of carefully researched and selected (sec. 1.1) recently published SOTA methods in the field of dynamic DNN architectural and topological adaptation (sec. 2.2), quantized (sec. 2.1) and distributed (sec. 2.4) DNN training and inference as well as new, efficient architectures (sec. 2.3). We further contribute a categorization (sec. 1) and overview (sec. 2) of these methods, describe the metrics (sec. 3.1) and experimental setups (sec. 3.4) used commonly in the field as well as the problems faced when comparing different approaches (sec. 3.4). Finally, we highlight open questions (sec. 5) we regard as relevant for their high potential to advance the SOTA in the field of deep learning.

5 Open Questions

In DNN training, a DNN stores information on the shape of decision surface in it’s weights. We found that no approach exists using a metric that directly measures the amount information lost (e.g. KLD) through weight pruning and we conjecture that the concept introduced by VNQ (sec. 2.1.1) could be extended to that purpose. We also point out that it has not yet been explored how quantized training impacts vanishing/exploding gradients, counter-strategies [61, 62, 63, 64, 65, 66, 67] and vice versa. It is furthermore unknown how the combination of various approaches performs. It might be worthwhile exploring how a MobileNet trained by Procrustes, pruned by WoodFisher and finally compressed and quantized by Deep Compression performs (sec. 2.2.1, 2.2.3, 2.2.2, 2.5.2) and whether the results can be improved further by the most recent data augmentation techniques for semi-supervised and supervised deep learning tasks [68, 69].

Given [70] showed the practical feasibility of software-induced fault injection attacks on DNNs and the high vulnerability of DNNs to such attacks, another important question left untreated by all quantization related works in sec. 2.1 and 2.5 is how these approaches hold up against dedicated attacks on quantized networks [71, 72] for they do not natively include defense mechanisms [73, 74] against such attacks. Exemplary studies of the robustness of quantized neural networks against bit error attacks were recently conducted by [75, 76]. Another type of attack to which DNNs are particularly vulnerable are out-of-distribution (OOD) attacks [77] which is why we deem it important to explore how architectural adaption or quantization affects a networks robustness to such attacks. Proofable guarantees w.r.t. to the detection of OOD attacks are provided by [78].

As discussed in sec. 3, only 5 of 24 surveyed works provide abstract guarantees regarding convergence, time or space complexity and not a single work provides any guarantees w.r.t. test error (i.e. network degradation induced by architecture modifications or quantization). We conjecture component-wise perturbation analysis [79] might lead to such guarantees as this technique has in similar scenarios not only been applied to study historical neural network architectures [80, 81, 82] but also yielded results for simple modern architectures recently [83].

Another question we deem worthy of further investigation is whether the interpretability [84, 85] of quantized, pruned or otherwise adapted DNNs changes w.r.t. to a base model and whether these changes, if present, can be quantified and lower or upper bounds for their magnitude be provided.

References

[1] James Hardy Wilkinson “Rounding errors in algebraic processes” Courier Corporation, 1994
[2] Benoit Jacob et al. “Quantization and training of neural networks for efficient integer-arithmetic-only inference” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713
[3] Jiwei Yang et al. “Quantization networks” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7308–7316
[4] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye and Gang Hua “LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks” In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 365–382
[5] Tao Sheng et al. “A quantization-friendly separable convolution for mobilenets” In 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), 2018, pp. 14–18 IEEE
[6] Aditya Rajagopal, Diederik Adriaan Vink, Stylianos I Venieris and Christos-Savvas Bouganis “Multi-Precision Policy Enforced Training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs” In Proceedings of the 37th International Conference on Machine Learning PMLR, 2020, pp. 7943–7952
[7] Dan Alistarh et al. “QSGD: Communication-efficient SGD via gradient quantization and encoding” In Advances in Neural Information Processing Systems 30, 2017, pp. 1709–1720
[8] Dan Alistarh et al. “QSGD: Communication-efficient SGD via gradient quantization and encoding” In arXiv preprint arXiv:1610.02132, 2016 URL: https://arxiv.org/pdf/1610.02132.pdf
[9] Dominik Marek Loroch, Franz-Josef Pfreundt, Norbert Wehn and Janis Keuper “Tensorquant: A simulation toolbox for deep neural network quantization” In Proceedings of the Machine Learning on HPC Environments, 2017, pp. 1–8
[10] Tianyi Zhang, Zhiqiu Lin, Guandao Yang and Christopher De Sa “Qpytorch: A low-precision arithmetic simulation framework” In arXiv preprint arXiv:1910.04540, EMC2: Workshop on Energy Efficient ML and Cognitive Computing, at NeurIPS, 2019 URL: https://arxiv.org/pdf/1910.04540.pdf
[11] Jan Achterhold, Jan M. Köhler, Anke Schmeink and Tim Genewein “Variational Network Quantization” In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings OpenReview.net, 2018 URL: https://openreview.net/forum?id=ry-TW-WAb
[12] Forrest N Iandola et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size” In arXiv preprint arXiv:1602.07360, 2016 URL: https://arxiv.org/pdf/1602.07360.pdf
[13] Ariel Gordon et al. “Morphnet: Fast & simple resource-constrained structure learning of deep networks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1586–1595
[14] “WoodFisher: Efficient Second-Order Approximation for Neural Network Compression” In Advances in Neural Information Processing Systems 33, 2020
[15] Song Han, Huizi Mao and William J. Dally “Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding” In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016 URL: http://arxiv.org/abs/1510.00149
[16] Jiemin Fang et al. “FNA++: Fast Network Adaptation via Parameter Remapping and Architecture Search” In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, pp. 1–1 DOI: 10.1109/TPAMI.2020.3044416
[17] Weizhe Hua et al. “Channel Gating Neural Networks” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019 URL: https://proceedings.neurips.cc/paper/2019/file/68b1fbe7f16e4ae3024973f12f3cb313-Paper.pdf
[18] Mieszko Lis, Maximilian Golub and Guy Lemieux “Full Deep Neural Network Training On A Pruned Weight Budget” In Proceedings of Machine Learning and Systems 1, 2019, pp. 252–263 URL: https://proceedings.mlsys.org/paper/2019/file/7f1de29e6da19d22b51c68001e7e0e54-Paper.pdf
[19] Dingqing Yang et al. “Procrustes: a dataflow and accelerator for sparse deep neural network training” In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 711–724 IEEE
[20] Mark Sandler et al. “Mobilenetv2: Inverted residuals and linear bottlenecks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520
[21] Andrew G Howard et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications” In arXiv preprint arXiv:1704.04861, 2017 URL: https://arxiv.org/pdf/1704.04861.pdf
[22] Andrew Howard et al. “Searching for mobilenetv3” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1314–1324
[23] Debjyoti Sinha and Mohamed El-Sharkawy “Thin MobileNet: An Enhanced MobileNet Architecture” In 2019 IEEE 10th Annual Ubiquitous Computing, Electronics Mobile Communication Conference (UEMCON), 2019, pp. 0280–0285 DOI: 10.1109/UEMCON47517.2019.8993089
[24] Hong-Yen Chen and Chung-Yen Su “An Enhanced Hybrid MobileNet” In 2018 9th International Conference on Awareness Science and Technology (iCAST), 2018, pp. 308–312 DOI: 10.1109/ICAwST.2018.8517177
[25] Asifullah Khan, Anabia Sohail, Umme Zahoora and Aqsa Saeed Qureshi “A survey of the recent architectures of deep convolutional neural networks” In Artificial Intelligence Review Springer, 2019, pp. 1–62
[26] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin and Jian Sun “Shufflenet: An extremely efficient convolutional neural network for mobile devices” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6848–6856
[27] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng and Jian Sun “Shufflenet v2: Practical guidelines for efficient cnn architecture design” In Proceedings of the European conference on computer vision (ECCV), 2018, pp. 116–131
[28] Tal Ben-Nun and Torsten Hoefler “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis” In ACM Computing Surveys (CSUR) 52.4 ACM New York, NY, USA, 2019, pp. 1–43
[29] Brendan McMahan and Daniel Ramage “Federated Learning: Collaborative Machine Learning without Centralized Training Data” Online, accessed 16.07.2020, 2017 URL: https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
[30] Brendan McMahan et al. “Communication-Efficient Learning of Deep Networks from Decentralized Data” In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics 54, Proceedings of Machine Learning Research Fort Lauderdale, FL, USA: PMLR, 2017, pp. 1273–1282 URL: http://proceedings.mlr.press/v54/mcmahan17a.html
[31] Ji Wang et al. “Deep learning towards mobile applications” In 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), 2018, pp. 1385–1393 IEEE
[32] Yunhui Guo “A Survey on Methods and Theories of Quantized Neural Networks” In CoRR abs/1808.04752, 2018 arXiv: http://arxiv.org/abs/1808.04752
[33] El Hadrami Cheikh Tourad and Mohsine Eleuldj “Survey of Deep Learning Neural Networks Implementation on FPGAs” In 2020 5th International Conference on Cloud Computing and Artificial Intelligence: Technologies and Applications (CloudTech), 2020, pp. 1–8 DOI: 10.1109/CloudTech49835.2020.9365911
[34] Sheng Xu, Anran Huang, Lei Chen and Baochang Zhang “Convolutional Neural Network Pruning: A Survey” In 2020 39th Chinese Control Conference (CCC), 2020, pp. 7458–7463 DOI: 10.23919/CCC50068.2020.9189610
[35] Yangyang Li et al. “Quantum Optimization and Quantum Learning: A Survey” In IEEE Access 8, 2020, pp. 23568–23593 DOI: 10.1109/ACCESS.2020.2970105
[36] Somayeh Bakhtiari Ramezani et al. “Machine Learning Algorithms in Quantum Computing: A Survey” In 2020 International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–8 DOI: 10.1109/IJCNN48605.2020.9207714
[37] Durk P Kingma, Tim Salimans and Max Welling “Variational Dropout and the Local Reparameterization Trick” In Advances in Neural Information Processing Systems 28 Curran Associates, Inc., 2015 URL: https://proceedings.neurips.cc/paper/2015/file/bc7316929fe1545bf0b98d114ee3ecb8-Paper.pdf
[38] Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha and Dmitry Vetrov “Structured Bayesian Pruning via Log-Normal Multiplicative Noise”, NIPS’17 Long Beach, California, USA: Curran Associates Inc., 2017, pp. 6778–6787
[39] Solomon Kullback and Richard A Leibler “On information and sufficiency” In The annals of mathematical statistics 22.1 JSTOR, 1951, pp. 79–86
[40] Dong Yin et al. “Gradient diversity: a key ingredient for scalable distributed learning” In International Conference on Artificial Intelligence and Statistics, 2018, pp. 1998–2007
[41] Andrew Kerr et al. “NVIDIA CUTLASS” Online, accessed 15.07.2020, 2020 URL: https://github.com/NVIDIA/cutlass
[42] Laurent Sifre and Stéphane Mallat “Rigid-motion scattering for image classification” In Ph. D. thesis Citeseer, 2014
[43] Yann LeCun, John S Denker and Sara A Solla “Optimal brain damage” In Advances in neural information processing systems, 1990, pp. 598–605
[44] Babak Hassibi and David G Stork “Second order derivatives for network pruning: Optimal brain surgeon” Morgan Kaufmann, 1993
[45] Peter M Williams “Bayesian regularization and pruning using a Laplace prior” In Neural computation 7.1 MIT Press, 1995, pp. 117–143
[46] Andrew Y Ng “Feature selection, L 1 vs. L 2 regularization, and rotational invariance” In Proceedings of the twenty-first international conference on Machine learning, 2004, pp. 78
[47] Robert Tibshirani “Regression shrinkage and selection via the lasso” In Journal of the Royal Statistical Society: Series B (Methodological) 58.1 Wiley Online Library, 1996, pp. 267–288
[48] François Chollet “Xception: Deep learning with depthwise separable convolutions” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258
[49] Saining Xie et al. “Aggregated residual transformations for deep neural networks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500
[50] David A Huffman “A method for the construction of minimum-redundancy codes” In Proceedings of the IRE 40.9 IEEE, 1952, pp. 1098–1101
[51] David Patterson et al. “Carbon Emissions and Large Neural Network Training” In arXiv preprint arXiv:2104.10350, 2021 URL: https://arxiv.org/pdf/2104.10350.pdf
[52] Lasse F Wolff Anthony, Benjamin Kanding and Raghavendra Selvan “Carbontracker: Tracking and predicting the carbon footprint of training deep learning models” In arXiv preprint arXiv:2007.03051, 2020 URL: https://arxiv.org/pdf/2007.03051.pdf
[53] Olga Russakovsky et al. “Imagenet large scale visual recognition challenge” In International journal of computer vision 115.3 Springer, 2015, pp. 211–252
[54] Jia Deng et al. “Imagenet: A large-scale hierarchical image database” In 2009 IEEE conference on computer vision and pattern recognition, 2009, pp. 248–255 Ieee
[55] Geoffrey Hinton Alex Krizhevsky “The CIFAR-10 dataset” Online, accessed 15.07.2020, 2019 URL: https://www.cs.toronto.edu/~kriz/cifar.html
[56] Karen Simonyan and Andrew Zisserman “Very Deep Convolutional Networks for Large-Scale Image Recognition” In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015 URL: http://arxiv.org/abs/1409.1556
[57] Alex Krizhevsky, Ilya Sutskever and Geoffrey E Hinton “Imagenet classification with deep convolutional neural networks” In Advances in neural information processing systems, 2012, pp. 1097–1105
[58] Christian Szegedy et al. “Rethinking the inception architecture for computer vision” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826
[59] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun “Deep residual learning for image recognition” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
[60] Xijun Wang, Meina Kan, Shiguang Shan and Xilin Chen “Fully learnable group convolution for acceleration of deep neural networks” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9049–9058
[61] Sunitha Basodi, Chunyan Ji, Haiping Zhang and Yi Pan “Gradient amplification: An efficient way to train deep neural networks” In Big Data Mining and Analytics 3.3, 2020, pp. 196–207 DOI: 10.26599/BDMA.2020.9020004
[62] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee and Andrew Rabinovich “Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks” In International Conference on Machine Learning, 2018, pp. 794–803
[63] Tim Salimans and Durk P Kingma “Weight normalization: A simple reparameterization to accelerate training of deep neural networks” In Advances in neural information processing systems, 2016, pp. 901–909
[64] Günter Klambauer, Thomas Unterthiner, Andreas Mayr and Sepp Hochreiter “Self-normalizing neural networks” In Advances in neural information processing systems, 2017, pp. 971–980
[65] Xavier Glorot and Yoshua Bengio “Understanding the difficulty of training deep feedforward neural networks” In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249–256
[66] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification” In Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034
[67] Boris Hanin and David Rolnick “How to start training: The effect of initialization and architecture” In Advances in Neural Information Processing Systems, 2018, pp. 571–581
[68] Prannay Khosla et al. “Supervised Contrastive Learning” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 18661–18673 URL: https://proceedings.neurips.cc/paper/2020/file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper.pdf
[69] Ting Chen, Simon Kornblith, Mohammad Norouzi and Geoffrey Hinton “A simple framework for contrastive learning of visual representations” In International conference on machine learning, 2020, pp. 1597–1607 PMLR
[70] Sanghyun Hong et al. “Terminal brain damage: Exposing the graceless degradation in deep neural networks under hardware fault attacks” In 28th $\{$ USENIX $\}$ Security Symposium ( $\{$ USENIX $\}$ Security 19), 2019, pp. 497–514
[71] Adnan Siraj Rakin, Zhezhi He and Deliang Fan “Bit-flip attack: Crushing neural network with progressive bit search” In Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1211–1220
[72] Adnan Siraj Rakin, Zhezhi He and Deliang Fan “Bit-Flips Attack and Defense” Online, accessed 15.07.2020, 2020 URL: https://github.com/elliothe/BFA
[73] Zhezhi He et al. “Defending and Harnessing the Bit-Flip based Adversarial Weight Attack” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 14095–14103
[74] Jingtao Li et al. “Defending Bit-Flip Attack through DNN Weight Reconstruction” In 2020 57th ACM/IEEE Design Automation Conference (DAC), 2020, pp. 1–6 DOI: 10.1109/DAC18072.2020.9218665
[75] Mirco Giacobbe, Thomas A Henzinger and Mathias Lechner “How many bits does it take to quantize your neural network?” In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, 2020, pp. 79–97 Springer
[76] David Stutz, Nandhini Chandramoorthy, Matthias Hein and Bernt Schiele “Bit error robustness for energy-efficient dnn accelerators” In Fourth Conference on Machine Learning and Systems, 2021 mlsys.org URL: https://proceedings.mlsys.org/paper/2021/file/a684eceee76fc522773286a895bc8436-Paper.pdf
[77] Maximilian Augustin, Alexander Meinke and Matthias Hein “Adversarial robustness on in-and out-distribution improves explainability” In European Conference on Computer Vision, 2020, pp. 228–245 Springer
[78] Julian Bitterwolf, Alexander Meinke and Matthias Hein “Certifiably Adversarially Robust Detection of Out-of-Distribution Data” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 16085–16095 URL: https://proceedings.neurips.cc/paper/2020/file/b90c46963248e6d7aab1e0f429743ca0-Paper.pdf
[79] Nicholas J. Higham “Accuracy and Stability of Numerical Algorithms”, pp. 119–137 DOI: 10.1137/1.9780898718027.ch7
[80] Kaining Wang and Anthony N. Michel “Robustness and Perturbation Analysis of a Class of Artificial Neural Networks” In Neural Networks 7.2, 1994, pp. 251–259
[81] A. Meyer-Base “Perturbation analysis of a class of neural networks” In Proceedings of International Conference on Neural Networks (ICNN’97) 2, 1997, pp. pp. 825–828 vol.2
[82] Xiaoqin Zeng and D.. Yeung “Sensitivity analysis of multilayer perceptron to input and weight perturbations” In IEEE Transactions on Neural Networks 12.6, 2001, pp. 1358–1366
[83] L. Xiang, X. Zeng, Y. Niu and Y. Liu “Study of Sensitivity to Weight Perturbation for Convolution Neural Network” In IEEE Access 7, 2019, pp. 93898–93908
[84] Quanshi Zhang, Ying Nian Wu and Song-Chun Zhu “Interpretable Convolutional Neural Networks” In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8827–8836 DOI: 10.1109/CVPR.2018.00920
[85] Quan-shi Zhang and Song-Chun Zhu “Visual interpretability for deep learning: a survey” In Frontiers of Information Technology & Electronic Engineering 19.1 Springer, 2018, pp. 27–39

Dynamic Neural Network Architectural and Topological Adaptation and Related Methods - A Survey