AdaSpring: Context-adaptive and Runtime-evolutionary Deep Model Compression for Mobile Applications

Sicong Liu Northwestern Polytechnical UniversitySchool of Computer ScienceXi’anChina , Bin Guo Northwestern Polytechnical UniversitySchool of Computer ScienceXi’anChina , Ke Ma Northwestern Polytechnical UniversitySchool of Computer ScienceXi’anChina , Zhiwen Yu Northwestern Polytechnical UniversitySchool of Computer ScienceXi’anChina and Junzhao Du Xidian UniversitySchool of Computer Science and TechnologyXi’anChina

Abstract.

There are many deep learning (e.g. DNN) powered mobile and wearable applications today continuously and unobtrusively sensing the ambient surroundings to enhance all aspects of human lives. To enable robust and private mobile sensing, DNN tends to be deployed locally on the resource-constrained mobile devices via model compression. The current practice either hand-crafted DNN compression techniques, i.e., for optimizing DNN-relative performance (e.g. parameter size), or on-demand DNN compression methods, i.e., for optimizing hardware-dependent metrics (e.g. latency), cannot be locally online because they require offline retraining to ensure accuracy. Also, none of them have correlated their efforts with runtime adaptive compression to consider the dynamic nature of deployment context of mobile applications. To address those challenges, we present AdaSpring, a context-adaptive and self-evolutionary DNN compression framework. It enables the runtime adaptive DNN compression locally online. Specifically, it presents the ensemble training of a retraining-free and self-evolutionary network to integrate multiple alternative DNN compression configurations (i.e., compressed architectures and weights). It then introduces the runtime search strategy to quickly search for the most suitable compression configurations and evolve the corresponding weights. With evaluation on five tasks across three platforms and a real-world case study, experiment outcomes show that AdaSpring obtains up to $3.1\times$ latency reduction, $4.2\times$ energy efficiency improvement in DNNs, compared to hand-crafted compression techniques, while only incurring $\leq 6.2ms$ runtime-evolution latency.

Corresponding author: [email protected]

^†^†copyright: acmcopyright^†^†journal: IMWUT^†^†journalyear: 2021^†^†journalvolume: 5^†^†journalnumber: 1^†^†article: 24^†^†publicationmonth: 3^†^†price: 15.00^†^†doi: 10.1145/3448125^†^†ccs: Human-centered computing Ubiquitous and mobile computing systems and tools

1. introduction

In recent years, a lot of ubiquitous devices (e.g. smartphones, wearables, and embedded facilities) are integrated with continuously running applications to facilitate all aspects of human lives. For example, the smartphone-based speech assistant (e.g. ProxiTalk (Yang et al., 2019)), and wearable sensor-enabled activity recognition (IMUTube (Kwon et al., 2020), MITIER (Chen et al., 2020)). Notably, there is a growing trend to bring deep learning (e.g. DNN) powered intelligence into mobile devices, which benefits effective data analysis. Besides, due to the increasing user concerns on transmission cost and privacy issues, executing DNN on local devices tends to be a promising paradigm for robust mobile sensing (Wang et al., 2020; Lane et al., 2015). However, it is non-trivial to deploy the computational-intensive DNN on mobile platforms with tightly limited resources (i.e., storage, battery).

Given those challenges, prior works have investigated different DNN specialization schemes to explore the desired tradeoff between application performance (i.e., accuracy, latency) and resource constraints (i.e., battery and storage budgets). Firstly, as illustrated in Figure 1(a), the hand-crafted DNN compression methods, e.g. weight pruning (Luo and Wu, 2020) relay on manual design to reduce the model complexity. They may not suffice to meet diverse performance requirements. Secondly, the on-demand DNN compression schemes (see Figure 1(b)), e.g. DeepX (Lane et al., 2016), AMC (He et al., 2018), and AdaDeep (Liu et al., 2020), adopt a trainable meta-learner to automatically find the most suitable DNN compression strategies for various platforms. They need offline retraining to ensure accuracy and update the meta-learner. The extra overhead and latency for offline retraining is intolerable for responsive applications. Thirdly, the one-shot neural architecture search (NAS) methods (see Figure 1(c)) pre-train a super-net and automatically search for the best DNN architecture for target platforms (Fang et al., 2020; Zoph et al., 2018; Saikia et al., 2019; Cai et al., 2019). However, they also render high overhead for scanning and searching a large-scale candidate space. None of them can work locally online.

Despite major advances of existing DNN compression techniques, none of them have correlated their efforts with runtime adaptive compression, to consider the dynamics of the deployment context in continuously running applications. We have identified that a self-adaptive, retraining-free, and fast framework for runtime adaptive DNN compression (see in Figure2(d)) is necessary yet challenging. As we have illustrated in Figure 1(d), the DNN deployment context often exhibits high dynamics and unpredictability in practice. And the dynamic changes in the deployment context will further lead to varying performance demands on DNN compression. Specifically, we identify the dynamic context to mainly include the time-varying hardware capabilities (e.g. storage, battery, processor), the DNN active execution time, the agnostic inference frequency triggered by real environments, and the unpredictable resource contention imposed by other Apps.

Figure 2 shows an example in which a user carries a smartphone-based hearing assistant App (e.g. UbiEar (Sicong et al., 2017)) to sense the ambient acoustic event of interest continuously. During its use, the smartphone’s battery is dynamically consumed by the DNN execution, the memory access, the microphone sampling, and the screen with unpredictable frequency, which further characterize the dynamic energy constraints for the deployed DNN. And the storage unit (e.g. L2-Cache) is also dynamically occupied by other applications, resulting in various storage budgets for DNN parameters. Both the mobile developer and user face a problem: how to automatically and effectively re-compress DNN at runtime to meet dynamic demands? And they face the following two challenges:

•

Firstly, it is non-trivial to continually scale up/down the DNN compression configurations, including both architectures and weights, to meet the dynamic optimization objectives on multiple DNN performance (i.e., accuracy, latency, energy consumption) on-the-fly. This is because most DNN compression methods are irreversible to scale up DNN again, i.e., recover fine details of the DNN architecture and weight, from a compressed/pruned model. And the weight evolution is always limited by offline retraining.
•

Secondly, it is intractable to provide an efficient and effective solution to the runtime optimization problem. To tailor this problem, quickly searching for the suitable compression techniques from an elite candidate set and efficiently evolving weights without model retraining are required. Moreover, it is difficult to systematically balance the compromising of multiple conflicting and interdependent performance metrics (e.g. latency, storage and energy efficiency) by merely tuning the DNN compression methods.

In view of those challenges and limitations, we present AdaSpring, a context-adaptive and runtime-evolutionary deep model compression framework. It continually controls the compromising of multiple performance metrics by re-selecting the proper DNN compression techniques. To formulate the dynamic context, we formulate the runtime tuning of compression techniques by a dynamic optimization problem (see Eq.(3.2) in $\S$ 3). In that, we model the dynamic context by a set of time-varying constraints (i.e., accuracy loss threshold, latency and storage budgets, and the relative importance of objectives). And then, we present a heuristic solution. In particular, to eliminate the runtime retraining cost, we decouple offline training from the online adaptation by putting weight tuning ahead in the training of a self-evolutionary network (see $\S$ 4). Furthermore, we present an efficient and effective search strategy. It involves an elite and flexible search space (see $\S$ 5.1), the progressive shortest candidate encoding, and the Runtime3C search algorithm (see $\S$ 5.2) to boost the locally online search efficiency and quality. The main contributions of this work are summarized as follows.

•

To the best of our knowledge, AdaSpring is the first context-adaptive, and self-adaptive DNN compression framework to continually shrink model architectures and evolve the corresponding weights by automatically applying the proper retraining-free compression techniques on-the-fly. And it trains a self-evolutionary network to synergize the multi-scale compression operators’ weight recycle and decouple offline training from online compression.
•

AdaSpring presents an efficient runtime search strategy to optimize the runtime adaptive compression problem. It introduces the elite and flexibly combined compression operator space, the fast Runtime3C search algorithm, and a set of speedup mechanisms to boost the search efficiency and quality, at runtime, while avoids explosive combination.
•

Using five mobile applications across three platforms and a real-world case study of DNN-powered sound recognition on NVIDIA Jetbot, extensive experiments showed the advantage of AdaSpring to continually optimize DNN configurations. It adaptively adjust the compression configurations to tune energy cost by $1.6mJ\sim 5.6mJ$ , latency by $1.3ms\sim 10.2ms$ , and storage by $201KB\sim 1.9MB$ , with $\leq 2.1\%$ accuracy loss. And the online evolution latency of compression configurations to meet dynamic contexts is $\leq 6.2ms$ .

In the rest of the paper, we review the related work in $\S$ 2, present the system overview in $\S$ 3, and elaborate the AdaSpring design in $\S$ 5 and $\S$ 4. We report an evaluation of AdaSpring in $\S$ 6 and conclude in $\S$ 7.

2. Related Work

Our work is inspired by and closely related to the following works.

DNN Compression for Ubiquitous Mobile Applications. There is a promising trend to bring deep learning powered intelligence to mobile and embedded devices (e.g. smartphones, wearables, IoT) for enhancing all aspects of human lives. such as smartphone-based speech assistant (e.g. ProxiTalk (Yang et al., 2019)), smarthome device-enabled sound detection (Bhattacharya et al., 2020), and wearable-based activity recognition (e.g. MITIER (Chen et al., 2020)). And recent research has demonstrated the potential of feeding DNNs into resource-constrained mobiles (Cheng et al., 2017) by using DNN compression techniques, including parameter pruning (He et al., 2017), sharing (Wu et al., 2018b), and quantification (Zhu and Zabaras, 2018), compact component (Iandola et al., 2016; Howard et al., 2017), and model distillation (Chen et al., 2017), However, we note that few efficient compression techniques are dedicated to optimizing the application-driven system performance (e.g. energy efficiency). For example, the recent work (Jha et al., 2019) argues that the platform-aware SqueezeNet (Iandola et al., 2016) and SqueezeNext (Gholami et al., 2018) merely reduce parameter size or MAC amount which do not necessarily lead to reduced energy cost or latency (Yang et al., 2017). Also, all of these techniques need several epochs of model retraining to ensure accuracy, thereby they can not be locally online. Instead, AdaSpring decouples DNN training from online adaptive compression, and consider both the dynamic arithmetic intensity of parameters and activations for guiding the best specialization of convolutional compression configurations.

On-demand DNN Computation for Diverse User Demands. There have been two categories of on-demand DNN computation adjustment methods: on-demand DNN compression (Liu et al., 2018a) (Luo and Wu, 2020) and on-demand DNN segmentation (Zhao et al., 2018). They specialize DNN computation offline to meet diverse hardware resource budgets (e.g. battery, memory, and computation) and application demands (e.g. input diversity). To satisfy resource budgets, He et al. (He et al., 2018) adopt reinforcement learning to adaptively sample the design space. Shuochao et al. (Yao et al., 2017) use a recurrent model to control the adaptive compression ratio of each layer. Singh et al. (Singh et al., 2019) introduce a min-max game to achieve maximum pruning with minimal accuracy drop. These methods, however, need extra offline training to update the meta-controller for on-demand adjustment. Zhao et al. (Zhao et al., 2018) present the adaptively distributed execution of CNN-based applications on resource-constrained IoT edge clusters. However, because of mobile platforms’ mobility and opportunistic connectivity, distributed DNN inference in mobile clusters is not robust yet. Built upon these efforts, AdaSpring is the first to enable runtime and adaptive DNN evolution locally without requiring Wi-Fi/cellular networks to connect with other platforms while achieve competitive performance.

Dynamic Adaptation of DNN Execution. Prior works have investigated the run-time adaptation of DNN execution to adapt to diverse inputs from two directions: dynamic selection of inference path (Wu et al., 2018a) or network variant (Teerapittayanon et al., 2016). Wu et al. (Wu et al., 2018a) adaptively choose which residual blocks to execute during inference to reduce computation without degrading accuracy. Teerapittayanon et al. (Teerapittayanon et al., 2016) propose a multi-branch network to allow inference adaptively exit from early branches. Han et al. (Han et al., 2016) adaptively select model variants to optimize accuracy and satisfy resource constraints (e.g. memory and energy). Gao et al. (Gao et al., 2018) propose feature boosting and suppression method to predictively amplify salient convolutional channels and skip unimportant ones. However, these methods highly depend on the pre-defined design space of alternative execution paths and variants, but it is prohibitive to specify all of them before deploying models into agnostic mobile contexts. AdaSpring dynamically select and combine the proper compression operators to flexibly shrink the model configurations from multiple scaling dimensions at runtime.

Fast and Platform-aware Neural Architecture Search. Recent studies have verified the potential of leveraging neural architecture search (NAS) framework to automate neural architecture specialization for mobile applications, from two aspects. Firstly, the Fast-NAS aims to automatically specialize neural architecture for different performance demands (e.g. computation amount, parameter size), using as little search cost as possible (Ren et al., 2020). To speedup the search, researchers have investigated the modular search strategy (Cai et al., 2018; Zhong et al., 2018; Liu et al., 2017), differentiable search strategy (Liu et al., 2018b; Jiang et al., 2019; Chen et al., 2019; Zhou et al., 2020), and super-network search strategy (Fang et al., 2020; Bender et al., 2018; Cai et al., 2019). For example, Liu et al. (Liu et al., 2018b) relax the search space to be continuous, so that it can be optimized by gradient descent, using orders of magnitude less search cost. Cai et al. (Cai et al., 2019) trains an once-for-all (OFA) super-network that supports $2\times{10^{19}}$ diverse variant-network search. We note that the OFA super-network includes some redundant and invalid variant-networks, which is not elite and incurs a high search cost. (As we will discuss in $\S$ 5.) Secondly, unlike general NAS that only optimize for model-relative metrics, such as FLOPS, the platform-aware NAS alao incorporates platform-relative metrics (e.g. latency) into optimization objectives. Such as Mingxing et al. (Tan et al., 2019) explicitly incorporate latency into the NAS objective to identify a mobile CNN model. Xiaoliang et al. (Dai et al., 2019) propose an efficient search algorithm aided by efficient accuracy and hardware resource predictors. However, above methods still sacrifice high overhead to obtain the ranking of candidate architectures based on their performance on validation sets. And they donot accurately consider the energy consumption improvement target since the energy efficiency measurement is not straightforward on different platforms with dynamic nature. Depart from existing efforts, AdaSpring treats the retraining-free compression operator (illustrated in $\S$ 5.1) as a new ensemble to be tuned by automated macro-NAS. It trains a self-evolutionary network(see $\S$ 4) at design time to decouple model retraining and adaptive compression. Besides, it present a set of mechanisms to boost search efficiency (i.e., at millisecond level) and quality during dynamic inference. Notably, AdaSpring leverages the dynamically measured hardware-relative metrics (i.e., arithmetic intensity of parameter and activation) to derive a guiding selection, which also prevent the explosive combination (see $\S$ 5).

3. Overview

This section starts with problem analysis and then presents an overview of AdaSpring design.

3.1. Problem Study

Due to the dynamic natures of DNN deployment context, we aim to continually tune the DNN compression configurations to directly/indirectly optimize the application-driven system performance (i.e., accuracy, energy efficiency, latency). The hybrid dependency of multiple platform-relative performance metrics and DNN-dependent metrics are shown in Figure 4. To further understand the performance requirements of DNN for the continuously running mobile applications, we ask $60$ mobile users and $10$ Android developers to rate the importance of different DNN performance aspects on mobiles. And we summarize the results as our design goals. Specifically, a DNN for continuously running mobile Apps needs to fulfill the following requirements:

•

Accurate: the DNN is accurate enough to guarantee a high-quality task. The model weights at different scales are well-trained to represent the generic information of recognition objects.
•

Responsive: the complexity of the DNN should be controllable to satisfy diverse user demands on latency constraints, especially on low-end (e.g. CPU-powered) mobiles.
•

Energy-Efficient: the energy consumption of the DNN should be continually optimized, which is the bottleneck metric for continuously sensing applications (Yang et al., 2017).
•

Runtime-evolutionary: both the DNN architecture and parameter weights are runtime-evolutionary to meet the dynamic deployment context for continually optimizing the above three requirements (i.e., accurate, responsive, energy-efficient) at runtime.

Unfortunately, none of previous efforts satisfy all these requirements (as discussed in $\S$ 2). To this end, this paper proposes AdaSpring, a context-adaptive and runtime-evolutionary DNN compression framework to automatically optimize the requirements mentioned above, which are closely related to the user experience.

3.2. Optimization Formulation

As shown in Figure 4, AdaSpring intends to provide a systematic method to automatically select the compression operator combination for tuning the above conflicting and interdependent performance metrics. Mathematically, AdaSpring explores an efficient solution to the following dynamic optimization problem:

	$\displaystyle\mathop{argmin}\limits_{\delta_{i}\in\mathrm{\Delta}}$	$\displaystyle\lambda_{1}(t)Norm(A(\Omega)-A(\Omega(\delta_{i})))-\lambda_{2}(t)Norm(E(\Omega(\delta_{i}))$
(1)		s.t.	$\displaystyle A(\Omega)-A(\Omega(\delta_{i}))\leq A_{threshold}(t),\,\,T(\Omega(\delta_{i}))\leq T_{bgt}(t),\,\,S(\Omega(\delta_{i}))\leq S_{bgt}(t)$

where $\mathrm{\Delta}$ represents the set of all optional convolutional compression operators (as enumerated in $\S$ 5.1). Given an backbone-net architecture $\Omega$ , $\Omega(\delta_{i})$ represents the re-configured model architecture compressed by the selected compression operator $\delta_{i}$ . $A$ , $E$ , $T$ and $S$ denote the measured accuracy, energy efficiency, latency, and memory footprint of a given model running on the target mobile platform. The two objectives on $A$ and $E$ are combined by relative importance coefficients $\lambda_{1}$ and $\lambda_{2}$ , which dynamically depend on the platform’s remaining battery. We express the dynamic deployment contexts as a set of time-varying constraints, i.e., the threshold of accuracy loss $A_{threshold}(t)$ , the latency budget $T_{bgt}(t)$ , the storage budget $S_{bgt}(t)$ , and relative importance coefficients of objectives ( $\lambda_{1}(t)$ , $\lambda_{2}(t)$ ). The latency budget $T_{bgt}(t)$ is application-specified. And the storage budget $S_{bgt}$ (t) is platform-imposed. For example, reducing the model size $S$ to satisfy the budget of L2-Cache $S_{bgt}$ helps to fit it into the on-chip memory and avoids the expensive off-chip access. We note that $Norm(.)$ is a normalization operation for objective aggregation, e.g. $log(.)$ . We then propose a heuristic optimization solution as adjusting the model architecture for satisfying dynamic performance requirements. In particular, the model architecture $\Omega(\delta_{i})$ can directly determine both $S$ and $A$ (see Figure 4). While the quantification of hardware-dependent metrics $E$ and $T$ are not straightforward. Therefore, AdaSpring’s goal turns to adaptively select the compression operator combination $\delta_{i}$ from a discrete set of all possible combinations $\mathrm{\Delta}$ , so that it can directly/indirectly tune model performance metrics.

3.3. AdaSpring Framework

The above challenging problem motivates the AdaSpring design. As shown in Figure 4, the AdaSpring framework consists of a self-evolutionary network, a runtime adaptive compression block, and a dynamic context awareness block. (i) The self-evolutionary network is an ensemble of a backbone-net and multiple retraining-free compression operator-variants, which enables weight recycle between numerous variants while avoiding catastrophic interference. We initialize the backbone-net’s hyperparameters at design time using an on-demand DNN generation framework, i.e., AdaDeep (Liu et al., 2020), for satisfying mobile application performance demands on a target platform. (ii) The runtime adaptive compression block is capable of selecting a deterministic optimal combination of compression operators for reconfiguring and evolving the backbone-net at runtime. And (iii) the dynamic deployment context awareness block detects the evolution demands and triggers the runtime adaptive compression block. The triggering station can be modeled as the noticeable context changes or by a pre-defined frequency (e.g. time slice) for continuously running Apps in regular days.

4. Retraining-free and Self-evolutionary Network Design

This section presents the design of the retraining-free and self-evolutionary network. The self-evolutionary network consists of a high-performance backbone network and multiple compression operator-variants.

(a) Operator

\delta_{1}

(b) Operator

\delta_{2}

\delta_{3}

(d) Operator

\delta_{4}

Figure 5. Four categories of compression operators that synthesize multiple scaling dimensions.

4.1. Compression Operators

This paper focus on the configuration optimization of convolutional architecture, operations, and activations. Because recent successful DNN models tend to shift more parameters on convolutional layers and use fewer fully-connected layers (Wu et al., 2018b; Chen et al., 2016; Cai et al., 2019). Built upon the existing compression experience, we propose the following alternative convolutional compression operators that synthesize multiple scaling dimensions (e.g. width, depth, and connection).

•

Compression operator $\delta_{1}$ : multi-branch channel merging techniques(e.g. Fire block (Iandola et al., 2016)) increase the model depth with less parameters by replacing one conv layer using two conv layers (i.e., squeeze layer and expand layer) which is elaborately designed to decrease the kernel size and channel size per unit.
•

Compression operator $\delta_{2}$ : low-rank convolution factorization techniques (e.g. SVD-based (Wu et al., 2018b), sparse coding-based (Bhattacharya and Lane, 2016) factorization, or depth/group-wise convolution (Li et al., 2019)) decompose a conv layer into several conv layers with smaller kernel size, hence leads to a growing model depth with less parameters.
•

Compression operator $\delta_{3}$ : channel-wise scaling techniques (e.g. channel-level pruning (Cai et al., 2019) and channel-wise architecture noise injection (Yang et al., 2020)) can tune variable operator-variant sampling.
•

Compression operator $\delta_{4}$ : depth scaling techniques (e.g. depth-elastic pruning (Cai et al., 2019), residual connection (He et al., 2016)) can derive a shallower variant-network from a backbone-network via skipping connections.

4.2. Ensemble Training of Self-evolutionary Network

We put the retraining process ahead in the ensemble training of the self-evolutionary network at design time to get rid of weight retraining during dynamic inference. Therefore, the self-evolutionary network training is an ensemble of a backbone-net and multiple variant-nets derived by various convolutional compression operators.

4.2.1. Primer on Parameter Recycling

We refer to the parameter recycling strategy (Wu et al., 2018b; Cai et al., 2017) to recycle the backbone-net weights and take less search time than those searching from scratch. The weight recycling strategy is conducive to making maximum use of the existing architectures’ experience to reduce the time complexity of the searching process. In particular, we reuse the existing hand-crafted/elaborated high-performance DNN, including architecture and weight, as an initialization point (i.e., backbone network). And then, we leverage an automated optimizer (i.e., search strategy) to only search for the optimal architecture adjustments (e.g. widening a certain network, skipping connections) to obtain a promising new model. However, these methods challenge the ensemble training of multiple variant models that drift away from the backbone-net’s initial configuration. In detail, the training of a variant’s weights will likely interfere/override the weights that have been learned for other variants and thus degrade the overall performance. We note that the catastrophic interference problem when multiple variant-nets share parameters is a long-standing problem in itself (French, 1999; Kirkpatrick et al., 2017).

4.2.2. Training Strategy

Our goal with the self-evolutionary network is to integrate with multiple versions of DNN architectures and the corresponding weights introduced by different compression operators. And the above-mentioned parameter recycling strategy provides great potentialities. To further avoid the catastrophic interference problem caused by parameter recycling, we present a novel training strategy to consider the parameter transformation and knowledge distillation for preserving the parametric function of multiple variant-nets. In detail, we first perform the standard back-propagating process to train a high-accuracy backbone-net. Afterwards, we respectively leverage the parameter transformation techniques for learning compression operators $\delta_{1}$ and $\delta_{2}$ , the knowledge distillation techniques for learning compression operators $\delta_{3}$ and $\delta_{4}$ , and the trainable channel-wise mutation techniques for learning $\delta_{3}$ .

(1)

Parameter transformations for learning variant-nets derived by compression operator $\delta_{1}$ and $\delta_{2}$ . We consider the function-preserving parameter transformation when recycling parameters. It allows us to initialize a new variant-net that is derived by a compression operator to preserve the function of the given backbone-net, but use different parameterization to be further trained to improve the performance (Cai et al., 2017). We transform the original convolutional parameter and store the extra copy of weight for $\delta_{1}$ and $\delta_{2}$ . And we further set an accuracy target as a threshold, by which the transformed parameters for compression operator-variants will only be fine-tuned when its accuracy is lower than that. As thus, we need a small number of extra parameters to store the transformed parameters for compression operators $\delta_{1}$ and $\delta_{2}$ . And we only access the weights of the deterministically selected compression operator to evolve model weights.
(2)

Knowledge distillation for learning variant-nets derived by compression operator $\delta_{3}$ and $\delta_{4}$ . We allow each conv layer to choose the depth/channel compression ratio flexibly. And we adopt the knowledge distillation techniques (Cai et al., 2019) to fine-tune the parameters of compression operators $\delta_{3}$ and $\delta_{4}$ with different compression ratios (e.g. $20\%,50\%$ ). So that AdaSpring can flexibly switch over the different parameter weights of channel-wise and depth-wise scaling operators, and avoid weight interference. Also, we perform the trainable channel-wise and depth-wise architecture ranking as the weight importance criterion to guide the adaptive layer slimming and scaling. In particular, we pre-train a self-evolutionary network to evaluate the overall performance (e.g. accuracy drop, parameter arithmetic intensity, activation arithmetic intensity, and latency) of different variant-networks that are compressed by different operators. And these are used as the prior-based architecture importance ranking to guide the runtime scaling to shrink unimportant layer/channel first, rather than randomly scaling.
(3)

Trainable channel-wise mutation for training variant-nets derived by compression operator $\delta_{3}$ . To maintain a good diversity of solutions, we present a novel trainable architecture mutation technique to inject the architecture variance into the compressed network. This idea is supported by recent DNN studies, which have verified the dominant effect of model architecture on accuracy compared to model parameters (Yu and Huang, 2019). That is, AdaSpring can directly use the trainable architecture mutation technique with diverse noise magnitude since the channel importance ranking of the backbone-net is consistent at both design time and runtime. Specifically, we inject Gaussian noise to the channel-wise operator’s scaling ratio (i.e., $\delta_{3}$ ), and the noise magnitude is trainable for channel importance ranking. That is, the more important the channel is, the lower intensity of noise we inject. This, as we will evaluate in $\S$ 6.5, plays a nontrivial role for AdaSpring’s progressive shortest encoding process of DNN compression configurations and the runtime searching process for boosting the runtime DNN adaptation quality and efficiency.

Besides, to enable the stable ensemble training of multiple variant-nets, we leverage the mini-batch techniques to split the training data into small batches. We normalize the gradient to reduce the interference caused by gradient variance (Li et al., 2014).

5. Runtime Adaptive Compression

This section presents how AdaSpring quickly searches for the most suitable combination of retraining-free compression operators, from a flexible and elite space, to reconfigure the trained self-evolutionary network on-the-fly.

5.1. Flexible and Elite Search Space

5.1.1. Multi-granularity Search Space

We form an elite search space, which include a set of coarse-grained compression operators (e.g. Fire block (Iandola et al., 2016), SVD-based (Wu et al., 2018b), sparse coding-based (Bhattacharya and Lane, 2016) factorization), for faster convergence, and the fine-grained compression operators (e.g. channel-level and depth-level pruning and channel-wise randomization), for better diversity. Consider a convolutional layer that has the total parameters: (input feature map channel size $M$ ) $\times$ (output feature map channel size $N$ ) $\times$ (kernel width/height $S_{P}$ ) $\times$ (kernel width/height $S_{P}$ ), and total activations: $N\times$ $S_{A}$ (output feature map width/height) $\times$ $S_{A}$ . Scaling either the input feature map, kernel, channel, or output feature map can shrink the model complexity. We empirically observe that different scaling dimensions are not independent. Firstly, there is no single compression technique that achieves the best application-driven performance (i.e., $A$ , $T$ , $E$ , $C$ , $S_{a}$ , and $S_{p}$ ). It is necessary to combine several compression techniques. Secondly, as mentioned in $\S$ 2, few existing compression techniques are retraining-free or dedicated to optimizing the holistic hardware efficiency across various platforms. These findings further suggest us toflexibly coordinate and balance multiple scaling dimensions by searching for the best combination of compression operators, rather than the single dimension (e.g. model pruning).

5.1.2. Hardware Efficiency-guided Combination.

We argue that the widely used parameters number, MAC amount, or speedup ratio are not good approximations for hardware efficiency, which heavily depends on the memory movement and bandwidth bound. For example, Jha et al. (Jha et al., 2019) reported that although SqueezeNet (Iandola et al., 2016) has $51.8\times$ fewer parameters than AlexNet (Krizhevsky et al., 2012), it consumes 33 $\%$ more energy due to its larger amount of activations and data movement. And we identify that merely cutting down the parameter size may lead to an increase in activation size, which, in turn, increases the memory footprint and energy consumption (Jha and Mittal, 2020). For example, the recent study (Jha et al., 2019; Jha and Mittal, 2020) has shown that the energy consumption of CNNs mainly depends on the memory movement, memory reuse, and bandwidth bound.

To this end, we present the controllable hardware-efficiency criteria, i.e., arithmetic intensity, to guide the automated combination of compression operators in different layers. We leverage the arithmetic intensity as a proxy to the degree of reuse of parameters and activations and the energy consumption required for processing inputs, inspired by hardware studies (Jha et al., 2019; Jha and Mittal, 2020). Because the measurement of hardware-relative metrics, especially energy efficiency, is not straightforward. Thereby, we present three hardware-efficiency metrics to predict how efficiently arithmetic operation can reuse the data fetched from different levels in the memory hierarchy and how efficiently the arithmetic operation is executed.

•

Computation/parameter ratio $C/S_{p}$ : is an approximation of the parameter arithmetic intensity;
•

Computation/Activation ratio $C/S_{a}$ : is the proxy of the activation arithmetic intensity;
•

latency $T$ : include the measured inference time $T_{inference}$ of a specialized model, and the time i.e., $T_{load}$ for loading parameters and activations for convolution computing on the target mobile device, i.e., $T=T_{load}+T_{inference}$ .

We separately evaluate $C/S_{p}$ and $C/S_{a}$ and then aggregate them together by the aggregation coefficients $\mu_{1}$ and $\mu_{2}$ , to better profile the energy efficiency of each candidate compression operator $i$ .

(2)

E\approx\mu_{1}C/S_{p}+\mu_{2}C/S_{a}

Upon these criterions, AdaSpring automatically selects and combines compression operators for maximizing the aggregated value of $\mu_{1}C/S_{p}+\mu C/S_{a}$ , according to the upper limit of the calculation intensity of the mobile platforms. And test the real latency $T$ to prevent the exploration of invalid solutions via comparing with the latency budgets. We empirically set $\mu>\mu$ (e.g. $\mu=0.4,\mu=0.6$ as default) since $C/S_{a}$ contributes more to memory footprint (as benched in $\S$ 6.5). And AdaSpring discovers some novel combinations for optimizing the underlying data movement. For example, we suggest the $\delta_{1}+\delta_{3}$ and $\delta_{2}+\delta_{4}$ groups (as discussed in $\S$ 6). The fine-grained channel-wise scaling operators (e.g. $\delta_{3}$ , $\delta_{4}$ ) readjust the channel size, MAC amount, and output activation size of the conv layers to smooth out the bandwidth bound problem, which is caused by the coarse-grained operators (e.g. $\delta_{1}$ and $\delta_{2}$ ). The hardware efficiency-guided combination of several compression operators also helps to avoid the blindly explosive combination.

5.2. Runtime Search Strategy

To evolve DNN architecture and weight to an optimal configuration at runtime, we propose the runtime search strategy based on the above flexible and elite search space.

5.2.1. Progressive Shortest Encoding of Candidate

Consider a complex self-evolutionary network that contains many combinations of compression operator variants and configurations, systematically and generically choosing the right candidate configurations and encoding them into the representation of a search algorithm is difficult. For optimizing DNN compression configurations at runtime, such representations define the potential search space of the problem to be explored. Given that some candidate configurations do not contribute to the specific performance optimization demand or other candidates can represent some of their information, the shortest actual encoding will benefit the search result (i.e., model evolution plans) and overhead.

As shown in Figure 7(a), the classic binary encoding of all compression operator configurations across all layers in a binary format is redundant. Specifically, given a backbone network with $N$ conv layers to be selectively compressed. Take $N=3$ as an example. A classic binary encoding method needs $3$ bit to record whether a specific layer participates in compression or not. Other $3*4$ bits (four bits to represent $4^{2}$ selective operators per layer). In this way, the encoding length is $N+MN=(M+1)N$ when we have $M$ optional compression operator. And the search space derived by this encoding diversity is $2^{N}\times{M^{N}}$ , i.e., $O(M^{N})$ . Furthermore, it will increase exponentially as the number of optional compression operators increase.

To better represent the fundamental search space, we propose the progressive shortest encoding of compression operator configurations via a layer-dependent manner. As we will show in $\S$ 6.5.3, it improves the search efficiency by one order of magnitude, compared to the classic binary encoding. As shown in Figure 7(b), we use $N$ digits to record the count of layers that have been compressed. The first digit represents the compressed layer count, and the next length-variable few digits record the selected compression operator index of each layer. For example, the value $1$ of the first digit means that only the first conv layer is compressed on-demand. Thereby, only one additional digit is needed to record the compression operator index (i.e.,, $1$ ) for it. Afterward, AdaSpring inherits the above $2$ -digit string and inject channel-wise variance to mutate the inherited survival $2$ -digit encoding string. We refer the channel-wise variance mutation process in $\S$ 5.2. And then, we turn to the second adaptable conv layer. If the first digit of compressed layer count is updated to 2, we append one more digit indicating the selected compression operator index to the survival $2$ -digit encoding string. Thus, the encoding length progressively increases from 2 to ( $N+1$ ), and the complexity of the search space is reduced to $O(N^{2})$ . The progressive shortest encoding of the candidate is conducive to the flexibility of AdaSpring and prevents unnecessary exploration.

5.2.2. Runtime3C Search Algorithm

This subsection presents the Runtime3C search algorithm, a Pareto optimal decision-based searching algorithm, to pick a sole optimal solution from the search space at runtime. To the best of our knowledge, many widely used universal search algorithms (e.g. evolutionary algorithms) are not designed to optimize the runtime adaptive compression problem or handle dependency constraints of multiple DNN performance. We heuristically regard the selection of compression operators for each layer as a single-layer optimization subproblem in a collaborative manner to derive the most suitable solution quickly and effectively.

Input: (a)Deployment context: dynamic context constraints

\theta_{A_{loss}},T_{bgt},S_{bgt}

; relative importance

\lambda_{1}

\lambda_{2}

. ;

(b) DNN: a trained self-evolutionary network comprising of a backbone-net

\Omega

and multiple retraining-free compression operator-variants

\mathcal{\delta}

\delta\in\mathrm{\Delta}

Output: A reconfigured DNN model

\Omega(\delta,w_{\delta})

1 Transforming search space from

(\delta\in\mathrm{\Delta})

to hardware-efficient group

(group(\delta)\in\mathrm{\Delta^{\prime}})

;

2 while (layer $i$ is conv type) $\&\&$ (layer $i$ is to be compressed) do

3 Inherit 3C configurations from layer (i-1)

\mathcal{E}\{\Omega,group(\delta_{(}i-1))\}

;

4 Select 2 candidates

group(\delta_{j}),group(\delta_{k})

from the Pareto front of the valid space

\mathrm{\Delta^{\prime}}

;

5 Mutate and augment 2 candidates to 6 candidates using the trained channel-wise variances

\epsilon

;

6 Select the Pareto-optimal candidate (i.e., min

A_{loss}

while max

E

) as the survival candidate

group(\delta_{i})

;

7 Evolve weights

w_{\delta_{i}}

via parameter transformation ;

8 Encode candidates

\mathcal{E}\{\Omega,group(\delta_{i})\}

;

9 Forward DNN

\Omega(group(\delta_{i}))

to measure

A

T

S_{p}

S_{a}

C

E

of the overall model

\Omega(\delta_{i})

;

10 Judge whether the DNN performance satisfy constraints of the current deployment context;

11 if context constraints satisfied then

12 Searching stops;

13 end if

14 layer ++;

15 end while

*Note: we start exploring compression operator configurations from the second conv layer by default to preserve more input details.

Algorithm 1 Pareto decision-based Runtime search algorithm for convolutional compression operator configurations (Runtime3C)

As shown in Algorithm 1, each subproblem at layer $i$ is to search the optimal group of compression operators for optimizing the overall performance of the entire DNN. Starting from the second conv layer by default, AdaSpring selects two candidate solutions at layer $i$ from the Pareto front of the selectable compression operator groups for optimizing the accuracy and energy efficiency of the entire model (line 2). In detail, the picked two candidate solutions are the best two compromises in $\lambda_{1}log(A_{loss})$ v.s. $\lambda_{2}log(E)$ , from the Pareto front within the valid search space, i.e., $A_{loss}>5\%$ . Here, we leverage the ranking of the pre-tested accuracy and energy cost of the DNNs to establish the Pareto front. And the accuracy ranking derived by historical results is consistent with the ranking of the actual accuracy of these DNNs measured on mobile devices. We then mutate and augment candidates from two to six by injecting the channel-wise variance to the candidate configurations. The trained architecture importance is a criterion for Gaussian noise injection. This process can improve the diversity of subproblem solution as well as the performance of the global solution, inspired by the genetic algorithm in the adaptive software engineering (Chen et al., 2018). We choose the best candidate as the survival subproblem solution for compressing layer $i$ (line 6). Afterward, the $i-$ th layer’s survival solution is used to reconfigure the $i-$ layer and becomes the initial station of the subproblem at $(i+1)$ -th layer. We fix the selected compression configurations for $i-$ th layer and repeat the above-searching steps (line $3\sim 9$ ) to specialize the optimal compression strategies for the $(i+1)$ -th layer. Once the model satisfies the dynamic constraints in latency $T_{bgt}(t)$ and memory $S_{b}gt(t)$ at time $t$ , the subproblem expansion stops (line 12). And finally, it outputs the global compression configuration solution.

6. Evaluation

This section presents the evaluation of AdaSpring over different mobile applications on diverse mobile and embedded platforms with dynamic deployment context. We compare AdaSpring against ten alternative methods reported in the state-of-the-art literature.

Table 1. Summary of the applications and corresponding datasets for evaluating AdaSpring

No.	Target task (utility label)	Dataset	Description
$D_{1}$	Image ( $10$ classes)	CIFAR-100(Krizhevsky, 2009a)	$60,000$ images
$D_{2}$	Image ( $5$ classes)	ImageNet(Deng et al., 2009)	$65,000$ images
$D_{3}$	Acoustic event ( $9$ classes)	UbiSound(Sicong et al., 2017)	$7,500$ audio clips
$D_{4}$	Human activity ( $7$ classes)	Har(UCI, 2017)	$10,000$ records of accelerometer and gyroscope
$D_{5}$	Driver behavior ( $10$ classes)	StateFarm(Kaggle, 2019)	$22,424$ images

6.1. Experiment Setup

We first present the settings for our evaluation.

System Implementation. We implement AdaSpring’s offline block with TensorFlow (Google, 2017) in Python on the server side to train the self-evolutionary network (see $\S$ 4). And we realize the AdaSpring’s online blocks on the mobile and embedded platforms to adjust the DNN configurations on the fly for better inference performance. The self-evolutionary network (i.e., a backbone-net and multiple variant compression operators), generated by AdaSpring’s offline component, is then loaded into the target platform. To further reduce the memory access cost, we load DNN parameters from L2-Cache memory.

Evaluation Applications/Datasets. We use five commonly used mobile applications/datasets to evaluate AdaSpring’s performance as elaborated in Table 1. Specifically, we test AdaSpring for mobile image classification (D1: Cifar100 (Krizhevsky, 2009b), D2: ImageNet (Deng et al., 2009)), mobile acoustic event awareness (D3: UbiSound (Sicong et al., 2017)), mobile human activity sensing (D4: Har (UCI, 2017)), and mobile driver behavior prediction (D5: StateFarm (Kaggle, 2019)).

Mobile Platforms with Dynamic Context Settings. We evaluate AdaSpring on three categories of commonly used mobile and embedded platforms, including one personal smartphones, i.e., Xiaomi RedMi 3S (device1), one embedded development board, i.e., raspberry Pi 4B (device3), and one mobile robot platform i.e., NVIDIA Jetbot (device4) loaded with the mobile development board. They are equipped with diverse processors, storage and battery capacity. The dynamic context is formulated by the time-varying latency budget $T_{bgt}(t)$ , storage budget $S_{p}(t)$ , and the relative importance coefficient of accuracy and energy efficiency objectives.

Comparison Baselines. We employ three categories of DNN specialization baselines to evaluate ${\sf AdaSpring}$ . The detailed settings of ten baselines from three categories are as below. Firstly, the hand-crafted compression baselines relay on manual design to realize efficient DNN compression. They provide the high standard for AdaSpring to tune the specialized DNNs’ performance tradeoff between accuracy, latency, and resource efficiency.

•

Fire (Iandola et al., 2016) presented in SqueezeNet reduces filter size and decreases input channels using squeeze layers.
•

MobileNetV2 (Sandler et al., 2018) replaces the traditional convolutional operation by an inverted residual with the linear bottleneck to expand module to high dimension and then filter with a depth-wise convolution.
•

SVD-based convolutional decomposition technique (Lane et al., 2016) introduces an extra conv layer between $conv_{i}$ and $conv_{(i+1)}$ using the singular value decomposition (SVD) based parameter matrix factorization. The number of neurons $k$ in the inserted layer is set according to the dynamic neuron numbers $m$ in $conv_{i}$ , i.e., $k=m/12$ .
•

Sparse coding-based convolutional decomposition technique (Bhattacharya and Lane, 2016) insert a conv layer between $conv_{i}$ and $conv_{(i+1)}$ using the sparse coding-based parameter matrix factorization. The k-basis dictionary is dynamically determined by the neuron number $m$ in $conv_{i}$ , i.e., $k=m/6$ .

Secondly, the on-demand DNN compression baseline methods adopt a trainable optimizer to automatically find the most suitable DNN compression strategies for various mobile platforms.These baselines provide a strict benchmark against which we can validate that both searching and retraining costs are bottleneck limitations for the runtime adaptation demands.

•

AdaDeep (Liu et al., 2020) automatically selects and combines compression techniques to generate a specialized DNN that balance accuracy and resource constraints.
•

ProxylessNAS (Cai et al., 2018) directly learns architectures without any proxy while still allowing a large candidate set and removing the restriction of repeating blocks.
•

Once-for-all(OFA) (Cai et al., 2019) obtains a specialized sub-network by selecting from the once-for-all network that supports diverse architectural settings without additional training.

Thirdly, the runtime adaptive DNN compression requires to search for the most suitable combination of retraining-free compression techniques quickly, we select two baseline optimization methods to compare with AdaSpring. Here, the baseline optimizers represent two intuitive searching ideas for the runtime adaptive compression of DNN configurations.

•

Exhaustive optimizer tests all combinations of compression operators’ performance on the validation and then selects the one variety with the best tradeoff based on the fixed performance ranking. And then it fixes the compression operators and only scale down the compression operators’ hyperparameters, i.e., compression ratio, to satisfy the dynamic resource budgets.
•

Greedy optimizer selects the best compression operator layer-by-layer that obtains the best tradeoff between accuracy and parameter size, in which the relative importance is equally set to a fixed value of $0.5$ .
•

AdaSpring selects and applies the most suitable combination of compression operators into the self-evolutionary backbone network for accuracy and resource efficiency tradeoff.

Table 2. Performance comparison of AdaSpring with three categories of baselines on Raspberry Pi 4B (device 2) using CIFAR-100 (

D_{1}

). The backbone network includes 5 conv layers, and 1 GAP layer.

Baselines

DNN compression techniques

Performance of specialized DNN

Performance of DNN specialization scheme

A

(

\%

) ^*1

T

(ms)

C/S_{p}

C/S_{a}

En

(mJ)

cost

Retraining

cost (hours)

Scale

down

Scale

Stand-alone compression

Fire (Iandola et al., 2016)

72.3

24.7

81.2

394.7

3.1

1.5N ^*2

fix

MobileNetV2 (Sandler et al., 2018)

72.6

48.1

84.3

128.4

5.2

1.8N ^*2

fix

—

SVD decomposition (Lane et al., 2016)

71.2

21.7

68.6

165.8

4.8

2.3N ^*2

scalable

—

Sparse coding

decomposition (Bhattacharya and Lane, 2016)

72.9

22.3

69.8

195.2

4.6

2.3N^*2

scalable

—

On-demand compression

AdaDeep (Liu et al., 2020)

73.5

21.9

78.3

264.6

3.5

18N

hours

^*2

38N ^*2

scalable

—

ProxylessNAS (Cai et al., 2018)

74.2

49.5

121.3

232.1

3.8

196N

hours

29N^*2

scalable

—

OFA (Cai et al., 2019)

71.4

51.2

123.4

257.3

3.1

hours

scalable

Runtime adaptive compression

Exhaustive optimizer

58.3

21.1

81.2

283.2

2.9

—

Greedy optimizer

65.3

16.7

83.5

298.4

3.1

ms

—

AdaSpring

74.1

15.6

158.9

358.7

1.9

3.8

ms

scalable

*1

We test the average DNN accuracy at three dynamic moments.
*2

The $N$ in search cost and retraining cost columns shows that the cost is linear to the number of deployment contexts.

6.2. Performance Comparison

We evaluate AdaSpring in terms of the specialized DNNs’ running performance (i.e., accuracy $A$ , amount of MACs $C$ , parameter arithmetic intensity $C/S_{p}$ , activation arithmetic intensity $C/S_{a}$ , and energy consumption $EC$ ) and the specialization methods’ all-around performance (i.e., search cost, retraining cost, and scaling flexibility). As shown in table 2, we compare AdaSpring’s performance with ten baselines. In this thread of experiments, we leverage the same mobile sensing task (i.e., image recognition using CIFAR-100 ( $D_{1}$ ) datasets) and target embedded platform (i.e., Raspberry Pi 4B) for six state-of-the-art baselines and AdaSpring for a fair comparison. We adopt different baseline methods to specialize the DNN architectures and weights for optimizing accuracy and resource efficiency objectives with dynamically specified constraints (see $\S$ Equ. 3.2). Here, the relative importance coefficients (i.e., $\lambda_{1}$ and $\lambda_{2}$ ) are dynamically determined by the remaining battery percentage $E_{remaining}$ of the target platform, i.e., $\lambda_{2}=max\{0.3,E_{remaining}\}$ , and $\lambda_{1}=1-\lambda_{2}$ . Afterwards, we test the specialized DNN’s running performance on a Raspberry Pi 4B platform. To mitigate the effect of noise and increase the robustness of performance measurements, we repeat the steps mentioned above five times and take an average over them.

Table 3. Performance of AdaSpring evaluated on different tasks/datasets, compared to the corresponding DNNs compressed by depthwise convolutional decomposition, i.e., MobileNet.

Mobile Taks	Compression operator configurations	Compared to the performance of MobileNet network
Mobile Taks	Compression operator configurations	A loss	E	T	C	Sp	Sa
CIFAR-100( $D_{1}$ )	$\delta_{1}+\delta_{3}(50\%)$	-2.1%	2.5 $\times{}$	1.2 $\times{}$	5.6 $\times{}$	2.8 $\times{}$	1.2 $\times{}$
ImageNet( $D_{2}$ )	$\delta_{2}+\delta_{4}(1)$	-0.9%	8.9 $\times{}$	1.3 $\times{}$	8.6 $\times{}$	5.2 $\times{}$	1.9 $\times{}$
UbiSound( $D_{3}$ )	$\delta_{2}+\delta_{3}(75\%)$	1.3%	15.2 $\times{}$	1.1 $\times{}$	4.3 $\times{}$	3.8 $\times{}$	1.2 $\times{}$
Har( $D_{4}$ )	$\delta\_1+\delta_{4}(1)$	-0.3%	2.1 $\times{}$	0.8 $\times{}$	9.2 $\times{}$	7.1 $\times{}$	1.3 $\times{}$
StateFarm( $D_{5}$ )	$\delta_{2}+\delta_{3}(55\%)$	0.2%	5.9 $\times{}$	0.7 $\times{}$	5.6 $\times{}$	4.3 $\times{}$	1.6 $\times{}$

Performance comparison. Table 2 summarizes the performance comparison between ten baseline methods and AdaSpring. First, AdaSpring achieves the best overall performance in terms of accuracy $A$ , MAC amount $C$ , parameter arithmetic intensity $C/S_{p}$ , activation arithmetic intensity $C/S_{a}$ , and energy consumption $En$ , while incurring negligible accuracy loss, compared to the DNNs specialized by other baseline methods. The AdaSpring reduces the model inference latency to $15.6$ ms, the energy consumption to $1.9mJ$ . And it increases the parameter arithmetic intensity $C/S_{p}$ to $158.9$ , the activation arithmetic intensity $C/S_{a}$ to $358.7$ . Notably, AdaSpring generates DNN to get the largest activation arithmetic intensity $C/S_{a}$ and second-largest parameter arithmetic intensity $C_{a}$ . Compared to the parameter size, the influence of activation arithmetic intensity upon energy consumption is equally or even more critical. The DNN specialized by the hand-crafted Fire, MobileNetV2, SVD, and sparse coding techniques consumes energy by $3.1mJ$ , $5.2mJ$ , $4.8mJ$ , and $4.6mJ$ , respectively. The accuracy of exhaustive optimizer is much lower than the proposed design, since it shows low accuracy when it fixes the compression operator categories and only over-compresses their hyperparameters. This outcome demonstrates that the reselection of different compression operators are necessary. The specialized DNN’s accuracy achieved by AdaSpring is at least as good as ProxylessNAS, and sometimes even better than the hand-crafted compression techniques. Second, the AdaSpring’s specialization scheme is the most efficient in reducing the searching cost and retraining cost. The adaptive compression baselines involve a high overhead in retraining. For example, AdaDeep requires an average of $19\sim 38$ hours for retraining (e.g. retraining the deep reinforcement learning model-based optimizer) offline on the GPU platform for each adjustment of compression strategies. AdaDeep and ProxylessNAS need $18N$ and $196N$ hours, respectively, to search from the candidate configurations, which increases linearly with the number of dynamic contexts. Although OFA and AdaSpring do not need retaining. OFA needs $41hours$ to search per adaptation, while AdaSpring only needs $3.8ms$ to do that. This is because AdaSpring leverages the elite compression operator space, rather than the basic kernel size or channel number space in OFA, to avoid the redundant search exploration.

Summary. AdaSpring outperforms the other ten baselines in terms of the DNN performance tradeoff between accuracy, latency, arithmetic intensity of parameters and activations, and energy consumption. Meanwhile, it incurs the modest searching cost without retraining, making it ideal for runtime adaptive DNN compression.

6.3. AdaSpring’s Performance over Different Tasks

To illustrate the AdaSpring’s performance over different tasks, we evaluate it using all the five applications/datasets (see $\S$ 6.1) on a Raspberry Pi 4B platforms (Device 3) which is powered by a mobile $3800mAh$ battery. AdaSpring dynamically detects the platform’s remaining battery and sets the coefficients between accuracy and energy efficiency in Equ. 3.2 according to the percentage of remaining power $E_{remaining}$ , i.e., $\lambda_{2}=max\{0.3,1-E_{remaining}\}$ . In addition, we specify the storage budget as 2MB that is capacity of the L2-Cache. We set a accuracy loss threshold to be $0.5,0.3,0.6,0.5$ for image classification tasks ( $D_{1}$ , $D_{2}$ ), sound sensing (i.e., $D_{3}$ ), human activity prediction task ( $D_{4}$ ), and driver behavior recognition task ( $D_{5}$ ), respectively. And assume the latency sensitivity as the latency budget of $20ms,10ms$ , $30ms,20ms$ for $D_{1}\sim D_{5}$ .

Table 4. We test AdaSpring’s performance across three platforms on four moments with dynamic contexts.

Diverse platform				Dynamic context
Device	Processor	L2-Cache	Battery	Time	$9:00am$	$10:00am$	$11:00am$	$12:00noon$
Redmi 3S smartphone	Qualcomm B21	2MB	4100mAh	Remaining battery	$86\%$	$78\%$	$72\%$	$61\%$
Raspberry Pi 4B	Cortex-A72	2MB	3800mAh	Avaliable cache	2MB	1.6MB	1.5MB	1.7MB
NVIDIA Jetbot	Cortex-A57	2MB	7200mAh	Inference require	2 times	1 time	2 times	1 time

Performance. Figure 8 compares the performance of the DNN configurations specialized by AdaSpring on five different tasks in terms of user experience metrics (i.e., inference accuracy $A$ , energy efficiency $E$ , and inference latency $T$ ) and direct DNN metrics (i.e., computation $C$ , parameter size $S_{p}$ , activation size $S_{a}$ ). And we compute the mean and standard deviation of the running performance of the DNN specialized in five dynamic moments, at which the percentage of remaining battery is $0.85$ , $0.75$ , $0.62$ , $0.52$ , and $0.38$ , respectively. These affect the tradeoff demands on objectives. The storage budget $S_{bgt}$ for parameters dynamically depends on the available Cache capacity. We simulate the unpredictable resource contention by other software using the randomization noise $\sigma$ injection to Cache’s available capacity, i.e., $(2-\sigma)$ MB. For different tasks, datsets, and deployment contexts, AdaSpring selects the various combinations of compression operators to scale up/down the model configurations to optimize and balance multiple performances. It achieves the inference latency $1.2\sim 2.8$ , the parameter arithmetic intensity $106\sim 229$ , activation arithmetic intensity $143\sim 220$ , with a negligible accuracy loss ( $\leq 0.5\%$ ) or even accuracy improvement ( $\leq 2.2\%$ ).

Summary. For different tasks with diverse backbone model shapes and various sensitivity to accuracy loss and latency, the DNN specialized by AdaSpring varies. As for the same task, the DNN’s compression configurations founded by AdaSpring also differ according to the dynamic deployment context.

6.4. AdaSpring’s Performance across Diverse and Dynamic Deployment Contexts

In this experiment, we compare the AdaSpring’s performance for mobile sound sensing application ( $D_{3}$ ), tested in three different platforms. We adopt the same self-evolutionary network comprising of the same backbone-net and some optional compression operator-variants. Different platforms have different resource characteristics, which are further affected by dynamic deployment contexts. In particular, we adopt the RedMi 3S smartphone equipped with Qualcomm B21 processor, $2MB$ L2-Cache, and $4100mAh$ battery; the Raspberry Pi 4B with $2MB$ L2-Cache, and $3800mAh$ battery; and the NVIDIA Jetbot with quad-core ARM Cortex-A57 processor, $2MB$ L2-Cache, and $7200mAh$ battery. We adopt the similar dynamic deployment context settings with $\S$ 6.3.

Performance. Figure 9 summarizes the performance of DNN specialized by AdaSpring along with the dynamic changes of deployment context. We first initialize the different DNN configurations for various platform constraints and then leverage AdaSpring to update the compression operator-variant selections according to the specific platform’s dynamic contexts. We select four points of dynamic contexts. AdaSpring identifies DNN configurations to obtain latency of $1.1\sim 1.8$ , parameter arithmetic intensity $81\sim 151$ , and activation arithmetic intensity $192\sim 397$ while have slightly degraded or even better accuracy $91.2\%\sim 98.6\%$ . We pick a time fragment to show four moments with dynamic deployment contexts, as shown in Table 4. As the gradual reduction of battery power and the dynamic fluctuation of Cache capacity, we show the performance changes of the DNN, which is continually scaled-down/up by selecting and combining different compression operators. Moreover, AdaSpring supports scale up the model architecture again when the dynamic constraints on resource efficiency are relaxed, bringing better flexibility.

Summary. AdaSpring adaptively selects the proper combination of compression operators to optimize DNN performance continually to meet dynamic context demands. Moreover, AdaSpring accomplishes a flexible evolution, i.e., support both scaling up and down the DNN configurations as the context demands change.

6.5. Micro-benchmarks of AdaSpring

In this subsection, we evaluate the impact of different factors on AdaSpring’s design.

6.5.1. Hardware Efficiency-guided Combination

We compare the performance of DNNs reconfigured by a stand-alone compression technique (e.g. the Fire module (Iandola et al., 2016)), the blindly combined two compression techniques (e.g. Fire module plus depth-wise pruning), and the proposed hardware-efficient grouping of compression operators (see Figure 10(a)). And we show that the hardware-efficient grouping can always guarantee a comparable overall performance in terms of accuracy, energy efficiency and latency.

6.5.2. Layer-dependent Inheriting and Mutation

. As discussed in $\S$ 5.2, we leverage the inheriting and mutation schemes to balance the searching diversity and convergence. We compare the locally greedy scheme layer by layer, the layer-dependent inheriting scheme, and the proposed layer-dependent inheriting plus mutation scheme in AdaSpring. Figure 10(b) shows that AdaSpring achieves the best tradeoff between model accuracy and energy efficiency.

6.5.3. Progressive Shortest Encoding

. The encoding of convolutional compression configurations at multiple layers affects the complexity of the search space. Figure 10(c) compares the performance of classic binary encoding and progressive shortest encoding scheme. And we find that AdaSpring’s progressive shortest encoding method boosts the search efficiency.

6.5.4. Aggregation Coefficients in Arithmetic Intensity

As mentioned in $\S$ 5.1.2, the aggregation coefficients $\mu_{1}$ and $\mu_{2}$ for the parameter and activation arithmetic intensity $\mu_{1}C/S_{p}+\mu_{2}C/S_{a}$ need to be optimized empirically. Figure 10(d) illustrates the estimated energy consumption using different aggregation coefficient settings. Therefore, we set $\mu_{1}=0.4,\mu_{2}=0.6$ by default across different platforms.

6.6. Case Study

We deploy AdaSpring on a commercial mobile robot platform (i.e., NVIDIA Jetbot, device4) and conduct a one-day experiment (09:00 to 17:00) to continually optimize the DNN configurations for a sound assistant application (i.e., UbiEar (Sicong et al., 2017)). This application adopts a DNN to realize a sound recognition and notification tool for hard-of-hearing people to sense emergency (e.g. fire alarms, smoke alarms, kettle boiling whistle) and social events (e.g. doorbell ring, knocking door, people crying). We simulate the dynamic mobile context of the DNN (as described in $\S$ 3) as follows. On the one hand, we artificially play some audio clips for emergency events and generate social events to control the happening frequency of acoustic events, affecting the DNN inference frequency. On the other hand, we simulate the unpredictable storage resource contention by other software using the randomization noise (e.g. Gaussian noise) $\sigma$ injection to the available capacity of L2-Cache, i.e., $(2-\sigma)$ MB. Here, the maximum capacity of L2-Cache on NVIDIA Jetbot platform is $2MB$ , and we update the randomized resource contention value of $\sigma$ per hour. We do not artificially change the battery power, which is continuously consumed in the real-world as the application runs. Therefore, the remaining battery is dynamically changing, e.g., 86%, 72%, and 63%, as shown in Figure 13, which forms the dynamic energy budgets.

Figure 13 illustrates the dynamic deployment context (i.e., energy, storage, event happening frequency) of the DNN for the continuous sound sensing application. The battery’s remaining energy formulate the importance coefficient $\lambda_{2}$ in the runtime optimization problem (Equ. 3.2). The available capacity of L2-Cache decides the storage budget of parameters $S_{bgt}$ . And the sound emergency frequency will indirectly influence the battery’s power. Different deployment contexts have various resource constraints and performance objective sensitivity, which lead to further performance and budget demands on the DNN. AdaSpring triggers the runtime DNN evolution block by a pre-defined frequency (e.g. every two hours) to shrink the DNN configurations in this regular day. Figure 12shows the runing performance of the DNNs specialized by AdaSpring. AdaSpring can continually and adaptively select the best compression strategy to shrink the DNN configurations given diverse user demands. Specifically, it selects the $\delta_{1}$ (Fire) + $\delta_{3}$ (pruning 50% channel) for the regular resource-constrained moment, $\delta_{1}$ (Fire)+ $\delta_{4}$ (pruning 1 layer) for the tight memory constraint moment, and $\delta_{2}$ (SVD-based decomposition) + $\delta_{3}$ (pruning 65% channel) for the tight battery-bounded moment. The evolved models can achieve $\geq 95.6\%$ accuracy and $168.2\sim 202.6$ arithmetic intensity. AdaSpring searches the proper combinations of compression operators that satisfy diverse demands on accuracy and resource efficiency within $2.8ms\sim 3.1ms$ .

7. Conclusion

This paper addressed the runtime adaptive DNN compression problem to consider the dynamic deployment context of continuously running mobile applications. We present AdaSpring, a context-adaptive and runtime-evolutionary DNN compression framework that continually optimizes the DNN configurations (i.e., architectures and weights) to adapt to the dynamic context. We formulate the dynamic performance demands (e.g. accuracy, latency, energy efficiency) as a time-varying constrained optimization problem. And we propose a heuristic solution as quickly searching for the most suitable combination of retraining-free compression techniques at runtime. To decouple DNN training from runtime adaptive compression, we put computation ahead in the training of a self-evolutionary network at design time (see $\S$ 4). And we present the Runtime3C search algorithm and a set of searching speedup mechanisms to boost the runtime search efficiency and quality. Evaluation using five different mobile applications across four mobile platforms and a real-world case study show the performance advantages of AdaSpring to evolve the DNN compression configurations locally online at millisecond level. In the future work, facing the diverse and dynamic mobile scenarios (e.g. data, task, and platform), more efforts and insights for the self-evolutionary deep model compression and optimization frameworks are much needed.

Acknowledgements.

This work was partially supported by the National Key R&D Program of China (2019YFB1703901), National Science Fund for Distinguished Young Scholars (62025205, 61725205), National Natural Science Foundation of China (No. 62032020, 61960206008, 62032017), and the Fundamental Research Funds for the Central Universities (No. 3102020QD1005). The authors also thank the anonymous reviewers for their constructive feedback that has made the work stronger.

References

(1)
Bender et al. (2018) Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. 2018. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning. 550–559.
Bhattacharya and Lane (2016) Sourav Bhattacharya and Nicholas D Lane. 2016. Sparsification and separation of deep learning layers for constrained resource inference on wearables. In Proceedings of CD-ROM. 176–189.
Bhattacharya et al. (2020) Sourav Bhattacharya, Dionysis Manousakas, Alberto Gil CP Ramos, Stylianos I Venieris, Nicholas D Lane, and Cecilia Mascolo. 2020. Countering Acoustic Adversarial Attacks in Microphone-equipped Smart Home Devices. Proceedings of the IMWUT 4, 2 (2020), 1–24.
Cai et al. (2017) Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. 2017. Efficient architecture search by network transformation. arXiv preprint arXiv:1707.04873 (2017).
Cai et al. (2019) Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. 2019. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791 (2019).
Cai et al. (2018) Han Cai, Ligeng Zhu, and Song Han. 2018. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332 (2018).
Chen et al. (2017) Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. 2017. Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems. 742–751.
Chen et al. (2020) Ling Chen, Yi Zhang, and Liangying Peng. 2020. METIER: A Deep Multi-Task Learning Based Activity and User Recognition Model Using Wearable Sensors. Proceedings of IMWUT 4, 1 (2020), 1–18.
Chen et al. (2018) Tao Chen, Ke Li, Rami Bahsoon, and Xin Yao. 2018. FEMOSAA: Feature-guided and knee-driven multi-objective optimization for self-adaptive software. ACM Transactions on Software Engineering and Methodology 27, 2 (2018).
Chen et al. (2016) Wenlin Chen, James Wilson, Stephen Tyree, Kilian Q Weinberger, and Yixin Chen. 2016. Compressing convolutional neural networks in the frequency domain. In Proceedings of SIGKDD. 1475–1484.
Chen et al. (2019) Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. 2019. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In Proceedings of ICCV. 1294–1303.
Cheng et al. (2017) Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017).
Dai et al. (2019) Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yiming Wu, Yangqing Jia, et al. 2019. Chamnet: Towards efficient network design through platform-aware model adaptation. In Proceedings of CVPR. 11398–11407.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of CVPR.
Fang et al. (2020) Jiemin Fang, Yuzhu Sun, Kangjian Peng, Qian Zhang, Yuan Li, Wenyu Liu, and Xinggang Wang. 2020. Fast neural network adaptation via parameter remapping and architecture search. arXiv preprint arXiv:2001.02525 (2020).
French (1999) Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3, 4 (1999), 128–135.
Gao et al. (2018) Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, and Cheng-zhong Xu. 2018. Dynamic channel pruning: Feature boosting and suppression. arXiv preprint arXiv:1810.05331 (2018).
Gholami et al. (2018) Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiangyu Yue, Peter Jin, Sicheng Zhao, and Kurt Keutzer. 2018. Squeezenext: Hardware-aware neural network design. In Proceedings of CVPR. 1638–1647.
Google (2017) Google. 2017. TensorFlow. https://goo.gl/j7HAZJ.
Han et al. (2016) Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy. 2016. Mcdnn: An approximation-based execution framework for deep stream processing under resource constraints. In Proceedings of MobiSys. 123–136.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of CVPR. 770–778.
He et al. (2018) Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2018. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of ECCV. 784–800.
He et al. (2017) Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1389–1397.
Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
Iandola et al. (2016) Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).
Jha and Mittal (2020) Nandan Kumar Jha and Sparsh Mittal. 2020. Modeling Data Reuse in Deep Neural Networks by Taking Data-Types into Cognizance. IEEE Trans. Comput. (2020).
Jha et al. (2019) Nandan Kumar Jha, Sparsh Mittal, and Govardhan Mattela. 2019. The ramifications of making deep neural networks compact. In Proceedings of VLSID. IEEE, 215–220.
Jiang et al. (2019) Yufan Jiang, Chi Hu, Tong Xiao, Chunliang Zhang, and Jingbo Zhu. 2019. Improved differentiable architecture search for language modeling and named entity recognition. In Proceedings of EMNLP-IJCNLP. 3576–3581.
Kaggle (2019) Kaggle. 2019. State Farm Distracted Driver Detection. https://www.kaggle.com/c/state-farm-distracted-driver-detection.
Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of PNAS 114, 13 (2017), 3521–3526.
Krizhevsky (2009a) Alex Krizhevsky. 2009a. Learning multiple layers of features from tiny images. https://www.tensorflow.org/datasets/catalog/cifar100.
Krizhevsky (2009b) Alex Krizhevsky. 2009b. Learning multiple layers of features from tiny images. Technical Report.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
Kwon et al. (2020) Hyeokhyen Kwon, Catherine Tong, Harish Haresamudram, Yan Gao, Gregory D Abowd, Nicholas D Lane, and Thomas Ploetz. 2020. IMUTube: Automatic extraction of virtual on-body accelerometry from video for human activity recognition. arXiv preprint arXiv:2006.05675 (2020).
Lane et al. (2016) Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. 2016. Deepx: A software accelerator for low-power deep learning inference on mobile devices. In Proceedings of IPSN. IEEE, 1–12.
Lane et al. (2015) Nicholas D Lane, Petko Georgiev, and Lorena Qendro. 2015. DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 283–294.
Li et al. (2019) Gen Li, Inyoung Yun, Jonghyun Kim, and Joongkyu Kim. 2019. Dabnet: Depth-wise asymmetric bottleneck for real-time semantic segmentation. arXiv preprint arXiv:1907.11357 (2019).
Li et al. (2014) Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. 2014. Efficient mini-batch training for stochastic optimization. In Proceedings of SIGKDD. 661–670.
Liu et al. (2017) Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. 2017. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436 (2017).
Liu et al. (2018b) Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018b. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).
Liu et al. (2020) Sicong Liu, Junzhao Du, Kaiming Nan, Atlas Wang, Yingyan Lin, et al. 2020. AdaDeep: A Usage-Driven, Automated Deep Model Compression Framework for Enabling Ubiquitous Intelligent Mobiles. arXiv preprint arXiv:2006.04432 (2020).
Liu et al. (2018a) Sicong Liu, Yingyan Lin, Zimu Zhou, Kaiming Nan, Hui Liu, and Junzhao Du. 2018a. On-demand deep model compression for mobile devices: A usage-driven model selection framework. In Proceedings of MobiSys. 389–400.
Luo and Wu (2020) Jian-Hao Luo and Jianxin Wu. 2020. Autopruner: An end-to-end trainable filter pruning method for efficient deep model inference. Pattern Recognition (2020), 107461.
Ren et al. (2020) Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, and Xin Wang. 2020. A Comprehensive Survey of Neural Architecture Search: Challenges and Solutions. arXiv preprint arXiv:2006.02903 (2020).
Saikia et al. (2019) Tonmoy Saikia, Yassine Marrakchi, Arber Zela, Frank Hutter, and Thomas Brox. 2019. Autodispnet: Improving disparity estimation with automl. In Proceedings of ICCV. 1812–1823.
Sandler et al. (2018) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510–4520.
Sicong et al. (2017) Liu Sicong, Zhou Zimu, Du Junzhao, Shangguan Longfei, Jun Han, and Xin Wang. 2017. Ubiear: Bringing location-independent sound awareness to the hard-of-hearing people with smartphones. Proceedings of IMWUT 1, 2 (2017), 1–21.
Singh et al. (2019) Pravendra Singh, Vinay Kumar Verma, Piyush Rai, and Vinay P Namboodiri. 2019. Play and prune: Adaptive filter pruning for deep model compression. arXiv preprint arXiv:1905.04446 (2019).
Tan et al. (2019) Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of CVPR. 2820–2828.
Teerapittayanon et al. (2016) Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. 2016. Branchynet: Fast inference via early exiting from deep neural networks. In Proceedings of ICPR. IEEE, 2464–2469.
UCI (2017) UCI. 2017. Dataset for Human Activity Recognition. https://goo.gl/m5bRo1.
Wang et al. (2020) Xiaofei Wang, Yiwen Han, Victor CM Leung, Dusit Niyato, Xueqiang Yan, and Xu Chen. 2020. Convergence of edge computing and deep learning: A comprehensive survey. IEEE Communications Surveys & Tutorials 22, 2 (2020), 869–904.
Wu et al. (2018b) Junru Wu, Yue Wang, Zhenyu Wu, Zhangyang Wang, Ashok Veeraraghavan, and Yingyan Lin. 2018b. Deep $k$ -Means: Re-Training and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions. arXiv preprint arXiv:1806.09228 (2018).
Wu et al. (2018a) Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. 2018a. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of CVPR. 8817–8826.
Yang et al. (2020) Li Yang, Zhezhi He, Yu Cao, and Deliang Fan. 2020. A Progressive Sub-Network Searching Framework for Dynamic Inference. arXiv preprint arXiv:2009.05681 (2020).
Yang et al. (2017) Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. 2017. Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of CVPR. 5687–5695.
Yang et al. (2019) Zhican Yang, Chun Yu, Fengshi Zheng, and Yuanchun Shi. 2019. ProxiTalk: Activate Speech Input by Bringing Smartphone to the Mouth. Proceedings of IMWUT 3, 3 (2019), 1–25.
Yao et al. (2017) Shuochao Yao, Yiran Zhao, Aston Zhang, Lu Su, and Tarek Abdelzaher. 2017. Deepiot: Compressing deep neural network structures for sensing systems with a compressor-critic framework. In Proceedings of SenSys. 1–14.
Yu and Huang (2019) Jiahui Yu and Thomas Huang. 2019. AutoSlim: Towards One-Shot Architecture Search for Channel Numbers. arXiv preprint arXiv:1903.11728 (2019).
Zhao et al. (2018) Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018. DeepThings: Distributed adaptive deep learning inference on resource-constrained IoT edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348–2359.
Zhong et al. (2018) Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. 2018. Practical block-wise neural network architecture generation. In Proceedings of CVPR. 2423–2432.
Zhou et al. (2020) Pan Zhou, Caiming Xiong, Richard Socher, and Steven CH Hoi. 2020. Theory-inspired path-regularized differential network architecture search. arXiv preprint arXiv:2006.16537 (2020).
Zhu and Zabaras (2018) Yinhao Zhu and Nicholas Zabaras. 2018. Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification. J. Comput. Phys. 366 (2018), 415–447.
Zoph et al. (2018) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of CVPR. 8697–8710.