AdaSpring: Context-adaptive and Runtime-evolutionary Deep Model Compression for Mobile Applications
Abstract.
There are many deep learning (e.g. DNN) powered mobile and wearable applications today continuously and unobtrusively sensing the ambient surroundings to enhance all aspects of human lives. To enable robust and private mobile sensing, DNN tends to be deployed locally on the resource-constrained mobile devices via model compression. The current practice either hand-crafted DNN compression techniques, i.e., for optimizing DNN-relative performance (e.g. parameter size), or on-demand DNN compression methods, i.e., for optimizing hardware-dependent metrics (e.g. latency), cannot be locally online because they require offline retraining to ensure accuracy. Also, none of them have correlated their efforts with runtime adaptive compression to consider the dynamic nature of deployment context of mobile applications. To address those challenges, we present AdaSpring, a context-adaptive and self-evolutionary DNN compression framework. It enables the runtime adaptive DNN compression locally online. Specifically, it presents the ensemble training of a retraining-free and self-evolutionary network to integrate multiple alternative DNN compression configurations (i.e., compressed architectures and weights). It then introduces the runtime search strategy to quickly search for the most suitable compression configurations and evolve the corresponding weights. With evaluation on five tasks across three platforms and a real-world case study, experiment outcomes show that AdaSpring obtains up to latency reduction, energy efficiency improvement in DNNs, compared to hand-crafted compression techniques, while only incurring runtime-evolution latency.
1. introduction
In recent years, a lot of ubiquitous devices (e.g. smartphones, wearables, and embedded facilities) are integrated with continuously running applications to facilitate all aspects of human lives. For example, the smartphone-based speech assistant (e.g. ProxiTalk (Yang et al., 2019)), and wearable sensor-enabled activity recognition (IMUTube (Kwon et al., 2020), MITIER (Chen et al., 2020)). Notably, there is a growing trend to bring deep learning (e.g. DNN) powered intelligence into mobile devices, which benefits effective data analysis. Besides, due to the increasing user concerns on transmission cost and privacy issues, executing DNN on local devices tends to be a promising paradigm for robust mobile sensing (Wang et al., 2020; Lane et al., 2015). However, it is non-trivial to deploy the computational-intensive DNN on mobile platforms with tightly limited resources (i.e., storage, battery).
Given those challenges, prior works have investigated different DNN specialization schemes to explore the desired tradeoff between application performance (i.e., accuracy, latency) and resource constraints (i.e., battery and storage budgets). Firstly, as illustrated in Figure 1(a), the hand-crafted DNN compression methods, e.g. weight pruning (Luo and Wu, 2020) relay on manual design to reduce the model complexity. They may not suffice to meet diverse performance requirements. Secondly, the on-demand DNN compression schemes (see Figure 1(b)), e.g. DeepX (Lane et al., 2016), AMC (He et al., 2018), and AdaDeep (Liu et al., 2020), adopt a trainable meta-learner to automatically find the most suitable DNN compression strategies for various platforms. They need offline retraining to ensure accuracy and update the meta-learner. The extra overhead and latency for offline retraining is intolerable for responsive applications. Thirdly, the one-shot neural architecture search (NAS) methods (see Figure 1(c)) pre-train a super-net and automatically search for the best DNN architecture for target platforms (Fang et al., 2020; Zoph et al., 2018; Saikia et al., 2019; Cai et al., 2019). However, they also render high overhead for scanning and searching a large-scale candidate space. None of them can work locally online.




Despite major advances of existing DNN compression techniques, none of them have correlated their efforts with runtime adaptive compression, to consider the dynamics of the deployment context in continuously running applications. We have identified that a self-adaptive, retraining-free, and fast framework for runtime adaptive DNN compression (see in Figure2(d)) is necessary yet challenging. As we have illustrated in Figure 1(d), the DNN deployment context often exhibits high dynamics and unpredictability in practice. And the dynamic changes in the deployment context will further lead to varying performance demands on DNN compression. Specifically, we identify the dynamic context to mainly include the time-varying hardware capabilities (e.g. storage, battery, processor), the DNN active execution time, the agnostic inference frequency triggered by real environments, and the unpredictable resource contention imposed by other Apps.
Figure 2 shows an example in which a user carries a smartphone-based hearing assistant App (e.g. UbiEar (Sicong et al., 2017)) to sense the ambient acoustic event of interest continuously. During its use, the smartphone’s battery is dynamically consumed by the DNN execution, the memory access, the microphone sampling, and the screen with unpredictable frequency, which further characterize the dynamic energy constraints for the deployed DNN. And the storage unit (e.g. L2-Cache) is also dynamically occupied by other applications, resulting in various storage budgets for DNN parameters. Both the mobile developer and user face a problem: how to automatically and effectively re-compress DNN at runtime to meet dynamic demands? And they face the following two challenges:
-
•
Firstly, it is non-trivial to continually scale up/down the DNN compression configurations, including both architectures and weights, to meet the dynamic optimization objectives on multiple DNN performance (i.e., accuracy, latency, energy consumption) on-the-fly. This is because most DNN compression methods are irreversible to scale up DNN again, i.e., recover fine details of the DNN architecture and weight, from a compressed/pruned model. And the weight evolution is always limited by offline retraining.
-
•
Secondly, it is intractable to provide an efficient and effective solution to the runtime optimization problem. To tailor this problem, quickly searching for the suitable compression techniques from an elite candidate set and efficiently evolving weights without model retraining are required. Moreover, it is difficult to systematically balance the compromising of multiple conflicting and interdependent performance metrics (e.g. latency, storage and energy efficiency) by merely tuning the DNN compression methods.
In view of those challenges and limitations, we present AdaSpring, a context-adaptive and runtime-evolutionary deep model compression framework. It continually controls the compromising of multiple performance metrics by re-selecting the proper DNN compression techniques. To formulate the dynamic context, we formulate the runtime tuning of compression techniques by a dynamic optimization problem (see Eq.(3.2) in 3). In that, we model the dynamic context by a set of time-varying constraints (i.e., accuracy loss threshold, latency and storage budgets, and the relative importance of objectives). And then, we present a heuristic solution. In particular, to eliminate the runtime retraining cost, we decouple offline training from the online adaptation by putting weight tuning ahead in the training of a self-evolutionary network (see 4). Furthermore, we present an efficient and effective search strategy. It involves an elite and flexible search space (see 5.1), the progressive shortest candidate encoding, and the Runtime3C search algorithm (see 5.2) to boost the locally online search efficiency and quality. The main contributions of this work are summarized as follows.
-
•
To the best of our knowledge, AdaSpring is the first context-adaptive, and self-adaptive DNN compression framework to continually shrink model architectures and evolve the corresponding weights by automatically applying the proper retraining-free compression techniques on-the-fly. And it trains a self-evolutionary network to synergize the multi-scale compression operators’ weight recycle and decouple offline training from online compression.
-
•
AdaSpring presents an efficient runtime search strategy to optimize the runtime adaptive compression problem. It introduces the elite and flexibly combined compression operator space, the fast Runtime3C search algorithm, and a set of speedup mechanisms to boost the search efficiency and quality, at runtime, while avoids explosive combination.
-
•
Using five mobile applications across three platforms and a real-world case study of DNN-powered sound recognition on NVIDIA Jetbot, extensive experiments showed the advantage of AdaSpring to continually optimize DNN configurations. It adaptively adjust the compression configurations to tune energy cost by , latency by , and storage by , with accuracy loss. And the online evolution latency of compression configurations to meet dynamic contexts is .

2. Related Work
Our work is inspired by and closely related to the following works.
DNN Compression for Ubiquitous Mobile Applications. There is a promising trend to bring deep learning powered intelligence to mobile and embedded devices (e.g. smartphones, wearables, IoT) for enhancing all aspects of human lives. such as smartphone-based speech assistant (e.g. ProxiTalk (Yang et al., 2019)), smarthome device-enabled sound detection (Bhattacharya et al., 2020), and wearable-based activity recognition (e.g. MITIER (Chen et al., 2020)). And recent research has demonstrated the potential of feeding DNNs into resource-constrained mobiles (Cheng et al., 2017) by using DNN compression techniques, including parameter pruning (He et al., 2017), sharing (Wu et al., 2018b), and quantification (Zhu and Zabaras, 2018), compact component (Iandola et al., 2016; Howard et al., 2017), and model distillation (Chen et al., 2017), However, we note that few efficient compression techniques are dedicated to optimizing the application-driven system performance (e.g. energy efficiency). For example, the recent work (Jha et al., 2019) argues that the platform-aware SqueezeNet (Iandola et al., 2016) and SqueezeNext (Gholami et al., 2018) merely reduce parameter size or MAC amount which do not necessarily lead to reduced energy cost or latency (Yang et al., 2017). Also, all of these techniques need several epochs of model retraining to ensure accuracy, thereby they can not be locally online. Instead, AdaSpring decouples DNN training from online adaptive compression, and consider both the dynamic arithmetic intensity of parameters and activations for guiding the best specialization of convolutional compression configurations.
On-demand DNN Computation for Diverse User Demands. There have been two categories of on-demand DNN computation adjustment methods: on-demand DNN compression (Liu et al., 2018a) (Luo and Wu, 2020) and on-demand DNN segmentation (Zhao et al., 2018). They specialize DNN computation offline to meet diverse hardware resource budgets (e.g. battery, memory, and computation) and application demands (e.g. input diversity). To satisfy resource budgets, He et al. (He et al., 2018) adopt reinforcement learning to adaptively sample the design space. Shuochao et al. (Yao et al., 2017) use a recurrent model to control the adaptive compression ratio of each layer. Singh et al. (Singh et al., 2019) introduce a min-max game to achieve maximum pruning with minimal accuracy drop. These methods, however, need extra offline training to update the meta-controller for on-demand adjustment. Zhao et al. (Zhao et al., 2018) present the adaptively distributed execution of CNN-based applications on resource-constrained IoT edge clusters. However, because of mobile platforms’ mobility and opportunistic connectivity, distributed DNN inference in mobile clusters is not robust yet. Built upon these efforts, AdaSpring is the first to enable runtime and adaptive DNN evolution locally without requiring Wi-Fi/cellular networks to connect with other platforms while achieve competitive performance.
Dynamic Adaptation of DNN Execution. Prior works have investigated the run-time adaptation of DNN execution to adapt to diverse inputs from two directions: dynamic selection of inference path (Wu et al., 2018a) or network variant (Teerapittayanon et al., 2016). Wu et al. (Wu et al., 2018a) adaptively choose which residual blocks to execute during inference to reduce computation without degrading accuracy. Teerapittayanon et al. (Teerapittayanon et al., 2016) propose a multi-branch network to allow inference adaptively exit from early branches. Han et al. (Han et al., 2016) adaptively select model variants to optimize accuracy and satisfy resource constraints (e.g. memory and energy). Gao et al. (Gao et al., 2018) propose feature boosting and suppression method to predictively amplify salient convolutional channels and skip unimportant ones. However, these methods highly depend on the pre-defined design space of alternative execution paths and variants, but it is prohibitive to specify all of them before deploying models into agnostic mobile contexts. AdaSpring dynamically select and combine the proper compression operators to flexibly shrink the model configurations from multiple scaling dimensions at runtime.
Fast and Platform-aware Neural Architecture Search. Recent studies have verified the potential of leveraging neural architecture search (NAS) framework to automate neural architecture specialization for mobile applications, from two aspects. Firstly, the Fast-NAS aims to automatically specialize neural architecture for different performance demands (e.g. computation amount, parameter size), using as little search cost as possible (Ren et al., 2020). To speedup the search, researchers have investigated the modular search strategy (Cai et al., 2018; Zhong et al., 2018; Liu et al., 2017), differentiable search strategy (Liu et al., 2018b; Jiang et al., 2019; Chen et al., 2019; Zhou et al., 2020), and super-network search strategy (Fang et al., 2020; Bender et al., 2018; Cai et al., 2019). For example, Liu et al. (Liu et al., 2018b) relax the search space to be continuous, so that it can be optimized by gradient descent, using orders of magnitude less search cost. Cai et al. (Cai et al., 2019) trains an once-for-all (OFA) super-network that supports diverse variant-network search. We note that the OFA super-network includes some redundant and invalid variant-networks, which is not elite and incurs a high search cost. (As we will discuss in 5.) Secondly, unlike general NAS that only optimize for model-relative metrics, such as FLOPS, the platform-aware NAS alao incorporates platform-relative metrics (e.g. latency) into optimization objectives. Such as Mingxing et al. (Tan et al., 2019) explicitly incorporate latency into the NAS objective to identify a mobile CNN model. Xiaoliang et al. (Dai et al., 2019) propose an efficient search algorithm aided by efficient accuracy and hardware resource predictors. However, above methods still sacrifice high overhead to obtain the ranking of candidate architectures based on their performance on validation sets. And they donot accurately consider the energy consumption improvement target since the energy efficiency measurement is not straightforward on different platforms with dynamic nature. Depart from existing efforts, AdaSpring treats the retraining-free compression operator (illustrated in 5.1) as a new ensemble to be tuned by automated macro-NAS. It trains a self-evolutionary network(see 4) at design time to decouple model retraining and adaptive compression. Besides, it present a set of mechanisms to boost search efficiency (i.e., at millisecond level) and quality during dynamic inference. Notably, AdaSpring leverages the dynamically measured hardware-relative metrics (i.e., arithmetic intensity of parameter and activation) to derive a guiding selection, which also prevent the explosive combination (see 5).
3. Overview
This section starts with problem analysis and then presents an overview of AdaSpring design.


3.1. Problem Study
Due to the dynamic natures of DNN deployment context, we aim to continually tune the DNN compression configurations to directly/indirectly optimize the application-driven system performance (i.e., accuracy, energy efficiency, latency). The hybrid dependency of multiple platform-relative performance metrics and DNN-dependent metrics are shown in Figure 4. To further understand the performance requirements of DNN for the continuously running mobile applications, we ask mobile users and Android developers to rate the importance of different DNN performance aspects on mobiles. And we summarize the results as our design goals. Specifically, a DNN for continuously running mobile Apps needs to fulfill the following requirements:
-
•
Accurate: the DNN is accurate enough to guarantee a high-quality task. The model weights at different scales are well-trained to represent the generic information of recognition objects.
-
•
Responsive: the complexity of the DNN should be controllable to satisfy diverse user demands on latency constraints, especially on low-end (e.g. CPU-powered) mobiles.
-
•
Energy-Efficient: the energy consumption of the DNN should be continually optimized, which is the bottleneck metric for continuously sensing applications (Yang et al., 2017).
-
•
Runtime-evolutionary: both the DNN architecture and parameter weights are runtime-evolutionary to meet the dynamic deployment context for continually optimizing the above three requirements (i.e., accurate, responsive, energy-efficient) at runtime.
Unfortunately, none of previous efforts satisfy all these requirements (as discussed in 2). To this end, this paper proposes AdaSpring, a context-adaptive and runtime-evolutionary DNN compression framework to automatically optimize the requirements mentioned above, which are closely related to the user experience.
3.2. Optimization Formulation
As shown in Figure 4, AdaSpring intends to provide a systematic method to automatically select the compression operator combination for tuning the above conflicting and interdependent performance metrics. Mathematically, AdaSpring explores an efficient solution to the following dynamic optimization problem:
(1) | s.t. |
where represents the set of all optional convolutional compression operators (as enumerated in 5.1). Given an backbone-net architecture , represents the re-configured model architecture compressed by the selected compression operator . , , and denote the measured accuracy, energy efficiency, latency, and memory footprint of a given model running on the target mobile platform. The two objectives on and are combined by relative importance coefficients and , which dynamically depend on the platform’s remaining battery. We express the dynamic deployment contexts as a set of time-varying constraints, i.e., the threshold of accuracy loss , the latency budget , the storage budget , and relative importance coefficients of objectives (, ). The latency budget is application-specified. And the storage budget (t) is platform-imposed. For example, reducing the model size to satisfy the budget of L2-Cache helps to fit it into the on-chip memory and avoids the expensive off-chip access. We note that is a normalization operation for objective aggregation, e.g. . We then propose a heuristic optimization solution as adjusting the model architecture for satisfying dynamic performance requirements. In particular, the model architecture can directly determine both and (see Figure 4). While the quantification of hardware-dependent metrics and are not straightforward. Therefore, AdaSpring’s goal turns to adaptively select the compression operator combination from a discrete set of all possible combinations , so that it can directly/indirectly tune model performance metrics.
3.3. AdaSpring Framework
The above challenging problem motivates the AdaSpring design. As shown in Figure 4, the AdaSpring framework consists of a self-evolutionary network, a runtime adaptive compression block, and a dynamic context awareness block. (i) The self-evolutionary network is an ensemble of a backbone-net and multiple retraining-free compression operator-variants, which enables weight recycle between numerous variants while avoiding catastrophic interference. We initialize the backbone-net’s hyperparameters at design time using an on-demand DNN generation framework, i.e., AdaDeep (Liu et al., 2020), for satisfying mobile application performance demands on a target platform. (ii) The runtime adaptive compression block is capable of selecting a deterministic optimal combination of compression operators for reconfiguring and evolving the backbone-net at runtime. And (iii) the dynamic deployment context awareness block detects the evolution demands and triggers the runtime adaptive compression block. The triggering station can be modeled as the noticeable context changes or by a pre-defined frequency (e.g. time slice) for continuously running Apps in regular days.
4. Retraining-free and Self-evolutionary Network Design
This section presents the design of the retraining-free and self-evolutionary network. The self-evolutionary network consists of a high-performance backbone network and multiple compression operator-variants.




4.1. Compression Operators
This paper focus on the configuration optimization of convolutional architecture, operations, and activations. Because recent successful DNN models tend to shift more parameters on convolutional layers and use fewer fully-connected layers (Wu et al., 2018b; Chen et al., 2016; Cai et al., 2019). Built upon the existing compression experience, we propose the following alternative convolutional compression operators that synthesize multiple scaling dimensions (e.g. width, depth, and connection).
-
•
Compression operator : multi-branch channel merging techniques(e.g. Fire block (Iandola et al., 2016)) increase the model depth with less parameters by replacing one conv layer using two conv layers (i.e., squeeze layer and expand layer) which is elaborately designed to decrease the kernel size and channel size per unit.
-
•
Compression operator : low-rank convolution factorization techniques (e.g. SVD-based (Wu et al., 2018b), sparse coding-based (Bhattacharya and Lane, 2016) factorization, or depth/group-wise convolution (Li et al., 2019)) decompose a conv layer into several conv layers with smaller kernel size, hence leads to a growing model depth with less parameters.
- •
- •
4.2. Ensemble Training of Self-evolutionary Network
We put the retraining process ahead in the ensemble training of the self-evolutionary network at design time to get rid of weight retraining during dynamic inference. Therefore, the self-evolutionary network training is an ensemble of a backbone-net and multiple variant-nets derived by various convolutional compression operators.
4.2.1. Primer on Parameter Recycling
We refer to the parameter recycling strategy (Wu et al., 2018b; Cai et al., 2017) to recycle the backbone-net weights and take less search time than those searching from scratch. The weight recycling strategy is conducive to making maximum use of the existing architectures’ experience to reduce the time complexity of the searching process. In particular, we reuse the existing hand-crafted/elaborated high-performance DNN, including architecture and weight, as an initialization point (i.e., backbone network). And then, we leverage an automated optimizer (i.e., search strategy) to only search for the optimal architecture adjustments (e.g. widening a certain network, skipping connections) to obtain a promising new model. However, these methods challenge the ensemble training of multiple variant models that drift away from the backbone-net’s initial configuration. In detail, the training of a variant’s weights will likely interfere/override the weights that have been learned for other variants and thus degrade the overall performance. We note that the catastrophic interference problem when multiple variant-nets share parameters is a long-standing problem in itself (French, 1999; Kirkpatrick et al., 2017).
4.2.2. Training Strategy
Our goal with the self-evolutionary network is to integrate with multiple versions of DNN architectures and the corresponding weights introduced by different compression operators. And the above-mentioned parameter recycling strategy provides great potentialities. To further avoid the catastrophic interference problem caused by parameter recycling, we present a novel training strategy to consider the parameter transformation and knowledge distillation for preserving the parametric function of multiple variant-nets. In detail, we first perform the standard back-propagating process to train a high-accuracy backbone-net. Afterwards, we respectively leverage the parameter transformation techniques for learning compression operators and , the knowledge distillation techniques for learning compression operators and , and the trainable channel-wise mutation techniques for learning .
-
(1)
Parameter transformations for learning variant-nets derived by compression operator and . We consider the function-preserving parameter transformation when recycling parameters. It allows us to initialize a new variant-net that is derived by a compression operator to preserve the function of the given backbone-net, but use different parameterization to be further trained to improve the performance (Cai et al., 2017). We transform the original convolutional parameter and store the extra copy of weight for and . And we further set an accuracy target as a threshold, by which the transformed parameters for compression operator-variants will only be fine-tuned when its accuracy is lower than that. As thus, we need a small number of extra parameters to store the transformed parameters for compression operators and . And we only access the weights of the deterministically selected compression operator to evolve model weights.
-
(2)
Knowledge distillation for learning variant-nets derived by compression operator and . We allow each conv layer to choose the depth/channel compression ratio flexibly. And we adopt the knowledge distillation techniques (Cai et al., 2019) to fine-tune the parameters of compression operators and with different compression ratios (e.g. ). So that AdaSpring can flexibly switch over the different parameter weights of channel-wise and depth-wise scaling operators, and avoid weight interference. Also, we perform the trainable channel-wise and depth-wise architecture ranking as the weight importance criterion to guide the adaptive layer slimming and scaling. In particular, we pre-train a self-evolutionary network to evaluate the overall performance (e.g. accuracy drop, parameter arithmetic intensity, activation arithmetic intensity, and latency) of different variant-networks that are compressed by different operators. And these are used as the prior-based architecture importance ranking to guide the runtime scaling to shrink unimportant layer/channel first, rather than randomly scaling.
-
(3)
Trainable channel-wise mutation for training variant-nets derived by compression operator . To maintain a good diversity of solutions, we present a novel trainable architecture mutation technique to inject the architecture variance into the compressed network. This idea is supported by recent DNN studies, which have verified the dominant effect of model architecture on accuracy compared to model parameters (Yu and Huang, 2019). That is, AdaSpring can directly use the trainable architecture mutation technique with diverse noise magnitude since the channel importance ranking of the backbone-net is consistent at both design time and runtime. Specifically, we inject Gaussian noise to the channel-wise operator’s scaling ratio (i.e., ), and the noise magnitude is trainable for channel importance ranking. That is, the more important the channel is, the lower intensity of noise we inject. This, as we will evaluate in 6.5, plays a nontrivial role for AdaSpring’s progressive shortest encoding process of DNN compression configurations and the runtime searching process for boosting the runtime DNN adaptation quality and efficiency.
Besides, to enable the stable ensemble training of multiple variant-nets, we leverage the mini-batch techniques to split the training data into small batches. We normalize the gradient to reduce the interference caused by gradient variance (Li et al., 2014).
5. Runtime Adaptive Compression
This section presents how AdaSpring quickly searches for the most suitable combination of retraining-free compression operators, from a flexible and elite space, to reconfigure the trained self-evolutionary network on-the-fly.
5.1. Flexible and Elite Search Space
5.1.1. Multi-granularity Search Space
We form an elite search space, which include a set of coarse-grained compression operators (e.g. Fire block (Iandola et al., 2016), SVD-based (Wu et al., 2018b), sparse coding-based (Bhattacharya and Lane, 2016) factorization), for faster convergence, and the fine-grained compression operators (e.g. channel-level and depth-level pruning and channel-wise randomization), for better diversity. Consider a convolutional layer that has the total parameters: (input feature map channel size ) (output feature map channel size ) (kernel width/height ) (kernel width/height ), and total activations: (output feature map width/height) . Scaling either the input feature map, kernel, channel, or output feature map can shrink the model complexity. We empirically observe that different scaling dimensions are not independent. Firstly, there is no single compression technique that achieves the best application-driven performance (i.e., , , , , , and ). It is necessary to combine several compression techniques. Secondly, as mentioned in 2, few existing compression techniques are retraining-free or dedicated to optimizing the holistic hardware efficiency across various platforms. These findings further suggest us toflexibly coordinate and balance multiple scaling dimensions by searching for the best combination of compression operators, rather than the single dimension (e.g. model pruning).
5.1.2. Hardware Efficiency-guided Combination.
We argue that the widely used parameters number, MAC amount, or speedup ratio are not good approximations for hardware efficiency, which heavily depends on the memory movement and bandwidth bound. For example, Jha et al. (Jha et al., 2019) reported that although SqueezeNet (Iandola et al., 2016) has fewer parameters than AlexNet (Krizhevsky et al., 2012), it consumes 33 more energy due to its larger amount of activations and data movement. And we identify that merely cutting down the parameter size may lead to an increase in activation size, which, in turn, increases the memory footprint and energy consumption (Jha and Mittal, 2020). For example, the recent study (Jha et al., 2019; Jha and Mittal, 2020) has shown that the energy consumption of CNNs mainly depends on the memory movement, memory reuse, and bandwidth bound.

To this end, we present the controllable hardware-efficiency criteria, i.e., arithmetic intensity, to guide the automated combination of compression operators in different layers. We leverage the arithmetic intensity as a proxy to the degree of reuse of parameters and activations and the energy consumption required for processing inputs, inspired by hardware studies (Jha et al., 2019; Jha and Mittal, 2020). Because the measurement of hardware-relative metrics, especially energy efficiency, is not straightforward. Thereby, we present three hardware-efficiency metrics to predict how efficiently arithmetic operation can reuse the data fetched from different levels in the memory hierarchy and how efficiently the arithmetic operation is executed.
-
•
Computation/parameter ratio : is an approximation of the parameter arithmetic intensity;
-
•
Computation/Activation ratio : is the proxy of the activation arithmetic intensity;
-
•
latency : include the measured inference time of a specialized model, and the time i.e., for loading parameters and activations for convolution computing on the target mobile device, i.e., .
We separately evaluate and and then aggregate them together by the aggregation coefficients and , to better profile the energy efficiency of each candidate compression operator .
(2) |
Upon these criterions, AdaSpring automatically selects and combines compression operators for maximizing the aggregated value of , according to the upper limit of the calculation intensity of the mobile platforms. And test the real latency to prevent the exploration of invalid solutions via comparing with the latency budgets. We empirically set (e.g. as default) since contributes more to memory footprint (as benched in 6.5). And AdaSpring discovers some novel combinations for optimizing the underlying data movement. For example, we suggest the and groups (as discussed in 6). The fine-grained channel-wise scaling operators (e.g. , ) readjust the channel size, MAC amount, and output activation size of the conv layers to smooth out the bandwidth bound problem, which is caused by the coarse-grained operators (e.g. and ). The hardware efficiency-guided combination of several compression operators also helps to avoid the blindly explosive combination.
5.2. Runtime Search Strategy
To evolve DNN architecture and weight to an optimal configuration at runtime, we propose the runtime search strategy based on the above flexible and elite search space.
5.2.1. Progressive Shortest Encoding of Candidate
Consider a complex self-evolutionary network that contains many combinations of compression operator variants and configurations, systematically and generically choosing the right candidate configurations and encoding them into the representation of a search algorithm is difficult. For optimizing DNN compression configurations at runtime, such representations define the potential search space of the problem to be explored. Given that some candidate configurations do not contribute to the specific performance optimization demand or other candidates can represent some of their information, the shortest actual encoding will benefit the search result (i.e., model evolution plans) and overhead.
As shown in Figure 7(a), the classic binary encoding of all compression operator configurations across all layers in a binary format is redundant. Specifically, given a backbone network with conv layers to be selectively compressed. Take as an example. A classic binary encoding method needs bit to record whether a specific layer participates in compression or not. Other bits (four bits to represent selective operators per layer). In this way, the encoding length is when we have optional compression operator. And the search space derived by this encoding diversity is , i.e., . Furthermore, it will increase exponentially as the number of optional compression operators increase.


To better represent the fundamental search space, we propose the progressive shortest encoding of compression operator configurations via a layer-dependent manner. As we will show in 6.5.3, it improves the search efficiency by one order of magnitude, compared to the classic binary encoding. As shown in Figure 7(b), we use digits to record the count of layers that have been compressed. The first digit represents the compressed layer count, and the next length-variable few digits record the selected compression operator index of each layer. For example, the value of the first digit means that only the first conv layer is compressed on-demand. Thereby, only one additional digit is needed to record the compression operator index (i.e.,, ) for it. Afterward, AdaSpring inherits the above -digit string and inject channel-wise variance to mutate the inherited survival -digit encoding string. We refer the channel-wise variance mutation process in 5.2. And then, we turn to the second adaptable conv layer. If the first digit of compressed layer count is updated to 2, we append one more digit indicating the selected compression operator index to the survival -digit encoding string. Thus, the encoding length progressively increases from 2 to (), and the complexity of the search space is reduced to . The progressive shortest encoding of the candidate is conducive to the flexibility of AdaSpring and prevents unnecessary exploration.
5.2.2. Runtime3C Search Algorithm
This subsection presents the Runtime3C search algorithm, a Pareto optimal decision-based searching algorithm, to pick a sole optimal solution from the search space at runtime. To the best of our knowledge, many widely used universal search algorithms (e.g. evolutionary algorithms) are not designed to optimize the runtime adaptive compression problem or handle dependency constraints of multiple DNN performance. We heuristically regard the selection of compression operators for each layer as a single-layer optimization subproblem in a collaborative manner to derive the most suitable solution quickly and effectively.
As shown in Algorithm 1, each subproblem at layer is to search the optimal group of compression operators for optimizing the overall performance of the entire DNN. Starting from the second conv layer by default, AdaSpring selects two candidate solutions at layer from the Pareto front of the selectable compression operator groups for optimizing the accuracy and energy efficiency of the entire model (line 2). In detail, the picked two candidate solutions are the best two compromises in v.s. , from the Pareto front within the valid search space, i.e., . Here, we leverage the ranking of the pre-tested accuracy and energy cost of the DNNs to establish the Pareto front. And the accuracy ranking derived by historical results is consistent with the ranking of the actual accuracy of these DNNs measured on mobile devices. We then mutate and augment candidates from two to six by injecting the channel-wise variance to the candidate configurations. The trained architecture importance is a criterion for Gaussian noise injection. This process can improve the diversity of subproblem solution as well as the performance of the global solution, inspired by the genetic algorithm in the adaptive software engineering (Chen et al., 2018). We choose the best candidate as the survival subproblem solution for compressing layer (line 6). Afterward, the th layer’s survival solution is used to reconfigure the layer and becomes the initial station of the subproblem at -th layer. We fix the selected compression configurations for th layer and repeat the above-searching steps (line ) to specialize the optimal compression strategies for the -th layer. Once the model satisfies the dynamic constraints in latency and memory at time , the subproblem expansion stops (line 12). And finally, it outputs the global compression configuration solution.
6. Evaluation
This section presents the evaluation of AdaSpring over different mobile applications on diverse mobile and embedded platforms with dynamic deployment context. We compare AdaSpring against ten alternative methods reported in the state-of-the-art literature.
No. | Target task (utility label) | Dataset | Description |
Image ( classes) | CIFAR-100(Krizhevsky, 2009a) | images | |
Image ( classes) | ImageNet(Deng et al., 2009) | images | |
Acoustic event ( classes) | UbiSound(Sicong et al., 2017) | audio clips | |
Human activity ( classes) | Har(UCI, 2017) | records of accelerometer and gyroscope | |
Driver behavior ( classes) | StateFarm(Kaggle, 2019) | images |
6.1. Experiment Setup
We first present the settings for our evaluation.
System Implementation. We implement AdaSpring’s offline block with TensorFlow (Google, 2017) in Python on the server side to train the self-evolutionary network (see 4). And we realize the AdaSpring’s online blocks on the mobile and embedded platforms to adjust the DNN configurations on the fly for better inference performance. The self-evolutionary network (i.e., a backbone-net and multiple variant compression operators), generated by AdaSpring’s offline component, is then loaded into the target platform. To further reduce the memory access cost, we load DNN parameters from L2-Cache memory.
Evaluation Applications/Datasets. We use five commonly used mobile applications/datasets to evaluate AdaSpring’s performance as elaborated in Table 1. Specifically, we test AdaSpring for mobile image classification (D1: Cifar100 (Krizhevsky, 2009b), D2: ImageNet (Deng et al., 2009)), mobile acoustic event awareness (D3: UbiSound (Sicong et al., 2017)), mobile human activity sensing (D4: Har (UCI, 2017)), and mobile driver behavior prediction (D5: StateFarm (Kaggle, 2019)).
Mobile Platforms with Dynamic Context Settings. We evaluate AdaSpring on three categories of commonly used mobile and embedded platforms, including one personal smartphones, i.e., Xiaomi RedMi 3S (device1), one embedded development board, i.e., raspberry Pi 4B (device3), and one mobile robot platform i.e., NVIDIA Jetbot (device4) loaded with the mobile development board. They are equipped with diverse processors, storage and battery capacity. The dynamic context is formulated by the time-varying latency budget , storage budget , and the relative importance coefficient of accuracy and energy efficiency objectives.
Comparison Baselines. We employ three categories of DNN specialization baselines to evaluate . The detailed settings of ten baselines from three categories are as below. Firstly, the hand-crafted compression baselines relay on manual design to realize efficient DNN compression. They provide the high standard for AdaSpring to tune the specialized DNNs’ performance tradeoff between accuracy, latency, and resource efficiency.
-
•
Fire (Iandola et al., 2016) presented in SqueezeNet reduces filter size and decreases input channels using squeeze layers.
-
•
MobileNetV2 (Sandler et al., 2018) replaces the traditional convolutional operation by an inverted residual with the linear bottleneck to expand module to high dimension and then filter with a depth-wise convolution.
-
•
SVD-based convolutional decomposition technique (Lane et al., 2016) introduces an extra conv layer between and using the singular value decomposition (SVD) based parameter matrix factorization. The number of neurons in the inserted layer is set according to the dynamic neuron numbers in , i.e., .
-
•
Sparse coding-based convolutional decomposition technique (Bhattacharya and Lane, 2016) insert a conv layer between and using the sparse coding-based parameter matrix factorization. The k-basis dictionary is dynamically determined by the neuron number in , i.e., .
Secondly, the on-demand DNN compression baseline methods adopt a trainable optimizer to automatically find the most suitable DNN compression strategies for various mobile platforms.These baselines provide a strict benchmark against which we can validate that both searching and retraining costs are bottleneck limitations for the runtime adaptation demands.
-
•
AdaDeep (Liu et al., 2020) automatically selects and combines compression techniques to generate a specialized DNN that balance accuracy and resource constraints.
-
•
ProxylessNAS (Cai et al., 2018) directly learns architectures without any proxy while still allowing a large candidate set and removing the restriction of repeating blocks.
-
•
Once-for-all(OFA) (Cai et al., 2019) obtains a specialized sub-network by selecting from the once-for-all network that supports diverse architectural settings without additional training.
Thirdly, the runtime adaptive DNN compression requires to search for the most suitable combination of retraining-free compression techniques quickly, we select two baseline optimization methods to compare with AdaSpring. Here, the baseline optimizers represent two intuitive searching ideas for the runtime adaptive compression of DNN configurations.
-
•
Exhaustive optimizer tests all combinations of compression operators’ performance on the validation and then selects the one variety with the best tradeoff based on the fixed performance ranking. And then it fixes the compression operators and only scale down the compression operators’ hyperparameters, i.e., compression ratio, to satisfy the dynamic resource budgets.
-
•
Greedy optimizer selects the best compression operator layer-by-layer that obtains the best tradeoff between accuracy and parameter size, in which the relative importance is equally set to a fixed value of .
-
•
AdaSpring selects and applies the most suitable combination of compression operators into the self-evolutionary backbone network for accuracy and resource efficiency tradeoff.
Baselines | DNN compression techniques | Performance of specialized DNN | Performance of DNN specialization scheme | |||||||||||||
() *1 | (ms) | (mJ) |
|
|
|
|
||||||||||
Stand-alone compression | Fire (Iandola et al., 2016) | 72.3 | 24.7 | 81.2 | 394.7 | 3.1 | 0 | 1.5N *2 | fix | |||||||
MobileNetV2 (Sandler et al., 2018) | 72.6 | 48.1 | 84.3 | 128.4 | 5.2 | 0 | 1.8N *2 | fix | — | |||||||
SVD decomposition (Lane et al., 2016) | 71.2 | 21.7 | 68.6 | 165.8 | 4.8 | 0 | 2.3N *2 | scalable | — | |||||||
|
72.9 | 22.3 | 69.8 | 195.2 | 4.6 | 0 | 2.3N*2 | scalable | — | |||||||
On-demand compression | AdaDeep (Liu et al., 2020) | 73.5 | 21.9 | 78.3 | 264.6 | 3.5 | 18N *2 | 38N *2 | scalable | — | ||||||
ProxylessNAS (Cai et al., 2018) | 74.2 | 49.5 | 121.3 | 232.1 | 3.8 | 196N 2 | 29N*2 | scalable | — | |||||||
OFA (Cai et al., 2019) | 71.4 | 51.2 | 123.4 | 257.3 | 3.1 | 41 | 0 | scalable | scalable | |||||||
Runtime adaptive compression | Exhaustive optimizer | 58.3 | 21.1 | 81.2 | 283.2 | 2.9 | 0 | 0 | — | — | ||||||
Greedy optimizer | 65.3 | 16.7 | 83.5 | 298.4 | 3.1 | 25 | 0 | — | — | |||||||
AdaSpring | 74.1 | 15.6 | 158.9 | 358.7 | 1.9 | 3.8 | 0 | scalable | scalable |
-
*1
We test the average DNN accuracy at three dynamic moments.
-
*2
The in search cost and retraining cost columns shows that the cost is linear to the number of deployment contexts.
6.2. Performance Comparison
We evaluate AdaSpring in terms of the specialized DNNs’ running performance (i.e., accuracy , amount of MACs , parameter arithmetic intensity , activation arithmetic intensity , and energy consumption ) and the specialization methods’ all-around performance (i.e., search cost, retraining cost, and scaling flexibility). As shown in table 2, we compare AdaSpring’s performance with ten baselines. In this thread of experiments, we leverage the same mobile sensing task (i.e., image recognition using CIFAR-100 () datasets) and target embedded platform (i.e., Raspberry Pi 4B) for six state-of-the-art baselines and AdaSpring for a fair comparison. We adopt different baseline methods to specialize the DNN architectures and weights for optimizing accuracy and resource efficiency objectives with dynamically specified constraints (see Equ. 3.2). Here, the relative importance coefficients (i.e., and ) are dynamically determined by the remaining battery percentage of the target platform, i.e., , and . Afterwards, we test the specialized DNN’s running performance on a Raspberry Pi 4B platform. To mitigate the effect of noise and increase the robustness of performance measurements, we repeat the steps mentioned above five times and take an average over them.

Mobile Taks | Compression operator configurations | Compared to the performance of MobileNet network | |||||
A loss | E | T | C | Sp | Sa | ||
CIFAR-100() | -2.1% | 2.5 | 1.2 | 5.6 | 2.8 | 1.2 | |
ImageNet() | -0.9% | 8.9 | 1.3 | 8.6 | 5.2 | 1.9 | |
UbiSound() | 1.3% | 15.2 | 1.1 | 4.3 | 3.8 | 1.2 | |
Har() | -0.3% | 2.1 | 0.8 | 9.2 | 7.1 | 1.3 | |
StateFarm() | 0.2% | 5.9 | 0.7 | 5.6 | 4.3 | 1.6 |
Performance comparison. Table 2 summarizes the performance comparison between ten baseline methods and AdaSpring. First, AdaSpring achieves the best overall performance in terms of accuracy , MAC amount , parameter arithmetic intensity , activation arithmetic intensity , and energy consumption , while incurring negligible accuracy loss, compared to the DNNs specialized by other baseline methods. The AdaSpring reduces the model inference latency to ms, the energy consumption to . And it increases the parameter arithmetic intensity to , the activation arithmetic intensity to . Notably, AdaSpring generates DNN to get the largest activation arithmetic intensity and second-largest parameter arithmetic intensity . Compared to the parameter size, the influence of activation arithmetic intensity upon energy consumption is equally or even more critical. The DNN specialized by the hand-crafted Fire, MobileNetV2, SVD, and sparse coding techniques consumes energy by , , , and , respectively. The accuracy of exhaustive optimizer is much lower than the proposed design, since it shows low accuracy when it fixes the compression operator categories and only over-compresses their hyperparameters. This outcome demonstrates that the reselection of different compression operators are necessary. The specialized DNN’s accuracy achieved by AdaSpring is at least as good as ProxylessNAS, and sometimes even better than the hand-crafted compression techniques. Second, the AdaSpring’s specialization scheme is the most efficient in reducing the searching cost and retraining cost. The adaptive compression baselines involve a high overhead in retraining. For example, AdaDeep requires an average of hours for retraining (e.g. retraining the deep reinforcement learning model-based optimizer) offline on the GPU platform for each adjustment of compression strategies. AdaDeep and ProxylessNAS need and hours, respectively, to search from the candidate configurations, which increases linearly with the number of dynamic contexts. Although OFA and AdaSpring do not need retaining. OFA needs to search per adaptation, while AdaSpring only needs to do that. This is because AdaSpring leverages the elite compression operator space, rather than the basic kernel size or channel number space in OFA, to avoid the redundant search exploration.
Summary. AdaSpring outperforms the other ten baselines in terms of the DNN performance tradeoff between accuracy, latency, arithmetic intensity of parameters and activations, and energy consumption. Meanwhile, it incurs the modest searching cost without retraining, making it ideal for runtime adaptive DNN compression.
6.3. AdaSpring’s Performance over Different Tasks
To illustrate the AdaSpring’s performance over different tasks, we evaluate it using all the five applications/datasets (see 6.1) on a Raspberry Pi 4B platforms (Device 3) which is powered by a mobile battery. AdaSpring dynamically detects the platform’s remaining battery and sets the coefficients between accuracy and energy efficiency in Equ. 3.2 according to the percentage of remaining power , i.e., . In addition, we specify the storage budget as 2MB that is capacity of the L2-Cache. We set a accuracy loss threshold to be for image classification tasks (, ), sound sensing (i.e., ), human activity prediction task (), and driver behavior recognition task (), respectively. And assume the latency sensitivity as the latency budget of , for .



Diverse platform | Dynamic context | ||||||||
Device | Processor | L2-Cache | Battery | Time | |||||
Redmi 3S smartphone | Qualcomm B21 | 2MB | 4100mAh | Remaining battery | |||||
Raspberry Pi 4B | Cortex-A72 | 2MB | 3800mAh | Avaliable cache | 2MB | 1.6MB | 1.5MB | 1.7MB | |
NVIDIA Jetbot | Cortex-A57 | 2MB | 7200mAh | Inference require | 2 times | 1 time | 2 times | 1 time |
Performance. Figure 8 compares the performance of the DNN configurations specialized by AdaSpring on five different tasks in terms of user experience metrics (i.e., inference accuracy , energy efficiency , and inference latency ) and direct DNN metrics (i.e., computation , parameter size, activation size ). And we compute the mean and standard deviation of the running performance of the DNN specialized in five dynamic moments, at which the percentage of remaining battery is , , , , and , respectively. These affect the tradeoff demands on objectives. The storage budget for parameters dynamically depends on the available Cache capacity. We simulate the unpredictable resource contention by other software using the randomization noise injection to Cache’s available capacity, i.e., MB. For different tasks, datsets, and deployment contexts, AdaSpring selects the various combinations of compression operators to scale up/down the model configurations to optimize and balance multiple performances. It achieves the inference latency , the parameter arithmetic intensity , activation arithmetic intensity , with a negligible accuracy loss ( ) or even accuracy improvement ().
Summary. For different tasks with diverse backbone model shapes and various sensitivity to accuracy loss and latency, the DNN specialized by AdaSpring varies. As for the same task, the DNN’s compression configurations founded by AdaSpring also differ according to the dynamic deployment context.
6.4. AdaSpring’s Performance across Diverse and Dynamic Deployment Contexts
In this experiment, we compare the AdaSpring’s performance for mobile sound sensing application (), tested in three different platforms. We adopt the same self-evolutionary network comprising of the same backbone-net and some optional compression operator-variants. Different platforms have different resource characteristics, which are further affected by dynamic deployment contexts. In particular, we adopt the RedMi 3S smartphone equipped with Qualcomm B21 processor, L2-Cache, and battery; the Raspberry Pi 4B with L2-Cache, and battery; and the NVIDIA Jetbot with quad-core ARM Cortex-A57 processor, L2-Cache, and battery. We adopt the similar dynamic deployment context settings with 6.3.




Performance. Figure 9 summarizes the performance of DNN specialized by AdaSpring along with the dynamic changes of deployment context. We first initialize the different DNN configurations for various platform constraints and then leverage AdaSpring to update the compression operator-variant selections according to the specific platform’s dynamic contexts. We select four points of dynamic contexts. AdaSpring identifies DNN configurations to obtain latency of , parameter arithmetic intensity , and activation arithmetic intensity while have slightly degraded or even better accuracy . We pick a time fragment to show four moments with dynamic deployment contexts, as shown in Table 4. As the gradual reduction of battery power and the dynamic fluctuation of Cache capacity, we show the performance changes of the DNN, which is continually scaled-down/up by selecting and combining different compression operators. Moreover, AdaSpring supports scale up the model architecture again when the dynamic constraints on resource efficiency are relaxed, bringing better flexibility.
Summary. AdaSpring adaptively selects the proper combination of compression operators to optimize DNN performance continually to meet dynamic context demands. Moreover, AdaSpring accomplishes a flexible evolution, i.e., support both scaling up and down the DNN configurations as the context demands change.
6.5. Micro-benchmarks of AdaSpring
In this subsection, we evaluate the impact of different factors on AdaSpring’s design.
6.5.1. Hardware Efficiency-guided Combination
We compare the performance of DNNs reconfigured by a stand-alone compression technique (e.g. the Fire module (Iandola et al., 2016)), the blindly combined two compression techniques (e.g. Fire module plus depth-wise pruning), and the proposed hardware-efficient grouping of compression operators (see Figure 10(a)). And we show that the hardware-efficient grouping can always guarantee a comparable overall performance in terms of accuracy, energy efficiency and latency.







6.5.2. Layer-dependent Inheriting and Mutation
. As discussed in 5.2, we leverage the inheriting and mutation schemes to balance the searching diversity and convergence. We compare the locally greedy scheme layer by layer, the layer-dependent inheriting scheme, and the proposed layer-dependent inheriting plus mutation scheme in AdaSpring. Figure 10(b) shows that AdaSpring achieves the best tradeoff between model accuracy and energy efficiency.
6.5.3. Progressive Shortest Encoding
. The encoding of convolutional compression configurations at multiple layers affects the complexity of the search space. Figure 10(c) compares the performance of classic binary encoding and progressive shortest encoding scheme. And we find that AdaSpring’s progressive shortest encoding method boosts the search efficiency.
6.5.4. Aggregation Coefficients in Arithmetic Intensity
As mentioned in 5.1.2, the aggregation coefficients and for the parameter and activation arithmetic intensity need to be optimized empirically. Figure 10(d) illustrates the estimated energy consumption using different aggregation coefficient settings. Therefore, we set by default across different platforms.
6.6. Case Study
We deploy AdaSpring on a commercial mobile robot platform (i.e., NVIDIA Jetbot, device4) and conduct a one-day experiment (09:00 to 17:00) to continually optimize the DNN configurations for a sound assistant application (i.e., UbiEar (Sicong et al., 2017)). This application adopts a DNN to realize a sound recognition and notification tool for hard-of-hearing people to sense emergency (e.g. fire alarms, smoke alarms, kettle boiling whistle) and social events (e.g. doorbell ring, knocking door, people crying). We simulate the dynamic mobile context of the DNN (as described in 3) as follows. On the one hand, we artificially play some audio clips for emergency events and generate social events to control the happening frequency of acoustic events, affecting the DNN inference frequency. On the other hand, we simulate the unpredictable storage resource contention by other software using the randomization noise (e.g. Gaussian noise) injection to the available capacity of L2-Cache, i.e., MB. Here, the maximum capacity of L2-Cache on NVIDIA Jetbot platform is , and we update the randomized resource contention value of per hour. We do not artificially change the battery power, which is continuously consumed in the real-world as the application runs. Therefore, the remaining battery is dynamically changing, e.g., 86%, 72%, and 63%, as shown in Figure 13, which forms the dynamic energy budgets.

Figure 13 illustrates the dynamic deployment context (i.e., energy, storage, event happening frequency) of the DNN for the continuous sound sensing application. The battery’s remaining energy formulate the importance coefficient in the runtime optimization problem (Equ. 3.2). The available capacity of L2-Cache decides the storage budget of parameters . And the sound emergency frequency will indirectly influence the battery’s power. Different deployment contexts have various resource constraints and performance objective sensitivity, which lead to further performance and budget demands on the DNN. AdaSpring triggers the runtime DNN evolution block by a pre-defined frequency (e.g. every two hours) to shrink the DNN configurations in this regular day. Figure 12shows the runing performance of the DNNs specialized by AdaSpring. AdaSpring can continually and adaptively select the best compression strategy to shrink the DNN configurations given diverse user demands. Specifically, it selects the (Fire) + (pruning 50% channel) for the regular resource-constrained moment, (Fire)+ (pruning 1 layer) for the tight memory constraint moment, and (SVD-based decomposition) + (pruning 65% channel) for the tight battery-bounded moment. The evolved models can achieve accuracy and arithmetic intensity. AdaSpring searches the proper combinations of compression operators that satisfy diverse demands on accuracy and resource efficiency within .
7. Conclusion
This paper addressed the runtime adaptive DNN compression problem to consider the dynamic deployment context of continuously running mobile applications. We present AdaSpring, a context-adaptive and runtime-evolutionary DNN compression framework that continually optimizes the DNN configurations (i.e., architectures and weights) to adapt to the dynamic context. We formulate the dynamic performance demands (e.g. accuracy, latency, energy efficiency) as a time-varying constrained optimization problem. And we propose a heuristic solution as quickly searching for the most suitable combination of retraining-free compression techniques at runtime. To decouple DNN training from runtime adaptive compression, we put computation ahead in the training of a self-evolutionary network at design time (see 4). And we present the Runtime3C search algorithm and a set of searching speedup mechanisms to boost the runtime search efficiency and quality. Evaluation using five different mobile applications across four mobile platforms and a real-world case study show the performance advantages of AdaSpring to evolve the DNN compression configurations locally online at millisecond level. In the future work, facing the diverse and dynamic mobile scenarios (e.g. data, task, and platform), more efforts and insights for the self-evolutionary deep model compression and optimization frameworks are much needed.
Acknowledgements.
This work was partially supported by the National Key R&D Program of China (2019YFB1703901), National Science Fund for Distinguished Young Scholars (62025205, 61725205), National Natural Science Foundation of China (No. 62032020, 61960206008, 62032017), and the Fundamental Research Funds for the Central Universities (No. 3102020QD1005). The authors also thank the anonymous reviewers for their constructive feedback that has made the work stronger.References
- (1)
- Bender et al. (2018) Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. 2018. Understanding and simplifying one-shot architecture search. In International Conference on Machine Learning. 550–559.
- Bhattacharya and Lane (2016) Sourav Bhattacharya and Nicholas D Lane. 2016. Sparsification and separation of deep learning layers for constrained resource inference on wearables. In Proceedings of CD-ROM. 176–189.
- Bhattacharya et al. (2020) Sourav Bhattacharya, Dionysis Manousakas, Alberto Gil CP Ramos, Stylianos I Venieris, Nicholas D Lane, and Cecilia Mascolo. 2020. Countering Acoustic Adversarial Attacks in Microphone-equipped Smart Home Devices. Proceedings of the IMWUT 4, 2 (2020), 1–24.
- Cai et al. (2017) Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. 2017. Efficient architecture search by network transformation. arXiv preprint arXiv:1707.04873 (2017).
- Cai et al. (2019) Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han. 2019. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791 (2019).
- Cai et al. (2018) Han Cai, Ligeng Zhu, and Song Han. 2018. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv preprint arXiv:1812.00332 (2018).
- Chen et al. (2017) Guobin Chen, Wongun Choi, Xiang Yu, Tony Han, and Manmohan Chandraker. 2017. Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems. 742–751.
- Chen et al. (2020) Ling Chen, Yi Zhang, and Liangying Peng. 2020. METIER: A Deep Multi-Task Learning Based Activity and User Recognition Model Using Wearable Sensors. Proceedings of IMWUT 4, 1 (2020), 1–18.
- Chen et al. (2018) Tao Chen, Ke Li, Rami Bahsoon, and Xin Yao. 2018. FEMOSAA: Feature-guided and knee-driven multi-objective optimization for self-adaptive software. ACM Transactions on Software Engineering and Methodology 27, 2 (2018).
- Chen et al. (2016) Wenlin Chen, James Wilson, Stephen Tyree, Kilian Q Weinberger, and Yixin Chen. 2016. Compressing convolutional neural networks in the frequency domain. In Proceedings of SIGKDD. 1475–1484.
- Chen et al. (2019) Xin Chen, Lingxi Xie, Jun Wu, and Qi Tian. 2019. Progressive differentiable architecture search: Bridging the depth gap between search and evaluation. In Proceedings of ICCV. 1294–1303.
- Cheng et al. (2017) Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. 2017. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282 (2017).
- Dai et al. (2019) Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin, Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yiming Wu, Yangqing Jia, et al. 2019. Chamnet: Towards efficient network design through platform-aware model adaptation. In Proceedings of CVPR. 11398–11407.
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of CVPR.
- Fang et al. (2020) Jiemin Fang, Yuzhu Sun, Kangjian Peng, Qian Zhang, Yuan Li, Wenyu Liu, and Xinggang Wang. 2020. Fast neural network adaptation via parameter remapping and architecture search. arXiv preprint arXiv:2001.02525 (2020).
- French (1999) Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3, 4 (1999), 128–135.
- Gao et al. (2018) Xitong Gao, Yiren Zhao, Łukasz Dudziak, Robert Mullins, and Cheng-zhong Xu. 2018. Dynamic channel pruning: Feature boosting and suppression. arXiv preprint arXiv:1810.05331 (2018).
- Gholami et al. (2018) Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiangyu Yue, Peter Jin, Sicheng Zhao, and Kurt Keutzer. 2018. Squeezenext: Hardware-aware neural network design. In Proceedings of CVPR. 1638–1647.
- Google (2017) Google. 2017. TensorFlow. https://goo.gl/j7HAZJ.
- Han et al. (2016) Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy. 2016. Mcdnn: An approximation-based execution framework for deep stream processing under resource constraints. In Proceedings of MobiSys. 123–136.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of CVPR. 770–778.
- He et al. (2018) Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. 2018. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of ECCV. 784–800.
- He et al. (2017) Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1389–1397.
- Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017).
- Iandola et al. (2016) Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size. arXiv preprint arXiv:1602.07360 (2016).
- Jha and Mittal (2020) Nandan Kumar Jha and Sparsh Mittal. 2020. Modeling Data Reuse in Deep Neural Networks by Taking Data-Types into Cognizance. IEEE Trans. Comput. (2020).
- Jha et al. (2019) Nandan Kumar Jha, Sparsh Mittal, and Govardhan Mattela. 2019. The ramifications of making deep neural networks compact. In Proceedings of VLSID. IEEE, 215–220.
- Jiang et al. (2019) Yufan Jiang, Chi Hu, Tong Xiao, Chunliang Zhang, and Jingbo Zhu. 2019. Improved differentiable architecture search for language modeling and named entity recognition. In Proceedings of EMNLP-IJCNLP. 3576–3581.
- Kaggle (2019) Kaggle. 2019. State Farm Distracted Driver Detection. https://www.kaggle.com/c/state-farm-distracted-driver-detection.
- Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of PNAS 114, 13 (2017), 3521–3526.
- Krizhevsky (2009a) Alex Krizhevsky. 2009a. Learning multiple layers of features from tiny images. https://www.tensorflow.org/datasets/catalog/cifar100.
- Krizhevsky (2009b) Alex Krizhevsky. 2009b. Learning multiple layers of features from tiny images. Technical Report.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
- Kwon et al. (2020) Hyeokhyen Kwon, Catherine Tong, Harish Haresamudram, Yan Gao, Gregory D Abowd, Nicholas D Lane, and Thomas Ploetz. 2020. IMUTube: Automatic extraction of virtual on-body accelerometry from video for human activity recognition. arXiv preprint arXiv:2006.05675 (2020).
- Lane et al. (2016) Nicholas D Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. 2016. Deepx: A software accelerator for low-power deep learning inference on mobile devices. In Proceedings of IPSN. IEEE, 1–12.
- Lane et al. (2015) Nicholas D Lane, Petko Georgiev, and Lorena Qendro. 2015. DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing. 283–294.
- Li et al. (2019) Gen Li, Inyoung Yun, Jonghyun Kim, and Joongkyu Kim. 2019. Dabnet: Depth-wise asymmetric bottleneck for real-time semantic segmentation. arXiv preprint arXiv:1907.11357 (2019).
- Li et al. (2014) Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. 2014. Efficient mini-batch training for stochastic optimization. In Proceedings of SIGKDD. 661–670.
- Liu et al. (2017) Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. 2017. Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436 (2017).
- Liu et al. (2018b) Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018b. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).
- Liu et al. (2020) Sicong Liu, Junzhao Du, Kaiming Nan, Atlas Wang, Yingyan Lin, et al. 2020. AdaDeep: A Usage-Driven, Automated Deep Model Compression Framework for Enabling Ubiquitous Intelligent Mobiles. arXiv preprint arXiv:2006.04432 (2020).
- Liu et al. (2018a) Sicong Liu, Yingyan Lin, Zimu Zhou, Kaiming Nan, Hui Liu, and Junzhao Du. 2018a. On-demand deep model compression for mobile devices: A usage-driven model selection framework. In Proceedings of MobiSys. 389–400.
- Luo and Wu (2020) Jian-Hao Luo and Jianxin Wu. 2020. Autopruner: An end-to-end trainable filter pruning method for efficient deep model inference. Pattern Recognition (2020), 107461.
- Ren et al. (2020) Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, and Xin Wang. 2020. A Comprehensive Survey of Neural Architecture Search: Challenges and Solutions. arXiv preprint arXiv:2006.02903 (2020).
- Saikia et al. (2019) Tonmoy Saikia, Yassine Marrakchi, Arber Zela, Frank Hutter, and Thomas Brox. 2019. Autodispnet: Improving disparity estimation with automl. In Proceedings of ICCV. 1812–1823.
- Sandler et al. (2018) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510–4520.
- Sicong et al. (2017) Liu Sicong, Zhou Zimu, Du Junzhao, Shangguan Longfei, Jun Han, and Xin Wang. 2017. Ubiear: Bringing location-independent sound awareness to the hard-of-hearing people with smartphones. Proceedings of IMWUT 1, 2 (2017), 1–21.
- Singh et al. (2019) Pravendra Singh, Vinay Kumar Verma, Piyush Rai, and Vinay P Namboodiri. 2019. Play and prune: Adaptive filter pruning for deep model compression. arXiv preprint arXiv:1905.04446 (2019).
- Tan et al. (2019) Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. 2019. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of CVPR. 2820–2828.
- Teerapittayanon et al. (2016) Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. 2016. Branchynet: Fast inference via early exiting from deep neural networks. In Proceedings of ICPR. IEEE, 2464–2469.
- UCI (2017) UCI. 2017. Dataset for Human Activity Recognition. https://goo.gl/m5bRo1.
- Wang et al. (2020) Xiaofei Wang, Yiwen Han, Victor CM Leung, Dusit Niyato, Xueqiang Yan, and Xu Chen. 2020. Convergence of edge computing and deep learning: A comprehensive survey. IEEE Communications Surveys & Tutorials 22, 2 (2020), 869–904.
- Wu et al. (2018b) Junru Wu, Yue Wang, Zhenyu Wu, Zhangyang Wang, Ashok Veeraraghavan, and Yingyan Lin. 2018b. Deep -Means: Re-Training and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions. arXiv preprint arXiv:1806.09228 (2018).
- Wu et al. (2018a) Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. 2018a. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of CVPR. 8817–8826.
- Yang et al. (2020) Li Yang, Zhezhi He, Yu Cao, and Deliang Fan. 2020. A Progressive Sub-Network Searching Framework for Dynamic Inference. arXiv preprint arXiv:2009.05681 (2020).
- Yang et al. (2017) Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. 2017. Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of CVPR. 5687–5695.
- Yang et al. (2019) Zhican Yang, Chun Yu, Fengshi Zheng, and Yuanchun Shi. 2019. ProxiTalk: Activate Speech Input by Bringing Smartphone to the Mouth. Proceedings of IMWUT 3, 3 (2019), 1–25.
- Yao et al. (2017) Shuochao Yao, Yiran Zhao, Aston Zhang, Lu Su, and Tarek Abdelzaher. 2017. Deepiot: Compressing deep neural network structures for sensing systems with a compressor-critic framework. In Proceedings of SenSys. 1–14.
- Yu and Huang (2019) Jiahui Yu and Thomas Huang. 2019. AutoSlim: Towards One-Shot Architecture Search for Channel Numbers. arXiv preprint arXiv:1903.11728 (2019).
- Zhao et al. (2018) Zhuoran Zhao, Kamyar Mirzazad Barijough, and Andreas Gerstlauer. 2018. DeepThings: Distributed adaptive deep learning inference on resource-constrained IoT edge clusters. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37, 11 (2018), 2348–2359.
- Zhong et al. (2018) Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and Cheng-Lin Liu. 2018. Practical block-wise neural network architecture generation. In Proceedings of CVPR. 2423–2432.
- Zhou et al. (2020) Pan Zhou, Caiming Xiong, Richard Socher, and Steven CH Hoi. 2020. Theory-inspired path-regularized differential network architecture search. arXiv preprint arXiv:2006.16537 (2020).
- Zhu and Zabaras (2018) Yinhao Zhu and Nicholas Zabaras. 2018. Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification. J. Comput. Phys. 366 (2018), 415–447.
- Zoph et al. (2018) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of CVPR. 8697–8710.