Progressive Transmission and Inference of Deep Learning Models
Abstract
Modern image files are usually progressively transmitted and provide a preview before downloading the entire image for improved user experience to cope with a slow network connection. In this paper, with a similar goal, we propose a progressive transmission framework for deep learning models, especially to deal with the scenario where pre-trained deep learning models are transmitted from servers and executed at user devices (e.g., web browser or mobile). Our progressive transmission allows inferring approximate models in the middle of file delivery, and quickly provide an acceptable intermediate outputs. On the server-side, a deep learning model is divided and progressively transmitted to the user devices. Then, the divided pieces are progressively concatenated to construct approximate models on user devices. Experiments show that our method is computationally efficient without increasing the model size and total transmission time while preserving the model accuracy. We further demonstrate that our method can improve the user experience by providing the approximate models especially in a slow connection.
Index Terms:
deep learning model transmission, deep learning model deployment, deep learning application, progressive transmission, user experienceI Introduction
Recently, deep learning models have spread out to user edge devices beyond powerful enterprise servers [1]. Although edge devices have relatively lower computational power than the servers, reasons such as data privacy, communication latency, or server-side computational burdens accelerate their adoption [2, 3, 4]. One simple way to use deep learning models on edge devices is to embed the model in the application on the deployment stage. However, it has not been considered as an issue when the model is transmitted over a network connection. For example, in a complex application that uses multiple models, it might be impossible to embed all the models due to the limited storage resources (Fig. 2a). In some cases, deep learning models might be transmitted again after the deployment since they are frequently updated in the server, or the user’s situation-specific models might have to be provided each time (Fig. 2b). On a web application, the deep models are executed on a web browser, which does not support pre-embedding of the models (Fig. 2c). In these cases, user should wait for a long time until finishing the transmission. And if the network connection is slow, the deep model transmission time would harm the user experience. We further expect the problem might occur not only for a slow network connection but also for relatively fast network connection since the deep model size becomes larger for higher accuracy [5].

In this paper, we introduce a compromise solution to the problem of long transmission time of deep learning models. Instead of reducing the long transmission time itself, we propose to provide approximate inference results in the middle of the transmission. A similar approach has been taken for image transmission, for example, the JPEG format supports progressive mode in which a reasonable preview is available after only a portion of the data is transmitted [6, 7]. Inspired by the classic yet effective image transmission approach, we argue that progressive transmission of deep models could improve the response time and the user experience.
Fig. 1 illustrates the tasks and the flow for progressive transmission and inference. As shown in the figure, the original model is divided before deployment, and concatenation and inference are conducted multiple times on the user device afterward. To enable progressive transmission, an interface that supports flexible division and concatenation should be designed without increasing the overall model size. In this paper, we present a new framework that includes division and concatenation schemes for progressive transmission and inference of deep learning models along with the compression algorithm. Next, we demonstrate that our framework does not increase the model size and total execution time even if it provides approximate inference results. We further evaluate whether our method improves user experience in a real-world web application by conducting a user study. The source code and demo are available at https://github.com/Prev/progressivenet.

II Related Work
II-A Progressive Transmission of Images
The goal of classic progressive transmission of images, introduced by Sloan and Tanimoto [6] is to display a reasonable preview with minimum bits. To display a reasonable approximate image, limiting the pixel values (i.e., quantizing the pixel value range from 256 to 16), or reducing the image resolution has been used [7]. The algorithm used in image transmission cannot be used in the transmission of deep learning models since we transmit a completely different file types: a visual image and a bunch of matrices. However, there is a common goal in that human perception is both considered when showing intermediate results. In image transmission, the most plausible intermediate image must be shown. In deep learning model transmission, providing an intermediate model with the highest possible accuracy is crucial.
II-B Deep Model Compression
One way to improve user experience when transmitting a deep model over network is to compress the deep model. Many techniques have been proposed such as network pruning [8, 9], weight quantization [10, 11], matrix factorization [12], and knowledge distillation [13] to compress deep models. While such methods are efficient in terms of compression rate, deep model compression necessarily trade-offs the accuracy. Rather than compressing the model with accuracy losses, we could construct an algorithm specialized for the model transmission scenario. Progressive transmission is a compromise policy that can be applied after the deep model compression to improve user experience.
II-C Deep Model over a Network
Recently, Chen et al. [14] introduces a design paradigm for compressing and transmitting models in a network environment, though it is dedicated to the infrastructure in cloud and edge computing and they do not consider the response time or user experience. Collaborative intelligence [15, 16, 17] is an another paradigm in deep model deployment, which is proposed to solve the transmission time, computation load, and storage limitations of edge devices. In collaborative intelligence, only few layers of the model are executed on the edge device and then the output from the front layers is transmitted to the server instead of heavy raw input such as images or videos [15]. Collaborative intelligence works as a compromise solution between the edge-only and the server-only approach, and the studies show that they could improve the inference latency. However, it still requires the server resources, thus it cannot be used in the scenarios where the cost problem or the data privacy problem exists.
Unlike these methods, our progressive transmission enhances user experiences by providing approximated inference results during the transmission. In addition, our method can deal with the server cost problems or data privacy issues since any computations on the server-side are not required.
III Framework Design

In this section, we propose a framework which transforms a trained model to the progressive model and provides approximate inference results during transmission. The framework includes a representation of the progressive model, which supports multiple inferences during transmission while preserving the model size. The overall flow of the framework is illustrated in Fig. 1. To increase usability of progressive transmission, we design the framework to support various vision tasks, including image classification and object detection. We further expose flexible configuration on our framework to allow the users of the framework to set the numbers of divisions and the size of each part on their demands.
III-A A Naive Approach
A naive idea to design a representation for progressive transmission is to use the precision of the floating points in the model. Because most deep learning models operate through matrix additions and multiplications, model inference works well even if the floating points are manipulated and precision is slightly decreased [10]. Therefore, by transmitting significant bits of the floating points first and transmitting trivial bits later, the intermediate results can be inferred even if only a part of the model is delivered. For example, when the model is transmitted in two stages, a number in the model can be represented in the following form:
(1) |
where the first significand and exponent are sent first and the second significand is sent later. This methodology is intuitive, however, it is not efficient in terms of representation space. Instead of splitting the floating points by digit, compression algorithms can be used to represent the model with lower precision drops. Among them, quantization [10, 9, 11] is a general way to compress the deep learning models.
III-B Using Quantization
Fig. 3 illustrates an overview of the procedures from the original weight matrix to the intermediate inference results. In overall, the procedure consists of four steps as follows: 1) quantization, 2) bit division, 3) bit concatenation, and 4) dequantization. We describe the framework step-by-step in this section.
Step 1: Quantization
To better represent the deep model, we first quantize every floating point of matrices in the deep model. The most common quantization algorithm in deep model is to calculate the maximum and minimum values for each matrix and divide the range into uniform intervals to map the matrix values. Our quantization scheme is similar to the algorithm, which is also used in Tensorflow Lite [18] and Tensorflow JS [2] frameworks, yet we replace the rounding function with the flooring function. Jin et al. [19] already have addressed the issue that the rounding function causes precision losses in bit concatenation and requires extra storage to avoid the losses. We quantize the floating-point scalar in matrix to the -bit unsigned integer using
(2) |
We use a small enough value to make the range of the scalars to before applying the flooring function. Using the equation, we have a quantized matrix from the original floating-point matrix as displayed in the Fig. 3.
Step 2: Bit division
To support progressive transmission, we second divide the quantized matrix to multiple fraction matrices which have a same dimension but different bit-widths. As one of the design goals of the framework is to provide flexible configuration to users, we expose the variable which is responding to the bit-widths of the each element of divided matrices. (e.g., in the figure.) After dividing the matrcies, we progressively transmit the divided model to the user divice. The scheme to fetch the -part from the -bit quantized integer is given by:
(3) |
where is the -th bit-width and . Note that and are the unsigned bit-shift operations to the left and right respectively, adopted for fast calculation.
Model | Size | H/W | Singleton | Progressive | |
---|---|---|---|---|---|
w/o concurrent | w/ concurrent | ||||
MobileNetV2 [20] | 7.1 MB | CPU | 8s | 13s (+63%) | 8s (+0%) |
MobileNetV1 [21] | 8.5 MB | CPU | 10s | 18s (+80%) | 10s (+0%) |
InceptionV1 [22] | 13.4 MB | GPU | 14s | 17s (+21%) | 14s (+0%) |
ResNet50 [23] | 51.2 MB | GPU | 52s | 63s (+21%) | 53s (+2%) |
SSDLite-MobileNetV2 [20] | 9.3 MB | GPU | 10s | 13s (+30%) | 10s (+0%) |
SSD-MobileNetV2 [20] | 33.8 MB | GPU | 35s | 50s (+42%) | 35s (+0%) |
Step 3: Bit concatenation
The divided matrices from the previous step are deployed to the server and transmitted to the user device when the deep model is requested. To properly infer the deep model from the divided pieces, the divided matrices should be concatenated before the inference. For example, the second approximate model is available after concatenating the elements of the model.part1 and model.part2 matrices, grouped by local positions and restoring the floating points. The scheme to concatenate the divided binaries is given by
(4) | |||
where the operator indicates the bitwise or operation, and is the available number of fractions.
Step 4: Dequantization
After concatenation, a quantized integer in the matrix is restored to a floating point. This operation is the inverse of quantization, and the equation is given by
(5) |
where is a revised factor that restores the loss from the flooring function.
III-C Concurrent Transmission and Inference
While quantization and bit division are performed once before the model deployement, concatenation and dequantization are performed at every inference on the user device. Furthermore, our framework requires multiple inferences to show the intermediate results of the model. However, those overheads from concatenation and multiple inferences become a problem since user devices usually do not have powerful hardware, unlike the enterprise servers. Towards minimizing the overhead from progressive inference, we introduce concurrent execution of transmission and inference.
Fig. 4 shows timelines of a singleton model and two progressive models, where the upper model is naively implemented while the lower model performs concatenation and inference concurrently with the transmission. When the inference is performed concurrently with the transmission, we could achieve equivalent completion time compared to the traditional method even if additional operations for concatenation and multiple inferences are added. As modern operating systems and browsers download the files in the background, concurrent execution can be implemented easily and efficiently.

IV Evaluation


IV-A Total Execution Time
To demonstrate that our framework does not increase the total execution time, we measure the total execution times of popular deep learning models with our JavaScript implementation. Table I shows the summarized results of the concurrent transmission and inference. The first four models are object classifiers trained with ImageNet [24], and the two models below are the object detection models trained with MS COCO [25]. Notably, the total execution times (transmission + concatenation + dequantization + inference) of the progressive models are equivalent to the singleton models if the concurrent execution is implemented, while an additional 20% to 80% longer time occurs when the concurrent execution is not implemented.
Experimental Setup
With the observation that 16-bit quantized models show equivalent accuracy to the full-precision model, we used 16-bit quantized models as baselines. We configured the transmission to proceed eight times in the progressive method (2461416), and used CPU(JS) on inferring small models and GPU(WebGL) on inferring relatively heavy models. The experiment was conducted on Macbook Air (M1, 2020), Chrome 88, and TensorFlowJS. Transmission speed was set to 1 MB/s.
Model | Bit-width | ||||||||
---|---|---|---|---|---|---|---|---|---|
2 | 4 | 6 | 8 | 10 | 12 | 14 | 16 | orig. | |
MobileNetV2 [20] | 0.0 | 0.0 | 40.1 | 70.7 | 71.8 | 71.9 | 71.9 | 71.9 | 71.9 |
ResNet18 [23] | 0.0 | 0.0 | 67.5 | 69.6 | 69.8 | 69.8 | 69.8 | 69.8 | 69.8 |
EfficientNet-b0 [26] | 0.0 | 0.0 | 0.0 | 48.7 | 70.9 | 76.4 | 77.5 | 77.5 | 77.6 |
SSD300-VGG16 [27] | 0.0 | 0.0 | 18.6 | 25.0 | 25.1 | 25.1 | 25.1 | 25.1 | 25.1 |
SSDLite320-MobileNetV3-Large [28] | 0.0 | 0.0 | 15.5 | 20.3 | 21.1 | 21.2 | 21.3 | 21.3 | 21.3 |
FasterRCNN-MobileNetV3-Large-FPN [29, 28] | 0.0 | 0.0 | 24.3 | 31.8 | 23.6 | 32.8 | 32.8 | 32.8 | 32.8 |
IV-B Qualitative Results
We qualitatively show examples of progressive transmission for the classification and detection models in Figs. 5 and 6, respectively. Our progressive model provides approximate results before the entire model file is transmitted, allowing users to interact with the model quickly and improving the user experience.
IV-C Accuracy Analysis
We evaluate the accuracy of progressive models with popular classification and detection model architectures. We borrowed pre-trained models provided in online websites and converted the models with our framework to support progressive transmission. Table II shows the accuracy comparison results, where we measure the ImageNet top-1 accuracy and COCO boxAP of approximate models provided by our framework during transmission.
In final model accuracy, we reveal that there is no accuracy degradation on our progressive models even though our model supports progressive inference during transmission. In intermediate model accuracy, we could not receive meaningful inference results in 2-bit and 4-bit models due to the precision loss, yet it has been shown that better inference results can be obtained from 6-bit and later. One interesting part is that our experiments were conducted only by converting the existing pre-trained models without adaptive quantization-aware training [19, 30], which shows high scalability of our method.
IV-D User Study
We further demonstrate the applicability of the progressive transmission in a real-world environment by conducting a user experiment with our web application. The experiment compares the user’s tolerance between conducting the progressive transmission or not, when the long transmission time of deep models is given.
Experimental Setup
Fig. 7 shows the overall design of our user study. First, the participants are given two choices to classify an image: obtain the classification results from a deep learning model (Find automatically button) or manually classify without waiting for the deep model’s results (Do it myself button). Then, we divide the participants into two groups and implement two versions of the experimental application according to the groups:
-
•
Group A: No progressive transmission. Users can only see the final result after the model is fully transmitted.
-
•
Group B: Progressive transmission is implemented. Users are allowed to see intermediate results before entire model is delivered.
We measure the ratio of the participants who actively use the Find automatically button during the experiments. To clearly see the effect of the progressive transmission, we limit the transmission speed of the deep model. Due to the repetitive and boring task, the participants might prefer to use the Find automatically button to obtain the classification results of the deep model. However, due to the long transmission time of the model, they may prefer to manually annotate the images instead of waiting for the model’s results. We hypothesize that Group B will be more likely to use the button than Group A, since the progressive transmission reduces the boredom and frustration caused by the slow transmission time.

Implementation Detail
The images to be classified are from the ImageNet [24] validation set, and the experiment has 6 repetitive stages. We use MobileNetV2 [31] as the deep model for the Find automatically button. The model is executed on the web browsers of the participants’ own devices (desktops or laptops). We implement the experimental application as a web application, using TensorFlowJS [2]. The model transmission speed is set to multiple configurations (0.1 MB/s, 0.2 MB/s, and 0.5 MB/s) to simulate from slow network speed to relatively fast network speed. Participants are asked to classify 8 or 12 images in each stage (12 images on 0.10.2 MB/s and 8 images on 0.5 MB/s). The network speed is evenly configured among Groups A and B. In Group B, the model is quantized after training and then eight intermediate results are provided during transmission (2461416).
Result
We recruited 66 participants online, and randomly distributed to Groups A and B. A total of 57 valid data were collected, where we excluded data from the participants who have never used the Find automatically button during the experiment. Then we measured the ratio of participants who used the Find automatically button greater or equal to three times during six stages (click ratio ).
Network Speed | Group A | Group B |
---|---|---|
(=29) | (=28) | |
0.1MB/s (=18) | 44% | 67% |
0.2MB/s (=23) | 42% | 64% |
0.5MB/s (=16) | 50% | 88% |
Overall | 45% | 71% |
Table III shows the results of our experiment. In Group A, more than half of the participants gave up on using the automatic tool and performed the task manually. In contrast, more than 70% of participants in Group B actively used the tool, which is a consistent result with our hypothesis. Notably, progressive transmission was effective in all network speed configurations, showing possibility that progressive transmission might work as a general solution when delivering a model through a network.

We further asked participants about the experience of the Find automatically button after the experiment. Fig. 8 shows the survey results, where 39 participants submitted the survey results out of 57 participants. As a result, Group A tended to be more dissatisfied with the model’s inference speed than Group B. We suppose that progressive transmission reduces some of the dissatisfaction caused by the slow transmission speed.
Taking the above results together, we argue that progressive transmission and inference of deep models is beneficial to the user experience. As the size of the deep model increases or the network speed is limited, the more effectiveness of our progressive transmission will be shown. In real-world scenario, progressive transmission could compromise the long transmission time of the deep models, especially for the countries that have not yet constructed fast networks.
V Conclusion
We introduced the progressive transmission and inference of deep learning models which allows approximate inference results in the middle of the transmission. Our progressive transmission framework allows multiple inferences during the transmission without increasing the model size and total execution time. We demonstrate that our method improves the user experiences by conducting a user study with a real world-like application. To the best of our knowledge, this is the first approach to introduce progressive transmission into deep learning models. We implemented progressive model transmission in a simple way in this paper, but we believe that it can be implemented with more advanced methods. We hope that this study motivates academy and industry to consider user experience in addition to deep learning model size or the transmission time, and activates the applications and services that transmits deep learning models to user devices.
Acknowledgement
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2019-0-01158, Development of a Framework for 3D Geometric Model Processing)
References
- [1] J. Wang, B. Cao, P. Yu, L. Sun, W. Bao, and X. Zhu, “Deep learning towards mobile applications,” in 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). IEEE, 2018, pp. 1385–1393.
- [2] D. Smilkov, N. Thorat, Y. Assogba, A. Yuan, N. Kreeger, P. Yu, K. Zhang, S. Cai, E. Nielsen, D. Soergel, S. Bileschi, M. Terry, C. Nicholson, S. N. Gupta, S. Sirajuddin, D. Sculley, R. Monga, G. Corrado, F. B. Viégas, and M. Wattenberg, “TensorFlow.js: Machine Learning for the Web and Beyond,” arXiv preprint arXiv:1901.05350, 2019.
- [3] M. Hidaka, Y. Kikura, Y. Ushiku, and T. Harada, “Webdnn: Fastest dnn execution framework on web browser,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1213–1216.
- [4] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and wireless networking: A survey,” IEEE Communications Surveys & Tutorials, vol. 21, no. 3, pp. 2224–2287, 2019.
- [5] Z. Chen, S. Wang, D. O. Wu, T. Huang, and L.-Y. Duan, “From data to knowledge: Deep learning model compression, transmission and communication,” in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 1625–1633.
- [6] Sloan and Tanimoto, “Progressive Refinement of Raster Images,” IEEE Transactions on Computers, vol. C-28, no. 11, pp. 871–874, 1979.
- [7] K.-H. Tzou, “Progressive Image Transmission: A Review And Comparison Of Techniques,” Optical Engineering, vol. 26, no. 7, pp. 581 – 589, 1987.
- [8] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” Advances in Neural Information Processing Systems 29 (NIPS 2016), pp. 1379–1387, 2016.
- [9] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” in International Conference on Learning Representations (ICLR), 2016.
- [10] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing Deep Convolutional Networks using Vector Quantization,” arXiv preprint arXiv:1412.6115, 2014.
- [11] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- [12] X. Yu, T. Liu, X. Wang, and D. Tao, “On compressing deep models by low rank and sparse decomposition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7370–7379.
- [13] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- [14] Z. Chen, L.-Y. Duan, S. Wang, Y. Lou, T. Huang, D. O. Wu, and W. Gao, “Toward knowledge as a service over networks: A deep learning model communication paradigm,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 6, pp. 1349–1363, 2019.
- [15] A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “Jointdnn: an efficient training and inference engine for intelligent mobile cloud computing services,” IEEE Transactions on Mobile Computing, 2019.
- [16] A. E. Eshratifar, A. Esmaili, and M. Pedram, “Bottlenet: A deep learning architecture for intelligent mobile cloud computing services,” in 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). IEEE, 2019, pp. 1–6.
- [17] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” ACM SIGARCH Computer Architecture News, vol. 45, no. 1, pp. 615–629, 2017.
- [18] “TensorFlow Lite - ML for Mobile and Edge Devices,” https://www.tensorflow.org/lite.
- [19] Q. Jin, L. Yang, and Z. Liao, “Adabits: Neural network quantization with adaptive bit-widths,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- [20] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.
- [21] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv preprint arXiv:1704.04861, 2017.
- [22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper With Convolutions,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- [24] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.
- [25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
- [26] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML, 2019.
- [27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
- [28] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1314–1324.
- [29] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, pp. 91–99, 2015.
- [30] H. Yu, H. Li, H. Shi, T. S. Huang, and G. Hua, “Any-precision deep neural networks,” arXiv preprint arXiv:1911.07346, 2019.
- [31] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.