Bandwidth Utilization Side-Channel on ML Inference Accelerators

Sarbartha Banerjee The University of Texas at Austin
[email protected] Shijia Wei The University of Texas at Austin
[email protected] Prakash Ramrakhyani
{@IEEEauthorhalign} Mohit Tiwari
ARM Research
[email protected] The University of Texas at Austin
[email protected]

Abstract

Accelerators used for machine learning (ML) inference provide great performance benefits over CPUs. Securing confidential model in inference against off-chip side-channel attacks is critical in harnessing the performance advantage in practice. Data and memory address encryption has been recently proposed to defend against off-chip attacks.

In this paper, we demonstrate that bandwidth utilization on the interface between accelerators and the weight storage can serve a side-channel for leaking confidential ML model architecture. This side channel is independent of the type of interface, leaks even in the presence of data and memory address encryption and can be monitored through performance counters or through bus contention from an on-chip unprivileged process.¹¹1Presented at Secure and Private Systems for Machine Learning (SPSL 2021)

I Introduction

Deep learning model inference services have spawned a domain-specific computing revolution. High performance ML inference accelerators in the form of neural processing units (NPU) are being developed by both industry [14, 2, 17] and academia [11, 16, 8]. NPUs may be integrated either in the system-on-chip (SoC) [2] or connected to system bus [14]. Inference-as-a-service (InFaaS) [19] is deployed by cloud providers like Amazon Sagemaker [1] running user inference on ML accelerators. This incentivizes ML model vendors to host trained models on cloud platforms and provide services on confidential user data like face recognition and organizational data like disease classification on patient private data.

From the security perspective, the model vendor requires the cloud provider to protect the confidentiality of model parameters as well as the layer dimensions. Prior work [5, 18, 20] show how knowledge of layer dimensions can be used to steal a victim’s ML IP by reconstructing a model with similar accuracy. An attacker can further use a stolen model to launch adversarial attacks on the victim system [15, 21].

Temporal and spatial sharing of NPUs by multiple applications has been proposed to improve system scalability and overall inference time [9, 10]. This allows multiple tenants sharing the memory bus to infer victim model utilization through bandwidth contention channels. To support such sharing, the cloud hypervisors collect NPU resource utilization information (e.g. memory bus utilization) for each tenant for providing quality-of-service (QoS) guarantees, memory traffic and infrastructure management.

In this paper, we develop a new attack based on observing the NPUs bandwidth utilization. We highlight that observing this side channel alone can leak the ML model structure even when off-chip data and addresses are encrypted. Summarizing our key contributions:

•

We introduce the bandwidth utilization side-channel on NPUs to leak ML model dimensions.
•

A proof-of-concept exploit on the DRAM interface against 6 image classification models and explore several classifier to identify model layer boundaries and layer dimension.
•

We propose possible defences to create a demand agnostic bandwidth utilization. Software defence leads to a $1.6x$ increase in overhead while hardware countermeasures to generate constant memory traffic has $14.6\%$ to $19.3\%$ overhead.

II Overview and background

Refer to caption — Figure 1: Different steps of an end-to-end bandwidth utilization attack. (1) Collect bandwidth utilization; (2) Classify model layer boundary and layer type; (3) Use black box attack to reconstruct model or perform adversarial attack.

Figure 1 shows the different stages of an end-to-end attack with utilization bandwidth as a side-channel. The memory bus connecting the NPU and the model weight storage reveals layer variations through the bandwidth utilization side channel. Our attack receives the bandwidth utilization trace and performs two classification steps: (1) Detect layer boundaries with a boundary detector (Section III-D2); (2) Split the time-series along layer boundaries to detect the layer dimensions with type classifier (Section III-D3). The number of layers and its dimension can be fed to black box attacks for model weights reconstruction or an adversarial attack (Section 1).

NPU to memory interface. The NPU accelerator can be resident on the system-on-chip (SoC) with memory bus connected directly to CPU last-level cache (LLC) as in ARM Ethos [2] or connected with an on-chip eDRAM as in DaDianNao [8]. Other accelerators [14] are connected off-chip performing direct system memory accesses. All the model weights cannot be loaded upfront due to limited NPU internal memory. This motivates tiling the weights for each DNN layer to fit in the NPU and loading them as required, therefore creating bandwidth variation irrespective of the NPU memory interface.

Tile size variations. The model layer dimensions depend on the model configuration like number of filters, height, width and channels. The tiling is not only devised on these four dimensions but also on the data locality of the input feature maps and the intermediate partial sums. Moreover, the number and type of operation for each layer are different. Convolution and dense layers perform matrix multiplication while pooling and activation perform ALU operations like max, min or addition leading to parameter variations for each layer.

Black box attacks. A large body of work [5, 18, 20] recreates a model having similar classification accuracy as the victim model with only the model dimension. As a first step, the attacker initializes a model with victim model dimension. Then, she sends multiple inference requests to the victim model and uses the classification to label the data creating a dataset. This dataset is used to train the attacker’s model until it achieves similar accuracy as the victim model. prior work [15, 21] further illustrates use of reconstructed model to generate adversarial examples for the victim model.

III Bandwidth Utilization Attack

III-A Threat Model

We adopt a threat model similar to trusted execution environments (TEE) like Intel SGX [3]. Specifically, we assume that the underlying privileged software is untrusted, malicious co-tenant may share the same NPU and the system may be exposed to physical attacks. We assume the techniques in secure processors are adopted by the NPU providers. Beyond memory encryption and integrity checks, we also assume memory address encryption is employed as in [4, 6], thus restricting the adversary to only observe the bandwidth utilization.

Additionally, we assume that the victim uses the optimal tile size for performance, and that the adversary can perform offline profiling to collect all possible layer and tile configuration characteristics for training her classifier(s).

III-B Point of Leakage

Bandwidth utilization can be observed by malicious hypervisors or co-executing tenants. Cloud hypervisors collect traffic statistics for each application for load balancing and congestion control. Considering the threat model of a malicious cloud hypervisor, high-precision counters can be used to collect bandwidth utilization of each tenant. Such widgets are (or can be) placed at (1) the Last-Level Cache (LLC) interface, (2) the DMA engine of the NPU, or (3) the DRAM controller.

Unprivileged co-executing tenants share the same memory interface. Malicious tenants in the same NPU can constantly monitor the bandwidth and record victim utilization from bandwidth drops due to contention. This drop in bandwidth is proportional to the victim load request size scheduled by a physically shared NPU DMA controller.

III-C Example of Bandwidth Variation at DRAM Interface

As mentioned in Section 1, tile execution of different layers on the NPU loads tiled matrix weights from either LLC or main memory. Figure 2(a) presents the layer-dependent load bandwidth variation for the model weights in a VGG16 model. There are three types of bandwidth variations across adjacent layer boundaries: (T1) Layers with different tile sizes (e.g. layer 2 and 3) where the filter size is different with a pooling layer in between. (T2) Layers with the same sized tiles but having different number of tiles (e.g. layer 1 and 2) which is possible with adjacent layer having the same filter size but differing in the number of channels. (T3) Identical adjacent layers (e.g. layer 10 and 11) with identical sized filters having the same number of channels as in some deeper resnet layers.

Figure 2(b) zooms into the $5^{th}$ layer revealing the number of tiles. Due to tile size optimization, the number of tiles and the bandwidth utilization vary across model layers depending on their dimensions. There are finite tiling schemes with each having a unique signature of bandwidth and execution time.

III-D Attack Demonstration

As a proof-of-concept, we demonstrate the bandwidth utilization attack on the DRAM interface and model the adversary as a malicious hypervisor.

III-D1 Experimental Setup

We prototype the VTA [16] ML inference accelerator on a Xilinx Pynq-Z1 fpga running at 100 MHz. VTA is an fpga implementation of an NPU with the widely used TVM [7] software stack. Similar to an NPU, the VTA has its own instruction set consisting of load, compute and store operations. We augment the VTA DMA controller with a memory transaction counter to log the read transactions. The model weights are not updated at inference time, hence monitoring only read transaction is enough for inferring the model dimensions. The counter accumulates the number of bytes transferred for each DMA transaction and stores in a memory-mapped register, read by the runtime driver at 250KHz.

We run six image classification DNN inference models, namely, VGG-11, VGG-16, Alexnet, Resnet-18, Resnet-34 and Resnet-50. The memory utilization shape for each of the workload is shown in Figure 3. These traces are taken with the NPU connected to system memory bus. To validate possibility of attack on the LLC interface, we have also collected traces with NPU connected to LLC through accelerator coherency port (ACP). Those traces also have similar demand signature and are not shown due to space constraints.

For layer boundary detection, we use precision and recall to evaluate the detector performance. Precision indicates the fraction of correctly detected boundaries out of the total detected boundaries. The higher the precision is, the less false positive claims the detector makes. Recall measures the fraction of correctly detected boundaries out of the all true boundaries. The higher the recall is, the less correct layer boundaries the detector misses. To evaluate layer type classification performance, we use accuracy which indicates the ratio of correctly classified samples out of all samples. We did not use accuracy for layer boundary detector because the number of non-boundary inputs outnumbers the number of boundary inputs by orders of magnitude. Using accuracy may present a falsely satisfying result.

III-D2 Layer Boundary Detection

A single DNN model consists of multiple layers. The first step in the attack is layer boundary detection. Prior arts [13, 12] demonstrated that the read-after-write (RAW) pattern on the address trace reveals the layer boundaries accurately. However, our threat model restricts attackers to use bandwidth utilization variation rather than the address trace.

Fine-grained observations enable attackers to model each layer as a time-series of bandwidth information. To identify layer boundaries, the adversary clusters the observed time-series trace into different classes. First, we collect statistics like total data transferred per sampling window, median and peak bandwidth, bandwidth standard deviation as well as frequency domain signals extracted using discrete wavelet transform (DWT) to perform feature extraction and selection. Second, the adversary builds a bag-of-words model with the extracted features, a popular Natural Language Processing (NLP) technique, on sliding windows of the trace. The combination of a bag-of-words model and sliding windows enables the attacker classifier to obtain both the frequency- and time-domain information of the collected trace. Then, the attacker performs clustering to obtain the potential layer boundary candidates. Subsequently, offline profiled termination timings of all possible layers are used to validate these candidates to reduce false positives.

The boundary detection results are shown in Table I. The table heading lists the different benchmarks with the total number of layer boundaries. Adjacent layers with a different shape (T1 in III-C) are identified as ‘easy’ because they utilize different memory bandwidth and are therefore easy to detect. The classifier can detect easy boundaries with 100% precision for AlexNet, VGG11 and VGG16. Overall, 73.9% of layer boundaries are detected across our six workloads. Note that for ResNet models, the residual layers are very short, which makes boundaries hard to detect due to our choice of sliding windows size. We leave these hyperparameter tuning (e.g. sliding window size) for future work.

AlexNet

VGG11

VGG16

ResNet18

ResNet34

ResNet50

easy

all

easy

all

easy

all

easy

all

easy

all

easy

all

precision

0.69

0.64

0.66

0.72

0.33

recall

0.75

0.83

0.73

0.96

0.67

0.96

TABLE I: Precision and recall when identifying layer boundaries for each network. easy boundaries refer to ones between adjacent layers with different utilization shapes, while all includes all layer boundaries.

III-D3 Layer Type Classification

Similar to layer boundary detection, we include frequency domain signals from DWT to capture tile bandwidth signatures for layer type classification. DWT detects change in bandwidth across tiles at layer boundary. The wavelets also captures sharp changes in bandwidth, which is useful for tile boundary detection.

We test victim memory traffic time-series using three classifiers: Support Vector Machine (SVM), Multilayer Perceptron (MLP), and Convolutional Neural Network (CNN), each trained on either the time-series features for potential layers or on the features extracted from DWT.

AlexNet

VGG11

VGG16

ResNet18

ResNet34

ResNet50

Overall

Execution

time only

0.958

0.896

0.851

0.824

0.826

SVM w/

(w/o) DWT

0.811

0.927

MLP w/

(w/o) DWT

(0.986)

0.868

(0.849)

0.949

(0.934)

CNN w/

(w/o) DWT

0.877

(0.830)

0.953

(0.934)

TABLE II: Layer type classification accuracy using unshaped traffic assuming perfect layer boundary detection. The last three rows also show results without DWT signals in parentheses.

Table II shows the layer-type classification accuracy for each tested model, using bandwidth trace assuming perfect identification of layer boundaries. The last column (Overall) summarizes the weighted accuracy of all layers in all models. The first row shows classification accuracy merely using the termination timing of each layer. This is a baseline accuracy for any attacker with the knowledge of the layer boundaries (execution timing for each layers). All the layers of AlexNet and VGG-11 are identifiable using this basic classifier. These workloads have few layers and all the layers differ in their execution time. Therefore, merely having the layer termination time is sufficient to classify the layers. The accuracy decreases for deeper models like ResNets with identical adjacent layers (T3 in III-C). The 2-4th rows show accuracy of the three evaluated layer-type classifiers. From execution-time based classifier to bandwidth-based classifiers, the accuracy jumps from 84% to 93% on average. From SVM to CNN, accuracy improves with increasing classifier complexity. In addition, including frequency domain signals improves the classifier, resulting in an accuracy of 95.3%. This is because different tile size configurations have different compute/bandwidth ratios.

IV Countermeasures

Bandwidth utilization channel leaks because of the layer dependent bandwidth utilization during ML inference. This can be prevented by disabling tile-size optimization or shaping the shared interface traffic (via pure software or software-hardware co-design).

IV-A Disabling tile size optimization

Constant tile size across all layers within a model can make the effective bandwidth constant throughout the model execution. However, disabling this optimization leads to inference time overhead and lower utilization of NPU resources including compute, on-chip storage, and off-chip memory bus bandwidth. To illustrate the wasted performance opportunity, we explore execution performance of 800 distinct tile-size configurations for each of the models. The overhead of each configuration with respect to the most performant configuration is plotted in Figure 4. The performance overhead of the median configuration is, on an average, 1.2X more compared to the best case. The performance overhead varies up to 1.6 - 2.0X in certain tile configurations. The resulting performance overhead impacts the service-level agreement (SLA) and causes under-utilization of cloud resources. This is due to the large variation in the height, width and number of channels for different layers of the DNN model.

IV-B Memory traffic shaping

The memory trace could be made independent of the demand trace. The CSP can choose a predefined memory utilization independent from the execution model. The memory controller interleaves the real read requests with fake transactions if the demand falls below the expected bandwidth. On the other hand, demand requests need to be throttled when utilization exceeds the assigned bandwidth. The predefined memory trace can be a fixed bandwidth trace or some other model demand independent pattern. The write requests can be sent at regular intervals. Encrypted stores with valid flag can distinguish between real and fake transactions.

To defend against bandwidth utilization attack on ML inference, we modified the accelerator design to split each transaction to multiple transactions of fixed size. The fixed size transactions were sent at an equal interval of time resulting in a constant bandwidth. Equal-sized splits are created by padding the unequal transactions. In case of memory idle intervals, fake transactions are sent to fill the gap. The tile size optimization is performed based on the fixed bandwidth to obtain the best tile configuration for each layer.

The precision of layer boundary detection reduces drastically as shown in Table III. The boundary detection classifier has increased sensitivity to have enough coverage as visible from the recall numbers. NA in the table for ResNet models indicate the attacker’s failure to identify the true boundaries even with 10000x false positives $(precision<0.0001)$ . The layer construction is infeasible with such high false positives for the shaped constant trace. The overall precision drops to less than 0.01%. The model type classifier fails with such low precision on layer boundary detection thwarting the attack.

With a specific bandwidth choice, the tradeoff is between the amount of wasted memory bandwidth and the performance overhead caused by request throttling. With a fixed bandwidth of 200 MB/s, the geomean overhead is 14.6% with worst case overhead of 19.3%.

AlexNet

VGG11

VGG16

ResNet18

ResNet34

ResNet50

easy

all

easy

all

easy

all

easy

all

easy

all

easy

all

precision

0.03

0.01

0.0027

0.00011

recall

0.75

0.83

0.73

TABLE III: Memory traffic shaping with constant bandwidth reduces the precision of layer boundary detection drastically for all models

V Related work

Recent works [13, 12] illustrate that memory access patterns reveal a DNN model structure by snooping the off-chip address bus. We demonstrate a new alternative side-channel, using bandwidth variation, for leaking model dimensions even in the presence of data and address encryption. Observing bandwidth variation is feasible even by on-chip unprivileged malicious co-tenants without the use of performance counters.

Memory traffic shaping as a defence mechanism is illustrated in prior works like MITTS [23] or camouflage [22]. These are applicable to general-purpose workloads with a runtime shaping logic. However, the memory demand requests for a DNN workload are known at compile time. We demonstrate that the compiler can choose a traffic pattern and perform the tile size analysis to improve the overall inference latency.

VI Conclusion

This paper studies the bandwidth utilization side-channel to infer confidential model structure. The channel can be observed by performance counters or even by unprivileged co-executing tenant through traffic contention. This work shows that model structure can be leaked even with an encrypted address trace. The study discusses potential countermeasures and highlights one that leverages the knowledge of the inference workload at compile time to tune the tiles accordingly and improve the bus utilization while closing the bandwidth utilization channel.

VII Acknowledgements

We would like to thank our anonymous reviewers for providing their valuable feedback and suggestions. This work is funded by under task $\#2965.001$ by Semiconductor Research Corporation (SRC) and Intel Strategic Research Alliance (ISRA) grant.

References

[1] “Amazon sagemaker,” https://aws.amazon.com/sagemaker/.
[2] “Arm ethos-n series processors,” https://developer.arm.com/ip-products/processors/machine-learning/arm-ethos-n.
[3] “Intel software guard extensions,” https://software.intel.com/en-us/sgx/details.
[4] S. Aga and S. Narayanasamy, “Invisimem: Smart memory defenses for memory bus side channel,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA ’17. ACM, 2017. [Online]. Available: http://doi.acm.org/10.1145/3079856.3080232
[5] S. Alfeld, X. Zhu, and P. Barford, “Data poisoning attacks against autoregressive models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016.
[6] A. Awad, Y. Wang, D. Shands, and Y. Solihin, “ObfusMem: A Low-Overhead Access Obfuscation for Trusted Memories.” ACM Press, 2017, pp. 107–119.
[7] T. Chen, T. Moreau, Z. Jiang, H. Shen, E. Q. Yan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: end-to-end optimization stack for deep learning,” arXiv preprint arXiv:1802.04799, pp. 1–15, 2018.
[8] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun et al., “Dadiannao: A machine-learning supercomputer,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 2014, pp. 609–622.
[9] Y. Choi and M. Rhu, “Prema: A predictive multi-task scheduling algorithm for preemptible neural processing units,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 220–233.
[10] S. Ghodrati, B. H. Ahn, J. K. Kim, S. Kinzer, B. R. Yatham, N. Alla, H. Sharma, M. Alian, E. Ebrahimi, N. S. Kim et al., “Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 681–697.
[11] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie: efficient inference engine on compressed deep neural network,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 243–254, 2016.
[12] X. Hu, L. Liang, S. Li, L. Deng, P. Zuo, Y. Ji, X. Xie, Y. Ding, C. Liu, T. Sherwood, and Y. Xie, “Deepsniffer: A dnn model extraction framework based on learning architectural hints,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 385–399. [Online]. Available: https://doi.org/10.1145/3373376.3378460
[13] W. Hua, Z. Zhang, and G. E. Suh, “Reverse engineering convolutional neural networks through side-channel information leaks,” in Proceedings of the 55th Annual Design Automation Conference, ser. DAC ’18. New York, NY, USA: ACM, 2018, pp. 4:1–4:6. [Online]. Available: http://doi.acm.org/10.1145/3195970.3196105
[14] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2017, pp. 1–12.
[15] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable adversarial examples and black-box attacks,” arXiv preprint arXiv:1611.02770, 2016.
[16] T. Moreau, T. Chen, Z. Jiang, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Vta: An open hardware-software stack for deep learning,” arXiv preprint arXiv:1807.04188, 2018.
[17] J. F. K. O. M. Papamichael, T. M. M. Liu, D. L. S. A. M. Haselman, L. A. M. Ghandi, S. H. P. P. A. Sapek, and G. W. L. Woods, “A configurable cloud-scale dnn processor for real-time ai,” in Proceedings of the 45th Annual International Symposium on Computer Architecture, ser. ISCA ’18, 2018.
[18] N. Papernot, P. McDaniel, A. Sinha, and M. Wellman, “Towards the science of security and privacy in machine learning,” arXiv preprint arXiv:1611.03814, 2016.
[19] F. Romero, Q. Li, N. J. Yadwadkar, and C. Kozyrakis, “Infaas: Managed & model-less inference serving,” arXiv preprint arXiv:1905.13348, 2019.
[20] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart, “Stealing machine learning models via prediction apis,” in 25th $\{$ USENIX $\}$ Security Symposium ( $\{$ USENIX $\}$ Security 16), 2016, pp. 601–618.
[21] B. Wang and N. Z. Gong, “Stealing hyperparameters in machine learning,” in 2018 IEEE Symposium on Security and Privacy (SP). IEEE, 2018, pp. 36–52.
[22] Y. Zhou, S. Wagh, P. Mittal, and D. Wentzlaff, “Camouflage: Memory traffic shaping to mitigate timing attacks,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017, pp. 337–348.
[23] Y. Zhou and D. Wentzlaff, “Mitts: Memory inter-arrival time traffic shaping,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 532–544, 2016.