Privacy-Preserving Feature Coding for Machines

Bardia Azizian and Ivan V. Bajić School of Engineering Science, Simon Fraser University, Burnaby, BC, Canada

Abstract

Automated machine vision pipelines do not need the exact visual content to perform their tasks. Therefore, there is a potential to remove private information from the data without significantly affecting the machine vision accuracy. We present a novel method to create a privacy-preserving latent representation of an image that could be used by a downstream machine vision model. This latent representation is constructed using adversarial training to prevent accurate reconstruction of the input while preserving the task accuracy. Specifically, we split a Deep Neural Network (DNN) model and insert an autoencoder whose purpose is to both reduce the dimensionality as well as remove information relevant to input reconstruction while minimizing the impact on task accuracy. Our results show that input reconstruction ability can be reduced by about 0.8 dB at the equivalent task accuracy, with degradation concentrated near the edges, which is important for privacy. At the same time, 30% bit savings are achieved compared to coding the features directly.

Index Terms:

Deep neural network, coding for machines, privacy, adversarial training, feature coding

I Introduction

The growth of Artificial Intelligence (AI) applications such as Internet of Things (IoT), visual surveillance, autonomous driving, industrial machine vision, etc., has resulted in a proliferation of “intelligent” edge devices, sensors, and their associated infrastructure. These devices need to communicate with each other and the cloud-based services to accomplish specific tasks. According to Cisco’s annual Internet report [1], most globally connected devices will be allocated to machine-to-machine (M2M) connections by 2023. Accordingly, significant research effort is being devoted to the development of efficient data coding for machines. Related standardization activities such as JPEG AI [2] and MPEG-VCM [3] have also been initiated.

With the massive amounts of data being passed around among edge devices and the cloud, privacy concerns naturally arise. What happens if a malicious third party gains access to this data? Depending on the particular scenario and application, there are a number of privacy and security concerns, exemplified through various attacks [4, 5]. Our particular focus in this paper is on the model inversion attack [6], where the attacker attempts to recover the original image from the intercepted features. If such an attack is successful, the attacker obtains the original image containing private information, which they may then utilize for malicious purposes. Cryptography methods [7] provide one possible solution to protect the data, although they have their own risks and challenges. But, in the context of data coding for machines, there is an opportunity to reduce or remove private information from data while retaining task-relevant information. This is because machine vision generally needs higher-level information, rather than details of each pixel, in order to perform a given task. For instance, the details of a vehicle’s license plate or a person’s face are not necessary if the DNN model is only supposed to detect cars and pedestrians on the street.

In this paper, we propose a feature coding method for machines that allows object detection at near-default accuracy (i.e., close to the model’s accuracy without compression) at high bitrates, while being resistant to model inversion attacks. To do so, we train two networks simultaneously in an adversarial manner [8]. One is part of an object detection pipeline, and the other is an adversary trying to perform input reconstruction from encoded features. The loss functions used in training are designed to encourage high object detection accuracy and low input reconstruction performance, especially near the edges because those details often reveal private information.

The paper is organized as follows. Related work is discussed in Section II. A comprehensive explanation of the proposed method is given in Section III. Section IV presents our experimental results, followed by conclusions in Section V.

Refer to caption — Figure 1: The overall block diagram of the proposed method. $Conv(n,k,s)$ is a 2D Convolution layer followed by a Batch normalization layer and a $SiLU()$ activation, with $n$ being the number of output channels, a kernel size of $k\times k$ , and stride= $s$ . $Deconv(n,k,s)$ is the same as $Conv(n,k,s)$ except with a Convolution Transpose layer. Note that the $Conv$ layers in the autoencoder do not have batch normalization, and the last $Conv$ in AE does not have the $SiLU()$ activation either. The structure of other blocks is shown in Fig. 2

II Related Work

For decades, image codecs have been designed to improve compression efficiency by decreasing the number of bits representing the input while maintaining the fidelity between the original and reconstructed input. This principle works well for human vision, especially when the fidelity is measured using perceptual metrics [9]. In recent years, DNN models have entered the area of image coding [10, 11, 12], showing steady progress in comparison with traditional codecs.

Increasingly, however, much of the visual content is only “seen” by machines in applications such as autonomous driving and navigation, traffic monitoring, etc. In such cases, it may be advantageous to code task-relevant features derived from images instead of images themselves. The utility of feature coding has been recognized even prior to the current wave of interest in DNNs, through MPEG standards on Compact Descriptors for Visual Search (CDVS) [13] and Compact Descriptors for Visual Analysis (CDVA) [14]. More recently, feature compression has been studied in the context of collaborative intelligence [15], where a DNN is distributed between an edge device and the cloud, and features need to be uploaded from the edge to the cloud to complete the inference. This scenario is also assumed in the present paper.

Like images, features derived from intermediate DNN layers can be encoded either using traditional or DNN-based codecs. Early work on this topic [16, 17, 18] preferred traditional codecs; such approaches are still appealing due to traditional codecs’ computational simplicity relative to DNN-based codecs, and their wide availability in existing cameras and devices. In order to use a conventional codec for feature coding, the feature tensor usually needs to be tiled into an image, scaled, and pre-quantized [16]. The authors of [17] additionally reduced the dimensionality of the latent space in terms of both the number of channels and spatial resolution using an autoencoder prior to coding the bottleneck tensors via JPEG. Such dimensionality reduction usually helps compression efficiency.

More recent schemes [19, 20] employ DNN-based coding tools, especially advanced entropy models, to code features. An advantage of such schemes is their ability to be trained end-to-end, with a loss function that combines a differentiable rate estimate and task accuracy. Specifically, [19, 20] uses the scale hyperprior entropy model from [10] to obtain the rate estimate. While the ability to train end-to-end affords additional flexibility to these methods, the downside is increased complexity and the fact that their coding engines are not widely available in existing devices.

None of the above-mentioned feature coding methods has considered privacy issues related to coded features. In fact, research on privacy protection in the context of feature coding for machines is fairly scarce. One of the rare works on this topic is [21], where an information-theoretic privacy approach called privacy fan is proposed. Here, features from a DNN are scored according to the mutual information (MI) [22] relative to the private and non-private tasks. Features with less private and more non-private information are only lightly compressed, whereas those that carry more private information are compressed more heavily to protect privacy.

At its core, privacy fan is a feature selection method based on MI between features and private/non-private tasks, which is difficult to estimate in high-dimensional spaces [23]. The approach presented in this paper avoids this challenge by using an autoencoder. For this reason, it is also more flexible than the privacy fan because it enables not just feature selection, but also feature modification. Details are presented next.

III Proposed Method

The main goal of this research is to develop a framework for feature coding for machines, which has improved resistance to model inversion attacks. Our pipeline is shown in Fig. 1. We chose object detection by YOLOv5 [24] as the machine task, but the framework is applicable to other DNN models and tasks as well. Several models with varying complexity and accuracy are available in the YOLOv5 Github repository.¹¹1https://github.com/ultralytics/yolov5 We chose YOLOv5m and set the “image-size” parameter to 512.

III-A Split Point

To fit into the edge-cloud collaborative setting, the YOLOv5 model is split into two parts: the front-end, which is deployed on an edge device, and the back-end, which resides in the cloud. Choosing a split point is a design issue [25, 26], which depends on energy considerations, computational resources at the edge, the type of connection between the edge and the cloud, and so on. Here, our selection of the split point is mainly based on information-theoretic considerations.

Let $\mathbf{X}$ be the input image and $\mathbfcal{Y}_{i}$ be the feature tensor at the $i$ -th layer. Since $\mathbf{X}\to\mathbfcal{Y}_{i}\to\mathbfcal{Y}_{i+1}$ forms a Markov chain, by the data processing inequality [22] we have that

I(\mathbf{X};\mathbfcal{Y}_{i})\geq I(\mathbf{X};\mathbfcal{Y}_{i+1}),

(1)

where $I(\cdot;\cdot)$ is the mutual information. Hence, deeper layers carry less information about the input and are therefore more resilient to model inversion attacks. Also, as shown in [27], deeper layers are more compressible. These arguments would suggest choosing a split point as deep as possible.

On the other hand, limited computation and energy resources on the edge device favor selecting a shallower split point. Moreover, the YOLOv5m model branches out at layer 5, meaning that if we select the split point after layer 5, we would have to encode and transmit multiple feature tensors, which would increase both the complexity and the total bitrate. Hence, we choose to split the YOLOv5m model at layer 5.

III-B Autoencoder

At the split point, we insert an autoencoder, whose purpose is to both reduce the dimensionality and modify the features, such as to improve resistance to model inversion. This is a plug-and-play strategy and can be used for other models and tasks as well. The autoencoder is shown in Fig. 1: its encoder portion is referred to as AE and decoder as AD. They consist of $Conv(n,k,s)$ layers, whose structure is specified in the caption of Fig. 1, and ResBlocks, whose structure is shown in Fig. 2. The AE outputs the bottleneck feature tensor whose channel dimension ( $64$ ) is lower than the input tensor’s channel dimension ( $192$ ), but the spatial dimensions are unchanged. This is done to preserve the spatial precision of subsequent object detection. The resulting bottleneck feature tensor is tiled, pre-quantized to 8-bits per element, and encoded using Versatile Video Coding (VVC)-Intra [28]. At the cloud side, the encoded bitstream is decoded by a VVC decoder, then AD, then fed to the YOLOv5m back-end.

III-C Adversarial Training

Since the goal is to create bottleneck features with improved resilience towards model inversion, we construct an auxiliary DNN model, called Reconstruction Network (RecNet), whose goal is to reconstruct the input image from bottleneck features. As depicted in Fig. 1, the architecture of RecNet is roughly a mirror of the YOLOv5m front-end and AE. We train the autoencoder and RecNet together in an adversarial manner [8]. Over the training process, RecNet tries to recover the input image from the bottleneck features as best it can, while the AE simultaneously tries to disrupt RecNet’s performance by manipulating the generated bottleneck features. At the same time, both AE and AD attempt to keep the object detection accuracy high. Note that the pre-trained YOLOv5m model is kept intact, and its weights are frozen during the entire training process. The training process for each batch of data is summarized in the following steps:

The input $\mathbf{X}$ goes through the front-end, AE, and RecNet and the reconstruction loss is computed as follows:

\begin{split}L_{rec}=\frac{1}{n}\|\mathbf{X}-\widehat{\mathbf{X}}\|_{1}&+\frac{\beta}{n}\|S_{x}*\mathbf{X}-S_{x}*\widehat{\mathbf{X}}\|_{1}\\ &+\frac{\beta}{n}\|S_{y}*\mathbf{X}-S_{y}*\widehat{\mathbf{X}}\|_{1},\end{split}

(2)

where $\widehat{\mathbf{X}}$ is the reconstructed input, $\|\cdot\|_{1}$ is the $\ell_{1}$ -norm, $S_{x}$ and $S_{y}$ are the horizontal and vertical Sobel filters, respectively, $*$ is the convolution operator, and $n$ is the total number of tensor’s elements in the batch. The value of $\beta$ is empirically set to $5$ . We adopted the Sobel filter in the reconstruction loss in order to emphasize the edges since private information is usually associated with fine detail. As a result, RecNet becomes more powerful in reconstructing the edges, which will force AE to remove edge information from the bottleneck features.

2.

Gradient of $L_{rec}$ backpropagates only through the RecNet and updates its weights. Note that the autoencoder’s weights are frozen at this step.
3.

The same batch of images goes through the whole network, and the total loss is computed as follows:

$L_{tot}=L_{obj}-w\cdot L_{rec},$ (3)

where $L_{obj}$ is the YOLOv5 object detection loss [24], and $w$ is the balancing weight between the reconstruction and object detection. We empirically set it to $0.1$ .
4.

Gradient of $L_{tot}$ backpropagates only through the autoencoder and updates its weights. RecNet’s weights are frozen in this step. Note that the negative sign of the $L_{rec}$ in (3) leads AE to make reconstruction more difficult for RecNet. Meanwhile, the positive sign of $L_{obj}$ causes AE and AD to improve object detection accuracy.

IV Experimental Results

We trained our networks in several steps on the COCO-2017 object detection dataset [29] using an NVIDIA Tesla V100-SXM2 GPU with 32GB memory. In the first step, only the autoencoder got trained with $L_{obj}$ for 50 epochs. Next, we trained only the RecNet with $L_{rec}$ for 20 epochs, with the autoencoder’s weights frozen to those obtained in the first step. Finally, with the autoencoder and RecNet initialized to the previously obtained weights, the adversarial training was conducted as described in Section III-C for 40 epochs. In all the steps, we used Stochastic Gradient Descent (SGD) optimizer with the initial learning rate equal to 0.01, changing by a cosine learning rate decay [30] over the training. The object detection accuracy for the proposed autoencoder after training and the original YOLOv5m model is given in Table I.

TABLE I: Results without feature compression

Detection accuracy on COCO-2017	[email protected]	[email protected]:.95
Proposed autoencoder	$61.3$	$42.8$
Original YOLOv5m	$62.4$	$43.7$

IV-A Resistance to model inversion

As mentioned before, RecNet is an auxiliary DNN model exploited in the adversarial training stage. So, RecNet is not part of the final pipeline. In a real situation, however, if an adversary can get hold of the edge device, they can try to train their own input reconstruction model using input-bottleneck pairs. In order to test our model against this attack, we train a new, randomly initialized RecNet on the bottleneck features generated by the autoencoder obtained in the adversarial training stage. This RecNet is trained with a $\ell_{1}$ -norm loss (the first term in (2)) and is called RecNet-bottleneck. For comparison, we have also trained another RecNet, whose first three layers are removed, on the original YOLOv5m latent space (without the AE). We call this model RecNet-latent.

We test the resistance against model inversion attack by applying RecNet-latent to YOLOv5m features at layer 5 (without VVC compression) and applying RecNet-bottleneck to our bottleneck features (also without VVC compression). Input reconstruction performance is measured using conventional Peak Signal-to-Noise Ratio (PSNR), as well as a new quality metric called edge-PSNR that emphasizes PSNR near the edges. To compute the edge-PSNR, the horizontal and vertical Sobel filters are applied to the original and reconstructed images to capture the image gradients in horizontal and vertical directions. Then, the magnitude of the gradient is considered as a new image based on which the edge-PSNR is calculated. The values of PSNR and edge-PSNR are given in Table II. These results show that our AE is able to remove some information required for input reconstruction, especially near the edges. In particular, reconstruction from our bottleneck features is 1.4 dB worse than reconstruction from YOLOv5m features, and this loss is mostly concentrated near the edges since edge-PSNR is 2.5 dB lower. Some visual examples are also provided in Fig. 3. As can be seen, the edges are more distorted in the output of RecNet-bottleneck, the text is not readable, and the faces and facial expressions are not easily recognizable.

TABLE II: Input reconstruction results

	RecNet-latent	RecNet-bottleneck	Difference
PSNR	$21.78$ dB	$20.38$ dB	$-1.40$ dB
edge-PSNR	$25.61$ dB	$23.12$ dB	$-2.49$ dB

IV-B Feature compression results

To measure the impact of feature compression, we encode the bottleneck features of the COCO validation images using VVC-Intra, specifically, its VVenC [31] implementation with lowdelay-faster preset. Prior to that, the $64$ channels in the bottleneck are clipped, quantized to $8$ bits, and tiled into an $8\times 8$ matrix to create a gray-scale image. On the decoder side, the bitstream is decoded, and the resulting tensors are passed to AD and the YOLOv5m back-end for inference.

We found that most of the feature values in the bottleneck lie in the range of $[-6,6]$ . As noted in [32], feature compression performance can be improved via clipping. To this end, we tested three clipping ranges: $[-6,6]$ , $[-3,3]$ , and $[-1.5,1.5]$ . We computed Bjøntegaard-Delta values [33] between their associated Rate-Accuracy curves (not shown due to space constraints), where object detection accuracy is measured via mean Average Precision (mAP) at the IoU threshold of 0.5. QP values $\{34,36,38,40,41,42\}$ were used to obtain these results. Using the performance for the $[-1.5,1.5]$ clipping range as the anchor, BD-Rate calculated based on the MPEG-VCM reporting template [34] shows that clipping ranges $[-3,3]$ and $[-6,6]$ reduce the bitrate by $8.1\%$ and $7.1\%$ on average, respectively. Hence, we selected $[-3,3]$ as the clipping range for further experiments.

Finally, we compare our proposed method against direct coding of YOLOv5m layer 5 features, which we call the “Anchor.” Similar to encoding the bottleneck, the $192$ channels in the YOLOv5m latent space are clipped, quantized to 8 bits, tiled into a $12\times 16$ matrix, and coded using VVC-Intra.

Fig. 4 shows the Rate-Accuracy and Quality-Accuracy curves obtained by QP $\in\{34,36,38,40,41,42\}$ for the “Proposed” and QP $\in\{39,41,42,43,44,45\}$ for the “Anchor.” BD-Rate between the curves presented in Fig. 4(a) is $-31.3\%$ . Hence, our proposed method reduces the bitrate by over $30\%$ on average for the equivalent accuracy. This reduction is not surprising given the reduced dimensionality of the bottleneck features. However, the novel benefit of the proposed method is the reduction of input reconstruction ability at the same accuracy, which is depicted in Fig. 4(b). Here, BD-PSNR between the curves is $-0.76$ dB, meaning that our bottleneck features lower the input reconstruction ability by about $0.8$ dB on average. As seen earlier, by design, most of the degradation is near the edges, which is important for privacy.

V Conclusion

In this paper, we presented a novel feature coding scheme for machines with improved resistance to model inversion attacks. The features of the YOLOv5m latent space were transformed into a lower-dimensional space using an autoencoder, which was trained in an adversarial manner to reduce the ability for input reconstruction. Visual and quantitative results showed that our method is able to degrade the quality of the recovered image using a model inversion attack, especially near the edges. Meanwhile, coding the features produced by our autoencoder resulted in more than $30\%$ bit savings, on average, at the same object detection accuracy compared to coding the original YOLOv5m features.

References

[1] Cisco, “Cisco annual Internet report (2018–2023),” Mar. 2020.
[2] ISO/IEC JTC 1/SC29/WG1, “Final call for proposals for JPEG AI,” Jan. 2022, N100095.
[3] W. Gao, S. Liu, X. Xu, M. Rafie, Y. Zhang, and I. Curcio, “Recent standard development activities on video coding for machines,” arXiv:2105.12653, 2021.
[4] X. Liu, L. Xie, Y. Wang, J. Zou, J. Xiong, Z. Ying, and A. V. Vasilakos, “Privacy and security issues in deep learning: A survey,” IEEE Access, vol. 9, pp. 4566–4593, 2021.
[5] F. Mireshghallah, M. Taram, P. Vepakomma, A. Singh, R. Raskar, and H. Esmaeilzadeh, “Privacy in deep learning: A survey,” arXiv:2004.12254, Jul. 2020.
[6] Z. He, T. Zhang, and R. B. Lee, “Model inversion attacks against collaborative inference,” in Proc. 35th Annual Computer Security Applications Conference, 2019, p. 148–162.
[7] K.-A. Shim, “A survey of public-key cryptographic primitives in wireless sensor networks,” IEEE Commun. Surveys Tuts., vol. 18, no. 1, pp. 577–601, 2016.
[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. NeurIPS, 2014.
[9] W. Lin and C.-C. Jay Kuo, “Perceptual visual quality metrics: A survey,” J. Visual Commun. Image Represent., vol. 22, no. 4, pp. 297–312, 2011.
[10] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” in Proc. ICLR, 2018.
[11] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in NeurIPS, 2018.
[12] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Learned image compression with discretized gaussian mixture likelihoods and attention modules,” in Proc. IEEE/CVF CVPR, June 2020.
[13] MPEG-CDVS, “Compact descriptors for visual search,” 2015, ISO/IEC JTC 1 15938-13.
[14] MPEG-CDVA, “Compact descriptors for video analysis,” 2019, ISO/IEC JTC 1 15938-15.
[15] I. V. Bajić, W. Lin, and Y. Tian, “Collaborative intelligence: Challenges and opportunities,” in Proc. IEEE ICASSP, 2021, pp. 8493–8497.
[16] H. Choi and I. V. Bajić, “Deep feature compression for collaborative object detection,” in Proc. IEEE ICIP, 2018, pp. 3743–3747.
[17] A. E. Eshratifar, A. Esmaili, and M. Pedram, “Bottlenet: A deep learning architecture for intelligent mobile cloud computing services,” in Proc. IEEE/ACM ISLPED, 2019, pp. 1–6.
[18] S. R. Alvar and I. V. Bajić, “Multi-task learning with compressible features for collaborative intelligence,” in Proc. IEEE ICIP, 2019.
[19] N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari, and E. Rahtu, “Image coding for machines: an end-to-end learned approach,” in Proc. IEEE ICASSP, 2021, pp. 1590–1594.
[20] Z. Yuan, S. Rawlekar, S. Garg, E. Erkip, and Y. Wang, “Feature compression for rate constrained object detection on the edge,” arXiv:2204.07314, 2022.
[21] S. R. Alvar and I. V. Bajić, “Scalable privacy in multi-task image compression,” in Proc. IEEE VCIP, 2021, pp. 1–5.
[22] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Wiley, 2006.
[23] A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Phys. Rev. E, vol. 69, Jun 2004.
[24] G. Jocher et al., “ultralytics/yolov5: v6.1 - TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference,” Feb. 2022. [Online]. Available: https://doi.org/10.5281/zenodo.6222936
[25] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” ACM SIGARCH Computer Architecture News, vol. 45, no. 1, pp. 615–629, 2017.
[26] A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “JointDNN: an efficient training and inference engine for intelligent mobile cloud computing services,” IEEE Trans. Mobile Comput., vol. 20, no. 2, pp. 565–576, 2021.
[27] H. Choi and I. V. Bajić, “Scalable image coding for humans and machines,” IEEE Trans. Image Process., vol. 31, pp. 2739–2754, 2022.
[28] B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, and J.-R. Ohm, “Overview of the versatile video coding (vvc) standard and its applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 10, pp. 3736–3764, 2021.
[29] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in Proc. ECCV, 2014, pp. 740–755.
[30] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li, “Bag of tricks for image classification with convolutional neural networks,” in Proc. IEEE/CVF CVPR, June 2019.
[31] A. Wieckowski, J. Brandenburg, T. Hinz, C. Bartnik, V. George, G. Hege, C. Helmrich, A. Henkel, C. Lehmann, C. Stoffers, I. Zupancic, B. Bross, and D. Marpe, “VVenC: an open and optimized VVC encoder implementation,” in Proc. IEEE ICME Workshops, 2021, pp. 1–2.
[32] R. A. Cohen, H. Choi, and I. V. Bajić, “Lightweight compression of intermediate neural network features for collaborative intelligence,” IEEE Open J. Circuits Syst., vol. 2, pp. 350–362, 2021.
[33] G. Bjøntegaard, “Calculation of average psnr differences between rd-curves,” in VCEG Meeting (ITU-T SG16 Q.6), 2001, VCEG-M33.
[34] C. Hollmann, S. Liu, W. Gao, and X. Xu, “[VCM] on VCM reporting template,” ISO/IEC JTC 1/SC 29/WG2 M56185, Jan. 2021.