HUT: Enabling High-UTility, Batched Queries
under Differential Privacy Protection for Internet-of-Vehicles

Abstract

The emerging trends of Internet-of-Vehicles (IoV) demand centralized servers to collect/process sensitive data with limited computational resources on a single vehicle. Such centralizations of sensitive data demand practical privacy protections. One widely-applied paradigm, Differential Privacy, can provide strong guarantees over sensitive data by adding noises. However, directly applying DP for IoV incurs significant challenges for data utility and effective protection. We observe that the key issue about DP-enabled protection in IoV lies in how to synergistically combine DP with special characteristics of IoV, whose query sequences are usually formed as unbalanced batches due to frequent interactions between centralized servers and edge vehicles.

To this end, we propose HUT, a new algorithm to enable High UTility for DP-enabled protection in IoV. Our key insight is to leverage the inherent characteristics in IoV: the unbalanced batches. Our key idea is to aggregate local batches and apply Order Constraints, so that information loss from DP protection can be mitigated. We evaluate the effectiveness of HUT against the state-of-the-art DP protection mechanisms. The results show that HUT can provide much lower information loss by 95.69% and simultaneously enable strong mathematically-guaranteed protection over sensitive data.

1 Introduction

Internet-of-Vehicles (IoV) is considered a promising solution to resolve the limited computational power of a single vehicle. IoV centralizes statistics from edge vehicles in a power server for computation. However, the centralizations of these statistics may contain sensitive data, and substantially requires privacy protections in practice [1, 2, 3, 4, 5, 6]. Differential Privacy (DP) [7], as the dominant mechanism for privacy-preserving computation, provides strong guarantees over sensitive data by statistically adding noises. Though DP is a promising solution for the privacy concerns in IoV, directly applying DP for IoV incurs significant challenges in terms of both data utility and effective protection.

We address the key issue behind applying DP for IoV: how to synergistically combine DP with special characteristics of IoV, whose query sequences are usually formed as unbalanced batches in practice. The unique characteristic of IoV is the unbalanced sequence of requests (denoted as batches). These batches are formed due to a practical issue: vehicles require frequent queries of relevant data for in-time decision-makings [8]. Hence, how to strike a balance between privacy protection and data utility is a major challenge for applying DP for IoV [9]. This is because privacy over-protection can cause low data utility.

Prior works aim to balance the tradeoffs between data utility and privacy protection, but fail to address the issues of batches within IoV. [10] introduces Constraint Inference, which applies a constraint on the order of a sorted data set to decrease unnecessary noise; [11] introduces Micro-Aggregation, which uses a fixed-size cluster to cluster data set before adding noise, to reduce the information loss; and [12] introduces K-aggregation, which aggregates small-value data to form new data when the summation of small-value data exceeds a certain threshold. The ignorance prohibits DP to be applied effectively in the context of IoV, which demands new insights for bridging DP with IoV.

We propose HUT, a new algorithm to enable High UTility for DP-enabled protection in the context of IoV. Our key insight is to take advantage of the fact that, the interactions between centralized servers and edge vehicles are usually frequent and fine-grained, which are combined into a series of unbalanced batches. To exploit the characteristic, our key idea is to aggregate local batches and apply Order Constraints, so that information loss from DP protection can be mitigated.

We quantitatively evaluate the effectiveness of HUT against the state-of-the-art DP protection mechanism. Our results show that HUT can provide much lower information loss by 95.69% and simultaneously enable strong mathematically-guaranteed protection of sensitive data. Meanwhile, HUT does not influence the strong guarantee of DP design, which ensures effective protection of sensitive data from users.

2 Design Overview

We give an overview of our HUT design for ensuring DP protection with high data utility. HUT consists of three steps: 1) we leverage Micro-aggregation and K-means algorithm to group small-value batches together so that we can mitigate the negative effects of uniform DP protection; 2) we deploy Differential Privacy (DP) to preserve the privacy of the pre-processed data; 3) we exploit Order Constraint (OC) to re-sort DP-protected data so that we can achieve high data utility. Figure 1 compares HUT with two other state-of-the-art DP protection methods, and we elaborate the differences in details as follows.

Refer to caption — Fig. 1: An illustration of three DP-protection methods. ➊ refers to HUT (our proposal), in which the details of the noisy-data processing method(OC)[10] is illustrated; Two state-of-the-art methods are illustrated (➋ [11] and ➌ [12]).

1) Micro-Aggregation: HUT first applies Micro-Aggregation to cluster small-value raw data, so that we can mitigate the negative impacts from DP in terms of data utility. This is because DP adds uniform random noises to the dataset, and small-value data will have a higher Signal-to-Noise Ratio (SNR), compared with large-value data. Therefore, the data utility is impacted dependently according to its exact values. Therefore, a mechanism to mitigate such impacts is essential. We achieve so as follows. Before directly using DP techniques on raw data for protection, HUT sets a threshold to divide small-value data and the rest into different batches (as To-be-clustered batches and Not-clustered batches). Next, we apply K-Mean clustering on data within the threshold (the To-be-clustered batches), to automatically group such small-value data further into smaller intra-batches (i.e. Micro-Aggregation).

Compared with [11], HUT effectively reduces the sensitivity and variations of small-value data as they are upper bounded by the centroid of the respective cluster. In [11] (shown as ➋ in Figure 1), a fixed-size clustering method is applied and this results in a consistent distance function on a total order relation but incurs large bias on the dataset which has highly uneven data distribution. Such bias on unevenly distributed data could harm the integrity of the dataset, leading to additional information loss. Thus, on the contrary, we leverage the K-means clustering algorithm to perform the clustering to mitigate the bias.

2) DP Protection: we then add Laplace noise on the data to achieve $\epsilon$ -Differential Privacy. DP aims to minimize the changes in query answers when a record is changed in the data set by adding a randomized function (subject to $\epsilon$ -differential privacy on the data). Hereby, we elaborate $\epsilon$ -Differential Privacy via a mathematical description of a randomized function $N_{f}$ (shown below).

Theorem 1

(Randomized Function subject to $\epsilon$ -DP) We regard a randomized function $N_{f}$ as satisfying $\epsilon$ -DP, assuming that $\forall$ data set $X_{1}$ and $X_{2}$ with at most one element difference, and $\forall S\subseteq Range(K_{f})$ , the $N_{f}$ function obey the below in-equation:

$Pr[N_{f}(X_{1})\in S]\leq exp(\epsilon)\times Pr[N_{f}(X_{2})\in S]$

The above-randomized function is commonly realized by adding Laplace Distributed Noise $\epsilon$ , which is inversely proportional to the noise magnitude (If $\epsilon$ is smaller, the magnitude of the noise is larger). In HUT, we try out various $\epsilon$ settings to testify the proposed method feasibility in different protection scales.

3) Order Constraint: HUT finally utilizes a noisy-data processing method, Order Constraint(OC), to minimize unnecessary noise. A general feature of noisy data is that after adding noise(especially large-scale noise) on a previously sorted dataset, the noisy dataset often turns out to be disordered. [10] introduces OC to re-process the noisy data to ensure the same order as before. Specifically, the Min-Max formulas (constraint inference) [13] is used to remove the unnecessary noise, applying least squares regression to change the data values in the noisy dataset to make sure it still follows the previous sorted order. The removed noise is proved to be unnecessary and does not influence the effect of dataset privacy protection but can have a significant improvement on reducing unnecessary information loss. Thus, we apply this method under the context of DP protection as a step to further reducing unnecessarily-added DP noise while having no negative impacts on privacy protection itself.

Novelty of HUT: HUT has three novelties, compared with the state-of-the-art DP protection mechanisms. First, HUT applies two-stage Micro-aggregation to mitigate the negative impacts on fine-grained batches, which is considered as a unique characteristic of IoV; Second, HUT leverages K-Means Clustering, rather than fixed-size clustering, to improve the utility of protected data; and third, HUT selectively enables lightweight constraints to be synergistic with other designs, for both DP protection and data utility .

Are there any negative impacts on DP protection: In short, there are no negative impacts on DP protection when applying HUT: since DP protection and K-means clustering are not in conflict with the mathematical guarantee of DP, we focus on the Micro-Aggregation algorithm. First, Micro-Aggregation in HUT does not impact the DP protection since K-Means Clustering constrains only small-valued data; and second, Micro-Aggregation in HUT does not change the data, and therefore no negative impacts are incurred on DP protection.

3 Experimental Setup

$\bullet$ Selected Dataset: We use BROOK dataset [14]. BROOK dataset is a multi-modal dataset with facial videos and 11 dimensions of biosignal/contextual data from 34 drivers’ driving experiences. In our experiments, we specifically handcraft a dataset that contains drivers’ facial images and the corresponding vehicle speeds, as a proof-of-concept. Figure 2 demonstrates an example of our handcrafted dataset.

$\bullet$ Configurations of Proposed Methods: There are three key parameters for our method, the scale of DP $\epsilon$ , the threshold $p$ and the number of clusters $k$ for Micro-Aggregation. We testify our design in various parameter settings, including $\epsilon\in\{0.008,0.01,0.02,0.05\}$ , $k\in\{5,8,10,15\}$ and $p\in\{30$ % $,35$ % $,40$ % $\}$ , in which $k\in\{5,8,10\}$ are used for simple queries and $k\in\{5,10,15\}$ are used for counting queries to satisfy different data set configurations. We conduct 20 experiments for each parameter setting and take the mean as the final value of information loss.

$\bullet$ Query Settings: We examine the data utility of the DP-protected dataset with two representative types of queries. The first query type is a simple query with facial images as input, and a query for the speed data in our handcrafted dataset. These queries simulate the unit-length querying process for intelligent in-vehicle systems to continuously retrieve data for decision-making in fine granularity. The second type of query is the counting query, which simulates the statistical analysis function of in-vehicle systems. The counting query is similar to query data from a contingency table (the bottom image in Figure 2). Different from the first general situation, this one already has a fixed sensitivity of 1, making it more challenging for HUT because sensitivity reduction is limited.

$\bullet$ Evaluation Metric: We use the Mean Square Error (MSE) as the evaluation metric of data utility and information loss. We denote the query response on the DP-protected data as $d^{\prime}$ , and the query response on the raw data as $d$ . MSE is calculated by equation $\mathrm{MSE}=\frac{1}{n}\sum_{i=1}^{n}\left(d^{\prime}_{i}-{d}_{i}\right)^{2}$ .

4 Experimental Results

We compare HUT with the state-of-the-art DP algorithms to examine its effectiveness of data utility. Since the average information loss of method ➋ and ➌ in Figure 1 are highly similar under all experiment conditions, we select the overall best performer ➌ to compare its information loss with the information loss of HUT.

The results of the simple query and counting query in different $\epsilon$ values with fixed k clusters and threshold portions are presented in Figure 3 and Figure 4. Both figures prove the effectiveness of Micro-aggregation in increasing SNR for small-value data and mitigating the negative effects of DP protections, in a common setting (simple query) and challenging setting (counting query). For simple query, our method achieves a maximum of 98% (when $\epsilon$ =0.008) and a minimum of 79% (when $\epsilon$ =0.05) reduction information loss reduction compared to the state-of-the-art, with k=10 and 30% threshold. For the counting query, the max information loss reduction percentage is 65% (when $\epsilon$ =0.008) and the minimum percentage is 27% (when $\epsilon$ =0.05). We note that when the noise magnitude grows (i.e. $\epsilon$ values decreases), the performance of our algorithm degrades in a simple query, but improves in counting query. The opposite trend in two query settings proves two things: 1) Applying DP on raw data could significantly harm normal data utility, as simple query, a primary indexing operation could cause relatively severe information loss; 2) Aggregation, as one kind of counting operations, could effectively avoid over-protection in extreme conditions, but is less effective when data is not over-protected.

Figure 5 and Figure 6 presents the information loss of simple query and counting query in different K settings, when $\epsilon$ and threshold are fixed at 0.02,35%. We obtain the key observations that an effective choice of the number of clusters for K-mean clustering falls in the range between 5 and 15, which have the optimal information loss reduction. In addition, our empirical study shows that any reasonable threshold values would incur negligible perturbations on the information loss, as different threshold values cause a maximum 1% variance for simple query and 6% for counting query.

Table 1 reports the best percentage of average information loss compared to the state-of-the-art mechanism in both simple query and counting query for each $\epsilon$ value when the number of clusters k and threshold for aggregation varies for each selected $\epsilon$ . We believe exploiting the mitigation of DP’s negative effects in different $\epsilon$ settings is essential, as a small scale of $\epsilon$ changes could cause relatively large fluctuations in the information loss.

Table 1: Best information loss reduction on the state-of-the-art mechanism in different

\epsilon

settings

	Epsilon	K	Threshold	% of reduction
Simple Query	0.008	10	30%	78.51%
	0.01	8	30%	80.74%
	0.02	8	30%	88.75%
	0.05	5	30%	95.69%
Counting Query	0.008	10	35%	64.13%
	0.01	5	30%	71.71%
	0.02	5	35%	45.22%
	0.05	10	40%	50.46%

5 Conclusions and Future Works

We present HUT, a new algorithm to enable High UTility for DP-enabled protection in the context of IoV. Our key insight is to take advantage of the fact that, the interactions between centralized servers and edge vehicles are usually frequent and fine-grained, which are combined into a series of unbalanced batches. We evaluate the effectiveness of HUT against the state-of-the-art DP protection mechanism, and the results show that HUT can provide much lower information loss by 95.69% and simultaneously enable strong mathematically-guaranteed protection of sensitive data.

Our future works aim to explore the scalability and robustness of HUT. For scalability, we expect to examine whether HUT can be applied for data with broad range or unevenly distributed datasets; as for robustness, HUT still suffers from the instability of the K-mean clustering method, which can produce query results with relatively large variance. Therefore, we can expect a trade-off between different clustering algorithms, in terms of the robustness of HUT.

References

[1] Qian Mei, Hu Xiong, Yanan Zhao, and Kuo-Hui Yeh, “Toward blockchain-enabled iov with edge computing: Efficient and privacy-preserving vehicular communication and dynamic updating,” in 2021 IEEE Conference on Dependable and Secure Computing (DSC), 2021, pp. 1–8.
[2] Kai Fan, Wei Jiang, Qi Luo, Hui Li, and Yintang Yang, “Cloud-based rfid mutual authentication scheme for efficient privacy preserving in iov,” Journal of the Franklin Institute, vol. 358, no. 1, pp. 193–209, 2021.
[3] Zeyu Xiong, Jiahao Wang, Wangkai Jin, Junyu Liu, Yicun Duan, Zilin Song, and Xiangjun Peng, “Face2Statistics: User-Friendly, Low-Cost and Effective Alternative to In-Vehicle Sensors/Monitors for Drivers,” in International Conference on Human-Computer Interaction, 2022.
[4] Zhentao Huang, Rongze Li, Wangkai Jin, Zilin Song, Yu Zhang, Xiangjun Peng, and Xu Sun, “Face2Multi-modal: In-vehicle Multi-modal Predictors via Facial Expressions,” in Adjunct Proceedings of the 12th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, AutomotiveUI 2020, Virtual Event, Washington, DC, USA, September 21-22, 2020. 2020, pp. 30–33, ACM.
[5] Jiahao Wang, Zeyu Xiong, Yicun Duan, Junyu Liu, Zilin Song, and Xiangjun Peng, “The Importance Distribution of Drivers’ Facial Expressions Varies over Time!,” in AutomotiveUI ’21: 13th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, Leeds, United Kingdom, September 9-14, 2021 - Adjunct Proceedings. 2021, pp. 148–151, ACM.
[6] Yu Zhang, Wangkai Jin, Zeyu Xiong, Zhihao Li, Yuyang Liu, and Xiangjun Peng, “Demystifying Interactions Between Driving Behaviors and Styles Through Self-clustering Algorithms,” in International Conference on Human-Computer Interaction, Heidi Krömker, Ed., 2021.
[7] Cynthia Dwork, “Differential privacy,” in Automata, Languages and Programming, Michele Bugliesi, Bart Preneel, Vladimiro Sassone, and Ingo Wegener, Eds., Berlin, Heidelberg, 2006, pp. 1–12, Springer Berlin Heidelberg.
[8] Fiodar Kazhamiaka, Matei Zaharia, and Peter Bailis, “Challenges and opportunities for autonomous vehicle query systems,” in 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11-15, 2021, Online Proceedings. 2021, www.cidrdb.org.
[9] Yicun Duan, Junyu Liu, Wangkai Jin, and Xiangjun Peng, “Characterizing Differentially-Private Techniques in the Era of Internet-of-Vehicles,” Technical Report-Feb-03 at User-Centric Computing Group, University of Nottingham Ningbo China, 2022.
[10] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu, “Boosting the accuracy of differentially private histograms through consistency,” Proc. VLDB Endow., vol. 3, no. 1, pp. 1021–1032, 2010.
[11] Jordi Soria-Comas, Josep Domingo-Ferrer, David Sánchez, and Sergio Martínez, “Enhancing data utility in differential privacy via microaggregation-based k -anonymity,” VLDB J., vol. 23, no. 5, pp. 771–794, 2014.
[12] Bo-Chen Tai, Szu-Chuang Li, and Yennun Huang, “K-aggregation: Improving accuracy for differential privacy synthetic dataset by utilizing k-anonymity algorithm,” in 31st IEEE International Conference on Advanced Information Networking and Applications, AINA 2017, Taipei, Taiwan, March 27-29, 2017, Leonard Barolli, Makoto Takizawa, Tomoya Enokido, Hui-Huang Hsu, and Chi-Yi Lin, Eds. 2017, pp. 772–779, IEEE Computer Society.
[13] S. Gopal Krishna Patro and Kishore Kumar Sahu, “Normalization: A preprocessing stage,” 2015.
[14] Xiangjun Peng, Zhentao Huang, and Xu Sun, “Building BROOK: A multi-modal and facial video database for human-vehicle interaction research,” CoRR, vol. abs/2005.08637, 2020.

HUT: Enabling High-UTility, Batched Queries under Differential Privacy Protection for Internet-of-Vehicles