Enhancing Utility in the Watchdog Privacy Mechanism
Abstract
This paper is concerned with enhancing data utility in the privacy watchdog method for attaining information-theoretic privacy. For a specific privacy constraint, the watchdog method filters out the high-risk data symbols through applying a uniform data regulation scheme, e.g., merging all high-risk symbols together. While this method entirely trades the symbols resolution off for privacy, we show that the data utility can be greatly improved by partitioning the high-risk symbols set and individually privatizing each subset. We further propose an agglomerative merging algorithm that finds a suitable partition of high-risk symbols: it starts with a singleton high-risk symbol, which is iteratively fused with others until the resulting subsets are private. Numerical simulations demonstrate the efficacy of this algorithm in privately achieving higher utilities in the watchdog scheme.
Index Terms— Information-theoretic privacy; Watchdog privacy mechanism; Privacy-utility trade-off.
1 Introduction
Industries and governments are increasingly sharing data to unlock economic and societal benefits through advances in data analytics and machine learning. However, such data also contains sensitive information about individuals or businesses, which makes the privacy regulators, users, and data providers concerned about the leakage of confidential information, either explicitly or implicitly. In signal processing and information theory, data privacy is underpinned in terms of a measure called information lift[1, 2].
To evaluate how much private data is informative about the confidential data , the lift measures the change in the posterior belief from the prior belief for each instance of and by
(1) |
It is clear that a small lift indicates limited private information gain by the adversary and therefore the more private is. The lift is the elementary measure in almost all information leakage measures, e.g., mutual information [1, 3], Sibson mutual information [4, 5, 6], -leakage [7] and local differential privacy [8, 9]: as proved in [2], if lift is bounded, all these leakage measures are also bounded.
utility | privacy leakage | |
---|---|---|
complete merging | 0.5913 | 0.8037 |
two-subset merging | 0.7335 | 0.8488 |
The existing approach to attain lift-based privacy is the watchdog method proposed in [2]. For a specific privacy constraint, i.e., a threshold on the lift, the watchdog method filters out and privatizes the high-risk symbols of . The authors in [2] adopted a uniform approach: merging all high-risk symbols together into a ‘super’ symbol, which is proved in [10, 6] to be the optimal privatization scheme in attaining data privacy. However, this uniform approach neglects an important issue in data privacy: to achieve the benefits of data sharing, the privatized data should provide a satisfactory level of usefulness in .111The data utility is usually quantified by average performance measures such as mutual information [3], -divergence [11] and Hamming distortion [12]. We use mutual information in this paper. Despite the relaxation attempts in [10, 6], the complete merging method minimizes the resolution of high-risk symbols and significantly deteriorates data utility, which is at odds with the purpose of data sharing.
In fact, even a small alteration can greatly enhance the data utility. In Table 1, we arbitrarily cut the high-risk symbol set (of size 7) into two subsets, each of which is then privatized individually. The utility (measured by mutual information) is increased significantly without sacrificing too much data privacy, which remains below the design constraint . This not only shows that the complete merging approach is an ‘overkill’ in terms of data utility, but also suggests a partitioned privatization approach.
In this paper, we propose the subset merging method for watchdog mechanism to enhance the data utility. Finding the best partition of high-risk symbols set to achieve optimal utility is generally a combinatorial problem. Accordingly, we propose a heuristic greedy algorithm to find good subsets to merge, which guarantees data privacy. To do so, this greedy algorithm tries to search the finest partition of the original high-risk symbols set that ensures the resulting lift of the whole dataset does not exceed . It starts with the singleton partition of high-risk symbols (highest resolution) and iteratively merges symbols until lift values of the resulting subsets are all below . Numerical simulations show that our proposed algorithm enhances utility significantly and maintains the privacy leakage constraint.
2 System Model
Denote random variables and the sensitive and public data, respectively. The joint distribution describes the statistical correlation between and . To protect the privacy of , we sanitize to by the transition probability . Here, for each , and therefore the Markov chain is formed.
The watchdog method is based on the logarithm of the lift measure
For each , denote the maximum symbol-wise information leakage by , where
(2) |
Applying an upper bound to for all symbols , the whole alphabet is divided into two subsets: the low-risk subset is given by that is safe to publish, and the high-risk symbol set
that requires some treatment before the data publishing.
The authors in [10, 6] adopt a uniform randomization scheme
(3) |
where is complete merging solution, e.g., where there is only one super symbol such that for all and otherwise.
After randomization, the log-lift is given by where due to the Markov property and . We can extend the notion of the log-lift, and from a single to a subset [10]:
(4) |
where and .
Applying , the value of as the upper bound on privacy leakage after randomization is given by [10]
(5) |
It is obvious that , so it attains the privacy constraint. However, the value of is variable and depends on the joint probability distribution.
Example 1
Let , , and . We randomly generate a joint distribution and the resulting maximum symbol-wise leakage and utility are and , respectively. For the given , the low-risk and high-risk subsets are given by and , respectively. After randomization, assume high-risk symbols are mapped to where and , then the maximum symbol-wise leakage and utility are given by and , respectively.
In Example 1, the leakage in the high-risk subset after randomization is , which is an order of magnitude smaller than the original privacy constraint , based on which was obtained. Although this small leakage guarantees a very high level of privacy, it damages utility drastically, the utility decreases from to . On the other hand, when a threshold is given as the privacy constraint, it is acceptable to just keep the privacy leakage less than , even if it becomes very close to this threshold. Thus, in the next section, we propose a subset merging method to enhance utility where the privacy leakage increases, but remains under to the extent possible.
3 Enhancing Utility
In this section, we introduce an approach to enhance utility while maintaining a set privacy constraint. We measure utility by mutual information which for the bi-partition and complete merging randomization is given by [10]
(6) |
Clearly, the utility depends on for and the overall . Our proposed approach to enhance data utility is through increasing data resolution. That is, we propose to randomize subsets of separately rather than the complete merging of the whole set .
Let and consider a bi-partition , a further partitioning of elements in denoted by , and complete merging randomizations where . In other words, we partition to subsets , so and each subset is randomized by the corresponding randomization . Note that for we have . The resulting mutual information is
(7) |
Then the normalized mutual information-loss for partition on is defined as
(8) |
Since for the data resolution increases for each and this can results in a larger mutual information and hence a lower utility loss. The following definition and derivations formalize this observation.
Definition 1
Assume two partitions and . We say is a refinement of and is an aggregation of [13], if for every where , and we have .
If is a refinement of and is an aggregation of then . This is because
(9) | |||
Generally, finding an optimal partition that maximizes utility while maintaining the privacy constraint is combinatorial since it depends on the high-risk symbol probabilities , and the joint probability distribution. However, we can use as a stop criteria to make a heuristic agglomerative algorithm for obtaining a refinement of to enhance utility as much as possible.
3.1 A greedy algorithm to refine the high-risk subset
Knowing that the aggregation of the partition of reduces the symbol resolution and data utility, we propose a heuristic greedy algorithm that determines the most refined partition of that satisfies the data privacy constraint specified by . This is a bottom-up algorithm, which bootstraps from the most refined (singleton-element) partition of . This starting point provides the highest resolution/utility but results in a lowest data privacy level. We let the subsets in the singleton partition merge with each other to reduce the log-lift measure until the log-lift of all subsets is reduced below . To achieve a low complexity, but effective procedure that results in a finer partition of , we implement the subset merging in order. A good candidate to begin with is the most challenging symbol with the highest log-lift leakage.
The pseudo-code of our method is shown in Algorithm 1. To find each subset , we start with as the symbol with the highest risk in (line 4) and then to make the leakage of the subset less than we merge another symbol to it that minimizes in each iteration (lines 5-8). Each symbol that is added to is removed from (line 7). When a subset is made, we add it to the partition set and repeat the same process for the remaining . After making all subsets, there is a possibility that for the last subset the leakage is greater than . That is, . If this happens, while we make an agglomerate by merging a subset to it that minimizes (lines 11-15). A complete merging is a special output of our algorithm if no better finer partition can be found that maintains the privacy.
4 Experiments
To make a comparison between Algorithm 1 and complete merging, we randomly generated 1000 joint distributions where and . For each distribution, after randomization we obtained maximum privacy leakage of high-risk symbols , maximum overall privacy leakage , and the utility loss NMIL under Algorithm 1 and complete merging for different values of . Then for each , we derived the mean value and standard deviation of , , and NMIL across these 1000 observations.
In Fig. 1 the mean value of is depicted for each , as well as its standard deviation (shown as tolerance bars). As expected, complete merging makes a strong guarantee on privacy leakage and keeps both mean and standard deviation much less than in all cases. In contrast, algorithm 1 increases the privacy leakage and lets it be closer to compared to complete merging, but crucially it still keeps the mean value and the corresponding deviation less than in all cases. As the value of increases, the standard deviation is also increased. This is because when increases, the privacy constraint is less strict and consequently the size of decreases. As a result, the sample size to calculate the mean value of privacy leakage reduces, which causes a larger deviation.
Fig. 2 shows the normalized mutual information loss under Algorithm 1 and complete merging. It demonstrates that Algorithm 1 enhances utility substantially for each value of .
Finally, to have a clear privacy-utility trade-off (PUT) comparison, we present PUT curves for both Algorithm 1 and complete merging in Fig. 3. For each , the mean value of overall privacy leakage is shown versus the corresponding mean value of NMIL. Clearly, Algorithm 1 enhances utility significantly while satisfying the privacy leakage constraint. For a very strict constraint on privacy (), complete merging results in perfect privacy with total utility loss where . However, Algorithm 1 keeps the average utility loss less than .
5 Conclusion
In this paper, we introduced a method to enhance utility in the watchdog privacy mechanism. We showed that it is possible to maintain the privacy constraint and improve utility. In our approach, instead of randomizing the whole high-risk partition, we randomized subsets of high-risk symbols separately. Then we proposed a heuristic greedy algorithm to find subsets of high-risk elements and at the same time keep the leakage of each subset less than the privacy constraint, . The simulation results showed substantial utility enhancement and preservation of the privacy constraint.
In future research, it would be interesting to consider how to develop more efficient subset merging algorithms. Investigation of effective parameters like subset size and different privacy and utility measures could be helpful. It would be also beneficial to apply the proposed algorithm to real world data sets and measure actual practical utility. It would be also beneficial to apply the proposed algorithm to real world datasets and measure the obtained utility.
References
- [1] F. d. P. Calmon and N. Fawaz, “Privacy against statistical inference,” in Proc. Annu. Allerton Conf. Commun., Control, and Comput., Monticello, IL, 2012, pp. 1401–1408.
- [2] H. Hsu, S. Asoodeh, and F. d. P. Calmon, “Information-theoretic privacy watchdogs,” in Proc. IEEE Int. Symp. Inf. Theory, Paris, France, 2019, pp. 552–556.
- [3] A. Makhdoumi, S. Salamatian, N. Fawaz, and M. Médard, “From the information bottleneck to the privacy funnel,” in Proc. IEEE Inf. Theory Workshop, Hobart, TAS, 2014, pp. 501–505.
- [4] R. Sibson, “Information radius,” Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, vol. 14, no. 2, pp. 149–160, 1969.
- [5] S. Verdú, “-mutual information,” in Proc. Information Theory and Applications Workshop (ITA), San Diego, CA, 2015, pp. 1–6.
- [6] N. Ding, M. A. Zarrabian, and P. Sadeghi, “-information-theoretic privacy watchdog and optimal privatization scheme,” in Proc. IEEE Int. Symp. Inf. Theory, 2021, pp. 2584–2589.
- [7] J. Liao, O. Kosut, L. Sankar, and F. d. P. Calmon, “Tunable measures for information leakage and applications to privacy-utility tradeoffs,” IEEE Trans. Inf. Theory, vol. 65, no. 12, pp. 8043–8066, 2019.
- [8] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy and statistical minimax rates,” in Proc. IEEE 54th Annu. Symp. Found. Comput. Sci., 2013, pp. 429–438.
- [9] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith, “What can we learn privately?,” SIAM Journal on Computing, vol. 40, no. 3, pp. 793–826, 2011.
- [10] P. Sadeghi, N. Ding, and T. Rakotoarivelo, “On properties and optimization of information-theoretic privacy watchdog,” in Proc. IEEE Inf. Theory Workshop, 2020.
- [11] P. Kairouz, S. Oh, and P. Viswanath, “Extremal mechanisms for local differential privacy,” in Adv. Neural Inf. Process. Syst, jan 2014, vol. 4, pp. 2879–2887.
- [12] A. D. Sarwate and L. Sankar, “A rate-disortion perspective on local differential privacy,” in Proc. Annu. Allerton Conf. Commun., Control, and Comput., 2014, pp. 903–908.
- [13] Y. Liu, P. Sadeghi, F. Arbabjolfaei, and Y. H. Kim, “Capacity theorems for distributed index coding,” IEEE Trans. Inf. Theory, vol. 66, no. 8, pp. 4653–4680, 2020.