Local Differential Privacy for Sequential Decision Making in a Changing Environment
Abstract
We study the problem of preserving privacy while still providing high utility in sequential decision making scenarios in a changing environment. We consider abruptly changing environment: the environment remains constant during periods and it changes at unknown time instants. To formulate this problem, we propose a variant of multi-armed bandits called non-stationary stochastic corrupt bandits. We construct an algorithm called SW-KLUCB-CF and prove an upper bound on its utility using the performance measure of regret. The proven regret upper bound for SW-KLUCB-CF is near-optimal in the number of time steps and matches the best known bound for analogous problems in terms of the number of time steps and the number of changes. Moreover, we present a provably optimal mechanism which can guarantee the desired level of local differential privacy while providing high utility.
Introduction
Several practically relevant applications including recommender systems, Internet advertising have been formulated as sequential decision making problems using the framework of multi-armed bandits. The importance of privacy in such sequential decision making problems has been extensively discussed in the literature (see for example, DBLP:conf/nips/ThakurtaS13; DBLP:conf/uai/MishraT15; tossou:aaai2016).
Differential privacy, introduced by Dwork06calibratingnoise, is one of the popular approaches to address such privacy concerns. In sequential decision making problems, algorithms providing differential privacy preserve data privacy by adding appropriate statistical noise to the data. Duchi:2014:PAL:2700084.2666468 extend this notion to local differential privacy in which data remains private even from the algorithm. The main difference between global and local differential privacy is whether privacy is to be maintained from the algorithm or the (possibly unintended) recipient of the output of the algorithm. In global differential privacy, noise is added by the algorithm so the output does not reveal private information about the input. In local differential privacy, noise is added to the input of the algorithm so that privacy is maintained even from the algorithm.
To understand the motivation for local differential privacy, let us consider the practical application of Internet advertising 111We consider a simplistic scenario for illustrative purposes.. An advertising system receives, as input, feedback from the users which may reveal private information about them. The advertising system employs a suitable learning algorithm and selects ads for the users tailored to the feedback given by them. These selected ads are then given to the advertisers as output. While using global differential privacy, privacy is maintained from the advertisers by ensuring that the output of the learning algorithms does not reveal information about the input (i.e., user information). Typically, advertising systems are established by leading social media networks, web browsers and other popular websites. DBLP:conf/icdm/Korolova10; kosinski2013private show that it is possible to accurately predict a range of highly sensitive personal attributes including age, sexual orientation, relationship status, political and religious affiliation using the feedback available to the advertising systems. Such possible breach of privacy necessitates us to protect personal user information not only from the advertisers but also from the advertising systems. Local differential privacy is able to achieve this objective unlike global differential privacy.
In this article, we propose to use low privacy regime using local differential privacy. In low privacy regime, the noise added to the data is small and the aim of the privacy mechanism is to send as much information about data as allowed, but no more (NIPS2014_5392). This is in alignment with our dual goal of using privacy in recommendation systems or Internet advertising, and other similar applications: provide useful recommendations/ads to the users while respecting their privacy as much as possible.
We measure the utility of our proposed algorithm using regret which is a measure of the total mistake cost (precise definitions will follow in the next Section). When rewards are bounded (as assumed in most works in the literature), the regret of any algorithm is trivially bounded linearly in the number of time steps . An algorithm is said to be learning if its regret is bounded sub-linearly in .
Main Contributions
-
1.
We propose non-stationary stochastic corrupt bandits, a novel formulation which aims to preserve local differential privacy while still providing high utility for sequential decision making in a non-stationary environment.
-
2.
We construct an algorithm called SW-KLUCB-CF for the considered problem.
-
3.
We prove an upper bound on the utility of SW-KLUCB-CF in terms of its regret. This upper bound is near-optimal in terms of the number of time steps and matches the best known bound for analogous problems in terms of the number of time steps and the number of changes.
-
4.
We provide an optimal mechanism to achieve a desired level of local differential privacy while achieving high utility.
This work is an extension of pmlr-v83-gajane18a to non-stationary environments and reuses some of the concepts used there. However, it should be noted that the algorithms proposed in pmlr-v83-gajane18a will not be able to solve the problem considered in this article. In fact, it is easy to construct non-stationary environments for which the algorithms proposed in pmlr-v83-gajane18a (and all other differentially private algorithms designed for stationary environment) will suffer regret linear in the number of time steps . On the other hand, the algorithm proposed in this article can guarantee regret sub-linear in in such scenarios. Furthermore, due to the changing environment and the use of a sliding window, the regret analysis in our article presents challenges not encountered in stationary settings.
Our extension to non-stationary environments is practically relevant as the assumption of stationarity is sometimes unrealistic in real-world applications. Such an extension providing local differential privacy in non-stationary environments for the problem of data collection is given by NEURIPS2018_a0161022. Our problem is different than NEURIPS2018_a0161022 as we study learning to make optimal sequential decisions in a non-stationary environment while providing local differential privacy. Note that a naive strategy of restarting an algorithm (designed for a stationary environment) after each change is not possible in the problem considered here as the time instants at which the changes occur are unknown.
Related Work
In the context of sequential decision-making, global differential privacy has been studied in various settings including stochastic bandits (DBLP:conf/uai/MishraT15; tossou:aaai2016), adversarial bandits (DBLP:conf/nips/ThakurtaS13; tossou:aaai2017a) and collaborative bandits (10.1145/3383313.3412254). In the context of sequential decision-making, local differential privacy has been considered in stochastic bandit setting (pmlr-v83-gajane18a; pmlr-v151-tao22a), contextual bandits (NEURIPS2020_908c9a56), collaborative bandits (10.1145/3383313.3412254) and Markov decision processes (Chowdhury_Zhou_2022; DBLP:journals/corr/abs-2010-07778). For a comprehensive overview of differential privacy and its application to other problems, see Dwork:2014:AFD:2693052.2693053.
The notion of using a sliding window mechanism (as we do in our proposed algorithm) to deal with a non-stationary environment has been employed in classical bandits (GarivierSW) as well as Markov decision processes (GajaneSW).
Non-Stationary Stochastic Corrupt Bandits
A non-stationary stochastic corrupt bandits problem is formally characterized by a set of arms on which are indexed a list of unknown sub-Gaussian reward distributions , a list of unknown sub-Gaussian feedback distributions , and a list of known mean-corruption functions . Here, the total number of time steps (i.e., the horizon) is indicated as . The environment undergoes abrupt changes at unknown time steps called as breakpoints and it remains constant in the intervals between two successive breakpoints.
For notational convenience, we assume that the first breakpoint occurs at . From breakpoint till the subsequent breakpoint (or the horizon, in case of the last breakpoint), if the learner pulls an arm at time , they receive a (hidden) reward drawn from the distribution with mean and observe a feedback drawn from the distribution with mean . We assume that, for each arm, there exists a loose link between the reward and the feedback through a known corruption function which maps the mean of the reward distribution to the mean of the feedback distribution : and . Our proposed algorithm and the proven regret bound also work if the corruption function for an arm changes across time as long as the current corruption function is known.
Note that these functions may be completely different from one arm to another. For Bernoulli distributions, the reward distributions and the feedback distributions are in for all and we assume all the corruption functions to be continuous in this interval. We also assume the corruption functions to be strictly monotonic and denote the corresponding inverse functions by . The assumption of monotonicity is required for efficient learning as proved in pmlr-v83-gajane18a.
Another way to define the link between the reward and the feedback is to provide a corruption scheme operator which maps the rewards into feedback distributions.
Randomized Response
Randomized response (a privacy protection technique introduced by (Warner1965)) can also be simulated by a Bernoulli corrupt bandits problem and the corresponding corruption scheme is encoded as:
(1) |
Each item in denotes the probability of observing a particular feedback for a particular reward i.e., The corresponding corruption function is
To measure the utility of an algorithm for this problem, we define the notion of regret in the following. Let us denote the mean reward of arm at time step as . The objective of an algorithm, which chooses the arm at time based only on the previously observed feedback, , is to maximize the expected sum of rewards i.e., to achieve high utility. This is equivalent to minimizing the regret, where . Regret measures the performance of the algorithm against an omniscient policy that at each time step chooses the arm with the maximal mean reward. Thus, low regret translates to achieving high utility.
The Proposed Algorithm
To solve the problem at hand, we propose SW-KLUCB-CF, an adaptation of the -UCB algorithm of KLUCBJournal. The algorithm takes as input: the window size , a non-decreasing function , the horizon and the corruptions functions . We assume that the horizon T is known; an unknown can be handled using the doubling trick (besson:hal-01736357). We use to denote the Kullback–Leibler divergence between two Bernoulli distributions with mean and . We also use a shorthand of to denote .
At each time time step , the algorithm computes an , which is an upper-confidence bound on built from a confidence interval on based on the KL-divergence. The quantity denotes the number of times arm was chosen in the last time steps until time . Correspondingly, denotes the empirical mean of the feedback observed from arm in the last time steps until time : .
Theorem 1 gives an upper bound on the regret of SW-KLUCB-CF. A more explicit bound is proved in the Appendix.
Theorem 1
The regret of SW-KLUCB-CF using and on a Bernoulli non-stationary stochastic corrupt bandits problem with strictly monotonic and continuous corruption functions at time is upper-bounded by 222 ignores logarithmic factors and constants.
where and are the optimum arm and the corresponding optimal mean respectively after change and before the subsequent change.
The lower bound on regret in terms for classical non-stationary stochastic bandits is (GarivierSW). Theorem 1 matches the lower bound up to logarithmic factors, so SW-KLUCB-CF has near-optimal regret guarantees in terms of the time horizon . The best known regret upper bounds for classical non-stationary stochastic bandits (e.g., pmlr-v99-auer19a) also feature logarithmic terms besides the lower bound, hence our regret bound is in line with the best known results for analogous problems. Moreover, the bound in Theorem 1 also matches the best known regret bound in terms of for classical non-stationary stochastic bandits which is .
Input: Window size , a non-decreasing function , , monotonic and continuous corruption functions and ,
-
1.
Initialization: Pull each arm once.
-
2.
for time do
-
(a)
Compute for each arm the quantity
-
(b)
Pull arm and observe the feedback .
end for
-
(a)
We can use SW-KLUCB-CF on non-stationary stochastic corrupts bandits where the corruption is done via randomized response. The following corollary bounds the resulting regret.
Corollary 1
The regret of SW-KLUCB-CF on a Bernoulli non-stationary stochastic corrupt bandits problem with randomized response using corruption matrices at time is upper-bounded by
This corollary follows from Theorem 1 and Pinsker’s inequality: . The term can be understood as the slope of the corruption function .
Corruption Mechanism to Preserve Local Privacy in Non-Stationary Environment
First, let us formally define local differential privacy.
Definition 1
(Locally differentially private mechanism) Any randomized mechanism is -locally differentially private for , if for all and for all ,
As done in pmlr-v83-gajane18a, a straightforward approach to achieve local differential privacy using corrupt bandits is to employ a corruption scheme on the user feedback. This is similar to how randomized response is used in data collection by DBLP:conf/edbt/0009WH16.
Definition 2
(-locally differentially private bandit feedback corruption scheme) A bandit feedback corruption scheme is -locally differentially private for , if for all reward sequences and , and for all ,
When corruption is done by randomized response, local differential privacy requires that . From Corollary 1, we can see that to achieve lower regret, (a) is to be maximized for all . Using DBLP:conf/edbt/0009WH16, we can state that, in order to achieve -local differential privacy while maximizing ,
(2) |
As it turns out, this is equivalent to the staircase mechanism for local privacy which is the optimal local differential privacy mechanism for low privacy regime (JMLR:v17:15-135, Theorem 14). The trade-off between utility and privacy is controlled by .
Using the corruption parameters from Eq. (2) with Corollary 1, we arrive at the following upper bound.
Corollary 2
At time , the regret of SW-KLUCB-CF with -locally differentially private bandit feedback corruption scheme given by Eq. (2) is
The term in the above expression conveys the relationship of the regret with the level of local differential privacy symbolized by . For low values of , . This is in line with other bandit algorithms providing differential privacy (e.g., DBLP:conf/uai/MishraT15).
Elements of Mathematical Analysis
Here, we provide a proof outline for Theorem 1. Please refer to the Appendix for the complete proof.
We start by bounding the expected number of times a suboptimal arm (i.e., an arm other than the optimal arm at the time of selection) is pulled by the algorithm till horizon . Recall that, at any time step , SW-KLUCB-CF pulls an arm maximizing an index defined as
We further decompose the computation of index as follows,
where,
The interval is a KL-based confidence interval on the mean feedback of arm at time . This is in contrast to -UCB (KLUCBJournal) where a confidence interval is placed on the mean reward. Furthermore, This differs from -UCB-CF (pmlr-v83-gajane18a) where the mean feedback of an arm remains the same for all the time steps and does not feature .
In our analysis, we use the fact that when an arm is picked at time by SW-KLUCB-CF, one of the following is true: Either the mean feedback of the optimal arm with mean reward is outside its confidence interval (i.e., or ) which is unlikely. Or, the mean feedback of the optimal arm is where it should be, and then the fact that arm is selected indicates that the confidence interval on cannot be too small as either or . The previous statement follows from considering various cases depending on whether the corruption functions and are increasing or decreasing. We then need to control the two terms in the decomposition of the expected number of draws of arm . The term regarding the “unlikely” event, is bounded using the same technique as in the -UCB analysis, however with some added challenges due to the use of a sliding window. In particular, the analysis of a typical upper confidence bound algorithm for bandits relies on the fact that the confidence interval for any arm is always non-increasing, however this is not true while using a sliding window. To control the second term, depending on the monotonicity of the corruption functions and , we need to meticulously adapt the arguments in KLUCBJournal to control the number of draws of a suboptimal arm, as can be seen in the Appendix.
Concluding Remarks
In this work, we proposed the setting of non-stationary stochastic corrupt bandits for preserving privacy while still maintaining high utility in sequential decision making in a changing environment. We devised an algorithm called SW-KLUCB-CF and proved its regret upper bound which is near-optimal in the number of time steps and matches the best known bound for analogous problems in terms of the number of time steps and the number of changes. Moreover, we provided an optimal corruption scheme to be used with our algorithm in order to attain the dual goal of achieving high utility while maintaining the desired level of privacy.
Interesting directions for future work include:
-
1.
Complete an empirical evaluation of the proposed algorithm on simulated as well as real-life data.
-
2.
Characterize the changes in the environment by a variation budget (as done in NIPS2014_903ce922 for classical bandits) instead of the number of changes.
-
3.
Incorporate contextual information in the learning process.
-
4.
Propose a Bayesian algorithm for non-stationary stochastic corrupt bandits.
-
5.
Propose a (near-)optimal differentially private algorithm which does not need to know the number of changes.
Appendix A Proof of Theorem 1
Proof. The proof follows along the lines of the proof for Theorem 2 from pmlr-v83-gajane18a.
The index used by SW-KLUCB-CFis defined by
For the purpose of this proof, we further decompose the computation of index as follows,
where,
Note that, the optimal arm at time is denoted as and is the corresponding optimal mean. Along the same lines, let and .
Let be the number of times arm has been pulled till time . To get an upper bound on the regret of our algorithm, we first bound for all the non-optimal arms (i.e., at time ). Recall that is the mean reward of arm at time step . Let us define as the set of indices such that for all and all . That is to say is the set of all time steps for which there was no change in the previous time steps. Recall that is the arm chosen by the algorithm at time step . Then,
Depending upon if and are increasing or decreasing there are four possible sub-cases:
-
•
Both and are increasing.
(3) -
•
is decreasing and is increasing.
(4) -
•
is increasing and is decreasing.
(5) -
•
is decreasing and is decreasing.
(6)
We first upper bound the two sums
(7) |
using that and are respectively lower and upper confidence bound on . Recall that is denoted as .
(8) |
where the upper bound follows from Lemma 2 in KLUCBJournal, and the fact that is the empirical mean of Bernoulli samples with mean . Similarly, one has
(9) |
As , for ,
Then, using Eq. (8) and Eq. (9), the two quantities in Eq. (7) can be upper bounded by
This proves that
(10) | ||||
(11) |
We now turn our attention to the other two sums involved in the upper bound we gave for . Let the unknown time-step at which change occurs be denoted as . For notational convenience, we assume that the first change occurs at so and change takes place at where is the horizon. We introduce the notation and . So we can write, when is increasing,
In the above, the penultimate steps follows from the fact that the event is subsumed by the event . So, one obtains, when is increasing,
(12) |
Using similar arguments, one can show that when is decreasing,
(13) |
Recall that is the mean reward of arm after change and before the subsequent change. Correspondingly, let be the mean feedback of arm after change and and before the subsequent change. Furthermore, let be the optimum mean after change and and before the subsequent change.
Using Appendix A.2. of (KLUCBJournal), the quantity in the right-hand side of (12) can be upper-bounded by
(14) |
For (13), noting that , one has
where , is the empirical mean of observations of a Bernoulli random variable with mean . Hence, the analysis of (KLUCBJournal) can be applied, and using that and , the right hand side of (13) can also be upper bound by (14).
Combining inequalities (10), (11) and (12),(13), (14) with the initial decomposition of , and substituting yields in all cases,
(15) |
Minimizing the leading terms in the RHS from eq. (15) via taking the first derivative with respect to and equating it to , leads to solving for in
Here, must be positive for the log to exist, so we can write for some , and the equation becomes
This equation has no solution in an elementary expression, although it can be expressed in terms of the Lambert W function (Corless1996). Opting for an elementary expression for , we can choose , which leads to the following bound,
Since the rewards are bounded in for Bernoulli non-stationary stochastic bandits, the regret is upper-bounded by,
Assuming that for some , the expected regret is upper bounded as . In particular, if , the number of breakpoints is upper-bounded by independently of , then with , the upper bound is .