This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Local Differential Privacy for Sequential Decision Making in a Changing Environment

Pratik Gajane
Abstract

We study the problem of preserving privacy while still providing high utility in sequential decision making scenarios in a changing environment. We consider abruptly changing environment: the environment remains constant during periods and it changes at unknown time instants. To formulate this problem, we propose a variant of multi-armed bandits called non-stationary stochastic corrupt bandits. We construct an algorithm called SW-KLUCB-CF and prove an upper bound on its utility using the performance measure of regret. The proven regret upper bound for SW-KLUCB-CF is near-optimal in the number of time steps and matches the best known bound for analogous problems in terms of the number of time steps and the number of changes. Moreover, we present a provably optimal mechanism which can guarantee the desired level of local differential privacy while providing high utility.

Introduction

Several practically relevant applications including recommender systems, Internet advertising have been formulated as sequential decision making problems using the framework of multi-armed bandits. The importance of privacy in such sequential decision making problems has been extensively discussed in the literature (see for example, DBLP:conf/nips/ThakurtaS13; DBLP:conf/uai/MishraT15; tossou:aaai2016).

Differential privacy, introduced by Dwork06calibratingnoise, is one of the popular approaches to address such privacy concerns. In sequential decision making problems, algorithms providing differential privacy preserve data privacy by adding appropriate statistical noise to the data. Duchi:2014:PAL:2700084.2666468 extend this notion to local differential privacy in which data remains private even from the algorithm. The main difference between global and local differential privacy is whether privacy is to be maintained from the algorithm or the (possibly unintended) recipient of the output of the algorithm. In global differential privacy, noise is added by the algorithm so the output does not reveal private information about the input. In local differential privacy, noise is added to the input of the algorithm so that privacy is maintained even from the algorithm.

To understand the motivation for local differential privacy, let us consider the practical application of Internet advertising 111We consider a simplistic scenario for illustrative purposes.. An advertising system receives, as input, feedback from the users which may reveal private information about them. The advertising system employs a suitable learning algorithm and selects ads for the users tailored to the feedback given by them. These selected ads are then given to the advertisers as output. While using global differential privacy, privacy is maintained from the advertisers by ensuring that the output of the learning algorithms does not reveal information about the input (i.e., user information). Typically, advertising systems are established by leading social media networks, web browsers and other popular websites. DBLP:conf/icdm/Korolova10; kosinski2013private show that it is possible to accurately predict a range of highly sensitive personal attributes including age, sexual orientation, relationship status, political and religious affiliation using the feedback available to the advertising systems. Such possible breach of privacy necessitates us to protect personal user information not only from the advertisers but also from the advertising systems. Local differential privacy is able to achieve this objective unlike global differential privacy.

In this article, we propose to use low privacy regime using local differential privacy. In low privacy regime, the noise added to the data is small and the aim of the privacy mechanism is to send as much information about data as allowed, but no more (NIPS2014_5392). This is in alignment with our dual goal of using privacy in recommendation systems or Internet advertising, and other similar applications: provide useful recommendations/ads to the users while respecting their privacy as much as possible.

We measure the utility of our proposed algorithm using regret which is a measure of the total mistake cost (precise definitions will follow in the next Section). When rewards are bounded (as assumed in most works in the literature), the regret of any algorithm is trivially bounded linearly in the number of time steps TT. An algorithm is said to be learning if its regret is bounded sub-linearly in TT.

Main Contributions

  1. 1.

    We propose non-stationary stochastic corrupt bandits, a novel formulation which aims to preserve local differential privacy while still providing high utility for sequential decision making in a non-stationary environment.

  2. 2.

    We construct an algorithm called SW-KLUCB-CF for the considered problem.

  3. 3.

    We prove an upper bound on the utility of SW-KLUCB-CF in terms of its regret. This upper bound is near-optimal in terms of the number of time steps and matches the best known bound for analogous problems in terms of the number of time steps and the number of changes.

  4. 4.

    We provide an optimal mechanism to achieve a desired level of local differential privacy while achieving high utility.

This work is an extension of pmlr-v83-gajane18a to non-stationary environments and reuses some of the concepts used there. However, it should be noted that the algorithms proposed in pmlr-v83-gajane18a will not be able to solve the problem considered in this article. In fact, it is easy to construct non-stationary environments for which the algorithms proposed in pmlr-v83-gajane18a (and all other differentially private algorithms designed for stationary environment) will suffer regret linear in the number of time steps TT. On the other hand, the algorithm proposed in this article can guarantee regret sub-linear in TT in such scenarios. Furthermore, due to the changing environment and the use of a sliding window, the regret analysis in our article presents challenges not encountered in stationary settings.

Our extension to non-stationary environments is practically relevant as the assumption of stationarity is sometimes unrealistic in real-world applications. Such an extension providing local differential privacy in non-stationary environments for the problem of data collection is given by NEURIPS2018_a0161022. Our problem is different than NEURIPS2018_a0161022 as we study learning to make optimal sequential decisions in a non-stationary environment while providing local differential privacy. Note that a naive strategy of restarting an algorithm (designed for a stationary environment) after each change is not possible in the problem considered here as the time instants at which the changes occur are unknown.

Related Work

In the context of sequential decision-making, global differential privacy has been studied in various settings including stochastic bandits (DBLP:conf/uai/MishraT15; tossou:aaai2016), adversarial bandits (DBLP:conf/nips/ThakurtaS13; tossou:aaai2017a) and collaborative bandits (10.1145/3383313.3412254). In the context of sequential decision-making, local differential privacy has been considered in stochastic bandit setting (pmlr-v83-gajane18a; pmlr-v151-tao22a), contextual bandits (NEURIPS2020_908c9a56), collaborative bandits (10.1145/3383313.3412254) and Markov decision processes (Chowdhury_Zhou_2022; DBLP:journals/corr/abs-2010-07778). For a comprehensive overview of differential privacy and its application to other problems, see Dwork:2014:AFD:2693052.2693053.

The notion of using a sliding window mechanism (as we do in our proposed algorithm) to deal with a non-stationary environment has been employed in classical bandits (GarivierSW) as well as Markov decision processes (GajaneSW).

Non-Stationary Stochastic Corrupt Bandits

A non-stationary stochastic corrupt bandits problem is formally characterized by a set of arms A={1,,K}A=\{1,\dots,K\} on which are indexed a list of unknown sub-Gaussian reward distributions {𝝂a(1)}aA,,{𝝂a(LT)}aA\{\boldsymbol{\nu}_{a}(1)\}_{a\in A},\dots,\{\boldsymbol{\nu}_{a}(L_{T})\}_{a\in A}, a list of unknown sub-Gaussian feedback distributions {𝝇a(1)}aA,,{𝝇a(LT)}aA\{\boldsymbol{\varsigma}_{a}(1)\}_{a\in A},\dots,\{\boldsymbol{\varsigma}_{a}(L_{T})\}_{a\in A}, and a list of known mean-corruption functions {ga}aA\{g_{a}\}_{a\in A}. Here, the total number of time steps (i.e., the horizon) is indicated as TT. The environment undergoes LTL_{T} abrupt changes at unknown time steps called as breakpoints and it remains constant in the intervals between two successive breakpoints.

For notational convenience, we assume that the first breakpoint occurs at t=1t=1. From ithi^{\text{th}} breakpoint till the subsequent breakpoint (or the horizon, in case of the last breakpoint), if the learner pulls an arm aAa\in A at time tt, they receive a (hidden) reward RtR_{t} drawn from the distribution 𝝂a(i)\boldsymbol{\nu}_{a}(i) with mean μa(i)\mu_{a}(i) and observe a feedback FtF_{t} drawn from the distribution 𝝇a(i)\boldsymbol{\varsigma}_{a}(i) with mean λa(i)\lambda_{a}(i). We assume that, for each arm, there exists a loose link between the reward and the feedback through a known corruption function gag_{a} which maps the mean of the reward distribution to the mean of the feedback distribution : ga(μa(i))=λa(i),aAg_{a}(\mu_{a}(i))=\lambda_{a}(i),\forall a\in A and 1iLT1\leq i\leq L_{T}. Our proposed algorithm and the proven regret bound also work if the corruption function for an arm changes across time as long as the current corruption function is known.

Note that these gag_{a} functions may be completely different from one arm to another. For Bernoulli distributions, the reward distributions and the feedback distributions are in [0,1][0,1] for all aAa\in A and we assume all the corruption functions {ga}aA\{g_{a}\}_{a\in A} to be continuous in this interval. We also assume the corruption functions {ga}aA\{g_{a}\}_{a\in A} to be strictly monotonic and denote the corresponding inverse functions by ga1g_{a}^{-1}. The assumption of monotonicity is required for efficient learning as proved in pmlr-v83-gajane18a.

Another way to define the link between the reward and the feedback is to provide a corruption scheme operator g~a\tilde{g}_{a} which maps the rewards into feedback distributions.

Randomized Response

Randomized response (a privacy protection technique introduced by (Warner1965)) can also be simulated by a Bernoulli corrupt bandits problem and the corresponding corruption scheme g~a\tilde{g}_{a} is encoded as:

𝕄a [\@arstrut01\\0p00(a)1-p11(a)\\11-p00(a)p11(a)\\] \mathds{M}_{a}\coloneqq\hbox{}\vbox{\kern 0.86108pt\hbox{$\kern 0.0pt\kern 2.5pt\kern-5.0pt\left[\kern 0.0pt\kern-2.5pt\kern-5.55557pt\vbox{\kern-0.86108pt\vbox{\vbox{ \halign{\kern\arraycolsep\hfil\@arstrut$\kbcolstyle#$\hfil\kern\arraycolsep& \kern\arraycolsep\hfil$\@kbrowstyle#$\ifkbalignright\relax\else\hfil\fi\kern\arraycolsep&& \kern\arraycolsep\hfil$\@kbrowstyle#$\ifkbalignright\relax\else\hfil\fi\kern\arraycolsep\cr 5.0pt\hfil\@arstrut$\scriptstyle$\hfil\kern 5.0pt&5.0pt\hfil$\scriptstyle 0$\hfil\kern 5.0pt&5.0pt\hfil$\scriptstyle 1\\0$\hfil\kern 5.0pt&5.0pt\hfil$\scriptstyle p_{00}(a)$\hfil\kern 5.0pt&5.0pt\hfil$\scriptstyle 1-p_{11}(a)\\1$\hfil\kern 5.0pt&5.0pt\hfil$\scriptstyle 1-p_{00}(a)$\hfil\kern 5.0pt&5.0pt\hfil$\scriptstyle p_{11}(a)\\$\hfil\kern 5.0pt\crcr}}}}\right]$}} (1)

Each item in 𝕄a\mathds{M}_{a} denotes the probability of observing a particular feedback for a particular reward i.e., 𝕄a(y,x)(Feedback from arm a=y|Reward from arm a=x).\mathds{M}_{a}(y,x)\coloneqq\mathds{P}\big{(}\text{Feedback from arm }a=y\>|\>\text{Reward from arm }a=x\big{)}. The corresponding corruption function is ga(x)=1p00(a)+[p00(a)+p11(a)1]x.g_{a}(x)=1-p_{00}(a)+[p_{00}(a)+p_{11}(a)-1]\cdot{}x.

To measure the utility of an algorithm for this problem, we define the notion of regret in the following. Let us denote the mean reward of arm aa at time step tt as μa,t\mu_{a,t}. The objective of an algorithm, which chooses the arm a^t\hat{a}_{t} at time tt based only on the previously observed feedback, F1,,Ft1F_{1},\dots,F_{t-1}, is to maximize the expected sum of rewards i.e., to achieve high utility. This is equivalent to minimizing the regret, Regret(T)t=1Tμ,t𝔼[t=1Tμa^t,t],\operatorname{Regret}(T)\coloneqq\sum_{t=1}^{T}\mu_{*,t}-\mathds{E}\left[\sum_{t=1}^{T}\mu_{\hat{a}_{t},t}\right], where μ,tmaxaAμa,t\mu_{*,t}\coloneqq\max_{a\in A}\mu_{a,t}. Regret measures the performance of the algorithm against an omniscient policy that at each time step chooses the arm with the maximal mean reward. Thus, low regret translates to achieving high utility.

The Proposed Algorithm

To solve the problem at hand, we propose SW-KLUCB-CF, an adaptation of the kl\mathrm{kl}-UCB algorithm of KLUCBJournal. The algorithm takes as input: the window size ww, a non-decreasing function ff, the horizon TT and the corruptions functions g1,,gKg_{1},\dots,g_{K}. We assume that the horizon T is known; an unknown TT can be handled using the doubling trick (besson:hal-01736357). We use d(x,y)d(x,y) to denote the Kullback–Leibler divergence between two Bernoulli distributions with mean xx and yy. We also use a shorthand of xyx\wedge y to denote min(x,y)\min(x,y).

At each time time step tt, the algorithm computes an Indexa(t)\mathrm{Index}_{a}(t), which is an upper-confidence bound on μa,t\mu_{a,t} built from a confidence interval on λa,t\lambda_{a,t} based on the KL-divergence. The quantity Na(t,w)N_{a}(t,w) denotes the number of times arm aa was chosen in the last ww time steps until time tt. Correspondingly, λ^a(t,w)\hat{\lambda}_{a}(t,w) denotes the empirical mean of the feedback observed from arm aa in the last ww time steps until time tt: λ^a(t,w)1Na(t,w)s=min{1,tw+1}tFs𝟙(a^s=a)\hat{\lambda}_{a}(t,w)\coloneqq\frac{1}{N_{a}(t,w)}\sum_{s=\min\{1,t-w+1\}}^{t}F_{s}\cdot\mathds{1}_{(\hat{a}_{s}=a)}.

Theorem 1 gives an upper bound on the regret of SW-KLUCB-CF. A more explicit bound is proved in the Appendix.

Theorem 1

The regret of SW-KLUCB-CF using f(x)log(x)+3log(log(x))f(x)\coloneqq\log(x)+3\log(\log(x)) and w=4eTLT+4w=\sqrt{\frac{4eT}{L_{T}+4}} on a Bernoulli non-stationary stochastic corrupt bandits problem with strictly monotonic and continuous corruption functions {ga}aA\{g_{a}\}_{a\in A} at time TT is upper-bounded by 222O~\tilde{O} ignores logarithmic factors and constants.

O~(aALTT+i=1LTaa(i)log(TLT)d(λa(i),ga(μ(i))),\tilde{O}\left(\sum_{a\in A}\sqrt{L_{T}T}+\sum_{i=1}^{L_{T}}\sum_{a\neq a_{*}(i)}\frac{\log{\left(\sqrt{\frac{T}{L_{T}}}\right)}}{d(\lambda_{a}(i),g_{a}(\mu_{*}(i))}\right),

where a(i)a_{*}(i) and μ(i)\mu_{*}(i) are the optimum arm and the corresponding optimal mean respectively after ithi^{th} change and before the subsequent change.

The lower bound on regret in terms TT for classical non-stationary stochastic bandits is Ω(T)\Omega(\sqrt{T}) (GarivierSW). Theorem 1 matches the lower bound up to logarithmic factors, so SW-KLUCB-CF has near-optimal regret guarantees in terms of the time horizon TT. The best known regret upper bounds for classical non-stationary stochastic bandits (e.g., pmlr-v99-auer19a) also feature logarithmic terms besides the lower bound, hence our regret bound is in line with the best known results for analogous problems. Moreover, the bound in Theorem 1 also matches the best known regret bound in terms of LTL_{T} for classical non-stationary stochastic bandits which is OLTO\sqrt{L_{T}}.

Algorithm 1 Sliding Window KLUCB  for Non-Stationary Stochastic Corrupt Bandits (SW-KLUCB-CF)

Input: Window size ww, a non-decreasing function f:f:\mathds{N}\rightarrow\mathds{R}, TT, monotonic and continuous corruption functions g1,,gKg_{1},\dots,g_{K} and d(x,y)KL((x),(y))d(x,y)\coloneqq\mathrm{KL}(\mathcal{B}(x),\mathcal{B}(y)),

  1. 1.

    Initialization: Pull each arm once.

  2. 2.

    for time t=K,,T1t=K,\dots,T-1 do

    1. (a)

      Compute for each arm aAa\in A the quantity

      Indexa(t)\displaystyle\mathrm{Index}_{a}(t)
      max{q:Na(t,w)d(λ^a(t,w),ga(q))f(tw)}\displaystyle\coloneqq\max\left\{q:\ N_{a}(t,w)\cdot{}d(\hat{\lambda}_{a}(t,w),g_{a}(q))\leq f\left(t\wedge w\right)\right\}
    2. (b)

      Pull arm a^t+1argmaxaAIndexa(t)\hat{a}_{t+1}\coloneqq\operatornamewithlimits{argmax}\limits_{a\in A}{\operatorname{Index}_{a}(t)} and observe the feedback Ft+1F_{t+1}.

    end for

We can use SW-KLUCB-CF on non-stationary stochastic corrupts bandits where the corruption is done via randomized response. The following corollary bounds the resulting regret.

Corollary 1

The regret of SW-KLUCB-CF on a Bernoulli non-stationary stochastic corrupt bandits problem with randomized response using corruption matrices {𝕄}aA\{\mathds{M}\}_{a\in A} at time TT is upper-bounded by

O~(aALTT+i=1LTaa(i)log(TLT)(p00(a)+p11(a)1)2).\tilde{O}\left(\sum_{a\in A}\sqrt{L_{T}T}+\sum_{i=1}^{L_{T}}\sum_{a\neq a_{*}(i)}\frac{\log{\left(\sqrt{\frac{T}{L_{T}}}\right)}}{(p_{00}(a)+p_{11}(a)-1)^{2}}\right).

This corollary follows from Theorem 1 and Pinsker’s inequality: d(x,y)>2(xy)2d(x,y)>2(x-y)^{2}. The term (p00(a)+p11(a)1)(p_{00}(a)+p_{11}(a)-1) can be understood as the slope of the corruption function gag_{a}.

Corruption Mechanism to Preserve Local Privacy in Non-Stationary Environment

First, let us formally define local differential privacy.

Definition 1

(Locally differentially private mechanism) Any randomized mechanism \mathcal{M} is ϵ\epsilon-locally differentially private for ϵ0\epsilon\geq 0, if for all d1,d2Domain()d_{1},d_{2}\in Domain(\mathcal{M}) and for all SRange()S\subset Range(\mathcal{M}),

[(d1)S]eϵ[(d2)S].\mathds{P}[\mathcal{M}(d_{1})\in S]\leq e^{\epsilon}\cdot\mathds{P}[\mathcal{M}(d_{2})\in S].

As done in pmlr-v83-gajane18a, a straightforward approach to achieve local differential privacy using corrupt bandits is to employ a corruption scheme on the user feedback. This is similar to how randomized response is used in data collection by DBLP:conf/edbt/0009WH16.

Definition 2

(ϵ\epsilon-locally differentially private bandit feedback corruption scheme) A bandit feedback corruption scheme g~\tilde{g} is ϵ\epsilon-locally differentially private for ϵ0\epsilon\geq 0, if for all reward sequences Rt1,,Rt2R_{t1},\dots,R_{t2} and Rt1,Rt2R^{\prime}_{t1}\dots,R^{\prime}_{t2}, and for all 𝒮Range(g~)\mathcal{S}\subset Range(\tilde{g}),

[g~(Rt1,,Rt2)𝒮]eϵ[g~(Rt1,,Rt2)𝒮].\mathds{P}[\tilde{g}(R_{t1},\dots,R_{t2})\in\mathcal{S}]\leq e^{\epsilon}\cdot\mathds{P}[\tilde{g}(R^{\prime}_{t1},\dots,R^{\prime}_{t2})\in\mathcal{S}].

When corruption is done by randomized response, local differential privacy requires that max1aK(p00(a)1p11(a),p11(a)1p00(a))eϵ\max_{1\leq a\leq K}{\Big{(}\frac{p_{00}(a)}{1-p_{11}(a)},\frac{p_{11}(a)}{1-p_{00}(a)}\Big{)}}\leq e^{\epsilon}. From Corollary 1, we can see that to achieve lower regret, p00(a)+p11p_{00}(a)+p_{11}(a) is to be maximized for all aAa\in A. Using DBLP:conf/edbt/0009WH16, we can state that, in order to achieve ϵ\epsilon-local differential privacy while maximizing p00(a)+p11(a)p_{00}(a)+p_{11}(a),

𝕄a= [\@arstrut01\\0eϵ+1eϵ1+1eϵ\\11+1eϵeϵ+1eϵ\\] .\mathds{M}_{a}=\hbox{}\vbox{\kern 0.86108pt\hbox{$\kern 0.0pt\kern 2.5pt\kern-5.0pt\left[\kern 0.0pt\kern-2.5pt\kern-5.55557pt\vbox{\kern-0.86108pt\vbox{\vbox{ \halign{\kern\arraycolsep\hfil\@arstrut$\kbcolstyle#$\hfil\kern\arraycolsep& \kern\arraycolsep\hfil$\@kbrowstyle#$\ifkbalignright\relax\else\hfil\fi\kern\arraycolsep&& \kern\arraycolsep\hfil$\@kbrowstyle#$\ifkbalignright\relax\else\hfil\fi\kern\arraycolsep\cr 5.0pt\hfil\@arstrut$\scriptstyle$\hfil\kern 5.0pt&5.0pt\hfil$\scriptstyle 0$\hfil\kern 5.0pt&5.0pt\hfil$\scriptstyle 1\\0$\hfil\kern 5.0pt&5.0pt\hfil$\scriptstyle\frac{e^{\epsilon}}{1+e^{\epsilon}}$\hfil\kern 5.0pt&5.0pt\hfil$\scriptstyle\frac{1}{1+e^{\epsilon}}\\1$\hfil\kern 5.0pt&5.0pt\hfil$\scriptstyle\frac{1}{1+e^{\epsilon}}$\hfil\kern 5.0pt&5.0pt\hfil$\scriptstyle\frac{e^{\epsilon}}{1+e^{\epsilon}}\\$\hfil\kern 5.0pt\crcr}}}}\right]$}}. (2)

As it turns out, this is equivalent to the staircase mechanism for local privacy which is the optimal local differential privacy mechanism for low privacy regime (JMLR:v17:15-135, Theorem 14). The trade-off between utility and privacy is controlled by ϵ\epsilon.

Using the corruption parameters from Eq. (2) with Corollary 1, we arrive at the following upper bound.

Corollary 2

At time TT, the regret of SW-KLUCB-CF  with ϵ\epsilon-locally differentially private bandit feedback corruption scheme given by Eq. (2) is O~(aALTT+i=1LTaa(i)log(TLT)(eϵ1eϵ+1)2).\tilde{O}\left(\sum_{a\in A}\sqrt{L_{T}T}+\sum_{i=1}^{L_{T}}\sum_{a\neq a_{*}(i)}\frac{\log{\left(\sqrt{\frac{T}{L_{T}}}\right)}}{\left(\frac{e^{\epsilon}-1}{e^{\epsilon}+1}\right)^{2}}\right).

The term (eϵ1eϵ+1)2\big{(}\frac{e^{\epsilon}-1}{e^{\epsilon}+1}\big{)}^{2} in the above expression conveys the relationship of the regret with the level of local differential privacy symbolized by ϵ\epsilon. For low values of ϵ\epsilon, (eϵ1eϵ+1)ϵ/2\big{(}\frac{e^{\epsilon}-1}{e^{\epsilon}+1}\big{)}\approx\epsilon/2. This is in line with other bandit algorithms providing differential privacy (e.g., DBLP:conf/uai/MishraT15).

Elements of Mathematical Analysis

Here, we provide a proof outline for Theorem 1. Please refer to the Appendix for the complete proof.

We start by bounding the expected number of times a suboptimal arm (i.e., an arm other than the optimal arm at the time of selection) is pulled by the algorithm till horizon TT. Recall that, at any time step tt, SW-KLUCB-CF pulls an arm maximizing an index defined as

Indexa(t)\displaystyle\mathrm{Index}_{a}(t)
max{q:Na(t,w)d(λ^a(t,w),ga(q))f(tw)}\displaystyle\coloneqq\max\left\{q:\ N_{a}(t,w)\cdot{}d\left(\hat{\lambda}_{a}(t,w),g_{a}(q)\right)\leq f\left(t\wedge w\right)\right\}
=maxga1({q:Na(t,w)d(λ^a(t,w),q)f(tw)}).\displaystyle=\max g_{a}^{-1}\left(\left\{q:N_{a}(t,w)\cdot{}d\left(\hat{\lambda}_{a}(t,w),q\right)\leq f\left(t\wedge w\right)\right\}\right).

We further decompose the computation of index as follows,

Indexa(t){ga1(a(t))if ga is decreasing,ga1(ua(t))if ga is increasing\mathrm{Index}_{a}(t)\coloneqq\begin{cases}g_{a}^{-1}({\ell_{a}(t)})&\text{if }g_{a}\text{ is decreasing},\\ g_{a}^{-1}({u_{a}(t)})&\text{if }g_{a}\text{ is increasing}\end{cases}

where,

a(t)\displaystyle\ell_{a}(t) min{q:Na(t,w)d(λ^a(t,w),q)f(tw)},\displaystyle\coloneqq\min\Big{\{}q:N_{a}(t,w)\cdot{}d\left(\hat{\lambda}_{a}(t,w),q\right)\leq f\left(t\wedge w\right)\Big{\}},
ua(t)\displaystyle u_{a}(t) max{q:Na(t,w)d(λ^a(t,w),q)f(tw)}.\displaystyle\coloneqq\max\Big{\{}q:N_{a}(t,w)\cdot{}d\left(\hat{\lambda}_{a}(t,w),q\right)\leq f\left(t\wedge w\right)\Big{\}}.

The interval [a(t),ua(t)][\ell_{a}(t),u_{a}(t)] is a KL-based confidence interval on the mean feedback λa,t\lambda_{a,t} of arm aa at time tt. This is in contrast to kl\mathrm{kl}-UCB (KLUCBJournal) where a confidence interval is placed on the mean reward. Furthermore, This differs from kl\mathrm{kl}-UCB-CF (pmlr-v83-gajane18a) where the mean feedback of an arm remains the same for all the time steps and ff does not feature ww.

In our analysis, we use the fact that when an arm aa is picked at time t+1t+1 by SW-KLUCB-CF, one of the following is true: Either the mean feedback of the optimal arm a,ta_{*,t} with mean reward μ,t\mu_{*,t} is outside its confidence interval (i.e., ga,t(μ,t)<a,t(t)g_{a_{*,t}}(\mu_{*,t})<\ell_{a_{*,t}}(t) or ga,t(μ,t)>ua,t(t)g_{a_{*,t}}(\mu_{*,t})>u_{a_{*,t}}(t)) which is unlikely. Or, the mean feedback of the optimal arm is where it should be, and then the fact that arm aa is selected indicates that the confidence interval on λa\lambda_{a} cannot be too small as either (ua(t)ga(μ,t))(u_{a}(t)\geq g_{a}(\mu_{*,t})) or (a(t)ga(μ,t))(\ell_{a}(t)\leq g_{a}(\mu_{*,t})). The previous statement follows from considering various cases depending on whether the corruption functions gag_{a} and ga,tg_{a_{*,t}} are increasing or decreasing. We then need to control the two terms in the decomposition of the expected number of draws of arm aa. The term regarding the “unlikely” event, is bounded using the same technique as in the kl\mathrm{kl}-UCB analysis, however with some added challenges due to the use of a sliding window. In particular, the analysis of a typical upper confidence bound algorithm for bandits relies on the fact that the confidence interval for any arm is always non-increasing, however this is not true while using a sliding window. To control the second term, depending on the monotonicity of the corruption functions gag_{a} and ga,tg_{a_{*,t}}, we need to meticulously adapt the arguments in KLUCBJournal to control the number of draws of a suboptimal arm, as can be seen in the Appendix.

Concluding Remarks

In this work, we proposed the setting of non-stationary stochastic corrupt bandits for preserving privacy while still maintaining high utility in sequential decision making in a changing environment. We devised an algorithm called SW-KLUCB-CF and proved its regret upper bound which is near-optimal in the number of time steps and matches the best known bound for analogous problems in terms of the number of time steps and the number of changes. Moreover, we provided an optimal corruption scheme to be used with our algorithm in order to attain the dual goal of achieving high utility while maintaining the desired level of privacy.

Interesting directions for future work include:

  1. 1.

    Complete an empirical evaluation of the proposed algorithm on simulated as well as real-life data.

  2. 2.

    Characterize the changes in the environment by a variation budget (as done in NIPS2014_903ce922 for classical bandits) instead of the number of changes.

  3. 3.

    Incorporate contextual information in the learning process.

  4. 4.

    Propose a Bayesian algorithm for non-stationary stochastic corrupt bandits.

  5. 5.

    Propose a (near-)optimal differentially private algorithm which does not need to know the number of changes.

Appendix A Proof of Theorem 1

Proof. The proof follows along the lines of the proof for Theorem 2 from pmlr-v83-gajane18a.

The index used by SW-KLUCB-CFis defined by

Indexa(t)\displaystyle\mathrm{Index}_{a}(t) max{q:Na(t,w)d(λ^a(t,w),ga(q))f(tw)}\displaystyle\coloneqq\max\left\{q:\ N_{a}(t,w)\cdot{}d\left(\hat{\lambda}_{a}(t,w),g_{a}(q)\right)\leq f\left(t\wedge w\right)\right\}
=maxga1({q:Na(t,w)d(λ^a(t,w),q)f(tw)}).\displaystyle=\max g_{a}^{-1}\left(\left\{q:N_{a}(t,w)\cdot{}d\left(\hat{\lambda}_{a}(t,w),q\right)\leq f\left(t\wedge w\right)\right\}\right).

For the purpose of this proof, we further decompose the computation of index as follows,

Indexa(t){ga1(a(t))if ga is decreasing,ga1(ua(t))if ga is increasing\mathrm{Index}_{a}(t)\coloneqq\begin{cases}g_{a}^{-1}({\ell_{a}(t)})&\text{if }g_{a}\text{ is decreasing},\\ g_{a}^{-1}({u_{a}(t)})&\text{if }g_{a}\text{ is increasing}\end{cases}

where,

a(t)\displaystyle\ell_{a}(t) min{q:Na(t,w)d(λ^a(t,w),q)f(tw)} and\displaystyle\coloneqq\min\left\{q:N_{a}(t,w)\cdot{}d\left(\hat{\lambda}_{a}(t,w),q\right)\leq f\left(t\wedge w\right)\right\}\text{ and }
ua(t)\displaystyle u_{a}(t) max{q:Na(t,w)d(λ^a(t,w),q)f(tw)}.\displaystyle\coloneqq\max\left\{q:N_{a}(t,w)\cdot{}d\left(\hat{\lambda}_{a}(t,w),q\right)\leq f\left(t\wedge w\right)\right\}.

Note that, the optimal arm at time tt is denoted as a,ta_{*,t} and μ,t\mu_{*,t} is the corresponding optimal mean. Along the same lines, let (t)a,t(t)\ell_{*}(t)\coloneqq\ell_{a_{*,t}}(t) and u(t)ua,t(t)u_{*}(t)\coloneqq u_{a_{*,t}}(t).

Let Na(t)N_{a}(t) be the number of times arm aa has been pulled till time tt. To get an upper bound on the regret of our algorithm, we first bound 𝔼[Na(t)]\mathds{E}[N_{a}(t)] for all the non-optimal arms aa (i.e., aa,ta\neq a_{*,t} at time tt). Recall that μi,t\mu_{i,t} is the mean reward of arm ii at time step tt. Let us define 𝒯(w)\mathcal{T}(w) as the set of indices t{K+1,,T}t\in\{K+1,\dots,T\} such that μi,s=μi,t\mu_{i,s}=\mu_{i,t} for all i{1,,K}i\in\{1,\dots,K\} and all tw<stt-w<s\leq t. That is to say 𝒯(w)\mathcal{T}(w) is the set of all time steps t{K+1,,T}t\in\{K+1,\dots,T\} for which there was no change in the previous ww time steps. Recall that a^t\hat{a}_{t} is the arm chosen by the algorithm at time step tt. Then,

𝔼(Na(T))\displaystyle\mathds{E}(N_{a}(T)) =1+t=KT1(a^t+1=a)\displaystyle=1+\sum_{t=K}^{T-1}\mathds{P}(\hat{a}_{t+1}=a)
1+LTw+KtT1,t𝒯(w)(a^t+1=a).\displaystyle\leq 1+L_{T}\cdot w+\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}(\hat{a}_{t+1}=a).

Depending upon if gag_{a} and ga,tg_{a_{*,t}} are increasing or decreasing there are four possible sub-cases:

  • Both ga,tg_{a_{*,t}} and gag_{a} are increasing.

    (a^t+1=a)\displaystyle(\hat{a}_{t+1}=a)
    (u(t)<ga,t(μ,t))(a^t+1=a,u(t)ga,t(μ,t))\displaystyle\subseteq\left(u_{*}(t)<g_{a_{*,t}}(\mu_{*,t})\right)\cup\left(\hat{a}_{t+1}=a,u_{*}(t)\geq g_{a_{*,t}}(\mu_{*,t})\right)
    =(u(t)<ga,t(μ,t))(a^t+1=a,ga,t1(u(t))μ,t)since ga,t is increasing\displaystyle=\left(u_{*}(t)<g_{a_{*,t}}(\mu_{*,t})\right)\cup\left(\hat{a}_{t+1}=a,g^{-1}_{a_{*,t}}(u_{*}(t))\geq\mu_{*,t}\right)\qquad\text{since $g_{a_{*,t}}$ is increasing}
    =(u(t)<ga,t(μ,t))(a^t+1=a,ga1(ua(t))μ,t)since IndexaIndexa,t\displaystyle=\left(u_{*}(t)<g_{a_{*,t}}(\mu_{*,t})\right)\cup\left(\hat{a}_{t+1}=a,g^{-1}_{a}(u_{a}(t))\geq\mu_{*,t}\right)\qquad\text{since $\operatorname{Index}_{a}\geq\operatorname{Index}_{a_{*,t}}$}
    =(u(t)<ga,t(μ,t))(a^t+1=a,ua(t)ga(μ,t))since ga is increasing.\displaystyle=\left(u_{*}(t)<g_{a_{*,t}}(\mu_{*,t})\right)\cup\left(\hat{a}_{t+1}=a,u_{a}(t)\geq g_{a}(\mu_{*,t})\right)\qquad\text{since $g_{a}$ is increasing.}
    𝔼(Na(T))\displaystyle\therefore\mathds{E}(N_{a}(T))\leq 1+LTw+KtT1,t𝒯(w)(u(t)<ga,t(μ,t))\displaystyle 1+L_{T}\cdot w+\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(u_{*}(t)<g_{a_{*,t}}(\mu_{*,t})\right)
    +KtT1,t𝒯(w)(a^t+1=a,ua(t)ga(μ,t)).\displaystyle+\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(\hat{a}_{t+1}=a,u_{a}(t)\geq g_{a}(\mu_{*,t})\right). (3)
  • ga,tg_{a_{*,t}} is decreasing and gag_{a} is increasing.

    (a^t+1=a)\displaystyle(\hat{a}_{t+1}=a)
    ((t)>ga,t(μ,t))(a^t+1=a,(t)ga,t(μ,t))\displaystyle\subseteq\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{*,t})\right)\cup\left(\hat{a}_{t+1}=a,\ell_{*}(t)\leq g_{a_{*,t}}(\mu_{*,t})\right)
    =((t)>ga,t(μ,t))(a^t+1=a,ga,t1((t))μ,t)since ga,t is decreasing\displaystyle=\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{*,t})\right)\cup\left(\hat{a}_{t+1}=a,g_{a_{*,t}}^{-1}(\ell_{*}(t))\geq\mu_{*,t}\right)\qquad\text{since $g_{a_{*,t}}$ is decreasing}
    =((t)>ga,t(μ,t))(a^t+1=a,ga1(ua(t))μ,t)since IndexaIndexa,t\displaystyle=\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{*,t})\right)\cup\left(\hat{a}_{t+1}=a,g_{a}^{-1}(u_{a}(t))\geq\mu_{*,t}\right)\qquad\text{since $\operatorname{Index}_{a}\geq\operatorname{Index}_{a_{*,t}}$}
    =((t)>ga,t(μ,t))(a^t+1=a,ua(t)ga(μ,t))since ga is increasing.\displaystyle=\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{*,t})\right)\cup\left(\hat{a}_{t+1}=a,u_{a}(t)\geq g_{a}(\mu_{*,t})\right)\qquad\text{since $g_{a}$ is increasing.}
    𝔼(Na(T))\displaystyle\therefore\mathds{E}(N_{a}(T))\leq 1+LTw+KtT1,t𝒯(w)((t)>ga,t(μ,t))\displaystyle 1+L_{T}\cdot w+\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{*,t})\right)
    +KtT1,t𝒯(w)(a^t+1=a,ua(t)ga(μ,t)).\displaystyle+\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(\hat{a}_{t+1}=a,u_{a}(t)\geq g_{a}(\mu_{*,t})\right). (4)
  • ga,tg_{a_{*,t}} is increasing and gag_{a} is decreasing.

    (a^t+1=a)\displaystyle(\hat{a}_{t+1}=a)
    (u(t)<ga,t(μ,t))(a^t+1=a,u(t)ga,t(μ,t))\displaystyle\subseteq\left(u_{*}(t)<g_{a_{*,t}}(\mu_{*,t})\right)\cup\left(\hat{a}_{t+1}=a,u_{*}(t)\geq g_{a_{*,t}}(\mu_{*,t})\right)
    =(u(t)<ga,t(μ,t))(a^t+1=a,ga,t1(u(t))μ,t)since ga,t is increasing\displaystyle=\left(u_{*}(t)<g_{a_{*,t}}(\mu_{*,t})\right)\cup\left(\hat{a}_{t+1}=a,g_{a_{*,t}}^{-1}(u_{*}(t))\geq\mu_{*,t}\right)\qquad\text{since $g_{a_{*,t}}$ is increasing}
    =(u(t)<ga,t(μ,t))(a^t+1=a,ga1(a(t))μ,t)since Indexa>Indexa,t\displaystyle=\left(u_{*}(t)<g_{a_{*,t}}(\mu_{*,t})\right)\cup\left(\hat{a}_{t+1}=a,g_{a}^{-1}(\ell_{a}(t))\geq\mu_{*,t}\right)\qquad\text{since $\operatorname{Index}_{a}>\operatorname{Index}_{a_{*,t}}$}
    =(u(t)<ga,t(μ,t))(a^t+1=a,a(t)ga(μ,t))since ga is decreasing.\displaystyle=\left(u_{*}(t)<g_{a_{*,t}}(\mu_{*,t})\right)\cup\left(\hat{a}_{t+1}=a,\ell_{a}(t)\leq g_{a}(\mu_{*,t})\right)\qquad\text{since $g_{a}$ is decreasing.}
    𝔼(Na(T))\displaystyle\therefore\mathds{E}(N_{a}(T))\leq 1+LTw+KtT1,t𝒯(w)(u(t)<ga,t(μ,t))\displaystyle 1+L_{T}\cdot w+\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(u_{*}(t)<g_{a_{*,t}}(\mu_{*,t})\right)
    +KtT1,t𝒯(w)(a^t+1=a,a(t)ga(μ,t)).\displaystyle+\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(\hat{a}_{t+1}=a,\ell_{a}(t)\leq g_{a}(\mu_{*,t})\right). (5)
  • ga,tg_{a_{*,t}} is decreasing and gag_{a} is decreasing.

    (a^t+1=a)\displaystyle(\hat{a}_{t+1}=a)
    ((t)>ga,t(μa,t))(a^t+1=a,(t)ga,t(μa,t)\displaystyle\subseteq\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{a_{*,t}})\right)\cup\left(\hat{a}_{t+1}=a,\ell_{*}(t)\leq g_{a_{*,t}}(\mu_{a_{*,t}}\right)
    =((t)>ga,t(μa,t))(a^t+1=a,ga,t1((t))μa,t)since ga,t is decreasing\displaystyle=\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{a_{*,t}})\right)\cup\left(\hat{a}_{t+1}=a,g_{a_{*,t}}^{-1}(\ell_{*}(t))\geq\mu_{a_{*,t}}\right)\qquad\text{since $g_{a_{*,t}}$ is decreasing}
    =((t)>ga,t(μa,t))(a^t+1=a,ga1(a(t))μa,t)since Indexa>Indexa,t\displaystyle=\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{a_{*,t}})\right)\cup\left(\hat{a}_{t+1}=a,g_{a}^{-1}(\ell_{a}(t))\geq\mu_{a_{*,t}}\right)\qquad\text{since $\operatorname{Index}_{a}>\operatorname{Index}_{a_{*,t}}$}
    =((t)>ga,t(μa,t))(a^t+1=a,a(t)ga(μa,t))since ga is decreasing.\displaystyle=\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{a_{*,t}})\right)\cup\left(\hat{a}_{t+1}=a,\ell_{a}(t)\leq g_{a}(\mu_{a_{*,t}})\right)\qquad\text{since $g_{a}$ is decreasing.}
    𝔼(Na(T))\displaystyle\therefore\mathds{E}(N_{a}(T))\leq 1+LTw+KtT1,t𝒯(w)((t)>ga,t(μa,t))\displaystyle 1+L_{T}\cdot w+\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{a_{*,t}})\right)
    +KtT1,t𝒯(w)(a^t+1=a,a(t)ga(μa,t)).\displaystyle+\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(\hat{a}_{t+1}=a,\ell_{a}(t)\leq g_{a}(\mu_{a_{*,t}})\right). (6)

We first upper bound the two sums

KtT1,t𝒯(w)(u(t)<ga,t(μ,t))andKtT1,t𝒯(w)((t)>ga,t(μa,t))\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(u_{*}(t)<g_{a_{*,t}}(\mu_{*,t})\right)\ \ \text{and}\ \ \sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{a_{*,t}})\right) (7)

using that (t)\ell_{*}(t) and u(t)u_{*}(t) are respectively lower and upper confidence bound on ga,t(μ,t)g_{a_{*,t}}(\mu_{*,t}). Recall that min{t,w}\min\left\{t,w\right\} is denoted as twt\wedge w.

(ua,t<ga,t(μ,t))\displaystyle\mathds{P}\left(u_{a_{*,t}}<g_{a_{*,t}}(\mu_{*,t})\right)
(ga,t(μ,t)>λ^a,t(t,w) and Na,t(t,w)d(λ^a,t(t,w),ga,t(μ,t))f(tw))\displaystyle\leq\mathds{P}\left(g_{a_{*,t}}(\mu_{*,t})>\hat{\lambda}_{a_{*,t}}(t,w)\text{ and }N_{a_{*,t}}(t,w)\cdot d\left(\hat{\lambda}_{a_{*,t}}(t,w),g_{a_{*,t}}(\mu_{*,t})\right)\geq f\left(t\wedge w\right)\right)
(s{1,,(tw)}:ga,t(μ,t)>λ^a,t,s and sd(λ^a,t,s,ga,t(μ,t))f(tw))\displaystyle\leq\mathds{P}\left(\exists s\in\{1,\dots,(t\wedge w)\}:g_{a_{*,t}}(\mu_{*,t})>\hat{\lambda}_{a_{*,t},s}\text{ and }s\cdot d(\hat{\lambda}_{a_{*,t},s},g_{a_{*,t}}(\mu_{*,t}))\geq f\left(t\wedge w\right)\right)
min{1,ef(tw)logtef(tw)},\displaystyle\leq min\left\{1,e\left\lceil f\left(t\wedge w\right)\log{t}\right\rceil e^{-f\left(t\wedge w\right)}\right\}, (8)

where the upper bound follows from Lemma 2 in KLUCBJournal, and the fact that λ^a,t,s\hat{\lambda}_{a_{*,t},s} is the empirical mean of ss Bernoulli samples with mean ga,t(μ,t)g_{a_{*,t}}(\mu_{*,t}). Similarly, one has

((t)>ga,t(μa,t))min{1,ef(tw)logtef(tw)}.\mathds{P}\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{a_{*,t}})\right)\leq min\left\{1,e\left\lceil f\left(t\wedge w\right)\log{t}\right\rceil e^{-f\left(t\wedge w\right)}\right\}. (9)

As f(x)logx+3(loglogx)f(x)\coloneqq\log{x}+3(\log{\log{x}}), for x3x\geq 3,

ef(x)logx4elog2x.e\lceil f(x)\log{x}\rceil\leq 4e\log^{2}{x}.

Then, using Eq. (8) and Eq. (9), the two quantities in Eq. (7) can be upper bounded by

1+t=3T1ef(tw)logtef(tw)\displaystyle 1+\sum_{t=3}^{T-1}e\left\lceil f\left(t\wedge w\right)\log{t}\right\rceil e^{-f\left(t\wedge w\right)} 1+t=3T14elog2(tw)ef(tw)\displaystyle\leq 1+\sum_{t=3}^{T-1}4e\cdot\log^{2}\left({t\wedge w}\right)\cdot e^{-f(t\wedge w)}
=1+4et=3T11(tw)log(tw)\displaystyle=1+4e\sum_{t=3}^{T-1}\frac{1}{(t\wedge w)\cdot\log{(t\wedge w)}}
=1+4et=3w1(tw)log(tw)+ 4et=w+1T1(tw)log(tw)\displaystyle=1+4e\sum_{t=3}^{w}\frac{1}{(t\wedge w)\cdot\log{(t\wedge w)}}\ +\ 4e\sum_{t=w+1}^{T}\frac{1}{(t\wedge w)\cdot\log{(t\wedge w)}}
1+4et=3w13log3+4et=w+1T1wlogw\displaystyle\leq 1+4e\sum_{t=3}^{w}\frac{1}{3\log{3}}+4e\sum_{t=w+1}^{T}\frac{1}{w\log{w}}
1+4ew3log3+4eTwlogw.\displaystyle\leq 1+\frac{4ew}{3\log{3}}+\frac{4eT}{w\log{w}}.

This proves that

KtT1,t𝒯(w)(u(t)<ga,t(μ,t))\displaystyle\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(u_{*}(t)<g_{a_{*,t}}(\mu_{*,t})\right) 1+4ew3log3+4eTwlogwand,\displaystyle\leq 1+\frac{4ew}{3\log{3}}+\frac{4eT}{w\log{w}}\quad\text{and}, (10)
KtT1,t𝒯(w)((t)>ga,t(μa,t))\displaystyle\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(\ell_{*}(t)>g_{a_{*,t}}(\mu_{a_{*,t}})\right) 1+4ew3log3+4eTwlogw.\displaystyle\leq 1+\frac{4ew}{3\log{3}}+\frac{4eT}{w\log{w}}. (11)

We now turn our attention to the other two sums involved in the upper bound we gave for 𝔼(Na(T))\mathds{E}(N_{a}(T)). Let the unknown time-step at which ithi^{th} change occurs be denoted as tit_{i}. For notational convenience, we assume that the first change occurs at t=1t=1 so t1=1t_{1}=1 and change L+1L+1 takes place at t=T+1t=T+1 where TT is the horizon. We introduce the notation d+(x,y)=d(x,y)𝟙(x<y)d^{+}(x,y)=d(x,y)\cdot\mathds{1}_{(x<y)} and d(x,y)=d(x,y)𝟙(x>y)d^{-}(x,y)=d(x,y)\cdot\mathds{1}_{(x>y)}. So we can write, when gag_{a} is increasing,

KtT1,t𝒯(w)(a^t+1=a,ua(t)ga(μ,t))\displaystyle\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(\hat{a}_{t+1}=a,u_{a}(t)\geq g_{a}(\mu_{*,t})\right)
i=1Ltit<ti+11,t𝒯(w)(a^t+1=a,ua(t)ga(μ,t))\displaystyle\leq\sum_{i=1}^{L}\sum_{t_{i}\leq t<t_{i+1}-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(\hat{a}_{t+1}=a,u_{a}(t)\geq g_{a}(\mu_{*,t})\right)
=𝔼[i=1Ltit<ti+11,t𝒯(w)𝟙a^t+1=a𝟙Na(t,w)d+(λ^a,Na(t,w),ga(μ,t))f(tw)]\displaystyle=\mathds{E}\left[\sum_{i=1}^{L}\sum_{t_{i}\leq t<t_{i+1}-1,\ t\in\mathcal{T}(w)}\mathds{1}_{\hat{a}_{t+1}=a}\cdot\mathds{1}_{N_{a}(t,w)\cdot d^{+}(\hat{\lambda}_{a,N_{a}(t,w)},g_{a}(\mu_{*,t}))\leq f(t\wedge w)}\right]
𝔼[i=1Ltit<ti+11,t𝒯(w)s=1tw𝟙a^t+1=a𝟙Na(t,w)=s𝟙sd+(λ^a,s,ga(μ,t))f(tw)]\displaystyle\leq\mathds{E}\left[\sum_{i=1}^{L}\sum_{t_{i}\leq t<t_{i+1}-1,\ t\in\mathcal{T}(w)}\sum_{s=1}^{t\wedge w}\mathds{1}_{\hat{a}_{t+1}=a}\cdot\mathds{1}_{N_{a}(t,w)=s}\cdot\mathds{1}_{s\cdot d^{+}(\hat{\lambda}_{a,s},g_{a}(\mu_{*,t}))\leq f(t\wedge w)}\right]
𝔼[i=1Ltit<ti+11,t𝒯(w)s=1tw𝟙a^t+1=a𝟙Na(t)=s𝟙sd+(λ^a,s,ga(μ,t))f(tw)]\displaystyle\leq\mathds{E}\left[\sum_{i=1}^{L}\sum_{t_{i}\leq t<t_{i+1}-1,\ t\in\mathcal{T}(w)}\sum_{s=1}^{t\wedge w}\mathds{1}_{\hat{a}_{t+1}=a}\cdot\mathds{1}_{N_{a}(t)=s}\cdot\mathds{1}_{s\cdot d^{+}(\hat{\lambda}_{a,s},g_{a}(\mu_{*,t}))\leq f(t\wedge w)}\right]
𝔼[i=1Ls=1tw𝟙sd+(λ^a,s,ga(μ,t))f(tw)tit<ti+11,t𝒯(w)𝟙a^t+1=a𝟙Na(t)=s1].\displaystyle\leq\mathds{E}\Big{[}\sum_{i=1}^{L}\sum_{s=1}^{t\wedge w}\mathds{1}_{s\cdot d^{+}(\hat{\lambda}_{a,s},g_{a}(\mu_{*,t}))\leq f(t\wedge w)}\underbrace{\sum_{t_{i}\leq t<t_{i+1}-1,\ t\in\mathcal{T}(w)}\mathds{1}_{\hat{a}_{t+1}=a}\cdot\mathds{1}_{N_{a}(t)=s}}_{\leq 1}\Big{]}.

In the above, the penultimate steps follows from the fact that the event Na(t,w)=sN_{a}(t,w)=s is subsumed by the event Na(t)=sN_{a}(t)=s. So, one obtains, when gag_{a} is increasing,

KtT1,t𝒯(w)(a^t+1=a,ua(t)ga(μ,t))(l=1Ls=1twsd+(λ^a,s,ga(μ,t))f(tw)).\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(\hat{a}_{t+1}=a,u_{a}(t)\geq g_{a}(\mu_{*,t})\right)\leq\mathds{P}\left(\sum_{l=1}^{L}\sum_{s=1}^{t\wedge w}s\cdot d^{+}(\hat{\lambda}_{a,s},g_{a}(\mu_{*,t}))\leq f(t\wedge w)\right). (12)

Using similar arguments, one can show that when gag_{a} is decreasing,

KtT1,t𝒯(w)(a^t+1=a,a(t)ga(μa,t))(l=1Ls=1twsd(λ^a,s,ga(μ,t))f(tw)).\sum_{K\leq t\leq T-1,\ t\in\mathcal{T}(w)}\mathds{P}\left(\hat{a}_{t+1}=a,\ell_{a}(t)\leq g_{a}(\mu_{a_{*,t}})\right)\leq\mathds{P}\left(\sum_{l=1}^{L}\sum_{s=1}^{t\wedge w}s\cdot d^{-}(\hat{\lambda}_{a,s},g_{a}(\mu_{*,t}))\leq f(t\wedge w)\right). (13)

Recall that μa(i)\mu_{a}(i) is the mean reward of arm aa after ithi^{th} change and before the subsequent change. Correspondingly, let λa(i)\lambda_{a}(i) be the mean feedback of arm aa after ithi^{th} change and and before the subsequent change. Furthermore, let μ(i)\mu_{*}(i) be the optimum mean after ithi^{th} change and and before the subsequent change.

Using Appendix A.2. of (KLUCBJournal), the quantity in the right-hand side of (12) can be upper-bounded by

i=1Lf(w)d(λa(i),ga(μ(i))+i=1L2πd(λa(i),ga(μ(i))2(d(λa(i),ga(μ(i))3f(w)+i=1L2(d(λa(i),ga(μ(i))d(λa(i),ga(μ(i)))2+1.\sum_{i=1}^{L}\frac{f(w)}{d(\lambda_{a}(i),g_{a}(\mu_{*}(i))}+\sum_{i=1}^{L}\sqrt{2\pi}\sqrt{\frac{d^{\prime}(\lambda_{a}(i),g_{a}(\mu_{*}(i))^{2}}{(d(\lambda_{a}(i),g_{a}(\mu_{*}(i))^{3}}}\sqrt{f(w)}+\sum_{i=1}^{L}2\left(\frac{d^{\prime}(\lambda_{a}(i),g_{a}(\mu_{*}(i))}{d(\lambda_{a}(i),g_{a}(\mu_{*}(i))}\right)^{2}+1. (14)

For (13), noting that d(x,y)=d+(1x,1y)d^{-}(x,y)=d^{+}(1-x,1-y), one has

(sd(λ^a,s,ga(μ,t))f(tw))=\displaystyle\mathds{P}\left(s\cdot d^{-}(\hat{\lambda}_{a,s},g_{a}(\mu_{*,t}))\leq f(t\wedge w)\right)= (sd+(1λ^a,s,1ga(μ,t))f(tw))\displaystyle\mathds{P}\left(s\cdot d^{+}(1-\hat{\lambda}_{a,s},1-g_{a}(\mu_{*,t}))\leq f(t\wedge w)\right)
=\displaystyle= (sd+(μ^a,s,1ga(μ,t))f(tw)),\displaystyle\mathds{P}\left(s\cdot d^{+}(\hat{\mu}_{a,s},1-g_{a}(\mu_{*,t}))\leq f(t\wedge w)\right),

where μ^a,s1λ^a,s\hat{\mu}_{a,s}\coloneqq 1-\hat{\lambda}_{a,s}, is the empirical mean of ss observations of a Bernoulli random variable with mean 1λa<1ga(μ,t)1-\lambda_{a}<1-g_{a}(\mu_{*,t}). Hence, the analysis of (KLUCBJournal) can be applied, and using that d(1x,1y)=d(x,y)d(1-x,1-y)=d(x,y) and d(1x,1y)=d(x,y)d^{\prime}(1-x,1-y)=-d^{\prime}(x,y), the right hand side of (13) can also be upper bound by (14).

Combining inequalities (10), (11) and (12),(13), (14) with the initial decomposition of 𝔼[Na(T)]\mathds{E}[N_{a}(T)], and substituting f(x)log(x)+3loglog(x)f(x)\coloneqq\log(x)+3\log\log(x) yields in all cases,

𝔼[Na(T)]\displaystyle\mathds{E}[N_{a}(T)]\leq LTw+4ew3log3+4eTwlogw+i=1LTf(w)d(λa(i),ga(μ(i))\displaystyle\ L_{T}\cdot w+\frac{4ew}{3\log{3}}+\frac{4eT}{w\log{w}}+\sum_{i=1}^{L_{T}}\frac{f(w)}{d(\lambda_{a}(i),g_{a}(\mu_{*}(i))}
+i=1LT2πd(λa(i),ga(μ(i))2(d(λa(i),ga(μ(i))3f(w)\displaystyle\qquad+\sum_{i=1}^{L_{T}}\sqrt{2\pi}\sqrt{\frac{d^{\prime}(\lambda_{a}(i),g_{a}(\mu_{*}(i))^{2}}{(d(\lambda_{a}(i),g_{a}(\mu_{*}(i))^{3}}}\sqrt{f(w)}
+i=1LT2(d(λa(i),ga(μ(i))d(λa(i),ga(μ(i)))2+5\displaystyle\qquad+\sum_{i=1}^{L_{T}}2\left(\frac{d^{\prime}(\lambda_{a}(i),g_{a}(\mu_{*}(i))}{d(\lambda_{a}(i),g_{a}(\mu_{*}(i))}\right)^{2}+5
(LT+4)w+4eTwlogw+i=1LTlog(w)+3loglog(w)d(λa(i),ga(μ(i))\displaystyle\leq(L_{T}+4)\cdot w+\frac{4eT}{w\log{w}}+\sum_{i=1}^{L_{T}}\frac{\log(w)+3\log{\log(w)}}{d(\lambda_{a}(i),g_{a}(\mu_{*}(i))}
+i=1LT2πd(λa(i),ga(μ(i))2(d(λa(i),ga(μ(i))3log(w)+3loglog(w)\displaystyle\qquad+\sum_{i=1}^{L_{T}}\sqrt{2\pi}\sqrt{\frac{d^{\prime}(\lambda_{a}(i),g_{a}(\mu_{*}(i))^{2}}{(d(\lambda_{a}(i),g_{a}(\mu_{*}(i))^{3}}}\sqrt{\log(w)+3\log{\log(w)}}
+i=1LT2(d(λa(i),ga(μ(i))d(λa(i),ga(μ(i)))2+5.\displaystyle\qquad+\sum_{i=1}^{L_{T}}2\left(\frac{d^{\prime}(\lambda_{a}(i),g_{a}(\mu_{*}(i))}{d(\lambda_{a}(i),g_{a}(\mu_{*}(i))}\right)^{2}+5. (15)

Minimizing the leading terms in the RHS from eq. (15) via taking the first derivative with respect to ww and equating it to 0, leads to solving for ww in

w2(log2w)logw+1=4eTLT+4\displaystyle\ \ \frac{w^{2}\left(\log^{2}{w}\right)}{\log{w}+1}=\frac{4eT}{L_{T}+4}
w2log(w2)=8eTLT+4\displaystyle\simeq\ w^{2}\log{(w^{2})}=\frac{8eT}{L_{T}+4}

Here, ww must be positive for the log to exist, so we can write w2=euw^{2}=e^{u} for some uu, and the equation becomes

ueu=8eTLT+4.ue^{u}=\frac{8eT}{L_{T}+4}.

This equation has no solution in an elementary expression, although it can be expressed in terms of the Lambert W function (Corless1996). Opting for an elementary expression for ww, we can choose w=4eTLT+4w=\sqrt{\frac{4eT}{L_{T}+4}}, which leads to the following bound,

𝔼[Na(T)]\displaystyle\mathds{E}[N_{a}(T)] 4e(LT+4)T+4e(LT+4)Tlog(4eTLT+4)+i=1LTlog(4eTLT+4)+3loglog(4eTLT+4)d(λa(i),ga(μ(i))\displaystyle\leq\sqrt{4e(L_{T}+4)T}+\frac{\sqrt{4e(L_{T}+4)T}}{\log{\left(\sqrt{\frac{4eT}{L_{T}+4}}\right)}}+\sum_{i=1}^{L_{T}}\frac{\log{\left(\sqrt{\frac{4eT}{L_{T}+4}}\right)}+3\log{\log{\left(\sqrt{\frac{4eT}{L_{T}+4}}\right)}}}{d(\lambda_{a}(i),g_{a}(\mu_{*}(i))}
+i=1LT2πd(λa(i),ga(μ(i))2(d(λa(i),ga(μ(i))3log(4eTLT+4)+3loglog(4eTLT+4)\displaystyle\ +\sum_{i=1}^{L_{T}}\sqrt{2\pi}\sqrt{\frac{d^{\prime}(\lambda_{a}(i),g_{a}(\mu_{*}(i))^{2}}{(d(\lambda_{a}(i),g_{a}(\mu_{*}(i))^{3}}}\sqrt{\log{\left(\sqrt{\frac{4eT}{L_{T}+4}}\right)}+3\log{\log{\left(\sqrt{\frac{4eT}{L_{T}+4}}\right)}}}
+i=1LT2(d(λa(i),ga(μ(i))d(λa(i),ga(μ(i)))2+5.\displaystyle\ +\sum_{i=1}^{L_{T}}2\left(\frac{d^{\prime}(\lambda_{a}(i),g_{a}(\mu_{*}(i))}{d(\lambda_{a}(i),g_{a}(\mu_{*}(i))}\right)^{2}+5.

Since the rewards are bounded in [0,1][0,1] for Bernoulli non-stationary stochastic bandits, the regret is upper-bounded by,

O~(aALTT+aa(i)i=1LTlog(TLT)d(λa(i),ga(μ(i))).\tilde{O}\left(\sum_{a\in A}\sqrt{L_{T}T}+\sum_{a\neq a_{*}(i)}\sum_{i=1}^{L_{T}}\frac{\log{\left(\sqrt{\frac{T}{L_{T}}}\right)}}{d(\lambda_{a}(i),g_{a}(\mu_{*}(i))}\right).

Assuming that LT=(Tβ)L_{T}=\left(T^{\beta}\right) for some β[0,1)\beta\in[0,1), the expected regret is upper bounded as O~(T(1+β)/2)\tilde{O}\left(T^{(1+\beta)/2}\right). In particular, if β=0\beta=0, the number of breakpoints is upper-bounded by LL independently of TT, then with w=4eTL+4w=\sqrt{\frac{4eT}{L+4}}, the upper bound is O~(LT)\tilde{O}\left(\sqrt{LT}\right).